# This is a chunk of R code. All text after a # symbol is a comment
# Set working directory using setwd() function
setwd('Enter the path to my working directory')
# Clear all variables in R's memory
rm(list=ls()) # Standard code to clear R's memory
Manipulating data in R
Using arrays and data frames
This tutorial is a mixture of R code chunks and explanations of the code. The R code chunks will appear in boxes.
Below is an example of a chunk of R code:
Sometimes the output from running this R code will be displayed after the chunk of code.
Here is a chunk of code followed by the R output
2 + 4 # Use R to add two numbers
[1] 6
Introduction
This worksheet covers the different ways to manipulate arrays of numbers and data frames in R.
Creating an array of numbers
One-dimensional arrays
Use the array()
function to create an array of numbers.
For example, to create an array of 13 zeros we would type
array(0, dim=13) # Create an array with 13 elements, all set to zero
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0
This is called a one dimensional array with 13 elements.
We can also create a 1D array of numbers using the concatenate function c()
c(3, 100, 30, 23, -9, 6.3) # Concatenate six numbers
[1] 3.0 100.0 30.0 23.0 -9.0 6.3
Two-dimensional arrays
A matrix is a two dimensional array. A matrix can also be created using the array()
function, but now the dim=
argument has two numbers: the first being the number of rows in the matrix and the second the number of columns in the matrix.
# Create a 2D array of zeros with 3 rows and 5 columns
array(0, dim=c(3,5))
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 0 0 0 0 0
[3,] 0 0 0 0 0
Using [ ]
to extract data from an array
Using [ ]
to specify a single element
# Create a 1D array called a1 with 10 elements all set to be zero
<- array(0, dim=10) a1
If we want to set the 3rd element of the array a1 to be the number 8 we use square brackets []
as follows
# Set the 3rd elements of the array a1 to be the number 10
3] <- 8
a1[# Display the modified array a1 a1
[1] 0 0 8 0 0 0 0 0 0 0
Using [ ]
to specify a several elements
We can use [ ]
to specify several elements of the a1 array. For example, to display just the 3rd, 5th and 8th elements
# Display the 3rd, 5th and 8th elements of a1
c(3, 5, 8) ] a1[
[1] 8 0 0
And we can use this approach to set the values of several elements at the same time
# Set the 3rd, 5th and 8th elements of a1 to be 2.5, -76 and 109, respectively
c(3, 5, 8) ] <- c( 2.5, -76, 109)
a1[ # Display the modified array a1 a1
[1] 0.0 0.0 2.5 0.0 -76.0 0.0 0.0 109.0 0.0 0.0
Using [ ]
to specify an element of a 2D array
First create a 2D array with 3 rows and 5 columns (i.e 15 elements in total) that contain random numbers draw from a standard normal distribution (mean=0, standard deviation=1)
# Create a 3 by 5 array containing random data from a normal distribution
<- array( rnorm(15), dim=c(3,5))
a2 # Display the array a2 a2
[,1] [,2] [,3] [,4] [,5]
[1,] -0.3876384 -0.2909345 -0.2427396 -1.2765277 1.186071236
[2,] -1.6578986 1.0727205 -0.1618912 0.8652126 -0.004456946
[3,] -2.1676924 -0.2850214 1.0837220 -0.8157207 0.868907546
The row and column indices of an element can be specified inside square brackets, separated by a comma
# Display the element in the 2nd row, 4th column of a2
2, 4] a2[
[1] 0.8652126
We could reassign specific elements of a 2D array
# Reassign the element in the 2nd row, 4th column of a2 to be zero
2, 4] <- 0
a2[ # Display the array a2 a2
[,1] [,2] [,3] [,4] [,5]
[1,] -0.3876384 -0.2909345 -0.2427396 -1.2765277 1.186071236
[2,] -1.6578986 1.0727205 -0.1618912 0.0000000 -0.004456946
[3,] -2.1676924 -0.2850214 1.0837220 -0.8157207 0.868907546
Using [ ]
to specify an entire row/column of a 2D array
An entire row can be specified by leaving the index blank
# Display all the elements in the 2nd row of a2
2, ] a2[
[1] -1.657898611 1.072720455 -0.161891234 0.000000000 -0.004456946
and similarly for an entire column
# Display all the elements in the 4th column of a2
4] a2[ ,
[1] -1.2765277 0.0000000 -0.8157207
Creating a data frame
Importing data
Data frames are produced when you import data into R using one of the data import functions (e.g. read.csv()
,read.table()
).
Here is the R code to import the data in file WOLF.CSV. The data frame is called wolf and contains the imported data.
# Import WOLF.CSV file using read.table function
<- read.table('WOLF.CSV', header=TRUE, sep=',') wolf
Looking at the first 6 rows of the wolf data frame shows that it contains 7 variables (i.e. 7 columns). The columns are called: Individual, Sex, Population, Colour, Cpgmg, Tpgmg, Ppgmg.
head(wolf)
Individual Sex Population Colour Cpgmg Tpgmg Ppgmg
1 1 M 2 W 15.86 5.32 NA
2 2 F 1 D 20.02 3.71 14.37622
3 3 F 2 W 9.95 5.30 21.65902
4 4 F 1 D 25.22 3.71 13.42507
5 5 M 2 D 21.13 5.34 NA
6 6 M 2 W 12.48 4.60 NA
Converting a 2D array into a data frame
Use the as.data.frame()
function to convert a 2D array into a data frame. Each column of the array will become a different variable. By default the columns will be labelled V1, V2, V3, etc
# Convert array a2 into a data frame
<- as.data.frame(a2)
d2 str(d2) # Display the structure of the new data frame
'data.frame': 3 obs. of 5 variables:
$ V1: num -0.388 -1.658 -2.168
$ V2: num -0.291 1.073 -0.285
$ V3: num -0.243 -0.162 1.084
$ V4: num -1.277 0 -0.816
$ V5: num 1.18607 -0.00446 0.86891
Creating a data frame from scratch
The data.frame()
function can be used to create a data frame from scratch.
# Create data frame containing information from a food diary
<- data.frame(day=c('Mon','Tues','Wed','Thurs','Mon'),
d3 week=c(1,1,1,2,3),
meal=c('Breakfast','Lunch','Lunch','Dinner','Dinner'),
portions = c(2,1,0,0,1))
Above is an example where a data frame with 4 variables (from a food diary) is created (variable names are day, week, meal, portions). Each variable has 5 observations (rows).
Looking at the structure of the data frame shows that the text has been taken as data type character (abbreviated to chr
) and numbers are data type numeric
str(d3) # Display the structure of data frame d3
'data.frame': 5 obs. of 4 variables:
$ day : chr "Mon" "Tues" "Wed" "Thurs" ...
$ week : num 1 1 1 2 3
$ meal : chr "Breakfast" "Lunch" "Lunch" "Dinner" ...
$ portions: num 2 1 0 0 1
If we want day
and meal
to be factors we could use the as.factor()
function.
$day <- as.factor(d3$day) # Convert day to be a factor
d3$meal <- as.factor(d3$meal) # Convert meal to be a factor d3
or we could use the argument stringsAsFactors=TRUE
in the original data.frame()
function.
Specifying a column
A column of a data frame corresponds to a single variable.
The $
symbol
The dollar symbol ($
) allows you to specify individual variables in a data frame. The dollar is preceded by the name of the data frame and followed by the name of the variable: <name of data frame>$<name of variable>
To view the meal and week variables in the data frame d3 (above) type
$week # Display variable week in the d3 data frame d3
[1] 1 1 1 2 3
$meal # Display variable meal in the d3 data frame d3
[1] Breakfast Lunch Lunch Dinner Dinner
Levels: Breakfast Dinner Lunch
Note that for factors (e.g. the meal variable) R displays a list of the levels in the factor (i.e. the unique values of the qualitative variable).
Square brackets [ ]
Just as for arrays, square brackets can be used to specify a column from a data frame. To specify the 3rd column in the d3 data frame type
3] # Display the 3rd column in data frame d3 d3[ ,
[1] Breakfast Lunch Lunch Dinner Dinner
Levels: Breakfast Dinner Lunch
Note that the row index is blank in order to specify all rows
Specifying a row
An observation (a row in a data frame) can be specified using [ ]
To specify the 2nd observation in the wolf data frame type
2, ] wolf[
Individual Sex Population Colour Cpgmg Tpgmg Ppgmg
2 2 F 1 D 20.02 3.71 14.37622
Note that the column index is left blank in order to specify all columns
We can use the same notation to specify a single value. To specify the 2nd observation of the 5th column in the wolf data frame type
2, 5] wolf[
[1] 20.02
Logical indexing
Logical indexing uses a logical variable to specify elements in an array or data frame. A logical variable is either TRUE
or FALSE
.
A logical index will extract values from a data frame at positions where the logical index is TRUE
.
Data frame example
Here is an example using the d3 data frame. We want to extract all observations from week 1.
# Display data in the d3 data frame d3
day week meal portions
1 Mon 1 Breakfast 2
2 Tues 1 Lunch 1
3 Wed 1 Lunch 0
4 Thurs 2 Dinner 0
5 Mon 3 Dinner 1
# Create a logical array that is TRUE if an observation is from week 1
<- d3$week==1
logInd str(logInd) # Display the structure of the logical array
logi [1:5] TRUE TRUE TRUE FALSE FALSE
# Use the logical array inside square brackets to specify rows corresponding to week 1
d3[logInd, ]
day week meal portions
1 Mon 1 Breakfast 2
2 Tues 1 Lunch 1
3 Wed 1 Lunch 0
Rows corresponding the where logInd is TRUE
are extracted.
The same result could be obtained using the subset()
function
# Use the subset function to specify rows corresponding to week 1
subset(d3, week==1)
day week meal portions
1 Mon 1 Breakfast 2
2 Tues 1 Lunch 1
3 Wed 1 Lunch 0
Array example
Here is an example where we want to extract all negative numbers from the 2D array a2
# Display the 2D array a2 a2
[,1] [,2] [,3] [,4] [,5]
[1,] -0.3876384 -0.2909345 -0.2427396 -1.2765277 1.186071236
[2,] -1.6578986 1.0727205 -0.1618912 0.0000000 -0.004456946
[3,] -2.1676924 -0.2850214 1.0837220 -0.8157207 0.868907546
# Create a logical array that is TRUE when an element of a2 is negative
<- a2<0 logInd
# Extract elements of a2 that are negative
a2[logInd]
[1] -0.387638396 -1.657898611 -2.167692442 -0.290934469 -0.285021383
[6] -0.242739598 -0.161891234 -1.276527741 -0.815720675 -0.004456946
Reassigning values in a data frame using [ ]
Above we used logical indexing to extract all observations from week 1 in the data frame d3
. We can use this approach to reassign the values in a data frame. This can be useful when correcting errors in a data frame.
Here is how you reassign the values of portions for week 1 in data frame d3 to be equal to 0
# Create a logical array that is TRUE if an observation is from week 1
<- d3$week == 1
loInd
# Reassign portions to be 0 in week 1
$portions[logInd] <- 0 d3
Reorganising a data frame
This section shows how to pivot data. This means combining several columns into one column (i.e. putting data in a longer format) and splitting data from one column across several columns (i.e. putting data into a wider format). We will be using the pivot_longer()
and pivot_wider()
functions from the package tidyr
.
We will use three more data sets (INSECT.TXT, MALIN_HEAD.TXT and BEEKEEPER.TXT). These data sets are explained at http://www.ucd.ie/ecomodel/Resources/datasets_WebVersion.html
First import the data
# TAB delimited variables
<- read.table('INSECT.TXT', header=T, skip=3, sep='\t')
insect <- insect[,-c(7,8)] # Remove the last two empty columns
insect
<- read.table('MALIN_HEAD.TXT', header=T, sep='\t')
rainfall
# White space delimited variables
<- read.table('BEEKEEPER.TXT', header=T, skip=6, sep='') bees
Several columns into one column (using pivot_longer()
)
Look at the insect data frame
insect
Spray.A Spray.B Spray.C Spray.D Spray.E Spray.F
1 10 11 0 3 3 11
2 7 17 1 5 5 9
3 20 21 7 12 3 15
4 14 11 2 6 5 22
5 14 16 3 4 3 15
6 12 14 1 3 6 16
7 10 17 2 5 1 13
8 23 17 1 5 1 10
9 17 19 3 5 3 26
10 20 21 0 5 2 26
11 14 7 1 2 6 24
12 13 13 4 4 4 13
The insect
data frame is not tidy: it is not organised so that each row is an observation and each column is a variable.
It has two variables: spray name (A-F) and insect count, so ideally we want a data frame with two columns (one column would contain the type of spray used on a plot and the second column would contain the insect count from the plot). As it stands the spray name appears as the names of the columns, and the insect count is spread across six columns. Ideally we need to reorganize the data into a long-format, so that there are just two columns, one for each variable.
To reorganize the data frame so that each column corresponds to a single variable we use the pivot_longer()
function in the tidyr
package. This function will:
- Specify which columns to combine into one column with the
cols=
argument. - Create a new variable containing the names of the columns from the insect data frame. This column’s name can be set with the
names_to=
argument of thepivot_longer()
function.
- Create a second new variable containing the numerical data in the columns. This column’s name can be set with the
values_to=
argument of thepivot_longer()
function.
library('tidyr') # Load tidyr package
# Tidy the insect data frame
# Gather all the count data into one column
<- pivot_longer(data=insect,
insect.long cols=c(1:6),
values_to="Count",
names_to="Spray")
head(insect.long) # Display the reshaped data frame
# A tibble: 6 × 2
Spray Count
<chr> <int>
1 Spray.A 10
2 Spray.B 11
3 Spray.C 0
4 Spray.D 3
5 Spray.E 3
6 Spray.F 11
This reorganized data frame is in long-format and is sometimes called a molten data frame.
Example: Reorganising climate data
A common example of using the pivot_longer()
function to combine multiple columns of data into a single column occurs with monthly climate data (e.g. Met Eireann’s monthly data; https://www.met.ie/climate/available-data). It is common for raw data to be displayed with columns to represent individual months. To analyse these data we will want all rainfall data in one column, and a second column added giving the month.
# Display the data rainfall
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1 2016 166.3 130.7 72.7 52.9 65.6 72.4 109.9 63.4 NA NA NA NA
2 2015 176.0 85.8 123.1 64.7 137.0 56.1 132.7 111.0 29.7 71.9 222.9 272.9
3 2014 162.2 189.9 71.6 33.4 86.8 48.6 86.0 95.3 23.0 131.4 134.4 150.5
4 2013 140.9 74.1 61.7 61.6 102.5 85.5 56.5 92.6 69.7 103.8 116.0 178.6
With these data we will combine the monthly rainfall data into two columns: one with the name of the month and one with the rainfall for the month.
However we do not want to change the Year column, so we instruct R to not use the year column, using the argument cols=-Year
(see below).
# Organise the rainfall data
<- pivot_longer(data=rainfall,
rainfall.long cols=-Year,
names_to="Month",
values_to="Rainfall")
# Display the first 6 lines of the organised data frame
head(rainfall.long)
# A tibble: 6 × 3
Year Month Rainfall
<int> <chr> <dbl>
1 2016 Jan 166.
2 2016 Feb 131.
3 2016 Mar 72.7
4 2016 Apr 52.9
5 2016 May 65.6
6 2016 Jun 72.4
One column across several columns (using pivot_wider()
)
Sometimes you want to take well formatted data and spread one variable across several columns. One option to do this you can use the pivot_wider()
function in tidyr
.
The bees
data frame is well formatted (one variable in one column and each row an observation). In this well formatted data set we have a column called Flowers (density of flowers), and a column called Field (the field where the flowers density was measured). We will spread the Flowers data out so that the density of flowers is in two columns: one columns for field 1 and one column for field 2.
We will use the pivot_wider()
function. This function will:
Create column names from the values in variable Field (categorical variable). This is done with the
names_from=
andnames_prefix=
arguments of thepivot_wider()
function.Indicate that the Flowers variable will be used to fill the newly created columns. This is done with the
values_from=
argument of thepivot_wider()
function.
# Spread the Field across multiple columns
# (see below for an explanation of names_prefix="Field_")
<- pivot_wider(data=bees,
bees.wide values_from="Flowers",
names_from="Field",
names_prefix="Field_")
head(bees.wide) # Display the spread out data frame
# A tibble: 6 × 4
Treatment Plot Field_1 Field_2
<int> <int> <int> <int>
1 1 1 23 31
2 1 2 26 32
3 1 3 20 35
4 1 4 21 29
5 2 1 20 30
6 2 2 24 32
The names_prefix="Field_"
argument is used to create the new column names, by adding Field_ as a prefix to the field number.
Example: Reorganising climate data
We could use the pivot_wider()
function to return the monthly rainfall data (see above) from its tidy format (data frame rainfall.long
) to a wide format where each month’s rainfall has its own column.
# Reorganise the rainfall data back into a wide format
<- pivot_wider(data=rainfall.long,
rainfall.wide values_from="Rainfall",
names_from="Month")
# Display the wide-format rainfall data rainfall.wide
# A tibble: 4 × 13
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2016 166. 131. 72.7 52.9 65.6 72.4 110. 63.4 NA NA NA NA
2 2015 176 85.8 123. 64.7 137 56.1 133. 111 29.7 71.9 223. 273.
3 2014 162. 190. 71.6 33.4 86.8 48.6 86 95.3 23 131. 134. 150.
4 2013 141. 74.1 61.7 61.6 102. 85.5 56.5 92.6 69.7 104. 116 179.
Other data organisation packages
There are several other data manipulation tools in R. Two common packages for data manipulation are:
reshape2
dplyr
See also http://www.r-bloggers.com/reshape-and-aggregate-data-with-the-r-package-reshape2/ and http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/
Summarising a variable by group
The summary()
function produces a summary of an entire data frame. However, it doesn’t allow you to calculate summaries for groups within the data frame.
If you want to calculate a mean across different grouping (e.g. for each value of a qualitative factor) the aggregate()
function is very useful.
To calculate the mean rainfall across years for each month:
# Mean rainfall across years 2013-2016 for each month
aggregate(Rainfall~Month, # Define aggregation using a formula
data=rainfall.long, # Specify data frame to be aggregated
FUN=mean, # Specify function to use for aggregation
na.rm=TRUE, # Ignore missing values
na.action=na.pass) # Pass missing data to the aggregation function
Month Rainfall
1 Apr 53.1500
2 Aug 90.5750
3 Dec 200.6667
4 Feb 120.1250
5 Jan 161.3500
6 Jul 96.2750
7 Jun 65.6500
8 Mar 82.2750
9 May 97.9750
10 Nov 157.7667
11 Oct 102.3667
12 Sep 40.8000
The order of the months is alphabetical, making the summary harder to read. This could be improved by specifying the Month
variable as a factor with the levels in a specified order. Below is the code to do this.
# Define Month as a factor with levels in a specified order
$Month <- factor(rainfall.long$Month,
rainfall.longlevels = c('Jan','Feb','Mar','Apr','May','Jun',
'Jul','Aug','Sep','Oct','Nov','Dec'))
The aggregate
function can be used to produce a summary for each month
# Summary of rainfall across years 2013-2016 for each month
aggregate(Rainfall~Month,
data=rainfall.long,
FUN=summary)
Month Rainfall.Min. Rainfall.1st Qu. Rainfall.Median Rainfall.Mean
1 Jan 140.9000 156.8750 164.2500 161.3500
2 Feb 74.1000 82.8750 108.2500 120.1250
3 Mar 61.7000 69.1250 72.1500 82.2750
4 Apr 33.4000 48.0250 57.2500 53.1500
5 May 65.6000 81.5000 94.6500 97.9750
6 Jun 48.6000 54.2250 64.2500 65.6500
7 Jul 56.5000 78.6250 97.9500 96.2750
8 Aug 63.4000 85.3000 93.9500 90.5750
9 Sep 23.0000 26.3500 29.7000 40.8000
10 Oct 71.9000 87.8500 103.8000 102.3667
11 Nov 116.0000 125.2000 134.4000 157.7667
12 Dec 150.5000 164.5500 178.6000 200.6667
Rainfall.3rd Qu. Rainfall.Max.
1 168.7250 176.0000
2 145.5000 189.9000
3 85.3000 123.1000
4 62.3750 64.7000
5 111.1250 137.0000
6 75.6750 85.5000
7 115.6000 132.7000
8 99.2250 111.0000
9 49.7000 69.7000
10 117.6000 131.4000
11 178.6500 222.9000
12 225.7500 272.9000
Summary of topics
- Creating an array (1D and 2D)
array()
functionc()
function
- Creating a data frame
- Importing data into a data frame
as.data.frame()
functiondata.frame()
function
- Specifying and modifying elements in arrays and data frames
- Square brackets
[]
- Dollar symbol
$
- Logical indices
- Square brackets
- Reorganising a data frame
- Gathering data into one column
- Spreading data across several columns
- Groupwise summary a data frame
Further Reading
All these books can be found in UCD’s library
- Andrew P. Beckerman and Owen L. Petchey, 2012 Getting Started with R: An introduction for biologists (Oxford University Press, Oxford)
- Mark Gardner, 2012 Statistics for Ecologists Using R and Excel (Pelagic, Exeter)
- Tenko Raykov and George A Marcoulides, 2013 Basic statistics: an introduction with R (Rowman and Littlefield, Plymouth)
- John Verzani, 2005 Using R for introductory statistics (Chapman and Hall, London)
- Andy P. Field, Jeremy Miles, Zoë Field, 2013 Discovering statistics using R (Sage, London)