# This is a chunk of R code. All text after a # symbol is a comment
# Set working directory using setwd() function
setwd('Enter the path to my working directory')
# Clear all variables in R's memory
rm(list=ls()) # Standard code to clear R's memory
Organising Data in R
A tutorial about data analysis using R (Website Version)
This tutorial is a mixture of R code chunks and explanations of the code. The R code chunks will appear in boxes.
Below is an example of a chunk of R code:
Sometimes the output from running this R code will be displayed after the chunk of code.
Here is a chunk of code followed by the R output
2 + 4 # Use R to add two numbers
[1] 6
Objectives
The objectives of this tutorial are:
- Introduce the concept of a data frame
- Demonstrate how data frames can be manipulated
- Demonstrate how to reformat data and code for missing data
- Explain data subsetting in R
- Save imported data to a compact binary file
Introduction
This tutorial will show you how to view, subset and manipulate data frames within R. This assumes that the data have been successfully imported into R (if you are unsuccessful at importing data into R then you need to read the data importing worksheet).
The data we’ll be using have been imported from these files:
- WOLF.CSV: This file is a text file of comma separated variables.
- INSECT.TXT:This file is a text file of TAB delimited variables.
These data sets are described at http://DrJonYearsley.github.io/Resources/datasets_WebVersion.html
Viewing a data frame
Finding variable names
Use the ls()
function to print a list of variables in R’s memory
ls() # Display the variables in R's memory
[1] "insect" "wolf"
A poor way to view data
Typing the name of a variable will display all the data contained in the variable.
# Display the entire insect data frame insect
Spray.A Spray.B Spray.C Spray.D Spray.E Spray.F X X.1
1 10 11 0 3 3 11 NA NA
2 7 17 1 5 5 9 NA NA
3 20 21 7 12 3 15 NA NA
4 14 11 2 6 5 22 NA NA
5 14 16 3 4 3 15 NA NA
6 12 14 1 3 6 16 NA NA
7 10 17 2 5 1 13 NA NA
8 23 17 1 5 1 10 NA NA
9 17 19 3 5 3 26 NA NA
10 20 21 0 5 2 26 NA NA
11 14 7 1 2 6 24 NA NA
12 13 13 4 4 4 13 NA NA
BEWARE: Printing out the entire data set is rarely useful, because data sets are often too large to fit on a computer screen (for example, the wolf data frame has 178 rows of data, making it hard to read in one go). There are often better ways to view a data frame than to just print out the entire variable.
Good ways to view data
Here are some options for viewing data frames:
head(wolf) # Display the first 6 lines of the wolf data frame
tail(wolf, n=10) # Display the last 10 lines of the wolf data frame
summary(wolf) # Display an overview of the wolf data frame
str(wolf) # Display the structure of the wolf data frame
The summary()
function is particularly useful. It displays summary statistics for each variable in a data frame. Later we will see how the summary()
function has many uses, such as displaying summary results from a data analysis.
The summary output for a data frame depends upon a variable’s data type.
- For quantitative data (
num
andint
) the summary shows the minimum, first quartile (25% quantile), the mean, the median (50% quantile or second quartile), the third quartile (75% quantile), the maximum and the number of missing values (missing values are represented asNA
in R). Examples of numerical data in thewolf
data frame Cpgmg, Tpgmg and Ppgmg. - For qualitative data (
factor
,logi
) the summary shows first five categories of a qualitative variable and the number of data points in each category. Any remaining categories are lumped together as(Other)
. The number of missing values are also shown. Examples of qualitative data in thewolf
data frame are Sex and Colour. - For plain text data that isn’t qualitative the summary displays the type of data (
Class : character
).
The data type of a variable (e.g. quantitative, qualitative, character) is displayed in the output from the str()
function.
Viewing part of a data frame
Refering to a single column in a data frame using $
A single variable (column) in a data frame can be specified by giving the name of the data frame, followed by a $
followed by the name of the variable.
Here is a example that specifies just the cortisol data in the wolf
data frame
$Cpgmg # Display just the cortisol data wolf
The names of the variables can be seen at the top of each column of data (for example, using the head()
function)
# Variable names appear above each column of data
head(wolf) # Display first 6 rows of data.
Individual Sex Population Colour Cpgmg Tpgmg Ppgmg
1 1 M 2 W 15.86 5.32 NA
2 2 F 1 D 20.02 3.71 14.37622
3 3 F 2 W 9.95 5.30 21.65902
4 4 F 1 D 25.22 3.71 13.42507
5 5 M 2 D 21.13 5.34 NA
6 6 M 2 W 12.48 4.60 NA
Adding a variable into a data frame
We can add a variable to a data frame using the $
operator.
Here is an example where we add the variable Replicate
(1-12) which codes for each replicate of an experimental treatment
$Replicate <- c(1:12) # Add a variable called Replicate to the data frame insect
head(insect) # Display the first 6 rows of the trimmed data frame
Spray.A Spray.B Spray.C Spray.D Spray.E Spray.F X X.1 Replicate
1 10 11 0 3 3 11 NA NA 1
2 7 17 1 5 5 9 NA NA 2
3 20 21 7 12 3 15 NA NA 3
4 14 11 2 6 5 22 NA NA 4
5 14 16 3 4 3 15 NA NA 5
6 12 14 1 3 6 16 NA NA 6
Changing a variable’s data type
Data in statistical analyses are often one of two basic data types: quantitative or qualitative data.
- R calls a continuous quantitative variable numeric (abbreviated to
num
) - R calls a discrete quantitative variable integer (abbreviated to
int
) - R calls a qualitative variable a factor
A qualitative variable is a set of labels (e.g. large, medium and small). Each label is called a level of the factor.
R also has other data types. Some examples are:
- character data type = plain text (abbreviated to
chr
) - logical data type = a variable that is TRUE or FALSE (abbreviated to
logi
)
In the wolf data frame the variables Population, Individual, Sex and Colour are qualitative (the labels from each of these variables identify a data point to a population, an individual, a sex and a coat colour, respectively).
The data types that R has assigned each variable can be seen by looking at the structure of the wolf data frame
str(wolf) # Display the structure of the data frame
'data.frame': 178 obs. of 7 variables:
$ Individual: int 1 2 3 4 5 6 7 8 9 10 ...
$ Sex : chr "M" "F" "F" "F" ...
$ Population: int 2 1 2 1 2 2 1 1 1 2 ...
$ Colour : chr "W" "D" "W" "D" ...
$ Cpgmg : num 15.86 20.02 9.95 25.22 21.13 ...
$ Tpgmg : num 5.32 3.71 5.3 3.71 5.34 4.6 4.58 9.27 4.81 5.07 ...
$ Ppgmg : num NA 14.4 21.7 13.4 NA ...
You can see some issues here:
- The variables Population and Individual have not been assigned as quantitative variables (R has identified them as numerical integers,
int
, because the wolf.csv file used whole numbers as labels for these two variables).
- The variables Sex and Colour have been identified as containing text (
chr
type), but we want these to be recognised as qualitative nominal data types (R calls this data type afactor
). The variable Sex has two levels ‘M’ and ‘F’. The variable Colour also has two levels ‘D’, ‘W’, and blank should be explicitly recognised as missing data.
We want to redefine the variables Population, Sex and Colour so that R recognizes it as a factor (unorded factor). We will also redefine the variable Individual to be plain text (i.e. a character) to demonstrate the as.character()
function.
# Convert Population variable from numeric to a factor (a qualitative variable)
$Population <- as.factor(wolf$Population)
wolf
# Convert Sex variable from character to a factor (a qualitative variable)
$Sex <- as.factor(wolf$Sex)
wolf
# Convert Colour variable from character to a factor (a qualitative variable)
$Colour <- as.factor(wolf$Colour)
wolf
# Convert Individual variable from numeric to plain text
$Individual <- as.character(wolf$Individual)
wolf
# Display an overview of the data frame
summary(wolf)
Individual Sex Population Colour Cpgmg Tpgmg
Length:178 F:72 1: 45 : 30 Min. : 4.75 Min. : 3.140
Class :character M:76 2:103 D: 37 1st Qu.:12.16 1st Qu.: 4.372
Mode :character U:30 3: 30 W:111 Median :15.61 Median : 5.070
Mean :17.74 Mean : 6.148
3rd Qu.:20.35 3rd Qu.: 6.317
Max. :73.19 Max. :61.790
Ppgmg
Min. :12.76
1st Qu.:19.50
Median :25.00
Mean :25.89
3rd Qu.:30.01
Max. :53.28
NA's :109
Notice how the summary of the variables Population, Sex, Individual and Individual have changed now that they are factors. Also note that missing values, NA’s, are explicitly taken into account when summarizing the data (e.g. the variable Ppgmg).
There are a set of related functions for coercing variables into other data types. Here are some examples
as.factor(...) # Coerces a variable to be a factor (qualitative, nominal)
as.numeric(...) # Coerces a variable to be numeric (quantitative, continuous)
as.character(...) # Coerces a variable to be a character (qualitative, unordered)
Removing a variable from a data frame
Sometimes we want to remove a variable from a data frame.
The insect
data frame has two variables that should not be part of the data set (X
and X.1
). This is quite common when importing data. In this case the reason is two additional TABs at the end of each line in the text file. These TABs are hard to see, but R recognized them, created two additional variables and named them with default labels.
The columns can be removed by first finding out how many rows and columns the data frame has and then removing the last two columns. Here is the code
ncol(insect) # Number of columns in data frame
nrow(insect) # Number of rows in data frame
dim(insect) # Display number of rows and columns
<- insect[ ,-c(7,8)] # Remove the last two columns insect
Set missing data to NA
Always use
NA
to represent missing data
Data on coat colour is missing for population 3. R explicitly represents missing data as NA
, but the WOLF.CSV data file uses a blank space to represent missing data.
The code below sets these blank spaces to NA
# Create a logical variable that is TRUE if an observation is from population 3
<- wolf$Population==3
bool.index
# Set coat colour variable to be NA for observations from population 3
$Colour[bool.index] <- NA wolf
Subset of a data frame
Selecting observations (rows) from a data frame
To select only particular rows from a data frame using a criterion you can use the subset
function.
For example, to make a subset of the data in wolf
that contains only females,
<- subset(wolf, Sex=='F') # Create a subset with data on female wolves wolf.F
Another way to subset the data frame using a logical index:
# Create a logical variable which is TRUE if an observation is for a female
<- wolf$Sex=='F'
bool.index
# Create a subset containing only data on female wolves
<- wolf[bool.index, ] wolf.F2
Make a subset using several variables
# Create a subset containing only data on female wolves in Population 1
# method 1:
<- subset(wolf, Sex=='F' & Population==1) wolf.F3
# Create a subset containing only data on female wolves in Population 1
# method 2:
<- wolf$Sex=='F' & wolf$Population==1
bool.index <- wolf[bool.index,] wolf.F4
Another example using a logical OR (|
)
# Create a subset containing only data on wolves in Population 1 OR Population 2
<- subset(wolf, Population==1 | Population==2)
wolf.F5
summary(wolf.F5)
Individual Sex Population Colour Cpgmg Tpgmg
Length:148 F:72 1: 45 : 0 Min. : 4.75 Min. : 3.250
Class :character M:76 2:103 D: 37 1st Qu.:12.16 1st Qu.: 4.378
Mode :character U: 0 3: 0 W:111 Median :15.38 Median : 5.030
Mean :16.61 Mean : 5.617
3rd Qu.:19.98 3rd Qu.: 6.067
Max. :40.43 Max. :15.130
Ppgmg
Min. :12.76
1st Qu.:19.50
Median :25.00
Mean :25.89
3rd Qu.:30.01
Max. :53.28
NA's :79
Dropping unused levels of a factor
The subset wolf.F5
contains no data from population 3, but the factor Population still has 3 levels. To remove unused levels from a factor use the function droplevels()
Using the droplevels()
function on the data frame wolf.F5
will remove the level for population 3, as well as any other levels that contain no data (e.g. wolves with an undetermined sex, level U of variable Sex)
<- droplevels(wolf.F5) # Update the levels of factors in wolf.F5
wolf.F5 summary(wolf.F5) # The factor Population now has 2 levels
Individual Sex Population Colour Cpgmg Tpgmg
Length:148 F:72 1: 45 D: 37 Min. : 4.75 Min. : 3.250
Class :character M:76 2:103 W:111 1st Qu.:12.16 1st Qu.: 4.378
Mode :character Median :15.38 Median : 5.030
Mean :16.61 Mean : 5.617
3rd Qu.:19.98 3rd Qu.: 6.067
Max. :40.43 Max. :15.130
Ppgmg
Min. :12.76
1st Qu.:19.50
Median :25.00
Mean :25.89
3rd Qu.:30.01
Max. :53.28
NA's :79
Selecting variables (columns) from a data frame
The subset command can be used to extract one or more variables from a data frame. For example, to select only the cortisol (Cpgmg
) and Population
variables from the wolf
data frame (these are the third and fifth columns in the data frame)
# Create a subset of the data containing the variables 'Population' and 'Cpgmg'
<- subset(wolf, select=c('Population','Cpgmg')) wolf.subset1
Other ways to select variables from a data frame
# Create a subset of the data containing the variables 'Population' and 'Cpgmg'
<- wolf[,c('Population','Cpgmg')]
wolf.subset2
# Create a subset of the data containing the variables 'Population' and 'Cpgmg'
# (columns 3 and 5 in the wolf data frame)
<- wolf[,c(3,5)]
wolf.subset3
# Create a subset of the data containing the variable 'Population'
# using the variable name
$Population wolf
Variables (columns) and observations (rows) can be selected at the same time. Here is an example selecting data on population identity and cortisol for just female wolves
# Create a subset of the data containing only female wolves and the
# variables 'Population' and 'Cpgmg'
<- subset(wolf, Sex=='F', select=c('Population','Cpgmg')) wolf.subset4
Saving data
Large data sets can be time consuming to import into R. Once a file has been imported it is a good idea to save the data in R’s native binary format. Data in this format is quick to import and takes up less space on the hard drive. By convention, files containing data in R’s binary format have the suffix .Rdata
.
To save the variables wolf
, insect.tidy
and bees
to a file use the save()
command
# Save wolf, insect.tidy and bees to a file called 'sheet2_data.Rdata'
save(wolf, insect, file='sheet2_data.Rdata')
We can verify that the data have been correctly saved by clearing R’s memory and re-importing them using the load()
command. Try running the following commands to see if you can reload the data saved in file sheet2_data.Rdata
.
rm(list=ls()) # Clear variables from memory
ls() # Display the variables in R's memory
load(file='sheet2_data.Rdata') # Import R binary data from a file
ls() # Display the variables in R's memory
Summary of the topics covered
- Displaying contents of a data frame
- Manipulating data in a data frame
- Creating subset of data
- Saving a data frame to a file using R’s binary data file format
- Reading data from an R binary data file
Further Reading
All these books can be found in UCD’s library
- Andrew P. Beckerman and Owen L. Petchey, 2012 Getting Started with R: An introduction for biologists (Oxford University Press, Oxford) [Chapter 3]
- Michael J. Crawley, 2015 Statistics : an introduction using R (John Wiley & Sons, Chichester) [Chapter 2]
- Tenko Raykov and George A Marcoulides, 2013 Basic statistics: an introduction with R (Rowman and Littlefield, Plymouth)