# This is a chunk of R code. All text after a # symbol is a comment
# Set working directory using setwd() function
setwd('Enter the path to my working directory')
# Clear all variables in R's memory
rm(list=ls()) # Standard code to clear R's memory
General Linear Models (fitting)
A tutorial about data analysis using R (Website Version)
This tutorial is a mixture of R code chunks and explanations of the code. The R code chunks will appear in boxes.
Below is an example of a chunk of R code:
Sometimes the output from running this R code will be displayed after the chunk of code.
Here is a chunk of code followed by the R output
2 + 4 # Use R to add two numbers
[1] 6
Objectives
The objectives of this tutorial are:
- To introduce general linear models
- As statistical models for describing a population
- As a tool for estimating quantities of interest and making decisions from data
- The concept of response and explanatory variables
- Explain how general linear model can be divided into deterministic and random parts
- Develop statistical models for a hypothesis (H1) and a null-hypothesis (H0)
- Introduce the concept of model parameters
- Introduce R’s
lm()
function for fitting a general linear model to data - Demonstrate the use of
lm()
by fitting models to cortisol data in the WOLF.CSV file
Statistical modelling
Review this online lesson introducing statistical models before continuing.
Online Lesson: Statisticial Modelling
This lesson is at https://DrJonYearsley.github.io/OnlineLessons/lesson4_statisticalmodelling_Website.html
A story about wolves
Every winter in Canada hunters shoot and trap wolves for their pelts. This hunting reduces the population size of wolves, but it has many other knock on consequences. For example, hunting also removes individuals from social groups which disrupts the wolf’s social hierarchy.
Questions about the world
Biologists have many questions about the impact of such hunting upon the wolves:
- Does hunting change a wolf pack’s behaviour?
- Do hunted wolves show signs of stress?
- Does hunting impact a wolf packs individual reproductive rate?
- Are all wolves affected equally by hunting?
And the list goes on.
We’ll focus on one question:
Do hunted wolves show signs of stress?
We happen to have a dataset that is relevant to this question: the WOLF.CSV data. This file contains data on individual wolves from three populations. Population 1 had almost no hunting intensity, population two had heavy hunting intensity and population 3 was a captive population. Cortisol concentrations from hair samples of individual wolves are also in this dataset. Cortisol is a physiological marker of stress.
To study this the above question we have:
- A response variable: Quantitative data on the main topic of interest (stress) in the form of individual cortisol concentrations (units of pg/mg).
- An explanatory variable: Data that might explain some of the variation in the response. In this case hunting intensity may help explain wolf stress levels. It is a qualitative variable (the population identity).
Hypotheses
We’re going to write our question as a testable statement, which is called a hypothesis. It is very important that this hypothesis is testable so that we can use our data to quantitatively test the hypothesis.
Start with a hypothesis about the effect of hunting on cortisol concentrations, which we’ll call the alternative hypothesis (H~1):
- H1: Wolves in population 2 (heavily hunted) have average cortisol levels that are higher than wolves in population 1 (almost not hunted).
We’ll also write a hypothesis that is a statement about no effect, which we’ll call the null hypothesis (H0)
- H0: Wolves in population 2 have the same average cortisol levels as wolves in population 1
Testable predictions
If these hypotheses are fit for purpose we should be able to generate a testable prediction (or several predictions) from each hypothesis.
For example:
Hypothesis | Prediction |
---|---|
H1 | Prediction 1: The concentration of cortisol in a sample of wolf hair from population 2 will be higher, on average, than a sample of hair from population 1. |
H1 | Prediction 2: The concentration of cortisol in blood samples from population 2 will be higher, on average, than a sample of blood from population 1. |
H0 | Prediction 1: On average, the concentration of cortisol in a sample of wolf hair from populations 1 and 2 will be the same. |
H0 | Prediction 2: On average the concentration of cortisol in blood samples from populations 1 and 2 will be the same. |
Importing the data
First we import our data (our sample which gives partial infomration about our system). We make sure that the data type of the Population
variable is set as qualitative using the function as.factor()
.
<- read.table('WOLF.CSV', header=T, sep=',') # Import wolf data set
wolf $Population <- as.factor(wolf$Population) # Set Population as a qualitative variable wolf
We will focus on populations 1 and 2 (light and heavy hunting respectively), so let’s subset our data to contain only these two populations.
<- droplevels(subset(wolf, Population!=3)) # Extract data from population 1 & 2 wolf12
Notice that we have used the droplevels()
function to ‘update’ the levels of the factors in the wolf12
data frame. This droplevels()
function will force R to forget about the level for population 3.
Exploratory data analysis
After importing the data it’s important to start making a picture of the data by visualising the data using numerical and graphical summaries.
Numerical data exploration
First let’s look at a summary of the data
summary(wolf12) # Numerical summary of the data
Individual Sex Population Colour
Min. : 1.00 Length:148 1: 45 Length:148
1st Qu.: 37.75 Class :character 2:103 Class :character
Median : 74.50 Mode :character Mode :character
Mean : 74.50
3rd Qu.:111.25
Max. :148.00
Cpgmg Tpgmg Ppgmg
Min. : 4.75 Min. : 3.250 Min. :12.76
1st Qu.:12.16 1st Qu.: 4.378 1st Qu.:19.50
Median :15.38 Median : 5.030 Median :25.00
Mean :16.61 Mean : 5.617 Mean :25.89
3rd Qu.:19.98 3rd Qu.: 6.067 3rd Qu.:30.01
Max. :40.43 Max. :15.130 Max. :53.28
NA's :79
Start to get a feeling for the numbers. Note the names of the variables, their range, the central tendency and their spread.
We can calculate the central tendency for each population using the aggregate()
function
# Calculate mean Cortisol in each population
aggregate(Cpgmg~Population, data=wolf12, FUN=mean)
Population Cpgmg
1 1 15.56222
2 2 17.07495
Notice that mean cortisol is higher in population 2. Is that the pattern we’d expect?
Graphical data exploration
Exploring the data with graphics is also useful
library(ggplot2) # Load ggplot2
First we’ll look at the distribution of the cortisol concentrations
# Plot the distribution of the cortisol data
ggplot(data=wolf12,
aes(x=Cpgmg)) +
geom_histogram(bins=20) +
labs(x='Cortisol Concentration [pg/mg]') +
theme_bw()
The distribution looks slightly right-skewed.
We can plot the raw data
# Plot the raw cortisol data
ggplot(data=wolf12,
aes(x=Population,
y=Cpgmg,
colour=Population)) +
geom_point(position='jitter') +
labs(x='Population',
y='Cortisol Concentration [pg/mg]') +
theme_bw()
or summarise the data with a boxplot. Importantly, The boxplot not only visualises the central tendency of the data from the two populations (the median) but also the spread (in terms of the 25% and 75% quantiles as well as the whiskers).
# Draw a boxplot of the cortisol data from populations 1 and 2
ggplot(data=wolf12,
aes(x=Population,
y=Cpgmg)) +
geom_boxplot() +
labs(x='Population',
y='Cortisol Concentration [pg/mg]') +
theme_bw()
Median cortisol levels in population 2 are slightly higher than population 1, but there is a lot of spread about the central tendency.
A violin plot is an alternative to a boxplot
# Draw a violin plot of the cortisol data from populations 1 and 2
ggplot(data=wolf12,
aes(x=Population,
y=Cpgmg)) +
geom_violin() +
labs(x='Population',
y='Cortisol Concentration [pg/mg]') +
theme_bw()
We may want to look at how several variables are able to explain cortisol values (the response variable). For example, differences between males and females and differences between the two populations.
# Draw a violin plot of the cortisol data from populations 1 and 2
# for males and females
ggplot(data=wolf12,
aes(x=Population,
y=Cpgmg,
fill=Sex)) +
geom_violin() +
labs(x='Population',
y='Cortisol Concentration [pg/mg]') +
theme_bw()
Balanced and unbalanced data
Use the table()
function to look at how much data we have from each population.
# Look at the balance of the data across
# levels of Population
table(wolf12$Population)
1 2
45 103
The sample size for population 1 is lower than population 2. This is called an unbalanced design and is very common when working with data collected from the field. A balanced design is when sample sizes are the same for all levels of a factor. Classical ANOVA analysis required the data to be balanced. So these data would require special treatment if analysed with a classical ANOVA. General linear models can be used with unbalanced data, because they use maximum likelihood to fit the model. However, general linear model analysis is easiest for balanced data sets, and they should not be used for data that are extremely unbalanced.
Analyse the data
Modelling the population
Our data are a sample. The data contain partial information about the topic of our hypotheses (i.e. the population). The population is the set of cortisol concentrations from all wolves from the two locations, at any time of the year, and maybe even across more of ta wolf’s body than just the hair.
If we are going to investigate our hypotheses then we need a way to describe the population. We will use a statistical model to describe the distribution of data we may expect across the whole population.
A statistical model
The figure below sketches out a possible statistical model. It sketches the distribution of cortisol for populations 1 and 2. The model has three important aspects:
- The response variable is cortisol concentration (on the x-axis), and it is a quantitative variable
- The average value of cortisol in each population is described by a number (represented as \mu_1 and \mu_2, whose value is as yet not known). In other words, the average value in each population has a unique value than can be determined. Given complete infomration about the population we could determine the value precisely.
- The variation of cortisol values in each population is being described by a normal distribution. The variation within a population is being described as being random because we are not trying to specify unique values of cortisol for every individual wolf. Instead we are describing all the possible outcomes of cortisol values by a normal distribution. There is an important differnce to the last point: Given complete infomration about the population we could not determine the precise value of a wolf’s cortisol value, but we can assign a probability to observing a cortisol concentration.
This model in the above sketch could be written in equation form as
\text{Cpgmg} = \mu_i + \text{Normal}(0, \sigma)
where \mu_i is the mean cortisol for population i (i is an index that can be 1 or 2) and \sigma is the standard deviation in cortisol values (i.e. the spread in cortisol values).
Note that:
- the spread is the same in both population 1 and 2.
- \text{Normal}(0,\sigma) is shorthand for a normal distribution with meanof zero and standard deviation \sigma.
General linear models
The sketch above is a representation of a general linear model. General linear models are a hugely important family of statistical models.
A general linear model always uses a normal distribution to describe variation in the population that we will not try to associate with another variable (e.g. variation in cortisol concentrations between wolves in a population). The normal distribution in a general linear model is assumed to have a mean of zero and a constant standard deviation that must be estimated from data.
Special cases of general linear models are the methods of: linear regression, t-tests, analysis of variance (ANOVA) and analysis of covariance (ANCOVA).
General linear models are a statistical tool that allow us to:
- Analyse a very wide range of data with one statistical methodology
- Always describe randomness using a normal distribution
- Estimate means and other parameters of the population (e.g. the difference in mean cortisol between populations)
- Estimate the uncertainty in population parameters from point 3.
- Quantify the strength of evidence against the null-hypothesis
The work-horse function in R for fitting general linear models to data is the lm()
function.
We will use the general linear model framework to estimate mean cortisol values in populations 1 & 2, and then test the hypothesis that these means are the same. We will then compare this approach to a t-test approach and finally to a permutation test approach.
Hypotheses revisited
Each hypothesis (the null-hypothesis, H0 and the alternative hypothesis, H1) can be expressed in terms of our statistical model.
Model for H0
The null-hypothesis states that the there is no difference in average cortisol concentrations between populations 1 and 2.
This means that \mu_1 = \mu_2. The model corresponding to this is sketched below.
This model could also be written in equation form as
\text{Cpgmg} = \mu + \text{Normal}(0, \sigma)
where \mu is the mean cortisol in population 1 or 2 and \sigma is the standard deviation in cortisol values
In R the formula for this model is written simply as Cpgmg ~ 1
(for more details on how to specify a model using a formula go to http://DrJonYearsley.github.io/Resources/R_formulas_WebVersion.html).
You could also think of this model in terms of trying to predict average cortisol values. To predict average cortisol in either population 1 or population 2, given this null-hypothesis, you would make the same prediction (\mu) for both population.
The random part (represented by \text{Normal}(0, \sigma)) is always assumed to be a normal distribution when using the lm()
function because lm()
fits a general linear model to the data. The deterministic part (sometimes called the systematic part) in the formula Cpgmg ~ 1
is represented by the 1
which means that it is a constant (i.e. there is one mean that is not affected by any other variable). This constant is called the Intercept by R.
Model for H1
The hypothesis states that avergage cortisol is different in populations 1 and 2.
This means that \mu_1 \ne \mu_2. The model corresponding to this is our first sketch showing two bell shaped normal distributions.
We therefore need two average values (\mu_1 and \mu_2) plus an estimate of spread (\sigma) to specify the model for this hypothesis.
You could also think of this model in terms of trying to predict average cortisol values. To predict average cortisol in either population 1 or population 2, given this alternative-hypothesis, you would make the prediction \mu_1 for population 1 and \mu_2 for population 2 (i.e. different predictions.
In R the formula for this model will be written simply as Cpgmg ~ 1 + Population
Formulas in R: Click here for more details on how to specify a model using a formula
Remember:
The random part of a general linear model is always assumed to be a normal distribution when using thelm()
function.
The deterministic part now includes the factorPopulation
to show that separate means should be fitted to population 1 and 2 (each level of the factor). The factorPopulation
is an example of an explantory variable.
Model parameters
- The model for H0 has two parameters: \mu and \sigma.
- The model for H1 has three parameters: \mu_1, \mu_2 and \sigma.
Parameters can also be called coefficients. The term coefficient is commonly used when talking about the deterministic part of a model.
Fitting each of these models to the wolf12
data involves estimating values for these parameters. The lm()
function fits each model using a method called maximum likelihood.
Fitting a model to data
Use the lm()
function to fit a general linear model to data.
To use the lm()
function we specify the model using a formula (see above) and specify the data frame containing the data.
Our formulas are
- H0 model:
Cpgmg ~ 1
- H1 model:
Cpgmg ~ 1 + Population
The response (Cpgmg) is on the left hand side of the tilde (~
) and the deterministic part of the model (the explanatory variables) is on the right hand side of the tilde. The random part is not represented in the formula because it is automatically taken to be a normal distribution. The function lm()
always assumes the random part of a model is a normal distribution.
Here is the R code to fit the H0 model (null model) and the H1 model (alternative model)
# Fit a linear model for the null-model
# (i.e. Population is not part of the model)
<- lm(Cpgmg ~ 1, data=wolf12)
m0
# Fit a linear model for the alternative model
# (i.e. Population affects cortisol)
<- lm(Cpgmg ~ 1 + Population, data=wolf12) m1
Summarising the fitted model
The summary()
function to can be used to view the fitted models.
summary(m0) # Summarise the null model
Call:
lm(formula = Cpgmg ~ 1, data = wolf12)
Residuals:
Min 1Q Median 3Q Max
-11.865 -4.455 -1.240 3.360 23.815
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.6150 0.5051 32.9 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.145 on 147 degrees of freedom
The summary of each model prints:
Call:
thelm()
command used to fitted the modelResiduals:
the quartiles of the residuals (described in later tutorials)Coefficients:
the fitted parameters for the deterministic part of the model (called the model coefficients) and their uncertainty (standard errors). For model m0 this is just the mean cortisol (called theIntercept
) with an estimated value of 16.6 (standard error=0.51).Residual standard error:
the fitted value for the standard deviation of the normal distribution in the statistical model. For the m0 model this is sn-1= 6.14.- the residual degrees of freedom. This is the number of data points (=148) minus the number of parameters in the deterministic part of the model. For model m0 there is one parameter in the deterministic part of the model (\mu), therefore the residual degrees of freedom = 147.
If the model is more complex than fitting a single mean (e.g. model m1) then summary()
also prints:
Multiple R-squared:
the R2 of the model. For model m1 the R2=0.013. The R2 can be interpreted as the percentage of the variance in the response that is explained by the deterministic part of the model.- an adjusted R2 which we will not be considering.
F statistic:
an additional F statistic which we will not be considering
Interpreting model coefficients
Model H0
The H0 model has one coefficient, the mean cortisol value, \mu (output from summary(m0)
). This is estimated to be 16.61 (s.e. 0.51). This information is from the (Intercept)
row in the model summary info.
Remember:
that a fitted coefficients has an uncertainty because we used a sample to estimate a property of the population.It is important to give an uncertainty with every estimated value from a model (a standard error or a 95% confidence interval).
We advise you to use a 95% confidence interval (confidence intervals will be discussed in a later tutorial):
- Use the
coef()
function to quickly see the fitted values for a model’s coefficients - Use the
confint()
command to obtain 95% confidence intervals in these fitted coefficients
# Model coefficient estimates and uncertainties
coef(m0) # Coefficients for model m0
(Intercept)
16.615
confint(m0) # 95% confidence intervals on the coefficients for model m0
2.5 % 97.5 %
(Intercept) 15.61685 17.61315
We can now report our estimate for the mean cortisol value (notice that we back transform to quote the result in the original units, not on the log-scale):
The average cortisol for all wolves from populations 1 and 2 is 16.61 pg/mg (95%CI: 15.62 - 17.61)
Model H1
The H1 model has two coefficients:
coef(m1) # Coefficients for model m1
(Intercept) Population2
15.562222 1.512729
confint(m1) # 95% confidence intervals on the coefficients for model m1
2.5 % 97.5 %
(Intercept) 13.7575212 17.366923
Population2 -0.6505745 3.676033
- the mean cortisol value for population 1 (\mu_1). This is estimated to be 15.56 (95% CI: (95% CI: 13.76 - 17.37). This is the coefficient labelled
(Intercept)
. - the difference between the means of population 1 and 2 (\mu_2 - \mu_1). This difference is estimated to be 1.51 (95% CI: -0.65 - 3.68). This is the coefficient labelled
Population2
.
Residuals
A residual is the difference between a data point and the prediction from a model. The residual is therefore associated with the random part of the model. It is the part of the response that cannot be explained by the model.
Most of the assumptions in our statistical model concern the residuals. So we will look at residuals in more detail when we discuss if the fitted model obeys the assumptions (model validation is the subject of a later tutorial).
The residuals of a fitted linear model can be obtained with the residuals()
function
<- residuals(m1) # Extract the residuals for model m1 m1_resid
The distribution of residuals has a mean of zero,
mean(m1_resid) # Mean of the residuals from model m1
[1] 5.521109e-16
The calculated mean is extremely close to zero. It is not exactly zero because of the limited accuracy of the computer (remember 1.0e-16
is computer shorthand for 1x10-16 = 0.0000000000000001).
The distribution of residuals can also be visualised by plotting a histogram
# Plot the distribution of residuals from model m1
ggplot(data=as.data.frame(m1_resid),
aes(x=m1_resid)) +
geom_histogram(bins=30) +
theme_bw()
The standard deviation of the residuals is close to being an estimate of \sigma from our statistical model (it is a slight under-estimate).
sd(m1_resid) # Standard deviation of the residuals
[1] 6.104729
A more precise estimate is given by the residual standard error
in the summary()
output.
Long-hand calculation of residuals
Residuals can be calculated explicitly by subtracting the model predictions (using the predict()
function) from the original data.
<- wolf12$logCpgmg - predict(m1) # Calculate residuals by hand m1_resid2
Have a go at comparing the two approaches for calculating residuals and show that they give the same answer
Summary of the topics covered
- The statistical model for general linear models revolves around the Normal distribution
- General linear models always use a normal distribution as the random part
- The variable of interest is called the response variable
- Statistical models of data always have a deterministic and a random part
- Statistical models have parameters
- The values of a model’s parameters can be estimated by fitting a model to data
- The
lm()
function fits general linear models to data
Further Reading
All these books can be found in UCD’s library
- Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 1304–1312.
- Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997–1003.
- Andrew P. Beckerman and Owen L. Petchey, 2012 Getting Started with R: An introduction for biologists (Oxford University Press, Oxford) [Chapter 5]
- Michael J. Crawley, 2015 Statistics : an introduction using R (John Wiley & Sons, Chichester) [Chapters 7, 8]
- Tenko Raykov and George A Marcoulides, 2013 Basic statistics: an introduction with R (Rowman and Littlefield, Plymouth)