# This is a chunk of R code. All text after a # symbol is a comment
# Set working directory using setwd() function
setwd('Enter the path to my working directory')
# Clear all variables in R's memory
rm(list=ls()) # Standard code to clear R's memory
Distributions
A tutorial about data analysis using R (Website Version)
This tutorial is a mixture of R code chunks and explanations of the code. The R code chunks will appear in boxes.
Below is an example of a chunk of R code:
Sometimes the output from running this R code will be displayed after the chunk of code.
Here is a chunk of code followed by the R output
2 + 4 # Use R to add two numbers
[1] 6
Objectives
The objectives of this tutorial are:
- Introduce the concept of a distribution
- Distinguish between empirical and theoretical distributions
- Introduce the normal distribution
- Summarise a distribution using central tendency, spread and skew
- Introduce the central limit theorem
Introduction
This tutorial introduces the idea of a distribution. We will look at the distribution of the human height observations for woman from the HEIGHT.CSV data file and the distribution of cortisol levels in wolves from the WOLF.CSV data file.
We will also look at an important theoretical distribution, the normal distribution, and introduce the central limit theorem to explain why the normal distribution (a theoretical distribution) is so often used to analyse real data.
To start with we will import the data from the HEIGHT.CSV file and extract just the data for females.
# ************************************
# Import data ------------------------
<- read.table('HEIGHT.CSV', header=T, sep=',') # Import human height data
human <- subset(human, SEX=='Female') # Extract data for females humanF
And the data from WOLF.CSV
<- read.table('WOLF.CSV', header=T, sep=',') # Import wolf data set wolf
What is a distribution?
Review this online lesson about distributions before continuing.
Examples of distributions
1: Will it rain tomorrow?
The answer to this question is unknown. There are two possible outcomes: yes it will rain, no it will not rain. Although a definite answer is not possible, stating a distribution of outcomes is possible if we can estimate the probability of the two outcomes (the two probabilities must add up to one).
Below is the distribution for an estimated 70% chance of rain. This is an example of a theoretical distribution called a Bernoulli distribution
2: How tall is a human?
The humanF
data shows that the heights of woman can take a range of values. The distribution of the observed data lies between 1.41 m and 1.83 m.
Above is a plot of the raw height data showing the distribution of the data. Each vertical line represents a data point. The data are not uniformly spread out across the entire range. There is a greater density of data at values in the centre of the range.
We can see the distribution of the data better using a histogram (R’s hist()
function or geom_histogram()
with the ggplot2 package). A histogram divides the range of heights into a number of bins and counts the number of data points in each bin.
Below is an example of a histogram with approximately 30 bins.
# Display a histogram of female heights from the HEIGHT.CSV dataset
ggplot(data=humanF, # Define data to be plotted
aes(x=HEIGHT)) +
geom_histogram(bins=30) + # Draw a histogram using 30 bins
labs(x='Height',y='Count') + # Set axis titles
theme_bw() + # Set background to white
theme(axis.title = element_text(size=20), # Set fontsize of axis title
axis.text = element_text(size=16)) # Set fontsize of axis labels
The distribution of height is the probability of observing an individual of height x (where x can be any possible value).
Values with a high probability
We can see that heights in the centre of the range (around 1.63 m) have the highest probability of being observed.
Values with a low probability
From the histogram we can see that the probability of observing a height greater than 1.8 m is close to zero (there are 7 individuals out of 1552 with heights larger than 1.8 m).
Visualising a distribution
Above we saw two graphical visualisations of distributions.
A distribution of a variable can be represented graphically, with a list of all the possible values of the variable on the x-axis and the relative frequency of observing each outcome (a number between 0 and 1) on the y-axis. Adding up the relative frequencies of all possible outcomes will always give a total of one, because it is 100% certain that some outcome is observed. The uncertainty is in which outcome.
The distribution of a continuous variable is sometimes called a probability density function (or PDF for short).
The distribution of a variable with a fixed number of outcomes (a qualitative variable, or a discrete quantitative variable) is sometimes called a probability mass function.
The shape of a distribution is best visualised using a quantile-quantile plot (QQ-plot).
Numerical Summary Statistics
We can start to describe any distribution by using summary statistics that describe some important aspects of the shape of a distribution.
Some common summary statistics are:
- central tendency (i.e. a value characteristic of the high probability region)
- spread (i.e. the ‘width’ of the high probability region)
- skew
Central tendency
Central tendency is a number that aims to describe where most outcomes occur
Three common measures of central tendency are:
- the arithmetic mean (R function
mean()
)
- the median (R function
median()
)
- the mode
Below we calculate the median of human height using the median()
function
# Calculate the median height, ignoring missing data
median(humanF$HEIGHT, na.rm=TRUE)
[1] 1.628
Notice that the median corresponds to the region of high probability in the histogram of heights (above).
Spread
Spread is a number that aims to describe how variable the outcomes are around a central tendency
Three measures of spread are:
- the standard deviation (spread around the mean, R function
sd()
) - the inter-quartile range (spread around the median, R function
IQR()
) - the median absolute deviation (spread around the median, R function
mad()
)
Below we calculate the median absolute deviation (MAD) of human heights using the function mad()
# Calculate the median absolute deviation of
# height, ignoring missing data
mad(humanF$HEIGHT, na.rm=T)
[1] 0.0622692
Range
Range is the difference between the largest and smallest observation in a data set
The largest and smallest observations in a data set can be calculated using the function range()
(or by using the functions max()
and min()
).
The max and min are not robust summary statistic of a distribution. This is because the max and min strongly depend upon the sample size of your data set. As the sample size increases you are more likely to observed extreme values (the maximum will increase and the minimum will decrease).
Use extreme quantiles (e.g. 1% and 99% quantiles) as a more robust summary statistic of a distribution
The maximum and minimum observed heights can be found using R’s range()
function
# Calculate the range of the data, ignoring missing data
range(humanF$HEIGHT, na.rm=TRUE)
[1] 1.409 1.829
A more robust measure of a distribution’s extremes are quantiles (say 1% and 99% quantiles)
# Calculate 1% and 99% quantiles, ignoring missing data
quantile(humanF$HEIGHT, prob=c(0.01, 0.99), na.rm=TRUE)
1% 99%
1.47704 1.77896
Theoretical Distributions
The Normal distribution
The normal distribution (sometimes called a Gaussian distribution) is an extremely important theoretical distribution in data analysis. The distribution is completely described by two parameters:
- the mean (commonly represented by \mu)
- the standard deviation (commonly represented by \sigma)
These two parameters describe the bell shape of the normal distribution. The normal distribution is symmetric, meaning that the distribution can be reflected about the mean and look the same.
Generating random numbers
R can generate random numbers from a specified theoretical distribution. Here are three examples.
Normal distribution
Below is the code to generate 10,000 values with mean=2 and standard deviation=0.5.
rnorm()
function generates random data drawn from a normal distribution (you can specify the two parameters of the normal distribution: the mean and standard deviation).
# .................................................
# Generate 10000 data point drawn from a normal distribution
# .................................................
<- rnorm(n=10000, mean=2, sd=0.5)
normal_data
# Produce a histogram of the generated data
ggplot(data=as.data.frame(normal_data), # Define data to be plotted
aes(x=normal_data)) +
geom_histogram(bins=30) + # Draw a histogram using 30 bins
labs(x='x', # Set axis titles
y='Count') +
theme_bw() + # Set background to white
theme(axis.title = element_text(size=20), # Set fontsize of axis title
axis.text = element_text(size=16)) # Set fontsize of axis labels
Binomial distribution
The binomial distribution can be used to simulate the chance of seeing a species at a location given the probability that the species exists in the habitat.
rbinom()
function generates the number of success given the number of attempts and the probability of a success.
Below is the code to generate the number of times a species is seen in a habitat out of 20 replicate habitats and a probability of occurrence of 0.69 (69%). This is repeated 10,000 times and the distribution from these 10,000 draws from a binomial distribution plotted.
# .................................................
# Generate 10000 data points drawn from a binomial distribution
# .................................................
<- rbinom(n=10000, size=20, prob=0.69)
binomial_data
# Produce a histogram of the generated data
ggplot(data=as.data.frame(binomial_data), # Define data to be plotted
aes(x=binomial_data)) +
geom_bar() + # Draw a bar plot
labs(x='Times species seen in 20 locations',# Set axis titles
y='Count') +
theme_bw() + # Set background to white
theme(axis.title = element_text(size=20), # Set fontsize of axis title
axis.text = element_text(size=16)) # Set fontsize of axis labels
Categorical distribution
A categorical distribution is the relative frequencies for the levels of a qualitative variable
sample()
generates a random sample from a set of qualitative outcomes (e.g. picking numbers in the lottery).
Here is an example that randomly picks one day of the week
# .................................................
# Generate categorical data
# .................................................
# Sample from a set of possible outcomes using sample()
# The outcomes are all the days of the week
<- c('Monday','Tuesday','Wednesday',
week 'Thursday','Friday','Saturday','Sunday')
# Sample 1 day from the week
sample(week, size=1)
[1] "Tuesday"
We can repeat this sampling many times (70 times, say) using a for-loop
# Create a blank data frame
<- data.frame(day=rep(NA, times=70))
categorical_data
for (i in 1:70) {
$day[i] <- sample(week, size=1)
categorical_data
}
# Produce a histogram of the generated data
ggplot(data=categorical_data, # Define data to be plotted
aes(x=day)) +
geom_bar() + # Draw a barplot
labs(x='Day of week', # Set axis titles
y='Count') +
theme_bw() + # Set background to white
theme(axis.title = element_text(size=20), # Set fontsize of axis title
axis.text = element_text(size=16), # Set fontsize of axis labels
axis.text.x = element_text(angle=90,# Rotate x axis labels
hjust=1))# Justify text
Using a theoretical distribution
A theoretical distribution is used in data analysis to:
- provide a mathematical description of a random variable
- provide a distribution which can mimic observations
- provides a link between statistical theory and observed data
A Normal distribution is a theoretical distribution. A Normal distribution with mean = 1.63 m and standard deviation = 0.064 produces data similar to our observations of female heights (humanF$HEIGHT
)
Below we have graphed histograms of the 1552 real human height observations (left) and a random sample of 1552 values drawn from a Normal distribution (right). On top of these two histograms the Normal distribution is represented as the red line (with mean = 1.63 m and standard deviation = 0.064 m).
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Summary of topics
- A distribution describes all possible outcomes and their relative frequencies
- A measure of central tendency (e.g. median, mean) and a measure of spread (e.g. median absolute deviation, standard deviation) are descriptive statistics that start to describe a distribution
- A histogram and quantile-quantile plot can be used to visualise a distribution
- The normal distribution is the single most important theoretical distribution in data analysis
- The normal distribution is completely described by two parameters: mean and standard deviation
- Normal data can be generate in R using the
rnorm()
function
- Categorical data can be generated using the
sample()
function
Further Reading
All these books can be found in UCD’s library
- Mark Gardner, 2012 Statistics for Ecologists Using R and Excel (Pelagic, Exeter) [Chapter 4]
- Tenko Raykov and George A Marcoulides, 2013 Basic statistics: an introduction with R (Rowman and Littlefield, Plymouth)
- John Verzani, 2005 Using R for introductory statistics (Chapman and Hall, London)