rcode17

Regression: R code for Chapter 17 examples

Download the R code on this page as a single file here

New methods on this page

Click on a function argument for a short description of its meaning. The variable names are plucked from the examples further below.

Linear regression:
lionRegression <- lm(, )
summary()

Predict mean y on a regression line corresponding to a given x-value:
predict(, , )

ANOVA test of zero regression slope:
anova()

Other new methods:
95% confidence interval for the slope.
95% confidence bands.
95% prediction intervals.
Scatter plot smoothing.
Nonlinear regression.
Logistic regression.

Example 17.1. Lion noses

Estimate a linear regression and calculate predicted values and their uncertainty for the relationship between lion age and proportion of black on the lion nose.

Read and inspect the data.

lion <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter17/chap17e1LionNoses.csv"))
head(lion)

Basic scatter plot.

plot(ageInYears ~ proportionBlack, data = lion)

Commands for a fancier scatter plot using more options (Figure 17.1-1) are .

Fit the linear regression to the data using least squares. Use lm, which was also used for ANOVA. Save the results into a model object, and then use other commands to extract the quantities we want.

lionRegression <- lm(ageInYears ~ proportionBlack, data = lion)

Show the estimated linear regression coefficients: slope and intercept. The output of summary includes other valuable quantities, including standard errors and a t-test of zero regression slope. If you want only the regression coeffients without the rest of the data, use coef(lionRegression).

summary(lionRegression)

Add the least squares regression line to the plot. The quick method is to use abline. If needed, redraw the scatter plot first.

abline(lionRegression)

Another way to draw the regression line is to generate predicted values and then connect them with a line (Figure 17.1-4). This has the advantage of more control - for example, we can make sure that the line doesn't fall outside the range of the data, avoiding extrapolation. Commands to accomplish this are .

Use the regression line for prediction. For example, here is how to predict mean lion age corresponding to a value of 0.50 of proportion black in the nose. Standard error of the predicted mean is included.

predict(lionRegression, data.frame(proportionBlack = 0.50), se.fit = TRUE)

Calculate the residual and regression mean squares. All the mean squares are included in the anova table, which is generate using the anova command.

anova(lionRegression)

Standard error of slope. The standard errors are provided in the output of the summary function.

summary(lionRegression)

95% confidence interval for slope.

confint(lionRegression)

95% confidence bands (Figure 17.2-1). If needed, redraw the scatter plot, including the regression line, before executing the commands below. The seq function below generates a sequence of 100 new x-values evenly spaced between the smallest and largest values of proportionBlack. The predict function calculates the corresponding predicted mean y-values on the regression line, as well as the lower and upper limits of the confidence intervals for the predicted mean y-values at each x-value (these are put into the data frame named ypt. Finally, the lines functions connect the consecutive lower limits of the confidence intervals with a line, and do the same for the consecutive upper limits, forming the bands.

xpt <- seq(min(lion$proportionBlack), max(lion$proportionBlack), 
	length.out = 100)
ypt <- data.frame( predict(lionRegression, 
	newdata = data.frame(proportionBlack = xpt), 
	interval = "confidence") )
lines(ypt$lwr ~ xpt, lwd = 1.5, lty = 2)
lines(ypt$upr ~ xpt, lwd = 1.5, lty = 2)

95% prediction intervals (Figure 17.2-1). If needed, redraw the scatter plot, including the regression line, before executing the commands below. The commands below are the same as those used to generate the confidence bands (see above), but here the lower and upper limits are confidence intervals for the prediction of a single individual y-value, rather than the mean value, corresponding to each x-value.

xpt <- seq(min(lion$proportionBlack), max(lion$proportionBlack), 
	length.out = 100)
ypt <- data.frame(predict(lionRegression, newdata = data.frame(proportionBlack = xpt), 
	interval = "prediction", level = 0.95))
lines(ypt$lwr ~ xpt, lwd = 1.5, lty = 3)
lines(ypt$upr ~ xpt, lwd = 1.5, lty = 3)

Example 17.3. Prairie home campion

Test the null hypothesis of zero regression slope. The data are from an experiment investigating the effect of plant species diversity on the stability of plant biomass production.

Read and inspect the data.

prairie <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter17/chap17e3PlantDiversityAndStability.csv"))
head(prairie)

Log-transform stability and include the new variable in the data frame. Inspect the data frame once again to make sure the function worked as intended.

prairie$logStability <- log(prairie$biomassStability)
head(prairie)

Scatter plot with regression line.

plot(logStability ~ nSpecies, data = prairie) 
prairieRegression <- lm(logStability ~ nSpecies, data = prairie)
abline(prairieRegression)

Commands for a fancier scatter plot with more options (Figure 17.3-1) are .

t-test and 95% confidence interval of regression slope.

prairieRegression <- lm(logStability ~ nSpecies, data = prairie)
summary(prairieRegression)
confint(prairieRegression)

ANOVA test of zero regression slope (Table 17.3-2).

anova(prairieRegression)

R2. The R2 associated with the regression is part of the output of the summary command.

summary(prairieRegression)$r.squared

Figure 17.5-2. Tail feather outlier

Examine the effect on an outlier on linear regression. The data are percentage of white in the tail feathers of the dark-eyed junco at sites at different latitudes.

Read and inspect data.

junco <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter17/chap17f5_2JuncoOutlier.csv"))
head(junco)

Scatter plot of all the data.

plot(percentWhiteTail ~ latitude, data = junco)

Commands for a more elaborate scatter plot using more options (Figure 17.5-2) are .

Add regression lines to the scatter plot, calculated both with and without the outlier. You may need to redraw the scatter plot using the above commands before executing the following functions.

juncoRegression <- lm(percentWhiteTail ~ latitude, data = junco)
abline(juncoRegression, col = "red", lwd = 2)
juncoRegressionNoOut <- lm(percentWhiteTail ~ latitude, 
	data = junco, subset = percentWhiteTail > 38)
abline(juncoRegressionNoOut, lwd = 2)

Figure 17.5-3. Desert pool fish

Detect a nonlinear relationship graphically, using data on surface area of desert pools and the number of fish species present.

Read and inspect data.

desertFish <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter17/chap17f5_3DesertPoolFish.csv"))
head(desertFish)

Scatter plot.

plot(nFishSpecies ~ poolArea, data = desertFish)

Commands for a fancier scatter plot are .

Add regression line to previous scatter plot.

desertRegression <- lm(nFishSpecies ~ poolArea, data = desertFish)
abline(desertRegression)

Add curve based on smoothing to previous scatter plot.

desertSmooth <-loess(nFishSpecies ~ poolArea, data = desertFish)
xpts <- seq(min(desertFish$poolArea), max(desertFish$poolArea), length.out = 100)
ypts <- predict(desertSmooth, new = data.frame(poolArea = xpts))
lines(xpts, ypts)

Here is another method to produce a smooth curve in a scatter plot.

desertSmooth <-smooth.spline(desertFish$poolArea, desertFish$nFishSpecies, df = 4)
xpts <- seq(min(desertFish$poolArea), max(desertFish$poolArea), length.out = 100)
ypts <- predict(desertSmooth, xpts)$y
lines(xpts, ypts)

Figure 17.5-4 (left). Blue tit cap color

Create a residual plot to check assumptions. The data are cap color of offspring and parents in a sample of the blue tit.

Read and inspect the data.

capColor <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter17/chap17f5_4BlueTitCapColor.csv"))
head(capColor)

Basic scatter plot with regression line.

plot(offspringColorScore ~ midparentColorScore, data = capColor)
capColorRegression <- lm(offspringColorScore ~ midparentColorScore, data = capColor)
abline(capColorRegression)

Residual plot. Residuals are calculated from the model object created from the previous lm command, using the residuals command.

plot(residuals(capColorRegression) ~ midparentColorScore, data = capColor)
abline(c(0,0))

Commands for a fancier residual plot using more options are .


Figure 17.5-4 (right). Cockroach neurons

Make a residual plot to check assumptions. The data are temperature and the firing rate of neurons in cockroach.

Read and inspect the data.

roach <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter17/chap17f5_4CockroachNeurons.csv"))
head(roach)

Scatter plot with regression line.

plot(rate ~ temperature, data = roach)
roachRegression <- lm(rate ~ temperature, data = roach)
abline(roachRegression)

Residual plot.

plot(residuals(roachRegression) ~ temperature, data = roach)
abline(c(0,0))

Commands for a more elaborate residual plot are .


Figure 17.6-3. Iris pollination

Use a square root transformation to improve the fit to assumptions of linear regression. The data are number of pollen grains received and floral tube length of an iris species.

Read and inspect data.

iris <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter17/chap17f6_3IrisPollination.csv"))
head(iris)

Scatter plot with regression line.

plot(grainsDeposited ~ tubeLengthMm, data = iris)
irisRegression <- lm(grainsDeposited ~ tubeLengthMm, data = iris)
abline(irisRegression)

Residual plot.

plot(residuals(irisRegression) ~ tubeLengthMm, data = iris)
abline(c(0,0))

Commands for a residual plot with more options are .

Square root transformation.

iris$sqRootGrains <- sqrt(iris$grainsDeposited + 1/2)

Scatter plot using transformed data, with new regression line added.

plot(sqRootGrains ~ tubeLengthMm, data = iris)
irisRegressionSqrt <- lm(sqRootGrains ~ tubeLengthMm, data = iris)
abline(irisRegressionSqrt)

Residual plot based on the transformed data.

plot(residuals(irisRegressionSqrt) ~ tubeLengthMm, data = iris)
abline(c(0,0))

Instructions for a residual plot with more options are .


Figure 17.8-1. Iron and phytoplankton growth

Fit a nonlinear regression curve having an asymptote (Michaelis-Menten curve). The data are the relationship between population growth rate of a phytoplankton in culture and the concentration of iron in the medium.

Read and examine data.

phytoplankton <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter17/chap17f8_1IronAndPhytoplanktonGrowth.csv"))
head(phytoplankton)

Scatter plot.

plot(phytoGrowthRate ~ ironConcentration, data = phytoplankton)

Instructions for a scatter plot with more options are .

Fit a Michaelis-Menten curve to the phytoplankton data using the nls (nonlinear least squares). To fit the curve, provide a formula that also includes symbols (letters) for the parameters to be estimated. In the following function, we use "a" and "b" to indicate the parameters of the Michaelis-Menten curve we want to estimate. The function includes an argument where we must provide an initial guess of parameter values. The value of the initial guess is not so important -- here we choose a=1 and b=1 as initial guesses. The first function below carried out the model fit and save the results in an R object named phytoCurve.

phytoCurve <- nls(phytoGrowthRate ~ a*ironConcentration / ( b+ironConcentration), 
	data = phytoplankton, list(a = 1, b = 1))

Obtain the parameter estimates using the summary command to , including standard errors and t-tests of null hypotheses that parameter values are zero.

summary(phytoCurve)

Add the nonlinear regression curve to scatter plot. If necessary, redraw the scatter plot before issuing the following commands.

xpts <- seq(min(phytoplankton$ironConcentration), 
	max(phytoplankton$ironConcentration), length.out = 100)
ypts <- predict(phytoCurve, new = data.frame(ironConcentration = xpts))
lines(xpts, ypts)

Many of the functions that can be used to extract results from a saved lm object work in the same way when applied to an nls object, such as predict, residuals, and coef.


Figure 17.8-2. Pond plants and productivity

Fit a quadratic curve to the relationship between the number of plant species present in ponds and pond productivity.

Read and examine data.

pondProductivity <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter17/chap17f8_2PondPlantsAndProductivity.csv"))
head(pondProductivity)

Scatter plot.

plot(species ~ productivity, data = pondProductivity)

Commands for a fancier scatter plot are .

Fit a quadratic curve to the data. Here, the single variable productivity in the data frame is included in the formula both as itself and as the squared term, productivity2. To make the squared term work, we need to wrap the term with I(). The results of the model fit are saved in an R object productivityCurve.

productivityCurve <- lm(species ~ productivity + I(productivity^2), 
	data = pondProductivity)

Show estimates of the parameters of the quadratic curve (regression coefficients) are obtained as follows, along with standard errors and t-tests.

summary(productivityCurve)

Add quadratic regression curve to scatter plot. If necessary, redraw the previous scatter plot before issuing the following commands.

xpts <- seq(min(pondProductivity$productivity), max(pondProductivity$productivity), 
	length.out = 100)
ypts <- predict(productivityCurve, new = data.frame(productivity = xpts))
lines(xpts, ypts)

Example 17.8. Incredible shrinking seal

Fit a formula-free curve (cubic spline) to the relationship between body length and age for female fur seals.

Read and inspect data

shrink <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter17/chap17e8ShrinkingSeals.csv"))
head(shrink)

Scatter plot.

plot(length ~ ageInDays, data = shrink, pch = ".")

Commands for a fancier scatter plot with more options are .

Fit a cubic spline. The argument df stands for effective degrees of freedom, which allows you to control how complicated the curve should be. The simplest possible curve is a straight line, which has df=2 (one for slope and another for intercept). More complex curves require more df. Here we fit a very complicated curve.

shrinkCurve <- smooth.spline(shrink$ageInDays, shrink$length, df = 40)

Add curve to scatter plot. If needed, replot the scatter plot before issuing the commands below.

xpts <- seq(min(shrink$ageInDays), max(shrink$ageInDays), length.out = 1000)
ypts <- predict(shrinkCurve, xpts)$y
lines(xpts, ypts)

Figure 17.9-1. Guppy cold death

Fit a logistic regression to the relationship between guppy mortality and duration of exposure to a temperature of 5 degrees C.

Read and inspect the data.

guppy <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter17/chap17f9_1GuppyColdDeath.csv"))
head(guppy)

Draw a frequency table of mortality at different exposure times.

table(guppy$mortality, guppy$exposureDurationMin, 
	dnn = c("Mortality","Exposure (min)"))

Scatter plot of the data.

plot(mortality ~ jitter(exposureDurationMin), data = guppy)

Commands for a fancier scatter plot, with more options, are .

Fit a logistic regression.

guppyGlm <- glm(mortality ~ exposureDurationMin, data = guppy,
	family = binomial(link = logit))

Add logistic regression curve to scatter plot.

xpts <- seq(min(guppy$exposureDurationMin), max(guppy$exposureDurationMin), 
	length.out = 100)
ypts <- predict(guppyGlm, newdata = data.frame(exposureDurationMin = xpts), 
	type = "response")
lines(xpts, ypts)

Table of regression coefficients, with parameter estimates (Table 17.9-2).

summary(guppyGlm)

95% confidence intervals for parameter estimates. It is necessary to load the MASS package first.

library(MASS)
confint(guppyGlm)

Predict probability of mortality (mean mortality) for a given x-value, 10 min duration, including standard error of prediction.

predict(guppyGlm, newdata = data.frame(exposureDurationMin = 10),
	type = "response", se.fit = TRUE)

Estimate the LD50, the dose at which half the individuals are predicted to die from exposure.

library(MASS)
dose.p(guppyGlm, p = 0.50)

Analysis of deviance table, with a of the null hypothesis of zero slope (Table 17.9-3).

anova(guppyGlm, test = "Chi")
Formula describing a relationship between the response variable (the y-variable) and the explanatory variable (the x-variable).
The name of the data frame containing the variables in the formula.
The name of the R object containing the results of lm in the previous step.
The name of the R object with the saved results from a previous lm command.
The predict function requires that the x-value have the same name as the explanatory variable in the regression and be in a data frame. This argument accomplishes both at once.
The argument causes the predict function also to show a standard error for the prediction.
The name of the R object with the saved results from a previous lm command.

plot(ageInYears ~ proportionBlack, data = lion, pch = 16, col="firebrick", 
	las = 1, cex = 1.5, bty = "l", xlim = c(0,0.8), ylim = c(0,14), 
	ylab = "Age (years)", xlab = "Proportion black")


xpts <- range(lion$proportionBlack)
ypts <- predict(lionRegression, data.frame(proportionBlack = xpts))
lines(ypts ~ xpts, lwd = 1.5)


plot(logStability ~ nSpecies, data = prairie, bty = "l", col = "firebrick", 
	pch = 1, las = 1, cex = 1.5, xlim = c(0,16), xaxp = c(0,16,8), 
	xlab = "Species number treatment", ylab = "Log-transformed ecosystem stability")
prairieRegression <- lm(logStability ~ nSpecies, data = prairie)
xpts <- c(1,16)
ypts <- predict(prairieRegression, newdata = data.frame(nSpecies = xpts))
lines(ypts ~ xpts, lwd=1.5)


plot(percentWhiteTail ~ latitude, data = junco, pch = 16, col="firebrick", 
 	las = 1, cex = 1.5, bty = "l", xlab = "Latitude (degrees N)", 
 	ylab = "Percent white in tail")


plot(nFishSpecies ~ poolArea, data = desertFish, pch = 16, col = "firebrick", 
  	las = 1, cex = 1.5, bty = "l", xlab = expression(Area~(m^2)), 
  	ylab = "Number of species", ylim = c(0,6))


plot(residuals(capColorRegression) ~ midparentColorScore, data = capColor, 
	pch = 16, col = "firebrick", las = 1, cex = 1.5, bty = "l", 
	xlab = "Midparent color score", ylab = "Residuals")
abline(c(0,0))


plot(residuals(roachRegression) ~ temperature, data = roach, 
	pch = 16, col = "firebrick", las = 1, cex = 1.5, bty = "l", 
	xlab = "Temperature", ylab = "Residuals")
abline(c(0,0))


plot(residuals(irisRegression) ~ tubeLengthMm, data = iris, pch = 16,  
	col = "firebrick", las = 1, cex = 1.5, bty = "l",
	xlab = "Floral tube length (mm)", ylab = "Residuals")
abline(c(0,0))


plot(residuals(irisRegressionSqrt) ~ tubeLengthMm, data = iris,  
	pch= 16, col = "firebrick", las = 1, cex=1.5, bty = "l", 
	xlab = "Floral tube length (mm)", ylab = "Residuals")
abline(c(0,0))


plot( phytoGrowthRate ~ ironConcentration, data = phytoplankton, pch = 16, 
	col = "firebrick", las = 1, cex = 1.5, bty = "l", 
	ylab = "Growth rate (no./day)",
	xlab = expression(paste("Iron concentration (", mu, "mol)")) )


plot(species ~ productivity, data = pondProductivity, pch = 16, 
	col = "firebrick", las = 1, cex = 1.5, bty = "l", 
	ylab = "Number of species", xlab = "Productivity (g/15 days)" )


plot(jitter(length, factor = 2) ~ ageInDays, data = shrink, pch = ".", 
	col = "firebrick", las = 1, bty = "l", ylab = "Body length (cm)",
	xlab = "Female age (days)")


plot(jitter(mortality, factor = 0.1) ~ jitter(exposureDurationMin, factor = 1),  
	data = guppy, col = "firebrick", las = 1, bty = "l", ylab = "Mortality",
	xlab = "Duration of exposure (min)")