On Thursday, we did a basic introduction to R, learned some basic data visualization, and saw how to do some simple statistical tests. Today, we take those tests further and talk about regression. With regression, we can see if one or more variables can be used to predit another variable. Very cool. Let’s get started.
Like our last workshop, we’ll be making use of some additional packages. The readxl
package again is just to read in the data, and most of tidyverse
will be to clean the data a little bit. The new one for this workshop is lme4
, which contains functions that allow us to do mixed-effects regression models.
library(readxl)
library(tidyverse)
library(lme4)
Everything else will be taken care of using base R functions.
We’ll be making use of the two data sets your instructors have made available to you. The first, is data collected from bones in Trino Vercellese. We’ll call it the trino
data set.
# Read in the Trino data
trino <- read_excel("data/Trino subadults data.xlsx")
The other data set is a much larger spreadsheet, and contains many more measurements. We’ll load that in and call it pathology
. Most of this code is identical to what was done in the last workshop and just cleans the data up a little bit. Since this workshop is not about cleaning and tidying data, I won’t explain the minutia of what’s going on here, but I encourage you to look at Grolemund and Wickham’s R for Data Science to learn more about these.
# Read the data in
pathology <- read_excel("data/Data for R workshop.xls", sheet = 1)
# Remove the useless column
pathology$X__1 <- NULL
# Remove rows with NAs
pathology <- pathology[complete.cases(pathology),]
# Clean up the Age and Sex columns
pathology <- pathology %>%
# Clean up the Max Age column
mutate(`Max Age` = fct_recode(`Max Age`, "6 mos" = "6mos", "0" = "birth")) %>%
separate(`Max Age`, into = c("Max Age", "text"), sep = " ", fill = "right", convert = TRUE) %>%
mutate(`Max Age` = if_else(is.na(text), `Max Age`, `Max Age`/12)) %>%
select(-text) %>%
# Do the exact same thing for the Min Age Column
mutate(`Min Age` = fct_recode(`Min Age`, "2 mos" = "2mos", "0" = "birth", "0" = "Birth")) %>%
separate(`Min Age`, into = c("Min Age", "text"), sep = " ", fill = "right", convert = TRUE) %>%
mutate(`Min Age` = if_else(is.na(text), `Min Age`, `Min Age`/12)) %>%
select(-text) %>%
# Clean up the Sex column
mutate(Sex = fct_collapse(`Summary Sex Det.`,
Female = c("Female", "Female?"),
Male = c("Male", "Male?"))) %>%
print()
## # A tibble: 252 x 25
## `Burial #` `Cribra orbitalia` `Porotic hyperostosis` `Max LEH score`
## <chr> <dbl> <dbl> <dbl>
## 1 RO-938 0 0 0
## 2 RO-640 0 0 0
## 3 RO-1163 0 0 0
## 4 RO-1511 0 0 0
## 5 RO-1641 0 0 0
## 6 RO-1671 0 0 0
## 7 RO-533 0 0 0
## 8 RO-637 0 0 0
## 9 RO-710 0 0 0
## 10 RO-713 0 0 0
## # ... with 242 more rows, and 21 more variables: `Is periostitis
## # present?` <dbl>, `L shoulder DJD` <dbl>, `R shoulder DJD` <dbl>, `L
## # elbow DJD` <dbl>, `R elbow DJD` <dbl>, `L hip DJD` <dbl>, `R hip
## # DJD` <dbl>, `L knee DJD` <dbl>, `R knee DJD` <dbl>, `L wrist/hand
## # DJD` <dbl>, `R wrist/hand DJD` <dbl>, `L ankle/foot DJD` <dbl>, `R
## # ankle/foot DJD` <dbl>, `cerv vert DJD` <dbl>, `thor vert DJD` <dbl>,
## # `lum vert DJD` <dbl>, `Summary Sex Det.` <chr>, `Summary
## # age-at-death` <chr>, `Min Age` <dbl>, `Max Age` <dbl>, Sex <fct>
Finally, since most of the columns in this data set are technically categorical variables, let’s explicitly tell that R because, as it stands R thinks they’re numbers. The fastest way I know how to do this is using the mutate_at
function in the dplyr
library. Again, I don’t expect you to know exactly what this is doing, but you should run it to make your data set a but cleaner.
pathology <- pathology %>%
mutate_at(vars(`Cribra orbitalia`:`lum vert DJD`), as.factor)
summary(pathology)
## Burial # Cribra orbitalia Porotic hyperostosis Max LEH score
## Length:252 0:198 0:214 0:137
## Class :character 1: 53 1: 29 1: 57
## Mode :character 2: 1 2: 9 2: 36
## 3: 22
##
##
## Is periostitis present? L shoulder DJD R shoulder DJD L elbow DJD
## 0:190 0:237 0:238 0:234
## 1: 33 1: 14 1: 12 1: 12
## 2: 26 2: 1 2: 2 2: 6
## 3: 1
## 4: 1
## 7: 1
## R elbow DJD L hip DJD R hip DJD L knee DJD R knee DJD L wrist/hand DJD
## 0:230 0:215 0:209 0:238 0:238 0:228
## 1: 17 1: 31 1: 37 1: 12 1: 10 1: 18
## 2: 5 2: 6 2: 6 2: 1 2: 3 2: 6
## 3: 1 3: 1
##
##
## R wrist/hand DJD L ankle/foot DJD R ankle/foot DJD cerv vert DJD
## 0:225 0:238 0:238 0:222
## 1: 16 1: 14 1: 11 1: 22
## 2: 11 2: 2 2: 8
## 3: 1
##
##
## thor vert DJD lum vert DJD Summary Sex Det. Summary age-at-death
## 0:232 0:229 Length:252 Length:252
## 1: 15 1: 12 Class :character Class :character
## 2: 4 2: 7 Mode :character Mode :character
## 3: 1 3: 4
##
##
## Min Age Max Age Sex
## Min. : 0.00 Min. : 0.00 Female :51
## 1st Qu.: 4.75 1st Qu.: 7.00 Male :34
## Median :20.00 Median :30.00 Subadult:80
## Mean :21.29 Mean :29.51 UID :87
## 3rd Qu.:35.00 3rd Qu.:49.25
## Max. :50.00 Max. :80.00
So with that our, data is cleaned up and ready to go. You’re of course welcome to use your own at this point, but for illustrative purposes, I’ll use the sample data.
In the previous workshop, we did some t-tests and chi-squared tests (as well as some related ones for special cases) that let you know if two groups were different in some way. This is great and useful and certainly has its purpose in research. What it does not do is say whether one variable predicts the outcome of another. For example, we can see if the children and adult bones have different distribution of isotope ratios, but that was assuming age is a binary variable. A better question would be something like, “Can we predict the age of death of a skeleton based on their isotope ratios?” or “After taking into account the two isotope ratios, are status and sex significant predictors of a skeleton’s age?” Linear regression models can answer these questions.
To start, we’ll begin with a linear regression with just a single continuous predictor, and then we’ll move on to a categorical predictor. Then we’ll look at how to add multiple predictors to a single model.
A linear regression model takes some data and essentially “fits a line” to it. Imagine a scatter plot, with the explanatory variable along the x-axis and the response variable in the y-axis. Actually, we don’t even need to imagine this, let’s just make one.
ggplot(trino, aes(x = d13C, y = AgeYrs)) +
geom_point()
Even though in reality the d13C is a response to how long a person lives, it is now 2,500 years after the fact and you don’t know the skeleton’s age, but you do know the isotope ratio. So we’re going to use what you do know to figure out what you don’t know.
What a linear regression model is going to do is find a straight line that comes closest to all of the points. Just by looking at this plot, we can imagine that this line will start in the upper left and move down towards the bottom right at some angle. In fact, we can plot this line right in ggplot
with the stat_smooth
function.
ggplot(trino, aes(x = d13C, y = AgeYrs)) +
geom_point() +
stat_smooth(method = "lm")
Here, the blue line represents the best fit line to your data, and the darker gray portion is the confidence interval, meaning that it’s about 95% sure that if the blue line isn’t exactly where it’s displayed, it’s about 95% sure it’ll be at least within the dark gray portion.
What this line represents is a model. If you’re apartment shopping and management takes you on a tour through a model apartment, you can appreciate that the model is usually a simplified version of what the actual apartment might be like. There will be furniture but maybe not in the exact place you would put yours. It has closets and cupboards, but there may be nothing in them. Overall, it seems like there’s just less stuff in a model apartment. But, it’s still useful to tour and walk around in one because it’ll give you a pretty good idea of what to expect in the actual one.
The same thing kind of applies to a regression model. It’s a simplified version of reality, but it’s still useful to analyze because it can be used to make predictions about the data. Obviously the model isn’t going to fit your data perfectly—lots of dots are relatively far from it— but the prediction is that if you were to collect more data from the same population that the points would generally fall along that line somewhere.
So this is great and all, but you can visually fit a line to anything, even random data. What we need are some actual numbers. To do that, we use the lm
function, which stands for “linear model.” This function takes two arguments.
The first is a formula with the template response_variable ~ explanatory_variable
. Here, we need to think about what our explanatory and response variables are. We want to be able to use the d13C
variable to predict AgeYrs
, so the AgeYrs
column is the response variable. Then, we put a tilde (~
), which is that little squiggly thing on your keyboard to the left of the number 1 key. Then, we’ll put AgeYrs
. This all acts as the first argument to the lm
function.
The second argument is the data that we want to run the model on. So in this case, it’s data = trino
.
We’ll run this model and save it to a new object called trino_lm1
:
trino_lm1 <- lm(AgeYrs ~ d13C, data = trino)
Hooray! You did statistics! Now let’s see how to interpret what you’ve done.
Okay, how can we see the results? Run the summary
function on the new object you created.
summary(trino_lm1)
##
## Call:
## lm(formula = AgeYrs ~ d13C, data = trino)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.765 -1.780 -1.022 1.887 11.932
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18.7687 10.0033 -1.876 0.0681 .
## d13C -1.2123 0.5235 -2.316 0.0259 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.159 on 39 degrees of freedom
## (58 observations deleted due to missingness)
## Multiple R-squared: 0.1209, Adjusted R-squared: 0.09834
## F-statistic: 5.363 on 1 and 39 DF, p-value: 0.02592
Okay, lots of numbers! How do we make sense of all this? Let’s take each section one at a time.
Call:
This section simply shows you what the original formula was. Sometimes helpful as a sanity check to make sure it did the right thing.
Residuals:
A residual of an observation is how far off the model’s prediction was to the observed value. Visually, if you draw a vertical line from a point to the blue line, the residual is the length of that line. A model’s residuals should be normally distributed, and this portion of the output helps you see if it is. The median residual should be close to zero, and the 1Q and 3Q should be about the same magnitude from zero. In our case, this looks pretty good, but you can run more formal tests to say for sure. In fact, we can access these residuals with trino_lm1$residuals
:
trino_lm1$residuals
## 1 2 3 4 5 6
## -2.9162011 -3.7648045 -2.1312849 -0.6765363 -2.7798883 -2.7648045
## 7 8 9 10 11 12
## -1.5525140 -3.3709497 -1.1888268 -2.4011173 -1.6586592 -1.5374302
## 13 14 15 16 17 18
## -1.7798883 -2.3709497 -2.6134078 1.0385475 -1.0223464 -1.6284916
## 19 20 21 22 23 24
## -3.6893855 -1.5072626 0.9324022 -1.7195531 -0.9921788 -0.6284916
## 25 26 27 28 29 30
## -1.1134078 -0.1284916 -1.4620112 -0.1134078 -0.3558659 0.9927374
## 31 32 33 34 35 36
## 1.9776536 1.8865922 2.8564246 2.8865922 3.7351955 3.0078212
## 37 38 39 40 41
## 2.7653631 3.5229050 2.7955307 7.5379888 11.9324022
Here, we see lots of numbers, so it’s not that useful for humans to read through. But we can run a normality test on these to see if they’re normally distributed.
shapiro.test(trino_lm1$residuals)
##
## Shapiro-Wilk normality test
##
## data: trino_lm1$residuals
## W = 0.84582, p-value = 5.922e-05
Because we have a small p-value (i.e. < 0.05), that is evidence to reject the null hypothesis that the data are normally distributed. That’s not great for us. The reason we want normally distributed residuals is because one o the assumptions in a linear regression model is that the relationship between the two variables is linear. Examning the residuals is a way to make sure that’s true, although an easier and better way is actually to just look at the plot. If the relationship is not linear, it just means there’s a more complicated pattern than a straight line, such as a parabola or a more complicated curve. So for example, what may be happening is that the ratio decreases quickly in children and then then decreases more slowly in adults. For now we’ll ignore this, but it’s good to keep in mind.
Skipping down to the bottom section, there are lots of numbers there to tell you how well the model fits to the data. The key value you want to look at is the Adjusted R-squared
, which, for this model, is 0.09834
. This is a value that ranges from 0 to 1, with higher values indicating a better fit. You can think of this number as the proportion of the variation your model was able to capture. 0.09834
is not great, so there are likely other things that can be used to help predict a the age of a skeleton other than the d13C ratio.
Coefficients:
The section you want to pay the most attention to is actually the table of coefficients (repeated here since space is not an issue in this online document).
##
## Call:
## lm(formula = AgeYrs ~ d13C, data = trino)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.765 -1.780 -1.022 1.887 11.932
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18.7687 10.0033 -1.876 0.0681 .
## d13C -1.2123 0.5235 -2.316 0.0259 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.159 on 39 degrees of freedom
## (58 observations deleted due to missingness)
## Multiple R-squared: 0.1209, Adjusted R-squared: 0.09834
## F-statistic: 5.363 on 1 and 39 DF, p-value: 0.02592
So this is organized as a table, with the (Intercept)
as one row, and your explanatory variable, AgeYrs
in another row. The four columns represent various summary statistics about these two rows. To interpret this table, let’s start with the estimate of the (Intercept)
(the first column of the first row). If you think back to plotting a line on a graph, the intercept is where the line intercepts with the y-axis, meaning where the variable in the x-axis is zero. For most regression models, this number is meaningless. Technically, it’s the age that you would expect the skeleton to be if it had an isotope ratio of zero. Not only is a ratio of zero meaningless (I assume), but an age of -18.7687 doesn’t make sense. In fact, even if those numbers did make sense, you can’t extrapolate beyond the range of values in your predictor variable beyond what’s in your sample, and since you only have data between -21.0 and -16.7, you can only use this model to make predictions in that range. The other numbers in the row for the intercept aren’t too useful, and the p-value only says whether the intercept itself is significantly different from zero, and in this, we don’t really care about the intercept anyway, so it doesn’t matter.
However, the formula for any line needs a slope and an intercept. We’ve already got the intercept, and as it turns out, the estimate of the d13C row, -1.2123
, is the slope of the line. So, just as you interpret the slope for any other line’s formula, the number represents the amount of change along the y-axis for every unit along the x-axis. More practically, if the isotope ratio goes up by 1, the predicted age goes down by -1.2123. Since the isotope ratios are negative, we turn it around, so model predicts that a skeleton with as isotope ratio of -19 will be 1.2123 years older than a skeleton with an isotope ratio of -20. To calculate exactly how old the model’s predictions are you use the estimates and plug them into the formula
predicted age = intercept + d13C × the estimate for d13C
So, if we want to see how old the model predicts a skeleton to be if the ratio is -19
-18.7687 + -19 * -1.2123
## [1] 4.265
the model predicts it to be 4.265 years old. If the isotope ratio is -20, then
-18.7687 + -20 * -1.2123
## [1] 5.4773
its predicted to be 5.4773 years old. So, if you were to collect more data and you wanted to estimate the age, you could use this formula to predict it just based on the d13C isotope ratio, though keep in mind that the model fit wans’t great (the adjusted R-squared was low), so this might not be the most reliable formula.
One final part of this output that we certainly want to pay attention to is the p-value for the d13C variable. This p-value says whether d13C
is a significant predictor of AgeYrs
. Because the value is small (0.2359 < 0.05), that’s evidence to reject the null hypothesis that d13C is not significant, and we can conclude that d13C is a significant predictor of the age of the skeleton. The asterisk next to the p-value is a quick way to look at the p-value, and there’s a little key underneath the coefficients table: if there’s a .
, then the p-value is between 0.1 and 0.05; if it’s one asterisk (*
), it is between 0.05 and 0.01; two asterisks (**
) is between 0.01 and 0.001; and three (***
) is if it’s less than 0.001. Again, this is not the degree of significance, but is an indication of how surprised you should be to find what you did assuming there is no relationship between the predictor and the response variable.
Now that we know that d13C was a significant predictor, see if d15N is also significant. Plot it to get some intuition of what to expect, and then fit the model. Check the residuals to see if they’re normally distributed. Does the model fit the data better than the d13C model? What is the estimated age of a skeleton if their d15N ratio is 8?
First, we’ll plot it just as we did before:
ggplot(trino, aes(x = d15N, y = AgeYrs)) +
geom_point() +
stat_smooth(method = "lm")
Then we’ll fit the model.
trino_fit2 <- lm(AgeYrs ~ d15N, data = trino)
Before even looking at the output, let’s check the normality of the residuals:
shapiro.test(trino_fit2$residuals)
##
## Shapiro-Wilk normality test
##
## data: trino_fit2$residuals
## W = 0.91353, p-value = 0.004265
Again, it looks like the residuals are not normally distributed, so we should be wary when interpreting this model. Here’s the model summary.
summary(trino_fit2)
##
## Call:
## lm(formula = AgeYrs ~ d15N, data = trino)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9752 -1.7939 -0.7922 1.4631 9.0652
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.7973 2.9419 4.350 9.49e-05 ***
## d15N -0.8979 0.3092 -2.904 0.00604 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.055 on 39 degrees of freedom
## (58 observations deleted due to missingness)
## Multiple R-squared: 0.1778, Adjusted R-squared: 0.1567
## F-statistic: 8.431 on 1 and 39 DF, p-value: 0.006044
So the adjusted R-squared is higher this time at 0.1567, suggesting this model fits the data better than the previous model. The d15N is a significant predictor, with the p-value of 0.006. So, if we had someone with a ratio of 8, their age would be:
12.7973 + 8 * -0.8979
## [1] 5.6141
About 5.6 years old.
A simple linear regression using a single categorical variable is a lot like doing the Wilcoxon or the Kruskall-Wallis test that we did in the last section. We’re essentially seeing of the two groups are different from one another in some way. For this section, we’ll switch over to the pathology data and see what we can predict about the skeleton’s age using various predictors.
To start simple, we’ll use the only variable that has just two levels, the L ankle/foot DJD
column. In all of the observations, the only two values represented in this data are 0
(pathology is not observable) and 1
(pathology is observable but absent). So essentially, this is a binary data type. To get a feel for what we’re working with, let’s plot the data using box plots.
ggplot(pathology, aes(x = `L ankle/foot DJD`, y = `Min Age`)) +
geom_boxplot()
So based on this, we have reason to suspect that there is some significant effect. Let’s fit a model that predicts the minimum age of a skeleton and the L ankle/foot DJD
as the only predictor.
path_lm1 <- lm(`Min Age` ~ `L ankle/foot DJD`, data = pathology)
summary(path_lm1)
##
## Call:
## lm(formula = `Min Age` ~ `L ankle/foot DJD`, data = pathology)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.6723 -15.9223 -0.6723 14.3277 29.3277
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.6723 0.9692 21.329 < 2e-16 ***
## `L ankle/foot DJD`1 11.0420 4.1120 2.685 0.00773 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.95 on 250 degrees of freedom
## Multiple R-squared: 0.02804, Adjusted R-squared: 0.02415
## F-statistic: 7.211 on 1 and 250 DF, p-value: 0.007731
The model output is very similar to what we saw earlier. Note that the model fit, measured by the adjusted R-squared, is even lower at 0.02. This is not good. But, if you think about it, you’re not going to be able to predict age very well at all if the only information you have is “absent” vs “observable but absent” pathology on a skeleton’s left ankle. However, even though it doesn’t do a great job at explaining all the data, the variable is still a significant predictor since it has a p-value of less than 0.05.
Interpreting the model coefficients is a little bit different for a categorical variable though. For continuous data, the intercept represented the estimated value when the predictor was at its reference value of 0. For a categorical variable, there is no clear reference value. So, what R does is it chooses one, and by default, it’ll choose the first that comes alphabetically. Since we’re working with number-like levels in this categorical variable, it’ll choose the lowest value, zero. To double-check what a reference value is, you can use levels
and see which value comes first.
levels(pathology$`L ankle/foot DJD`)
## [1] "0" "1"
So, the intercept is the predicted value when the variable is at its reference value. Specifically, this model predicts that the minimum of a skeleton with an absence of pathogens in the left ankle/foot is 26.193.
To get the predicted value of another level in a categorical variable, you just add the estimate to the intercept. So in this case, when the L ankle/food DJD
value is "1"
, you would add 7.808 to 26.193 to get 34.001 years old. Even though these are gross approximations of age, this predictor was statistically significant, which is great news for us.
So now, let’s see if we can interpret a more complicated variable. Let’s pick Max LEH score
since it has four levels. As always, it’s good to plot the data to get a feel for what you can expect.
ggplot(pathology, aes(x = `Max LEH score`, y = `Min Age`)) +
geom_boxplot()
So for this variable, it looks like the main difference is going to between 0 and everything else. The differences between 1, 2, and 3 are not big and might not be statistically significant. Let’s test this formally using a regression model on the data.
path_lm2 <- lm(`Min Age` ~ `Max LEH score`, data = pathology)
summary(path_lm2)
##
## Call:
## lm(formula = `Min Age` ~ `Max LEH score`, data = pathology)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.228 -15.150 0.056 11.341 31.850
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.150 1.265 14.345 < 2e-16 ***
## `Max LEH score`1 8.078 2.334 3.461 0.000633 ***
## `Max LEH score`2 5.795 2.774 2.089 0.037696 *
## `Max LEH score`3 5.509 3.401 1.620 0.106544
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.81 on 248 degrees of freedom
## Multiple R-squared: 0.05422, Adjusted R-squared: 0.04278
## F-statistic: 4.74 on 3 and 248 DF, p-value: 0.003111
Okay, so what do we see here. Unlike the model outputs we’ve seen previously, this one has a separate coefficient for every level in your predictor variable. So we have coefficients for 1, 2, and 3. But where’s zero? Remember that the intercept is the estimate when the predictor is at its reference value, which in this case is zero.
Okay, so based on this output, you might conclude that the scores of 1 and 2 are significant but not 3. What does that mean? As it turns out, when you use a predictor variable like this, the significance is based on the difference between that level and the reference level. So, what this output is saying is that a score of 1 is significantly different from 0 and that a score of 2 is also significantly different from 0, but that a score of 3 is not significantly different from zero. What this does not tell us is whether there is a significant difference between 1 and 2 or between 1 and 3 or between 2 and 3. How can we do that?
A simple way is to just change the reference level. First, I’m going to make a copy of the pathology
data set, because the change I’m about to make I only want it to be temporary.
pathology_temp <- pathology
Then, I’ll redefine the Max LEH score
column as a releveled version of itself, only the reference level should be "1"
(accomplished using `ref = “1”).
pathology_temp$`Max LEH score` <- relevel(pathology_temp$`Max LEH score`, ref = "1")
Now, I’ll run another regression model on this temporary data frame and examine the output.
path_lm3 <- lm(`Min Age` ~ `Max LEH score`, data = pathology_temp)
summary(path_lm3)
##
## Call:
## lm(formula = `Min Age` ~ `Max LEH score`, data = pathology_temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.228 -15.150 0.056 11.341 31.850
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.228 1.961 13.372 < 2e-16 ***
## `Max LEH score`0 -8.078 2.334 -3.461 0.000633 ***
## `Max LEH score`2 -2.284 3.153 -0.724 0.469528
## `Max LEH score`3 -2.569 3.717 -0.691 0.490112
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.81 on 248 degrees of freedom
## Multiple R-squared: 0.05422, Adjusted R-squared: 0.04278
## F-statistic: 4.74 on 3 and 248 DF, p-value: 0.003111
So the results are a lot different. Note that we have coefficients for 0, 2, and 3 because 1 is now the reference level meaning it’s the intercept. Crucially, we draw different conclusions. Now 0 is significant, but the significance for 2 went away. Why is that? It’s because while 0 is significantly different from 1 for this data set, the difference between 1 and 2 is not significant.
Based on my limited knowledge of this data, it’s probably the case that keeping the reference level at zero for any/all of these variables makes the most sense. You’re free to change reference level when it makes theoretical sense, meaning it makes sense intuitively to compare the various levels to that one. If you find yourself changing the reference value willy-nilly and sticking with one because you like the results—especially if it doesn’t make a lot of sense as to why that one should be considered the reference value—you’re essentially fudging the numbers to get what you want. Just make sure the reference value makes sense.
A more efficient way of viewing all the differences is to use what’s called an analysis of variance. Similar to the regression model or even the Kruskall-Wallis test, this will see if the Max LEH score
is a significant predictor of Min Age
. Unfortunately, to run this, we can’t have spaces in the names of our columns, so we’re going to have to rename them. One way to do that is to just create new columns that are equal to the ones you want to rename.
pathology_temp$min_age <- pathology_temp$`Min Age`
pathology_temp$score <- pathology_temp$`Max LEH score`
We’ll then run the analysis of variance on this temporary data set using the new column names. The syntax is identical to the lm
syntax.
path_aov <- aov(min_age ~ score, data = pathology_temp)
summary(path_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## score 3 3118 1039.4 4.74 0.00311 **
## Residuals 248 54386 219.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The summary is simple and just shows whether the score
column is a significant predictor of min_age
. We already knew that from the regression model. What it does not show though is which level in score
is significant. To see that, we have to do a post hoc test. The best way to do that is to do Tukey’s Honest Significant Difference test using the TukeyHSD
function, which will compare all pairs of levels and see which ones are significantly different from one another.
TukeyHSD(path_aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = min_age ~ score, data = pathology_temp)
##
## $score
## diff lwr upr p adj
## 0-1 -8.0784351 -14.115685 -2.041186 0.0035233
## 2-1 -2.2836257 -10.437965 5.870713 0.8873011
## 3-1 -2.5689793 -12.182905 7.044946 0.9004118
## 2-0 5.7948094 -1.378961 12.968580 0.1594170
## 3-0 5.5094559 -3.288118 14.307030 0.3693385
## 3-2 -0.2853535 -10.650785 10.080078 0.9998702
So this shows a table with p-values suggesting the significance between each pair. Looking through this, it looks like the only significant difference is between 1 and 0, with a p-value of about 0.004.
So doing regression with categorical variables takes a bit more work to interpret. The results are less clear because you have to understand what the reference values are. But once you get it, things work out and you’ll be able to interpret these numbers without any problems.
So what we’ve done so far is look at a single variable predicting another one. While the predictors we chose (d13C
, d15N
, L ankle/foot DJD
, and Max LEH score
) were all significant, the overall model wasn’t very effective at predicting the age of the skeleton. The next step is to see if we can add more predictors to the outcome to see if, with the combination of more factors, we can do a better job at predicting this.
This is where multiple regression comes in. Multiple regression is just like simple linear regression, only we add more than one variable. The functions are the same, we just add more columns to the formula. You can add more numeric or categorical columns or any combination of the two and the model will work just fine. The interesting thing about this is that you can see the effect of one variable after taking into account the influence of all the others. If you’ve got two variables trying to predict a third, you can think of a line trying to fit the data in a three-dimensional space. When you add more variables (which you certainly can do), it get’s harder to visualize, but the concept is the same. Let’s see this in action.
Going back to the trino
data, we saw that d13C and d15N were both significant predictors of the age of the skeleton, with the latter doing a slightly better job. Let’s see if we can create a better model by using both of those. To add variables to a formula, just put a plus sign between them.
trino_fit3 <- lm(AgeYrs ~ d13C + d15N, data = trino)
summary(trino_fit3)
##
## Call:
## lm(formula = AgeYrs ~ d13C + d15N, data = trino)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5997 -1.7699 -0.7029 1.6553 10.1064
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5499 14.2153 0.039 0.9693
## d13C -0.5471 0.6211 -0.881 0.3840
## d15N -0.7054 0.3794 -1.859 0.0707 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.064 on 38 degrees of freedom
## (58 observations deleted due to missingness)
## Multiple R-squared: 0.1942, Adjusted R-squared: 0.1518
## F-statistic: 4.579 on 2 and 38 DF, p-value: 0.01653
Okay, so the output is virtually the same format, only each predictor has its own row in the table of coefficients. Remember that d13C and d15N, when by themselves, were both significant predictors of the skeleton’s age? What do you notice now? Neither are. So what’s up with that? As it turns out, each variable, after taking into account what the other does, doesn’t really add much information about the skeleton’s age. To be honest, I don’t know how to interpret this information. It can’t be the case that one of the predictors is useless, because we found that both were significant individually.
One other small detail is the adjusted R-squared. While it is still higher than trino_lm1
, which has 0.098, it’s actually slightly lower than trino_lm2
, which was 0.1567. How can this be a worse model than the one with just d15N
? The adjusted R-squared is a calculation based on how well the model fits to your data based on how many variables there are. Basically, it adds a penalty for each additional variable. If the variable’s contribution outweighs the penalty, then the adjusted R-squared will go up. But in this case, the contribution of d13C
is less than the penalty attached to it, so there’s a drop in the adjusted R-squared. This would be a time when you’re probably justified dropping the d13C
variable from the model and sticking with the more simpler model.
Let’s say the predictors were significant. How do we interpret the results of a multiple regression model? Basically, the same as when there’s just one variable. The only difference is that because there are more variables, you have to add more numbers. The reference level for both continuous variables is zero, so if you want to predict the age of someone with a d13C measurement of -19 and a d15N measurement of 10, you’d start with the intercept (0.5499) and add the coefficient of d13C (-0.5471) times the d13C measurement (-19) and then add the coefficient of the d15N (-0.7054) times the d15N measurement (10).
0.5499 + (-0.5471 * -19) + (-0.7054 * 10)
## [1] 3.8908
The predicted age is 3.89 years old.
When you’ve got multiple categorical variables, interpreting the output is similar to multiple continuous variables, only you have to keep track of the reference levels. Looking just at the adult bones in the trino
data set, let’s fit a model that predicts the d13C
ratio using both Sex and Status as predictors.
# First, subset the data
adult_bones <- trino[trino$SumAge == "Adult",]
# Fit the model
trino_lm4 <- lm(d13C ~ Sex + Status, data = adult_bones)
summary(trino_lm4)
##
## Call:
## lm(formula = d13C ~ Sex + Status, data = adult_bones)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.82773 -0.30950 -0.03639 0.25496 1.07227
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.9126 0.1931 -103.108 < 2e-16 ***
## SexMale 0.7576 0.1988 3.811 0.000803 ***
## StatusLow 0.6827 0.1857 3.677 0.001130 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4898 on 25 degrees of freedom
## Multiple R-squared: 0.5103, Adjusted R-squared: 0.4712
## F-statistic: 13.03 on 2 and 25 DF, p-value: 0.000133
The model output suggests that the men had a higher ratio of d13C than the women did, by about 0.76. Additionally, low status people had a higher ratio by about 0.68. So, high status women were predicted to have a d13C ratio of -19.91 while low status men had a predicted ratio of (-19.92 + 0.75 + 0.68 =) -18.4723. Technically, these results might not be reliable because we fit two variables to a model with only 28 observations, but it’s something to consider looking into.
With multiple regression, we’re not limited to just two variables. You can add many more, depending on the number of observations you have. One rule of thumb that I’ve heard is that you can have one variable per 15 observations. So the pathology data has 252 observations, meaning we could theoretically have up to 16 variables. This doesn’t mean you have to add that many, because like we saw in the previous model, more is not necessarily better. Plus, the more variables you have, the less statistical power your model has, meaning it’s going to be less accurate because they data is being stretched too thin.
So as it stands, the pathology data has something like 19 or 20 pathology-related variables. This already exceeds the number of variables we really can feasibly include in a model when we’re working with “only” 252 observations. But, as it turns out, because of the way categorical variables are handled mathematically in the model, they count as more than one. So if there are three unique values in a categorical variable, that’ll count as 2 variables. If there are 4, it’ll count as 3. So really, if you want to include all 19 variables, you’d need at least 675 skeletons to analyze.
Now, there are ways around this. One is to use some sort of dimension reduction technique like PCA where you combine the information from multiple columns into fewer abstract ones. I don’t know how to do this with categorical variables though.
The other technique is to let the computer choose which variables are be significant. It’s probably not the case that every variable makes a significant contribution to predicting the skeleton’s age, especially after taking into account the information that all the other variables have contributed. What is probably more likely is that some variables will contribute information and most of the others won’t.
So, what we’re going to do is one of many ways to do variable selection. What we’re going to tell R to do is to evaluate many possible models and tell us which one is the best. Pretty slick. The way it works is like this.
First, we start with a null model. This is a regular linear model, but there are no variables as predictors. As a placeholder, we put some constant, like 1
, as the only variable. This model is essentially fitting a horizontal line to the data. If none of the predictors are good, this null model will be no better but no worse than any of the others.
null_model <- lm(`Min Age` ~ 1, data = pathology)
We then provide a list of variables that we want R to consider for inclusion. We have 19 of these, so we have to just type them all out. I couldn’t find a shortcut, so you can just copy and paste what I have here. This is all inside the step
function. As its first argument, it takes the null model, and then in the scope
argument, you put all the variables you want to be tested after a tilde.
step_model <- step(null_model, scope = ~ `Cribra orbitalia` + `Porotic hyperostosis` + `Max LEH score` +
`Is periostitis present?` + `L shoulder DJD` + `R shoulder DJD` + `L elbow DJD` +
`R elbow DJD` + `L hip DJD` + `R hip DJD` + `L knee DJD` + `R knee DJD` +
`L wrist/hand DJD` + `R wrist/hand DJD` + `L ankle/foot DJD` + `R ankle/foot DJD` +
`cerv vert DJD` + `thor vert DJD` + `lum vert DJD`)
## Start: AIC=1370.41
## `Min Age` ~ 1
##
## Df Sum of Sq RSS AIC
## + `R wrist/hand DJD` 2 4217.3 53287 1355.2
## + `cerv vert DJD` 2 3376.2 54128 1359.2
## + `R hip DJD` 2 2977.9 54526 1361.0
## + `L hip DJD` 2 2837.7 54666 1361.7
## + `Max LEH score` 3 3118.1 54386 1362.4
## + `L wrist/hand DJD` 2 2099.8 55404 1365.0
## + `L ankle/foot DJD` 1 1612.1 55892 1365.2
## + `R elbow DJD` 2 1694.9 55809 1366.9
## + `Is periostitis present?` 5 2953.2 54551 1367.1
## + `lum vert DJD` 3 2056.3 55448 1367.2
## + `Cribra orbitalia` 2 1547.6 55957 1367.5
## + `R shoulder DJD` 2 1252.6 56252 1368.9
## + `L knee DJD` 3 1641.4 55863 1369.1
## + `L shoulder DJD` 2 1182.3 56322 1369.2
## + `thor vert DJD` 3 1466.0 56038 1369.9
## <none> 57504 1370.4
## + `L elbow DJD` 2 817.8 56686 1370.8
## + `R ankle/foot DJD` 3 1072.1 56432 1371.7
## + `Porotic hyperostosis` 2 480.4 57024 1372.3
## + `R knee DJD` 3 888.9 56615 1372.5
##
## Step: AIC=1355.21
## `Min Age` ~ `R wrist/hand DJD`
##
## Df Sum of Sq RSS AIC
## + `cerv vert DJD` 2 1693.0 51594 1351.1
## + `L hip DJD` 2 1659.0 51628 1351.2
## + `R hip DJD` 2 1373.3 51914 1352.6
## + `Max LEH score` 3 1519.0 51768 1353.9
## <none> 53287 1355.2
## + `L ankle/foot DJD` 1 268.2 53019 1355.9
## + `Cribra orbitalia` 2 627.1 52660 1356.2
## + `thor vert DJD` 3 922.4 52364 1356.8
## + `R shoulder DJD` 2 468.8 52818 1357.0
## + `lum vert DJD` 3 801.0 52486 1357.4
## + `Is periostitis present?` 5 1626.4 51660 1357.4
## + `R elbow DJD` 2 377.6 52909 1357.4
## + `L elbow DJD` 2 288.8 52998 1357.8
## + `L shoulder DJD` 2 260.7 53026 1358.0
## + `L knee DJD` 3 665.5 52621 1358.0
## + `Porotic hyperostosis` 2 173.5 53113 1358.4
## + `L wrist/hand DJD` 2 16.3 53270 1359.1
## + `R knee DJD` 3 420.6 52866 1359.2
## + `R ankle/foot DJD` 3 236.7 53050 1360.1
## - `R wrist/hand DJD` 2 4217.3 57504 1370.4
##
## Step: AIC=1351.08
## `Min Age` ~ `R wrist/hand DJD` + `cerv vert DJD`
##
## Df Sum of Sq RSS AIC
## + `L hip DJD` 2 1241.45 50352 1348.9
## + `R hip DJD` 2 955.96 50638 1350.4
## <none> 51594 1351.1
## + `Porotic hyperostosis` 2 607.17 50987 1352.1
## + `Max LEH score` 3 1009.02 50585 1352.1
## + `L ankle/foot DJD` 1 164.79 51429 1352.3
## + `Cribra orbitalia` 2 409.67 51184 1353.1
## + `R elbow DJD` 2 184.25 51410 1354.2
## + `L shoulder DJD` 2 129.98 51464 1354.4
## + `R shoulder DJD` 2 129.50 51464 1354.4
## + `L wrist/hand DJD` 2 35.30 51559 1354.9
## + `L elbow DJD` 2 22.71 51571 1355.0
## + `lum vert DJD` 3 419.33 51174 1355.0
## - `cerv vert DJD` 2 1693.00 53287 1355.2
## + `L knee DJD` 3 315.52 51278 1355.5
## + `thor vert DJD` 3 275.07 51319 1355.7
## + `Is periostitis present?` 5 992.59 50601 1356.2
## + `R ankle/foot DJD` 3 85.24 51509 1356.7
## + `R knee DJD` 3 80.55 51513 1356.7
## - `R wrist/hand DJD` 2 2534.17 54128 1359.2
##
## Step: AIC=1348.94
## `Min Age` ~ `R wrist/hand DJD` + `cerv vert DJD` + `L hip DJD`
##
## Df Sum of Sq RSS AIC
## <none> 50352 1348.9
## + `Porotic hyperostosis` 2 595.73 49757 1349.9
## + `L ankle/foot DJD` 1 43.91 50308 1350.7
## - `L hip DJD` 2 1241.45 51594 1351.1
## + `Cribra orbitalia` 2 340.96 50011 1351.2
## - `cerv vert DJD` 2 1275.45 51628 1351.2
## + `R hip DJD` 2 265.45 50087 1351.6
## + `Max LEH score` 3 640.08 49712 1351.7
## + `L shoulder DJD` 2 191.06 50161 1352.0
## + `L elbow DJD` 2 143.22 50209 1352.2
## + `L wrist/hand DJD` 2 79.58 50273 1352.5
## + `R shoulder DJD` 2 35.18 50317 1352.8
## + `R elbow DJD` 2 28.12 50324 1352.8
## + `lum vert DJD` 3 235.54 50117 1353.8
## + `R ankle/foot DJD` 3 207.14 50145 1353.9
## + `L knee DJD` 3 138.91 50213 1354.2
## + `R knee DJD` 3 129.25 50223 1354.3
## + `thor vert DJD` 3 54.20 50298 1354.7
## + `Is periostitis present?` 5 684.09 49668 1355.5
## - `R wrist/hand DJD` 2 2335.30 52688 1356.4
This will produce a lot of output, but it’s useful to look through because you can see what the various steps were.
What the model does first is it fits 19 individual models with each variable as the only predictor variable. It gets summary statistics about each of these models, including values that indicate how well the model is fit to the data. Instead of adjusted R-squared, this uses what’s called AIC: the model with the lowest AIC is selected as the best fit to the data. So in step 1, you can see the 19 candidate models in order of their AIC. Note that the summary statistics from null model itself, or rather, summary statistics if no change is made to the current model, is also included as <none>
. (Interestingly, the inclusion of 4 variables were actually worse than the null model.) In the end model that was selected as the best was the one with R wrist/hand DJD
.
In step 2, it repeats the process, only this time it’s adding each of the other variables to a model that already contains R wrist/hand DJD
. In addition to checking this, it actually tries to remove the R wrist/hand DJD
variable to see if the null model is actually better. You can see the results of these 19 candidate models in the output. The model that turned out to be the best was the one that included cerv vert DJD
as a second predictor.
Step 3 does the same thing. Starting with the reference point of a model with both R wrist/hand DJD
and cerv vert DJD
, it tries adding each of the 17 remaining variables, one at a time. It also tries removing each of the variables included in the model, in case that makes a difference (and it sometimes does!). It found that a model that included these two and L hip DJD
was the best.
Step 4 does the same thing. The only difference was that it found that the none of the candidate models did any better (meaning none had a lower AIC score) than the current best model with those three predictors. So, it concludes that the current model is the best and finishes.
When we save the output of the step
function into a variable, that variable contains the best fit model. We can see the results of that model now.
summary(step_model)
##
## Call:
## lm(formula = `Min Age` ~ `R wrist/hand DJD` + `cerv vert DJD` +
## `L hip DJD`, data = pathology)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.0484 -14.2984 0.9516 10.9516 30.9516
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.048 1.008 18.893 < 2e-16 ***
## `R wrist/hand DJD`1 8.016 4.042 1.983 0.04845 *
## `R wrist/hand DJD`2 13.507 4.549 2.969 0.00328 **
## `cerv vert DJD`1 -1.306 3.694 -0.354 0.72392
## `cerv vert DJD`2 12.844 5.476 2.345 0.01980 *
## `L hip DJD`1 4.242 3.121 1.359 0.17537
## `L hip DJD`2 13.577 6.107 2.223 0.02712 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.34 on 245 degrees of freedom
## Multiple R-squared: 0.1244, Adjusted R-squared: 0.1029
## F-statistic: 5.8 on 6 and 245 DF, p-value: 1.145e-05
So it turns out that this model contains three variables, R wrist/hand DJD
, cerv vert DJD
, and L hip DJD
. When we have multiple categorical variables, this is where interpreting the output gets a little tricky. Keep in mind that the intercept represents the predicted age of the skeleton when all the predictor variables are at their reference values. In this case, they’re all zero. So to interpret the coefficients, you add the estimates to that intercept.
For example, if you have a skeleton with no indication of pathogens on the right wright/hand or cerv. vert, but there was an observable but absent pathogen on their left hip, you would take the coefficient of the left hip at value 1 (4.242) and add it to the estimate (19.048) to get the predicted age of 23.39.
But let’s say we have a skeleton with no observable pathogen on the cerv. vert, an observable but absent pathogen on the left hip, and a pathogen in the right wrist. We start with the intercept (19.048) and add the estimate for an observable but absent pathogen on the left hip (4.242), and add the estimate for a pathogen on the right wrist (13.507) to get an estimated age of 36.797. So for multiple categorical variables, you add whatever estimates are relevant to the estimate to get the estimated age.
In the end, this model is a little unsatisfactory. It’s not likely that the only way to determine the age of a skeleton is by looking at the presence or absence of pathogens in just the right wrist/hand, cerv vert, and left hip. I’m not entirely sure what to do with this kind of data. It’s more likely that the variables are all correlated in some way, and that they all contribute to predicting the age of the skeleton roughly equally. I would recommend looking into alternatives to PCA that work on categorical data to see if you can reduce these potentially correlated variables into a more manageable amount.
Today’s workshop has introduced the idea of regression. Basically, you can use regression to see if one variable predicts the other. It’s a very powerful, very useful tool when analyzing your data. As you can imagine, there are lots of things that could still be said about regression, but hopefully this was enough for you to get started on your own data and you get you curious to learn more.
There are some additional concepts that we just didn’t have time for that might be worth looking into. Everything we did has to do with predicting a continuous variable. But sometimes you want to predict a binary variable. For that, you use what’s called a logistic regression model. Interpreting the model output is a bit trickier with logistic regression, but it’s also a very useful tool. If you want to use variables to predict some sort of categorical variable with more than 2 levels, that’s what’s called multinomial logistic regression. I don’t know how to do it very well, but at least you have the term to google now. Finally, I would also look into what are called mixed-effects models, which are super powerful models that take into account idiosyncrasies that individual skeletons or field sites and their effect on the overall results.
Statistics is fun, and it’s always worth explore what else is out there to see what other things you can glean from your data.