1 Introduction

This last part of the workshop is more specialized. Since I work with regional accents of American English, my research is so far from what you guys do that I have virtually no clue what it is that you do. Which is cool! I picture you guys as sort of a mix between the TV show Bones and the movie National Treasure. That may be completely off, but I’ll go with it. However, statistics is field-independent, so the principles that I’ve learned in linguistics still apply to you and your data. Which is also very cool.

Your instructor has provided us with some sample data, similar to what you recently collected. She has also explained what kinds of statistical tests are used with this kind of data. In this last section of the workshop, I’ll briefly go over how to implement some of these things in R.

1.1 Load the packages

We’ll need several additional packages for this workshop. The first, readxl, is just there to help us read in the Excel spreadsheets that the data is stored in. The second, ggplot2, is for plotting. tidyverse is just for data cleanup. The last, vcd will be for specialized functions later on.

library(readxl)
library(ggplot2)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.2
## Warning: package 'tibble' was built under R version 3.4.3
## Warning: package 'tidyr' was built under R version 3.4.4
## Warning: package 'purrr' was built under R version 3.4.4
## Warning: package 'dplyr' was built under R version 3.4.4
library(vcd)
## Warning: package 'vcd' was built under R version 3.4.3

Everything else will be taken care of using base R functions.

1.2 Load the data

I’ve gotten two datasets from your instructors to use in this workshop. Each can be used to illustrate certain concepts, so we’ll work with both of them.

The first, is data collected from bones in Trino Vercellese. We’ll call it the trino dataset.

trino <- read_excel("data/Trino subadults data.xlsx")
## Warning in strptime(x, format, tz = tz): unknown timezone 'default/America/
## New_York'
print(trino)
## # A tibble: 99 x 7
##    ID    SumAge        AgeYrs  d15N  d13C Status Sex  
##    <chr> <chr>          <dbl> <dbl> <dbl> <chr>  <chr>
##  1 188-D Dead Subadult    0.5  11.5 -18.3 High   <NA> 
##  2 188   Dead Subadult    0.5  10.1 -19   High   <NA> 
##  3 173-C Dead Subadult    0.8  10.8 -17.9 High   <NA> 
##  4 192   Dead Subadult    0.8  11.6 -16.7 High   <NA> 
##  5 133   Dead Subadult    1    10   -18.6 High   <NA> 
##  6 188-A Dead Subadult    1.5   8.1 -19   High   <NA> 
##  7 190   Dead Subadult    1.5   8   -18   High   <NA> 
##  8 194   Dead Subadult    1.5   9.9 -19.5 High   <NA> 
##  9 385   Dead Subadult    1.5  11.8 -17.7 Low    <NA> 
## 10 446   Dead Subadult    1.5  11.7 -18.7 Low    <NA> 
## # ... with 89 more rows

Because this spreadsheet essentially contains three different datasets, I’ll split it into those three. I’ll use the subsetting technique that we covered in Part 1 of this workshop. I’ll get the data from the children by selecting just the rows of trino that have "Dead Subadult" as the value in the SumAge column. Note that you need two equals signs (==) for that. Similarly, I’ll get the adult bones’ data and the adult teeth data.

children_bones <- trino[trino$SumAge == "Dead Subadult",]
adult_bones   <- trino[trino$SumAge == "Adult",]
adult_teeth   <- trino[trino$SumAge == "Living Subadult",]

The other dataset is a much larger spreadsheet, and contains many more measurements. We’ll load that in and call it pathology.

pathology <- read_excel("data/Data for R workshop.xls", sheet = 1)

There’s a bit of cleaning we need to do with this dataset before we can work with it. First, there’s one column at the very end that had some sort of comment in it. We’ll toss that since we don’t need it. Since there wans’t a column name in the original spredsheet, it was given the default name X__1. To remove a single column, we can just set it equal to NULL:

pathology$X__1 <- NULL

Next, there are a couple of skeletons near the bottom (7 actually) that have very little or zero pathology data. Then there are anothers that have missing measurements. If we subset the pathology data using the complete.cases function, we can remove these rows. But before we do, it might be worth it to explore the rows that we’re about to remove. To do the logical opposite of a function, put an exclamation point before it. So while complete.cases removes rows with NAs, !complete.cases only shows these rows.

pathology[!complete.cases(pathology),]
## # A tibble: 14 x 24
##    `Burial #` `Cribra orbitalia` `Porotic hyperostosis` `Max LEH score`
##    <chr>                   <dbl>                  <dbl>           <dbl>
##  1 RO-1735                     1                      0               0
##  2 RO-1365                     0                      0               0
##  3 RA-106                      0                      0               0
##  4 RA-144                      0                      0               0
##  5 RO-1119                    NA                     NA               0
##  6 RO-1166-B                  NA                     NA               0
##  7 RO-1364                    NA                     NA               0
##  8 RO-1371                    NA                     NA               0
##  9 RO-1690                    NA                     NA               0
## 10 RO-1699                    NA                     NA               0
## 11 RO-638-B                   NA                     NA               0
## 12 RO-861                      0                      0               0
## 13 RO-862                      0                      0               0
## 14 L-231                       2                      2               1
## # ... with 20 more variables: `Is periostitis present?` <dbl>, `L shoulder
## #   DJD` <dbl>, `R shoulder DJD` <dbl>, `L elbow DJD` <dbl>, `R elbow
## #   DJD` <dbl>, `L hip DJD` <dbl>, `R hip DJD` <dbl>, `L knee DJD` <dbl>,
## #   `R knee DJD` <dbl>, `L wrist/hand DJD` <dbl>, `R wrist/hand
## #   DJD` <dbl>, `L ankle/foot DJD` <dbl>, `R ankle/foot DJD` <dbl>, `cerv
## #   vert DJD` <dbl>, `thor vert DJD` <dbl>, `lum vert DJD` <dbl>, `Summary
## #   Sex Det.` <chr>, `Summary age-at-death` <chr>, `Min Age` <chr>, `Max
## #   Age` <chr>

Some of these might be in error, but I don’t know. You can go back and fix them in Excel if you’d like. Since NAs in the data tend to mess things up down the road, we’ll remove all 14 of these rows.

pathology <- pathology[complete.cases(pathology),]

Finally, the age data is a little unclean. We probably want numbers in there, but because some of the rows have something like 8 months, R sees text and it’s going to interpret the entire column as text. Probably not what you want for age data.

Let’s see what we’re working with here. We can tell R to show us the unique values in the Max Age column with the unique function. We can then wrap that in sort to put them in order. Note that because the column was originally named Max Age in Excel, R retained the space, which makes it trickier to work with. This means that when you refer to this column you’ll have to put ticks on either side of Max Age:

sort(unique(pathology$`Max Age`))
##  [1] "0.25"      "1"         "1.5"       "10"        "10 months"
##  [6] "11"        "11 mos"    "12 mos"    "13"        "14.5"     
## [11] "15"        "15 mos"    "16 mos"    "17"        "18"       
## [16] "19"        "2"         "2 mos"     "20"        "22"       
## [21] "24"        "25"        "3"         "3 mos"     "3.5"      
## [26] "30"        "34"        "35"        "4"         "4 mos"    
## [31] "40"        "45"        "49"        "5"         "50"       
## [36] "55"        "6"         "6 months"  "6 mos"     "60"       
## [41] "6mos"      "7"         "7 months"  "70"        "8"        
## [46] "8 mos"     "80"        "9"         "9 mos"     "birth"
sort(unique(pathology$`Min Age`))
##  [1] "0"        "1"        "10"       "12"       "15"       "16"      
##  [7] "17"       "18"       "2"        "2 mos"    "2.5"      "20"      
## [13] "24"       "25"       "2mos"     "3"        "3 mos"    "30"      
## [19] "35"       "4"        "4 months" "4 mos"    "40"       "45"      
## [25] "5"        "5 months" "50"       "6"        "6 mos"    "7"       
## [31] "8"        "8 months" "8 mos"    "9"        "9 mos"    "9.5"     
## [37] "birth"    "Birth"

So, what the plan is, is to strip away the months and mos, and in those rows, divide the leftover number by 12 to get an exact age in years. But because of some inconsistency in the data, we’ll have to individually change the 6mos and 2mos (since there’s no space) and birth and Birth. This is stuff that you might find to be easier to to in Excel, but for convenience, here’s some R code that will take care of it for you. It’s out of the scope of this workshop to explain this code, but you can look at my Tidyverse workshops to find out more.

pathology <- pathology %>%
  
  # Clean up the Max Age column
  mutate(`Max Age` = fct_recode(`Max Age`, "6 mos" = "6mos", "0" = "birth")) %>%
  separate(`Max Age`, into = c("Max Age", "text"), sep = " ", fill = "right", convert = TRUE) %>%
  mutate(`Max Age` = if_else(is.na(text), `Max Age`, `Max Age`/12)) %>%
  select(-text) %>%
  
  # Do the exact same thing for the Min Age Column
  mutate(`Min Age` = fct_recode(`Min Age`, "2 mos" = "2mos", "0" = "birth", "0" = "Birth")) %>%
  separate(`Min Age`, into = c("Min Age", "text"), sep = " ", fill = "right", convert = TRUE) %>%
  mutate(`Min Age` = if_else(is.na(text), `Min Age`, `Min Age`/12)) %>%
  select(-text) %>%
  print()
## Warning: package 'bindrcpp' was built under R version 3.4.4
## # A tibble: 252 x 24
##    `Burial #` `Cribra orbitalia` `Porotic hyperostosis` `Max LEH score`
##    <chr>                   <dbl>                  <dbl>           <dbl>
##  1 RO-938                      0                      0               0
##  2 RO-640                      0                      0               0
##  3 RO-1163                     0                      0               0
##  4 RO-1511                     0                      0               0
##  5 RO-1641                     0                      0               0
##  6 RO-1671                     0                      0               0
##  7 RO-533                      0                      0               0
##  8 RO-637                      0                      0               0
##  9 RO-710                      0                      0               0
## 10 RO-713                      0                      0               0
## # ... with 242 more rows, and 20 more variables: `Is periostitis
## #   present?` <dbl>, `L shoulder DJD` <dbl>, `R shoulder DJD` <dbl>, `L
## #   elbow DJD` <dbl>, `R elbow DJD` <dbl>, `L hip DJD` <dbl>, `R hip
## #   DJD` <dbl>, `L knee DJD` <dbl>, `R knee DJD` <dbl>, `L wrist/hand
## #   DJD` <dbl>, `R wrist/hand DJD` <dbl>, `L ankle/foot DJD` <dbl>, `R
## #   ankle/foot DJD` <dbl>, `cerv vert DJD` <dbl>, `thor vert DJD` <dbl>,
## #   `lum vert DJD` <dbl>, `Summary Sex Det.` <chr>, `Summary
## #   age-at-death` <chr>, `Min Age` <dbl>, `Max Age` <dbl>

I’m also going to create a cleaned up version of the Summary Sex Det. column. Here’s what we have so far:

table(pathology$`Summary Sex Det.`)
## 
##   Female  Female?     Male    Male? Subadult      UID 
##       21       30       20       14       80       87

Basically, I’ll collapse it down to just three factors by removing the question marks.

pathology <- pathology %>%
  mutate(Sex = fct_collapse(`Summary Sex Det.`,
                            Female = c("Female", "Female?"),
                            Male   = c("Male", "Male?"))) %>%
  print()
## # A tibble: 252 x 25
##    `Burial #` `Cribra orbitalia` `Porotic hyperostosis` `Max LEH score`
##    <chr>                   <dbl>                  <dbl>           <dbl>
##  1 RO-938                      0                      0               0
##  2 RO-640                      0                      0               0
##  3 RO-1163                     0                      0               0
##  4 RO-1511                     0                      0               0
##  5 RO-1641                     0                      0               0
##  6 RO-1671                     0                      0               0
##  7 RO-533                      0                      0               0
##  8 RO-637                      0                      0               0
##  9 RO-710                      0                      0               0
## 10 RO-713                      0                      0               0
## # ... with 242 more rows, and 21 more variables: `Is periostitis
## #   present?` <dbl>, `L shoulder DJD` <dbl>, `R shoulder DJD` <dbl>, `L
## #   elbow DJD` <dbl>, `R elbow DJD` <dbl>, `L hip DJD` <dbl>, `R hip
## #   DJD` <dbl>, `L knee DJD` <dbl>, `R knee DJD` <dbl>, `L wrist/hand
## #   DJD` <dbl>, `R wrist/hand DJD` <dbl>, `L ankle/foot DJD` <dbl>, `R
## #   ankle/foot DJD` <dbl>, `cerv vert DJD` <dbl>, `thor vert DJD` <dbl>,
## #   `lum vert DJD` <dbl>, `Summary Sex Det.` <chr>, `Summary
## #   age-at-death` <chr>, `Min Age` <dbl>, `Max Age` <dbl>, Sex <fct>

So with that our, data is cleaned up and ready to go.

2 Comparing two groups

In this section, we’ll look at tests that compare central tendencies of two groups. Technically, t-tests test whether the means are the same while the Mann-Whitney test (aka U-test) and the Kurskall-Wallis test compare the medians. Basically, we’re going to be comparing two groups of numbers to see if they’re different.

This is for variables that are numeric measurements, like isotope ratios, and are not appropriate for the labels of 0, 1, and 2 that have been applied to the skeletal pathology variables. For categorical variables, we’ll look at the chi-squared test in the next section. The null hypothesis of these tests that the two groups are indeed the same, so we want to see if the difference between them is statistically significant.

Again, I don’t fully understand what your data is, so I’ll just pick two comparisons that seem logical to me—though they might not make sense to you. In this section, we’ll compare the nitrogen isotope ratios between the adults and the children found in the trino dataset. Presumably they’d be different. We can hypothesize that ratios would be higher in the children than in the adults.

However, before we get into these tests, it’s important to consider what the underling assumptions are of the tests we’re about to do. Statistical tests come with these underlying assumptions because they rely on properties of certain distributions (like a normal distribution) in their calculations. If we violate the assumptions and run the test anyway, we won’t be able to trust results of the output of the test.

2.1 t-test

2.1.1 Checking the assumptions

The first and most basic test is called the t-test. It’s a common statistical test, but it comes with a handful of assumptions that your data has to meet. Let’s look at each of these.

  1. Your data has to be randomly sampled from the underlying population.

    This is relatively straightforward: your data can’t be cherry-picked. However, what this means is that every member of the population should have had an equal chance at being selected as a part of the sample. If your population is “everyone,” technically every single person should be fair game for inclusion in your data. In reality, this rarely, if ever, happens. However, if your population is smaller, such as “all the skeletons in this particular field site”, then you might be fine. But it’s important to understand that the results you draw from your statistical tests can only apply to your population.

  2. Observations are independent

    Each observations/measurement in your sample should be completely independent of all the others. If three of the skeletons in your field site come from one area, and the rest come from the other, and you find that those three are all measureably different in some way, then this assumption has been violated.

  3. The variable should be numeric.

    This tests only works with numbers. No categorical variables here. Even if your dataset contains numbers, it’s important to see whether R interprets them as a numeric variable. Sometimes, an occasional letter or some text will slip in there (usually, I find, for rows you mark NA for some reason), and one slip-up will mess up the entire column. Let’s make sure the data we’re working with are numbers. A quick way is to run summary on the data, and see if you get number-like summary statistics like mean, and quantiles:

    summary(trino)
    ##       ID               SumAge              AgeYrs            d15N       
    ##  Length:99          Length:99          Min.   : 0.500   Min.   : 3.700  
    ##  Class :character   Class :character   1st Qu.: 2.000   1st Qu.: 8.600  
    ##  Mode  :character   Mode  :character   Median : 3.500   Median : 9.200  
    ##                                        Mean   : 4.368   Mean   : 9.316  
    ##                                        3rd Qu.: 6.000   3rd Qu.: 9.900  
    ##                                        Max.   :14.500   Max.   :12.000  
    ##                                        NA's   :58                       
    ##       d13C          Status              Sex           
    ##  Min.   :-21.0   Length:99          Length:99         
    ##  1st Qu.:-19.7   Class :character   Class :character  
    ##  Median :-19.3   Mode  :character   Mode  :character  
    ##  Mean   :-19.1                                        
    ##  3rd Qu.:-18.6                                        
    ##  Max.   :-16.7                                        
    ## 

    So here, it looks like ID, SumAge, Status and Sex are all being treated as categorical variables, while AgeYrs, d15N, d13C are numbers. That looks right to me, so we’ll continue with that. Side note, you can also check if things are numbers with is.numeric and put the column you want to look at in parentheses.

    is.numeric(trino$d15N)
    ## [1] TRUE

    If it’s a number, the value should be TRUE, which is what we see in this case. Okay great.

  4. You should have at least 30 observations.

    This is important because more data generally means more statistical power. If you have just a little bit of data, it’s hard to tell what the underlying patterns are. You can count your data, or use a function such as nrow or table to see if there’s more than 30.

    nrow(adult_bones)
    ## [1] 28
    table(trino$SumAge)
    ## 
    ##           Adult   Dead Subadult Living Subadult 
    ##              28              41              30

    So we’re goot for two of the groups, but we only have 28 observations for the adults. I don’t know how strict the cutoff of 30 is, but it might be good to make a mental note that we might not have enough data for the adults. If we split the data up by high and low Status, the picture is bleak. I’ll use table again, but put the two columns I want to compare (SumAge and Status):

    table(trino$SumAge, trino$Status)
    ##                  
    ##                   High Low
    ##   Adult             14  14
    ##   Dead Subadult     31  10
    ##   Living Subadult   14  16

    So, if we want to compare the high vs. low groups, we definitely don’t have 30 observations for each subset. We’ll keep this in mind for later.

  5. The variances of the samples should be homogeneous.

    This means that the approximate range that the measurements fall in should be about the same size. So if you’ve got one measurement that ranges from 0–1 and another measurement that ranges from 10,000–1,000,000 then your data violates the assumption of the test. However, by defult, the t-test function in R uses what’s called the Welch’s adjustment which corrects for uneven variances, so you don’t have to worry about this one so much.

  6. The data should be normally distributed.

    This is an important one. You’ll have to test whether the data is normally distributed, meaning it looks like a typical bell curve when you plot it. We can visually inspect it with ggplot. You can use geom_histogram like we did earlier to do this. Let’s start with the d15N measurement in the children bones data:

    ggplot(children_bones, aes(d15N)) + 
      geom_histogram(binwidth = 0.2)

    Alternatively, you can use geom_histogram, which will literally plot a curve for you.

    ggplot(children_bones, aes(d15N)) + 
      geom_density()

    Hm. That doens’t look like much of a bell curve. This doesn’t bode well for us. Fortunatley, statisticians are smart and have come up with actual statistical tests to see whether things are normally distrubuted, lest our eyes deceive us. We can run what’s called the Shapiro-Wilk test on the data, which tests for normality, using the shapiro.test function.

    shapiro.test(children_bones$d15N)
    ## 
    ##  Shapiro-Wilk normality test
    ## 
    ## data:  children_bones$d15N
    ## W = 0.90711, p-value = 0.002707

    This test assumes that the data is normally distributed and has that as the null hypothesis. A low p-value is evidence to reject the null hypothesis and provides evidence to suggest that the data is not not normal. Since the test statistic, the W-value is high, we have a low p-value. The typical cutoff for p-values is 0.05, so anything smaller than it is considered “low”. Because it’s much smaller than 0.05, We have to conclude that the data is not normal. Okay, so not good if we want to run the t-test. Fortunately, there are ways around all this, which we’ll get to in a sec. For now, let’s finish the rest of these assumptions for the t-test.

    We can visually look at all the other groups as well using the facet_wrap function in ggplot. Let’s facet the graph by SumAge and by Status and get six plots all at once.

    ggplot(trino, aes(d15N)) + 
      geom_density() + 
      facet_wrap(~SumAge + Status)

    So here, it seems like none of the subgroups are normally distributed. Again, there is a workaround that we’ll get to in a few minutes.

2.1.1.1 Your Turn!

2.1.1.1.1 The challenge
  1. We saw how to plot a histogram for one varible, and how to plot six density curves all at once. See if you can plot six histograms all at once. Try plotting the d13C data instead of d15N.

  2. Pick a subset that looks the most normal and do a test for normality. You may have to create a new dataset for it that gets just one group of people and one Status, like this:

    children_high <- trino[trino$SumAge == "Dead Subadult" & trino$Status == "High",]
    shapiro.test(children_high$d15N)
    ## 
    ##  Shapiro-Wilk normality test
    ## 
    ## data:  children_high$d15N
    ## W = 0.94602, p-value = 0.1211
2.1.1.1.2 The solution
  1. Here’s a plot with all six histograms for the d13C data:

    ggplot(trino, aes(d13C)) + 
      geom_histogram(binwidth = 0.1) + 
      facet_wrap(~SumAge + Status)

  2. It looks like the High Dead Subadult data might be normally distributed. Let’s test that using the Shaprio-Wilk test:

    children_high <- trino[trino$SumAge == "Dead Subadult" & trino$Status == "High",]
    shapiro.test(children_high$d13C)
    ## 
    ##  Shapiro-Wilk normality test
    ## 
    ## data:  children_high$d13C
    ## W = 0.98217, p-value = 0.8699

Well look at that! We have a high p-value ( > 0.05). So the data is normal. This particular subset has 31 observations, which is good for us, but other subsets have fewer, so the Shapiro-Wilk test might not even be that accurate if there’s not enough data. Something to keep in mind.

2.1.2 Running the t-test

Summarizing what we know so far about our data is that is meets most of the assumptions, but it violates the normality and, to some degree, the minimum number of observations. This means that we really shouldn’t run the t-test. We will anyway for illustrative purposes, but we shouldn’t trust the results.

The function we’ll use is t.test. As arguments, you’ll put the subsets that you want to test: in this case it’s the d13C column of the adult_bones subset and the d13C column of the children_bones subset.

Now, before we move on, it’s important to think about your hypothesis. There are slightly different versions of the t-test: the two-tailed and the one-tailed test. Two-tailed means that you’re testing whether two groups are different but you’re not sure or don’t care about which is group is bigger/smaller. This is good, but it takes a bigger difference to get statistical significance. If you’re hypothesis states that the measurements from one group is bigger than the other, you can do a one-tailed test. The benefit of this is that you’re “rewarded” for your previous knowledge of the data and a smaller difference is more noticeable.

In our data, we want to test specifically that the children d13C ratios are greater than the adult measurements. So, as a third argument to t.test, we’ll add alternative = "greater". Doing this will run a one-tailed t-test, and will only look to see if the first dataset (the children) is larger than the second datatset (the adults).

t.test(children_bones$d13C, adult_bones$d13C, alternative = "greater")
## 
##  Welch Two Sample t-test
## 
## data:  children_bones$d13C and adult_bones$d13C
## t = -0.14402, df = 66.901, p-value = 0.557
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -0.3550893        Inf
## sample estimates:
## mean of x mean of y 
## -19.08537 -19.05714

There’s kind of a lot of output, but the important thing to look for is the p-value. Here, we have a high p-value (= 0.557). This test has the null hypothesis that the two distributions are the same, so because the p-value was high, we do not have enough evidence to reject the null hypothesis. Bummer.

But wait! We ran this test knowing that we violated it assumptions. Maybe we’ll get different results by using one of the other tests.

2.1.2.1 Your Turn!

2.1.2.1.1 The challenge
  1. Try flipping the t-test we just did around by putting the adult data first and then the children data and changing "greater" to "less". What do you notice about the two tests?

  2. Pick two different groups and run a t-test on them.

2.1.2.1.2 The solution
  1. If you flip the test around, the numbers are identical:

    t.test(adult_bones$d13C, children_bones$d13C, alternative = "less")
    ## 
    ##  Welch Two Sample t-test
    ## 
    ## data:  adult_bones$d13C and children_bones$d13C
    ## t = 0.14402, df = 66.901, p-value = 0.557
    ## alternative hypothesis: true difference in means is less than 0
    ## 95 percent confidence interval:
    ##       -Inf 0.3550893
    ## sample estimates:
    ## mean of x mean of y 
    ## -19.05714 -19.08537
  2. For two different groups, I’ll compare the d15N measurements of the high-status children with the low-status children. I’m not sure if one will be higher than the other, so I’ll use "two.sided" for the alternative function.

    children_high <- children_bones[children_bones$Status == "High",]
    children_low  <- children_bones[children_bones$Status == "Low",]
    t.test(children_high$d15N, children_low$d15N, alternative = "two.sided")
    ## 
    ##  Welch Two Sample t-test
    ## 
    ## data:  children_high$d15N and children_low$d15N
    ## t = 1.4552, df = 10.702, p-value = 0.1743
    ## alternative hypothesis: true difference in means is not equal to 0
    ## 95 percent confidence interval:
    ##  -0.5599367  2.7231626
    ## sample estimates:
    ## mean of x mean of y 
    ##  9.651613  8.570000

    The p-value here is lower, but it’s not less than 0.05. This does not mean it’s “closer” to being significant. It’s still not significant.

2.2 The Mann-Whitney-U test

The t-test is what we call a parametric test, which basically means it depends on normally distributed data. Since our data is not normally distributed, we have to resort to nonparametric tests, which is just an alternative to parametic tests, except they do not assume normality.

The Mann-Whitney test (which has lots of other names like the U-test and the Wilcoxon rank-sum test) is a nonparametric version of the t-test in that it compares two groups without assuming they’re normally distributed. The way this is done is handled in the underlying mathmatics, which you can read about on your own. The key difference is that while the t-tests compares the two groups’ means, the Mann-Whitney test compares their medians. This is an important distinction that you should incorporate into your write-up if you use these tests.

One important assumption of the Mann-Whitney test is that the two groups, while they don’t have to be normally distributed, should come from populations with the same distribution. So if one group is left-skewed and the other is right-skewed, this test won’t be reliable. I don’t know of a formal way to test this, so I guess plotting the groups and making sure the density curves look kinda similar is what I would do.

To actually run it, we use the R function wilcox.test. The syntax and arguments are identical to t.test.

wilcox.test(children_bones$d13C, adult_bones$d13C, alternative = "greater")
## Warning in wilcox.test.default(children_bones$d13C, adult_bones$d13C,
## alternative = "greater"): cannot compute exact p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  children_bones$d13C and adult_bones$d13C
## W = 552, p-value = 0.6084
## alternative hypothesis: true location shift is greater than 0

Uh-oh. We got a warning message. It says that the test cannot compute and exact p-value with ties. What that means is that the data has some observations that are the exact same. They’re tied. We can modify the data a little bit by wrapping it in the jitter function. This function adds a small amount of random noise to your data so that there are no ties:

children_bones$d13C
##  [1] -18.3 -19.0 -17.9 -16.7 -18.6 -19.0 -18.0 -19.5 -17.7 -18.7 -18.5
## [12] -18.4 -18.6 -19.5 -19.7 -17.1 -18.8 -19.3 -21.0 -19.2 -17.6 -20.2
## [23] -19.6 -19.3 -19.7 -19.3 -20.4 -19.7 -19.9 -19.2 -18.8 -19.7 -18.9
## [34] -19.7 -19.0 -19.6 -19.8 -20.0 -20.6 -20.4 -17.6
jitter(children_bones$d13C)
##  [1] -18.28584 -19.01831 -17.91561 -16.70443 -18.60925 -19.00504 -17.98670
##  [8] -19.50933 -17.68728 -18.68793 -18.48860 -18.39294 -18.61879 -19.49797
## [15] -19.69523 -17.08093 -18.80041 -19.31108 -21.00860 -19.20409 -17.59796
## [22] -20.19171 -19.61718 -19.28154 -19.68300 -19.30037 -20.38079 -19.70105
## [29] -19.90654 -19.19864 -18.78322 -19.69272 -18.88895 -19.71694 -18.99777
## [36] -19.61810 -19.81298 -20.00856 -20.58995 -20.41404 -17.61426

The values are all very close to the originals, so it should have no impact on the overall results. Also, because the noise is random, your numbers might not be the same as what you see here.

So, let’s add some noise to the data, and then run wilcox.test again.

wilcox.test(jitter(children_bones$d13C), jitter(adult_bones$d13C), alternative = "greater")
## 
##  Wilcoxon rank sum test
## 
## data:  jitter(children_bones$d13C) and jitter(adult_bones$d13C)
## W = 557, p-value = 0.5841
## alternative hypothesis: true location shift is greater than 0

So, unfortunately, the high p-value that we saw in the t-test and with ties just now didn’t go away. Thus, we don’t have enough evidence to reject the null hypothesis that the d13C ratios for the children were higher than for the adults and have to conclude that they’re the same.

2.3 The Kruskal-Wallis test

The Krustal-Wallis test is very similar to the Mann-Whitney test. It has the same underlying assumptions and similar math. The only real difference is that it can handle three groups.

Because it works with three or more groups, the syntax is a little bit different. Instead of listing the columns we want to test, we actually use what I think is called formula notation in R. You put the column name of the dependent variable, then a tilde (~), and then the column name of the independent variable. So here the dependent variable is d13C (or d15N, it’s up to you) and the dependent variable is SumAge. Then, we need to put data = trino to tell the function where the data is coming from. Unfortunately, because of how R read in the data, we’ll have to convert the SumAge column into what’s called a factor instead of a character vector. It’s not important why that is, but here’s the code for how to do that. But at least we don’t need to add jitter!

trino$SumAge <- as.factor(trino$SumAge)
kruskal.test(d13C ~ SumAge, data = trino)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  d13C by SumAge
## Kruskal-Wallis chi-squared = 0.4716, df = 2, p-value = 0.7899

So, the output is simple, and shows a large p-value, suggesting that the d13C measurements for all three groups (dead subadult, living subadult, and adult) are not significantly different from one another. I don’t know if this is what you expected to find, and maybe with more data the results would be different. But it’s at least what this data sample shows.

2.3.1 Your Turn!

Run a t-test, Mann-Whitney test, or a Kruskal-Wallis test on your own data and see what results you can find.

2.4 Plotting two groups

Regardless of what test you do, it’s always good to visualize your data. I like to use boxplots or violin plots. If the results of the statistical tests are not what you expected, you might see why when you plot them. Here, I provide the code to do two simple plots. The first shows a boxplot comparing the d15C ratios in the high and low status children bones. The second compares the d15N measurements in all three groups.

ggplot(children_bones, aes(x = Status, y = d15N)) + 
  geom_boxplot()

ggplot(trino, aes(x = SumAge, y = d15N)) + 
  geom_violin()

2.5 Reporting your statistical tests

When it comes time to write up your results, it’s important to explain your methods. It’s not sufficient to say “the two groups were different (p < 0.05)”. This provides the reader with zero information as to how you came to that conclusion. Some things that you need to include are

  1. What data you ran the test on. I like to make it very clear what the two groups were and what variables were involved.

  2. The name of the test. Include whether it was an one-tailed or two-tailed test.

  3. The test statistic. By this, I mean t in the t-test, W in the Shapiro-Wilk test, or Kruskall-Wallis chi-squared for the Kruskall-Wallis test. We didn’t really talk about these, but these are the numbers that the the p-value is based on. They can be found in the output of the test.

  4. For the t-test, the degrees of freedom. Again, found in the output, usually abbreviated “df”.

  5. The p-value.

  6. It’s also nice to include the mean value (for the t-test) or the median value (for the other two), as well as a plot, so people can get a feel for the difference.

So here’s a test that we did earlier:

wilcox.test(jitter(children_bones$d13C), jitter(adult_bones$d13C), alternative = "greater")
## 
##  Wilcoxon rank sum test
## 
## data:  jitter(children_bones$d13C) and jitter(adult_bones$d13C)
## W = 559, p-value = 0.5746
## alternative hypothesis: true location shift is greater than 0

And I’ll calculate the medians of the two groups to include them in the output:

median(children_bones$d13C)
## [1] -19.2
median(adult_bones$d13C)
## [1] -19.25

And here’s what I sould say if I were reporting it. Keep in mind, I don’t really know what this data is, so if I totally botch the wording of the data, my bad.

“A one-tailed Mann-Whitney test was used to compare the d13C isotope ratios between children and adults. The test suggests that the difference between the ratio for the adult bones (median = -19.25) and for the children (-19.2) was not statistically significant (W = 559, p = 0.575).”

It’s not necessary to be overly specific in the test statistic or the p-value. I round mine to a couple decimal places each. Especially if the p-value is less than 0.001 or so, I just put “< 0.001” because nobody cares that it’s 0.00024781 or something.

2.5.0.1 Your Turn!

Write up the results from one of the tests you did earlier.

3 Differences between groups

3.1 The χ2 test

The chi-squared (χ2—note that the symbol is not X but is the Greek letter chi χ) test is used to compare the number of expected verses actual observations. The observed numbers are the ones you see in your sample. The expected ones are what you would find if the two variables you’re looking at are completely independent of each other (i.e. if the null hypothesis were true). We’ll start with a simple chi-squared test on a 2x2 table, and then we’ll move onto a bigger one.

So let’s think about this. Again, I don’t know your data, so I don’t know how useful of a question this is, but let’s say we’re interested in the proportion of high status people in the children verses the adults in the trino data. We have some number of high status and some number of low status people for both the children and the adults. Is the ratio between those about the same? If status has nothing to do with age, then we would expect that the ratios to be about equal. But if there is a relationship, then we would expect more in one group.

Unfortunately, to run a chi-squared test, you’ll need to do a bit of data preparation. First, we’ll take a new subset of our data. We want all the rows in the trino dataset where the SumAge is either "Dead Subadult" or "Adult". To do this, we can use the %in% function and supply a list of possible values with c(). Since this is just the bones and not the teeth data, we’ll call it bones.

bones <- trino[trino$SumAge %in% c("Dead Subadult", "Adult"),]
bones
## # A tibble: 69 x 7
##    ID    SumAge        AgeYrs  d15N  d13C Status Sex  
##    <chr> <fct>          <dbl> <dbl> <dbl> <chr>  <chr>
##  1 188-D Dead Subadult    0.5  11.5 -18.3 High   <NA> 
##  2 188   Dead Subadult    0.5  10.1 -19   High   <NA> 
##  3 173-C Dead Subadult    0.8  10.8 -17.9 High   <NA> 
##  4 192   Dead Subadult    0.8  11.6 -16.7 High   <NA> 
##  5 133   Dead Subadult    1    10   -18.6 High   <NA> 
##  6 188-A Dead Subadult    1.5   8.1 -19   High   <NA> 
##  7 190   Dead Subadult    1.5   8   -18   High   <NA> 
##  8 194   Dead Subadult    1.5   9.9 -19.5 High   <NA> 
##  9 385   Dead Subadult    1.5  11.8 -17.7 Low    <NA> 
## 10 446   Dead Subadult    1.5  11.7 -18.7 Low    <NA> 
## # ... with 59 more rows

What we then have to do is turn this long spreadsheet into a small table that summarizes how many observations we have in each combination of groups. We can do this with the table function. Like other functions we’ve seen before, we just supply the two column names we are concerned with as arguments to the function.

bones_table <- table(bones$SumAge, bones$Status)
bones_table
##                  
##                   High Low
##   Adult             14  14
##   Dead Subadult     31  10
##   Living Subadult    0   0

But wait. Why is “Living Subadult” even a part of that table if there are none in this subset? This will mess up the chi-squared test if there are zeros. Long story, but it has to do with how R encodes the data. We can get rid of these with the droplevels function on bones before creating the table.

bones <- droplevels(bones)
bones_table <- table(bones$SumAge, bones$Status)
bones_table
##                
##                 High Low
##   Adult           14  14
##   Dead Subadult   31  10

There we go. So if we look here we have 14 adults with high status and 14 adults with a low status. A 1:1 ratio. But for the children, we have 31 with high status and 10 for low status. That’s slightly more than 3:1. In fact, the odds of a high status child is (31/14)/(10/14) = 3.1 higher than a high-status adult. To me that seems significant. Let’s run the actual chi-squared test on this bones_table to see if it is significant.

chisq.test(bones_table)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  bones_table
## X-squared = 3.7476, df = 1, p-value = 0.05288

So here we have a p-value of 0.05288. This is slightly higher than the predetermined value of 0.05. If you want to be technical, the cutoff of 0.05 is strict, and this is considered not significant. Lots of people have gotten results like these and have sort of fudged the wording to make it sound bette. This page has an ever-growing compilation of expressions people have used to report p-values that are greater than 0.05. I searched for ones that were 0.052, like ours, and found these various wordings and expressions:

  • (barely) not statistically significant

  • a clear tendency to significance

  • a marginal trend toward significance

  • a possible trend toward significance

  • approaching prognostic significance

  • did not quite achieve the conventional levels of significance

  • just skirting the boundary of significance

  • narrowly avoided significance

  • nearly borderline significance

  • not exactly significant

  • on the brink of significance

There are many, many more of these on the site I listed above. I provide you this list not to give you an idea of what to say in your write-up, but to show you that these kinds of descriptions are written by authors that really want the thing they found to be true, but don’t quite have the numbers to back them up. Significance is not a continuum but is a binary yes/no, so for somthing to be “approaching”, “trending towards”, or “be borderline” is simply circumlocuting the fact that the results are not significant.

In reality, 0.05 is an arbitrary cutoff, and should be considered a guideline of sorts. Personally, I would not report this as significant. I usually only report things as significant if the p-value is less than 0.01 specifically to avoid the situation we’re in now and to make sure that the thing I found really is there. But that’s just me.

We can actually get a measure of the effect size. I won’t go into detail, but Levshina (2015:209) shows that you can use the function assocstats in the vcd library to get what’s called Cramer’s V.

library(vcd)
assocstats(bones_table)
##                     X^2 df P(> X^2)
## Likelihood Ratio 4.7902  1 0.028621
## Pearson          4.8104  1 0.028289
## 
## Phi-Coefficient   : 0.264 
## Contingency Coeff.: 0.255 
## Cramer's V        : 0.264

Cramer’s V is a value that ranges between 0 and 1 and measures the effect size. A rule of thumb is that there is a strong effect size if Cramer’s V is greater then 0.5. If it’s less than 0.5 but greater than 0.3, it’s a moderate effect. If it’s less than 0.3 but greater than 0.1, than it’s a small effect size. Our value of 0.264 would be considered a small effect size. Take that for what it’s worth, but again, personally, that would be another nail in the coffin and I wouldn’t report this as significant.

3.1.1 Your Turn!

Run a chi-squared test on your own data!

3.2 Fisher’s Exact Test

One of the assumptions to the chi-squared test is that each “cell” of the table you’re testing has to be at least 5. But sometimes you have small numbers to work with. For example, let’s look at just the adult bones and run a chi-squared test on the Status and Sex of the people.

adult_table <- table(adult_bones$Status, adult_bones$Sex)
adult_table
##       
##        Female Male
##   High      4   10
##   Low       5    9

As you can see, three of the four cells are fine, but there’s only 4 observations in the High-Status females. If we run a chi-square test, it’ll get mad at us.

chisq.test(adult_table)
## Warning in chisq.test(adult_table): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  adult_table
## X-squared = 0, df = 1, p-value = 1

It gives you output, but there’s a warning that says the approximation may be incorrect. The solution is to run what’s called Fisher’s Exact Test. This is essentially the same thing, but it can handle cells with small numbers.

fisher.test(adult_table)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  adult_table
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.1070051 4.6418518
## sample estimates:
## odds ratio 
##  0.7285285

Now, in this case, the ratios are very similar, so we get a very high p-value of 1. This is clear evidence that the ratio between high and low status observations for the men and women are the same. (For all I know, this may be by design.)

3.3 On Larger Tables

As it turns out, you’re often going to want to compare variables that each have more than two possible levels. Fortunately, the chi-squared and Fisher’s Exact test can both handle that. We’ll see if there are certain levels of skeletal pathology are related to the sex of the skeleton.

cribra_orbitalia_table <- table(pathology$`Cribra orbitalia`, pathology$`Summary Sex Det.`)
cribra_orbitalia_table
##    
##     Female Female? Male Male? Subadult UID
##   0      8      19   10     8       70  83
##   1     12      11   10     6       10   4
##   2      1       0    0     0        0   0

So, as you can see, there was only one skeleton that had a “2” for this variable. All those zeros are going to mess up the tests a little bit. Let’s remove that one observation to make things easier for us.

cribra_orbitalia_no_2 <- pathology[pathology$`Cribra orbitalia` != 2,]
cribra_orbitalia_table <- table(cribra_orbitalia_no_2$`Cribra orbitalia`, cribra_orbitalia_no_2$`Summary Sex Det.`)
cribra_orbitalia_table
##    
##     Female Female? Male Male? Subadult UID
##   0      8      19   10     8       70  83
##   1     12      11   10     6       10   4

That’s looking better. Since we have the one cell with just four observations, we should probably do Fisher’s exact test.

fisher.test(cribra_orbitalia_table)

Now, you probably got an error when you ran this. This is because running this test on a larger table is computationall quite intensive. It says to increase the size of the workspace, which you can do with the workspace argument. From what I found on stack exchange, people were setting it to 6e8, which seems to work. But beware, this may take a very long time to run if the table is big. I’m talking like an hour or two. For this particular test on the desktop Mac I’m on, it was nearly instantaneous, but your laptops might have a harder time with it

fisher.test(cribra_orbitalia_table, workspace = 6e8)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  cribra_orbitalia_table
## p-value = 9.8e-11
## alternative hypothesis: two.sided

Anyway, the results from this test return a small p-value. Hooray! But what does that mean? It means that at least one of the cells is significantly different than expected, but it does not tell you which one. Basically, something is interesting, but we can’t tell for sure just from this output. Fortunately, there’s another handy function in the vcd library called assoc. When we run this function on the table, and specify shade = TRUE, it’ll produce a nice plot showing which variables are significant.

assoc(cribra_orbitalia_table, shade = TRUE)

Here we see some blue and red rectangles. The blue means that there are more observed values than expected, and red means there are fewer than expected. The height of the bar represents the size of the residual, and the width represents how much data was expected from that cell.

If you run a chi-square test on the data—keep in mind that we should take the results with a grain of salt because of the cell with only four observations—and put $residuals after it, you can get these numbers. Anything with an absolute value greater than 2 is considered significant.

chisq.test(cribra_orbitalia_table)$residuals
## Warning in chisq.test(cribra_orbitalia_table): Chi-squared approximation
## may be incorrect
##    
##         Female    Female?       Male      Male?   Subadult        UID
##   0 -1.9579219 -0.9590181 -1.4543989 -0.9159249  0.8676242  1.7346708
##   1  3.7843389  1.8536233  2.8111124  1.7703313 -1.6769739 -3.3528314

However, let’s take some things into consideration. First, this data is splitting up Women from Women? and Men from Men?. What would happen if we collapsed those. We did that in the setup phase of the workshop, and called this new column Sex. Let’s run this test on that instead.

cribra_orbitalia_table2 <- table(cribra_orbitalia_no_2$`Cribra orbitalia`, cribra_orbitalia_no_2$Sex)
cribra_orbitalia_table2
##    
##     Female Male Subadult UID
##   0     27   18       70  83
##   1     23   16       10   4

You can see how many in each category there are.

fisher.test(cribra_orbitalia_table2, workspace = 6e8)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  cribra_orbitalia_table2
## p-value = 5.508e-11
## alternative hypothesis: two.sided

The test was still significant, which is good for us.

assoc(cribra_orbitalia_table2, shade = TRUE)

Okay good. So even if we collapse the sex factor down, our results are still the same. One thing that we should be suspect of though is that the one cell that had four values, the UID skeletons with a value of 1, happened to be significant. That one I would be particularly suspect of because there’s just not enough data to say for sure.

3.3.1 Your Turn!

3.3.1.1 The Challenge

Pick another variable and run a chi-square test, or, if more appropriate, a Fisher’s Exact test.

3.3.1.2 The Solution

Since I have no idea what this data represents, and since I’m leaning on my left elbow as I look through the data, I’ll pick the R elbow DJD column.

left_elbow_table <- table(pathology$`L elbow DJD`, pathology$`Sex`)
left_elbow_table
##    
##     Female Male Subadult UID
##   0     43   26       80  85
##   1      6    5        0   1
##   2      2    3        0   1

I see that most of the cells have small numbers. This is a perfect time where the Fisher’s Test would be more appropriate.

fisher.test(left_elbow_table, workspace = 6e8)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  left_elbow_table
## p-value = 5.763e-06
## alternative hypothesis: two.sided

I got a high p-value, so let’s see which cells are deviant.

assoc(left_elbow_table, shade = TRUE)

So both the female and male skeletons have more 1’s than expected, and the males also have more 2’s. Cool.

3.4 Writing up your results

Similar to the previous tests, it’s important to include information about the statistical tests you run when you run your data. For the chi-squared tests, it’s important to include the actual chi-squared test statistic, the degrees of freedom (in parentheses), and the p-value. You can go above and beyond and report the odds and Cramer’s V as well. Assuming that we interpret the test as significant, you might write something like this:

Using a chi-squared test, we found that the proportion of high-class people to low-class people was significnatly higher for children than for adults (χ2(1) = 3.748, p = 0.0529). The effect size was small, with the odds of a child being high class were 3.1 times higher than an adult being high class (Cramer’s V = 0.264).

4 Principal Components Analysis

The last topic for today is what’s called Principal Components Analysis, or PCA. This is a technique that is used for a variety of things. For example, if you have tons and tons of variables, you can use PCA to reduce these down to a manageable amount (this is called “dimension reduction”). I’ve seen PCA used to combine correlated variables together, which are then fed into regression models (this is called “principal components regression”). I’ve also seen it used as a sort of cluster analysis, to not only see which variables go together but to see which observations form distinct clusters.

What PCA does exactly is pretty cool. Imagine you’re holding a cube, about a foot long on each side, made of crystal. Inside that cube, there are little dots plotted, as if it were a 3D scatterplot. Since you’re holding the cube, you can rotate it and look at it from any angle, viewing the same arrangement of dots from lot of vantage points. The cool part is that you find that if you look at it in a certain way, the dots all come close to lining up. Even though the dots are scattered in a 3D space, from this angle, it looks like they’re pretty close to being in a straight line—a one-dimensional space. This is essentially what PCA does. It “rotates” your data around in a multi-dimensional cube until it finds a way to look at the cube such that all the points “line up” into a smaller number of dimensions. It’s not going to be perfect and you lose some of the information by simplfying it like this, but it’s easier to work with a line than it is a cube, and it’s easier to work with fewer variables than many.

Because PCA is so popular, there are lots of ways to do it in R. Levshina (2015) uses the PCA function in the FactoMineR package. And Julia Silge uses prcomp_irlba from the irlba package. I’ll stick with the base R prcomp function, but you may find a morespecialized one that suits your needs later.

Unfortunately, our sample data out of the box isn’t ideal for PCA. Typically, you have lots of numeric variables, and we only have two (d15N and d13C), or three (with AgeYrs) for some of the data. For now, we’ll start with just the children’s data, since we have three variables, but later we’ll move on to some artificial data for illustrative purposes.

4.1 PCA on children’s data

The good thing is there’s virtually zero data prep for this. We can call the prcomp function directly. As the first argument, we put the children_bones dataset, but only including the columns we want to include in the data. Since PCA only works on numbers, we’ll just use the AgeYrs, d15N and the d13C variables, which are columns 3, 4, and 5. We also add the center = TRUE and scale = TRUE arguments. Every tutorial I’ve seen on PCA says it’s important to do this, but doesn’t really explain why. I think it basically compresses everything down to a range between -1 and 1. I think.

pca_children <- prcomp(children_bones[,c(3:5)], center = TRUE, scale = TRUE)

This creates a new object we’ve called pca_children. We can examine this object by typing it’s name.

pca_children
## Standard deviations (1, .., p=3):
## [1] 1.3798521 0.8248012 0.6447567
## 
## Rotation (n x k) = (3 x 3):
##               PC1        PC2        PC3
## AgeYrs  0.5162823 -0.8420219  0.1563708
## d15N   -0.6183360 -0.2401605  0.7483205
## d13C   -0.5925481 -0.4830343 -0.6446430

There are two groups of information here, the standard deviations (which we’ll get to in a second), and the rotations. In the matrix of rotations, each column represent the three new variables, the principal components. If you give PCA three variables (like we did), it will return three PCs. You can think of these three PCs as the new x-, y-, and z-axes in our rotated space. Alternatively, you can think of the PCs as unique mixtures of your original variables, with the proportion of each original variable being represented in by the numbers. If they’re closer to zero, then that original variable didn’t contribute much, but if it’s closer to 1 or -1, than the PC contains a lot of information about the variable. In other words, the rotates are how correlated the PCs are with the original variables.

We can view some summary information about the PCs by calling summary on the object:

summary(pca_children)
## Importance of components%s:
##                           PC1    PC2    PC3
## Standard deviation     1.3799 0.8248 0.6448
## Proportion of Variance 0.6347 0.2268 0.1386
## Cumulative Proportion  0.6347 0.8614 1.0000

Here we see the “importance” of the PCs. PC1 is always the most important, then PC2, and then PC3. You’ll notice the standard deviation and proportion of variance each get smaller as you move to the right. The cumulative proportion show how much of the “variance” is ratained by keeping that many PCs. For example, PC2 has a cumulative proportion of 0.86. This means that about 86% of all the information contained in the three original variables is in the two PCs. A rule of thumb I’ve heard is that you pay attention to as many PCs as it takes to get the cumulative proportion up to about 80%, and toss the rest. So if we really wanted to, we could toss PC3 and make do with just two PCs to represent these three variables.

So, let’s see if we can do something with this information. The rotations that we saw earlier are actually useful numbers and when you multiply your original observations by those numbers in such a way, you get the new PCs. Fortunately, that’s already been done for us and can be found in object already:

head(pca_children$x)
##              PC1         PC2        PC3
## [1,] -1.92419547  0.25676213  0.2993800
## [2,] -0.93527344  0.82639929  0.1016708
## [3,] -1.84897358  0.08593727 -0.2921220
## [4,] -2.91092268 -0.64459330 -0.7196776
## [5,] -1.06651723  0.51270553 -0.1929981
## [6,]  0.01160381  0.88077445 -0.8094310

These six rows show the PCs for the first six observations in our data. We can tag on all this information to the original dataset now:

children_with_pca <- cbind(children_bones, pca_children$x)
head(children_with_pca)
##      ID        SumAge AgeYrs d15N  d13C Status  Sex         PC1
## 1 188-D Dead Subadult    0.5 11.5 -18.3   High <NA> -1.92419547
## 2   188 Dead Subadult    0.5 10.1 -19.0   High <NA> -0.93527344
## 3 173-C Dead Subadult    0.8 10.8 -17.9   High <NA> -1.84897358
## 4   192 Dead Subadult    0.8 11.6 -16.7   High <NA> -2.91092268
## 5   133 Dead Subadult    1.0 10.0 -18.6   High <NA> -1.06651723
## 6 188-A Dead Subadult    1.5  8.1 -19.0   High <NA>  0.01160381
##           PC2        PC3
## 1  0.25676213  0.2993800
## 2  0.82639929  0.1016708
## 3  0.08593727 -0.2921220
## 4 -0.64459330 -0.7196776
## 5  0.51270553 -0.1929981
## 6  0.88077445 -0.8094310

So now we have our oroginal dataset, plus three new columns (the PCs). Keeping in mind that the first two PCs capture about 86% of the information in the three columns (AgeYrs, d15N, and d13C), we can actually plot these and see what patterns emerge. Here, I’ll plot PC1 on the x-axis, PC2 on the y-axis.

ggplot(children_with_pca, aes(PC1, PC2)) + 
  geom_point()

So this is cool and all, but not super useful. Like we saw in the ggplot2 tutorial, let’s swap out geom_point for geom_label, and use the ID as the text for that label. And for kicks, let’s color the label based on the stats (high vs. low).

ggplot(children_with_pca, aes(PC1, PC2, label = ID, color = Status)) + 
  geom_label()

Okay, so now we’re seeing something interesting. It seems like most of the blue ones (which is the low-status) are congregated over on one side of the plot, while two stragglers (446 and 385) are separate. This is primarily a difference in PC1 since they’re far apart horitontally on the graph. Let’s go back to the rotations to remind us what PC1 consists of.

pca_children
## Standard deviations (1, .., p=3):
## [1] 1.3798521 0.8248012 0.6447567
## 
## Rotation (n x k) = (3 x 3):
##               PC1        PC2        PC3
## AgeYrs  0.5162823 -0.8420219  0.1563708
## d15N   -0.6183360 -0.2401605  0.7483205
## d13C   -0.5925481 -0.4830343 -0.6446430

Okay, so it’s a fairly even mix of all three variables. Maybe these two skeletons were anomalies in some way. Similarly, there were two points that were very high and low on the graph, 364 and S-547. This is primarily a difference in PC2, and looking at the rotations we see that the variable with the largest (absolute value) contribution to PC2 is AgeYrs. Looking back at the original data, S-547 wasn’t exactly extreme in years, but 364 was the oldest child in the sample. The second oldest was 34-B, which is close to 364 on the graph and has a relatively extreme placement along the y-axis. But, if we plot the data with the filled color of the labels corresponding to the d15N variable (I’m using scale_fill_distiller to provide a pretty color scheme), we see that S-547 far and away has the lowest value.

ggplot(children_with_pca, aes(PC1, PC2, label = ID, fill = d15N)) + 
  geom_label() + 
  scale_fill_distiller(type = "div", palette = "RdYlGn")

Finally, the other useful thing about PCA is you can see how your variables cluster together. We looked at the observations themselves, but we can also examine the variables that were a part of the PCA. To do this, we can plot them as a scatterplot, and variables that are more similar will appear clustered together, while those that are dissimilar will be far apart from each other.

So PCA is super cool because it allows us to view our data in lots of interesting new ways. Because of the type of data we have, it’s not super userful, mostly since the number of variables is small. In the next section we’ll add some random variables to see how this works better.

4.2 PCA on artificial data

For this section, I’m going to create a bunch of additional, randomly generated columns. Some of them are random, and some of them are just random noise added to the numeric columns. We’ll work with the full dataset, so we’re not doing the AgeYrs column.

n <- nrow(trino)

# Create just random data based on various standard distributions.
rand <- runif(n, min = 5, max = 10)
norm <- rnorm(n, mean = 1, sd = 1)
pois <- rpois(n, lambda = 3)
binom <- rbinom(n, size = 20, prob = 0.2)

# Add random noise to our existing data
d15N_mod1 <- trino$d15N + rnorm(n)
d15N_mod2 <- trino$d15N + rnorm(n)
d13C_mod1 <- trino$d13C + rnorm(n)
d13C_mod2 <- trino$d13C + rnorm(n)

# Combine it all into a new dataset.
trino_with_artificial <- cbind(trino, rand, norm, pois, binom, d15N_mod1, d15N_mod2, d13C_mod1, d13C_mod2)

Now that that’s done, we’ll run a PCA on all of these columns.

trino_pca <- prcomp(trino_with_artificial[,c(4,5,8:15)], center = TRUE, scale = TRUE)
trino_pca
## Standard deviations (1, .., p=10):
##  [1] 1.7446435 1.2448456 1.1248118 1.0341019 0.9493473 0.8703731 0.7297135
##  [8] 0.6669748 0.5211261 0.4053284
## 
## Rotation (n x k) = (10 x 10):
##                   PC1         PC2         PC3         PC4         PC5
## d15N      -0.46044877  0.38848449 -0.06435253  0.09185095 -0.06566278
## d13C      -0.40051740 -0.41801368  0.11204293 -0.13525735  0.11120113
## rand      -0.05084298 -0.12260696 -0.66589912 -0.19620009  0.05973243
## norm      -0.03987038  0.01145321  0.61459353 -0.21690252 -0.50475454
## pois      -0.14404363 -0.00736575  0.29410427  0.62102379  0.55011555
## binom      0.10719318 -0.11311831 -0.24300769  0.69130053 -0.61687036
## d15N_mod1 -0.40148008  0.37943094 -0.09696541 -0.08805124 -0.13609410
## d15N_mod2 -0.42046377  0.34034106 -0.05423243  0.07970591 -0.04151371
## d13C_mod1 -0.35928231 -0.41489469 -0.06008287 -0.04199401 -0.13096327
## d13C_mod2 -0.35227735 -0.46302618  0.02558967  0.09105582 -0.06652126
##                   PC6          PC7         PC8         PC9         PC10
## d15N       0.01851772 -0.005230779  0.01845313 -0.06553464  0.784328832
## d13C      -0.21320629  0.012840229 -0.14519322 -0.74343407 -0.047322666
## rand       0.68156720  0.050737264 -0.11042495 -0.13115250 -0.019891702
## norm       0.55492663  0.003319036 -0.07334635 -0.07231399 -0.012905449
## pois       0.41703078 -0.009833082  0.13549166 -0.04448075 -0.100271159
## binom     -0.04404782  0.001608370 -0.00723953 -0.23144704 -0.051698521
## d15N_mod1 -0.03784565  0.387908148  0.54790088 -0.07690461 -0.448502264
## d15N_mod2 -0.05193695 -0.344731449 -0.61970163  0.16205980 -0.405622861
## d13C_mod1  0.01627925 -0.617716243  0.45007959  0.30299454 -0.006172135
## d13C_mod2 -0.03838140  0.588419361 -0.23621503  0.49096949  0.059810478

As you can see, with more variables, the output is bigger. Since we supplied it with 10 variables, we now have 10 PCs, and the interpretation of them is much tricker. You can see that the real data (the top two rows) and their noisy versions (the bottom 4 rows) make up a substantial part of PC1, but that the random data (rand, norm, pois, and binom) don’t. But PC3 has a lot more of those randomly generatd numbers. Let’s see if we can figure out how my PCs we should keep.

summary(trino_pca)
## Importance of components%s:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.7446 1.2448 1.1248 1.0341 0.94935 0.87037 0.72971
## Proportion of Variance 0.3044 0.1550 0.1265 0.1069 0.09013 0.07575 0.05325
## Cumulative Proportion  0.3044 0.4593 0.5859 0.6928 0.78292 0.85868 0.91193
##                            PC8     PC9    PC10
## Standard deviation     0.66697 0.52113 0.40533
## Proportion of Variance 0.04449 0.02716 0.01643
## Cumulative Proportion  0.95641 0.98357 1.00000

Again, rule of thumb says to keep as many as it take to get up to 80% So in this case, 5. Another rule of thumb I’ve heard is to keep those whose standard deviations are greater than 1, which would mean only 4 for us. Keeping anything more than 4 or 5 offers no additional explanatory power. We can plot these as well, simply, with just the base R (i.e. not ggplot2) function plot:

plot(trino_pca)

The bars will always be in descending order, but sometimes you’ll find a clear indication of which PCs to keep. You’ll see many three or so that are relatively high, and the rest are all tiny. In our case, there’s no real indication of that, probably because we have random variables in our data. At any rate, we’ll go ahea and add all the PCs back into our main data frame and do the same plot we saw earlier.

trino_with_pca <- cbind(trino, trino_pca$x)
ggplot(trino_with_pca, aes(PC1, PC2, label = ID, color = Status)) + 
  geom_label()

This time, the results are less clear. Because we’re working with the adult data mixed with the children, and because we have a bunch of random variables, the patterns will be less clear. When colored by status, there’s no real pattern. Our friend S-547 is off in no man’s land again, probably because of their low d15N ratio.

If you color the labels by some of the other variables, you can start to get a feel for how each variable contributes to PC1 and PC2. Here we see that even though there may be a couple very high values for binom (keep in mind, these are randomly generated), it’s not enough to put that label at the edge of the graph.

ggplot(trino_with_pca, aes(PC1, PC2, label = ID, fill = binom)) + 
  geom_label() + 
  scale_fill_distiller(type = "div", palette = "RdYlGn")

But this may have to do with the fact that the random variables weren’t a significant part of PC1 or PC2. If we plot PC3, we see that the configuration of the plot is different and that the extreme values tend to move towards the edges of the graph.

ggplot(trino_with_pca, aes(PC2, PC3, label = ID, fill = binom)) + 
  geom_label() + 
  scale_fill_distiller(type = "div", palette = "RdYlGn")

The plotting data is already in the trino_pca object and can be accessed with $rotation. This, when converted into a dataframe, will give you an entire spreadsheet with the coordinates in all the PCs (= dimensions). So, we’ll create a new object, rotation, and do that.

rotation <- as.data.frame(trino_pca$rotation)

Also, the names of the variables are not stored in the dataframe itself but are just the “row names”, so we’ll add them explicitly by creating a new column in our spreadsheet called variable and set it equal to the row names.

rotation$variable = rownames(rotation)

Now we can plot it. We’ll just plot PC1 and PC2, and put the variable name as the label.

ggplot(rotation, aes(PC1, PC2, label = variable)) + 
         geom_label()

Ahh, this is what we would expect. The two real variables, d13C and d15N are relatively far from each other. And they’re clustered with the two fake variables we created based on them. Very cool. In the opposite corner we have the three distributions with the random variable being close to (0,0). It doesn’t pattern with anything, which is what we would expect. This plot matches my intuition of the data, which is awesome.

In conclusion, PCA is a very handy tool. There’s no real statistical test going on here, and you might not report the PCA directly in your paper, but it’s a good data exploration tool at the very least. You can use it to inform decisions about what you do with your analysis down the road. Here, we mostly used to detect some of the more extreme values, but you might find some clear clusters that appear that would be very hard to detect otherwise. It would be worth it to investigate some of the patterns you see in these plots and see if there’s something interesting in your data that you didn’t know before.

4.3 PCA on the pathology data

4.3.1 Disclaimer

In some ways, the skeleton pathology data is ideal for PCA because there are lots of variables you want to look at. However, in other ways, it is not appropraite for PCA because PCA only works on numeric data. You might point out that there are numbers in the spreadsheet, but keep in mind that those numbers are arbitrary. They’re simply one-character placeholders for categorical labels that happen to be digits. Because they’re numbers, R will erroneously interpret them as a numeric variable. The same would apply if you had a dataset of athletes’ jersey numbers: yes, they’re numbers, but not really. Technically, our data is what we would call “ordinal” data, because it’s categorical but there is a meaningful order to them.

Because of this, PCA is not an appropriate tool for this kind of data. You can force it, and the code will run, but that doens’t mean it’s appropriate or meaningful (just as it’s meaningless to say that player 20 is twice as good as player 10 on a soccer team). I asked a stats friend what they thought of PCA on ordinal data and his response was (in a text): “*cringe*”. There are alternative methods out there that do the same kind of thing that PCA does only on ordinal, categorical, or binary data, but I’m not familiar with how they work.

With that said, the rest of this section will run a PCA on ordinal data, but please be aware that it’s more for illustrative purposes. We can’t trust the results. Besides, as we’ll see in the output, the results are difficult to interpret anyway.

4.3.2 Run the PCA

The code to run this PCA is idential to what we’ve seen before, so it should be pretty straightforward. First, we’ll isolate just the columns we’re concerned with and save them in a new dataset.

pathology_numbers <- pathology[c(2:19)]

Then we’ll run the PCA and examine the output.

not_good_pca <- prcomp(pathology_numbers, scale = TRUE, center = TRUE)
summary(not_good_pca)
## Importance of components%s:
##                           PC1     PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.6490 1.28152 1.13926 1.08732 0.98695 0.88985
## Proportion of Variance 0.3898 0.09124 0.07211 0.06568 0.05412 0.04399
## Cumulative Proportion  0.3898 0.48109 0.55319 0.61887 0.67299 0.71698
##                            PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.81802 0.81200 0.78623 0.72570 0.69268 0.67938
## Proportion of Variance 0.03718 0.03663 0.03434 0.02926 0.02666 0.02564
## Cumulative Proportion  0.75415 0.79078 0.82513 0.85438 0.88104 0.90668
##                          PC13    PC14    PC15    PC16    PC17    PC18
## Standard deviation     0.6321 0.58469 0.54254 0.49629 0.45384 0.43779
## Proportion of Variance 0.0222 0.01899 0.01635 0.01368 0.01144 0.01065
## Cumulative Proportion  0.9289 0.94787 0.96423 0.97791 0.98935 1.00000

Since we gave it 18 variables, it returns 18 PCs. Looks like the first 8 or 9 account for 80% of the variance, but only PC1 has a high standard deviation. We can see this clearly in the plot:

plot(not_good_pca)

Nevertheless, let’s add on the PCs to the dataframe and see what we get when we plot them. I’ll do the same thing as before and plot the skeleton’s ID as a label, and I picked Min Age to represent the color.

pathology_with_bad_pca <- cbind(pathology, not_good_pca$x)
ggplot(pathology_with_bad_pca, aes(PC1, PC2, label = `Burial #`, fill = `Min Age`)) + 
  geom_label() + 
  scale_fill_distiller(type = "div", palette = "RdYlGn")

This plot is typical of what you would expect after doing PCA with categorical data. Most of th data points are clustered all around each other in an unnaturally clean way. See how the data sort of forms a point towards the left? Typically, they should be much more spread out than this. There are a few deviant observations, but it’s hard to say whether they actually are deviant in some way or if it’s an artifact of the forced PCA. Also, there doesn’t seem to be much of a pattern to the ages.

We can plot the variables themselves as well. While it may be tempting to interpret this, especially if you like what you see, be warned that these results are completley unreliable.

rotation <- as.data.frame(not_good_pca$rotation)
rotation$variable = rownames(rotation)
ggplot(rotation, aes(PC1, PC2, label = variable)) + 
         geom_label()

5 References and Useful Resources

The explanations for the statistics in this workshop came from these sources:

  • Hankins, Matthew. “Still not Significant”. Blog post. April 21, 2013. https://mchankins.wordpress.com/2013/04/21/still-not-significant-2/.

  • Levshina, Natalia (2015). How to do linguistics with R: Data exploration and statistical analysis. John Benjamins. Amsterdam. Particularly

    This is essentially an intro to stats book that happens to use linguistic data. I believe it would be good reading even if you don’t have data like what is used. I in particular used Chapter 5 for the t-test and Mann-Whitney test, Chapter 9 for the chi-squared tests, and chapter 18 for PCA.

  • McDonald, John H. (2014). Handbook of Biological Statistics (3rd ed.). Sparky House Publishing, Baltimore, Maryland.

    Specifically, I used the online version, last updated July 20, 2015. I used the “Kruskal-Wallis test” section which contains the content of pages 157–164 in the printed version, and the “Chi-square test of goodness-of-fit” section which contains the content of pages 45-52 in the printed version. An excellent resource with lot of additional information on these tests.

  • Silge, Julia. “Understanding PCA using Stack Overflow data”. Blog post. May 18, 2018. https://juliasilge.com/blog/stack-overflow-pca/.

    A very cool blog post on PCA. She uses some R code that we didn’t discuss here, but she shows some really cool clustering with lots of variables.