Datasets

Author

Joey Stanley

Below you will find some datasets I use for various workshops or publications. They are all either my own data (so like, based on my voice or behavior) or are publicly available.

Cereal (.csv or .txt)

A dataset of containing a bunch of information about 80 different kinds of breakfast cereal. I use this in some of my ggplot2 workshops. I got the dataset from the the one that Chris Crawford at Kaggle.com has made available but I’ve removed some of the columns to make it more manageable for a workshop. In his words, “If you like to eat cereal, do yourself a favor and avoid this dataset at all costs. After seeing these data it will never be the same for me to eat Fruity Pebbles again.”

Joey’s Vowels

This dataset from me reading about 300 sentences at home in my kitchen. I had it automatically transcribed, force-aligned, and formant-extracted using DARLA and I took zero effort to clean it up so it’s a little messy. Because DARLA utilizes FAVE for formant extraction, all the typical FAVE columns are there. This dataset is the one I use in a lot of my tutorials.

Missionary Voice

A collection of eight recordings to accompany a forthcoming publication in Proceedings of the Linguistic Society of America annual meeting. The two male and two female speakers that sounded the most like Latter-day Saint missionaries and the two male and female speakers that sounded the least like missionaries are included.

Olympic Games (athletes, events, years)

This is data about the athletes, events, from the Olympic games. The dataset was downloaded from Kaggle.com, which was made available thanks to user Randi Griffin. I use this in my tidyverse workshops to illustrate joining different datasetes.

Sample Audio

This is a collection of recordings from two projects I’ve done. The ones labeled Carol, Daniel, Kathleen, Doug, and Margaret are roughly one-minute clips from interviews I did in Washington and contain first-hand accounts of Mount St. Helens. Stephanie, Jordan, Erica, Julia, Sabrina, Rodney, and Corey are one-sentence recordings collected via Amazon Mechanical Turk from Western American English speakers. There is also one clip of my own voice in there. Each recording comes with a sentence, word, and phoneme-level transcription. These speakers have given consent for me to distribute brief recordings for educational purposes. All names are pseudonyms.

Snoozing

This is data about when I’ve woken up every morning for about two years. Just information I downloaded from my iPhone’s Health app. I use this in some Tidyverse workshops for simple illustrations of reading in an excel file and doing some data manipulation and joining.

Stranger Things

A small dataset with basic information about each episode of Stranger Things. I use this in my most recent ggplot2 workshops. I went to IMDb’s data download page to actually get it, and did a little bit of prep work to make it ready for the workshop.

Top 25 Girl Names in 2017

A small dataset simply listing the top 25 most common baby girl names in 2017 in the US and how many babies were given those names. Created with the help of the babynames package which contains data from the Social Security database. This is used in the ggplot2 workshop. Note that this is saved as a tab-delimited file to help practice reading in different file types.

Vowels (tall and wide)

Two very small datasets of sample acoustic measurements. They contain identical information, but one is “tall” and the other is “wide.” They’re used in my first Tidyverse workshop to illustrate reshaping dataframes.