A few weeks ago, I wrote about a grant I was awarded where I’ll use Amazon Mechanical Turk (“MTurk”) to collect data from people all across the West. Today, I did a soft launch of the request and already got recordings from five people!
After weeks of carefully wrangling my MTurk request, my Qualtrics survey, and my IRB forms, I finally got it all set up. I’ve had a handful of projects get approved by the IRB, but this one was a little different since it was through MTurk, so I was a little unsure how to go about some things. Luckily, our IRB office was having open houses all through the semester, which were very helpful.
Just got IRB approval on my first try. I'm getting better at this! Huh, so going to the @UGAHSO open houses helps a ton! Who'da thunk?— Joey Stanley (@joey_stan) May 17, 2017
I decided to do a soft release first. $2500 is a lot of money to just throw into a task all at once and I wanted to make sure things were working out right. So I put in enough for five people do to the task. Within the hour I was getting data sent to me! It was crazy!
With MTurk, I'm literally getting data emailed to me throughout the day. Pretty exciting.— Joey Stanley (@joey_stan) May 18, 2017
I got all five in one day with no problem. I’m glad I did the soft release though because there were a couple small snafus that I had to fix. For example, I underestimated how much time it would take people to finish the task, so I’ll raise the amount they’re compensated: I can afford fewer workers that way, but at least I pay them an honest amount.
I’ll spend the next few days making absolutely certain that the task I want them to do is what’s right for this project. But at some point, I’ll pull the trigger and let’er rip. From that point on, all I need to do is approve people’s work (to make sure they get paid) and then just enjoy the hours and hours of recordings showing up in my inbox. What a way to collect data!
So this happened:
Thank you, MTurker, for pointing out that my consent form says the software I ask you download "will be harmful to your computer." #Typo— Joey Stanley (@joey_stan) May 22, 2017
Okay, so several weeks have passed, and the data collection phase is drawing to a close. In just a couple of weeks, I was able to get data from almost 200 people. I had some major time constraints on how I could use my money, so I had to find ways to use it quicker. I ended up creating an entirely new task, similar to the first one, with a whole new batch of sentences and words for people to read. A large portion of my participants returned to do the second part, meaning I have around 30 minutes of audio from almost 100 people.
This is an incredible dataset I’ve collected. I don’t know how much audio I have total yet, but it’s well over 50 hours. That’s pretty good for just three weeks.
However, I will be the first to say that it was a rough three weeks. It seems like every hour I was getting data emailed to me, and several times a day I had to sit and catalogue the recordings and speaker metadata, while managing the MTurk tasks. Most of the time, it was relatively straightforward, but some participants needed a little extra attention because of technical difficulties, glitches in the system, or complaints here and there. Luckily, I did this when I wasn’t in classes, because otherwise it would have been impossible.
At last, my data collection has drawn to a close. I ended up with about 212 speakers and 84 hours of data. Not bad. Now comes the daunting task of processing all of this. For every person, if I just want to do a small task that only takes a minute, it’ll take over 3 hours to do it for all speakers! This will take a very long time for me to get through, but from the 2% that I’ve looked at so far, it’s going to be very fruitful corpus.