Data Collection in Dialectology using Amazon Mechanical Turk

How-to Guides
MTurk
Methods
Research
Skills
Author

Joey Stanley

Published

July 14, 2020

Note

The following is a manuscript I put together around November 2017. It was intended for publication somewhere, but I felt it was too informally written and too specific to a particular method to be useful. Besides, now it’s a little dated. Rather than collecting dust on my computer in perpetuity, I’ve decided to release it as a blog post. Hopefully it can be useful to someone out there.

For well over a century, dialectologists have needed to collect linguistic data from people in a large geographic area. The gold standard is to conduct in-person interviews, but the time commitment and costs associated with travel are prohibitively expensive for most researchers. This paper discusses Amazon Mechanical Turk as an innovative alternative to conducting fieldwork, and provides detail on the kinds of things to consider when using this new tool.

Data Collection Methods

Before discussing why any new methodology is even worth considering, it is prudent to briefly summarize the various ways in which dialectologists have collected data to show that techniques have evolved with technology several times.

Petyt (1980) explains that one of the earliest dialectology project in modern times began in 1877 in Düsseldorf, Germany. With the help of government funding, Gerog Wenker mailed questionnaires to every village that had a school, asking teachers to “translate” 40 sentences into the local dialect. A staggering 52,000 questionnaires were returned, but only a fraction of them were published before the project was discontinued in 1956.

Towards the end of the 19th century, Jules Gilliéron headed the new Atlas Linguistique de la France (Gilliéron 1902), and recruited Edmond Edmont to do the fieldwork. In what might have been the most serene sociolinguistic fieldwork ever, middle-aged Edmont bicycled across France and French-speaking Europe for eight years, documenting the language of 639 rural areas. The questionnaire eventually included 1900 items, arranged in a semi-conversational structure. Later researchers in Europe would improve on this format all the way through the 1970s.

Comparing the two projects, Petyt (1980) concludes that while the German survey outstrips the French project in geographic coverage, Gilliéron’s is superior in the amount of data per person. In addition, German schoolteachers were required to use a modified orthography to convey pronunciation differences while Edmont recorded utterances phonetically. Using the postal service had its advantages though: it saved time, money, and personnel.

It was around the mid-twentieth century that linguists began taking advantage of developing technology and started using portable recording equipment in fieldwork. The Linguistic Atlas of New England (Kurath 1939) was completed too early for audio recording, but interviews after 1950 in the Linguistic Atlas of the Middle and South Atlantic States (McDavid, Jr. et al. 1980) and all of both the Linguistic Atlas of the Gulf States (Pederson, McDaniel & Adams 1986) and the Dictionary of American Regional English (Cassidy 1985) were recorded.

The use of this technology provided numerous benefits over traditional pencil-and-paper transcription. First, it made it so the interviewer did not have to carefully transcribe during the interview but instead could be an active participant in the conversation. It allowed for transparency in data collection as well as preservation of the audio so that one could double-check the fieldworker’s transcriptions. It also made it possible for acoustic analysis of the audio rather than relying on transcriptions. Finally, it allowed for the analysis of other portions of the interview, or “collateral data” (Antieau 2017), besides the targeted lexical items, even if this analysis occurs decades later (see, for example, Olsen et al. 2017). However, the number of fieldworkers and the time commitment required for even small-scale dialectology projects was still an inhibiting factor in data collection (which was now coupled with the added cost of recording equipment).

By the 1980s, technology had advanced to allow linguists to record interviews that took place over the telephone. This was a major breakthrough and was a new way to cut costs associated with data collection. Labov (1984) describes procedures used for an early telephone survey, showing that most aspects of an in-person interview (a short questionnaire, grammaticality judgements, words lists, and minimal pairs) could be done over the phone. Labov, Ash, & Boberg (2006:36) explain that collecting data via telephone offers the advantage of collecting large amounts of data without the cost of sending fieldworkers all over the country. Convenience comes at the expense of audio quality, but they explain that their analysis was satisfactory and mostly comparable to in-person interviews. This comparison is remarkably similar to the early dialectology projects in Europe: Wenker’s questionnaires provided more data but were less reliable while Gilliéron’s sample was smaller but more accurate.

Today, the internet provides nearly limitless data and dialectologists are using innovative techniques for data collection. For example, the primary instrument in the Harvard Dialect Survey (Vaux & Golder 2003) was a questionnaire distributed it over the internet rather than through the mail (see also Katz 2016). Boberg (2016) collected data by teaming up with a popular newspaper in Canada in order to study preferences between Canadian and American spelling. And Grieve (2015) compiled a corpus of more than 200,000 letters to the editor to study regional variation in written American English.

The advantage of these projects is that it is relatively easy to collect an enormous amount of data, even for a single researcher, in relatively little time. However, similar to using telephones and questionnaires in previous projects, this comes at the expense of quality. Furthermore, it is difficult to gather a large amount of data, including metadata, from any one person, making individual-level analysis limited, if possible at all.

The other major limitation is that it is difficult to collect audio data by scraping the web or distributing questionnaires online. This makes intuitive sense: because text a simpler data format, it is much easier to extract and analyze with automatic methods. It certainly is possible to use existing recordings to compile a corpus: Stanley & Renwick (2016) analyze the speech of one speaker’s religious sermons over several decades using 39 hours of publically available recordings. However, as is true of any convenience sample, the biggest challenge is to build an audio corpus that is representative and evenly sampled across a population.

In-person fieldwork will likely always be the gold standard, but an alternative that is less time- and resource-intensive for gathering audio in dialectology is needed. Like distributing questionnaires, a method that does not require the researcher to be physically present could facilitate audio collection in a short period of time. One possible solution to this problem is Amazon Mechanical Turk.

What is Amazon Mechanical Turk?

Amazon Mechanical Turk (hereafter “MTurk”) is a US-based crowdsourcing internet marketplace owned by Amazon. It is a platform for businesses and individuals (“requesters”) to post tasks called Human Intelligence Tasks (“HITs”) for workers to complete. Typical HITs are simple tasks that cannot (yet) be automated like completing surveys, tagging images, data cleanup, writing product descriptions, and transcribing audio. Workers complete these for a small monetary payment (called a “reward”). In exchange for their services, the site charges requesters an additional 20%–40% of the worker’s compensation for each task.

Requesters have full control their HIT. They determine how many workers they would like to perform the task. They set the compensation amount, based on the amount of work required to complete the task. They also give a maximum amount of time allotted for the task (perhaps one hour for very short tasks or a day for longer ones). With justification, they can reject payment for a particular worker if the task is incomplete or dissatisfactory or they can reward workers with extra money for a job well done. Finally, requesters can set up their tasks so that only workers who meet the qualifications can see the HIT. Filters can target specific age groups, ethnicities, genders, and—critical for dialectology—geographic areas (specific countries or US states).

From the worker’s side, MTurk is more straightforward. After creating a free MTurk account, workers can browse existing HITs and complete whatever tasks they choose. When they finish a HIT, they are provided with a completion code that is unique to them and/or the HIT. Requesters use this to link workers to data and to prevent spoofers from getting paid. Within a short time, typically less than three days, workers receive payment, which can be transferred to a bank account.

While a HIT can be created using the MTurk interface, it is common to provide workers with a link to a third-party survey distributor. A popular distributor is Qualtrics , which allows researchers to create surveys with a wide variety of customizable features. While providing analyses of their own, the data and metadata collected on these surveys are available for download in spreadsheet form, allowing the researcher to perform their own analysis.

In summary, MTurk is an online space where people are willing to complete small tasks for payment. Unlike other kinds of studies, this makes recruiting very easy: willing workers are ready and are waiting for tasks. Linguists can take advantage of this platform to find participants for their own studies faster and easier than through many other ways of recruitment.

MTurk for Linguistics Research

The field of linguistics has already benefited from MTurk. Gibson, Piantadosi, Fedorenko (2011) show how the platform can be used to gather grammaticality judgements (see also Sprouse 2011). Karttunen (2014) provides evidence for semantic interpretations of sentences using MTurk workers. Wood et al. (2015) successfully used MTurk workers for acceptability judgements in non-standard syntactic constructions.

Kim et al. (2016) was perhaps the first to collect audio using MTurk in order to dialect features. As a supplement to answers from questionnaires that asked about regional lexical items and phonological features, a separate HIT was set up to collect audio. Focusing on New Englanders, they asked workers to read 12 sentences twice each and paid them $2–$4. The researchers received roughly 10–15 recordings a day from the relatively small targeted geographic area, though it included major urban centers such as Boston, Hartford, and Providence. After two months, they were able to collect audio from around 800 participants. Processing this much audio takes time, but drawing from just 390 speakers (52,418 vowel tokens), they show that their sample contained fine-grained sociophonetic variation in New England comparable to previous research using traditional methods.

In June 2017, I adopted a similar methodology and collected audio from MTurk workers in the western United States. Aiming for more data from each speaker, two similar HITs were set up: the first asked participants to read 132 sentences and a 288-item word list and the second had 125 sentences and a 260-item word list. Each of these tasks paid workers $4.50. Not all workers completed both tasks, but in less than four weeks I was able to collect 84 hours of audio from 212 speakers in the targeted area, providing me with 619,738 vowel tokens.

These studies show that researchers have already started taking advantage of MTurk for linguistic studies. In particular, a staggering amount of audio can be gathered for dialectology using this platform. The question is to consider whether this is good data and whether this technique is a viable option for linguistic research.

Technical Aspects of MTurk

As with any data collection technique, MTurk has its pros and cons. The benefits are obvious: it is easy to collect audio from many speakers in a specific geographic area in a short amount of time. With that said, it is not free: a budget of at least several hundred dollars to a couple thousand is required to collect a large sample, depending on the task. But traveling to interview hundreds of people spread over New England—or especially the West—is expensive, certainly more than a few dollars per person which is what an MTurk study would cost. In addition to being cheaper, using MTurk is faster than traditional methods and can be done entirely on a computer by a single researcher. However, this efficiency is not without its drawbacks. The following sections describe some of the limitations of using an MTurk study that researchers should be aware of before when considering a project using this platform.

Demographics

Proper sampling procedures are crucial when designing a study. If the sample is not representative of the population, the results are not generalizable. This section considers who MTurk workers are generally, how they compare to the general population, and issues with demographics on MTurk.

The majority of MTurk workers are based in India or the United States (Ross et al. 2010; Pavlick et al. 2014). This is good news for dialectologists wanting to study language in those two countries because it is easy to find workers there. There are workers in many other countries, but they are far less common, making MTurk-based dialectological work outside of these two major countries less feasible. Fortunately, as was mentioned above, researchers can set up HITs so that they are visible only to workers whose IP addresses locate them in the targeted area.

Spatially, within the United States, workers are relatively representative of the country’s population distribution. Most workers come from urban centers but some come from very rural areas. In my sample of 212 people across the western United States, all major cities were represented and I was still able to get some workers from Wyoming, Montana, and South Dakota (none, however, came from North Dakota). If the researcher wishes to have representation in rural areas of the United States, it may take several months to get sufficient coverage and a large sample is required, but it is possible.

How is this geographic information gathered? MTurk has information on users’ IP addresses to allow for filtering by location, but this information is not available to requesters. The simplest way is to include a survey question that asks for current location, where they grew up, or residential history. Another option is to use some sort of online survey distributor such as Qualtrics, which collects participant’s IP addresses by default. But workers must be made aware in consent forms if this information is to be collected.

Focusing just on the United States, compared to the national average, workers tend to be lower in socioeconomic status, more educated, younger (μ = 31.6 years), less representative of ethnic minorities (particularly African Americans), less likely to be married, and significantly more liberal and Democratic (Levay, Freese & Druckman 2016 and sources within). Because of the computer-based nature of the platform, workers are presumably literate and more comfortable with computers, which means older and people lower in socioeconomic status are probably underrepresented. These trends have implications for the kinds of conclusions that are drawn from MTurk studies, assuming all data is pooled together. However, if a sufficiently large sample is gathered, a balanced subset that controls for certain demographic factors can be analyzed instead, which can still reasonably be generalized to the non-MTurk population.

As always, demographic data from MTurk must be taken with a grain of salt. Unfortunately, some workers will try to cheat their way through the task in order to get paid in less time. Chandler & Paolacci (2017) found that workers will falsify their own information in order to get through prescreening questions, so putting an exact description of the required qualifications is not recommended . IP addresses can be spoofed as well, meaning geolocation is not always completely reliable. For user-submitted recordings, it’s necessary to listen to the audio to make sure workers completed the task properly before approving payment.

Thus, an audio sample collected from MTurk workers is, strictly speaking, not representative of all Americans. Only a very carefully designed random sampling procedure can accomplish this, which is becoming more difficult with increasing mobility (Bailey 2017). Together with the restriction to reading tasks, this demographic bias should be carefully considered when drawing conclusions from MTurk studies.

Types of Tasks

While MTurk makes it easy to reach a particular group of people, there are limitations to the types of tasks they can completed. Questionnaires that include grammaticality judgements, intuitions of rhyming pairs, and lexical choices for certain concepts can be easily adapted to an MTurk task. But the kind of audio that can be collected using MTurk is limited.

Collecting recordings of spontaneous, naturalistic, or conversational speech is probably not feasible in this setting. First, another person is required, meaning the worker must be involved in recruiting that second person, and there is no guarantee that ethical recruitment procedures (as determined by the researcher’s institution’s ethics board) will be followed or that informed consent will be given. Even if the researcher asks the worker to talk to themselves for a few minutes, this is a very different speech style than what many sociolinguists consider naturalistic data.

Thus, researchers are essentially limited to reading tasks to collect audio, but there are several different kinds that can still be done. Workers can read carefully crafted stories, paragraphs, sentences, or word lists to collect many tokens of some variable. Workers can also be asked to read minimal pairs, which can complement intuition data collected in a later portion of the survey.

This restriction to reading tasks is a limitation of MTurk and should be taken into consideration when designing a study and interpreting its results. For example, Stanley & Vanderniet (2018) found that non-mainstream consonantal variants were infrequent and hypercorrect forms were more common in MTurk data. The research question determines the methodology, and MTurk may not be appropriate for those that approach non-standard language.

Audio Quality

An obvious drawback to using MTurk for audio is that the sound quality is completely uncontrolled. Substandard audio can still provide meaningful results (Labov, Ash & Boberg 2006; Kim et al. 2016), but MTurk audio will never approach laboratory data. This makes it even more imperative to perform procedures such as normalization on vowel data when making interspeaker comparisons to tease out these speaker and recording differences (Rathcke et al. 2017)

In my study, I asked participants to use an external microphone if they had one and to specify the model. Some workers reported using decent condenser microphones (Blue Yeti, Blue Snowball, Samson C03U) or gaming headsets, all of which provided clean audio. Others used the microphone on iPhone earbuds or some other inexpensive external microphone. But several recordings were noisy, very quiet, or (especially if they used the built-in microphones on their laptops) punctuated with clicks on the keyboard and taps on the trackpad.

Not only is it important to consider the microphone being used, but users have to some sort of recording software as well since MTurk does not have a way to record audio. Kim et al. (2016) provided instructions for QuickTime or similar built-in computer software and asked workers to use those. In an attempt to control for quality and file formats, I asked users to use the free recording software Audacity and provided instructions for downloading, installing, and using it. This software records with a sampling frequency of 44,100 Hz with 32-bit depth by default, even if the microphone cannot match those specifications, ensuring at least consistency in the recorder even if the microphones are all different. Importantly, the instructions showed users how to save the audio as a .WAV file. However, having to set up a rudimentary “recording studio” is an atypical amount of prep work for a HIT: some users were not comfortable using new computer software it was a source of much frustration.

As an alternative, researchers could consider whether workers should just use their smartphones as recording devices. Admittedly, this is probably easier for the workers in some ways, but makes it more difficult on the analyst’s end. First, this assumes that all workers have smartphones, which is not guaranteed. It may be easier to record, but it takes time to transfer recordings from phones to computers and explicit instructions must be given for different phone-computer combinations. Finally, the formatting is inconsistant across smartphones and likely not one that is immediately useable for linguistic research. For example, the Voice Memo app on iPhones records in a .M4A format and is compressed. This must be converted to .WAV or some other format to be useable in most linguistic software which necessarily means some loss in fidelity. The recording quality of smartphones has continued to improve in recent years, but currently it is probably better to use a computer-based recording device for MTurk studies. Having any control over audio quality is difficult and is a primary concern for MTurk-based audio. Not all workers have good recording equipment, and it is a challenge to control software and formatting. With that said, the audio is not unusable, and with the volume of data available on MTurk, clear patterns can be found among the noise.

File Uploads

Just as MTurk does not have a way to record audio, it does not have a file-upload system either, so any study that requires user-submitted audio needs a way to get the files from the workers’ computers to the researcher’s. Kim et al. (2016) linked workers to a custom designed webpage that had recording and uploading capabilities. A custom site like this takes considerable preparation for the researchers, but allows complete control over all the data gathered from the workers.

Lacking the skills to build such a site, I provided users a link to a Qualtrics survey for uploading files. Currently, the option to allow users to upload files is an add-on feature and does not come standard on the basic Qualtrics license. Even with the feature, individual files must be relatively small (no more than a couple minutes of uncompressed audio), but currently there is no storage limit imposed on a Qualtrics account or survey, meaning all audio for the project can be safely stored on their servers.

It is possible to get an entire recording session in a single file (especially if software like Audacity is used), but the problem is uploading such a large file. If a worker’s internet connection is slow, this takes a lot of their valuable time. Kim et al. (2016) asked users to record, save, and upload one sentence at a time (there were 12 sentences total). Having over 100 sentences in my HITs, this was not feasible. Instead, I asked workers to record ten sentences at a time (corresponding to one page of the provided script). Again, this meant more work and tedium for the workers, which was reflected in their compensation.

It is unusual for Qualtrics surveys to include a large amount of uploaded data, so as a safety precaution, I also asked users to submit their files using the cloud-based file transferring site WeTransfer. This site is free and easy to use: users simply drag and drop files onto the webpage (up to 2GB for free—more than enough for a single person’s audio), put in sender and recipient email addresses, and submit the form. The site then takes the files and store them on their server for a week. Immediately after uploading and submitting the form, the sender and recipient get emails with a link to download the files. Since it is against MTurk policy to collect user emails, I set up a sender and a recipient email account and asked users to use those emails instead . This not only protected their anonymity but ensured me with two copies of the files in case of typos or some other error.

These extra precautions were worth the extra time: some files failed to upload in Qualtrics and others failed to come via WeTransfer. Between the Qualtrics uploads and the two copies from WeTransfer, I fortunately did not lose any audio. Whatever method is chosen as a file upload system should be reliable so that workers’ time is not wasted.

Organization

Even if all the files are received by the researcher, any large project necessarily requires some degree of organization or else data can be lost. Dialectology projects often have dozens or hundreds of speakers each with various recordings and some amount of metadata. Organization is especially important for MTurk projects because audio, speaker metadata, HIT metadata, MTurk IDs, and completion codes must all be linked together.

A lot of organization relies on careful tracking of worker IDs, which are unique to each worker and are available to requesters. In my project, I asked workers to give their files a specific name that included their MTurk ID (e.g. “ID12345678_1-10.wav”) before uploading them. This created a link between audio files and the specific worker. This also connected workers to their metadata provided in the Qualtrics survey since these uniquely named audio files were uploaded as a part of their survey.

Another thing to consider is the completion codes, the key that links a worker to the completed task when requesters approve payment. Theoretically, codes can be anything the requester sets up, including words, numbers, or a random string of characters. These should be randomly generated in some way or else people will be able to provide a “correct” completion code without actually doing the task.

Qualtrics makes it easy to provide completion codes at the end of a survey using a random number generator. This number is saved in the downloadable Qualtrics spreadsheet and thus links it to the audio, metadata, and worker ID. When it comes time to approve payment, it is then easy to see whether the MTurk ID and completion code match what is on the Qualtrics spreadsheet.

Finally, all this should be connected to the metadata about the HIT itself. Most of the metadata on the HIT, such as start and completion times, is not useful for the researcher. But it does contain unique IDs for the HIT itself which is required when asking MTurk to help resolve disputes with workers.

Keeping track of all this information is not easy or straightforward. There are several spreadsheets that need to be maintained, and both update frequently as workers complete the tasks. Currently, the MTurk website is not the most intuitive interface, though in 2017 an API was launched which is accessible through the MTurkR package in R (Leeper 2017), making it easier to organize HITs, approve payment, and communicate with workers through the programming language R (R Core Team 2014).

The Ethnics of MTurk Workers

Though they are anonymous, MTurk workers are people. Most do not rely on income from completing tasks (Ross et al. 2010), but they deserve fair pay for their time. As with any study on human subjects, researchers must clearly explain what the task will require them to do, what the estimated time is, and what their compensation will be. Recording audio requires workers to have a microphone. It also implies that they need to be in a semi-private setting and not in a public library or similar place. Installing software requires access to a computer with administrator privileges. To get a gauge of how long the task would take to complete, I ran a pilot study on some MTurkers and found that it took about twice as long as my original estimates. (I then gave those who participated in the pilot study a bonus.)

Most workers are motivated to do a good job. This is in part because requesters can always deny payment if a task is not completed or done properly. There is no reason for them to invest time into doing something wrong and not get payment. This motivation comes from the fact that some requesters make their HITs available only to workers who have had a history of many approved payments, and that a rejected HIT is permanently on a worker’s profile and can exclude them from other tasks. The majority of workers for a single project will complete the job as requested with no additional work on the researcher’s part.

However, no matter how much compensation you offer or how long your estimated time is, there will be workers who feel like they were taken advantage of. If they are less comfortable with computers, then using (and installing) software will take much longer, as well as saving and uploading files. Some people have trouble reading out loud and will take longer to make the recordings. There will be technical errors on your end, their end, and with MTurk. Some workers may want to spread the task over two days, which will cause the HIT to time out. One worker asked if his wife could complete the task as well, so I had to temporarily lift the ban on multiple submissions from the same IP address. Though communication between the two parties is limited, workers are able to send these concerns to requesters, but the tone can be harsh.

MTurk has ways of dealing with some of these issues. For example, upon completion of a task, requesters can offer a bonus to workers that do an especially good job. This can be used as incentive to complete the task on a case-by-case basis if specific workers feel like they are underpaid. It is also possible to set up “dummy HITs” that are only good for a specific worker ID as a way to pay people because of technical errors. It is recommended that part of the budget for an MTurk project be set aside for rewards and other extra payment to workers. When it comes to working with your participants, the 80-20 rule apples: most will take minimal effort, but some will take a lot of extra attention.

Between keeping track of the various spreadsheets of metadata, ensuring good audio quality, file organization, and resolving concerns with MTurk workers, it is nearly a full-time job for me for at least a week or two for this kind of project. Researchers wishing to use MTurk should be aware of the time commitment required to essentially be the accountant and human resources of a large project.

Conclusions

The gold standard for dialectology research is in-person fieldwork because the researcher has complete control over the sampling, demographics, recording equipment, and can collect different kinds of styles besides reading tasks. However, this is prohibitively time-intensive and expensive for most researchers. An alternative is to use Amazon Mechanical Turk for collecting audio.

This paper has outlined some of the positive and negative aspects of MTurk. Most notably, this platform provides a way to get recordings of speech from a specific geographic area without the researcher having to travel to the field site. Though there are costs associated with paying workers and fees to MTurk, it can be considerably cheaper to collect this data since the cost there are no costs associated with travel or equipment.

The most obvious negative aspect of MTurk-based audio is the sound quality. It is near-impossible to control for recording equipment and to ensure audio fidelity. However, just as meaningful results can be drawn from questionnaires (Boberg 2016) and telephone interviews (Labov, Ash & Boberg 2006), MTurk data can be used for meaningful dialectology research as well (Kim et al. 2016; Stanley & Vanderniet 2018).

Amazon Mechanical Turk is not the solution to all problems in dialectology research, but it is a viable alternative to traditional methods. As tech-nology advances, linguistic methodology should follow closely behind, and linguists should be aware of the strengths, weaknesses, assumptions, and techniques associated with each new tool. Amazon Mechanical Turk is a feasible option for collecting audio data for linguistic analysis and should be considered when a researcher lacks the means for a full-scale dialectology project.