Abstract

Microtask crowdsourcing is increasingly critical to the creation of extremely large datasets. As a result, crowd workers spend weeks or months repeating the exact same tasks — making it necessary to understand their behavior over these long periods of time. We utilize three large, longitudinal datasets of nine million annotations collected from Amazon Mechanical Turk to examine claims that workers fatigue or satisfice over these long periods, producing lower quality work. We find that, contrary to these claims, workers are extremely stable in their accuracy over the entire period. To understand whether workers set their accuracy based on the task’s requirements for acceptance, we then perform an experiment where we vary the required accuracy for a large crowdsourcing task. Workers did not adjust their accuracy based on the acceptance threshold: workers who were above the threshold continued working at their usual quality level, and workers below the threshold self-selected themselves out of the task. Capitalizing on this consistency, we demonstrate that it is possible to predict workers’ long-term accuracy using just a glimpse of their performance on the first five tasks.

CSCW 2017

Paper

Download the paper here.

Dataset

Download the dataset here.

Slides

Download the slides for our talk at CSCW here.

Bibtex

@inproceedings{hata2017glimpse,
  title={A Glimpse Far into the Future: Understanding Long-term Crowd Worker Quality},
  author={Hata, Kenji and Krishna, Ranjay and Fei-Fei, Li and Bernstein, Michael},
  booktitle={CSCW: Computer-Supported Cooperative Work and Social Computing},
  year={2017}
}

Dataset

The Visual Genome project contains multiple datasets that were collected over a long period of time. We focus our analysis on the image description dataset, the visual question answering dataset and the verification dataset. In total, we analyze 6.4 thousand crowd workers who contributed 8.89 million annotations over a period of 9 months.

Data Analysis

We found that across all 3 datasets, workers were consistent in their microtask submissions. Contrary to prior literation on satisficing, workers maintain a similar quality of work over time, but also get more efficient as they gain experience with the task. Only a small proportion of workers are affected by fatigue from continuous monotonous tasks. The paper goes into more details about not worker's consistent with regards, to accuracy, word diversity as well as annotation speed. Read the paper to learn more.

Effect of acceptance threshold and transparency in worker performance and Retention

To further investigate the cause of this consistency, we analyzed 1.1 thousand additional workers who contributed 0.67 million additional annotations under varying experimental conditions. We found that process-centric factors such as the acceptance threshold (whether high at 96% or low at 70%) used to accept or reject work had no significant impact on accuracy of annotations. Workers were consistent whether they knew (high transparency) or didn't know (low transparency) what these acceptance thresholds were. However, we did find that workers would self-select themselves out of tasks they couldn't complete effectively.

Implications for Crowdsourcing

Attaining high quality judgments from crowd workers is often seen as a challenge. Given that workers perform consistently over time, we suggest its implications in crowdsourcing design. We show that models can be trained using a worker's first few submissions to predict their performance in the future. We reinforce that person-centric strategies should be employed when creating tasks.

Read the paper for more...