CrowdED

CrowdED is a two-staged statistical guideline for optimal crowdsourcing experimental design in order to a-priori estimate optimal workers and tasks' assignment to obtain maximum accuracy on all tasks.

2
contributors
64 commitsLast commit ≈ 67 months ago5 stars5 forks

What CrowdED can do for you

CrowdED is a two-staged statistical guideline for optimal crowdsourcing experimental design to a-priori estimate optimal workers and tasks' assignment to obtain maximum accuracy on all tasks.

CrowdApp Beta

Table of Contents

Installation

To install the package, please use the pip installation as follows:

pip install crowdED

Installing from source (Optional)

git clone https://github.com/MaastrichtU-IDS/crowdED.git
cd crowdED
pip install --editable ./

Note: currently, crowdED is only compatible with Python 3.6.

Examples

Create a synthetic dataset of tasks

You will need to run !pip install shortuuid

import crowded.simulate as cs

#define your parameters
total_tasks = 415
p_hard_tasks = 0.4
number_of_valid_answers = 3

#create task dataset
df_tasks = cs.Tasks(number_of_valid_answers).create(total_tasks, p_hard_tasks)

Create a synthetic dataset of workers

import crowded.simulate as cs

#define your parameters
total_workers = 40
alpha = 28
beta = 2
#create task dataset
df_workers = cs.Workers(alpha, beta).create(total_workers)

Assign easily and fairly workers to tasks

import crowded.simulate as cs

#workers per task should always be smaller than the number of workers
wpt = 5
#create assignment
df_tw = cs.AssignTasks(df_tasks, df_workers, wpt).create()

Compute Bayes probability and predict worker answers

import crowded.method as cs

#workers per task should always be smaller than the number of workers
wpt = 5
#create assignment
df_tw = cs.AssignTasks(df_tasks, df_workers, wpt).create()

Compute Bayes probability and Predict answers of the workers

import crowded.method as cm

#define the parameters
x = df_tw['prob_task'] #vector of probabilities of tasks
y = df_tw['prob_worker'] #vector of probabilities of workers
z = df_tasks['true_answers'].unique()  #vector of valid answers in the experiment
#compute probability
cp = cm.ComputeProbability(x, y, z)
import crowded.method as cm

#define the parameters
g = df_tw['true_answers'] #vector of gold standard answers
p = cp.predict() #binary vector of 0 and 1
z = df_tasks['true_answers'].unique()  #vector of valid answers in the experiment
#compute match
worker_answer = cm.WorkerAnswer(g, p, z)
#add the answers to the assignment dataset
df_tw['worker_answers'] = worker_answer.match()

Citing this work

If you use CrowdED in a scientific publication, you are highly encouraged (not required) to cite the following paper:

CrowdED: Guideline for Optimal Crowdsourcing Experimental Design.
Amrapali Zaveri, Pedro Hernandez Serrano Manisha Desai and Michel Dumontier
https://doi.org/10.1145/3184558.3191543.

Bibtex entry:

    @inproceedings{Zaveri:2018:CGO:3184558.3191543,
    author = {Zaveri, Amrapali and Serrano, Pedro Hernandez and Desai, Manisha and Dumontier, Michel},
    title = {CrowdED: Guideline for Optimal Crowdsourcing Experimental Design},
    booktitle = {Companion Proceedings of the The Web Conference 2018},
    series = {WWW '18},
    year = {2018},
    isbn = {978-1-4503-5640-4},
    location = {Lyon, France},
    pages = {1109--1116},
    numpages = {8},
    url = {https://doi.org/10.1145/3184558.3191543},
    doi = {10.1145/3184558.3191543},
    acmid = {3191543},
    publisher = {International World Wide Web Conferences Steering Committee},
    address = {Republic and Canton of Geneva, Switzerland},
    keywords = {biomedical, crowdsourcing, data quality, data science, fair, metadata, reproducibility},
    }

License

This repository is licensed under the terms of MIT LICENSE

Keywords
Programming languages
  • Python 59%
  • R 41%
License
</>Source code

Participating organisations

Maastricht University

Contributors

PHS
Pedro V Hernandez Serrano