Annotation quality and progress tracking⚓︎

The platform offers two ways to monitor the quality of annotations and the progress of the screening/coding efforts.

Inter-rater-reliability⚓︎

You may wish to see how far annotators have progressed with the assignments and how well they agree. This information is available via "Annotation" -> "Quality Monitor" and at the bottom of each assignment scope config page.

The screenshot above shows the list of all assignment scopes in an example project. Scopes are grouped by annotation scheme.

At the end of each row you see the number of assignments (user->item pairs) and what the annotation progress is. For more details, you can click the small list icon left to that, which shows you which document was assigned to which user and what the respective assignment status is. The timestamp dates the day this assignment scope was created (which is usually also the day assignments were created). By clicking on the eye button at the beginning of a row, you can load additional details and inter-rater-reliability metrics.

This table shows the averages of the computed scores for each label.

Pearson correlation
Spearman correlation
Kendall's tau
Cohen's kappa
Fleiss' kappa and Randolph's kappa
Krippendorff's alpha
Overlap: How many of the items that were assigned to either annotator did both annotate with that label.
Agreement: Of those overlapping, how often did annotators agree or disagree on the label.
Not all implementations are from these libraries, have a look here if you are unsure.

For most scores, we exclude items where at least one annotator did not annotate that label for that item. All labels are assumed to be nominal, meaning there is no order/rating in numerical choice values. Note: We map multi-choice labels to unique values, so [1,2,3] and [1,2] are still full disagreement, no grace!

We do not compute these scores on the fly to save energy and improve loading speed. Typically, the data does not change anyway, but if it does (during active annotation), you can recompute all scores with the button at the top right.

The rows with the green background (starting with the label) are averages of scores for all pairs of annotators unless the respective score can handle multiple annotators. You may see a label more than once, this can happen when your annotation scheme allows repeats. To save space, we do not display the full label hierarchy path (as for example in the resolution screen), but only the last key.

Clicking on the person with tag icon at the top right next to the recompute button, you reveal all pairwise metrics. Note, that at this time we compute the metric only "in one direction" since the developer assumed the metrics to be symmetric. If the metrics are not symmetric, let us know, so that we compute both ways.

If you want to drill down to the last detail, you can reveal the label values by clicking the tags icon next to the usernames.

Progress tracking with stopping criterion⚓︎

The buscar stopping criterion was developed by Max Callaghan and Finn Müller-Hansen to provide a metric that is founded in statistical theory for when it is safe to stop annotating; or in other words: did we see at least x% of all relevant articles. This metric can only be used in projects that use prioritised screening, as the metric depends on an oracle that guesses which documents are relevant and humans to confirm how well the oracle is doing.

You can create multiple trackers per project to track the progress on different screening efforts or inclusion rules. The trackers can be configured or viewed via "Annotation" -> "Progress Monitor".

On the top left of this screen, you see the list of trackers. Clicking on the eye next to a tracker, you open respective details. With the "Add tracker" button on the top right, you can create a new tracker.

Tracker settings⚓︎

For the configuration, you can reveal all settings by clicking the small plus icon on the right of the settings panel (or close it again by clicking the minis). Please provide meaningful names and don't forget to save when you made changes to the settings.

Recall target⚓︎

You can select a recall target, typically this is 95% (0.95).

Number of items⚓︎

We do not assume the number of all items to be the number of documents in the project, hence you have to provide that number yourself based on the appropriate dataset size this tracker references.

Batch size⚓︎

The buscar score is not computed after every annotation but in chunks of annotations. If you set the batch size to -1, the score will be computed after each assignment scope or batch of resolved annotations. Alternatively, you can set this to any number > 10 and the score will be computed after every 10 annotated documents.

Inclusion rule⚓︎

The platform makes no assumptions about what "relevance" means in your annotation scheme. Hence, you have to define the inclusion rule using a subset of the NQL. The example in the screenshot has a boolean label with the key "rel" that has to be true for a document to be included in the study.

'Complex' inclusion rules

Each clause has the form of <key> + = / >= / <= / > / < / != + <value>. Here, <key> refers to the key defined in the schema and <value> is either true/false for boolean labels or corresponds to the respective integer value of a single choice lable. Multi-choice labels, float, and string are not implemented. You can combine multiple clauses with AND/OR.

For example, you could define something more complex: Let's assume there are, amongst many more, the three labels about carbon pricing (cp; yes=1, no=0, maybe=2), is implemented (imp; yes=1, no=0, maybe=2), and ex-post (exp; yes=1, no=0). Here, we assume documents to be relevant (included in the study) by defining the rule as cp=1 AND imp=1 AND exp=1.

Selected scopes⚓︎

You can pick annotations from all assignment scopes (person icon) and resolution scopes (person with checkmark icon) and even a mix between them. In the case of assignment scopes, the inclusion rule is valid if any annotator made the defined choice. Note, that this is per clause, not for the entire statement. Typically, one would always and only use resolved labels for this.

The order in which you add scopes to the list of selected scopes matters! If you need to re-order the list, at the moment, you have to clear it first and add them back in the correct order.

Reset and update⚓︎

It is computationally expensive to compute the buscar score. Thus, we keep the series of scores in the database and you can update, i.e. compute the latest step(s) the existing series after you added one or more scopes to the list. If you choose to reset, the series will be deleted and computed from scratch. If for make any changes to the recall target, number of items, batch size, or inclusion rule, you always need to reset.

Plot doesn't properly update

We marked it as a known bug for now that the graph does not refresh properly. For now, please refresh the page after you update/reset the scores. Sorry.

Plot and stats⚓︎

The x-axis marks the number of screened items. The bars in the background are green for items marked for inclusion and red otherwise. The red line plot shows the cumulative number of relevant documents seen (left y-axis). The blue line plot shows the buscar score (right y-axis).

As long as the buscar score is high, you have to continue annotating. It will stay at or near 1 for a while, so you might not even see it. Once it goes down, the end might be near. Please read the respective paper to fully understand the implications of that score.