Priority screening

Documentation incomplete

This is just a placeholder page with some notes. Please get in contact if you would be interested in properly documenting this section.

TODO: Background

Configure setup
Verify inclusion rule and data table
Train model and wait
Once predictions are in, use in assignment config

Setup config⚓︎

when generating the data table, will write result from inclusion rule to column incl_field and write predict data to incl_pred_field (keep unchanged unless this collides with other column names that might exist)
train_split proportion of the dataset to use for training (rest is for some basic evaluation; usually keep this as high as possible if you value potentially higher quality over higher quality certainty)
n_predictions remembering all predictions adds unnecessary burden on our database, so specify the smallest possible number (e.g. the number of assignments you are planning to make with the predictions)

Training⚓︎

Head over to gitlab (or follow the button)
Create new pipeline
Configure the variables (most importantly the ID, unless you know what you are doing, leave other variables untouched)
Run pipeline (prioritise and cleanup; only run the other steps if you know what you are doing)
Wait

Data table⚓︎

basis for training date and predictions
NQL defines total dataset (training and what you want to predict on)
ignores hierarchy (if unclear, probably totally fine for you)
when item appears in multiple selected scopes, all labels kept and assigned to first scope; some labels might be overridden (edge case where resolved multiple times)
column name template {res or username}|{label key}:{value} (whereas res is resolved)
table is only filled with 1s for selected label values and then populated with 0s where it makes sense (e.g. assume method:0,method:1,method:2 are all part of the "Methods" label and 2 was selected, then 0 and 1 should implicitly be 0/False)
empty cells in table means no label available
missing info propagates to inclusion rule

Training artifacts⚓︎

data.arrow Full dataset with all labels and predictions.
buscar_est.png BUSCAR score with number of relevant articles over time and estimation based on predictions on the rest.
buscar_est.json ... and the matching data
inclusion_statistics.png Number of items and included/excluded items per assignment scope.
inclusion_statistics.json ... and the matching data
report_test.json
predictions.csv
buscar_p.json
buscar_recall.json
inclusion_curve.png
buscar.png
workload_estimation.txt
report_self.json
score_distribution.png
roc_auc.png
roc_auc.json

Models⚓︎

BERT et al

Inclusion rule⚓︎

based on columns
note NA OR 1 is 1 and NA OR 0 is NA (makes sense, but confusing at first)
note NA AND 1 is NA and NA AND 0 is 0 (makes sense, but confusing at first)
general idea:
- list of column names means all should be True (if value available)
- add ! to column name to indicate value has to exist; implicitly assuming NA or 1 which can be made explicit by adding ?
- prepend ~ to column name to indicate negation of value (no effect on NA)
- you might not care where a label comes from when at least one user chose it, then drop the first part of the column name (ANYSRC)
- add * to ANYSRC to indicate that all should be 1 (e.g. all users agree)
- note, this also always includes resolution, but that should implicitly be agreeing anyway
lists can be enclosed in [ .. ]
lists can be space- or comma-separated
prepend bracketed list with AND or OR to indicate how columns shall be combined
you can combine clauses with AND or OR
use ( .. ) to combine multiple nested clauses
play with the grammar at https://www.lark-parser.org/ide/ (add start: clause to the beginning)

?clause: cols             
       | clause _and  clause            -> and
       | clause _or  clause            -> or
       | "(" clause ")"
       | _neg clause                 -> not

cols: col [(("," | " ") col)*]  -> anded
    | "OR" "[" col [(("," | " ") col)*] "]"  -> ored
    | "AND" "[" col [(("," | " ") col)*] "]"  -> anded


col: SRC     -> maybeyes
   | SRC "!" -> forceyes
   | SRC "?" -> maybeyes
   | _neg SRC     -> maybeno
   | _neg SRC "!" -> forceno
   | _neg SRC "?" -> maybeno
   | ANYSRC  -> anyyes
   | ANYSRC "*" -> allyes
   | ANYSRC "!*" -> forceallyes
   | _neg ANYSRC  -> anyno
   | _neg ANYSRC "*" -> allno
   | _neg ANYSRC "!*" -> forceallno

SRC: LAB "|" LAB ":" DIGIT+
ANYSRC: LAB ":" DIGIT+
LAB: (LETTER|DIGIT|"-"|"_")+
_neg: "-" | "~"

_and: "AND"i | "&"
_or: "OR"i | "|"

%import common.DIGIT
%import common.LETTER
%import common.WS

%ignore WS