Skip to content

Priority screening

Documentation incomplete

This is just a placeholder page with some notes. Please get in contact if you would be interested in properly documenting this section.

TODO: Background

  1. Configure setup
  2. Verify inclusion rule and data table
  3. Train model and wait
  4. Once predictions are in, use in assignment config

Screenshot Screenshot Screenshot

Setup config⚓︎

  • when generating the data table, will write result from inclusion rule to column incl_field and write predict data to incl_pred_field (keep unchanged unless this collides with other column names that might exist)
  • train_split proportion of the dataset to use for training (rest is for some basic evaluation; usually keep this as high as possible if you value potentially higher quality over higher quality certainty)
  • n_predictions remembering all predictions adds unnecessary burden on our database, so specify the smallest possible number (e.g. the number of assignments you are planning to make with the predictions)

Training⚓︎

  1. Head over to gitlab (or follow the button)
  2. Create new pipeline
  3. Configure the variables (most importantly the ID, unless you know what you are doing, leave other variables untouched)
  4. Run pipeline (prioritise and cleanup; only run the other steps if you know what you are doing)
  5. Wait

Screenshot Screenshot Screenshot

Data table⚓︎

  • basis for training date and predictions
  • NQL defines total dataset (training and what you want to predict on)
  • ignores hierarchy (if unclear, probably totally fine for you)
  • when item appears in multiple selected scopes, all labels kept and assigned to first scope; some labels might be overridden (edge case where resolved multiple times)
  • column name template {res or username}|{label key}:{value} (whereas res is resolved)
  • table is only filled with 1s for selected label values and then populated with 0s where it makes sense (e.g. assume method:0,method:1,method:2 are all part of the "Methods" label and 2 was selected, then 0 and 1 should implicitly be 0/False)
  • empty cells in table means no label available
  • missing info propagates to inclusion rule

Training artifacts⚓︎

  • data.arrow Full dataset with all labels and predictions.
  • buscar_est.png BUSCAR score with number of relevant articles over time and estimation based on predictions on the rest.
  • buscar_est.json ... and the matching data
  • inclusion_statistics.png Number of items and included/excluded items per assignment scope.
  • inclusion_statistics.json ... and the matching data
  • report_test.json
  • predictions.csv
  • buscar_p.json
  • buscar_recall.json
  • inclusion_curve.png
  • buscar.png
  • workload_estimation.txt
  • report_self.json
  • score_distribution.png
  • roc_auc.png
  • roc_auc.json

Models⚓︎

BERT et al

Inclusion rule⚓︎

  • based on columns
  • note NA OR 1 is 1 and NA OR 0 is NA (makes sense, but confusing at first)
  • note NA AND 1 is NA and NA AND 0 is 0 (makes sense, but confusing at first)
  • general idea:
    • list of column names means all should be True (if value available)
    • add ! to column name to indicate value has to exist; implicitly assuming NA or 1 which can be made explicit by adding ?
    • prepend ~ to column name to indicate negation of value (no effect on NA)
    • you might not care where a label comes from when at least one user chose it, then drop the first part of the column name (ANYSRC)
    • add * to ANYSRC to indicate that all should be 1 (e.g. all users agree)
    • note, this also always includes resolution, but that should implicitly be agreeing anyway
  • lists can be enclosed in [ .. ]
  • lists can be space- or comma-separated
  • prepend bracketed list with AND or OR to indicate how columns shall be combined
  • you can combine clauses with AND or OR
  • use ( .. ) to combine multiple nested clauses
  • play with the grammar at https://www.lark-parser.org/ide/ (add start: clause to the beginning)
?clause: cols             
       | clause _and  clause            -> and
       | clause _or  clause            -> or
       | "(" clause ")"
       | _neg clause                 -> not

cols: col [(("," | " ") col)*]  -> anded
    | "OR" "[" col [(("," | " ") col)*] "]"  -> ored
    | "AND" "[" col [(("," | " ") col)*] "]"  -> anded


col: SRC     -> maybeyes
   | SRC "!" -> forceyes
   | SRC "?" -> maybeyes
   | _neg SRC     -> maybeno
   | _neg SRC "!" -> forceno
   | _neg SRC "?" -> maybeno
   | ANYSRC  -> anyyes
   | ANYSRC "*" -> allyes
   | ANYSRC "!*" -> forceallyes
   | _neg ANYSRC  -> anyno
   | _neg ANYSRC "*" -> allno
   | _neg ANYSRC "!*" -> forceallno

SRC: LAB "|" LAB ":" DIGIT+
ANYSRC: LAB ":" DIGIT+
LAB: (LETTER|DIGIT|"-"|"_")+
_neg: "-" | "~"

_and: "AND"i | "&"
_or: "OR"i | "|"

%import common.DIGIT
%import common.LETTER
%import common.WS

%ignore WS