NACSOS Query Language⚓︎

Across the platform, you can utilise our NQL to filter for specific items. At this point, the filters are limited to academic, lexis, and generic projects (no twitter or patents).

Under development

The NQL is now in its second version. Things might still change over time, we aim to point this out in the changelog. Recently, the parsing and translation of the query was moved to the front-end. This allows us to provide hints while you type. This will come soon!

Overview⚓︎

The following shows the general query keywords that you can combine in any way you like with AND and OR. Any (sub-)clause can be pre-prended with "NOT" in order to negate the statement. It is strongly recommended to use parentheses for all sub-clauses.

TITLE: dqstring
ABSTRACT: dqstring
YEAR: comp year
DATE: comp date
DOI: doi [ , ... ]
OA: openalex_id [ , ... ]
ID: item_id [ , ... ]
SOURCE: dqstring
META: meta-clause
IMPORT: [( NOT ) import_id] [ , ... ]
LABEL: label-clause
assigned-clause
annotated-clause
abstract-clause

Note however, that not every combination makes sense. This can lead to result sets that don't match your expectations or even errors. Using parenthesis for every sub-clause will ensure that the query is evaluated correctly (e.g. a AND b OR c is ambiguous, (a AND (b OR c)) is not).

How to read this guide

Things in parentheses are optional. To make things easier to read this guide prefers to show all options spelled out.

The bold parts (such as clause or data_type) are placeholders and need to be replaced. You should find a corresponding section with more details on this page.

Many of the keywords have aliases to make the query more compact. To keep this quide clear, not all are mentioned, but you might find them in the text. In most cases, the capitalisation is ignored. For example TITLE can be title, TI, or ti, or Ti, ...

The [ , ... ] bits symbolise a comma-separated list. If you prefer, you can put those in brackets. Any of the following are acceptable comma-separated lists (in this example for integers): 1, 2, 3, 1,2,3, [1,2,3], {1,2,3}

At the end of this guide we included several case-studies/examples that might better explain how to translate this more formal grammar into actual queries.

If you are interested, you can have a look at the nearley grammar (in the repository](https://gitlab.pik-potsdam.de/mcc-apsis/nacsos/nacsos-web/-/blob/master/src/util/nql/grammar.ne?ref_type=heads). There is also a hidden page where you can look at the parsed output.

Field filters⚓︎

`TITLE` (or `TI`)⚓︎

Only available in academic and lexis projects. Filters on the title of a paper or article. At this point, this is very basic and is translated in the database to title LIKE '%[dqstring]%'. For example: title: "carbon dioxide"

`ABSTRACT` (or `ABS` or `TEXT`)⚓︎

Available in all projects. Filters on the text (e.g. article content, paper abstract). At this point, this is very basic and is translated in the database to item.text LIKE '%[dqstring]%'. For example: abstract: "carbon dioxide"

`YEAR` (or `YR` or `PY`)⚓︎

Only available in academic and lexis projects. Filters on the publication_year of a paper or year of the publication time of an article. The comp is not optional. For example year: =2019.

`DATE` (or `PD`)⚓︎

Only available in lexis projects. Filters on the date part of the publication time of an article. The comp is not optional. For example date: <2023-06-14.

`DOI` a.k.a. document object identifier⚓︎

Only available in academic projects. Filters for DOI (checks if the article is in the list of DOIs). For example: DOI: 10.1145/3406522.3446034, 10.1016/j.pec.2016.03.015, 10.2196/jmir.7639

`OA` a.k.a. OpenAlex ID⚓︎

Only available in academic projects. Filters for openalex_id (checks if the article is in the list of OpenAlex ids). For example: OA: W2741809807

`ID` a.k.a. item_id⚓︎

Available in all projects. Filters for the item_id(s) used on the platform from the list provided. Note, this will only work for items within the project you are searching from. For exmaple: id: 0d22fd06-245c-4028-85b2-29296095621c

`SOURCE` (or `SRC`) a.k.a. venue, newspaper⚓︎

Only available in academic and lexis projects. Filters on the publication venue name of an article or name of the newspaper (or whatever source LexisNexis provided). For example SRC: "The Guardian".

`META`⚓︎

Available in all projects. During import, some values are stored in a semi-structured meta-data field. You may have seen this when looking at the raw data by clicking the ? on an item card. See below for how the meta-clause works. At this point, this is not implemented.

Filter by imports⚓︎

You can filter for items that were imported from a specific import. The following query will look for items that are in the first but not the second import (note the preceeding ~).

import: 689d6b84-fa26-4d24-8d8c-f13a13481a45,
        ~530e46de-7d0e-4954-a787-2f666742a6a4

The logic applied by this filter may not always seem consistent, but it is. For example, if you only include one id and this is excluded, then it will return all items in the project but from that import. If multiple IDs are mentioned, the "all the rest" is then restricted to the not excluded ones. In this way, you could also query for the overlap of queries. Note, that the following does not give you the overlap between imports but the set union:

import: 689d6b84-fa26-4d24-8d8c-f13a13481a45,
        530e46de-7d0e-4954-a787-2f666742a6a4

Filter by assigned items (`assigned-clause`)⚓︎

You may want to find items that have already been assigned before. Depending on your usecase, you are interested in different "logics", or "modes". In any case, this considers only assignments, not annotations. The platform technically allows annotations without assignments. Results here can and will include items that are assigned and also annotated, so this does not implement the logic for "assigned but not annotated".

The assigned-clause is defined as follows:

IS ASSIGNED
IS ASSIGNED IN assignment_scope_id [ , ... ]
IS ASSIGNED BUT NOT IN assignment_scope_id [ , ... ]
IS NOT ASSIGNED
IS NOT ASSIGNED IGNORING assignment_scope_id [ , ... ]
IS ASSIGNED WITH annotation_scheme_id
IS NOT ASSIGNED WITH annotation_scheme_id

IS ASSIGNED
Return all items that are assigned in any assignment scope to any number of users.

IS ASSIGNED IN
Return all items that are assigned to any number of users within the list of assignment scopes.

IS ASSIGNED BUT NOT IN
Return all items that are assigned to any number of users but excluding/ignoring assignments from the listed assignment scopes.

IS NOT ASSIGNED
Return all items that are not assigned anywhere.

IS NOT ASSIGNED IGNORING
Return all items that are not assigned anywhere if we pretend assignments from the listed assignment scopes do not exist.

IS ASSIGNED WITH
Return all items that are assigned to any number of users within the assignment scopes of this annotation scheme.

IS ASSIGNED BUT NOT WITH
Return all items that are assigned to any number of users but excluding/ignoring assignments from the assignment scopes of this annotation scheme.

Filter by abstract length (`abstract-clause` )⚓︎

The abstract-clause is defined as follows:

HAS ABSTRACT ( comp uint )
HAS NO ABSTRACT

This filter can be used if you are looking for abstracts with a particular length. In the first option, an abstract has to exist (note, that might include an empty string). The specification of the length is optional (what is written in the parentheses). The second option will return all items with no abstract (note, that does not include zero-length abstracts).

Filter by annotated items (`annotated-clause`)⚓︎

The annotated-clause is defined as follows:

HAS ANNOTATION
HAS ANNOTATION IN assignment_scope_id [ , ... ]
HAS ANNOTATION WITH annotation_scheme_id
HAS NO ANNOTATION
HAS NO ANNOTATION IN assignment_scope_id [ , ... ]
HAS NO ANNOTATION WITH annotation_scheme_id

HAS ANNOTATION
Return all items that were (manually) annotated.

HAS ANNOTATION IN
Return all items that were annotated in any of these assignment scopes.

HAS ANNOTATION WITH
Return all items that were annotated in any of the assignment scopes using this annotation scheme.

HAS NO ANNOTATION
Return all items that were never (manually) annotated.

HAS NO ANNOTATION IN Return all items that were not annotated in any of these assignment scopes.

HAS NO ANNOTATION WITH Return all items that were not annotated in any of the assignment scopes using this annotation scheme.

Filter by labels (`label-clause` )⚓︎

The label-clause is defined as follows:

( type ) value-clause ( from-clause ) ( with-clause ) ( users-clause ) ( repeats-clause )

Clauses in parentheses are optional.

`value-clause`⚓︎

This clause determines the key and value of the labels you want to filter for. There are three types of filters based on the type of label (bool, single, or multi) you are filtering:

KEY comp uint
KEY = bool
KEY LIKE dqstring
KEY set-comp uint [ , ... ]

The KEY is to be replaced with the label key which was specified in the annotation scheme. In the following example, the key for the label "Relevant" is rel.

Filtering for label type "string" and "float" is currently not implemented.

`from-clause`⚓︎

FROM assignment_scope_id [ , ... ]
FROM bot_annotation_metadata_id [ , ... ]

This limits the filter to only look for labels that result from assignments in any one of these assignment scopes, bot-annotation scopes, or label resolutions.

Note, that you cannot mix bot labels, resolutions, and human labels in one label-clause. For that you would have to use multiple clauses, for example:

(
  rel=true FROM [scope_id]
  OR
  RESOLUTION rel=true FROM [resolve_id1], [resolve_id2], [resolve_id3]
)
AND 
BOT method=3

This would look for items where the resolved label marked it as "relevant" in any of the three resolutions, maybe one assignment scope is not fully annotated and resolved, so it looks for any annotation as "relevant" in [scope_id], and finally we use the bot annotation (for example a classifier) to find items where "method" is 3.

`with-clause`⚓︎

WITH annotation_scheme_id

This is a shorthand for the from-clause and is equivalent to listing all assignment scope ids (or resolutions) for an annotation scheme.

`users-clause`⚓︎

BY user_id [ , ... ]
BY ANY user_id [ , ... ]
BY ALL user_id [ , ... ]

This limits the search to labels by listed users. Note, that this is not available in combination with the BOT or RESOLUTION keyword.

There are two modes:
1. "ALL" listed users that annotated the item must have the specified label or
2. "ANY" of the listed users annotated this item with the specified label.

If not specified, defaults to "ANY".

`repeats-clause`⚓︎

REPEATS uint [ , ... ]

Limits the filter to labels at a specific position. This only makes sense if you marked a label as "can be repeated", e.g. to indicate "primarily this", "secondary this", ... Here you can restrict the search to, for example, only the first-of-a-kind label. If this clause is not specified, the filter always looks at all repeats.

Label type (`type`)⚓︎

Can be one of

USER (for labels made by humans during annotation)
BOT (for labels made by a script)
RES (for resolved labels)
RESOLVED (for resolved labels)
RESOLUTION (for resolved labels)

When left empty, defaults to USER.

Comparators (`comp` / `set-comp`)⚓︎

Operators for numbers and dates (`comp`)⚓︎

= Equality
< Smaller than / earlier than
<= Smaller of equal / earlier than or in/on
> Larger than / later than
>= Larger or equal / later than or in/on
!= Inequality

Operators for sets (`set-comp`)⚓︎

Relevant mainly for multi-labels.

== Are both exactly equal (order ignored)? Example: [4,5,6] == [5,4,6] → yes
@> Does the first array contain the second? Example: [1,4,3] @> [3,1,3] → yes
!> Does the first array contain none of the second? Example: [1,4,3] !> [2,5,6] → yes
&& Is there some overlap? Example: [2,3,7] <@ [1,4,2,6] → yes

Data types⚓︎

dqstring
Is a string that starts and ends with double-quotes, for example "example". In between the quotes, you can use all characters on the keyboard. If you need to use quotes, you need to escape them, for example "They said: \"no\"!"

item_id / import_id / ...
This is the so-called uuid assigned to something on the platform, such as an item, import, assignment scope, ... You may have seen them around already, these are the weird-looking 36-character-long strings of numbers and the letters a-f separated by four dashes. Often, the uuid is shown as part of the URL or directly in the interface, so you can just copy it without quotes. In some cases you can put NOT (or - or ~) in front of a single id to explicitly exclude it.

openalex_id
The id used for a work object without quotes. This excludes the "https://" part but includes the "W" at the beginning.

doi
The DOI as it is stored in the database without quotes. The standard for that is to exclude the "https://" part and only save the "actual" DOI.

bool
true for "yes" or false for "no" (not case-sensitive).

uint
Any positive integer.

year
Four digit number.

date
Date in the format YYYY-MM-DD without quotes.

Practical examples⚓︎

TODO

Example 1: Climate and health⚓︎

TODO

Example 2: Something else⚓︎

TODO

Using NQL in Python⚓︎

At this point, the grammar and parser are only available in JavaScript. The backend currently works with the translated query that the frontend sends as nested json objects. There is a hidden page where you can write your NQL query and get it translated to a Python dict which you can then copy into your Python code.

Example 1: Fetch results using `Query` object⚓︎

import json
from sqlalchemy.orm import Session, func, distinct
from nacsos_data.db import get_engine
from nacsos_data.util import NQLQuery, NQLFilterParser

# Add your project_id here
PROJECT_ID = ...  
# Copy the translated query here
NQL = NQLFilterParser.validate_python(json.loads(...))
# Maximum number of results
LIMIT = 10

query = NQLQuery(query=NQL, project_id=PROJECT_ID)
db_engine = get_engine(conf_file='/path/to/config/remote.env')
with db_engine.session() as session:  # type: Session
    n_docs = query.count(session=session)
    print(f'Query leads to {n_docs:,} hits')
    print(f'Listing the first {LIMIT}:')
    docs = query.results(session=session, limit=LIMIT)
    for doc in docs:
        print(doc)

Example 2: Translate NQL to SQL for customisation⚓︎

You can already call query.count() to get the number of results, but let's try to replicate it as a proof of concept for showing how you can use the generated SQLAlchemy query.

import json
from sqlalchemy.orm import Session, func, distinct
from nacsos_data.db import get_engine
from nacsos_data.util import nql_to_sql, NQLFilterParser

# Add your project_id here
PROJECT_ID = ...  
# Copy the translated query here
NQL = NQLFilterParser.validate_python(json.loads(...))
# Maximum number of results
LIMIT = 10

db_engine = get_engine(conf_file='/path/to/config/remote.env')
with db_engine.session() as session:  # type: Session
    stmt_nql = nql_to_sql(query=NQL, project_id=PROJECT_ID)
    # We need to create a subquery or CTE to add things to the query
    stmt_sub = stmt_nql.subquery()
    stmt_cnt = func.count(distinct(stmt_sub.c.item_id))
    cnt = session.execute(stmt_cnt).scalar()
    print(f'Found {cnt:,} items for this query.')

NACSOS Query Language⚓︎

Overview⚓︎

Field filters⚓︎

TITLE (or TI)⚓︎

ABSTRACT (or ABS or TEXT)⚓︎

YEAR (or YR or PY)⚓︎

DATE (or PD)⚓︎

DOI a.k.a. document object identifier⚓︎

OA a.k.a. OpenAlex ID⚓︎

ID a.k.a. item_id⚓︎

SOURCE (or SRC) a.k.a. venue, newspaper⚓︎

META⚓︎

Filter by imports⚓︎

Filter by assigned items (assigned-clause)⚓︎

Filter by abstract length (abstract-clause )⚓︎

Filter by annotated items (annotated-clause)⚓︎

Filter by labels (label-clause )⚓︎

value-clause⚓︎

from-clause⚓︎

with-clause⚓︎

users-clause⚓︎

repeats-clause⚓︎

Label type (type)⚓︎

Comparators (comp / set-comp)⚓︎

Operators for numbers and dates (comp)⚓︎

Operators for sets (set-comp)⚓︎

Data types⚓︎

Practical examples⚓︎

Example 1: Climate and health⚓︎

Example 2: Something else⚓︎

Using NQL in Python⚓︎

Example 1: Fetch results using Query object⚓︎

Example 2: Translate NQL to SQL for customisation⚓︎

`TITLE` (or `TI`)⚓︎

`ABSTRACT` (or `ABS` or `TEXT`)⚓︎

`YEAR` (or `YR` or `PY`)⚓︎

`DATE` (or `PD`)⚓︎

`DOI` a.k.a. document object identifier⚓︎

`OA` a.k.a. OpenAlex ID⚓︎

`ID` a.k.a. item_id⚓︎

`SOURCE` (or `SRC`) a.k.a. venue, newspaper⚓︎

`META`⚓︎

Filter by assigned items (`assigned-clause`)⚓︎

Filter by abstract length (`abstract-clause` )⚓︎

Filter by annotated items (`annotated-clause`)⚓︎

Filter by labels (`label-clause` )⚓︎

`value-clause`⚓︎

`from-clause`⚓︎

`with-clause`⚓︎

`users-clause`⚓︎

`repeats-clause`⚓︎

Label type (`type`)⚓︎

Comparators (`comp` / `set-comp`)⚓︎

Operators for numbers and dates (`comp`)⚓︎

Operators for sets (`set-comp`)⚓︎

Example 1: Fetch results using `Query` object⚓︎