Importing data⚓︎

In general, Items in the database should always be unique, fully populated (all fields set) and "clean". We consider the case for Tweets to be trivial, so this article focuses mainly on AcademicItems.

Database representation⚓︎

From the perspective of the database, there are five relevant tables:

Item -> item: The main reference for that item, mainly contains text (e.g. the abstract).
AcademicItem -> academic_item: Contains additional meta-data specific to academic items. Consult the respective models for how to populate the fields.
AcademicItemVariant -> academic_item_variant: Used to keep track of alternative values in case of duplicates.
Import -> import: Context in which items are ingested into the platform. Should have a descriptive title (e.g. source and date) and the query in the description for more details.
m2m_import_item_table -> m2m_import_item: Keeps track of the many-to-many relation of items and the imports. In this way, we keep only one item but can remember that it came from multiple sources.

Utility functions⚓︎

Properly populating all those relations can be tedious. To this end, nacsos_data provides several utility functions that you can use to ingest data into a project.

Reading and transforming data⚓︎

Most notably, there are utils for reading different data formats and producing AcademicItemModels:

nacsos_data.util.academic.wos.read_wos_file() for reading a RIS-style Web of Science export.
nacsos_data.util.academic.scopus.read_scopus_file() for reading a CSV (all fields except references!) export from Scopus.
nacsos_data.util.academic.openalex.download_openalex_query() to run a query on OpenAlex (local solr instance) and store it in a jsonl file.
nacsos_data.util.academic.openalex.generate_items_from_openalex() to go through that file and yield AcademicItems.
nacsos_data.util.academic.nacsos_legacy.fetch_nacsos_legacy_doc() to read one document from NACSOS-legacy and converting it to the NACSOS2 format.
nacsos_data.util.academic.nacsos_legacy.read_nacsos1_annotations() for reading all DocOwnerships and DocUserCats related to a Query from NACSOS-legacy (in proper order) and translating them to a set of Assignment and Annotation.

Ingestion helpers⚓︎

nacsos_data.util.academic.duplicate.find_duplicates() Will find candidates for duplicates for the given item. (see below for details)
nacsos_data.util.academic.duplicate._are_actually_duplicate() Double-check candidates to clear false-positives. (see below for details)
nacsos_data.util.academic.duplicate.fuse_items() Given two items that are determined to be duplicates, fuse them together and return a new item of consolidated values. For example with respectively missing fields populated or author list extended.
nacsos_data.util.academic.importer.duplicate_insertion() Given the item_id of an existing item and a new item that's deemed to be it's duplicate, handle the proper updates and insertion of variants.
nacsos_data.util.academic.importer.import_academic_items() Pretty much a one-stop-shop for all the above and more. Just give it a list of items and a project, and it will handle everything.

Conventions of handling duplicates⚓︎

Please note, that the most recent algorithm for detecting duplicates is probably in this file. In general, we aim to keep duplicates out of the platform during import. Cleaning up all references after the fact can be very challenging and error-prone. That said, if you ever end up in that situation where you have to merge two items together, a good start might be to look at nacsos_data.util.duplicate.update_references(). It will try to update all references in the database, but you have to keep in mind, that there might be other references in some meta-data fields that are not covered. Furthermore, all external references you might already have would also have to be updated. Hence, you should pay attention during import, so that only duplicate-free and clean data ends up in the database in the first place.

There are some functions to help with that (see above). The general concept is as follows:

Get a clean AcademicItemModel from your datasource (see helpers above)
Fetch potential candidates for duplicates (e.g. based on title-slug or any of the IDs)
Determine which of the candidates are "real" duplicates (if everything was done right before, this should only be one)
- If no duplicates are found, insert a new item (and respective m2m references)
- If duplicates are found, create a variant entry
  - If this is the first time we found a duplicate for this item, create a variant off of the original entry
  - Check all values of the new item against existing variants and drop all values that already exist
  - Insert the variant into the respective table, so that for each item_id, there titles, ids, etc. appear only once
  - Update the values in the items table (e.g. to populate previously missing fields)

This list roughly describes how the nacsos_data library implements the procedure. For details and the most up-to-date approach, check the code.

Deduplication on the platform⚓︎

The default import on the platform is using the following algorithm and rules to identify duplicates.

Basic philosophy: Duplicate candidates that we found based on title + abstract search (only drastically simplified texts—everything is lower-cased and all characters but A-Z and spaces are removed—with more than 300 characters are considered and 95% token overlap is required) are always duplicates even if some other rules (e.g. different authors or DOI) would disagree. Although this may seem too "aggressive" for some applications, but they are essentially the same paper for any topic model, classifier, or human annotator. Also, we keep track of "variants" of an item, so alternative values for different fields can be considered in later analyses, for example publication year in a time series.

Beyond abstract matching: If the abstract does not exist and the title + abstract text is not sufficiently long, we try other rules to try and identify duplicates. First we look for candidates based on title slug, DOI, WebOfScience-ID, Scopus-ID, OpenAlex-ID, Pubmed-ID, Dimensions-ID, and SemanticScholar-ID. We then check each candidate for publication year difference > 2 (oftentimes, papers with the same title are published several years apart) and also do an additional check of the title (sometimes, titles differ due to version/part numbers or years or a report; which we would consider false positive matches). At this point, we do not consider any deeper matching based on authors, venues, or alike. Our main goal is to find duplicates with respect to content, not necessarily slight variations or versions (preprint vs published) or database inconsistencies.

Examples⚓︎

It might help to demonstrate how easy it actually is to import things using the correct functions.

Importing OpenAlex⚓︎

import logging
from nacsos_data.util.academic.importer import import_academic_items
from nacsos_data.util.academic.openalex import generate_items_from_openalex
from nacsos_data.db import get_engine_async

PROJECT_ID = '[FILL PROJECT_ID HERE]'
OA_EXPORT_FILE = '/path/to/exported/data.jsonl'
LOG_FILE = '/path/to/logfile.txt'  # set to None to write to std-out

# The function will create an Import for the following user and name/description.
IMPORT_USER = '[UUID of a user]'
IMPORT_NAME = '[Descriptive name for the import]'
IMPORT_DESC = '''
Data source: OpenAlex\n
Dump version: 2023-05-02\n
Query: "(carbon OR emission? OR CO2 OR GHG) AND (price? OR pricing OR tax OR taxes)"
'''

# set to false to actually write to the database, otherwise this will simulate
# the import as best as possible
DRY_RUN = True

logging.basicConfig(
    format='%(asctime)s [%(levelname)s] %(name)s: %(message)s',
    level=logging.DEBUG,  # set log level as you like
    filename=LOG_FILE
)

db_engine_async = get_engine_async(conf_file='/path/to/config.env')

# This next line is almost the only thing you have to change for different sources
academic_items = generate_items_from_openalex(openalex_export=OA_EXPORT_FILE, project_id=PROJECT_ID)

# This might need to be done since pydantic sometimes doesn't json-serialise recursively, so we help it along
for item in academic_items:
    if item.authors is not None:
        item.authors = [a.dict() for a in item.authors]

# Read the docstring of this function to learn what all the parameters mean!!!
import_id, item_ids = await import_academic_items(
    items=academic_items,
    project_id=PROJECT_ID,
    db_engine=db_engine_async,
    import_name=IMPORT_NAME,
    description=IMPORT_DESC,
    user_id=IMPORT_USER,
    import_id=None,
    dry_run=DRY_RUN,
    trust_new_authors=False,
    trust_new_keywords=False,
)

Illustration⚓︎

Let's have a look at the item with id '040737f0-55b2-4556-bbb2-0c7a5140ac15':

SELECT ai.item_id,
       ai.doi,
       ai.wos_id,
       ai.scopus_id,
       ai.openalex_id,
       ai.s2_id,
       ai.title,
       ai.title_slug,
       ai.publication_year,
       ai.source,
       ai.keywords,
       substring(ai.authors::text for 140) as authors,
       ai.pubmed_id,
       ai.project_id,
       substring(i.text for 100)           as abstract,
       i.type
FROM academic_item ai
         LEFT JOIN item i on i.item_id = ai.item_id
WHERE ai.item_id = '040737f0-55b2-4556-bbb2-0c7a5140ac15';

item_id	doi	wos_id	scopus_id	openalex_id	s2_id	title	title_slug	publication_year	source	keywords	authors	pubmed_id	project_id	abstract	type
040737f0-55b2-4556-bbb2-0c7a5140ac15	10.1007/s10640-010-9345-x	WOS:000278401900007	2-s2.0-77953480306	W2148165261	null	Who Pays a Price on Carbon?	whopaysapriceoncarbon	2010	ENVIRONMENTAL & RESOURCE ECONOMICS	["Distributional incidence", "Carbon tax", "Tradable permits", "LIFETIME INCIDENCE", "TAX", "EMISSIONS", "POLICIES"]	[{"name": "Grainger, Corbett A.", "email": null, "orcid": null, "s2_id": null, "scopus_id": null, "openalex_id": null, "affiliations": [{"na	null	748e739d-f011-44de-9cb0-c9cb4bb18d08	We use the 2003 Consumer Expenditure Survey and emissions estimates from an input-output model to es	academic

... and its variants. As you can see, most fields are null to save space. Only different values are stored, such as the two publication years. For keywords, authors, and meta, we don't drop extra values, as there would be too many rules and confusing conflicts involved.

SELECT item_variant_id,
       item_id,
       import_id,
       doi,
       wos_id,
       scopus_id,
       openalex_id,
       s2_id,
       pubmed_id,
       title,
       publication_year,
       source,
       keywords,
       substring(authors::text for 140) as authors,
       substring(abstract for 140)      as abstract,
       substring(meta::text for 140)    as meta
FROM academic_item_variant
WHERE item_id = '040737f0-55b2-4556-bbb2-0c7a5140ac15';

item_variant_id	item_id	import_id	doi	wos_id	scopus_id	openalex_id	s2_id	pubmed_id	title	publication_year	source	keywords	authors	abstract	meta
287c06bf-fa58-4562-9a2a-c1d8e716a173	040737f0-55b2-4556-bbb2-0c7a5140ac15	ae9eb81c-7f20-41c1-8d97-94a876d599e5	null	null	null	W3121621695	null	null	Who Pays a Price on Carbon	2009	null	null	null	We use the 2003 Consumer Expenditure Survey and emissions estimates from an input-output model to estimate the incidence of a price on carbo	{"type": "posted-content", "cited_by_count": 6}
eab623e8-66bf-4668-9a7f-344e34a35753	040737f0-55b2-4556-bbb2-0c7a5140ac15	ae9eb81c-7f20-41c1-8d97-94a876d599e5	null	null	null	W3023054246	null	null	null	null	null	null	null	null	{"type": "posted-content", "cited_by_count": 1}
d78c0da5-801e-4f59-a259-152dc9801f91	040737f0-55b2-4556-bbb2-0c7a5140ac15	ae9eb81c-7f20-41c1-8d97-94a876d599e5	10.1007/s10640-010-9345-x	null	null	W2148165261	null	null	Who Pays a Price on Carbon?	2010	null	null	null	null	{"type": "journal-article", "cited_by_count": 146}
8312cbb0-25fd-462b-b0e1-e3a42aaabfa6	040737f0-55b2-4556-bbb2-0c7a5140ac15	586c1818-e747-42b7-b078-4e745de20c39	null	WOS:000278401900007	2-s2.0-77953480306	null	null	null	null	null	ENVIRONMENTAL & RESOURCE ECONOMICS	["Distributional incidence", "Carbon tax", "Tradable permits", "LIFETIME INCIDENCE", "TAX", "EMISSIONS", "POLICIES"]	[{"name": "Grainger, Corbett A.", "email": null, "orcid": null, "s2_id": null, "scopus_id": null, "openalex_id": null, "affiliations": [{"na	null	{"nacsos1": {"vol": "46", "date": "JUL", "issn": "0924-6460", "lang": "English", "subj": "Business & Economics; Environmental Sciences & Eco
41803df6-21ac-471b-aea8-868ea1f29857	040737f0-55b2-4556-bbb2-0c7a5140ac15	7d0364ca-bb44-48ef-968d-cbe2700fab2c	null	null	null	null	null	null	null	null	null	["Distributional incidence", "Carbon tax", "Tradable permits", "LIFETIME INCIDENCE", "TAX", "EMISSIONS", "POLICIES"]	[{"name": "Grainger, Corbett A.", "email": null, "orcid": null, "s2_id": null, "scopus_id": null, "openalex_id": null, "affiliations": [{"na	null	{"nacsos1": {"vol": "46", "date": "JUL", "issn": "0924-6460", "lang": "English", "subj": "Business & Economics; Environmental Sciences & Eco
af04e84b-a09c-47be-bf00-591eec2b98e3	040737f0-55b2-4556-bbb2-0c7a5140ac15	2188a241-1aa7-4b7d-9218-ba88c8497866	null	null	null	null	null	null	null	null	null	["Distributional incidence", "Carbon tax", "Tradable permits", "LIFETIME INCIDENCE", "TAX", "EMISSIONS", "POLICIES"]	[{"name": "Grainger, Corbett A.", "email": null, "orcid": null, "s2_id": null, "scopus_id": null, "openalex_id": null, "affiliations": [{"na	null	{"nacsos1": {"vol": "46", "date": "JUL", "issn": "0924-6460", "lang": "English", "subj": "Business & Economics; Environmental Sciences & Eco

For completeness, via the m2m table...

SELECT *
FROM m2m_import_item
WHERE m2m_import_item.item_id = '040737f0-55b2-4556-bbb2-0c7a5140ac15';

import_id	item_id	time_created	type
ae9eb81c-7f20-41c1-8d97-94a876d599e5	040737f0-55b2-4556-bbb2-0c7a5140ac15	2023-05-12 16:53:53.381928 +00:00	explicit
586c1818-e747-42b7-b078-4e745de20c39	040737f0-55b2-4556-bbb2-0c7a5140ac15	2023-05-12 18:20:52.376814 +00:00	explicit
7d0364ca-bb44-48ef-968d-cbe2700fab2c	040737f0-55b2-4556-bbb2-0c7a5140ac15	2023-05-12 18:25:50.147696 +00:00	explicit
2188a241-1aa7-4b7d-9218-ba88c8497866	040737f0-55b2-4556-bbb2-0c7a5140ac15	2023-05-12 18:29:15.864089 +00:00	explicit

... we can trace it to four distinct imports:

SELECT import.import_id,
       import.user_id,
       import.project_id,
       import.name,
       substring(import.description for 140) as description,
       import.type,
       import.time_created,
       import.time_started,
       import.time_finished,
       import.config,
       import.pipeline_task_id
FROM import,
     m2m_import_item
WHERE import.import_id = m2m_import_item.import_id
  AND m2m_import_item.item_id = '040737f0-55b2-4556-bbb2-0c7a5140ac15';

import_id	user_id	project_id	name	description	type	time_created	time_started	time_finished	config	pipeline_task_id
ae9eb81c-7f20-41c1-8d97-94a876d599e5	088011c5-546b-4c4a-b099-54ab59a7a99a	748e739d-f011-44de-9cb0-c9cb4bb18d08	Initial OpenAlex import	OpenAlex query: ({!surround v="(carbon OR emission? OR CO2 OR GHG) 2N (price? OR pricing OR tax OR taxes OR taxation OR trading OR trade OR	script	2023-05-12 16:51:25.399264 +00:00	2023-05-12 16:51:25.421586 +00:00	2023-05-12 17:00:31.865510 +00:00	null	null
586c1818-e747-42b7-b078-4e745de20c39	088011c5-546b-4c4a-b099-54ab59a7a99a	748e739d-f011-44de-9cb0-c9cb4bb18d08	NACSOS-legacy (Scopus)	TITLE-ABS-KEY((((carbon OR emission OR CO2 OR GHG OR greenhouse-gas) W/2 (pric* OR tax* OR trading OR trade OR tradable OR levy OR levies OR	script	2023-05-12 16:17:58.160652 +00:00	2023-05-12 16:17:58.170225 +00:00	2023-05-12 16:22:04.009238 +00:00	null	null
7d0364ca-bb44-48ef-968d-cbe2700fab2c	088011c5-546b-4c4a-b099-54ab59a7a99a	748e739d-f011-44de-9cb0-c9cb4bb18d08	NACSOS-legacy (WoS)	TS=((((carbon OR emission$ OR CO2 OR GHG OR "greenhouse gas" OR "greenhouse gases") NEAR/1 (pric* OR tax* OR trading OR trade OR tradable OR	script	2023-05-12 16:22:57.841787 +00:00	2023-05-12 16:22:57.846607 +00:00	2023-05-12 16:27:10.685794 +00:00	null	null
2188a241-1aa7-4b7d-9218-ba88c8497866	088011c5-546b-4c4a-b099-54ab59a7a99a	748e739d-f011-44de-9cb0-c9cb4bb18d08	NACSOS-legacy (RePEc)	('carbon price' OR 'emission price' OR 'emissions price' OR 'CO2 price' OR 'GHG price' OR 'greenhouse gas price' OR 'greenhouse gases price'	script	2023-05-12 16:27:39.260514 +00:00	2023-05-12 16:27:39.268536 +00:00	2023-05-12 16:29:33.003618 +00:00	null	null