Skip to content

Importing data⚓︎

In general, Items in the database should always be unique, fully populated (all fields set) and "clean". We consider the case for Tweets to be trivial, so this article focuses mainly on AcademicItems.

Database representation⚓︎

From the perspective of the database, there are five relevant tables:

  • Item -> item: The main reference for that item, mainly contains text (e.g. the abstract).
  • AcademicItem -> academic_item: Contains additional meta-data specific to academic items. Consult the respective models for how to populate the fields.
  • AcademicItemVariant -> academic_item_variant: Used to keep track of alternative values in case of duplicates.
  • Import -> import: Context in which items are ingested into the platform. Should have a descriptive title (e.g. source and date) and the query in the description for more details.
  • m2m_import_item_table -> m2m_import_item: Keeps track of the many-to-many relation of items and the imports. In this way, we keep only one item but can remember that it came from multiple sources.

Utility functions⚓︎

Properly populating all those relations can be tedious. To this end, nacsos_data provides several utility functions that you can use to ingest data into a project.

Reading and transforming data⚓︎

Most notably, there are utils for reading different data formats and producing AcademicItemModels:

Ingestion helpers⚓︎

Conventions of handling duplicates⚓︎

Please note, that the most recent algorithm for detecting duplicates is probably in this file. In general, we aim to keep duplicates out of the platform during import. Cleaning up all references after the fact can be very challenging and error-prone. That said, if you ever end up in that situation where you have to merge two items together, a good start might be to look at nacsos_data.util.duplicate.update_references(). It will try to update all references in the database, but you have to keep in mind, that there might be other references in some meta-data fields that are not covered. Furthermore, all external references you might already have would also have to be updated. Hence, you should pay attention during import, so that only duplicate-free and clean data ends up in the database in the first place.

There are some functions to help with that (see above). The general concept is as follows:

  1. Get a clean AcademicItemModel from your datasource (see helpers above)
  2. Fetch potential candidates for duplicates (e.g. based on title-slug or any of the IDs)
  3. Determine which of the candidates are "real" duplicates (if everything was done right before, this should only be one)
    • If no duplicates are found, insert a new item (and respective m2m references)
    • If duplicates are found, create a variant entry
      • If this is the first time we found a duplicate for this item, create a variant off of the original entry
      • Check all values of the new item against existing variants and drop all values that already exist
      • Insert the variant into the respective table, so that for each item_id, there titles, ids, etc. appear only once
      • Update the values in the items table (e.g. to populate previously missing fields)

This list roughly describes how the nacsos_data library implements the procedure. For details and the most up-to-date approach, check the code.

Deduplication on the platform⚓︎

The default import on the platform is using the following algorithm and rules to identify duplicates.

Basic philosophy: Duplicate candidates that we found based on title + abstract search (only drastically simplified texts—everything is lower-cased and all characters but A-Z and spaces are removed—with more than 300 characters are considered and 95% token overlap is required) are always duplicates even if some other rules (e.g. different authors or DOI) would disagree. Although this may seem too "aggressive" for some applications, but they are essentially the same paper for any topic model, classifier, or human annotator. Also, we keep track of "variants" of an item, so alternative values for different fields can be considered in later analyses, for example publication year in a time series.

Beyond abstract matching: If the abstract does not exist and the title + abstract text is not sufficiently long, we try other rules to try and identify duplicates. First we look for candidates based on title slug, DOI, WebOfScience-ID, Scopus-ID, OpenAlex-ID, Pubmed-ID, Dimensions-ID, and SemanticScholar-ID. We then check each candidate for publication year difference > 2 (oftentimes, papers with the same title are published several years apart) and also do an additional check of the title (sometimes, titles differ due to version/part numbers or years or a report; which we would consider false positive matches). At this point, we do not consider any deeper matching based on authors, venues, or alike. Our main goal is to find duplicates with respect to content, not necessarily slight variations or versions (preprint vs published) or database inconsistencies.

Examples⚓︎

It might help to demonstrate how easy it actually is to import things using the correct functions.

Importing OpenAlex⚓︎

import logging
from nacsos_data.util.academic.importer import import_academic_items
from nacsos_data.util.academic.openalex import generate_items_from_openalex
from nacsos_data.db import get_engine_async

PROJECT_ID = '[FILL PROJECT_ID HERE]'
OA_EXPORT_FILE = '/path/to/exported/data.jsonl'
LOG_FILE = '/path/to/logfile.txt'  # set to None to write to std-out

# The function will create an Import for the following user and name/description.
IMPORT_USER = '[UUID of a user]'
IMPORT_NAME = '[Descriptive name for the import]'
IMPORT_DESC = '''
Data source: OpenAlex\n
Dump version: 2023-05-02\n
Query: "(carbon OR emission? OR CO2 OR GHG) AND (price? OR pricing OR tax OR taxes)"
'''

# set to false to actually write to the database, otherwise this will simulate
# the import as best as possible
DRY_RUN = True

logging.basicConfig(
    format='%(asctime)s [%(levelname)s] %(name)s: %(message)s',
    level=logging.DEBUG,  # set log level as you like
    filename=LOG_FILE
)

db_engine_async = get_engine_async(conf_file='/path/to/config.env')

# This next line is almost the only thing you have to change for different sources
academic_items = generate_items_from_openalex(openalex_export=OA_EXPORT_FILE, project_id=PROJECT_ID)

# This might need to be done since pydantic sometimes doesn't json-serialise recursively, so we help it along
for item in academic_items:
    if item.authors is not None:
        item.authors = [a.dict() for a in item.authors]

# Read the docstring of this function to learn what all the parameters mean!!!
import_id, item_ids = await import_academic_items(
    items=academic_items,
    project_id=PROJECT_ID,
    db_engine=db_engine_async,
    import_name=IMPORT_NAME,
    description=IMPORT_DESC,
    user_id=IMPORT_USER,
    import_id=None,
    dry_run=DRY_RUN,
    trust_new_authors=False,
    trust_new_keywords=False,
)

Illustration⚓︎

Let's have a look at the item with id '040737f0-55b2-4556-bbb2-0c7a5140ac15':

SELECT ai.item_id,
       ai.doi,
       ai.wos_id,
       ai.scopus_id,
       ai.openalex_id,
       ai.s2_id,
       ai.title,
       ai.title_slug,
       ai.publication_year,
       ai.source,
       ai.keywords,
       substring(ai.authors::text for 140) as authors,
       ai.pubmed_id,
       ai.project_id,
       substring(i.text for 100)           as abstract,
       i.type
FROM academic_item ai
         LEFT JOIN item i on i.item_id = ai.item_id
WHERE ai.item_id = '040737f0-55b2-4556-bbb2-0c7a5140ac15';
item_id doi wos_id scopus_id openalex_id s2_id title title_slug publication_year source keywords authors pubmed_id project_id abstract type
040737f0-55b2-4556-bbb2-0c7a5140ac15 10.1007/s10640-010-9345-x WOS:000278401900007 2-s2.0-77953480306 W2148165261 null Who Pays a Price on Carbon? whopaysapriceoncarbon 2010 ENVIRONMENTAL & RESOURCE ECONOMICS ["Distributional incidence", "Carbon tax", "Tradable permits", "LIFETIME INCIDENCE", "TAX", "EMISSIONS", "POLICIES"] [{"name": "Grainger, Corbett A.", "email": null, "orcid": null, "s2_id": null, "scopus_id": null, "openalex_id": null, "affiliations": [{"na null 748e739d-f011-44de-9cb0-c9cb4bb18d08 We use the 2003 Consumer Expenditure Survey and emissions estimates from an input-output model to es academic

... and its variants. As you can see, most fields are null to save space. Only different values are stored, such as the two publication years. For keywords, authors, and meta, we don't drop extra values, as there would be too many rules and confusing conflicts involved.

SELECT item_variant_id,
       item_id,
       import_id,
       doi,
       wos_id,
       scopus_id,
       openalex_id,
       s2_id,
       pubmed_id,
       title,
       publication_year,
       source,
       keywords,
       substring(authors::text for 140) as authors,
       substring(abstract for 140)      as abstract,
       substring(meta::text for 140)    as meta
FROM academic_item_variant
WHERE item_id = '040737f0-55b2-4556-bbb2-0c7a5140ac15';
item_variant_id item_id import_id doi wos_id scopus_id openalex_id s2_id pubmed_id title publication_year source keywords authors abstract meta
287c06bf-fa58-4562-9a2a-c1d8e716a173 040737f0-55b2-4556-bbb2-0c7a5140ac15 ae9eb81c-7f20-41c1-8d97-94a876d599e5 null null null W3121621695 null null Who Pays a Price on Carbon 2009 null null null We use the 2003 Consumer Expenditure Survey and emissions estimates from an input-output model to estimate the incidence of a price on carbo {"type": "posted-content", "cited_by_count": 6}
eab623e8-66bf-4668-9a7f-344e34a35753 040737f0-55b2-4556-bbb2-0c7a5140ac15 ae9eb81c-7f20-41c1-8d97-94a876d599e5 null null null W3023054246 null null null null null null null null {"type": "posted-content", "cited_by_count": 1}
d78c0da5-801e-4f59-a259-152dc9801f91 040737f0-55b2-4556-bbb2-0c7a5140ac15 ae9eb81c-7f20-41c1-8d97-94a876d599e5 10.1007/s10640-010-9345-x null null W2148165261 null null Who Pays a Price on Carbon? 2010 null null null null {"type": "journal-article", "cited_by_count": 146}
8312cbb0-25fd-462b-b0e1-e3a42aaabfa6 040737f0-55b2-4556-bbb2-0c7a5140ac15 586c1818-e747-42b7-b078-4e745de20c39 null WOS:000278401900007 2-s2.0-77953480306 null null null null null ENVIRONMENTAL & RESOURCE ECONOMICS ["Distributional incidence", "Carbon tax", "Tradable permits", "LIFETIME INCIDENCE", "TAX", "EMISSIONS", "POLICIES"] [{"name": "Grainger, Corbett A.", "email": null, "orcid": null, "s2_id": null, "scopus_id": null, "openalex_id": null, "affiliations": [{"na null {"nacsos1": {"vol": "46", "date": "JUL", "issn": "0924-6460", "lang": "English", "subj": "Business & Economics; Environmental Sciences & Eco
41803df6-21ac-471b-aea8-868ea1f29857 040737f0-55b2-4556-bbb2-0c7a5140ac15 7d0364ca-bb44-48ef-968d-cbe2700fab2c null null null null null null null null null ["Distributional incidence", "Carbon tax", "Tradable permits", "LIFETIME INCIDENCE", "TAX", "EMISSIONS", "POLICIES"] [{"name": "Grainger, Corbett A.", "email": null, "orcid": null, "s2_id": null, "scopus_id": null, "openalex_id": null, "affiliations": [{"na null {"nacsos1": {"vol": "46", "date": "JUL", "issn": "0924-6460", "lang": "English", "subj": "Business & Economics; Environmental Sciences & Eco
af04e84b-a09c-47be-bf00-591eec2b98e3 040737f0-55b2-4556-bbb2-0c7a5140ac15 2188a241-1aa7-4b7d-9218-ba88c8497866 null null null null null null null null null ["Distributional incidence", "Carbon tax", "Tradable permits", "LIFETIME INCIDENCE", "TAX", "EMISSIONS", "POLICIES"] [{"name": "Grainger, Corbett A.", "email": null, "orcid": null, "s2_id": null, "scopus_id": null, "openalex_id": null, "affiliations": [{"na null {"nacsos1": {"vol": "46", "date": "JUL", "issn": "0924-6460", "lang": "English", "subj": "Business & Economics; Environmental Sciences & Eco

For completeness, via the m2m table...

SELECT *
FROM m2m_import_item
WHERE m2m_import_item.item_id = '040737f0-55b2-4556-bbb2-0c7a5140ac15';
import_id item_id time_created type
ae9eb81c-7f20-41c1-8d97-94a876d599e5 040737f0-55b2-4556-bbb2-0c7a5140ac15 2023-05-12 16:53:53.381928 +00:00 explicit
586c1818-e747-42b7-b078-4e745de20c39 040737f0-55b2-4556-bbb2-0c7a5140ac15 2023-05-12 18:20:52.376814 +00:00 explicit
7d0364ca-bb44-48ef-968d-cbe2700fab2c 040737f0-55b2-4556-bbb2-0c7a5140ac15 2023-05-12 18:25:50.147696 +00:00 explicit
2188a241-1aa7-4b7d-9218-ba88c8497866 040737f0-55b2-4556-bbb2-0c7a5140ac15 2023-05-12 18:29:15.864089 +00:00 explicit

... we can trace it to four distinct imports:

SELECT import.import_id,
       import.user_id,
       import.project_id,
       import.name,
       substring(import.description for 140) as description,
       import.type,
       import.time_created,
       import.time_started,
       import.time_finished,
       import.config,
       import.pipeline_task_id
FROM import,
     m2m_import_item
WHERE import.import_id = m2m_import_item.import_id
  AND m2m_import_item.item_id = '040737f0-55b2-4556-bbb2-0c7a5140ac15';
import_id user_id project_id name description type time_created time_started time_finished config pipeline_task_id
ae9eb81c-7f20-41c1-8d97-94a876d599e5 088011c5-546b-4c4a-b099-54ab59a7a99a 748e739d-f011-44de-9cb0-c9cb4bb18d08 Initial OpenAlex import OpenAlex query:
({!surround v="(carbon OR emission? OR CO2 OR GHG) 2N (price? OR pricing OR tax OR taxes OR taxation OR trading OR trade OR
script 2023-05-12 16:51:25.399264 +00:00 2023-05-12 16:51:25.421586 +00:00 2023-05-12 17:00:31.865510 +00:00 null null
586c1818-e747-42b7-b078-4e745de20c39 088011c5-546b-4c4a-b099-54ab59a7a99a 748e739d-f011-44de-9cb0-c9cb4bb18d08 NACSOS-legacy (Scopus) TITLE-ABS-KEY((((carbon OR emission OR CO2 OR GHG OR greenhouse-gas) W/2 (pric* OR tax* OR trading OR trade OR tradable OR levy OR levies OR script 2023-05-12 16:17:58.160652 +00:00 2023-05-12 16:17:58.170225 +00:00 2023-05-12 16:22:04.009238 +00:00 null null
7d0364ca-bb44-48ef-968d-cbe2700fab2c 088011c5-546b-4c4a-b099-54ab59a7a99a 748e739d-f011-44de-9cb0-c9cb4bb18d08 NACSOS-legacy (WoS) TS=((((carbon OR emission$ OR CO2 OR GHG OR "greenhouse gas" OR "greenhouse gases") NEAR/1 (pric* OR tax* OR trading OR trade OR tradable OR script 2023-05-12 16:22:57.841787 +00:00 2023-05-12 16:22:57.846607 +00:00 2023-05-12 16:27:10.685794 +00:00 null null
2188a241-1aa7-4b7d-9218-ba88c8497866 088011c5-546b-4c4a-b099-54ab59a7a99a 748e739d-f011-44de-9cb0-c9cb4bb18d08 NACSOS-legacy (RePEc) ('carbon price' OR 'emission price' OR 'emissions price' OR 'CO2 price' OR 'GHG price' OR 'greenhouse gas price' OR 'greenhouse gases price' script 2023-05-12 16:27:39.260514 +00:00 2023-05-12 16:27:39.268536 +00:00 2023-05-12 16:29:33.003618 +00:00 null null

Further reading⚓︎