Importing data⚓︎
In general, Items in the database should always be unique, fully populated (all fields set) and "clean".
We consider the case for Tweets to be trivial, so this article focuses mainly on AcademicItems.
Database representation⚓︎
From the perspective of the database, there are five relevant tables:
Item->item: The main reference for that item, mainly containstext(e.g. the abstract).AcademicItem->academic_item: Contains additional meta-data specific to academic items. Consult the respective models for how to populate the fields.AcademicItemVariant->academic_item_variant: Used to keep track of alternative values in case of duplicates.Import->import: Context in which items are ingested into the platform. Should have a descriptive title (e.g. source and date) and the query in the description for more details.m2m_import_item_table->m2m_import_item: Keeps track of the many-to-many relation of items and the imports. In this way, we keep only one item but can remember that it came from multiple sources.
Utility functions⚓︎
Properly populating all those relations can be tedious.
To this end, nacsos_data provides several utility functions that you can use to ingest data into a project.
Reading and transforming data⚓︎
Most notably, there are utils for reading different data formats and producing AcademicItemModels:
nacsos_data.util.academic.wos.read_wos_file()for reading a RIS-style Web of Science export.nacsos_data.util.academic.scopus.read_scopus_file()for reading a CSV (all fields except references!) export from Scopus.nacsos_data.util.academic.openalex.download_openalex_query()to run a query on OpenAlex (local solr instance) and store it in a jsonl file.nacsos_data.util.academic.openalex.generate_items_from_openalex()to go through that file and yieldAcademicItems.nacsos_data.util.academic.nacsos_legacy.fetch_nacsos_legacy_doc()to read one document from NACSOS-legacy and converting it to the NACSOS2 format.nacsos_data.util.academic.nacsos_legacy.read_nacsos1_annotations()for reading all DocOwnerships and DocUserCats related to a Query from NACSOS-legacy (in proper order) and translating them to a set ofAssignmentandAnnotation.
Ingestion helpers⚓︎
nacsos_data.util.academic.duplicate.find_duplicates()Will find candidates for duplicates for the given item. (see below for details)nacsos_data.util.academic.duplicate._are_actually_duplicate()Double-check candidates to clear false-positives. (see below for details)nacsos_data.util.academic.duplicate.fuse_items()Given two items that are determined to be duplicates, fuse them together and return a new item of consolidated values. For example with respectively missing fields populated or author list extended.nacsos_data.util.academic.importer.duplicate_insertion()Given theitem_idof an existing item and a new item that's deemed to be it's duplicate, handle the proper updates and insertion of variants.nacsos_data.util.academic.importer.import_academic_items()Pretty much a one-stop-shop for all the above and more. Just give it a list of items and a project, and it will handle everything.
Conventions of handling duplicates⚓︎
Please note, that the most recent algorithm for detecting duplicates is probably
in this file.
In general, we aim to keep duplicates out of the platform during import.
Cleaning up all references after the fact can be very challenging and error-prone.
That said, if you ever end up in that situation where you have to merge two items together, a good start might be to
look
at nacsos_data.util.duplicate.update_references().
It will try to update all references in the database, but you have to keep in mind, that there might be other references
in some meta-data fields that are not covered.
Furthermore, all external references you might already have would also have to be updated.
Hence, you should pay attention during import, so that only duplicate-free and clean data ends up in the database in the
first place.
There are some functions to help with that (see above). The general concept is as follows:
- Get a clean
AcademicItemModelfrom your datasource (see helpers above) - Fetch potential candidates for duplicates (e.g. based on title-slug or any of the IDs)
- Determine which of the candidates are "real" duplicates (if everything was done right before, this should only be
one)
- If no duplicates are found, insert a new item (and respective m2m references)
- If duplicates are found, create a variant entry
- If this is the first time we found a duplicate for this item, create a variant off of the original entry
- Check all values of the new item against existing variants and drop all values that already exist
- Insert the variant into the respective table, so that for each
item_id, there titles, ids, etc. appear only once - Update the values in the items table (e.g. to populate previously missing fields)
This list roughly describes how the nacsos_data library implements the procedure.
For details and the most up-to-date approach, check the code.
Deduplication on the platform⚓︎
The default import on the platform is using the following algorithm and rules to identify duplicates.
Basic philosophy: Duplicate candidates that we found based on title + abstract search (only drastically simplified texts—everything is lower-cased and all characters but A-Z and spaces are removed—with more than 300 characters are considered and 95% token overlap is required) are always duplicates even if some other rules (e.g. different authors or DOI) would disagree. Although this may seem too "aggressive" for some applications, but they are essentially the same paper for any topic model, classifier, or human annotator. Also, we keep track of "variants" of an item, so alternative values for different fields can be considered in later analyses, for example publication year in a time series.
Beyond abstract matching: If the abstract does not exist and the title + abstract text is not sufficiently long, we try other rules to try and identify duplicates. First we look for candidates based on title slug, DOI, WebOfScience-ID, Scopus-ID, OpenAlex-ID, Pubmed-ID, Dimensions-ID, and SemanticScholar-ID. We then check each candidate for publication year difference > 2 (oftentimes, papers with the same title are published several years apart) and also do an additional check of the title (sometimes, titles differ due to version/part numbers or years or a report; which we would consider false positive matches). At this point, we do not consider any deeper matching based on authors, venues, or alike. Our main goal is to find duplicates with respect to content, not necessarily slight variations or versions (preprint vs published) or database inconsistencies.
Examples⚓︎
It might help to demonstrate how easy it actually is to import things using the correct functions.
Importing OpenAlex⚓︎
import logging
from nacsos_data.util.academic.importer import import_academic_items
from nacsos_data.util.academic.openalex import generate_items_from_openalex
from nacsos_data.db import get_engine_async
PROJECT_ID = '[FILL PROJECT_ID HERE]'
OA_EXPORT_FILE = '/path/to/exported/data.jsonl'
LOG_FILE = '/path/to/logfile.txt' # set to None to write to std-out
# The function will create an Import for the following user and name/description.
IMPORT_USER = '[UUID of a user]'
IMPORT_NAME = '[Descriptive name for the import]'
IMPORT_DESC = '''
Data source: OpenAlex\n
Dump version: 2023-05-02\n
Query: "(carbon OR emission? OR CO2 OR GHG) AND (price? OR pricing OR tax OR taxes)"
'''
# set to false to actually write to the database, otherwise this will simulate
# the import as best as possible
DRY_RUN = True
logging.basicConfig(
format='%(asctime)s [%(levelname)s] %(name)s: %(message)s',
level=logging.DEBUG, # set log level as you like
filename=LOG_FILE
)
db_engine_async = get_engine_async(conf_file='/path/to/config.env')
# This next line is almost the only thing you have to change for different sources
academic_items = generate_items_from_openalex(openalex_export=OA_EXPORT_FILE, project_id=PROJECT_ID)
# This might need to be done since pydantic sometimes doesn't json-serialise recursively, so we help it along
for item in academic_items:
if item.authors is not None:
item.authors = [a.dict() for a in item.authors]
# Read the docstring of this function to learn what all the parameters mean!!!
import_id, item_ids = await import_academic_items(
items=academic_items,
project_id=PROJECT_ID,
db_engine=db_engine_async,
import_name=IMPORT_NAME,
description=IMPORT_DESC,
user_id=IMPORT_USER,
import_id=None,
dry_run=DRY_RUN,
trust_new_authors=False,
trust_new_keywords=False,
)
Illustration⚓︎
Let's have a look at the item with id '040737f0-55b2-4556-bbb2-0c7a5140ac15':
SELECT ai.item_id,
ai.doi,
ai.wos_id,
ai.scopus_id,
ai.openalex_id,
ai.s2_id,
ai.title,
ai.title_slug,
ai.publication_year,
ai.source,
ai.keywords,
substring(ai.authors::text for 140) as authors,
ai.pubmed_id,
ai.project_id,
substring(i.text for 100) as abstract,
i.type
FROM academic_item ai
LEFT JOIN item i on i.item_id = ai.item_id
WHERE ai.item_id = '040737f0-55b2-4556-bbb2-0c7a5140ac15';
| item_id | doi | wos_id | scopus_id | openalex_id | s2_id | title | title_slug | publication_year | source | keywords | authors | pubmed_id | project_id | abstract | type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 040737f0-55b2-4556-bbb2-0c7a5140ac15 | 10.1007/s10640-010-9345-x | WOS:000278401900007 | 2-s2.0-77953480306 | W2148165261 | null | Who Pays a Price on Carbon? | whopaysapriceoncarbon | 2010 | ENVIRONMENTAL & RESOURCE ECONOMICS | ["Distributional incidence", "Carbon tax", "Tradable permits", "LIFETIME INCIDENCE", "TAX", "EMISSIONS", "POLICIES"] | [{"name": "Grainger, Corbett A.", "email": null, "orcid": null, "s2_id": null, "scopus_id": null, "openalex_id": null, "affiliations": [{"na | null | 748e739d-f011-44de-9cb0-c9cb4bb18d08 | We use the 2003 Consumer Expenditure Survey and emissions estimates from an input-output model to es | academic |
... and its variants. As you can see, most fields are null to save space. Only different values are stored, such as the two publication years. For keywords, authors, and meta, we don't drop extra values, as there would be too many rules and confusing conflicts involved.
SELECT item_variant_id,
item_id,
import_id,
doi,
wos_id,
scopus_id,
openalex_id,
s2_id,
pubmed_id,
title,
publication_year,
source,
keywords,
substring(authors::text for 140) as authors,
substring(abstract for 140) as abstract,
substring(meta::text for 140) as meta
FROM academic_item_variant
WHERE item_id = '040737f0-55b2-4556-bbb2-0c7a5140ac15';
| item_variant_id | item_id | import_id | doi | wos_id | scopus_id | openalex_id | s2_id | pubmed_id | title | publication_year | source | keywords | authors | abstract | meta |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 287c06bf-fa58-4562-9a2a-c1d8e716a173 | 040737f0-55b2-4556-bbb2-0c7a5140ac15 | ae9eb81c-7f20-41c1-8d97-94a876d599e5 | null | null | null | W3121621695 | null | null | Who Pays a Price on Carbon | 2009 | null | null | null | We use the 2003 Consumer Expenditure Survey and emissions estimates from an input-output model to estimate the incidence of a price on carbo | {"type": "posted-content", "cited_by_count": 6} |
| eab623e8-66bf-4668-9a7f-344e34a35753 | 040737f0-55b2-4556-bbb2-0c7a5140ac15 | ae9eb81c-7f20-41c1-8d97-94a876d599e5 | null | null | null | W3023054246 | null | null | null | null | null | null | null | null | {"type": "posted-content", "cited_by_count": 1} |
| d78c0da5-801e-4f59-a259-152dc9801f91 | 040737f0-55b2-4556-bbb2-0c7a5140ac15 | ae9eb81c-7f20-41c1-8d97-94a876d599e5 | 10.1007/s10640-010-9345-x | null | null | W2148165261 | null | null | Who Pays a Price on Carbon? | 2010 | null | null | null | null | {"type": "journal-article", "cited_by_count": 146} |
| 8312cbb0-25fd-462b-b0e1-e3a42aaabfa6 | 040737f0-55b2-4556-bbb2-0c7a5140ac15 | 586c1818-e747-42b7-b078-4e745de20c39 | null | WOS:000278401900007 | 2-s2.0-77953480306 | null | null | null | null | null | ENVIRONMENTAL & RESOURCE ECONOMICS | ["Distributional incidence", "Carbon tax", "Tradable permits", "LIFETIME INCIDENCE", "TAX", "EMISSIONS", "POLICIES"] | [{"name": "Grainger, Corbett A.", "email": null, "orcid": null, "s2_id": null, "scopus_id": null, "openalex_id": null, "affiliations": [{"na | null | {"nacsos1": {"vol": "46", "date": "JUL", "issn": "0924-6460", "lang": "English", "subj": "Business & Economics; Environmental Sciences & Eco |
| 41803df6-21ac-471b-aea8-868ea1f29857 | 040737f0-55b2-4556-bbb2-0c7a5140ac15 | 7d0364ca-bb44-48ef-968d-cbe2700fab2c | null | null | null | null | null | null | null | null | null | ["Distributional incidence", "Carbon tax", "Tradable permits", "LIFETIME INCIDENCE", "TAX", "EMISSIONS", "POLICIES"] | [{"name": "Grainger, Corbett A.", "email": null, "orcid": null, "s2_id": null, "scopus_id": null, "openalex_id": null, "affiliations": [{"na | null | {"nacsos1": {"vol": "46", "date": "JUL", "issn": "0924-6460", "lang": "English", "subj": "Business & Economics; Environmental Sciences & Eco |
| af04e84b-a09c-47be-bf00-591eec2b98e3 | 040737f0-55b2-4556-bbb2-0c7a5140ac15 | 2188a241-1aa7-4b7d-9218-ba88c8497866 | null | null | null | null | null | null | null | null | null | ["Distributional incidence", "Carbon tax", "Tradable permits", "LIFETIME INCIDENCE", "TAX", "EMISSIONS", "POLICIES"] | [{"name": "Grainger, Corbett A.", "email": null, "orcid": null, "s2_id": null, "scopus_id": null, "openalex_id": null, "affiliations": [{"na | null | {"nacsos1": {"vol": "46", "date": "JUL", "issn": "0924-6460", "lang": "English", "subj": "Business & Economics; Environmental Sciences & Eco |
For completeness, via the m2m table...
SELECT *
FROM m2m_import_item
WHERE m2m_import_item.item_id = '040737f0-55b2-4556-bbb2-0c7a5140ac15';
| import_id | item_id | time_created | type |
|---|---|---|---|
| ae9eb81c-7f20-41c1-8d97-94a876d599e5 | 040737f0-55b2-4556-bbb2-0c7a5140ac15 | 2023-05-12 16:53:53.381928 +00:00 | explicit |
| 586c1818-e747-42b7-b078-4e745de20c39 | 040737f0-55b2-4556-bbb2-0c7a5140ac15 | 2023-05-12 18:20:52.376814 +00:00 | explicit |
| 7d0364ca-bb44-48ef-968d-cbe2700fab2c | 040737f0-55b2-4556-bbb2-0c7a5140ac15 | 2023-05-12 18:25:50.147696 +00:00 | explicit |
| 2188a241-1aa7-4b7d-9218-ba88c8497866 | 040737f0-55b2-4556-bbb2-0c7a5140ac15 | 2023-05-12 18:29:15.864089 +00:00 | explicit |
... we can trace it to four distinct imports:
SELECT import.import_id,
import.user_id,
import.project_id,
import.name,
substring(import.description for 140) as description,
import.type,
import.time_created,
import.time_started,
import.time_finished,
import.config,
import.pipeline_task_id
FROM import,
m2m_import_item
WHERE import.import_id = m2m_import_item.import_id
AND m2m_import_item.item_id = '040737f0-55b2-4556-bbb2-0c7a5140ac15';
| import_id | user_id | project_id | name | description | type | time_created | time_started | time_finished | config | pipeline_task_id |
|---|---|---|---|---|---|---|---|---|---|---|
| ae9eb81c-7f20-41c1-8d97-94a876d599e5 | 088011c5-546b-4c4a-b099-54ab59a7a99a | 748e739d-f011-44de-9cb0-c9cb4bb18d08 | Initial OpenAlex import | OpenAlex query: ({!surround v="(carbon OR emission? OR CO2 OR GHG) 2N (price? OR pricing OR tax OR taxes OR taxation OR trading OR trade OR |
script | 2023-05-12 16:51:25.399264 +00:00 | 2023-05-12 16:51:25.421586 +00:00 | 2023-05-12 17:00:31.865510 +00:00 | null | null |
| 586c1818-e747-42b7-b078-4e745de20c39 | 088011c5-546b-4c4a-b099-54ab59a7a99a | 748e739d-f011-44de-9cb0-c9cb4bb18d08 | NACSOS-legacy (Scopus) | TITLE-ABS-KEY((((carbon OR emission OR CO2 OR GHG OR greenhouse-gas) W/2 (pric* OR tax* OR trading OR trade OR tradable OR levy OR levies OR | script | 2023-05-12 16:17:58.160652 +00:00 | 2023-05-12 16:17:58.170225 +00:00 | 2023-05-12 16:22:04.009238 +00:00 | null | null |
| 7d0364ca-bb44-48ef-968d-cbe2700fab2c | 088011c5-546b-4c4a-b099-54ab59a7a99a | 748e739d-f011-44de-9cb0-c9cb4bb18d08 | NACSOS-legacy (WoS) | TS=((((carbon OR emission$ OR CO2 OR GHG OR "greenhouse gas" OR "greenhouse gases") NEAR/1 (pric* OR tax* OR trading OR trade OR tradable OR | script | 2023-05-12 16:22:57.841787 +00:00 | 2023-05-12 16:22:57.846607 +00:00 | 2023-05-12 16:27:10.685794 +00:00 | null | null |
| 2188a241-1aa7-4b7d-9218-ba88c8497866 | 088011c5-546b-4c4a-b099-54ab59a7a99a | 748e739d-f011-44de-9cb0-c9cb4bb18d08 | NACSOS-legacy (RePEc) | ('carbon price' OR 'emission price' OR 'emissions price' OR 'CO2 price' OR 'GHG price' OR 'greenhouse gas price' OR 'greenhouse gases price' | script | 2023-05-12 16:27:39.260514 +00:00 | 2023-05-12 16:27:39.268536 +00:00 | 2023-05-12 16:29:33.003618 +00:00 | null | null |