OpenAlex⚓︎
Importing from OpenAlex⚓︎
You can access our self-hosted OpenAlex instances (Solr and Postgres) directly. For more information on how to write queries, please consult the respective page in the document.
You can use the Solr full-text search via the platform:
Unless you know exactly what the settings mean, just use the defaults. The +20/-20 next to the "Offset" field is effectively the pagination, but it does not work well for high page counts. You can get a histogram (number of papers per year) by checking the box and clicking "Query".
Sometimes, there might be an error after 30s. This means, that the query was cancelled because it was too slow. You may try it once or twice more and get lucky.
As mentioned in the below, it is always best practice to avoid wildcard queries and instead use an explicit OR clause.
By clicking on "Tokens" under the query box, you get a small tool to find candidates for a postfix-wildcard.
Solr⚓︎
As per direct import of OpenAlex into NACSOS: please use this responsibly, as wrong settings may cause you to import millions of documents into the NACSOS database.
Common pitfalls to look out for
- Make sure to select the field you want to query (either within the query or set
df). - Make sure to select the correct query parser. Typically,
defType=luceneworks fine, for complex queries it is required. - Make sure to set
q.op=AND, this is typically what you want for boolean queries.
Wildcards⚓︎
- The wildcard character
?matches any single character. - The wildcard character
*matches zero or more sequential characters, and can be placed in the middle of a word. - Note, that it is always better to rather spell out the explicit matches rather than using wildcards. It may help to look at the solr index to see what the wildcard would match.
Querying multiple fields⚓︎
To query multiple fields (e.g. title and abstract), change defType to
edismax, set qf to a space separated list of fields, e.g.
this example.
Near operators⚓︎
Using The eDisMax mode (necessary to query multiple fields) you can search for terms that occur close to each other
For example, "GHG emissions"~3 will find results where "GHG" and "emissions" occur within 3 positions of each other.
However, you cannot use wildcards, or combine terms. For that, you need...
Complex Phrase or Surrounding Parser⚓︎
These can be used to create more complex queries in conjunction with the NEAR operator. For example:
With the complex phrase parser
{!complexphrase inOrder=false df=title}"(co2 OR GHG) emissio*"~5 NOT "co2 emission*" NOT "GHG emission*"
Similarly, with the surrounding parser.
Nwill match unordered,Worderedclimate W changeshould be used instead of"climate change"climate 3N chang*could matchclimate changebut alsochanging in the climate
These things can also be mixed with standard queries:
("dissonance" OR tariff* OR "time-varying pricing") AND ({!surround v="(energy OR electric) 15W (consumption OR conservation)"} OR "price responsiveness")
For complex queries, you have to unset the defType (pick ----- or lucene) and set the default query field (df) to title_abstract (or whatever field you like to search).
Here is an example query
If you use any of these {! ...} queries, you need to set the parser (aka defType) to lucene.
Faceting⚓︎
Documentation here. Facets are just like filters in online shops, where you get a side panel to filter by price, brand or size. The facet is only providing the statistics of what and how much of something is there, it does not actually filter.
It is great for getting distribution over time of publications by adding the following query parameters to the URL:
facet=true
facet.range=publication_year
facet.range.start=1990
facet.range.end=2024
facet.range.gap=1
facet.sort=index
Easier and giving you full coverage:
facet=true
facet.field=publication_year
facet.sort=index
Tips, pitfalls and notes for translating WoS or Scopus queries to Solr⚓︎
- Make sure that parentheses are properly applied:
WoS querying assumes implicit parentheses in combinations of the boolean operators
ANDandOR(see here), e.g.copper OR lead AND algaeis implicitly searches ascopper OR (lead AND algae). In Solr, the behavior is different:copper OR lead AND algaeonly gives results matchinglead AND algae. - When translating queries, it is essential to test independent parts separately.
Solr may return results even in case a part of the query is not correctly structured.
For example, using the near operator
3Nwithout the proper parser{!surround v=''}will not raise an error but look for3Nas a token and therefore will return no results for this part of the query. - Near operators:
NEAR/xin WoS needs to be translated to(x+1)N. In WoS, thexindicates the maximal number of words between the two terms, while in Solr it indicates the number of steps between - Concatenated near operators:
{!surround v='(soil 3N carbon 3N sequestration)'}is internally treated as{!surround v='((soil 3N carbon) 3N sequestration)'}while{!surround v='(soil 3N (carbon 3N sequestration))'}may return slightly different results (same behavior as in WoS). - Internally, the querying works with indices that are built on tokens.
This
tool can be used to test how a query matches internally (click
Analysisin the Solr menu).
Downloading results⚓︎
A script to retrieve results from the database looks like this:
Simple export script written in Python (JSONL)
import json
import logging
import requests
from time import time
from pathlib import Path
from datetime import timedelta
q = """
climate AND chang*
"""
logging.basicConfig(format='%(asctime)s [%(levelname)s] %(name)s: %(message)s', level=logging.DEBUG)
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
export_fields = [
'id', 'title', 'abstract', 'mag',
'publication_year', 'cited_by_count', 'type', 'doi'
]
BATCH_SIZE = 10000
TARGET_FILE = Path('/path/to/export/file.jsonl')
TARGET_FILE.parent.mkdir(exist_ok=True, parents=True)
url = 'http://srv-mcc-apsis-rechner:8983/solr/openalex/select'
data = {
'q': q,
'df': 'title_abstract',
'sort': 'id desc',
'q.op': 'AND',
'fl': ','.join(export_fields),
'rows': BATCH_SIZE,
'cursorMark': '*'
}
logger.info(f'Querying endpoint with batch_size={BATCH_SIZE:,}: {url}')
logger.info(f'Writing results to: {TARGET_FILE}')
with open(TARGET_FILE, 'w') as f_out:
t0 = time()
batch_i = 0
num_docs_cum = 0
while True:
t1 = time()
batch_i += 1
logger.info(f'Running query for batch {batch_i} with cursor "{data["cursorMark"]}"')
t2 = time()
res = requests.post(url, data=data).json()
data['cursorMark'] = res['nextCursorMark']
n_docs_total = res['response']['numFound']
batch_docs = res['response']['docs']
n_docs_batch = len(batch_docs)
num_docs_cum += n_docs_batch
logger.debug(f'Query took {timedelta(seconds=time() - t2)}h and yielded {n_docs_batch:,} docs')
logger.debug(f'Current progress: {num_docs_cum:,}/{n_docs_total:,}={num_docs_cum / n_docs_total:.2%} docs')
if len(batch_docs) == 0:
logger.info('No documents in this batch, assuming to be done!')
break
logger.debug('Writing documents to file...')
[f_out.write(json.dumps(doc) + '\n') for doc in batch_docs]
logger.debug(f'Done with batch {batch_i} in {timedelta(seconds=time() - t1)}h; '
f'{timedelta(seconds=time() - t0)}h passed overall')
Simple export script written in Python (CSV)
import csv
import logging
import requests
from time import time
from pathlib import Path
from datetime import timedelta
q = """
climate change
"""
logging.basicConfig(format='%(asctime)s [%(levelname)s] %(name)s: %(message)s', level=logging.DEBUG)
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
export_fields = [
'id', 'title', 'abstract', 'mag',
'publication_year', 'cited_by_count', 'type', 'doi'
]
BATCH_SIZE = 10000
TARGET_FILE = Path('/path/to/export/file.jsonl.csv')
TARGET_FILE.parent.mkdir(exist_ok=True, parents=True)
url = 'http://srv-mcc-apsis-rechner:8983/solr/openalex/select'
data = {
'q': q,
'df': 'title_abstract',
'sort': 'id desc',
'q.op': 'AND',
'fl': ','.join(export_fields),
'rows': BATCH_SIZE,
'cursorMark': '*'
}
logger.info(f'Querying endpoint with batch_size={BATCH_SIZE:,}: {url}')
logger.info(f'Writing results to: {TARGET_FILE}')
with open(TARGET_FILE, 'w', newline='') as f_out:
writer = csv.DictWriter(f_out, fieldnames=export_fields, quoting=csv.QUOTE_ALL, dialect='unix')
writer.writeheader()
t0 = time()
batch_i = 0
num_docs_cum = 0
while True:
t1 = time()
batch_i += 1
logger.info(f'Running query for batch {batch_i} with cursor "{data["cursorMark"]}"')
t2 = time()
res = requests.post(url, data=data).json()
data['cursorMark'] = res['nextCursorMark']
n_docs_total = res['response']['numFound']
batch_docs = res['response']['docs']
n_docs_batch = len(batch_docs)
num_docs_cum += n_docs_batch
logger.debug(f'Query took {timedelta(seconds=time() - t2)}h and yielded {n_docs_batch:,} docs')
logger.debug(f'Current progress: {num_docs_cum:,}/{n_docs_total:,}={num_docs_cum / n_docs_total:.2%} docs')
if len(batch_docs) == 0:
logger.info('No documents in this batch, assuming to be done!')
break
logger.debug('Writing documents to file...')
writer.writerows(batch_docs)
logger.debug(f'Done with batch {batch_i} in {timedelta(seconds=time() - t1)}h; '
f'{timedelta(seconds=time() - t0)}h passed overall')
Simple export script written in Python using the nacsos_data library
import logging
from nacsos_data.util.academic.openalex import download_openalex_query_raw
q = """
your query here
"""
BATCH_SIZE = 10000
TARGET_FILE = Path('/path/to/export/file.jsonl.csv')
TARGET_FILE.parent.mkdir(exist_ok=True, parents=True)
URL = 'http://srv-mcc-apsis-rechner:8983/solr/openalex'
export_fields = [
'id', 'title', 'abstract', 'mag',
'publication_year', 'cited_by_count', 'type', 'doi'
]
logging.basicConfig(format='%(asctime)s [%(levelname)s] %(name)s: %(message)s', level=logging.DEBUG)
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
download_openalex_query_raw(TARGET_FILE,
query=q,
openalex_endpoint=URL,
batch_size=BATCH_SIZE,
export_fields=export_fields,
log=logger)
Simple export script written in Python using the nacsos_data library (as AcademicItemModel)
import logging
from nacsos_data.util.academic.openalex import download_openalex_query_item
q = """
your query here
"""
BATCH_SIZE = 10000
TARGET_FILE = Path('/path/to/export/file.jsonl.csv')
TARGET_FILE.parent.mkdir(exist_ok=True, parents=True)
URL = 'http://srv-mcc-apsis-rechner:8983/solr/openalex'
logging.basicConfig(format='%(asctime)s [%(levelname)s] %(name)s: %(message)s', level=logging.DEBUG)
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
download_openalex_query_item(TARGET_FILE,
query=q,
openalex_endpoint=URL,
batch_size=BATCH_SIZE,
log=logger)
Helper for making wildcards explicit⚓︎
You may run into an error similar to "too many subqueries".
This is usually due to the fact that Solr will expand wildcards before executing the query.
For example, if you search for "NACSOS AND rock*", Solr will look at all words it knows about and create a query like this:
"NACSOS AND (rock OR rocks OR rocking OR rocker OR ...)".
This list can be massive, hence the limit to prevent overloading.
Also for your own sanity, it might be good to make wildcards explicit.
Knowing what Solr knows is helping with this.
The /terms Component with prefix search will help.
Click here to get the basic config


