Direct access to data on the platform⚓︎

There are several ways to access the data stored in the platform. Internally, the platform uses the data library nacsos-data. This library contains SQLAlchemy 2.0 schema definitions, corresponding pydantic models, some utilities and common CRUD (create, read, update, delete) functionality.

This guide provides a primer on how to access data programmatically. For details on how to use SQLAlchemy (Version 2.0!!) or SQL in general, please refer to the respective documentations.

Prerequisites⚓︎

Access to the database (server and port can be reached; you have a username and password)
Most recent version of nacsos_data installed (optional)

Installing nacsos_data

Let's assume you have have a python environment (or jupyter kernel) ready and activated. In order to use the nacsos_data library, you have to install it first. You can either install it from git directly or install a local (editable) version.

Install from gitInstall locally

Get the id/hash of the latest commit (e.g. via the API or the WebUI).

# replace <GIT_HASH> with the id/hash (long format)
pip install -e "git+ssh://git@gitlab.pik-potsdam.de/mcc-apsis/nacsos/nacsos-data.git@<GIT_HASH>#egg=nacsos_data"

cd path/to/workspace/
git clone git@gitlab.pik-potsdam.de:mcc-apsis/nacsos/nacsos-data.git
pip install -e nacsos-data

Connect to the database⚓︎

For anything you want to do with the database, you need to connect to the database and open a session. Please refer to the SQLAlchemy documentation on details how data is handled in a session, there are some caveats to it.

The nacsos_data library provides some utility function to get a session. Here's a small "Hello World" example that counts the number of rows in the item table using a raw SQL statement.

Connection details from config fileConnection details from environment variablesProvide connection details

The most practical way to keep secrets a secret is to create a file with all the connection information and putting it in a save place on your computer. Whenever needed, you can point to that file either by (globally or temporarily) setting the environment variable NACSOS_CONFIG to the path (ideally absolute) to the config file or by providing the path in the script.

Code config.env

from nacsos_data.db.connection import get_engine
from sqlalchemy import text

# if NACSOS_CONFIG is set in the environment, you can drop the `conf_file` kwarg
engine = get_engine(conf_file='path/to/config.env')
with engine.session() as session:
    response = s.execute(text('SELECT count(1) FROM item'))
    num_rows = response.scalars().all()
    print(f'The database contains {num_rows:,} items.')

NACSOS_DB__HOST="localhost"
NACSOS_DB__PORT=5432
NACSOS_DB__USER="nacsos"
NACSOS_DB__PASSWORD="secret"
NACSOS_DB__DATABASE="nacsos_core"

You may be using a jupyter server which provides a secure vault for environment variables, or you are in some sort of cloud environment that provides a similar mechanism. This allows you to provide connection details via NACSOS_DB__HOST, NACSOS_DB__USER, ... Internally, the library tries to read a config file, falls back to defaults and overrides from the environment variables (see pydantic docs).

Code Available variables

from nacsos_data.db.connection import get_engine
from sqlalchemy import text

engine = get_engine()
with engine.session() as session:
    response = s.execute(text('SELECT count(1) FROM item'))
    num_rows = response.scalars().all()
    print(f'The database contains {num_rows:,} items.')

export NACSOS_DB__HOST="localhost"
export NACSOS_DB__PORT=5432
export NACSOS_DB__USER="nacsos"
export NACSOS_DB__PASSWORD="secret"
export NACSOS_DB__DATABASE="nacsos_core"

If you have all the connection details, you can directly add them to a script.
Note, that this potentially dangerous, as secrets may leak into repositories.

from nacsos_data.db.connection import get_engine, DatabaseConfig
from sqlalchemy import text

settings = DatabaseConfig(HOST='localhost', PORT=5432, USER='nacsos', PASSWORD='secret', DATABASE='nacsos_core')
engine = get_engine(setttings=settings)
with engine.session() as session:
    response = s.execute(text('SELECT count(1) FROM item'))
    num_rows = response.scalars().all()
    print(f'The database contains {num_rows:,} items.')

Using SQLAlchemy ORM Schemas⚓︎

If you are not familiar with ORM, you should not use this method to interact with the database before you read up on the topic, e.g. the SQLAlchemy documentation.

All relevant tables in the database have a corresponding schema definition in nacsos_data.db.schemas. You can use these to construct complex SQL statements or query for "objects" stored in the database, such as a document or annotation for a document.

The following example may serve as a starting point in combination with the SQLAlchemy ORM Query Guide. We'll assume you already connected an engine (see above). Note, that in this example we are not fetching a session via a context manager but opening it from the engine directly, so we also have to make sure it is closed properly! This does however simplify things for use in jupyter, since ORM models are otherwise deconstructed outside the context of a session. For more details when data is kept in sync with the database or when the data assigned to an ORM object is cleared, please study the SQLAlchemy documentation.

from sqlalchemy import select
from nacsos_data.db.schemas import Project
session = engine()
stmt = select(Project)
result = session.execute(stmt)
projects = result.scalars().all()
session.close()

Using `nacsos_data` CRUD⚓︎

Some basic and very common functionality is already implemented in some CRUD functions available in the nacsos_data.db.crud package. The functions usually require you to provide an async engine as the main purpose it to use them in the context of the platform backend. If you never heard of asynchronous execution before, this may be confusing at first. Furthermore, the CRUD functions usually don't return ORM objects but pydantic objects that are mirroring the schema definitions. This has the benefit of persistence beyond the session context and a complete separation from the database to prevent any data manipulation to be written back to the database unintentionally.

You can also use the CRUD helper functions, but they require an asynchronous engine

from nacsos_data.db import get_engine_async
from nacsos_data.db.crud.projects import read_all_projects

engine = get_engine_async(conf_file='path/to/config.env')
projects = await read_all_projects(engine=engine)

Using pydantic models⚓︎

All SQLAlchemy schemas are mirrored by a corresponding pydantic model. This allows us to return models without ORM binding but the same interface. It also is helpful when you are storing data elsewhere, e.g. in the form of json files. You can use the pydantic models to store and load the data in the same format used by the database (see helper functions or model exports).

All model definitions (incl. documentation) are in the nacsos_data.models package.