Direct access to data on the platform⚓︎
There are several ways to access the data stored in the platform.
Internally, the platform uses the data library nacsos-data.
This library contains SQLAlchemy 2.0 schema definitions,
corresponding pydantic models, some utilities and common
CRUD (create, read, update, delete) functionality.
This guide provides a primer on how to access data programmatically. For details on how to use SQLAlchemy (Version 2.0!!) or SQL in general, please refer to the respective documentations.
Prerequisites⚓︎
- Access to the database (server and port can be reached; you have a username and password)
- Most recent version of
nacsos_datainstalled (optional)
Installing nacsos_data
Let's assume you have have a python environment (or jupyter kernel) ready and activated.
In order to use the nacsos_data library, you have to install it first.
You can either install it from git directly or install a local (editable) version.
Get the id/hash of the latest commit (e.g. via the API or the WebUI).
# replace <GIT_HASH> with the id/hash (long format)
pip install -e "git+ssh://git@gitlab.pik-potsdam.de/mcc-apsis/nacsos/nacsos-data.git@<GIT_HASH>#egg=nacsos_data"
cd path/to/workspace/
git clone git@gitlab.pik-potsdam.de:mcc-apsis/nacsos/nacsos-data.git
pip install -e nacsos-data
Connect to the database⚓︎
For anything you want to do with the database, you need to connect to the database and open a session. Please refer to the SQLAlchemy documentation on details how data is handled in a session, there are some caveats to it.
The nacsos_data library provides some utility function to get a session.
Here's a small "Hello World" example that counts the number of rows in the item table using a raw SQL statement.
The most practical way to keep secrets a secret is to create a file with all the connection information
and putting it in a save place on your computer. Whenever needed, you can point to that file either by
(globally or temporarily) setting the environment variable NACSOS_CONFIG to the path (ideally absolute) to the
config file or by providing the path in the script.
from nacsos_data.db.connection import get_engine
from sqlalchemy import text
# if NACSOS_CONFIG is set in the environment, you can drop the `conf_file` kwarg
engine = get_engine(conf_file='path/to/config.env')
with engine.session() as session:
response = s.execute(text('SELECT count(1) FROM item'))
num_rows = response.scalars().all()
print(f'The database contains {num_rows:,} items.')
NACSOS_DB__HOST="localhost"
NACSOS_DB__PORT=5432
NACSOS_DB__USER="nacsos"
NACSOS_DB__PASSWORD="secret"
NACSOS_DB__DATABASE="nacsos_core"
You may be using a jupyter server which provides a secure vault for environment variables, or you are in
some sort of cloud environment that provides a similar mechanism.
This allows you to provide connection details via NACSOS_DB__HOST, NACSOS_DB__USER, ...
Internally, the library tries to read a config file, falls back to defaults and overrides from the environment
variables (see pydantic docs).
from nacsos_data.db.connection import get_engine
from sqlalchemy import text
engine = get_engine()
with engine.session() as session:
response = s.execute(text('SELECT count(1) FROM item'))
num_rows = response.scalars().all()
print(f'The database contains {num_rows:,} items.')
export NACSOS_DB__HOST="localhost"
export NACSOS_DB__PORT=5432
export NACSOS_DB__USER="nacsos"
export NACSOS_DB__PASSWORD="secret"
export NACSOS_DB__DATABASE="nacsos_core"
If you have all the connection details, you can directly add them to a script.
Note, that this potentially dangerous, as secrets may leak into repositories.
from nacsos_data.db.connection import get_engine, DatabaseConfig
from sqlalchemy import text
settings = DatabaseConfig(HOST='localhost', PORT=5432, USER='nacsos', PASSWORD='secret', DATABASE='nacsos_core')
engine = get_engine(setttings=settings)
with engine.session() as session:
response = s.execute(text('SELECT count(1) FROM item'))
num_rows = response.scalars().all()
print(f'The database contains {num_rows:,} items.')
Using SQLAlchemy ORM Schemas⚓︎
If you are not familiar with ORM, you should not use this method to interact with the database before you read up on the topic, e.g. the SQLAlchemy documentation.
All relevant tables in the database have a corresponding schema definition in nacsos_data.db.schemas.
You can use these to construct complex SQL statements or query for "objects" stored in the database, such as
a document or annotation for a document.
The following example may serve as a starting point in combination with the SQLAlchemy ORM Query Guide. We'll assume you already connected an engine (see above). Note, that in this example we are not fetching a session via a context manager but opening it from the engine directly, so we also have to make sure it is closed properly! This does however simplify things for use in jupyter, since ORM models are otherwise deconstructed outside the context of a session. For more details when data is kept in sync with the database or when the data assigned to an ORM object is cleared, please study the SQLAlchemy documentation.
from sqlalchemy import select
from nacsos_data.db.schemas import Project
session = engine()
stmt = select(Project)
result = session.execute(stmt)
projects = result.scalars().all()
session.close()
Using nacsos_data CRUD⚓︎
Some basic and very common functionality is already implemented in some CRUD functions available
in the nacsos_data.db.crud package.
The functions usually require you to provide an async engine as the main purpose it to use them in the context
of the platform backend.
If you never heard of asynchronous execution before, this may be confusing at first.
Furthermore, the CRUD functions usually don't return ORM objects but pydantic objects that are mirroring the schema
definitions.
This has the benefit of persistence beyond the session context and a complete separation from the database to prevent
any data manipulation to be written back to the database unintentionally.
You can also use the CRUD helper functions, but they require an asynchronous engine
from nacsos_data.db import get_engine_async
from nacsos_data.db.crud.projects import read_all_projects
engine = get_engine_async(conf_file='path/to/config.env')
projects = await read_all_projects(engine=engine)
Using pydantic models⚓︎
All SQLAlchemy schemas are mirrored by a corresponding pydantic model. This allows us to return models without ORM binding but the same interface. It also is helpful when you are storing data elsewhere, e.g. in the form of json files. You can use the pydantic models to store and load the data in the same format used by the database (see helper functions or model exports).
All model definitions (incl. documentation) are in the nacsos_data.models package.