Skip to content

Storing documents⚓︎

  • purpose of Item
  • different types (and relation to project type)
  • how to handle different "views" on a document (e.g. paragraphs, full doc, abstract,...)
  • practices on (de)duplication

Twitter⚓︎

todo

Academic⚓︎

AcademicItem inherits from Item, Item.text is the abstract.

Standard Item fields are described in nacsos_data/models/items/academic.py. For fields which map multiple values to a single item (authors, keywords), we will at first store these in jsonb columns (where the structure is defined in models/items/academic.py. If we find that this causes performance issues (e.g. difficulties in finding items authored by "Minx, JC") we may revisit.

For all metadata not in standard columns, we will add this as jsonb. This should then be type-annotated with a typing.Union of pydantic models describing the possible shapes this meta-data object can take (for our sanity and for documentation). Note, that postgres also allows indices on jsonb (see docs)

Within a Project, each individual scientific document must map to one, and only one, AcademicItem object. This means that when adding a document to a Project, we must search for AcademicItems in that project that have the same, or very similar metadata (see deduplication. This also means that the same document will appear multiple times across different Projects as different AcademitItems. Therefore, although the proprietary IDs and doi are in principle unique to an individual document, they will not be unique in our database.

Multi-part items (RFC)⚓︎

Background⚓︎

fundamental question is how to deal with different use cases. e.g. for papers, text could be the abstract, title, full-text, paragraphs of full text and based on context, the same item (?) would point to different texts alternatively, we view the specific item as the unique reference and Item as the context-sensitive one which would lead to lots of repeated data though

Proposal⚓︎

For simplicity, this proposal looks at the use case of parliamentary speeches (but the concept should be applicable to other use cases as described above)

  • Each session in parliament has a transcript
  • Transcripts consist of multiple parts (in order)
  • Each part is a contribution by an entity (e.g. a speech or comment)
  • both, transcript and part inherit from the BaseItem
  • part has FK to transcript
  • transcript contains full-text transcript (yes, data will be duplicate to text in parts)
  • transcript will be referred to when (e.g. ) labelling the entire transcript
  • part will be referred to when labels refer to that specific contribution
  • in the interface, we can decide to show parts individually (with all their specific metadata, but still keep the context of the transcript)
  • in the interface, e.g. when labelling, the interface has to handle where to assign label to (transcript item or part item)

Relationships / Duplicates⚓︎

  • Within a project data should be unique
  • Across projects, data does not need to be unique; in fact, data will be duplicated for each project (if there are overlaps)
  • Ideally, normalisations and standards should still carry across projects (and be back-filled/updated when conventions change)
  • This allows partitioning of the tables along projects
  • If cross-project information sharing is needed, we'll run a script that finds pairwise duplicates (e.g. by twitter_id) and thus indirectly allow access to annotations from both projects

Querying tips⚓︎

  • https://docs.sqlalchemy.org/en/20/orm/queryguide/inheritance.html
  • https://docs.sqlalchemy.org/en/20/orm/inheritance.html

Further reading⚓︎