OPKC Data Standard

The Open Pathogen Kinetics Commons is built on a simple, flexible data standard designed to maximize data contribution while ensuring the core variables for kinetic modeling are always present.

Our standard is built on a tiered system: "Essential," "Encouraged," and "Accepted" data fields. This structure allows us to incorporate datasets from a wide variety of sources, from detailed clinical studies to broader surveillance programs.

Data Field Definitions

Essential Fields

These fields are the bare minimum required for a dataset to be included in the database. They establish the fundamental links between a patient, a time point, and a measurement.

Pathogen

The specific pathogen, e.g., "SARS-CoV-2".

Time

Time of sample collection, in days, relative to an event (e.g., symptom onset).

Pathogen Load

The measured pathogen load. We prefer Log10(copies/mL).

Platform Type

Category of platform used to measure pathogen load (e.g., "RT-qPCR", "rapid antigen")

Units

The units for the viral load, e.g., "Ct", "Log10 copies/mL".

DOI

Digital Object Identifier for the source publication.

Patient Identifier

An anonymized, unique ID for each patient.

Patient Species

The species of the patient (e.g., "Homo sapiens").

Encouraged Fields

These fields provide critical context that enables more powerful and stratified analyses. We strongly encourage contributors to include these fields whenever possible.

StudyID

A short, unique identifier for the study (e.g., "ke2022").

InfectionID

An identifier for the infection, helpful if the same person underoges multiple infections.

Sample Source

The anatomical source of the sample (e.g., "nasal", "saliva").

Sample Method

How the sample was collected (e.g., "swab", "wash").

GE/mL Conversion

Parameters (e.g., intercept, slope) to convert from Ct to copies/mL.

Targets

The specific gene target(s) of the assay (e.g., "N gene", "S gene").

Platform Tech

The specific underlying technology of the assay (e.g., "Alinity", "cobas").

Accepted Fields

These fields represent valuable clinical and demographic metadata. While not required, they are gratefully accepted and will be incorporated into the database to allow for highly specific data filtering.

Age

Patient age. We prefer a range (AgeRng1, AgeRng2) for privacy.

Symptoms

Presence, absence, or description of symptoms (e.g., "fever", "cough").

Treatments

Any treatments administered (e.g., "Paxlovid", "Remdesivir").

Comorbidities

Pre-existing conditions (e.g., "diabetes", "hypertension").

Hospitalization

Indicator of whether the patient was hospitalized.

Subtype

Pathogen variant (e.g., A/H1N1 or BA.1)

Database Architecture

Our goal is to eventually transition this flat-file commons into a queryable relational database. A preliminary sketch of this architecture is shown below. This structure will allow for more efficient querying and linking of studies, individuals, and samples.

Database Schema Diagram

Our Rationale

This data standard is a pragmatic balance between comprehensiveness and usability. We want to capture the rich biological, clinical, and epidemiological context of each sample, but we also recognize that overly complex standards are a barrier to contribution.

Why these fields?

Our Essential Fields are the minimum set required to plot a single kinetic trajectory and link it to its source. Our Encouraged Fields were chosen because they are the most common and powerful variables used for stratification (e.g., comparing viral loads by sample source, or by assay target). The Accepted Fields provide deeper clinical context that is invaluable for specific research questions, such as linking viral kinetics to disease severity.

What's omitted?

We have intentionally omitted fields that are difficult to standardize, such as detailed clinical symptom scoring, or fields that are often protected by privacy regulations and require complex data use agreements, such as specific locations or full dates. Our focus is on creating a public-ready dataset that can be shared and used broadly with minimal friction.