OPKC Data Standard
The Open Pathogen Kinetics Commons is built on a simple, flexible data standard designed to maximize data contribution while ensuring the core variables for kinetic modeling are always present.
Our standard is built on a tiered system: "Essential," "Encouraged," and "Accepted" data fields. This structure allows us to incorporate datasets from a wide variety of sources, from detailed clinical studies to broader surveillance programs.
Data Field Definitions
Essential Fields
These fields are the bare minimum required for a dataset to be included in the database. They establish the fundamental links between a patient, a time point, and a measurement.
Pathogen
The specific pathogen, e.g., "SARS-CoV-2".
Time
Time of sample collection, in days, relative to an event (e.g., symptom onset).
Pathogen Load
The measured pathogen load. We prefer Log10(copies/mL).
Platform Type
Category of platform used to measure pathogen load (e.g., "RT-qPCR", "rapid antigen")
Units
The units for the viral load, e.g., "Ct", "Log10 copies/mL".
DOI
Digital Object Identifier for the source publication.
Patient Identifier
An anonymized, unique ID for each patient.
Patient Species
The species of the patient (e.g., "Homo sapiens").
Encouraged Fields
These fields provide critical context that enables more powerful and stratified analyses. We strongly encourage contributors to include these fields whenever possible.
StudyID
A short, unique identifier for the study (e.g., "ke2022").
InfectionID
An identifier for the infection, helpful if the same person underoges multiple infections.
Sample Source
The anatomical source of the sample (e.g., "nasal", "saliva").
Sample Method
How the sample was collected (e.g., "swab", "wash").
GE/mL Conversion
Parameters (e.g., intercept, slope) to convert from Ct to copies/mL.
Targets
The specific gene target(s) of the assay (e.g., "N gene", "S gene").
Platform Tech
The specific underlying technology of the assay (e.g., "Alinity", "cobas").
Accepted Fields
These fields represent valuable clinical and demographic metadata. While not required, they are gratefully accepted and will be incorporated into the database to allow for highly specific data filtering.
Age
Patient age. We prefer a range (AgeRng1, AgeRng2) for privacy.
Symptoms
Presence, absence, or description of symptoms (e.g., "fever", "cough").
Treatments
Any treatments administered (e.g., "Paxlovid", "Remdesivir").
Comorbidities
Pre-existing conditions (e.g., "diabetes", "hypertension").
Hospitalization
Indicator of whether the patient was hospitalized.
Subtype
Pathogen variant (e.g., A/H1N1 or BA.1)
Database Architecture
Our goal is to eventually transition this flat-file commons into a queryable relational database. A preliminary sketch of this architecture is shown below. This structure will allow for more efficient querying and linking of studies, individuals, and samples.
Database Schema Diagram
Our Rationale
This data standard is a pragmatic balance between comprehensiveness and usability. We want to capture the rich biological, clinical, and epidemiological context of each sample, but we also recognize that overly complex standards are a barrier to contribution.
Why these fields?
Our Essential Fields are the minimum set required to plot a single kinetic trajectory and link it to its source. Our Encouraged Fields were chosen because they are the most common and powerful variables used for stratification (e.g., comparing viral loads by sample source, or by assay target). The Accepted Fields provide deeper clinical context that is invaluable for specific research questions, such as linking viral kinetics to disease severity.
What's omitted?
We have intentionally omitted fields that are difficult to standardize, such as detailed clinical symptom scoring, or fields that are often protected by privacy regulations and require complex data use agreements, such as specific locations or full dates. Our focus is on creating a public-ready dataset that can be shared and used broadly with minimal friction.