Skip to content

Run Metadata

run.parquet stores one row per data acquisition run, capturing the technical and instrument properties of each MS run. Each row links to one or more biological samples through a samples list, which supports TMT/iTRAQ multiplexing where a single run contains multiple labeled channels. The file is derived from the community SDRF (Sample and Data Relationship Format) standard.

See the full YAML schema in run.yaml.


Instrument, Enzyme, and Dissociation Fields

Instrument, enzyme, and dissociation method fields store plain strings (human-readable names). CV accession mappings are stored separately in ontology.parquet, keeping run.parquet simple and consistent with sample.parquet.


MODIFICATION Structure

Modification parameters use a dedicated struct with explicit fields, since modifications have well-known properties (fixed/variable, position, target residue) that benefit from a typed schema:

MODIFICATION = pa.struct([
    pa.field("accession", pa.string()),                        # UNIMOD accession (required), e.g. "UNIMOD:21"
    pa.field("name", pa.string(), nullable=True),               # human-readable name, e.g. "Phospho"
    pa.field("fixed", pa.bool_(), nullable=True),               # true = fixed, false = variable
    pa.field("position", pa.string(), nullable=True),           # e.g. "Anywhere", "Protein N-term"
    pa.field("target_amino_acid", pa.string(), nullable=True),  # e.g. "S,T,Y", "C"
])

PyArrow Schema

import pyarrow as pa

# (ONTOLOGY_TERM and MODIFICATION defined above)

run_schema = pa.schema([
    # Identity -- maps to SDRF "assay name"
    pa.field("run_accession", pa.string()),             # PK
    pa.field("run_file_name", pa.string()),              # raw data file name (without extension)

    # Sample-channel mapping (one run can contain multiple samples via TMT/iTRAQ)
    pa.field("samples", pa.list_(pa.struct([
        pa.field("sample_accession", pa.string()),      # FK to sample.parquet
        pa.field("label", pa.string()),                 # channel label, e.g. "TMT126", "LFQ"
        pa.field("biological_replicate", pa.int32(), nullable=True),
        pa.field("technical_replicate", pa.int32(), nullable=True),
    ]))),

    # Run-level properties
    pa.field("fraction", pa.string(), nullable=True),

    # Instrument and method (plain strings; CV mappings in ontology.parquet)
    pa.field("instrument", pa.string(), nullable=True),
    pa.field("enzymes", pa.list_(pa.string()), nullable=True),
    pa.field("dissociation_method", pa.string(), nullable=True),

    # Modification parameters -- list of typed modification structs
    pa.field("modification_parameters", pa.list_(MODIFICATION), nullable=True),
])

Field Reference

Field Description Type Required
run_accession Unique run identifier (= SDRF assay name) string yes
run_file_name Raw data file name (without extension) string yes
file_name Original file name with extension (e.g. S1_Frontal_1.raw) string no
samples List of sample-channel mappings with replicate info (FK to sample.parquet) list[struct] yes
fraction Fraction identifier string no
instrument Mass spectrometer name string no
enzymes Proteolytic enzyme name(s) list[string] no
dissociation_method Fragmentation method name (e.g. HCD, CID, ETD) string no
modification_parameters Modifications configured in the database search list[MODIFICATION] no

Samples list detail

Each element in the samples list contains:

Subfield Description Type
sample_accession FK to sample.parquet string
label Channel label (e.g. TMT126, TMT127N, LFQ) string
biological_replicate Biological replicate number int32 (nullable)
technical_replicate Technical replicate number int32 (nullable)

Why replicates live in the samples struct

biological_replicate and technical_replicate are stored per sample-run pair rather than at the run or sample level. This makes the samples list a self-contained design matrix: each entry fully describes one sample's role in one run, without requiring joins to additional tables.

Label-free vs TMT

In a label-free experiment, each run maps to one sample with label "LFQ":

{"samples": [
    {"sample_accession": "Sample_01", "label": "LFQ",
     "biological_replicate": 1, "technical_replicate": 1}
]}

In a TMT-10plex experiment, each run maps to 10 samples:

{"samples": [
    {"sample_accession": "Sample_01", "label": "TMT126",
     "biological_replicate": 1, "technical_replicate": 1},
    {"sample_accession": "Sample_02", "label": "TMT127N",
     "biological_replicate": 2, "technical_replicate": 1},
    {"sample_accession": "Sample_03", "label": "TMT127C",
     "biological_replicate": 3, "technical_replicate": 1},
    ...
]}

MODIFICATION detail

Each MODIFICATION struct contains:

Subfield Description Type Required
accession UNIMOD accession (e.g. UNIMOD:21, UNIMOD:4) string yes
name Human-readable name (e.g. Phospho, Carbamidomethyl) string no
fixed true = fixed modification, false = variable modification bool no
position Where the modification can occur (e.g. Anywhere, Protein N-term) string no
target_amino_acid Target residue(s), comma-separated (e.g. S,T,Y, C) string no

Example Data

Modification parameters:

[
    {
        "accession": "UNIMOD:4",
        "name": "Carbamidomethyl",
        "fixed": true,
        "position": "Anywhere",
        "target_amino_acid": "C"
    },
    {
        "accession": "UNIMOD:21",
        "name": "Phospho",
        "fixed": false,
        "position": "Anywhere",
        "target_amino_acid": "S,T,Y"
    }
]

Enzymes:

["Trypsin"]

Instrument:

"Q Exactive HF"

Dissociation method:

"HCD"

CV accession resolution

CV accessions for instrument, enzyme, and dissociation method names are stored in ontology.parquet, not in run.parquet. This keeps run.parquet simple and consistent with sample.parquet.

Relationship to per-PSM modifications

The modification_parameters list in run.parquet captures what was configured in the search engine. For per-peptide modification evidence with localization scores, see the Modifications specification.

Reading run.parquet

import pyarrow.parquet as pq

runs = pq.read_table("PXD014414.run.parquet")
print(runs.column_names)
# ['run_accession', 'run_file_name', 'samples', 'fraction',
#  'instrument', 'enzymes', ...]

# Get all sample-channel mappings for a specific run
run_row = runs.filter(
    pq.compute.equal(runs.column("run_accession"), "Run_01")
).to_pydict()
print(run_row["samples"])

Querying with DuckDB

-- All runs with their instruments
SELECT run_accession, instrument
FROM 'PXD014414.run.parquet';

-- Unnest samples to get the full design matrix
SELECT r.run_accession, s.sample_accession, s.label,
       s.biological_replicate, s.technical_replicate
FROM 'PXD014414.run.parquet' r, UNNEST(r.samples) AS s;

-- Join runs with samples to get full metadata
SELECT r.run_accession, r.run_file_name,
       rs.biological_replicate, rs.technical_replicate,
       s.organism, s.disease
FROM 'PXD014414.run.parquet' r,
     UNNEST(r.samples) AS rs,
     'PXD014414.sample.parquet' s
WHERE rs.sample_accession = s.sample_accession;