Run Metadata¶

run.parquet stores one row per data acquisition run, capturing the technical and instrument properties of each MS run. Each row links to one or more biological samples through a samples list, which supports TMT/iTRAQ multiplexing where a single run contains multiple labeled channels. The file is derived from the community SDRF (Sample and Data Relationship Format) standard.

See the full YAML schema in run.yaml.

Instrument, Enzyme, and Dissociation Fields¶

Instrument, enzyme, and dissociation method fields store plain strings (human-readable names). CV accession mappings are stored separately in ontology.parquet, keeping run.parquet simple and consistent with sample.parquet.

MODIFICATION Structure¶

Modification parameters use a dedicated struct with explicit fields, since modifications have well-known properties (fixed/variable, position, target residue) that benefit from a typed schema:

MODIFICATION = pa.struct([
    pa.field("accession", pa.string()),                        # UNIMOD accession (required), e.g. "UNIMOD:21"
    pa.field("name", pa.string(), nullable=True),               # human-readable name, e.g. "Phospho"
    pa.field("fixed", pa.bool_(), nullable=True),               # true = fixed, false = variable
    pa.field("position", pa.string(), nullable=True),           # e.g. "Anywhere", "Protein N-term"
    pa.field("target_amino_acid", pa.string(), nullable=True),  # e.g. "S,T,Y", "C"
])

PyArrow Schema¶

import pyarrow as pa

# (ONTOLOGY_TERM and MODIFICATION defined above)

run_schema = pa.schema([
    # Identity -- maps to SDRF "assay name"
    pa.field("run_accession", pa.string()),             # PK
    pa.field("run_file_name", pa.string()),              # raw data file name (without extension)

    # Sample-channel mapping (one run can contain multiple samples via TMT/iTRAQ)
    pa.field("samples", pa.list_(pa.struct([
        pa.field("sample_accession", pa.string()),      # FK to sample.parquet
        pa.field("label", pa.string()),                 # channel label, e.g. "TMT126", "LFQ"
        pa.field("biological_replicate", pa.int32(), nullable=True),
        pa.field("technical_replicate", pa.int32(), nullable=True),
    ]))),

    # Run-level properties
    pa.field("fraction", pa.string(), nullable=True),

    # Instrument and method (plain strings; CV mappings in ontology.parquet)
    pa.field("instrument", pa.string(), nullable=True),
    pa.field("enzymes", pa.list_(pa.string()), nullable=True),
    pa.field("dissociation_method", pa.string(), nullable=True),

    # Modification parameters -- list of typed modification structs
    pa.field("modification_parameters", pa.list_(MODIFICATION), nullable=True),
])

Field Reference¶

Field	Description	Type	Required
`run_accession`	Unique run identifier (= SDRF `assay name`)	string	yes
`run_file_name`	Raw data file name (without extension)	string	yes
`file_name`	Original file name with extension (e.g. `S1_Frontal_1.raw`)	string	no
`samples`	List of sample-channel mappings with replicate info (FK to sample.parquet)	list[struct]	yes
`fraction`	Fraction identifier	string	no
`instrument`	Mass spectrometer name	string	no
`enzymes`	Proteolytic enzyme name(s)	list[string]	no
`dissociation_method`	Fragmentation method name (e.g. HCD, CID, ETD)	string	no
`modification_parameters`	Modifications configured in the database search	list[MODIFICATION]	no

Samples list detail¶

Each element in the samples list contains:

Subfield	Description	Type
`sample_accession`	FK to `sample.parquet`	string
`label`	Channel label (e.g. `TMT126`, `TMT127N`, `LFQ`)	string
`biological_replicate`	Biological replicate number	int32 (nullable)
`technical_replicate`	Technical replicate number	int32 (nullable)

Why replicates live in the samples struct

biological_replicate and technical_replicate are stored per sample-run pair rather than at the run or sample level. This makes the samples list a self-contained design matrix: each entry fully describes one sample's role in one run, without requiring joins to additional tables.

Label-free vs TMT

In a label-free experiment, each run maps to one sample with label "LFQ":

{"samples": [
    {"sample_accession": "Sample_01", "label": "LFQ",
     "biological_replicate": 1, "technical_replicate": 1}
]}

In a TMT-10plex experiment, each run maps to 10 samples:

{"samples": [
    {"sample_accession": "Sample_01", "label": "TMT126",
     "biological_replicate": 1, "technical_replicate": 1},
    {"sample_accession": "Sample_02", "label": "TMT127N",
     "biological_replicate": 2, "technical_replicate": 1},
    {"sample_accession": "Sample_03", "label": "TMT127C",
     "biological_replicate": 3, "technical_replicate": 1},
    ...
]}

MODIFICATION detail¶

Each MODIFICATION struct contains:

Subfield	Description	Type	Required
`accession`	UNIMOD accession (e.g. `UNIMOD:21`, `UNIMOD:4`)	string	yes
`name`	Human-readable name (e.g. `Phospho`, `Carbamidomethyl`)	string	no
`fixed`	`true` = fixed modification, `false` = variable modification	bool	no
`position`	Where the modification can occur (e.g. `Anywhere`, `Protein N-term`)	string	no
`target_amino_acid`	Target residue(s), comma-separated (e.g. `S,T,Y`, `C`)	string	no

Example Data¶

Modification parameters:

[
    {
        "accession": "UNIMOD:4",
        "name": "Carbamidomethyl",
        "fixed": true,
        "position": "Anywhere",
        "target_amino_acid": "C"
    },
    {
        "accession": "UNIMOD:21",
        "name": "Phospho",
        "fixed": false,
        "position": "Anywhere",
        "target_amino_acid": "S,T,Y"
    }
]

Enzymes:

["Trypsin"]

Instrument:

"Q Exactive HF"

Dissociation method:

"HCD"

CV accession resolution

CV accessions for instrument, enzyme, and dissociation method names are stored in ontology.parquet, not in run.parquet. This keeps run.parquet simple and consistent with sample.parquet.

Relationship to per-PSM modifications

The modification_parameters list in run.parquet captures what was configured in the search engine. For per-peptide modification evidence with localization scores, see the Modifications specification.

Reading run.parquet

import pyarrow.parquet as pq

runs = pq.read_table("PXD014414.run.parquet")
print(runs.column_names)
# ['run_accession', 'run_file_name', 'samples', 'fraction',
#  'instrument', 'enzymes', ...]

# Get all sample-channel mappings for a specific run
run_row = runs.filter(
    pq.compute.equal(runs.column("run_accession"), "Run_01")
).to_pydict()
print(run_row["samples"])

Querying with DuckDB

-- All runs with their instruments
SELECT run_accession, instrument
FROM 'PXD014414.run.parquet';

-- Unnest samples to get the full design matrix
SELECT r.run_accession, s.sample_accession, s.label,
       s.biological_replicate, s.technical_replicate
FROM 'PXD014414.run.parquet' r, UNNEST(r.samples) AS s;

-- Join runs with samples to get full metadata
SELECT r.run_accession, r.run_file_name,
       rs.biological_replicate, rs.technical_replicate,
       s.organism, s.disease
FROM 'PXD014414.run.parquet' r,
     UNNEST(r.samples) AS rs,
     'PXD014414.sample.parquet' s
WHERE rs.sample_accession = s.sample_accession;

Sample Metadata -- sample.parquet
Dataset Metadata -- dataset.parquet
SDRF Conversion -- Column mapping and ingestion process
Modifications -- Per-peptide modification evidence
QPX Format Overview