Run Metadata¶
run.parquet stores one row per data acquisition run, capturing the technical and instrument properties of each MS run. Each row links to one or more biological samples through a samples list, which supports TMT/iTRAQ multiplexing where a single run contains multiple labeled channels. The file is derived from the community SDRF (Sample and Data Relationship Format) standard.
See the full YAML schema in run.yaml.
Instrument, Enzyme, and Dissociation Fields¶
Instrument, enzyme, and dissociation method fields store plain strings (human-readable names). CV accession mappings are stored separately in ontology.parquet, keeping run.parquet simple and consistent with sample.parquet.
MODIFICATION Structure¶
Modification parameters use a dedicated struct with explicit fields, since modifications have well-known properties (fixed/variable, position, target residue) that benefit from a typed schema:
MODIFICATION = pa.struct([
pa.field("accession", pa.string()), # UNIMOD accession (required), e.g. "UNIMOD:21"
pa.field("name", pa.string(), nullable=True), # human-readable name, e.g. "Phospho"
pa.field("fixed", pa.bool_(), nullable=True), # true = fixed, false = variable
pa.field("position", pa.string(), nullable=True), # e.g. "Anywhere", "Protein N-term"
pa.field("target_amino_acid", pa.string(), nullable=True), # e.g. "S,T,Y", "C"
])
PyArrow Schema¶
import pyarrow as pa
# (ONTOLOGY_TERM and MODIFICATION defined above)
run_schema = pa.schema([
# Identity -- maps to SDRF "assay name"
pa.field("run_accession", pa.string()), # PK
pa.field("run_file_name", pa.string()), # raw data file name (without extension)
# Sample-channel mapping (one run can contain multiple samples via TMT/iTRAQ)
pa.field("samples", pa.list_(pa.struct([
pa.field("sample_accession", pa.string()), # FK to sample.parquet
pa.field("label", pa.string()), # channel label, e.g. "TMT126", "LFQ"
pa.field("biological_replicate", pa.int32(), nullable=True),
pa.field("technical_replicate", pa.int32(), nullable=True),
]))),
# Run-level properties
pa.field("fraction", pa.string(), nullable=True),
# Instrument and method (plain strings; CV mappings in ontology.parquet)
pa.field("instrument", pa.string(), nullable=True),
pa.field("enzymes", pa.list_(pa.string()), nullable=True),
pa.field("dissociation_method", pa.string(), nullable=True),
# Modification parameters -- list of typed modification structs
pa.field("modification_parameters", pa.list_(MODIFICATION), nullable=True),
])
Field Reference¶
| Field | Description | Type | Required |
|---|---|---|---|
run_accession |
Unique run identifier (= SDRF assay name) |
string | yes |
run_file_name |
Raw data file name (without extension) | string | yes |
file_name |
Original file name with extension (e.g. S1_Frontal_1.raw) |
string | no |
samples |
List of sample-channel mappings with replicate info (FK to sample.parquet) | list[struct] | yes |
fraction |
Fraction identifier | string | no |
instrument |
Mass spectrometer name | string | no |
enzymes |
Proteolytic enzyme name(s) | list[string] | no |
dissociation_method |
Fragmentation method name (e.g. HCD, CID, ETD) | string | no |
modification_parameters |
Modifications configured in the database search | list[MODIFICATION] | no |
Samples list detail¶
Each element in the samples list contains:
| Subfield | Description | Type |
|---|---|---|
sample_accession |
FK to sample.parquet |
string |
label |
Channel label (e.g. TMT126, TMT127N, LFQ) |
string |
biological_replicate |
Biological replicate number | int32 (nullable) |
technical_replicate |
Technical replicate number | int32 (nullable) |
Why replicates live in the samples struct
biological_replicate and technical_replicate are stored per sample-run pair rather than at the run or sample level. This makes the samples list a self-contained design matrix: each entry fully describes one sample's role in one run, without requiring joins to additional tables.
Label-free vs TMT
In a label-free experiment, each run maps to one sample with label "LFQ":
{"samples": [
{"sample_accession": "Sample_01", "label": "LFQ",
"biological_replicate": 1, "technical_replicate": 1}
]}
In a TMT-10plex experiment, each run maps to 10 samples:
{"samples": [
{"sample_accession": "Sample_01", "label": "TMT126",
"biological_replicate": 1, "technical_replicate": 1},
{"sample_accession": "Sample_02", "label": "TMT127N",
"biological_replicate": 2, "technical_replicate": 1},
{"sample_accession": "Sample_03", "label": "TMT127C",
"biological_replicate": 3, "technical_replicate": 1},
...
]}
MODIFICATION detail¶
Each MODIFICATION struct contains:
| Subfield | Description | Type | Required |
|---|---|---|---|
accession |
UNIMOD accession (e.g. UNIMOD:21, UNIMOD:4) |
string | yes |
name |
Human-readable name (e.g. Phospho, Carbamidomethyl) |
string | no |
fixed |
true = fixed modification, false = variable modification |
bool | no |
position |
Where the modification can occur (e.g. Anywhere, Protein N-term) |
string | no |
target_amino_acid |
Target residue(s), comma-separated (e.g. S,T,Y, C) |
string | no |
Example Data¶
Modification parameters:
[
{
"accession": "UNIMOD:4",
"name": "Carbamidomethyl",
"fixed": true,
"position": "Anywhere",
"target_amino_acid": "C"
},
{
"accession": "UNIMOD:21",
"name": "Phospho",
"fixed": false,
"position": "Anywhere",
"target_amino_acid": "S,T,Y"
}
]
Enzymes:
Instrument:
Dissociation method:
CV accession resolution
CV accessions for instrument, enzyme, and dissociation method names are stored in ontology.parquet, not in run.parquet. This keeps run.parquet simple and consistent with sample.parquet.
Relationship to per-PSM modifications
The modification_parameters list in run.parquet captures what was configured in the search engine. For per-peptide modification evidence with localization scores, see the Modifications specification.
Reading run.parquet
import pyarrow.parquet as pq
runs = pq.read_table("PXD014414.run.parquet")
print(runs.column_names)
# ['run_accession', 'run_file_name', 'samples', 'fraction',
# 'instrument', 'enzymes', ...]
# Get all sample-channel mappings for a specific run
run_row = runs.filter(
pq.compute.equal(runs.column("run_accession"), "Run_01")
).to_pydict()
print(run_row["samples"])
Querying with DuckDB
-- All runs with their instruments
SELECT run_accession, instrument
FROM 'PXD014414.run.parquet';
-- Unnest samples to get the full design matrix
SELECT r.run_accession, s.sample_accession, s.label,
s.biological_replicate, s.technical_replicate
FROM 'PXD014414.run.parquet' r, UNNEST(r.samples) AS s;
-- Join runs with samples to get full metadata
SELECT r.run_accession, r.run_file_name,
rs.biological_replicate, rs.technical_replicate,
s.organism, s.disease
FROM 'PXD014414.run.parquet' r,
UNNEST(r.samples) AS rs,
'PXD014414.sample.parquet' s
WHERE rs.sample_accession = s.sample_accession;
Related Pages¶
- Sample Metadata --
sample.parquet - Dataset Metadata --
dataset.parquet - SDRF Conversion -- Column mapping and ingestion process
- Modifications -- Per-peptide modification evidence
- QPX Format Overview