YAML Schema Reference¶
QPX data views have formal schema definitions in YAML format. These schemas define the structure and types of each field using Arrow-native type names and serve as the canonical, language-independent schema definition for the QPX format.
Expression views use AnnData
The expression views (Absolute Expression, Differential Expression) use AnnData (.h5ad) as their primary format and do not have YAML schemas. See AnnData Concepts for details.
API views have no schemas
API views (e.g., peptide summaries, protein summaries) are computed on demand from the primary data views. They are programmable and do not have formal schemas.
YAML vs Parquet
YAML schemas (.yaml) define the logical data model -- the field names, Arrow types, nullability, and documentation for each view. The actual data is serialized in Apache Parquet format for efficient columnar storage and querying. The YAML schemas are the single source of truth for the QPX data model, while Parquet handles the physical storage layer.
How to use these schemas¶
- Validation: Use these schemas to validate that a QPX file conforms to the expected structure.
- Code generation: The QPX loader converts YAML schemas to PyArrow schemas at import time.
- Documentation: Each field includes a
docattribute describing its purpose and semantics. - Shared types: Common struct types (e.g.,
score,modification,intensity) are defined once intypes.yamland reused across schemas.
Schemas¶
Data Views¶
name: psm
file_type: psm_file
primary_key: [sequence, charge, run_file_name, scan]
doc: "Peptide Spectrum Matches from identification searches"
fields:
# Core identification (shared with Feature)
sequence:
type: string
required: true
doc: "Unmodified peptide sequence"
peptidoform:
type: string
required: true
doc: "ProForma notation"
modifications:
type: "list<modification>"
doc: "Structured modifications"
charge:
type: int16
required: true
doc: "Precursor charge"
posterior_error_probability:
type: float64
doc: "PEP - probability PSM is incorrect (lower is better)"
is_decoy:
type: bool
required: true
doc: "Decoy match flag"
calculated_mz:
type: float32
required: true
doc: "Theoretical m/z"
observed_mz:
type: float32
required: true
doc: "Experimental m/z"
mass_error_ppm:
type: float32
doc: "Mass error in ppm: 1e6 * (observed_mz - calculated_mz) / calculated_mz"
additional_scores:
type: "list<score>"
doc: "Search engine scores"
predicted_rt:
type: float32
doc: "Predicted RT (seconds)"
run_file_name:
type: string
required: true
doc: "Run file containing this PSM"
cv_params:
type: "list<cv_param>"
doc: "CV parameters"
scan:
type: "list<int32>"
required: true
doc: "Scan ID (array of int components)"
rt:
type: float32
doc: "Retention time (seconds)"
ion_mobility:
type: float32
doc: "Ion mobility"
missed_cleavages:
type: int16
doc: "Number of missed enzymatic cleavages (MS:1003044)"
# PSM-specific fields
protein_accessions:
type: "list<string>"
doc: "Protein accessions the peptide maps to"
# Cross-linking (XL-MS)
cross_links:
type: "list<cross_link>"
doc: "Cross-link information for XL-MS experiments"
# Spectral data (optional - may be stored in mz.parquet instead)
mz_array:
type: "list<float32>"
doc: "Spectrum m/z values"
intensity_array:
type: "list<float32>"
doc: "Spectrum intensity values"
charge_array:
type: "list<int32>"
doc: "Fragment charge values"
ion_type_array:
type: "list<string>"
doc: "Fragment ion type annotations (b, y, a, etc.)"
ion_mobility_array:
type: "list<float32>"
doc: "Fragment ion mobility values"
name: feature
file_type: feature_file
primary_key: [sequence, charge, run_file_name, anchor_protein]
doc: "Quantified peptide features per MS run"
fields:
# Core identification
sequence:
type: string
required: true
doc: "Unmodified peptide sequence"
peptidoform:
type: string
required: true
doc: "ProForma notation"
modifications:
type: "list<modification>"
doc: "Structured modifications"
charge:
type: int16
required: true
doc: "Charge state"
posterior_error_probability:
type: float64
doc: "PEP for the peptide match"
is_decoy:
type: bool
required: true
doc: "Decoy flag"
calculated_mz:
type: float32
required: true
doc: "Theoretical m/z"
observed_mz:
type: float32
required: true
doc: "Experimental m/z"
mass_error_ppm:
type: float32
doc: "Mass error in ppm: 1e6 * (observed_mz - calculated_mz) / calculated_mz"
additional_scores:
type: "list<score>"
doc: "Search engine scores"
predicted_rt:
type: float32
doc: "Predicted RT (seconds)"
run_file_name:
type: string
required: true
doc: "Run file name"
cv_params:
type: "list<cv_param>"
doc: "CV parameters"
scan:
type: "list<int32>"
required: true
doc: "Scan ID (array of int components)"
rt:
type: float32
doc: "Retention time (seconds)"
ion_mobility:
type: float32
doc: "Ion mobility"
missed_cleavages:
type: int16
doc: "Number of missed enzymatic cleavages (MS:1003044)"
# Quantification
intensities:
type: "list<intensity>"
doc: "Primary intensities per label"
additional_intensities:
type: "list<additional_intensity>"
doc: "Tool-provided intensities (normalized, LFQ, iBAQ) read from upstream output"
# Protein mapping
pg_accessions:
type: "list<pg_protein>"
doc: "Protein group accessions with optional peptide positions (start, end)"
anchor_protein:
type: string
required: true
doc: "Representative protein"
unique:
type: bool
doc: "Unique peptide indicator"
pg_global_qvalue:
type: float64
optional: true
doc: "Global q-value of PG"
pg_positions:
type: "list<protein_position>"
optional: true
doc: "Peptide positions per protein"
# Ion mobility window
ion_mobility_start:
type: float32
doc: "Start ion mobility"
ion_mobility_stop:
type: float32
doc: "Stop ion mobility"
# Gene annotations
gg_accessions:
type: "list<string>"
optional: true
doc: "Gene group identifiers (gene symbols; Ensembl IDs when available)"
gg_names:
type: "list<string>"
optional: true
doc: "Gene group names (gene symbols from tool output)"
# Spectra reference
id_run_file_name:
type: string
doc: "Run file containing best PSM"
rt_start:
type: float32
doc: "RT window start"
rt_stop:
type: float32
doc: "RT window end"
name: pg
file_type: pg_file
primary_key: [anchor_protein, run_file_name]
doc: "Protein groups with per-run quantification"
fields:
# Protein group identity
pg_accessions:
type: "list<string>"
required: true
doc: "Protein accessions within this group"
pg_names:
type: "list<string>"
doc: "Protein group names"
gg_accessions:
type: "list<string>"
doc: "Gene group identifiers (gene symbols; Ensembl IDs when available)"
gg_names:
type: "list<string>"
doc: "Gene group names (gene symbols from tool output)"
gg_qvalue:
type: float64
nullable: true
doc: "Gene group q-value (e.g., DIA-NN GG.Q.Value)"
anchor_protein:
type: string
required: true
doc: "Anchor/leading protein of the group"
# Run-level context
run_file_name:
type: string
required: true
doc: "Raw file name"
# Quality metrics
global_qvalue:
type: float64
doc: "Global q-value at experiment level"
pg_qvalue:
type: float64
doc: "Protein group q-value at run level"
# Quantification
intensities:
type: "list<intensity>"
doc: "Primary intensity per label"
additional_intensities:
type: "list<additional_intensity>"
doc: "Tool-provided intensities (normalized, LFQ, iBAQ) read from upstream output"
# Flags
is_decoy:
type: bool
required: true
doc: "Decoy flag"
contaminant:
type: bool
doc: "Contaminant flag"
# Peptide/feature counts
peptides:
type: "list<peptide_per_protein>"
required: true
doc: "Peptide counts per protein in the group"
peptide_counts:
type: peptide_counts
doc: "Unique/total peptide sequence counts"
feature_counts:
type: feature_counts
doc: "Unique/total feature counts"
# Protein properties
sequence_coverage:
type: float32
doc: "Sequence coverage percentage"
molecular_weight:
type: float32
doc: "Molecular weight (kDa)"
# Scores and metadata
additional_scores:
type: "list<score>"
doc: "Additional scores"
cv_params:
type: "list<cv_param>"
doc: "CV parameters"
name: mz
file_type: mz_file
primary_key: [id]
doc: "Mass spectrometry spectral data (scan-level)"
fields:
# Scan identification
id:
type: string
required: true
doc: "Unique scan/spectrum identifier"
ms_level:
type: int32
required: true
doc: "MS level (1=MS1, 2=MS2)"
centroid:
type: bool
required: true
doc: "Centroided (true) or profile (false)"
# Timing and mobility
scan_start_time:
type: float32
required: true
doc: "Scan start time (minutes)"
inverse_ion_mobility:
type: float32
doc: "Inverse ion mobility (TIMS)"
ion_injection_time:
type: float32
required: true
doc: "Ion injection time (ms)"
total_ion_current:
type: float32
required: true
doc: "Total ion current"
# Precursor info (MS2+ only)
precursors:
type: "list<precursor>"
doc: "Precursor ions"
# Spectral data
mz:
type: "list<float32>"
required: true
doc: "m/z values"
intensity:
type: "list<float32>"
required: true
doc: "Intensity values"
# Metadata
cv_params:
type: "list<cv_param>"
doc: "CV parameters"
Metadata Views¶
name: dataset
file_type: dataset_file
primary_key: [project_accession]
doc: "Project-level metadata for a QPX dataset (single-row file)"
fields:
project_accession:
type: string
required: true
doc: "Project accession (e.g., PXD014414)"
project_title:
type: string
doc: "Title of the project"
project_description:
type: string
doc: "Project description"
pubmed_id:
type: string
doc: "PubMed ID"
software_name:
type: string
doc: "Software that generated the data"
software_version:
type: string
doc: "Software version"
creation_date:
type: string
required: true
doc: "ISO 8601 date when dataset was created"
# Integrity / packaging
file_checksums:
type: "map<string, string>"
doc: "SHA-256 hex digests keyed by file name (relative to dataset root)"
file_row_counts:
type: "map<string, int64>"
doc: "Row counts keyed by file name"
file_sizes_bytes:
type: "map<string, int64>"
doc: "File sizes in bytes keyed by file name"
total_structures:
type: int32
doc: "Number of Parquet structures in this dataset"
packaged_at:
type: string
doc: "ISO 8601 timestamp when integrity was computed"
name: sample
file_type: sample_file
primary_key: [sample_accession]
doc: "Biological sample metadata, one row per sample"
extra_columns: true
fields:
sample_accession:
type: string
required: true
doc: "Unique sample identifier (SDRF source name)"
organism:
type: string
required: true
doc: "Species (e.g. Homo sapiens)"
organism_part:
type: string
required: true
doc: "Tissue or organ (e.g. Brain; Frontal cortex)"
disease:
type: string
optional: true
doc: "Disease state (omitted when absent from SDRF)"
cell_line:
type: string
optional: true
doc: "Cell line name (omitted when absent from SDRF)"
cell_type:
type: string
optional: true
doc: "Cell type (omitted when absent from SDRF)"
sex:
type: string
optional: true
doc: "Biological sex (omitted when absent from SDRF)"
age:
type: string
optional: true
doc: "Age of specimen (omitted when absent from SDRF)"
developmental_stage:
type: string
optional: true
doc: "Developmental stage (omitted when absent from SDRF)"
ancestry:
type: string
optional: true
doc: "Ancestry category (omitted when absent from SDRF)"
individual:
type: string
optional: true
doc: "Individual/patient identifier (omitted when absent from SDRF)"
sample_description:
type: string
optional: true
doc: "Free-text sample description (omitted when absent from SDRF)"
name: run
file_type: run_file
primary_key: [run_file_name]
doc: "MS acquisition run metadata, one row per run"
fields:
run_accession:
type: string
required: true
doc: "Unique run identifier (SDRF assay name)"
run_file_name:
type: string
required: true
doc: "Raw data file name (without extension)"
file_name:
type: string
doc: "Original file name with extension (e.g. S1_Frontal_1.raw)"
# Sample-channel mapping (supports multiplexed runs)
samples:
type: "list<sample_channel>"
required: true
doc: "Sample-channel mappings with replicate info"
# Run properties
fraction:
type: string
doc: "Fraction identifier"
# Instrument and method (plain strings; CV mappings in ontology.parquet)
instrument:
type: string
doc: "Mass spectrometer name"
enzymes:
type: "list<string>"
doc: "Proteolytic enzyme names"
dissociation_method:
type: string
doc: "Fragmentation method name (HCD, CID, ETD)"
# Search configuration
modification_parameters:
type: "list<modification_param>"
doc: "Modifications configured in database search"
name: ontology
file_type: ontology_file
primary_key: [field_name, view]
doc: "Field-to-ontology mappings. Makes the dataset self-describing"
fields:
field_name:
type: string
required: true
doc: "snake_case QPX field or score name"
ontology_name:
type: string
doc: "Proper ontology term name"
ontology_accession:
type: string
doc: "Ontology accession identifier (e.g., MS:1002252)"
ontology_source:
type: string
doc: "Ontology prefix (MS, UBERON, UNIMOD, etc.)"
ontology_version:
type: string
doc: "Version of the ontology used for resolution (e.g., 4.1.235)"
view:
type: string
required: true
doc: "QPX view this field belongs to (psm, feature, pg, etc.)"
description:
type: string
doc: "Human-readable description"
source_column_name:
type: string
doc: "Original column name in the tool output (e.g., Precursor.Quantity)"
source_tool:
type: string
doc: "Tool name that produced this field (e.g., DIA-NN)"
name: provenance
file_type: provenance_file
primary_key: [step_order]
doc: "Processing provenance - one row per pipeline step"
fields:
step_order:
type: int32
required: true
doc: "Execution order (1, 2, 3, ...)"
step_category:
type: string
required: true
doc: "Broad category (workflow, database_search, quantification, etc.)"
step_name:
type: string
required: true
doc: "Specific name (sequence_search, psm_rescoring, etc.)"
tool_name:
type: string
required: true
doc: "Tool that performed this step"
tool_version:
type: string
doc: "Tool version"
tool_uri:
type: string
doc: "Container URI or tool URL"
parameters:
type: "list<parameter>"
doc: "Key-value pairs of important settings"
config:
type: string
doc: "Full tool configuration as JSON string"
output_views:
type: "list<string>"
doc: "QPX views this step generated (e.g., ['psm', 'feature'])"
Shared Types¶
# Shared struct types used across QPX schemas.
# These are referenced by name in schema YAML files (e.g., type: list<intensity>).
score:
fields:
score_name: {type: string}
score_value: {type: float64}
higher_better: {type: bool, nullable: true}
cv_param:
fields:
cv_name: {type: string}
cv_value: {type: string}
modification_position:
fields:
position: {type: int32}
amino_acid: {type: string, nullable: true}
scores: {type: "list<score>", nullable: true}
modification:
fields:
name: {type: string}
accession: {type: string, nullable: true}
positions: {type: "list<modification_position>"}
intensity:
fields:
label: {type: string}
intensity: {type: float32}
intensity_pair:
fields:
intensity_name: {type: string}
intensity_value: {type: float32}
additional_intensity:
fields:
label: {type: string}
intensities: {type: "list<intensity_pair>"}
pg_protein:
fields:
accession: {type: string}
start: {type: int32, nullable: true}
end: {type: int32, nullable: true}
pre: {type: string, nullable: true}
post: {type: string, nullable: true}
protein_position:
fields:
protein_accession: {type: string}
start: {type: int32}
end: {type: int32}
ontology_property:
fields:
key: {type: string}
value: {type: string}
ontology_term:
fields:
accession: {type: string}
name: {type: string, nullable: true}
properties: {type: "list<ontology_property>", nullable: true}
sample_channel:
fields:
sample_accession: {type: string}
label: {type: string}
biological_replicate: {type: int32, nullable: true}
technical_replicate: {type: int32, nullable: true}
modification_param:
fields:
accession: {type: string}
name: {type: string, nullable: true}
fixed: {type: bool, nullable: true}
position: {type: string, nullable: true}
target_amino_acid: {type: string, nullable: true}
peptide_per_protein:
fields:
protein_name: {type: string}
peptide_count: {type: int32}
peptide_counts:
fields:
unique_sequences: {type: int32}
total_sequences: {type: int32}
feature_counts:
fields:
unique_features: {type: int32}
total_features: {type: int32}
precursor:
fields:
selected_ion_mz: {type: float32}
selected_ion_charge: {type: int32, nullable: true}
selected_ion_intensity: {type: float32, nullable: true}
isolation_window_target: {type: float32, nullable: true}
isolation_window_lower: {type: float32, nullable: true}
isolation_window_upper: {type: float32, nullable: true}
spectrum_ref: {type: string, nullable: true}
parameter:
fields:
key: {type: string}
value: {type: string}
cross_link:
fields:
xl_type: {type: string}
partner_sequence: {type: string, nullable: true}
partner_peptidoform: {type: string, nullable: true}
donor_position: {type: int32}
acceptor_position: {type: int32, nullable: true}
linker_name: {type: string}
linker_accession: {type: string, nullable: true}
linker_mass: {type: float64}