Skip to content

YAML Schema Reference

QPX data views have formal schema definitions in YAML format. These schemas define the structure and types of each field using Arrow-native type names and serve as the canonical, language-independent schema definition for the QPX format.

Expression views use AnnData

The expression views (Absolute Expression, Differential Expression) use AnnData (.h5ad) as their primary format and do not have YAML schemas. See AnnData Concepts for details.

API views have no schemas

API views (e.g., peptide summaries, protein summaries) are computed on demand from the primary data views. They are programmable and do not have formal schemas.

YAML vs Parquet

YAML schemas (.yaml) define the logical data model -- the field names, Arrow types, nullability, and documentation for each view. The actual data is serialized in Apache Parquet format for efficient columnar storage and querying. The YAML schemas are the single source of truth for the QPX data model, while Parquet handles the physical storage layer.

How to use these schemas

  • Validation: Use these schemas to validate that a QPX file conforms to the expected structure.
  • Code generation: The QPX loader converts YAML schemas to PyArrow schemas at import time.
  • Documentation: Each field includes a doc attribute describing its purpose and semantics.
  • Shared types: Common struct types (e.g., score, modification, intensity) are defined once in types.yaml and reused across schemas.

Schemas

Data Views

psm.yaml
name: psm
file_type: psm_file
primary_key: [sequence, charge, run_file_name, scan]
doc: "Peptide Spectrum Matches from identification searches"

fields:
  # Core identification (shared with Feature)
  sequence:
    type: string
    required: true
    doc: "Unmodified peptide sequence"
  peptidoform:
    type: string
    required: true
    doc: "ProForma notation"
  modifications:
    type: "list<modification>"
    doc: "Structured modifications"
  charge:
    type: int16
    required: true
    doc: "Precursor charge"
  posterior_error_probability:
    type: float64
    doc: "PEP - probability PSM is incorrect (lower is better)"
  is_decoy:
    type: bool
    required: true
    doc: "Decoy match flag"
  calculated_mz:
    type: float32
    required: true
    doc: "Theoretical m/z"
  observed_mz:
    type: float32
    required: true
    doc: "Experimental m/z"
  mass_error_ppm:
    type: float32
    doc: "Mass error in ppm: 1e6 * (observed_mz - calculated_mz) / calculated_mz"
  additional_scores:
    type: "list<score>"
    doc: "Search engine scores"
  predicted_rt:
    type: float32
    doc: "Predicted RT (seconds)"
  run_file_name:
    type: string
    required: true
    doc: "Run file containing this PSM"
  cv_params:
    type: "list<cv_param>"
    doc: "CV parameters"
  scan:
    type: "list<int32>"
    required: true
    doc: "Scan ID (array of int components)"
  rt:
    type: float32
    doc: "Retention time (seconds)"
  ion_mobility:
    type: float32
    doc: "Ion mobility"
  missed_cleavages:
    type: int16
    doc: "Number of missed enzymatic cleavages (MS:1003044)"

  # PSM-specific fields
  protein_accessions:
    type: "list<string>"
    doc: "Protein accessions the peptide maps to"

  # Cross-linking (XL-MS)
  cross_links:
    type: "list<cross_link>"
    doc: "Cross-link information for XL-MS experiments"

  # Spectral data (optional - may be stored in mz.parquet instead)
  mz_array:
    type: "list<float32>"
    doc: "Spectrum m/z values"
  intensity_array:
    type: "list<float32>"
    doc: "Spectrum intensity values"
  charge_array:
    type: "list<int32>"
    doc: "Fragment charge values"
  ion_type_array:
    type: "list<string>"
    doc: "Fragment ion type annotations (b, y, a, etc.)"
  ion_mobility_array:
    type: "list<float32>"
    doc: "Fragment ion mobility values"
feature.yaml
name: feature
file_type: feature_file
primary_key: [sequence, charge, run_file_name, anchor_protein]
doc: "Quantified peptide features per MS run"

fields:
  # Core identification
  sequence:
    type: string
    required: true
    doc: "Unmodified peptide sequence"
  peptidoform:
    type: string
    required: true
    doc: "ProForma notation"
  modifications:
    type: "list<modification>"
    doc: "Structured modifications"
  charge:
    type: int16
    required: true
    doc: "Charge state"
  posterior_error_probability:
    type: float64
    doc: "PEP for the peptide match"
  is_decoy:
    type: bool
    required: true
    doc: "Decoy flag"
  calculated_mz:
    type: float32
    required: true
    doc: "Theoretical m/z"
  observed_mz:
    type: float32
    required: true
    doc: "Experimental m/z"
  mass_error_ppm:
    type: float32
    doc: "Mass error in ppm: 1e6 * (observed_mz - calculated_mz) / calculated_mz"
  additional_scores:
    type: "list<score>"
    doc: "Search engine scores"
  predicted_rt:
    type: float32
    doc: "Predicted RT (seconds)"
  run_file_name:
    type: string
    required: true
    doc: "Run file name"
  cv_params:
    type: "list<cv_param>"
    doc: "CV parameters"
  scan:
    type: "list<int32>"
    required: true
    doc: "Scan ID (array of int components)"
  rt:
    type: float32
    doc: "Retention time (seconds)"
  ion_mobility:
    type: float32
    doc: "Ion mobility"
  missed_cleavages:
    type: int16
    doc: "Number of missed enzymatic cleavages (MS:1003044)"

  # Quantification
  intensities:
    type: "list<intensity>"
    doc: "Primary intensities per label"
  additional_intensities:
    type: "list<additional_intensity>"
    doc: "Tool-provided intensities (normalized, LFQ, iBAQ) read from upstream output"

  # Protein mapping
  pg_accessions:
    type: "list<pg_protein>"
    doc: "Protein group accessions with optional peptide positions (start, end)"
  anchor_protein:
    type: string
    required: true
    doc: "Representative protein"
  unique:
    type: bool
    doc: "Unique peptide indicator"
  pg_global_qvalue:
    type: float64
    optional: true
    doc: "Global q-value of PG"
  pg_positions:
    type: "list<protein_position>"
    optional: true
    doc: "Peptide positions per protein"

  # Ion mobility window
  ion_mobility_start:
    type: float32
    doc: "Start ion mobility"
  ion_mobility_stop:
    type: float32
    doc: "Stop ion mobility"

  # Gene annotations
  gg_accessions:
    type: "list<string>"
    optional: true
    doc: "Gene group identifiers (gene symbols; Ensembl IDs when available)"
  gg_names:
    type: "list<string>"
    optional: true
    doc: "Gene group names (gene symbols from tool output)"

  # Spectra reference
  id_run_file_name:
    type: string
    doc: "Run file containing best PSM"
  rt_start:
    type: float32
    doc: "RT window start"
  rt_stop:
    type: float32
    doc: "RT window end"
pg.yaml
name: pg
file_type: pg_file
primary_key: [anchor_protein, run_file_name]
doc: "Protein groups with per-run quantification"

fields:
  # Protein group identity
  pg_accessions:
    type: "list<string>"
    required: true
    doc: "Protein accessions within this group"
  pg_names:
    type: "list<string>"
    doc: "Protein group names"
  gg_accessions:
    type: "list<string>"
    doc: "Gene group identifiers (gene symbols; Ensembl IDs when available)"
  gg_names:
    type: "list<string>"
    doc: "Gene group names (gene symbols from tool output)"
  gg_qvalue:
    type: float64
    nullable: true
    doc: "Gene group q-value (e.g., DIA-NN GG.Q.Value)"
  anchor_protein:
    type: string
    required: true
    doc: "Anchor/leading protein of the group"

  # Run-level context
  run_file_name:
    type: string
    required: true
    doc: "Raw file name"

  # Quality metrics
  global_qvalue:
    type: float64
    doc: "Global q-value at experiment level"
  pg_qvalue:
    type: float64
    doc: "Protein group q-value at run level"

  # Quantification
  intensities:
    type: "list<intensity>"
    doc: "Primary intensity per label"
  additional_intensities:
    type: "list<additional_intensity>"
    doc: "Tool-provided intensities (normalized, LFQ, iBAQ) read from upstream output"

  # Flags
  is_decoy:
    type: bool
    required: true
    doc: "Decoy flag"
  contaminant:
    type: bool
    doc: "Contaminant flag"

  # Peptide/feature counts
  peptides:
    type: "list<peptide_per_protein>"
    required: true
    doc: "Peptide counts per protein in the group"
  peptide_counts:
    type: peptide_counts
    doc: "Unique/total peptide sequence counts"
  feature_counts:
    type: feature_counts
    doc: "Unique/total feature counts"

  # Protein properties
  sequence_coverage:
    type: float32
    doc: "Sequence coverage percentage"
  molecular_weight:
    type: float32
    doc: "Molecular weight (kDa)"

  # Scores and metadata
  additional_scores:
    type: "list<score>"
    doc: "Additional scores"
  cv_params:
    type: "list<cv_param>"
    doc: "CV parameters"
mz.yaml
name: mz
file_type: mz_file
primary_key: [id]
doc: "Mass spectrometry spectral data (scan-level)"

fields:
  # Scan identification
  id:
    type: string
    required: true
    doc: "Unique scan/spectrum identifier"
  ms_level:
    type: int32
    required: true
    doc: "MS level (1=MS1, 2=MS2)"
  centroid:
    type: bool
    required: true
    doc: "Centroided (true) or profile (false)"

  # Timing and mobility
  scan_start_time:
    type: float32
    required: true
    doc: "Scan start time (minutes)"
  inverse_ion_mobility:
    type: float32
    doc: "Inverse ion mobility (TIMS)"
  ion_injection_time:
    type: float32
    required: true
    doc: "Ion injection time (ms)"
  total_ion_current:
    type: float32
    required: true
    doc: "Total ion current"

  # Precursor info (MS2+ only)
  precursors:
    type: "list<precursor>"
    doc: "Precursor ions"

  # Spectral data
  mz:
    type: "list<float32>"
    required: true
    doc: "m/z values"
  intensity:
    type: "list<float32>"
    required: true
    doc: "Intensity values"

  # Metadata
  cv_params:
    type: "list<cv_param>"
    doc: "CV parameters"

Metadata Views

dataset.yaml
name: dataset
file_type: dataset_file
primary_key: [project_accession]
doc: "Project-level metadata for a QPX dataset (single-row file)"

fields:
  project_accession:
    type: string
    required: true
    doc: "Project accession (e.g., PXD014414)"
  project_title:
    type: string
    doc: "Title of the project"
  project_description:
    type: string
    doc: "Project description"
  pubmed_id:
    type: string
    doc: "PubMed ID"
  software_name:
    type: string
    doc: "Software that generated the data"
  software_version:
    type: string
    doc: "Software version"
  creation_date:
    type: string
    required: true
    doc: "ISO 8601 date when dataset was created"

  # Integrity / packaging
  file_checksums:
    type: "map<string, string>"
    doc: "SHA-256 hex digests keyed by file name (relative to dataset root)"
  file_row_counts:
    type: "map<string, int64>"
    doc: "Row counts keyed by file name"
  file_sizes_bytes:
    type: "map<string, int64>"
    doc: "File sizes in bytes keyed by file name"
  total_structures:
    type: int32
    doc: "Number of Parquet structures in this dataset"
  packaged_at:
    type: string
    doc: "ISO 8601 timestamp when integrity was computed"
sample.yaml
name: sample
file_type: sample_file
primary_key: [sample_accession]
doc: "Biological sample metadata, one row per sample"
extra_columns: true

fields:
  sample_accession:
    type: string
    required: true
    doc: "Unique sample identifier (SDRF source name)"
  organism:
    type: string
    required: true
    doc: "Species (e.g. Homo sapiens)"
  organism_part:
    type: string
    required: true
    doc: "Tissue or organ (e.g. Brain; Frontal cortex)"
  disease:
    type: string
    optional: true
    doc: "Disease state (omitted when absent from SDRF)"
  cell_line:
    type: string
    optional: true
    doc: "Cell line name (omitted when absent from SDRF)"
  cell_type:
    type: string
    optional: true
    doc: "Cell type (omitted when absent from SDRF)"
  sex:
    type: string
    optional: true
    doc: "Biological sex (omitted when absent from SDRF)"
  age:
    type: string
    optional: true
    doc: "Age of specimen (omitted when absent from SDRF)"
  developmental_stage:
    type: string
    optional: true
    doc: "Developmental stage (omitted when absent from SDRF)"
  ancestry:
    type: string
    optional: true
    doc: "Ancestry category (omitted when absent from SDRF)"
  individual:
    type: string
    optional: true
    doc: "Individual/patient identifier (omitted when absent from SDRF)"
  sample_description:
    type: string
    optional: true
    doc: "Free-text sample description (omitted when absent from SDRF)"
run.yaml
name: run
file_type: run_file
primary_key: [run_file_name]
doc: "MS acquisition run metadata, one row per run"

fields:
  run_accession:
    type: string
    required: true
    doc: "Unique run identifier (SDRF assay name)"
  run_file_name:
    type: string
    required: true
    doc: "Raw data file name (without extension)"
  file_name:
    type: string
    doc: "Original file name with extension (e.g. S1_Frontal_1.raw)"

  # Sample-channel mapping (supports multiplexed runs)
  samples:
    type: "list<sample_channel>"
    required: true
    doc: "Sample-channel mappings with replicate info"

  # Run properties
  fraction:
    type: string
    doc: "Fraction identifier"

  # Instrument and method (plain strings; CV mappings in ontology.parquet)
  instrument:
    type: string
    doc: "Mass spectrometer name"
  enzymes:
    type: "list<string>"
    doc: "Proteolytic enzyme names"
  dissociation_method:
    type: string
    doc: "Fragmentation method name (HCD, CID, ETD)"

  # Search configuration
  modification_parameters:
    type: "list<modification_param>"
    doc: "Modifications configured in database search"
ontology.yaml
name: ontology
file_type: ontology_file
primary_key: [field_name, view]
doc: "Field-to-ontology mappings. Makes the dataset self-describing"

fields:
  field_name:
    type: string
    required: true
    doc: "snake_case QPX field or score name"
  ontology_name:
    type: string
    doc: "Proper ontology term name"
  ontology_accession:
    type: string
    doc: "Ontology accession identifier (e.g., MS:1002252)"
  ontology_source:
    type: string
    doc: "Ontology prefix (MS, UBERON, UNIMOD, etc.)"
  ontology_version:
    type: string
    doc: "Version of the ontology used for resolution (e.g., 4.1.235)"
  view:
    type: string
    required: true
    doc: "QPX view this field belongs to (psm, feature, pg, etc.)"
  description:
    type: string
    doc: "Human-readable description"
  source_column_name:
    type: string
    doc: "Original column name in the tool output (e.g., Precursor.Quantity)"
  source_tool:
    type: string
    doc: "Tool name that produced this field (e.g., DIA-NN)"
provenance.yaml
name: provenance
file_type: provenance_file
primary_key: [step_order]
doc: "Processing provenance - one row per pipeline step"

fields:
  step_order:
    type: int32
    required: true
    doc: "Execution order (1, 2, 3, ...)"
  step_category:
    type: string
    required: true
    doc: "Broad category (workflow, database_search, quantification, etc.)"
  step_name:
    type: string
    required: true
    doc: "Specific name (sequence_search, psm_rescoring, etc.)"
  tool_name:
    type: string
    required: true
    doc: "Tool that performed this step"
  tool_version:
    type: string
    doc: "Tool version"
  tool_uri:
    type: string
    doc: "Container URI or tool URL"
  parameters:
    type: "list<parameter>"
    doc: "Key-value pairs of important settings"
  config:
    type: string
    doc: "Full tool configuration as JSON string"
  output_views:
    type: "list<string>"
    doc: "QPX views this step generated (e.g., ['psm', 'feature'])"

Shared Types

types.yaml
# Shared struct types used across QPX schemas.
# These are referenced by name in schema YAML files (e.g., type: list<intensity>).

score:
  fields:
    score_name: {type: string}
    score_value: {type: float64}
    higher_better: {type: bool, nullable: true}

cv_param:
  fields:
    cv_name: {type: string}
    cv_value: {type: string}

modification_position:
  fields:
    position: {type: int32}
    amino_acid: {type: string, nullable: true}
    scores: {type: "list<score>", nullable: true}

modification:
  fields:
    name: {type: string}
    accession: {type: string, nullable: true}
    positions: {type: "list<modification_position>"}

intensity:
  fields:
    label: {type: string}
    intensity: {type: float32}

intensity_pair:
  fields:
    intensity_name: {type: string}
    intensity_value: {type: float32}

additional_intensity:
  fields:
    label: {type: string}
    intensities: {type: "list<intensity_pair>"}

pg_protein:
  fields:
    accession: {type: string}
    start: {type: int32, nullable: true}
    end: {type: int32, nullable: true}
    pre: {type: string, nullable: true}
    post: {type: string, nullable: true}

protein_position:
  fields:
    protein_accession: {type: string}
    start: {type: int32}
    end: {type: int32}

ontology_property:
  fields:
    key: {type: string}
    value: {type: string}

ontology_term:
  fields:
    accession: {type: string}
    name: {type: string, nullable: true}
    properties: {type: "list<ontology_property>", nullable: true}

sample_channel:
  fields:
    sample_accession: {type: string}
    label: {type: string}
    biological_replicate: {type: int32, nullable: true}
    technical_replicate: {type: int32, nullable: true}

modification_param:
  fields:
    accession: {type: string}
    name: {type: string, nullable: true}
    fixed: {type: bool, nullable: true}
    position: {type: string, nullable: true}
    target_amino_acid: {type: string, nullable: true}

peptide_per_protein:
  fields:
    protein_name: {type: string}
    peptide_count: {type: int32}

peptide_counts:
  fields:
    unique_sequences: {type: int32}
    total_sequences: {type: int32}

feature_counts:
  fields:
    unique_features: {type: int32}
    total_features: {type: int32}

precursor:
  fields:
    selected_ion_mz: {type: float32}
    selected_ion_charge: {type: int32, nullable: true}
    selected_ion_intensity: {type: float32, nullable: true}
    isolation_window_target: {type: float32, nullable: true}
    isolation_window_lower: {type: float32, nullable: true}
    isolation_window_upper: {type: float32, nullable: true}
    spectrum_ref: {type: string, nullable: true}

parameter:
  fields:
    key: {type: string}
    value: {type: string}

cross_link:
  fields:
    xl_type: {type: string}
    partner_sequence: {type: string, nullable: true}
    partner_peptidoform: {type: string, nullable: true}
    donor_position: {type: int32}
    acceptor_position: {type: int32, nullable: true}
    linker_name: {type: string}
    linker_accession: {type: string, nullable: true}
    linker_mass: {type: float64}

Viewing schemas programmatically

You can load and inspect any schema in Python using the QPX loader:

from qpx.core.models.loader import load_schema

schema = load_schema("feature")
for field in schema._iter_fields():
    print(f"{field.name}: {field.arrow_type} (nullable={field.nullable})")