QPX Format Overview¶

The QPX format is a modern, scalable data format designed specifically for proteomics data analysis. It addresses the limitations of existing formats -- XML-based HUPO-PSI standards (mzML, mzIdentML) and tab-delimited formats like mzTab -- which struggle with large-scale datasets and advanced analytical use cases such as AI/ML model development.

QPX organizes proteomics results into multiple views, each capturing a different aspect of the data (identifications, quantifications, spectra, metadata). Data views use Apache Parquet as the primary serialization format. Expression views (absolute and differential expression) use AnnData (.h5ad) for native interoperability with the scverse ecosystem.

Note

QPX does not aim to replace mzTab or individual tool output formats. Its goal is to provide a unified, performance-oriented format that enables AI-related use cases and easy integration of proteomics results across tools and platforms.

Data model¶

The diagram below shows the QPX views and how they relate to each other. Arrows indicate data flow from raw spectra through identification and quantification to final expression results.

graph LR
    MZ[mz<br/>Mass Spectra] --> PSM[psm<br/>Peptide Spectrum Matches]
    PSM --> FEAT[feature<br/>Peptide Features]
    FEAT --> PG[pg<br/>Protein Groups]
    FEAT --> API["API Views<br/>(on demand)"]
    PG --> API
    PG --> AE[ae<br/>Absolute Expression]
    PG --> DE[de<br/>Differential Expression]
    PSM -.-> PPM[pepmap<br/>Peptide-Protein Map]
    FEAT -.-> PPM
    SAMPLE[sample<br/>Sample Metadata] -.-> FEAT
    SAMPLE -.-> PG
    SAMPLE -.-> AE
    RUN[run<br/>Run Metadata] -.-> FEAT
    RUN -.-> PG
    RUN -.-> API
    DATASET[dataset<br/>Dataset Metadata] -.-> MZ
    DATASET -.-> PSM
    DATASET -.-> FEAT

    style MZ fill:#e1f5fe
    style PSM fill:#e8f5e9
    style FEAT fill:#e8f5e9
    style PG fill:#fff3e0
    style PPM fill:#e8f5e9
    style API fill:#e8f5e9,stroke-dasharray: 5 5
    style AE fill:#fce4ec
    style DE fill:#fce4ec
    style SAMPLE fill:#f3e5f5
    style RUN fill:#f3e5f5
    style DATASET fill:#f3e5f5

Views at a glance¶

Data Views¶

View	File pattern	Format	Description
`mz`	`{PREFIX}.mz.parquet`	Parquet	Raw and processed mass spectra (MS1 and MS2)
`psm`	`{PREFIX}.psm.parquet`	Parquet	Peptide spectrum matches from database search engines (primarily DDA)
`feature`	`{PREFIX}.feature.parquet`	Parquet	Quantified peptide features per MS run, with sample and channel intensities
`pg`	`{PREFIX}.pg.parquet`	Parquet	Protein groups with quantification across runs and channels
`pepmap`	`{PREFIX}.pepmap.parquet`	Parquet	Deduplicated peptide-to-protein mapping with gene names and uniqueness flags

API Views¶

These are not stored as standalone files. They are programmable, on-demand views that the API computes by joining and aggregating the primary data views. Users can request different fields and aggregations depending on their needs (e.g., protein counts, peptide abundances, identification summaries). See API Views for details.

Expression Views¶

View	File pattern	Format	Description
`ae`	`{PREFIX}.ae.h5ad`	AnnData	Absolute expression matrix (iBAQ values per protein per sample)
`de`	`{PREFIX}.de.h5ad`	AnnData	Differential expression results (fold changes, p-values between contrasts)

Metadata¶

View	File pattern	Format	Description
`dataset`	`{PREFIX}.dataset.parquet`	Parquet	Project-level metadata, search parameters, and software provenance
`sample`	`{PREFIX}.sample.parquet`	Parquet	Biological sample metadata (one row per sample)
`run`	`{PREFIX}.run.parquet`	Parquet	Data acquisition run metadata (one row per run)
`ontology`	`{PREFIX}.ontology.parquet`	Parquet	Field-to-ontology mapping (makes the dataset self-describing)
`provenance`	`{PREFIX}.provenance.parquet`	Parquet	Processing chain: tools, versions, parameters, and FDR thresholds
`sdrf`	`{PREFIX}.sdrf.tsv`	TSV	Original SDRF file, preserved for provenance

Tip

File extensions follow the pattern {PREFIX}.{view}.{format} -- for example, PXD014414.feature.parquet or PXD014414.sample.parquet. See File Extensions & Naming for the full convention.

Which view do I need?¶

Use the decision guide below to find the right starting point for your use case.

Use case	Recommended view(s)
Train an ML model on PSM-level spectral features	`psm` (with spectral arrays) + `mz`
Retrieve peptide intensities per sample	`feature`
Build a volcano plot	`de`
Compare protein abundance across conditions	`ae` or `pg`
Look up which proteins a peptide maps to	`pepmap`, `feature`, or `psm`
Get a quick protein or peptide report for a web portal	API Views
Understand the experimental design	`sample` + `run`
Get project-level metadata and search parameters	`dataset`
Integrate with scanpy / scverse ecosystem	`.ae.h5ad` + `.de.h5ad`
List all files in a QPX project	`dataset` or glob `{PREFIX}-*.parquet`

Note

Some views are better suited for specific acquisition methods. The psm view is primarily intended for DDA experiments. The feature view covers both DDA and DIA workflows.

Shared concepts¶

Several data structures are reused across multiple views. Each concept has its own page with full definitions and examples.

Concept	Used in views	Description
Peptidoform	PSM, Feature	Peptide sequence with modifications in ProForma notation
Modifications	PSM, Feature	Structured modification representation with localization scores
Intensities	Feature, Protein Group	Primary and additional intensity measurements across channels
Scores & CV Terms	PSM, Feature, Protein Group	Search engine scores and controlled vocabulary parameters
Scan Numbers	PSM, Feature, MZ	Instrument-specific scan identifiers and format conventions

Current status¶

The QPX format is at version 1.0, primarily implemented in the quantms workflow. The format is open and can be adopted by any software tool. Versioning follows {major}.{minor} semantics: major releases may introduce breaking changes, while minor updates remain backward compatible. Every view file includes a qpx_version metadata field to identify which specification version produced it.