QPX Format Overview¶
The QPX format is a modern, scalable data format designed specifically for proteomics data analysis. It addresses the limitations of existing formats -- XML-based HUPO-PSI standards (mzML, mzIdentML) and tab-delimited formats like mzTab -- which struggle with large-scale datasets and advanced analytical use cases such as AI/ML model development.
QPX organizes proteomics results into multiple views, each capturing a different aspect of the data (identifications, quantifications, spectra, metadata). Data views use Apache Parquet as the primary serialization format. Expression views (absolute and differential expression) use AnnData (.h5ad) for native interoperability with the scverse ecosystem.
Note
QPX does not aim to replace mzTab or individual tool output formats. Its goal is to provide a unified, performance-oriented format that enables AI-related use cases and easy integration of proteomics results across tools and platforms.
Data model¶
The diagram below shows the QPX views and how they relate to each other. Arrows indicate data flow from raw spectra through identification and quantification to final expression results.
graph LR
MZ[mz<br/>Mass Spectra] --> PSM[psm<br/>Peptide Spectrum Matches]
PSM --> FEAT[feature<br/>Peptide Features]
FEAT --> PG[pg<br/>Protein Groups]
FEAT --> API["API Views<br/>(on demand)"]
PG --> API
PG --> AE[ae<br/>Absolute Expression]
PG --> DE[de<br/>Differential Expression]
PSM -.-> PPM[pepmap<br/>Peptide-Protein Map]
FEAT -.-> PPM
SAMPLE[sample<br/>Sample Metadata] -.-> FEAT
SAMPLE -.-> PG
SAMPLE -.-> AE
RUN[run<br/>Run Metadata] -.-> FEAT
RUN -.-> PG
RUN -.-> API
DATASET[dataset<br/>Dataset Metadata] -.-> MZ
DATASET -.-> PSM
DATASET -.-> FEAT
style MZ fill:#e1f5fe
style PSM fill:#e8f5e9
style FEAT fill:#e8f5e9
style PG fill:#fff3e0
style PPM fill:#e8f5e9
style API fill:#e8f5e9,stroke-dasharray: 5 5
style AE fill:#fce4ec
style DE fill:#fce4ec
style SAMPLE fill:#f3e5f5
style RUN fill:#f3e5f5
style DATASET fill:#f3e5f5
Views at a glance¶
Data Views¶
| View | File pattern | Format | Description |
|---|---|---|---|
mz |
{PREFIX}.mz.parquet |
Parquet | Raw and processed mass spectra (MS1 and MS2) |
psm |
{PREFIX}.psm.parquet |
Parquet | Peptide spectrum matches from database search engines (primarily DDA) |
feature |
{PREFIX}.feature.parquet |
Parquet | Quantified peptide features per MS run, with sample and channel intensities |
pg |
{PREFIX}.pg.parquet |
Parquet | Protein groups with quantification across runs and channels |
pepmap |
{PREFIX}.pepmap.parquet |
Parquet | Deduplicated peptide-to-protein mapping with gene names and uniqueness flags |
API Views¶
These are not stored as standalone files. They are programmable, on-demand views that the API computes by joining and aggregating the primary data views. Users can request different fields and aggregations depending on their needs (e.g., protein counts, peptide abundances, identification summaries). See API Views for details.
Expression Views¶
| View | File pattern | Format | Description |
|---|---|---|---|
ae |
{PREFIX}.ae.h5ad |
AnnData | Absolute expression matrix (iBAQ values per protein per sample) |
de |
{PREFIX}.de.h5ad |
AnnData | Differential expression results (fold changes, p-values between contrasts) |
Metadata¶
| View | File pattern | Format | Description |
|---|---|---|---|
dataset |
{PREFIX}.dataset.parquet |
Parquet | Project-level metadata, search parameters, and software provenance |
sample |
{PREFIX}.sample.parquet |
Parquet | Biological sample metadata (one row per sample) |
run |
{PREFIX}.run.parquet |
Parquet | Data acquisition run metadata (one row per run) |
ontology |
{PREFIX}.ontology.parquet |
Parquet | Field-to-ontology mapping (makes the dataset self-describing) |
provenance |
{PREFIX}.provenance.parquet |
Parquet | Processing chain: tools, versions, parameters, and FDR thresholds |
sdrf |
{PREFIX}.sdrf.tsv |
TSV | Original SDRF file, preserved for provenance |
Tip
File extensions follow the pattern {PREFIX}.{view}.{format} -- for example, PXD014414.feature.parquet or PXD014414.sample.parquet. See File Extensions & Naming for the full convention.
Which view do I need?¶
Use the decision guide below to find the right starting point for your use case.
| Use case | Recommended view(s) |
|---|---|
| Train an ML model on PSM-level spectral features | psm (with spectral arrays) + mz |
| Retrieve peptide intensities per sample | feature |
| Build a volcano plot | de |
| Compare protein abundance across conditions | ae or pg |
| Look up which proteins a peptide maps to | pepmap, feature, or psm |
| Get a quick protein or peptide report for a web portal | API Views |
| Understand the experimental design | sample + run |
| Get project-level metadata and search parameters | dataset |
| Integrate with scanpy / scverse ecosystem | .ae.h5ad + .de.h5ad |
| List all files in a QPX project | dataset or glob {PREFIX}-*.parquet |
Note
Some views are better suited for specific acquisition methods. The psm view is primarily intended for DDA experiments. The feature view covers both DDA and DIA workflows.
Shared concepts¶
Several data structures are reused across multiple views. Each concept has its own page with full definitions and examples.
| Concept | Used in views | Description |
|---|---|---|
| Peptidoform | PSM, Feature | Peptide sequence with modifications in ProForma notation |
| Modifications | PSM, Feature | Structured modification representation with localization scores |
| Intensities | Feature, Protein Group | Primary and additional intensity measurements across channels |
| Scores & CV Terms | PSM, Feature, Protein Group | Search engine scores and controlled vocabulary parameters |
| Scan Numbers | PSM, Feature, MZ | Instrument-specific scan identifiers and format conventions |
Current status¶
The QPX format is at version 1.0, primarily implemented in the quantms workflow. The format is open and can be adopted by any software tool. Versioning follows {major}.{minor} semantics: major releases may introduce breaking changes, while minor updates remain backward compatible. Every view file includes a qpx_version metadata field to identify which specification version produced it.