Serialization & Parquet Details¶
QPX uses two serialization formats, each chosen for its suitability to the data it stores: Apache Parquet for columnar proteomics data and metadata, and AnnData (.h5ad) for expression views.
Supported Serialization Formats¶
| Format | Extension | Used By | Description |
|---|---|---|---|
| Apache Parquet | .parquet |
PSM, Feature, PG, Peptide, Protein, MZ, Dataset, Sample, Run | Columnar storage with compression |
| AnnData (HDF5) | .h5ad |
AE, DE | Matrix-form expression data (scverse ecosystem) |
Note
The choice of format for each view is driven by its access patterns. Data views and metadata use Parquet for column-level filtering and large-scale analytical queries. Expression views use AnnData (.h5ad) for native interoperability with the scverse ecosystem (scanpy, scvi-tools, muon). The original SDRF (.sdrf.tsv) is preserved alongside the Parquet files for provenance but is not a QPX output format.
Parquet Format¶
Apache Parquet is a columnar storage format designed for efficient analytical processing. Unlike row-oriented formats (CSV, TSV, mzTab), Parquet stores data column by column, which provides significant advantages for proteomics workflows:
- File metadata and column metadata: Each Parquet file includes a footer that describes the schema, row groups, column statistics (min/max values), and custom key-value metadata. This allows query engines to skip irrelevant data without reading the full file.
- Column-oriented design: Analytical queries that access a subset of columns (e.g., retrieving only
sequenceandchargefrom a PSM file) read only the relevant column chunks, reducing I/O dramatically compared to row-based formats. - Broad ecosystem support: Parquet is natively supported by PyArrow, DuckDB, Polars, Apache Spark, pandas, R (via the
arrowpackage), and many other tools. This makes QPX files immediately usable across languages and frameworks without custom parsers.
Compression¶
Parquet's columnar layout achieves substantial compression ratios because values within a column tend to have similar data types and distributions. The following benchmarks illustrate the storage savings and write performance for real-world proteomics datasets:
| Project | Type | Original Size (GB) | Parquet Size (MB) | PSM Write Time (s) | Feature Write Time (s) |
|---|---|---|---|---|---|
| PXD046440 | MaxQuant | 48 | 337/343 | 985.27 | 678.47 |
| PXD016999 | mzTab | 160 | 155/228 | 539.00 | 3554.53 |
| PXD019909 | DIA-NN | 1.9 | 195 | -- | 229.48 |
Compression algorithms
QPX supports the following compression algorithms for Parquet files:
- Zstd -- Best balance of compression ratio and speed; broad cross-language support (default)
- Snappy -- Fast compression/decompression with moderate ratio
- Gzip -- Higher compression ratio, slower speed
- LZO -- Optimized for decompression speed
- None -- No compression
In addition to block-level compression, Parquet applies encoding schemes at the column level:
- RLE (Run-Length Encoding) -- Efficient for columns with many repeated values (e.g.,
charge). Boolean columns likeis_decoyuse native bit-packing (1 bit per value) with optional RLE for maximum compression - Dictionary encoding -- Replaces repeated string values with integer codes (e.g.,
run_file_name,sequence)
Parquet Features¶
QPX leverages the following Parquet capabilities:
- Columnar Storage: Each column is stored independently, enabling query engines to read only the columns needed for a given analysis. This reduces I/O for analytical queries that access a subset of fields.
- Efficient Compression: Column-level compression with Zstd, Snappy, Gzip, or LZO, combined with RLE and dictionary encoding, achieves compression ratios of 10x--100x on typical proteomics datasets.
- Schema Evolution: Columns can be added, removed, or modified across QPX versions without breaking compatibility with existing files. Readers that encounter unknown columns can safely ignore them.
- Complex Data Types: Parquet natively supports nested structs, arrays, and maps. QPX uses these for fields like
modifications(array of structs),additional_scores(array of structs), and spectral arrays (mz_array,intensity_array).
Hive Partitioning¶
QPX supports Hive-style partitioning for large datasets. This splits a single Parquet file into a directory tree where each subdirectory encodes a partition column value (e.g., run_file_name=run_01/). This is particularly useful for datasets with 100+ runs where a single Parquet file would exceed several GB.
Directory Structure¶
The default partition column is run_file_name:
PXD004683/
├── exp.pg.parquet # Single-file structure (small)
├── exp.sample.parquet
├── exp.run.parquet
├── exp.dataset.parquet
└── feature/ # Partitioned structure (large)
├── run_file_name=run_01.raw/
│ └── part-0.parquet
├── run_file_name=run_02.raw/
│ └── part-0.parquet
└── run_file_name=run_03.raw/
└── part-0.parquet
Auto-Discovery¶
When opening a dataset, QPX checks for single Parquet files first. If no single file is found for a structure, it falls back to checking for a Hive-partitioned directory with the same name (e.g., feature/).
import qpx
# Automatically discovers both single files and partitioned directories
ds = qpx.open("PXD004683/")
print(ds.feature.count()) # Works regardless of storage layout
Writing Partitioned Data¶
from qpx.writers.base import BaseWriter
import pyarrow.parquet as pq
# Read existing data
table = pq.read_table("exp.feature.parquet")
# Write as Hive-partitioned
BaseWriter.write_partitioned(table, "feature/", partition_cols=["run_file_name"])
DuckDB Integration¶
DuckDB reads Hive-partitioned directories natively:
-- The partition column (run_file_name) is automatically available
SELECT run_file_name, COUNT(*) AS n_features
FROM read_parquet('feature/**/*.parquet', hive_partitioning=true)
GROUP BY run_file_name;
Practical limits of Parquet metadata
Parquet stores all file-level and column-level metadata in the file footer. When reading a partitioned dataset, the footer of every partition file must be loaded into memory to reconstruct the full schema and row group statistics. For datasets with thousands of partitions, this metadata overhead can become significant. Consider using fewer, larger partitions rather than many small ones.
Cloud / S3 Access¶
QPX datasets can be opened directly from S3-compatible cloud storage without downloading files locally. This uses DuckDB's httpfs extension under the hood.
Opening a Dataset from S3¶
import qpx
ds = qpx.open(
"s3://my-bucket/datasets/PXD014414/",
s3_config={
"region": "us-east-1",
"access_key_id": "AKIA...",
"secret_access_key": "...",
},
)
# Use exactly like a local dataset
print(ds.feature.count())
df = ds.feature.filter("charge > 2").to_df()
Configuration Options¶
| Parameter | Description |
|---|---|
region |
AWS region (e.g., "us-east-1") |
access_key_id |
AWS access key ID |
secret_access_key |
AWS secret access key |
endpoint |
Custom S3 endpoint for MinIO or other S3-compatible services |
anonymous |
Set True for public buckets (no credentials needed) |
S3-Compatible Services¶
For MinIO or other S3-compatible services, set the endpoint parameter:
ds = qpx.open(
"s3://proteomics/PXD014414/",
s3_config={
"endpoint": "minio.example.com:9000",
"access_key_id": "minioadmin",
"secret_access_key": "minioadmin",
},
)
Performance
Cloud access adds network latency to every query. For interactive analysis, consider downloading the dataset locally first. Cloud access is most useful for automated pipelines and one-off queries against large remote datasets.
File-Level Metadata¶
Every QPX Parquet file includes metadata as key-value pairs stored in the Parquet file footer. These fields identify the file's origin, format version, and provenance:
| Key | Description |
|---|---|
qpx_version |
Version of the QPX specification used to generate the file |
software_provider |
Name and version of the software that produced the data |
project_accession |
ProteomeXchange or project accession (e.g., PXD012345) |
project_title |
Human-readable title of the project |
scan_format |
Format of scan identifiers: scan, index, or nativeId |
creator |
Name of the tool or person who created the file |
file_type |
QPX view type (e.g., psm_file, feature_file, pg_file) |
creation_date |
ISO 8601 date when the file was created |
compression_format |
Compression algorithm used: zstd, snappy, gzip, lzo, or none |
Writing Parquet with metadata in Python
import pyarrow as pa
import pyarrow.parquet as pq
# Build your table
table = pa.table({"sequence": ["PEPTIDER", "ANOTHERPEPTIDE"]})
# Define file-level metadata
file_metadata = {
'qpx_version': '1.0',
'software_provider': 'QuantMS 1.3.0',
'project_accession': 'PXD012345',
'file_type': 'psm_file',
'creation_date': '2021-01-01',
}
# Merge with existing schema metadata and write
existing_metadata = table.schema.metadata or {}
merged_metadata = {
**existing_metadata,
**{k.encode(): v.encode() for k, v in file_metadata.items()}
}
table = table.replace_schema_metadata(merged_metadata)
pq.write_table(table, 'psm_data.parquet')
Reading file metadata in Python
See Also¶
- File Naming -- file extension and naming conventions for QPX files
- Versioning -- version format and backward compatibility rules
- QPX Format Overview -- overview of all QPX views and their relationships