Serialization & Parquet Details¶

QPX uses two serialization formats, each chosen for its suitability to the data it stores: Apache Parquet for columnar proteomics data and metadata, and AnnData (.h5ad) for expression views.

Supported Serialization Formats¶

Format	Extension	Used By	Description
Apache Parquet	`.parquet`	PSM, Feature, PG, Peptide, Protein, MZ, Dataset, Sample, Run	Columnar storage with compression
AnnData (HDF5)	`.h5ad`	AE, DE	Matrix-form expression data (scverse ecosystem)

Note

The choice of format for each view is driven by its access patterns. Data views and metadata use Parquet for column-level filtering and large-scale analytical queries. Expression views use AnnData (.h5ad) for native interoperability with the scverse ecosystem (scanpy, scvi-tools, muon). The original SDRF (.sdrf.tsv) is preserved alongside the Parquet files for provenance but is not a QPX output format.

Parquet Format¶

Apache Parquet is a columnar storage format designed for efficient analytical processing. Unlike row-oriented formats (CSV, TSV, mzTab), Parquet stores data column by column, which provides significant advantages for proteomics workflows:

File metadata and column metadata: Each Parquet file includes a footer that describes the schema, row groups, column statistics (min/max values), and custom key-value metadata. This allows query engines to skip irrelevant data without reading the full file.
Column-oriented design: Analytical queries that access a subset of columns (e.g., retrieving only sequence and charge from a PSM file) read only the relevant column chunks, reducing I/O dramatically compared to row-based formats.
Broad ecosystem support: Parquet is natively supported by PyArrow, DuckDB, Polars, Apache Spark, pandas, R (via the arrow package), and many other tools. This makes QPX files immediately usable across languages and frameworks without custom parsers.

Compression¶

Parquet's columnar layout achieves substantial compression ratios because values within a column tend to have similar data types and distributions. The following benchmarks illustrate the storage savings and write performance for real-world proteomics datasets:

Project	Type	Original Size (GB)	Parquet Size (MB)	PSM Write Time (s)	Feature Write Time (s)
PXD046440	MaxQuant	48	337/343	985.27	678.47
PXD016999	mzTab	160	155/228	539.00	3554.53
PXD019909	DIA-NN	1.9	195	--	229.48

Compression algorithms

QPX supports the following compression algorithms for Parquet files:

Zstd -- Best balance of compression ratio and speed; broad cross-language support (default)
Snappy -- Fast compression/decompression with moderate ratio
Gzip -- Higher compression ratio, slower speed
LZO -- Optimized for decompression speed
None -- No compression

In addition to block-level compression, Parquet applies encoding schemes at the column level:

RLE (Run-Length Encoding) -- Efficient for columns with many repeated values (e.g., charge). Boolean columns like is_decoy use native bit-packing (1 bit per value) with optional RLE for maximum compression
Dictionary encoding -- Replaces repeated string values with integer codes (e.g., run_file_name, sequence)

Parquet Features¶

QPX leverages the following Parquet capabilities:

Columnar Storage: Each column is stored independently, enabling query engines to read only the columns needed for a given analysis. This reduces I/O for analytical queries that access a subset of fields.
Efficient Compression: Column-level compression with Zstd, Snappy, Gzip, or LZO, combined with RLE and dictionary encoding, achieves compression ratios of 10x--100x on typical proteomics datasets.
Schema Evolution: Columns can be added, removed, or modified across QPX versions without breaking compatibility with existing files. Readers that encounter unknown columns can safely ignore them.
Complex Data Types: Parquet natively supports nested structs, arrays, and maps. QPX uses these for fields like modifications (array of structs), additional_scores (array of structs), and spectral arrays (mz_array, intensity_array).

Hive Partitioning¶

QPX supports Hive-style partitioning for large datasets. This splits a single Parquet file into a directory tree where each subdirectory encodes a partition column value (e.g., run_file_name=run_01/). This is particularly useful for datasets with 100+ runs where a single Parquet file would exceed several GB.

Directory Structure¶

The default partition column is run_file_name:

PXD004683/
├── exp.pg.parquet                   # Single-file structure (small)
├── exp.sample.parquet
├── exp.run.parquet
├── exp.dataset.parquet
└── feature/                         # Partitioned structure (large)
    ├── run_file_name=run_01.raw/
    │   └── part-0.parquet
    ├── run_file_name=run_02.raw/
    │   └── part-0.parquet
    └── run_file_name=run_03.raw/
        └── part-0.parquet

Auto-Discovery¶

When opening a dataset, QPX checks for single Parquet files first. If no single file is found for a structure, it falls back to checking for a Hive-partitioned directory with the same name (e.g., feature/).

import qpx

# Automatically discovers both single files and partitioned directories
ds = qpx.open("PXD004683/")
print(ds.feature.count())  # Works regardless of storage layout

Writing Partitioned Data¶

from qpx.writers.base import BaseWriter
import pyarrow.parquet as pq

# Read existing data
table = pq.read_table("exp.feature.parquet")

# Write as Hive-partitioned
BaseWriter.write_partitioned(table, "feature/", partition_cols=["run_file_name"])

DuckDB Integration¶

DuckDB reads Hive-partitioned directories natively:

-- The partition column (run_file_name) is automatically available
SELECT run_file_name, COUNT(*) AS n_features
FROM read_parquet('feature/**/*.parquet', hive_partitioning=true)
GROUP BY run_file_name;

Practical limits of Parquet metadata

Parquet stores all file-level and column-level metadata in the file footer. When reading a partitioned dataset, the footer of every partition file must be loaded into memory to reconstruct the full schema and row group statistics. For datasets with thousands of partitions, this metadata overhead can become significant. Consider using fewer, larger partitions rather than many small ones.

Cloud / S3 Access¶

QPX datasets can be opened directly from S3-compatible cloud storage without downloading files locally. This uses DuckDB's httpfs extension under the hood.

Opening a Dataset from S3¶

import qpx

ds = qpx.open(
    "s3://my-bucket/datasets/PXD014414/",
    s3_config={
        "region": "us-east-1",
        "access_key_id": "AKIA...",
        "secret_access_key": "...",
    },
)

# Use exactly like a local dataset
print(ds.feature.count())
df = ds.feature.filter("charge > 2").to_df()

Configuration Options¶

Parameter	Description
`region`	AWS region (e.g., `"us-east-1"`)
`access_key_id`	AWS access key ID
`secret_access_key`	AWS secret access key
`endpoint`	Custom S3 endpoint for MinIO or other S3-compatible services
`anonymous`	Set `True` for public buckets (no credentials needed)

S3-Compatible Services¶

For MinIO or other S3-compatible services, set the endpoint parameter:

ds = qpx.open(
    "s3://proteomics/PXD014414/",
    s3_config={
        "endpoint": "minio.example.com:9000",
        "access_key_id": "minioadmin",
        "secret_access_key": "minioadmin",
    },
)

Performance

Cloud access adds network latency to every query. For interactive analysis, consider downloading the dataset locally first. Cloud access is most useful for automated pipelines and one-off queries against large remote datasets.

File-Level Metadata¶

Every QPX Parquet file includes metadata as key-value pairs stored in the Parquet file footer. These fields identify the file's origin, format version, and provenance:

Key	Description
`qpx_version`	Version of the QPX specification used to generate the file
`software_provider`	Name and version of the software that produced the data
`project_accession`	ProteomeXchange or project accession (e.g., `PXD012345`)
`project_title`	Human-readable title of the project
`scan_format`	Format of scan identifiers: `scan`, `index`, or `nativeId`
`creator`	Name of the tool or person who created the file
`file_type`	QPX view type (e.g., `psm_file`, `feature_file`, `pg_file`)
`creation_date`	ISO 8601 date when the file was created
`compression_format`	Compression algorithm used: `zstd`, `snappy`, `gzip`, `lzo`, or `none`

Writing Parquet with metadata in Python

import pyarrow as pa
import pyarrow.parquet as pq

# Build your table
table = pa.table({"sequence": ["PEPTIDER", "ANOTHERPEPTIDE"]})

# Define file-level metadata
file_metadata = {
    'qpx_version': '1.0',
    'software_provider': 'QuantMS 1.3.0',
    'project_accession': 'PXD012345',
    'file_type': 'psm_file',
    'creation_date': '2021-01-01',
}

# Merge with existing schema metadata and write
existing_metadata = table.schema.metadata or {}
merged_metadata = {
    **existing_metadata,
    **{k.encode(): v.encode() for k, v in file_metadata.items()}
}
table = table.replace_schema_metadata(merged_metadata)
pq.write_table(table, 'psm_data.parquet')

Reading file metadata in Python

import pyarrow.parquet as pq

parquet_file = pq.ParquetFile("psm_data.parquet")
metadata = parquet_file.schema_arrow.metadata
for key, value in metadata.items():
    print(f"{key.decode()}: {value.decode()}")