Collection Indexes¶

An index is a materialized, pre-computed data structure that enables fast search across all datasets in a collection. Unlike the collection itself (which is programmatic), indexes are persisted as Parquet files because the alternative -- scanning the full collection for every query -- is impractical at scale.

Why Indexes?¶

Consider a collection with dozens of datasets. Without an index, searching for a peptide means scanning every PSM file; finding a differentially expressed protein means opening every DE view. With an index, the query reads only a small, targeted partition:

# Without index: scans ALL parquet files (~minutes)
coll.sql("SELECT * FROM psm WHERE sequence = 'PEPTIDEK'")

# With index: reads only the relevant partition (~milliseconds)
coll.index("peptide").search("PEPTIDEK")

An index trades one-time build cost (minutes to hours) for instant queries (milliseconds). Think of it as a library catalog: the catalog is small and fast to search, and it tells you which book (dataset) to open.

Index Types¶

QPX defines a standard framework for collection indexes. Each index type materializes data from a specific QPX view into a Hive-partitioned Parquet structure optimized for its search pattern.

Peptide Index¶

Maps peptide sequences to the datasets that contain them. Built from psm.parquet across all datasets.


Source view	PSM
Stored at	`_index/peptide/`
Partitioned by	First two amino acids of the sequence (~400 partitions)

_index/peptide/
├── sequence_prefix=AA/data.parquet
├── sequence_prefix=AC/data.parquet
├── ...
├── sequence_prefix=PE/data.parquet    # "PEPTIDEK" lives here
└── sequence_prefix=YY/data.parquet

Schema¶

Each row represents one (peptide, peptidoform, dataset) combination:

Field	Type	Description
`sequence`	string	Unmodified peptide amino acid sequence
`peptidoform`	string	Peptide with modifications in ProForma notation
`project_accession`	string	Dataset identifier (e.g., `PXD000561`)
`charge_states`	array[int16]	Distinct precursor charge states observed
`spectra_count`	int32	Number of PSMs for this peptide in this dataset
`best_pep`	float64	Best (lowest) posterior error probability across all PSMs
`protein_accessions`	array[string]	All proteins this peptide maps to in this dataset

Querying¶

coll = qpx.open_collection("/data/my_collection/")

# Search by exact sequence
results = coll.index("peptide").search("PEPTIDEK")

# Search by prefix (wildcard)
results = coll.index("peptide").search_prefix("PEPTID")

# Search by peptidoform (with modifications)
results = coll.index("peptide").search_peptidoform("PEP(Phospho)TIDEK")

Build Algorithm¶

SELECT
    sequence, peptidoform, project_accession,
    LIST(DISTINCT charge)                     AS charge_states,
    COUNT(*)                                  AS spectra_count,
    MIN(posterior_error_probability)           AS best_pep,
    LIST(DISTINCT UNNEST(protein_accessions))  AS protein_accessions,
    LEFT(sequence, 2)                         AS sequence_prefix
FROM psm
WHERE is_decoy = false
GROUP BY sequence, peptidoform, project_accession

Protein Index¶

Maps protein accessions to the datasets that identified or quantified them. Built from pg.parquet across all datasets.


Source view	Protein Group (PG)
Stored at	`_index/protein/`
Partitioned by	First two characters of the anchor protein accession (~300-400 partitions)

_index/protein/
├── accession_prefix=A0/data.parquet
├── accession_prefix=P0/data.parquet   # "P04637" (TP53) lives here
├── accession_prefix=Q9/data.parquet
└── ...

Schema¶

Each row represents one (protein, dataset) combination:

Field	Type	Description
`anchor_protein`	string	Representative protein accession for the group
`protein_accessions`	array[string]	All accessions in the protein group
`project_accession`	string	Dataset identifier
`gg_names`	array[string]	Gene names associated with the protein group
`global_qvalue`	float64	Protein-level global q-value
`num_peptides`	int32	Number of distinct peptides supporting the protein group
`num_runs`	int32	Number of MS runs where the protein was quantified

Querying¶

# Find all datasets that identified TP53
results = coll.index("protein").search("P04637")

# Search by gene name (requires scanning gene name column within partitions)
results = coll.index("protein").search_gene("TP53")

# Find all datasets with a protein family (prefix search)
results = coll.index("protein").search_prefix("P046")

Build Algorithm¶

SELECT
    anchor_protein, protein_accessions, project_accession,
    gg_names, global_qvalue,
    num_peptides,
    COUNT(DISTINCT run_file_name) AS num_runs,
    LEFT(anchor_protein, 2)       AS accession_prefix
FROM pg
GROUP BY anchor_protein, protein_accessions, project_accession,
         gg_names, global_qvalue, num_peptides

Differential Expression Index¶

Maps proteins to their differential expression results across studies. Built from differential expression (DE) views. Enables meta-analysis queries like "which proteins are consistently up-regulated in cancer studies?"


Source view	Differential Expression (DE)
Stored at	`_index/de/`
Partitioned by	First two characters of the protein accession

_index/de/
├── accession_prefix=A0/data.parquet
├── accession_prefix=P0/data.parquet
└── ...

Schema¶

Each row represents one (protein, contrast, dataset) combination:

Field	Type	Description
`protein_accession`	string	Protein accession
`gene_name`	string, null	Gene name
`project_accession`	string	Dataset identifier
`contrast`	string	Comparison label (e.g., "disease_vs_control")
`log2_fold_change`	float64	Log2 fold change
`adj_pvalue`	float64	Adjusted p-value (BH or similar)
`regulation`	string	"up", "down", or "ns" (not significant)
`organisms`	array[string]	Organisms in this study (from sample metadata)

Querying¶

# Which studies show TP53 as differentially expressed?
results = coll.index("de").search("P04637")
#   project_accession | contrast              | log2fc | adj_pvalue | regulation
#   PXD000561         | tumor_vs_normal       | 2.3    | 1.2e-08    | up
#   PXD002137         | treatment_vs_control  | -0.8   | 0.03       | down

# Meta-analysis: proteins consistently up-regulated across 3+ studies
results = coll.index("de").search_consistent(
    min_studies=3,
    direction="up",
    adj_pvalue_cutoff=0.05,
)

# Find all DE results for a gene
results = coll.index("de").search_gene("BRCA1")

Build Algorithm¶

SELECT
    protein_accession, gene_name, project_accession,
    contrast, log2_fold_change, adj_pvalue,
    CASE
        WHEN adj_pvalue < 0.05 AND log2_fold_change > 0 THEN 'up'
        WHEN adj_pvalue < 0.05 AND log2_fold_change < 0 THEN 'down'
        ELSE 'ns'
    END AS regulation,
    s.organisms,
    LEFT(protein_accession, 2) AS accession_prefix
FROM de
JOIN (
    SELECT project_accession, LIST(DISTINCT organism) AS organisms
    FROM sample
    GROUP BY project_accession
) s USING (project_accession)

Future Index Types¶

The framework supports additional index types following the same pattern:

Index	Source view	Partition key	Use case
spectrum	MZ	Binned precursor m/z	"Find spectra matching this m/z and charge"
modification	PSM/Feature	Modification name	"Which datasets have phosphorylation data?"
sample	Sample	Organism/tissue	"Find all liver tissue datasets"

Building Indexes¶

coll = qpx.open_collection("/data/my_collection/")

# Build a specific index
coll.build_index("peptide")
coll.build_index("protein")
coll.build_index("de")

# Build all available indexes
coll.build_all_indexes()

The build process for each index:

Reads the source view from every dataset in the collection.
Applies index-specific filters (e.g., is_decoy = false for peptide index).
Aggregates per the index schema's grouping key.
Partitions by the index's partition column.
Writes Hive-partitioned Parquet files to _index/{type}/.

Rebuilding¶

Indexes can go stale when datasets are added or removed. The rebuild process is idempotent -- it drops the existing index and recreates it from scratch:

# Rebuild after adding new datasets
coll.build_index("peptide", force=True)

# Check if index is stale (compares dataset list vs index metadata)
coll.index("peptide").is_stale()

Index Metadata¶

Each index stores a metadata file at _index/{type}/_metadata.json:

{
    "index_type": "peptide",
    "source_view": "psm",
    "qpx_version": "1.0",
    "built_at": "2026-04-08T14:30:00Z",
    "datasets_included": ["PXD000561", "PXD002137"],
    "total_entries": 14400000,
    "partitions": 400,
    "compression": "zstd"
}

The datasets_included list enables staleness detection: if the current collection has datasets not in this list, the index is stale.

Storage¶

Indexes are lightweight compared to the source data:

Index type	Typical size	Notes
Peptide	~0.05% of PSM data	Aggregates millions of PSMs into unique peptide-dataset pairs
Protein	~0.01% of PG data	One row per protein group per dataset
DE	~0.02% of DE data	One row per protein per contrast per dataset

Conventions¶

Indexes live under _index/ at the collection root.
The _ prefix signals "infrastructure, not a dataset" -- collection discovery skips it.
Each index type gets its own subdirectory: _index/peptide/, _index/protein/, _index/de/.
Partition column names use the {field}_prefix naming pattern.
Index Parquet files use ZSTD compression.
Index metadata is stored as _metadata.json (not Parquet) for easy inspection.
Each index declares its source_view in metadata, linking it to the QPX view it materializes.