Collection Indexes¶
An index is a materialized, pre-computed data structure that enables fast search across all datasets in a collection. Unlike the collection itself (which is programmatic), indexes are persisted as Parquet files because the alternative -- scanning the full collection for every query -- is impractical at scale.
Why Indexes?¶
Consider a collection with dozens of datasets. Without an index, searching for a peptide means scanning every PSM file; finding a differentially expressed protein means opening every DE view. With an index, the query reads only a small, targeted partition:
# Without index: scans ALL parquet files (~minutes)
coll.sql("SELECT * FROM psm WHERE sequence = 'PEPTIDEK'")
# With index: reads only the relevant partition (~milliseconds)
coll.index("peptide").search("PEPTIDEK")
An index trades one-time build cost (minutes to hours) for instant queries (milliseconds). Think of it as a library catalog: the catalog is small and fast to search, and it tells you which book (dataset) to open.
Index Types¶
QPX defines a standard framework for collection indexes. Each index type materializes data from a specific QPX view into a Hive-partitioned Parquet structure optimized for its search pattern.
Peptide Index¶
Maps peptide sequences to the datasets that contain them. Built from psm.parquet across all datasets.
| Source view | PSM |
| Stored at | _index/peptide/ |
| Partitioned by | First two amino acids of the sequence (~400 partitions) |
_index/peptide/
├── sequence_prefix=AA/data.parquet
├── sequence_prefix=AC/data.parquet
├── ...
├── sequence_prefix=PE/data.parquet # "PEPTIDEK" lives here
└── sequence_prefix=YY/data.parquet
Schema¶
Each row represents one (peptide, peptidoform, dataset) combination:
| Field | Type | Description |
|---|---|---|
sequence |
string | Unmodified peptide amino acid sequence |
peptidoform |
string | Peptide with modifications in ProForma notation |
project_accession |
string | Dataset identifier (e.g., PXD000561) |
charge_states |
array[int16] | Distinct precursor charge states observed |
spectra_count |
int32 | Number of PSMs for this peptide in this dataset |
best_pep |
float64 | Best (lowest) posterior error probability across all PSMs |
protein_accessions |
array[string] | All proteins this peptide maps to in this dataset |
Querying¶
coll = qpx.open_collection("/data/my_collection/")
# Search by exact sequence
results = coll.index("peptide").search("PEPTIDEK")
# Search by prefix (wildcard)
results = coll.index("peptide").search_prefix("PEPTID")
# Search by peptidoform (with modifications)
results = coll.index("peptide").search_peptidoform("PEP(Phospho)TIDEK")
Build Algorithm¶
SELECT
sequence, peptidoform, project_accession,
LIST(DISTINCT charge) AS charge_states,
COUNT(*) AS spectra_count,
MIN(posterior_error_probability) AS best_pep,
LIST(DISTINCT UNNEST(protein_accessions)) AS protein_accessions,
LEFT(sequence, 2) AS sequence_prefix
FROM psm
WHERE is_decoy = false
GROUP BY sequence, peptidoform, project_accession
Protein Index¶
Maps protein accessions to the datasets that identified or quantified them. Built from pg.parquet across all datasets.
| Source view | Protein Group (PG) |
| Stored at | _index/protein/ |
| Partitioned by | First two characters of the anchor protein accession (~300-400 partitions) |
_index/protein/
├── accession_prefix=A0/data.parquet
├── accession_prefix=P0/data.parquet # "P04637" (TP53) lives here
├── accession_prefix=Q9/data.parquet
└── ...
Schema¶
Each row represents one (protein, dataset) combination:
| Field | Type | Description |
|---|---|---|
anchor_protein |
string | Representative protein accession for the group |
protein_accessions |
array[string] | All accessions in the protein group |
project_accession |
string | Dataset identifier |
gg_names |
array[string] | Gene names associated with the protein group |
global_qvalue |
float64 | Protein-level global q-value |
num_peptides |
int32 | Number of distinct peptides supporting the protein group |
num_runs |
int32 | Number of MS runs where the protein was quantified |
Querying¶
# Find all datasets that identified TP53
results = coll.index("protein").search("P04637")
# Search by gene name (requires scanning gene name column within partitions)
results = coll.index("protein").search_gene("TP53")
# Find all datasets with a protein family (prefix search)
results = coll.index("protein").search_prefix("P046")
Build Algorithm¶
SELECT
anchor_protein, protein_accessions, project_accession,
gg_names, global_qvalue,
num_peptides,
COUNT(DISTINCT run_file_name) AS num_runs,
LEFT(anchor_protein, 2) AS accession_prefix
FROM pg
GROUP BY anchor_protein, protein_accessions, project_accession,
gg_names, global_qvalue, num_peptides
Differential Expression Index¶
Maps proteins to their differential expression results across studies. Built from differential expression (DE) views. Enables meta-analysis queries like "which proteins are consistently up-regulated in cancer studies?"
| Source view | Differential Expression (DE) |
| Stored at | _index/de/ |
| Partitioned by | First two characters of the protein accession |
Schema¶
Each row represents one (protein, contrast, dataset) combination:
| Field | Type | Description |
|---|---|---|
protein_accession |
string | Protein accession |
gene_name |
string, null | Gene name |
project_accession |
string | Dataset identifier |
contrast |
string | Comparison label (e.g., "disease_vs_control") |
log2_fold_change |
float64 | Log2 fold change |
adj_pvalue |
float64 | Adjusted p-value (BH or similar) |
regulation |
string | "up", "down", or "ns" (not significant) |
organisms |
array[string] | Organisms in this study (from sample metadata) |
Querying¶
# Which studies show TP53 as differentially expressed?
results = coll.index("de").search("P04637")
# project_accession | contrast | log2fc | adj_pvalue | regulation
# PXD000561 | tumor_vs_normal | 2.3 | 1.2e-08 | up
# PXD002137 | treatment_vs_control | -0.8 | 0.03 | down
# Meta-analysis: proteins consistently up-regulated across 3+ studies
results = coll.index("de").search_consistent(
min_studies=3,
direction="up",
adj_pvalue_cutoff=0.05,
)
# Find all DE results for a gene
results = coll.index("de").search_gene("BRCA1")
Build Algorithm¶
SELECT
protein_accession, gene_name, project_accession,
contrast, log2_fold_change, adj_pvalue,
CASE
WHEN adj_pvalue < 0.05 AND log2_fold_change > 0 THEN 'up'
WHEN adj_pvalue < 0.05 AND log2_fold_change < 0 THEN 'down'
ELSE 'ns'
END AS regulation,
s.organisms,
LEFT(protein_accession, 2) AS accession_prefix
FROM de
JOIN (
SELECT project_accession, LIST(DISTINCT organism) AS organisms
FROM sample
GROUP BY project_accession
) s USING (project_accession)
Future Index Types¶
The framework supports additional index types following the same pattern:
| Index | Source view | Partition key | Use case |
|---|---|---|---|
| spectrum | MZ | Binned precursor m/z | "Find spectra matching this m/z and charge" |
| modification | PSM/Feature | Modification name | "Which datasets have phosphorylation data?" |
| sample | Sample | Organism/tissue | "Find all liver tissue datasets" |
Building Indexes¶
coll = qpx.open_collection("/data/my_collection/")
# Build a specific index
coll.build_index("peptide")
coll.build_index("protein")
coll.build_index("de")
# Build all available indexes
coll.build_all_indexes()
The build process for each index:
- Reads the source view from every dataset in the collection.
- Applies index-specific filters (e.g.,
is_decoy = falsefor peptide index). - Aggregates per the index schema's grouping key.
- Partitions by the index's partition column.
- Writes Hive-partitioned Parquet files to
_index/{type}/.
Rebuilding¶
Indexes can go stale when datasets are added or removed. The rebuild process is idempotent -- it drops the existing index and recreates it from scratch:
# Rebuild after adding new datasets
coll.build_index("peptide", force=True)
# Check if index is stale (compares dataset list vs index metadata)
coll.index("peptide").is_stale()
Index Metadata¶
Each index stores a metadata file at _index/{type}/_metadata.json:
{
"index_type": "peptide",
"source_view": "psm",
"qpx_version": "1.0",
"built_at": "2026-04-08T14:30:00Z",
"datasets_included": ["PXD000561", "PXD002137"],
"total_entries": 14400000,
"partitions": 400,
"compression": "zstd"
}
The datasets_included list enables staleness detection: if the current collection has datasets not in this list, the index is stale.
Storage¶
Indexes are lightweight compared to the source data:
| Index type | Typical size | Notes |
|---|---|---|
| Peptide | ~0.05% of PSM data | Aggregates millions of PSMs into unique peptide-dataset pairs |
| Protein | ~0.01% of PG data | One row per protein group per dataset |
| DE | ~0.02% of DE data | One row per protein per contrast per dataset |
Conventions¶
- Indexes live under
_index/at the collection root. - The
_prefix signals "infrastructure, not a dataset" -- collection discovery skips it. - Each index type gets its own subdirectory:
_index/peptide/,_index/protein/,_index/de/. - Partition column names use the
{field}_prefixnaming pattern. - Index Parquet files use ZSTD compression.
- Index metadata is stored as
_metadata.json(not Parquet) for easy inspection. - Each index declares its
source_viewin metadata, linking it to the QPX view it materializes.