Processing Provenance¶
provenance.parquet captures the processing chain that generated a QPX dataset -- which tools ran, in what order, with what parameters. Its purpose is understanding, not reproduction: a user looking at the data should be able to answer questions like "what FDR was used?", "what search tolerance?", or "why is my protein missing?" without reading pipeline logs.
See the full YAML schema in provenance.yaml.
Understanding, not reproduction
provenance.parquet records the key decisions and parameters that shape results. It is not a workflow execution log or a Nextflow -resume file. For full reproducibility, refer to the pipeline's own configuration and execution artifacts.
PyArrow Schema¶
import pyarrow as pa
PARAMETER = pa.struct([
pa.field("key", pa.string()), # parameter name (snake_case)
pa.field("value", pa.string()), # parameter value (as string)
])
provenance_schema = pa.schema([
# Step identity
pa.field("step_order", pa.int32()), # execution order (1, 2, 3...)
pa.field("step_category", pa.string()), # broad category
pa.field("step_name", pa.string()), # specific step name
# Tool
pa.field("tool_name", pa.string()), # tool name (e.g. "comet", "percolator")
pa.field("tool_version", pa.string(), nullable=True), # tool version
pa.field("tool_uri", pa.string(), nullable=True), # container URI or tool URL
# Parameters -- curated summary of key settings
pa.field("parameters", pa.list_(PARAMETER), nullable=True),
# Full configuration -- raw JSON for complete reference
pa.field("config", pa.string(), nullable=True),
# What this step produced
pa.field("output_views", pa.list_(pa.string()), nullable=True),
])
Field Reference¶
| Field | Description | Type | Required |
|---|---|---|---|
step_order |
Execution order (1, 2, 3...) | int32 | yes |
step_category |
Broad category of the processing step | string | yes |
step_name |
Specific name for the step | string | yes |
tool_name |
Name of the tool or software | string | yes |
tool_version |
Version of the tool | string | no |
tool_uri |
Container URI (e.g., Docker/Singularity image) or tool URL | string | no |
parameters |
Key-value pairs of important settings (curated summary) | list[struct{key, value}] | no |
config |
Full tool/step configuration as a JSON string | string | no |
output_views |
Which QPX views this step generated (e.g., ["psm"], ["feature", "pg"]) |
list[string] | no |
parameters vs config
parameters is the curated summary -- a short list of the most important settings, queryable with SQL UNNEST. Use it for quick lookups like "what FDR was used?".
config is the complete reference -- the full tool configuration as a JSON string. Use it when you need every setting, not just the highlights. For a quantms pipeline, this would be the Nextflow params.json; for a Comet search, the full comet.params converted to JSON.
Step Categories¶
Each processing step belongs to a category that groups steps by function.
step_category |
Description | Typical tools |
|---|---|---|
workflow |
Pipeline orchestrator | quantms, nf-core, Nextflow |
raw_conversion |
Raw file conversion to open format | ThermoRawFileParser, msconvert |
database_search |
Sequence database search | Comet, MSGF+, MaxQuant, Sage, DIA-NN |
rescoring |
PSM rescoring and validation | Percolator, PeptideProphet, mokapot |
fdr_filtering |
FDR threshold application | OpenMS, custom scripts |
quantification |
Peptide/protein quantification | FlashLFQ, DIA-NN, IonQuant |
protein_inference |
Protein group inference | Epifany, ProteinProphet, IDPicker |
normalization |
Intensity normalization | quantms, MSstats |
statistical_analysis |
Differential expression / statistics | MSstats, limma, DEqMS, proDA |
expression |
Absolute expression computation | quantms (iBAQ) |
Example¶
quantms LFQ pipeline¶
[
{
"step_order": 1,
"step_category": "workflow",
"step_name": "pipeline_execution",
"tool_name": "quantms",
"tool_version": "1.3.0",
"tool_uri": null,
"parameters": [
{"key": "nextflow_version", "value": "23.10.0"},
{"key": "profile", "value": "docker"},
{"key": "acquisition_method", "value": "lfq"}
],
"config": "{\"input\":\"samplesheet.csv\",\"database\":\"UP000005640_9606.fasta\",\"acquisition_method\":\"lfq\",\"search_engines\":\"comet\",\"psm_fdr\":0.01,\"protein_fdr\":0.01,\"enable_mod_localization\":true,\"precursor_mass_tolerance\":10,\"fragment_mass_tolerance\":0.02,\"variable_mods\":\"Oxidation (M)\",\"fixed_mods\":\"Carbamidomethyl (C)\",\"max_mods\":3,\"match_between_runs\":true}",
"output_views": null
},
{
"step_order": 2,
"step_category": "raw_conversion",
"step_name": "raw_to_mzml",
"tool_name": "thermorawfileparser",
"tool_version": "1.4.2",
"tool_uri": "docker://ghcr.io/bigbio/thermorawfileparser:1.4.2",
"parameters": [
{"key": "output_format", "value": "mzML"},
{"key": "peak_picking", "value": "true"}
],
"output_views": ["mz"]
},
{
"step_order": 3,
"step_category": "database_search",
"step_name": "sequence_search",
"tool_name": "comet",
"tool_version": "2024.01.0",
"tool_uri": "docker://ghcr.io/bigbio/quantms-comet:2024.01.0",
"parameters": [
{"key": "precursor_mass_tolerance", "value": "10ppm"},
{"key": "fragment_mass_tolerance", "value": "0.02Da"},
{"key": "isotope_error", "value": "0/1/2"},
{"key": "fasta_database", "value": "UP000005640_9606.fasta"},
{"key": "fasta_checksum", "value": "sha256:a1b2c3d4..."}
],
"output_views": ["psm"]
},
{
"step_order": 4,
"step_category": "rescoring",
"step_name": "psm_rescoring",
"tool_name": "percolator",
"tool_version": "3.06.1",
"tool_uri": "docker://ghcr.io/bigbio/quantms-percolator:3.06.1",
"parameters": [
{"key": "train_fdr", "value": "0.01"},
{"key": "test_fdr", "value": "0.01"}
],
"output_views": ["psm"]
},
{
"step_order": 5,
"step_category": "fdr_filtering",
"step_name": "multi_level_fdr",
"tool_name": "openms",
"tool_version": "3.1.0",
"tool_uri": null,
"parameters": [
{"key": "psm_fdr", "value": "0.01"},
{"key": "peptide_fdr", "value": "0.01"},
{"key": "protein_fdr", "value": "0.01"}
],
"output_views": null
},
{
"step_order": 6,
"step_category": "protein_inference",
"step_name": "protein_grouping",
"tool_name": "epifany",
"tool_version": "3.1.0",
"tool_uri": null,
"parameters": [
{"key": "algorithm", "value": "bayesian"},
{"key": "protein_fdr", "value": "0.01"}
],
"output_views": ["pg"]
},
{
"step_order": 7,
"step_category": "quantification",
"step_name": "peptide_quantification",
"tool_name": "flashlfq",
"tool_version": "1.2.0",
"tool_uri": null,
"parameters": [
{"key": "match_between_runs", "value": "true"},
{"key": "mbr_rt_tolerance", "value": "2.5min"}
],
"output_views": ["feature"]
},
{
"step_order": 8,
"step_category": "normalization",
"step_name": "intensity_normalization",
"tool_name": "quantms",
"tool_version": "1.3.0",
"tool_uri": null,
"parameters": [
{"key": "method", "value": "median"},
{"key": "log_transform", "value": "true"}
],
"output_views": null
},
{
"step_order": 9,
"step_category": "statistical_analysis",
"step_name": "group_comparison",
"tool_name": "msstats",
"tool_version": "4.10.0",
"tool_uri": null,
"parameters": [
{"key": "normalization", "value": "equalizeMedians"},
{"key": "imputation", "value": "MinProb"},
{"key": "fdr_threshold", "value": "0.05"}
],
"output_views": ["de"]
},
{
"step_order": 10,
"step_category": "expression",
"step_name": "ibaq_calculation",
"tool_name": "quantms",
"tool_version": "1.3.0",
"tool_uri": null,
"parameters": [
{"key": "method", "value": "ibaq"},
{"key": "log_transform", "value": "true"}
],
"output_views": ["ae"]
}
]
Common Queries¶
-- Full processing chain in order
SELECT step_order, step_category, tool_name, tool_version
FROM 'PXD014414.provenance.parquet'
ORDER BY step_order;
-- What search engine and tolerances were used?
SELECT tool_name, tool_version, parameters
FROM 'PXD014414.provenance.parquet'
WHERE step_category = 'database_search';
-- What FDR thresholds were applied anywhere in the pipeline?
SELECT s.step_name, p.key, p.value
FROM 'PXD014414.provenance.parquet' s,
UNNEST(s.parameters) AS p
WHERE p.key LIKE '%fdr%';
-- Which tools produced the feature view?
SELECT step_name, tool_name, tool_version
FROM 'PXD014414.provenance.parquet'
WHERE list_contains(output_views, 'feature');
-- All container images used
SELECT step_name, tool_name, tool_uri
FROM 'PXD014414.provenance.parquet'
WHERE tool_uri IS NOT NULL;
User Story: "Why is my protein missing?"¶
A researcher notices that a protein they expected is absent from the results. They can trace through the provenance to understand why:
| Check | Query | What to look for |
|---|---|---|
| Was it in the database? | database_search → fasta_database |
Protein must be in the FASTA file |
| Were tolerances too tight? | database_search → precursor_mass_tolerance |
Tight tolerances may miss modified peptides |
| Was FDR too strict? | fdr_filtering → psm_fdr, protein_fdr |
Strict thresholds remove borderline identifications |
| Was it collapsed into another group? | protein_inference → algorithm |
Check the PG view for related protein groups |
| Was match-between-runs off? | quantification → match_between_runs |
MBR increases identifications across runs |
Relationship to Existing Metadata¶
provenance.parquet complements, not replaces, the existing metadata files:
| Existing | Purpose | Provenance adds |
|---|---|---|
dataset.parquet → software_name, software_version |
Quick project-level lookup | Full tool chain with all versions |
run.parquet → enzymes, modification_parameters |
Per-run search config | Pipeline-level search engine settings |
run.parquet → instrument, dissociation_method |
Instrument info | Stays in run.parquet (per-run, not per-pipeline) |
Related Pages¶
- Dataset Metadata -- project-level metadata
- Run Metadata -- per-run instrument and search parameters
- QPX Format Overview