Skip to content

Processing Provenance

provenance.parquet captures the processing chain that generated a QPX dataset -- which tools ran, in what order, with what parameters. Its purpose is understanding, not reproduction: a user looking at the data should be able to answer questions like "what FDR was used?", "what search tolerance?", or "why is my protein missing?" without reading pipeline logs.

See the full YAML schema in provenance.yaml.

Understanding, not reproduction

provenance.parquet records the key decisions and parameters that shape results. It is not a workflow execution log or a Nextflow -resume file. For full reproducibility, refer to the pipeline's own configuration and execution artifacts.


PyArrow Schema

import pyarrow as pa

PARAMETER = pa.struct([
    pa.field("key", pa.string()),        # parameter name (snake_case)
    pa.field("value", pa.string()),      # parameter value (as string)
])

provenance_schema = pa.schema([
    # Step identity
    pa.field("step_order", pa.int32()),               # execution order (1, 2, 3...)
    pa.field("step_category", pa.string()),           # broad category
    pa.field("step_name", pa.string()),               # specific step name

    # Tool
    pa.field("tool_name", pa.string()),               # tool name (e.g. "comet", "percolator")
    pa.field("tool_version", pa.string(), nullable=True),  # tool version
    pa.field("tool_uri", pa.string(), nullable=True),      # container URI or tool URL

    # Parameters -- curated summary of key settings
    pa.field("parameters", pa.list_(PARAMETER), nullable=True),

    # Full configuration -- raw JSON for complete reference
    pa.field("config", pa.string(), nullable=True),

    # What this step produced
    pa.field("output_views", pa.list_(pa.string()), nullable=True),
])

Field Reference

Field Description Type Required
step_order Execution order (1, 2, 3...) int32 yes
step_category Broad category of the processing step string yes
step_name Specific name for the step string yes
tool_name Name of the tool or software string yes
tool_version Version of the tool string no
tool_uri Container URI (e.g., Docker/Singularity image) or tool URL string no
parameters Key-value pairs of important settings (curated summary) list[struct{key, value}] no
config Full tool/step configuration as a JSON string string no
output_views Which QPX views this step generated (e.g., ["psm"], ["feature", "pg"]) list[string] no

parameters vs config

parameters is the curated summary -- a short list of the most important settings, queryable with SQL UNNEST. Use it for quick lookups like "what FDR was used?".

config is the complete reference -- the full tool configuration as a JSON string. Use it when you need every setting, not just the highlights. For a quantms pipeline, this would be the Nextflow params.json; for a Comet search, the full comet.params converted to JSON.

Step Categories

Each processing step belongs to a category that groups steps by function.

step_category Description Typical tools
workflow Pipeline orchestrator quantms, nf-core, Nextflow
raw_conversion Raw file conversion to open format ThermoRawFileParser, msconvert
database_search Sequence database search Comet, MSGF+, MaxQuant, Sage, DIA-NN
rescoring PSM rescoring and validation Percolator, PeptideProphet, mokapot
fdr_filtering FDR threshold application OpenMS, custom scripts
quantification Peptide/protein quantification FlashLFQ, DIA-NN, IonQuant
protein_inference Protein group inference Epifany, ProteinProphet, IDPicker
normalization Intensity normalization quantms, MSstats
statistical_analysis Differential expression / statistics MSstats, limma, DEqMS, proDA
expression Absolute expression computation quantms (iBAQ)

Example

quantms LFQ pipeline

[
  {
    "step_order": 1,
    "step_category": "workflow",
    "step_name": "pipeline_execution",
    "tool_name": "quantms",
    "tool_version": "1.3.0",
    "tool_uri": null,
    "parameters": [
      {"key": "nextflow_version", "value": "23.10.0"},
      {"key": "profile", "value": "docker"},
      {"key": "acquisition_method", "value": "lfq"}
    ],
    "config": "{\"input\":\"samplesheet.csv\",\"database\":\"UP000005640_9606.fasta\",\"acquisition_method\":\"lfq\",\"search_engines\":\"comet\",\"psm_fdr\":0.01,\"protein_fdr\":0.01,\"enable_mod_localization\":true,\"precursor_mass_tolerance\":10,\"fragment_mass_tolerance\":0.02,\"variable_mods\":\"Oxidation (M)\",\"fixed_mods\":\"Carbamidomethyl (C)\",\"max_mods\":3,\"match_between_runs\":true}",
    "output_views": null
  },
  {
    "step_order": 2,
    "step_category": "raw_conversion",
    "step_name": "raw_to_mzml",
    "tool_name": "thermorawfileparser",
    "tool_version": "1.4.2",
    "tool_uri": "docker://ghcr.io/bigbio/thermorawfileparser:1.4.2",
    "parameters": [
      {"key": "output_format", "value": "mzML"},
      {"key": "peak_picking", "value": "true"}
    ],
    "output_views": ["mz"]
  },
  {
    "step_order": 3,
    "step_category": "database_search",
    "step_name": "sequence_search",
    "tool_name": "comet",
    "tool_version": "2024.01.0",
    "tool_uri": "docker://ghcr.io/bigbio/quantms-comet:2024.01.0",
    "parameters": [
      {"key": "precursor_mass_tolerance", "value": "10ppm"},
      {"key": "fragment_mass_tolerance", "value": "0.02Da"},
      {"key": "isotope_error", "value": "0/1/2"},
      {"key": "fasta_database", "value": "UP000005640_9606.fasta"},
      {"key": "fasta_checksum", "value": "sha256:a1b2c3d4..."}
    ],
    "output_views": ["psm"]
  },
  {
    "step_order": 4,
    "step_category": "rescoring",
    "step_name": "psm_rescoring",
    "tool_name": "percolator",
    "tool_version": "3.06.1",
    "tool_uri": "docker://ghcr.io/bigbio/quantms-percolator:3.06.1",
    "parameters": [
      {"key": "train_fdr", "value": "0.01"},
      {"key": "test_fdr", "value": "0.01"}
    ],
    "output_views": ["psm"]
  },
  {
    "step_order": 5,
    "step_category": "fdr_filtering",
    "step_name": "multi_level_fdr",
    "tool_name": "openms",
    "tool_version": "3.1.0",
    "tool_uri": null,
    "parameters": [
      {"key": "psm_fdr", "value": "0.01"},
      {"key": "peptide_fdr", "value": "0.01"},
      {"key": "protein_fdr", "value": "0.01"}
    ],
    "output_views": null
  },
  {
    "step_order": 6,
    "step_category": "protein_inference",
    "step_name": "protein_grouping",
    "tool_name": "epifany",
    "tool_version": "3.1.0",
    "tool_uri": null,
    "parameters": [
      {"key": "algorithm", "value": "bayesian"},
      {"key": "protein_fdr", "value": "0.01"}
    ],
    "output_views": ["pg"]
  },
  {
    "step_order": 7,
    "step_category": "quantification",
    "step_name": "peptide_quantification",
    "tool_name": "flashlfq",
    "tool_version": "1.2.0",
    "tool_uri": null,
    "parameters": [
      {"key": "match_between_runs", "value": "true"},
      {"key": "mbr_rt_tolerance", "value": "2.5min"}
    ],
    "output_views": ["feature"]
  },
  {
    "step_order": 8,
    "step_category": "normalization",
    "step_name": "intensity_normalization",
    "tool_name": "quantms",
    "tool_version": "1.3.0",
    "tool_uri": null,
    "parameters": [
      {"key": "method", "value": "median"},
      {"key": "log_transform", "value": "true"}
    ],
    "output_views": null
  },
  {
    "step_order": 9,
    "step_category": "statistical_analysis",
    "step_name": "group_comparison",
    "tool_name": "msstats",
    "tool_version": "4.10.0",
    "tool_uri": null,
    "parameters": [
      {"key": "normalization", "value": "equalizeMedians"},
      {"key": "imputation", "value": "MinProb"},
      {"key": "fdr_threshold", "value": "0.05"}
    ],
    "output_views": ["de"]
  },
  {
    "step_order": 10,
    "step_category": "expression",
    "step_name": "ibaq_calculation",
    "tool_name": "quantms",
    "tool_version": "1.3.0",
    "tool_uri": null,
    "parameters": [
      {"key": "method", "value": "ibaq"},
      {"key": "log_transform", "value": "true"}
    ],
    "output_views": ["ae"]
  }
]

Common Queries

-- Full processing chain in order
SELECT step_order, step_category, tool_name, tool_version
FROM 'PXD014414.provenance.parquet'
ORDER BY step_order;

-- What search engine and tolerances were used?
SELECT tool_name, tool_version, parameters
FROM 'PXD014414.provenance.parquet'
WHERE step_category = 'database_search';

-- What FDR thresholds were applied anywhere in the pipeline?
SELECT s.step_name, p.key, p.value
FROM 'PXD014414.provenance.parquet' s,
     UNNEST(s.parameters) AS p
WHERE p.key LIKE '%fdr%';

-- Which tools produced the feature view?
SELECT step_name, tool_name, tool_version
FROM 'PXD014414.provenance.parquet'
WHERE list_contains(output_views, 'feature');

-- All container images used
SELECT step_name, tool_name, tool_uri
FROM 'PXD014414.provenance.parquet'
WHERE tool_uri IS NOT NULL;

User Story: "Why is my protein missing?"

A researcher notices that a protein they expected is absent from the results. They can trace through the provenance to understand why:

Check Query What to look for
Was it in the database? database_searchfasta_database Protein must be in the FASTA file
Were tolerances too tight? database_searchprecursor_mass_tolerance Tight tolerances may miss modified peptides
Was FDR too strict? fdr_filteringpsm_fdr, protein_fdr Strict thresholds remove borderline identifications
Was it collapsed into another group? protein_inferencealgorithm Check the PG view for related protein groups
Was match-between-runs off? quantificationmatch_between_runs MBR increases identifications across runs

Relationship to Existing Metadata

provenance.parquet complements, not replaces, the existing metadata files:

Existing Purpose Provenance adds
dataset.parquetsoftware_name, software_version Quick project-level lookup Full tool chain with all versions
run.parquetenzymes, modification_parameters Per-run search config Pipeline-level search engine settings
run.parquetinstrument, dissociation_method Instrument info Stays in run.parquet (per-run, not per-pipeline)