Skip to content

Transform Commands

Transform and process data within the QPX ecosystem.

Overview

The transform command group provides tools for processing and transforming QPX data into various downstream formats. These commands enable gene annotation and protein-level quantification from feature data.

Available Commands

  • gene-map - Map genes from FASTA
  • normalize-accessions - Normalize protein accession formats (full ↔ bare)
  • update-metadata - Update sample/run metadata from a revised SDRF
  • quantify - Protein quantification via mokume (DirectLFQ, MaxLFQ, iBAQ, TopN, etc.)

gene-map

Map gene information from FASTA to parquet format.

Description

Enriches protein identifications in QPX PSM or feature files with gene-level metadata extracted from FASTA database headers. 

Parameters

ParameterTypeRequiredDefaultDescription
--parquet-path FILE Yes - QPX PSM or feature parquet file path
--fasta FILE Yes - FASTA database file path
--output-folder DIRECTORY Yes - Output directory for generated files
--species TEXT No human Species name for gene mapping
--verbose FLAG No - Enable verbose logging

Usage Examples

Basic Example

Map gene information to parquet file:

qpxc transform gene-map \
    --parquet-path ./output/psm.parquet \
    --fasta proteins.fasta \
    --output-folder ./output \
    --species human

With Species Parameter

qpxc transform gene-map \
    --parquet-path ./output/feature.parquet \
    --fasta tests/examples/fasta/Homo-sapiens.fasta \
    --output-folder ./output \
    --species human

Output Files

  • Output: Enhanced parquet file(s) with gene information
  • Format: Parquet file in output folder
  • Added Fields: Gene names and metadata from FASTA headers

Best Practices

  • Use species-specific FASTA files for accurate gene annotation
  • Enable verbose mode for debugging

normalize-accessions

Normalize protein accession formats between full UniProt form (sp|ACC|NAME) and bare form (ACC).

Description

Forward (default): converts full UniProt identifiers to bare accessions. sp|P04114|APOB_HUMAN → P04114 CONTAM_sp|CONTAM_P02768|... → CONTAM_P02768 Reverse: converts bare accessions back to full UniProt format. Requires a FASTA database to look up the full identifiers. Normalizes anchor_protein and pg_accessions in both feature.parquet and pg.parquet files. 

Parameters

ParameterTypeRequiredDefaultDescription
--dataset DIRECTORY Yes - Path to a QPX dataset directory (containing quantms.*.parquet files)
--direction TEXT No forward 'forward' (sp|ACC|NAME → ACC) or 'reverse' (ACC → sp|ACC|NAME)
--fasta FILE No - FASTA database file (required for --direction reverse)
--in-place FLAG No - Overwrite original files instead of writing to --output
--output DIRECTORY No overwrites in place Output directory (default: overwrites in place)
--verbose FLAG No - Enable verbose logging

Usage Examples

Normalize protein accession formats:

# Forward: strip sp|...|... to bare accessions
qpxc transform normalize-accessions \
    --dataset ./my_dataset --direction forward --in-place

# Reverse: restore full UniProt identifiers from FASTA
qpxc transform normalize-accessions \
    --dataset ./my_dataset --direction reverse \
    --fasta proteins.fasta --in-place

# Forward to a new directory (non-destructive)
qpxc transform normalize-accessions \
    --dataset ./my_dataset --direction forward \
    --output ./my_dataset_normalized

update-metadata

Update sample.parquet and run.parquet metadata from a revised SDRF file, with safety checks on protected fields.

Description

Re-generates sample.parquet and run.parquet from the new SDRF. Original files are backed up as *.parquet.bak before overwriting. Safe to update (metadata-only, no impact on data): disease, organism_part, cell_type, cell_line, sex, age, treatment, individual, ancestry, developmental_stage, and any additional characteristics[*] columns. Protected fields (will BLOCK unless --force is used): instrument, enzymes, modification_parameters, fraction, dissociation_method, label/channel mapping, data file references. If --old-sdrf is provided, the tool compares old vs new SDRF and blocks if any protected fields changed. Without --old-sdrf, the safety check is skipped (useful for first-time metadata enrichment). 

Parameters

ParameterTypeRequiredDefaultDescription
--dataset DIRECTORY Yes - Path to a QPX dataset directory
--sdrf FILE Yes - Path to the updated SDRF TSV file
--old-sdrf FILE No - Path to the original SDRF (for safety checks). If omitted, protected-field validation is skipped.
--force FLAG No - Apply changes even if protected fields (instrument, enzymes, modifications, fractions, labels) have changed. Use with caution.
--verbose FLAG No - Enable verbose logging

Usage Examples

Update dataset metadata from SDRF:

# Update with safety check
qpxc transform update-metadata \
    --dataset ./my_project \
    --sdrf ./updated_sdrf.tsv \
    --old-sdrf ./original_sdrf.tsv

# Update without safety check (first-time enrichment)
qpxc transform update-metadata \
    --dataset ./my_project \
    --sdrf ./enriched_sdrf.tsv

# Force update even if protected fields changed
qpxc transform update-metadata \
    --dataset ./my_project \
    --sdrf ./new_sdrf.tsv \
    --old-sdrf ./old_sdrf.tsv --force

quantify

Compute protein-level quantification from QPX feature data using mokume.

Description

Reads a QPX feature.parquet file, extracts peptide-level intensities, and computes protein-level quantification using the selected method.  Supported methods: directlfq — DirectLFQ intensity traces (default) maxlfq — MaxLFQ delayed normalization topn — Average of N most intense peptides top3 — Average of 3 most intense peptides ibaq — Intensity-Based Absolute Quantification (requires --fasta) sum — Sum of all peptide intensities 

Parameters

ParameterTypeRequiredDefaultDescription
--feature-path FILE Yes - QPX feature.parquet file path
--method TEXT No directlfq Quantification method (directlfq, maxlfq, topn, top3, ibaq, sum)
--fasta FILE No - FASTA database (required for ibaq method)
--enzyme TEXT No Trypsin Enzyme for iBAQ digestion (default: Trypsin)
--topn-n INTEGER No 3 N for TopN method (default: 3)
--threads INTEGER No -1 Parallel threads for MaxLFQ (-1 = all cores)
--output PATH Yes - Output file path (.parquet, .tsv, or .csv)
--normalize FLAG No - Normalize quantification values
--organism TEXT No human Organism for iBAQ (default: human)
--ploidy INTEGER No 2 Ploidy for iBAQ ruler (default: 2)
--cpc FLOAT No 200 Cell copies per cell for iBAQ ruler (default: 200)
--min-aa INTEGER No 7 Min peptide length for iBAQ (default: 7)
--max-aa INTEGER No 30 Max peptide length for iBAQ (default: 30)
--verbose FLAG No - Enable verbose logging

Supported Methods

Method Description Extra Requirements
directlfq DirectLFQ intensity traces (default) pip install mokume[directlfq]
maxlfq MaxLFQ delayed normalization --
topn Average of N most intense peptides --topn-n to set N
top3 Average of 3 most intense peptides --
ibaq Intensity-Based Absolute Quantification --fasta required
sum Sum of all peptide intensities --

Usage Examples

DirectLFQ (default)

qpxc transform quantify \
    --feature-path ./qpx_output/feature.parquet \
    --method directlfq \
    -o proteins_directlfq.parquet

iBAQ (requires FASTA)

qpxc transform quantify \
    --feature-path ./qpx_output/feature.parquet \
    --method ibaq --fasta proteome.fasta \
    -o proteins_ibaq.tsv

MaxLFQ with 8 threads

qpxc transform quantify \
    --feature-path ./qpx_output/feature.parquet \
    --method maxlfq --threads 8 \
    -o proteins_maxlfq.parquet

TopN with normalization

qpxc transform quantify \
    --feature-path ./qpx_output/feature.parquet \
    --method topn --topn-n 5 --normalize \
    -o proteins_top5.parquet

Output Files

  • Parquet: .parquet files with protein-level quantification
  • TSV: .tsv files (tab-separated) — determined by output file extension
  • Content: Protein accessions, sample IDs, and quantified intensities

Common Issues

Issue: mokume is not installed

  • Solution: Install with pip install mokume

Issue: DirectLFQ is not installed

  • Solution: Install with pip install mokume[directlfq]

Issue: --fasta option is required for the ibaq method

  • Solution: Provide a FASTA database file with --fasta

Best Practices

  • Ensure QPX feature.parquet contains valid anchor_protein, intensities, and run_file_name fields
  • If using older QPX datasets, ensure fields have been migrated from legacy names (precursor_chargecharge, id_scanscan)
  • Decoy entries (is_decoy=true) and zero-intensity rows are automatically filtered
  • Use --normalize for cross-sample normalization
  • Use --threads to control parallelism for MaxLFQ