Transform Commands¶

Transform and process data within the QPX ecosystem.

Overview¶

The transform command group provides tools for processing and transforming QPX data into various downstream formats. These commands enable gene annotation and protein-level quantification from feature data.

Available Commands¶

gene-map - Map genes from FASTA
normalize-accessions - Normalize protein accession formats (full ↔ bare)
update-metadata - Update sample/run metadata from a revised SDRF
quantify - Protein quantification via mokume (DirectLFQ, MaxLFQ, iBAQ, TopN, etc.)

gene-map¶

Map gene information from FASTA to parquet format.

Description¶

Enriches protein identifications in QPX PSM or feature files with gene-level metadata extracted from FASTA database headers.

Parameters¶

Parameter	Type	Required	Default	Description
`--parquet-path`	FILE	Yes	-	QPX PSM or feature parquet file path
`--fasta`	FILE	Yes	-	FASTA database file path
`--output-folder`	DIRECTORY	Yes	-	Output directory for generated files
`--species`	TEXT	No	`human`	Species name for gene mapping
`--verbose`	FLAG	No	-	Enable verbose logging

Usage Examples¶

Basic Example¶

Map gene information to parquet file:

qpxc transform gene-map \
    --parquet-path ./output/psm.parquet \
    --fasta proteins.fasta \
    --output-folder ./output \
    --species human

With Species Parameter¶

qpxc transform gene-map \
    --parquet-path ./output/feature.parquet \
    --fasta tests/examples/fasta/Homo-sapiens.fasta \
    --output-folder ./output \
    --species human

Output Files¶

Output: Enhanced parquet file(s) with gene information
Format: Parquet file in output folder
Added Fields: Gene names and metadata from FASTA headers

Best Practices¶

Use species-specific FASTA files for accurate gene annotation
Enable verbose mode for debugging

normalize-accessions¶

Normalize protein accession formats between full UniProt form (sp|ACC|NAME) and bare form (ACC).

Description¶

Forward (default): converts full UniProt identifiers to bare accessions. sp|P04114|APOB_HUMAN → P04114 CONTAM_sp|CONTAM_P02768|... → CONTAM_P02768 Reverse: converts bare accessions back to full UniProt format. Requires a FASTA database to look up the full identifiers. Normalizes anchor_protein and pg_accessions in both feature.parquet and pg.parquet files.

Parameters¶

Parameter	Type	Required	Default	Description
`--dataset`	DIRECTORY	Yes	-	Path to a QPX dataset directory (containing quantms.*.parquet files)
`--direction`	TEXT	No	`forward`	'forward' (sp\|ACC\|NAME → ACC) or 'reverse' (ACC → sp\|ACC\|NAME)
`--fasta`	FILE	No	-	FASTA database file (required for --direction reverse)
`--in-place`	FLAG	No	-	Overwrite original files instead of writing to --output
`--output`	DIRECTORY	No	overwrites in place	Output directory (default: overwrites in place)
`--verbose`	FLAG	No	-	Enable verbose logging

Usage Examples¶

Normalize protein accession formats:

# Forward: strip sp|...|... to bare accessions
qpxc transform normalize-accessions \
    --dataset ./my_dataset --direction forward --in-place

# Reverse: restore full UniProt identifiers from FASTA
qpxc transform normalize-accessions \
    --dataset ./my_dataset --direction reverse \
    --fasta proteins.fasta --in-place

# Forward to a new directory (non-destructive)
qpxc transform normalize-accessions \
    --dataset ./my_dataset --direction forward \
    --output ./my_dataset_normalized

update-metadata¶

Update sample.parquet and run.parquet metadata from a revised SDRF file, with safety checks on protected fields.

Description¶

Re-generates sample.parquet and run.parquet from the new SDRF. Original files are backed up as *.parquet.bak before overwriting. Safe to update (metadata-only, no impact on data): disease, organism_part, cell_type, cell_line, sex, age, treatment, individual, ancestry, developmental_stage, and any additional characteristics[*] columns. Protected fields (will BLOCK unless --force is used): instrument, enzymes, modification_parameters, fraction, dissociation_method, label/channel mapping, data file references. If --old-sdrf is provided, the tool compares old vs new SDRF and blocks if any protected fields changed. Without --old-sdrf, the safety check is skipped (useful for first-time metadata enrichment).

Parameters¶

Parameter	Type	Required	Default	Description
`--dataset`	DIRECTORY	Yes	-	Path to a QPX dataset directory
`--sdrf`	FILE	Yes	-	Path to the updated SDRF TSV file
`--old-sdrf`	FILE	No	-	Path to the original SDRF (for safety checks). If omitted, protected-field validation is skipped.
`--force`	FLAG	No	-	Apply changes even if protected fields (instrument, enzymes, modifications, fractions, labels) have changed. Use with caution.
`--verbose`	FLAG	No	-	Enable verbose logging

Usage Examples¶

Update dataset metadata from SDRF:

# Update with safety check
qpxc transform update-metadata \
    --dataset ./my_project \
    --sdrf ./updated_sdrf.tsv \
    --old-sdrf ./original_sdrf.tsv

# Update without safety check (first-time enrichment)
qpxc transform update-metadata \
    --dataset ./my_project \
    --sdrf ./enriched_sdrf.tsv

# Force update even if protected fields changed
qpxc transform update-metadata \
    --dataset ./my_project \
    --sdrf ./new_sdrf.tsv \
    --old-sdrf ./old_sdrf.tsv --force

quantify¶

Compute protein-level quantification from QPX feature data using mokume.

Description¶

Reads a QPX feature.parquet file, extracts peptide-level intensities, and computes protein-level quantification using the selected method. Supported methods: directlfq — DirectLFQ intensity traces (default) maxlfq — MaxLFQ delayed normalization topn — Average of N most intense peptides top3 — Average of 3 most intense peptides ibaq — Intensity-Based Absolute Quantification (requires --fasta) sum — Sum of all peptide intensities

Parameters¶

Parameter	Type	Required	Default	Description
`--feature-path`	FILE	Yes	-	QPX feature.parquet file path
`--method`	TEXT	No	`directlfq`	Quantification method (directlfq, maxlfq, topn, top3, ibaq, sum)
`--fasta`	FILE	No	-	FASTA database (required for ibaq method)
`--enzyme`	TEXT	No	Trypsin	Enzyme for iBAQ digestion (default: Trypsin)
`--topn-n`	INTEGER	No	3	N for TopN method (default: 3)
`--threads`	INTEGER	No	-1	Parallel threads for MaxLFQ (-1 = all cores)
`--output`	PATH	Yes	-	Output file path (.parquet, .tsv, or .csv)
`--normalize`	FLAG	No	-	Normalize quantification values
`--organism`	TEXT	No	human	Organism for iBAQ (default: human)
`--ploidy`	INTEGER	No	2	Ploidy for iBAQ ruler (default: 2)
`--cpc`	FLOAT	No	200	Cell copies per cell for iBAQ ruler (default: 200)
`--min-aa`	INTEGER	No	7	Min peptide length for iBAQ (default: 7)
`--max-aa`	INTEGER	No	30	Max peptide length for iBAQ (default: 30)
`--verbose`	FLAG	No	-	Enable verbose logging

Supported Methods¶

Method	Description	Extra Requirements
`directlfq`	DirectLFQ intensity traces (default)	`pip install mokume[directlfq]`
`maxlfq`	MaxLFQ delayed normalization	--
`topn`	Average of N most intense peptides	`--topn-n` to set N
`top3`	Average of 3 most intense peptides	--
`ibaq`	Intensity-Based Absolute Quantification	`--fasta` required
`sum`	Sum of all peptide intensities	--

Usage Examples¶

DirectLFQ (default)¶

qpxc transform quantify \
    --feature-path ./qpx_output/feature.parquet \
    --method directlfq \
    -o proteins_directlfq.parquet

iBAQ (requires FASTA)¶

qpxc transform quantify \
    --feature-path ./qpx_output/feature.parquet \
    --method ibaq --fasta proteome.fasta \
    -o proteins_ibaq.tsv

MaxLFQ with 8 threads¶

qpxc transform quantify \
    --feature-path ./qpx_output/feature.parquet \
    --method maxlfq --threads 8 \
    -o proteins_maxlfq.parquet

TopN with normalization¶

qpxc transform quantify \
    --feature-path ./qpx_output/feature.parquet \
    --method topn --topn-n 5 --normalize \
    -o proteins_top5.parquet

Output Files¶

Parquet: .parquet files with protein-level quantification
TSV: .tsv files (tab-separated) — determined by output file extension
Content: Protein accessions, sample IDs, and quantified intensities

Common Issues¶

Issue: mokume is not installed

Solution: Install with pip install mokume

Issue: DirectLFQ is not installed

Solution: Install with pip install mokume[directlfq]

Issue: --fasta option is required for the ibaq method

Solution: Provide a FASTA database file with --fasta

Best Practices¶

Ensure QPX feature.parquet contains valid anchor_protein, intensities, and run_file_name fields
If using older QPX datasets, ensure fields have been migrated from legacy names (precursor_charge → charge, id_scan → scan)
Decoy entries (is_decoy=true) and zero-intensity rows are automatically filtered
Use --normalize for cross-sample normalization
Use --threads to control parallelism for MaxLFQ

Convert Commands - Convert raw data to QPX format
Visualization Commands - Visualize transformed data
Statistics Commands - Analyze transformed data

Transform Commands¶

Overview¶

Available Commands¶

gene-map¶

Description¶

Parameters¶

Usage Examples¶

Basic Example¶

With Species Parameter¶

Output Files¶

Best Practices¶

normalize-accessions¶

Description¶

Parameters¶

Usage Examples¶

update-metadata¶

Description¶

Parameters¶

Usage Examples¶

quantify¶

Description¶

Parameters¶

Supported Methods¶

Usage Examples¶

DirectLFQ (default)¶

iBAQ (requires FASTA)¶

MaxLFQ with 8 threads¶

TopN with normalization¶

Output Files¶

Common Issues¶

Best Practices¶

Related Commands¶