Transform Commands¶
Transform and process data within the QPX ecosystem.
Overview¶
The transform command group provides tools for processing and transforming QPX data into various downstream formats. These commands enable gene annotation and protein-level quantification from feature data.
Available Commands¶
- gene-map - Map genes from FASTA
- normalize-accessions - Normalize protein accession formats (full ↔ bare)
- update-metadata - Update sample/run metadata from a revised SDRF
- quantify - Protein quantification via mokume (DirectLFQ, MaxLFQ, iBAQ, TopN, etc.)
gene-map¶
Map gene information from FASTA to parquet format.
Description¶
Enriches protein identifications in QPX PSM or feature files with gene-level metadata extracted from FASTA database headers.
Parameters¶
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
--parquet-path |
FILE | Yes | - | QPX PSM or feature parquet file path |
--fasta |
FILE | Yes | - | FASTA database file path |
--output-folder |
DIRECTORY | Yes | - | Output directory for generated files |
--species |
TEXT | No | human |
Species name for gene mapping |
--verbose |
FLAG | No | - | Enable verbose logging |
Usage Examples¶
Basic Example¶
Map gene information to parquet file:
qpxc transform gene-map \
--parquet-path ./output/psm.parquet \
--fasta proteins.fasta \
--output-folder ./output \
--species human
With Species Parameter¶
qpxc transform gene-map \
--parquet-path ./output/feature.parquet \
--fasta tests/examples/fasta/Homo-sapiens.fasta \
--output-folder ./output \
--species human
Output Files¶
- Output: Enhanced parquet file(s) with gene information
- Format: Parquet file in output folder
- Added Fields: Gene names and metadata from FASTA headers
Best Practices¶
- Use species-specific FASTA files for accurate gene annotation
- Enable verbose mode for debugging
normalize-accessions¶
Normalize protein accession formats between full UniProt form (sp|ACC|NAME) and bare form (ACC).
Description¶
Forward (default): converts full UniProt identifiers to bare accessions. sp|P04114|APOB_HUMAN → P04114 CONTAM_sp|CONTAM_P02768|... → CONTAM_P02768 Reverse: converts bare accessions back to full UniProt format. Requires a FASTA database to look up the full identifiers. Normalizes anchor_protein and pg_accessions in both feature.parquet and pg.parquet files.
Parameters¶
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
--dataset |
DIRECTORY | Yes | - | Path to a QPX dataset directory (containing quantms.*.parquet files) |
--direction |
TEXT | No | forward |
'forward' (sp|ACC|NAME → ACC) or 'reverse' (ACC → sp|ACC|NAME) |
--fasta |
FILE | No | - | FASTA database file (required for --direction reverse) |
--in-place |
FLAG | No | - | Overwrite original files instead of writing to --output |
--output |
DIRECTORY | No | overwrites in place | Output directory (default: overwrites in place) |
--verbose |
FLAG | No | - | Enable verbose logging |
Usage Examples¶
Normalize protein accession formats:
# Forward: strip sp|...|... to bare accessions
qpxc transform normalize-accessions \
--dataset ./my_dataset --direction forward --in-place
# Reverse: restore full UniProt identifiers from FASTA
qpxc transform normalize-accessions \
--dataset ./my_dataset --direction reverse \
--fasta proteins.fasta --in-place
# Forward to a new directory (non-destructive)
qpxc transform normalize-accessions \
--dataset ./my_dataset --direction forward \
--output ./my_dataset_normalized
update-metadata¶
Update sample.parquet and run.parquet metadata from a revised SDRF file, with safety checks on protected fields.
Description¶
Re-generates sample.parquet and run.parquet from the new SDRF. Original files are backed up as *.parquet.bak before overwriting. Safe to update (metadata-only, no impact on data): disease, organism_part, cell_type, cell_line, sex, age, treatment, individual, ancestry, developmental_stage, and any additional characteristics[*] columns. Protected fields (will BLOCK unless --force is used): instrument, enzymes, modification_parameters, fraction, dissociation_method, label/channel mapping, data file references. If --old-sdrf is provided, the tool compares old vs new SDRF and blocks if any protected fields changed. Without --old-sdrf, the safety check is skipped (useful for first-time metadata enrichment).
Parameters¶
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
--dataset |
DIRECTORY | Yes | - | Path to a QPX dataset directory |
--sdrf |
FILE | Yes | - | Path to the updated SDRF TSV file |
--old-sdrf |
FILE | No | - | Path to the original SDRF (for safety checks). If omitted, protected-field validation is skipped. |
--force |
FLAG | No | - | Apply changes even if protected fields (instrument, enzymes, modifications, fractions, labels) have changed. Use with caution. |
--verbose |
FLAG | No | - | Enable verbose logging |
Usage Examples¶
Update dataset metadata from SDRF:
# Update with safety check
qpxc transform update-metadata \
--dataset ./my_project \
--sdrf ./updated_sdrf.tsv \
--old-sdrf ./original_sdrf.tsv
# Update without safety check (first-time enrichment)
qpxc transform update-metadata \
--dataset ./my_project \
--sdrf ./enriched_sdrf.tsv
# Force update even if protected fields changed
qpxc transform update-metadata \
--dataset ./my_project \
--sdrf ./new_sdrf.tsv \
--old-sdrf ./old_sdrf.tsv --force
quantify¶
Compute protein-level quantification from QPX feature data using mokume.
Description¶
Reads a QPX feature.parquet file, extracts peptide-level intensities, and computes protein-level quantification using the selected method. Supported methods: directlfq — DirectLFQ intensity traces (default) maxlfq — MaxLFQ delayed normalization topn — Average of N most intense peptides top3 — Average of 3 most intense peptides ibaq — Intensity-Based Absolute Quantification (requires --fasta) sum — Sum of all peptide intensities
Parameters¶
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
--feature-path |
FILE | Yes | - | QPX feature.parquet file path |
--method |
TEXT | No | directlfq |
Quantification method (directlfq, maxlfq, topn, top3, ibaq, sum) |
--fasta |
FILE | No | - | FASTA database (required for ibaq method) |
--enzyme |
TEXT | No | Trypsin | Enzyme for iBAQ digestion (default: Trypsin) |
--topn-n |
INTEGER | No | 3 | N for TopN method (default: 3) |
--threads |
INTEGER | No | -1 | Parallel threads for MaxLFQ (-1 = all cores) |
--output |
PATH | Yes | - | Output file path (.parquet, .tsv, or .csv) |
--normalize |
FLAG | No | - | Normalize quantification values |
--organism |
TEXT | No | human | Organism for iBAQ (default: human) |
--ploidy |
INTEGER | No | 2 | Ploidy for iBAQ ruler (default: 2) |
--cpc |
FLOAT | No | 200 | Cell copies per cell for iBAQ ruler (default: 200) |
--min-aa |
INTEGER | No | 7 | Min peptide length for iBAQ (default: 7) |
--max-aa |
INTEGER | No | 30 | Max peptide length for iBAQ (default: 30) |
--verbose |
FLAG | No | - | Enable verbose logging |
Supported Methods¶
| Method | Description | Extra Requirements |
|---|---|---|
directlfq |
DirectLFQ intensity traces (default) | pip install mokume[directlfq] |
maxlfq |
MaxLFQ delayed normalization | -- |
topn |
Average of N most intense peptides | --topn-n to set N |
top3 |
Average of 3 most intense peptides | -- |
ibaq |
Intensity-Based Absolute Quantification | --fasta required |
sum |
Sum of all peptide intensities | -- |
Usage Examples¶
DirectLFQ (default)¶
qpxc transform quantify \
--feature-path ./qpx_output/feature.parquet \
--method directlfq \
-o proteins_directlfq.parquet
iBAQ (requires FASTA)¶
qpxc transform quantify \
--feature-path ./qpx_output/feature.parquet \
--method ibaq --fasta proteome.fasta \
-o proteins_ibaq.tsv
MaxLFQ with 8 threads¶
qpxc transform quantify \
--feature-path ./qpx_output/feature.parquet \
--method maxlfq --threads 8 \
-o proteins_maxlfq.parquet
TopN with normalization¶
qpxc transform quantify \
--feature-path ./qpx_output/feature.parquet \
--method topn --topn-n 5 --normalize \
-o proteins_top5.parquet
Output Files¶
- Parquet:
.parquetfiles with protein-level quantification - TSV:
.tsvfiles (tab-separated) — determined by output file extension - Content: Protein accessions, sample IDs, and quantified intensities
Common Issues¶
Issue: mokume is not installed
- Solution: Install with
pip install mokume
Issue: DirectLFQ is not installed
- Solution: Install with
pip install mokume[directlfq]
Issue: --fasta option is required for the ibaq method
- Solution: Provide a FASTA database file with
--fasta
Best Practices¶
- Ensure QPX feature.parquet contains valid
anchor_protein,intensities, andrun_file_namefields - If using older QPX datasets, ensure fields have been migrated from legacy names (
precursor_charge→charge,id_scan→scan) - Decoy entries (
is_decoy=true) and zero-intensity rows are automatically filtered - Use
--normalizefor cross-sample normalization - Use
--threadsto control parallelism for MaxLFQ
Related Commands¶
- Convert Commands - Convert raw data to QPX format
- Visualization Commands - Visualize transformed data
- Statistics Commands - Analyze transformed data