Transform Commands¶
Transform and process data within the QPX ecosystem.
Overview¶
The transform command group provides tools for processing and transforming QPX data into various downstream formats. These commands enable absolute and differential expression analysis, metadata mapping, and data format conversions.
Available Commands¶
- ae - Convert iBAQ to absolute expression format
- differential - Convert MSstats differential expression data
- gene - Map gene information to proteins
- ibaq - Process iBAQ quantification files
- spectra - Map spectrum information
- uniprot - Map latest UniProt annotations
- anndata - Merge AE files into AnnData format
ae¶
Convert iBAQ absolute expression data to QPX format.
Description¶
Transforms iBAQ (intensity-Based Absolute Quantification) data into the standardized QPX absolute expression format. It integrates protein quantification with sample metadata from SDRF files.
Format Specification: For details about the AE format structure and fields, see the Absolute Expression Format Specification.
Parameters¶
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
--ibaq-file |
FILE | Yes | - | IBAQ file path |
--sdrf-file |
FILE | Yes | - | SDRF file path |
--protein-file |
FILE | No | - | Protein file that meets specific requirements |
--project-file |
FILE | No | - | QPX project file |
--output-folder |
DIRECTORY | Yes | - | Output directory for generated files |
--output-prefix |
TEXT | No | - | Prefix for output files |
--delete-existing |
FLAG | No | - | Delete existing files in output folder |
Usage Examples¶
Basic Example¶
Convert iBAQ data with default settings:
qpxc transform ae \
--ibaq-file ibaq_data.tsv \
--sdrf-file metadata.sdrf.tsv \
--output-folder ./output
With Project Metadata¶
qpxc transform ae \
--ibaq-file tests/examples/AE/PXD016999.1-ibaq.tsv \
--sdrf-file tests/examples/AE/PXD016999-first-instrument.sdrf.tsv \
--project-file tests/examples/AE/project.json \
--output-folder ./output \
--output-prefix ae_with_metadata \
--delete-existing
Filter Specific Proteins¶
qpxc transform ae \
--ibaq-file tests/examples/AE/PXD016999.1-ibaq.tsv \
--sdrf-file tests/examples/AE/PXD016999-first-instrument.sdrf.tsv \
--protein-file tests/examples/fasta/Homo-sapiens.fasta \
--output-folder ./output
Input File Formats¶
iBAQ File: Tab-separated file with protein accessions and iBAQ intensities
SDRF File: Standard PRIDE SDRF format with sample metadata
Output Files¶
- Output:
{output-prefix}-{uuid}.absolute.parquet - Format: Parquet file containing absolute expression quantification
- Schema: Conforms to QPX absolute expression specification
Common Issues¶
Issue: Mismatched sample names between iBAQ and SDRF
- Solution: Ensure column names in iBAQ file match sample identifiers in SDRF
Issue: Missing protein accessions
- Solution: Provide
--protein-fileto filter and validate protein IDs
Best Practices¶
- Always provide project metadata file when available for better data provenance
- Use
--delete-existingflag carefully to avoid accidental data loss - Validate SDRF file format before processing
- Check sample name consistency across input files
differential¶
Convert MSstats differential expression data to QPX format.
Description¶
Transforms differential expression analysis results from MSstats into the standardized QPX differential expression format. Supports FDR-based filtering and protein-specific subsetting.
Parameters¶
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
--msstats-file |
FILE | Yes | - | MSstats differential file |
--sdrf-file |
FILE | Yes | - | SDRF file needed to extract metadata |
--project-file |
FILE | No | - | QPX project file |
--protein-file |
FILE | No | - | Protein file that meets specific requirements |
--fdr-threshold |
FLOAT | No | 0.05 |
FDR threshold to filter results |
--output-folder |
DIRECTORY | Yes | - | Output directory for generated files |
--output-prefix |
TEXT | No | - | Prefix for output files |
--delete-existing |
FLAG | No | - | Delete existing files in output folder |
--verbose |
FLAG | No | - | Enable verbose logging |
Usage Examples¶
Basic Example¶
Convert MSstats differential expression data:
qpxc transform differential \
--msstats-file msstats_comparisons.csv \
--sdrf-file metadata.sdrf.tsv \
--output-folder ./output
With Custom FDR Threshold¶
qpxc transform differential \
--msstats-file tests/examples/DE/PXD033169.sdrf_openms_design_msstats_in_comparisons.csv \
--sdrf-file tests/examples/DE/PXD033169.sdrf.tsv \
--fdr-threshold 0.01 \
--output-folder ./output \
--output-prefix de_stringent \
--verbose
With Project Metadata¶
qpxc transform differential \
--msstats-file tests/examples/DE/PXD033169.sdrf_openms_design_msstats_in_comparisons.csv \
--sdrf-file tests/examples/DE/PXD033169.sdrf.tsv \
--project-file tests/examples/DE/project.json \
--fdr-threshold 0.05 \
--output-folder ./output \
--delete-existing
Input File Format¶
MSstats File: CSV file with comparison results
Protein,Label,log2FC,SE,Tvalue,DF,pvalue,adj.pvalue
P12345,Condition2-Condition1,2.5,0.3,8.33,10,0.0001,0.001
Q67890,Condition2-Condition1,-1.8,0.4,-4.5,10,0.002,0.01
Output Files¶
- Output:
{output-prefix}-{uuid}.differential.parquet - Format: Parquet file containing differential expression results
- Schema: Conforms to QPX differential expression specification
Common Issues¶
Issue: No significant results after FDR filtering
- Solution: Increase
--fdr-thresholdor check input data quality
Issue: Memory errors with large comparison files
- Solution: Process comparisons in batches or increase available memory
Best Practices¶
- Use FDR threshold of 0.05 or lower for publication-quality results
- Enable verbose mode to monitor filtering statistics
- Validate comparison group names match SDRF metadata
- Include project file for complete data provenance
gene¶
Map gene information from FASTA to parquet format.
Description¶
Maps gene names and information from a FASTA file to protein identifications in QPX PSM or feature files. This command enriches protein data with gene-level metadata extracted from FASTA headers.
Parameters¶
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
--parquet-path |
FILE | Yes | - | PSM or feature parquet file path |
--fasta |
FILE | Yes | - | FASTA file path |
--output-folder |
DIRECTORY | Yes | - | Output directory for generated files |
--file-num |
INTEGER | No | 10 | Number of rows to read in each batch |
--partitions |
TEXT | No | - | Fields for splitting files (comma-separated) |
--species |
TEXT | No | human | Species name (default: human) |
Usage Examples¶
Basic Example¶
Map gene information to parquet file:
qpxc transform gene \
--parquet-path ./output/psm.parquet \
--fasta proteins.fasta \
--output-folder ./output
With Partitioning¶
qpxc transform gene \
--parquet-path ./output/feature.parquet \
--fasta tests/examples/fasta/Homo-sapiens.fasta \
--output-folder ./output \
--file-num 20 \
--partitions reference_file_name \
--species human
Output Files¶
- Output: Enhanced parquet file(s) with gene information
- Format: Parquet file in output folder
- Added Fields: Gene names and metadata from FASTA headers
Best Practices¶
- Use species-specific FASTA files for accurate gene annotation
- Adjust
--file-numbased on available memory for large files - Use partitioning for better file organization in large datasets
ibaq¶
Convert feature data to iBAQ format.
Description¶
Transforms feature-level quantification data into iBAQ (intensity-Based Absolute Quantification) format. This command integrates feature quantification with sample metadata from SDRF files to generate iBAQ values.
Parameters¶
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
--feature-file |
FILE | Yes | - | Feature file path |
--sdrf-file |
FILE | Yes | - | SDRF file for metadata extraction |
--output-folder |
DIRECTORY | Yes | - | Output directory for generated files |
--output-prefix |
TEXT | No | ibaq |
Prefix for output files |
Usage Examples¶
Basic Example¶
Convert feature data to iBAQ format:
qpxc transform ibaq \
--feature-file ./output/feature.parquet \
--sdrf-file ./metadata.sdrf.tsv \
--output-folder ./output
With Custom Prefix¶
qpxc transform ibaq \
--feature-file ./output/feature.parquet \
--sdrf-file ./metadata.sdrf.tsv \
--output-folder ./output \
--output-prefix ibaq_quantification
Output Files¶
- Output:
{output-prefix}-{uuid}.ibaq.parquet - Format: Parquet file containing iBAQ quantification values
- Content: Protein-level iBAQ values per sample
Best Practices¶
- Ensure feature file contains all necessary quantification data
- Verify SDRF metadata matches sample identifiers in feature file
- Use iBAQ output for absolute protein quantification analysis
spectra¶
Map spectrum information from mzML to parquet format.
Description¶
Enriches PSM or feature data with additional spectral information extracted from mzML files. This command maps spectrum metadata and peak information to the corresponding peptide-spectrum matches.
Parameters¶
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
--parquet-path |
FILE | Yes | - | PSM or feature parquet file path |
--mzml-directory |
DIRECTORY | Yes | - | Directory containing mzML files |
--output-folder |
DIRECTORY | Yes | - | Output directory for generated files |
--file-num |
INTEGER | No | 10 | Number of rows to read in each batch |
--partitions |
TEXT | No | - | Fields for splitting files (comma-separated) |
Usage Examples¶
Basic Example¶
Map spectrum information to parquet:
qpxc transform spectra \
--parquet-path ./output/psm.parquet \
--mzml-directory ./mzml_files \
--output-folder ./output
With Batch Processing and Partitioning¶
qpxc transform spectra \
--parquet-path ./output/psm.parquet \
--mzml-directory ./mzml_files \
--output-folder ./output \
--file-num 20 \
--partitions reference_file_name
Output Files¶
- Output: Enhanced PSM/feature parquet file(s) with spectral information
- Format: Parquet file in output folder
- Added Fields: Spectrum metadata and peak information from mzML files
Best Practices¶
- Ensure mzML files are in the specified directory with correct naming
- Adjust
--file-numbased on available memory and file size - Use partitioning for organized output when processing large datasets
uniprot¶
Map feature data to latest UniProt version.
Description¶
Maps peptides and features to the latest UniProt protein database using a FASTA file. This command updates protein identifications to match current UniProt accessions and annotations.
Parameters¶
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
--feature-file |
FILE | Yes | - | Feature file path |
--fasta |
FILE | Yes | - | UniProt FASTA file path |
--output-folder |
DIRECTORY | Yes | - | Output directory for generated files |
--output-prefix |
TEXT | No | - | Prefix for output files |
Usage Examples¶
Basic Mapping¶
Map features to latest UniProt:
qpxc transform uniprot \
--feature-file ./output/feature.parquet \
--fasta uniprot_human.fasta \
--output-folder ./output
With Custom Prefix¶
qpxc transform uniprot \
--feature-file ./output/feature.parquet \
--fasta ./uniprot_human_2024.fasta \
--output-folder ./output \
--output-prefix feature_updated
Output Files¶
- Output:
{output-prefix}-{uuid}.feature.parquet - Format: Parquet file with updated UniProt mappings
- Content: Feature data mapped to latest UniProt protein identifications
Best Practices¶
- Use the latest UniProt FASTA file for most current annotations
- Run this command when updating to a new UniProt release
- Verify FASTA file matches the organism of your study
anndata¶
Merge multiple AE files into a file in AnnData format.
Description¶
Combines multiple absolute expression (AE) files from a directory into a single AnnData object (H5AD format). This command is useful for integrating data from multiple experiments for downstream analysis with scanpy or other Python-based tools.
Parameters¶
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
--directory |
DIRECTORY | Yes | - | The directory for storing AE files |
--output-folder |
DIRECTORY | Yes | - | Output directory for generated files |
--output-prefix |
TEXT | No | - | Prefix for output files |
Usage Examples¶
Basic Example¶
Merge AE files into AnnData format:
qpxc transform anndata \
--directory ./ae_files \
--output-folder ./output
With Custom Prefix¶
qpxc transform anndata \
--directory ./ae_files \
--output-folder ./output \
--output-prefix merged_ae
Output Files¶
- Output:
{output-prefix}-{uuid}.h5ad - Format: AnnData H5AD file (HDF5-based format)
- Structure:
X: Protein expression matrixobs: Sample metadatavar: Protein metadata
Use Cases¶
- Integration with scanpy for dimensionality reduction and clustering
- Compatibility with Python machine learning libraries
- Multi-experiment data integration
- Cross-study meta-analysis
Best Practices¶
- Ensure all AE files in the directory have consistent format
- Validate sample metadata consistency before merging
- Use the output with scanpy or other AnnData-compatible tools
- Consider memory requirements for large datasets
Related Commands¶
- Convert Commands - Convert raw data to QPX format
- Visualization Commands - Visualize transformed data
- Statistics Commands - Analyze transformed data