Skip to content

Transform Commands

Transform and process data within the QPX ecosystem.

Overview

The transform command group provides tools for processing and transforming QPX data into various downstream formats. These commands enable absolute and differential expression analysis, metadata mapping, and data format conversions.

Available Commands

  • ae - Convert iBAQ to absolute expression format
  • differential - Convert MSstats differential expression data
  • gene - Map gene information to proteins
  • ibaq - Process iBAQ quantification files
  • spectra - Map spectrum information
  • uniprot - Map latest UniProt annotations
  • anndata - Merge AE files into AnnData format

ae

Convert iBAQ absolute expression data to QPX format.

Description

Transforms iBAQ (intensity-Based Absolute Quantification) data into the standardized QPX absolute expression format. It integrates protein quantification with sample metadata from SDRF files.

Format Specification: For details about the AE format structure and fields, see the Absolute Expression Format Specification.

Parameters

ParameterTypeRequiredDefaultDescription
--ibaq-file FILE Yes - IBAQ file path
--sdrf-file FILE Yes - SDRF file path
--protein-file FILE No - Protein file that meets specific requirements
--project-file FILE No - QPX project file
--output-folder DIRECTORY Yes - Output directory for generated files
--output-prefix TEXT No - Prefix for output files
--delete-existing FLAG No - Delete existing files in output folder

Usage Examples

Basic Example

Convert iBAQ data with default settings:

qpxc transform ae \
    --ibaq-file ibaq_data.tsv \
    --sdrf-file metadata.sdrf.tsv \
    --output-folder ./output

With Project Metadata

qpxc transform ae \
    --ibaq-file tests/examples/AE/PXD016999.1-ibaq.tsv \
    --sdrf-file tests/examples/AE/PXD016999-first-instrument.sdrf.tsv \
    --project-file tests/examples/AE/project.json \
    --output-folder ./output \
    --output-prefix ae_with_metadata \
    --delete-existing

Filter Specific Proteins

qpxc transform ae \
    --ibaq-file tests/examples/AE/PXD016999.1-ibaq.tsv \
    --sdrf-file tests/examples/AE/PXD016999-first-instrument.sdrf.tsv \
    --protein-file tests/examples/fasta/Homo-sapiens.fasta \
    --output-folder ./output

Input File Formats

iBAQ File: Tab-separated file with protein accessions and iBAQ intensities

ProteinID    Sample1    Sample2    Sample3
P12345       1000000    950000     1050000
Q67890       500000     480000     520000

SDRF File: Standard PRIDE SDRF format with sample metadata

Output Files

Common Issues

Issue: Mismatched sample names between iBAQ and SDRF

  • Solution: Ensure column names in iBAQ file match sample identifiers in SDRF

Issue: Missing protein accessions

  • Solution: Provide --protein-file to filter and validate protein IDs

Best Practices

  • Always provide project metadata file when available for better data provenance
  • Use --delete-existing flag carefully to avoid accidental data loss
  • Validate SDRF file format before processing
  • Check sample name consistency across input files

differential

Convert MSstats differential expression data to QPX format.

Description

Transforms differential expression analysis results from MSstats into the standardized QPX differential expression format. Supports FDR-based filtering and protein-specific subsetting.

Parameters

ParameterTypeRequiredDefaultDescription
--msstats-file FILE Yes - MSstats differential file
--sdrf-file FILE Yes - SDRF file needed to extract metadata
--project-file FILE No - QPX project file
--protein-file FILE No - Protein file that meets specific requirements
--fdr-threshold FLOAT No 0.05 FDR threshold to filter results
--output-folder DIRECTORY Yes - Output directory for generated files
--output-prefix TEXT No - Prefix for output files
--delete-existing FLAG No - Delete existing files in output folder
--verbose FLAG No - Enable verbose logging

Usage Examples

Basic Example

Convert MSstats differential expression data:

qpxc transform differential \
    --msstats-file msstats_comparisons.csv \
    --sdrf-file metadata.sdrf.tsv \
    --output-folder ./output

With Custom FDR Threshold

qpxc transform differential \
    --msstats-file tests/examples/DE/PXD033169.sdrf_openms_design_msstats_in_comparisons.csv \
    --sdrf-file tests/examples/DE/PXD033169.sdrf.tsv \
    --fdr-threshold 0.01 \
    --output-folder ./output \
    --output-prefix de_stringent \
    --verbose

With Project Metadata

qpxc transform differential \
    --msstats-file tests/examples/DE/PXD033169.sdrf_openms_design_msstats_in_comparisons.csv \
    --sdrf-file tests/examples/DE/PXD033169.sdrf.tsv \
    --project-file tests/examples/DE/project.json \
    --fdr-threshold 0.05 \
    --output-folder ./output \
    --delete-existing

Input File Format

MSstats File: CSV file with comparison results

Protein,Label,log2FC,SE,Tvalue,DF,pvalue,adj.pvalue
P12345,Condition2-Condition1,2.5,0.3,8.33,10,0.0001,0.001
Q67890,Condition2-Condition1,-1.8,0.4,-4.5,10,0.002,0.01

Output Files

  • Output: {output-prefix}-{uuid}.differential.parquet
  • Format: Parquet file containing differential expression results
  • Schema: Conforms to QPX differential expression specification

Common Issues

Issue: No significant results after FDR filtering

  • Solution: Increase --fdr-threshold or check input data quality

Issue: Memory errors with large comparison files

  • Solution: Process comparisons in batches or increase available memory

Best Practices

  • Use FDR threshold of 0.05 or lower for publication-quality results
  • Enable verbose mode to monitor filtering statistics
  • Validate comparison group names match SDRF metadata
  • Include project file for complete data provenance

gene

Map gene information from FASTA to parquet format.

Description

Maps gene names and information from a FASTA file to protein identifications in QPX PSM or feature files. This command enriches protein data with gene-level metadata extracted from FASTA headers.

Parameters

ParameterTypeRequiredDefaultDescription
--parquet-path FILE Yes - PSM or feature parquet file path
--fasta FILE Yes - FASTA file path
--output-folder DIRECTORY Yes - Output directory for generated files
--file-num INTEGER No 10 Number of rows to read in each batch
--partitions TEXT No - Fields for splitting files (comma-separated)
--species TEXT No human Species name (default: human)

Usage Examples

Basic Example

Map gene information to parquet file:

qpxc transform gene \
    --parquet-path ./output/psm.parquet \
    --fasta proteins.fasta \
    --output-folder ./output

With Partitioning

qpxc transform gene \
    --parquet-path ./output/feature.parquet \
    --fasta tests/examples/fasta/Homo-sapiens.fasta \
    --output-folder ./output \
    --file-num 20 \
    --partitions reference_file_name \
    --species human

Output Files

  • Output: Enhanced parquet file(s) with gene information
  • Format: Parquet file in output folder
  • Added Fields: Gene names and metadata from FASTA headers

Best Practices

  • Use species-specific FASTA files for accurate gene annotation
  • Adjust --file-num based on available memory for large files
  • Use partitioning for better file organization in large datasets

ibaq

Convert feature data to iBAQ format.

Description

Transforms feature-level quantification data into iBAQ (intensity-Based Absolute Quantification) format. This command integrates feature quantification with sample metadata from SDRF files to generate iBAQ values.

Parameters

ParameterTypeRequiredDefaultDescription
--feature-file FILE Yes - Feature file path
--sdrf-file FILE Yes - SDRF file for metadata extraction
--output-folder DIRECTORY Yes - Output directory for generated files
--output-prefix TEXT No ibaq Prefix for output files

Usage Examples

Basic Example

Convert feature data to iBAQ format:

qpxc transform ibaq \
    --feature-file ./output/feature.parquet \
    --sdrf-file ./metadata.sdrf.tsv \
    --output-folder ./output

With Custom Prefix

qpxc transform ibaq \
    --feature-file ./output/feature.parquet \
    --sdrf-file ./metadata.sdrf.tsv \
    --output-folder ./output \
    --output-prefix ibaq_quantification

Output Files

  • Output: {output-prefix}-{uuid}.ibaq.parquet
  • Format: Parquet file containing iBAQ quantification values
  • Content: Protein-level iBAQ values per sample

Best Practices

  • Ensure feature file contains all necessary quantification data
  • Verify SDRF metadata matches sample identifiers in feature file
  • Use iBAQ output for absolute protein quantification analysis

spectra

Map spectrum information from mzML to parquet format.

Description

Enriches PSM or feature data with additional spectral information extracted from mzML files. This command maps spectrum metadata and peak information to the corresponding peptide-spectrum matches.

Parameters

ParameterTypeRequiredDefaultDescription
--parquet-path FILE Yes - PSM or feature parquet file path
--mzml-directory DIRECTORY Yes - Directory containing mzML files
--output-folder DIRECTORY Yes - Output directory for generated files
--file-num INTEGER No 10 Number of rows to read in each batch
--partitions TEXT No - Fields for splitting files (comma-separated)

Usage Examples

Basic Example

Map spectrum information to parquet:

qpxc transform spectra \
    --parquet-path ./output/psm.parquet \
    --mzml-directory ./mzml_files \
    --output-folder ./output

With Batch Processing and Partitioning

qpxc transform spectra \
    --parquet-path ./output/psm.parquet \
    --mzml-directory ./mzml_files \
    --output-folder ./output \
    --file-num 20 \
    --partitions reference_file_name

Output Files

  • Output: Enhanced PSM/feature parquet file(s) with spectral information
  • Format: Parquet file in output folder
  • Added Fields: Spectrum metadata and peak information from mzML files

Best Practices

  • Ensure mzML files are in the specified directory with correct naming
  • Adjust --file-num based on available memory and file size
  • Use partitioning for organized output when processing large datasets

uniprot

Map feature data to latest UniProt version.

Description

Maps peptides and features to the latest UniProt protein database using a FASTA file. This command updates protein identifications to match current UniProt accessions and annotations.

Parameters

ParameterTypeRequiredDefaultDescription
--feature-file FILE Yes - Feature file path
--fasta FILE Yes - UniProt FASTA file path
--output-folder DIRECTORY Yes - Output directory for generated files
--output-prefix TEXT No - Prefix for output files

Usage Examples

Basic Mapping

Map features to latest UniProt:

qpxc transform uniprot \
    --feature-file ./output/feature.parquet \
    --fasta uniprot_human.fasta \
    --output-folder ./output

With Custom Prefix

qpxc transform uniprot \
    --feature-file ./output/feature.parquet \
    --fasta ./uniprot_human_2024.fasta \
    --output-folder ./output \
    --output-prefix feature_updated

Output Files

  • Output: {output-prefix}-{uuid}.feature.parquet
  • Format: Parquet file with updated UniProt mappings
  • Content: Feature data mapped to latest UniProt protein identifications

Best Practices

  • Use the latest UniProt FASTA file for most current annotations
  • Run this command when updating to a new UniProt release
  • Verify FASTA file matches the organism of your study

anndata

Merge multiple AE files into a file in AnnData format.

Description

Combines multiple absolute expression (AE) files from a directory into a single AnnData object (H5AD format). This command is useful for integrating data from multiple experiments for downstream analysis with scanpy or other Python-based tools.

Parameters

ParameterTypeRequiredDefaultDescription
--directory DIRECTORY Yes - The directory for storing AE files
--output-folder DIRECTORY Yes - Output directory for generated files
--output-prefix TEXT No - Prefix for output files

Usage Examples

Basic Example

Merge AE files into AnnData format:

qpxc transform anndata \
    --directory ./ae_files \
    --output-folder ./output

With Custom Prefix

qpxc transform anndata \
    --directory ./ae_files \
    --output-folder ./output \
    --output-prefix merged_ae

Output Files

  • Output: {output-prefix}-{uuid}.h5ad
  • Format: AnnData H5AD file (HDF5-based format)
  • Structure:
  • X: Protein expression matrix
  • obs: Sample metadata
  • var: Protein metadata

Use Cases

  • Integration with scanpy for dimensionality reduction and clustering
  • Compatibility with Python machine learning libraries
  • Multi-experiment data integration
  • Cross-study meta-analysis

Best Practices

  • Ensure all AE files in the directory have consistent format
  • Validate sample metadata consistency before merging
  • Use the output with scanpy or other AnnData-compatible tools
  • Consider memory requirements for large datasets