Skip to content

Statistics Commands

Perform statistical analysis on QPX data.

Overview

The stats command group provides tools for generating comprehensive statistical summaries of QPX data files. These commands help assess data quality, completeness, and provide key metrics for experimental reports.

Available Commands

All statistics commands are accessed through the analyze subcommand:

  • project-ae - Generate statistics for absolute expression data
  • psm - Generate statistics for PSM data

project-ae

Generate comprehensive statistics for a project's absolute expression data.

Description

Analyzes both absolute expression (AE) and PSM data to generate a complete statistical summary. This command is useful for quality control and generating summary statistics for publications or reports.

Parameters

ParameterTypeRequiredDefaultDescription
--absolute-path FILE Yes - Absolute expression file path
--parquet-path FILE Yes - PSM parquet file path
--save-path FILE No - Output statistics file path (e.g. stats.txt)

Usage Examples

Print to Console

Generate project statistics:

qpxc stats analyze project-ae \
    --absolute-path ./output/ae.parquet \
    --parquet-path ./output/psm.parquet \
    --save-path ./reports/project_statistics.txt

Save to File

qpxc stats analyze project-ae \
    --absolute-path ./output/ae.parquet \
    --parquet-path ./output/psm.parquet \
    --save-path ./reports/project_statistics.txt

Output Format

The command generates a text report with the following metrics:

Number of proteins: 2,547
Number of peptides: 12,384
Number of samples: 24
Number of peptidoforms: 15,921
Number of msruns: 24
iBAQ Number of proteins: 2,547
iBAQ Number of samples: 24

Metrics Explained

Metric Description
Number of proteins Unique protein identifications
Number of peptides Unique peptide sequences identified
Number of samples Biological samples in the dataset
Number of peptidoforms Unique modified peptide forms
Number of msruns Total MS runs performed
iBAQ Number of proteins Proteins with iBAQ quantification
iBAQ Number of samples Samples with iBAQ data

Use Cases

  • Quality Control: Verify expected number of identifications
  • Publication Reporting: Generate summary statistics for methods sections
  • Data Completeness: Assess coverage across samples
  • Comparative Analysis: Compare statistics across different processing pipelines

Best Practices

  • Run statistics after data processing to verify completeness
  • Compare statistics with expected values based on sample type and instrument
  • Use for QC before downstream analysis
  • Include in supplementary materials for publications

psm

Generate statistics for PSM (Peptide-Spectrum Match) data.

Description

Analyzes PSM data to generate detailed statistics about identifications, including protein, peptide, and PSM counts.

Parameters

ParameterTypeRequiredDefaultDescription
--parquet-path FILE Yes - PSM parquet file path
--save-path FILE No - Output statistics file path (e.g. stats.txt)

Usage Examples

Print to Console

Generate PSM statistics:

qpxc stats analyze psm \
    --parquet-path ./output/psm.parquet \
    --save-path ./reports/psm_statistics.txt

Save to File

qpxc stats analyze psm \
    --parquet-path ./output/psm.parquet \
    --save-path ./reports/psm_statistics.txt

Output Format

Number of proteins: 1,823
Number of peptides: 8,642
Number of peptidoforms: 11,205
Number of PSMs: 45,892
Number of msruns: 12

Metrics Explained

Metric Description
Number of proteins Unique proteins with at least one PSM
Number of peptides Unique peptide sequences (without modifications)
Number of peptidoforms Unique peptide sequences with modifications
Number of PSMs Total peptide-spectrum matches
Number of msruns Number of MS runs contributing data

Understanding the Metrics

Peptide vs Peptidoform

  • Peptide: Amino acid sequence (e.g., PEPTIDE)
  • Peptidoform: Peptide + modifications (e.g., PEPTIDE[+16])

Expected Ratios

Typical ratios for quality data:

  • PSMs per peptide: 2-5 (varies with replicates)
  • Peptidoforms per peptide: 1-3 (depends on PTM analysis)
  • Peptides per protein: 3-20 (depends on protein abundance and coverage)

Use Cases

  • Quality Assessment: Verify identification rates
  • Method Optimization: Compare different search parameters
  • Replication Analysis: Assess consistency across runs
  • FDR Validation: Ensure sufficient identifications after filtering

Best Practices

  • Compare PSM counts before and after FDR filtering
  • Monitor peptidoform counts to assess modification analysis quality
  • Track msrun numbers to verify all files processed correctly
  • Use as input for sample size calculations in future experiments

Statistical Interpretation Guide

Data Quality Indicators

Good Quality Signs

  • Consistent protein counts across replicates
  • Expected PSM-to-peptide ratios
  • Complete data across all msruns
  • Reasonable peptidoform diversity

Potential Issues

  • Very low PSM counts: Search parameter issues, poor sample quality
  • High peptidoform-to-peptide ratio: Over-prediction of modifications
  • Missing msruns: File processing errors
  • Extreme variation: Batch effects, contamination

Comparative Analysis

When comparing datasets:

  1. Normalize by sample amount: Account for loading differences
  2. Consider instrument type: Different platforms yield different numbers
  3. Account for search space: Database size affects identification rates
  4. Match FDR thresholds: Ensure fair comparison

Reporting Guidelines

For publications, report:

  • Total unique proteins, peptides, and PSMs
  • Number of biological replicates
  • Number of technical replicates (msruns)
  • FDR thresholds applied
  • Protein and peptide identification rates

Automation and Integration

Batch Processing

Process multiple files:

#!/bin/bash
for file in ./output/*.psm.parquet; do
    filename=$(basename "$file" .parquet)
    qpxc stats analyze psm \
        --parquet-path "$file" \
        --save-path "./reports/${filename}_stats.txt"
done

Integration with Reports

Use output in automated reports:

qpxc stats analyze psm \
    --parquet-path ./output/psm.parquet \
    --save-path ./reports/stats.txt

# Extract specific metrics
grep "Number of proteins" ./reports/stats.txt >> ./summary_report.md

Combining with Visualization

Generate both statistics and plots:

# Generate statistics
qpxc stats analyze psm \
    --parquet-path ./output/psm.parquet \
    --save-path ./reports/stats.txt

# Generate visualizations
qpxc visualize plot peptide-distribution \
    --feature-path ./output/feature.parquet \
    --save-path ./plots/peptide_dist.svg

Advanced Usage

Quality Control Workflow

Complete QC workflow example:

#!/bin/bash

# 1. Generate PSM statistics
qpxc stats analyze psm \
    --parquet-path ./output/psm.parquet \
    --save-path ./qc/psm_stats.txt

# 2. Generate AE statistics (if available)
if [ -f ./output/ae.parquet ]; then
    qpxc stats analyze project-ae \
        --absolute-path ./output/ae.parquet \
        --parquet-path ./output/psm.parquet \
        --save-path ./qc/ae_stats.txt
fi

# 3. Create visualizations
qpxc visualize plot box-intensity \
    --feature-path ./output/feature.parquet \
    --save-path ./qc/intensity_boxplot.svg

qpxc visualize plot peptide-distribution \
    --feature-path ./output/feature.parquet \
    --save-path ./qc/peptide_distribution.svg

echo "QC report generated in ./qc/"

Custom Thresholds

Define quality thresholds:

#!/bin/bash

# Generate statistics
qpxc stats analyze psm \
    --parquet-path ./output/psm.parquet \
    --save-path ./stats.txt

# Check thresholds
proteins=$(grep "Number of proteins" ./stats.txt | awk '{print $4}' | tr -d ',')
peptides=$(grep "Number of peptides" ./stats.txt | awk '{print $4}' | tr -d ',')

if [ "$proteins" -lt 1000 ]; then
    echo "WARNING: Low protein count ($proteins)"
fi

if [ "$peptides" -lt 5000 ]; then
    echo "WARNING: Low peptide count ($peptides)"
fi