Statistics Commands¶
Perform statistical analysis on QPX data.
Overview¶
The stats command group provides tools for generating comprehensive statistical summaries of QPX data files. These commands help assess data quality, completeness, and provide key metrics for experimental reports.
Available Commands¶
All statistics commands are accessed through the analyze subcommand:
- project-ae - Generate statistics for absolute expression data
- psm - Generate statistics for PSM data
project-ae¶
Generate comprehensive statistics for a project's absolute expression data.
Description¶
Analyzes both absolute expression (AE) and PSM data to generate a complete statistical summary. This command is useful for quality control and generating summary statistics for publications or reports.
Parameters¶
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
--absolute-path |
FILE | Yes | - | Absolute expression file path |
--parquet-path |
FILE | Yes | - | PSM parquet file path |
--save-path |
FILE | No | - | Output statistics file path (e.g. stats.txt) |
Usage Examples¶
Print to Console¶
Generate project statistics:
qpxc stats analyze project-ae \
--absolute-path ./output/ae.parquet \
--parquet-path ./output/psm.parquet \
--save-path ./reports/project_statistics.txt
Save to File¶
qpxc stats analyze project-ae \
--absolute-path ./output/ae.parquet \
--parquet-path ./output/psm.parquet \
--save-path ./reports/project_statistics.txt
Output Format¶
The command generates a text report with the following metrics:
Number of proteins: 2,547
Number of peptides: 12,384
Number of samples: 24
Number of peptidoforms: 15,921
Number of msruns: 24
iBAQ Number of proteins: 2,547
iBAQ Number of samples: 24
Metrics Explained¶
| Metric | Description |
|---|---|
| Number of proteins | Unique protein identifications |
| Number of peptides | Unique peptide sequences identified |
| Number of samples | Biological samples in the dataset |
| Number of peptidoforms | Unique modified peptide forms |
| Number of msruns | Total MS runs performed |
| iBAQ Number of proteins | Proteins with iBAQ quantification |
| iBAQ Number of samples | Samples with iBAQ data |
Use Cases¶
- Quality Control: Verify expected number of identifications
- Publication Reporting: Generate summary statistics for methods sections
- Data Completeness: Assess coverage across samples
- Comparative Analysis: Compare statistics across different processing pipelines
Best Practices¶
- Run statistics after data processing to verify completeness
- Compare statistics with expected values based on sample type and instrument
- Use for QC before downstream analysis
- Include in supplementary materials for publications
psm¶
Generate statistics for PSM (Peptide-Spectrum Match) data.
Description¶
Analyzes PSM data to generate detailed statistics about identifications, including protein, peptide, and PSM counts.
Parameters¶
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
--parquet-path |
FILE | Yes | - | PSM parquet file path |
--save-path |
FILE | No | - | Output statistics file path (e.g. stats.txt) |
Usage Examples¶
Print to Console¶
Generate PSM statistics:
qpxc stats analyze psm \
--parquet-path ./output/psm.parquet \
--save-path ./reports/psm_statistics.txt
Save to File¶
qpxc stats analyze psm \
--parquet-path ./output/psm.parquet \
--save-path ./reports/psm_statistics.txt
Output Format¶
Number of proteins: 1,823
Number of peptides: 8,642
Number of peptidoforms: 11,205
Number of PSMs: 45,892
Number of msruns: 12
Metrics Explained¶
| Metric | Description |
|---|---|
| Number of proteins | Unique proteins with at least one PSM |
| Number of peptides | Unique peptide sequences (without modifications) |
| Number of peptidoforms | Unique peptide sequences with modifications |
| Number of PSMs | Total peptide-spectrum matches |
| Number of msruns | Number of MS runs contributing data |
Understanding the Metrics¶
Peptide vs Peptidoform¶
- Peptide: Amino acid sequence (e.g.,
PEPTIDE) - Peptidoform: Peptide + modifications (e.g.,
PEPTIDE[+16])
Expected Ratios¶
Typical ratios for quality data:
- PSMs per peptide: 2-5 (varies with replicates)
- Peptidoforms per peptide: 1-3 (depends on PTM analysis)
- Peptides per protein: 3-20 (depends on protein abundance and coverage)
Use Cases¶
- Quality Assessment: Verify identification rates
- Method Optimization: Compare different search parameters
- Replication Analysis: Assess consistency across runs
- FDR Validation: Ensure sufficient identifications after filtering
Best Practices¶
- Compare PSM counts before and after FDR filtering
- Monitor peptidoform counts to assess modification analysis quality
- Track msrun numbers to verify all files processed correctly
- Use as input for sample size calculations in future experiments
Statistical Interpretation Guide¶
Data Quality Indicators¶
Good Quality Signs¶
- Consistent protein counts across replicates
- Expected PSM-to-peptide ratios
- Complete data across all msruns
- Reasonable peptidoform diversity
Potential Issues¶
- Very low PSM counts: Search parameter issues, poor sample quality
- High peptidoform-to-peptide ratio: Over-prediction of modifications
- Missing msruns: File processing errors
- Extreme variation: Batch effects, contamination
Comparative Analysis¶
When comparing datasets:
- Normalize by sample amount: Account for loading differences
- Consider instrument type: Different platforms yield different numbers
- Account for search space: Database size affects identification rates
- Match FDR thresholds: Ensure fair comparison
Reporting Guidelines¶
For publications, report:
- Total unique proteins, peptides, and PSMs
- Number of biological replicates
- Number of technical replicates (msruns)
- FDR thresholds applied
- Protein and peptide identification rates
Automation and Integration¶
Batch Processing¶
Process multiple files:
#!/bin/bash
for file in ./output/*.psm.parquet; do
filename=$(basename "$file" .parquet)
qpxc stats analyze psm \
--parquet-path "$file" \
--save-path "./reports/${filename}_stats.txt"
done
Integration with Reports¶
Use output in automated reports:
qpxc stats analyze psm \
--parquet-path ./output/psm.parquet \
--save-path ./reports/stats.txt
# Extract specific metrics
grep "Number of proteins" ./reports/stats.txt >> ./summary_report.md
Combining with Visualization¶
Generate both statistics and plots:
# Generate statistics
qpxc stats analyze psm \
--parquet-path ./output/psm.parquet \
--save-path ./reports/stats.txt
# Generate visualizations
qpxc visualize plot peptide-distribution \
--feature-path ./output/feature.parquet \
--save-path ./plots/peptide_dist.svg
Advanced Usage¶
Quality Control Workflow¶
Complete QC workflow example:
#!/bin/bash
# 1. Generate PSM statistics
qpxc stats analyze psm \
--parquet-path ./output/psm.parquet \
--save-path ./qc/psm_stats.txt
# 2. Generate AE statistics (if available)
if [ -f ./output/ae.parquet ]; then
qpxc stats analyze project-ae \
--absolute-path ./output/ae.parquet \
--parquet-path ./output/psm.parquet \
--save-path ./qc/ae_stats.txt
fi
# 3. Create visualizations
qpxc visualize plot box-intensity \
--feature-path ./output/feature.parquet \
--save-path ./qc/intensity_boxplot.svg
qpxc visualize plot peptide-distribution \
--feature-path ./output/feature.parquet \
--save-path ./qc/peptide_distribution.svg
echo "QC report generated in ./qc/"
Custom Thresholds¶
Define quality thresholds:
#!/bin/bash
# Generate statistics
qpxc stats analyze psm \
--parquet-path ./output/psm.parquet \
--save-path ./stats.txt
# Check thresholds
proteins=$(grep "Number of proteins" ./stats.txt | awk '{print $4}' | tr -d ',')
peptides=$(grep "Number of peptides" ./stats.txt | awk '{print $4}' | tr -d ',')
if [ "$proteins" -lt 1000 ]; then
echo "WARNING: Low protein count ($proteins)"
fi
if [ "$peptides" -lt 5000 ]; then
echo "WARNING: Low peptide count ($peptides)"
fi
Related Commands¶
- Convert Commands - Generate data files for analysis
- Transform Commands - Process data before statistics
- Visualization Commands - Create visual representations of statistics