Visualization Guide¶
Create publication-quality visualizations from QPX datasets using the Python API.
Overview¶
QPX provides a comprehensive visualization API through dataset views. Each view has a .plot() method that generates informative, publication-ready plots. All plots are created using matplotlib and can be saved in vector formats (SVG, PDF) for high-resolution output.
Getting Started¶
Load a QPX dataset and create visualizations:
import qpx
# Load dataset
ds = qpx.Dataset("path/to/dataset/")
# View-based plotting
ds.identifications.plot() # Identification summary plot
ds.runs.plot() # Run-level QC plot
ds.modifications.plot() # Modification distribution plot
ds.qc.plot() # Quality control dashboard
Available Visualizations¶
Identification Summary¶
Visualize protein, peptide, and PSM identifications:
import qpx
ds = qpx.Dataset("./output/")
# Generate identification summary plot
fig = ds.identifications.plot()
# Save to file
fig.savefig("./plots/identifications.svg", format='svg', bbox_inches='tight')
Output:
- Bar plot showing counts of proteins, peptides, and PSMs
- Stacked bars for different identification levels
- Useful for quick assessment of dataset size
Interpretation:
- High protein counts: Good depth of coverage
- Low peptide-to-protein ratio: May indicate poor fragmentation or search issues
- High PSM-to-peptide ratio: Good reproducibility across runs
Run Summary¶
Visualize statistics across MS runs:
import qpx
ds = qpx.Dataset("./output/")
# Generate run-level QC plot
fig = ds.runs.plot()
# Save to file
fig.savefig("./plots/run_summary.svg", format='svg', bbox_inches='tight')
Output:
- Multiple panels showing run-level metrics
- PSM counts per run
- Peptide and protein identifications per run
- Useful for identifying problematic runs
Interpretation:
- Consistent bars: Good technical reproducibility
- Outlier runs: May indicate instrument issues or sample problems
- Declining counts: Possible column degradation
Modification Distribution¶
Visualize post-translational modifications:
import qpx
ds = qpx.Dataset("./output/")
# Generate modification distribution plot
fig = ds.modifications.plot()
# Save to file
fig.savefig("./plots/modifications.svg", format='svg', bbox_inches='tight')
Output:
- Bar plot showing frequency of different modifications
- Grouped by modification type
- Useful for PTM analysis validation
Interpretation:
- Expected modifications: Oxidation, carbamidomethylation, etc.
- Unexpected modifications: May indicate search parameter issues
- Modification frequency: Reflects biological state and search sensitivity
Quality Control Dashboard¶
Comprehensive QC visualization:
import qpx
ds = qpx.Dataset("./output/")
# Generate QC dashboard
fig = ds.qc.plot()
# Save to file
fig.savefig("./plots/qc_dashboard.svg", format='svg', bbox_inches='tight')
Output:
- Multi-panel dashboard with key QC metrics
- Intensity distributions
- Missing value patterns
- Identification rates
- Run-to-run consistency
Interpretation:
- Aligned intensity distributions: Good normalization
- Low missing values: Complete quantification
- Consistent identification rates: Technical reproducibility
Intensity Distribution Plots¶
Box Plot¶
Visualize intensity distributions across samples:
import qpx
import matplotlib.pyplot as plt
ds = qpx.Dataset("./output/")
# Get intensity columns
intensity_cols = [col for col in ds.feature.data.columns
if col.startswith('sample_')]
# Create box plot
fig, ax = plt.subplots(figsize=(12, 6))
ds.feature.data[intensity_cols].apply(lambda x: x[x > 0].apply('log10')).boxplot(ax=ax)
ax.set_xlabel('Sample')
ax.set_ylabel('log10(Intensity)')
ax.set_title('Intensity Distribution Across Samples')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
# Save
fig.savefig("./plots/intensity_boxplot.svg", format='svg', bbox_inches='tight')
Output:
- Box plots for each sample showing intensity distribution
- Log-transformed intensities for better visualization
- Box shows interquartile range (IQR)
- Whiskers extend to 1.5×IQR
- Outliers shown as individual points
Interpretation:
- Aligned medians: Good normalization
- Similar IQR: Consistent quantification across samples
- Many outliers: May indicate contamination or technical issues
- Different ranges: Batch effects or loading differences
KDE (Kernel Density Estimation)¶
Plot smooth density distributions:
import qpx
import matplotlib.pyplot as plt
import numpy as np
ds = qpx.Dataset("./output/")
# Get intensity columns (limit to first 10 samples for readability)
intensity_cols = [col for col in ds.feature.data.columns
if col.startswith('sample_')][:10]
# Create KDE plot
fig, ax = plt.subplots(figsize=(10, 6))
for col in intensity_cols:
intensities = ds.feature.data[col].dropna()
if len(intensities) > 0:
log_intensities = np.log10(intensities[intensities > 0])
log_intensities.plot.kde(ax=ax, label=col)
ax.set_xlabel('log10(Intensity)')
ax.set_ylabel('Density')
ax.set_title('Intensity Distribution (KDE)')
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
# Save
fig.savefig("./plots/intensity_kde.svg", format='svg', bbox_inches='tight')
Output:
- Overlaid kernel density curves for each sample
- Smooth representation of intensity distributions
- Legend shows sample identifiers
Interpretation:
- Overlapping curves: Good sample-to-sample consistency
- Shifted curves: Potential batch effects or normalization issues
- Different shapes: Sample-specific technical issues
- Bimodal distributions: Distinct protein abundance classes
Peptide Distribution Plots¶
Peptides per Protein¶
Visualize peptide coverage across proteins:
import qpx
import matplotlib.pyplot as plt
ds = qpx.Dataset("./output/")
# Count peptides per protein
peptides_per_protein = ds.psm.data.groupby('protein_accessions')['sequence'].nunique()
top_proteins = peptides_per_protein.nlargest(20)
# Create bar plot
fig, ax = plt.subplots(figsize=(12, 6))
top_proteins.plot.bar(ax=ax)
ax.set_xlabel('Protein')
ax.set_ylabel('Number of Peptides')
ax.set_title('Peptide Distribution Across Top 20 Proteins')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
# Save
fig.savefig("./plots/peptides_per_protein.svg", format='svg', bbox_inches='tight')
Output:
- Bar plot showing peptide counts per protein
- Top N proteins by peptide count
- X-axis: Protein identifiers
- Y-axis: Number of unique peptides
Interpretation:
- High peptide counts: Abundant proteins with good coverage
- Single peptide proteins: May be less confident identifications
- Distribution shape: Reflects proteome complexity
Peptides by Condition¶
Compare peptide identifications across experimental conditions:
import qpx
import matplotlib.pyplot as plt
ds = qpx.Dataset("./output/")
# Assuming 'condition' column exists in PSM data or can be derived from sample metadata
# This is a simplified example - adapt based on your metadata structure
if 'condition' in ds.psm.data.columns:
peptides_by_condition = ds.psm.data.groupby('condition')['sequence'].nunique()
# Create bar plot
fig, ax = plt.subplots(figsize=(10, 6))
peptides_by_condition.plot.bar(ax=ax)
ax.set_xlabel('Condition')
ax.set_ylabel('Number of Peptides')
ax.set_title('Peptides Identified by Condition')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
# Save
fig.savefig("./plots/peptides_by_condition.svg", format='svg', bbox_inches='tight')
Interpretation:
- High variation: May indicate batch effects or quality issues
- Low counts in specific conditions: May suggest technical problems
- Consistent counts: Good data quality and reproducibility
iBAQ Distribution¶
Visualize absolute protein abundance (iBAQ values):
import qpx
import matplotlib.pyplot as plt
import numpy as np
ds = qpx.Dataset("./output/")
# Assuming iBAQ columns exist in protein group data
ibaq_cols = [col for col in ds.pg.data.columns
if 'ibaq' in col.lower() and col.startswith('sample_')]
if ibaq_cols:
# Plot first sample as example
sample_col = ibaq_cols[0]
ibaq_values = ds.pg.data[sample_col].dropna()
if len(ibaq_values) > 0:
# Create histogram with KDE
fig, ax = plt.subplots(figsize=(10, 6))
log_ibaq = np.log10(ibaq_values[ibaq_values > 0])
log_ibaq.plot.hist(bins=50, alpha=0.6, ax=ax, label='Histogram')
log_ibaq.plot.kde(ax=ax, label='KDE', linewidth=2)
ax.set_xlabel('log10(iBAQ Intensity)')
ax.set_ylabel('Frequency / Density')
ax.set_title(f'iBAQ Distribution - {sample_col}')
ax.legend()
plt.tight_layout()
# Save
fig.savefig("./plots/ibaq_distribution.svg", format='svg', bbox_inches='tight')
Output:
- Histogram + kernel density estimate of iBAQ values
- Log-transformed for better visualization
- X-axis: log10(iBAQ intensity)
- Y-axis: Density or frequency
Interpretation:
- Bimodal distribution: Distinct protein abundance classes
- Long tail: High-abundance proteins (housekeeping, structural)
- Narrow range: Limited dynamic range, possible detection issues
Missing Value Patterns¶
Visualize missing data patterns:
import qpx
import matplotlib.pyplot as plt
import seaborn as sns
ds = qpx.Dataset("./output/")
# Get intensity columns
intensity_cols = [col for col in ds.feature.data.columns
if col.startswith('sample_')]
# Calculate missing value percentages
missing_pct = ds.feature.data[intensity_cols].isna().mean() * 100
# Create bar plot
fig, ax = plt.subplots(figsize=(12, 6))
missing_pct.plot.bar(ax=ax, color='coral')
ax.set_xlabel('Sample')
ax.set_ylabel('Missing Values (%)')
ax.set_title('Missing Value Percentage by Sample')
ax.axhline(y=30, color='r', linestyle='--', label='30% threshold')
plt.xticks(rotation=45, ha='right')
plt.legend()
plt.tight_layout()
# Save
fig.savefig("./plots/missing_values.svg", format='svg', bbox_inches='tight')
Interpretation:
- Low missing values (<20%): Good data quality
- Moderate (20-40%): Acceptable for most analyses
- High (>40%): May require imputation or filtering
Customization and Styling¶
Publication-Quality Plots¶
Enhance plots for publication:
import qpx
import matplotlib.pyplot as plt
# Set publication style
plt.style.use('seaborn-v0_8-paper')
plt.rcParams['figure.dpi'] = 300
plt.rcParams['font.size'] = 12
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['axes.titlesize'] = 16
ds = qpx.Dataset("./output/")
# Create plot with custom styling
fig = ds.identifications.plot()
# Save in multiple formats
fig.savefig("./plots/identifications.svg", format='svg', bbox_inches='tight', dpi=300)
fig.savefig("./plots/identifications.pdf", format='pdf', bbox_inches='tight', dpi=300)
fig.savefig("./plots/identifications.png", format='png', bbox_inches='tight', dpi=300)
Multi-Panel Figures¶
Create comprehensive visualization panels:
import qpx
import matplotlib.pyplot as plt
ds = qpx.Dataset("./output/")
# Create multi-panel figure
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
# Panel 1: Identifications
ax1 = axes[0, 0]
stats = {
'Proteins': ds.psm.data['protein_accessions'].nunique(),
'Peptides': ds.psm.data['sequence'].nunique(),
'PSMs': ds.psm.count()
}
ax1.bar(stats.keys(), stats.values())
ax1.set_ylabel('Count')
ax1.set_title('A. Identifications')
# Panel 2: Runs
ax2 = axes[0, 1]
run_psms = ds.psm.data.groupby('run_file_name').size()
ax2.bar(range(len(run_psms)), run_psms.values)
ax2.set_xlabel('Run')
ax2.set_ylabel('PSM Count')
ax2.set_title('B. PSMs per Run')
# Panel 3: Modifications
ax3 = axes[1, 0]
if 'modifications' in ds.psm.data.columns:
mod_counts = ds.psm.data['modifications'].value_counts().head(10)
mod_counts.plot.barh(ax=ax3)
ax3.set_xlabel('Count')
ax3.set_title('C. Top 10 Modifications')
# Panel 4: Intensity distribution (if feature data available)
ax4 = axes[1, 1]
if hasattr(ds, 'feature') and ds.feature.count() > 0:
intensity_cols = [col for col in ds.feature.data.columns
if col.startswith('sample_')][:5]
for col in intensity_cols:
intensities = ds.feature.data[col].dropna()
if len(intensities) > 0:
intensities[intensities > 0].apply('log10').plot.kde(ax=ax4, label=col)
ax4.set_xlabel('log10(Intensity)')
ax4.set_ylabel('Density')
ax4.set_title('D. Intensity Distribution')
ax4.legend()
plt.tight_layout()
fig.savefig("./plots/comprehensive_qc.svg", format='svg', bbox_inches='tight', dpi=300)
General Plotting Tips¶
Output Formats¶
- SVG: Recommended for publications (scalable, editable)
- PDF: Alternative vector format (portable)
- PNG: Raster format (use high DPI: 300+)
Color Considerations¶
- Plots use color-blind friendly palettes by default
- Ensure sufficient contrast for grayscale printing
- Use ColorBrewer palettes for multi-category plots
Size and Resolution¶
- Vector formats (SVG/PDF) scale without quality loss
- For raster formats, use DPI ≥ 300 for publications
- Standard figure sizes: single column (3.5"), double column (7")
Best Practices¶
- Label axes clearly: Include units where applicable
- Add titles: Descriptive but concise
- Use legends: When plotting multiple series
- Limit colors: Use 5-10 distinct colors maximum
- Save originals: Keep vector formats for future editing
Integration with Analysis Workflows¶
Combine statistics and visualization:
import qpx
import matplotlib.pyplot as plt
def comprehensive_qc_workflow(dataset_path, output_dir):
"""Generate comprehensive QC report with plots."""
ds = qpx.Dataset(dataset_path)
# Generate all plots
plots = {
'identifications': ds.identifications.plot(),
'runs': ds.runs.plot(),
'modifications': ds.modifications.plot(),
'qc': ds.qc.plot()
}
# Save plots
for name, fig in plots.items():
fig.savefig(f"{output_dir}/{name}.svg", format='svg', bbox_inches='tight')
plt.close(fig)
# Generate statistics summary
with open(f"{output_dir}/statistics.txt", 'w') as f:
f.write("QPX Quality Control Report\n")
f.write("=" * 50 + "\n\n")
f.write(f"Proteins: {ds.psm.data['protein_accessions'].nunique():,}\n")
f.write(f"Peptides: {ds.psm.data['sequence'].nunique():,}\n")
f.write(f"PSMs: {ds.psm.count():,}\n")
f.write(f"Runs: {ds.psm.data['run_file_name'].nunique()}\n")
print(f"QC report generated in: {output_dir}")
# Usage
comprehensive_qc_workflow("./output/", "./qc_report/")
Advanced Customization¶
For advanced customization beyond the built-in views, you can access the underlying data directly:
import qpx
import matplotlib.pyplot as plt
import pandas as pd
ds = qpx.Dataset("./output/")
# Access raw data
psm_df = ds.psm.data
feature_df = ds.feature.data
pg_df = ds.pg.data
# Create custom visualizations using matplotlib, seaborn, plotly, etc.
# Example: Custom scatter plot
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(psm_df['rt'], psm_df['observed_mz'], alpha=0.5, s=10)
ax.set_xlabel('Retention Time (seconds)')
ax.set_ylabel('Observed m/z')
ax.set_title('PSM Distribution in RT-m/z Space')
plt.tight_layout()
fig.savefig("./plots/custom_scatter.svg", format='svg', bbox_inches='tight')
Related Documentation¶
- Statistics Guide - Generate numeric summaries
- Dataset API Reference - Core dataset structure
- Views Specification - Available dataset views
- Quickstart Guide - Get started with QPX