Skip to content

Differential Expression

The differential expression (DE) view stores statistical results comparing protein abundance between experimental conditions. It contains fold changes, p-values, and associated metrics for each protein across contrasts. This is a key output of the proteomics data analysis workflow, enabling identification of differentially expressed proteins.

Use cases

  • Store differentially expressed proteins between contrasts with fold changes and p-values.
  • Enable visualization using Volcano Plots and other statistical plots.
  • Enable integration with other omics data for multi-omics analysis.
  • Store metadata about the statistical method and contrast definitions alongside the results.
  • Integrate with scanpy plotting functions (sc.pl.rank_genes_groups, sc.pl.rank_genes_groups_volcano).

Format

The differential expression view uses AnnData (.h5ad) as its primary format. DE results are stored in AnnData's uns (unstructured metadata) slot, following scanpy's rank_genes_groups convention. The AnnData object can optionally include an expression matrix in X (from the AE view) alongside the DE results.

For general AnnData concepts and conventions shared with the absolute expression view, see AnnData Concepts.

AnnData structure

AnnData (n_obs x n_vars = samples x proteins)
    obs:    sample metadata (organism, tissue, disease, ...)
    var:    protein metadata (gene_name, ...)
    X:      protein quantification matrix (e.g. MaxLFQ, TMT intensity)
    uns:    DE results + file-level metadata

DE results in uns

DE results are stored in uns["de_results"] as a dictionary keyed by contrast identifier. Each contrast contains structured arrays following scanpy's rank_genes_groups convention, extended with QPX-specific fields.

adata.uns["de_results"] = {
    "cancer_vs_normal": {
        "names": np.array(["P04217", "P68871", ...]),           # protein accessions
        "gene_names": np.array(["A1BG", "HBB", ...]),          # gene symbols
        "logfoldchanges": np.array([-1.542, 0.891, ...]),       # log2 fold change
        "scores": np.array([-8.567, 4.321, ...]),               # t-value / test statistic
        "pvals": np.array([0.0001, 0.002, ...]),                # raw p-values
        "pvals_adj": np.array([0.005, 0.045, ...]),             # adjusted p-values (BH)
        "se": np.array([0.18, 0.206, ...]),                     # standard error
        "df": np.array([37, 35, ...]),                          # degrees of freedom
        "is_significant": np.array([True, True, ...]),          # pre-computed significance
        "issue": np.array([None, None, ...]),                   # inference issues
        "condition_test": "squamous cell carcinoma",            # test condition
        "condition_reference": "normal",                        # reference condition
    }
}

Fields per contrast

Field Description Type Required
names Protein accessions (sorted by p-value) array[string] Yes
gene_names Gene symbols corresponding to each protein array[string] No
logfoldchanges Log2 fold change (test / reference) array[float64] Yes
scores Test statistic (t-value or equivalent) array[float64] No
pvals Raw p-values array[float64] Yes
pvals_adj Adjusted p-values (multiple testing corrected) array[float64] Yes
se Standard error of the log2 fold change array[float64] No
df Degrees of freedom array[int32] No
is_significant Pre-computed significance at the stored FDR threshold array[bool] No
issue Issue with protein quantification, if any array[string] No
condition_test Test condition in the contrast string Yes
condition_reference Reference/control condition in the contrast string Yes

uns (file-level metadata)

Key Description Type
qpx_version Version of the QPX format string
file_type Always "differential_expression" string
project_accession Project accession in PRIDE Archive string
statistical_method Statistical method used (e.g., msstats_group_comparison, limma, deqms) string
correction_method Multiple testing correction method (e.g., BH) string
fdr_threshold FDR threshold used for is_significant string
factor_names JSON array of factor names from SDRF string
contrasts JSON array of contrast identifiers string
creation_date Date when the file was created string
creator Name of the tool that created the file string

Tool mapping

QPX's DE fields map to common statistical tool outputs:

QPX field scanpy MSstats PyDESeq2 limma (R) edgeR (R)
names names Protein (index) (rownames) (rownames)
logfoldchanges logfoldchanges log2FC log2FoldChange logFC logFC
scores scores Tvalue stat t --
pvals pvals pvalue pvalue P.Value PValue
pvals_adj pvals_adj adj.pvalue padj adj.P.Val FDR
se -- SE lfcSE -- --
df -- DF -- df.residual --

Example

Creating a DE AnnData

import anndata as ad
import numpy as np
import pandas as pd

# Protein metadata
var = pd.DataFrame({
    "gene_name": ["A1BG", "HBB", "TP53"],
}, index=["P04217", "P68871", "P04637"])

# Sample metadata (optional — can be empty if only storing DE results)
obs = pd.DataFrame(index=["Sample-1", "Sample-2", "Sample-3", "Sample-4"])

# Create AnnData (X can be empty or from AE)
adata = ad.AnnData(
    X=np.zeros((len(obs), len(var))),
    obs=obs,
    var=var,
)

# Add DE results
adata.uns["de_results"] = {
    "cancer_vs_normal": {
        "names": np.array(["P04217", "P68871", "P04637"]),
        "gene_names": np.array(["A1BG", "HBB", "TP53"]),
        "logfoldchanges": np.array([-1.542, 0.891, 2.345]),
        "scores": np.array([-8.567, 4.321, 12.456]),
        "pvals": np.array([0.0001, 0.002, 0.00001]),
        "pvals_adj": np.array([0.005, 0.045, 0.001]),
        "se": np.array([0.18, 0.206, 0.188]),
        "df": np.array([37, 35, 37]),
        "is_significant": np.array([True, True, True]),
        "issue": np.array([None, None, None]),
        "condition_test": "squamous cell carcinoma",
        "condition_reference": "normal",
    }
}

# Add file-level metadata
adata.uns["qpx_version"] = "2.0"
adata.uns["file_type"] = "differential_expression"
adata.uns["statistical_method"] = "msstats_group_comparison"
adata.uns["correction_method"] = "BH"
adata.uns["fdr_threshold"] = "0.05"
adata.uns["contrasts"] = '["cancer_vs_normal"]'

# Save
adata.write("PXD000000.de.h5ad")

Using with scanpy

import scanpy as sc

adata = ad.read_h5ad("PXD000000.de.h5ad")

# Convert to scanpy rank_genes_groups format for plotting
adata.uns["rank_genes_groups"] = {
    "params": {"reference": "normal", "method": "t-test"},
}
for contrast_id, contrast_data in adata.uns["de_results"].items():
    adata.uns["rank_genes_groups"]["names"] = contrast_data["names"]
    adata.uns["rank_genes_groups"]["logfoldchanges"] = contrast_data["logfoldchanges"]
    adata.uns["rank_genes_groups"]["scores"] = contrast_data["scores"]
    adata.uns["rank_genes_groups"]["pvals"] = contrast_data["pvals"]
    adata.uns["rank_genes_groups"]["pvals_adj"] = contrast_data["pvals_adj"]

# scanpy plotting functions work directly
sc.pl.rank_genes_groups(adata, n_genes=20)

Querying DE results

import pandas as pd

adata = ad.read_h5ad("PXD000000.de.h5ad")

# Get significant up-regulated proteins in a contrast
contrast = adata.uns["de_results"]["cancer_vs_normal"]
de_df = pd.DataFrame({
    "protein": contrast["names"],
    "gene_name": contrast["gene_names"],
    "log2fc": contrast["logfoldchanges"],
    "pvalue": contrast["pvals"],
    "adj_pvalue": contrast["pvals_adj"],
    "significant": contrast["is_significant"],
})

# Filter: significant and up-regulated
up_regulated = de_df[(de_df["significant"]) & (de_df["log2fc"] > 1.0)]
print(up_regulated.sort_values("adj_pvalue"))

File naming

DE AnnData files follow the QPX naming convention:

{PREFIX}.de.h5ad

Example: PXD000000.de.h5ad

Notes

Bundling AE and DE

AE and DE results can be stored in the same AnnData file. The expression matrix (X and layers) comes from the AE view, and DE results are added to uns["de_results"]. This enables a single file that contains both the expression data and the statistical comparisons.

QPX is richer than scanpy

Fields like se, gene_names, condition_test, condition_reference, and is_significant have no direct equivalent in scanpy's rank_genes_groups format. QPX preserves more statistical detail.

Relationship to other views

The DE view takes as input the protein quantification from the Protein Group View or the Absolute Expression View. For absolute expression values per sample, see the Absolute Expression View.

Protein group encoding

Protein groups in the DE results use a single representative protein accession. The full protein group membership is available in the Protein Group View.