Differential Expression¶

The differential expression (DE) view stores statistical results comparing protein abundance between experimental conditions. It contains fold changes, p-values, and associated metrics for each protein across contrasts. This is a key output of the proteomics data analysis workflow, enabling identification of differentially expressed proteins.

Use cases¶

Store differentially expressed proteins between contrasts with fold changes and p-values.
Enable visualization using Volcano Plots and other statistical plots.
Enable integration with other omics data for multi-omics analysis.
Store metadata about the statistical method and contrast definitions alongside the results.
Integrate with scanpy plotting functions (sc.pl.rank_genes_groups, sc.pl.rank_genes_groups_volcano).

Format¶

The differential expression view uses AnnData (.h5ad) as its primary format. DE results are stored in AnnData's uns (unstructured metadata) slot, following scanpy's rank_genes_groups convention. The AnnData object can optionally include an expression matrix in X (from the AE view) alongside the DE results.

For general AnnData concepts and conventions shared with the absolute expression view, see AnnData Concepts.

AnnData structure¶

AnnData (n_obs x n_vars = samples x proteins)
    obs:    sample metadata (organism, tissue, disease, ...)
    var:    protein metadata (gene_name, ...)
    X:      protein quantification matrix (e.g. MaxLFQ, TMT intensity)
    uns:    DE results + file-level metadata

DE results in `uns`¶

DE results are stored in uns["de_results"] as a dictionary keyed by contrast identifier. Each contrast contains structured arrays following scanpy's rank_genes_groups convention, extended with QPX-specific fields.

adata.uns["de_results"] = {
    "cancer_vs_normal": {
        "names": np.array(["P04217", "P68871", ...]),           # protein accessions
        "gene_names": np.array(["A1BG", "HBB", ...]),          # gene symbols
        "logfoldchanges": np.array([-1.542, 0.891, ...]),       # log2 fold change
        "scores": np.array([-8.567, 4.321, ...]),               # t-value / test statistic
        "pvals": np.array([0.0001, 0.002, ...]),                # raw p-values
        "pvals_adj": np.array([0.005, 0.045, ...]),             # adjusted p-values (BH)
        "se": np.array([0.18, 0.206, ...]),                     # standard error
        "df": np.array([37, 35, ...]),                          # degrees of freedom
        "is_significant": np.array([True, True, ...]),          # pre-computed significance
        "issue": np.array([None, None, ...]),                   # inference issues
        "condition_test": "squamous cell carcinoma",            # test condition
        "condition_reference": "normal",                        # reference condition
    }
}

Fields per contrast¶

Field	Description	Type	Required
`names`	Protein accessions (sorted by p-value)	`array[string]`	Yes
`gene_names`	Gene symbols corresponding to each protein	`array[string]`	No
`logfoldchanges`	Log2 fold change (test / reference)	`array[float64]`	Yes
`scores`	Test statistic (t-value or equivalent)	`array[float64]`	No
`pvals`	Raw p-values	`array[float64]`	Yes
`pvals_adj`	Adjusted p-values (multiple testing corrected)	`array[float64]`	Yes
`se`	Standard error of the log2 fold change	`array[float64]`	No
`df`	Degrees of freedom	`array[int32]`	No
`is_significant`	Pre-computed significance at the stored FDR threshold	`array[bool]`	No
`issue`	Issue with protein quantification, if any	`array[string]`	No
`condition_test`	Test condition in the contrast	`string`	Yes
`condition_reference`	Reference/control condition in the contrast	`string`	Yes

uns (file-level metadata)¶

Key	Description	Type
`qpx_version`	Version of the QPX format	`string`
`file_type`	Always `"differential_expression"`	`string`
`project_accession`	Project accession in PRIDE Archive	`string`
`statistical_method`	Statistical method used (e.g., `msstats_group_comparison`, `limma`, `deqms`)	`string`
`correction_method`	Multiple testing correction method (e.g., `BH`)	`string`
`fdr_threshold`	FDR threshold used for `is_significant`	`string`
`factor_names`	JSON array of factor names from SDRF	`string`
`contrasts`	JSON array of contrast identifiers	`string`
`creation_date`	Date when the file was created	`string`
`creator`	Name of the tool that created the file	`string`

Tool mapping¶

QPX's DE fields map to common statistical tool outputs:

QPX field	scanpy	MSstats	PyDESeq2	limma (R)	edgeR (R)
`names`	`names`	`Protein`	(index)	(rownames)	(rownames)
`logfoldchanges`	`logfoldchanges`	`log2FC`	`log2FoldChange`	`logFC`	`logFC`
`scores`	`scores`	`Tvalue`	`stat`	`t`	--
`pvals`	`pvals`	`pvalue`	`pvalue`	`P.Value`	`PValue`
`pvals_adj`	`pvals_adj`	`adj.pvalue`	`padj`	`adj.P.Val`	`FDR`
`se`	--	`SE`	`lfcSE`	--	--
`df`	--	`DF`	--	`df.residual`	--

Example¶

Creating a DE AnnData¶

import anndata as ad
import numpy as np
import pandas as pd

# Protein metadata
var = pd.DataFrame({
    "gene_name": ["A1BG", "HBB", "TP53"],
}, index=["P04217", "P68871", "P04637"])

# Sample metadata (optional — can be empty if only storing DE results)
obs = pd.DataFrame(index=["Sample-1", "Sample-2", "Sample-3", "Sample-4"])

# Create AnnData (X can be empty or from AE)
adata = ad.AnnData(
    X=np.zeros((len(obs), len(var))),
    obs=obs,
    var=var,
)

# Add DE results
adata.uns["de_results"] = {
    "cancer_vs_normal": {
        "names": np.array(["P04217", "P68871", "P04637"]),
        "gene_names": np.array(["A1BG", "HBB", "TP53"]),
        "logfoldchanges": np.array([-1.542, 0.891, 2.345]),
        "scores": np.array([-8.567, 4.321, 12.456]),
        "pvals": np.array([0.0001, 0.002, 0.00001]),
        "pvals_adj": np.array([0.005, 0.045, 0.001]),
        "se": np.array([0.18, 0.206, 0.188]),
        "df": np.array([37, 35, 37]),
        "is_significant": np.array([True, True, True]),
        "issue": np.array([None, None, None]),
        "condition_test": "squamous cell carcinoma",
        "condition_reference": "normal",
    }
}

# Add file-level metadata
adata.uns["qpx_version"] = "2.0"
adata.uns["file_type"] = "differential_expression"
adata.uns["statistical_method"] = "msstats_group_comparison"
adata.uns["correction_method"] = "BH"
adata.uns["fdr_threshold"] = "0.05"
adata.uns["contrasts"] = '["cancer_vs_normal"]'

# Save
adata.write("PXD000000.de.h5ad")

Using with scanpy¶

import scanpy as sc

adata = ad.read_h5ad("PXD000000.de.h5ad")

# Convert to scanpy rank_genes_groups format for plotting
adata.uns["rank_genes_groups"] = {
    "params": {"reference": "normal", "method": "t-test"},
}
for contrast_id, contrast_data in adata.uns["de_results"].items():
    adata.uns["rank_genes_groups"]["names"] = contrast_data["names"]
    adata.uns["rank_genes_groups"]["logfoldchanges"] = contrast_data["logfoldchanges"]
    adata.uns["rank_genes_groups"]["scores"] = contrast_data["scores"]
    adata.uns["rank_genes_groups"]["pvals"] = contrast_data["pvals"]
    adata.uns["rank_genes_groups"]["pvals_adj"] = contrast_data["pvals_adj"]

# scanpy plotting functions work directly
sc.pl.rank_genes_groups(adata, n_genes=20)

Querying DE results¶

import pandas as pd

adata = ad.read_h5ad("PXD000000.de.h5ad")

# Get significant up-regulated proteins in a contrast
contrast = adata.uns["de_results"]["cancer_vs_normal"]
de_df = pd.DataFrame({
    "protein": contrast["names"],
    "gene_name": contrast["gene_names"],
    "log2fc": contrast["logfoldchanges"],
    "pvalue": contrast["pvals"],
    "adj_pvalue": contrast["pvals_adj"],
    "significant": contrast["is_significant"],
})

# Filter: significant and up-regulated
up_regulated = de_df[(de_df["significant"]) & (de_df["log2fc"] > 1.0)]
print(up_regulated.sort_values("adj_pvalue"))

File naming¶

DE AnnData files follow the QPX naming convention:

{PREFIX}.de.h5ad

Example: PXD000000.de.h5ad

Notes¶

Bundling AE and DE

AE and DE results can be stored in the same AnnData file. The expression matrix (X and layers) comes from the AE view, and DE results are added to uns["de_results"]. This enables a single file that contains both the expression data and the statistical comparisons.

QPX is richer than scanpy

Fields like se, gene_names, condition_test, condition_reference, and is_significant have no direct equivalent in scanpy's rank_genes_groups format. QPX preserves more statistical detail.

Relationship to other views

The DE view takes as input the protein quantification from the Protein Group View or the Absolute Expression View. For absolute expression values per sample, see the Absolute Expression View.

Protein group encoding

Protein groups in the DE results use a single representative protein accession. The full protein group membership is available in the Protein Group View.