Absolute Expression¶

Absolute expression (AE) quantification determines the baseline amount of a target protein in a sample. In proteomics, the primary computational method is the intensity-based absolute quantification (iBAQ) method.

Use cases¶

Store and retrieve baseline protein abundance (iBAQ values) per sample.
Understand expression profiles of a protein across different conditions, tissues, and organisms.
Provide a proxy for protein copy number estimation and concentration in biological samples.
Enable fast visualization of absolute expression results.
Integrate with scverse ecosystem tools (scanpy, scvi-tools, muon) for multi-omics analysis.

Format¶

The absolute expression view uses AnnData (.h5ad) as its primary format. AnnData is a matrix-format standard from the scverse ecosystem that naturally represents the samples-by-proteins structure of absolute expression data.

For general AnnData concepts and conventions shared with the differential expression view, see AnnData Concepts.

AnnData structure¶

AnnData (n_obs x n_vars = samples x proteins)
    obs:    sample metadata (organism, tissue, disease, ...)
    var:    protein metadata (gene_name, ...)
    X:      ibaq_log (primary quantification matrix)
    layers: ibaq_raw, ibaq_ppb, copies_per_cell, concentration_nm
    uns:    file-level metadata (qpx_version, project_accession, ...)

Slots¶

AnnData slot	Content	Description
`X`	`ibaq_log`	Primary data matrix -- log-transformed iBAQ values (samples x proteins)
`obs`	Sample metadata	One row per sample with experimental annotations
`var`	Protein metadata	One row per protein with gene names
`layers["ibaq_raw"]`	Raw iBAQ	Untransformed iBAQ values
`layers["ibaq_ppb"]`	Relative iBAQ	Parts-per-billion normalized iBAQ
`layers["copies_per_cell"]`	Copy number	Estimated protein copies per cell
`layers["concentration_nm"]`	Concentration	Estimated protein concentration in nM
`uns`	File metadata	QPX version, project accession, factor values, etc.

Convention for primary matrix

ibaq_log is chosen as the primary quantification (maps to X). This parallels the scRNA-seq convention where X holds log-normalized counts and layers["counts"] holds raw counts.

obs (sample metadata)¶

Each row in obs represents one sample. The index is the sample accession.

Field	Description	Type	Required
`sample_accession`	Sample accession from SDRF (used as index)	`string`	Yes
`organism`	Organism (promoted from SDRF)	`string`	No
`organism_part`	Tissue or organ (promoted from SDRF)	`string`	No
`disease`	Disease condition (promoted from SDRF)	`string`	No
`cell_line`	Cell line (promoted from SDRF)	`string`	No
`biological_replicate`	Biological replicate number	`int32`	No
`technical_replicate`	Technical replicate number	`int32`	No

Additional SDRF factor values may be included as extra columns in obs.

var (protein metadata)¶

Each row in var represents one protein. The index is the protein accession.

Field	Description	Type	Required
`protein`	Protein accession (UniProt), used as index	`string`	Yes
`gene_name`	Gene symbol	`string`	No

uns (file-level metadata)¶

Key	Description	Type
`qpx_version`	Version of the QPX format	`string`
`file_type`	Always `"absolute_expression"`	`string`
`project_accession`	Project accession in PRIDE Archive	`string`
`project_title`	Project title	`string`
`factor_value`	Factor value from SDRF	`string`
`creation_date`	Date when the file was created	`string`
`creator`	Name of the tool that created the file	`string`

Example¶

Creating an AE AnnData¶

import anndata as ad
import numpy as np
import pandas as pd

# Sample metadata (obs)
obs = pd.DataFrame({
    "organism": ["Homo sapiens", "Homo sapiens"],
    "organism_part": ["heart", "liver"],
    "disease": ["normal", "normal"],
    "biological_replicate": [1, 1],
}, index=["PXD000000-Sample-1", "PXD000000-Sample-2"])

# Protein metadata (var)
var = pd.DataFrame({
    "gene_name": ["A1BG", "HBB", "TP53"],
}, index=["P04217", "P68871", "P04637"])

# Primary matrix: ibaq_log (samples x proteins)
X = np.array([
    [8.48, 6.23, 7.91],   # Sample-1
    [7.12, 9.45, 8.03],   # Sample-2
])

# Create AnnData
adata = ad.AnnData(X=X, obs=obs, var=var)

# Add alternative quantifications as layers
adata.layers["ibaq_raw"] = np.array([
    [5678.9, 234.5, 2345.6],
    [1234.5, 6789.0, 3456.7],
])

# Add file-level metadata
adata.uns["qpx_version"] = "2.0"
adata.uns["file_type"] = "absolute_expression"
adata.uns["project_accession"] = "PXD000000"

# Save
adata.write("PXD000000.ae.h5ad")

Querying AE data¶

import anndata as ad

adata = ad.read_h5ad("PXD000000.ae.h5ad")

# Expression of a specific gene across all samples
tp53_idx = adata.var["gene_name"] == "TP53"
print(adata[:, tp53_idx].X)

# All heart tissue samples
heart = adata[adata.obs["organism_part"] == "heart"]
print(heart.X)

# Most abundant proteins in a sample
sample = adata[0, :]
top_proteins = sample.var.index[np.argsort(sample.X.flatten())[::-1][:10]]

File naming¶

AE AnnData files follow the QPX naming convention:

{PREFIX}.ae.h5ad

Example: PXD000000.ae.h5ad

Notes¶

Missing values

Proteins not detected in a sample result in NaN values in the AnnData matrix. This is consistent with how scRNA-seq handles dropout events.

Multi-omics integration

AE AnnData can be concatenated with scRNA-seq AnnData for joint analysis (e.g., CITE-seq + bulk proteomics comparison). The scverse ecosystem provides tools like muon for multi-modal data integration.

Relationship to other views

The AE view derives from the Protein Group View, which provides per-file protein quantification. AE aggregates across files and computes iBAQ-based absolute quantities. For differential comparisons between conditions, see the Differential Expression View.

Protein group encoding

Protein groups in the var index are written as a single representative protein accession. The full protein group membership is available in the Protein Group View.