Absolute Expression¶
Absolute expression (AE) quantification determines the baseline amount of a target protein in a sample. In proteomics, the primary computational method is the intensity-based absolute quantification (iBAQ) method.
Use cases¶
- Store and retrieve baseline protein abundance (iBAQ values) per sample.
- Understand expression profiles of a protein across different conditions, tissues, and organisms.
- Provide a proxy for protein copy number estimation and concentration in biological samples.
- Enable fast visualization of absolute expression results.
- Integrate with scverse ecosystem tools (scanpy, scvi-tools, muon) for multi-omics analysis.
Format¶
The absolute expression view uses AnnData (.h5ad) as its primary format. AnnData is a matrix-format standard from the scverse ecosystem that naturally represents the samples-by-proteins structure of absolute expression data.
For general AnnData concepts and conventions shared with the differential expression view, see AnnData Concepts.
AnnData structure¶
AnnData (n_obs x n_vars = samples x proteins)
obs: sample metadata (organism, tissue, disease, ...)
var: protein metadata (gene_name, ...)
X: ibaq_log (primary quantification matrix)
layers: ibaq_raw, ibaq_ppb, copies_per_cell, concentration_nm
uns: file-level metadata (qpx_version, project_accession, ...)
Slots¶
| AnnData slot | Content | Description |
|---|---|---|
X |
ibaq_log |
Primary data matrix -- log-transformed iBAQ values (samples x proteins) |
obs |
Sample metadata | One row per sample with experimental annotations |
var |
Protein metadata | One row per protein with gene names |
layers["ibaq_raw"] |
Raw iBAQ | Untransformed iBAQ values |
layers["ibaq_ppb"] |
Relative iBAQ | Parts-per-billion normalized iBAQ |
layers["copies_per_cell"] |
Copy number | Estimated protein copies per cell |
layers["concentration_nm"] |
Concentration | Estimated protein concentration in nM |
uns |
File metadata | QPX version, project accession, factor values, etc. |
Convention for primary matrix
ibaq_log is chosen as the primary quantification (maps to X). This parallels the scRNA-seq convention where X holds log-normalized counts and layers["counts"] holds raw counts.
obs (sample metadata)¶
Each row in obs represents one sample. The index is the sample accession.
| Field | Description | Type | Required |
|---|---|---|---|
sample_accession |
Sample accession from SDRF (used as index) | string |
Yes |
organism |
Organism (promoted from SDRF) | string |
No |
organism_part |
Tissue or organ (promoted from SDRF) | string |
No |
disease |
Disease condition (promoted from SDRF) | string |
No |
cell_line |
Cell line (promoted from SDRF) | string |
No |
biological_replicate |
Biological replicate number | int32 |
No |
technical_replicate |
Technical replicate number | int32 |
No |
Additional SDRF factor values may be included as extra columns in obs.
var (protein metadata)¶
Each row in var represents one protein. The index is the protein accession.
| Field | Description | Type | Required |
|---|---|---|---|
protein |
Protein accession (UniProt), used as index | string |
Yes |
gene_name |
Gene symbol | string |
No |
uns (file-level metadata)¶
| Key | Description | Type |
|---|---|---|
qpx_version |
Version of the QPX format | string |
file_type |
Always "absolute_expression" |
string |
project_accession |
Project accession in PRIDE Archive | string |
project_title |
Project title | string |
factor_value |
Factor value from SDRF | string |
creation_date |
Date when the file was created | string |
creator |
Name of the tool that created the file | string |
Example¶
Creating an AE AnnData¶
import anndata as ad
import numpy as np
import pandas as pd
# Sample metadata (obs)
obs = pd.DataFrame({
"organism": ["Homo sapiens", "Homo sapiens"],
"organism_part": ["heart", "liver"],
"disease": ["normal", "normal"],
"biological_replicate": [1, 1],
}, index=["PXD000000-Sample-1", "PXD000000-Sample-2"])
# Protein metadata (var)
var = pd.DataFrame({
"gene_name": ["A1BG", "HBB", "TP53"],
}, index=["P04217", "P68871", "P04637"])
# Primary matrix: ibaq_log (samples x proteins)
X = np.array([
[8.48, 6.23, 7.91], # Sample-1
[7.12, 9.45, 8.03], # Sample-2
])
# Create AnnData
adata = ad.AnnData(X=X, obs=obs, var=var)
# Add alternative quantifications as layers
adata.layers["ibaq_raw"] = np.array([
[5678.9, 234.5, 2345.6],
[1234.5, 6789.0, 3456.7],
])
# Add file-level metadata
adata.uns["qpx_version"] = "2.0"
adata.uns["file_type"] = "absolute_expression"
adata.uns["project_accession"] = "PXD000000"
# Save
adata.write("PXD000000.ae.h5ad")
Querying AE data¶
import anndata as ad
adata = ad.read_h5ad("PXD000000.ae.h5ad")
# Expression of a specific gene across all samples
tp53_idx = adata.var["gene_name"] == "TP53"
print(adata[:, tp53_idx].X)
# All heart tissue samples
heart = adata[adata.obs["organism_part"] == "heart"]
print(heart.X)
# Most abundant proteins in a sample
sample = adata[0, :]
top_proteins = sample.var.index[np.argsort(sample.X.flatten())[::-1][:10]]
File naming¶
AE AnnData files follow the QPX naming convention:
Example: PXD000000.ae.h5ad
Notes¶
Missing values
Proteins not detected in a sample result in NaN values in the AnnData matrix. This is consistent with how scRNA-seq handles dropout events.
Multi-omics integration
AE AnnData can be concatenated with scRNA-seq AnnData for joint analysis (e.g., CITE-seq + bulk proteomics comparison). The scverse ecosystem provides tools like muon for multi-modal data integration.
Relationship to other views
The AE view derives from the Protein Group View, which provides per-file protein quantification. AE aggregates across files and computes iBAQ-based absolute quantities. For differential comparisons between conditions, see the Differential Expression View.
Protein group encoding
Protein groups in the var index are written as a single representative protein accession. The full protein group membership is available in the Protein Group View.