Field Ontology Mapping¶
QPX fields are mapped to controlled vocabulary (CV) and ontology terms where applicable. This mapping provides formal semantic definitions for each field, enabling interoperability with proteomics standards, biological databases, and data management platforms like lamindb.
snake_case everywhere
All QPX field names and score names use snake_case -- no colons, dots, spaces, or mixed case. This ensures every name is a valid identifier in SQL, Python, R, and any query language without quoting. The proper ontology term name (e.g., Comet:xcorr for comet_xcorr) and accession (e.g., MS:1002252) are stored in this mapping.
QPX uses three categories of ontologies:
- PSI-MS (MS) -- mass spectrometry terms for instrument, acquisition, and identification fields.
- Biological ontologies -- for sample metadata fields: NCBI Taxonomy, UBERON, EFO/MONDO, Cell Ontology, etc.
- UniMod (UNIMOD) -- for post-translational modification definitions.
Fields without ontology mappings
Fields with blank ontology entries are QPX-specific fields without standardized CV terms. These fields are defined entirely by the QPX specification.
ontology.parquet -- machine-readable mapping¶
Each QPX dataset includes an ontology.parquet file that stores the field-to-ontology mapping in a queryable format. This makes the dataset self-describing -- any consumer can look up the proper ontology term for any field or score name without external documentation.
See the full YAML schema in ontology.yaml.
Schema¶
import pyarrow as pa
ontology_schema = pa.schema([
pa.field("field_name", pa.string()), # snake_case QPX field name (e.g. "comet_xcorr")
pa.field("ontology_name", pa.string(), nullable=True), # proper ontology term (e.g. "Comet:xcorr")
pa.field("ontology_accession", pa.string(), nullable=True), # CV accession (e.g. "MS:1002252")
pa.field("ontology_source", pa.string(), nullable=True), # ontology prefix (e.g. "MS", "UBERON", "UNIMOD")
pa.field("ontology_version", pa.string(), nullable=True), # version of the ontology (e.g. "4.1.235")
pa.field("view", pa.string()), # which view (e.g. "psm", "feature", "sample")
pa.field("description", pa.string(), nullable=True), # human-readable description
pa.field("source_column_name", pa.string(), nullable=True), # original column name in tool output
pa.field("source_tool", pa.string(), nullable=True), # tool that produced this field
])
Example rows¶
| field_name | ontology_name | ontology_accession | ontology_source | view | description |
|---|---|---|---|---|---|
comet_xcorr |
Comet:xcorr | MS:1002252 | MS | psm | Cross-correlation score from Comet |
organism |
organism | NCBITaxon | NCBITaxon | sample | Species |
organism_part |
organism part | UBERON | UBERON | sample | Anatomical structure |
rt |
retention time | MS:1000894 | MS | psm | Retention time (seconds) |
anchor_protein |
anchor protein | MS:1001591 | MS | feature | Representative protein |
Querying¶
-- Look up the ontology term for a field
SELECT ontology_name, ontology_accession
FROM 'PXD014414.ontology.parquet'
WHERE field_name = 'comet_xcorr';
-- All fields with PSI-MS mappings
SELECT field_name, ontology_name, ontology_accession
FROM 'PXD014414.ontology.parquet'
WHERE ontology_source = 'MS';
-- All sample metadata ontology mappings
SELECT field_name, ontology_name, ontology_source
FROM 'PXD014414.ontology.parquet'
WHERE view = 'sample';
Sample Metadata (sample.parquet)¶
Sample metadata fields map to biological ontologies. These mappings are critical for integration with lamindb/bionty and for validating annotations against community-standard registries.
| Field | Ontology | Ontology Accession | Description |
|---|---|---|---|
sample_accession |
--- | --- | Unique sample identifier (= SDRF source name) |
organism |
NCBI Taxonomy | NCBITaxon | Species (e.g., Homo sapiens = NCBITaxon:9606) |
organism_part |
UBERON | UBERON | Anatomical structure (e.g., heart = UBERON:0000948) |
disease |
EFO / MONDO | EFO, MONDO | Disease state (e.g., breast cancer = MONDO:0007254) |
cell_line |
Cellosaurus | Cellosaurus | Cell line (e.g., HeLa = CVCL_0030) |
cell_type |
Cell Ontology | CL | Cell type (e.g., T cell = CL:0000084) |
sex |
PATO | PATO | Biological sex (e.g., female = PATO:0000383) |
age |
--- | --- | Age of the specimen (free text) |
developmental_stage |
HsapDv / MmusDv | HsapDv | Developmental stage |
ancestry |
HANCESTRO | HANCESTRO | Ancestry category |
individual |
--- | --- | Individual/patient identifier |
sample_description |
--- | --- | Free-text sample description |
lamindb / bionty compatibility
Each sample metadata field maps to a bionty registry:
| QPX field | bionty registry | Example validation |
|---|---|---|
organism |
bt.Organism |
bt.Organism.from_source(name="Homo sapiens") |
organism_part |
bt.Tissue |
bt.Tissue.from_source(name="heart") |
disease |
bt.Disease |
bt.Disease.from_source(name="breast cancer") |
cell_line |
bt.CellLine |
bt.CellLine.from_source(name="HeLa") |
cell_type |
bt.CellType |
bt.CellType.from_source(name="T cell") |
See Compatibility with lamindb and AnnData for details.
Run Metadata (run.parquet)¶
| Field | Ontology | Ontology Accession | Description |
|---|---|---|---|
run_accession |
--- | --- | Unique run identifier (= SDRF assay name) |
run_file_name |
--- | --- | Raw data file name without path or extension |
samples |
--- | --- | Sample-channel mapping (FK to sample.parquet) |
fraction |
--- | --- | Fraction identifier |
instrument |
PSI-MS | MS | Mass spectrometer (e.g., Q Exactive HF = MS:1002523) |
enzymes |
PSI-MS | MS | Proteolytic enzymes (e.g., Trypsin = MS:1001251) |
dissociation_method |
PSI-MS | MS | Fragmentation method (e.g., HCD = MS:1000422) |
modification_parameters |
UniMod | UNIMOD | Search modifications (e.g., Phospho = UNIMOD:21) |
additional_terms |
PSI-MS | MS | Other ontology-backed terms (acquisition method, labeling, etc.) |
Dataset Metadata (dataset.parquet)¶
| Field | Ontology | Ontology Accession | Description |
|---|---|---|---|
project_accession |
--- | --- | ProteomeXchange accession or local project identifier |
project_title |
--- | --- | Title of the project |
project_description |
--- | --- | Description of the project |
pubmed_id |
--- | --- | PubMed ID of the associated publication |
qpx_version |
--- | --- | Version of the QPX format specification |
software_name |
--- | --- | Software that generated the data |
software_version |
--- | --- | Version of the software |
creation_date |
--- | --- | ISO 8601 creation date |
PSM View (psm.parquet)¶
| Field | Ontology Name | Ontology Accession | Description |
|---|---|---|---|
sequence |
AA sequence | MS:1001344 |
Amino acid sequence |
peptidoform |
peptidoform | MS:1003049 |
Peptide with modifications in ProForma notation |
modifications |
protein modifications | MS:1000933 |
Structured modification list with UNIMOD accessions |
charge |
charge state | MS:1000041 |
Charge state of the precursor ion |
posterior_error_probability |
posterior error probability | MS:1001493 |
PEP score: probability the PSM is incorrect (lower is better) |
is_decoy |
decoy peptide | MS:1002217 |
Whether the PSM is from a decoy sequence |
calculated_mz |
theoretical monoisotopic m/z | MS:1003053 |
Theoretical m/z from molecular composition |
observed_mz |
selected ion m/z | MS:1002234 |
Experimentally observed m/z |
rt |
retention time | MS:1000894 |
Retention time of the MS2 scan (seconds) |
predicted_rt |
predicted retention time | MS:1000897 |
Predicted retention time (seconds) |
run_file_name |
--- | --- | Spectrum file name without path or extension |
scan |
scan number | MS:1003057 |
Scan identifier as array of integer components |
protein_accessions |
--- | --- | Protein accessions the peptide maps to (optional column) |
ion_mobility |
ion mobility drift time | MS:1002476 |
Ion mobility value for the precursor |
mz_array |
m/z array | MS:1000514 |
Array of fragment m/z values |
intensity_array |
intensity array | MS:1000515 |
Array of fragment intensity values |
charge_array |
--- | --- | Array of fragment ion charge values |
ion_type_array |
--- | --- | Array of fragment ion type annotations (b, y, a, etc.) |
ion_mobility_array |
--- | --- | Array of fragment ion mobility values |
mass_error_ppm |
--- | --- | Mass error in ppm |
missed_cleavages |
missed cleavages | MS:1003044 |
Number of missed enzymatic cleavages |
additional_scores |
--- | --- | Score array with name, value, and direction (see Scores) |
cv_params |
--- | --- | Controlled vocabulary parameters |
Feature View (feature.parquet)¶
| Field | Ontology Name | Ontology Accession | Description |
|---|---|---|---|
sequence |
AA sequence | MS:1001344 |
Amino acid sequence |
peptidoform |
peptidoform | MS:1003049 |
Peptide with modifications in ProForma notation |
modifications |
protein modifications | MS:1000933 |
Structured modification list |
charge |
charge state | MS:1000041 |
Charge of the quantified analyte |
is_decoy |
decoy peptide | MS:1002217 |
Whether the peptide is from a decoy sequence |
calculated_mz |
theoretical monoisotopic m/z | MS:1003053 |
Theoretical m/z |
observed_mz |
selected ion m/z | MS:1002234 |
Experimentally observed m/z |
rt |
retention time | MS:1000894 |
Precursor retention time (seconds) |
rt_start |
--- | --- | Start of the retention time window |
rt_stop |
--- | --- | End of the retention time window |
predicted_rt |
predicted retention time | MS:1000897 |
Predicted retention time (seconds) |
ion_mobility |
ion mobility drift time | MS:1002476 |
Ion mobility value for the precursor |
ion_mobility_start |
--- | --- | Start ion mobility value |
ion_mobility_stop |
--- | --- | End ion mobility value |
run_file_name |
--- | --- | Run file containing the feature |
id_run_file_name |
--- | --- | Run file containing the best PSM for this feature |
scan |
scan number | MS:1003057 |
Scan identifier of the best PSM |
intensities |
--- | --- | Primary intensity across labels (see Intensities) |
additional_intensities |
--- | --- | Tool-provided intensities (normalized, LFQ, iBAQ) |
pg_accessions |
--- | --- | Protein group accessions |
anchor_protein |
anchor protein | MS:1001591 |
Representative protein of the protein group |
pg_positions |
--- | --- | Peptide start/end positions in each protein |
pg_global_qvalue |
protein-level global FDR | MS:1001214 |
Global q-value of the protein group |
unique |
--- | --- | Unique peptide indicator |
gg_accessions |
--- | --- | Gene group identifiers |
gg_names |
--- | --- | Gene group names |
mass_error_ppm |
--- | --- | Mass error in ppm |
missed_cleavages |
missed cleavages | MS:1003044 |
Number of missed enzymatic cleavages |
posterior_error_probability |
posterior error probability | MS:1001493 |
PEP for the peptide match |
additional_scores |
--- | --- | Score array (see Scores) |
cv_params |
--- | --- | Controlled vocabulary parameters |
Protein Group View (pg.parquet)¶
| Field | Ontology Name | Ontology Accession | Description |
|---|---|---|---|
pg_accessions |
--- | --- | Protein accessions within this group |
pg_names |
--- | --- | Descriptive names for the proteins |
gg_accessions |
--- | --- | Gene group accessions |
gg_names |
--- | --- | Gene names |
anchor_protein |
anchor protein | MS:1001591 |
Representative protein of the group |
run_file_name |
--- | --- | Raw file containing this protein group |
peptide_counts |
--- | --- | Peptide sequence counts (unique + total) |
feature_counts |
--- | --- | Feature counts (unique + total) |
global_qvalue |
protein-level global FDR | MS:1001214 |
Global q-value at the experiment level |
pg_qvalue |
--- | --- | Run-level protein group q-value |
is_decoy |
decoy peptide | MS:1002217 |
Whether the protein group is a decoy |
contaminant |
--- | --- | Contaminant indicator (1 = contaminant) |
sequence_coverage |
sequence coverage | MS:1001093 |
Percentage of protein sequence covered by peptides |
molecular_weight |
molecular mass | MS:1000224 |
Molecular weight of the protein (kDa) |
intensities |
--- | --- | Primary intensity across labels |
additional_intensities |
--- | --- | Tool-provided intensities |
peptides |
--- | --- | Peptide counts per individual protein in the group |
additional_scores |
--- | --- | Additional scores and metrics |
cv_params |
--- | --- | Controlled vocabulary parameters |
Mass Spectra View (mz.parquet)¶
| Field | Ontology Name | Ontology Accession | Description |
|---|---|---|---|
id |
spectrum identifier nativeID format | MS:1000767 |
Unique identifier for the scan |
ms_level |
ms level | MS:1000511 |
MS level (1 = MS1, 2 = MS2, etc.) |
centroid |
centroid spectrum | MS:1000127 |
Whether data is centroided |
scan_start_time |
scan start time | MS:1000016 |
Start time of the scan (minutes) |
inverse_ion_mobility |
inverse reduced ion mobility | MS:1002815 |
Inverse ion mobility (1/K0) for TIMS data |
ion_injection_time |
ion injection time | MS:1000927 |
Ion injection time (milliseconds) |
total_ion_current |
total ion current | MS:1000285 |
Total ion current for the scan |
precursors |
precursor | MS:1000456 |
Precursor ions for this MS/MS scan |
mz |
m/z array | MS:1000514 |
Array of m/z values |
intensity |
intensity array | MS:1000515 |
Array of intensity values |
cv_params |
--- | --- | Additional CV parameters |
Score Fields¶
Score names used in additional_scores that have PSI-MS ontology mappings. All QPX score names use snake_case -- the ontology name column shows the proper CV term.
| QPX Name (snake_case) | Ontology Name | Ontology Accession | Direction |
|---|---|---|---|
posterior_error_probability |
posterior error probability | MS:1001493 |
lower is better |
global_qvalue |
PSM-level global FDR | MS:1002350 |
lower is better |
pg_global_qvalue |
protein-level global FDR | MS:1001214 |
lower is better |
andromeda_score |
Andromeda:score | MS:1002338 |
higher is better |
andromeda_delta_score |
Andromeda:delta score | MS:1003433 |
higher is better |
percolator_score |
percolator:score | MS:1001492 |
higher is better |
openms_score |
OpenMS:Best PSM Score | MS:1003114 |
higher is better |
msgf_raw_score |
MS-GF:RawScore | MS:1002049 |
higher is better |
comet_xcorr |
Comet:xcorr | MS:1002252 |
higher is better |
comet_deltacn |
Comet:deltacn | MS:1002253 |
higher is better |
comet_expect |
Comet:expectation value | MS:1002257 |
lower is better |
msgf_spec_evalue |
MS-GF:SpecEValue | MS:1002052 |
lower is better |
sage_hyperscore |
Sage:hyperscore | --- | higher is better |
diann_qvalue |
--- | --- | lower is better |
diann_global_qvalue |
--- | --- | lower is better |
diann_cscore |
--- | --- | higher is better |
consensus_support |
--- | --- | higher is better |
QPX-specific Fields¶
Fields defined by QPX without standardized ontology terms.
| Field | View(s) | Description |
|---|---|---|
run_file_name |
PSM, Feature, PG | Spectrum/run file name without path or extension |
pg_accessions |
Feature, PG | Protein group accessions |
pg_names |
PG | Descriptive protein names |
gg_accessions |
PG | Gene group accessions |
gg_names |
PG | Gene names |
rt_start / rt_stop |
Feature | Retention time window boundaries |
ion_mobility_start / ion_mobility_stop |
Feature | Ion mobility window boundaries |
intensities / additional_intensities |
Feature, PG | Intensity structures (see Intensities) |
peptide_counts / feature_counts |
PG | Count structures for peptides and features |
contaminant |
PG | Contaminant indicator |
pg_positions |
Feature | Peptide positions within each protein |
id_run_file_name / id_scan |
Feature | Reference to best PSM for the feature |
protein_accessions |
PSM | Protein accessions (optional column) |
charge_array / ion_type_array / ion_mobility_array |
PSM | Fragment ion annotation arrays |
parent_ion_fraction |
Score | Fraction of target peak intensity in the isolation window |
precursor_quantification_score |
Score | QuantUMS-derived quantification confidence: 1.0 / (1.0 + SD) |
Compatibility with lamindb and AnnData¶
AnnData¶
The QPX expression views (ae.h5ad, de.h5ad) use AnnData natively. The ontology mapping provides semantic definitions for the .obs (sample) and .var (protein/gene) slots:
| AnnData slot | QPX source | Ontology-backed fields |
|---|---|---|
.obs |
sample.parquet |
organism (NCBITaxon), organism_part (UBERON), disease (EFO/MONDO), cell_type (CL), cell_line (Cellosaurus), sex (PATO) |
.var |
pg.parquet |
protein (UniProt), gene_name (HGNC/Ensembl) |
.uns |
dataset.parquet |
project_accession, qpx_version |
When building AnnData objects from QPX Parquet files, the .obs columns inherit the ontology definitions from sample.parquet. This means any AnnData created from QPX data carries structured, ontology-backed annotations that tools like scanpy and lamindb can validate.
lamindb / bionty¶
lamindb uses bionty registries to validate biological entities. QPX's sample metadata fields map directly to bionty registries:
| QPX field | bionty registry | Ontology source | Example |
|---|---|---|---|
organism |
bt.Organism |
NCBI Taxonomy | Homo sapiens → NCBITaxon:9606 |
organism_part |
bt.Tissue |
UBERON | heart → UBERON:0000948 |
disease |
bt.Disease |
MONDO | breast cancer → MONDO:0007254 |
cell_type |
bt.CellType |
Cell Ontology | T cell → CL:0000084 |
cell_line |
bt.CellLine |
Cellosaurus | HeLa → CVCL_0030 |
sex |
--- | PATO | female → PATO:0000383 |
developmental_stage |
bt.DevelopmentalStage |
HsapDv | --- |
This means QPX datasets can be registered in lamindb with validated, ontology-backed annotations:
import lamindb as ln
import bionty as bt
import anndata as ad
# Read QPX AnnData
adata = ad.read_h5ad("PXD014414.ae.h5ad")
# Validate against bionty registries
bt.Organism.from_source(name="Homo sapiens")
bt.Tissue.from_source(name="heart")
bt.Disease.from_source(name="breast cancer")
# Register as a lamindb artifact with validated metadata
artifact = ln.Artifact.from_anndata(adata, description="PXD014414 absolute expression")
artifact.save()
Parquet views and lamindb¶
The Parquet data views (psm, feature, pg, mz) can also be registered as lamindb artifacts. The ontology mapping allows lamindb to understand what each field represents:
import lamindb as ln
# Register Parquet files as artifacts
artifact = ln.Artifact("PXD014414.feature.parquet", description="PXD014414 features")
artifact.save()
# Link to biological context via validated sample metadata
artifact.organisms.add(bt.Organism.from_source(name="Homo sapiens"))
artifact.tissues.add(bt.Tissue.from_source(name="heart"))
The key benefit: because QPX defines clear ontology mappings for sample metadata, lamindb can automatically validate and link QPX datasets to the correct biological entities in its registry. No manual curation of field semantics is needed.
Ontology references¶
| Ontology | Prefix | URL | Used for |
|---|---|---|---|
| PSI-MS | MS: |
OLS4 | MS instrument, acquisition, identification fields |
| UniMod | UNIMOD: |
UniMod | Post-translational modifications |
| NCBI Taxonomy | NCBITaxon: |
OLS4 | Organism / species |
| UBERON | UBERON: |
OLS4 | Anatomical structures / tissues |
| EFO | EFO: |
OLS4 | Experimental factors, diseases |
| MONDO | MONDO: |
OLS4 | Disease ontology |
| Cell Ontology | CL: |
OLS4 | Cell types |
| Cellosaurus | CVCL_ |
Cellosaurus | Cell lines |
| PATO | PATO: |
OLS4 | Phenotypic qualities (sex, etc.) |
| HANCESTRO | HANCESTRO: |
OLS4 | Human ancestry |
| HsapDv | HsapDv: |
OLS4 | Human developmental stages |