PSM View¶
The PSM (Peptide Spectrum Match) view captures spectrum-level identification results. Each row represents a single match between a mass spectrum and a peptide sequence, including the identification scores, optional spectral data, and protein mappings.
Use Cases¶
- AI/ML training: Provides peptide-spectrum pairs with optional spectral arrays (m/z, intensity, ion types) for training intensity prediction, de novo sequencing, and clustering models.
- Spectrum-level analysis: Enables detailed inspection of individual identifications, including retention time, charge state, and search engine scores.
- DDA identification results: Designed primarily for data-dependent acquisition (DDA) workflows where each spectrum yields one or more peptide identifications.
Schema¶
Core Identification Fields¶
| Field | Description | Type | Required |
|---|---|---|---|
sequence |
Unmodified peptide amino acid sequence | string | yes |
peptidoform |
Peptide sequence with modifications in ProForma notation | string | yes |
modifications |
Structured list of modifications with name, accession, position, and localization scores | array[struct], null | no |
charge |
Charge state of the precursor ion | int16 | yes |
posterior_error_probability |
Posterior error probability (PEP) for the peptide-spectrum match — the probability that the PSM is incorrect. Lower values indicate higher confidence (lower is better). Ranges from 0.0 (confident) to 1.0 (likely incorrect) | float64, null | no |
is_decoy |
Whether the PSM is a decoy match (true) or a target match (false) |
bool | yes |
calculated_mz |
Theoretical peptide mass-to-charge ratio based on identified sequence and modifications | float32 | yes |
observed_mz |
Experimental observed peptide mass-to-charge ratio | float32 | yes |
mass_error_ppm |
Mass error in ppm: 1e6 × (observed_mz − calculated_mz) / calculated_mz | float32, null | no |
missed_cleavages |
Number of missed enzymatic cleavages | int16, null | no |
rt |
MS2 scan's retention time (in seconds) | float32, null | no |
predicted_rt |
Predicted retention time of the peptide (in seconds) | float32, null | no |
run_file_name |
Spectrum file name without path or extension | string | yes |
scan |
Scan identifier as an array of integer components (e.g., [43920] for single-scan instruments, [10, 1, 345] for Waters function/process/scan) |
array[int32] | yes |
additional_scores |
List of score structures with name, value, and direction indicator | array[struct], null | no |
cv_params |
Optional list of controlled vocabulary parameters for additional metadata | array[struct], null | no |
Optional Fields¶
These fields are optional and may not exist in the file at all. They are included based on conversion settings or user preference.
| Field | Description | Type | Required |
|---|---|---|---|
protein_accessions |
Protein accessions of all proteins that the peptide maps to. Optional because protein mapping can be recovered from the feature and protein group views | array[string], null | no |
cross_links |
Cross-link information for XL-MS experiments. Each entry describes one cross-link site. null for non-cross-linked PSMs |
array[struct], null | no |
ion_mobility |
Ion mobility value for the precursor ion | float32, null | no |
mz_array |
Array of m/z values for the spectrum | array[float32], null | no |
intensity_array |
Array of intensity values for the spectrum | array[float32], null | no |
charge_array |
Array of fragment ion charge values | array[int32], null | no |
ion_type_array |
Array of fragment ion type annotations (e.g., b, y, a) | array[string], null | no |
ion_mobility_array |
Array of fragment ion mobility values | array[float32], null | no |
Nullable vs Optional
Core fields marked as "not required" are nullable -- the column always exists in the file but individual values may be null. Optional fields (protein accessions, spectral data) may be absent from the file entirely, depending on conversion settings. Protein mappings can be recovered by joining with the feature and protein group views.
Cross-Linking Fields¶
The cross_links field supports XL-MS (cross-linking mass spectrometry) experiments following the mzIdentML 1.3 specification. Each entry in the array represents one cross-link site.
| Field | Description | Type | Required |
|---|---|---|---|
xl_type |
Type of cross-link: "inter" (between two peptides), "intra" (within same peptide), or "dead-end" (one reactive end) |
string | yes |
partner_sequence |
Plain amino acid sequence of the beta (partner) peptide. null for dead-end and intra-peptide links |
string, null | no |
partner_peptidoform |
Beta peptide in ProForma notation. null for dead-end and intra-peptide links |
string, null | no |
donor_position |
Cross-link attachment position on the alpha peptide (1-indexed) | int32 | yes |
acceptor_position |
Attachment position on the beta peptide (1-indexed). null for dead-end links |
int32, null | no |
linker_name |
Name of the cross-linker reagent (e.g., "DSS", "BS3", "DSSO") |
string | yes |
linker_accession |
XLMOD controlled vocabulary accession for the linker (e.g., "XLMOD:02001") |
string, null | no |
linker_mass |
Cross-linker mass in Daltons | float64 | yes |
Cross-linked PSM
Non-cross-linked PSMs
For regular (non-XL) PSMs, the cross_links field is null. This keeps the schema clean -- the struct-based design adds no overhead for non-cross-linked datasets.
Shared Fields¶
Several fields in the PSM view use structures shared across other QPX views:
- For details on the
modificationsfield structure, see Modifications. - For details on
additional_scoresand score semantics, see Scores. - For details on
cv_paramsusage and recommended terms, see Scores & CV Terms.
Example¶
Basic PSM Record¶
{
"sequence": "AAAAAAAAAAGAAGGR",
"peptidoform": "_(Acetyl (Protein N-term))AAAAAAAAAAGAAGGR_",
"charge": 2,
"scan": [42164],
"rt": 5140.98,
"calculated_mz": 635.3311,
"observed_mz": 635.3315,
"is_decoy": false,
"posterior_error_probability": 5.58e-20,
"predicted_rt": null,
"run_file_name": "20200101_sample_A",
"protein_accessions": ["Q86U42-2", "Q86U42"],
"modifications": [
{
"name": "Acetyl",
"accession": "UniMod:1",
"positions": [
{
"position": 0,
"amino_acid": null,
"scores": []
}
]
}
],
"additional_scores": [
{
"score_name": "andromeda_score",
"score_value": 175.73,
"higher_better": true
},
{
"score_name": "andromeda_delta_score",
"score_value": 160.47,
"higher_better": true
},
{
"score_name": "parent_ion_fraction",
"score_value": 0.0,
"higher_better": true
}
],
"cv_params": [
{"cv_name": "dissociation method", "cv_value": "HCD"},
{"cv_name": "normalized collision energy", "cv_value": "28"}
]
}
PSM with Spectral Data¶
When spectral arrays are included, the record also contains peak-level data:
{
"sequence": "AAAAAAAAAAGAAGGR",
"peptidoform": "_(Acetyl (Protein N-term))AAAAAAAAAAGAAGGR_",
"charge": 2,
"scan": [42164],
"rt": 5140.98,
"calculated_mz": 635.3311,
"observed_mz": 635.3315,
"is_decoy": false,
"run_file_name": "20200101_sample_A",
"mz_array": [175.119, 289.163, 360.200, 431.236, 488.258],
"intensity_array": [1234.5, 5678.9, 3456.7, 2345.6, 1234.5],
"charge_array": [1, 1, 1, 1, 1],
"ion_type_array": ["y1", "y2", "b3", "b4", "b5"]
}
File Metadata¶
PSM Parquet files store file-level metadata as key-value pairs in the Parquet footer. The following metadata fields are defined:
| Field | Description |
|---|---|
qpx_version |
Version of the QPX format used to generate the file |
software_provider |
Name and version of the software that generated the data |
scan_format |
Format of scan identifiers: scan, index, or nativeId |
creator |
Name of the tool or person who created the file |
file_type |
Type of the file (value: psm_file) |
creation_date |
Date when the file was created |
compression_format |
Compression algorithm used: zstd (default), snappy, gzip, lzo, or none |
Reading file metadata in Python
Notes¶
DDA-specific view
The PSM view is designed primarily for DDA (data-dependent acquisition) methods. It is not recommended for DIA experiments, where the feature view should be used instead. Generating a PSM file for DIA data would produce duplicated information relative to the feature view.
PEP is a PSM-level metric
posterior_error_probability is defined only in the PSM view. It represents the probability that a specific peptide-spectrum match is incorrect (lower is better). All major tools export PEP as P(incorrect): Percolator (posterior_error_prob), MaxQuant (PEP). FragPipe exports PeptideProphet Probability (P(correct)), so converters must compute PEP = 1 - probability. The feature and peptide views do not carry PEP as a top-level field; use additional_scores or best_id_score if a derived PEP value is needed at those levels.
- Relationship to feature view: The PSM view captures individual spectrum matches, while the Feature View aggregates these into quantified peptide features with intensity data. A single feature may correspond to multiple PSMs across different scans.
- Protein inference: Protein inference results should not be the primary focus of the PSM view.
protein_accessionsis an optional column — protein mappings can be recovered by joining through the feature and protein group views. When included, it is useful for peptide filtering and protein-level browsing. For full protein group information, use the protein group (PG) view. - Spectral arrays: The
mz_arrayandintensity_arrayare parallel arrays of the same length. For large-scale spectral storage, the dedicated mass spectra (mz) view is recommended. - Recommended additional scores:
global_qvalue(experiment-level PSM q-value),rank(peptide rank in search results), andpg_global_qvalue(protein group q-value used for filtering).