Skip to content

PSM View

The PSM (Peptide Spectrum Match) view captures spectrum-level identification results. Each row represents a single match between a mass spectrum and a peptide sequence, including the identification scores, optional spectral data, and protein mappings.

Use Cases

  • AI/ML training: Provides peptide-spectrum pairs with optional spectral arrays (m/z, intensity, ion types) for training intensity prediction, de novo sequencing, and clustering models.
  • Spectrum-level analysis: Enables detailed inspection of individual identifications, including retention time, charge state, and search engine scores.
  • DDA identification results: Designed primarily for data-dependent acquisition (DDA) workflows where each spectrum yields one or more peptide identifications.

Schema

Core Identification Fields

Field Description Type Required
sequence Unmodified peptide amino acid sequence string yes
peptidoform Peptide sequence with modifications in ProForma notation string yes
modifications Structured list of modifications with name, accession, position, and localization scores array[struct], null no
charge Charge state of the precursor ion int16 yes
posterior_error_probability Posterior error probability (PEP) for the peptide-spectrum match — the probability that the PSM is incorrect. Lower values indicate higher confidence (lower is better). Ranges from 0.0 (confident) to 1.0 (likely incorrect) float64, null no
is_decoy Whether the PSM is a decoy match (true) or a target match (false) bool yes
calculated_mz Theoretical peptide mass-to-charge ratio based on identified sequence and modifications float32 yes
observed_mz Experimental observed peptide mass-to-charge ratio float32 yes
mass_error_ppm Mass error in ppm: 1e6 × (observed_mz − calculated_mz) / calculated_mz float32, null no
missed_cleavages Number of missed enzymatic cleavages int16, null no
rt MS2 scan's retention time (in seconds) float32, null no
predicted_rt Predicted retention time of the peptide (in seconds) float32, null no
run_file_name Spectrum file name without path or extension string yes
scan Scan identifier as an array of integer components (e.g., [43920] for single-scan instruments, [10, 1, 345] for Waters function/process/scan) array[int32] yes
additional_scores List of score structures with name, value, and direction indicator array[struct], null no
cv_params Optional list of controlled vocabulary parameters for additional metadata array[struct], null no

Optional Fields

These fields are optional and may not exist in the file at all. They are included based on conversion settings or user preference.

Field Description Type Required
protein_accessions Protein accessions of all proteins that the peptide maps to. Optional because protein mapping can be recovered from the feature and protein group views array[string], null no
cross_links Cross-link information for XL-MS experiments. Each entry describes one cross-link site. null for non-cross-linked PSMs array[struct], null no
ion_mobility Ion mobility value for the precursor ion float32, null no
mz_array Array of m/z values for the spectrum array[float32], null no
intensity_array Array of intensity values for the spectrum array[float32], null no
charge_array Array of fragment ion charge values array[int32], null no
ion_type_array Array of fragment ion type annotations (e.g., b, y, a) array[string], null no
ion_mobility_array Array of fragment ion mobility values array[float32], null no

Nullable vs Optional

Core fields marked as "not required" are nullable -- the column always exists in the file but individual values may be null. Optional fields (protein accessions, spectral data) may be absent from the file entirely, depending on conversion settings. Protein mappings can be recovered by joining with the feature and protein group views.

Cross-Linking Fields

The cross_links field supports XL-MS (cross-linking mass spectrometry) experiments following the mzIdentML 1.3 specification. Each entry in the array represents one cross-link site.

Field Description Type Required
xl_type Type of cross-link: "inter" (between two peptides), "intra" (within same peptide), or "dead-end" (one reactive end) string yes
partner_sequence Plain amino acid sequence of the beta (partner) peptide. null for dead-end and intra-peptide links string, null no
partner_peptidoform Beta peptide in ProForma notation. null for dead-end and intra-peptide links string, null no
donor_position Cross-link attachment position on the alpha peptide (1-indexed) int32 yes
acceptor_position Attachment position on the beta peptide (1-indexed). null for dead-end links int32, null no
linker_name Name of the cross-linker reagent (e.g., "DSS", "BS3", "DSSO") string yes
linker_accession XLMOD controlled vocabulary accession for the linker (e.g., "XLMOD:02001") string, null no
linker_mass Cross-linker mass in Daltons float64 yes

Cross-linked PSM

{
  "sequence": "AAKPEPTIDER",
  "cross_links": [
    {
      "xl_type": "inter",
      "partner_sequence": "LKSEQUENCER",
      "partner_peptidoform": "LKSEQUENCER",
      "donor_position": 3,
      "acceptor_position": 2,
      "linker_name": "DSS",
      "linker_accession": "XLMOD:02001",
      "linker_mass": 138.0681
    }
  ]
}

Non-cross-linked PSMs

For regular (non-XL) PSMs, the cross_links field is null. This keeps the schema clean -- the struct-based design adds no overhead for non-cross-linked datasets.

Shared Fields

Several fields in the PSM view use structures shared across other QPX views:

  • For details on the modifications field structure, see Modifications.
  • For details on additional_scores and score semantics, see Scores.
  • For details on cv_params usage and recommended terms, see Scores & CV Terms.

Example

Basic PSM Record

{
  "sequence": "AAAAAAAAAAGAAGGR",
  "peptidoform": "_(Acetyl (Protein N-term))AAAAAAAAAAGAAGGR_",
  "charge": 2,
  "scan": [42164],
  "rt": 5140.98,
  "calculated_mz": 635.3311,
  "observed_mz": 635.3315,
  "is_decoy": false,
  "posterior_error_probability": 5.58e-20,
  "predicted_rt": null,
  "run_file_name": "20200101_sample_A",
  "protein_accessions": ["Q86U42-2", "Q86U42"],
  "modifications": [
    {
      "name": "Acetyl",
      "accession": "UniMod:1",
      "positions": [
        {
          "position": 0,
          "amino_acid": null,
          "scores": []
        }
      ]
    }
  ],
  "additional_scores": [
    {
      "score_name": "andromeda_score",
      "score_value": 175.73,
      "higher_better": true
    },
    {
      "score_name": "andromeda_delta_score",
      "score_value": 160.47,
      "higher_better": true
    },
    {
      "score_name": "parent_ion_fraction",
      "score_value": 0.0,
      "higher_better": true
    }
  ],
  "cv_params": [
    {"cv_name": "dissociation method", "cv_value": "HCD"},
    {"cv_name": "normalized collision energy", "cv_value": "28"}
  ]
}

PSM with Spectral Data

When spectral arrays are included, the record also contains peak-level data:

{
  "sequence": "AAAAAAAAAAGAAGGR",
  "peptidoform": "_(Acetyl (Protein N-term))AAAAAAAAAAGAAGGR_",
  "charge": 2,
  "scan": [42164],
  "rt": 5140.98,
  "calculated_mz": 635.3311,
  "observed_mz": 635.3315,
  "is_decoy": false,
  "run_file_name": "20200101_sample_A",
  "mz_array": [175.119, 289.163, 360.200, 431.236, 488.258],
  "intensity_array": [1234.5, 5678.9, 3456.7, 2345.6, 1234.5],
  "charge_array": [1, 1, 1, 1, 1],
  "ion_type_array": ["y1", "y2", "b3", "b4", "b5"]
}

File Metadata

PSM Parquet files store file-level metadata as key-value pairs in the Parquet footer. The following metadata fields are defined:

Field Description
qpx_version Version of the QPX format used to generate the file
software_provider Name and version of the software that generated the data
scan_format Format of scan identifiers: scan, index, or nativeId
creator Name of the tool or person who created the file
file_type Type of the file (value: psm_file)
creation_date Date when the file was created
compression_format Compression algorithm used: zstd (default), snappy, gzip, lzo, or none

Reading file metadata in Python

import pyarrow.parquet as pq

parquet_file = pq.ParquetFile("experiment.psm.parquet")
metadata = parquet_file.schema_arrow.metadata
for key, value in metadata.items():
    print(f"{key.decode()}: {value.decode()}")

Notes

DDA-specific view

The PSM view is designed primarily for DDA (data-dependent acquisition) methods. It is not recommended for DIA experiments, where the feature view should be used instead. Generating a PSM file for DIA data would produce duplicated information relative to the feature view.

PEP is a PSM-level metric

posterior_error_probability is defined only in the PSM view. It represents the probability that a specific peptide-spectrum match is incorrect (lower is better). All major tools export PEP as P(incorrect): Percolator (posterior_error_prob), MaxQuant (PEP). FragPipe exports PeptideProphet Probability (P(correct)), so converters must compute PEP = 1 - probability. The feature and peptide views do not carry PEP as a top-level field; use additional_scores or best_id_score if a derived PEP value is needed at those levels.

  • Relationship to feature view: The PSM view captures individual spectrum matches, while the Feature View aggregates these into quantified peptide features with intensity data. A single feature may correspond to multiple PSMs across different scans.
  • Protein inference: Protein inference results should not be the primary focus of the PSM view. protein_accessions is an optional column — protein mappings can be recovered by joining through the feature and protein group views. When included, it is useful for peptide filtering and protein-level browsing. For full protein group information, use the protein group (PG) view.
  • Spectral arrays: The mz_array and intensity_array are parallel arrays of the same length. For large-scale spectral storage, the dedicated mass spectra (mz) view is recommended.
  • Recommended additional scores: global_qvalue (experiment-level PSM q-value), rank (peptide rank in search results), and pg_global_qvalue (protein group q-value used for filtering).