Skip to content

Mass Spectra View

The mass spectra (mz) view is a Parquet file that stores raw mass spectrometry spectral data in a columnar format. This view is based on the mz_parquet format developed by Michael Lazear, adapted for the QPX ecosystem with additional metadata support through controlled vocabulary (CV) parameters.

Use cases

  • Retrieve precursor mass, retention time, and intensity values for all spectra in a file.
  • Enable visualization and interactive scanning at the mass spectra level.
  • Provide structured spectral data for AI/ML training and prediction tasks.
  • Store spectra in a columnar format that is significantly more efficient than mzML for large-scale retrieval.

Schema

See the full YAML schema in mz.yaml.

Scan identification

Field Description Type Required
id Unique identifier for the scan or spectrum string Yes
ms_level The MS level (1 for MS1, 2 for MS2, etc.) int Yes
centroid Whether the data is centroided (true) or profile mode (false) boolean Yes

Timing and mobility

Field Description Type Required
scan_start_time Start time of the scan in minutes float Yes
inverse_ion_mobility Inverse ion mobility, used for TIMS data (1/K0) float No
ion_injection_time Ion injection time in milliseconds float Yes
total_ion_current Total ion current (TIC) for the scan float Yes

Precursors

Field Description Type Required
precursors List of precursor ions for this MS/MS scan array[struct] No

Each precursor struct contains:

Sub-field Description Type Required
selected_ion_mz m/z value of the selected precursor ion float Yes
selected_ion_charge Charge state of the selected precursor ion int No
selected_ion_intensity Intensity of the selected precursor ion float No
isolation_window_target Target m/z for the isolation window float No
isolation_window_lower Lower bound of the isolation window float No
isolation_window_upper Upper bound of the isolation window float No
spectrum_ref Reference to another spectrum (for linking to external datasets) string No

Spectral data

Field Description Type Required
mz Array of m/z values for the scan array[float] Yes
intensity Array of intensity values corresponding to the m/z values array[float] Yes

CV params

Field Description Type Required
cv_params Optional list of controlled vocabulary parameters for additional metadata array[struct{name, value}] No

Each CV param struct contains:

Sub-field Description Type
name Name of the CV term (e.g., from PSI-MS or other ontologies) string
value Value associated with the CV term string

Example

An MS2 spectrum with a single precursor ion:

{
  "id": "controllerType=0 controllerNumber=1 scan=15236",
  "ms_level": 2,
  "centroid": true,
  "scan_start_time": 42.156,
  "inverse_ion_mobility": null,
  "ion_injection_time": 35.0,
  "total_ion_current": 2.45e6,
  "precursors": [
    {
      "selected_ion_mz": 547.2891,
      "selected_ion_charge": 2,
      "selected_ion_intensity": 8.3e5,
      "isolation_window_target": 547.29,
      "isolation_window_lower": 0.7,
      "isolation_window_upper": 0.7,
      "spectrum_ref": null
    }
  ],
  "mz": [
    110.0713, 120.0808, 136.0757, 147.1128, 175.1190,
    234.1448, 262.1397, 349.1718, 462.2559, 575.3399,
    688.4240, 801.5080, 914.5921, 1013.6605
  ],
  "intensity": [
    5200.0, 12300.0, 8900.0, 45000.0, 23000.0,
    67000.0, 34000.0, 89000.0, 120000.0, 95000.0,
    78000.0, 56000.0, 34000.0, 12000.0
  ],
  "cv_params": [
    {"name": "MS:1000016", "value": "42.156"},
    {"name": "MS:1000512", "value": "FTMS + c NSI d Full ms2 547.29@hcd30.00"}
  ]
}

Notes

Preferred over embedding spectra in PSM view

While the PSM view can optionally carry mz_array and intensity_array fields, the mz view is the preferred format for storing and retrieving spectral data. Separating spectra from identifications allows independent access patterns: spectra can be queried by retention time or m/z range without loading identification data, and vice versa.

Querying with DuckDB

The columnar Parquet layout enables efficient SQL-based access to spectra:

-- Retrieve all MS2 spectra in a retention time window
SELECT id, scan_start_time, precursors, mz, intensity
FROM 'sample.mz.parquet'
WHERE ms_level = 2 AND scan_start_time BETWEEN 30.0 AND 45.0

-- Count spectra per MS level
SELECT ms_level, COUNT(*) as num_spectra
FROM 'sample.mz.parquet'
GROUP BY ms_level

Array alignment

The mz and intensity arrays MUST have the same length within each record. Each element at index i in intensity corresponds to the peak at index i in mz.