SDRF Conversion¶

When SDRF is ingested into QPX, bracket-style column names (e.g. characteristics[organism]) are mapped to clean snake_case field names (e.g. organism) and the data is split into separate Parquet files -- one for biological samples and one for data acquisition runs. This page documents the column mapping registry, the multi-valued metadata convention, and the end-to-end conversion process.

SDRF Column Name Mapping¶

When SDRF is converted to QPX, bracket-style column names are mapped to snake_case field names according to the following registry. This mapping is deterministic and applied automatically during ingestion.

Sample Columns (→ sample.parquet)¶

Sample columns describe biological properties of the specimen. These are deduplicated and stored in sample.parquet.

SDRF Column	QPX Field Name	Target File
`source name`	`sample_accession`	sample.parquet
`characteristics[organism]`	`organism`	sample.parquet
`characteristics[organism part]`	`organism_part`	sample.parquet
`characteristics[disease]`	`disease`	sample.parquet
`characteristics[cell line]`	`cell_line`	sample.parquet
`characteristics[cell type]`	`cell_type`	sample.parquet
`characteristics[sex]`	`sex`	sample.parquet
`characteristics[age]`	`age`	sample.parquet
`characteristics[developmental stage]`	`developmental_stage`	sample.parquet
`characteristics[ancestry category]`	`ancestry`	sample.parquet
`characteristics[individual]`	`individual`	sample.parquet

Run Columns (→ run.parquet)¶

Run columns describe technical and instrument-related properties of the data acquisition. These are stored in run.parquet.

SDRF Column	QPX Field Name	Target File
`assay name`	`run_accession`	run.parquet
`comment[data file]`	`run_file_name`	run.parquet
`comment[label]`	label (inside `samples` list)	run.parquet
`comment[fraction identifier]`	`fraction`	run.parquet
`characteristics[biological replicate]`	`biological_replicate` (inside `samples` list)	run.parquet
`comment[technical replicate]`	`technical_replicate` (inside `samples` list)	run.parquet
`comment[instrument]`	`instrument` (as ONTOLOGY_TERM)	run.parquet
`comment[cleavage agent details]`	`enzymes` (as list of ONTOLOGY_TERM)	run.parquet
`comment[proteomics data acquisition method]`	`additional_terms` (as ONTOLOGY_TERM)	run.parquet
`comment[dissociation method]`	`dissociation_method` (as ONTOLOGY_TERM)	run.parquet
`comment[modification parameters]`	`modification_parameters` (as list of MODIFICATION)	run.parquet

Factor Columns (→ future factor.parquet)¶

Factor columns capture experimental design variables from factor value[X] columns in the SDRF. These are stored in factor.parquet.

SDRF Column	QPX Field Name	Status
`factor value[disease]`	`factor_disease`	factor.parquet
`factor value[organism part]`	`factor_organism_part`	factor.parquet

Note

Factor columns are open-ended. Any factor value[X] column in the SDRF will be mapped to factor_X (with spaces replaced by underscores). Replicate identifiers (biological_replicate, technical_replicate) are stored in the samples struct of run.parquet rather than here, so the design matrix is self-contained in a single file.

Multi-valued Metadata¶

In rare cases a single SDRF row may describe a sample derived from multiple organisms (e.g. a spike-in experiment) or multiple cell lines (pooled samples). QPX handles this natively using list columns.

Convention¶

All biological property columns in sample.parquet use list[string] types. During SDRF ingestion, pipe-delimited values in the source TSV (e.g. Homo sapiens|Saccharomyces cerevisiae) are split and stored as proper list elements:

# Single-valued (most samples)
organism = ["Homo sapiens"]

# Multi-valued (spike-in / pooled samples)
organism = ["Homo sapiens", "Saccharomyces cerevisiae"]

This avoids delimiter conventions in the stored data and enables native list operations in query engines.

SQL Queries¶

When querying sample.parquet with DuckDB, use UNNEST to explode list columns or list_contains to filter:

-- All unique organisms in the project
SELECT DISTINCT UNNEST(organism) AS organism
FROM 'PXD014414.sample.parquet';

-- Samples with a specific disease
SELECT sample_accession, disease
FROM 'PXD014414.sample.parquet'
WHERE list_contains(disease, 'breast cancer');

SDRF Conversion Process¶

At ingestion time, the SDRF TSV is split into sample.parquet and run.parquet:

Load the SDRF TSV file
Rename columns using the field registry above
Extract sample-level columns -- deduplicate rows by sample_accession (= source name) to produce sample.parquet
Extract run-level columns -- one row per run_accession (= assay name), with the samples list linking to sample accessions, labels, and replicate numbers, to produce run.parquet
Parse ontology terms -- instrument, enzyme, modification parameter cells are parsed into ONTOLOGY_TERM structs
Preserve provenance -- the original .sdrf.tsv file is kept alongside the Parquet files

Project Layout¶

PXD014414/
  PXD014414.sdrf.tsv             # Original SDRF (provenance)
  PXD014414.sample.parquet       # Biological samples (deduplicated)
  PXD014414.run.parquet          # Data acquisition runs with ontology terms
  PXD014414.dataset.parquet      # Project-level metadata
  PXD014414.feature.parquet
  PXD014414.psm.parquet
  ...

Sample Metadata -- sample.parquet
Run Metadata -- run.parquet
Dataset Metadata -- dataset.parquet
QPX Format Overview