Sample Metadata¶

sample.parquet stores one row per biological sample, capturing the biological properties of each specimen in the study. It is derived from the community SDRF (Sample and Data Relationship Format) standard, with bracket-style column names translated to clean snake_case field names during ingestion.

See the full YAML schema in sample.yaml.

SDRF as the ingestion format

QPX does not invent its own sample metadata model. It adopts the community SDRF standard and converts it to query-friendly Parquet files. At ingestion time, bracket-style column names are translated to clean snake_case field names and the data is split by entity type.

PyArrow Schema¶

import pyarrow as pa

sample_schema = pa.schema([
    # Identity -- maps to SDRF "source name"
    pa.field("sample_accession", pa.string()),          # PK, required

    # Mandatory biological properties
    pa.field("organism", pa.string()),                   # required
    pa.field("organism_part", pa.string()),              # required

    # Optional standard properties
    pa.field("disease", pa.string(), nullable=True),
    pa.field("cell_line", pa.string(), nullable=True),
    pa.field("cell_type", pa.string(), nullable=True),
    pa.field("sex", pa.string(), nullable=True),
    pa.field("age", pa.string(), nullable=True),
    pa.field("developmental_stage", pa.string(), nullable=True),
    pa.field("ancestry", pa.string(), nullable=True),
    pa.field("individual", pa.string(), nullable=True),
    pa.field("sample_description", pa.string(), nullable=True),

    # Extra columns from SDRF characteristics are added dynamically
    # e.g. pa.field("biological_replicate", pa.string(), nullable=True),
    # e.g. pa.field("bmi", pa.string(), nullable=True),
])

Field Reference¶

Core fields (always present)¶

Field	Description	Type	Required
`sample_accession`	Unique sample identifier (= SDRF `source name`)	string	yes
`organism`	Species (e.g. `"Homo sapiens"`)	string	yes
`organism_part`	Tissue or organ (e.g. `"Brain; Frontal cortex"`)	string	yes
`disease`	Disease state (e.g. `"breast cancer"`)	string	no
`cell_line`	Cell line name (e.g. `"HeLa"`)	string	no
`cell_type`	Cell type (e.g. `"T cell"`)	string	no
`sex`	Biological sex (e.g. `"female"`)	string	no
`age`	Age of the specimen (e.g. `"45"`)	string	no
`developmental_stage`	Developmental stage	string	no
`ancestry`	Ancestry category	string	no
`individual`	Individual/patient identifier	string	no
`sample_description`	Free-text sample description	string	no

Extra columns (dataset-specific)¶

Any non-standard SDRF characteristics[X] column becomes its own string column in the Parquet file. For example:

Extra column	SDRF source
`biological_replicate`	`characteristics[biological replicate]`
`bmi`	`characteristics[bmi]`
`treatment`	`characteristics[treatment]`

Multi-valued fields

When a sample has multiple values for a property (e.g., a spike-in with two organisms), the values are joined with "; " into a single string: "Homo sapiens; Saccharomyces cerevisiae".

Example Rows¶

sample_accession	organism	organism_part	disease	sex	biological_replicate
Sample_01	Homo sapiens	breast	breast cancer	female	1
Sample_02	Homo sapiens	breast	normal	female	2
Sample_03	Homo sapiens; Saccharomyces cerevisiae	breast	breast cancer	male	3

Reading sample.parquet

import pyarrow.parquet as pq

samples = pq.read_table("PXD014414.sample.parquet")
print(samples.column_names)
# ['sample_accession', 'organism', 'organism_part', 'disease', ..., 'biological_replicate']

# Get unique organisms
import pyarrow.compute as pc
organisms = pc.unique(samples.column("organism")).to_pylist()
print(organisms)  # ['Homo sapiens', 'Homo sapiens; Saccharomyces cerevisiae']

Querying with DuckDB

-- All unique organisms in the project
SELECT DISTINCT organism
FROM 'PXD014414.sample.parquet';

-- Samples with a specific disease
SELECT sample_accession, organism_part, disease
FROM 'PXD014414.sample.parquet'
WHERE disease LIKE '%breast cancer%';

-- Group by extra column
SELECT biological_replicate, COUNT(*) AS n_samples
FROM 'PXD014414.sample.parquet'
GROUP BY biological_replicate;

Run Metadata -- run.parquet
Dataset Metadata -- dataset.parquet
SDRF Conversion -- Column mapping and ingestion process
QPX Format Overview