Sample Metadata¶
sample.parquet stores one row per biological sample, capturing the biological properties of each specimen in the study. It is derived from the community SDRF (Sample and Data Relationship Format) standard, with bracket-style column names translated to clean snake_case field names during ingestion.
See the full YAML schema in sample.yaml.
SDRF as the ingestion format
QPX does not invent its own sample metadata model. It adopts the community SDRF standard and converts it to query-friendly Parquet files. At ingestion time, bracket-style column names are translated to clean snake_case field names and the data is split by entity type.
PyArrow Schema¶
import pyarrow as pa
sample_schema = pa.schema([
# Identity -- maps to SDRF "source name"
pa.field("sample_accession", pa.string()), # PK, required
# Mandatory biological properties
pa.field("organism", pa.string()), # required
pa.field("organism_part", pa.string()), # required
# Optional standard properties
pa.field("disease", pa.string(), nullable=True),
pa.field("cell_line", pa.string(), nullable=True),
pa.field("cell_type", pa.string(), nullable=True),
pa.field("sex", pa.string(), nullable=True),
pa.field("age", pa.string(), nullable=True),
pa.field("developmental_stage", pa.string(), nullable=True),
pa.field("ancestry", pa.string(), nullable=True),
pa.field("individual", pa.string(), nullable=True),
pa.field("sample_description", pa.string(), nullable=True),
# Extra columns from SDRF characteristics are added dynamically
# e.g. pa.field("biological_replicate", pa.string(), nullable=True),
# e.g. pa.field("bmi", pa.string(), nullable=True),
])
Field Reference¶
Core fields (always present)¶
| Field | Description | Type | Required |
|---|---|---|---|
sample_accession |
Unique sample identifier (= SDRF source name) |
string | yes |
organism |
Species (e.g. "Homo sapiens") |
string | yes |
organism_part |
Tissue or organ (e.g. "Brain; Frontal cortex") |
string | yes |
disease |
Disease state (e.g. "breast cancer") |
string | no |
cell_line |
Cell line name (e.g. "HeLa") |
string | no |
cell_type |
Cell type (e.g. "T cell") |
string | no |
sex |
Biological sex (e.g. "female") |
string | no |
age |
Age of the specimen (e.g. "45") |
string | no |
developmental_stage |
Developmental stage | string | no |
ancestry |
Ancestry category | string | no |
individual |
Individual/patient identifier | string | no |
sample_description |
Free-text sample description | string | no |
Extra columns (dataset-specific)¶
Any non-standard SDRF characteristics[X] column becomes its own string column in the Parquet file. For example:
| Extra column | SDRF source |
|---|---|
biological_replicate |
characteristics[biological replicate] |
bmi |
characteristics[bmi] |
treatment |
characteristics[treatment] |
Multi-valued fields
When a sample has multiple values for a property (e.g., a spike-in with two organisms), the values are joined with "; " into a single string: "Homo sapiens; Saccharomyces cerevisiae".
Example Rows¶
| sample_accession | organism | organism_part | disease | sex | biological_replicate |
|---|---|---|---|---|---|
| Sample_01 | Homo sapiens | breast | breast cancer | female | 1 |
| Sample_02 | Homo sapiens | breast | normal | female | 2 |
| Sample_03 | Homo sapiens; Saccharomyces cerevisiae | breast | breast cancer | male | 3 |
Reading sample.parquet
import pyarrow.parquet as pq
samples = pq.read_table("PXD014414.sample.parquet")
print(samples.column_names)
# ['sample_accession', 'organism', 'organism_part', 'disease', ..., 'biological_replicate']
# Get unique organisms
import pyarrow.compute as pc
organisms = pc.unique(samples.column("organism")).to_pylist()
print(organisms) # ['Homo sapiens', 'Homo sapiens; Saccharomyces cerevisiae']
Querying with DuckDB
-- All unique organisms in the project
SELECT DISTINCT organism
FROM 'PXD014414.sample.parquet';
-- Samples with a specific disease
SELECT sample_accession, organism_part, disease
FROM 'PXD014414.sample.parquet'
WHERE disease LIKE '%breast cancer%';
-- Group by extra column
SELECT biological_replicate, COUNT(*) AS n_samples
FROM 'PXD014414.sample.parquet'
GROUP BY biological_replicate;
Related Pages¶
- Run Metadata --
run.parquet - Dataset Metadata --
dataset.parquet - SDRF Conversion -- Column mapping and ingestion process
- QPX Format Overview