Dataset Metadata¶

dataset.parquet is a single-row Parquet file capturing project-level metadata for a QPX dataset -- project identity and software provenance. It does not store per-run or per-sample metadata -- those live in run.parquet and sample.parquet, respectively.

See the full YAML schema in dataset.yaml.

Aggregated biology is derived, not stored

Fields like organisms, organism parts, diseases, and cell lines are not stored in dataset.parquet. These are derived at query time from sample.parquet:

SELECT DISTINCT UNNEST(organism) AS organism FROM 'PXD014414.sample.parquet';

This eliminates the risk of project metadata drifting out of sync with the actual sample annotations. Similarly, instrument, enzyme, and modification information lives in run.parquet since it can vary per run.

PyArrow Schema¶

import pyarrow as pa

dataset_schema = pa.schema([
    # Project identity
    pa.field("project_accession", pa.string()),
    pa.field("project_title", pa.string(), nullable=True),
    pa.field("project_description", pa.string(), nullable=True),
    pa.field("pubmed_id", pa.string(), nullable=True),

    # Software
    pa.field("software_name", pa.string(), nullable=True),
    pa.field("software_version", pa.string(), nullable=True),

    # Provenance
    pa.field("creation_date", pa.string()),

    # Integrity / Packaging
    pa.field("file_checksums", pa.map_(pa.string(), pa.string()), nullable=True),
    pa.field("file_row_counts", pa.map_(pa.string(), pa.int64()), nullable=True),
    pa.field("file_sizes_bytes", pa.map_(pa.string(), pa.int64()), nullable=True),
    pa.field("total_structures", pa.int32(), nullable=True),
    pa.field("packaged_at", pa.string(), nullable=True),
])

Field Reference¶

Core Identity¶

Field	Description	Type	Required
`project_accession`	Project accession identifier (e.g. `PXD014414`)	string	yes
`project_title`	Title of the project	string	no
`project_description`	Description of the project	string	no
`pubmed_id`	PubMed ID associated with the project	string	no
`qpx_version`	Version of the QPX format specification	string	yes

Software Provenance¶

Field	Description	Type	Required
`software_name`	Name of the software that generated the data (e.g. `quantms`)	string	no
`software_version`	Version of the software	string	no

File Provenance¶

Field	Description	Type	Required
`creation_date`	ISO 8601 date when the dataset was created	string	yes

Integrity / Packaging¶

These fields enable dataset validation after packaging or transfer. They are populated by calling ds.compute_integrity() and verified with ds.verify_integrity().

Field	Description	Type	Required
`file_checksums`	SHA-256 hex digests keyed by file name (relative to dataset root)	map<string, string>	no
`file_row_counts`	Row counts keyed by file name	map<string, int64>	no
`file_sizes_bytes`	File sizes in bytes keyed by file name	map<string, int64>	no
`total_structures`	Number of Parquet structures in this dataset	int32	no
`packaged_at`	ISO 8601 timestamp when integrity was computed	string	no

Computing and verifying integrity

import qpx

ds = qpx.open("PXD014414/")

# Compute integrity (checksums, row counts, file sizes)
integrity = ds.compute_integrity()

# Save integrity data into dataset.parquet
meta_df = ds.dataset_meta.to_df()
meta_dict = meta_df.iloc[0].to_dict()
meta_dict.update(integrity)
ds.save_structure([meta_dict], "dataset", prefix="PXD014414")
ds.refresh()

# Later: verify the dataset is intact
result = ds.verify_integrity()
if result["errors"]:
    print("Integrity check failed:", result["errors"])
else:
    print("Dataset integrity verified")

Reading dataset.parquet in Python

import pyarrow.parquet as pq

dataset = pq.read_table("PXD014414.dataset.parquet")
row = dataset.to_pydict()

print(row["project_accession"])   # ['PXD014414']
print(row["software_name"])       # ['quantms']

Reading dataset.parquet with DuckDB

SELECT project_accession, software_name, software_version
FROM 'PXD014414.dataset.parquet';

Sample Metadata -- sample.parquet
Run Metadata -- run.parquet
SDRF Conversion -- column mapping and ingestion process
QPX Format Overview