Protein Group View¶

The protein group (PG) view is a tabular Parquet file that contains the details of protein groups identified and quantified per raw file. It captures the relationship between protein groups and the raw files in which they were detected, including peptide counts, feature counts, quality metrics, and intensity-based quantification.

This view is analogous to outputs from tools such as MaxQuant (proteinGroups.txt), DIA-NN (pg_matrix), and FragPipe protein group reports.

Use cases¶

Retrieve all protein groups identified or quantified in a given raw file.
Retrieve protein group abundance by file and condition.
Store and query FDR q-values for protein groups at both the run and experiment level.
Support downstream statistical analysis by providing per-file protein-level quantification.

Schema¶

Fields marked with (PK) are primary keys and MUST NOT be null. Fields marked with (nullable) may have null values. See the full YAML schema in pg.yaml.

Identity¶

Field	Description	Type	Required
`pg_accessions`	Protein accessions of all proteins within this group	`array[string]`	Yes (PK)
`pg_names`	Descriptive names for the proteins in the group	`array[string]`	No
`gg_accessions`	Gene group accessions as a string array	`array[string]`	No
`gg_names`	Gene names corresponding to the proteins in the group	`array[string]`	No
`gg_qvalue`	Gene group q-value (e.g., DIA-NN GG.Q.Value)	`float64`, null	No
`anchor_protein`	Representative protein of the group (leading protein)	`string`	No
`run_file_name`	The raw file containing the identified/quantified protein group	`string`	Yes (PK)

Counts¶

Field	Description	Type	Required
`peptide_counts`	Peptide sequence counts for this protein group in this file	`struct`	No
`peptide_counts.unique_sequences`	Number of peptide sequences unique to this protein group within this file	`int`	--
`peptide_counts.total_sequences`	Total number of peptide sequences identified for this protein group in this file	`int`	--
`feature_counts`	Peptide feature counts (peptide-charge combinations) for this protein group	`struct`	No
`feature_counts.unique_features`	Number of unique peptide features specific to this protein group within this file	`int`	--
`feature_counts.total_features`	Total number of peptide features identified for this protein group in this file	`int`	--

Quality¶

Field	Description	Type	Required
`global_qvalue`	Global q-value of the protein group at the experiment level	`float64`	No
`pg_qvalue`	Protein group q-value at the run level	`float64`	No
`is_decoy`	Whether the protein group is a decoy (`true`) or a target (`false`)	`bool`	Yes
`contaminant`	Contaminant flag	`bool`, null	No
`sequence_coverage`	Percentage of the protein sequence covered by identified peptides	`float32`	No
`molecular_weight`	Molecular weight of the protein in kDa	`float32`	No

Quantification¶

Field	Description	Type	Required
`intensities`	Primary intensity-based abundance of the protein group across labels. See Intensities	`array[struct]`	No
`additional_intensities`	Pre-computed intensity values from the upstream tool (normalized, LFQ, iBAQ, etc.). See Intensities	`array[struct]`	No
`additional_scores`	Additional scores and metrics (posterior error probability, confidence, etc.). See Scores	`array[struct]`	No

Each entry in intensities contains:

Sub-field	Description	Type
`label`	Label identifier (e.g., TMT126, LFQ)	`string`
`intensity`	Raw intensity value	`float32`

Each entry in additional_intensities contains:

Sub-field	Description	Type
`label`	Label identifier (e.g., TMT126, LFQ)	`string`
`intensities`	Array of name-value pairs for derived intensities	`array[struct{intensity_name, intensity_value}]`

Peptide detail¶

Field	Description	Type	Required
`peptides`	Number of peptides per individual protein in the protein group	`array[struct]`	Yes

Each entry in peptides contains:

Sub-field	Description	Type
`protein_name`	Protein accession	`string`
`peptide_count`	Number of peptides for this protein	`int`

CV params¶

Field	Description	Type	Required
`cv_params`	Optional list of controlled vocabulary parameters for additional metadata	`array[struct{cv_name, cv_value}]`	No

Example¶

{
  "pg_accessions": ["P04217", "A0A024R4E5"],
  "pg_names": ["Alpha-1B-glycoprotein", "Alpha-1B-glycoprotein variant"],
  "gg_accessions": ["A1BG"],
  "gg_names": ["A1BG"],
  "anchor_protein": "P04217",
  "run_file_name": "20230101_sample_01",
  "peptide_counts": {
    "unique_sequences": 12,
    "total_sequences": 18
  },
  "feature_counts": {
    "unique_features": 24,
    "total_features": 36
  },
  "global_qvalue": 0.001,
  "pg_qvalue": 0.005,
  "is_decoy": false,
  "contaminant": 0,
  "sequence_coverage": 45.2,
  "molecular_weight": 54.3,
  "intensities": [
    {
      "label": "TMT126",
      "intensity": 1.5e8
    }
  ],
  "additional_intensities": [
    {
      "label": "TMT126",
      "intensities": [
        {"intensity_name": "LFQ", "intensity_value": 1.2e8},
        {"intensity_name": "iBAQ", "intensity_value": 3.4e7}
      ]
    }
  ],
  "peptides": [
    {"protein_name": "P04217", "peptide_count": 15},
    {"protein_name": "A0A024R4E5", "peptide_count": 3}
  ],
  "additional_scores": [
    {"score_name": "posterior_error_probability", "score_value": 0.0001, "higher_better": false}
  ],
  "cv_params": null
}

Recommended Intensity Names¶

The following names are commonly used in additional_intensities for the PG view. Use consistent naming so downstream tools can recognise them across datasets.

`intensity_name`	Source tool(s)	Description
`maxlfq`	DIA-NN, MaxQuant, FragPipe	MaxLFQ normalised protein group quantity
`lfq`	MaxQuant	Label-free quantification intensity
`ibaq`	MaxQuant	Intensity-based absolute quantification
`topn`	DIA-NN	Top-N normalised protein group quantity
`normalize_intensity`	quantms	Median-normalised intensity
`spectral_count`	FragPipe, MaxQuant	Number of PSMs for razor peptides
`unique_spectral_count`	FragPipe	Number of PSMs for unique peptides only
`total_spectral_count`	FragPipe	Number of PSMs for all peptides (razor + shared)
`genes_maxlfq`	DIA-NN	Gene-group level MaxLFQ quantity
`genes_maxlfq_unique`	DIA-NN	Gene-group MaxLFQ using only proteotypic peptides
`ratio_h_l`	MaxQuant (SILAC)	Heavy-to-light protein ratio
`ratio_h_l_normalized`	MaxQuant (SILAC)	Normalised heavy-to-light ratio
`reporter_intensity`	MaxQuant (TMT/iTRAQ)	Raw reporter-ion intensity
`reporter_intensity_corrected`	MaxQuant (TMT/iTRAQ)	Isotope-purity-corrected reporter intensity

Spectral counts as intensities

Spectral counts are quantitative measures that vary per run and per label, so they fit naturally in additional_intensities. Store them as float values (e.g., 3.0 instead of 3) since the struct uses float32.

Gene-group quantities

DIA-NN reports protein-group level (PG.MaxLFQ) and gene-group level (Genes.MaxLFQ) quantities separately. Store gene-group quantities in additional_intensities with names prefixed genes_ (e.g., genes_maxlfq, genes_maxlfq_unique). The gg_accessions field identifies the gene group.

Recommended Score Names¶

The following score names are commonly used in additional_scores for the PG view. For the full list of recommended score names and naming conventions, see Scores.

`score_name`	Source tool(s)	Direction	Description
`posterior_error_probability`	MaxQuant	lower is better	Protein-group PEP
`andromeda_score`	MaxQuant	higher is better	Andromeda protein-level score
`protein_probability`	FragPipe	higher is better	ProteinProphet probability
`top_peptide_probability`	FragPipe	higher is better	Highest PeptideProphet probability among peptides
`pg_maxlfq_quality`	DIA-NN	higher is better	QuantUMS quality for PG.MaxLFQ estimate
`genes_maxlfq_quality`	DIA-NN	higher is better	QuantUMS quality for gene-level MaxLFQ
`pg_pep`	DIA-NN	lower is better	Posterior error probability at the protein group level
`gg_qvalue`	DIA-NN	lower is better	Gene group q-value
`lib_pg_qvalue`	DIA-NN	lower is better	Library protein group q-value
`protein_qvalue`	DIA-NN	lower is better	Unique protein q-value (proteotypic evidence)

Tool Mappings¶

This section shows how output columns from common search engines and pipelines map to pg.parquet fields.

Wide-to-long conversion

MaxQuant, DIA-NN, and FragPipe output protein groups in wide format (one row per protein group, one column per experiment/run). QPX uses long format (one row per protein group per run). Converters must melt wide-format columns into separate rows keyed by run_file_name.

MaxQuant (`proteinGroups.txt`)¶

Identity & MetadataCountsQuantificationQuality

MaxQuant column	QPX field	Notes
`Protein IDs`	`pg_accessions`	Semicolon-separated → array
`Majority protein IDs`	`anchor_protein`	First accession of majority IDs
`Protein names`	`pg_names`	Semicolon-separated → array
`Gene names`	`gg_names`	Semicolon-separated → array
`Sequence coverage [%]`	`sequence_coverage`
`Mol. weight [kDa]`	`molecular_weight`
`Reverse`	`is_decoy`	`"+"` → `true`, else `false`
`Potential contaminant`	`contaminant`	`"+"` → `1`, else `0`
`Only identified by site`	`cv_params`	`{cv_name: "only_identified_by_site", cv_value: "true"}`
`Identification type [exp]`	`cv_params`	`{cv_name: "identification_type", cv_value: "By MS/MS"}`

MaxQuant column	QPX field	Notes
`Unique peptides`	`peptide_counts.unique_sequences`	Peptides exclusive to this PG
`Peptides` or `Razor + unique peptides`	`peptide_counts.total_sequences`	Total assigned peptides
`Peptide counts (all)`	`peptides`	Per-protein peptide counts

MaxQuant column	QPX field	Notes
`Intensity [exp]`	`intensities`	Primary raw intensity per run
`LFQ intensity [exp]`	`additional_intensities` → `lfq`	MaxLFQ normalised
`iBAQ [exp]`	`additional_intensities` → `ibaq`	Intensity-based absolute quantification
`MS/MS count [exp]`	`additional_intensities` → `spectral_count`	PSM count as float
`Reporter intensity [channel]`	`intensities`	For TMT/iTRAQ, one entry per channel
`Reporter intensity corrected [channel]`	`additional_intensities` → `reporter_intensity_corrected`
`Ratio H/L [exp]`	`additional_intensities` → `ratio_h_l`	SILAC ratio
`Ratio H/L normalized [exp]`	`additional_intensities` → `ratio_h_l_normalized`	Normalised SILAC ratio

MaxQuant column	QPX field	Notes
`Q-value`	`global_qvalue`	Protein group FDR
`Score`	`additional_scores` → `andromeda_score`
`PEP`	`additional_scores` → `posterior_error_probability`

DIA-NN (main report)¶

DIA-NN's main report is precursor-level. PG-level columns (PG.*, Genes.*) are repeated for every precursor in the group within a run. Deduplicate by Protein.Group + Run to produce one pg.parquet row.

IdentityQuantificationQuality

DIA-NN column	QPX field	Notes
`Protein.Group`	`pg_accessions`	Semicolon-separated → array; also `anchor_protein` (first accession)
`Protein.Names`	`pg_names`
`Genes`	`gg_names` / `gg_accessions`
`Run`	`run_file_name`	Raw file name without path

DIA-NN column	QPX field	Notes
`PG.Quantity`	`intensities`	Non-normalised protein group quantity
`PG.MaxLFQ`	`additional_intensities` → `maxlfq`	QuantUMS/MaxLFQ normalised
`PG.Normalised`	`additional_intensities` → `normalize_intensity`	Normalised PG quantity
`PG.TopN`	`additional_intensities` → `topn`	Top-N normalised quantity
`Genes.MaxLFQ`	`additional_intensities` → `genes_maxlfq`	Gene-group MaxLFQ
`Genes.MaxLFQ.Unique`	`additional_intensities` → `genes_maxlfq_unique`	Gene-group MaxLFQ (proteotypic only)
`Genes.TopN`	`additional_intensities` → `genes_topn`	Gene-group Top-N

DIA-NN column	QPX field	Notes
`Global.PG.Q.Value`	`global_qvalue`	Experiment-level PG FDR
`PG.Q.Value`	`pg_qvalue`	Run-level PG FDR
`PG.PEP`	`additional_scores` → `pg_pep`
`PG.MaxLFQ.Quality`	`additional_scores` → `pg_maxlfq_quality`	QuantUMS quality metric
`GG.Q.Value`	`additional_scores` → `gg_qvalue`	Gene group q-value
`Lib.PG.Q.Value`	`additional_scores` → `lib_pg_qvalue`	Library PG q-value
`Protein.Q.Value`	`additional_scores` → `protein_qvalue`	Unique protein q-value
`Genes.MaxLFQ.Quality`	`additional_scores` → `genes_maxlfq_quality`

FragPipe (`combined_protein.tsv`)¶

FragPipe outputs are pre-filtered (no decoys or contaminants). Per-experiment columns are prefixed with the experiment name.

Identity & MetadataCountsQuantificationQuality

FragPipe column	QPX field	Notes
`Protein ID`	`anchor_protein`	Leading protein accession
`Protein ID` + `Indistinguishable Proteins`	`pg_accessions`	Combine into array
`Description`	`pg_names`	Protein description
`Gene`	`gg_names` / `gg_accessions`
`Coverage`	`sequence_coverage`	Percentage (0–100)
`Protein Length`	`cv_params`	`{cv_name: "sequence_length", cv_value: "495"}`
`Protein Existence`	`cv_params`	`{cv_name: "protein_existence", cv_value: "1"}`

FragPipe column	QPX field	Notes
`[exp] Total Peptides`	`peptide_counts.total_sequences`	Per-experiment peptide count

FragPipe column	QPX field	Notes
`[exp] Intensity`	`intensities`	Primary razor peptide intensity
`[exp] MaxLFQ Intensity`	`additional_intensities` → `maxlfq`	MaxLFQ normalised
`[exp] MaxLFQ Unique Intensity`	`additional_intensities` → `maxlfq_unique`	MaxLFQ using only unique peptides
`[exp] Unique Intensity`	`additional_intensities` → `unique_intensity`	Intensity from unique peptides only
`[exp] Spectral Count`	`additional_intensities` → `spectral_count`	Razor peptide PSMs
`[exp] Unique Spectral Count`	`additional_intensities` → `unique_spectral_count`	Unique peptide PSMs
`[exp] Total Spectral Count`	`additional_intensities` → `total_spectral_count`	All peptide PSMs

FragPipe column	QPX field	Notes
`Protein Probability`	`additional_scores` → `protein_probability`	ProteinProphet score
`Top Peptide Probability`	`additional_scores` → `top_peptide_probability`	Best PeptideProphet score

FragPipe q-values

FragPipe does not report explicit protein-group q-values. The Protein Probability (from ProteinProphet) is filtered at a given FDR threshold (typically 1%) before writing combined_protein.tsv. Set global_qvalue to null and store the probability in additional_scores.

Notes¶

Relationship to other views

The PG view provides per-file protein group quantification. For derived per-sample summaries (protein counts, abundances, etc.), see API Views. For downstream absolute or differential expression results, see the Absolute Expression and Differential Expression views.

Primary key constraints

The combination of pg_accessions and run_file_name forms the composite primary key. Both fields MUST NOT be null. Each record represents a single protein group observed in a single raw file.