Skip to content

Protein Group View

The protein group (PG) view is a tabular Parquet file that contains the details of protein groups identified and quantified per raw file. It captures the relationship between protein groups and the raw files in which they were detected, including peptide counts, feature counts, quality metrics, and intensity-based quantification.

This view is analogous to outputs from tools such as MaxQuant (proteinGroups.txt), DIA-NN (pg_matrix), and FragPipe protein group reports.

Use cases

  • Retrieve all protein groups identified or quantified in a given raw file.
  • Retrieve protein group abundance by file and condition.
  • Store and query FDR q-values for protein groups at both the run and experiment level.
  • Support downstream statistical analysis by providing per-file protein-level quantification.

Schema

Fields marked with (PK) are primary keys and MUST NOT be null. Fields marked with (nullable) may have null values. See the full YAML schema in pg.yaml.

Identity

Field Description Type Required
pg_accessions Protein accessions of all proteins within this group array[string] Yes (PK)
pg_names Descriptive names for the proteins in the group array[string] No
gg_accessions Gene group accessions as a string array array[string] No
gg_names Gene names corresponding to the proteins in the group array[string] No
gg_qvalue Gene group q-value (e.g., DIA-NN GG.Q.Value) float64, null No
anchor_protein Representative protein of the group (leading protein) string No
run_file_name The raw file containing the identified/quantified protein group string Yes (PK)

Counts

Field Description Type Required
peptide_counts Peptide sequence counts for this protein group in this file struct No
peptide_counts.unique_sequences Number of peptide sequences unique to this protein group within this file int --
peptide_counts.total_sequences Total number of peptide sequences identified for this protein group in this file int --
feature_counts Peptide feature counts (peptide-charge combinations) for this protein group struct No
feature_counts.unique_features Number of unique peptide features specific to this protein group within this file int --
feature_counts.total_features Total number of peptide features identified for this protein group in this file int --

Quality

Field Description Type Required
global_qvalue Global q-value of the protein group at the experiment level float64 No
pg_qvalue Protein group q-value at the run level float64 No
is_decoy Whether the protein group is a decoy (true) or a target (false) bool Yes
contaminant Contaminant flag bool, null No
sequence_coverage Percentage of the protein sequence covered by identified peptides float32 No
molecular_weight Molecular weight of the protein in kDa float32 No

Quantification

Field Description Type Required
intensities Primary intensity-based abundance of the protein group across labels. See Intensities array[struct] No
additional_intensities Pre-computed intensity values from the upstream tool (normalized, LFQ, iBAQ, etc.). See Intensities array[struct] No
additional_scores Additional scores and metrics (posterior error probability, confidence, etc.). See Scores array[struct] No

Each entry in intensities contains:

Sub-field Description Type
label Label identifier (e.g., TMT126, LFQ) string
intensity Raw intensity value float32

Each entry in additional_intensities contains:

Sub-field Description Type
label Label identifier (e.g., TMT126, LFQ) string
intensities Array of name-value pairs for derived intensities array[struct{intensity_name, intensity_value}]

Peptide detail

Field Description Type Required
peptides Number of peptides per individual protein in the protein group array[struct] Yes

Each entry in peptides contains:

Sub-field Description Type
protein_name Protein accession string
peptide_count Number of peptides for this protein int

CV params

Field Description Type Required
cv_params Optional list of controlled vocabulary parameters for additional metadata array[struct{cv_name, cv_value}] No

Example

{
  "pg_accessions": ["P04217", "A0A024R4E5"],
  "pg_names": ["Alpha-1B-glycoprotein", "Alpha-1B-glycoprotein variant"],
  "gg_accessions": ["A1BG"],
  "gg_names": ["A1BG"],
  "anchor_protein": "P04217",
  "run_file_name": "20230101_sample_01",
  "peptide_counts": {
    "unique_sequences": 12,
    "total_sequences": 18
  },
  "feature_counts": {
    "unique_features": 24,
    "total_features": 36
  },
  "global_qvalue": 0.001,
  "pg_qvalue": 0.005,
  "is_decoy": false,
  "contaminant": 0,
  "sequence_coverage": 45.2,
  "molecular_weight": 54.3,
  "intensities": [
    {
      "label": "TMT126",
      "intensity": 1.5e8
    }
  ],
  "additional_intensities": [
    {
      "label": "TMT126",
      "intensities": [
        {"intensity_name": "LFQ", "intensity_value": 1.2e8},
        {"intensity_name": "iBAQ", "intensity_value": 3.4e7}
      ]
    }
  ],
  "peptides": [
    {"protein_name": "P04217", "peptide_count": 15},
    {"protein_name": "A0A024R4E5", "peptide_count": 3}
  ],
  "additional_scores": [
    {"score_name": "posterior_error_probability", "score_value": 0.0001, "higher_better": false}
  ],
  "cv_params": null
}

The following names are commonly used in additional_intensities for the PG view. Use consistent naming so downstream tools can recognise them across datasets.

intensity_name Source tool(s) Description
maxlfq DIA-NN, MaxQuant, FragPipe MaxLFQ normalised protein group quantity
lfq MaxQuant Label-free quantification intensity
ibaq MaxQuant Intensity-based absolute quantification
topn DIA-NN Top-N normalised protein group quantity
normalize_intensity quantms Median-normalised intensity
spectral_count FragPipe, MaxQuant Number of PSMs for razor peptides
unique_spectral_count FragPipe Number of PSMs for unique peptides only
total_spectral_count FragPipe Number of PSMs for all peptides (razor + shared)
genes_maxlfq DIA-NN Gene-group level MaxLFQ quantity
genes_maxlfq_unique DIA-NN Gene-group MaxLFQ using only proteotypic peptides
ratio_h_l MaxQuant (SILAC) Heavy-to-light protein ratio
ratio_h_l_normalized MaxQuant (SILAC) Normalised heavy-to-light ratio
reporter_intensity MaxQuant (TMT/iTRAQ) Raw reporter-ion intensity
reporter_intensity_corrected MaxQuant (TMT/iTRAQ) Isotope-purity-corrected reporter intensity

Spectral counts as intensities

Spectral counts are quantitative measures that vary per run and per label, so they fit naturally in additional_intensities. Store them as float values (e.g., 3.0 instead of 3) since the struct uses float32.

Gene-group quantities

DIA-NN reports protein-group level (PG.MaxLFQ) and gene-group level (Genes.MaxLFQ) quantities separately. Store gene-group quantities in additional_intensities with names prefixed genes_ (e.g., genes_maxlfq, genes_maxlfq_unique). The gg_accessions field identifies the gene group.

The following score names are commonly used in additional_scores for the PG view. For the full list of recommended score names and naming conventions, see Scores.

score_name Source tool(s) Direction Description
posterior_error_probability MaxQuant lower is better Protein-group PEP
andromeda_score MaxQuant higher is better Andromeda protein-level score
protein_probability FragPipe higher is better ProteinProphet probability
top_peptide_probability FragPipe higher is better Highest PeptideProphet probability among peptides
pg_maxlfq_quality DIA-NN higher is better QuantUMS quality for PG.MaxLFQ estimate
genes_maxlfq_quality DIA-NN higher is better QuantUMS quality for gene-level MaxLFQ
pg_pep DIA-NN lower is better Posterior error probability at the protein group level
gg_qvalue DIA-NN lower is better Gene group q-value
lib_pg_qvalue DIA-NN lower is better Library protein group q-value
protein_qvalue DIA-NN lower is better Unique protein q-value (proteotypic evidence)

Tool Mappings

This section shows how output columns from common search engines and pipelines map to pg.parquet fields.

Wide-to-long conversion

MaxQuant, DIA-NN, and FragPipe output protein groups in wide format (one row per protein group, one column per experiment/run). QPX uses long format (one row per protein group per run). Converters must melt wide-format columns into separate rows keyed by run_file_name.

MaxQuant (proteinGroups.txt)

MaxQuant column QPX field Notes
Protein IDs pg_accessions Semicolon-separated → array
Majority protein IDs anchor_protein First accession of majority IDs
Protein names pg_names Semicolon-separated → array
Gene names gg_names Semicolon-separated → array
Sequence coverage [%] sequence_coverage
Mol. weight [kDa] molecular_weight
Reverse is_decoy "+"true, else false
Potential contaminant contaminant "+"1, else 0
Only identified by site cv_params {cv_name: "only_identified_by_site", cv_value: "true"}
Identification type [exp] cv_params {cv_name: "identification_type", cv_value: "By MS/MS"}
MaxQuant column QPX field Notes
Unique peptides peptide_counts.unique_sequences Peptides exclusive to this PG
Peptides or Razor + unique peptides peptide_counts.total_sequences Total assigned peptides
Peptide counts (all) peptides Per-protein peptide counts
MaxQuant column QPX field Notes
Intensity [exp] intensities Primary raw intensity per run
LFQ intensity [exp] additional_intensitieslfq MaxLFQ normalised
iBAQ [exp] additional_intensitiesibaq Intensity-based absolute quantification
MS/MS count [exp] additional_intensitiesspectral_count PSM count as float
Reporter intensity [channel] intensities For TMT/iTRAQ, one entry per channel
Reporter intensity corrected [channel] additional_intensitiesreporter_intensity_corrected
Ratio H/L [exp] additional_intensitiesratio_h_l SILAC ratio
Ratio H/L normalized [exp] additional_intensitiesratio_h_l_normalized Normalised SILAC ratio
MaxQuant column QPX field Notes
Q-value global_qvalue Protein group FDR
Score additional_scoresandromeda_score
PEP additional_scoresposterior_error_probability

DIA-NN (main report)

DIA-NN's main report is precursor-level. PG-level columns (PG.*, Genes.*) are repeated for every precursor in the group within a run. Deduplicate by Protein.Group + Run to produce one pg.parquet row.

DIA-NN column QPX field Notes
Protein.Group pg_accessions Semicolon-separated → array; also anchor_protein (first accession)
Protein.Names pg_names
Genes gg_names / gg_accessions
Run run_file_name Raw file name without path
DIA-NN column QPX field Notes
PG.Quantity intensities Non-normalised protein group quantity
PG.MaxLFQ additional_intensitiesmaxlfq QuantUMS/MaxLFQ normalised
PG.Normalised additional_intensitiesnormalize_intensity Normalised PG quantity
PG.TopN additional_intensitiestopn Top-N normalised quantity
Genes.MaxLFQ additional_intensitiesgenes_maxlfq Gene-group MaxLFQ
Genes.MaxLFQ.Unique additional_intensitiesgenes_maxlfq_unique Gene-group MaxLFQ (proteotypic only)
Genes.TopN additional_intensitiesgenes_topn Gene-group Top-N
DIA-NN column QPX field Notes
Global.PG.Q.Value global_qvalue Experiment-level PG FDR
PG.Q.Value pg_qvalue Run-level PG FDR
PG.PEP additional_scorespg_pep
PG.MaxLFQ.Quality additional_scorespg_maxlfq_quality QuantUMS quality metric
GG.Q.Value additional_scoresgg_qvalue Gene group q-value
Lib.PG.Q.Value additional_scoreslib_pg_qvalue Library PG q-value
Protein.Q.Value additional_scoresprotein_qvalue Unique protein q-value
Genes.MaxLFQ.Quality additional_scoresgenes_maxlfq_quality

FragPipe (combined_protein.tsv)

FragPipe outputs are pre-filtered (no decoys or contaminants). Per-experiment columns are prefixed with the experiment name.

FragPipe column QPX field Notes
Protein ID anchor_protein Leading protein accession
Protein ID + Indistinguishable Proteins pg_accessions Combine into array
Description pg_names Protein description
Gene gg_names / gg_accessions
Coverage sequence_coverage Percentage (0–100)
Protein Length cv_params {cv_name: "sequence_length", cv_value: "495"}
Protein Existence cv_params {cv_name: "protein_existence", cv_value: "1"}
FragPipe column QPX field Notes
[exp] Total Peptides peptide_counts.total_sequences Per-experiment peptide count
FragPipe column QPX field Notes
[exp] Intensity intensities Primary razor peptide intensity
[exp] MaxLFQ Intensity additional_intensitiesmaxlfq MaxLFQ normalised
[exp] MaxLFQ Unique Intensity additional_intensitiesmaxlfq_unique MaxLFQ using only unique peptides
[exp] Unique Intensity additional_intensitiesunique_intensity Intensity from unique peptides only
[exp] Spectral Count additional_intensitiesspectral_count Razor peptide PSMs
[exp] Unique Spectral Count additional_intensitiesunique_spectral_count Unique peptide PSMs
[exp] Total Spectral Count additional_intensitiestotal_spectral_count All peptide PSMs
FragPipe column QPX field Notes
Protein Probability additional_scoresprotein_probability ProteinProphet score
Top Peptide Probability additional_scorestop_peptide_probability Best PeptideProphet score

FragPipe q-values

FragPipe does not report explicit protein-group q-values. The Protein Probability (from ProteinProphet) is filtered at a given FDR threshold (typically 1%) before writing combined_protein.tsv. Set global_qvalue to null and store the probability in additional_scores.

Notes

Relationship to other views

The PG view provides per-file protein group quantification. For derived per-sample summaries (protein counts, abundances, etc.), see API Views. For downstream absolute or differential expression results, see the Absolute Expression and Differential Expression views.

Primary key constraints

The combination of pg_accessions and run_file_name forms the composite primary key. Both fields MUST NOT be null. Each record represents a single protein group observed in a single raw file.