Assemble protein groups from peptide-to-protein mapping

Usage

assemble_protein_groups(
  proteins_to_peptides,
  minspec_peptides = 1L,
  protein_grouping_rule = default_protein_grouping,
  consume_subsets = TRUE,
  verbose = FALSE,
  ...
)

Arguments

proteins_to_peptides: A data frame with peptide, protein, and an optional peptide_rank column.
minspec_peptides: Minimum number of specific peptides for a group.
protein_grouping_rule: Function to decide on merging protein groups.
consume_subsets: Logical, whether to keep peptide groups that do not have specific peptides. When TRUE, minspec_peptides filter is ignored.
verbose: Logical, for verbose output of the inference process.
...: Additional arguments for the protein_grouping_rule function.

Value

A nested frame of protein groups with proteins and peptides columns that are data frames containing proteins and peptides that map to this protein group.

Details

The function implements the classical "greedy" inference of proteins groups from bottom-up MS data as described in (...) paper. It is based on the protein-to-peptides mapping and specific peptides that only map to proteins unique to a given protein group.

In comparison to the original implementation, it provides several advanced features that allows the inference algorithm to produce more relevant protein groups for specific use cases.

Peptide ranks and decisive peptides

The input proteins_to_peptides data frame can optionally contain a peptide_rank column that ranks the peptide-to-protein assignment. It could be used to e.g. give priority to the peptides identified in actual biosamples (rank 1) over peptides only identified in negative control experiments (rank 2). Peptides with NA ranks will be excluded from the protein group inference altogether. This allows to e.g. ignore the peptides shared between different organisms in mixed-species samples.

For a given protein, only the peptides with the lowest rank (among all its peptides) – the decisive peptides – will be considered for the protein group inference. The non-decisive peptides (including the unranked ones) will still be mapped to the protein groups. The status of individual peptides is provided by the is_decisive column in the peptides nested data frame.

User-defined protein grouping rules

To preserve the quantitative information of shared peptides, the MaxQuant protein inference implementation introduced the concept of "razor" peptides. These are the peptides shared by several protein groups, but assigned to a specific protein group based on parsimony principles. While not specific, these peptides are still used for quantifying the protein group they are assigned to.

The assemble_protein_groups() does not implement the razor peptides. Instead, it allows tuning how the proteins are merged into protein groups and ensuring that proteins with very few specific peptides are not separated, and their shared peptides discarded from quantification.

The users can provide a custom function that defines how the protein group candidates should be merged. The default_protein_grouping() allows specifying how many specific peptides (either as an absolute number or as a fraction of all candidate peptides) are required to keep the protein groups separate. By increasing the nspec_peptides above the standard 1 (a single specific peptide), one can group the proteins more aggressively. This helps to reduce the number of protein groups that map to different isoforms or variants/alleles of the same protein.

The pregrouped_protein_grouping() allows to tune it further by using e.g. gene IDs as the identifier of protein pre-groups and defining more stringent criteria for separating proteins of the same pre-group.

Protein groups without specific peptides

The classical protein group inference requires that each protein group contains at least one specific peptide. Proteins without specific peptides are discarded, since there is no evidence of their presence in the sample. In certain scenarios, however, it is known that these proteins are present in the sample, and they have to be separately quantified. One such case are the cleavage products of complement activation cascade proteins – although it is known that these cleavage products are present in the sample, most of their peptides also map to the full-length protein and are, therefore, combined into a single protein group by the standard algorithm. This could be avoided by setting consume_subsets = TRUE, which retains protein groups without specific peptides. However, extra care has to be taken when interpreting the quantitative data for such protein groups, as it does not represent the real abundance of these protein fragments – rather the average abundance of peptides that map to these fragments but also shared with the other fragments.

References

TODO