
Assemble protein groups from peptide-to-protein mapping
Source:R/protgroup_assembly.R
assemble_protein_groups.RdAssemble protein groups from peptide-to-protein mapping
Usage
assemble_protein_groups(
proteins_to_peptides,
minspec_peptides = 1L,
protein_grouping_rule = default_protein_grouping,
consume_subsets = TRUE,
verbose = FALSE,
...
)Arguments
- proteins_to_peptides
A data frame with
peptide,protein, and an optionalpeptide_rankcolumn.- minspec_peptides
Minimum number of specific peptides for a group.
- protein_grouping_rule
Function to decide on merging protein groups.
- consume_subsets
Logical, whether to keep peptide groups that do not have specific peptides. When
TRUE,minspec_peptidesfilter is ignored.- verbose
Logical, for verbose output of the inference process.
- ...
Additional arguments for the
protein_grouping_rulefunction.
Value
A nested frame of protein groups with proteins and peptides columns
that are data frames containing proteins and peptides that map to this protein
group.
Details
The function implements the classical "greedy" inference of proteins groups from bottom-up MS data as described in (...) paper. It is based on the protein-to-peptides mapping and specific peptides that only map to proteins unique to a given protein group.
In comparison to the original implementation, it provides several advanced features that allows the inference algorithm to produce more relevant protein groups for specific use cases.
Peptide ranks and decisive peptides
The input proteins_to_peptides data frame can optionally contain a peptide_rank column
that ranks the peptide-to-protein assignment. It could be used to e.g. give priority
to the peptides identified in actual biosamples (rank 1) over peptides only
identified in negative control experiments (rank 2). Peptides with NA ranks
will be excluded from the protein group inference altogether. This allows to e.g. ignore
the peptides shared between different organisms in mixed-species samples.
For a given protein, only the peptides with the lowest rank (among all its peptides) –
the decisive peptides – will be considered for the protein group inference.
The non-decisive peptides (including the unranked ones) will still be mapped to
the protein groups. The status of individual peptides is provided by the is_decisive
column in the peptides nested data frame.
User-defined protein grouping rules
To preserve the quantitative information of shared peptides, the MaxQuant protein inference implementation introduced the concept of "razor" peptides. These are the peptides shared by several protein groups, but assigned to a specific protein group based on parsimony principles. While not specific, these peptides are still used for quantifying the protein group they are assigned to.
The assemble_protein_groups() does not implement the razor peptides. Instead,
it allows tuning how the proteins are merged into protein groups and ensuring that
proteins with very few specific peptides are not separated, and their shared peptides
discarded from quantification.
The users can provide a custom function that defines how the protein group candidates
should be merged. The default_protein_grouping() allows specifying how many
specific peptides (either as an absolute number or as a fraction of all candidate
peptides) are required to keep the protein groups separate. By increasing the
nspec_peptides above the standard 1 (a single specific peptide), one can
group the proteins more aggressively. This helps to reduce the number of
protein groups that map to different isoforms or variants/alleles of the same protein.
The pregrouped_protein_grouping() allows to tune it further by using e.g.
gene IDs as the identifier of protein pre-groups and defining more stringent
criteria for separating proteins of the same pre-group.
Protein groups without specific peptides
The classical protein group inference requires that each protein group
contains at least one specific peptide. Proteins without specific peptides
are discarded, since there is no evidence of their presence in the sample.
In certain scenarios, however, it is known that these proteins are present in
the sample, and they have to be separately quantified. One such case are the
cleavage products of complement activation cascade proteins – although it is known
that these cleavage products are present in the sample, most of their peptides
also map to the full-length protein and are, therefore, combined into a single
protein group by the standard algorithm.
This could be avoided by setting consume_subsets = TRUE, which retains
protein groups without specific peptides.
However, extra care has to be taken when interpreting the quantitative data for
such protein groups, as it does not represent the real abundance of these protein
fragments – rather the average abundance of peptides that map to these fragments
but also shared with the other fragments.