Perform in silico digestion of protein sequences into peptides using specified digestion rules (e.g., trypsin). Returns a data frame with one peptide per row, retaining all original protein information.
Usage
proteins_digest(
proteins,
seq_col = "seq",
rule = "(?<=[KR])",
min_length = 6L,
max_length = 40L,
max_miscleavages = 1L
)Arguments
- proteins
A data frame containing protein information with a sequence column
- seq_col
Name of the column containing amino acid sequences (default: "seq")
- rule
Regular expression pattern for digestion enzyme cleavage sites. Default is trypsin rule: cleavage after K or R
- min_length
Minimum length of resulting peptides (default: 6)
- max_length
Maximum length of resulting peptides (default: 40)
- max_miscleavages
Maximum number of missed cleavages allowed (default: 1)
Value
A data frame with one row per digested peptide, containing:
All original protein columns except the sequence column
peptide: the peptide sequence
pep_start: start position in protein sequence (1-based)
pep_end: end position in protein sequence (1-based)
pep_length: length of the peptide
n_miscleavages: number of missed cleavages
is_start_cleaved: whether peptide start matches cleavage rule or is protein start/N-term M
is_end_cleaved: whether peptide end matches cleavage rule or is protein end
Details
The function performs fully tryptic digestion using a vectorized approach. If a protein sequence starts with methionine (M) or unknown amino acid (X), an additional cleavage site is added between positions 1 and 2 to handle N-terminal processing.
Examples
if (FALSE) { # \dontrun{
# Basic example protein data frame
proteins <- tibble(
protein_id = "P12345",
seq = "MKTESTPEPTIDEKANDANOTHERPEPTIDER"
)
# Digest with default trypsin settings
peptides <- proteins_digest(proteins)
print(peptides)
# Expected output includes peptides:
# - "KTESTPEPTIDEK" (from M-removal at N-term, positions 2-14)
# - "TESTPEPTIDEK" (from position 2, positions 3-14)
# - "ANDANOTHERPEPTIDER" (C-terminal peptide, positions 15-32)
# Test N-terminal M/X handling - demonstrates additional cleavage sites
proteins_nterm <- tibble(
protein_id = c("P001", "P002", "P003"),
seq = c(
"MKTESTPEPTIDEKANDANOTHERPEPTIDER", # M at N-term - extra cleavage at pos 1
"XKTESTPEPTIDEKANDANOTHERPEPTIDER", # X at N-term - extra cleavage at pos 1
"AKTESTPEPTIDEKANDANOTHERPEPTIDER" # A at N-term - no extra cleavage
)
)
peptides_nterm <- proteins_digest(proteins_nterm)
print(peptides_nterm)
# P001 and P002 should have peptides like:
# - "MKTESTPEPTIDEK" (full N-terminal, positions 1-14)
# - "KTESTPEPTIDEK" (M/X-cleaved, positions 2-14)
# P003 should only have:
# - "AKTESTPEPTIDEK" (full N-terminal, positions 1-14)
# Test peptides with miscleavages - demonstrates max_miscleavages parameter
proteins_miscleavage <- tibble(
protein_id = "P004",
seq = "AKBCKDKEKFK" # Multiple K sites for testing missed cleavages
)
# Allow up to 2 missed cleavages
peptides_miscleavage <- proteins_digest(proteins_miscleavage, max_miscleavages = 2L)
print(peptides_miscleavage)
# Expected peptides include:
# 0 miscleavages: "AK", "BCK", "DK", "EK", "FK"
# 1 miscleavage: "AKBCK", "BCKDK", "DKEK", "EKFK"
# 2 miscleavages: "AKBCKDK", "BCKDKEK", "DKEKFK"
# Note: Only peptides >= 6 amino acids are included by default
# Use custom sequence column name
proteins_custom <- tibble(
protein_id = "P12345",
sequence = "MKTESTPEPTIDEKANDANOTHERPEPTIDER"
)
peptides_custom <- proteins_digest(proteins_custom, seq_col = "sequence")
print(peptides_custom)
} # }
