In silico protein digestion — proteins

Perform in silico digestion of protein sequences into peptides using specified digestion rules (e.g., trypsin). Returns a data frame with one peptide per row, retaining all original protein information.

Usage

proteins_digest(
  proteins,
  seq_col = "seq",
  rule = "(?<=[KR])",
  min_length = 6L,
  max_length = 40L,
  max_miscleavages = 1L
)

Arguments

proteins: A data frame containing protein information with a sequence column
seq_col: Name of the column containing amino acid sequences (default: "seq")
rule: Regular expression pattern for digestion enzyme cleavage sites. Default is trypsin rule: cleavage after K or R
min_length: Minimum length of resulting peptides (default: 6)
max_length: Maximum length of resulting peptides (default: 40)
max_miscleavages: Maximum number of missed cleavages allowed (default: 1)

Value

A data frame with one row per digested peptide, containing:

All original protein columns except the sequence column
peptide: the peptide sequence
pep_start: start position in protein sequence (1-based)
pep_end: end position in protein sequence (1-based)
pep_length: length of the peptide
n_miscleavages: number of missed cleavages
is_start_cleaved: whether peptide start matches cleavage rule or is protein start/N-term M
is_end_cleaved: whether peptide end matches cleavage rule or is protein end

Details

The function performs fully tryptic digestion using a vectorized approach. If a protein sequence starts with methionine (M) or unknown amino acid (X), an additional cleavage site is added between positions 1 and 2 to handle N-terminal processing.

Examples

if (FALSE) { # \dontrun{
# Basic example protein data frame
proteins <- tibble(
    protein_id = "P12345",
    seq = "MKTESTPEPTIDEKANDANOTHERPEPTIDER"
)

# Digest with default trypsin settings
peptides <- proteins_digest(proteins)
print(peptides)
# Expected output includes peptides:
# - "KTESTPEPTIDEK" (from M-removal at N-term, positions 2-14)
# - "TESTPEPTIDEK" (from position 2, positions 3-14)
# - "ANDANOTHERPEPTIDER" (C-terminal peptide, positions 15-32)

# Test N-terminal M/X handling - demonstrates additional cleavage sites
proteins_nterm <- tibble(
    protein_id = c("P001", "P002", "P003"),
    seq = c(
        "MKTESTPEPTIDEKANDANOTHERPEPTIDER", # M at N-term - extra cleavage at pos 1
        "XKTESTPEPTIDEKANDANOTHERPEPTIDER", # X at N-term - extra cleavage at pos 1
        "AKTESTPEPTIDEKANDANOTHERPEPTIDER" # A at N-term - no extra cleavage
    )
)

peptides_nterm <- proteins_digest(proteins_nterm)
print(peptides_nterm)
# P001 and P002 should have peptides like:
# - "MKTESTPEPTIDEK" (full N-terminal, positions 1-14)
# - "KTESTPEPTIDEK" (M/X-cleaved, positions 2-14)
# P003 should only have:
# - "AKTESTPEPTIDEK" (full N-terminal, positions 1-14)

# Test peptides with miscleavages - demonstrates max_miscleavages parameter
proteins_miscleavage <- tibble(
    protein_id = "P004",
    seq = "AKBCKDKEKFK" # Multiple K sites for testing missed cleavages
)

# Allow up to 2 missed cleavages
peptides_miscleavage <- proteins_digest(proteins_miscleavage, max_miscleavages = 2L)
print(peptides_miscleavage)
# Expected peptides include:
# 0 miscleavages: "AK", "BCK", "DK", "EK", "FK"
# 1 miscleavage: "AKBCK", "BCKDK", "DKEK", "EKFK"
# 2 miscleavages: "AKBCKDK", "BCKDKEK", "DKEKFK"
# Note: Only peptides >= 6 amino acids are included by default

# Use custom sequence column name
proteins_custom <- tibble(
    protein_id = "P12345",
    sequence = "MKTESTPEPTIDEKANDANOTHERPEPTIDER"
)
peptides_custom <- proteins_digest(proteins_custom, seq_col = "sequence")
print(peptides_custom)
} # }