Bioinformatics Algorithms: An Active Learning Approach

Chapter 11: Was T. rex Really a Big Chicken?

Why do we include only masses of prefix and suffix subpeptides in the ideal spectrum of a peptide (e.g., RED or DCA for REDCA)? Why don't we include masses of other subpeptides (like EDC)?

Internal subpeptides like EDC of REDCA require two bonds (before E and after C) to be broken to generate the fragment ion. In contrast, prefix and suffix subpeptides require only one bond to be broken. As a result, although internal subpeptides correspond to some fragment ions, we ignore them because they are much less common than the fragment ions generated by prefix and suffix subpeptides.

How do we infer the charges of annotated peaks in a spectrum?

For example, how did we annotate the tall peak y12++ as having charge +2 in the figure below (one of the annotations of DinosaurSpectrum)?

As described in the main text, mass spectrometers measure the mass-to-charge ratio rather than the mass of fragment ions. Thus, a peak in a spectrum with a given mass-to-charge ratio m/z gives rise to various masses depending on its (unknown) charge z. If one of the resulting masses matches a mass in the theoretical spectrum, we may infer that the peak has charge z.

How does the use of rounded (integer) masses of amino acids diminish our ability to accurately sequence peptides?

Although we rounded amino acid masses to integers to simplify the presentation, proteomics researchers do not round masses when attempting to interpret spectra. For example, amino acids K and Q have different molecular composition (C₆H₁₂ON₂ versus C₅H₈O₂N₂), and they have different monoisotopic masses (128.09497 Da versus 128.05858 Da), but their integer masses are the same. Since modern mass spectrometers are very accurate, they can detect differences in masses of up to 0.01 Da in order to distinguish K and Q.

Why do the heights of some peaks in DinosaurSpectrum correlate so poorly with the corresponding amplitudes in the spectral vector?

DinosaurSpectrum is reproduced below (top), along with its spectral vector (bottom).

The transformation of a spectrum into a spectral vector is a complex process that takes into account many factors in addition to the heights of peaks. For details of this transformation, see Kim et al., 2008..

If we know the proteome, isn’t it always better to use peptide identification instead of peptide sequencing?

If a spectrum that we analyze originated from a peptide in a proteome, then it makes sense to apply peptide identification via database search and to identify this peptide. However, our knowledge of the proteomes remains incomplete even in the case of a well-studied human proteome. Biologists therefore sometimes use de novo peptide sequencing to discover peptides that do not appear in the currently known (still incomplete) proteome.

Why do we generate the decoy database by assuming that all amino acids have the same frequency (1/20), despite the fact that many proteomes have widely varying amino acid frequencies?

In practice, proteomics researchers do typically generate decoy databases by taking into account amino acid frequencies in the proteome under study. This is often achieved by randomly shuffling the amino acids in real proteins to generate a decoy database.

What is the running time of the dynamic programming algorithm for computing the size of a spectral dictionary?

The algorithm for computing |Dictionary_threshold(Spectrum)| amounts to filling up the N*M table Size(i, t), where N is the maximum score among all peptides against Spectrum and M is the parent mass of Spectrum. To fill in each element of this table, we need to consider all possible amino acids and apply the equation for Size(i, t) from the section "Spectral Dictionaries".