Step 3: Post-Processing - Unrestricted identification of modified peptides

4. Manuscript 1, Extended Abstract: How to create good MS2 decoy spectra?

5.3 Unrestricted identification of modified peptides

5.3.4 Step 3: Post-Processing

Most identification tools attempt to assign a statistical quality measure to a PSM such as a P-value/E-value or FDR. These measures are commonly estimated by screening the experimental dataset against a database of randomised peptide sequences. The performance of a tool can be evaluated based on the trade-off between error rate and sensitivity, often visualised in a Receiver Operator Curve (ROC) [17]. Despite efforts to reduce the number of candidate peptides in the database and the development of improved scoring algorithms, high-scoring random matches remain a problem when looking for PTMs in an unsupervised manner. Strict error rate thresholds often lead to important loss of sensitivity and the opposite will return disputable matches that have to be manually validated.

A range of post-processing algorithms has been developed to further refine the search results.

In addition to presenting a sophisticated alignment algorithm to compare theoretical and experimental spectra Tsur et al. [66] proposed a new way to tackle loss of discriminatory power in open modification searches. Their approach relies on the tabulation of all the mass shifts reported by the software for each amino acid in a large dataset. Assuming that incorrect modification mass assignments will distribute randomly across all amino acids, those matches

containing modifications of residues reported multiple times are more likely to represent true modified peptides.

PTMFinder [73] elaborates on the same idea of studying the global evidence for modifications found in the experimental data. Here the focus is shifted from the significance of individual PSMs to modification site scoring. The authors acknowledge that Open Modification Searches are error prone, but try to make use of the fact that large datasets tend to contain a lot of redundant and complementary spectra. The post-processing tool is designed to be a pluggable module that can handle the output of any OMS tool, although demonstrated in combination with MS-Alignment and the InsPecT scoring. The basic idea is to group the identification results by modification site and extract the evidence for each occurrence.

Features such as the number of overlapping peptides carrying the same modification and the same modified peptide found in multiple charge states are used to train a Support Vector Machine (SVM) used to distinguish between false and correct modification site assignments.

ComByne [74] is another interesting post-processing module scoring modification sites. In a first step, peptide match probabilities are adjusted for peptide length, missed cleavages and modifications. The rationale here is that chances of randomly matching a semi-tryptic, modified, and unmodified peptide with an elevated score differ, since e.g. the database may contain substantially more modified peptide candidates than unmodified peptides.

Furthermore, short peptides may randomly be assigned high scores solely based on correct prefix or suffix amino acids. Similar reasoning is used by other post-processing tools such as PeptideProphet [32] and Panoramics [75]. A novelty in ComByne is that it refines the p-value of a peptide match based on the difference in measured retention time and the predicted retention time of the peptide candidate. In addition, p-values are adjusted based on corroboration, where in analogy with PTMFinder the global evidence of a modification site is accounted for. The discovery of overlapping peptides would boost the p-value of both peptides. Similarly, when the unmodified and modified variants of the same peptide species are found the p-values are recalculated. ComByne can also score phosphorylation sites. An identification of i.e. ASLGS[-18]LEGEASSPK becomes more believable if another spectrum matches ASLGS[+80]LEGEASSPK, because phosphorylated serine has a common neutral loss of 98 Daltons.

108 5.4 Improving modification discovery

5.4.1 Mass-accuracy and modification discovery

The mass spectrometer type used to analyse samples dramatically influences the number of candidate peptides per spectrum in a restricted PFF search. An experimental spectrum with a precursor mass of 1000.48 Da has approximately 90 times more unmodified peptide candidates in UniProtKB/Swiss-Prot (Yeast) if produced on a low mass accuracy ion-trap instrument (precursor mass accuracy +/- 2 Da) compared to an FT-ICR (precursor mass accuracy +/- 0.006 Da) (See Table 2). Exact precursor mass measurements naturally speeds up the bioinformatic part of an LC-MS workflow, when analysing the data with a classical PFF tool. However, substantially higher precursor mass precision does not by default lead to a dramatic increase in the number of identified peptides and proteins. Haas et al. [76]

investigated the benefits of high-mass accuracy measurements analysing different datasets with an LTQ-FT instrument where the FT-ICR part of the instrument was either not exploited or used for MS scan survey. The MS/MS data of a complex peptide mixture derived from the yeast proteome was submitted to SEQUEST [15] with search parameters adapted to the instrumentation. Interestingly, despite a dramatic search space reduction only 10% more peptide identifications were produced when the FT was turned on. The advantage of the FT-ICR was proved important mainly for assigning MS/MS spectra with low signal-to-noise ratios. SEQUEST returned 100% more confident peptide matches in the high mass accuracy data when analysing a yeast sample enriched for phosphorylated peptides. These spectra are often dominated by ions resulting from the neutral loss of phosphoric acid (as seen in Fig. 1b), whereas sequence-specific fragment ions formed through cleavage of the peptide backbone amide bonds can be of low intensity. In an OMS, higher precursor mass accuracy does not reduce the number of candidate peptides per spectrum. Consequently search times are not reduced but, more discriminative grouping of PSMs with the same modification mass can be obtained and modification masses can be accurately mapped to known modifications annotated in databases. To the best of our knowledge no study has been published investigating the benefits of high precursor mass accuracy measurements and OMS.

High mass accuracy measurements of fragment ions have been shown to be of great importance for peptide identification based on de novo sequencing [49, 77]. In a recent publication [78], the benefits of fragment ions acquired at high mass accuracy, using an LTQ-Orbitrap, was further investigated. Data acquired on an Orbitrap and linear ion trap data, from

the same Pseudomonas aeruginosa sample, was analyzed in restricted searches with Phenyx [33] (Genebio SA), setting the fragment mass tolerance to 12 ppm and 0.5 Da respectively. As the overall search space was increased, by allowing for multiple missed tryptic cleavage sites and a variable (methyl-ester) modification on amino acids D, E, S and T, high accuracy fragment mass measurements resulted in more identified spectra, at low FDRs. The spectra not confidently identified in the Phenyx searches were extracted and submitted to the OMS tool Popitam [53] searching the Orbitrap data with a fragment mass tolerance of 0.01 Da (minimum accepted by the software tool) and 0.5 Da for the ion-trap data. While no additional confident identifications were found in the ion-trap dataset, peptides with mass shifts corresponding to oxidation, adduction of sodium, methylation and dethiomethylation were frequently observed in the high fragment mass accuracy Orbitrap data.

Fortunately, the use of high mass accuracy instruments (Orbitrap and FT) becomes more and more common in proteomics labs. For an in-depth review covering the topic of accurate mass accuracy in proteomics experiments we recommend [79].

5.4.2 Extended identification workflows

A number of clever data pre-processing steps have been proposed. They are worth considering for reducing computational time and possibly increasing PTM identification rate.

MS/MS experiments often generate redundant data sets containing multiple spectra of the same peptides. On this basis, a fast clustering algorithm was presented, grouping similar spectra and replacing them with a single representative spectrum [80]. It is demonstrated how data-sets of over 10 million spectra could be reduced by a factor of ten, significantly speeding up the following database search. This of course is especially meaningful when using OMS tools as the analysis time per spectrum is much larger than when performing a restricted search with a classical PFF tool.

Bern et al. [81] describes a spectrum quality assessment tool and shows how it may be of particular interest in a pre-processing step preceding a modification search: spectra assigned a high quality but not identified in a restricted search often can be explained by modified peptides, although many modifications produce low quality spectra.

We strongly recommend reading Tanner et al. [82] providing a well-written user manual of the InsPecT software platform. The authors suggest that an OMS could be succeeded by a

110 restrictive follow-up search. Here the data is further explored for some of the more frequent or especially interesting modifications listed in the OMS output. The restrictive search can be more sensitive in detecting modifications with known effects on fragmentation and multiple modifications per peptide are allowed. SeMoP [69] automates such a three-step strategy. First, a standard database search is performed with SEQUEST [15]. Second, all peptides corresponding to the identified proteins or only the identified peptides from step one are exhaustively explored for modifications. Lastly, data is re-submitted to SEQUEST for a targeted search for specific modifications found in the unrestricted search, allowing for multiple modifications per peptide.

5.4.3 Experimental set-up for PTM identification

Studying the results presented in the papers describing the OMS tools discussed above it becomes clear that the vast majority of the modifications identified are in fact not post-translational but rather chemical modifications induced by sample preparation such as Cysteine Carboxyamidomethylation, N-terminus and Lysine Carbamylation, Oxidation of Methionine and Sodium and Potassium adducts. PTMs especially those previously unknown can be expected to be poorly abundant. This is illustrated in Figure 5 showing the modification mass distribution of a typical OMS search. The histogram displays the confident PSMs returned when screening a human blood plasma sample dataset produced on an Orbitrap instrument, and analysed with a novel library search based OMS tool, QuickMod (Ahrné et al. manuscript in preparation).

The detection of low-abundant post-translationally modified peptides is very limited when analysing complex mixtures because these peptides are overshadowed by unmodified peptides and peptides modified during the sample preparation. In addition, post-processing algorithms tend to favour the identification of abundant modifications for which extensive evidence can be found. These factors complicate the successful discovery of rare PTMs. The detection problem can be partly improved by various promising sample preparation techniques such as anti-phosphoamino acid antibodies for protein isolation [83, 7] and affinity-based enrichment of modified proteins or peptides [6,84-85].

Seo et al. present a protocol targeting low abundance PTMs [86]. The authors describe a clever LC-MS workflow, Selectively Excluded Mass Screening Analysis (SEMSA) where samples are analysed by an LC-ESI-qTOF in multiple rounds. For each round, the precursor masses of spectra confidently identified, using MODi [55], are added to a mass exclusion list

allowing for the fragmentation of precursor ions of low intensities. A similar set-up, presented in Schmidt et al. [87] where unidentified MS features were added to an inclusion list for targeted fragmentation, lead to extensive identification of phosphorylation sites in a protein mixture obtained from Drosophila melanogaster lysates.

Another interesting LC-MS identification workflow was recently published by Carapito et al. [88] where spectra are acquired under different collision conditions and a peptide mass inclusion list is compiled based on the detection of modification specific neutral loss fragments and reporter ions in combination with ion signals corresponding to the modified and unmodified peptide masses. In a final step the peptides on the mass inclusion list are sequenced in a directed MS/MS mode.

Barnes et al. [89] developed a software tool, MassShiftFinder, tackling the detection problem by screening MALDI-TOF data for potentially modified peptides which then can be selected for subsequent TOF-TOF analysis. Their algorithm performs a blind search for modifications using peptide mass fingerprints from two proteases with different cleavage specificities. If the same mass shift relative to the unmodified theoretical values is observed for both proteases, and the peptides are overlapping, the mass shift can correspond to a modification or a substitution.

Working with more than one proteases is in general a good idea in order to increase the sequence coverage of PTM sites in the results of analysis [10]. Strong b- or y- ion peaks on either side of a modified residue are the best evidence for site specificity. Digesting the samples with multiple proteases also improves the chances of producing such a spectrum, facilitating the modification localisation problem. Furthermore, the confident identification of low abundance peptides generally requires multiple replicate analyses of the same LC-MS/MS of similar or replicate samples [65].

As mentioned earlier in this review, an additional problem which makes the identification of real PTMs particularly tricky is the fact that some modified peptides fragment poorly in the mass spectrometer. Low-energy CAD/CID tandem mass spectrometry has been, by far, the most common method used to dissociate peptide ions for subsequent sequence analysis. Ideally, the peptide is cleaved randomly at the amide bonds along its backbone to produce a homologous series of b and y-type fragment ions. The presence of multiple basic residues prevents full fragmentation upon collision activation/induction and

112 directs the backbone bond dissociation to specific sites and therefore inhibits the production of a sufficiently diverse set of sequence ions. Further, post-translational modifications such as phosphorylation, sulfonation, nitrosylation and O- and N- linked glycosylation may similarly redirect the sites of preferred cleavage. Often the modified moiety is cleaved off and the peptide backbone is left more or less intact. The resulting spectra tend to contain little peptide sequence information and may not allow for successful identification. In this regard, CAD/CID is most effective for short, low-charged unmodified peptides. New instrumentation technologies support alternative solutions for data generation that have the potential to improve peptide and protein identification, in particular the identification of peptides carrying labile modifications.

As n-dimensional mass spectrometry has become more practical new techniques to identify modified peptides have been developed making up for the limitations of CAD/CID fragmentation. Newer ion-trap instruments provide the option of collecting MS3 spectra of abundant MS2 peaks. Peptides carrying labile modifications have been analysed by automated data-dependent triggering of MS3 acquisition whenever the dominant neutral loss ion of the appropriate mass is detected in an MS2 spectrum [8, 90, 91]. By separately fragmenting the neutral loss ion a sequence information rich MS3 spectrum can be produced. Different approaches have been tested to combine MS2 and MS3 spectra from the same peptide to improve peptide identification [92-94].

Other methods to generate higher quality spectra of peptides carrying labile modifications rely on new fragmentation techniques altogether. Electron Capture Dissociation (ECD), is a method for peptide dissociation which is relatively indifferent to peptide sequence and length while avoiding the loss of labile modifications during fragmentation [95].

However, ECD requires an FT-ICR mass spectrometer. Syka et al. [96] introduced Electron Transfer Dissociation ETD and has proven useful for the identification of modified peptides and peptides with basic residues. ETD fragments peptides at the Cα-N bond by transferring an electron from a radical anion to a protonated peptide inducing similar fragmentation patterns to ECD, but can be used on more widely accessible ion-trap or Orbitrap mass spectrometers [97]. Olsen et al. [98] demonstrated a third new PTM-friendly fragmentation technology which takes advantage of the Orbitrap's architecture; Higher energy C-trap dissociation (HCD). HCD spectra show richer fragmentation than typical CAD/CID spectra especially in the low-mass region of the spectrum including a2, b2, y1, y2 ions and immonium ions of histidine and modified residues such as the immonium ion of phosphotyrosine.

Many more experimental protocols have been described in the literature aiming to increase identification of PTMs. For details we refer to an excellent review on this topic [3].

Figure 5) The vast majority of the modifications identified in a typical OMS are in fact not Post-Translational but rather modifications induced by the sample preparation such as Cysteine Carboxyamidomethylation, N-terminus and Lysine Carbamylation, Oxidation of Methionine and Sodium and Potassium adducts. PTMs especially those previously unknown can be expected to be low abundant. The histogram display the confident modified PSM returned when screening a human blood plasma sample dataset produced on a Orbitrap instrument, and analysed with a novel library search based OMS tool, QuickMod (manuscript in preparation).

5.5 OMS studies

Typical PFF tools successfully explore the unmodified fraction of the experimental data but often fail to identify an important part of the fragmentation spectra. The use of recently

114 developed OMS software enables a more complete annotation of MS/MS datasets and refines our understanding of the biological system under investigation.

PTMFinder was evaluated on an impressively large dataset of 18 million spectra from a whole-lysate extract of HEK293 human embryonic kidney cells. Hundreds of previously uncharacterised modification sites were found, most of them were phosphorylations, acetylations or methylations, in addition to more than nine hundred already documented modification sites [73]. In the same publication the authors reported the discovery of several modification sites conserved between protein orthologues in humans and protists based on the additional analysis of Dictyostelium discoideum samples.

The ocular lens is another suitable testing ground as it is a particularly rich source of PTMs.

Since the proteins in the mature lens fiber cells do not turnover during its long lifetime, it is expected that a wide variety of PTMs will accumulate on a large number of residues. This makes lens samples excellent candidates for exploring and evaluating the ability of OMS tools to detect protein modifications. Several of the recently developed OMS tools have been tested on such datasets including InsPecT [56], PTMExplorer [71], SIMS [57] and SwissPIT [64].

Willmarth et al. [99] identified a total of 155 modification sites in crystallins analysing a human lens dataset produced on two different instruments; LCG Classic ion trap and a Q-TOF hybrid mass spectrometer, using the InsPecT software suite. 77 of these were previously reported sites and 78 newly detected, including carboxymethyl lysine (+58 Da), carboxyethyl lysine (+72 Da) and an arginine modification of +55 Da. PTM-Explorer was tested on lens protein samples from a 100 week-old mouse [71]. Approximately 30% of all identified peptides were found to carry modifications other than the common sample preparation artefacts propionamide on cysteines and methionine oxidation, mainly phosphorylation, acetylation and sodium adducts. The developers of SIMS benchmarked their tool against InsPecT on a small human lens protein test dataset of 243 high-resolution spectra, of modified peptides, generated on a QTOF instrument [57]. The two algorithms returned identical results for 80 % of the spectra. 17% of the spectra were identified to the same peptide and modification mass but disagreeing on the modification site.

Other benchmarking studies also show that different OMS tools often agree on peptide identification and modification mass but the site assignment may differ. ProteinProspector [72] was compared to InsPecT on a publicly available dataset (regis-web.systemsbiology.net/PublicDatasets/ mix2) of 3734 spectra produced on a QSTAR

instrument, from a protein mixture of 18 standard proteins. The two tools reported the same peptides with the same modification mass for 1102 spectra, but approximately half of the modification sites did not align.

Modificomb [61] was evaluated on high-accuracy FT data where two complementary fragmentation techniques were used; Collision Activated Dissociation (CAD) and Electron Capture Dissociation (ECD) revealing several previously unknown modifications, later confirmed by Mascot in targeted searches, including a frequent 12 Da proline modification

Dans le document Exploring the use of MS/MS spectral libraries to improve protein identification and characterization (Page 107-0)