Article 1, Extended abstract: A simple workflow to increase MS2 identification rate

Different peptides have different fragmentation propensities and for any given protein certain peptides are more easily detectable and confidently identified in a LC-MS/MS workflow.

Certain model organisms and sub-proteomes have been subject to extensive investigation in proteomics. These studies show that some peptides are detected all the time while others are never seen (Lam et al. 2007). The repeated rediscovery of the same identifiable peptides using sequence search based data analysis tools is often time-consuming and error-prone. In targeted proteomics studies, where the goal is systematic and repeated investigation of a certain predefined set of peptides or proteins rather than the discovery of previously unobserved peptides, identification results from prior experiments stored in publicly available databases may greatly facilitate the data analysis (see Table 1, Chapter 2).

Spectral library searching is a natural way of incorporating previous knowledge about observed peptides and their respective fragmentation patterns into a new search. The National Institute of Standards and Technology (NIST, http://peptide.nist.gov/) and the Institute for Systems Biology (ISB, http://www.peptideatlas.org/speclib/) have made huge efforts to compile high quality publicly available spectrum libraries for different organism and machine types including some specialized spectrum libraries rich in modified peptides. To enable the exploration of these spectral libraries when analyzing new experimental data several spectral library tools have been developed such as SpectraST (Lam et al. 2007), NIST MSPepSearch et al. 2006), X!Hunter (Craig et al. 2006), ProMEX (Hummel et al. 2007), HMMatch (Wu et al. 2007) and MSDash (Wu et al. 2008).

Lam et al. showed that a spectral library matching approach outperforms conventional sequence tools in terms of speed, error rates and sensitivity of peptide identification (Lam et al. 2007). Spectral library search tools provide rapid peptide identification as the experimental data is screened against small size databases, while time-consuming modeling of theoretical spectra is not needed. As a library spectrum is rich in information (including fragmentation intensities of a wide range of fragment ions) relative to a theoretical peptide spectrum, library search tools typically employ simple and fast scoring algorithms and are expected to yield more discriminative match scores than sequence search tools.

The limited protein coverage of currently available spectrum libraries imposes an obvious drawback of spectral library based data analysis. Even though, MS studies have reached substantial depth and coverage in the case of some model organisms (King et al. 2006, Brunner et al. 2007), existing libraries are far from complete, especially with respect to peptides from low abundance proteins and modified peptides. Publicly available MS/MS data from human samples suffer from low proteome coverage. About, 35% of all predicted human proteins have yet to be observed reliably by MS (Nilsson et al. 2010).

However, the amount of data available to the research community grows steadily much thanks to the development of proteomics repositories such as PeptideAtlas, Pride, Peptidome, and Tranche data exchange system (Vizcaino et al. 2010, see Table 1). There is a hope that all peptides that are detectable by MS, at least for the most frequently studied organism, will eventually be discovered and annotated in spectral libraries.

Another drawback of the spectral library search approach arises from the varying peptide fragmentation patterns observed on different mass spectrometers and experimental conditions such as the collision energy. Ideally different laboratories should be able to build up customized spectral libraries, as an optimal spectrum library would include fragmentation spectra produced on the very same machine as the experimental data submitted to the library search tool.

The following article entitled “A simple workflow to increase MS2 identification rate by subsequent spectral library search” describe the development of a data processing tool allowing individual laboratories to build their own spectrum libraries. Furthermore, we demonstrate how a work-flow combining a sequence search and a spectrum library search of in-house compiled spectrum libraries can improve identification coverage of the experimental data dramatically. Here a spectrum library is created on-the-fly from the results output of a sequence search analysis. In a next step we make use of the precision and speed of a spectrum library search to identify spectra that the sequence search tool failed to annotate.

Technical Brief

A simple workflow to increase MS2 identification rate by subsequent spectral library search

Erik Ahrné¹, Alexandre Masselot², Pierre-Alain Binz^1,2, Markus Müller¹, Frederique Lisacek¹

1Swiss Institute of Bioinformatics, Proteome Informatics Group, Michel-Servet 1, CH-1211 Geneva, Switzerland; Geneva Bioinformatics (Genebio) SA, Av de Champel 25, CH-1206 Geneva, Switzerland

Abstract

Searching a spectral library for the identification of protein MS/MS data has proven to be a fast and accurate method, while yielding a high identification rate. We investigated the potential to increase peptide discovery rate, with little increase in computational time, by constructing a workflow based on a sequence search with Phenyx followed by a library search with SpectraST. Searching a consensus library compiled from the search results of the prior Phenyx search increased the number of confidently matched spectra by up to 156%.

Additionally matched spectra by SpectraST included noisy spectra, spectra representing missed cleaved peptides as well as spectra from posttranslationally modified peptides.

Keywords:

Mass Spectrometry/ Protein Identification/ Spectral Library Search/ False Discovery Rate Correspondence

Mr. Erik Ahrné, Swiss Institute of Bioinformatics,,1, rue Michel Servet,CH-1211 Genève 4,Switzerland

E-mail: Erik.Ahrne@isb-sib.ch Fax: (+41 22) 379 58 58

The use of tandem mass spectrometry (MS/MS) is a well established method to identify and characterize proteins from complex samples. Various tools are designed to analyse MS/MS data.

Popular software such as Sequest [1] Mascot [2] and Phenyx [3], employ a Peptide Fragment Fingerprinting (PFF) algorithm where experimental spectra are compared to theoretical spectra generated in silico from a protein sequence database.

Recently a different method for protein identification and characterisation based on spectral library search has shown promising results, in terms of computational time, identification rate and accuracy, when compared to traditional identification techniques based on sequence search [4-7]. In this approach experimental spectra are scored against a carefully compiled database of previously identified experimental spectra. Matching experimental spectra against experimentally generated spectra and not in silico predicted ones, tends to lead to a higher sensitivity. One reason is that the actual intensities of all fragment types present in the library spectrum are considered including neutral loss and various uncommon or even unknown fragments. Further, high precision and speed is achieved as the search space of a spectral library is significantly smaller than that of a sequence database, since many putative peptides considered in a sequence search are supposedly not detectable by a mass spectrometer, and will therefore not be represented in a spectral library. The main weakness of this method resides in two aspects involving the universality of the library. Firstly, the fragmentation pattern of a peptide depends on the type of mass spectrometer used and the experimental conditions such as the collision energy, suggesting that different laboratories may need customised spectral libraries. Secondly it is evident that only peptides stored in the library will have a chance to be matched in a search. Several different spectral library search tools have been developed; SpectraST [4], X!Hunter [5], BiblioSpec [6], Libquest [7].

Here we present a method that aims at maximizing the number of interpreted spectra by combining the exhaustiveness of a sequence search approach with the sensitivity of a spectrum library search. We constructed a workflow combining sequentially a Phenyx sequence search with a spectral library search, using SpectraST where a spectral library is created from spectra confidently matched by Phenyx in an initial search (Fig. 1). The spectra are searched against this dedicated library and the identifications from the initial sequence search and the library search are combined. Given the small size of a spectral library generated from the sequence search results of a single dataset, more spectral matches can be made, in an unrestricted search, with little extra cost in computational time. The additional matches may include spectra for which the sequence search on its own fails to assign a confident score, spectra where the wrong precursor mass ion isotope has been selected as parent mass

and spectra of modified peptides. The workflow was tested on two datasets of different complexity [8, 9].

Figure 1. Schema of the analysis workflow: The dataset of unidentified experimental spectra is

searched by Phenyx. A consensus spectral library is constructed from the search results. Next, a library search using SpectraST is performed on the full dataset. The search results of the two identification tools are merged.

The first test was performed on three replicate LTQ-FT (Thermo Fisher Scientific Inc.) datasets [8] consisting of two-fold dilution series spanning a dynamic range from 25 to 800 fmol of six non-human proteins spiked into a complex sample background of non-human proteins. The six proteins are:

horse myoglobin (Swiss-Prot accession number (AC) P68083, identifier MYG HORSE), carbonic anhydrase (Swiss-Prot AC P00921, identifier CAH2 BOVIN), horse Cytochrome c (Swiss-Prot AC P00004, identifier CYC HORSE), chicken lysozyme (Swiss-Prot AC P00698, identifier LYSC CHICK), yeast alcohol dehydrogenase (Swiss-Prot AC P00330, identifier ADH1 YEAST), rabbit aldolase A (Swiss-Prot AC P00883, identifier ALDOA RABIT). The search parameters with Phenyx (Geneva Bioinformatics SA, Geneva, Switzerland) were set to carboxyamidomethylation of cysteine residues and parent mass tolerance of 50 ppm, allowing for two missed tryptic cleavages. A combined target-decoy database search was performed to estimate the false discovery rate (FDR). The target database used was human IPI (v.3.15) containing the protein sequences of the six standard proteins.

The decoy database was generated by applying a third order Markov chain [10]. For each of the six dilution runs, matches with a z-score higher than what corresponds to an approximate 5% FDR were

clustered together and a consensus spectrum was generated. For each cluster all peaks were ordered by mass, peaks that group together within in a 0.2 Da mass window were merged, where the consensus peak mass was set to the centroided mass and the intensity was set to the average intensity. Only peaks appearing in more then 20% of the spectra were included.

Next, the full dataset was searched with SpectraST. SpectraST employs a weighted dot product scoring algorithm for spectral comparison and was developed at the Institute for Systems Biology as integrated tool of the Trans Proteomic Pipeline. Considering the small library size the precursor m/z tolerance has little influence on the overall search time, therefore we tested the workflow setting the precursor mass tolerance of the SpectraST search to 100 Da. The SpectraST results were subjected to automated validation and probability assignment by PeptideProphet [11]. Peptide matches, with scores passing a FDR threshold of 5% were registered and the results of the Phenyx search and the SpectraST library search were compared. Spectra confidently identified with SpectraST but not with Phenyx were manually examined using an in-house spectral visualisation tool.

SpectraST was installed on a regular desktop PC. The spectral library of the LTQ-FT dataset of 303 spectra was built from the Phenyx search results exported to pepXML format in approximately 90 seconds. The library search time per spectrum was approximately 0.003 seconds.

In total, for the six dilutions runs of the first replicate dataset, Phenyx identified 1485 spectra.

SpectraST identified 3805 spectra at the same FDR. To explain the additionally identified spectra we examined the spectra identified to the six-non human proteins known to be present in the samples.

Phenyx matched 362 spectra to these proteins. SpectraST made 639 matches at the same FDR.

SpectraST failed to confidently identify 3 out of the Phenyx identifications. Upon further inspection, these spectra were found to be identified with the same peptides as those proposed by Phenyx but with PeptideProphet probabilities falling below the threshold. Out of the 639 matches made by SpectraST 280 were not confidently matched by Phenyx. After manual examination of these spectra they could be grouped into 6 categories (Table 1). Approximately 40% of these additionally matched spectra were found to include deamidated Asparagine (N) and/or Glutamine (Q) residues. One third of the spectra were noisy spectra that Phenyx failed to confidently match. About 15% of the spectra represented peptides where the precursor ion had lost a molecule of water. Among the remaining additional identifications were spectra where the wrong isotope had been picked as parent mass, spectra representing N-term carboxyamidomethylated peptides and noisy missed cleaved peptides. Note that the sequence search parameter settings allowed for up to two missed cleavages. 13 out of the 280

additional SpectraST matches were found to be incorrect identifications. For further verification of the SpectraST identifications a subset of 100 spectra, matched to human proteins, picked at random were manually inspected. Among these spectra 6 appeared to be identified to the wrong peptide. The results described above were reproduced in the analysis of the second and third replicate samples.

In figure 2a, a noisy experimental spectrum confidently identified by SpectraST is displayed with the candidate library spectrum. The match is assigned a PeptideProphet probability score of 0.99.

Phenyx identified this spectrum to the same peptide as SpectraST but with a z-score of 5.74, falling below the z-score threshold of 6.1 which corresponds to an approximate FDR of 5%.

Figure 2b shows the library spectrum representing the unmodified doubly charged peptide LFTGHPETLEK matched to an experimental spectrum representing a doubly charged n-terminal carboxyamidomethylated variant of the same peptide. Peaks of the C-terminal ion series align in both spectra leading to a high scoring match, while those of the N-terminal ion series are shifted by 58 Da.

This spectrum was not identified by Phenyx as the search parameters were not set to include N-terminal carboxyamidomethylation.

A large percentage of the spectra additionally matched after the library search, are deamidated peptides. Deamidation is a common post-translational modification resulting in the conversion of an asparagine residue to a mixture of isoaspartate and aspartate. Deamidation of glutamine residues also occurs but does so at a lower rate. It can also artefactually happen under acidic catalysis. When these spectra are manually compared with the matched library spectrum it is apparent that deamidation often causes important changes in the fragmentation pattern.

Figure 2c shows the non-modified and asparagine deamidated spectra of the RHGLDNYR peptide, from chicken lysozyme. The high intensity of b6 and y6 ions in the deamidated spectra is possibly explained by the enhancement of fragmentation after acidic amino acids when no mobile proton is present. Even though the intensity distribution of the fragment ions of the two peptides is rather different the match is assigned a PeptideProphet probability score of 0.99. A likely explanation for this is that in addition to the simple dot-product of the square root of the peak intensities of the compared spectra, the SpectraST fval score incorporates two sub-scores, the delta dot score (∆D) and the dot bias score (DB). The ∆D score is the difference in dot product of the top -hit and the second highest scoring candidate match. The DB score is a measure of how biased the dot product score is to a few matching peaks. Given the small size of our spectral library the probability of getting a high scoring runner up by chance is small. Furthermore, the contribution to the dot product is evenly

distributed among the peaks in the deamidated spectrum, thus the ∆D and DB scores will promote a high fval score.

For a more detailed discussion of the analysis of the additionally found matches, by SpectraST, in the LTQ-FT data see figure S4 in the supplementary material.

Figure 2a)

Figure 2b)

Figure 2c)

Figure 2. Spectra confidently matched with SpectraST (bottom), but with no confident Phenyx identification, displayed with the matched consensus library spectrum (top). (a) A noisy experimental spectrum (bottom), confidently matched by SpectraST to a consensus library spectrum. (b) A spectrum representing a N-term carboxyamidomethylated peptide (below) matched to an unmodified library spectrum of the same peptide. (c) A spectrum representing a deamidated peptide matched to the unmodified library spectrum.

Note that we also studied the accuracy of spectral count as a quantification method based on the Phenyx search results alone and when taking into account the additionally matched spectra found in the library search. Liu et al, 2004 [12] and Old et al, 2005 [13] suggested that there is a linear correlation between spectral count, being the sum of all identified spectra for a given protein including spectra redundant for ion charge states, and protein abundance. The results of this comparison are presented in the supplemental material (see figures S1-S3).

Table 1, Additionally identified spectra by SpectraST deamidation No precursor

Table 1. Categorisation of spectra with no confident Phenyx hit, but confidently matched by SpectraST.

For dataset I (LTQ-FT data) the categorisation covers the manually validated matches of the first replicate run identified to one of the six-non human proteins. For dataset II (QqTOF data) the categorisation covers the full dataset and is based on the manual identifications in Chalkley et. al.

* Bad mass precision (20), mis-cleavage( 8), co-elution (1).

The second test involved a more complex dataset of 3269 spectra, from a sample containing over 200 proteins, acquired on a quadrupole selecting, quadrupole collision cell, time-of-flight (QqTOF) geometry instrument [9]. The Phenyx search parameters were set to parent mass tolerance of 200 ppm and to include the following PTMs as variable modifications; oxidation of methionine, N-terminal acetylation, deamidation, methylation of lysine and pyroglutamate formation from N-N-terminal glutamine residues. The database searched was UniprotKB/SwissProt v.54.6. The library was built from non-modified confident Phenyx identifications. False discovery rates were estimated and the library was created using the same protocol as for the LTQ-FT dataset. The QqTOF library contained 1017 spectra and was built in approximately 200 seconds. The search time per spectrum was on an average 0.009 seconds.

In Chalkley et. al. 2005 [9], 2479 of the yeast spectra were manually annotated. We used these

should be used with some caution, but we see no reason to believe that potential wrongly assigned spectra would affect the overall comparison of the Phenyx and SpectraST results. When using the manual identifications to calculate the FDR for the Phenyx and SpectraST searches at given score thresholds we found them to be approximately one percent higher for both search tools, compared to the FDR estimated from the decoy search and the PeptideProphet analysis. This could be explained by spectra matched to peptides with very similar sequences to the manual identifications. An example is the spectrum at 31.21 min from cation exchange fraction 1 which was manually annotated with the peptide sequence STEIIR but assigned the sequence TSEIIR by Phenyx and SpectraST or the spectrum at 19.76 min from cation exchange fraction 6 manually annotated TITFHR but identified to the peptide DVTFHR by Phenyx and SpectraST.

At a FDR of 5% Phenyx alone identified 1084 non-modified peptides identical to the manual identifications of these spectra. Given the content of the spectral library SpectraST could at best match 1304 (20% more) spectra to the same peptides as the manual identifications. SpectraST made 1249 (15% more) “correct” identifications. Out of the 55 spectra that SpectraST failed to recover 38 were identified to the right peptide but with scores below the set threshold. These were primarily modified spectra and spectra that also Phenyx failed to confidently identify. The remaining 17 spectra identified to peptides different from the manual identifications were mainly cases where SpectraST matched a peptide with a highly similar sequence to the manual identification, and cases where Phenyx had made a high scoring match, different from the manual annotation, and consequently the same peptide was matched by SpectraST.

The additional matches made by SpectraST are categorized in Table 1. Out of the 122 non-modified spectra confidently matched by SpectraST but with no confident Phenyx identification 52 were matched to the same peptide as SpectraST with z-scores between 5 and 6 where a z-score of 6 corresponds to an approximate FDR of 5% and a zscore of 5 corresponds to an approximate FDR of 15%.

In total Phenyx correctly identified 47 posttranslationally modified peptides, including oxidation of methionine residues (33), deamidation (8), n-term acetylation (4), and pyroglutamate formation from

Dans le document Exploring the use of MS/MS spectral libraries to improve protein identification and characterization (Page 26-46)