• Aucun résultat trouvé

Chapter 2 First Author Papers

2.3. Glycoforest

2.3.1. Overview

Glycoforest is a software package that was developed to accelerate the process of assigning glycan structures to spectra in glycomics MS/MS datasets. The challenges that were addressed are the generation of candidate glycan structures and the scoring of the generated candidates.

Glycoforest addresses the candidate generation and scoring by proposing a partial de novo algorithm that makes use of the OMS spectrum similarity score. The scoring function that was developed combines the OMS score from the candidate generation with the normalized dot product similarity between the query spectrum and the candidate’s theoretical spectrum.

Glycoforest was tested on two hand curated datasets (human gastric and salmon mucosa O-linked glycomes)23,24 for which MS/MS spectra were annotated manually. Glycoforest generated the human validated candidate structure for 92% of the test cases. The correct structure was found as the best scoring match for 70% and among the top 3 matches for 83%

of test cases. In addition to spectra that are annotated, the test dataset also contains spectra that are missing a manual annotation. Glycoforest was able to assign a structure to 1,532 consensus spectra. A portion containing 521 consensus spectra confirmed that Glycoforest annotated an additional 50 MS/MS spectra that were overlooked during manual annotation.

Glycoforest 1.0

Oliver Horlacher,†,‡ Chunsheng Jin,§ Davide Alocci,†,‡ Julien Mariethoz,†,‡ Markus Müller,†,‡

Niclas G. Karlsson,§ and Frederique Lisacek*,†,‡

Proteome Informatics Group, SIB Swiss Institute of Bioinformatics, Geneva, 1211, Switzerland

University of Geneva, Geneva, 1211, Switzerland

§Glyco Inammatory Group, Department of Medical Biochemistry and Cell Biology, Institute of Biomedicine, Sahlgrenska Academy, University of Gothenburg, Gothenburg, SE405 30, Sweden

*S Supporting Information

ABSTRACT: Tandem mass spectrometry, when combined with liquid chromatography and applied to complex mixtures, produces large amounts of raw data, which needs to be analyzed to identify molecular structures. This technique is widely used, particularly in glycomics. Due to a lack of high throughput glycan sequencing software, glycan spectra are predominantly se-quenced manually. A challenge for writing glycan-sequencing software is that there is no direct template that can be used to infer structures detectable in an organism. To help alleviate this bottleneck, we present Glycoforest 1.0, a partial de novo algorithm for sequencing glycan structures based on MS/MS spectra. Glycoforest was tested on two data sets (human gastric and salmon mucosa O-linked glycomes) for which MS/MS

spectra were annotated manually. Glycoforest generated the human validated structure for 92% of test cases. The correct structure was found as the best scoring match for 70% and among the top 3 matches for 83% of test cases. In addition, the Glycoforest algorithm detected glycan structures from MS/MS spectra missing a manual annotation. In total 1532 MS/MS previously unannotated spectra were annotated by Glycoforest. A portion containing 521 spectra was manually checked conrming that Glycoforest annotated an additional 50 MS/MS spectra overlooked during manual annotation.

G

lycomics is a 21st century endeavor that is gaining momentum1 and importance in biological knowledge.2 Glycans, in the form of both polysaccharides and glycoconju-gates, are the most abundant class of biomolecules and are increasingly recognized as being implicated in a wide range of biological processes. Despite this ubiquitous presence in living systems, glycomics has lagged behind genomics and proteo-mics, in particular, because, until recently, high throughput methods for the large scale characterization of the branching structures of complex carbohydrates, especially while con-jugated to the carrier protein, have been missing. In fact, the only way to generate oligosaccharide structural data is by biochemical analysis, but full and detailed characterization of glycan structures presented by a native protein is a challenging analytical process that requires sensitive, quantitative, and robust technologies to determine monosaccharide composition, linkages, and branching sequences. Nowadays, tandem mass spectrometry (MS/MS) is the most established analytical technique used to address the challenges of glycomics as it oers high levels of sensitivity and can handle complex mixtures.3 Robust analytical technology combined with strong bioinformatics support is common to most omics applica-tions. Such a combination is designed to extract as much information as possible from the information-rich data sets

generated. It appears then that glycomics needs the automation of analytical methods to enable large-scale studies. An increasing range of software tools is available for analyzing MS/MS data produced in glycomics (see Li et al.4recent book chapter or Maxwell et al.5 for review), but very few were dened to be operational in high throughput settings. Indeed, automating the identification of glycans from MS/MS spectra is challenging because there is no direct template, such as the genome in proteomics, which can be used to generate a database of all glycan structures in a given organism. Instead, possible structures have to be inferred from enzyme activity, or de novoalgorithms have to be used to assign the spectra. As Li et al.4had comprehensively covered the range of existing tools, we nonetheless mention the recent progress since Glyco-Workbench,6 one of the most popular software designed to speed up and ease the manual interpretation of MS/MS data.

GlycoWorkbench evaluates a set of user-selected structures upon establishing how well their theoretical fragment masses match the real MS/MS peak list, but it is very low-throughput as it only processes one spectrum at a time. Such a drawback is

Received: July 14, 2017 Accepted: September 13, 2017 Published: September 13, 2017

Article pubs.acs.org/ac

© 2017 American Chemical Society 10932 DOI:10.1021/acs.analchem.7b02754

Anal. Chem.2017, 89, 1093210940 C

Cite This: Anal. Chem. 2017, 89, 10932-10940

tackled in Glycomics Elucidation and Annotation Tool (GELATO), a component of the GRITS toolbox (http://

www.grits-toolbox.org). This software was designed to be able to handle combined MS and MS/MS glycomic les but is lacking a module for LC-MS and LC-MS/MS peak identication. Automation of glycomic MS/MS can be further increased with the use of Smart Annotation Enhancement Graph (SAGE),7a user behavior learning method. SAGE scores select or reject annotations based on prior structure assign-ments made by a user.

Despite this recent progress, an LC-MS/MS database search strategy similar to the most common identification approach in proteomics still cannot be easily transposed to glycomics. This is mainly due to the absence of a comprehensive and well-curated set of experimentally derived glycan structures, notwithstanding much-improved eorts for collecting data,810 but also to the lack of subtle scoring schemes. The rst database search strategy was implemented in GlycosidIQ, which generates a theoretical peak list for each structure in a database by computing all its theoretical fragments.11A scoring function determines the best match between the theoretical peak lists and the MS/MS spectra. This approach described in 2004 was limited by the availability of reliable data at the time.

In fact, recent eorts have mostly focused on glycoproteomics considering glycans as a source of mass shifts between glycosylated and nonglycosylated peptides. This approach is dicult due to the unmanageable expansion of the search space. For example, one of the latest methods tackling large-scale issues is GlycoMasterDB.12It uses higher energy collision dissociation (HCD) spectra to identify glycan structures/

masses and electron transfer dissociation (ETD) to identify the peptide sequence by matching against a database of glycans and peptides, respectively. The glycan database is constructed from GlycomeDB. An alternative option to database search is to adopt ade novostrategy for glycans13and/or glycopeptides, but such approaches have so far not been extensively validated. In all cases, false discovery rate (FDR) calculations are not straightforward and more often than not missing for glycan identications. Finally, scoring schemes are not easy to determine either, since result validation is not fool proof and sometimes glycobiologist dependent.

These examples highlight the objective diculty of implementing analytical pipelines. Nevertheless, the annotation of fragment spectra comprises a series of tedious and repetitive steps whose automation is straightforward and can result in a substantial decrease of the time needed for sequencing a structure. In that respect, recent progress in proteomics and metabolomics can benet glycomics. Small molecule identi-fication has long been relying on the use of spectral libraries,14,15and in proteomics, a signicant number of articles also promote the use of direct spectral matching and the construction of spectral libraries to support reliable peptide identication and quantication, especially in high-throughput settings.16,17 Computationally, using spectral libraries is also much more ecient as it skips the construction of theoretical spectra. The obvious downside is the potential incompleteness of the library and its subsequent limited coverage. Nonetheless, the constant improvement of instrument accuracy, sensitivity, and accumulation of knowledge in glycobiology has contributed to the generation of libraries with good coverage. Thus, we chose to associate automation with the establishment of glycan MS/MS spectral libraries. The idea of assigning spectra through matching a library consisting of experimentally determined

fragment spectra was proposed in glycomics by Kameyama et al.18 The experimentally determined MS/MS spectral library anticipated is now stored in UniCarb-DB.19

Spectra libraries rely on two core principles beyond mainstream approaches: (1) matching unidentied spectra to spectra in a spectral library or to unidentied spectra is more ecient than matching to theoretical spectra based on reference sequences and (2) consensus interpretation of sets of related spectra is more reliable than identication of one spectrum at a time. This type of method relies on spectrum clustering and primarily on the denition of similarity. Despite some attempts to rene this denition and sharpen spectral matches (e.g., Spec2spec and Tremolo),20,21 the normalized dot product remains a widely used, reliable, and simple measure. In this study, we have transposed this approach in the context of glycomics. So far, only one tool implements a similar method, SweetNET,22which is applied to solving glycopeptide structures.

In this Article, we introduce Glycoforest 1.0, a partialde novo algorithm for sequencing glycan structures based on MS/MS spectra. Glycoforest first clusters all MS/MS spectra in each individual LC-MS/MS run to collect spectra that belong to the same glycan, generating high-quality consensus spectra. Glycan structures are then assigned to the consensus spectra using a partial de novo glycan sequencing algorithm. The sequencing algorithm makes use of a similarity measure that was previously dened for nding unknown modications in peptides, otherwise called open modication search.23,24 The same principle is applied to search a library of reference spectra to nd references that share structural similarities with the query but dier in glycan composition. The score from the open modication search and the glycan structures obtained from the reference spectra are then used to generate and score glycan spectra matches. This algorithm is partially de novo because reference spectra for which the structure is known are required as a starting point for generating structures that are not present in the reference set. The sequencing results, including query spectra for which no structure was found, are visualized using a spectral network25that shows the potential structural relation-ships between consensus spectra. Glycoforest was evaluated against manually annotated results obtained from data sets of human gastric (16 rawles) and Atlantic salmon (20 rawles) mucin-typeO-glycan MS/MS spectra.

MATERIALS AND METHODS

Raw Data.The data used in this study were obtained from human gastric and Atlantic salmon mucin-typeO-glycan MS/

MS data set. Both data sets are manually annotated and have been previously described.26,27The raw data was downloaded from Swestore (http://srm.swegrid.se/). Briey, O-glycans were released from mucins, and the reduced glycans were analyzed by liquid-chromatography tandem mass spectrometry (LC-MS/MS) using a porous graphitized carbon particle column. The eluted O-glycans were detected using an LTQ mass spectrometer (Thermo Scientic, San Jose, CA, US) in negative ion mode. The gastric mucin data set consists of 16 raw mass spectrometry files (.raw) and one Excel file that contains a summary of the manually sequenced glycans and their associated retention times in each run. The Atlantic salmon data set contains 20 rawfiles as well as a summary Excel le of the retention times. This amounts to 36 raw les that contain 99 937 glycan and nonglycan MS/MS spectra.

Analytical Chemistry Article

DOI:10.1021/acs.analchem.7b02754 Anal. Chem.2017, 89, 1093210940 10933

Annotation of Known Structures. The manually annotated O-glycan structures that were used to assess the Glycoforest algorithm were obtained from the summary Excel les of the respective papers.26,27 Each row of the retention time table contains a glycan sequence and the retention time of that sequence for each run. For each structure, the elution peaks that contain spectra from that structure were identied using the retention time listed for each run. Once the correct elution peak was identied, the MS/MS spectra that belong to that peak were annotated with the structure. In cases of coelution, each MS/MS spectrum was annotated with the structure that contributed more ion current based on the shape of the elution peak.

Methods. Glycoforest 1.0 is a Java program written using the MzJava software library28 and Apache Spark29 (http://

spark.apache.org). The spectra data set was sequentially processed by going through three main steps (Figure 1): (1) preparing data for sequencing; (2) glycan structure sequencing;

(3) spectral network generation.

Step 1 collates the raw MS/MS spectra in each run using a clustering algorithm to generate consensus spectra. Step 2 uses the sequencing algorithm to assign structures to each consensus spectrum. Step 3 then creates a spectral network to visualize the structural relationships that are available to the sequencing algorithm.

Preparing Data for Sequencing. The raw MS/MS spectra wererst prepared for sequencing by data conversion, ltering, and then clustering to build consensus spectra. The 36 raw les were centroided and converted to mzXML using MSConvert.30The resulting mzXMLles were then converted to Hadoop les (http://hadoop.apache.org) using the MzJava library.28To reduce the number of low quality and nonglycan spectra, the following rules were applied to select MS/MS spectra: contain at least 10 peaks, have a retention time between 10 and 30 min, have a precursor m/z that is within tolerance (±0.3) of a glycan composition with at least one HexNAc.

Only the MS/MS spectra meeting the criteria above were considered for clustering. Details on how the glycan compositions were generated can be found in the Supporting Information.

Building Consensus Spectra. It is generally accepted that combining MS/MS spectra into consensus spectra improves the interpretation of spectral matches. Due to the large number of MS/MS spectra collected in this work, we employed a clustering strategy to collect spectra that belong together followed by applying a merging procedure to create consensus spectra. Practically, MS/MS spectra in each run were clustered

and a consensus spectrum was built from each cluster. Thus, each consensus spectrum represents at least one glycan structure. The aim of clustering is to collect all MS/MS spectra from the same structure without generating clusters that are contaminated with spectra from a dierent structure. This can be dicult due to the coelution of isomeric structures. To account for this, clustering parameters were chosen to prevent the generation of clusters that contain spectra from more than one structure. The downside of this choice is the possible generation of multiple clusters for a single structure. However, the merging of redundant clusters is easier to address than the splitting of ambiguous clusters. Like other MS/MS clustering algorithms,31 Glycoforest uses agglomerative hierarchical clustering based on spectra similarity. However, clustering in Glycoforest diers from other clustering algorithms on two key points: (i) the clusters are constrained using within run retention times; (ii) instead of setting a similarity threshold for the denition of a cluster, a statistical model of the clusters controls and terminates the clustering process when the model with the minimum Bayesian Information Criterion (BIC)32is found.

Briey, to make clustering more ecient, spectra of each run were grouped on the basis of precursor charge state, retention time, and precursor m/z value. Each of the groups was then clustered. During the clustering, Glycoforest keeps track of the cluster assignment of raw MS/MS spectra so that the consensus spectra can be generated from the unprocessed MS/MS spectra of each cluster. Further details of the clustering algorithm are provided in theSupporting Information.

Finally, consensus spectra were created by combining all peaks from the raw MS/MS spectra that belong to a cluster into one spectrum. Peaks with similarm/z values (±0.3 Da) were grouped, and each group was replaced by a single centroid peak. During centroiding, the number of peaks that were used to generate the centroid was recorded to keep track of the number of MS/MS spectra in which the peak occurred (spectrum count). The intensity of the centroid peak was calculated by summing the intensity of all the peaks that were merged and taking the square root. Once the consensus spectra had been generated, isotopic consensus spectra were removed using the isotope removal algorithm described in the Supporting Information.

Test Data and Reference Selection.All consensus spectra that include spectra that weremanually annotatedwith anO -glycan structure (as described in theMaterials and Methods) constitute what we call thetest set. The Glycoforest sequencing algorithm requires as reference a set of high-quality spectra that are unambiguously annotated. This set of spectra is called the reference set. Consensus spectra were included in thereference set if they hadone structure annotationand at least one peak for each glycosidic bond in the structure. The ion types that were considered when determining if a bond is present were B, C, Y, and Z ions with all charges smaller than or equal to the consensus precursor charge. We assumed that all manually annotated structures (usually short glycans) are correct in this study.

Glycan Sequencing. We dene here open similarity to be the similarity of two spectra, that have dierent precursor masses and therefore have dierent glycan compositions. More precisely, the open similarity accounts for the m/z shift of a subset of the peaks due to the addition, removal, or substitution of monosaccharides (Hex, Fuc HexNAc, Neu5Ac, Neu5Gc, Kdn) and sulfate (S) as summarized in Table S-1. The Figure 1. Main steps of the Glycoforest workow for assigning

glycans. In step 1, raw MS/MS data is clustered and consensus spectra are generated. In step 2, the sequencing algorithm is used to assign a structure to each consensus spectrum. In step 3, a spectral network is built to visualize the relationship between the reference, assigned, and unassigned spectra.

Analytical Chemistry Article

DOI:10.1021/acs.analchem.7b02754 Anal. Chem.2017, 89, 1093210940 10934

similarity was calculated using the function previously dened to nd unknown modications in peptides.23,24 The Glyco-forest sequencing algorithm exploits the fact that the open similarity of two consensus spectra (S1 and S2) correlates with the structural similarity of the associated glycans (G1 and G2).

The composition dierence, which was inferred from the mass dierence between S1 and S2, combined with open similarity was used to generate and score hypothesis about structural changes (structure transition) between G1 and G2.

The structure transitions that were considered were:

The structure transitions that were considered were: