Developing algorithms to automate the identification of post translational modification in LC-MS/MS data

(1)

Thesis

Reference

Developing algorithms to automate the identification of post translational modification in LC-MS/MS data

HORLACHER, Oliver

Abstract

Understanding post-translational modification (PTM) of proteins and how PTMs influences cellular processes is an important part of understanding the biology of both healthy and diseased cells. The most widely used experimental technique for studying PTMs is Liquid Chromatography Tandem Mass Spectrometry (LC-MS/MS). When LC-MS/MS is applied to complex mixtures, such as cell lines or tissue samples, a large amount of raw data is produced which needs to be analysed to identify molecular structures. This thesis focuses on the development of software to automate the identification of PTMs in proteomic MS/MS data and identifying glycans in glycomic MS/MS data. The outcome of this thesis is three papers and their associated software packages: MzJava, MzMod and Glycoforest. MzJava provides the building blocks for developing MS/MS analysis software, MzMod is a validated and improved spectrum library based open modification search engine and Glycoforest pioneered a new approach for automating glycomics analysis.

HORLACHER, Oliver. Developing algorithms to automate the identification of post translational modification in LC-MS/MS data. Thèse de doctorat : Univ. Genève, 2018, no.

Sc. 5194

URN : urn:nbn:ch:unige-1045171

DOI : 10.13097/archive-ouverte/unige:104517

Available at:

http://archive-ouverte.unige.ch/unige:104517

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

1

UNIVERSITE DE GENÈVE FACULTE DES SCIENCES

Département d'informatique Dr. Frédérique Lisacek

Developing algorithms to automate the identification of post translational modification in LC-MS/MS data

THÈSE

présentée à la Faculté des sciences de l’Université de Genève pour obtenir le grade de Docteur ès sciences, mention bioinformatique

par

Oliver Horlacher

de Brugg (AG)

Thèse N° 5194

GENÈVE 2018

(3)

2

Résumé en Français

L’élucidation du rôle des modifications post-traductionnelles (PTMs) des protéines et de leur influence sur les processus cellulaires contribue à la compréhension de la biologie des cellules.

La technique expérimentale la plus couramment utilisée pour étudier les PTMs est la Chromatographie Liquide suivie de Spectrométrie de masse en tandem (LC-MS/MS). Lorsque la LC-MS/MS est appliquée à des mélanges complexes, tels que des lignées cellulaires ou des échantillons de tissus, une grande quantité de données brutes est produite et doit être analysée pour identifier les entités moléculaires en présence. Cette thèse se concentre sur le développement de logiciels visant à automatiser l'identification des PTMs dans les données MS/MS générées en protéomique et à l’identification des glycanes dans les données MS/MS générées en glycomique. Les travaux de cette thèse ont conduit à trois logiciels: MzJava, MzMod et Glycoforest.

MzJava est une bibliothèque logicielle qui peut être utilisée comme base pour guider et accélérer le développement de logiciels de traitement et d'interprétation de données de spectrométrie MS/MS. MzJava fournit des structures de données et des algorithmes pour représenter et traiter les spectres de masse et les molécules biologiques associées, telles que les métabolites, les glycanes ou les peptides. Pour assurer que MzJava contienne du code exact et facile à utiliser, l'Interface de Programme d'Application de la bibliothèque a été soigneusement conçue et testée de manière approfondie. MzJava a été utilisée pour développer tous les logiciels présentés dans cette thèse et a été publiée en tant que projet open-source.

MzMod est un moteur de recherche ouverte de PTMs utilisant une bibliothèque de spectres et qui permet de traiter des ensembles de données MS/MS contenant des dizaines de millions de spectres. La méthode a non seulement été conçue pour un traitement efficace de gros volumes de données, mais aussi pour améliorer la précision du score d’identification en incluant la cohérence des ions du squelette des peptides et la qualité du positionnement de la PTM. MzMod a été validé en traitant un ensemble de 25 millions de spectres correspondants à 30 tissus humains puis comparé à MODa, qui est un moteur de recherche ouverte de PTMs populaire.

(4)

3

Cette comparaison a montré que MzMod est plus facile à utiliser que MODa lors de l'analyse de gros volumes de données et que MzMod identifie quatre à cinq fois plus de PTMs que MODa pour les modifications qui ne sont pas fixes.

Glycoforest est un outil qui permet d'accélérer l’assignation de structures de glycanes à des spectres MS/MS. Le plus grand défi à relever pour développer un logiciel qui automatise l'assignation des glycanes aux spectres MS/MS, est l’absence de modèle directement utilisable pour déduire les structures potentiellement présentes dans un échantillon. Glycoforest relève ce défi avec un algorithme partiellement de novo qui utilise le moteur de recherche ouverte et la similitude des spectres associée pour générer des structures de glycanes candidates. Pour sélectionner la structure correcte parmi les candidats, un score a été défini sur la base de la combinaison d’informations provenant du résultat de la recherche ouverte et du spectre théorique de chaque candidat. En analysant deux ensembles de spectres MS/MS annotés manuellement, il est montré que Glycoforest génère 92% des structures validées par un expert humain. Le score permet de sélectionner la structure attendue dans 70% des cas et dans 83%

des cas celle-ci est classée parmi les trois meilleurs candidats.

(5)

4

Abstract

Understanding post-translational modification (PTM) of proteins and how PTMs influences cellular processes is an important part of understanding the biology of both healthy and diseased cells. The most widely used experimental technique for studying PTMs is Liquid Chromatography Tandem Mass Spectrometry (LC-MS/MS). When LC-MS/MS is applied to complex mixtures, such as cell lines or tissue samples, a large amount of raw data is produced which needs to be analysed to identify molecular structures. This thesis focuses on the development of software to automate the identification of PTMs in proteomic MS/MS data and identifying glycans in glycomic MS/MS data. The outcome of this thesis are three software packages: MzJava, MzMod and Glycoforest.

MzJava is a library that can be used as building blocks to help guide and accelerate the development of software for processing and interpreting MS/MS spectrometry data. MzJava provides data structures and algorithms for representing and processing mass spectra and their associated biological molecules, such as metabolites, glycans and peptides. To ensure that MzJava contains code that is correct and easy to use, the library's application programming interface was carefully designed and thoroughly tested. MzJava was used to develop all software presented in this thesis and has been published as an open-source project.

MzMod is a spectrum library based open modification search (OMS) engine that is capable of processing MS/MS datasets that contain tens of millions of spectra. In addition to focusing on efficiently processing large datasets, improvements were also made to the accuracy of the OMS scoring function by including the consistency of the backbone ions and the quality of the PTM position. MzMod was validated using a dataset containing 25 million spectra from 30 human tissues and compared to MODa, which is a popular OMS search engine. The validation showed that MzMod is easier to use than MODa when analysing large datasets and that MzMod identifies four to five times more PTMs than MODa for modifications that are not fixed.

Glycoforest is a tool that helps accelerate the process of assigning glycan structures to MS/MS spectra. The biggest challenge that needs to be addressed when developing software to automate

(6)

5

the assignment of glycans to MS/MS spectra, is that there is no direct template that can be used to infer the structures that are potentially present in a sample. Glycoforest addressed this challenge by using a partial de novo algorithm that makes use of OMS spectrum similarity to generate candidate glycan structures. To select the correct structure from among the candidates, a scoring function was developed that combines the information from the OMS similarity and the theoretical spectrum of the candidate. Using two manually annotated MS/MS data sets, we showed that Glycoforest can generate the human validated candidate structure for 92% of the test cases. The scoring function was able to select the correct structure for 70% of the test cases and the correct structure was among the top three best scoring candidates for 83% of the test cases.

(7)

6

Chapter 1 Introduction

1.1. Background

Proteins are essential parts of organisms and are involved in almost all processes in a cell. It is estimated that the human genome has 20,000 to 22,000 protein coding genes. However, the proteome is much more complex and diverse than what is expected given the number of genes in the genome. One of the major mechanism that increases proteome diversity is post translational modification (PTM)^1–3.

Post translational modifications are chemical alterations of a protein that can occur at any step in the life cycle of a protein. These modifications can change the binding properties, 3D structure, enzymatic activity, life span, and subcellular localization of the proteins. There are more than 200 different PTMs recorded in the UniProt protein knowledgebase ranging from small chemical modifications (e.g., acetylation, phosphorylation) to the addition of large biomolecules (e.g., glycosylation, ubiquitination)⁴. Phosphorylation is the addition of phosphate (PO4) which weighs 80 Da, to serine, tyrosine or threonine. Whereas glycosylation is the addition of a complex carbohydrate polymer to aspargine (N-linked glycan) or serine/threonine and possibly tyrosine (O-linked glycan). Glycans are complex tree structures and can range from 365 DA to 14,729 Da⁵.

Khoury et al. calculated statistics on the frequency of PTMs by analysing SwissProt⁶. They found that phosphorylation is the most common of the experimentally identified PTMs, while glycosylation is the most common putative PTM. Post-translational modifications play a fundamental role in almost all aspects of cell biology. For example, glycosylation is critical for molecular recognition, cell-cell and cell-matrix interactions, energy generation, and modification of protein conformations⁷. Glycans have been shown to be involved in diseases such as cancer, autoimmunity and arthritis^8–10.

Identifying modifications of proteins and understanding how PTMs influences cellular processes is important to understand the biology of both healthy and diseased cells.

(11)

10

1.2. PTM detection

Currently, Liquid Chromatography Tandem Mass Spectrometry (LC-MS/MS) is the most widely used high throughput technique for analysing PTMs.^11,12. It is an analytical technique that combines the physical separation capability of liquid chromatography with the mass analysis capability of mass spectrometry. PTMs affect the molecular weight of the modified peptide and this differences in mass can be detected by MS/MS.

1.2.1. Liquid Chromatography Tandem Mass Spectrometry

Liquid Chromatography (LC) separates the components of a liquid mixture by percolating through a column. The velocity at which individual molecules in the mixtures move through the column is a function of the physical property of the analytes, the content of the column and the composition of the mobile phase. The time at which a specific analyte elutes from the column is called the retention time. Upon elution the analytes are directly sprayed into the mass spectrometer using electrospray, where the analytes are first ionized and then measured in the first mass spectrometer. The resulting spectrum provides the mass to charge (m/z) ratio of the intact analyte ions (precursor ions) that are coming off the LC. Because the precursor m/z cannot be used to uniquely define the complete structure of analytes, a second measurement is taken. The ions of a precursor are selected, and fragment ions are created and measured in the second mass spectrometer to produce the MS/MS spectrum for the analyte (Figure 1).

(12)

11

Figure 1. Schematic of tandem mass spectrometry. The figure was originally published by Murray.¹³

Common fragmentation methods that are used for proteomics and glycomics are: collision induced dissociation (CID), higher-energy collisional dissociation (HCD) and electron transfer dissociation (ETD)^7,14. Historically CID has been the most commonly used, however HCD and ETD are getting more popular for both proteomics and glycomics. Especially for studying PTM’s^15,16. Figure 2 shows the backbone fragmentation locations and ion types for glycans and peptides. The ion types that are observed, depend on the fragmentation method that is used.

Figure 2. Ion types produced by collision fragmentation of (A) glycans and (B) peptide. The figures were originally published by Han⁷ and Roepstorff¹⁷.

A B

(13)

12

The fragments in Figure 2 are not the only fragments that are observed in a MS/MS spectrum.

This is illustrated by Figure 3 which shows examples of a CID MS/MS spectrum for a glycan and a peptide. As can be seen, the spectrum contains peaks that cannot be assigned to backbone fragments. Possible sources of these peaks include neutral losses, side chain fragments and ions due to multiple backbone fragmentations.

Figure 3. Annotated CID MS/MS spectrum obtain from (A) glycan and (B) peptide. The figures were originally published by Han⁷ and Hernandez¹⁸.

1.2.2. Targeted PTM vs. open PTM proteomics

LC-MS/MS has become the standard tool for identifying PTMs of proteins. Figure 4 illustrates a general workflow for a proteomics experiment. It starts with the isolation of proteins from biological samples such as whole cells or tissue lysate, followed by enzymatically digesting of the protein into peptides. The most commonly used enzyme is trypsin which cleaves the protein after lysine or arginine residues, producing peptides averaging 15 amino acids in length¹⁸. These peptides are then measured in a LC-MS/MS and the resulting spectra are assigned to a peptide and then to a protein using bioinformatic tools.

B

A

(14)

13

A typical MS/MS experiment results in spectra that can come from an unmodified peptide, a peptide modified due to sample preparation or a peptide carrying one or more modifications¹⁹. Unmodified spectra are generally analysed using a database search, where the experimental MS/MS spectra are matched against a user-selected protein database. Modified peptides can be analysed using a database search by specifying the modification prior to the search. This is called a targeted PTM search.

For a LC-MS/MS experiment where a targeted search was used, only about 25% of the spectra are expected to be identified²⁰. A large proportion of the unidentified spectra are likely to be peptides carrying modifications which were not considered in the targeted PTM search²¹. Including many modifications (>5) in a targeted search is not practical as it will dramatically increase the search time and reduce sensitivity¹⁹. Instead, a technique called open modification searches (OMS)¹⁹ is used. OMS do not require prior specification of the modification that are in the sample. Instead the modification is read out directly from the MS/MS spectrum allowing both known and novel PTMs to be identified.

Figure 4. Proteomics strategies for PTM analysis. (A) Targeted PTM analysis: The sample is enriched for the PTM of interest before being analysed using a database search. (B) Open PTM search: The sample is analysed using an open modification search. This allows both known and novel PTMs to be identified.

(15)

14

1.2.3. Glycomics

Post translational modifications that have a more complex structure, such as glycans, cannot be identified using a targeted or open search. This is because the mass of these modifications is not enough to uniquely identify the chemical structure. For example, there are 5 different glycan structures that have a mass of 733²². Due to this ambiguity, different experimental and bioinformatics techniques are required to analyse glycans.

Figure 5 shows the general workflow of a glycomic experiment. First, proteins are isolated from biological samples, such as cell lines, tissue or body fluid. Then the glycans are either enzymatically (N-glycans) or chemically (O-glycans) released from these proteins. The released glycans are purified and then measured in a LC-MS/MS. The resulting MS/MS spectra are predominantly analysed by hand with the support of bioinformatics tools^23,24.

Figure 5. General workflow for analysing glycans by mass spectrometry.

1.3. Software Engineering and Bioinformatics Challenges

Characterizing and identifying PTMs in LC-MS/MS data is both a software engineering and bioinformatics challenge. The challenges that were addressed in this thesis are:

 The size of the LC-MS/MS data

 Writing good quality code

 Developing better algorithms to assign molecules to spectra

(16)

15

1.3.1. Big Data

MS datasets range in size from a few thousand to tens of millions of spectra. And dataset sizes are expected to increase as mass spectrometers are improved so that more of the molecules in the sample are measured. The software engineering challenge this poses is writing software that can gracefully scale from thousands to tens of millions of spectra. The challenges are due to the different storage and computation resources that are required to best facilitate the analysis of different sized datasets. For small datasets, single threaded in memory processing is sufficient.

As the dataset gets bigger it becomes necessary to process the data on multiple threads and to manage how data storage is split between memory and disk. For processing very large datasets, it is necessary to distribute the computing and data storage over large computer clusters or cloud infrastructure. This scaling challenge can be tackled by using big data frameworks such as Apache Hadoop²⁵ and Apache Spark²⁶. Hadoop has already been used in bioinformatics, primarily for next-generation sequence analysis²⁷ but also for proteomics^28–30, while Spark was more recently introduced^31,32.

Apache Hadoop. Apache Hadoop²⁵ is a top-level Apache project that aims to develop software for reliable, scalable, distributed computing of large datasets across clusters of computers. Hadoop is designed to scale up from single servers to thousands of machines.

Failures in hardware and the network are handled in the software library. This makes it possible to build clusters using cheap commodity hardware. Hadoop includes four modules: Hadoop Common, Hadoop Distributed File System (HDFS), Hadoop YARN and Hadoop MapReduce.

Hadoop Common is the core of the Hadoop framework. It contains utilities that support the other Hadoop modules, such as abstractions of the underlying operating system and its file system. It also provides support for reading and writing Hadoop specific file formats such as Sequence Files. Sequence Files are flat files consisting of binary key/value pairs that are used by Hadoop MapReduce as input/output and to store temporary data.

The second module is the Hadoop Distributed File System (HDFS) which is based on the Google File System Paper³³. HDFS provides a distributed storage layer that is schema less, durable, can automatically handle failure and rebalances automatically to even out disk space consumption on a cluster. HDFS works by splitting each file into 64MB chunks and storing

(17)

16

each chunk on 3 different nodes. If a component fails, the system automatically notices the defect and re-replicates the chunks that were stored on the failed node using data from the other two healthy replicas.

The third module is Hadoop YARN which is a framework for job scheduling and cluster management. It provides services for resource management, workflow management fault tolerance and job scheduling/monitoring. The final module is Hadoop MapReduce which is a framework for parallel processing of large datasets. Hadoop MapReduce is based on the Google MapReduce paper ³⁴. The three main problems that MapReduce addresses are:

 Parallelization

- How to parallelize the computation

 Distribution

- How to distribute the data

 Fault-tolerance

- How to handle component failure

MapReduce hides away most of the complexities of dealing with large scale distributed systems. MapReduce provides a minimal API that consists of two functions, map and reduce.

A key insight of the Google MapReduce paper was to send code to data, not data to code. This is an important differentiator when compared to traditional data warehouse systems and relational databases. Together with HDFS, MapReduce can handle terabytes and even petabytes of data and it can be used to process data that is too big to be moved²⁷.

To implement a MapReduce program a map function is specified that takes a key-value pair as input and outputs a set of intermediate key-value pairs. And a reduce function is specified that merges all intermediate values that have the same key. The parallelization of programs written in this functional style are performed automatically by Hadoop and works as follows: The input data is split into independent junks each of which is processed in parallel by a map task. The outputs of the map tasks are then sorted by the framework and used as input for the reduce task.

(18)

17

By persisting data back to disk after each map and reduce function call, Hadoop can process data sets that are larger than the available memory on the cluster.

While MapReduce is capable of processing huge data sets, the availability of only one transformation (map) and one action (reduce) makes it difficult to use MapReduce for writing machine learning and graph processing algorithms.

Apache Spark. Apache Spark²⁶ is a top-level Apache project that provides a fast and general engine for large-scale data processing. As such Spark is a direct replacement and successor to Hadoop MapReduce. The main problems that Spark addresses are:

 Ease of use

- Enables iterative and streaming workloads

- Supports many more functions for transformations and actions - Provides distributed shared variables

 Execution speed

- Removes the performance penalty that is incurred from persisting the data to disk by storing data in memory

- Uses “lazy evaluations” which allows automatic optimization of data processing While Spark directly competes with MapReduce it is designed to integrate with the rest of the Hadoop infrastructure (Figure 6).

(19)

18

Figure 6. The Hadoop architecture.

Spark is much easier to use and more expressive than MapReduce. Spark generalizes the two- stage map/reduce paradigm to support an arbitrary directed acyclic graph (DAG) of tasks. The core concept is the abstraction called a Resilient Distributed Datasets (RDDs). A RDD is an immutable collection of objects that are split into partitions which are distributed across a cluster. Computations are performed in parallel by performing transformations or actions on the RDD. RDD’s have map and reduce functions like MapReduce, but also add many other functions like filter, flatMap, groupByKey and aggregateByKey.

To combine data from two or more RDD’s, Spark provides functions such as union, intersection, join and cogroup. To share state across a cluster, Spark provides broadcast and accumulator variables. Broadcast variables are used to keep a read-only variable cached on each machine rather than shipping a copy of it with each task. Accumulators are variables that are added to and can be used as distributed counters and sums. A good comparison of the relative ease of MapReduce and Spark are the word count examples that are provided by both processing

engines. The MapReduce word count example

(https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Example%3A+WordCount+v2.

0) is over 100 lines of code and the programming style is very different to the style that would be used to write a non-distributed word count. The Spark equivalent is:

(20)

19

JavaRDD<String> textFile = sc.textFile("hdfs://...");

JavaPairRDD<String, Integer> counts = textFile

.flatMap(s -> Arrays.asList(s.split(" ")).iterator()) .mapToPair(word -> new Tuple2<>(word, 1))

.reduceByKey((a, b) -> a + b);

counts.saveAsTextFile("hdfs://...");

The Spark code is far more compact, and the programming style is very similar to what would be used to write a non-distributed word count.

Another advantage of Spark is the execution speed. Spark can be up to 100 times faster than MapReduce. This is mainly because RDDs can be persisted in memory and only needs to spill to disk if the data is too big. This greatly reduces the serialization overhead incurred by Hadoop and consequently speeds up calculations. If the data is too large to fit in memory and the RDD is stored on disk, Spark is still up to 10 times faster than MapReduce. This is because Spark uses "lazy evaluation" of the task DAG allowing automatic optimization of the execution plan to speed up computation.

Spark code can be written almost the same way as code that would be written for a non- distributed program. This expressiveness together with Spark’s performance and ability to spill data to disk for large datasets, makes Spark an excellent fit for processing small, medium and huge MS/MS datasets.

(21)

20

1.3.2. Code Quality

Like any other software project, a challenge for bioinformatics research projects is maintaining good code quality. In bioinformatics code quality is important because it helps ensure that research which uses the code is valid and that software development time is optimized. The characteristics of good quality code which are important for bioinformatics are:

 Correctness

 Maintainability

 Ease of use

 Efficiency

Correctness refers to the codes ability to produce the correct result. Correctness is correlated with the number of bugs, both known and unknown, that the software has. Obviously, code that has bugs and produces the wrong result is not suitable for producing scientific results that are valid. Code containing many bugs is also a major time sink due to the amount of time that is required to find, understand and fix the problem without introducing new bugs.

Maintainability refers to how easy it is for a programmer to work with the software. It addresses questions such as: How easy is it to debug the code? How fast is it to produce a fix? How quickly can a new developer understand the code? And how quickly can new features be added to the code? Because of the rapid change in requirements and features that are inherent in software that is developed for research, maintainability is especially important.

Ease of use refers to how easy it is to learn the API’s that the code exposes and how easy it is to use the API’s once they are learnt.

Efficiency directly relates to the performance and speed of running the software. Efficiency is important because code that takes a long time to run slows down the development process. Not just because of the intrinsic time it takes, but also due to the cost of interrupting developer flow.

The literature on best practice for developing bioinformatics and scientific software^35–39 covers the basic of how to write good code. Such as using a Version Control System (VCS) for keeping

(22)

21

track of the code, how to organize and track data and results. As well as the basics of coding best practices, such as making incremental changes, not writing duplicated code and writing code that works before optimizing the code. However, there are best practices and development methodologies that are used in software engineering which can also help with writing and maintain good quality bioinformatics code. The best practices that are particularly useful when writing bioinformatics code are: testing, refactoring, measuring code quality and continuous integration.

Testing. In software engineering testing and especially automated testing is an active area of research. There are generally four levels of tests: unit tests, integration tests, user interface tests and system tests. For bioinformatics software that is developed while conducting research the testing levels that are of interest are unit tests and integration tests. Unit tests verify the correctness of a small unit of code, usually at the method level. And integration tests are used to check larger parts of the code such as parsing an input file. To help with write and automatically run the tests, frameworks and software libraries such as JUnit (http://junit.org/) and Mockito (http:// mockito.org/) can be used. Obviously having tests is useful to ensure that the code is correct. However, testing also helps with maintainability and usability of the code.

Software is far easier to maintain if there are tests that can give immediate feedback if changes to the existing codebase cause errors in other parts of the codebase. Additionally, writing code that is testable encourages developers to write small isolated units of code, which is a very good practice and makes the resulting code easier to maintain and use.

Refactoring. Refactoring is the process of incrementally restructuring existing code without changing or adding to the codes behavior⁴⁰. Refactoring techniques are used to take code that is correct and change the structure of the code to improve the code to make it easier to maintain and use. Refactoring relies heavily on unit tests to ensure that the code still works correctly after each refactoring step.

Measuring code quality. A commonly used tool for measuring code quality is SonarQube (http://www.sonarqube.org/). SonarQube combines static code analysis with the results of the unit tests to provide an overview of the overall code quality as well as the change to the quality that occurs whenever a change is made to the code.

(23)

22

Continuous Integration. Continuous integration (CI)⁴¹ is a software engineering practice in which changes to the code are automatically tested, and code quality is re-measured whenever the changes are committed to the VCS. The goal of CI is to provide rapid feedback so that changes that introduce issues or break the build can be corrected rapidly.

Test Driven Development. An agile development methodology that fits very well with bioinformatics software development is Test Driven Development(TDD)³⁹. In brief, TDD describes a short development cycle for adding improvements or new functionality. First the developer writes an, initially failing, automated test that defines the bug, improvement or new functionality. Code is then written to make the test pass. This is followed by refactoring the new code to bring it up to acceptable quality. Figure 7 summarizes the TDD cycle.

Figure 7. The test-driven development process that was used while developing MzJava. This figure is an adaptation of an image from the blog

https://softwareasscience.blogspot.ch/2014/02/tdd-test-driven-development-in-practice.html

(24)

23

1.3.3. Assigning molecules to MS/MS spectra

The bioinformatics challenge when developing algorithms to identify post translational modification in LC-MS/MS data, is to assign the correct molecule to a spectrum. For open modification PTM proteomics this is assigning modified peptides to spectra and in glycomics it is assigning glycan structures to spectra.

The first algorithm to assign molecules to spectra was developed in 1966⁴². Since then more sophisticated algorithms have become available ^43–52 , but further improvement can be made by refining old algorithms or developing new algorithms for assigning molecules to spectra.

Conceptually assigning a molecule to a spectrum can be split into two steps. The first is to generate a candidate set of potential molecules. And the second is to score the candidates so that the best molecule can be selected from the candidate set.

Generating Potential Candidates. The challenge when generating the set of candidates is to keep the candidate set as small as possible without discarding the correct structure. Keeping the candidate set small makes it easier for the scoring function to find the correct candidate.

The two methods that have been used to generate candidates are: using the information contained in the query spectrum to generate the candidates and to select the candidates from a set of molecules that are known to occur in the sample.

Using the information in the query spectrum to generate the molecule is called a de-novo search.

De-novo searches have been used in glycomics⁵³ to assign structures and in proteomics⁵⁴ to find unmodified peptides. The advantage of de-novo searches is that it is possible to generate any peptide or glycan. The disadvantage is that very high-quality spectra are required, making the de-novo approach unsuitable for most spectra obtained in shotgun proteomics or glycomics experiments.

Selecting the candidates from a set of molecules that are known to occur in the sample is used by two related methods called spectrum library and database searches. For both methods, the candidate set is obtained by selecting all molecules that have a precursor ion that is within tolerance of the query spectrum. The biggest difference between spectra libraries and databases is that spectra libraries store a validated high-quality experimental MS/MS spectrum for each

(25)

24

molecule. Whereas databases store a theoretical MS/MS spectrum, generated from in-silico fragmentation, for each molecule.

Spectra libraries contain experimental spectra and are created by collecting and validating high quality spectra from previous experiments. The advantage of this is that all molecules in the spectrum library are known to be detectable on a mass spectrometer, allowing the smallest possible candidate set to be selected. The disadvantage is that only molecules that were previously identified using MS/MS can be automatically detected. Spectra libraries were first used in the 1980s for small molecule identification⁵⁵. In 1998 Yates et al.⁵⁶ showed that spectra libraries can be used to identify peptides in shotgun proteomics. Spectral library searches were first used for glycomics in 2005 by Kameyama et al.⁵⁷.

The advantage of using database searches is that the database does not need to be created from MS/MS data. Consequently, database searches can be used to identify molecules that have not been detected using MS/MS. Therefore, database searches work best if it is possible to make a database that contains all the molecules that are likely for a given sample. This is possible for proteomics, where the genome or transcriptome can be used as a template to generate all possible proteins. The proteins are then digested in silico and the resulting peptides stored in the database along with the theoretical spectrum that is generated by in silico fragmentation.

Database searches have been successfully implemented in proteomics by tools such as Sequest⁵⁸, Mascot⁵⁹ and X!Tandem⁶⁰ and are widely used.

Creating a database for glycomics is more challenging because there is no direct template from which all the glycans can be derived. An alternative, but not exhaustive, source of glycan structures are databases of previously identified glycans, such as GlycomeDB⁶¹. Database searches are available for glycomics (e.g. GlycosidIQ⁶²). However, because the database does not contain all glycans, databases searches have not been widely used in glycomics.

Both spectra library and database searches can be used to generate candidates that take PTMs into account. For open modifications searches, the range of mass differences that are allowed is increased from the spectrometers tolerance to the maximum allowed modification mass. For example, if the query precursor m/z is 800 and the maximum allowed modification weight is 200 Da, any peptide that has a precursor m/z between 600 and 800 m/z will be selected. For

(26)

25

targeted modifications, the peptide sequence is also checked to make sure it contains at least one amino acid that can have the modification, and the mass of the modification is taken into account when the query and peptide precursor are compared. For example, when looking for phosphorylation only peptides that contain an S, T or Y are considered and the query precursor must be within tolerance of the combined peptide and modification mass.

Scoring. Once the candidate molecules have been selected, a scoring algorithm is used to calculate the quality of the match between the query spectrum and the library or database spectrum. If the candidate with the highest score exceeds the threshold for a reliable match it is assigned to the query spectrum.

The scoring algorithm that is used depends on the search type. Spectra library score the similarity between two experimental spectra, database searches score the match between an experimental and theoretical spectrum, and open modification searches need to account for the mass shift due to the modification when calculating the score. The similarity score, when doing spectra library searches, is typically calculated using the normalized dot product (ndp).

However, several other scoring functions such as shared peak count, weighted normalized dot product or Pearson’s coefficient have also been investigated. When calculating the similarity for open modification searches using spectra libraries, the peaks first need to be aligned to account for the modification mass, as illustrated by Figure 8, followed by calculating the ndp^45,63. The open modification search tools that implement the spectra library approach include QuickMod⁵⁰, Bonanza⁴⁹, pMatch⁵¹ and Tier-Wise⁵².

(27)

26

Figure 8. Aligning the unmodified spectrum of TMY with the spectrum of TMY with an oxidised M (TM{O}Y).

Database searches differs from spectra library searches in that the score is calculated from an experimental spectrum and a theoretical spectrum that was generated from in silico fragmentation. Due to the complex fragmentation chemistry, it is currently not possible to generate realistic theoretical spectra. The theoretical spectra that are generated only contain backbone ion peaks and all the peaks for the same ion type have the same intensity.

The scoring functions that are used by database searches take into account this difference between experimental and theoretical spectra. Commonly used database search algorithms are Sequest⁵⁸, Mascot⁵⁹ and X!Tandem⁶⁰.

The proteomic database algorithms can also be used for targeted PTM searches. This is done by adding the modification to the peptide and generating a new theoretical spectrum. The scoring is then the same as described above. Like the open modification search using spectra libraries, the database searches also first align the peaks before scoring the match between the experimental and theoretical spectra. Tools that can be used for finding PTM using database searches include MS-alignment⁴⁵, MODi⁴⁶, ModifiComb⁴⁸, MODa⁴⁷ and MSFragger⁶⁴.

(28)

27

1.4. Objectives and Thesis Overview

The focus of this thesis is the development of algorithms to automate the identification of post- translational modifications in proteomic MS/MS data, and the identification of glycans in glycomic MS/MS data. This thesis is made up of three published articles.

The first article “MzJava: An open source library for mass spectrometry data processing”, presents a well-engineered and well-tested Java class library which was designed to ease the development of custom MS/MS data analysis software. In this article, we provide an overview of the available classes and functionalities, and describe the methodology that was used during the development of the library.

The second article ‘’Mining Large Scale Tandem Mass Spectrometry Data for Protein Modifications Using Spectral Libraries” describes the development and testing of two tools, Liberator and MzMod which were built using MzJava. Liberator is a tool to create spectral libraries for large data sets and MzMod is a spectrum library based OMS search engine.

The last article “Glycoforest 1.0”, introduces a partial de novo algorithm for assigning glycan structures to MS/MS spectra. The article describes the development and validation of the Glycoforest.

Figure 9 provides a high-level overview of how the software presented in this thesis are connected.

(29)

28

Figure 9. High-level overview of how the software presented in this thesis are connected.

MzJava provides the classes and functionalities for working with MS/MS data and for

developing novel MS/MS analysis algorithms. MzJava is split into four modules, the core, big data infrastructure, proteomics and glycomics. The core and big data infrastructure module contains functionality that is common to all MS data, the proteomics module contains functionality specific to peptides and proteins, and the glycomics module contains

functionality specific to glycans. MzMod and Glycoforest are both written using MzJava, with MzMod building on top of the proteomics module whereas Glycoforest is built on top of the glycomic module.

(30)

29

Chapter 2 First Author Papers

2.1. MzJava: An open source library for mass spectrometry data processing

2.1.1. Overview

MzJava is a library of algorithms and data structures that can be used to create software for processing and interpreting MS/MS spectrometry data. MzJava originated from merging JPL (http://javaprotlib.sourceforge.net/) and another unpublished Java MS codebase which I developed prior to this thesis. During the merge, the code was comprehensively refactored and refined to produce a high quality, consistent and well-designed API. Further refinements and additions were made to MzJava while working on the other projects described in this thesis and by other members of the Proteome Informatics group while working on their projects. The notable additions that I contributed were the inclusion of support for Apache Hadoop and Apache Spark as well as new code to support Glycomics MS/MS data analysis. During the work on MzJava, best practices in software engineering such as test-driven development and continuous integration were used. To ensure that these best practices were effective and to maintain our quality standards, the code quality was continuously tracked using SonarQube.

This resulted in a software library that provides a solid foundation for developing software and algorithms for interpreting MS/MS data.

(31)

MzJava: An open source library for mass spectrometry data processing☆

Oliver Horlacherâ,b, Frederic Nikitinâ, Davide Alocciâ,b, Julien Mariethozâ, Markus Müllerâ,b,⁎, Frederique Lisacekâ,b,⁎

aProteome Informatics Group, SIB Swiss Institute of Bioinformatics, Geneva 1211, Switzerland

bCentre Universitaire de Bioinformatique, University of Geneva, Geneva 1211, Switzerland

a b s t r a c t a r t i c l e i n f o

Article history:

Received 2 May 2015

Received in revised form 17 June 2015 Accepted 22 June 2015

Available online 30 June 2015

Keywords:

Java

Mass spectrometry Proteomics Glycomics Hadoop Spark

Mass spectrometry (MS) is a widely used and evolving technique for the high-throughput identiﬁcation of molecules in biological samples. The need for sharing and reuse of code among bioinformaticians working with MS data prompted the design and implementation of MzJava, an open-source Java Application Program- ming Interface (API) for MS related data processing. MzJava provides data structures and algorithms for representing and processing mass spectra and their associated biological molecules, such as metabolites, glycans and peptides. MzJava includes functionality to perform mass calculation, peak processing (e.g. centroiding,ﬁlter- ing, transforming), spectrum alignment and clustering, protein digestion, fragmentation of peptides and glycans as well as scoring functions for spectrum–spectrum and peptide/glycan-spectrum matches. For data import and export MzJava implements readers and writers for commonly used data formats. For many classes support for the Hadoop MapReduce (hadoop.apache.org) and Apache Spark (spark.apache.org) frameworks for cluster computing was implemented. The library has been developed applying best practices of software engineering. To ensure that MzJava contains code that is correct and easy to use the library's API was carefully designed and thoroughly tested. MzJava is an open-source project distributed under the AGPL v3.0 licence. MzJava requires Java 1.7 or higher. Binaries, source code and documentation can be downloaded fromhttp://mzjava.expasy.organd https://bitbucket.org/sib-pig/mzjava.

This article is part of a Special Issue entitled: Computational Proteomics.

1. Introduction

Mass spectrometry (MS) has become a central analytical technique to characterise proteins, lipids, carbohydrates and metabolites in complex samples[1,2]. The diversity of biological questions possibly addressed with MS is reflected in a wide range of experimental workflows. Analysing data generated through these workflows is automated though most of the time, the variability of applications requires software customisation and/or extension. This situation is well described in a recent review[3]and justifies the development of libraries of MS related software to facilitate code reuse. Software libraries are meant to benefit the developers' own group and collaborators as well as the wider computational proteomics and glycomics communities.

Many early open source contributions dedicated to the automated analysis of proteomic data were coded in C++ or Perl. The trans- proteomic pipeline (TPP)[4]is an assembly of C++ programs and Perl scripts to process and statistically validate MS/MS search results

from different search engines and integrate these with quantitative data. Later, TOPP was introduced as a management system for generic proteomics workflows[5], based on OpenMS[6]. OpenMS contains a well-designed Application Programming Interface (API), which makes it useful not only as a toolbox but also as a code base for software developers. ProteoWizard[7]is another C++ open source project for the conversion of proteomic MS/MSfile formats and processing of MS/MS spectra. More recently, the Java programming language gained popular- ity in manyfields and especially in bioinformatics due to its portability across different computer platforms and the availability of powerful and comprehensive class libraries that facilitate and accelerate software development. For example, the Chemistry Development Kit (CDK, http://sourceforge.net/projects/cdk/) is an open source library for cheminformatics and computational chemistry[8]. BioJava (http://

biojava.org/,[9]) addresses the bioinformatics community and provides classes and tools for protein structure comparison, alignments of DNA and protein sequences, analysis of amino acid properties, protein modiﬁcations and prediction of disordered regions in proteins as well as parsers for commonﬁle formats.

A broad range of Java based open source solutions was developed for the proteomics community as comprehensively reviewed in[3]. Our focus being mainly on MS/MS data processing, the following refers to

☆ This article is part of a Special Issue entitled: Computational Proteomics.

⁎ Corresponding authors at: Swiss Institute of Bioinformatics, University of Geneva, CMU, Rue Michel-Servet 1, 1211 Geneva, Switzerland.

E-mail addresses:[email protected](M. Müller), [email protected](F. Lisacek).

Contents lists available atScienceDirect

Journal of Proteomics

j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / j p r o t

(32)

such dedicated toolboxes. Compomics is a collection of tools mainly for MS/MS data analysis[10,11]. It also contains the Compomics-utilities class library [12], which provides the code base tools. The Java Proteomics Library (JPL, javaprotlib.sourceforge.net) provides an API to process MS/MS spectra and their annotation. It served as the code base for several software projects dealing with MS/MS searches[13], spectral libraries and open modiﬁcation searches [14], and data- independent quantiﬁcation[15]. The PRIDE tool suite (http://pride- toolsuite.googlecode.com) contains a set of pure Java libraries, tools and packages designed to process and visualize versatile MS proteomics data[3]. It also contains a well-designed Java API (ms-data-core-api) facilitating the implementation of customized solutions[16]. In recent years, several open source Java-based laboratory information systems (LIMS) storing and processing proteomic or glycomic data were described[17–19].

Glycomics MS data are still seldom produced in high throughput set- ups but thisﬁeld evolves quickly and the need for automation is growing. A reference software for glycan MS and MS/MS annotation is GlycoWorkbench[20]. This tool is modular and mostly known for its convenient user interface to support glycan structural assignment.

Theoretical glycan spectra can be calculated with its fragmentation tool based on the mechanisms and nomenclature described by Domon and Costello[21]. Importantly it relies on recognised standard descrip- tion of monosaccharides and full structures[22].

The scalability of existing solutions is currently one of the greatest challenges. The ever-growing size of MS datasets and the need to process spectra by the tens of millions imposes the use of distributed data processing frameworks such as Hadoop MapReduce [23](hadoop.apache.org) and Apache Spark[24](spark.apache.org).

Hadoop is already used in bioinformatics, primarily for next- generation sequencing analysis [25] but also for proteomics [26–28], while Spark was more recently introduced[29,30]. Hadoop MapReduce is an implementation of the MapReduce programming model described by Dean and Ghemawat[23]. Spark extends on the functionality and performance of Hadoop by allowing in- memory data storage and providing additional functions[24].

We introduce MzJava a Java class library designed to ease the development of custom data analysis software by providing building blocks that are common to most MS data processing software. MzJava addresses the scaling issues by adding classes to interface with Hadoop and Spark. Furthermore, new code was included for processing glycomics data. In fact, MzJava originates from merging JPL and another unpublished Java MS codebase. During this merge the code was comprehensively refactored and reﬁned in an effort to produce a consistent and well-designed API. Best practices in software engineering such as test driven development and continuous integration were applied during the implementation. Code quality metrics of MzJava are continuously tracked to maintain high quality standards. These metrics are used to benchmark MzJava in relation to other packages.

2. Materials and methods 2.1. Development aims

MzJava is mostly centred on MS/MS identiﬁcation and annotation.

This bias towards an identification-related API reflects our research focus. The use of the MzJava API is intended for writing software that is capable of processing large data sets. Consequently the API is designed to be extensible,flexible and efficient. During development we found thatflexibility often comes at the cost of performance. Where we identified performance hot-spots we implemented solutions that allow eitherflexible or high performance code to be written, while in non performance critical code we keptflexibility as a major criterion.

Additional design aims were to make the API not only easy to use, but also hard to misuse as well as prompt to fail whenever there are errors [31].

The development of MzJava entailed refactoring a substantial part of the JPL aiming at producing high quality and efﬁcient code. The outcome is meant to be structured as a coherent API as opposed to bundling a collection of code pieces. MzJava follows Java naming and behaviour con- ventions and provides builders withﬂuent interfaces for constructing complex objects (http://en.wikipedia.org/wiki/Fluent_interface). To help prevent misuse and to make MzJava easy to use in multithreaded environments, mutable objects are avoided as much as possible. Objects that need to be mutable were designed to always be in a valid state.

2.2. Development methodology

The methodology that was employed to develop MzJava follows best practice for scientiﬁc computing[32–35]and is inﬂuenced by agile software development, especially test driven development (TDD)[36]

and continuous integration (CI)[37]. In brief, TDD describes a short development cycle for adding improvements or new functionality. First the developer writes an, initially failing, automated test that deﬁnes the improvement or new functionality. Code is then written to make the test pass. This is followed by refactoring the new code to bring it up to acceptable quality.Fig. 1A summarizes the TDD cycle. Automated tests are written using JUnit (http://junit.org/) and Mockito (http://

mockito.org/). CI is a software engineering practice in which changes to the code are automatically tested whenever they are added to the codebase. The goal of CI is to provide rapid feedback so that changes that introduce issues or break the build can be corrected rapidly. Jenkins (http://jenkins-ci.org/) is used to automate the CI.

Code quality scores are tracked using SonarQube (http://www.

sonarqube.org/). Quality profiles were slightly altered from the Sonar way with Findbugs profiles. SonarQube is also used to evaluate the quality of comparable libraries such as BioJava, ms-data-core-api, jmzml (a library to handle mzMLfiles[38]), and compomics-utilities.

MzJava is a Maven project developed using IntelliJ Idea (https://www.

jetbrains.com/idea/).Fig. 1B provides a more detailed view on the development cycle and the tools used.

2.3. Architecture

The MzJava architecture is modular and consists of three main modules:

1. Thecore modulecontains functionality that is common to all MS data 2. Theproteomics modulecontains functionality speciﬁc to peptides and

proteins

3. Theglycomics modulecontains functionality speciﬁc to glycans.

Fig. 2illustrates the organisation of the modules and highlights the central position of the core module that overlaps with the proteomics and glycomics modules.

3. Results

The MzJava core consists of three main parts:mol,msandio(Fig. 2).

Themol-core comprises classes to work with chemical compositions and their masses. TheAtomclass for example represents the mass and isotopic abundances of a chemical element. TheCompositionclass deals with assemblies of atoms deﬁned by their stoichiometric chemical formulae.NumericMassis used to represent objects that have a mass but no known composition. The main classes inms-core deal with peak lists (PeakList, list of m/z-intensity pairs) and their associated meta data (Spectrum). There are a number ofSpectrumsubclasses to capture the meta data associated with particular types of spectra. For example, MsnSpectrumcontains meta data such as scan number and retention time andConsensusSpectrumcaptures meta data that is associated with a consensus spectrum such as the ids of the spectra from which the consensus was built, and the structure of the peptide/glycan that the consensus spectrum represents. The peak list associated with each

64 O. Horlacher et al. / Journal of Proteomics 129 (2015) 63–70

Developing algorithms to automate the identification of post translational modification in LC-MS/MS data

Thesis

Reference

Developing algorithms to automate the identification of post translational modification in LC-MS/MS data

UNIVERSITE DE GENÈVE FACULTE DES SCIENCES

Developing algorithms to automate the identification of post translational modification in LC-MS/MS data

THÈSE

Oliver Horlacher

Résumé en Français

Abstract

Table of Contents

Chapter 1 Introduction

1.1. Background

1.2. PTM detection

A B

B

A

1.3. Software Engineering and Bioinformatics Challenges

1.4. Objectives and Thesis Overview

Chapter 2 First Author Papers

2.1. MzJava: An open source library for mass spectrometry data processing