Inferring phenotypes from genotypes with machine learning : an application to the global problem of antibiotic resistance

(1)

Inferring phenotypes from genotypes with machine

learning : an application to the global problem of

antibiotic resistance

Thèse

Alexandre Drouin

Doctorat en informatique

Philosophiæ doctor (Ph. D.)

Québec, Canada

(2)

Inferring phenotypes from genotypes

with machine learning

An application to the global problem of antibiotic resistance

Thèse

Alexandre Drouin

Sous la direction de:

François Laviolette, Directeur de recherche Mario Marchand, Codirecteur de recherche Jacques Corbeil, Codirecteur de recherche

(3)

Résumé

La compréhension du lien entre les caractéristiques génomiques d’un individu, le génotype, et son état biologique, le phénotype, est un élément essentiel au développement d’une médecine personnalisée où les traitements sont adaptés à chacun. Elle permet notamment d’anticiper des maladies, d’estimer la réponse à des traitements et même d’identifier de nouvelles cibles pharmaceutiques. L’apprentissage automatique est une science visant à développer des algo-rithmes capables d’apprendre à partir d’exemples. Ces algoalgo-rithmes peuvent être utilisés pour produire des modèles qui estiment des phénotypes à partir de génotypes, lesquels peuvent ensuite être étudiés pour élucider les mécanismes biologiques sous-jacents aux phénotypes. Toutefois, l’utilisation d’algorithmes d’apprentissage dans ce contexte pose d’importants défis algorithmiques et théoriques. La haute dimensionnalité des données génomiques et la petite taille des échantillons de données peuvent mener au surapprentissage; le volume des données requiert des algorithmes adaptés qui limitent leur utilisation des ressources computationnelles; et finalement, les modèles obtenus doivent pouvoir être interprétés par des experts du domaine, ce qui n’est pas toujours possible.

Cette thèse présente des algorithmes d’apprentissage produisant des modèles interprétables pour la prédiction de phénotypes à partir de génotypes. En premier lieu, nous explorons la prédiction de phénotypes discrets à l’aide d’algorithmes à base de règles. Nous proposons de nouvelles implémentations hautement optimisées et des garanties de généralisation adap-tées aux données génomiques. En second lieu, nous nous intéressons à un problème plus théorique, soit la régression par intervalles, et nous proposons deux nouveaux algorithmes d’apprentissage, dont un à base de règles. Finalement, nous montrons que ce type de régres-sion peut être utilisé pour prédire des phénotypes continus et que ceci mène à des modèles plus précis que ceux des méthodes conventionnelles en présence de données censurées ou bruitées. Le thème applicatif de cette thèse est la prédiction de la résistance aux antibiotiques, un pro-blème de santé publique d’envergure mondiale. Nous démontrons que nos algorithmes peuvent servir à prédire, de façon très précise, des phénotypes de résistance, tout en contribuant à en améliorer la compréhension. Ultimement, nos algorithmes pourront servir au développement d’outils permettant une meilleure utilisation des antibiotiques et un meilleur suivi épidémio-logique, un élément clé de la solution à ce problème.

(4)

Abstract

A thorough understanding of the relationship between the genomic characteristics of an in-dividual (the genotype) and its biological state (the phenotype) is essential to personalized medicine, where treatments are tailored to each individual. This notably allows to anticipate diseases, estimate response to treatments, and even identify new pharmaceutical targets. Ma-chine learning is a science that aims to develop algorithms that learn from examples. Such algorithms can be used to learn models that estimate phenotypes based on genotypes, which can then be studied to elucidate the biological mechanisms that underlie the phenotypes. Nonetheless, the application of machine learning in this context poses significant algorithmic and theoretical challenges. The high dimensionality of genomic data and the small size of data samples can lead to overfitting; the large volume of genomic data requires adapted algorithms that limit their use of computational resources; and importantly, the learned models must be interpretable by domain experts, which is not always possible.

This thesis presents learning algorithms that produce interpretable models for the prediction of phenotypes based on genotypes. Firstly, we explore the prediction of discrete phenotypes using rule-based learning algorithms. We propose new implementations that are highly optimized and generalization guarantees that are adapted to genomic data. Secondly, we study a more theoretical problem, namely interval regression. We propose two new learning algorithms, one which is rule-based. Finally, we show that this type of regression can be used to predict continuous phenotypes and that this leads to models that are more accurate than those of conventional approaches in the presence of censored or noisy data.

The overarching theme of this thesis is an application to the prediction of antibiotic resistance, a global public health problem of high significance. We demonstrate that our algorithms can be used to accurately predict resistance phenotypes and contribute to the improvement of their understanding. Ultimately, we expect that our algorithms will take part in the development of tools that will allow a better use of antibiotics and improved epidemiological surveillance, a key component of the solution to this problem.

(5)

List of Tables

3.1 Error rate and complexity for algorithms that use feature selection . . . 38

3.2 Error rate and complexity for algorithms that use the entire feature space . . . 39

7.1 Overview of the benchmark datasets. . . 77

7.2 Accuracy and complexity of all algorithms on the benchmark datasets . . . 78

7.3 Accuracy and complexity of SCM and CART models learned using cross-validation and bound selection for model selection . . . 80

11.1 Overview of the K. pneumoniae minimum inhibitory concentration prediction data . . . 125

11.2 Mean squared error of real-valued and interval regression models . . . 126

11.3 Accuracy of real-valued and interval regression models . . . 127

A.1 Comparison of SCM models with the k-mer length set to 31 vs. selected by cross-validation . . . 140

A.2 An example, drawn from the results for the doripenem dataset (P. aeruginosa), where extending the k-mer in the best k = 15 model does not lead to the best model for k = 21. . . 141

A.3 Error rate and complexity of SCM models learned using cross-validation and bound selection for model selection . . . 145

A.4 A detailed overview of the sequencing data used in this study . . . 147

A.5 The antibiotic resistance datasets used in this study . . . 148

A.6 Analysis of overfitting . . . 149

A.7 Average and standard deviation of sensitivities for each dataset . . . 150

A.8 Average and standard deviation of specificities for each dataset . . . 150

A.9 Average and standard deviation of error rates for each dataset . . . 151

B.1 Detailed description of the datasets extracted from the PATRIC database. . . . 154

B.2 Average testing set metrics for each dataset . . . 155

C.1 Detailed results for all methods on the 107 PATRIC datasets . . . 164

C.2 Extended benchmark . . . 188

C.3 Sample compression bound values for the SCMb models on each benchmark dataset. . . 188

C.4 Sample compression bound values for the CARTb models on each benchmark dataset. . . 189

(10)

List of Figures

I.1 The central dogma of molecular biology . . . 6

I.2 Illustration of the structure of a DNA molecule . . . 7

I.3 Number of bacterial genomes in the GenBank database from 2007 to 2017.. . . 13

1.1 Overfitting and underfitting . . . 23

1.2 Cross-validation. . . 25

3.1 Distribution of resistant and sensitive isolates in each dataset. . . 36

3.2 Visualization of antibiotic resistance models . . . 42

3.3 Going beyond k-mers. . . 43

3.4 Overcoming spurious correlations . . . 45

3.5 The k-mer representation . . . 50

4.1 The Kover AMR Platform . . . 56

5.1 SCM results on the 36 PATRIC datasets . . . 62

7.1 Summary of the data extracted from the PATRIC database . . . 73

7.2 Accuracy of CARTb and SCMb models on all datasets, grouped by species . . . 74

7.3 Visualization of rule-based genotype-to-phenotype models . . . 77

7.4 Running time of CART and SCM using bound selection and cross-validation for model selection . . . 79

7.5 Confusion matrices for the multi-class classification tasks. . . 81

8.1 Maximum Margin Interval Tree cost w.r.t. each type of censored output . . . . 93

9.1 An example partition of leaf τ0 into leaves τ1 and τ2. . . 99

9.2 First two steps of the dynamic programming algorithm . . . 101

9.3 Empirical evaluation of the time complexity . . . 103

9.4 Predictions on simulated datasets . . . 104

9.5 Comparison of MMIT and other learning algorithms on various datasets . . . . 104

10.1 Illustration of the MMIKNN algorithm . . . 109

10.2 Predictions of all interval regression algorithms for simulated datasets . . . 113

10.3 Comparison of the mean squared error of MMIKNN and other interval regres-sion algorithms in seven real and simulated datasets . . . 114

10.4 Generalization error of MMIKNN as a function of the number of features in each simulated dataset . . . 115

10.5 Comparison of MMIKNN and other interval regression algorithms in the datasets of Chapter 9 . . . 116

(11)

10.6 Mean squared error of MMIKNN and other interval regression algorithms on

two datasets where MMIKNN is the best performing method . . . 117 A.1 Antibiotic resistance models learned by SCM for each dataset . . . 152 C.1 Accuracy of the CARTband SCMbmodels with respect to the number of k-mers

and the number of genomes in each of the 107 PATRIC datasets . . . 160 C.2 Value of the sample compression bound with respect to the number of rules in

the SCMb model for the M. tuberculosis dataset. . . 161

C.3 Value of the sample compression bound with respect to the α hyperparameter

of the CARTb algorithm for the M. tuberculosis dataset. . . 162

C.4 Illustration of the bound selection and cross-validation model selection methods. 163 D.1 Training and testing set mean squared error for various margin sizes . . . 196 D.2 Average and standard deviation of the mean squared error on the five

cross-validation folds accross all dataset . . . 197 D.3 Margin values () for each of the five cross-validation folds. . . 198 D.4 Minimum number of examples required to split a leaf, for each of the five

cross-validation folds. . . 198 D.5 Maximum tree depth for each of the five cross-validation folds. . . 199 E.1 MIC prediction model for MMIT trained from the interval labels (yint_{) with}

the mean squared error as the cross-validation score. . . 201 E.2 MIC prediction model for MMIT trained from the approximate real-valued

la-bels (yapprox_{) with the mean squared error as the cross-validation score.} _{. . . .} ₂₀₂

E.3 MIC prediction model for MMIT trained from the interval labels (yint_{) with}

the ±1 two-fold dilution accuracy as the cross-validation score. . . 203 E.4 MIC prediction model for MMIT trained from the measurement-noise-aware

interval labels (yint?_{) with the ±1 two-fold dilution accuracy as the}

cross-validation score. . . 204 E.5 MIC prediction model for MMIT trained from the approximate real-valued

la-bels (yapprox_{) with the ±1 two-fold dilution accuracy as the cross-validation}

(12)

Not everything that can be counted counts. Not everything that counts can be counted.

(13)

Remerciements

Je remercie mon directeur de recherche, François Laviolette, de m’avoir pris sous son aile dès la fin de mes études de premier cycle. Extrêmement généreux de son temps, il a su me guider tout au long de mes études graduées et me transmettre sa passion pour la recherche. Il m’a fait redécouvrir les mathématiques et a ravivé mon intérêt pour cette discipline. Par ses grands talents de pédagogue, il m’a appris l’art de communiquer et de vulgariser des idées de recherche complexes. Je lui suis très reconnaissant de m’avoir donné plusieurs opportunités de présenter mes travaux dans des lieux tout aussi excitants les uns que les autres (Institut Curie, Argonne National Laboratory, Google Cambridge, etc.) Je remercie mon codirecteur, Mario Marchand, de m’avoir transmis la rigueur de l’écriture scientifique et de m’avoir introduit à la théorie de l’apprentissage. Je remercie également mon codirecteur, Jacques Corbeil, de m’avoir introduit à la bio-informatique et de m’avoir donné de nombreuses opportunités de collaboration dans ce domaine. Ensemble, ils m’ont constamment amené à repousser mes limites et ont grandement contribué à ma formation en tant que chercheur.

Je remercie ma conjointe (et homonyme) Alexandra de m’avoir accompagné dans cette aven-ture et d’avoir toujours été présente pour m’encourager et souligner mes réussites. Il y a presque dix ans, elle m’avait fortement suggéré d’entreprendre des études universitaires; il semblerait que j’y aie pris goût! Je remercie mes parents, Michel et Sylvie, pour leur soutien continu et pour m’avoir constamment encouragé à persévérer dans mes études. Je remercie aussi ma sœur Valérie, mes beaux parents Diane et Raymond, les nombreux autres membres de famille et mes amis pour leur soutien et leurs encouragements. Je les remercie tous d’avoir enduré mon indisponibilité perpétuelle, surtout au cours de l’écriture de cette thèse.

Je remercie tous les membres du GRAAL que j’ai eu le privilège de côtoyer. Merci tout particulièrement à Sébastien Giguère et Gaël Letarte qui ont significativement contribué aux travaux présentés dans cette thèse. Merci aussi à Hana Ajakan, Baptiste Bauvin, Luc Bé-gin, Khalil Ben Fadhel, Jonathan Bergeron, Sébastien Boisvert, Francis Brochu, Ulysse Côté Allard, Patrick Dallaire, Maxime Déraspe, Gabriel Dubé, Christophe Duchesne-Ashworth, Pierre-Louis Gagnon, Nicolas Garneau, Jonathan Gingras, Pascal Germain, Mazid Abiodoun Osseni, Frédérik Paradis, Pier-Luc Plante, Frédéric Raymond, Émile Robitaille, Lynda Ro-bitaille, Amélie Rolland, Jean-Francis Roy, Vladana Sagatovich et Prudencio Tossou. Vous

(14)

avez tous contribué, par l’entremise de collaborations et de discussions intéressantes, à ma formation en tant que chercheur.

Je remercie également Mathieu Blanchette, ainsi que les membres de son groupe de recherche à l’Université McGill, de m’avoir généreusement accueilli en tant qu’étudiant visiteur et de m’avoir transmis leur passion pour la bio-informatique. Merci particulièrement à Christopher JF Cameron, Maia Kaplan et Faizy Ahsan avec qui j’ai eu la chance de collaborer. Je remercie aussi Toby Dylan Hocking, avec qui j’ai eu la chance de collaborer lors de mon passage dans cette université.

Je remercie également l’équipe d’Element AI, particulièrement Alexandre Lacoste, Valérie Bécaert, Philippe Beaudoin et Nicolas Chapados, de m’avoir donné l’opportunité de m’initier à la recherche sur l’apprentissage profond.

Finalement, je remercie le Conseil de recherches en sciences naturelles et en génie du Canada pour son support financier, ainsi que Calcul Québec et Calcul Canada pour l’accès à des ressources de calcul exceptionnelles qui m’ont permis de réaliser les expériences présentées dans cette thèse.

(15)

Foreword

This thesis presents a selection of four articles that were written during my doctoral studies. Combined with the other contributions presented in this thesis, these works form a common thread that represents our exploration of the genotype-to-phenotype prediction problem. My contribution to these articles is described hereafter.

Included publications

Predictive computational phenotyping and biomarker discovery using reference-free genome comparisons

Alexandre Drouin, Sébastien Giguère, Maxime Déraspe, Mario Marchand, Michael Tyers, Vivian G. Loo, Anne-Marie Bourgault, François Laviolette, and Jacques Corbeil in BMC Genomics, 17(1), 754 (2016).

Published: yes (journal)

Date of acceptance: July 6, 2016 Main author: Alexandre Drouin Included as: Chapter3

Editing for inclusion: The notation used to present the sample compression bound was adapted to match the one used in the thesis.

Author contributions: A.D., F.L., M.M. and S.G. designed the algorithmic extensions to the Set Covering Machine algorithm. A.D., F.L. and M.M. derived the sample compression bound for the Set Covering Machine. A.D. designed the out-of-core implementation of the Set Covering Machine. A.D., F.L., J.C., M.M. and S.G. designed the experimental protocols and A.D. conducted the experiments. A.D., J.C., M.D. and S.G. evaluated the biological relevance of the models. M.D. acquired the data and prepared it for analysis. A.-M.B. and V.L. acquired and provided the C. difficile genomes and the associated antibiotic susceptibility data. A.D., F.L., J.C., M.D., M.M., M.T. and S.G. wrote the manuscript.

(16)

Large scale modeling of antimicrobial resistance with interpretable classifiers

Alexandre Drouin, Frédéric Raymond, Gaël Letarte St-Pierre, Mario Marchand, Jacques Cor-beil, and François Laviolette in Machine Learning for Health Workshop , Neural Information Processing Systems Conference (2016).

Published: yes (workshop)

Date of acceptance: November 15, 2016 Main author: Alexandre Drouin

Included as: Chapter5

Editing for inclusion: The article is unchanged.

Author contributions: A.D. acquired the data from the PATRIC database and devised the web-based tool for the biological interpretation of models. A.D. and G.L. designed and conducted the antimicrobial resistance (AMR) prediction experiments. A.D., F.L., G.L., and M.M. analyzed the AMR prediction results. A.D., F.R., and J.C. analyzed the biological interpretation of the models. All authors contributed to writing the manuscript.

Interpretable genotype-to-phenotype classifiers with performance guarantees

Alexandre Drouin, Gaël Letarte, Frédéric Raymond, Mario Marchand, Jacques Corbeil, and François Laviolette in Scientific Reports, 9(1), 4071 (2019).

Date of acceptance: February 19, 2019 Main author: Alexandre Drouin

Editing for inclusion: The notation used to present the sample compression bound was adapted to match the one used in the thesis.

Author contributions: A.D. conceived and conducted the AMR prediction experiments for binary phenotypes and the comparison of the bound-selection and cross-validation approaches for model selection. G.L. conceived and conducted the multi-class classification and running time measurement experiments. A.D., G.L., F.L., and M.M. analyzed the AMR prediction, model selection, and running time results. A.D., F.R., and J.C analyzed the models and their biological interpretation. A.D., F.L., G.L., and M.M. derived the sample compression bound for classification trees. A.D. and G.L. implemented the Classification and Regression Trees algorithm in Kover, based on an out-of-core implementation of the Set Covering Machine algorithm by A.D. All authors contributed to writing the manuscript.

(17)

Maximum Margin Interval Trees

Alexandre Drouin, Toby Dylan Hocking, and François Laviolette in Advances in Neural In-formation Processing Systems (2017).

Published: yes (conference)

Date of acceptance: September 4, 2017 Main author: Alexandre Drouin

Editing for inclusion: Minor changes were needed to adapt the notation to the one used in the thesis.

Author contributions: A.D. and T.D.H. devised the MMIT algorithm, as well as the dynamic programming algorithm at its core. A.D. and T.D.H. studied the time and space complexity of these algorithms. A.D. proved the theorem bounding the number of pointer moves in the dynamic programming algorithm for the linear hinge loss. A.D. proved the theorem stating that the solution of the dynamic programming algorithm is optimal. A.D., T.D.H., and F.L. contributed to formalizing the algorithm and to its presentation in the article. A.D. implemented the algorithms and A.D. and T.D.H. conducted the experiments. All authors contributed to analyzing the results and writing the manuscript.

Other publications

I have also contributed to other articles that are not included in this thesis, since their topic diverges from that of the thesis. Below, I briefly describe my contribution to these articles.

Accelerated Robust Point Cloud Registration in Natural Environments through Positive and Unlabeled Learning

Maxime Latulippe, Alexandre Drouin, Philippe Giguère, and François Laviolette in Proceed-ings of the Twenty-Third International Joint Conference on Artificial Intelligence (2013). Published: yes (conference)

Date of acceptance: April 2, 2013 Main author: Maxime Latulippe

Author contributions: M.L. analyzed the problem, designed the proposed solution, and collected the data, under the supervision of P.G. A.D. designed the positive and unlabeled learning component of the method, under the supervision of F.L. A.D. participated to this project through a graduate course given by P.G at Laval University. M.L. conducted the experiments and analyzed the results. All authors contributed to writing the manuscript.

(18)

Learning a peptide-protein binding affinity predictor with kernel ridge regression

Sébastien Giguere, Mario Marchand, François Laviolette, Alexandre Drouin, and Jacques Corbeil in BMC Bioinformatics (2013).

Date of acceptance: February 21, 2013 Main author: Sébastien Giguère

Author contributions: SG designed the GS kernel, algorithms for it’s computation, imple-mented the learning algorithm and conducted experiments on the PepX and QSAM datasets. MM designed the learning algorithm. FL and MM did the proof of the symmetric positive semi-definiteness of the GS kernel. AD conducted experiments on MHC-II datasets. JC pro-vided biological insight and knowledge. This work was done under the supervision of MM, FL and JC. All authors contributed to writing the manuscript.

MHC-NP: Predicting peptides naturally processed by the MHC

Sébastien Giguère, Alexandre Drouin, Alexandre Lacoste, Mario Marchand, Jacques Corbeil, and François Laviolette in Journal of immunological methods, 400, 30-36 (2013).

Date of acceptance: October 5, 2013 Main author: Sébastien Giguère

Author contributions: S.G. analyzed the problem and designed the eluted peptide predic-tion method. A.D. and S.G. conducted the eluted peptide predicpredic-tion experiments and analyzed the results. A.L. conducted the experiments on advanced hyperparameter tuning and analyzed the results. A.D. and S.G. implemented the MHC-NP tool and prepared its integration to the Immune Epitope Database. All authors contributed to writing the manuscript.

Mass spectra alignment using virtual lock-masses

Francis Brochu, Pier-Luc Plante, Alexandre Drouin, François Laviolette, Mario Marchand, and Jacques Corbeil, Submitted (2018).

Published: no (submitted)

Date of submission: July 19, 2018 Main author: Francis Brochu

Author contributions: A.D., F.B., F.L., M.M., and P.-L.P. devised the algorithms for the identification of virtual lock masses (VLM) and the correction of spectra given such VLMs.

(19)

P.-L.P. and J.C. acquired the data. F.B. conducted the experiments and analyzed the results. All authors contributed to writing the manuscript.

Deep Learning for Electromyographic Hand Gesture Signal Classification Using Transfer Learning

Ulysse Côté-Allard, Cheikh Latyr Fall, Alexandre Drouin, Alexandre Campeau-Lecours, Clé-ment Gosselin, Kyrre Glette, François Laviolette, and Benoit Gosselin, Submitted (2018). Published: no (submitted)

Date of submission: December 23, 2017 Main author: Ulysse Côté-Allard

Author contributions: U.C.A designed and conducted the experiments, acquired the data, devised the transfer learning algorithm and its implementation, and wrote the manuscript. C.L.F. conducted a literature review on deep learning in embedded system. A.D. contributed to writing, revising, and enhancing the technical descriptions in the manuscript. A.C.L. con-tributed to and revised the assistive and rehabilitation aspect of the manuscript. C.G and K.G. contributed to the hardware required to conduct the experiment. F.L. and B.G. contributed by their ideas and guidance. All authors contributed to editing and revising the manuscript.

(20)

Introduction

The global problem of antibiotic resistance

The discovery of antibiotics and their introduction as therapeutical agents has revolutionized the field of medicine (Blair et al.,2015). For most of the past century, antibiotics have been deeply embedded in medical practices. Modern medicine relies on antibiotics for the manage-ment of infectious diseases and to ensure the success of medical procedures, such as surgery and cancer chemotherapy (Davies and Davies,2010;Blair et al.,2015;World Health Organization, 2018). In the pre-antibiotic era, infectious disease management was significantly different. In fact, it was not until the end of the 19th century that the germ theory of disease, postulating a causal association between some bacteria and diseases, was widely accepted (Richmond,1954; Davies and Davies,2010). Prior to this discovery, scientists believed that many infections, such as smallpox, syphilis, and tuberculosis, were caused by miasma, i.e., contaminated air ( Rich-mond,1954). The discovery of infectious bacterial agents stimulated the search for preventive and therapeutical treatments. Among the first antibiotics to be discovered were penicillin and sulphonamides (Davies and Davies, 2010). Penicillin was discovered by Alexander Fleming in 1927 and, following the work of Howard Walter Florey, was first used to treat patients in the early 1940s. Ever since, antibiotics have saved countless lives and, in retrospect, their discovery can be seen as a turning point in human history (Davies and Davies,2010). In fact, Charles Fletcher, a research fellow who conducted the first clinical trials for penicillin, wrote:

It is difficult to convey the excitement of actually witnessing the amazing power of penicillin over infections for which there had previously been no effective treat-ment. I could not then imagine the transformation of medicine and surgery that penicillin would produce. But I did glimpse the disappearance of the chambers of horrors which seems to be the best way to describe those old septic wards, and could see that we should never again have to fear the streptococcus [...] or the more deadly staphylococcus (Fletcher,1984;Cars,2014).

However, despite invaluable improvements in the outcome of patient treatments, microbiol-ogists anticipated that the widespread use of antibiotics would lead to another significant

(21)

problem: antibiotic resistance. This refers to the state of bacteria which have developed the ability to survive in the presence of antibiotics. In fact, in 1942, microbiologist René Dubos claimed that the benefits of antibiotics were being “bought at the cost of a huge ran-som” (Moberg, 1996;Cars, 2014). Moreover, when receiving their Nobel prize of Physiology or Medicine (1945) for the discovery of penicillin, Fleming and Florey put forth the issue of antibiotic resistance (Cars, 2014; World Health Organization, 2014). What these scientists feared is that the use of antibiotics would slowly eliminate bacteria that were susceptible to antibiotic treatment, favoring the survival of multidrug-resistant strains, against which such treatments would be ineffective.

Over the past decades, antibiotic resistance has become an increasingly serious issue. In fact, even prior to the widespread use of penicillin, strains having the ability to inactivate this drug had been discovered (Davies and Davies,2010). Similarly, sulphonamide antibiotics have been plagued by resistance, ever since they were introduced (Davies and Davies, 2010). To date, many bacteria have developed multidrug resistance, meaning that they resist to more than one antibiotic. Moreover, some cases of pan-resistance, i.e., resistance to all known treatments, have been reported (Jeukens et al.,2017a). As a consequence, the range of therapeutic options is considerably reduced, undermining patient outcomes, while increasing periods of hospital care and healthcare costs (Davies and Davies, 2010; World Health Organization, 2018). For instance, people with methicillin-resistant Staphylococcus aureus, a common source of severe infection, are estimated to be 64% more likely to die than people with the nonresistant vari-ant (World Health Organization,2018). In addition,Cars(2014) claim that “without effective antibiotics, the rate of postoperative infections in patients with hip replacement is 40%–50%, and about 30% of those with an infection will die”. A recent report of the Centers for Dis-ease Control and Prevention of the United States of America (USA) provides a gross estimate of the burden that antibiotic resistance imposes on society in terms of morbidity, mortality and healthcare costs. Their findings indicate that, each year, at least two million Ameri-cans are infected with antibiotic resistant bacteria, of which at least 23 000 die as a direct result of the infection. Moreover, the financial cost of this issue is estimated at 20-35 billion dollars in healthcare costs and loss of productivity for the USA alone (Centers for Disease Control and Prevention,2013). Furthermore, a recent report by the Public Health Agency of Canada claims that drug-resistant infections impose a significant financial burden on Canada’s healthcare system and that 1 in 16 patients admitted to Canadian hospitals are expected to develop an infection due to a multidrug-resistant organism (Public Health Agency of Canada, 2017). Unfortunately, this issue is not limited to the USA and Canada, and is a worldwide threat (World Health Organization, 2018). For instance, cases of gonorrhoea that resist to last resort drugs have been reported in countries, such as Australia, Austria, Canada, France, Japan, Norway, Slovenia, South Africa, Sweden, and the United Kingdom of Great Britain and Northern Ireland (World Health Organization, 2018). Antibiotic resistant bacteria oc-cur naturally in every country and their global spread is fueled by the speed and volume of

(22)

intercontinental travel (Jeukens et al., 2017a). Adding to the urgency of this problem, no new families of antibiotics have been discovered since 1987 and the development pipeline is practically empty (Cars,2014;Davies and Davies,2010;Nathan and Cars,2014;World Health Organization,2014). To date, experts agree that we are at risk of entering a “post-antibiotic era”. But what has led to the issues that we are facing today?

Clearly, the development of antibiotic resistance is not entirely due to human activity, since it has been proven that some resistance genes have been present in nature for billions of years (Blair et al., 2015). These genes were evolved to protect bacteria from microorgan-isms that naturally produce antibiotic molecules and spread them to their local environment. Interestingly, most antibiotics that are used in human medicine are derived from such mi-croorganisms (e.g., penicillin). Nevertheless, there is striking evidence that the rapid rise of antibiotic resistance has been fueled by our overuse and misuse of antibiotic drugs, which placed significant selective pressure on bacterial populations and favored the evolution of re-sistant strains (World Health Organization, 2014, 2018). According to Davies and Davies (2010), “this is not a natural process, but a man-made situation superimposed on nature” and “there is perhaps no better example of the Darwinian notions of selection and survival.” For instance, over the last half-century, millions of metric tons of antibiotics have been produced and released into the environment, pressing bacterial populations to develop drug resistance to survive (Davies and Davies,2010). Moreover, it is estimated that less than half of the total antibiotic production is destined to human use (Davies and Davies,2010;Nathan and Cars, 2014). The rest is used for alternative applications, such as growth promotion in agriculture, which favors the spread of antibiotic resistant bacteria into ground waters, vegetables, and meat (Davies and Davies, 2010; President’s Council of Advisors on Science and Technology, 2014). Such applications are incompatible with the fact that antibiotics are used to treat infections in humans (Nathan and Cars,2014). Furthermore, antibiotics are often misused in the treatment of humans. For instance, up to 50% of the antibiotics prescribed for patient treatment are used suboptimally or not needed, e.g., for viral infections (Centers for Disease Control and Prevention,2013). Moreover, in many developing countries, antibiotic use is not regulated and these drugs are available over-the-counter (Nathan and Cars, 2014). Conse-quently, there is an urgent need to regulate and limit the global usage of antibiotics, and the World Health Organization has stressed that this would require a global effort (World Health Organization,2014,2015,2018).

Recently, four core actions have been proposed to combat antibiotic resistance (Centers for Disease Control and Prevention, 2013; President’s Council of Advisors on Science and Tech-nology,2014). These are: 1) preventing infections and preventing the spread of resistance, 2) tracking resistant bacteria, 3) improving the use of today’s antibiotics, 4) promoting the devel-opment of new antibiotics and diagnostic tests for resistant bacteria. It has been emphasized that a key part of implementing these actions is the development of new tools to help identify

(23)

resistance in bacteria (Obama, 2014; White House,2015). Such tools could consist of algo-rithms capable of analyzing bacterial isolates to quickly determine their biological properties (referred to as phenotype). These algorithms could be used for epidemiological surveillance by systematically, and rapidly, screening bacterial isolates for drug resistance, which would be useful for actions (1) and (2). Moreover, they could serve to improve the appropriate use of antibiotics (referred to as antibiotic stewardship) by recommending antibiotics that are likely to be effective against specific infections, which would be useful for actions (1), (3), and (4). Hence, there is a pressing need to develop bioinformatics methods for the prediction of antibiotic resistance.

In this thesis, we explore how machine learning can be coupled with genomics to generate accurate predictions of phenotypes. We propose learning algorithms to achieve this goal and use antibiotic resistance prediction to demonstrate their effectiveness. We first present the fields of machine learning and bioinformatics and then dive deeper into the problem of antibiotic resistance prediction.

Machine learning

Intelligence refers to the ability of an entity to learn, perceive, reason, and make decisions that maximize its chances of successfully accomplishing its goals. Similarly, artificial intelligence is the ability of a machine, e.g., a computer, to perform tasks that require intelligence. For a computer to be considered intelligent, it must not necessarily master each of the previous abilities. In fact, our perception of artificial intelligence is constantly evolving, as advances in computing power and algorithmic developments constantly repel the limits of modern com-puters. This is humorously reflected in the popular quote “artificial intelligence is what has not been done yet” (Hofstadter,1979). For instance, one of the milestones of artificial intelli-gence is IBM’s Deep Blue chess-playing supercomputer, which defeated the world champion, Garry Kasparov, in 1997 (Campbell et al., 2002). Although it showed great performance at this game, Deep Blue did not have the ability to learn to improve its performance. Rather, it used its powerful multi-core architecture and domain knowledge to search the space of possible actions. On another hand, Google DeepMind recently proposed the AlphaGo model, which showed outstanding performance at the game of Go, and had the ability to improve with experience (Silver et al., 2016, 2017). Alpha Go sparked public amazement, when it outper-formed the world champions, Lee Sedol (2016) and Ke Jie (2017), at this game, which has long been viewed as the most challenging classical game for artificial intelligence (Silver et al., 2016). As opposed to Deep Blue, these models did not solely rely on computational power to search the immense space of possible actions. Instead, they attempted to predict which actions would lead to good game outcomes and had the ability to improve these predictions as they accumulated game-playing experience.

(24)

Giving computers the ability to learn from experience is the focus of machine learning, a subfield of computer science. Specifically, this field focuses on the development of learning algorithms, which learn to perform tasks by analyzing a set of examples. Informally, a learning algorithm can be pictured as a student. The student is taught how to perform tasks through a series of examples and is later expected to successfully apply the acquired knowledge to new instances of the same tasks. For instance, a student could be taught how to add two numbers through a series of examples consisting of pairs of numbers and their sum. The student would later be expected to successfully add pairs of numbers that it has not yet encountered. Similarly, an arts student could be shown a few pictures of dogs and later expected to produce a novel, realistic drawing of a dog. In other words, machine learning is tasked with creating algorithms that can infer the underlying rules of a set of observations and apply them to new observations.

Learning tasks can be grouped into three overarching categories: supervised, semi-supervised, and unsupervised learning. In supervised learning, each example is associated with an expected outcome. The goal of the algorithm to learn a mapping from the contextual information of any example, also called features, to its expected outcome. For instance, supervised learning could be used to predict the selling price of houses based on their characteristics (e.g., number of rooms, square footage, etc.). Similarly, in semi-supervised learning, some examples are associated with a label, while some are not. This type of problem generally arises in settings where labeling examples is costly or requires significant domain expertise. Finally, in unsu-pervised learning, no labels are provided and the learning algorithm is tasked with estimating the underlying distribution of the learning examples. Applications of this type of learning include clustering, anomaly detection (e.g., fraud detection), and generation (i.e., producing new instances, similar to those observed).

In this thesis, we explore how supervised learning can be coupled with bioinformatics to learn mappings from the biological composition of individuals to properties of interest, such as disease state and response to treatment. Supervised learning is further detailed in Chapter 1, while bioinformatics is introduced in the next section.

Bioinformatics

Bioinformatics is an applicative field of science, at the interface of computer science and biol-ogy, that is focused towards using algorithms to understand biological processes. Specifically, this field revolves around designing algorithms and methods to acquire, analyze, and draw conclusions from data that reflect the biological state of an entity. The purpose of this section is to introduce the basic concepts of bioinformatics that are required to understand this thesis. For a detailed overview of the field, the reader is referred to the textbooks ofPevzner (2000); Baldi and Brunak (2001);Lesk (2014).

(25)

DNA

RNA Protein Transcription

Translation Replication

Source:Barillot et al.(2012)

Figure I.1: The central dogma of molecular biology. Solid arrows indicate transfers that commonly occur in cells, while dotted arrows indicate transfers that do not occur in most cells, but may occur under special circumstances.

Most of the work presented in this thesis revolves around one central idea: comparing groups of individuals to identify characteristics that are associated with biological states of interest (e.g., antibiotic resistance, cancer, etc.). Specifically, we will explore how machine learning can be used to determine the association between the genetic composition of an individual (genotype) and its biological state (phenotype). Before discussing how such comparisons can be made, we introduce the basic elements that compose living cells, that is deoxyribonucleic acid (DNA), ribonucleic acid (RNA), and proteins. As we will discuss, these elements are sequential molecules that encode all of the information required to create life.

The central dogma of molecular biology

In 1958, Francis Crick formalized the relationship between DNA, RNA, and proteins, by proposing the central dogma of molecular biology (Crick,1970). This dogma summarizes how sequence information flows through biological systems. Notably, it states that DNA sequences are transcribed into RNA sequences, which are, in turn, translated into protein sequences. Figure I.1illustrates all the known paths of sequence information flow within cells.

DNA Deoxyribonucleic acid (DNA) is a chain-like molecule composed of four nucleotides: adenine, cytosine, guanine, and thymine, which are generally represented using the alphabet {A, C, G, T }. Just like letters can be combined to form meaningful words, combinations of nu-cleotides form the blueprint of every part of the cellular machinery. As illustrated in FigureI.2, DNA molecules are composed of two complementary sequences (strands) that intertwine to form a double helix. The strands are said to be complementary, since there is a bijective mapping between the nucleotides present on one strand, and the ones present at the same

(26)

A T T A A T G C C G G C C G

DNA double helix

Base pairs

Source: Cancer Research UK / Wikimedia Commons

Figure I.2: Illustration of the structure of a DNA molecule.

positions on the other strand. In fact, nucleotides are paired using the following rules: A ↔ T and C ↔ G. For example, if an adenine is present on one of the strands, then a thymine is present, at the corresponding position, on the other strand.

RNA As stated by the dogma, DNA molecules are converted into ribonucleic acid (RNA) molecules by a process called transcription. As opposed to DNA, RNA molecules are single-stranded. Moreover, they do not contain thymine (T) nucleotides, since those are replaced by uracil (U) nucleotides during the transcription process. RNA molecules have various functions within the cell but, for the purpose of this thesis, we can see RNA as an intermediate step on our quest to proteins. The interested reader is referred to Clancy (2008) for an overview of RNA functions.

Protein Proteins are the end result of the flow of information that was initiated at the DNA level. They are key components of cells, taking part in internal chemical reactions, structural composition, interactions with the outside world, and cellular movement. Proteins are formed during the translation process, where RNA is converted into sequences of small molecules called amino acids. There are 20 amino acids1 _{that are known to occur naturally, each of which is}

encoded by a triplet of RNA nucleotides, called a codon. Notice that there are 43 _{= 64}

possible combinations of nucleotides, but only 20 amino acids. Hence, many combinations of nucleotides map to the same amino acid. It is believed that this redundancy mechanism has been evolved to mitigate the effect of variations in DNA and RNA on the resulting protein sequences. Finally, it is also worth mentioning that some segments of DNA are non-coding, meaning that they are not converted into proteins. Such segments were previously referred to as “junk DNA”, since they were believed to serve no purpose. However, it has now been

(27)

shown that they play a variety of key roles in the cell. For more details, the reader is referred to Pennisi (2012).

In conclusion, the central dogma of molecular biology states how the information encoded in DNA eventually yields proteins, which are key elements of cells, which are themselves key components of living individuals. Therefore, slight modifications in the DNA, be it on purpose, by error, or by chance, can have a significant impact on the phenotype of an individual.

Genomics

“Omic” sciences are fields of study of biology that focus on the analysis of all the elements of a given type in individuals. Notably, omic sciences can be used to compare sets of individuals to determine the variations that are associated with a phenotype of interest. These fields usually carry a name composed of a prefix indicating the type of elements under study, and the suffix “omics”. For instance, genomics refers to the analysis of the entire genetic material (genome), proteomics refers to the analysis of all the proteins (proteome), and metabolomics refers to the analysis of all the metabolites (metabolome) present in an individual. Omic sciences involve characterizing and quantifying these elements with the aim of drawing conclusions regarding their function, structure, and interactions. The analysis of these large sets of elements requires the manipulation of very large datasets, acquired using precise measurement instruments, such as DNA sequencers and mass spectrometers.

DNA Sequencing DNA molecules are converted to a format that is readable by computers with the help of DNA sequencers. Such devices use chemical and physical processes to read the nucleotides that compose a molecule and convert it into a sequence of characters from the alphabet {A, C, G, T }. Initially, DNA sequencing relied on costly, low throughput, technolo-gies, such as Sanger sequencing (Sanger et al., 1977). In fact, the Human Genome Project, which used such technologies to sequence the first human genome, required 14 years and US$3 billion to reach completion (van Dijk et al., 2014). In sharp contrast, nowadays, a human genome can be sequenced for around US$1000 dollars in only a few days (van Dijk et al., 2014; Goodwin et al., 2016). Consequently, it is now possible to perform population scale studies, such as the Thousand Genomes Project (1000 Genomes Project Consortium et al., 2012), where 1000 human genomes were sequenced. Moreover, DNA sequencing is increasingly being used in translational research areas, such as clinical diagnostics (Török and Peacock, 2012;van Dijk et al.,2014). The astonishing increase in the throughput of DNA sequencing technology is due to the advent of a new paradigm: next-generation sequencing.

Next-generation sequencing The particularity of next-generation sequencing (NGS) meth-ods is that they do not read the DNA molecule sequentially in an end-to-end fashion. Rather, they break it down into small fragments and sequence them in parallel. In fact, modern instru-ments can sequence thousands to millions of such fraginstru-ments at once (van Dijk et al., 2014).

(28)

Each fragment, yields a small sequence of DNA (referred to as a read), typically ranging from a few hundred to a few thousand nucleotides, depending on the instrumentation (Goodwin et al., 2016). Reads are akin to the pieces of a puzzle and must be assembled in order to recover the full DNA sequence (Boisvert et al., 2010). However, reads are interspersed with sequencing errors due to limitations of the instruments (Goodwin et al., 2016). In order to mitigate the effect of such errors, it is necessary that each position in the molecule be cov-ered by multiple reads (Boisvert et al., 2010). The number of times that each position is covered (referred to as the coverage) is generally a good indicator of the quality of the re-sulting sequence. Nevertheless, high coverage is obtained at the cost of more resources, more sophisticated instruments, and longer sequencing runs. Hence, NGS offers a flexible platform that allows the rapid acquisition of DNA sequences of various qualities, based on the needs of practitioners. In the next section, we present the genotype-to-phenotype problem, one of the most significant applications of next-generation sequencing.

The genotype-to-phenotype problem

The advent of high throughput sequencing has given rise to new opportunities in clinical research and healthcare (Koboldt et al., 2013; van Dijk et al., 2014). It is now possible to sequence the genome of large cohorts of individuals and combine this data with phenotypic information (e.g., related to health or disease), in order to uncover DNA patterns that explain phenotypes of interest. The task of identifying such patterns is known as the genotype-to-phenotype problem and is the main topic of this thesis. Achieving this goal shows great promise in improving our understanding of the underlying biological mechanisms of diseases, while improving our ability to make accurate diagnoses and prognoses (Koboldt et al.,2013; van Dijk et al.,2014).

In this work, we address the genotype-to-phenotype problem by seeking models that accurately predict phenotypes based on a small set of genomic variations (e.g., mutations, insertions, deletions, etc.). Our primary objective is the accurate prediction of the phenotype, while our secondary objective is the interpretability of the resulting model. Specifically, we propose learning algorithms that produce models that can be explained to domain experts. As a con-sequence, the models can simultaneously be used for prediction (e.g., diagnostics, prognostics) and to investigate the biological mechanisms that underlie a phenotype. It is important to contrast this approach with classical genome-wide association studies (GWAS; see Bush and Moore (2012) for an introduction). In GWAS, all the genomic variations present in the data are screened for association with the phenotype using statistical hypothesis testing. Then, the variations that are deemed significantly associated are used to train a learning algorithm. This approach has the benefit of enumerating all genomic variations that are potentially as-sociated with the phenotype. However, the variations used to fit the model are selected in an independent step, which can lead to suboptimal selection with regard to prediction accuracy.

(29)

In contrast, we favor an approach where the selection of genomic variations is built-in to the learning process and is based on their ability to compose an accurate model.

Predicting antibiotic resistance

The core application of this thesis is the prediction of antibiotic resistance2 _{phenotypes from}

bacterial genomes, which is an instance of the genotype-to-phenotype problem. The devel-opment of tools for the rapid detection antibiotic resistant bacteria is of crucial importance in fighting the global spread of antibiotic resistance (Didelot et al.,2012;Centers for Disease Control and Prevention, 2013;Pulido et al.,2013;Kollef and Micek,2014). Accelerating the diagnostic of antibiotic resistant infections is bound to shorten the time of empiric therapy, during which patients are treated based on the most likely cause of infection, which would lead to better patient outcomes and reduced costs (Török and Peacock, 2012;Pulido et al., 2013;Kollef and Micek,2014).

Molecular mechanisms of antibiotic resistance

Before diving into the methods that can be used to detect antibiotic resistance, we briefly describe the molecular mechanisms that underlie such phenotypes. The presentation is based on Blair et al. (2015) and thus, the reader is referred to this work for additional details. The molecular mechanisms of antibiotic resistance can be grouped into three main categories: 1) those that minimize the concentration of antibiotics in the cell, 2) those that alter the antibiotic’s target, and 3) those that inactivate the antibiotic. Bacteria can acquire resistance mechanisms through mutations or horizontal gene transfer, a process by which bacterial cells can exchange sets of genes (referred to as plasmids).

The first category of mechanisms limits the effectiveness of an antibiotic by preventing it from accessing its target. This is achieved in two ways: reduced permeability or increased efflux. First, permeability refers to the ability of an antibiotic to penetrate a cell by traversing its outer membrane through channels (referred to as porins). Resistance arises due to mutations that reduce porin expression or render them incapable of transporting the antibiotic. Second, bacterial efflux pumps transport the antibiotic out of the cell, rendering it ineffective. The overexpression of such pumps, due to mutations in regulatory genes, is a common resistance mechanism. Of note, some pumps are specific to one antibiotic, whereas others are less specific and can lead to multi-drug resistance. This is particularly concerning, since multi-drug resistance efflux pumps have been found on plasmids that can transfer between bacteria. The second category of mechanisms prevents the proper functioning of antibiotics by altering their target. Once inside the cell, antibiotics typically function by binding to a specific target, which disrupts the cell’s activity. Hence, resistance can be achieved by preventing the antibiotic

2

(30)

from binding to its target. This can be done by introducing mutations in the target’s binding site. It can also be achieved without mutations, through enzymes (encoded by genes) that alter the target through post-translational modifications (e.g., methylation) and prevent binding. Mutated target genes and target-altering enzymes can both be transferred between bacteria through horizontal gene transfer.

The third category of mechanisms relies on enzymes, encoded by resistance genes, to destroy or modify the antibiotic and render it inefficient. Enzymes that destroy antibiotics typically rely on hydrolysis, a process that cleaves antibiotic molecules. Such enzymes include the infamous β-lactamases, which confer resistance to β-lactam antibiotics, such as penicillin. Additionally, some enzymes, such as aminoglycoside acetyltransferases, modify the structure of antibiotic molecules, preventing them from binding to their target. In both cases, enzymes encoded by resistance genes are known to spread between bacteria through horizontal gene transfer. In summary, molecular mechanisms of antibiotic resistance are diverse and mostly result from genome modification. Such mechanisms spread through bacterial populations due to clonal reproduction and horizontal gene transfer, accompanied by strong selective pressure due to antibiotic use. With most antibiotic resistance mechanisms taking root in the genome, it is reasonable to expect that accurate genotype-to-phenotype models of such phenotypes can be constructed.

Methods for the detection of antibiotic resistance

Current approach

The current methodology used to process bacterial samples involves many time-consuming steps that require significant human expertise (Didelot et al., 2012). Prior to determining which antibiotics are likely to be effective against a bacterial infection, a sample must be collected, the pathogen must be isolated from the sample, its species must be determined, and it must be cultured in the presence of various concentrations of the antibiotic (Didelot et al.,2012). This process can take from a few days to a few weeks depending on the nature of the pathogen (Didelot et al.,2012;Török and Peacock, 2012; Pulido et al., 2013; Zankari et al.,2013). In addition, some pathogens are difficult to grow, or even non-culturable, which requires the use of species-specific tests based on alternative approaches such as polymerase chain reaction (PCR) to identify segments of mutated genes conferring resistance (Didelot et al.,2012;Pennisi,2012).

Whole genome sequencing

Whole bacterial genome sequencing shows great promise in improving and simplifying the current clinical microbiology diagnostic pipeline (Didelot et al., 2012; Goldberg et al.,2015; Land et al.,2015). Given the genome of a pathogen, one can rapidly infer its species and run a

(31)

series of diagnostic tests (e.g., antibiotic resistance, virulence, etc.) in parallel (Goldberg et al., 2015;Ellington et al., 2017; Votintseva et al.,2017). Moreover, the computational nature of such tests allows them to be easily updated and extended (Pesesky et al., 2016). Due to the recent advances in next-generation sequencing, it is envisaged that this approach could allow routine antibiotic susceptibility testing with turnaround times as low as a single day (Didelot et al.,2012;Török and Peacock,2012;Zankari et al.,2013;Tuite et al.,2014;Goldberg et al., 2015; Land et al.,2015). In recent work, Votintseva et al. (2017) showed a proof of concept of such a diagnostic pipeline for Mycobacterium tuberculosis, going from patient to diagnostic in less than a single day. However, prior to the integration of whole genome sequencing in routine practices, a good understanding of genotype-to-phenotype relationships needs to be established (Didelot et al., 2012; Zankari et al., 2013). This requires the development of efficient software and algorithms, databases to store bacterial genomes and known genotype-to-phenotype associations, as well as extensive validation (Didelot et al., 2012; Török and Peacock,2012;McArthur et al.,2013;Zankari et al.,2013;Tuite et al.,2014;Kwong et al.,2015; Land et al.,2015). Nevertheless, it is expected that, over many years, whole genome sequencing will gradually become a standard approach to routine diagnostics in clinical microbiology laboratories (Didelot et al.,2012;Pulido et al.,2013;Zankari et al.,2013).

A growing body of data

The introduction of routine whole genome sequencing in clinical microbiology laboratories opens new possibilities for large-scale data collection. An increasing number of genome se-quences are available in public repositories, such as the DNA Data Bank of Japan (Tateno et al.,2002), the European Nucleotide Archive (Leinonen et al.,2010), the GenBank (Benson et al.,2018), and the Sequence Read Archive (Kodama et al.,2012). For instance, as illustrated in FigureI.3, the number of bacterial genomes available in GenBank has increased from 370 in 2007, to 126 010 in 2017, corresponding to a 340-fold increase over the past ten years3_{. Other}

repositories, such as the Gene Ontology (Ashburner et al.,2000;Gene Ontology Consortium, 2017), maintain information about genes, such as their function and the biological processes in which they take part. On top of these sources of data, databases such as the Pathosystems Resource Integration Center (PATRIC) database (Wattam et al.,2016;Antonopoulos et al., 2017) aggregate all of this complex information into a single resource. In fact, the PATRIC database regroups publicly available bacterial genomes along with metadata, such as measured antibiotic resistance phenotypes, phylogenetic classifications, and high quality gene annota-tions. This database constitutes a valuable resource for large-scale genotype-to-phenotype studies of antibiotic resistance and is one of the main sources of data used in this thesis.

3

Based on data extracted fromhttps://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes. txton April 14, 2018.

(32)

2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 Year 25000 50000 75000 100000 125000 Genomes

Figure I.3: Number of bacterial genomes in the GenBank database from 2007 to 2017.

Bioinformatics of antibiotic resistance prediction

The increasing abundance of bacterial genomic data has stimulated the development of bioin-formatics tools for the prediction of antibiotic resistance from genome sequences. These tools can be grouped into two categories: 1) those that exploit a curated set of genomic variants that are known to be associated with resistance (referred to as curation-based methods) and 2) those that seek new genotype-to-phenotype associations (referred to as de novo methods), which are the topic of this thesis.

Both types of methods have been successfully used to predict antibiotic resistance pheno-types, such as discrete levels of resistance (e.g., resistant vs susceptible) and quantitative measurements of resistance, such as the minimum inhibitory concentration (i.e., minimum concentration of an antibiotic required to inhibit the growth of bacterial cells). Below, we review the literature of methods for predicting antibiotic resistance from genome sequences and categorize them accordingly.

Curation-based methods

Curation-based methods make predictions based on the presence or absence of a predefined set of genomic variations that are known to be antibiotic resistance determinants. This includes the presence of known resistance genes and mutations in the target site of antibiotics. Sev-eral public databases catalog resistance determinants and their effect on resistance to various antibiotics, among which are the ARG-ANNOT database (Gupta et al.,2013), the Compre-hensive Antibiotic Resistance Database (CARD;McArthur et al.(2013)), the Mobile Elements and Resistance Genes Enhanced for Metagenomics database (MERGEM;Déraspe(2015)), the ResFinder database (Zankari et al.,2012), and the Resfams database (Gibson et al.,2015). The first step in any curation-based prediction method is the identification resistance determi-nants in the genome of interest. Databases usually provide their own tool for this purpose. The

(33)

ARG-ANNOT, CARD, and ResFinder databases rely on the Basic Local Alignment Search Tool (BLAST) (Altschul et al., 1990), an algorithm that can rapidly align small sequences (e.g., resistance genes) to larger ones (e.g., genomes) using sequence similarity. In contrast, the ResFams database relies on hidden Markov models to detect the presence of resistance determinants. Moreover, Bradley et al. (2015) have proposed to use De Bruijn graphs to represent genomes and genomic variations in order to rapidly detect resistance determinants. Once resistance determinants have been identified, these methods make predictions based on two approaches: 1) rules based on expert knowledge, 2) machine learning models (Pesesky et al., 2016). The first approach, which is the most common, consists of predicting that an isolate is resistant to an antibiotic if a predefined set of resistance determinants are present in its genome (Stoesser et al., 2013; Gordon et al.,2014;Hasman et al., 2014; Kos et al.,2014; Bradley et al.,2015;Walker et al.,2015;Metcalf et al.,2016;Pesesky et al.,2016;Moran et al., 2017). These approaches are constrained by current knowledge and are generally limited to detecting phenotypes that depend on a single determinant. In contrast, the second approach consists of training a machine learning algorithm to predict resistance phenotypes, using the presence or absence of resistance determinants as features (Rishishwar et al., 2013; Pesesky et al., 2016; Eyre et al., 2017; Yang et al., 2017b). This approach has the advantage of not being limited to expert-defined rules and has the ability to model phenotypes that depend on multiple determinants (Pesesky et al.,2016).

Finally, it must be noted that, by definition, curation-based predictors are unable to detect novel antibiotic resistance mechanisms (Pesesky et al., 2016; Jeukens et al., 2017a; Macesic et al.,2017). This issue is alleviated by the use of de novo methods, which we present in the next section and explore throughout this thesis.

De novo methods

In contrast with curation-based methods, de novo methods do not rely on prior knowledge of genomic variations associated with resistance. Rather, they use statistical evidence to detect patterns that are predictive of the phenotypes. Hence, these methods have the ability to de-tect novel genotype-to-phenotype associations and advance our understanding of phenotypes. Consequently, we have chosen to orient this thesis towards this type of method.

De novo methods must make use of a data representation that highlights the genomic vari-ations that exist within the population. A typical approach consists in representing each genome by a set of single nucleotide polymorphisms (SNP), which are variations that occur at a single base pair location within the population (Brookes,1999;Koboldt et al.,2013;Nielsen et al.,2011). However, this approach relies on multiple sequence alignment, which is compu-tationally expensive and can fail in the presence of large-scale genomic rearrangements, such as horizontal gene transfer, a common occurrence in bacterial populations (Bonham-Carter

(34)

et al.,2014;Leimeister et al.,2014;Song et al.,2014;Vinga and Almeida,2003;Vinga,2007). In contrast, reference-free methods, which represent each genome by a set of words, alleviate the need for multiple sequence alignment (Bonham-Carter et al.,2014;Leimeister et al.,2014; Song et al.,2014;Vinga and Almeida,2003;Vinga,2007). For example, in the k-mer represen-tation, each genome is represented by the set of k-mers (i.e., short words of k nucleotides) that it contains. Genomes can then be compared based on the presence and absence of such words. The main downside of the k-mer representation is that it contains a lot of redundancy, due to the fact the many k-mers are always present or absent simultaneously. In this sense, Jaillard et al. (2017) and Jaillard et al. (2018) proposed to replace k-mers by unitigs, i.e., words of variable length with unique presence/absence patterns that are generated using compacted De Bruijn graphs. In this thesis, we use the k-mer representation due to its simplicity and effectiveness, but it is important to note that the proposed algorithms would also work with other representations, such as SNPs and unitigs.

Bacterial genome-wide association studies are a type of de novo method that uses statistical hypothesis testing with the aim of finding all genomic variations that are associated with a phenotype (see Power et al.(2017) for a review). Historically, the success of bacterial GWAS has been limited by the strong population structure that exists among bacteria (Chen and Shapiro, 2015; Falush, 2016; Power et al., 2017). In contrast with humans, bacteria evolve clonally, i.e., one cell divides into two descendants with nearly identical genomes. This tends to create strong biases in the data that can cause genomic variants to incorrectly seem associated with the phenotype. For example, any variation that is unique to a set of highly related individuals with the same phenotype (i.e., a clade of the phylogenetic tree), will appear to be associated with the phenotype, while it is truly an artifact of evolution (Chen and Shapiro, 2015). Recently, methods that perform bacterial GWAS while rejecting spurious associations arising from population structure have been proposed (Earle et al., 2016; Lees et al., 2016; Collins and Didelot,2018). It is therefore increasingly feasible to detect relevant genotype-to-phenotype associations using this approach (Falush,2016). The main advantage of the GWAS approach is that it is highly scalable, since the association between each genomic variation and the phenotype can be tested in parallel. However, this approach tends to detect associations between single genomic variations and the phenotype, and can fail to detect relevant patterns when phenotypes are multifactorial.

This brings us to machine learning algorithms, which attempt to learn a model that accurately predicts the phenotype using the observed genomic variations. In contrast with GWAS, such algorithms do not seek to detect all genotype-to-phenotype associations, nor do they attempt to find causal associations. Rather, their objective is to build a model that relies on a, possibly complex, combination of genomic variations to achieve accurate predictions of the phenotype. In this setting, algorithms that are interpretable, i.e., that they expose the genomic basis for their predictions, are of particular interest. Such models can help discover new