• Aucun résultat trouvé

Benchmarking in (meta-) genomics: LEMMI & BUSCO

N/A
N/A
Protected

Academic year: 2022

Partager "Benchmarking in (meta-) genomics: LEMMI & BUSCO"

Copied!
258
0
0

Texte intégral

(1)

Thesis

Reference

Benchmarking in (meta-) genomics: LEMMI & BUSCO

SEPPEY, Mathieu

Abstract

La quantité d'informations obtenue lors de l'étude de génomes augmente de façon constante. La complexité des processus de séquençage conduit des données imparfaites avec un niveau élevé de bruit. Celles-ci nécessitent une variété de méthodes pour donner une interprétation biologique leur contenu. Comme certains résultats reflètent mieux la réalité que d'autres, des stratégies pour juger de la qualité

des données et des performances des méthodes informatiques sont nécessaires pour guider les choix expérimentaux. Ces évaluations doivent être systématiques et reproductibles. J'explore ici deux aspects liés à l'évaluation de processus informatiques dédiés la recherche biologique, ainsi que de leurs résultats. Ils sont abordés au travers de deux ressources bio-informatiques distinctes dédiées l'évaluation des performances (benchmark). i) La plate-forme LEMMI fournit une évaluation continue des outils de classification taxonomique. ii) Le logiciel BUSCO utilise des gènes conservés universellement pour évaluer le niveau de [...]

SEPPEY, Mathieu. Benchmarking in (meta-) genomics: LEMMI & BUSCO. Thèse de doctorat : Univ. Genève, 2020, no. Sc. 5496

DOI : 10.13097/archive-ouverte/unige:143927 URN : urn:nbn:ch:unige-1439276

Available at:

http://archive-ouverte.unige.ch/unige:143927

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

UNIVERSITÉ DE GENÈVE

Département de médecine FACULTÉ DE MÉDECINE génétique et développement Professeur Evgeny M. Zdobnov

Département d'informatique FACULTÉ DES SCIENCES Dr. Frédérique Lisacek

___________________________________________________________

Benchmarking in (meta-) genomics:

LEMMI & BUSCO

THÈSE

Présentée à la Faculté des sciences de l'Université de Genève Pour obtenir le grade de Docteur ès sciences, mention bioinformatique

Par

Mathieu Seppey

De Hérémence (Valais)

Thèse N°5496

Genève, 2020

(3)
(4)

3 À mon papa

(5)
(6)

5

A CKNOWLEDGEMENTS

I start by warmly thanking my thesis supervisor, Prof. Evgeny Zdobnov, for trusting me and giving me the opportunity to spend these years in his research group exploring the world of bioinformatics and comparative genomics. Then, I would like to thank the thesis committee, Dr. Frederique Lisacek, Prof. Patrick Ruch, and Prof.

Ioannis Xenarios for investing their time in evaluating my work.

I am also thankful to all my past and present colleagues in the "ezlab" group, through whom I have learned a lot. As this work was finalized far away from Geneva in a strange time of confinement, I already miss our habit of going together to the food truck in the park every Wednesday as we have for years. A special thanks goes to Matthew and Chris as they proofread several of my texts constituting this thesis. In addition, during my time at the universities of Lausanne, Vancouver, and Geneva, I have encountered numerous professors, teaching assistants, and students who have shared their passion for sciences; some have remained friends, all deserve gratitude.

Finally, it would have been difficult for me to ever present this work without the encouragement of my family during all these years since I decided to take a new path in my life and start a scientific curriculum. My mother Antoinette, my father Aimé, and my brother François have been watching this closely, sometimes a bit puzzled, but always supportive in a caring environment. My father left us just before he could see me reaching this milestone, but he was there to appreciate all the way leading to it.

(7)

6

C ONTENTS

1 Abstract ... 8

2 Résumé ... 9

3 Introduction ... 10

3.1 Challenges facing computational biology in the age of high-throughput sequencing ... 10

3.2 Benchmarking ... 13

3.2.1 Comparative evaluation of computational methods ... 13

3.2.2 Quality evaluation of assemblies and annotated gene sets ... 14

3.2.2.1 Universal orthologs as measuring units ... 14

3.2.2.2 CEGMA, BUSCO, and CheckM ... 16

3.3 Software containers ... 17

3.4 Metagenome research ... 19

3.4.1 Uncovering the uncultured majority ... 19

3.4.2 WGS taxonomic classifiers ... 19

3.4.3 Taxonomic resources ... 21

3.4.4 Benchmarking WGS taxonomic classifiers ... 22

3.5 Thesis outline ... 27

3.5.1 Continuous benchmarking of WGS taxonomic classifiers with LEMMI ... 27

3.5.2 Consolidating BUSCO ... 28

3.5.3 Additional work ... 29

4 Results ... 30

4.1 The proof of concept and initial release of LEMMI ... 30

4.2 Releasing enhanced BUSCO pipelines ... 57

4.2.1 Updating the markers to version odb9 and introducing BUSCO v2 ... 57

4.2.2 BUSCO v3 ... 58

4.3 Documenting BUSCO usage ... 59

4.3.1 Methods in Molecular Biology: Gene Prediction ... 59

(8)

7

4.3.2 Methods in Molecular Biology: Insect Genomics ... 59

4.4 The genome of the banded demoiselle ... 107

5 Discussion and perspectives ... 124

5.1 LEMMI taxonomic classifiers ... 124

5.2 BUSCO ... 132

6 Conclusion ... 136

7 Bibliography ... 137

8 Appendix ... 149

8.1 Supplementary material Seppey et al. 2020 ... 149

8.2 Supplementary material Waterhouse et al. 2018 ... 171

8.3 Supplementary material Ioannidis et al. 2017 ... 193

(9)

8

1 A BSTRACT

The amount of information obtained by studying the genome of isolated organisms as well as by metagenomics sampling has grown steadily during the past decade. The complexity and imperfections of sequencing processes translate into noisy in-silico data that necessitate a variety of methods and heuristics to deliver a biological interpretation of their nucleotide content; this includes for instance genotyping and taxonomic identification. As some results are expected to better fit the reality than others, criteria and strategies to judge the quality of the data and the performances of the methods are needed to guide experimental choices. Evaluations should be neutral, systematic, and reproducible. In this work, I explore two aspects of assessing computational processes dedicated to biological research, and their results, against predefined gold standards.

They are addressed through the development of distinct bioinformatics resources dedicated to benchmarking. i) I introduce the LEMMI platform; it hosts a continuous evaluation of taxonomic classifiers to which candidate methods are submitted wrapped in software containers. They are assessed on their ability to describe metagenomics samples simulated from publicly available data, as well as on the computing resources they use. To guarantee the fairest evaluation possible, the reference database can be controlled during the benchmark and the entire procedure is free from human-made, arbitrary decisions. The results can be explored through online dynamic plots and rankings. ii) The BUSCO software uses universally conserved orthologs to assess the completeness of genomic sequences and their protein annotation sets. It employs marker genes organized in clade-specific datasets covering eukaryotes, bacteria, and archaea. We have reworked the initial version to offer a reliable pipeline and to make it evolve with the source of its datasets, the OrthoDB catalog of orthologs. We have also made a significant effort to document proper and reproducible uses of BUSCO, both for its main purpose of completeness assessment and for additional objectives.

(10)

9

2 R ÉSUMÉ

La quantité d'informations obtenues lors de l'étude des génomes d'organismes isolés ainsi que ceux provenant de métagénomes a augmenté de façon constante durant la dernière décennie. La complexité des processus de séquençage conduit à l'obtention de données électroniques imparfaites avec un niveau élevé de bruit. Celles-ci nécessitent une variété de méthodes et heuristiques pour donner une interprétation biologique à leur contenu en nucléotides. Ceci inclut notamment le génotypage et l'identification taxonomique. Comme certains résultats reflètent mieux la réalité que d'autres, des critères et des stratégies pour juger de la qualité des données et des performances des méthodes sont nécessaires pour guider les choix expérimentaux. Ces évaluations doivent être neutres, systématiques et reproductibles. Dans ce travail, j'explore deux aspects liés à l'évaluation de processus informatiques dédiés à la recherche biologique, ainsi que de leurs résultats, en se basant sur des données étalons prédéfinies. Ces aspects sont abordés au travers du développement de deux ressources bio-informatiques distinctes dédiées à l'évaluation des performances (benchmark). i) Je présente la plate-forme LEMMI; elle fournit une évaluation continue des outils de classification taxonomique par le biais de conteneurs de logiciels dans lesquels sont encapsulées les méthodes candidates. Elles sont jugées sur leur habilité à décrire les échantillons métagénomiques simulés à partir de données publiques ainsi qu’en fonction de leur usage des ressources informatiques. Pour garantir l'évaluation la plus juste possible, la base de données de référence peut être contrôlée durant le benchmark et le processus entier se fait sans décision humaine arbitraire. Les résultats peuvent être consultés au travers de graphiques et de classements dynamiques. ii) Le logiciel BUSCO utilise des gènes orthologues conservés universellement pour évaluer le niveau de complétion de séquences génomiques et de leurs annotations de protéines. Il utilise des gènes marqueurs organisés en jeux de données spécifiques à des lignées et couvre les eucaryotes, les bactéries et les archées. Nous avons retravaillé la version initiale pour offrir un outil fiable et le faire évoluer avec la source de ses données qui est le catalogue de gènes orthologues OrthoDB. Nous avons également fait un effort conséquent pour documenter un usage correct et reproductible de BUSCO, à la fois pour son but principal qui est d'évaluer le niveau de complétion des séquences, mais aussi pour des objectifs additionnels.

(11)

10

3 I NTRODUCTION

3.1 Challenges facing computational biology in the age of high- throughput sequencing

Genomics has emerged as a major component of molecular biology. It is dedicated to deciphering and handling the sequences upon which all life is based: long chains of (deoxy-) ribonucleic acid (DNA, RNA) organized in chromosomes that collectively form genomes ranging from thousands to billions of base pairs (bp). These sequences can be recovered from the cells of macroscopic and microscopic organisms, as well as from non- cellular entities such as viruses. The objectives of genomics are multiple: from studying the organization and the evolution of genomes across related species1, to building phylogenetic trees of entire groups2, to cataloging the function of each genic and intergenic portion of the human genome3. Converting RNA to complementary DNA enables the quantification of gene expression, which allows for instance the assessment of the transcriptional response when a bacterial species is exposed to different environmental stress4. Genomics is now a key player in medicine and agriculture; it enables the identification of rare variants that are beneficial or detrimental in human5, animal6, and plant genomes7, identification of pathogenic organisms8, and tracking of the presence of antibiotic resistance genes9.

Genomic investigations are driven by the readout of DNA sequencers producing raw data in the form of reads that can then be resolved, by the use of advanced methods, into more elaborate datasets such as assemblies and protein sets. The length of sequenced reads range from dozens to thousands of bp and modern sequencing methods are referred to as either short-read or long-read technologies. The Illumina devices have led the market over the past decade with their flow cells enabling massively parallel production of short reads through cluster amplification followed by sequencing by synthesis combined with fluorescence signal detection. While it has enabled high- throughput, a small read size is not a desirable property for sequencing as it complicates the subsequent assembly process. Therefore, longer-read technologies have emerged, with the machines from Pacific Biosciences and Oxford Nanopore; the later is known for its MinION portable device. These technologies have yet to overcome the well- established Illumina offering, as they suffer from an error rate higher by an order of magnitude, owing to their single-molecule approach, as opposed to cluster

(12)

11

amplification. Long-read technologies are evolving rapidly and are likely to play a major role in future improvements in all fields of genomics. DNA sequencing technologies are reviewed in detail by Knetsch et al.10. Costs that were initially prohibitive have decreased dramatically over the years with the diversification of high-throughput technologies (Figure 3.1a). As a consequence, the barrier imposed by limited access to data has fallen and the sequences of thousands of organisms are made available in public databases every year (Figure 3.1b), some of which represent novel taxa, while others are the result of re-sequencing based on different technologies. These aim at complementing or improving the material already available for a species but create redundancies. Additionally, this ever-growing amount of data has detrimental effects as it brings new costs and limitations for performing biological experiments; they now lie in obtaining the resources and human competencies to conduct analyses. This has been a strong incentive to develop novel methods and heuristics in the field of bioinformatics, the engineering side of computational biology. Innovations have aimed at increasing the accuracy of the results, the speed of the analyses, and at reducing volatile memory, the number of central processing units (cpus), and the storage capacity necessary for computations. Numerous approaches have been published for instance to resolve genome assemblies11 or for read alignment against a reference followed by single nucleotide polymorphism variants calling12.

(13)

12

Figure 3.1 a. Decreasing costs of sequencing over time. Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available at: www.genome.gov/sequencingcostsdata accessed June 1st, 2020. b.

Number of whole genome sequencing entries deposited on GeneBank.

https://www.ncbi.nlm.nih.gov/genbank/statistics accessed June 1st, 2020.

Nowadays, data and method implementations can circulate rapidly and efficiently thanks to public databases centralizing biological sequences13–15 and version control systems provided by platforms offering free code hosting such as GitHub (github.com) and GitLab (gitlab.com). Furthermore, preprint servers (e.g. biorxiv.org) provide open access to non-peer-reviewed publications describing any method hosted on these code repositories. The drawback of such a rich and dynamic scientific ecosystem is its weak ability to guarantee the quality, the maintenance, and sometimes the mere access to these resources. Nevertheless, good scientific research does require proven methods and validated data as only these can be translated into daily clinical practice or become part of environmental monitoring solutions. Therefore, it is necessary to develop strategies

(14)

13

dedicated to assessing bioinformatics algorithm implementations and the quality of genomic data they produce. In both cases, comparisons built on pre-existing knowledge can be an effective approach.

3.2 Benchmarking

3.2.1 Comparative evaluation of computational methods

The action of measuring the quality of a process or its result against a predefined truth or representative dataset (the gold standard) is referred to as benchmarking. The term benchmark sometimes indicates the gold standard set itself. For consistency and clarity throughout the present thesis, I will refer to benchmarking when evoking the evaluation process and to the gold standard set otherwise. In computational biology, multiple initiatives have been conducted to benchmark methods. Three major approaches can be identified: self-assessment against competitors upon publication of a tool, independent comparative studies, and broad-community challenges. The Critical Assessment of protein Structure Prediction (CASP)16, which started as early as 1994 and has published CASP1317 in 2018, is the oldest effort to federate an entire field around regularly benchmarking its methods. Several similar initiatives have appeared such as the Assemblathon for genome assemblers18, the Quest for Orthologs19, or the Global Alliance for Genomics and Health20 dedicated to human variants calling. In metagenomics classification (the main focus of a subsequent chapter), several independent comparative studies had been published21–24 before the Critical Assessment of Metagenome Interpretation (CAMI)25 conducted a community-wide challenge aiming at unifying benchmarking in the field. However, the rather long delay between CAMI1 (started in 2015, results in 2017) and CAMI226 (started in 2018, unpublished in mid- 2020) has not stopped developers from providing their own benchmark upon publishing27–29, although sometimes reusing CAMI1 datasets. In the aftermath of this challenge, additional intermittent comparative studies have been released30–33. Overall, the level of agreement on how to proceed and how to judge the outcomes depends on the extent of standardization desired and implemented by each field and initiatives from the academic, governmental, or private sector. Developers working on problems with a little or no community still have to design their own self-evaluation when publishing a new method, as in the case of phage annotation tools34.

(15)

14

3.2.2 Quality evaluation of assemblies and annotated gene sets

The line between benchmarking a tool and evaluating its output is thin as the assessment of a method necessarily involves investigating its results. There are situations in which the use of multiple approaches during routine analyses is too costly;

comparative evaluations of methods using gold standard sets as described above are then an efficient way to select the most appropriate tool, ensuring the best outcome for most data. On the other hand, it is sometimes desirable to have a trial-and-error approach centered on evaluating the results obtained with the actual data when the entire project consists of producing a single high-quality output. Sequenced genomes can be affected by sample collection, library preparation protocols, or the combination of different sequencing technologies. Therefore, it is common for researchers to compare the result of concurrent assembly and annotation runs on the same input data. As the starting points of many downstream experiments, assemblies and protein predictions need to undergo quality assessment to ensure their content reflects the underlying organism. First, basic "sanity checks" considering only the data can be used as preliminary validations to identify and correct technical artifacts35. Second, the cumulative size in bp of all assembled contiguous sequences (contigs) can be compared to the expected size as estimated with flow cytometry, which does not rely on sequencing;

this provides an independent quantitative estimation of completeness. Additionally, continuity metrics can be calculated from the sequences, the most common being N50 which indicates the contig size above which half of the recovered genome is assembled.

A low N50 corresponds to a fragmented and low-quality assembly that will complicate the subsequent gene prediction process. These assessments, however, do not match the definition of benchmarking stated earlier, which involve a comparison to a gold standard set. A different approach exists for assessing genome completeness: using the expected gene content (i.e. markers) defined by previously sequenced organisms belonging to the same group to validate the assembly. This can be seen as an evolutionarily informed gold standard. By contrast with N50, this strategy relies on a biologically relevant feature:

homology of sequences.

3.2.2.1 Universal orthologs as measuring units

Genomes evolve by duplications and losses of their content36 (genic and intergenic) or by acquiring external sequences by various horizontal transfer mechanisms37. In a genomics context, a gene is a portion of chromosome matching a specific pattern that can be expressed in the form of RNA; when this is a messenger RNA translated into

(16)

15

amino acids, the gene is qualified as protein-coding. Two genes that share a common ancestral sequence are homologous. When the link between them is traced back to a speciation event, as opposed to a duplication in the common ancestral species or a horizontal transfer, the gene enters the subcategory of homology that is called orthology.

Orthologous genes, or orthologs, are crucial to comparative genomics as they usually represent the closest sequences shared by related species, evolutionarily constrained as a similar function is assumed38,39. Orthologs connecting all cellular life can still be identified thanks to fundamental biochemical pathways. The Clusters of Orthologous Groups40 (COG) was the pionnier method to delineate families of orthologs, i.e. all genes in a given clade that descend from the same copy in their shared ancestor. Families of orthologs can display different evolutionary history and present distinct structures41; gene losses act against universality while duplications following the initial speciation generate families of orthologs of which some species possess several members. As for their use in benchmarking, the ideal markers for assessing genome completeness are orthologs that i) have remained universal within the clade of interest; ii) are constrained to being single copy (low-duplicability). Universal single copy orthologs (USCO) (Figure 3.2) can be used i) to gauge completeness by assessing their presence; ii) to quantify the level of contamination, i.e. technical duplications or chimeric assembly of multiple organisms, by assessing their number of copies.

Figure 3.2. Illustration of orthology through hierarchical delineation of gene families.

Each shape pictures a gene; colored leaves represent extant species while nodes stand for ancestral states prior to a speciation event. When the clade under the root node (R) is considered, all genes connected by dashed lines are orthologs. The two kinds of square are duplicated genes belonging to the same group of orthologs. As the gene

(17)

16

represented by a star is lost in one extant species, the USCO common to the clade under R are the genes depicted by the circle and the triangle. From the perspective of the clade below the node N2, the five different symbols are distinct USCOs as the two types of square are linked by a duplication event that occurred prior to the speciation founding that clade.

___________________________________________________________

3.2.2.2 CEGMA, BUSCO, and CheckM

The first tool dedicated to using universal orthologs is the Core Eukaryotic Genes Mapping Approach (CEGMA)42. The primary purpose intended by the authors was to extract a set of 458 eukaryotic universal protein-coding genes from input draft assemblies to have species-specific evidence to train an ab-initio gene predictor.

However, with the CEGMA results in hand, it was clear that considering the ratio of recovered to expected genes would be a good proxy for benchmarking assembly completeness. The initial CEGMA approach did not enforce low-duplicability as a criterion for selecting universal genes. A later refinement43 included a partial filtering of duplicated orthologs that decreased the number of markers to 248. CEGMA universal orthologs are a subset of the clusters of Orthologous Groups for euKaryotes44 (KOG) and are based on six model species. They are represented by profile-Hidden Markov Models (profile-HMMs) and can be predicted as "complete" or "partial". While CEGMA considers only the root level of eukaryotes, the Benchmarking Universal Single Copy Orthologs (BUSCO) sets were next introduced to enable a similar approach for different subclades of eukaryotes; they offer a higher resolution as more closely related organisms share a larger number of orthologs (Figure 3.2). BUSCO groups are the subset produced by the OrthoDB hierarchical catalog of orthologs45, determined by the rule of (near-) universality and low duplicability in 90% of the sampled species; they constitute the marker datasets used by the eponymous software to predict candidate genes in the input genome or transcriptome assembly and assess them with profile-HMMs.

Additionally, BUSCO evaluates protein annotation sets. The first release46,47 contained six eukaryotic datasets, namely Eukaryotes (429 genes), Metazoans, Arthropods, Vertebrates, Fungi, and Plantae. The software reports the completeness in terms of expected gene content by predicting those that are complete, present either in a single copy or duplicated, as well as gene fragments. While CEGMA has been discontinued, the BUSCO approach has the advantage of being bound to OrthoDB, which has provided updates regularly to reflect the growing genome sequencing effort48. The initial release

(18)

17

of the BUSCO software also contained a dataset for the root bacterial node, based on a different source of markers49. In parallel, the software CheckM50 was released to fulfil a similar task for non-eukaryotic organisms, i.e bacteria and archaea; it defined multiple clade-specific marker sets and also considered the genomic location of each gene to correct for non-independent collocated markers. The data underlying the CheckM analyses were curated from the integrated microbial genomes and microbiomes system51 when the software was released in 2014 and has not been updated since 2015.

3.3 Software containers

Comparative evaluations of computational methods usually stand on static publications, offering no guarantee that an efficient tool remains accessible to be used in a real setting on a long-term basis. Difficulties in preserving access to methods over time have been an extensive problem in biomedical research52. Archival stability is not only essential for conducting new analyses if a tool is recognized as high-performing, but also to ensure reproducibility, which is an important aspect of the scientific effort. Moreover, with the increasing number of tools to choose from, it is impossible for a high performance computing (HPC) environment staff to support all new methods. To tackle these problems, virtualization can be leveraged; it enables packaging applications in easily deployable and distributable images, allowing users to set-up pipelines that can be archived and used by others in different infrastructures without the installation burden. The main idea behind virtualization is to install and run an operating system (OS) designed to interact directly with the hardware in a higher-level setting running on a host system. This is achieved by a specialized software that abstracts the low-level components so the guest OS can function as if it were natively installed on the machine.

There are multiple types of virtualization that differ by the extent of abstraction, isolation, and the layers that stand between the hardware and the guest OS. Two broad categories can be distinguished here: hardware-level virtualization, or hypervisors, and software-level virtualization, or containers (Figure 3.3). Hypervisors such as Oracle VM VirtualBox simulate the hardware while container applications such as Docker only virtualize the OS kernel, which makes them more lightweight and less prone to performance overhead, close to running on a native OS53,54. Container images are built by following plain-text recipe files and can then be distributed manually or through online hubs. Docker (www.docker.com) is a container solution that has become very popular, has a free distribution and hosting infrastructure (hub.docker.com), but suffers from security concerns that have prevented it from establishing itself in scientific HPC;

(19)

18

Singularity (https://sylabs.io/docs/) is an alternative solution that does not suffer from these safety issues55 and is compatible with Docker-based images. The popularity of Docker and the availability of Singularity on HPC makes containers an efficient approach for method distribution. In the form of versioned recipe files, they are lightweight but not immune to the unavailability of source material, while as larger built images, they will remain operational as long as the container technology exists and is backward- compatible. To date, several initiatives have tried to organize and standardize the creation and distribution of containers for computational biology applications under unified concepts56,57.

Figure 3.3. Different architectures enabling virtualization of operating systems (OS) to distribute applications (App.); UI = user interface. Top left: no virtualization, an OS installed natively manages its hardware. Top right: type 1 hypervisor, a virtualization software manages the hardware and abstracts it for each guest OS. Bottom left: type 2 hypervisor, a virtualization software running as an application abstracts the hardware for the guest OS through the host OS. Bottom right: container engine, the guest OS is not fully installed; only high-level layers such as the file systems are virtualized and isolated processes are run through the host OS kernel. In these three types of virtualization, the guest OS permit a "plug and play" manipulation; they are not bound to a local install and several of them can be loaded in parallel, unloaded, and distributed easily.

(20)

19

3.4 Metagenome research

3.4.1 Uncovering the uncultured majority

Before molecular biology techniques, the foundations of microbiology were built on knowledge obtained from petri dishes and thus limited to organisms that could grow in an artificial medium. However, the early surveys based on sequencing the small subunit (SSU) 16s ribosomal RNA (16s rRNA) gene, used as the main phylogenetic marker in bacteria and archaea, revealed that the majority of the phyla have never been cultured58. This finding along with the increasing availability of high-throughput sequencers has fuelled the study of entire ecological communities of microorganisms such as human microbiota59 or samples from natural areas60. Studies limited to organism identification and quantification have often relied on amplicon-based analyses of targeted markers; it is a popular approach as it is cost-effective, and broad-range primers can be used to amplify 16s rRNA sequences of most bacteria and archaea61. Microbial eukaryotes do not possess this marker but equivalent studies can be conducted using its counterpart SSU, the 18s rRNA, or internal transcribed spacers that offer a better discriminative power in fungi62. The level of nucleotide identity, usually 97% in bacteria and archaea49, is used to cluster organisms into operational taxonomic units (OTU), a term that is preferred to species as these clusters provide no biological insight on actual species' existence. Beyond these objectives, amplicon-based analyses are limited; the risk of missing taxa owing to inadequate primers cannot be excluded. In addition, the growing interest in strain interpretation63 is incompatible with the low discriminative power of using a unique marker gene. Non-targeted whole genome sequencing of the total DNA and RNA content of a sample (WGS metagenomics) can overcome these limitations. It enables the design of protocols recovering all organisms at once, including viruses, improving the taxonomic resolution, and provides extra material to study functional aspects of each organism's genome. The drawback of using WGS metagenomics is the increased computational burden induced by expanding the amount of data recovered for each organism.

3.4.2 WGS taxonomic classifiers

Some of the problems encountered when investigating WGS metagenomics samples had previously been solved for individual genomes, and existing workflows could be adapted to handle differences unique to metagenomes. For instance, many methods had been implemented for assembling reads into contigs from isolated or clonal

(21)

20

samples, upon which metagenome-aware versions have been based (e.g.

MetaSPADES64). On the other hand, new kinds of problems have arisen from studying complex microbial communities, notably how to define the identity of all organisms and their abundances relative to the sampled total; this is done by tools known as classifiers that combine binning (i.e. grouping sequences by composition or by similarity to a reference) and profiling (i.e. estimating the abundance of these groups), some of which also attach taxonomic labels to the sequences. Implementations of both tasks are often entangled as profiling requires some sort of binning, of either all sequences or a subset to predict markers. While metagenome-assembled genomes are expected to contain more information for the task, reads are often the preferred input to rapidly classify complex microbial communities down to the lowest abundance organisms for which assemblies are difficult owing to insufficient coverage. Moreover, assembly can be facilitated with an upstream classification of reads. During the process, sequences have to be clustered by composition to form OTUs or to be assigned a phylogenetic closest neighbor from taxonomically annotated references. Short read libraries often harbor patterns with no discriminative power, either made of low complexity repetitive elements or containing conserved sequences shared across multiple clades. Therefore, raw outputs of standard alignment tools like BLAST65 need refinement to produce a meaningful taxonomic description of the sample. Different strategies have been explored by classification methods; tools such as MEGAN66 and Kraken67 have introduced the use of the lowest common ancestor (LCA) to organize matches of reads to references into a multi-level taxonomic profile (Figure 3.4); tools such as CLARK68 have implemented ways to skim the reference to keep only the discriminative information at a given taxonomic level. These strategies remain imperfect, as they do not return a biologically accurate abundance by distributing reads across multiple levels (a read resolved at the genus rank will cause the underlying species abundance to be underestimated by one read). Bracken69, which is a companion tool for Kraken, re-estimates the true abundance by probabilistically redistributing all the reads down to the species level. Taxonomic classifiers vary in their technical approaches as well as in the type of results they produce;

while approximately serving the same purpose, different families of methods can be distinguished. Some tools label all reads and also provide taxonomic composition profiles while others report only one of the two (abundance estimation is possible without binning the full read content); some use complete genomes as references while others are based on curated marker genes; some report read abundance while others report organism abundance proxied by genome copies or marker copies; finally, some

(22)

21

are based on alignments while others rely on k-mer counting. Many methods have been developed in a continuous effort to cope with the ever-increasing amount of reference genomes that makes BLAST-like alignments impractical in terms of runtime, and the use of comprehensive references extremely memory-demanding with most approaches.

Figure 3.4. Last common ancestor (LCA) approach as implemented by Kraken67. The reference database contains all oligonucleotides of a given length (k-mers) extracted from the represented genomes, assigned to the LCA taxon of all sequences where it was found. When classifying a read, the counts of all occurring k-mers are used to find the best-supported root-to-leaf path (in red) and select the corresponding taxonomic label.

The minimal number of matches supporting the node can be customized to a more stringent value resulting in a higher level being assigned (for instance here the better- supported genus pictured by a circle). When given the same toy example, CLARK68, which is also k-mer-based but uses only discriminative entries at a given taxonomic level, would classify the query based on the grayed area and agree on the label at the species level.

3.4.3 Taxonomic resources

To interpret newly acquired genomic data in the light of our understanding of evolution, one needs references annotated with metadata reflecting their taxonomic position, commonly based on seven main levels: superkingdom (domain), phylum, class, order, family, genus, and species. Amplicon-based analyses can rely on several databases containing SSU sequences bound to a taxonomic annotation, such as Silva70, while WGS

(23)

22

metagenomics has mainly used the National Center for Biotechnology Information (NCBI) taxonomy71 that annotates sequences deposited for instance on RefSeq13. These resources offer unequal content in terms of sheer numbers of taxa but also display discrepancies72. Indeed, the science of naming and circumscribing organisms had existed long before the advent of sequencing, and the established phenotypic-based taxonomy only partially agrees with sequence-based classification. An effort to reconcile taxonomy and phylogeny is ongoing as seen with the Silva database, which has been curated to resolve conflicts in favor of phylogenetic evidence following a SSU guide tree.

However, altering established knowledge is an arduous task and the NCBI taxonomy, which is not automatically generated from sequences, has adopted a balanced strategy to favor a phylogenetic taxonomy while still reflecting the consensus in the literature71. Recently, others have postulated that a phylogeny-based taxonomy for WGS sequences cannot be avoided to ensure usable classifications given the increasing amount of understudied organisms sampled by non-targeted metagenomics73. The Genome Taxonomy DataBase (GTDB) project74 has reprocessed the NCBI content and offers a taxonomy that is genome-based and thus disagrees with the NCBI version on a significant part of its content. This will add an extra layer of complexity when benchmarking taxonomic classification methods. Finally, the increasingly important strain level will have to be reintegrated by future taxonomic systems, as it was

abandoned by the NCBI taxonomy in 2014

(https://www.ncbi.nlm.nih.gov/refseq/announcements/2013 accessed June 8th, 2020).

3.4.4 Benchmarking WGS taxonomic classifiers

With the increasing interest in metagenome research, the need for method evaluations has grown; Mavromatis et al.23 were the first in 2007 to conduct the assessment of three binning strategies in a study that also included assemblers and gene predictors at a time when metagenomics was at its beginning. By recreating in-silico communities from publicly available microbial sequences, they showed the importance of such an approach when no ground truth can be easily defined from actual samples.

The first large comparative study24 to focus on classifiers was in 2012, after the field had expanded. It presented the evaluation of 25 tools qualified as sequence classification programs. The authors distinguished three categories of tools, namely phylogenetic, similarity (i.e. alignment), and compositional approaches. They drew general conclusions about precision and runtime but stated that "the overall variance of the

(24)

23

statistics makes it difficult to make definitive statements about the superiority of one program or method over another". The content of the reference database has a major influence on the classification and this first study did not exclude the source of the artificial reads from the reference used to train some of the tools. The next study22 published in 2015 tested 14 methods considered relevant metagenome analysis tools and used artificially evolved genomes to mitigate the aforementioned problem, although exact matches caused by a common source remained, leading to unrealistic recalls. They concluded that differences among methods were large, that no tool was close to estimating true abundances, and reported runtimes varying by orders of magnitude. No clear method ranking was defined. The same year, Peabody et al.21 produced the first comparative benchmarking that controlled the content of the reference for tools allowing their database to be manipulated. They excluded the source of the reads and selectively removed entire clades to simulate a real-life scenario in which certain organisms had not been observed before75. They also reported that filtering low abundance taxa using a threshold was worth exploring to avoid false positives while maintaining a satisfying recall; an idea that was later reused by McIntyre et al.33 in a comparative study that evaluated combinations of methods using an ensemble approach. In parallel, the Critical Assessment of Metagenome Interpretation (CAMI) challenge took place between 2015 and 201725. CAMI covered several questions related to metagenomics, including binning and profiling, which were considered as non-overlapping problems, and the choice of the category was left to developers. Their approach regarding datasets was slightly different: while reads were generated in-silico from artificial microbial communities, their sources were not publicly available genomes but organisms sequenced for the challenge. This choice was an alternative to in-silico clade exclusion, ensuring no candidate tool would have the source of the reads as reference, effectively eliminating the problem until the publication of the challenge material. CAMI also released companion papers describing methods and utilities for benchmarking56,76–78 and later a classifier based on their effort79. A second edition26 of CAMI is taking place in 2019- 2020.

Subsequent studies30,32 have generally aimed at improving minor aspects of these initial works, in addition to introducing alternative metrics, assessing different taxonomic levels, and evaluating newly available methods (Figure 3.5 and Table 3.1).

A meta-analysis of previous studies was conducted and provides a comprehensive review of the problem31. Finally, as many of these studies and other individual method

(25)

24

publications have produced mock communities and in-silico reads, the Mockrobiota project was created to keep track of all available datasets80.

Figure 3.5. Assessment overlap among five prominent comparative studies having evaluated a total of 66 tools performing binning-like and profiling-like tasks on WGS sequencing data. Zeros are omitted for visual clarity. The number of tools included in each study is in brackets. The year corresponds to the first appearance of a preprint or a publication. Initial and subsequent versions of a method were considered as similar (Kraken=Kraken 2), variations of a method were considered as different (KrakenUniq

!= Kraken).

(26)

25

Table 3.1. Characteristics of comparative studies dedicated to metagenomics classifiers identified in recent years. The year corresponds to the first appearance of a preprint or a publication. Metrics (applied to classified reads or organisms): recall or sensitivity (SEN), specificity (SPEC), precision or positive prediction value (PPV), negative prediction value (NPV), F1-score (F1), area under the precision recall curve (AUPR), proportion of assigned reads (ASSIGN), true positives (TP), false positive (FP), true negative (TN), false negative (FN), Matthews Correlation Coefficient (MCC), any distance metric to real abundance such as L1-score (DIST), weighted UniFrac (UniFrac), accuracy (ACC) = (TP + TN)/(TP + FP + FN + TN), coverage (COV) = TP/Total expected results number, error per query (EPQ) = FP/Total query number, memory, runtime.

Authors Year Number of tools Metrics Organisms Taxonomic levels* Comments Mavromatis et al.

23

2007 3 SEN, SPEC Bacteria All + strain

Bazinet et al. 24 2012 25 SEN, SPEC, ASSIGN,

memory, runtime

Bacteria, viruses, human

All

Lindgreen et al. 22 2015 14 TP, FP, TP, TN, SEN,

SPEC, PPV, NPV, MCC, ASSIGN, DIST

Bacteria, archaea, eukaryotes

Phylum, genus

(27)

26 Peabody et al. 21 2015 38 SEN, SPEC,

ASSIGN

Bacteria, archaea All for 11/38 tools: clade exclusion, identical reference construction McIntyre et al. 33 2017 12 FP, SEN, SPEC,

DIST, AUPR, DIST, memory, runtime

Bacteria, archaea, human

Genus, species, subspecies

Mix simulated and real titrated samples, including MinION. Ensemble approach combining tools.

Gardner et al. 31 2017 25 SEN, PPV, F1 - - meta-analysis

Sczyrba et al. 25 2017 19 FP, SEN, PPV, DIST, UniFrac

Bacteria, archea, viruses

All Run as a challenge with shared publication, simulation based on dedicated sequencing

Escobar-Zepeda et al. 32

2018 4 SEN, SPEC,

MCC, ACC, COV, EPQ

Bacteria, archaea All + subspecies compared with 4 other, amplicon-based, methods

Ye et al. 30 2019 17 AUPR , DIST,

memory, runtime

Bacteria, archea, viruses

Genus, species Identical reference construction

Reuse CAMI sets

* All stands for the six or seven main levels: (superkingdom), phylum, class, order, family, genus, species. NCBI taxonomy is assumed.

(28)

27

3.5 Thesis outline

As the amount of contributions made to the field of meta-genomics has soared (both in term of data and methods), the aim of this thesis was to provide a solution able to highlight the most valuable methods and their references in a systematic way.

Methods should appear ranked according to multiple criteria to help non-advanced users to make informed decisions. The scope was broadly defined as metagenomics tools, and was later refined to taxonomic classifiers as the number of methods in this category made it an ideal choice to explore a new benchmarking strategy. In parallel, the software BUSCO maintained by our group has needed substantial maintenance and improvement to keep up with the field and match its community of users' needs and demands. My contribution to this tool has grown to a point it constitutes an equally important part of my work during the time dedicated to this thesis. While the two bioinformatics resources presented below have benchmarking as common denominator, their specific objectives do not overlap and they are discussed separately in the rest of this manuscript.

3.5.1 Continuous benchmarking of WGS taxonomic classifiers with LEMMI

The pace of publication of new metagenomics classifiers in peer-reviewed and preprint journals made us realize that existing comparative studies (Table 1) were well suited to discuss and drive the scientific progress of the field, but not to keep track of the best method implementations available at any time. The CAMI challenge has been following a rather unclear agenda that does not fit the continuous need for benchmarking; independent comparative studies, which are not publicly announced, may overlook interesting methods. Therefore, it is likely that some valuable tools remain in the shadows owing to a lack of visibility when their developers need it, facing difficulties to reach an audience in a saturated ecosystem. Moreover, the weak overlap between comparative studies (Figure 3.5) makes it difficult to gauge new tools against older methods and maintain a clear view of the evolution over time. To provide a different and complementary approach to address this problem, I developed a platform that hosts a benchmark of taxonomic classification methods wrapped in Docker containers that is continuously open to new submissions and provides results through a dynamic web platform (Figure 3.6). The results section for the first part of this thesis

(29)

28

contains the paper describing the platform. In the discussion, I cover the expected impact, limitations, and perspectives for the expansion of the Live Evaluation of Computational Methods for Metagenome Investigation (LEMMI) platform.

Figure 3.6. Visual abstract presenting the LEMMI platform.

3.5.2 Consolidating BUSCO

The initial version of the utility BUSCO v147 released by our group and based on OrthoDB v746 was well received and gained momentum in the community working on eukaryotic non-model species assemblies; CEGMA has been discontinued and their developers promoted BUSCO as a possible replacement (https://www.acgt.me/blog/2015/5/18/goodbye-cegma-hello-busco accessed June 1st, 2020). As the new datasets based on OrthoDB v9 were in production, I reworked the initial BUSCO script to obtain two successive versions (BUSCO v2 and BUSCO v3) that were structured into robust pipelines, user-friendly, and error tolerant. Ever since, BUSCO (https://busco.ezlab.org/) has been distributed through virtual machines and hosted on a gitlab.com repository around which a community of users has formed. The results section below presents four publications related to this work: i) the OrthoDB v9.1 release, which introduces BUSCO v2 and the corresponding odb9 datasets; ii) the BUSCO v3 paper that describes the evolution of the tool from its initial version and presents use cases for taking full advantage of the pipeline; iii) a chapter covering BUSCO in an edition of the book series "Methods in Molecular Biology" dedicated to gene prediction; iv) another chapter in the same series of books focused on insect

(30)

29

genomics. The BUSCO pipeline remains under continuous development and its latest release has aimed at improving the use of BUSCO for assessing the completeness of metagenome-assembled genomes (MAGs). The transition from BUSCO v3 to BUSCO v4 and the remaining challenges will be part of the discussion and perspective section dedicated to the second part of this thesis.

3.5.3 Additional work

Our research group has sequenced several insect genomes in the frame of the i5k consortium81. I helped with this effort working on one species. The last publication included as results contains my contribution to the draft genome of the damselfly Calopteryx splendens.

(31)

30

4 R ESULTS

4.1 The proof of concept and initial release of LEMMI

LEMMI: A continuous benchmarking platform for metagenomics classifiers82 (2020)

Mathieu Seppey, Mosè Manni, Evgeny M. Zdobnov https://doi.org/10.1101/gr.260398.119

The manuscript included below is the version accepted and published online in Genome Research, Seppey et al., 2020.

Full text: pages 30 - 55

Website and code repository:

https://lemmi.ezlab.org/

https://gitlab.com/ezlab/lemmi Context and contribution:

The starting point of this project was the wish to tackle the problem of benchmarking in the field of metagenomics from the angle of continuous evaluation to better reflect the state of the art. I first explored what has been done by others, including the CAMI challenge, and retained the valuable ideas. One requirement was to reuse previously suggested exchange formats to contribute to standardization in the field and not attempt to bring unnecessary changes. I then concluded that mandatory containerization and the use of publicly available data for generating gold standard sets would be advantageous to enable highly controllable, flexible, and continuous evaluations. While LEMMI is intended to cover a broad range of methods in metagenomics, it was decided to limit the scope of our first continuous benchmarking project to taxonomic classifiers. From that point, I defined the precise approach, developed the pipeline, and created the website with input, help, and suggestions from other members of our group. During development, we have benchmarked some of the most popular tools in the field and also integrated less known methods. The ultimate goal is to establish a community wishing to make spontaneous contributions. Some developers of methods we included have already shown interest and provided updates.

This publication presents the platform and discusses relevant results.

(32)

31

(33)

32

(34)

33

(35)

34

(36)

35

(37)

36

(38)

37

(39)

38

(40)

39

(41)

40

(42)

41

(43)

42

(44)

43

(45)

44

(46)

45

(47)

46

(48)

47

(49)

48

(50)

49

(51)

50

(52)

51

(53)

52

(54)

53

(55)

54

(56)

55

(57)

56

(58)

57

4.2 Releasing enhanced BUSCO pipelines

4.2.1 Updating the markers to version odb9 and introducing BUSCO v2

OrthoDB v9.1: cataloging evolutionary and functional annotations for animal, fungal, plant, archaeal, bacterial and viral orthologs83 (2017) Evgeny M. Zdobnov, Fredrik Tegenfeldt, Dmitry Kuznetsov, Robert M. Waterhouse, Felipe A. Simão, Panagiotis Ioannidis, Mathieu Seppey, Alexis Loetscher and Evgenia V.

Kriventseva

http://doi.org/10.1093/nar/gkw1119 Full text: pages 59 - 64

Context and contribution:

The OrthoDB catalog of orthologs was first released in 2007 and had already been the subject of four publications45,46,48,84 accompanying its growth. This paper describes the transition from the version 8 to the version 9.1, which coincided with the introduction of plants, archaea, and viruses into the database. Added to the existing and newly introduced eukaryotes and bacteria, the total count of species passed 5,000 entries. To maximize the benefits of this wealth of data for the end-user, the development team made an extensive revision of the web interface as well as to the programmatic access to the content. My direct contribution to this update was the annotation of OrthoDB groups with COG functional categories85.

This paper also dedicates a paragraph to BUSCO v2 by first introducing the update of the datasets to the version "odb9" taking advantage of an improved sampling in OrthoDB v9 compared to the initial BUSCO sets based on OrthoDB v7. It then reports improvements in the underling software, the transfer of the code to the GitLab platform, and the distribution of a VirtualBox machine to facilitate the use of BUSCO. I lead and implemented most of these changes, notably by structuring the early BUSCO python scripts into parent-child classes and improving the error management and notification aspects of the program.

(59)

58

4.2.2 BUSCO v3

BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics86 (2018)

Robert M. Waterhouse, Mathieu Seppey, Felipe A. Simão, Mosè Manni, Panagiotis Ioannidis, Guennadi Klioutchnikov, Evgenia V. Kriventseva, and Evgeny M. Zdobnov http://doi.org/10.1093/molbev/msx319

Full text: pages 65 - 70

Website and code repositories:

https://busco.ezlab.org/v3/

https://gitlab.com/ezlab/busco/-/tags/3.0.2 https://gitlab.com/ezlab/busco_usecases Context and contribution:

We further edited the code underlying the BUSCO assessment procedure to improve the analysis and make it fully controllable through a configuration file; this would eventually facilitate its integration as part of broader pipelines, such as our OrthoDB platform. I again lead this effort with inputs from other developers in our group. A notable change in the core algorithm was the use of extra amino acid sequences representing less common variants to help capture more divergent proteins in the first step of the procedure that locates candidate genomic regions with tblastn65. The changes made to the software were significant enough to dub it BUSCO v3 in this paper that demonstrates how BUSCO can be used beyond mere completeness assessment. My direct contribution to this work beyond the development of BUSCO was the analysis of publicly available Lactobacillus and Aspergillus genomes to highlight how BUSCO scores could help select high quality genomes, sometimes in partial disagreement with technical continuity metrics (N50) and annotation with the "reference genome" flag. I also used BUSCO to recreate a phylogeny of rodents agreeing with the literature by mixing publicly available genomes and transcriptomes; this illustrates the power of BUSCO marker genes and the associated software to easily conduct phylogenomics studies. Finally, to encourage the communication of BUSCO results in a harmonized fashion, I developed an R script to process multiple BUSCO outputs and create a figure presenting all results with a recognizable color scheme.

(60)

59

4.3 Documenting BUSCO usage

4.3.1 Methods in Molecular Biology: Gene Prediction

BUSCO: Assessing Genome Assembly and Annotation Completeness87 (2019)

Mathieu Seppey, Mosè Manni, Evgeny M. Zdobnov http://doi.org/10.1007/978-1-4939-9173-0_14 Full text: pages 71 - 89

Context and contribution:

We answered an invitation to contribute to an edition of Methods in Molecular Biology (MiMB) covering gene prediction with a focus on eukaryotic genomes. This series of book provides protocols and is popular for its "Notes" section that centralizes practical tips on how to use the method. It was for us a unique opportunity to gather all the feedback and questions we have got through our user support over the years and write a guide on how to properly use and report a BUSCO assessment. I led the writing of this chapter with significant inputs from my co-authors. A different chapter of the same book describes gVolante88, a genome quality assessment web platform that is an example of a broader pipeline integrating BUSCO as evoked above.

4.3.2 Methods in Molecular Biology: Insect Genomics

Using BUSCO to Assess Insect Genomic Resources89 (2019)

Robert M. Waterhouse, Mathieu Seppey, Felipe A. Simão, Evgeny M. Zdobnov http://doi.org/10.1007/978-1-4939-8775-7_6

Full text: pages 90 - 105 Context and contribution:

This was in response to a parallel invitation to describe BUSCO applied to insect genomics in a different edition of the MiMB series. My contribution to this chapter written by RMW is limited to discussions and review.

(61)

60

(62)

61

(63)

62

(64)

63

(65)

64

(66)

65

(67)

66

(68)

67

(69)

68

(70)

69

(71)

70

(72)

71

(73)

72

(74)

73

(75)

74

(76)

75

(77)

76

(78)

77

(79)

78

(80)

79

(81)

80

(82)

81

(83)

82

(84)

83

(85)

84

(86)

85

(87)

86

(88)

87

(89)

88

(90)

89

(91)

90

(92)

91

(93)

92

Références

Documents relatifs

Abstract : Zinc oxide (ZnO) thin film and nanostructures have great potential for applications in the fields of optoelec-tronic, sensor devices and solar cells (photovoltaic).. In

coupling between ocean temperature and sea-ice extent is studied in a spatially distributed system. The evolution equations are cast into a normal form and some

If the former mechanism had been the general case, the data would have demonstrated photon stimulated ion desorption (PSID) as being a straightforward detection

In reference [I] the authors, investigating the dependence of on the thikness of crystal, suggested the steep slope of the soft mode o,(k) at small k in the region

Il est co-auteur, avec Alain Beltran, d’une histoire de l’INRIA (His- toire d’un pionnier de l’informatique. 40 ans de recherche à l’INRIA, EDP Sciences, 2007)..

(We don’t consider shallow approximations in this paper since they are a restricted case of growing approximations without resulting in any obvious lower complexity.) It is known

Thus the political and media discourse on the veil was transformed: it was no longer a symbol of victimhood, but instead became a symbol of a stubborn refusal to accept ‘our’ culture

1.3 Organization of the Dissertation 5 • Classification and cost analysis of existing Id/Loc split schemes: We develop analytical model for protocols modeling and analysis.. We