Benchmarking - Benchmarking in (meta-) genomics: LEMMI & BUSCO

3.2.1 Comparative evaluation of computational methods

The action of measuring the quality of a process or its result against a predefined truth or representative dataset (the gold standard) is referred to as benchmarking. The term benchmark sometimes indicates the gold standard set itself. For consistency and clarity throughout the present thesis, I will refer to benchmarking when evoking the evaluation process and to the gold standard set otherwise. In computational biology, multiple initiatives have been conducted to benchmark methods. Three major approaches can be identified: self-assessment against competitors upon publication of a tool, independent comparative studies, and broad-community challenges. The Critical Assessment of protein Structure Prediction (CASP)¹⁶, which started as early as 1994 and has published CASP13¹⁷ in 2018, is the oldest effort to federate an entire field around regularly benchmarking its methods. Several similar initiatives have appeared such as the Assemblathon for genome assemblers¹⁸, the Quest for Orthologs¹⁹, or the Global Alliance for Genomics and Health²⁰ dedicated to human variants calling. In metagenomics classification (the main focus of a subsequent chapter), several independent comparative studies had been published^21–24 before the Critical Assessment of Metagenome Interpretation (CAMI)²⁵ conducted a community-wide challenge aiming at unifying benchmarking in the field. However, the rather long delay between CAMI1 (started in 2015, results in 2017) and CAMI2²⁶ (started in 2018, unpublished in mid-2020) has not stopped developers from providing their own benchmark upon publishing^27–29, although sometimes reusing CAMI1 datasets. In the aftermath of this challenge, additional intermittent comparative studies have been released^30–33. Overall, the level of agreement on how to proceed and how to judge the outcomes depends on the extent of standardization desired and implemented by each field and initiatives from the academic, governmental, or private sector. Developers working on problems with a little or no community still have to design their own self-evaluation when publishing a new method, as in the case of phage annotation tools³⁴.

3.2.2 Quality evaluation of assemblies and annotated gene sets

The line between benchmarking a tool and evaluating its output is thin as the assessment of a method necessarily involves investigating its results. There are situations in which the use of multiple approaches during routine analyses is too costly;

comparative evaluations of methods using gold standard sets as described above are then an efficient way to select the most appropriate tool, ensuring the best outcome for most data. On the other hand, it is sometimes desirable to have a trial-and-error approach centered on evaluating the results obtained with the actual data when the entire project consists of producing a single high-quality output. Sequenced genomes can be affected by sample collection, library preparation protocols, or the combination of different sequencing technologies. Therefore, it is common for researchers to compare the result of concurrent assembly and annotation runs on the same input data. As the starting points of many downstream experiments, assemblies and protein predictions need to undergo quality assessment to ensure their content reflects the underlying organism. First, basic "sanity checks" considering only the data can be used as preliminary validations to identify and correct technical artifacts³⁵. Second, the cumulative size in bp of all assembled contiguous sequences (contigs) can be compared to the expected size as estimated with flow cytometry, which does not rely on sequencing;

this provides an independent quantitative estimation of completeness. Additionally, continuity metrics can be calculated from the sequences, the most common being N50 which indicates the contig size above which half of the recovered genome is assembled.

A low N50 corresponds to a fragmented and low-quality assembly that will complicate the subsequent gene prediction process. These assessments, however, do not match the definition of benchmarking stated earlier, which involve a comparison to a gold standard set. A different approach exists for assessing genome completeness: using the expected gene content (i.e. markers) defined by previously sequenced organisms belonging to the same group to validate the assembly. This can be seen as an evolutionarily informed gold standard. By contrast with N50, this strategy relies on a biologically relevant feature:

homology of sequences.

3.2.2.1 Universal orthologs as measuring units

Genomes evolve by duplications and losses of their content³⁶ (genic and intergenic) or by acquiring external sequences by various horizontal transfer mechanisms³⁷. In a genomics context, a gene is a portion of chromosome matching a specific pattern that can be expressed in the form of RNA; when this is a messenger RNA translated into

amino acids, the gene is qualified as protein-coding. Two genes that share a common ancestral sequence are homologous. When the link between them is traced back to a speciation event, as opposed to a duplication in the common ancestral species or a horizontal transfer, the gene enters the subcategory of homology that is called orthology.

Orthologous genes, or orthologs, are crucial to comparative genomics as they usually represent the closest sequences shared by related species, evolutionarily constrained as a similar function is assumed^38,39. Orthologs connecting all cellular life can still be identified thanks to fundamental biochemical pathways. The Clusters of Orthologous Groups⁴⁰ (COG) was the pionnier method to delineate families of orthologs, i.e. all genes in a given clade that descend from the same copy in their shared ancestor. Families of orthologs can display different evolutionary history and present distinct structures⁴¹; gene losses act against universality while duplications following the initial speciation generate families of orthologs of which some species possess several members. As for their use in benchmarking, the ideal markers for assessing genome completeness are orthologs that i) have remained universal within the clade of interest; ii) are constrained to being single copy (low-duplicability). Universal single copy orthologs (USCO) (Figure 3.2) can be used i) to gauge completeness by assessing their presence; ii) to quantify the level of contamination, i.e. technical duplications or chimeric assembly of multiple organisms, by assessing their number of copies.

Figure 3.2. Illustration of orthology through hierarchical delineation of gene families.

Each shape pictures a gene; colored leaves represent extant species while nodes stand for ancestral states prior to a speciation event. When the clade under the root node (R) is considered, all genes connected by dashed lines are orthologs. The two kinds of square are duplicated genes belonging to the same group of orthologs. As the gene

represented by a star is lost in one extant species, the USCO common to the clade under R are the genes depicted by the circle and the triangle. From the perspective of the clade below the node N2, the five different symbols are distinct USCOs as the two types of square are linked by a duplication event that occurred prior to the speciation founding that clade.

___________________________________________________________

3.2.2.2 CEGMA, BUSCO, and CheckM

The first tool dedicated to using universal orthologs is the Core Eukaryotic Genes Mapping Approach (CEGMA)⁴². The primary purpose intended by the authors was to extract a set of 458 eukaryotic universal protein-coding genes from input draft assemblies to have species-specific evidence to train an ab-initio gene predictor.

However, with the CEGMA results in hand, it was clear that considering the ratio of recovered to expected genes would be a good proxy for benchmarking assembly completeness. The initial CEGMA approach did not enforce low-duplicability as a criterion for selecting universal genes. A later refinement⁴³ included a partial filtering of duplicated orthologs that decreased the number of markers to 248. CEGMA universal orthologs are a subset of the clusters of Orthologous Groups for euKaryotes⁴⁴ (KOG) and are based on six model species. They are represented by profile-Hidden Markov Models (profile-HMMs) and can be predicted as "complete" or "partial". While CEGMA considers only the root level of eukaryotes, the Benchmarking Universal Single Copy Orthologs (BUSCO) sets were next introduced to enable a similar approach for different subclades of eukaryotes; they offer a higher resolution as more closely related organisms share a larger number of orthologs (Figure 3.2). BUSCO groups are the subset produced by the OrthoDB hierarchical catalog of orthologs⁴⁵, determined by the rule of (near-) universality and low duplicability in 90% of the sampled species; they constitute the marker datasets used by the eponymous software to predict candidate genes in the input genome or transcriptome assembly and assess them with profile-HMMs.

Additionally, BUSCO evaluates protein annotation sets. The first release^46,47 contained six eukaryotic datasets, namely Eukaryotes (429 genes), Metazoans, Arthropods, Vertebrates, Fungi, and Plantae. The software reports the completeness in terms of expected gene content by predicting those that are complete, present either in a single copy or duplicated, as well as gene fragments. While CEGMA has been discontinued, the BUSCO approach has the advantage of being bound to OrthoDB, which has provided updates regularly to reflect the growing genome sequencing effort⁴⁸. The initial release

of the BUSCO software also contained a dataset for the root bacterial node, based on a different source of markers⁴⁹. In parallel, the software CheckM⁵⁰ was released to fulfil a similar task for non-eukaryotic organisms, i.e bacteria and archaea; it defined multiple clade-specific marker sets and also considered the genomic location of each gene to correct for non-independent collocated markers. The data underlying the CheckM analyses were curated from the integrated microbial genomes and microbiomes system⁵¹ when the software was released in 2014 and has not been updated since 2015.

Dans le document Benchmarking in (meta-) genomics: LEMMI & BUSCO (Page 14-18)