LEMMI taxonomic classifiers - Benchmarking in (meta-) genomics: LEMMI & BUSCO

My work on this first topic resulted in a benchmarking pipeline that conducts automated evaluations of taxonomic classifiers wrapped in Docker containers (automated in the sense that no human decision is required nor permitted, while the current workflow still necessitates a manual trigger). It provides multiple metrics through a web interface accessible on https://lemmi.ezlab.org. It offers a reactive solution, as new methods can appear online in a matter of days after submission; they receive an independent rating in LEMMI, including a measurement of resources used.

In this section, I first summarize benefits unique to our platform, and then review the desirable improvements in parallel with discussing the challenges posed by automated evaluation of taxonomic classifiers. Finally, I consider the possible expansions of the benchmark to other sub-fields of metagenomics.

At the time of writing, the LEMMI taxonomic classifiers platform has been publicly accessible for a year and a half and contains eight published methods27,29,67,68,93–96

evaluated using different versions and parameters. Figure 5.1 shows these methods ranked according to one of the many preset metrics available to visitors inspecting results on the web platform, along with the number of citations of each tool in the first half of 2020. The case of MetaCache²⁹ is an example of a method that has remained in the shadows, as mentioned in the introduction. This tool appeared in mid-2017 in a peer-reviewed journal, claiming interesting performances when compared to the already popular methods Kraken and CLARK, notably a superiority in terms of memory efficiency. However, it was not included in the first CAMI challenge and has not been selected by subsequent independent comparative studies³⁰. Under the LEMMI criteria, the claims made by MetaCache authors are backed (Kraken and CLARK in its normal mode cannot build references with the available amount of memory, in contrast to MetaCache, which is also an accurate tool when compared to other methods).

Unfortunately, this advantage is not reflected in the adoption of the tool as proxied by its number of citations to date. This method has likely failed to reach its users at the time it could have been an efficient choice for researchers wishing to maximize the representativity of their reference database with less memory. As demonstrated by our results, Kraken 2⁹⁷ has since dramatically decreased resource usage when compared to

125

its previous implementation and makes MetaCache unnecessary for a community that is used to the Kraken tool suite. The number of available methods waiting for an independent confirmation of their performance has grown substantially^98–104 and such a scenario is likely to repeat. By contrast, some of the most recent methods are not mature enough to be used in a real setting, as for example kASA⁹⁹, which despite making interesting claims on small test datasets does not scale with real-life sample sizes and could not complete the evaluation procedure owing to unacceptable runtime (Figure 5.2, https://www.ezlab.org/lemmi_failed.html). This demonstrates that, once recognized by the community, our LEMMI platform could be a filter facilitating the distinction between methods offering no practical interests from those with a real potential for improving daily analyses in metagenomics classification. I have selected all novel tools included in LEMMI to date thanks to well documented git repositories and appealing supporting publications; they have been wrapped into containers by myself or other members of our group. Method developers and users have shown interest and provided advice and updates. The platform is now waiting for external submissions to help prospective users to get a more comprehensive overview of available tools. This will allow those who invested into valuable method development to be proactive and have their work not only recognized by a publication but also used. To organize benchmarking, we have set up an environment promoting exchanges and suggestions on https://gitlab.com/ezlab/lemmi/-/issues.

126

Figure 5.1. Methods included in the LEMMI release beta01.20200615, presented here using the ranking preset SD.DEFAULT. The numbers of citations in 2020 correspond to the most cited related paper on Google Scholar on June 15th, 2020.

127

Figure 5.2. Runtime observed when testing kASA on a single LEMMI dataset containing 50 million reads. This value of about eight days for one run was judged incompatible with conducting a full benchmark that encompasses 16 runs. This shows how our platform can dismiss methods unlikely to be useful in a real setting.

My work shows that continuous evaluation of new methods on generic problems while monitoring resource usage will greatly contribute to maintaining a better overview of the field at any time. In the current LEMMI context, these generic problems are restricted to providing taxonomic bins and profiles at the NCBI genus and species levels, formed out of gold standard sets mixing bacterial and archaeal short reads. To cover the whole range of questions posed by taxonomic classification, the LEMMI platform will have to diversify its content to sample from the entire tree of life and simulate scenarios involving a host organism. Notably, the inclusion of gold standard sets dedicated to viruses would be more informative for the virology community, since the greater distances observed between reference and sampled organisms may be in favor of protein-based classifiers such as Kaiju⁹⁴. The in-silico reads will also have to represent additional sequencing technologies as some methods are now optimized to leverage long reads¹⁰¹. In addition, there is a pressing need for LEMMI to assess tools at the strain level, since recently published methods target that rank^101,102, and to support alternative taxonomic systems^73,74. While the LEMMI platform has been conceived to evolve easily in these directions in future releases, barriers to achieving flexible and fair benchmarking often arise from method implementations themselves. When addressing new problems, tool developers have sometimes reinvented analysis workflows; however,

128

these changes should only take place when they are essential. My experience while developing this platform is that the field of metagenomics classification has lacked clear guidelines, shared vocabulary, and standardized outputs that would facilitate unbiased evaluations while utilizing the full potential of each candidate tool. The fuzzy difference between taxonomic binners and profilers illustrates this problem. Kraken and CLARK fell into different categories in the first CAMI challenge, making them non-comparable, while they have been otherwise considered as direct competitors and precursors for using k-mers to address taxonomic classification problems. As we have wrapped candidate methods into LEMMI containers, we have often written postprocessing scripts and sometimes hacked the code to bypass rigid workflows in order to obtain comparable readouts. I postulate that our unequal knowledge of each method may have introduced biases while it could have been better handled by tool creators. This strongly advocates for a community approach to benchmarking that involves developers as envisioned when the LEMMI concept will be fully established. However, this is likely not sufficient to maximize the benefits of automated benchmarking if method implementations are not designed from the beginning to meet minimal common requirements. The CAMI challenge has defined output file formats for profiles and binning results that might become the established standard; they are the ones supported by the current version of LEMMI. However, these formats do not allow any confidence value supporting a prediction to be included and thus this information computed by some tools cannot be considered during the evaluation. Fortunately, these files could be extended to contain additional properties. Besides output formats, method implementations show extra differences that bring noise when conducting an automated benchmark such as ours.

The content of the reference database is a problem that was extensively covered in the results section. With LEMMI, we have given to its construction an importance that goes beyond what was reported in most previous benchmarking efforts, notably by measuring the runtime and memory necessary to process any given genomic material. Moreover, the use of mandatory references cancel out differences among methods regarding that aspect. A tool that claims improvements in terms of analysis performance but is limited to using a bundled database provided with no update strategy will be outdated in a matter of months. It may still be a breakthrough from an algorithmic point of view, but its implementation should not be recommended to users who wish to perform analyses up-to-date with the field, hence raising the question of its usefulness. Our platform can technically allow tools to use their bundled reference as we recognize that the use of curated markers can be a useful strategy for taxonomic identification and we have aimed

129

at evaluating what is directly available to users. I have therefore included Metaphlan2, which follows this approach and has been providing updates over the years, but I have decided not to include MSC⁹⁸ despite its publication in a peer-reviewed journal in 2019;

this tool does not provide any documented reference processing script and has no public versioning or issue board to expect evolutions. As mentioned above for filtering methods that do not scale with large datasets, once recognized by the community, LEMMI will act as a strong incentive for developers to properly finish their development by making database construction a feature in its own right before trying to publish their work. The Kraken tools provide scripts to process any genomes as references, making them suitable candidates to enter the LEMMI benchmark. In addition, their authors distribute preprocessed databases that are regularly updated. Therefore, they represent a sustainable approach for those who cannot afford to build their own reference. However, we show in our LEMMI publication that the choice of using only sequences tagged as

"complete genomes" should be questioned as it creates a resource that may be misleading, missing numerous organisms that are represented in public databases.

Another difference among methods implementations that is a major obstacle to comparisons is the unclear definition of what relative abundance represents, either the quantity of sequenced reads or the quantity of organisms. Multiplying categories is not the solution and limits the benefit of benchmarking when the experimental objectives are clear from the final user point of view. The biological quantity to predict is the organism abundance while read abundance is rather an intermediate technical result;

therefore, LEMMI expects a relative abundance profile representing organisms, proxied by genome copies or marker copies, and provides the necessary information for conducting a normalization (i.e. assembly size when building the reference). As many tools have not been designed to fulfill this task, their current evaluated versions include a normalization script of my own making. Taxonomic differences constitute an additional major problem preventing LEMMI from conducting a fully unbiased evaluation of the core algorithm of each method. Tools that do not follow the NCBI taxonomic system of identifiers have to go through a conversion process in which information might be lost in translation. For instance, every time spelling does not match that of NCBI, the use of Latin names will cause a candidate tool to miss an organism it could in fact predict. It goes without saying that method implementations have not been primarily designed to enter comparative benchmarking but to be convenient to users. While past evaluation studies have done their best to take these differences into account, I expect LEMMI to encourage researchers who wish to submit

130

their developments to structure their tools in an atomistic way that facilitates their evaluation in the platform (Figure 5.3). In particular, allowing custom reference genomes to be processed and alternate taxonomies to be plugged in (e.g. NCBI/GTDB) will introduce flexibility not only essential to a proper evaluation in LEMMI, but also highly beneficial to the final users of the tool.

Figure 5.3. Architecture that would enable an unbiased automated benchmark in LEMMI if adopted when developing future methods. Alternate taxonomic systems and constrained reference genomes are supported. The blue items denote elements that the LEMMI benchmarking platform can provide or control. Dashed lines represent components that decrease the explanatory power of the benchmark if not adjustable.

One distinctive feature of the LEMMI approach is the mandatory submission of the model as a containerized method instead of the resulting predictions. While this excludes web-based platforms without standalone equivalent, it guarantees that good performances obtained by a candidate tool do not depend strongly on the expertise and resources possessed only by its developers. Furthermore, the material used for simulating the gold standard sets and references is based on public data and all

131

computations are done under the supervision of the platform; this allows the free redefinition of known and unknown organisms and gives more flexibility to generate new scenarios than the strategy relying on new sequencing chosen by CAMI. This workflow has an obvious cost in terms of computing infrastructure; it requires multiple references to be built and analyses to be done for each permutation of parameters on each tool. Fortunately, the continuous approach helps to spread the submissions over time and mitigates this problem. A drawback of asking for inputs as containers is the complexity of the task for developers; the time invested into preparing a compatible submission is a barrier that will fall only if the interest in LEMMI is strong but also if everything is done to facilitate this task. We provide with the current version of LEMMI a detailed user guide on how to proceed along with a demo dataset and a test script.

However, researchers can only partially reproduce how their containerized methods will behave in the complex LEMMI environment regarding file paths and permissions, which make the current submission process prone to unwanted back-and-forth with the submitter. Therefore, I would include in future expansions of the project a complete standalone version of the pipeline. It would not only enable developers to ensure their container is compatible with the LEMMI platform, but also help them to evaluate whether their current project is mature enough and performing well enough against competitors to be worth submitting to our centralized public benchmark.

In the frame of the LEMMI platform, we are contemplating a concept that can be applied to many subfields of microbial genomics analysis. The future evolution of the platform should include a second version of the taxonomic classification benchmark; it would fulfill the promise of continuity and would improve while building a community eager to contribute. In parallel, the platform could be expanded to any task addressed by multiple methods and for which in-silico data can be used to obtain a satisfying model of the reality, or tasks for which public gold standard sets exist. We are considering variant calling as the focus of a second component to include in the LEMMI platform.

Past benchmarking efforts²⁰ have essentially focused on human variant predictions and a recent study has addressed the problem when working with bacteria¹⁰⁵. Variant calling targeting viruses, in which our group has expertise, has not been the subject of dedicated benchmarking. As common workflows rely on a read aligner and a variant caller constituting distinct steps exchanging well-defined formats, methods dedicated to these two tasks could be submitted in distinct containers to enable a systematic exploration of their combinations.

132

Dans le document Benchmarking in (meta-) genomics: LEMMI & BUSCO (Page 125-133)