Methods and Results

Dans le document Metagenomic Characterization of the Virome of Clinical Samples (Page 27-147)

4.1 – Methods for the detection of viruses in metagenomic samples

Analyses were made with a pipeline named ezVIR2, expanding on the original tool, ezVIR, developed by Tom Petty24. This pipeline identifies viruses from both RNA and DNA-seq libraries, and can process data from any short read sequencer. The first version relied on individual greedy alignments against a panel of more than ten thousand clinically relevant viral genomes, outputting the best results based on the most-covered viral strains. In practice this meant that every library was individually mapped more than ten thousand times, which was found to be both slow and computationally heavy, and a major overhaul on virus detection had to be done. Briefly, instead of thousands of individual alignments, the second version relies instead on representatives of sequence clusters, which were grouped by sequence similarity. One single index is created for these representative sequences, and reads are mapped against it in order to find which families of viruses are in each sample. Each representative genome with a positive result is marked, and the corresponding clusters are fully explored by mapping all reads to each genome present in them, with the objective of finding the closest possible strain. The pipeline structure and software is the following: ezVIR2 calls for a quality control of the input data by removing or trimming both low quality and low complexity sequences using Trimmomatic v0.3944, and Tagdust281, respectively. Reads with a quality score lower than 30, corresponding to a 0.1% error rate, and reads with an entropy value lower than 16, corresponding to a very low complexity structure, are removed. Human data is removed by aligning the quality filtered data to the human reference genome (hg38) and transcriptome using the short read aligner SNAP50, with settings allowing for a maximum of eight mismatches between each read pair. Reads that did not map to the human reference, referred to from this point onward as ―non-human data‖ are recovered and aligned in two steps against a curated database of clinically relevant viruses, developed in collaboration with Prof. Kaiser‘s group. This database is currently comprised of over 11,000 viral sequences that are known to be associated with human disease or potential zoonotic transmission. When searching for viruses, the first step determines only which families of viruses are present by aligning the reads against a smaller version of the virus database that uses ~4,000 representative sequence clusters. To cluster the genomes, I used CD-HIT-EST82, using a sequence identity of 99% between sequences, at the nucleotide level. The alignment is made against a representative sequence of each cluster, defined as the longest sequence per cluster. Clusters

26 identified in the first step are taken to the second step, where reads are mapped to each viral sequence inside the cluster individually, in order to find out the closest representative strain for each cluster. The resulting output is a plot which shows how many virus genera were found, and which are the most representative strains for each virus family (Figure 7). Based on the observations made by Wright et al83 for single/double-indexed Illumina samples, I also applied a cut-off value for read abundance in order to reduce the amount of false positive hits due to cross-contamination in multiplexed samples. Briefly, for each virus, I calculated the ratio between each sample and the sample with the most abundance. Then, a ratio of 0.24%

was used as a cut-off value for cross-talk; any library containing a virus with a ratio lower than this value was assumed to be derived from reads which were improperly demultiplexed. The majority of viruses considered to be contaminants were confirmed to not be present with RT-PCR. To detect divergent sequences, non-human reads are assembled into contigs using three different assembly tools: metaSPADES84, MEGAHIT85 and IDBA-UD86. Each assembly is then scanned with DIAMOND87, a blastX-like tool, using NCBI‘s NR as the input database, and a threshold of 0.05 in e-value is used as the cut-off point for further analysis. DIAMOND‘s output is a blast table which is fed into MEGAN88, a software that bins results according to the last common ancestor (LCA), by assigning a weight to each reference sequence according to the amount of BLAST results being assigned to it and to sequences in the same taxonomic species, and a threshold of 75%. In order to check for a consensus between assemblies, contigs classified as virus or virus-like are compared between all three assemblies using MAFFT89, a multiple sequence aligner. The same sequences are further analysed by checking their read coverage, depth and translating their nucleotide sequence into all six open reading frames in order to identify its gene structure. Each hypothetical protein is classified by scanning with PSI-Blast90, with an e-value of max 0.05 as threshold. Additionally, sequences with no classification are also translated in the six reading frames and scanned with PSI-Blast, using the same threshold value.

The full code and database are available at: https://gitlab.com/ezlab/ezvir2

27 Figure 7 Representation of the ezVIR2 analysis pipeline. Briefly, in the first section, the sequenced metagenomic libraries are filtered for host data, low quality and low complexity reads, as well as sequences derived from reagents and other contaminants (UniVec). In Phase-1, non-human reads are mapped against a database of 4.500 clinically relevant viruses in order to determine which families of viruses are present per sample. In Phase-2, this search is enhanced to the full size of the database, in order to determine the closest known species for each identified virus family.

28 4.2 – Identification of an Astrovirus in respiratory disease of unknown etiology

Astrovirus VA1 identified by next-generation sequencing in a nasopharyngeal specimen of a febrile Tanzanian child with acute respiratory disease of unknown etiology70

Authors: Samuel Cordey, Francisco Brito, Diem Lan Vu, Lara Turin, Mary Kilowoko, Esther Kyungu, Blaise Genton, Evgeny M. Zdobnov, Valérie D‘Acremont, and Laurent Kaiser

DOI: doi:10.1038/emi.2016.67 Full Text: Pages 38 to 40

Context: In this paper we detail the detection of an Astrovirus VA1 genome in a study of 30 nasopharyngeal samples of Tanzanian children suffering from respiratory symptoms, with no known etiology (six pools of five patients each). The presence of the Astrovirus was later confirmed by RT-PCR. Astroviruses are not commonly found outside the gastrointestinal system, and this analysis was their first documented report in respiratory specimens. The analysis also detected the presence of several other viruses, namely parainfluenza virus type 2 and type 4 (PIV2/PIV4), which fit the symptoms presented by some of the patients. PIV2 was also confirmed in two patients by RT-PCR, while PIV4 was not confirmed by RT-PCR.

Contributions: I performed the bioinformatics analysis of the metagenomic samples using the ezVIR2 pipeline to process the libraries and detect the presence of viruses, as described previously. Following the initial detection of the Astrovirus VA1 in the first pool, the RNA and DNA of each patient in that pool was sequenced individually and analysed using the same approach. I also performed an additional phylogenetic analysis for the detected Astrovirus, which consisted in assembling the reads mapping to the virus, performing a BLAST analysis to identify the contigs belonging to the capsid sequence, and generating a consensus tree based on a multiple sequence alignment between them and a set of representative Astrovirus capsid sequences. Additional software used: MAFFT91 for generating the multiple sequence alignment, IDBA-UD86 for assembling, and IQ-TREE92 for generating the tree.

29 4.3 – Identification of an Astrovirus in meningitis-like disease of unknown etiology Astrovirus MLB2, a New Gastroenteric Virus Associated with Meningitis and Disseminated Infection71

Authors: Samuel Cordey, Diem-Lan Vu, Manuel Schibler, Arnaud G. L‘Huillier, Francisco Brito, Mylène Docquier, Klara M. Posfay-Barbe, Thomas J. Petty, Lara Turin, Evgeny M. Zdobnov, and Laurent Kaiser

DOI: doi:10.3201/eid2205.151807 Full Text: pages 41 to 48

Context: This paper is part of the analysis of a cohort of patients with meningitis-like symptoms and no known etiology. RNA and DNA were extracted from each of the patients‘ cerebrospinal fluid (CSF) and sequenced, in order to find potential pathogens.

Bioinformatics analysis detected one patient positive for Astrovirus MLB2, later also found to be present in urine, plasma and anal swab samples. The detected strain was used for a prevalence study in 943 fecal and 424 cerebrospinal fluid samples from the HUG, detecting an additional five cases of infection by Astrovirus MLB2. While Astroviruses are commonly associated with cases of gastroenteritis, this publication contributed to a growing amount of evidence that it might also play a role in CNS disease.

Contributions: I performed the bioinformatics analysis using the following software:

ezVIR2 for the processing of samples and detection of viruses, Sparse93 for virus genome assembly, MAFFT and IQ-TREE for phylogenetic analysis. The initial analysis identified an Astrovirus MLB2 in the cerebrospinal fluid of a patient with low coverage (35% of total sequence length), which subsequently led to the sequencing and analysis of the patient‘s plasma, urine, and anal swab. The virus was identified in all three samples, and confirmed by RT-PCR. In particular, I recovered a complete Astrovirus MLB2 sequence in the anal swab sample, with a 98.5% similarity to previously described sequences. The recovered sequence contributed to defining the PCR probe used for the downstream prevalence study.

30 4.4 – A review on the Astroviridae family and their role in human disease

Novel human astroviruses: Novel human diseases?73

Authors: Diem-Lan Vu, Samuel Cordey, Francisco Brito, and Laurent Kaiser DOI: doi:10.1038/s41426-018-0025-1

Full text: Pages 49 to 56

Context: A follow-up to the two previous papers on astroviruses; upon finding this virus family in non-gastrointestinal samples, an in-depth review of these viruses was required to give context to their possible involvement in other diseases. This publication documented the currently known information on the Astroviridae family including their structure, prevalence, and their involvement in human disease outside, with a focus on disease outside of the gastro-intestinal system, namely CNS infection in immunocompromised hosts.

Contributions: My contribution to this publication was the bioinformatics analysis section, which consisted in recovering a representative set of complete genomes and capsid genes (GenBank) for the two main genera of Astrovirus (Mamastrovirus and Avastrovirus) and performing two phylogenetic analyses, one based on the full genome sequence, and another based on the capsid sequence, using MAFFT for the multiple sequence alignment and IQ-TREE for the phylogenetic tree reconstruction.

31 4.5 – Metaviromics of haematopoietic stem cell transplants

Human Pegivirus Persistence in Human Blood Virome after Allogeneic Haematopoietic Stem-cell Transplantation76

Authors: Diem-Lan Vu, Samuel Cordey, Federico Simonetta, Francisco Brito, Mylène Docquier, Lara Turin, Christian van Delden, Elsa Boely, Carole Dantin, Amandine Pradier, Eddy Roosnek, Yves Chalandon, Evgeny M.Zdobnov, Stavroula Masouridi-Levrat, Laurent Kaiser

DOI: doi:10.1016/J.CMI.2018.05.004 Full Text: Pages 57 to 64

Context: The main question of this study was to understand how virus infection resurges in patients with a reduced immune resistance. In particular, we wanted to observe whether viruses which are considered commensal, become pathogenic upon reducing the capabilities of the immune system. For this, the plasma of 40 immunocompromised patients was sequenced (RNA and DNA) and analysed, who had undergone hematopoetic stem cell transplantation. The study evidenced a number of different virus infections, with the most abundant viruses being polyomaviruses, anelloviruses, herpsesviruses, and pegiviruses. Pegivirus infection was further studied and revealed infections that lasted up to a year after the procedure, though no association was found between it and factors such as survival, relapse, and immune reconstitution.

Contributions: I performed the bioinformatics analysis for all 80 samples (40 DNA and 40 RNA) using ezVIR2. Additional analyses not mentioned in the publication were required to remove possible read overlaps in virus families with high sequence similarity. This was done by mapping the reads to each genome individually and checking which reads overlapped between results; in cases where the overlap between two viruses was complete, the least abundant virus was removed. I also performed a phylogenetic analysis of the assembled pegivirus, using IDBA-UD for the assembly, MAFFT for the multiple sequence alignment and IQ-TREE for the phylogenetic tree reconstruction. Finally, I produced the figures for visualizing of the virome contents.

32 4.6 – Blood metaviromics, part 1: Red blood cells and plasma

Metagenomics analysis of red blood cell and fresh‐frozen plasma units74

Authors: Pierre Lau, Samuel Cordey, Francisco Brito, Diderik Tirefort, Thomas J.

Petty, Lara Turin, Arthur Guichebaron, Mylène Docquier, Evgeny M. Zdobnov, Sophie Waldvogel-Abramowski, Thomas Lecompte, Laurent Kaiser, and Olivier Preynat-Seauve

DOI: doi:10.1111/trf.14148 Full Text: Pages 65 to 78

Context: The main objective of this study was to assess the safety of blood products deemed eligible for transfusion, using an unbiased approach (metagenomics). Three hundred units of red blood cell and fresh-frozen plasma units were pooled into 20 libraries each (10 DNA, 10 RNA) and sequenced. Three categories of viruses were detected: commensal viruses (TTV, HPgV), potentially pathogenic viruses (HHV5,HPV), and one case of a virus rarely reported in blood (Astrovirus), which is not part of the typical blood screens administered to blood units. At the time of publication, it was also one of the first studies using a metagenomic approach to characterize RNA and DNA from healthy donors, helping to establish the profile of a healthy person‘s blood virome.

Contributions: I performed the bioinformatics analysis of all 40 samples (20 RNA and 20 DNA) using ezVIR2 in order to characterize the viral metagenome of blood donors.

Confirmation of results using MMSeqs2, BLAST, and Centrifuge was made by Pierre Lau.

33 4.7 – Blood metaviromics, part 2: Platelets

Metagenomics analysis of the virome of 300 concentrates from a Swiss platelet bank75

Authors: Francisco Brito, Samuel Cordey, Eric Delwart, Xutao Deng, Diderik Tirefort, Coralie Lemoine‐Chaduc, Evgeny Zdobnov, Thomas Lecompte, Laurent Kaiser, Sophie Waldvogel‐Abramowski, Olivier Preynat‐Seauve

DOI: doi:10.1111/vox.12695 Full text: Pages 79 to 82

Context: Following the blood donor sample analysis, the recovery of an Astrovirus sequence prompted interest to study if the same observations would be made in donor platelets unit, since they are often transfused to immunocompromised patients. As established from previous publications, astroviruses are associated with meningitis-like disease in immunocompromised patients, and their presence would represent a high-risk for the patient. Total RNA and DNA of 300 platelet units were sequenced for the analysis, pooled into 20 libraries (10 RNA, 10 DNA). Additionally a positive control library was included in both RNA and DNA runs. Results showed a similar virome as the one observed in red blood cells and fresh-frozen plasma units, suggesting the platelet concentrates still retain the virome, despite the plasma/leucocyte reduction processes. No pathogenic viruses were reported.

Contributions: I performed the bioinformatics analysis using ezVIR2, wrote the manuscript and produced the figures, along with Samuel Cordey, Olivier Preynat‐

Seauve and Evgeny Zdobnov.

34 4.8 – Identification of a case of complex contamination in clinical metagenomic

samples

Novel Rhabdovirus and an almost complete drain fly transcriptome recovered from two independent contaminations of clinical samples94

Authors: Francisco Brito, Mosè Manni, Florian Laubscher, Manuel Schibler, Mary-Anne Hartley, Kristina Keitel, Tarsis Mlaganile, Valerie d‘Acremont, Samuel Cordey, Laurent Kaiser, Evgeny M Zdobnov apparent rhabdovirus infection, a virus that is known to cause meningitis-like disease.

However, further analysis revealed a case of complex sample contamination by a drain fly, which was also the real host of the rhabdovirus.

Contributions: I wrote the manuscript in collaboration with Mosè Manni, Samuel Cordey and Evgeny Zdobnov. I performed the initial library analysis in collaboration with Florian Laubscher (I analysed MG2015, Florian analysed MG2017) with ezVIR2. I also performed the de novo assembly, the sequence analysis which allowed for the discovery of the highly-divergent virus, and the phylogenetic analysis. COI detection analysis was performed by Mosè Manni.

35 4.9 – The metavirome of the cerebrospinal fluid

Viral sequences detection by high-throughput sequencing in cerebrospinal fluid of individuals with and without central nervous system disease95

Authors: Manuel Schibler, Francisco Brito, Marie-Céline Zanella, Evgeny M. Zdobnov, Florian Laubscher, Arnaud G L‘Huillier, Juan Ambrosioni, Noémie Wagner, Klara M Posfay-Barbe, Mylène Docquier, Eduardo Schiffer, Georges L. Savoldelli, Roxane Fournier, Lauriane Lenggenhager, Samuel Cordey, Laurent Kaiser

DOI: doi:10.3390/genes10080625 Full Text: Pages 115 to 126

Context: A metagenomic analysis of the cohort with unknown central nervous system (CNS) disease (26 patients), 10 patients with known CNS disease, and 30 patients presenting no CNS disease, who underwent elective surgery, representing the control group. The two main objectives were to characterize the CNS virome and which circulating viruses could potentially be associated with CNS disease. Results evidenced several commensal viruses (annelovirus, herpesvirus, pegivirus), and a potentially pathogenic infection by an astrovirus, previously documented in section 4.2.

No novel viruses were identified, with the exception of a Rhabdovirus derived from an external contamination of one library, detailed in section 4.8.

Contributions: I contributed to this publication with the bioinformatics analysis, in collaboration with Florian Laubscher, using ezVIR2 for virus detection, IDBA-UD for assembly, and DIAMOND and BLAST for the identification of novel virus contigs.

Additional analysis regarding the contaminations is detailed in section 4.8.

36 4.10 – Identification of astrovirus infections in febrile Tanzanian children

Detection of novel astroviruses MLB1 and MLB2 in the sera of febrile Tanzanian children72

Authors: Samuel Cordey, Mary-Anne Hartley, Kristina Keitel, Florian Laubscher, Francisco Brito, Thomas Junier, Frank Kagoro, Josephine Samaka, John Masimba, Zamzam Said, Hosiana Temba, Tarsis Mlaganile, Mylène Docquier, Jacques Fellay, Laurent Kaiser & Valérie D‘Acremont

DOI: doi:10.1038/s41426-018-0025-1 Full Text: Pages 101 to 103

Context: This analysis was made in the context of studying the virome of a cohort of Tanzanian individuals. The analysis evidenced several novel MLB1 and MLB2 astroviruses, evidencing possible new associations between these viruses and disease.

Contributions: I contributed in developing the approach used to perform the virus analysis (ezVIR2) and for assembling the astrovirus contigs, but did not perform the analysis, which was done by Florian Laubscher.

37 4.11 – Identification of dicistrovirus infections in febrile Tanzanian

Detection of dicistroviruses RNA in blood of febrile Tanzanian children78

Authors: Samuel Cordey, Florian Laubscher, Mary-Anne Hartley, Thomas Junier, Francisco J. Pérez-Rodriguez, Kristina Keitel, Gael Vieille, Josephine Samaka, Tarsis Mlaganile, Frank Kagoro, Noémie Boillat-Blanco, Zainab Mbarack, Mylène Docquier, Francisco Brito, Daniel Eibach, Jürgen May, Peter Sothmann, Cassandra Aldrich, John Lusingu, Caroline Tapparel, Valérie D‘Acremont & Laurent Kaiser

DOI: doi:10.1080/22221751.2019.1603791 Full text: Pages 104 to 114

Context: Part of the analysis of the same cohort described in section 4.10. The results assembled and reported a high number of potentially pathogenic dicistrovirus in several patients.

Contributions: I contributed in developing the approach used to perform the virus analysis, and I independently verified the dicistrovirus findings made by Florian.

38 4.12 – Analysis of human/virus infection dynamics in cultured airway epithelia

Propagation of respiratory viruses in human airway epithelia reveals persistent virus-specific signatures80

Authors: Manel Essaidi-Laziosi, Francisco Brito, Sacha Benaoudia, Léna Royston, Valeria Cagno, Mélanie Fernandes-Rocha, Isabelle Piuz, Evgeny Zdobnov, Song Huang, Samuel Constant, Marc-Olivier Boldi, Laurent Kaiser, Caroline Tapparel

DOI: doi:10.1016/j.jaci.2017.07.018 Full text: Pages 127 to 137,

Context: The objective of this publication was to study how respiratory viruses propagate and adapt to the human immune system when infecting airway epithelia. To complement the bench work analyses, two bioinformatics approaches were used. The first focused on a transcriptomic analysis of human epithelial cells, in order to observe changes over time in gene expression, caused by the infection of two viruses,

Context: The objective of this publication was to study how respiratory viruses propagate and adapt to the human immune system when infecting airway epithelia. To complement the bench work analyses, two bioinformatics approaches were used. The first focused on a transcriptomic analysis of human epithelial cells, in order to observe changes over time in gene expression, caused by the infection of two viruses,

Dans le document Metagenomic Characterization of the Virome of Clinical Samples (Page 27-147)