• Aucun résultat trouvé

Metagenomic Characterization of the Virome of Clinical Samples

N/A
N/A
Protected

Academic year: 2022

Partager "Metagenomic Characterization of the Virome of Clinical Samples"

Copied!
248
0
0

Texte intégral

(1)

Thesis

Reference

Metagenomic Characterization of the Virome of Clinical Samples

BRITO, Francisco

Abstract

Les approches métagénomiques nous donnent un aperçu en profondeur du microbiome humain en une seule analyse, sans la nécessité d'avoir une connaissance a priori des contenus d'un échantillon. De plus en plus, elles deviennent un complément aux tests de routine, car leur approche ouverte est capable de détecter des organismes divergents qui ne font pas partie des tests de diagnostic standards. Les virus jouent un rôle important dans la santé humaine, étant à l'origine de plusieurs infections et maladies cliniquement relevantes.

Pour les études présentées dans cette thèse, j'ai collaboré avec le laboratoire de virologie de l'Hôpital de l'Université de Genève pour élaborer des méthodes d'analyse permettant de mettre en évidence des séquences virales dans des données métagénomiques cliniques humaines (ezVIR2), dans le but de caractériser des maladies d'étiologie inconnue, suspectées d'être d'origine virale. Afin de mieux comprendre le rôle de chaque virus, nous avons d'abord évalué le virome sain d'un individu en analysant des échantillons métagénomiques de plasma, globules rouges et [...]

BRITO, Francisco. Metagenomic Characterization of the Virome of Clinical Samples. Thèse de doctorat : Univ. Genève, 2020, no. Sc. 5477

URN : urn:nbn:ch:unige-1443445

DOI : 10.13097/archive-ouverte/unige:144344

Available at:

http://archive-ouverte.unige.ch/unige:144344

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

Metagenomic Characterization Of The Virome Of Clinical Samples

THÈSE

présentée à la Faculté de Sciences de l‘Université de Genève pour obtenir le grade de Docteur ès Sciences, mention Bioinformatique

par

Francisco Maria de Aboim Borges Fialho de Brito

de Lisboa (Portugal)

Thèse N° 5477

GENÈVE

Centre d'Impression de l'Université de Genève 2020

UNIVERSITÉ DE GENÈVE Département de Médecine génétique et Développement Département d‘ Informatique

FACULTÉ DE MÉDECINE Prof. Evgeny M. Zdobnov FACULTÉ DE SCIENCES Dr. Frédérique Lisacek

(3)

1

(4)

2

Acknowledgements

I would first like to thank the thesis supervisors for the opportunity of allowing me to work with them, Prof. Evgeny Zdobnov and Prof. Laurent Kaiser. I would also like to thank all the collaborators, in particular Samuel Cordey, Diem-Lan Vu Cantero, Olivier Preynat Seauve, and Caroline Tapparel

Secondly, I would like to thank the members of the committee for spending their time in reading this manuscript and giving feedback, Dr. Frédérique Lisacek, Prof. Ioannis Xennarios, and Prof. Jacques Fellay.

I am also thankful to all of the EZLab‘s current and former members for precious discussions and feedback, in particular Alexis Loetscher, Felipe Simão, and Christopher Rands for long discussions on various topics, which were very definitely always about serious science and research.

Thank you to my parents, Margarida Aboim Borges e Miguel Fialho de Brito, for the love, support and encouraging words, as well my friends for always being there for me, in particular the ones back in Portugal who were always ready to meet up whenever I came back to Lisbon for a visit (and even synchronize their own visits to Lisbon!).

Finally I would be remiss not to thank the Sauvains, for allowing me and friends to spend long evenings at their store, relaxing with some games and food after a long day of work.

(5)

3

Contents

1. Abstract ... 5

2. Résumé ... 6

3. Introduction ... 8

3.1 – Virus classification ... 8

3.2 – Viral infection and impact in human health ... 10

3.3 – Detection of viruses ... 12

3.4 – Metagenomics ... 14

3.5 – Thesis objectives ... 23

3.6 – Thesis contributions... 23

3.7 – Thesis outline ... 24

4. Methods and Results ... 25

4.1 – Methods for the detection of viruses in metagenomic samples ... 25

4.2 – Identification of an Astrovirus in respiratory disease of unknown etiology ... 28

4.3 – Identification of an Astrovirus in meningitis-like disease of unknown etiology ... 29

4.4 – A review on the Astroviridae family and their role in human disease ... 30

4.5 – Metaviromics of haematopoietic stem cell transplants ... 31

4.6 – Blood metaviromics, part 1: Red blood cells and plasma ... 32

4.7 – Blood metaviromics, part 2: Platelets ... 33

4.8 – Identification of a case of complex contamination in clinical metagenomic samples ... 34

4.9 – The metavirome of the cerebrospinal fluid ... 35

4.10 – Identification of astrovirus infections in febrile Tanzanian children ... 36

4.11 – Identification of dicistrovirus infections in febrile Tanzanian ... 37

4.12 – Analysis of human/virus infection dynamics in cultured airway epithelia ... 38

4.13 – Metaviromics of Kawasaki disease patients ... 39

5. Discussion... 145

5.1 – Metagenomic data analysis and contamination ... 145

5.2 – Astroviruses cases and its role in disease ... 148

5.3 – The role of commensal viruses in HSCT recovery ... 149

5.4 – Characterization of the virome of blood donors ... 150

5.5 – Characterization of the cerebrospinal fluid virome ... 151

5.6 – Characterization of the human/virus dynamics in airway epithelia ... 152

5.7 – Characterization of the virome of Kawasaki Disease patients ... 153

(6)

4

6. Conclusions ... 154

7. Bibliography ... 156

8. Appendix ... 165

8.1 – Supplementary material for discussion ... 165

8.1.1 - Supplementary material for section 5.1 ... 165

8.1.2 – Supplementary material for section 5.5 ... 166

8.1.3 - Supplementary material for section 5.6 ... 167

8.1.4 - Supplementary material for section 5.7 ... 168

8.2 – Supplementary online material ... 170

8.2.1 – Supplementary online material for section 4.2 ... 170

8.2.2 – Supplementary online material for section 4.3 ... 172

8.2.3 – Supplementary online material for section 4.5 ... 188

8.2.4 – Supplementary online material for section 4.6 ... 194

8.2.5 – Supplementary online material for section 4.9 ... 226

8.2.6 – Supplementary online material for section 4.10 ... 231

8.2.7 – Supplementary online material for section 4.11 ... 232

8.2.8 – Supplementary online material for section 4.12 ... 233

8.2.9 – Supplementary online material for section 4.13 ... 245

(7)

5

1. Abstract

Metagenomic approaches give us an in-depth overview of the human microbiome all in one go, without any a priori knowledge of a sample. They are becoming an important complement to routine assays due to their open approach, as it is able to detect divergent organisms, not part of standard diagnosis. Viruses play an important role in human health, being the cause of several clinically relevant infections and diseases. For the work presented in this thesis, I collaborated with the University of Geneva Hospital virology laboratory, to elaborate analysis methods for evidencing virus sequences in human clinical metagenomic data (ezVIR2), with the aim of characterizing diseases of unknown etiology, suspected to be viral. In order to better understand the role of each virus, we first assessed the healthy virome of an individual, by analyzing metagenomic samples of plasma, red blood cells and platelets of healthy donors at the hospital‘s blood bank, as well as the cerebrospinal fluid of healthy individuals, revealing viruses such as anellovirus and pegivirus, and papillomavirus to be part of the healthy human microbiome. We then compared these results to the blood, cerebrospinal fluid, and respiratory tract virome of patients with disease of unknown etiology, showing the presence of some pathogenic viruses. Several strains of astrovirus were described outside of their typical infection site (gastrointestinal tract), suggesting its likely involvement in meningitis-like disease and disseminated infection.

We also recovered several dicistroviruses in the blood of febrile children. We also assessed the potential role of viruses in the recovery of highly immunocompromised patients, by analyzing the metagenome of hematopoietic stem cell transplantees, containing both long and high load pegivirus infections. These infections were found not to interfere with the patients‘ recovery. The human/virus dynamics during infection were also a focus of our study, revealing specific signatures for rhabdovirus strains, during an airway epithelia infection. Additionally, we assessed the viral signatures of patients with Kawasaki disease, showing no connection between the patients, and describing only commensal virus. Finally, I detailed a complex case of contamination in clinical samples, where a rhabdovirus was identified in patients suffering from meningitis-like symptoms, a disease commonly associated with viral infection.

However, the host for the virus infection was found to be a drain fly contaminating these samples. In summary, we advanced the potential of metagenomic approaches in the clinical setting, and successfully used it to describe several cases of infection, as well as understanding normal virome of an individual.

(8)

6

2. Résumé

Les approches métagénomiques nous donnent un aperçu en profondeur du microbiome humain en une seule analyse, sans la nécessité d‘avoir une connaissance a priori des contenus d‘un échantillon. De plus en plus, elles deviennent un complément aux tests de routine, car leur approche ouverte est capable de détecter des organismes divergents qui ne font pas partie des tests de diagnostic standards.

Les virus jouent un rôle important dans la santé humaine, étant à l'origine de plusieurs infections et maladies cliniquement relevantes. Pour les études présentées dans cette thèse, j'ai collaboré avec le laboratoire de virologie de l'Hôpital de l'Université de Genève pour élaborer des méthodes d'analyse permettant de mettre en évidence des séquences virales dans des données métagénomiques cliniques humaines (ezVIR2), dans le but de caractériser des maladies d'étiologie inconnue, suspectées d'être d‘origine virale. Afin de mieux comprendre le rôle de chaque virus, nous avons d'abord évalué le virome sain d'un individu en analysant des échantillons métagénomiques de plasma, globules rouges et plaquettes de donneurs sains issus de la banque du sang de l'hôpital, ainsi que le liquide céphalorachidien d'individus sains, révélant que les anellovirus, pegivirus, et papillomavirus font partie d‘un microbiome sain. Nous avons ensuite comparé ces résultats avec ceux obtenus à partir du virome du sang, du liquide céphalorachidien et des voies respiratoires de patients atteints d'une maladie d'étiologie inconnue, montrant la présence de pathogènes viraux. Plusieurs souches d'astrovirus ont été décrites en dehors de leur site d'infection typique (gastro-intestinal), suggérant leur implication probable dans une maladie de type méningite et infection disséminée. Nous avons également récupéré plusieurs dicistrovirus dans le sang d‘enfants fébriles. Nous avons aussi évalué le rôle potentiel des virus ont dans la récupération des patients fortement immunodéprimés, en analysant le métagénome des transplantés de cellules-souches hématopoïétiques, contenant des infections de pegivirus à la fois persistantes et ayant une charge virale élevée. Ces infections n‘ont pas interféré avec la guérison des patients. La dynamique entre humain et virus a également été au centre de notre étude, révélant des signatures spécifiques pour les souches de rhabdovirus lors d'une infection de l‘épithélium des voies aériennes. Nous avons aussi évalué les signatures virales des patients atteints de la maladie de Kawasaki, ne montrant aucun lien entre les patients et décrivant uniquement des virus commensaux. Finalement, j'ai détaillé un cas de contamination complexe dans des échantillons cliniques, où un rhabdovirus a été identifié chez des patients souffrant de symptômes de méningite, une maladie généralement associée à des infections virales.

(9)

7 Cependant, le vrai hôte de l'infection virale s'est avéré être une petite mouche qui contaminait ces échantillons. En résumé, nous avons développé le potentiel des approches métagénomiques dans un milieu clinique et l'avons utilisé avec succès pour décrire plusieurs cas d'infection, ainsi que pour comprendre ce qu'est le virome normal d'un individu.

(10)

8

3. Introduction

3.1 – Virus classification

Viruses are defined as infectious, obligate intracellular parasites. They are comprised of either DNA or RNA, and replicate by relying on the host‘s cellular systems. They are thought to be the most abundant type of organism on the planet and have been observed to infect almost all other organisms, from prokaryotes to eukaryotes, displaying multiple morphologies and methods of infection and transmission. Unlike other domains of life, viruses lack a universal set of genes that can be used to classify and compare them (e.g. 16S ribosomal subunit, replication- related units, and with the exception of the RNA-dependent RNA polymerase gene, which is found in almost all RNA viruses), so historically their classification has relied in a conjunction of several properties, from host range and pathogenicity, to structure and replication mechanism. The best known categorization is the Baltimore classification, which separates viruses into seven types, based on a combination of features:

replication method, genetic material, and strand orientation (Table 1, Figure 1). The taxonomic classification and organization of viruses is maintained by the International Committee on Taxonomy of Viruses - ICTV1 which, as of February 2019, reported a total of 14 viral taxonomic orders, 150 families, 1019 genera, and 5560 viral species classified. While the approval by the ICTV for taxonomic classification of viruses has historically relied on virus isolation/culturing, developments on sequencing technologies have shown that obtaining complete, high-quality viral sequences is possible using only sequencing, thus reducing the reliance in physical characteristics for taxonomic classification. Metagenomic approaches in particular, have greatly contributed in expanding the known virome, allowing for the identification of hundreds of thousands of new viral genomes2,3. Despite not characterizing the biological properties of the viruses directly (i.e.: pathogenicity, host), sequencing approaches provide information on the viral genomic sequence (e.g.: predicted genes, GC content, codon usage, sequence motifs) and how these relate to other viral species. The amount of information provided by these approaches prompted a decision by the ICTV in 2016 to validate confirmations of the presence of a virus in a sample just by using sequencing data4, allowing for a more accurate taxonomic classification of viruses. Despite this implementation, the ICTV has not defined complete rules on how to systematically classify viruses, providing no universal thresholds for sequence similarity, which other properties to consider, or naming conventions. This, coupled with the fact that virus

(11)

9 classification throughout the years has had different rules applied between and within virus families, shows that virus classification remains inconsistent. A clear example of this is the Flavivirus genus, where the nucleotide distance between two viral sequences varies anywhere between 6% and 43%5. Another example is the Herpesvirales order;

the traditional classification approach followed a morphology-based-only classification (shape of capsid/envelope/nucleus), but once genomic sequence similarities were taken into account, it evidenced the fact that the group‘s species (Herpesvirus 1 to 8) have a higher diversity between types than previously expected6. Other examples include viruses which are only classified based on one gene, e.g. the Tymnovirales order, whose classification is based exclusively on RNA polymerase genes, despite being comprised of a large polyprotein with four distinct genes7. Recently, attempts by groups to normalize virus classification have been made, proposing methods for scoring viruses based on their genomic similarity8,9. However, despite suggestions and improvements on the classification methods, an under-sampling problem still exists, requiring more viral genomes to be sequenced and characterized. These inconsistencies in classification methods show the existing need to improve our knowledge on viral sequence diversity in order to accurately classify virus families, in particular ones that could be implicated in human disease.

Table 1 - The Baltimore classification of viruses, and the number of known families within each class, according to ICTV. Classes are defined by type of mRNA production.

Group Classification Number of families

(* subfamilies)

Abbreviation

I Double-stranded DNA viruses 34 dsDNA

II Single-stranded DNA viruses 11 ssDNA

III Double-stranded RNA viruses 9 dsRNA

IV Positive-sense single-stranded

RNA viruses 38 ssRNA (+)

V Negative-sense single-

stranded RNA viruses 15 ssRNA (-)

VI RNA retro-transcribing viruses 2* ssRNA-RT

VII DNA retro-transcribing viruses 2 dsDNA-RT

(12)

10 Figure 1 – From Viralzone10; the Baltimore classification of viruses.

3.2 – Viral infection and impact in human health

A successful viral infection requires cells with two specific characteristics: they need susceptible cells, which act as the point of entry for the infection (e.g. epithelial cells), and permissive cells, whose infection allows for the development of new viruses (e.g. leukocytes). After the virus enters the host successfully, the viral genome directs the synthesis of viral components using the cell‘s own machinery, which self-assemble into new viral particles, ready to infect more cells11. In the case where infected cells are not permissive, the infection is characterized as abortive, since the virus lacks the proper conditions to replicate. The viral infection cycle is divided into four steps:

attachment and entry, decoding of genome information, genome replication, and assembly and release of viral particles. When the virus infects a permissive cell, it can go through two distinct phases: the lytic phase and the lysogenic phase. In the lytic phase, viruses replicate in high quantities using the cell‘s machinery, then exit the infected cell and go on to infect other permissive cells, restarting the cycle. In the lysogenic phase, also known as the latent phase, viral expression is heavily reduced and no new viral particles are produced in order to avoid detection from the host‘s immune system. Viruses will stay in this dormant state until a set of ideal conditions triggers the start of the lytic phase, such as stress or co-infection by other organisms.

The lysogenic phase can be further characterized depending on how the virus remains dormant within the cell: it can either remain on the cytoplasm/nucleus in a stabilized

(13)

11 form, known as episomal latency, or it integrates into the host‘s genome, known as proviral latency.

In all eukaryotic viral taxonomic orders, we find viruses which play a role in human disease, or have an impact in human-related activities, such as economic resources (e.g. crops and cattle). Such viruses include influenza, human immunodeficiency virus (HIV), yellow fever virus (YFV), and West Nile virus (WNV), which have been the focus of extensive research, so their detection and treatment can be done in the fastest and most accurate way possible, in order to minimize their impact. In addition to viruses with well characterized infections and outbreaks, there are viruses with a high impact on human health, which have been recently described and yet to be associated with major cases of disseminated disease. These are known as emerging viruses, i.e. viruses with increasing or the potential to increase incidence12. They can be novel viruses recently described, highly divergent versions of known viruses, or known viruses that had not been previously associated with specific types of disease. Emerging viruses are often associated with cases of zoonotic infection, where a virus present in a non-human host is transferred to humans, and their outbreaks are tracked by the World Health Organization WHO in a list of

―prioritized diseases‖, first created in 2015 (https://www.who.int/activities/prioritizing- diseases-for-research-and-development-in-emergency-contexts). Known examples of emerging viruses are Zika virus13 (ZIKV), with a zoonotic transmission by mosquitoes and then between humans via mother/child, sex, and blood transfusion; the middle east respiratory syndrome coronavirus14 (MERS-CoV), with a zoonotic association to bats and camels. Currently, there are two ongoing emergent virus outbreaks: the ebola virus outbreak in Africa, and the 2019-nCoV in China.

Despite being described almost exclusively as causative agents of disease, not all viruses are pathogenic; viruses with no relevant clinical effect are known as commensal viruses. Known examples of commensal viruses are human Pegiviruses (HPgV) and Torque Teno viruses (TTV). Despite their infection having no observable clinical effect, Pegivirus infections have been shown to have positive effects on the survival of patients co-infected with HIV15, whereas TTVs have been proposed as markers to predict transplant rejection16. Additionally, viral sequences, due to their ability to integrate into the genome of another organism, can be found in almost every human cell independently of their pathogenic effect. Retrovirus-like sequences are estimated to represent around 8% of the human genome17, and are likely to be the remnants of primate germ cell infections whose integration was retained across millions of years. For example, human endogenous retroviruses (HERV) can integrate into the human genome, albeit their effects on human health are mostly unknown. Recent

(14)

12 studies pointing towards a possible role in regulation of innate immunity18, with both positive and negative effects such as having a potential use to treat ovarian cancer, or diminishing the recognition of metastatic tumor cells by the immune system, respectively.

3.3 – Detection of viruses

Several diagnostic tests have been developed in order to detect the presence of a viral infection. Traditional tests rely on previous knowledge of pathogens and diseases, and are targeted to specific strains/families. Typically, a single test is used per virus, with the most popular ones being serology-based assays, and sequence- based assays, which require minimal to no virus culturing. Serology tests are based on the identification of the reaction that occurs when antibodies bind to the viruses‘

antigens. Some of the most-used serology-based tests are hemagglutination assays (HA), immunofluorescent antibody tests (IFAT), and enzyme-linked immunosorbent assays (ELISA) (Figure 2). Briefly, HAs are used in cases of viruses that bind to red blood cells, such as influenza and dengue, and consists of progressive dilutions of antibodies, which are incubated against viruses and red blood cells, allowing for testing of both the presence of a virus and the minimum concentration of antibodies required to neutralize said virus. IFAT, on the other hand, relies on tagging antibodies with fluorescent dyes in order to measure the light signal emitted by these in a direct or indirect way. In the direct approach the antibody that binds to the viral antigen is tagged with fluorescence molecules, whereas in the indirect approach an untagged antibody is used to bind to the antigen, and an antibody tagged with fluorescence is used to bind specifically to the first antibody. Similarly, ELISA fixes the antibodies to a solid surface and then a first round adds the sample19; if the virus is present the antibodies will capture it and any other molecule is subsequently washed away. Finally, a round of antibodies with fluorescent molecules is added, which bind to the captured viruses.

Other serology tests include western blotting, flocculation and virus neutralization11. Serology tests are fast, cheap and accurate, all of them important factors for routine assays. On the other hand they lack sensitivity and will miss novel viruses and viruses with altered antibody recognition sites due to the limited number of available virus- specific antibodies and their high specificity, respectively. Additionally, researching new antibodies is a slow and expensive process. Emergent viruses can also be missed since the virus-disease association is not previously known, and consequently the virus will likely not be searched for. Finally, they present a low resolution at the molecular level, i.e. they do not give information on the viral nucleotide sequence, and thus are

(15)

13 incapable of providing information on important characteristics such as strain level identity and variants.

Figure 2 Figure adapted from Principles of Virology11. Examples of serology assays a) hemagglutination, b) Direct and indirect immunoassays, c) enzyme-linked immunoabsorbent assay, or ELISA.

Sequence (or molecular) diagnostic tests, as the name suggests, are based on detecting molecular sequences specific to the organisms of interest. The two most widely used tests are the real-time quantitative polymerase chain reaction (qPCR) test and short-read high-throughput sequencing (HTS/NGS). A qPCR assay is a variant of the polymerase chain reaction (PCR) test which allows you can quantify the amount of viral particles present in a sample (viral load) by using specific primers to amplify the viral sequence. Since qPCRs search for known sequences, they cannot be used to search for novel pathogens, and their effectiveness is influenced by the high mutation rates in viruses, which can quickly make a strain too divergent to be tracked by the primers designed to amplify a specific genus/family of virus. Much like in serological assays, emerging viruses can also be missed in cases where they are not typically associated with the symptoms observed, and therefore will not be tested for. Despite these limitations, qPCRs are one of the fastest and most cost-effective approaches, two features essential for clinical diagnosis tests. Another sequence-based test is Sanger sequencing20, which for many years was the go-to sequencing approach to identify viruses and perform phylogenetic analyses in a clinical setting, such as tracking HIV infection21. While accurate, it is both slow and technically demanding.

Short-read high-throughput sequencing (HTS), also known as next-generation sequencing (NGS), and deep sequencing are umbrella terms for sequencing techniques that aim to produce a large amount of short nucleotide sequences, known as reads (typically 50-300 nucleotides long), derived from fragments of the genome of interest. These reads provide relevant information about the genome, such as levels of gene expression, genomic abundance, sequence variants, etc.22. While having a higher cost and turnover time than serological/PCR assays, it also offers more information, allowing for the identification of virulence factors, genotyping, phylogenetics, and the discovery of novel viruses, and unlocks the possibility of complementary parallel analyses, such as checking the effects of the pathogens in human gene expression.

(16)

14 Due to these advantages, it has quickly become the second line of diagnostics for cases where the routine assays are not successful in establishing a disease etiology. In the clinic, the ideal assay would be able to detect the presence of all known pathogens in a single, high resolution test.

3.4 – Metagenomics

With the help of HTS, approaches have been developed to allow for sequencing and characterization of the full contents of a sample in only one run, potentially allowing for the identification of all pathogens at the same time. These are known as metagenomic approaches, and consist of untargeted and unbiased sequencing of a sample‘s contents, be it environmental, clinical, or otherwise, offering a complete overview of the sequences within it. Since it is untargeted, it captures all contents of a sample without a priori knowledge, and does not require culturing. In clinical cases with diseases of unknown etiology, metagenomics has been shown to be a reliable approach, as several research groups have been able to identify viruses and bacteria that correlate to the patient‘s symptoms. The first application of metagenomics as a clinically actionable diagnosis was published in 201423, where the authors successfully identified and treated a clinical case of leptospirosis. Standard clinical assays had not found an etiology for the observed symptoms, since they were negative for leptospira.

After detecting the bacteria, patients were treated with antibiotics, and recovered.

Shortly after, several studies showed how this approach was also able to detect viruses in a clinical setting, and which laboratory methods are the most appropriate for such analyses24–26. Metagenomic sequencing approaches also allow for the characterization of novel sequences; traditionally, in order to confirm a new viral sequence, a virus culture is needed. However in recent years, sequencing organisms has become cheaper and more accurate and a confirmation by assembling the full virus sequence with a high coverage has been shown to be enough to confirm its existence. This brought an exponential growth in the discovery of novel virus-like sequences (Figure 3). In 2017, a study identified several uncharacterized viruses circulating in the blood of patients, by analysing a total of 1300 cell-free DNA samples derived from 188 patients27. Out of 2917 novel sequences, 276 novel contigs were Anellovirus, and 523 were classified as phage/prophage contigs. Along with this growth, the number of false positive results has also risen, with several descriptions of novel viruses later being confirmed to be reagents, contaminations or sequencing artefacts. In 2019, a study has shown which reagents and contaminants could be interpreted as viral sequences, through the analysis and comparison of 700 metagenomic libraries, prepared with several different types of sequencing prep-kits, as well as various reagents 28. The

(17)

15 results show a total of 493 virus-like sequences significantly associated with laboratory components, and not with the actual contents of the sample. In order to avoid these contaminations, pre-processing steps must be implemented before analysing metagenomic samples for pathogens, by targeting and removing adapter sequences, viral vectors and other sequences likely to be misinterpreted as ―real viruses‖.

Additionally, a recently published review has put the results of some metagenomic experiments into question, claiming they fail to provide proper controls for a healthy viral background29. Indeed, one cannot determine which circulating viruses have a clinical impact without information on what constitutes a healthy virome. In 2017, a major study helped to establish the healthy blood virome, by characterizing the DNA- based blood virome of 8000 healthy individuals30. It reported 94 different viruses, 19 of which were human viruses, most already known to be highly prevalent in humans, such as anelloviruses, Merkel cell polyomaviruses (or human polyomavirus 5), papillomaviruses, and herpesviruses. However, the study did not focus on RNA or highly divergent viruses, missing a large part of the expected contents of the healthy human virome. Additionally, blood is not the only possible reservoir for viruses; a complete characterization of the human virome in its multiple tissues is necessary to accurately determine pathogenic infections, and metagenomics presents itself as the most promising approach to perform this analysis, as it does not need a priori information about the sample‘s contents and is able to produce an accurate representation of the contents of a patient‘s virome/microbiome independently of the type of sample used, be it blood, pulmonary fluid, cerebrospinal fluid, etc.

Figure 3 – Yearly cumulative number of viral sequences submitted to GenBank‘s virus section (gbVRL).

(18)

16 A standard metagenomics assay consists of three main steps: Data generation, data processing, and data interpretation. To generate raw data, several protocols and sequencing technologies are available, each with their advantages and disadvantages.

Currently, the most popular approach is developed by Illumina Inc., called sequencing- by-synthesis31,32, or as it is more commonly known, ―Illumina sequencing‖. It uses short- reads, up to 300 nucleotides long, which are produced in large amounts (hundreds of millions to billions of reads) in order to cover all the genomic content present in the sample, several times over. Other short-read HTS techniques have been brought to the market throughout the years, namely Roche‘s 454 sequencer (pyrosequencing)33, Thermofisher‘s IonTorrent (Ion semiconduction-based sequencing)34, and Life Technologies SOLiD (sequence by ligation)35, though not as successful. In contrast with short-read sequencing, other approaches have been developed using technology that allow for sequencing long reads without an amplification step, such as Oxford Nanopore and Pacific Biosciences SMRT sequencing, helping to solve several of the existing problems with short-read HTS, namely their inability to resolve repetitive regions and long homopolymers in the genome, which is possible due to the larger read size (over 10kb long), which easily cover those regions and their surrounding sequences, and circumvent sequence errors generated from the amplification step.

Currently, their disadvantages are tied to a high error rate and high monetary price when compared with the Illumina sequencer equivalent for the same amount of output data, as well as a need for a large amount of input data (PacBio). Despite this, studies have shown the viability of this technology for metagenomic analyses of viruses36–38, as well as hybrid approaches that use both long and short read sequencing, and which have allowed for the detection of several mobile elements in the metagenome of patients39. Additionally, Oxford Nanopore‘s MinION sequencing machine, offers portability over sequencing depth, which allows for studies directly in the field, using portable laboratories. This approach heavily contributed in the characterization of the Zika virus during the 2018 outbreak.

Generally speaking, sequencing protocols start by performing a total nucleic acid extraction and sequencing a sample of interest, e.g. a complete RNA/DNA extraction from a blood sample (Figure 4). The extracted DNA/RNA is then randomly sheared into smaller fragments. This process takes into account the read lengths generated for each type of protocol; for short reads, they can vary from 20 to 800bp, up or to 20kb in long read sequencing protocols. In the case of RNA libraries, they are also reverse-transcribed into cDNA before sequencing. When samples have a low DNA/RNA yield, a PCR amplification step may also be required to reach the required levels of input data. The exception to this is PacBio, which by default requires a large

(19)

17 amount of non-amplified input DNA/RNA40. Sequencing adapters are then ligated to the DNA/cDNA fragments, and are ready to be sequenced. The final product obtained from the sequencing run is a library of millions of reads, all belonging to the same sample. In order to minimize costs, usually several samples are sequenced at the same time in a process known as multiplexing. In order to accurately sort each read to its sample, a unique identifier needs to be included. This is known as read indexing, and consists of a small, unique oligonucleotide sequence (8-10 nt long), which is attached to the fragments along with the sequencing adapters, thus allowing to separate between library contents. In a metagenomic sample, the most abundant component is typically the host‘s own genomic content; for example, in a human metagenomic sample, up to 95% of reads generated are attributed to human DNA/RNA30,41, with the rest being distributed between contaminants and the host‘s microbiome, despite the latter representing a larger pool of different organisms. Viruses in particular are often the smallest proportion of reads in a sample, and consequently have the weakest signal for detection. To improve the signal, methods have been developed, albeit not without significant drawbacks. Size filtration and density based enrichment, can increase the amount of viral reads, along with a significant increase in bias on the composition of viral populations42. Primer based amplification can also bias results, as it can miss unknown and highly divergent viral sequences, as well as, in cases of co-infection of same-family viruses, a preferential amplification of a particular strain over the others.

Figure 4 – A typical workflow for metagenomic analyses. The boxes in green represent steps requiring bench work (wet-lab), and are tied to data production. The blue boxes represent the steps using bioinformatics (dry-lab), and relate to data processing. Finally, the red box represents a step requiring both wet- and dry-lab contributions, and relates to data interpretation.

(20)

18 Once the read libraries have been created for each sample, they need to be processed in order to determine its contents, and how much of each organism is present (Figure 4, blue boxes). This is called the data processing step, and it is split into three parts: pre-processing, host content removal, and analysis of non-host data.

The pre-processing step aims to remove sequences that can bias downstream analyses. Low quality reads, low complexity reads, homopolymers, and uncalled bases (tagged as N) are removed, as their repetitive properties and unreliable quality make them unlikely to be assigned to the correct organism. Briefly, these are defined as follows: read quality scores (QS) represent the probability of nucleotides being miscalled during sequencing, and are defined by the equation QS = -10log(e), where e is the estimated probability of a base call being incorrect, e.g. a QS score of 40 would mean that there‘s a 99.99% probability of nucleotides called during sequencing being correct (high quality), whereas a QS score of 10 would lower the probability to only 90% (low quality). Homopolymers, are large stretches of the same nucleotide over a read, covering a large portion (>50%) of its total length. Their repetitive nature causes both base calling and alignment ambiguity due to their lack of distinctive pattern. Low complexity reads are similar to homopolymers; they are reads with a large number of repetitive patterns (i.e. short tandem repeats), which also cause ambiguity when attempting to identify their correct origins, both at the organism level and at their location within a genome. Additionally, a removal of adapter sequences and kit-related sequences is recommended, typically by scanning all reads through uniVec, a database that collects known vector sequences, adapters and reagents. Other pre- processing steps include the removal of PCR duplicates; these are reads that have the same exact sequence and chromosomal coordinates, and derive from overextending the number of cycles during amplification and high amplitudes in fragment sizes. Since DNA/RNA is sheared randomly, the likelihood of observing the exact same reads belonging to the same region is extremely low. Should the same reads mapping to the same regions be observed, they also need to be filtered, as they could cause certain genomic regions to be overrepresented. Finally, the removal of ribosomal sequences is also recommended in analyses where the original nucleotide sequence is being translated into a hypothetical protein, not only because from a biological point of view, ribosomal RNAs are not translated into amino acids, but their translated sequence is not identified as a ribosome, and rather as belonging to other organisms (viruses, bacteria, mammals), clouding the interpretation of results.

In terms of software, there are several packages that assess read quality and sequencing errors, ranging from a simple quality analysis (FastQC43), to complete tools that assess read quality, detect and filter sequence adapters, and improve read quality

(21)

19 by trimming low quality ends (Trimmomatic44, Prinseq45, BBDuk46). After filtering libraries for reads that are either not part of the sample or contain information derived from sequencing errors, we are left with the reads belonging to the components of the actual sample, which are the host, and its microbiome. As mentioned previously, and with the exception of the gut microbiome, the number of reads attributed to the host is comparatively higher than the ones assigned to the microbiome, despite the latter representing a larger pool of organisms. Host data can impact downstream analyses, and its inclusion can increase the likelihood of misclassified reads, due to its abundance and the presence of regions with high sequence similarity between organisms. In the case of clinical samples, human data can also be subject to mandatory removal in order to preserve the patients‘ privacy. Due to these factors, host content is typically removed in order to simplify the analysis of the microbial content.

After this, they are ready to be classified according to their taxonomic profile. This is seen as the most challenging step of the analysis, with two main computational problems to solve. The first is the fact that NGS produces a large amount of data (hundreds of millions of reads and increasing with every new generation of sequencers) which needs to be classified in an accurate and timely way, whereas the second challenge is that the amount of reference organisms to scan for has also been growing at an accelerated pace (Figure 3), increasing the amount of sequence comparisons needed to complete the analysis, thus extending the duration of the analysis. There are two main approaches to the taxonomic classification of reads: the first aims to directly compare their similarity to a curated database of reference sequences, such as human-related viruses (e.g. VIPR47), or a large database of different organisms (e.g RefSeq, Genbank), and works best in the identification of sequences which are close to already known organisms, i.e. organisms whose sequence has already been described; this is known as a reference-based approach.

Software-wise, BLAST is the most well known reference-based tool48, and it is seen as one of the most sensitive tool for classifying metagenomic data, with the drawback of being one of the slowest, making it unfit for the clinical setting. Overall, reference- based tools can be split into read-mapping approaches, which classify each read individually according to a set of reference sequences/taxa, and profiling approaches, which search for a set of specific matches within the dataset and report the relative abundance of each taxon, rather than classifying each read/read-pair individually. Read mapping tools (BLAST, BWA49, BBMap46, SNAP50) align each read against a database of representative genomes (or index) chosen by the user, which can vary in size and complexity, going from a small number of specific marker genes, to several sets of complete genomes. Profiling approaches (e.g. Kraken51, CLARK52), on the other hand

(22)

20 use an approach that consists of taking the sequences of all relevant genomes, and creating a database by breaking down each genome into smaller fragments of size K, called k-mers. These are sorted and stored along with taxon information (taxID), resulting in a look up table that can be searched for exact k-mer matches which define a certain family/species/strain; the more they appear on multiple genomes, the less specific the k-mers will be. By default, the databases use a fixed k-mer size of 31, though this can be changed to other lengths, depending on the amount of memory available. Smaller k-mer sizes will raise sensitivity, and consequently call more false positives, whereas longer k-mers raise the specificity and the number of false negatives. While faster than alignment-based approaches, due to the approach being based on exact matches, they also present a lowered sensitivity threshold, leading to the non-detection of divergent sequences. To counteract this, software has been developed where k-mers can be spaced, extending possible k-mer matches with gaps (e.g. Seed-Kraken53, and MetaProb54), with some success.

The second approach is known as de novo assembling, and as the name implies, it relies on assembling reads into larger sequences without the need of a reference database, in order to reconstruct the genomes present in the sample. This approach is particularly useful to recover sequences not present in databases, which can range from a slight sequence divergence from a known reference, to an entirely novel genome, never before described and/or classified. De novo approaches rely on taking reads and overlapping them together to form larger sequences, called contigs.

Two of the most commonly used algorithms used in read assembly are Overlap- Layout-Consensus (OLC), and de Bruijn graphs (DBG)55. OLC, as the name suggests, is a 3 step approach; first, the algorithm overlaps each read by performing an all- versus-all pairwise read comparison, then it creates a layout of these pairwise comparisons to create the contigs, with each read corresponding to a node in the graph, and each edge being an overlap between reads, forming a multiple sequence alignment (MSA). Finally, based on the MSA, it picks the consensus sequence by analysing the most likely nucleotide for each position of the contig, i.e. a majority vote (Figure 5a). Notable OLC based assemblers are IVA56, Omega57, and Genovo58. The DBG algorithm works by breaking down each read into k-mers, and merging them into a graph based on their relationship with neighboring k-mers (i.e. each k-mer represents a node, each edge connects two neighbors). The contig is calculated by finding the optimal path, where each edge is visited only once (Figure 5b). Recent assembler benchmarking tests classify the DBG-based assemblers metaSPADES, MEGAHIT, and IDBA-UD as the best options for assembling viromes from metagenomic data59,60. Both algorithms present their own advantages and disadvantages: OLC approaches

(23)

21 require less pre-processing and have a higher tolerance for heterozygous organisms, as they contain an in-built error correction step; by having to calculate a consensus sequence, mismatches are ―voted out‖ of the final sequence. However, due to the nature of all versus all approaches (all reads are being compared against all reads), the more data, the slower and more computationally demanding the process will be.

Therefore it is recommended for samples with low read abundance and longer reads.

Comparatively, DBG approaches are less computationally intensive, making them suitable for HTS/NGS approaches; however, they require more pre-assembly error correction steps, as more sequencing errors will produce more bubbles, or branching k- mer paths. In both approaches, after successfully assembling the contents of a library, the contigs generated are taxonomically classified by binning them according to abundance, sequence similarity, or by comparison against known references, similar to the ones mentioned in the reference-based approach.

Metagenomic analysis software has seen a boom in development in recent years, with several being published every year, each claiming to be more efficient and accurate than the previous ones on specific aspects of the analysis. Since benchmarking tests for these tools are not standardized, it created the need for a universal benchmarking tool, capable of performing unbiased evaluations on their quality, by independent groups. The first evaluation study of its kind was the CAMI challenge, which provided original, never before analysed libraries to be analysed by any group that wanted to assess and submit their tool61. This resulted in a more unified view of metagenomic analysis tools, showing most tools struggled with a classification at the species level or lower. However, this only provided a snapshot of the available tools at the time of the study, ignoring any new tools published since. Other reviews have been published since, but the same problem remains62,63. To tackle this problem, our lab created an in-house solution (LEMMI64), led by Seppey et al., consisting of a tool that allows for continuous benchmarking, with the capability of adding any tool to the benchmarking tests and comparisons in real-time.

(24)

22 Figure 5 – Representation of assembling algorithms typically used in metagenomics. a) Overlap-Layout- Consensus algorithm: each read (size 6 nt) goes through a pairwise comparison against all reads, and are laid out according to each comparison in order to calculate the best consensus sequence. Mismatches (in blue) are corrected during the consensus step, with nucleotide position being calculated by majority vote.

b) De Bruijn algorithm: each read is split into all possible k-mers (size 4) and connected to each other by their neighboring relationships in order to calculate the best path. In some cases several possible paths may exist, forming bubbles, in (in blue).

Once a classification of reads and/or assembled contigs is achieved, we move to data interpretation (Figure 4, red box); processing the data alone is not enough to understand the contents of a library, and other steps are needed to better understand the biological context of the results. First, before trying to determine possible roles in disease, one of the most important steps is to determine if the organisms reported are actually there, by physically confirming their presence with a RT-PCR assay, which can require the development of new probes to detect the newly assembled sequences.

Related to the confirmation of results, a post-processing identification and removal of contaminants can also be required; packages such as decontam65 remove background noise related to the sequencer and reagents used by comparing the ratio of the species observed in a control library ‖microbiome‖ against the results from the patients‘

libraries. This approach has been used to evidence various cases of reagent contaminations, such as demonstrating the majority of species thought to be part of the human placenta‘s microbiome to be, in fact, contaminants66–68. Another important set of post-processing analyses are the assessment of how ―good‖ were the assemblies; this is done by checking the overall coverage of each contig and assessing for possible chimeric assemblies (the merging of reads from different organisms) and checking assembly completeness with software such as BUSCO69, which checks how many of the expected single copy genes were recovered. Finally, gene prediction software and

(25)

23 phylogenetic analyses give us context on how the detected organisms relate to previously described organisms, giving us context on possible roles in disease.

3.5 – Thesis objectives

In this thesis I explored the applications of metagenomic approaches in the characterization of the human virome, the detection of known pathogenic viruses, emerging viruses, and highly divergent or even novel viruses (Figure 6). I maily focused on two main sample types, the cerebrospinal fluid (CSF) and blood. To perform these detections, I first needed to establish which viruses are typically found in a healthy virome, e.g. blood donors, and which known pathogenic viruses are observable in patients. After determining these two backgrounds, I was able to identify which novel/emergent viruses are only found in cases of disease, and if they play a role as pathogens.

Figure 6 – Overall objectives of the thesis. Starting with samples of healthy donors and samples of patients with disease of unknown etiology, I aim to use a metagenomic approach to identify and characterize known pathogenic and commensal viruses. Following this, I intend to look into viruses not previously described, which will contribute to future association studies in order to determine their impact on disease.

3.6 – Thesis contributions

In the context of clinical metagenomics, I collaborated with the virology department at the University of Geneva Hospitals (HUG) in a variety of studies, with the aim of characterizing the human virome and its association with disease by developing and maintaining an analysis pipeline specialized on the detection of virus infection. One of the multiple topics was the characterization of astrovirus infections in the

(26)

24 cerebrospinal fluid, blood and respiratory tract of patients70–72, and their involvement in disease, as well as a review on the state of the art knowledge on astroviruses73. Other topics included a study on the virome of the blood bank (blood cell, plasma and platelets) at the HUG74,75 in the context of transfusion safety; a metagenomics study on viral reactivation in patients who underwent hematopoietic stem cell transplant76 (HSCT) focusing on how it might affect patient recovery; an analysis of the metavirome of the cerebrospinal fluid between patients presenting meningitis-like disease and non- symptomatic patients77 in order to evidence the etiology of unknown encephalitis cases; a confirmation analysis of dicistrovirus in the blood of febrile patients78, and a metagenomics study on the etiology of Kawasaki disease79.

Additionally, I contributed with bioinformatics analyses to a non-metagenomics based study, aiming to identify virus-specific effects to the human airway epithelia during an infection80. Finally, my work contributed in the detection and characterization of an unexpected and complex contamination case, which also resulted in the assembly of a transcriptome of a drain fly, detected in the contaminated clinical samples.

3.7 – Thesis outline

After the introduction above, I present the results section divided in two parts:

the first part is a set of descriptions for each publication I contributed during the course of this thesis, highlighting the context of the study and my contribution to them, followed by the publications themselves, in the second part. In the discussion section, each subsection aggregates the previously shown publications by topic. Results will be discussed from the point of view of how bioinformatics approaches applied to these studies contributed to obtaining results relevant to the questions posed in each publication. First I will discuss the methods developed and used analyse the contents of metagenomic libraries, their limitations regarding contamination, and possible improvements. Secondly, I discuss the Astroviridae cases, which are all merged into one section (sections 4.2, 4.3, 4.4 and 4.9), as are the virome characterizations of blood and platelet donors at the HUG (sections 4.6 and 4.7). Each other publication has its own subsection. Lastly, the conclusion section discusses the overall contributions made, challenges, and future perspectives in the application of metagenomes in a clinical setting. Additionally, the appendices contain the online supplementary material from each publication, as well as additional figures and tables necessary to the discussion of the results, not included in the original manuscripts.

(27)

25

4. Methods and Results

4.1 – Methods for the detection of viruses in metagenomic samples

Analyses were made with a pipeline named ezVIR2, expanding on the original tool, ezVIR, developed by Tom Petty24. This pipeline identifies viruses from both RNA and DNA-seq libraries, and can process data from any short read sequencer. The first version relied on individual greedy alignments against a panel of more than ten thousand clinically relevant viral genomes, outputting the best results based on the most-covered viral strains. In practice this meant that every library was individually mapped more than ten thousand times, which was found to be both slow and computationally heavy, and a major overhaul on virus detection had to be done. Briefly, instead of thousands of individual alignments, the second version relies instead on representatives of sequence clusters, which were grouped by sequence similarity. One single index is created for these representative sequences, and reads are mapped against it in order to find which families of viruses are in each sample. Each representative genome with a positive result is marked, and the corresponding clusters are fully explored by mapping all reads to each genome present in them, with the objective of finding the closest possible strain. The pipeline structure and software is the following: ezVIR2 calls for a quality control of the input data by removing or trimming both low quality and low complexity sequences using Trimmomatic v0.3944, and Tagdust281, respectively. Reads with a quality score lower than 30, corresponding to a 0.1% error rate, and reads with an entropy value lower than 16, corresponding to a very low complexity structure, are removed. Human data is removed by aligning the quality filtered data to the human reference genome (hg38) and transcriptome using the short read aligner SNAP50, with settings allowing for a maximum of eight mismatches between each read pair. Reads that did not map to the human reference, referred to from this point onward as ―non-human data‖ are recovered and aligned in two steps against a curated database of clinically relevant viruses, developed in collaboration with Prof. Kaiser‘s group. This database is currently comprised of over 11,000 viral sequences that are known to be associated with human disease or potential zoonotic transmission. When searching for viruses, the first step determines only which families of viruses are present by aligning the reads against a smaller version of the virus database that uses ~4,000 representative sequence clusters. To cluster the genomes, I used CD-HIT-EST82, using a sequence identity of 99% between sequences, at the nucleotide level. The alignment is made against a representative sequence of each cluster, defined as the longest sequence per cluster. Clusters

(28)

26 identified in the first step are taken to the second step, where reads are mapped to each viral sequence inside the cluster individually, in order to find out the closest representative strain for each cluster. The resulting output is a plot which shows how many virus genera were found, and which are the most representative strains for each virus family (Figure 7). Based on the observations made by Wright et al83 for single/double-indexed Illumina samples, I also applied a cut-off value for read abundance in order to reduce the amount of false positive hits due to cross- contamination in multiplexed samples. Briefly, for each virus, I calculated the ratio between each sample and the sample with the most abundance. Then, a ratio of 0.24%

was used as a cut-off value for cross-talk; any library containing a virus with a ratio lower than this value was assumed to be derived from reads which were improperly demultiplexed. The majority of viruses considered to be contaminants were confirmed to not be present with RT-PCR. To detect divergent sequences, non-human reads are assembled into contigs using three different assembly tools: metaSPADES84, MEGAHIT85 and IDBA-UD86. Each assembly is then scanned with DIAMOND87, a blastX-like tool, using NCBI‘s NR as the input database, and a threshold of 0.05 in e- value is used as the cut-off point for further analysis. DIAMOND‘s output is a blast table which is fed into MEGAN88, a software that bins results according to the last common ancestor (LCA), by assigning a weight to each reference sequence according to the amount of BLAST results being assigned to it and to sequences in the same taxonomic species, and a threshold of 75%. In order to check for a consensus between assemblies, contigs classified as virus or virus-like are compared between all three assemblies using MAFFT89, a multiple sequence aligner. The same sequences are further analysed by checking their read coverage, depth and translating their nucleotide sequence into all six open reading frames in order to identify its gene structure. Each hypothetical protein is classified by scanning with PSI-Blast90, with an e-value of max 0.05 as threshold. Additionally, sequences with no classification are also translated in the six reading frames and scanned with PSI-Blast, using the same threshold value.

The full code and database are available at: https://gitlab.com/ezlab/ezvir2

(29)

27 Figure 7 Representation of the ezVIR2 analysis pipeline. Briefly, in the first section, the sequenced metagenomic libraries are filtered for host data, low quality and low complexity reads, as well as sequences derived from reagents and other contaminants (UniVec). In Phase-1, non-human reads are mapped against a database of 4.500 clinically relevant viruses in order to determine which families of viruses are present per sample. In Phase-2, this search is enhanced to the full size of the database, in order to determine the closest known species for each identified virus family.

(30)

28 4.2 – Identification of an Astrovirus in respiratory disease of unknown etiology

Astrovirus VA1 identified by next-generation sequencing in a nasopharyngeal specimen of a febrile Tanzanian child with acute respiratory disease of unknown etiology70

Authors: Samuel Cordey, Francisco Brito, Diem Lan Vu, Lara Turin, Mary Kilowoko, Esther Kyungu, Blaise Genton, Evgeny M. Zdobnov, Valérie D‘Acremont, and Laurent Kaiser

DOI: doi:10.1038/emi.2016.67 Full Text: Pages 38 to 40

Context: In this paper we detail the detection of an Astrovirus VA1 genome in a study of 30 nasopharyngeal samples of Tanzanian children suffering from respiratory symptoms, with no known etiology (six pools of five patients each). The presence of the Astrovirus was later confirmed by RT-PCR. Astroviruses are not commonly found outside the gastrointestinal system, and this analysis was their first documented report in respiratory specimens. The analysis also detected the presence of several other viruses, namely parainfluenza virus type 2 and type 4 (PIV2/PIV4), which fit the symptoms presented by some of the patients. PIV2 was also confirmed in two patients by RT-PCR, while PIV4 was not confirmed by RT-PCR.

Contributions: I performed the bioinformatics analysis of the metagenomic samples using the ezVIR2 pipeline to process the libraries and detect the presence of viruses, as described previously. Following the initial detection of the Astrovirus VA1 in the first pool, the RNA and DNA of each patient in that pool was sequenced individually and analysed using the same approach. I also performed an additional phylogenetic analysis for the detected Astrovirus, which consisted in assembling the reads mapping to the virus, performing a BLAST analysis to identify the contigs belonging to the capsid sequence, and generating a consensus tree based on a multiple sequence alignment between them and a set of representative Astrovirus capsid sequences. Additional software used: MAFFT91 for generating the multiple sequence alignment, IDBA-UD86 for assembling, and IQ-TREE92 for generating the tree.

(31)

29 4.3 – Identification of an Astrovirus in meningitis-like disease of unknown etiology Astrovirus MLB2, a New Gastroenteric Virus Associated with Meningitis and Disseminated Infection71

Authors: Samuel Cordey, Diem-Lan Vu, Manuel Schibler, Arnaud G. L‘Huillier, Francisco Brito, Mylène Docquier, Klara M. Posfay-Barbe, Thomas J. Petty, Lara Turin, Evgeny M. Zdobnov, and Laurent Kaiser

DOI: doi:10.3201/eid2205.151807 Full Text: pages 41 to 48

Context: This paper is part of the analysis of a cohort of patients with meningitis-like symptoms and no known etiology. RNA and DNA were extracted from each of the patients‘ cerebrospinal fluid (CSF) and sequenced, in order to find potential pathogens.

Bioinformatics analysis detected one patient positive for Astrovirus MLB2, later also found to be present in urine, plasma and anal swab samples. The detected strain was used for a prevalence study in 943 fecal and 424 cerebrospinal fluid samples from the HUG, detecting an additional five cases of infection by Astrovirus MLB2. While Astroviruses are commonly associated with cases of gastroenteritis, this publication contributed to a growing amount of evidence that it might also play a role in CNS disease.

Contributions: I performed the bioinformatics analysis using the following software:

ezVIR2 for the processing of samples and detection of viruses, Sparse93 for virus genome assembly, MAFFT and IQ-TREE for phylogenetic analysis. The initial analysis identified an Astrovirus MLB2 in the cerebrospinal fluid of a patient with low coverage (35% of total sequence length), which subsequently led to the sequencing and analysis of the patient‘s plasma, urine, and anal swab. The virus was identified in all three samples, and confirmed by RT-PCR. In particular, I recovered a complete Astrovirus MLB2 sequence in the anal swab sample, with a 98.5% similarity to previously described sequences. The recovered sequence contributed to defining the PCR probe used for the downstream prevalence study.

Références

Documents relatifs

Serum samples of known identity were deliberately selected from a previous study for the standardisation of serial and single dilution ELISA to increase the confidence of the

Effect of duration and temperature during the delivery of bovine blood samples to the analytical laboratory on metabolic parameters serum assays.. World Buiatrics Congress (WBC),

Effect of duration and temperature during the delivery of urine samples to the analytical laboratory on metabolic parameters assays... Nutrition & Metabolic

In this analysis, we apply mathematical models of antibody kinetics and machine learning classifiers to identify serological signatures of SARS-CoV-2 infection generated

Figure 2 : Number of antiepileptic drug assays performed according to their reasons: in accordance with HAS 2007 recommendations therapeutic adjustment, specific clinical

While the impact of this new approach is still debatable in plant virus diagnostics, viral metagenomics has already produced key advances in viral ecology and has the potential

Double registrations of blood glucose were quantified for each patient with two different determination tools5. We want to answer the question ”Is there a difference between the

In view of the undocumented disease burden of ETEC- associated diarrhea in the Federal Capital Territory (FCT, Abuja, Nigeria), this study aimed to not only determine