Dans le document Metagenomic Characterization of the Virome of Clinical Samples (Page 10-27)

3.1 – Virus classification

Viruses are defined as infectious, obligate intracellular parasites. They are comprised of either DNA or RNA, and replicate by relying on the host‘s cellular systems. They are thought to be the most abundant type of organism on the planet and have been observed to infect almost all other organisms, from prokaryotes to eukaryotes, displaying multiple morphologies and methods of infection and transmission. Unlike other domains of life, viruses lack a universal set of genes that can be used to classify and compare them (e.g. 16S ribosomal subunit, replication-related units, and with the exception of the RNA-dependent RNA polymerase gene, which is found in almost all RNA viruses), so historically their classification has relied in a conjunction of several properties, from host range and pathogenicity, to structure and replication mechanism. The best known categorization is the Baltimore classification, which separates viruses into seven types, based on a combination of features:

replication method, genetic material, and strand orientation (Table 1, Figure 1). The taxonomic classification and organization of viruses is maintained by the International Committee on Taxonomy of Viruses - ICTV1 which, as of February 2019, reported a total of 14 viral taxonomic orders, 150 families, 1019 genera, and 5560 viral species classified. While the approval by the ICTV for taxonomic classification of viruses has historically relied on virus isolation/culturing, developments on sequencing technologies have shown that obtaining complete, high-quality viral sequences is possible using only sequencing, thus reducing the reliance in physical characteristics for taxonomic classification. Metagenomic approaches in particular, have greatly contributed in expanding the known virome, allowing for the identification of hundreds of thousands of new viral genomes2,3. Despite not characterizing the biological properties of the viruses directly (i.e.: pathogenicity, host), sequencing approaches provide information on the viral genomic sequence (e.g.: predicted genes, GC content, codon usage, sequence motifs) and how these relate to other viral species. The amount of information provided by these approaches prompted a decision by the ICTV in 2016 to validate confirmations of the presence of a virus in a sample just by using sequencing data4, allowing for a more accurate taxonomic classification of viruses. Despite this implementation, the ICTV has not defined complete rules on how to systematically classify viruses, providing no universal thresholds for sequence similarity, which other properties to consider, or naming conventions. This, coupled with the fact that virus

9 classification throughout the years has had different rules applied between and within virus families, shows that virus classification remains inconsistent. A clear example of this is the Flavivirus genus, where the nucleotide distance between two viral sequences varies anywhere between 6% and 43%5. Another example is the Herpesvirales order;

the traditional classification approach followed a morphology-based-only classification (shape of capsid/envelope/nucleus), but once genomic sequence similarities were taken into account, it evidenced the fact that the group‘s species (Herpesvirus 1 to 8) have a higher diversity between types than previously expected6. Other examples include viruses which are only classified based on one gene, e.g. the Tymnovirales order, whose classification is based exclusively on RNA polymerase genes, despite being comprised of a large polyprotein with four distinct genes7. Recently, attempts by groups to normalize virus classification have been made, proposing methods for scoring viruses based on their genomic similarity8,9. However, despite suggestions and improvements on the classification methods, an under-sampling problem still exists, requiring more viral genomes to be sequenced and characterized. These inconsistencies in classification methods show the existing need to improve our knowledge on viral sequence diversity in order to accurately classify virus families, in particular ones that could be implicated in human disease.

Table 1 - The Baltimore classification of viruses, and the number of known families within each class, according to ICTV. Classes are defined by type of mRNA production.

Group Classification Number of families

(* subfamilies)


I Double-stranded DNA viruses 34 dsDNA

II Single-stranded DNA viruses 11 ssDNA

III Double-stranded RNA viruses 9 dsRNA

IV Positive-sense single-stranded

RNA viruses 38 ssRNA (+)

V Negative-sense

single-stranded RNA viruses 15 ssRNA (-)

VI RNA retro-transcribing viruses 2* ssRNA-RT

VII DNA retro-transcribing viruses 2 dsDNA-RT

10 Figure 1 – From Viralzone10; the Baltimore classification of viruses.

3.2 – Viral infection and impact in human health

A successful viral infection requires cells with two specific characteristics: they need susceptible cells, which act as the point of entry for the infection (e.g. epithelial cells), and permissive cells, whose infection allows for the development of new viruses (e.g. leukocytes). After the virus enters the host successfully, the viral genome directs the synthesis of viral components using the cell‘s own machinery, which self-assemble into new viral particles, ready to infect more cells11. In the case where infected cells are not permissive, the infection is characterized as abortive, since the virus lacks the proper conditions to replicate. The viral infection cycle is divided into four steps:

attachment and entry, decoding of genome information, genome replication, and assembly and release of viral particles. When the virus infects a permissive cell, it can go through two distinct phases: the lytic phase and the lysogenic phase. In the lytic triggers the start of the lytic phase, such as stress or co-infection by other organisms.

The lysogenic phase can be further characterized depending on how the virus remains dormant within the cell: it can either remain on the cytoplasm/nucleus in a stabilized

11 which have been the focus of extensive research, so their detection and treatment can be done in the fastest and most accurate way possible, in order to minimize their impact. In addition to viruses with well characterized infections and outbreaks, there are viruses with a high impact on human health, which have been recently described and yet to be associated with major cases of disseminated disease. These are known as emerging viruses, i.e. viruses with increasing or the potential to increase incidence12. They can be novel viruses recently described, highly divergent versions of known viruses, or known viruses that had not been previously associated with specific types of disease. Emerging viruses are often associated with cases of zoonotic infection, where a virus present in a non-human host is transferred to humans, and their outbreaks are tracked by the World Health Organization WHO in a list of

―prioritized diseases‖, first created in 2015 ( Known examples of emerging viruses are Zika virus13 (ZIKV), with a zoonotic transmission by mosquitoes and then between humans via mother/child, sex, and blood transfusion; the middle east respiratory syndrome coronavirus14 (MERS-CoV), with a zoonotic association to bats and camels. Currently, there are two ongoing emergent virus outbreaks: the ebola virus outbreak in Africa, and the 2019-nCoV in China.

Despite being described almost exclusively as causative agents of disease, not all viruses are pathogenic; viruses with no relevant clinical effect are known as commensal viruses. Known examples of commensal viruses are human Pegiviruses (HPgV) and Torque Teno viruses (TTV). Despite their infection having no observable clinical effect, Pegivirus infections have been shown to have positive effects on the survival of patients co-infected with HIV15, whereas TTVs have been proposed as markers to predict transplant rejection16. Additionally, viral sequences, due to their ability to integrate into the genome of another organism, can be found in almost every human cell independently of their pathogenic effect. Retrovirus-like sequences are estimated to represent around 8% of the human genome17, and are likely to be the remnants of primate germ cell infections whose integration was retained across millions of years. For example, human endogenous retroviruses (HERV) can integrate into the human genome, albeit their effects on human health are mostly unknown. Recent

12 studies pointing towards a possible role in regulation of innate immunity18, with both positive and negative effects such as having a potential use to treat ovarian cancer, or diminishing the recognition of metastatic tumor cells by the immune system, respectively.

3.3 – Detection of viruses

Several diagnostic tests have been developed in order to detect the presence of a viral infection. Traditional tests rely on previous knowledge of pathogens and diseases, and are targeted to specific strains/families. Typically, a single test is used per virus, with the most popular ones being serology-based assays, and sequence-based assays, which require minimal to no virus culturing. Serology tests are sequence-based on the identification of the reaction that occurs when antibodies bind to the viruses‘

antigens. Some of the most-used serology-based tests are hemagglutination assays (HA), immunofluorescent antibody tests (IFAT), and enzyme-linked immunosorbent assays (ELISA) (Figure 2). Briefly, HAs are used in cases of viruses that bind to red blood cells, such as influenza and dengue, and consists of progressive dilutions of antibodies, which are incubated against viruses and red blood cells, allowing for testing of both the presence of a virus and the minimum concentration of antibodies required to neutralize said virus. IFAT, on the other hand, relies on tagging antibodies with fluorescent dyes in order to measure the light signal emitted by these in a direct or indirect way. In the direct approach the antibody that binds to the viral antigen is tagged with fluorescence molecules, whereas in the indirect approach an untagged antibody is used to bind to the antigen, and an antibody tagged with fluorescence is used to bind specifically to the first antibody. Similarly, ELISA fixes the antibodies to a solid surface and then a first round adds the sample19; if the virus is present the antibodies will capture it and any other molecule is subsequently washed away. Finally, a round of antibodies with fluorescent molecules is added, which bind to the captured viruses.

Other serology tests include western blotting, flocculation and virus neutralization11. Serology tests are fast, cheap and accurate, all of them important factors for routine assays. On the other hand they lack sensitivity and will miss novel viruses and viruses with altered antibody recognition sites due to the limited number of available virus-specific antibodies and their high virus-specificity, respectively. Additionally, researching new antibodies is a slow and expensive process. Emergent viruses can also be missed since the virus-disease association is not previously known, and consequently the virus will likely not be searched for. Finally, they present a low resolution at the molecular level, i.e. they do not give information on the viral nucleotide sequence, and thus are

13 incapable of providing information on important characteristics such as strain level identity and variants.

Figure 2 Figure adapted from Principles of Virology11. Examples of serology assays a) hemagglutination, b) Direct and indirect immunoassays, c) enzyme-linked immunoabsorbent assay, or ELISA.

Sequence (or molecular) diagnostic tests, as the name suggests, are based on detecting molecular sequences specific to the organisms of interest. The two most widely used tests are the real-time quantitative polymerase chain reaction (qPCR) test and short-read high-throughput sequencing (HTS/NGS). A qPCR assay is a variant of the polymerase chain reaction (PCR) test which allows you can quantify the amount of viral particles present in a sample (viral load) by using specific primers to amplify the viral sequence. Since qPCRs search for known sequences, they cannot be used to search for novel pathogens, and their effectiveness is influenced by the high mutation rates in viruses, which can quickly make a strain too divergent to be tracked by the primers designed to amplify a specific genus/family of virus. Much like in serological assays, emerging viruses can also be missed in cases where they are not typically associated with the symptoms observed, and therefore will not be tested for. Despite these limitations, qPCRs are one of the fastest and most cost-effective approaches, two features essential for clinical diagnosis tests. Another sequence-based test is Sanger sequencing20, which for many years was the go-to sequencing approach to identify viruses and perform phylogenetic analyses in a clinical setting, such as tracking HIV infection21. While accurate, it is both slow and technically demanding.

Short-read high-throughput sequencing (HTS), also known as next-generation sequencing (NGS), and deep sequencing are umbrella terms for sequencing techniques that aim to produce a large amount of short nucleotide sequences, known as reads (typically 50-300 nucleotides long), derived from fragments of the genome of interest. These reads provide relevant information about the genome, such as levels of gene expression, genomic abundance, sequence variants, etc.22. While having a higher cost and turnover time than serological/PCR assays, it also offers more information, allowing for the identification of virulence factors, genotyping, phylogenetics, and the discovery of novel viruses, and unlocks the possibility of complementary parallel analyses, such as checking the effects of the pathogens in human gene expression.

14 Due to these advantages, it has quickly become the second line of diagnostics for cases where the routine assays are not successful in establishing a disease etiology. In the clinic, the ideal assay would be able to detect the presence of all known pathogens in a single, high resolution test.

3.4 – Metagenomics

With the help of HTS, approaches have been developed to allow for sequencing and characterization of the full contents of a sample in only one run, potentially allowing for the identification of all pathogens at the same time. These are known as metagenomic approaches, and consist of untargeted and unbiased sequencing of a sample‘s contents, be it environmental, clinical, or otherwise, offering a complete overview of the sequences within it. Since it is untargeted, it captures all contents of a sample without a priori knowledge, and does not require culturing. In clinical cases with diseases of unknown etiology, metagenomics has been shown to be a reliable approach, as several research groups have been able to identify viruses and bacteria that correlate to the patient‘s symptoms. The first application of metagenomics as a clinically actionable diagnosis was published in 201423, where the authors successfully identified and treated a clinical case of leptospirosis. Standard clinical assays had not found an etiology for the observed symptoms, since they were negative for leptospira.

After detecting the bacteria, patients were treated with antibiotics, and recovered.

Shortly after, several studies showed how this approach was also able to detect viruses in a clinical setting, and which laboratory methods are the most appropriate for such analyses24–26. Metagenomic sequencing approaches also allow for the characterization of novel sequences; traditionally, in order to confirm a new viral sequence, a virus culture is needed. However in recent years, sequencing organisms has become cheaper and more accurate and a confirmation by assembling the full virus sequence with a high coverage has been shown to be enough to confirm its existence. This brought an exponential growth in the discovery of novel virus-like sequences (Figure 3). In 2017, a study identified several uncharacterized viruses circulating in the blood of patients, by analysing a total of 1300 cell-free DNA samples derived from 188 patients27. Out of 2917 novel sequences, 276 novel contigs were Anellovirus, and 523 were classified as phage/prophage contigs. Along with this growth, the number of false positive results has also risen, with several descriptions of novel viruses later being confirmed to be reagents, contaminations or sequencing artefacts. In 2019, a study has shown which reagents and contaminants could be interpreted as viral sequences, through the analysis and comparison of 700 metagenomic libraries, prepared with several different types of sequencing prep-kits, as well as various reagents 28. The

15 results show a total of 493 virus-like sequences significantly associated with laboratory components, and not with the actual contents of the sample. In order to avoid these contaminations, pre-processing steps must be implemented before analysing metagenomic samples for pathogens, by targeting and removing adapter sequences, viral vectors and other sequences likely to be misinterpreted as ―real viruses‖. DNA-based blood virome of 8000 healthy individuals30. It reported 94 different viruses, 19 of which were human viruses, most already known to be highly prevalent in humans, such as anelloviruses, Merkel cell polyomaviruses (or human polyomavirus 5), papillomaviruses, and herpesviruses. However, the study did not focus on RNA or highly divergent viruses, missing a large part of the expected contents of the healthy human virome. Additionally, blood is not the only possible reservoir for viruses; a complete characterization of the human virome in its multiple tissues is necessary to accurately determine pathogenic infections, and metagenomics presents itself as the most promising approach to perform this analysis, as it does not need a priori information about the sample‘s contents and is able to produce an accurate representation of the contents of a patient‘s virome/microbiome independently of the type of sample used, be it blood, pulmonary fluid, cerebrospinal fluid, etc.

Figure 3 – Yearly cumulative number of viral sequences submitted to GenBank‘s virus section (gbVRL).

16 A standard metagenomics assay consists of three main steps: Data generation, data processing, and data interpretation. To generate raw data, several protocols and sequencing technologies are available, each with their advantages and disadvantages.

Currently, the most popular approach is developed by Illumina Inc., called sequencing-by-synthesis31,32, or as it is more commonly known, ―Illumina sequencing‖. It uses short-reads, up to 300 nucleotides long, which are produced in large amounts (hundreds of millions to billions of reads) in order to cover all the genomic content present in the sample, several times over. Other short-read HTS techniques have been brought to the market throughout the years, namely Roche‘s 454 sequencer (pyrosequencing)33, Thermofisher‘s IonTorrent (Ion semiconduction-based sequencing)34, and Life Technologies SOLiD (sequence by ligation)35, though not as successful. In contrast with short-read sequencing, other approaches have been developed using technology that allow for sequencing long reads without an amplification step, such as Oxford Nanopore and Pacific Biosciences SMRT sequencing, helping to solve several of the existing problems with short-read HTS, namely their inability to resolve repetitive regions and long homopolymers in the genome, which is possible due to the larger read size (over 10kb long), which easily cover those regions and their surrounding sequences, and circumvent sequence errors generated from the amplification step.

Currently, their disadvantages are tied to a high error rate and high monetary price when compared with the Illumina sequencer equivalent for the same amount of output data, as well as a need for a large amount of input data (PacBio). Despite this, studies have shown the viability of this technology for metagenomic analyses of viruses36–38, as well as hybrid approaches that use both long and short read sequencing, and which have allowed for the detection of several mobile elements in the metagenome of patients39. Additionally, Oxford Nanopore‘s MinION sequencing machine, offers portability over sequencing depth, which allows for studies directly in the field, using portable laboratories. This approach heavily contributed in the characterization of the Zika virus during the 2018 outbreak.

Generally speaking, sequencing protocols start by performing a total nucleic acid extraction and sequencing a sample of interest, e.g. a complete RNA/DNA extraction from a blood sample (Figure 4). The extracted DNA/RNA is then randomly sheared into smaller fragments. This process takes into account the read lengths generated for each type of protocol; for short reads, they can vary from 20 to 800bp, up

Generally speaking, sequencing protocols start by performing a total nucleic acid extraction and sequencing a sample of interest, e.g. a complete RNA/DNA extraction from a blood sample (Figure 4). The extracted DNA/RNA is then randomly sheared into smaller fragments. This process takes into account the read lengths generated for each type of protocol; for short reads, they can vary from 20 to 800bp, up

Dans le document Metagenomic Characterization of the Virome of Clinical Samples (Page 10-27)