5.1 – Metagenomic data analysis and contamination
In this section, I discuss the overall impact and challenges posed by metagenomic analysis across all the publications presented. Metagenomics allows for an unbiased characterization of the full contents of a sample, by capturing and sequencing all organisms present within it. Consequently, metagenomic library contents are a complex pool of sequences belonging to several different organisms, in need of robust filtering and sorting steps in order to provide an accurate classification of its contents. Without these steps, any data analysis and interpretation will lead to poor conclusions, such asassociating contaminants classified as viruses to cases of disease of unknown etiology. Examples of this are the xenotropic murine leukemia virus and parvo-like viruses which were initially thought to be novel pathogenic viruses and later confirmed to be reagent contaminants96,97.
In order to perform the bioinformaticsanalysis presented in these publications, I contributed in developing a pipeline that processes metagenomic data and identifies clinically relevant viral sequences (ezVIR2), and also relies on the identification and removal of non-sample related sequences, be it contamination or associated sequencing biases. Viruses derived from reagent contaminations are dependent on the sequencing kits used for preparing the samples, and variations between preparation kits, reagents and sequencing approaches, which were thoroughly documented by Asplund et al28, showing the high diversity of sequences that could be misclassified as a virus. In the libraries analysed in the course of this thesis, I observed two main types of contamination: reagent-based contaminants and library cross-talk. Reagent contaminants were found in all studies, though two cases are specific for the libraries from the HSCT patient study (section 4.5) and the platelet donor bank study (section 4.7). In the platelet bank study, kadipiroviruses and parvo-like hybrid viruses were observed in the majority of libraries, both of which are documented as contaminants found in the Qiagen kits used in the library preparation step97,98. In the HSCT study, mastadenovirus C was observed in five libraries. All reads recovered mapped to a specific genomic region which is used as part of a viral vector (supplementary figure 1, section 8.1). While I could not determine its origin, it has been documented in Asplund et al. as being a reagent contaminant. Additionally, the xenotropic murine leukemia virus was observed, as expected, across all experiments, both in DNA and RNA samples. Screening samples for cloning vectors using the UniVec database also revealed contaminations, mainly several samples presenting Illumina‘s P Fosill-2 vector
146 and the E.Coli pBR322 vector, likely derived from contaminations in the reagents used in the sequencing kits. The second type of contamination, library cross-talk, was observed in all sequencing runs; simply put, when multiplexing and sequencing libraries in the same lane, some reads can be misassigned to other libraries, leading to the incorrect characterization of the patients‘ viromes. The error rate of misclassified reads can be significantly reduced by using double-indexing approaches and having more strict demultiplexing parameters/software99,100, neither of which were available for a majority of the libraries produced in the context of this thesis, as they had already been prepared, sequenced and demultiplexed previously. A solution was found in the form of an approximation by applying a ratio that allowed the detection of cases of database used for a metagenomic detection of viruses; while assembly approaches are useful for cases of high viral loads, mapping approaches are required to track viruses whose assembly is not feasible due to a low read abundance, or fragmented coverage of the viral genome. A curation of the contents of the database is essential to reduce the amount of false positive results, by culling viruses or genomic regions that are often used as parts of vectors or experiments being run in the same laboratory, and common reagent contaminations. Initial iterations of the ezVIR database contained sequences that were consistently identified as pathogenic viruses which were later confirmed to be false positives. The clearest example of this was a 1991 proviral Human immunodeficiency virus 2 (D00835.1), whose submitted genome is flanked by human sequences, corresponding to a human DNA methyltransferase 1 in chromosome 19. A low number of reads (50-750 reads per sample) were not being mapped to the
Along with the identification of reagent contaminants, library cross-talk, and sequencing biases, metagenomics libraries can also have more complex cases of contamination. For example, a situation where an unexpected contaminant, i.e. not typically identified in the course of the analysis pipeline, is infected with a virus that is also associated with human disease. This is the case of the publication in section 4.8,
147 where I evidenced an unexpected contamination by a drain fly infected with a rhabdovirus. These viruses are of high clinical importance when found in the CSF, as they are associated with cases of encephalitis in humans101, and this case, one of the symptoms observed in the patients. Since the insect had not been previously sequenced and due to its high sequence divergence, its presence could have been easily missed without a thorough analysis of the non-human data, while the virus could have been misdiagnosed as likely cause for the patient‘s disease. This complex case of insect species identification took advantage of the lab‘s expertise in phylogenetics;
the initial approach (de novo assembly, DIAMOND, Megan) revealed the organism to likely be a mosquito (A. Gambiae), a species also often involved in human disease, which could have led to other incorrect interpretations regarding human/vector interactions. However, the identification of the COI gene and phylogenetic analysis using BOLD Systems, BUSCO and OrthoDB allowed us to identify the organism as a drain fly. These are insects that are ubiquitously found colonizing water filters and water sources, have no known impact on human disease, but have been found infesting hospital air filters102. Overall, this case evidenced the need for an overlap of clinical, computational, and biological backgrounds/specialists to apply metagenomics to clinical cases, in order to minimize the risk of misdiagnosis.
148 5.2 – Astroviruses cases and its role in disease
In this section I will discuss all publications relating to the detection and characterization of Astroviruses (4.2, 4.3, 4.4, 4.6, 4.10). Astrovirus infections are typically associated with gastrointestinal disease, but recent evidence has shown that the virus also circulates in other regions of the human body, with a potential involvement in other diseases, namely meningitis-like disease in immunocompromised patients. In my work for this thesis, I collaborated on several studies that evidence its presence outside the gastrointestinal tract. First, in the respiratory system, we documented the first known case of an Astrovirus VA1 infection in the respiratory system of febrile children. In blood, we documented the first known case of Astrovirus MLB1, and showed more evidence of the presence of Astrovirus MLB2 in immunocompromised patients, as well as identifying Astrovirus VA1 in a healthy donor, showing an acute asymptomatic viraemia. The latter case is particularly relevant for blood transfusion safety, as they are not part of routine assays, and could be involved in causing disease to transfusion receivers with compromised immune systems.
Finally, I helped evidence the presence of Astroviruses MLB2 in the CSF of patients with meningitis-like disease. While the mechanics on how the virus infects and migrates are still unclear, a mounting body of evidence has shown their possible involvement as a pathogenic agent in immunocompromised patients, causing encephalitis and meningitis-like disease. In order to study these mechanisms, a more focused study would be necessary, likely not using metagenomics, as the goal would focus on a specific virus/host interaction, rather than the full microbiome. A likely way to approach this would be a study similar to the one described in 4.12, using iPSC practical applications, the metagenomic identification of astrovirus MLB2 prompted a development of a PCR probe for this strain, which helped to identify 6 more cases in a retrospective prevalence study, described in section 4.3. If the association between disease and virus is confirmed in a later study, this PCR test could easily be included as part of routine assays for meningoencephalitis cases of unknown etiology, with a much quicker turnaround than current metagenomic approaches.
149 5.3 – The role of commensal viruses in HSCT recovery
In this study, the metagenomic approach and bioinformatics analysis allowed for the detection and tracking of commensal viruses not taken into consideration by typical diagnostic approaches. The analysis showed the presence of several virus families: anelloviridae, herpesviridae, polyomaviridae, flaviviridae, papillomaviridae, adenoviridae, and togaviridae, most confirmed by RT-PCR. Additionally, the PCR analysis also detected cases of human herpesvirus 4 and 5 (EBV and CMV respectively) infections with a low viral load, not present in the analysis by ezVIR2.
Overall, the virome observed was found to be consistent with a previous study on the virome of the gut of HSCT patients103, and less diversity than the viruses found on the respiratory tract of HSCT patients presenting disease symptoms104, namely respiratory syncytial viruses and human metapneumovirus. Additionally, human pegivirus (HPgV) infections, which were detected in over 30% of the patients, were followed up with further laboratory analyses and determined to last upwards of a year, although with no impact in recovery of the immune system, survival, or relapse. The read abundance of pegiviruses for most patients was enough to allow the assembly of a complete or almost complete sequence; the phylogenetic analysis of these sequences did not show a common pattern between patients and, based on the virus‘ genomic sequence, the differences in sequence similarity evidenced the uniqueness of each infection, discarding the possibility of sample contamination or a nosocomial infection between patients. Regarding false-positive results, some unexpected contaminants were detected, namely a high number of reads mapping to a specific region of the adenovirus genome, already discussed in 5.1. The origin of this contamination was not determined but hypothesized to be originating from one of the reagents used for library preparation. Additionally, a case of a possible rubella (togaviridae) infection also required further analysis, in order to determine its origin. By mapping the recovered reads to all known Rubella genomes, the analysis evidenced that reads mapped favourably to Rubella vaccine strains, although the read abundance was not enough to pinpoint which vaccine strain was used. In this case, patient history complemented the analysis, suggesting the cause to be indeed a recent MMR (measles, mumps, rubella) vaccination.
150 5.4 – Characterization of the virome of blood donors
In this section I discuss the publications related to the blood donor studies at the Geneva University Hospital (4.6 & 4.7). The metagenomic study of the donor pools revealed that standard analysis to blood units did not miss any of the pathogenic viruses checked before blood donations: HIV, hepatitis virus A,B,C,E, and Parvovirus B19. Despite Parvovirus B19 reads being recovered 9 pools, they were found to be derived from a cross-contamination originating from the positive control spike rather than any of the patient pools, as reads in all pools were mapping to the same region of the genome, with low abundance. Regarding sample contamination, the negative control used for the red blood cells and plasma samples did not evidence any of the expected viruses tied to reagent contaminants, namely circoviruses, parvo-like viruses, and kadipiroviruses. Additionally, at the time of publication, the reads recovered for Merkel cell polyomavirus and papillomaviruses were hypothesized to be skin contaminants, but following the results later published by Asplund et al28, these have been demonstrated to be likely laboratory reagent contaminations instead. Regarding the virome observed by HTS, results show clean plasma and RBC units, evidencing only commensal viruses: in DNA samples, anelloviridae and herpesviridae were observed; in RNA samples, pegiviruses were observed. The one exception is an observation of Astroviridae, which was tracked down to one donor with an acute asymptomatic viraemia. The results for the DNA virome were mirrored in a later study with a larger scope by Moustafa et al30, which characterized the blood virome of 8000 DNA individuals, recovering primarily anelloviruses, herpesviridae, and Papillomaviruses. Typically, less than 1-2% of the total sequenced reads is attributed to viruses in metagenomic analyses, and are likely to miss infections with low viral loads.
Consequently, the lower resolution of the libraries (30 patients per pool) has further likely reduced the possibility of finding low viral load infections affecting single patients in a pool. Despite the low number of pathogenic viruses, the unexpected findings of this study (the astrovirus infection) prompted the study of platelet concentrates, as these are commonly transfused to immunocompromised patients, and astroviruses, as discussed in 5.2, have been associated with meningitis-like disease and fever in immunocompromised patients. While the analysis did not find potentially pathogenic viruses in donor platelets, we evidenced that viruses are still present in platelet concentrates, likely through leftover plasma and RBC content, which could still be transferred to other patients.
151 5.5 – Characterization of the cerebrospinal fluid virome
Here I discuss section 4.11. This manuscript details the analysis of the cerebrospinal fluid (CSF) of 36 patients with central nervous system (CNS) inflammation and compares the results against the CSF of 30 patients who underwent elective surgery, with no observable disease symptoms. The results between conditions showed that both conditions evidenced pegivirus infections. Anelloviruses, regarded as commensal viruses, were only found in patients with inflammation, and likely linked to the increased lymphocyte count in those patients. A single astrovirus infection was detected in a patient with CNS inflammation of unknown origin, reported in a separate publication (section 4.3). Additionally, the presence of gemycircularvirus was found in two elective surgery individuals and one patient with inflammation, a molluscum contagiosum (poxvirus) was found in one elective surgery individual, and a megavirus was detected in a patient with inflammation. Due to the lack of accurate probes, these cases were not able to be confirmed by RT-PCR. Gemycircularviruses have been suggested to be associated with encephalitis105, however here, we find them circulating in both patients with and without CNS inflammation. Finally, one case of HIV was not detected by the metagenomic approach and was suggested to be due to the low viral load and amount of sample analysed. Overall, the study contributed in expanding our knowledge of the virome of the cerebrospinal fluid, identifying what the expected healthy virome contains, and the detection of an astrovirus led to other studies on this virus‘ prevalence in patients with similar symptoms (4.2). The study of this cohort also allowed for the identification of an unexpected drain fly contaminant, and the characterization of its transcriptome (4.8).
152 5.6 – Characterization of the human/virus dynamics in airway epithelia
Here I discuss section 4.12. My contribution to this study involved only a set of complementary bioinformatics analyses with the objective of understanding how different pathogenic viruses affect the human epithelia, as well as assessing the viability of the in vitro airway epithelia model used for the study. Two separate experiments were made in order to study the effects of viral infection over time; the first one focused on the effects of Rhinovirus B48 (low cytotoxicity) and C15 (moderate cytotoxicity) on the human epithelial tissue, by sequencing the human transcriptome at 3 time points, each representing a different stage of the infection – the start of the infection, during the infectious period, and prolonged infection – then performing a differential gene expression analysis (DEA), and a pathway enrichment analysis. The second experiment focused on the adaptation of the viruses during the infection, by sequencing two strains of viruses, enterovirus D68 and respiratory syncytial virus, at the same 3 stages of infection and performing a variant calling analysis. The differential gene expression analysis complemented the laboratory analysis on cytokine production, evidencing an effect on genes involved in pathways regarding cilia morphogenesis in the rhinovirus C15 samples, and no effect in rhinovirus B48 samples Additionally, the DEA also showed that a prolonged infection by these viruses affect the expression of fewer genes over time (section 8.1.3 - Sup. Figure 3). The sequencing results also show a clear variation in viral abundance over time (Sup. Figure 4 - section 8.1.3), though no significant variants were observed between time points. This could be explained by the lack of an immune system in the model, other than the interferon response observed in the results, which would cause more selective pressure on the viruses and consequentially, more variation. Overall, the contribution of bioinformatics in this study allowed for showing the potential of the epithelial culture model for studying respiratory viruses, by finding potential targets for antiviral therapies, identifying the virulence levels of each of the strains tested, and how human gene expression is affected by viral infection over time.
153 5.7 – Characterization of the virome of Kawasaki Disease patients
In this section I discuss section 4.13. The etiology for Kawasaki Disease (KD) is yet to be established; studies have shown it to be likely associated with viral or bacterial infections106, though so far there has been no demonstrable association between a specific set of pathogens and the patients‘ symptoms. Similarly, our metagenomic approach did not find a common pathogenic thread between patients;
several different viruses were identified and confirmed, some associated with a recent vaccination, commensal viruses, or other unrelated disease. Regarding the patient‘s virome, we found a large amount of novel anellovirus sequences which, after performing a phylogenetic study (Section 8.1.4a - Supplementary Figure 5), did not show consistent clustering patterns that could evidence a possible correlation between the virus and symptoms. We also found four contigs for a divergent picobirnavirus-like genome, identifiable only at the amino acid level, in one of the patients.
Picobirnaviruses have recently been proposed to be bacterial viruses due to unique genomic sequences present in their genomes, typically only found in phages107. Clinically, they have been previously associated with cases of diarrhea108. The contigs found for the virus derive from two separate segments, similar to the canonical structure of known picobirnaviruses (Section 8.1.5b, Sup. Figure 6, Sup. Table 1) and were not found by RT-PCR, nor were they present in any of the other libraries, including other studies in this thesis. Unlike the contaminating drain fly and rhabdovirus sequences found in 4.8, I did not find traces of external organisms that could identify this sequence as a contaminant, but the lack of a negative control for this study and the negative RT-PCR result does not allow us to classify this sequence as a positive result, as it could be just a spurious event of a reagent contamination.