T ARGETED ORGANELLE GENOME ASSEMBLY AND HETEROPLASMY DETECTION

(1)

AND HETEROPLASMY DETECTION

Nicolas Dierckxsens

Université Libre de Bruxelles

Supervisor:

Prof. Dr. Patrick Mardulyn

Co-Supervisor:

Dr. Guillaume Smits

Dissertation presented in partial fulfilment

of the requirements for the degree of Doctor in Science

(2)

(3)

i

D

ECLARATION

This dissertation is the result of my own work and includes nothing, which is the outcome of work done in collaboration except where specifically indicated in the text. It has not been previously submitted, in part or whole, to any university of institution for any degree, diploma, or other qualification.

Nicolas Dierckxsens

This thesis has been written under the supervision of Prof. Dr. Patrick Mardulyn and Dr. Guillaume Smits.

The members of the jury are:

 Prof. Tom Lenaerts (Université Libre de Bruxelles - President)  Dr. Guillaume Smits (Université Libre de Bruxelles)

 Prof. Patrick Mardulyn (Université Libre de Bruxelles)  Prof. Ludwig Triest (Vrije Universiteit Brussel)

(4)

(5)

iii

A

CKNOWLEDGEMENTS

Firstly, I would like to express my sincere gratitude to my promotors Dr. Guillaume Smits and Prof. Patrick Mardulyn for giving me the opportunity to start my PhD study and the continuous support during the last four years. Without Dr. Guillaume Smits, I would not have found the funding to finish my project and I would like thank him especially for that. I would also like to thank Ludwig Triest for the support in the first year and for the data he made available to me. Besides my advisors, I would like to thank the members of the jury to make their time available and to give me feedback on my thesis project.

(6)

(7)

v

Thanks to the development of next-generation sequencing (NGS) technology, whole genome data can be readily obtained from a variety of samples. Since the massive increase in available sequencing data, the development of efficient assembly algorithms has become the new bottleneck. Almost every new released tool is based on the De Brujin graph method, which focuses on assembling complete datasets with mathematical models. Although the decreasing sequencing costs made whole genome sequencing (WGS) the most straightforward and least laborious approach of gathering sequencing data, many research projects are only interested in the extranuclear genomes. Unfortunately, few of the available tools are specifically designed to efficiently retrieve these extranuclear genomes from WGS datasets. We developed a seed-and-extend algorithm that assembles organelle circular genomes from WGS data, starting from a single short seed sequence. The algorithm has been tested on several new (Gonioctena

intermedia and Avicennia marina) and public (Arabidopsis thaliana and Oryza sativa)

(8)

(9)

vii

Grâce au développement des techniques de séquençage de nouvelle génération, des données sur le génome entier peuvent être obtenues pour de nombreux organismes. Malgré une augmentation considérable des données de séquences disponibles, le développement de nouveaux algorithmes d’assemblage reste mineur. Presque tous les nouveaux assembleurs sont basés sur l’approche du graphe De Bruijn, qui se focalise sur l’assemblage de sets complets de données au moyen de modèles mathématiques. Bien que la diminution des coûts de séquençage ait encouragé la récolte de séquences de génomes entiers, de nombreux projets de recherche s’intéressent uniquement aux génomes extra-nucléaires. Malheureusement, très peu d’assembleurs sont optimisés pour extraire ces génomes extra-nucléaires à partir de données de génomes entiers. Nous avons donc développé un algorithme “seed-and-extend” (NOVOPlasty) qui assemble ces génomes circulaires à partir de données de séquences de génomes entiers, en débutant à partir d’une courte séquence de départ. L’algorithme a été testé sur plusieurs sets de données Illumina de génomes entiers, à la fois des sets nouveaux (Gonioctena intermedia et Avicennia marina) et des sets disponibles publiquement (Arabidopsis thaliana et Oryza

sativa); il a chaque fois donné des meilleurs résultats (fiabilité et couverture) que ceux

obtenus par d’autres assembleurs. Dans notre comparaison, NOVOPlasty a assemblé chaque génome en moins de 30 minutes, en utilisant un maximum de 16 GB de mémoire RAM. NOVOPlasty est le seul assembleur de novo qui permet de directement et rapidement générer un contig circulaire de grande qualité à partir des séquences d’un génome entier.

(10)

(11)

ix

1 WHAT IS A GENOME? ... 1

1.1DISCOVERY OF THE GENETIC CODE ... 1

1.2NUCLEAR DNA ... 2 1.3ORGANELLE GENOMES ... 2 1.3.1 Mitochondria………..……….. 2 1.3.2 Chloroplasts………3 2 DNA SEQUENCING... 5 2.1SANGER SEQUENCING ... 5

2.2NEXT-GENERATION SEQUENCING (NGS) ... 6

2.2.1 Roche 454 and SOLiD……….. 7

2.2.2 Ion Torrent……….…….. 7

2.2.3 llumina (Solexa) sequencing ………...… 8

2.3THIRD-GENERATION SEQUENCING... 9

2.3.1 Helicos ………... 10

2.3.2 Pacific Biosciences ……… 10

2.3.3 Oxford Nanopore ………... 12

2.4COMPARISON BETWEEN SEQUENCING PLATFORMS ... 14

3 SEQUENCE ASSEMBLY ... 17

3.1ASSEMBLY METHODS ... 17

3.1.1 String-based ……….. 18

3.1.2 Graph-based ………. 20

3.2ASSEMBLY COMPARISON ... 25

3.3ORGANELLE GENOME ASSEMBLY ... 26

3.3.1 MITObim ………... 28

3.3.2 Org.Asm: The ORGanelle ASeMbler ……… 29

3.3.3 Norgal ………... 30

3.3.4 GetOrganelle ……….30

3.3.5 Organelle_PBA ………. 31

4 PROBLEM DESCRIPTION AND OBJECTIVES ... 33

4.1PUBLICATIONS ... 35

5 NOVOPLASTY ... 37

5.1INTRODUCTION ... 38

5.2MATERIALS AND METHODS ... 40

(12)

5.2.3 Quality assessment ……….40 5.2.4 NOVOPlasty algorithm ………. 42 5.3RESULTS ... 46 5.3.1 Chloroplast assembly ……… 46 5.3.2 Mitochondrial assembly ……… 49 5.3.3 Overall performance ………. 51 5.3.4 Seed compatibility ………. 52 5.3.5 Complex assemblies ……….. 55 5.4DISCUSSION ... 56

5.5NOVOPLASTY UPDATES ... 58

6 HETEROPLASMY DETECTION ... 61

6.1INTRODUCTION ... 62

6.2MATERIALS AND METHODS ... 67

(13)

1 W

HAT IS A GENOME

?

A genome is an organism’s complete set of DNA, including all of its genes. In humans, a copy of the entire genome, more than 3 billion DNA base pairs, is contained in all cells that have a nucleus. Each genome contains all of the information needed to build and maintain that organism.

1.1 Discovery of the Genetic Code

Although James Watson and Francis Crick are often seen as the discoverers of DNA, the molecule was already identified decades before. It was a Swiss chemist called Johann Friedrich Miescher that first isolated a mysterious substance that he called ‘nuclein’ (1,2). Besides the fact that these molecules originate from the nucleus, he could not determine the exact function or structure of this new molecule. Hence, the reason why it took a few decades more before his discovery was fully appreciated by the scientific community.

(14)

the inheritance theory of Gregor Mendel resulted in a flood of research to prove or disprove the theory of how physical characteristics are inherited from one generation to the next (3).

In the middle of the 19th century, Walther Flemming discovered chromosomes in the nucleus and described for the first time the process of mitosis (4). Building on his findings, Walter Sutton and Theodor Boveri propose that chromosomes bear hereditary factors in accordance with Mendelian laws (5). All these previous and many other groundbreaking discoveries made it possible for James Watson and Francis Crick to decipher the molecular structure of DNA in April 1953. They described DNA as a double-stranded helix, with the two strands connected by hydrogen bonds (6).

1.2 Nuclear DNA

All eukaryotic organisms have a nucleus that contains almost the entire genome. Nuclear DNA (nDNA) is organised in chromosomes and in sexually reproducing organisms, one copy is inherited from each parent. The most studied genome is that of the human, which contains 23 chromosome pairs.

1.3 Organelle genomes

Organelles, which are descendants from endosymbiotic bacteria, are essential for ATP production and can be found in almost every eukaryotic cell (7). Their circular genomes are haploid and usually maternally inherited, making them ideal for phylogenetic and phylogeographic studies. As organelle genomes encode a few essential genes, it is not surprising that mutations in those genes can lead to serious disorders. These organelles are despite their small size the most studied part of the genome (8).

1.3.1 Mitochondria

(15)

Mitochondrial DNA (mtDNA) is a circular multi-copy genome that is maternally inherited by the oocyte. The amount of copies depends on the energy needs of the cell, tissues like muscles and liver will have a high mtDNA count, while mitochondria are absent in red blood cells (9). Although the mitochondrial genome varies in length between species, it usually does not exceed 25,000 bp (human mtDNA consists of 16,569 nucleotides). There are around 1500 genes linked to mitochondria in humans, among which 37 are encoded by the mitochondrial genome and the rest by the nuclear genome. Plant mitochondria are usually much longer in length, which is caused by the transfer of chloroplast DNA. This makes the sequence of plant mitochondria highly variable and complicated to study (10).

1.3.2 Chloroplasts

(16)

(17)

2 DNA

SEQUENCING

Since the discovery of DNA, scientists have searched for new ways to decipher the genetic code. Researchers quickly understood that determining the order of sequences could greatly increase our knowledge in the understanding of living matter. DNA sequencing techniques have evolved from laborious methods to determine short sequences to sophisticated machines that are capable of sequencing the complete genome.

2.1 Sanger sequencing

(18)

based on their length by capillary gel electrophoresis, which results in a chromatogram where each peak in fluorescence intensity represents a base call. Despite other technologies that worked on the same principle; the accuracy, robustness and ease of use of Sanger sequencing made it the most common DNA sequencing technology for years to come (19).

2.2 Next-Generation Sequencing (NGS)

After a long period of relatively slow progress in DNA sequencing, a new wave of sequencing technologies emerged between 2005 and 2010 (20). This next generation of sequencers were cheaper, faster and highly parallel, because of the latter they were also called “High-Throughput Sequencing Technologies”. In 2006, it still costed ~$14 million to generate a draft human genome, while in 2015 it dropped to $1500 (Figure 2.2) (21). These advances were so great that they revolutionized the fields of genetics and molecular biology (22).

(19)

2.2.1 Roche 454 and SOLiD

Roche 454 was the first Next-Generation Sequencer on the market, yet it was never able to achieve a foothold and was discontinued in 2013 after it became non-competitive. This sequencer worked on the principle of ‘pyrosequencing’, where additions of bases are optically detected after the release of a pyrophosphate (23). This method achieved longer reads than other NGS technologies, but the low throughput and high costs made it less attractive.

The coming of third-generation sequencing and the dominance of Illumina made also ABI-SOLiD to discontinue in 2016. SOLiD technology used DNA ligase instead of DNA polymerase for sequencing. A typical sequencing machine will sequentially identify every nucleotide as A, C, G, T, while SOLiD uses colour space encoding to represent a sequence as transitions between nucleotides. This method achieved high accuracies but the short read lengths and the limitation to resequencing made it destined to disappear (24).

2.2.2 Ion Torrent

(20)

competition of third generation sequencers, Ion Torrent will be probably have to struggle to survive in this highly competitive field.

2.2.3 llumina (Solexa) sequencing

Solexa was founded in 1998 to commercialize the ‘Sequencing-by-Synthesis’ technology that was developed at the University of Cambridge by Shankar Balasubramanian and David Klenerman (27). In 2007, just one year after the first commercial Solexa sequencer was introduced, Illumina purchased Solexa and continuously improved this technology since then (28).

(21)

Figure 2.2.3: Illumina sequencing technology. (A) A NGS library is prepared by fragmenting a DNA sample and ligating specialized adapters to both fragment ends. (B) The DNA fragments with adapters are hybridized onto the flow cell and amplified into clonal clusters through bridge amplification. (C) During each sequencing cycle, one fluorescent-labelled nucleotide is incorporated into each complemented strain. The emission wavelength and intensity are used to identify the bases. (31)

The majority of Illumina datasets are paired-end, which means that both ends of each fragment is being sequenced. Paired reads can compensate the short read lengths of Illumina data and help to resolve repeats. Illumina offers several NGS platforms based on this technology, all with different properties to appeal to as many users as possible. Generally, all Illumina platforms produce a very high number of accurate, but relatively short reads. This high throughput combined with a high accuracy makes it the ideal choice for transcriptome analysis (32), rare variant calling (33) or population based sequencing projects (34).

2.3 Third-Generation Sequencing

(22)

technologies first break up the DNA into small fragments, followed by an amplification step to generate a large enough signal during base calling (35). Third generation technologies are able to sequence the DNA sample directly without an amplification step. This offers several advantages like simplified sample preparation, no amplification-induced errors, no GC-bias and direct RNA sequencing (36). Still, the most important advancement of these technologies is the ability of generating far longer reads than ever before. Complex genomic regions like genome wide duplications or tandem repeats that were often too complex to assemble with short reads, have now been resolved with third-generation sequencers (37). Third-Generation sequencing is also known as ‘Single Molecule sequencing’ and ‘Long-Read sequencing’, referring to respectively the method behind the technology and the most visible improvement.

2.3.1 Helicos

Although Helicos (38) was the first single molecule sequencer, it is not always linked to third-generation sequencing, as it was limited to very short read lengths. Rather than revolutionizing DNA sequencing, Helicos modified current techniques to make single molecule sequencing possible. Besides the use of only one fluorescent-tagged nucleotide per cycle and the absence of amplification, there was not much difference compared to Illumina sequencing. The addition of only one nucleotide per cycle and the extensive imaging needed to compensate the absence of amplification, resulted in a very slow cycle time. Other issues were the short read length (35bp), the high error rate and the high costs and size of the Helicos machine. These disadvantages were too great to compete with Illumina and lead to a bankruptcy of Helicos in 2012 (39).

2.3.2 Pacific Biosciences

(23)

nucleotides as they are incorporated into individual template molecules. The PacBio platform consists out of SMRT Cells, containing tens of thousands of microfabricated nanostructures called zero-mode waveguides (ZMW). A single polymerase is immobilized at the bottom of each ZMW, which will form a complex with one of the DNA templates after they were loaded to the SMRT cell. These templates are called SMRTbells and are created by ligating hairpin adapters to both sides of a double-stranded DNA (dsDNA) molecule. Inside each ZMW, the polymerase performs sequencing-by-synthesis in similar fashion to some of the NGS technologies. PacBio’s innovation does not originate from the sequencing method, but from how they detect the fluorescent emission inside the ZMW. The ZMW has such a small light detection volume so that when it is illuminated from below, it does not allow the wavelength of the light to pass through efficiently. The light will only penetrate the lower part of the ZMW, creating a powerful light microscope that reduces background noise up to a 1000-fold (41). This make it possible to real-time detect the individual fluorophores that are cleaved from the incorporated nucleotides (Figure 2.3.2).

(24)

The low detection accuracy (85%) is a serious drawback of this method, but can be improved by sequencing one DNA template multiple times in one cycle. This is made possible by the circular design of the DNA template (SMRTbell). Long insert libraries will pass only once and produce very long reads (>20 kb) with low accuracy, while shorter libraries will be sequenced multiple times in to a continuous long read (CLR), which consists of multiple subreads from the same insert that can be combined into one highly accurate consensus sequence termed a circular consensus sequence (CCS) (42). PacBio does not focus on multiple platforms like Illumina, instead they bring a new machine on the market to replace the previous one over time. After PacBio RS (2011) and PacBio RS II (2013), the new Sequel System was released in 2015. These machines output variable read lengths, with an average between 10,000 - 15,000 bp. This is a huge advantage compared to NGS platforms, especially for the de novo whole genome sequencing, as short reads cannot resolve complex genomic regions (repeats and whole genome duplications). The downside are the higher error rates, higher costs and reduced throughput (43).

2.3.3 Oxford Nanopore

(25)

Figure 2.3.3: Oxford Nanopore sequencing technology. (A) A docking enzyme that unwinds a dsDNA strand to then be pulled through the pore. The small picture shows the distinct disruptions in the current for each nucleotide. (B) The MinION sequencing machine that can transfer real-time sequencing data to a desktop. (C) The even more compact SmidgION is currently under development and can be connected to a smartphone (45).

In 2014, Oxford Nanopore Technologies (ONT) introduced the MinION and released since then two more machines, the GridION X5 in 2017 and the PromethION in 2018. The GridION can launch up to five of the same flow cells to increase the throughput or to sequence multiple samples on the same device. The PromethION is their largest device with 48 flow cells, containing 144,000 nanopores (in comparison to MinION’s 500), which is ideal for human-genome-scale projects and is designed to compete with the high-throughput Illumina devices. While the GridION and the PromethION were developed to increase the throughput, ONT is currently developing an even compacter version of the MinION that can be directly connected to a smartphone, named the SmidgION (Figure 2.3.3).

A

B

(26)

These compact sizes and simple library preparation protocols make Nanopore sequencers ideal to use in the field, for example during pathogen outbreaks or remote monitoring (45). Data are made available in real-time once the first DNA template passes through one of the pores, which makes it possible to monitor the experiment and terminate or extend when needed. The MinION has already been used in environments where there is no access to a sequencing lab, like the International Space Station (46) and Antarctica (47). The only persistent disadvantage is the lower accuracy of around 85%, similar to that of the long-read sequencing technology of PacBio. Yet, this technology offers many advantages and could improve even further in the near future, which could make it a serious contender for Illumina.

2.4 Comparison between sequencing platforms

(27)

Table 2.4: Comparison with the current most used sequencing technologies (37,41,48,49).

Technology Read

length Accuracy Throughput Advantages Disadvantages

Sanger 400-900 bp 99,99% NA Very high accuracy, Still seen as golden standard Expensive, time consuming, impractical for larger sequencing projects Ion Torrent < 440 bp 98% 2 GB Less expensive equipment, Fast Homopolymer errors, Low throughput Illumina 2 x 150 -300 bp 99,9% 1,2 (ISeq) -1800 (HiSeq X) GB

Low cost per Mb, High throughput Short reads, Systemic errors in homopolymers PacBio Sequel 10–15 kbp 87% 5-10 GB Epigenetics, Random error profile, Long reads Expensive, Low accuracy Oxford Nanopore MinION 10-15 kbp 85% 0,1 GB Direct RNA sequencing, Field use, Small

equipment size, Short library preparations, Very long reads

(28)

(29)

3 S

EQUENCE

A

SSEMBLY

The rapid evolution of DNA sequencing technologies in the last decades makes it now possible to produce a large amount of sequencing data within a reasonable budget and time frame. There is a need of computational power and well-designed software to derive useful information from these sequencing data. The problem is that these two elements cannot follow the continuous progression of sequencing technologies. Yet, insufficient computational power is seldom the true problem and could often be resolved by designing more memory-efficient software. There is a wide range of software that is designed for processing sequencing data; in this chapter, we will restrict our focus to sequence assembly software.

3.1 Assembly methods

(30)

fragmented data into the 22 chromosomes is an extremely complex task. Many aspects have to be taken into account during the design of an assembly algorithm. The properties of the sequencing data have to be thoroughly studied, as well as the complexity of the genome and the available computational resources (50). When a very similar genome has previously been assembled, it is possible to align all the reads to this reference and reconstruct the genome that is mirrored on the reference. This simplifies the assembly greatly, but is limited to very close references and is unable to detect structural variations. Reference based alignment is frequently used in the medical field to detect Single Nucleotide Polymorphisms (SNPs) and small insertions or deletions (INDELs) (51). This chapter will only discuss de novo assembly, where a genome is reconstructed without a reference.

The development of de novo sequence assemblers have always run in parallel with the evolution of DNA sequencers and are often designed for one particular sequencing technology. The general principle of de novo assemblers is to find overlaps among reads and joining them into contiguous sequences, called contigs. The first assemblers were string-based methods that were efficient for Sanger reads or small early-NGS datasets. Currently, the vast majority of assemblers are more complex graph based algorithms (Overlap Layout Consensus or De Bruijn) to deal with the increasing amount of reads per dataset (52).

3.1.1 String-based

(31)

repeat sequence CACA collapses into CA, causing a misassembly. Besides the collapse of repeats, the greedy approach can also lead to chimeric assemblies when the algorithm selects the wrong extension following a duplicated region. This strategy was initially designed for Sanger reads, which are relatively long and therefore less prone to cause errors in repetitive or duplicated regions (55).

Figure 3.1.1a: Example of a greedy string-based assembly of four short reads. Only the shortest common superstring is considered, other possible overlaps are not brought in consideration. In this example, a greedy approach results in a collapse of the CACA repeat.

This strategy was also adapted for the less accurate and shorter (30-90bp) reads from NGS technologies. These assemblers (SSAKE (56), VCAKE (57)) used a different method of string-based assembly to avoid the misassemblies caused by the greedy approach. They will consider all the reads that have an overlap larger than a certain length (k-mer) and will terminate the extension when there is a conflict between multiple reads that could extend the contig. In Figure 3.1.1.b the contig will be terminated to avoid a misassembly, which illustrates the difference in outcome compared to the greedy approach of Figure 3.1.1.a.

(32)

Figure 3.1.1b: Example of a string-based assembly without a greedy approach of four short reads. The same four reads of figure 3.1.1.b lead to a shorter contig to avoid a possible misassembly.

3.1.2 Graph-based

(33)

Figure 3.1.2: Example of a graph where nodes are connected by edges. (A) A directed graph with orientated edges. (B) An undirected graph.

3.1.2.1 Overlap Layout Consensus (OLC)

The first graph-based assemblers were based on the OLC approach and designed for Sanger (CAP3 (59), PHRAP(60)) or 454 (Newbler) reads. They gave better results than the string-based assemblers, but were soon overruled by De Bruijn graph algorithms.

As you can tell by the name, OLC algorithms consist out of three phases:

 Overlap: All the overlaps between reads are found by an all against all pair-wise comparison and used to build an overlap graph. In this weighted graph, all the nodes represent the reads and the edges the overlap. The minimum overlap length and minimum percent identity are the two parameters that will shape the graph. Higher values of these parameters will results in shorter but more accurate contigs (61).

 Layout: During the layout stage, the graph will first be simplified by removing redundant information. Then the layout of the reads along the genome is determined by finding the (Hamiltonian) path that passes each node exactly once (62). Finding one such path would be the ideal situation, in reality multiple paths will be formed that split up the genome into contigs.

nodes

edges

A _nodes

(34)

 Consensus: In the final stage, multiple sequence alignment of all the reads along the layout is used to build a consensus sequence for each contig.

Figure 3.1.2.1: Example of an overlap graph that consist out of five reads of 3 bp. The nodes are the reads and the weighted edges are the overlaps. The red lines show the Hamiltonian path that results in the following sequence: ATCCAGT. Image adapted from (63).

(35)

3.1.2.2 De Bruijn graph

Long before its application in genome assembly, the Dutch mathematician Nicolaas de Bruijn developed this graph theory in 1946 and based it on Euler’s first theorem of graph theory, dating back to 1735 (66). De Bruijn graph (DBG) is the most recent method and almost all assembly software of the last decade are based on this approach. The increasing amount of reads generated by NGS technologies made it almost impossible to resolve the more complex OLC graphs. For this reason, the De Bruijn graph approach became the preferred method for assembly developers (61).

We can also break up the De Bruijn graph into three steps:

 Break up reads into successive k-mers: All reads (length L) are divided into (L-k+1) k-mers and each unique k-mer is stored in a hash. As many reads share the same k-mers, this hash table will take less memory than storing the complete reads. The optimal k-mer depends on the read length and the complexity of the genome and is hard to determine theoretically, and therefore has to be established experimentally (67).

 Construct a De Bruijn graph: Nodes represent sequences with length x and each edge connects nodes that have an exact x-1 overlap, which correspond to the prefix and suffix of the x-mer. The edges are directed from the prefix to the suffix, and can also be weighted to present the amount of reads that each edge support. There are two variations of this graph representation, the node-centric and the edge-centric.

o Node-centric: The nodes represent the k-mer substrings and the edges are the (k-1) overlap between two nodes (68).

o Edge centric: Every node is a (k-1) substring and an edge connects two nodes if their (k-1)-mers are consecutive in some read (69).

Both definitions are equivalent, a node-centric de Bruijn graph for a k-mer is identical to an edge-centric de Bruijn graph for a (k+1)-mer (70)

(36)

contains every edge once. As long as all the possible k-mers of the genome are present in the graph, there will be an Eulerian path. In theory, an Eulerian path can connect all the kmers into a complete genome or chromosome. In practice, the Eulerian path will be interrupted many times because of repetitive and duplicated regions (55).

Figure 3.1.2.2: Example of a De Brujin graph with an Eulerian path. An example small circular sequence (ATGGCGTGCA) has been split up into k-mers with a length of 3 bp (k). The edge-centric graph is constructed by connecting the prefix (k-1) of each k-mer to the suffix (k-1) of that k-mer. With this approach, each edge represents a k-mer. The Eulerian path is found by passing through each edge once, which leads to a complete reconstructed sequence. (71)

Choosing the best k-mer for your assembly has always been a difficult question to answer. Longer k-mers can improve the assembly of repetitive regions, but require more memory and are more sensitive for sequencing errors. With the increasing memory-efficiency of assembly algorithms and longer read lengths, the search of the

(37)

optimal k-mer is getting more and more important. Some assemblers are now capable of using multiple k-mers during the assembly instead of fixed k. IDBA-UD (72) starts with a small k-mer, removes the reads that are already assembled and moves on to a larger k-mer with the remaining reads. SPAdes (73) also conducts multiple assemblies with different k-mers.

3.2 Assembly comparison

Many assemblers have been developed over time and there is no clear consensus of which software works best. Benchmarks that are published by the software developers are often over fitted to their tool and therefore not completely reliable. There are a few efforts like the Assemblathon (74,75) for a neutral comparison between assembly tools, but even then only a few genomes were tested and the last edition dates back from 2013. When a new genome has been sequenced, it is hard to predict which assembly tool will behave best. Most de novo assembly projects will therefore try out multiple tools before selecting the software for the final assembly.

Many factors can influence the performance of an assembly tool:  The NGS technology used to sequence:

Each technology has a different pattern of sequencing errors and different read lengths. Generally string-based and OLC algorithms perform better on long read data (PacBio, Nanopore or Sanger), while De Bruijn is more suited for short-read datasets with a high coverage (Illumina) (76). Hybrid assemblers have the ability to combine different NGS datasets (MIRA, MaSuRCA (77)).  The size of the sequenced genome:

Larger genomes generally makes it harder to resolve a graph, especially OLC graphs, while string based algorithms are too slow and require too much memory for large genomes. As long as Illumina reads are used to assemble large genomes, De Bruijn algorithms remain the only option (78).

(38)

High heterozygosity makes any assembly much more complex and will increase the computational power needed to resolve the assembly graphs. Some assemblers were specifically designed for heterozygous genomes (Platanus (79), dipSPAdes (80)).

 The abundance of repetitive and duplicated regions:

This is the greatest obstacle in genome assembly, especially for short-read datasets. Some algorithms perform better than others, but when the repeat is longer than the read length, there is no way to resolve it. The only solutions are the use of long-read technologies or designing mate pair libraries. Mate pairs are not easy to construct, but when the distance between the paired reads is large enough, most duplicated regions can be resolved (81).

 The available computational power:

CPU time and memory consumption can differ greatly between assemblers. OLC graphs consume a lot of memory with large datasets, but there are also significant differences between De Bruijn graph assemblers.

3.3 Organelle genome assembly

(39)

Figure 3.3: Organelle genomes in GenBank. (A) Total number of deposited genomes (15704) in GenBank as of 14 March 2016. (B) Annual deposition of mitochondrial and chloroplast genomes in GenBank since 2003. Statistics from the National Center for Biotechnology Information Genome Resources (https://www.ncbi.nlm.nih.gov/genome/browse/ (13 March 2016, date last accessed)).

(40)

inverted repeats and homogenous regions in the mitochondrial genome complicate the assembly (83). There are a number of algorithms that adapt existing nuclear genome assembly tools to organelle genome assembly (84,85). One of the strategies is to filter reads with high coverage; however, the coverage of GC-rich regions is reduced during sequencing, which would lead to their exclusion (84,85). The first and most widely used assembler that was specifically designed for mitochondrial assembly is MITObim (86). A few others followed, though their usage is limited to a relatively small amount of assembly projects.

3.3.1 MITObim

(41)

Figure 3.3.1: MITObim workflow. The algorithm starts with mapping mitochondrial reads to the conserved regions of the mitochondrial reference. Those mapped reads are used to build an initial assembly, which is then used to bait new reads that overlap this assembly. This process is repeated until the mitochondrial genome is reconstructed (86).

3.3.2 Org.Asm: The ORGanelle ASeMbler

(42)

coverage of the organelle sequences. The assembly will be initiated by a set of protein sequences that function as seeds. Protein sequences are more conserved compared to their nucleic counterparts, which makes it possible to initiate the assembly with more distant related sequences. An initial estimation of the coverage is made by aligning the reads to set of seeds and is later refined after an assembly of 15 kb. This tool has been successfully used for several assembly projects (88-92).

3.3.3 Norgal

Norgal (de Novo ORGAneLle extractor) (93) is a pipline for mitochondrial assembly by identifying a high frequency subset of k-mers that are predominantly of mitochondrial origin and de novo assembling this subset of reads with IDBA-UD (72). The advantage of this method is that it does not require a reference or seed sequence to initiate the assembly. The benchmark of their publication compares Norgal against NOVOPlasty and MITObim for 5 mitochondrial genomes. NOVOPlasty has overall the most accurate results and in a fraction of the time that Norgal requires. Norgal is also limited to mitochondrial assembly and they therefore suggest to first use Norgal on a chloroplast dataset and use the assembled contig as a seed for NOVOPlasty or MITObim.

3.3.4 GetOrganelle

(43)

3.3.5 Organelle_PBA

(44)

(45)

4 P

ROBLEM DESCRIPTION AND

O

BJECTIVES

(46)

were often assembled by mapping reads to existing reference genomes, which bears the risk of misassembly through incorporation of sequences from the reference. Despite the proven usefulness of MITObim, there was a need for a better performing organelle assembler that could also assemble chloroplast genomes.

While MITObim is using an existing graph-based assembler (MIRA) in iterations, the goal of this work was to develop a complete new and independent assembler that could extend a seed sequence into the circular genome. The decreasing price of NGS data made skimming WGS data the preferred choice to obtain reads for assembling organelle genomes. For this reason, I decided to focus on WGS data from short-read sequencing technologies. In order to develop a tool that is as accurate and user-friendly as possible, it had to meet the following criteria:

1. A flexible seed input: It should be possible to start the assembly of an organelle genome with a seed from a distant related species. This is especially important for chloroplast genome assembly, as the available chloroplast genomes are still limited.

2. Free of contamination: The assembler should only output contigs originating from the organelle genome that the user is targeting. Assembly tools that are not specifically designed for organelle genome assembly will predominantly output nuclear contigs. When the organelle genome assembly is divided over several contigs, it is not straightforward to reconstruct the complete genome.

3. Accessible for users with limited informatics experience: Many assembly tools or pipelines are difficult to use and require experience with linux. To make it accessible for as many users as possible, a tool should be cross-platform and should limit the amount of installations and dependencies.

(47)

incorporated in the assembly method, otherwise there would be no significant improvement compared to current assemblers.

Once a robust assembler is developed, it can be extended with extra features allowing the analysis of genomic variation. After genome assembly, users still have to use different software for variance calling, heteroplasmy detection, or genome annotation. One tool that includes all these features would make organelle research accessible to researches with limited experience in bioinformatics.

The main goal of this thesis was therefore to create a user-friendly computer program specifically designed to assemble mitochondrial and chloroplast genomes from NGS illumina data, and then to compare its efficiency and accuracy to other existing tools. Finally, I developed the program further to uncover different variants of an organelle genome in a dataset, which can be used to study both heteroplasmy and nuclear copies of mitochondrial genome fragments (so called numts). In particular, I used this tool to evaluate whether it was possible to differentiate heteroplasmic variants from numts.

4.1 Publications

Related to this thesis:

Dierckxsens N, Mardulyn P and Smits G. (2018) “Unravelling heteroplasmy patterns

with NOVOPlasty.” Submitted.

Dierckxsens N, Mardulyn P and Smits G. (2017) “NOVOPlasty: de novo assembly of

organelle genomes from whole genome data.” Nucleic Acids Research 45(4):e18. doi:10.1093/nar/gkw955.

Other projects:

(48)

(49)

5 NOVOP

LASTY

Thanks to the evolution in next-generation sequencing (NGS) technology, whole genome data can be readily obtained from a variety of samples. There are many algorithms available to assemble these reads, but few of them focus on assembling the extranuclear genomes. Therefore, we developed a seed-and-extend algorithm that assembles these circular genomes from whole genome sequencing (WGS) data, starting from a single seed sequence. The algorithm has been tested on several new (Gonioctena intermedia and Avicennia marina) and public (Arabidopsis thaliana and

Oryza sativa) whole genome Illumina datasets and it always outperformed other

(50)

5.1 Introduction

The circular genomes of chloroplasts and mitochondria are frequently targeted for de novo assembly. Both genomes are usually maternally inherited, have a conserved gene organization and are often used in phylogenetic and phylogeographic studies, or as a barcode in plant and food identification (99). Different in vitro strategies to isolate these genomes from the much larger nuclear chromosomes have been developed, but this task has proven to be particularly challenging. Before the development of NGS technology, organelle genome assembly was based on conventional primer walking strategies, using long range PCR and cloning of PCR products, which are laborious and costly (100–102). NGS made it possible to develop novel strategies to construct the entire chloroplast or mitochondrial genome, thereby dramatically reducing time and costs compared to the more conventional methods. It is now affordable to obtain whole genome data in a short timespan by using genomic DNA extracted from whole cells (103). Besides nuclear sequences, a high copy number of extranuclear sequences will be present in the sample, usually around 5 to 10% of chloroplast DNA (104) and around 1–2% of mitochondrial DNA (105), allowing to assemble both nuclear and extranuclear genomes from one simple experiment. Shallow sequencing of genomic DNA will result in comparatively deep sequencing of the high-copy fraction of the genome; this approach is called genome skimming. Although assembling the complete data set will generate contigs for the organelle genomes, it is also possible to first isolate the chloroplast or mitochondrial reads, and then assemble this subset. The best strategy depends on the data set, computational power and reference genome availability.

(51)

When sequence reads are obtained from a total DNA extract, there will be a large excess of reads from the nuclear genome. To reduce the runtime and computational resources needed for the assembly of the several order of magnitudes smaller organelle genomes, it is suggested to work with a relatively low total number of reads (104). The copy number of organelle genomes being much higher than the copy number of the nuclear genome, working with a whole genome data set of low coverage is largely sufficient (106). One strategy often used to reduce the ratio of nuclear to organelle reads prior to the assembly consists in filtering the extra-nuclear sequences, either by keeping only regions of higher coverage or by mapping reads to a reference genome. Filtering by differential coverage will often result in the undesirable exclusion of regions of low or high GC content (see Figure 4.1), as many NGS systems will perform less efficiently in these regions (105). Another option is to isolate plastid or mitochondrial DNA, prior to sequencing, by capturing these molecules using specific probes. However, many specific probes need to be designed to cover the complete organelle genome, such that this approach is only recommended when many samples must be sequenced in parallel.

Figure 4.1: Coverage depth for a 12 000 bp long region of the mitochondrial genome of

Gonioctena intermedia. There are several regions with a low GC content, resulting in a

reduced read coverage.

(52)

used for organelle genome assembly, through the benchmarked assembly of new and reference mitochondrial and chloroplast genomes from multiple organisms.

5.2 Materials and Methods

5.2.1 Sequencing

All in-house non-human samples were sequenced on the Illumina HiSeq platform (101 bp or 126 bp paired-end reads). The human mitochondria samples (PCR-free) were sequenced on the Illumina HiSeqX platform (150 bp paired-end reads).

Two public data sets of Arabidopsis thaliana and of Oryza sativa were downloaded from the European Nucleotide Archive (http://www.ebi.ac.uk). Data sets SRR1174256 (A. thaliana), SRR1810277 (A. thaliana) and ERR477442 (O. sativa) were sequenced on the Illumina HiSeq 2000 platform and consists out of paired end reads with a read length of respectively 90 bp, 101 bp and 96 bp. Data sets DRX021298 (A. thaliana) and SRR1328237 (O. sativa) were sequenced on the Illumina HiSeq 2500 platform and consisted of paired end reads with a read length of respectively 150 bp and 151 bp. A total of 20% of data sets SRR1174256 and SRR1810277, and 8% of SRR1328237 were sub-sampled for the benchmarking study.

5.2.2 De novo assembly

All assemblies were executed on a Intel Xeon CPU machine containing 24 cores of 2.93 GHz and a total of 96.8 GB of RAM. Our program NOVOPlasty is written in Perl. In addition, four open-source assemblers (MITObim (86), MIRA (64), ARC (https://github.com/ibest/ARC) and SOAPdenovo2 (107)) and the pay-for-use CLC assembler (CLCbio, Aarhus, Denmark) were used on the same data, for comparison.

5.2.3 Quality assessment

(53)

(54)

number of contigs. Comparing speed and system requirements was straightforward, since each assembler ran on the same machine and made use of the same input data set. The quality indicators were measured relative to the corresponding reference as mentioned above. The genome coverage represents the percentage of the reference genome that was assembled minus ambiguous nucleotides. The accuracy represents the percentage of correctly assembled nucleotides relative to the ‘perfect’ validated alignments. The highest possible score (100%) for speed, memory efficiency, disk space, genome coverage, assembly accuracy and number of contigs were set to respectively 0 min, O GB of RAM, 0 GB, 100%, 100% and 1 contig. The lowest score (0%) was always chosen close to the average of the assembler that performed the worst to get a clear difference between the assemblers. All percentages were rounded off to two decimal digits. The absolute values for each assembly can be examined in Appendix 1.

5.2.4 NOVOPlasty algorithm

(55)

try to assemble every read, but will extend the given seed until the circular genome is formed. The assembly will circularize when the length is in the expected range and both ends overlap by at least 200 bp. When a repetitive region is detected, the circularization will be postponed until the assembly exits the repetitive region. Since whole genome data usually contain a high coverage of extranuclear sequences, the algorithm is capable of extending one read into a complete circular genome (Figure 4.2.4a).

Figure 4.2.4a: Workflow of NOVOPlasty. For simplicity, the workflow was limited to unidirectional extension. (A) All reads are stored in a hash table with a unique id. A second hash table contains the ids for the read start = k-mer parameter (default = 38) of the corresponding read. (B) Scope of search 1 is the region where a match of the ‘read start’ indicates an extension of the sequence. All these matching reads are stored separately. (C) The position of the paired reads are verified by aligning each paired read to a previous assembled area, which is determined by the library insert size (scope of search 2). (D) A consensus sequence of the different extensions is determined.

(56)

(57)

Figure 4.2.4b: NOVOPlasty’s strategy for SNR regions. The above figure depicts the alignment of Illumina reads around a SNR region (G-repeat), which is indicated by the red circle. The SNR region is flanked by regions with high frequencies of sequencing errors (red dots). These high error frequencies only occur following a SNR region and therefore absent in the forward or reverse reads. By assembling towards the SNR from both directions, the erroneous regions of the reads can be avoided.

(58)

Figure 4.2.4c: NOVOPlasty’s strategy for tandem repeats. When NOVOPlasty encounters tandem repeats that are unsolvable with short reads, it will search for the sequence that is directly adjacent to the tandem repeat. That sequence will be used as a seed to start a new contig and will form a scaffold with the previous contig that ends in the tandem repeat.

5.3 Results

5.3.1 Chloroplast assembly

(59)

(60)

Table 5.3.1: Benchmarking results for the assembly of the A. marina chloroplast.

(61)

5.3.2 Mitochondrial assembly

NOVOPlasty has been tested for the assembly of seven mitochondrial genomes from three different species. Besides the mitochondrion of the leaf beetle Gonioctena

intermedia, all assemblies resulted in a complete circular genome. Detailed results and

statistics can be found in Appendix 1.

All assemblers besides ARC successfully constructed the complete human mitochondrion in a single contig. ARC was not successful which was unexpected since the reference was almost identical. According to the developers, this problem might be resolved by removing the adapters before the assembly. Quality assessment showed that the MIRA assembly contained one mismatch and four unidentified nucleotides, while the other assemblies were identical to the reference. MITObim performed best in memory consumption, 1.5 GB and NOVOPlasty had the shortest runtime, 4 m 04 s (see Appendix 1).

(62)

ARC assembly was 99.99% accurate but comprised only 85.39% of the genome. Except SOAPdenovo2, all de novo assemblers were able to assemble a larger fraction of the genome (Table 5.3.2). Since it is not possible to assemble the tandem repeats accurately with short reads, we also calculated the coverage and accuracy against the mitochondrion reference without the repetitive section of the control region. This significantly increased the accuracy for the assemblies that partially covered the tandem repeats. The MIRA and MITObim assemblies have the highest genome coverage, but the lowest accuracy. If we only look at the repetitive region, the accuracies are 95.44% and 90.62% for respectively MIRA and MITObim. This shows that these assemblers are not reliable for problematic regions and lose their advantage of higher genome coverage. From the four assemblers with an accuracy above 99.97%, NOVOPlasty has the highest genome coverage (Table 5.3.2).

Table 5.3.2: Benchmarking results for the assembly of the G. intermedia mitochondrion.

(63)

genome coverage for the NOVOPlasty assembly. It rose from 92.74% to 94.66%. A second assembly of this data set with an average coverage depth of 892 improved the genome coverage further to 95.18%. This shows that increased read length with deeper coverage can help NOVOPlasty to resolve problematic regions, since the gain was from the highly repetitive and AT-rich region. More remarkable were the results of CLC and SOAPdenovo2, both assemblies resulted in a reduced genome coverage and accuracy. The genome coverage of SOAPdenovo2 and CLC were reduced by respectively 64% and 1.4%. The reduction of CLC could be explained by differences in the sample preparation or in the sequencing run (resulting in reads of reduced quality or underrepresented region), but this cannot explain the large reduction for SOAPdenovo2. One explanation could be the high coverage depth, which can cause problems with some assembly tools. This was tested by repeating the assembly with a sub-sample of 50% of the previous data set. The results showed an increase in genome coverage from 30.8% to 46.8% for the SOAPdenovo2 assembly, showing a reverse effect of an increased coverage depth on the quality of the assembly for SOAPdenovo2.

5.3.3 Overall performance

(64)

tested assemblers output a pool of contigs, originating from the nuclear, mitochondrial and chloroplast genomes. This pool can contain in some cases more than 1 700 000 contigs (see in Appendix 1), which can make it problematic to isolate the organelle genome. Finally, NOVOPlasty scores best on genome coverage and accuracy, the two most important indicators for an accurate assembly.

Figure 5.3.3: Score graph derived from the benchmark study. Each property of each assembler was given a score proportional to the other assemblers. Each score was based on the average results of seven assemblies and expressed in percentage. A score of 100% is always seen as most favourable, more detailed explanation can be found in the ‘Quality assessment’ section of Materials and Methods. (*) Highest score for the corresponding property.

5.3.4 Seed compatibility

(65)

main difference with traditional seed dependent assemblers, is that NOVOPlasty does not use the seed sequence to initiate the assembly, but uses it to retrieve one sequence read of the targeted genome from the data set. Possible sequencing errors in that read are corrected by aligning the whole dataset to that read, making the corrected sequence independent from the read that is fished out. Subsequently, the corrected read will be elongated until the genome is circularized. This new strategy was tested with a variety of seed sequences, originating from closely to relatively distantly related species. Twelve different mitochondrial (Figure 5.3.4a) and chloroplast (Figure 5.3.4b) genomes were tested as a seed sequence for the assembly of respectively the mitochondrial genome of Homo sapiens and the chloroplast genome of Arabidopsis thaliana.

(66)

Figure 5.3.4b: Seed compatibility test for the de novo assembly of the chloroplast from

Arabidopsis thaliana with 12 different chloroplast genomes and 12 subunits (RuBP) as a

seed sequence. A green dot means that the chloroplast genome of that species can be used as a seed for the chloroplast assembly of A. thaliana. Red M indicates that NOVOPlasty assembled the mitochondrial genome instead of the chloroplast genome. Same color indications for the RuBP unit. Phylogenetic tree based on information extracted from the NCBI taxonomy database (113), using phyloT (http://phylot.biobyte.de/).

(67)

as seed, short regions specific to the chloroplast genome, that have no equivalent in the mitochondrial genome (instead of the complete genome sequence). This was empirically confirmed by selecting the Rubisco-bis-phosphate oxygenase (RuBP) subunit as a seed sequence (Figure 5.3.4b), which resulted in a successful assembly of the chloroplast genome in 10 out of 12 seed sequences. We were unable to initiate the assembly with the RuBP subunits of Pyropia perforata (RuBP subunit not present) and Lobosphaera incisa as a seed sequence. While these two algae species are evolutionarily very distant from Arabidopsis, we were able to assemble the chloroplast genome of Arabidopsis using the complete chloroplast genomes of both algae as seed. We would therefore recommend trying short specific portions of the genome as seed at first, but then to try the complete chloroplast genome as seed if unsuccessful.

5.3.5 Complex assemblies

(68)

Figure 6.3.5: BLAST results of the assembly outputs for Avicennia officinalis from three assembly algorithms. All the contigs of each assembly output is aligned against the verified chloroplast genome sequence of Avicennia officinalis. Each red bar presents a contig that

is correctly aligned against the reference.

5.4 Discussion

(69)

obtained by different platforms, but the outcome can vary greatly depending on the assembly software.

Due to a lack of reliable and user-friendly open-source software for the assembly of mitochondrial and, especially, plastid genomes, many researchers select the CLC pay-for-use assembler (99,116,117). We present NOVOPlasty, an open-source alternative software, specifically designed for assembling organelle genomes that is capable of delivering the complete genome sequence within 30 min. The algorithm takes full advantage of the high coverage available for organelle genomes in NGS data, which makes it even capable of assembling reads from problematic regions, like AT-rich stretches. No reference genome is needed and the assembly can be initiated by a wide range of seed sequences. When the final assembly delivered by NOVOPlasty is made of several contigs, those are automatically arranged sequentially, which facilitates finishing the assembly using complementary methods.

(70)

chloroplasts). All but six, characterized by particularly complex repeats, displayed a single contig.

The software is open source and can be downloaded at https://github.com/ndierckx/NOVOPlasty. Besides a standard Perl installation, there are no software or module requirements to run the script. All paired-end Illumina whole genome data sets are compatible with NOVOPlasty. It is recommended to have sufficient coverage (30X for the organelle genome) and to use untrimmed reads to assemble a complete circular genome. Incomplete assemblies caused by low coverage regions (low GC) could be resolved by using higher coverage (up to 1000X or more), but be cautious that higher coverage will slow down the assembly and will require more virtual memory. A manual and an example of the configuration file can also be found on the github page.

5.5 NOVOPlasty updates

Since the release of NOVOPlasty, there have been several updates to improve the robustness and to include new functionalities. The timeline of updates for NOVOPlasty from version 1.0 until current version 2.7.2 are shown in Figure 6.5a.

Figure 6.5a: Contributions made to the github page of NOVOPlasty since the first release. This graph shows the amount of changes to the code since the first version got published.

(71)

 Use of IUPAC codes: Before NOVOPlasty only outputted ‘N’ when the nucleotide was ambiguous.

 Zipped files as input: bz2 and gz extensions are supported

 Additional metrics: After the assembly, NOVOPlasty outputs the amount of total reads and assembled reads, the average coverage and the fraction of organelle genome reads.

 Option to save the assembled reads in a separate file

 Expansion of the read id database: Some formats of read ids were not supported by NOVOPlasty.

Other updates were added to prevent problems to reoccur:

 Max memory option: Most users run NOVOPlasty on their laptop, even though it often does not have enough virtual memory for their large datasets. After I regularly received messages that the NOVOPlasty assembly terminated during the building of hash tables, I added a ‘Max memory’ option to the config file. This option will subsample the dataset until the given memory has been filled. This option does not only prevent termination trough lack of memory, but will also speed up the assembly by subsampling large datasets.

 Improved seed retrieval: Datasets with low coverage often failed in retrieving an initial read when the seed was not closely related. This was resolved by making the alignment to the seed independent from the given k-mer. Which means the k-mer can be reduced to align more reads to the seed.

(72)

possible (Figure 6.5b). This method makes it possible to resolve the inverted repeat in chloroplast genomes, which reduces the post-processing time.

Figure 6.5b: The use of a reference genome by NOVOPlasty. The assembly will still be de novo; the reference genome is only used to resolve ambiguous positions that would result in the

termination of the contig.

Assemble de novo

(73)

6 H

ETEROPLASMY DETECTION

(74)

6.1 Introduction

Mitochondria are small organelles involved in various cellular functions, with ATP production through respiration as their primary task. On average, one mitochondrion harbors between 2-10 mtDNA molecules, and each cell contains hundreds to thousands of mitochondria, depending on the cell function (118). When inherited or somatic mutations cause variation within one cell or between different cells of an organism, we speak of heteroplasmy. The majority of mutations are somatic and with an increasing rate by age. As mitochondrial genomes are maternally inherited, mutations can also originate from the mother side. In rare occasions, paternal leakage of mitochondria during fertilization have been reported to occur in several species (119, 120).

(75)

Figure 6.1a: Mitochondrial heteroplasmy dynamics. When mitochondrial cells divide, random segregation can lead to a reduced or increased frequency of the mutant mitochondrion. When the mutant type exceeds a certain threshold, disease symptoms associated with this mutation can start to develop (127).

There is also great interest from the field of forensic science, since these mutation patterns can enhance the likelihood ratio for a potential match (128,129). Considering that mitochondrial genomes are frequently used as markers in evolutionary studies, widespread occurrence of heteroplasmy within a species could have important implications for divergence times between populations or species (130).

(76)

(77)

Figure 6.1b: Heteroplasmy detection with Sanger sequencing. Double peaks on electropherograms indicate the presence of heteroplasmy. (A) Heteroplasmy in 4 brain samples, D-84 samples are from the same patient, but from different regions of the brain. The low heteroplasmy frequency of sample C-69 cannot be confirmed with Sanger, while S-111E and D-84E have high enough frequencies to confirm heteroplasmy. D-84J has no detectable heteroplasmy. (B) An example of length polymorphism in the CA repeat (137).

(78)

DNA by polymerase chain reaction (PCR) with specific primers (133). Both strategies are still sensitive for NUMTs enrichment and depend on the experimental design for incomplete mitochondrial coverage. The most straightforward and least laborious approach is whole genome sequencing (WGS), where all of the DNA in the cells will be sequenced (138). Achieving a very deep coverage is more costly, however the dataset can also be used for other genomic analyses. There are already large amounts of WGS dataset online available and the continuous decrease of NGS sequencing costs will make it more attractive to produce these datasets in the future. For the detection of rare point mutations (<0.5%), there is a new method named duplex sequencing, which has a reliable detection threshold up to 0.01% (139). This method tags both strands of the DNA and will only identify SNPs when they are present on both strands, which strongly reduces false calls from random sequencing errors. Duplex sequencing is less effective for indel detection and the high costs makes it not applicable for whole genome sequencing. This improved accuracy counts only for sequencing errors, NUMTs can still interfere during alignment.

Heteroplasmy detection pipelines are based on aligning the reads to the mitochondrial reference. Most algorithms use simple base-quality filtering to more complex statistical strategies to counter sequencing errors (140, 141). Depending on the chosen strategy and the quality of the dataset it can increase detection threshold to 1%. NUMT interference can be reduced by filtering out these sequences before heteroplasmy detection, which can be achieved by aligning all the reads to the nuclear genome. This works very well when the nuclear genome originates from the same individual, however this is rarely the case. In most cases, a reference nuclear genome will be used, though it does not always contain the same NUMTs as the sequenced individual.

(79)

that detects intra-individual polymorphisms during a de novo assembly of the mitochondrial genome. We extended the organelle assembler NOVOPlasty with this new heteroplasmy option that is also able to assemble the region around each detected polymorphic site and subsequently identify connections between nearby mutations. Depending on the detected mutation density, NOVOPlasty is able to assemble between 200 bp to the complete mitochondrial haplotype or NUMT. This new method can identify mutations originating from NUMTs or the presence of multiple haplotypes.

6.2 Materials and Methods

6.2.1 Sequencing

The Gonioctena intermedia dataset (PCR-free) was sequenced on the Illumina HiSeq platform (126 bp or 250 bp paired-end reads). The human samples (PCR-free) were sequenced on the Illumina HiSeqX platform (150 bp paired-end reads). We received one dataset from the University of Innsbruck, which was used in a benchmark for their scalable web server for the analysis of mtDNA studies (mtDNA-Server) (142). This dataset is a sample-mix up (1:100) of two human samples (HM625679.1 and KC286589.1) sequenced on the Illumina HiSeq platform.

T ARGETED ORGANELLE GENOME ASSEMBLY AND HETEROPLASMY DETECTION

AND HETEROPLASMY DETECTION

Nicolas Dierckxsens

Université Libre de Bruxelles

Supervisor:

Prof. Dr. Patrick Mardulyn

Co-Supervisor:

Dr. Guillaume Smits

Dissertation presented in partial fulfilment

of the requirements for the degree of Doctor in Science

D

ECLARATION

A

CKNOWLEDGEMENTS

1 W

HAT IS A GENOME

?

1.1 Discovery of the Genetic Code

1.2 Nuclear DNA

1.3 Organelle genomes

1.3.1 Mitochondria

1.3.2 Chloroplasts

2 DNA

SEQUENCING

2.1 Sanger sequencing

2.2 Next-Generation Sequencing (NGS)

2.2.1 Roche 454 and SOLiD

2.2.2 Ion Torrent

2.2.3 llumina (Solexa) sequencing

2.3 Third-Generation Sequencing

2.3.1 Helicos

2.3.2 Pacific Biosciences

2.3.3 Oxford Nanopore

2.4 Comparison between sequencing platforms

3 S

EQUENCE

A

SSEMBLY

3.1 Assembly methods

3.1.1 String-based

3.1.2 Graph-based

3.2 Assembly comparison

3.3 Organelle genome assembly

3.3.1 MITObim

3.3.2 Org.Asm: The ORGanelle ASeMbler

3.3.3 Norgal

3.3.4 GetOrganelle

3.3.5 Organelle_PBA

4 P

ROBLEM DESCRIPTION AND

O

BJECTIVES

4.1 Publications

5 NOVOP

LASTY

5.1 Introduction

5.2 Materials and Methods

5.2.1 Sequencing

5.2.2 De novo assembly

5.2.3 Quality assessment

5.2.4 NOVOPlasty algorithm

5.3 Results

5.3.1 Chloroplast assembly

5.3.2 Mitochondrial assembly

5.3.3 Overall performance

5.3.4 Seed compatibility

5.3.5 Complex assemblies

5.4 Discussion

5.5 NOVOPlasty updates

6 H

ETEROPLASMY DETECTION

6.1 Introduction

6.2 Materials and Methods

6.2.1 Sequencing

6.2.2 Heteroplasmy detection