1.1 Discovery of the Genetic Code

(1)

C

ONTENTS

1 WHAT IS A GENOME? ... 1

1.1DISCOVERY OF THE GENETIC CODE ... 1

1.2NUCLEAR DNA ... 2

1.3ORGANELLE GENOMES ... 2

1.3.1 Mitochondria………..……….. 2

1.3.2 Chloroplasts………3

2 DNA SEQUENCING... 5

2.1SANGER SEQUENCING ... 5

2.2NEXT-GENERATION SEQUENCING (NGS) ... 6

2.2.1 Roche 454 and SOLiD……….. 7

2.2.2 Ion Torrent……….…….. 7

2.2.3 llumina (Solexa) sequencing ………...… 8

2.3THIRD-GENERATION SEQUENCING... 9

2.3.1 Helicos ………... 10

2.3.2 Pacific Biosciences ……… 10

2.3.3 Oxford Nanopore ………... 12

2.4COMPARISON BETWEEN SEQUENCING PLATFORMS ... 14

3 SEQUENCE ASSEMBLY ... 16

3.1ASSEMBLY METHODS ... 16

3.1.1 String-based ……….. 17

3.1.2 Graph-based ………. 19

3.2ASSEMBLY COMPARISON ... 24

3.3ORGANELLE GENOME ASSEMBLY ... 25

3.3.1 MITObim ………... 27

3.3.2 Org.Asm: The ORGanelle ASeMbler ……… 28

3.3.3 Norgal ………... 29

3.3.4 GetOrganelle ……….29

3.3.5 Organelle_PBA ………. 30

4 PROBLEM DESCRIPTION AND OBJECTIVES ... 32

4.1PUBLICATIONS ... 34

5 NOVOPLASTY ... 36

5.1INTRODUCTION ... 37

5.2MATERIALS AND METHODS ... 39

(2)

5.2.2 De novo assembly ………. 39

5.2.3 Quality assessment ……….39

5.2.4 NOVOPlasty algorithm ………. 41

5.3RESULTS ... 45

5.3.1 Chloroplast assembly ……… 45

5.3.2 Mitochondrial assembly ……… 47

5.3.3 Overall performance ………. 50

5.3.4 Seed compatibility ………. 51

5.3.5 Complex assemblies ……….. 54

5.4DISCUSSION ... 55

5.5NOVOPLASTY UPDATES ... 57

6 HETEROPLASMY DETECTION ... 60

6.1INTRODUCTION ... 61

6.2MATERIALS AND METHODS ... 66

6.2.1 Sequencing ……… 66

6.2.2 Heteroplasmy detection ……… 66

6.2.3 Mutation linkage ………... 67

6.3RESULTS ... 69

6.3.1 Mitochondrial assembly ……….69

6.3.2 mtDNA-Server dataset ………... 70

6.3.3 Human WGS datasets ……… 74

6.3.4 G. intermedia dataset ……… 77

6.4DISCUSSION ... 78

7 DISCUSSION ... 80

7.1FUTURE WORK ... 82

7.1.1 NOVOPLasty updates ………... 82

7.1.2 NOVOLoci ………. 83

7.1.3 NOVOLoci in practice ………... 84

7.2CONCLUSIONS ... 88

8 APPENDICES ... 90

9 REFERENCES ... 100

(3)

1 W HAT IS A GENOME ?

A genome is an organism’s complete set of DNA, including all of its genes. In humans, a copy of the entire genome, more than 3 billion DNA base pairs, is contained in all cells that have a nucleus. Each genome contains all of the information needed to build and maintain that organism.

1.1 Discovery of the Genetic Code

Although James Watson and Francis Crick are often seen as the discoverers of DNA, the molecule was already identified decades before. It was a Swiss chemist called Johann Friedrich Miescher that first isolated a mysterious substance that he called

‘nuclein’ (1,2). Besides the fact that these molecules originate from the nucleus, he could not determine the exact function or structure of this new molecule. Hence, the reason why it took a few decades more before his discovery was fully appreciated by the scientific community.

Albrecht Kossel was the first to determine the chemical structure of ‘nuclein’ and naming it deoxyribonucleic acid (DNA). He also succeeded in isolating the five building blocks of DNA and RNA, called nucleotides: adenine (A), cytosine (C), guanine (G), thymine (T) and uracil (U) (2). In the early 1900s, a growing interest in

(4)

the inheritance theory of Gregor Mendel resulted in a flood of research to prove or disprove the theory of how physical characteristics are inherited from one generation to the next (3).

In the middle of the 19th century, Walther Flemming discovered chromosomes in the nucleus and described for the first time the process of mitosis (4). Building on his findings, Walter Sutton and Theodor Boveri propose that chromosomes bear hereditary factors in accordance with Mendelian laws (5). All these previous and many other groundbreaking discoveries made it possible for James Watson and Francis Crick to decipher the molecular structure of DNA in April 1953. They described DNA as a double-stranded helix, with the two strands connected by hydrogen bonds (6).

1.2 Nuclear DNA

All eukaryotic organisms have a nucleus that contains almost the entire genome.

Nuclear DNA (nDNA) is organised in chromosomes and in sexually reproducing organisms, one copy is inherited from each parent. The most studied genome is that of the human, which contains 23 chromosome pairs.

1.3 Organelle genomes

Organelles, which are descendants from endosymbiotic bacteria, are essential for ATP production and can be found in almost every eukaryotic cell (7). Their circular genomes are haploid and usually maternally inherited, making them ideal for phylogenetic and phylogeographic studies. As organelle genomes encode a few essential genes, it is not surprising that mutations in those genes can lead to serious disorders. These organelles are despite their small size the most studied part of the genome (8).

1.3.1 Mitochondria

Mitochondria are cytoplasmic organelles that provide the cell with energy by producing adenosine triphosphate (ATP) through oxidative phosphorylation (OXPHOS).

Mitochondrial DNA (mtDNA) is a circular multi-copy genome that is maternally

(5)

inherited by the oocyte. The amount of copies depends on the energy needs of the cell, tissues like muscles and liver will have a high mtDNA count, while mitochondria are absent in red blood cells (9). Although the mitochondrial genome varies in length between species, it usually does not exceed 25,000 bp (human mtDNA consists of 16,569 nucleotides). There are around 1500 genes linked to mitochondria in humans, among which 37 are encoded by the mitochondrial genome and the rest by the nuclear genome. Plant mitochondria are usually much longer in length, which is caused by the transfer of chloroplast DNA. This makes the sequence of plant mitochondria highly variable and complicated to study (10).

1.3.2 Chloroplasts

Although plants also produce ATP through their mitochondria, they also need chloroplasts to produce carbon sources. While animals acquire carbohydrates through their food, the chloroplasts of plants make sugars by converting carbon dioxide (CO2), water (H20) and solar energy into glyceraldehyde 3-phosphate. Generally, chloroplast DNA (cpDNA) consists in one circular DNA molecule with a length between 120–220 kbp. Most chloroplast genome contain two inverted repeats, which separate a long single copy section (LSC) from a short single copy section (SSC) (11,12).

(6)

Figure 1.3.2: The chloroplast genome of Arabidopsis thaliana, assembled with NOVOPlasty (13), annotated with DOGMA (14) and drawn with OG Draw (15).

(7)

2 DNA SEQUENCING

Since the discovery of DNA, scientists have searched for new ways to decipher the genetic code. Researchers quickly understood that determining the order of sequences could greatly increase our knowledge in the understanding of living matter. DNA sequencing techniques have evolved from laborious methods to determine short sequences to sophisticated machines that are capable of sequencing the complete genome.

2.1 Sanger sequencing

It all started with a 24 bp sequence that was identified by using a method known as wandering-spot analysis (16). This was a time consuming and labour intensive method and was therefore quickly replaced by faster and more efficient sequencing methods developed by Frederick Sanger (17). Up to this day, Sanger’s chain termination method is still widely used to determine short sequences (<1000 bp). Sanger sequencing extends a primer sequence with a mixture of the four DNA nucleotides (dATP, dTTP, dCTP, dGTP) and fluorescent “chain terminator” nucleotides that mark the ends of the fragments (18). This will generate sequences of different lengths, each marked with a fluorescent dye that indicates the last nucleotide. These fragments are then separated based on their length by capillary gel electrophoresis, which results in a chromatogram

(8)

where each peak in fluorescence intensity represents a base call. Despite other technologies that worked on the same principle; the accuracy, robustness and ease of use of Sanger sequencing made it the most common DNA sequencing technology for years to come (19).

2.2 Next-Generation Sequencing (NGS)

After a long period of relatively slow progress in DNA sequencing, a new wave of sequencing technologies emerged between 2005 and 2010 (20). This next generation of sequencers were cheaper, faster and highly parallel, because of the latter they were also called “High-Throughput Sequencing Technologies”. In 2006, it still costed ~$14 million to generate a draft human genome, while in 2015 it dropped to $1500 (Figure 2.2) (21). These advances were so great that they revolutionized the fields of genetics and molecular biology (22).

Figure 2.2: Cost per raw megabase of DNA sequence over time (21).

Introduction of NGS

(9)

2.2.1 Roche 454 and SOLiD

Roche 454 was the first Next-Generation Sequencer on the market, yet it was never able to achieve a foothold and was discontinued in 2013 after it became non-competitive.

This sequencer worked on the principle of ‘pyrosequencing’, where additions of bases are optically detected after the release of a pyrophosphate (23). This method achieved longer reads than other NGS technologies, but the low throughput and high costs made it less attractive.

The coming of third-generation sequencing and the dominance of Illumina made also ABI-SOLiD to discontinue in 2016. SOLiD technology used DNA ligase instead of DNA polymerase for sequencing. A typical sequencing machine will sequentially identify every nucleotide as A, C, G, T, while SOLiD uses colour space encoding to represent a sequence as transitions between nucleotides. This method achieved high accuracies but the short read lengths and the limitation to resequencing made it destined to disappear (24).

2.2.2 Ion Torrent

Before the coming of long-read sequencing, Ion Torrent was the only competition left for Illumina. Unlike most NGS technologies, Ion Torrent does not make use of optical signals. Instead, they use semiconductor sequencing technology, which detects the change in pH during the incorporation of a nucleotide. The change in pH is caused by the release of an H⁺ ion during the addition of a dNTP, and is used to determine how many bases are added in each cycle (25). The main advantage of this technology is that the Ion Personal Genome Machine (PGM) is small in size, has a fast turnover rate and is relative cheap to acquire. Smaller laboratories that regularly require sequencing data do not always have the budget to invest in expensive Illumina machines, and therefore opt for Ion Torrent machines. The much smaller throughput and systemic indel errors in homopolymers of Ion Torrent are persistent disadvantages compared to Illumina (26).

With the new compact and low-cost iSeq 100 sequencer of Illumina and the increasing competition of third generation sequencers, Ion Torrent will be probably have to struggle to survive in this highly competitive field.

(10)

2.2.3 llumina (Solexa) sequencing

Solexa was founded in 1998 to commercialize the ‘Sequencing-by-Synthesis’

technology that was developed at the University of Cambridge by Shankar Balasubramanian and David Klenerman (27). In 2007, just one year after the first commercial Solexa sequencer was introduced, Illumina purchased Solexa and continuously improved this technology since then (28).

Before DNA can be sequenced, Illumina technology requires the sample to be fragmented and ligated with adapters in order to attach each fragmented sequence to a flow cell. Each attached sequence will be amplified by bridge amplification, resulting in clusters of the same sequences, which is needed to observe a strong optical signal during the sequencing phase. All four reversible fluorescent-labelled nucleotides are added to the flow cell during each sequencing cycle. These nucleotides have a reversible 3' blockage that limits nucleotide incorporation to only one at a time. After each round of synthesis, a digital image of the flow cell will be taken and each incorporated base will be determined by the emission wavelength and intensity. The 3' blockage and fluorescent dye is then chemically removed for the next sequencing cycle (29,30). This process will be repeated until each full DNA molecule is sequenced (Figure 2.2.3).

(11)

Figure 2.2.3: Illumina sequencing technology. (A) A NGS library is prepared by fragmenting a DNA sample and ligating specialized adapters to both fragment ends. (B) The DNA fragments with adapters are hybridized onto the flow cell and amplified into clonal clusters through bridge amplification. (C) During each sequencing cycle, one fluorescent-labelled nucleotide is incorporated into each complemented strain. The emission wavelength and intensity are used to identify the bases. (31)

The majority of Illumina datasets are paired-end, which means that both ends of each fragment is being sequenced. Paired reads can compensate the short read lengths of Illumina data and help to resolve repeats. Illumina offers several NGS platforms based on this technology, all with different properties to appeal to as many users as possible.

Generally, all Illumina platforms produce a very high number of accurate, but relatively short reads. This high throughput combined with a high accuracy makes it the ideal choice for transcriptome analysis (32), rare variant calling (33) or population based sequencing projects (34).

2.3 Third-Generation Sequencing

There is still discussion what defines the different generations of sequencing technologies, as there always are some technologies that fall in between. We will follow the most common argument to categorize third-generation technologies, namely the capability of reading DNA sequences at the single molecule level. NGS technologies first break up the DNA into small fragments, followed by an amplification

(12)

step to generate a large enough signal during base calling (35). Third generation technologies are able to sequence the DNA sample directly without an amplification step. This offers several advantages like simplified sample preparation, no amplification-induced errors, no GC-bias and direct RNA sequencing (36). Still, the most important advancement of these technologies is the ability of generating far longer reads than ever before. Complex genomic regions like genome wide duplications or tandem repeats that were often too complex to assemble with short reads, have now been resolved with third-generation sequencers (37). Third-Generation sequencing is also known as ‘Single Molecule sequencing’ and ‘Long-Read sequencing’, referring to respectively the method behind the technology and the most visible improvement.

2.3.1 Helicos

Although Helicos (38) was the first single molecule sequencer, it is not always linked to third-generation sequencing, as it was limited to very short read lengths. Rather than revolutionizing DNA sequencing, Helicos modified current techniques to make single molecule sequencing possible. Besides the use of only one fluorescent-tagged nucleotide per cycle and the absence of amplification, there was not much difference compared to Illumina sequencing. The addition of only one nucleotide per cycle and the extensive imaging needed to compensate the absence of amplification, resulted in a very slow cycle time. Other issues were the short read length (35bp), the high error rate and the high costs and size of the Helicos machine. These disadvantages were too great to compete with Illumina and lead to a bankruptcy of Helicos in 2012 (39).

2.3.2 Pacific Biosciences

We can consider Pacific Biosciences’ Single Molecule Real Time (SMRT) sequencing technology as the first true third-generation sequencer (40). It was commercially introduced in 2010 and now available in a range of PacBio machines. The technology is based on sequencing-by-synthesis and optical monitoring of fluorescently tagged nucleotides as they are incorporated into individual template molecules. The PacBio platform consists out of SMRT Cells, containing tens of thousands of microfabricated

(13)

nanostructures called zero-mode waveguides (ZMW). A single polymerase is immobilized at the bottom of each ZMW, which will form a complex with one of the DNA templates after they were loaded to the SMRT cell. These templates are called SMRTbells and are created by ligating hairpin adapters to both sides of a double- stranded DNA (dsDNA) molecule. Inside each ZMW, the polymerase performs sequencing-by-synthesis in similar fashion to some of the NGS technologies. PacBio’s innovation does not originate from the sequencing method, but from how they detect the fluorescent emission inside the ZMW. The ZMW has such a small light detection volume so that when it is illuminated from below, it does not allow the wavelength of the light to pass through efficiently. The light will only penetrate the lower part of the ZMW, creating a powerful light microscope that reduces background noise up to a 1000-fold (41). This make it possible to real-time detect the individual fluorophores that are cleaved from the incorporated nucleotides (Figure 2.3.2).

Figure 2.3.2: SMRT sequencing technology. (A) A ZMW with an immobilized DNA template-polymerase complex at the bottom. (B) The four nucleotides, each labelled with a different fluorescent dye, are added to the ZMV. As a nucleotide is held in the detection volume by the polymerase, a light pulse is generated and can be detected until the fluorophore is cleaved (41).

The low detection accuracy (85%) is a serious drawback of this method, but can be improved by sequencing one DNA template multiple times in one cycle. This is made possible by the circular design of the DNA template (SMRTbell). Long insert libraries

(14)

will pass only once and produce very long reads (>20 kb) with low accuracy, while shorter libraries will be sequenced multiple times in to a continuous long read (CLR), which consists of multiple subreads from the same insert that can be combined into one highly accurate consensus sequence termed a circular consensus sequence (CCS) (42).

PacBio does not focus on multiple platforms like Illumina, instead they bring a new machine on the market to replace the previous one over time. After PacBio RS (2011) and PacBio RS II (2013), the new Sequel System was released in 2015. These machines output variable read lengths, with an average between 10,000 - 15,000 bp.

This is a huge advantage compared to NGS platforms, especially for the de novo whole genome sequencing, as short reads cannot resolve complex genomic regions (repeats and whole genome duplications). The downside are the higher error rates, higher costs and reduced throughput (43).

2.3.3 Oxford Nanopore

Oxford Nanopore is the latest sequencing technology to be commercialized and introduced a complete new way of DNA sequencing. The concept dates from the 1980s, but it took decades to develop it into a competitive technology (44). Rather than synthesising the DNA template, nucleotides are directly detected while they pass through a nanopore. These protein pores are embedded in a membrane with an electric potential that pulls the DNA through the pore. Every time a nucleotide passes through the pore, the ion current will change in a distinct manner for each of the four bases (Figure 2.3.3) (45).

(15)

Figure 2.3.3: Oxford Nanopore sequencing technology. (A) A docking enzyme that unwinds a dsDNA strand to then be pulled through the pore. The small picture shows the distinct disruptions in the current for each nucleotide. (B) The MinION sequencing machine that can transfer real-time sequencing data to a desktop. (C) The even more compact SmidgION is currently under development and can be connected to a smartphone (45).

In 2014, Oxford Nanopore Technologies (ONT) introduced the MinION and released since then two more machines, the GridION X5 in 2017 and the PromethION in 2018.

The GridION can launch up to five of the same flow cells to increase the throughput or to sequence multiple samples on the same device. The PromethION is their largest device with 48 flow cells, containing 144,000 nanopores (in comparison to MinION’s 500), which is ideal for human-genome-scale projects and is designed to compete with the high-throughput Illumina devices. While the GridION and the PromethION were developed to increase the throughput, ONT is currently developing an even compacter version of the MinION that can be directly connected to a smartphone, named the SmidgION (Figure 2.3.3).

These compact sizes and simple library preparation protocols make Nanopore sequencers ideal to use in the field, for example during pathogen outbreaks or remote

A

B

C

(16)

monitoring (45). Data are made available in real-time once the first DNA template passes through one of the pores, which makes it possible to monitor the experiment and terminate or extend when needed. The MinION has already been used in environments where there is no access to a sequencing lab, like the International Space Station (46) and Antarctica (47). The only persistent disadvantage is the lower accuracy of around 85%, similar to that of the long-read sequencing technology of PacBio. Yet, this technology offers many advantages and could improve even further in the near future, which could make it a serious contender for Illumina.

2.4 Comparison between sequencing platforms

With the many platforms available on the market, it is important to consider all the options and select the method best suiting your project. The chosen platform should not only be determined by your budget and the amount of DNA available, but also by the type of projects the data will be used for in the future and which computational methods will be needed to accomplish this.

Table 2.4: Comparison with the current most used sequencing technologies (37,41,48,49).

(17)

Technology Read

length Accuracy Throughput Advantages Disadvantages

Sanger 400-900 bp 99,99% NA

Very high accuracy, Still seen as golden

standard

Expensive, time consuming, impractical for larger sequencing

projects

Ion

Torrent < 440 bp 98% 2 GB Less expensive equipment, Fast

Homopolymer errors, Low

throughput

Illumina ^{2 x 150 -}

300 bp 99,9%

1,2 (ISeq) - 1800 (HiSeq

X) GB

Low cost per Mb, High throughput

Short reads, Systemic errors in

homopolymers

PacBio Sequel

10–15 kbp 87% 5-10 GB

Epigenetics, Random error

profile, Long reads

Expensive, Low accuracy

Oxford Nanopore MinION

10-15 kbp 85% 0,1 GB

Direct RNA sequencing, Field use, Small

equipment size, Short library preparations, Very long reads

Low accuracy, Low throughput

(18)

3 S ^EQUENCE A ^SSEMBLY

The rapid evolution of DNA sequencing technologies in the last decades makes it now possible to produce a large amount of sequencing data within a reasonable budget and time frame. There is a need of computational power and well-designed software to derive useful information from these sequencing data. The problem is that these two elements cannot follow the continuous progression of sequencing technologies. Yet, insufficient computational power is seldom the true problem and could often be resolved by designing more memory-efficient software. There is a wide range of software that is designed for processing sequencing data; in this chapter, we will restrict our focus to sequence assembly software.

3.1 Assembly methods

Sequence assembly refers to aligning and combining short sequences to reconstruct the original sequence of an entire DNA molecule. As sequencing technologies generate huge amounts of relatively short reads in a random order, there is a need for well- designed algorithms to automate the reconstruction. To apprehend the complexity of the task, consider that a whole human genome Illumina dataset with 30x coverage has around 750 million reads of 100-250 bp. Reconstructing this huge amount of fragmented data into the 22 chromosomes is an extremely complex task. Many aspects

(19)

have to be taken into account during the design of an assembly algorithm. The properties of the sequencing data have to be thoroughly studied, as well as the complexity of the genome and the available computational resources (50). When a very similar genome has previously been assembled, it is possible to align all the reads to this reference and reconstruct the genome that is mirrored on the reference. This simplifies the assembly greatly, but is limited to very close references and is unable to detect structural variations. Reference based alignment is frequently used in the medical field to detect Single Nucleotide Polymorphisms (SNPs) and small insertions or deletions (INDELs) (51). This chapter will only discuss de novo assembly, where a genome is reconstructed without a reference.

The development of de novo sequence assemblers have always run in parallel with the evolution of DNA sequencers and are often designed for one particular sequencing technology. The general principle of de novo assemblers is to find overlaps among reads and joining them into contiguous sequences, called contigs. The first assemblers were string-based methods that were efficient for Sanger reads or small early-NGS datasets. Currently, the vast majority of assemblers are more complex graph based algorithms (Overlap Layout Consensus or De Bruijn) to deal with the increasing amount of reads per dataset (52).

3.1.1 String-based

The first and most intuitive approach to sequence assembly was a greedy string-based method. Greedy assemblers iteratively join reads together that are most similar to each other until no more reads can be joined. ‘Greedy’ refers to the strategy of the algorithm to only consider the best overlap between two reads without looking at the global outcome (53). The algorithm will look during each iteration for the Shortest Common Superstring (SCS) and reject reads with a shorter overlap. This simplifies the design of the algorithm, but will not always lead to a correct assembly, with a high risk of misassembly in repetitive regions (52,54). Figure 3.1.1a is such an example, where the repeat sequence CACA collapses into CA, causing a misassembly. Besides the collapse of repeats, the greedy approach can also lead to chimeric assemblies when the algorithm

(20)

selects the wrong extension following a duplicated region. This strategy was initially designed for Sanger reads, which are relatively long and therefore less prone to cause errors in repetitive or duplicated regions (55).

Figure 3.1.1a: Example of a greedy string-based assembly of four short reads. Only the shortest common superstring is considered, other possible overlaps are not brought in consideration. In this example, a greedy approach results in a collapse of the CACA repeat.

This strategy was also adapted for the less accurate and shorter (30-90bp) reads from NGS technologies. These assemblers (SSAKE (56), VCAKE (57)) used a different method of string-based assembly to avoid the misassemblies caused by the greedy approach. They will consider all the reads that have an overlap larger than a certain length (k-mer) and will terminate the extension when there is a conflict between multiple reads that could extend the contig. In Figure 3.1.1.b the contig will be terminated to avoid a misassembly, which illustrates the difference in outcome compared to the greedy approach of Figure 3.1.1.a.

TGAC

GACA

ACAT

CACA

TGACA

ACAT

CACA

TGACAT

Original sequence:

TGACACAT

(21)

Figure 3.1.1b: Example of a string-based assembly without a greedy approach of four short reads. The same four reads of figure 3.1.1.b lead to a shorter contig to avoid a possible misassembly.

3.1.2 Graph-based

Despite the development of a few string-based assemblers for the new NGS technologies, it was clear that there was a need for a new approach to handle large short-reads datasets. The general idea was to switch from a local approach (string- based) to a global analysis of the relationships between the reads. This would increase the assembly speed and would be more suitable for larger genomes and high throughput datasets (58). Graphs are mathematical structures to model pairwise relations between objects, which are represented as a network of nodes (or vertices) connected by edges.

Edges can be directed from one node to another or be undirected when there is no distinction between the two nodes associated with each edge. A graph is weighted when the edges are given a score, which can be represented by a number or the thickness of the edge (Figure 3.1.2).

(22)

Figure 3.1.2: Example of a graph where nodes are connected by edges. (A) A directed graph with orientated edges. (B) An undirected graph.

3.1.2.1 Overlap Layout Consensus (OLC)

The first graph-based assemblers were based on the OLC approach and designed for Sanger (CAP3 (59), PHRAP(60)) or 454 (Newbler) reads. They gave better results than the string-based assemblers, but were soon overruled by De Bruijn graph algorithms.

As you can tell by the name, OLC algorithms consist out of three phases:

 Overlap: All the overlaps between reads are found by an all against all pair-wise comparison and used to build an overlap graph. In this weighted graph, all the nodes represent the reads and the edges the overlap. The minimum overlap length and minimum percent identity are the two parameters that will shape the graph. Higher values of these parameters will results in shorter but more accurate contigs (61).

 Layout: During the layout stage, the graph will first be simplified by removing redundant information. Then the layout of the reads along the genome is determined by finding the (Hamiltonian) path that passes each node exactly once (62). Finding one such path would be the ideal situation, in reality multiple paths will be formed that split up the genome into contigs.

nodes

edges

A nodes

edges B

(23)

 Consensus: In the final stage, multiple sequence alignment of all the reads along the layout is used to build a consensus sequence for each contig.

Figure 3.1.2.1: Example of an overlap graph that consist out of five reads of 3 bp. The nodes are the reads and the weighted edges are the overlaps. The red lines show the Hamiltonian path that results in the following sequence: ATCCAGT. Image adapted from (63).

MIRA (64) is one of the few OLC assemblers that survived and successfully adapted to the dominant Illumina technology. This assembler is also slowly disappearing but in some cases still used for the assembly of bacteria or small eukaryote genomes. The most important advantage of MIRA is the hybrid assembly option, which makes it possible to combine sequencing reads of multiple technologies. The latest update featured the option to combine long PacBio reads with short accurate Illumina reads, which is an option that is highly desired in the assembly community and often lacking with modern assembly software (65). The dominance of Illumina has made De Bruijn graph the preferred assembly method, as it is still the most efficient method for large short-read datasets. Yet, OLC is having a comeback due to rise of long-read technologies (PacBio and Nanopore), as OLC is still the best option for long reads with a low to moderate coverage.

(24)

3.1.2.2 De Bruijn graph

Long before its application in genome assembly, the Dutch mathematician Nicolaas de Bruijn developed this graph theory in 1946 and based it on Euler’s first theorem of graph theory, dating back to 1735 (66). De Bruijn graph (DBG) is the most recent method and almost all assembly software of the last decade are based on this approach.

The increasing amount of reads generated by NGS technologies made it almost impossible to resolve the more complex OLC graphs. For this reason, the De Bruijn graph approach became the preferred method for assembly developers (61).

We can also break up the De Bruijn graph into three steps:

 Break up reads into successive k-mers: All reads (length L) are divided into (L- k+1) k-mers and each unique k-mer is stored in a hash. As many reads share the same k-mers, this hash table will take less memory than storing the complete reads. The optimal k-mer depends on the read length and the complexity of the genome and is hard to determine theoretically, and therefore has to be established experimentally (67).

 Construct a De Bruijn graph: Nodes represent sequences with length x and each edge connects nodes that have an exact x-1 overlap, which correspond to the prefix and suffix of the x-mer. The edges are directed from the prefix to the suffix, and can also be weighted to present the amount of reads that each edge support. There are two variations of this graph representation, the node-centric and the edge-centric.

o Node-centric: The nodes represent the k-mer substrings and the edges are the (k-1) overlap between two nodes (68).

o Edge centric: Every node is a (k-1) substring and an edge connects two nodes if their (k-1)-mers are consecutive in some read (69).

Both definitions are equivalent, a node-centric de Bruijn graph for a k-mer is identical to an edge-centric de Bruijn graph for a (k+1)-mer (70)

 Find the Eulerian path: Instead of passing through every node exactly once, like in the Hamiltonian path, the Eulerian approach will look for a path that contains every edge once. As long as all the possible k-mers of the genome are

(25)

present in the graph, there will be an Eulerian path. In theory, an Eulerian path can connect all the kmers into a complete genome or chromosome. In practice, the Eulerian path will be interrupted many times because of repetitive and duplicated regions (55).

Figure 3.1.2.2: Example of a De Brujin graph with an Eulerian path. An example small circular sequence (ATGGCGTGCA) has been split up into k-mers with a length of 3 bp (k).

The edge-centric graph is constructed by connecting the prefix (k-1) of each k-mer to the suffix (k-1) of that k-mer. With this approach, each edge represents a k-mer. The Eulerian path is found by passing through each edge once, which leads to a complete reconstructed sequence.

(71)

Choosing the best k-mer for your assembly has always been a difficult question to answer. Longer k-mers can improve the assembly of repetitive regions, but require more memory and are more sensitive for sequencing errors. With the increasing memory-efficiency of assembly algorithms and longer read lengths, the search of the optimal k-mer is getting more and more important. Some assemblers are now capable of using multiple k-mers during the assembly instead of fixed k. IDBA-UD (72) starts

Genome

:

K-mers from edges

(26)

with a small k-mer, removes the reads that are already assembled and moves on to a larger k-mer with the remaining reads. SPAdes (73) also conducts multiple assemblies with different k-mers.

3.2 Assembly comparison

Many assemblers have been developed over time and there is no clear consensus of which software works best. Benchmarks that are published by the software developers are often over fitted to their tool and therefore not completely reliable. There are a few efforts like the Assemblathon (74,75) for a neutral comparison between assembly tools, but even then only a few genomes were tested and the last edition dates back from 2013.

When a new genome has been sequenced, it is hard to predict which assembly tool will behave best. Most de novo assembly projects will therefore try out multiple tools before selecting the software for the final assembly.

Many factors can influence the performance of an assembly tool:

 The NGS technology used to sequence:

Each technology has a different pattern of sequencing errors and different read lengths. Generally string-based and OLC algorithms perform better on long read data (PacBio, Nanopore or Sanger), while De Bruijn is more suited for short-read datasets with a high coverage (Illumina) (76). Hybrid assemblers have the ability to combine different NGS datasets (MIRA, MaSuRCA (77)).

 The size of the sequenced genome:

Larger genomes generally makes it harder to resolve a graph, especially OLC graphs, while string based algorithms are too slow and require too much memory for large genomes. As long as Illumina reads are used to assemble large genomes, De Bruijn algorithms remain the only option (78).

 The level of heterozygosity and ploidy:

High heterozygosity makes any assembly much more complex and will increase the computational power needed to resolve the assembly graphs. Some assemblers were specifically designed for heterozygous genomes (Platanus (79), dipSPAdes (80)).

(27)

 The abundance of repetitive and duplicated regions:

This is the greatest obstacle in genome assembly, especially for short-read datasets. Some algorithms perform better than others, but when the repeat is longer than the read length, there is no way to resolve it. The only solutions are the use of long-read technologies or designing mate pair libraries. Mate pairs are not easy to construct, but when the distance between the paired reads is large enough, most duplicated regions can be resolved (81).

 The available computational power:

CPU time and memory consumption can differ greatly between assemblers.

OLC graphs consume a lot of memory with large datasets, but there are also significant differences between De Bruijn graph assemblers.

3.3 Organelle genome assembly

The majority of sequence assemblers are designed for whole genome sequencing (WGS) data and assembly of the complete dataset. Despite the fact that organelle genomes are the most sequenced after prokaryotes genomes (Figure 3.3), few assembly tools were developed for this purpose. The different strategies to separate organelle DNA from a sample are further discussed in chapter 4, therefore we will restrict to organelle genome assembly from WGS datasets.

(28)

Figure 3.3: Organelle genomes in GenBank. (A) Total number of deposited genomes (15704) in GenBank as of 14 March 2016. (B) Annual deposition of mitochondrial and chloroplast genomes in GenBank since 2003. Statistics from the National Center for Biotechnology Information Genome Resources (https://www.ncbi.nlm.nih.gov/genome/browse/ (13 March 2016, date last accessed)).

The reduced costs of NGS technologies made WGS the fastest and least laborious manner to sequence DNA samples. Methods where organelle DNA is first separated from the sample by capture protocols can lead to gaps in sequence coverage, which will not happen with WGS, as all the DNA of the sample will be sequenced (82). The downside is that the majority of sequences are from nuclear origin and complicate the organelle assembly. Since organelle genomes are abundant in eukaryotic cells, their coverage will be far higher than the coverage of nuclear sequences, which means that a low coverage dataset will ideally generate short nuclear contigs and a large organelle contig. This makes it easy to identify the organelle genome. However when repetitive regions break up the genome into smaller contigs, it will not be that straightforward to merge them together. This is particularly problematic for chloroplast genomes, as their inverted repeats and homogenous regions in the mitochondrial genome complicate the

(29)

assembly (83). There are a number of algorithms that adapt existing nuclear genome assembly tools to organelle genome assembly (84,85). One of the strategies is to filter reads with high coverage; however, the coverage of GC-rich regions is reduced during sequencing, which would lead to their exclusion (84,85). The first and most widely used assembler that was specifically designed for mitochondrial assembly is MITObim (86). A few others followed, though their usage is limited to a relatively small amount of assembly projects.

3.3.1 MITObim

MITObim is not an assembly algorithm on itself, instead it uses the different modules of the MIRA assembler to reconstruct mitochondrial genomes from WGS data. It has a mapping mode for reference-guided assembly and a de novo mode that requires a seed sequence to initiate the assembly process. MITObim iteratively activates two MIRA modules to first bait reads that align to the reference or seed (MIRAbait), followed by the main assembly module to assemble the filtered reads. This process will be repeated until the complete genome is reconstructed (Figure 3.3.1). MITObim was published in 2013 and has been the most popular tool for mitochondrial assembly since then, resulting in more than 500 citations

(30)

Figure 3.3.1: MITObim workflow. The algorithm starts with mapping mitochondrial reads to the conserved regions of the mitochondrial reference. Those mapped reads are used to build an initial assembly, which is then used to bait new reads that overlap this assembly. This process is repeated until the mitochondrial genome is reconstructed (86).

3.3.2 Org.Asm: The ORGanelle ASeMbler

Org.Asm (87) is an organelle assembler developed by Eric Coissac in 2016 and specifically designed for assembly from WGS data. The assembly algorithm is based on a De Bruijn graph, with the difference that it will take in account the heightened

(31)

coverage of the organelle sequences. The assembly will be initiated by a set of protein sequences that function as seeds. Protein sequences are more conserved compared to their nucleic counterparts, which makes it possible to initiate the assembly with more distant related sequences. An initial estimation of the coverage is made by aligning the reads to set of seeds and is later refined after an assembly of 15 kb. This tool has been successfully used for several assembly projects (88-92).

3.3.3 Norgal

Norgal (de Novo ORGAneLle extractor) (93) is a pipline for mitochondrial assembly by identifying a high frequency subset of k-mers that are predominantly of mitochondrial origin and de novo assembling this subset of reads with IDBA-UD (72). The advantage of this method is that it does not require a reference or seed sequence to initiate the assembly. The benchmark of their publication compares Norgal against NOVOPlasty and MITObim for 5 mitochondrial genomes. NOVOPlasty has overall the most accurate results and in a fraction of the time that Norgal requires. Norgal is also limited to mitochondrial assembly and they therefore suggest to first use Norgal on a chloroplast dataset and use the assembled contig as a seed for NOVOPlasty or MITObim.

3.3.4 GetOrganelle

GetOrganelle (94) is a pipeline for de novo assembly of chloroplast genomes using whole genome data. This pipeline, which was published in 2018, has a similar baiting strategy as MITObim. It will map reads to a reference or seed sequence to bait out organelle sequences. While MITObim has an assembly phase after each iteration of baiting, GetOrganelle will uses the filtered reads as baits in the next iteration and only assemble after no more organelle reads can be extracted. Bowtie2 (95) is used to map reads to the bait sequences and SPAdes (73) to assemble them.

(32)

3.3.5 Organelle_PBA

The coming of long-read technologies can simplify the assembly of more complex organelle genomes. While the above assembly tools are all designed for short-read sequences, Organelle_PBA (96) is solely designed for PacBio technology. Organelle reads are filtered from the dataset by aligning to a closely related species. Those reads are corrected and de novo assembled using Sprai (97) with an additional scaffolding step by SSPACE-LongRead (98). Organelle_PBA is like MITObim not an assembly algorithm on itself, but a pipeline that relies on existing tools. For the moment, Organelle_PBA is the only tool designed for organelle assembly with long-read technologies, although we can expect there are more to come.

(33)

(34)

4 P ROBLEM DESCRIPTION AND

O ^BJECTIVES

Assembly algorithms have been developed in parallel with the evolution of sequencing technologies. The increasing throughput and decreasing price per Gb (21) has led to very large datasets that require de Bruijn graph algorithms to assemble the complete dataset within a reasonable time (76). As NGS data were mainly used for the assembly of the whole nuclear genome, most assembly algorithms were designed to assemble all reads as quickly and accurately as possible. Before 2008, the costs associated with NGS of whole genomic DNA extract were too high to be collected only for the assembly of organelle genomes. Nowadays however, NGS costs have largely decreased, and analysis of whole genome data has become the preferred strategy.

Despite the increasing use of WGS data for organelle genomes, developers of assembly algorithms did not focus on this specific application. Before the development of NOVOPlasty, MITObim was the only organelle assembler available and was used for a large portion of the deposited organelle genomes on NCBI. Although the algorithm was designed for assembling mitochondrial genomes and never optimized for the larger chloroplast genomes, it was also used for that purpose. These chloroplast genomes were often assembled by mapping reads to existing reference genomes, which bears the

(35)

risk of misassembly through incorporation of sequences from the reference. Despite the proven usefulness of MITObim, there was a need for a better performing organelle assembler that could also assemble chloroplast genomes.

While MITObim is using an existing graph-based assembler (MIRA) in iterations, the goal of this work was to develop a complete new and independent assembler that could extend a seed sequence into the circular genome. The decreasing price of NGS data made skimming WGS data the preferred choice to obtain reads for assembling organelle genomes. For this reason, I decided to focus on WGS data from short-read sequencing technologies. In order to develop a tool that is as accurate and user-friendly as possible, it had to meet the following criteria:

1. A flexible seed input: It should be possible to start the assembly of an organelle genome with a seed from a distant related species. This is especially important for chloroplast genome assembly, as the available chloroplast genomes are still limited.

2. Free of contamination: The assembler should only output contigs originating from the organelle genome that the user is targeting. Assembly tools that are not specifically designed for organelle genome assembly will predominantly output nuclear contigs. When the organelle genome assembly is divided over several contigs, it is not straightforward to reconstruct the complete genome.

3. Accessible for users with limited informatics experience: Many assembly tools or pipelines are difficult to use and require experience with linux. To make it accessible for as many users as possible, a tool should be cross-platform and should limit the amount of installations and dependencies.

4. Contiguous assembly: The goal is to output one circular sequence for as many datasets as possible. Complex genomic regions (low-complexity regions, repetitive sequences or duplicated segments) can cause an assembly to break up into multiple contigs. New strategies to resolve these regions should be incorporated in the assembly method, otherwise there would be no significant improvement compared to current assemblers.

(36)

Once a robust assembler is developed, it can be extended with extra features allowing the analysis of genomic variation. After genome assembly, users still have to use different software for variance calling, heteroplasmy detection, or genome annotation.

One tool that includes all these features would make organelle research accessible to researches with limited experience in bioinformatics.

The main goal of this thesis was therefore to create a user-friendly computer program specifically designed to assemble mitochondrial and chloroplast genomes from NGS illumina data, and then to compare its efficiency and accuracy to other existing tools.

Finally, I developed the program further to uncover different variants of an organelle genome in a dataset, which can be used to study both heteroplasmy and nuclear copies of mitochondrial genome fragments (so called numts). In particular, I used this tool to evaluate whether it was possible to differentiate heteroplasmic variants from numts.

4.1 Publications

Related to this thesis:

Dierckxsens N, Mardulyn P and Smits G. (2018) “Unravelling heteroplasmy patterns with NOVOPlasty.” Submitted.

Dierckxsens N, Mardulyn P and Smits G. (2017) “NOVOPlasty: de novo assembly of organelle genomes from whole genome data.” Nucleic Acids Research 45(4):e18.

doi:10.1093/nar/gkw955.

Other projects:

Moretto M, Sonego P, Dierckxsens N, et al. (2016) “COLOMBOS v3.0: leveraging gene expression compendia for cross-species analyses.” Nucleic Acids Research 44 (Database issue): D620-D623. doi:10.1093/nar/gkv1251.

(37)

(38)

5 NOVOP ^LASTY

Thanks to the evolution in next-generation sequencing (NGS) technology, whole genome data can be readily obtained from a variety of samples. There are many algorithms available to assemble these reads, but few of them focus on assembling the extranuclear genomes. Therefore, we developed a seed-and-extend algorithm that assembles these circular genomes from whole genome sequencing (WGS) data, starting from a single seed sequence. The algorithm has been tested on several new (Gonioctena intermedia and Avicennia marina) and public (Arabidopsis thaliana and Oryza sativa) whole genome Illumina datasets and it always outperformed other assemblers in assembly accuracy and coverage. In our benchmark, NOVOPlasty assembled all the genomes in less than 30 minutes with a maximum memory requirement of 16 GB. NOVOPlasty is the only de novo assembler that provides a fast and straightforward manner to extract the extranuclear genomes from WGS data in one circular high quality contig. The seed-and-extend method is currently being adapted for the assembly of small genomes, metagenomes and local assembly of regions of interest.

The software is open source and can be downloaded at https://github.com/ndierckx/NOVOPlasty

(39)

5.1 Introduction

The circular genomes of chloroplasts and mitochondria are frequently targeted for de novo assembly. Both genomes are usually maternally inherited, have a conserved gene organization and are often used in phylogenetic and phylogeographic studies, or as a barcode in plant and food identification (99). Different in vitro strategies to isolate these genomes from the much larger nuclear chromosomes have been developed, but this task has proven to be particularly challenging. Before the development of NGS technology, organelle genome assembly was based on conventional primer walking strategies, using long range PCR and cloning of PCR products, which are laborious and costly (100–102). NGS made it possible to develop novel strategies to construct the entire chloroplast or mitochondrial genome, thereby dramatically reducing time and costs compared to the more conventional methods. It is now affordable to obtain whole genome data in a short timespan by using genomic DNA extracted from whole cells (103). Besides nuclear sequences, a high copy number of extranuclear sequences will be present in the sample, usually around 5 to 10% of chloroplast DNA (104) and around 1–2% of mitochondrial DNA (105), allowing to assemble both nuclear and extranuclear genomes from one simple experiment. Shallow sequencing of genomic DNA will result in comparatively deep sequencing of the high-copy fraction of the genome; this approach is called genome skimming. Although assembling the complete data set will generate contigs for the organelle genomes, it is also possible to first isolate the chloroplast or mitochondrial reads, and then assemble this subset. The best strategy depends on the data set, computational power and reference genome availability.

When a reference genome sequence is available from a closely related organism, a genome sequence can easily be assembled by mapping the reads to the reference (103).

However, when the reference is too distant, the assembly will contain numerous mismatches. Reference assemblies generally require less computational time and virtual memory, but can only handle a limited amount of variances between the targeted and reference genome to stay accurate. In many cases, a de novo assembly is the preferred strategy for an accurate assembly.

(40)

When sequence reads are obtained from a total DNA extract, there will be a large excess of reads from the nuclear genome. To reduce the runtime and computational resources needed for the assembly of the several order of magnitudes smaller organelle genomes, it is suggested to work with a relatively low total number of reads (104). The copy number of organelle genomes being much higher than the copy number of the nuclear genome, working with a whole genome data set of low coverage is largely sufficient (106). One strategy often used to reduce the ratio of nuclear to organelle reads prior to the assembly consists in filtering the extra-nuclear sequences, either by keeping only regions of higher coverage or by mapping reads to a reference genome. Filtering by differential coverage will often result in the undesirable exclusion of regions of low or high GC content (see Figure 4.1), as many NGS systems will perform less efficiently in these regions (105). Another option is to isolate plastid or mitochondrial DNA, prior to sequencing, by capturing these molecules using specific probes. However, many specific probes need to be designed to cover the complete organelle genome, such that this approach is only recommended when many samples must be sequenced in parallel.

Figure 4.1: Coverage depth for a 12 000 bp long region of the mitochondrial genome of Gonioctena intermedia. There are several regions with a low GC content, resulting in a reduced read coverage.

With recent technological advances and cost reduction of shotgun sequencing, the most reliable and straightforward method to a complete assembly of an extranuclear genome would be to sequence a whole genome extract and utilize the complete data set for the assembly, a bioinformatic procedure that is not always straightforward with the tools currently available. Here, we present a novel algorithm, NOVOPlasty, specifically developed for the de novo assembly of mitochondrial and chloroplast genomes from whole genome data. We compared its performance with available software commonly

(41)

used for organelle genome assembly, through the benchmarked assembly of new and reference mitochondrial and chloroplast genomes from multiple organisms.

5.2 Materials and Methods 5.2.1 Sequencing

All in-house non-human samples were sequenced on the Illumina HiSeq platform (101 bp or 126 bp paired-end reads). The human mitochondria samples (PCR-free) were sequenced on the Illumina HiSeqX platform (150 bp paired-end reads).

Two public data sets of Arabidopsis thaliana and of Oryza sativa were downloaded from the European Nucleotide Archive (http://www.ebi.ac.uk). Data sets SRR1174256 (A. thaliana), SRR1810277 (A. thaliana) and ERR477442 (O. sativa) were sequenced on the Illumina HiSeq 2000 platform and consists out of paired end reads with a read length of respectively 90 bp, 101 bp and 96 bp. Data sets DRX021298 (A. thaliana) and SRR1328237 (O. sativa) were sequenced on the Illumina HiSeq 2500 platform and consisted of paired end reads with a read length of respectively 150 bp and 151 bp. A total of 20% of data sets SRR1174256 and SRR1810277, and 8% of SRR1328237 were sub-sampled for the benchmarking study.

5.2.2 De novo assembly

All assemblies were executed on a Intel Xeon CPU machine containing 24 cores of 2.93 GHz and a total of 96.8 GB of RAM. Our program NOVOPlasty is written in Perl. In addition, four open-source assemblers (MITObim (86), MIRA (64), ARC (https://github.com/ibest/ARC) and SOAPdenovo2 (107)) and the pay-for-use CLC assembler (CLCbio, Aarhus, Denmark) were used on the same data, for comparison.

5.2.3 Quality assessment

To obtain a reliable quality assessment, we chose to benchmark the different tools with well known model organisms. A human data set was used for a mitochondrial assembly and Arabidopsis thaliana and Oryza sativa for the chloroplast assembly. The latter two

(42)

were executed in duplicate, their accession numbers in the European Nucleotide Archive are listed in the ‘Sequencing’ section. In addition to these data sets, one unknown mitochondrial genome (Gonioctena intermedia) and one unknown chloroplast genome (Avicennia marina) were included in the comparison. The mitochondrial genome of G. intermedia contains a highly repetitive section, which was useful to assess the performance on long repetitive regions. In this case, a reference genome was obtained by assembling long PacBio (Pacific Biosciences) reads together with short Illumina reads using the MIRA assembler. Some of the PacBio long reads cover the complete mitochondrial genome, which made it possible to assemble a reliable reference genome for the benchmark study. The NOVOPlasty assembly of the mitochondrial genome of G. intermedia was submitted to GenBank (KX922881). The other reference genomes were retrieved from the National Center for Biotechnology Information (NCBI) database (https://www.ncbi.nlm.nih.gov). GenBank entry AP000423.1 was selected for the A. thaliana assemblies, KM103369.1 for data set SRR1328237 of O. sativa, KM088022.1 for data set ERR477442 of O. sativa and X93334.1 for H. sapiens. Even though these references are very accurate, some variances between individual samples is expected. To detect these putative variances between our data set and the used reference, we realigned all reads to each assembly with Bowtie2 (95) for visual inspection with Tablet (108). Each data set contained a small number of single nucleotide polymorphisms (SNPs), which were corrected in the respective reference genomes to acquire a perfect reference for our benchmark study.

Visual proof of the SNPs justifies these corrections and can be examined in Appendix 1.

The different data sets were used for a benchmarking study comparing six assemblers, namely, MIRA, MITObim, SOAPdenovo2, CLC, NOVOPlasty and ARC. ARC was only used for the mitochondrial assemblies since it still relies on reference genomes and when a very close reference was lacking, the assemblies resulted in a lower genome coverage than with the de novo approach. All tested assemblers were evaluated for speed, memory efficiency, disk usage, genome coverage, assembly accuracy and number of contigs. Comparing speed and system requirements was straightforward, since each assembler ran on the same machine and made use of the same input data set.

(43)

The quality indicators were measured relative to the corresponding reference as mentioned above. The genome coverage represents the percentage of the reference genome that was assembled minus ambiguous nucleotides. The accuracy represents the percentage of correctly assembled nucleotides relative to the ‘perfect’ validated alignments. The highest possible score (100%) for speed, memory efficiency, disk space, genome coverage, assembly accuracy and number of contigs were set to respectively 0 min, O GB of RAM, 0 GB, 100%, 100% and 1 contig. The lowest score (0%) was always chosen close to the average of the assembler that performed the worst to get a clear difference between the assemblers. All percentages were rounded off to two decimal digits. The absolute values for each assembly can be examined in Appendix 1.

5.2.4 NOVOPlasty algorithm

NOVOPlasty is a seed-extend based assembler similar to string overlap algorithms like SSAKE (56) and VCAKE (57). It starts with storing the sequences into a hash table, which allows quick accessibility of the reads (Figure 4.2.4a). The assembly has to be initiated by a seed, which is iteratively extended bidirectionally. This seed sequence is not used for initiating the assembly, but to retrieve one sequence read of the targeted genome from the NGS data set. This strategy can handle a wider range of seed inputs without incorporating mismatches into the assembly. The seed sequence can be one sequence read, a conserved gene or even a complete organelle genome from a distant species. The end and start of the seed are scanned for overlapping reads in the hash table and stored separately. All putative extensions are identified and subsequently crosschecked with the paired reads to verify if they are positioned correctly. Relatively similar sequences are grouped together and every base extension is resolved by a consensus between the overlapping reads. When there is more than one possible consensus extension (i.e. more than one group of sufficient size), the assembly splits and two new contigs will be created. Unlike most assemblers, NOVOPlasty does not try to assemble every read, but will extend the given seed until the circular genome is formed. The assembly will circularize when the length is in the expected range and both ends overlap by at least 200 bp. When a repetitive region is detected, the

1.1 Discovery of the Genetic Code

C

1 W HAT IS A GENOME ?

1.1 Discovery of the Genetic Code

1.2 Nuclear DNA

1.3 Organelle genomes

1.3.1 Mitochondria

1.3.2 Chloroplasts

2 DNA SEQUENCING

2.1 Sanger sequencing

2.2 Next-Generation Sequencing (NGS)

2.2.1 Roche 454 and SOLiD

2.2.2 Ion Torrent

2.2.3 llumina (Solexa) sequencing

2.3 Third-Generation Sequencing

2.3.1 Helicos

2.3.2 Pacific Biosciences

2.3.3 Oxford Nanopore

2.4 Comparison between sequencing platforms

3 S EQUENCE A SSEMBLY

3.1 Assembly methods

3.1.1 String-based

3.1.2 Graph-based

:

3.2 Assembly comparison

3.3 Organelle genome assembly

3.3.1 MITObim

3.3.2 Org.Asm: The ORGanelle ASeMbler

3.3.3 Norgal

3.3.4 GetOrganelle

3.3.5 Organelle_PBA

4 P ROBLEM DESCRIPTION AND

O BJECTIVES

4.1 Publications

5 NOVOP LASTY

5.1 Introduction

5.2 Materials and Methods 5.2.1 Sequencing

5.2.2 De novo assembly

5.2.3 Quality assessment

5.2.4 NOVOPlasty algorithm

3 S ^EQUENCE A ^SSEMBLY

O ^BJECTIVES

5 NOVOP ^LASTY