• Aucun résultat trouvé

GENOME ASSEMBLY OF ARGANIA SPINOSA

N/A
N/A
Protected

Academic year: 2021

Partager "GENOME ASSEMBLY OF ARGANIA SPINOSA"

Copied!
41
0
0

Texte intégral

(1)

Royaume du Maroc

Ministère de l’Education Nationale, de la Formation Professionnelle, de l’Enseignement Supérieure et de la Recherche

UNIVERSITE MOHAMMED V

FACULTE DE MEDECINE ET DE PHARMACIE DE RABAT

MEMOIRE DE MASTER

MASTER DE BIOTECHNOLOGIE MEDICALE OPTION : BIOMEDICALE

Thème

Président Azeddine IBRAHIMI Professeur Faculté de Médecine et de Pharmacie, UMV, Rabat Encadrant Hassan GHAZAL Professeur Centre National Pour la Recherche Scientifique, CNRST Examinateur Abdelhamid EL MOUSSADIK Professeur Faculté des Sciences, Universite Ibn Zohr, Agadir Examinateur Noureddine HAMAMOUCH Professeur Faculté des Sciences, UMV, Rabat

Examinatrice Fatima GABOUN Docteur Institut National de la Recherche Agronomique, Rabat

Genome Assembly of Argania spinosa

Présenté Par :

Abdellah IDRISSI AZAMI

Encadré Par :

Pr. Hassan GHAZAL

Promotion: Novembre 2020

(2)

Abstract

Argania spinosa is an endemic plant in the mid-western Morocco. Its oil is considered as one of

the most expensive oils in the world thanks to its nutritional value, and medical and therapeutical effects. Genomic study of the argane tree would make the discovery of the oil and other metabolite biosynthesis pathways easier which will help in extracting, purifying, and ameliorating the quality of those metabolites. To achieve this purpose, the whole genome of the argane tree has been sequenced using Illumina HiSeq Xten, and PacBio RS II technologies, the genome size estimated using K-mer distribution, then assembled using short-reads, long-reads, and hybrid assembly strategies using three assemblers, namely Canu, SOAPdenovo 2, and Masurca. Results showed that hybrid assembly gave the biggest sequence size (423 Mbases), and the higher N50 value (40 Kbases) among the three assembling approaches. The assembled genome size was lower than the estimated genome size (630 Mb – 713Mb), which might be due to the high heterozygosity of the plant and high repetitive regions in the genome. In order to predict the genes structure and function of different regions in the Argane genome’s, an AB initio gene prediction was performed, and resulted in the prediction of 115 genes, with 5128 introns, and 6073 exons. This is very low compared to expected 30000-40000 genes, which is due to the low quality of the input assembly. Seven genes were predicted to be completed, and annotated using homology search with BLASTp. Three of them code for Photosystems and RuBisCO, two complexes that leads the light and the dark phases of the photosynthesis process, two encodes for ribosomal proteins, one codes for RNA polymerase, while the seventh encodes for the γ-Tocopherol methyltransferase, which is involved in the biosynthesis of the vitamin E. The mitochondrial genome assembly gave a circular scaffold of 245,326 b, that contains 21 genes mainly involved in the Oxidative Phosphorylation process. Towards full exploitation of the Genome information for oil biosynthesis pathways discovery, we will enhance the genome assembly by using other assemblers, as well as combining manual and automated assembly procedures.

Keywords: Argania spinosa, Morocco, Genome assembly, Assemblers, Annotation, Mitogenome, Genome Size estimate.

(3)

Résumé

Argania spinosa est une plante endémique du Centre-ouest du Maroc. Son huile est considérée

comme l'une des huiles les plus chères au monde grâce à sa valeur nutritionnelle et ses effets médicaux et thérapeutiques. Une étude génomique de l'arganier faciliterait la découverte des voies de biosynthèse des métabolites, ce qui aidera à extraire, purifier et à améliorer la qualité de ces métabolites, et puis leur utilisation dans les domaines médicale et pharmaceutique. Dans cette étude, on a séquencé l’ADN génomique de cette plante, en utilisant Illumina HiSeq Xten, et PacBio RS II. On a estimé la taille du génome de l’arganier, en utilisant la methode de distribution des K-mer, et determine la profondeur de sequencage. Ensuite, nous avons assemblé ce génome en utilisant trois approches, assemblage des short-reads, assemblage des long-reads, et assemblage hybride. Les résultats des assemblages montrent que l’assemblage hybride est le plus fiable, vue qu’il donne la plus grande taille de séquences assemblées (423 Mb), avec la plus grande valeur N50 (40Kb). Cependant, la taille de séquence assemblée est encore beaucoup plus petite que la taille estimée (630Mb – 713Mb). Ceci serait due à l'hétérozygotie de la plante, et l'abondance des régions répétitives. Afin de prédire la structure et les fonctions des gènes dans les differents régions génomiques de la plante, on a réalisé une prédiction des gènes par la méthode AB initio, ce qui a donné 115 gènes, totalisant 5128 introns, et 6073 exons. Ces résultats sont tres faibles, compare avec le nombre de gènes attendus (30000-40000 gènes). Ceci est due à la faible qualité de l’assemblage. Seuls sept gènes ont été prédits d’être complets, tandis que leur annotation par homologie utilisant BLASTp a montré qu’ils consistent en : trois gènes qui codent pour les photosystèmes et la RuBisCO, ce sont les deux complexes clés des deux phases, claire est sombre, de la photosynthèse, deux gènes codent pour des protéines ribosomales, un gène code pour L'ARN polymérase, et le dernier code pour γ-Tocopherol methyltransferase, cette enzyme est impliquee dans la voie de biosynthèse de la vitamine E. L’assemblage de l'ADN mitochondrial a donné une seule séquence circulaire d’une taille de 245,326 b; son annotation a révélé la présence de 21 gènes qui sont tous impliques dans le processus de la Phosphorylation Oxydative. Afin de découvrir la voie de biosynthèse de l’huile d’Argane en utilisant les données génomiques, on doit améliorer l’assemblage en utilisant d’autres assembleurs et réaliser un assemblage hybride manuel et automatique.

Mots clés : Argania spinosa, Assemblage, génome, Assembleurs, Annotation, Mitogenomique, Estimation de la taille du génome.

(4)

Acknowledgement

First and Foremost, I would like to express my extreme gratefulness to God for the continuous support.

I would like to thank very deeply and profoundly Professor Hassan GHAZAL for his devotion to work, focus and hard efforts put forward to make me assimilate. Equally important, was his indulgence, support and encouragement which made me to challenge myselves and give the utmost of what I can.

I am grateful also to Professor Azeddine IBRAHIMI who gave me the opportunity to attend the Master of Biotechnology, which was an opportunity for me to learn more and improve my Biology and Bioinformatics skills.

My deepest rosaries go to our committee members Professor Abdelhamid EL MOUSSADIK, Professor Noureddine HAMAMOUCH, and Doctor Fatima GABOUN, for their time and endeavours throughout the review process.

Sincere thanks to Doctor Bouchra CHAOUNI for teaching me advanced Bioinformatics tools and methods, guiding me, helping, and supporting me during this thesis study.

(5)

Dedication

I dedicate this project to my close friends Sofia and

Nihal, and to all my family members, mainly, my

older and younger brothers, I am indebted to you

for your support and love.

I am also grateful to all my friends who provided

their help each time I needed it.

(6)

List of figures

Figure 1 : Photo of Argane tree whose genome has been sequenced ... 3

Figure 2 : FASTQ file format... 8

Figure 3 : Genome Assembly algorithms... 9

Figure 4 : Multi-FASTA file format representation of an NGS Output sequence...9

Figure 5 : Example of Genome Annotation GFF file format...11

Figure 6 :. Genome size variation across plants, and other organisms ...12

Figure 7 : Genome size and redundancy variation across plants and other organisms...13

Figure 8 : Photo of the Argane AMGHAR shrub that provided DNA for sequencing... 14

Figure 9 : Quality profile of shor-reads library,...18

Figure 10 : Quality profile of long-reads library, ... 19

Figure 11 : K-mer distribution profile for the K-mer size of 21 using DSK tool. ...20

Figure 12 : K-mer distribution profile for K-mer size of 79...21

(7)

List of Tables

Table 1 : Comparison of sequencing platforms... 7 Table 2 : Summary of Argane libraries generated by sequencers... 15 Table 3 : Summary of Argane Genome assembly results using Canu, SOAPdenovo 2, and

Masurka...21 Table 4 : Predicted functions of the 7 predited as complete genes on the Masurka Argene

Genome Assembly...22 Table 5 : Mitochondrial genome size of some plants... 23

(8)

List of Abbreviations

CVD: Cardiovascular Disease DBG: De Bruijn Graph

ESTs: Expressed Sequence Tags GFF: Gene Features File

HDL: High Density Lipoproteins LDL: Low Density Lipoproteins

LINEs: Long Interspersed Nuclear Elements

MRSA: Methicillin-Resistant Staphylococcus aureus NGS: Next-Generation Sequencing

OLC: Overlap Layout Consensus

SINEs: Short Interspersed Nuclear Elements VAO: Virgin Argane Oil

(9)

Table of Contents

Introduction...1

Part A: Bibliography... 2

1. The Biology of the plant... 2

2. Argane Oil... 3

3. Medicinal and Pharmaceutical potential of argane oil... 3

4. Overview about genomics... 5

4.1. DNA sequencing... 5

4.2. Genome assembly...8

4.3. Genome annotation...10

5. Plant genome assembly challenges...11

6. Aim of the study...13

Part B: Material and Methods...14

1. Plant material...14

2. DNA extraction and Sequencing... 14

3. Data preprocessing... 15

4. K-mer distribution analysis... 15

5. Genome assembly...16

6. Gene prediction...16

(10)

8. Mitogenome annotation...17

Part C: Results and Discussion...17

1. Results... Error! Bookmark not defined. 1.1. Raw data quality...17

1.2. K-mer distribution and genome size estimation...19

1.3. Genome assembly... 21

1.4. Gene prediction...22

1.5. Mitogenome...22

2. Discussion... 24

Conclusion and perspectives...27

(11)

Introduction

Argania spinosa, is an endemic plant in the Middle West of Morocco [1,2]. It is considered as

one of the main three biosphere reserves in Morocco [1]. The forest of Argane tree occupies an area of 8280 km2[3], mainly in the Souss valley dry lowlands and in the sunny mountains of the Anti-Atlas [3]. The Argane tree grows very slowly [3], but it can live for more than 200 years, and it can survive the long drought periods due to its deep root system [3]. The tree of Argania

spinosa can be shrubby, or grow up from 7m to 10m of height, it has a trunk diameter up to 100

cm [4], yellow-green leaves, and rounded shape green fruits [4].

The Argane oil is considered as one of the rare and most expensive oils in the world [5], it is extracted from kernels contained in its nut sized fruit of Argane tree [3,5]. At the national level, Argane oil production represents 1.6% of the Moroccan consumption of edible oils, and 9% of the oil national production [5].

Argane oil has been utilized as a natural remedy in herbal medicine for many decades [3]. It is traditionally used by women for body care, massaging, and cooking [5]. This oil has a high percentage of unsaturated fatty acids, which give it a high nutritional quality [3]. Tocopherols is the main antiaxidant in Argane oil, and it has an anti-cancer effect [3].

Argan-oil has been recognized for its different medicinal and therapeutical effects. It has many remarkable properties including the restoration of the skin water lipid layer, increasing nutrients in the skin cell, intracellular oxygen stimulation, free radicals neutralization, and defense of the connective tissue [5]. Several studies revealed the benefits of argane oil in prevention from many health disorders, such as coronary disease atherosclerosis [3], cardiovascular disease [4], and cancer [6]. This preventive effect is due to its antioxidant activity [6]. It has also an antidiabetic effect [7], and antibiotic effect against Methicillin-resistant Staphylococcus aureus (MRSA) [8], in addition to its ability in hair protection [9].

The term “Genomics” refers to the study of all genes of an organism. The discovery of DNA sequencing methods by Sanger and Gilbert in the late 1970s was the starting event of the genomics era, but its main driving force was the launch of Human Genome Project in 1990 [11], which led to developed shotgun sequencing for the first time [11]. Genome assembly consist on

(12)

the process of reconstructing the sequence composition of the sequeneced DNA of an organism [12]. This process uses mainly two graph algorithms; De-Bruijn-Graph (DBG) and Overlap– Layout–Consensus (OLC),that require high computational resources [11].

Genome annotation stands for giving a meaning to each part of the genome, structural annotation stands for delineating and demarcating genes and regulatory regions in a genome, while functional annotation is assigning function to structural elements [13]. Plant genomics is hard to process, due to the complexity of their genomes. This complexity is caused mainly by four factors: (i) large genome size, (ii) polyploidy, (iii) heterozygosity, and (iv) high rate of redundancy [14,15].

The aim of this study is to assemble the genome of Argania spinosa, which will help on identifying genes involved in biosynthesis of the argane oil and other important metabolites. This will help extracting, purifying, and even ameliorate those metabolites production in order to use them for medical and pharmaceutical purposes.

Part A: Bibliography

1. The Biology of the plant

Argania spinosa, or the Argane tree (NCBI taxid: 85884), is a xerothermophile endemic tree in

the Middle West of Morocco [1,2]. It is not only considered as one of the main three biosphere reserves in Morocco [1], but also as an 80-million-year-old relic tree species that has been known since the time of the Phoenicians [3]. It is widely agreed that all Argania trees vanished from Northern Africa after the Quaternary glaciations, but persisted in the Souss valley whereby optimal requirements for the trees’ survival continued [3]. The Argania spinosa tree remains the only species of the Sapotaceae family still existing in the subtropical climate. The Argania forest occupies roughly 8280 km2 [3], mainly in the Souss valley dry lowlands and in the Anti-Atlas sunny mountains [3]. The Argane tree grows very slowly, it takes 15 years to mature, but it is extremely resistant [3]. Argane trees can live for 150 years, and sometimes more than 200 years, and because of its deep root system, it can survive the long drought periods [3]. The Argane tree can be shrubby, or grow up from 7m to 10m of height, with a trunk diameter up to 100 cm [4]. It has yellow-green leaves, and rounded shape green fruits [4] (Figure 1).

(13)

Figure 1: Photo of Argane tree whose genome has been sequenced (Wekipedia). Legend: A: The Argane tree, B: The Argane fruit

2. Argane Oil

The Argane oil is extracted from kernels contained in its nut sized fruit, which are produced by the Argane tree [3,5]. This oil is considered as one of the rare and most expensive oils in the world [5]. Argane oil represents 25% of the daily intake of fatty substance [5]. At the national level, Morocco produces annually 3000 to 4000 tons of Argane oil, which represents 1.6% of the Moroccan consumption of edible oils, and 9% of the oil national production [5].

Argan-oil has a high percentage of unsaturated fatty acids, that makes it of high nutritional quality [3]. It is relatively stable in storage and frying because of its natural antioxidants, mainly tocopherols [3]. The virgin Argane-oil (VAO) is characterized by high contents of antioxidants and mono- and poly-unsaturated fatty acids, and phenolic compounds, in addition to its cosmetic applications, which gave this oil a tangible significance as a renewable source of high social and economic value [4]. Argane oil has been utilized as a natural remedy in herbal medicine for many decades [3]. Argane-oil is traditionally used by south Moroccan women for nail protecting, hair and skincare, massaging, and cooking [5].

3. Medicinal and Pharmaceutical potential of Argane oil

Argane-oil has been recognized for its medicinal and therapeutical effects. It has many remarkable properties including (i) the restoration of the skin water lipid layer, (ii) increasing

B

A

(14)

nutrients in the skin cell, (iii) intracellular oxygen stimulation, (iv) free radicals neutralization, and (v) defense of the connective tissue [5]. Several experiments revealed the benefits of Argane oil in coronary disease prevention and protection from atherosclerosis through a number of biological mechanisms [3].

The phenolic extracts that exist in Virgin Argane oil molecules (VAO-PE) were a subject of a study by Berrougui et al. [4] aiming to reveal its protective action against cardiovascular disease (CVD). The aforementioned study revealed that VAO-PE could significantly (i) increase the fluidity of High-Density Lipoprotein (HDL, “good” cholesterol) phospholipidic bilayer, (ii) reduce the disappearance of Vitamin E, (iii) prolonge the lag-phase and reduce the progression rate of lipid peroxidation, (iv) inhibits the Low-Density Lipoproteins (LDL, "bad" cholesterol)

oxidation, and (v) enhance the reverse cholesterol transport, which helps to prevent cardiovascular diseases [4].

Bnouham et al. [7] studied the antidiabetic effect of Argane oil in healthy and induced diabetic rats, and showed that Argane oil reduces significantly the glycemia and improve the body mass [7].

Faria et al. [9] analyzed the hair protective effect of Argane oil on Caucasian hair after treatment with hair dye. In this research, the authors analysed the hair protective effect of commercial conditioner agents mixed with Argane oil and/or Theobroma grandiflorum seed butter in hair care on Caucasian hair treated by dye. The results showed that the use of Argane oil with conditioner agent after hair dye reduces loss of proteins, which reduce the damage caused by the hair dye [9].

Methicillin-Resistant Staphylococcus aureus (MRSA) strains carry the mecA gene, which makes it resistant to the methicillin antibiotic [8]. The standard drugs used to treat MRSA infections are vancomycin and teicoplanin which have renal toxicity [8]. A study by Naher et al. [8] aimed to reveal the effect of Argane oil in MRSA infection treatment by isolating 20 MRSA and treating them by a mixture of Argane oil and H2O2. The results of the aforementioned study revealed the ability of the mixture to inhibit growth of MRSA isolates by 80% and present an inhibition zone similar to those of teicoplanin [8].

(15)

Furthermore, the Argane oil has anti-cancer properties. It shares a similar composition with olive oil, hence the cancer chemoprotective effect attributed to olive oil has also been attributed to Argane oil. Argane oil's high levels of γ‐tocopherol, and its high squalene content have even led to a suggestion that its chemoprotective effect may even be greater [6].

Antioxidants present in Argane oil delay the onset of reactive oxygen species after lipid peroxidation [6]. Specific investigations on prostatic cells have shown that, in vitro, Argane oil polyphenols and sterols have cytotoxic properties and exert an inhibitory effect on the proliferation of hormone‐independent (DU145 and PC3) as well as of hormone‐dependent (LNCaP) prostate cancer cell linesused in therapeutic research[6].

4. Overview about genomics

The term “genome” refers to all the DNA molecules in the cell of an organism [12], while “omics” refers to studies that work collectively in the quantification and characterization of many biological molecules, in order to study the structure, functions, and dynamism of an organism [10]. Genomics is the study of all genes of an organism, in addition to their interactions with each other and with the organism’s environment [10]. The actual beginning of genomics era is the discovery of DNA sequencing methods by Sanger and Gilbert in the late 1970s, but the main driving force of this field was the launch of Human Genome Project in 1990 [11]. This project led John Craig Venter and his colleagues to use shotgun sequencing for the first time [11], which is based on fragmenting the whole genome into small fragments of DNA [16]. This makes sequencing easier, but the assembly of the obtained small fragments or reads into the original genome sequence complicated [11].

4.1. DNA sequencing

DNA sequencing stands for the chemical identification of the order of bases in a polynucleotide strand. Walter Gilbert discovered with Allan Maxam a chemical cleavage method for identifying the order of the nucleotide in a polynucleotide strand. As for Sanger, he developed another method based on chain termination (Table 1) [16]. Both Gilbert and Sanger sequencing methods had disadvantages, namely, the time consuming, accuracy, a large number of reaction mixtures, and mainly, limited sequence length [16]. Other advanced techniques have been developed in

(16)

order to overcome the limits of the conventional techniques, mainly shotgun sequencing methods. The idea was found first time by John C. Venter in 1990s [11], but in 2005, 454 Life Science (Branford, Connecticut, United States, shut down by Roche in 2013) released the first genome sequencer based on shotgun sequencing [17]. In the same decade, two other companies (Illumina (San Diego, California) and Applied Biosystem (Foster City, California) released their genome sequencers. This was the beginning of the Next-Generation Sequencing (NGS) era [16].

The NGS sequencers are based on fragmenting the target DNA sequence into small fragments, called reads, then sequence all of them at once. Bases can be identified either by photo identification (case of Pyrosequencing and Sequencing by Synthesis) or by chemical identification (case of Sequencing by Ligation and Ion Semiconductor Sequencing) (Table 1) [17]. Short read technologies have also limitations such as GC bias, difficulties mapping to repetitive elements, trouble discriminating paralogous sequences, and difficulties in phasing alleles. Long read single-molecule sequencers resolve these obstacles. Moreover, they offer higher consensus accuracies and can detect epigenetic modifications from native DNA (Table 1) [18].

Both, short-reads sequencers, and long reads Single-Molecule Real-Time (SMRT) sequencers store their output sequences in files with “FASTQ” extension. This is a text file extension that store data in sets of four lines, where the first line contain information about the read, second contain the sequence, the third is a separator, and the fourth contain the sequencing quality of each base in the read sequence, the quality is coded in ASCI code (Figure 2) [19].

(17)

Table 1: Comparison of sequencing platforms [20] Sequencing Platform Amplification Method Sequencing Method Read Length (bp) Error Rate (%) Number of Reads Per Run Time Per Run (Hours) Cost Per Million Bases (USD)

Sanger PCR Dideoxy chain

termination

600-1000 0.001 96 0.5-3 500

Ion Torrent PCR Polymerase

synthesis 200 1 8.2 x 107 2-4 0.10 454 Roche GS PCR Pyrosequencing 700 1 1 x 106 23 8.57 Illumina PCR Synthesis 2 x 125 0.1 8 x 109 (paired) 7-60 0.03 SOLiD PCR Ligation 2 x 60 5 8 x 108 144 0.11

PacBio RS II Real-time single molecule template Synthesis ~10,000-15,000 13 3.5-7.5 x 104 0.5-4 0.40-0.80 Oxford Nanopore MinION None Nanopore ~2000-5000 38 1.1-4.7 x 104 50 6.44-17.90

(18)

Figure 2: FASTQ file format[19]

Legend:Each four lines represents one read, first line of each read contains read informationm and start with @, second line represents the sequence, third line is a separator, fourth line contains base’s quality in ASCII format.

4.2. Genome assembly

Genome assembly is a computational process that aims of reconstruct the sequence composition of the DNA of an organism [12]. This process goes in three steps: first, assembling reads in bigger sequences called contigs, second joining contigs into scaffolds, then, construct chromosomes from scaffolds (Figure 3) [12]. Genome assembly process uses graph algorithms, namely (i) Overlap Layout Consensus (OLC),based on finding overlaps between reads, then joining them depending on the sequencing depth in order to construct contigs and scaffolds, and (ii) De Burijn graph (DBG), based on using K-mers; each K-mers is a word created from nucleotides with a specific size, the size of K-mer being always lower than the size of reads [12]. Instead of looking for overlaps between reads, DBG looks for overlaps between K-mers [12]. Since it uses K-mer instead of reads, DBG requires less computational resources comparing to OLC, but it has a low accuracy for long-reads assembly. Hence, DBG is usually used for short reads assembly, while OLC is used for long reads assembly [12].

Hybrid assembly approach is based on combining long reads and short reads. First it assembles short reads using DBG and create mega reads, then assemble long reads using OLC and the final

(19)

step is to map mega reads to assembled long reads. The good quality of short reads, and the big size of (assembled) long reads make this approach the most accurate assembly approach [21]. The output of the assembly step is a “FASTA” (or Multi-FASTA) file. This is an extension of text files that contain a header(s) with information about the sequence (s), and the sequences (Figure 4) [22]. The evaluation of assembly quality requires measuring a set of parameters, (i) N50, it stands of the size of the median contig/scaffold; the higher N50 the more assembly quality is good, (ii) the largest contig/scaffold, and (iii) the number of contigs/scaffolds; this number should be close to the number of chromosomes in the target organism [12].

Figure 3: Genome Assembly algorithms [12],

Legend: Gray rectangle in the top represent the target genome, the arrows represent reads, vertices in OLC

represent reads, vertices in DBG represent K-mers.

(20)

4.3. Genome annotation

Genome annotation stands for giving a meaning to each part of the genome. It has two components : (i) structural annotation stands for delineating and demarcating genes and regulatory regions in a genome, while (ii) functional annotation is assigning function to structural elements [13]. The annotation of eukaryotic genomes requires following some precise steps in order to succeed it. The first step is repeat identification; the term repeat refers to two different types of sequences, (i) low-complexity sequences like homopolymeric runs of nucleotides, and (ii) transposable elements like Long Interspersed Nuclear Elements (LINEs), and Short Interspersed Nuclear Elements (SINEs) [23]. The second step consists on an evidence alignment of the genome to databases of pre-identified proteins, RNAs, and Expressed Sequence tags (ESTs) [23]. The third step consists on gene prediction; this can be done either by AB initio prediction, which use mathematical models in order to identify Coding Sequences (CDS), introns and exons, or by evidence driven gene prediction, which stands for using external evidence such as proteins and ESTs alignment [23]. The annotation output is often a “GFF” file (Gene Feature File). It is a text file extension, divided into 9 columns separated by tabulations, and it contains meta-information header lines, the start and the end of each feature, the type of each feature (exon, CDS, intron …), the strand that contain the gene, and the sequence of the gene (Figure 5) [24].

(21)

Figure 5: Example of Genome Annotation GFF file format [25]

5. Plant genome assembly challenges

Plants genomes are known by their complexity, which is due to some specific factors:

-Large genome size: Plants are known as the organisms with the bigger genome size (from 100 Mbases to 100 Gbases) [14]. The large size of the plant genomes makes their sequencing and assembly harder than many other species (Figure 6 and Figure 7) [26].

Redundancy: repetitive regions in plants genomes is higher than many other species [26]. The higher redundancy in those genomes makes overlap detection hard for assemblers. Generally, it creates loops in the constructed graph called bubbles [12], that represent an obstacle to assembly (Figure 7) [12].

(22)

Polyploidy: the term “polyploidy” stands for the existence of multiple copies of identical or similar chromosome sets in one species. It is a known feature of plant [15,27], that causes the presence of many different sequences that belong to the same region, while the assembler might detect them as different sequences [12].

Heterozygosity: it is the presence of different pair of alleles in the same organism [28]. Same as polyploidy, this fact increases the amount of different sequences that belong to the same region, thing that won’t be detected by assemblers, hence, they will be assembled as different contigs [12].

(23)

Figure 7: Genome size and redundancy variation across plants and other organisms[26]

6. Aim of the study

The aim of this study is to assemble and annotate the genome of Argania spinosa, which will help on identifying genes and pathways involved in biosynthesis of the Argane oil and other metabolites. This will help improving those metabolites production in order to use them for medical and pharmaceutical purposes.

(24)

Part B: Material and Methods

1. Plant material

DNA sequences were extracted from a 9 years old shrub, called Argane AMGHAR. The shrub has only one main trunk, with a height of 3m (Figure 8).

Figure 8: Photo of the Argane AMGHAR shrub that provided DNA for sequencing

2. DNA extraction and Sequencing

DNA extraction was done using the Qiagen Plant DNeasy mini kit. The extraction was from lyophilized leaf tissue. Paired end library was prepared using Illumina Nextra DNA Library Prep Kit, the average insert size was 600bp. Libraries were sequenced using Illumina HiSeq Xten platform, which generated 471M reads, with an average size of 150x2, and a total size of 144 Gbase (coverage = ~230x) (Table 2). Long reads were generated using PacBio RSII, which generated 817K single end reads, with an average size of 6300 bases, the total size is about 7 Gbase (coverage = ~12x) (Table 2).

(25)

Table 2: Summary of Argane libraries generated by sequencers

Sequencing platform Type of library Number of reads Average read size Total size Coverage

Illumina HiSeq Xten Paired End 957,451,810 150 bp 144 Gbases 236x

PacBio RS II Single End 6,705,437 6300 bases 7.6 Gbases 13x

3. Data preprocessing

Data quality was visualized using FastQC [29] software, which gave plots of the base quality per position, GC content, length distribution, and adapters content. In order to remove high repetitive sequences and normalize the size of sequences, short-read libraries were normalized using BBMap tool [30]. The long-read libraries were not preprocessed, since the long-read assembler requires non filtered reads to assemble them; the reads quality filtration will be done in the assembly process.

4. K-mer distribution analysis

In order to explore the dataset before the assembly, we used DSK [31], which is a tool for counting K-mers in a given library; the K-mer count can give pieces of information such as the genome size and the heterozygosity. In order to have a useful K-mer distribution, we used unique K-mers, and selecting big K-mer size. For this purpose, we used two K-mer sizes, 21 and 79, longer K-mers required a high computational resource and could not be executed.

The genome size was calculated from the K-mer distribution plot using the formula:

(26)

In case of one peack int the plot of K-mer distribution, that peack is used as genome size peack, while in case of many peacks in the plot, the lower peack refers to the genome size peack [40]. The coverage of the sequencing was calculated using the estimated genome size folowing the formula:

e e 樈 e G e = 樈 Gy ᪦Ge 樈e 樈᪦eeG e eG e

5. Genome assembly

Three genome assembly processes were done as follow: First, the only long-reads were assembled using CANU assembler [32]; which filter the quality of the library, then build an OLC graph that will be used to generate assemblies. Second, the assembly was done using only short-reads, that was done using SoapDeNovo [33], which is an assembler that uses DB graph to assemble large genomes from short-reads. Lastly, a hybrid assembly combining both short and long reads has been performed using Spades [34]. It first assembles short-reads using DB graph, then maps them to long reads to generate scaffolds.

6. Gene prediction

In order to predict coding sequences in the assembled sequences, we used AUGUSTUS [35] for an AB initio prediction, and Solanum lycopersicum genome annotation as a template organism, since it is the closest specie to Argania spinosa from the given templates. then the obtained CDS sequences were aligned to Ericales reference proteins using BLASTp [36].

7. Mitogenome assembly

In order to extract like-mitochondrial reads from whole-genome libraries, GetOrganelles has been used [37] which also filtered the extracted reads to remove duplicate sequences. Then a first DB graph using Spades has been generated, and the graph has been cleaned using GetOrganelles to generate one circular sequence.

(27)

8. Mitogenome annotation

Mitochondrial genome was annotated using GeSeq [38], and Vaccinium macrocarpon and

Camellia sinensis as organism templates to do the annotation. They are the only Ericales that

have reference mitochondrial genome sequences in RefSeq NCBI database. The results were visualized by OrganellarGenomeDRAW tool [39].

Part C: Results

1.1. Raw data quality

Read quality profiles using FastQC software showed that short reads library has very good quality, where the median, and the mean of quality for each position is above 30, lower quantile starts to decrease bellow 30 starting from 90thposition (Figure 9). This library didn’t need to be filtered, but in order to make the assembly easier, we reduced redundancy using BBduk script from BBMap tool [30], this tool detect redundent sequences using duplicate K-mers, we used 79 as a K-mer size to perform this step.

(28)

Figure 9: Quality profile of shor-reads library,

Legend: Red area present the area of bad quality, the yellow area present the area of an average quality, the green

area present the area of good quality, the blue line present the mean quality per position in read, the black lines represent quantiles of qualities per each position in the read. The blue line is at a level between 34 and 36 quality score, and all box plots are above the 36 quality score, which mean that the quality of this library is good.

Read quality profiles for long reads library shows that all bases above the 100th position have quality score bellow 30, where those above 300th position have quality score bellow 20 (Figure 10), hence the quality of long reads is low. This library needed to be filtered, but the long read assembler perform the read correction as a step of the assembling, hence the long reads libraries were not preprocessed .

(29)

Figure 10: Quality profile of long-reads library,

Legend: Red area present the area of bad quality, the yellow area present the area of an average quality, the green area present the area of good quality, the blue line present the mean quality per position in read, yellow boxes

present quntiles of the qualities per each position, the black lines represent outliers qualities per each position in the read, and red line present the median of qualities per position in read. Box plots and the blue line goes from the green area to the red area starting from the position 50, which mean that all sequences above this position have bad quality.

1.2. Genome size and coverage estimation

The counting of k-mer frequency in the sequencing data can be carried out using many of the currently available tools such as DSK swe used to perform a mer distribution analysis (Figure 11). The presence of two peaks in the profiles refers to the heterozygosity of the plant. For both K-mers, 21 and 79, the heterozygosity peak is at the same level, while the lower peak, which refers to the genome size, is not at the same level (Figure 11 and Figure 12). The estimated genome calculated is about 630Mbases or 713Mbases, using K-mer 21 or K-mer 79 profiles, respectively. The calculated coverage based on the estimated genome size was about ~236x for short reads, and ~13x for long reads.

(30)

Figure 11: K-mer distribution profile for the K-mer size of 21 using DSK tool.

Legend : First peack refers to the heterozygosity, it is represent the frequency of repetitive K-mers. Second peack

refers to genome size, it is represent the freqeuncy of non repetitive K-mer. K-mer abundance refers to the coverage of each detected k-mer. Number of distinct K-mer refers to the frequency of K-mer with a specific coverage,

Heterozygosity peak

Genome size peak

(31)

Figure 12: K-mer distribution profile for K-mer size of 79.

Legend : First peack refers to the heterozygosity, it is represent the frequency of repetitive K-mers. Second peack

refers to genome size, it is represent the freqeuncy of non repetitive K-mer. K-mer abundance refers to the coverage of each detected k-mer. Number of distinct K-mer refers to the frequency of K-mer with a specific coverage,

1.3. Genome assembly

We used Masurka for the Argane Genome Assembly (Table 3). The Long reads assembly generated 115 contigs with a total size of 909 Kb, the largest one has a size of 115 Kb, while the smallest one has a size of 5321 bases, the N50 for this assembly is about 18Kb. As for Short reads assembly, it generated 15 K contigs, the largest one size’s is about 261 Kb, the smallest one size’s is about 280 bases, the total size for this assembly is about 194 Mb, with an N50 of 35 Kb. The Hybrid assembly generated a total size of 423 Mb, that consists on 82K contigs, the largest one size’s is about 325 Kb, the smallest one’s 6430 bases, and the N50 is about 40 Kb (Table 3).

Table 3: Summary of Argane Genome assembly results using Canu, SOAPdenovo 2, and Masurka

Assembly type Total size N50 Number of contigs Largest Contig

Long reads only 909 Kb 18 Kb 115 115 Kb

Short reads only 194 Mb 35 Kb 15K 261 Kb

Hybrid assembly 423 Mb 40 Kb 82K 325 Kb

Heterozygosity peak

Genome size peak

(32)

1.4. Genome annotation

Although the Genome quality Assembly was of low quality, we decided to proceed with the genome annotation for preliminary exploration purpose. AB initio gene prediction using the ? tool predicted 150 genes, totalling 5128 introns, and 6073 exons. 7 out of these 150 genes were predicted as complete genes, while the rest were partial genes (Table or Figure?). The alignment of these 7 completed genes revealed their functions (Table 4). Three genes encode for photosynthesis complexes (Photosystem I chlorophyl A, Photosystem II chlorophyl B, and RuBisCO large subunit), two encode for ribosomal proteins (Ribosomal Proteins L1 and L2), one encodes for the γ-Tocopherol methyl transferase, and one encodes for RNA Polymerase Subunit Beta (Table 4).

Table 4: Predicted functions of the 7 predited as complete genes on the Masurka Argene Genome Assembly

Predicted gene Name Function

γ-TMT γ-Tocopherol methyl transferase

rpl-2 Ribosomal Protein L2

rpl-1 Ribosomal Protein L1

psa-A Photosystem I chrolophyl A

psb Photosystem II chrolophyl B

rbcL RuBisCO large subunit

rpoB RNA Polymerase Subunit Beta

1.5. Analysis of the Argane Mitogenome (Mitochondrial Genome)

138M Like-mitochondrial reads were extracted from the whole genome library using the GetOrganell tool, and filtered to remove duplicate reads. The obtained 121M unique reads were assembled using Spades tool, which generated one circular scaffold of 245,326 bp. This is comparable to other mitochondrial plant genomes (Table 5).

(33)

Table 5: Mitochondrial genome size of some plants

Plant Mitochondrial Genome size

Camellia sinensis 707,441bp

Vaccinium macrocarpon 459,678bp

Arabidopsis thaliana 367,808bp

Olea europaea 710,688bp

Oryza sativa 401,567bp

The result of the mitogenome annotation showed that five gene groups in the mitochondrial genome belong to the Oxidative Phosphorylation pathway, seven genes encode for NADH Dehydrogenase subunits (nad1 - nad7), two encode for Succinate Dehydrogenase subunits (shd3 and sdh4), one encode for Ubichinol Cytochrome C Reductase (cob), three encode for Cytochrome C Oxidase (cox1 - cox3), and five ATP synthase genes (atp1, atp4, atp6, atp8, and atp9), two genes encode for Cytochrome C biogenesis protein (ccmB and ccmC). In addition, 18 tRNAs and 4 rRNAs, one ORF (orf100), and 10 ribosomal proteins genes were annotated in the mitogenome (Figure 13).

(34)

Figure 13: Argane Mitochondrial genome map assembled using Spades

Part D: Discussion

Genome assembly is a crucial step in the genomic workflow, since gene annotation can not be done unless the assembly has a good quality. Genome assembly is getting complicated with size and oragnisms type; plants are being the most difficult to assembl. The genome assembly of Argania spinosa tree is not an exception. The application of Masurka Assembler on Argane

(35)

genome did not result in a good quality assembly, due to high repetitives regions and high heterozygosity, which interfere with sequences overlapping process during the reads assmebly. The genome size estimated for this plant was about 630Mbase and 713Mbase for the K-mers 21 and 79 respectively. Since the 79-mer is bigger, the amount of unique 79-mers will be higher than 21-mers, which make the genome size estimated by 79-mer more accurate. However, both estimated genome sizes might be close to the actual genome size, since it fall within the range of genome size of the Sapotacea family which goes from 273Mb to 2.5Gb. The sequencing coverage calculated from estimated genome was about ~236x for short reads and ~13x for long reads, this diffrence is the reason of the diffrence between long reads assembly and short reads assembly qualities.

Long reads assembly done using the Canu assembler generated 115 contigs, which make it, from this point of view, better than the short-reads assembly and hybrid assembly, that generated X and X contigs, respectively . However, the total assembled sequence size is much lower than in both other assembly approaches results. The N50 and the largest contig for long-read assembly are also lower than those generated by short-read assembly and hybrid assembly. Those differences between long-reads assembly approach and other assembly approaches are due to the low quality of the input sequences, which makes Canu to drop many them while correcting reads. Generally, the low quality of long-reads obtained with the available technologies makes assembly not accurate for eukaryotic organisms.

We compared the two other short reads and hybrid assembly approaches, Short-reads assembly was done using SoapDeNovo, which is an assembler able to work on large genomes such as plant genomes. However, it generated a total assembled sequences size of 193Mb, which is still lower than the estimated genome size and lower than the hybrid approach results. This is most probably due to the high heterozygosity of the plant, which causes the creation of many bubbles during the building of the DBG. The hybrid assembly was able to generate bigger sequence size due to the fact of using long reads as a map for short reads, helps on cleaning the bubbles from the graph, which will lead construct longer contigs. The hybrid assembly result is still lower than the estimated genome size, and the results of the published first draft Argane genome. study [40] that uses the same dataset, and the same assembling approach. This difference is due to the

(36)

assemblers used; while the first draft genome used Masurca assembler, we used Spades for hybrid assembly. Both assemblers were able to perform hybrid assembly, however, Spades is less sensitive to large genomes than Masurca assembler. This further suggest that the choice of the assembler can also influence the results.

For a prrelimiray exploration of the abtained assembly, we proceed with genome annotation using an AB initio gene prediction that led to identify seven complete genes in the assembled genomes. Two of them encodes for Photosystems chlorophyl A and B, those complexes are responsible for photosnthesis. Another predicted gene that is involved in photosynthesis is the RuBisCO large subunit gene, which is the leading enzyme of the Calvin cycle, or the dark phase of the photosynthesis. Those three complexes belong to the chloroplast, where the photosynthesis happen, those genes were refered by a study of Khayi et al. [43] to be present in the chloroplast genome of Argane tree, their presence here might be due to the existence of chloroplast genome inside the whole genome library. The γ-TMT gene is one of the seven complete genes predicted, and it encodes for γ-Tocopherol methyl transferase, involved in the biosynthesis pathway of α-Tocopherol, also known as Vitamin E [41]. Vitamin E is a lipophelic antioxidant, it reduces the active oxygen radicals, which makes it a good nutrient to prevent cancer [42]. The Tocopherol is known of its ability to prevent many disorders such as Alzheimer and heart disease, t due to its antioxidant properties. The low number of genes predicted is due to the quality of the assembly. In order to predict all the genes, expected to be 30000 to 40000 genes as in many other plants (Table Y), the assembly should reach the scaffolding step at least.

Particular interest has been given to the Argane mitochondrial genome. The mitochondrial genome has been assembled and provided 254 Kb circular DNA molecule, this size belong to the range of plant mitochondrial genomes (200Kb - 2000Kb) [44]. Mitogenome annotation revealed 6 classes of genes that are involved directly or indirectly in the Oxidative Phosphorylation process. Those genes encode for the respiratory chain proterins such as NADH dehydrogenase, ATP synthesis, or for the biosynthesis of cythochromes such as ccmB and ccmC genes.

(37)

Conclusion and perspectives

Argania spinosa is an endemic tree of Morocco highly important for its nutritional

pharmaceutical and cosmetic values. Studies revealed the medical and therapeutical effects of number of metabolites of the Argane tree. The discovery and analysis of genomic regions involved in the biosynthesis of those metabolites would contribute to cultural and genetic strategies to improve these metabolites production and quality. In this Master thesis study, we assembled the argane tree genome, using NGS DNA short-reads and long-reads libraries. The genome size of Argane tree was estimated using K-mer frequency to be 630Mb to 713Mbases. The long reads assembly could generate low number of contigs, with low sequence length, because of the quality of reads generated by SMRTs technology. Inbdeed, the used assembler, Canu?, drops many data while correcting these long reads. Hybrid assembly seems to be the best assembly approach, but the assembler should be chosen wisely. The genes prediction revealed the presence of γ-Tocopherol methyl transferase which is involved the the biosynthesis pathway of Tocopherol, an antioxidant agent that can prevent many health disorders, including Alzeimner and heart disease [45]. Mitochondrial genome has also been assembled and revealed a size of 245,326 bases. Mitogenome annotation demonstrated the presence of 21 genes, most of them encode only for complexes involved directly or indirectly in the Oxydative Phosphorylation process.

We propose to improve the Assembly using a method that combine both Automatic and manual steps, in irder to be able to perform a good annotation for the Argane genome. The transcriptome analysis is needed to achieve an accurate structural and functional annotations.

(38)

References

[1] FAO, Population (French Edition) 5 (2010) 764.

[2] S. Moukrim, S. Lahssini, M. Rhazi, H.M. Alaoui, A. Benabou, I. Wahby, M. El Madihi, M. Arahou, L. Rhazi, Agroforest Syst 93 (2019) 1209–1219.

[3] S. Zrira, in: M. Neffati, H. Najjaa, Á. Máthé (Eds.), Medicinal and Aromatic Plants of the World - Africa Volume 3, Springer Netherlands, Dordrecht, 2017, pp. 91–125.

[4] H. Berrougui, M. Cloutier, M. Isabelle, A. Khalil, Atherosclerosis 184 (2006) 389–396. [5] A. Adlouni, Phytothérapie 8 (2010) 89–97.

[6] H.E. Monfalouti, D. Guillaume, C. Denhez, Z. Charrouf, Journal of Pharmacy and Pharmacology 62 (2010) 1669–1675.

[7] M. Bnouham, S. Bellahcen, W. Benalla, A. Legssyer, A. Ziyyat, H. Mekhfi, Journal of Complementary and Integrative Medicine 5 (2008).

[8] P.D.H.S. Naher, A.P.A.K. Al-Saffar, L.D.H.O. Al, Global Journal of Research In Engineering (2014).

[9] P.M. Faria, L.N. Camargo, R.S.H. Carvalho, L.A. Paludetti, M.V.R. Velasco, R.M. da Gama, Journal of Cosmetics, Dermatological Sciences and Applications 03 (2013) 40. [10] J.O. Lay, R. Liyanage, S. Borgmann, C.L. Wilkins, TrAC Trends in Analytical Chemistry

25 (2006) 1046–1056.

[11] J. Weissenbach, Comptes Rendus Biologies 339 (2016) 231–239.

[12] A. Kalyanaraman, in: D. Padua (Ed.), Encyclopedia of Parallel Computing, Springer US, Boston, MA, 2011, pp. 755–768.

[13] L.A. Bright, S.C. Burgess, B. Chowdhary, C.E. Swiderski, F.M. McCarthy, BMC Bioinformatics 10 (2009) S8.

(39)

[14] T.P. Michael, Brief Funct Genomics 13 (2014) 308–317. [15] L.D. Gottlieb, Heredity 91 (2003) 91–92.

[16] N. Saraswathy, P. Ramalingam, in: N. Saraswathy, P. Ramalingam (Eds.), Concepts and Techniques in Genomics and Proteomics, Woodhead Publishing, 2011, pp. 57–76.

[17] J.S. Reis-Filho, Breast Cancer Research 11 (2009) S12.

[18] S. Ardui, A. Ameur, J.R. Vermeesch, M.S. Hestand, Nucleic Acids Res 46 (2018) 2159– 2168.

[19] P.P. Sinha, Bioinformatics with R Cookbook, Packt Publishing, Birmingham, 2014.

[20] S. Derocles, D. Bohan, A. Dumbrell, J. Kitson, F. Massol, C. Pauvert, M. Plantegenest, C. Vacher, D. Evans, Advances in Ecological Research 58 (2018) 1–62.

[21] A.V. Zimin, G. Marçais, D. Puiu, M. Roberts, S.L. Salzberg, J.A. Yorke, Bioinformatics 29 (2013) 2669–2677.

[22] M. Hosseini, D. Pratas, A. Pinho, Information, MDPI 7 (2016) 56. [23] M. Yandell, D. Ence, Nature Reviews Genetics 13 (2012) 329–342.

[24] S. Gundersen, M. Kalaš, O. Abul, A. Frigessi, E. Hovig, G. Sandve, BMC Bioinformatics 12 (2011) 494.

[25] R. Podicheti, Q. Dong, Cold Spring Harbor Protocols 2010 (2010) pdb.prot5392.

[26] M.T. Rabanus-Wallace, N. Stein, in: T. Miedaner, V. Korzun (Eds.), Applications of Genetic and Genomic Research in Cereals, Woodhead Publishing, 2019, pp. 19–47.

[27] J.S. Heslop-Harrison, in: S. Brenner, J.H. Miller (Eds.), Encyclopedia of Genetics, Academic Press, New York, 2001, pp. 1509–1511.

(40)

[29] B. Bioinformatics, Cambridge, UK: Babraham Institute (2011).

[30] B. Bushnell, BBMap: A Fast, Accurate, Splice-Aware Aligner, Lawrence Berkeley National Lab.(LBNL), Berkeley, CA (United States), 2014.

[31] G. Rizk, D. Lavenier, R. Chikhi, Bioinformatics 29 (2013) 652–653.

[32] S. Koren, B.P. Walenz, K. Berlin, J.R. Miller, N.H. Bergman, A.M. Phillippy, Genome Res. 27 (2017) 722–736.

[33] R. Luo, B. Liu, Y. Xie, Z. Li, W. Huang, J. Yuan, G. He, Y. Chen, Q. Pan, Y. Liu, J. Tang, G. Wu, H. Zhang, Y. Shi, Y. Liu, C. Yu, B. Wang, Y. Lu, C. Han, D.W. Cheung, S.-M. Yiu, S. Peng, Z. Xiaoqian, G. Liu, X. Liao, Y. Li, H. Yang, J. Wang, T.-W. Lam, J. Wang, Gigascience 1 (2012) 18.

[34] A. Bankevich, S. Nurk, D. Antipov, A.A. Gurevich, M. Dvorkin, A.S. Kulikov, V.M. Lesin, S.I. Nikolenko, S. Pham, A.D. Prjibelski, A.V. Pyshkin, A.V. Sirotkin, N. Vyahhi, G. Tesler, M.A. Alekseyev, P.A. Pevzner, J Comput Biol 19 (2012) 455–477.

[35] M. Stanke, R. Steinkamp, S. Waack, B. Morgenstern, Nucleic Acids Res 32 (2004) W309– W312.

[36] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman, J Mol Biol 215 (1990) 403– 410.

[37] J.-J. Jin, W.-B. Yu, J.-B. Yang, Y. Song, C.W. dePamphilis, T.-S. Yi, D.-Z. Li, Genome Biology 21 (2020) 241.

[38] M. Tillich, P. Lehwark, T. Pellizzer, E.S. Ulbricht-Jones, A. Fischer, R. Bock, S. Greiner, Nucleic Acids Res 45 (2017) W6–W11.

[39] S. Greiner, P. Lehwark, R. Bock, Nucleic Acids Res 47 (2019) W59–W64.

[40] S. Khayi, N.E. Azza, F. Gaboun, S. Pirro, O. Badad, M.G. Claros, D.A. Lightfoot, T. Unver, B. Chaouni, R. Merrouch, B. Rahim, S. Essayeh, M. Ganoudi, R. Abdelwahd, G. Diria, M.A. Mdarhi, M. Labhilili, D. Iraqi, J. Mouhaddab, H. Sedrati, M. Memari, N. Hamamouch,

(41)

J. de D. Alché, N. Boukhatem, R. Mrabet, R. Dahan, A. Legssyer, M. Khalfaoui, M. Badraoui, Y. Van de Peer, T. Tatusova, A. El Mousadik, R. Mentag, H. Ghazal, F1000Res 7 (2020) 1310.

[41] E. Bergmüller, S. Porfirova, P. Dörmann, Plant Mol Biol 52 (2003) 1181–1190.

[42] C. Constantinou, A. Papas, A.I. Constantinou, International Journal of Cancer 123 (2008) 739–752.

[43] Khayi S, Gaboun F, Pirro S, Tatusova T, El Mousadik A, Ghazal H, et al. Complete Chloroplast Genome of Argania spinosa: Structural Organization and Phylogenetic Relationships in Sapotaceae. Plants 2020;9. https://doi.org/10.3390/plants9101354.

[44] Fauron C, Allen J, Clifton S, Newton K. Plant Mitochondrial Genomes. In: Daniell H, Chase C, editors. Mol. Biol. Biotechnol. Plant Organelles Chloroplasts Mitochondria, Dordrecht: Springer Netherlands; 2004, p. 151–77. https://doi.org/10.1007/978-1-4020-3166-3_6.

[45] Monfalouti HE, Guillaume D, Denhez C, Charrouf Z. Therapeutic potential of argan oil: a review. J Pharm Pharmacol 2010;62:1669–75. https://doi.org/10.1111/j.2042-7158.2010.01190.x.

Figure

Figure 1: Photo of Argane tree whose genome has been sequenced (Wekipedia).
Table 1: Comparison of sequencing platforms [20] Sequencing Platform AmplificationMethod SequencingMethod Read Length (bp) ErrorRate(%) Number of ReadsPer Run Time Per Run(Hours) Cost PerMillionBases (USD)
Figure 3: Genome Assembly algorithms [12],
Figure 5: Example of Genome Annotation GFF file format [25]
+7

Références

Documents relatifs

Abstract: Whole Genome Profiling to Generate a Core Physical Map of the Gene Rich Part of the Sugarcane Genome (Plant and Animal Genome XXII

In order to study mollicutes and characterize the relationship between biological properties and genomes, genetic tools that allow a functional analysis of these

This review will address the expected impact of newly sequenced genomes on antibacterial discovery and vaccinology, as well as the impact of NGS on draft bacterial genomes..

Montoro, Chow, Chrestin, Duan, Galand, Garcia, Kolesnikova-Allen, Le Guen, Leclercq, Lekawipat, Omokhafe, Pujade-Renaud, Putranto, Sales, Seguin, Teerawatanasuk. IRRDB NATURAL

Nevertheless, whole genome sequencing of cassava is currently in progress, and because a strong synteny between rubber tree and cassava genomes is expected, the sequence alignment

Therefore, the objectives of this study were (1) to investigate the accuracy of imputation to WGS in two pig lines using a multi-line reference pop- ulation and a limited number

We discuss here what we believe are important aspects of the current study: genotyping an RH panel with a high density SNP array, construction of high (ultimate) resolution

Here, a microresonator-based Kerr frequency comb [5] (soliton microcomb) with a repetition rate of 14 GHz is generated with an ultra-stable pump laser and used to derive an