• Aucun résultat trouvé

Identification and correction of genome mis-assemblies due to heterozygosity

N/A
N/A
Protected

Academic year: 2021

Partager "Identification and correction of genome mis-assemblies due to heterozygosity"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-01092959

https://hal.inria.fr/hal-01092959

Submitted on 9 Dec 2014

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Identification and correction of genome mis-assemblies

due to heterozygosity

Anaïs Gouin, Anthony Bretaudeau, Claire Lemaitre, Fabrice Legeai

To cite this version:

Anaïs Gouin, Anthony Bretaudeau, Claire Lemaitre, Fabrice Legeai. Identification and correction of genome mis-assemblies due to heterozygosity. European Conference on Computational Biology (ECCB), Sep 2014, Strasbourg, France. ECCB 2014, 2014. �hal-01092959�

(2)

Anaïs GOUIN

1

, Anthony BRETAUDEAU

2

, Claire LEMAITRE

1

and Fabrice LEGEAI

2

1

INRIA/IRISA/GenScale, Campus de Beaulieu, 35042 Rennes cedex, France

2

INRA, Institut de Génétique, Environnement et Protection des Plantes (IGEPP), Domaine de la Motte – 35653 Le Rheu

Motivation

:

Some heterozygous regions have a significant

divergence between the two haplotypes and the assembly

process can lead to the construction of two different contigs,

instead of one consensus sequence.

Objective

:

Set up a strategy to detect and correct false

duplications in already-built assemblies.

Identification and correction of genome

mis-assemblies due to heterozygosity

PIPELINE

Set up thresholds

Genome correction

206 chains matching known false duplications (manually curated) :

153 “inners” / 53 “ends”

114 “included” / 38 “borders”

~80% of known allelic regions with chosen thresholds

Application :

Spodoptera frugiperda genome

Re-alignment of

selected pairs

Chaining

Genome

correction

Plast [3]

Lastz [1]

Identification of

mis-assemblies

Remove scaffolds

(entire or parts)

“inner”

“end”

query

reference

“included”

“border”

Set up

thresholds :

A, B, lim1, lim2

B

% of query

added

coverage

lim1

≤ lim2

A

bp

A

bp

Filtering duplications

by checking

uniqueness of matches

Final selection

of allelism

problems

[1] Harris RS: Improved pairwise alignment of genomic DNA. Ann Arbor: ProQuest; 2007:84

[2] Kajitani R. et al, Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads, Genome Res. 2014; 24(8):1384-95

[3] Nguyen V.H., Lavenier D., PLAST: parallel local alignment search tool for database comparison, BMC Bioinformatics, vol 10, no 329, 2009

[4] Waterhouse et al, OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs, Nucleic Acids Research, 2013, PMID:23180791

Potential

erroneous

duplications

Expected

coverage

Potential

duplications

Perspectives :

Non-studied here : potential problems of allelism within a scaffold, at contig level?

Other parameters to take into account : SNPs count (to select regions) / mate pairs (to validate corrected assembly)

Pre-selection of

“similar” scaffolds

< 1kb &

< 80% of query

query

Filter over

chain size

reference

Analysis of chains

matching known

false duplications

Identification of mis-assemblies

Potential bad

annotations :

duplications?

“included”

Expected coverage

lim2

lim1

B

lim2

lim1

A

“borders”

Potential bad

annotations :

duplications?

Expected coverage

Initial assembly

Allpaths

Platanus [2]

assembly

Corrected

assembly

Total size (Mb)

526,0

470,1

434,9

Nb. scaffolds

48 272

41 633

41 536

N50 (bp)

39 593

75 578

52 733

No hit

14

9

15

Single hit

1497

2047

1904

Multi hit

782

237

374

Expected size : ~ 375 Mb

*BUSCO : Benchmarking sets of Universal Single-Copy Orthologs (Arthropoda

species) [4]

B

U

S

C

O

*

s

ta

ts

dist1

dist2

A

dd

ed

c

ov

er

ag

e

A

dd

ed

c

ov

er

ag

e

% query covered

“inners”

“ends”

Coverage of scaffolds

N

um

be

r

of

s

ca

ff

ol

ds

Our strategy allows a good improvement of the initial assembly.

Performing a new assembly with a tool handling heterozygosity (such as Platanus) is still more efficient.

max(

dist1

,

dist2

)

added

coverage

Références

Documents relatifs

Among the various molecular markers, RFLPs were the first to be used for plant genome studies, on mapping and diversity analysis.. However RFLPs are labour intensive

See Additional file 1 for further details on the workflow to integrate different adjacency predictions and build the new assemblies, the PacBio assembly generation, the genome

Therefore, also high concentrations, global or local, lead to attractive photophysical properties such as excitation and emission in the visible region of the solar

Since nickel grows in a nodular shape during the process (Fig. 1b), it cannot provide the required airtightness. Conversely, copper deposits in a continuous multilayer that is able

(A) For six bacterial genome (six panels), eight assemblies were provided by GAGE-B, and were merged either with GAA (64 combinations), GAM-NGS (64 combinations) or Mix (28

Figure 11 shows the profiles of tangential and normal tractions along the adhesive layer computed by the identified model at different measurement instants i (8 ≤ i ≤ 19) during

Clusters of isolates are displayed in different colors and correspond to sequence types (ST); circles represent water isolates, and squares represent sediment isolates..

(A) For six bacterial genome (six panels), eight assemblies were provided by GAGE-B, and were merged either with GAA (64 combinations), GAM-NGS (64 combinations) or Mix (28