• Aucun résultat trouvé

SGS Read quality analysis

N/A
N/A
Protected

Academic year: 2022

Partager "SGS Read quality analysis"

Copied!
62
0
0

Texte intégral

(1)

SGS Read quality analysis

Christophe Klopp, Bioinfo Genotoul

(2)

Overview

What is a good quality read?

The file formats

Constructor defined

The fastq file format

Common biases

Duplication bias story

HiSeq quality analysis

(3)

The sequencing process

1) Sampling

2) DNA or RNA extraction 3) Amplification (*)

4) Tagging (*)

5) Sequencing

(4)

Illumina sequencing

(5)

Illumina library preparation

(6)

The possible problems

1) Sampling :

1) Contamination

2) Low quality material

2) DNA or RNA extraction bias 3) Amplification bias

4) Tagging : uneven mixing 5) Sequencing :

1) Cross contamination

2) Region selection (low or high GC content)

3) Read production (quantity, Ns,...)

(7)

Manufacturer file formats

Roche 454

Sff format

Binary format containing flowgram

(8)

fastq file formats

(9)

Sequencing bias

Platform related

Roche 454 (data from Jean-Marc Aury CNS)

99,9% mapped reads

Mean error rate : 0,55%

37% deletions, 53% insertions, 10% substitutions.

homopolymers errors

emPCR duplications

Plate location bias

Solexa (data from Jean-Marc Aury CNS)

98,5% mapped reads

Mean error rate : 0,38%

3% deletions, 2% insertions, 95% substitutions

Low A/T rich coverage

(10)

RNA-Seq read content profile

(11)

Pyrosequencing read replication bias:

evidence and correction proposal for genome

sequencing

(12)

Laurence Drouilhet : Phd student Prediction of 100 SNPs

80 were false SNPs linked to reads having the same start

Where it all started!

(13)

This does not look randomly picked!

How strange!

(14)

The questions

Is it normal to have so many reads starting at this position?

If not, where does it come from?

Why do these reads have the same error?

(15)

The material

1 Titanium E. coli run from the local platform

test run used to validate the sequencer (two half plates)

First region : 671 856 reads

Second region : 529 653 reads

Reference : Escherichia coli str. K-12 substr.

MG1655, complete genome

NCBI / LOCUS : NC_000913

4,639,675 bp

circular BCT DNA

08-MAY-2009

(16)

Duplicate reads of the 454

Single sequences

Representative sequence

Duplicate sequences

Two half runs (relative)

Simulated run

half run 1

half run 2

550 000 seq from genome

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

in cluster cluster base out of clusters

(17)

What is the structure of the duplicate read graph?

Number of couples, triplets,...

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

1 10 100 1000 10000 100000

1_clusters 2_clusters genome_clusters

(18)

Validation of the

observation in other runs

GS FLX standard : NCBI SRA

Bias already exists

SRR001355

SRR016860

SRR016859

SRR000868

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

in cluster cluster base out of clusters

(19)

The 454 process

?

?

?

(20)

Where are the duplicate reads located on the plate?

No specific location

But the half runs have different profiles

Duplicate reads are not cluster around a

wells

(21)

Where are the duplicate reads on the genome

All reads on the reference genome

Duplicate reads on the reference genome

(22)

Have the half runs the same duplicate reads?

Only 1.6% of duplicate reads from the second half run exist also in the first half run

Duplicate reads are not due to the

fragmentation and selection process.

(23)

Have duplicate reads specific patterns?

Distance between two adjacent reads / complexity

(24)

Have duplicate reads specific patterns?

No specific pattern :

GC %

Di-nucleotide %

Tri-nucleotide %

Using megablast and same start (-p 98 -s 140)

Same start alignment result strand :

1 216 659 forward/forward

110 896 forward/reverse

(25)

Duplication during emPCR ?

One bead = one micro-reactors ?

Martine Yerle 2009

(26)

And it can be worse!

Martine Yerle 2009

(27)

What are the impacts of n-plicated reads

False SNPs

Percent of SNPs removed with the removal of n- plicated sequences : 6.5% - 33.5%

Wrong expression measurement

Longer assembly processing

(28)

Correction proposal : Pyrocleaner

Target : removing the n-plicated sequences to be as close as possible to the random results

10 100 1000 10000 100000 1000000

Sim run1 clusters 454 run1 clusters Sim run2 clusters 454 run2 clusters

(29)

Using the end position

(30)

After Pyrocleaner!

Use the start position and the length reads

https://mulcyber.toulouse.inra.fr/projects/pyrocleaner/

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1 10 100 1000 10000 100000 1000000

Sim run1

454 run1 cleaned up Sim run 2 454 run2 cleaned up

(31)

Conclusions

Duplication

Exists,

Can be very important

Depends on the experiment

Solution :

Use the read length

Use random tags

(32)

Quality analysis of the HiSeq reads produced

during the validation runs

(33)

Overview

Reference genome (PhiX)

Illumina reads known biases

Produced data

What do we consider a good run?

What do we consider a good shotgun run?

Software pieces

Results

NG6

(34)

PhiX genome

Accession : NC_001422

Circular genome

Length : 5,386 bases

Composition : 31,3% de T, 24% de A, 23,3% de G, 21,4%

de C (44,7%GC)

(35)

Sequence uniqueness in the PhiX genome

All sequences longer than 13 bases are unique within the genome

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

-20,00%

0,00%

20,00%

40,00%

60,00%

80,00%

100,00%

unique kmers in the PhiX genome

unique

(36)

Known Illumina read biaises

(37)

Produced data

2 'flowcell' : 500 million reads per flowcell

* 2 if paired-ends

8 'lanes' : 65 millions reads par lane * 2 if paired-ends

100 base pairs long reads coming from one or both ends of a fragment.

+ -

(38)

4 runs

2

A+ A- B+ B-

Runs 1

2

3

4

Flowcell A Flowcell B

(39)

What do we consider a good run?

The number of reads produced should match the manufacturer standards (2 * 500 millions reads per flowcell).

The read length should match the manufacturer standards (100 base pairs)

The per base quality values given by the image processing software should be above the common threshold (Phred > 20).

All reads should align to the reference.

The error rate should be low :

Ambiguous bases : N

Bases different from the reference

Insertions / deletions

The paired-end reads should align on the reference in opposite directions.

(40)

What do we consider a good shotgun run?

Half of the reads should align one strand on the reference genome and the other half on the other strand.

The same number of reads should be aligned on every position of the reference genome

The error type and rate should be the same at each position of the reads and of the genome.

The insert size should follow a Gaussian law.

(41)

Software pieces

Quality checks (fastqc)

http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

Alignment (bwa)

http://bio-bwa.sourceforge.net/

alignments processing (samtools) http://samtools.sourceforge.net/

Alignment verification :picard

http://picard.sourceforge.net/

(42)

Number of reads per flowcell

Run 1 Run 2 Run 3 Run 4

0 200 000 000 400 000 000 600 000 000 800 000 000 1 000 000 000 1 200 000 000 1 400 000 000 1 600 000 000 1 800 000 000 2 000 000 000

A B A filtered B filtered

(43)

Read length

1 2 3 4

0 20 40 60 80 100 120

FCA + FCA - FCB + FCB -

(44)

Quality along the reads

(45)

Laser effect

(46)

Ambiguous bases : Ns

(47)

Read nucleotide content

(48)

Kmers representation

TTCTG 23 TGCTG 25

(49)

Alignment rates

1 2 3 4

0 10 20 30 40 50 60 70 80 90 100

FCA + FCA - FCB + FCB -

(50)

Bwa limits

With another alignment tool

Message from bwa

cumul

97.551% 97.551%

pourcent bwa

nb

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

autres contamination aligné blast aligné bwa

(51)

Alignment vs base quality

(52)

Insertions, deletions, substitutions

En milliers

Substitution rate : 2 ‰

Insertions ou deletions : 0.1‰

(53)

Alignment strand

1FCA+

1FCA- 1FCB+

1FCB- 2FCA+

2FCA- 2FCB+

2FCB- 3FCA+

3FCA- 3FCB+

3FCB- 4FCA+

4FCA- 4FCB+

4FCB-

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Reverse Foward

(54)

Inserts sizes

En milliers

(55)

Reference genome coverage

(56)

Reference genome

pair-ends read coverage

(57)

Substitutions along the

reference genome

(58)

Conclusions

The number and length of the produced reads are matching the manufacturer standards. They are often higher.

The read quality is good analysing the alignment and the low error rate

The per base Phred quality values provided are pessimistic compared to the alignment rate.

The library used for the platform validation are not

permitting to really evaluate if the sequencing protocol is 'shotgun'.

(59)

NG6

http://ng6.toulouse.inra.fr/

(60)

Exercises : Up to you!

http://bioinfo.genotoul.fr/index.php?id=160

(61)

Quality at each position

(62)

Nucleotides at each position

Références

Documents relatifs

In order to isolate the causal effect of an extension of the maximum benefits payment period, we compared the future of beneficiaries in stream 2 who would have been in stream

To study these territories, they always focus on specific places they choose to investigate deeply – or “thickly”, to adopt a malinowskian vocabulary

is the doctor’s job easy or difficult..

We shall call a curve diagram reduced if it has the minimal possible number of intersections with the horizontal line, and also the minimal possible number of vertical tangencies in

In this guide, we look at road safe- ty reporting from three distinct perspectives: its effects on public health, how road users are impacted, and how road safety legislation

a, Initially the two states feel the same potential (purple line), and a quantum superposition of internal states corresponds to a spatial overlap of the wave functions of the

Enseigner l’anglais à partir d’albums, Sylvie Hanot © Retz, 2017 Enseigner l’anglais à partir d’albums, Sylvie Hanot © Retz, 2017.. ALBUM 1 : FOLLOW THE LINE TO SCHOOL ALBUM 1

Together with ConSense, Marabu implemented a unified quality and environmental management sys- tem that ensures that there are no no accidental local deviations in processes.. The new