SGS Read quality analysis

(1)

SGS Read quality analysis

Christophe Klopp, Bioinfo Genotoul

(2)

Overview



What is a good quality read?



The file formats



Constructor defined



The fastq file format



Common biases



Duplication bias story



HiSeq quality analysis

(3)

The sequencing process

1) Sampling

2) DNA or RNA extraction **3) Amplification (*)**

**4) Tagging (*)**

5) Sequencing

(4)

Illumina sequencing

(5)

Illumina library preparation

(6)

The possible problems

1) Sampling :

1) Contamination

2) Low quality material

2) DNA or RNA extraction bias 3) Amplification bias

4) Tagging : uneven mixing 5) Sequencing :

1) Cross contamination

2) Region selection (low or high GC content)

3) Read production (quantity, Ns,...)

(7)

Manufacturer file formats



Roche 454



Sff format



Binary format containing flowgram

(8)

fastq file formats

(9)

Sequencing bias

● Platform related

● Roche 454 (data from Jean-Marc Aury CNS)

– 99,9% mapped reads

– Mean error rate : 0,55%

– 37% deletions, 53% insertions, 10% substitutions.

– homopolymers errors

– emPCR duplications

– Plate location bias

● Solexa (data from Jean-Marc Aury CNS)

– 98,5% mapped reads

– Mean error rate : 0,38%

– 3% deletions, 2% insertions, 95% substitutions

– Low A/T rich coverage

(10)

RNA-Seq read content profile

(11)

Pyrosequencing read replication bias:

evidence and correction proposal for genome

sequencing

(12)

Laurence Drouilhet : Phd student Prediction of 100 SNPs

80 were false SNPs linked to reads having the same start

Where it all started!

(13)

This does not look randomly picked!

How strange!

(14)

The questions



Is it normal to have so many reads starting at this position?



If not, where does it come from?



Why do these reads have the same error?

(15)

The material



1 Titanium E. coli run from the local platform

test run used to validate the sequencer (two half plates)

 First region : 671 856 reads

 Second region : 529 653 reads



Reference : Escherichia coli str. K-12 substr.

MG1655, complete genome

 NCBI / LOCUS : NC_000913

 4,639,675 bp

 circular BCT DNA

 08-MAY-2009

(16)

Duplicate reads of the 454

 Single sequences

 Representative sequence

 Duplicate sequences

 Two half runs (relative)

 Simulated run

half run 1

half run 2

550 000 seq from genome

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

in cluster cluster base out of clusters

(17)

What is the structure of the duplicate read graph?



Number of couples, triplets,...

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

1 10 100 1000 10000 100000

1_clusters 2_clusters genome_clusters

(18)

Validation of the

observation in other runs



GS FLX standard : NCBI SRA



Bias already exists

SRR001355

SRR016860

SRR016859

SRR000868

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

in cluster cluster base out of clusters

(19)

The 454 process

?

(20)

Where are the duplicate reads located on the plate?



No specific location



But the half runs have different profiles



Duplicate reads are not cluster around a

wells

(21)

Where are the duplicate reads on the genome



All reads on the reference genome



Duplicate reads on the reference genome

(22)

Have the half runs the same duplicate reads?



Only 1.6% of duplicate reads from the second half run exist also in the first half run



Duplicate reads are not due to the

fragmentation and selection process.

(23)

Have duplicate reads specific patterns?

 Distance between two adjacent reads / complexity

(24)

Have duplicate reads specific patterns?



No specific pattern :

 GC %

 Di-nucleotide %

 Tri-nucleotide %



Using megablast and same start (-p 98 -s 140)



Same start alignment result strand :

 1 216 659 forward/forward

 110 896 forward/reverse

(25)

Duplication during emPCR ?



One bead = one micro-reactors ?

Martine Yerle 2009

(26)

And it can be worse!

Martine Yerle 2009

(27)

What are the impacts of n-plicated reads



False SNPs

 Percent of SNPs removed with the removal of n- plicated sequences : 6.5% - 33.5%



Wrong expression measurement



Longer assembly processing

(28)

Correction proposal : Pyrocleaner



Target : removing the n-plicated sequences to be as close as possible to the random results

10 100 1000 10000 100000 1000000

Sim run1 clusters 454 run1 clusters Sim run2 clusters 454 run2 clusters

(29)

Using the end position

(30)

After Pyrocleaner!



Use the start position and the length reads

https://mulcyber.toulouse.inra.fr/projects/pyrocleaner/

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1 10 100 1000 10000 100000 1000000

Sim run1

454 run1 cleaned up Sim run 2 454 run2 cleaned up

(31)

Conclusions



Duplication



Exists,



Can be very important



Depends on the experiment



Solution :



Use the read length



Use random tags

(32)

Quality analysis of the HiSeq reads produced

during the validation runs

(33)

Overview

●

Reference genome (PhiX)

●

Illumina reads known biases

●

Produced data

●

What do we consider a good run?

●

What do we consider a good shotgun run?

●

Software pieces

●

Results

●

NG6

(34)

PhiX genome

● Accession : NC_001422

● Circular genome

● Length : 5,386 bases

● Composition : 31,3% de T, 24% de A, 23,3% de G, 21,4%

de C (44,7%GC)

(35)

Sequence uniqueness in the PhiX genome

● All sequences longer than 13 bases are unique within the genome

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

-20,00%

0,00%

20,00%

40,00%

60,00%

80,00%

100,00%

unique kmers in the PhiX genome

unique

(36)

Known Illumina read biaises

(37)

Produced data

2 'flowcell' : 500 million reads per flowcell

* 2 if paired-ends

8 'lanes' : 65 millions reads par lane * 2 if paired-ends

100 base pairs long reads coming from one or both ends of a fragment.

+ -

(38)

4 runs

2

A+ A- B+ B-

Runs 1

2

3

4

Flowcell A Flowcell B

(39)

What do we consider a good run?

● The number of reads produced should match the manufacturer standards (2 * 500 millions reads per flowcell).

● The read length should match the manufacturer standards (100 base pairs)

● The per base quality values given by the image processing software should be above the common threshold (Phred > 20).

● All reads should align to the reference.

● The error rate should be low :

● Ambiguous bases : N

● Bases different from the reference

● Insertions / deletions

● The paired-end reads should align on the reference in opposite directions.

(40)

What do we consider a good shotgun run?

● Half of the reads should align one strand on the reference genome and the other half on the other strand.

● The same number of reads should be aligned on every position of the reference genome

● The error type and rate should be the same at each position of the reads and of the genome.

● The insert size should follow a Gaussian law.

(41)

Software pieces

●

Quality checks (fastqc)

http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

●

Alignment (bwa)

http://bio-bwa.sourceforge.net/

●

alignments processing (samtools) http://samtools.sourceforge.net/

●

Alignment verification :picard

http://picard.sourceforge.net/

(42)

Number of reads per flowcell

Run 1 Run 2 Run 3 Run 4

0 200 000 000 400 000 000 600 000 000 800 000 000 1 000 000 000 1 200 000 000 1 400 000 000 1 600 000 000 1 800 000 000 2 000 000 000

A B A filtered B filtered

(43)

Read length

1 2 3 4

0 20 40 60 80 100 120

FCA + FCA - FCB + FCB -

(44)

Quality along the reads

(45)

Laser effect

(46)

Ambiguous bases : Ns

(47)

Read nucleotide content

(48)

Kmers representation

TTCTG 23 TGCTG 25

(49)

Alignment rates

1 2 3 4

0 10 20 30 40 50 60 70 80 90 100

FCA + FCA - FCB + FCB -

(50)

Bwa limits

●

With another alignment tool

●

Message from bwa

cumul

97.551% 97.551%

pourcent bwa

nb

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

autres contamination aligné blast aligné bwa

(51)

Alignment vs base quality

(52)

Insertions, deletions, substitutions

En milliers

Substitution rate : 2 ‰

Insertions ou deletions : 0.1‰

(53)

Alignment strand

1FCA+

1FCA- 1FCB+

1FCB- 2FCA+

2FCA- 2FCB+

2FCB- 3FCA+

3FCA- 3FCB+

3FCB- 4FCA+

4FCA- 4FCB+

4FCB-

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Reverse Foward

(54)

Inserts sizes

En milliers

(55)

Reference genome coverage

(56)

Reference genome

pair-ends read coverage

(57)

Substitutions along the

reference genome

(58)

Conclusions

● The number and length of the produced reads are matching the manufacturer standards. They are often higher.

● The read quality is good analysing the alignment and the low error rate

● The per base Phred quality values provided are pessimistic compared to the alignment rate.

● The library used for the platform validation are not

permitting to really evaluate if the sequencing protocol is 'shotgun'.

(59)

NG6

http://ng6.toulouse.inra.fr/

(60)

Exercises : Up to you!

http://bioinfo.genotoul.fr/index.php?id=160

(61)

Quality at each position

(62)

SGS Read quality analysis