SGS Read quality analysis
Christophe Klopp, Bioinfo Genotoul
Overview
What is a good quality read?
The file formats
Constructor defined
The fastq file format
Common biases
Duplication bias story
HiSeq quality analysis
The sequencing process
1) Sampling
2) DNA or RNA extraction 3) Amplification (*)
4) Tagging (*)
5) Sequencing
Illumina sequencing
Illumina library preparation
The possible problems
1) Sampling :
1) Contamination
2) Low quality material
2) DNA or RNA extraction bias 3) Amplification bias
4) Tagging : uneven mixing 5) Sequencing :
1) Cross contamination
2) Region selection (low or high GC content)
3) Read production (quantity, Ns,...)
Manufacturer file formats
Roche 454
Sff format
Binary format containing flowgram
fastq file formats
Sequencing bias
● Platform related
● Roche 454 (data from Jean-Marc Aury CNS)
– 99,9% mapped reads
– Mean error rate : 0,55%
– 37% deletions, 53% insertions, 10% substitutions.
– homopolymers errors
– emPCR duplications
– Plate location bias
● Solexa (data from Jean-Marc Aury CNS)
– 98,5% mapped reads
– Mean error rate : 0,38%
– 3% deletions, 2% insertions, 95% substitutions
– Low A/T rich coverage
RNA-Seq read content profile
Pyrosequencing read replication bias:
evidence and correction proposal for genome
sequencing
Laurence Drouilhet : Phd student Prediction of 100 SNPs
80 were false SNPs linked to reads having the same start
Where it all started!
This does not look randomly picked!
How strange!
The questions
Is it normal to have so many reads starting at this position?
If not, where does it come from?
Why do these reads have the same error?
The material
1 Titanium E. coli run from the local platform
test run used to validate the sequencer (two half plates)
First region : 671 856 reads
Second region : 529 653 reads
Reference : Escherichia coli str. K-12 substr.
MG1655, complete genome
NCBI / LOCUS : NC_000913
4,639,675 bp
circular BCT DNA
08-MAY-2009
Duplicate reads of the 454
Single sequences
Representative sequence
Duplicate sequences
Two half runs (relative)
Simulated run
half run 1
half run 2
550 000 seq from genome
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
in cluster cluster base out of clusters
What is the structure of the duplicate read graph?
Number of couples, triplets,...
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
1 10 100 1000 10000 100000
1_clusters 2_clusters genome_clusters
Validation of the
observation in other runs
GS FLX standard : NCBI SRA
Bias already exists
SRR001355
SRR016860
SRR016859
SRR000868
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
in cluster cluster base out of clusters
The 454 process
?
?
?
Where are the duplicate reads located on the plate?
No specific location
But the half runs have different profiles
Duplicate reads are not cluster around a
wells
Where are the duplicate reads on the genome
All reads on the reference genome
Duplicate reads on the reference genome
Have the half runs the same duplicate reads?
Only 1.6% of duplicate reads from the second half run exist also in the first half run
Duplicate reads are not due to the
fragmentation and selection process.
Have duplicate reads specific patterns?
Distance between two adjacent reads / complexity
Have duplicate reads specific patterns?
No specific pattern :
GC %
Di-nucleotide %
Tri-nucleotide %
Using megablast and same start (-p 98 -s 140)
Same start alignment result strand :
1 216 659 forward/forward
110 896 forward/reverse
Duplication during emPCR ?
One bead = one micro-reactors ?
Martine Yerle 2009
And it can be worse!
Martine Yerle 2009
What are the impacts of n-plicated reads
False SNPs
Percent of SNPs removed with the removal of n- plicated sequences : 6.5% - 33.5%
Wrong expression measurement
Longer assembly processing
Correction proposal : Pyrocleaner
Target : removing the n-plicated sequences to be as close as possible to the random results
10 100 1000 10000 100000 1000000
Sim run1 clusters 454 run1 clusters Sim run2 clusters 454 run2 clusters
Using the end position
After Pyrocleaner!
Use the start position and the length reads
https://mulcyber.toulouse.inra.fr/projects/pyrocleaner/
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
1 10 100 1000 10000 100000 1000000
Sim run1
454 run1 cleaned up Sim run 2 454 run2 cleaned up
Conclusions
Duplication
Exists,
Can be very important
Depends on the experiment
Solution :
Use the read length
Use random tags
Quality analysis of the HiSeq reads produced
during the validation runs
Overview
●
Reference genome (PhiX)
●
Illumina reads known biases
●
Produced data
●
What do we consider a good run?
●
What do we consider a good shotgun run?
●
Software pieces
●
Results
●
NG6
PhiX genome
● Accession : NC_001422
● Circular genome
● Length : 5,386 bases
● Composition : 31,3% de T, 24% de A, 23,3% de G, 21,4%
de C (44,7%GC)
Sequence uniqueness in the PhiX genome
● All sequences longer than 13 bases are unique within the genome
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
-20,00%
0,00%
20,00%
40,00%
60,00%
80,00%
100,00%
unique kmers in the PhiX genome
unique
Known Illumina read biaises
Produced data
2 'flowcell' : 500 million reads per flowcell
* 2 if paired-ends
8 'lanes' : 65 millions reads par lane * 2 if paired-ends
100 base pairs long reads coming from one or both ends of a fragment.
+ -
4 runs
2
A+ A- B+ B-
Runs 1
2
3
4
Flowcell A Flowcell B
What do we consider a good run?
● The number of reads produced should match the manufacturer standards (2 * 500 millions reads per flowcell).
● The read length should match the manufacturer standards (100 base pairs)
● The per base quality values given by the image processing software should be above the common threshold (Phred > 20).
● All reads should align to the reference.
● The error rate should be low :
● Ambiguous bases : N
● Bases different from the reference
● Insertions / deletions
● The paired-end reads should align on the reference in opposite directions.
What do we consider a good shotgun run?
● Half of the reads should align one strand on the reference genome and the other half on the other strand.
● The same number of reads should be aligned on every position of the reference genome
● The error type and rate should be the same at each position of the reads and of the genome.
● The insert size should follow a Gaussian law.
Software pieces
●
Quality checks (fastqc)
http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
●
Alignment (bwa)
http://bio-bwa.sourceforge.net/
●
alignments processing (samtools) http://samtools.sourceforge.net/
●
Alignment verification :picard
http://picard.sourceforge.net/
Number of reads per flowcell
Run 1 Run 2 Run 3 Run 4
0 200 000 000 400 000 000 600 000 000 800 000 000 1 000 000 000 1 200 000 000 1 400 000 000 1 600 000 000 1 800 000 000 2 000 000 000
A B A filtered B filtered
Read length
1 2 3 4
0 20 40 60 80 100 120
FCA + FCA - FCB + FCB -
Quality along the reads
Laser effect
Ambiguous bases : Ns
Read nucleotide content
Kmers representation
TTCTG 23 TGCTG 25
Alignment rates
1 2 3 4
0 10 20 30 40 50 60 70 80 90 100
FCA + FCA - FCB + FCB -
Bwa limits
●
With another alignment tool
●
Message from bwa
cumul
97.551% 97.551%
pourcent bwa
nb
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
autres contamination aligné blast aligné bwa
Alignment vs base quality
Insertions, deletions, substitutions
En milliers
Substitution rate : 2 ‰
Insertions ou deletions : 0.1‰
Alignment strand
1FCA+
1FCA- 1FCB+
1FCB- 2FCA+
2FCA- 2FCB+
2FCB- 3FCA+
3FCA- 3FCB+
3FCB- 4FCA+
4FCA- 4FCB+
4FCB-
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Reverse Foward
Inserts sizes
En milliers
Reference genome coverage
Reference genome
pair-ends read coverage
Substitutions along the
reference genome
Conclusions
● The number and length of the produced reads are matching the manufacturer standards. They are often higher.
● The read quality is good analysing the alignment and the low error rate
● The per base Phred quality values provided are pessimistic compared to the alignment rate.
● The library used for the platform validation are not
permitting to really evaluate if the sequencing protocol is 'shotgun'.
NG6
http://ng6.toulouse.inra.fr/