NGS reads aligning and NGS reads aligning and
SNP calling SNP calling
Christophe Klopp - 2012
Genetic variation
http://en.wikipedia.org/wiki/Genetic_variation
Genetic variation, variations in alleles of genes, occurs both within and in populations. Genetic variation is important because it provides the “raw material” for natural selection.
3
Types of variations
● SNP : Single nucleotide polymorphism
● CNV : copy number variation
● Chromosomal rearrangement
● Chromosomal duplication
http://en.wikipedia.org/wiki/Copy-number_variation http://en.wikipedia.org/wiki/Human_genetic_variation
The variation transmission
● Mutation : In molecular biology and genetics,
mutations are changes in a genomic sequence: the DNA sequence of a cell's genome or the DNA or RNA sequence of a virus (http://en.wikipedia.org/wiki/Mutation).
● Mutations are transmitted if they are not lethal.
● Mutations can impact the phenotype.
5
Genetic markers and genotyping
● A set of SNPs is selected along the genome.
● The phenotypes are collected for individuals.
● The SNPs are genotyped (measured) for the same individuals.
● This enables to find location having a link between the genotype and the phenotype :
– Major genes
– QTL (Quantitative Trait Loci)
The 1000 genomes project
● Joint project NCBI / EBI
● Common data formats :
– fastq
– SAM (Sequence Alignment/Map)
7
Sequence Alignment/Map (SAM) format
➢ Data sharing was a major issue with the 1000 genomes
➢ Capture all of the critical information about NGS data in a single indexed and compressed file
➢ Sharing : data across and tools
➢ Generic alignment format
➢ Supports short and long reads (454 – Solexa – Solid)
➢ Flexible in style, compact in size, efficient in random access
SAM format
Website :
http://samtools.sourceforge.net
Paper :
Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools.
Bioinformatics, 25, 2078-9. [PMID: 19505943]
Sequence Alignment/Map (SAM) format
SAM format
9
SAM format Header section
➢ Header lines start with @ followed by a two-letter TAG
➢ Header fields are TYPE:VALUE pairs
SAM format Alignment section
➢ 11 mandatory fields
➢ Variable number of optional fields
➢ Fields are tab delimited
11
SAM format Full example
<QNAME> <FLAG> <RNAME> <POS> <MAPQ> <CIGAR> <MRNM> <MPOS> <ISIZE> <SEQ> <QUAL>
[<TAG>:<VTYPE>:<VALUE> [...]]
Header
Alignement
X? : Reserved for end users NM : Number of nuc. Difference
MD : String for mismatching positions RG : Read group
[...]
A : Printable character i : Signed 32bit integer
f : Singleprecision float number Z : Printable string
H : Hex string (high nybble first)
SAM format Flag field
http://picard.sourceforge.net/explain-flags.html
13
SAM format Extended CIGAR format
Ref: GCATTCAGATGCAGTACGC Read: ccTCAGGCATTAgtg POS CIGAR
5 2S4M2D6M3S
BAM format
➢ Binary representation of SAM
➢ Compressed by BGZF library
➢ Greatly reduces storage space requirements to about 27% of original SAM
15
SAMtools
➢ Library and software package
➢ Creating sorted and indexed BAM files from SAM files
➢ Removing PCR duplicates
➢ Merging alignments
➢ Visualization of alignments from BAM files
➢ SNP calling
➢ Short indel detection
http://samtools.sourceforge.net/samtools.shtml
SAMtools
Example usage
17
SAMtools Example usage
➢ Create BAM from SAM
samtools view bS aln.sam o aln.bam
➢ Sort BAM file
samtools sort example.bam sortedExample
➢ Merge sorted BAM files
samtools merge sortedMerge.bam sorted1.bam sorted2.bam
➢ Index BAM file
samtools index sortedExample.bam
➢ Visualize BAM file
samtools tview sortedExample.bam reference.fa
Exercise 1
●
Downloading the sam file :
–
●
Visualizing the file content : find the number of exact matching reads
●
Installing the samtools :
–
http://samtools.sourceforge.net/samtools.shtml
●
Producing as sorted and indexed bam file from the
same file
19
Visualizing the alignment IGV
➢ IGV : Integrative Genomics Viewer
➢ Website : http://www.broadinstitute.org/igv
Visualizing the alignment IGV
➢High-performance visualization tool
➢Interactive exploration of large, integrated datasets
➢Supports a wide variety of data types
➢Documentations
➢Developed at the Broad Institute of MIT and Harvard
21
Visualizing the alignment
IGV
Visualizing the alignment
IGV - Loading the reference
23
Visualizing the alignment
IGV - Loading the reference
Visualizing the alignment
IGV - Loading the bam file
25
Visualizing the alignment
IGV - Loading the bam file
Visualizing the alignment
IGV - Zoom
27
Visualizing the alignment
IGV - Zoom
Visualizing the alignment
IGV - Loading a gff file
29
Visualizing the alignment
IGV - Loading a gff file
Visualizing the alignment IGV - Coverage
➢ Generate the coverage information to be displayed in IGV.
java jar igvtools.jar count aln.bam aln.depth.tdf ref.genome
➢ Remark : ref.genome was generated when we imported the genome sequence
➢ This step is optional, but it is essential if you want to see the read depth information in large scale.
31
Visualizing the alignment
IGV - Coverage
Exercise 2
●
Open IGV in java webstart :
http://www.broadinstitute.org/igv
●
Create the genome using the fasta file
●
Load the sorted bam file
33
The pileup format
http://samtools.sourceforge.net/pileup.shtml
Chr. - Coord. - Base('*' for indel) - Number of reads covering the site - Read bases* - Base qualities
Read bases :
➢ '.' and ',' : match to the reference base on the forward/reverse strand
➢ 'ACTGN' and 'actgn' : for a mismatch on the forward/reverse strand
➢ '^' and '$' : start/end of a read segment
➢ '+[0-9]+[ACGTNacgtn]+' and '-[0-9]+[ACGTNacgtn]+' : insertion/deletion
Using mpileup
➢ Get the raw pileup:
samtools mpileup f ref.fa aln.bam > raw.txt
ref.fa Fasta formatted file of the reference genome aln.bam Sorted BAM formatted file, from the alignments raw.txt Output pileup formatted, with consensus calls
-f Reference sequence, ref.fa (in FastA format)
35
Mpileup output
Variant Calling
samtools mpileup uf ref.fa aln1.bam aln2.bam | bcftools view bvcg >
var.raw.bcf
bcftools view var.raw.bcf | vcfutils.pl varFilter D100 > var.flt.vcf
Var.raw.bcf binary compressed variants Var.flt.cvf text filtered variants
-b output BCF instead of VCF
-v output potential variant sites only (force -c) -c SNP calling (force -e)
-g call genotypes at variant sites (force -c) -D100 maximum read depth [10000000]
37
VCF format
Exercise 3
●
Visualize the bam and bai files in IGV
●
Produce a tdf file for the coverage
●
Find SNPs from the mpileup file
●
Transform it into gff
●