• Aucun résultat trouvé

NGS reads aligning and NGS reads aligning and SNP calling SNP calling

N/A
N/A
Protected

Academic year: 2022

Partager "NGS reads aligning and NGS reads aligning and SNP calling SNP calling"

Copied!
38
0
0

Texte intégral

(1)

NGS reads aligning and NGS reads aligning and

SNP calling SNP calling

Christophe Klopp - 2012

(2)

Genetic variation

http://en.wikipedia.org/wiki/Genetic_variation

Genetic variation, variations in alleles of genes, occurs both within and in populations. Genetic variation is important because it provides the “raw material” for natural selection.

(3)

3

Types of variations

SNP : Single nucleotide polymorphism

CNV : copy number variation

Chromosomal rearrangement

Chromosomal duplication

http://en.wikipedia.org/wiki/Copy-number_variation http://en.wikipedia.org/wiki/Human_genetic_variation

(4)

The variation transmission

Mutation : In molecular biology and genetics,

mutations are changes in a genomic sequence: the DNA sequence of a cell's genome or the DNA or RNA sequence of a virus (http://en.wikipedia.org/wiki/Mutation).

Mutations are transmitted if they are not lethal.

Mutations can impact the phenotype.

(5)

5

Genetic markers and genotyping

A set of SNPs is selected along the genome.

The phenotypes are collected for individuals.

The SNPs are genotyped (measured) for the same individuals.

This enables to find location having a link between the genotype and the phenotype :

Major genes

QTL (Quantitative Trait Loci)

(6)

The 1000 genomes project

Joint project NCBI / EBI

Common data formats :

fastq

SAM (Sequence Alignment/Map)

(7)

7

Sequence Alignment/Map (SAM) format

Data sharing was a major issue with the 1000 genomes

Capture all of the critical information about NGS data in a single indexed and compressed file

Sharing : data across and tools

Generic alignment format

Supports short and long reads (454 – Solexa – Solid)

Flexible in style, compact in size, efficient in random access

SAM format

Website :

http://samtools.sourceforge.net

Paper :

Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools.

Bioinformatics, 25, 2078-9. [PMID: 19505943]

(8)

Sequence Alignment/Map (SAM) format

SAM format

(9)

9

SAM format Header section

Header lines start with @ followed by a two-letter TAG

Header fields are TYPE:VALUE pairs

(10)

SAM format Alignment section

11 mandatory fields

Variable number of optional fields

Fields are tab delimited

(11)

11

SAM format Full example

<QNAME> <FLAG> <RNAME> <POS> <MAPQ> <CIGAR> <MRNM> <MPOS> <ISIZE> <SEQ> <QUAL>

[<TAG>:<VTYPE>:<VALUE> [...]]

Header

Alignement

X? : Reserved for end users NM : Number of nuc. Difference

MD : String for mismatching positions RG : Read group

[...]

A : Printable character i : Signed 32­bit integer

f : Single­precision float number Z : Printable string

H : Hex string (high nybble first)

(12)

SAM format Flag field

http://picard.sourceforge.net/explain-flags.html

(13)

13

SAM format Extended CIGAR format

Ref: GCATTCAGATGCAGTACGC Read:  ccTCAG­­GCATTAgtg POS  CIGAR

5    2S4M2D6M3S

(14)

BAM format

Binary representation of SAM

Compressed by BGZF library

Greatly reduces storage space requirements to about 27% of original SAM

(15)

15

SAMtools

Library and software package

Creating sorted and indexed BAM files from SAM files

Removing PCR duplicates

Merging alignments

Visualization of alignments from BAM files

SNP calling

Short indel detection

http://samtools.sourceforge.net/samtools.shtml

(16)

SAMtools

Example usage

(17)

17

SAMtools Example usage

Create BAM from SAM

samtools view ­bS aln.sam ­o aln.bam

Sort BAM file

samtools sort example.bam sortedExample

Merge sorted BAM files

samtools merge sortedMerge.bam sorted1.bam sorted2.bam

Index BAM file

samtools index sortedExample.bam

Visualize BAM file

samtools tview sortedExample.bam reference.fa

(18)

Exercise 1

Downloading the sam file :

Visualizing the file content : find the number of exact matching reads

Installing the samtools :

http://samtools.sourceforge.net/samtools.shtml

Producing as sorted and indexed bam file from the

same file

(19)

19

Visualizing the alignment IGV

IGV : Integrative Genomics Viewer

Website : http://www.broadinstitute.org/igv

(20)

Visualizing the alignment IGV

High-performance visualization tool

Interactive exploration of large, integrated datasets

Supports a wide variety of data types

Documentations

Developed at the Broad Institute of MIT and Harvard

(21)

21

Visualizing the alignment

IGV

(22)

Visualizing the alignment

IGV - Loading the reference

(23)

23

Visualizing the alignment

IGV - Loading the reference

(24)

Visualizing the alignment

IGV - Loading the bam file

(25)

25

Visualizing the alignment

IGV - Loading the bam file

(26)

Visualizing the alignment

IGV - Zoom

(27)

27

Visualizing the alignment

IGV - Zoom

(28)

Visualizing the alignment

IGV - Loading a gff file

(29)

29

Visualizing the alignment

IGV - Loading a gff file

(30)

Visualizing the alignment IGV - Coverage

Generate the coverage information to be displayed in IGV.

java ­jar igvtools.jar count aln.bam aln.depth.tdf ref.genome

Remark : ref.genome was generated when we imported the genome sequence

This step is optional, but it is essential if you want to see the read depth information in large scale.

(31)

31

Visualizing the alignment

IGV - Coverage

(32)

Exercise 2

Open IGV in java webstart :

http://www.broadinstitute.org/igv

Create the genome using the fasta file

Load the sorted bam file

(33)

33

The pileup format

http://samtools.sourceforge.net/pileup.shtml

Chr. - Coord. - Base('*' for indel) - Number of reads covering the site - Read bases* - Base qualities

Read bases :

'.' and ',' : match to the reference base on the forward/reverse strand

'ACTGN' and 'actgn' : for a mismatch on the forward/reverse strand

'^' and '$' : start/end of a read segment

'+[0-9]+[ACGTNacgtn]+' and '-[0-9]+[ACGTNacgtn]+' : insertion/deletion

(34)

Using mpileup

Get the raw pileup:

samtools mpileup ­f ref.fa aln.bam > raw.txt

ref.fa Fasta formatted file of the reference genome aln.bam Sorted BAM formatted file, from the alignments raw.txt Output pileup formatted, with consensus calls

-f Reference sequence, ref.fa (in FastA format)

(35)

35

Mpileup output

(36)

Variant Calling

samtools mpileup ­uf ref.fa aln1.bam aln2.bam | bcftools view ­bvcg ­ > 

var.raw.bcf  

bcftools view var.raw.bcf | vcfutils.pl varFilter ­D100 > var.flt.vcf  

Var.raw.bcf binary compressed variants Var.flt.cvf text filtered variants

-b output BCF instead of VCF

-v output potential variant sites only (force -c) -c SNP calling (force -e)

-g call genotypes at variant sites (force -c) -D100 maximum read depth [10000000]

(37)

37

VCF format

(38)

Exercise 3

Visualize the bam and bai files in IGV

Produce a tdf file for the coverage

Find SNPs from the mpileup file

Transform it into gff

Load the gff in IGV

Références

Documents relatifs

Conclusion: This study, the first genome-wide SNP interaction analysis conducted so far on VT risk, suggests that common SNPs are unlikely exerting strong interactive effects on

The objectives of the work presented here are (1) to use SNPs previously identified in maize to develop a first reliable and standardized large scale SNP genotyping array; (2)

To conclude, G-DNA is an extremely efficient software tool that performs pairwise sequence alignment for selected pairs from a given set of nucleotide reads.. It is therefore

We propose to use the TPTP thf syntax in order to allow the implementer of a theorem prover to include semantical information about their inference rules, thereby replacing

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

We also contrast our technique of bolstering otherwise ambiguous split alignments by combining read group and paired-end information to the conventional method of de- tecting

We thus present in this paper an algorithm which takes as input two sets of short reads (Illumina or AB/SOLiD) and outputs candidate SNPs (i.e. mouths in the de Bruijn graph),

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des