DNA Sequencing data analysis - Genomic characterization of basal cell carcinoma of the skin

Sequencing reads were mapped to the Genome Reference Consortium release h37 reference human genome with the Burrows-Wheeler Aligner (BWA) (Li and Durbin, 2009) and sequencing quality and target enrichment was verified with Picard tools metrics (http://broadinstitute.github.io/picard/). The mapped reads were then processed with the Genome Analysis Toolkit (GATK version 3.3.0) (McKenna et al., 2010) following best practices (valid at least through June 2015) for exome and specific indications for smaller targeted regions. The general steps are described below and a general description of the programs and formats can be found in Box 3 or the glossary:

BAM files preprocessing using Picard tools:

 Reads were sorted by coordinate number (SortSam tool).

 Duplicates were marked (MarkDuplicates tool).

 Read group information was added (AddOrReplaceReadGroups tool).

 An updated index was made for the preprocessed BAM files (BuildBamIndex tool).

GATK steps were carried out:

 The regions in the dataset requiring local realignment due to indels were determined (RealignerTargetCreator tool).

 The reads were realigned over the identified intervals (IndelRealigner tool).

 Positions with mismatches to the reference whose quality score needed to be adjusted were identified (BaseRecalibrator tool).

 The adjustments of quality scores were applied to the BAM files (PrintReads tool).

The BaseRecalibrator tool was not applied to Cancer Panel-sequenced samples based on the recommended best practices for small target regions.

When a target region needed to be provided, the exome set provided by GATK of the Genome Reference Consortium release h37 was used for exome-sequenced samples. For Cancer Panel-sequenced samples, a file containing the coordinates of the Cancer Panel exomes was used.

92 of 299 2.1. Somatic variant calling

SNVs were called on a set of 39 tumor samples and their matched normal that had been exome sequenced. Variants were called with SAMtools (Li, 2011) mpileup on both tumor and germline samples, using standard parameters. Only variants with QS >50, CV >20 and absent from the matched germline, were considered somatic.

Somatic variants were also called in this same set of samples with MuTect (Cibulskis et al., 2013) version 1.1.4 with standard parameters, and using the matched germline as control.

The commands used were similar to this example:

muTect-1.1.4.jar

--analysis_type MuTect \

--reference_sequence human_h37.fasta \ --cosmic variants_in_cosmic.vcf \

--dbsnp variants_in_dbSNP.vcf \

--intervals exome_coordinates.interval_list \ --input_file:normal matched_germline.bam \ --input_file:tumor tumor_sample.bam \ -o somatic_SNV.out \

The number of somatic variants identified with both methods was compared. We observed that both were good at identifying somatic SNVs when the tumor fraction was high (>20%).

On the other hand, when the fraction of tumor cells in the sample was low, the number of variants found by samtools diminished considerably when compared to MuTect and to its performance on samples with a higher tumor fraction (Figure 20). This observation confirmed that MuTect is a good method for somatic variant calling in our BCC samples, especially when tumor fraction is low (Cibulskis et al., 2013). Furthermore, MuTect is widely used in the field of somatic variant calling and several strong publications support its application (Banerji et al., 2012, Cancer Genome Atlas Research, 2008, Stransky et al., 2011, Stieglitz et al., 2015, Ojesina et al., 2014).

93 of 299

For somatic indel calling, the performance of Pindel and HaplotypeCaller, one of the variant caller algorithms proposed by the GATK, was compared on 23 tumor-germline pairs. By previous experience in other sequencing projects in the lab, we have considered that the false positive rate of Pindel (the somatic indel caller of choice in the laboratory) is quite high, and that quality scores in the mid ranges are not fully reliable when evaluating variants. Additionally, HaplotypeCaller is powerful because it uses as input the BAM files that have been realigned around indels, reducing this way the amount of false positive and false negative calls.

HaplotypeCaller is a variant caller, not necessarily a somatic variant caller. In order to identify somatic indels with it, the following steps were followed:

Box 3. Programs and Formats used in HTS data processing

BAM format: a BAM file is a binary version of a SAM file, which stands for Sequence Alignment/Map format, a TAB-delimited text format of an alignment containing information on each read.

GATK: Genome Analysis Toolkit is a software package for the analysis of high-throughput sequencing data by data science and engineering group at the Broad Institute (McKenna et al., 2010, Van der Auwera et al., 2013).

HaplotypeCaller: is the principal variant caller algorithm from the GATK. It calls SNVs and indels simultaneously using de novo assembly and a Bayesian statistical model.

IGV: the Integrative Genomics Viewer is an interactive visualization tool for the exploration of large genomic datasets developed by Robinson et al. (2011).

Igvtools: is a set of utilities to pre-process data files for the IGV and for other programs.

MuTect: is a tool that identified somatic point mutations in high-throughput sequencing data of cancer genomes developed as part of the GATK related programs (Cibulskis et al., 2013).

PicardTools: a set of Java command line tools for the manipulation of high-throughput sequencing data and formats.

Pindel: is an algorithm that detects deletions and insertions from small, medium and large sizes from paired-end short reads, developed by Ye et al. (2009).

SAMtools: is a set of utilities for manipulating and processing short DNA sequence read alignments in different formats developed by Li et al. (2009).

SAMtools tview: is a text alignment viewer to visualize the alignment of short reads to the reference genome and uses different colors and symbols to display different read and base specific features.

VCF format: the Variant Call Format is a text file format that stores gene sequence variation information in a compact way along with general information about the sample and the genome used in its processing.

94 of 299

 Variant calling was carried out on the germline samples with the standard parameters of HaplotypeCaller.

 Variant calling on the tumor sample was carried out with HaplotypeCaller, using the VCF file of the germline sample as input for the comparison “--comp” argument.

This function flags all the variants from the tumor sample that are present in the germline VCF file, allowing them to be easily filtered out or further investigated.

Command example:

GenomeAnalysisTK.jar -T HaplotypeCaller \

--reference_sequence human_h37.fasta \ --comp germline_variants.vcf \

-I 5-VS056-T1recalreads.bam \

--intervals exome_coordinates.interval_list \ --genotyping_mode DISCOVERY \

-stand_emit_conf 10 \ -stand_call_conf 30 \ -nct 10 \

-o tumor_variants.vcf

 All indels that were flagged as being present in the matched germline sample were removed. Only indels marked as somatic (not present in the germline control) were kept.

HaplotypeCaller identified less somatic indels than Pindel in all samples (Figure 20).

Several examples were visually explored with SAMtools tview and igvtools to see if the calling of those indels by Pindel was correct or not. In all the cases, the indels called by Pindel were considered false positives. After these comparisons among variant callers for somatic variants, we settled on using MuTect and HaplotypeCaller with the comparison mode for somatic variant calling on tumors for which a germline control was available.

95 of 299

For the samples for which a matched germline was not available, a Panel of Normals (PON) was constructed from 50 germline samples.

A PON is useful to reduce false positives and remove germline events that might not be in the dbSNP VCF file provided to MuTect. It is created by running MuTect on germline samples as if they were tumors without a matched germline sample. The produced VCF file contains the sites that were called as somatic SNVs by mutect.

Figure 20. Comparison of the number of somatic variants identified by samtools and MuTect or Pindel and HaplotypeCaller. a. The number of somatic SNVs called with samtools and MuTect on 39 tumor-germline pairs was compared. The number of somatic SNVs per sample is indicated on the y axis. On the x axys, tumor samples are ordered by tumor fraction in decreasing order from the left. The red circles indicate different tumor fractions throughout the dataset. b. The number of somatic indels called with Pindel and HaplotypeCaller on 23 tumor-germline pairs was compared. The number of somatic SNVs per sample is indicated on the y axis.

Tumor samples are displayed in the x axis.

a

b

96 of 299

The PON file is provided as an argument when running MuTect on samples with no matched germline. MuTect will reject all sites present in the PON, and won’t call them as somatic unless they are in the COSMIC variant file (Cibulskis et al., 2013). Since only samples sequenced with the Cancer Panel did not have a matched germline, the PON only covers variation within the genes of the Cancer Panel.

Although even with a PON filtering germline variants can wrongly be considered as somatic when no matched germline is available, only stop gain, splice site, or probably damaging non-synonymous mutations were considered as putative drivers on these samples.

In order to estimate the fraction of false positive somatic variants in tumors without matching control we reanalyzed 43 tumors for which we have a matched germline sample using a PON. We then compared the number of identified mutations with either germline or PON filtering. We estimated that the number of false somatic variants is no more than 15%

of the total number of variants in non-matched samples. From the false somatic variants, only 4.6% were “false” drivers, not filtered out by the PON but present in the matched germline sample.

2.2. Sanger Sequencing

The LATS1 gene was not part of the cancer panel (see Results). We were interested in the status of a particular LATS1 residue, R995. This residue was found to be mutated in a significant fraction of the exomes’ dataset. The mutation status of R995 in the samples sequenced with the Cancer Panel was evaluated with PCR amplification of the region of interest and subsequent Sanger sequencing of the PCR product.

PCRs reactions were set up with 1µL DNA (~60 ng/µL), 1µL of each primer (Figure 21a), 10 µL JumpStart Taq ReadyMix (SIGMA P2893) and 8 µL water, for a final volume of 20 µL per sample. A standard PCR program was used to amplify the regions. Some samples were loaded in an agarose gel to verify a single PCR product had been obtained. All samples were then purified with 20 µL microCLEAN (microzone, 2MCL) and resuspended in 25 µL TE 1x.

The concentration of some samples was measured with a NanoDrop spectrophotometer to assess the expected concentration of the samples. All samples were then Sanger Sequenced.

Only three LATS1 variants were identified in the Sanger sequenced samples and were

97 of 299

added to the list of somatic variants. An example of them can be seen in Figure 21b.

Because low allele frequency is difficult to detect with Sanger sequencing we believe, based on the frequency of LATS1 mutations in the exome-sequenced samples, that at least three or four other tumors harbor LATS1 mutations, although they were not identifiable.

2.3. Somatic copy number aberration (SCNA) analysis

SCNAs were retrieved from exome-sequencing data of the exome sequenced samples, taking into account allelic coverage, percentage of reads for germline heterozygous variants, and fraction of tumor cells in a sample.

The regions of SCNAs in the tumor were probabilistically inferred using a twenty state

Figure 21. LATS1 mutational assessment in Cancer Panel-sequenced samples. a. PCR primers designed to amplify the region flanking residue R995 of the LATS1 gene. b. Example of a LATS1 somatic variant identified with Sanger sequencing (sample 5-WS013-T1). In this sample, the position 148 (marked with a red arrow) corresponds to c.2983. The mutation causes a p.R995C substitution.

a b

98 of 299

whereC^T_j^/^Nrepresent the normalized coverage of the exon j,P_j^T^/^N the percentage of reads covering the major allele in the tumor|normal and  = 1- contamination due to normal cells.

According to (1), the non-redundant admissible states are given by the outer product of the vector C of possible coverages normalized with respect to a diploid genome [0,0.5,1,1.5,2]

and the coverage fraction P of the major allele [0, 16.6,25,50]. We modeled the observed

 

_j _j

After identification of SCNAs, these were manually inspected. GISTIC 2.0.16 (Mermel et al., 2011) was then used on this data to reveal regions with significant copy gains and losses.

99 of 299

Dans le document Genomic characterization of basal cell carcinoma of the skin (Page 100-108)