• Aucun résultat trouvé

DNA quality control and library preparation

1. Sample collection and preparation

1.2. DNA quality control and library preparation

DNA from the frozen or the AllProtect-preserved tumors was extracted with the QIAGEN DNeasy Blood and Tissue Kit. DNA from corresponding blood was extracted using a QIAGEN Autopure LS. The quality of the DNAs obtained with these two methods was evaluated by gel electrophoresis. Only samples showing a clear band of good quality

83 of 299

genomic DNA were used in the study. The concentration of the extracted DNA was measured with a Qubit fluorometer (Invitrogen).

For FFPE samples, punch biopsies or whole excisates with the characteristics mentioned in the previous section were used for DNA extraction. A new microtome blade was used for each block and the necessary tools were cleaned with xylene and 100% ethanol in between samples to avoid cross-contamination during sectioning. From each block, the first 5 slices (of 5µm) were discarded and the subsequent 10-20 slices of 10µm (depending on the size of the embedded sample) were transferred to a 1.5 ml Eppendorf tube and used for DNA extraction with the Qiagen QIAmp DNA FFPE Tissue kit. Concentration of the extracted samples was quantified with a Qubit fluorometer and subsequently, the quality of all DNA samples was evaluated with gel electrophoresis if sample concentration was adequate (20 ng/µl or more) or with the Agilent TapeStation genomic DNA ScreenTape when DNA amounts were low.

DNA extracted from FFPE samples is usually highly fragmented and quantities can be small, making the objective evaluation of a sample’s quality difficult. Initially, about 50% of the samples used to test the DNA extraction and library preparation workflow failed library preparation or sequencing.

Carrying out whole genome amplification (WGA) to obtain high molecular weight DNA did not solve the problem, as this process requires the presence of adequate template to work properly and some of the samples were too fragmented to be used in WGA. In some other cases, most of the sample was lost after sonication, or was washed away during purification steps. Furthermore, carrying out some of the library preparation steps (such as adapter ligation) with the incorrect amount of DNA and without proper adjustment of reagent’s ratios, can lead to poor performance that is reflected in subsequent steps. It is not practical to troubleshoot in a per-sample basis when large numbers are being prepared and when amounts of sample are limited, so we aimed at a solution applicable to all samples.

In order to identify samples adequate to undergo library preparation, the DNA integrity number (DIN) provided by the Agilent TapeStation was used during sample triage. The DIN ranges from 0 to 10. According to Agilent TapeStation literature, the higher the DIN

84 of 299

number, the higher a sample’s integrity is, (http://www.agilent.com/cs/library/

applications/5991-5258EN.pdf Last accessed 16.11.2015), as can be observed in the electronic gel image generated from the electropherogram of samples of different degradation levels (Figure 18).

Samples with a DIN of three or more were used for library preparation as per Agilent’s recommendation (Personal communication M. Walter and R. Beneke, Agilent Technologies).

Although some samples had a relatively good DIN (the highest obtained was 6.7), most samples did not. The median DIN for these FFPE samples was 4.8.

By combining the DIN and enhancements in the library preparation protocol described below, the success rate for FFPE library preparation increased to virtually 100% (all the libraries prepared after these adjustments were successful and the sequencing data they produced was deemed of good quality for analysis).

Figure 18. DNA degradation series analyzed with Agilent Genomic DNA ScreenTape. Gel image of a degradation series of 15 genomic DNA samples at 60ng/µL concentration. The DINs of the different samples is displayed at the bottom of the lane. The blue arrow on top shows how the increased degradation of a sample correlates to a lower DIN. Reproduced from http://www.agilent.com/cs/library/ applications/5991-5258EN.pdf Last accessed 16.11.2015

85 of 299 1.2.2 DNA Library preparation

1.2.2.1. Library preparation for exome sequencing

Libraries for exome sequencing were prepared for 68 of the 104 frozen/AllProtect preserved tumors with a matched germline sample, which was also prepared for exome sequencing. Exome libraries were prepared with the Agilent SureSelect Human All Exon V5 with either 1µg or 200ng DNA as input, following Agilent’s protocols.

Briefly, 1µg or 200ng input DNA was sheared in a Covaris S2 sonicator for six cycles of 60 seconds, with the parameters recommended by Agilent (Table 4). After sonication and sample purification with Agencourt AMPure XP beads following Agilent’s recommendations, the ends of the fragmented DNA were repaired and adenylated on their 3’end. Universal paired-end adaptors were ligated and the input material was amplified with 5 (with 250 ng of adaptor ligated DNA, when starting library preparation with ≥1µg DNA) or 10 PCR cycles (when starting library preparation material is 200 ng) with the conditions specified in the protocol. After amplification, 750 ng of library in a 3.4 µl volume was reached through sample concentration in a vacuum concentrator with low heat. The library was then hybridized to the biotinylated RNA exome capture library probe set (Agilent SureSelect Human All Exon V5) for 16 hours. The selected/hybridized fragments were captured with Dynabeads MyOne Streptavidin T1 magnetic beads and unspecific binding was washed following Agilent’s standard protocol.

Setting Value sequencing. Reproduced from http://www.agilent.com/cs/library/usermanuals/Public/

G7530-90000_SureSelect_IlluminaXTMultiplexed_1.8.pdf Last accessed 16.11.2015

86 of 299

After washing, the selected DNA was further amplified and sample-specific indexes were added with 10 cycles of PCR following Agilent indications. Indexed library concentration was measured with a Qubit fluorometer and its size distribution and overall integrity were evaluated with an Agilent Bioanalyzer High Sensitivity chip.

1.2.2.2. Library preparation for Cancer Panel sequencing

For the 100 FFPE samples, the remaining 36 fresh tissue/blood paired samples and 21 unpaired fresh tissue samples, the Cancer Panel, a custom Agilent target sequencing panel, was used (Box 2).

Box 2. Cancer Panel

This custom Agilent target sequencing panel (called throughout the text “Cancer Panel”) was developed in the lab to study cancer samples. It contains 387 cancer-related genes that were selected based on their mutational profiles in the COSMIC v71 database (Forbes et al., 2015 CosmicDatabase). These genes’ fraction of truncating and recurrent mutations, and their presence in different tumor types was taken into account when selecting them. All cancer related genes reported in recent reviews (Vogelstein et al., 2013, Lawrence et al., 2014) were also included in the Cancer Panel. The genes included are the following:

ABL1 recommended Agilent SureSelect Custom Target protocols, with the same overall

87 of 299

processing as the exome-sequenced samples, with the following modifications in the protocol:

Standard pre-capture PCR of 10 cycles in all cases.

The RNA capture library probe set used was an Agilent custom made Cancer Panel instead of the Agilent SureSelect Human All Exon V5.

The volume of RNA capture library used was 1 µL instead of 2 µL as recommended in the protocol. The volume of library used was 400 ng in 1.7 µL instead of 750 ng in 3.4 µL as recommended in the protocol. In concordance, all the other volumes for hybridization and sequence capture were halved.

Standard post-capture PCR of 13 cycles instead of 10.

Originally, the FFPE samples were prepared using the same Agilent protocol and 400 ng or more DNA as input, but the success rate for library preparation was 60%, with an additional 10% of the samples appearing to produce a good library but failing at the sequencing step. Additionally, the amount of sample required was not always available and initially, a significant number of the samples were deemed insufficient for library preparation.

In order to improve the quality of the libraries prepared, the following steps were implemented besides the DIN adoption (described in the previous section) to determine sample’s quality:

1. Sonication optimization. Sonication was performed on the Covaris S2 system.

After testing different sonication times and parameter settings for an FFPE extracted sample (200 ng DNA in 50 µL TE buffer 1x), the conditions recommended in the protocol were adjusted as follows: The sonication time was shortened from the Agilent standard (shown in Table 4), to 3 minutes (3 pulses of 60 seconds each). The rest of the parameters were kept the same. Since the DNA was already partially fragmented, 3 minutes was enough to provide a good fragment size distribution of the input material, and was comparable to the results obtained by sonicating for the standard time genomic DNA samples (Figure 19).

88 of 299

2. Library optimization. Instead of using the Agilent reagents for library preparation, other protocols were investigated and discussed with Agilent’s technical experts. The option recommended by them, and that yielded the best results in library preparation, was the KAPA Hyper Prep Kit for Illumina platforms (KR0961–v1.14). This kit and protocol are optimized for low quantity or low quality samples, including FFPE-extracted material, producing good libraries with quantities as small as 1 ng input DNA.

This protocol allowed the inclusion in the study of samples for which only small amounts of DNA were available, as well as those that had significant levels of degradation (but a DIN still ≥ 3).

Good quality libraries from 50 to 200 ng degraded DNA as starting material were produced.

The principles of the sample preparation using the KAPA protocol are the same as for the Agilent protocol, but the number of purification steps is lower and the enzymes used for Figure 19. Different sonication times for FFPE-extracted DNA. a. Gel image of four different time points tested. Ladder to the left. Reproduced from Agilent bioanalyzer report document. b.

Electropherograms of four different time points tested, labeled on top. The images were obtained by loading 1 µL of sonicated DNA in an Agilent Bioanalyzer DNA High Sensitivity chip. The bottom two images are examples of the profiles of two genomic DNA samples after standard sonication times (6 pulses of 60 seconds). Time point 4 and genomic DNA examples were purified with AMPure magnetic beads to remove fragments <120 bp before loading on the chip.

a

b

89 of 299

Table 5. Differences between Agilent SureSelect and KAPA Hyper protocols for library preparation. The fields highlighted in blue represent the steps that ultimately were carried out during the preparation of the FFPE Cancer Panel libraries. With information from http://www.agilent.com/cs/library/usermanuals/Public/G7530-90000_SureSelect_Illumina XTMultiplexed_1.8.pdf Last accessed 16.11.2015 and https://www.kapabiosystems.com/

assets/KAPA_Hyper_Prep_Kit_TDS.pdf Last accessed 16.11.2015

ligation of adaptors, as well as amplification of the libraries, are different. These enzymes have been shown to be more efficient, especially at the adapter ligation step and PCR

A-tailing Yes No (single step with end repair)

Purification after adaptor

90 of 299 1.3. Sequencing of exome and cancer panel genes

Libraries were sequenced in an Illumina HiSeq2000, 100bp paired end to a median coverage of 100x (range 76-204x) for exomes and 171x (range 34-952x) for cancer panel.

Average Illumina error rate for FFPE sequenced samples was 0.23 +/- 0.09% for read 1 and 0.4 +/- 0.10% for read 2, with a cluster density of ~1000 K/mm2. The mean of >= Q30 bases was 92.31%. For fresh tissue and blood samples, the average Illumina error rate was of 0.24 +/- 0.05% for read 1 and 0.37 +/- 0.08% for read 2, with a cluster density of ~900 K/mm2. The mean of >= Q30 bases was 91.83%.

91 of 299

2. DNA Sequencing data analysis

Sequencing reads were mapped to the Genome Reference Consortium release h37 reference human genome with the Burrows-Wheeler Aligner (BWA) (Li and Durbin, 2009) and sequencing quality and target enrichment was verified with Picard tools metrics (http://broadinstitute.github.io/picard/). The mapped reads were then processed with the Genome Analysis Toolkit (GATK version 3.3.0) (McKenna et al., 2010) following best practices (valid at least through June 2015) for exome and specific indications for smaller targeted regions. The general steps are described below and a general description of the programs and formats can be found in Box 3 or the glossary:

BAM files preprocessing using Picard tools:

Reads were sorted by coordinate number (SortSam tool).

Duplicates were marked (MarkDuplicates tool).

Read group information was added (AddOrReplaceReadGroups tool).

An updated index was made for the preprocessed BAM files (BuildBamIndex tool).

GATK steps were carried out:

The regions in the dataset requiring local realignment due to indels were determined (RealignerTargetCreator tool).

The reads were realigned over the identified intervals (IndelRealigner tool).

Positions with mismatches to the reference whose quality score needed to be adjusted were identified (BaseRecalibrator tool).

The adjustments of quality scores were applied to the BAM files (PrintReads tool).

The BaseRecalibrator tool was not applied to Cancer Panel-sequenced samples based on the recommended best practices for small target regions.

When a target region needed to be provided, the exome set provided by GATK of the Genome Reference Consortium release h37 was used for exome-sequenced samples. For Cancer Panel-sequenced samples, a file containing the coordinates of the Cancer Panel exomes was used.

92 of 299 2.1. Somatic variant calling

SNVs were called on a set of 39 tumor samples and their matched normal that had been exome sequenced. Variants were called with SAMtools (Li, 2011) mpileup on both tumor and germline samples, using standard parameters. Only variants with QS >50, CV >20 and absent from the matched germline, were considered somatic.

Somatic variants were also called in this same set of samples with MuTect (Cibulskis et al., 2013) version 1.1.4 with standard parameters, and using the matched germline as control.

The commands used were similar to this example:

muTect-1.1.4.jar

--analysis_type MuTect \

--reference_sequence human_h37.fasta \ --cosmic variants_in_cosmic.vcf \

--dbsnp variants_in_dbSNP.vcf \

--intervals exome_coordinates.interval_list \ --input_file:normal matched_germline.bam \ --input_file:tumor tumor_sample.bam \ -o somatic_SNV.out \

The number of somatic variants identified with both methods was compared. We observed that both were good at identifying somatic SNVs when the tumor fraction was high (>20%).

On the other hand, when the fraction of tumor cells in the sample was low, the number of variants found by samtools diminished considerably when compared to MuTect and to its performance on samples with a higher tumor fraction (Figure 20). This observation confirmed that MuTect is a good method for somatic variant calling in our BCC samples, especially when tumor fraction is low (Cibulskis et al., 2013). Furthermore, MuTect is widely used in the field of somatic variant calling and several strong publications support its application (Banerji et al., 2012, Cancer Genome Atlas Research, 2008, Stransky et al., 2011, Stieglitz et al., 2015, Ojesina et al., 2014).

93 of 299

For somatic indel calling, the performance of Pindel and HaplotypeCaller, one of the variant caller algorithms proposed by the GATK, was compared on 23 tumor-germline pairs. By previous experience in other sequencing projects in the lab, we have considered that the false positive rate of Pindel (the somatic indel caller of choice in the laboratory) is quite high, and that quality scores in the mid ranges are not fully reliable when evaluating variants. Additionally, HaplotypeCaller is powerful because it uses as input the BAM files that have been realigned around indels, reducing this way the amount of false positive and false negative calls.

HaplotypeCaller is a variant caller, not necessarily a somatic variant caller. In order to identify somatic indels with it, the following steps were followed:

Box 3. Programs and Formats used in HTS data processing

BAM format: a BAM file is a binary version of a SAM file, which stands for Sequence Alignment/Map format, a TAB-delimited text format of an alignment containing information on each read.

GATK: Genome Analysis Toolkit is a software package for the analysis of high-throughput sequencing data by data science and engineering group at the Broad Institute (McKenna et al., 2010, Van der Auwera et al., 2013).

HaplotypeCaller: is the principal variant caller algorithm from the GATK. It calls SNVs and indels simultaneously using de novo assembly and a Bayesian statistical model.

IGV: the Integrative Genomics Viewer is an interactive visualization tool for the exploration of large genomic datasets developed by Robinson et al. (2011).

Igvtools: is a set of utilities to pre-process data files for the IGV and for other programs.

MuTect: is a tool that identified somatic point mutations in high-throughput sequencing data of cancer genomes developed as part of the GATK related programs (Cibulskis et al., 2013).

PicardTools: a set of Java command line tools for the manipulation of high-throughput sequencing data and formats.

Pindel: is an algorithm that detects deletions and insertions from small, medium and large sizes from paired-end short reads, developed by Ye et al. (2009).

SAMtools: is a set of utilities for manipulating and processing short DNA sequence read alignments in different formats developed by Li et al. (2009).

SAMtools tview: is a text alignment viewer to visualize the alignment of short reads to the reference genome and uses different colors and symbols to display different read and base specific features.

VCF format: the Variant Call Format is a text file format that stores gene sequence variation information in a compact way along with general information about the sample and the genome used in its processing.

94 of 299

Variant calling was carried out on the germline samples with the standard parameters of HaplotypeCaller.

Variant calling on the tumor sample was carried out with HaplotypeCaller, using the VCF file of the germline sample as input for the comparison “--comp” argument.

This function flags all the variants from the tumor sample that are present in the germline VCF file, allowing them to be easily filtered out or further investigated.

Command example:

GenomeAnalysisTK.jar -T HaplotypeCaller \

--reference_sequence human_h37.fasta \ --comp germline_variants.vcf \

-I 5-VS056-T1recalreads.bam \

--intervals exome_coordinates.interval_list \ --genotyping_mode DISCOVERY \

-stand_emit_conf 10 \ -stand_call_conf 30 \ -nct 10 \

-o tumor_variants.vcf

All indels that were flagged as being present in the matched germline sample were removed. Only indels marked as somatic (not present in the germline control) were kept.

HaplotypeCaller identified less somatic indels than Pindel in all samples (Figure 20).

Several examples were visually explored with SAMtools tview and igvtools to see if the calling of those indels by Pindel was correct or not. In all the cases, the indels called by Pindel were considered false positives. After these comparisons among variant callers for somatic variants, we settled on using MuTect and HaplotypeCaller with the comparison mode for somatic variant calling on tumors for which a germline control was available.

95 of 299

For the samples for which a matched germline was not available, a Panel of Normals (PON) was constructed from 50 germline samples.

A PON is useful to reduce false positives and remove germline events that might not be in the dbSNP VCF file provided to MuTect. It is created by running MuTect on germline samples as if they were tumors without a matched germline sample. The produced VCF file contains the sites that were called as somatic SNVs by mutect.

Figure 20. Comparison of the number of somatic variants identified by samtools and MuTect or Pindel and HaplotypeCaller. a. The number of somatic SNVs called with samtools and MuTect on 39 tumor-germline pairs was compared. The number of somatic SNVs per sample is indicated on the y axis. On the x axys, tumor samples are ordered by tumor fraction in decreasing order from the left. The red circles indicate different tumor fractions throughout the dataset. b. The number of somatic indels called with Pindel and HaplotypeCaller on 23 tumor-germline pairs was compared. The number of somatic SNVs per sample is indicated on the y axis.

Tumor samples are displayed in the x axis.

a

b

96 of 299

The PON file is provided as an argument when running MuTect on samples with no matched germline. MuTect will reject all sites present in the PON, and won’t call them as somatic unless they are in the COSMIC variant file (Cibulskis et al., 2013). Since only samples sequenced with the Cancer Panel did not have a matched germline, the PON only covers variation within the genes of the Cancer Panel.

Although even with a PON filtering germline variants can wrongly be considered as somatic

Although even with a PON filtering germline variants can wrongly be considered as somatic