• Aucun résultat trouvé

1. Cancer and genomics

1.3. Cancer databases

The production of large amounts of data requires proper organizational and analytical tools. The Catalogue Of Somatic Mutations In Cancer (COSMIC) (Forbes et al., 2015, Cosmic Database), created in 2004, allows users to explore and access a constantly updated and manually curated database containing detailed information on the Cancer Gene Census (http://cancer.sanger.ac.uk/census/) containing at the beginning of December 2015, 572 genes involved in cancer across 2,500 human diseases and 47 primary tissues. The COSMIC database is a comprehensive collection of cancer genomics information as it contains data of more than 300 cancer publications and the datasets produced through the ICGC and TCGA. COSMIC is strong in the description of point mutations and fusion genes observed in cancer, but other features such as SCNAs and gene expression are also contained in the database. The data is easily accessible through the COSMIC website (http://cancer.sanger.ac.uk/) and it is possible to search it from a text input box or by selecting specific category fields in the dropdown menus. The selected data is then displayed through user friendly and intuitive diagrams along with complementary information and additional search and filtering options. Furthermore, the full database is downloadable to allow for detailed and customized exploration when required (Forbes et al., 2015). Additional databases similar or complementary to COSMIC have also been initiated by other consortia or groups. Some examples are cBioPortal (http://www.cbioportal.org/) from the Memorial Sloan Kettering Cancer Center/TCGA and the ICGC data portal (https://dcc.icgc.org/).

24 of 299 1.4. Comprehensive analysis of large datasets

The first study published by sequencing consortiums was on glioblastoma. The article, published in 2008, provides an integrative and comprehensive analysis of glioblastoma, the most common primary brain tumor in adults (Cancer Genome Atlas Research, 2008). The authors study copy number, gene expression and methylation of 206 tumors, and complement it with DNA sequence analysis of 91 of these tumors. Because of the obvious clinical interest in this tumor type, significant information was previously known.

Glioblastomas frequently have amplifications or mutational activation of RTK genes, activation of PI3K pathway genes, and inactivation of TP53 or RB1 genes. By using microarrays, additional SCNAs were interrogated. Homozygous deletions of NF1 and PARK2 were frequent events, as well as amplification of AKT3. Amplification of other genes (FGFR2, IRS2) and deletion of others (PTPRD) were also recurrent events in the dataset. The changes in copy number correlated well to gene expression data of the genes within the SCNAs. cnLOH in 17q was also considered relevant in this analysis.

For somatic mutations, 601 genes were investigated, yielding 453 validated non-silent somatic variants in 223 genes, 79 of which were mutated more than once. Mutation rates were between 1.4 and 5.8 mutations/Mb, depending on the treatment status of the tumors.

Untreated tumors harbored fewer mutations, since treated tumors were enriched for mutations in mismatch repair genes (MMR). Eight genes were considered to be significantly mutated (FDR <10-3). Mutations in TP53 clustered in the DNA binding domain, a known TP53 hotspot for mutations in human cancer, in a significant percentage of the tumors. NF1 was strongly put forward as an important gene in glioblastoma. Somatic inactivating mutations in this gene were found in 14% of the tumors, as well as deletions in about 20%

of the samples, while some tumors had NF1 reduced expression but no genomic alteration was found. The EGFR family of genes is known to be frequently activated in glioblastoma, and almost half the samples had EGFR alterations. However, in only one previously reported case, had a point mutation in ERBB2 been identified in glioblastoma. 22 of the tumors had focal amplifications of the WT allele, 16 had point mutations associated to a focal amplification, and 3 had ERBB2 point mutations with no amplifications. Although only half of the point mutations were validated, ERBB2 can be considered a gene directly involved in glioblastoma. Additionally, activating missense mutations of PIK3CA are known

25 of 299

to occur frequently in glioblastoma, and were found in a fraction of the studied samples. On the other hand, its regulatory protein PIK3R1 was not known to be frequently mutated in cancers, but it was found mutated in 10% of the studied tumors. The mutations or small deletions, occurred all in the same protein domain and were predicted to cause constitutive PI(3)K activity.

The most interesting observation in this study was probably the one obtained with the methylation studies. Cancer-specific DNA methylation of the CpG regions in promoters of 2,305 genes was measured and compared to normal brain methylation. The methylation failure to repair alkylated guanine residues caused by treatment. Most interestingly, the mutational spectrum of the MMR genes reflected MGMT methylation status (Figure 3b).

These observations show that the initial genetics of the tumor along with the treatment, can affect the level of acquisition of somatic mutations. These interactions should be taken into consideration when selecting a therapy for patients with a methylated MGMT promoter.

The integrated analysis of this glioblastoma cohort unveiled a large network of alterations, with RTK, P53 and RB pathways at its core. These three main pathways were affected in most tumors (P<0.0018), but only one component of each pathway harbored an inactivating mutation, and this phenomenon was found to be highly significant (P53=9.3x10-10, RB 2.5x10-13, RTK= 0. 022).

This first integrative approach to the study of a cancer type by a consortium was deemed successful. The known genetic factors of glioblastoma were confirmed and novel relevant genes, as well as regulatory pathways, were uncovered. Furthermore some of their findings can provide a direct therapeutic benefit to individuals with tumors with a specific phenotype and the approach to cancer research in an unbiased and systematic way, with no prior hypothesis, proved to be valuable to uncover unexpected key players of cancer.

26 of 299

Over the last few years, the discoveries that these world-wide collaborations in cancer genomics research have produced are substantial. Besides the detailed and comprehensive characterization of specific tumor types, the integration of the information gathered for individual samples has proven highly valuable. For example, the grouped evaluation of numerous tumor types provided detailed information on the mechanisms and patterns of SCNAs in cancer (Zack et al., 2013). Additionally, the deep analysis of the mutational landscape of several tumor types has allowed researchers to appreciate that the number of accumulated mutations (mutational load) is highly variable across cancers (Lawrence et al., 2013) and that mutational signatures of cancer might be a powerful method to classify and perhaps even therapeutically approach specific cancer types, since these are frequently correlated to tumor etiology (Alexandrov et al., 2013a).

1.4.1. SCNAs profiling in large datasets

Using SNP arrays, Zack and collaborators studied SCNAs across 4,934 tumor samples of 11 different cancer types (Zack et al., 2013) as part of the Pan-Cancer analysis project (Cancer Genome Atlas Research et al., 2013), with the objective of distinguishing driver from

Figure 3. Methylation status of the MGMT promoter and its relationship with treatment and mutational contexts in glioblastoma. a. the y axis corresponds to number of mutations.

The x axys corresponds to the treatment status of a patient (+ treated, - non-treated), the methylation status of the MGMT promoter (Meth=methylated, - non-methylated), and the mutational status of MMR genes (Mut= at least one MMR gene mutated, - non-mutated). The numbers under the bars represent the number of samples in each group. b. Mutational spectrum of the MMR genes as a function of treatment and methylation status of the MGMT promoter. Color codes for both graphs are at the bottom. Reproduced from Cancer Genome Atlas Research, 2008 .

a b

27 of 299

passenger events and identifying the mechanisms of SCNAs acquisition in cancer. They also aimed at pinpointing the key genes within a SCNA that were ultimately driving the cancer phenotype. After inferring the SCNA profiles that better explained the ploidy determined for each tumor, the authors called 202,244 SCNAs (median of 39 per tumor sample) and classified them in 6 different categories (Table 2).

SCNAs category Median per tumor

Focal copy gain, smaller than chromosome arm 11

Focal copy loss, smaller than chromosome arm 12

Arm-level copy gain, full arm-length or longer 3

Arm-level copy loss, full arm-length or longer 5

cnLOH 1

Whole genome duplication 37% of cancers

They observed that cancers with whole genome duplication (WGD) had twice the rate of SCNAs that tumors without. This correlated well with WGD tumors having an average ploidy of 3.31 and not 4, while tumors with no WGD had a ploidy of 1.99 (when 2 was the expected ploidy). WGD occurred early in the SCNA events history in tumors, while other types of SCNAs arose after the WGD event. The average copy number profile for these 11 cancer types in WGD or near-euploid state can be seen in Figure 4a.

Focal SCNAs that extended to the telomeres were longer than intrachromosomal SCNAs.

These internal SCNAs had frequencies inversely proportional to their length, while telomeric SCNAs were uniform in size (Figure 4b) and were more frequent than expected assuming random positions for SCNAs (P<0.0001). SCNAs in general tended to finish at the centromeres.

Table 2. Types of SCNAs across 11 tumor types. Events assessed from SNP array data of 4,934 tumors from Zack et al, 2013.

28 of 299

Figure 4. Characteristics of different types of SCNAs. a.Number of amplifications (red) or deletions (blue) on 10 cancer types from an arm-level or a focal perspective (top and bottom respectively). In each cancer type, samples with WGD events are at the right and samples without WGD to the left; SCNA in samples with WGD are resolved according to their timing relative to the WGD event. b. Distribution of lengths of SCNAs originating at telomeres compared to intra-chromosomal SCNAs. c. Rates of chromothripsis across different cancer types. BLCA= bladder, BRCA=breast, COAD=Colon and rectal carcinoma, GBM= glioblastoma multiforme, HNSC= head and neck squamous cell, KIRC=kidney renal cell, LUAD=lung adenocarcinoma, LUSC=lung squamous cell, OV=ovary, UCEC=uterine cervix. All three panels reproduced from Zack et al., 2013.

a b

c

Chromothripsis was detected in 5% of samples with varying frequencies depending on tumor type (Figure 4c) but unrelated to overall rates of SCNAs per sample. Chromothripsis tended to occur in specific regions and is associated with particular driver events.

Across all cancer types, 70 recurrent amplifications and the same number of recurrent deletions were identified. The authors identified “peak” regions within these SCNAs that were more likely to contain oncogenes or tumor suppressor genes. SCNAs within the peak regions were shorter than events occurring elsewhere in the chromosome (P<0.0001) and they were also more often high-amplitude events (P<0.0001). The frequency of events in these peak regions was stable across tumors of the same lineage. 24 of the 70 peak regions

29 of 299

of amplification contained an oncogene known to be activated by amplification (such as CCND1, EGFR, MYC, ERBB2, and CCNE1) or other genes directly involved in carcinogenesis, such as TERC, which encodes the substrate for TERT, a known oncogene. From the peak regions of amplification, 12 contained tumor suppressor genes (such as ATM, NOTCH, FOXK2 and PPP2R2A) and two other regions had tumor suppressor gene candidates (ERRFI1 and FOXC1).

The peaks that contained no obvious cancer gene candidates were subjected to literature citation searching algorithms, and enrichment for topics related to epigenetic and mitochondrial regulation was observed. This finding stresses the relevance of epigenetic alterations in cancer progression, in concordance with previous observations (Berman et al., 2012, Fullgrabe et al., 2011). When significantly mutated genes (SMG) were called within the peak regions, the authors identified all genes known to be tumor suppressors as well as a significant fraction of genes known to act as oncogenes through amplification, inside the peak regions. It was interesting to note that deleted regions are probably enriched on tumor suppressor genes, as they had more truncating and frameshift deletions than expected (P=0.0002). Furthermore, from the 770 peak regions identified across specific cancer lineages, 84% occurred in at least two lineages and 65% were inside peak regions from the pan-cancer analysis.

This study was the first high resolution analysis of SCNAs and ploidy across several cancer types. It reports areas in the genome that are enriched in SCNAs in cancer and that probably contain genes or regulatory elements that act as drivers of tumorigenesis, and confirm the relevance of several previously identified tumor suppressor and oncogenes. It stresses the advantages of unbiased approaches in large dataset for the identification of events and genes important in carcinogenesis.

1.4.2. Factors influencing mutation rates

As mentioned before, the most conventional approach for the identification of somatic mutations and genes directly involved in tumorigenesis is the comparison of tumor and germline DNA of large sample sets, followed by statistical analysis to identify SMGs. It was initially thought that large datasets would increase the sensitivity and the specificity of the analyses, but in most cases what increased was the number of false positives, since highly

30 of 299

mutable genes such as olfactory receptors and large genes such as TTN and PCLO, were systematically identified and sometimes even nominated as cancer genes. To better understand mutational processes in cancer, over 3,000 tumor-germline pairs of 27 different cancer types for which whole exome or whole genome sequencing had been carried out, were studied as part of the Pan-Cancer analysis project (Lawrence et al., 2013).

The authors observed large variation in mutation rates among different cancer types (Figure 5). Disparity within tumors of the same type was also impressive, with some specimens having mutation rates as low as 1 mutation/Mb and others with mutation rates of over 100 mutations/Mb, as in the case of melanoma (Figure 5). Many times, increased or decreased mutations rates could be correlated to confounding factors, such as tobacco use in lung carcinomas, exposure or not to UV-light in melanoma, or the presence of mismatch repair mutations in colon cancers.

Lawrence and collaborators also looked at mutational profiles per tumor type, and if they could be used to cluster different tumor types together. They observed that different types of lung cancers share the C>A signature consistent with tobacco exposure, while melanomas have mostly C>T mutations, classical sign of UV-light induced mutations.

Tumors from the gastrointestinal tract show C>T mutations in a CpG context and epithelial cancers (bladder, cervical and head and neck) display a large fraction of C>T/G mutations in a TpC context, that could be caused by APOBEC restricting viral infection, a common coadjutant factor in a fraction of these types of cancer.

31 of 299

Mutation frequency was found to vary significantly throughout the genome of specific tumors and within tumor types. This could be a consequence of specific genomic features, such as gene expression (Pleasance et al., 2010a). Germline mutation rate is lower in highly expressed genes due to transcription-coupled repair (Fousteri and Mullenders, 2008), and this was confirmed in their sample set, where they found that mutations were less frequent in highly expressed genes. Average mutation rate is almost 3 times higher in the lowest expressed genes when compared to the highest expressed ones. Another feature found to be important was DNA replication time, which is also known to be correlated to germline mutation (Stamatoyannopoulos et al., 2009). Late-replicating regions are expected to have higher mutation rates, and this correlation was indeed observed in Lawrence and collaborators dataset, where the mutation rate was three times higher in the late versus the earliest replication regions. These observations explain some of the false positive cancer genes. For example, both olfactory receptors and large genes are lowly expressed, late replicating, and have a large number of silent or intronic mutations (Lawrence et al., 2013).

The authors integrated their observations to create a powerful algorithm to identify SMGs in cancer. This algorithm, called MutSigCV, takes into account the mutational covariates Figure 5. Somatic mutation frequencies in 27 different tumor types. Each dot was obtained through a tumor-matched normal comparison; the vertical axis indicates the frequency of somatic mutations in the exome. Tumor types are ordered based on their median somatic mutation frequency and the number of samples per tumor type is indicated above the plot.

Reproduced from Lawrence et al., 2013.

32 of 299

described in their publication and performs well when identifying SMG, effectively reducing false positives. Since its development, MutSigCV has been the program of choice to identify cancer genes and it has been used in numerous publications (Cancer Genome Atlas Research, 2012, Cancer Genome Atlas Network, 2015, Cancer Genome Atlas, 2015, Gao et al., 2014, Jones et al., 2012, Pugh et al., 2013, Pickering et al., 2014).

The study of a very similar dataset confirms the results regarding mutation rate among cancer types and mutational signatures and how they can be used to cluster tumors by their class and sometimes etiology (Kandoth et al., 2013). This study additionally identified several genes that were very frequently mutated in cancer. The most frequently mutated gene in their 3,281 tumors of 12 different types was TP53 (42% of the samples), followed by PIK3CA (mutated in 10% of their samples). Mutations in these two genes were specific to particular groups of cancers, TP53 being more frequent in ovarian or endometrial carcinomas and basal breast cancer, while PIK3CA did not occur in ovarian, lung or kidney cancers. Mutations in SMG across the 12 cancer types were subjected to unsupervised clustering, and it was found that 72% of the samples were indeed clustering with tumors of the same tissue type, having mutations in the same driver genes. Pairwise comparisons among mutations in the 127 SMG identified in this study found 14 mutually exclusive gene pairs. For example, TP53 and CDH1 are mutually exclusive in breast cancer (FDR 0.05), while TP53, PTEN, VHL, NPM1 and GATA3 are mutually exclusive across the full dataset (P=0.01). In contrast, there were a number of associations detected, with 148 co-occurring mutations across the SMGs in the dataset.

Furthermore, Kandoth et al. (2013) were able to temporally place the occurrence of mutations in the history of a tumor by looking at variant allele fraction (VAF) distribution of mutations in some of the SMGs in acute myeloid leukemia (AML), breast and uterine/cervical cancers. TP53 had the higher VAF in these cancer types, suggesting it tends to appear early in tumorigenesis although its elevated VAF might be due to cnLOH, a common event in TP53 and other tumor suppressors such as BRCA1, BRCA2 and PTEN.

Other genes were also identified in specific tumor types. For example, in AML, DNMT3A and SMC3 had significantly higher VAFs (P<0.0003 and P< 0.05 respectively) than average.

These analyses are interesting because they can allow the identification of the primary

33 of 299

drivers of tumorigenesis among all the contributing drivers in a specific cancer type. The results of this study, along with the ones described before, and many other reports have help us understand the mechanisms of cancer and how this disease begins.

Even though most of the mutations identified in cancer are passenger mutations, they are the product of the same mutational processes that give place to the drivers. The mutations bear the signature of said mutational processes, DNA damage, length and strength of exposure to said mutational processes, or DNA repair mechanisms. What we ultimately see when a tumor is sequenced, is a combination of different mutational signatures consequence of all the processes involved in cancer formation (Figure 6). Our understanding of these processes was until recently, limited to specific genes, but genome-wide approaches are starting to be applied to the analysis of large or comprehensive datasets (Helleday et al., 2014, Alexandrov and Stratton, 2014).

The most convenient dataset to study the full landscape of mutational signatures in cancer

The most convenient dataset to study the full landscape of mutational signatures in cancer