• Aucun résultat trouvé

Genomic characterization of basal cell carcinoma of the skin

N/A
N/A
Protected

Academic year: 2022

Partager "Genomic characterization of basal cell carcinoma of the skin"

Copied!
306
0
0

Texte intégral

(1)

Thesis

Reference

Genomic characterization of basal cell carcinoma of the skin

BONILLA BUSTILLO, Ximena

Abstract

Skin basal cell carcinoma (BCC) is caused by UV-light radiation activating the Hh signaling pathway through inactivating mutations in PTCH1 or activating events in SMO, as well as TP53 mutations. In this thesis we show that BCC is the tumor with the highest mutation rate to date with 65 mutations/Mb, and that its mutational signature of 90% C>T mutations in a pyrimidine context reveals clear UV-light causality. We show the relevance of downstream Hh genes and Hippo pathway genes in BCC by the identification of MYCN and PTPN14 point mutations. Additional genes such as LATS1, STK19, ERBB2, FBXW7, PPP6C, KNSTRN, and CASP8 were also mutated in a fraction of tumors. Gene expression studies support our finding that oncogenic mutations downstream of SMO and Hippo pathway activating mutations are alternative mechanisms in BCC. The results presented in this thesis enhance our understanding of tumorigenesis in BCC and provide the means for better classification of tumors and treatment options.

BONILLA BUSTILLO, Ximena. Genomic characterization of basal cell carcinoma of the skin. Thèse de doctorat : Univ. Genève, 2016, no. Sc. 4888

URN : urn:nbn:ch:unige-819068

DOI : 10.13097/archive-ouverte/unige:81906

Available at:

http://archive-ouverte.unige.ch/unige:81906

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

UNIVERSITÉ DE GENÈVE

Genomic Characterization of Basal Cell Carcinoma of the Skin

THÈSE

présentée à la Faculté des Sciences de l’Université de Genève pour obtenir le grade de Docteur ès sciences, mention biologie

par

Ximena BONILLA-BUSTILLO de

MEXIQUE

Thèse n° 4888

Genève, 2016 Département de Biologie

Moléculaire FACULTÉ DES SCIENCES

Professeur Thanos D. Halazonetis Département de Médecine

Génétique et Développement FACULTÉ DE MÉDECINE

Professeur Stylianos E. Antonarakis

(3)
(4)

;

UNIVERSITE ...

DE GENEVE

Doctorat ès sciences Mention biologie

Thèse de

Madame Ximena BONILLA BUSTILLO

intitulée :

"Genomic Characterization of Basal Cell Carcinome of the Skin"

La Faculté des sciences, sur le préavis de Monsieur S. ANTONARAKIS, professeur ordinaire et directeur de thèse (Faculté de médecine, Déportement de médecine génétique et développement), Monsieur T. O. HALAZONETIS, professeur ordinaire et codirecteur de

thèse (Déportement de qiologie moléculaire), et Monsieur P. CAMPBELL, docteur (Cancer

Genetics & Genomics, The Wellcome Trust Sanger lnstitute, Hinxton, Cambridge, United Kingdom), autorise l'impression de la présente thèse, sans exprimer d'opinion sur les propositions qui y sont énoncées.

Genève, le ll janvier 2016

Thèse - 4888 -

Le Doyen

N.B.- La thèse doit porter la déclaration précédente et remplir les conditions énumérées dans les "Informations relatives aux thèses de doctorat à l'Université de Genève".

(5)
(6)

Table of Contents

ACKNOWLEDGEMENTS ... 1

RESUME EN FRANÇAIS ... 5

SUMMARY ... 9

INTRODUCTION ... 13

1. Cancer and genomics ... 15

1.1. The first cancer genomic studies ... 18

1.2. International cancer consortia ... 21

1.3. Cancer databases ... 23

1.4. Comprehensive analysis of large datasets ... 24

1.4.1. SCNAs profiling in large datasets ... 26

1.4.2. Factors influencing mutation rates ... 29

1.4.3. Mutational signatures ... 33

2. Basal cell carcinoma of the skin ... 39

2.1. Epidemiology of BCC ... 39

2.2. BCC risk factors ... 40

2.3. Clinical and histological features ... 40

2.4. Pathogenesis of BCC ... 44

2.4.1. The hedgehog signaling pathway... 45

2.4.2. Hedgehog genes and their role in BCC ... 47

2.4.2. The TP53 gene in BCC ... 55

2.5. Mutational signatures in BCC ... 57

2.6. BCC treatment and management options ... 58

3. Genomic studies of BCC and related cancers ... 59

3.1. Exome and transcriptome sequencing analysis of BCCs... 59

3.1.1. Advantages and limitations of BCC genomic studies ... 63

3.2. Squamous cell carcinoma ... 64

3.3. Cutaneous melanoma ... 66

3.4. Genomic studies of other Hh driven cancers ... 67

JUSTIFICATION ... 71

OBJECTIVES ... 75

METHODS ... 79

1. Sample collection and preparation ... 81

1.1. Sample collection ... 81

1.1.1. Fresh tissue samples ... 81

1.1.2. FFPE tissue samples ... 81

1.1.3. Publicly accessible data: ... 82

1.2. DNA quality control and library preparation ... 82

(7)

1.2.1. DNA extraction ... 82

1.2.2 DNA Library preparation ... 85

1.3. Sequencing of exome and cancer panel genes ... 90

2. DNA Sequencing data analysis ... 91

2.1. Somatic variant calling ... 92

2.2. Sanger Sequencing ... 96

2.3. Somatic copy number aberration (SCNA) analysis ... 97

3. Identification of significantly mutated genes ... 99

3.1. MutSigCV... 99

3.2. TumOnc ... 99

4. RNA sample preparation and sequencing ... 104

5. Investigation of the relevance of the identified mutations ... 105

5.1. In silico protein structure predictions ... 105

5.2. N-myc functional assays ... 105

5.3. Immunohistochemistry ... 106

RESULTS ... 107

SECTION 1: Sample collection and sequencing... 109

SECTION 2: Mutation rate and mutational profiles in BCC ... 114

Rationale and objectives ... 114

2.1. Mutation rate in BCC ... 115

2.2. Mutational profile of BCC ... 116

SECTION 3: Identification of driver genes ... 120

Rationale and objectives ... 120

3.1. Identification of BCC driver genes ... 121

3.1.1. Identification of significantly mutated genes ... 122

3.1.2. Identification of recurrently mutated nucleotides ... 123

3.1.3. Additional known cancer genes are mutated in BCC ... 124

SECTION 4. Novel driver genes in BCC ... 127

Rationale and objectives ... 127

4.1. Mutational profile of the BCC driver genes ... 129

4.1.1. Tumor suppressors in BCC ... 129

4.1.2. Mutations in MYCN and FBXW7 ... 130

4.1.3. Mutations in PTPN14 ... 133

4.1.4. Mutations in LATS1/LATS2 ... 137

4.1.5. Other putative drivers ... 138

4.2. Somatic copy number aberrations in BCC ... 143

SECTION 5. Relationship between drivers and phenotypes ... 145

Rationale and objectives ... 145

5.1. Histological Subtypes, drug resistance and BCC drivers ... 147

(8)

5.2. Tumor clonality and germline variants ... 150

5.3. BCCs in Gorlin syndrome ... 153

SECTION 6. Gene expression in BCC ... 154

Rationale and objectives ... 154

6.1. Gene expression in BCC ... 155

DISCUSSION ... 159

ABREVIATIONS AND GLOSSARY ... 173

REFERENCES ... 179

APPENDIX ... 201

Appendix A. ... 205

Appendix B. ... 209

Appendix C. ... 235

Appendix D. ... 238

Appendix E. ... 241

Appendix F. ... 287

Appendix G. ... 297

(9)
(10)

1 of 299

ACKNOWLEDGEMENTS

(11)

2 of 299

(12)

3 of 299

I would like to express my deepest gratitude to Prof. Stylianos E. Antonarakis for giving me the opportunity to spend the last few years in his lab working on many interesting and exciting topics with very talented people, and for his invaluable support during my PhD and over the last weeks of thesis writing. I would also like to very specially thank Sergey Nikolaev for trusting in me and inviting me to work on the genomics of BCC, giving me the opportunity to delve into an exciting topic, and guiding me through it.

I would like to thank Prof. Thanos Halazonetis and Prof. Peter Campbell for agreeing to review this thesis manuscript, and Prof. Marguerite Neeman-Arbez and Prof. Dominique Belin for reviewing and grading my hors-thèse exam.

Additionally, I would like to thank the NCCR-Frontiers in genetics PhD program for accepting me as one of their students, for giving me the opportunity to pursue my research here in Switzerland, and for allowing me to do laboratory rotations. Special thanks to Prof.

Didier Trono and Prof. Pedro Herrera and their teams, with whom I did my rotations.

I would like to thank all current and past members of the SEA research and clinical labs for their continuous support during my PhD. It is impossible to mention individually each one of you here, but I feel very lucky of having worked with such highly skilled, enthusiastic people like you all.

Special thanks go out to Michel Guipponi for always having my back and for teaching me how to plan a project and approach problem solving. I would like to thank Audrey Letourneau as well, for being a good friend, a great example of discipline and kindness, and for sharing everything she knew with me. Corinne Gehrig, a terrific friend and outstanding technician, thank you for teaching me how to work in the lab and for receiving me with arms wide open from day one.

A huge thank you to Kostya Popadin and Richard Fish who, each of them in their own way, shared with me their enthusiasm for scientific discovery and their novel ideas and projects.

These discussions were a very nice reminder of our ultimate objectives as researchers and what really matters.

Federico Santoni, Periklis Makrythanasis, Daniel Robyr, Jean-Louis Blouin, Samuel Lukowski, thank you for the scientific discussions and your invaluable help throughout these years. Pascale Ribaux, Emilie Falconnet, Geneviève Duriaux-Sail, thank you for your excellent technical work and your contributions to the projects I’ve been involved with.

I would like to thank fellow lab/GEDEV-mates Reza Sailani, Georgios Stamoulis, Marco Garieri, Alex Fort, Mari Nelis, Mariana Bustamante, Muhammad Ansar, Andrew Brown, Ana Viñuela, Delphine Baronnier, Valentina Cigliola. It has been very nice sharing scientific and fun times with you all.

(13)

4 of 299

To all the collaborators involved in the BCC project, thank you for being part of this study, for your hard work and your recommendations and comments; and thank you to all the patients that agreed to donate their tumors and blood.

Thank you to the administrative and secretarial team, whose help has been invaluable.

Francine Chopard, Sarita Goutorbe, Ancilla Stefani, Teresa Enes, Natacha Gulizia, you girls are truly great!

Thank you to the members of the Dermitzakis and Zdobnov labs for their help with technical issues and their friendship. Those over coffee conversations, evening beers, hikes and movie nights were always lots of fun.

I would like to very specially thank Aline Dousse and Natalia Lugli, two good friends whom worked on writing their thesis at the same time I did and who offered me support, encouragement, good humor, and shared relaxing moments when needed.

To Alfonso, Leo, Françoise and Pierre, thank you for being like family away from home. I would like to thank my friends for all the fun moments and adventures we’ve had together:

Maria, Sot, Julien, Andrea, Silja, Nathaly, Adriana, Janine, Maciej, Mery, Nico, Séverine.

I would like to thank my parents and my brother for their constant support in spite of the distance and a very special thank you to Mike, for being unconditionally there and just plain awesome. I couldn’t have wished for a better partner than you.

(14)

5 of 299

RESUME EN FRANÇAIS

(15)

6 of 299

(16)

7 of 299

Le carcinome basocellulaire de la peau (CBC) est la tumeur maligne la plus fréquente chez l'homme et a une incidence mondiale de 7 millions de nouveaux cas par an (Lomas et al., 2012). Le CBC se développe généralement à partir de cellules progénitrices de la couche basale de l'épiderme interfolliculaire exposées aux rayonnements UV. La mutagenèse induite par la lumière UV est connue pour provoquer l'activation anormale de la voie de signalisation Hedgehog (Hh), via des mutations inactivant PTCH1 ou activant SMO, deux gènes clés de cette voie, et des mutations inactivatrices de TP53. Cependant, la morphologie et l'agressivité du CBC, ainsi que la réponse incomplète aux inhibiteurs de SMO dans une large fraction de tumeurs traitées pharmacologiquement, pourraient être associées à l'acquisition de mutations supplémentaires. Un moyen efficace et inclusif d'aborder la question des mécanismes génétiques en jeu dans un type de cancer particulier est de procéder à une étude non biaisée d'un grand nombre d'échantillons. Les progrès dans le séquençage et les techniques analytiques en génomique du cancer, consécutifs à la création de consortiums de recherche internationaux dans ce domaine, nous ont permis de procéder à l'analyse génomique de 293 échantillons de CBC.

L'objectif de cette thèse est de comprendre les mécanismes génétiques impliqués dans le développement et la progression du CBC à travers de l'étude des processus de mutation, l'identification de mutations supplémentaires impliquées dans la tumorigenèse et l'étude de la relation de ces mutations avec des phénotypes spécifiques au CBC dans l'ensemble de nos échantillons.

Dans cette thèse, nous montrons que le CBC est la tumeur avec le plus haut taux de mutation identifiée à ce jour, avec 65 mutations par mégabase. Sa signature mutationnelle présente 90% de mutations C>T dans un contexte de bases pyrimidiques, révélant ainsi une relation causale entre la lumière UV et le taux très élevé de mutation. Nous avons constaté que 85% des échantillons étudiés ont une mutation pilote dans un des gènes Hh liés à la tumorigenèse dans le CBC. En utilisant des logiciels publiés (MutSigCV) et des nouveaux algorithmes (TumOnc), nous avons trouvé que la grande majorité des tumeurs ont des mutations dans d'autres gènes d'intérêt. En particulier, l'identification et la validation fonctionnelle de mutations ponctuelles dans MYCN et tronquantes dans PTPN14 ont montré

(17)

8 of 299

que des gènes qui se trouvent en aval de SMO dans la voie de signalisation Hh et dans celle de Hippo, respectivement, sont des acteurs courants et importants dans l'oncogenèse du CBC. Des gènes supplémentaires, dont LATS1, STK19, ERBB2, FBXW7, PPP6C, K/N/H/RAS et RB1, étaient également porteurs de mutations oncogéniques connues dans une fraction des tumeurs. De façon intéressante, nous avons observé que les tumeurs présentant un risque élevé de récidive avaient presque deux fois plus de mutations dans les gènes pilotes du CBC nouvellement identifiés que les tumeurs avec un risque de récidive faible. De plus, nous avons remarqué que les mutations de SMO étaient plus nombreuses dans les tumeurs résistantes au traitement avec des inhibiteurs de SMO, tels que le vismodégib.

Comme les patients développent parfois des tumeurs multiples, le CBC est un bon modèle pour étudier l'évolution du cancer. Dans nos échantillons, nous avons identifié quatre paires de tumeurs clonales, dont deux partageaient une fraction importante de leurs mutations somatiques et avaient donc une origine commune. Les deux autres partageaient au contraire une seule mutation TP53 qui est probablement apparue dans une région de la peau contenant une mutation prédisposante préexistante. Finalement, nous avons réalisé des études d'expression génique pour confirmer la pertinence des mutations identifiées et avons observé que les gènes cibles des voies de signalisation MYCN et Hippo avaient une expression plus élevée dans les échantillons de CBC que dans la peau non-affectée. Cette constatation corrobore notre observation selon laquelle les mutations oncogéniques en aval de SMO dans la voie Hh et les mutations activatrices de la voie Hippo sont des mécanismes oncogéniques alternatifs ou concomitants aux mécanismes connus du CBC.

Les résultats présentés dans cette thèse mettent en évidence la relation entre les nouveaux gènes pilotes identifiés et les différents phénotypes de CBC. Ces observations indiquent que les mutations dans ces gènes sont des prédicteurs intéressants du comportement des CBC, en particulier chez les patients présentant des tumeurs avancées et métastatiques qui ne peuvent être enlevées chirurgicalement. Nos découvertes soulignent la complexité génétique du cancer et offrent une meilleure compréhension de la biologie du CBC, le type de cancer le plus fréquent.

(18)

9 of 299

SUMMARY

(19)

10 of 299

(20)

11 of 299

Skin basal cell carcinoma (BCC) is the most common malignant neoplasm in humans and has a worldwide incidence of 7 million new cases per year (Lomas et al., 2012). BCC usually arises from progenitor cells in the basal layer of the interfollicular epidermis exposed to UV-light radiation. UV-light mutagenesis is known to cause the aberrant activation of the hedgehog (Hh) signaling pathway through inactivating mutations of PTCH1 or activating events on SMO, two key players of this pathway, as well as TP53 mutations.

However, BCCs morphology and aggressiveness, as well as the incomplete response to SMO inhibitors in a large fraction of pharmacologically treated tumors, may be associated with the acquisition of additional driver mutations. An effective and inclusive way of identifying the genetic mechanisms in a particular cancer type is to carry out an unbiased study of genome and transcriptome sequencing of a large number of samples. International cancer consortia have performed dozens of studies on the genomic characterization of different cancer types, but BCC has not been investigated nor included in the prospective studies, probably because of its low aggressivity and relatively simple clinical management.

The objective of this thesis is to understand the genetic mechanisms involved in BCC development and progression through the study of the mutational processes, the identification of additional drivers of tumorigenesis, and the analysis of the relationship of these drivers with specific BCC phenotypes in a sample set of 293 BCCs with a matched blood sample for 156 of them.

In this thesis we show that BCC is the tumor with the highest mutation rate to date with 65 mutations/Mb, and that its mutational signature of 90% C>T mutations in a pyrimidine context reveals clear causality between UV-light and the extremely high mutation rate.

Furthermore, we have found that 85% of the studied samples have a driver mutation in one of the known Hh gene drivers (PTCH1 and SMO) or TP53.

By using publicly available (MutSigCV) and novel (TumOnc) algorithms, we found that the great majority of tumors harbor somatic mutations in additional genes. In particular, the identification and functional validation of MYCN point mutations and PTPN14 truncating mutations show that downstream Hh genes and Hippo pathway genes respectively, are

(21)

12 of 299

frequent drivers in BCC oncogenesis. Additional genes such as LATS1, STK19, ERBB2, FBXW7, PPP6C, KNSTRN, CASP81 among others, were also mutated in a fraction of tumors and display known oncogenic mutations. Interestingly, we observed that tumors with a higher risk of recurrence have almost twice the mutations in the novel BCC drivers than tumors with a low recurrence risk. In addition, we have found that SMO mutations are enriched in tumors resistant to treatment with SMO inhibitors such as vismodegib.

We also carried out gene expression studies to confirm the relevance of our identified mutations and we observed that MYCN and Hippo pathway target genes were upregulated in BCC when compared to non-affected skin. This observation supports our finding that oncogenic mutations downstream of SMO in the Hh pathway, and mutations that activate the Hippo pathway, are alternative or co-occurring oncogenic mechanisms in BCC. Since patients could develop multiple tumors, BCC is an excellent model to study tumor evolution.

In our sample set we identified four clonal tumor pairs, two of which shared a significant fraction of their somatic mutations and had therefore a common origin while the other two shared a single TP53 mutation and probably arose in a patch of skin with a pre-existing predisposing variant.

The results presented in this thesis enhance our understanding of tumorigenesis in BCC and provide the means for better classification of tumors and treatment options.

(22)

13 of 299

INTRODUCTION

(23)

14 of 299

(24)

15 of 299

1. Cancer and genomics

Cancer is considered a chronic and genetic disease (Hanahan and Weinberg, 2011) characterized by the growth and spread of abnormal cells. It is responsible for a very important fraction of human deaths worldwide, causing in 2012 the death of 8.2 million people (American Cancer Society, 2015).

The importance of the genome in cancer etiology was suggested early on in the history of cancer research. At the beginning of the 19th century, Theodor Boveri observed dividing cancer cells under the microscope, identified chromosomal aberrations, and suggested them to be the event responsible for abnormal cell proliferation (Stratton et al., 2009).

Subsequent multidisciplinary work over the following decades allowed us to know that chromosomal aberrations (Rowley, 1973) and point mutations in specific genes (Reddy et al., 1982) are directly correlated with cancer initiation and progression.

Cancer development depends on the acquisition of somatic mutations in individual cells and on selection acting on them. Selection eliminates cells with deleterious mutations, while those that have acquired mutations that allow them to divide more or survive longer than others, thrive. Most of these advantageous mutations have a relatively mild effect and cause no distinguishable phenotype at the organism level. However, some of them may confer substantial proliferative or invasiveness advantage to cells, allowing them to grow at significant rates, invade tissues and metastasize (Stratton et al., 2009).

These mutations are known as “driver” mutations. They promote tumorigenesis and occur in a set of genes known as “cancer genes” (Box 1). The mutations that do not contribute to tumorigenesis but that hitchhike along with the drivers because they were acquired previously by the cell, are called “passenger” mutations (Stratton et al., 2009). It is important to note that cancer genes can also harbor passenger mutations: not all the mutational events occurring on a gene promote tumorigenesis.

(25)

16 of 299

Box 1. Types of mutation and cancer genes

Driver mutation: a mutation causally implicated in oncogenesis through conferring growth and selection advantages to the cell that has it.

Passenger mutations: a mutation present in a cancerous cell that has not been selected neither conferred an advantage to the cell nor contributed to cancer development.

Gain of Function (GoF) mutations: a mutation that confers an enhanced or additional function to the protein product. The effect of these mutations is usually dominant.

Loss of Function (LoF) mutations: a mutation that causes the abolishment or reduction of protein function. The effect of these mutations is usually recessive.

Cancer gene: genes that when mutated contribute to cancer development. The total number of these genes is unknown but mice studies suggest that more than 2,000 genes could contribute to cancer development one way or another.

Oncogene: a gene carrying activating mutations that cause cells to become cancerous through the stimulation of cell division, decrease of cell differentiation or inhibition of cell death, among others. Their effect is usually dominant; a mutation in one allele is enough to confer oncogenic characteristics to the cell.

Tumor suppressor gene: Genes that usually inhibit cell proliferation stimulate cell death or promote DNA repair that when carrying mutations or methylation defects with inactivating consequences, contribute to cancer. Their effect is usually recessive; inactivation of one copy is followed by mechanisms that facilitate the loss of the second copy.

(Concepts in this box are from Stratton et al., 2009, Weinberg et al., 2007, Kufe et al., 2003)

Cancer genes can be of two different types. Oncogenes are those genes that are aberrantly activated by a mutation (GoF, gain of function mutations), increasing the selective advantage of the cells that have it. Tumor suppressor genes are inactivated by mutations (LoF, loss of function mutations) and it is this inactivation what confers a proliferative advantage to the mutated cells. The patterns of mutations between oncogenes and tumor suppressor genes are very different. Oncogenes are recurrently mutated at specific amino acids while tumor suppressors have a high fraction of truncating mutations distributed along the whole protein sequence (Vogelstein et al., 2013)(Figure 1). A sometimes applied

“rule of thumb” or 20/20 rule states that genes in which 20% or more of the mutations occur in recurring sites are oncogenes while genes in which 20% or more of mutations have a truncating effect on the protein product, are tumor suppressors genes (Vogelstein et al., 2013)

(26)

17 of 299

About 1% of the human genes are reported to recurrently show mutations in cancer, most of them by being mutated somatically and about 20% of them by harboring germline variants that confer an increased risk of cancer (http://cancer.sanger.ac.uk/census). The methods with which they have been identified vary and have been changing along with technological advances throughout the years. The first approaches consisted on investigating genes on the breakpoints of chromosomal translocations in hematological malignancies. Afterwards, systematic and informed literature reviews allowed the identification of cancer candidate genes that were then specifically assayed through biochemical experimentation (Stratton et al., 2009).

After the sequencing of the human genome (Lander et al., 2001, Venter et al., 2001) and the explosion in genomic research that followed, opened a new era in genetics research. The Figure 1. Pattern of mutations in oncogenes and tumor suppressor genes. a. Mutations in IDH1, a known oncogene, occur mostly at a specific hotspot in the protein product. b. Mutations in RB1, a known tumor suppressor gene, occur all along the protein sequence and are mostly truncating mutations. Red lollipops represent truncating mutations, green lollipops missense mutations and purple lollipops both truncating and missense events in the same amino acid. Protein functional domains are represented by colored boxes, number of mutations is depicted on the left. The most recurrent events per protein are labeled with the amino acid change. Protein diagrams were generated with data for all cancer studies available on cBioPortal (Cerami et al., 2012, Gao et al., 2013), with the mutation mapper tool.

a

b

(27)

18 of 299

advances in sequencing technology and cost reduction (Wetterstrand K.A.) over the last 15 years as well as the development of powerful data analytical methods, have allowed researchers in the field of genetics to approach questions in a more global and integrative way (Reuter et al., 2015). Cancer genetics is one of the fields that have exploited the most these advances in sequencing technology because genetically profiling tumors and affected individuals is not a prohibitive endeavor anymore, but instead a cost effective one (Meldrum et al., 2011, Storrs, 2012, Reuter et al., 2015, Chin et al., 2011).

1.1. The first cancer genomic studies

The first unbiased approach to the study of somatic mutations in cancer was published in 2006. It reported the huge effort carried out by Sjoblom et al. (2006), aimed at investigating mutations in the 13,023 genes in the CCDS Database (consensus coding sequence database) at that time; in two of the most common and aggressive cancers worldwide.

To do so, they performed PCR amplification of the exons of these genes in 11 cell lines or xenografts of colorectal and 11 of breast cancer as well as two matched normal samples.

They identified about half a million putative somatic variants and after removing germline variants and SNVs in dbSNP as well as visual verification and re-sequencing, only 1307 confirmed somatic mutations in 1149 genes remained.

The identified genes were further sequenced in a set of 24 colorectal or breast cancer sample sets to investigate their mutational profile and mutation prevalence and the identified variants were verified with similar stringent methods. At the end, they had identified 921 breast and 751 colorectal cancer somatic mutations.

One of their most interesting observations was the differences in mutational spectrum between colorectal and breast cancer. 59% of the mutations in colorectal cancer and 35%

in breast cancer were C:G>T:A. 7% of mutations in colorectal cancer were C:G>G:C while in breast cancer they were 29%. Furthermore, mutations tended to occur in different regions of genes and with specific contexts. The authors report all these differences as being highly significant (P<0.0001) and indicative of the mechanisms responsible for mutagenesis in these two cancer types. They were able to confirm the presence of mutations in the previously known breast and colorectal cancer genes with frequencies >10%, in genes

(28)

19 of 299

mutated in other cancer types, and genes whose copy number changes have been linked to cancer.

This study proved that unbiased, large-scale approaches are highly valuable not only to validate previously known genes but to identify novel candidates that might have been missed before and that may provide new insights into pathogenesis. They also show that mutational spectra may vary significantly between cancer types, and that this can be due to the underlying mutational mechanisms responsible for tumor formation in the first place.

After this study, numerous others have followed with increased experience in approaches and analytical tools. A first report of a whole genome sequenced tumor (although the study focused on coding variants only) appeared in the journal Nature in 2008, followed by other reports in 2010 in the same journal. The discussion of two examples follows.

An euploid acute myeloid leukemia (AML) and a matched non-affected skin sample were whole genome sequenced (WGS) by Ley et al. (2008). Over 2.5 SNVs were identified in the AML sample but 97.6% of them were also present in the matched skin. The authors then further analyzed only the high confidence coding variants, selected after the implementation of the filtering criteria delineated in Figure 2: eight heterozygous variants, all in genes not previously linked to AML: CDH24, PCDH24, GPR123, EBI2, PTPRT, KNDC1, SLC15A1, GRINL1B. They also identified two known somatic insertions in FLT3 and NPM genes. All somatic events, excepting the FLT3 insertion, were present in the majority of reads and therefore, in the main, dominant clone both in the original sample and again in a relapse sample taken a year later. The major contribution of this study were advancements in analysis of WGS of tumors, and the identification of somatic mutations with NGS sequencing technologies without the need of extremely high coverage. The identified genes had not been previously linked to AML and the mutations identified, although somatic, may only be relevant for the sequenced patient or they may not be relevant for cancer progression at all. The study further stresses the importance of unbiased approaches, as they allow for the discovery of novel genes whose role in disease would need to be further investigated but that might not have been even considered as candidates before due to lack of evidence linking them to cancer.

(29)

20 of 299

The 2010 Pleasance, Cheetham and collaborators study (Pleasance et al., 2010a) performed NGS sequencing at 40x coverage of a metastatic melanoma cell line (COLO-829) and a lymphoblastoid cell line from the same individual (COLO-829BL). Variant calling was performed for both samples and the set of variants was then compared to identify somatic mutations private to the cancer cells. Somatic rearrangements were predicted by studying reads which ends mapped to different parts of the genome and the great majority of these and the called SNVs were confirmed by PCR. When studying mutational signature, the authors found that most mutations were C:T>G:A mutations, and that the majority of the dinucleotide mutations were from the CC>TT type, compatible with the UV-light signature expected in melanoma. Further investigation lead the authors to see that most of the mutations occurred in a pyrimidine context much more frequently than what was expected by chance (P=0.0001 for both single and dinucleotide mutations), confirming that most mutations in COLO-829 are induced by UV-light. The mechanisms of nucleotide excision repair (NER) were also analyzed, and they observed that 2/3 of the mutations occurred on the non-transcribed strand, consistent with NER being a repair mechanism of the UV-light induced damage. The group went on to search for additional mutational mechanisms and identified C:A>G:T signatures, compatible with reactive oxygen species (ROS)’ involvement Figure 2. Filters used to identify somatic SNVs in a tumor sample with a matched normal.

Watson/Venter correspond to genetic variation present in these two publicly available, high confidence human genomes. UTR= untranslated regions. Reproduced from Ley et al., 2008.

(30)

21 of 299

in melanoma. This study, and its back-to-back publication on the genomic analysis of small- cell lung cancer (Pleasance et al., 2010b), are highly valuable because they present the first comprehensive collection of coding and non-coding somatic variants in a cancer genome.

They also present a detailed study of the mutational mechanisms in melanoma and small- cell lung cancer and the processes underlining them. The genes carrying somatic mutations identified in these studies could be important in cancerogenesis and could be further studied. They predict that comprehensive, high quality catalogs of somatic mutations in cancer would be soon accessible and would allow for the better understanding of what causes cancer and which factors influence its treatment.

1.2. International cancer consortia

The huge potential of the comprehensive analysis of human cancers was initially recognized by the Welcome Trust Sanger institute in Cambridge, UK, where the Cancer Genome Project was originally proposed and started (Dickson, 1999). Several years later, in 2006, the National Human Genome Research Institute (NHGRI) and the National Cancer Institute (NCI) agencies in the U.S.A., invested around 100 million dollars for the establishment of The Cancer Genome Atlas (TCGA), an initiative comprising several research institutions and research groups to study cancer-specific DNA mutations and their relationship to cancer phenotype, aggressiveness, response to treatment and overall characteristics (The Cancer Genome Atlas consortium website).

Two years later, the International Cancer Genome Consortium (ICGC) was established as an umbrella to cancer sequencing efforts around the world to provide to the research community a comprehensive set of genomic abnormalities in cancer and to promote the discovery of cancer targets, drugs and therapeutic strategies, clinical tests and biomarker development (International Cancer Genome et al., 2010, Jennings and Hudson, 2013). The ICGC also aims to make the generated data available to the public as fast as possible and to establish guidelines, standards and common approaches for the myriad of studies of different cancer types that are being undertaken all over the world.

The ICGC comprises important groups such as the Cancer Genome Project from the Wellcome Trust Sanger Institute, the Tumor Sequencing Project from the Broad Institute

(31)

22 of 299

Table1. Current statistics for the ICGC consortium. https://dcc.icgc.org/ Last accessed:

14.11.2015.

and Baylor College of Medicine, and the TCGA effort. These major groups, along with several smaller consortiums, and individual labs, collaborate in the genomic characterization of different cancer types. Up to this date, the original ICGC plan of sequencing 500 tumor samples and their matched germline sample for 50 tumor types (Jennings and Hudson, 2013) has grown to 78 committed and ongoing cancer sequencing projects (International Cancer Genome Consortium website). Current statistics (up to November 2015) for the ICGC can be found in Table 1, and a table with a full overview of the studied cancers and the type of analyses carried out on them can be found in the Apendix. Most of the investigated tumor types are common and solid as their collection and isolation is mostly straightforward, but a special effort is being made regarding rare or liquid cancer types, such as adrenocortical carcinoma and leukemias, respectively (The Cancer Genome Atlas consortium website).

Although strategies vary among different cancer sequencing projects, most of the large studies follow an approach similar to the one presented in Sjöblom et al. (Sjoblom et al., 2006), where non affected tissue is compared to the tumor’s, but using the type of technology utilized by Ley et al. (Ley et al., 2008) and Pleasance, Cheetham et al. (Pleasance et al., 2010a). More and more of the studies integrate different omics to the approach, providing a more rounded-up vision of a tumor’s landscape in this way. Furthermore, the

Cancer projects 66

Cancer primary sites 21

Donors with molecular data 12,979

Total donors 16,318

Simple somatic mutations 16,459,160

Mutated genes 57,543

(32)

23 of 299

integration of datasets across various cancer types allows novel approaches and discoveries. The Pan-Cancer analysis project for example, has focused on studying general cancer mechanisms by this means (Cancer Genome Atlas Research et al., 2013) and integrative studies have provided highly valuable and interesting data on processes involved in cancer over the recent years. Comprehensive analysis of SCNAs (Zack et al., 2013), somatic mutation landscapes (Kandoth et al., 2013, Alexandrov et al., 2013a), processes of mutagenesis (Burns et al., 2013), mutation heterogeneity (Lawrence et al., 2013) and other topics have provided an immense amount of novel findings and data for subsequent studies.

1.3. Cancer databases

The production of large amounts of data requires proper organizational and analytical tools. The Catalogue Of Somatic Mutations In Cancer (COSMIC) (Forbes et al., 2015, Cosmic Database), created in 2004, allows users to explore and access a constantly updated and manually curated database containing detailed information on the Cancer Gene Census (http://cancer.sanger.ac.uk/census/) containing at the beginning of December 2015, 572 genes involved in cancer across 2,500 human diseases and 47 primary tissues. The COSMIC database is a comprehensive collection of cancer genomics information as it contains data of more than 300 cancer publications and the datasets produced through the ICGC and TCGA. COSMIC is strong in the description of point mutations and fusion genes observed in cancer, but other features such as SCNAs and gene expression are also contained in the database. The data is easily accessible through the COSMIC website (http://cancer.sanger.ac.uk/) and it is possible to search it from a text input box or by selecting specific category fields in the dropdown menus. The selected data is then displayed through user friendly and intuitive diagrams along with complementary information and additional search and filtering options. Furthermore, the full database is downloadable to allow for detailed and customized exploration when required (Forbes et al., 2015). Additional databases similar or complementary to COSMIC have also been initiated by other consortia or groups. Some examples are cBioPortal (http://www.cbioportal.org/) from the Memorial Sloan Kettering Cancer Center/TCGA and the ICGC data portal (https://dcc.icgc.org/).

(33)

24 of 299 1.4. Comprehensive analysis of large datasets

The first study published by sequencing consortiums was on glioblastoma. The article, published in 2008, provides an integrative and comprehensive analysis of glioblastoma, the most common primary brain tumor in adults (Cancer Genome Atlas Research, 2008). The authors study copy number, gene expression and methylation of 206 tumors, and complement it with DNA sequence analysis of 91 of these tumors. Because of the obvious clinical interest in this tumor type, significant information was previously known.

Glioblastomas frequently have amplifications or mutational activation of RTK genes, activation of PI3K pathway genes, and inactivation of TP53 or RB1 genes. By using microarrays, additional SCNAs were interrogated. Homozygous deletions of NF1 and PARK2 were frequent events, as well as amplification of AKT3. Amplification of other genes (FGFR2, IRS2) and deletion of others (PTPRD) were also recurrent events in the dataset. The changes in copy number correlated well to gene expression data of the genes within the SCNAs. cnLOH in 17q was also considered relevant in this analysis.

For somatic mutations, 601 genes were investigated, yielding 453 validated non-silent somatic variants in 223 genes, 79 of which were mutated more than once. Mutation rates were between 1.4 and 5.8 mutations/Mb, depending on the treatment status of the tumors.

Untreated tumors harbored fewer mutations, since treated tumors were enriched for mutations in mismatch repair genes (MMR). Eight genes were considered to be significantly mutated (FDR <10-3). Mutations in TP53 clustered in the DNA binding domain, a known TP53 hotspot for mutations in human cancer, in a significant percentage of the tumors. NF1 was strongly put forward as an important gene in glioblastoma. Somatic inactivating mutations in this gene were found in 14% of the tumors, as well as deletions in about 20%

of the samples, while some tumors had NF1 reduced expression but no genomic alteration was found. The EGFR family of genes is known to be frequently activated in glioblastoma, and almost half the samples had EGFR alterations. However, in only one previously reported case, had a point mutation in ERBB2 been identified in glioblastoma. 22 of the tumors had focal amplifications of the WT allele, 16 had point mutations associated to a focal amplification, and 3 had ERBB2 point mutations with no amplifications. Although only half of the point mutations were validated, ERBB2 can be considered a gene directly involved in glioblastoma. Additionally, activating missense mutations of PIK3CA are known

(34)

25 of 299

to occur frequently in glioblastoma, and were found in a fraction of the studied samples. On the other hand, its regulatory protein PIK3R1 was not known to be frequently mutated in cancers, but it was found mutated in 10% of the studied tumors. The mutations or small deletions, occurred all in the same protein domain and were predicted to cause constitutive PI(3)K activity.

The most interesting observation in this study was probably the one obtained with the methylation studies. Cancer-specific DNA methylation of the CpG regions in promoters of 2,305 genes was measured and compared to normal brain methylation. The methylation status of the promoter of the MGMT gene (a DNA repair gene whose protein product removes alkyl groups from guanine residues) was found to be associated to a tumor’s response to treatment with alkylating agents. About 20% of the samples had methylated their MGMT promoter. Treated glioblastomas had a dramatic change in the mutational context where mutations tended to take place (Figure 3a), which was consistent with failure to repair alkylated guanine residues caused by treatment. Most interestingly, the mutational spectrum of the MMR genes reflected MGMT methylation status (Figure 3b).

These observations show that the initial genetics of the tumor along with the treatment, can affect the level of acquisition of somatic mutations. These interactions should be taken into consideration when selecting a therapy for patients with a methylated MGMT promoter.

The integrated analysis of this glioblastoma cohort unveiled a large network of alterations, with RTK, P53 and RB pathways at its core. These three main pathways were affected in most tumors (P<0.0018), but only one component of each pathway harbored an inactivating mutation, and this phenomenon was found to be highly significant (P53=9.3x10-10, RB 2.5x10-13, RTK= 0. 022).

This first integrative approach to the study of a cancer type by a consortium was deemed successful. The known genetic factors of glioblastoma were confirmed and novel relevant genes, as well as regulatory pathways, were uncovered. Furthermore some of their findings can provide a direct therapeutic benefit to individuals with tumors with a specific phenotype and the approach to cancer research in an unbiased and systematic way, with no prior hypothesis, proved to be valuable to uncover unexpected key players of cancer.

(35)

26 of 299

Over the last few years, the discoveries that these world-wide collaborations in cancer genomics research have produced are substantial. Besides the detailed and comprehensive characterization of specific tumor types, the integration of the information gathered for individual samples has proven highly valuable. For example, the grouped evaluation of numerous tumor types provided detailed information on the mechanisms and patterns of SCNAs in cancer (Zack et al., 2013). Additionally, the deep analysis of the mutational landscape of several tumor types has allowed researchers to appreciate that the number of accumulated mutations (mutational load) is highly variable across cancers (Lawrence et al., 2013) and that mutational signatures of cancer might be a powerful method to classify and perhaps even therapeutically approach specific cancer types, since these are frequently correlated to tumor etiology (Alexandrov et al., 2013a).

1.4.1. SCNAs profiling in large datasets

Using SNP arrays, Zack and collaborators studied SCNAs across 4,934 tumor samples of 11 different cancer types (Zack et al., 2013) as part of the Pan-Cancer analysis project (Cancer Genome Atlas Research et al., 2013), with the objective of distinguishing driver from

Figure 3. Methylation status of the MGMT promoter and its relationship with treatment and mutational contexts in glioblastoma. a. the y axis corresponds to number of mutations.

The x axys corresponds to the treatment status of a patient (+ treated, - non-treated), the methylation status of the MGMT promoter (Meth=methylated, - non-methylated), and the mutational status of MMR genes (Mut= at least one MMR gene mutated, - non-mutated). The numbers under the bars represent the number of samples in each group. b. Mutational spectrum of the MMR genes as a function of treatment and methylation status of the MGMT promoter. Color codes for both graphs are at the bottom. Reproduced from Cancer Genome Atlas Research, 2008 .

a b

(36)

27 of 299

passenger events and identifying the mechanisms of SCNAs acquisition in cancer. They also aimed at pinpointing the key genes within a SCNA that were ultimately driving the cancer phenotype. After inferring the SCNA profiles that better explained the ploidy determined for each tumor, the authors called 202,244 SCNAs (median of 39 per tumor sample) and classified them in 6 different categories (Table 2).

SCNAs category Median per tumor

Focal copy gain, smaller than chromosome arm 11

Focal copy loss, smaller than chromosome arm 12

Arm-level copy gain, full arm-length or longer 3

Arm-level copy loss, full arm-length or longer 5

cnLOH 1

Whole genome duplication 37% of cancers

They observed that cancers with whole genome duplication (WGD) had twice the rate of SCNAs that tumors without. This correlated well with WGD tumors having an average ploidy of 3.31 and not 4, while tumors with no WGD had a ploidy of 1.99 (when 2 was the expected ploidy). WGD occurred early in the SCNA events history in tumors, while other types of SCNAs arose after the WGD event. The average copy number profile for these 11 cancer types in WGD or near-euploid state can be seen in Figure 4a.

Focal SCNAs that extended to the telomeres were longer than intrachromosomal SCNAs.

These internal SCNAs had frequencies inversely proportional to their length, while telomeric SCNAs were uniform in size (Figure 4b) and were more frequent than expected assuming random positions for SCNAs (P<0.0001). SCNAs in general tended to finish at the centromeres.

Table 2. Types of SCNAs across 11 tumor types. Events assessed from SNP array data of 4,934 tumors from Zack et al, 2013.

(37)

28 of 299

Figure 4. Characteristics of different types of SCNAs. a.Number of amplifications (red) or deletions (blue) on 10 cancer types from an arm-level or a focal perspective (top and bottom respectively). In each cancer type, samples with WGD events are at the right and samples without WGD to the left; SCNA in samples with WGD are resolved according to their timing relative to the WGD event. b. Distribution of lengths of SCNAs originating at telomeres compared to intra-chromosomal SCNAs. c. Rates of chromothripsis across different cancer types. BLCA= bladder, BRCA=breast, COAD=Colon and rectal carcinoma, GBM= glioblastoma multiforme, HNSC= head and neck squamous cell, KIRC=kidney renal cell, LUAD=lung adenocarcinoma, LUSC=lung squamous cell, OV=ovary, UCEC=uterine cervix. All three panels reproduced from Zack et al., 2013.

a b

c

Chromothripsis was detected in 5% of samples with varying frequencies depending on tumor type (Figure 4c) but unrelated to overall rates of SCNAs per sample. Chromothripsis tended to occur in specific regions and is associated with particular driver events.

Across all cancer types, 70 recurrent amplifications and the same number of recurrent deletions were identified. The authors identified “peak” regions within these SCNAs that were more likely to contain oncogenes or tumor suppressor genes. SCNAs within the peak regions were shorter than events occurring elsewhere in the chromosome (P<0.0001) and they were also more often high-amplitude events (P<0.0001). The frequency of events in these peak regions was stable across tumors of the same lineage. 24 of the 70 peak regions

(38)

29 of 299

of amplification contained an oncogene known to be activated by amplification (such as CCND1, EGFR, MYC, ERBB2, and CCNE1) or other genes directly involved in carcinogenesis, such as TERC, which encodes the substrate for TERT, a known oncogene. From the peak regions of amplification, 12 contained tumor suppressor genes (such as ATM, NOTCH, FOXK2 and PPP2R2A) and two other regions had tumor suppressor gene candidates (ERRFI1 and FOXC1).

The peaks that contained no obvious cancer gene candidates were subjected to literature citation searching algorithms, and enrichment for topics related to epigenetic and mitochondrial regulation was observed. This finding stresses the relevance of epigenetic alterations in cancer progression, in concordance with previous observations (Berman et al., 2012, Fullgrabe et al., 2011). When significantly mutated genes (SMG) were called within the peak regions, the authors identified all genes known to be tumor suppressors as well as a significant fraction of genes known to act as oncogenes through amplification, inside the peak regions. It was interesting to note that deleted regions are probably enriched on tumor suppressor genes, as they had more truncating and frameshift deletions than expected (P=0.0002). Furthermore, from the 770 peak regions identified across specific cancer lineages, 84% occurred in at least two lineages and 65% were inside peak regions from the pan-cancer analysis.

This study was the first high resolution analysis of SCNAs and ploidy across several cancer types. It reports areas in the genome that are enriched in SCNAs in cancer and that probably contain genes or regulatory elements that act as drivers of tumorigenesis, and confirm the relevance of several previously identified tumor suppressor and oncogenes. It stresses the advantages of unbiased approaches in large dataset for the identification of events and genes important in carcinogenesis.

1.4.2. Factors influencing mutation rates

As mentioned before, the most conventional approach for the identification of somatic mutations and genes directly involved in tumorigenesis is the comparison of tumor and germline DNA of large sample sets, followed by statistical analysis to identify SMGs. It was initially thought that large datasets would increase the sensitivity and the specificity of the analyses, but in most cases what increased was the number of false positives, since highly

(39)

30 of 299

mutable genes such as olfactory receptors and large genes such as TTN and PCLO, were systematically identified and sometimes even nominated as cancer genes. To better understand mutational processes in cancer, over 3,000 tumor-germline pairs of 27 different cancer types for which whole exome or whole genome sequencing had been carried out, were studied as part of the Pan-Cancer analysis project (Lawrence et al., 2013).

The authors observed large variation in mutation rates among different cancer types (Figure 5). Disparity within tumors of the same type was also impressive, with some specimens having mutation rates as low as 1 mutation/Mb and others with mutation rates of over 100 mutations/Mb, as in the case of melanoma (Figure 5). Many times, increased or decreased mutations rates could be correlated to confounding factors, such as tobacco use in lung carcinomas, exposure or not to UV-light in melanoma, or the presence of mismatch repair mutations in colon cancers.

Lawrence and collaborators also looked at mutational profiles per tumor type, and if they could be used to cluster different tumor types together. They observed that different types of lung cancers share the C>A signature consistent with tobacco exposure, while melanomas have mostly C>T mutations, classical sign of UV-light induced mutations.

Tumors from the gastrointestinal tract show C>T mutations in a CpG context and epithelial cancers (bladder, cervical and head and neck) display a large fraction of C>T/G mutations in a TpC context, that could be caused by APOBEC restricting viral infection, a common coadjutant factor in a fraction of these types of cancer.

(40)

31 of 299

Mutation frequency was found to vary significantly throughout the genome of specific tumors and within tumor types. This could be a consequence of specific genomic features, such as gene expression (Pleasance et al., 2010a). Germline mutation rate is lower in highly expressed genes due to transcription-coupled repair (Fousteri and Mullenders, 2008), and this was confirmed in their sample set, where they found that mutations were less frequent in highly expressed genes. Average mutation rate is almost 3 times higher in the lowest expressed genes when compared to the highest expressed ones. Another feature found to be important was DNA replication time, which is also known to be correlated to germline mutation (Stamatoyannopoulos et al., 2009). Late-replicating regions are expected to have higher mutation rates, and this correlation was indeed observed in Lawrence and collaborators dataset, where the mutation rate was three times higher in the late versus the earliest replication regions. These observations explain some of the false positive cancer genes. For example, both olfactory receptors and large genes are lowly expressed, late replicating, and have a large number of silent or intronic mutations (Lawrence et al., 2013).

The authors integrated their observations to create a powerful algorithm to identify SMGs in cancer. This algorithm, called MutSigCV, takes into account the mutational covariates Figure 5. Somatic mutation frequencies in 27 different tumor types. Each dot was obtained through a tumor-matched normal comparison; the vertical axis indicates the frequency of somatic mutations in the exome. Tumor types are ordered based on their median somatic mutation frequency and the number of samples per tumor type is indicated above the plot.

Reproduced from Lawrence et al., 2013.

(41)

32 of 299

described in their publication and performs well when identifying SMG, effectively reducing false positives. Since its development, MutSigCV has been the program of choice to identify cancer genes and it has been used in numerous publications (Cancer Genome Atlas Research, 2012, Cancer Genome Atlas Network, 2015, Cancer Genome Atlas, 2015, Gao et al., 2014, Jones et al., 2012, Pugh et al., 2013, Pickering et al., 2014).

The study of a very similar dataset confirms the results regarding mutation rate among cancer types and mutational signatures and how they can be used to cluster tumors by their class and sometimes etiology (Kandoth et al., 2013). This study additionally identified several genes that were very frequently mutated in cancer. The most frequently mutated gene in their 3,281 tumors of 12 different types was TP53 (42% of the samples), followed by PIK3CA (mutated in 10% of their samples). Mutations in these two genes were specific to particular groups of cancers, TP53 being more frequent in ovarian or endometrial carcinomas and basal breast cancer, while PIK3CA did not occur in ovarian, lung or kidney cancers. Mutations in SMG across the 12 cancer types were subjected to unsupervised clustering, and it was found that 72% of the samples were indeed clustering with tumors of the same tissue type, having mutations in the same driver genes. Pairwise comparisons among mutations in the 127 SMG identified in this study found 14 mutually exclusive gene pairs. For example, TP53 and CDH1 are mutually exclusive in breast cancer (FDR 0.05), while TP53, PTEN, VHL, NPM1 and GATA3 are mutually exclusive across the full dataset (P=0.01). In contrast, there were a number of associations detected, with 148 co-occurring mutations across the SMGs in the dataset.

Furthermore, Kandoth et al. (2013) were able to temporally place the occurrence of mutations in the history of a tumor by looking at variant allele fraction (VAF) distribution of mutations in some of the SMGs in acute myeloid leukemia (AML), breast and uterine/cervical cancers. TP53 had the higher VAF in these cancer types, suggesting it tends to appear early in tumorigenesis although its elevated VAF might be due to cnLOH, a common event in TP53 and other tumor suppressors such as BRCA1, BRCA2 and PTEN.

Other genes were also identified in specific tumor types. For example, in AML, DNMT3A and SMC3 had significantly higher VAFs (P<0.0003 and P< 0.05 respectively) than average.

These analyses are interesting because they can allow the identification of the primary

(42)

33 of 299

drivers of tumorigenesis among all the contributing drivers in a specific cancer type. The results of this study, along with the ones described before, and many other reports have been crucial to the global understanding of cancer genomics and the different processes and mechanisms underlying it.

1.4.3. Mutational signatures

Finding better ways of successfully identify driver genes in cancer as well as the identification of the genes themselves is a large and important endeavor in the area of cancer genomics. However, the study of the full spectrum of somatic mutations can also help us understand the mechanisms of cancer and how this disease begins.

Even though most of the mutations identified in cancer are passenger mutations, they are the product of the same mutational processes that give place to the drivers. The mutations bear the signature of said mutational processes, DNA damage, length and strength of exposure to said mutational processes, or DNA repair mechanisms. What we ultimately see when a tumor is sequenced, is a combination of different mutational signatures consequence of all the processes involved in cancer formation (Figure 6). Our understanding of these processes was until recently, limited to specific genes, but genome- wide approaches are starting to be applied to the analysis of large or comprehensive datasets (Helleday et al., 2014, Alexandrov and Stratton, 2014).

The most convenient dataset to study the full landscape of mutational signatures in cancer is WGS of cancer samples. Available sample sets of these kind are limited and apart from a complete but at the same time concise report of mutational signatures in melanoma and lung cancer (Pleasance et al., 2010a, Pleasance et al., 2010b) (briefly discussed above), none had been studied until Serena Nik-Zainal et al. (2012) analyzed mutational signatures in 21 genomes of breast cancers, where nine of them had germline predisposing mutations in BRCA1 or BRCA2. The 21 tumors and matched germline were sequenced to >30x and somatic variants were called. RNAseq was performed for 17 of the 21 samples.

The authors observed significant variation in the frequency of each type of substitution (C>A, C>G, C>T, T>A, T>C, and T>G) across all samples. When integrating the bases at the 5’and 3’ of the mutated base as a context of mutation, they observe that certain contexts are

(43)

34 of 299

overrepresented when compared to chance. They exemplify this finding with C>T mutations in a XpCpG context (where X is any base), that were found in all the tumors. This is a well-known mutational mechanism due to deamination to thymine of methylated cytosines in that particular context. Furthermore, this signature is more common outside of CpG islands (P < 0.0001), where there is more methylation (Nik-Zainal et al., 2012). To dissect all mutational signatures, including overlapping ones and weak ones, the authors used a mathematical approach that extracts patterns from multidimensional data (Alexandrov et al., 2013b). They found five different mutational signatures in the 21 samples, each one with a particular profile in the three nucleotide code and each one occurring at different frequencies in the 21 breast cancers, originating this way the particular mutational signature per tumor(Figure 7a,b).

Figure 7. Mutational signatures across 21 breast cancers. a. Fraction of contribution of each mutation type in each context in the five mutational signatures identified in breast cancer. The major components contributing to each signature are marked with arrows. b. Proportion (marked below) of the total substitutions of each of the five mutational signatures in “a” for the 21 breast cancer genomes. Reproduced from Nik-Zainal et al., 2012.

a b

(44)

35 of 299

Although no correlation was found between the presence of specific somatic mutated genes and a signature, unsupervised clustering of the samples clustered all the cancers with BRCA1 or BRCA2 mutations together due to their similarities in mutational signatures:

Signature A was depleted and signature D was enriched in comparison to the other tumors.

Dinucleotide mutations were also enriched in these tumors (P < 0.001) and followed the same patterns as single nucleotide substitutions. Multiple and regional micro clusters of mutations, or kataergis, were found in 13 out of the 21 tumors, and these were sometimes associated with chromothripsis or other structural rearrangements and were loosely associated to signatures E and B.

The analysis of transcription and its relationship to mutational processes in these cancers showed that transcriptional strand bias and expression-related mutations occur in breast cancer but that they usually do not co-drive mutational processes and instead, happen independently. It was also observed that the greater the distance from the transcription start site (TSS), the greater the accumulation of mutations (Nik-Zainal et al., 2012).

The signatures in indels were also investigated. Interestingly, indels mediated by regions of overlapping microhomology (short tandem repeats or stretches of the same base at the breakpoints) were more likely to occur in tumors with BRCA1 and BRCA2 germline mutations (P=2.23x10-16). Microhomology is a signature of non-homologous end joining (NHEJ), and BRCA1-BRCA2 are involved in homologous recombination. When mutated, cells would presumably find alternative ways of carrying out DNA repair (Nik-Zainal et al., 2012). The results summarized above and others not discussed here, highlight the importance of genome-wide analysis of mutational signatures, and show the complex and dynamic relationships between mutation, transcription, and DNA repair.

To extend the analysis on breast cancer to the majority of tumor types that had been sequenced to that point, Alexandrov et al. (2013a) analyzed the almost 5 million mutations from over 7,000 cancers of 30 different types/subtypes (507 WGS and 6,535 WES) using the same approach utilized for breast cancer but adapting it to exome-sequenced data.

By using point mutations and information about their trinucleotide context, the authors identified 21 different mutational signatures. Some of them were more frequent than

Références

Documents relatifs

Les phénotypes sécrétoires de ces cellules sénescentes constituent une arme à double tranchant, puisque certaines de ces cytokines et chimiokines participent à l'établissement

1993 Nesting biology of tropical solitary and social sweat bees, Lasioglossum (Dialictus) figueresi Wcislo and Lasioglossum (D.) aeneiventre (Friese) (Hymenoptera: Halictinae).

We also study the stability and the convergence rate of a numerical method to the system with uncertainty, specifically, the generalized Polynomial Chaos approximation based

The initiation stage is a necessary step in the modified resistant hepatocyte model. Initiation is caused by DEN administration. The metabolites produced during bioactivation

The unitary maps ϕ 12 will be very useful in the study of the diagonal action of U(2) on triples of Lagrangian subspaces of C 2 (see section 3).. 2.4

Apart from discrete groups and duals of compact groups, the first test examples for a quantum Property RD should be the free quantum groups introduced by Wang [6], which are

Afin de déterminer les valeurs optimums des différents paramètres d’adsorption, pour une meilleure purification de l’acide phosphorique; nous présentons dans ce

Nous observons également une restauration partielle et concomitante de l’activité autophagique et de l’expression génique de DAPK2 après perte de poids induite par