Computational analyses of gene fusions, viruses and parasitic genomic elements in breast cancer

(1)

Université Libre de Bruxelles

Faculté de Médecine

Computational analyses of gene fusions,

viruses and parasitic genomic elements in

breast cancer

Danai Fimereli

Promoteur: Thèse présentée en vue de Vincent Detours l’obtention du grade de Docteur en Sciences Biomédicales

(2)

This thesis was written under the supervision of Prof. Vincent Detours

The Jury members are:

Prof. Pierre Bergmann (President), Prof. Daniel Christophe,

Prof. Vincent Detours (Supervisor), Prof. Tom Lenaerts,

Prof. Pierre Smeesters, Prof. Guillaume Smits

The external experts are:

Prof. Katleen De Preter (UGent) and

(3)

Abstract

Breast cancer is the most common cancer in women and research efforts to unravel the underlying mechanisms that drive carcinogenesis are continuous. The emergence of high-throughput sequencing techniques and their constant advancement, in combination with large scale studies of genomic and transcriptomic data, allowed the identification of important genetic changes that take place in the breast cancer genome, including somatic mutations, copy number aberrations and genomic rearrangements.

The overall aim of this thesis is to explore the presence of genetic changes that take place in the breast cancer transcriptome and their possible contribution to carcinogenesis. This thesis comprises of three different but connected research studies and was performed under the supervision of Prof. Vincent Detours at the Institut de Recherche Interdisciplinaire en Biologie Humaine et Moléculaire, Université Libre de Bruxelles.

The aim of the first research study was the identification of expressed gene fusions in breast cancer and the study of their association with other genomic events. For achieving this, transcriptome sequencing and Single Nucleotide Polymorphism arrays data for a cohort of 55 tumors and 10 normal breast tissues were combined. Gene fusions were detected in the majority of the samples, with evident differences between breast cancer subtypes, where HER2+ samples had significantly more fusions than the other subtypes. The genome-wide analysis uncovered localization of fusion genes in specific chromosomes like 17, 8 or 20. Additionally, a positive correlation between the number of gene fusions and the number of amplifications was observed, including the association between fusions on chromosome 17 and the amplifications in HER2+ samples, which can be attributed to the highly rearranged genomes of these subtypes. Finally, the absence of highly recurrent fusions across this cohort adds to the notion that gene fusions in breast cancer are most likely private events, with the majority being “passenger” events.

(6)

sequences (~2-30 reads) were extracted belonging to viruses EBV, HHV6 and Merkel cell polyomavirus. Such low levels of viral expression direct against a viral etiology for breast cancer but one should not exclude possible cases of integrated but silent viruses.

In the third research project, we analyzed in silico the transcriptional profiles of human endogenous retroviruses in breast cancer. Despite being scattered across the genome in large numbers, a number of ERVs are actively transcribed, consisting of a small percentage of the total mapped reads. Alongside protein coding genes and lncRNAs, they show distinct expression profiles across the different breast cancer subtypes with luminal and basal-like samples clear separating from each other. Additionally, distinct profiles between ER+ and ER- samples were observed. Tumor specific ERV loci show an association with the immune status of the tumors, indicating that ERVs are reactivated in tumors and could play a role in the activation of the immune response cascade.

(7)

Acknowledgements

I would like to start this note by thanking Prof. Vincent Detours for giving me the opportunity to complete this thesis in his lab. His help and guidance throughout these years has been more than valuable and none of it would have happened if it were not for him. His constant strive for performing better, his inspirational motivation and the fact that he never allowed me to underestimate my research work have helped me become the researcher I am today. I am also grateful to Dr. Christos Sotiriou for his inspiring collaboration and guidance and to Prof. Carine Maenhaut, for her help and guidance during this thesis. Also I would like to thank Dr. Debora Fumagalli for always being available to answer my questions. This research received financial support from the Belgian Fonds National de la Recherche Scientifique (F.R.S-FNRS) through a Télévie grant, the Fonds David et Alice Van Buuren and the Fondation Rose et Jean Hoguet.

This PhD process would not have been the same without my fellow bioinformaticians/computational biologists that I was honored to spend these past years with. First and foremost, Maxime Tarabich and Gil Tomás, we went through all of it together and we did it! Without you two this trip would not have been so awesome! David Gacquer, your help throughout all these years is more than priceless! Tomasz Konopka, you have helped me become a better scientist in many ways, thank you! Norman David Brown, especially for you, I will make sure that my next paper will be published in your favorite journal! And Joel Rodrigues Vitoria, as a newbie, welcome to the club!! I will always remember the fantastic trips we all did together for work and fun and hopefully see you all soon in our next adventures!

Moreover, Elina, Diana, Manu, Zineb, Katerina, Genevieve, Jaime, Raquel, Aurelie, Louise, Mathieu, Alice, Olivier, Bach, Wallis, Jessica, Valeria, Elena and Pierre, it was a pleasure meeting you all and having great times of team-building fun, Friday beers with card games and ping-pong, barbeques and endless discussions about nothing and everything.

(8)

I have been more than fortunate to have friends to call my family. Adriana, together since university and sometimes you still know me better than I do myself. Kosta, though it was bioinformatics that brought us close, now its much more than that that keeps us together. Maria, we met in Greece but luck (or fate, or the stars in the sky or the aliens out there) rejoined us here in Brussels, to spend some great nights in those weird places that I never wanted to go but somehow liked after. Elena, you were the legend of Hellas BC but then I came to Brussels and you had to settle for the second place, sorry. Irini, you were always there for me and made me the person I am today. Thank you all for your support!

Finally, no words will ever express my feelings for my family who have been my inspiration for everything and the guiding light through my life. Dad, Mom and Eleni, I love you! Σας αγαπώ πολύ!

(9)

List of Tables

Table 2.1. Number of fusions across breast cancer subtypes ... 56

Table 3.1 Viral sequences detected in the samples with the use of several computational methods.. ... 77

Table 4.1. Top 20 activated subfamilies ... 92

Table S2.1. Clinical and pathological data of the 55 patients ... 131

Table S2.2. Table containing the 370 detected fusions ... 132

Table S4.1: Table containing the number of activated loci found in each subfamily ... 141

Table S4.2: Table of differentially expressed ERVs across breast cancer subtypes . 141 Table S4.3: Table of differentially expressed ERVs between ER+ and ER− tumors ... 141

(10)

List of Figures

Figure 1.1. The hallmarks of cancer by Hanahan and Weinberg ... 14

Figure 1.2. Significantly mutated genes across the different breast cancer subtypes . 17 Figure 1.3: Rearrangements that can lead to gene fusions ... 18

Figure 1.4. The different repeat classes in the human genome (version hg19) ... 24

Figure 1.5. Classification and genomic distribution of retrotransposons. ... 25

Figure 1.6: The transmission and fixation of exogenous retroviruses into the genome ... 26

Figure 1.7. Proviral structure of a human endogenous retrovirus (HERV-K). ... 27

Figure 1.8: Workflows for whole-exome, whole-genome and transcriptome sequencing. ... 31

Figure 1.9: NGS workflow of Illumina technology ... 32

Figure 2.1. Differences of the number of gene fusions between breast cancer subtypes ... 56

Figure 2.2. Distribution of gene fusions across the human genome ... 58

Figure 2.3. Proportion of fusion breakpoints located close to copy number aberrations. ... 59

Figure 2.4. Correlation analysis between the number of fusions and copy number aberrations ... 60

Figure 3.1. Computational pipelines for the detection of non-human sequences. ... 76

Figure 4.1. Transcriptional profiles of retroelements ... 93

Figure 4.2 Subtype-specific transcription of ERV elements in breast cancer ... 96

Figure 4.3 Boxplot of mean LTR expression in ER+ and ER− samples. ... 97

Figure 4.4 ERV profiles in tumors and normal breast tissues ... 99

Figure S4.1. Bioinformatics pipeline used for the quantification of repetitive elements. ... 142

Figure S4.2. Percentage of reads mapping to LTRs and internal regions of ERVs. . 143

Figure S4.3. PCA analysis of the family-wise expression of ERVs. ... 143

Figure S4.4. Reads mapped to ERVs with uniquely mapping or primary mapping approach ... 144

(11)

Abbreviations

BLAT: BLAST-like alignment tool

BLAST: Basic Local Alignment Search Tool DNMTis: DNA methyltransferase inhibitors EBV: Epstein-Barr virus

ER: Estrogen Receptor ERV: Endogenous retroviruses FDR: False discovery rates

FISH: Fluorescence In Situ Hybridization

FPKM: Fragment Per Kilobase per Million aligned reads HBV: Hepatitis B virus

HCMV: Human Cytomegalovirus HCV: Hepatitis C virus

HER2: Human Epidermal Growth Factor Receptor 2 HHV4: Human herpesvirus 4

HHV8: Human herpesvirus 8 HPV: Human papillomavirus HR: Homologous recombination HTLV-I: Human T-lymphotropic virus I IHC: Immuno-histochemistry KSHV: Kaposi's sarcoma herpesvirus LINEs: Long interspreaded elements LTR: Long terminal repeats

MaLR: Mammalian apparent LTR retrotransposons MCV: Merkel cell polyomavirus

MMTV: Mouse mammary tumor virus NHEJ: Non-homologous end joining NGS: Next-Generation Sequencing ORF: Open reading frame

PCR: Polymerase chain reaction PR: Progesterone Receptor RNA-seq: RNA sequencing

(12)

SNP: Single Nucleotide Polymorphism TCGA: The Cancer Genome Atlas TE: Transposable elements

(13)

Chapter 1

Introduction

1.1 Cancer

Cancer is a disease of the genome. Normal cells in the human body accumulate somatic mutations that in most cases are harmless, but in some conditions can lead to a phenotypic change. These mutations can be placed into two categories, the “driver” and the “passenger” mutations. “Driver” mutations offer growth advantage to cancer cells thus are implicated in the carcinogenesis while “passenger” mutations give no advantage to the cell. Mutations that play an important role in oncogenesis typically affect two types of genes, the oncogenes and the tumor suppressor genes. An oncogene can be any gene encoding a protein that can transform cells or induce cancer. A tumor suppressor gene can be any gene that can inhibit cell proliferation. In this way, activation of an oncogene or silencing of a tumor suppressor gene will lead to a cancer phenotype.

(14)

Figure 1.1. The hallmarks of cancer by Hanahan and Weinberg [2].

(15)

some tumors. In chromothripsis, the genome is scattered and re-assembled in a single event, leading to a large number of chromosomal rearrangements and DNA copy number changes localized in a single chromosome [6]. In kataegis multiple point mutations take place near sites of genomic rearrangements [7].

1.1.1 Breast cancer

1.1.1.1 Epidemiology and risk factors

Breast cancer is the most common cancer in women worldwide and the second most common overall after lung cancer [8]. In 2017, it is estimated that almost 252.000 new cases of breast cancer will be diagnosed in women in the United States of America, of which 40.000 are estimated to die from breast cancer (“American Cancer Society. Cancer Facts & Figures 2017”). Various risk factors have been associated with this disease including age, family history, radiation, lifestyle, endogenous hormones or hormone replacement therapy among others [9].

1.1.1.2 Classification of breast cancer

Breast cancer can be classified according to various systems including histopathology, immunohistochemistry and gene expression profiling.

Based on histopathology, the majority of the breast tumors are classified in ductal and lobular carcinomas. Ductal carcinoma is the most common type of breast cancer. It develops in the duct system of the breast and can be divided in the invasive and the in situ ductal carcinoma, depending on the ability of the cells to invade the basement membrane or not. Lobular carcinoma arises in the milk-producing glands of the breast and can also be divided into invasive and in situ.

Immunohistochemistry also allowed the classification of breast cancer and helped in the treatment decision. Depending on the expression of human epidermal receptor 2 (HER2), estrogen receptor (ER) and progesterone receptor (PR), breast cancer can be classified into ER positives, HER2 positives, triple negatives and triple positives.

(16)

expression of the hormonal receptors or the HER2+ amplification, (2) the Her2-enriched tumors that show an amplification of the HER2 gene, (3) the luminal A that show expression of the hormone receptors, with low proliferation and good prognosis, (4) the luminal B tumors that are also expressing the hormone receptors and are highly proliferative and (5) the “normal-like” tumors that have an expression profile similar to the normal mammary tissue. These subtypes were identified by other studies, including the work of Parker et al. [11] who devised a 50-gene classifier (PAM50) including hormone receptor genes, proliferation related genes and genes with epithelial and basal features. Importantly, this classification has prognostic and predictive value on breast tumors and is currently widely used.

1.1.1.3 Mutational processes in breast cancer

(17)

Figure 1.2. Significantly mutated genes across the different breast cancer subtypes [12].

1.2 Chromosomal rearrangements, gene fusions and cancer

1.2.1 Chromosomal rearrangements and gene fusions

(18)

Figure 1.3: Rearrangements that can lead to gene fusions. Balanced rearrangements can be

caused by a translocation, and insertion and an inversion. Unbalanced rearrangements are usually deletions and duplications [15].

(19)

pathway, the two ends are simply joined together, while, in the alternative pathway, small sequence identities (called microhomology) are found at the edge of the breaks. Overall, the NHEJ-mediated repair is considered the main mechanism involved in rearrangements, as it has been shown by the presence of overlapping microhomology and non-templated sequences at rearrangement junctions in breast tumors [21]. Apart from the NHEJ pathway, more pathways have been proposed to be involved in translocations. The breakage-fusion-bridge cycle [22] is known to cause amplifications and translocations, formed through the loss of telomeres, end-to-end chromosome fusions and breakage of the fused chromosome. In the non-allelic homologous recombination, the misalignment of low-copy repeats during meiosis or mitosis can lead to genomic rearrangements including duplications and deletions [23]. Finally during microhomology-mediated break-induced replication, the 3’ single-strand tails from collapsed replication forks can anneal to a single-single-stranded DNA in proximity, causing also duplications [24].

When a genomic rearrangement brings together two otherwise distant genes, a gene fusion is formed. The first gene fusion (called “the Philadelphia chromosome”) was discovered in 1960 in chronic myeloid leukemia and occurs between the genes BCR and ABL1 located on chromosomes 22 and 9 respectively [25]. During this reciprocal translocation, the ABL1 gene moves to a part of the BCR gene, creating in this way an elongated chromosome 9 and a truncated chromosome 22.

(20)

EML4 gene, led to the production of anti-cancer drugs inhibiting its kinase action. Papillary thyroid carcinomas share different versions of the RET/PTC fusions, which include the RET proto-oncogene that encodes for a receptor tyrosine kinase, and different partner genes. Like in the previous examples, the presence of the tyrosine kinase as a 3' gene leads to its activation and to an uncontrolled proliferation [29].

1.2.2 Detection of gene fusions

(21)

1.2.3 Fusions in breast cancer

The inter-tumor heterogeneity of breast cancer that is reflected in the mutational status can also be seen in the landscape of genomic rearrangements and gene fusions. The prevalence of these events has been the subject of several studies and some of their most important points will be discussed here.

The four different breast cancer subtypes exhibit a distinction in the number of rearrangements, as was first seen in the study of Stephens et al. [37] performing whole-genome sequencing on 24 breast cancers, including both primary tumors and cell lines. HER2+ subtypes carried the majority of the fusions, localized on chromosome 17, while luminal subtypes showed only few rearrangements. This heterogeneity was confirmed in the study of Banerji et al. [38], where 22 whole-genome tumors with paired normal samples were sequenced. An additional aspect of the rearrangements in breast cancer is the lack of recurrence across samples, originally pointed out by Stephens et al. [30]. In the study of Robinson et al. [39], no recurrent fusion pair was detected, but on the contrary, individual recurrent genes forming fusions with different partners were present. In one example, several genes from the MAST kinase family were found in multiple fusions with different partners. The preserved kinase domain of those fusions results in growth and proliferative advantage, indicating that these genes could be used as potential drug targets.

(22)

1.3 Human viruses and cancer

1.3.1 Oncogenic viruses

The first oncogenic virus was discovered in 1908 by Ellerman and Bang [43], who demonstrated that erythroleukemia could be transmitted between chickens by cell-free tissue filtrates. Two years later Rous [44] showed that sarcoma could also be transmitted between chickens by cell-free tumor extracts. Both studies concluded that an agent that could pass the filters, namely a virus, was the cause of cancer. The discovery of Rous earned him the Nobel Prize in 1966 and opened the path for tumor virus and cancer biology.

(23)

(MCV), the first virus to be discovered through NGS in Merkel cell carcinoma [47]. The involvement of viruses in carcinogenesis can be through a direct or an indirect mechanism. In the direct mechanism, the expression of viral oncogenic genes can lead to the silencing of tumor suppressor genes of the host and thus to cell transformation (like in HPV-induced cancers). Moreover the expression of viral proteins can prevent cell death and promote genomic instability (like in the EBV-induced cancers). In the indirect mechanism, a chronic inflammation is caused by the long-term infection of a virus, like in the case of the HCV-induced liver cancer.

1.3.2 Viral infections as a cause of breast cancer

(24)

1.3.3 Endogenous retroviruses as part of the human genome

1.3.3.1 Repetitive elements in the human genome

In 2001, the sequencing of the human genome brought into light the fine details of its structure. Around 30,000-40,000 protein-coding genes were annotated (currently this number is reduced to ~20,000 genes) accounting only for a small fraction of the genome (1.5%). A much larger part, almost 50%, comprises of transposable elements (TE) and other repetitive sequences [58] (Figure 1.4).

Figure 1.4. The different repeat classes in the human genome (version hg19) [59].

Transposable elements, as their name indicates, have the ability to transpose from one position to another within the genome. Based on the need for a reverse transcription during the transposition, they are divided in two categories, namely DNA and RNA transposons. DNA transposons cover ~3% of the human genome and encode for a transposase protein required for excising as double stranded DNA and reintegrating into another position in the genome in a simple “cut-and-paste” mechanism [60]. The RNA transposons (also called retrotransposons) follow a different mechanism, since they first transcribe the DNA into RNA, then RNA is reversely transcribed into DNA and finally it is inserted back in the genome at a new position.

(25)

transposons constitutes about 8% of the human genome and differs from the non-LTR transposons due to the presence of long terminal repeats (LTRs) on both ends of an internal sequence. This group consists of both endogenous retroviruses (ERVs) and mammalian apparent LTR retrotransposons (MaLRs), which are distantly related to ERVs [61]. Each of the three classes of retrotransposons is divided into families and each family into subfamilies. Each subfamily is spread across multiple positions in the human genome (Figure 1.5). Despite their name, only a few of the TEs are presently actively transposing in the human genome.

Figure 1.5. Classification and genomic distribution of retrotransposons. Retrotransposons

(26)

1.3.3.2 Endogenous retroviruses

Endogenous retroviruses are the remains of the infection from ancient exogenous retroviruses that gained access to the germline cells and integrated into the genome ~30-45 millions years ago [63]. After the initial integration, such a retroviral element could amplify and copy itself into new positions in the genome while pass to the descendant's somatic and germline cells (Figure 1.6).

Figure 1.6: The transmission and fixation of exogenous retroviruses into the genome [64].

(27)

Figure 1.7. Proviral structure of a human endogenous retrovirus (HERV-K) [65].

The amplification of the ERVs in the genome occurs in two different ways, depending on the presence or absence of functional env and gag genes. In the first case, the amplification takes place through an extracellular re-infection of the cells, where newly formed viral particles leave the existing cell, insert into another cell and integrate again into the genome. In the second case, the presence of a defective env gene and an altered Gag protein leads to the transcription of such an element and its re-integration in the genome of the same cell [64].

(28)

The integration of ERVs in the human genome and their subsequent amplification millions of years ago led to their fixation in the human population. However, it has been proposed that the HML-2/HERV-K family member that integrated very recently into the human genome is “insertionally polymorphic”. Marchi et al. [69] analyzed whole genomes of individuals and showed that 13 ERV loci that are not present in the reference human genome were found in the genomes of those individuals, with a frequency ranging from 1 individual to 95% of them. Such unfixed loci could be associated with different diseases, since their activation is more possible than for those already fixed in the human genome [70][71].

(29)

remarkable example of this symbiosis exists during the formation of the placenta. The env gene of the HERV-W family encodes for a protein called syncytin 1. This protein is known to mediate cell-to-cell fusion of cytotrophoblasts into syncytiotrophoblasts, which constitute the interface between the embryo and the mother [75]. In addition to syncitin 1, the expression of a second protein called syncytin 2, which is encoded by the env gene of the HERV-FRD family, has been detected in the placenta [76]. The transcription of both genes is the result of a hypomethylation known to take place in the placenta while this is not the case in other tissues [77]. In contrast to the above example, other cases of active ERVs have been linked to pathological conditions in humans, like multiple sclerosis to HERV-W, HERV-H and HERV-Fc1 [78][79], schizophrenia to HERV-W [80] or cancer. As the main subject of this thesis is breast cancer, focus will thereafter be placed only on the connection of ERVs and cancer, without underestimating their importance in other diseases.

It is well established that an overall hypomethylation takes place in cancer cells compared to their control normal tissues [81]. Specifically, regions of repetitive elements, like retrotransposons, are very often hypomethylated in tumors, thus escaping from their transcriptional control and leading to the subsequent expression of both their mRNA and proteins [82]. The reactivation and overexpression of ERVs is detected in various cancer types and has been associated with carcinogenesis. In melanomas and melanoma cell lines, expression of HERV-K proteins and antibodies against HERV-K was found [83][84]. In ovarian cancer, increased expression of HERV-K was observed relative to the expression of normal ovarian tissues, together with other classes of HERV env mRNA including ERV3 and HERV-E while antibodies against these elements were also detected [85].In prostate cancer, HERV-K gag mRNA was significantly elevated when compared to control samples while IFNγ and its downstream effector were significantly higher in the patients with high levels of HERVK gag mRNA [86]. Env transcripts of HERV-K were detected in breast cancer cell lines but not in normal cells, while no expression of other tested ERVs was detected in the tumor samples [87].

(30)

lead to the overexpression of human oncogenes. Long terminal repeats of ERVs are known to carry promoter and enhancer sequences used for the expression of their own genes. The activation of LTRs located upstream of an oncogene could by-pass the normal promoter and lead to an expression of these genes. The insertional mutagenesis mechanism is the least possible since ERVs, in comparison to other repetitive elements, are not known to actively retrotranspose in the human genome, possibly with the exception of some HERV-K elements.

1.4 Next-Generation Sequencing

1.4.1 Sequencing techniques

The first methods to obtain the DNA sequence from a sample were independently developed by two different groups in the 1970s, namely Maxam and Gilbert [89] and Sanger and colleagues [90]. While both Gilbert and Sanger won the Nobel Prize in chemistry in 1980 for their contribution, the Sanger method became widely used, with high utility even today. The major achievement of the sequencing technology was the release in 2001 of the draft and three years later of the “complete” first human genome sequence. Throughout the years, different NGS technologies were developed and advanced significantly, by limiting the time required to complete the sequencing and improving the quality of the output data. Nowadays, a single genome can be sequenced within a few hours, with a cost of ~1000 dollars.

(31)

hybridization capture followed by its subtraction through binding to magnetic beads. The next step in both techniques is the fragmentation of the mRNA into smaller pieces, followed by a size selection and a reverse transcription into cDNA. Adapters and PCR primers are added at both ends of the double-stranded DNA and the sequences are amplified.

Figure 1.8: Workflows for whole-exome, whole-genome and transcriptome sequencing [91].

(32)

over, another base will incorporate, emit the corresponding light and this process will be repeated multiple times. At the end of the process, the color sequence will correspond to the bases sequenced, producing the so-called reads (Figure 1.9).

Figure 1.9: NGS workflow of Illumina technology. (a) library preparation; (b) structure of

(33)

1.4.2 Next-generation sequencing applications

This section presents in detail how NGS can be used for the analysis of RNA-seq data, focusing on those techniques used in our study workflow. In each of the three following chapters, a more detailed presentation of the methods used is available, as part of the Materials and Methods sections.

Transcriptome reconstruction

The initial step in almost every analysis that involves RNA or DNA sequences requires the positioning of the sequencing reads back to the genome from where they originate. To this end, two main approaches have been developed, the “reference based” alignment and the “de novo assembly”. Originally, different algorithms have been developed for local pairwise alignments of short sequences, including Basic Local Alignment Search Tool (BLAST) [93] and BLAST-like alignment tool (BLAT) [94]. BLAT in particular was created and used for the assembly and annotation of the human genome, in order to overcome the slower preexisting alignment tools. The two algorithms differ in various aspects, one of which involves the index step. BLAST builds an index of the query sequence and then scans through a database whereas BLAT builds an index of the database and then scans through the query sequence. The emergence of NGS required the design of faster and most accurate algorithms for the alignment of short sequences against full genomes.

Currently, the “reference based” alignment is carried out by fast algorithms, like Tophat2 [95] or STAR [96] and can be completed within minutes. Such algorithms are specifically designed for RNA-seq data since they take into consideration the splicing events that occur in the transcriptome and allow reads to map at both sides of an exonic region. For achieving this, a combination of both the annotation of the exons for each gene/transcript with the reference genome sequence is used.

(34)

the segments are connected creating a directed graph and leading to the reconstruction of the genome. This approach is currently the only available for organisms without a reference genome, although genomes of closely related species can also be used for a “reference based” alignment. Although both approaches can be used for the transcriptome reconstruction, the “de novo assembly” is slower and resource consuming, thus making the “reference based” alignement the most widely used.

Gene expression and differential expression analysis

Transcriptome sequencing is the methodology for identifying transcripts and quantifying gene expression levels. Expression quantification requires as a first step counting the number of reads aligned to each gene or transcript. Mapping of reads into one or multiple positions in the genome should be taken into consideration during this step. In most cases, reads mapping to multiple positions in the genome that belong to repetitive elements, are discarded and the expression estimate is based only on uniquely mapped reads. In the next step, read counts for each gene are normalized by the library size of the sample and the gene length. In this way, a comparison between different samples and genes is feasible. Different tools have been developed for an automated expression calculation. Cufflinks [98] is one of the most used tools for assembling known transcripts, discovering novel ones and quantifying their abundance in a single run. Other approaches require multiple steps for deriving the expression estimates, with tools like HTSeq-count [99] and featureCounts [100] designed especially for the first step of read counting.

Comparing the expression of individual genes among samples of different conditions, like cancerous and normal tissues or biological replicates, is an additional analytical step in the pipeline. The identification of differentially expressed genes is possible with the use of various tools developed for this purpose like edgeR [101] or DESeq2 [102] that require the raw read counts as an input parameter or Cuffdiff as a continuation of the Cufflinks pipeline described above [103].

(35)

when a genomic rearrangement occurs in the genome, the reads will map to positions that do not comply with the above rules and thus will be flagged (discordant alignment). In example, during a gene fusion caused by a deletion, the two reads will be mapped into a greater distance than what would be expected if this deletion was not present in the cancer genome.

Fusion detection algorithms, like deFuse [32], take advantage of these properties of the mapping reads, by first mapping all reads to the genome and extracting the discordant alignments, clustering together those that represent the same fusion event. These “spanning read” harbor a fusion breakpoint in the insert sequence between the two mates of the pair. Then, the algorithm performs a search for reads that cannot be mapped continuously to the genome but span the fusion breakpoint. These are called “split reads” with the fusion breakpoint located inside the read. The combination of these two types of reads indicates for a fusion event. It has been shown that not every called fusion is true and therefore further annotation and filtering of the predicted fusions is necessary. As an example, fusions that are formed by genes with high sequence similarity are removed, as it is difficult to derive their originating genomic position. Other filters that vary between algorithms include the removal of ribosomal RNA, the small size of which would, like before, cause mapping problems or the removal of mitochondrial genes. After extensive filtering, the remaining gene fusions can be considered true positives with a high probability and further validated and analyzed.

Presence of non-human reads in human samples

As mentioned previously, the analysis of transcriptomic and genomic data from human-derived samples starts with the alignment of reads against the human reference genome. The majority of these reads (>80%) will map to the human genome, however a small part will remain unmapped. These reads can be of human origin but unmapped due to sequencing errors and poorly annotated transcripts or of non-human origin [104]. One source of non-human sequences in human samples is contamination by viral or bacterial sequences that are either used as a quality control or originate from other infected samples [105]. For example, the use of enterobacteria phage phiX as a quality control during sequencing can leave contaminant sequences.

(36)

(37)

1.5 Aims and contributions of the research work

The results obtained during this thesis are presented in three chapters. Each of the chapters includes an introduction, a Material and Methods section, the results and a discussion. Chapters two and three led each to the preparation of a manuscript.

The overall aim of this thesis is to explore the presence of genetic changes that take place in the breast cancer transcriptome and their possible contribution to carcinogenesis.

In the first Results chapter (Chapter 2 in the thesis), the aim is to identify gene fusions in breast cancer transcriptomes and to study their distribution and association with other genomic events. Genomic rearrangements and gene fusions play an important role in the carcinogenesis. Examples like the ABL1-BCR fusion in chronic myelogenous leukemia or the TMPRSS2-ERG fusion in prostate cancer highlight the importance of fusions in the cancer cell. In order to detect gene fusions, the transcriptomes of 55 breast cancers were analyzed with the use of NGS. We aimed to identify expressed gene fusions, evaluate any differences between breast cancer subtypes, assess their recurrence among samples and correlate these fusions with other genomic alterations like copy-number aberrations and clinical parameters. Gene fusion data from a larger cohort (TCGA database) were also analyzed as a comparison cohort to our findings.

Contributions of this research include: 1) the identification of well-described

and novel tumor-specific gene fusions involved in breast cancer, 2) the description of the fusion landscape in the existing breast cancer subtype classification and 3) the observation of a non-random localization of fusions across the genome and their association with other genomic events.

(38)

viruses. An unbiased search for expressed viral sequences using next-generation sequencing was conducted. Five different computational methods were employed in order to scan the transcriptome of 58 breast cancers, in such a way as to overcome the limits of each individual method. The analysis was also complemented with PCR and IHC tests of specific viruses.

Contributions of this research include: 1) a methodologically robust argument

against a viral etiology in breast cancer and 2) the evaluation of different techniques for the detection of non-human sequences in human samples.

In the third Results chapter (Chapter 4 in the thesis), the aim is to identify expressed sequences of endogenous retroviruses in breast cancer. Human endogenous retroviruses are remnants of exogenous retroviruses, integrated in the human genome millions of years ago. It is estimated that they comprise almost 8% of the human genome, however they are non-infectious due to the accumulation of mutations and deletions and due to epigenetic silencing mechanisms. Although their presence in the human genome is either beneficial or negligible, their reactivation could also contribute to oncogenesis. As a first step the expression levels of all retroviral elements were quantified alongside protein coding genes in breast cancer samples through transcriptome sequencing. Their transcriptional profiles were analyzed both across all breast tumors and between the different breast cancer subtypes. We also identified retroviral elements higher expressed in tumor samples compared to their adjacent normal tissues that could be involved in the carcinogenesis.

Contributions of this research include: 1) a representation of the

(39)

1.6 Bibliography

1. Hanahan D, Weinberg RA. The Hallmarks of Cancer. Cell. 2000;100:57–70.

2. Hanahan D, Weinberg RA. Hallmarks of Cancer: The Next Generation. Cell. 2011;144:646–74.

3. Armitage P, Doll R. The Age Distribution of Cancer and a Multi-stage Theory of Carcinogenesis. Br. J. Cancer. 1954;8:1–12.

4. Armitage P, Doll R. A Two-stage Theory of Carcinogenesis in Relation to the Age Distribution of Human Cancer. Br. J. Cancer. 1957;11:161–9.

5. Fearon ER, Vogelstein B. A genetic model for colorectal tumorigenesis. Cell. 1990;61:759–67.

6. Stephens PJ, Greenman CD, Fu B, Yang F, Bignell GR, Mudie LJ, et al. Massive Genomic Rearrangement Acquired in a Single Catastrophic Event during Cancer Development. Cell. 2011;144:27–40.

7. Nik-Zainal S, Alexandrov LB, Wedge DC, Van Loo P, Greenman CD, Raine K, et al. Mutational Processes Molding the Genomes of 21 Breast Cancers. Cell. 2012;149:979–93.

8. Global Cancer Facts & Figures | American Cancer Society. http://www.cancer.org/research/cancerfactsstatistics/global

9. Key TJ, Verkasalo PK, Banks E. Epidemiology of breast cancer. Lancet Oncol. 2001;2:133–40.

10. Perou CM, Sørlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, et al. Molecular portraits of human breast tumours. Nature. 2000;406:747–52.

(40)

12. Network TCGA. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490:61–70.

13. Bergamaschi A, Kim YH, Wang P, Sørlie T, Hernandez-Boussard T, Lonning PE, et al. Distinct patterns of DNA copy number alteration are associated with different clinicopathological features and gene-expression subtypes of breast cancer. Genes. Chromosomes Cancer. 2006;45:1033–40.

14. Griffiths AJ, Gelbart WM, Miller JH, Lewontin RC. Chromosomal Rearrangements. 1999. http://www.ncbi.nlm.nih.gov/books/NBK21367/

15. Mertens F, Johansson B, Fioretos T, Mitelman F. The emerging complexity of gene fusions in cancer. Nat. Rev. Cancer. 2015;15:371–81.

16. Mani R-S, Chinnaiyan AM. Triggers for genomic rearrangements: insights into genomic, cellular and environmental influences. Nat. Rev. Genet. 2010;11:819–29.

17. Bounacer A, Wicker R, Caillou B, Cailleux AF, Sarasin A, Schlumberger M, et al. High prevalence of activating ret proto-oncogene rearrangements, in thyroid tumors from patients who had received external radiation. Oncogene. 1997;15:1263–73.

18. Arlt MF, Durkin SG, Ragland RL, Glover TW. Common fragile sites as targets for chromosome rearrangements. DNA Repair. 2006;5:1126–35.

19. Raghavan SC, Lieber MR. DNA structures at chromosomal translocation sites. BioEssays. 2006;28:480–94.

20. Roukos V, Burman B, Misteli T. The cellular etiology of chromosome translocations. Curr. Opin. Cell Biol. 2013;25:357–64.

(41)

23. Liu P, Carvalho CMB, Hastings PJ, Lupski JR. Mechanisms for recurrent and complex human genomic rearrangements. Curr. Opin. Genet. Dev. 2012;22:211–20.

24. Hastings PJ, Ira G, Lupski JR. A Microhomology-Mediated Break-Induced Replication Model for the Origin of Human Copy Number Variation. PLoS Genet. 2009;5(1).

25. Nowell P C & Hungerford D A. A minute chromosome in human chronic granulocytic leukemia. Science 1960;142:1497.

26. Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, Sun X-W, et al. Recurrent Fusion of TMPRSS2 and ETS Transcription Factor Genes in Prostate Cancer. Science. 2005;310:644–8.

27. Tomlins SA, Laxman B, Varambally S, Cao X, Yu J, Helgeson BE, et al. Role of the TMPRSS2-ERG Gene Fusion in Prostate Cancer. Neoplasia N. Y. N. 2008;10:177–88.

28. Soda M, Choi YL, Enomoto M, Takada S, Yamashita Y, Ishikawa S, et al. Identification of the transforming EML4–ALK fusion gene in non-small-cell lung cancer. Nature. 2007;448:561–6.

29. Romei C, Elisei R. RET/PTC Translocations and Clinico-Pathological Features in Human Papillary Thyroid Carcinoma. Front. Endocrinol. 2012;3:54.

30. Langer-Safer PR, Levine M, Ward DC. Immunological method for mapping genes on Drosophila polytene chromosomes. Proc. Natl. Acad. Sci. U. S. A. 1982;79:4381– 5.

31. Ritz A, Paris PL, Ittmann MM, Collins C, Raphael BJ. Detection of recurrent rearrangement breakpoints from copy number data. BMC Bioinformatics. 2011;12:114.

(42)

33. Kim D, Salzberg SL. TopHat-Fusion: an algorithm for discovery of novel fusion transcripts. Genome Biol. 2011;12:1–15.

34. Iyer MK, Chinnaiyan AM, Maher CA. ChimeraScan: a tool for identifying chimeric transcription in sequencing data. Bioinformatics. 2011;27:2903–4.

35. McPherson A, Wu C, Wyatt AW, Shah S, Collins C, Sahinalp SC. nFuse: Discovery of complex genomic rearrangements in cancer using high-throughput sequencing. Genome Res. 2012;22:2250–61.

36. Newman AM, Bratman SV, Stehr H, Lee LJ, Liu CL, Diehn M, et al. FACTERA: a practical method for the discovery of genomic rearrangements at breakpoint resolution. Bioinforma. Oxf. Engl. 2014;30:3390–3.

37. Stephens PJ, McBride DJ, Lin M-L, Varela I, Pleasance ED, Simpson JT, et al. Complex landscapes of somatic rearrangement in human breast cancer genomes. Nature. 2009;462:1005–10.

38. Banerji S, Cibulskis K, Rangel-Escareno C, Brown KK, Carter SL, Frederick AM, et al. Sequence analysis of mutations and translocations across breast cancer subtypes. Nature. 2012;486:405–9.

39. Robinson DR, Kalyana-Sundaram S, Wu Y-M, Shankar S, Cao X, Ateeq B, et al. Functionally recurrent rearrangements of the MAST kinase and Notch gene families in breast cancer. Nat. Med. 2011;17:1646–51.

40. Veeraraghavan J, Tan Y, Cao X-X, Kim JA, Wang X, Chamness GC, et al. Recurrent ESR1-CCDC170 rearrangements in an aggressive subset of oestrogen receptor-positive breast cancers. Nat. Commun. 2014;5:4577.

(43)

43. Ellermann DV, Bang O. Experimentelle Leukämie bei Hühnern. II. Z. Für Hyg. Infekt. 1909;63:231–72.

44. Rous P. A sarcoma of the fowl transmissible by an agent separable from the tumor cells. J. Exp. Med. 1911;13:397–411.

45. Epstein MA, Achong BG, Barr YM. Virus particles in cultured lymphoblasts from Burkitt's lymphoma. Lancet Lond. Engl. 1964;1:702–3.

46. Bodelon C, Untereiner ME, Machiela MJ, Vinokurova S, Wentzensen N. Genomic characterization of viral integration sites in HPV-related cancers. Int. J. Cancer. 2016;139:2001–11.

47. Feng H, Shuda M, Chang Y, Moore PS. Clonal Integration of a Polyomavirus in Human Merkel Cell Carcinoma. Science. 2008;319:1096–100.

48. Lonardo AD, Venuti A, Marcante ML. Human papillomavirus in breast cancer. Breast Cancer Res. Treat. 1992;21:95–100.

49. Baltzell K, Buehring GC, Krishnamurthy S, Kuerer H, Shen HM, Sison JD. Limited evidence of human papillomavirus on breast tissue using molecular in situ methods. Cancer. 2012;118:1212–20.

50. Kroupis C, Markou A, Vourlidis N, Dionyssiou-Asteriou A, Lianidou ES. Presence of high-risk human papillomavirus sequences in breast cancer tissues and association with histopathological characteristics. Clin. Biochem. 2006;39:727–31.

51. Simões PW, Medeiros LR, Simões Pires PD, Edelweiss MI, Rosa DD, Silva FR, et al. Prevalence of Human Papillomavirus in Breast Cancer: A Systematic Review. Int. J. Gynecol. Cancer. 2012;22:343–7.

52. Li N, Bi X, Zhang Y, Zhao P, Zheng T, Dai M. Human papillomavirus infection and sporadic breast carcinoma risk: a meta-analysis. Breast Cancer Res. Treat. 2011;126:515–20.

(44)

54. Taher C, Boniface J de, Mohammad A-A, Religa P, Hartman J, Yaiw K-C, et al. High Prevalence of Human Cytomegalovirus Proteins and Nucleic Acids in Primary Breast Cancer and Metastatic Sentinel Lymph Nodes. PLOS ONE. 2013;8:e56795.

55. Utrera-Barillas D, Valdez-Salazar H-A, Gómez-Rangel D, Alvarado-Cabrero I, Aguilera P, Gómez-Delgado A, et al. Is human cytomegalovirus associated with breast cancer progression? Infect. Agent. Cancer. 2013;8:12.

56. Khoury JD, Tannir NM, Williams MD, Chen Y, Yao H, Zhang J, et al. Landscape of DNA Virus Associations across Human Malignant Cancers: Analysis of 3,775 Cases Using RNA-Seq. J. Virol. 2013;87:8916–26.

57. Tang K-W, Alaei-Mahabadi B, Samuelsson T, Lindh M, Larsson E. The landscape of viral expression and host gene fusion and adaptation in human cancer. Nat. Commun. 2013;4:2513.

58. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921.

59. Treangen TJ, Salzberg SL. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 2012;13:36–46.

60. Smit AF. Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr. Opin. Genet. Dev. 1999;9:657–63.

61. Smit AF. Identification of a new, abundant superfamily of mammalian LTR-transposons. Nucleic Acids Res. 1993;21:1863–72.

62. Göke J, Ng HH. CTRL+INSERT: retrotransposons and their contribution to regulation and innovation of the transcriptome. EMBO Rep. 2016;17:1131–44.

63. Sverdlov ED. Retroviruses and primate evolution. BioEssays. 2000;22:161–71.

(45)

65. Hohn O, Hanke K, Bannert N. HERV-K(HML-2), the Best Preserved Family of HERVs: Endogenization, Expression, and Implications in Health and Disease. Front. Oncol. 2013;3:246.

66. Bannert N, Kurth R. The Evolutionary Dynamics of Human Endogenous Retroviral Families. Annu. Rev. Genomics Hum. Genet. 2006;7:149–73.

67. Magiorkinis G, Belshaw R, Katzourakis A. “There and back again”: revisiting the pathophysiological roles of human endogenous retroviruses in the post-genomic era. Philos. Trans. R. Soc. Lond. B. Biol. Sci. 2013;368:20120504.

68. Downey RF, Sullivan FJ, Wang-Johanning F, Ambs S, Giles FJ, Glynn SA. Human endogenous retrovirus K and cancer: Innocent bystander or tumorigenic accomplice? Int. J. Cancer. 2015;137:1249–57.

69. Marchi E, Kanapin A, Magiorkinis G, Belshaw R. Unfixed endogenous retroviral insertions in the human population. J. Virol. 2014;JVI.00919-14.

70. Wildschutte JH, Williams ZH, Montesion M, Subramanian RP, Kidd JM, Coffin JM. Discovery of unfixed endogenous retrovirus insertions in diverse human populations. Proc. Natl. Acad. Sci. 2016;113:E2326–34.

71. Moyes D, Griffiths DJ, Venables PJ. Insertional polymorphisms: a new lease of life for endogenous retroviruses in human disease. Trends Genet. 2007;23:326–33.

72. Subramanian RP, Wildschutte JH, Russo C, Coffin JM. Identification, characterization, and comparative genomic distribution of the HERV-K (HML-2) group of human endogenous retroviruses. Retrovirology. 2011;8:90.

73. Lavie L, Kitova M, Maldener E, Meese E, Mayer J. CpG Methylation Directly Regulates Transcriptional Activity of the Human Endogenous Retrovirus Family HERV-K(HML-2). J. Virol. 2005;79:876–83.

(46)

75. Mi S, Lee X, Li X, Veldman GM, Finnerty H, Racie L, et al. Syncytin is a captive retroviral envelope protein involved in human placental morphogenesis. Nature. 2000;403:785–9.

76. Blaise S, Parseval N de, Bénit L, Heidmann T. Genomewide screening for fusogenic human endogenous retrovirus envelopes identifies syncytin 2, a gene conserved on primate evolution. Proc. Natl. Acad. Sci. 2003;100:13013–8.

77. Matoušková M, Blažková J, Pajer P, Pavlíček A, Hejnar J. CpG methylation suppresses transcriptional activity of human syncytin-1 in non-placental tissues. Exp. Cell Res. 2006;312:1011–20.

78. Perron H, Germi R, Bernard C, Garcia-Montojo M, Deluen C, Farinelli L, et al. Human endogenous retrovirus type W envelope expression in blood and brain cells provides new insights into multiple sclerosis disease. Mult. Scler. Houndmills Basingstoke Engl. 2012;18:1721–36.

79. Laska MJ, Brudek T, Nissen KK, Christensen T, Møller-Larsen A, Petersen T, et al. Expression of HERV-Fc1, a Human Endogenous Retrovirus, Is Increased in Patients with Active Multiple Sclerosis. J. Virol. 2012;86:3713–22.

80. Leboyer M, Tamouza R, Charron D, Faucard R, Perron H. Human endogenous retrovirus type W (HERV-W) in schizophrenia: A new avenue of research at the gene–environment interface. World J. Biol. Psychiatry. 2013;14:80–90.

81. Ehrlich M. DNA methylation in cancer: too much, but also too little. Oncogene. 2002;21:5400–13.

82. Feinberg AP, Vogelstein B. Hypomethylation distinguishes genes of some human cancers from their normal counterparts. Nature. 1983;301:89–92.

(47)

84. Schiavetti F, Thonnard J, Colau D, Boon T, Coulie PG. A Human Endogenous Retroviral Sequence Encoding an Antigen Recognized on Melanoma by Cytolytic T Lymphocytes. Cancer Res. 2002;62:5510–6.

85. Wang-Johanning F, Liu J, Rycaj K, Huang M, Tsai K, Rosen DG, et al. Expression of multiple human endogenous retrovirus surface envelope proteins in ovarian cancer. Int. J. Cancer J. Int. Cancer. 2007;120:81–90.

86. Wallace TA, Downey RF, Seufert CJ, Schetter A, Dorsey TH, Johnson CA, et al. Elevated HERV-K mRNA expression in PBMC is associated with a prostate cancer diagnosis particularly in older men and smokers. Carcinogenesis. 2014;35:2074–83.

87. Wang-Johanning F, Frost AR, Johanning GL, Khazaeli MB, LoBuglio AF, Shaw DR, et al. Expression of Human Endogenous Retrovirus K Envelope Transcripts in Human Breast Cancer. Clin. Cancer Res. 2001;7:1553–60.

88. Zhao J, Rycaj K, Geng S, Li M, Plummer JB, Yin B, et al. Expression of Human Endogenous Retrovirus Type K Envelope Protein is a Novel Candidate Prognostic Marker for Human Breast Cancer. Genes Cancer. 2011;2:914–22.

89. Maxam AM, Gilbert W. A new method for sequencing DNA. Proc. Natl. Acad. Sci. U. S. A. 1977;74:560–4.

90. Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. U. S. A. 1977;74:5463–7.

91. Bras J, Guerreiro R, Hardy J. Use of next-generation sequencing and other whole-genome strategies to dissect neurological disease. Nat. Rev. Neurosci. 2012;13:453– 64.

92. Shin J, Ming G, Song H. Decoding neural transcriptomes and epigenomes via high-throughput sequencing. Nat. Neurosci. 2014;17:1463–75.

93. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–10.

(48)

95. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14:R36.

96. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2012;bts635.

97. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 2011;29:644–52.

98. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and abundance estimation from RNA-Seq reveals thousands of new transcripts and switching among isoforms. Nat. Biotechnol. 2010;28:511–5.

99. Anders S, Pyl PT, Huber W. HTSeq--a Python framework to work with high-throughput sequencing data. Bioinforma. Oxf. Engl. 2015;31:166–9.

100. Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30:923–30.

101. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–40.

102. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550.

103. Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat. Biotechnol. 2013;31:46–53.

(49)

(50)

Chapter 2

A comprehensive analysis of expressed gene

fusion events in breast cancer

2.1 Introduction

Gene fusions have been primarily a subject of study in hematological malignancies and sarcomas owing to their pathogenic nature. The best known example involves a reciprocal translocation between chromosome 9 and 22 giving rise to the BCR-ABL gene fusion in chronic myelogenous leukemia [1][2]. Recently though, pathogenic gene fusions have also been discovered in solid tumors. For instance, interstitial deletions of chromosome 21 have been shown to lead to the

TMPRSS2-ERG fusion in prostate cancer in half of the cases [3][4]. These and other

examples highlight the role played by genomic rearrangements in reshaping the cancer genome and signify the importance of studying gene fusions in other malignancies.

Genomic rearrangements have long been studied using low-resolution methods including fluorescent in situ hybridization, spectral karyotyping and cytogenetic techniques. However, the advent of next generation sequencing has opened the door to faster and more efficient ways of uncovering such events. Whole genome and whole transcriptome (RNA-seq) sequencing have been used for the detection of gene fusions at the DNA level or in RNA respectively, providing a comprehensive repertoire of rearrangements in the cancer genome [5][6].

(51)

were detected in breast cancers (among other types of cancer) using RNA-seq data from TCGA [9][5]. Both studies detected recurrent fusions, with ESR1-CCDC170 being the most frequent. This fusion was however found in less than 10% of the patients, showing that recurrent fusions are not frequent in breast cancer. The analysis of 560 breast cancer genomes identified for the first time six different rearrangement signatures but the presence of recurrent in-frame fusions was limited [6].

The aim of this chapter is to explore the occurrence of gene fusions in a cohort of well-characterized breast cancers equally distributed among the known subtypes (ie triple negative- TN; HER2 positive- HER2+; luminal A- LA; and luminal B- LB). Using RNA-seq we detected a varying number of fusions across the samples with clear-cut differences between breast cancer subtypes. Hotspots of fusion genes in specific chromosomes like 17, 8 or 20 were also observed whereas other chromosomes have only a few fusion genes. By integrating data obtained from Single Nucleotide Polymorphism (SNP) arrays, a correlation was established between the number of gene fusions and the number of amplifications. Finally, the presence of recurrent fusions in our samples seemed to be rare, confirming that gene fusions in breast cancer are most likely private events not shared between samples.

My contribution to this research study involves: • Bioinformatics analysis of sequencing data • Data and results interpretation

• Manuscript preparation

2.2 Material and Methods

Patients and samples

(52)

Antibody staining for estrogen receptor, progesterone receptor, human epidermal growth factor receptor 2 and Ki67 as well as FISH for HER2 were performed as described in Fumagalli et al. [10] and according to the ASCO-CAP guidelines [11]. The histological grade was defined using the modified Bloom-Richardson grading system [12][13]. On the basis of their immunohistochemistry profile, patients were classified in one of the known four main IHC BC subtypes: TN (ER, PgR, and HER2 negative), HER2+ (any ER and PgR, HER2 positive), LA (ER positive, HER2 negative, histological grade 1) and LB (ER positive, HER2 negative, histological grade 3). Evaluation of the quantity and location (stromal or intratumoral) of tumor-infiltrating lymphocytes (TILs) was defined as described previously [14]. The use of the data is consistent with the informed consent signed by the patients or has been granted approval by the local Ethics Committee (study number: CE1967) and is in accordance with the applicable laws and regulations of Belgium.

RNA extraction

RNA from fresh-frozen material was extracted using TRIzol® (Life Technologies, Carlsbad, California) following the manufacturer’s instructions. RNA concentration was defined using the NanoDrop 1000 (Thermo Scientific, Waltham, Massachusetts), and RNA integrity (RIN: RNA Integrity Number) was assessed using an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, California). All the samples yielded enough material for downstream analyses and had a RIN equal or superior to 6.5.

Transcriptome sequencing

(53)

Corporation, Beverly, Massachusetts). The cDNA fragments went through an end repair process, the addition of a single ‘A’ base and ligation of the adapters. The products were purified using the AMPure XP beads and enriched with PCR (15 cycles) to create the final cDNA library followed by purification using the AMPure XP beads. Libraries’ quality control and quantification were performed using the Agilent Bioanalyser 2100 and qRT-PCR; libraries were pooled (4 libraries/pool). Clusters were generated in a cBot Cluster Generation System using the Paired-End Cluster Generation Kit v2-HS and sequenced on the Illumina HiSeq 2000 platform (Illumina) with a 2x50 base-pairs (BP) paired-end mode.

DNA extraction and SNP arrays

DNA extraction was performed as previously described [10] from 50 samples. Briefly, DNA from fresh-frozen tissue was extracted using DNeasy Blood and Tissue kit® (Qiagen) following the manufacturer’s instructions. DNA concentration was measured using the NanoDrop 1000 and all the samples yielded enough material for downstream analyses. Affymetrix Genome-Wide SNP6 arrays were performed at AROS Applied biotechnologies a/s (Aarhus, Denmark) following the manufacturer’s instructions. The raw intensity values were normalized to obtain the Log2 ratios using the copy number workflow in Affymetrix Power Tools release 1.17.0 using default parameter settings. We used the full version of the CDF, version na.32 of NetAffx’s annotation database and version na.32.r1 of the HapMap 270 reference file. Following normalization, we computed the Median Absolute Pairwise Deviation (MAPD) with a fixed cut-off of 0.35 to discard failures and the Median Auto-Correlation (MAC) on the raw Log2 ratios and with a fixed cut-off 0.4 to flag arrays with unusually wavy

profiles as potential failures. A total of 50 arrays were performed corresponding to 50 unique samples. Four arrays had MAPD or MAC values outside limits and were rehybridized. Sequencing and SNPs array data obtained from the enrolled patients are archived at European Genome-phenome Archive, https://www.ebi.ac.uk/ega, under accession number EGAS00001000495.

Detection of fusion transcripts

(54)

default. Using an in-house pipeline, we applied the following filters to remove possible false positive fusions: 1) ribosomal and mitochondrial genes were removed, 2) fusions formed between paralog genes were removed, 3) fusions found in both normal and tumor samples were removed, 4) read-through fusions were removed. Fusion pairs consisting of the same genes but with different breakpoint positions were allowed.

Copy Number analysis

Affymetrix Genome-Wide SNP6 arrays contain both polymorphic SNP probes and non-polymorphic CNV probes. We used two parallel approaches involving (a) allele specific copy number analysis using the SNP probes and (b) total copy number analysis using the full set of 1.8M markers and parameters from (a) to control for the cancer cell fraction (CCF) and genomic mass. From the BAF and genotyping calls, only informative SNP probes displaying heterozygous genotype (AB) and 0.1 < BAF < 0.9 were kept for analysis at (a). The Log2 Ratios and BAF, per sample, were

segmented jointly using a multitrack segmentation algorithm from the library copynumber [16] to determine common breakpoints. Estimates for the CCF and genomic mass were obtained using Genome Alteration Print [17]. Samples with a CCF < 30% were further excluded leaving 40 unique patients. For analysis at (b), the Log2 ratios for these 40 arrays were segmented using the circular binary segmentation

algorithm in the library DNAcopy [18] using default parameter settings. We considered an amplification or deletion as any loci with a segmented Log2 ratio > 0.3

or < -0.3 respectively. Finally, we used a distance of 100Kb left and right of each copy number breakpoint to search for fusion events. For comparison with TCGA data, we obtained the fusions detected by Yoshihara et al. [5], TCGA level 3 copy number data from the TCGA data portal [19] and PAM50 RNA-seq subtypes from the UCSC cancer genomics browser [20]. The same workflow as for our data was applied.

Statistical Analysis

(55)

2.3 Results

2.3.1. Clinical and pathological characterization of our patient’s

population

The clinical and pathological characterization of the 55 patients participating in this research study is described in Supplementary Table S2.1 (Appendix 2). The mean age of patients was 55.3 (range: 34-85) with 69% above the age of 50. The samples included 14 HER2+, 16 TN, 16 LA and 9 LB. Twenty-nine percent were of histological grade 1, 11% of grade 2 and 60% of grade 3. The proportion of stromal TILs ranged from 3% to 55%.

2.3.2 Detection of gene fusions in our population

(56)

Table 2.1. Number of fusions across breast cancer subtypes HER2+ LA LB TN TOTAL Number of samples 14 16 9 16 55 Number of fusions 176 28 52 114 370 Fusion median 12 1 1 4 4 Fusion range 4-31 0-8 0-24 0-21 0-31

Figure 2.1. Differences of the number of gene fusions between breast cancer subtypes.

(57)

Computational analyses of gene fusions, viruses and parasitic genomic elements in breast cancer

Université Libre de Bruxelles

Faculté de Médecine