• Aucun résultat trouvé

Bioinformatic inference of a prognostic epigenetic signature of immunity in breast cancers

N/A
N/A
Protected

Academic year: 2021

Partager "Bioinformatic inference of a prognostic epigenetic signature of immunity in breast cancers"

Copied!
98
0
0

Texte intégral

(1)

Bioinformatic inference of a

prognostic epigenetic signature of

immunity in breast cancers

Martin Bizet

Interuniversity Institute of Bioinformatics in Brussels

Laboratory of Cancer Epigenetics

Machine Learning Group

Universit´e Libre de Bruxelles, Faculty of Sciences

A thesis submitted for the degree of

Doctor of Philosophy

Under the direction of

Pr. G. Bontempi (Promoter)

Pr. F. Fuks (Co-promoter)

(2)
(3)

esum´

e

L’alt´eration des marques ´epig´en´etiques est de plus en plus reconnue comme une car-act´eristique fondamentale des cancers. Dans cette th`ese, nous avons utilis´e des profils de m´ethylation de l’ADN en vue d’am´eliorer la classification des patients atteints du cancer du sein grˆace `a une approche bas´ee sur l’apprentissage automatique. L’objectif `a long terme est le d´eveloppement d’outils cliniques de m´edecine personnalis´ee.

Les donn´ees de m´ethylation de l’ADN furent acquises `a l’aide d’une puce `a ADN d´edi´ee `a la m´ethylation, appel´ee Infinium. Cette technologie est r´ecente compar´ee, par exemple, aux puces d’expression g´enique et son pr´etraitement n’est pas encore standardis´e. La premi`ere partie de cette th`ese fut donc consacr´ee `a l’´evaluation des m´ethodes de normalisation par comparaison des donn´ees normalis´ees avec d’autres technologies (pyros´equen¸cage et RRBS) pour les deux technologies Infinium les plus r´ecentes (450k et 850k). Nous avons ´egalement ´evalu´e la couverture de r´egions bi-ologiquement relevantes (promoteurs et amplificateurs) par les deux technologies.

(4)

Summary

Epigenetic alterations are increasingly recognised as an hallmark of cancers. In this thesis, we used a machine-learning-based approach to improve breast cancer patients’ classification using DNA methylation profiling with the long term aim to make treat-ment more personalised.

The DNA methylation data were acquired using a high density DNA methylation array called Infinium. This technology is recent compared to expression arrays and its preprocessing is not yet standardised. So, the first part of this thesis was to evaluate the normalisation methods by comparing normalised data against other technologies (pyrosequencing and RRBS) for the two most recent Infinium arrays (450k and 850k). We also went deeper into the evaluation of these arrays by assessing their coverage of biologically relevant regions like promoters and enhancers.

(5)

Acknowledgments

This thesis is result of a fruitful collaboration between many laboratories and I want to thank many people without whom this thesis would not have been possible.

First, I want to thank my promoter and co-promoter Gianluca Bontempi and Fran¸cois Fuks for hiring me in their laboratories. Gianluca, your guidance in the com-plex but exciting world of the machine learning was very important for me. Fran¸cois, thank you to make me discover the amazing field of epigenetics.

Then, I also thank the members of my three laboratories (the ”Laboratory of Can-cer Epigenetics”, the ”Machine Learning Group” and the ”Interuniversity Institute of Bioinformatics in Brussels”) and particularly the people of my office at Erasme (Evelyne, Romy, Sylvie, ...) and La Plaine (Rudy, Stefan, Matthias ...) for all the funny and emotional moments we shared. We still do not have the Nobel prize but I hope we will still have the opportunity to work on it... around a beer...

Importantly, I want to thank Sarah, Matthieu and Jana for their daily guidance. This work would really have been impossible without you! I think you really made me grow up!

Also, a particular thanks to Fabrizio, our discussions about machine learning were really useful. Thank you, Olivier and Eric to make me discover the world of long noncoding RNA, I am missing our long brainstorming!

This thesis is also the result of the collaboration with the laboratory of Christos Sotiriou who bring the samples and more importantly the expertise in breast cancer field. I thank them a lot for that.

A particular thanks to Catharina, Matthieu, Audrey, Clemence, Nathaniel and Nitesh for the proof read of my thesis manuscript.

I thank my wife and my mother who supported me when I was overflowed by stress! And, of course, Alan and Edwin, my twins who bring me happiness every day!!

(6)

Acronyms

Acronyms Meaning

1st Exon First exon of the transcript

27k Infinium HumanMethylation27 beadarray 450k Infinium human methylation 450 beadarray 5caC 5-carboxycytidine

5fC 5-formylcytidine

5hmC 5-hydroxymethylcytidine

5mC 5-methylcytidine (i.e. DNA methylation) 6mA 6-methyladenosine

850k Infinium HumanMethylation 850 beadarray β-value Methylated over total signal

∆β Absolute difference of β-values

∆N -TP73 Alternative TP73 transcript (lacking N domain) ψ Pseudouridine

A Adenosine

ADAR1 Adenosine deaminase acting on RNA 1 ADP Adenosine diphosphate

AIC Akaike information criterion

AID Activation-induced cytidine deaminase ALYREF Aly/REF export factor

APOBEC Apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like

ARGO Argonaute protein ATP Adenosine triphosphate AUC Area under the ROC curve AURKA Aurora kinase A

BC Breast cancer BER Balanced error rate

BMIQ Beta mixture quantile dilation normalisation bp Number of base pairs distance

(7)

ca5C 5-carboxyribocytidine

CAGE Cap analysis gene expression

CD20+ Cells expressing the cluster of differentiation 20 (B lympho-cytes)

CD247 Cluster of differentiation 247 immune marker

CD3+ Cells expressing the cluster of differentiation 3 (T lympho-cytes)

CD3D Cluster of differentiation 3 delta part immune marker CD4+ Cells expressing the cluster of differentiation 4

CD45 Cluster of differentiation 45 immune gene

CD45+ Cells expressing the cluster of differentiation 45 (leucocytes) CD8+ Cells expressing the cluster of differentiation 8 (= CTL) CD8A Cluster of differentiation 8 alpha part gene

CDS Coding DNA sequence CGI CpG island

CHARM Comprehensive high throughput array for relative methyla-tion

ChIA-PET Chromatin interaction analysis by paired-end tag sequenc-ing

ChIP-seq chromatin immunoprecipitation sequencing CI Confidence interval

ComBat Combining batch method CpG C followed by G

CR Cross-reactive

CTCF CCCTC-binding factor CTL Cytotoxic T cells (= CD8+)

CXCL9 C-X-C motif chemokine ligand 9 immune marker Dasen Pipeline with background correction followed by Nasen

DC Dendritic cells

DCIS Ductal carcinoma in situ

DMFS Distant metastasis free survival DMP Differentially methylated positions DMR Differentially methylated regions

DNA Deoxyribonucleic acid DNMT1 DNA methyltransferase 1 DNMT3A DNA methyltransferase 3A DNMT3B DNA methyltransferase 3B

dNTP Deoxynucleotide triphosphate dTET Drosophila TET homolog

eCGI Enhancer associated CGI EGF Epidermal growth factor

ENCODE Encyclopædia of DNA elements project Epcam Epithelial cell adhesion molecule

(8)

ERBB2 ErbB2 receptor tyrosine kinase 2 gene (coding for HER2 protein)

eRNA Enhancer associated RNA EZH2 Enhancer of zeste 2

f5C 5-formylribocytidine

FANTOM5 Functional annotation of the mammal genome version 5 database

FDA Food and drug administration FDR False discovery rate

FFPE Formalin fixed paraffin embedded sample FISH Fluorescence in situ hybridisation

FN False negative

FOXP1 Forkhead box P1 gene FP False positive

FTO Fat mass & obesity associated protein Fun Functional normalisation

G Guanosine

GENCODE Gene-dedicated part of ENCODE consortium GEO Gene expression omnibus database

GZMB Granzyme B immune activation marker H & E Hæmatoxylin and eosin staining

H1 Histone 1 H2A Histone 2A H2B Histone 2B

H3 Histone 3 H4 Histone 4

HAT Histone acetyltransferase

HCT116 DKO HCT116 cells double knock-out for DNMT1 and DNMT3B HCT116 WT HCT116 cells with wild type background

HDAC Histone deacetylase HDM Histone demethylase

HER2 Human epidermal growth factor receptor 2; Its associated breast cancer subtype;

hg18 Human genome build version 18 hg19 Human genome build version 19 hg38 Human genome build version 38 HKMT Histone lysine methyltransferase

hm5C 5-hydroxymethylribocytidine HMM Hidden Markov Model

HMT Histone methyltransferase

HNSC Head and neck squamous cell carcinoma

HpaII Methylation-sensitive restriction enzyme from Hæmophilus parainfluenzae

(9)

IDC Invasive ductal carcinoma IGF2 Insulin-like growth factor 2

IHC Immunohistochemistry ILC Invasive lobular carcinoma

INA Internexin neuronal intermediate filament protein alpha (MeTIL signature)

K Lysine kb Kilobases

KDM1A Histone lysine demethylase 1A KI67 Kiel 67 protein

KLHL6 Kelch-like family member 6 (MeTIL signature) LCIS Lobular carcinoma in situ

lincRNA Long intergenic non-coding RNA LINE Long interspersed nuclear element LNCipedia Long noncoding encyclopædia database

lncRNA long non-coding RNA loess Local-regression

LOESS Local-regression-based normalisation LUM Luminal breast cancer subtype LumA Luminal A breast cancer subtype LumB Luminal B breast cancer subtype

lumiMethyB Background correction for methylation from lumi m1A 1-methylriboadenosine

m5C 5-methylribocytidine m6A 6-methylriboadenosine m7G 7-methylriboguanosine

M-A plot log2 expression ratio in function of average signal plot

MAF Minor allele frequency MBD Methylation binding domain MBT Malignant brain tumours domain

McrBC Methylation-sensitive restriction enzyme from Escherichia coli K12

MDR Multidrug resistance gene MDS Multidimensional scaling MECP2 Methyl-CpG binding protein 2

MeDIP-chip Methylated DNA immunoprecipitation on array (also called chip)

MeDIP-seq Methylated DNA immunoprecipitation sequencing MethylCap-seq DNA methylation capture sequencing

MeTIL Methylation-based assessment of TIL METTL14 Methyltransferase like 14

METTL3 Methyltransferase like 3

MGMT O-6-methylguanine-DNA methyltransferase

(10)

miRNA MicroRNA MLH1 MutL Homolog 1

MOF Monomeric and oligomeric flavanols

mRMR Minimal redundancy maximal relevance feature selection mRNA Messenger RNA

MspI Methylation-insensitive restriction enzyme from Moraxella sp

M-value log2 methylated over unmethylated signals ratio

Nasen Between-array normalisation separating Infinium types and colour channel

NGS Next-generation sequencing NK Natural killer cells

NKI Netherlands Cancer Institute Nm 2’-O-methylation

NMF Non-negative matrix factorisation NMLS NormaliseMethyLumiSet normalisation

NOOB Normal-exponential-based background correction using out-of-bound probes

NOOB+Fun Pipeline formed by NOOB background correction followed by Fun

Normexp Normal-exponential-based background correction using con-trol probes

NPCA Normalised PCA

NSUN2 NOP2/Sun RNA methyltransferase family member OR Odds ratio

OSAT Optimal sample assignment tool OxBS Oxidative bisulphite treatment oxmC Oxidised modification of 5mC oxmU Oxidised modification of T

p300 Protein 300

PASR Promoter associated short RNA

PaTIL Pathological intratumoral TIL readings PBC Peak-based correction

PCA Principal component analysis pCGI Promoter associated CGI

PCPG Pheochromocytoma and Paraganglioma pCR Pathological complete response

PCR Polymerase chain reaction PHD Plant homeodomain

PI3K Phosphatidylinositol-3-phosphate

PIKCA Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit alpha

piRNA Piwi RNA

(11)

pre-mRNA Unmatured mRNA

PRF1 Perforin 1 immune activation marker PRIDE Proteomic identification database pri-miRNA Primary miRNA

PRMT Protein arginine methyltransferase PROMPT Promoter upstream transcripts

PR Progesterone receptor

PTEN Phosphatase and tensin homolog

PTPRCAP Protein tyrosine phosphatase, receptor type C-associated protein (MeTIL signature)

PUS1 Pseudouridylase 1 PUS7 Pseudouridylase 7

QC Quality check

QN Quantile normalisation

QN+BMIQ Pipeline formed by smooth quantile colour adjustment fol-lowed by BMIQ

qPCR Quantitative polymerase chain reaction R Arginine

RAD51 RAD51 recombinase RAR-β Retinoic acid receptor β

RASSF1A Ras association domain family member 1 (MeTIL signature) RCP Regression on correlated probes normalisation

RefSeq NCBI Reference Sequence database rho Spearman correlation coefficient RNA Ribonucleic acid

RNAi Interfering RNA

ROC Receiver operating characteristic curve

RRBS Reduced representation bisulphite sequencing RRF Regularised random forest

rRNA Ribosomal RNA S Serine

SAM S-adenosylmethionine

SEMA3B Semaphorin 3B (MeTIL signature) siRNA Small interfering RNA

SKCM Skin cutaneous melanoma Smooth Smooth quantile normalisation

SMOTE Synthetic minority oversampling technique

SMUG1 Single-strand-selective monofunctional uracil-DNA glycosy-lase 1

snoRNA Small nucleolar RNA

SNP Single nucleotide polymorphism snRNA Small nuclear RNA

SQN Subset quantile normalisation

(12)

SSO Subset sampling optimisation undersampling method STAT1 Signal transducer and activator transcription 1 associated

signature

SWAN Subset within-array quantile normalisation SWI/SNF SWItch/Sucrose Non-Fermentable

T Threonine (in proteic context); Thymidine (in DNA con-text)

TDG Thymine DNA glycosylase

TET Ten-eleven translocation methyl cytosine dioxygenase TF Transcription factor

THCA Thyroid carcinoma THYM Thymoma

TIL Tumour infiltrating lymphocyte tiRNA Tiny RNA

TME Tumour microenvironment

TN Triple negative breast cancer subtype

TNM Staging integrating tumour size, node invasion and presence of metastasis

TOP Trial of Principle cohort

Tost Pipeline from Touleimat and Tost TP True positive

TP53 Tumour protein 53 TP73 Tumour protein 73 Tregs Regulatory T cells tRNA Transfer RNA

TSS Transcription start site TSS1500 1500bp upstream of the TSS

TSS200 200bp upstream of the TSS TSSaRNA TSS associated RNA

TTS Transcription termination site

T-UCR Transcribed ultraconserved noncoding RNA UCSC University California Santa Cruz website WGSBS Whole-genome shotgun bisulphite sequencing

WTAP WT1-associated protein Y Tyrosine

(13)

Contents

1 Introduction 1

1.1 Breast Cancers . . . 1

1.1.1 Breast Cancers Prognostication . . . 3

1.1.2 Prediction of Treatment Efficiency in Breast Cancers . . . 4

1.1.3 High Throughput Tumour Profiling . . . 6

1.2 Epigenetics . . . 7

1.3 Infinium Technology . . . 8

1.4 Extracting Signatures . . . 9

2 Aim of the Thesis & Original Contributions 11 2.1 Original Work . . . 12

2.1.1 Infinium HumanMethylation Beadarrays Evaluation . . . 12

2.1.2 Breast Cancers MeTIL Signature Extraction . . . 13

2.1.3 Other Related Projects . . . 14

3 Biological Background 17 3.1 Breast Cancers . . . 17

3.1.1 Anatomopathological, Histological Classification and Staging . 17 3.1.2 Clinical and Molecular Classification . . . 19

3.1.3 Tumour Microenvironment . . . 23

3.2 Epigenetics . . . 25

3.2.1 Epigenetics & the central dogma of molecular biology . . . 25

(14)

3.2.4 Epigenetic at cis-Regulatory Elements . . . 33

3.2.4.1 Promoter . . . 33

3.2.4.2 Gene Body . . . 36

3.2.4.3 Enhancers . . . 38

3.2.4.4 Regulatory Region Identification . . . 39

3.2.5 Epigenetic Alterations in Breast Cancers . . . 40

3.3 DNA Methylation Assessment . . . 44

3.3.1 Bisulphite Conversion . . . 44

3.3.2 Massively Parallel Sequencing . . . 46

3.3.3 Infinium beadarrays . . . 49

3.3.3.1 Promoter-centric Infinium Arrays (GoldenGate and HumanMethylation27 beadarrays) 49 3.3.3.2 High Coverage Infinium Arrays (HumanMethylation450 and HumanMethylation850 Beadarrays) . . . 50

3.3.4 Methods for Validation at Single-site Scale . . . 51

4 Bioinformatic Background 54 4.1 Unreliable Infinium Probes Filtering . . . 54

4.1.1 High Detection P values . . . 54

4.1.2 Cross-reactive Probes . . . 55

4.1.3 Probes Containing Common SNPs . . . 55

4.1.4 Probes Located on Heterochromosomes . . . 56

4.2 Infinium HumanMethylation Beadarrays Normalisation . . . 57

4.2.1 Inheritance from Expression Array Normalisation . . . 57

4.2.2 Within-array Normalisation . . . 60

4.2.3 Between-array Normalisation . . . 64

4.3 Extracting Signatures from microarray data . . . 67

4.3.1 Gene Expression Signatures in Breast Cancers . . . 68

4.3.2 Machine-Learning-based Signature Extraction . . . 71

4.3.2.1 Biological Knowledge . . . 72

4.3.2.2 Feature Extraction . . . 73

4.3.2.3 Filter Feature selection . . . 74

4.3.2.4 Embedded Feature selection . . . 76

4.3.2.5 Wrapper Feature selection . . . 78

(15)

5 Infinium HumanMethylation Beadarrays Evaluation 82 5.1 Processing of Infinium HumanMethylation High-density Beadarrays . 85

5.2 Dataset Description . . . 86

5.3 Filtering Impact in 450k, 850k and RRBS Technologies . . . 88

5.4 Evaluation of Normalisation Methods . . . 90

5.4.1 Evaluation of 450k Within-array Normalisation Methods . . . 90

5.4.2 Evaluation of 450k Between-array Normalisation Methods . . 93

5.4.3 Evaluation of Normalisation Methods on 850k Data . . . 95

5.4.4 Variance Heterogeneity . . . 98

5.5 Biological Features Covered by Infinium Beadarrays . . . 103

5.5.1 Development of Alternative Annotations . . . 103

5.5.1.1 Regulatory Regions . . . 106

5.5.1.2 Association to Transcript . . . 106

5.5.1.3 CpG Island-associated Regions . . . 107

5.5.1.4 Promoter/Non-promoter Regions . . . 107

5.5.1.5 Illumina Default Annotation . . . 108

5.5.2 Infinium HumanMethylation850 Coverage Evaluation . . . 108

5.5.3 Epigenetic-based 850k Annotation . . . 112

5.5.4 Differential Methylation Analysis with 850k . . . 116

5.6 Discussion . . . 119

5.6.1 The Epigenetic-based Annotation We Developed Improves Infinium Interpretability . . . 119

5.6.2 Our Study Reveals the Broad Methylome View Provided by 850k Relatively to RRBS . . . 121

5.6.3 Our Comparative Study Highlights PBC and NOOB as Best Within-array Normalisation . . . 122

5.6.4 Our Comparative Study Reveals that Between-array Normalisation can Artefactually Distort Data . . . 124

5.6.5 Our Between-replicates Analysis and Side Projects Show the Need for a Methylation Difference Threshold . . . . 125

6 The MeTIL Score: Predicting TIL Amount with DNA Methylation thanks to Machine Learning 131 6.1 Data and Cohort Description . . . 133

6.2 Derivation of the MeTIL Signature . . . 135

(16)

6.2.2 Generation of a Signature Population . . . 137

6.2.3 Final Signature Selection . . . 142

6.3 Computation of the MeTIL Score from the Signature . . . 148

6.4 Evaluation of the MeTIL Score Performance . . . 150

6.4.1 Evaluation of TIL Distributions Using the MeTIL Score . . . 150

6.4.2 Prediction of Survival and Response to Chemotherapy with the MeTIL Score . . . 159

6.4.3 Evaluation of TILs through Bisulphite Pyrosequencing of MeTIL Markers . . . 162

6.4.4 Prediction of Survival Outcome in Other Cancer Types with the MeTIL Score . . . 164

6.5 Discussion . . . 165

6.5.1 Our Original Machine Learning Approach Extracts the Representative Signature from a Signatures Population . . . . 165

6.5.2 Our MeTIL signature Specifically Reflects TILs . . . 169

6.5.3 Our MeTIL Score Predict Outcome and Response to Chemotherapy . . . 171

6.5.4 Our MeTIL Score May be Transferred in Clinics Using Pyrosequencing . . . 172

6.5.5 Our MeTIL Score is Prognostic in Other Cancers . . . 172

7 Conclusions & Perspectives 175 7.1 Summary of the Contributions of this Thesis . . . 175

7.1.1 Infinium HumanMethylation Preprocessing . . . 175

7.1.2 Epigenetic-based Annotation . . . 176

7.1.3 Development of a Score Reflecting TILs . . . 177

7.2 Future Works . . . 178

7.2.1 Improvement of Infinium Processing . . . 178

7.2.2 Exploration of the Signature Population . . . 179

7.2.3 Extension of the MeTIL Signature . . . 180

7.2.4 Epigenetic in Breast Cancers . . . 180

A Background: Supplementary Information 181 A.1 Epigenetic Modifications . . . 181

A.1.1 Histone Modifications . . . 181

(17)

A.2 DNA methylation Assessment . . . 194

A.2.1 Methylation-sensitive Restriction Enzymes . . . 194

A.2.2 Affinity Enrichment . . . 194

A.2.3 Massively Parallel Sequencing . . . 195

A.2.3.1 Restriction-based Sequencing (Methyl-seq) . . . 195

A.2.3.2 Affinity-based Sequencing (MeDIP & MethylCap) . . 195

A.2.4 Microarrays . . . 195

A.2.4.1 Restriction-based Microarrays (MethylScope and CHARM) . . . 195

A.3 Signature Extraction from microarrays . . . 197

A.3.1 Cox Regression . . . 197

A.3.2 Feature Extraction . . . 197

A.3.3 Mutual Information . . . 198

A.3.4 Logistic Regression . . . 198

B Infinium Evaluation: Supplementary Material 199 B.1 Normalisation . . . 199

B.2 Biological features covered by Infinium arrays . . . 205

C MeTIL score: Supplementary Material 208 C.1 Extracting Signatures . . . 208

C.2 Patient Cohorts . . . 208

D Publications 224

(18)

Chapter 1

Introduction

Here, we introduce the motivations of studying epigenetics in breast cancers.

First, we describe the clinical necessities of inferring signatures in breast cancers for both prognostication and prediction.

Secondly, the concept of epigenetics and its involvement in cancer diseases is briefly described, particularly focusing on DNA methylation.

Then, the technology used to measure DNA methylation alteration is introduced and the main approaches to extract signatures are briefly explained.

Finally the different contributions of the present thesis are introduced.

1.1

Breast Cancers

Breast cancers (BC) are the most frequently encountered types of cancers in women from Western countries. In Europe, BC have an incidence of 71.1 new cases by 100 000 each year and even reached 106 cases by 100 000 in 2008 in Belgium [276, 68]. In the United States, 246 660 new cases were diagnosed in 2016 (Figure 1.1). So, BC are one of the highest public health concerns in Western countries, since one woman out of eight will develop breast cancer during her lifetime [224].

(19)

Figure 1.1: From [224] Ten leading cancer types for the estimated new cancer cases (top)

(20)

Figure 1.2: Breast cancers prognostication. Solid line: time frame, dashed line: data analysis, red : tumour, green: tumour removal through surgery.

In an effort to reduce the rate of recurrence and prolong the survival, new therapies are being developed and new randomised trials are regularly conducted since the mid 1980s with the aim of improving both prognostication and prediction of response to therapy [76].

1.1.1

Breast Cancers Prognostication

The prognostication aims to predict the survival of a patient, or the risk to develop metastases without treatment after the initial surgery (Figure 1.2).

Several features can be taken into consideration for breast cancers prognosis, e.g. clinical variables (age of the patient, the invasion of cancer cells in the nodes and tumour size).

(21)

other [198]. So, the use of histological grade is not sufficient for an accurate prognosis of breast cancer patients.

Additionally, the histopathologist may also provide quantification of non-tumoral cells like stromal or lymphocytic cells. Being able to quantify this last cell type could be particularly relevant as tumour infiltrating lymphocytes (TILs) are known to be associated to good prognosis. However, even more than for the histological grade, TIL quantification suffers from subjectivity and varies from a pathologist to the other. Furthermore, it relies on semi-quantitative measurements, which are limited in accuracy and reproducibility. To reduce this subjectivity, scoring systems are under development but no consensus has been reached yet [113, 64, 214, 229].

1.1.2

Prediction of Treatment Efficiency in Breast Cancers

In order to improve patient survival and/or reduce toxic side effects by targeting more specifically cancer cells, new systemic adjuvant treatments are continually under development. To provide the appropriate treatment to the patient, prediction of the breast cancer patient response to the treatment is essential. There exist two settings for drug delivery: in adjuvant setting the drug is delivered after surgery (Figure 1.3) while in neoadjuvant (Figure 1.4) setting the drug is delivered prior to surgery [173]. As for prognostication, the adjuvant setting predictions are based on features extracted from the primary tumour. In the neoadjuvant setting, the situation is more complex: the features used for prediction are extracted from a biopsy of the breast tumour done before the neoadjuvant therapy. Breast surgery is carried out to remove the primary tumour after neoadjuvant therapy and the efficacy of the treatment is assessed (e.g. decrease in tumour size). Particularly, the pathological complete response (pCR) is defined as the complete disappearance of tumour cells in the breast and the axillary lymph nodes. pCR has been shown to be associated with excellent long-term survival. Therefore, in neoadjuvant setting only the response or resistance to the treatment is predicted, leaving aside the survival of the patients.

In this thesis, we assess prognosis on adjuvant cohorts while the predictive power of our score is evaluated on a cohort with neoadjuvant setting.

(22)

Figure 1.3: Breast cancer adjuvant response prediction. Solid line: time frame, dashed

line: data analysis, red : tumour, green: tumour removal through surgery.

Figure 1.4: Breast cancer neoadjuvant response prediction. Solid line: time frame, dashed

(23)

The expression status of the hormonal receptors (the œstrogen receptor [ER] and the progesterone receptor [PR]) can be used to predict response to hormonotherapy while expression and/or gene amplification of the Human epidermal growth factor receptor 2 (HER2) oncogene can be used to define individuals who may benefit from targeted anti-HER2 therapy. However, current prediction models need to be improved to integrate additional mechanisms such as apparition of resistance or influence of the surrounding non-tumoral cells.

1.1.3

High Throughput Tumour Profiling

In summary, prognostication and treatment efficiency prediction remain challenging in breast cancers because of their vast heterogeneous nature. Indeed, traditional histopathological characteristics based on microscopic examination of tumour sample anatomy are not able to capture the biological differences existing between tumours and, therefore, patient with similar histopathological characteristics may experiment very different rates of survival and responses to anti-cancer therapies.

With the development of high throughput molecular profiling, such as gene expres-sion profiling through microarray-based technology, the quantitative measurement of thousands of gene expressions in parallel became possible. This allowed the iden-tification of a set of genes that can be used as molecular markers for prognosis or treatment efficiency prediction. This set of genes called signature were able to out-perform classical clinical risk classification and to refine tumour stratification [254]. Many studies attempted to identify subtypes of breast tumours using gene expression data both using unsupervised (clustering) and supervised (machine-learning-based) methods [199, 230, 232, 98, 191]. These studies consistently showed at least four subtypes which exhibits distinct clinical outcomes (see Section 3.1). These intrinsic subtypes are now well-established and clinical tests for subtyping are now commer-cially available [226].

(24)

1.2

Epigenetics

Etymologically formed by the Greek term epi meaning ‘on the top of’ and the term genetics, epigenetics has been introduced by Conrad Waddington in 1942 as the do-main studying the causal relationship between the genes and their products which leads to the observed phenotype. Slowly, the definition has evolved and currently epigenetics is defined as “the study of the stable and reversible changes of the gene ex-pression that occurs without alterations of the deoxyribonucleic acid (DNA) sequence and can be inherited from one cell to its daughters” (though the heritability part is still under debate) [54].

While there is still a debate concerning which biological processes to include in the epigenetics field, in this thesis we considered four different biological phenomena: • Chemical modifications of the DNA, and particularly DNA methylation which consists in the addition of a methyl group to a cytidine mostly when this cytidine is followed by a guanosine (CpG context). The role of epigenetic modifications in BC will be the main focus of this thesis.

• Histones post-translational modifications, which are additions of chemical groups on proteins, called histones. DNA is wrapped on these proteins and their mod-ifications affect gene regulation.

• Expression of noncoding RNAs, which are genes that do not code for any protein and often have gene regulatory functions.

• Chemical modifications of RNAs, i.e. modifications that change the final ex-pression, notably by affecting the translation and/or the degradation rate. These different concepts, described in detail in the Biological Background (Sec-tion 3.2), are involved in the regula(Sec-tion of the gene expression and therefore explain why an identical genome can lead to multiple phenotypes. So, if all superior organ-isms have a unique genome containing the information of all their genes, they also have multiple epigenomes regulating the gene expression at each cell type level.

(25)

Figure 1.5: From [58] While gene expression profiling highlights mainly tumour cells

mark-ers DNA methylation seems more sensitive to tumour microenvironment.

emerging [192]. In addition, recent studies have shown that DNA methylation can re-fine BC classification, with subtypes only partially overlapping gene expression based ones [57, 234]. This result highlights the capacity of DNA methylation to provide a view of cancers biology which is complementary to gene expression.

Furthermore, the evolution of a cancer is also strongly influenced by the presence of other cell types in their vicinity (i.e. tumour microenvironment), which is effectively reflected by the global epigenome of a tumour sample. Particularly, thanks to its cell type specificity, DNA methylation can potentially be used as a snapshot of the non-cancerous cells present in the tumour and affecting the outcome (Figure 1.5)[58, 117].

1.3

Infinium Technology

(26)

technologies is essential, it also implies new challenges from the data analysis point of view [181].

While Infinium HumanMethylation450 (450k) is the most popular technology for DNA methylation assessment at genome-scale, with more than 650 000 samples pub-lished on the Gene Expression Omnibus (GEO) platform, it remains a recent tech-nology compared to gene expression arrays and no consensus exists about the most convenient preprocessing methodology. Particularly, its particular design requires two distinct normalisation steps: the within-array normalisation correcting specific biases that exist between probes within a single array and the between-array normalisation correcting between samples variations.

Infinium 450k and HumanMethylation850 (850k) beadarrays assess the methyla-tion status of more than 450 000 and 850 000 cytidines respectively, each one being a different variable. If we consider that diagnostic tests are limited to a very small set of features (less than ten in most of the cases) [226], the need of signature extraction strategies appear to be evident.

1.4

Extracting Signatures

Selecting a small subset of variables to precisely discriminate between different clinically-relevant conditions (e.g. absence, low amount or high amount of lymphocytes) is essential to develop diagnostic kits able to make accurate prognosis or to accurately predict the response to a therapy. Nowadays the discovery of this subset, called signa-ture, is becoming more and more challenging since high throughput technologies leads to an exponential growth of the number of variables while the number of samples is limited for practical and ethical reasons. Furthermore the imbalance of the data (i.e. the under-representation of one condition compared to another) often impacts on the predictive performance of the signature.

Two strategies are mainly used for extracting signatures, one based on biostatis-tics, the other on machine learning.

(27)

Therefore different multiple correction methods have been introduced [53]. Since the biostatistical approach evaluates each feature independently it has several drawbacks: • all relevant variables will be integrated to the final signature. However highly-correlated variables share the same information and leads to a larger signature without improving global prediction power.

• The detection of features individually lowly relevant that becomes highly pre-dictive together is almost the impossible.

• Finally the biostatistical tests are highly dependent on specific assumptions whose validity is sometimes difficult to assess.

The second strategy is based on machine learning methodologies. The rationale is to develop, on a subset of the samples called training set, a model able to discriminate between the different conditions of interest (called classes) using a set of features. The prediction performance of the model is then evaluated by comparing the predictions of the model with the reality on another part of the data called test set. To extract a signature using machine learning several models have to be developed using different subsets of the features. The final signature is the one leading to best performance of the model.

(28)

Chapter 2

Aim of the Thesis & Original

Contributions

As introduced in the previous chapter, the currently used factors for prognosis and prediction of response to chemotherapy in breast cancers are suboptimal and insuf-ficient to explain the differences in survival and response observed in the clinical practice. It is, therefore, essential to explore the existence of markers to improve prognostication and prediction of the efficacy of neoadjuvant anthracycline-based chemotherapy, one of the most commonly administered chemotherapies in BC but unfortunately associated with rare but severe side-effects.

In this thesis, we explore the role of epigenetic alterations, which are increasingly recognised as a hallmark of many cancers. Particularly, we decided to focus on the development of a DNA methylation score reflecting variation in the tumour microenvi-ronment (particularly focusing on TILs) and, thereby, improving both prognostication and prediction of response to chemotherapy. Using high throughput Infinium 450k microarray technology, we aim to extract a small set of highly-predictive epigenetic sites, with the final purpose to develop a diagnostic tool for the optimal management of the individual breast cancer patient.

(29)

assess DNA methylation [139], the main objective of these analyses is to provide guidelines for an optimal use of this technology to the research community.

After this efficient preprocessing of the data, some challenges remained like the relatively small size of the cohort and its imbalance. Therefore, in the second part of this thesis (Chapter 6), we develop a machine learning framework that maximises the selection of relevant features in small sample size unbalanced datasets where there is a high risk to extract overfitted signature (signature which predictive power is restricted to the dataset and could not be validated on other datasets). Using this approach, we finally developed a score called MeTIL score that reflects TILs in breast cancer samples.

2.1

Original Work

The present thesis is the result of the collaboration between the “Laboratory of Cancer Epigenetics”, the “Machine Learning Group and the “Interuniversity Institute of Bioinformatics in Brussels”. The “Breast Cancer Translational Research Laboratory”, from Bordet Institute, was also involved in this project, by providing sample biopsies and clinical expertise.

The research leads to the publication of several articles (Figure 2.1 and Section D). In this manuscript, we focus on articles associated to Infinium preprocessing (a second-author paper “A comprehensive overview of Infinium HumanMethylation450 data processing” [Figure 2.1 7] and a first-author paper in preparation “Reannotation and normalisation of Infinium 850k bead chip improves high-throughput analysis of enhancer methylation”) or on the extraction of a lymphocytic signature in breast cancers (co-first author paper ”DNA methylation-based immune response signature improves patient diagnosis in multiple cancers.” [Figure 2.1 1]).

2.1.1

Infinium HumanMethylation Beadarrays Evaluation

(30)

and the impact of array normalisation methods. Finally, we discuss how to prop-erly analyse Infinium data and illustrate using related projects where we performed bioinformatic analysis in epigenetic field.

Here are the main results of our analyses:

• The comparative evaluation reveals that the “peak-based normalisation” (PBC) and the “Normal-exponential-based background correction using out-of-bound probes” (NOOB) are most efficient within-array normalisation methods.

• No between-array normalisation proved efficient, highlighting a need for im-provement.

• We also identified an 850k-specific bias: a high variance heterogeneity that can-not be corrected by existing normalisation methods, except a slight improve-ment with PBC.

• Finally, a proper analysis of Infinium beadarrays can only be obtained with a good probe annotation. Therefore, we developed an alternative annotation strongly improving the biological interpretability of the Infinium beadarrays.

2.1.2

Breast Cancers MeTIL Signature Extraction

The second contribution of this thesis (Section 6) is the identification of DNA methy-lation markers that recapitulate the evaluation of TILs and their impact on long-term outcome in breast cancers.

In summary, we obtained the following results:

• We developed a machine learning procedure allowing the extraction of signatures and the robust assessment of their predictive power. The modularity of our procedure allowed to test different “dimensionality reduction” and prediction model methods and to compare their performances.

• We applied this machine learning procedure on breast cancers Infinium data to extract a set of DNA methylation-based immune markers (MeTIL signa-ture) measuring TIL distributions in a more robust and sensitive manner than conventional pathological methods.

(31)

• MeTIL markers improved the prognostication also in other malignancies, in-cluding melanoma and lung cancers. Furthermore, we reported with MeTIL markers a prognostic value for TILs in previously unrecognised malignancies. • We demonstrated the possibility to apply this methodology in clinic, since

MeTIL markers can be determined by bisulphite pyrosequencing from low amounts of DNA from FFPE tumour tissue.

Globally our machine learning based analysis highlights the power of DNA methy-lation to evaluate tumour immune response and the potential of this approach to improve prognostication of breast and other cancers.

2.1.3

Other Related Projects

It is also important to remind that a number of other related projects were accom-plished and relates to publications.

• The bioinformatic analysis of the second-author paper, called “Portraying breast cancers with long noncoding RNAs.” (Figure 2.1 3)], allowed to identify long noncoding RNAs which expression is altered in breast cancers. While this pa-per is not related to DNA methylation the analysis also required expa-pertise in bioinformatic of epigenetics. Particularly, the main challenges were the reanno-tation of public expression arrays to identify long-noncoding-associated probes and the evaluation of the potential role of long noncoding using the “guilt by association” approach.

• In the article “The interplay between the lysine demethylase KDM1A and DNA methyltransferases in cancer cells is cell cycle dependent.” (Figure 2.1 4), the analysis of Infinium data did not show any significant changes when KDM1A was knocked down while its interaction with a main DNA methylation writer enzyme had been shown. This suggests a DNA methylation independent role of this interaction.

(32)

• In the “FOXP1 is a regulator of quiescence in healthy human CD4+ T cells and is constitutively repressed in T cells from patients with lymphoproliferative disorders.” article (Figure 2.1 2), a proper analysis of Infinium 450k beadarray was applied in order to identify significant DNA methylation alterations of the FOXP1 gene between naive and memory T cells demonstrating the involvement of epigenetic regulation of this gene in the T cell quiescence process.

(33)
(34)

Chapter 3

Biological Background

Here, we first describe breast cancers: after a summary of the existing classifications, we focus on the influence of the tumour microenvironment.

Then, we detail the epigenetic mechanisms (focusing on DNA methylation), their regulation and biological functions. Particularly, we describe how their cross-talk at key genomic regulatory regions allows for a tiny regulation of gene expression. We also highlight the importance of epigenetic alterations in breast cancers.

Finally, we present the technologies available to quantify DNA methylation both at genome and site-specific scale (focusing on bisulphite-based methods).

3.1

Breast Cancers

Breast cancers are very heterogeneous diseases, which can arise from different cell types, or even tissues. The causal mechanisms leading to the disease are also variable as well as the histopathological phenotypes. Therefore, it is more accurate to consider breast cancers as several diseases affecting the same anatomic structure (the breast). Several classifications exist allowing more dedicated treatment of the patients.

3.1.1

Anatomopathological, Histological Classification

and Staging

(35)
(36)

Figure 3.2: from [256] Mammary gland cells lineage. Mammary stem cells (left panel ) can

differentiate to luminal or basal progenitors (middle panel ) which lead to mature cells (right

panel ).

of origin (e.g. progenitor cells) than less aggressive ones, either a dedifferentiation process occurs in aggressive carcinoma which leads to the acquirement of stem cell-like traits. This stem cell-like phenotype can be identified by a histopathological study of a slide of the tumour tissue. Accordingly a histopathological classification have been developed which classify breast tumours in three grades:

• grade I is characterised by well differentiated and lowly proliferative cells, • grade III is undifferentiated and highly proliferative,

• grade II present an intermediate phenotype [216].

Another approach is the TNM staging which is an integrative scoring taking into account primary tumour size, number of invaded lymph nodes and presence of metastasis. Higher is the stage, more aggressive is the tumour.

3.1.2

Clinical and Molecular Classification

(37)

may be used:

• the abundance of specific markers can be quantified on a tumour sample slide by targeting them using ‘immunohistochemistry’ (IHC) at protein level or ‘flu-orescence in situ hybridisation’ (FISH) at DNA level.

• Microarrays can be used to quantify markers abundance at RNA or DNA levels. While some microarrays specifically dedicated to popular signatures are com-monly used, this method is generally less used in clinics because of its higher cost [226].

Classically, breast cancers can be classified based on the assessment of four mark-ers: (1) oestrogen and (2) progesterone hormonal receptors (ER and PR), (3) HER2 gene amplification status and (4) the Kiel 67 protein (KI67), which is a proliferation marker. Alternatively, gene expression microarrays have been used to identify breast cancer subtypes unsupervisedly by similarity between gene expression pattern of pa-tients [199, 230]. Then signatures have been developed allowing to classify papa-tients with a restricted number of genes, like the PAM50 signature based on 50 genes or the Trigene based on 3 genes [191, 98]. The two approaches give similar results and five subtypes with different prognosis and therapies were defined (see Figure 3.3):

• Luminal-A (LumA) and Luminal-B (LumB) tumours share many traits and can be regrouped within the Luminal subtype (LUM). Indeed, both subtypes show an overexpression of ER and/or PR without HER2-amplification. When ER or PR binds its appropriate hormone, its conformation changes allowing its bind-ing on DNA and the activation of key-regulatory pathways [245]. The LUM subtype, which represents around 70% of the breast cancers [108], is treated using hormonotherapy. This therapy consists in inhibiting the ER receptor (ta-moxifen) or the synthesis of oestrogen through aromatase inhibitor (letrozole). Indeed LUM cancers are dependent of ER or PR receptors for their growth. In comparison to the other breast cancers, LumA are characterised by a differen-tiated phenotype, a low proliferation rate and a good prognosis [49]. LumB are more aggressive and proliferative. Therefore, proliferation markers, like KI67 or Aurora kinase A (AURKA) expression, can be used to distinguish LumB from LumA.

(38)

gene. This gene is coding for the HER2 protein, a receptor of the epider-mal growth factor (EGF). Its activation increases proliferation through the phosphatidylinositol-3-phosphate (PI3K) pathway. While targeted therapies in-hibiting specifically HER2 have been developed (e.g. Trastuzumab), resistance mechanisms often occurred and this subtype, representing 15% of the breast cancers, remains of bad prognosis. Usually, tumours with HER2-amplification are characterised as HER2-amplified independently of their hormonal receptor status, however HER2-amplified tumours overexpressing also hormonal recep-tors, which account for 10% of the breast cancers, do not behave similarly to hormone-negative ones and can also be considered as a different subtype [156, 108].

• Triple-Negative tumours (TN) are negatives for the overexpression of PR, ER and for the amplification of HER2. This subtype, also called basal-like when identified using PAM50 signature, accounts for around 12% of the cases [108]. TN tumours often present a stem-cell-like phenotype, highly dedifferentiated and proliferative, and it has been suggested that this phenotype can be ex-plained by less differentiated cell of origin. Particularly TN are suspected to arise from the luminal progenitor cells [151]. Therefore, TN are very aggressive and present a bad prognosis. Furthermore, no specific treatment is currently available and patients are treated only with chemotherapies that block repli-cation (anthracycline) or cell division (docetaxel). Therefore, a better under-standing of this subtype is essential to improve patient’s care. Notably, this subtype remains heterogeneous and recently TN were further refined in 6 sub-types, opening new avenues for personalised medicine [144].

• Normal-like tumours show a gene expression pattern very similar to the normal samples. This subtype is quite rare and cannot be identified by IHC-based approach. It shows a good prognosis [30].

(39)
(40)

Figure 3.4: from [129] Non-silent somatic mutations by breast cancer subtypes. Missense

mutations are shown in light grey and truncation mutations in dark grey. Blue: LumA,

cyan: LumB, pink : HER2, red : basal-like

(72%) but lowly mutated in luminals (12% of the LumA and 19% of the LumB) (see Figure 3.4) [129]. In addition, the Breast cancer 1 (BRCA1) and Breast cancer 2 (BRCA2) mutations are rare inherited mutations that strongly increase the risk of developing a TN breast cancer.

3.1.3

Tumour Microenvironment

A tumour sample is not only composed of cancerous cells, several other cell types surrounding these cells are also present [117]. These cells, which form the tumour microenvironment (TME) are playing a huge role in the disease process. Indeed there is a cross-talk between these cells and the tumour [123].

First the cancers-associated fibroblasts have been shown to favour tumour growth notably by allowing tissue remodelling and extracellular matrix deposition [188, 203]. Then endothelial cells have a main role in tumour neoangiogenesis which is known for a long time to be essential to tumour growth and metastasis [86]. Adipocytes also seem to be involved into chemotherapy resistance mechanisms [31].

(41)

Figure 3.5: From [123] Origins and influence of tumour heterogeneity. The tumour

sam-ple is composed of tumour cells and their TME (fibroblasts, immune infiltrate, vascular network ...).

TN [113, 159, 155, 64] as well as in HER2-amplified BCs treated with chemother-apy and Trastuzumab [113, 155, 64, 27]. However the effect on the tumour is highly dependent on the immune cell type. While T lymphocytes expressing the cluster of differentiation 8 (CD8+), by killing cancerous cells, are of good prognosis [165], regulatory T lymphocytes reduce immune response and are linked to a bad prognosis [141]. T helpers, expressing the cluster of differentiation 4 (CD4+), could be of good or bad prognosis depending on their cytokine secretions [89]. The link of B cells to the outcome is also ambiguous with some studies showing a favourable prognosis and other a worst outcome [219, 10].

(42)

devel-oped for TILs and they also predict better clinical outcome and response to therapy in TN and HER2 tumours [113, 64, 221]. Particularly, T lymphocyte infiltration, estimated using gene expression levels, allowed to predict survival in breast cancers [96].

3.2

Epigenetics

3.2.1

Epigenetics & the central dogma of molecular biology

Even if all non-cancerous cells of a human body have the same genome, hundreds of cell types presenting different shapes and functions can be identified (like the adipocytes, lymphocytes ... mentioned at previous section [section 3.1.3]). This observation was difficult to explain with the original version of the central dogma of molecular biology. According to it, the DNA encodes for all biological processes thanks to the succession of billions of monomers, called nucleotides, forming a 4 letter alphabet sequence (A: Adenosine, C: Cytidine, G: Guanosine and T: Thymidine): the genome. Thanks to its double stranded structure resulting from the specific complementarity of each nucleotide from one strand with the facing nucleotide on the other strand, the sequence is inherited from one cell to its daughters via the replication process. The DNA functional unit is the gene which is transcribed into another 4-letter macromolecule called ribonucleic acid (RNA) (where U: Uridine replaces T). A particular class of RNA, called messenger RNA (mRNA), possesses a coding sequence part (CDS) which is translated into 20-letter macromolecules, the proteins, while untranslated regions (UTR) remains on upstream (5’UTR) and downstream (3’UTR) parts of the transcript. In the original dogma, most of the genes generate a single type of mRNA and the proteins are considered as the main effectors of the cell, mediating cell reactions through catalysis (enzymes) and interacting with each other and other macromolecules (Figure 3.6) [46].

(43)

Figure 3.6: from https://en.wikipedia.org/ Central dogma of molecular biology

The existence of distinct cell types is related to the concept of epigenetics: indeed, in superior organisms, epigenetic modifications are one of the main mechanisms lead-ing to cell differentiation. Startlead-ing from a poorly-differentiated cell (called stem cell ) a lineage specification occurs. This process involves the acquirement of new epige-netic patterns allowing the silencing of genes associated to an undifferentiated state (pluripotency genes) and activate genes specific of a cell type. (see Figure 3.7 for a schematic example with DNA methylation). This new pattern will be conserved in the daughter cells thanks to maintenance enzymes. Finally, cells with the same genome become highly different from one another and specialised to particular functions [33].

3.2.2

Chromatin Structure

(44)

Figure 3.7: from [117] The lineage specification process involves DNA methylation

alter-ations.

the next one [39].

The chromatin exists in two forms: the heterochromatin which is highly com-pacted and not accessible to the transcription, and the euchromatin which is less dense and where genes can be transcribed. While one part of the chromatin is consti-tutively dense and not transcribed whatever the cell type, another can be under an euchromatin status (potentially transcribed) in one cell type and in an heterochro-matin status in other cell types (Figure 3.8). The regulation of the chroheterochro-matin status is therefore one of the key mechanism that impact gene expression and its control involved epigenetic modifications described hereafter [134].

Figure 3.8: Adapted from http://www.stomponstep1.com/ Schematic representation of the

(45)

Figure 3.9: Adapted from [175] Epigenetic modifications types.

3.2.3

Epigenetic Modifications

As briefly introduced (Section 1.2), the biological mechanisms to consider as epige-netic modifications are under debate. In this thesis, we regroup four types of mech-anisms (summarised at Figure 3.9) which affect gene expression by modulating the ability of proteic factors (e.g. transcription factors [TFs], RNA polymerase, splicing factors ...) to bind at specific DNA or RNA location, by direct interaction or by modification of the chromatin status.

As the main focus in this thesis, DNA modifications are highly explained hereafter, while histone modifications, noncoding RNAs and RNA modifications are only briefly described. A more detailed description of these last three mechanisms can be found in appendix (Section A.1).

3.2.3.1 DNA Modifications

(46)

While the methylation of adenosine at carbon 6 (6mA) have recently been identi-fied in mouse DNA, the majority of epigenetically functional modifications identiidenti-fied in mammals is occurring at the carbon 5 of cytidine [135]. The most abundant of these alterations is the 5-methylcytidine (5mC) often referred as DNA methylation. Of note, 5mC is also present in bacteria and plants and is notably playing an essen-tial role in bacterial defence against phages. In vertebrates, 5mC has been shown to be involved in many biological mechanisms: gene regulation, particularly in cell pluripotency and differentiation, genomic imprinting, X-chromosome inactivation ... [62, 121, 33] It also prevents endoparasitic sequences activation (like retrovirus and other transposable elements) and participates in chromosome segregation [138, 142]. Importantly, 5mC is altered in several diseases and particularly in cancers [121].

Around 4% of the C are methylated and the vast majority (> 99.9%) of the cytidines that can undergo methylation are located in a CpG dinucleotide context [152]. The formation of 5mC from C is under the control of DNA methyltransferase enzymes (DNMT). Those enzymes catalyse the transfer of a methyl- group from S-adenosylmethionine (SAM) to the carbon 5 of the cytidine [62]. In mammals, three enzymes are responsible for this reaction: DNMT1, DNMT3A and DNMT3B. During replication process, the new strand is always regenerated with unmethylated cyti-dine only. Therefore, a passive demethylation occurs by dilution of the 5mC marks along replication. DNMT1, on one hand, (with the participation of DNMT3A and B) is responsible for 5mC maintenance. Indeed, DNMT1 recognises preferentially hemimethylated CpG sites (i.e. CpG site with 5mC on one strand and C on the other, which typically occurs after replication process) and catalyses the methylation of the C at this site. DNMT3A and DNMT3B, on the other hand, are responsible for setting up new methylation pattern during embryonic development. This is done in association with other proteins leading to a precise and tissue-specific location of the de novo 5mC marks [62] (see Figure 3.10). As 5mC spontaneous deamination to thymine is a frequent mutation in germline, CpGs are not uniformly distributed and the genome is globally depleted in CpGs, except in regions of high-CpG-density, called CpG Island (CGI) that are probably not or transiently methylated in germline. Importantly the methylation status of CGI varies in other tissues and, most of the time, all CpGs within a CGI share the same methylation status [121].

(47)

5-Figure 3.10: Adapted from [240] DNA methylation reactions and associated enzymes.

hydroxymethylcytidine (5hmC), 5-formylcytidine (5fC) and 5-carboxycytidine (5caC). Finally, both 5fC and 5caC are recognised by the thymine DNA glycosylases (TDG) which catalyses the excision of these bases and leads to their replacement by an un-methylated C via the base excision-repair machinery. Alternatively, 5caC may also be decarboxylated into C by a currently unknown enzyme or passively diluted through replication as oxidised modifications of 5mC (e.g. 5hmC, 5fC and 5caC) (oxmC) are not recognised by DNMT1 [59] (see Figure 3.11). In mouse embryonic stem cells, 30 000 molecules of 5mC, 1 300 of 5hmC, 20 of 5fC, and 3 of 5caC have been counted by 1 000 000 of cytidines. The very low proportion of 5fC and 5caC suggests that these modifications are intermediates of the demethylation process but may not have an epigenetic role per se. Conversely, 5hmC account for 0.1% of the cytidines and can even reach 1% in some tissues such as brain. The location of this modification has, furthermore, been found to be relatively stable in a particular cell type and proteins interacting with 5hmC have been reported as potential readers of this mark. Among these readers were found chromatin modifiers and transcription factors, suggesting a potential epigenetic role of 5hmC [36].

(48)

Figure 3.11: From [132] DNA demethylation through a TET-mediated oxidative process. AM-PD: Active Modification - Passive Demethylation; AM-AR: Active Modification -

Ac-tive Replacement

an unmethylated C status. However this mechanism remains highly speculative as no enzyme has been demonstrated to catalyse oxmC to oxmU reaction and the involve-ment of 5mC deamination to T (catalysed by APOBEC3A) has not been shown to be involved in demethylation process [220] (see Figure 3.12).

In addition to the aforementioned physiological demethylation processes, demethy-lation drugs have been developed like the azacytidine and have been recently approved by the ‘Food and Drugs Administration’ (FDA) [75]. This is opening new therapeutic avenues by providing the potential to reverse pathological 5mC pattern observed in diseases, such as cancers.

3.2.3.2 Histones Modifications

(49)

Figure 3.12: From [220] Cytidine and modified cytidine hypothetical deamination process

(only the base is shown).

• The histone modification can affect the affinity between the histone and the DNA. When the affinity is high, the histone is stuck at its position and DNA is not accessible to the transcription machinery. This causes gene inhibition. Conversely less affinity between DNA and histone is associated with gene ac-tivation. A typical example is histone acetylation which writing, by histone acetyltransferases (HAT), cause gene activation, while their erasing, by histone deacetylases (HDAC), inhibit transcription.

• The histone modification can also be recognised by specific proteins (readers) which cause the recruitment of remodelling proteins (e.g. the SWItch/Sucrose Non-Fermentable [SWI/SNF] complex) which use the adenosine triphosphate (ATP) hydrolysis as energy source to actively move the histones thereby chang-ing the state of chromatin from euchromatin to heterochromatin or conversely. This mechanism is common to most of the histone marks including acetylation and has been highly studied for histone methylation where the writing by his-tones methyltransferases (HMT) and the erasing by histone demethylase (HDM) can lead to either gene activation or inhibition depending on the location of the mark on the histone.

(50)

A more detailed description of the regulation of the main histone marks is available in appendix (Section A.1.1).

3.2.3.3 Noncoding RNAs

One of the major improvements to the central dogma of molecular biology, discovered in the last decades, is the existence of noncoding RNAs. These RNAs are produced by the transcription of specific genes but do not code for any protein and play their biological role as RNA molecules.

Three main classes of noncoding RNAs can be defined based on their length: the short noncoding RNAs (smaller than 40 base-pairs (bp)), the medium-sized noncoding RNAs and the long noncoding RNAs (lncRNAs) (larger than 200bp). In addition, a functional classification can be done and is detailed in appendix (Section A.1.2). 3.2.3.4 RNA Modifications

In addition to DNA and histone, a third type of epigenetic modification occurs at RNA level, and may also play a role in gene regulation. This new layer of epigenetic alterations gives rise to a new field called epitranscriptomics (Section A.1.3).

3.2.4

Epigenetic at cis-Regulatory Elements

An intense cross-talk exists between the different layers of epigenetic modifications. For example, cross-talk between histone marks, DNA modifications and noncoding RNAs are strong elements of transcription regulation. in DNA, many genomic regions are under epigenetic control (Figure 3.13). Particularly, three types of regulatory regions concentrating many epigenetic elements have been described: the promoters, the gene bodies and the enhancers.

3.2.4.1 Promoter

(51)

Figure 3.13: From [117] For years, DNA methylation has been studied in a gene-promoter

(52)

Figure 3.14: From [133] Lysine methylation marks at promoter and gene body.

and distant parts of the promoter contain additional transcription factors binding sites required for fine regulation [227].

In addition to transcription factors, epigenetic mechanisms are strongly involved in promoter regulation. Active transcription requires TFs and Pol II to access to DNA. Therefore, the chromatin should be in an opened state (euchromatin) characterised by acetylated marks (H3K27ac and H3K9ac). In addition to acetylation, histone may also be methylated [134]. There are methylation marks linked to an active state of promoters (H3K4me3) and some linked to an inactive state (H3K27me3) (see Figure 3.14). A particular class of promoters, called poised promoters, possesses both active and inactive marks and are transcriptionally inactive.

(53)

re-Figure 3.15: From [146] Summary of the opposing effects of H3K4me3 and H3K27me3,

and their proposed roles at poised promoters.

catalyses the H3K27me3 mark while PRC2 is responsible for chromatin compaction (PRC2) [146]. This repressive effect of 5mC is reinforced by the interaction of DNMT enzymes with HDACs and HMTs (e.g. H3K9-methyltransferase) and with the En-hancer of zeste 2 (EZH2) protein of the PRC2 complex [91, 255]. Therefore, 5mC is related to chromatin compaction and long-term silencing [146]. It has been shown that DNMT3A and 3B binding require the presence of a nucleosome to catalyse 5mC synthesis. As active TSS fully lacks nucleosome, 5mC is thought to be a second step mark, occurring after inactivation of the gene and loss of H3K4me3 mark, which locks the gene in a silenced state, through DNA compaction [121]. In this context, a po-tential role of poised promoter would be to inactivate gene through H3K27me3 mark while avoiding DNA methylation mediated full silencing by keeping the H3K4me3 mark (see Figure 3.15) [146].

3.2.4.2 Gene Body

The gene body is the region where transcription of a gene may occur. It ranges from the TSS to the TTS. In active genes, as this region requires access to DNA, chromatin is in an open state and specific histone marks like H3K36me3 are present [133].

For DNA methylation, the interpretation of the 5mC pattern is much more com-plex. Indeed, 5mC changes in this region can be linked to different biological pro-cesses.

(54)

seems to be associated to an increase of the gene expression [138]. Gene body methylation could potentially play a role in alternative splicing events [121]. Indeed, retained exons have been shown to present a higher methylation level and DNA methylation is suggested to recruit the Methyl-CpG binding protein 2 (MeCP2) which promotes exon recognition [184, 171].

• Regions up to 2 kilobases (kb) from CGI have also been reported to contain a large proportion of tissue-specific DNA methylation alterations. The authors also shown that these regions, called shores, were altered in colon cancers. A large proportion of these alterations were associated to transcript initiation using ‘Cap analysis gene expression’ (CAGE) experiments. This allowed the authors to conclude that shores DNA methylation alterations can be related to alternative promoters [115]. The presence of alternative TSSs is, indeed, a sec-ond explanation for apparent non-promoters alterations (see Figure 3.14 green boxes on bottom right). While most of the DNA methylation promoter studies focus on the promoter of the main transcript of a gene, some alternative tran-scripts initiate at other TSS and have their own promoters. DNA methylation alterations affecting these alternative promoters may appear as intergenic or gene body alterations if only the main transcript is taken into account. As ob-served for shores, it seems that tissue-specific methylation is much more common in these alternative promoters than in the main promoter. This could indicate an important role in cell lineage [172]. Some studies have shown alteration of the methylation level of these alternative promoters, notably in cancers.

• Similarly it is also important to take into account the noncoding transcrip-tome to avoid classifying some cytidines as intergenic or gene body associated while their methylation is in fact impacting a noncoding transcript [138]. DNA methylation has been shown to be a key regulator of lncRNA expression. For example Li et al have recently used DNA methylation to identify epigenetically deregulated lncRNAs in breast cancers [150].

(55)

Figure 3.16: From [149] Functional roles of eRNA transcripts. Enhancer RNAs could

functionally contribute to gene activation at least partially by modulating the stability of enhancer:promoter (E:P) looping via interacting with looping factors or could regulate the chromatin accessibility of its target promoter region.

3.2.4.3 Enhancers

(56)

affected genes were strongly associated to key cancer processes revealing the essential role of enhancer DNA methylation alterations in cancers [13]. Following a similar strategy, the ELMER algorithm used the relation between methylation and expres-sion among publicly available samples from ‘The Cancer Genome Atlas’ consortium (TCGA) to identify enhancer with DNA methylation alteration in cancers [275]. In-terestingly a specific class of CpG Islands have recently been identified. Conversely to the most studied CGI located at promoters (pCGI), these CGI are associated with enhancers (eCGI). eCGI seems to be more frequently hypermethylated in cancers with a silencing effect on the tumour-suppressor gene targeted by the eCGI-associated en-hancers. In addition, these enhancers seem to target more transcription regulators than the other enhancers, thereby potentially playing a more important role in the regulatory network [15].

The interaction between an enhancer and its target promoter is under the con-trol of the so-called insulator regions. These regions contain a binding site for a cohesin/CCCTC binding factor (CTCF) complex which allows the generation of large loop of DNA. DNA regions where enhancer-promoter interactions occur become thereby isolated from the rest of the genome within these loops. These structures within the DNA are not fixed and are under the control of DNA methylation. In-deed, CTCF binding is inhibited by 5mC. Therefore, 5mC at CTCF site can restore enhancer-promoter and thereby, activate the enhancer target genes [169].

3.2.4.4 Regulatory Region Identification

Due to the major biological roles of enhancers regulation, it becomes essential to identify them as well as others regulatory regions. It is usually done through the localisation of specific histone marks [6, 8] or through the binding of specific factor (such as the ‘protein 300’ [p300] for enhancers [102]). Tools like the i-cisTarget can be used to identify enriched regulatory regions (e.g. enhancers) from a set of genomic position or co-expressed genes [104]. Also, the ‘Encyclopædia of DNA elements’ (EN-CODE) project provides a publicly available classification of the whole genome of 9 cell lines into 15 chromatin states, including promoters and enhancers states. This classification is based on a hidden-Markov model (HMM) using chromatin immuno-precipitation sequencing (ChIP-seq) data of CTCF together with 8 histone marks (see Figure 3.17) [79].

(57)

Références

Documents relatifs

Methods We compare 32 feature selection methods on 4 public gene expression datasets for breast cancer prognosis, in terms of predictive performance, stability and

The practical approach to civic center design deals with the functioning of the elements of the area, the facilities that are to be included, the inner-relationships

So steht beispielsweise der Begriff Prä- konzepte im Kapitel „Lernverständnis“ für alle Konzepte, welche die Schülerinnen und Schüler zu Beginn einer Unterrichtssequenz haben

Disease resistance assay at the whole plant level in the greenhouse revealed seven transgenic lines (3 lines of 110 Richter, 2 lines of 3309 Couderc and 2 lines of Teleki 5C)

The dataset prior to missing value imputation corresponded to a sparse matrix containing 1104 samples (benign or malignant samples, either from microarray data or

La DLSM s’appuie sur les r´ esultats de la Generalized Linear Sampling Method (GLSM) dont la mise en œuvre n´ ecessite que le probl` eme de transmission int´ erieur (PTI) soit

Based on the reaction of the model compounds (Table 4) and the functional group concentration of the binders (Table 1), the reactivity of PPA with asphalt is expected to increase

/ La version de cette publication peut être l’une des suivantes : la version prépublication de l’auteur, la version acceptée du manuscrit ou la version de l’éditeur. For