• Aucun résultat trouvé

The two technologies (microarrays and sequencing) described in the section 1.3require bioinformatical and statistical methods at several levels. Indeed, several artifacts from microarrays and HTS may disturb the estimations of the DNA copy numbers. These experimental artifacts lead to systematic biases that it is necessary to correct. For instance, for CGH-arrays, the most common is the spatial artifact. Indeed, boundary effects but also regional shifts and systematic variations are often observed [Reimers and Weinstein, 2005]. For HTS, the most common bias is due to the percentage of GC-content that influences the read alignment on the reference sequence [Benjamini and Speed, 2012]. These issues are often dealt with bioinformatic methods but also statistical models.

Microarray and sequencing data can be used to explore several fields. For instance, [Beerenwinkel et al.,2014] have recently reviewed mathematical models for cancer evo-lution. Indeed, cancer can be seen as an evolutionary process with specific features.

The tumors display an abnormal copy number of chromosomes, an elevated muta-tion rate, and several rearrangements of chromosomes for instance. Therefore, several phylogenetic methods that take into account particular features of tumor have been developed to study the clonal evolution of cancers [Chowdhury et al.,2013,Greenman et al., 2012]. Recently, an evolutionary study of ovarian cancers has shown a corre-lation between genetic heterogeneity, patient survival, and drug resistance [Schwarz et al.,2014].

In this thesis, we focus on the detection of the copy number alterations that can be observed in tumor cells. The improving of the detection of the alterations can lead to better understand the tumor evolution by integrating the discoveries in the

CHAPTER 1. GENERAL INTRODUCTION 2014,Jiang et al.,2016].

To discover genomic biomarkers, it is necessary to develop statistical methods able to deal with features of microarrays and NGS described in 1.4.4. This thesis is struc-tured in three parts: the first one is about segmentation models, the second one is about the model on heterogeneity and the third one is about bioinformatic considerations to deal with real data.

In the first part, we start by introducing the statistical models that are usually used to segment this kind of data namely c,b andd, or evenc1 and c2 signals (Chapter 2).

The aim of segmentation methods is to recover the genome location where the altered regions are not the same before and after these points. Microarray and NGS produced a large quantity of information at the scale of kilobase even at the scale of the unit base on the genome. Therefore, these methods require to be statistically efficient to discover relevant biomarkers correctly but also efficiently in terms of computation, both in terms of time and space complexity. After extending the univariate segmentation methods to apply them simultaneously to c and d signals, we present a new strategy to evaluate the gain using bothcand dsignals to recover alterations in DNA copy number signals in Chapter3. The last chapter of this section aims to present a new method that does not require to transformb signals todsignals (Chapter 4).

Then, in the second part, we focused on the discovery of the tumoral heterogeneity by dealing with several samples simultaneously. After a chapter that briefly introduces the models used to study tumoral heterogeneity (Chapter5), we present a new model in Chapter6. This model has the particularity to be able to be applied to discover intra or inter-tumoral heterogeneity on microarrays or HTS data. We present two applications to two different kinds of data sets (Chapter7). The first one is a public data set where data is from microarrays. We attempt to infer intra-tumoral heterogeneity from several samples of the same patient. The second one is a collaboration with Institut Curie. We analyze heterogeneity from several patients suffering from a particular breast cancer.

We dealt with WES data for this study.

To finish, the last part summarizes some contributions at a bioinformatic level to solve problems come across the dealing with real data analysis. Indeed, through the applications of the developed heterogeneity model on real data, several issues have been raised at normalization level. The two chapters of this part are therefore focused on normalization of data. Chapter 8 deals with the estimation of the DoH signals in ab-sence of a normal reference from microarrays data and is linked to the first application.

Chapter9is focused on the normalization of the WES data to get TCN, BAF and DoH signals as from microarrays.

Part I

Joint segmentation methods

Table of Contents

2 DNA copy number segmentation 21

2.1 Typology of copy number segmentation methods . . . 21 2.2 Univariate models . . . 22 2.3 Two dimensional methods . . . 34 2.4 Model selection . . . 42 2.5 Conclusion . . . 43 3 Performance evaluation of DNA copy number segmentation methods 45 3.1 Background . . . 45 3.2 Generating data with known “truth” . . . 46 3.3 Evaluation pipeline . . . 51 3.4 Results. . . 56 3.5 Summary and discussion . . . 64 4 Non-parametric segmentation method using kernels 65 4.1 Introduction . . . 65 4.2 Model in RKHS. . . 68 4.3 Algorithms . . . 70 4.4 Combination of kernels . . . 76 4.5 Model selection . . . 76 4.6 Results on the realistic simulated framework. . . 80 4.7 Conclusion . . . 84

Chapter 2

DNA copy number segmentation

This chapter introduces the segmentation models to detect alteration in DNA copy number signals. After a brief review of the main methods to segment DNA copy number signals, we present the univariate models in a second section. In a third section, we described the approach that we have considered to segment jointly the TCN and the DoH. Finally, we present the standard approach to select the best model in the case of segmentation models.

2.1 Typology of copy number segmentation methods

This section is about the segmentation models previously described by [Neuvial et al., 2011] and [Zhang,2010]. In the last twenty years, many different methods have been proposed for the analysis of DNA copy number profiles. Most of them may be classi-fied into four categories: methods based on Hidden Markov Models (HMM), multiple change-point methods, fused lasso-based methods and recursive segmentation methods.

1. HMM-based approaches rely on the idea that the recovered DNA copy number should be discrete and that these different levels can be modeled using a small number of HMM states. A typical example of such an HMM is the work of [Fridlyand,2004]. For the specific case of SNP array analysis in cancer samples, several dedicated HMM have been proposed [Sun et al., 2009,Greenman et al., 2010,Chen et al.,2011] (Section 2.2.1).

2. Multiple change-point methods assume that the observed signal is affected by abrupt changes and that between these breaks the signal should be homogenous [Picard et al.,2005] (2.2.3).

3. Recursive segmentation approaches rely on the intuitive idea that a segmentation can be recovered by recursively cutting the signal into two or more pieces. A typical example of such an recursive approach is the work of [Olshen et al.,2004]

(2.2.4).

4. Methods based on a fused lasso penalty rely on the idea that, in most cases, two successive measurements should have the same estimate. This is encoded by a L1 penalty on successive differences. The recovered signal is guaranteed to be piecewise constant. A typical example of such a fused model is the work of [Tibshirani et al.,2005]. This class of methods can be viewed as solving a convex relaxation of the multiple change point problem. (2.2.5)

The above classification is by no means exhaustive (see for example [Hup´e et al., 2004, Ben-Yaacov and Eldar, 2008]), but summarizes the most common approaches linked to the work of this thesis. In the next section, we present the main classical models.