Summary and discussion - Développement de méthodes statistiques pour l'analyse du nombre de cop

We have developed a framework to assess the performance of various DNA copy number segmentation methods. A critical aspect of this framework is that it generates realistic copy-number profiles by resampling real SNP array data. This allows us to study a large number of scenarios without relying on a particular statistical model. It is our opinion that this framework is simple to use as it depends on few parameters, all of which have a straightforward biological interpretation. An R package is available and we believe that our proposed data generation scheme can be used readily as well as applied to other data sets and technologies. It is also possible to extend the set of segmentation methods compared, as explained in the package documentation. In this chapter, we illustrated the usage of this framework on two SNP array data sets from Affymetrix and Illumina.

We were able to identify which technological and biological parameters drive the performance of segmentation methods. First, it appears that the percentage of tumor cells in the sample plays a critical role: for a percentage lower than 70%, it is probably hopeless to recover the whole set of breakpoints with a high accuracy. We emphasize the relevance of the considered range of cellularity for applications: we expect that tumor cell lines should be well represented by the 100% setting, while the 50% is not unusual for clinical practice. Second, it seems that different microarray technologies might lead to different performances. Specifically, the ratio between the number of informative allelic probes (heterozygous SNPs) to the total number of probes is a crucial aspect, particularly for a high level of normal contamination. Finally, not all methods achieve similar performance across the scenarios that we have considered. Interestingly, we show that methods that take advantage of both signal dimensions are generally but not always better than those using only one of them. This variability between segmentation methods may be attributed to some extent to the biological and technological contexts, in the sense that some methods might be more adapted to certain scenarios.

Our framework provides a way to critically evaluate the performance of segmenta-tion methods, and therefore to rasegmenta-tionally select one or several of them for a particular data set. Such a quantitative assessment is also useful for interpretation. For example, we showed that even in favorable scenarios, performances are not perfect. Furthermore, perhaps unexpectedly, we showed that copy number transitions involving the gain or loss of a single DNA copy are not equally easy to recover, meaning that the proportion of different types of copy number transitions recovered by a particular segmentation method may not be directly interpretable.

Chapter 4

Non-parametric segmentation method using kernels

4.1 Introduction

This chapter is a collaborative work with Alain C´elisse, Guillemette Marot, and Guillem Rigaill. We recently submitted this work to the Computational Statistics and Data Analysis (CSDA) journal. In this Chapter, we attempt to use kernel tricks in order to develop a new non-parametric segmentation procedure. This method allows us to free ourselves from the DoH transformation described in section1.4.2. We bring several contributions to the computational aspects and the statistical performance of the kernel change-point procedure introduced by [Arlot et al., 2012] to segment separately then jointly TCN and BAF.

The model presented in this chapter is similar to the model described in Chapter2.

However, instead of considering change-points in the mean of the signal, we consider change-points in the whole distribution of the signal. Indeed, this assumption is more realistic for BAF signals for which it is clear changes don’t occur only in the mean.

Indeed, the distribution is multimodal for each segment. The number of the modes but also their values, and the variance vary from a segment to another. For instance, on Fig. 4.1, it is easy to observe three modes in the normal region (1,1) respectively for the three statuses of SNPs (AA, BB, and AB). However, for the gain region (1,2) where we have to observe four statuses respectively for (AAA, AAB, ABB, and BBB), modes are more difficult to distinct but it seems that variance has increased. For the third region (cn-LOH) denoted (0,2) in terms of parental copy number, both the number and the location of the modes have changed compared to the gain region. All these

0 2000 4000 6000 8000 10000

0.00.40.8b

(1,1) (1,2) (0,2)

Figure 4.1 – BAF signal

observations had motivated us to o propose a non-parametric method to detect these different types of changes at the same time.

This chapter describes a new algorithm to simultaneously perform the dynamic pro-gramming step of [Harchaoui and Capp´e,2007] and also compute the required elements of the cost matrix on the fly. As a consequence, this algorithm has a complexity of order O(SJ²) in time and O(SJ) in space (including both the dynamic programming and the cost matrix computation). This improved space complexity comes without an increased time complexity, which is a key point for genome analysis. However, to deal with larger data sets the time complexity is still high and we develop a new algorithm based on a low-rank approximation to the Gram matrix. This computational improve-ment is possible at the price of an approximation, which leads to (almost) the best segmentations from 1 to S segments with a complexity of order O(Sp²J) in time and O((S+p)J) in space, where pis the rank of the approximation. Finally, we adapt the model selection of [Lebarbier,2005] described in section 2.4to our case and illustrate its good empirical statistical performance. Before describing the kernel segmentation method, we need to introduce some essential notions on kernels [Shawe-Taylor and Cristianini,2004,Sch¨olkopf et al.,2004].

Let (X1, . . . ,XJ) ∈ X a signal of length J within there are some changes in the distribution. This signal can either represent the total DNA copy number or the B allele fraction. Then, we consider a positive definite kernel k : X × X 7→ R. H is its associated Reproducible kernel hilbert space (RKHS), and Φ : X 7→ H is the canonical feature map defined by Φ(x) = k(x,•) (a function of H). There exists a strong link between the canonical feature map and the kernel function since ∀x, y∈ H : hΦ(x), Φ(y)iH =k(x, y).

CHAPTER 4. NON-PARAMETRIC SEGMENTATION METHOD USING KERNELS

8/24

Intro. Kernels Where? (D,kknown) How many? (kknown) Choice of kernel

Kernel and Reproducing Kernel Hilbert Space (RKHS)

X1, . . . ,Xn 2 X: initial observations.

k(·,·) : X ⇥ X !R: reproducing kernel (H: RKHS).

(·) : X ! H s.t. (x) =k(x,·): canonical feature map.

<·, ·>_H: inner-product in H.

Asset:

Enables to work with high-dimensional heterogeneous data.

Kernel change-point detection Alain Celisse

Figure 4.2 – Mapping initial data to hilbert spaceH

Yj =Φ(Xj)∈ H and the mean elementµ?_j ∈ H of the distribution of Xj.

∀f ∈ H hµ?_j, fiH=E^YjhY_j, fiH=E^XjhΦ(X_j), fiH

There is a strong connection between the mean element µ?_j and the distribution of Xj denoted P^Xj. Indeed, for particular kernels (namely characteristic kernels), a change in the distribution of Xj implies a change in the mean element µ?_j.

P^Xj 6=P^Xj ⇒µ?_j 6=µ?_j

Kernel examples: Several usual characteristic kernels are defined below ifX =R.

• Gaussian kernel: k(x, y) = expn

−kx−yk² δ

• Exponential kernel : k(x, y) = expn

−|x−y| δ

• Polynomial kernel: k(x, y) = (δ0+δ1hx, yi)^δ² with δ= (δ0, δ1, δ2)

We denote the associated Gram matrix associated by K = {K_i,j}₁_≤_i,j_≤_J, where Ki,j =k(Xi,Xj). The model and algorithms presented in the following can be applied for each type of different type of kernels. However, to discover breakpoints in the DNA copy number signals, we focused on characteristic kernels and more particularly on the gaussian one. After describing the model and the algorithms (sections 4.2 and 4.3), a short section described how to combine two kernels (4.4). Then, we present the model selection used in this case (section 4.5). Finally, we present global performance of the procedure at the end of the chapter4.6.

Dans le document Développement de méthodes statistiques pour l'analyse du nombre de copies d'ADN en cancérologie (Page 74-78)