A few words on identifiability - Développement de méthodes statistiques pour l'analyse du nombr

A major issue for this type of reconstruction is that even ignoring the noise in the input profile, there may be multiple ways to define the weights and the subclones that correspond to the same reconstructed profile when neither the weights nor the subclones are observed. From a statistical viewpoint, this can be phrased as an identifiability problem [Behr and Munk,2015].

This issue of identifiability can be overcome by adding constraints to the model parameters and two major directions have been considered in the literature. One possibility is to constrain the latent profiles to belong to a pre-determined alphabet (e.g. {0,1,2, . . .10}). Necessary and sufficient conditions have recently been given for the problem to be identifiable under this constraint [Behr and Munk,2015], and the

CHAPTER 5. DISCOVERING HETEROGENEITY IN CANCERS

profile values to a finite alphabet may not always be realistic in practice. Indeed, copy number signals measured from microarray or sequencing technologies are not directly proportional to true copy numbers due to non-linearities induced by the biological assays including saturation effects [Skvortsov et al.,2007].

New methods attempted to infer intra-tumor heterogeneity using copy number from HTS or SNP data by integrating mutation information [Ha et al.,2014,Oesper et al., 2013, Roth et al., 2014, Li and Li, 2014]. [Jiang et al., 2016] pointed out the fact that these methods do not take into account several samples at the same time except PyClone. However, PyClone does not enable copy number alterations to be subclonal.

Deal with several profiles is a possibility to overcome the above-mentioned identifiability issue. Models assume that several profiles are observed and share the same set of latent profiles. From our perspective, this assumption is both justified by the underlying biology, and weak enough to be coherent with the observed copy number signals. This is the reason why we work under this assumption. We assume that several DNA copy number profiles are observed and that each of these profiles is a mixture of the same subclones (possibly with weight 0 for some of the subclones). This model is able to tackle either the intra or inter-tumoral heterogeneity by dealing with several samples.

5.7 Conclusion

This chapter aims to introduce the classical models of the matrix factorization that is the direction that we have chosen to analyze heterogeneity. We have seen that is possible to add several constraints on these models. This makes it possible to add biological priors in the form of various constraints. For example, a sample can be seen as a mixture of several cells and weights represent the proportion of each type of cell (archetypal analysis constraint). Then, we would like that latent profiles present a few alterations, in particular it could be interesting to add a fused LASSO constraint (sparse dictionary learning constraint).

Therefore, our contribution in this part deals with discovering heterogeneity from DNA copy number data by extending the existent models. We add constraints that provide a more realistic biological sense and integrate the BAF signal (Chapter 6).

The goal is to discover characteristics of the resistant subclones in DNA copy number data. All information fromSNParray data i.e. B allele fraction and total copy number at each SNPs is integrated by the intermediate of the PSCN (parent specific copy number) information (section1.4.3). We are currently implementing the method as an R package named InCaSCN (Inferring cancer subclones using DNA copy number) to discover heterogeneity. We have this method to two real data sets that contain several

samples at various time points and locations for the same patient (Chapter7). A paper with J. Chiquet and P. Neuvial is currently in preparation.

Chapter 6

Inferring cancer subclones using parental DNA copy numbers

6.1 Introduction

The objective of this Chapter is to present our proposed approach to study cancer heterogeneity by using copy number alterations (CNA). That way, the model will be applicable either on the array-based Comparative Genomic Hybridization (aCGH), Single Nucleotide polymorphism (SNP) microarrays, whole exome sequencing (WES) or whole genome sequencing (WGS). We attempt to respond to the question of recon-structing the underlying subclones and the corresponding weights from a series of DNA copy number profiles measured by the technologies cited above.

This model presented here is applicable to samples from a same patient that have been taken at various time or spatial points [Schwarz et al.,2015], but also to samples from an homogeneous group of several patients. This assumption has already been made in the literature [Nowak et al.,2011,Masecchia et al.,2013]. Our model may be seen as an extension of these approaches, with the following original contributions:

1. leveraging the allelic signals available from SNP array or sequencing data in order to explicitly integrate parent-specific copy numbers [Olshen et al., 2011] in the model;

2. making the mixing weights interpretable as such by modeling each profile as a convex combination of latent profiles;

3. modeling tumor clonality at the level of copy number segments (not individual

probes), which is the level of information at which such events occur.

Our model has similarities with the model recently proposed in [Jiang et al.,2016]

except that it enables us to analyze several samples to explore both intra-heterogeneity and inter-heterogeneity contrary to the model of [Jiang et al., 2016]. Indeed, this model is restricted to the first type of heterogeneity because its first aim is to build phylogenetic trees in order to understand the tumor history for a single patient.

The model that we propose in this Chapter is inspired by dictionary learning meth-ods (see section 5.2) as models of [Nowak et al.,2011,Masecchia et al.,2013] and the convex combination in our model of the latent profiles to explain samples is inspired of archetypal analysis introduced in section 5.5.

We formulate the problem of estimating the parameters of the model as an opti-mization problem and propose an iterative algorithm to estimate these parameters. We assess the performance of this approach using realistic simulations based on real DNA copy number data [Pierre-Jean et al.,2015]. We also applied the model to two different kinds of real data sets (Chapter 7).

Dans le document Développement de méthodes statistiques pour l'analyse du nombre de copies d'ADN en cancérologie (Page 104-108)