DNA copy number data in cancerology - Développement de méthodes statistiques pour l'analyse du

In this thesis, we focused on genomic alterations in tumor cells at the level of the DNA copy number. This section describes the notations and variables used along the manuscript.

1.4.1 Total copy number and B allele fraction

As explained in section 1.2, in tumor cells, parts of a chromosome of various sizes (from kilobases to a chromosome arm) may be deleted, or copied several times. As a result, DNA copy numbers in tumor cells are piecewise constant along the genome.

For illustration, Figure 1.9 displays an example of copy number signals that may be obtained from SNP-array data. Red vertical lines represent change points. In this particular example, the first region [0-2200] is normal, the second one [2200-6100] is a region where one of the parental chromosomes has been duplicated, and the third one [6100-10000] is a region of uniparental disomy called also Copy neutral loss of heterozygosity (cn-LOH), that is, a region where one of the parental chromosomes has been duplicated and the other one deleted. The top panel represents estimates of the Total copy number (TCN) (denoted by c). The bottom panel represents estimates of BAF (denoted by b). We refer to section 1.3.1 for an explanation of how these estimates may be obtained and to [Neuvial et al., 2011] for normalization of these quantities. In the normal region [0-2200], the total copy number is centered around two copies and allelic ratios have three modes centered at 0, 1/2 and 1. These modes correspond to homozygous SNPs AA (b= 0) and BB (b= 1), and heterozygous SNPs AB (b= 1/2). We note that in the second region where the tumor has 3 copies of DNA, the average observed signal is substantially below the true copy number. This is due to the presence of normal cells in the “tumor sample”, a phenomenon known as normal contamination which shrinks the observed signals toward two copies of DNA. We can refer to [Neuvial et al.,2011] for a more detailed explanation of this phenomenon and other sources of non-calibration in DNA copy number signals, such as the ploidy of the tumor. One important observation is that change points occur at the same position in both dimensions. This is explained by the fact that a change in only one of the parental copy numbers is reflected in both c and b. Therefore, it makes sense to analyze both dimensions of the signal jointly in order to identify change points. In the following, we denote by J the number of loci and respectively bycj and bybj the total copy number and the B allele fraction at the locus j for all j= 1, . . . , J.

CHAPTER 1. GENERAL INTRODUCTION

0 2000 4000 6000 8000 10000

1.02.03.04.0c

(1,1) (1,2) (0,2)

0 2000 4000 6000 8000 10000

0.00.40.8b

(1,1) (1,2) (0,2)

Figure 1.9 – Example SNP array data. Total copy numbers (c), allelic ratios (b) along 10,000 genomic loci. Red vertical lines represent change points, and red horizontal lines represent mean signal levels between two change points. SNPs that are heterozygous in the germline are colored in black; all of of the other loci are colored in gray.

ically, one parental copy has been lost when the other parental copy has been gained.

Parental copy number estimation is detail in section 1.4.3.

1.4.2 DoH transformation

In order to facilitate the separation between the different altered regions (called math-ematically segmentation), allelic ratios (b) are generally transformed into unimodal signals, as originally proposed in [Staaf et al.,2008]. This transformation is motivated by the fact that allelic ratios can be symmetrized (“folded”) and that SNPs that are homozygous in the germline (these SNPs are plotted in gray in Figure 1.9) can be discarded as they carry very little information about copy-number changes. Following [Bengtsson et al.,2010], we define theDecrease of heterozygosity (DoH)as :

dj = 2

bj−1 2

(1.3) only for SNPs that are heterozygous in the germline, which is essentially a rescaled version of the “mirrored/folded BAF” defined by [Staaf et al.,2008]. After this

trans-0 2000 4000 6000 8000 10000

1.02.03.04.0c

(1,1) (1,2) (0,2)

0 2000 4000 6000 8000 10000

0.00.20.40.60.8d

(1,1) (1,2) (0,2)

Figure 1.10 – Example SNP array data along 10,000 genomic loci, after transformation of allelic ratios (b) into decrease in heterozygosity (d), following [Bengtsson et al.,2010,Staaf et al.,2008].

Red vertical lines represent change points, and red horizontal lines represent mean signal levels between two change points. SNPs that are heterozygous in the germline are colored in black;

all of of the other loci are colored in gray.

formation, DNA copy numbers can be considered as a bivariate, piecewise-constant signal, as illustrated by Figure 1.10.

It should be emphasized at this stage that because the proportion of heterozygous markers among SNPs is generally of the order of 1/3 for a given sample, the number of informative markers is several times larger for c than ford.

1.4.3 Parental copy number computation

[Neuvial et al., 2011] proposes to estimate the parental copy numbers (maternal and paternal copies) from the DoH estimation described in the previous section. Consider-ing a SNPj which is heterozygous in the germline, the minor and major copy numbers atjare defined as the smallest and the largest of the two parental chromosomes. They can be estimated as:

(c¹_j =cj(1−dj)/2

c² =c_j(1 +d_j)/2 (1.4)

CHAPTER 1. GENERAL INTRODUCTION

3 2 1

Gain

Genome position

cn-LOH loss normal

TCN

3 2 1

Genome position

1 2/3 1/3

Genome position

BAF

3 2 1

Genome position

Figure 1.11 –TCN,BAFand minor and major copy number representations along the genome

1.3) . By definition, minor and major copy number have the following nice properties:

c_j = c¹_j +c²_j, d_j = (c²_j −c¹_j)/c_j and c¹_j ≤ c²_j. The interpretation in terms of Loss of Heterozygosity (LOH) is also very simple : for instance a minor copy close to 0 corresponds to LOHalteration and c¹_j =c²_j corresponds to allelic balance.

The list below describes the common different copy number states in terms of minor and major copy number denoted by the vector (c1, c2), where c1 corresponds to the minor copy number, and c₂ corresponds to the major copy number [Neuvial et al., 2011].

• (1,1): normal (one copy from each parent)

• (0,1): hemizygous deletion (loss of one parental copy)

• (0,2): copy-neutral LOH (loss of one parental copy and gain of the other)

• (1,2): single copy gain

A graphical representation without noise for this four common types of alterations is shown on Figure1.11.

1.4.4 Features DNA copy number data

It is clear by looking at Fig. 1.9and1.10that DNA copy number profiles have particu-lar features. The first one is that DNA copy number data sin tumor cells are piecewise constant along the genome. Then, the second one shared generally by genomic data is

the high dimension. As we said in the section 1.3, it is possible to quantify genomic information at a large number of loci along the genome. Therefore, nowadays, microar-rays contains around 10⁶ observations and HTS can reach around the billion3∗10⁹ observations (a whole human genome).

In addition, tumor samples are not composed of only one type of cells but several, indeed it is usual that there is a contamination by a non-negligible proportion of normal cells when the sample is taken off. This induces difficulties to identify the altered regions 1.10.

Dans le document Développement de méthodes statistiques pour l'analyse du nombre de copies d'ADN en cancérologie (Page 20-24)