• Aucun résultat trouvé

To study the genome (not only in cancer research), high throughput molecular bio-logical techniques have been developed in the last twenty years, and several efficient technologies have been designed to identify biomarkers involved in genetic diseases and cancers. DNA microarrays and high throughput sequencing technologies are two kinds of emergent technologies developed to quantify the various levels of genetic information.

DNA microarrays and since ten years high throughput sequencing technologies (Whole genome sequencing (WGS) and Whole exome sequencing (WES)) are used to detect biomarkers involved in genetic diseases and cancers. In this section, we briefly explain

CHAPTER 1. GENERAL INTRODUCTION 1.3.1 DNA microarray data

DNA microarrays were first produced by the company Affymetrix in 1991 but they starting being used in 1995 [Schena et al.,1995]. Microarrays were first used to mea-sure gene expression and more particularly to compare expression in tumor cells to expression in normal cells. Then, those to analyze DNA copy number have been de-signed at the same time in 1992 [Kallioniemi et al., 1992]. In 1998, microarrays were developed to genotype multiple regions of the genome (called loci) [Wang et al.,1998].

Data from two types of DNA microarrays have been studied in this thesis: Compara-tive Genomic Hybridization (CGH) arrays andSingle nucleotide polymorphism (SNP) arrays.

CGH arrays have been developed to measure the total DNA copy number at pre-defined loci on the genome with a high-resolution scale. The principle is the following:

DNA from a reference and a test are collected and labeled with two different fluo-rophores. Then, cleaning and scanning of the arrays have been performed in order to do image processing by an image analysis software (see Fig. 1.4). The signal obtained from these arrays after the image processing is summarized by the ratio between the amount of the test and reference DNA. An example of signals obtained from CGH arrays are presented in section1.4.

Figure 1.4 – Principle array-CGH(extract from [Barillot et al.,2012])

SNP arrays. A Single-Nucleotide Polymorphism (SNP) (pronounced snip) is defined by a variation of theDNAsequence occurring at a single genome position. It is charac-terized by a nucleotide - A, T, C, or G - which differs between members of a population (or between paired chromosomes in an individual). For instance, the DNA fragment sequence at the top of Fig. 1.5is AAGCCTA and AAGCTTA for theDNA fragment

sequence on the bottom. In most cases, we say that there are two alleles : C and T (denoted arbitrarily by allele A and B). In most cases, SNPs only have two alleles.

Figure 1.5 – What is a SNP? Source: http://www.dnabaser.com/articles/SNP/

SNP-single-nucleotide-polymorphism.html

SNP arrays measure allele quantities at a large number of predefined loci. The SNP arrays are used to study small variations between whole genomes [Visscher et al.,2012].

Indeed, it is possible to compare if a SNP appears more frequently in a population which suffers from a particular disease than in a healthy population. However, SNP arrays can also be used to study genetic abnormalities in cancer. In fact, by measuring intensities of alleles at predefined loci, it is possible to deduce the DNA copy number but also the genotypes AA or BB (the homozygous) or AB (the heterozygous) of each locus.

Formally, for each j= 1, . . . , J, let us denote by θAj and θBj the signal intensities measured at SNPj for alleles Aand B, respectively. θAj andθBj are proportionate to the allele quantity.

We define the first dimension of the signal by the total DNA copy number that is proportionate to SNPj by θjttAjtBj (sum of quantity of allele A and allele B in the tumor sample denotedt). If a reference sample is available, it is possible to measure the total DNA copy number in the tumor sample by :

cj = 2× θtj

θrr (1.1)

CHAPTER 1. GENERAL INTRODUCTION denoted r.

The second dimension of the signal fromSNParrays is theB allele fraction (BAF).

The BAFis defined atSNPj by:

btj = θtBj

θAjttBj (1.2)

and is between 0 and 1.

For example, in a normal cell with twoDNA copies,BAFis close to 0 if only allele A is observed, 1 if only allele B is observed, and 0.5 if both of allele A and B are observed. Thereafter, the heterozygous in the germline refers to the genotype in the normal cells of the patient.

1.3.2 High-throughput sequencing data

DNA sequencing is usually used to determine the sequence from single gene to entire genomes. This technology provides ordered sequences of nucleotides present at the level of DNAorRibonucleic acid (RNA) and allows us to detect mutations typically linked to diseases at a higher resolution than DNA microarrays.

Whole genome sequencing (WGS)

High-throughput sequencing (HTS) is a recent technology which allows us to se-quence DNA and RNA much quicker and cheaper than Sanger sequencing. Whole genome sequencing consists in collecting DNA sample and then determining the iden-tity of the nucleotides (A, T, C, G) that compose the genomes of a living being. The first step is to cut the whole DNA sequence into short fragments between 10bp and 100bp. After a step of replication, the start and the end of the all replicated fragments are sequenced and read (Fig. 1.6). Then, each fragment, called read, is aligned to a reference sequence. The depth of sequencing is the number of times a nucleotide is read during the process. For instance, a depth of 100x means that in mean the number of reads with a part aligned at this position is close to 100.

Whole exome sequencing (WES)

However, the exons (Fig. 1.7) represent only 1,5% of the human genome [Venter et al.,2001] are short DNA sequences that lead totranscripts. In addition, the disease-causing variants are in majority in exons. Before performing the sequencing, the coding portion of the genome is captured and then sequenced (see Fig. 1.8 for more details).

As a result, it is possible to sequence more deeply a small part of the genome for a same cost than the whole genome.

Figure 1.6 – Whole genome sequencing (Property of US National Human Genome Research Institute)

In cancer research, this technology allows us to recover both DNA copy number alterations,in addition to mutations which can drive the evolution of cancer and somatic events like translocations. In order to be able to analyze this kind of data, we have to adapt our statistical methods.

CHAPTER 1. GENERAL INTRODUCTION

Figure 1.7 – Gene (Property of US National Human Genome Research Institute )

Figure 1.8 – Whole exome sequencing Source: http://biol1020-2012-2.blogspot.fr/2012/

08/a-new-breast-cancer-susceptibility-gene.html