A Data Mining Approach to Discover Genetic and Environmental Factors involved in Multifactoral Diseases

(1)

HAL Id: inria-00001181

https://hal.inria.fr/inria-00001181

Submitted on 30 Mar 2006

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

A Data Mining Approach to Discover Genetic and Environmental Factors involved in Multifactoral Diseases

Laetitia Jourdan, Clarisse Dhaenens, El-Ghazali Talbi, Sophie Gallina

To cite this version:

Laetitia Jourdan, Clarisse Dhaenens, El-Ghazali Talbi, Sophie Gallina. A Data Mining Approach to

Discover Genetic and Environmental Factors involved in Multifactoral Diseases. Knowledge-Based

Systems, Elsevier, 2002, 15 (4), pp.235–242. �10.1016/S0950-7051(01)00145-9�. �inria-00001181�

(2)

UNCORRECTED PROOF

A data mining approach to discover genetic and environmental factors involved in multifactorial diseases

L. Jourdan

^a,

*, C. Dhaenens

^a

, E.-G. Talbi

^a

, S. Gallina

^b

aLIFL, Batiment M3, CiteÂ Scienti®que, 59655 Villeneuve d'Ascq Cedex, France

bBiological Institute, Multifactorial Disease Laboratory, 1 rue du Professeur Calmette, B.P. 245, 59019 Lille Cedex, France Received 16 March 2001; revised 22 April 2001; accepted 31 May 2001

Abstract

In this paper, we are interested in discovering genetic and environmental factors that are involved in multifactorial diseases. Experiments have been achieved by the Biological Institute of Lille and many data has been generated. To exploit these data, data mining tools are required and we propose a two-phase optimisation approach using a speci®c genetic algorithm. During the ®rst step, we select signi®cant features with a speci®c genetic algorithm. Then, during the second step, we cluster affected individuals according to the features selected by the ®rst phase. The paper describes the speci®cities of the genetic problem that we are studying, and presents in detail the genetic algorithm that we have developed to deal with this very large size feature selection problem. Results on both arti®cial and real data are presented.

q

2001 Elsevier Science Ltd All rights reserved.

Keywords: Data mining; Clustering; Genetic algorithm; Feature selection; Multifactorial disease

1. Introduction

Common diseases such as Type 2 diabetes, obesity, asthma, hypertension and certain cancers represent a major public health concern (around 160 million people have Type 2 diabetes). These complex diseases have a multifactorial aetiology in which environmental factors (for example: Body Mass Index, which is a measure of obesity, or the age at onset, which is the age of the indivi- dual when diabetes was diagnosed), as well as genetic factors (genetic markers) contribute to the pathogenesis of the disease. In order to localise the involved factors, we must be able to detect complex interactions such as [(gene A and gene B) or (gene C and environmental factor D)] in one or more populations. Classical methods of genetic analysis test models with one genetic factor. Some methods have been adapted to test the interaction of a second factor, once a ®rst major factor has been detected. These methods are not intended to test a huge amount of genetics models combining more than two genes. So, they have only a limited capacity to dissect the genetics of complex disease traits.

In order to elucidate the molecular bases of Type 2

diabetes and obesity, the Multifactorial Disease Laboratory at the Biological Institute of Lille had performed large analyses on collections of affected families from different populations. Since the precise molecular mechanisms lead- ing to these diseases are largely unknown, a genome-wide scan strategy was used to localise the genetic factors. This strategy requires no presumptions about interesting loci (a locus is a speci®c portion of DNA, located without any ambiguity on one chromosome). Seeking for interaction patterns in the huge amount of data generated during this

®rst step should increase the power to detect regions containing genes with no individual major effect, but whose interaction leads to the disease. The biological analy- sis already performed consists of four steps.

1. Collect families with at least two or three affected members.

2. Extract DNA of parents and offspring from a blood sample.

3. Characterise each DNA sample at 400 loci spread over the 23 chromosomes.

4. Compute the genetic similarities for each pair of relatives at each locus.

Loci are polymorphous, so that parents may have different variants, called alleles. Each individual has two alleles for each locus, one inherited from his father and the other from

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56

57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 Knowledge-Based Systems 00 (2001) 000±000

KNOSYS1341

www.elsevier.com/locate/knosys

* Corresponding author. Tel.:133-320337186.

E-mail addresses: jourdan@li¯.fr (L. Jourdan), talbi@li¯.fr (E.-G. Talbi).

ARTICLE IN PRESS

(3)

UNCORRECTED PROOF

his mother. So we can calculate a genetic similarity for a pair of relatives (for example two brothers), based on the number of common alleles they have. For example, at a given locus, two brothers will have the probability: 25%

to share no alleles, 50% to share one allele, 25% to share two alleles.

The genetic similarity for each pair is calculated with multi-point methods, in order to have extrapolated values at regular positions between loci. That leads to 3652 values corresponding to 3652 points of comparison on the 23 chro- mosomes. Then, a comparison is made between the genetic similarity observed and what would be expected regarding the statistical sharing probabilities. This leads to a binary matrix, where a 1 indicates that the observed similarity is greater than the expected probability.

The method must detect a combination of factors, for which a subset of lines share a common pattern of values.

However, the method should not converge to a unique solu- tion, but must produce several solutions in order to take into account the heterogeneity of the populations. Moreover, only very few factors (less than 5%) should be relevant.

This is an unsupervised clustering problem, where 3654 features

¹

(3652 comparison points and two environmental factors) have to be considered, but where biologists have in mind that only very few features are relevant.

We would like here to point some speci®cations of the problem regarding usual data mining problems.

² A lot of features have to be considered (up to 3654).

² Very few signi®cant features (less than 5%).

² Only few data (number of pairs of individuals), compared to the large number of features, are available.

² We can only work with affected people because those who are not affected, when extracting their DNA, may become affected.

² The objective is to discover not only a single association, but several associations of genes and environmental factors, and to group affected people according to these associations.

To deal with an unsupervised clustering problem on such a large number of features, it is not possible to apply directly classical clustering algorithms. Algorithms, such as k-means algorithm, dedicated for clustering, are not able to deal with so many features. Their executions would be time consum- ing and results obtained would not be exploitable. There- fore, we adapt a two phase approach.

² A feature selection phase using a genetic algorithm.

² A clustering phase.

For the ®rst phase, a feature selection, we developed a genetic algorithm to extract most in¯uential features and

in particular in¯uential associations of features. This heur- istic approach has been chosen as the number of features is large. For this speci®c problem, some advanced mechan- isms have been introduced in the genetic algorithm such as some dedicated genetic operators, the sharing, the random immigrant, and a particular distance operator has been de®ned. Then, the second phase is a clustering based on the features selected during the previous phase. The clustering algorithm used is k-means.

This paper is organised as follows. First, we give some mathematical background. Then, our approach, an adapta- tion of a genetic algorithm for this particular feature selec- tion problem, is detailed. Section 4 presents the clustering phase. In Sections 5 and 6, we report results on experiments realised with data from GAW11, a workshop on genetic analysis and with real datasets.

2. Mathematical background

In this section, we give de®nitions of concepts and func- tions used in the rest of the paper.

Unsupervised classi®cation (or clustering or learning from observations). This is a data mining task in which the system has to classify a set of objects without any infor- mation on the characteristics of classes. It has to ®nd its own classes (clusters).

Feature (or attribute). It is a quantity describing an instance. Here, a feature may be an environmental factor or a genetic comparison point.

Support. The notion of support in data mining denotes the number of times (the number of rows) a set of features is met over the number of times at least one member of the set is met. Here is a formal de®nition of the support: let R be a set, r be a binary database over R and X # R be a set of items. The item set X matches a row t [ R if X # t. The set of rows in r matches by X is denoted by u{t [ R/X # t}u and the support u{t [ r=X # t}u=u{t [ r='x

_i

[ X=x

_i

# t}u:

Scattering criterion. To measure the quality of clustering we use the classical scattering criterion based on statistical concepts. It is de®ned as follows:

Given c clusters, n features,

Let x

j

be the j-th cluster with j 1, ¼, c,

Let m be the mean vector (m (m

1

m

2

´´´ m

n

) and m

j

the vector of means for the j-th cluster m

_j

ÿ

m

_j₁

m

_j₂

¼ m

jn

.

X

i

is an element of the cluster x

j

and ÿ

X

_i

2 m

_j

_t

is the transposition of the vector ÿ

X

_i

2 m

_j

The scatter matrix for the j-th cluster is then: . P

_j

P

Xi[x_j

ÿ

X

_i

2 m

_j

ÿ

X

_i

2 m

_j

_t

. We can de®ne the intra

L. Jourdan et al. / Knowledge-Based Systems 00 (2001) 000±000 2

113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168

169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224

1The term feature will now be used according to the data mining termi- nology. It is de®ned in Section 2.

ARTICLE IN PRESS

(4)

UNCORRECTED PROOF

cluster scatter matrix for c clusters:

P

_W

X

^c

j1

P

_j

and the inter cluster scatter matrix between clusters:

P

_B

X

^c

j1

m

_j

2 m

m

_j

2 m

_t

Then, the scattering criterion is: tr ÿ P

²¹_W

P

_B

where tr is the trace. The better tr ÿ

P

²¹_W

P

_B

is, the better the ratio between the intra cluster scatter and the inter cluster scatter is and the better is the quality of the clustering.

3. The feature selection phase

The ®rst phase of our algorithm deals with isolating the very few relevant features from a large set. This is not exactly the classical feature selection problem known in data mining as, in Ref. [1], for example around 50% of features are selected. Here, we have the idea that less than 5% of features have to be selected. However, this problem is very close to the classical feature selection problem, and we take inspiration from the literature about this subject.

3.1. Description of the data to be analysed for interaction patterns

The genetic similarities are merged in a matrix together

with environmental factors. In this matrix, lines represent pairs of relatives and columns, the genetic and environmen- tal factors. Each factor will be called a feature. This matrix has no missing value for genetic data since we use extra- polation. The size of the matrix will be from 500 to 1000 lines, depending on the number of families, and from 400 to 3654 columns depending on the extrapolation function. The size of this matrix is unusual for data mining because the number of columns (i.e. the features) is of the same order than the number of lines.

3.2. Feature subset selection

The feature subset selection problem refers to the task of identifying and selecting a useful subset of features from a large set of redundant, perhaps irrelevant, features [1].

Two models exist depending on whether the selection is coupled with the classi®cation scheme or not. The ®rst one, the ®lter model, which realises the feature subset selection and the classi®cation in two separate phases, uses a measure that is simple and fast to compute. The second one, the wrapper method, which realises the feature subset selection and the classi®cation in the same process, engages a learn- ing algorithm to measure the classi®cation accuracy.

Classically, feature selection is treated as an optimisation problem: ®nding the minimal number of bits that have value 1 with which a given criterion is satis®ed.

Feature selection and extraction can optimise classi®ca- tion performance and even affect classi®er design [2].

Researchers show that optimal feature selection is NP- hard [3]. Therefore, only heuristics, and in particular

225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280

281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 Random Immigrants

CrossOver Selection

Replace

o1 o2

o’2

New population Population

Population of individuals

Mutation p1 p2

Fig. 1. Our genetic algorithm: the different phases.

ARTICLE IN PRESS

(5)

UNCORRECTED PROOF

meta-heuristics such as genetic algorithms [1,2], are able to deal with large size problems.

For this phase, we decide to use a genetic algorithm as the literature shows they are well adapted for problems with a large number of features [2,4,5].

A genetic algorithm (GA) works by repeatedly modifying a population of arti®cial structures through the application of genetic operators. The goal is to ®nd the best possible solution or good solutions for the problem. For this particu- lar feature selection problem, we have developed a particu- lar GA (see Fig. 1). We present here the main characteristics and adaptations that we made to deal with the particular problem.

3.3. The encoding

The genetic algorithm uses the concept of chromosome to encode and manipulate solutions. Each chromosome will de®ne an individual of the population of the genetic algo- rithm. Here, we use a binary representation: a chromosome is a string of bits whose size corresponds to the number of features. A 0 or 1, at position i, indicates whether the feature i is selected (1) or not (0).

3.4. The chromosomal distance (a distance adapted to the problem)

We noted that features are correlated. A gene and its neighbours on the same chromosome have similar in¯uence.

We decided to create a distance to be able to compare two individuals and to recognise two similar individuals of our GA. This distance takes into account this correlation.

Biologist experts indicate that the length of correlation is 20 c-Morgan (a measure unit): i.e. a gene is correlated with

the genes that are around it. This correlation stops at the chromosomal cut.

3.5. The ®tness function

As we deal with an unsupervised classi®cation (cluster- ing), it is impossible to evaluate the accuracy of the feature selection phase according to the clustering and we need to build a particular objective function and to adapt a ®lter model.

The ®tness function we developed refers to the support notion (see Section 2) of an association. The function is composed of two parts. The ®rst one favours for a small support a small number of selected features and the second one, the most important (multiplied by 2), favours for a large support a large number of features. What is expected is to favour good associations (in terms of support) with as many features as possible.

F 1 2 S £ T

10 2 10 £ SF T 0

B B

@

1 C C A 1 2

£ S £ T

10 2 10 £ SF T 0

B B

@

1 C C A

where: Support; S uA > B > C > ¼ u=uA < B < C < ¼ u;

A, B, C, ¼ are the selected features; uA > B > C > ´´´u is the number of rows for which features A, B, c, ¼ are all equal to 1; and uA < B < C < ´´´u is the number of rows for which at least one feature A, B, ¼ is equal to 1. T total number of features and SF number of selected signi®cant

337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392

393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 Number of non common bits : nu = 8

1

*

* 1 1 1 1

* 1

* 1 1

1 1 0 0 0 0 1

0 0 0 1

1

* * * * 0 0 1

0 1 *

*

1 0 1

0

* Number of common bits : 5

1 0 0 0 1

1 0 0

1

Number of selected features : nc = 2

Offsprings

Mask

0 1 0 0 0

1 1

1 0

0

0 1 0 0 0

1 0

1 0 1 1 1

1

Chromosome Selected Features Parents

c 01

c 1 n1= 8

c 2 n2= 4

c 02 n02= 4

n01= 8

Length of the chromosome : lc = 13

Fig. 2. The SSOCF crossover operator.

ARTICLE IN PRESS

(6)

UNCORRECTED PROOF

features (selected features that are not too close in terms of the chromosomal distance.

3.6. The genetic operators

These operators allow GAs to explore the search space.

However, operators typically have destructive as well as constructive effects. They must be adapted to the problem.

Crossover

We use the subset size-oriented common feature cross- over operator (SSOCF) [6] that keeps useful informative blocks and produces offspring which have the same distri- bution than the parents (see Fig. 2).

The shared features are kept by offspring and the non- shared features are inherited by offspring corresponding to the i-th parent with the probability ÿ

n

_i

2 n

_c

=n

_u

where n

i

is the number of selected features of the i-th parent, n

c

is the number of commonly selected features across both mating partners and n

u

is the number of non-shared selected features.

Mutation

The mutation is an operator which allows diversity.

During the mutation stage, a chromosome has a probability p

mut

to mutate. If a chromosome is selected to mutate, a number n of bits to be ¯ipped is chosen then n bits are randomly chosen and ¯ipped. In order to create a large diversity, we set p

_mut

around 10%.

Offspring are kept, only if they ®t better than the least good individual of the population.

3.7. Selection

We implement a probabilistic binary tournament selec- tion. Tournament selection holds n tournaments to choose n individuals. Each tournament consists of sampling two indi- viduals of the population and choosing the ®ttest one with a probability p [ [0.5, 1].

3.8. Sharing

To avoid premature convergence and to discover different good solutions (different relevant associations of features), we use a niching mechanism. A comparison of such mechanisms has been done in Ref. [7]. The objective is to boost the selection chance of individuals that lie in less crowded areas of the search space. We use a niche count that measures how crowded the neighbourhood of a solution is. The distance D is the chromosomal distance adapted to our problem presented before. The ®tness of individuals in high concentrated search space regions is degraded and a new ®tness value is used, in place of the initial value of the

®tness, for the selection. The sharing ®tness f

_sh

(i) of an individual i, where n is the size of the population, a

sh

1 and s

sh

3, is:

f

_sh

i Fi

P

ⁿ

j1

Sh D

_i;j

where

A

_i;j

D

_i;j

s

sh

_a_sh

Sh D

_i;j

1 2 A

_i;j

if D

_i;j

, s

sh

0 else

(

3.9. Random immigrant

Random immigrant is another method that helps to main- tain diversity in the population. It should also help to avoid premature convergence [8]. We use random immigrant as follows. When the best individual is the same during N generations, all the individuals of the population, whose

®tness, f

sh

(i), is under the mean, are replaced by new indi- viduals randomly generated.

4. The clustering phase

The second phase of our method is a clustering phase using features selected during the ®rst phase.

4.1. Clustering

Clustering is used to identify classes of objects sharing common characteristics. Clustering methods can be applied to many human activities and particularly to the problem of taking automatic decisions, for example in the establish- ment of a medical diagnosis from the clinical description of a patient. The classi®cation, or data clustering, can isolate similarities and differences in the database and make groups of similar data which are called classes or groups. The clus- tering methods make classes of homogeneous objects [9].

Widely used clustering methods are: K nearest neighbour, k- means [10], density clustering [11]. However, most of the time, only a few features are considered by these methods.

In our case, we need to consider 3654 features and cluster- ing methods have dif®culties dealing with such a large number, that is why we needed to pre-select some of the most relevant features.

Much research has been done on clustering problems, but as our main work is around the feature selection process, we do not describe all of the clustering works. The interested reader may refer to the tutorial [12].

4.2. Objective of the clustering phase

The objective of the clustering phase is to group together pairs of individuals that share the same genetic particulari- ties. A feature selected during the ®rst phase indicates that

449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504

505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560

ARTICLE IN PRESS

(7)

UNCORRECTED PROOF

alone or with other features, it is often shared by affected people.

Now, the clustering should indicate what are the most signi®cant associations.

Two pairs that would be grouped in a same cluster indi- cate that the concerned people share similar genetic parti- cularity. Hence, these particularities may explain their affect and a rule such as `The disease may be explained thanks to features A, B and C' may be expressed. Each class will represent a rule, so the ®nal result could ideally be `This disease may be explained thanks to association of features A, B, C or D, E, F, G or H, I'.

4.3. Application of k-means

The feature selection has selected signi®cant features. We can now use a classical algorithm: the k-means algorithm.

The k-means algorithm is an iterative procedure which requires an initial classi®cation of the data. The k-means algorithm proceeds as follows: it computes the centre of each cluster, then computes a new partition by assigning each object to the cluster whose centre is the closest to that object. This cycle is repeated during a given number of iterations or until the assignment has not changed during one iteration [13].

Since the number of features is now very small, we imple- ment a classical version of the k-means algorithm by randomly selecting initial centres.

5. An illustrative example

In this part, we give a short example of the two phases of our method.

If we consider the matrix shown in Table 1, where E1 is an environmental factor and A, B, C, ¼, K are genetic factors, the best solutions are f

_E1C

271=120; f

_CF

271=120; f

_IJ

112=75; f

_E1CF

7=5:

The genetic algorithm will select the features E1, C, F, I, J.

Then, the k-means algorithm will make the following clusters:

² cluster1: E1 C F; and

² cluster2: I J

6. Experimental results

6.1. Experiments on arti®cial data

Experiments have been ®rst executed on an arti®cial data- base, constructed for the workshop GAW11 (Genetic Analysis Workshop) which was held in 1998 and was orga- nised by Dr David A. Greenberg.

²

This base is public and is used to evaluate different data mining methods. This is an arti®cial database, but constructed in order to be very simi- lar to real databases, and we know, by construction, the relevant associations of features (associations of loci (A, B, C, D) and environmental factors (E1, E2)) which can in¯uence the disease. Results to obtain are associations A 1 B 1 D and E1 1 C.

This test base is composed of 491 features and 165 pairs of individuals.

For 10 runs, we wanted to know how many times asso- ciations were discovered by the genetic algorithm. Table 2 presents the results.

The ®rst phase was able to discover real interactions of loci. Some of them are more dif®cult to ®nd than others.

Then, we ran the k-means algorithm with the results of the GA, in order to validate results.

We gave 11 features selected among the 491 thanks to the

®rst phase, (only 2.24% of the initial number of features), to the k-means algorithm. The k-means algorithm helps us to discover associations between genes and association between genes and environmental factors. We had previously experimented with the classical k-means algo- rithm without any feature selection. The execution time was very large (over 7500 min) and results could not be interpreted (we did not know which were the features involved in the disease). With the feature selection, the executive time of k-means has decreased to 1 min and the

561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616

617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 Table 1

E1 A B C D F G H I J K

1 0 0 1 0 0 0 0 0 0 0

1 0 0 1 0 1 0 0 0 0 0

0 0 0 1 0 1 0 1 0 0 0

1 0 0 1 0 1 0 0 0 0 1

0 0 0 0 0 0 1 0 1 1 0

0 0 0 0 1 0 0 0 1 1 0

0 1 0 0 0 0 0 0 1 1 1

0 0 1 0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 1 0 1 0

2 Dr D.A. Greenberg, Department of Psychiatry, Box 1229, Mt Sinai Medical Center, 1 G. Levy Place, New York, NY 10029, USA.

Table 2

Associations discovered by the GA with the arti®cial data

Association A1B A1D B1D C1E1

Frequency (%) 100 50 20 10

Table 3

Clusters obtained and their occurrences with the arti®cial data

Cluster Occurrence

(A B D), (E1C) 4

(A B D), (E1C D) 1

(A B), (E1D C) 1

(A B), (E1) 2

(E1A D B), (E1C B) 1

(A B D), (E1D) 1

ARTICLE IN PRESS

(8)

UNCORRECTED PROOF

results are exploitable. We ran k-means (with k 2) 10 times with the 11 selected features. We present in Table 3 clusters obtained and their number of occurrences. This table shows that the k-means algorithm using results of the GA is able to construct clusters very closely related to the solution presented in results of the workshop. Moreover, this solution has been exactly found in 40% of the execu- tions.

6.2. Experiments on real data

Experiments have been executed on real data provided by the Biological Institute of Lille (B.I.L.) for the study of diabetes. First, the dataset was composed of 543 pairs of individuals who have diabetes. Biologists examined 385 points of comparison and two environmental factors (age at onset, the age of the individual when diabetes was diag- nosed and BMI Body Mass Index, which is a measure of obesity).

Now, experiments are currently done on a more complete database composed of 1075 pairs of individuals, 3652 comparison points and two environmental factors. As data and results are con®dential, we are not allowed to detail all the biological results, but we present here some aspects.

Performance of the method

The proposed approach must be able to deal with a very large dataset so we ®rst evaluate its performance. Therefore, we made some tests to determine limits of our application.

The genetic algorithm is the longest phase of our approach, so we tested its ability to deal with large databases. Firstly, we tested it for different numbers of features. We obtained the following results: 1000 features, 60 min; 2000 features, 140 min; 3000 features, 207 min; 3654 features, 255 min.

So, the genetic algorithm grows linearly in time in function of the number of features.

Secondly, we tested it for different numbers of pairs of individuals. We obtained the following results: 200 pairs, 56 min; 400 pairs, 64 min; 600 pairs, 74 min; 800 pairs, 76 min.

Hence, these experiments show that this algorithm may

be applied to very large datasets (with a large number of features) and is able to give results in reasonable time.

Results

For the ®rst real dataset (453 pairs and 387 features) experiments have been achieved. We present here results for 10 executions of the genetic algorithm. During the 10 executions, a total of seven features corresponding to seven different locations of chromosomes (that we cannot reveal) (A, B, C, D, F, G, H) and an environmental factor (E1) have been selected in different associations. Table 4 indicates the number of times each association has been found.

This table shows that some associations are easy to ®nd (e.g. association of features D and G is found in 100% of the executions) and others are more dif®cult (association E1 1 H is found in 10% of the executions).

Then, in order to validate these results, and to be able to propose interesting and complete associations, we ran the k- means algorithm with the eight selected features.

As we do not know how many clusters (i.e. how many associations will be found), we ran k-means with several values of k. The best results were found with k 3 accord- ing to the scattering criteria. Table 5 reports the number of times each association was found with k 3 when the scat- tering criterion was good.

Now, experiments are done with the complete database, as the performance study showed that the method was able to deal with such a large database. Those results are con®- dential and we cannot give any information about them.

What we can say is that the results of two executions are similar, and it allows us to express assumptions that biolo- gists have now to con®rm.

The important aspect here is the ability of the method to

®nd several associations. The method is a tool that must help biologists to identify interesting areas of chromosomes that they will be able to study in detail.

7. Conclusions and discussion

This paper presents an optimisation method developed to deal with a data mining problem. The objective is to provide a tool to help biologists to ®nd interactions between genes involved in multifactorial diseases. However, these interac- tions are dif®cult to ®nd and statistical methods are mostly based on single-gene models. We model this problem as a particular feature selection problem, where we want to ®nd different combinations of genes. We solve this problem using a two phase method, where the ®rst phase selects the most relevant associations of features and the second phase validates the feature selection by grouping affected people according to the selected features, using the classical k-means algorithm. For the ®rst phase, we propose an adapted genetic algorithm. We have tested our approach on constructed data provided for GAW11 and compared our results with other works. Now, this method is used by

673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728

729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 Table 4

Associations discovered by the GA with the real data

Association B1C D1G A1F A1D E11H

Frequency (%) 50 100 50 40 10

Table 5

Clusters obtained and their occurrences with the real data

Cluster Occurrence (%) Cluster Occurrence (%)

(A F H) (A H)

(B C D) 75 (B C D) (%) 25

(E1G) (E1F G)

ARTICLE IN PRESS

(9)

UNCORRECTED PROOF

biologists of B.I.L. in order to exploit data they have collected to study diabetes and obesity. This approach gave promising results (it is able to identify interesting asso- ciations of genes and/or environmental factors) that biolo- gists have to con®rm with biological experiments. Complex diseases are a good challenge for geneticists. These diseases are characterised by etiologic heterogeneity and with data mining approaches, such as the approach proposed here, biologists are able to discover interactions between genes and between genes and environmental factors.

The perspectives for this project may be the following.

² Taking into account the family aspect in our data for the evaluation function.

² Working on different clustering methods (k-nearest neighbours, dedicated methods, etc.) because the k- means algorithm is not able to give good solutions for each run.

References

[1] J. Yang, V. Honoavar, Feature subset selection using a genetic algorithm, in: H. Liu, H. Motoda (Eds.), Feature Extraction, Construction and Selection: a Data Mining Perspective, Kluwer Academic, MA, 1998, pp. 117±136.

[2] M. Pei, E.D. Goodman, W.F. Punch, Feature extraction using genetic algorithms. Technical Report, Michigan State University, GARAGe, June 1997.

[3] P.M. Narendra, K. Fukunaga, A. branch, and bound algorithm for feature subset selection, IEEE Trans. Computer C-26 (9) (1977) 917±922.

[4] M. Pei, E.D. Goodman, W.F. Punch, Y. Ding, Genetic algorithms for classi®cation and feature extraction, in: Annual Meeting: Classi®cation Society of North America, June 1995.

[5] M. Pei, E.D. Goodman, W.F. Punch, Pattern discovery from data using genetic algorithm, in: Proceedings of the First Paci®c-Asia Conference on Knowledge Discovery and Data Mining, February 1997.

[6] C. Emmanouilidis, A. Hunter, J. MacIntyre, A multiobjective evolutionary setting for feature selection and a commonality-based crossover operator, in: Congress on Evolutionary Computing 2000, vol. 2, CEC, 2000, pp. 309±316.

[7] S.W. Mafhoud, Niching method for genetic algorithms, PhD thesis, University of Illinois, 1995.

[8] C. Bates Congdon, A comparison of genetic algorithm and other machine learning systems on a complex classi®cation task from common disease research, PhD thesis, University of Michigan, 1995.

[9] T. Mitchell, Machine Learning, McGraw-Hill, 1997.

[10] K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, San Diego, CA, 1990.

[11] R. Lefebvre, G. Venturi, Le Data Mining, Eyrolles Informatique, 1998.

[12] A. Hinneburg, D.A. Keim, Tutorial: clustering techniques for large datasets from the past to the future, in: PKDD 2000, September, 2000, p. 65, url: http://hawaii.informatik.uni-halle.de/hinnebur/ClusterTu- torial/

[13] N. MonmarcheÂ, M. Slimane, G. Venturini, Antclass: discovery of cluster in numeric data by a hybridization of an ant colony with the kmeans algorithm. Technical Report 213, Ecole d'IngeÂnieurs en Informatique pour l'Industrie (E3i), UniversiteÂ de Tours, January 1999.

785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840

841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896

A Data Mining Approach to Discover Genetic and Environmental Factors involved in Multifactoral Diseases

HAL Id: inria-00001181

https://hal.inria.fr/inria-00001181

Submitted on 30 Mar 2006

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

A Data Mining Approach to Discover Genetic and Environmental Factors involved in Multifactoral Diseases

Laetitia Jourdan, Clarisse Dhaenens, El-Ghazali Talbi, Sophie Gallina

To cite this version:

Laetitia Jourdan, Clarisse Dhaenens, El-Ghazali Talbi, Sophie Gallina. A Data Mining Approach to

Discover Genetic and Environmental Factors involved in Multifactoral Diseases. Knowledge-Based

Systems, Elsevier, 2002, 15 (4), pp.235–242. �10.1016/S0950-7051(01)00145-9�. �inria-00001181�

UNCORRECTED PROOF

A data mining approach to discover genetic and environmental factors involved in multifactorial diseases

L. Jourdan

*, C. Dhaenens

, E.-G. Talbi

, S. Gallina

2001 Elsevier Science Ltd All rights reserved.

1. Introduction

In order to elucidate the molecular bases of Type 2

®rst step should increase the power to detect regions containing genes with no individual major effect, but whose interaction leads to the disease. The biological analy- sis already performed consists of four steps.

1. Collect families with at least two or three affected members.

2. Extract DNA of parents and offspring from a blood sample.

3. Characterise each DNA sample at 400 loci spread over the 23 chromosomes.

4. Compute the genetic similarities for each pair of relatives at each locus.

Loci are polymorphous, so that parents may have different variants, called alleles. Each individual has two alleles for each locus, one inherited from his father and the other from

ARTICLE IN PRESS

UNCORRECTED PROOF

his mother. So we can calculate a genetic similarity for a pair of relatives (for example two brothers), based on the number of common alleles they have. For example, at a given locus, two brothers will have the probability: 25%

to share no alleles, 50% to share one allele, 25% to share two alleles.

The method must detect a combination of factors, for which a subset of lines share a common pattern of values.

However, the method should not converge to a unique solu- tion, but must produce several solutions in order to take into account the heterogeneity of the populations. Moreover, only very few factors (less than 5%) should be relevant.

This is an unsupervised clustering problem, where 3654 features

(3652 comparison points and two environmental factors) have to be considered, but where biologists have in mind that only very few features are relevant.

We would like here to point some speci®cations of the problem regarding usual data mining problems.

² A lot of features have to be considered (up to 3654).

² Very few signi®cant features (less than 5%).

² Only few data (number of pairs of individuals), compared to the large number of features, are available.

² We can only work with affected people because those who are not affected, when extracting their DNA, may become affected.

² The objective is to discover not only a single association, but several associations of genes and environmental factors, and to group affected people according to these associations.

² A feature selection phase using a genetic algorithm.

² A clustering phase.

For the ®rst phase, a feature selection, we developed a genetic algorithm to extract most in¯uential features and

2. Mathematical background

In this section, we give de®nitions of concepts and func- tions used in the rest of the paper.

Unsupervised classi®cation (or clustering or learning from observations). This is a data mining task in which the system has to classify a set of objects without any infor- mation on the characteristics of classes. It has to ®nd its own classes (clusters).

Feature (or attribute). It is a quantity describing an instance. Here, a feature may be an environmental factor or a genetic comparison point.

[ X=x

# t}u:

Scattering criterion. To measure the quality of clustering we use the classical scattering criterion based on statistical concepts. It is de®ned as follows:

Given c clusters, n features,

Let x

be the j-th cluster with j  1, ¼, c,

Let m be the mean vector (m  (m

m

´´´ m

) and m

the vector of means for the j-th cluster m

 ÿ

m

m

¼ m

.

X

is an element of the cluster x

and ÿ

X

2 m

is the transposition of the vector ÿ

X

2 m

The scatter matrix for the j-th cluster is then: . P

 P

ÿ

X

2 m

ÿ

X

2 m

# t}u:

be the j-th cluster with j 1, ¼, c,

Let m be the mean vector (m (m

ÿ

P

X

X

F 1 2 S £ T