The other difficulty lied in the imprecision of the questions asked

(1)

Conclusion

The analysis of microarray data proved to be a very challenging task, in part because of the sheer size of the data. This made such simple things as the determination of p-values and clustering complicated enterprises. The other difficulty lied in the imprecision of the questions asked. The biologists often have qualitative questions, which are not always easy to formulate mathematically. Also, they often do not correctly assess what is possible and what is not. For these reasons, it has often been necessary to determine what a biologist might be interested in and could be obtained from the data. This can be the hardest part of a work.

The works presented in this thesis span a large spectrum. The first half was mainly concentrated on technical issues: how to estimate significance levels, how to improve the data quality and how to store the data. For these parts, the questions where quite clear, and most of the difficulties came from the size of the data and its particularities. In this first half, the techniques presented are mostly improvements of existing techniques. The main original points were:

1. In the statistical chapter, the comparison of different scoring functions, the merging of false discovery rates determined on different intensity windows and the determination of a local false discovery rate.

2. In the data improvement chapter, the creation of two efficient data quality criteria.

The merging of different scans to avoid saturation was original at the time it was created, although it has been published by others since. The assessment of the different modifications in a thorough and consistent manner is obvious but surprisingly unusual.

3. The data storing chapter present a model of database which is essentially the development of simple design ideas.

The second half was concerned with questions which the biologists did not ask directly, but could be inferred from their frustrations. Since the problems those chapters try to solve are original, the techniques used tend to be original also. The three original points raised in those chapters were:

1. The discovery of the composition of complex samples. The expression profile of complex samples can depend more on their composition than on their pathological status. A mean to mathematically dissect those samples was proposed. This work is completely original.

2. The discovery of different clustering. The samples in an experiment can often be organized in more than one way, depending on the criterion chosen. A mean to cluster the gene in function of the clustering they give on the samples is given. The idea of this work stems from a paper showing the existence of different superimposed clusterings. The re-definition of this problem as a clustering problem on the genes is original, as is the algorithmic means to perform said clustering.

3. The last chapter is concerned about the link between genetic network and clustering. It is shown that to suppose the existence of a form of genetic network implies the existence of a clustering, whose form is determined by the form of the genetic network. This is illustrated with a Boolean network, showing that its identification can be separated in two tasks, a clustering and an identification of the clustering with the network. This way of linking clustering and genetic network is original, as are the algorithms proposed.