Data representation - Data mining techniques .1 A brief overview.1 A brief overview

1.2 Data mining techniques .1 A brief overview.1 A brief overview

1.2.2 Data representation

The representation of the data plays an important role in selecting the appropriate data mining technique to use. In the example of the blood analysis, the data can be

represented as real numbers. Usually one variable does not suffice for representing a sample, and hence vectors or matrices of variables need to be used. For instance, an apple can be represented by a digital image portraying the fruit. A digital image is a matrix of pixels with a certain color. In this case, the image of the apple is represented as a matrix of real numbers. A sound can instead be represented as a set of consecutive audio signals. In this case the data are represented as vectors of real numbers. The length of the representing vector is important as longer vectors represent the sound more accurately. Other representations can make use of graphs or networks, as is the case of the financial application discussed in Section 1.3.3.

Some of the data mining techniques use distances between samples for partition-ing or classifypartition-ing data. Computpartition-ing the distance between two samples means com-puting the distance between two vectors or two matrices of variables representing the samples. An efficient representation of the data impacts the definition of a good distance function. Even in the cases where data mining techniques do not use the distance function (such is the case of artificial neural networks), data representation is important as it helps the technique to better perform the task.

In order to understand the importance of data representation, let us consider as an example the different ways a DNA (deoxyribonucleic acid) sequence can be represented. The DNA contains the genetic instructions used in the development and the functioning of all living organisms. It consists of two strands that wrap around each other. Chemical bonds hold together the two strands. Each strand contains a sequence of 4 bases and each base has a complementary base. This means that one strand can be rebuilt by using the information located on the other one. Only one sequence of bases is therefore sufficient for representing a DNA molecule. One of the possible representations can be the sequence of initials of the name of the bases:

A for adenine, C for cytosine, G for guanine and T for thymine. On a computer, a character is represented using the ASCII code, an 8-bit code. However, as pointed out in [49], there are more efficient representations. Four names or initials can be coded by 4 integer numbers, for instance 0 for adenine, 1 for cytosine, 2 for guanine and 3 for thymine. These numbers can be represented on computers using a 2-bit code: 00, 01, 10, 11. This code is certainly more efficient than the ASCII code, since it needs one fourth of the variables for representing the same data. Figure 1.2 gives a schematic comparison of the possible representations for the DNA molecules.

In living organisms a DNA molecule can be divided into genes. Genes contain the information for codingproteins. Proteins have been studied for many years because of their high importance in biology, and finding out the secrets they still hide is one of the major challenges in modern biology. Because of its relevance, this topic is largely treated in the specialized literature. There is a considerable amount of work dedicated to the protein representation and its conformations. In January 2009, Google Scholar provided more than 6000 papers on “protein folding’’ published during 2008, and already about 300 papers published in 2009. Just to quote one of them, the work in [115] presents the recent progress for uncovering the secrets of protein folding.

Even though protein molecules are not specifically studied in agricultural-related fields, we decided to discuss here the different ways a protein conformation can be modeled. This is a very interesting example, because it shows how a single object,

Fig. 1.2 The codes that can be used for representing a DNA sequence.

the protein, can be modeled in different ways. The model to be used can then be chosen on the basis of the experiments to be performed. In the following, only the spatial conformations that proteins can assume are taken into consideration, leaving out protein chemical and physical features.

Proteins are formed by other smaller molecules calledamino acids. There are only 20 different amino acids that are involved in the protein synthesis, and therefore proteins can be built by using 20 different molecular bricks only. Each amino acid has a common part and a part that characterizes each of them, which is calledside chain.

The amino acids forming a protein are bonded chemically to each other through the atoms of their common parts. Therefore, a protein can be seen as a chain of amino acids: the sequence of atoms contained in the common parts form the so-called backboneof the protein, where the side chains of all the amino acids are attached.

Among the atoms contained in the common part of each amino acid, more im-portance is given to the carbon atom usually labeled with the symbolCα. In some models presented in the literature [38, 172, 175], this atom has been used alone for representing an entire amino acid in a protein. Then, in this case, protein confor-mations are represented through the spatial coordinates ofnatoms, each of them representing an amino acid. It is clear that these models give a very simplified rep-resentation of a protein conformation. In fact, information about the side chains are not included at all, and therefore the model cannot discriminate among the 20 amino acids. However, this representation is able totracethe protein backbone.

More accurate representations of the protein backbones can be obtained if more atoms are considered. If three particular atoms from the common part of each amino acid are considered (two carbon atomsCαandCand a nitrogenN), then this infor-mation is sufficient for rebuilding the whole backbone of the protein. Therefore, a protein backbone can be represented precisely by a sequence of 3natomic coordi-nates, wherenis the number of amino acids.

This representation is however not much used, because there is another repre-sentation of the protein backbones which is much more efficient. A torsion angle can be computed among four consecutive atoms of the sequence of atomsN,Cα

andCrepresenting a protein backbone. Then, a corresponding sequence of 3n−3 torsion angles can be computed. This other sequence can be used for representing the protein backbone as well, because the torsion angles can be computed from the atomic coordinates, and vice versa. The representation which is based on the torsion angles is more efficient, because the protein backbone is represented by using less information. Indeed, a sequence of 3natoms is a sequence of 9ncoordinates, whereas a sequence of 3n−3 angles is just a sequence of 3n−3 real numbers.

In the applications, the representation based on the sequence of torsion angles is further simplified. The sequence of atoms on the backbone is a continuous repetition of the atomsN,Cα andC. Each quadruplet defining a torsion angle contains two atoms of the same kind that belong to two bonded amino acids. Then, the torsion angles can be divided in 3 groups, depending on the kind of atom that appears twice.

Torsion angles of the same group are usually denoted by the same symbol: the most used symbols are,andω. Statistical analysis on the torsion angleωproved that its value is rather constant. For this reason, often all the torsion anglesωare not considered as variables, so that only 2n−2 real numbers are needed for representing a protein backbone by the sequence of torsion anglesand. One of the most successful methods for the prediction of protein conformations, ASTROFOLD, uses this efficient representation [130, 131].

Depending on the problem that is under study, different representations of the protein backbones can be convenient. In the problem studied in [138, 139, 140, 141, 152, 153], for instance, the distances between the atoms of each quadruplet that can be defined on the protein backbone are known. This information is used for computing the cosine of the torsion angle among the atoms of each of such quadruplets. Thus, if the cosine of a torsion angle is known, the torsion angle can have only two possible values. If all these values are preliminarily computed, then the sequence of torsion anglesand can be substituted by a sequence of binary variables that can have two possible values only, 0 and 1. In this representation, 2n−2 variables are still needed for representing the protein backbone, but the variables are not real numbers anymore, but rather binary variables.

The representation of entire protein conformations is more complex. The full-atom representation consists in the spatial coordinates of all the full-atoms of the protein.

Even though some of the atoms can be omitted because their coordinates can be computed from others, the full-atom representation still remains too much complex, especially for large proteins. Another possibility is to represent the protein backbone with theandtorsion angles, and to represent each side chain through suitable torsion anglesχ that can be defined on each side chain. A protein molecule can contain 20 different amino acids, and therefore 20 different sets of torsion anglesχ need to be defined, each of them tailored to the different shape of each side chain.

Figure 1.3 shows three possible representations of myoglobin, a very important protein. On the left, the full-atom representation of the protein is shown: atoms having a different color or gray scale refer to different kinds of atoms. In the middle,

Fig. 1.3 Three representations for protein molecules. From left to right: the full-atom representation of the whole protein, the representation of the atoms of the backbone only, and the representation through the torsion anglesand.

the same representation is presented, where all the atoms related to the side chains are omitted. The figure gives an idea on how many atoms more are needed to be considered when the information about the side chains is also included. Finally, on the right, the path followed by the protein backbone is shown, which can be identified through the sequence of torsion anglesand. Note that we did not include the representation of the protein backbone as a sequence of binary variables, because it would just be a sequence of numbers 0 and 1. The conformation of the protein in Figure 1.3 has been downloaded from the Protein Data Bank (PDB) [18, 186], a public Web database of protein conformations.

Depending on the problem to be solved, a representation can be more convenient than others. For instance, in [175], the protein backbones are represented by the trace of theCαcarbon atoms, because the considered model is based on the relative distances between suchCα atoms. The model is used for simulating protein confor-mations. In [131], the sequence of torsion angles is instead used, because the aim is to predict the conformation of proteins starting from their chemical composition.

The complexity of the problem needs a representation where the maximum amount of information is stored by using the minimum number of variables. Finally, in [139], the molecular distance geometry problem is to be solved. In this case, some of the distances between the atoms of the protein backbone are known, and the coordinates of such atoms must be computed. By using the information on the distances, the representation can be simplified to a sequence of binary variables. In this way, the complexity of the problem decreases, and it can then be solved efficiently. Protein molecules have been studied also by using data mining techniques. Recent papers on this topic are, for instance, [47, 107, 242].

Dans le document DATA MINING IN AGRICULTURE (Page 25-29)