Motifs forest algorithm Require: A learning set of instances L

Protein (+) sense RNA (-) sense RNA

Algorithm 7 Motifs forest algorithm Require: A learning set of instances L

Ensure: A motif forestM?

1: M?←_∅

2: fori=1←kdo

3: L_i←bootstrap a sample fromL

4: M_i ←build a motifs-tree fromL_i

5: M? ←M?∪M_i

6: end for

subset to build the test node. For each test node construction, a new subset of feature is randomly selected. Otherwise the growth of the tree is similar to theC4.5algorithm. In the motifs-trees forest, only bagging is applied to the motifs-tree but this ensemble approach with motifs-trees make it close to random forest. Indeed, for a motifs-tree, each split during the learning phase can also be seen as search for the best feature in a random subset of features.

Considering that the feature space is the set of all possible motifs, the feature selection method ofC4.5was not applied due to the size of the space. The genetic algorithm setup used in motifs-tree learning is somewhere a search in a random subset of the possible motifs because:

— the initial population on which all genetic operators will be applied is randomly drawn and is of finite size;

— the number of generation is bounded by a parameter and the value used is much lower than the number of applications of genetic operator to generate all the motifs from the initial population (i.e.not an exhaustive search).

So, the search is bound to a subset of motifs defined by the initial population, the number of generation and the genetic operators used⁶. In other words, using bagging with motifs-trees is similar to a random forest with trees built on features included on the motifs space.

8.4.2 Classification performances of the motifs forest

As often observed by a bagging or random forest approach, the classifier has better performance for its prediction. However, the obtained improve-ments when evaluated with the2012dataset are not impressive. The complete cross-validated performances are provided in table49. The parameters used to learn the forest are:

— initial instances set (the one used with bagging): this is the same set as the one used to train a single motifs-tree;

— number of motifs-trees: 99;

— bagging size:0.6;

— the genetic algorithm parameters are the same as the ones used for a single motifs-tree.

Moreover, the drawback of this is obviously that the white box property of the motifs-tree vanished. Analysis of a motifs-tree is not always obvious and combined in an ensemble makes very difficult to extract which features in an instance allow prediction to occur. Therefore the use of ensemble learning as described here is not the best way to improve performances.

6. However, all the possible motifs can be generated if the evolution runs for a sufficiently long amount of time.

8.5 Conclusion and perspective

Table49: Cross-validated performances of the ensemble learning method on the2012 datasets. MCC means Matthews correlation coefficient.

The provided gain is the increase in MCC versus the cross-validated performances of a single motifs-tree.

Dataset Accuracy Sensitivity Specificity MCC MCC gain Eukaryota 0.90 0.95 0.81 0.78 +0.01 Metazoa 0.91 0.97 0.76 0.76 +0.04 H. sapiens 0.94 0.99 0.59 0.69 +0.06

8.5 c o n c l u s i o n a n d p e r s p e c t i v e

In this chapter we demonstrate that the motifs-trees can be used to build a good predictor for N^α-terminal acetylation. Moreover in comparison to the tools currently published and available; we propose a new state of the art approach to predict N^α-terminal acetylation. There are mainly two available predictors NetAcet and the previous state of the art,TermiNator3. Regarding NetAcet, their limitations make the comparison useless as they are working only on NatA potential subtrates and only for S. cerevisiae (section 5.2).

When compared with the state of the art, namelyTermiNator3, we obtain comparable performances regarding the generalization. However the motifs used inTermiNator3seem to need an update. Indeed when tested with the 2015 datasets, the performances of the predictor drop. But finding new motifs with no machine learning approach in a set composed of more than 6 000proteins is a hard task. With the2015Eukaryotadataset, the motifs-trees still produce good results with the same generalization performances as with the 2012dataset. The comparison of the N-terminal methionine cleavage prediction is not discussed because the motifs-trees andTermiNator3produce excellent results.

Another important contribution is the proposed criteria to build a clean set of N^α-terminal acetylation proteins and especially non-acetylated proteins. It is crucial for supervised learning to have a clean dataset. That is to say having protein correctly labeled as N^α-acetylated, which is an easy information to extract from UniProtKB, or having proteins that do not have the annotation of being acetylated that arereallynot acetylated. To obtain those non-acetylated proteins is the hard part in building a dataset. We hope that the criteria we have proposed will be used by other groups that tackle the problem of N^α-terminal acetylation prediction and even refined if needed.

We have also tried to produce awhite boxclassifier by using motifs. To describe a sequence a regular expression like descriptor seems to be the easiest model to use. In the case of the human initiator methionine cleavage we succeed in extracting features from the motifs in the tree. Regarding the N^α-terminal acetylation it was more difficult, even for the shortH. sapiens tree. We realize that aligned motifs are able to detect very subtle features in the sequence but this can make them difficult to read or interpret. We think that the there are several potential improvements to do to. First the motifs are sometimes too complicated and this may be improved by allowing another plague operator (see chapter7.4) that can slightly reduce the fitness of the individual. It can be done by accepting the modification of the operator on the motif if the decrease in fitness stays under a given percentage of the fitness of the unmodified individual. Another variant can be applied in a tree pruning manner. Usually a tree is pruned in a bottom-up fashion and for each node we estimate the error made when the node is converted into

a leaf against the error when the subtree in conserved. This principle and estimation can be applied to a plague operator. At the end of the growth phase, we traverse the tree in a bottom-up manner and at each node we apply a plague operator that decreases the fitness of the individual. Thus we have two subtrees: one with the unaltered motif as the root node and one with the altered motif. Then the error estimation is applied to both subtrees and the keep the subtree. Then the selected subtrees is selected in the same manner the decision to replace a subtree by a leaf is made in the pessimistic pruning [Quinlan,1987]. Such step may have the advantage of producing trees that generalize more and that are more readable by a human.

Part IV

C O N C L U S I O N

Dans le document Analysis of large biological data: metabolic network modularization and prediction of N-terminal acetylation (Page 197-200)