• Aucun résultat trouvé

Protein (+) sense RNA (-) sense RNA

Algorithm 6 Motifs-tree growth

Require: a set of exampleI={(si,cj)}wheresiis a sequence andcjis one of the two class label. Thestopfunction is an implementation of a stop criterion depending on the instances set and the depth of the tree.

Ensure: The root of a decision tree.

functionpartition(I,m,τ) L,R←

for(x,y)∈ Ido

s←the alignment score betweenmandx(equation93) ifs≤τthen

L←L∪(x,y) else

R←R∪(x,y) end if

end for return(L,R) end function functionleaf(I)

K←the most frequent class inI returna leaf predictingK end function

functiongrow(I,d) ifstop(I,d)then

returnleaf(I) else

(m,τ)←a motif searched by GA onI (L,R)←partition(I,m,τ)

l←grow(L,d+1), the child when a sequence matchesm r←grow(R,d+1), the child when a sequence does not matchm returna node using the motif(m,τ)withlandras children end if

end function root←grow(I,0)

8

M O T I F S - T R E E S P E R F O R M A N C E S A N D P R O T E O M I C A N A LY S I S F O R H. SAPIENS

In this chapter the performances of the initiator methionine cleavage and Nα-terminal acetylation are exposed and discussed along with an an analysis of the two studied post-translational modifications on H. sapienswith the help of a generated motifs-trees. The performance and analysis are based on the2012dataset (see chapter6).

8.1 i n i t i at o r m e t h i o n i n e c l e ava g e

We start first by showing the classification score obtained for the initiator methionine cleavage. As stated in section 1.4the Nα-terminal acetylation may occurs either on the initiator methionine or on the first exposed amino acid when this methionine is cleaved. Therefore there is a need, for a given protein, to either have the cleavage information on its initiator methionine or use a tool to predict it. We decided to use the motifs-trees to predict initiator methionine cleavage. However this problem has already been tackled and the prediction is yet quite successful [Frottin et al.,2006, Martinez et al., 2008]. We also use this post-translational modification as a validation as its human proteomeis well characterized. Hence this allows us to assess if the motifs-trees can be stated aswhite-boxmodels.

8.1.1 Parameters selection and classification performance

The setup of the algorithm is detailed in table31. Was have tested a range of parameters by scanning. We obtained similar classification performances and there was no interest in tuning them. The search of a good discriminant motif does not seem to be sensitive to these parameters and this is probably because we are not using a sole motif but we are combining them in a tree. Indeed, if a path in the tree does not provide enough performance, another motif is added. Hence the new motif compensates the errors of the previous motif. However there is a parameter that impacts the quality of the prediction: the number of considered N-terminal residues (thenumber of amino acidsparameter). The original choice of this parameter is six and it is the size that disambiguates the2012Nα-terminal acetylationEukaryota dataset1. Indeed even if we have2558different proteins in theEukaryota2012 dataset, considering only the firstnresidues reduce the number of possible polypeptides. For example with only the first two residues we theoretically have400(202) possible peptides. This will produce a dataset with a lot of repetitions and ambiguities. Indeed with so few possible peptides, some combination will certainly appear in both the Nα-acetylated and non-Nα -acetylated examples in the training set. In this condition it is very unlikely that the algorithm is able to find a correct relationship between the sequences and the acetylation. Hence such ambiguities are removed from the dataset.

So the first justification of our choice to select the number of residue is that it makes all the Nα-acetylated sequences different from the non-Nα-acetylated

1. At the origin the goal was to predict Nα-terminal acetylation and this is why the parameters selection is mainly based on the Nα-terminal acetylation datasets.

sequences. With six residues there are enough possible peptides (between2 205and206). This lower bound for the number of residues is a legitimate choice in order to have a clean dataset. However we can question the upper bound because the amino acids that are further in the sequence may influence the acetylation of the protein by NATs. To do a parameter selection we apply a nested cross-validation on the2012Nα-terminal acetylation datasets with the following values: 6, 7, 8, 9, 10, 15, 20 and 40 residues. The nested validation is composed of an outer validation and an inner cross-validation. The outer cross-validation is composed ofk-folds: Fi ={Li,Ti}, withi=1, . . . ,k, build on the available dataset. Then:

1. for a givenFi, the training or learning setLiis used to buildk0-folds:

Li =Fj0 ={L0j,Tj0}, withj=1, . . . ,k0.

2. Each parameter values (or combination if there are more than one parameter) is trained and tested on theFj0(the inner step).

3. The parameter that minimize the error is selected to be train onLi and tested onTi(the outer step).

4. These steps are repeated for allFi.

5. When allFihave been processed, the score on theTi are aggregated.

It should be noted that the best selected value is not necessarily the same during the inner cross-validations. Thus theFi are not evaluated with the same N-terminal number of residues. We applied the outer cross-validation withk=10 and the inner cross-validation withk0 =10. The results of this procedure provide a good estimation of the error of the algorithm. The results are included in table32and are above the baseline for each taxon.

The selected lengths after the inner cross-validation phase are:

— forH. sapiens6(six times),7(three times) and8(one time).

— For Metazoa6(four times),7(two times),8(one time),9(two times) and10(one time).

— ForEukaryota6(six times),7(one time),8(one time) and9(two times).

Most of the time a residue of length6offers the best performances. However to select the best parameter to build a model, these parameters are evalu-ated with another10-fold cross-validation. This allow to determine which parameter is the best. The length minimizing the error is the selected value for the parameter. However from these results, we already see that there is no need to use a sequence length that is composed of more than10residues.

The following residues add noise to the data and the algorithm looses its capacity to generalize. In the case of the Nα-terminal acetylation the selected value is6and is confirmed by cross-validation (see dedicated section8.2.1 and table34). The same parameter is used for initiator methionine cleavage.

Two potential problems arise from the algorithm we used to build our classifier. The first is common to all machine learning algorithms and is the lack of generalization (see chapter5orHastie et al.[2001]). The second problem is the stability of our model, that is to say the consistency of the results despite the stochastic nature the genetic algorithm. Indeed, we have no guarantee that every evolution will converge to a good solution. To evaluate the stability we simply applied10independent stratified10-folds cross-validations on our datasets, combining the cross-validations results to obtain the average and the standard deviation. So if the cross-validations results have a high average classification score with low standard deviations,

2. The bounds are because the methionine and alanine are frequently found as the first residue.

8.1 Initiator methionine cleavage

Table31: Parameters use to build the motifs-trees. It regroups parameters of the genetic algorithm, plague, individual, alignment and decision tree. Those parameters are use for motifs-trees predicting initia-tor methionine cleavage and Nα-terminal acetylation for the three considered taxa.

Parameter Value Parameter Value

Tournament size 5 Population size 250

Max generations 150 Mutation probability 0.75 Number of plagueremove 20 Number of plagueclean 100 Number of amino acids 6 Gap penalty -0.0625

Pruning factor (α) 0.5 Bucket size 6

Table32: Results obtained by outer and inner cross-validation (k=10,k0=10) assessing the quality of the algorithm in predicting Nα-terminal acetylation on the2012datasets. BL stands for baseline and MCC for Matthews correlation coefficient. For a list of selected N-terminus length, see text.

Taxon BL Accuracy Sensitivity Specificity MCC Eukaryota 0.63 0.84 0.87 0.79 0.66 Metazoa 0.71 0.86 0.90 0.77 0.67 H. sapiens 0.86 0.88 0.93 0.60 0.54

the method is stable and produces classifiers with a good generalization capability. As we are only taking several proteins can be represented by the same six amino acids. In order to avoid that instances in the training (or learning) set appear also in the test set, we have removed those duplicates from the training set of each fold. This ensures that the test set contains only sequences never seen during the learning phase. The prediction results are detailed in table33 and the results are very good and we stress that the standard deviation of the score between the 10 cross-validation were

≈10−4. However, we must pay attention to the accuracy values. Since our training set class are imbalanced, a trivial classifier could easily reach a high accuracy (table27). For instance72% of our eukaryotic proteins undergo initiator methionine cleavage in our dataset and a trivial classifier based on the majority that predicts all proteins as cleaved will obtain an accuracy score of0.72. Nevertheless the motifs-trees greatly improve the accuracy over the baseline. ForH. sapiensthe baseline is0.69and the accuracy improvement is0.26, forMetazoa it is0.72and0.22, and forEukaryota it is0.72and0.23. These performances encourage us to produce a model for human initiator methionine cleavage prediction based on the complete human dataset. The purpose is to produce a MetAP specificity analysis based on the motifs-tree.

8.1.2 Human MetAPs specificity analysis

The human dataset is the only dataset we have dedicated to one organism.

So to avoid problem of homologous in other organism we focus onH. sapiens.

The analyzed trees is the product of a learning on the full datasets. We point out that the motifs found during different runs of training are close and combined in similar trees. The tree is illustrated in figure70.

R

Figure70: The motifs-tree predicting the initiator methionine cleavage for from theH. sapiens2012dataset. This model has been built on the fullH. sapiensdataset. Each node of the tree is represented by the motif used for its test. Leaves are represented using a sequence logo made with all the sequences ending in that leaf. The label under a leaf specifies the class corresponding to the prediction made at the leaf and its accuracy on the human dataset (this cannot be used to infer its error on unseen proteins, see table33).

The initiator methionine is always present in all sequences, but is not displayed in the logo as it does not provide any information.

Moreover each sequence logo is rescaled according to its highest value (the maximum being 4.32 bits). The branches are labeled with the alignment score condition required on the test to follow the path indicated by the branch. The sequence logo on top illustrates the composition of the H. sapiens proteins extracted from UniProtKB (table27).

8.1 Initiator methionine cleavage

Table33: Results assessing the quality of the initiator methionine cleavage prediction on the2012 dataset. Score values are the mean on10 independent stratified cross-validations, each made with10folds.

MCC means Matthews correlation coefficient. The score standard deviation between the10cross-validations is≈10−4for each taxon.

Taxon Baseline Accuracy Sensitivity Specificity MCC

Eukaryota 0.72 0.93 0.95 0.89 0.83

Metazoa 0.72 0.94 0.96 0.91 0.86

H. sapiens 0.69 0.95 0.96 0.93 0.89

We analyze the discovered motifs to propose assumptions about the sub-strates of the enzymes catalyzing the post-translational modification. We study how sequences are split at each node and we try to extract the features that separate the two set of sequences induced by the split. Let’s note that there are two genes encoding for MetAPs in human, MetAP1and MetAP2 [Bradshaw et al.,1998], but the information about which enzyme catalyzes the cleavage is not known and is not taken into consideration in the model.

First of all, we see that the motifs-tree is composed of three motifs and all of them leading to at least one leaf (i.e. a predicted class):

— the sequences that do not contain the signal described by the first motif are classified as not undergoing the initiator methionine cleavage;

— the sequences containing both the first two motif signals are classified as undergoing the initiator methionine cleavage;

— the sequences reaching the last node are classified as not undergoing the initiator methionine cleavage if the signal of the third motif is detected in the sequence.

Hence a match on the first two motifs induces cleavage of the initiator methionine. A match on the first two motifs and a match on the last motif do not induce cleavage. To understand what features are exploited by the motif tree to discriminate the sequences we will focus on the first node. The first motif is described by the following tokens sequence:

•{S,CHAM830104}{S,CHAM830104}•{F,GARJ730101}•• (98) and the analysis is split into two steps:

1. scores analysis and identification of discriminant tokens in the motif, 2. identification of position of interest in the amino acids sequences.

We begin by identifying the discriminant tokens (step1). To do so, we compute the averagemotif score profile. The profile is computed for a set of sequences aligned on a motif. For a given alignment, each token contributes to the alignment score either by its similarity with the aligned amino acid, either by being gapped. If all contributions of each token on each sequence are summed and normalized, we obtain an average motif score profile.

Formally, letm= (t1,t2, . . . ,tk), aktoken motif, andS={sj}a set of amino acids sequences,sj = (aj1,aj2, . . . ,ajn). The profile ofmon all sequences in S, is a vectorc= (c1,c2, . . . ,ck), whoseci are given by :

ci= 1

|S|

|S|

j=1

σ(ti,xji) (99)

where xj is the aligned sequence, i.e. sj with the alignment gaps. So xji

is thei-th symbols in the sequence j which is aligned with m. It can be either an amino acid or a gap (σwith a gap always equals the gap penalty γ). So to identify discriminant tokens in the motif, we compute the profiles for the sequences following the left (cl) and the right (cr) branch and plot the following differencecr−cl. A positive difference points to a token that increases in average the score of the sequences following the right branch, a negative difference points to a token that increases in average the score of the sequences following the left branch. So, as we want to identify the features contributing to the signal strength, we are interested in the positive differences. In the case of the first motif, the profiles difference emphasizes the discriminant power of the tokens at position 2 and 3in the motif (figure71 (a)). The two tokens are the same, namely the token {S,CHAM830104}.

This property is interesting because it gives the maximum similarity (i.e.

1.0) with the Ser and the following amino acids: Ala (A), Cys (C), Gly (G), Pro (P), Thr (T) and obviously Ser (S). It gives a similarity of0.5with the Ser for the other amino acids, except Leu (L) with which it has no similarity.

So it clearly promotes the presence of A, C, T G, P, S and T. Regarding the amino acids producing a similarity of0.5, it is interesting to note that the threshold is 5.4375, which is the maximum alignment score possible with the motif minus 0.5. So the use of this property in the first motif acts as a selector for the amino acids having a similarity score of1.

Now that we have identified two tokens having an impact on the align-ment score, we must identify where, in the protein sequence, the specificity induced by the token is discriminant (step2). To do so, we rely on a plot showing how many time a tokeniis aligned with the residue at positionj of the sequences following the right branch. This histogram shows that the two tokens of interest are mainly aligned on the second amino acid (the one immediately after the initiator methionine), and, in less extent, on the third amino acid (figure71(b)).

This rough analysis allow to conclude that this node splits the protein set based on the presence of an alanine, cysteine, glycine, proline and serine immediately after the initiator methionine. Moreover, as this node lead to a leaf for the sequences in which the signal is not detected, we can observe that the proteins not having those amino acids at the second position do not undergo the initiator methionine cleavage. Therefore the following rule can be proposed: if a sequence start withM¬[ACGPSTV], the methionine is not cleaved. This has been experimentally observed [Burstein and Schechter, 1978, Meinnel et al., 2005] and is corroborated by our model. Moreover this rule is compatible with the pattern found byMartinez et al.(Martinez et al.[2008], table1). If we take into account only the information regarding the initiator methionine cleavage in the cited publication, we can build the following rule: a match withM¬[ACGPST]for the first two amino acids imply no initiator methionine cleavage.

The same approach can be used to extract information from the other motifs. We will summarize the main lines here. First, it is important to remember that we are going through a decision tree, and the alignments are applied on sequences that have beenselectedby the preceding motifs. The profiles difference of the second motif indicates that the token at position10 has a major contribution in producing discriminant alignment score between proteins (figure72). The token is[AFIKNQSW]and the histogram of aligned positions shows that it is almost always aligned with the second amino acid

8.1 Initiator methionine cleavage

Token (position in the motif)

Average alignment score

Token (position in the motif)

Alignment scores

Left positions histogram

residue 1 residue 2 residue 3 residue 4 residue 5 residue 6

1 2 3 4 5 6 7

Token (position in the motif) Average profiles score difference

−0.05

Alignment scores

Figure71: (a) The motif score profile (equation99) difference between the sequences achieving an alignment score less than or equal to the threshold and greater than the threshold. On this plot we can see that the tokens at position2and3in the motif have an important contribution in the alignment scores of sequences achieving a score higher than the threshold. (b) The normalized histogram of aligned position illustrates on which positions in the amino acids sequence a token is aligned. The sequences considered to build this histogram are the one following the right branch after the first motif. The colors of the stack indicate the position in the amino acids sequences. A stack lower that1.0reflect that the token is aligned with sequence gaps. E.g., a stack with a height of 0.4means the token is aligned with an amino acid for40% of the alignments, and is gapped for the remaining60%.

in the sequence. But we already know that the sequences reaching this node should be carry[ACGPSTV]as the second residue. So we candenoise this token by only considering the intersection between[AFIKNQSW] and [ACGPST], leading to a simplified form of the token: [AS]. So the motif seems to detect the presence of an alanine and a serine at the second position.

Another token contributes well to the profiles difference, the token13, which is a fixed amino acid token for the Ser. This token is mainly aligned on the second and third amino acid in the sequences. As a relevant match implies that the sequence undergoes the initiator methionine cleavage, this lead us to propose that sequences starting withM[AS]are cleaved. But theMA sequences are highly represented in the set of sequences having a relevant match with the motif (68% of the set), and may hide the contribution of other tokens. So we removed those sequences from the proteins set and produced a new profiles difference. These new profiles emphasize the contribution of the second token in the motif, which is a fixed amino acid token for the Pro, and is always aligned on the second amino acid in the sequence. So, considering the preceding motif and the information provided by the tokens

Another token contributes well to the profiles difference, the token13, which is a fixed amino acid token for the Ser. This token is mainly aligned on the second and third amino acid in the sequences. As a relevant match implies that the sequence undergoes the initiator methionine cleavage, this lead us to propose that sequences starting withM[AS]are cleaved. But theMA sequences are highly represented in the set of sequences having a relevant match with the motif (68% of the set), and may hide the contribution of other tokens. So we removed those sequences from the proteins set and produced a new profiles difference. These new profiles emphasize the contribution of the second token in the motif, which is a fixed amino acid token for the Pro, and is always aligned on the second amino acid in the sequence. So, considering the preceding motif and the information provided by the tokens

Documents relatifs