• Aucun résultat trouvé

Supervised learning to tune simulated annealing for in silico protein structure prediction

N/A
N/A
Protected

Academic year: 2021

Partager "Supervised learning to tune simulated annealing for in silico protein structure prediction"

Copied!
1
0
0

Texte intégral

(1)

Supervised learning to tune simulated annealing

for in silico protein structure prediction

Alejandro Marcos Alvarez, Francis Maes and Louis Wehenkel

Department of Electrical Engineering and Computer Science - University of Li`ege, Belgium

Contact information: amarcos@ulg.ac.be, http://www.montefiore.ulg.ac.be/~ama/

Simulated annealing is a widely used stochastic optimization algorithm whose efficiency essentially depends on the proposal distribution used to generate the next search state at each step. We propose to adapt this distribution to a family of parametric optimization problems by using supervised machine learning on a sample of search states derived from a set of typical runs of the algorithm over this family. We apply this idea in the context of in silico protein structure prediction.

Motivation

Protein structure prediction is a topical and challenging open problem in bioinformatics. The significance of this problem is due to the importance of studying protein structures in biomedical research in order to improve our understanding of the human physiology and to accelerate drug design processes.

The most reliable way to determine protein structures is to use ex-perimental methods such as X-ray crystallography or NMR spec-troscopy, which are however expensive and time consuming, and hence the design of in silico protein structure prediction methods has become a very active research field.

General problem statement

The problem we are considering is in silico protein structure pre-diction, which amounts to predicting the 3D coordinates of each atom in the protein given its amino acid sequence.

Characteristics

• modeled as a parametric optimization problem parameterized by λ – high-dimensional for usefully sized proteins;

– λ ≡ the amino acid sequence of the protein; – sλ ≡ current state (structure) of the protein.

• cost function ≡ the energy function E of the protein – large number of local minima;

– global minimum of E corresponds to the sought structure; – E includes all constraints;

– evaluating E can be long.

• optimization algorithm ≡ simulated annealing (SA) [4]

– proteins-specific operators o ∈ O used to modify the structure.

Optimization algorithm

Algorithm 1: Simulated annealing

Let B be a budget of iterations, E(·) the oracle evaluating the energy and T (i) a non increasing cooling schedule defined over {1, . . . , B}. Input: λ the problem instance, Sλ its solution space, s0 ∈ Sλ the chosen initial state, p(o) is a proposal distribution used to sample operators 1: s = s0; 2: e = E s0; 3: for i = 1 . . . B do 4: propose o ∈ O s.t. o ∼ p(o); 5: s′ = o (s); 6: e′ = E s′;

7: with probability min



1, exp kTe−e(i)′  do

8: s = s′;

9: e = e′;

10: end

11: end for 12: return s

Supervised learning based framework

Observations

SA’s efficiency critically depends on p(o) (naive policy) !

What we are going to do

Use supervised machine learning to create a conditional probability distribution p(o | s) (conditional policy) and use it instead of p(o).

Work phases

1. Generate intermediate structures

Apply SA with p(o) on the learning set and save intermediate struc-tures during optimization.

2. Generate good operators

For each intermediate structure, use an EDA [5] to discover good op-erators (that decrease energy).

3. Learn the conditional policy

Use D to learn a conditional operator selection policy.

4. Assessment

Use p(o | s) with SA on the test set and evaluate performance.

Conditional distribution

• discrete parameters: maximum-entropy classifier [2]; • continuous parameters:

µ = hθµ; φ(sλ)i;

σ = log {1 + exp (−hθσ; φ(sλ)i)} ; pθ (γ | φ(sλ)) ∼ Nθ(µ, σ).

Estimation of distribution algorithm (EDA)

Results

0 0.5 1 1.5 2 2.5 x 105 2.6 2.8 3 3.2 3.4 3.6 3.8 4 Iterations

Average energy [ log

10

( kCal / mol ) ]

Original proposal distribution Learned proposal distribution

Figure 1: Evolution of average energy of the test set proteins during one optimization run.

• The learning set is composed of 100 proteins randomly selected from the database PSIPRED [3].

• The test set is composed of 10 proteins randomly selected from the database PSIPRED [3].

• The parameters of SA were determined by a rule of thumb based on what can be found in official Rosetta tutorials (more details in [1]). • The learned conditional distribution outperforms the other one in

terms of convergence speed and of final result.

• These results are promising but the structures predicted after one such learning iteration are still very different from the real structures.

Conclusions and future work

• Improvement

Machine learning can improve optimization performance. • Promising results

In the context of in silico protein structure prediction. • Learning for search

Learning a good way to search through the state space of a problem. • General

Can be applied to other optimization problems and search methods. • Local vs global information

Better efficiency may be expected if learning could take into account global information (in this work, local information is used).

• Future work includes

– optimization: fine tuning of parameters, other algorithms; – learning: improvement of features and model selection.

References

[1] A. Marcos Alvarez. Pr´ediction de structures de macromol´ecules par apprentissage automatique. Master’s thesis, University of Li`ege, Faculty of Engineering, 2011.

[2] A. L. Berger, V. J. D. Pietra, and S. A. D. Pietra. A maximum entropy approach to natural language processing. Computational linguistics, 22(1):39–71, 1996.

[3] D. T. Jones. Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology, 292(2):195 – 202, 1999.

[4] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science, New Series, 220(4598):671–680, 1983.

[5] P. Larra˜naga and J. A. Lozano. Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation. Springer, October 2002.

This work is supported by the Belgian Science Policy Office and was funded by a FRIA scholarship of the F.R.S-FNRS.

Références

Documents relatifs

Dans ses conclusions sur l’arrêt d’assemblée SA Rothmans, Mme Laroque précisait que « si une illégalité est constitutive d'une faute de nature à engager la

The main objective of this research is to improve estimates of biomass by combining ALOS-2 SAR, Landsat 8, and field plots data in a tropical forest with high biomass values.. First,

En cela, nous commençons par proposer un processus de construction de score de risque de maladie, basé sur l’exemple du cancer du sein, qui permet la réalisation d’un compromis

In those groups that have no faulty agents, the simulation will be a faithful simulation of algorithm A, and the output of each thread within that group will eventually stabilize to

The Inverse Reinforcement Learning (IRL) [15] problem, which is addressed here, aims at inferring a reward function for which a demonstrated expert policy is optimal.. IRL is one

observations in this experiment but it is not used in the regression algorithm. This is a specific form of the Bellman equation [10] with the transformation t being a

topology of the configurations in the training set and the testing set are identical. Changes in the topology was outside the scope of the study, as, in the design of critical

Multi-allelic datasets (from e.g., peptidome MS of clinical samples) allow for training models to score antigen presentation in a given donor (with all its HLA-I alleles), and