• Aucun résultat trouvé

Des stratégies évolutionnaires globalement convergentes avec une application en imagerie sismique pour la géophysique

N/A
N/A
Protected

Academic year: 2021

Partager "Des stratégies évolutionnaires globalement convergentes avec une application en imagerie sismique pour la géophysique"

Copied!
190
0
0

Texte intégral

(1)

HAL Id: tel-01121075

https://tel.archives-ouvertes.fr/tel-01121075

Submitted on 27 Feb 2015

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Des stratégies évolutionnaires globalement convergentes

avec une application en imagerie sismique pour la

géophysique

Youssef Diouane

To cite this version:

Youssef Diouane. Des stratégies évolutionnaires globalement convergentes avec une application en imagerie sismique pour la géophysique. Mathématiques [math]. INP DE TOULOUSE, 2014. Français. �tel-01121075�

(2)

TH `ESE

TH `ESE

En vue de l’obtention du

DOCTORAT DE L’UNIVERSIT´E DE TOULOUSE

D´elivr´e par : l’Institut National Polytechnique de Toulouse (INP Toulouse)

Pr´esent´ee et soutenue le 17/10/2014 par :

Youssef DIOUANE

Globally convergent evolution strategies with application to an Earth

imaging problem in geophysics

JURY

Henri Calandra Total, USA President of jury

Serge Gratton INPT, France PhD advisor

Luis Nunes Vicente University of Coimbra, Portugal PhD co-advisor

Stefano Lucidi University of Rome, Italy Referee

Thomas Baeck Leiden University, Netherlands Referee

Xavier Vasseur CERFACS, France Member of jury

´Ecole doctorale et sp´ecialit´e :

MITT : Domaine Math´ematiques : Math´ematiques appliqu´ees

Unit´e de Recherche :

Centre Europ´een de Recherche et de Formation Avanc´ee en Calcul Scientifique (CERFACS)

Directeurs de Th`ese :

Serge GRATTON et Luis Nunes VICENTE

Rapporteurs :

(3)
(4)

R´esum´e

Au cours des derni`eres ann´ees, s’est d´evelopp´e un int´erˆet tout particulier pour l’optimisa-tion sans d´eriv´ee. Ce domaine de recherche se divise en deux cat´egories: une d´eterministe et l’autre stochastique. Bien qu’il s’agisse du mˆeme domaine, peu de liens ont d´ej`a ´et´e ´etablis entre ces deux branches. Cette th`ese a pour objectif de combler cette lacune, en montrant comment les techniques issues de l’optimisation d´eterministe peuvent am´eliorer la performance des strat´egies ´evolutionnaires, qui font partie des meilleures m´ethodes en optimisation stochastique.

Sous certaines hypoth`eses, les modifications r´ealis´ees assurent une forme de conver-gence globale, c’est-`a-dire une converconver-gence vers un point stationnaire de premier ordre ind´ependamment du point de d´epart choisi. On propose ensuite d’adapter notre algo-rithme afin qu’il puisse traiter des probl`emes avec des contraintes g´en´erales. On montrera ´egalement comment am´eliorer les performances num´eriques des strat´egies ´evolutionnaires en incorporant un pas de recherche au d´ebut de chaque it´eration, dans laquelle on con-struira alors un mod`ele quadratique utilisant les points o`u la fonction coˆut a d´ej`a ´et´e ´evalu´ee.

Grˆace aux r´ecents progr`es techniques dans le domaine du calcul parall`ele, et `a la nature parall´elisable des strat´egies ´evolutionnaires, on propose d’appliquer notre algorithme pour r´esoudre un probl`eme inverse d’imagerie sismique. Les r´esultats obtenus ont permis d’am´eliorer la r´esolution de ce probl`eme.

Mots-cl´es: Optimisation num´erique, strat´egies ´evolutionnaires, convergence globale, d´ecroissance suffisante, probl`emes inverses, imagerie du sous-sol, inversion des formes d’ondes acoustiques, calcul parall`ele (HPC).

(5)
(6)

Abstract

In recent years, there has been significant and growing interest in Derivative-Free Opti-mization (DFO). This field can be divided into two categories: deterministic and stochas-tic. Despite addressing the same problem domain, only few interactions between the two DFO categories were established in the existing literature. In this thesis, we attempt to bridge this gap by showing how ideas from deterministic DFO can improve the efficiency and the rigorousness of one of the most successful class of stochastic algorithms, known as Evolution Strategies (ES’s).

We propose to equip a class of ES’s with known techniques from deterministic DFO. The modified ES’s achieve rigorously a form of global convergence under reasonable as-sumptions. By global convergence, we mean convergence to first-order stationary points independently of the starting point. The modified ES’s are extended to handle general constrained optimization problems. Furthermore, we show how to significantly improve the numerical performance of ES’s by incorporating a search step at the beginning of each iteration. In this step, we build a quadratic model using the points where the objective function has been previously evaluated.

Motivated by the recent growth of high performance computing resources and the parallel nature of ES’s, an application of our modified ES’s to Earth imaging geophysics problem is proposed. The obtained results provide a great improvement to known solutions of this problem.

Keywords: Numerical optimization, evolution strategies, global convergence, sufficient decrease,inverse problems, Earth imaging, acoustic full-waveform inversion, high perfor-mance computing (HPC).

(7)
(8)

Acknowledgements

It is a pleasure to thank the many people who made this thesis possible. First and foremost I want to thank my supervisors Serge Gratton and Luis Nunes Vicente. They have taught me, both consciously and unconsciously, how good research is done. I appreciate their availability for the fruitful discussions which make my PhD experience productive and stimulating. The joy and enthusiasm they have for their research was contagious and motivational for me, even during tough times in the PhD pursuit. I am also thankful for the excellent example they have provided as successful researchers. I am equally grateful to Henri Calandra and Total E&P for the funding on my PhD, without which this great experience would have not been possible, and for the very challenging geophysical application that they provided, which justifies all the effort behind my studies. I would like to express my sincere thanks to Xavier Vasseur for his daily guidance and advices from which I learned so much, not only for my thesis development but also for my future career. I would like also to thank the referees, Thomas Baeck and Stefano Lucidi, for their careful and enlightening comments on my research.

I am also grateful to all of the ALGO team members at CERFACS for being with me during the past three years. Special thanks in particular to Selime G¨urol for her help, advices, and encouragement. Many thanks also to Rafael Lago for his help during the early stage of my PhD. CERFACS administration would not be that efficient without Brigitte Yzel and Mich`ele Campassens. Thanks to them for their permanent sup-port in administrative procedures. They were always available to solve my problems with patience and smile.

My special thanks to my best friend Elhoucine Bergou, thanks for all these 6 years spent together. My PhD would not have been the same without you my brother. Very special thanks to Zineb Ghormi for her never-ending support, trust, encouragement and understanding. My thanks go to my family and friends: my brothers Simohamed and Ayoub, my sister Mariam, my uncles Omar and Brahim, Hamza, Abdelhadi, Azhar, Nabil, Bassam, Naama, M’Barek, Daoud, Hassan, ...

Lastly, and most importantly, I wish to thank my parents, Aicha Ouaziz and Hissoune Diouane. They bore me, raised me, supported me, taught me, and loved me. To them I dedicate this thesis.

(9)
(10)

Contents

1 Introduction 1

2 Deterministic Derivative-Free Optimization 6

2.1 Model based methods . . . 7

2.1.1 Trust-region framework . . . 7

2.1.2 Polynomial interpolation and regression models . . . 9

2.1.2.1 Polynomial bases. . . 9

2.1.2.2 Polynomial interpolation . . . 10

2.1.2.3 Under-determined interpolation models . . . 11

2.1.2.4 Regression models . . . 14

2.1.3 An interpolation based trust-region approach . . . 14

2.1.3.1 The trust-region subproblem . . . 16

2.1.3.2 Global convergence . . . 17

2.2 Direct-search methods . . . 17

2.2.1 Basic concepts . . . 18

2.2.1.1 Positive spanning sets and positive bases . . . 18

2.2.1.2 Gradient estimates . . . 20

2.2.2 Direct-search methods . . . 22

2.2.2.1 Coordinate-search method . . . 22

2.2.2.2 Direct-search framework. . . 25

2.2.3 Global convergence . . . 26

2.2.3.1 Global convergence for smooth functions . . . 27

2.2.3.2 Global convergence for non-smooth functions . . . 28

2.3 Conclusion . . . 29

3 Stochastic Derivative-Free Optimization & Evolution Strategies 30 3.1 Evolution strategies . . . 32

3.1.1 Notation and algorithm . . . 32

3.1.2 Recombination mechanism . . . 34

3.1.3 Selection mechanism . . . 35

3.1.4 Mutation mechanism . . . 35

3.1.4.1 The concept . . . 35

3.1.4.2 Example in real-valued search spaces . . . 36 vii

(11)

Contents viii

3.2 A class of evolution strategies . . . 39

3.2.1 Concept and algorithm . . . 39

3.2.2 Some existing convergence results. . . 40

3.2.3 CMA-ES a state of the art for ES . . . 42

3.2.3.1 The parent update . . . 42

3.2.3.2 Covariance matrix update. . . 44

3.2.3.3 Step size update . . . 45

3.2.4 Local meta-models and ES’s. . . 45

3.2.4.1 Locally weighted regression . . . 46

3.2.4.2 Approximate ranking procedure . . . 47

3.3 Conclusion . . . 48

4 Globally Convergent Evolution Strategies 49 4.1 A class of evolution strategies provably global convergent . . . 50

4.1.1 Globally convergent evolution strategies . . . 50

4.1.2 Convergence . . . 52

4.1.2.1 The step size behavior . . . 52

4.1.2.2 Global convergence . . . 55 4.1.3 Convergence assumptions . . . 57 4.2 Numerical experiments . . . 58 4.2.1 Algorithmic choices . . . 59 4.2.2 Test problems. . . 60 4.2.3 Test strategies . . . 61 4.2.4 Numerical results . . . 62

4.2.5 Global optimization tests . . . 65

4.3 Conclusions . . . 70

5 Extension to Constraints 72 5.1 A globally convergent ES for general constraints . . . 74

5.1.1 Algorithm description . . . 74

5.1.2 Step size behavior . . . 76

5.1.3 Global convergence . . . 79

5.2 A particularization for only unrelaxable constraints . . . 87

5.2.1 Algorithm description . . . 87 5.2.2 Asymptotic results . . . 87 5.2.3 Implementation choices . . . 89 5.3 Numerical experiments . . . 94 5.3.1 Unrelaxable constraints . . . 94 5.3.1.1 Solvers tested . . . 94 5.3.1.2 Algorithmic choices . . . 95 5.3.1.3 Test problems . . . 95 5.3.1.4 Comparison results . . . 96

5.3.2 Relaxable and unrelaxable constraints . . . 98

5.3.2.1 Test problems . . . 99

5.3.2.2 Test strategy . . . 99

5.3.2.3 Numerical results . . . 100

(12)

Contents ix

6 Incorporating Local Models in a Globally Convergent ES 105

6.1 Incorporating local models in a globally convergent ES . . . 106

6.1.1 The general strategy of the search step. . . 106

6.1.2 Trust-region subproblem in the search step . . . 107

6.1.3 Geometry control in the search step . . . 107

6.1.4 Constraints treatment in the search step . . . 108

6.1.5 Algorithm description . . . 108

6.2 Numerical experiments . . . 110

6.2.1 Test strategy . . . 110

6.2.2 Numerical results for unconstrained optimization . . . 110

6.2.2.1 Search step impact. . . 110

6.2.2.2 Comparison with other solvers . . . 112

6.2.3 Numerical results for constrained optimization . . . 114

6.2.3.1 Search step impact. . . 115

6.2.3.2 Comparison with other solvers . . . 116

6.3 Conclusions . . . 117

7 Towards an Application in Seismic Imaging 119 7.1 Full-waveform inversion . . . 121

7.1.1 Forward problem . . . 121

7.1.2 FWI as a least-squares local optimization . . . 123

7.2 ES for building an initial velocity model for FWI . . . 125

7.2.1 Methodology . . . 125

7.2.2 SEG/EAGE salt dome velocity model . . . 126

7.2.3 Search space reduction . . . 128

7.2.3.1 One-dimensional approximation procedure . . . 128

7.2.3.2 Three-dimensional approximation procedure . . . 131

7.2.4 A parallel ES for acoustic full waveform inversion. . . 133

7.3 Numerical experiments . . . 136

7.3.1 Implementation details. . . 136

7.3.2 Numerical Results . . . 138

7.4 Conclusions . . . 141

8 Conclusions & Perspectives 142 8.1 Conclusions . . . 142

8.2 Perspectives . . . 143

A Data & Performance Profiles Results 146

B Test Results 150

(13)

List of Figures

2.1 A graphical representation of the maximal positive basis D1 (left) and the minimal positive basis D2 (right) forR2.. . . 19 2.2 For a given positive spanning set and a vector w =−∇f(x) (green), there

must exist at least one descent direction d (red) (i.e. w>d > 0). . . 20

2.3 A positive spanning set with a very small cosine measure. . . 20

2.4 InR2, for a given positive spanning set the cosine measure is defined by cos(θ) where θ (blue) is the largest angle between two adjacent vectors. . 21

2.5 Six iterations of the coordinate-search method with opportunistic polling (following the order East/West/North/South). The initial point is x0 = [−3.5, −3.5], the starting step size is α0= 3. For successful iterations, the step size is kept unchanged, otherwise it is reduced by a factor β = 1/2. The ellipses show the level sets of the objective function f (x) = (x1+ x2− 2)2+ (x1− x2)2. The optimum is located at the point [1, 1]. . . 24 3.1 A scalar density function for a normal distribution . . . 37

3.2 A 2-D situation where non-isotropic mutations, parallel to the y-axis, enhance the performance. The ellipses show the level sets of the objective function f (x) = (x1+ x2− 2)2+ (x1− x2)2. . . 38 3.3 A 2-D situation where it is more efficient to have correlated Gaussian

mutations. The ellipses show the level sets of the objective function f (x) = (x1+ x2− 2)2+ (x1− x2)2. . . 39 3.4 A 2-D illustration of an evolution strategy. Generation after generation

the sampling distribution and the step size are getting adapted to the landscape of the objective function. The ellipses show the level sets of the objective function. . . 40

3.5 A graphical representation of a 2-dimensional run of CMA-ES where x0= [−4, −4], the initial step size σCMA-ES

0 = 1, and the covariance matrix is isotropic (i.e. C0= I2). The population size is λ = 10, the new parent is chosen using the µ = 5 best individuals. The ellipses show the level sets of the objective function f (x) = (x1+ x2− 2)2+ (x1− x2)2. The optimum is located at the point [1, 1]. . . 43

4.1 A 2-D illustration of three possible globally convergent evolution strate-gies. The ellipses show the level sets of the objective function.. . . 51

4.2 Data profiles computed for the set of smooth problems, considering the two levels of accuracy, 10−3 and 10−7 (for the three modified versions).. . 63

4.3 Performance profiles computed for the set of smooth problems with a logarithmic scale, considering the two levels of accuracy, 10−2 and 10−4 (for the three modified versions). . . 64

(14)

List of Figures xi

4.4 Data profiles computed for the set of smooth problems, considering the two levels of accuracy, 10−3 and 10−7. . . . 65 4.5 Data profiles computed for the set of nonstochastic noisy problems,

con-sidering the two levels of accuracy, 10−3 and 10−7. . . 65

4.6 Data profiles computed for the set of piecewise smooth problems, consid-ering the two levels of accuracy, 10−3 and 10−7. . . 66

4.7 Data profiles computed for the set of stochastic noisy problems, consid-ering the two levels of accuracy, 10−3 and 10−7. . . . 66 4.8 Performance profiles computed for the set of smooth problems with a

logarithmic scale, considering the two levels of accuracy, 10−2 and 10−4. . 67

4.9 Performance profiles computed for the set of nonstochastic noisy problems with a logarithmic scale, considering the two levels of accuracy, 10−2and 10−4. . . 67

4.10 Performance profiles computed for the set of piecewise smooth problems with a logarithmic scale, considering the two levels of accuracy, 10−2and 10−4. . . 68

4.11 Performance profiles computed for the set of stochastic noisy problems with a logarithmic scale, considering the two levels of accuracy, 10−2and 10−4. . . . 68 4.12 Results for the mean/mean version, CMA-ES, and MADS on a set of

multi-modal functions of dimension 10 (using λ = 20). . . 69

4.13 Results for the mean/mean version, CMA-ES, and MADS on a set of multi-modal functions of dimension 20 (using λ = 40). . . 69

4.14 Results for the mean/mean version, CMA-ES, and MADS on a set of multi-modal functions of dimension 10 (using λ = 100). . . 70

4.15 Results for the mean/mean version, CMA-ES, and MADS on a set of multi-modal functions of dimension 20 (using λ = 200). . . 70

5.1 A 2-D illustration of the barrier approach to handle linearly constrained problems using a positive generators of the polar cone of the -active constraints. Figure (5.1(a)) outlines the detection of an -active mean parent point, while Figures (5.1(b)) and (5.1(c)) show the restoration process to conform the offspring distribution to the local geometry. The ellipses show the level sets of the objective function. . . 91

5.2 An illustration of the projection approach to handle linearly constrained problems. The figure (5.2(a)) outlines the projection of the unfeasible sample points. Figures (5.2(b)) and (5.2(c)) show the adaptation of the distribution of the offspring candidate solution to the constraints local geometry. . . 93

5.3 Performance profiles for 114 bound constrained problems (average objec-tive function values for 10 runs). . . 96

5.4 Performance profiles for 107 general linearly constrained problems (aver-age objective function values for 10 runs). . . 97

5.5 Data profiles for 114 bound constrained problems (average objective func-tion values for 10 runs). . . 98

5.6 Data profiles for 107 general linearly constrained problems (average ob-jective function values for 10 runs). . . 98

(15)

List of Figures xii

6.1 Data profiles computed for the set of smooth problems to assess the im-pact of incorporating local models, considering the two levels of accuracy,

10−3 and 10−7. . . 111

6.2 Data profiles computed for the set of nonstochastic noisy problems to assess the impact of incorporating local models, considering the two levels of accuracy, 10−3 and 10−7. . . . 111

6.3 Data profiles computed for the set of piecewise smooth problems to assess the impact of incorporating local models, considering the two levels of accuracy, 10−3 and 10−7.. . . 112

6.4 Data profiles computed for the set of stochastic noisy problems to assess the impact of incorporating local models, considering the two levels of accuracy, 10−3 and 10−7.. . . 112

6.5 Comparison with SID-PSM and BCDFO methods on the set of smooth problems using data profiles, considering the two levels of accuracy, 10−3 and 10−7. . . 113

6.6 Comparison with SID-PSM and BCDFO methods on the set of non-stochastic noisy problems using data profiles, considering the two levels of accuracy, 10−3 and 10−7. . . . 113

6.7 Comparison with SID-PSM and BCDFO methods on the set of piece-wise smooth problems using data profiles, considering the two levels of accuracy, 10−3 and 10−7.. . . 114

6.8 Comparison with SID-PSM and BCDFO methods on the set of stochastic noisy problems using data profiles, considering the two levels of accuracy, 10−3 and 10−7. . . 114

6.9 Data profiles computed for 114 bound constrained problems to assess the impact of incorporating local models, considering the two levels of accuracy, 10−3 and 10−7.. . . 115

6.10 Data profiles computed for 107 general linearly constrained problems to assess the impact of incorporating local models, considering the two levels of accuracy, 10−3 and 10−7. . . 116

6.11 Data profiles for 114 bound constrained problems using an accuracy level of 10−3 (average objective function values for 10 runs). . . 117

6.12 Data profiles for 114 bound constrained problems using an accuracy level of 10−7 (average objective function values for 10 runs). . . 117

7.1 A graphical representation of acoustic waves propagation by a source are reflected by a reflective layer (in white) and are detected by the geophones.119 7.2 A graphical representation of acoustic wave propagation over a two-dimensional velocity model. . . 123

7.3 Academic 3D SEG/EAGE salt dome velocity model using Paraview [88]. The geophysical domain size is of 20× 20 × 5 km3 in which the minimal velocity is of 1500 m/s. The velocity model is representing a dome of salt in the subsurface of Earth, which abruptly increases the velocity of propagation of the compressional waves. . . 127

7.4 The reduction procedure over a one-dimensional case. . . 129

7.5 The duplication procedure over a one-dimensional case. . . 130

(16)

List of Figures xiii

7.7 A one-dimensional magnification procedure using DCT transform. Com-pared to the duplicated vector, the magnification using DCT transform represents better the true velocity vector. . . 132

7.8 A 3D duplicated and magnified models of SEG/EAGE salt dome velocity model. The velocity models are built using n = 8× 8 × 5 = 320, the original size of the true velocity model is of N = 225× 225 × 70 = 3543750.133

7.9 A parallel evolution strategy for full waveform inversion. . . 136

7.10 The starting velocity model for the parallel evolution strategy. . . 137

7.11 Inversion results for the Salt dome velocity model using n = 320 param-eters. The working frequency is of 1 Hz. . . 138

7.12 Objective function evaluation at the best population point for the first 278 iterations of the parallel evolution strategy. . . 139

7.13 Graphical representation of the salt dome of three velocity models: the true velocity salt dome (Figure 7.13(a)), the approximated one using 320 parameters (Figure 7.13(b)), and the inverted velocity model (Fig-ure 7.13(c)). Only the points of the models which have velocity equal or larger than 3500 m/s are shown (to delineate the structure of the dome of salt). . . 139

7.14 Comparison of the inversion results for the Salt dome velocity model using n = 320 parameters for different range of frequencies (1 Hz, 2 Hz and 3 Hz).140

A.1 Data profiles computed for the set of nonstochastic noisy problems, con-sidering the two levels of accuracy, 10−3 and 10−7(for the three modified versions). . . 146

A.2 Data profiles computed for the set of piecewise smooth problems, consid-ering the two levels of accuracy, 10−3 and 10−7 (for the three modified versions). . . 147

A.3 Data profiles computed for the set of stochastic noisy problems, consid-ering the two levels of accuracy, 10−3 and 10−7 (for the three modified versions). . . 147

A.4 Performance profiles computed for the set of nonstochastic noisy problems with a logarithmic scale, considering the two levels of accuracy, 10−2and 10−4 (for the three modified versions). . . 148

A.5 Performance profiles computed for the set of piecewise smooth problems with a logarithmic scale, considering the two levels of accuracy, 10−2and 10−4 (for the three modified versions). . . . 148 A.6 Performance profiles computed for the set of stochastic noisy problems

with a logarithmic scale, considering the two levels of accuracy, 10−2and 10−4 (for the three modified versions). . . 149

(17)

List of Algorithms

2.1 A DFO trust-region algorithm. . . 15

2.2 Coordinate-search method. . . 23

2.3 A direct-search method.. . . 26

3.1 A general framework for (µ/ρ+, λ)–ES.. . . 33

3.2 A general framework for (µ/µW, λ)–ES. . . 41

3.3 Approximate ranking procedure. . . 48

4.1 A class of globally convergent ES’s. . . 53

5.1 A globally convergent ES for general constraints (Main). . . 77

5.2 A globally convergent ES for general constraints (Restoration). . 78

5.3 A globally convergent ES for unrelaxable constraints. . . 88

5.4 Calculating the positive generators Dk. . . 92

6.1 A globally convergent ES using a search step. . . 109

7.1 A multi-scale algorithm for frequency-domain FWI. . . 125

7.2 An adaptation of the ES algorithm to FWI setting. . . 135

(18)

List of Tables

4.1 The distribution of np in the test set.. . . 60 4.2 Noiseless problems. . . 67

4.3 Noisy problems.. . . 67

5.1 Comparison results for the extreme barrier approach using a maximal budget of 2000 .. . . 101

5.2 Comparison results for the extreme barrier approach using a maximal budget of 20000 . . . 102

5.3 Comparison results for the merit approach and the progressive barrier one using a maximal budget of 2000 . . . 103

5.4 Comparison results for the merit approach and the progressive barrier one using a maximal budget of 20000 . . . 103

7.1 The distribution of the clusters and the population size depending on the working frequency. . . 138

B.1 Results from comparison of the solvers on bound-constraind problems (average of 10 runs for stochastic solvers)- Part 1 . . . 151

B.2 Results from comparison of the solvers on bound-constraind problems (average of 10 runs for stochastic solvers)- Part 2 . . . 152

B.3 Results from comparison of the solvers on bound-constraind problems (average of 10 runs for stochastic solvers)- Part 3 . . . 153

B.4 Results from comparison of the solvers on bound-constraind problems (average of 10 runs for stochastic solvers)- Part 4 . . . 154

B.5 Results from comparison of the solvers on bound-constraind problems (average of 10 runs for stochastic solvers) - Part 5 . . . 155

B.6 Results from comparison of the solvers on linear-constraind problems (av-erage of 10 runs for stochastic solvers)- Part 1 . . . 156

B.7 Results from comparison of the solvers on linear-constraind problems (av-erage of 10 runs for stochastic solvers)- Part 2 . . . 157

B.8 Results from comparison of the solvers on linear-constraind problems (av-erage of 10 runs for stochastic solvers)- Part 3 . . . 158

B.9 Results from comparison of the solvers on linear-constraind problems (av-erage of 10 runs for stochastic solvers)- Part 4 . . . 159

B.10 Results from comparison of the solvers on linear-constraind problems (av-erage of 10 runs for stochastic solvers) - Part 5 . . . 160

(19)

To my parents

(20)

Chapter 1

Introduction

Nowadays, many practical optimization problems have become often noisy, complex, and not sufficiently explicitly defined to give reliable derivatives. In this thesis, we are interested in optimization problems where derivative information is unavailable or hard to obtain in practice. For instance, optimizing large and complex systems often requires the tuning of many parameters. These parameters are typically set to values that may have some mathematical meaning or that have been found to perform well. The choice of parameters can be done automatically using training data of simulations. In such case, not only it is hard to find the derivatives with respect to the parameters, but also numerical noise and probably non-differentiability issues may appear. As consequence, we have seen a resurgence of interest in Derivative-Free Optimization (DFO) [52]. Derivative-based methods are more adapted to solve large scale optimization problems, typically around 106 unknowns or more. These methods can be very efficient when the starting point is accurate enough, but otherwise they suffer from stalled convergence to spurious local minima for non-convex optimization problems. Thus the holy grail of these problems is to warmstart the local optimization procedures by efficiently finding a good initial guess without the need of sophisticated a priori knowledge on the objective function (such as the problem structure, its background, ...). When the number of unknowns included in the optimization can be reduced, it is possible to use a type of DFO methods that are known for their ability to handle hard problems and to find a good initial guess (a starting point leading to a better minimum). Once a starting point is found, derivative-based methods can be applied to refine the problem solution. In the scope of this thesis, we deal with a very large scale seismic imaging inversion problem [167] where we show that some DFO methods can improve the optimization procedure by finding an accurate initial guess from which one can initiate derivative-based methods [126], without any physical knowledge.

(21)

Chapter 1. Introduction 2 DFO methods do not use derivatives information of the objective function or constraints, nor an approximation to the derivatives. Actually, derivatives approximation is often very expensive and can produce misleading results due to the presence of noise. DFO area can be divided into two categories, depending on the methods used to explore the search space. The first category is deterministic DFO algorithms such as model-based methods [49, 130] or direct-search methods [52, 108]. The major drawback of these methods is that they can get easily stuck in a local optimum. The second category is stochastic derivative-free optimization [122, 158], which has been employed to miti-gate the defect of the local deterministic methods in the solution of difficult objective functions (e.g. non-smooth and multi-modal). Stochastic derivative-free optimization algorithms aim to be robust when dealing with multi-modal objective functions. Some of these methods are generally inspired by nature, in the same way that random pro-cesses are often associated with natural systems (e.g. mutations of genetic information, annealing process of metal, molecular dynamics, or swarm behaviors of birds). Well-known representatives of stochastic methods are simulated annealing [107], particle swarm optimization [103] and evolutionary algorithms [26,32,91,92, 142,150]. Over the past, stochastic DFO was regarded by the deterministic DFO community as another discipline, and only few interactions between the two DFO categories were established. Meanwhile, stochastic optimization algorithms have been growing rapidly in popularity thanks to some methods that became “industry standard” approaches for solving challenging optimization problems. Such growth led the deterministic DFO community to reconsider their position and it has started recently to include stochastic frameworks in their research topics [27,73,115, 129].

Evolution strategies (ES’s) are one of these successful stochastic algorithms, seen as a class of evolutionary algorithms that are naturally parallelizable, appropriate for con-tinuous optimization, and that lead to interesting results [23, 37, 145]. Motivated by the industrial demand, we propose in this thesis to equip a class of ES’s with known techniques from the deterministic DFO community based on the step size control. The incorporated techniques are inspired by the recent development in direct search meth-ods [18, 52, 53, 74, 166]. Our modifications enhance the performance of the original algorithm particularly for expensive objective function evaluation. The proposed ES’s achieve rigorously a form of global convergence under reasonable assumptions. By global convergence, we mean the ability of the algorithm to generate a sequence of points con-verging to a stationary point regardless the starting point.

(22)

Chapter 1. Introduction 3 The problem under consideration, in this thesis, is of the form

min f (x)

s.t. x∈ Ω, (1.1)

where f is a real-value objective function assumed to be bounded by bellow. The fea-sible region Ω ⊂ Rn of this problem can be defined by relaxable and/or non-relaxable

constraints. Relaxable constraints need only to be satisfied approximately or asymptot-ically. No violation is allowed when the constraints are non-relaxable (typically, they are bounds or linear constraints).

In Chapter2, we give a short overview of existing deterministic derivative-free optimiza-tion methods and their classificaoptimiza-tion. We present the general framework of model-based methods inside their derivative free context. We emphasize multivariate polynomial interpolation techniques used to build different types of local polynomial interpolation and regression models. We also address (directional) direct-search methods where the sampling is guided by a set of directions with specific features. Key concepts particu-larly related to the sampling set are also outlined (i.e. positive spanning set, a descent direction and the cosine measure). We end up the chapter by reviewing some of the existing global convergence results regarding the presented direct-search methods. As our main motivation is to equip a class of ES’s with some direct search techniques, Chapter 3 gives an overview of stochastic derivative-free optimization algorithms and in particular ES’s, their appearance and history, their basic ideas and principles. We present also some theoretical aspects of ES’s,in particular, the main existing global convergence properties of ES algorithms. The chapter closes with a detailed description of CMA-ES [85,86] regarded as state of the art in stochastic derivative-free optimization. In Chapter4, we introduce our first contribution where we show how to modify a large class of ES’s for unconstrained optimization in order to rigorously achieve global conver-gence. The type of ES’s under consideration recombines the parent points by means of a weighted sum, around which the offspring points are computed by random generation. The modifications consist essentially in the reduction of the size of the steps whenever a sufficient decrease condition on the function values is not verified. When the latter condition is fulfilled, the step size can be reset to the one maintained by the ES’s them-selves, as long as it is sufficiently large. We propose ways of imposing sufficient decrease for which global convergence holds under reasonable assumptions (e.g. density of cer-tain limit directions in the unit sphere). Given a limited budget of function evaluations, our numerical experiments have shown that the modified CMA-ES is capable of further progress in function values. Moreover, we have observed that such an improvement

(23)

Chapter 1. Introduction 4 in efficiency comes without significantly weakening the performance of the underlying method in the presence of several local minimizers.

The modified ES is extended to handle general constrained optimization in Chapter5. Our methodology is built upon the globally convergent evolution strategies previously introduced for unconstrained optimization. Two feasible approaches are encompassed to handle the non-relaxable constraints. In the first approach, the objective function is evaluated directly at the generated sampled points. The feasibility is enforced through an extreme barrier function. The second approach projects the generated sampled points onto the feasible domain before evaluating the objective function. The treatment of relaxable constraints is inspired by the merit function approach [74], where one tries to combine both the objective function and the constraints violation function. In the first numerical experiments, where we consider only unrelaxable constraints, we show that our proposed ES approaches (using the extreme barrier or projection) is competitive with the state of the art solvers for derivative-free bound and linearly constrained optimization. In the second part of our numerical experiments, we test our algorithms based on the merit function approach under the presence of both relaxable and unrelaxable constraints. On the chosen test problems, the merit approach shows promising results compared to the progressive barrier one [19], in particular, for relatively small feasible regions.

The modified ES, proposed in Chapters 4and 5, evaluates the objective function at a significantly large number of points at each iteration. These evaluations can be used in different ways to speed up the convergence and make ES algorithms more efficient especially for small budgets. The possibility that we explore in Chapter 6is to use the previously evaluated points to construct surrogate quadratic models for the objective function f . The surrogate models are computed using techniques inspired from model-based methods for deterministic DFO. Our hybrid algorithm has been designed to satisfy the convergence analysis of our globally convergent ES. As expected, our experiments show that incorporating local models improves the performance of our ES in both un-constrained and un-constrained optimization problems. Regression models are found to be the most efficient quadratic ones within our ES algorithms.

Our target application is the solution of an Earth imaging problem in geophysics. In Chapter7, without any physical knowledge, we use our globally convergent ES’s to find a starting point for an optimization procedure that attempts to drive high-resolution quantitative models of the subsurface using the full information of acoustic waves, known as acoustic full-waveform inversion [167]. The chapter starts with a detailed description of the considered problem. We outline also one possible way to adapt our ES to the acoustic full-waveform inversion problem setting. A subspace approach is used for the parametrization of the problem. Motivated by the recent growth of high performance

(24)

Chapter 1. Introduction 5 computing resources, we propose a highly parallel implementation of our ES adapted to the requirements of the problem. The initial results, obtained in this direction, show that great improvement can be expected in the automation of the full-waveform inversion. Finally, we draw some conclusions and outline perspectives in Chapter8.

(25)

Chapter 2

Deterministic Derivative-Free

Optimization

Deterministic derivative-free optimization (DFO) methods either try to build models of the objective function based on sample function values, i.e. model-based methods [49,

52], or directly exploit a sample set of function evaluations without building an explicit model, i.e. direct-search methods [52, 108]. Motivated by the large number of DFO applications, researchers and practitioners made a significant progress on algorithmic and theoretical aspects of the DFO methods over the past two decades. The most important progress concerns the recent algorithms and proofs of global convergence [17,

49, 52, 108, 149, 166]. By global convergence, we mean the ability of a method to generate a sequence of points converging to a stationary point regardless the starting point. A point is said to be stationary if it satisfies the first order necessary conditions, in the sense that the gradient is equal to zero if the objective function is differentiable or, in the non-smooth case, non-negativity following all directional derivatives of the Clarke generalized derivatives [43]. The book by Conn, Scheinberg and Vicente [52] gives a good review of the state of the art of deterministic DFO with a detailed description of the theoretical background to ensure convergence. The main classes of globally convergent algorithms for derivative-free optimization are:

1. Trust-region methods [49, 52, 130], where one minimizes accurate models in-side a region of prespecified size. The models are for example built either using interpolation and regression techniques [50] or radial-basis functions [168]. 2. Directional direct-search methods [52,108], where sampling is guided by sets

of directions with appropriate properties, i.e. sets of directions generating Rn with non-negative coefficients. Popular algorithms under this class are coordinate

(26)

Chapter 2. Deterministic Derivative-Free Optimization 7 search, pattern search, generalized patern search (GPS) [17], generating set search (GSS) [108], and mesh adaptive direct-search (MADS) [18]. We will often refer to this class of methods simply as direct-search methods.

3. Simplicial direct-search methods [52, 128], where optimization is ensured through simplex operations like reflection, expansion, or contraction. A popular example is the Nelder-Mead method [128], which is regarded as the most popular derivative-free method.

4. Line-search methods [52,102], where one tries to optimize the objective function using a simplex gradient. The latter is typically chosen as a gradient of linear interpolation or regression polynomial model. A popular example is the implicit-filtering method of Kelley et al [102].

Only trust-region methods and direct-search methods are going to be explored further in this thesis. The remainder of this chapter is organized as follows: we begin by a short overview about model-based methods, where we present the general framework of trust-region methods including their relationship with regression and quadratic models. The second section is devoted to direct-search methods where we present a class of globally convergent directional direct-search methods. The convergence results on this chapter are announced without proofs. For the proofs we refer the reader to [17,49,52,108,166] and the references given there.

2.1

Model based methods

Model based methods can be seen as a combination of the trust-region framework with interpolation models of the objective function. Basically in these methods, we construct a local model of the objective function and estimate the new step by minimizing the model inside a region. The model is constructed using points evaluated on a specific point subset. Such point subset must verify some appropriate features so that the models can be well-defined. In this section, we briefly describe the essence of this approach. For more detailed analysis, the reader is referred to [49,51,52, 130].

2.1.1 Trust-region framework

The trust-region framework is usually used when derivative information of the objective function is available or at least some estimates to the derivatives can be computed

(27)

Chapter 2. Deterministic Derivative-Free Optimization 8 accurately. A typical trust-region method is as follows: at the k-th iteration, given the current iterate xk, a model of the form

mk(xk+ s) = f (xk) + gk>s +

1 2s

>H

k s (2.1)

(where gk and Hk correspond to estimates of the gradient and the Hessian, respectively)

is minimized in a neighborhood around the current iterate defined by the ball (or the trust-region)

B(xk, ∆k) = {x ∈ Rn|kx − xkk ≤ ∆k}. (2.2)

centered on xk and with the radius ∆k; the norm k.k could be an iteration

depen-dent norm, but is usually fixed. Different norm choices can be used depending on the minimization problem, for instance in the unconstrained case, the standard Euclidean norm is more adapted [49, 52]. The infinity norm was shown to be more suited when considering bound constraints [49,72].

The minimization of the model inside the trust-region leads to a new trial point xk+

sk. To determine if the computed point is successful or not, we evaluate the objective

function at the new point xk+ sk and compare the true reduction in the value of the

objective function with the predicted reduction by the model. If the ratio ρk =

f (xk)− f(xk+ sk)

mk(xk)− mk(xk+ sk)

(2.3) is larger than a constant ν1> 0, the step is then accepted, so the model is updated. The

trust-region radius is possibly increased if the success is really significant. When the step is unsuccessful (meaning ρk ≤ ν1), the trial point is rejected and the trust-region

radius ∆k is reduced.

The approximation model (2.1) is generally constructed using second-order Taylor series expansion. However, in the derivative-free context, one uses alternative approximation techniques that are not based upon the derivatives of the objective function f . Quadratic interpolation is one of these techniques that can be combined with the trust-region algorithms. For guaranteeing convergence, one needs to impose on the approximation model to be locally accurate enough. The interpolation set as well as the mechanism of maintaining it good enough inside the trust-region are described in the next section. The upcoming results are general interpolation and regression results that have been proven useful while dealing model-based optimization. The subscript k is dropped in the following description for clarity reasons; without loss of information since we make a focus on a given iteration of the trust-region algorithm.

(28)

Chapter 2. Deterministic Derivative-Free Optimization 9 2.1.2 Polynomial interpolation and regression models

In this section, we consider the problem of interpolating known objective function values at a given set Y of interpolation points, Y = {y1, y2, . . . , yp} ⊂ Rn. We aim to find a

model m for which the interpolation condition

m(yj) = f (yj) j = 1, . . . , p (2.4)

holds. We say that a set of points can be interpolated by a polynomial of a certain degree, if for the function f there exists a polynomial m such that (2.4) holds for all the points in the interpolation set Y .

2.1.2.1 Polynomial bases Let Pd

n be the space of polynomials of degree ≤ d in Rn, and q the dimension of this

space. Let i}qi=1 be a given basis ofPnd, which is a set of q polynomials of degree≤ d.

Thus, any polynomial m∈ Pd

n can be written uniquely as

m(x) =

q

j=1

αjφj(x), (2.5)

where αφ = (α1, . . . , αq)> ∈ Rq. Different polynomial bases φ can be considered, the

simplest and the most used polynomial basis is the basis of monomials, known as the natural basis ¯φ. Such basis is defined using multi-indices in the following way [52]: Let a vector αi = (αi1, . . . , αin)∈ Nn be called a multi-index, and, for any x ∈ Rn, we define xαi as xαi = n ∏ j=1 xα i j j . Let also |αi| = n ∑ j=1 αij and αi! = n ∏ j=1 (αij!).

Then the elements of the natural basis are ¯ φi(x) = 1 (αi)!x αi , i = 0, . . . , q, |αi| ≤ d.

(29)

Chapter 2. Deterministic Derivative-Free Optimization 10 The natural basis can then be written as follows:

¯ φ = { 1, x1, x2, . . . , xn, 1 2x 2 1, x1x2, . . . , 1 (d− 1)!x d−1 n−1xn, 1 d!x d n } . (2.6)

Consequently, for uni-variate interpolation (i.e. d = 1) we have q = n + 1, and that q = (n+1)(n+2)2 for a full quadratic interpolation (i.e, d = 2).

2.1.2.2 Polynomial interpolation

Using (2.5) and (2.4), the coefficients αφ = (α1, . . . , αq)> can be found by solving the

following equation:

q

j=1

αjφj(yi) = f (yi) i = 1, . . . , p,

which can be written as a linear system of the form:

M (φ, Y )αφ = f (Y ), (2.7)

where the coefficient matrix M (φ, Y ) and right hand side f (Y ) of this system are        φ1(y1) φ2(y1) · · · φq(y1) φ1(y2) φ2(y2) · · · φq(y2) .. . ... . .. ... φ1(yp) φ2(yp) · · · φq(yp)        and        f (y1) f (y2) .. . f (yp)        , respectively.

If the coefficient matrix M (φ, Y ) is square and nonsingular, then the set of points Y is poised with respect to the subspace spanned by φ. This means that Y can be interpolated by a unique polynomial from this subspace. When the interpolation set remains poised for small perturbations, the set is called well-poised. If the set Y is poised, then one can solve the linear system and find an interpolation polynomial. However, numerically the coefficient matrix M (φ, Y ) may be ill-conditioned depending on the basis choice {φi}qi=1. Thus, in general, the condition number of the matrix M (φ, Y ) is a bad measure

of poisedness of Y . However, if one chooses the interpolation basis φ as the natural basis of monomials ¯φ and ˆY as a shifted and scaled version of Y such as ˆY ⊂ B(0; 1), the condition number of M ( ¯φ, ˆY ) can be used to monitor the poisedness of the points set [52, Theorem 3.14].

To incorporate models in the trust-region framework, one has to adapt the model con-struction to different degrees of freedom (which depend on both the cardinality of the interpolation set and the variable size). For instance, during the first iterations one has

(30)

Chapter 2. Deterministic Derivative-Free Optimization 11 only few points and so can not always construct an interpolation model. When p = n+1 points are available, we can build a linear model which is known to be sufficient to make some progress. As far as the number of function evaluations p exceeds n + 1 but not more than 12(n + 1)(n + 2), the coefficient matrix M (φ, Y ) contains more columns than rows, and thus the interpolation polynomials defined by (2.4) are no longer unique for quadratic interpolation. To overcome this problem, one uses under-determined mod-els which have been widely used in many practical DFO implementations (see Section

2.1.2.3). Complete quadratic model can be built once the number of function evaluations is equal to 12(n + 1)(n + 2), such models being expected to lead to faster progress. As far as the number of function evaluations p exceeds 12(n + 1)(n + 2), regression models can be used (see Section2.1.2.4). Regression models have been shown to be often better than if we just select the ’best’ subset of 12(n + 1)(n +2) points and use the chosen subset to build complete quadratic models [50].

2.1.2.3 Under-determined interpolation models

The interpolation polynomials defined by (2.4) are not unique in this case; different approaches can be used [50,52]:

Sub-basis models: A simple way to impose the uniqueness of the interpolation poly-nomials can be ensured by restricting the linear system (2.7) to have a unique solution (by removing q− p columns of M(φ, Y ), their corresponding elements of the solution αφ are set to zero). This approach is in general not very successful, except if we have a

priori knowledge on the sparsity structure of the gradient and the Hessian of the objec-tive function. Such information can be exploited by deleting the corresponding columns in the linear system (2.7). Choosing p columns in M (φ, Y ) corresponds to removing polynomials from the basis φ to obtain a new one ˜φ. As a consequence, the points set Y has to be well poised with respect to the sub-space generated by ˜φ.

Minimum norm models: The second approach to get a unique polynomial solution for the under-determined system (2.7) is to compute the minimum using l2-norm of the

solution αφ. In this case, the problem to solve is defined as follows :

min 1 2kαφk 2 2 s.t. M (φ, Y )αφ = f (Y ) . (2.8)

(31)

Chapter 2. Deterministic Derivative-Free Optimization 12 Assuming that the coefficient matrix M (φ, Y ) has full row rank, the solution of the problem (2.8) is given by

αφ = M (φ, Y )†f (Y ), (2.9)

where M (φ, Y )† denotes the Moore-Penrose pseudo-inverse of M (φ, Y ). The latter one can be computed using a QR factorization or a singular value decomposition of the coefficient matrix. The polynomial solution found in (2.9) depends on the choice of the basis φ. In practice, it has been observed that it is worthy to consider the minimum l2-norm when one is working with the natural polynomial basis ¯φ [52, Section 5.1].

Minimum Frobenius norm models: The error bounds on both the objective func-tion and its gradient, for under-determined interpolafunc-tion models, depend on the norm of the Hessian of the model [52, Theorem 5.4]. Therefore, the motivation of this approach is to build models with a minimum value of the norm of the model Hessian. In the quadratic interpolation case, such minimization is equivalent to minimizing the coeffi-cients αφ related to the quadratic monomials. By splitting the natural basis ¯φ into two

parts: a linear ¯φL ={1, x1, x2, . . . , xn} and a quadratic ¯φQ={12x21, x1x2, . . . ,12x2n}, the

interpolation model can be written as follows:

m(x) = α>Lφ¯L+ α>Qφ¯Q,

where αL and αQ are the solution of the following optimization problem

min 1 2kαQk 2 2 s.t. M ( ¯φL, Y )αL+ M ( ¯φQ, Y )αQ = f (Y ) . (2.10)

The corresponding solution αφ¯= [αL, αQ] is called minimum Frobenius norm solution.

In fact, due to the choice of the natural basis, solving the problem (2.10) is equivalent to minimizing the Frobenius norm1of the Hessian of m(x). The solution of (2.10) exists

and is uniquely defined if the following matrix is nonsingular:

F ( ¯φ, Y ) = ( M ( ¯φQ, Y )M ( ¯φQ, Y )> M ( ¯φL, Y ) M ( ¯φL, Y )> 0 ) .

The matrix F ( ¯φ, Y ) is nonsingular if and only if the coefficient matrix M ( ¯φL, Y ) has full

column rank and M ( ¯φQ, Y )M ( ¯φQ, Y )>is positive definite in the null space of M ( ¯φL, Y )

(the last condition can be ensured if the matrix M ( ¯φL, Y ) has full row rank). In this

1The Frobenius matrix norm

k.kF is defined for a square matrix A by the

s X

1≤i,j≤n

(32)

Chapter 2. Deterministic Derivative-Free Optimization 13 case, the sample set Y is called poised in the minimum Frobenius norm sense. The coefficients αL and αQ are computed by solving first

F ( ¯φ, Y ) ( µ αL ) = ( f (Y ) 0 )

to find αL and µ the Lagrange multiplier of the problem (2.10), then by computing

αQ = M ( ¯φL, Y )>µ we complete the model construction.

A variant of the Frobenius norm model is the least Frobenius norm updating of quadratic models [137]. Instead of minimizing the Frobenius norm of the model Hessian, one tries to optimize its change from the current iteration to the previously computed Hessian. The new optimization problem can be formulated as follows:

min 1 2kαQ− α old Q k22 s.t. M ( ¯φL, Y )αL+ M ( ¯φQ, Y )αQ = f (Y ) . (2.11)

This optimization problem is solved through a shifted problem on αdif = αQ− αoldQ of

the type given in (2.10).

Minimum Frobenius norm models and its variant have shown to be the most efficient and successful to build quadratic models and are implemented in many software implementa-tions [52,138]. The minimization of the change in the Hessian of the model from one iter-ation to the next works very well in some cases, in particular, when p = 2n+1 [137,138].

Sparse quadratic interpolation: When the structure of the Hessian is sparse, it is possible by using the l1 norm to recover the sparsity of the constructed model in the

under-determined case [28]. In fact, instead of solving (2.10) we construct the following optimization problem

min Qk1

s.t. M ( ¯φL, Y )αL+ M ( ¯φQ, Y )αQ = f (Y )

. (2.12)

where αQ, αL, ¯φL, and ¯φQ are defined as in (2.10). Solving (2.12) is doable, since it us a

linear program (LP). The sparse quadratic approach is shown to be more advantageous when the Hessian of f has zero entries [28].

(33)

Chapter 2. Deterministic Derivative-Free Optimization 14 2.1.2.4 Regression models

This section is devoted to the case where the number of the points p is more than q, meaning that in the quadratic interpolation case, p exceeds 12(n + 1)(n + 2). Under such consideration, the linear system (2.7) is overdetermined and has in general no solution. The regression models key idea is to find the best solution that minimizes the gap between the M (φ, Y )αφ and f (Y ). In other words, the coefficients αφ will be the

solution of the following linear least-squares problem : min

αφ kM(φ, Y )αφ− f(Y )k

2

2. (2.13)

When the coefficient matrix has full column rank, the minimization problem (2.13) above has a unique solution given by solving the normal equations

M (φ, Y )>M (φ, Y )αφ = M (φ, Y )>f (Y ).

To solve this linear system, singular value decomposition or QR factorization of the co-efficient matrix can be used. Regression models are very recommended to use, especially when the objective function is noisy [50,52].

2.1.3 An interpolation based trust-region approach

Different interpolation-based trust-region methods are available in the literature. The existing methods can be divided into two categories, the first one being the methods that work well for practical problems but are not supported by a convergence theory. The second category includes the methods for which global convergence was shown, but that are practically less competitive than the first category. The algorithm framework which will be described in this section requires the usage of fully linear models, meaning models with accuracy properties similar to those of first-order expansion Taylor model. A rig-orous definition of a fully linear model can be found in [51, Definition 3.1] (see also [52, Definition 10.3]). Algorithm 2.1a derivative-free interpolation based trust-region algo-rithm for which global convergence to first-order stationary points is proved [51, 52].

The algorithm as presented is simple, we check if the norm of the model gradient is too small. If it is, we start the criticality step with the purpose of verifying if the gradient of the objective function f is also small. At each iteration, many situations can occur: an iteration is successful whenever ρk≥ ν1; the trial point is then accepted

(34)

Chapter 2. Deterministic Derivative-Free Optimization 15

Algorithm 2.1: A DFO trust-region algorithm.

Initialization: Let an initial point x0 and the value f (x0) be given. Choose an initial

trust-region radius ∆0> 0. Select an initial model m0. Set k = 0 and the

parameters g> 0; 0 < γ < 1 < γinc, 0 < ν0≤ ν1< 1, µ > β > 0.

1. Criticality step : Apply some procedure when k∇mk(xk)k ≤ g to find a new

model mk and a new trust region radius ∆k such that ∆k ≤ µk∇mk(xk)k and mk

is fully linear on B(xk; ∆k), and such that, if ∆k is reduced, one has

βk∇mk(xk)k ≤ ∆k.

2. Compute the step : Compute a step sk such as

sk = argmins∈B(0,∆k)mk(xk+ s). (2.14)

2. Accept the trial point : Compute f (xk+ sk) and

ρk =

f (xk)− f(xk+ sk)

mk(xk)− mk(xk+ sk)

.

If ρk ≥ ν1or if both ρk≥ ν0 and the model is fully linear on B(xk; ∆k), then

xk+1= xk+ sk and the model is updated to take into consideration the new

iterate, resulting in a new model mk+1; otherwise mk+1= mk and xk+1= xk.

4. Improve the model :

If ρk < ν1use a model-improvement algorithm to certify that the model mk is

fully linear on B(xk, ∆k). Let mk+1the new possibly improved model.

5. Update the trust-region radius: Set

∆k+1=           

[∆k, min{γinc∆k, ∆max}] if ρk ≥ ν1,

γ∆k if ρk < ν1 and mk is fully linear,

∆k if ρk < ν1 and mk is not

certifiably fully linear. Increment k by one and return to Step 1.

ν0 ≤ ρk < ν1 and the model is fully linear (see Algorithm2.1), the trial point is again

accepted but the trust-region is decreased; such iteration is called acceptable. The third situation occurs when ρk < ν1 and the model mk is not certifiably fully linear (see [51,

Definition 3.1]). In this case, the geometry should be improved; the trial point may be included in the sample set but it will not accepted as the new iterate; such iteration is called model-improving. The last situation occurs when ρk < ν0and mkis fully linear,

in this case only the trust-region radius is reduced, the other parameters (including the current iterate) are kept the same; such iteration is declared unsuccessful. The model-improvement cycle in Step 4 can be launched for an infinite number of iterations.

(35)

Chapter 2. Deterministic Derivative-Free Optimization 16 However, when the models are assumed to be fully linear and uniformly bounded, one can ensure that only finite improvement steps will take place [52]. The criticality step is not invoked in detail (see [51,52] for more details), but mainly in such a step one keeps reducing the trust-region radius ∆k and computes a fully linear model in B(xk; ∆k)

until ∆k ≤ µk∇mk(xk)k is obtained. At the exit of the criticality step one also has

∆k ≥ βk∇mk(xk)k (with µ > β).

2.1.3.1 The trust-region subproblem

In Step 2 of Algorithm 2.1, one needs to approximate a minimizer sk of the following

optimization problem (called trust-region subproblem): min

s∈B(0,∆k)

mk(xk+ s), (2.15)

where mk is the model for the objective function and B(0, ∆k) is the trust-region. The

computation of such step sk is crucial for the convergence theory of the trust-region

methods. In general, it is not necessary to find an exact minimizer of this optimization problem as far as the computed step ensures some form of sufficient decrease condition, meaning that the new step sk has to fulfill

mk(xk+ sk) ≤ mk(xk)− ψk,

where ψk is a positive value satisfying suitable conditions [52]. The key point is to make

sure that the total decrease is at least a fraction of that obtained with the Cauchy step sCk [52, Chapter 10], for all iterations k:

m(xk)− mk(xk+ sk) ≥ κf cd[m(xk)− mk(xk+ sCk)], (2.16)

where κf cd ∈ (0, 1]. The Cauchy step sCk can be computed by backtracking a line

search along the steepest descent direction given by the gradient of the model. As a consequence, the Cauchy step is defined by

sCk =−tCkgk, (2.17) where tC k is given by tCk = argmin t≥0:xk−tgk∈Bk(xk,∆k) mk(xk− tgk).

(36)

Chapter 2. Deterministic Derivative-Free Optimization 17 The Cauchy step satisfies the condition:

mk(xk)− mk(xk+ sCk) ≥ 1 2kgkk min { kgkk kHkk , ∆k } . (2.18) 2.1.3.2 Global convergence

To prove global convergence to first-order critical points (convergence to a stationary point regardless the starting point), it suffices to assume in addition to the assump-tion (2.16), that the gradient of the objective function f is Lipschitz continuous. We suppose also that the Hessian model is bounded (see [52] for a complete and detailed convergence analysis).

Under such assumptions it is provable that the trust-region radius in Algorithm 2.1

converges to zero [52, Lemma 10.9]:

Lemma 2.1. Consider a sequence of iterations generated by Algorithm2.1 without any stopping criterion. Then under the above assumptions one has

lim

k→+∞∆k= 0. (2.19)

When the sequence of iterates is bounded, one can also prove that all limit points of the sequence of iterates are first-order stationary points. The global convergence result is then derived as follows [52, Theorem 10.13]:

Theorem 2.2. Consider a sequence of iterations generated by Algorithm 2.1 without any stopping criterion. Then under the above assumptions one has

lim

k→+∞∇f(xk) = 0. (2.20)

2.2

Direct-search methods

Direct-search methods correspond to DFO algorithms where sampling, at each iteration, is guided by a finite set of directions with some appropriate features. These methods do not use any derivative approximation or model building. In this section, by direct-search we mean the directional type; we refer the reader to [52,102,128] and references therein for more details on the other types of direct-search methods. To describe direct-search algorithms, we first present some related basic concepts.

(37)

Chapter 2. Deterministic Derivative-Free Optimization 18 2.2.1 Basic concepts

To guide the optimization process, the directions used in direct-search methods must have some appropriate features. One essential property consists on ensuring that at least one of the chosen directions is descent. A direction d is said to be descent at the point x, if there exists a positive value ¯α such that:

∀α ∈ (0, ¯α] , f (x + αd) < f (x). (2.21) When f is continuously differentiable at x and ∇f(x) 6= 0, all the descent directions d fulfill −∇f(x)>d > 0. To ensure the existence of such directions, some notions related

to positive spanning sets and positive bases are needed [52, 56].

2.2.1.1 Positive spanning sets and positive bases

The positive span of a set (PSS) of vectors [v1, . . . , vr] in Rn is defined as the convex

cone which is positively generated by [v1, . . . , vr] (meaning the set {v ∈ Rn : v = r

i=1

αivi, αi≥ 0, i = 1, . . . , r}) [52, 56].

Definition 2.3.

• A positive spanning set in Rn is a set of vectors whose positive span isRn.

• The set [v1, . . . , vr] is said to be positively dependent, if one of the vectors is in the

convex cone positively spanned by the remaining vectors, i.e, if one of the vectors is a positive combination of the others; otherwise, the set is positively independent. • A positive basis in Rn is a positively independent set whose positive span isRn.

Unlike Rn bases where one has exactly n vectors, the cardinality of a positive basis has at least n + 1 and at most 2n vectors [15,56]. Positive bases with n + 1 and 2n vectors are referred to as the minimal and the maximal positive bases, respectively.

Example 2.1. Let B = [e1, e2, . . . , en] be the canonical basis of Rn, where ei denotes

the vector with a 1 in the ith coordinate and 0’s elsewhere, and let e =

n

i=1

ei, then

• D⊕= [B , −B] is a maximal positive basis of Rn, where−B = [−e1,−e2, . . . ,−en].

(38)

Chapter 2. Deterministic Derivative-Free Optimization 19 6 - ? D1= D⊕ 6 - D2

Figure 2.1: A graphical representation of the maximal positive basis D1 (left) and

the minimal positive basis D2(right) forR2.

In Figure2.1, we depict two positive bases forR2 (maximal and minimal).

As stated in [52, Theorem 2.4], if [v1, . . . , vr] is a positive basis forRn and W ∈ Rn×n

is a nonsingular matrix, then [W v1, . . . , W vr] is also a positive basis for Rn. In other

words, having a positive basis in Rn, one can ensure the existence of infinitely many

different ones. Attractive properties of positive bases (explaining their use in direct-search methods) are as follows:

Theorem 2.4. Let [v1, . . . , vr] be a positive basis forRn and w∈ Rn. then

[ ∀i ∈ {1, . . . , r} v>i w≥ 0 ] ⇒ [ w = 0 ] . (2.22)

Proof. Since [v1, . . . , vr] spans Rn positively, the vector−w can be written as

−w =

r

i=1

λivi,

where each λi≥ 0 for all i = 1, . . . , r.

From (2.22) we have v>i w≥ 0 for all i ∈ {1, . . . , r} and so

0 ≤

r

i=1

λiv>i w = −w>w ≤ 0.

The only possibility is then w = 0.

Thus by choosing w = −∇f(x) in Theorem 2.4, positive bases can be used to check either a point x∈ Rn is a stationary point of the objective function or not.

Theorem 2.5. Let f be a continuously differentiable function with∇f(x) 6= 0 for some x∈ Rn. Let [v

1, . . . , vr] be a positive basis for Rn, then there exists i in{1, . . . , r} such

as

(39)

Chapter 2. Deterministic Derivative-Free Optimization 20 Proof. Let w =−∇f(x) where x ∈ Rn. one knows that w>w > 0 for all non-zero w and

since [v1, . . . , vr] spansRn positively, one has

w =

r

i=1

λivi,

where each λi≥ 0 for all i = 1, . . . , r. Hence,

w>w =

r

i=1

λiw>vi> 0

from which we conclude that at least one of the scalars w>v1, . . . , w>vrhas to be positive.

In other words, Theorem 2.5states that there must exist at least one descent direction in a positive basis. In Figure2.2, we identify the descent direction for the two positive spanning sets D1 and D2 inR2.

6 d - ? B B BBM w D1 -d 6 D2 PPPPq w

Figure 2.2: For a given positive spanning set and a vector w =−∇f(x) (green), there must exist at least one descent direction d (red) (i.e. w>d > 0).

2.2.1.2 Gradient estimates

By assuming that the set of search directions is a PSS, one is sure that for each iteration a descent direction must exist in the PSS. However, in practice finding a good descent direction may not be possible, see for instance Figure2.3where two vectors of the PSS tend to be colinear opposite. A good descent direction can be defined as a direction

 −∇f(x) CC -C CCO     D −

Figure 2.3: A positive spanning set with a very small cosine measure.

(40)

Chapter 2. Deterministic Derivative-Free Optimization 21 the more acute the angle between the descent direction and the negative gradient of the objective function, the better the direction. A PSS gives descent directions at each iteration but may not be good enough (depending on the level of acuteness) to ensure convergence; in this case the PSS is said to be degenerate. Thus, the question that arises naturally is: how to measure and control any deterioration in the PSS property to avoid its degeneracy ? For that sake, we review the notion of the cosine measure for positive spanning sets [108].

Definition 2.6. The cosine measure of a positive spanning set (with nonzero vectors) or of a positive basis D is defined by

cm(D) = min

06=v∈Rnmaxd∈D

v>d

kvkkdk.

InR2, the cosine measure of a positive spanning set is the cosine of the half of the largest

angle θ between two of its adjacent vectors (see Figure2.4).

6 -θ = π 2  ? D1 cm(D1) = cos(π4) - θ = 3π4 6 D2 cm(D2) = cos(3π8 )

Figure 2.4: InR2, for a given positive spanning set the cosine measure is defined by

cos(θ) where θ (blue) is the largest angle between two adjacent vectors.

Remark 2.7. The cosine measure of a positive set is strictly positive.

In terms of descent, a key point of the cosine measure can be seen as follows: given a nonzero vector w∈ Rn, one has

cm(D) ≤ max

d∈D

w>d kwkkdk. Thus there must exist a d∈ D such that

cm(D) ≤ w>d kwkkdk. In particular if one chooses w =−∇f(x), then

cm(D)k∇f(x)kkdk ≤ −∇f(x)>d. (2.23)

A cosine measure close to zero indicates a deterioration of the PSS, meaning that the PSS becomes degenerate. To see how the cosine measure can predict such deterioration,

Figure

Figure 2.5: Six iterations of the coordinate-search method with opportunistic polling (following the order East/West/North/South)
Figure 3.2: A 2-D situation where non-isotropic mutations, parallel to the y-axis, enhance the performance
Figure 3.5: A graphical representation of a 2-dimensional run of CMA-ES where x 0 = [ − 4, − 4], the initial step size σ CMA-ES0 = 1, and the covariance matrix is isotropic (i.e
Figure 4.3: Performance profiles computed for the set of smooth problems with a logarithmic scale, considering the two levels of accuracy, 10 − 2 and 10 − 4 (for the three
+7

Références

Documents relatifs

Keywords: data mining, classification, imbalance problem, cost-sensitive learning, imbalanced data, principal components, neural-like structure of suc- cessive

Le calcul de la somme de deux nombres qui se suivent est facile quand on connaît bien les

[r]

[r]

Mon grand frère a eu un nouveau vélo pour son anniversaire.. Papa regarde un match de football à

Rémi mime la momie dans la salle avec Hugo et Lili.. Rémi a filmé

Le jeu peut ensuite être mis à disposition des élèves toute la

Often, when analyzing economic data, it becomes necessary to assess the dynamics of changes in a certain value or to form a forecast of future values.. However, the parameter