Des stratégies évolutionnaires globalement convergentes avec une application en imagerie sismique pour la géophysique

(1)

HAL Id: tel-01121075

https://tel.archives-ouvertes.fr/tel-01121075

Submitted on 27 Feb 2015

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Des stratégies évolutionnaires globalement convergentes

avec une application en imagerie sismique pour la

géophysique

Youssef Diouane

To cite this version:

Youssef Diouane. Des stratégies évolutionnaires globalement convergentes avec une application en imagerie sismique pour la géophysique. Mathématiques [math]. INP DE TOULOUSE, 2014. Français. �tel-01121075�

(2)

TH `ESE

En vue de l’obtention du

DOCTORAT DE L’UNIVERSIT´E DE TOULOUSE

D´elivr´e par : l’Institut National Polytechnique de Toulouse (INP Toulouse)

Pr´esent´ee et soutenue le 17/10/2014 par :

Youssef DIOUANE

Globally convergent evolution strategies with application to an Earth

imaging problem in geophysics

JURY

Henri Calandra Total, USA President of jury

Serge Gratton INPT, France PhD advisor

Luis Nunes Vicente University of Coimbra, Portugal PhD co-advisor

Stefano Lucidi University of Rome, Italy Referee

Thomas Baeck Leiden University, Netherlands Referee

Xavier Vasseur CERFACS, France Member of jury

École doctorale et spécialité :

MITT : Domaine Mathématiques : Mathématiques appliquées

Unit´e de Recherche :

Centre Europ´een de Recherche et de Formation Avanc´ee en Calcul Scientifique (CERFACS)

Directeurs de Th`ese :

Serge GRATTON et Luis Nunes VICENTE

Rapporteurs :

(3)

(4)

R´esum´e

Au cours des dernières années, s’est développé un intérêt tout particulier pour l’optimisa-tion sans dérivée. Ce domaine de recherche se divise en deux catégories: une déterministe et l’autre stochastique. Bien qu’il s’agisse du même domaine, peu de liens ont déjà été établis entre ces deux branches. Cette thèse a pour objectif de combler cette lacune, en montrant comment les techniques issues de l’optimisation déterministe peuvent améliorer la performance des stratégies évolutionnaires, qui font partie des meilleures méthodes en optimisation stochastique.

Sous certaines hypothèses, les modifications réalisées assurent une forme de conver-gence globale, c’est-à-dire une converconver-gence vers un point stationnaire de premier ordre indépendamment du point de départ choisi. On propose ensuite d’adapter notre algo-rithme afin qu’il puisse traiter des problèmes avec des contraintes générales. On montrera également comment améliorer les performances numériques des stratégies évolutionnaires en incorporant un pas de recherche au début de chaque itération, dans laquelle on con-struira alors un modèle quadratique utilisant les points où la fonction coût a déjà été évaluée.

Grâce aux récents progrès techniques dans le domaine du calcul parallèle, et à la nature parallélisable des stratégies évolutionnaires, on propose d’appliquer notre algorithme pour résoudre un problème inverse d’imagerie sismique. Les résultats obtenus ont permis d’améliorer la résolution de ce problème.

Mots-clés: Optimisation numérique, stratégies évolutionnaires, convergence globale, décroissance suffisante, problèmes inverses, imagerie du sous-sol, inversion des formes d’ondes acoustiques, calcul parallèle (HPC).

(5)

(6)

Abstract

In recent years, there has been significant and growing interest in Derivative-Free Opti-mization (DFO). This field can be divided into two categories: deterministic and stochas-tic. Despite addressing the same problem domain, only few interactions between the two DFO categories were established in the existing literature. In this thesis, we attempt to bridge this gap by showing how ideas from deterministic DFO can improve the efficiency and the rigorousness of one of the most successful class of stochastic algorithms, known as Evolution Strategies (ES’s).

We propose to equip a class of ES’s with known techniques from deterministic DFO. The modified ES’s achieve rigorously a form of global convergence under reasonable as-sumptions. By global convergence, we mean convergence to first-order stationary points independently of the starting point. The modified ES’s are extended to handle general constrained optimization problems. Furthermore, we show how to significantly improve the numerical performance of ES’s by incorporating a search step at the beginning of each iteration. In this step, we build a quadratic model using the points where the objective function has been previously evaluated.

Motivated by the recent growth of high performance computing resources and the parallel nature of ES’s, an application of our modiﬁed ES’s to Earth imaging geophysics problem is proposed. The obtained results provide a great improvement to known solutions of this problem.

Keywords: Numerical optimization, evolution strategies, global convergence, suﬃcient decrease,inverse problems, Earth imaging, acoustic full-waveform inversion, high perfor-mance computing (HPC).

(7)

(8)

Acknowledgements

It is a pleasure to thank the many people who made this thesis possible. First and foremost I want to thank my supervisors Serge Gratton and Luis Nunes Vicente. They have taught me, both consciously and unconsciously, how good research is done. I appreciate their availability for the fruitful discussions which make my PhD experience productive and stimulating. The joy and enthusiasm they have for their research was contagious and motivational for me, even during tough times in the PhD pursuit. I am also thankful for the excellent example they have provided as successful researchers. I am equally grateful to Henri Calandra and Total E&P for the funding on my PhD, without which this great experience would have not been possible, and for the very challenging geophysical application that they provided, which justiﬁes all the eﬀort behind my studies. I would like to express my sincere thanks to Xavier Vasseur for his daily guidance and advices from which I learned so much, not only for my thesis development but also for my future career. I would like also to thank the referees, Thomas Baeck and Stefano Lucidi, for their careful and enlightening comments on my research.

I am also grateful to all of the ALGO team members at CERFACS for being with me during the past three years. Special thanks in particular to Selime Gürol for her help, advices, and encouragement. Many thanks also to Rafael Lago for his help during the early stage of my PhD. CERFACS administration would not be that efficient without Brigitte Yzel and Michèle Campassens. Thanks to them for their permanent sup-port in administrative procedures. They were always available to solve my problems with patience and smile.

My special thanks to my best friend Elhoucine Bergou, thanks for all these 6 years spent together. My PhD would not have been the same without you my brother. Very special thanks to Zineb Ghormi for her never-ending support, trust, encouragement and understanding. My thanks go to my family and friends: my brothers Simohamed and Ayoub, my sister Mariam, my uncles Omar and Brahim, Hamza, Abdelhadi, Azhar, Nabil, Bassam, Naama, M’Barek, Daoud, Hassan, ...

Lastly, and most importantly, I wish to thank my parents, Aicha Ouaziz and Hissoune Diouane. They bore me, raised me, supported me, taught me, and loved me. To them I dedicate this thesis.

(9)

(10)

5.2.1 Algorithm description . . . 87 5.2.2 Asymptotic results . . . 87 5.2.3 Implementation choices . . . 89 5.3 Numerical experiments . . . 94 5.3.1 Unrelaxable constraints . . . 94 5.3.1.1 Solvers tested . . . 94 5.3.1.2 Algorithmic choices . . . 95 5.3.1.3 Test problems . . . 95 5.3.1.4 Comparison results . . . 96

5.3.2 Relaxable and unrelaxable constraints . . . 98

5.3.2.1 Test problems . . . 99

5.3.2.2 Test strategy . . . 99

5.3.2.3 Numerical results . . . 100

(12)

Contents ix

6 Incorporating Local Models in a Globally Convergent ES 105

6.1 Incorporating local models in a globally convergent ES . . . 106

6.1.1 The general strategy of the search step. . . 106

6.1.2 Trust-region subproblem in the search step . . . 107

6.1.3 Geometry control in the search step . . . 107

6.1.4 Constraints treatment in the search step . . . 108

6.1.5 Algorithm description . . . 108

6.2 Numerical experiments . . . 110

6.2.1 Test strategy . . . 110

6.2.2 Numerical results for unconstrained optimization . . . 110

6.2.2.1 Search step impact. . . 110

6.2.2.2 Comparison with other solvers . . . 112

6.2.3 Numerical results for constrained optimization . . . 114

6.2.3.1 Search step impact. . . 115

6.2.3.2 Comparison with other solvers . . . 116

7 Towards an Application in Seismic Imaging 119 7.1 Full-waveform inversion . . . 121

7.1.1 Forward problem . . . 121

7.1.2 FWI as a least-squares local optimization . . . 123

7.2 ES for building an initial velocity model for FWI . . . 125

7.2.1 Methodology . . . 125

7.2.2 SEG/EAGE salt dome velocity model . . . 126

7.2.3 Search space reduction . . . 128

7.2.3.1 One-dimensional approximation procedure . . . 128

7.2.3.2 Three-dimensional approximation procedure . . . 131

7.2.4 A parallel ES for acoustic full waveform inversion. . . 133

7.3 Numerical experiments . . . 136

7.3.1 Implementation details. . . 136

7.3.2 Numerical Results . . . 138

8 Conclusions & Perspectives 142 8.1 Conclusions . . . 142

8.2 Perspectives . . . 143

A Data & Performance Proﬁles Results 146

B Test Results 150

(13)

List of Figures

2.1 A graphical representation of the maximal positive basis D1 (left) and the minimal positive basis D2 (right) forR2.. . . 19 2.2 For a given positive spanning set and a vector w =−∇f(x) (green), there

must exist at least one descent direction d (red) (i.e. w>d > 0). . . 20

2.3 A positive spanning set with a very small cosine measure. . . 20

2.4 InR2, for a given positive spanning set the cosine measure is deﬁned by cos(θ) where θ (blue) is the largest angle between two adjacent vectors. . 21

2.5 Six iterations of the coordinate-search method with opportunistic polling (following the order East/West/North/South). The initial point is x0 = [−3.5, −3.5], the starting step size is α0= 3. For successful iterations, the step size is kept unchanged, otherwise it is reduced by a factor β = 1/2. The ellipses show the level sets of the objective function f (x) = (x1+ x2− 2)2+ (x1− x2)2. The optimum is located at the point [1, 1]. . . 24 3.1 A scalar density function for a normal distribution . . . 37

3.2 A 2-D situation where non-isotropic mutations, parallel to the y-axis, enhance the performance. The ellipses show the level sets of the objective function f (x) = (x1+ x2− 2)2+ (x1− x2)2. . . 38 3.3 A 2-D situation where it is more eﬃcient to have correlated Gaussian

mutations. The ellipses show the level sets of the objective function f (x) = (x1+ x2− 2)2+ (x1− x2)2. . . 39 3.4 A 2-D illustration of an evolution strategy. Generation after generation

the sampling distribution and the step size are getting adapted to the landscape of the objective function. The ellipses show the level sets of the objective function. . . 40

3.5 A graphical representation of a 2-dimensional run of CMA-ES where x0= [−4, −4], the initial step size σCMA-ES

0 = 1, and the covariance matrix is isotropic (i.e. C0= I2). The population size is λ = 10, the new parent is chosen using the µ = 5 best individuals. The ellipses show the level sets of the objective function f (x) = (x1+ x2− 2)2+ (x1− x2)2. The optimum is located at the point [1, 1]. . . 43

4.1 A 2-D illustration of three possible globally convergent evolution strate-gies. The ellipses show the level sets of the objective function.. . . 51

4.2 Data proﬁles computed for the set of smooth problems, considering the two levels of accuracy, 10−3 and 10−7 (for the three modiﬁed versions).. . 63

4.3 Performance proﬁles computed for the set of smooth problems with a logarithmic scale, considering the two levels of accuracy, 10−2 and 10−4 (for the three modiﬁed versions). . . 64

(14)

List of Figures xi

4.4 Data proﬁles computed for the set of smooth problems, considering the two levels of accuracy, 10−3 _{and 10}−7_. _{. . . 65} 4.5 Data proﬁles computed for the set of nonstochastic noisy problems,

con-sidering the two levels of accuracy, 10−3 and 10−7. . . 65

4.6 Data proﬁles computed for the set of piecewise smooth problems, consid-ering the two levels of accuracy, 10−3 and 10−7. . . 66

4.7 Data proﬁles computed for the set of stochastic noisy problems, consid-ering the two levels of accuracy, 10−3 _{and 10}−7_. _{. . . 66} 4.8 Performance proﬁles computed for the set of smooth problems with a

logarithmic scale, considering the two levels of accuracy, 10−2 and 10−4. . 67

4.9 Performance proﬁles computed for the set of nonstochastic noisy problems with a logarithmic scale, considering the two levels of accuracy, 10−2and 10−4. . . 67

4.10 Performance proﬁles computed for the set of piecewise smooth problems with a logarithmic scale, considering the two levels of accuracy, 10−2_and 10−4. . . 68

4.11 Performance proﬁles computed for the set of stochastic noisy problems with a logarithmic scale, considering the two levels of accuracy, 10−2and 10−4_. _{. . . 68} 4.12 Results for the mean/mean version, CMA-ES, and MADS on a set of

multi-modal functions of dimension 10 (using λ = 20). . . 69

4.13 Results for the mean/mean version, CMA-ES, and MADS on a set of multi-modal functions of dimension 20 (using λ = 40). . . 69

5.1 A 2-D illustration of the barrier approach to handle linearly constrained problems using a positive generators of the polar cone of the -active constraints. Figure (5.1(a)) outlines the detection of an -active mean parent point, while Figures (5.1(b)) and (5.1(c)) show the restoration process to conform the oﬀspring distribution to the local geometry. The ellipses show the level sets of the objective function. . . 91

5.2 An illustration of the projection approach to handle linearly constrained problems. The ﬁgure (5.2(a)) outlines the projection of the unfeasible sample points. Figures (5.2(b)) and (5.2(c)) show the adaptation of the distribution of the oﬀspring candidate solution to the constraints local geometry. . . 93

5.3 Performance proﬁles for 114 bound constrained problems (average objec-tive function values for 10 runs). . . 96

5.4 Performance proﬁles for 107 general linearly constrained problems (aver-age objective function values for 10 runs). . . 97

5.5 Data proﬁles for 114 bound constrained problems (average objective func-tion values for 10 runs). . . 98

5.6 Data proﬁles for 107 general linearly constrained problems (average ob-jective function values for 10 runs). . . 98

(15)

List of Figures xii

6.1 Data proﬁles computed for the set of smooth problems to assess the im-pact of incorporating local models, considering the two levels of accuracy,

10−3 and 10−7. . . 111

6.2 Data proﬁles computed for the set of nonstochastic noisy problems to assess the impact of incorporating local models, considering the two levels of accuracy, 10−3 _{and 10}−7_. _{. . . 111}

6.3 Data proﬁles computed for the set of piecewise smooth problems to assess the impact of incorporating local models, considering the two levels of accuracy, 10−3 and 10−7.. . . 112

6.4 Data proﬁles computed for the set of stochastic noisy problems to assess the impact of incorporating local models, considering the two levels of accuracy, 10−3 and 10−7.. . . 112

6.5 Comparison with SID-PSM and BCDFO methods on the set of smooth problems using data proﬁles, considering the two levels of accuracy, 10−3 and 10−7. . . 113

6.6 Comparison with SID-PSM and BCDFO methods on the set of non-stochastic noisy problems using data proﬁles, considering the two levels of accuracy, 10−3 _{and 10}−7_. _{. . . 113}

6.7 Comparison with SID-PSM and BCDFO methods on the set of piece-wise smooth problems using data proﬁles, considering the two levels of accuracy, 10−3 and 10−7.. . . 114

6.8 Comparison with SID-PSM and BCDFO methods on the set of stochastic noisy problems using data proﬁles, considering the two levels of accuracy, 10−3 and 10−7. . . 114

6.9 Data proﬁles computed for 114 bound constrained problems to assess the impact of incorporating local models, considering the two levels of accuracy, 10−3 and 10−7.. . . 115

6.10 Data proﬁles computed for 107 general linearly constrained problems to assess the impact of incorporating local models, considering the two levels of accuracy, 10−3 and 10−7. . . 116

6.11 Data proﬁles for 114 bound constrained problems using an accuracy level of 10−3 (average objective function values for 10 runs). . . 117

6.12 Data proﬁles for 114 bound constrained problems using an accuracy level of 10−7 (average objective function values for 10 runs). . . 117

7.1 A graphical representation of acoustic waves propagation by a source are reﬂected by a reﬂective layer (in white) and are detected by the geophones.119 7.2 A graphical representation of acoustic wave propagation over a two-dimensional velocity model. . . 123

7.3 Academic 3D SEG/EAGE salt dome velocity model using Paraview [88]. The geophysical domain size is of 20× 20 × 5 km3 _{in which the minimal} velocity is of 1500 m/s. The velocity model is representing a dome of salt in the subsurface of Earth, which abruptly increases the velocity of propagation of the compressional waves. . . 127

7.4 The reduction procedure over a one-dimensional case. . . 129

7.5 The duplication procedure over a one-dimensional case. . . 130

(16)

List of Figures xiii

7.7 A one-dimensional magniﬁcation procedure using DCT transform. Com-pared to the duplicated vector, the magniﬁcation using DCT transform represents better the true velocity vector. . . 132

7.8 A 3D duplicated and magniﬁed models of SEG/EAGE salt dome velocity model. The velocity models are built using n = 8_{× 8 × 5 = 320, the} original size of the true velocity model is of N = 225_{× 225 × 70 = 3543750.}133

7.9 A parallel evolution strategy for full waveform inversion. . . 136

7.10 The starting velocity model for the parallel evolution strategy. . . 137

7.11 Inversion results for the Salt dome velocity model using n = 320 param-eters. The working frequency is of 1 Hz. . . 138

7.12 Objective function evaluation at the best population point for the ﬁrst 278 iterations of the parallel evolution strategy. . . 139

7.13 Graphical representation of the salt dome of three velocity models: the true velocity salt dome (Figure 7.13(a)), the approximated one using 320 parameters (Figure 7.13(b)), and the inverted velocity model (Fig-ure 7.13(c)). Only the points of the models which have velocity equal or larger than 3500 m/s are shown (to delineate the structure of the dome of salt). . . 139

7.14 Comparison of the inversion results for the Salt dome velocity model using n = 320 parameters for diﬀerent range of frequencies (1 Hz, 2 Hz and 3 Hz).140

A.1 Data proﬁles computed for the set of nonstochastic noisy problems, con-sidering the two levels of accuracy, 10−3 and 10−7(for the three modiﬁed versions). . . 146

A.2 Data proﬁles computed for the set of piecewise smooth problems, consid-ering the two levels of accuracy, 10−3 and 10−7 (for the three modiﬁed versions). . . 147

A.3 Data proﬁles computed for the set of stochastic noisy problems, consid-ering the two levels of accuracy, 10−3 and 10−7 (for the three modiﬁed versions). . . 147

A.4 Performance proﬁles computed for the set of nonstochastic noisy problems with a logarithmic scale, considering the two levels of accuracy, 10−2and 10−4 (for the three modiﬁed versions). . . 148

A.5 Performance profiles computed for the set of piecewise smooth problems with a logarithmic scale, considering the two levels of accuracy, 10−2and 10−4 _{(for the three modified versions).} _{. . . 148} A.6 Performance profiles computed for the set of stochastic noisy problems

with a logarithmic scale, considering the two levels of accuracy, 10−2and 10−4 (for the three modiﬁed versions). . . 149

(17)

List of Algorithms

2.1 A DFO trust-region algorithm. . . 15

2.2 Coordinate-search method. . . 23

2.3 A direct-search method.. . . 26

3.1 A general framework for (µ/ρ+_{, λ)–ES.}_{. . . 33}

3.2 A general framework for (µ/µW, λ)–ES. . . 41

3.3 Approximate ranking procedure. . . 48

4.1 A class of globally convergent ES’s. . . 53

5.1 A globally convergent ES for general constraints (Main). . . 77

5.2 A globally convergent ES for general constraints (Restoration). . 78

5.3 A globally convergent ES for unrelaxable constraints. . . 88

5.4 Calculating the positive generators Dk. . . 92

6.1 A globally convergent ES using a search step. . . 109

7.1 A multi-scale algorithm for frequency-domain FWI. . . 125

7.2 An adaptation of the ES algorithm to FWI setting. . . 135

(18)

List of Tables

4.1 The distribution of np in the test set.. . . 60 4.2 Noiseless problems. . . 67

4.3 Noisy problems.. . . 67

5.1 Comparison results for the extreme barrier approach using a maximal budget of 2000 .. . . 101

5.2 Comparison results for the extreme barrier approach using a maximal budget of 20000 . . . 102

5.3 Comparison results for the merit approach and the progressive barrier one using a maximal budget of 2000 . . . 103

5.4 Comparison results for the merit approach and the progressive barrier one using a maximal budget of 20000 . . . 103

7.1 The distribution of the clusters and the population size depending on the working frequency. . . 138

B.1 Results from comparison of the solvers on bound-constraind problems (average of 10 runs for stochastic solvers)- Part 1 . . . 151

B.5 Results from comparison of the solvers on bound-constraind problems (average of 10 runs for stochastic solvers) - Part 5 . . . 155

B.6 Results from comparison of the solvers on linear-constraind problems (av-erage of 10 runs for stochastic solvers)- Part 1 . . . 156

B.10 Results from comparison of the solvers on linear-constraind problems (av-erage of 10 runs for stochastic solvers) - Part 5 . . . 160

(19)

To my parents

(20)

Chapter 1 Introduction

Nowadays, many practical optimization problems have become often noisy, complex, and not sufficiently explicitly defined to give reliable derivatives. In this thesis, we are interested in optimization problems where derivative information is unavailable or hard to obtain in practice. For instance, optimizing large and complex systems often requires the tuning of many parameters. These parameters are typically set to values that may have some mathematical meaning or that have been found to perform well. The choice of parameters can be done automatically using training data of simulations. In such case, not only it is hard to find the derivatives with respect to the parameters, but also numerical noise and probably non-differentiability issues may appear. As consequence, we have seen a resurgence of interest in Derivative-Free Optimization (DFO) [52]. Derivative-based methods are more adapted to solve large scale optimization problems, typically around 106 unknowns or more. These methods can be very efficient when the starting point is accurate enough, but otherwise they suffer from stalled convergence to spurious local minima for non-convex optimization problems. Thus the holy grail of these problems is to warmstart the local optimization procedures by efficiently finding a good initial guess without the need of sophisticated a priori knowledge on the objective function (such as the problem structure, its background, ...). When the number of unknowns included in the optimization can be reduced, it is possible to use a type of DFO methods that are known for their ability to handle hard problems and to find a good initial guess (a starting point leading to a better minimum). Once a starting point is found, derivative-based methods can be applied to refine the problem solution. In the scope of this thesis, we deal with a very large scale seismic imaging inversion problem [167] where we show that some DFO methods can improve the optimization procedure by finding an accurate initial guess from which one can initiate derivative-based methods [126], without any physical knowledge.

(21)

Chapter 1. Introduction 2 DFO methods do not use derivatives information of the objective function or constraints, nor an approximation to the derivatives. Actually, derivatives approximation is often very expensive and can produce misleading results due to the presence of noise. DFO area can be divided into two categories, depending on the methods used to explore the search space. The ﬁrst category is deterministic DFO algorithms such as model-based methods [49, 130] or direct-search methods [52, 108]. The major drawback of these methods is that they can get easily stuck in a local optimum. The second category is stochastic derivative-free optimization [122, 158], which has been employed to miti-gate the defect of the local deterministic methods in the solution of diﬃcult objective functions (e.g. non-smooth and multi-modal). Stochastic derivative-free optimization algorithms aim to be robust when dealing with multi-modal objective functions. Some of these methods are generally inspired by nature, in the same way that random pro-cesses are often associated with natural systems (e.g. mutations of genetic information, annealing process of metal, molecular dynamics, or swarm behaviors of birds). Well-known representatives of stochastic methods are simulated annealing [107], particle swarm optimization [103] and evolutionary algorithms [26,32,91,92, 142,150]. Over the past, stochastic DFO was regarded by the deterministic DFO community as another discipline, and only few interactions between the two DFO categories were established. Meanwhile, stochastic optimization algorithms have been growing rapidly in popularity thanks to some methods that became “industry standard” approaches for solving challenging optimization problems. Such growth led the deterministic DFO community to reconsider their position and it has started recently to include stochastic frameworks in their research topics [27,73,115, 129].

Evolution strategies (ES’s) are one of these successful stochastic algorithms, seen as a class of evolutionary algorithms that are naturally parallelizable, appropriate for con-tinuous optimization, and that lead to interesting results [23, 37, 145]. Motivated by the industrial demand, we propose in this thesis to equip a class of ES’s with known techniques from the deterministic DFO community based on the step size control. The incorporated techniques are inspired by the recent development in direct search meth-ods [18, 52, 53, 74, 166]. Our modiﬁcations enhance the performance of the original algorithm particularly for expensive objective function evaluation. The proposed ES’s achieve rigorously a form of global convergence under reasonable assumptions. By global convergence, we mean the ability of the algorithm to generate a sequence of points con-verging to a stationary point regardless the starting point.

(22)

Chapter 1. Introduction 3 The problem under consideration, in this thesis, is of the form

min f (x)

s.t. x_{∈ Ω,} (1.1)

where f is a real-value objective function assumed to be bounded by bellow. The fea-sible region Ω ⊂ Rn _{of this problem can be deﬁned by relaxable and/or non-relaxable}

constraints. Relaxable constraints need only to be satisﬁed approximately or asymptot-ically. No violation is allowed when the constraints are non-relaxable (typically, they are bounds or linear constraints).

In Chapter2, we give a short overview of existing deterministic derivative-free optimiza-tion methods and their classificaoptimiza-tion. We present the general framework of model-based methods inside their derivative free context. We emphasize multivariate polynomial interpolation techniques used to build different types of local polynomial interpolation and regression models. We also address (directional) direct-search methods where the sampling is guided by a set of directions with specific features. Key concepts particu-larly related to the sampling set are also outlined (i.e. positive spanning set, a descent direction and the cosine measure). We end up the chapter by reviewing some of the existing global convergence results regarding the presented direct-search methods. As our main motivation is to equip a class of ES’s with some direct search techniques, Chapter 3 gives an overview of stochastic derivative-free optimization algorithms and in particular ES’s, their appearance and history, their basic ideas and principles. We present also some theoretical aspects of ES’s,in particular, the main existing global convergence properties of ES algorithms. The chapter closes with a detailed description of CMA-ES [85,86] regarded as state of the art in stochastic derivative-free optimization. In Chapter4, we introduce our first contribution where we show how to modify a large class of ES’s for unconstrained optimization in order to rigorously achieve global conver-gence. The type of ES’s under consideration recombines the parent points by means of a weighted sum, around which the offspring points are computed by random generation. The modifications consist essentially in the reduction of the size of the steps whenever a sufficient decrease condition on the function values is not verified. When the latter condition is fulfilled, the step size can be reset to the one maintained by the ES’s them-selves, as long as it is sufficiently large. We propose ways of imposing sufficient decrease for which global convergence holds under reasonable assumptions (e.g. density of cer-tain limit directions in the unit sphere). Given a limited budget of function evaluations, our numerical experiments have shown that the modified CMA-ES is capable of further progress in function values. Moreover, we have observed that such an improvement

(23)

Chapter 1. Introduction 4 in eﬃciency comes without signiﬁcantly weakening the performance of the underlying method in the presence of several local minimizers.

The modified ES is extended to handle general constrained optimization in Chapter5. Our methodology is built upon the globally convergent evolution strategies previously introduced for unconstrained optimization. Two feasible approaches are encompassed to handle the non-relaxable constraints. In the first approach, the objective function is evaluated directly at the generated sampled points. The feasibility is enforced through an extreme barrier function. The second approach projects the generated sampled points onto the feasible domain before evaluating the objective function. The treatment of relaxable constraints is inspired by the merit function approach [74], where one tries to combine both the objective function and the constraints violation function. In the first numerical experiments, where we consider only unrelaxable constraints, we show that our proposed ES approaches (using the extreme barrier or projection) is competitive with the state of the art solvers for derivative-free bound and linearly constrained optimization. In the second part of our numerical experiments, we test our algorithms based on the merit function approach under the presence of both relaxable and unrelaxable constraints. On the chosen test problems, the merit approach shows promising results compared to the progressive barrier one [19], in particular, for relatively small feasible regions.

The modified ES, proposed in Chapters 4and 5, evaluates the objective function at a significantly large number of points at each iteration. These evaluations can be used in different ways to speed up the convergence and make ES algorithms more efficient especially for small budgets. The possibility that we explore in Chapter 6is to use the previously evaluated points to construct surrogate quadratic models for the objective function f . The surrogate models are computed using techniques inspired from model-based methods for deterministic DFO. Our hybrid algorithm has been designed to satisfy the convergence analysis of our globally convergent ES. As expected, our experiments show that incorporating local models improves the performance of our ES in both un-constrained and un-constrained optimization problems. Regression models are found to be the most efficient quadratic ones within our ES algorithms.

Our target application is the solution of an Earth imaging problem in geophysics. In Chapter7, without any physical knowledge, we use our globally convergent ES’s to ﬁnd a starting point for an optimization procedure that attempts to drive high-resolution quantitative models of the subsurface using the full information of acoustic waves, known as acoustic full-waveform inversion [167]. The chapter starts with a detailed description of the considered problem. We outline also one possible way to adapt our ES to the acoustic full-waveform inversion problem setting. A subspace approach is used for the parametrization of the problem. Motivated by the recent growth of high performance

(24)

Chapter 1. Introduction 5 computing resources, we propose a highly parallel implementation of our ES adapted to the requirements of the problem. The initial results, obtained in this direction, show that great improvement can be expected in the automation of the full-waveform inversion. Finally, we draw some conclusions and outline perspectives in Chapter8.

(25)

Chapter 2 Deterministic Derivative-Free

Optimization

Deterministic derivative-free optimization (DFO) methods either try to build models of the objective function based on sample function values, i.e. model-based methods [49,

52], or directly exploit a sample set of function evaluations without building an explicit model, i.e. direct-search methods [52, 108]. Motivated by the large number of DFO applications, researchers and practitioners made a signiﬁcant progress on algorithmic and theoretical aspects of the DFO methods over the past two decades. The most important progress concerns the recent algorithms and proofs of global convergence [17,

49, 52, 108, 149, 166]. By global convergence, we mean the ability of a method to generate a sequence of points converging to a stationary point regardless the starting point. A point is said to be stationary if it satisfies the first order necessary conditions, in the sense that the gradient is equal to zero if the objective function is differentiable or, in the non-smooth case, non-negativity following all directional derivatives of the Clarke generalized derivatives [43]. The book by Conn, Scheinberg and Vicente [52] gives a good review of the state of the art of deterministic DFO with a detailed description of the theoretical background to ensure convergence. The main classes of globally convergent algorithms for derivative-free optimization are:

1. Trust-region methods [49, 52, 130], where one minimizes accurate models in-side a region of prespeciﬁed size. The models are for example built either using interpolation and regression techniques [50] or radial-basis functions [168]. 2. Directional direct-search methods [52,108], where sampling is guided by sets

of directions with appropriate properties, i.e. sets of directions generating Rn with non-negative coeﬃcients. Popular algorithms under this class are coordinate

(26)

Chapter 2. Deterministic Derivative-Free Optimization 7 search, pattern search, generalized patern search (GPS) [17], generating set search (GSS) [108], and mesh adaptive direct-search (MADS) [18]. We will often refer to this class of methods simply as direct-search methods.

3. Simplicial direct-search methods [52, 128], where optimization is ensured through simplex operations like reﬂection, expansion, or contraction. A popular example is the Nelder-Mead method [128], which is regarded as the most popular derivative-free method.

4. Line-search methods [52,102], where one tries to optimize the objective function using a simplex gradient. The latter is typically chosen as a gradient of linear interpolation or regression polynomial model. A popular example is the implicit-ﬁltering method of Kelley et al [102].

Only trust-region methods and direct-search methods are going to be explored further in this thesis. The remainder of this chapter is organized as follows: we begin by a short overview about model-based methods, where we present the general framework of trust-region methods including their relationship with regression and quadratic models. The second section is devoted to direct-search methods where we present a class of globally convergent directional direct-search methods. The convergence results on this chapter are announced without proofs. For the proofs we refer the reader to [17,49,52,108,166] and the references given there.

2.1 Model based methods

Model based methods can be seen as a combination of the trust-region framework with interpolation models of the objective function. Basically in these methods, we construct a local model of the objective function and estimate the new step by minimizing the model inside a region. The model is constructed using points evaluated on a specific point subset. Such point subset must verify some appropriate features so that the models can be well-defined. In this section, we briefly describe the essence of this approach. For more detailed analysis, the reader is referred to [49,51,52, 130].

2.1.1 Trust-region framework

The trust-region framework is usually used when derivative information of the objective function is available or at least some estimates to the derivatives can be computed

(27)

Chapter 2. Deterministic Derivative-Free Optimization 8 accurately. A typical trust-region method is as follows: at the k-th iteration, given the current iterate xk, a model of the form

mk(xk+ s) = f (xk) + gk>s +

1 2s

>_H

k s (2.1)

(where gk and Hk correspond to estimates of the gradient and the Hessian, respectively)

is minimized in a neighborhood around the current iterate deﬁned by the ball (or the trust-region)

B(xk, ∆k) = {x ∈ Rn|kx − xkk ≤ ∆k}. (2.2)

centered on xk and with the radius ∆k; the norm k.k could be an iteration

depen-dent norm, but is usually fixed. Different norm choices can be used depending on the minimization problem, for instance in the unconstrained case, the standard Euclidean norm is more adapted [49, 52]. The infinity norm was shown to be more suited when considering bound constraints [49,72].

The minimization of the model inside the trust-region leads to a new trial point xk+

sk. To determine if the computed point is successful or not, we evaluate the objective

function at the new point xk+ sk and compare the true reduction in the value of the

objective function with the predicted reduction by the model. If the ratio ρk =

f (xk)− f(xk+ sk)

mk(xk)− mk(xk+ sk)

(2.3) is larger than a constant ν1> 0, the step is then accepted, so the model is updated. The

trust-region radius is possibly increased if the success is really signiﬁcant. When the step is unsuccessful (meaning ρk ≤ ν1), the trial point is rejected and the trust-region

radius ∆k is reduced.

The approximation model (2.1) is generally constructed using second-order Taylor series expansion. However, in the derivative-free context, one uses alternative approximation techniques that are not based upon the derivatives of the objective function f . Quadratic interpolation is one of these techniques that can be combined with the trust-region algorithms. For guaranteeing convergence, one needs to impose on the approximation model to be locally accurate enough. The interpolation set as well as the mechanism of maintaining it good enough inside the trust-region are described in the next section. The upcoming results are general interpolation and regression results that have been proven useful while dealing model-based optimization. The subscript k is dropped in the following description for clarity reasons; without loss of information since we make a focus on a given iteration of the trust-region algorithm.

(28)

Chapter 2. Deterministic Derivative-Free Optimization 9 2.1.2 Polynomial interpolation and regression models

In this section, we consider the problem of interpolating known objective function values at a given set Y of interpolation points, Y = _{y1_{, y}2_{, . . . , y}p_{} ⊂ R}n_{. We aim to ﬁnd a}

model m for which the interpolation condition

m(yj) = f (yj) j = 1, . . . , p (2.4)

holds. We say that a set of points can be interpolated by a polynomial of a certain degree, if for the function f there exists a polynomial m such that (2.4) holds for all the points in the interpolation set Y .

2.1.2.1 Polynomial bases Let _Pd

n be the space of polynomials of degree ≤ d in Rn, and q the dimension of this

space. Let _{φi}qi=1 be a given basis ofPnd, which is a set of q polynomials of degree≤ d.

Thus, any polynomial m_{∈ P}d

n can be written uniquely as

m(x) =

q

∑

j=1

αjφj(x), (2.5)

where αφ = (α1, . . . , αq)> ∈ Rq. Diﬀerent polynomial bases φ can be considered, the

simplest and the most used polynomial basis is the basis of monomials, known as the natural basis ¯φ. Such basis is deﬁned using multi-indices in the following way [52]: Let a vector αi = (αi₁, . . . , αi_n)_{∈ N}n be called a multi-index, and, for any x _{∈ R}n, we deﬁne xαi as xαi = n ∏ j=1 xα i j j . Let also |αi| = n ∑ j=1 αi_j and αi! = n ∏ j=1 (αi_j!).

Then the elements of the natural basis are ¯ φi(x) = 1 (αi_)!x αi , i = 0, . . . , q, |αi_{| ≤ d.}

(29)

Chapter 2. Deterministic Derivative-Free Optimization 10 The natural basis can then be written as follows:

¯ φ = { 1, x1, x2, . . . , xn, 1 2x 2 1, x1x2, . . . , 1 (d_{− 1)!}x d−1 n−1xn, 1 d!x d n } . (2.6)

Consequently, for uni-variate interpolation (i.e. d = 1) we have q = n + 1, and that q = (n+1)(n+2)₂ for a full quadratic interpolation (i.e, d = 2).

2.1.2.2 Polynomial interpolation

Using (2.5) and (2.4), the coeﬃcients αφ = (α1, . . . , αq)> can be found by solving the

following equation:

q

∑

j=1

αjφj(yi) = f (yi) i = 1, . . . , p,

which can be written as a linear system of the form:

M (φ, Y )αφ = f (Y ), (2.7)

where the coeﬃcient matrix M (φ, Y ) and right hand side f (Y ) of this system are        φ1(y1) φ2(y1) · · · φq(y1) φ1(y2) φ2(y2) · · · φq(y2) .. . ... . .. ... φ1(yp) φ2(yp) · · · φq(yp)        and        f (y1₎ f (y2₎ .. . f (yp)        , respectively.

If the coefficient matrix M (φ, Y ) is square and nonsingular, then the set of points Y is poised with respect to the subspace spanned by φ. This means that Y can be interpolated by a unique polynomial from this subspace. When the interpolation set remains poised for small perturbations, the set is called well-poised. If the set Y is poised, then one can solve the linear system and find an interpolation polynomial. However, numerically the coefficient matrix M (φ, Y ) may be ill-conditioned depending on the basis choice {φi}qi=1. Thus, in general, the condition number of the matrix M (φ, Y ) is a bad measure

of poisedness of Y . However, if one chooses the interpolation basis φ as the natural basis of monomials ¯φ and ˆY as a shifted and scaled version of Y such as ˆY _{⊂ B(0; 1), the} condition number of M ( ¯φ, ˆY ) can be used to monitor the poisedness of the points set [52, Theorem 3.14].

To incorporate models in the trust-region framework, one has to adapt the model con-struction to diﬀerent degrees of freedom (which depend on both the cardinality of the interpolation set and the variable size). For instance, during the ﬁrst iterations one has

(30)

Chapter 2. Deterministic Derivative-Free Optimization 11 only few points and so can not always construct an interpolation model. When p = n+1 points are available, we can build a linear model which is known to be sufficient to make some progress. As far as the number of function evaluations p exceeds n + 1 but not more than 1₂(n + 1)(n + 2), the coefficient matrix M (φ, Y ) contains more columns than rows, and thus the interpolation polynomials defined by (2.4) are no longer unique for quadratic interpolation. To overcome this problem, one uses under-determined mod-els which have been widely used in many practical DFO implementations (see Section

2.1.2.3). Complete quadratic model can be built once the number of function evaluations is equal to 1₂(n + 1)(n + 2), such models being expected to lead to faster progress. As far as the number of function evaluations p exceeds 1₂(n + 1)(n + 2), regression models can be used (see Section2.1.2.4). Regression models have been shown to be often better than if we just select the ’best’ subset of 1₂(n + 1)(n +2) points and use the chosen subset to build complete quadratic models [50].

2.1.2.3 Under-determined interpolation models

The interpolation polynomials deﬁned by (2.4) are not unique in this case; diﬀerent approaches can be used [50,52]:

Sub-basis models: A simple way to impose the uniqueness of the interpolation poly-nomials can be ensured by restricting the linear system (2.7) to have a unique solution (by removing q− p columns of M(φ, Y ), their corresponding elements of the solution αφ are set to zero). This approach is in general not very successful, except if we have a

priori knowledge on the sparsity structure of the gradient and the Hessian of the objec-tive function. Such information can be exploited by deleting the corresponding columns in the linear system (2.7). Choosing p columns in M (φ, Y ) corresponds to removing polynomials from the basis φ to obtain a new one ˜φ. As a consequence, the points set Y has to be well poised with respect to the sub-space generated by ˜φ.

Minimum norm models: The second approach to get a unique polynomial solution for the under-determined system (2.7) is to compute the minimum using l2-norm of the

solution αφ. In this case, the problem to solve is deﬁned as follows :

min 1 2kαφk 2 2 s.t. M (φ, Y )αφ = f (Y ) . (2.8)

(31)

Chapter 2. Deterministic Derivative-Free Optimization 12 Assuming that the coeﬃcient matrix M (φ, Y ) has full row rank, the solution of the problem (2.8) is given by

αφ = M (φ, Y )†f (Y ), (2.9)

where M (φ, Y )† denotes the Moore-Penrose pseudo-inverse of M (φ, Y ). The latter one can be computed using a QR factorization or a singular value decomposition of the coeﬃcient matrix. The polynomial solution found in (2.9) depends on the choice of the basis φ. In practice, it has been observed that it is worthy to consider the minimum l2-norm when one is working with the natural polynomial basis ¯φ [52, Section 5.1].

Minimum Frobenius norm models: The error bounds on both the objective func-tion and its gradient, for under-determined interpolafunc-tion models, depend on the norm of the Hessian of the model [52, Theorem 5.4]. Therefore, the motivation of this approach is to build models with a minimum value of the norm of the model Hessian. In the quadratic interpolation case, such minimization is equivalent to minimizing the coeﬃ-cients αφ related to the quadratic monomials. By splitting the natural basis ¯φ into two

parts: a linear ¯φL ={1, x1, x2, . . . , xn} and a quadratic ¯φQ={1₂x21, x1x2, . . . ,1₂x2n}, the

interpolation model can be written as follows:

m(x) = α>_Lφ¯L+ α>Qφ¯Q,

where αL and αQ are the solution of the following optimization problem

min 1 2kαQk 2 2 s.t. M ( ¯φL, Y )αL+ M ( ¯φQ, Y )αQ = f (Y ) . (2.10)

The corresponding solution α_φ¯= [αL, αQ] is called minimum Frobenius norm solution.

In fact, due to the choice of the natural basis, solving the problem (2.10) is equivalent to minimizing the Frobenius norm1_{of the Hessian of m(x). The solution of (}_2.10_{) exists}

and is uniquely deﬁned if the following matrix is nonsingular:

F ( ¯φ, Y ) = ( M ( ¯φQ, Y )M ( ¯φQ, Y )> M ( ¯φL, Y ) M ( ¯φL, Y )> 0 ) .

The matrix F ( ¯φ, Y ) is nonsingular if and only if the coeﬃcient matrix M ( ¯φL, Y ) has full

column rank and M ( ¯φQ, Y )M ( ¯φQ, Y )>is positive deﬁnite in the null space of M ( ¯φL, Y )

(the last condition can be ensured if the matrix M ( ¯φL, Y ) has full row rank). In this

1_{The Frobenius matrix norm}

k.kF is defined for a square matrix A by the

s X

1≤i,j≤n

(32)

Chapter 2. Deterministic Derivative-Free Optimization 13 case, the sample set Y is called poised in the minimum Frobenius norm sense. The coeﬃcients αL and αQ are computed by solving ﬁrst

F ( ¯φ, Y ) ( µ αL ) = ( f (Y ) 0 )

to ﬁnd αL and µ the Lagrange multiplier of the problem (2.10), then by computing

αQ = M ( ¯φL, Y )>µ we complete the model construction.

A variant of the Frobenius norm model is the least Frobenius norm updating of quadratic models [137]. Instead of minimizing the Frobenius norm of the model Hessian, one tries to optimize its change from the current iteration to the previously computed Hessian. The new optimization problem can be formulated as follows:

min 1 2kαQ− α old Q k22 s.t. M ( ¯φL, Y )αL+ M ( ¯φQ, Y )αQ = f (Y ) . (2.11)

This optimization problem is solved through a shifted problem on αdif = αQ− αoldQ of

the type given in (2.10).

Minimum Frobenius norm models and its variant have shown to be the most eﬃcient and successful to build quadratic models and are implemented in many software implementa-tions [52,138]. The minimization of the change in the Hessian of the model from one iter-ation to the next works very well in some cases, in particular, when p = 2n+1 [137,138].

Sparse quadratic interpolation: When the structure of the Hessian is sparse, it is possible by using the l1 norm to recover the sparsity of the constructed model in the

under-determined case [28]. In fact, instead of solving (2.10) we construct the following optimization problem

min _kαQk1

s.t. M ( ¯φL, Y )αL+ M ( ¯φQ, Y )αQ = f (Y )

. (2.12)

where αQ, αL, ¯φL, and ¯φQ are deﬁned as in (2.10). Solving (2.12) is doable, since it us a

linear program (LP). The sparse quadratic approach is shown to be more advantageous when the Hessian of f has zero entries [28].

(33)

Chapter 2. Deterministic Derivative-Free Optimization 14 2.1.2.4 Regression models

This section is devoted to the case where the number of the points p is more than q, meaning that in the quadratic interpolation case, p exceeds 1₂(n + 1)(n + 2). Under such consideration, the linear system (2.7) is overdetermined and has in general no solution. The regression models key idea is to ﬁnd the best solution that minimizes the gap between the M (φ, Y )αφ and f (Y ). In other words, the coeﬃcients αφ will be the

solution of the following linear least-squares problem : min

αφ kM(φ, Y )αφ− f(Y )k

2

2. (2.13)

When the coeﬃcient matrix has full column rank, the minimization problem (2.13) above has a unique solution given by solving the normal equations

M (φ, Y )>M (φ, Y )αφ = M (φ, Y )>f (Y ).

To solve this linear system, singular value decomposition or QR factorization of the co-eﬃcient matrix can be used. Regression models are very recommended to use, especially when the objective function is noisy [50,52].

2.1.3 An interpolation based trust-region approach

Different interpolation-based trust-region methods are available in the literature. The existing methods can be divided into two categories, the first one being the methods that work well for practical problems but are not supported by a convergence theory. The second category includes the methods for which global convergence was shown, but that are practically less competitive than the first category. The algorithm framework which will be described in this section requires the usage of fully linear models, meaning models with accuracy properties similar to those of first-order expansion Taylor model. A rig-orous definition of a fully linear model can be found in [51, Definition 3.1] (see also [52, Definition 10.3]). Algorithm 2.1a derivative-free interpolation based trust-region algo-rithm for which global convergence to first-order stationary points is proved [51, 52].

The algorithm as presented is simple, we check if the norm of the model gradient is too small. If it is, we start the criticality step with the purpose of verifying if the gradient of the objective function f is also small. At each iteration, many situations can occur: an iteration is successful whenever ρk≥ ν1; the trial point is then accepted

(34)

Chapter 2. Deterministic Derivative-Free Optimization 15

Algorithm 2.1: A DFO trust-region algorithm.

Initialization: Let an initial point x0 and the value f (x0) be given. Choose an initial

trust-region radius ∆0> 0. Select an initial model m0. Set k = 0 and the

parameters g> 0; 0 < γ < 1 < γinc, 0 < ν0≤ ν1< 1, µ > β > 0.

1. Criticality step : Apply some procedure when _k∇mk(xk)k ≤ g to ﬁnd a new

model mk and a new trust region radius ∆k such that ∆k ≤ µk∇mk(xk)k and mk

is fully linear on B(xk; ∆k), and such that, if ∆k is reduced, one has

βk∇mk(xk)k ≤ ∆k.

2. Compute the step : Compute a step sk such as

sk = argmins∈B(0,∆k)mk(xk+ s). (2.14)

2. Accept the trial point : Compute f (xk+ sk) and

ρk =

f (xk)− f(xk+ sk)

mk(xk)− mk(xk+ sk)

.

If ρk ≥ ν1or if both ρk≥ ν0 and the model is fully linear on B(xk; ∆k), then

xk+1= xk+ sk and the model is updated to take into consideration the new

iterate, resulting in a new model mk+1; otherwise mk+1= mk and xk+1= xk.

4. Improve the model :

If ρk < ν1use a model-improvement algorithm to certify that the model mk is

fully linear on B(xk, ∆k). Let mk+1the new possibly improved model.

5. Update the trust-region radius: Set

∆k+1=           

[∆k, min{γinc∆k, ∆max}] if ρk ≥ ν1,

γ∆k if ρk < ν1 and mk is fully linear,

∆k if ρk < ν1 and mk is not

certiﬁably fully linear. Increment k by one and return to Step 1.

ν0 ≤ ρk < ν1 and the model is fully linear (see Algorithm2.1), the trial point is again

accepted but the trust-region is decreased; such iteration is called acceptable. The third situation occurs when ρk < ν1 and the model mk is not certiﬁably fully linear (see [51,

Deﬁnition 3.1]). In this case, the geometry should be improved; the trial point may be included in the sample set but it will not accepted as the new iterate; such iteration is called model-improving. The last situation occurs when ρk < ν0and mkis fully linear,

in this case only the trust-region radius is reduced, the other parameters (including the current iterate) are kept the same; such iteration is declared unsuccessful. The model-improvement cycle in Step 4 can be launched for an inﬁnite number of iterations.

(35)

Chapter 2. Deterministic Derivative-Free Optimization 16 However, when the models are assumed to be fully linear and uniformly bounded, one can ensure that only ﬁnite improvement steps will take place [52]. The criticality step is not invoked in detail (see [51,52] for more details), but mainly in such a step one keeps reducing the trust-region radius ∆k and computes a fully linear model in B(xk; ∆k)

until ∆k ≤ µk∇mk(xk)k is obtained. At the exit of the criticality step one also has

∆k ≥ βk∇mk(xk)k (with µ > β).

2.1.3.1 The trust-region subproblem

In Step 2 of Algorithm 2.1, one needs to approximate a minimizer sk of the following

optimization problem (called trust-region subproblem): min

s∈B(0,∆k)

mk(xk+ s), (2.15)

where mk is the model for the objective function and B(0, ∆k) is the trust-region. The

computation of such step sk is crucial for the convergence theory of the trust-region

methods. In general, it is not necessary to find an exact minimizer of this optimization problem as far as the computed step ensures some form of sufficient decrease condition, meaning that the new step sk has to fulfill

mk(xk+ sk) ≤ mk(xk)− ψk,

where ψk is a positive value satisfying suitable conditions [52]. The key point is to make

sure that the total decrease is at least a fraction of that obtained with the Cauchy step sC_k [52, Chapter 10], for all iterations k:

m(xk)− mk(xk+ sk) ≥ κf cd[m(xk)− mk(xk+ sCk)], (2.16)

where κf cd ∈ (0, 1]. The Cauchy step sCk can be computed by backtracking a line

search along the steepest descent direction given by the gradient of the model. As a consequence, the Cauchy step is deﬁned by

sC_k =_−tC_kgk, (2.17) where tC k is given by tC_k = argmin t≥0:xk−tgk∈Bk(xk,∆k) mk(xk− tgk).

(36)

Chapter 2. Deterministic Derivative-Free Optimization 17 The Cauchy step satisﬁes the condition:

mk(xk)− mk(xk+ sCk) ≥ 1 2kgkk min { kgkk kHkk , ∆k } . (2.18) 2.1.3.2 Global convergence

To prove global convergence to ﬁrst-order critical points (convergence to a stationary point regardless the starting point), it suﬃces to assume in addition to the assump-tion (2.16), that the gradient of the objective function f is Lipschitz continuous. We suppose also that the Hessian model is bounded (see [52] for a complete and detailed convergence analysis).

Under such assumptions it is provable that the trust-region radius in Algorithm 2.1

converges to zero [52, Lemma 10.9]:

Lemma 2.1. Consider a sequence of iterations generated by Algorithm2.1 without any stopping criterion. Then under the above assumptions one has

lim

k→+∞∆k= 0. (2.19)

When the sequence of iterates is bounded, one can also prove that all limit points of the sequence of iterates are ﬁrst-order stationary points. The global convergence result is then derived as follows [52, Theorem 10.13]:

Theorem 2.2. Consider a sequence of iterations generated by Algorithm 2.1 without any stopping criterion. Then under the above assumptions one has

lim

k→+∞∇f(xk) = 0. (2.20)

2.2 Direct-search methods

Direct-search methods correspond to DFO algorithms where sampling, at each iteration, is guided by a ﬁnite set of directions with some appropriate features. These methods do not use any derivative approximation or model building. In this section, by direct-search we mean the directional type; we refer the reader to [52,102,128] and references therein for more details on the other types of direct-search methods. To describe direct-search algorithms, we ﬁrst present some related basic concepts.

(37)

Chapter 2. Deterministic Derivative-Free Optimization 18 2.2.1 Basic concepts

To guide the optimization process, the directions used in direct-search methods must have some appropriate features. One essential property consists on ensuring that at least one of the chosen directions is descent. A direction d is said to be descent at the point x, if there exists a positive value ¯α such that:

∀α ∈ (0, ¯α] , f (x + αd) < f (x). (2.21) When f is continuously diﬀerentiable at x and _{∇f(x) 6= 0, all the descent directions d} fulﬁll −∇f(x)>_{d > 0. To ensure the existence of such directions, some notions related}

to positive spanning sets and positive bases are needed [52, 56].

2.2.1.1 Positive spanning sets and positive bases

The positive span of a set (PSS) of vectors [v1, . . . , vr] in Rn is deﬁned as the convex

cone which is positively generated by [v1, . . . , vr] (meaning the set {v ∈ Rn : v = r

∑

i=1

αivi, αi≥ 0, i = 1, . . . , r}) [52, 56].

Deﬁnition 2.3.

• A positive spanning set in Rn _{is a set of vectors whose positive span is}_Rn_.

• The set [v1, . . . , vr] is said to be positively dependent, if one of the vectors is in the

convex cone positively spanned by the remaining vectors, i.e, if one of the vectors is a positive combination of the others; otherwise, the set is positively independent. • A positive basis in Rn _{is a positively independent set whose positive span is}_Rn_.

Unlike Rn bases where one has exactly n vectors, the cardinality of a positive basis has at least n + 1 and at most 2n vectors [15,56]. Positive bases with n + 1 and 2n vectors are referred to as the minimal and the maximal positive bases, respectively.

Example 2.1. Let B = [e1, e2, . . . , en] be the canonical basis of Rn, where ei denotes

the vector with a 1 in the ith coordinate and 0’s elsewhere, and let e =

n

∑

i=1

ei, then

• D⊕= [B , −B] is a maximal positive basis of Rn, where−B = [−e1,−e2, . . . ,−en].

(38)

Chapter 2. Deterministic Derivative-Free Optimization 19 6 - ? D1= D⊕ 6 - D2

Figure 2.1: A graphical representation of the maximal positive basis D1 (left) and

the minimal positive basis D2(right) forR2.

In Figure2.1, we depict two positive bases forR2 (maximal and minimal).

As stated in [52, Theorem 2.4], if [v1, . . . , vr] is a positive basis forRn and W ∈ Rn×n

is a nonsingular matrix, then [W v1, . . . , W vr] is also a positive basis for Rn. In other

words, having a positive basis in _Rn_{, one can ensure the existence of inﬁnitely many}

diﬀerent ones. Attractive properties of positive bases (explaining their use in direct-search methods) are as follows:

Theorem 2.4. Let [v1, . . . , vr] be a positive basis forRn and w∈ Rn. then

[ ∀i ∈ {1, . . . , r} v>i w≥ 0 ] ⇒ [ w = 0 ] . (2.22)

Proof. Since [v1, . . . , vr] spans Rn positively, the vector−w can be written as

−w =

r

∑

i=1

λivi,

where each λi≥ 0 for all i = 1, . . . , r.

From (2.22) we have v>_i w_{≥ 0 for all i ∈ {1, . . . , r} and so}

0 ≤

r

∑

i=1

λiv>i w = −w>w ≤ 0.

The only possibility is then w = 0.

Thus by choosing w = −∇f(x) in Theorem 2.4, positive bases can be used to check either a point x_{∈ R}n _{is a stationary point of the objective function or not.}

Theorem 2.5. Let f be a continuously diﬀerentiable function with_{∇f(x) 6= 0 for some} x∈ Rn_{. Let [v}

1, . . . , vr] be a positive basis for Rn, then there exists i in{1, . . . , r} such

as

(39)

Chapter 2. Deterministic Derivative-Free Optimization 20 Proof. Let w =_{−∇f(x) where x ∈ R}n_{. one knows that w}>_{w > 0 for all non-zero w and}

since [v1, . . . , vr] spansRn positively, one has

w =

r

∑

i=1

λivi,

where each λi≥ 0 for all i = 1, . . . , r. Hence,

w>w =

r

∑

i=1

λiw>vi> 0

from which we conclude that at least one of the scalars w>_v₁_{, . . . , w}>_v_r_{has to be positive.}

In other words, Theorem 2.5states that there must exist at least one descent direction in a positive basis. In Figure2.2, we identify the descent direction for the two positive spanning sets D1 and D2 inR2.

6 d - ? B B BBM w D1 -d 6 D2 PPPPq w

Figure 2.2: For a given positive spanning set and a vector w =_{−∇f(x) (green), there} must exist at least one descent direction d (red) (i.e. w>_{d > 0).}

2.2.1.2 Gradient estimates

By assuming that the set of search directions is a PSS, one is sure that for each iteration a descent direction must exist in the PSS. However, in practice ﬁnding a good descent direction may not be possible, see for instance Figure2.3where two vectors of the PSS tend to be colinear opposite. A good descent direction can be deﬁned as a direction

−∇f(x) C_C -C CCO D −

Figure 2.3: A positive spanning set with a very small cosine measure.

(40)

Chapter 2. Deterministic Derivative-Free Optimization 21 the more acute the angle between the descent direction and the negative gradient of the objective function, the better the direction. A PSS gives descent directions at each iteration but may not be good enough (depending on the level of acuteness) to ensure convergence; in this case the PSS is said to be degenerate. Thus, the question that arises naturally is: how to measure and control any deterioration in the PSS property to avoid its degeneracy ? For that sake, we review the notion of the cosine measure for positive spanning sets [108].

Deﬁnition 2.6. The cosine measure of a positive spanning set (with nonzero vectors) or of a positive basis D is deﬁned by

cm(D) = min

06=v∈Rnmax_d_∈D

v>_d

kvkkdk.

In_R2_{, the cosine measure of a positive spanning set is the cosine of the half of the largest}

angle θ between two of its adjacent vectors (see Figure2.4).

6 -θ = π 2 ? D1 cm(D1) = cos(π₄) - θ = 3π4 6 D2 cm(D2) = cos(3π₈ )

Figure 2.4: InR2_{, for a given positive spanning set the cosine measure is deﬁned by}

cos(θ) where θ (blue) is the largest angle between two adjacent vectors.

Remark 2.7. The cosine measure of a positive set is strictly positive.

In terms of descent, a key point of the cosine measure can be seen as follows: given a nonzero vector w∈ Rn_{, one has}

cm(D) _{≤ max}

d∈D

w>d kwkkdk. Thus there must exist a d∈ D such that

cm(D) ≤ w>d kwkkdk. In particular if one chooses w =_{−∇f(x), then}

cm(D)_{k∇f(x)kkdk ≤ −∇f(x)}>d. (2.23)

A cosine measure close to zero indicates a deterioration of the PSS, meaning that the PSS becomes degenerate. To see how the cosine measure can predict such deterioration,