The ridge method for tame min-max problems

(1)

HAL Id: hal-03186676

https://hal.archives-ouvertes.fr/hal-03186676

Preprint submitted on 31 Mar 2021

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

The ridge method for tame min-max problems

Edouard Pauwels

To cite this version:

(2)

The ridge method for tame min-max problems

Edouard Pauwels

∗

March 31, 2021

Abstract

We study the ridge method for min-max problems, and investigate its conver-gence without any convexity, differentiability or qualification assumption. The cen-tral issue is to determine whether the “parametric optimality formula” provides a conservative field, a notion of generalized derivative well suited for optimization. The answer to this question is positive in a semi-algebraic, and more generally de-finable, context. The proof involves a new characterization of definable conservative fields which is of independent interest. As a consequence, the ridge method applied to definable objectives is proved to have a minimizing behavior and to converge to a set of equilibria which satisfy an optimality condition. Definability is key to our proof: we show that for a more general class of nonsmooth functions, conservativity of the parametric optimality formula may fail, resulting in an absurd behavior of the ridge method.

Keywords. min-max problems, ridge algorithm, parametric optimality, conservative fields, definable sets, o-minimal structures, Clarke subdifferential, First order methods

(3)

1 Introduction

1.1 Main result

We consider unconstrained minimization of an objective function:

f : x 7→ max

y∈Rr F (x, y), (1)

where F : Rp× Rr

→ R is locally Lipschitz and achieves its maximum1

in y for all x ∈ Rp. The function f will be called the value function, note that in this situation, f is also locally Lipschitz. These notations and assumptions will be standing throughout this paper. We consider the Ridge Method (RM), initialized with x0 ∈ Rp, it is defined recursively as

follows, for all k ∈ N,

yk ∈ arg max

y∈RrF (xk, y)

(uk, 0) ∈ ∂cF (xk, yk) (RM)

xk+1 = xk− αkuk,

where (αk)k∈N is a nonsummable sequence of positive step sizes tending to 0 and ∂c

denotes the Clarke subgradient [19]. The existence of the update direction uk in (RM) is

ensured by the Parametric Optimality (PO) formula for subgradient of partial maxima. The algorithm relies on the knowledge of

• A partial maximization oracle, which associates to any x ∈ Rp_{, an element of}

the set P (x) := arg max_y∈RrF (x, y), assumed to be nonempty.

• A first order oracle, which associates to any (x, y) ∈ Rp

× Rr _{first order}

informa-tion about F in the form of its Clarke subgradient, ∂c_{F .}

For most locally Lipschitz functions F , the Clarke subgradient, ∂c_{F [19] carries absolutely}

no information about the function itself and is therefore useless from a computational perspective [36, 16, 17]. For our purpose, we need to restrict F to be in a subclass which is well behaved with respect to subdifferentiation, we choose the class of path differentiable functions, which was identified by several authors to be well behaved in terms of subgradient differential inclusion [35, 15, 22, 13].

The purpose of this paper is to investigate asymptotic behavior of (RM). The main results are the following.

• When applied to a large and widespread subclass of functions F , for example semi-algebraic functions, and more generally definable functions, Algorithm (RM) has a minimizing behavior. For bounded sequences, the value function, f (xk), is

converg-ing and accumulation points, (¯x, ¯y), of the sequence (xk, yk)k∈N are equilibria which

(5)

satisfy the optimality condition:

0 ∈ conv{u, (u, 0) ∈ ∂cF (¯x, y), y ∈ arg max

z∈RrF (¯x, z)}

¯

y ∈ arg max

y∈RrF (¯x, y)

Let us stress that most objectives found in applications are definable, see [7, 12] for discussion and examples, and [14, 13] for a recent account in deep learning.

• Without definability assumption, for general path-differentiable functions, the algo-rithm may fail to have any minimizing property. We construct a Lipschitz function F which is path-differentiable, such that all inputs x are steady states for (RM) but not critical in any reasonable sense for f . This underlines the importance of the de-finability assumption in the previous result and shows that Algorithm (RM) requires to work with proper subclasses, beyond Lipschicity and path-differentiability.

1.2 Parametric optimality and nonsmooth differential calculus

In order to analyse algorithm (RM), we need a variational model for parametric optimality and calculus rules providing access to first order information for f , from the knowledge of the partial maximization oracle and the subgradient of F . Our main candidate is the parametric optimality formula, which was described in [31, Corollary 3.1.1] for partial minimization, see also [32, Theorem 10.13]. Clarke subgradient is non directional, and therefore, the formula is also valid for partial maximization. Under our setting, it ensures that for all x ∈ Rp

∂cf (x) ⊂ conv{u, (u, 0) ∈ ∂cF (x, y), y ∈ P (x)}. (PO) The formula is in fact a result describing the subgradient of f , but we will be interested mostly in its right hand side which will be referred to as the parametric optimality formula (PO formula) 2. Indeed, an elements given by the PO formula can be computed from the knowledge of maximization and first order oracles mentioned above and precisely corresponds to the search direction chosen in Algorithm (RM).

As noted in [32, Theorem 10.13], the PO formula is sharp, it holds with equality in the case of a concave F (jointly in (x, y)) However in general nonconvex settings, without further qualification assumptions, the PO formula does not hold with equality3. Therefore,

2_{Enveloppe formula: the PO formula is very similar to the envelope formula: ∂}c_{f (x) ⊂ conv{u ∈} ∂c

xF (x, y), y ∈ P (x)}, [19, Theorem 2.8.2] where ∂cx denotes the subgradient for fixed y. This formula holds with equality if F is convex in x, see[10, Proposition A.22][24, 21], and more generally regular [19]. This is simpler and computationally more advantageous than the PO formula. However, it very much depends on convexity and is too coarse in general for our purposes, consider for example F (x, y) = −|x−y|, the enveloppe formula outputs [−1, 1] for all x.

3_Failure _of _partial _optimality _formula: _Consider _the _function _{F : (x, y)} _7→ max_y∈R{−y min{|x|, 1} + min{0, y}}. _{We have for all x ∈ R, f (x) = max}yF (x, y) = 0. Yet fixing x = 0, we have F (0, y) = 0 for all y > 0, all such y being partial minimizers. However ([−y, y], 0) ∈ ∂c_{F (0, y) for any y > 0. This shows that any s ∈ R is compatible with the PO formula} at x = 0, yet the corresponding Clarke subgradient is only the singleton {0}, which coincides with the classical derivative.

(6)

the PO formula does not necessarily provide a subgradient, this constitutes the main difficulty in analyzing Algorithm (RM). We choose to use conservativity, a notion of generalized derivative which was recently introduced as a nonsmooth analysis tool which is compatible with differential calculus [13]. Most importantly, conservative fields may be used in place of subgradients for first order optimization [18, 13, 14], making them a natural candidate for algorithmic oracles in our context. With this in mind, the proposed analysis of Algorithm (RM) boils down to the central question:

For a path-differentiable F , does the PO formula define a conservative field for f ?

The answer to this question is negative in general, we provide a counterexample. However for definable functions the answer turns out to be positive. The proof of the latter result relies on a characterization of definable conservative fields based only on definable paths, which is of independent interest. In the context of conservativity, the definable case plays a special role as it is widepread in applications [13, 14] and a many further properties are available [14, 29, 23]. The reader unfamiliar with definability may consider instead semialgebraicity, which is a special case, a function being semialgebraic when its graph can be represented as the finite union of solution sets of polynomial systems involving finitely many equalities and inequalities. Section 3 exposes basic definitions and more details regarding definability.

1.3 Applications of the main results

Min-max problems arise in machine learning applications with Generative Adversarial Networks (GANs) [27, 6], adversarial training of deep networks [34, 28], and further applications [1] bolstering reasearch on algorithms for min-max problems, see for example [37] for a ridge-type method. Most of these problems amount to solve nonsmooth min-max problems, with non convex or concave definable objectives, based on first order methods and algorithmic differentiation oracles. Our results describe a notion of equilibrium for such problem as well as an algorithm to reach such equilibria, under assumptions which are general enough to encompass most machine learning applications [14].

Our main results are actually stated in terms of conservative fields, a notion which has been shown to be compatible with the rules of differential calculus, contrary to the notion of subgradient [13]. Beyond min-max problems, the fact that the PO formula defines a conservative field can be used in compositional modeling, in combination with algorith-mic differentiation. This is in close connection with emerging extensions of deep neural networks which include optimization problems within their formulation, some of the net-work layers being defined as partial maxima or minima [5, 2, 9]. Conservativity of the PO formula provides theoretical ground to develop algorithmic differentiation tools for compositional problems involving such max structured functions.

(7)

2 Presentation of the main results

2.1 Technical preliminary

For any integer p ∈ N, we use the following notations. We denote by h·, ·i be the canonical Euclidean scalar product on Rp and k·k its associated norm. A locally Lipschitz continuous function, f : Rp _{→ R is differentiable almost everywhere by Rademacher’s theorem, see}

for example [26]. Denote by R ⊂ Rp_{, the full measure set where f is differentiable, then}

the Clarke subgradient [19] of f is given for any x ∈ Rp, by

∂cf (x) = convn_{v ∈ R}p, ∃yk →

k→∞x with yk ∈ R, vk= ∇f (yk) →k→∞ v

o .

A set valued map D : Rp _{⇒ R}q _{is a function from R}p _{to the set of subsets of R}q_{. The}

graph of D is given by

graph D = {(x, z) : x ∈ Rp, z ∈ D(x)} .

D is said to have closed graph or to be graph closed if graph D is closed as a subset of Rp+q_.

An equivalent characterization is that for any converging sequences (xk)_k∈N, (vk)_k∈N in

Rp, with vk ∈ D(xk) for all k ∈ N, we have

lim

k→∞vk∈ D( limk→∞xk).

D is said to be locally bounded if for each compact K ⊂ Rp_{, there is M > 0 such that}

kvk ≤ M for all v ∈ D(x) for all x ∈ K. An absolutely continuous curve is a continuous function x : R → Rp _{which admits a derivative ˙x for Lebesgue almost all t ∈ R, (in which}

case ˙x is Lebesgue measurable), and x(t) − x(0) is the Lebesgue integral of ˙x between 0 and t for all t ∈ R.

These elements allow to define the notion conservativity of set valued mappings [13].

Definition 1 (Conservative fields) Let D : Rp _{⇒ R}p _{be a set valued map with closed}

graph, non empty and locally bounded values and f : Rp 7→ R a locally Lipschitz function. Then f is a potential for D if for all x ∈ Rp_{, all γ : [0, 1] 7→ R}p_{, absolutely continuous}

with γ(0) = 0 and γ(1) = x, and all measurable functions, v : [0, 1] 7→ Rp_{, such that}

v(t) ∈ D(γ(t)) for all t ∈ [0, 1],

f (x) = f (0) + Z 1

0

h ˙γ(t), v(t)i dt. (2)

We shall also say that D is a conservative field for f or simply a conservative field. Such functions f are called path differentiable.

The result of [13, Corollary 1] ensures that for a path differentiable f , ∂cf is a conservative field.

(8)

2.2 Characterization of definable conservative mappings

Our main convergence result holds under definability assumptions, we start by showing that conservativity admits a simpler characterization in this context. From now on we fix an o-minimal structure (for example semialgebraic sets, see Section 3 for more details on definability), all definable objects we shall consider are implicitly definable in this structure.

Definition 2 (Definably conservative field) Let D : Rp ⇒ Rp be a set valued field with closed graph, non empty and locally bounded values. Assuming in addition that D is definable, D is called definably conservative if equation (2) holds only for definable C1

loops γ and definable selections v.

Following Definition 2, it is obvious that a definable conservative field is definably con-servative since definable C1 loops are absolutely continuous and definable selections are measurable. The following result provides a converse, a slightly more general proof is found in Section 3.

Theorem 1 Let D : Rp _{⇒ R}p _{be definably conservative, then D is conservative.}

Remark 1 It is of primary importance in Definition 3, that the definable loops and de-finable set valued mapping D, are dede-finable in the same o-minimal structure. For example consider the set E, the graph of the exponential function in R2. By a theorem of Wilkie [39], there exists an o-minimal structure which contains all semi-algebraic sets and such that E is definable in this structure, call it Wilkie’s structure. Consider D : R2 _{⇒ R}2 _{to be}

{0} outside of E and the unit Euclidean ball on E, it has a closed graph, it is bounded with nonempty values and definable in Wilkie’s structure. Consider any differentiable semial-gebraic loop γ : [0, 1] 7→ R2_{. Since E is a the graph of an analytic function, but is not}

a semialgebraic set, the intersection of γ and E must contain only finitely many points. Hence γ and D satisfy formula (2), but D is obviously not conservative as it does not satisfy the integral formula along any nontrivial absolutely continuous path which image is in E. Hence the importance of having a unique fixed o-minimal structure throughout the manuscript.

This result shows that definable conservativity is equivalent to conservativity in the de-finable world. Its proof is based on an equivalent characterization of conservativity in this context, variational stratification [11, 13]. The restriction to definable loops and selection in Definition 2 opens the possibility to use all results of o-minimal geometry [25, 20] in order to prove that a given field is conservative. This can be in particular useful to prove conservativity of the PO formula.

2.3 Definable parametric optimality formula

Taking advantage of the strong rigidity of definable objects, we obtain the following result, which proof is stated in Section 3.5.

(9)

Theorem 2 Let F : Rp_{× R}r _{7→ R be locally Lipschitz and definable. Set}

f : Rp 7→ R x 7→ max

y∈Rr F (x, y).

where the argmax is assumed to be locally bounded, call it P (x) ⊂ Rr_{. Set}

Df: Rp ⇒ Rp

x 7→ conv {u, ∃y ∈ P (x), (u, 0) ∈ ∂cF (x, y)} ,

then Df is conservarive for f .

This result is the consequence of the more general Theorem 7 stated in Example 1. The result is in fact stated for more general conservative fields than the Clarke subgdradient, which opens the possibility to use objects defined through other calculus rules in place of subgradients, for example, outputs of algorithmic differentiation [13, 14]. This result implies that the PO formula can be used as a first order optimization oracle in the definable world, as illustrated in the next section.

2.4 Algorithmic consequences

Getting back to the initial problem

min

x∈Rpmax_y∈RrF (x, y)

where F : Rp _{× R}r _{7→ R is locally Lipschitz and, in addition, definable. Assume that the}

mapping x ⇒ P (x) = arg maxy∈RrF (x, y) is nonempty and locally bounded (or take a

locally bounded subset). Consider the ridge algorithm, set x0 ∈ Rp and iterate for k ∈ N

yk ∈ arg max

y∈RrF (xk, y)

(uk, 0) ∈ ∂cF (xk, yk)

xk+1 = xk− αkuk

Assume that (αk)k∈N is a non summable squence of positive step sizes tending to zero.

Assume that (xk)k∈N is bounded. Then F (xk, yk) converges, and all accumulation points

(¯x, ¯y) of (xk, yk)k∈N are PO critical points for f such that

0 ∈ conv{u, (u, 0) ∈ ∂cF (¯x, y), y ∈ arg max

z∈Rr F (¯x, z)} (3)

¯

y ∈ arg max

y∈RrF (¯x, y)

The result follows by noticing that for all k, uk ∈ Df(xk) where Df is a convex valued

conservative field for f as described in Theorem 2. Indeed condition (3) can be equivalently read as 0 ∈ Df(x). Note that using [13, Corollary 1], condition (3) is an optimality

(10)

Convex valued conservative fields can be used in place of subgradients in first order method, while allowing to deploy the general method of [8], see [13] for convergence analysis. More precisely, the convergence result follows by combining Theorem 3.6, Re-mark 1.5(ii) and Proposition 3.27 of [8] with null deterministic perturbation term, the Morse-Sard condition being obtained in [13, Theorem 5] since Df is also definable.

Remark 2 The same result holds mutatis mutandis with a definable conservative field D in place of ∂cF , for example one obtained by algorithmic differentiation [13]. Similarly minimization and maximization could be interchanged arbitrarily, modulo changes in the step sign.

Remark 3 The convex hull in (3) is necessary, for example setting F (x, y) = xy − 2||y| − 1|, we have f (x) = maxyF (x, y) = |x| for all x ∈ [−1, 1]. In this case, the

algorithm reduces to subgradient descent and the convex hull is necessary to obtain a valid optimality condition at 0.

2.5 Failure of parametric optimality formula in general

It was already shown that the PO formula does not necessarily provide elements of the subdifferential (see footnote3 _{on page 4). Yet the failure only occurred at the origin which}

does not prevent the PO formula to provide a conservative field for the value function f . The following result shows that this is not the case in general, its proof is given in Section 4.

Theorem 3 There exists a Lipschitz path differentiable function g : R2 → R such that • for all x ∈ [0, 1], max_y∈Rg(x, y) = x and the maximum is attained on [0, 1]. • for all x ∈ [0, 1], 0 ∈ {v, (v, 0) ∈ ∂c_{g(x, y), y ∈ arg max}

y∈Rg(x, y)}

• for countably many x ∈ [0, 1], arg max_y∈Rg(x, y) is a pair, for the rest it is a sin-gleton.

The preceding result shows that the PO formula does not provide a conservative field as the value function is the identity on R, but the formula may result in the constant 0, which is not compatible with the integration constraint in Definition 1. It is also obvious that the ridge algorithm (RM) applied to minimization of f : x 7→ maxy∈[0,1]g(x, y) based

on the PO formula may get stuck at any initialization point x ∈ [0, 1] since they are all steady states of the algorithm (RM). This illustrates the fact that failure of conservativity entails in this case failure for first order algorithms based on the PO formula. Obviously the function g given in Theorem 3 is not definable in any structure as otherwise Theorem 2 would apply.

(11)

3 Definably conservative fields and parametric

opti-mality formula

The result described in Theorem 1 is actually proved for conservative mappings, which generalize conservative fields in a similar way as jacobians generalize gradients. We start with an extension of Definition 1 to this setting and make the necessary technical connec-tions with the work of [13], in particular the chain rule along absolutely continuous curves and the variational stratification. These preliminaries, although not explicitly stated this way in [13] are direct consequences of [13] and given here for completeness. We then proceed to the proof of the main result of this section: definably conservative mappings are conservative.

3.1 Conservative mappings

The following defines conservativity for matrix set valued functions using vanishing cir-culation, as suggested in [13], it is equivalent to [13, Definition 4].

Definition 3 (Conservative mappings) Let J : Rp ⇒ Rm×p be a set valued map with closed graph, non empty and locally bounded values. Then J is a conservative mapping, if for all x ∈ Rp_{, all γ : [0, 1] 7→ R}p_{, absolutely continuous with γ(0) = γ(1), and all}

measurable functions, V : [0, 1] 7→ Rm×p, such that V (t) ∈ J (γ(t)) for all t ∈ [0, 1], Z 1

0

V (t) ˙γ(t)dt = 0. (4)

If in addition, J is definable, then J is called definably conservative if (4) holds only for definable C1 loops γ and definable selections V .

The following Lemma is a useful alternative characterization of conservativity through an operational chain rule corresponding to [13, Definition 4].

Lemma 1 Let J : Rp ⇒ Rm×p, be a graph closed locally bounded non empty valued map. Then the following are equivalent

(i) There exists G : Rp _{7→ R}m_{, locally Lipschitz such that for any γ : [0, 1] 7→ R}p

abso-lutely continuous, for almost all t ∈ [0, 1] d

dtG(γ(t)) = M ˙γ(t), ∀M ∈ J(γ(t)).

(ii) J is a conservative mapping as stated in Definition 3.

In the situation of Lemma 1 (i), we will say that J is a conservative mapping for G, in which case G is defined up to constants using line integrals.

(12)

3.2 O-minimal structures

Important references on this topic are [20, 25]. An o-minimal structure on (R, +, ·) is a collection of sets O = (Op)p∈N where each Op is itself a family of subsets of Rp, such that

for each p ∈ N:

(i) Op is stable by complementation, finite union, finite intersection and contains Rp.

(ii) if A belongs to Op, then both A × R and R × A belong to Op+1;

(iii) if π : Rp+1 → Rp

is the canonical projection onto Rp then, for any A ∈ Op+1, the

set π(A) belongs to Op;

(iv) Op contains the family of real algebraic subsets of Rp, that is, every set of the form

{x ∈ Rp _{| g(x) = 0}}

where g : Rp _{→ R is a polynomial function;}

(v) the elements of O1 are exactly the finite unions of intervals.

A subset of Rp which belongs to an o-minimal structure O is said to be definable in O. A function is definable in O whenever its graph is definable in O. A set valued mapping (or a function) is said to be definable in O whenever its graph is definable in O. The terminology tame refers to definability in an o-minimal structure without specifying which structure. From now on we fix an o-minimal structure O, definable sets being implicitly definable in O.

The simplest o-minimal structure is given by the class of real semialgebraic objects. Recall that a set A ⊂ Rp _{is called semialgebraic if it is a finite union of sets of the form}

k

\

i=1

{x ∈ Rp _{| g}

i(x) < 0, hi(x) = 0}

where the functions gi, hi : Rp → R are real polynomial functions and k ≥ 1. The key

tool to show that these sets form an o-minimal structure is Tarski-Seidenberg principle which ensures that (iii) holds true. As detailed in [20] this result can be expressed in the following way.

Proposition 1 (Quantifier elimination) Any first order formula (quantification on variables only) involving polynomials, equalities and inequalities, with definable functions and definable sets describes a definable set.

3.3 Variational Stratification

The notion of Variational Stratification was first exposed in [13] and found an interest-ing application to conservativity [13]. The tangent space at a point x of a differentiable

(13)

manifold M is denoted by TxM4. Given a submanifold M of a finite dimensional

Rieman-nian manifold, it is endowed by the Riemanninan structure inherited from the ambient space. Given G : Rp _{→ R}m _{and M ⊂ R}p _{a differentiable submanifold on which G is}

differentiable, we denote by JacM G its Riemannian jacobian or even, when no confusion

is possible, Jac G.

A Ck _{stratification of a (sub)manifold M (of R}p_{) is a partition S = (M}

1, . . . , Mm) of M

into Ck manifolds having the property that cl Mi ∩ Mj 6= ∅ implies that Mj is entirely

contained in the boundary of Mi whenever i 6= j. Assume that a function G : M → Rm

is given and that M is stratified into manifolds on which G is differentiable. For x in M , we denote by Mx the strata containing x and we simply write Jac G(x) for the jacobian

of G with respect to Mx.

Stratifications can have many properties, we refer to [25] and references therein for an account on this question and in particular for more on the idea of a Whitney stratification. The definition is as follows: a Cr-stratification S = (Mi)i∈I of a manifold M has the

Whitney-(a) property, if for each x ∈ cl Mi ∩ Mj (with i 6= j) and for each sequence

(xk)k∈N ⊂ Mi we have: lim k→∞xk = x lim k→∞TxkMi = T      =⇒ TxMj ⊂ T

where the second limit is to be understood in the Grassmanian, i.e., “directional”, sense. In the sequel we shall use the term Whitney stratification to refer to a C1-stratification with the Whitney-(a) property. The following can be found for example in [25, 4.8].

Theorem 4 (Whitney stratification) Let A1, . . . , Ak be definable subsets of Rp, then

there exists a definable Whitney stratification (Mi)i∈I compatible with A1, . . . , Ak, i.e. such

that for each i ∈ I, there is t ∈ {1, . . . k}, such that Mi ⊂ At.

For the rest of this section, k denotes an arbitrary positive integer.

Definition 4 (Variational stratification [11]) Let G : Rp _{→ R}m_{, be locally Lipschitz}

continuous, let J : Rp ⇒ Rm×p be a set valued map and let r ≥ 1. We say that the couple (G, J ) has a Ck _{variational stratification if there exists a C}k _{Whitney stratification}

S = (Mi)i∈I of Rp, such that G is Ck on each stratum and for all x ∈ Rp,

J (x)Proj_T_Mx_(x) = {Jac G(x)} , (5)

where Jac G(x) is the jacobian of G restricted to the active strata Mx containing x.

Theorem 5 (Characterization of conservativity) Let J : Rp _{⇒ R}m×p _{be a definable,}

nonempty, locally bounded, graph closed set valued mapping and G : Rp 7→ R be a definable locally Lipschitz function. Then the following are equivalent

(14)

• J is conservative for G.

• (G, J) admit a Ck _{variational stratification.}

For the reverse implication, J and G need not to be definable.

Proof : This result is essentially known and we point out the arguments for completeness. First, J is conservative for G if and only if, the projection of each row of J is conservative for the corresponding coordinate of G, this is Lemma 3 and 4 in [13] in combination with Lemma 1 above. Hence we may reason coordinatewise.

For the direct implication, it results from [13, Theorem 4] that each coordinate of Gi and

the corresponding line of J , Ji admit a variational projection formula ([13, Definition 5])

for each i = 1, . . . , m, this corresponds to the variational formula introduced in [11], which is limited to the univariate case. The variational projection formula is stable by consid-ering submanifolds and hence is stable when refining a given stratification. Hence thanks to Theorem 4, we may find a common Whitney stratification such that the projection formula holds for each coordinate Gi of G and the corresponding row Ji of J . This results

in the formula given in Definition 4.

For the reverse implication, similarly as above, the variational stratification in Definition 4 implies the projection formula of [13, Definition 5] for each coordinate of G with the corresponding line of Ji. By [13, Theorem 3] (see also [22] which states the result for

the Clarke subgradient), each row of J is conservative for the corresponding coordinate of G which implies that J is conservative for G by [13, Lemma 4]. This does not require

definability.

3.4 Definably conservative mappings

We start with a preliminary lemma which will then be applied recursively toward a proof of a variational stratification property from which Theorem 1 will follow.

Lemma 2 Let J : Rp _{⇒ R}m×p_{, be a definably conservative mapping. Let G : x 7→}R1

0 V (γ(t)) ˙γ(t)dt

for any C1 _{definable γ : [0, 1] 7→ R}p with γ(0) = 0 and γ(1) = x and any definable se-lection V as in Definition 4. Then there exists a finite number of definable open sets U1, . . . , UN in Rp such that ∪Ni=1cl(Ui) = Rp, G is continuously differentiable on each set

and J = {JG}, the jacobian of G, on each set.

Proof : Note that it is not known a priori if G is definable and therefore the results of [13] does not directly applies. Denote by R the set where J is single valued, this set is definable by Proposition 1 and we are going to show that its complement has empty interior. Toward a contradiction, suppose that the complement of R has nonempty interior. Then, definable choice [25, 4.5] ensures that there exists two definable selections V1 and V2 such that V1 6= V2 on a small open ball B. Let v be a unit norm definable

selection in Im(V1− V2)T, which exists thanks to [25, 4.5]. Since v, V 1, V 2 are definable,

(15)

on B (reducing and translating B if necessary). Call r the radius of B and assume without loss of generality that it is centered at 0. Consider the solution to

˙γ(t) = v(γ(t)), γ(0) = 0.

γ is C1 _{and stays in a neighborhood of 0 for smalls values of t, let’s say that kγ(t)k < r/2}

for all t ∈ [0, α] for a certain α > 0. Let 0 < < r/2 be arbitrary and fix ˜γ, a C1 definable path with ˜γ(0) = 0 such that

max

t∈α[0,1]

max {kγ(t) − ˜γ(t)k, kγ0(t) − ˜γ0(t)k} ≤ ,

take for example a polynomial approximation of γ0 and its integral, using Weierstrass approximation Theorem. Since ˜γ is a C1_{definable arc which remains in B by construction,}

for almost all t around 0, we have using Lemma 1, d

dtF (˜γ(t)) = V1(˜γ(t))(v(γ(t)) + u(t)) = V2(˜γ(t))(v(γ(t)) + u(t))

where u(t) := ˜γ0(t) − v(γ(t)) for all t. By construction, we have ku(t)k ≤ for all t. We can let → 0 along a sequence of such approximations ˜γ, and using continuity of V1 and

V2, we obtain for almost all t,

V1(γ(t))v(γ(t)) = V2(γ(t))v(γ(t))

and therefore (V1(γ(t)) − V2(γ(t))v(γ(t)) = 0, for all t by continuity of v, V1, V2 and γ.

Since v(γ(t)) ∈ Im(V1(γ(t)) − V2(γ(t))T, this shows that v(γ(t)) = 0 for all t around 0.

This is contradictory with the fact that v has unit norm.

This shows that the complement of R has empty interior. By stratification, using Theorem 4, there exists U1, . . . , UN strata of maximal dimension such that the complement of R

does not intersect any Ui, i = 1, . . . , N , and ∪Ni=1cl(Ui) = Rp.

By graph closedness and local boundedness, J can be identified with a continuous function on each Ui. Let x ∈ Ui for some i, there is a small ball around x such that J is continuous

on the ball. By the definition of G, for any v ∈ Rp _{and t > 0 such that x + tv remains in}

this ball, F (x + tv) − F (x) t = Z 1 0 J (x + stv) vds = 1 t Z t 0 J (x + sv) vds. Letting t → 0, we have lim t→0,t>0 G(x + tv) − G(x) t = J (x)v

where the limit is by continuity of J at x. This formula allows to identify the partial derivatives at x of each of the m coordinate components of G with entries of J (x). Since these are continuous at x, G is differentiable at x, and since x ∈ Ui was arbitrary and J

(16)

Theorem 6 Let J : Rp _{⇒ R}m×p _{be a definably conservative mapping. Let G : R}p _{7→ R}

be defined as in Lemma 2. Then (G, J ) admits a Ck variational stratification: (Mi)i∈I a

definable Whitney stratification of Rp _{such that G is C}k _{on each stratum with}

J (x)Q(x) = {JG(x)},

where Q(x) is the matrix representing orthogonal projection to the tangent space of the active stratum M (x) at x, seen as a subspace of Rp_.

Proof : We shall prove that (G, J ) has a C1 _{variational stratification. The C}k_variational

stratification follows by definability and the existence of Ck stratification of G using [25, 4.8], which allows to refine a potential C1 _{stratification to obtain differentiability up to}

order k.

Let M be a connected definable C1 _{submanifold embedded in R}p _{which is also a}

con-nected C1 _{cell (see [25, 4.2]). Since M is a connected C}1 _{cell, there is a definable C}1

diffeomorphism θ : M 7→ RdimM, see for example [20, Section 6.2]. Set φ : RdimM 7→ M such that φ = θ−1. Consider now

˜ G : RdimM 7→ Rm ˜ x 7→ G(φ(x)) ˜ J : RdimM ⇒ Rm×dim M ˜ x ⇒ J(φ(x))Jφ(x)

For any definable C1 _{path γ : [0, 1] 7→ R}dimM_{, φ ◦ γ is definable and so ( ˜}_{G, ˜}_{J ) satisfy the}

hypotheses of Lemma 2 since by Lemma 1, for almost all t d dtF (γ(t)) =˜ d dtF (φ ◦ γ(t)) = J (φ ◦ γ(t))(φ ◦ γ) 0 (t) = J (φ ◦ γ(t))Jφ(γ(t)) ˙γ(t) = ˜J (γ(t)) ˙γ(t),

and γ was an arbitrary C1 definable path.

Hence there exists ˜U1. . . ˜UN open in RdimM such that the union of their closure is equal

the whole RdimM, and on each set Ui, ˜J is single valued and ˜G is C1 with ˜J = {J_G˜}.

For i = 1 . . . N , set Ui = φ( ˜Ui) we have that the union of their relative closure is M :

∪N

i=1clM(Ui) = M . Furthermore, F = ˜F ◦ θ is differentiable on each Ui, relative to M , and

{JF(x)} = {J_F˜(θ(x))Jθ(x)} = J(x)Jφ(θ(x))Jθ(x).

We remark that Jφ(θ(x))Jθ(x) is the projection on the tangent space of M at x so that

each line of J has a single valued projection on the tangent space of M on each Ui.

Since M was arbitrary, we may start with M = Rp and proceed by induction on the dimension by applying Lemma 2 with the above reasoning. We have a C1 _variational

projection on a dense set of strata (cells) of dimension p. We may obtain a Whitney stratification, compatible with this set, so that the projection formula does not hold only on a finite union of strata (cells) of dimension at most p − 1 [25, 4.8]. Each stratum M , being a C1 _{embedded submanifold as well as a C}1 _{cell and we may repeat the process}

recursively until the dimension of the set where the projection formula does not hold is zero, i.e. a finite set of points, to obtain the desired Whitney stratification.

(17)

The following corollary combines Theorems 5 and 6. Theorem 1 is a special case for m = 1.

Corollary 1 Let J : Rp _{⇒ R}m×p _{be a definably conservative mapping, then J is a}

con-servative mapping.

3.5 Application to PO formula

This section describe how Corollary 1 can be used to prove that the PO formula provides a conservative field in the definable world. We start with a slightly more general result which generalizes the finite selection process described in [14] from the discrete to the continuous setting.

Theorem 7 Let F : Rp_×Rr _{7→ R be locally Lipschitz, D a definable conservative mapping}

for F and P : Rp ⇒ Rp be a definable set valued field, with closed graph nonempty locally bounded values, such that for all x ∈ Rp _{and y ∈ P (x), there exists u ∈ R}p _{such that}

(u, 0) ∈ D(x, y) and y 7→ F (x, y) is constant on P (x). We set

f : x 7→ F (x, P (x)).

We have that f is continuous, set

Df: x 7→ {u, ∃y ∈ P (x), (u, 0) ∈ D(x, y)} .

Then Df is conservative for f .

Proof : One can check that Df is definable thanks to Proposition 1. Furthermore, it has

closed graph and is locally bounded with nonempty values. Hence, by Theorem 1 and Lemma 1, we only have to prove a chain rule along definable C1 curves.

Let t 7→ x(t) ∈ Rp _{be a C}1 _{definable path we will obtain definable selections of interest}

thanks to [25, 4.5], and draw conclusions thanks to Lemma 8 which asserts that definable set valued fields have countable dense definable selections.

Let t → y(t) ∈ Rp be a definable selection in t → P (x(t)) and let t 7→ T (t) = (u(t), v(t)) be a definable selection in t ⇒ D(x(t), y(t)). Definable curves are piecewise continuously differentiable. Hence the functions t 7→ f (x(t)) = F (x(t), y(t)), t 7→ x(t) and t 7→ y(t) are differentiable everywhere except at finitely many points, call them t1, . . . , tM. Bounded

definable curves have left and right limits everywhere, we have for i = 2, . . . M that t 7→ y(t) is continuous on (ti−1, ti) and can be extended to an absolutely continuous path

on [ti−1, ti]. Hence we can use the fact that D is conservative for F , which yields using

graph closedness of P and continuity of F , for i = 2, . . . M

f (x(ti)) − f (x(ti−1)) = F (x(ti), lim t↑ti

y(t)) − F (x(ti−1), lim t↓ti−1 y(t)) = Z t=ti t=ti−1 d dtF (x(t), y(t))dt = Z t=ti t=ti−1

(18)

Finally, by removing the set of discontinuity points of y, we obtain for almost all t. d

dtf (x(t)) = d

dtF (x(t), y(t)) = h ˙x(t), u(t)i + h ˙y(t), v(t)i

Since T was an arbitrary definable selection of t ⇒ D(x(t), y(t)), which admits a countable dense sequence of definable selectors thanks to Lemma 8, we have for almost all t

d

dtf (x(t)) = h ˙x(t), ui + h ˙y(t), vi , ∀(u, v) ∈ D(x(t), y(t)).

By the hypotheses, for each u ∈ Df(x), one can choose v = 0 and hence for almost all t

d

dtf (x(t)) = h ˙x(t), ui ∀u such that (u, 0) ∈ D(x(t), y(t)).

Note that y is an arbitrary definable selector in t → P (x(t)). By Lemma 9, there is a

count-able family of such selectors (yi)i∈N, such that for all t, Df(x(t)) = cl {u, ∃i ∈ N, (u, 0) ∈ D(x(t), yi(t))}.

This implies that for almost all t d

dtf (x(t)) = h ˙x(t), ui ∀u ∈ Df(x(t)).

Hence, we have a chain rule along definable arcs which is the desired result. Indeed, repeating the proof of direct implication in Lemma 1 for definable arcs, we have that Df

is definably conservative and hence conservative thanks to Theorem 1 and admits f as a

potential.

This result can be applied to partial maximization as the following example shows. It could also be applied to partial minimization or differentiation of more general critical values such as local minima or local maxima (under suitable assumptions). The following example is a repetition of Theorem 2 which is based on Theorem 7.

Example 1 (Partial maximization) Let F : Rp_×Rr _{7→ R be definable locally Lipschitz}

and D : Rp+r _{⇒ R}p+r _{be definable and conservative for F . Set}

f : Rp 7→ R x 7→ max

y∈Rr F (x, y),

where the argmax is assumed to be nonempty and locally bounded, call it P (x) ⊂ Rr. P has a closed graph, set

Df: Rp ⇒ Rp

x 7→ conv {u, ∃y ∈ P (x), (u, 0) ∈ D(x, y)} ,

then Df is conservative for f . Indeed, Df has nonempty values by [32, Example 10.12]

so D and P comply with Theorem 7.

4 Failure of parametric optimality formula

This section is dedicated to the construction of the function g in Theorem 3. We start with the construction of a fractal set C and then describe the counterexample which will be based on the distance functions to C.

(19)

−0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 1.5 x y −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 1.5 x y −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 1.5 x y −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 1.5 x y −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 1.5 x y −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 1.5 x y

Figure 1: The fractal construction in Section 4. We start with the closed unit square in black. It is split into four copies of size one fourth the original square. This process is repeated recursively on each square ad infinitum. The additional red lines represent projection of these sets on rotated axes. Considering Ci, i ∈ N, the set obtained after i

steps (C0 is the original square), we have Ci+1⊂ Ci for all i, we let the limiting set to be

C = ∩_i∈NCi, which is closed. The projection of C on each axes are full segments.

Fur-thermore, in the limit, both projections on rotated axes are Cantor sets of zero measure.

4.1 A fractal set

Let C be the fractal set which construction is described in Figure 1. The construction of C is similar to the one described in [38] to provide a counterexample to Morse-Sard theorem. This construction was also used in [30] to provide a subgradient sequence on a path differentiable function which fails to have dissipative and minimizing properties. The set is defined as C = ∩_i∈NCi, where for each i ∈ N, Ci is the union of 4i squares

of size 1/4i_{. Furthermore, these sets form a nested decreasing sequence for the inclusion}

partial order. The set C has the following properties.

• C is closed as an intersection of closed sets, hence it is compact.

• The projection of C on each axis are full segments [0, 1] × {0} and {0} × [0, 1], we denote them by proj_x(C) and proj_y(C). Indeed, since we have a nested sequence proj_x(C) = ∩_i∈Nproj_x(Ci) = ∩i∈N[0, 1] = [0, 1].

• In particular, C is nonempty.

• The projection on each axis rotated clockwise by an angle of arctan(2) is a Cantor set of zero measure (see Figure 1). Indeed, at each step there is a constant proportion of each segment which is removed from the projection, this is the simplest construction of Cantor sets.

(20)

• For each i ≥ 1, there is a finite number of x ∈ [0, 1] for which the vertical line starting at x intersects Ci at two distinct squares, call this set Xi. We have Xi ⊂ Xi+1 and

for each x ∈ [0, 1], x 6∈ Xi, the intersection of the vertical line at x and Ci is at

a single square. Set X = ∪_i∈NXi, X is denumerable and for each x ∈ X, the the

vertical line starting at x intersects C at two distinct points, for all other x ∈ [0, 1] this intersection is a singleton.

• Let f : R 7→ R be a function such that for some i ∈ N, (x, f(x)) ∈ Ci for all x ∈ [0, 1],

then the total variation of f is at least i. Hence if f satisfies this property for all i, that is (x, f (x)) ∈ C for all x ∈ [0, 1], then f has infinite total variation.

4.2 Construction and proof of the counterexample

Consider the following Lipschitz function

f : R2 7→ R

z 7→ −dist (z, C) .

Set O = R2_{\ C, O is open and we have for all z ∈ O}

f (z) = − min

c∈C kz − ck = maxc∈C −kz − ck.

Each function fc: z 7→ −kz − ck is C1 on O and both fc and ∇fc are jointly continuous

with respect to z and c on O×C. This shows that f is lower C1on O [32, Definition 10.29]. Hence we have that f is subdifferentially regular on O [32, Theorem 10.31]. Combining with Lemma 5 we have

Lemma 3 f is path differentiable.

Proof : We will prove that ∂c_{f satisfies the chain rule along absolutely continuous curves}

[22, 13], since f is Lipschitz, this is sufficient to conclude, see also Lemma 1. Let γ : [0, 1] 7→ R2 and R ⊂ [0, 1] the full measure set where γ and f ◦ γ are differentiable, we will show that γ satisfies the chain rule for almost all t ∈ R which is sufficient to conclude. We set

E = {t ∈ R, γ(t) 6∈ C}

We also consider ˜E ⊂ E with

˜ E = t ∈ E, max v∈∂c_{f (γ(t))} hv, ˙γ(t)i − d dtf (γ(t)) > 0 .

Fix t ∈ ˜E arbitrary, since t ∈ E, choosing a small enough, we have by continuity of γ, γ([t − a, t + a]) ∩ C = ∅ and γ([t − a, t + a]) ⊂ O. Since f is lower C1 _{on O, by [32,}

Theorem 10.31], f is subdifferentially regular in a neighborhood of γ([t−a, t+a]) ⊂ O. We may apply [22, Lemma 5.4] which shows that f satisfies the chain rule along the curve γ restricted to the closed segment I = [t−a, t+a], in other words [t−a, t+a]∩ ˜E has measure

(21)

−0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 1.5 x y −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 1.5 x y −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 1.5 x y −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 1.5 x y −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 1.5 x y −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 1.5 x y

Figure 2: Same construction as in Figure 1, except that we keep the whole product set, the red line representing the projections on rotated axes and the black squares representing their products. In the limit, we obtain the same set C as in Figure 1 since the distance between the set represented here and those of Figure 1 goes to zero.

zero. The segment can be taken of arbitrarily small length, therefore, such intervals form a Vitali covering set (see for example [33, Section 6.2]). By the Vitali covering theorem [33, Section 6.2] for any > 0, there is a finite collection of such segments I1, . . . , IK such

that ˜E \ ∪K

k=1Ik has (outer) measure at most which shows that ˜E has (outer) measure

arbitrary small and therefore has measure zero. Now set

E1 = {t ∈ R, γ(t) ∈ C, γ0(t) 6= 0}

E2 = {t ∈ R, γ(t) ∈ C, γ0(t) = 0} .

We have R = E ∪ E1 ∪ E2, we have shown that the chain rule holds for almost all t in

E, Lemma 6 ensures that E1 has zero measure and the chain rule holds trivially for all

t ∈ E2 because f is Lipschitz. This shows that the chain rule holds for almost all t ∈ R

and the conclusion follows.

Now, we should characterize the subdifferential of f , which directly relates to the normal cone to C. We will use the notion of normal cone as described in [32, Definition 6.3]. As depicted in Figure 2, it turns out that C is actually a product of Cantor set which are closed and have empty interior. Using Lemma 7 we obtain by [32, Proposition 6.41] that the normal cone to C is R2 everywhere on C. We deduce the following.

Lemma 4 For all z ∈ C, ∂c_{f (z) = B, where B is the unit ball.}

Proof : As shown in Figure 1, C is contained in a product of cantor sets, ˜C. Actually C is equal to a product of Cantor sets. Indeed, as detailed in Figure 2, the distance between

(22)

the sets constructed in Figure 1 and those of Figure 2 decreases to 0, so that limiting intersections are the same. By [32, Proposition 6.41] and using Lemma 7, N_C˜ = R2

everywhere on ˜C hence on C. The result follows from [32, Example 8.53] using the fact that the Clarke subgradient is the convex closure of the limiting subgradient and it commutes with multiplication by scalars. For the function g in Theorem 3, one could take for example g : (x, y) 7→ 2f (x, y) + x which satisfies all the required properties

• Since the projection of C on x is [0, 1] (Section 4.1), we have maxy∈[0,1]g(x, y) = x.

• Since for all but countably many x, the intersection of C with the vertical line at x is a singleton (Section 4.1), for such x, the argmax is unique, for the others it is a pair.

• The sum rule holds for f , so for every x ∈ [0, 1] and y ∈ arg maxy∈[0,1]g(x, y), we

have (x, y) ∈ C so ∂cf (x, y) = 2B, by Lemma 4. By sum rule, ∂cg(x, y) = 2B +(1, 0) so that

v, (v, 0) ∈ ∂cg(x, y), y ∈ arg max

y∈[0,1]

g(x, y)

= [−1, 2]

which contains 0.

4.3 Need for a better behaved subclass

In the definition of the function g, the fractal nature of the construction makes the argmin mapping, P : Rp ⇒ Rr, although almost everywhere a singleton, highly non regular. In this example, it is not even of bounded variation in the sense that it is not possible to ob-tain bounded variation selections in P , a fortiori not absolutely continuous. This explains why conservativity is destroyed, the connection between g, F and its value function f , through P in the PO formula takes place outside of the absolutely continuous world and hence outside of the conservative world which is built on absolutely continuous paths. Therefore additional restrictions on the function F have to be enforced if one wants a calculus rules for the PO formula which preserves conservativity. An intuitive direction, is to ensure that the argmax mapping admits selections which are absolutely continuous, or close to absolute continuity, in order to apply and use the definition of conservativity in Definition 1. There would be potentially many possible such classes, in Section 3.5 we focus on one of them, definable functions [25, 20], for which we have access to definable selections which are piecewise differentiable, which is close enough to absolute continuity for our purpose.

5 Lemmas and proofs

Proof of Lemma 1: The fact that (i) implies (ii) is direct by integration. Indeed, G ◦ γ is absolutely continuous. For any γ : [0, 1] 7→ Rp _{absolutely continuous, with γ(0) = γ(1),}

(23)

and measurable selection V as in the statement of the Lemma, Z 1 0 V (γ(t)) ˙γ(t)dt = Z 1 0 d dtG(γ(t))dt = G(γ(1)) − G(γ(0)) = 0,

where the first inequality uses (i) and the fact that V (γ(t)) ∈ J (t) for all t and the second is absolute continuity of G ◦ γ.

For the reverse implication, fix x ∈ Rp_{, an absolutely continuous path γ : [0, 1] 7→ R}p_,

γ(0) = 0, γ(1) = x and a measurable selection V : Rp _{7→ R}p _{such that for all z ∈ R}p_,

V ∈ J (z). We define

G(x) = Z 1

0

V (t) ˙γ(t)dt.

By (ii), the value of G does not depend on the choice of γ and on the measurable selection V . Furthermore, we have by Lebesgue differentiation theorem, for almost all t ∈ [0, 1],

d

dtF (γ(t)) = V (t) ˙γ(t) (6) Now since J is nonempty compact valued, from [3, corollary 18.15], there exists a sequence (Vi)i∈N of measurable selectors such that for all t ∈ [0, 1]

J (γ(t)) = cl{Vi(γ(t))}i∈N. (7)

The result follows by combining (6) and (7). The following Lemma is essentially a repetition of [30, Lemma 16] which we reproduce here for completeness.

Lemma 5 Let C ⊂ R2 _{be a closed set which projections on the x and y axes have measure}

zero respectively. Then for any Lipschitz curve, γ : R 7→ R2_{, the set}

E = {t ∈ R, γ(t) ∈ C} \ {t ∈ R, γ0(t) = 0}

has measure zero.

Proof : Write γ1, γ2 the coordinates of γ and P1, P2 the projection of C on the x and y

axes respectively, by hypothesis they have measure zero. Set B ⊂ R the set where γ0 is either not defined or well defined and different from zero. We have E = γ−1(C) ∩ B. Set A1 ⊂ B and A2 ⊂ B the sets where γ10 6= 0 and γ20 6= 0 respectively. Set A3 ⊂ B the

zero measure set where γ0 is not defined. We have that, A1∪ A2 ∪ A3 = B.

Consider an enumeration of intervals (Ij)j∈N of the form [pj, qj], with pj, qj ∈ Q and

pj < qj for j ∈ N. Fix j ∈ N and i = 1, 2 and consider the following function on Ij

f_ji: t → min

s∈Ij, γi(s)=γi(t)

(24)

The set valued function t ⇒ {s ∈ Ij, γi(s) = γi(t)} has closed graph and nonempty

com-pact values on Ij and hence fji is measurable by [3, Theorems 18.19 and 18.20]. Set

Qi_j =t ∈ Ij, t = fji(t) ∩ Ai∩ γi−1(Pi).

Qi

j is measurable and γi is injective on Qji by construction and γi(Qij) ⊂ Pi .

Using the injectivity of γi on Qij and a change of variable formula [26, Theorem 3.8], we

have for any j ∈ N and i = 1, 2,

0 ≤ Z Qi j |γ_i0(t)|dt = Z γi(Qij) 1dt ≤ Z Pi 1dt = 0

Since the first integrand is strictly positive, this means that Qi

j has measure zero.

Now consider t ∈ γ_i−1(Pi)∩Ai, this means that there exists p ∈ Pisuch that t ∈ γi−1(p)∩Ai.

Since t ∈ Ai this means that γi0(t) 6= 0 and so there exists an interval Ij containing t,

such that {t} = {s ∈ Ij, γi(s) = γi(t)}. This shows that t ∈ Qij and since t was arbitrary,

we have that γ_i−1(Pi) ∩ Ai = ∪j∈NQij and hence γ −1

i (Pi) ∩ Ai has zero measure as the

countable union of zero measure sets. We have

E = γ−1(C) ∩ B = γ−1(C) ∩ (A1∪ A2∪ A3)

⊂ γ₁−1(P1) ∩ γ2−1(P2) ∩ (A1∪ A2∪ A3)

⊂ (γ₁−1(P1) ∩ A1) ∪ (γ2−1(P2) ∩ A2) ∪ A3.

All three sets on the left hand side have zero measures so E has zero measure.

Lemma 6 The result of Lemma 5 holds for any absolutely continuous curve γ : [0, 1] 7→ R2.

Proof : From [4, Lemma 1.1.4], γ admits a Lipschitz reparametrization. That is, there exists an increasing absolutely continuous function s : R 7→ R with Lipschitz inverse t and a Lipschitz curve ˆγ, such that

ˆ

γ = γ ◦ t kˆγ0k ◦ s = kγ

0_k

1 + kγ0_k (8)

where the second identity holds almost everywhere. Lemma 6 holds true for ˆγ. We have for all s using (8) and the fact that t is the inverse of s,

ˆ γ(s) ∈ C, kˆγ0k(s) 6= 0 ⇔ γ(t(s)) ∈ C, kˆγ0k ◦ s(t(s)) 6= 0 ⇔ γ(t(s)) ∈ C, kγ0(t(s))k 6= 0. Therefore, we have t ({s, ˆγ(s) ∈ C, kˆγ0k(s) 6= 0}) = {t, γ(t) ∈ C, kγ0(t)k 6= 0}

(25)

The set {s, ˆγ(s) ∈ C, kˆγ0k(s) 6= 0} has measure zero by Lemma 6 because ˆγ is Lipschitz. The Lebesgue measure of the image of a zero measure set by a Lipschitz map is zero, and, since t is Lipschitz, the right hand side has measure zero. This is the desired result.

Lemma 7 Let C ⊂ R be a closed set with empty interior. Then for all x ∈ C, NC(x) = R.

Proof : Denote by f the distance function to C. f is 1-Lipschitz and hence differentiable almost everywhere. Furthermore, f (x) = 0 if and only if x ∈ C since C is closed. Fix x ∈ C and construct a sequence (xk)k∈N as follows, choose k= 1/(k + 1) for k ∈ N,

• If (x, x + k) ∩ C = ∅, then xk = x and we have that 1 ∈ NC(xk).

• Otherwise choose zk ∈ (x, x + k) ∩ C and yk = arg maxy∈[x,zk]f (y) where the

maximum is positive by closedness of C and the fact that C has empty interior. yk

has a projection on the left and on the right on C (otherwise it cannot be in the argmax). Choose xk to be the projection on the left. We have that 1 ∈ NC(xk),

x ≤ xk ≤ x + k.

In all cases we have xk → x and 1 ∈ NC(xk) which shows that 1 ∈ NC(x). Similarly

one could show that −1 ∈ NC(x) and hence NC(x) = R since it is a cone. Since x was

arbitrary in C, this proves the desired result.

Lemma 8 Let J : Rp _{7→ R}m_{, be a definable compact valued map with nonempty values.}

Then there exists a sequence of definable selectors (Vi)i∈N for J such that for all x ∈ Rp

J (x) = cl{Vi(x)}i∈N.

Proof : Let V1 be any definable selection of J , such a V1 exists by [25, 4.5]. Set by

recursion, for i = 2, . . ., Vi a definable selection of

x ⇒ arg max v∈J (x)dist v, {Vj(x)} i−1 j=1 .

which is definable. By a simple covering argument, using compacity, for all x

lim

i→∞v∈J (x)max dist

v, {Vj(x)} i−1 j=1 = 0,

which shows that the constructed sequence has the desired property.

Lemma 9 Using the notations of Theorem 7, there is a sequence of definable selectors, (yi)i∈N such that for all i ∈ N, and all x ∈ Rp, yi(x) ∈ P (x) and

(26)

Proof : Note that Df has compact values and is definable by Proposition 1. Using Lemma

8 we have a definable sequence (ui)i∈N of selections in Df such that Df(x) = cl{ui(x)}i∈N.

By definability, for i ∈ N using [25, 4.5], we can choose a definable sequence (yi)i∈N such

that for all x ∈ Rp_,

yi(x) ∈ P (x)

(ui(x), 0) ∈ D(x, yi(x)).

We have for all x ∈ Rp lim

i→∞u∈Dmaxf(x)

dist

u, {˜u, (˜u, 0) ∈ D(x, yj(x))}i−1_j=1

≤ lim

i→∞u∈Dmaxf(x)

dist

u, {uj(x)}i−1_j=1

= 0,

which shows that the constructed sequence has the desired property.

Acknowledgments. The author acknowledge the support of ANR-3IA Artificial and Natural Intelligence Toulouse Institute, Air Force Office of Scientific Research, Air Force Material Command, USAF, under grant numbers FA9550-19-1-7026, FA9550-18-1-0226, and ANR MaSDOL - 19-CE23-0017-01. The author would like to thank J´erˆome Bolte and Rodolfo Rios-Zeruche for interesting discussions which helped putting this work together.

References

[1] Ablin, P. and Peyr´e, G. and Moreau, T. (2020). Super-efficiency of automatic differentiation for functions defined as a minimum. In International Conference on Machine Learning.

[2] Agrawal, A., Amos, B., Barratt, S., Boyd, S., Diamond, S., Kolter, Z. (2019). Differentiable convex optimization layers. Advances in neural information pro-cessing systems.

[3] Aliprantis C.D., Border K.C. (2005) Infinite Dimensional Analysis (3rd edition) Springer

[4] Ambrosio L., Gigli N. and Savar´e G. (2008). Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media. [5] Amos, B., Kolter, J. Z. (2017). Optnet: Differentiable optimization as a layer

in neural networks. In International Conference on Machine Learning.

[6] Arjovsky, Chintala, Bottou (2017). Wasserstein GAN. International Conference on Machine Learning.

[7] Attouch, H., Bolte, J., & Svaiter, B. F. (2013). Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Mathematical Programming, 137(1), 91-129.

(27)

[8] Bena¨ım M., Hofbauer J. and Sorin S. (2005). Stochastic approximations and differential inclusions. SIAM Journal on Control and Optimization, 44(1), 328-348.

[9] Berthet, Q., Blondel, M., Teboul, O., Cuturi, M., Vert, J. P., Bach, F. (2020). Learning with differentiable perturbed optimizers. Advances in neural informa-tion processing systems.

[10] Bertsekas D. P. (1971). Control of uncertain systems with a set-membership description of the uncertainty. Doctoral dissertation, Massachusetts Institute of Technology.

[11] Bolte J., Daniilidis A., Lewis A. and Shiota, M. (2007). Clarke subgradients of stratifiable functions. SIAM Journal on Optimization, 18(2), 556-572.

[12] Bolte, J., Sabach, S., & Teboulle, M. (2014). Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Program-ming, 146(1), 459-494.

[13] Bolte, J. and Pauwels, E. (2020). Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning. Mathematical Programming.

[14] Bolte, J. and Pauwels, E. (2020). A mathematical model for automatic differen-tiation in machine learning. Proceedings of the conference on neural information processing systems.

[15] Borwein J. M. and Moors, W. B. (1998). A chain rule for essentially smooth Lipschitz functions. SIAM Journal on Optimization, 8(2), 300-308.

[16] Borwein J., Moors W. and Wang, X. (2001). Generalized subdifferentials: a Baire categorical approach. Transactions of the American Mathematical Society, 353(10), 3875-3893.

[17] Borwein J. M. (2017). Generalisations, Examples, and Counter-examples in Analysis and Optimisation. Set-Valued and Variational Analysis, 25(3), 467-479.

[18] C. Castera, J. Bolte, C. F´evotte and E. Pauwels (2019). An inertial newton algorithm for deep learning. arXiv preprint arXiv:1905.12278.

[19] Clarke F. H. (1983). Optimization and nonsmooth analysis. Siam.

[20] Coste M. (1999) An introduction to o-minimal geometry. RAAG notes, Institut de Recherche Math´ematique de Rennes.

[21] Danskin J. M. (1966). The theory of max-min, with applications. SIAM Journal on Applied Mathematics, 14(4), 641-664.

(28)

[22] Davis D., Drusvyatskiy D., Kakade S., and Lee J. D. (2020). Stochastic sub-gradient method converges on tame functions, 20(1), 119-154. Foundations of Computational Mathematics.

[23] Davis, D. and Drusvyatskiy, D. (2021). Conservative and semismooth derivatives are equivalent for semialgebraic maps. arXiv preprint arXiv:2102.08484.

[24] Dem’Yanov V. F. (1966). On the solution of several minimax problems. I. Cy-bernetics, 2(6), 47-53.

[25] van den Dries L. and Miller C. (1996). Geometric categories and o-minimal structures. Duke Math. J, 84(2), 497-540.

[26] Evans L. C. and Gariepy R. F. (2015). Measure theory and fine properties of functions. Revised Edition. Chapman and Hall/CRC.

[27] Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y. Generative adversarial nets. Advances in neural information processing systems.

[28] Goodfellow, I. J., Shlens, J., Szegedy, C. (2015). Explaining and harnessing adversarial examples. International conference on learning representations. [29] Lewis, A. and Tian, T. (2021). The structure of conservative gradient fields.

arXiv preprint arXiv:2101.00699.

[30] Rios-Zertuche R. (2020). Examples of pathological dynamics of the sub-gradient method for Lipschitz path-differentiable functions. arXiv preprint arXiv:2007.11699.

[31] Rockafellar R. T. (1985). Extensions of subgradient calculus with applications to optimization. Nonlinear Analysis: Theory, Methods & Applications, 9(7), 665-698.

[32] Rockafellar R. T. and Wets R. J. B. (1998). Variational analysis (Vol. 317). Springer Science & Business Media.

[33] H. Royden, P. Fitzpatrick (2010) Real Analysis Prentice Hall

[34] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R. (2014). Intriguing properties of neural networks. International Conference on Learning Representations.

[35] Valadier M. (1989). Entraˆınement unilat´eral, lignes de descente, fonctions lips-chitziennes non pathologiques. Comptes rendus de l’Acad´emie des Sciences, 308, 241-244.

[36] Wang X. (1995). Pathological Lipschitz functions in Rn_{. Master Thesis, Simon}

(29)

[37] Wang Y. and Zhang G. and Ba J. (2020) On Solving Minimax Optimization Locally: A Follow-the-Ridge Approach International Conference on Learning Representations.

[38] Whitney H. (1935). A function not constant on a connected set of critical points. Duke Mathematical Journal, 1(4), 514-517.

[39] Wilkie A. J. (1999). A theorem of the complement and some new o-minimal structures. Selecta Mathematica, 5(4), 397-421.

The ridge method for tame min-max problems

HAL Id: hal-03186676

https://hal.archives-ouvertes.fr/hal-03186676

The ridge method for tame min-max problems

Edouard Pauwels

To cite this version:

The ridge method for tame min-max problems

Edouard Pauwels

March 31, 2021

Contents

1

Introduction

1.1

Main result

1.2

Parametric optimality and nonsmooth differential calculus

1.3

Applications of the main results

2

Presentation of the main results

2.1

Technical preliminary

2.2

Characterization of definable conservative mappings

2.3

Definable parametric optimality formula

2.4

Algorithmic consequences

2.5

Failure of parametric optimality formula in general

3

Definably conservative fields and parametric

opti-mality formula

3.1

Conservative mappings

3.2

O-minimal structures

3.3

Variational Stratification

3.4

Definably conservative mappings

3.5

Application to PO formula

4

Failure of parametric optimality formula

4.1

A fractal set

4.2

Construction and proof of the counterexample

4.3

Need for a better behaved subclass

5

Lemmas and proofs

References