Unite´ de recherche INRIA Lorraine, Technopoˆle de Nancy-Brabois, Campus scientifique, 615 rue du Jardin Botanique, BP 101, 54600 VILLERS LE` S NANCY Unite´ de recherche INRIA Rennes, Ir[r]

To sum up, Theorem 4.2 and Theorem 4.1 both give conditions to get the existence **of** a solution to the Bellman equation. These conditions are how- ever not comparable in the general case, for on the one hand weak continuity is weaker than uniform continuity in v, but on the other hand Theorem 4.2 requires to be able to get an interval [v, v] such that B([v, v]) ⊂ [v, v]. A conceivable difficulty with Theorem 4.2 stems from the actual possibility **of** identifying a pair **of** **functions** v and v which fullfils the assumptions **of** this theorem. The following class **of** examples is useful as it introduces a general method that allows for finding these two candidate values. Broadly speaking, it is based upon the idea **of** a the **value** function **of** aggregators that dominates the primitive aggregator A.

En savoir plus
the noise W 0 . Then, by reformulating the original control problem into a stochastic control problem where the conditional law P W 0
X t is the sole controlled state variable driven by the
random noise W 0 , and by showing the continuity **of** the **value** function in the Wasserstein space **of** probability measures, we are able to prove a **dynamic** **programming** principle (DPP) for our stochastic McKean-Vlasov control problem. Next, for exploiting the DPP, we use a notion **of** differentiability with respect to probability measures introduced by P.L. Lions in his lectures at the Coll`ege de France [35], and detailed in the notes [11]. This notion **of** derivative is based on the lifting **of** **functions** defined on the Hilbert space **of** square integrable random variables distributed according to the “lifted” probability measure. By combining with a special Itˆo’s chain rule for flows **of** conditional distributions, we derive the **dynamic** **programming** Bellman equation for stochastic McKean-Vlasov control problem, which is a fully nonlinear second order partial differential equation (PDE) in the infinite dimensional Wasserstein space **of** probability measures. By adapting standard arguments to our context, we prove the viscosity property **of** the **value** function to the Bellman equation from the **dynamic** **programming** principle. To complete our PDE characterization **of** the **value** function with a uniqueness result, it is convenient to work in the lifted Hilbert space **of** square integrable random variables instead **of** the Wasserstein metric space **of** probability measures, in order to rely on the general results for viscosity solutions **of** second order Hamilton-Jacobi-Bellman equations in separable Hilbert spaces, see [33], [34], [23]. We also state a verification theorem which is useful for getting an analytic feedback form **of** the optimal control when there is a smooth solution to the Bellman equation. Finally, we apply our results to the class **of** linear-quadratic (LQ) stochastic McKean-Vlasov control problem for which one can obtain explicit solutions, and we illustrate with an example arising from an interbank systemic risk model.

En savoir plus
The main contribution **of** our paper is the following: we provide a systematic method to solve any problem **of** this sort, including those in which Agent can also control the volatility **of** the output process, and not just the drift 1 . We first used that method to solve a Principal-Agent problem which had not been solved before 2 in a pre-cursor to this paper, Cvitani´c, Possama¨ı and Touzi [7], for the special case **of** CARA utility **functions**, showing that the optimal contract depends not only on the output **value** (in a linear way, because **of** CARA preferences), but also on the risk the output has been exposed to, via its quadratic variation. In the examples section **of** the present paper, we also show how to solve other problems **of** this type by our method, problems which had been previously solved by ad hoc methods, on a case-by-case basis. We expect there will be many other applications involving Principal-Agent problems **of** this type, which have not been previously solved, and which our method will be helpful in solving. 3 The present paper includes all the above cases as special cases (up to some technical considerations), considering a multi-dimensional model with arbitrary utility **functions** and Agent’s effort affecting both the drift and the volatility **of** the output, that is, both the return and the risk 4 . Let us also point out that there is no need for any Markovian type assumptions for using our approach, a point which also generalizes earlier results.

En savoir plus
Key words: approximate **dynamic** **programming**, fuzzy approximation, **value** iteration, convergence analysis.
1 Introduction
**Dynamic** **programming** (DP) is a powerful paradigm for solving optimal control problems, thanks to its mild as- sumptions on the controlled process, which can be non- linear or stochastic [3, 4]. In the DP framework, a model **of** the process is assumed to be available, and the imme- diate performance is measured by a scalar reward sig- nal. The controller then maximizes the long-term per- formance, measured by the cumulative reward. DP al- gorithms can be extended to work without requiring a model **of** the process, in which case they are usually called reinforcement learning (RL) algorithms [24]. Most DP and RL algorithms work by estimating an optimal **value** function, i.e., the maximal cumulative reward as a function **of** the process state and possibly also **of** the control action. Representing **value** **functions** exactly is

En savoir plus
In this paper we focus on regional deterministic optimal control problems, i.e., problems where the dynamics and the cost functional may be different in several regions **of** the state space and present discontinuities at their interface.
Under the assumption that optimal trajectories have a locally finite number **of** switchings (no Zeno phenomenon), we use the duplication technique to show that the **value** function **of** the regional optimal control problem is the minimum over all possible structures **of** trajectories **of** **value** **functions** associated with classical optimal control problems settled over fixed structures, each **of** them being the restriction to some submanifold **of** the **value** function **of** a classical optimal control problem in higher dimension. The lifting duplication technique is thus seen as a kind **of** desingularization **of** the **value** function **of** the regional optimal control problem. In turn, we extend to regional optimal control problems the classical sensitivity relations and we prove that the regularity **of** this **value** function is the same (i.e., is not more degenerate) than the one **of** the higher-dimensional classical optimal control problem that lifts the problem.

En savoir plus
Contributions. The extensive literature on submod- ular **functions** motivates us to investigate other fun- damental questions concerning their structure. How much information is contained in a **submodular** func- tion? How much **of** that information can be obtained in just a few **value** oracle queries? Can an auctioneer efficiently estimate a player’s utility function if it is sub- modular? To address these questions, we consider the problem **of** approximating a **submodular** function f ev- erywhere while performing only a polynomial number **of** queries. More precisely, the problem we study is:

En savoir plus
F (s, A 0 ) ≥ F (s, A) (2)
∀A 0 s.t. A 0 ⊂ A
Equation (2) roughly means that a particular task cannot increase in **value** because **of** the presence **of** other assign- ments. Although many score **functions** typically used in task allocation satisfy this submodularity condition (for example the information theory community [10]), many also do not. It is simple to demonstrate that the distributed greedy multi-assignment problem may fail to converge with a non- **submodular** score function, even with as few as 2 tasks and 2 agents. Consider the following example, where notation for an agent’s task group is (task ID, task score), and the sequential order added moves from left to right. The structure **of** this simple algorithm is that each agent produces bids on a set **of** desired tasks, then shares these with the other agents. This process repeats until no agent has incentive to deviate from their current allocation. In the following examples, the nominal score achieved for servicing a task will be defined as T . The actual **value** achieved for servicing the task may be a function **of** other things the agent has already committed to. In the above example, is defined as some **value**, 0 < < T . Example 1: Allocations with a **submodular** score function

En savoir plus
Table 5: Accuracy **of** KNN with different metrics learning algorithm and their running time in seconds.
datasets as in the previous experiment, but we make ξ **of** the ξ-additive varies from 1 to min(10, m). A **value** **of** ξ = 1 means that there is no interaction be- tween features, and only singletons are considered. Increasing ξ adds orders **of** interaction, and finally reaches the order **of** interaction tackled by LMEL approach without ξ-additive method. It can be seen that each time we decrease ξ, the number **of** free pa- rameters **of** f (S) is divided by 2, so that running time **of** the method is now very reasonable, even for quite large dimensional data. Table 5 also gives the results obtained through a grid search **of** ξ (last col- umn). Interestingly, one can see that LEML-ξ often gives better results than LEML, showing that using all the m-tuple-wise combinations are not always necessary, and may even penalize the performances (e.g. balance, ionosphere, liver, and sonar).

En savoir plus
Observe that since |X | · |X | · (N + 1) will typically be substantially smaller than |Y|, the use **of** this approximation architecture makes the linear program in Algorithm 3 a tractable program. For example, in models in which the state space has millions **of** billions states only thousands **of** basis **functions** are required. Note that our selection **of** basis **functions** produces the most general possible separable approximation. For each x, j, the function f x j (s) can take different values for each different s ∈ {0, ..., N }. 22 Our selection **of** basis **functions** generalizes approximation architectures for which the **value** function is approximated as a linear combination **of** moments **of** the industry state. Moment-based approximations have been previously used in large scale stochastic control problems that arise in macroeconomics (Krusell and Smith 1998) – using a very different approach and algorithm than ours, though. In our computational experiments we observe that moving beyond simple linear combinations **of**, say, the first two moments **of** the industry state is valuable: often the simpler architecture fails to produce good approximations to MPE for our computational examples while our proposed architecture does.

En savoir plus
Cambridge, MA, 02139
Unmanned Aircraft Systems (UAS) have the potential to perform many **of** the dangerous missions currently flown by manned aircraft. Yet, the complexity **of** some tasks, such as air combat, have precluded UAS from successfully carrying out these missions autonomously. This paper presents a formulation **of** a level flight, fixed velocity, one-on-one air combat ma- neuvering problem and an approximate **dynamic** **programming** (ADP) ap- proach for computing an efficient approximation **of** the optimal policy. In the version **of** the problem formulation considered, the aircraft learning the optimal policy is given a slight performance advantage. This ADP approach provides a fast response to a rapidly changing tactical situation, long plan- ning horizons, and good performance without explicit coding **of** air combat tactics. The method’s success is due to extensive feature development, reward shaping and trajectory sampling. An accompanying fast and effec- tive rollout based policy extraction method is used to accomplish on-line implementation. Simulation results are provided that demonstrate the ro- bustness **of** the method against an opponent beginning from both offensive

En savoir plus
In conventional DP computation, the Bellman’s equation [2] is evaluated iteratively and sequentially and such time-con- suming sequential iterations result in substantial computational delay. The notion **of** “curse **of** dimensionality” as coined by Bellman in [10] refers to the vast computational effort required for the numerical solution **of** Bellman’s equation when there is a large number **of** state variables that are subjected to the optimization objective function. Both the computational delay and hardware resources requirements grow explosively when the problem size increases. To accelerate the DP computa- tion, a first-order ordinary differential equation (ODE) system was proposed by Lam and Tong [1] and can be employed to transform the sequential DP algorithm into a continuous-time parallel computational network which enables high-speed convergence for Bellman’s optimality criterion. Here, a CMOS current-mode analog circuit is presented to provide a highly portable and low-power implementation **of** the proposed net- work architecture. Detailed analysis on computational speed, power consumption and network convergence is presented. Realization **of** a circuit with a reasonable size is demonstrated to exemplify the design principles. A procedure to test and validate the fabricated circuit is discussed. We have also inves- tigated the error models and the results lead to a compensation scheme whereby the errors due to nonideal current source and device mismatch are minimized.

En savoir plus
in the state space Q in low-speed (tape, disk) computer storage. It is common knowledge that in real problems excessive high-speed storage requirements.. can present a serious implementa[r]

Let ^ be an upper bound on the objective function value of any optimal solution to the original discrete optimization problem 9, Then, since .^ is a representation of 9.. it follows that[r]

the extension **of** the HJB **dynamic** **programming** approach. The second difficulty is related to the well-posedness **of** the HJB or adjoint equation because it is set in an infinite domain in space. The third difficulty is the lack **of** proof **of** convergence **of** the two algorithms suggested here, an HJB-based fixed point and a steepest descent based on calculus **of** variations. When it converges the fixed point method is preferred but there is no known way to guarantee convergence even to a local minimum; as for the steepest descent we found it somewhat hard to use, mainly because it generates ir- regular oscillating solutions; some bounds on the derivatives **of** u need to be added in penalised form in the criteria. Numerically both algorithms are cursed by the asymp- totic behaviour **of** the solution **of** the adjoint state at infinity. So, when possible, the Riccati semi-analytical feedback solution is the best. Finally, but this applies only to this toy problem, the pure feedback solution is nearly optimal, easy to compute and stable. Note also that this semi-analytical solution is a validation test, since it has been recovered by both algorithms.

En savoir plus
2 Clustered Robust Routing problem
In this section we briefly present the CRR problem [SFC + 18] and introduce some key notations. Given a network topology and a sequence **of** traffic matrices (TMs), which might represent the evolution over time **of** end-to-end connections (i.e., demands), the CRR problem consists in splitting the sequence into smaller clusters **of** contiguous TMs with a single routing configuration applied to each cluster. Since a single routing configuration is applied to all TMs **of** the cluster, the goal is to find the clusters and corresponding routing configurations that minimize the worst deviation with respect to **Dynamic**-TE.

En savoir plus
which requires a parametric maximization (minimization) over the control set U . For its discrete analogue G(V ), a common practice in the literature is to compute the min- imization by comparison, i.e., by evaluating the expression in a finite set **of** elements **of** U (see for instance [1, 12, 17] and references therein). In contrast to the com- parison approach, the contribution **of** this paper is to demonstrate that an accurate realization **of** the min-operation on the right hand side **of** (1.3) can have an impor- tant impact on the optimal controls that are determined on the basis **of** the **dynamic** **programming** principle. In this respect, the reader can take a preview to Figure 4.2, where differences between optimal control fields obtained with different minimization routines can be appreciated. Previous works concerning the construction **of** minimiza- tion routines for this problem date back to [8], where Brent’s algorithm is proposed to solve high dimensional Hamilton-Jacobi -Bellman equations and to [10], where the authors consider a fast semi-Lagrangian algorithm for front propagation problems. In this latter reference, the authors determine the minimizer **of** a specific Hamiltonian by means **of** an explicit formula. Moreover, for local optimization strategies in **dynamic** **programming** we refer to [16] for Brent’s algorithm and to [22] for a Bundle Newton method.

En savoir plus
X
t=0
β t r(x t , u t , x t+1 ), (1)
with β ∈ (0, 1) the discount factor. In finite horizon problems the sum is truncated and β ∈ (0, 1]. The random variable depends on the initial distribution q and on the policy π **of** the agent. The superscripts q, π stress that dependence but are omitted in the sequel when the context is clear. The agent applies a Stochastic Markov policy π ∈ Σ. Stochastic policies are often relevant in a risk-sensitive context: see Section 4.1. The policy may be time-varying. Even if the process is observable, in many risk-sensitive setups, it is necessary to extend the state space and define auxiliary variables (see Section 2.2). Therefore, let I refer to such an extended state space, and let I be identified to the information space **of** the agent. The random selection **of** an action in A according to a distribution conditioned on the information state i t ∈ I is written

En savoir plus
(The smoothing of the error after a single successive approximation step in this example is a coincidence. In general, several successive approximation steps will be requir[r]

Based on the variation **of** J with respect to u we have used 100 iterations **of** a gradient method with fixed step size, ω = 0.3. The parameters **of** the problem are T = 2, h = 0.1, θ = 1, σ 2 = 2
3 θ and ρ 0 , ρ T are as
in [7]. The numerical method for the PDEs is a centered space-time finite element method **of** degree 1 on a mesh **of** 150 points and 40 time steps.