• The rewards u(s) that are obtained in the final states s ∈ Sh. In this article we do not consider intermediate satisfaction degrees. However, our results could be easily extended to handle them.
Decision trees and finite-horizon Markov decision processes are two frameworks that are very close to each other. A finite-horizon (possibilistic) MDP can be translated into a decision tree by representing explicitly every possible trajectories. However, the number of such trajectories may be exponential in the horizon (O((|S| × |A|) h ) ) and so is the size of the DT representation of the finitehorizon MDP. Thus, a naive application of the backwards induction algorithm to a DT translation of a MDP is inefficient. Fortunately, [ 24 ] has shown that a backwards induction algorithm could be defined for possibilistic MDP, which complexity is only polynomial in the representation size of the MDP (similarly to the stochastic MDP case).
In order to overcome the drowning effect, Fargier and Sabbadin have proposed lexicographic refinements of possi- bilistic criteria for non-sequential decision problems [ 26 ]. However, these refinements have not been extended yet to sequential decision under uncertainty, where the drowning effect is also due to the reduction of compound possibilis- tic policies into simple possibility distributions on the consequences. The present paper 1 proposes an extension of the lexicographic preference relations to finitehorizon sequential problem, providing lexicographic possibilistic decision criteria that compare full policies (and not simply their reductions). This allow us to equip possibilistic decision trees and finitehorizon Markov decision processes with backward induction algorithms that compute lexicographically optimal policies.
dimension are high. In [Hur+18], we proposed algorithms relying on deep neural net- works for approximating/learning the optimal policy and then eventually the value function by performance/policy iteration or hybrid iteration with Monte Carlo regressions now or later. This research led to three algorithms, namely algorithms NNcontPI, Hybrid-Now and Hybrid-LaterQ that are recalled in Section 2, and which can be seen as a natural extension of actor-critic methods, developed in the reinforcement learning community for stationary stochastic problem ([SB98]), to finite-horizon control problems. Note that for stationary control problem, it is usual to use techniques such as temporal difference learning, which relies on the fact that the value function and the optimal control do not depend on time, to improve the learning of the latter. Such techniques do not apply to finitehorizon control problems. In Section 3, we perform some numerical and comparative tests to illustrate the efficiency of our different algorithms, on 100-dimensional nonlinear PDEs examples as in [EHJ17] and quadratic Backward Stochastic Differential equations as in [CR16], as well as on high-dimensional linear quadratic stochastic control problems. We present numerical results for an option hedging problem in finance, and energy storage problems arising in the valuation of gas storage and in microgrid management. Numerical results and comparisons to quantization-type algorithms Qknn, introduced in this paper as an efficient algorithm to numerically solve low-dimensional control problems, are also provided. Finally, we con- clude in Section 4 with some comments about possible extensions and improvements of our algorithms.
Chance Constrained FiniteHorizon Optimal Control with Nonconvex Constraints
Masahiro Ono, Lars Blackmore, and Brian C. Williams
Abstract— This paper considers finite-horizon optimal con- trol for dynamic systems subject to additive Gaussian- distributed stochastic disturbance and a chance constraint on the system state defined on a non-convex feasible space. The chance constraint requires that the probability of constraint violation is below a user-specified risk bound. A great deal of recent work has studied joint chance constraints, which are defined on the a conjunction of linear state constraints. These constraints can handle convex feasible regions, but do not extend readily to problems with non-convex state spaces, such as path planning with obstacles.
8.1 A problem of business strategy
Denote by y(s) the quantity of steel produced by an industry, at time s. At every moment, such production can either be reinvested to expand the productive capacity or sold. The initial productive capacity is x > 0; such capacity grows as the reinvestment rate. Let the function u : [0, T ] → [0, 1], where u(s) is the fraction of the output at time s that should be reinvested. The objective is to maximize the total sales. As u(s) is the fraction of the output y(s) that we reinvest, then (1 − u(s))y(s) is the part of y(s) that we sell by a a price P at time s, that is constant over the time horizon. So the first objective is
An interesting question is whether the optimal selling price is a nonincreasing function of the initial inventory level, as is the case for a similar model with no fixed cost; see Federgr[r]
Our algorithm, Market-based Iterative Risk Allocation (MIRA), finds the optimal solution by solving the decom- posed optimization problem and the root-finding problem it- eratively and a[r]
Unite´ de recherche INRIA Lorraine, Technopoˆle de Nancy-Brabois, Campus scientifique, 615 rue du Jardin Botanique, BP 101, 54600 VILLERS LE` S NANCY Unite´ de recherche INRIA Rennes, Ir[r]
good solution and their computation time is often better than GMAA*. This might prove useful for approximate heuristic driven forward searches.
The computational record of the two 2-agent programs shows that MILP-2 agents is slower than MILP when the horizon grows. There are two reasons to which the sluggishness of MILP-2 agents may be attributed. The time taken by the branch and bound (BB) method to solve a 0-1 MILP is inversely proportional to the number of 0-1 variables in the MILP. MILP-2 agents has many more 0-1 variables than MILP event hough the total number of variables in it is exponentially less than in MILP. This is the first reason. Secondly, MILP-2 agents is a more complicated program than MILP; it has many more constraints than MILP. MILP is a simple program, concerned only with finding a subset of a given set. In addition to finding weights of histories, MILP also finds weights of terminal joint histories. This is the only extra or superfluous quantity it is forced to find. On the other hand, MILP-2 agents takes a much more circuitous route, finding many more superfluous quantities than MILP. In addition to weights of histories, MILP-2 agents also finds supports of policies, regrets of histories and values of information sets. Thus, the relaxation of MILP-2 agents takes longer to solve than the relaxation of MILP. This is the second reason for the slowness with which the BB method solves MILP-2 agents.
Unité de recherche INRIA Rocquencourt Domaine de Voluceau - Rocquencourt - BP 105 - 78153 Le Chesnay Cedex France Unité de recherche INRIA Futurs : Parc Club Orsay Université - ZAC des V[r]
6.2 Sufficient conditions for (H3)
We end this Section by providing explicit examples where (H3) is satisfied. The idea consists in constructing switching strategies with finite number of switches and satisfying the constraint imposed on the controlled diffusion. This allows to get a lower bound for the value function. Thanks to the estimate of Lemma 5.2, this proves the polynomial growth of the value function.
INFINITE-HORIZON PROBLEMS UNDER PERIODICITY CONSTRAINT
J. BLOT, A. BOUADI AND B. NAZARET
Abstract. We study some infinite-horizon optimization problems on spaces of periodic functions, for non periodic Lagrangians. The main strategy relies on the reduction to finitehorizon thanks to the introduction of an averaging operator. We then provide existence results and necessary optimality condi- tions, in which the corresponding averaged Lagrangian appears.
2 What is a strategy and how to measure its performance?
2.1 Deterministic strategies with finitehorizon
We now propose a definition of sequential deterministic strategies for optimization with finitehorizon. As- sume that one has a budget of r evaluations after having evaluated y at an arbitrary n-points design, X. One step of a sequential strategy essentially consists in looking for the next point where to evaluate y, say x n +1 . In some sampling procedures like crude Monte Carlo, x n +1 may be determined without taking into account the design X and the corresponding observations Y. However, in the considered case of adaptive strategies, x n +1 is determined on the basis of the available information. Furthermore, we restrict ourselves here to the case of deterministic strategies, i.e. where x n +1 only depends on the past and doesn’t involve any random operator (like mutations in genetic algorithms). So x n +1 is in fact defined as some function of X and Y:
f ps, Y s , Z s , U s qds, t P r0, T s, (1.3)
in the context of mean-variance hedging, with H s : “ 1 τďs and W a standard Brownian
motion. The interesting feature here lies in the fact that under some assumptions on the market, the solution triplet pY, Z, U q to the previous BSDE is completely described in terms of the one of a BSDE with deterministic finitehorizon. More precisely, if we assume that F is the natural filtration of W and if τ is a random time which is not a F-stopping time, then the BSDE with deterministic horizon associated with BSDE (1.3) is of the form
In this paper, we consider the output controllability of finite-dimensional control systems. In many applica- tions, state variables are not relevant from the point of view of the practical control application. To give a ba- sic idea, if one aim to model a car, the wheel angle, will be one the state variables, but this state variable is not relevant for the main purpose of a car, i.e., moving from one point to another. In this context, the system output could be the orientation and the position of the mass center of the car. Our main goal is to only control the output of the system to some desired target in a given time T > 0, and then keep this output fixed for the re- maining times t > T .
Dans ce contexte et dans un objectif d’amélioration du bien-être animal, la sélection génétique est à privilégier afin d’améliorer le comportement des truies pour faciliter [r]
2 Le modèle de planification stochastique
Nous considérons un modèle à horizon fini N et à temps discret où les intervalles de temps (ou slots) sont d’une durée finie δ et identifiés par un entier t. Chaque slot t a une capacité maximale de traitement en tonnes C t et un coût de setup σ t dès lors qu’on exécute un traitement, dont
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignemen[r]
In [ 3 ], it has been shown that the epigraph of the value function ϑ can always be described by an auxiliary optimal control problem without state constraints for which the value functi[r]
Les moyens de cette exploration fonctionnelle et de cette diversification structurelle dépendront de manière critique de la capacité à importer dans les cellules les molécules hétérod[r]