• Aucun résultat trouvé

Batch Mode Reinforcement Learning based on the Synthesis of Artificial Trajectories

N/A
N/A
Protected

Academic year: 2021

Partager "Batch Mode Reinforcement Learning based on the Synthesis of Artificial Trajectories"

Copied!
94
0
0

Texte intégral

(1)

Batch Mode Reinforcement Learning based on

the Synthesis of Artificial Trajectories

R. Fonteneau(1),(2)

Joint work with Susan A. Murphy(3) , Louis Wehenkel(2) and Damien Ernst(2)

(1) Inria Lille – Nord Europe, France

(2) University of Liège, Belgium (3) University of Michigan, USA

December 10th, 2012

(2)

Outline

Batch Mode Reinforcement Learning

– Reinforcement Learning

– Batch Mode Reinforcement Learning – Objectives

– Main Difficulties & Usual Approach – Remaining Challenges

A New Approach: Synthesizing Artificial Trajectories

– Formalization

– Artificial Trajectories: What For?

Estimating the Performances of Policies

– Model-free Monte Carlo Estimation – The MFMC Algorithm

– Theoretical Analysis – Experimental Illustration

(3)
(4)

Reinforcement Learning

● Reinforcement Learning (RL) aims at finding a policy maximizing received

rewards by interacting with the environment

Agent Environment

Actions

Observations, Rewards

(5)

Batch Mode Reinforcement Learning

● All the available information is contained in a batch collection of data

● Batch mode RL aims at computing a (near-)optimal policy from this collection of data

● Examples of BMRL problems: dynamic treatment regimes (inferred from clinical

data), marketing optimization (based on customers histories), finance, etc... Batch mode RL

Finite collection of trajectories of the agent Near-optimal decision strategy

Agent Environment

Actions

Observations, Rewards

(6)

Time Patients 1

p

0 1 T

?

'optimal' treatment ?

Batch Mode Reinforcement Learning

(7)

Time Patients 1

p

0 1 T

?

'optimal' treatment ?

Batch Mode Reinforcement Learning

(8)

Objectives

(9)

Objectives

● Main goal: Finding a "good" policy

(10)

Objectives

● Main goal: Finding a "good" policy

● Many associated subgoals:

(11)

Objectives

● Main goal: Finding a "good" policy

● Many associated subgoals:

– Evaluating the performance of a given policy – Computing performance guarantees

(12)

Objectives

● Main goal: Finding a "good" policy

● Many associated subgoals:

– Evaluating the performance of a given policy – Computing performance guarantees

(13)

Objectives

● Main goal: Finding a "good" policy

● Many associated subgoals:

– Evaluating the performance of a given policy – Computing performance guarantees

– Computing safe policies

– Choosing how to generate additional transitions – ...

(14)

Main Difficulties & Usual Approach

(15)

Main Difficulties & Usual Approach

● Main difficulties of the batch mode setting:

– Dynamics and reward functions are unknown (and not accessible to

(16)

Main Difficulties & Usual Approach

● Main difficulties of the batch mode setting:

– Dynamics and reward functions are unknown (and not accessible to

simulation)

(17)

Main Difficulties & Usual Approach

● Main difficulties of the batch mode setting:

– Dynamics and reward functions are unknown (and not accessible to

simulation)

– The state-space and/or the action space are large or continuous

(18)

Main Difficulties & Usual Approach

● Main difficulties of the batch mode setting:

– Dynamics and reward functions are unknown (and not accessible to

simulation)

– The state-space and/or the action space are large or continuous

– The environment may be highly stochastic

(19)

Main Difficulties & Usual Approach

● Main difficulties of the batch mode setting:

– Dynamics and reward functions are unknown (and not accessible to

simulation)

– The state-space and/or the action space are large or continuous

– The environment may be highly stochastic

● Usual Approach:

– To combine dynamic programming with function approximators (neural

(20)

Main Difficulties & Usual Approach

● Main difficulties of the batch mode setting:

– Dynamics and reward functions are unknown (and not accessible to

simulation)

– The state-space and/or the action space are large or continuous

– The environment may be highly stochastic

● Usual Approach:

– To combine dynamic programming with function approximators (neural

networks, regression trees, SVM, linear regression over basis functions, etc)

– Function approximators have two main roles:

● To offer a concise representation of state-action value function for

deriving value / policy iteration algorithms

(21)

Remaining Challenges

● The black box nature of function approximators may have some unwanted

(22)

Remaining Challenges

● The black box nature of function approximators may have some unwanted

effects:

(23)

Remaining Challenges

● The black box nature of function approximators may have some unwanted

effects:

– hazardous generalization

(24)

Remaining Challenges

● The black box nature of function approximators may have some unwanted

effects:

– hazardous generalization

– difficulties to compute performance guarantees – unefficient use of optimal trajectories

(25)

Remaining Challenges

● The black box nature of function approximators may have some unwanted

effects:

– hazardous generalization

– difficulties to compute performance guarantees – unefficient use of optimal trajectories

(26)

A New Approach: Synthesizing Artificial

Trajectories

(27)

Formalization

● System dynamics:

(28)

Formalization

● System dynamics:

(29)

Formalization

● System dynamics:

● Reward function:

(30)

Formalization

● System dynamics: ● Reward function: ● Performance of a policy where Reinforcement learning

(31)

Formalization

● The system dynamics, reward function and disturbance probability distribution are

unknown

(32)

Formalization

● The system dynamics, reward function and disturbance probability distribution are

unknown

● Instead, we have access to a sample of one-step system transitions:

(33)

Artificial trajectories are (ordered) sequences of elementary pieces of

trajectories:

Formalization

Artificial trajectories

(34)

Artificial Trajectories: What For?

● Artificial trajectories can help for:

– Estimating the performances of policies

– Computing performance guarantees

– Computing safe policies

(35)

Artificial Trajectories: What For?

● Artificial trajectories can help for:

– Estimating the performances of policies – Computing performance guarantees – Computing safe policies

(36)
(37)

Model-free Monte Carlo Estimation

● If the system dynamics and the reward function were accessible to simulation, then

(38)

MODEL OR SIMULATOR REQUIRED!

(39)

Model-free Monte Carlo Estimation

● If the system dynamics and the reward function were accessible to simulation, then

Monte Carlo (MC) estimation would allow estimating the performance of h

● We propose an approach that mimics MC estimation by rebuilding p artificial

(40)

Model-free Monte Carlo Estimation

● If the system dynamics and the reward function were accessible to simulation, then

Monte Carlo (MC) estimation would allow estimating the performance of h

● We propose an approach that mimics MC estimation by rebuilding p artificial

trajectories from one-step system transitions

● These artificial trajectories are built so as to minimize the discrepancy (using a

distance metric ∆) with a classical MC sample that could be obtained by

simulating the system with the policy h; each one step transition is used at most once

(41)

Model-free Monte Carlo Estimation

● If the system dynamics and the reward function were accessible to simulation, then

Monte Carlo (MC) estimation would allow estimating the performance of h

● We propose an approach that mimics MC estimation by rebuilding p artificial

trajectories from one-step system transitions

● These artificial trajectories are built so as to minimize the discrepancy (using a

distance metric ∆) with a classical MC sample that could be obtained by

simulating the system with the policy h; each one step transition is used at most once

● We average the cumulated returns over the p artificial trajectories to obtain the

(42)
(43)

Example with T = 3, p = 2, n = 8

(44)
(45)
(46)
(47)
(48)
(49)
(50)
(51)
(52)
(53)
(54)
(55)
(56)
(57)
(58)
(59)
(60)
(61)
(62)
(63)
(64)

● Lipschitz continuity assumptions:

Theoretical Analysis

Assumptions

(65)

● Distance metric ∆

● k-sparsity

● denotes the distance of (x,u) to its k-th nearest neighbor

(using the distance ∆) in the sample

Theoretical Analysis

Assumptions

(66)

Theoretical Analysis

Assumptions

● The k-sparsity can be

seen as the smallest radius such that all ∆-balls in X×U contain at least k elements from

(67)

Theoretical Analysis

Theoretical results

(68)

Theoretical Analysis

Theoretical results

Expected value of the MFMC estimator

Theorem

(69)

Theoretical Analysis

Theoretical results

(70)

Theoretical Analysis

Theoretical results

Variance of the MFMC estimator

Theorem

(71)

● Dynamics: ● Reward function: ● Policy to evaluate: ● Other information:

p

W

(.) is uniform,

Experimental Illustration

Benchmark

(72)

Monte Carlo estimator Model-free Monte Carlo estimator

Simulations for p = 10, n = 100 … 10 000, uniform grid, T = 15, x

0 = - 0.5 .

n = 100 … 10 000,

p

= 10

Experimental Illustration

Influence of n

(73)

Simulations for p = 1 … 100, n = 10 000 , uniform grid, T = 15, x

0 = - 0.5 .

Monte Carlo estimator Model-free Monte Carlo estimator

p

= 1 … 100,

n

=10 000

p

= 1 … 100

Experimental Illustration

Influence of p

(74)

● Comparison with the FQI-PE algorithm using k-NN, n=100, T=5 .

Experimental Illustration

MFMC vs FQI-PE

(75)

● Comparison with the FQI-PE algorithm using k-NN, n=100, T=5 .

Experimental Illustration

MFMC vs FQI-PE

(76)

Conclusions

Stochastic setting

Bias / variance analysis Illustration

MFMC: estimator of the expected return

Estimator of the

VaR

Deterministic setting Continuous

action space Finite action space

CGRL Sampling strategy Bounds on the return Convergence Convergence + additional properties Illustration Illustration

(77)

Conclusions

Stochastic setting

Bias / variance analysis Illustration

MFMC: estimator of the expected return

Estimator of the

VaR

Deterministic setting Continuous

action space Finite action space

CGRL Sampling strategy Bounds on the return Convergence Convergence + additional properties Illustration Illustration

(78)

References

"Batch mode reinforcement learning based on the synthesis of artificial trajectories". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. To appear in Annals of Operations Research, 2012.

"Generating informative trajectories by using bounds on the return of control policies". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Proceedings of the Workshop on Active Learning and Experimental Design 2010 (in conjunction with AISTATS 2010), 2-page highlight paper, Chia Laguna, Sardinia, Italy, May 16, 2010.

"Model-free Monte Carlo-like policy evaluation". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. In Proceedings of The Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2010), JMLR W&CP 9, pp 217-224, Chia Laguna, Sardinia, Italy, May 13-15, 2010.

"A cautious approach to generalization in reinforcement learning". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Proceedings of The International Conference on Agents and Artificial Intelligence (ICAART 2010), 10 pages, Valencia, Spain, January 22-24, 2010.

"Inferring bounds on the performance of a control policy from a sample of trajectories". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. In Proceedings of The IEEE International Symposium on Adaptive Dynamic Programming and

Reinforcement Learning (ADPRL 2009), 7 pages, Nashville, Tennessee, USA, 30 March-2 April, 2009. Acknowledgements to F.R.S – FNRS for its financial support.

(79)
(80)

Estimating the Performances of Policies

Consider again the p artificial trajectories that were rebuilt by the MFMC estimatorThe Value-at-Risk of the policy h

can be straightforwardly estimated as follows:

with

(81)

Deterministic Case: Computing Bounds

Bounds from a Single Trajectory

(82)

Deterministic Case: Computing Bounds

Bounds from a Single Trajectory

Proposition:

Let be an artificial trajectory. Then,

(83)

Deterministic Case: Computing Bounds

Maximal Bounds

(84)

Deterministic Case: Computing Bounds

Tightness of Maximal Bounds

(85)

Inferring Safe Policies

From Lower Bounds to Cautious Policies

● Consider the set of open-loop policies:

● For such policies, bounds can be computed in a similar way

● We can then search for a specific policy for which the associated lower bound is

maximized:

● A O( T n ² ) algorithm for doing this: the CGRL algorithm (Cautious approach to

(86)

Inferring Safe Policies

Convergence

(87)

Inferring Safe Policies

Experimental Results

(88)

CGRL FQI (Fitted Q Iteration)

The state space is uniformly covered by the sample

Information about the Puddle area is

removed

Inferring Safe Policies

Experimental Results

(89)

Inferring Safe Policies

Bonus

(90)

Sampling Strategies

● Given a sample of system transitions

How can we determine where to sample additional transitions ?

● We define the set of candidate optimal policies:

● A transition is said compatible with if

and we denote by the set of all such compatible transitions. An Artificial Trajectories Viewpoint

(91)

Sampling Strategies

● Iterative scheme:

with

● Conjecture:

(92)

Sampling Strategies

Illustration

● Action space:

● Dynamics and reward function:

● Horizon: ● Initial sate:

● Total number of policies:

● Number of transitions

(93)

Connexion to Classic Batch Mode RL

Towards a New Paradigm for Batch Mode RL

l

1

l

1,1

l

2

l

k

l

1,2

l

1,k

l

k , 1

l

k , 2

l

k , k

l

2,1

l

2,2

l

2,k

l

k , k ,... ,k

l

1,1,. .. ,1

l

k , 2,1

l

k , 2,2

l

k , 2,k

(94)

Connexion to Classic Batch Mode RL

Towards a New Paradigm for Batch Mode RL

● The k-NN FQI-PE algorithm:

Références

Documents relatifs

As an improvement, we use conditional Monte Carlo to obtain a smoother estimate of the distribution function, and we combine this with randomized quasi-Monte Carlo to further reduce

This approach can be generalized to obtain a continuous CDF estimator and then an unbiased density estimator, via the likelihood ratio (LR) simulation-based derivative estimation

The estimated variance, bias, MSE and fraction of the MSE due to the square bias on the likelihood function estimator with constructed lattices and other point sets are plotted

Monte Carlo MC simulation is widely used to estimate the expectation E[X ] of a random variable X and compute a con dence interval on E[X ]... What this talk

Monte Carlo MC simulation is widely used to estimate the expectation E[X ] of a random variable X and compute a confidence interval on E[X ]... What this talk

Monte Carlo MC simulation is widely used to estimate the expectation E[X ] of a random variable X and compute a confidence interval on E[X ]... What this talk

The performances of Subset Simulation (Section 1.2.1) and Line Sampling (Sec- tion 1.2.2) will be compared to those of the Importance Sampling (IS) (Section 1.3.1),

Dans ce travail, après une revue des différentes stratégies thérapeutiques et des recommandations internationales concernant ces infections, nous allons présenter les