• Aucun résultat trouvé

Temporal Discounting and Optimal Poli- Poli-cies

Delayed-Reinforcement Learning

11.3 Temporal Discounting and Optimal Poli- Poli-cies

In delayed reinforcement learning, one often assumes that rewards in the distant future are not as valuable as are more immediate rewards. This preference can be accomodated by a temporal discount factor, 0 <1.

The present value of a reward, ri, occuring i time units in the future, is taken to beiri. Suppose we have a policy(X) that maps input vectors into actions, and letri(X) be the reward that will be received on thei-th time step after one begins executing policystarting in stateX. Then the total reward accumulated over all time steps by policybeginning in state Xis:

V(X) =X1

i=0iri(X)

R

G

1 2 3 4 5 6 7 1

2 3 4 5 6 7 8

Figure 11.3: An Optimal Policy in the Grid World

One reason for using a temporal discount factor is so that the above sum will be nite. An optimal policy is one that maximizesV(X) for all inputs, X.

In general, we want to consider the case in which the rewards, ri, are random variables and in which the eects of actions on environmental states are random. In Markovian environments, for example, the probability that actionain stateXiwill lead to stateXjis given by a transition probability pXjjXia]. Then, we will want to maximize expected future reward and would deneV(X) as:

V(X) =E

"

1

X

i=0iri(X)#

In either case, we callV(X) thevalueof policyfor inputX.

If the action prescribed by taken in stateX leads to state X0 (ran-domly according to the transition probabilities), then we can writeV(X) in terms ofV(X0) as follows:

V(X) =rX(X)] +X

X0 pX0jX(X)]V(X0) where (in summary):

=the discount factor,

V(X) = the value of stateXunder policy,

rX(X)] =the expected immediate reward received when we execute the action prescribed byin stateX, and

pX0jX(X)] =the probability that the environment transitions to stateX0when we execute the action prescribed byin stateX.

In other words, the value of stateXunder policyis the expected value of the immediate reward received when executing the action recommended byplus the average value (under) of all of the states accessible fromX. For an optimal policy, (and no others!), we have the famous \opti-mality equation:"

The theory of dynamic programming (DP) Bellman, 1957, Ross, 1983]

assures us that there is at least one optimal policy, , that satises this equation. DP also provides methods for calculating V(X) and at least one , assuming that we know the average rewards and the tran-sition probabilities. If we knew the trantran-sition probabilities, the average rewards, and V(X) for all X and a, then it would be easy to imple-ment an optimal policy. We would simply select that a that maximizes r(Xa) +PX0pX0jXa]V(X0). That is,

But, of course, we are assuming that we do not know these average rewards nor the transition probabilities, so we have to nd a method that eectively learns them.

If we had a model of actions, that is, if we knew for every state,X, and actiona, which state,X0resulted, then we could use a method calledvalue iteration to nd an optimal policy. Value iteration works as follows: We begin by assigning, randomly, anestimated valueV^(X) to every state,X. On thei-th step of the process, suppose we are at stateXi(that is, our input on thei-th step isXi), and that the estimated value of stateXi on thei-th step is ^Vi(Xi). We then select that actionathat maximizes the estimated value of the predicted subsequent state. Suppose this subsequent state

having the highest estimated value is X0i. Then we update the estimated value, ^Vi(Xi), of stateXi as follows:

V^i(X) = (1;ci)^Vi;1(X) +ci

hri+V^i;1(X0i)i ifX=Xi,

= ^Vi;1(X) otherwise.

We see that this adjustment moves the value of ^Vi(Xi) an increment (depending onci) closer tohri+V^i(X0i)i. Assuming that ^Vi(X0i) is a good estimate forVi(X0i), then this adjustment helps to make the two estimates more consistent. Providing that 0 < ci < 1 and that we visit each state innitely often, this process of value iteration will converge to the optimal values.

Discuss synchronous dynamic programming, asynchronous dynamic programming, and policy iteration.

11.4

Q

-Learning

Watkins Watkins, 1989] has proposed a technique that he callsincremental dynamic programming. Letastand for the policy that chooses actiona once, and thereafter chooses actions according to policy. We dene:

Q(Xa) =Va(X) Then the optimal value from stateXis given by:

V(X) = maxa Q(Xa)

This equation holds only for an optimal policy, . The optimal policy is given by:

(X) =argmaxa Q(Xa)

Note that if an actiona makesQ(Xa) larger thanV(X), then we can improve by changing it so that(X) =a. Making such a change is the basis for a powerful learning rule that we shall describe shortly.

Suppose actionain stateXleads to stateX0. Then using the denitions ofQandV, it is easy to show that:

Q(Xa) =r(Xa) +EV(X0)]

wherer(Xa) is the average value of the immediate reward received when we execute actionain stateX. For an optimal policy (and no others), we have another version of the optimality equation in terms ofQvalues:

Q(Xa) =maxa hr(Xa) +EhQ(X0a)ii

for all actions,a, and states,X. Now, if we had the optimalQvalues (for all aandX), then we could implement an optimal policy simply by selecting that action that maximizedr(Xa) +EQ(X0a).

That is,

(X) =argmaxa hr(Xa) +EhQ(X0a)i i

Watkins' proposal amounts to a TD(0) method of learning theQvalues.

We quote (with minor notational changes) from Watkins & Dayan, 1992, page 281]:

\InQ-Learning, the agent's experience consists of a sequence of distinct stages orepisodes. In thei-th episode, the agent:

observes its current stateXi,

selects using the method described below] and performs an actionai,

observes the subsequent stateX0i,

receives an immediate rewardri, and

adjusts itsQi;1values using a learning factorci, according to:

Qi(Xa) = (1;ci)Qi;1(Xa) +ciri+Vi;1(X0i)]

ifX=Xi anda=ai,

=Qi;1(Xa) otherwise,

where

Vi;1(X0) = maxb Qi;1(X0b)]

is the best the agent thinks it can do from stateX0. :::

The initial Qvalues, Q0(Xa), for all states and actions are assumed given."

Using the currentQvalues,Qi(Xa), the agent always selects that ac-tion that maximizesQi(Xa). Note that only the Q value corresponding to the state just exited and the action just taken is adjusted. And that Q value is adjusted so that it is closer (by an amount determined by ci) to the sum of the immediate reward plus the discounted maximum (over all actions) of the Q values of the state just entered. If we imagine the Qvalues to be predictions of ultimate (innite horizon) total reward, then the learning procedure described above is exactly a TD(0) method of learn-ing how to predict theseQ values. Q learning strengthens the usual TD methods, however, because TD (applied to reinforcement problems using value iteration) requires a one-step lookahead, using a model of the eects of actions, whereasQlearning does not.

A convenient notation (proposed by Schwartz, 1993]) for representing the change in Qvalue is:

Q(Xa) ;r+V(X0)

whereQ(Xa) is the newQvalue for inputXand actiona,ris the imme-diate reward when actionais taken in response to inputX,V(X0) is the maximum (over all actions) of theQvalue of the state next reached when action a is taken from state X, and is the fraction of the way toward which the newQvalue,Q(Xa), is adjusted to equalr+V(X0).

Watkins and Dayan Watkins & Dayan, 1992] prove that, under certain conditions, theQvalues computed by this learning procedure converge to optimal ones (that is, to ones on which an optimal policy can be based).

We deneni(Xa) as the index (episode number) of thei-th time that actionais tried in stateX. Then, we have:

Theorem 11.1 (Watkins and Dayan) For Markov problems with states

fXgand actions fag, and given bounded rewards jrnjR, learning rates 0cn<1, and

1

X

i=0cni(Xa)=1

1

X

i=0

hcni(Xa)i2<1

for all Xand a, then

Qn(Xa)!Qn(Xa)asn!1, for allXand a, with probability 1, where Qn(Xa) corresponds to theQvalues of an optimal policy.

Again, we quote from Watkins & Dayan, 1992, page 281]:

\The most important condition implicit in the convergence the-orem ::: is that the sequence of episodes that forms the basis of learning must include an innite number of episodes for each starting state and action. This may be considered a strong con-dition on the way states and actions are selected|however, un-der the stochasticconditionsof the theorem, no methodcould be guaranteed to nd an optimal policy under weaker conditions.

Note, however, that the episodes need not form a continuous sequence|that is the X0 of one episode need not be theX of the next episode."

The relationships amongQlearning, dynamic programming,and control are very well described in Barto, Bradtke, & Singh, 1994]. Qlearning is best thought of as a stochastic approximation method for calculating the Q values. Although the denition of the optimal Q values for any state depends recursively on expectedvalues of theQvalues for subsequent states (and on the expected values of rewards), no expected values are explicitly computed by the procedure. Instead, these values are approximated by iterative sampling using the actual stochastic mechanism that produces successor states.

11.5 Discussion, Limitations, and