• Aucun résultat trouvé

Performance bound for Approximate Optimistic Policy Iteration

N/A
N/A
Protected

Academic year: 2023

Partager "Performance bound for Approximate Optimistic Policy Iteration"

Copied!
5
0
0

Texte intégral

(1)

HAL Id: inria-00480952

https://hal.inria.fr/inria-00480952

Submitted on 5 May 2010

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires

Performance bound for Approximate Optimistic Policy Iteration

Bruno Scherrer, Christophe Thiery

To cite this version:

Bruno Scherrer, Christophe Thiery. Performance bound for Approximate Optimistic Policy Iteration.

[Technical Report] 2010. �inria-00480952�

(2)

Performance bound for Approximate Optimistic Policy Iteration

Christophe Thiery, Bruno Scherrer

We provide here a proof of the performance bound theorem published in Thiery and Scherrer (2010). This theorem applies to Least-SquaresλPolicy It- eration and more generally approximate, optimistic Policy Iteration algorithms.

Theorem 1 (Performance bound for Approximate Optimistic Policy Iteration) Letn)n≥1be a sequence of positive weights such thatP

n≥1λn = 1. LetQ0

be an arbitrary initialization. We consider an iterative algorithm that generates the sequencek, Qk)k≥1 with

πk+1←greedy(Qk), Qk+1←X

n≥1

λn(Bπk+1)nQkk+1.

ǫk+1 is the approximation error made when estimating the next value function.

Letǫ be a uniform majoration of that error, i.e. for allk,kǫkk≤ǫ. Then lim sup

k→∞

kQ−Qπkk≤ 2γ (1−γ)2ǫ.

Proof

Notations and main idea of the proof We will use the following notations:

• bk =Qk−Bπk+1Qk is the Bellman error,

• dk=Q−(Qk−ǫk) is the difference between the optimal value function and theQk iterate (before error),

• sk =Qk−ǫk−Qπk is the difference between theQk iterate (before error) and the (true) value of the policyπk,

• β=P

n≥1λnγn (note that 0≤β ≤γ).

(3)

The distance between the value of the optimal policy and the value of the current policy can be formulated as

kQ−Qπkk = max(Q−Qπk)

= max(Q−Qkk+Qk−ǫk−Qπk)

= max(dk+sk)

≤maxdk+ maxsk (1)

The idea of the proof is to compute upper bounds on dk and sk. As we will see, the bounds we will obtain will both depend on an upper on the Bellman errorbk, that we derive first.

An upper bound on the Bellman error bk: Asπk+1 is the greedy policy with respect toQk, we haveBπkQk ≤Bπk+1Qk, which allows us to write

bk = Qk−Bπk+1Qk

= Qk−BπkQk+BπkQk−Bπk+1Qk

≤ Qk−BπkQk

= (Qk−ǫkk)−Bπk(Qk−ǫkk)

= (Qk−ǫk)−Bπk(Qk−ǫk) +ǫk−γPπkǫk

= X

n≥1

λn[(Bπk)nQk−1]−X

n≥1

λn

(Bπk)n+1Qk−1

+ (I−γPπkk

= X

n≥1

λn

(Bπk)nQk−1]−(Bπk)n+1Qk−1

+ (I−γPπkk

= X

n≥1

λn(γPπk)n(Qk−1−BπkQk−1) + (I−γPπkk

= X

n≥1

λn(γPπk)nbk−1+ (I−γPπkk.

By using the fact thatPπk is a stochastic matrix, we have maxbk ≤X

n≥1

λnγnmaxbk−1+ (1 +γ)ǫ=βmaxbk−1+ (1 +γ)ǫ.

We then deduce by induction that maxbk

k−1

X

j=0

βj(1 +γ)ǫ+βkmaxb0= 1 +γ

1−βǫ+O(γk). (2) An upper bound ondk : Let us now consider thedkterm and its evolution.

dk+1 =Q−(Qk+1−ǫk+1)

=Q−X

n≥1

λn(Bπk+1)nQk

= X

n≥1

λn

Q−(Bπk+1)nQk

. (3)

(4)

Sinceπk+1is the greedy policy with respect toQk, we haveBπQk ≤Bπk+1Qk. Therefore

Q−(Bπk+1)nQk

=BπQ−BπQk+BπQk−Bπk+1Qk+Bπk+1Qk

−(Bπk+1)2Qk+ (Bπk+1)2Qk−. . .+ (Bπk+1)n−1Qk−(Bπk+1)nQk

≤BπQ−BπQk+γPπk+1(Qk−Bπk+1Qk) +

+(γPπk+1)2(Qk−Bπk+1Qk) +. . .+ (γPπk+1)n−1(Qk−Bπk+1Qk)

=γPπ(Q−Qk) + +

γPπk+1+ (γPπk+1)2+. . .+ (γPπk+1)n−1

(Qk−Bπk+1Qk)

=γPπ(Q−(Qk−ǫk))−γPπǫk+ +

γPπk+1+ (γPπk+1)2+. . .+ (γPπk+1)n−1

(Qk−Bπk+1Qk)

=γPπdk−γPπǫk+

γPπk+1+ (γPπk+1)2+. . .+ (γPπk+1)n−1 bk. AsPπ andPπk+1 are stochastic matrices, we deduce

max[Q−(Bπk+1)nQk] ≤γmaxdk+γǫ+ (γ+γ2+. . .+γn−1) maxbk

=γmaxdk+γǫ+γ−γn

1−γ maxbk. By using Equation 3, we obtain the following induction on maxdk:

maxdk+1≤γmaxdk+γǫ+X

n≥1

λn

γ−γn 1−γ maxbk

.

With the help of the Bellman error upper bound obtained earlier (Equation 2) we obtain

maxdk+1 ≤γmaxdk+γǫ+X

n≥1

λn

γ−γn (1−γ)(1−β)

(1 +γ)ǫ+O(γk)

=γmaxdk+γǫ+ γ−β

(1−γ)(1−β)(1 +γ)ǫ+O(γk) which gives, by taking the limit superior,

lim sup

k→∞

maxdk ≤ γ 1−γǫ+

γ−β (1−γ)2(1−β)

(1 +γ)ǫ. (4) An upper bound onsk : Let us now consider theskterm from Equation 1:

sk+1 =Qk+1−ǫk+1−Qπk+1

= X

n≥1

λn

(Bπk+1)nQk

−(Bπk+1)Qk

= X

n≥1

λn

(Bπk+1)nQk−(Bπk+1)Qk

. (5)

(5)

It can be seen that

(Bπk+1)nQk−(Bπk+1)Qk

= (Bπk+1)nQk−(Bπk+1)n+1Qk+ (Bπk+1)n+1Qk−(Bπk+1)n+2Qk+. . .

= (γPπk+1)n(Qk−Bπk+1Qk) + (γPπk+1)n+1(Qk−Bπk+1Qk) +. . .

= (γPπk+1)n[I+γPπk+1+ (γPπk+1)2+. . .]bk. As above, by using the stochasticity ofPπk+1, we obtain

max[(Bπk+1)nQk−(Bπk+1)Qk]≤γn(1 +γ+γ2+. . .) maxbk = γn

1−γmaxbk. By using Equation 5, we obtain an upper bound on maxsk+1:

maxsk+1≤ 1 1−γ

 X

n≥1

λnγnmaxbk

.

With the help of the Bellman error upper bound (Equation 2) and by taking the limit superior, we have

lim sup

k→∞

maxsk ≤ 1 1−γ

 X

m≥1

λnγn1 +γ 1−βǫ

= β

(1−γ)(1−β)(1 +γ)ǫ. (6) Conclusion of the proof Finally, let us get back to Equation 1 and use the upper bounds we just derived fordk (Equation 4) andsk (Equation 6):

lim sup

k→∞

kQ−Qπkk ≤lim sup

k→∞

maxdk+ lim sup

k→∞

maxsk

= γ

1−γǫ+

γ−β

(1−γ)2(1−β)+ β (1−γ)(1−β)

(1 +γ)ǫ.

= γ

1−γǫ+

γ−β+ (1−γ)β (1−γ)2(1−β)

(1 +γ)ǫ.

= γ

1−γǫ+ γ

(1−γ)2

(1 +γ)ǫ.

= γ(1−γ) +γ(1 +γ) (1−γ)2 ǫ

= 2γ

(1−γ)2ǫ.

References

Thiery, C. and B. Scherrer (2010). Least-squares λ policy iteration: Bias- variance trade-off in control problems. InICML’10: Proceedings of the 27th Annual International Conference on Machine Learning.

Références

Documents relatifs

To test whether the vesicular pool of Atat1 promotes the acetyl- ation of -tubulin in MTs, we isolated subcellular fractions from newborn mouse cortices and then assessed

Néanmoins, la dualité des acides (Lewis et Bronsted) est un système dispendieux, dont le recyclage est une opération complexe et par conséquent difficilement applicable à

Cette mutation familiale du gène MME est une substitution d’une base guanine par une base adenine sur le chromosome 3q25.2, ce qui induit un remplacement d’un acide aminé cystéine

En ouvrant cette page avec Netscape composer, vous verrez que le cadre prévu pour accueillir le panoramique a une taille déterminée, choisie par les concepteurs des hyperpaysages

Chaque séance durera deux heures, mais dans la seconde, seule la première heure sera consacrée à l'expérimentation décrite ici ; durant la seconde, les élèves travailleront sur

A time-varying respiratory elastance model is developed with a negative elastic component (E demand ), to describe the driving pressure generated during a patient initiated

The aim of this study was to assess, in three experimental fields representative of the various topoclimatological zones of Luxembourg, the impact of timing of fungicide

Attention to a relation ontology [...] refocuses security discourses to better reflect and appreciate three forms of interconnection that are not sufficiently attended to