• Aucun résultat trouvé

Septemberthe6 ,2012 Journ´eeGDRRobotiqueMatthieuGeist Anoverviewof ‘ -regularizationforvaluefunctionapproximation

N/A
N/A
Protected

Academic year: 2022

Partager "Septemberthe6 ,2012 Journ´eeGDRRobotiqueMatthieuGeist Anoverviewof ‘ -regularizationforvaluefunctionapproximation"

Copied!
29
0
0

Texte intégral

(1)

Introduction Background Algorithms Summary

An overview of ` 1 -regularization for value function approximation

Journ ´ee GDR Robotique

Matthieu Geist

(Sup ´elec, IMS-MaLIS Research Group) matthieu.geist@supelec.fr

September the 6

th

, 2012

Matthieu Geist Regularizing LSTD 1/26

(2)

Introduction Background Algorithms Summary

MDP Setting

1

Introduction MDP Setting

2

Background LSTD

`

1

-regularization

3

Algorithms Lasso-TD

`

1

-PBR

`

1

-LSTD Dantzig LSTD

4

Summary

(3)

Introduction Background Algorithms Summary

MDP Setting

(Bertsekas & Tsitsiklis, 1996; Sutton & Barto, 1998)

Controller : a

t

= π(s

t

) Reward : r

t

= r (s

t

, a

t

)

System dynamics : s

t+1

∼ P(.|s

t

, a

t

)

Goal : Given a policy π : S → A compute its value v

π

(s) = E

"

X

t=0

γ

t

r (s

t

, π(s

t

))

s

0

= s, π

#

(0 < γ < 1)

Matthieu Geist Regularizing LSTD 3/26

(4)

Introduction Background Algorithms Summary

MDP Setting

The value function v satisfies the Bellman equation v = r + γ Pv ⇔ v = T v

We look for ˆ v(s) = P

p

j=1

θ

j

φ

j

(s) or v ˆ = Φθ, where Φ =

φ(s

1

)

>

.. . φ(s

|S|

)

>

 = (φ

1

. . . φ

p

) and θ =

 θ

1

.. . θ

p

T is only known through samples (s

i

, r

i

, s

i0

)

ni=1

where s

i

∼ µ :

Φ = ˜

 φ(s

1

)

>

.. . φ(s

n

)

>

 , Φ ˜

0

=

 φ(s

01

)

>

.. . φ(s

0n

)

>

 , ˜ r =

 r

1

.. . r

n

 ,

We consider the situation where n p

(5)

Introduction Background Algorithms Summary

LSTD

`1-regularization

1

Introduction MDP Setting

2

Background LSTD

`

1

-regularization

3

Algorithms Lasso-TD

`

1

-PBR

`

1

-LSTD Dantzig LSTD

4

Summary

Matthieu Geist Regularizing LSTD 5/26

(6)

Introduction Background Algorithms Summary

LSTD

`1-regularization

LSTD

(Bradtke & Barto, 1996)

One looks for v ˆ in the feature space satisfying v ˆ = Π

µ

T ˆ v.

The solution v ˆ = Φθ

0

can be characterized as ( ω

θ

= arg min

ω

kr + γPΦθ − Φωk

2µ,2

θ

0

= arg min

θ

kΦθ − Φω

θ

k

2µ,2

and approximated by its empirical counterpart :

( ω

θ

= arg min

ω

k˜ r + γ Φ ˜

0

θ − Φωk ˜

22

θ

0

= arg min

θ

k Φθ ˜ − Φω ˜

θ

k

22

(7)

Introduction Background Algorithms Summary

LSTD

`1-regularization

LSTD (alternate writing)

We look for v ˆ = Φθ

0

that solves v ˆ = Π

µ

T v ˆ . It is known that Π

µ

= Φ(Φ

>

D

µ

Φ)

−1

Φ

>

D

µ

, and (after algebra) one can show that θ

0

= A

−1

b (Linear system of size p) ⇔ θ

0

= arg min

θ

kAθ − bk

22

where

A = Φ

>

D

µ

(I − γP)Φ b = Φ

>

D

µ

r

can be approximated through their empirical counterpart : A ˜ = 1

n Φ ˜

>

( ˜ Φ − γ Φ ˜

0

) b ˜ = 1

n Φ ˜

>

˜ r

Matthieu Geist Regularizing LSTD 7/26

(8)

Introduction Background Algorithms Summary

LSTD

`1-regularization

LSTD (another alternate writing)

Recall the first writing :

( ω

θ

= arg min

ω

k˜ r + γ Φ ˜

0

θ − Φωk ˜

22

θ

0

= arg min

θ

k Φθ ˜ − Φω ˜

θ

k

22

.

The second equation is minimized to zero for θ = ω

θ

(the fixed point of Π

µ

T exists).

Therefore, the second equation rewrites θ

0

= arg min

ω

k˜ r + γ Φ ˜

0

θ

0

− Φωk ˜

22

(fixed-point equation, θ

0

in both sides).

(9)

Introduction Background Algorithms Summary

LSTD

`1-regularization

Least-squares regression and ` 2 -penalization

regression (γ = 0), find θ such that ˜ r ≈ Φθ ˜ ; natural (LS) objective function :

θ

0

= arg min

θ

k˜ r − Φθk ˜

22

analytical solution : θ

0

= ( ˜ Φ

>

Φ) ˜

−1

Φ˜ ˜ r this is an empirical projection : Φθ ˜

0

= ˜ Π

µ

˜ r , (so Π ˜

µ

= ˜ Φ( ˜ Φ

>

Φ) ˜

−1

Φ ˜

>

)

`

2

-penalized LS :

θ

λ

= arg min

θ

k˜ r − Φθk ˜

22

+ λkθk

22

analytical solution : θ

λ

= ( ˜ Φ

>

Φ + ˜ λI)

−1

Φ˜ ˜ r

this is a penalized empirical projection : Φθ ˜

λ

= ˜ Π

(λ,2)µ

˜ r

Matthieu Geist Regularizing LSTD 9/26

(10)

Introduction Background Algorithms Summary

LSTD

`1-regularization

` 1 -penalization

Lasso (Tibshirani, 1996) (`

1

-penalized LS) : θ

λ

= arg min

θ

k˜ r − Φθk ˜

22

+ λkθk

1

promotes sparsity

no analytical solution, but efficient computation through the regularization path (θ

λ

piecewise linear) (Efron et al. , 2004) this is a penalized empirical projection : Φθ ˜

λ

= ˜ Π

(λ,1)µ

˜ r Dantzig Selector (Candes & Tao, 2007) :

θ

λ

= arg min

θ

kθk

1

subject to k Φ ˜

>

(˜ r − Φθ)k ˜

≤ λ promotes sparsity

this is actually a linear program

many (Lasso) extensions, but lets keep things simple...

(11)

Introduction Background Algorithms Summary

Lasso-TD

`1-PBR

`1-LSTD Dantzig LSTD

1

Introduction MDP Setting

2

Background LSTD

`

1

-regularization

3

Algorithms Lasso-TD

`

1

-PBR

`

1

-LSTD Dantzig LSTD

4

Summary

Matthieu Geist Regularizing LSTD 11/26

(12)

Introduction Background Algorithms Summary

Lasso-TD

`1-PBR

`1-LSTD Dantzig LSTD

Algorithm

LSTD (3

rd

writing) :

θ

0

= arg min

ω

k˜ r + γ Φ ˜

0

θ

0

− Φωk ˜

22

Lasso-TD (Kolter & Ng, 2009) :

θ

λ

= arg min

ω

k˜ r + γ Φ ˜

0

θ

λ

− Φωk ˜

22

+ λkωk

1

(not a standard lasso problem)

alternative writing :

( ω

θ

= arg min

ω

k˜ r + γ Φ ˜

0

θ − Φωk ˜

22

+ λkωk

1

θ

λ

= arg min

θ

k Φθ ˜ − Φω ˜

θ

k

22

another alternative writing : v ˜

λ

= ˜ Π

λ,1µ

T ˆ ˜ v

λ

(with ˜ v

λ

= ˜ Φθ

λ

)

(13)

Introduction Background Algorithms Summary

Lasso-TD

`1-PBR

`1-LSTD Dantzig LSTD

Properties

not a standard lasso ⇒ originally requires an adhoc solver can be framed as an LCP (Linear Complementary

Problem) (Johns et al. , 2010) : more generic solvers requires A ˜ to be a P-matrix (not necessarily true in the off policy case : Π ˜

λ,1µ

T ˆ may have zero or multiple fixed points) sparsity oracle inequality (Ghavamzadeh et al. , 2011) :

inf

λ

kv − ˜ v

λ

k

n

≤ 1 1 − γ inf

θ

(

kv − ˆ v

θ

k

n

+ O

r kθk

0

ln p n

!)

remark : how to choose λ ? (no cross-validation !)

Matthieu Geist Regularizing LSTD 13/26

(14)

Introduction Background Algorithms Summary

Lasso-TD

`1-PBR

`1-LSTD Dantzig LSTD

Algorithm (Geist & Scherrer, 2011; Hoffman et al. , 2011)

lasso-td :

( ω

θ

= arg min

ω

k˜ r + γ Φ ˜

0

θ − Φωk ˜

22

+ λkωk

1

θ

λ

= arg min

θ

k Φθ ˜ − Φω ˜

θ

k

22

why not :

( ω

θ

= arg min

ω

k˜ r + γ Φ ˜

0

θ − Φωk ˜

22

θ

λ

= arg min

θ

k Φθ ˜ − Φω ˜

θ

k

22

+ λkθk

1

? (this is `

1

-PBR)

alternative writing : θ

λ

= arg min

θ

k Π ˜

µ

(˜ v

θ

− T ˆ v ˜

θ

)k

22

+ λkθk

1

(recall v ˜

θ

= ˜ Φθ)

(`

1

-penalized Projected Bellman Residual)

(15)

Introduction Background Algorithms Summary

Lasso-TD

`1-PBR

`1-LSTD Dantzig LSTD

Properties

`

1

-PBR is a standard Lasso problem : θ

λ

= arg min

θ

k˜ y − Ψθk ˜

22

+λkθk

1

, y ˜ = ˜ Π

µ

˜ r , Ψ = ˜ ˜ Φ−γ Π ˜

µ

Φ ˜

0

(easy extension to other penalization)

weaker conditions than Lasso-TD (off-policy is fine) but huge computational cost if p n (projection in O(p

3

)) no finite sample analysis

Matthieu Geist Regularizing LSTD 15/26

(16)

Introduction Background Algorithms Summary

Lasso-TD

`1-PBR

`1-LSTD Dantzig LSTD

Algorithm

2

nd

formulation of LSTD : Aθ ˜

0

= ˜ b, A ˜ = 1

n Φ ˜

>

( ˜ Φ − γ Φ ˜

0

), ˜ b = 1 n Φ ˜

>

˜ r equivalently :

θ

0

= arg min

θ

k Aθ ˜ − ˜ bk

22

`

1

-LSTD (Pires, 2011; Pires & Szepesv ´ari, 2012) : θ

λ

= arg min

θ

k Aθ ˜ − bk ˜

22

+λkθk

1

(17)

Introduction Background Algorithms Summary

Lasso-TD

`1-PBR

`1-LSTD Dantzig LSTD

Properties

“standard” Lasso problem (use any solver) no problem in the off-policy case

built-in (theoretical) λ-selection scheme finite sample analysis :

inf

λ

kAθ

λ

− bk

2

≤ O kθ

k

1

r p

2

n ln 1 δ

!

w.p. 1 − δ;

(recall A = lim

n→∞

A ˜ and b = lim

n→∞

b, ˜ Aθ

= b)

Matthieu Geist Regularizing LSTD 17/26

(18)

Introduction Background Algorithms Summary

Lasso-TD

`1-PBR

`1-LSTD Dantzig LSTD

Algorithm

2

nd

formulation of LSTD : θ

0

= arg min

θ

k Aθ ˜ − ˜ bk

22

Dantzig-LSTD (Geist et al. , 2012) :

θ

λ

= arg min

θ

kθk

1

subject to k Aθ ˜ − ˜ bk

≤ λ.

this is a linear program :

u,θ∈

min

Rp

1

>

u subject to

( −u ≤ θ ≤ u

−λ1 ≤ Aθ ˜ − b ˜ ≤ λ1

(19)

Introduction Background Algorithms Summary

Lasso-TD

`1-PBR

`1-LSTD Dantzig LSTD

Properties

standard Linear Program

no problem in the off-policy case connexion to Lasso-TD :

k Aθ ˜

lassoTDλ

− bk ˜

≤ λ.

heuristic model selection scheme finite sample analysis :

inf

λ

kAθ

λ

− bk

≤ O kθ

k

1

r 1

n ln p δ

!

w.p. 1 − δ;

(recall A = lim

n→∞

A ˜ and b = lim

n→∞

b, ˜ Aθ

= b) empirically : Lasso-TD, D-LSTD ≥ `

1

-LSTD, `

1

-PBR

Matthieu Geist Regularizing LSTD 19/26

(20)

Introduction Background Algorithms Summary

1

Introduction MDP Setting

2

Background LSTD

`

1

-regularization

3

Algorithms Lasso-TD

`

1

-PBR

`

1

-LSTD Dantzig LSTD

4

Summary

(21)

Introduction Background Algorithms Summary

ω

θ

= arg min

ω

k R ˜ + γ Φ ˜

0

θ − Φωk ˜

22

+ λ

1

kωk

22

θ

0

= arg min

θ

k Φθ ˜ − Φω ˜

θ

k

2

+ λ

2

kθk

22

LSTD : v ˆ

0

= ˜ Π

µ

T ˆ ˆ v

0

Matthieu Geist Regularizing LSTD 21/26

(22)

Introduction Background Algorithms Summary

ω

θ

= arg min

ω

k R ˜ + γ Φ ˜

0

θ − Φωk ˜

22

1

kωk

22

θ

λ12

= arg min

θ

k Φθ ˜ − Φω ˜

θ

k

2

2

kθk

22

`

2,2

-LSTD (Farahmand et al. , 2008) : θ

λ1,2

= arg min kˆ v

θλ

1,2

− Π ˜

λµ1,2

T ˆ v ˆ

θλ

1,2

k

22

+ λ

2

kθk

22

(23)

Introduction Background Algorithms Summary

ω

θ

= arg min

ω

k R ˜ + γ Φ ˜

0

θ − Φωk ˜

22

1

kωk

1

θ

λ12

= arg min

θ

k Φθ ˜ − Φω ˜

θ

k

2

+ λ

2

kθk

22

Lasso-TD : v

θλ

1

= ˜ Π

λµ1,1

T ˆ v

θλ

1

Matthieu Geist Regularizing LSTD 21/26

(24)

Introduction Background Algorithms Summary

ω

θ

= arg min

ω

k R ˜ + γ Φ ˜

0

θ − Φωk ˜

22

1

kωk

22

θ

λ12

= arg min

θ

k Φθ ˜ − Φω ˜

θ

k

2

2

kθk

1

`

1

-PBR : θ

λ1,2

= arg min kˆ v

θλ

1,2

− Π ˜

λµ1,2

T ˆ v ˆ

θλ

1,2

k

22

+ λ

2

kθk

1

(25)

Introduction Background Algorithms Summary

based on the residual Aθ ˜ − b ˜

`

1

-LSTD :

θ

λ

= arg min

θ

kθk

1

subject to k Aθ ˜ − bk ˜

22

≤ λ Dantzig-LSTD :

θ

λ

= arg min

θ

kθk

1

subject to k Aθ ˜ − bk ˜

≤ λ

Matthieu Geist Regularizing LSTD 22/26

(26)

Introduction Background Algorithms Summary

Thank you for your attention !

Questions ?

Final Remark. This talk focused on projected-fixed-point-based

regularization, but it is also possible to consider residual approaches,

with a potential bias problem. The first (as far as I know) algorithm to

use `

1

-penalization for value function estimation is actually a (biased)

residual approach (Loth et al. , 2007).

(27)

Introduction Background Algorithms Summary

References I

Bertsekas, Dimitri P., & Tsitsiklis, John N. 1996.

Neuro-Dynamic Programming.

Athena Scientific.

Bradtke, S. J., & Barto, A. G. 1996.

Linear Least-Squares algorithms for temporal difference learning.

Machine Learning,22, 33–57.

Candes, E., & Tao, T. 2007.

The Dantzig selector : statistical estimation when p is much larger than n.

Annals of Statistics,35(6), 2313–2351.

Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. 2004.

Least Angle Regression.

Annals of Statistics,32(2), 407–499.

Farahmand, A., Ghavamzadeh, M., Szepesv ´ari, C., & Mannor, S. 2008.

Regularized Policy Iteration.

In : Proc. of NIPS 21.

Geist, M., & Scherrer, B. 2011.

`1-penalized projected Bellman residual.

In : Proc. of EWRL 9.

Geist, Matthieu, Scherrer, Bruno, Lazaric, Alessandro, & Ghavamzadeh, Mohammad. 2012.

A Dantzig Selector Approach to Temporal Difference Learning.

In : International Conference on Machine Learning (ICML).

Matthieu Geist Regularizing LSTD 24/26

(28)

Introduction Background Algorithms Summary

References II

Ghavamzadeh, M., Lazaric, A., Munos, R., & Hoffman, M. 2011.

Finite-Sample Analysis of Lasso-TD.

In : Proc. of ICML.

Hoffman, M. W., Lazaric, A., Ghavamzadeh, M., & Munos, R. 2011.

Regularized Least Squares Temporal Difference learning with nested`2and`1penalization.

In : Proc. of EWRL 9.

Johns, J., Painter-Wakefield, C., & Parr, R. 2010.

Linear Complementarity for Regularized Policy Evaluation and Improvement.

In : Proc. of NIPS 23.

Kolter, J. Z., & Ng, A. Y. 2009.

Regularization and Feature Selection in Least-Squares Temporal Difference Learning.

In : Proc. of ICML.

Loth, Manuel, Davy, Manuel, & Preux, Philippe. 2007.

Sparse Temporal Difference Learning using LASSO.

In : IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

Pires, B. A. 2011.

Statistical analysis of`1-penalized linear estimation with applications.

M.Phil. thesis, University of Alberta.

Pires, Bernardo A., & Szepesv ´ari, Csaba. 2012.

Statistical linear estimation with penalized estimators : an application to reinforcement learning.

In : Proceedings of the 29th International Conference on Machine Learning (ICML 2012).

(29)

Introduction Background Algorithms Summary

References III

Sutton, R. S., & Barto, A. G. 1998.

Reinforcement Learning : an Introduction.

The MIT Press.

Tibshirani, R. 1996.

Regression Shrinkage and Selection via the Lasso.

Journal of the Royal Statistical Society,58(1), 267–288.

Matthieu Geist Regularizing LSTD 26/26

Références

Documents relatifs

In terms of empirical research, questions were discussed linked to the ‘measurement’ of the design and implementation of tasks, the design and evaluation of processes expected to

Motivated by 2, 4 and by 5 where penalty methods for variational inequalities with single-valued monotone maps are given, we will prove that our proposed

2.3. Aggregation of color features by Choquet Integral Foreground objects can be made by the fusion of different features with a Choquet integral. For each pixel, similar- ity

Keywords and phrases: Nonparametric regression, Biased data, Deriva- tives function estimation, Wavelets, Besov balls..

We systematically calculate bias value using standard method (operational model on dose-response curves :). Bias=2.3 : C1 is biased towards β-arr, compared to cAMP, in comparison

In the recent years several decision aid methods or decision support systems have been proposed to help in the selection of the best compromise alternatives.. Two

In this paper, adopting the Fenchel–Lagrange duality, which has been first introduced by Bo¸t and Wanka in [14] for ordinary convex programming problems, we give the

Cepstral parameters are extracted again as acoustic features, as described and applied before for the experiments with reverberation as the single distortion effect. Also the