Septemberthe6 ,2012 Journ´eeGDRRobotiqueMatthieuGeist Anoverviewof ‘ -regularizationforvaluefunctionapproximation

(1)

Introduction Background Algorithms Summary

An overview of ` ₁ -regularization for value function approximation

Journ ´ee GDR Robotique

Matthieu Geist

(Sup ´elec, IMS-MaLIS Research Group) [email protected]

September the 6

^th

, 2012

Matthieu Geist Regularizing LSTD 1/26

(2)

MDP Setting

1

Introduction MDP Setting

2

Background LSTD

`

₁

-regularization

3

Algorithms Lasso-TD

`

₁

-PBR

`

₁

-LSTD Dantzig LSTD

4

Summary

(3)

MDP Setting

(Bertsekas & Tsitsiklis, 1996; Sutton & Barto, 1998)

Controller : a

t

= π(s

t

) Reward : r

_t

= r (s

_t

, a

_t

)

System dynamics : s

_t₊₁

∼ P(.|s

_t

, a

_t

)

Goal : Given a policy π : S → A compute its value v

^π

(s) = E

"

_∞

X

t=0

γ

^t

r (s

_t

, π(s

_t

))

s

₀

= s, π

#

(0 < γ < 1)

(4)

MDP Setting

The value function v satisfies the Bellman equation v = r + γ Pv ⇔ v = T v

We look for ˆ v(s) = P

p

j=1

θ

_j

φ

_j

(s) or v ˆ = Φθ, where Φ =







φ(s

₁

)

^>

.. . φ(s

_|S|

)

^>





 = (φ

₁

. . . φ

p

) and θ =





 θ

1

.. . θ

p







T is only known through samples (s

_i

, r

_i

, s

_i⁰

)

ⁿ_i=1

where s

_i

∼ µ :

Φ = ˜





 φ(s

₁

)

^>

.. . φ(s

n

)

^>





 , Φ ˜

⁰

=





 φ(s

⁰₁

)

^>

.. . φ(s

⁰_n

)

^>





 , ˜ r =





 r

₁

.. . r

n





 ,

We consider the situation where n p

(5)

LSTD

`₁-regularization

1

Introduction MDP Setting

2

Background LSTD

`

₁

-regularization

3

Algorithms Lasso-TD

`

₁

-PBR

`

₁

-LSTD Dantzig LSTD

4

Summary

(6)

LSTD

`₁-regularization

LSTD

(Bradtke & Barto, 1996)

One looks for v ˆ in the feature space satisfying v ˆ = Π

µ

T ˆ v.

The solution v ˆ = Φθ

₀

can be characterized as ( ω

_θ

= arg min

ω

kr + γPΦθ − Φωk

²_µ,2

θ

₀

= arg min

_θ

kΦθ − Φω

_θ

k

²_µ,2

and approximated by its empirical counterpart :

( ω

_θ

= arg min

_ω

k˜ r + γ Φ ˜

⁰

θ − Φωk ˜

²₂

θ

0

= arg min

_θ

k Φθ ˜ − Φω ˜

θ

k

²₂

(7)

LSTD

`₁-regularization

LSTD (alternate writing)

We look for v ˆ = Φθ

₀

that solves v ˆ = Π

µ

T v ˆ . It is known that Π

_µ

= Φ(Φ

^>

D

_µ

Φ)

⁻¹

Φ

^>

D

_µ

, and (after algebra) one can show that θ

₀

= A

⁻¹

b (Linear system of size p) ⇔ θ

₀

= arg min

θ

kAθ − bk

²₂

where

A = Φ

^>

D

_µ

(I − γP)Φ b = Φ

^>

D

_µ

r

can be approximated through their empirical counterpart : A ˜ = 1

n Φ ˜

^>

( ˜ Φ − γ Φ ˜

⁰

) b ˜ = 1

n Φ ˜

^>

˜ r

(8)

LSTD

`₁-regularization

LSTD (another alternate writing)

Recall the first writing :

( ω

_θ

= arg min

_ω

k˜ r + γ Φ ˜

⁰

θ − Φωk ˜

²₂

θ

0

= arg min

_θ

k Φθ ˜ − Φω ˜

_θ

k

²₂

.

The second equation is minimized to zero for θ = ω

_θ

(the fixed point of Π

µ

T exists).

Therefore, the second equation rewrites θ

0

= arg min

ω

k˜ r + γ Φ ˜

⁰

θ

0

− Φωk ˜

²₂

(fixed-point equation, θ

₀

in both sides).

(9)

LSTD

`₁-regularization

Least-squares regression and ` ₂ -penalization

regression (γ = 0), find θ such that ˜ r ≈ Φθ ˜ ; natural (LS) objective function :

θ

0

= arg min

θ

k˜ r − Φθk ˜

²₂

analytical solution : θ

₀

= ( ˜ Φ

^>

Φ) ˜

⁻¹

Φ˜ ˜ r this is an empirical projection : Φθ ˜

₀

= ˜ Π

µ

˜ r , (so Π ˜

µ

= ˜ Φ( ˜ Φ

^>

Φ) ˜

⁻¹

Φ ˜

^>

)

`

₂

-penalized LS :

θ

λ

= arg min

θ

k˜ r − Φθk ˜

²₂

+ λkθk

²₂

analytical solution : θ

_λ

= ( ˜ Φ

^>

Φ + ˜ λI)

⁻¹

Φ˜ ˜ r

this is a penalized empirical projection : Φθ ˜

λ

= ˜ Π

^(λ,2)µ

˜ r

(10)

LSTD

`₁-regularization

` ₁ -penalization

Lasso (Tibshirani, 1996) (`

₁

-penalized LS) : θ

_λ

= arg min

θ

k˜ r − Φθk ˜

²₂

+ λkθk

₁

promotes sparsity

no analytical solution, but efficient computation through the regularization path (θ

λ

piecewise linear) (Efron et al. , 2004) this is a penalized empirical projection : Φθ ˜

λ

= ˜ Π

^(λ,1)µ

˜ r Dantzig Selector (Candes & Tao, 2007) :

θ

λ

= arg min

θ

kθk

₁

subject to k Φ ˜

^>

(˜ r − Φθ)k ˜

_∞

≤ λ promotes sparsity

this is actually a linear program

many (Lasso) extensions, but lets keep things simple...

(11)

Lasso-TD

`₁-PBR

`₁-LSTD Dantzig LSTD

1

Introduction MDP Setting

2

Background LSTD

`

₁

-regularization

3

Algorithms Lasso-TD

`

₁

-PBR

`

₁

-LSTD Dantzig LSTD

4

Summary

(12)

Lasso-TD

`₁-PBR

Algorithm

LSTD (3

^rd

writing) :

θ

0

= arg min

ω

k˜ r + γ Φ ˜

⁰

θ

0

− Φωk ˜

²₂

Lasso-TD (Kolter & Ng, 2009) :

θ

λ

= arg min

ω

k˜ r + γ Φ ˜

⁰

θ

λ

− Φωk ˜

²₂

+ λkωk

₁

(not a standard lasso problem)

alternative writing :

( ω

θ

= arg min

_ω

k˜ r + γ Φ ˜

⁰

θ − Φωk ˜

²₂

+ λkωk

₁

θ

_λ

= arg min

_θ

k Φθ ˜ − Φω ˜

_θ

k

²₂

another alternative writing : v ˜

_λ

= ˜ Π

^λ,1µ

T ˆ ˜ v

_λ

(with ˜ v

_λ

= ˜ Φθ

_λ

)

(13)

Lasso-TD

`₁-PBR

Properties

not a standard lasso ⇒ originally requires an adhoc solver can be framed as an LCP (Linear Complementary

Problem) (Johns et al. , 2010) : more generic solvers requires A ˜ to be a P-matrix (not necessarily true in the off policy case : Π ˜

^λ,1µ

T ˆ may have zero or multiple fixed points) sparsity oracle inequality (Ghavamzadeh et al. , 2011) :

inf

λ

kv − ˜ v

_λ

k

_n

≤ 1 1 − γ inf

θ

(

kv − ˆ v

_θ

k

_n

+ O

r kθk

₀

ln p n

!)

remark : how to choose λ ? (no cross-validation !)

(14)

Lasso-TD

`₁-PBR

Algorithm (Geist & Scherrer, 2011; Hoffman et al. , 2011)

lasso-td :

( ω

_θ

= arg min

_ω

k˜ r + γ Φ ˜

⁰

θ − Φωk ˜

²₂

+ λkωk

₁

θ

_λ

= arg min

_θ

k Φθ ˜ − Φω ˜

_θ

k

²₂

why not :

( ω

_θ

= arg min

_ω

k˜ r + γ Φ ˜

⁰

θ − Φωk ˜

²₂

θ

_λ

= arg min

_θ

k Φθ ˜ − Φω ˜

_θ

k

²₂

+ λkθk

₁

? (this is `

1

-PBR)

alternative writing : θ

_λ

= arg min

θ

k Π ˜

_µ

(˜ v

_θ

− T ˆ v ˜

_θ

)k

²₂

+ λkθk

₁

(recall v ˜

_θ

= ˜ Φθ)

(`

₁

-penalized Projected Bellman Residual)

(15)

Lasso-TD

`₁-PBR

Properties

`

1

-PBR is a standard Lasso problem : θ

_λ

= arg min

θ

k˜ y − Ψθk ˜

²₂

+λkθk

₁

, y ˜ = ˜ Π

_µ

˜ r , Ψ = ˜ ˜ Φ−γ Π ˜

_µ

Φ ˜

⁰

(easy extension to other penalization)

weaker conditions than Lasso-TD (off-policy is fine) but huge computational cost if p n (projection in O(p

³

)) no finite sample analysis

(16)

Lasso-TD

`₁-PBR

Algorithm

2

^nd

formulation of LSTD : Aθ ˜

₀

= ˜ b, A ˜ = 1

n Φ ˜

^>

( ˜ Φ − γ Φ ˜

⁰

), ˜ b = 1 n Φ ˜

^>

˜ r equivalently :

θ

0

= arg min

θ

k Aθ ˜ − ˜ bk

²₂

`

1

-LSTD (Pires, 2011; Pires & Szepesv ´ari, 2012) : θ

λ

= arg min

θ

k Aθ ˜ − bk ˜

²₂

+λkθk

₁

(17)

Lasso-TD

`₁-PBR

Properties

“standard” Lasso problem (use any solver) no problem in the off-policy case

built-in (theoretical) λ-selection scheme finite sample analysis :

inf

λ

kAθ

_λ

− bk

₂

≤ O kθ

^∗

k

₁

r p

²

n ln 1 δ

!

w.p. 1 − δ;

(recall A = lim

n→∞

A ˜ and b = lim

n→∞

b, ˜ Aθ

^∗

= b)

(18)

Lasso-TD

`₁-PBR

Algorithm

2

^nd

formulation of LSTD : θ

0

= arg min

θ

k Aθ ˜ − ˜ bk

²₂

Dantzig-LSTD (Geist et al. , 2012) :

θ

_λ

= arg min

θ

kθk

₁

subject to k Aθ ˜ − ˜ bk

_∞

≤ λ.

this is a linear program :

u,θ∈

min

R^p

1

^>

u subject to

( −u ≤ θ ≤ u

−λ1 ≤ Aθ ˜ − b ˜ ≤ λ1

(19)

Lasso-TD

`₁-PBR

Properties

standard Linear Program

no problem in the off-policy case connexion to Lasso-TD :

k Aθ ˜

^lassoTD_λ

− bk ˜

_∞

≤ λ.

heuristic model selection scheme finite sample analysis :

inf

λ

kAθ

_λ

− bk

_∞

≤ O kθ

^∗

k

₁

r 1

n ln p δ

!

w.p. 1 − δ;

(recall A = lim

n→∞

A ˜ and b = lim

n→∞

b, ˜ Aθ

^∗

= b) empirically : Lasso-TD, D-LSTD ≥ `

₁

-LSTD, `

₁

-PBR

(20)

1

Introduction MDP Setting

2

Background LSTD

`

₁

-regularization

3

Algorithms Lasso-TD

`

₁

-PBR

`

₁

-LSTD Dantzig LSTD

4

Summary

(21)

ω

θ

= arg min

ω

k R ˜ + γ Φ ˜

⁰

θ − Φωk ˜

²₂

+ λ

1

kωk

²₂

θ

₀

= arg min

θ

k Φθ ˜ − Φω ˜

_θ

k

₂

+ λ

₂

kθk

²₂

LSTD : v ˆ

₀

= ˜ Π

µ

T ˆ ˆ v

₀

(22)

ω

θ

= arg min

ω

k R ˜ + γ Φ ˜

⁰

θ − Φωk ˜

²₂

+λ

1

kωk

²₂

θ

_λ₁_,λ₂

= arg min

θ

k Φθ ˜ − Φω ˜

_θ

k

₂

+λ

₂

kθk

²₂

`

_2,2

-LSTD (Farahmand et al. , 2008) : θ

_λ_1,2

= arg min kˆ v

_θ_λ

1,2

− Π ˜

^λ_µ¹^,2

T ˆ v ˆ

_θ_λ

1,2

k

²₂

+ λ

₂

kθk

²₂

(23)

ω

θ

= arg min

ω

k R ˜ + γ Φ ˜

⁰

θ − Φωk ˜

²₂

+λ

1

kωk

₁

θ

_λ₁_,λ₂

= arg min

θ

k Φθ ˜ − Φω ˜

_θ

k

₂

+ λ

₂

kθk

²₂

Lasso-TD : v

_θ_λ

1

= ˜ Π

^λµ¹^,1

T ˆ v

_θ_λ

1

(24)

ω

θ

= arg min

ω

k R ˜ + γ Φ ˜

⁰

θ − Φωk ˜

²₂

+λ

1

kωk

²₂

θ

_λ₁_,λ₂

= arg min

θ

k Φθ ˜ − Φω ˜

_θ

k

₂

+λ

₂

kθk

₁

`

₁

-PBR : θ

λ_1,2

= arg min kˆ v

_θ_λ

1,2

− Π ˜

^λµ¹^,2

T ˆ v ˆ

_θ_λ

1,2

k

²₂

+ λ

2

kθk

₁

(25)

based on the residual Aθ ˜ − b ˜

`

₁

-LSTD :

θ

λ

= arg min

θ

kθk

₁

subject to k Aθ ˜ − bk ˜

²₂

≤ λ Dantzig-LSTD :

θ

_λ

= arg min

θ

kθk

₁

subject to k Aθ ˜ − bk ˜

_∞

≤ λ

(26)

Thank you for your attention !

Questions ?

Final Remark. This talk focused on projected-fixed-point-based

regularization, but it is also possible to consider residual approaches,

with a potential bias problem. The first (as far as I know) algorithm to

use `

₁

-penalization for value function estimation is actually a (biased)

residual approach (Loth et al. , 2007).

(27)

References I

Bertsekas, Dimitri P., & Tsitsiklis, John N. 1996.

Neuro-Dynamic Programming.

Athena Scientific.

Bradtke, S. J., & Barto, A. G. 1996.

Linear Least-Squares algorithms for temporal difference learning.

Machine Learning,22, 33–57.

Candes, E., & Tao, T. 2007.

The Dantzig selector : statistical estimation when p is much larger than n.

Annals of Statistics,35(6), 2313–2351.

Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. 2004.

Least Angle Regression.

Annals of Statistics,32(2), 407–499.

Farahmand, A., Ghavamzadeh, M., Szepesv ´ari, C., & Mannor, S. 2008.

Regularized Policy Iteration.

In : Proc. of NIPS 21.

Geist, M., & Scherrer, B. 2011.

`₁-penalized projected Bellman residual.

In : Proc. of EWRL 9.

Geist, Matthieu, Scherrer, Bruno, Lazaric, Alessandro, & Ghavamzadeh, Mohammad. 2012.

A Dantzig Selector Approach to Temporal Difference Learning.

In : International Conference on Machine Learning (ICML).

(28)

References II

Ghavamzadeh, M., Lazaric, A., Munos, R., & Hoffman, M. 2011.

Finite-Sample Analysis of Lasso-TD.

In : Proc. of ICML.

Hoffman, M. W., Lazaric, A., Ghavamzadeh, M., & Munos, R. 2011.

Regularized Least Squares Temporal Difference learning with nested`₂and`₁penalization.

In : Proc. of EWRL 9.

Johns, J., Painter-Wakefield, C., & Parr, R. 2010.

Linear Complementarity for Regularized Policy Evaluation and Improvement.

In : Proc. of NIPS 23.

Kolter, J. Z., & Ng, A. Y. 2009.

Regularization and Feature Selection in Least-Squares Temporal Difference Learning.

In : Proc. of ICML.

Loth, Manuel, Davy, Manuel, & Preux, Philippe. 2007.

Sparse Temporal Difference Learning using LASSO.

In : IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

Pires, B. A. 2011.

Statistical analysis of`₁-penalized linear estimation with applications.

M.Phil. thesis, University of Alberta.

Pires, Bernardo A., & Szepesv ´ari, Csaba. 2012.

Statistical linear estimation with penalized estimators : an application to reinforcement learning.

In : Proceedings of the 29th International Conference on Machine Learning (ICML 2012).

(29)

References III

Sutton, R. S., & Barto, A. G. 1998.

Reinforcement Learning : an Introduction.

The MIT Press.

Tibshirani, R. 1996.

Regression Shrinkage and Selection via the Lasso.

Journal of the Royal Statistical Society,58(1), 267–288.

Septemberthe6 ,2012 Journ´eeGDRRobotiqueMatthieuGeist Anoverviewof ‘ -regularizationforvaluefunctionapproximation

An overview of ` 1 -regularization for value function approximation

Journ ´ee GDR Robotique

Matthieu Geist

(Sup ´elec, IMS-MaLIS Research Group) [email protected]

September the 6

, 2012

Introduction MDP Setting

Background LSTD

`

-regularization

Algorithms Lasso-TD

`

-PBR

`

-LSTD Dantzig LSTD

Summary

Controller : a

= π(s

) Reward : r

= r (s

, a

)

System dynamics : s

∼ P(.|s

, a

)

Goal : Given a policy π : S → A compute its value v

(s) = E

"

X

γ

r (s

, π(s

))

s

= s, π

#

(0 < γ < 1)

The value function v satisfies the Bellman equation v = r + γ Pv ⇔ v = T v

We look for ˆ v(s) = P

θ

φ

(s) or v ˆ = Φθ, where Φ =







φ(s

)

.. . φ(s

)





 = (φ

. . . φ

) and θ =





 θ

.. . θ







T is only known through samples (s

, r

, s

)

where s

∼ µ :

Φ = ˜





 φ(s

)

.. . φ(s

)





 , Φ ˜

=

An overview of ` ₁ -regularization for value function approximation