Introduction Background Algorithms Summary
An overview of ` 1 -regularization for value function approximation
Journ ´ee GDR Robotique
Matthieu Geist
(Sup ´elec, IMS-MaLIS Research Group) matthieu.geist@supelec.fr
September the 6
th, 2012
Matthieu Geist Regularizing LSTD 1/26
Introduction Background Algorithms Summary
MDP Setting
1
Introduction MDP Setting
2
Background LSTD
`
1-regularization
3
Algorithms Lasso-TD
`
1-PBR
`
1-LSTD Dantzig LSTD
4
Summary
Introduction Background Algorithms Summary
MDP Setting
(Bertsekas & Tsitsiklis, 1996; Sutton & Barto, 1998)
Controller : a
t= π(s
t) Reward : r
t= r (s
t, a
t)
System dynamics : s
t+1∼ P(.|s
t, a
t)
Goal : Given a policy π : S → A compute its value v
π(s) = E
"
∞X
t=0
γ
tr (s
t, π(s
t))
s
0= s, π
#
(0 < γ < 1)
Matthieu Geist Regularizing LSTD 3/26
Introduction Background Algorithms Summary
MDP Setting
The value function v satisfies the Bellman equation v = r + γ Pv ⇔ v = T v
We look for ˆ v(s) = P
pj=1
θ
jφ
j(s) or v ˆ = Φθ, where Φ =
φ(s
1)
>.. . φ(s
|S|)
>
= (φ
1. . . φ
p) and θ =
θ
1.. . θ
p
T is only known through samples (s
i, r
i, s
i0)
ni=1where s
i∼ µ :
Φ = ˜
φ(s
1)
>.. . φ(s
n)
>
, Φ ˜
0=
φ(s
01)
>.. . φ(s
0n)
>
, ˜ r =
r
1.. . r
n
,
We consider the situation where n p
Introduction Background Algorithms Summary
LSTD
`1-regularization
1
Introduction MDP Setting
2
Background LSTD
`
1-regularization
3
Algorithms Lasso-TD
`
1-PBR
`
1-LSTD Dantzig LSTD
4
Summary
Matthieu Geist Regularizing LSTD 5/26
Introduction Background Algorithms Summary
LSTD
`1-regularization
LSTD
(Bradtke & Barto, 1996)One looks for v ˆ in the feature space satisfying v ˆ = Π
µT ˆ v.
The solution v ˆ = Φθ
0can be characterized as ( ω
θ= arg min
ωkr + γPΦθ − Φωk
2µ,2θ
0= arg min
θkΦθ − Φω
θk
2µ,2and approximated by its empirical counterpart :
( ω
θ= arg min
ωk˜ r + γ Φ ˜
0θ − Φωk ˜
22θ
0= arg min
θk Φθ ˜ − Φω ˜
θk
22Introduction Background Algorithms Summary
LSTD
`1-regularization
LSTD (alternate writing)
We look for v ˆ = Φθ
0that solves v ˆ = Π
µT v ˆ . It is known that Π
µ= Φ(Φ
>D
µΦ)
−1Φ
>D
µ, and (after algebra) one can show that θ
0= A
−1b (Linear system of size p) ⇔ θ
0= arg min
θ
kAθ − bk
22where
A = Φ
>D
µ(I − γP)Φ b = Φ
>D
µr
can be approximated through their empirical counterpart : A ˜ = 1
n Φ ˜
>( ˜ Φ − γ Φ ˜
0) b ˜ = 1
n Φ ˜
>˜ r
Matthieu Geist Regularizing LSTD 7/26
Introduction Background Algorithms Summary
LSTD
`1-regularization
LSTD (another alternate writing)
Recall the first writing :
( ω
θ= arg min
ωk˜ r + γ Φ ˜
0θ − Φωk ˜
22θ
0= arg min
θk Φθ ˜ − Φω ˜
θk
22.
The second equation is minimized to zero for θ = ω
θ(the fixed point of Π
µT exists).
Therefore, the second equation rewrites θ
0= arg min
ω
k˜ r + γ Φ ˜
0θ
0− Φωk ˜
22(fixed-point equation, θ
0in both sides).
Introduction Background Algorithms Summary
LSTD
`1-regularization
Least-squares regression and ` 2 -penalization
regression (γ = 0), find θ such that ˜ r ≈ Φθ ˜ ; natural (LS) objective function :
θ
0= arg min
θ
k˜ r − Φθk ˜
22analytical solution : θ
0= ( ˜ Φ
>Φ) ˜
−1Φ˜ ˜ r this is an empirical projection : Φθ ˜
0= ˜ Π
µ˜ r , (so Π ˜
µ= ˜ Φ( ˜ Φ
>Φ) ˜
−1Φ ˜
>)
`
2-penalized LS :
θ
λ= arg min
θ
k˜ r − Φθk ˜
22+ λkθk
22analytical solution : θ
λ= ( ˜ Φ
>Φ + ˜ λI)
−1Φ˜ ˜ r
this is a penalized empirical projection : Φθ ˜
λ= ˜ Π
(λ,2)µ˜ r
Matthieu Geist Regularizing LSTD 9/26
Introduction Background Algorithms Summary
LSTD
`1-regularization
` 1 -penalization
Lasso (Tibshirani, 1996) (`
1-penalized LS) : θ
λ= arg min
θ
k˜ r − Φθk ˜
22+ λkθk
1promotes sparsity
no analytical solution, but efficient computation through the regularization path (θ
λpiecewise linear) (Efron et al. , 2004) this is a penalized empirical projection : Φθ ˜
λ= ˜ Π
(λ,1)µ˜ r Dantzig Selector (Candes & Tao, 2007) :
θ
λ= arg min
θ
kθk
1subject to k Φ ˜
>(˜ r − Φθ)k ˜
∞≤ λ promotes sparsity
this is actually a linear program
many (Lasso) extensions, but lets keep things simple...
Introduction Background Algorithms Summary
Lasso-TD
`1-PBR
`1-LSTD Dantzig LSTD
1
Introduction MDP Setting
2
Background LSTD
`
1-regularization
3
Algorithms Lasso-TD
`
1-PBR
`
1-LSTD Dantzig LSTD
4
Summary
Matthieu Geist Regularizing LSTD 11/26
Introduction Background Algorithms Summary
Lasso-TD
`1-PBR
`1-LSTD Dantzig LSTD
Algorithm
LSTD (3
rdwriting) :
θ
0= arg min
ω
k˜ r + γ Φ ˜
0θ
0− Φωk ˜
22Lasso-TD (Kolter & Ng, 2009) :
θ
λ= arg min
ω
k˜ r + γ Φ ˜
0θ
λ− Φωk ˜
22+ λkωk
1(not a standard lasso problem)
alternative writing :
( ω
θ= arg min
ωk˜ r + γ Φ ˜
0θ − Φωk ˜
22+ λkωk
1θ
λ= arg min
θk Φθ ˜ − Φω ˜
θk
22another alternative writing : v ˜
λ= ˜ Π
λ,1µT ˆ ˜ v
λ(with ˜ v
λ= ˜ Φθ
λ)
Introduction Background Algorithms Summary
Lasso-TD
`1-PBR
`1-LSTD Dantzig LSTD
Properties
not a standard lasso ⇒ originally requires an adhoc solver can be framed as an LCP (Linear Complementary
Problem) (Johns et al. , 2010) : more generic solvers requires A ˜ to be a P-matrix (not necessarily true in the off policy case : Π ˜
λ,1µT ˆ may have zero or multiple fixed points) sparsity oracle inequality (Ghavamzadeh et al. , 2011) :
inf
λkv − ˜ v
λk
n≤ 1 1 − γ inf
θ
(
kv − ˆ v
θk
n+ O
r kθk
0ln p n
!)
remark : how to choose λ ? (no cross-validation !)
Matthieu Geist Regularizing LSTD 13/26
Introduction Background Algorithms Summary
Lasso-TD
`1-PBR
`1-LSTD Dantzig LSTD
Algorithm (Geist & Scherrer, 2011; Hoffman et al. , 2011)
lasso-td :
( ω
θ= arg min
ωk˜ r + γ Φ ˜
0θ − Φωk ˜
22+ λkωk
1θ
λ= arg min
θk Φθ ˜ − Φω ˜
θk
22why not :
( ω
θ= arg min
ωk˜ r + γ Φ ˜
0θ − Φωk ˜
22θ
λ= arg min
θk Φθ ˜ − Φω ˜
θk
22+ λkθk
1? (this is `
1-PBR)
alternative writing : θ
λ= arg min
θ
k Π ˜
µ(˜ v
θ− T ˆ v ˜
θ)k
22+ λkθk
1(recall v ˜
θ= ˜ Φθ)
(`
1-penalized Projected Bellman Residual)
Introduction Background Algorithms Summary
Lasso-TD
`1-PBR
`1-LSTD Dantzig LSTD
Properties
`
1-PBR is a standard Lasso problem : θ
λ= arg min
θ
k˜ y − Ψθk ˜
22+λkθk
1, y ˜ = ˜ Π
µ˜ r , Ψ = ˜ ˜ Φ−γ Π ˜
µΦ ˜
0(easy extension to other penalization)
weaker conditions than Lasso-TD (off-policy is fine) but huge computational cost if p n (projection in O(p
3)) no finite sample analysis
Matthieu Geist Regularizing LSTD 15/26
Introduction Background Algorithms Summary
Lasso-TD
`1-PBR
`1-LSTD Dantzig LSTD
Algorithm
2
ndformulation of LSTD : Aθ ˜
0= ˜ b, A ˜ = 1
n Φ ˜
>( ˜ Φ − γ Φ ˜
0), ˜ b = 1 n Φ ˜
>˜ r equivalently :
θ
0= arg min
θ
k Aθ ˜ − ˜ bk
22`
1-LSTD (Pires, 2011; Pires & Szepesv ´ari, 2012) : θ
λ= arg min
θ
k Aθ ˜ − bk ˜
22+λkθk
1Introduction Background Algorithms Summary
Lasso-TD
`1-PBR
`1-LSTD Dantzig LSTD
Properties
“standard” Lasso problem (use any solver) no problem in the off-policy case
built-in (theoretical) λ-selection scheme finite sample analysis :
inf
λkAθ
λ− bk
2≤ O kθ
∗k
1r p
2n ln 1 δ
!
w.p. 1 − δ;
(recall A = lim
n→∞A ˜ and b = lim
n→∞b, ˜ Aθ
∗= b)
Matthieu Geist Regularizing LSTD 17/26
Introduction Background Algorithms Summary
Lasso-TD
`1-PBR
`1-LSTD Dantzig LSTD
Algorithm
2
ndformulation of LSTD : θ
0= arg min
θ
k Aθ ˜ − ˜ bk
22Dantzig-LSTD (Geist et al. , 2012) :
θ
λ= arg min
θ
kθk
1subject to k Aθ ˜ − ˜ bk
∞≤ λ.
this is a linear program :
u,θ∈
min
Rp1
>u subject to
( −u ≤ θ ≤ u
−λ1 ≤ Aθ ˜ − b ˜ ≤ λ1
Introduction Background Algorithms Summary
Lasso-TD
`1-PBR
`1-LSTD Dantzig LSTD
Properties
standard Linear Program
no problem in the off-policy case connexion to Lasso-TD :
k Aθ ˜
lassoTDλ− bk ˜
∞≤ λ.
heuristic model selection scheme finite sample analysis :
inf
λkAθ
λ− bk
∞≤ O kθ
∗k
1r 1
n ln p δ
!
w.p. 1 − δ;
(recall A = lim
n→∞A ˜ and b = lim
n→∞b, ˜ Aθ
∗= b) empirically : Lasso-TD, D-LSTD ≥ `
1-LSTD, `
1-PBR
Matthieu Geist Regularizing LSTD 19/26
Introduction Background Algorithms Summary
1
Introduction MDP Setting
2
Background LSTD
`
1-regularization
3
Algorithms Lasso-TD
`
1-PBR
`
1-LSTD Dantzig LSTD
4
Summary
Introduction Background Algorithms Summary
ω
θ= arg min
ω
k R ˜ + γ Φ ˜
0θ − Φωk ˜
22+ λ
1kωk
22θ
0= arg min
θ
k Φθ ˜ − Φω ˜
θk
2+ λ
2kθk
22LSTD : v ˆ
0= ˜ Π
µT ˆ ˆ v
0Matthieu Geist Regularizing LSTD 21/26
Introduction Background Algorithms Summary
ω
θ= arg min
ω
k R ˜ + γ Φ ˜
0θ − Φωk ˜
22+λ
1kωk
22θ
λ1,λ2= arg min
θ
k Φθ ˜ − Φω ˜
θk
2+λ
2kθk
22`
2,2-LSTD (Farahmand et al. , 2008) : θ
λ1,2= arg min kˆ v
θλ1,2
− Π ˜
λµ1,2T ˆ v ˆ
θλ1,2
k
22+ λ
2kθk
22Introduction Background Algorithms Summary
ω
θ= arg min
ω
k R ˜ + γ Φ ˜
0θ − Φωk ˜
22+λ
1kωk
1θ
λ1,λ2= arg min
θ
k Φθ ˜ − Φω ˜
θk
2+ λ
2kθk
22Lasso-TD : v
θλ1
= ˜ Π
λµ1,1T ˆ v
θλ1
Matthieu Geist Regularizing LSTD 21/26
Introduction Background Algorithms Summary
ω
θ= arg min
ω
k R ˜ + γ Φ ˜
0θ − Φωk ˜
22+λ
1kωk
22θ
λ1,λ2= arg min
θ
k Φθ ˜ − Φω ˜
θk
2+λ
2kθk
1`
1-PBR : θ
λ1,2= arg min kˆ v
θλ1,2
− Π ˜
λµ1,2T ˆ v ˆ
θλ1,2
k
22+ λ
2kθk
1Introduction Background Algorithms Summary
based on the residual Aθ ˜ − b ˜
`
1-LSTD :
θ
λ= arg min
θ
kθk
1subject to k Aθ ˜ − bk ˜
22≤ λ Dantzig-LSTD :
θ
λ= arg min
θ
kθk
1subject to k Aθ ˜ − bk ˜
∞≤ λ
Matthieu Geist Regularizing LSTD 22/26
Introduction Background Algorithms Summary
Thank you for your attention !
Questions ?
Final Remark. This talk focused on projected-fixed-point-based
regularization, but it is also possible to consider residual approaches,
with a potential bias problem. The first (as far as I know) algorithm to
use `
1-penalization for value function estimation is actually a (biased)
residual approach (Loth et al. , 2007).
Introduction Background Algorithms Summary
References I
Bertsekas, Dimitri P., & Tsitsiklis, John N. 1996.
Neuro-Dynamic Programming.
Athena Scientific.
Bradtke, S. J., & Barto, A. G. 1996.
Linear Least-Squares algorithms for temporal difference learning.
Machine Learning,22, 33–57.
Candes, E., & Tao, T. 2007.
The Dantzig selector : statistical estimation when p is much larger than n.
Annals of Statistics,35(6), 2313–2351.
Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. 2004.
Least Angle Regression.
Annals of Statistics,32(2), 407–499.
Farahmand, A., Ghavamzadeh, M., Szepesv ´ari, C., & Mannor, S. 2008.
Regularized Policy Iteration.
In : Proc. of NIPS 21.
Geist, M., & Scherrer, B. 2011.
`1-penalized projected Bellman residual.
In : Proc. of EWRL 9.
Geist, Matthieu, Scherrer, Bruno, Lazaric, Alessandro, & Ghavamzadeh, Mohammad. 2012.
A Dantzig Selector Approach to Temporal Difference Learning.
In : International Conference on Machine Learning (ICML).
Matthieu Geist Regularizing LSTD 24/26
Introduction Background Algorithms Summary
References II
Ghavamzadeh, M., Lazaric, A., Munos, R., & Hoffman, M. 2011.
Finite-Sample Analysis of Lasso-TD.
In : Proc. of ICML.
Hoffman, M. W., Lazaric, A., Ghavamzadeh, M., & Munos, R. 2011.
Regularized Least Squares Temporal Difference learning with nested`2and`1penalization.
In : Proc. of EWRL 9.
Johns, J., Painter-Wakefield, C., & Parr, R. 2010.
Linear Complementarity for Regularized Policy Evaluation and Improvement.
In : Proc. of NIPS 23.
Kolter, J. Z., & Ng, A. Y. 2009.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning.
In : Proc. of ICML.
Loth, Manuel, Davy, Manuel, & Preux, Philippe. 2007.
Sparse Temporal Difference Learning using LASSO.
In : IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.
Pires, B. A. 2011.
Statistical analysis of`1-penalized linear estimation with applications.
M.Phil. thesis, University of Alberta.
Pires, Bernardo A., & Szepesv ´ari, Csaba. 2012.
Statistical linear estimation with penalized estimators : an application to reinforcement learning.
In : Proceedings of the 29th International Conference on Machine Learning (ICML 2012).
Introduction Background Algorithms Summary
References III
Sutton, R. S., & Barto, A. G. 1998.
Reinforcement Learning : an Introduction.
The MIT Press.
Tibshirani, R. 1996.
Regression Shrinkage and Selection via the Lasso.
Journal of the Royal Statistical Society,58(1), 267–288.
Matthieu Geist Regularizing LSTD 26/26