Unconstrained
Recursive Importance
Sampling and applications
V. Lemaire & G. Pag` es
(∗)LPMA, Univ. Paris 6
—
9 F´evrier 2009, CMAP-X
Basic problem of Numerical Probability Compute by Monte Carlo simulation
m =
E
F(X) X : (Ω,A,P
) →R
q or (H,(.|.)H) Hilbert space Examples in Finance : Option premium, greek parameters, etc.If F(X)∈ L2(
P
)
E
F(X) − 1 MXM k=1
F(Xk)
2
=
rVar(F(X))
M ,
Central Limit Theorem, Law of the iterated Logarithm, etc, are ruled by Var(F(X)).
Suggest Variance reduction.
One tool (in
R
d) : Importance sampling:X ∼ p(x)λd(dx) and Xθ ∼ pθ(x)dx.
E
(F(X)) =Z
F(x)p(x)dx = Z
Rd
F(x) p
pθ (x)pθ(x)dx =
E
F(Xθ) p
pθ (Xθ)
. Resulting minimization problem min
θ V (θ) V (θ) =
E
F(Xθ)2 p2
p2θ(Xθ)
=
E
F(X)2 p
pθ (X)
. If θ 7→ pθ(x) is log-concave and lim|θ|→∞ pθ(x) = 0 then
V is convex and lim
|θ|→∞V (θ) = +∞ so that
ArgminθV = {∇θV = 0} 6= ∅.
Stochastic Recursive Zero search: Robbins-Monro algorithm.
V (θ) =
E
(v(θ, X))“ =⇒ ”∇V (θ) =E
∂v
∂θ(θ, X)
| {z }
local gradient
=
E
w(θ, X)
| {z }
pseudo−local gradient
Let ρ :
R
d → (0,+∞), Borel function. Then setH(θ, X) := ρ(θ)w(θ, X).
Theorem (Extended Robbins-Monro algorithm) Let X1, X2, . . . , Xn i.i.d.
with distribution L(X) on (Ω,A,
P
) (easy to simulate).Let γ = (γn)n≥1,
X
n
γn = +∞ and X
n
γn2 < +∞. Set
θn+1 = θn − γn+1H(θn, Xn+1), θ0∈
R
d.Then, if (WARNING!)
∀θ∈
R
d, kH(θ, X)k2 ≤ C(1 + |θ|)there exists a random vector θ∗∞(Ω,A,
P
) → {∇V = 0} such thatθn −→a.s. θ∞∗ as n → ∞.
Why ?
Key 1: Fn = σ(θ0, X1, . . . , Xn)
E
(H(θn, Xn+1)| Fn) = (E
H(θ, X))|θ=θn = ρ(θn)∇V (θn) so thatθn+1 = θn − γn+1ρ(θn)∇V (θn)
| {z }
N ewton−Raphson like algo.
−γn+1 H(θn, Xn+1) −
E
H(θn, Xn+1 | Fn)| {z }
M artingale disturbance term
Key 2: If supn
E
(|H(θn, Xn+1)|2) < +∞ and Pn≥1 γn2 < +∞ then
E
Mn2 = Xn≥1
γn2
E
(|H(θn, Xn+1)|2) < +∞ so that Mn −→ M∞∈ L2(P
).The disturbance term is fading faster than γn
Stochastic Approximation =
Deterministic zero search procedures + Monte Carlo (simulation)
I. Recursive variance reduction: the Arouna-Lapeyre (2003) algorithm revisited (Lemaire-P., 2008)
Importance sampling by mean translation:
X ∼ N(0; Id),
E
F(X) =Z
Rd
F(x)e−|x|
2
2 dx
(2π)d2 . pθ(x) = p(x − θ) with p(x) = e−|x|
2 2
(2π)d2 Cameron-Martin formula
E
F(X) =E
(F(X + θ)e−(θ|X)e−|θ|2
2 ). (1)
The θ with the lowest variance is solution to
minθ V (θ) with V (θ) =
E
F(X + θ)2e−2(θ|X)e−|θ|2If F is smooth, large deviation approach (see Glasserman and al. 1999)
Comparison with GHS (Glasserman et al.)
θmin∈Rd
E
F(X)2e−<θ|X>+12|θ|21D(X)where D := {F > 0}.
E
F(X)2e−<θ|X>+12|θ|21D(X) = (2π)−d/2Z
D
e2 log(F(x))−<θ|x>+12|θ|2−12|x|2dx
≈ C max
x∈D exp (2 log(F(x))−< θ|x > +1
2|θ|2− 1
2|x|2) so that the above minimization problem amounts to
θmin∈Rd max
x∈D(2 log(F(x))− < θ|x > +1
2|θ|2 − 1
2|x|2)
Following Arouna: second change of variable V (θ) = e|θ|
2
2
E
F2(X)e−(X|θ)so that
∇V (θ) = e|θ|
2
2
E
F2(X)e−(θ|X)(θ − X) . which yieldsθn+1 = θn − γn+1 e|θn|
2
2 F2(Xn+1)e−(θn|Xn+1)(θn − Xn+1)
| {z }
=H(θn,Xn+1)
.
Unfortunately :
lim inf
|θ|→∞
kH(θ, X)k2
|θ| = +∞. . . Consequence : It does explode in general.
Remedy : Arouna suggests a constrained version of the algorithm with a slow relaxation of the constraint (i.e. slowly increasing sequence of
compact sets)
algorithm with repeated projections “`a la Chen”
which is the mathematical formalisation of repeated trials.
A.s. convergence holds with a CLT (Lelong 2007) once stabilization has occured.
In pratice : the choice of the compact sets needs much care in connection with the step sequence
New approach (Lemaire-P. (2008)) :
third change of variable to plug θ in the payoff F ! Plug back θ into F!
∇V (θ) = e|θ|
2
2
E
F2(X)e−(θ|X)(θ − X)= e|θ|2
E
F2(X − θ)(2θ − X) . +Growth control of F at infinity:
|F(x + y)| ≤ (a + b|F(x)|)(a + b|F(y)|) + c
⇓
θn+1 = θn − γn+1 F(Xn+1 − θn)2
1 + F(−θn)2 (2θn − Xn+1)
| {z }
=:He(θn,Xn+1)
.
The function He satisfies the linear growth assumption in L2(
P
).Hence a.s. convergence to an optimal θ∗∈ ArgminV with CLT, LIL, etc.
II. Extension mean translation for log -concave p.d.f.
Still pθ = p(x − θ) now with
p is log-concave and lim
|x|→∞p(x) = 0.
We make the following assumption on the probability density p
∃a∈ [1,2] such that
(i) |∇pp|(x) = O(|x|a−1) as |x| → ∞ (ii) ∃δ > 0, logp(x) + δ|x|a is convex.
The same approach based on three translations of means
E
(|X|2(a−1)F2(X)) < +∞ and F sub-multiplicativeH(θ, x) := F2(x − θ) 1 + F2(−θ)
| {z }
self−control
e−2δ|θ|a p2(x − θ) p(x)p(x − 2θ)
| {z }
≤C
∇p(x − 2θ) p(x − 2θ) ,
satisfies
E
H(θ, X) = ρ(θ)∇V (θ) andkH(θ, X)k2 ≤ C(1 + |θ|) so that the extended R-M Theorem applies.
III. Esscher transform
Set ψ(θ) := log
E
(e(θ|X)) < +∞, θ∈R
d.∀ θ∈
R
d, pθ(x) = e(θ|X)−ψ(θ)p(x), x ∈R
d.Assume that both X = X0 and X(θ) = g(θ, U), U ∼ U([0,1]) can be simulated at the same “reasonable” cost .
V (θ) =
E
hF2(X)e−(θ|X)+ψ(θ)i.Suppose that the function ψ satisfies
(i) lim|θ|→∞ ψ(θ)|θ| = +∞ or lim|θ|→∞ ψ(θ) − 2ψ(θ/2) = +∞. (ii) ∃ δ > 0, θ 7→ ψ(θ) − δ |θ|2 is concave,
Assume the payoff F satisfies
∀θ ∈
R
d,E
h|X|F2(X)e(θ|X)i< +∞
Then,
∇V (θ) =
E
(∇ψ(θ) − X)F2(X)e−(θ,X)+ψ(θ),
=
E
h ∇ψ(θ) − X(−θ)F2(X(−θ))i
eψ(θ)−ψ(−θ), Then the recursive procedure
θn+1 = θn − γn+1H(θn, g(−θn, Un+1)
| {z }
=:Xn+1
)
where
– (Un)n≥1 is an i.i.d. sequence with L(X(−θ))
− H(θ, x) := e−λ2√d|∇ψ(−θ)|F2(x) ∇ψ(θ) − x satisfies
θn −→a.s. θ∞∗ , θ∞∗ is a {∇V = 0}-valued r.v..
IV. Functional setting: pathwise dependent diffusions
We consider a d-dimensional Itˆo process X = (Xt)t∈[0,T] solution to the S.D.E.
(Eb,σ,W) ≡ dXt = b(t, Xt)dt + σ(t, Xt)dWt, X0 = x∈
R
d, (2)where Xt denotes the stopped process at t.
– Includes Diffusion and Euler Scheme
(Hb,σ) ≡
(i) b and σcontinuous on [0, T] × C([0, T],
R
d)(ii) ∀ t∈ [0, T], ∀x, y∈ C([0, T],
R
d),|b(t, y) − b(t, x)| + kσ(t, y) − σ(t, x)k ≤ Cb,σkx − yksup. ensures existence and uniqueness of a strong solution.
Aim: Compute by Monte Carlo simulation
E
F(X)F : C([0, T],
R
d) →R
, F(X)∈ L1(P
) andP
(F(X) 6= 0) > 0.An extension : replace θ∈
R
d byϕ(Xt)
| {z }
driver
×θ(t)
with
ϕ : C([0, T],
R
d) →R
q, bounded, θ∈ L2T := L2([0, T], dt).For this talk q = 1 and
∀ξ∈ C([0, T],
R
d), ϕ(ξ) ≡ 1Tool : Triple application of Girsanov ’s Theorem.
• Representations of
E
F(X) [Girsanov 1] :E
F(X) = e−1
2kθk2L2
T
E
F(X(θ))e−R0T θ(s)dWs.where
dXt(θ) = (b(t, X(θ),t) +θ(t)σ(t, X(θ),t))dt+σ(t, X(θ),t)dWt, X0 = x∈
R
d.• Variance Minimization [Girsanov 2]:
θmin∈L2T V (θ) ou min
θ∈E V (θ), E = span(e1, . . . , em) ⊂ L2T where
V (θ) := e−kθk
2 L2
T ,q
E
F2(X(θ))e−2R0T θ(s)dWs= e
1
2kθk2L2
T ,q
E
F2(X)e−R0T θ(s)dWs.• Gradient representation [Girsanov 3]: (Hb,σ) and
E
F(X)2+δ) < +∞. V is log-convexe and limkθkL2
T ,q→+∞ V (θ) = +∞ (Fatou) hence Argminθ∈L2
T
V 6= ∅ and Argminθ∈E 6= ∅ E closed (finite dimensional) subspace of L2T.
V is differentiable at every θ∈ L2T with gradient ∇V (θ)∈ L2T : ∀ψ∈ L2T (∇V (θ)|ψ)L2
T = e
1
2kθk2L2
T ,q
E
F2(X)e−R0T θ(s)dWs (θ|ψ)L2T − Z T
0
ψ(s)dWs
!!
,
= ekθk
2 L2
T
E
F2(X(−θ)) 2(θ|ψ)L2T −
Z T 0
ψ(s)dWs
!!
. (3)
• “Sub-linearity correction” in (3):
Follows from an priori (strong) control of X − X(θ) (both living on the same space) on an appropriate space. Let p ≥ 1.
∀θ∈ L2T , X and X(θ) strong solutions of Eb,σ,W˜ and Eb+σθ,W˜ .
sup
t∈[0,T]|Xt − Xt(θ)| p
≤ Cb,σeCb,σT
Z T
0 |σ(s, X(θ),s)θ(s)|ds p
≤ CσkθkL1T
1 + kkXksupkp(1+δ) e
1
2pδkθk2L2 T
.
If the functional F satisfies
∀ x∈ C([0, T],
R
d), |F(x)| ≤ CF (1 + kxkrsup)Let E = Vec(e1, . . . , em) ⊂ L2T (to be specified). The algorithm defined by θn+1 = θn − γn+1Hr(θn, X(−θn), W(n+1))
with (W(n))n≥1 i.i.d. Brownian motions,
X(−θn) = G(−θn, W(n+1)) solution to (Eb−σθn, W(n+1)),
∀i∈ {1, . . . , m},
Hr,i(θ, x, W) := (Hr(θ, x, W)|ei)L2
T ,q = F2(x) 1 + kθk2rL2
T
2(θ|ei)L2
T ,q− Z T
0
ei(s)dWs
!
satisfies
θn −→p.s. θE∗ ∈ argminEV.
Practical implementation
Phase 1 : Compute θ∗ = θN using MRM iterations.
Phase 2 : Compute
E
F(X) via Monte Carlo of size MM CE
F(X) ≈ 1 MM CMXM C
m=1
e−
1
2kθ∗k2L2
T F((X(θ∗))(m)) e−R0T θ(s)dWs(m) W(m) independent Brownian motions.
MRM ≪ MM C (e.g. MRM ≈ MM C/10).
Alternative : Adaptive coupling (for mean translation only)
Numerical experiments (I): NIG Call option X =d N IG(α, β, δ, µ) = αδK1(αp
δ2 + (x − µ)2) πp
δ2 + (x − µ)2 eδ√
α2−β2+β(x−µ)dx.
with K1 modified Bessel function of the second kind.
F(x) = 50 (ex − K)+
α = 2, β = 0.2, δ = 0.8, µ = 0.04.
MRM = 100 000, MM C = 1 000 000.
Translation
f o r n = 0 t o M do
X ˜ NIG ( a l p h a , b e t a , gamma , d e l t a )
t h e t a = t h e t a − 1 / ( n +1000)∗H1 ( t h e t a , X) f o r n = 0 t o N do
X ˜ NIG ( a l p h a , b e t a , gamma , d e l t a ) mean = mean + F (X) ∗ p (X+t h e t a ) / p (X)
Esscher transform
f o r n = 0 t o M do
X ˜ NIG ( a l p h a , b e t a−t h e t a , gamma , d e l t a ) t h e t a = t h e t a − 1 / ( n +1000)∗H2 ( t h e t a , X) f o r n = 0 t o N do
X ˜ NIG ( a l p h a , b e t a+t h e t a , gamma , d e l t a ) mean = mean + F (X) ∗ exp(−t h e t a∗X)
mean = mean ∗ exp ( p s i ( t h e t a ) )
K mean crude var var. ratio. var. ratio
translation (θ) Esscher (θ) 0.6 42.19 8538 5.885 (0.791) 56.484 (1.322) 0.8 34.19 8388 7.525 (0.903) 39.797 (1.309) 1.0 27.66 8176 9.218 (0.982) 32.183 (1.294) 1.2 22.60 7930 10.068 (1.017) 29.232 (1.280) 1.4 18.76 7677 9.956 (1.026) 28.496 (1.268)
Table 1: Variance reduction for different strikes.
densites
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
-2 -1 0 1 2 3 4
crude translation Esscher
Figure 1: Densities of X (crude), X + θ (translation) and X(θ) (Esscher) in the case K = 1.
Numerical experiments (II): Simplified Spark Spread
We consider now a exchange option between gas and electricity (called spark spread).
A simplified form of the payoff is
F(X) = 50(eXelec − ceXgas − K)+,
where Xelec ∼ N IG(2,0.2,0.8,0.04) and Xgas ∼ N IG(1.4,0.2,0.2,0.04) are independent.
• 300 000 iterations of Robbins-Monro
• 3 000 000 iterations of Monte Carlo
Simplified Spark Spread option - results
K c mean crude var var. ratio. var. ratio translation Esscher 0.6 0.2 33.235 8378.4 5.2609 27.455
0.4 26.534 8133.3 5.0604 28.669 0.6 21.587 7862.7 4.8046 30.649 0.8 17.931 7595.2 4.5839 33.656
1 15.184 7344.2 4.4064 37.489
0.8 0.2 26.908 8160.1 5.1366 28.876 0.4 21.725 7884.9 4.844 31.018 0.6 17.955 7612.5 4.6031 34.166 0.8 15.156 7357.3 4.416 38.167
1 13.027 7123.9 4.2685 42.781
Numerical experiments (III): Barrier options and local volatility models
Pseudo-CEV model (α∈]0,1]) : dXt = rXtdt + σXtα Xt
p1 + Xt2 dWt, X0 = x > 0, Down-and-in Call:
F(X) = (XT − K)+1n
0≤mint≤T Xt ≤ Lo.
Market parameters
X0 = 100, r = 0.04, pseudo-volatility σ = 7, α = 0.5.
Contract parameter K = 115, L = 65, T = 1.
Time discretization : Continuous Euler scheme (Brownian bridge) with step T /n, n = 100.
MRM = 50 000 MM C = 500 000
Down & In Call option - Brownian interpolation
• ( ¯Xtk) Euler scheme of step tk = kTn with n = 100.
• Brownian bridge interpolation and pre-conditioning:
E
F( ¯X) =E
F( ¯X)|X¯t1,· · · ,X¯tn=
E
"
( ¯XT − K)+ 1 −
NY−1 k=0
p( ¯Xtk,X¯tk+1)
!#
, with
p(xk, xk+1) = 1 −
P
t∈min[0,t1]Wt ≤ L − xk σ(xk)
Wt1 = xk+1 − xk σ(xk)
,
=
0 if L ≥ min(xk, xk+1),
1 − e−
2(L−xk)(L−xk+1 )
σ2 (xk)(tk+1−tk) , otherwise.
Basis of L2([0,1],
R
)• a polynomial basis: ∀n ≥ 0,∀t ∈ [0,1],
P˜n(t) = Pn(2t − 1) where Pn(t) = 1 2nn!
dn
dtn (t2 − 1)n
. (ShLeg)
• the Karhunen-Lo`eve basis: ∀n ≥ 0,∀t ∈ [0,1], en(t) = √
2 sin
n + 1 2
πt
(KL)
• the Haar basis: ∀n ≥ 0,∀k = 0, . . . ,2n − 1,∀t ∈ [0,1],
ψn,k(t) = 2k2 ψ(2kt − n) (Haar)
where ψ(t) =
1 if t ∈ [0, 12)
−1 if t ∈ [12,1) 0 otherwise
Results for trivial driver ϕ = Id
Basis Dim. Mean CI 95% Variance ratio
Constant 1 3.1836 ±0.0251 2.6297
ShiftLegendre 2 3.1830 ±0.0223 3.3258 4 3.1815 ±0.0215 3.5670 8 3.1813 ±0.0215 3.5659 Karhunen-Lo`eve 2 3.1852 ±0.0187 4.7254 4 3.1862 ±0.0183 4.9385 8 3.1918 ±0.0178 5.2183
Haar 2 3.1834 ±0.0215 3.5699
4 3.1871 ±0.0186 4.7896 8 3.1864 ±0.0177 5.2675
Table 2: Variance ratio obtained for different basis in the local volatility model (K = 115, L = 65, variance of the crude Monte Carlo: 206.52).
Representations of the optimal variance reducer
-1 -0.5 0 0.5 1 1.5 2 2.5
0 10 20 30 40 50 60 70 80 90 100
cst poly haar KL
(2 basis functions)
-1 -0.5 0 0.5 1 1.5 2 2.5
0 10 20 30 40 50 60 70 80 90 100
cst poly haar KL
(4 basis functions)
-1 -0.5 0 0.5 1 1.5 2 2.5
0 10 20 30 40 50 60 70 80 90 100
cst poly haar KL
(8 basis functions)
With a non trivial driver ϕ(t,X¯t) =
¯
pk 1 − p¯k
, with p¯k =
P
0≤smin≤kT /n
X¯s
X¯0, . . . ,X¯kT /n
and
E = (
R
1[0,T])2 so that Variance reducerα p¯k + β (1 − p¯k).
Results with non-trivial driver ϕ
K L Mean CI 95% Var.Ratio (Crude) α β
85 65 3.1827 ±0.0127 10.02 (20.28) -0.3057 1.5522 75 6.4115 ±0.0190 9.96 (45.03) -0.1428 1.7985 95 65 3.1846 ±0.0124 10.65 (19.08) -0.1141 1.9139 75 6.4117 ±0.0199 9.07 (49.42) -0.0029 1.9814 85 11.4478 ±0.0293 8.03 (106.99) 0.1898 1.8937 105 65 3.1835 ±0.0135 8.98 (22.65) 0.1487 1.9628 75 6.4120 ±0.0209 8.21 (54.59) 0.1493 2.0060 85 11.4458 ±0.0295 7.88 (108.94) 0.2503 1.8737 95 18.6060 ±0.0345 9.83 (149.07) 0.5594 1.4343 115 65 3.1817 ±0.0148 7.38 (27.54) 0.3062 1.6884 75 6.4112 ±0.0209 8.18 (54.79) 0.1928 1.8119 85 11.4470 ±0.0289 8.24 (104.16) 0.2599 1.7430 95 18.6061 ±0.0346 9.79 (149.76) 0.5755 1.4313
Table 3: Variance reduction for different strikes and barrier levels in the local volatility model.
Results with non-trivial driver ϕ in the Black-Scholes model
K L Mean CI 95% Var.Ratio (Crude) α β
85 65 2.5738 ±0.0115 13.49 (16.56) -0.1752 1.6685 75 6.0489 ±0.0186 14.26 (43.39) 0.0493 1.9191 95 65 2.5704 ±0.0110 14.64 (15.26) 0.0524 1.9987 75 6.0492 ±0.0190 13.67 (45.25) 0.1557 2.0560 85 11.5970 ±0.0301 12.23 (112.92) 0.4108 2.1226 105 65 2.5687 ±0.0122 12.03 (18.56) 0.3888 2.1423 75 6.0548 ±0.0206 11.66 (53.08) 0.3895 2.1720 85 11.5953 ±0.0308 11.67 (118.32) 0.4524 2.1608 95 19.2882 ±0.0348 17.17 (151.04) 0.6619 1.7910 115 65 2.5706 ±0.0135 9.75 (22.90) 0.5473 1.8903 75 6.0530 ±0.0211 11.16 (55.42) 0.4591 1.9371 85 11.5976 ±0.0297 12.55 (109.98) 0.4807 2.0008 95 19.2958 ±0.0347 17.21 (150.67) 0.7217 1.6380
Table 4: Variance reduction for different strikes and barrier levels in the Black&Scholes model (r = 0.04, σ = 0.7).
Computing
V aR and CV aR
Monte Carlo simulation by
O. Bardou & N. Frikha & G. Pag` es
(∗)LPMA, Univ. Paris 6
—
1 Definitions
Let X : (Ω, A,P) → Rd be a random vector (structure variable).
Let ϕ : Rd → R be a Borel function representative of a loss
Definition. Value-at Risk (VaR) Let α∈ (0,1) be the confidence level.
VaRα(ϕ(X)) := inf {ξ | P(ϕ(X) ≤ ξ) ≥ α} i.e. the lowest α-quantile of ϕ(X).
If ϕ(X) has a continuous distribution (no atom) , then P(ϕ(X) ≤ ξ) = α.
and if ϕ(X) has no “hole” then VaRα(ϕ(X)) is unique.
Definitions. (a) Conditional Value-at Risk (CVaR) such that ϕ(X)∈ L1(
P
).Conditional Value at Risk (CVaR) (at level α). As soon as ϕ(X) ∈ L1(P), the conditional Value-at-Risk is defined by
CVaRα(ϕ(X)) := E[ϕ(X)|ϕ(X) ≥ VaRα(ϕ(X))]. represents the mean loss once in the “stress” zone.
(b) Ψ-Conditional Value-at Risk (Ψ-CVaR) If Ψ(ϕ(X)) ∈ L1(P) CVaRα(ϕ(X)) := E[Ψ(ϕ(X)) |ϕ(X) ≥ VaRα(ϕ(X))].
provides more insight on the distribution of ϕ(X) in the “stress” zone.
2 Rockafellar and Uryasev representation fomula
Proposition Let V and VΨ be the functions defined by:
V (ξ) = E[v(ξ, X)] and VΨ(ξ) = E[w(ξ, X)] (4) where
v(ξ, x) := ξ + 1
1 − α(ϕ(X) − ξ)+ (5)
and
w(ξ, x) := Ψ(ξ) + 1
1 − α(Ψ(ϕ(x)) − Ψ(ξ))1{ϕ(x)≥ξ}. (6) Assume L(ϕ(X)) is continuous.
Then, the function V is convex, differentiable and any point of the set arg min V = {∇V = 0} = {ξ | P(ϕ(X) ≤ ξ) = α}.
is a V aRα(ϕ(X)).
Furthermore,
CV aRα(ϕ(X)) = min
ξ∈R V (ξ) and, for every ξα∗ ∈ arg minV
Ψ-CV aRα(ϕ(X)) = VΨ(ξα∗).
Proof. The function V is convex since the functions ξ 7→ (ϕ(x) − ξ)+ are convex for every x ∈ Rd.
V is differentiable with derivative V ′(ξ) = 1 − 1−1αP(ϕ(X) > ξ) reaches its absolute minimum at any ξα∗ satisfying
P(ϕ(X) > ξα∗) = 1 − α i.e. P(ϕ(X) ≤ ξα∗) = α.
Moreover V (ξα∗) = ξα∗ + E[(ϕ(X) − ξα∗)+] P(ϕ(X) > ξα∗)
= ξα∗E[1ϕ(X)>ξα∗] + E[(ϕ(X) − ξα∗)+] P(ϕ(X) > ξα∗)
= E[ϕ(X)|ϕ(X) > ξα∗]. Likewise VΨ(ξα∗) = Ψ-CV aRα(ϕ(X)). ♦
3 Stochastic gradient and companion procedure
V aRα computation. The derivative V ′ admit the representation:
V ′(ξ) = E[H (ξ, X)]
with
H(ξ, x) := ∂v
∂ξ(x, ξ) = 1 − 1
1 − α1{ϕ(x)≥ξ}. Stochastic gradient defined by:
ξn+1 = ξn − γn+1H(ξn, Xn+1), n ≥ 0, ξ0 ∈ L1(P), (7) where
– (Xn)n≥1 is an i.i.d sequence of r.v. distribution as X, independent of ξ0, – (γn)n≥1 is a step sequence (decreasing to 0) satisfying:
(A1) ≡ X
n≥1
γn = +∞ and X
n≥1
γn2 < +∞.
Ψ-CV aRα computation.
Temporarily assume that ξn → ξα∗ = V aRα(ϕ(X))
P
-a.s..– Naive idea: compute the function VΨ at ξα∗ = V aRα(ϕ(X)):
Ψ-CV aRα = VΨ(ξα∗) = E[wΨ(ξα∗, X)]
using a regular Monte Carlo simulation, 1
n
nX−1 k=0
w(ξα∗, Xk+1′ ), Xk′ i.i.d. with distribution L(X).
– Alternative idea: an adaptive “companion” procedure of the quantile search algorithm
replace ξα∗ by ξk at step k i.e.
Cn = 1 n
nX−1 k=0
wΨ(ξk, Xk+1), n ≥ 1, C0 = 0.
(Cn)n≥0 is the sequence of empirical means of the non i.i.d. sequence (w(ξk, Xk+1))k≥1 can be written recursively:
Cn+1 = Cn − 1
n + 1 (Cn − w(ξn, Xn+1)) , n ≥ 0, C0 = 0.
– Why γn and n1 ?. . .
Cn+1 = Cn − γn+1 (Cn − w(ξn, Xn+1)), n ≥ 0, C0 = 0, The resulting algorithm reads
ξn+1 = ξn − γn+1H(ξn, Xn+1), ξ0 ∈ L1(P), n ≥ 0
Cn+1 = Cn − γn+1 (Cn − w(ξn, Xn+1)), C0 = 0, n ≥ 0
3.1 The a.s. convergence of the quantile search algorithm
A (slightly) more general result than in Part III. . .
Theorem (Extended Robbins-Monro Theorem). Let H : Rq × Rd → Rd be a Borel function and X be an Rd-valued random vector such that
E[|H(ξ, X)|] < ∞ for every ξ ∈ Rd. Then set
∀ξ ∈ Rd, h(ξ) = E[H(ξ, X)].
Suppose that the function h is continuous and that T ∗ := {h = 0} satisfies
∀ξ ∈ Rd \T ∗,∀ξ∗ ∈ T ∗, hξ − ξ∗, h(ξ)i > 0. (8) Let (γn)n≥1 be the decreasing step sequence satisfying
(A1) ≡ X
n
γn = +∞ and X
n
γn2 < +∞. Suppose that
∀ξ ∈ Rd, kH(ξ, X)k2 ≤ C(1 + |ξ|) (9) (which implies that |h(ξ)|2 ≤ C(1 + |ξ|2)).
Let (Xn)n≥1 be an i.i.d sequence of r.v. with distribution L(X), let ξ0 be a random vector independent of (Xn)n≥1 satisfying E[|ξ0|] < ∞, all
defined on the same probability space (Ω, A,P). Let (rn)n≥1 be the Fn-measurable reminder sequence satisfying
Xγn|rn| < ∞.
Then, the recursive procedure defined by
ξn = ξn−1 − γnH(ξn−1, Xn) + γnrn, n ≥ 1 satisfies:
∃ ξ∞, r.v., such that ξn a.s→ ξ∞ and ξ∞ ∈ T ∗ a.s.
The convergence also holds in Lp(P), p ∈ (0,2).
Application to quantile search H(ξ, x) = 1 − 1
1 − α1{ϕ(x)≥ξ}, rn ≡ 0
⇓
ξn −→a.s. ξα∗ ∈ V aRα(ϕ(X)).
3.2 The a.s. convergence of the companion algorithm
Representation of γn γn = γ0 ∆n
Sn , n ≥ 0, with Sn =
Xn k=0
∆k, γ0 := sup
n≥1
γn + 1. (10) Conversely
∆n+1 = ∆n γn+1 γn
γ0
γ0 − γn+1, n ≥ 0, ∆0 = 1.
Elementary computations show that X
n
γn = +∞ =⇒ lim
n→+∞Sn = +∞. and
Cn = 1 Sn
nX−1 k=0
∆k+1wΨ(ξk, Xk+1)
= 1
Sn
nX−1 k=0
∆k+1VΨ(ξk) +
nX−1 k=0
∆k+1∆Nk+1
!
where
∆Nk+1 = w(ξk, Xk+1) −
E
(wΨ(ξk, Xk+1) | FkX) = wΨ(ξk, Xk+1) − VΨ(ξk) is an FnX-martingale increment.– The first sum in the right hand side converges to
VΨ(ξα∗) = Ψ-CV aRα(ϕ(X)) owing to the continuity of VΨ at ξα∗ and Cesaro’s Lemma.
– The convergence to 0 of the second sum will follow from the a.s.
convergence of the martingale Nnγ :=
Xn k=1
γk∆Nk, n ≥ 1 and the Kronecker lemma since γn = γ0∆Sn
n .
The sequence (Nnγ)n≥1 is an Fn-martingale since the ∆Nk’s are martingales increments and
E
(∆Nn)2|FnX−1
≤ 1
(1 − α)2Eh
(Ψ (ϕ (X)) − Ψ (ξ))2i
|ξ=ξk−1
.
The continuity of Ψ at ξα∗ and the a.s. convergence of ξk toward ξα∗ imply that
sup
n≥1E[(∆Nn)2|Fn−1] < ∞ a.s.
Consequently, Assumption (A1) implies hNγi∞ = X
n≥1
γn2E[(∆Nn)2|Fn−1] < ∞ which in turn yields the a.s. convergence of (Nnγ)n≥1. ♦
3.3 Rate of convergence (I): CLT
Zn = (ξn, Cn), where γn = κ
n n ≥ 1.
Under appropriate assumptions, with z∗ := (V aRα(ϕ(X)),Ψ-CV aRα(ϕ(X)))
√n(Zn − z∗) −→ NL (0,√
κ D∗(κ)) where D∗(κ) also depends on v, wΨ [Dh(z∗)]−1 where
h(ξ, c) =
E
(v(ξ, X)), c −E
( wΨ(ξ, X) .and κ > 2ℜ(λ1
min), λmin eigenvalue of [Dh(z∗)]−1 with the lowest real part.
To be compared to the regular Newton-Raphson algorithm (if d = 1 . . . ) Question How to minimize the asymptotic variance √
κ D∗(κ) as a function of κ?
κ = 1
3.4 Rate of convergence (I): Averaging principle
Theorem. (Ruppert and Polyak’s Averaging Principle) Suppose that the Rd sequence (Zn)n≥0 is defined recursively by
Zn+1 = Zn − γn+1(h(Zn) + ǫn+1)
where (ǫn)n≥1 is an L2+η-bounded sequence of martingale increments such that
∃ Γ ∈ S+(d,
R
) such that E[ǫn+1ǫtn+1|Fn] a.s→ Γ.Assume that
{h = 0} = {z∗} and M = Dh(z∗) exists in GL(d,
R
)is repulsive (i.e. all its eigenvalues are positive) and h(z) = Dh(z∗)(z − z∗) + O(|z − z∗|2).
Set γn = nγ1a with 12 < a < 1, and Z¯n+1 := Z0 + · · · + Zn
n + 1 = ¯Zn − 1
n + 1( ¯Zn − Zn), n ≥ 0.
Then √
n Z¯n − z∗ L
→ N (0, D∗)
with D∗ = M−1Γ(M−1)T is optimal in term of “statistical efficiency”.
Comments. A proof of this result is given in Duflo’s book Algorithmes stochastiques (p.169).
Application to V aRα(ϕ(X)) and Ψ-CV aRα(ϕ(X)) computation.
Assume the distribution of ϕ(X) has a density fϕ(X) > 0.
Set ξα∗ := V aRα(ϕ(X)) (unique). Then D∗ is given by
α(1−α) fϕ(X)2 (ξα∗)
α
(1−α)fϕ(X)(ξ∗α)E
(Ψ(ϕ(X))−Ψ(ξα∗
α
(1−α)fϕ(X)(ξ∗α)E
(Ψ(ϕ(X))−Ψ(ξα∗))1{ϕ(X)≥ξα∗} 1
(1−α)2Var (Ψ(ϕ(X))−Ψ(ξα∗)) 1
4 Speeding up the procedure
By the way V aRα(ϕ(X)) is about rare events! If α ≈ 1, the matrix D∗ explodes!
In practice, the algorithm remains frozen !
(Classical) solution Variance reduction by Importance sampling!
Assume
L(X) = p(x)λd(dx).
Example: Twist by mean translation the distribution of X to minimize both
-- Var 1{ϕ(X)≥ξα∗}
(asymptotic variance of V aRα) – Var (Ψ(ϕ(X)) − ξα∗) 1{ϕ(X)≥ξα∗}
(asymptotic variance of CV aRα).
Finding parameters θ∗ and µ∗ that minimize the two functions:
Q∗1(θ) := E
1{ϕ(X)≥ξ∗α} p(X) p(X − θ)
Q∗2(µ) := E
(Ψ(ϕ(X)) − ξα∗)2 1{ϕ(X)≥ξα∗} p(X) p(X − µ)
. Objections. 1. Explosion ! We know how to encompass (see Part II) 2. ξα∗ the target is unknown ! We make it “implicit”
4.1 General log -concave setting from Part II
If p satisfies the assumptions of the log-concave setting,
∃a ∈ [1,2] such that
(i) |∇pp|(x) = O(|x|a−1) as |x| → ∞ (ii) ∃ρ > 0,log (p(x)) + ρ|x|a is convex.
then, the first step of the machinery of Part II yields:
The optimal variance reducers (θ∗α, µ∗α) are zeros of
∇Q∗1(θ) = 0 and ∇Q∗2(µ) = 0 where