Towards More Complex Penalization

(1)

Gabriel Peyré

www.numerical-tours.com

Low Complexity Regularization of

Inverse Problems

Cours #3

Proximal Splitting Methods

(2)

Overview of the Course

• Course #1: Inverse Problems

• Course #2: Recovery Guarantees

• Course #3: Proximal Splitting Methods

(3)

Setting:

H: Hilbert space. Here: H = R^N . G : H R ⇤ {+⇥}

Convex Optimization

minx H G(x) Problem:

(4)

Setting:

H: Hilbert space. Here: H = R^N .

Class of functions:

G : H R ⇤ {+⇥}

x y

G(tx + (1 t)y) tG(x) + (1 t)G(y) t [0, 1]

Convex:

(5)

Setting:

Class of functions:

G : H R ⇤ {+⇥}

lim inf

x x₀ G(x) G(x₀) {x ⇥ H \ G(x) ⇤= + } ⇤= ⌅

x y

G(tx + (1 t)y) tG(x) + (1 t)G(y) t [0, 1]

Lower semi-continuous:

Convex:

Proper:

(6)

Setting:

Class of functions:

G : H R ⇤ {+⇥}

lim inf

x x₀ G(x) G(x₀) {x ⇥ H \ G(x) ⇤= + } ⇤= ⌅

x y

C(x) = 0 if x ⇥ C,

+ otherwise.

(C closed and convex)

G(tx + (1 t)y) tG(x) + (1 t)G(y) t [0, 1]

Indicator:

Lower semi-continuous:

Convex:

Proper:

(7)

K : R^N R^P , P N

Example: Regularization

K Kf₀ f₀

1

Inverse problem: measurements y = Kf₀ + w

(8)

K : R^N R^P , P N

observations y = Kf R^P f₀ = x₀ sparse in dictionary R^{N Q}, Q N.

= K ⇥ ⇥ R^P ^Q

Model:

K Kf₀ f₀

x R^Q f = x R^N

coefficients image K

1

(9)

K : R^N R^P , P N

Fidelity Regularization

xminR^N

1

2||y x||² + ||x||¹

observations y = Kf R^P f₀ = x₀ sparse in dictionary R^{N Q}, Q N.

= K ⇥ ⇥ R^P ^Q

Model:

K

Sparse recovery: f = x where x solves Kf₀

f₀

x R^Q f = x R^N

coefficients image K

1

(10)

(Kf)_i = f_i if i , 0 otherwise.

Inpainting: masking operator K

R^{N Q} translation invariant wavelet frame.

K : R^N R^P ^P ⁼ ^{| |}

Orignal f₀ = x₀ y = x₀ + w Recovery x

c

1

(11)

Overview

• Subdifferential Calculus

• Proximal Calculus

• Forward Backward

• Douglas Rachford

• Generalized Forward-Backward

• Duality

(12)

G(x) = {u ⇥ H \ ⇤ z, G(z) G(x) + ⌅u, z x⇧}

Sub-differential:

G(x) = |x|

G(0) = [ 1, 1]

Sub-differential

(13)

G(x) = {u ⇥ H \ ⇤ z, G(z) G(x) + ⌅u, z x⇧}

If F is C¹, F (x) = { F(x)} Sub-differential:

Smooth functions: ^{G(x) =} ^|^x^|

G(0) = [ 1, 1]

(14)

G(x) = {u ⇥ H \ ⇤ z, G(z) G(x) + ⌅u, z x⇧}

G(0) = [ 1, 1]

First-order conditions:

0 G(x )

x argmin

x H

G(x)

(15)

x G(x) = {u ⇥ H \ ⇤ z, G(z) G(x) + ⌅u, z x⇧}

G(0) = [ 1, 1]

U(x)

0 G(x )

x argmin

x H

G(x)

y x, v u 0

Monotone operator:

(u, v) U(x) U(y),

U(x) = G(x)

(16)

⇥G(x) = ( x y) + ⇥|| · ||¹(x)

|| · ||¹(x)_i = sign(x_i) if x_i ⇥= 0,

[ 1, 1] if x_i = 0.

x ⇥ argmin

x R^Q G(x) = 1

2||y x||² + ||x||¹ 1

(17)

I = {i ⇥ {0, . . . , N 1} \ x_i ⇤= 0}

Support of the solution: ⁱ

x_i

⇥G(x) = ( x y) + ⇥|| · ||¹(x)

|| · ||¹(x)_i = sign(x_i) if x_i ⇥= 0,

[ 1, 1] if x_i = 0.

x ⇥ argmin

x R^Q G(x) = 1

2||y x||² + ||x||¹ 1

(18)

s R^N, ( x y) + s = 0 s_I = sign(x_I),

||s_I^c|| 1.

I = {i ⇥ {0, . . . , N 1} \ x_i ⇤= 0}

Support of the solution: ⁱ

x_i

i, y x

⇥G(x) = ( x y) + ⇥|| · ||¹(x)

|| · ||¹(x)_i = sign(x_i) if x_i ⇥= 0,

[ 1, 1] if x_i = 0.

i x ⇥ argmin

x R^Q G(x) = 1

2||y x||² + ||x||¹ 1

(19)

= 0 (noisy)

Important: the optimization variable is f.

J(f) =

i

||( f)_i||

Finite difference gradient: ⁽ ^f⁾ⁱ R²

Discrete TV norm:

Example: Total Variation Denoising

: R^N R^N ²

f ⇥ argmin

f R^N

1

2||y f||² + J(f)

(20)

J(f) = G( f) G(u) =

i

||u_i||

J(f) = div ( G( f))

(J A) = A ( J) A Composition by linear maps:

⇥G(u)_i = ^||^u^uⁱⁱ^|| if u_i ⇥= 0,

R² \ || || 1 if u_i = 0.

f ⇥ argmin

f R^N

1

2||y f||² + J(f)

(21)

J(f) = G( f) G(u) =

i

||u_i||

J(f) = div ( G( f))

(J A) = A ( J) A Composition by linear maps:

⇥ i I, v_i = _|| ^f_fⁱ

i ||,

⇥ i I^c, ||v_i|| 1 I = {i \ (⇥f )_i = 0}

v R^N ², f = y + div(v)

⇥G(u)_i = ^||^u^uⁱⁱ^|| if u_i ⇥= 0,

R² \ || || 1 if u_i = 0.

f ⇥ argmin

f R^N

1

2||y f||² + J(f)

(22)

Overview

• Subdifferential Calculus

• Proximal Calculus

• Duality

(23)

Proximal operator of G:

Prox _G(x) = argmin

z

1

2||x z||² + G(z)

Proximal Operators

(24)

Prox _G(x) = argmin

z

1

2||x z||² + G(z) G(x) = ||x||¹ =

i

|x_i|

G(x) = ||x||⁰ = | {i \ x_i = 0} |

G(x) =

i

log(1 + |x_i|²)

−10 −8 −6 −4 −2 0 2 4 6 8 10

−2 0 2 4 6 8 10 12

||x||⁰

|x|

log(1 + x²)

G(x)

(25)

3rd order polynomial root.

Prox _G(x) = argmin

z

1

2||x z||² + G(z) G(x) = ||x||¹ =

i

|x_i|

Prox _G(x)_i = max 0, 1

|x_i| x_i G(x) = ||x||⁰ = | {i \ x_i = 0} |

Prox _G(x)_i = x_i if |x_i| 2 , 0 otherwise.

G(x) =

i

log(1 + |x_i|²)

−10 −8 −6 −4 −2 0 2 4 6 8 10

−2 0 2 4 6 8 10 12

−10 −8 −6 −4 −2 0 2 4 6 8 10

−10

−8

−6

−4

−2 0 2 4 6 8 10

||x||⁰

|x|

log(1 + x²)

G(x)

Prox_G(x)

(26)

Separability: ^{G(x) =} ^G1(x₁) + . . . + G_n(x_n) Prox_G(x) = (Prox_G₁(x₁), . . . , Prox_G_n(x_n))

Proximal Calculus

(27)

Separability:

Quadratic functionals:

= (Id + ) ¹

G(x) = G₁(x₁) + . . . + G_n(x_n) Prox_G(x) = (Prox_G₁(x₁), . . . , Prox_G_n(x_n))

G(x) = 1

2|| x y||² Prox _G = (Id + ) ¹

(28)

Separability:

= (Id + ) ¹

G(x) = 1

2|| x y||² Prox _G = (Id + ) ¹

Composition by tight frame:

Prox_{G A}(x) = A Prox_G A + Id A A A A = Id

(29)

Separability:

Indicators:

= (Id + ) ¹

G(x) = 1

2|| x y||² Prox _G = (Id + ) ¹

G(x) = _C(x) x

Prox _G(x) = Proj_C(x)

= argmin

z C ||x z||

Composition by tight frame:

Proj_C(x)

C Prox_{G A}(x) = A Prox_G A + Id A A

A A = Id

(30)

where x U(y) y U ¹(x)

is a single-valued mapping Resolvant of G:

z = Prox _G(x) ₀ _z _x ₊ _⇥_G(z) x (Id + ⇥G)(z) z = (Id + ⇥G) ¹(x)

Prox _G = (Id + ⇥G) ¹

Inverse of a set-valued mapping:

Prox and Subdifferential

(31)

where x U(y) y U ¹(x)

is a single-valued mapping Resolvant of G:

z = Prox _G(x) ₀ _z _x ₊ _⇥_G(z) x (Id + ⇥G)(z) z = (Id + ⇥G) ¹(x)

Prox _G = (Id + ⇥G) ¹

Inverse of a set-valued mapping:

Fix point: x argmin

x

G(x)

0 G(x ) x (Id + ⇥G)(x )

x^⇥ = (Id + ⇥G) ¹(x^⇥) = Prox _G(x^⇥)

Prox and Subdifferential

(32)

If 0 < < 2/L, x^{( )} x a solution.

Theorem:

Gradient descent:

G is C¹ and G is L-Lipschitz

Gradient and Proximal Descents

[explicit]

x^{( +1)} = x^{( )} G(x^{( )})

(33)

Theorem:

Gradient descent:

x^{( +1)} = x^{( )} v^{( )},

Problem: slow.

v^{( )} G(x^{( )})

Sub-gradient descent:

[explicit]

If 1/⇥, x^{( )} x a solution.

Theorem:

x^{( +1)} = x^{( )} G(x^{( )})

(34)

If c > 0, x^{( )} x a solution.

Theorem:

Gradient descent:

x^{( +1)} = x^{( )} v^{( )},

Problem: slow.

v^{( )} G(x^{( )})

x^(⇥+1) = Prox _G(x^(⇥))

Prox _G hard to compute.

Sub-gradient descent:

Proximal-point algorithm:

[explicit]

[implicit]

If 1/⇥, x^{( )} x a solution.

Theorem:

x^{( +1)} = x^{( )} G(x^{( )})

(35)

Overview

• Forward Backward

• Duality

(36)

Solve min

x H E(x)

Problem: Prox _E is not available.

Proximal Splitting Methods

(37)

Solve min

x H E(x)

Splitting: E(x) = F (x) +

i

G_i(x) Simple Smooth

(38)

Solve min

x H E(x)

Splitting: E(x) = F (x) +

i

G_i(x) Simple Smooth

Iterative algorithms using: ^F ^(x)

Prox _G_i(x)

Forward-Backward:

Douglas-Rachford:

Primal-Dual:

Generalized FB:

G_i

G_i A F + G_i

F + G solves

(39)

Simple Smooth

Data fidelity:

Regularization:

f₀ = x₀ sparse in dictionary .

K : R^N R^P , P N K

= K ⇥ F (x) = 1

2||y x||² G(x) = ||x||¹ =

i

|x_i|

xminR^N F (x) + G(x)

Sparse recovery: f = x where x solves Model:

Smooth + Simple Splitting

Kf₀ f₀

(40)

x argmin

x

F(x) + G(x) 0 F (x ) + G(x ) (x F(x )) x + ⇥G(x ) Fix point equation:

x^⇥ = Prox _G(x^⇥ F(x^⇥))

Forward-Backward

(41)

x argmin

x

x^(⇥+1) = Prox _G x^(⇥) F (x^(⇥)) x^⇥ = Prox _G(x^⇥ F(x^⇥))

Forward-backward:

(42)

x argmin

x

G = _C

Forward-backward:

Projected gradient descent:

(43)

x argmin

x

G = _C

Forward-backward:

Projected gradient descent:

Theorem:

a solution of ( ) If < 2/L,

Let F be L-Lipschitz.

x^{( )} x

(44)

minx

1

2|| x y||² + ||x||¹ min

x F (x) + G(x)

F(x) = 1

2|| x y||²

G(x) = ||x||¹

F (x) = ( x y)

Prox _G(x)_i = max 0, 1 ⇥

|x_i| x_i

L = || ||

Example: L1 Regularization

Forward-backward Iterative soft thresholding

(45)

Theorem:

E(x^{( )}) E(x ) C/

F is L-Lipschitz.

G is simple.

If L > 0, FB iterates x^{( )} satisfies minx E(x) = F (x) + G(x)

C degrades with L 0.

Convergence Speed

(46)

(see also Nesterov method)

Complexity theory: optimal in a worse-case sense.

t⁽⁰⁾ = 1 x^(`+1) = Prox_1/L

✓

y^(`) 1

L rF(y^(`))

◆

Multi-steps Accelerations

Beck-Teboule accelerated FB:

t^{( +1)} = 1 + 1 + 4(t^{( )})² 2

y^{( +1)} = x^{( +1)} + t^{( )} 1

t^{( +1)} (x^{( +1)} x^{( )})

Theorem: If L > 0, ^E^(x^{( )}⁾ ^E^(x ⁾ ^C

(47)

Overview

• Douglas Rachford

• Duality

(48)

Douglas-Rachford iterations:

( )

RProx _G(x) = 2Prox _G(x) x z^(⇥+1) = 1

2 z^(⇥) +

2 RProx _G₂ RProx _G₁(z^(⇥))

Reflexive prox:

x^(`+1) = Prox _G₁(z^(`+1))

Douglas Rachford Scheme

Simple Simple minx G₁(x) + G₂(x)

(49)

Douglas-Rachford iterations:

Theorem:

( )

a solution of ( ) RProx _G(x) = 2Prox _G(x) x

x^{( )} x z^(⇥+1) = 1

2 z^(⇥) +

2 RProx _G₂ RProx _G₁(z^(⇥))

If 0 < < 2 and ⇥ > 0, Reflexive prox:

x^(`+1) = Prox _G₁(z^(`+1))

Douglas Rachford Scheme

Simple Simple minx G₁(x) + G₂(x)

(50)

and minx G₁(x) + G₂(x)

z, z x ⇥( G₁)(x) and x z ⇥( G₂)(x) x = Prox _G₁(z) _(2x _z₎ _x _⇥₍ _G₂_)(x)

0 (G₁ + G₂)(x)

DR Fix Point Equation

(51)

and minx G₁(x) + G₂(x)

z, z x ⇥( G₁)(x) and x z ⇥( G₂)(x) x = Prox _G₁(z) _(2x _z₎ _x _⇥₍ _G₂_)(x)

z = 2Prox _G₂ RProx _G₁(y) (2x z)

x = Prox _G₂(2x z) = Prox _G₂ RProx _G₁(z)

z = 2Prox _G₂ RProx _G₁(z) RProx _G₁(z) z = RProx _G₂ RProx _G₁(z)

z = 1

2 z +

2 RProx _G₂ RProx _G₁(z) 0 (G₁ + G₂)(x)

DR Fix Point Equation

(52)

minx G₁(x) + G₂(x)

G₁(x) = i_C(x), C = {x \ x = y}

Prox _G₁(x) = Proj_C(x) = x + ^⇥( ^⇥) ¹(y x) G₂(x) = ||x||¹ Prox _G₂(x) = max 0, 1

|x_i| x_i e⇥cient if easy to invert. i

Example: Constrainted L1

minx=y ||x||¹

(53)

50 100 150 200 250

−5

−4

−3

−2

−1 0 1

minx G₁(x) + G₂(x)

G₁(x) = i_C(x), C = {x \ x = y}

Prox _G₁(x) = Proj_C(x) = x + ^⇥( ^⇥) ¹(y x) G₂(x) = ||x||¹ Prox _G₂(x) = max 0, 1

|x_i| x_i e⇥cient if easy to invert. i

= 0.01

= 10= 1 Example: compressed sensing

R^{100 400} Gaussian matrix

||x₀||⁰ = 17 y = x₀

log₁₀(||x^{( )}||¹ ||x ||¹)

Example: Constrainted L1

minx=y ||x||¹

(54)

C = (x₁, . . . , x_k) H^k \ x₁ = . . . = x_k

each F_i is simple minx G₁(x) + . . . + G_k(x)

G(x₁, . . . , x_k) = G₁(x₁) + . . . + G_k(x_k)

(x1min,...,x_k) G(x₁, . . . , x_k) + ◆_C(x₁, . . . , x_k)

More than 2 Functionals

(55)

C = (x₁, . . . , x_k) H^k \ x₁ = . . . = x_k

Prox _⇥_C(x₁, . . . , x_k) = (˜x, . . . , x)˜ where ˜x = 1 k i

x_i each F_i is simple minx G₁(x) + . . . + G_k(x)

G(x₁, . . . , x_k) = G₁(x₁) + . . . + G_k(x_k)

G and _C are simple:

Prox _G(x₁, . . . , x_k) = (Prox _G_i(x_i))_i

(x1min,...,x_k) G(x₁, . . . , x_k) + ◆_C(x₁, . . . , x_k)

More than 2 Functionals

(56)

Linear map A : E H.

C = {(x, y) ⇥ H E \ Ax = y} minx G₁(x) + G₂ A(x)

G₁, G₂ simple.

z⇥H Emin G(z) + _C(z)

G(x, y) = G₁(x) + G₂(y)

Auxiliary Variables: DR

(57)

Linear map A : E H.

C = {(x, y) ⇥ H E \ Ax = y}

Prox _C(x, y) = (x + A y, y˜ y)˜ = (˜x, Ax)˜ where y˜ = (Id + AA ) ¹(Ax y)

˜

x = (Id + A A) ¹(A y + x)

efficient if Id + AA or Id + A A easy to invert.

minx G₁(x) + G₂ A(x)

G₁, G₂ simple.

z⇥H Emin G(z) + _C(z)

G(x, y) = G₁(x) + G₂(y)

Prox _G(x, y) = (Prox _G₁(x), Prox _G₂(y))

Auxiliary Variables: DR

(58)

G₁(u) = ||u||¹ Prox _G₁(u)_i = max 0, 1

||u_i|| u_i minf

1

2||Kf y||² + ||⇥f||¹ minx G₁(f) + G₂ (f)

G₂(f) = 1

2||Kf y||² Prox _G₂ = (Id + K K) ¹K C = (f, u) ⇥ R^N R^N ² \ u = ⇤f

Prox _C(f, u) = ( ˜f , f˜)

Example: TV Regularization

||u||¹ =

i

||u_i||

(59)

Compute the solution of:

O(N log(N)) operations using FFT.

G₁(u) = ||u||¹ Prox _G₁(u)_i = max 0, 1

||u_i|| u_i minf

1

2||Kf y||² + ||⇥f||¹ minx G₁(f) + G₂ (f)

G₂(f) = 1

2||Kf y||² Prox _G₂ = (Id + K K) ¹K C = (f, u) ⇥ R^N R^N ² \ u = ⇤f

Prox _C(f, u) = ( ˜f , f˜)

(Id + ) ˜f = div(u) + f

||u||¹ =

i

||u_i||

(60)

Iteration y = Kx₀

y = f₀ + w

Orignal f₀ Recovery f

(61)

Overview

• Generalized Forward-Backward

• Duality