A remark on the optimal transport between two probability measures sharing the same copula

(1)

HAL Id: hal-00844906

https://hal-enpc.archives-ouvertes.fr/hal-00844906

Submitted on 16 Jul 2013

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de

A remark on the optimal transport between two probability measures sharing the same copula

Aurélien Alfonsi, Benjamin Jourdain

To cite this version:

Aurélien Alfonsi, Benjamin Jourdain. A remark on the optimal transport between two probability measures sharing the same copula. Statistics and Probability Letters, Elsevier, 2014, dx.doi.org/10.1016/j.spl.2013.09.035. �hal-00844906�

(2)

A remark on the optimal transport between two probability measures sharing the same copula

A. Alfonsi, B. Jourdain^∗ July 16, 2013

Abstract

We are interested in the Wasserstein distance between two probability measures onRⁿ sharing the same copula C. The image of the probability measure dC by the vectors of pseudo-inverses of marginal distributions is a natural generalization of the coupling known to be optimal in dimension n= 1. It turns out that for cost functionsc(x, y) equal to the p-th power of theL^q norm ofx−y inRⁿ, this coupling is optimal only whenp=qi.e. when c(x, y) may be decomposed as the sum of coordinate-wise costs.

Keywords: Optimal transport, Copula, Wasserstein distance, Inversion of the cumulative distribution function.

AMS Classification (2010): 60E05, 60E15.

1 Optimal transport between two probability measures sharing the same copula

Given two probability measures µ and ρ, the optimal transport theory aims at minimizing R c(x, y)ν(dx, dy) over all couplings ν with first marginal ν ◦((x, y) 7→ x)⁻¹ = µ and second marginal ν ◦((x, y) 7→ y)⁻¹ = ρ for a measurable non-negative cost function c. We use the notationν<^µρ for such couplings. In the present note, we are interested in the particular case of the so-called Wasserstein distance between two probability measures µand ρ on Rⁿ :

W_p,q(µ, ρ) = inf

ν<^µρ

Z

Rⁿ×Rⁿ

kx−yk^p_qν(dx, dy) 1/p

(1.1)

∗Universit´e Paris-Est, CERMICS, Projet MathFi ENPC-INRIA-UMLV, 6 et 8 avenue Blaise Pascal, 77455 Marne La Vall´ee, Cedex 2, France, e-mails : alfonsi@cermics.enpc.fr, jourdain@cermics.enpc.fr. This research benefited from the support of the “Chaire Risques Financiers”, Fondation du Risque and the French National Research Agency (ANR) under the program ANR-12-BLAN Stab.

(3)

obtained for the choicec(x, y) =kx−yk^p_q. HereRⁿis endowed with the norm k(x₁, . . . , x_n)k_q = (Pn

i=1|x_i|^q)^1/q for q ∈ [1,+∞) whereas p ∈ [1,+∞) is the power of this norm in the cost function.

In dimension n = 1, kxk_q = |x| so that the Wasserstein distance does not depend on q and is simply denoted by W_p. Moreover, the optimal transport is given by the inversion of the cumulative distribution functions : whatever p∈[1,+∞), the optimal coupling is the image of the Lebesgue measure on (0,1) byu7→(F_µ⁻¹(u), F_ρ⁻¹(u)) where foru∈(0,1),F_µ⁻¹(u) = inf{x∈ R:µ((−∞, x])≥u}andF_ρ⁻¹(u) = inf{x∈R:ρ((−∞, x])≥u}(see for instance Theorem 3.1.2 in [3]). This implies that W_p^p(µ, ρ) =R

(0,1)|F_µ⁻¹(u)−F_ρ⁻¹(u)|^pdu.

In higher dimensions, according to Sklar’s theorem (see for instance Theorem 2.10.11 in Nelsen [1]), µ

n

Y

i=1

(−∞, xi]

!

=C(µ1((−∞, x1]), . . . , µn((−∞, xn]))

where we denote by µ_i = µ◦((x₁, . . . , x_n) 7→ x_i)⁻¹ the i-th marginal of µ and C is a copula function i.e. C(u₁, . . . , u_n) =m(Qn

i=1[0, u_i]) for some probability measurem on [0,1]ⁿ with all marginals equal to the Lebesgue measure on [0,1]. The copula functionC is uniquely determined on the product of the ranges of the marginal cumulative distribution functionsx_i 7→µ_i((−∞, x_i]).

In particular, when the marginalsµ_i do not weight points, the copulaCis uniquely determined.

Sklar’s theorem shows that the dependence structure associated withµis encoded in the copula functionC. Last, we give the well-known Fr´echet-Hoeffding bounds

∀u₁, . . . , u_n∈[0,1], C_n⁻(u₁, . . . , u_n)≤C(u₁, . . . , u_n)≤C_n⁺(u₁, . . . , u_n)

that hold for any copula functionCwithC_n⁺(u₁, . . . , u_n) = min(u₁, . . . , u_n) andC_n⁻(u₁, . . . , u_n) = (u₁+· · ·+u_n−n+ 1)⁺ (see Nelsen [1], Theorem 2.10.12 or Rachev and R¨uschendorf [3], section 3.6). We recall that the copula C_n⁺is then-dimensional cumulative distribution function of the image of the Lebesgue measure on [0,1] by R∋x7→(x, . . . , x)∈Rⁿ. Also the copulaC₂⁻ is the 2-dimensional cumulative distribution function of the image of the Lebesgue measure on [0,1]

by R∋x7→(x,1−x)∈R² and, forn≥3,C_n⁻ is not a copula.

In dimension n= 1, the unique copula function is C(u) =uand therefore the optimal coupling between µ and ρ, which necessarily share this copula, is the image of the probability measure dC byu7→(F_µ⁻¹(u), F_ρ⁻¹(u)). It is therefore natural to wonder whether, whenµandρshare the same copula C in higher dimensions, the optimal coupling is still the image of the probability measure dC by (u₁, . . . , u_n) 7→ (F_µ⁻₁¹(u₁), . . . , F_µ⁻_n¹(u_n), F_ρ⁻₁¹(u₁), . . . , F_ρ⁻_n¹(u_n)). We denote by µ⋄ρ this probability law on R²ⁿ. It turns out that the picture is more complicated than in dimension one because of the choice of the index q of the norm.

Proposition 1.1 Let n ≥ 2, µ and ρ be two probability measures on Rⁿ sharing the same copula C andW_p,q(µ, ρ) = inf_ν<^µ_ρ

R

Rⁿ×Rⁿkx−yk^pqν(dx, dy)1/p

.

• If p=q, then an optimal coupling between µand ρ is given byν =µ⋄ρ and W_p,p^p (µ, ρ) =

Z

[0,1]ⁿ n

X

i=1

|F_µ⁻¹_i (u_i)−F_ρ⁻¹_i (u_i)|^pdC(u₁, . . . , u_n) = Z

[0,1]

n

X

i=1

|F_µ⁻¹_i (u)−F_ρ⁻¹_i (u)|^pdu.

(4)

• If p 6= q, the coupling µ⋄ρ is in general no longer optimal. For p < q, if C 6= C_n⁺, we can construct probability measures µandρ on Rⁿ admitting C as their unique copula such that

Z

Rⁿ×Rⁿ

kx−yk^p_qµ⋄ρ(dx, dy) 1/p

>W_p,q(µ, ρ).

For p > q, the same conclusion holds ifn≥3 or n= 2 andC 6=C₂⁻.

Remark 1.2 Let µ and ρ be two probability measures on Rⁿ and ν <^µ_ρ. For n = 1, ν is said to be comonotonic if ν((−∞, x],(−∞, y]) = C₂⁺(µ((−∞, x]), ρ((−∞, y])). Puccetti and Scarsini [2] investigate several extensions of this notion for n≥2. In particular, they say that ν is π-comonotonic (resp. c-comonotonic) if µ and ρ have a common copula and ν = µ⋄ρ (resp. ν maximizes R

Rⁿ×Rⁿhx, yiν(dx, dy)˜ over all the coupling measuresν<˜ ^µ_ρ). Looking at some connections between their different definitions of comonotonicity, they show in Lemma 4.4 that π-comonotonicity implies c-comonotonicity. Since

Z

Rⁿ×Rⁿ

kx−yk²₂ν(dx, dy) =˜ Z

Rⁿ

kxk²₂µ(dx) + Z

Rⁿ

kyk²₂ρ(dy)−2 Z

Rⁿ×Rⁿ

hx, yi˜ν(dx, dy), this yields our result in the case p=q= 2.

2 Proof of Proposition 1.1

The optimality in the case q =p, follows by choosing d₁ = . . . =d_n = d^′₁ = . . . =d^′_n = d^′′₁ = . . .=d^′′_n= 1,c_i(y_i, z_i) =|y_i−z_i|^p,α=dC, andϕ_i =F_µ⁻_i¹,ψ_i =F_ρ⁻_i¹ in the following Lemma.

Lemma 2.1 Let α be a probability measure on R^d¹×R^d²×. . .×R^dⁿ with respective marginals α₁, . . . , α_n on R^d¹, . . . ,R^dⁿ and ϕ_i : R^dⁱ → R^d^′ⁱ, ψ_i : R^dⁱ → R^d^′′ⁱ and c_i : R^d^′ⁱ×R^d^′′ⁱ → R+ be measurable functions such that

∀i∈ {1, . . . , n}, inf

νi<^αi◦^ϕ

−1 i αi◦ψ−1

i

Z

R^d^′ⁱ×R^d^′′ⁱ

c_i(yi, z_i)νi(dyi, dz_i) = Z

R^di

c_i(ϕi(xi), ψi(xi))αi(dxi). (2.1)

Then setting ϕ : x = (x1, . . . , x_n) ∋ R^d¹ ×. . .×R^dⁿ 7→ (ϕ1(x1), . . . , ϕn(xn)) ∈ R^d^′¹^+...+d^′ⁿ and ψ:x∋R^d¹ ×. . .×R^dⁿ 7→(ψ₁(x₁), . . . , ψ_n(x_n))∈R^d^′′¹^+...+d^′′ⁿ, one has

infν<^α◦ϕ−_α◦ψ−¹₁

Z

R^d^′¹^+...+d^′ⁿ×R^d^′′¹^+...+d^′′ⁿ n

X

i=1

c_i(y_i, z_i)ν(dy, dz)

= Z

R^d¹⁺^...+^dn n

X

i=1

c_i(ϕ_i(x_i), ψ_i(x_i))α(dx) =

n

X

i=1

Z

R^di

c_i(ϕ_i(x_i), ψ_i(x_i))α_i(dx_i).

Proof of Lemma 2.1. We give two alternative proofs of the Lemma. The first one is based

(5)

on basic arguments.

Z

R^d^{1 +}^...+^dn n

X

i=1

c_i(ϕ_i(x_i), ψ_i(x_i))α(dx)

≥ inf

ν<^α^◦^ϕ−¹

α◦ψ−1

Z

R^d^′¹^+...+d^′ⁿ×R^d^′′¹^+...+d^′′ⁿ n

X

i=1

≥

n

X

i=1

inf

ν<^α^◦^ϕ−¹

α◦ψ−1

Z

R^d′¹^+...+d′ⁿ×R^d′′¹^+...+d′′ⁿ

≥

n

X

i=1

inf

νi<^αi^◦ϕ

−1 i αi◦ψ−1

i

Z

R^d^′ⁱ×R^d^′′ⁱ

c_i(y_i, z_i)ν_i(dy_i, dz_i)

=

n

X

i=1

Z

R^di

c_i(ϕ_i(x_i), ψ_i(x_i))α_i(dx_i)

= Z

R^d¹⁺^...+^dn n

X

i=1

c_i(ϕ_i(x_i), ψ_i(x_i))α(dx), where we used that

• the probability measureα◦(ϕ⁻¹, ψ⁻¹) onR^d^′¹^+...+d^′ⁿ×R^d^′′¹^+...+d^′′ⁿ has respective marginals α◦ϕ⁻¹ and α◦ψ⁻¹ on R^d^′¹^+...+d^′ⁿ and R^d^′′¹^+...+d^′′ⁿ, for the first inequality,

• the infimum of a sum is greater than the sum of infima, for the second inequality,

• the respective marginals ofα◦ϕ⁻¹ andα◦ψ⁻¹ onR^d^′ⁱ andR^d^′′ⁱ areα_i◦ϕ⁻_i ¹ andα_i◦ψ_i⁻¹, for the third one,

• and the hypotheses for the first equality.

The second proof is given to illustrate the theory of optimal transport. It requires to make the following additional assumption on the cost function c_i, for all 1≤i≤n:

there exist measurable functions g_i:R^d^′ⁱ →R+ and h_i :R^d^′′ⁱ →R+ such that c_i(y_i, z_i)≤g_i(y_i) +h_i(z_i),

Z

R^di

g_i(ϕ_i(x_i))α_i(dx_i)<∞, Z

R^di

h_i(ψ_i(x_i))α_i(dx_i)<∞. (2.2) Basically, this assumption ensures that R

R^d′ⁱ×R^d′′ⁱ c_i(y_i, z_i)ν_i(dy_i, dz_i) < ∞ for any coupling ν_i<^αⁱ^◦^ϕ⁻

1 i

αi◦ψ⁻_i¹.

We now introduce some definitions that are needed, and refer to the Section 3.3 of Rachev and R¨uschendorf [3] for a full introduction. Let ¯c : R^d^′ ×R^d^′′ → R. A function f : R^d^′ → R¯ is

¯

c-convex if there is a function a : R^d^′′ → R¯ such that f(y) = sup_z_∈_Rd′′¯c(y, z) −a(z). For y ∈R^d^′, we define the ¯c-subgradient:

∂_c_¯f(y) ={z∈R^d^′′ s.t.∀˜y∈domf, f(˜y)−f(y)≥c(˜¯y, z)−c(y, z)},¯

(6)

where domf ={y∈R^d^′, f(y)<∞}.

We are now in position to prove the result again. Let X = (X₁, . . . , X_d) be a random variable with probability measure α. We define Y_i =ϕ_i(X_i),Z_i =ψ_i(X_i) and ¯c_i =−c_i. From (2.1), we know that (Y_i, Z_i) is an optimal coupling that maximizes E[¯c_i(Y_i, Z_i)]. From Theorem 3.3.11 of Rachev and R¨uschendorf [3], this implies the existence of a ¯c_i-convex functionf_i:R^d^′ⁱ →R¯ such that Z_i ∈∂_¯_c_if_i(Yi). By definition of the ¯c_i-convexity, there exists a functiona_i :R^d^′′ⁱ → R¯ such that f_i(y_i) = sup

zi∈R^d^′′ⁱ c(y¯ _i, z_i)−a(z_i). We define for y = (y₁, . . . , y_n) ∈ R^d^′¹ ×. . .×R^d^′ⁿ and z= (z₁, . . . , z_n)∈R^d^′′¹ ×. . .×R^d^′′ⁿ,

f(y) =

n

X

i=1

f_i(yi), a(z) =

n

X

i=1

a_i(zi) and ¯c(y, z) =

n

X

i=1

¯

c_i(yi, z_i).

The function f is ¯c-convex since we clearly havef(y) = sup

z∈R^d^′′¹⁺^···^+d^′′ⁿ¯c(y, z)−a(z). Then, we have the straightforward inclusion:

∂_¯_cf(y) ={(z₁, . . . , z_n)∈R^d^′′¹ ×. . .×R^d^′′ⁿ, s.t.∀˜y∈domf,

n

X

i=1

f_i(˜y_i)−f_i(y_i)≥

n

X

i=1

¯

c_i(˜y_i, z_i)−

n

X

i=1

¯

c_i(y_i, z_i)}

⊃∂_¯_c₁f₁(y₁)× · · · ×∂_¯_c_nf_n(y_n).

This gives immediatelyY ∈∂_c_¯f(X). Using again Theorem 3.3.11 of [3], we get that the coupling (Y, Z) with lawµ⋄ρ is optimal in the sense that it maximizes E[¯c(Y, Z)].

We now prove that the couplingµ⋄ρ is in general no longer optimal whenq6=p. We first deal with the dimension n= 2. Given two copulas C₂ and C₂^′ on [0,1]², let (U₁, U₂^′) be distributed according dC₂^′. Given (U1, U₂^′), let U₂ be distributed according to the conditional distribution of the second coordinate given that the first one is equal toU₁ underdC₂ andU₁^′ be distributed according to the conditional distribution of the first coordinate given that the second one in equal to U₂^′ still under dC₂. This way the random variables U₁, U₂, U₁^′ and U₂^′ are uniformly distributed on [0,1] and both the vectors (U₁, U₂) and (U₁^′, U₂^′) are distributed according todC₂. For ε∈[0,1], we consider

Y_ε= (U1, εU₂), Zε = (εU1, U₂) andZ_ε^′ = (εU₁^′, U₂^′).

We notice that the copula of Y_ε and Z_ε is C₂ since the copula is preserved by coordinatewise increasing functions (see Nelsen [1], Theorem 2.4.3). Also, Z_ε and Z_ε^′ obviously have the same law. We will show that for a suitable choice of C₂^′ andε >0 small enough, we generally have

E[kYε−Z_εk^p_q]>E[kYε−Z_ε^′k^p_q].

Denoting by µ^ε the law of Y_ε and ρ^ε the common law of Z_ε and Z_ε^′, this implies the desired conclusion since

Z

R²×R²

kx−yk^p_qµ^ε⋄ρ^ε(dx, dy) = Z

[0,1]²

k(F_µ⁻ε¹

1 (u1), F_µ⁻ε¹

2 (u2))−(F_ρ⁻ε¹

1 (u1), F_ρ⁻ε¹

2 (u2))k^p_qdC₂(u1, u₂)

=E[kYε−Z_εk^p_q]>E[kYε−Z_ε^′k^p_q]≥ W_p,q^p (µ, ρ).

(7)

The coupling between Y_ε and Z_ε gives the score

E[((1−ε)^qU₁^q+ (1−ε)^qU₂^q)^p/q] →

ε→0E[(U₁^q+U₂^q)^p/q], while the one between Y_ε and Z_ε^′ gives:

E[(|U₁−εU₁^′|^q+|U₂^′ −εU₂|^q)^p/q] →

ε→0E[(U₁^q+ (U₂^′)^q)^p/q].

We now focus on the cost function ¯c(u₁, u₂) =−(u^q₁+u^q₂)^p/q foru₁, u₂ ∈(0,1). We have

∂_u₁∂_u₂c(u¯ ₁, u₂) =p(q−p)u^q₁⁻¹u^q₂⁻¹(u^q₁+u^q₂)^p^q⁻².

Whenq < p(resp.q > p) this is negative (resp. positive) for anyu₁, u₂ ∈(0,1), i.e. ¯c(resp. −¯c) satisfies the so-called Monge condition. By Theorem 3.1.2 of Rachev and R¨uschendorf [3], we get that E[¯c(U₁, U₂^′)] is maximal for U₂^′ = 1−U₁ (resp. U₂^′ =U₁), i.e when C₂^′(u, v) =C₂⁻(u, v) (resp. C₂^′(u, v) =C₂⁺(u, v)). Besides, since ∂_u₁∂_u₂¯c(u₁, u₂) does not vanish, we have

E[¯c(U₁, U₂)]<E[¯c(U₁,1−U₁)] (resp. E[¯c(U₁, U₂)]<E[¯c(U₁, U₁)])

when C₂6=C₂⁻ (resp.C₂6=C₂⁺). Taking U₂^′ = 1−U₁ (resp. U₂^′ =U₁), we have in this case E[((1−ε)^qU₁^q+ (1−ε)^qU₂^q)^p/q]>E[(|U₁−εU₁^′|^q+|U₂^′ −εU₂|^q)^p/q]

for ε > 0 small enough. Notice that since both Y₀ and Z₀ have one constant coordinate, the range of the cumulative distribution function of this coordinate is {0,1} and the probability measures µ₀ and ρ₀ share any two dimensional copula and in particular C₂⁻ (resp. C₂⁺). That is why one has to choose ε >0.

This two dimensional example can be easily extended to dimension n≥3. LetC_n now denote a n-dimensional copula, C₂(u₁, u₂) = C_n(u₁, u₂,1, . . . ,1) and C₂^′ be another two-dimensional copula. We define (U₁, U₂, U₁^′, U₂^′) as above. Then, we choose (U₃, . . . , U_n) (resp. (U₃^′, . . . , U_n^′)) distributed according to the conditional law of then−2 last coordinates given that the two first are equal to (U₁, U₂) (resp. (U₁^′, U₂^′)) underdC_n. Last, we define

Y_ε= (U₁, εU₂, εU₃, . . . , εU_n), Z_ε= (εU₁, U₂, εU₃, . . . , εU_n) and Z_ε^′ = (εU₁^′, U₂^′, εU₃^′, . . . , εU_n^′).

We still have on the one hand that Y_ε andZ_ε share the same copula and on the other hand that Z_ε andZ_ε^′ have the same law. Moreover,

E[kY_ε−Z_εk^p_q] →

ε→0E[(U₁^q+U₂^q)^p/q] and E[kY_ε−Z_ε^′k^p_q] →

ε→0E[(U₁^q+ (U₂^′)^q)^p/q].

Taking ε > 0 small enough, we get that E[kY_ε−Z_εk^pq] > E[kY_ε−Z_ε^′k^pq] ≥ Wp,q^p (µ, ρ) if, for some u₁, u₂ ∈ [0,1], C_n(u₁, u₂,1, . . . ,1) > C₂⁻(u₁, u₂) when q < p (resp. C_n(u₁, u₂,1, . . . ,1) <

C₂⁺(u₁, u₂) when q > p). If there exist i < j such that

C_n(1, . . . ,1, u_i,1, . . . ,1, u_j,1, . . . ,1)> C₂⁻(u_i, u_j) (resp. < C₂⁺(u_i, u_j) ),

then one may repeat the above reasoning with the i-th andj-th coordinates replacing the first and second ones. Then the couplingν =µ⋄ρis not optimal for ε >0 small enough. Forn≥3, there is no copula such that C_n(1, . . . ,1, u_i,1, . . . ,1, u_j,1, . . . ,1) = C₂⁻(u_i, u_j) for any i < j , u_i, u_j ∈[0,1], since this would imply thatU₂= 1−U₁ =U₃ andU₃ = 1−U₂. Also, the only one copula satisfying C_n(1, . . . ,1, u_i,1, . . . ,1, u_j,1, . . . ,1) =C₂⁺(u_i, u_j) for any i < j, u_i, u_j ∈[0,1], is C_n⁺ since the former condition implies U_i=U_j for any i < j.

(8)

References

[1] Nelsen, R.B. An introduction to copulas, 2nd edition. Springer-Verlag, 2006.

[2] Puccetti, G. and Scarsini, M., Multivariate comonotonicity, Journal of Multivariate Anal- ysis, p. 291–304, Vol. 101, No. 1, 2010.

[3] Rachev, S.T. and R¨uschendorf, L. Mass Transportation problems. Springer-Verlag, 1998.