The controlled biased coin problem

(1)

THE CONTROLLED BIASED COIN PROBLEM

OLIVIER GOSSNER AND TRISTAN TOMALA

Abstract. We study the maxmin value of a zero-sum repeated game where player 1 is restricted to pure strategies but privately observes the realizations of some random variables. This kind of problem was introduced by Gossner and Vieille [GV02] in the case where player 1 observes i.i.d. random variables. The paper solves the case where the law of the random variable (the coin) depends on player 1’s action. We also discuss the general case where the law of the coin is controlled by both players.

1. Model and definitions

1.1. The repeated game. Let (A, B, g) be a zero-sum game where, A (resp. B) is the finite set of actions of player 1 (resp.2) and g : A × B → R is the payoff function. Let S be a finite set of signals and µ : A → ∆(S) be a transition probability. The repeated game unfolds as follows. At each stage t = 1, 2, . . ., each player chooses an action in her own set of actions and if (a, b) ∈ A × B is the action profile played, the payoff for player 1 is g(a, b). A signal s is then drawn according to µ(·|a) and announced to player 1 only.

A history of length n of the game [resp. for player 2] is an element hn of

Hn= (A × B × S)n, [resp. h2n of Hn2= (A × B)n]. A pure strategy σ for player 1

is a sequence (σn)n≥0with σn: Hn→ A. A behavioral strategy τ for player 2 is a

sequence (τn)n≥0with τn: Hn2→ ∆(B), where ∆(B) denotes the set of probabilities

on B. A pair of strategies (σ, τ ) with σ pure and τ behavioral induce a probability distribution Pσ,τ on the set of plays (A × B × S)∞ endowed with the product

σ-algebra. The average payoff up to stage n is: γn(σ, τ ) = Eσ,τ[1_nP n

m=1g(am, bm)]

where (am, bm) is the random pair of actions at stage n. The uniform max min

payoff of the repeated game is defined as follows (see [GV02] for this model and [MSZ94] for general repeated games):

Definition 1. _{(1) Player 1 guarantees v ∈ R if:} ∀ε > 0, ∃σ, ∃N s.t. ∀τ , ∀n ≥ N , γn(σ, τ ) ≥ v − ε.

(2) Player 2 defends v ∈ R if:

∀ε > 0, ∀σ, ∃τ, ∃N s.t. ∀n ≥ N , γn(σ, τ ) ≤ v + ε.

(3) The max min is v∞∈ R such that I guarantees v∞and II defends v∞.

1.2. Information theory tools. Let x be a finite random variable with law P . Throughout the paper, we write log to denote the logarithm with base 2. By definition, the entropy of x is:

H(x) = −E[log P (x)] = −X

xP (x) log P (x)

Note that H(x) is non-negative and depends only on the law P of x. One can therefore define the entropy of a probability distribution P = (pk)k=1,...,L with

finite support by H(P ) = −PL

k=1pklog pk. Date: 3rd February 2005.

(2)

Let (x, y) be a couple of finite random variables with joint law P . Define P (x|y) = P (x = x|y = y) if P (y = y) > 0 and arbitrarily otherwise. The condi-tional entropy of x given y is:

H(x | y) = −E[log P (x | y)] One has the following chain rule:

H(x, y) = H(y) + H(x | y) 2. The characterization

Although, player 1 is restricted to pure strategies, since player 2 does not ob-serve the signals of player 1, she holds a belief on the action of player 1, that is a probability distribution on A. Assume this belief to be x about the action of player 1 at stage n. At stage n, player 2 gets to observe the random action a of player 1 but not the random signal s observed by player 1. Define the entropy variation associated to x as H(s|a) − H(a) which writes:

∆H(x) =X

ax(a)H(q(·|a)) − H(x)

This represents the variation of uncertainty for player 2 at stage n. For x ∈ ∆(A), the optimal payoff associated to x is:

π(x) = min

b g(x, b)

where g is extended to mixed actions by linearity. This function represents the expected payoff obtained when player 2 plays her best reply to her belief about the action of player 1. Let then:

V = {(∆H(x), π(x)) | x ∈ ∆(A)} and

v∗= sup{x2∈ R | (x1, x2) ∈ co V, x1≥ 0}

This is the highest payoff associated to a convex combination of mixed actions un-der the constraint that the average entropy variation is non-negative. Note that ∆H and π are continuous on ∆(A) and thus V is compact. For every x such that H(x) = 0, ∆H(x) ≥ 0 thus V ∩ {x1≥ 0} 6= ∅ and the supremum that defines v∗is

a maximum.

We give now other expressions of v∗which are more convenient for computations. First we show how to express the number v∗through simple optimization problems. Define for each real number h:

u(h) = max{π(x) | x ∈ ∆(A), ∆H(x) ≥ h} From the definition of V we have for each h:

u(h) = max{x2| (x1, x2) ∈ V, x1≥ h}

Since V is compact, u(h) is well defined. Let cav u be the least concave function pointwise greater than u. Then:

sup{x2∈ R | (x1, x2) ∈ co V, x1≥ 0} = cav u(0)

To see this, note that u is upper-semi-continuous, non-increasing and the hypo-graph of u is the comprehensive set V − R2+ associated to V . Then cav u is thus

also non-increasing, u.s.c. and its hypograph is co V − R2

+. Now using Fenchel’s

duality we get: Lemma 2.

cav u(0) = inf

(3)

Proof. cav u(0) = −(−u)∗∗(0), where F∗ is the Fenchel-conjugate of F . By defini-tion of the Fenchel conjugate:

(−u)∗(λ) = sup h {hλ − (−u(h))} (−u)∗∗(h) = sup λ {hλ − (−u)∗(λ)}

Thus cav u(0) = infλsuph{hλ + u(h)} and by the definition of u:

cav u(0) = inf

λ maxx _h≤∆H(x)sup {hλ + π(x)}

For λ < 0, we let h go to −∞ and the sup is infinite. Thus we restrict to λ ≥ 0 which gives:

cav u(0) = inf

λ≥0maxx _h≤∆H(x)sup minb {hλ + g(x, b)}

= inf

λ≥0maxx minb {λ∆H(x) + g(x, b)}

concluding the proof.

We have then the following result:

Theorem 3. The uniform minmax v∞exists and v∞= v∗.

3. Proof of theorem 3

3.1. Player 2 defends v∗. The argument used here is a generalization of the one found in [GV02]. We introduce some notations. Let σ be a strategy of player 1. Define inductively τσ as the strategy of player 2 that plays stage-best replies to

σ. At stage 1, τσ(∅) ∈ argminbg(σ(∅), b) where ∅ is the null history that starts the

game. Assume that τσis defined on histories of length less that n. For every history

h2

n of player 2, let xn+1(h2n) be the distribution of the action of player 1 given h2n

and let τσ(h2n) be in argminbg(xn+1(h2n), b).

Lemma 4. For each integer n and strategy σ: γn(σ, τσ) ≤ v∗

Proof. Let σ be a strategy for the team and set τ = τσ. Let an, bn, sn be the

sequences of random action profiles and signals associated to (σ, τ ), h2

nthe random

history of player 2 and xn+1 = xn+1(h2n). The expected payoff at stage m + 1

after h2

mis maxbg(xm+1(h2m), b)) = π(xm+1(h2m)) from the definition of τ and thus

γn(σ, τ ) = Eσ,τ[_n1Pn_m=1π(xm)].

We compute now the entropy variations, ∆H(xm+1(h2m)). By definition of xm+1,

∆H(xm+1(h2m)) = H(am+1|hm2) − H(sm+1|am+1, h2m)

and thus:

Eσ,τ∆H(xm+1) = H(am+1|h2m) − H(sm+1|am+1)

Denote Hm= H(s1, . . . , sm|a1, b1, . . . , am, bm). Applying the additivity formula in

two different ways:

H(s1, . . . , sm, am+1, bm+1, sm+1, |h2m) = Hm+ H(bm+1|h2m) + H(sm+1|am+1)

(4)

where we use that (am+1, bm+1) are independent conditional on h2m, am+1 is

de-terministic given h2

m and sm+1 depends on am+1 only. Thus, Eσ,τ∆H(xm+1) =

Hm+1− Hmand, Eσ,τ 1 n X m≤n∆H(xm) = 1 nHn≥ 0

Therefore the vector (Eσ,τ1_nPm≤n∆H(xm), γn(σ, τ )) belongs to co V ∩ {x1 ≥

0}.

3.2. Player 1 guarantees v∗. Call a strategy of player 1 autonomous if it does not depend on player 2’s moves.

Lemma 5. Let σ be an automomous strategy, for each stage n and strategy τ for player 2, γn(σ, τσ) ≥ γn(σ, τ ).

Proof. Straightforward since player 2’s moves do not influence the actions of player

1.

Corollary 6. If ∀ε > 0 there exists an autonomous strategy σ and an integer N such that ∀n ≥ N , γn(σ, τσ) ≥ v∗− ε, then the player 1 guarantees v∗.

An autonomous strategy σ maps ∪n(A × S)nto A and thus generates a

proba-bility distribution Pσon (A × S)∞. Conversely, call a distribution P on (A × S)∞

an A-distribution if for each n and for P-almost every history hn∈ (A × S)n, the

distribution of an+1 conditional on hn is a Dirac measure. Every A-distribution

induces an autonomous strategy.

Given an autonomous σ, consider the belief of player 2 at stage n: given h2 n, let

xn+1be the distribution of σ(a1, s1, . . . , an, sn) given h2n. The empirical distribution

of beliefs up to stage n as the random variable: dn=

1 n

X

m≤nxm

where x denotes the Dirac measure on x. The random variable dn has values in

D = ∆(∆(A)) and we let δ = Eσ(dn) be the element of D such that for every

real-valued continuous function f on ∆(A), Eσ(R f (x)ddn(x)) =R f (x)dδ(x). We

get:

γn(σ, τσ) = Eδπ

Empirical distributions of beliefs are studied in [GT04] were we prove the fol-lowing result:

Theorem 7. [GT04]. For every δ ∈ ∆(∆(A)) such that Eδ∆H ≥ 0, there exists

an A-distribution P such that EPdn weak-∗ converges to δ.

It follows readily:

Lemma 8. Player 1 guarantees sup {Eδπ | δ ∈ ∆(∆(A)), Eδ∆H ≥ 0}.

We conclude the proof by the following lemma: Lemma 9. sup {Eδπ | δ ∈ ∆(∆(A)), Eδ∆H ≥ 0} = v∗.

Proof. Immediate since the set of vectors (Eδ∆H, Eδπ) as δ varies in ∆(A) is

(5)

4. 2 × 2 games

We study 2 × 2 games and give an explicit formula for v∗. Consider the game G whose matrix is displayed below. The set of signals is {0, 1}, µ(0|T ) = p, µ(0|B) = q, where w.l.o.g. 0 ≤ p ≤ q ≤ 1₂. T B a b c d p q L R

Let us assume that the game has no value in pure strategies, that is w.l.o.g. min {a, d} > max {c, b}. Then player 1 has a unique optimal strategy x∗=_a−b+d−cd−c . Let h0:= ∆H(x∗). We consider two cases.

• First case: h0 ≥ 0. Denoting v the value of G, remark that u(h) ≤ v for

each h and that u(0) ≥ π(x∗) = v. Thus cav u(0) = v.

• Second case: h0< 0. Note that ∆H(x) is convex on [0, 1] and π(x) is linear

on [0, x∗] and on [x∗, 1]. Therefore, every point in co V is dominated by a point lying either on the segment [(h0, v); (H(q), c)] or on [(h0, v); (H(p), b)].

Computing the intersection of those segments with the vertical axis we get: v∗= max _H(p) H(p) − h0 v − h0 H(p) − h0 b; H(q) H(q) − h0 v − h0 H(q) − h0 c

For the matching pennies game, i.e. a = d = 1, b = c = 0, h0=H(p)+H(q)₂ − 1 and

v∗= H(q) 2 + H(q) − H(p) • If p = q, v∗₌ H(q)

2 , which is the result of [GV02].

• If p = 0, q = 1 2, v ∗₌1 3 = 2 3 1

2 which means that player 1 can get the value of

G two thirds of the time. This result can be proven directly by considering the following idea of strategy. Play T for 2n stages so as to generate 2n fair coins. This is the trigerring phase. Play then the mixed strategy (1₂,1₂) i.i.d. for 2n stages, using the 2n coins. This is the main phase. With high probability, during the main phase player 1 plays T n times and generates n coins. Player 1 then plays T n times so as to get 2n coins again and the main phase can be restarted. With high probability, this process can be repeated a large number of times. The trigerring phase becomes payoff unsignificant and the average payoff is 12·2n+0·n

3n .

5. The bi-controlled case

We analyze here the case where the law of the coin depends on both actions, i.e. the data of the game are the same except for the transition µ which maps now A×B to ∆(S). The definitions of strategies and maxmin of the game are unchanged. For each x ∈ ∆(A) and b ∈ B, set:

∆H(x, b) =X

ax(a)H(q(·|a, b)) − H(x)

5.1. A heuristic analysis. We define a heuristic vector payoff game as follows: At each stage n = 1, 2, . . ., player 1 chooses x ∈ ∆(A), player 2 chooses b ∈ B and the payoff vector is (∆H(x, b), g(x, b)). The payoff vector is publicly observable. Heuristic Result. Player 1 guarantees v if and only if the set of payoffs u = (u1, u2) ∈ R2| u1≥ 0, u2≥ v is approachable ([Bl56]) by player 1 in the

(6)

Proof. (Heuristic). As in the proof of theorem 3, a pair of strategies in the game induces at each stage a belief of player 2 regarding the next action of player 1: let xn+1be the distribution of σ(a1, s1, . . . , an, sn) given h2n. One may again define

the empirical distribution of beliefs up to stage n but this quantity will depend on both players’ strategies. The heuristic is to generalize [GT04]’s result to the case where the beliefs of player 2 depend on his own action. This is subject to ongoing research. The main idea is to replace the feasibility condition Eδ∆H ≥ 0 by an

approachability condition of the set of distributions δ that verify such conditions. We deduce a characterization of the maxmin value:

Corollary 10. The maxmin value of the game is the largest number v∗ such that u = (u1, u2) ∈ R2| u1≥ 0, u2≥ v∗ and:

v∗= inf

λ≥0maxx minb {g(x, b) + λ∆H(x, b)}

Proof. The first part of the corollary is a direct consequence of the heuristic result. Using Blackwell’s necessary and sufficient conditions for approachability of convex sets we get thatu = (u1, u2) ∈ R2| u1≥ 0, u2≥ v∗ is approachable by player 1 if

and only if for each p ∈ [0, 1]: pv + (1 − p)0 ≤ max

x minb {pg(x, b) + (1 − p)∆H(x, b)}

For p = 0 this says maxxminb∆H(x, b) ≥ 0 which is verified (take x to be a

pure action). Assume thus p > 0. Dividing by p and setting λ = 1−p_p , the condition becomes that for each λ ≥ 0,

v ≤ max

x minb {g(x, b) + λ∆H(x, b)}

The highest v that verifies this is clearly: v∗= inf

λ≥0maxx minb {g(x, b) + λ∆H(x, b)}

5.2. Matching pennies. We consider matching pennies and use the above for-mula to compare two situations: (1) The coin is fully controlled by player 2 and is deterministic if Left and fair if Right; (2)The coin is fully controlled by player 1 and is deterministic if Top and fair if Bottom. The two situations are displayed below. What is the most favourable situation for player 1?

0 1₂ T B 1 0 0 1 or T B 1 0 0 1 0 1 2 L R L R

In the second case, we already know that the maxmin value is 1₃. We use now our formula to compute the value in the first case.

Proposition 11. In the first case, v∗is the unique solution in [0, 1) of the equation: λ = 1 + λ

2 − λH( 1 + λ

2 ) Numerically: 0.355 < v∗< 0.356.

(7)

Proof. We wish to compute: inf λ≥0maxx minb {g(x, b) + λ∆H(x, b)} which equals: inf λ≥0maxx min {x − λH(x); (1 − x) + λ(1 − H(x))}

Set F (λ) = maxxminb{g(x, b) + λ∆H(x, b)}. Given x and λ, the minimum is

achieved at L iff x ≤ 1+λ₂ which holds for all λ ≥ 1. For each such λ, we solve maxx(x−λH(x)). This expression being convex, the maximum is achieved at x = 1,

thus F (λ) = 1, ∀λ ≥ 1.

Assume now λ ≤ 1. Since both expressions x−λH(x) and (1−x)+λ(1−H(x)) are convex in x, the maximum of min {x − λH(x); (1 − x) + λ(1 − H(x))} is achieved either at 0, at 1 or at x that equalizes these expressions. One easily deduces that:

F (λ) = max λ;1 + λ 2 − λH( 1 + λ 2 ) We let f (λ) = 1+λ₂ − λH(1+λ

2 ) and studying the variations of f we find that there

exits a unique λ0< 1 such that f (λ0) = λ0 and f (λ) ≥ λ ⇔ λ ≥ λ0. Numerically,

0.355 < λ0< 0.356. Thus,

F (λ) =

f (λ) 0 ≤ λ ≤ λ0

λ λ0≤ λ ≤ 1

It follows that infλ≥0F (λ) = min[0,λ0]f (λ). Numerical resolution of f

0_{(λ) = 0}

shows f0(λ0) < 0, hence the result.

References

[Bl56] D. Blackwell. An Analog of the Minmax Theorem for Vector Payoffs. Pacific Journal of Mathematics, 65:1–8, 1956.

[GT04] O. Gossner and T. Tomala. Empirical distributions of beliefs under imperfect observa-tion. Cahiers du CEREMADE 410, Universit´e Paris Dauphine, Paris, 2004.

[GV02] O. Gossner and N. Vieille. How to play with a biased coin? Games and Economic Behaviour, 41:206–226, 2002.

[MSZ94] J.-F. Mertens, S. Sorin, and S Zamir. Repeated games. CORE discussion paper 9420-9422, 1994.