• Aucun résultat trouvé

Zero-sum Stochastic Games: Limit Optimal Trajectories

N/A
N/A
Protected

Academic year: 2021

Partager "Zero-sum Stochastic Games: Limit Optimal Trajectories"

Copied!
20
0
0

Texte intégral

(1)

SYLVAIN SORIN AND GUILLAUME VIGERAL

Abstract. We consider zero sum stochastic games. For every discount factor λ, a time normalization allows to represent the game as being played on the interval [0, 1]. We introduce the trajectories of cumulated expected payoff and of cumulated occupation measure up to time t ∈ [0, 1], under ε-optimal strategies. A limit optimal trajectory is defined as an accumulation point as the discount factor tends to 0. We study existence, uniqueness and characterization of these limit optimal trajectories for absorbing games.

1. Introduction

The analysis of two person zero sum repeated games in discrete time may be performed along two lines: 1) asymptotic approach: to each probability distribution θ on the set of stages (m = 1, 2, ...) one associates the game Gθ where the evaluation of the stream of stage payoffs {gm} is P+∞m=1θmgm, and one denotes

its value by vθ. Given a preordered family {Θ, ≻} of probability distributions, one studies whether vθ

converges as θ ∈ Θ “goes to ∞” according to ≻. Typical exemples correspond to n-stage games (θm = 1

nIm≤n, n → ∞), λ-discounted games (θm= λ(1 −λ)

m−1, λ → 0), or more generally decreasing evaluations

(θm≥ θm+1) with θ1→ 0. The game has an asymptotic value v∗ when these limits exist and coincide.

2) uniform approach: for each strategy of player 1, one evaluates the amount that can be obtained against any strategy of the opponent in any sufficiently long game for {Θ, ≻}. This allows to define a minmax and the game has a uniform value v∞when minmax = maxmin.

The second approach is stronger than the first one (the existence of v∞ implies the existence of v∗ and

their equality), but there are games with asymptotic value and no uniform value (incomplete information on both sides, Aumann and Maschler [1], Mertens and Zamir [7]; stochastic games with signals on the moves). The first approach deals only with families of values while the second explicitly consider strategies. The main difference is that in the first case the (ε)-optimal strategies of the players may depend on the evaluation represented by θ.

We focus here on a class of games where this dependence has a smooth representation. Basically in addition to the asymptotic properties of the value one studies the asymptotic behavior along the play induced by (ε)-optimal strategies.

Date: December 4, 2018.

Some of the results of this paper were presented in “Atelier Franco-Chilien: Dynamiques, optimisation et apprentissage” Valparaiso, November 2010 and a preliminary version of this paper was given at the Game Theory Conference in Stony Brook, July 2012. This research was supported by grant PGMO 0294-01 (France).

(2)

The first step is to normalize the duration of the game using the evaluation θ. We consider each game Gθ

as being played on [0, 1], stage n lasting from time tn−1=Pm<nθmto time tn = tn−1+ θn (with t0= 0).

Note that here time t corresponds to the fraction t of the total duration of the game, as evaluated trough θ. In particular given (ε)-optimal strategies in Gθthe stream of expected stage payoffs generate a bounded

measurable trajectory on [0, 1] and one will consider its asymptotic behavior.

The next section introduces the basic definitions and concepts that allow to describe our results. The main proofs are in Section 3. Further examples are in Sections 4 and 5.

To end this quick overview let us recall that there are games which do not have an asymptotic value: stochastic games with compact action spaces, Vigeral [15]; finite stochastic games with signals on the state, Ziliotto [16]; or more generally Sorin and Vigeral, [13].

2. Limit optimal trajectories

Let Γ be a two-person zero-sum stochastic game with state space Ω, action spaces I and J, stage payoff g and transition ρ from Ω × I × J to IR (resp. ∆(Ω)). We assume that Ω is finite, I and J are compact metric, g and ρ continuous. We keep the same notations for the multilinear extensions to X = ∆(I) and Y = ∆(J), where as usual ∆(A) denotes the probabilities on A.

For any pair of stationary strategies (x, y) ∈ XΩ×Y= X×Y, any state ω ∈ Ω and any stage n, denote

by cω,x,yn the expected payoff at stage n under these stationary strategies, given the initial state ω, and

by qω,x,y

n ∈ ∆(Ω) the corresponding distribution of the state at stage n. Hence cω,x,yn = hqnω,x,y, ¯g(x, y)i

where ¯g(x, y) stands for the vector payoff with component in state ζ ∈ Ω given by g(ζ; x(ζ), y(ζ)).

Definition 1.

For any (x, y) ∈ XΩ× Y, any discount factor λ ∈ (0, 1], and any starting state ω, define the function

lλω,x,y: [0, 1] → R by lω,x,yλ (tn) = λ n X i=1 (1 − λ)i−1cω,x,yi

for tn = λPni=1(1 − λ)i−1 and a linear interpolation between these dates {tn}.

Thus lω,x,yλ (tn) corresponds to the expectation of the accumulated payoff for the n first stages in the

discounted game, or up to time tn and lω,x,yλ (t) to the same at the fraction t of the game, both under x

and y in the λ-discounted game starting from ω.

Let M (Ω) denote the set of positive measures on Ω. We introduce similarly the expected accumulated occupation measure at time t under x and y in the λ-discounted game starting from ω as follows:

Definition 2.

For any (x, y) ∈ XΩ× Y, any discount factor λ, and any starting state ω, define the function Qω,x,y

(3)

[0, 1] → M(Ω) by: Qω,x,yλ (tn) = λ n X i=1 (1 − λ)i−1qiω,x,y

and by a linear interpolation between these dates {tn}.

Note that for any t ∈ [0, 1], Qω,x,yλ (t) ∈ t ∆(Ω).

Denote by lλx,yand Q x,y

λ the Ω-vectors of functions l ω,x,y

λ (·) and Q ω,x,y

λ (·) respectively.

Limit trajectories for the payoff and occupation measures will be defined as accumulation points of lxλ,yλ

λ and Q

xλ,yλ

λ under λ−dependent ε-optimal strategies xλ and yλ as λ tends to 0.

More precisely, denote by Xε

λ (resp. Yλε) the set of ε-optimal stationary strategies in the λ-discounted

game Γλ (with value vλ) for Player 1 (resp. for Player 2).

Then we introduce:

Definition 3. l = (lω : [0, 1] → R)ω∈Ω is a limit optimal trajectory for the expected accumulated payoff

(LOT P ) if :

∀ε > 0, ∃λ0> 0, ∀λ < λ0, ∃xλ∈ Xλε, ∃yλ∈ Yλε, ∀ω ∈ Ω, ∀t ∈ [0, 1], |lω(t) − l ω,xλ,yλ

λ (t)| ≤ ε.

Q = (Qω : [0, 1] → M(Ω))

ω∈Ω is a limit optimal trajectory for the expected accumulated occupation

measure (LOT M ) if :

∀ε > 0, ∃λ0> 0, ∀λ < λ0, ∃xλ∈ Xλε, ∃yλ∈ Yλε, ∀ω ∈ Ω, ∀t ∈ [0, 1], kQω(t) − Qω,xλ λ,yλ(t)k ≤ ε.

Alternate weaker and stronger definitions are as follows: in both cases, if “∀λ < λ0” is replaced by

“for some λn going to 0”, we will speak of a weak LOT . If “∃xλ ∈ Xελ, ∃yλ ∈ Yλε” is replaced by

‘∃ε′ < ε, ∀x λ∈ Xε

λ, ∀yλ∈ Yε

λ”, we will speak of a strong LOT .

Remark 4.

- a weak LOT always exists by standard arguments of equicontinuity. - if a LOT P l exists, vλ converges to l(1).

- if a strong LOT exists, it is unique

- no strong LOT M exists in general (just consider a game where payoff is always 0).

- if the game has a uniform value and both players use ε-optimal strategies the average expected payoff is essentially constant along the play.

A first approach to this topic concerns one player games (or games where one player controls the transitions), where there is no finiteness assumption on Ω. Assume that vλconverges uniformly then there

exists a strong LOTP and it is linear w.r.t. t, which means that the expected payoff is constant along the trajectory (Sorin, Venel and Vigeral [14]).

(4)

The same article provides an example of a two player game with finite action and countable state spaces, where LOTP is not unique.

The main contributions of the current paper are:

For absorbing games, existence of a linear LOTP, and existence of a “geometric” algebraic LOTM For finite absorbing games, existence of a strong LOTP.

An exemple of a finite game where LOTM is not semialgebraic.

An example of compact absorbing game with non uniqueness of LOTP.

Let us mention recent results of Oliu-Barton and Ziliotto [8] establishing the existence of linear strong LOPT for finite stochastic games and optimal strategies: the class of games is larger and they allow for any kind of optimal strategies. Our results deal with compact action spaces and ε-optimal strategies.

As a final comment, let us underline the fact that the previous concepts and definitions can be extended to any repeated game, for any evaluation and any type of strategies.

3. Absorbing games

An absorbing game Γ is defined by two sets of actions I and J, two stage payoff functions g, g∗ from

I × J to [−1, 1] and a probability of absorption p∗ from I × J to [0, 1] .

I and J are compact metric sets, g, g∗ and pare (jointly) continuous.

The repeated game is played in discrete time as follows. At stage t = 1, 2, ... (if absorption has not yet occurred) player 1 chooses it∈ I and, simultaneously, player 2 chooses jt∈ J:

(i) the payoff at stage t is g (it, jt);

(ii) with probability p∗(it, jt) absorption is reached and the payoff in all future stages s > t is g∗(it, jt);

(iii) with probability p (it, jt) := 1 − p∗(it, jt) the situation is repeated at stage t + 1.

Recall that the asymptotic analysis for these games is due to Kohlberg [3] in the case where I and J are finite and Rosenberg and Sorin [9] in the current framework. In either case the value vλ of the discounted

game Γλ converges to some v as λ goes to 0. This does not require any assumption on the information of

the players. In case of full observation of the actions - or of the stage payoff, a uniform value exists, see Mertens and Neyman [5] in the finite case and Mertens, Neyman and Rosenberg [6] for compact actions.

Recall that X = ∆(I) and Y = ∆(J) are the sets of probabilities on I and J. The functions g, p and p∗are bilinearly extended to X × Y . Let

G∗(x, y) := p∗(x, y)g∗(x, y) := Z

I×J

p∗(i, j)g∗(i, j)x(di)y(dj).

g∗(x, y) is thus the expected absorbing payoff conditionally to absorption (and is thus only defined for

(5)

3.1. An auxiliary game.

Consider the two-person zero-sum game A, defined for any (x, x′, a) ∈ S = X2× R+ and (y, y, b) ∈ T =

Y2× R+, by the payoff function

A(x, x′, a, y, y′, b) = g(x, y) + a G∗(x′, y) + b G∗(x, y′)

1 + a p∗(x, y) + b p(x, y) . (1)

3.1.1. General properties.

The following proposition extends to the compact case results due to Laraki [4] in the finite case (later simplified by Cardaliaguet, Laraki and Sorin [2]).

Proposition 5.

1) The game A has a value, which is v = lim vλ.

More precisely v = max x∈X(x′,a)sup∈X×R+ inf (y,y′,b)∈TA(x, x ′, a, y, y, b) = min

y∈Y (y′,b)inf∈Y ×R+(x,xinf′,a)∈SA(x, x

, a, y, y, b).

2) Moreover, if (x, x′, a) is ε-optimal in the game A then for any λ small enough the stationary strategy

ˆ

xλ:=x+λax

1+λa is 2ε-optimal in Γλ.

Proof.

1) Consider an accumulation point w of the family {vλ} and let λn→ 0 such that vλn converges to w.

We will show that

w ≤ sup (x,x′,a)∈S inf (y,y′,b)∈T g(x, y) + a G∗(x′, y) + b G∗(x, y′) 1 + a p∗(x, y) + b p(x, y) (2)

A dual argument proves at the same time that the family {vλ} converges and that the auxiliary game A

has a value.

Let rλ(x, y) be the payoff in the game Γλ, induced by a pair of stationary strategies (x, y) ∈ X × Y . It

satisfies rλ(x, y) = λg(x, y) + (1 − λ)[(1 − p∗(x, y))rλ(x, y) + G∗(x, y)] (3) hence rλ(x, y) = λg(x, y) + (1 − λ)G ∗(x, y) λ + (1 − λ)p∗(x, y) . (4)

In particular for any xλ∈ X optimal for Player 1 in Γλ, one obtains

vλ≤

λg(xλ, y) + (1 − λ)G∗(xλ, y)

λ + (1 − λ)p∗(xλ, y) , ∀y ∈ Y, (5)

that one can write

vλ≤

g(xλ, y) + (1−λ)λ G∗(xλ, y)

(6)

Let x∈ X be an accumulation point of {xλn} and given ε > 0 let λ in the sequence {λn} such that

|g(x, y) − g(xλ, y)| ≤ ε, ∀y ∈ Y

(we use the fact that g is uniformly continuous on X × Y ) and

|vλ− w| ≤ ε. Then with a = (1−λ) λ and x ′= x λ, (6) implies w − ε ≤ g(x, y) + aG ∗(x, y) 1 + ap∗(x, y) + ε, ∀y ∈ Y. (7)

On the other hand, going to the limit in (5) leads to

w p∗(x, y′) ≤ G∗(x, y′), ∀y′ ∈ Y. (8)

We multiply (7) by the denominator 1 + ap∗(x′, y) and we add to (8) multiplied by b ∈ R+ to obtain the

property:

∀ε > 0, ∃ x, x′ ∈ X and a ∈ R+ such that

w ≤ g(x, y) + a G

(x, y) + b G(x, y)

1 + a p∗(x, y) + b p(x, y) + 2ε, ∀y, y′∈ Y, b ∈ R+ (9)

which implies (2). Note moreover that x is independent of ε, hence the result.

2) Let (x, x′, a) be ε-optimal in the game A and ˆx λ:=

x + λax′

1 + λa . Using (4) one obtains

rλ(ˆxλ, y) = λ[g(x, y) + λag(x

, y)] + (1 − λ)[G(x, y) + λaG(x, y)]

λ(1 + λa) + (1 − λ)(p∗(x, y) + λap(x, y)) .

Note that A(x, x′, a, y, y,1 − λ λ ) = λg(x, y) + λa G∗(x, y) + (1 − λ) G(x, y) λ + λa p∗(x, y) + (1 − λ) p(x, y) . Thus |rλ(ˆxλ, y) − A(x, x′, a, y, y,1 − λ λ )| ≤ 4Cλa where C is a bound on the payoffs. Hence, for any y ∈ Y

v − rλ(ˆxλ, y) ≤ ε + 4Cλa ≤ 2ε

for λ small enough. 

Ais an auxiliary limit game in the sense that:

(7)

factor a stationary strategy of player 1 in Γ).

ii) There is a map ψ from Y × (0, 1] to T (that associates to a stationary strategy of player 2 in Γ and a discount factor a stationary strategy of player 2 in A).

iii)

rλ(φ(λ, s), y) ≥ A(s, ψ(λ, y)) − o(1), ∀s ∈ S, ∀y ∈ Y

iv) A dual property holds.

These properties imply: lim vλ exists and equals v(A).

We then recover Corollary 3.2 in Sorin and Vigeral [12], with a new proof that will be useful in the sequel.

Corollary 6.

v = min

y∈Y maxx∈Xmed g(x, y);x”|p∗sup(x”,y)>0{g

(x”, y)} ; inf y”|p∗(x,y”)>0{g

(x, y”)}

!

= max

x∈Xminy∈Ymed g(x, y);x”|p∗sup(x”,y)>0{g

(x”, y)} ; inf y”|p∗(x,y”)>0{g

(x, y”)}

!

(10)

where med is the median of three numbers, and with the usual convention that supx”∈∅= −∞; infy”∈∅=

+∞. Moreover if (x, x′, a) (resp (y, y, ε)) is ε-optimal in A then x (resp. y) is ε-optimal in (10).

Proof.

For any ε > 0 fix a triplet (y, y′

ε, bε) ∈ T of the second player ε-optimal in A, where we can assume that y

does not depend on ε by the previous result. Then for any x” ∈ X such that p∗(x”, y) > 0, one has

v + ε ≥ A(x, x”, +∞, y, y′ε, bε) = g∗(x”, y) (11)

thus v + ε ≥ h+(y) := sup

x”|p∗(x”,y)>0{g∗(x”, y)}. Denote similarly h−(x) = infy”|p(x,y”)>0{g∗(x, y′)}.

On the other hand, for any x

v + ε ≥ A(x, x”, 0, y, y′ε, bε) =g(x, y) + bεG ∗(x, y′ ε) 1 + bεp∗(x, yε′) Now if p∗(x, y′ ε) > 0, g(x, y) + bεG∗(x, y′ε) 1 + bεp∗(x, yε′) ≥ min{g(x, y), g ∗(x, y

ε)}, hence in all cases

v + ε ≥ A(x, x”, 0, y, y′ε, bε) ≥ min{g(x, y), h−(x)}.

Thus for any x ∈ X

v + ε ≥ med g(x, y); h+(y); h−(x)

(12) Letting ε go to 0 and using the dual inequality establish the results. 

(8)

We establish here more precise results concerning the decomposition of the payoff induced by ε-optimal strategies in the game A .

Proposition 7. Let (x, x′, a) and (y, y, b) be ε-optimal in the game A.

a) If p∗(x, y) > 0 then |g(x, y) − v| ≤ ε b) |g(x, y) − v| ≤ 2(1 + ap∗(x, y) + bp(x, y))ε c) If ap∗(x, y) + bp(x, y) > 0 then aG∗(x, y) + bG(x, y) ap∗(x, y) + bp(x, y) − v ≤ 3 1 + ap∗(x, y) + bp(x, y) ap∗(x, y) + bp(x, y) ε.

Proof. a) This is exactly equation (11) and its dual. b) From v + ε ≥ A(x, x′, 0, y, y′, b) we get

g(x, y) + bG∗(x, y)

1 + bp∗(x, y) ≤ v + ε.

On the other hand, v − ε ≤ A(x, x′, a, y, y, +∞) hence G(x, y) ≥ (v − ε)p(x, y). Combining both

inequalities yields

g(x, y) ≤ (v + ε)(1 + bp∗(x, y′)) − b(v − ε)p∗(x, y′) ≤ v + ε(1 + 2bp∗(x, y′))

≤ v + 2ε(1 + ap∗(x′, y) + bp∗(x, y′)) (13)

and the dual inequality is similar.

c) Since A(x, x′, a, y, y′, b) ≥ v − ε, one has

(v − ε)(1 + ap∗(x′, y) + bp∗(x, y′)) ≤ g(x, y) + aG∗(x′, y) + bG∗(x, y′)

≤ v + 2ε(1 + ap∗(x′, y) + bp∗(x, y′)) + aG∗(x′, y) + bG∗(x, y′) by (13)

hence

v(ap∗(x′, y) + bp∗(x, y′)) − aG∗(x′, y) − bG∗(x, y′) ≤ 3ε(1 + ap∗(x′, y) + bp∗(x, y′))

and the dual inequality is similar. 

3.2. Asymptotics properties in Γλ.

Since the game is absorbing, we write simply Qx,yλ (t) for Qω0,x,y

λ (t)(ω0), where ω0 is the nonabsorbing

state.

Lemma 8. Let xλ and yλ be two families of (non necessarily optimal) stationary strategies of Player 1

and Player 2 respectively. Assume that p∗(xλ, yλ)

λ converges to some γ in [0, +∞] as λ goes to 0. Then Qxλ,yλ

λ (t) converges uniformly in t to

1 − (1 − t)1+γ

1 + γ as λ goes to 0, with the natural convention that 1 − (1 − t)1+γ

(9)

Proof. By definition 2, for any λ and tn= λPni=1(1 − λ)i−1= 1 − (1 − λ)n, Qxλ,yλ λ (tn) = λ n X i=1 (1 − λ)i−1(1 − p∗(xλ, yλ))i−1 = 1 − ((1 − λ)(1 − p∗(xλ, yλ))) n 1 + p∗(xλ,yλ) λ − p∗(xλ, yλ)

with linear interpolation between these dates. Remark first that this implies that Qxλ,yλ

λ (t) ≤ [1 + p∗(x

λ,yλ)

λ ]−1 for all t and λ, which gives at the limit

the desired result if γ = +∞.

Assume now that γ ∈ [0, +∞[, and thus that p∗(x

λ, yλ) tends to 0 as λ goes to 0. Fix t and λ, and let

n be the integer part of ln(1ln(1−λ)−t) so that tn ≤ t ≤ tn+1. Since Qxλλ,yλ(tn) is decreasing in n,

Qxλ,yλ λ (t) ≤ Q xλ,yλ λ (tn) = 1 − ((1 − λ)(1 − p∗(xλ, yλ))) n 1 + p∗(xλ,yλ) λ − p∗(xλ, yλ) ≤ 1 − ((1 − λ)(1 − p∗(xλ, yλ))) ln(1−t) ln(1−λ)−1 1 + p∗(xλ,yλ) λ − p∗(xλ, yλ) = 1 − (1 − t) 1+ln(1−p∗(xλ,yλ))ln(1−λ) −ln(1−λ) 1 +p∗(xλ,yλ) λ − p∗(xλ, yλ) (14) Similarly, Qxλ,yλ λ (t) ≥ Q xλ,yλ λ (tn+1) = 1 − ((1 − λ)(1 − p∗(xλ, yλ))) n+1 1 + p∗(xλ,yλ) λ − p∗(xλ, yλ) ≥ 1 − ((1 − λ)(1 − p∗(xλ, yλ))) ln(1−t) ln(1−λ)+1 1 + p∗(xλ,yλ) λ − p∗(xλ, yλ) = 1 − (1 − t) 1+ln(1−p∗(xλ,yλ))ln(1−λ) +ln(1−λ) 1 +p∗(xλ,yλ) λ − p∗(xλ, yλ) (15)

Letting λ go to 0 in (14) and (15) yields the result. 

For any (x, x′, a) ∈ X2× R+ and (y, y, b) ∈ Y2× R+ define

γ(x, x′, a, y, y′, b) =        +∞ if p∗(x, y) > 0 ap∗(x, y) + bp(x, y) if p(x, y) = 0

(10)

Corollary 9. Let (x, x′, a) ∈ X2×R+ and (y, y, b) ∈ Y2×R+and denote ˆx λ:= x+λax ′ 1+λa and ˆyλ:= y+λby′ 1+λb . Then Qˆxλ,ˆyλ λ (t) converges uniformly in t to 1 − (1 − t)1+γ(x,x′,a,y,y,b) 1 + γ(x, x′, a, y, y, b) as λ goes to 0.

Proposition 10. Any absorbing game has a LOTM Q(t) = 1−(1−t)1+γ1+γ for some γ ∈ [0, +∞], and a LOTP l(t) = tv.

Proof. For every n let (x, x′n, an) and (y, yn′, bn) be 1n-optimal strategies for each player in A (recall that x

and y can be chosen independently on n). Up to extraction γn:= γ(x, x′n, an, y, yn′, bn) converges to some

γ in [0, +∞].

Fix ε > 0 and let n ≥ 2

ε such that 1 − (1 − t)1+γn 1 + γn − 1 − (1 − t)1+γ 1 + γ ≤ ε 2 (16)

on [0,1]. By Proposition 5 the strategies ˆxn λ:= x+λanx′n 1+λan and ˆy n λ := y+λbnyn′

1+λbn are ε-optimal in Γλfor λ small

enough. Corollary 9 and equation (16) imply that Qˆxnλ,ˆyλn λ (t) − 1 − (1 − t)1+γ 1 + γ ≤ ε (17)

for all λ small enough and t ∈ [0, 1]. This answers the first part of the Proposition. Clearly lxˆnλ,ˆynλ λ (t) = Q ˆ xn λ,ˆyn λ (t)g(ˆxnλ, ˆyλn) +  t − Qxˆnλ,ˆyn λ (t)  g∗(ˆxnλ, ˆynλ). (18)

Recall that the payoff function is assumed bounded by 1. Then equations (17) and (18) imply that for λ small enough and every t,

lˆxnλ,ˆyλn λ (t) − 1 − (1 − t)1+γ 1 + γ g(ˆx n λ, ˆyλn) −  t −1 − (1 − t) 1+γ 1 + γ  g∗(ˆxn λ, ˆyλn) ≤ 2ε Since xn

λ and yλn converge to x and y, we then have for λ small enough

lxˆnλ,ˆynλ λ (t) − 1 − (1 − t)1+γ 1 + γ g(x, y) −  t −1 − (1 − t) 1+γ 1 + γ  g∗(ˆxn λ, ˆynλ) ≤ 3ε (19) .

We now consider four separate cases. Basically either γ = 0 or +∞ and equation (19) implies that l(t) is linear, and hence equals tv since l(1) = v by near optimality of the strategies ˆxn

λ and ˆyλn in Γλ ; or

γ ∈]0, +∞[ and then both g(x, y) and g∗(ˆxn

λ, ˆynλ) are close to v by Proposition 7, which once again implies

l(t) = tv.

Case 1 : p∗(x, y) > 0. Then g∗(ˆxn

λ, ˆyλn) converges to g∗(x, y) as λ go to 0, and by Proposition 7 a)

|g∗(x, y) − v| ≤ ε. Since γ = +∞ in that case, equation (19) yields |lxˆn λ,ˆynλ

λ (t) − tv| ≤ 5ε for all λ small

(11)

Case 2 : p∗(x, y) = 0 and γ = 0, hence γn ≤ 1 (up to chosing a larger n). Then Proposition 7 b) implies

|g(x, y) − v| ≤ 4ε, and equation (19) yields |lˆxnλ,ˆy n λ

λ (t) − tv| ≤ 7ε for all λ small enough, uniformly in t.

Case 3 : p∗(x, y) = 0 and γ ∈]0, +∞[, hence γ2 ≤ γn ≤ 1 + γ (up to chosing a larger n). Then

Proposition 7 b) implies |g(x, y) − v| ≤ 2(2 + γ)ε. Moreover, p∗(x, y) = 0 implies that g∗(ˆxnλ, ˆynλ) converges

to aG∗(x′n, y) + bG∗(x, yn′) ap∗(x′ n, y) + bp∗(x, y′n) as λ go to 0, and by Proposition 7 c) aG∗(x′ n, y) + bG∗(x, y′n) ap∗(x′ n, y) + bp∗(x, y′n) − v ≤ 3 1+γ/2 γ/2 ε.

Hence equation (19) yields |lxˆnλ,ˆy n λ

λ (t) − tv| ≤ (7 + 2γ + 3 1+γ/2

γ/2 )ε for all λ small enough, uniformly in t.

Case 4 : p∗(x, y) = 0 and γ = +∞, hence γn≥ 1 (up to chosing a larger n). Then gxn

λ, ˆynλ) converges to aG ∗(x′ n, y) + bG∗(x, yn′) ap∗(x′ n, y) + bp∗(x, y′n) as λ go to 0, and by Proposition 7 c) aG∗(x′n, y) + bG∗(x, yn′) ap∗(x′ n, y) + bp∗(x, yn′) − v ≤ 6ε. Thus equation (19) yields |lxˆnλ,ˆyλn

λ (t) − tv| ≤ 9ε for all λ small enough, uniformly in t.

As claimed, in every case we see that l(t) = tv. 

Remark 11.

Recall that Q(·) and l(·) represent expected cumulated occupation measure and payoff. By deriving these quantities with respect to t we get that the asymptotic probability q(t) of still being in the non absorbing state at time t is (1 − t)γ, and that the current asymptotic payoff is v at any time.

Remark 12.

Let us give a simple heuristic behind the form (1 − t)γ for q(t). Assuming that this quantity is well

defined and smooth, note that at time t the remaining game has a length 1 − t and weight 1 − q(t) hence by renormalization (see figure below)

q′(t) 1 − q(t)(1 − t) = q ′(0) so that − q ′(t) 1 − q(t) = − q′(0) 1 − t which leads, with q(0) = 0 to q(t) = 1 − (1 − t)γ for some γ.

(12)

0 t 1 q(t)

1

q′(0)

q′(t)

Let us illustrate now the four cases in the preceding proof by giving examples.

Example 13.

Consider the absorbing game

L R

T 1∗ 0

B 0 1∗

with asymptotic value 1. Let x = 1/2T + 1/2B. Then (x, x, n) is 1/n-optimal in A for Player 1, while any (y, y′, b) is optimal for Player 2. Since p(x, y) > 0 for all y case 1occurs for any choice of (y, y, b), hence

the corresponding γ is +∞ and Q(t) = 0 for all t.

Notice that in Γλthe only optimal stationary strategy is (1/2, 1/2) for each player, leading to the same

asymptotic trajectory Q(.) = 0. I

Example 14.

Consider the absorbing game

L R

T 1∗ 0

(13)

with asymptotic value 1. Then (B, T, n) is 1/n-optimal in A, while any (y, y′, b) is optimal. The associated

γ is ny(L), hence either y(L) = 0 and case 2 holds with γ = 0 and Q(t) = t, or y(L) > 0 and case 4 occurs with γ = +∞ and Q(t) = 0.

Notice that in Γλthe only optimal stationary strategy is xλ= yλ= ( √

λ 1+√λ,

1

1+√λ) for each player. Since

p∗(x

λ, yλ) = (1+λ√λ)2 ∼ λ, Lemma 8 implies that the asymptotic trajectory associated to optimal strategies

is Q(t) = t −t2

2. Moreover for any γ ≥ 0 the strategy of Player 2 zλ= ( γ√λ 1+√λ, 1 −

γ√λ

1+√λ) is ε-optimal in Γλ

for λ small enough, and p∗(x

λ, zλ) ∼ γλ hence any Q(t) of the form 1−(1−t)

1+γ

1+γ is an asymptotic behavior.

Example 15.

Consider the Big Match

L R

T 1∗ 0

B 0 1

with asymptotic value 1/2. Let y = 1/2L+1/2R. Then (B, T, 1) and (y, y, 0) are optimal in A, with γ = 1. Hence case 3 holds, and the corresponding Q(t) is t −t2

2. The optimal strategies in Γλ are ( λ 1+λ,

1 1+λ) and

(1/2, 1/2) respectively, leading to the same Q(t) = t − t2 2.

3.3. Finite case.

We now prove that when the game is finite, the limit payoff trajectory is linear for every couple of near optimal stationary strategies, not only those given by Proposition 5. That is, l(t) = tv is a strong limit behavior for the cumulated payoff.

Proposition 16. Let Γ be a finite absorbing game with asymptotic value v, xλ and yλ families of

ε(λ)-optimal stationary strategies in Γλ, with ε(λ) going to 0 as λ goes to 0. Then for every t ∈ [0, 1], lλxλ,yλ(t)

converges to tv as λ goes to 0.

We will use in the proof of this proposition the following elementary lemma given without proof.

Lemma 17. Let a, b, c, d be real numbers with c and d positive. Then min(a c, b d) ≤ a+b c+d ≤ max( a c, b d) with

equality if and only if a c =

b d

Proof of Proposition 16. The result is clear for t = 0 or 1, assume by contradiction that it is false for some t ∈]0, 1[. Hence there is a sequence λn going to 0 and optimal strategies xλn and yλn such that

lxλn,yλn

λn (t) converges to tw with w 6= v. Up to extraction of subsequences, xλn and yλn converge to x

and y respectively. Also up to extraction, all the following limits exist in [0, +∞]: α(i) = limn→∞xλnλn(i),

β(j) = limn→∞yλnλn(j), and γ = limn→∞

p∗(x λn,yλn)

λn . If γ 6= 0 (and hence p

(x

λn, yλn) > 0 for n large

enough), denote g∗(x, y) := limG∗(x λn,yλn)

p∗(x

(14)

Recall formula (4):

rλn(xλn, yλn) =

λng(xλn, yλn) + (1 − λn)G∗(xλn, yλn)

λn+ (1 − λn)p∗(xλn, yλn)

(20)

and since xλ and yλ are families of ε(λ)-optimal strategies in Γλ, rλn(xλn, yλn) converges to v as n tends

to infinity.

Recall that by Lemma 8, at the limit

l(t) =1 − (1 − t) 1+γ 1 + γ g(x, y) +  t −1 − (1 − t) 1+γ 1 + γ  g∗(x, y). We first claim that γ ∈]0, +∞[.

If γ = 0, l(t) = tg(x, y), and by near optimality of xλ and yλ v = l(1) = g(x, y), hence l(t) = tv a

contradiction.

If γ = +∞, l(t) = tg∗(x, y), and by near optimality of x

λ and yλ v = l(1) = g∗(x, y), hence l(t) = tv a

contradiction.

Hence γ ∈]0, +∞[, and w is a nontrivial convex combination of g(x, y) and g∗(x, y). Since v = l(1) is

also a convex combination of g(x, y) and g∗(x, y), the assumption that v 6= w implies g(x, y) 6= g∗(x, y). Assume without loss of generality g(x, y) < v < g∗(x, y).

We class the actions i of Player 1 in 4 categories I1to I4:

• i ∈ I1if x(i) > 0,

• i ∈ I2if x(i) = 0 and α(i) = +∞,

• i ∈ I3if x(i) = 0 and α(i) ∈]0, +∞[,

• i ∈ I4if x(i) = 0 and α(i) = 0.

Hence actions of category 1 are of order 1, actions of category 3 are of order λ, actions of category 4 are of order o(λ), and actions of category 2 are played with probability going to 0 but large with respect to λ. Define categories J1to J4 of player 2 in a similar way. By definition,

p∗(xλn, yλn) λn = X I×J p∗(i, j)xλn(i)yλn(j) λn (21)

Recall that the left hand side converges to γ ∈]0, +∞[, hence up to extraction p∗(i,j)xλn(i)yλn(j)

λn converge

in [0, +∞[ for any i and j, denote by δij the limit. If i and j are of category k and l with k + l > 4 then

δij= 0. If k + l < 4 then xλn(i)yλnλn(j) diverges to +∞ which implies that p∗(i, j) = 0 = δij.

Hence going to the limit in (21) we get that γ = γ1,3+γ2,2+γ3,1where where γ1,3 :=PI1×J3p

(i, j)x(i)β(j),

γ2,2:=PI2×J2δij, and γ3,1 :=

P

I3×J1p

(i, j)α(i)y(j). Recall that γ > 0 thus at least one of γ

1,3, γ2,2 or

(15)

Similarly, passing to the limit in the definition of G∗(xλ

n, yλn) :=

P

I×JG∗(i, j)xλn(i)yλn(j) yields

µ := limG∗kl(xλn,yλn) λn = µ1,3+ µ2,2+ µ3,1where µ1,3 := P I1×J3x(i)β(j)G ∗(i, j), µ 2,2 :=PI2×J2δijg∗(i, j), and µ3,1:=PI3×J1α(i)y(j)G

(i, j). Note that if γ

k,4−k= 0 for some k then µk,4−k= 0 as well.

Finally, going to the limit in equation (20) yields

v = g(x, y) + µ1,3+ µ2,2+ µ3,1 1 + γ1,3+ γ2,2+ γ3,1

and similarly going to the limit in the definition of g∗(x, y) := limG∗(xλn,yλn)

p∗(x

λn,yλn) yields

g∗(x, y) = µ1,3+ µ2,2+ µ3,1 γ1,3+ γ2,2+ γ3,1

Recall that we assumed g∗(x, y) > v. By Lemma 17, there exists k such that γ

k,4−k > 0 and µγk,4−kk,4−k ≥

g∗(x, y) > v.

Assume first that k = 1. Consider now the following strategy y′ λn : y

λn(j) = 0 for j ∈ J3, and

y′

λn(j) = yλn(j) for all other j except for an arbitrary j0∈ J1for which yλ′n(j0) = yλn(j0) +

P

j∈J3yλn(j).

The only effect of this deviation is that now γ′

1,3= µ′1,3= 0. Hence lim n→∞rλn(xλn, y ′ λn) = g(x, y) + µ2,2+ µ3,1 1 + γ2,2+ γ3,1

which is strictly less than v by Lemma 17 since v < µ1,3

γ1,3 ; this contradicts the ǫ(λn)-optimality of xλn.

Assume next that k = 2. Consider now the following strategy yλ′n : y

λn(j) = 0 for j ∈ J2, and

y′

λn(j) = yλn(j) for all other j except for an arbitrary j0∈ J1for which yλ′n(j0) = yλn(j0) +

P

j∈J2yλn(j).

The only effect of this deviation is that now γ′

2,2= µ′2,2= 0. Hence lim n→∞rλn(xλn, y ′ λn) = g(x, y) + µ1,3+ µ3,1 1 + γ2,2+ γ3,1

which is strictly less than v by Lemma 17 since v < µ2,2

γ2,2 ; this again contradicts the ǫ(λn)-optimality of

xλn.

Finally assume that k = 3. Consider now the following strategy x′

λn : xλn(i) = 2xλn(i) for i ∈ I3 and

x′λn(i) = xλn(i) for all other i except for an arbitrary i0∈ I1 for which x′λn(i0) = xλn(i0) −

P

j∈J3xλn(j)

(which is nonnegative for n large enough). The only effect of this deviation is that now γ′3,1 = 2γ3,1 and

µ′ 3,1= 2µ3,1. Hence lim n→∞rλn(x ′ λn, yλn) = g(x, y) + µ1,3+ µ2,2+ 2µ3,1 1 + γ1,3+ γ2,2+ 2γ3,1

which is strictly more than v by Lemma 17 since v < µ3,1

(16)

4. Stochastic finite games 4.1. Non algebraic limit trajectories.

Consider the following zero-sum stochastic game with two non absorbing states and two actions for each player. In the first state s1 (which is the starting state) the payoff and transitions are as follows:

L R

U 1* 0+

D 0 1

where ∗ denotes absorption and + that there is a deterministic transition to state 2. Starting from the second state s2 the game is a linear variation of the Big Match:

L R

U 1* -1*

D -1 1

Since vλ(s2) = 0 for all λ and since there is no return once the play has entered state s2, it implies that the

optimal play in state s1 is the same than in the Big Match, in which γ = 1. So in both states the optimal

strategies in Γλare D + λU for Player 1 and 1/2L + 1/2R for Player 2. By a scaling of time, the preceding

section tells us that at the limit game, the probability of being in state 2 at time t, given that there were transition from s1 to s2 at time z, is 11−z−t. Since (also from the preceding section) the time of transition

from s1 to another state (which is s2 with probability 1/2) has a uniform law on [0, 1], the probability of

being in s2at time t is p(t) = 1 2 Z t 0 1 − t 1 − zdz = − (1 − t) ln(1 − t) 2 .

Notice that this not an algebraic function of t as it was always the case in the preceding section. Similarly the probability of absorption before time t is 1 − (1 − t)(1 −ln(1−t)2 ) and is also non algebraic.

4.2. No ε−optimal strategies of the form x + aλx′.

In the following game with two non absorbing states the payoff is always 1 in state a and -1 in state b with the following deterministic transitions

state a L R U a b D b 1* state b L R U b a D a -1*

(17)

It is easy to see that the asymptotic value is 0 and that optimal strategies in the λ-discounted game put a weight ∼√λ on D and R in both states, hence the absorbing probability is of the order of λ per stage in each state.

We show that strategies of the form (xa+ Caλx′a, xb+ Cbλx′b) cannot guarantee more than -1 to player 1,

as λ goes to 0.

- If xa(D)xb(D) > 0, player 2 plays L in a and R in b inducing an absorbing payoff of -1.

From a we reach eventually b were −1 has a positive probability.

- If xa(D) = 0, xb(D) > 0, player plays R in both games inducing an absorbing payoff of -1.

The payoff will be absorbing with high probability in finite time and the relative probability of 1∗vanishes

with λ.

- If xa(D) > 0, xb(D) = 0, player plays L in both games inducing a non absorbing payoff of -1.

For λ small, most of the time the state is b.

- If xa(D) = 0, xb(D) = 0, player plays R in game a and L in game b. The event “absorbing payoff of

1” occurs at stage n if ωn = a and in = D. Hence ωn−1 = b and in−1 = D. Now this event “in = D

and in−1 = D” has probability of order λ2. Then the absorbing component of the λ discounted payoffs

converges to 0 with λ. Moreover the non absorbing payoff is mainly −1.

5. An absorbing game with compact action sets and non linear LOTP

We consider the following absorbing game with compact actions sets. There are three states, two absorbing 0∗ and −1∗ , and the non absorbing state ω, in which the payoff is 1 whatever the actions taken. The sets of action are X = Y = {0} ∪ {1/n, n ∈ N∗} with the usual distance. The probabilities of absorption are given by :

ρ(0∗|x, y) =        0 if x = y √y if x 6= y and ρ(−1∗|x, y) =        y if x = y 0 if x 6= y

It is easily checked that both functions ρ(0∗| · ·) and ρ(−1| · ·) are (jointly) continuous.

Proposition 18. For any discount factor λ ∈]0, 1], 0 (resp. 1) is optimal for Player 1 (resp. Player 2) in the λ-discounted game, and vλ= λ.

The corresponding payoff trajectory is: l(t) = 0 on [0, 1].

Proof. Action 0 of Player 1 ensures that there will never be absorption to state −1∗, and thus that the

(18)

with probability 1 at the end of stage 1, and thus that the stage payoff from stage 2 on is nonpositive. Since the payoff in stage 1 is 1 irrespective of player’s actions, the proposition is established.

Notice that the play under this couple of optimal strategies is simple: there is immediate absorption to 0∗, and in particular the limit payoff trajectory is linear and equals 0 for every time t. 

We now prove that there are other ε-optimal strategies, with a different limit payoff trajectory. Denote {λ} := 1

[1/λ] where [·] is the integer part ; hence λ ≤ {λ} < λ

1−λ and 1/{λ} ∈ N∗ for all λ ∈]0, 1].

Proposition 19. For any discount factor λ ∈]0, 1], {λ} is λ-optimal for Player 1 and √λ-optimal for Player 2 in the λ-discounted game.

The corresponding payoff trajectory is: l(t) = t − t2.

Proof. If both players play {λ}, the payoff in the λ-discounted game is, according to formula (4),

rλ({λ}, {λ}) = λ − (1 − λ){λ}

λ + (1 − λ){λ}

which is nonnegative since {λ} < 1−λλ . On the other hand, since λ ≤ {λ}, one gets rλ({λ}, {λ}) ≤ λ.

If Player 1 plays {λ} while Player 2 plays y 6= {λ}, there is no absorption to −1∗hence rλ({λ}, y) ≥ 0.

Thus {λ} is λ-optimal for Player 1.

If If Player 2 plays {λ} while Player 1 plays x 6= {λ}, then, according once again to formula (4),

rλ(x, {λ}) = λ λ + (1 − λ)p{λ} ≤ λ λ + (1 − λ)√λ ≤ √λ.

Thus {λ} is√λ-optimal for Player 2.

Notice that while the limit value is 0 and ({λ}, {λ}) is a couple of near optimal strategies, along the induced play the nonabsorbing payoff is 1 and the absorbing payoff is -1. One can compute that the associated γ is 1, hence under these strategies Q(t) = t −t22. So that the accumulated limit payoff up to

time t is t − t2, which is non linear and positive for every t ∈]0, 1[. 

Basically the players use a jointly controlled procedure either to follow ({λ}, {λ}) or to get at most (resp. at least) 0.

6. Concluding comments

A first serie of interesting open questions is directly related to the results presented here like: extension of Proposition 12 to general (not stationary) strategies,

(19)

or more generally analysis in the framework of arbitrary (not discounted) evaluations and general stochastic games.

It is also natural to consider other families of repeated games: a first class that is of interest is games with incomplete information. The natural equivalent of LOT M is in this framework is the speed at which the information is transmitted during the game.

References

[1] Aumann R.J. and M. Maschler: Repeated Games with Incomplete Information, M.I.T. Press, (1995).

[2] Cardaliaguet P., R. Laraki and S. Sorin: A continuous time approach for the asymptotic value in two-person zero-sum repeated games. SIAM J. Control and Optimization, 50, 1573-1596, (2012).

[3] Kohlberg E.: Repeated games with absorbing states. Annals of Statistics, 2, 724-738, (1974).

[4] Laraki R.: Explicit formulas for repeated games with absorbing states. International Journal of Game Theory, 39, 53-69, (2010).

[5] Mertens J.-F. and A. Neyman: Stochastic games. International Journal of Game Theory, 10, 53-66, (1981).

[6] Mertens J.-F., A. Neyman and D. Rosenberg: Absorbing games with compact action spaces. Mathematics of Operations Research,34, 257-262, (2009).

[7] Mertens J.-F. and S. Zamir: The value of two-person zero-sum repeated games with lack of information on both sides. International Journal of Game Theory, 1, 39-64, (1971).

[8] Oliu-Barton M. and B. Ziliotto: Constant payoff in zero-sum stochastic games, arXiv:1811.04518v1, (2018).

[9] Rosenberg D. and S. Sorin: An operator approach to zero-sum repeated games. Israel Journal of Mathematics, 121, 221-246, (2001).

[10] Shapley L. S.: Stochastic games. Proceedings of the National Academy of Sciences of the U.S.A., 39, 1095-1100, (1953). [11] Sorin S.: A First Course on Zero-Sum Repeated Games. Springer, (2002).

[12] Sorin S. and G. Vigeral: Existence of the limit value of two person zero-sum discounted repeated games via comparison theorems, Journal of Opimization Theory and Applications, 157 , 564-576, (2013) .

[13] Sorin S. and G. Vigeral: Reversibility and oscillations in zero-sum discounted stochastic games, Journal of Dynamics and Games, 2, 103-115, (2015).

[14] Sorin S., X. Venel and G. Vigeral: Asymptotic properties of optimal trajectories in dynamic programming, Sankhya, 72-A, 237-245, (2010).

[15] Vigeral G.:A zero-sum stochastic game with compact action sets and no asymptotic value, Dynamic Games and Applications, 3, 172-186, (2013).

[16] Ziliotto B.: Zero-sum repeated games: counterexamples to the existence of the asymptotic value and the conjecture maxmin= lim vn, Annals of Probability, 44, 1107-1133, (2016).

(20)

Sylvain Sorin, Sorbonne Universit´e, UPMC Paris 06, Institut de Math´ematiques de Jussieu-Paris Rive Gauche, UMR 7586, CNRS, F-75005, Paris, France

E-mail address: sylvain.sorin@imj-prg.fr https://webusers.imj-prg.fr/sylvain.sorin

Guillaume Vigeral (corresponding author), Universit´e Paris-Dauphine, PSL Research University, CNRS, CEREMADE, Place du Mar´echal De Lattre de Tassigny. 75775 Paris cedex 16, France

E-mail address: vigeral@ceremade.dauphine.fr

http://www.ceremade.dauphine.fr/ vigeral/indexenglish.html

Références

Documents relatifs

/ La version de cette publication peut être l’une des suivantes : la version prépublication de l’auteur, la version acceptée du manuscrit ou la version de l’éditeur. For

However, a central question in the theory of 2-person zero-sum stochastic differential games is that of sufficient conditions, under which the game ad- mits a value, that is,

Bandits algorithms (Lai and Robbins, 1985; Grigoriadis and Khachiyan, 1995; Auer et al., 1995; Audibert and Bubeck, 2009) are tools for handling exploration vs exploitation dilemma

The numerical study was performed by (1) calculating the electrical current induced by the coil, (2) simulating the magnetic field created by the permanent magnet, (3) computing

In ähnlicher Weise ließ sich auch die Formel für das Kapillarmodell eines porösen Körpers bei radialer Anordnung aufstellen.. Mit den gewonnenen Resultaten konnten sodann

Keywords Zero-sum stochastic games, Shapley operator, o-minimal structures, definable games, uniform value, nonexpansive mappings, nonlinear Perron-Frobenius theory,

• The policy iteration algorithm for ergodic mean-payoff games is strongly polynomial when restricted to the class of ergodic games such that the expected first return (or hitting)

Hubner and Alessandro Ricci, discusses the fact that research on Multi- Agent Systems moved from Agent-Oriented Programming towards the idea of Multi-Agent Oriented Programming,