• Aucun résultat trouvé

Note: On A Proof Of Positionality Of Mean-Payoff Stochastic Games

N/A
N/A
Protected

Academic year: 2021

Partager "Note: On A Proof Of Positionality Of Mean-Payoff Stochastic Games"

Copied!
6
0
0

Texte intégral

(1)

HAL Id: hal-01091192

https://hal.archives-ouvertes.fr/hal-01091192

Submitted on 4 Dec 2014

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub-

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non,

Note: On A Proof Of Positionality Of Mean-Payoff Stochastic Games

Hugo Gimbert, Edon Kelmendi

To cite this version:

Hugo Gimbert, Edon Kelmendi. Note: On A Proof Of Positionality Of Mean-Payoff Stochastic Games.

[Research Report] Université de Bordeaux, LaBRI. 2014. �hal-01091192�

(2)

Note: On A Proof Of Positionality Of Mean-Payoff Stochastic Games

Hugo Gimbert CNRS, LaBRI Universit´ e de Bordeaux hugo.gimbert@labri.fr

Edon Kelmendi LaBRI

Universit´ e de Bordeaux edon.kelmendi@labri.fr December 4, 2014

1 Introduction

Two-player stochastic games played on finite arenas are considered, where to each state a reward is attached. The players objectives are to maximize, and respectively minimize the average reward on the long run. The purpose of this note is to point out an error in the proof given by Ligget and Lippman in [6] that both players have optimal strategies that are positional, and to provide one possible resolution.

The arena consists of a finite set of states S

i

controled by player i ∈ {1, 2}, for all s ∈ S

1

∪ S

2

= S a set of actions denoted A(s), and a transition probability p : S × A(S) → ∆(S) where A(S) = ∪

s∈S

A(s) and ∆(S) the set of probability distributions on S, i.e functions d : S → R

+

with the property that P

s∈S

d(s) = 1. The game starts at some state s

0

∈ S

i

whence player i chooses an action a in A(s

0

) after which the chance of being in state s

1

is p(s

0

, a)(s

1

), and so on for an infinite duration. Denote by S

(S

ω

) the set of finite (infinite) sequences of elements of S. Strategies for player i are functions σ

i

: S

S

i

→ A(S). The positional strategies are of the type σ

i

: S

i

→ A(S). Fixing an initial state s

0

∈ S and two strategies σ

1

, σ

2

for each player, gives rise to a unique probability measure on the sigma-algebra generated by the cylinders {pS

ω

| p ∈ S

}, denoted P

σs012

, with the property

P

σs12

0

(s

1

s

2

· · · s

n

S

ω

) =

n−1

Y

i=0

p(s

i

, σ

k(si)

(s

0

· · · s

i

))(s

i+1

),

(3)

where k(s

i

) = j if s

i

∈ S

j

. A payoff functions is a function f : S

ω

→ R . We will deal mainly with the following payoff functions:

D

β

(s

0

s

1

· · · ) =

X

i=0

β

i

r(s

i

)

M (s

0

s

1

· · · ) = lim inf

n→∞

1 n + 1

n

X

i=0

r(s

i

)

M(s

0

s

1

· · · ) = lim sup

n→∞

1 n + 1

n

X

i=0

r(s

i

),

where β ∈ [0, 1) and r : S → R

+

describes the rewards attached to states.

Let T

i

for i ∈ N be the random variable defined as: T

i

(s

0

· · · s

i

· · · ) = s

i

. We abbreviate the random variable D

β

(T

0

T

1

· · · ) as D

β

, and the same for the two other payoff functions.

Having fixed the notation, we proceed by presenting the proof of the following theorem, to which [6] is devoted.

Theorem 1 ([6]) . There exist a pair of positional strategies σ

1

, σ

2

such that for all strategies σ

1

, σ

2

and s ∈ S,

E

σs12

[M ] ≤ E

σs12

[M ] ≤ E

σs12

[M], where M (s

0

s

1

· · · ) = lim

n→∞ 1

n+1

P

n

i=0

r(s

i

).

We introduce two results on which the proof is based.

Theorem 2 ([1]) . For one player games (when S

i

= ∅ for one i ∈ {1, 2}), there exists a positional strategy σ

and β

∈ [0, 1) such that for all β ≥ β

, strategies σ, and s

0

∈ S,

E

σs

0

[D

β

] ≥ E

σs

0

[D

β

], if S

2

= ∅, and the opposite inequality if S

1

= ∅.

Theorem 3 ([7]) . For all β ∈ [0, 1) there exists positional strategies σ

1

, σ

2

such that for all strategies σ

1

, σ

2

and s

0

∈ S,

E

σs012

[D

β

] ≤ E

σs012

[D

β

] ≤ E

σs012

[D

β

].

(4)

Fixing a positional strategy for one player results in a Markov decision process. Let σ

2

be a positional strategy for player 2, then according to Theorem 2 there exists a positional strategy σ

1

and β

σ2

∈ [0, 1) such that for all strategies σ

1

for player 1 and β ≥ β

σ2

, E

σs12

[D

β

] ≥ E

σs12

[D

β

] for all s ∈ S. By fixing a positional strategy of player 1, we get a symmetric statement. Let β

= max{β

σ

| σ a positional strategy}, i.e the largest β resulting from Theorem 2 by fixing positional strategies for either player.

Let σ

1

, σ

2

be the pair positional strategies for β = β

in Theorem 3. The erroneous claim in [6] is that theorems 2 and 3 imply the following: for all β ≥ β

and s

0

∈ S

E

σs012

[D

β

] ≤ E

σs012

[D

β

] ≤ E

σs012

[D

β

], (1) as shown by the example in the next session.

2 An example

Consider the following one-player game with two states.

s

0

s

1

a b

Figure 1: A one player game

Assume that r(s

0

) = 0 and r(s

1

) = 1 and that S

1

= { s

0

, s

1

}. For all β ≥ 0 the optimal strategy is to play b in s

0

with a gain of P

k

β

k

. But regard that for β = 0 the strategy that plays a in s

0

is also optimal. Therefore β

can be equal to 0 and σ

1

in Theorem 3 can be the strategy that plays a in s

0

. But for this pair (1) is not true, since for β > 0, the strategy that plays b in s

0

is strictly better.

We provide a different proof of the existence of positional strategies which

are optimal for β ≥ β

. This proof follows closely the one given in [3] for the

discrete case, which in turn follows the one given in [4] for Markov decision

processes.

(5)

3 A different proof

Lemma 1 ([3],[4]) . There exist a pair of positional strategies σ

1

, σ

2

and β

∈ [0, 1) such that for all β ≥ β

and s

0

∈ S,

E

σs012

[D

β

] ≤ E

σs012

[D

β

] ≤ E

σs012

[D

β

]. (2) Proof. Let σ

1

, σ

2

be two positional strategies and s

0

∈ S. We argue that f (β) = E

σs012

[D

β

] is a rational function (a ratio between two polynomials in β). Indeed since the fixed strategies are positional, the dynamics of the game are described by the Markov chain given in the form of the S × S stochastic matrix

P

σ12

(s, s

) =

( p(s, σ

1

(s), s

) if s ∈ S

1

p(s, σ

2

(s), s

) if s ∈ S

2

.

For s ∈ S let s be the Dirac distribution on s given in the form of a row matrix (i.e 1 × S matrix with all components 0 except in position s), and let r be the column vector of rewards. Then

E

σs12

0

[D

β

] = s

0

X

k=0

(βP

σ12

)

k

r ,

where P

σ012

= I , the identity matrix. It is a theorem (cf. [5]) that if lim

k→∞

(βP

σ12

)

k

= 0, then the matrix I − βP

σ12

is invertible and P

k=0

(βP

σ12

)

k

= (I − βP

σ12

)

1

, from where we conclude that f (β) = E

σs012

[D

β

] is a rational function, with both polynomials having a finite de- gree.

Let (β

n

) be a sequence with elements in [0, 1) such that lim

n→∞

β

n

= 1.

By Theorem 2 for all β

n

we have a pair of positional strategies that are

optimal, and since the set of positional strategies is finite we may assume

that there exists σ

1

, σ

2

positional, such that (2) holds for a subsequence

n

) ⊆ (β

n

) and the pair of strategies σ

1

, σ

2

. We argue that there exists

β

∈ [0, 1) such that (2) holds for all β ≥ β

. Assume on the contrary

that for all β

∈ [0, 1) there exist β ≥ β

, a pair of positional strategies

τ

1

, τ

2

, and a state s

0

∈ S such that, either E

τs102

[D

β

] > E

σs012

[D

β

] or

E

σs012

[D

β

] < E

σs012

[D

β

]. Consequently there exists a sequence (µ

n

) with

µ

n

∈ [0, 1) and lim

n→∞

µ

n

= 1 such that either E

τs102

[D

µn

] > E

σs012

[D

µn

]

or E

σs012

[D

µn

] < E

σs012

[D

µn

] for all n ∈ N . Assume the latter. But then

g(β) = E

σs012

[D

µn

] − E

σs012

[D

µn

] would have infinitely many zeros, which is

not possible since we showed that it is a rational function. Similarly if we

(6)

We give the rest of the proof of Theorem 1 found in [6] for the sake of completeness. For all n ∈ N define H

n

(s

0

s

1

. . . ) = P

n

k=0

r(s

k

), and abbreviate by H

n

the random variable H

n

(T

0

T

1

· · · ). Let σ

1

, σ

2

be a pair of positional strategies resulting from Lemma 1. Now from Theorem 4.2 in [2] we have for all s ∈ S:

E

σs12

[H

n

] − sup

σ

E

σ,σs 2

[H

n

] is bounded uniformely in n.

Therefore for all strategies σ, E

σs12

[M ] = lim

n→∞

1 n sup

σ

E

σs2

[H

n

] ≥ lim sup

n→∞

1

n E

σ,σs 2

[H

n

] = E

σ,σs 2

[M ], and symmetrically for the inequality E

σs12

[M ] ≤ E

σs1

[M ].

References

[1] David Blackwell. Discrete dynamic programming. The Annals of Math- ematical Statistics, 33(2):719–726, 06 1962.

[2] Barry W Brown. On the iterative method of dynamic programming on a finite space discrete time markov process. The annals of mathematical statistics, pages 1279–1285, 1965.

[3] Hugo Gimbert and Wiesaw Zielonka. Applying blackwell optimality:

Priority mean-payoff games as limits of multi-discounted games.

[4] Arie Hordijk and Alexander A Yushkevich. Blackwell optimality. In Handbook of Markov decision processes, pages 231–267. Springer, 2002.

[5] John G Kemeny and James Laurie Snell. Finite markov chains, volume 356. van Nostrand Princeton, NJ, 1960.

[6] Thomas M Liggett and Steven A Lippman. Stochastic games with perfect information and time average payoff. Siam Review, 11(4):604–607, 1969.

[7] Lloyd S Shapley. Stochastic games. Proceedings of the National Academy

of Sciences of the United States of America, 39(10):1095, 1953.

Références

Documents relatifs

Keywords: Uniqueness; Convex surfaces; Nonlinear elliptic equations; Unique continuation; Alexandrov theorem.. We give a new proof of the following uniqueness theorem of

A generation of symbols asserted for n ≥ 0 in the proof of Theorem 3.3 of the original paper in fact only holds for n &gt; 0, thus undermining the proof of the theorem. A new version

In the cases which were considered in the proof of [1, Proposition 7.6], the case when the monodromy group of a rank 2 stable bundle is the normalizer of a one-dimensional torus

[7] Chunhui Lai, A lower bound for the number of edges in a graph containing no two cycles of the same length, The Electronic J. Yuster, Covering non-uniform hypergraphs

In this note we prove that the essential spectrum of a Schrödinger operator with δ-potential supported on a finite number of compact Lipschitz hypersurfaces is given by [0, +∞)..

Wang, Error analysis of the semi-discrete local discontinuous Galerkin method for compressible miscible displacement problem in porous media, Applied Mathematics and Computation,

SHELAH, A compactness theorem for singular cardinals, free alge- bras, Whitehead problem and transversals, Israel J. SHELAH, On uncountable abelian groups, Israel

One of the most important theorems on branching processes with n-types of particles (n-dimensional Galton-Watson processes) is the Sevastyanov’s theorem on