Policy iteration for stochastic zero-sum games

(1)

HAL Id: hal-01024097

https://hal.inria.fr/hal-01024097

Submitted on 15 Jul 2014

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Policy iteration for stochastic zero-sum games

Marianne Akian

To cite this version:

Marianne Akian. Policy iteration for stochastic zero-sum games. NETCO 2014, 2014, Tours, France.

�hal-01024097�

(2)

games

Marianne Akian

INRIA Saclay - ˆIle-de-France and CMAP, ´Ecole Polytechnique

NETCO 2014 23-27 June 2014, Tours

Joint work with St ´ephane Gaubert, see arXiv:1310.4953

(3)

Hamilton-Jacobi-Bellman-Isaacs equations

The stationary equation:

−H(x,Dv(x)) =0,x ∈X ⊂R^d +a boundary condition,

H(x,p) = min

a∈A(x) max

b∈B(x)[f(x,a,b)·p+g(x,a,b)], x ∈X, andp ∈R^d , is the dynamic programming equation satisfied by the (upper) value function of the zero-sum game problem:

v(x) = inf

(αt)t≥0

sup

(βt)_t_≥0

Z ∞ 0

g(xt, αt, βt)dt ,

wherex˙t =f(xt, αt, βt), for allt≥0, andinfandsupare taken over nonanticipating strategies of the first and second player (where the second player knows the current action of the first player).

Example: pursuit evasion games.

(4)

⇒

v =F(v), the fixed point equation of the dynamic programming or Shapley operatorF of a discrete time zero-sum two player stochastic game problem with finite state space.

Same for:Discounted problems, Optimal stopping time problems, Stochastic games.

(5)

Discrete time and state zero-sum stochastic games

LetF :Rⁿ→Rⁿ be defined by:

[F(v)]i := min

a∈A_i max

b∈Bi



 X

j∈[n]

M_ij^abvj + r_i^ab



, i ∈[n],

withM_ij^ab ≥0for alli,j∈[n],a∈ Ai, b∈ Bi.

The map F is thedynamic programming or Shapley operator of a discrete time zero-sum two player game problem with perfect information on the finite state spaceX := [n] :={1, . . . ,n}, with:

A_i,B_i sets of actions of the 1st, 2nd player MIN, MAX, when in statei r_i^ab reward paid by MIN to MAX, at each time

M_ijâb :=γ_iâbP_ijâb ≥0 γ_iâb :=P

j∈[n]M_ij^ab ≥0discount factor (<1or≤1or=1) P_ij^ab transition probability fromitoj(P

j∈[n]P_ij^ab =1).

(6)

LetF :R →R be defined by:

[F(v)]i := min

a∈A_i max

b∈Bi



 X

j∈[n]

M_ij^abvj + r_i^ab



, i ∈[n],

withM_ij^ab ≥0for alli,j∈[n],a∈ Ai, b∈ Bi. Denoteγ^ab_i :=P

j∈[n]M_ij^ab. Then

• F is order preserving:u ≤v ⇒F(u)≤F(v), for allu,v ∈Rⁿ;

• ifγ_i^ab ≤1for alli∈[n]anda∈ A, b∈ B, thenF is additively sub-homogeneous:F(λ+u)≤λ+F(u), for allλ≥0andu ∈Rⁿ

• thusF is sup-norm nonexpansive.

• Ifγ_i^ab=1for alli ∈[n]anda∈ A, b∈ B, thenF is additively homogeneous:F(λ+u) =λ+F(u), for allλ∈Randu ∈Rⁿ.

(7)

Let the value function of the gamewith infinite horizonbe given by:

vx = inf

(α_k)_k_≥0 sup

(βk)k≥0

E

"_∞ X

k=0

(

k−1

Y

ℓ=0

γ_X^α^ℓ^,β^ℓ

ℓ )r_x^α_k^k^,β^k |X0=x

# ,

whereαk andβk are possible strategies of both players of the game (at timek), andX_k ∈[n]is the state process of the game satisfying

P(X_k+1=j|Xk =i, αk =a, βk =b) =P_ij^ab.

Ifγx^a,b≤¯γ <1, thenF is a sup-norm contraction:

kF(v)−F(w)k_∞≤¯γkv −wk_∞ , andv is the unique solution of

v =F(v).

Moreover the optimal actions inF(v)give the optimal stationary strategies of the game.

(8)

Problem: computev ∈Rⁿsuch thatF(v) =v, when such a solution is unique, and bound the complexity of this computation.

Whenγ^ab_i ≤¯γ <1for alli∈[n]anda∈ A, b∈ B, then

• Then, thevalue iterationscoincide with fixed point iterations:

v^k⁺¹=F(v^k), and with the finite horizon approximations with T =k andϕ=v⁰. They converge geometrically towardsv with factorγ¯:

klim→∞kv^k−vk^1/k ≤γ .¯

• However, the value iteration algorithm is only pseudopolynomial.

• Also the existence of a polynomial algorithm is an open problem.

• What about the policy iteration?

(9)

Policy iterations for discounted games

Assume:A_i andB_i are finite sets, and

Denote byΣ :={σ:i ∈[n]7→σ_i ∈ A_i}and∆ :={δ:i ∈[n]7→δ_i ∈ B_i} the sets of policies,

and forσ∈Σandδ∈∆, define the matrices and vectors:

M^(σδ)= (M_ij^σⁱ^δⁱ)_ij=1,...,n, andr^(σδ) = (r_i^σⁱ^δⁱ)_i=1,...,n , and the affine maps

F^(σδ)(v) =M^(σδ)v+r^(σδ), v ∈Rⁿ .

Then,F can be written as:

F(v) =min

σ∈ΣF^(σ)(v) , withF^(σ)(v) :=max

δ∈∆F^(σδ)(v) , v ∈Rⁿ , where minima and maxima are for the partial order ofRⁿ.

The mapsF^(σδ),F^(σ)andF are all order preserving and contracting for the sup-norm with contraction factorγ¯.

Important: the infimum and supremum are attained because the sets {F^(σ)(v)| σ ∈Σ}and{F^(σδ)(v)| δ∈∆}are rectangular.

(10)

(Howard, 1960)for 1-player games,(Denardo, 1967)for 2-player games.

Using operators:

Given an initial policyσ⁰∈Σ, apply successively the two following steps fors≥0untilσ^s+1=σ^s:

1 Compute the fixed pointv^s ofF^(σ^s⁾;

2 Improve the policy: choose an optimal policy forv^s, that is σ^s+1∈Σsuch thatF(v^s) =F^(σ^s+1⁾(v^s)

withσ^s+1=σ^s as soon as this is possible.

Step 1 is solved by using Policy iteration for the (one-player) game with fixed policyσ^s, which constructsv^s,l andδ^s,l fromδ^s,0.

(11)

Policy iterations for discounted games

With control terminology:

1 Compute the valuev^sof the game with fixed policyσ^s, that is the solution ofv =F^(σ^s⁾(v);

2 Improve the policy: choose an optimal policy forv^s, that is σ^s+1∈Σsuch thatF(v^s) =F^(σ^s+1⁾(v^s)or equivalently:

σ_i^s+1 ∈ argmin

a∈A_i





 maxb∈B_i



 X

j∈[n]

M_ij^abv_j^s + r_i^ab











, i ∈[n],

(12)

Simplex algorithm for 1-player games with Dantzig pivoting:

1 Compute the valuev^sof the game with fixed policyσ^s, that is the solution ofv =F^(σ^s⁾(v);

2 Improve the policy: choose a policyσ^s+1∈Σsuch that

σ_i^s+1 ∈ argmin

a∈A_i





 maxb∈Bi



 X

j∈[n]











, i ∈[n],

for onei such that(F^(σ^s⁾(v^s)−F(v^s))i is maximal.

(13)

Policy iterations for discounted games

❄

✛

✻

❆❆

❆❆❯

.. .

.. . PI external

.. . PI internal

.. . σ⁰

δ^s,0 σ^s+1

σ^s σ^s−1

δ^s,l−1 δ^s,l

(14)

❄

✛

✻

❆❆

❆❆❯

.. .

.. . PI external

.. . PI internal

.. .

(Linear Equations) (Nonlinear Equations)

v^s v^s−1 v^s−2

v

v^s,l−1 v^s,l v^s

(15)

Policy iterations for discounted games: monotone convergence

• The sequence(v^s)_s≥0is nonincreasing;

• Hence, the sequence(σ^s)_s≥0does not visit the same policy two times, until it becomes stationary;

• So the sequence(v^s)sis stationary after a finite time (at most♯Σ), and converges towards the solutionv ofv =F(v).

(16)

convergence

• The sequence(v^s)_s≥0is nonincreasing;

• So the sequence(v^s)sis stationary after a finite time (at most♯Σ), and converges towards the solutionv ofv =F(v).

• Whensis fixed, the sequence(v^s,l)l is nondecreasing;

• Hence, the sequence(δ^s,l)l does not visit the same policy two times, until it becomes stationary;

• So the sequence(v^s,l)l is stationary after a finite time (at most♯∆), and converges towards the solutionv^s ofv =F^(σ^s⁾(v).

(17)

Policy iterations for discounted games: well known properties

• The Policy iterations converge faster than the value iterations: for alls ≥0,v ≤v^s+1≤F(v^s)≤v^s, sov ≤v^s ≤F^s(v⁰)≤v⁰.

• If the discount factor is uniformely bounded by some constant

¯

γ <1, then for the sup-norm, we have:

kv^s+1−vk ≤ kF(v^s)−vk ≤γ¯kv^s−vk .

• For 1-player games with an infinite number of actions and under regularity conditions, Policy iterations coincide with the Newton algorithm, and have asuper-linear convergence.

• However, in general, the number of (external) iterations is bounded by♯Σ≥2ⁿif♯Ai ≥2for alli ∈[n].

(18)

• (Friedmann, 2009)showed a 2-player deterministic game problem withγ ≃1and an exponential number of iterations.

• (Fearnley, 2010)and(Andersson, 2009)showed the same for a 1-player stochastic game.

(19)

Policy iterations for discounted games: recent results

(Ye, 2011)showed that Policy iteration algorithm and Simplex algorithm solve 1-player discounted games with fixed discount factorγ <1in strongly polynomialtime.

(Hansen, Miltersen and Zwick, 2011)extended and improved this result to Policy iteration algorithm for 2-player games. They show that the number of iterationssmax(to obtain stationarity) satisfies:

smax≤(m+1)(1+log(n²/(1−γ))

−log(γ) ) =O( m

1−γ log n 1−γ), withm=thetotal number of actions: the number of(i,a,b)withi∈[n], a∈ A_i andb∈ B_i.

(Feinberg, Huang, 2013): Same for a one-player game with mean-payoff, and a statei0such thatP_i,i^a

0 ≥1−γ, for alli ∈[n], a∈ Ai. Question: What remains true when the discount factorsγ_i^ab are not uniformely bounded by a constant<1?

or for games with mean-payoff?

(20)

2-player games satisfying

r(M^(σδ))≤λ ∀σ∈Σ, δ ∈∆

is strongly polynomial. More precisely, the number of external iterations smaxsatisfies:

smax≤(m₁−n)(1+⌊log(1−λ)

log(λ) ⌋) =O(m1−n

1−λ log 1 1−λ), withm1=thetotal number of actions of the first player: the number of (i,a)withi ∈[n]anda∈ Ai.

(21)

Proof. • Adapt the proof of(Hansen, Miltersen and Zwick, 2011)by using sup-norms instead ofℓ1norms and the nonlinear mapsF^(δ) to obtain the above bound when the discount factors are≤λ. A similar bound is obtained by (Scherrer, 2013)in the one-player case with fixed discount factor.

• Using nonlinear spectral theory, show that for allλ < µ <1, there existsϕ∈Rⁿsuch thatϕi >0,i ∈[n], andM^(σδ)ϕ≤µϕ.

• LetG(v) =ϕ⁻¹F(ϕv)withϕv = (ϕivi)_i∈[n]. ThenGis the dynamic programming operator of a game with discount factors≤µ, and the sequence of policiesσ^s forF andGare the same, so issmax.

• Equivalently,F is contracting onRⁿwith contraction factorµ, for the weighted sup-normk · k_ϕdefined by:

kvk_ϕ :=max

i∈[n]|vi

ϕ_i| ∀v ∈Rⁿ .

• Take the infimum of the bound over allµ.

(22)

to obtain the above bound when the discount factors are≤λ. A similar bound is obtained by (Scherrer, 2013)in the one-player case with fixed discount factor.

• Using nonlinear spectral theory, show that for allλ < µ <1, there existsϕ∈Rⁿsuch thatϕi >0,i ∈[n], andM^(σδ)ϕ≤µϕ. ←details

• LetG(v) =ϕ⁻¹F(ϕv)withϕv = (ϕivi)_i∈[n]. ThenGis the dynamic programming operator of a game with discount factors≤µ, and the sequence of policiesσ^s forF andGare the same, so issmax.

• Equivalently,F is contracting onRⁿwith contraction factorµ, for the weighted sup-normk · k_ϕdefined by:

kvk_ϕ :=max

i∈[n]|vi

ϕ_i| ∀v ∈Rⁿ .

• Take the infimum of the bound over allµ.

(23)

Definition (Nonlinear spectral radii(Nussbaum,Mallet-Paret, 1998)) Lethbe a nonlinear continuous positively homogenous map on a closed convex coneCofRⁿ(h(λv) =λh(v)for allλ >0andv ∈C):

• Thecone eigenvalue spectral radiusofh,ˆr_C(h), is the maximal modulus of an eigenvalue ofhinC, whereλis an eigenvalue associated tov ∈C\{0}ifh(v) =λv.

• TheCollatz-Wielandt numbercwC(h)is the infimum of the

super-eigenvalues ofh, whereλ >0is a super-eigenvalue if there existsv in the interior ofC such thath(v)≤λv.

• TheBonsall’s spectral radiusofhis defined as:

r_C(h) := inf

k≥1kh^kk^1/k_C , with khk_C := sup

x∈C,kxk=1

kh(x)k , for any given normk · konRⁿ.

(24)

C=Rⁿ+, all the above spectral radius notions ofhcoincide:

r(h) =inf_k≥1kh^kk^1/k_Rn +

=max{λ∈R| ∃v ∈Rⁿ+\{0},h(v) =λv}

=inf{λ >0| ∃v ∈(R^∗₊)ⁿ, h(v)≤λv}

Proposition (A. Gaubert, Nussbaum, arXiv 2011)

Assume thathandhπ are continuous, positively homogenous, order preserving selfmaps ofRⁿ+, for allπ∈Π, and thath(v) =max_π∈Πhπ(v) for allv ∈Rⁿ+, then

r(h) =max

π∈Πr(h_π) .

Applying the proposition toh(v) :=max_σ∈Σmax_δ∈∆(M^(σδ)v), we get thatr(h)≤λ < µand so by the theorem, there existsϕ∈(R^∗+)ⁿsuch thatM^(σδ)ϕ≤h(ϕ)≤µϕ, for allσ∈Σ, δ∈∆.

(25)

Consider the value function of the gamewith mean-payoff:

ηx = inf

(αk)k≥0

sup

(β_k)_k≥0

lim sup

T→∞

1 TE

"_T₋₁ X

k=0

rx^α_k^k^,β^k |X0=x

# .

LetF be the dynamic programming operator such thatγ^ab_i ≡1.

F is additively homogeneous. We say thatv ∈Rⁿis an(nonlinear additive) eigenvectororbiaisofF witheigenvalueρ∈RifF(v) =ρ+v.

• Ifρexists, thenηx =ρfor allx ∈[n].

• If all the matricesM^(σδ)are irreducible, thenρexists and the eigenvectorv is unique up to an additive constant.

• Other existence results ofρ:Bather, 1973,Gaubert, Gunawardena, 2001.

(26)

(Hoffman and Karp, 1966)We have to solveρ+v =F(v).

Using operators:

1 Compute the additive eigenvalue and eigenvectorρ^s andv^s of F^(σ^s⁾, that is the solution ofρ+v =F^(σ^s⁾(v);

2 Improve the policy: choose an optimal policy forv^s, that is σ^s+1∈Σsuch thatF(v^s) =F^(σ^s+1⁾(v^s)

Step 1 is solved by using Policy iteration for the (one-player) game with fixed policyσ^s, which constructsρ^s,l,v^s,l andδ^s,l fromδ^s,0.

(27)

Policy iterations for “irreducible” mean-payoff games

(Hoffman and Karp, 1966)We have to solveρ+v =F(v).

With control terminology:

1 Compute the valueρ^s and the biaisv^s of the game with fixed policy σ^s, that is the solution ofρ+v =F^(σ^s⁾(v);

2 Improve the policy: choose an optimal policy forv^s, that is σ^s+1∈Σsuch thatF(v^s) =F^(σ^s+1⁾(v^s)or equivalently:

σ_i^s+1 ∈ argmin

a∈A





 maxb∈B



 X

j∈[n]











, i ∈[n],

(28)

monotone convergence

• The sequence(ρ^s)_s≥0is nonincreasing;

• Ifρ^s=ρ^s+1, thenv^s−v^s+1is constant andv^s =v.

• So the sequence(ρ^s,v^s)s is stationary after a finite time (at most

♯Σ), up to an additive constant, and converges towards the solution (ρ,v)ofρ+v =F(v).

(29)

Policy iterations for “irreducible” mean-payoff games:

monotone convergence

• The sequence(ρ^s)_s≥0is nonincreasing;

• Ifρ^s=ρ^s+1, thenv^s−v^s+1is constant andv^s =v.

• So the sequence(ρ^s,v^s)s is stationary after a finite time (at most

♯Σ), up to an additive constant, and converges towards the solution (ρ,v)ofρ+v =F(v).

• Whensis fixed, the sequence(ρ^s,l)_l is nondecreasing;

• Ifρ^s,l =ρ^s,l+1, thenv^s,l−v^s,l+1is constant andv^s,l =v^s.

• Hence, the sequence(δ^s,l)l does not visit the same policy two times, until it becomes stationary;

• So the sequence(ρ^s,l,v^s,l)l is stationary after a finite time (at most

♯∆), and converges towards the solution(ρ^s,v^s)of ρ+v =F^(σ^s⁾(v).

(30)

T_ij(M) =E[inf{k ≥1| X_k =j} |X₀=i] ,

the expected first return (or hitting) time in statej, starting fromi.

Note thatT_ii₀(M)<+∞for alli∈[n]if and only ifMhas a unique recurrent (final) class andi0belongs to it.

Theorem (A., Gaubert, arXiv:1310.4953)

Let us fixK >0and a statei0. The policy iteration algorithm for the class of 2-player mean-payoff games such that

Tii₀(M^(σδ))≤K ∀σ ∈Σ, δ∈∆, i ∈[n]

smax≤(m1−n)(1+⌊ log(K)

log(K/(K −1)⌋) =O((m1−n)KlogK), withm1=thetotal number of actions of the first player.

(31)

Sketch of the proof. • Letϕ∈(R^∗₊)ⁿ be defined by:

ϕi =max_σ∈Σmax_δ∈∆Tii₀(M^(σδ)).

• LetQ^(σδ)be obtained fromM^(σδ)by putting itsi0th column to zero.

Thenϕ=1+maxσ∈Σmaxδ∈∆(Q^(σδ)ϕ).

• LetN^(σδ)be obtained fromM^(σδ)by replacing itsi0th column by the nonnegative vector(ϕ−1−Q^(σδ)ϕ)/ϕ_i₀.

• N^(σδ)has nonnegative entries and satisfies:

N^(σδ)ϕ=ϕ−1≤λϕ withλ=1−1/K ⇒ r(N^(σδ))≤λ .

• Then the map

G(v) =min

σ∈Σmax

δ∈∆(N^(σδ)v+r^(σδ)) , v ∈Rⁿ

satisfies the assumptions of the theorem for discounted games.

• Ifvi₀ =0, thenρ+v =F(v)⇔ρϕ+v =G(ρϕ+v).

• Hence, the sequences of policiesσ^sandδ^s,l forF andGare the same.

(32)

11

10 13 20 12

21 17

16

19 18

1 15 3

2

5

4

7 6

9

8 14

Nodes = web pages Arcs = hyperlinks

21 : spammer page

1 : non controlled page.

Associated Markov matrix S:Sij =1/Ni if(i,j)is an hyperlink,Sij =0

otherwise;Ni =number of hyperlinks fromi.

The PageRank is the invariant measureπofS.

(33)

• Letv be the preference probability vector of the Web search engine

• Letαbe a damping factor: the probability for a Web surfer to use the Web seach engine.

• Usually, one replacesSbyαS+ (1−α)✶v,✶= (1· · ·1)^T.

• Similar to consider the Markov matrix of the Web with the Web search engine: M =h

0 v

α✶(1−α)S

i .

• Ifr is an instantaneous reward such thatri =1fori =sand0 otherwise, then the mean-payoff is the PageRank (frequency of visit)πs of the spammer sites.

• Optimizing the spammer site is a 1-player game with mean-payoff (see for instance (Fercoq, A., Bouhtou, Gaubert, IEEE TAC 2013).

A zero-sum game problem:

• σ ∈Σis the policy of the Web search engine, it controlsv and wants to minimize the PageRank of the spammer site;

• δ ∈∆is the policy of the spammer, it controls the rows ofS with index in his site, and wants to maximize its PageRank.

• All final classes ofM^(σδ)contain state1(the Web search engine).

(34)

and to find a complexity result.

(35)

Related recent results for 1-player discounted games

• (Post, Ye, 2012)show that the simplex algorithm for deterministic MDP (1-player games) is strongly polynomial independently of the discount factor: it stops afterO(n⁵m²log²n)iterations, wheremis the number of possible actions by state (thusm1=nm).

• (Scherrer, 2013)generalizes this result to stochastic MDP which satisfy a bound which may be seen (and is equal when the discount factorγ tends to1) as a boundτr on the expected first return time to recurrent states and a boundτt on the expected exit time from transient states. Under these conditions the simplex algorithm stops after O(n³m²τrτtlog²(nτrτt)).

• (Scherrer, 2013)shows a similar result for Policy Iteration algorithm for stochastic MDP (1-player games), when the set of transient states is independent of the strategy. Under these conditions the Policy iteration algorithm stops after

n(m−1)(⌈τrlog(nτr)⌉+⌈τtlog(nτt)⌉)iterations.

• However, this assumption implies that the recurrent classes are independent of the strategy.

(36)

class of 2-player discounted games with fixed discount factor, M^(σδ) =γP^(σδ)withγ <1, such that

Tii₀(P^(σδ))≤K ∀σ∈Σ, δ∈∆, i∈[n]

smax≤(m1−n)(1+⌊ log(K)

log(K/(K −1)⌋) =O((m1−n)KlogK), withm1=thetotal number of actions of the first player.

Hence the bound does not depend onγ.

(37)

For a Markov matrixM, a statei and setCof states, denote:

T_iC(M) =E[inf{k ≥1| X_k ∈C} |X₀=i] , the expected first return (or hitting) time in setC, starting fromi. Theorem (A., Gaubert, 2014)

Let us fixK >0and a subsetCof states with cardinalitys. The policy iteration algorithm for the class of 2-player multichain mean-payoff games such that for allσ ∈Σ, δ∈∆, each final class ofM^(σδ)contains exactly one element ofCand

TiC(M^(σδ))≤K ∀i ∈[n]

smax≤(m1−n)(1+⌊ log(sK)

log(sK/(sK −1)⌋) =O((m1−n)sKlog(sK)), withm1=thetotal number of actions of the first player.

(38)

• In general,F may not have additive eigenvalue and eigenvector, that isρandv such thatρ+v =F(v).

• If the action spacesA_i andB_i are finite for alli ∈[n], thenF is polyhedral, and since it is also nonexpansive, by theKohlberg (1980)theorem, there existηandv inRⁿsuch that

F(tη+v) = (t+1)η+v, fort large enough.

• (η,v)is called aninvariant half-line.

• Thenη is the value of the game with mean-payoff.

• Moreover, there existFˆ andF´·such that(η,v)is an invariant half-line if and only if it satisfies the system:

( η = ˆF(η) , η+v = ´Fη(v) .

• Howeverv is not unique.

(39)

Policy iterations for multichain mean-payoff games

Construct a sequence of policiesσ^s, valuesη^s and biaisv^s. They were introduced and proved to converge by

• (Howard, 1960)and(Denardo and Fox, 1968)for 1-player multichain mean-payoff games,

• (V ¨oge and Jurdzi ´nski, 2000)for parity games,

• (Cochet-Terrasson, Gaubert, Gunawardena, 1998 and 1999), (Bjorklund, Sandberg, Vorobyov, 2004), (Jurdzi ´nski,Paterson, Zwick, 2006)for 2-player deterministic games,

• (Cochet-Terrasson and Gaubert, 2006), (A., Cochet-Terrasson, Detournay, and Gaubert, arXiv:1208.0446, and CDC 2013),

(Detournay,PIGAMES library, 2012), (Bourque, Raghavan, preprint, 2012)for general multichain 2-player stochastic games.

(Detournay, 2012).

To avoid cycling, one need to add some constraints onv^s, for instance:

• fix the valuev_i^s=0at one pointi of each final class ofM^(σδ) (Howard, and Denardo and Fox, for one-player games);

• by a nonlinear projection (Cochet-Terrasson and Gaubert);

and to choose optimal policies in a conservative way.

(40)

polynomial when restricted to the class of games such thatthe spectral radii of allM^(σδ)are bounded byλ <1. This result is invariant by diagonal scaling.

• The policy iteration algorithm for ergodic mean-payoff games is strongly polynomial when restricted to the class of ergodic games such thatthe expected first return (or hitting) time in some fixed statei0of the Markov chain associated to anyM^(σδ) and initial state is bounded byK <∞.

• Same result fordiscounted games.

• Same result for multichain mean-payoff games, wheni0is replaced by a set of statesC, and each recurrence class contains exactly one element ofC.

Open:

• Is the policy iteration algorithm for multichain stochastic games strongly polynomial, under some more general constraints on the M^(σδ)(only)?