• Aucun résultat trouvé

Policy iteration for stochastic zero-sum games

N/A
N/A
Protected

Academic year: 2021

Partager "Policy iteration for stochastic zero-sum games"

Copied!
40
0
0

Texte intégral

(1)

HAL Id: hal-01024097

https://hal.inria.fr/hal-01024097

Submitted on 15 Jul 2014

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Policy iteration for stochastic zero-sum games

Marianne Akian

To cite this version:

Marianne Akian. Policy iteration for stochastic zero-sum games. NETCO 2014, 2014, Tours, France.

�hal-01024097�

(2)

games

Marianne Akian

INRIA Saclay - ˆIle-de-France and CMAP, ´Ecole Polytechnique

NETCO 2014 23-27 June 2014, Tours

Joint work with St ´ephane Gaubert, see arXiv:1310.4953

(3)

Hamilton-Jacobi-Bellman-Isaacs equations

The stationary equation:

−H(x,Dv(x)) =0,xX ⊂Rd +a boundary condition,

H(x,p) = min

a∈A(x) max

b∈B(x)[f(x,a,b)·p+g(x,a,b)], xX, andp ∈Rd , is the dynamic programming equation satisfied by the (upper) value function of the zero-sum game problem:

v(x) = inf

t)t≥0

sup

t)t≥0

Z 0

g(xt, αt, βt)dt ,

wherex˙t =f(xt, αt, βt), for allt≥0, andinfandsupare taken over nonanticipating strategies of the first and second player (where the second player knows the current action of the first player).

Example: pursuit evasion games.

(4)

v =F(v), the fixed point equation of the dynamic programming or Shapley operatorF of a discrete time zero-sum two player stochastic game problem with finite state space.

Same for:Discounted problems, Optimal stopping time problems, Stochastic games.

(5)

Discrete time and state zero-sum stochastic games

LetF :Rn→Rn be defined by:

[F(v)]i := min

a∈Ai max

b∈Bi

 X

j∈[n]

Mijabvj + riab

, i ∈[n],

withMijab ≥0for alli,j∈[n],a∈ Ai, b∈ Bi.

The map F is thedynamic programming or Shapley operator of a dis- crete time zero-sum two player game problem with perfect information on the finite state spaceX := [n] :={1, . . . ,n}, with:

Ai,Bi sets of actions of the 1st, 2nd player MIN, MAX, when in statei riab reward paid by MIN to MAX, at each time

Mijab :=γiabPijab ≥0 γiab :=P

j∈[n]Mijab ≥0discount factor (<1or≤1or=1) Pijab transition probability fromitoj(P

j∈[n]Pijab =1).

(6)

LetF :R →R be defined by:

[F(v)]i := min

a∈Ai max

b∈Bi

 X

j∈[n]

Mijabvj + riab

, i ∈[n],

withMijab ≥0for alli,j∈[n],a∈ Ai, b∈ Bi. Denoteγabi :=P

j∈[n]Mijab. Then

F is order preserving:uvF(u)≤F(v), for allu,v ∈Rn;

ifγiab ≤1for alli∈[n]anda∈ A, b∈ B, thenF is additively sub-homogeneous:F(λ+u)≤λ+F(u), for allλ≥0andu ∈Rn

thusF is sup-norm nonexpansive.

Ifγiab=1for alli ∈[n]anda∈ A, b∈ B, thenF is additively homogeneous:F(λ+u) =λ+F(u), for allλ∈Randu ∈Rn.

(7)

Let the value function of the gamewith infinite horizonbe given by:

vx = inf

k)k≥0 sup

k)k≥0

E

" X

k=0

(

k−1

Y

ℓ=0

γXα

)rxαkkk |X0=x

# ,

whereαk andβk are possible strategies of both players of the game (at timek), andXk ∈[n]is the state process of the game satisfying

P(Xk+1=j|Xk =i, αk =a, βk =b) =Pijab.

Ifγxa,b≤¯γ <1, thenF is a sup-norm contraction:

kF(v)−F(w)k≤¯γkv −wk , andv is the unique solution of

v =F(v).

Moreover the optimal actions inF(v)give the optimal stationary strategies of the game.

(8)

Problem: computev ∈Rnsuch thatF(v) =v, when such a solution is unique, and bound the complexity of this computation.

Whenγabi ≤¯γ <1for alli∈[n]anda∈ A, b∈ B, then

Then, thevalue iterationscoincide with fixed point iterations:

vk+1=F(vk), and with the finite horizon approximations with T =k andϕ=v0. They converge geometrically towardsv with factorγ¯:

klim→∞kvkvk1/k ≤γ .¯

However, the value iteration algorithm is only pseudopolynomial.

Also the existence of a polynomial algorithm is an open problem.

What about the policy iteration?

(9)

Policy iterations for discounted games

Assume:Ai andBi are finite sets, and

Denote byΣ :={σ:i ∈[n]7→σi ∈ Ai}and∆ :={δ:i ∈[n]7→δi ∈ Bi} the sets of policies,

and forσ∈Σandδ∈∆, define the matrices and vectors:

M(σδ)= (Mijσiδi)ij=1,...,n, andr(σδ) = (riσiδi)i=1,...,n , and the affine maps

F(σδ)(v) =M(σδ)v+r(σδ), v ∈Rn .

Then,F can be written as:

F(v) =min

σ∈ΣF(σ)(v) , withF(σ)(v) :=max

δ∈∆F(σδ)(v) , v ∈Rn , where minima and maxima are for the partial order ofRn.

The mapsF(σδ),F(σ)andF are all order preserving and contracting for the sup-norm with contraction factorγ¯.

Important: the infimum and supremum are attained because the sets {F(σ)(v)| σ ∈Σ}and{F(σδ)(v)| δ∈∆}are rectangular.

(10)

(Howard, 1960)for 1-player games,(Denardo, 1967)for 2-player games.

Using operators:

Given an initial policyσ0∈Σ, apply successively the two following steps fors≥0untilσs+1s:

1 Compute the fixed pointvs ofFs);

2 Improve the policy: choose an optimal policy forvs, that is σs+1∈Σsuch thatF(vs) =Fs+1)(vs)

withσs+1s as soon as this is possible.

Step 1 is solved by using Policy iteration for the (one-player) game with fixed policyσs, which constructsvs,l andδs,l fromδs,0.

(11)

Policy iterations for discounted games

(Howard, 1960)for 1-player games,(Denardo, 1967)for 2-player games.

With control terminology:

Given an initial policyσ0∈Σ, apply successively the two following steps fors≥0untilσs+1s:

1 Compute the valuevsof the game with fixed policyσs, that is the solution ofv =Fs)(v);

2 Improve the policy: choose an optimal policy forvs, that is σs+1∈Σsuch thatF(vs) =Fs+1)(vs)or equivalently:

σis+1 ∈ argmin

a∈Ai

 maxb∈Bi

 X

j∈[n]

Mijabvjs + riab

, i ∈[n],

withσs+1s as soon as this is possible.

(12)

(Howard, 1960)for 1-player games,(Denardo, 1967)for 2-player games.

Simplex algorithm for 1-player games with Dantzig pivoting:

Given an initial policyσ0∈Σ, apply successively the two following steps fors≥0untilσs+1s:

1 Compute the valuevsof the game with fixed policyσs, that is the solution ofv =Fs)(v);

2 Improve the policy: choose a policyσs+1∈Σsuch that

σis+1 ∈ argmin

a∈Ai

 maxb∈Bi

 X

j∈[n]

Mijabvjs + riab

, i ∈[n],

for onei such that(Fs)(vs)−F(vs))i is maximal.

(13)

Policy iterations for discounted games

❆❆❯

.. .

.. . PI external

.. . PI internal

.. . σ0

δs,0 σs+1

σs σs−1

δs,l−1 δs,l

(14)

❆❆❯

.. .

.. . PI external

.. . PI internal

.. .

(Linear Equations) (Nonlinear Equations)

vs vs−1 vs−2

v

vs,l−1 vs,l vs

(15)

Policy iterations for discounted games: monotone convergence

The sequence(vs)s≥0is nonincreasing;

Hence, the sequence(σs)s≥0does not visit the same policy two times, until it becomes stationary;

So the sequence(vs)sis stationary after a finite time (at most♯Σ), and converges towards the solutionv ofv =F(v).

(16)

convergence

The sequence(vs)s≥0is nonincreasing;

Hence, the sequence(σs)s≥0does not visit the same policy two times, until it becomes stationary;

So the sequence(vs)sis stationary after a finite time (at most♯Σ), and converges towards the solutionv ofv =F(v).

Whensis fixed, the sequence(vs,l)l is nondecreasing;

Hence, the sequence(δs,l)l does not visit the same policy two times, until it becomes stationary;

So the sequence(vs,l)l is stationary after a finite time (at most♯∆), and converges towards the solutionvs ofv =Fs)(v).

(17)

Policy iterations for discounted games: well known properties

The Policy iterations converge faster than the value iterations: for alls ≥0,vvs+1F(vs)≤vs, sovvsFs(v0)≤v0.

If the discount factor is uniformely bounded by some constant

¯

γ <1, then for the sup-norm, we have:

kvs+1vk ≤ kF(vs)−vk ≤γ¯kvsvk .

For 1-player games with an infinite number of actions and under regularity conditions, Policy iterations coincide with the Newton algorithm, and have asuper-linear convergence.

However, in general, the number of (external) iterations is bounded by♯Σ≥2nif♯Ai ≥2for alli ∈[n].

(18)

(Friedmann, 2009)showed a 2-player deterministic game problem withγ ≃1and an exponential number of iterations.

(Fearnley, 2010)and(Andersson, 2009)showed the same for a 1-player stochastic game.

(19)

Policy iterations for discounted games: recent results

(Ye, 2011)showed that Policy iteration algorithm and Simplex algorithm solve 1-player discounted games with fixed discount factorγ <1in strongly polynomialtime.

(Hansen, Miltersen and Zwick, 2011)extended and improved this result to Policy iteration algorithm for 2-player games. They show that the number of iterationssmax(to obtain stationarity) satisfies:

smax≤(m+1)(1+log(n2/(1−γ))

−log(γ) ) =O( m

1−γ log n 1−γ), withm=thetotal number of actions: the number of(i,a,b)withi∈[n], a∈ Ai andb∈ Bi.

(Feinberg, Huang, 2013): Same for a one-player game with mean-payoff, and a statei0such thatPi,ia

0 ≥1−γ, for alli ∈[n], a∈ Ai. Question: What remains true when the discount factorsγiab are not uniformely bounded by a constant<1?

or for games with mean-payoff?

(20)

2-player games satisfying

r(M(σδ))≤λ ∀σ∈Σ, δ ∈∆

is strongly polynomial. More precisely, the number of external iterations smaxsatisfies:

smax≤(m1n)(1+⌊log(1−λ)

log(λ) ⌋) =O(m1n

1−λ log 1 1−λ), withm1=thetotal number of actions of the first player: the number of (i,a)withi ∈[n]anda∈ Ai.

(21)

Proof. Adapt the proof of(Hansen, Miltersen and Zwick, 2011)by using sup-norms instead ofℓ1norms and the nonlinear mapsF(δ) to obtain the above bound when the discount factors are≤λ. A similar bound is obtained by (Scherrer, 2013)in the one-player case with fixed discount factor.

Using nonlinear spectral theory, show that for allλ < µ <1, there existsϕ∈Rnsuch thatϕi >0,i ∈[n], andM(σδ)ϕ≤µϕ.

LetG(v) =ϕ−1F(ϕv)withϕv = (ϕivi)i∈[n]. ThenGis the dynamic programming operator of a game with discount factors≤µ, and the sequence of policiesσs forF andGare the same, so issmax.

Equivalently,F is contracting onRnwith contraction factorµ, for the weighted sup-normk · kϕdefined by:

kvkϕ :=max

i∈[n]|vi

ϕi| ∀v ∈Rn .

Take the infimum of the bound over allµ.

(22)

to obtain the above bound when the discount factors are≤λ. A similar bound is obtained by (Scherrer, 2013)in the one-player case with fixed discount factor.

Using nonlinear spectral theory, show that for allλ < µ <1, there existsϕ∈Rnsuch thatϕi >0,i ∈[n], andM(σδ)ϕ≤µϕ. ←details

LetG(v) =ϕ−1F(ϕv)withϕv = (ϕivi)i∈[n]. ThenGis the dynamic programming operator of a game with discount factors≤µ, and the sequence of policiesσs forF andGare the same, so issmax.

Equivalently,F is contracting onRnwith contraction factorµ, for the weighted sup-normk · kϕdefined by:

kvkϕ :=max

i∈[n]|vi

ϕi| ∀v ∈Rn .

Take the infimum of the bound over allµ.

(23)

Definition (Nonlinear spectral radii(Nussbaum,Mallet-Paret, 1998)) Lethbe a nonlinear continuous positively homogenous map on a closed convex coneCofRn(h(λv) =λh(v)for allλ >0andvC):

Thecone eigenvalue spectral radiusofh,ˆrC(h), is the maximal modulus of an eigenvalue ofhinC, whereλis an eigenvalue associated tovC\{0}ifh(v) =λv.

TheCollatz-Wielandt numbercwC(h)is the infimum of the

super-eigenvalues ofh, whereλ >0is a super-eigenvalue if there existsv in the interior ofC such thath(v)≤λv.

TheBonsall’s spectral radiusofhis defined as:

rC(h) := inf

k≥1khkk1/kC , with khkC := sup

x∈C,kxk=1

kh(x)k , for any given normk · konRn.

(24)

C=Rn+, all the above spectral radius notions ofhcoincide:

r(h) =infk≥1khkk1/kRn +

=max{λ∈R| ∃v ∈Rn+\{0},h(v) =λv}

=inf{λ >0| ∃v ∈(R+)n, h(v)≤λv}

Proposition (A. Gaubert, Nussbaum, arXiv 2011)

Assume thathandhπ are continuous, positively homogenous, order preserving selfmaps ofRn+, for allπ∈Π, and thath(v) =maxπ∈Πhπ(v) for allv ∈Rn+, then

r(h) =max

π∈Πr(hπ) .

Applying the proposition toh(v) :=maxσ∈Σmaxδ∈∆(M(σδ)v), we get thatr(h)≤λ < µand so by the theorem, there existsϕ∈(R+)nsuch thatM(σδ)ϕ≤h(ϕ)≤µϕ, for allσ∈Σ, δ∈∆.

(25)

Consider the value function of the gamewith mean-payoff:

ηx = inf

k)k≥0

sup

k)k≥0

lim sup

T→∞

1 TE

"T−1 X

k=0

rxαkkk |X0=x

# .

LetF be the dynamic programming operator such thatγabi ≡1.

F is additively homogeneous. We say thatv ∈Rnis an(nonlinear additive) eigenvectororbiaisofF witheigenvalueρ∈RifF(v) =ρ+v.

Ifρexists, thenηx =ρfor allx ∈[n].

If all the matricesM(σδ)are irreducible, thenρexists and the eigenvectorv is unique up to an additive constant.

Other existence results ofρ:Bather, 1973,Gaubert, Gunawardena, 2001.

(26)

(Hoffman and Karp, 1966)We have to solveρ+v =F(v).

Using operators:

Given an initial policyσ0∈Σ, apply successively the two following steps fors≥0untilσs+1s:

1 Compute the additive eigenvalue and eigenvectorρs andvs of Fs), that is the solution ofρ+v =Fs)(v);

2 Improve the policy: choose an optimal policy forvs, that is σs+1∈Σsuch thatF(vs) =Fs+1)(vs)

withσs+1s as soon as this is possible.

Step 1 is solved by using Policy iteration for the (one-player) game with fixed policyσs, which constructsρs,l,vs,l andδs,l fromδs,0.

(27)

Policy iterations for “irreducible” mean-payoff games

(Hoffman and Karp, 1966)We have to solveρ+v =F(v).

With control terminology:

Given an initial policyσ0∈Σ, apply successively the two following steps fors≥0untilσs+1s:

1 Compute the valueρs and the biaisvs of the game with fixed policy σs, that is the solution ofρ+v =Fs)(v);

2 Improve the policy: choose an optimal policy forvs, that is σs+1∈Σsuch thatF(vs) =Fs+1)(vs)or equivalently:

σis+1 ∈ argmin

a∈A

 maxb∈B

 X

j∈[n]

Mijabvjs + riab

, i ∈[n],

withσs+1s as soon as this is possible.

(28)

monotone convergence

The sequence(ρs)s≥0is nonincreasing;

Ifρss+1, thenvsvs+1is constant andvs =v.

Hence, the sequence(σs)s≥0does not visit the same policy two times, until it becomes stationary;

So the sequence(ρs,vs)s is stationary after a finite time (at most

♯Σ), up to an additive constant, and converges towards the solution (ρ,v)ofρ+v =F(v).

(29)

Policy iterations for “irreducible” mean-payoff games:

monotone convergence

The sequence(ρs)s≥0is nonincreasing;

Ifρss+1, thenvsvs+1is constant andvs =v.

Hence, the sequence(σs)s≥0does not visit the same policy two times, until it becomes stationary;

So the sequence(ρs,vs)s is stationary after a finite time (at most

♯Σ), up to an additive constant, and converges towards the solution (ρ,v)ofρ+v =F(v).

Whensis fixed, the sequence(ρs,l)l is nondecreasing;

Ifρs,ls,l+1, thenvs,lvs,l+1is constant andvs,l =vs.

Hence, the sequence(δs,l)l does not visit the same policy two times, until it becomes stationary;

So the sequence(ρs,l,vs,l)l is stationary after a finite time (at most

♯∆), and converges towards the solution(ρs,vs)of ρ+v =Fs)(v).

(30)

Tij(M) =E[inf{k ≥1| Xk =j} |X0=i] ,

the expected first return (or hitting) time in statej, starting fromi.

Note thatTii0(M)<+∞for alli∈[n]if and only ifMhas a unique recurrent (final) class andi0belongs to it.

Theorem (A., Gaubert, arXiv:1310.4953)

Let us fixK >0and a statei0. The policy iteration algorithm for the class of 2-player mean-payoff games such that

Tii0(M(σδ))≤K ∀σ ∈Σ, δ∈∆, i ∈[n]

is strongly polynomial. More precisely, the number of external iterations smaxsatisfies:

smax≤(m1n)(1+⌊ log(K)

log(K/(K −1)⌋) =O((m1n)KlogK), withm1=thetotal number of actions of the first player.

(31)

Sketch of the proof. Letϕ∈(R+)n be defined by:

ϕi =maxσ∈Σmaxδ∈∆Tii0(M(σδ)).

LetQ(σδ)be obtained fromM(σδ)by putting itsi0th column to zero.

Thenϕ=1+maxσ∈Σmaxδ∈∆(Q(σδ)ϕ).

LetN(σδ)be obtained fromM(σδ)by replacing itsi0th column by the nonnegative vector(ϕ−1−Q(σδ)ϕ)/ϕi0.

N(σδ)has nonnegative entries and satisfies:

N(σδ)ϕ=ϕ−1≤λϕ withλ=1−1/K ⇒ r(N(σδ))≤λ .

Then the map

G(v) =min

σ∈Σmax

δ∈∆(N(σδ)v+r(σδ)) , v ∈Rn

satisfies the assumptions of the theorem for discounted games.

Ifvi0 =0, thenρ+v =F(v)⇔ρϕ+v =G(ρϕ+v).

Hence, the sequences of policiesσsandδs,l forF andGare the same.

(32)

11

10 13 20 12

21 17

16

19 18

1 15 3

2

5

4

7 6

9

8 14

Nodes = web pages Arcs = hyperlinks

21 : spammer page

1 : non controlled page.

Associated Markov matrix S:Sij =1/Ni if(i,j)is an hyperlink,Sij =0

otherwise;Ni =number of hyperlinks fromi.

The PageRank is the invariant measureπofS.

(33)

Letv be the preference probability vector of the Web search engine

Letαbe a damping factor: the probability for a Web surfer to use the Web seach engine.

Usually, one replacesSbyαS+ (1−α)✶v,✶= (1· · ·1)T.

Similar to consider the Markov matrix of the Web with the Web search engine: M =h

0 v

α(1−α)S

i .

Ifr is an instantaneous reward such thatri =1fori =sand0 otherwise, then the mean-payoff is the PageRank (frequency of visit)πs of the spammer sites.

Optimizing the spammer site is a 1-player game with mean-payoff (see for instance (Fercoq, A., Bouhtou, Gaubert, IEEE TAC 2013).

A zero-sum game problem:

σ ∈Σis the policy of the Web search engine, it controlsv and wants to minimize the PageRank of the spammer site;

δ ∈∆is the policy of the spammer, it controls the rows ofS with index in his site, and wants to maximize its PageRank.

All final classes ofM(σδ)contain state1(the Web search engine).

(34)

and to find a complexity result.

(35)

Related recent results for 1-player discounted games

(Post, Ye, 2012)show that the simplex algorithm for deterministic MDP (1-player games) is strongly polynomial independently of the discount factor: it stops afterO(n5m2log2n)iterations, wheremis the number of possible actions by state (thusm1=nm).

(Scherrer, 2013)generalizes this result to stochastic MDP which satisfy a bound which may be seen (and is equal when the discount factorγ tends to1) as a boundτr on the expected first return time to recurrent states and a boundτt on the expected exit time from transient states. Under these conditions the simplex algorithm stops after O(n3m2τrτtlog2(nτrτt)).

(Scherrer, 2013)shows a similar result for Policy Iteration algorithm for stochastic MDP (1-player games), when the set of transient states is independent of the strategy. Under these conditions the Policy iteration algorithm stops after

n(m−1)(⌈τrlog(nτr)⌉+⌈τtlog(nτt)⌉)iterations.

However, this assumption implies that the recurrent classes are independent of the strategy.

(36)

class of 2-player discounted games with fixed discount factor, M(σδ) =γP(σδ)withγ <1, such that

Tii0(P(σδ))≤K ∀σ∈Σ, δ∈∆, i∈[n]

is strongly polynomial. More precisely, the number of external iterations smaxsatisfies:

smax≤(m1n)(1+⌊ log(K)

log(K/(K −1)⌋) =O((m1n)KlogK), withm1=thetotal number of actions of the first player.

Hence the bound does not depend onγ.

(37)

For a Markov matrixM, a statei and setCof states, denote:

TiC(M) =E[inf{k ≥1| XkC} |X0=i] , the expected first return (or hitting) time in setC, starting fromi. Theorem (A., Gaubert, 2014)

Let us fixK >0and a subsetCof states with cardinalitys. The policy iteration algorithm for the class of 2-player multichain mean-payoff games such that for allσ ∈Σ, δ∈∆, each final class ofM(σδ)contains exactly one element ofCand

TiC(M(σδ))≤K ∀i ∈[n]

is strongly polynomial. More precisely, the number of external iterations smaxsatisfies:

smax≤(m1n)(1+⌊ log(sK)

log(sK/(sK −1)⌋) =O((m1n)sKlog(sK)), withm1=thetotal number of actions of the first player.

(38)

In general,F may not have additive eigenvalue and eigenvector, that isρandv such thatρ+v =F(v).

If the action spacesAi andBi are finite for alli ∈[n], thenF is polyhedral, and since it is also nonexpansive, by theKohlberg (1980)theorem, there existηandv inRnsuch that

F(tη+v) = (t+1)η+v, fort large enough.

(η,v)is called aninvariant half-line.

Thenη is the value of the game with mean-payoff.

Moreover, there existFˆ andF´·such that(η,v)is an invariant half-line if and only if it satisfies the system:

( η = ˆF(η) , η+v = ´Fη(v) .

Howeverv is not unique.

(39)

Policy iterations for multichain mean-payoff games

Construct a sequence of policiesσs, valuesηs and biaisvs. They were introduced and proved to converge by

(Howard, 1960)and(Denardo and Fox, 1968)for 1-player multichain mean-payoff games,

(V ¨oge and Jurdzi ´nski, 2000)for parity games,

(Cochet-Terrasson, Gaubert, Gunawardena, 1998 and 1999), (Bjorklund, Sandberg, Vorobyov, 2004), (Jurdzi ´nski,Paterson, Zwick, 2006)for 2-player deterministic games,

(Cochet-Terrasson and Gaubert, 2006), (A., Cochet-Terrasson, Detournay, and Gaubert, arXiv:1208.0446, and CDC 2013),

(Detournay,PIGAMES library, 2012), (Bourque, Raghavan, preprint, 2012)for general multichain 2-player stochastic games.

(Detournay, 2012).

To avoid cycling, one need to add some constraints onvs, for instance:

fix the valuevis=0at one pointi of each final class ofM(σδ) (Howard, and Denardo and Fox, for one-player games);

by a nonlinear projection (Cochet-Terrasson and Gaubert);

and to choose optimal policies in a conservative way.

(40)

polynomial when restricted to the class of games such thatthe spectral radii of allM(σδ)are bounded byλ <1. This result is invariant by diagonal scaling.

The policy iteration algorithm for ergodic mean-payoff games is strongly polynomial when restricted to the class of ergodic games such thatthe expected first return (or hitting) time in some fixed statei0of the Markov chain associated to anyM(σδ) and initial state is bounded byK <∞.

Same result fordiscounted games.

Same result for multichain mean-payoff games, wheni0is replaced by a set of statesC, and each recurrence class contains exactly one element ofC.

Open:

Is the policy iteration algorithm for multichain stochastic games strongly polynomial, under some more general constraints on the M(σδ)(only)?

Références

Documents relatifs

The central solution concept for zero-sum games is its value. When it exists, the value is the maximal amount that each player can obtain in expectation regardless of her

Abstract. We consider zero sum stochastic games. For every discount factor λ, a time normalization allows to represent the game as being played on the interval [0, 1]. We introduce

A last one is that the asymptotic value exists in stochastic games with finite state space, compact action sets and continuous payoff and transition function.. 7.3

The theory of zero-sum repeated games with incomplete information provides precise predictions on players’ optimal strategies and the value (expected payoff) that each player

In fact, Sorin and Vigeral [11] prove the existence of the limit of the discounted values for stochastic games with finite state space, continuous action space and continuous

Recently, Raghavan and Syed (2002) provided an algorithm which finds the optimal strategies for two-player zero-sum perfect information games under the discounted criterion for a

Raghavan and Syed pro- vided in [3] a policy improvement algorithm to determine the optimal strategies for two-player zero-sum perfect information games under the discounted

The Conservative Policy Iteration (CPI) algorithm improves the concentrability coefficient [16], both Non-Stationary Value Iteration (NSVI) and Non- Stationary Policy Iteration