**8.4 Parametrized complexity bounds for stochastic mean payoﬀ games**

**8.4.2 Finding the states with maximal value**

In this section, we consider games that may not have a constant value. We show that a
modifica-tion of the algorithms presented before can decide if a game has constant value or not and, in the
latter case, find the states that have the maximal (or minimal) value. In order to do so, we need
to prove an analogue of Corollary 8.23 for games that do not have a constant value. This requires
to use the specific properties of Shapley operators given in Section 6.2. We let *χ* := max_{w}*χ**w*

and *χ* := min_{w}*χ** _{w}*. Moreover, even though we do not suppose that the value is constant, we
still use the quantities

*D*ub := (n+

*m)M*

^{min}

^{{}

^{s,n+m}

^{−}^{1}

*,*

^{}}*R*ub := 10(n+

*m)*

^{2}

*W M*

^{min}

^{{}

^{s,n+m}

^{−}^{1}

*in our complexity bounds. Our method is based on the next lemma.*

^{}}**Lemma 8.62.** *Let* *W* *⊂*[n]*⊎*[m] *denote the set of all states of the game with maximal value.*

*Then, for every* *N* ⩾1 *and* *w∈W* *we have*
(T* ^{N}*(0))

_{w}*N* ⩾*χ−R*_{ub}
*N* *.*

*Similarly, if* *W*^{′}*⊂*[n]*⊎*[m] *denotes the set of all states of the game with minimal value, then*
*for all* *N* ⩾1 *and* *w*^{′}*∈W*^{′}*we have*

(T* ^{N}*(0))

_{w}*′*

*N* ⩽*χ*+*R*_{ub}
*N* *.*

*Proof.* Let *W* *⊂* [n]*⊎*[m] denote the set of all states of the game with maximal value. As
shown in Lemma 6.14, *W* is a dominion (for Player Max), and the game induced by *W* is a
constant value game with value equal to *χ. Let* *T*˜:T^{W}*→*T* ^{W}* denote the Shapley operator of
the induced game. Then, Lemma 8.55 shows that for every

*ε >*0 there exists a bias vector

˜

*u* *∈*R* ^{W}* of

*T*˜,

*T*˜(˜

*u) =χ*+ ˜

*u, such that*

*∥˜u∥*H ⩽

*R*ub+

*ε. Let*

*u∈*R

*be a vector defined as*

^{n+m}*u*

*w*:= ˜

*u*

*w*for every

*w∈W*and

*u*

*w*:=

*−∞*otherwise. As in the proof of Theorem 6.16, observe that for all

*w∈W*we have

*χ*+

*u*

*= ( ˜*

_{w}*T*(˜

*u))*

*= (T(u))*

_{w}*(because*

_{w}*W*is a dominion for Player Max) and

*χ*+

*u*

*w*=

*−∞*⩽(T(u))

*w*for

*w /∈W*. Hence

*χ*+

*u*⩽

*T*(u). Furthermore, we have 0⩾

*−*

**t(u) +**

*u. Hence*

*T** ^{N}*(0)⩾

*−*

**t(u) +**

*T*

*(u)⩾*

^{N}*N χ−*

**t(u) +**

*u*for all

*N*⩾1. Hence, for every state

*w∈W*we have

(T* ^{N}*(0))

*w*

*N* ⩾*χ*+*−***t(u) +***u**w*

*N* ⩾*χ−∥u*˜*∥*H

*N* ⩾*χ−R*_{ub}
*N* *−* *ε*

*N* *.*

Since*ε >*0was arbitrary, we obtain the claim. The proof of the other inequality is analogous.^{2}

8.4. Parametrized complexity bounds for stochastic mean payoﬀ games 167

Figure 8.9: Procedure that checks if a stochastic mean payoﬀ game has constant value.

Thanks to Lemma 8.62, we can now give a procedure that checks if a game has constant value. This is given in the procedureConstantValue presented in Fig. 8.9.

**Lemma 8.63.** *If we put* *f* :=*T* *andK* := 3D^{2}_{ub}*, then the procedure* ConstantValue(f, K) *is*
*correct.*

*Proof.* Let *N* := 2KR_{ub}, *δ* := (8K^{2}*R*_{ub})^{−}^{1}, and take the vector *ν* obtained at the end of the
loop. If the game has constant value, then Remark 8.35 shows that **t(ν)***−***b(ν)** *<* 2Rub+ 1.

Conversely, if*χ̸*=*χ, then by Lemma 8.54 we haveχ−χ*⩾1/D_{ub}^{2} . Furthermore, by Lemmas 8.29

Hence, the procedure correctly decides if the game has constant value. Moreover, if *w* is such
that*ν** _{w}*=

**t(ν), then by Lemma 8.29 we have**

2The proof requires to dualize the notion of a dominion, as discussed in Remark 6.19.

168 Chapter 8. Condition number of stochastic mean payoﬀ games

1: **procedure** Extend(Z)

2: *▷ Z* *⊂V*_{Min}*⊎V*_{Max} *is a nonempty subset of states of a stochastic mean payoﬀ game*

3: **while** True **do**

4: **if** exists *k∈V*Min*\Z,a∈A*^{(k)},*z∈Z* such that*p*^{a}_{kz}*>*0 **then**

5: *Z* :=*Z∪ {k}*

6: **else**

7: **if** exists*i∈V*_{Max}*\Z* such that for all*b∈B*^{(i)} there is*z*_{b}*∈Z* with*p*^{b}_{iz}

*b* *>*0**then**

8: *Z* :=*Z∪ {i}*

9: **else**

10: **return** *Z*

11: **end**

12: **end**

13: **done**

14: **end**

Figure 8.10: Procedure that extends the set laying outside some dominion.

*Remark* 8.64. It is immediate to see that the complexity of the procedure above is the same as
the complexity of ApproximateValuedescribed in Theorem 8.58. Indeed, it makes*O(KR*_{ub})
calls to theOracle(f, ν,8K^{2}*R*_{ub})and every such call uses*O*^{(}*a*_{ct}(n+m)^{)})arithmetic operations.

If the game does not have a constant value, then ConstantValue outputs a state that does not attain the maximal value. We now want to show how to find all such states. To this end we introduce the procedure Extend that may be used to enlarge the set of states that do not attain the maximal value. The following lemma explains the behavior of this procedure.

We denote by *V* :=*V*_{Min}*⊎V*_{Max} the set of states of a stochastic mean payoﬀ game.

**Lemma 8.65.** *Procedure* Extend *has the following properties. If* *W* *⊂V* *is a dominion (for*
*Player Max) and* *Z∩W* =*∅, then*Extend(Z)*∩W* =*∅. Furthermore, the set* *V* *\*Extend(Z)
*is a dominion.*

*Proof.* To prove the first claim, suppose that *Z∩W* =*∅*. If*k∈V*_{Min}*∩W* is a state belonging
to *W* then, by definition, Player Min cannot leave *W*, i.e., ^{∑}_{w}_{∈}_{W}*p*^{a}* _{kw}* = 1 for every action

*a*

*∈*

*A*

^{(k)}. In particular,

*k*does not satisfy the condition of the first conditional statement of Extend. Likewise, if

*i∈V*

_{Max}

*∩W*is a state belonging

*W*, then Player Max has a possibility to stay in

*W*, i.e., there is an action

*b*

*∈*

*B*

^{(i)}such that

^{∑}

_{w∈W}*p*

^{b}*= 1. In particular,*

_{iw}*i*does not satisfy the condition of the second conditional statement of Extend. Hence, we have Extend(Z)

*∩W*=

*∅*. To prove the second statement, let

*W*˜ :=

*V*

*\*Extend(Z). Note that the procedure Extend stops only when both of its conditional statements are not satisfied.

Therefore, for every state *k* *∈* *V*Min *∩W*˜ and every action *a* *∈* *A*^{(k)} we have ^{∑}_{w}_{˜}_{∈}*W*˜ *p*^{a}_{k}_{w}_{˜} = 1.

Moreover, for every state *i* *∈V*_{Max}*∩W*˜ there is an action *b* *∈* *B*^{(i)} such that ^{∑}_{w}_{˜}_{∈}*W*˜ *p*^{b}* _{iw}* = 1.

Hence,*W*˜ is a dominion.

*Remark* 8.66. We point out that Extend can be implemented to work in *O*^{(}*a*_{ct}(n+*m)*^{2}^{)}
complexity. Indeed, checking if either of the conditional statements is satisfied can be done by
listing all the actions and, for every action, checking if the states to which this action may lead
belong to*Z*. Furthermore, the procedure stops after at most*n*+*m*of these verifications.

*Remark* 8.67. A more abstract, but equivalent, way of thinking about the procedure Extend
is the following. Given *Z, we define a vector* *u* *∈* T* ^{n+m}* as

*u*

*:=*

_{w}*−∞*for

*w*

*∈*

*Z*and

*u*

*= 0*

_{w}8.4. Parametrized complexity bounds for stochastic mean payoﬀ games 169

1: **procedure** TopClass(T)

2: *▷ V* *is the set of states of the game with operatorT*

3: **while** True **do**

4: (ConstVal, z, z* ^{′}*) :=ConstantValue(T,3D

_{ub}

^{2})

5: **if** ConstVal = True**then**

6: **return** *V* *▷ V* *is the set of states that have the maximal value*

7: **end**

8: *V* :=*V* *\*Extend(z)

9: let *T*˜ denote the Shapley operator of the game induced by*V*

10: *T* := ˜*T*

11: **done**

12: **end**

Figure 8.11: Procedure that finds the set of states with maximal value.

otherwise. We compute *T*(u) and, if there is a state *w /∈Z* such that(T(u))* _{w}* =

*−∞*, then we add

*w*to

*Z*.

Having the procedure Extend, we can now obtain an algorithm that finds the set of states having the maximal value. This is done by procedureTopClasspresented in Fig. 8.11.

**Theorem 8.68.** *The procedure* TopClass*is correct and it finds the set of states with maximal*
*value of a stochastic mean payoﬀ game. Moreover, it performsO*^{(}*a*_{ct}(n+m)^{6}*W M*3 min{s,n+m−1})
*arithmetic operations and every number that appears during these operations can be represented*
*using* *O*

(

(n+*m) log*^{(}(n+*m)M W*^{))}*bits of memory.*

*Proof.* Let*W* denote the set of all states with the maximal value. By Lemma 8.63, the procedure
ConstantValue(T,3D_{ub}^{2} ) correctly decides if the game has constant value and, if not, it
outputs a state *z* *∈* *V* such that *z /∈* *W*. Furthermore, by Lemma 6.14, *W* is a dominion.

Hence, by Lemma 8.65, the set *V*˜ :=*V* *\*Extend(z) is a dominion and *W* *⊂V*˜. Furthermore,
note that the game induced by *V*˜ has the same set of states with the maximal value as the
original game. Indeed, by Lemma 6.13, the value of every state in the game induced by *V*˜
is not greater than the value of the same state in the original game. Moreover, *W* is still
a dominion in the game induced by *V*˜, and the game induced by *W* does not change when
passing from the original game to the game induced by *V*˜. Hence, by Lemma 6.14, the value
of every state in *W* is the same in all of these three games. Therefore, the problem of finding
*W* reduces to the problem of finding the states with maximal value in the game induced by*V*˜.
This shows that TopClass is correct. To show the complexity bound, note that the calls for
ConstantValue are the most expensive operations inTopClassbecause a call for Extend
and reducing the game can be done in *O*^{(}*a*ct(n+*m)*^{2}^{)} complexity. Moreover, as noted in
Remark 8.64,ConstantValue(T,3D^{2}_{ub}) has the same asymptotic complexity as the problem
of solving a stochastic mean payoﬀ game presented in Theorem 8.58. Finally,TopClassdoes
at most*n*+*m* calls toConstantValue(T,3D^{2}_{ub}), hence the claimed bound.

*Remark* 8.69. Given the procedure TopClass, one can easily find the maximal value of the
game. Indeed, it is enough to first useTopClass to determine the set of states with maximal
value, then restrict the game to this set of states (using the fact that this set is a dominion)
and solve the remaining game by Theorem 8.58. The most expensive operation is TopClass,
hence the asymptotic complexity of finding the maximal value is the same as the complexity

170 Chapter 8. Condition number of stochastic mean payoﬀ games of TopClass indicated above. Furthermore, one can dualize the procedure Extend to the dominions of Player Min. This leads to the procedure BottomClass that finds the minimal value and set of states that attain this value in the same complexity asTopClass.

*Remark* 8.70. In [BEGM15], it is shown that given an oracle solving TopClass, and another
oracle that solves deterministic mean payoﬀ games, one can solve arbitrary stochastic mean
payoﬀ games with pseudopolynomial number of calls to these oracles provided that the number
of nondeterministic actions of the game is fixed. The authors of [BEGM15] present an oracle
solvingTopClassthat is based on the pumping algorithm. Our analysis shows that this oracle
can be replaced by an oracle based on value iteration.

*Remark* 8.71. As in the previous section (Remark 8.60), the results presented here extend to
the case of bipartite games described by bipartite Shapley operators. Indeed, if *F*:T^{V}*→* T* ^{V}*
is such an operator, and the quantities

*D*

_{ub}

*, R*

_{ub}are chosen accordingly to Example 8.38, then for any

*N*⩾0and any state

*k∈V*

_{Min}with maximal value

*χ*we have

(F* ^{N}*(0))

_{k}*N* ⩾2χ*−R*_{ub}
*N* *.*
Similarly, for any state *l∈V*Min with minimal value*χ* we have

(F* ^{N}*(0))

*l*

*N* ⩽2χ*−R*ub

*N* *.*

The proof of these inequalities proceeds in the same way as the proof Lemma 8.62. More
precisely, if *W* is the set of states of game with maximal value, then *W*˜ := *V*(W *∩V*_{Min}) is a
dominion and every state in the game induced by*W*˜ has value *χ.*^{3} Therefore, if*F*˜:T^{W}^{∩}^{V}^{Min} *→*
T^{W}^{∩}^{V}^{Min} is the bipartite Shapley operator of the induced game, then Lemmas 6.22 and 8.55
show that there is a bias vector*u*˜*∈*R^{W}^{∩}^{V}^{Min} such that *F(˜*˜ *u) = 2χ*+ ˜*u. Hence, for every* *ε >*0
we can take a bias vector *u*˜ such that *∥˜u∥*H ⩽ *R*ub+*ε. As in the proof of Lemma 8.62, we*
define *u* *∈* *V*_{Min} as *u** _{k}* := ˜

*u*

*if*

_{k}*k*

*∈*

*W*

*∩V*

_{Min}and

*u*

*=*

_{k}*−∞*otherwise, and we observe that 2χ+

*u*⩽

*F*(u). This gives the desired inequality. Furthermore, by applying Theorem 8.37 to

*F*˜ we get that the denominator of 2χ is bounded by

*D*ub and, similarly, the denominator of 2χ is bounded by

*D*

_{ub}. Hence, we can use the procedure ConstantValue for

*F*to decide if the game has constant value. We can also easily adapt the procedure Extend to

*F*(e.g., by using the observation given in Remark 8.67). As a result, we can find the maximal (or minimal) value of the game and all the states controlled by Player Min that attain it in

*O(nD*

_{ub}

^{2}

*R*ub

*c*eval) arithmetic operations, where

*c*

_{eval}and

*D*

_{ub}

*, R*

_{ub}are as in Remark 8.60, and all the numbers that occur during these operations can be represented using

*O(log(M D*

_{ub}

*R*

_{ub}

*W*))bits of memory.