# Finding the states with maximal value

Dans le document The DART-Europe E-theses Portal (Page 168-172)

## 8.4 Parametrized complexity bounds for stochastic mean payoﬀ games

### 8.4.2 Finding the states with maximal value

In this section, we consider games that may not have a constant value. We show that a modifica-tion of the algorithms presented before can decide if a game has constant value or not and, in the latter case, find the states that have the maximal (or minimal) value. In order to do so, we need to prove an analogue of Corollary 8.23 for games that do not have a constant value. This requires to use the specific properties of Shapley operators given in Section 6.2. We let χ := maxwχw

and χ := minwχw. Moreover, even though we do not suppose that the value is constant, we still use the quantities Dub := (n+m)Mmin{s,n+m1}, Rub := 10(n+m)2W Mmin{s,n+m1} in our complexity bounds. Our method is based on the next lemma.

Lemma 8.62. Let W [n][m] denote the set of all states of the game with maximal value.

Then, for every N ⩾1 and w∈W we have (TN(0))w

Nχ−Rub N .

Similarly, if W [n][m] denotes the set of all states of the game with minimal value, then for all N ⩾1 and w∈W we have

(TN(0))w

Nχ+Rub N .

Proof. Let W [n][m] denote the set of all states of the game with maximal value. As shown in Lemma 6.14, W is a dominion (for Player Max), and the game induced by W is a constant value game with value equal to χ. Let T˜:TW TW denote the Shapley operator of the induced game. Then, Lemma 8.55 shows that for every ε > 0 there exists a bias vector

˜

u RW of T˜,T˜(˜u) =χ+ ˜u, such that ∥˜u∥HRub+ε. Let u∈Rn+m be a vector defined as uw := ˜uw for everyw∈W and uw :=−∞otherwise. As in the proof of Theorem 6.16, observe that for allw∈W we have χ+uw = ( ˜Tu))w= (T(u))w (because W is a dominion for Player Max) and χ+uw =−∞ ⩽(T(u))w for w /∈W. Hence χ+uT(u). Furthermore, we have 0⩾t(u) +u. Hence

TN(0)⩾t(u) +TN(u)⩾N χ−t(u) +u for all N ⩾1. Hence, for every statew∈W we have

(TN(0))w

Nχ+t(u) +uw

Nχ−∥u˜H

Nχ−Rub N ε

N .

Sinceε >0was arbitrary, we obtain the claim. The proof of the other inequality is analogous.2

8.4. Parametrized complexity bounds for stochastic mean payoﬀ games 167

Figure 8.9: Procedure that checks if a stochastic mean payoﬀ game has constant value.

Thanks to Lemma 8.62, we can now give a procedure that checks if a game has constant value. This is given in the procedureConstantValue presented in Fig. 8.9.

Lemma 8.63. If we put f :=T andK := 3D2ub, then the procedure ConstantValue(f, K) is correct.

Proof. Let N := 2KRub, δ := (8K2Rub)1, and take the vector ν obtained at the end of the loop. If the game has constant value, then Remark 8.35 shows that t(ν)b(ν) < 2Rub+ 1.

Conversely, ifχ̸=χ, then by Lemma 8.54 we haveχ−χ⩾1/Dub2 . Furthermore, by Lemmas 8.29

Hence, the procedure correctly decides if the game has constant value. Moreover, if w is such thatνw=t(ν), then by Lemma 8.29 we have

2The proof requires to dualize the notion of a dominion, as discussed in Remark 6.19.

168 Chapter 8. Condition number of stochastic mean payoﬀ games

1: procedure Extend(Z)

2: ▷ Z ⊂VMin⊎VMax is a nonempty subset of states of a stochastic mean payoﬀ game

3: while True do

4: if exists k∈VMin\Z,a∈A(k),z∈Z such thatpakz >0 then

5: Z :=Z∪ {k}

6: else

7: if existsi∈VMax\Z such that for allb∈B(i) there iszb ∈Z withpbiz

b >0then

8: Z :=Z∪ {i}

9: else

10: return Z

11: end

12: end

13: done

14: end

Figure 8.10: Procedure that extends the set laying outside some dominion.

Remark 8.64. It is immediate to see that the complexity of the procedure above is the same as the complexity of ApproximateValuedescribed in Theorem 8.58. Indeed, it makesO(KRub) calls to theOracle(f, ν,8K2Rub)and every such call usesO(act(n+m)))arithmetic operations.

If the game does not have a constant value, then ConstantValue outputs a state that does not attain the maximal value. We now want to show how to find all such states. To this end we introduce the procedure Extend that may be used to enlarge the set of states that do not attain the maximal value. The following lemma explains the behavior of this procedure.

We denote by V :=VMin⊎VMax the set of states of a stochastic mean payoﬀ game.

Lemma 8.65. Procedure Extend has the following properties. If W ⊂V is a dominion (for Player Max) and Z∩W =∅, thenExtend(Z)∩W =∅. Furthermore, the set V \Extend(Z) is a dominion.

Proof. To prove the first claim, suppose that Z∩W =. Ifk∈VMin∩W is a state belonging to W then, by definition, Player Min cannot leave W, i.e., wW pakw = 1 for every action a A(k). In particular, k does not satisfy the condition of the first conditional statement of Extend. Likewise, if i∈VMax∩W is a state belonging W, then Player Max has a possibility to stay in W, i.e., there is an action b B(i) such that w∈W pbiw = 1. In particular, i does not satisfy the condition of the second conditional statement of Extend. Hence, we have Extend(Z)∩W = . To prove the second statement, let W˜ := V \Extend(Z). Note that the procedure Extend stops only when both of its conditional statements are not satisfied.

Therefore, for every state k VMin ∩W˜ and every action a A(k) we have w˜W˜ pakw˜ = 1.

Moreover, for every state i ∈VMax∩W˜ there is an action b B(i) such that w˜W˜ pbiw = 1.

Hence,W˜ is a dominion.

Remark 8.66. We point out that Extend can be implemented to work in O(act(n+m)2) complexity. Indeed, checking if either of the conditional statements is satisfied can be done by listing all the actions and, for every action, checking if the states to which this action may lead belong toZ. Furthermore, the procedure stops after at mostn+mof these verifications.

Remark 8.67. A more abstract, but equivalent, way of thinking about the procedure Extend is the following. Given Z, we define a vector u Tn+m as uw := −∞ forw Z and uw = 0

8.4. Parametrized complexity bounds for stochastic mean payoﬀ games 169

1: procedure TopClass(T)

2: ▷ V is the set of states of the game with operatorT

3: while True do

4: (ConstVal, z, z) :=ConstantValue(T,3Dub2 )

5: if ConstVal = Truethen

6: return V ▷ V is the set of states that have the maximal value

7: end

8: V :=V \Extend(z)

9: let T˜ denote the Shapley operator of the game induced byV

10: T := ˜T

11: done

12: end

Figure 8.11: Procedure that finds the set of states with maximal value.

otherwise. We compute T(u) and, if there is a state w /∈Z such that(T(u))w =−∞, then we add wtoZ.

Having the procedure Extend, we can now obtain an algorithm that finds the set of states having the maximal value. This is done by procedureTopClasspresented in Fig. 8.11.

Theorem 8.68. The procedure TopClassis correct and it finds the set of states with maximal value of a stochastic mean payoﬀ game. Moreover, it performsO(act(n+m)6W M3 min{s,n+m−1}) arithmetic operations and every number that appears during these operations can be represented using O

(

(n+m) log((n+m)M W))bits of memory.

Proof. LetW denote the set of all states with the maximal value. By Lemma 8.63, the procedure ConstantValue(T,3Dub2 ) correctly decides if the game has constant value and, if not, it outputs a state z V such that z /∈ W. Furthermore, by Lemma 6.14, W is a dominion.

Hence, by Lemma 8.65, the set V˜ :=V \Extend(z) is a dominion and W ⊂V˜. Furthermore, note that the game induced by V˜ has the same set of states with the maximal value as the original game. Indeed, by Lemma 6.13, the value of every state in the game induced by V˜ is not greater than the value of the same state in the original game. Moreover, W is still a dominion in the game induced by V˜, and the game induced by W does not change when passing from the original game to the game induced by V˜. Hence, by Lemma 6.14, the value of every state in W is the same in all of these three games. Therefore, the problem of finding W reduces to the problem of finding the states with maximal value in the game induced byV˜. This shows that TopClass is correct. To show the complexity bound, note that the calls for ConstantValue are the most expensive operations inTopClassbecause a call for Extend and reducing the game can be done in O(act(n+m)2) complexity. Moreover, as noted in Remark 8.64,ConstantValue(T,3D2ub) has the same asymptotic complexity as the problem of solving a stochastic mean payoﬀ game presented in Theorem 8.58. Finally,TopClassdoes at mostn+m calls toConstantValue(T,3D2ub), hence the claimed bound.

Remark 8.69. Given the procedure TopClass, one can easily find the maximal value of the game. Indeed, it is enough to first useTopClass to determine the set of states with maximal value, then restrict the game to this set of states (using the fact that this set is a dominion) and solve the remaining game by Theorem 8.58. The most expensive operation is TopClass, hence the asymptotic complexity of finding the maximal value is the same as the complexity

170 Chapter 8. Condition number of stochastic mean payoﬀ games of TopClass indicated above. Furthermore, one can dualize the procedure Extend to the dominions of Player Min. This leads to the procedure BottomClass that finds the minimal value and set of states that attain this value in the same complexity asTopClass.

Remark 8.70. In [BEGM15], it is shown that given an oracle solving TopClass, and another oracle that solves deterministic mean payoﬀ games, one can solve arbitrary stochastic mean payoﬀ games with pseudopolynomial number of calls to these oracles provided that the number of nondeterministic actions of the game is fixed. The authors of [BEGM15] present an oracle solvingTopClassthat is based on the pumping algorithm. Our analysis shows that this oracle can be replaced by an oracle based on value iteration.

Remark 8.71. As in the previous section (Remark 8.60), the results presented here extend to the case of bipartite games described by bipartite Shapley operators. Indeed, if F:TV TV is such an operator, and the quantities Dub, Rub are chosen accordingly to Example 8.38, then for anyN ⩾0and any state k∈VMin with maximal value χwe have

(FN(0))k

N ⩾2χ−Rub N . Similarly, for any state l∈VMin with minimal valueχ we have

(FN(0))l

N ⩽2χ−Rub

N .

The proof of these inequalities proceeds in the same way as the proof Lemma 8.62. More precisely, if W is the set of states of game with maximal value, then W˜ := V(W ∩VMin) is a dominion and every state in the game induced byW˜ has value χ.3 Therefore, ifF˜:TWVMin TWVMin is the bipartite Shapley operator of the induced game, then Lemmas 6.22 and 8.55 show that there is a bias vectoru˜RWVMin such that F(˜˜ u) = 2χ+ ˜u. Hence, for every ε >0 we can take a bias vector u˜ such that ∥˜u∥HRub+ε. As in the proof of Lemma 8.62, we define u VMin as uk := ˜uk if k W ∩VMin and uk = −∞ otherwise, and we observe that 2χ+uF(u). This gives the desired inequality. Furthermore, by applying Theorem 8.37 to F˜ we get that the denominator of 2χ is bounded by Dub and, similarly, the denominator of 2χ is bounded by Dub. Hence, we can use the procedure ConstantValue forF to decide if the game has constant value. We can also easily adapt the procedure Extend to F (e.g., by using the observation given in Remark 8.67). As a result, we can find the maximal (or minimal) value of the game and all the states controlled by Player Min that attain it inO(nDub2 Rubceval) arithmetic operations, wherecevalandDub, Rubare as in Remark 8.60, and all the numbers that occur during these operations can be represented using O(log(M DubRubW))bits of memory.

Dans le document The DART-Europe E-theses Portal (Page 168-172)