Leveraging repeated games for solving complex multiagent decision problems

(1)

Andriy BURKOV

Leveraging Repeated Games for Solving Complex Multiagent

Decision Problems

Thèse présentée `

a la Faculté des études supérieures de l’Université Laval dans le cadre du programme de doctorat en informatique pour l’obtention du grade de PhilosophiæDoctor (Ph.D.)

Département d’informatique et de génie logiciel Faculté des sciences et de génie

UNIVERSIT ´E LAVAL QU ´EBEC

2011

c

(2)

R´

esum´

e

Prendre de bonnes décisions dans des environnements multiagents est une tâche difficile dans la mesure où la présence de plusieurs décideurs implique des conflits d’intérêts, un manque de coordination, et une multiplicité de décisions possibles. Si de plus, les décideurs interagis-sent successivement à travers le temps, ils doivent non seulement décider ce qu’il faut faire actuellement, mais aussi comment leurs décisions actuelles peuvent affecter le comportement des autres dans le futur.

La théorie des jeux est un outil mathématique qui vise à modéliser ce type d’interactions via des jeux stratégiques à plusieurs joueurs. Des lors, les problèmes de décision multiagent sont souvent étudiés en utilisant la théorie des jeux. Dans ce contexte, et si on se restreint aux jeux dynamiques, les problèmes de décision multiagent complexes peuvent être approchés de fa¸con algorithmique.

La contribution de cette thèse est triple. Premièrement, elle contribue à un cadre algorith-mique pour la planification distribuée dans les jeux dynamiques non-coopératifs. La mul-tiplicité des plans possibles est à l’origine de graves complications pour toute approche de planification. Nous proposons une nouvelle approche basée sur la notion d’apprentissage dans les jeux répétés. Une telle approche permet de surmonter lesdites complications par le biais de la communication entre les joueurs.

Nous proposons ensuite un algorithme d’apprentissage pour les jeux répétés en “self-play”. Notre algorithme permet aux joueurs de converger, dans les jeux répétés initialement inconnus, vers un comportement conjoint optimal dans un certain sens bien défini, et ce, sans aucune communication entre les joueurs.

Finalement, nous proposons une famille d’algorithmes de résolution approximative des jeux dynamiques et d’extraction des stratégies des joueurs. Dans ce contexte, nous proposons tout d’abord une méthode pour calculer un sous-ensemble non vide des équilibres approximatifs parfaits en sous-jeu dans les jeux répétés. Nous montrons ensuite comment nous pouvons étendre cette méthode pour approximer tous les équilibres parfaits en sous-jeu dans les jeux répétés, et aussi résoudre des jeux dynamiques plus complexes.

(3)

Abstract

Making good decisions in multiagent environments is a hard problem in the sense that the presence of several decision makers implies conflicts of interests, a lack of coordination, and a multiplicity of possible decisions. If, then, the same decision makers interact continuously through time, they have to decide not only what to do in the present, but also how their present decisions may affect the behavior of the others in the future.

Game theory is a mathematical tool that aims to model such interactions as strategic games of multiple players. Therefore, multiagent decision problems are often studied using game theory. In this context, and being restricted to dynamic games, complex multiagent decision problems can be algorithmically approached.

The contribution of this thesis is three-fold. First, this thesis contributes an algorithmic framework for distributed planning in non-cooperative dynamic games. The multiplicity of possible plans is a matter of serious complications for any planning approach. We propose a novel approach based on the concept of learning in repeated games. Our approach permits overcoming the aforementioned complications by means of communication between players. We then propose a learning algorithm for repeated game self-play. Our algorithm allows players to converge, in an initially unknown repeated game, to a joint behavior optimal in a certain, well-defined sense, without communication between players.

Finally, we propose a family of algorithms for approximately solving dynamic games, and for extracting equilibrium strategy profiles. In this context, we first propose a method to compute a nonempty subset of approximate subgame-perfect equilibria in repeated games. We then demonstrate how to extend this method for approximating all subgame-perfect equilibria in repeated games, and also for solving more complex dynamic games.

(4)

Acknowledgements

This thesis is entirely dedicated to my family: to my parents, Valeriy and Tatiana, to whom I owe all my achievements, and who, even being far away, were always there to support me in my adventures; to my dear wife Maria and my loved daughters, Catherine and Eva, without whom my life would not be complete; and also to my brother Dmitriy, and to my uncle Viktor. I am heartily thankful to my supervisor, Prof. Brahim Chaib-draa, whose encouragement, guidance and support from the initial to the final stage enabled me to accomplish this work. I am grateful to the members of the jury, and especially to Profs. Michael Wellman (University of Michigan), Robin Cohen (University of Waterloo), and Pascal Lang (Universit´e Laval) for the care with which they reviewed the original manuscript.

I owe my deepest gratitude to Profs. Mario Marchand, Fran¸cois Laviolette, and Nadia Tawbi (Universit´e Laval) for their outstanding classes and support in my efforts, as well as to Mme. Lynda Goulet for her constant availability and enthusiasm in assisting me in various admin-istrative questions.

Lastly, but not least, I would like to thank my lab mates and friends, Abdeslam, Camille, Patrick, Julien, Pierre-Luc, Charles, S´ebastien, Alireza, Hamid, Guillaume, Pierre, Maxime, Jean-Samuel, Edi, Jilles, Pierrick, and St´ephane, for providing a fun and friendly environment.

(5)

List of Tables

4.1 Observed distribution over actions profiles in the 3 × 3 grid with horizon 4 for each time step. . . 75

5.1 FCS values of the strategy profiles learned by ASPL for different functional criteria of solution compared to the values of other solutions. . . 89

5.2 The smallest iteration numbers, after which the value of ASPL becomes 5%-close to the optimal FCS value in Random Games M. . . 90

7.1 The performance of ASPECT in the repeated Battle of the Sexes for different values of the approximation factor . The second column represents the hyper-cube side length l at the end of the algorithm’s execution; the third column contains the number of iterations until convergence; the last column contains the overall execution time in seconds. . . 149

(9)

List of Figures

2.1 Agent framework (adapted from Russell and Norvig(2009)). . . 8

2.2 An example of payoff functions in Prisoner’s Dilemma: (a) the payoff matrix of Player 1; (b) the payoff matrix of Player 2. . . 16

2.3 An example of payoff matrix that contains payoffs for both players in Prisoner’s Dilemma. . . 17

2.4 Payoff matrix of a strictly cooperative stage-game. . . 18

2.5 Payoff matrix of Rock, Paper, Scissors: a strictly adversarial stage-game. . . 18

2.6 Payoff matrix of a symmetric stage-game. . . 18

2.7 The process of game play in a repeated game. . . 19

2.8 The process of game play in a stochastic game. . . 21

2.9 Dynamic game models. . . 22

2.10 Payoff matrix of Prisoner’s Dilemma. . . 29

3.1 A payoff matrix of Player 1 in the game Gt(s) built from the multiagent Q-values. 38 4.1 An example of communication game payoff matrix for player i. . . 65

4.2 An example of communication game payoff matrix for player i. . . 71

4.3 Grid world (Hu and Wellman,2003). . . 73

4.4 Equilibria in a deterministic 3 × 3 grid world game. . . 74

5.1 Main steps of ASPL for player i. . . 83

5.2 A payoff matrix. . . 86

5.3 The evolution of FCS value of the learned strategy profile in Random Games M according to the max FCS. The X axis represents learning iterations (×103); the Y axis represents FCS value. . . 90

5.4 The evolution of FCS value of the learned strategy profile in Random Games M according to the sum FCS. . . 91

5.5 The evolution of FCS value of the learned strategy profile in Random Games M according to the product FCS. . . 92

(10)

x

6.2 Sets F , F†, F†∗ and F†+in the Prisoner’s Dilemma from Figure6.1. Set F in-cludes four bold dots denoted as r(C, D), r(C, C), r(D, C) and r(D, D). Set F† is the whole diamond-shaped area formed by the four dots and the bold lines that connects them. Set F†∗is shown as the shaded sector inside this diamond-shaped area including the bounds. Set F†+is a subset of F†∗ that excludes the

axes. . . 100

6.3 A game in which a profile of two grim trigger strategies is not a subgame-perfect equilibrium if the long-term payoff criterion is the average payoff. . . 103

6.4 An example of an automaton implementing a profile of two grim trigger strate-gies. Circles are states of the automaton; they are labeled with the action profiles prescribed by the profiles of decision functions. Arrows are transitions between the corresponding states; they are labeled with outcomes. . . 106

6.5 A generic stage-game. . . 116

6.6 An augmented game for the generic stage-game from Figure6.5. . . 116

6.7 An augmented game for Prisoner’s Dilemma from Figure 6.1. . . 117

6.8 An augmented game for Prisoner’s Dilemma from Figure 6.1. . . 117

6.9 Payoff matrix of both players in the Duopoly. . . 118

6.10 A finite automaton representing a carrot-and-stick based subgame-perfect equi-librium strategy for Player 1 in the Duopoly game from Figure6.9. The circles are the states of the automaton; they are labeled with the action prescribed by the decision function. The arrows are the transitions between the corresponding states; they are labeled with outcomes. . . 121

6.11 An example of SPE set refinement using hypercubes. . . 125

7.1 Equilibrium graph for player i. The graph represents the initial state followed by a non-cyclic sequence of states (nodes 1 to Z) followed by a cycle of X states (nodes Z + 1 to Z + X). The labels over the nodes are the immediate expected payoffs collected by player i in the corresponding states. . . 133

7.2 Deviation graphs for player i. (a) A generic deviation graph for player i. The graph represents the initial deviation state (node 0) followed by a transition into the punishment state (node 1) followed by a number of in-equilibrium (or otherwise, inside-the-support deviation) states (nodes 1 to L − 1) followed by the subsequent out-of-the-support deviation state (node L). (b) A particular, one-state deviation graph, where the only deviation state is the punishment state for player i. The labels over the nodes are the immediate expected payoffs collected by player i in the corresponding states. . . 135

7.3 Four payoff matrices: (a) Duopoly game, (b) Rock, Paper, Scissors, (c) Battle of the Sexes, and (d) A game with no pure Nash equilibrium in stage-game. . . 144

7.4 The evolution of the set of SPE payoff profiles computed by ASPECT with public correlation in the repeated Prisoner’s Dilemma. The numbers under the graphs reflect the algorithm’s iterations. The red (darker) regions denote the hypercubes that remain in the set C by the end of the corresponding iteration. 146

(11)

xi

7.5 The evolution of the set of SPE payoff profiles computed by ASPECT in the repeated Rock, Paper, Scissors with γ = 0.7 and = 0.01. . . 147

7.6 The evolution of the set of SPE payoff profiles computed by ASPECT without public correlation in the repeated Battle of the Sexes. . . 147

7.7 SPE payoff profiles in the repeated Duopoly game computed by ASPECT limited to pure action strategies with γ = 0.6 and = 0.01. (a) The evolution of the set of SPE payoff profiles through different algorithm’s iterations. (b) Abreu’s optimal penal code solution is contained within the set of SPE payoff profiles returned by our algorithm. . . 148

7.8 The sets of SPE payoff profiles computed in the repeated game from Figure7.3d with = 0.01 for different values of the discount factor. . . 149

7.9 The evolution of the set of SPE payoff profiles in a three-state Markov chain game. . . 151

7.10 Scalability of our algorithm in Markov chain games in terms of the average time until convergence as a function of the number of actions available to each player. . . 152

7.11 Scalability of our algorithm in Markov chain games in terms of the average time until convergence as a function of the number of states of the Markov chain game. . . 152

(12)

List of Algorithms

2.1 The value iteration algorithm for MDPs. . . 12

2.2 The finite value iteration algorithm for MDPs. . . 13

2.3 The Q-learning algorithm for MDPs. . . 14

3.1 Algorithm ofShapley (1953) for zero-sum stochastic games. . . 38

3.2 The Fictitious Play algorithm for stochastic games by Vrieze(1987). . . 40

4.1 The FiniteVI algorithm of Kearns et al. (2000) for planning in finite horizon stochastic games. . . 60

4.2 FiniteVICom, a value iteration algorithm for planning in finite horizon sto-chastic games with communication. . . 63

4.3 An algorithm to verify a stage-game equilibrium. . . 67

6.1 The basic structure of ASPECT. . . 124

6.2 The CubeSupported procedure. . . 126

6.3 The ConstructAutomaton procedure. . . 127

(13)

Chapter 1

Introduction

This introductory chapter describes the general context of the research presented in this thesis. The notion of an agent playing a game is first introduced as a model for decision making in multiagent systems. Game theory is then proposed as a set of tools for solving decision problems in such models. The objective of this thesis is then presented, as well as the approach taken to reach this objective. We then briefly review the main contributions, and outline the main parts of this thesis.

Multiagent systems (MAS) have become widely applied for modeling and solving real world practical problems. Problems such as robotic soccer, disaster rescue operations, automated driving, bargaining, stock market trading, electronic commerce, information gathering and sharing, as well as many others involving both human and artificial agents can all be modeled as a multiagent system.

An (artificial) agent is seen as an independent entity that can perceive the environment that surrounds it, make actions that change the state of this environment, and receive feedback, both in the form of a certain observation and in the form of a numerical “payoff”.

When an environment contains more than one agent, one can already talk about a multiagent system. Game theory regards agents in a multiagent system as game players. Accordingly, a multiagent environment, in which those agents can make their actions, is considered as a mul-tiplayer game, whose outcomes are often determined by the joint action of all agents/players1_. A multiplayer game can describe the outcomes of all possible interactions of players in terms of a real-valued numerical payoff of an outcome to a player. Furthermore, such a game can also describe the evolution of the multiagent environment in time. This evolution can depend on the joint behavior of players, and can be non-deterministic in its nature. For instance,

1

To facilitate reading, we will interchangeably use the terms “agent”, “player”, “robot”, and “opponent” to denote an agent in a multiagent system.

(14)

Chapter 1. Introduction 2

there can be a stochastic function describing the change in possible outcomes of the game, given the previous actions of players.

If a certain property of a multiagent environment2 cannot be taken into account by a specific game formulation, such a property is considered irrelevant to the problem and can be ignored. Such an approach allows a high level of abstraction: the decision maker can concentrate its attention only on those properties of the environment that are relevant to its multiagent, i.e. interactive, component.

Game theory provides a rich set of mathematical abstractions and frameworks suitable for modeling various multiagent decision problems. For instance, stage-games (or matrix games) are capable of representing multiagent decision problems, where the decision is made only once by each player; and all their decisions are made by the players simultaneously. Repeated games extend stage-games by allowing them to model relatively simple sequential (or dynamic) decision problems. Finally, more complex dynamic games, the stochastic games, allow model-ing complex multiagent decision problems, those that can arise in multiagent systems evolvmodel-ing in time and space. Being reduced to a well defined mathematical model, such as dynamic games, complex multiagent decision problems can then be approached algorithmically. Making good decisions in multiagent environments is hard. Indeed, the presence of multiple decision makers implies conflicts of interests, a lack of coordination, and a multiplicity of possible decisions. Additionally, if the same decision makers interact continuously through time, they have to decide not only what to do in the present, but also (and this is often more important) how their present decisions may affect the behavior of the others in the future. Certain game theoretic models, such as repeated games (Aumann,1981;Mailath and Samuelson,2006;Gossner and Tomala,2009) are especially suited for studying the long-term aspects of multi-agent interactions.

Not only can one player’s present decisions affect the future behavior of the other players, but also the environment itself can evolve under the influence of the players. The model of stochastic games (Shapley,1953;Bowling and Veloso,2002;Shoham et al.,2007) reflects this aspect by assuming that a game can have:

(i) multiple states, and

(ii) stochastic inter-state transitions that can be jointly controlled by the players.

Once a game is “extracted” from the original multiagent environment, the next problem is to solve that game. For instance, a solution of a multiplayer game, from the point of view of one player, can provide an answer to the following question: What actions should I do to perform best in this game? Different answers can be given, according to:

2_{Like, for instance, physical conditions of the environment, the limitations of agents in terms of their}

(15)

(i) what is the best for that agent,

(ii) what is known about the game and about the other players before the game starts, and (iii) what can become known during the game play.

Certain game models (Abreu et al.,1990;Bowling and Veloso,2002;Emery-Montemerlo et al.,

2004) are built on the assumption of imperfect or incomplete information, which players can have about certain game properties. For instance, the players can only have a partial knowledge of their payoffs for different game outcomes; or, they can imprecisely observe the game states or the actions played by the other players. Such games have to be treated differently from the games where this information is perfect and complete. Furthermore, whether the communication between players is available before or during game play can also have an important impact on the number of game solutions and their particular form. Another important aspect of those multiagent decision problems that involve artificial agents is the boundedness (or finiteness) of computational resources and memory available to artificial agents. To reflect such aspects, certain game models (Abreu and Rubinstein, 1988; Ben-Porath, 1993; Kalai, 1990) augment the standard game theoretic model of repeated games with the notion of “finite rationality”.

In the next section, we formally define the objective of this thesis.

1.1 Objective

The objective of this thesis is to find algorithmic ways to answer certain questions that are usually raised when one deals with the decision problems represented as a multiplayer game played by artificial agents. By “algorithmic ways” we mean that any way answering any of the questions we deal with in this thesis has to be implementable in a program code that embodies a certain algorithm. Furthermore, to find an answer to the given question, this algorithm has to have a finite running time.

Under “certain questions” we mean that we do not tend to answer all questions that can arise in a multiagent system. More specifically, we are interested in the following three types of questions. Given a multiplayer game of a certain form, and assuming a certain criterion of effective performance of agents in this game:

1. How can agents, by interacting with each other, learn to effectively perform in this game? When, i.e., several agents that cannot communicate are put into an unknown environment and left on their own, what should they do in order to start behaving well in a certain sense?

(16)

2. How can agents compute an effective joint plan in a distributed way? When, i.e., all necessary properties of the environment are known and the agents can communicate, how can they compute, in a distributed way, a sound set of individual plans, such that their joint behavior according to this set of individual plans can be considered good in a certain sense?

3. How can agents compute all possible joint plans? When, i.e., all necessary properties of the environment are known, how can one agent compute all possible ways to behave well in this environment?

Finally, by saying “played by artificial agents” we mean that we do not consider multiagent systems populated by humans or animals, or by humans and animals coexisting with artificial agents. We rather foresee multi-robot physical environments, or virtual environments jointly controlled by a set of software agents. Furthermore, we concede that those artificial agents have limitations, such as limited computational resources, memory, life span, and so on. In the next section, we specify the approach which we use in order to achieve the objective stated above.

1.2 Approach

A game can be solved in a centralized or in a distributed way. Solving a game in a centralized way is interesting in a variety of situations. One example is when one wants to investigate some properties of the game in order to predict the behavior of players in this game. Another example is when there is a certain mediating party whose responsibility is to provide the players with a way to behave, such that all players would accept it. This mediator therefore plays the role of a centralized game solver.

On the other hand, a multiagent decision problem can need to be solved by the agents in a distributed way, when a centralized behavior distribution is impossible. For example, for lack of a reliable mediator, the players can independently solve the game and then negotiate on an acceptable joint behavior by means of communication. Or, in a fully collaborative setting without communication, the players can independently compute a unique joint behavior that will satisfy all players.

In this thesis, we assume that the actual multiagent system has been reduced beforehand to an abstract game model. Once a particular game is given, we are interested in leveraging and extending the existing theory of repeated games for solving complex multiagent decision problems.

(17)

an available communication channel between players can help in solving decision problems in more complex game models, such as stochastic games.

Another aspect, also studied here, is the question of structural similarity between players. More precisely, we are interested in leveraging the fact that the players can all be the same, i.e., we consider the so-called “self-play” setting. For instance, similar players can share similar a priori information, they can also have aligned optimality criteria and behave predictably towards the other similar players. In the absence of explicit communication between players, a player can base its decisions on the fact that the other players are similar to it and share the same a priori information and beliefs about future events. Such players can send and understand indirect “signals” by doing specific actions at specific moments of time in order to implicitly exchange an acquired knowledge about the surrounding world and effectively coordinate their future actions.

Finally, we explore the computational aspect of finding solutions in games. More specifically, we attempt to algorithmically solve the problem of approximate computation of a solution in a given game. To accomplish that, we focus on the aspects of game repeatability and finiteness of player strategies. Again, we start by finding a method for solving a simpler model of repeated games, and then extend the method to solving more complex models. The next section lists the main contribution of this thesis.

1.3 Contributions

The key contributions of this thesis are three-fold:

A non-cooperative distributed planning algorithm for stochastic games

Stochastic games can have multiple solutions called equilibria. The multiplicity of equi-libria seriously complicates planning in games. We propose a planning approach that permits overcoming this complexity by means of unlimited free communication between players during the planning process. This contribution resulted in a publication (Burkov and Chaib-draa, 2008) nominated for the Best Paper Award at The Seventh Interna-tional Conference on Machine Learning and Applications (ICMLA’08).

A cooperative learning algorithm for repeated games

Our contribution is a novel algorithm that allows players to converge, in the self-play learning scenario, to a strategy profile that is optimal in a certain sense. Our approach is in contrast to classical learning approaches for dynamic games: namely, learning of equilibrium, Pareto-efficient strategy learning, and their variants. This contribution resulted in a publication (Burkov and Chaib-draa, 2009a) at The First International Conference on Algorithmic Decision Theory (ADT’09).

(18)

A family of algorithms for approximately solving dynamic games

This is the main contribution of the present thesis. Here, we focus on the problem of computing equilibria in dynamic games. We approach this by first proposing a method to compute a nonempty subset of approximate subgame-perfect equilibria in repeated games. We then demonstrate how to extend this method to approximate all subgame-perfect equilibria in repeated games, and also for solving more complex dynamic games. This contribution resulted in a publication (Burkov and Chaib-draa, 2010a) at The Twenty-Forth AAAI Conference on Artificial Intelligence (AAAI’10).

We conclude this chapter by outlining the remaining parts of this thesis.

1.4 Thesis Outline

This thesis is organized as follows:

• Chapter 2 explains how a multiagent environment can be reduced to a game, and how the interactive aspect of this environment can be effectively encapsulated within that mathematical model. In this chapter, we formulate the notion of solution of a game, and present some important solution concepts.

• Chapter3 is devoted to a review of the literature relevant to the domain of this thesis. More precisely, we present different algorithms for solving dynamic games, based on the concepts of learning and planning.

• Chapters 4–7 are devoted to the three key contributions of this thesis, enumerated in the previous section. The main contribution is presented in Chapters6 and7: a family of approximate algorithms for solving dynamic games.

• Chapter8 concludes this thesis with a summary of our contribution and considerations for future research.

(19)

Chapter 2

A Formal Framework

In this chapter, we present the formal framework for this thesis. This framework is characterized by an environment containing a set of artificial agents, capable of executing actions, perceiving the state of their environment, and obtaining a certain payoff as a result of their actions. There exist several models to represent such environments. Each model reflects different properties of the real world: number of agents, number of environment states, and preferences of agents. For each of these models, we give a definition of a solution for a decision problem stated using this model, and outline certain approaches to find these solutions.

2.1 Introduction

Multiagent systems (MAS) can be modeled in a variety of ways. Different paradigms have been used to model MAS, such as modal logic, contract nets, ant colonies, evolutionary com-putations, and others (Shoham and Leyton-Brown,2009). In the present thesis, we consider a game theoretic paradigm that has received considerable attention in the past decade. Ac-cording to a game approach to multiagent systems, the most important part of a MAS — interactions between agents — is abstracted from the overall problem and investigated sep-arately. This permits focusing attention on solving this problem (very difficult on its own) without taking into account the many other factors proper to a typical MAS. These factors (such as, for example, agents’ physical parameters, computational limits, programming as-pects), though important, can often be considered as non-pertinent to the interactive aspect of decision making proper to MAS. Indeed, interactions between decision makers is what fundamentally distinguishes multiagent decision problems from single-agent ones.

According to a game theoretic approach, the abstraction, which any MAS is reduced to, is a multiplayer game, where agents act as players and the environment reacts by giving the players some observations and numerical payoffs. This multiplayer game can, for example, be

(20)

Chapter 2. A Formal Framework 8

Agent

Decision

M aking

Sensors

Actuators

Percepts

Actions

En

vironmen

t

Figure 2.1: Agent framework (adapted from Russell and Norvig(2009)).

(from less to more complex models) a stage-game, a repeated game, or a stochastic game. We now start defining the multiagent framework from the single agent.

2.2 Multiagent Systems: A General Framework

2.2.1 The Agent

An agent according toRussell and Norvig(2009), is anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators. This idea is illustrated in Figure2.1. At each moment in time, the environment can be in one of a finite number of states. Each state of the environment is defined by a set of numerical vari-ables representing certain characteristics of the world around the agent. The agent perceives these characteristics through its sensors. The agent is supposed to be capable of changing the environment’s state by doing actions though its actuators. These actions can trigger a transition between two environment states.

It is also assumed that the existence of an agent has a goal. This goal is determined by the payoffs the agent obtains from the environment as a result of the executed actions. Positive (high) payoffs usually define something “good” for the agent, while negative (low) payoffs define something “bad”. Informally speaking, a rational agent always prefers higher payoffs to lower payoffs:

A rational agent is an agent whose goal is to maximize a certain non-decreasing function of the payoffs accumulated during her life cycle.

(21)

collected payoffs during a certain period of activity of the agent. When there is an environment and a single agent that interacts with it, such environments can be modeled as a Markov Decision Process.

2.2.2 Markov Decision Processes

Markov Decision Processes (or MDPs) is a popular way to model single-agent environments. This model can only represent single-agent environments, if the Markov assumption is justi-fied. As we said above, an agent can make actions that trigger transitions between the states of the environment. An environment is called Markovian (or stationary) if the next state of the environment depends only on the current state and the action executed by the agent in the current state. If an environment has a finite number of states, possesses the property of stationarity (i.e., satisfies the Markov assumption), and has a properly defined payoff function, such environment can be called a Markov Decision Process.

Thus, Markov Decision Processes (Russell and Norvig,2009) describe the environments that have a Markovian inter-state transition model, additive payoffs, and have the property that the current state of the environment is fully observable by the agent. The latter means that at any moment of time, the agent is certain in which state the environment is. More formally, an MDP is defined as a tuple, (S, A, P, r), where,

• S is the finite set of states of the environment, • A is the finite set of actions available to the agent,

• P is the transition function: S × A × S 7→ [0, 1], where P (s, a, s0) defines the probability that the environment will transit to state s0 ∈ S when the agent takes action a ∈ A in the current state s ∈ S, and

• r is the payoff function (a.k.a, reward function): S × A 7→ R, where r(s, a) is the payoff the agent obtains by doing action a in state s.

As we said above, every agent has a goal. For example, as we said, the goal of a rational agent is to maximize the payoff it obtains from the environment throughout its life span. A horizon T of an MDP (or, a horizon of an agent in an MDP) is the number of time steps (or transitions of the environment between states) starting from which the agent becomes indifferent to the payoffs it obtains from the environment. An MDP can be of a finite or of an infinite horizon. The definition of the long-term payoff criterion crucially depends on the finiteness of the horizon. Below, we introduce two different long-term payoff criteria for MDPs. But first, we define the notion of agent’s strategy:

(22)

A stationary strategy in an MDP, σ, is a mapping σ : S 7→ A, where σ(s) is the recom-mended action to execute when the MDP is in state s.

A (non-stationary) T -horizon strategy in an MDP, σT, is a mapping σT : S 7→ A, such that σT(s) is a recommended action to execute when the current state of the MDP is s, and T time steps are remaining in this MDP.

When horizon is infinite, the discounted cumulative payoff function is usually used as a cri-terion characterizing the agent’s long-term performance. If we let st be the state of the environment after which the agent has followed the stationary strategy σ during t steps, the (long-term expected) discounted cumulative payoff u(σ(s)) of executing strategy σ starting in state s is given by u(σ(s)) ≡ E "_∞ X t=0 γtr(st, σ(st))|s0 = s # , (2.1)

where γ ∈ [0, 1) is the discount factor, a number between 0 and 1, which reflects the preference of the agent for current payoffs over future payoffs. In Equation (2.1), the expectation is taken over all possible infinite sequences of states that can be visited by the agent if it uses the strategy σ starting from the state s.

When an MDP is of a finite horizon, a finite horizon cumulative payoff function is usually used. More precisely, the (expected) T -horizon cumulative payoff u(σT(s)) of executing a T -horizon strategy σT starting in state s is given by

u(σT(s)) ≡ E "_{T −1} X t=0 r(st, σT −t(st))|s0= s # , (2.2)

where T ∈ N+ is the length of the horizon.

It is worth noting that in the literature, the long-term expected payoff criterion is often called a utility function. In this thesis, we will sometimes, when there is no ambiguity, use the terms utility and utility function to say respectively long-term expected payoff and long-term expected payoff criterion to save space.

We will denote the utility of a strategy as u(σ(·)), by meaning either Equation (2.1) or (2.2), when the long-term payoff criterion of the agent is clear from the context, or is inessential. Let us now specify what it means to solve an MDP:

To solve an infinite horizon MDP with the discounted cumulative payoff is to find an optimal strategy σ∗ such that ∀s ∈ S:

σ∗(s) ≡ argmax σ

(23)

To solve a finite horizon MDP with the T -horizon cumulative payoff is, in turn, to find an optimal strategy σT ∗such that ∀s ∈ S:

σT ∗(s) ≡ argmax σT

u(σT(s)).

MDPs have been extensively studied by the agent research community (Sutton and Barto,

1998; Russell and Norvig, 2009; Sigaud and Buffet, 2010). In infinite horizon MDPs with discounted cumulative payoff, there is always an optimal stationary strategy. In finite horizon MDPs, in turn, optimal strategies can often be non-stationary.

There exist several ways to find an optimal strategy for a given MDP:

(i) by solving a linear program (Littman et al.,1995),

(ii) by value iteration (Puterman, 1994), if the payoff and transition functions are known, or

(iii) by reinforcement learning (Sutton and Barto,1998), such as Q-learning (Watkins and Dayan, 1992), if either the payoff function, or the transition function, or both, are unknown.

Value iteration (Puterman, 1994) is an algorithm for computing an optimal strategy for a given MDP when the model (i.e., the payoff and transition functions) is known. We now detail it.

2.2.2.1 Value Iteration

The value iteration algorithm assigns utilities to the states. If the payoff criterion is the discounted cumulative payoff (Equation 2.1), each state utility u∗(s) reflects the long-term payoff of starting to execute an optimal strategy σ∗ in the state s and then following that strategy infinitely often. At the beginning, each state utility is estimated with an arbitrary initial value u0(s). Then, the utilities of each state are updated using a certain function of the utilities of the neighboring states1_{. An update of the utility of the state s is made using} the following equation (called the Bellman backup):

ut+1(s) ← max a∈A " r(s, a) + γX s0_∈S P (s, a, s0)ut(s0) # , (2.3)

where ut(s) and ut+1(s) denote the current and the updated values of the utility of state s.

1_{A state s}0

of the MDP is a neighbor of another state s, if there is a non-zero transition probability P (s, a, s0) for a certain action a ∈ A.

(24)

The utilities in different states of the MDP are updated one by one in the loop; the process of value iteration converges to the optimal utilities u∗(s) of the states (Russell and Norvig,

2009). The optimal strategy for the agent is then obtained as

σ∗(s) ≡ argmax a∈A " r(s, a) + γX s0_∈S P (s, a, s0)u∗(s0) # .

Algorithm 2.1 represents the main steps of the value iteration algorithm. In this algorithm, Algorithm 2.1 The value iteration algorithm for MDPs.

Input: r, a payoff function; P , a transition function; γ, a discount factor; , a precision parameter. 1: t ← 0; 2: ∀s ∈ S, ut(s) ← random value; 3: repeat 4: for each s ∈ S do 5: ut+1(s) ← maxa∈Ar(s, a) + γ Ps0_∈SP (s, a, s0)ut(s0); 6: end for 7: t ← t + 1; 8: until ∀s ∈ S, |ut(s) − ut−1(s)| < . 9: for each s ∈ S do 10: σ∗(s) ← argmax_a∈Ar(s, a) + γ P_s0_∈SP (s, a, s0)ut(s0); 11: end for

12: return An optimal strategy σ∗.

the utilities of the states are first initialized with random values in line 2. Then, iteratively, these utilities are updated in all states (line5) until convergence of the current values to the optimal ones in all states (line8). The optimal strategy is then computed in line10. Having computed an optimal stationary strategy σ∗, the agent always executes the action a ≡ σ∗(s) if the current state of the MDP is s.

If, in an MDP, the long-term payoff criterion is T -horizon cumulated payoff, a finite value iteration algorithm can be used to solve that MDP (Algorithm2.2). This finite value iteration algorithm is similar to value iteration given by Algorithm 2.1. The difference is that in finite value iteration, the utilities of different states for the horizon 1 are initialized with the optimal immediate payoffs in the corresponding states (line2of Algorithm2.2). This is because when the horizon is 1, only one action remains to be executed until the end of the agent’s life cycle. Therefore, the optimal action is the one that maximizes the immediate expected payoff. The algorithm stops (line 10) when an optimal strategy is computed for horizon T . Having computed an optimal non-stationary T -horizon strategy σT ∗ using Algorithm2.2, the agent executes the action a ≡ σt(s) if the current state is s and the remaining number of steps until the horizon is reached is t.

(25)

Algorithm 2.2 The finite value iteration algorithm for MDPs.

Input: r, a payoff function; P , a transition function; T , a horizon length.

1: t ← 1;

2: ∀s ∈ S, ut(s) ← maxa∈Ar(s, a);

3: ∀s ∈ S, σt(s) ← argmax_a∈Ar(s, a);

4: repeat 5: for each s ∈ S do 6: ut+1(s) ← maxa∈Ar(s, a) + Ps0_∈SP (s, a, s0)ut(s0); 7: σt+1(s) ← argmax_a∈Ar(s, a) + P_s0_∈SP (s, a, s0)ut(s0); 8: end for 9: t ← t + 1; 10: until t = T

11: return An optimal strategy σT ∗≡ σt_.

2.2.2.2 Q-Learning

In the value iteration algorithm, instead of state utilities u(s), one can also use state–action utilities, often called Q-values. An optimal Q-value Q∗(s, a) reflects the long-term payoff of taking action a in state s, and then, starting from the next state s0, following an optimal strategy infinitely often. The criterion of optimality of Q-values is given by the following Bellman equation:

Q∗(s, a) = r(s, a) + γX s0_∈S

P (s, a, s0)u∗(s0), (2.4) where u∗(s) ≡ maxa∈AQ∗(s, a).

When the transition function P (s, a, s0) or the payoff function r(s, a), or both, are unknown, it is generally impossible to use the above equation in an algorithm such as, for example, value iteration for finding optimal Q-values. Reinforcement learning (Kaelbling et al.,1996) is a method for determining optimal utilities of the states, and the corresponding optimal strategies, by interacting with an unknown, or only partially known, environment.

Q-learning (Watkins and Dayan,1992) is a reinforcement learning algorithm for MDPs that consists of updating the long-term payoffs of state–action pairs by interacting with the envi-ronment. Because the transition function, P , is not known by the learning agent, Q-learning consists of iteratively re-estimating the true value of Q∗(s, a) by executing action a in state s of the MDP, and by observing the immediate payoff r(s, a) and the next state s0. The update is made using the following update rule:

Qt+1(s, a) ← (1 − λ(t))Qt(s, a) + λ(t)[r(s, a) + γ max a0_∈AQ

t_(s0

, a0)], (2.5) where Qt_{(s, a) is the current estimate of Q}∗_{(s, a), and λ(t) ∈ [0, 1] is the learning rate function} decreasing as t grows.

(26)

Let Q-learning be executed during t time steps and let the current state of the MDP be s. The agent selects the next action to execute as follows:

max a∈AQ

t_{(s, a),}

by allowing for a certain exploration of the environment. Such exploration can be done by executing random actions at certain moments of time in certain states (Singh et al.,2000a). Q-learning stops after a predefined number of learning steps when the learned strategy be-comes sufficiently close to the optimal one. In their turn, the estimated Q-values converge to the optimal ones under the following conditions (known as Robbins-Monro conditions):

∞ X t=0 λ(t) = ∞, ∞ X t=0 λ(t)2< ∞.

The formal definition of the Q-learning algorithm is presented in Algorithm 2.3. In this Algorithm 2.3 The Q-learning algorithm for MDPs.

Input: r, a payoff function; P , a transition function; γ, a discount factor; s0 ∈ S, the initial state.

1: Set current learning step t ← 0;

2: ∀s ∈ S and ∀a ∈ A, initialize Qt(s, a) ← random value;

3: Set current state s ← s0;

4: repeat

5: Select an action a ← maxa0_∈AQt(s, a0) by allowing for a certain exploration;

6: Execute a, observe the payoff r(s, a) and the next state s0;

7: Update Qt+1_{(s, a) ← (1 − λ(t))Q}t_{(s, a) + λ(t)[r(s, a) + γ max}

a0_∈AQt(s0, a0)];

8: For all state–action pairs (s0, a0) 6= (s, a), set Qt+1(s0, a0) ← Qt(s0, a0);

9: Set current state s ← s0;

10: Increment learning step t ← t + 1;

11: until the predefined number of learning steps is reached.

12: Define an optimal strategy σ∗(s) ← maxa∈AQt(s, a), ∀s ∈ S;

13: return An optimal strategy σ∗.

algorithm, the Q-values in all states for all actions are initialized with random values in line2. In lines5and 6, the learning agent selects an action, executes it, and observes the payoff and the next state of the environment. In line 7, the agent updates the Q-value in the previous state corresponding to the executed action. For all other state–action pairs, Q-values keep the old values (line8). The learning step is then incremented (line10), and the process continues until the predefined number of steps is reached. Then the agent computes the optimal strategy (line12) and executes it infinitely often.

When there are several agents in a common environment, and these agents are aware of each other and can influence together the environment’s state, such framework is called a multiagent environment. In a multiagent environment, different agents can have different,

(27)

often conflicting goals. In such situations, a solution is often a compromise of personal interests of all agents. Thus, the problem of finding a “good” behavior for each agent is not simple.

2.2.3 Multiagent Frameworks

Game theory proposes its own payoff based models to represent multiagent environments. A simplest such model is called a stage-game.

2.2.3.1 Stage-Games

A stage-game is a tuple (N, {Ai}i∈N, {ri}i∈N), where

• N , |N | ≡ n ∈ N+_{, is a finite set of individual players that act (play, or make their moves} in the game) simultaneously;

• Player i ∈ N has a finite set A_i of pure actions (or, simply, actions) at its disposal. When each player i among N chooses a certain action ai ∈ Ai, the resulting vector a ≡ (a1, . . . , an) forms an action profile, which is then played; the corresponding stage-game outcome is then said to be realized.

• Each action profile belongs to the set of action profiles A ≡ ×_i∈NAi. Here and below, the notation ×i∈NXi means the cross product of the sets Xi, i = 1 . . . |N |.

• A player specific payoff function ri specifies player i’s numerical payoff for different stage-game outcomes.

In a standard stage-game formulation, a bijection is typically assumed between the set of action profiles and the set of stage-game outcomes. In this case, the payoff function of player i can be defined as the mapping ri : A 7→ R; also, this assumption permits, with no ambiguity, to interchangeably use the notions of action profile and stage-game outcome.

Given an action profile a, r(a) ≡ (ri(a))i∈N is called a payoff profile. Here and below, the notation (xi)i∈N means the following vector of dimension n = |N |: (x1, x2, . . . , xn) ∈ ×i∈NXi, where xi ∈ Xi, ∀i ∈ N . Sometimes, a player would prefer to randomize over its actions. We now define the notions of mixed action and mixed action profile:

A mixed action αi of player i is a probability distribution over its actions, i.e., αi∈ ∆(Ai). A mixed action profile is a vector α ≡ (αi)i∈N that contains the mixed actions of all

(28)

Chapter 2. A Formal Framework 16 Player 1 Player 2 C D C 2 −1 D 3 0 (a) Player 1 Player 2 C D C 2 3 D −1 0 (b)

Figure 2.2: An example of payoff functions in Prisoner’s Dilemma: (a) the payoff matrix of Player 1; (b) the payoff matrix of Player 2.

We denote by αi(ai) and α(a) respectively the probability to play action ai by player i accord-ing to the mixed action αi, and the probability that the outcome a will be realized when α is played simultaneously by all players, i.e., α(a) ≡Q

i∈Nαi(ai).

Payoff functions can be extended to mixed action profiles by taking expectations: r(α) ≡ X

a∈A

r(a)α(a). (2.6)

A common notational convention adopted in this thesis is that vectors, such as action or payoff profiles, are denoted as a simple variable. When the same variable has a subscript i, this refers to i’s component of the vector, i.e., related to player i. For example, if a is an action profile, then ai is the action of player i in the action profile a. Similarly, if r(a) is the payoff profile, then ri(a) is the payoff of player i in the payoff profile r(a), and so on. Furthermore, in this thesis, we often utilize the so-called “−i notation”:

The notation −i stands for “all players in N except i”. For example,

a−i≡ (a1, a2, . . . , ai−1, ai+1, . . . , an), α−i ≡ (α1, α2, . . . , αi−1, αi+1, . . . , αn), and so on. We will also use the following or similar notation: (a0_i, a−i), which is equivalent to (a1, a2, . . . , ai−1, a0i, ai+1, . . . , an).

We are now able to define the notion of the support of a mixed action:

The support of a mixed action αi of player i is a set Aα_ii ⊆ Ai, which contains all pure actions, to which αi assigns a non-zero probability. When a mixed action αi has only one certain action ai ∈ Ai in its support, we can simply write ai instead of αi to denote this mixed action.

Since action profiles are vectors, payoff functions can be represented as matrices containing numerical payoffs. As a consequence, stage-games are often called matrix games and payoff functions are called payoff matrices. An example of a stage-game and the corresponding player specific payoff functions (represented as matrices) is given in Figure2.2. This example describes the famous Prisoner’s Dilemma game (Poundstone, 1992). For simplicity, one can put the payoffs of both players into the same matrix, as shown in Figure2.3. More generally,

(29)

Prisoner’s Dilemma is a matrix game in which:

(1) There are two players, called Player 1 and Player 2. Each player has a choice be-tween two actions: C (for cooperation) and D (for defection, i.e., non-cooperation). (2) The payoff matrix has the following structure:

Player 1

Player 2

C D C R, R S, T

D T, S P, P (3) The following inequalities hold: T > R > P > S.

For example, for the Prisoner’s Dilemma from Figure2.2, when Player 1 plays action C and Player 2 plays action D, the action profile is (C, D) and the corresponding payoffs of players are respectively −1 and 3. To the so-called cooperative outcome (C, C) there corresponds the payoff profile (2, 2); and to the non-cooperative outcome (D, D) there corresponds the payoff profile (0, 0). Player 1 Player 2 C D C 2, 2 −1, 3 D 3, −1 0, 0

Figure 2.3: An example of payoff matrix that contains payoffs for both players in Prisoner’s Dilemma.

Stage-games can differ according to the payoff structure of the game. In the multiagent literature, three major classes of games are usually distinguished:

1. Strictly cooperative games (a.k.a. team games), where all players obtain the same payoff when a certain action profile is played.

2. Strictly adversarial games (strictly competitive, or zero-sum games), two-player games, in which for all action profiles, the payoff of one player is the negative payoff of the other player (i.e., for all a ∈ A, r1(a) = −r2(a)).

3. The most general (and harder to deal with) class of stage-games, general-sum games, which includes previous two classes as special cases.

An example of a strictly collaborative game is shown in Figure2.4. As one can see, indepen-dently of the outcome of the game, the payoff of Player 1 is always the same as the payoff of Player 2.

An example of a strictly adversarial game is shown in Figure 2.5 that represents the payoff matrix for the famous Rock, Paper, Scissors game.

(30)

Chapter 2. A Formal Framework 18 Player 1 Player 2 C D C 2, 2 3, 3 D −1, −1 0, 0

Figure 2.4: Payoff matrix of a strictly cooperative stage-game.

Player 1 Player 2 R P S R 0, 0 −1, 1 1, −1 P 1, −1 0, 0 −1, 1 S −1, 1 1, −1 0, 0

Figure 2.5: Payoff matrix of Rock, Paper, Scissors: a strictly adversarial stage-game. Finally, Prisoner’s Dilemma from Figure 2.3 can serve an example of a general-sum stage-game.

There is also a subclass of general-sum games, called symmetric games, whose payoff matrix has a special structure. More precisely, a symmetric game is a game where the payoff for playing a particular action depends only on the other actions employed, not on who is playing them. If one can change the identities of the players without changing the payoff to the actions, then a game is symmetric. Figure2.6gives an example of a symmetric game. Rock, Paper, Scissors from Figure 2.5, and Prisoner’s Dilemma from Figure 2.3are both examples of symmetric stage-games. Player 1 Player 2 C D C 1, 1 2, 3 D 3, 2 4, 4

Figure 2.6: Payoff matrix of a symmetric stage-game.

Repeated games (Mailath and Samuelson,2006;Burkov and Chaib-draa,2010b) are built upon stage-games, and extend them to the environments that evolve in (a discrete) time.

2.2.3.2 Repeated Games

In a repeated game, the same stage-game is played in periods t = 0, 1, 2, . . ., also called stages, iterations, or time-steps. At the beginning of each period, players choose their actions. For instance, if player i wishes to play a mixed action at period t of the repeated game, it uses its mixed action distribution αt_i to draw a pure action at random. (We denote an action at_i drawn from a distribution αt_i over the actions as at_i ∼ αt

(31)

forms the action profile at_{. The players then simultaneously play this action profile, and} collect the stage-game payoffs corresponding to the resulting stage-game outcome. Then the repeated game transits to the next stage, where the same stage-game is played again, though the actions played by the agents can be different.

Hereinafter, the superscript t over the actions, action profiles, payoffs, payoff profiles, and so on, will usually reflect time-steps, iterations, periods, etc, unless stated otherwise. For example, at is the action profile played at time t; r_i3 is the payoff obtained by player i at iteration 3, etc.

A diagram illustrating the process of game play in a repeated game is shown in Figure2.7. As one can see, it is just one stage-game, always the same, played repeatedly throughout time. At each period t of the repeated game play, agents play a certain action profile at_{= (a}t

1, . . . , atn), and then obtain a certain payoff profile rt= (rt₁, . . . , rt_n) that corresponds to the played action profile a.

(a

0₁

, . . . , a

0_n

)

(r

₁0

, . . . , r

_n0

)

(a

₁1

, . . . , a

_n1

)

(r

1₁

, . . . , r

1_n

)

t

t = 0

t = 1

· · ·

Figure 2.7: The process of game play in a repeated game.

Stochastic games inherit the properties of both stage-games and MDPs, and extend them to the multiagent environments that evolve in time and space.

2.2.3.3 Stochastic Games

The definitions of stage-games and repeated games implicitly assume that the environment, represented by these models, has only one state that never changes. This state is fully de-scribed by the payoff matrices of players. For instance, we assumed that the payoffs of players for the same outcomes at different periods of the repeated Prisoner’s Dilemma are the same. In a more general case, however, a multiagent environment, at any moment of time, can be in one of a set of states. For example, when talking about embodied agents, such as robots, a local state of the environment can, for example, be characterized by robot’s coordinates and speed. Consequently, a state of a multi-robot environment could be characterized by a vector of local states of individual robots.

(32)

As we have seen, in a single-agent setting, multistate environments can be modeled as Markov Decision Processes (MDPs). There is also a formalism capable of representing multistate multiagent systems. This formalism, called stochastic games (Shapley,1953;Littman, 1994;

Burkov and Chaib-draa, 2010c), inherits many properties from MDPs. Indeed, a stochastic game could be simply seen as a multiplayer MDP. Each state of that MDP can, in turn, be seen as a certain stage-game, and to each action in that MDP there would correspond a certain action profile. In each MDP, there is a transition function, a law describing how the state of the environment changes as a result of executed actions. For a multiplayer environment, such transition function can be defined as a mapping from the set of pairs “stage-game– action profile” into the set of stage-games.

The principal difference of stochastic games from MDPs is that stochastic games assume multiple decision makers, while in MDPs, there is only one agent capable of making decisions that can trigger payoffs and change the state of the environment.

More formally, a stochastic game is a tuple, (N, S, P, {Ai}i∈N, {ri}i∈N, s0), where

• N is the finite set of players, |N | ≡ n ∈ N+_;

• S is the finite set of states (stage-games), with s ∈ S being a particular stage-game or state;

• Ai is the finite set of actions available to agent i, with ai ∈ Ai being a particular action of player i; let A ≡ ×i∈NAi denote the set of action profiles, with a ∈ A being a certain action profile;

• P is the transition function: S×A×S 7→ [0, 1], s.t. ∀s ∈ S, ∀a ∈ A: P

s0_∈SP (s, a, s0) = 1;

P (s, a, s0) defines the probability that the next state of the stochastic game will be s0 if the action profile a is played in the current state s;

• ri is the payoff function of player i: S × A 7→ R; ri(s, a) specifies the numerical payoff player i collects if action profile a is played in state s;

• s0 is the initial state of the stochastic game. Unless stated otherwise, we assume that the stochastic game play is always started in the initial state s0.

Note that in stochastic games, saying “the environment is in a state s ∈ S” is equivalent to saying “agents play a stage-game s ∈ S”.

A diagram illustrating the process of game play in a stochastic game is shown in Figure 2.8. In the beginning of the stochastic game play, the game is in the initial state s0. The agents play the corresponding stage-game by choosing and executing a certain action profile. For instance, given an action profile a0 ≡ (a0

1, . . . a0n) played in state s0 at time t = 0, player i collects the payoff r0_i = ri(s0, a0); the game then can randomly transit to another state,

(33)

t

t = 0

t = 1

t = 2

a

0

_{≡ (a}

0 1

, . . . , a

0n

)(r

01

, . . . , r

n0

)

(a

1₁

, . . . , a

1_n

)

(r

₁1

, . . . , r

1_n

)

(a

1 1

, . . . , a

1n

)

(r

11

, . . . , r

n1

)

(a

1₁

, . . . , a

1_n

)

(r

₁1

, . . . , r

1_n

)

s

0

s

0

s

1

s

_|S|−1

· · ·

P

(s

0

,a

0

,s

0

)

P (s

0

, a

0

, s

1

)

P

(s

0

_,a

0

,s

|S |− 1

₎

Figure 2.8: The process of game play in a stochastic game.

say s1, with probability P (s0, a0, s1). In this case, players will then play another stage-game corresponding to the state s1, and so on.

2.2.3.4 Markov Chain Games

Markov chain games (Renault et al.,2006) is a special case of stochastic games, where tran-sitions between states do not depend on the action profiles played in these states. A Markov chain game can also be seen as an extension of repeated games to multistate environments. A repeated game transforms into a Markov chain game, if there is a set S, and a transition function P : S × S 7→ [0, 1], such that each state s ∈ S is a certain stage-game, and P (s, s0) is the probability that the players will play stage-game s0 after playing stage-game s. Similarly

(34)

Stochastic games

Markov chain games

Repeated

games

Figure 2.9: Dynamic game models.

to more generally stated stochastic games, the payoff function of player i in a Markov chain game is augmented with the state space, i.e., ri : S × A 7→ R.

Throughout this thesis, we will often refer to the term “dynamic games”. Let us informally specify it below.

Dynamic games: If a certain statement, notion, or result is true for all multi-stage games (including repeated games, Markov chain games, and stochastic games) we say that the statement, notion, or result holds in dynamic games. Hereinafter, we will use the term “dynamic game” to denote all multi-stage games (Ba¸sar and Olsder,1999;Haurie and Zaccour,2005).

As a consequence, one can see stage-games as static games. The relations between different dynamic game models are shown in Figure2.9.

It worth noting that in dynamic games, the notions of stage, time, period, and time-step mean the same, i.e., a moment of discrete time at which decisions are allowed to be taken. It is assumed that anything that can happen in a dynamic game, happens during a certain time-step, and nothing happens between two subsequent time-steps.

Dynamic games can belong to different classes of games according to the structure of their payoff functions. Similarly to static games, they can either be strictly cooperative, strictly adversarial, or general-sum. In multistate models, such as stochastic or Markov chain games, this means that the necessary relations between payoffs of players hold in each state.

Now, as we have defined the principal models of games, let us introduce the notions of solution in a game.

(35)

2.2.4 Solution Concepts in Games

In static games, i.e., games that consist of only one stage, such as stage-games, only one outcome can be generated until the game ends. In such context, the long-term payoff of a certain game play for player i coincides with the immediate payoff of the outcome generated during that game play. This immediate payoff is given by the payoff function of player i. In dynamic games, one needs a more rich definition of the long-term payoff. In fact, the result of a game play is a sequence of outcomes and inter-state transitions. For certain players, not every such sequence would result in desirable high payoffs.

Let us assume that each player knows its long-term payoff criterion. The question is then to define, for each player, a way to choose its actions. Similarly to MDPs, such a way to choose the actions is called a strategy:

A strategy in games describes the way by which a player chooses actions to execute. A solution in games is an assignment of strategies to players, such that all players prefer

to follow this assignment.

Until now, we have only presented solution concepts and algorithmic ways to solve single-agent models, such as MDPs. Now, let us proceed to the basic notions of solution in multiagent decision problems.

2.2.4.1 Solutions in Stage-Games

As previously, let us start with a simpler model of stage-games. Recall that in stage-games, the long-term payoff of player i coincides with the immediate payoff given by the payoff function of player i. Let us now specify the notion of strategy in stage-games:

A strategy of player i in a stage-game is therefore simply a mixed action. Player i first chooses a mixed action and then uses it as its strategy, i.e., as a way to choose actions.

By maximizing its immediate expected payoff given the mixed actions of the other players, player i is said to play its best response to these mixed actions:

A best response of player i to a mixed action profile α−i of the opponent players is a (mixed) action α∗_i such that

α∗_i ≡ argmax α0

i∈∆(Ai)

(36)

We will denote the set of all best responses of player i to the action profile α−i as BRi(α−i). The notion of best response allows us, in turn, to define the notion of Nash equilibrium in stage-games:

A (Nash) equilibrium in a stage-game is a mixed action profile α ≡ (αi)i∈N with the property that for each player i ∈ N , αi is a best response of player i to α−i.

In other words, an equilibrium in stage-games is a mixed action profile with the property that when the other players stick to their respective mixed actions, the mixed action of player i maximizes her expected payoff. Assuming that each agent playing the game is rational, an equilibrium is therefore a solution of a stage-game, because it is a way to assign strategies to players such that the players will prefer to follow the assignment. Nash(1950a) showed that there is an equilibrium in any stage-game.

If, in an equilibrium α, for each player i, αi is such that αi(ai) = 1 for a certain ai ∈ Ai, then the equilibrium α is called a pure action equilibrium. If α is a pure action equilibrium, for simplicity, it can be denoted as an action profile a ∈ A, where each ai is the action played by player i with probability 1. Every non-pure action equilibrium is a mixed action equilibrium. One property of any mixed action equilibrium α ≡ (αi)i∈N is that for all i ∈ N : ri(ai, α−i) = ri(α), if ai ∈ Aαii, and ri(ai, α−i) < ri(α), otherwise. Stated in words, this means that in an equilibrium, each rational player i is indifferent (in terms of the expected payoff) between its pure actions in the support of the mixed action αi; at the same time, player i would prefer to play any action from the support to any other available action.

We say that one mixed action profile α Pareto-dominates another mixed action profile α0 if ri(α) ≥ ri(α0), ∀i ∈ N , and ∃i ∈ N such that ri(α) > ri(α0). If a certain action profile α is not dominated by any other action profile, then α is called optimal or Pareto-efficient action profile. The same can be said about equilibria in stage-games: they can be Pareto-optimal and can Pareto-dominate other equilibria of the game.

Let us now talk about finding equilibrium solutions in stage-games. Zero-sum games have an interesting property that they have a unique Nash equilibrium. This equilibrium corresponds to the maxmin solution of the game. A mixed action αi of player i is called maxmin if it maximizes the worst-case expected payoff of player i. Maxmin equilibrium is a mixed action profile ˆα ≡ ( ˆαi)i∈{1,2}, such that each ˆαi is the maxmin action of player i. Equilibria of this kind can be found in zero-sum games by solving a certain linear program for each player i ∈ {1, 2}. For instance, for player 1, the linear program has |A1| decision variables α1(a1), one for each action a1 ∈ A1. These decision variables specify probabilities to play each action. The overall linear program can be written as follows:

Maximize: mina2∈A2

P a1∈A1α1(a1)r1(a1, a2) Subject to: P a1∈A1α1(a1) = 1 α1(a1) ≥ 0, ∀a1∈ A1 (2.7)

Leveraging repeated games for solving complex multiagent decision problems