Multi-agent learning and repeated matrix games

(1)

Multi-agent learning and repeated matrix games

Bruno Bouzy

Université Paris Descartes, France

(2)

Outline



The five agendas of Multi-Agent Learning (MAL)



Game theory

 {Repeated} matrix games (RMG)

 Equilibria (Nash, Correlated)



MAL algorithms

 Equilibria learning

 Best response learning

 Competition & cooperation

 Regret oriented algorithms

 Leader algorithms



Discussion and experiments

 Algorithms evaluation

 Some results



Conclusion

(3)

Multi-agent learning (MAL)

•

Single-agent Reinforcement Learning  MAL

– Other agents change the environment

– « non stationarity »

•

Game theory

– Players = learners

– Equilibria

•

Multi-agent background

– Adaptive and autonomous agent

(4)

The five agendas of MAL (1/2)

•

AI Journal (2007)

– Shoham, Powers Grenager, If multi-agent learning is the answer what is the question ?

• Game theory, fictitious play  equilibria

• Artificial Intelligence, single-agent learning

• Since 2004, MAL literature increases

•

Computational agenda

– MAL algorithms used to compute equilibria

•

Descriptive agenda

– How natural agents learn in the context of other learners ?

– Investigate models that fit human or population behavior

•

Normative agenda

– How to account for equilibria reached by given learning rules ?

(5)

The five agendas of MAL (2/2)

•

2 “prescriptive” agendas

– How agents should learn ?

•

“Cooperative” prescriptive agenda

– Decentralized control in real world applications

• Control theory, distributed computing

• Joint policy

– Communication between agents is allowed

•

“Non Cooperative” prescriptive agenda (NCPA)

– How to obtain high reward in repeated games ?

– Convergence is not a goal

– Communication between agents is forbidden

– Set of players, tournaments

(6)

Cooperative or non cooperative ?

•

Nash meaning

– Non cooperative games (Annals Math, 1951)

– Two person cooperative games (Econometrica 1953)

– = communication allowed or not

•

Common meaning

– Cooperative = friendly

– Non cooperative = competitive

•

Repeated matrix game meaning

– Non cooperative agenda = No communication between players

– Cooperative game = team game

• all agents receive the same reward

– Competitive game = zero-sum 2-person game

(7)

Game theory (outline)



{Repeated} Matrix game examples



Nash Equilibria



Correlated Equlibria

(8)

2-player matrix games

•

Coordination Battle of Sexes

•

Competition Prisoner Dilemma

•

Stackelberg Chicken

Foot Theater Foot 2 1 0 0 Theater 0 0 1 2

Coop Defect Coop 3 3 1 4 Defect 4 1 2 2

Chicken Dare Chicken 6 6 2 7 Dare 7 2 0 0

c d

a 2 2 0 0 b 0 0 1 1

c d

a -3 3 0 0 b -1 1 -2 2

c d

a 1 0 3 2 b 2 1 4 0

(9)

Nash equilibrium

•

Pure strategy = an action

•

Mixed strategy = a probability distribution over actions

•

Best response to other agent strategies

•

Nash equilibrium:

– Every strategy is a best response to other strategies

– No one wants to change its strategy

•

Pure (resp. mixed) equilibrium

– The strategies are pure (resp. mixed)

•

Property

– Every matrix game has at least one mixed equilibrium

(10)

Nash equilibria

•

Coordination Battle of Sexes

•

Competition Prisoner Dilemma

•

Stackelberg Chicken

Foot Théatre Foot 2 1 0 0 Théatre 0 0 1 2

Coop Defect Coop 3 3 1 4 Defect 4 1 2 2

Chicken Dare Chicken6 6 2 7 Dare 7 2 0 0

c d

a 2 2 0 0 b 0 0 1 1

c d

a -3 3 0 0 b -1 1 -2 2

c d

a 1 0 3 2 b 2 1 4 0

p(a)=1/4

p(c)=1/2

(11)

Correlated Equilibrium (1/3)

•

Correlated Equilibrium (CE) (Aumann 1974)

– Probability distribution D over joint actions

– The players know D

– A joint action is drawn from D by a « referee » or « device »

– The referee gives every elementary action to its player

– If no player wants to deviate, then D is a CE

•

Property

– Every mixed NE is a CE

– Regret minimization methods converge to CE

– Expected reward of a CE >= Expected reward of a NE

– The players communicate (or cooperate) through the referee

(12)

Correlated Equilibrium (2/3)

•

Battle of Sexes

•

Nash Equilibrium

– Pure: (F, F) and (T, T)

– Mixed: (1/3, 2/3) and (2/3, 1/3)

– Expected rewards = (2/3, 2/3)

•

Correlated Equilibrium

– { ((F,F), ½), ((T,T), ½) }

Foot Theater Foot 2 1 0 0

Theater 0 0 1 2

(13)

Correlated Equilibrium (3/3)

•

Chicken Game

•

Nash Equilibrium

– Pure: (D, C) and (C, D)

– Mixed: (2/3, 2/3)

•

Correlated Equilibrium

– { ((C,C), 1/3), ((C,D), 1/3), ((D,C), 1/3) }

– Expected rewards = (5, 5)

ChickenDare

Chicken6 6 2 7

Dare 7 2 0 0

(14)

MAL algorithms (outline)



Equilibria learning



Best response learning



Dealing with cooperation and competition



Regret oriented algorithms



Leader algorithms

(15)

Equilibria learning (1/4)



Minimax-Q (Littman 1994)

 Action values are based on joint actions

with a the learning agent ‘s action o the opponent’s action

 The agent can observe opponent’s actions

 The values do not depend of opponents’ strategies

 Converge to game-theoretic optimal value in 2-player zero-sum games ))

, ( , ( min

max )

(s Q s a o

V _i

A A o

i a

i ∈ −i

= ∈

(16)

Equilibria learning (2/4)



Nash-Q (Hu & Wellman 1998)

 Minimax-Q extended to 2-player general-sum games

where (π₁(s), π ₂(s)) is the NE of the matrix game (Q₁(s), Q₂(s))

 Nash-Q compute mixed strategies

 Converges with some (restrictive) conditions

 Their exists only one equilibrium for the entire game and for every games defined by Q functions during learning

 All equilibria must be of a same type : adversarial or coordination

)) ( ), ( , ( ))

( ),

( ( )

( ₁ ₂ ₁ ₁ ₂

1 s Nash Q s Q s Q s s s

V = =

π π

(17)

Equilibria learning (3/4)



Friend-and-Foe-Q (FF-Q) (Littman 2001)

 Converges with less restrictive conditions than Nash-Q, but needs to class the opponent as “friend” or “foe”

 Two algorithms, one per class of opponents

 Foe opponent (Foe-Q) :

 Friend opponent (Friend-Q) :

 How to class the opponents ?

 Not answered in Littman paper

)) , ( , ( min

max )

(s Q s a o

V _i

A A o

i a

i

i ∈ −

= ∈

)) , ( , ( max

)

(s ( , ) Q s a o

V _i

A A o i a

i

i× −

= ∈

(18)

Equilibria learning (4/4)



Correlated-Q (CE-Q) (Greenwald & Hall 2003)

 Generalization of Nash-Q et FF-Q

 Allows several equilibria in the game

 Looks for correlated equilibria

 Four classes of opponents are defined => 4 CE types => 4 algorithms

 "utilitarian" (uCE-Q),

 “egalitarian" (eCE-Q),

 "republican" (rCE-Q),

 "libertarian“ (lCE-Q)

 Equilibria are computed with linear programming

 The algorithm must compute opponent’s Q-values

(19)

From equilibria to best-response



Focus on asymptotically achieving an equilibrium in self-play



Finding an equilibrium is not always the answer

 All agents have to play the equilibrium

 Bounded rationality

 One agent may not play the equilibrium

 Multiple equilibria

 Agents have to cooperate to play the same equilibrium



Long-term best-response play

 Learning strategies that play the best-response to the opponent's observed strategy.

(20)

Best-response learning



Learn to play the best response to opponent’s observed behavior



Naïve approach

 Single-agent learning



Take into account the possibility that other agents' strategies might change

 Fictitious play

 Q-learning

(21)

Q-learning (Watkins 1989, 1992)



Single-agent learning



The Q-learning update :



Q-learning is guaranteed to converge in stationary environments if :

 every state-action pair continue to be visited,

 and the learning rate is decreased appropriately over time

) , ( max arg

)

( s Q s a

A a∈

π =

(22)

Fictitious play (FP)



(Brown 1951)



The agent observes time-average frequency of other players’ action choices, and models:

The agent then plays best-response to this model



Converges in two-player zero-sum games



When it converges, the convergence point is a Nash equilibrium

ns observatio total

observed a

times a

prob

_k ^k

# ) #

( =

(23)

Previous experimental works

 MALT (Zawadzki 2005)

 Q learning (QL) is better than Nash-based algorithms

 GAMUT (Nudelman & al 2004)

 (Airiau 2007)

 FP is ranked first

 Specific representative RMG

(24)

Cooperation and competition (CC 1/3)

•

Prisoner Dilemma

•

Nash Equilibria

– Pure (D, D)

– Mixed: no

– Expected reward = 2

•

Pareto-optimal state that is better than (D,D)

– (C, C)

– Expected reward = 3

•

Best response algorithms, and regret oriented algorithms

– Converge to (D, D)

Coop Defect

Coop 3 3 1 4

Defect 4 1 2 2

(25)

S algorithm (CC 2/3)

•

Satisficing (S) algorithm (Stimpson & Goodrich 2003)

•

At instant t:

•

Receive the reward R(t)

•

Select action depending on the « aspiration » of agent i

– If R(t) >= Aspiration(i, t), then Action(i, t+1) = Action(i, t)

– Else Action(i, t+1) = random choice

•

Compute Aspiration (i, t)

– Aspiration(i, t+1) = λAspiration(i, t) + (1-λ) R

– Aspiration(i, 0) = Rmax

•

In self-play, the S algorithm converges to the pareto-optimal states

(26)

M Qubed (CC 3/3)

•

(Crandall & Goodrich 2005)

•

Combination of S algorithm and Q learning

•

Use Max or MiniMax (M

³

) strategies according to the situation (friendly or inimical)

•

Goals reached

– Max level against friendly agents

– Security level against inimical agents

(27)

Regret oriented algorithms

•

HMC (Hart & Mas-Colell 2000)

– Regret of action A over action B (internal regret)

– Probability of action X linear in its regret over last action

– Convergence to CE

•

Bandit algorithms: UCB (Auer & al 2002)

•

Exp3 and derived

(28)

Exp3 (Auer et al. 2002)



(Auer & al 2002)



"Exponential-weight algorithm for Exploration and Exploitation"



Non-stochastic multi-armed bandit problems



Action selection scheme

 Selection probability of action i

 γ : mixture rate

t K w

t t w

p

_K

j j

i i

γ + γ

−

= ∑ ⁽ ⁾

) ) (

1 (

)

(

(29)

UCB



(Auer & al 2002)



Select an action among a set of actions for both exploit and explore



Action = argmax

_i

( m(i) + sqrt( log(T) / t(i) ) )



Used in UCT for computer Go and Amazons



UCT = UCB for Trees

(30)

Leader algorithms

•

Implicit negociation in repeated games (Littman & Stone 2001)

•

Bully

– Being a leader

• Action = argmax _i m1(i, J*(i))

– Assume the opponent is a best response algorithm

• J*(i) = argmax _j m2(i, j)

a b

a 1 0 3 2

b 2 1 4 0

(31)

Discussion



Theoretical MAL criteria



Experimental evaluation criteria



Some experimental results

(32)

Theoretical MAL criteria



Bowling and Veloso (2001) proposed criteria for MAL algorithms.

Learning should :

 (1) always converge to a stationary policy

 (2) only terminate with a best-response to the play of other agents



Conitzer and Sandholm (2003) added another criterion :

 (3) converge to Nash equilibrium in self-play

(33)

Experimental evaluation in the NCP Agenda

•

Players' confrontation

– One-to-one confrontations

– All-against-all tournaments with one-to-one confrontations

•

Set of games

– Cooperative games

– Competitive games

– General sum games

– GAMUT

(34)

Tournaments settings



Parameters

– #actions = 3 – -9 <= R <= +9

– #repetitions = 100,000 – #MG = 100



For each MG, an all-against-all tournaments is set up

(35)

Tournament results on general-sum MG

– 1: M3 – 2: S – 3: UCB – 4: FP – 5: Exp3 – 6: Bully – 7: HMC – 8: QL

– 9: MinMax

– 10: Rand

(36)

Tournament results on team games

– 1: Exp3 – 2: M3 – 3: Bully – 4: S

– 5: HMC – 6: FP – 7: UCB – 8: QL

– 9: MinMax

– 10: Rand

(37)

Tournament results on zero-sum games

– 1: Exp3 – 2: M3

– 3: MinMax – 4: FP

– 5: S

– 6: UCB

– 7: QL

– 8: HMC

– 9: Bully

– 10: Rand

(38)

Conclusions and future works



MAL Non Cooperative Prescriptive Agenda

 Maximize the cumulative returns against many players in many RMG

 Intersection of Game Theory, AI, Reinforcement Learning

 Experimental results

 UCB, M3 work well in general-sum games

 Exp3 work well in both zero-sum games and team games

 Current work, two important features

 Averaging on the near past, and forgetting the far past

 Use states where a state corresponds to the past joint action

 Future works

 Hedging algorithms, or expert algorithms

 Algorithm game