Multi-agent learning and repeated matrix games
Bruno Bouzy
Université Paris Descartes, France
Outline
The five agendas of Multi-Agent Learning (MAL)
Game theory
{Repeated} matrix games (RMG)
Equilibria (Nash, Correlated)
MAL algorithms
Equilibria learning
Best response learning
Competition & cooperation
Regret oriented algorithms
Leader algorithms
Discussion and experiments
Algorithms evaluation
Some results
Conclusion
Multi-agent learning (MAL)
•
Single-agent Reinforcement Learning MAL
– Other agents change the environment
– « non stationarity »
•
Game theory
– Players = learners
– Equilibria
•
Multi-agent background
– Adaptive and autonomous agent
The five agendas of MAL (1/2)
•
AI Journal (2007)
– Shoham, Powers Grenager, If multi-agent learning is the answer what is the question ?
• Game theory, fictitious play equilibria
• Artificial Intelligence, single-agent learning
• Since 2004, MAL literature increases
•
Computational agenda
– MAL algorithms used to compute equilibria
•
Descriptive agenda
– How natural agents learn in the context of other learners ?
– Investigate models that fit human or population behavior
•
Normative agenda
– How to account for equilibria reached by given learning rules ?
The five agendas of MAL (2/2)
•
2 “prescriptive” agendas
– How agents should learn ?
•
“Cooperative” prescriptive agenda
– Decentralized control in real world applications
• Control theory, distributed computing
• Joint policy
– Communication between agents is allowed
•
“Non Cooperative” prescriptive agenda (NCPA)
– How to obtain high reward in repeated games ?
– Convergence is not a goal
– Communication between agents is forbidden
– Set of players, tournaments
Cooperative or non cooperative ?
•
Nash meaning
– Non cooperative games (Annals Math, 1951)
– Two person cooperative games (Econometrica 1953)
– = communication allowed or not
•
Common meaning
– Cooperative = friendly
– Non cooperative = competitive
•
Repeated matrix game meaning
– Non cooperative agenda = No communication between players
– Cooperative game = team game
• all agents receive the same reward
– Competitive game = zero-sum 2-person game
Game theory (outline)
{Repeated} Matrix game examples
Nash Equilibria
Correlated Equlibria
2-player matrix games
•
Coordination Battle of Sexes
•
Competition Prisoner Dilemma
•
Stackelberg Chicken
Foot Theater Foot 2 1 0 0 Theater 0 0 1 2
Coop Defect Coop 3 3 1 4 Defect 4 1 2 2
Chicken Dare Chicken 6 6 2 7 Dare 7 2 0 0
c d
a 2 2 0 0 b 0 0 1 1
c d
a -3 3 0 0 b -1 1 -2 2
c d
a 1 0 3 2 b 2 1 4 0
Nash equilibrium
•
Pure strategy = an action
•
Mixed strategy = a probability distribution over actions
•
Best response to other agent strategies
•
Nash equilibrium:
– Every strategy is a best response to other strategies
– No one wants to change its strategy
•
Pure (resp. mixed) equilibrium
– The strategies are pure (resp. mixed)
•
Property
– Every matrix game has at least one mixed equilibrium
Nash equilibria
•
Coordination Battle of Sexes
•
Competition Prisoner Dilemma
•
Stackelberg Chicken
Foot Théatre Foot 2 1 0 0 Théatre 0 0 1 2
Coop Defect Coop 3 3 1 4 Defect 4 1 2 2
Chicken Dare Chicken6 6 2 7 Dare 7 2 0 0
c d
a 2 2 0 0 b 0 0 1 1
c d
a -3 3 0 0 b -1 1 -2 2
c d
a 1 0 3 2 b 2 1 4 0
p(a)=1/4
p(c)=1/2
Correlated Equilibrium (1/3)
•
Correlated Equilibrium (CE) (Aumann 1974)
– Probability distribution D over joint actions
– The players know D
– A joint action is drawn from D by a « referee » or « device »
– The referee gives every elementary action to its player
– If no player wants to deviate, then D is a CE
•
Property
– Every mixed NE is a CE
– Regret minimization methods converge to CE
– Expected reward of a CE >= Expected reward of a NE
– The players communicate (or cooperate) through the referee
Correlated Equilibrium (2/3)
•
Battle of Sexes
•
Nash Equilibrium
– Pure: (F, F) and (T, T)
– Mixed: (1/3, 2/3) and (2/3, 1/3)
– Expected rewards = (2/3, 2/3)
•
Correlated Equilibrium
– { ((F,F), ½), ((T,T), ½) }
– Expected rewards = (3/2, 3/2)
Foot Theater Foot 2 1 0 0
Theater 0 0 1 2
Correlated Equilibrium (3/3)
•
Chicken Game
•
Nash Equilibrium
– Pure: (D, C) and (C, D)
– Mixed: (2/3, 2/3)
– Expected rewards = (14/3, 14/3)
•
Correlated Equilibrium
– { ((C,C), 1/3), ((C,D), 1/3), ((D,C), 1/3) }
– Expected rewards = (5, 5)
ChickenDare
Chicken6 6 2 7
Dare 7 2 0 0
MAL algorithms (outline)
Equilibria learning
Best response learning
Dealing with cooperation and competition
Regret oriented algorithms
Leader algorithms
Equilibria learning (1/4)
Minimax-Q (Littman 1994)
Action values are based on joint actions
with a the learning agent ‘s action o the opponent’s action
The agent can observe opponent’s actions
The values do not depend of opponents’ strategies
Converge to game-theoretic optimal value in 2-player zero-sum games ))
, ( , ( min
max )
(s Q s a o
V i
A A o
i a
i ∈ −i
= ∈
Equilibria learning (2/4)
Nash-Q (Hu & Wellman 1998)
Minimax-Q extended to 2-player general-sum games
where (π1(s), π 2(s)) is the NE of the matrix game (Q1(s), Q2(s))
Nash-Q compute mixed strategies
Converges with some (restrictive) conditions
Their exists only one equilibrium for the entire game and for every games defined by Q functions during learning
All equilibria must be of a same type : adversarial or coordination
)) ( ), ( , ( ))
( ),
( ( )
( 1 2 1 1 2
1 s Nash Q s Q s Q s s s
V = =
π π
Equilibria learning (3/4)
Friend-and-Foe-Q (FF-Q) (Littman 2001)
Converges with less restrictive conditions than Nash-Q, but needs to class the opponent as “friend” or “foe”
Two algorithms, one per class of opponents
Foe opponent (Foe-Q) :
Friend opponent (Friend-Q) :
How to class the opponents ?
Not answered in Littman paper
)) , ( , ( min
max )
(s Q s a o
V i
A A o
i a
i
i ∈ −
= ∈
)) , ( , ( max
)
(s ( , ) Q s a o
V i
A A o i a
i
i× −
= ∈
Equilibria learning (4/4)
Correlated-Q (CE-Q) (Greenwald & Hall 2003)
Generalization of Nash-Q et FF-Q
Allows several equilibria in the game
Looks for correlated equilibria
Four classes of opponents are defined => 4 CE types => 4 algorithms
"utilitarian" (uCE-Q),
“egalitarian" (eCE-Q),
"republican" (rCE-Q),
"libertarian“ (lCE-Q)
Equilibria are computed with linear programming
The algorithm must compute opponent’s Q-values
From equilibria to best-response
Focus on asymptotically achieving an equilibrium in self-play
Finding an equilibrium is not always the answer
All agents have to play the equilibrium
Bounded rationality
One agent may not play the equilibrium
Multiple equilibria
Agents have to cooperate to play the same equilibrium
Long-term best-response play
Learning strategies that play the best-response to the opponent's observed strategy.
Best-response learning
Learn to play the best response to opponent’s observed behavior
Naïve approach
Single-agent learning
Take into account the possibility that other agents' strategies might change
Fictitious play
Q-learning
Q-learning (Watkins 1989, 1992)
Single-agent learning
The Q-learning update :
Q-learning is guaranteed to converge in stationary environments if :
every state-action pair continue to be visited,
and the learning rate is decreased appropriately over time
) , ( max arg
)
( s Q s a
A a∈
π =
Fictitious play (FP)
(Brown 1951)
The agent observes time-average frequency of other players’ action choices, and models:
The agent then plays best-response to this model
Converges in two-player zero-sum games
When it converges, the convergence point is a Nash equilibrium
ns observatio total
observed a
times a
prob
k k# ) #
( =
Previous experimental works
MALT (Zawadzki 2005)
Q learning (QL) is better than Nash-based algorithms
GAMUT (Nudelman & al 2004)
(Airiau 2007)
FP is ranked first
Specific representative RMG
Cooperation and competition (CC 1/3)
•
Prisoner Dilemma
•
Nash Equilibria
– Pure (D, D)
– Mixed: no
– Expected reward = 2
•
Pareto-optimal state that is better than (D,D)
– (C, C)
– Expected reward = 3
•
Best response algorithms, and regret oriented algorithms
– Converge to (D, D)
Coop Defect
Coop 3 3 1 4
Defect 4 1 2 2
S algorithm (CC 2/3)
•
Satisficing (S) algorithm (Stimpson & Goodrich 2003)
•
At instant t:
•
Receive the reward R(t)
•
Select action depending on the « aspiration » of agent i
– If R(t) >= Aspiration(i, t), then Action(i, t+1) = Action(i, t)
– Else Action(i, t+1) = random choice
•
Compute Aspiration (i, t)
– Aspiration(i, t+1) = λAspiration(i, t) + (1-λ) R
– Aspiration(i, 0) = Rmax
•
In self-play, the S algorithm converges to the pareto-optimal states
M Qubed (CC 3/3)
•
(Crandall & Goodrich 2005)
•
Combination of S algorithm and Q learning
•
Use Max or MiniMax (M
3) strategies according to the situation (friendly or inimical)
•
Goals reached
– Max level against friendly agents
– Security level against inimical agents
Regret oriented algorithms
•
HMC (Hart & Mas-Colell 2000)
– Regret of action A over action B (internal regret)
– Probability of action X linear in its regret over last action
– Convergence to CE
•
Bandit algorithms: UCB (Auer & al 2002)
•
Exp3 and derived
Exp3 (Auer et al. 2002)
(Auer & al 2002)
"Exponential-weight algorithm for Exploration and Exploitation"
Non-stochastic multi-armed bandit problems
Action selection scheme
Selection probability of action i
γ : mixture rate
t K w
t t w
p
Kj j
i i
γ + γ
−
= ∑ ( )
) ) (
1
(
)
(
UCB
(Auer & al 2002)
Select an action among a set of actions for both exploit and explore
Action = argmax
i( m(i) + sqrt( log(T) / t(i) ) )
Used in UCT for computer Go and Amazons
UCT = UCB for Trees
Leader algorithms
•
Implicit negociation in repeated games (Littman & Stone 2001)
•
Bully
– Being a leader
• Action = argmax i m1(i, J*(i))
– Assume the opponent is a best response algorithm
• J*(i) = argmax j m2(i, j)
a b
a 1 0 3 2
b 2 1 4 0
Discussion
Theoretical MAL criteria
Experimental evaluation criteria
Some experimental results
Theoretical MAL criteria
Bowling and Veloso (2001) proposed criteria for MAL algorithms.
Learning should :
(1) always converge to a stationary policy
(2) only terminate with a best-response to the play of other agents
Conitzer and Sandholm (2003) added another criterion :
(3) converge to Nash equilibrium in self-play
Experimental evaluation in the NCP Agenda
•
Players' confrontation
– One-to-one confrontations
– All-against-all tournaments with one-to-one confrontations
•
Set of games
– Cooperative games
– Competitive games
– General sum games
– GAMUT
Tournaments settings
Parameters
– #actions = 3 – -9 <= R <= +9
– #repetitions = 100,000 – #MG = 100
For each MG, an all-against-all tournaments is set up
Tournament results on general-sum MG
– 1: M3 – 2: S – 3: UCB – 4: FP – 5: Exp3 – 6: Bully – 7: HMC – 8: QL
– 9: MinMax
– 10: Rand
Tournament results on team games
– 1: Exp3 – 2: M3 – 3: Bully – 4: S
– 5: HMC – 6: FP – 7: UCB – 8: QL
– 9: MinMax
– 10: Rand
Tournament results on zero-sum games
– 1: Exp3 – 2: M3
– 3: MinMax – 4: FP
– 5: S
– 6: UCB
– 7: QL
– 8: HMC
– 9: Bully
– 10: Rand
Conclusions and future works
MAL Non Cooperative Prescriptive Agenda
Maximize the cumulative returns against many players in many RMG
Intersection of Game Theory, AI, Reinforcement Learning
Experimental results
UCB, M3 work well in general-sum games
Exp3 work well in both zero-sum games and team games
Current work, two important features
Averaging on the near past, and forgetting the far past
Use states where a state corresponds to the past joint action
Future works
Hedging algorithms, or expert algorithms
Algorithm game