• Aucun résultat trouvé

Multi-agent learning and repeated matrix games

N/A
N/A
Protected

Academic year: 2022

Partager "Multi-agent learning and repeated matrix games"

Copied!
38
0
0

Texte intégral

(1)

Multi-agent learning and repeated matrix games

Bruno Bouzy

Université Paris Descartes, France

(2)

Outline

The five agendas of Multi-Agent Learning (MAL)

Game theory

{Repeated} matrix games (RMG)

Equilibria (Nash, Correlated)

MAL algorithms

Equilibria learning

Best response learning

Competition & cooperation

Regret oriented algorithms

Leader algorithms

Discussion and experiments

Algorithms evaluation

Some results

Conclusion

(3)

Multi-agent learning (MAL)

Single-agent Reinforcement Learning  MAL

Other agents change the environment

« non stationarity »

Game theory

Players = learners

Equilibria

Multi-agent background

Adaptive and autonomous agent

(4)

The five agendas of MAL (1/2)

AI Journal (2007)

Shoham, Powers Grenager, If multi-agent learning is the answer what is the question ?

Game theory, fictitious play  equilibria

Artificial Intelligence, single-agent learning

Since 2004, MAL literature increases

Computational agenda

MAL algorithms used to compute equilibria

Descriptive agenda

How natural agents learn in the context of other learners ?

Investigate models that fit human or population behavior

Normative agenda

How to account for equilibria reached by given learning rules ?

(5)

The five agendas of MAL (2/2)

2 “prescriptive” agendas

How agents should learn ?

“Cooperative” prescriptive agenda

Decentralized control in real world applications

Control theory, distributed computing

Joint policy

Communication between agents is allowed

“Non Cooperative” prescriptive agenda (NCPA)

How to obtain high reward in repeated games ?

Convergence is not a goal

Communication between agents is forbidden

Set of players, tournaments

(6)

Cooperative or non cooperative ?

Nash meaning

Non cooperative games (Annals Math, 1951)

Two person cooperative games (Econometrica 1953)

= communication allowed or not

Common meaning

Cooperative = friendly

Non cooperative = competitive

Repeated matrix game meaning

Non cooperative agenda = No communication between players

Cooperative game = team game

all agents receive the same reward

Competitive game = zero-sum 2-person game

(7)

Game theory (outline)

{Repeated} Matrix game examples

Nash Equilibria

Correlated Equlibria

(8)

2-player matrix games

Coordination Battle of Sexes

Competition Prisoner Dilemma

Stackelberg Chicken

Foot Theater Foot 2 1 0 0 Theater 0 0 1 2

Coop Defect Coop 3 3 1 4 Defect 4 1 2 2

Chicken Dare Chicken 6 6 2 7 Dare 7 2 0 0

c d

a 2 2 0 0 b 0 0 1 1

c d

a -3 3 0 0 b -1 1 -2 2

c d

a 1 0 3 2 b 2 1 4 0

(9)

Nash equilibrium

Pure strategy = an action

Mixed strategy = a probability distribution over actions

Best response to other agent strategies

Nash equilibrium:

Every strategy is a best response to other strategies

No one wants to change its strategy

Pure (resp. mixed) equilibrium

The strategies are pure (resp. mixed)

Property

Every matrix game has at least one mixed equilibrium

(10)

Nash equilibria

Coordination Battle of Sexes

Competition Prisoner Dilemma

Stackelberg Chicken

Foot Théatre Foot 2 1 0 0 Théatre 0 0 1 2

Coop Defect Coop 3 3 1 4 Defect 4 1 2 2

Chicken Dare Chicken6 6 2 7 Dare 7 2 0 0

c d

a 2 2 0 0 b 0 0 1 1

c d

a -3 3 0 0 b -1 1 -2 2

c d

a 1 0 3 2 b 2 1 4 0

p(a)=1/4

p(c)=1/2

(11)

Correlated Equilibrium (1/3)

Correlated Equilibrium (CE) (Aumann 1974)

Probability distribution D over joint actions

The players know D

A joint action is drawn from D by a « referee » or « device »

The referee gives every elementary action to its player

If no player wants to deviate, then D is a CE

Property

Every mixed NE is a CE

Regret minimization methods converge to CE

Expected reward of a CE >= Expected reward of a NE

The players communicate (or cooperate) through the referee

(12)

Correlated Equilibrium (2/3)

Battle of Sexes

Nash Equilibrium

Pure: (F, F) and (T, T)

Mixed: (1/3, 2/3) and (2/3, 1/3)

Expected rewards = (2/3, 2/3)

Correlated Equilibrium

{ ((F,F), ½), ((T,T), ½) }

Expected rewards = (3/2, 3/2)

Foot Theater Foot 2 1 0 0

Theater 0 0 1 2

(13)

Correlated Equilibrium (3/3)

Chicken Game

Nash Equilibrium

Pure: (D, C) and (C, D)

Mixed: (2/3, 2/3)

Expected rewards = (14/3, 14/3)

Correlated Equilibrium

{ ((C,C), 1/3), ((C,D), 1/3), ((D,C), 1/3) }

Expected rewards = (5, 5)

ChickenDare

Chicken6 6 2 7

Dare 7 2 0 0

(14)

MAL algorithms (outline)

Equilibria learning

Best response learning

Dealing with cooperation and competition

Regret oriented algorithms

Leader algorithms

(15)

Equilibria learning (1/4)

Minimax-Q (Littman 1994)

Action values are based on joint actions

with a the learning agent ‘s action o the opponent’s action

The agent can observe opponent’s actions

The values do not depend of opponents’ strategies

Converge to game-theoretic optimal value in 2-player zero-sum games ))

, ( , ( min

max )

(s Q s a o

V i

A A o

i a

i i

=

(16)

Equilibria learning (2/4)

Nash-Q (Hu & Wellman 1998)

Minimax-Q extended to 2-player general-sum games

where (π1(s), π 2(s)) is the NE of the matrix game (Q1(s), Q2(s))

Nash-Q compute mixed strategies

Converges with some (restrictive) conditions

Their exists only one equilibrium for the entire game and for every games defined by Q functions during learning

All equilibria must be of a same type : adversarial or coordination

)) ( ), ( , ( ))

( ),

( ( )

( 1 2 1 1 2

1 s Nash Q s Q s Q s s s

V = =

π π

(17)

Equilibria learning (3/4)

Friend-and-Foe-Q (FF-Q) (Littman 2001)

Converges with less restrictive conditions than Nash-Q, but needs to class the opponent as “friend” or “foe”

Two algorithms, one per class of opponents

Foe opponent (Foe-Q) :

Friend opponent (Friend-Q) :

How to class the opponents ?

Not answered in Littman paper

)) , ( , ( min

max )

(s Q s a o

V i

A A o

i a

i

i

=

)) , ( , ( max

)

(s ( , ) Q s a o

V i

A A o i a

i

i×

=

(18)

Equilibria learning (4/4)

Correlated-Q (CE-Q) (Greenwald & Hall 2003)

Generalization of Nash-Q et FF-Q

Allows several equilibria in the game

Looks for correlated equilibria

Four classes of opponents are defined => 4 CE types => 4 algorithms

"utilitarian" (uCE-Q),

“egalitarian" (eCE-Q),

"republican" (rCE-Q),

"libertarian“ (lCE-Q)

Equilibria are computed with linear programming

The algorithm must compute opponent’s Q-values

(19)

From equilibria to best-response

Focus on asymptotically achieving an equilibrium in self-play

Finding an equilibrium is not always the answer

All agents have to play the equilibrium

Bounded rationality

One agent may not play the equilibrium

Multiple equilibria

Agents have to cooperate to play the same equilibrium

Long-term best-response play

Learning strategies that play the best-response to the opponent's observed strategy.

(20)

Best-response learning

Learn to play the best response to opponent’s observed behavior

Naïve approach

Single-agent learning

Take into account the possibility that other agents' strategies might change

Fictitious play

Q-learning

(21)

Q-learning (Watkins 1989, 1992)

Single-agent learning

The Q-learning update :

Q-learning is guaranteed to converge in stationary environments if :

every state-action pair continue to be visited,

and the learning rate is decreased appropriately over time

) , ( max arg

)

( s Q s a

A a

π =

(22)

Fictitious play (FP)

(Brown 1951)

The agent observes time-average frequency of other players’ action choices, and models:

The agent then plays best-response to this model

Converges in two-player zero-sum games

When it converges, the convergence point is a Nash equilibrium

ns observatio total

observed a

times a

prob

k k

# ) #

( =

(23)

Previous experimental works

MALT (Zawadzki 2005)

Q learning (QL) is better than Nash-based algorithms

GAMUT (Nudelman & al 2004)

(Airiau 2007)

FP is ranked first

Specific representative RMG

(24)

Cooperation and competition (CC 1/3)

Prisoner Dilemma

Nash Equilibria

Pure (D, D)

Mixed: no

Expected reward = 2

Pareto-optimal state that is better than (D,D)

(C, C)

Expected reward = 3

Best response algorithms, and regret oriented algorithms

Converge to (D, D)

Coop Defect

Coop 3 3 1 4

Defect 4 1 2 2

(25)

S algorithm (CC 2/3)

Satisficing (S) algorithm (Stimpson & Goodrich 2003)

At instant t:

Receive the reward R(t)

Select action depending on the « aspiration » of agent i

If R(t) >= Aspiration(i, t), then Action(i, t+1) = Action(i, t)

Else Action(i, t+1) = random choice

Compute Aspiration (i, t)

Aspiration(i, t+1) = λAspiration(i, t) + (1-λ) R

Aspiration(i, 0) = Rmax

In self-play, the S algorithm converges to the pareto-optimal states

(26)

M Qubed (CC 3/3)

(Crandall & Goodrich 2005)

Combination of S algorithm and Q learning

Use Max or MiniMax (M

3

) strategies according to the situation (friendly or inimical)

Goals reached

Max level against friendly agents

Security level against inimical agents

(27)

Regret oriented algorithms

HMC (Hart & Mas-Colell 2000)

Regret of action A over action B (internal regret)

Probability of action X linear in its regret over last action

Convergence to CE

Bandit algorithms: UCB (Auer & al 2002)

Exp3 and derived

(28)

Exp3 (Auer et al. 2002)

(Auer & al 2002)

"Exponential-weight algorithm for Exploration and Exploitation"

Non-stochastic multi-armed bandit problems

Action selection scheme

Selection probability of action i

γ : mixture rate

t K w

t t w

p

K

j j

i i

γ + γ

= ∑ ( )

) ) (

1

(

)

(

(29)

UCB

(Auer & al 2002)

Select an action among a set of actions for both exploit and explore

Action = argmax

i

( m(i) + sqrt( log(T) / t(i) ) )

Used in UCT for computer Go and Amazons

UCT = UCB for Trees

(30)

Leader algorithms

Implicit negociation in repeated games (Littman & Stone 2001)

Bully

Being a leader

• Action = argmax i m1(i, J*(i))

Assume the opponent is a best response algorithm

• J*(i) = argmax j m2(i, j)

a b

a 1 0 3 2

b 2 1 4 0

(31)

Discussion

Theoretical MAL criteria

Experimental evaluation criteria

Some experimental results

(32)

Theoretical MAL criteria

Bowling and Veloso (2001) proposed criteria for MAL algorithms.

Learning should :

(1) always converge to a stationary policy

(2) only terminate with a best-response to the play of other agents

Conitzer and Sandholm (2003) added another criterion :

(3) converge to Nash equilibrium in self-play

(33)

Experimental evaluation in the NCP Agenda

Players' confrontation

One-to-one confrontations

All-against-all tournaments with one-to-one confrontations

Set of games

Cooperative games

Competitive games

General sum games

GAMUT

(34)

Tournaments settings

Parameters

– #actions = 3 – -9 <= R <= +9

– #repetitions = 100,000 – #MG = 100

For each MG, an all-against-all tournaments is set up

(35)

Tournament results on general-sum MG

– 1: M3 – 2: S – 3: UCB – 4: FP – 5: Exp3 – 6: Bully – 7: HMC – 8: QL

– 9: MinMax

– 10: Rand

(36)

Tournament results on team games

– 1: Exp3 – 2: M3 – 3: Bully – 4: S

– 5: HMC – 6: FP – 7: UCB – 8: QL

– 9: MinMax

– 10: Rand

(37)

Tournament results on zero-sum games

– 1: Exp3 – 2: M3

– 3: MinMax – 4: FP

– 5: S

– 6: UCB

– 7: QL

– 8: HMC

– 9: Bully

– 10: Rand

(38)

Conclusions and future works

MAL Non Cooperative Prescriptive Agenda

Maximize the cumulative returns against many players in many RMG

Intersection of Game Theory, AI, Reinforcement Learning

Experimental results

UCB, M3 work well in general-sum games

Exp3 work well in both zero-sum games and team games

Current work, two important features

Averaging on the near past, and forgetting the far past

Use states where a state corresponds to the past joint action

Future works

Hedging algorithms, or expert algorithms

Algorithm game

Références

Documents relatifs

The comparison between the hypothesized and the actual learning process shows that it is very important how students relate to diagrams on the different levels in

This theorem shows that an -Nash equilibrium can be controlled by the sum over the players of the sum of the norm of two Bellman-like residuals: the Bellman Residual of the

Constraint Games for Modeling and Solving Pure Nash Equi- libria, Price of Anarchy and Pareto Efficient Equilibria... Constraint Games for Modeling and Solving Pure Nash

The aim of this paper is to develop a deep learning based formation control strategy for the multi-agent systems by using the backpropagation algorithm.. Specifically, the deep

For general games, there might be non-convergence, but when the convergence of the ODE holds, considered stochastic algorithms converge towards Nash equilibria.. For games admitting

Fortunately, this is possible to go further, observing that many of the previous classes (ordinal, (exact) potential, continuous potential, load balancing games, congestion games,

In order to link the game world to the learning one, we propose in this section to link the objects used in a game-based framework with the concepts that we usually find in

It is shown that convergence of the empirical frequencies of play to the set of correlated equilibria can also be achieved in this case, by playing internal