Research/Blog
Multi Agent Reinforcement Learning
- June 17, 2020
- Posted by: Shubha Manikarnike
- Category: Reinforcement Learning
#CellStratAILab #disrupt4.0 #WeCreateAISuperstars
I presented a session on Multi-Agent RL recently at the CellStrat AI Lab.
Introduction :-
In the normal Reinforcement Learning setup, you have one agent which interacts with the environment. It uses the Observation from the environment, performs actions and observes the rewards.
In real life, many applications will involve several agents interacting with the environment.
Examples include :
- Chess : which requires two players.
- Stock markets where transactions by multiple agents affect the environment.
- Multiplayer games like Dota2 and Starcraft2
Let’s see another example of Multi-Agent scenario :
A traffic control scenario where multiple controllable entities (e.g., traffic lights, autonomous vehicles) work together to reduce highway congestion. Here, Each of these agents can act at different time-scales (i.e., act asynchronously).
Agents can come and go from the environment as time progresses.
This has multiple observations and rewards for each of the agents.
Why current algos find Multi-Agent situations intractable :
Unfortunately, traditional reinforcement learning approaches such as Q-Learning or policy gradient are poorly suited to multi-agent environments.
One issue is that each agent’s policy is changing as training progresses, and the environment becomes non-stationary from the perspective of any individual agent (in a way that is not explainable by changes in the agent’s own policy). This presents learning stability challenges and prevents the straightforward use of past experience replay, which is crucial for stabilizing deep Q-learning.
Policy gradient methods, on the other hand, usually exhibit very high variance when coordination of multiple agents is required.
Approaches to Multi-Agent RL :-
- One approach would be to train all agents independently.
- Here any agent, considers all the other agents to be a part of the environment and learns its own policy.
- Since, all agents are learning simultaneously, the environment as seen from the perspective of a single agent changes dynamically.
- This condition is called non-stationarity of the environment.
- In most single agent algorithms, it is assumed that the environment is stationary which leads to convergence.
- Another approach could be a single policy is learned for all agents.
- It takes in the state of the environment and returns an action for each agent in the form of single joint action vector.
- But the drawback here is the joint action space would increase exponentially with increase in number of agents.
Markov Games :-
We are familiar with Markov Decision Process (MDP) for most basic RL environments.
Specifically, in an MDP, the agent observes the states and receives reward from the system, after outputting the action a.
In a markov game on the other hand, all agents choose actions ai simultaneously, after observing the system state s and receiving each individual reward ri.
- Partially Observable Markov games is a multi-agent extension of Markov Decision Process.
- A Markov game for N agents is defined by a set of states S describing the possible configurations of all agents, a set of actions A1,…,AN and a set of observations O1,…,ON for each agent.
- To choose actions, each agent i uses a stochastic policy πθi:Oi×Ai→[0,1], which produces the next state according to the state transition function T: S ×A1×…×AN→S.
- Each agent i obtains rewards as a function of the state and agent’s action ri:S ×Ai→R, and receives a private observation correlated with the state oi:S → Oi.
- The initial states are determined by a distribution ρ:S →[0,1].
- Each agent i aims to maximize its own total expected return (here γ is a discount factor and T is the time horizon) :
Please note that Multi-Agent RL can operate in two modes :
- Competitive: Two or more agents try to beat each other in order to maximize their rewards.
- Collaboration : A group of agents use joint effort to reach a goal.
Multi Agent Actor Critic for mixed environments :-
Open AI developed a new algorithm, MADDPG, for centralized learning and decentralized execution in multi agent environments, allowing agents to learn to collaborate and compete with each other.
MADDPG used to train four red agents to chase two green agents. The red agents have learned to team up with one another to chase a single green agent, gaining higher reward. The green agents, meanwhile, learned to split up, and while one is being chased the other tries to approach the water (blue circle) while avoiding the red agents.
The Green agents maximize rewards by getting to water and avoiding the red agents.
Training vs Execution :-
Centralized Planning : Each agent only has direct access to local observations. These observations can be many things: an image of the environment, relative positions to landmarks, or even relative positions of other agents. Also, during learning, all agents are guided by a centralized module or critic.
Even though each agent only has local information and local policies to train, there is an entity overlooking the entire system of agents, advising them on how to update their policies. This reduces the effect of non-stationarity. All agents learn with the help of a module with global information.
Decentralized Execution : Then, during testing, the centralized module is removed, leaving only the agents, their policies, and local observations. This reduces the problem of increasing state and action space because joint policies need not be explicitly learned here. Instead, we hope that the central module has given enough information to guide local policy training such that it is optimal for the entire system once test time comes around.
MADDPG Architecture :-
Every agent has an observation space and continuous action space. Also, each agent has three components:
- An actor-network that uses local observations (represented as o) for deterministic actions (represented as a)
- A target actor-network with identical functionality for training stability
- A critic-network that uses joint states action pairs to estimate Q-values.
As the critic learns the joint Q-value function over time, it sends appropriate Q-value approximations to the actor to help training.
MADDPG – Critic :-
MADDPG uses an experience replay for efficient off-policy training. At each time step, the agent stores the following transition:
Critic Update :To update an agent’s centralized critic, we use a one-step look ahead TD-error:
where μ denotes the actor. Keep in mind that this is a centralized critic, meaning it uses joint information to update its parameters. The primary motivation is that knowing the actions taken by all agents makes the environment stationary even when policies change.
Here Qμi(x, a1, …, aN) is a centralized action-value function that takes as input the actions of all agents, a1, …, aN, in addition to some state information x, and outputs the Q-value for agent i.
MADDPG – Actor :-
Actor Updates: Similar to single-agent DDPG, we use the deterministic policy gradient to update each of the agent’s actor parameters. μ denotes an agent’s actor.
In the update equation : We take the gradient with respect to the actor’s parameters using a central critic to guide us. The most important thing to notice is that even though the actor only has local observations and actions, we use a centralized critic during training time, providing information about the optimality of its actions for the entire system. This reduces the effects of non stationarity while keeping policy learning at a lower state space!
Policy Inference :-
MADDPG suggests inferring other agents’ policies to make learning even more independent. In effect, each agent adds N-1 more networks to estimate the true policy of each of the other agents. We use a probabilistic network and maximize the log probability of outputting another agent’s observed action.
where we show the loss function for the ith agent estimating the jth agent’s policy with an entropy regularizer. As a result, our Q-value target becomes a slightly different value as we replace agent actions with our predicted action!
Policy Ensembles :-
There’s one big issue with the approach above. In many multi-agent settings, especially in competitive ones, agents can craft policies that overfit to other agents’ behaviors. This makes policies brittle, unstable, and typically suboptimal.
To compensate for that, MADDPG trains a collection of K sub-policies for each agent. At each episode, an agent randomly selects one of the sub-policies to choose an action from. Then, it executes it.
Suppose that policy μi is an ensemble of K different sub-policies with sub-policy k denoted by μθi(k) (denoted as μi(k)).
Since different sub-policies will be executed in different episodes, we maintain a replay buffer D(k)i for each sub-policy μi(k) of agent i. Accordingly, we can derive the gradient of the ensemble objective with respect to θ(k)i as follows:
MADDPG vs DDPG :-
Red agents trained with MADDPG exhibit more complex behaviors than those trained with DDPG.
In the above animation we see agents trained with our technique (left) and DDPG (right) attempting to chase green agents through green forests and around black obstacles. Our MADDPG agents catch more agents and visibly coordinate more than those trained with DDPG.
MADDPG outperforms other algorithms as shown in the chart below.
Unity ML Agents – Tennis Environment :-
In this environment, two Agents play a game of tennis.
The environment is constructed in such a way, that the way that if an agent hits the ball over the net, it receives a reward of +0.1. If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.
The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Each agent receives from Environment its own, local observation.
The Agent can take two different actions, corresponding to movement toward (or away from) the net, and jumping. The actions are defined in continuous space domain, which has implications for the algorithm design.
The goal of each agent is to keep the ball in play as long as possible.
The task is episodic, and in order to solve the environment (fulfil the project requirement), the agents must get an average score of +0.5 (over 100 consecutive episodes, after taking the maximum over both agents).
Code Snippet : Model
Code Snippet : DDPG Agent
Code Snippet : Update Critic
Code Snippet : Update Actor
Code Snippet : MADDPG
Summary :-
- MultiAgent algorithms help train systems with multiple agents which compete or collaborate to achieve certain goals.
- MultiAgent DDPG is one such algorithm. This algorithm has many Actors (representing multiple agents) but one global Critic.
- MADDPG defines actors for agents that only use local observations. This helps curb the effects of an exponentially increasing state and action space.
- It defines a centralized critic for each agent that uses joint information. This helps reduce the effects of non-stationarity and guides the actor to make it optimal for the global system
- It defines policy inference networks to estimate other agent’s policies. This helps limit agent interdependence and removes the need for agents to have perfect information.
- It defines policy ensembles to reduce the effects and possibility of overfitting to other agents’ policies.
CellStrat Training Course on “Deep Reinforcement Learning” :-
Learn advanced RL with CellStrat’s hands-on course on “Deep Reinforcement Learning”.
Details and Enrollment : https://bit.ly/CSDRLC
Questions ? Please contact us at +91-9742800566 !