Research/Blog

CellStrat > Research/Blog > Artificial Intelligence > Reinforcement Learning > RL with Actor-Critic Methods

RL with Actor-Critic Methods

March 19, 2020
Posted by: vsinghal
Category: Reinforcement Learning Robotics

#CellStratAILab #disrupt4.0 #WeCreateAISuperstars #AlwaysUpskilling

Minutes from Saturday 14th March 2020 AI Lab Workshop at BLR :-

Session Presenter : SHUBHA M., Deep Reinforcement Learning Researcher, CellStrat AI Lab

Last Saturday, our Reinforcement Learning Team Lead Shubha M. presented a fantastic presentation and workshop on Actor-Critic method used in RL. She also demonstrated a demo of this technique for Stock Market predictions.

Reinforcement Learning broadly involves Value-based methods and Policy-based Methods.

VALUE-BASED LEARNING :-

An RL agent has state, action, reward paradigm. An RL agent in a particular state takes a certain action for which the environment grants it a reward.

An MDP or Markov Decision Process provides the mathematical framework to solve the RL problem.

The Markov Process states that the future is only dependent on past state and not on past states. In other words, the current state captures all past dependencies. The Markov Reward Process models the paradigm.

**Markov Reward Process for a Student’s journey**

The cumulative reward at time step t is the sum of current and future discounted rewards. The future rewards are exponentially discounted by gamma factor γ, which is normally a fraction. The agent’s goal is to learn a policy π which maximizes this cumulative reward.

The state-value function V_π(s) measures how good it is to be in a certain state using the policy π.

The sample returns from Student Markov Reward Process can be depicted as :-

In this way, a value can be calculated for each state.

The Bellman Equation for MRP is coded as :-

The Bellman Equation for Student MRP can be depicted as follows :-

A Markov Decision Process is a Markov Reward Process with decisions.

The state-value function captures the value starting from a particular state s. The action-value function captures the value starting from a particular state s and taking action a, as per policy π.

The state-value function and action-value functions may be decomposed as follows :-

The optimal value functions are found by maximizing over all policies :-

The optimal policy π_* is found by maximizing over q_*(s,a) :-

Deep Q-Learning :-

Q-learning is a model-free reinforcement learning algorithm to learn a policy telling an agent what action to take under what circumstances [from Wikipedia].

**Q-learning : An off-policy TD control algorithm**

Arriving at optimal policy can involve Exploration vs Exploitation. Exploitation is about going with safe choices (e.g. visit your favorite restaurant). Exploration involves going with random choices in order to discover longer-term gratification (e.g. try a new restaurant).

We use techniques such as ε-greedy to make a call on exploration vs exploitation at each step.

Q-Learning does have limitations; e.g. total no of system states may be enormous (such as screen state space pixel values in an Atari game screen). Also both Q(s, a) and V(s) explore discrete action spaces, and are not suitable for continuous control spaces such as angle of a steering wheel, or the temperature of a heater.

Then comes Deep Q Learning :-

For an Atari game a DQN might look like :-

**High Level Architecture for an Atari game DQN**

A DQN for an Atari game takes the pixel states as input and predicts an action. The change in Game Score is fed back to the network at each time step.

The Q-values are updated as per this formula :-

For additional information on Q-value update, click here.

We also employ an Experience Replay technique in order to avoid forgetting previous experiences and to reduce correlations between experiences.

In normal DQN learning, the same weights are used for estimating the target and the Q value. The weight adjustment is given by :-

A technique called Fixed Q Targets (introduced by Deepmind) allows us to avoid this problem of chasing moving targets. Here we use a different network with fixed weights w- for estimating the TD target. At every Tau step, we copy the parameters from our DQN network to the target network. The Target-Q Network’s weights are updated less often than primary Q-Network.

POLICY-BASED LEARNING :-

Here the system learns an optimal policy directly without storing action-values. Unlike value-based methods, policy-based methods can learn true stochastic policies. Also policy-based methods are suitable for continuous action spaces.

The policy gradient is always of the form (for details and derivation of this equation, please check our prior post on Policy Gradients here) :-

The central term is the log likelihood of the policy. In our context, it measures how likely the trajectory is under the current policy. We multiply this with rewards, due to which, highly positive rewards increase the likelihood of a policy and vice versa.

The REINFORCE algorithm is given by :-

*Image Credit : UCLA Berkeley DRL Course CS294-112*

REINFORCE algorithm can be stated as follows :-

1) Perform a trajectory roll-out using the current policy
2) Store log probabilities (of policy) and reward values at each step
3) Calculate discounted cumulative future reward at each step
4) Compute policy gradient and update policy parameter
5) Repeat 1–4

The Gradient tries to :-

increase probability of paths with positive R
decrease probability of paths with negative R

A trajectory is a sequence of states and actions in one particular episode.

In REINFORCE algorithm, we update the policy parameter through Monte Carlo updates (i.e. taking random samples). This increases variance of the log probabilities (of policy distribution) and cumulative rewards values, leading to noisy gradients. This causes unstable policies or policies skewing to non-optimal directions.

One way to reduce variance and increase stability is subtracting the cumulative reward by a baseline.

Recall that the policy gradient is given by :-

We establish a Reward baseline as follows :-

The Advantage function A is defined as :-

The Advantage function provides a measure of how each action compares to a certain baseline. Using A^π(s^t,a^t) centers the learning signal and reduces the variance significantly.

A Vanilla Policy Gradient algorithm or VPG is given by :-

Another policy-based method is the Proximal Policy Optimization or the PPO, which is described here.

Actor-Critic :-

Here we combine Value-based and Policy-based methods. The Actor is policy-based and Critic is value-based.

First lets focus on chosing the right baseline. Actor-critic methods use the value function as a baseline for policy gradients, such that the only fundamental difference between actor-critic methods and other baseline methods are that actor-critic methods utilise a learned value function :-

=>in effect, increase logprob of action proportionally to how much its returns are better than the expected return (V(s)) under the current policy.

So we can rewrite the policy gradient using the advantage function:

The Advantage function provides a measure of how each action compares to the average performance at the state s^t , which is given by V_π(s^t).

The Actor-Critic architecture consists of two neural networks, the Actor and the Critic.

The Actor network takes in state as the input and outputs probability of Actions
The Critic network receives the state and reward resulting from the previous interaction. The critic uses the TD error calculated from this information to update itself and the actor.
The Actor network is trained to maximize the reward using Gradient Ascent.
The Critic network is trained to minimize the MSE /TD error between State values

*Image Credit :* *https://arxiv.org/pdf/1708.05866.pdf*

Here is a summary of PG Algorithms :-

The Q Actor Critic algorithm is :-

Two different neural networks may be used for Actor and Critic Networks. Sometimes the base network can be common :-

After this extensive discussion, Shubha demonstrated use of Actor-Critic for stock price prediction of S&P500 US Equity Markets Index. Her demo used Actor-Critic model with Fixed Q Targets and Experience Replay Buffer.

CellStrat Deep Reinforcement Learning Course :-

CellStrat AI Lab is a leading AI Lab and is working on the cutting-edge of Artificial Intelligence including latest algorithms in ML, DL, RL, Computer Vision, NLP etc.

Interested in learning Deep RL from one of the world’s best AI Labs ? If yes, enroll in our extensive course in Deep Reinforcement Learning (DRL). More details and enrollment here : https://bit.ly/CSDRLC

Questions ? Please feel free to call me at +91-9742800566 !

Best Regards,

Vivek Singhal
Co-Founder & Chief Data Scientist, CellStrat
+91-9742800566

References :-

David Silver lectures on DRL at UCL
CMU DRL lectures for 10703 DRL course
John Schulman and Pieter Abbeel lectures at UCLA Berkeley
Reinforcement Learning – An Introduction, Richard S. Sutton and Andrew G. Barto