Research/Blog
Policy Gradients – An Introduction
- February 6, 2020
- Posted by: Shubha Manikarnike
- Category: Reinforcement Learning
I conducted an Introductory session on Reinforcement Learning Policy Gradients (PG) at CellStrat AI Lab on 1st Feb 2020. The goal of this session was to explain the basic underlying principle of Policy Gradients.
The session started off with a quick recap of Reinforcement Learning, so that the audience is well aware of the definitions of State, Action, Reward and Policy. We then quickly discussed Q-Learning and Deep Q Networks and reviewed the DQN Algorithm given by DeepMind which uses Experience Replay and Fixed Q Targets.
Moving on to the main topic – ‘Policy Gradients‘ – these are a class of Algorithms which directly compute the Policy. In case of discrete action Space, the underlying neural network computes the probabilities of Actions, where as in continuous Action space – they directly output the action values.
I discussed, how this is different from Value based methods. In DQN, the neural Network computes the Q-Value for every state and we pick the Action which has the maximum Q-Value for that State. However, in Policy-based method, the neural network, directly outputs the Actions.
![](http://www.cellstrat.com/wp-content/uploads/2020/02/PG1-1024x377.png)
The advantages of Policy based methods are :-
1) Compute the Policy directly, instead of calculating Q Values and then arriving at a Policy.
2) Policy based methods can learn Stochastic Policies efficiently.
3) They can handle continuous Action spaces.
Gradient Ascent vs Gradient Descent:
With this basic understanding, we proceeded towards understanding how the Neural Network in a PG algorithm would work. The main goal of an RL algorithm is to maximize the Reward. The weights of the neural network should be adjusted to maximize Rewards. Hence we use Gradient Ascent – since we have to maximize the Reward function.
In regular neural networks, where we use Gradient Descent because the weights are adjusted to minimize the Loss function (eg: MSE in case of Linear Regression).
Derivation of the Reward Function: To calculate the reward function, consider a Trajectory (represented by symbol ‘Tau’) which is usually a sequence of States and Actions. This could be an entire episode for episodic tasks. The Reward for this is given by the Probability of the Trajectory * Reward fetched by the trajectory.
![](http://www.cellstrat.com/wp-content/uploads/2020/02/PG2.png)
![](http://www.cellstrat.com/wp-content/uploads/2020/02/PG3.png)
![](http://www.cellstrat.com/wp-content/uploads/2020/02/PG4.png)
![](http://www.cellstrat.com/wp-content/uploads/2020/02/PG5.png)
Code Demo: I showed a hands- on code bemo of training the ‘Acrobot-v1’ environment available with OpenAI Gym. The acrobot system includes two joints and two links, where the joint between the two links is actuated. Initially, the links are hanging downwards, and the goal is to swing the end of the lower link up to a given height.
This has a continuous State Space with six parameters and a discrete Action Space with three Values. I showed a demo of the Code in Pytorch to implement a PG algorithm to solve ‘Acrobot-v1’.
![](http://www.cellstrat.com/wp-content/uploads/2020/02/Acrobot-v1.gif)
Code : Github link : https://github.com/Shubha-Manikarnike/Policy-Gradients