Research/Blog

CellStrat > Research/Blog > Artificial Intelligence > Reinforcement Learning > A Summary of Model-Free RL Algorithms

A Summary of Model-Free RL Algorithms

April 13, 2020
Posted by: vsinghal
Category: Reinforcement Learning Robotics

#CellStratAILab #disrupt4.0 #WeCreateAISuperstars #AlwaysUpskilling

Reinforcement Learning (RL) refers to training agents with help of incentive-driven environments.

RL typically involves a tuple of <state, action, reward> paradigm, which means that the agent has action choices to make in various states, and each action entails a potential reward. This also means that each state has a “value” associated with it. The sequence of <state, action> pairs follows a recommended “policy“, such that maximum rewards can be attained by following that policy π.

In mid-2000s, with the advent and progress in Deep Learning, RL started using Deep Neural Networks for policy optimization techniques. Modern RL is generally separated into two types “model-free” and “model-based” (MBRL).

Model free methods learn directly from experience, this means that they perform actions either in the real world (ex: robots )or in computer (ex: games). Then they collect the reward from the environment, whether positive or negative, and they update their value functions.

This is a key difference with Model-Based approach. Model-Free methods act in the real environment in order to learn.

Conversely Model-Based algorithm uses a reduced number of interactions with the real environment during the learning phase. Its aim is to construct a model based on these interactions, and then use this model to simulate the further episodes, not in the real environment but by applying them to the constructed model and get the results returned by that model.

This has the advantage of speeding the learning, since there is no need to wait for the environment to respond nor to reset the environment to some state in order to resume learning.

On the downside however, if the model is inaccurate, we risk learning something completely different from the reality.

(Source Credit – https://towardsdatascience.com/model-based-reinforcement-learning-cb9e41ff1f0d)

Two dimensions of RL algorithms, based on the backups used to learn or construct a policy. At the extremes of these dimensions are (a) dynamic programming, (b) exhaustive search, (c) one-step TD learning and (d) pure Monte Carlo approaches. Bootstrapping extends from (c) 1-step TD learning to n-step TD learning methods, with (d) pure Monte Carlo approaches not relying on bootstrapping at all. Another possible dimension of variation is choosing to (c, d) sample actions versus (a, b) taking the expectation over all choices.
Image Credit : https://arxiv.org/pdf/1708.05866.pdf

We will discuss model-free RL in the rest of this article. This is a more mature area in terms of research and practicality. Model-based RL lies more in research domain so far.

Model-free RL has two types of algorithms – value-based and policy-based.

Value-based algorithms :-

Value-based algorithms iteratively update the perceived value of a state to finally learn an optimal policy. These algorithms are generally “off-policy“. (with exception of a few like SARSA which is “on-policy”).

Value-based algorithms solve for the Markov Decision Process (MDP). The Markov Process states that the future is only dependent on past state and not on past states. In other words, the current state captures all past dependencies. The Markov Reward Process models the paradigm.

**Markov Reward Process for a Student’s journey**

The cumulative reward at time step t is the sum of current and future discounted rewards. The agent’s goal is to learn a policy π which maximizes this cumulative reward.

The state-value function v_π(s) measures how good it is to be in a certain state using the policy π.

The major types of value-based algorithms are :-

Q-Learning :

Here agent stores a perceived value of each state, action pair called a Q value, which then decides the policy action.

We use Dynamic Programming to define the Q value. This uses the recursive Bellman Equation for this purpose :

This means that Q_π can be improved by bootstrapping, i.e., we can use the current values of our estimate of Q_π to improve our estimate.

The Q-values for various states are updated in an iterative manner using a Temporal Difference algorithm, which measures the old and new Q value after each action.

where α is the learning rate and ẟ the temporal difference (TD) error.

The table below summarizes the Q learning algorithm.

**Q-learning : An off-policy TD control algorithm**

The limitation of Q-learning is that a system may have enormous no of state options making the learning unviable. This makes it unsuitable for continuous action spaces such as steering wheel of a car.

Deep-Q Learning :

Deep-Q Learning uses neural networks to predict Q-values for various state-action combinations, allowing an expansion to continuous action spaces while saving computational resources.

A common technique in value-based methods is to try exploration vs exploitation paradigm. Exploration means deliberately trying states which have lower perceived value in order to achieve better overall result over the lifetime of the episode (e.g. a salesperson explores new markets in anticipation of achieving higher overall sales). Exploitation means sticking to “safe bets”, in other words taking actions with higher reward potential (e.g. a salesperson sticks to proven markets for sales hunt).

More details on value-based methods are available here.

Policy-based algorithms :-

Policy-based methods update the policy directly without storing state values. These typically use Policy Gradient (PG) algorithms. PG algos modify an agent’s policy based on which actions bring it higher rewards. These algos are considered “on-policy“.

One key point to note here is that there are two probabilities involved – the policy (π) probability and environment probability (p).

The Policy recommends what actions (a1, a2 etc.) to take with what probability E.g. in a game of Chess, policy predicts various probabilities for multiple moves possible at any point by the player (e.g. move Pawn A forward with 20% probability and move Pawn B forward with 80% probability). We will move Pawn B forward (as it has higher recommendation) and once we do that, the Environment probability is 100% that it lands in the desired slot.

Whereas in a game of Frozen Lake, the Policy recommends probabilities around what slots to move to. But after taking the action with highest probability, the Environment will dictate the probability of the agent landing in the targeted slot (the slippery ice means that there is a chance that the agent may not land in targeted slot).

Policy methods have these advantages compared to value-based methods – (i) better convergence, (ii) suitable for continuos action space (having infinite possibility of actions), (iii) can learn stochastic policies.

A Policy can be of two types :

(1) A deterministic policy maps state to actions. You give it a state and the function returns an action to take [π(s) -> a]. The action clearly determines the outcome. E.g. Make a three-dimensional robot walk forward as fast as possible, without falling over.

*Image Credit : https://gym.openai.com/envs*

(2) A stochastic policy gives a probabilistic output [π(a|s) -> P(a_t|s_t)] . The stochastic policy is used when the environment is uncertain and policy outputs a probability distribution over actions, but not concrete actions. e.g. in Rock Paper Scissor game, you have to output Rock Paper Scissor with equi-probable random policy of 33% each. In Chess, you can move Pawn A or Pawn B with different probabilities to different slots.

A basic PG algorithm is the “REINFORCE” technique, which implements the core idea of “reinforcing” policy gradients (in other words, increasing the likelihood of actions) which lead to higher rewards.

**CartPole trained with Policy Gradients**
*Image Credit :* *https://www.freecodecamp.org/news/an-introduction-to-policy-gradients-with-cartpole-and-doom-495b5ef2207f/*

The REINFORCE algorithm is given by :-

*Image Credit : UCLA Berkeley DRL Course CS294-112*

More details on Value-based and Policy-based algorithms are available here and here.

Actor-Critic Algorithms :-

These algorithms combine the Value-based and Policy based methods in a zero sum process. The Actor is a policy-based network and recommends an action based on policy. The Critic is a value-based network and takes the reward attained by Actor’s action along with state to update itself and the Actor using a TD control mechanism.

More details on Actor-Critic are available here.

*Image Credit :* *https://arxiv.org/pdf/1708.05866.pdf*

Policy gradients suffer from noisy gradients. Recent variations of PG algos attempt to mitigate some of these issues. These include TRPO (Trust Region Policy Optimization, TRPO paper), PPO (Proximal Policy Optimization, PPO paper), DDPG (Deep Deterministic Policy Gradient), Rainbow, TD3 (Twin Delayed Deep Deterministic Policy Gradient) and SAC (Soft Actor Critic). Rainbow combines many variations of DQN for a better result such as Prioritization DQN, Dueling DQN, A3C, Distributional DQN and Noisy DQN.

Model-free RL is improving rapidly with modern State-Of-The-Art algorithms. It will enable future applications like robotics.

CellStrat Deep Reinforcement Learning Course :-

CellStrat AI Lab is a leading AI Lab and is working on the cutting-edge of Artificial Intelligence including latest algorithms in ML, DL, RL, Computer Vision, NLP etc.

We are pleased to launch an extensive course in Deep Reinforcement Learning (DRL). More details and enrollment here : https://bit.ly/CSDRLC

Wish to attend a TRIAL CLASS (online webinar) for the new Deep RL course ? If yes, please RSVP below to attend.

CellStrat AI Lab :-
Register : http://bit.ly/15A-RLdp
Topic : Dynamic Programming
Date : Wednesday 15th Apr 2020, 4:00 – 5:30 PM IST

See you this Wednesday in the RL webinar !

Questions ? Call me at +91-9742800566 !

Best Regards,

Vivek Singhal
Co-Founder & Chief Data Scientist, CellStrat
+91-9742800566

References :-

Gists of Recent Deep RL Algorithms by Nathan Lambert
A Brief Survey of Deep Reinforcement Learning by Kai Arulkumaran et al.
An introduction to Policy Gradients with Cartpole and Doom by Thomas Simonini