Deep networks are present everywhere these days, from Google search and recommendation systems to identifying cells that cause cancer. We are moving into an era where everything we use will involve some sort of machine learning. Understanding how they work allows us to appreciate and further explore these new research areas. Most of the classification tasks such as image classification etc use supervised learning. However, there are a few other kinds of learning which we would be talking about in this post, more specifically “**Reinforcement Learning**”.

There are three main kinds of learning,

**Supervised**: We are given a training set which has labelled data where every input has a corresponding output.**Unsupervised**: We are given inputs but no corresponding outputs.**Semi-supervised**: Here we are given multiple inputs with occasional rewards which help us understand if we are moving in the right direction or not.

Recently another kind of learning called **Zero shot learning** started gaining popularity as well, but that would be a topic for another post in the future.

An example of supervised learning could be a standard classification task where you are given a bunch of inputs and are required to predict the possible class output. Here every input has a certain fixed output, ie. every input belongs to some class or the other.

Unsupervised learning usually requires you to find the natural division among data, an example is clustering given input data into a certain number of clusters.

Reinforcement learning however is different, it lies somewhere in-between. the other two types of learning. An agent is set in a world and is required to act upon the world based on its observations it in a way that would maximize the reward that it gains.

A classic example would be, place an agent in a grid at a certain cell and in each turn the agent chooses an action UP, DOWN, LEFT or RIGHT. Some of the cells have walls and the agent cannot move onto those cells. One of the cells would give a large positive reward and another cell would give a large negative reward (kill the agent). Each move gives a small negative reward to the agent, so we would expect the agent to find the shortest path to the cell with the large positive reward.

Reinforcement learning seems like a great way for agents to learn about a completely unknown environment by exploring it and collecting feedback. Initially when an agent is placed in a world, it would know nothing about it and would perform actions randomly. However once it starts collecting feedback, it would perform better and make the right choices. This is what we call the exploration vs exploitation trade-off. Initially we allow the agent to explore a lot by randomly picking actions but as time passes, the probability of choosing a random action is reduced and we force the agent to take actions that seem most favorable given the state of the world.

**Q-Learning** is one of the most popular reinforcement learning methods. It defines a quality function or the “**Q-Function**” which takes the current state of the world as an input and predicts the final estimated reward that the agent would get if it performed a certain action. For now assume that this function magically exists. The problem becomes quite simple now,

- Observe the world and capture its current state
- Calculate the value of the Q-Function for each action from the current state.
- Choose the action that corresponds to the highest Q-Function.

We can see that once we have this function, the problem is essentially solved and our agent can perform actions that would maximize its reward. One approach to the Q-Function is to model it as a table where each entry is indexed by the (state-s, action-a) pair which gives us the expected final reward if the agent perform the action “action-a” in the state “state-s”. This has a few issues which is why we have moved on to using neural networks to model this Q-Function. For more information on how neural networks work, take a look here.

This is what we call **Deep Q-Learning**. A Q-Function can be defined by a neural network which takes the current state as the input and then predicts the expected rewards for each of the possible actions. The input layer of the network would have the shape of the state of the agent. The output layer would have the shape of the number of possible actions. Do note that since we want to be predicting expected rewards, we should make sure that we don’t use activation functions like sigmoid, tanh etc which saturate after a certain range.

In order to train this network, we would need to give it some (input,output) pairs that it would use while performing back propagation to update the network. Consider an agent ‘A’ initially in a state ‘s’ and then performs an action ‘a’ based on its Q-Function (the neural network) and reaches state ‘s1’ while getting a reward ‘r’. The way we perform updates in Deep Q-Learning is,

- We collect our sets of 4 variables (s,s1,a,r) and gamma which is defined as the reduction factor for the rewards that we get.
- Essentially, we generate pairs of (inputs, outputs) where the input is the state ‘s’ and the output is the predicted final reward for each of the possible actions. We then update output[a] = r + gamma * max(Q(s1)) and then use this output as the label for the input state ‘s’ and perform standard supervised training on the neural network.
- The update we perform ensures that the network modifies itself such that the predicted final reward at time step ‘t’ for action ‘a’ from state ‘s’ is nothing but the reward it gets at the current time step ‘r’ + reduced future expected reward. The reduced future expected reward here is gamma * maximum final expected reward we get from state ‘s1’.

Generally, we let the network run for some iterations, then we perform updates and then let it run again and so on. When the network runs for some iterations, we store the four variables s,s1,a and t for each iteration. After a fixed number of iterations, randomly select some of the stored experiences and then use them to perform the network update. This is called experience replay.

Now that we finally know how** Deep Q-Learning **works, let’s apply it on a simple game of** Cart-Pole on OpenAI’s RL Gym in Python!**

First off, get** OpenAI ****Universe ****and ****Gym **installed by following their respective installation instructions.

For the rest of the post, we would be working with Cart-Pole where we are required to balance a pole on a cart by constantly moving the cart either to the left or to the right.

(Click on the gif to play it)

To start with, we need to include OpenAI’s universe and gym which gives us access to a variety of games on which we can apply reinforcement learning. We would then need to setup a template loop which lets the game progress.

import gym import universe env = gym.make('CartPole-v0') # Use the cart pole environment observation_n = env.reset() # Get the first observation from the environment while True: env.render() # Randomly perform an action, move either left or right action_n = np.random.randint(2) observation_n, reward_n, done_n, info = env.step(action_n)

We would then define our model that we would be using. In CartPole-v0, there are 4 inputs to the model with two outputs (left or right). Our network would have a shape of (4,128,128,2) with two hidden layers each having 128 nodes.

# Reinforcement Learning - Deep-Q learning model = Sequential() model.add(Dense(128, input_dim=4, activation='relu')) model.add(Dense(128, activation='relu')) model.add(Dense(2)) model.compile(loss='mse', optimizer=sgd(lr=0.0001))

We would then let our agent play the game in a loop. At each time step, we take in the observations as inputs, use our network to predict a suitable action and then perform that action while storing all our variables for experience replay.

for time_t in range(5000): env.render() action = model.predict(observation) action = np.argmax(action[0]) if np.random.uniform(0,1) < epsilon: # Either 0 or 1 sample the action randomly action = np.random.randint(2) observation_old = observation observation, reward, done, info = env.step(action) observation = np.reshape(observation, [1, 4]) replay_memory.append([observation_old, action, reward, observation])

After a certain number of iterations (or technically an episode), we would choose some of the stored variables. Create prediction labels as mentioned above and use them to train the network.

indices = np.random.choice(len(replay_memory), min(500, len(replay_memory))) for mem_idx in indices: mem = replay_memory[mem_idx] observation_old = mem[0] action = mem[1] reward = mem[2] observation = mem[3] target = reward if mem_idx != len(replay_memory) - 1: target = reward + gamma * np.amax(model.predict(observation)[0]) target_f = model.predict(observation_old) target_f[0][action] = target model.fit(observation_old, target_f, nb_epoch=1, verbose=0)

You would need to write a few more lines to save/load models for backups after episodes, uploading your scores onto OpenAI gym etc, but that shouldn’t be hard to understand.

And that’s all there is to it! If you followed along, then you have successfully trained an agent using Deep Q-Learning to play Cart-Pole!

**The entire code for the project can be found ****here ****(My submission on ****OpenAI gym****)**

**Further Reading:**

- Policy Gradients
- Deep Deterministic Policy Gradients (DDPG) – Deep Q-Learning works only when the outputs needed to be predicted are classes. If you want to predict values however (any floating point numbers – regression), this comes in handy.