Deep Q-Learning with Neural Networks (on Cart-Pole)

Deep networks are present everywhere these days, from Google search and recommendation systems to identifying cells that cause cancer. We are moving into an era where everything we use will involve some sort of machine learning. Understanding how they work allows us to appreciate and further explore these new research areas. Most of the classification tasks such as image classification etc use supervised learning. However, there are a few other kinds of learning which we would be talking about in this post, more specifically “Reinforcement Learning”.

There are three main kinds of learning,

  1. Supervised: We are given a training set which has labelled data where every input has a corresponding output.
  2. Unsupervised: We are given inputs but no corresponding outputs.
  3. Semi-supervised: Here we are given multiple inputs with occasional rewards which help us understand if we are moving in the right direction or not.

Recently another kind of learning called Zero shot learning started gaining popularity as well, but that would be a topic for another post in the future.

An example of supervised learning could be a standard classification task where you are given a bunch of inputs and are required to predict the possible class output. Here every input has a certain fixed output, ie. every input belongs to some class or the other.

Unsupervised learning usually requires you to find the natural division among data, an example is clustering given input data into a certain number of clusters.

Reinforcement learning however is different, it lies somewhere in-between. the other two types of learning. An agent is set in a world and is required to act upon the world based on its observations it in a way that would maximize the reward that it gains.

A classic example would be, place an agent in a grid at a certain cell and in each turn the agent chooses an action UP, DOWN, LEFT or RIGHT. Some of the cells have walls and the agent cannot move onto those cells. One of the cells would give a large positive reward and another cell would give a large negative reward (kill the agent). Each move gives a small negative reward to the agent, so we would expect the agent to find the shortest path to the cell with the large positive reward.

Reinforcement learning seems like a great way for agents to learn about a completely unknown environment by exploring it and collecting feedback. Initially when an agent is placed in a world, it would know nothing about it and would perform actions randomly. However once it starts collecting feedback, it would perform better and make the right choices. This is what we call the exploration vs exploitation trade-off. Initially we allow the agent to explore a lot by randomly picking actions but as time passes, the probability of choosing a random action is reduced and we force the agent to take actions that seem most favorable given the state of the world.

Q-Learning is one of the most popular reinforcement learning methods. It defines a quality function or the “Q-Function” which takes the current state of the world as an input and predicts the final estimated reward that the agent would get if it performed a certain action. For now assume that this function magically exists. The problem becomes quite simple now,

  1. Observe the world and capture its current state
  2. Calculate the value of the Q-Function for each action from the current state.
  3. Choose the action that corresponds to the highest Q-Function.

We can see that once we have this function, the problem is essentially solved and our agent can perform actions that would maximize its reward. One approach to the Q-Function is to model it as a table where each entry is indexed by the (state-s, action-a) pair which gives us the expected final reward if the agent perform the action “action-a” in the state “state-s”. This has a few issues which is why we have moved on to using neural networks to model this Q-Function. For more information on how neural networks work, take a look here.

This is what we call Deep Q-Learning. A Q-Function can be defined by a neural network which takes the current state as the input and then predicts the expected rewards for each of the possible actions. The input layer of the network would have the shape of the state of the agent. The output layer would have the shape of the number of possible actions. Do note that since we want to be predicting expected rewards, we should make sure that we don’t use activation functions like sigmoid, tanh etc which saturate after a certain range.

In order to train this network, we would need to give it some (input,output) pairs that it would use while performing back propagation to update the network. Consider an agent ‘A’ initially in a state ‘s’ and then performs an action ‘a’ based on its Q-Function (the neural network) and reaches state ‘s1’ while getting a reward ‘r’. The way we perform updates in Deep Q-Learning is,

  1. We collect our sets of 4 variables (s,s1,a,r) and gamma which is defined as the reduction factor for the rewards that we get.
  2. Essentially, we generate pairs of (inputs, outputs) where the input is the state ‘s’ and the output is the predicted final reward for each of the possible actions. We then update output[a] = r + gamma * max(Q(s1)) and then use this output as the label for the input state ‘s’ and perform standard supervised training on the neural network.
  3. The update we perform ensures that the network modifies itself such that the predicted final reward at time step ‘t’ for action ‘a’ from state ‘s’ is nothing but the reward it gets at the current time step ‘r’ + reduced future expected reward. The reduced future expected reward here is gamma * maximum final expected reward we get from state ‘s1’.

Generally, we let the network run for some iterations, then we perform updates and then let it run again and so on. When the network runs for some iterations, we store the four variables s,s1,a and t for each iteration. After a fixed number of iterations, randomly select some of the stored experiences and then use them to perform the network update. This is called experience replay.

Now that we finally know how Deep Q-Learning works, let’s apply it on a simple game of Cart-Pole on OpenAI’s RL Gym in Python!

First off, get OpenAI Universe and Gym installed by following their respective installation instructions.

For the rest of the post, we would be working with Cart-Pole where we are required to balance a pole on a cart by constantly moving the cart either to the left or to the right.

(Click on the gif to play it)

To start with, we need to include OpenAI’s universe and gym which gives us access to a variety of games on which we can apply reinforcement learning. We would then need to setup a template loop which lets the game progress.

import gym
import universe 

env = gym.make('CartPole-v0') # Use the cart pole environment
observation_n = env.reset() # Get the first observation from the environment

while True:
  # Randomly perform an action, move either left or right
  action_n = np.random.randint(2)
  observation_n, reward_n, done_n, info = env.step(action_n)

We would then define our model that we would be using. In CartPole-v0, there are 4 inputs to the model with two outputs (left or right). Our network would have a shape of (4,128,128,2) with two hidden layers each having 128 nodes.

# Reinforcement Learning - Deep-Q learning
model = Sequential()
model.add(Dense(128, input_dim=4, activation='relu'))
model.add(Dense(128, activation='relu'))
model.compile(loss='mse', optimizer=sgd(lr=0.0001))

We would then let our agent play the game in a loop. At each time step, we take in the observations as inputs, use our network to predict a suitable action and then perform that action while storing all our variables for experience replay.

for time_t in range(5000):
  action = model.predict(observation)
  action = np.argmax(action[0])
  if np.random.uniform(0,1) < epsilon:
    # Either 0 or 1 sample the action randomly
    action = np.random.randint(2)
  observation_old = observation
  observation, reward, done, info = env.step(action)
  observation = np.reshape(observation, [1, 4])
  replay_memory.append([observation_old, action, reward, observation])

After a certain number of iterations (or technically an episode), we would choose some of the stored variables. Create prediction labels as mentioned above and use them to train the network.

indices = np.random.choice(len(replay_memory), min(500, len(replay_memory)))
for mem_idx in indices:
  mem = replay_memory[mem_idx]
  observation_old = mem[0]
  action = mem[1]
  reward = mem[2]
  observation = mem[3]
  target = reward
  if mem_idx != len(replay_memory) - 1:
    target = reward + gamma * np.amax(model.predict(observation)[0])
  target_f = model.predict(observation_old)
  target_f[0][action] = target, target_f, nb_epoch=1, verbose=0)

You would need to write a few more lines to save/load models for backups after episodes, uploading your scores onto OpenAI gym etc, but that shouldn’t be hard to understand.

And that’s all there is to it! If you followed along, then you have successfully trained an agent using Deep Q-Learning to play Cart-Pole!

The entire code for the project can be found here (My submission on OpenAI gym)

Further Reading:

  1. Policy Gradients
  2. Deep Deterministic Policy Gradients (DDPG) – Deep Q-Learning works only when the outputs needed to be predicted are classes. If you want to predict values however (any floating point numbers – regression), this comes in handy.

Neuro-Evolution with Flappy Bird (Genetic Evolution on Neural Networks)

The human brain has been a great area of research for a very long time and recently we have come up with neural networks which approximate the brain and its function, to the best of our knowledge. A neural network, simply put, is a bunch of nodes (representing the neurons present in the brain) arranged in layers which are connected by weights (to simulate the connections that are formed between the neurons).

Every layer has a fixed number of nodes and layers are stacked one on top of another where each node in a given layer is connected to all the nodes in the previous layer and all the nodes in the next layer by weights. The first layer in the network represents the inputs and the last layer represents the outputs. Given certain inputs we would like our neural network to predict the outputs. Initially the weights of a neural network are randomly initialized and we train the network by a process called back propagation where we modify these weights so that the network becomes better at predicting outputs for given inputs. If you want to learn more about how neural networks work and how they can be trained, take a look at this. However, in the genetic sense of things, we don’t really need to perform back-propagation. I’ll get to the details later on in the post.

Now you might be thinking, what exactly is “Neuro-Evolution” and how does genetics come into play?

Let’s talk about the first part of the question. “Genetic-Evolution” is the process by which we start out with a number of random species and then over time the species change and evolve following a certain number of principles such that by the end, we would have an entirely new set of species which are better than the ones we started out with. In genetics, we define fitness as the rating of a certain species and with evolution we transform existing species in a manner that would result in species with higher fitness values. This fitness value could be defined in any manner, in real life this could be the survival capacity which is why we have developed the features we have as humans.

“Survival of the fittest”

Coming to the second part, “Neuro”. We take all that we have learned from genetics and apply that on neural networks. In our definition of Genetic Evolution above, replace “species” with “neural network” and we have Neuro-Evolution. We would start from a pool of neural networks (a few hundreds) and then perform a few evolutionary steps on them multiple times to end up with a pool of different neural networks which have a higher fitness value compared to the ones we started out with. We have talked about what Neuro-Evolution is, but let’s get to the specific steps that take place in an evolution.

Consider we have a pool of neural networks at time (t). We then apply genetic crossover followed by random mutation to end up with another pool of neural networks at time (t+1). We repeat this for multiple time steps to get our final set of networks. What we expect from genetic evolution is that,

∑(fitness(network[i])){t} < (fitness(network[i])){t+1}

At each time-step, we start out with a pool of neural networks of size (n) and we assign each network a selection probability based on their fitness, higher the fitness, more the selection probability. We perform this selection step (n/2) times where we select two networks at each selection step. We then perform genetic crossover and random mutation with each selected pair of networks to get a new pair of networks. In the end, we would have (n) networks which would move on to the next generation. Let’s talk about the genetic crossover and random mutation steps in detail.

  1. Genetic Crossover
    1. The crossover could be of any form as long as we swap something or the other from one of the selected networks to the other. We could select a certain layer and swap the weights or select a bunch of nodes and swap their weights etc. Any of these would fit into the description of genetic crossover.
    2. Performing this step with 2 networks (parents) would give us 2 networks as the outputs (children)

2. Random Mutation

    1. All the children that we get from the previous step are then subject to random mutation.
    2. With a certain probability select weights randomly in the network and tweak their weights a little. This could be adding a small value or flipping the sign etc.

All this is quite interesting, but are there any practical uses for this? Let’s take a look at how we can use Neuro-Evolution to make a bot that plays Flappy Bird in Keras (a beginner friendly deep learning library for Python)!

In Flappy Bird the user has only one action that they can perform, clicking a button to flap. Let us consider a very simple 3 layered neural network of the shape (3, 7, 1) which we use to predict the action. In our example here, each bird is represented by a neural network.

The inputs to the network are the three parameters that the bird sees and the output gives us the probability of choosing the ‘flap’ action.

# Initialize all models
for i in range(total_models):
  model = Sequential()
  model.add(Dense(output_dim=7, input_dim=3))
  sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
  model.compile(loss="mse", optimizer=sgd, metrics=["accuracy"])

In Keras, A simple way of creating neural networks is by using its Sequential() model. The Sequential() call returns an empty container where a user can add the layers they want. Here we add a Dense layer which is essentially a fully connected layer as we mentioned earlier. Neural networks often have an activation function attached to them to ensure that we get non-linearity and here we use the Sigmoid activation. Often you would also be required to define a loss function and an optimizer to train the network via back-propagation (but here we don’t need to perform back-propagation, so we can just ignore this part).

We create a pool of size ‘total_models’ which are all initialized randomly. Since we need a large pool to start with, it is recommended that we use at-least around a hundred networks to initialize our pool.

Let us now define our crossover function,

def model_crossover(model_idx1, model_idx2):
  global current_pool
  weights1 = current_pool[model_idx1].get_weights()
  weights2 = current_pool[model_idx2].get_weights()
  weightsnew1 = weights1
  weightsnew2 = weights2
  weightsnew1[0] = weights2[0]
  weightsnew2[0] = weights1[0]
  return np.asarray([weightsnew1, weightsnew2])

Here, we select the set of weights in our network that connect the 3 nodes in the first layer to the 7 nodes in the second layer and then swap these weights for the given two networks from the pool.

Moving on to random mutation,

def model_mutate(weights):
  for xi in range(len(weights)):
    for yi in range(len(weights[xi])):
      if random.uniform(0, 1) &amp;amp;gt; 0.85:
        change = random.uniform(-0.5,0.5)
        weights[xi][yi] += change
  return weights

We select weights randomly with a 0.15 probability and then change its value with a random number between -0.5 to +0.5.

Here, the fitness function is as simple as the amount of time that the bird stays alive in the game! The higher our score is for a given neural network, the fitter it is compared to the others. We update the fitness of each of our neural networks in the game loop.

for idxPlayer in range(total_models):
  # Check if the bird is alive and increment its fitness
  if playersState[idxPlayer] == True:
    fitness[idxPlayer] += 1

We also boost the fitness a bit if the bird crosses a pipe (this ensures that we favor crossing a pipe much more than just staying alive)

Let us now define the action prediction function using our neural network.

def predict_action(height, dist, pipe_height, model_num):
  global current_pool
  # The height, dist and pipe_height must be between 0 to 1 (Scaled)
  height = min(SCREENHEIGHT, height) / SCREENHEIGHT - 0.5
  dist = dist / 450 - 0.5 # Max pipe distance from player will be 450
  pipe_height = min(SCREENHEIGHT, pipe_height) / SCREENHEIGHT - 0.5
  neural_input = np.asarray([height, dist, pipe_height])
  neural_input = np.atleast_2d(neural_input)
  output_prob = current_pool[model_num].predict(neural_input, 1)[0]
  if output_prob[0] &amp;amp;lt;= 0.5:
    # Perform the jump action
    return 1
  return 2

This function is called for each neural network in the game loop as long as the bird corresponding to the neural network is alive. Based on this, we perform the flap action on the birds.

With all that done, we only have one more thing to do, initialize a pool of neural networks to represent the set of birds (100 in this case) and then after each game is over, perform neural evolution so that they perform better in the future.

def showGameOverScreen():
  """Perform genetic updates here"""
  global current_pool
  global fitness
  global generation
  new_weights = []
  total_fitness = 0
  for select in range(total_models):
    total_fitness += fitness[select]
  for select in range(total_models):
    fitness[select] /= total_fitness
    if select > 0:
      fitness[select] += fitness[select-1]
  for select in range(int(total_models/2)):
    parent1 = random.uniform(0, 1)
    parent2 = random.uniform(0, 1)
    idx1 = -1
    idx2 = -1
    for idxx in range(total_models):
      if fitness[idxx] &amp;amp;gt;= parent1:
        idx1 = idxx
    for idxx in range(total_models):
      if fitness[idxx] &amp;amp;gt;= parent2:
        idx2 = idxx
    new_weights1 = model_crossover(idx1, idx2)
    updated_weights1 = model_mutate(new_weights1[0])
    updated_weights2 = model_mutate(new_weights1[1])
  for select in range(len(new_weights)):
    fitness[select] = -100
  generation = generation + 1

This is called when the game ends and all the birds have died. Here, we select pairs of parents based on their fitness scores from the previous run of the game and call the crossover and mutation functions to get pairs of children which would be used as our new pool.

Apart from this, we would also need to perform some book keeping here and there and then our code would be ready to run. After a few hundreds of generations, we can expect the genetically trained networks to do exceptionally well at the game. (Much better than humans in fact!)

And that’s all there is to it!

Some sample results:

Sample 1: When training just started, at generation 10.


Sample 2: When training is over, at generation 1000.


The entire code for the project and some sample runs of the game using trained networks can be found here.

Further Reading:

  1. NEAT (Neural Evolution through Augmented Topologies) – The original report by Kenneth O. Stanley and Risto Miikkulainen
  2. NeuroEvolution with MarI/O – A neural evolution based algorithm which enables a bot to play Mario by SethBling

Neural Style Transfer + Sketches?

Style transfer put simply is just taking two images (a style image and a content image) and creating a new image which captures the texture and the color of the style image and the edges and finer details of the content image. We all know how mesmerizing the outputs of style transfer are, apps like Prisma etc thrive on this.

We use deep networks that have been trained on object detection and localization. These networks develop an intrinsic understanding of how to distinguish between the main content in the image that would be used to perform the detection and the rest of the style variations in the image. This is essential since object detection should ideally ignore the style/color and just concentrate on the semantic meaning that the image is trying to portray. Based on this fact, we try to probe into the network and use this to our benefit where we copy the style of one image onto another to generate fascinating results.

Neural Style Transfer has two main steps,

  1. Network Creation – We load a pre-trained network and then add some custom modules in it.
  2. De-convolutional image learning – Based on the modified network, we pass a random noise image X as input and get some gradients back which we use to update X to X’. We then use set X = X’ and then repeat this step for multiple iterations (1000s of iterations)

We used Torch running on Lua-JIT for coding this up. For most of the results, we used the VGG_ILSVRC 19 layer network, we did also test a bit on smaller networks that we trained on our own, but the results were not that great.

Overall, we can see that the content image is essentially used to capture the edges and structure-based information of the given input. Sketches generally tend to capture this exact information in a very minimalistic format. Based on this fact, we wanted to use sketches as input content images and images corresponding to the sketch as the style images. These would be retrieved using a Sketch-Based-Image-Retrieval system. This would give us an automatic coloring of sketches.

Automatic Sketch Coloring has three main steps,

  1. Image Retrieval – Using a standard SBIR system we retrieve the best match image.
  2. Sketch Segmentation – The sketch would have to be segmented into multiple foreground segments and the background.
  3. Image Segmentation – The image would also have to be segmented into multiple foreground regions and the background.
  4. Segment Matching – Each segment in the sketch would have to be matched to its most similar segment in the image segment.
  5. Local Style Transfer – Apply style transfer to each pair of segments at a time to get a local style transfer effect.

The main reason why these images are not perfect is because the networks we used were not trained on sketches but were just trained on normal images. Another issue is that the segmentation is not perfect most of the times and due to this, incorrect regions get matches up and the style transfer does not happen well. We would still be working on this, but for initial results, they seem quite promising.

The project is hosted at,

Playing around with Google Deep Dream

One of the simplest things to do in order to get started with convolutional neural networks and deep learning is by using images and observing how they change as we modify various parameters within the network or optimizing on various factors. It’s quite simple to get started and helps us appreciate how amazing these things are.

Google Deep Dream essentially takes an image as an input and then modifies it to generate various patterns/shapes/objects within the image. It takes whatever it can find in the image which seems like something it knows and then amplifies it and adds it to the image. Put in simple terms, suppose it sees a dog in the image, then it would just try to keep adding more dogs everywhere in the image.

This is quite interesting since we can select layers within the network where we want to dream. If we choose one of the initial layers of the network then more of the patterns that the net recognizes at that layer would be amplified and added back to the input image. Generally, the first few layers capture simple things like horizontal/vertical lines, small patches or some kinds of simple patterns or textures. So dreaming at these layers would generate lots of these patches/textures and add them to the input image. However, if we dream at higher layers in the network they often recognize much broader features like the eyes or maybe the ears of a dog. Now if we dream at this level, then a lot of eyes/ears would start to appear vaguely in the input image as we keep iterating.

The features that the network identifies mainly depends on the network we use. Suppose we use a network which was mainly trained on various images of dogs, then most of the features the network identifies would be related to the features of dogs such as their eyes, ears, fur, tail etc.

Doing this is quite simple, suppose we want to dream at a layer i, then at that layer we just need to set grad(i) = output(i). And then backpropagate and perform gradient ascent on the input image instead of the standard gradient descent. We do this by modifying the input image using the input gradients that we accumulate at the input layer by backpropagating. In this manner, we would be changing the input such that we start to  see more of what we are currently seeing at the dream layer, after a few iterations.

In a nutshell, we select a layer in the network where we want to dream. Set grad(i) = output(i) and then backpropagate and modify the input image. After a few iterations, we would be able to see how the input image changes to generate various shapes/patterns/objects etc.

I have used the Torch framework to code up a simple implementation of this, Deep Dream Torch. This is just a starting point, there are a whole variety of things that we can do with this and play around to achieve all sorts of unique transformations of a given image.