Q Learning Explained with Example Code

Intro to Reinforcement Learning and Q Learning

Q Learning is a off-policy TD reinforcement learning algorithm.
Reinforcement learning consists of an agent, an environment, and a reward system.
The agent performs an action in its current environment and receives a reward for it.
The goal of the agent is to maximize the expected cumulative reward.
We can see this visualized in an agent-environment loop depicted below.

Source: https://devblogs.nvidia.com/train-reinforcement-learning-agents-openai-gym/

Q learning is a type of TD, or Temporal Difference learning, algorithm.
This means that our algorithm learns at every time step (loop in the above diagram) by remembering the best possible actions for each state and action.
In fact, the Q in Q learning is a function that takes in a state “s” and an action “a” and returns the expected reward for the given inputs.

Q learning seeks to approximate the optimal action-value function Q*.

Another important detail to note about Q learning is that it is off policy.

The policy is a function that maps observed states to actions to be taken.

The fact that Q learning is off policy means that it finds the optimal action-value function independent of the policy being followed.

However, it is important to note that Q learning still uses a policy to determine what action to take given a state.

It's just that Q learning doesn't depend on the policy to approximate the optimal action-value Q*.

The way Q learning approximates the optimal action-value function Q* is by running many episodes of the agent performing actions to complete a task and learning from each episode.
Q learning "learns" by saving the reward corresponding to each state and action into a table.
Eventually, Q will converge to approximate Q*.

For each episode, Q learning applies an iterative formula that will update each state and action input based on how good its estimated future reward is.
The main idea to keep in mind is that we are constantly trying to update values in the table with better predictions as we explore more and more of the environment.

$Q^{new}(s_{t},a_{t})\leftarrow (1-\alpha )\cdot \underbrace {Q(s_{t},a_{t})} _{{\textrm {o}}ld~value}+\underbrace {\alpha } _{{\textrm {l}}earning~rate}\cdot \overbrace {{\bigg (}\underbrace {r_{t}} _{{\textrm {r}}eward}+\underbrace {\gamma } _{{\textrm {d}}iscount~factor}\cdot \underbrace {\max _{a}Q(s_{t+1},a)} _{{\textrm {e}}stimate~of~optimal~future~value}{\bigg )}} ^{{\textrm {l}}earned~value}$
Source: https://en.wikipedia.org/wiki/Q-learning

There are several terms in the equation worth explaining.
One is the reward, which simply signifies the reward given to the agent by the environment at that particular state.

The discount factor (gamma), a value between 0 and 1, is a variable used to place more value on rewards closer to the current state.
A higher discount factor means that we place more value on states farther out into the future while a smaller discount factor means that we place less value on those distant states.

The learning rate (alpha) can be thought of as manipulating the "exploration vs exploitation" characteristic of our Q learning algorithm.
Essentially, the learning rate determines how much new information should override old information.
A higher value means that we rely more on the more recent information while a lower value means that we place more value on prior knowledge.

The discount factor and learning rate variables appear to be quite similar at first glance, and it is worth taking the time to distinguish between the two.

Implementation:

All the code we are implementing from here on out can be found in this repo.
It is recommended to have the code up in a python environment next to this article.
Our code will be using Open AI Gym, a toolkit used to test out reinforcement learning algorithms such as Q learning.

To use Open AI gym, make sure you have Python 3.5 installed.
Then run "pip install gym".
If more detailed instructions are needed, follow the steps outlined in the Gym documentation: https://gym.openai.com/docs/.

Before we begin implementing our Q learning algorithm, we need to know some background knowledge on Gym.

In order to use gym, we will have to import it.
This can be done with a simple import statement:

import gym

Next, we need to specify the environment we want to test our algorithm in.
We will be using Taxi-v2 first.

We specify our environment and assign it to a variable to be used using the following line of code:

env = gym.make("Taxi-v2")

To initialize the environment, we use the following line of code:

env.reset()

If we were to print the value that is returned by the above line of code "print env.reset()", we would get a number.
This number signifies the starting state.
Every environment has a certain number of distinct states representing different possible ways our environment can look.
To see how many possible states our current environment has, we can print the following:

print(env.observation_space.n)

To see what the current state looks like, we can use the following line of code:

env.render()

Using this command, we can see a visualization of the state of our current environment.
In this specific environment, there are four locations: R, G, Y and B.
The taxi is represented by a yellow block.
The goal of the taxi is to carry a passenger from one location to another.
The taxi will change color to green when it is carrying a passenger.
The reward is 20 points for a successful drop-off, -10 points for an incorrect drop-off, and -1 for every time step.

To take a random step in Gym, we use the function "env.step()" with "env.action_space.sample()" to get a random action:

state, reward, done, info = env.step(env.action_space.sample())

Notice that the function "env.step()" returns four values, which we have assigned to four variables.
The first value represents the new state of the environment after our action.
The second value represents the reward we get from the new state.
The third value is a boolean representing whether we have reached a terminal state or not.
The fourth value contains information useful for debugging purposes.

Armed with our new knowledge of Gym and the environment we are trying to solve, we can begin to implement our Q learning algorithm.
Because Q learning "learns" by saving the reward corresponding to each state and action into a table, we will create a table using numpy that has dimensions "number of possible states" by "number of possible actions".

Besides our Q table, we will have a variable G to keep track of our total accumulated reward per episode.
We will also have a variable avg to keep track of our average every few loops so we can see how our algorithm is faring.

Our code will have two loops.
The outer loop will run our Q learning algorithm over so many episodes while our inner loop represents each episode.
Each iteration in the outer loop represents a single episode while each iteration in the inner loop represents a single step in a episode.

For every step in the inner loop, we will follow a greedy policy; we choose the locally optimal choice at each state.
This is represented by the action with the maximum reward for the current state:

action = np.argmax(Q[state])

Then, at every step, we will update our Q table with our action-value equation.
We will then update our overall reward G per episode and update our state variable.

We include the line "avg+=G" as well as the last if statement to print the average of the total accumulated reward per episode every 50 iterations.

Upon running those code (python3 <file-name>.py), we will see that the numbers being printed increase and gradually converge to a single value.

Extensions to Q Learning

There are extensions to Q Learning that we can add but we will first show why our current implementation may not be sufficient.
In our code, let's switch the environment to FrozenLake.
We can do this using the statement:

env = gym.make("FrozenLake-v0")

In FrozenLake, the agent starts at a spot and tries to reach the goal spot while avoiding holes.

When we run our code, we can see that we get 0.0 every single time.
This is because in FrozenLake, the reward system works by giving 1 if the agent reaches the goal and 0 otherwise.
The term for this problem is called "sparse rewards".
If we think about what our algorithm is doing, we are choosing the action that yields the maximum reward.
Since our table is full of zeroes and numpy.argmax returns the first occurrence of the max, we only ever choose the index of the first action.
We can see this by adding a "print(action)" statement after our action=np.argmax(Q[state]).

To alleviate this problem, we are going to have to improve our function.
We are going to do so by using a concept called greedy-epsilon.
Essentially, we move in the greedy direction (what we're currently doing) most of the time and with a small probability, move in a random direction.
This will cause our agent to explore the environment as opposed to being stuck in the same spot.

To implement greedy-epsilon, we are going to declare a variable "e" outside our loops.
This will initially be set to 0.1.
We will move in our intended direction 90% of the time and a random direction 10% of the time.
This code will look something like this:

prob = random() # chooses random number between 0 and 1
if prob > e:
action = np.argmax() # greedy direction
else:
action = env.action_space.sample() # random action

Before we run our code, we are also going to add another feature we covered at the very beginning, called discounted factor, or gamma.
As a reminder, the idea behind discounted factor is that we want to value rewards from future states less likely than current states since the probability that they are going to occur is much smaller.
Although it might seem we left this factor out when we were solving Taxiv2, the value was just 1 the entire time.
Now however, we are going to edit our action-value function to include an explicit gamma parameter.
We will also declare gamma outside our loops.
It will initially be assigned to 0.95.

$Q^{new}(s_{t},a_{t})\leftarrow (1-\alpha )\cdot \underbrace {Q(s_{t},a_{t})} _{{\textrm {o}}ld~value}+\underbrace {\alpha } _{{\textrm {l}}earning~rate}\cdot \overbrace {{\bigg (}\underbrace {r_{t}} _{{\textrm {r}}eward}+\underbrace {\gamma } _{{\textrm {d}}iscount~factor}\cdot \underbrace {\max _{a}Q(s_{t+1},a)} _{{\textrm {e}}stimate~of~optimal~future~value}{\bigg )}} ^{{\textrm {l}}earned~value}$
Source: https://en.wikipedia.org/wiki/Q-learning

Now when we run our algorithm, we see that by the end, we converge to rewards between 0.3 and 0.5.
Note that because of the nature of randomness in our algorithm, it might be necessary to run the program a couple more times if the output is still a list of 0.0's.

This concludes this post on Q Learning and a few of its extensions.
Hopefully by the end of this you will understand Q learning as well as the basics of OpenAI gym. Thanks for reading!

Search This Blog

Andrew Wei's blog