An Introduction to Q-Learning Part 2/2 (2024)

Back to Articles

PublishedMay 20, 2022

Update on GitHub

Upvote1

ThomasSimoniniThomas Simonini

Unit 2, part 2 of theDeep Reinforcement Learning Class with Hugging Face ๐Ÿค—

โš ๏ธ A new updated version of this article is available here ๐Ÿ‘‰ https://huggingface.co/deep-rl-course/unit1/introduction

This article is part of the Deep Reinforcement Learning Class. A free course from beginner to expert. Check the syllabushere.

An Introduction to Q-Learning Part 2/2 (3)

โš ๏ธ A new updated version of this article is available here ๐Ÿ‘‰ https://huggingface.co/deep-rl-course/unit1/introduction

This article is part of the Deep Reinforcement Learning Class. A free course from beginner to expert. Check the syllabushere.

In the first part of this unit, we learned about the value-based methods and the difference between Monte Carlo and Temporal Difference Learning.

So, in the second part, weโ€™ll study Q-Learning, and implement our first RL agent from scratch, a Q-Learning agent, and will train it in two environments:

  1. Frozen Lake v1 โ„๏ธ: where our agent will need togo from the starting state (S) to the goal state (G)by walking only on frozen tiles (F) and avoiding holes (H).
  2. An autonomous taxi ๐Ÿš•: where the agent will needto learn to navigatea city totransport its passengers from point A to point B.
An Introduction to Q-Learning Part 2/2 (4)

This unit is fundamental if you want to be able to work on Deep Q-Learning (Unit 3).

So letโ€™s get started! ๐Ÿš€

  • Introducing Q-Learning
    • What is Q-Learning?
    • The Q-Learning algorithm
    • Off-policy vs. On-policy
  • A Q-Learning example

Introducing Q-Learning

What is Q-Learning?

Q-Learning is anoff-policy value-based method that uses a TD approach to train its action-value function:

  • Off-policy: we'll talk about that at the endof this chapter.
  • Value-based method: finds the optimal policy indirectly by training a value or action-value function that will tell usthe value of each state or each state-action pair.
  • Uses a TD approach:updates its action-value function at each step instead of at the end of the episode.

Q-Learning is the algorithm we use to train our Q-Function, anaction-value functionthat determines the value of being at a particular state and taking a specific action at that state.

An Introduction to Q-Learning Part 2/2 (5)

TheQ comes from "the Quality" of that action at that state.

Internally, our Q-function hasa Q-table, a table where each cell corresponds to a state-action value pair value.Think of this Q-table asthe memory or cheat sheet of our Q-function.

If we take this maze example:

An Introduction to Q-Learning Part 2/2 (6)

The Q-Table is initialized. That's why all values are = 0. This tablecontains, for each state, the four state-action values.

An Introduction to Q-Learning Part 2/2 (7)

Here we see that thestate-action value of the initial state and going up is 0:

An Introduction to Q-Learning Part 2/2 (8)

Therefore, Q-function contains a Q-tablethat has the value of each-state action pair.And given a state and action,our Q-Function will search inside its Q-table to output the value.

An Introduction to Q-Learning Part 2/2 (9)

If we recap,Q-Learningis the RL algorithm that:

  • TrainsQ-Function (an action-value function)which internally is aQ-tablethat contains all the state-action pair values.
  • Given a state and action, our Q-Functionwill search into its Q-table the corresponding value.
  • When the training is done,we have an optimal Q-function, which means we have optimal Q-Table.
  • And if wehave an optimal Q-function, wehave an optimal policysince weknow for each state what is the best action to take.
An Introduction to Q-Learning Part 2/2 (10)

But, in the beginning,our Q-Table is useless since it gives arbitrary values for each state-action pair(most of the time, we initialize the Q-Table to 0 values). But, as we'llexplore the environment and update our Q-Table, it will give us better and better approximations.

An Introduction to Q-Learning Part 2/2 (11)

So now that we understand what Q-Learning, Q-Function, and Q-Table are,let's dive deeper into the Q-Learning algorithm.

The Q-Learning algorithm

This is the Q-Learning pseudocode; let's study each part andsee how it works with a simple example before implementing it. Don't be intimidated by it, it's simpler than it looks! We'll go over each step.

An Introduction to Q-Learning Part 2/2 (12)

Step 1: We initialize the Q-Table

An Introduction to Q-Learning Part 2/2 (13)

We need to initialize the Q-Table for each state-action pair.Most of the time, we initialize with values of 0.

Step 2: Choose action using Epsilon Greedy Strategy

An Introduction to Q-Learning Part 2/2 (14)

Epsilon Greedy Strategy is a policy that handles the exploration/exploitation trade-off.

The idea is that we define epsilon ษ› = 1.0:

  • With probability 1 โ€” ษ›: we doexploitation(aka our agent selects the action with the highest state-action pair value).
  • With probability ษ›:we do exploration(trying random action).

At the beginning of the training,the probability of doing exploration will be huge since ษ› is very high, so most of the time, we'll explore.But as the training goes on, and consequently ourQ-Table gets better and better in its estimations, we progressively reduce the epsilon valuesince we will need less and less exploration and more exploitation.

An Introduction to Q-Learning Part 2/2 (15)

Step 3: Perform action At, gets reward Rt+1 and next state St+1

An Introduction to Q-Learning Part 2/2 (16)

Step 4: Update Q(St, At)

Remember that in TD Learning, we update our policy or value function (depending on the RL method we choose)after one step of the interaction.

To produce our TD target,we used the immediate reward Rt+1R_{t+1}Rt+1โ€‹ plus the discounted value of the next state best state-action pair(we call that bootstrap).

An Introduction to Q-Learning Part 2/2 (17)

Therefore, our Q(St,At)Q(S_t, A_t)Q(Stโ€‹,Atโ€‹)update formula goes like this:

An Introduction to Q-Learning Part 2/2 (18)

It means that to update our Q(St,At)Q(S_t, A_t)Q(Stโ€‹,Atโ€‹):

  • We need St,At,Rt+1,St+1S_t, A_t, R_{t+1}, S_{t+1}Stโ€‹,Atโ€‹,Rt+1โ€‹,St+1โ€‹.
  • To update our Q-value at a given state-action pair, we use the TD target.

How do we form the TD target?

  1. We obtain the reward after taking the action Rt+1R_{t+1}Rt+1โ€‹.
  2. To get the best next-state-action pair value, we use a greedy policy to select the next best action. Note that this is not an epsilon greedy policy, this will always take the action with the highest state-action value.

Then when the update of this Q-value is done. We start in a new_state and select our actionusing our epsilon-greedy policy again.

It's why we say that this is an off-policy algorithm.

Off-policy vs On-policy

The difference is subtle:

  • Off-policy: usinga different policy for acting and updating.

For instance, with Q-Learning, the Epsilon greedy policy (acting policy), is different from the greedy policy that isused to select the best next-state action value to update our Q-value (updating policy).

An Introduction to Q-Learning Part 2/2 (19)

Is different from the policy we use during the training part:

An Introduction to Q-Learning Part 2/2 (20)
  • On-policy:using thesame policy for acting and updating.

For instance, with Sarsa, another value-based algorithm,the Epsilon-Greedy Policy selects the next_state-action pair, not a greedy policy.

An Introduction to Q-Learning Part 2/2 (21)
An Introduction to Q-Learning Part 2/2 (22)

A Q-Learning example

To better understand Q-Learning, let's take a simple example:

An Introduction to Q-Learning Part 2/2 (23)
  • You're a mouse in this tiny maze. You alwaysstart at the same starting point.
  • The goal isto eat the big pile of cheese at the bottom right-hand cornerand avoid the poison. After all, who doesn't like cheese?
  • The episode ends if we eat the poison,eat the big pile of cheese or if we spent more than five steps.
  • The learning rate is 0.1
  • The gamma (discount rate) is 0.99
An Introduction to Q-Learning Part 2/2 (24)

The reward function goes like this:

  • +0:Going to a state with no cheese in it.
  • +1:Going to a state with a small cheese in it.
  • +10:Going to the state with the big pile of cheese.
  • -10:Going to the state with the poison and thus die.
  • +0 If we spend more than five steps.
An Introduction to Q-Learning Part 2/2 (25)

To train our agent to have an optimal policy (so a policy that goes right, right, down), we will use the Q-Learning algorithm.

Step 1: We initialize the Q-Table

An Introduction to Q-Learning Part 2/2 (26)

So, for now,our Q-Table is useless; we needto train our Q-function using the Q-Learning algorithm.

Let's do it for 2 training timesteps:

Training timestep 1:

Step 2: Choose action using Epsilon Greedy Strategy

Because epsilon is big = 1.0, I take a random action, in this case, I go right.

An Introduction to Q-Learning Part 2/2 (27)

Step 3: Perform action At, gets Rt+1 and St+1

By going right, I've got a small cheese, so Rt+1=1R_{t+1} = 1Rt+1โ€‹=1, and I'm in a new state.

An Introduction to Q-Learning Part 2/2 (28)

Step 4: Update Q(St,At)Q(S_t, A_t)Q(Stโ€‹,Atโ€‹)

We can now update Q(St,At)Q(S_t, A_t)Q(Stโ€‹,Atโ€‹) using our formula.

An Introduction to Q-Learning Part 2/2 (29)
An Introduction to Q-Learning Part 2/2 (30)

Training timestep 2:

Step 2: Choose action using Epsilon Greedy Strategy

I take a random action again, since epsilon is big 0.99(since we decay it a little bit because as the training progress, we want less and less exploration).

I took action down.Not a good action since it leads me to the poison.

An Introduction to Q-Learning Part 2/2 (31)

Step 3: Perform action At, gets Rt+1R_{t+1}Rt+1โ€‹ and St+1

Because I go to the poison state,I get Rt+1=โˆ’10R_{t+1} = -10Rt+1โ€‹=โˆ’10, and I die.

An Introduction to Q-Learning Part 2/2 (32)

Step 4: Update Q(St,At)Q(S_t, A_t)Q(Stโ€‹,Atโ€‹)

An Introduction to Q-Learning Part 2/2 (33)

Because we're dead, we start a new episode. But what we see here is thatwith two explorations steps, my agent became smarter.

As we continue exploring and exploiting the environment and updating Q-values using TD target, Q-Table will give us better and better approximations. And thus, at the end of the training, we'll get an estimate of the optimal Q-Function.

Now that we studied the theory of Q-Learning, let's implement it from scratch. A Q-Learning agent that we will train in two environments:

  1. Frozen-Lake-v1 โ„๏ธ (non-slippery version): where our agent will need togo from the starting state (S) to the goal state (G)by walking only on frozen tiles (F) and avoiding holes (H).
  2. An autonomous taxi ๐Ÿš• will needto learn to navigatea city totransport its passengers from point A to point B.
An Introduction to Q-Learning Part 2/2 (34)

Start the tutorial here ๐Ÿ‘‰ https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/unit2/unit2.ipynb

The leaderboard ๐Ÿ‘‰ https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard

Congrats on finishing this chapter!There was a lot of information. And congrats on finishing the tutorials. Youโ€™ve just implemented your first RL agent from scratch and shared it on the Hub ๐Ÿฅณ.

Implementing from scratch when you study a new architecture is important to understand how it works.

Thatโ€™snormal if you still feel confusedwith all these elements.This was the same for me and for all people who studied RL.

Take time to really grasp the material before continuing.

And since the best way to learn and avoid the illusion of competence is to test yourself. We wrote a quiz to help you find where you need to reinforce your study. Check your knowledge here ๐Ÿ‘‰ https://github.com/huggingface/deep-rl-class/blob/main/unit2/quiz2.md

Itโ€™s essential to master these elements and having a solid foundations before entering thefun part.Don't hesitate to modify the implementation, try ways to improve it and change environments, the best way to learn is to try things on your own!

We published additional readings in the syllabus if you want to go deeper ๐Ÿ‘‰ https://github.com/huggingface/deep-rl-class/blob/main/unit2/README.md

In the next unit, weโ€™re going to learn about Deep-Q-Learning.

And don't forget to share with your friends who want to learn ๐Ÿค— !

Finally, we want to improve and update the course iteratively with your feedback. If you have some, please fill this form ๐Ÿ‘‰ https://forms.gle/3HgA7bEHwAmmLfwh9

Keep learning, stay awesome,

An Introduction to Q-Learning Part 2/2 (2024)

References

Top Articles
Latest Posts
Article information

Author: Ray Christiansen

Last Updated:

Views: 6033

Rating: 4.9 / 5 (49 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Ray Christiansen

Birthday: 1998-05-04

Address: Apt. 814 34339 Sauer Islands, Hirtheville, GA 02446-8771

Phone: +337636892828

Job: Lead Hospitality Designer

Hobby: Urban exploration, Tai chi, Lockpicking, Fashion, Gunsmithing, Pottery, Geocaching

Introduction: My name is Ray Christiansen, I am a fair, good, cute, gentle, vast, glamorous, excited person who loves writing and wants to share my knowledge and understanding with you.