Introducing Q-Learning - Hugging Face Deep RL Course (2024)

What is Q-Learning?

Q-Learning is anoff-policy value-based method that uses a TD approach to train its action-value function:

  • Off-policy: we’ll talk about that at the endof this unit.
  • Value-based method: finds the optimal policy indirectly by training a value or action-value function that will tell usthe value of each state or each state-action pair.
  • TD approach:updates its action-value function at each step instead of at the end of the episode.

Q-Learning is the algorithm we use to train our Q-function, anaction-value functionthat determines the value of being at a particular state and taking a specific action at that state.

Introducing Q-Learning - Hugging Face Deep RL Course (1)

TheQ comes from “the Quality” (the value) of that action at that state.

Let’s recap the difference between value and reward:

  • The value of a state, or a state-action pair is the expected cumulative reward our agent gets if it starts at this state (or state-action pair) and then acts accordingly to its policy.
  • The reward is the feedback I get from the environment after performing an action at a state.

Internally, our Q-function is encoded bya Q-table, a table where each cell corresponds to a state-action pair value.Think of this Q-table asthe memory or cheat sheet of our Q-function.

Let’s go through an example of a maze.

Introducing Q-Learning - Hugging Face Deep RL Course (2)

The Q-table is initialized. That’s why all values are = 0. This tablecontains, for each state and action, the corresponding state-action values.For this simple example, the state is only defined by the position of the mouse. Therefore, we have 2*3 rows in our Q-table, one row for each possible position of the mouse. In more complex scenarios, the state could contain more information than the position of the actor.

Introducing Q-Learning - Hugging Face Deep RL Course (3)

Here we see that thestate-action value of the initial state and going up is 0:

Introducing Q-Learning - Hugging Face Deep RL Course (4)

So: the Q-function uses a Q-tablethat has the value of each state-action pair.Given a state and action,our Q-function will search inside its Q-table to output the value.

Introducing Q-Learning - Hugging Face Deep RL Course (5)

If we recap,Q-Learningis the RL algorithm that:

  • Trainsa Q-function (an action-value function),which internally is aQ-table that contains all the state-action pair values.
  • Given a state and action, our Q-functionwill search its Q-table for the corresponding value.
  • When the training is done,we have an optimal Q-function, which means we have optimal Q-table.
  • And if wehave an optimal Q-function, wehave an optimal policysince weknow the best action to take at each state.

Introducing Q-Learning - Hugging Face Deep RL Course (6)

In the beginning,our Q-table is useless since it gives arbitrary values for each state-action pair(most of the time, we initialize the Q-table to 0). As the agent explores the environment and we update the Q-table, it will give us a better and better approximation to the optimal policy.

Introducing Q-Learning - Hugging Face Deep RL Course (7)

Now that we understand what Q-Learning, Q-functions, and Q-tables are,let’s dive deeper into the Q-Learning algorithm.

The Q-Learning algorithm

This is the Q-Learning pseudocode; let’s study each part andsee how it works with a simple example before implementing it. Don’t be intimidated by it, it’s simpler than it looks! We’ll go over each step.

Introducing Q-Learning - Hugging Face Deep RL Course (8)

Step 1: We initialize the Q-table

Introducing Q-Learning - Hugging Face Deep RL Course (9)

We need to initialize the Q-table for each state-action pair.Most of the time, we initialize with values of 0.

Step 2: Choose an action using the epsilon-greedy strategy

Introducing Q-Learning - Hugging Face Deep RL Course (10)

The epsilon-greedy strategy is a policy that handles the exploration/exploitation trade-off.

The idea is that, with an initial value of ɛ = 1.0:

  • With probability 1 — ɛ: we doexploitation(aka our agent selects the action with the highest state-action pair value).
  • With probability ɛ:we do exploration(trying random action).

At the beginning of the training,the probability of doing exploration will be huge since ɛ is very high, so most of the time, we’ll explore.But as the training goes on, and consequently ourQ-table gets better and better in its estimations, we progressively reduce the epsilon valuesince we will need less and less exploration and more exploitation.

Introducing Q-Learning - Hugging Face Deep RL Course (11)

Step 3: Perform action At, get reward Rt+1 and next state St+1

Introducing Q-Learning - Hugging Face Deep RL Course (12)

Step 4: Update Q(St, At)

Remember that in TD Learning, we update our policy or value function (depending on the RL method we choose)after one step of the interaction.

To produce our TD target,we used the immediate rewardRt+1R_{t+1}Rt+1 plus the discounted value of the next state, computed by finding the action that maximizes the current Q-function at the next state.(We call that bootstrap).

Introducing Q-Learning - Hugging Face Deep RL Course (13)

Therefore, ourQ(St,At)Q(S_t, A_t)Q(St,At)update formula goes like this:

Introducing Q-Learning - Hugging Face Deep RL Course (14)

This means that to update ourQ(St,At)Q(S_t, A_t)Q(St,At):

  • We needSt,At,Rt+1,St+1S_t, A_t, R_{t+1}, S_{t+1}St,At,Rt+1,St+1.
  • To update our Q-value at a given state-action pair, we use the TD target.

How do we form the TD target?

  1. We obtain the rewardRt+1R_{t+1}Rt+1 after taking the actionAtA_tAt.
  2. To get the best state-action pair value for the next state, we use a greedy policy to select the next best action. Note that this is not an epsilon-greedy policy, this will always take the action with the highest state-action value.

Then when the update of this Q-value is done, we start in a new state and select our actionusing a epsilon-greedy policy again.

This is why we say that Q Learning is an off-policy algorithm.

Off-policy vs On-policy

The difference is subtle:

  • Off-policy: usinga different policy for acting (inference) and updating (training).

For instance, with Q-Learning, the epsilon-greedy policy (acting policy), is different from the greedy policy that isused to select the best next-state action value to update our Q-value (updating policy).

Introducing Q-Learning - Hugging Face Deep RL Course (15)

Is different from the policy we use during the training part:

Introducing Q-Learning - Hugging Face Deep RL Course (16)
  • On-policy:using thesame policy for acting and updating.

For instance, with Sarsa, another value-based algorithm,the epsilon-greedy policy selects the next state-action pair, not a greedy policy.

Introducing Q-Learning - Hugging Face Deep RL Course (17)
Introducing Q-Learning - Hugging Face Deep RL Course (18)

< > Update on GitHub

Introducing Q-Learning - Hugging Face Deep RL Course (2024)
Top Articles
Today’s News 14th August 2024
Effective Interviewing and Interrogation Techniques, Third Edition - PDF Free Download
Dragon Age Inquisition War Table Operations and Missions Guide
Gomoviesmalayalam
Ou Class Nav
George The Animal Steele Gif
Craigslist Pets Longview Tx
Prosser Dam Fish Count
Plan Z - Nazi Shipbuilding Plans
Marvon McCray Update: Did He Pass Away Or Is He Still Alive?
Beryl forecast to become an 'extremely dangerous' Category 4 hurricane
Crawlers List Chicago
Diakimeko Leaks
O'Reilly Auto Parts - Mathis, TX - Nextdoor
Jenna Ortega’s Height, Age, Net Worth & Biography
Craigslist Lewes Delaware
A Person That Creates Movie Basis Figgerits
Conscious Cloud Dispensary Photos
Sister Souljah Net Worth
Manuela Qm Only
Dr. Nicole Arcy Dvm Married To Husband
Jailfunds Send Message
Orange Park Dog Racing Results
R/Orangetheory
134 Paige St. Owego Ny
Was heißt AMK? » Bedeutung und Herkunft des Ausdrucks
Khatrimmaza
Sf Bay Area Craigslist Com
Slv Fed Routing Number
Serenity Of Lathrop - Manteca Photos
Great Clips On Alameda
Indiana Wesleyan Transcripts
Devin Mansen Obituary
Domino's Delivery Pizza
Kgirls Seattle
Aliciabibs
Planet Fitness Santa Clarita Photos
Telugu Moviez Wap Org
Appraisalport Com Dashboard Orders
About My Father Showtimes Near Amc Rockford 16
Ezpawn Online Payment
COVID-19/Coronavirus Assistance Programs | FindHelp.org
Mitchell Kronish Obituary
Citymd West 146Th Urgent Care - Nyc Photos
Tlc Africa Deaths 2021
Craigslist Houses For Rent Little River Sc
Greg Steube Height
Large Pawn Shops Near Me
Dicks Mear Me
Bank Of America Appointments Near Me
Assignation en paiement ou injonction de payer ?
Image Mate Orange County
Latest Posts
Article information

Author: Arline Emard IV

Last Updated:

Views: 5620

Rating: 4.1 / 5 (52 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Arline Emard IV

Birthday: 1996-07-10

Address: 8912 Hintz Shore, West Louie, AZ 69363-0747

Phone: +13454700762376

Job: Administration Technician

Hobby: Paintball, Horseback riding, Cycling, Running, Macrame, Playing musical instruments, Soapmaking

Introduction: My name is Arline Emard IV, I am a cheerful, gorgeous, colorful, joyous, excited, super, inquisitive person who loves writing and wants to share my knowledge and understanding with you.