Q-Learning Basics with NumPy

Q-Learning is a **model-free, off-policy** reinforcement learning algorithm. - **Model-free** means the agent learns the optimal policy from observed rewards and actions without explicitly modelling the environment's transition function $\mathcal{T}(s,a)=P(s^{\prime}|s,a)$. - **Off-policy** means the agent uses a separate **behaviour policy** $\mu$ during training (for exploration) than its **optimal policy** $\pi$. At the core of Q-Learning is the **Q-table**, which stores the **quality** for every possible state/action pair, denoted $Q(s, a)$. One can think of the Q-table as the agent's knowledge, quantifying the maximum expected future return for taking a specific action in a specific state. Using the Q-table, the optimal policy is simply defined by choosing the action that yields the highest quality for the current state $\pi(s)=\arg \max_{a}{Q(s, a)}$ ```python class QLearningAgent: def get_optimal_action(self, state): return np.argmax(self._q_table[state, :]) ``` ## Bellman Optimality Equation (Simplified) The Bellman Optimality Equation is the mechanism for updating the Q-table, which recursively defines the optimal state-action value based on the optimal value of the successor state. The base case is a **terminating state** $s$, where the Q-value is normalised to zero. $Q(s,a) = 0$ ```python class QLearningAgent: def learn_termination(self, state): self._q_table[state, :] = 0 ``` For all non-terminating states, a simplified version for a **deterministic** environment and un-discounted returns is $Q(s,a)=R(s,a)+\max_{a'}(Q(s', a))$ where $R(s,a)$ is the **step reward** of performing action $a$ in state $s$. ```python class QLearningAgent: def learn(self, state, action, new_state, step_reward): self._q_table[state, action] = max( self._q_table[state, action], step_reward + np.max(self._q_table[new_state, :]), ) ``` ## Training Training aims to make the Q-table converge to the true optimal values. To do this, we let the agent explore its actions randomly. A simple recursive implementation is as follows. ```python def train(agent: QLearningAgent, state, info): action = env.action_space.sample(info["action_mask"]) new_state, step_reward, terminated, truncated, new_info = env.step(action) if terminated: agent.learn_termination(new_state) if not terminated and not truncated: train(agent, new_state, new_info) agent.learn(state, action, new_state, step_reward) ``` ### Memory Efficient Training The primary advantage of the recursive backtracking approach is that when $Q(s,a)$ is updated, the successor value $Q(s', a')$ has already been updated with the most accurate information. However, this requires memory to store all the states and actions leading to the terminating state. A memory-efficient approach updates $Q(s,a)$ based on potentially stale information from the successor state. This trade-off leads to less efficient Q-value propagation, requiring more training for convergence. This implementation is only for conceptual purpose and assumes tail recursion optimisation. An actual memory efficient training for Python should be implemented with an iterative-loop. ```python def train(agent: QLearningAgent, state, info): action = env.action_space.sample(info["action_mask"]) new_state, step_reward, terminated, truncated, new_info = env.step(action) if terminated: agent.learn_termination(new_state) agent.learn(state, action, new_state, step_reward) if not terminated and not truncated: train(agent, new_state, new_info) ``` ## Evaluation Comparison Let's compare the performance of the two training methods on the [Taxi Gymnasium environment](https://gymnasium.farama.org/environments/toy_text/taxi/). ![[q.png]] As expected from the design trade-off, the memory efficient trainer requires more episodes to achieve convergence compared to the normal trainer. A less obvious observation is: if the Q-table is only getting more accurate, why would the evaluation score ever decrease? *(This is left as an exercise for the reader :))* ## Demo ![[q_demo.mp4]] View the code in [Google Colab](https://colab.research.google.com/drive/1T93EE99ql_ju3qZaEG5BE2cGbS9je-tN?usp=sharing) ## Stochastic Environment The original Taxi environment is deterministic; performing action $a$ in state $s$ always result in a specific next state $s'$. We can introduce probability by setting parameters like: - `is_rainy=True`. The taxi moves in the intended direction with 80% probability, but moves left or right of the target direction with an equal 10% probability in both directions. - `fickle_passenger=True`. The passenger has a 30% chance of changing destinations once the cab has moved one square away from the pickup location. Before modifying the agents, let's evaluate them in this new environment. ![[q2.png]] Is it surprising that the agents can solve the stochastic environment even though they were only trained in the deterministic one? ![[q_demo2.mp4]] What we just demonstrated is a form of [[transfer learning]]. The agent was trained in one environment (deterministic) and performed in a different one (stochastic). This doesn't always work and depends on the similarity between the two environments, which can be quantified by measuring the distance between their Markov Decision Processes (MDPs). Let's look at a seemingly similar problem where transfer learning does not work well. ## Cliff Walking [Cliff Walking](https://gymnasium.farama.org/environments/toy_text/cliff_walking/) is a similar problem where the agent's goal is to move from the bottom-left corner to the bottom-right corner without falling off the cliff (the bottom strip). ![[cliff_walking.gif]] We can introduce probability to this environment using the `is_slippery` setting. When set to true, the agent may move occasionally move perpendicular to the intended direction. Let's see how the agent, trained only in the deterministic environment, performs in the stochastic one. ![[q3.png]] It solves the deterministic environment quite well, finding the perfect solution. ![[q_demo3.mp4]] However, we can intuitively tell that this perfect, cliff-hugging solution won't work in the stochastic environment because the agent is walking too close to the edge. A slip could cause the agent to fall off the cliff. The agent failed to transfer its learning to the stochastic environment.