## Policy Learning
### And Topic Wrapup
Bernhard Firner
2026-04-28
---
## Review
* Last time we zoomed through *action-utility estimation*
* Q is a function that estimates the utility/value of state, given an action
* Usually written $Q(s, a)$
* A good estimate of Q can be used to craft a good policy
* $\pi(s) \leftarrow \underset{a}{argmax} Q(s, a)~ \forall s \in S$
---
## Review Speedrun
* Just going to run through what we covered last time
* Then will just introduce the concepts of policy learning
* Focus on the key ideas
* RL has been gaining a more prominent role as we build more interactive agents
* Finish with some general remarks about deep learning
---
## Deep Learning?
* What does this have to do with deep learning?
* The states of many things we care about cannot be handled by a lookup table
* The state could be continuous
* Or the state could be so large that we are forced to estimate
* As long as adjacent states have similar utility, using a DNN as the estimator will work
---
## Using Q
* What is Q actually estimating?
* It depends upon how we train it
* In SARSA, the future rewards are estimated over all possible actions
* $Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha[R + \gamma Q(s_{t+1}, A) - Q(s_t,a_t)]$
---
## Q-Learning
* That is unrealistic
* Our policy won't take *all* actions
* Q-Learning only uses the "best" action to estimate future rewardsa
* $Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha[R + \gamma \underset{a}{max}Q(s_{t+1},a) - Q(s_t,a_t)]$
---
## Advantage
* Q-Learning will learn that we can push close to a bad outcome
* As long as the policy won't actually do the bad thing, it won't be reflected in Q
* Many games (and real-world activities) require a less conservative policy than SARSA
* But what if there is no "best" action?
---
## Expected SARSA
* We can fix SARSA is to take the expected value of future rewards
* This works even with a stochastic policy
* $Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha\left[R + \gamma \mathbb{E_\pi}[Q(s_{t+1}, a_{t+1}) | s_{t+1}] - Q(s_t,a_t)\right]$
* If actions are discrete, this is $Q(s_t,a_t) + \alpha\left[R + \gamma \underset{a}{\Sigma}\pi(a|s_{t+1})Q(s_{t+1},a) - Q(s_t,a_t)\right]$
* It is the same as Q-Learning with a deterministic policy
---
## Fighting Bias
* Store two estimates of the Q function, $Q_1$ and $Q_2$
* Use the other Q to estimate the value of the actions in the next state
* When choosing an action, each estimator consults the other for the action to use in the Q estimate
* Convergence occurs over time, but this removes potential bias, speeding learning
---
## Exploration
* We need to actually try actions from different states to fill in Q
* But our initial policy won't be good
* Solution: take a random action $\epsilon\%$ of the time
* Called an $\epsilon$-greedy policy
---
## Data Limiting
* When learning Q with a DNN, we need to carefully manage our batches
* For cartpole, we can simply keep a history, sample a subset of experiences, and stop early
* With $\epsilon\text{-greedy}$ exploration, the real policy will be robust
Cartpole after 1200 episodes.
---
## Otherwise
* If we take all samples and make a dataset, the performance will collapse
* The data becomes self-similar, and rare states are forgotten
---
## Effectiveness
* Q-Learning is great, especially when combined with rollout or Monte Carlo Tree Search
* In 2016, a Go playing program called AlphaGo beat one of the world's best player
* The states space of Go is enormous, so this proved that a DNN-based Q function was sufficient
* Combined with another network to suggest policies to explore via MCTS
* The next year, AlphaGo Zero replaced AlphaGo, using policy learning on its own
---
## Why Learn Policies?
* What should we do if the action space is continuous?
* A robot, for example
* If you can, dividing the output into discrete steps is usually easier
* Now we have a discrete problem, and Q-Learning works great
* But quantized outputs aren't always feasible
---
## Human-Level Play
* A deep Q network combined with rollout or MCTS is sufficient to reach human level play
* On Go
* On many Atari games
* Games like Pong
* It is surprisingly insufficient for a game like [Pitfall](https://ale.farama.org/environments/pitfall/)
* Notice that jumping changes the state
* So we had better try jumping in every position to see if it is good
The Atari game, Pitfall.
---
## Policy Learning Advantages
* The formulation of policy learning is simple, but it is quite difficult
* Why use it?
* A 19x19 go board has so many options that enumerating them is burdensome
---
## Policy Learning Advantages
* Imagine playing a few moves in advance
* We begin with $19\times 19=361$ possible first moves
* If we wanted to simulate 5 moves in advance, how many values of Q do we explore?
---
## Policy Learning Advantages
* If we wanted to simulate 5 moves in advance, how many values of Q do we explore?
* $361\times360\times359\times358\times357=\text{5,962,870,725,840}$
* Haha, no.
* Exploration will be inefficient
* Rewards can be similar and have high variance
---
## Exploration
* How is a policy more efficient?
* Rollout makes random plays, which is time consuming
* MCTS is better if we have a stochastic policy that can direct the search
* We can make a stochastic policy from Q-values, right?
* Randomness, with an $\epsilon\text{-greedy}$ policy, is inefficient
* And leads to bad estimates of final rewards
---
## Preferences Vs Values
* A learned policy is simply a probability distribution over all actions
* Imagine two moves have the same expected outcome, but one has higher variance
* The Q estimates would have to be the same
* But a policy may prefer the more certain choice
* A learned policy only indicates preferences, not numerical superiority
* This could be a simpler function to learn than one that predicts exact rewards
---
## Strong Policy Exploration
* A policy directly assigns probabilities to actions
* As long as a better move has a higher probability than a worse one, the exact numbers don't matter
* These preferences direct search into useful actions, making exploration efficient
* Since exploration is also used during training, this makes training efficient
* It is a virtuous cycle
---
## Gradient Policy Updates
* Updates depend upon two things:
* The probability of the policy choosing an action
* The reward observed from that action
* The basic update equation starts with this:
* $\theta \leftarrow \theta + \alpha \cdot \frac{\delta log \left[ \pi(a|s_t, \theta) \right]}{\delta \theta}G_t$
---
## Policy Update
* $\theta \leftarrow \theta + \alpha \cdot \frac{\delta log \left[ \pi(a|s_t, \theta) \right]}{\delta \theta}G_t$
* This says that we update parameters by the reward multiplied by the log likelihood of doing the action
* High positive rewards should be likely
* High negative rewards should be unlikely
* Sounds fine, but won't work without some help
---
## Dealing With Variance
* Imagine training a Go playing program from scratch
* The rewards will be all over the place
* That will make gradient descent difficult
* So we can learn with the *advantage* of an action instead of its actual return
* The *advantage* is how surprised we are by the reward
---
## Advantage
* Intuitively, if we are taking an action and we get the expected reward, then we do not need to learn anything
* We estimate the advantage using a deep Q network
* Loss becomes the square of the difference from the prediction and reward multiplied by the action probability
* This doesn't solve *all* of our variance problems
* And we can quickly go down a rabbit hole of exponential moving average Actor-Critic models, and other solutions
* But you get the general idea, so let's leave this topic here
---
## Your Takeaway
* Reinforcement learning is a very active area of research
* Right now, many policy learning techniques underwhelm
* Especially compared to the success of Q-Learning
* Learning in latent spaces
* Learning on a continuous space with continuous actions is difficult
* Operating on a compressed latent representation is more feasible
---
## Some Links
* If you are interested, here are some papers/videos:
* Deepmind's Dreamer 4 is good mines diamonds in minecraft:
* [Dreamer 4 video](https://youtu.be/oDlBtTcX0g0?si=64xF-EEQc36XFy7k)
* [Dreamer 4 site](https://danijar.com/project/dreamer4/)
* Plans in the latent space, but must be trained on actions
* New work, [https://dino-wm.github.io](https://dino-wm.github.io) and [LeWorldModel](https://arxiv.org/abs/2603.19312), attempt to give a DNN an image of the desired outcome
* A trajectory is found through the latent space, Z, that could accomplish it
---
## Discussion
* Next class will be a review of problems
* But we should close with some high-level discussion of AI
* RL (and similar techniques) are important, because current AI systems do not plan
* You should all be aware that an LLM is just a token predictor
---
## Open or Closed?
* Was the shop closed when I took the photo?
---
## According to ChatGPT
---
## According to Gemini
---
## Probing Gemini
---
## Nonsense
* The fact that signs point out is noted and then ignored
* Notice that the pull handle is on both sides
* There is no reflection on the glass, despite claims otherwise
---
## Probing Gemini
---
## DL Problems
* Deep learning makes predictions based upon sample statistics
* I've tested all of our exams on LLMs
* When there are multiple answers available, they like to pick the "common" sounding one
* Basically, if everyone likes to answer a question incorrectly, then so will an LLM
---
## One Simple Trick
* Even with CS211, it is easy to trick up LLMs
* Use this one simple trick to fool AI!
* Make three answers that mean the same thing, but one form is more common
* Make the correct answer all of the above
* LLMs love to pick the common phrasing
---
## Statistics
* Absent an algorithm like rollout or MCTS (or the latent space exploration research), DNNs are just statistical models
* LLMs use policy search to produce "better" outputs
* But evaluating those better outputs is still up to a person
---
## 99% Reliable
* One of AIs biggest hurdles is that 99% reliable is terrible
* If I put you into a robotaxi and tell you that it is 99% safe, should you jump out of the window?
---
## 99% Reliable
* One of AIs biggest hurdles is that 99% reliable is terrible
* If I put you into a robotaxi and tell you that it is 99% safe, should you jump out of the window?
### Yes!!!
---
## Reliability
* It is too easy to miss rare data
* And we are easily fooled by our own biases
* For example, many studies are done on the people who live around universities
* Why? Because that is where the graduate students who do the studies put up posters
---
## Rare Cases
* Rare events are, by definition, rare
---
---
* We want confidence that our models won't behave in unexpected ways
* But without test data, how do we know?
* I've seen roadside fires, flying tires, paper-coated roadways, and hundreds of meters of unspooling toilet paper
* But I've never decided to crash my car
* Humans *plan* and evaluate potential outcomes; we can build that into AI, but it is difficult
---
## Explainability $\neq$ Reliability
* Begin able to interpret a model's outputs is a good start
* It allows us to go back and fix problems
* But knowing why it did something bad doesn't mean it won't do it
* To get reliable system takes huge effort
* Collecting data, cleaning data, testing the system, searching for failures
* Failure to do so leads to mistakes and misrepresentations
---
## Example
* The UDL book lists an example of flawed work in section 21.4
* Several papers described using an AI to detect if someone was gay from a photograph
* The claim was that something in the face identified sexuality
* But it turned out that the dataset was highly biased, and contained other context clues
* The authors sensationalized their results rather than rigorously testing them
---
## Not just AI
* I don't want to pick on AI
* Mistakes (intentional or otherwise) pop up all the time in research
* A great example was the study of magnetic field sensing in insects
---
## Magnetic Field Sensing
* Birds may have the ability to sense magnetic fields
* But studying them is difficult
* So scientists searched for an easier model organism
* Eventually they converged upon fruit flies (Drosophila)
* And people did studies on insect magnetic field sensing for years!
---
## It doesn't exist
* In 2023, "[No evidence for magnetic field effects on the behaviour of Drosophila](https://www.nature.com/articles/s41586-023-06397-7)" was published
* The authors tested 97,658 flies in a maze and 10,960 on an escape task
* No evidence of magnetic sensitivity was found
* It is easy to start a misconception, and difficult to kill one
---
## Important for AI
* That being said, any of you could go out and create the next awesome DNN
* Please do!
* Deep learning is both powerful and fun!
* But it can easily fool you into thinking it is flawless.
* The most difficult part is the rigorous work of evaluating a model for bias and accuracy
---
## Next Class
* Next class will be a review of our major topics
* The recitations will go over the most commonly incorrect questions from the exams
* You should expect to see similar versions of these again!