# CS 462 - Lecture 24

<table class="clean">
<tr class="clean"><td class="e"> $\leftarrow$ </td> <td > $\longleftarrow$ </td> <td class="a d"></td></tr>
<tr class="clean"><td class="b"> $\rightarrow$ </td> <td class="c"> $\rightarrow$ </td> <td class="a d"> $\rightarrow$ </td></tr>
<table>

## Policy Learning

### And Topic Wrapup

Bernhard Firner

2026-04-28

---

## Review

* Last time we zoomed through *action-utility estimation*
  * Q is a function that estimates the utility/value of state, given an action
  * Usually written $Q(s, a)$
* A good estimate of Q can be used to craft a good policy
  * $\pi(s) \leftarrow \underset{a}{argmax} Q(s, a)~ \forall s \in S$

---

## Review Speedrun

* Just going to run through what we covered last time
* Then will just introduce the concepts of policy learning
* Focus on the key ideas
  * RL has been gaining a more prominent role as we build more interactive agents
* Finish with some general remarks about deep learning

---

## Deep Learning?

* What does this have to do with deep learning?
* The states of many things we care about cannot be handled by a lookup table
  * The state could be continuous
  * Or the state could be so large that we are forced to estimate
* As long as adjacent states have similar utility, using a DNN as the estimator will work

---

## Using Q

* What is Q actually estimating?
* It depends upon how we train it
* In SARSA, the future rewards are estimated over all possible actions
  * $Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha[R + \gamma Q(s_{t+1}, A) - Q(s_t,a_t)]$

---

## Q-Learning

* That is unrealistic
  * Our policy won't take *all* actions
* Q-Learning only uses the "best" action to estimate future rewardsa
* $Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha[R + \gamma \underset{a}{max}Q(s_{t+1},a) - Q(s_t,a_t)]$

---

## Advantage

* Q-Learning will learn that we can push close to a bad outcome
  * As long as the policy won't actually do the bad thing, it won't be reflected in Q
* Many games (and real-world activities) require a less conservative policy than SARSA
* But what if there is no "best" action?

---

## Expected SARSA

* We can fix SARSA is to take the expected value of future rewards
  * This works even with a stochastic policy
* $Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha\left[R + \gamma \mathbb{E_\pi}[Q(s_{t+1}, a_{t+1}) | s_{t+1}] - Q(s_t,a_t)\right]$
  * If actions are discrete, this is $Q(s_t,a_t) + \alpha\left[R + \gamma \underset{a}{\Sigma}\pi(a|s_{t+1})Q(s_{t+1},a) - Q(s_t,a_t)\right]$
* It is the same as Q-Learning with a deterministic policy

---

## Fighting Bias

* Store two estimates of the Q function, $Q_1$ and $Q_2$
* Use the other Q to estimate the value of the actions in the next state

<p style="font-size: 70%"><math display="block" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mtable><mtr><mtd columnalign="right" style="text-align: right"><msub><mi>Q</mi><mn>1</mn></msub><mrow><mo stretchy="true" form="prefix">(</mo><msub><mi>s</mi><mi>t</mi></msub><mo>,</mo><msub><mi>a</mi><mi>t</mi></msub><mo stretchy="true" form="postfix">)</mo></mrow></mtd><mtd columnalign="left" style="text-align: left"><mo>←</mo><msub><mi>Q</mi><mn>1</mn></msub><mrow><mo stretchy="true" form="prefix">(</mo><msub><mi>s</mi><mi>t</mi></msub><mo>,</mo><msub><mi>a</mi><mi>t</mi></msub><mo stretchy="true" form="postfix">)</mo></mrow><mo>+</mo><mi>α</mi><mrow><mo stretchy="true" form="prefix">[</mo><mi>R</mi><mo>+</mo><mi>γ</mi><msub><mi>Q</mi><mn>2</mn></msub><mrow><mo stretchy="true" form="prefix">(</mo><msub><mi>s</mi><mrow><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub><mo>,</mo><munder><mrow><mi>a</mi><mi>r</mi><mi>g</mi><mi>m</mi><mi>a</mi><mi>x</mi></mrow><mi>a</mi></munder><msub><mi>Q</mi><mn>1</mn></msub><mrow><mo stretchy="true" form="prefix">(</mo><msub><mi>s</mi><mrow><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub><mo>,</mo><mi>a</mi><mo stretchy="true" form="postfix">)</mo></mrow><mo stretchy="true" form="postfix">)</mo></mrow><mo>−</mo><msub><mi>Q</mi><mn>1</mn></msub><mrow><mo stretchy="true" form="prefix">(</mo><msub><mi>s</mi><mi>t</mi></msub><mo>,</mo><msub><mi>a</mi><mi>t</mi></msub><mo stretchy="true" form="postfix">)</mo></mrow><mo stretchy="true" form="postfix">]</mo></mrow></mtd></mtr><mtr><mtd columnalign="right" style="text-align: right"><msub><mi>Q</mi><mn>2</mn></msub><mrow><mo stretchy="true" form="prefix">(</mo><msub><mi>s</mi><mi>t</mi></msub><mo>,</mo><msub><mi>a</mi><mi>t</mi></msub><mo stretchy="true" form="postfix">)</mo></mrow></mtd><mtd columnalign="left" style="text-align: left"><mo>←</mo><msub><mi>Q</mi><mn>2</mn></msub><mrow><mo stretchy="true" form="prefix">(</mo><msub><mi>s</mi><mi>t</mi></msub><mo>,</mo><msub><mi>a</mi><mi>t</mi></msub><mo stretchy="true" form="postfix">)</mo></mrow><mo>+</mo><mi>α</mi><mrow><mo stretchy="true" form="prefix">[</mo><mi>R</mi><mo>+</mo><mi>γ</mi><msub><mi>Q</mi><mn>1</mn></msub><mrow><mo stretchy="true" form="prefix">(</mo><msub><mi>s</mi><mrow><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub><mo>,</mo><munder><mrow><mi>a</mi><mi>r</mi><mi>g</mi><mi>m</mi><mi>a</mi><mi>x</mi></mrow><mi>a</mi></munder><msub><mi>Q</mi><mn>2</mn></msub><mrow><mo stretchy="true" form="prefix">(</mo><msub><mi>s</mi><mrow><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub><mo>,</mo><mi>a</mi><mo stretchy="true" form="postfix">)</mo></mrow><mo stretchy="true" form="postfix">)</mo></mrow><mo>−</mo><msub><mi>Q</mi><mn>2</mn></msub><mrow><mo stretchy="true" form="prefix">(</mo><msub><mi>s</mi><mi>t</mi></msub><mo>,</mo><msub><mi>a</mi><mi>t</mi></msub><mo stretchy="true" form="postfix">)</mo></mrow><mo stretchy="true" form="postfix">]</mo></mrow></mtd></mtr></mtable><annotation encoding="application/x-tex">\begin{align*}
Q_1(s_t,a_t) & \leftarrow Q_1(s_t,a_t) + \alpha\left[R + \gamma Q_2(s_{t+1}, \underset{a}{argmax}Q_1(s_{t+1},a)) - Q_1(s_t,a_t)\right] \\
Q_2(s_t,a_t) & \leftarrow Q_2(s_t,a_t) + \alpha\left[R + \gamma Q_1(s_{t+1}, \underset{a}{argmax}Q_2(s_{t+1},a)) - Q_2(s_t,a_t)\right]
\end{align*}</annotation></semantics></math></p>

* When choosing an action, each estimator consults the other for the action to use in the Q estimate
* Convergence occurs over time, but this removes potential bias, speeding learning

---

## Exploration

* We need to actually try actions from different states to fill in Q
* But our initial policy won't be good
* Solution: take a random action $\epsilon\%$ of the time
  * Called an $\epsilon$-greedy policy

---

## Data Limiting

* When learning Q with a DNN, we need to carefully manage our batches
* For cartpole, we can simply keep a history, sample a subset of experiences, and stop early
  * With $\epsilon\text{-greedy}$ exploration, the real policy will be robust

<div class="col">
<img style="width: 40%" class="r-stretch" src="./figures/cartpole-q-learning12ksteps.gif" />
<br/>
<small>Cartpole after 1200 episodes.</small>
</div>

---

## Otherwise

* If we take all samples and make a dataset, the performance will collapse
* The data becomes self-similar, and rare states are forgotten

---

## Effectiveness

* Q-Learning is great, especially when combined with rollout or Monte Carlo Tree Search
* In 2016, a Go playing program called AlphaGo beat one of the world's best player
  * The states space of Go is enormous, so this proved that a DNN-based Q function was sufficient
  * Combined with another network to suggest policies to explore via MCTS
* The next year, AlphaGo Zero replaced AlphaGo, using policy learning on its own

---

## Why Learn Policies?

* What should we do if the action space is continuous?
  * A robot, for example
* If you can, dividing the output into discrete steps is usually easier
  * Now we have a discrete problem, and Q-Learning works great
* But quantized outputs aren't always feasible

---

## Human-Level Play

* A deep Q network combined with rollout or MCTS is sufficient to reach human level play
  * On Go
  * On many Atari games
  * Games like Pong
* It is surprisingly insufficient for a game like [Pitfall](https://ale.farama.org/environments/pitfall/)
* Notice that jumping changes the state
  * So we had better try jumping in every position to see if it is good

</div>
<div class="col">

<img style="width: 55%" class="r-stretch" src="./figures/pitfall.gif" />
<br/>
<small>The Atari game, Pitfall.</small>

</div>
</div>

---

## Policy Learning Advantages

* The formulation of policy learning is simple, but it is quite difficult
* Why use it?
* A 19x19 go board has so many options that enumerating them is burdensome

</div>
<div class="col">
<img style="width: 70%" class="r-stretch" src="./figures/go_example.png" />
</div>
</div>

---

## Policy Learning Advantages

* Imagine playing a few moves in advance
  * We begin with $19\times 19=361$ possible first moves
* If we wanted to simulate 5 moves in advance, how many values of Q do we explore?

</div>
<div class="col">
<img style="width: 70%" class="r-stretch" src="./figures/go_example.png" />
</div>
</div>

---

## Policy Learning Advantages

* If we wanted to simulate 5 moves in advance, how many values of Q do we explore?
  * $361\times360\times359\times358\times357=\text{5,962,870,725,840}$
  * Haha, no.
* Exploration will be inefficient
  * Rewards can be similar and have high variance

</div>
<div class="col">
<img style="width: 70%" class="r-stretch" src="./figures/go_example.png" />
</div>
</div>

---

## Exploration

* How is a policy more efficient?
* Rollout makes random plays, which is time consuming
  * MCTS is better if we have a stochastic policy that can direct the search
  * We can make a stochastic policy from Q-values, right?
* Randomness, with an $\epsilon\text{-greedy}$ policy, is inefficient
  * And leads to bad estimates of final rewards

---

## Preferences Vs Values

* A learned policy is simply a probability distribution over all actions
* Imagine two moves have the same expected outcome, but one has higher variance
  * The Q estimates would have to be the same
  * But a policy may prefer the more certain choice
* A learned policy only indicates preferences, not numerical superiority
  * This could be a simpler function to learn than one that predicts exact rewards

---

## Strong Policy Exploration

* A policy directly assigns probabilities to actions
* As long as a better move has a higher probability than a worse one, the exact numbers don't matter
  * These preferences direct search into useful actions, making exploration efficient
  * Since exploration is also used during training, this makes training efficient
* It is a virtuous cycle

---

## Gradient Policy Updates

* Updates depend upon two things:
  * The probability of the policy choosing an action
  * The reward observed from that action
* The basic update equation starts with this:
  * $\theta \leftarrow \theta + \alpha \cdot \frac{\delta log \left[ \pi(a|s_t, \theta) \right]}{\delta \theta}G_t$

---

## Policy Update

* $\theta \leftarrow \theta + \alpha \cdot \frac{\delta log \left[ \pi(a|s_t, \theta) \right]}{\delta \theta}G_t$
* This says that we update parameters by the reward multiplied by the log likelihood of doing the action
  * High positive rewards should be likely
  * High negative rewards should be unlikely
* Sounds fine, but won't work without some help

---

## Dealing With Variance

* Imagine training a Go playing program from scratch
* The rewards will be all over the place
  * That will make gradient descent difficult
* So we can learn with the *advantage* of an action instead of its actual return
  * The *advantage* is how surprised we are by the reward

---

## Advantage

* Intuitively, if we are taking an action and we get the expected reward, then we do not need to learn anything
* We estimate the advantage using a deep Q network
* Loss becomes the square of the difference from the prediction and reward multiplied by the action probability
* This doesn't solve *all* of our variance problems
  * And we can quickly go down a rabbit hole of exponential moving average Actor-Critic models, and other solutions
* But you get the general idea, so let's leave this topic here

---

## Your Takeaway

* Reinforcement learning is a very active area of research
* Right now, many policy learning techniques underwhelm
  * Especially compared to the success of Q-Learning
* Learning in latent spaces
  * Learning on a continuous space with continuous actions is difficult
  * Operating on a compressed latent representation is more feasible

---

## Some Links

* If you are interested, here are some papers/videos:
* Deepmind's Dreamer 4 is good mines diamonds in minecraft:
  * [Dreamer 4 video](https://youtu.be/oDlBtTcX0g0?si=64xF-EEQc36XFy7k)
  * [Dreamer 4 site](https://danijar.com/project/dreamer4/)
  * Plans in the latent space, but must be trained on actions
* New work, [https://dino-wm.github.io](https://dino-wm.github.io) and [LeWorldModel](https://arxiv.org/abs/2603.19312), attempt to give a DNN an image of the desired outcome
  * A trajectory is found through the latent space, Z, that could accomplish it

---

## Discussion

* Next class will be a review of problems
  * But we should close with some high-level discussion of AI
* RL (and similar techniques) are important, because current AI systems do not plan
  * You should all be aware that an LLM is just a token predictor

---

## Open or Closed?

* Was the shop closed when I took the photo?

</div>
<div class="col">

</div>
</div>

---

## According to ChatGPT

</div>

---

## According to Gemini

</div>

---

## Probing Gemini

</div>

---

## Nonsense

* The fact that signs point out is noted and then ignored
* Notice that the pull handle is on both sides
* There is no reflection on the glass, despite claims otherwise

</div>
<div class="col">

</div>
</div>

---

## Probing Gemini

</div>

---

## DL Problems

* Deep learning makes predictions based upon sample statistics
* I've tested all of our exams on LLMs
  * When there are multiple answers available, they like to pick the "common" sounding one
* Basically, if everyone likes to answer a question incorrectly, then so will an LLM

---

## One Simple Trick

* Even with CS211, it is easy to trick up LLMs
* Use this one simple trick to fool AI!
  * Make three answers that mean the same thing, but one form is more common
  * Make the correct answer all of the above
  * LLMs love to pick the common phrasing

---

## Statistics

* Absent an algorithm like rollout or MCTS (or the latent space exploration research), DNNs are just statistical models
* LLMs use policy search to produce "better" outputs
  * But evaluating those better outputs is still up to a person

---

## 99% Reliable

* One of AIs biggest hurdles is that 99% reliable is terrible
* If I put you into a robotaxi and tell you that it is 99% safe, should you jump out of the window?

---

## 99% Reliable

* One of AIs biggest hurdles is that 99% reliable is terrible
* If I put you into a robotaxi and tell you that it is 99% safe, should you jump out of the window?

### Yes!!!

---

## Reliability

* It is too easy to miss rare data
  * And we are easily fooled by our own biases
* For example, many studies are done on the people who live around universities
* Why? Because that is where the graduate students who do the studies put up posters

---

## Rare Cases

* Rare events are, by definition, rare

---

---

* We want confidence that our models won't behave in unexpected ways
  * But without test data, how do we know?
* I've seen roadside fires, flying tires, paper-coated roadways, and hundreds of meters of unspooling toilet paper
  * But I've never decided to crash my car
* Humans *plan* and evaluate potential outcomes; we can build that into AI, but it is difficult

---

## Explainability $\neq$ Reliability

* Begin able to interpret a model's outputs is a good start
  * It allows us to go back and fix problems
* But knowing why it did something bad doesn't mean it won't do it
* To get reliable system takes huge effort
  * Collecting data, cleaning data, testing the system, searching for failures
  * Failure to do so leads to mistakes and misrepresentations

---

## Example

* The UDL book lists an example of flawed work in section 21.4
  * Several papers described using an AI to detect if someone was gay from a photograph
* The claim was that something in the face identified sexuality
* But it turned out that the dataset was highly biased, and contained other context clues
  * The authors sensationalized their results rather than rigorously testing them

---

## Not just AI

* I don't want to pick on AI
* Mistakes (intentional or otherwise) pop up all the time in research
* A great example was the study of magnetic field sensing in insects

---

## Magnetic Field Sensing

* Birds may have the ability to sense magnetic fields
  * But studying them is difficult
* So scientists searched for an easier model organism
* Eventually they converged upon fruit flies (Drosophila)
  * And people did studies on insect magnetic field sensing for years!

---

## It doesn't exist

* In 2023, "[No evidence for magnetic field effects on the behaviour of Drosophila](https://www.nature.com/articles/s41586-023-06397-7)" was published
* The authors tested 97,658 flies in a maze and 10,960 on an escape task
  * No evidence of magnetic sensitivity was found
* It is easy to start a misconception, and difficult to kill one

---

## Important for AI

* That being said, any of you could go out and create the next awesome DNN
  * Please do!
* Deep learning is both powerful and fun!
* But it can easily fool you into thinking it is flawless.
  * The most difficult part is the rigorous work of evaluating a model for bias and accuracy

---

## Next Class

* Next class will be a review of our major topics
* The recitations will go over the most commonly incorrect questions from the exams
  * You should expect to see similar versions of these again!