# CS 530 - Lecture 25

## Review

Bernhard Firner

2026-04-28

---

## Topics

* The final is going to be cumulative
* But will have a strong focus on the RL topics
  * Roughly lectures 12 and onwards

---

## Pre-Midterm Topics

* That doesn't mean the pre-midterm topics aren't important
* State estimation in discrete and continuous spaces is key to many tasks
  * And identifying what kind of environment you are working on narrows down the techniques you can apply
* Importantly, the sensor fusion in a Kalman filter is robust and widely used
  * This is a tool that many scientists and engineers reach for frequently

---

## Filtering and Smoothing

* Please remember the difference between filtering and smoothing
  * Smoothing sees the future, filtering does not
* And also remember the techniques in discrete and continuous domains
  * Markov models in discrete spaces
  * Kalman filters are the most common in continuous

---

## Deterministic Policies

* We also discussed other, non-RL solutions early in the semester
* If we can find an optimal action with value iteration, why use RL?
  * Not everything needs to be a learning and estimation problem
* That being said, many interesting problems are interesting because they have no solution (yet!)

---

## Evaluating Policies

* Sometimes we cannot exactly (or easily) calculate an optimal policy
* We can use Monte Carlo prediction to estimate a policy's rewards, $v_{\pi}$
* The policy itself may not explore every state though
  * So we begin with a random state action pair, called *exploring starts*

---

## Monte Carlo Control

* If we can evaluate a policy, then we should be able improve the policy
  * This is policy estimation, $\pi \approx \pi_*$
* We estimate $Q(s, a)$ by averaging the returns of each Monte Carlo simulation
  * Using returns over multiple simulations improves the estimate in stochastic environments
* Then update $\pi(s_t) \leftarrow \underset{a}{argmax}Q(s_t, a)$

---

## Without Exploring Starts

* We cannot guarantee our initial policy will explore all states
  * So we still need to start with random state action pairs
* An $\epsilon\text{-greedy}$ policy is a simple solution
  * With probability $\epsilon$, we take a random action, otherwise we use the greedy best action
* As we converge to the optimal policy, we can gradually decrease $\epsilon$

---

## On or Off Policy

* $\epsilon\text{-greedy}$ policy exploration optimizes for a policy with random actions
  * This is on-policy, meaning that the exploration policy is the same as the policy being updated
  * Converging $\epsilon$ to 0 is possible, but could be slow
* So can we learn an optimal policy, $\pi_*$ with a different behavior policy, $b$?

---

## Importance Sampling Ratio

$ \rho_{t:T-1} = \prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)}$

* The importance sampling ratio gives a weight to anything discovered by off-policy search
* If $\pi(A_k|S_k)$ is 0, then the target policy would never choose that action, so the importance is 0
  * For example, if $\pi$ would never jump off of the cliff, and $b$ does, then we ignore that sequence

---

## Behavior Policy

* Question:
  * The behavior policy must satisfy this constraint:
  * if $\pi(a|s) > 0$ then $b(a|s) > 0$
* Why?

---

## Weighting

$ \rho_{t:T-1} = \prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)}$

* The denominator can be very small, leading to instabilities
  * In fact, if a loop is present then our variance is infinite
* So we bias our gain estimates towards b's rewards instead
  * The statistics are better, even if they are technically incorrect

---

## Weighted Importance Sampling

* $ U(s) \triangleq \frac{\Sigma_{t\in \tau(s)} \rho_{t:T(t)}G_t}{\rho_{t:T(t)}}$
  * If $\rho = 0$ we still treat this as 0
  * Otherwise we normalize by the total importance of the chain from t to T
* With this, off-policy Monte Carlo control estimation works

---

## Bootstrapping

* What if we want to learn as we go?
  * If the state space is large, Monte Carlo methods rely upon random actions for explorations
  * This is because Q does not update until the end of the episode
* Exploration is thus quite inefficient at first
* We learned that *Temporal-Difference* methods fix this problem

---

## TD-Learning

* SARSA estimates the value of $Q(S,A)$ based upon its current estimate of Q at the next state
  * $Q(S,A) \leftarrow Q(S,A) + \alpha[R + \gamma Q(S', A') - Q(S,A)]$
* This works, but what if the next action was a random one rather than the greedy optimal?
  * If the result is better than expected, $Q(S,A)$ should be updated
  * But what if the reward is negative?

---

## Q-Learning

* SARSA generates an overly-conservative policy
* Q-Learning only looks at the best-case reward from the next state
  * $Q(S,A) \leftarrow Q(S,A) + \alpha[R + \gamma \underset{a}{max}Q(S',a) - Q(S,A)]$
* Even if we are using an $\epsilon\text{-greedy}$ exploration policy, that won't affect $\pi_*$

---

## Question

* Moving a tile other than E gives a reward of -1
* Falling into the pit rewards -30 and return you to B
* With SARSA trained with $\epsilon=0.1$, what is Q(3,2)?
  * You can express it in terms of other Q values

---

## Question

* Moving a tile other than E gives a reward of -1
* Falling into the pit rewards -30 and return you to B
* With Q-Learning trained with $\epsilon=0.1$, what is Q(3,2)?
  * You can express it in terms of other Q values

---

## Question

* Q values are all initialized to -1
* $\pi(s_t) \leftarrow \underset{a}{argmax}Q(s_t, a)~\forall s_t \in S$
  * If multiple actions are tied, they are preferred in the following order: left, up, right, and down
* Starting at B, the following trajectory is collected:
  * (3,2), Pit (which brings us back to b), (2,1), Pit, (3,2), (3,3), (2,3), E

---

## Question

* (3,2), Pit (which brings us back to b), (2,1), Pit, (3,2), (3,3), (2,3), E
* Let $\alpha = 0.5, \gamma = 1.0$
* What are the estimates of Q if we were using Monte Carlo?
  * $Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \frac{W}{C(S_t,A_t)}[G - Q(S_t,A_t)]$
    * Where $C(S_t,A_t)$ is the total weights observed at that state and action
    * W is updated from the last step to the first: $W \leftarrow W\frac{\pi(A_t|S_t)}{b(A_t|S_t)}$

---

## Question

* (3,2), Pit (which brings us back to b), (2,1), Pit, (3,2), (3,3), (2,3), E
* Let $\alpha = 0.5, \gamma = 1.0$
* What are the estimates of Q if we were using Q-Learning?
  * $Q(S,A) \leftarrow Q(S,A) + \alpha[R + \gamma \underset{a}{max}Q(S',a) - Q(S,A)]$

---

## Expected SARSA

* SARSA can be improved by taking the epected value for the update based upon the target policy rather than the behavior policy
* $Q(S,A) \leftarrow Q(S,A) + \alpha[R + \gamma \mathbb{E}_\pi[Q(S', A') | S'] - Q(S,A)]$
  * With discrete actions, this becomes $Q(S,A) + \alpha[R + \gamma \underset{a}{\Sigma}\pi(a|S')Q(S',a) - Q(S,A)]$
  * Equivalent to Q-Learning if the policy is deterministics (only selects one best action)
* We can make one more improvement by using double-Q or double-expected SARSA

---

## Combining Exploration and Planning

* Model-based planning quickly converges to optimal policies, but cannot always be used
  * This means that we fully know the environment
* Model-free planning can explore and learn in any environment
  * But convergence is slow
* The two approaches can be combined, with a model slowly being learned as we explore

---

## Dyna-Q

* Randomly choose previusly explored states and simulate those actions again
* Going from $s_t$ to $s_{t+1}$ lead to an initial Q estimate, but now we may know more about $s_{t+1}$
* That quickly improves estimates of Q

<img style="width: 60%" class="r-stretch" src="./figures/double_q_complex_maze.gif" />
<br/>
Double Q

</div>
<div class="col">

<img style="width: 60%" class="r-stretch" src="./figures/dyna_q_complex_maze.gif" />
<br/>
Dyna Q

</div>
</div>

---

## Approximation

* That brings us to the idea of using DNNs in RL
* The simplest approach is to use a DNN as the Q function
* We can make things more interesting by approximating $\pi$ directly

---

## Estimating Q

* Advantages:
  * Works with continuous spaces
  * With with discrete spaces with a huge number of states
* Disadvantage
  * Now you are training a neural network

---

## Batch Problems

* Training requires good batch statistics
* But if we simply take all observations, our behavior will collapse

---

## Memory and Bias

* To remove bias during training, we need to discard some data
* Retain a bit from each episode, regardless of duration

<div class="container">
<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/cartpole-q-learning.gif" />
<br/>
<small>Cartpole after a few episodes.</small>
</div>
<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/cartpole-q-learning12ksteps.gif" />
<br/>
<small>Cartpole after 1200 episodes.</small>
</div>
</div>

---

## Exploration

* A DNN is fast enough to guide exploration
  * Via rollout
  * Or Monte Carlo Tree Search
* And this is sufficient to reach a human level of play in a game like go

---

## Advanced Learning

* It turns out that these techniques are insufficient for other games
* [Pitfall](https://ale.farama.org/environments/pitfall/) has state space that is too large
* The problem dimensionality makes it too easy to find a "new" state

</div>
<div class="col">

<img style="width: 55%" class="r-stretch" src="./figures/pitfall.gif" />
<br/>
<small>The Atari game, Pitfall.</small>

</div>
</div>

---

## Concepts

* We went over some high-level concepts
  * Rollout and the policy improvement theorem
  * MCTS
  * Policy Learning
  * Latent spaces
* Those are hopefully more fresh in your minds
  * I can't ask you a mathematical question about Z that you can do by hand, so expect concept questions

---

## Policy Learning

* Estimating Q may actually be harder than estimating a policy
  * After all, $\pi$ just needs to know that one thing is preferable to another, not their exact values
* And a policy can be stochastic in a way not supported by a Q value alone
* So policy learning may simplify this space

---

## Policy Learning

* We discussed REINFORCE
* And Actor-Critic training, where training is stabilized by a second model that is a past copy of the one being trained
* Both use the same general form of the policy update
  * $\theta \leftarrow \theta + \alpha \cdot \frac{\delta log \left[ \pi(a|s_t, \theta) \right]}{\delta \theta}G_t$
  * High positive rewards should be likely
  * High negative rewards should be unlikely

---

## State Compression

* The solution is to use latent representations to compress the observable state into a smaller relevant state
* This is a broad, powerful idea
  * That is difficult to compress into a slide or two
  * So if you did not follow this concept, look over the last four lectures

<!--
lecture02.md:# CS 530 - Lecture 02
lecture02.md-## Environments and Knowledge

lecture03.md:# CS 530 - Lecture 03
lecture03.md-## Planning Under Non-Determinism

lecture04.md:# CS 530 - Lecture 04
lecture04.md-## Probabalistic Reasoning

lecture05.md:# CS 530 - Lecture 05
lecture05.md-## Hidden Markov Models

lecture06.md:# CS 530 - Lecture 06
lecture06.md-## More Discrete Space Predictions

lecture07.md:# CS 530 - Lecture 07
lecture07.md-## State Space Models

lecture08.md:# CS 530 - Lecture 08
lecture08.md-## Noise and Fusion in State Space Models

lecture09.md:# CS 530 - Lecture 09
lecture09.md-## Evaluating Decisions

lecture10.md:# CS 530 - Lecture 10
lecture10.md-## Algorithms for Decisions

lecture11.md:# CS 530 - Lecture 11
lecture11.md-## Applied Machine Learning

lecture12.md:# CS 530 - Lecture 12
lecture12.md-## Monte Carlo

lecture13.md:# CS 530 - Lecture 13
lecture13.md-## Monte Carlo

lecture14.md:# CS 530 - Lecture 14
lecture14.md-## Temporal Difference Learning

lecture15.md:# CS 530 - Lecture 15
lecture15.md-## Planning and Learning

lecture16.md:# CS 530 - Lecture 16
lecture16.md-## Decision-Time Planning

lecture17.md:# CS 530 - Lecture 17
lecture17.md-## Approximation

lecture18.md:# CS 530 - Lecture 18
lecture18.md-## RL with Approximation

lecture19.md:# CS 530 - Lecture 19
lecture19.md-## REINFORCEMENT

lecture20.md:# CS 530 - Lecture 20
lecture20.md-## Actor-Critic Learning

lecture21.md:# CS 530 - Lecture 21
lecture21.md-## Features and Latent Spaces

lecture22.md:# CS 530 - Lecture 22
lecture22.md-## Latent Spaces and Knowledge Compression

lecture23.md:# CS 530 - Lecture 23
lecture23.md-## World Model Reasoning

lecture24.md:# CS 530 - Lecture 24
lecture24.md-## World Model Reasoning

-->