# CS 530 - Lecture 16

## Decision-Time Planning

Bernhard Firner

2026-03-26

---

## Review

* Last time we began unifying model-based planning and model-free planning
* Why did we want to do this?
  * model-based planning quickly converges to optimal policies, but cannot always be used
  * model-free planning can explore and learn in any environment

---

## DynaQ

* DynaQ runs multiple planning iterations after every step during exploration
* This speeds up learning per episode by "filling in" information

</div>

---

## Comparison: Exploration

* The advantage is clear when we look at how nicely the Q function utility estimates are filled in

<img style="width: 90%" class="r-stretch" src="./figures/double_q_complex_maze.gif" />
<br/>
Double Q

</div>
<div class="col">

<img style="width: 90%" class="r-stretch" src="./figures/dyna_q_complex_maze.gif" />
<br/>
Dyna Q

</div>
</div>

---

## Inefficiencies

* We are assuming here that the simulated steps of Dyna-Q are have no cost compared to exploration steps
* That matches with any experience in the real world, but the simulated steps still have computational cost
* A better approach would be directed, for example by using the policy to guide actions
  * An approach like this would be called *trajectory sampling*

---

## Types of Planning

* We used planning to update Q rewards in the background after taking an action
  * This prepares our tabular Q entries for future encounters
  * A type of *backround planning*
* The other way we could use planning is to use it in action selection
  * This is *decision-time planning*

---

## Thinking Ahead

* Decision-time planning means we plan our actions
  * For example, by running a simulation from the current state
* How deep and wide to run the simulation are based upon available time and heuristics
  * Consider a timed game of chess or go where we can simulate games from the current position, but not every possible game

---

## Models and Action-Values

* Obviously we need an accurate model to simulate ahead
* And there is no point in simulating ahead if we already have good estimates for Q
* So this type of planning may be limited
  * This has changed over time
  * If deep networks every grow into high-fidelity future-state simulators, decision-time planning will improve

---

## Why Does This Works

* Why should thinking ahead work?
* DynaQ sped up policy improvements and did a better job filling in Q estimates for infrequently visited states
* Thinking ahead won't do the second, and the only policy being improved is for the current state
  * And often our state space is so large that we won't bother storing our simulated results
* So when does this approach help?

---

## Focus

* If we don't actually have a way of narrowing our search, thinking ahead won't be useful
* This is akin to a beginner attempting to think ahead during a complicated game
  * Their estimates of $Q(s,a)$ are so poor that they'll still play poorly, no matter how long they plan
* A reasonable Q function will allow our agent to explore future states efficiently
  * Or the episodes need to be short enough that we can simulate to the end from the current state

---

## Rollout

* Rollout is a widely used decision-time planning algorithm
* Its advantages are that it is dead simple and that it works suprisingly well
* The algorithm is basically a Monte Carlo *prediction* algorithm
  * It evaluates the average returns of a policy rather than finding a new policy

---

## Rollout Algorithm

* From the current state, s
* Loop
  * Select a random action, a, and next state s'
  * Play to the end state (or some depth), using policy $\pi$
  * Append the final gain to the gains from action $a$
* Select the action with the highest average gain, or the gain the best matches some criteria

---

## Better than $\pi$?

* How is this better than the policy, $\pi$?
* Due to something called the *policy improvement theorem*
* Suppose original policy, $\pi$, results in some utility for the current state: $u_pi(s)$
* Now rollout results in $\pi'(s) = a /neq \pi(s)$, but $q_pi(s, a) \geq u_pi(s)$
  * That means it we find an action better than the current policy at just a single step, we have a better policy

---

## Chosen Policy

* The chosen policy could actually be random and rollout would result in an improvement
* But improving upon a decent policy will result in better outcomes than improving upon a bad policy
* We may still have good reasons to use a random policy
  * For example in games with a great deal of randomness and a large number of states
  * Or if our "good" policy is computationaly expensive, a basic one may allow more simulations

---

## Parallel Advantages

* One important advantage of this approach is that we can run multiple Monte Carlo simulations in parallel
* With other methods, each episode would update Q
  * Having parallel updates would at least make things complicated, and would likely decrease the effective learning rate
* With rollout, we are free to run as many parallel simulations as we can

---

## Simulated Results

* So what do we do with the simulated results?
  * We throw them out
* If this task has a large state space, we are unlikely to visit state $s$ again
* Because of the randomness in rollout, only a small subset of the exploration will follow the trajectory that actually occurs

---

## MCTS

* Monte Carlo Tree Search is rollout, but with an improved focus during search
* If the states being explored better match the actual states to follow, this also allows us to retain information
  * The accumulated states and value estimates are more likely to be revisited
* Exploration is focused by exploring actions that look promising, based upon current simulated returns

---

## MCTS Algorithm

* **Selection**: Select a leaf node in a tree of current value estimates. This is the *tree policy*, which could be $\epsilon\text-greed$ rather than rollout
* **Expansion**: Add child nodes to the leaf node using the policy. This step can be skipped to avoid growing the tree.
* **Simulation**: Simulate to the end from the selected node or one of the new child nodes using rollout.
* **Backup**: Use the final gain to update the leaf node. Don't update the rest of the tree to avoid mixing random rollout and $\pi$.

---

<img style="width: 75%" class="r-stretch" src="./figures/RLBook/ch8_figure10_MCTS.png" />
<br/>
<small>Monte Carlo Tree Search from Sutton and Barto</small>

</div>

---

## Advantages

* Because several of the explored steps actually use our policy, we are likely to reach those states
* We can also retrain part of the tree structure as we continue, so this will be more computationally efficient
* Let's look at a case study that shows the strength of this technique

---

## AlphaGo and AlphaGo Zero

* [AlphaGo](https://www.nature.com/articles/nature16961) combined several techniques to become the first professional strength go playing program
  * It boostrapped some of its components with data from a 10s of millions of human games
* It was followed a year later, in 2017, by [AlphaGo Zero](https://www.nature.com/articles/nature24270)
  * This version used no human knowledge

---

## The game of Go

* Go has a simple set of rules, but a terribly complicated state space
* On a full board, there is a 19x19 grid where players alternate placing stones on any empty intersection
  * Stones of the same player that are touching form a group
  * If any stone or group has no adjacent free spaces it is captured by the opposing player and removed from the board
* The only other important rule for play is that board positions may not be immediately repeated

---

## Difficulties

* So why is this hard?
  * Many "good" moves involve threading a needle between moves that are terrible
  * A random rollout policy will be unlikely to find these long sequences of good play
* The board state can also change drastically, depending upon moves
  * Large groups can be captured or threatened with a small number of moves

---

## Value Estimation

* In summary, estimating Q is difficult
* So how with AlphaGo deal with this?
  * A combination of MCTS, reinforcement learning, and supervised learning of deep neural networks
* The original Alpha Go used a large body of human plays to initial a neural network, but Alpha Go Zero bootstraps itself

---

## Things That Don't Work

* Even a 9x9 board is challenging, so it isn't just the size of the search space
* With a good estimate of the Q function, directed search from MCTS would be successful
  * Yet this was out of reach for many years of Go research
* Alpha Go combined large convolutional networks with reinforcement learning and tree search to solve the problem

---

## Policy Network

* Ideally, a single network could take in the state of the board and predict the next good move
* The AlphaGo authors trained a policy network on human play
  * Called the *SL-policy network* (for supervised learning)
  * Trained on 
* Called $p_theta(a|s)$

---

## Policy Network Training

* 30 million board positions from the KGS Go Server dataset were used
* Expert moves in a test dataset were predicted with an accuracy of 57%
* A smaller net was used to train a rollout policy, $p_\pi(a|s)$

</div>

---

## APV-MCTS

* MCTS uses a tree policy to decide where in the tree to explore with simulation
* Asynchronous policy and value MCTS uses a different strategy
  * The SL-policy network was used to predict human moves
  * Then, instead of rollout alone, the utility is estimated with the help of a value network
* But that value network is the Q function! That was supposed to be hard!

---

## Value Network

* The value network, another 13-layer ConvNet, predicts the value of the ending board position
  * This is the policy network, $p_\rho$
  * It is initialized with the SL-policy network weights and then trained with self-play

---

## Self-Play

* The policy network needs to evaluate lots of terrible moves
  * These will be encountered during rollout
    * Either with a uniform random rollout policy or the miniature $p_\pi$ policy network
* So self-play is used, where the current network $p_\rho$ plays against a previous policy network
* As a result, the $p_\rho$ network was stronger than the SL-policy network (on their own)

---

## Self-Play Dataset

* Self-play was used to create a new dataset of games between the RL policy and itself
* A single position was sampled from each of 30 million games
  * Then $v_\theta(s)$ is trained to predict the game outcome from board positions

---

## Value Prediction Error

* Rollout with the RL policy or SL policy networks are superior to $v_\theta$
* However, they also consume 15,000 times the computation

</div>

---

## MCTS Gain

* Those savings mean that the value network can be used with MCTS
* After rollout with $p_\pi$, the fast rollout policy, we see gain, $G$
* The utility/value of the state is predicted with this equation:
  * $u(s) = (1 - \lambda)v_\theta(s) + \lambda G$
  * $\lambda$ is a mixing parameter to tune the trust between rollout and the value network

---

## APV-MCTS

</div>

---

## Q Updates

* $1(s, a, i)$ indicates if action $a$ was taken from $s$ at simulation $i$
* $Q(s,a) = \frac{\sum_{i=1}^n 1(s,a,i)V(S^i_l)}{\sum_{i=1}^n 1(s,a,i)}$
  * $s^i_l$ is the leaf for simulation $i$
* Plainly, we average the value estimates across all times we encountered $(s, a)$

---

## Alpha Go Zero

* The human-trained SL policy network, gave better results than the RL trained network
  * Why?
  * It probably had to do with the SL policy network's better handling fo "nonoptimal"
* If we only trained on the "best" moves, an amateur who played good but non-optimal would would take us by surprise
* The goal of AlphaGo Zero was to remove the human training data

---

## Updates

* A single network, $f_\theta$, is used to replace the policy and value networks
  * This is intuitive; they should be exploiting the same knowledge
* The MCTS algorithm is updated
  * Remember the *policy improvement theorem*?
  * If $f_\theta$ is strong, then MCTS guided by $f_\theta$ should be stronger

---

## AlphaGo Zero MCTS

* MCTS will be updated to remove random rollout
* Instead, $f_\theta$ will be used to evaluate probabilities of different actions
  * MCTS uses those probabilities to expand the tree search
  * The final value estimates from $f_\theta$ are used to estimate Q
* Q updates are averaged over all encountered with a state, as before

---

## Updated MCTS

</div>

---

## Performance

</div>

---

## Takeaways

* This result surprised a lot of people
* It was a combination of a few things:
  * Progress in convolutional neural networks
  * Increased compute on modern GPUs
  * Sophisticated reinforcement learning algorithms

---

## For Us

* What should we learn from this?
  * Notice that the neural network was used in all steps
* We've been doing what is called *tabular Q learning*
  * Meaning that Q values are simply stored and looked up
* In AlphaGo, Q values are estimates from the neural network

---

## Advantages of DNNs

* There are a few advantages to this
* First, we can estimates $Q(s, a)$ for a state that we haven't actually observed before
  * As long as the DNN generalizes, this will work
* Likewise, if the DNN learns a compressed representation of all states this is an effective approach even when the number of states explodes

---

## Directions

* This gives us a strong signal that DNNs can serve as our Q functions in continuous environments
* Our difficulties will lie along other axes:
  * How can we simulate future states?
  * How can we get enough data to train the DNN?