# CS 530 - Lecture 16 ## Decision-Time Planning Bernhard Firner 2026-03-26 --- ## Review * Last time we began unifying model-based planning and model-free planning * Why did we want to do this? * model-based planning quickly converges to optimal policies, but cannot always be used * model-free planning can explore and learn in any environment --- ## DynaQ * DynaQ runs multiple planning iterations after every step during exploration * This speeds up learning per episode by "filling in" information
--- ## Comparison: Exploration * The advantage is clear when we look at how nicely the Q function utility estimates are filled in
Double Q
Dyna Q
--- ## Inefficiencies * We are assuming here that the simulated steps of Dyna-Q are have no cost compared to exploration steps * That matches with any experience in the real world, but the simulated steps still have computational cost * A better approach would be directed, for example by using the policy to guide actions * An approach like this would be called *trajectory sampling* --- ## Types of Planning * We used planning to update Q rewards in the background after taking an action * This prepares our tabular Q entries for future encounters * A type of *backround planning* * The other way we could use planning is to use it in action selection * This is *decision-time planning* --- ## Thinking Ahead * Decision-time planning means we plan our actions * For example, by running a simulation from the current state * How deep and wide to run the simulation are based upon available time and heuristics * Consider a timed game of chess or go where we can simulate games from the current position, but not every possible game --- ## Models and Action-Values * Obviously we need an accurate model to simulate ahead * And there is no point in simulating ahead if we already have good estimates for Q * So this type of planning may be limited * This has changed over time * If deep networks every grow into high-fidelity future-state simulators, decision-time planning will improve --- ## Why Does This Works * Why should thinking ahead work? * DynaQ sped up policy improvements and did a better job filling in Q estimates for infrequently visited states * Thinking ahead won't do the second, and the only policy being improved is for the current state * And often our state space is so large that we won't bother storing our simulated results * So when does this approach help? --- ## Focus * If we don't actually have a way of narrowing our search, thinking ahead won't be useful * This is akin to a beginner attempting to think ahead during a complicated game * Their estimates of $Q(s,a)$ are so poor that they'll still play poorly, no matter how long they plan * A reasonable Q function will allow our agent to explore future states efficiently * Or the episodes need to be short enough that we can simulate to the end from the current state --- ## Rollout * Rollout is a widely used decision-time planning algorithm * Its advantages are that it is dead simple and that it works suprisingly well * The algorithm is basically a Monte Carlo *prediction* algorithm * It evaluates the average returns of a policy rather than finding a new policy --- ## Rollout Algorithm * From the current state, s * Loop * Select a random action, a, and next state s' * Play to the end state (or some depth), using policy $\pi$ * Append the final gain to the gains from action $a$ * Select the action with the highest average gain, or the gain the best matches some criteria --- ## Better than $\pi$? * How is this better than the policy, $\pi$? * Due to something called the *policy improvement theorem* * Suppose original policy, $\pi$, results in some utility for the current state: $u_pi(s)$ * Now rollout results in $\pi'(s) = a /neq \pi(s)$, but $q_pi(s, a) \geq u_pi(s)$ * That means it we find an action better than the current policy at just a single step, we have a better policy --- ## Chosen Policy * The chosen policy could actually be random and rollout would result in an improvement * But improving upon a decent policy will result in better outcomes than improving upon a bad policy * We may still have good reasons to use a random policy * For example in games with a great deal of randomness and a large number of states * Or if our "good" policy is computationaly expensive, a basic one may allow more simulations --- ## Parallel Advantages * One important advantage of this approach is that we can run multiple Monte Carlo simulations in parallel * With other methods, each episode would update Q * Having parallel updates would at least make things complicated, and would likely decrease the effective learning rate * With rollout, we are free to run as many parallel simulations as we can --- ## Simulated Results * So what do we do with the simulated results? * We throw them out * If this task has a large state space, we are unlikely to visit state $s$ again * Because of the randomness in rollout, only a small subset of the exploration will follow the trajectory that actually occurs --- ## MCTS * Monte Carlo Tree Search is rollout, but with an improved focus during search * If the states being explored better match the actual states to follow, this also allows us to retain information * The accumulated states and value estimates are more likely to be revisited * Exploration is focused by exploring actions that look promising, based upon current simulated returns --- ## MCTS Algorithm * **Selection**: Select a leaf node in a tree of current value estimates. This is the *tree policy*, which could be $\epsilon\text-greed$ rather than rollout * **Expansion**: Add child nodes to the leaf node using the policy. This step can be skipped to avoid growing the tree. * **Simulation**: Simulate to the end from the selected node or one of the new child nodes using rollout. * **Backup**: Use the final gain to update the leaf node. Don't update the rest of the tree to avoid mixing random rollout and $\pi$. ---
Monte Carlo Tree Search from Sutton and Barto
--- ## Advantages * Because several of the explored steps actually use our policy, we are likely to reach those states * We can also retrain part of the tree structure as we continue, so this will be more computationally efficient * Let's look at a case study that shows the strength of this technique --- ## AlphaGo and AlphaGo Zero * [AlphaGo](https://www.nature.com/articles/nature16961) combined several techniques to become the first professional strength go playing program * It boostrapped some of its components with data from a 10s of millions of human games * It was followed a year later, in 2017, by [AlphaGo Zero](https://www.nature.com/articles/nature24270) * This version used no human knowledge --- ## The game of Go * Go has a simple set of rules, but a terribly complicated state space * On a full board, there is a 19x19 grid where players alternate placing stones on any empty intersection * Stones of the same player that are touching form a group * If any stone or group has no adjacent free spaces it is captured by the opposing player and removed from the board * The only other important rule for play is that board positions may not be immediately repeated --- ## Difficulties * So why is this hard? * Many "good" moves involve threading a needle between moves that are terrible * A random rollout policy will be unlikely to find these long sequences of good play * The board state can also change drastically, depending upon moves * Large groups can be captured or threatened with a small number of moves --- ## Value Estimation * In summary, estimating Q is difficult * So how with AlphaGo deal with this? * A combination of MCTS, reinforcement learning, and supervised learning of deep neural networks * The original Alpha Go used a large body of human plays to initial a neural network, but Alpha Go Zero bootstraps itself --- ## Things That Don't Work * Even a 9x9 board is challenging, so it isn't just the size of the search space * With a good estimate of the Q function, directed search from MCTS would be successful * Yet this was out of reach for many years of Go research * Alpha Go combined large convolutional networks with reinforcement learning and tree search to solve the problem --- ## Policy Network * Ideally, a single network could take in the state of the board and predict the next good move * The AlphaGo authors trained a policy network on human play * Called the *SL-policy network* (for supervised learning) * Trained on * Called $p_theta(a|s)$ --- ## Policy Network Training * 30 million board positions from the KGS Go Server dataset were used * Expert moves in a test dataset were predicted with an accuracy of 57% * A smaller net was used to train a rollout policy, $p_\pi(a|s)$
--- ## APV-MCTS * MCTS uses a tree policy to decide where in the tree to explore with simulation * Asynchronous policy and value MCTS uses a different strategy * The SL-policy network was used to predict human moves * Then, instead of rollout alone, the utility is estimated with the help of a value network * But that value network is the Q function! That was supposed to be hard! --- ## Value Network * The value network, another 13-layer ConvNet, predicts the value of the ending board position * This is the policy network, $p_\rho$ * It is initialized with the SL-policy network weights and then trained with self-play --- ## Self-Play * The policy network needs to evaluate lots of terrible moves * These will be encountered during rollout * Either with a uniform random rollout policy or the miniature $p_\pi$ policy network * So self-play is used, where the current network $p_\rho$ plays against a previous policy network * As a result, the $p_\rho$ network was stronger than the SL-policy network (on their own) --- ## Self-Play Dataset * Self-play was used to create a new dataset of games between the RL policy and itself * A single position was sampled from each of 30 million games * Then $v_\theta(s)$ is trained to predict the game outcome from board positions --- ## Value Prediction Error * Rollout with the RL policy or SL policy networks are superior to $v_\theta$ * However, they also consume 15,000 times the computation
--- ## MCTS Gain * Those savings mean that the value network can be used with MCTS * After rollout with $p_\pi$, the fast rollout policy, we see gain, $G$ * The utility/value of the state is predicted with this equation: * $u(s) = (1 - \lambda)v_\theta(s) + \lambda G$ * $\lambda$ is a mixing parameter to tune the trust between rollout and the value network --- ## APV-MCTS
--- ## Q Updates * $1(s, a, i)$ indicates if action $a$ was taken from $s$ at simulation $i$ * $Q(s,a) = \frac{\sum_{i=1}^n 1(s,a,i)V(S^i_l)}{\sum_{i=1}^n 1(s,a,i)}$ * $s^i_l$ is the leaf for simulation $i$ * Plainly, we average the value estimates across all times we encountered $(s, a)$ --- ## Alpha Go Zero * The human-trained SL policy network, gave better results than the RL trained network * Why? * It probably had to do with the SL policy network's better handling fo "nonoptimal" * If we only trained on the "best" moves, an amateur who played good but non-optimal would would take us by surprise * The goal of AlphaGo Zero was to remove the human training data --- ## Updates * A single network, $f_\theta$, is used to replace the policy and value networks * This is intuitive; they should be exploiting the same knowledge * The MCTS algorithm is updated * Remember the *policy improvement theorem*? * If $f_\theta$ is strong, then MCTS guided by $f_\theta$ should be stronger --- ## AlphaGo Zero MCTS * MCTS will be updated to remove random rollout * Instead, $f_\theta$ will be used to evaluate probabilities of different actions * MCTS uses those probabilities to expand the tree search * The final value estimates from $f_\theta$ are used to estimate Q * Q updates are averaged over all encountered with a state, as before --- ## Updated MCTS
--- ## Performance
--- ## Takeaways * This result surprised a lot of people * It was a combination of a few things: * Progress in convolutional neural networks * Increased compute on modern GPUs * Sophisticated reinforcement learning algorithms --- ## For Us * What should we learn from this? * Notice that the neural network was used in all steps * We've been doing what is called *tabular Q learning* * Meaning that Q values are simply stored and looked up * In AlphaGo, Q values are estimates from the neural network --- ## Advantages of DNNs * There are a few advantages to this * First, we can estimates $Q(s, a)$ for a state that we haven't actually observed before * As long as the DNN generalizes, this will work * Likewise, if the DNN learns a compressed representation of all states this is an effective approach even when the number of states explodes --- ## Directions * This gives us a strong signal that DNNs can serve as our Q functions in continuous environments * Our difficulties will lie along other axes: * How can we simulate future states? * How can we get enough data to train the DNN?