# CS 530 - Lecture 12 ## Monte Carlo Bernhard Firner 2026-03-03 --- ## Review * Last time we talked about the limitations of AI * Machine learning is data dependent * Theoretically we should be able to build better agents * But historically, most consistent improvements come from increased compute * Even modern success with deep learning was unlocked with data and compute --- ## Prediction * We are going to need ML for prediction * But let's talk about another non-ML topic first * First, because introducing new stuff right before the midterm is crazy * Second, to discuss the limitations of value iteration and related strategies --- ## Value Iteration * We've been discussing Markov Decision Processes (MDPs) * Value iteration involved looking at the utility of possible future states * If we iterate multiple steps, we are looking farther into the future * This is basically calculating the Bellman Equation and measuring utility * $U(s) = \underset{a\in A(s)}{\mathrm{max}}\Sigma_{s'}P(s'|s,a)[R(s,a,s') + \gamma U(s')]$ --- ## Grid World
* Value iteration worked perfectly on our grid world example * There are a limited number of states, and actions take the agent to an adjacent square
See ch 16 in the book
--- ## Value Iteration Program ```python import torch known_rewards = torch.zeros((1, 1, 3, 4)) known_rewards[0, 0, 0, 3] = 1 known_rewards[0, 0, 1, 3] = -1 # The convolution represents our actions conv = torch.nn.Conv2d(in_channels=1, out_channels=4, kernel_size=(3,3), stride=1, padding=1) with torch.no_grad(): # Up conv.weight[0, 0] = torch.tensor([[0, 0.8, 0], [0.1, 0, 0.1], [0, 0, 0]]) # Right conv.weight[1, 0] = torch.tensor([[0, 0.1, 0], [0, 0, 0.8], [0, 0.1, 0]]) # Down conv.weight[2, 0] = torch.tensor([[0, 0, 0], [0.1, 0, 0.1], [0, 0.8, 0]]) # Left conv.weight[3, 0] = torch.tensor([[0, 0.1, 0], [0.8, 0, 0], [0, 0.1, 0]]) # The weights should then be multiplied by gamma, the discount factor. gamma = 1 # The bias is the cost of each action, e.g. -0.04 with torch.no_grad(): conv.bias.fill_(-0.04) # Select the best possible next state utility maxpool = torch.nn.MaxPool3d(kernel_size = (4,1,1)) torch.set_printoptions(precision=2, sci_mode=False) # Starting state state = known_rewards for i in range(100): state = maxpool(conv(state.view(1, 1, 3, 4))) # Some rewards are fixed, including the obstacle (with 0 reward) state[0, 0, 0, 3] = 1 state[0, 0, 1, 3] = -1 state[0, 0, 1, 1] = 0 print(f"Step {i}") print(state) --- ## Convergence * The algorithm eventually converges, yielding the expected utility at each state * We can then use those to generate an optimal policy * It is "move towards the largest number"
[[ 0.48, 0.61, 0.81, 1.00], [ 0.35, 0.00, 0.51, -1.00], [ 0.27, 0.29, 0.42, 0.19]]
--- ## Problem * But what if there are too many states to explore? * Or what if we don't know the probabilities, $P(s'|s,a)$? * So we cannot simply run value iteration, iterate until convergence, and call it a day --- ## Monte Carlo Methods * Methods of estimation that involve a random component * Section 13.4 in Russel and Norvig * Chapter 5 in Sutton and Barto's Reinforcement Learning, 2nd edition --- ## Exploration * Monte Carlo methods give us a way to *explore* an environment * As we explore, we update our information and adjust the agent's policy, $\pi$ * Let's do a simple exploration example to show how that can work --- ## Multi-Armed Bandits * Slot machines are a favorite of statisticians, but you can also imagine a Gachapon * Let's say you are standing in front of a row of machines with money to burn * When in a group, the one-armed bandits become a multi-armed problem * You want to get the highest return * What is your optimal strategy? --- ## Information Gathering * The first time you put money into a machine and get (or fail to get) a reward, you've gathered information * For each machine, you track their payout, called $\bar{x}_i$ * As you collect more data, you become more confident * More plays on a particular machine shrinks that confidence interval --- ## Interval * $\bar{x}_i \pm \sqrt{\frac{2ln~n}{n_i}}$ * $n_i$ is the number of plays on machine $i$ * $n$ is the total number of plays * In the beginning, every machine has a high interval --- ## Exploration * We can choose a simply policy: always play the machine with the highest upper bound reward * As the number of samples increases, our interval for the machine decreases * And the interval for other machines goes up * $\sqrt{\frac{2ln(1)}{1}} = 0$ * $\sqrt{\frac{2ln(2)}{1}} \approx 1.177$ * $\sqrt{\frac{2ln(3)}{1}} \approx 1.4823$ --- ## Policy * After playing on one machine for a while, its interval decreases and the interval of another machine increases * This shifts our policy to a new machine, and the process repeats --- ## Outcome * We are exploring the environment, learning as we go * "Good" machines will be explored more often * "Bad" machines will be explored less often * Just enough to shrink their confidence intervals * This strategy minimizes our regret, the difference between our actions and only playing on the "best" machine --- ## Back to Monte Carlo * The insight here is that we don't need to explore all parts of the environment equally * Exploration should favor outcomes that look good * Ignore outcomes that seem bad * Early exploration will be more equal, but stick to previous good paths once they've been discovered --- ## Multiple Steps * Consider something similar to the multi-armed bandit, but now each machine unlocks a door to another room with more machines * If we lose in the second room, we get kicked back to the first room * We only win if we make it through a sequence of rooms * This describes most discrete environments --- ## Monte Carlo Prediction * Continuing with games of chance, let's discuss blackjack * Episodic, with rewards of +1 for winning, -1 for losing, and 0 for a draw * There is no discount, $\gamma$, because the game is short --- ## Blackjack * The player and dealer are each dealt 2 cards * 1 face up, 1 face down * The player's action is a decision between "hitting" or "sticking" * This means getting another card * The goal is to have a sum of card values as close as possible to 21 as possible without going over * Whoever is closer wins --- ## States * The ace can count as either 1 or 11 * The state that informs the player's decision is comprised of * The current sum * The dealer's face up card * And whether or not the player has an ace that can be used as 1 or 11 --- ## Total States * Any sum less than 11 doesn't need to be considered because the player will always hit * So the sum is 12-21, the dealer's card is ace-10, and the player may or may not have a usable ace * $10 \times 10\times 2 = 200$ --- ## Prediction * Let's use Monte Carlo for predicting first, then use it to develop a policy * Let's say we want to evaluate a policy of sticking only with a sum or 20 or 21 * What is our predicted reward with this policy? --- ## Brute Force * This is nothing more than a brute-force simulation
--- ## State Values
* These are the estimated state values of our policy * Notice that that back row is higher * That is the player winning! * Most other states don't look very good
From Sutton and Barto
--- ## Exploration * To find the optimal policy, we need to try different things * Imagine that graph of state values, but we have a different one for each policy * We don't want to brute force everything * Is there a good exploration strategy? --- ## Policy Evaluation * Given a policy, $\pi$, we are collecting data to estimate $q_\pi(s, a)$ * Meaning that we are in state s, take action a, and then follow a policy, $\pi$ * But if $\pi$ doesn't explore all states, many states will remain unexplored * For blackjack we can just begin in a random state, guaranteeing all will be visited --- ## Maintaining Exploration * The problem of a policy leaving states unvisited is a problem of maintaining exploration * This was solved with the multi-arm bandits by expanding the confidence ranges of less-explored machines * We could also assign a stochastic behavior to a policy, forcing it to random pick an action (at least some of the time) --- ## Algorithm per Simulation * Generate a state, $S_0$, $A_0$ randomly so that all initial states and actions will be explored * Play to the end state, using policy $\pi$ * Initialize a total reward $G \leftarrow 0$, and time $t \leftarrow T$ * Loop through the episode * $G \leftarrow \gamma G + R_{t+1}$ * Append G to a log of returns for the pair, $S_t, A_t$ * $Q(S_t, A_t) \leftarrow average(Returns(S_t, A_t))$ * $\pi(S_t) \leftarrow \underset{a}{argmax} Q(S_t, a)$ --- ## Efficiencty * It's not efficient, in the sense that a large state space could get stuck in a good, but non-optimal policy * Works fine for something like blackjack --- ## Optimal Policy
From Sutton and Barto
--- ## Better Exploration * Randomly generating starting locations won't scale * Backgammon is something like $10^{20}$ * Go is around $10^{170}$ --- ## Policy * We have two strategies * *on-policy*, where our policy itself also handles exploration * The algorithm just shown is one of these * *off-policy*, where exploration is done with something different than the learned best policy --- ## Continuous Environments * This exploration problem becomes worse in continuous environmnts * Inifinite states leads to infinite sadness * But what if we could group out states? * Maybe they look different, but aren't really? --- ## States * Notice that we grouped all states from 2-11 into one state in blackjack * Grouping information into the same state is a hard problem * And this is where ML comes in --- ## More Practice * We've only had one homework, so we can spent a moment with some more practice problems * Imagine a grid world with an upstairs and a downstairs, with steps marked by S * A hole, marked by H, has will drop you to the first floor with probability p=0.5 * If moving has a cost of -0.04, what is the utility of each grid?
S
-
W
S
1
-
-
W
-1
-
-
-
W
S
-
S
W
W
S
W
-
-
-
H
W
W
W
-
S
W
--- ## More Practice * Let's say you're playing rock-paper-scissors against Bob * You are playing for $10, best two out of three * What is your expected utility if you have already one a single round? --- ## More Practice * Bob always throws a different sign from his previous one * Before you begin a game, a spy offers to sell you information about whether or not Bob last threw paper * What is that information worth? --- ## More Practice * Bob's mood changes what he throws * His moods are happy, hungry, and sleepy * He transitions from happy to hungry with p=0.2 and to sleepy with p=0.2 * He transitions from hungry to happy with p=0.4 and to sleepy with p=0.4 * He transitions from sleepy to happy with p=0.1 and to hungry with p=0.1 * What fraction of the time is Bob asleep?