# CS 530 - Lecture 12

## Monte Carlo

Bernhard Firner

2026-03-03

---

## Review

* Last time we talked about the limitations of AI
* Machine learning is data dependent
* Theoretically we should be able to build better agents
  * But historically, most consistent improvements come from increased compute
* Even modern success with deep learning was unlocked with data and compute

---

## Prediction

* We are going to need ML for prediction
* But let's talk about another non-ML topic first
  * First, because introducing new stuff right before the midterm is crazy
  * Second, to discuss the limitations of value iteration and related strategies

---

## Value Iteration

* We've been discussing Markov Decision Processes (MDPs)
* Value iteration involved looking at the utility of possible future states
  * If we iterate multiple steps, we are looking farther into the future
* This is basically calculating the Bellman Equation and measuring utility
  * $U(s) = \underset{a\in A(s)}{\mathrm{max}}\Sigma_{s'}P(s'|s,a)[R(s,a,s') + \gamma U(s')]$

---

## Grid World

* Value iteration worked perfectly on our grid world example
* There are a limited number of states, and actions take the agent to an adjacent square

</div>
<div class="col">

</div>
</div>

---

## Value Iteration Program

```python
import torch

known_rewards = torch.zeros((1, 1, 3, 4))
known_rewards[0, 0, 0, 3] = 1
known_rewards[0, 0, 1, 3] = -1

# The convolution represents our actions
conv = torch.nn.Conv2d(in_channels=1, out_channels=4, kernel_size=(3,3), stride=1, padding=1)

with torch.no_grad():
    # Up
    conv.weight[0, 0] = torch.tensor([[0, 0.8, 0], [0.1, 0, 0.1], [0, 0, 0]])
    # Right
    conv.weight[1, 0] = torch.tensor([[0, 0.1, 0], [0, 0, 0.8], [0, 0.1, 0]])
    # Down
    conv.weight[2, 0] = torch.tensor([[0, 0, 0], [0.1, 0, 0.1], [0, 0.8, 0]])
    # Left
    conv.weight[3, 0] = torch.tensor([[0, 0.1, 0], [0.8, 0, 0], [0, 0.1, 0]])

# The weights should then be multiplied by gamma, the discount factor.
gamma = 1

# The bias is the cost of each action, e.g. -0.04
with torch.no_grad():
    conv.bias.fill_(-0.04)

# Select the best possible next state utility
maxpool = torch.nn.MaxPool3d(kernel_size = (4,1,1))

torch.set_printoptions(precision=2, sci_mode=False)

# Starting state
state = known_rewards
for i in range(100):
    state = maxpool(conv(state.view(1, 1, 3, 4)))
    # Some rewards are fixed, including the obstacle (with 0 reward)
    state[0, 0, 0, 3] = 1
    state[0, 0, 1, 3] = -1
    state[0, 0, 1, 1] = 0
    print(f"Step {i}")
    print(state)

---

## Convergence

* The algorithm eventually converges, yielding the expected utility at each state
* We can then use those to generate an optimal policy
  * It is "move towards the largest number"

---

## Problem

* But what if there are too many states to explore?
* Or what if we don't know the probabilities, $P(s'|s,a)$?
* So we cannot simply run value iteration, iterate until convergence, and call it a day

---

## Monte Carlo Methods

* Methods of estimation that involve a random component
* Section 13.4 in Russel and Norvig
* Chapter 5 in Sutton and Barto's Reinforcement Learning, 2nd edition

---

## Exploration

* Monte Carlo methods give us a way to *explore* an environment
* As we explore, we update our information and adjust the agent's policy, $\pi$
* Let's do a simple exploration example to show how that can work

---

## Multi-Armed Bandits

* Slot machines are a favorite of statisticians, but you can also imagine a Gachapon
* Let's say you are standing in front of a row of machines with money to burn
  * When in a group, the one-armed bandits become a multi-armed problem
* You want to get the highest return
  * What is your optimal strategy?

---

## Information Gathering

* The first time you put money into a machine and get (or fail to get) a reward, you've gathered information
* For each machine, you track their payout, called $\bar{x}_i$
* As you collect more data, you become more confident
  * More plays on a particular machine shrinks that confidence interval

---

## Interval

* $\bar{x}_i \pm \sqrt{\frac{2ln~n}{n_i}}$
  * $n_i$ is the number of plays on machine $i$
  * $n$ is the total number of plays
* In the beginning, every machine has a high interval

---

## Exploration

* We can choose a simply policy: always play the machine with the highest upper bound reward
* As the number of samples increases, our interval for the machine decreases
  * And the interval for other machines goes up
* $\sqrt{\frac{2ln(1)}{1}} = 0$ 
* $\sqrt{\frac{2ln(2)}{1}} \approx 1.177$ 
* $\sqrt{\frac{2ln(3)}{1}} \approx 1.4823$

---

## Policy

* After playing on one machine for a while, its interval decreases and the interval of another machine increases
  * This shifts our policy to a new machine, and the process repeats

---

## Outcome

* We are exploring the environment, learning as we go
* "Good" machines will be explored more often
* "Bad" machines will be explored less often
  * Just enough to shrink their confidence intervals
* This strategy minimizes our regret, the difference between our actions and only playing on the "best" machine

---

## Back to Monte Carlo

* The insight here is that we don't need to explore all parts of the environment equally
  * Exploration should favor outcomes that look good
  * Ignore outcomes that seem bad
  * Early exploration will be more equal, but stick to previous good paths once they've been discovered

---

## Multiple Steps

* Consider something similar to the multi-armed bandit, but now each machine unlocks a door to another room with more machines
* If we lose in the second room, we get kicked back to the first room
* We only win if we make it through a sequence of rooms
* This describes most discrete environments

---

## Monte Carlo Prediction

* Continuing with games of chance, let's discuss blackjack
  * Episodic, with rewards of +1 for winning, -1 for losing, and 0 for a draw
  * There is no discount, $\gamma$, because the game is short

---

## Blackjack

* The player and dealer are each dealt 2 cards
  * 1 face up, 1 face down
* The player's action is a decision between "hitting" or "sticking"
  * This means getting another card
* The goal is to have a sum of card values as close as possible to 21 as possible without going over
  * Whoever is closer wins

---

## States

* The ace can count as either 1 or 11
* The state that informs the player's decision is comprised of
  * The current sum
  * The dealer's face up card
  * And whether or not the player has an ace that can be used as 1 or 11

---

## Total States

* Any sum less than 11 doesn't need to be considered because the player will always hit
* So the sum is 12-21, the dealer's card is ace-10, and the player may or may not have a usable ace
  * $10 \times 10\times 2 = 200$

---

## Prediction

* Let's use Monte Carlo for predicting first, then use it to develop a policy
* Let's say we want to evaluate a policy of sticking only with a sum or 20 or 21
  * What is our predicted reward with this policy?

---

## Brute Force

* This is nothing more than a brute-force simulation

</div>

---

## State Values

* These are the estimated state values of our policy
* Notice that that back row is higher
  * That is the player winning!
* Most other states don't look very good

</div>
<div class="col">

<img style="width: 80%" class="r-stretch" src="./figures/RLBook/ch5_figure5.1.png" />
<br/>
<small>From Sutton and Barto</small>

</div>
</div>

---

## Exploration

* To find the optimal policy, we need to try different things
* Imagine that graph of state values, but we have a different one for each policy
  * We don't want to brute force everything
  * Is there a good exploration strategy?

---

## Policy Evaluation

* Given a policy, $\pi$, we are collecting data to estimate $q_\pi(s, a)$
  * Meaning that we are in state s, take action a, and then follow a policy, $\pi$
* But if $\pi$ doesn't explore all states, many states will remain unexplored
* For blackjack we can just begin in a random state, guaranteeing all will be visited

---

## Maintaining Exploration

* The problem of a policy leaving states unvisited is a problem of maintaining exploration
* This was solved with the multi-arm bandits by expanding the confidence ranges of less-explored machines
* We could also assign a stochastic behavior to a policy, forcing it to random pick an action (at least some of the time)

---

## Algorithm per Simulation

* Generate a state, $S_0$, $A_0$ randomly so that all initial states and actions will be explored
* Play to the end state, using policy $\pi$
* Initialize a total reward $G \leftarrow 0$, and time $t \leftarrow T$
* Loop through the episode
  * $G \leftarrow \gamma G + R_{t+1}$
  * Append G to a log of returns for the pair, $S_t, A_t$
  * $Q(S_t, A_t) \leftarrow average(Returns(S_t, A_t))$
  * $\pi(S_t) \leftarrow \underset{a}{argmax} Q(S_t, a)$

---

## Efficiencty

* It's not efficient, in the sense that a large state space could get stuck in a good, but non-optimal policy
* Works fine for something like blackjack

---

## Optimal Policy

<img style="width: 60%" class="r-stretch" src="./figures/RLBook/ch5_figure5.2.png" />
<br/>
<small>From Sutton and Barto</small>

</div>

---

## Better Exploration

* Randomly generating starting locations won't scale
  * Backgammon is something like $10^{20}$
  * Go is around $10^{170}$

---

## Policy

* We have two strategies
  * *on-policy*, where our policy itself also handles exploration
    * The algorithm just shown is one of these
  * *off-policy*, where exploration is done with something different than the learned best policy

---

## Continuous Environments

* This exploration problem becomes worse in continuous environmnts
  * Inifinite states leads to infinite sadness
* But what if we could group out states?
  * Maybe they look different, but aren't really?

---

## States

* Notice that we grouped all states from 2-11 into one state in blackjack
* Grouping information into the same state is a hard problem
  * And this is where ML comes in

---

## More Practice

* We've only had one homework, so we can spent a moment with some more practice problems
* Imagine a grid world with an upstairs and a downstairs, with steps marked by S
  * A hole, marked by H, has will drop you to the first floor with probability p=0.5
* If moving has a cost of -0.04, what is the utility of each grid?

---

## More Practice

* Let's say you're playing rock-paper-scissors against Bob
* You are playing for $10, best two out of three
  * What is your expected utility if you have already one a single round?

---

## More Practice

* Bob always throws a different sign from his previous one
* Before you begin a game, a spy offers to sell you information about whether or not Bob last threw paper
  * What is that information worth?

---

## More Practice

* Bob's mood changes what he throws
  * His moods are happy, hungry, and sleepy
* He transitions from happy to hungry with p=0.2 and to sleepy with p=0.2
* He transitions from hungry to happy with p=0.4 and to sleepy with p=0.4
* He transitions from sleepy to happy with p=0.1 and to hungry with p=0.1
* What fraction of the time is Bob asleep?