# CS 530 - Lecture 12

## Monte Carlo

### On and Off Policy

### Prediction and Control

Bernhard Firner

2026-03-10

---

## Review

* We've been talking about Monte Carlo simulations
  * To predict the *utility* of a policy by finding the expected reward from all states
* We'll extend this to finding the policy itself, which will be called the *control* problem
* This sets up reinforcement learning

---

## Reading

* Russel and Norvig
  * Monte Carlo is mentioned several times, but section 13.4 is most relevant
  * RL is chapter 23 in Russel and Norvig
* Or Chapter 5 in Sutton and Barto reinforcement learning book
* Russel and Norvig measure utility (U), the RL book uses value (V), but they are mostly equivalent

---

## Monte Carlo Prediction

* Last time we wanted to evaluation a policy, $\pi$
* This problem formulation is called *prediction*, since we are predicting the utility/value of the policy

---

## Monte Carlo Control

* Then we found the best policy
  * This is called *control*, because the policy guides exploration
* We do need to ensure that we visit every state
  * With blackjack we simply randomly select a state to start

---

## Algorithm per Simulation

* Generate a state, $S_0$, $A_0$ randomly so that all initial states and actions will be explored
* Play to the end state, using policy $\pi$
* Initialize a total reward $G \leftarrow 0$, and time $t \leftarrow T$
* Loop through the episode
  * $G \leftarrow \gamma G + R_{t+1}$
  * Append G to a log of returns for the pair, $S_t, A_t$
  * $Q(S_t, A_t) \leftarrow average(Returns(S_t, A_t))$
  * $\pi(S_t) \leftarrow \underset{a}{argmax} Q(S_t, a)$

---

## Idea

* This estimate rewards:
  * $Q(S_t, A_t) \leftarrow average(Returns(S_t, A_t))$
* And used them to estimate a policy:
  * $\pi(S_t) \leftarrow \underset{a}{argmax} Q(S_t, a)$
* The Q estimates improve as the policy improves, and eventually we converge to our final policy

---

## How We Search

* The optimal policy we discover is also the policy guiding search
  * This is good, it reduces time spent exploring states that we won't encounter
  * This is bad, it means that we have to be able to begin at a random state
    * And *those* may not be common
* So there are benefits to exploration with a different policy

---

## On-Policy, Off-Policy

* On-policy means ising the same optimizing policy to search
* An off-policy search means using something different
* For example, let's accept some error, $\epsilon$, during exploration
  * We'll add some randomness to the off-policy

---

## Soft Policy

* A $\sigma$-soft policy could look like this
* Ensure $\pi(a|s) \geq \frac{\epsilon}{|A(s)|}$
  * Every action will be taken with probability at least $\frac{\epsilon}{|A(s)|}$
* The "best" action is taken greedily with probability $1 - \epsilon + \frac{\epsilon}{|A(s)|}$
  * Where $|A(s)|$ is the total number of actions

---

## Updates

* Now, $Q(S_t, A_t) \leftarrow average(Returns(S_t, A_t))$
* $A^* \leftarrow \underset{a}{argmax}Q(S_t, a)$
* Exploration policy is
  * choose $a = A^*$ with probability $1 - \epsilon + \frac{\epsilon}{|A(s)|}$
  * choose any other action with probability $\frac{\epsilon}{|A(s)|}$

---

## Control Estimate

* When done (after some chosen number of iterations)
  * select $\pi(s) \leftarrow \underset{a}{argmax} Q(s, a)$ for all $s \in S$

---

## Complaints

* We're behaving non-optimally to explore, yet we want to infer the optimal action
* By making $\epsilon$ arbitrarily small, we can approach the optimal policy, but now we're tuning a hyperparameter

---

## Example Problem

* We begin at S and end at G
* If $\epsilon$ is too large, we'll keep falling into the cliff, so safer paths seem better
* Some small value of $\epsilon$ will find the best policy, but now we need to tune it based upon the length of the path

</div>
<div class="col">

<img style="width: 60%" class="r-stretch" src="./figures/RLBook/ch6_cliff_walking_top.png" />
<br/>
<small>Cliff walking example from Sutton and Barto</small>

</div>
</div>

---

## The Solution

* We shouldn't force our optimal policy to be related to our policy used for exploration
* Instead, we'll use two different ones
  * The **target policy** is what we want to learn
  * The **behavior policy** is how we learn about it
* This is **off-policy** search

---

## Behavior Policy

* Let's call this one $b$
* How do we learn about $\pi$ from $b$?
* Coverage: if $\pi(a|s) > 0$ then $b(a|s) > 0$
  * Otherwise we aren't exploring enough to evaluate that action
* $b$ is stochastic, $\pi$ is likely not

---

## Importance Sampling

* Let's say that $\pi$ and $b$ having nothing to do with one another
* $b$ will choose some actions, $A_1, A_2, ...$ and get some reward
  * How does that inform our target policy, $\pi$?
* We will weight rewards by the likelihood of that series of actions coming from $\pi$

---

## Importance Sampling Ratio

$\rho_{t:T-1} = \frac{\prod_{k=t}^{T-1}\pi(A_k|S_k)p(S_{k+1}|S_k,A_k)}{\prod_{k=t}^{T-1}b(A_k|S_k)p(S_{k+1}|S_k,A_k}$

* The probabilities of seeing a state given the previous state and action are the same for $\pi$ or $b$ and cancel

$ \rho_{t:T-1} = \prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)}$

---

## Interpretation

$ \rho_{t:T-1} = \prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)}$

* The importance sampling ratio gives a weight to anything discovered by off-policy search
* If $\pi(A_k|S_k)$ is 0, then the target policy would never choose that action, so the important is 0
  * For example, if $\pi$ would never jump off of the cliff, and $b$ does, then we ignore that sequence

---

## Utility Estimates

* $\rho$ makes it possible to estimate the utility of a policy
  * We complete an episode and use the final reward (called gain, $G$) to update our expectations
  * $\mathbb{E}[\rho_{t:T-1}G_t | S_t = s] = U_{\pi}(s)$

---

## Bookkeeping

* Let's clarify some notation, then we'll estimate the policy
* Let's never reset t to 0 after episodes, just keep counting
* $\tau(s)$ is the number of times we've visited state $s$
* $T(t)$ is the time at the end of the episode that includes $t$
* $\\{G_t\\}_{t\in \tau(s)}$ are the returns from state $s$
* $\\{\rho_{t:T(t)-1}\\}_{t\in \tau(s)}$ are the corresponding important sampling ratios

---

## Target Policy Estimation

$ U(s) \triangleq \frac{\Sigma_{t\in \tau(s)} \rho_{t:T(t)}G_t}{|\tau(s)|}$

* In words, the utility of state $s$ is the sum of the product of the importance ratio and gain for every episode when we have encountered the state, divided by the total number of times the state has been encountered

---

## Variance

* Is that equation safe to use?
  * Whenever we make an estimate, we always run the risk of high-variance problems

* $ U(s) \triangleq \frac{\Sigma_{t\in \tau(s)} \rho_{t:T(t)}G_t}{|\tau(s)|}$
  * The $\rho_{t:T(t)}G_t$ is expected to be $u_\pi(s)$
* What if there is a single sample and the importance ratio is large?
  * Then this single sample is being given a lot of (unearned) importance

---

## Weighted Importance Sampling

* Instead, we can bias our prediction to $u_b(s)$, the exploration policy
  * This is not *correct*, but has less variance and will still converge
* $ U(s) \triangleq \frac{\Sigma_{t\in \tau(s)} \rho_{t:T(t)}G_t}{\rho_{t:T(t)}}$
  * If $\rho > 1$ this will obviously reduce our estimate
  * If $\rho < 1$ it increases it, biasing us towards $u_b$ rather than $u_\pi$
  * But if $\rho = 0$ we still treat this as 0

---

## Why?

* I know that we've just waded through a bunch of math
  * And we'll do a cool example soon, I promise
* We just need to make sure our cool example actually works

---

## Infinite Variance

* Ordinary importance sampling can break down
* Imagine a machine that sometimes gives you a reward if you poke it enough
  * Action "left" pokes it, action right "right" does nothing
  * The target policy we want to use is poke it until you get a reward

</div>
<div class="col">

<img style="width: 60%" class="r-stretch" src="./figures/RLBook/ch5_figure5.4_upper.png" />
<br/>
<small>See example 5.5 from Sutton and Barto</small>

</div>
</div>

---

## Infinite Variance

* Any time the behavior policy chooses right, the importance ratio will be 0
* So we will only estimate $u_\pi$ from episodes where $b$ only chose left
  * What is the variance of the estimate?

</div>
<div class="col">

<img style="width: 60%" class="r-stretch" src="./figures/RLBook/ch5_figure5.4_upper.png" />
<br/>
<small>See example 5.5 from Sutton and Barto</small>

</div>
</div>

---

## Infinite Variance

* We have our equations:
  * $\rho_{t:T-1} = \prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)}$
  * $ U(s) \triangleq \frac{\Sigma_{t\in \tau(s)} \rho_{t:T(t)}G_t}{|\tau(s)|}$
* Recall that:
  * $Var[X] \triangleq \mathbb{E}[(X - \bar{X})^2] = \mathbb[X^2 - 2X\bar{X} + \bar{X}^2] = \mathbb{E}[X^2] - \bar{X}^2$

---

## Expected Values

* So what is $\mathbb{E}\left[\left(\prod_{t=0}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)}G_0\right)^2\right]$
  * There is only one possible value for $G$: 1
* The importance of any sequence is going to scale with how much more likely it was from $\pi$ than with $b$

---

## Enumeration

* Consider chains when only left is chosen
* A length 1 episode occurs when b chooses left (p = 0.5) and we win immediately (p = 0.1)
  * The importance sampling ratio is $\frac{1}{0.5}$
* So the first term of our expectation is $0.5\times 0.1\times\left(\frac{1}{0.5}\right)^2$

---

## Enumeration

* The next term will include choosing left (p = 0.5) and not winning (p = 0.9)
  * $0.5\times 0.9 \times 0.5\times 0.1\times\left(\frac{1}{0.5}\right)^2$
* This goes to the limit: $0.1\Sigma_{k=0}^{\infty}0.9^k \times 2^k \times 2 = \infty$
* If that part of the variance is infinite, then so is the entire thing
  * So we don't want to use ordinary (unweighted) importance sampling

---

## Fun Example

* Now we can do off-policy Monte Carlo control

---

## Off-Policy MC control

* Begin by initializing some variables:
  * $Q(s,a) \leftarrow \mathbb{R}$ (give the Q function some random scores)
  * $C(s,a) \leftarrow 0$ (this is the importance weighted count of action a at state s)
  * $\pi(s) \leftarrow \underset{a}{argmax}Q(s, a)$ (we always need a target policy)
* $b$ is going to be any soft policy (a small $\epsilon$-soft policy, for example)

---

## Initial Values

* Note that you can't just put any values into Q
  * If you are using Q in your behavior policy, then it shouldn't be biased towards awful actions
    * Such as actions that would get the agent stuck
* One option is to initialize it with very negative rewards, that will quickly be replaced by anything found through exploration

---

## Off-Policy MC Control

* Here's our learning loop for each episode:
<small>

* $G \leftarrow 0$
  * $W \leftarrow 0$
  * Complete the episode, using $b$
</small>

* Loop from $t=T-1, T-2, ..., 0:$
    * $G \leftarrow \gamma G + R_{t+1}$
    * $C(S_t,A_t) \leftarrow C(S_t,A_t) + W$
    * $Q(S_t,A_t) \leftarrow Q(S_t,A_t) \frac{W}{C(S_t,A_t)}[G - Q(s_t,A_t)]$
    * $\pi(S_t) \leftarrow \underset{a}{argmax}Q(S_t,a)$
    * If $A_t \neq \pi(S_t) $ then break
    * Else $W \leftarrow \frac{1}{b(A_t|S_t)}$

---

## Not So Bad!

* It isn't so bad
* Let's do an example
* Imagine we have a racetrack and a racecar
  * We'll model the track with some ascii art

---

## Racetrack

* The car begins on any S and finishes on any F
  * If it hits the wall, W, it resets to the beginning
* X and Y speeds are controlled independently, and can be changed by (0, -1, or +1) at each step
  * We don't allow both to be 0 (except at the start)

</div>
<div class="col">

<small>

```
WWWWWWWWWWWWWWW
WWW           F
WWW           F
WWW           F
W             F
W        WWWWWW
W        WWWWWW
W        WWWWWW
W        WWWWWW
W        WWWWWW
W      WWWWWWWW
W      WWWWWWWW
W      WWWWWWWW
W      WWWWWWWW
W      WWWWWWWW
W      WWWWWWWW
W      WWWWWWWW
W      WWWWWWWW
W      WWWWWWWW
WSSSSSSWWWWWWWW
```
</small>

</div>
</div>

---

## Racetrack

* The maximum speed is 5, and the car must be moving up and right
* A reward of -1 is applied for every step before the finish

</div>
<div class="col">

<small>

</div>
</div>

---

## Modifications

* That's the setup for the version in Sutton and Barto's book
* I'm going to allow the car to move in any direction, which will make things more complicated
  * But will allow more for interesting tracks

---

## Code

```python
import random
import sys

import numpy
import pickle

def softProbs(target_policy, epsilon, state, actions):
    # Get the probability of the soft policy selecting an action in a state
    # b(A_t|S_t)
    # Our behavioral policy will be similar to the target policy
    # Action probabilities sum to 1, with epsilon distributed over all possible actions
    action_probs = [epsilon / len(actions)]*len(actions)
    action_probs[target_policy[state]] += 1 - epsilon
    return action_probs

def softPolicy(target_policy, epsilon, state, actions):
    # Select from one of the possible actions based upon our behavior policy.
    action_probs = softProbs(target_policy, epsilon, state, actions)
    return random.choices(actions, action_probs, k=1)[0]

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("This program needs a map as its argument!")
        exit(0)

track = []
    with open(sys.argv[1], 'r') as mapfile:
        for line in mapfile:
            track.append(line[:-1])

# Find all of the starting and finish locations
    starts = [(y, x) for y, row in enumerate(track)
                     for x, val in enumerate(row) if val == 'S']
    finishes = [(y, x) for y, row in enumerate(track)
                       for x, val in enumerate(row) if val == 'F']

# Make a list of all possible actions
    actions = []
    for xdelta in [-1, 0, 1]:
        for ydelta in [-1, 0, 1]:
            actions.append((ydelta, xdelta))

MAX_SPEED = 3

# Do we need to train, or was a policy provided?
    if len(sys.argv) < 3:
        # Need to train

# Initialize our starting variables
        epsilon = 0.5
        gamma = 1
        Q = {}
        C = {}
        policy = {}
        for y in range(len(track)):
            for x in range(len(track[0])):
                s = (y,x)
                # The full state is the location and the current speed
                # Enumerate all possible speeds
                # Going to allow speeds from -MAX_SPEED to MAX_SPEED, so allow for 2*MAX_SPEED+1 states.
                # Note: Technically, since the max speed is MAX_SPEED, we don't
                # need this many states because, for example, MAX_SPEED,MAX_SPEED
                # isn't allowed.
                for yspeed in range(2*MAX_SPEED+1):
                    for xspeed in range(2*MAX_SPEED+1):
                        s = (y, x, yspeed - MAX_SPEED, xspeed - MAX_SPEED)
                        # In the book, there are access with Q(s, a), but I'll
                        # store them as an array indexed by s and then a.
                        # This makes it easier to select an action.
                        Q[s] = []
                        C[s] = []
                        for i, action in enumerate(actions):
                            # Make the initial action scores a bit random since we don't
                            # know the true rewards.
                            # Do bias then network to go up (-y) and to the right (+x) so that we don't sit on the start forever
                            bias = -s[2]*10 + s[3]/2
                            Q[s].append(bias + random.choices([-10000, -10001, -10002])[0])
                            C[s].append(0)
                        # Initialize the policy
                        policy[s] = numpy.argmax(Q[s])
        print(f"There are {len(Q)} possible states.")
        print("Beginning episodes.")

## Keep track if we've every actually updated Q values at the starting locations
        start_updated = False

for episode in range(5000):
            if episode % 100 == 0:
                print(f"Beginning episode {episode}")
            # Update epsilon to slowly converge on the target policy
            if episode > 500 and start_updated:
                epsilon = 0.1
            # Run the simulation
            finished = False
            # Pick a random starting location
            location = random.choice(starts)
            speed = (0, 0)
            states = []
            past_actions = []
            rewards = [0]
            while not finished:
                # Remember the state
                state = location + speed
                states.append(state)
                # Get the action from the behavior policy
                action = softPolicy(policy, epsilon, state, actions)
                # Remember the action
                past_actions.append(action)
                # Update the speed. Check that it is within bounds
                new_ys = min(MAX_SPEED, max(-MAX_SPEED, speed[0] + action[0]))
                new_xs = min(MAX_SPEED, max(-MAX_SPEED, speed[1] + action[1]))
                # If the total speed is too high, then don't update
                if abs(new_xs) + abs(new_ys) <= MAX_SPEED:
                    speed = (new_ys, new_xs)
                # Update the location, and check each space along the way
                total_speed = abs(speed[0]) + abs(speed[1])
                for step in range(1, total_speed+1):
                    new_y = location[0] + round( (step/total_speed) * speed[0])
                    new_x = location[1] + round( (step/total_speed) * speed[1])
                    # Handle locations and 0 speeds
                    if new_y < 0 or new_y >= len(track) or new_x < 0 or new_x >= len(track[0]) or track[new_y][new_x] == "W":
                        # Hit a wall. Reset.
                        location = random.choice(starts)
                        speed = (0, 0)
                    elif track[new_y][new_x] == "F":
                        # Finished.
                        finished = True
                        rewards.append(0)
                        break
                # Otherwise we move to the next location
                location = (location[0] + speed[0], location[1] + speed[1])
                # Negative reward for every step that still hasn't finished
                if not finished:
                    rewards.append(-1)
            # Reverse the states and action and then go over them in order
            states.reverse()
            past_actions.reverse()
            # Notice that rewards has one more element that states and actions, so
            # we will be looking at R_{t+1}
            rewards.reverse()
            chain_length = 0
            # The gain and importance ratio
            G = 0
            # The last step must be shared with the target policy network
            W = 1
            for state, action, reward in zip(states, past_actions, rewards):
                chain_length += 1
                if (state[0], state[1]) in starts and not start_updated:
                    start_updated = True
                    print("Reached a starting location.")
                G = gamma * G + reward
                # Update the importance weights
                a_idx = actions.index(action)
                C[state][a_idx] += W
                Q[state][a_idx] = Q[state][a_idx] + W / C[state][a_idx] * (G - Q[state][a_idx])
                # Update the policy the policy
                policy[state] = numpy.argmax(Q[state])
                # If this isn't something that our policy would have done, then we quit
                if a_idx != policy[state]:
                    # You may be curious to see how much learning is happening
                    #print(f"Chain length {chain_length}")
                    break
                else:
                    # Since the policy would take this action, its probability is 1
                    W = W / softProbs(policy, epsilon, state, actions)[a_idx]

# Save the policy
        # TODO Also allow loading a policy
        with open("policy.pkl", "wb") as outfile:
            pickle.dump((policy, Q), outfile)

else:
        with open(sys.argv[2], 'rb') as policyfile:
            policy, Q = pickle.load(policyfile)

# debugging/information
    print(f"Actions are {actions}")
    for start in starts:
        print(f"At location {start} Q has action scores {Q[start + (0, 0)]}")
        print(f"At location {start} policy is {policy[start + (0, 0)]}, which is {actions[policy[start + (0, 0)]]}")

# Demonstrate a race
    finished = False
    # Pick a random starting location
    location = random.choice(starts)
    speed = (0, 0)
    while not finished:
        # Print out the track
        for y in range(len(track)):
            for x in range(len(track[0])):
                if (y, x) == location:
                    print("C", end="")
                else:
                    print(track[y][x], end="")
            print("")
        print(f"\nSpeed: {speed}\n\n")
        # Get the action from the target policy
        a_idx = policy[location+speed]
        action = actions[a_idx]
        # Update the speed. Check that it is within bounds
        new_ys = min(MAX_SPEED, max(-MAX_SPEED, speed[0] + action[0]))
        new_xs = min(MAX_SPEED, max(-MAX_SPEED, speed[1] + action[1]))
        # If the total speed is too high, then don't update
        if abs(new_xs) + abs(new_ys) <= MAX_SPEED:
            speed = (new_ys, new_xs)
        # Update the location, and check each space along the way
        total_speed = abs(speed[0]) + abs(speed[1])
        for step in range(1, total_speed+1):
            new_y = location[0] + round( (step/total_speed) * speed[0])
            new_x = location[1] + round( (step/total_speed) * speed[1])
            # Handle locations and 0 speeds
            if new_y < 0 or new_y >= len(track) or new_x < 0 or new_x >= len(track[0]) or track[new_y][new_x] == "W":
                # Hit a wall. Reset.
                location = random.choice(starts)
                speed = (0, 0)
            elif track[new_y][new_x] == "F":
                # Finished.
                finished = True
        # Otherwise we move to the next location
        location = (location[0] + speed[0], location[1] + speed[1])
```

-v-

## Curses Version

```python
import curses
import random
import sys
import time

import numpy
import pickle

def curses_simulate(stdscr, policy, actions, track, starts, MAX_SPEED):
    # Clear screen and hide the cursor
    curses.curs_set(0)
    stdscr.clear()

# Get screen dimensions, but we're ignoring them for now
    scr_height, scr_width = stdscr.getmaxyx()

# Demonstrate a race
    finished = False
    # Pick a random starting location
    location = random.choice(starts)
    speed = (0, 0)
    while not finished:
        # Print out the track
        for y in range(len(track)):
            for x in range(len(track[0])):
                if (y, x) == location:
                    stdscr.addch(y, x, 'C')
                else:
                    stdscr.addch(y, x, track[y][x])
        # Refresh the screen to show changes
        stdscr.refresh()
        time.sleep(0.5)
        #print(f"\nSpeed: {speed}\n\n")
        # Get the action from the target policy
        a_idx = policy[location+speed]
        action = actions[a_idx]
        # Update the speed. Check that it is within bounds
        new_ys = min(MAX_SPEED, max(-MAX_SPEED, speed[0] + action[0]))
        new_xs = min(MAX_SPEED, max(-MAX_SPEED, speed[1] + action[1]))
        # If the total speed is too high, then don't update
        if abs(new_xs) + abs(new_ys) <= MAX_SPEED:
            speed = (new_ys, new_xs)
        # Update the location, and check each space along the way
        total_speed = abs(speed[0]) + abs(speed[1])
        for step in range(1, total_speed+1):
            new_y = location[0] + round( (step/total_speed) * speed[0])
            new_x = location[1] + round( (step/total_speed) * speed[1])
            # Handle locations and 0 speeds
            if new_y < 0 or new_y >= len(track) or new_x < 0 or new_x >= len(track[0]) or track[new_y][new_x] == "W":
                # Hit a wall. Reset.
                location = random.choice(starts)
                speed = (0, 0)
            elif track[new_y][new_x] == "F":
                # Finished.
                finished = True
        # Otherwise we move to the next location
        location = (location[0] + speed[0], location[1] + speed[1])

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("This program needs a map as its argument!")
        exit(0)

track = []
    with open(sys.argv[1], 'r') as mapfile:
        for line in mapfile:
            track.append(line[:-1])

# Make a list of all possible actions
    actions = []
    for xdelta in [-1, 0, 1]:
        for ydelta in [-1, 0, 1]:
            actions.append((ydelta, xdelta))

MAX_SPEED = 3

# Do we need to train, or was a policy provided?
    if len(sys.argv) < 3:
        # Need to train

## Keep track if we've every actually updated Q values at the starting locations
        start_updated = False

# Save the policy
        with open("policy.pkl", "wb") as outfile:
            pickle.dump((policy, Q), outfile)

else:
        with open(sys.argv[2], 'rb') as policyfile:
            policy, Q = pickle.load(policyfile)

curses.wrapper(curses_simulate, policy, actions, track, starts, MAX_SPEED)

```

-v-

## Track 1

-v-

## Track 2

```
WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
WWWW                                  WWWWW
WWWW                                     WW
WWWW                                      W
WW                                        F
WW         WWWWWWWWWWWWWWWWWW             F
WW        WWWWWWWWWWWWWWWWWWWWWW          F
WW        WWWWWWWWWWWWWWWWWWWWWWWW        F
WW        WWWWWWWWWWWWWWWWWWWWWWWWW       F
WW        WWWWWWWWWWWWWWWWWWWWWWWWWWW     F
WW      WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
WW      WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
WW      WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
WW      WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
WW      WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
WW      WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
WW      WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
WW      WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
WW      WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
WWSSSSSSWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
```

-v-

## Track 3

* For debugging

```
WWWWWWWWWWWWWWW
WWW           F
WW            F
WW            F
WW        WWWWW
WW       WWWWWW
WW      WWWWWWW
WWSSSSWWWWWWWWW
```

-v-

## Track 4

* Difficult

```
WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
WWWW                                  WWWWW
WWWW                                     WW
WWWW                                      W
WW                                        W
WW         WWWWWWWWWWWWWWWWWW             W
WW        WWWWWWWWWWWWWWWWWWWWWW          W
WW        WWWWWWWWWWWWWWWWWWWWWWWW        W
WW        WWWWWWWWWWWWWWWWWWWWWWWWW       W
WW        WWWWWWWWWWFFWWWWWWWWWWWWWWW     W
WW      WWWWWWWWWW    WWWWWWWWWWWWWW     WW
WW      WWWWWWWWW    WWWWWWWWWWWWWWW     WW
WW      WWWWWWW     WWWWWWWWWWWWWWWW     WW
WW      WWWWWWW  WWWWWWWWWWWWWWWWWWW     WW
WW      WWWWWWW  WWWWWWWWWWWWWWWWWWW     WW
WW      WWWWWWW                        WWWW
WW      WWWWWWW                       WWWWW
WW      WWWWWWW                       WWWWW
WW      WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
WWSSSSSSWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
```

---

## Episodes and Resetting

* Why do we reset? Shouldn't the bad moves be excluded from learning?
  * We need to assign some reward to the bad moves
* Since they must come earlier than the good moves that finally reach the finish, they will have a more negative reward
  * This does penalize a route that almost makes it before hitting a wall, but less of that should happen as the agent improves

---

## Comments

* This works great!
  * Until it doesn't. Try track 4.
* Exploration is basically random, so with a very large track this would take forever
* To speed things up, we could start by training with starting locations closer to the end, and gradually move them away
  * But that starts moving back towards random starts
* There is a better solution, which we'll discuss next time