# CS 530 - Lecture 12 ## Monte Carlo ### On and Off Policy ### Prediction and Control Bernhard Firner 2026-03-10 --- ## Review * We've been talking about Monte Carlo simulations * To predict the *utility* of a policy by finding the expected reward from all states * We'll extend this to finding the policy itself, which will be called the *control* problem * This sets up reinforcement learning --- ## Reading * Russel and Norvig * Monte Carlo is mentioned several times, but section 13.4 is most relevant * RL is chapter 23 in Russel and Norvig * Or Chapter 5 in Sutton and Barto reinforcement learning book * Russel and Norvig measure utility (U), the RL book uses value (V), but they are mostly equivalent --- ## Monte Carlo Prediction * Last time we wanted to evaluation a policy, $\pi$ * This problem formulation is called *prediction*, since we are predicting the utility/value of the policy --- ## Monte Carlo Control * Then we found the best policy * This is called *control*, because the policy guides exploration * We do need to ensure that we visit every state * With blackjack we simply randomly select a state to start --- ## Algorithm per Simulation * Generate a state, $S_0$, $A_0$ randomly so that all initial states and actions will be explored * Play to the end state, using policy $\pi$ * Initialize a total reward $G \leftarrow 0$, and time $t \leftarrow T$ * Loop through the episode * $G \leftarrow \gamma G + R_{t+1}$ * Append G to a log of returns for the pair, $S_t, A_t$ * $Q(S_t, A_t) \leftarrow average(Returns(S_t, A_t))$ * $\pi(S_t) \leftarrow \underset{a}{argmax} Q(S_t, a)$ --- ## Idea * This estimate rewards: * $Q(S_t, A_t) \leftarrow average(Returns(S_t, A_t))$ * And used them to estimate a policy: * $\pi(S_t) \leftarrow \underset{a}{argmax} Q(S_t, a)$ * The Q estimates improve as the policy improves, and eventually we converge to our final policy --- ## How We Search * The optimal policy we discover is also the policy guiding search * This is good, it reduces time spent exploring states that we won't encounter * This is bad, it means that we have to be able to begin at a random state * And *those* may not be common * So there are benefits to exploration with a different policy --- ## On-Policy, Off-Policy * On-policy means ising the same optimizing policy to search * An off-policy search means using something different * For example, let's accept some error, $\epsilon$, during exploration * We'll add some randomness to the off-policy --- ## Soft Policy * A $\sigma$-soft policy could look like this * Ensure $\pi(a|s) \geq \frac{\epsilon}{|A(s)|}$ * Every action will be taken with probability at least $\frac{\epsilon}{|A(s)|}$ * The "best" action is taken greedily with probability $1 - \epsilon + \frac{\epsilon}{|A(s)|}$ * Where $|A(s)|$ is the total number of actions --- ## Updates * Now, $Q(S_t, A_t) \leftarrow average(Returns(S_t, A_t))$ * $A^* \leftarrow \underset{a}{argmax}Q(S_t, a)$ * Exploration policy is * choose $a = A^*$ with probability $1 - \epsilon + \frac{\epsilon}{|A(s)|}$ * choose any other action with probability $\frac{\epsilon}{|A(s)|}$ --- ## Control Estimate * When done (after some chosen number of iterations) * select $\pi(s) \leftarrow \underset{a}{argmax} Q(s, a)$ for all $s \in S$ --- ## Complaints * We're behaving non-optimally to explore, yet we want to infer the optimal action * By making $\epsilon$ arbitrarily small, we can approach the optimal policy, but now we're tuning a hyperparameter --- ## Example Problem
* We begin at S and end at G * If $\epsilon$ is too large, we'll keep falling into the cliff, so safer paths seem better * Some small value of $\epsilon$ will find the best policy, but now we need to tune it based upon the length of the path
Cliff walking example from Sutton and Barto
--- ## The Solution * We shouldn't force our optimal policy to be related to our policy used for exploration * Instead, we'll use two different ones * The **target policy** is what we want to learn * The **behavior policy** is how we learn about it * This is **off-policy** search --- ## Behavior Policy * Let's call this one $b$ * How do we learn about $\pi$ from $b$? * Coverage: if $\pi(a|s) > 0$ then $b(a|s) > 0$ * Otherwise we aren't exploring enough to evaluate that action * $b$ is stochastic, $\pi$ is likely not --- ## Importance Sampling * Let's say that $\pi$ and $b$ having nothing to do with one another * $b$ will choose some actions, $A_1, A_2, ...$ and get some reward * How does that inform our target policy, $\pi$? * We will weight rewards by the likelihood of that series of actions coming from $\pi$ --- ## Importance Sampling Ratio $\rho_{t:T-1} = \frac{\prod_{k=t}^{T-1}\pi(A_k|S_k)p(S_{k+1}|S_k,A_k)}{\prod_{k=t}^{T-1}b(A_k|S_k)p(S_{k+1}|S_k,A_k}$ * The probabilities of seeing a state given the previous state and action are the same for $\pi$ or $b$ and cancel $ \rho_{t:T-1} = \prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)}$ --- ## Interpretation $ \rho_{t:T-1} = \prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)}$ * The importance sampling ratio gives a weight to anything discovered by off-policy search * If $\pi(A_k|S_k)$ is 0, then the target policy would never choose that action, so the important is 0 * For example, if $\pi$ would never jump off of the cliff, and $b$ does, then we ignore that sequence --- ## Utility Estimates * $\rho$ makes it possible to estimate the utility of a policy * We complete an episode and use the final reward (called gain, $G$) to update our expectations * $\mathbb{E}[\rho_{t:T-1}G_t | S_t = s] = U_{\pi}(s)$ --- ## Bookkeeping * Let's clarify some notation, then we'll estimate the policy * Let's never reset t to 0 after episodes, just keep counting * $\tau(s)$ is the number of times we've visited state $s$ * $T(t)$ is the time at the end of the episode that includes $t$ * $\\{G_t\\}_{t\in \tau(s)}$ are the returns from state $s$ * $\\{\rho_{t:T(t)-1}\\}_{t\in \tau(s)}$ are the corresponding important sampling ratios --- ## Target Policy Estimation $ U(s) \triangleq \frac{\Sigma_{t\in \tau(s)} \rho_{t:T(t)}G_t}{|\tau(s)|}$ * In words, the utility of state $s$ is the sum of the product of the importance ratio and gain for every episode when we have encountered the state, divided by the total number of times the state has been encountered --- ## Variance * Is that equation safe to use? * Whenever we make an estimate, we always run the risk of high-variance problems * $ U(s) \triangleq \frac{\Sigma_{t\in \tau(s)} \rho_{t:T(t)}G_t}{|\tau(s)|}$ * The $\rho_{t:T(t)}G_t$ is expected to be $u_\pi(s)$ * What if there is a single sample and the importance ratio is large? * Then this single sample is being given a lot of (unearned) importance --- ## Weighted Importance Sampling * Instead, we can bias our prediction to $u_b(s)$, the exploration policy * This is not *correct*, but has less variance and will still converge * $ U(s) \triangleq \frac{\Sigma_{t\in \tau(s)} \rho_{t:T(t)}G_t}{\rho_{t:T(t)}}$ * If $\rho > 1$ this will obviously reduce our estimate * If $\rho < 1$ it increases it, biasing us towards $u_b$ rather than $u_\pi$ * But if $\rho = 0$ we still treat this as 0 --- ## Why? * I know that we've just waded through a bunch of math * And we'll do a cool example soon, I promise * We just need to make sure our cool example actually works --- ## Infinite Variance
* Ordinary importance sampling can break down * Imagine a machine that sometimes gives you a reward if you poke it enough * Action "left" pokes it, action right "right" does nothing * The target policy we want to use is poke it until you get a reward
See example 5.5 from Sutton and Barto
--- ## Infinite Variance
* Any time the behavior policy chooses right, the importance ratio will be 0 * So we will only estimate $u_\pi$ from episodes where $b$ only chose left * What is the variance of the estimate?
See example 5.5 from Sutton and Barto
--- ## Infinite Variance * We have our equations: * $\rho_{t:T-1} = \prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)}$ * $ U(s) \triangleq \frac{\Sigma_{t\in \tau(s)} \rho_{t:T(t)}G_t}{|\tau(s)|}$ * Recall that: * $Var[X] \triangleq \mathbb{E}[(X - \bar{X})^2] = \mathbb[X^2 - 2X\bar{X} + \bar{X}^2] = \mathbb{E}[X^2] - \bar{X}^2$ --- ## Expected Values * So what is $\mathbb{E}\left[\left(\prod_{t=0}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)}G_0\right)^2\right]$ * There is only one possible value for $G$: 1 * The importance of any sequence is going to scale with how much more likely it was from $\pi$ than with $b$ --- ## Enumeration * Consider chains when only left is chosen * A length 1 episode occurs when b chooses left (p = 0.5) and we win immediately (p = 0.1) * The importance sampling ratio is $\frac{1}{0.5}$ * So the first term of our expectation is $0.5\times 0.1\times\left(\frac{1}{0.5}\right)^2$ --- ## Enumeration * The next term will include choosing left (p = 0.5) and not winning (p = 0.9) * $0.5\times 0.9 \times 0.5\times 0.1\times\left(\frac{1}{0.5}\right)^2$ * This goes to the limit: $0.1\Sigma_{k=0}^{\infty}0.9^k \times 2^k \times 2 = \infty$ * If that part of the variance is infinite, then so is the entire thing * So we don't want to use ordinary (unweighted) importance sampling --- ## Fun Example * Now we can do off-policy Monte Carlo control --- ## Off-Policy MC control * Begin by initializing some variables: * $Q(s,a) \leftarrow \mathbb{R}$ (give the Q function some random scores) * $C(s,a) \leftarrow 0$ (this is the importance weighted count of action a at state s) * $\pi(s) \leftarrow \underset{a}{argmax}Q(s, a)$ (we always need a target policy) * $b$ is going to be any soft policy (a small $\epsilon$-soft policy, for example) --- ## Initial Values * Note that you can't just put any values into Q * If you are using Q in your behavior policy, then it shouldn't be biased towards awful actions * Such as actions that would get the agent stuck * One option is to initialize it with very negative rewards, that will quickly be replaced by anything found through exploration --- ## Off-Policy MC Control * Here's our learning loop for each episode:
* $G \leftarrow 0$ * $W \leftarrow 0$ * Complete the episode, using $b$
* Loop from $t=T-1, T-2, ..., 0:$ * $G \leftarrow \gamma G + R_{t+1}$ * $C(S_t,A_t) \leftarrow C(S_t,A_t) + W$ * $Q(S_t,A_t) \leftarrow Q(S_t,A_t) \frac{W}{C(S_t,A_t)}[G - Q(s_t,A_t)]$ * $\pi(S_t) \leftarrow \underset{a}{argmax}Q(S_t,a)$ * If $A_t \neq \pi(S_t) $ then break * Else $W \leftarrow \frac{1}{b(A_t|S_t)}$ --- ## Not So Bad! * It isn't so bad * Let's do an example * Imagine we have a racetrack and a racecar * We'll model the track with some ascii art --- ## Racetrack
* The car begins on any S and finishes on any F * If it hits the wall, W, it resets to the beginning * X and Y speeds are controlled independently, and can be changed by (0, -1, or +1) at each step * We don't allow both to be 0 (except at the start)
``` WWWWWWWWWWWWWWW WWW F WWW F WWW F W F W WWWWWW W WWWWWW W WWWWWW W WWWWWW W WWWWWW W WWWWWWWW W WWWWWWWW W WWWWWWWW W WWWWWWWW W WWWWWWWW W WWWWWWWW W WWWWWWWW W WWWWWWWW W WWWWWWWW WSSSSSSWWWWWWWW ```
--- ## Racetrack
* The maximum speed is 5, and the car must be moving up and right * A reward of -1 is applied for every step before the finish
``` WWWWWWWWWWWWWWW WWW F WWW F WWW F W F W WWWWWW W WWWWWW W WWWWWW W WWWWWW W WWWWWW W WWWWWWWW W WWWWWWWW W WWWWWWWW W WWWWWWWW W WWWWWWWW W WWWWWWWW W WWWWWWWW W WWWWWWWW W WWWWWWWW WSSSSSSWWWWWWWW ```
--- ## Modifications * That's the setup for the version in Sutton and Barto's book * I'm going to allow the car to move in any direction, which will make things more complicated * But will allow more for interesting tracks --- ## Code ```python import random import sys import numpy import pickle def softProbs(target_policy, epsilon, state, actions): # Get the probability of the soft policy selecting an action in a state # b(A_t|S_t) # Our behavioral policy will be similar to the target policy # Action probabilities sum to 1, with epsilon distributed over all possible actions action_probs = [epsilon / len(actions)]*len(actions) action_probs[target_policy[state]] += 1 - epsilon return action_probs def softPolicy(target_policy, epsilon, state, actions): # Select from one of the possible actions based upon our behavior policy. action_probs = softProbs(target_policy, epsilon, state, actions) return random.choices(actions, action_probs, k=1)[0] if __name__ == "__main__": if len(sys.argv) < 2: print("This program needs a map as its argument!") exit(0) track = [] with open(sys.argv[1], 'r') as mapfile: for line in mapfile: track.append(line[:-1]) # Find all of the starting and finish locations starts = [(y, x) for y, row in enumerate(track) for x, val in enumerate(row) if val == 'S'] finishes = [(y, x) for y, row in enumerate(track) for x, val in enumerate(row) if val == 'F'] # Make a list of all possible actions actions = [] for xdelta in [-1, 0, 1]: for ydelta in [-1, 0, 1]: actions.append((ydelta, xdelta)) MAX_SPEED = 3 # Do we need to train, or was a policy provided? if len(sys.argv) < 3: # Need to train # Initialize our starting variables epsilon = 0.5 gamma = 1 Q = {} C = {} policy = {} for y in range(len(track)): for x in range(len(track[0])): s = (y,x) # The full state is the location and the current speed # Enumerate all possible speeds # Going to allow speeds from -MAX_SPEED to MAX_SPEED, so allow for 2*MAX_SPEED+1 states. # Note: Technically, since the max speed is MAX_SPEED, we don't # need this many states because, for example, MAX_SPEED,MAX_SPEED # isn't allowed. for yspeed in range(2*MAX_SPEED+1): for xspeed in range(2*MAX_SPEED+1): s = (y, x, yspeed - MAX_SPEED, xspeed - MAX_SPEED) # In the book, there are access with Q(s, a), but I'll # store them as an array indexed by s and then a. # This makes it easier to select an action. Q[s] = [] C[s] = [] for i, action in enumerate(actions): # Make the initial action scores a bit random since we don't # know the true rewards. # Do bias then network to go up (-y) and to the right (+x) so that we don't sit on the start forever bias = -s[2]*10 + s[3]/2 Q[s].append(bias + random.choices([-10000, -10001, -10002])[0]) C[s].append(0) # Initialize the policy policy[s] = numpy.argmax(Q[s]) print(f"There are {len(Q)} possible states.") print("Beginning episodes.") ## Keep track if we've every actually updated Q values at the starting locations start_updated = False for episode in range(5000): if episode % 100 == 0: print(f"Beginning episode {episode}") # Update epsilon to slowly converge on the target policy if episode > 500 and start_updated: epsilon = 0.1 # Run the simulation finished = False # Pick a random starting location location = random.choice(starts) speed = (0, 0) states = [] past_actions = [] rewards = [0] while not finished: # Remember the state state = location + speed states.append(state) # Get the action from the behavior policy action = softPolicy(policy, epsilon, state, actions) # Remember the action past_actions.append(action) # Update the speed. Check that it is within bounds new_ys = min(MAX_SPEED, max(-MAX_SPEED, speed[0] + action[0])) new_xs = min(MAX_SPEED, max(-MAX_SPEED, speed[1] + action[1])) # If the total speed is too high, then don't update if abs(new_xs) + abs(new_ys) <= MAX_SPEED: speed = (new_ys, new_xs) # Update the location, and check each space along the way total_speed = abs(speed[0]) + abs(speed[1]) for step in range(1, total_speed+1): new_y = location[0] + round( (step/total_speed) * speed[0]) new_x = location[1] + round( (step/total_speed) * speed[1]) # Handle locations and 0 speeds if new_y < 0 or new_y >= len(track) or new_x < 0 or new_x >= len(track[0]) or track[new_y][new_x] == "W": # Hit a wall. Reset. location = random.choice(starts) speed = (0, 0) elif track[new_y][new_x] == "F": # Finished. finished = True rewards.append(0) break # Otherwise we move to the next location location = (location[0] + speed[0], location[1] + speed[1]) # Negative reward for every step that still hasn't finished if not finished: rewards.append(-1) # Reverse the states and action and then go over them in order states.reverse() past_actions.reverse() # Notice that rewards has one more element that states and actions, so # we will be looking at R_{t+1} rewards.reverse() chain_length = 0 # The gain and importance ratio G = 0 # The last step must be shared with the target policy network W = 1 for state, action, reward in zip(states, past_actions, rewards): chain_length += 1 if (state[0], state[1]) in starts and not start_updated: start_updated = True print("Reached a starting location.") G = gamma * G + reward # Update the importance weights a_idx = actions.index(action) C[state][a_idx] += W Q[state][a_idx] = Q[state][a_idx] + W / C[state][a_idx] * (G - Q[state][a_idx]) # Update the policy the policy policy[state] = numpy.argmax(Q[state]) # If this isn't something that our policy would have done, then we quit if a_idx != policy[state]: # You may be curious to see how much learning is happening #print(f"Chain length {chain_length}") break else: # Since the policy would take this action, its probability is 1 W = W / softProbs(policy, epsilon, state, actions)[a_idx] # Save the policy # TODO Also allow loading a policy with open("policy.pkl", "wb") as outfile: pickle.dump((policy, Q), outfile) else: with open(sys.argv[2], 'rb') as policyfile: policy, Q = pickle.load(policyfile) # debugging/information print(f"Actions are {actions}") for start in starts: print(f"At location {start} Q has action scores {Q[start + (0, 0)]}") print(f"At location {start} policy is {policy[start + (0, 0)]}, which is {actions[policy[start + (0, 0)]]}") # Demonstrate a race finished = False # Pick a random starting location location = random.choice(starts) speed = (0, 0) while not finished: # Print out the track for y in range(len(track)): for x in range(len(track[0])): if (y, x) == location: print("C", end="") else: print(track[y][x], end="") print("") print(f"\nSpeed: {speed}\n\n") # Get the action from the target policy a_idx = policy[location+speed] action = actions[a_idx] # Update the speed. Check that it is within bounds new_ys = min(MAX_SPEED, max(-MAX_SPEED, speed[0] + action[0])) new_xs = min(MAX_SPEED, max(-MAX_SPEED, speed[1] + action[1])) # If the total speed is too high, then don't update if abs(new_xs) + abs(new_ys) <= MAX_SPEED: speed = (new_ys, new_xs) # Update the location, and check each space along the way total_speed = abs(speed[0]) + abs(speed[1]) for step in range(1, total_speed+1): new_y = location[0] + round( (step/total_speed) * speed[0]) new_x = location[1] + round( (step/total_speed) * speed[1]) # Handle locations and 0 speeds if new_y < 0 or new_y >= len(track) or new_x < 0 or new_x >= len(track[0]) or track[new_y][new_x] == "W": # Hit a wall. Reset. location = random.choice(starts) speed = (0, 0) elif track[new_y][new_x] == "F": # Finished. finished = True # Otherwise we move to the next location location = (location[0] + speed[0], location[1] + speed[1]) ``` -v- ## Curses Version ```python import curses import random import sys import time import numpy import pickle def curses_simulate(stdscr, policy, actions, track, starts, MAX_SPEED): # Clear screen and hide the cursor curses.curs_set(0) stdscr.clear() # Get screen dimensions, but we're ignoring them for now scr_height, scr_width = stdscr.getmaxyx() # Demonstrate a race finished = False # Pick a random starting location location = random.choice(starts) speed = (0, 0) while not finished: # Print out the track for y in range(len(track)): for x in range(len(track[0])): if (y, x) == location: stdscr.addch(y, x, 'C') else: stdscr.addch(y, x, track[y][x]) # Refresh the screen to show changes stdscr.refresh() time.sleep(0.5) #print(f"\nSpeed: {speed}\n\n") # Get the action from the target policy a_idx = policy[location+speed] action = actions[a_idx] # Update the speed. Check that it is within bounds new_ys = min(MAX_SPEED, max(-MAX_SPEED, speed[0] + action[0])) new_xs = min(MAX_SPEED, max(-MAX_SPEED, speed[1] + action[1])) # If the total speed is too high, then don't update if abs(new_xs) + abs(new_ys) <= MAX_SPEED: speed = (new_ys, new_xs) # Update the location, and check each space along the way total_speed = abs(speed[0]) + abs(speed[1]) for step in range(1, total_speed+1): new_y = location[0] + round( (step/total_speed) * speed[0]) new_x = location[1] + round( (step/total_speed) * speed[1]) # Handle locations and 0 speeds if new_y < 0 or new_y >= len(track) or new_x < 0 or new_x >= len(track[0]) or track[new_y][new_x] == "W": # Hit a wall. Reset. location = random.choice(starts) speed = (0, 0) elif track[new_y][new_x] == "F": # Finished. finished = True # Otherwise we move to the next location location = (location[0] + speed[0], location[1] + speed[1]) def softProbs(target_policy, epsilon, state, actions): # Get the probability of the soft policy selecting an action in a state # b(A_t|S_t) # Our behavioral policy will be similar to the target policy # Action probabilities sum to 1, with epsilon distributed over all possible actions action_probs = [epsilon / len(actions)]*len(actions) action_probs[target_policy[state]] += 1 - epsilon return action_probs def softPolicy(target_policy, epsilon, state, actions): # Select from one of the possible actions based upon our behavior policy. action_probs = softProbs(target_policy, epsilon, state, actions) return random.choices(actions, action_probs, k=1)[0] if __name__ == "__main__": if len(sys.argv) < 2: print("This program needs a map as its argument!") exit(0) track = [] with open(sys.argv[1], 'r') as mapfile: for line in mapfile: track.append(line[:-1]) # Find all of the starting and finish locations starts = [(y, x) for y, row in enumerate(track) for x, val in enumerate(row) if val == 'S'] finishes = [(y, x) for y, row in enumerate(track) for x, val in enumerate(row) if val == 'F'] # Make a list of all possible actions actions = [] for xdelta in [-1, 0, 1]: for ydelta in [-1, 0, 1]: actions.append((ydelta, xdelta)) MAX_SPEED = 3 # Do we need to train, or was a policy provided? if len(sys.argv) < 3: # Need to train # Initialize our starting variables epsilon = 0.5 gamma = 1 Q = {} C = {} policy = {} for y in range(len(track)): for x in range(len(track[0])): s = (y,x) # The full state is the location and the current speed # Enumerate all possible speeds # Going to allow speeds from -MAX_SPEED to MAX_SPEED, so allow for 2*MAX_SPEED+1 states. # Note: Technically, since the max speed is MAX_SPEED, we don't # need this many states because, for example, MAX_SPEED,MAX_SPEED # isn't allowed. for yspeed in range(2*MAX_SPEED+1): for xspeed in range(2*MAX_SPEED+1): s = (y, x, yspeed - MAX_SPEED, xspeed - MAX_SPEED) # In the book, there are access with Q(s, a), but I'll # store them as an array indexed by s and then a. # This makes it easier to select an action. Q[s] = [] C[s] = [] for i, action in enumerate(actions): # Make the initial action scores a bit random since we don't # know the true rewards. # Do bias then network to go up (-y) and to the right (+x) so that we don't sit on the start forever bias = -s[2]*10 + s[3]/2 Q[s].append(bias + random.choices([-10000, -10001, -10002])[0]) C[s].append(0) # Initialize the policy policy[s] = numpy.argmax(Q[s]) print(f"There are {len(Q)} possible states.") print("Beginning episodes.") ## Keep track if we've every actually updated Q values at the starting locations start_updated = False for episode in range(5000): if episode % 100 == 0: print(f"Beginning episode {episode}") # Update epsilon to slowly converge on the target policy if episode > 500 and start_updated: epsilon = 0.1 # Run the simulation finished = False # Pick a random starting location location = random.choice(starts) speed = (0, 0) states = [] past_actions = [] rewards = [0] while not finished: # Remember the state state = location + speed states.append(state) # Get the action from the behavior policy action = softPolicy(policy, epsilon, state, actions) # Remember the action past_actions.append(action) # Update the speed. Check that it is within bounds new_ys = min(MAX_SPEED, max(-MAX_SPEED, speed[0] + action[0])) new_xs = min(MAX_SPEED, max(-MAX_SPEED, speed[1] + action[1])) # If the total speed is too high, then don't update if abs(new_xs) + abs(new_ys) <= MAX_SPEED: speed = (new_ys, new_xs) # Update the location, and check each space along the way total_speed = abs(speed[0]) + abs(speed[1]) for step in range(1, total_speed+1): new_y = location[0] + round( (step/total_speed) * speed[0]) new_x = location[1] + round( (step/total_speed) * speed[1]) # Handle locations and 0 speeds if new_y < 0 or new_y >= len(track) or new_x < 0 or new_x >= len(track[0]) or track[new_y][new_x] == "W": # Hit a wall. Reset. location = random.choice(starts) speed = (0, 0) elif track[new_y][new_x] == "F": # Finished. finished = True rewards.append(0) break # Otherwise we move to the next location location = (location[0] + speed[0], location[1] + speed[1]) # Negative reward for every step that still hasn't finished if not finished: rewards.append(-1) # Reverse the states and action and then go over them in order states.reverse() past_actions.reverse() # Notice that rewards has one more element that states and actions, so # we will be looking at R_{t+1} rewards.reverse() chain_length = 0 # The gain and importance ratio G = 0 # The last step must be shared with the target policy network W = 1 for state, action, reward in zip(states, past_actions, rewards): chain_length += 1 if (state[0], state[1]) in starts and not start_updated: start_updated = True print("Reached a starting location.") G = gamma * G + reward # Update the importance weights a_idx = actions.index(action) C[state][a_idx] += W Q[state][a_idx] = Q[state][a_idx] + W / C[state][a_idx] * (G - Q[state][a_idx]) # Update the policy the policy policy[state] = numpy.argmax(Q[state]) # If this isn't something that our policy would have done, then we quit if a_idx != policy[state]: # You may be curious to see how much learning is happening #print(f"Chain length {chain_length}") break else: # Since the policy would take this action, its probability is 1 W = W / softProbs(policy, epsilon, state, actions)[a_idx] # Save the policy with open("policy.pkl", "wb") as outfile: pickle.dump((policy, Q), outfile) else: with open(sys.argv[2], 'rb') as policyfile: policy, Q = pickle.load(policyfile) # debugging/information print(f"Actions are {actions}") for start in starts: print(f"At location {start} Q has action scores {Q[start + (0, 0)]}") print(f"At location {start} policy is {policy[start + (0, 0)]}, which is {actions[policy[start + (0, 0)]]}") curses.wrapper(curses_simulate, policy, actions, track, starts, MAX_SPEED) ``` -v- ## Track 1 ``` WWWWWWWWWWWWWWW WWW F WWW F WWW F W F W WWWWWW W WWWWWW W WWWWWW W WWWWWW W WWWWWW W WWWWWWWW W WWWWWWWW W WWWWWWWW W WWWWWWWW W WWWWWWWW W WWWWWWWW W WWWWWWWW W WWWWWWWW W WWWWWWWW WSSSSSSWWWWWWWW ``` -v- ## Track 2 ``` WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW WWWW WWWWW WWWW WW WWWW W WW F WW WWWWWWWWWWWWWWWWWW F WW WWWWWWWWWWWWWWWWWWWWWW F WW WWWWWWWWWWWWWWWWWWWWWWWW F WW WWWWWWWWWWWWWWWWWWWWWWWWW F WW WWWWWWWWWWWWWWWWWWWWWWWWWWW F WW WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW WW WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW WW WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW WW WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW WW WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW WW WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW WW WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW WW WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW WW WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW WWSSSSSSWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW ``` -v- ## Track 3 * For debugging ``` WWWWWWWWWWWWWWW WWW F WW F WW F WW WWWWW WW WWWWWW WW WWWWWWW WWSSSSWWWWWWWWW ``` -v- ## Track 4 * Difficult ``` WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW WWWW WWWWW WWWW WW WWWW W WW W WW WWWWWWWWWWWWWWWWWW W WW WWWWWWWWWWWWWWWWWWWWWW W WW WWWWWWWWWWWWWWWWWWWWWWWW W WW WWWWWWWWWWWWWWWWWWWWWWWWW W WW WWWWWWWWWWFFWWWWWWWWWWWWWWW W WW WWWWWWWWWW WWWWWWWWWWWWWW WW WW WWWWWWWWW WWWWWWWWWWWWWWW WW WW WWWWWWW WWWWWWWWWWWWWWWW WW WW WWWWWWW WWWWWWWWWWWWWWWWWWW WW WW WWWWWWW WWWWWWWWWWWWWWWWWWW WW WW WWWWWWW WWWW WW WWWWWWW WWWWW WW WWWWWWW WWWWW WW WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW WWSSSSSSWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW ``` --- ## Episodes and Resetting * Why do we reset? Shouldn't the bad moves be excluded from learning? * We need to assign some reward to the bad moves * Since they must come earlier than the good moves that finally reach the finish, they will have a more negative reward * This does penalize a route that almost makes it before hitting a wall, but less of that should happen as the agent improves --- ## Comments * This works great! * Until it doesn't. Try track 4. * Exploration is basically random, so with a very large track this would take forever * To speed things up, we could start by training with starting locations closer to the end, and gradually move them away * But that starts moving back towards random starts * There is a better solution, which we'll discuss next time