# CS 530 - Lecture 19

## Policy Learning

Bernhard Firner

2026-04-07

---

<div class="container">
<div class="col">
<img style="width: 60%" class="r-stretch" src="./figures/cartpole-qnn.gif" />
<br/>
<small>Cart and pole, DQN.</small>
</div>
<div class="col">
<img style="width: 60%" class="r-stretch" src="./figures/mountain-car-qnn.gif" />
<br/>
<small>Mountain car, DQN, after many attempts.</small>
</div>
<div class="col">
<img style="width: 60%" class="r-stretch" src="./figures/mountain-car-reinforce.gif" />
<br/>
<small>Mountain car, REINFORCE.</small>
</div>
</div>

## Review

* Our algorithms didn't succeed last time?
  * Why not?

</div>
<div class="col">
<img style="width: 60%" class="r-stretch" src="./figures/mountain-car-episode-0.gif" />
<br/>
<small>Result after 100 episodes.</small>

</div>
</div>

---

## Continuous Worlds

* There are a few difficulties with the mountain car scenario
  * The environment is continuous (velocity and position)
  * The actions are continuous (between -1 and 1)
  * The reward is sparse (only received after reaching the goal)

---

## Q-Learning

* Q-Learning measures rewards at each step
  * Good enough to get us out of the hole
* But then what?

</div>
<div class="col">
<img style="width: 60%" class="r-stretch" src="./figures/mountain-car-episode-0.gif" />
<br/>
<small>Result after 100 episodes.</small>

</div>
</div>

---

## Exploration

* Exploration with a $\epsilon\text-greedy$ policy relies upon random actions to explore novel state-action pairs
* That is good in some ways
  * Remember cliff walking?
  * We can explore the effects of cliff jumping, but it won't affect our policy

---

## Cliff Walking

* We know that we can solve with with any tabular method
* And we've seen that a Deep Q Network performs similarly
  * Learning is far more noisy

---

## Tabular Vs DNN

* So why use DQNs?

<div class="container">
<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/cliff-walking-doubleq-500e.gif" />
<br/>
<small>Double Q, 500 episodes.</small>
</div>
<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/cliff-walking-doubleq-1000e.gif" />
<br/>
<small>Double Q, 1000 episodes.</small>
</div>
<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/cliff-walking-dqn-500e.gif" />
<br/>
<small>DQN, 500 episodes.</small>
</div>
<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/cliff-walking-dqn-1000e.gif" />
<br/>
<small>DQN, 1000 episodes.</small>
</div>
</div>

---

## Cart and Pole

* The environment is continuous
  * Cart position and velocity, pole position and angular velocity
* The actions are discrete (move left or move right)
* The reward is rich, get +1 for each step while the pole is upright

---

## Continuous States

* The state is continuous, so tabular methods won't work
* The actions are simple though
  * We want our DQN to create a (likely) continuous function of state
* It's important to notice that the state space will be explored quickly
  * The pole falls left or right, the cart moves left or right

---

## Gradients

* Let's formally discuss what's being done here
* Loss:
    * $L[\theta] = \left(R + \gamma\underset{a}{max}Q(s_{t+1}, a) - Q(s_t, a_t)\right)^2$
* Updates:
    * $\theta \leftarrow \theta + \alpha\left(R + \gamma\underset{a}{max}Q(s_{t+1}, a) - Q(s_t,a_t)\right)\frac{\delta Q(s_t,a_t)}{\delta \theta}$
    * Where Q is a function of its parameters, $\theta$

---

## DQN Vs Tabular Q-Learning

* Convergence isn't guaranteed
  * We've all messed up some DNN paramters, right?
* More specifically, the maximum future reward changes as we change Q
  * This could lead to instabilities

---

## DQN Code

```python
class NNQ:
    def __init__(self, actions, states, alpha=0.001, gamma=1.0, use_tanh=False, history_p=0.0):
        """
        Arguments:
            actions (bool, int): Tuple of (continuous, number of actions or action outputs)
            states  (bool, int): Tuple of (continuous, number of states or state variables)
            alpha       (float): Learning rate
            gamma       (float): Discount value
            use_tanh     (bool): Use Tanh on the network output.
            history     (float): The probability of retaining past experiences after an update.
        """
        super(NNQ, self).__init__()
        self.gamma = gamma
        self.model = QModel(states=states[1], actions=actions[1], use_tanh=use_tanh)
        self.optimizer = torch.optim.AdamW(self.model.parameters(), lr=0.001, weight_decay=0.01)
        self.criterion = torch.nn.MSELoss()
        self.history_p = history_p

self.actions = actions
        if self.actions[0] is False:
            # We'll fill in the states as we observe them, but we can prefill the action vectors
            self.action_vectors = F.one_hot(torch.arange(0, self.actions[1]))
            self.action_vectors.requires_grad = False
        self.states = states

# Save up batches of 32 for learning
        self.past_state_actions = None
        self.past_rewards = None

def learn(self, force_update=False):
        # Learn at the end of an episode or when there is enough data available
        if force_update or self.past_state_actions is not None and self.past_state_actions.size(0) >= 64:
            # Zero gradients before gradient calculation
            self.optimizer.zero_grad()
            # Forward again to make a gradient
            # Update with the error
            y_hat = self.model.forward(self.past_state_actions.float())
            y = self.past_rewards
            loss = self.criterion(y_hat, y)

# Gradient calculation
            loss.backward()
            # Update weights
            self.optimizer.step()
            # Only learn from examples once
            if self.history_p == 0.0:
                self.past_state_actions = None
                self.past_rewards = None
            else:
                mask = torch.rand(self.past_state_actions.size(0)) < self.history_p
                self.past_state_actions = self.past_state_actions[mask]
                self.past_rewards = self.past_rewards[mask]

def updateQ(self, state, new_state, action, reward):
        # Calculate the expected rewards
        with torch.no_grad():
            # Build one-hot vectors if the action space is not continuous
            if self.actions[0] is False:
                action_vectors = self.action_vectors[action:action+1]
            else:
                action_vectors = torch.tensor([action])
            sa_vectors = self.makeSAVectors(state, action_vectors)
            if self.past_state_actions is None:
                self.past_state_actions = sa_vectors
            else:
                self.past_state_actions = torch.concatenate((self.past_state_actions, sa_vectors))
            # There are no more rewards after the episode is complete
            if new_state is None:
                future_reward = reward
            else:
                future_reward = reward + self.gamma * self.maxReward(new_state)
            if self.past_rewards is None:
                self.past_rewards = torch.tensor([[future_reward]]).float()
            else:
                self.past_rewards = torch.concatenate((self.past_rewards, torch.tensor([[future_reward]]).float()))

self.learn(force_update=new_state is None)

def endEpisode(self, state, action, reward):
        """Learn from the episode and reset any state-dependent variables."""
        with torch.no_grad():
            # Build one-hot vectors if the action space is not continuous
            if self.actions[0] is False:
                action_vectors = self.action_vectors[action:action+1]
            else:
                action_vectors = torch.tensor([action])
            sa_vectors = self.makeSAVectors(state, action_vectors)
            if self.past_state_actions is None:
                self.past_state_actions = sa_vectors
            else:
                self.past_state_actions = torch.concatenate((self.past_state_actions, sa_vectors))
            # There are no more rewards after the episode is complete
            future_reward = reward
            if self.past_rewards is None:
                self.past_rewards = torch.tensor([[future_reward]]).float()
            else:
                self.past_rewards = torch.concatenate((self.past_rewards, torch.tensor([[future_reward]]).float()))
        self.learn(True)

def estimateQs(self, state, action_vectors):
        sa_vectors = self.makeSAVectors(state, action_vectors)
        return self.model.forward(sa_vectors.float())

def bestAction(self, state):
        # Use sampling if the action space is continuous
        if self.actions[0]:
            # Actions are in the range from -1 to 1
            rand_actions = (torch.rand(100*self.actions[1])*2-1).view(-1, self.actions[1])
            q_estimates = self.estimateQs(state, rand_actions)
            return rand_actions[torch.argmax(q_estimates).item()]
        else:
            q_estimates = self.estimateQs(state, self.action_vectors)
            return torch.argmax(q_estimates).item()

def maxReward(self, state):
        # Use sampling if the action space is continuous
        if self.actions[0]:
            # Actions are in the range from -1 to 1
            rand_actions = (torch.rand(100*self.actions[1])*2-1).view(-1, self.actions[1])
            q_estimates = self.estimateQs(state, rand_actions)
        else:
            q_estimates = self.estimateQs(state, self.action_vectors)
        return torch.max(q_estimates).item()

def meanReward(self, state):
        # Use sampling if the action space is continuous
        if self.actions[0]:
            # Actions are in the range from -1 to 1
            rand_actions = (torch.rand(100*self.actions[1])*2-1).view(-1, self.actions[1])
            q_estimates = self.estimateQs(state, rand_actions)
        else:
            q_estimates = self.estimateQs(state, self.action_vectors)
        return torch.mean(q_estimates).item()

def makeAction(self, action):
        if self.actions[0]:
            return torch.tensor([[action]]).float()
        else:
            return self.action_vectors[action:action+1]

def makeSAVectors(self, state, actions):
        with torch.no_grad():
            # If the state space is discrete, then make the state a one hot vector
            if self.states[0] is False and type(state) is int:
                state_vector = F.one_hot(torch.tensor([state]), num_classes=self.states[1]).expand(actions.size(0), -1)
            else:
                state_vector = torch.tensor(state)
                state_vector = state_vector.expand(actions.size(0), state_vector.size(-1))
            return torch.concatenate((state_vector, actions), axis=1)
```

---

## Important Bits

```python
    def learn(self, force_update=False):
        # Learn at the end of an episode or when there is enough data available
        if force_update or self.past_state_actions is not None and self.past_state_actions.size(0) >= 64:
            # Zero gradients before gradient calculation
            self.optimizer.zero_grad()
            # Forward again to make a gradient
            # Update with the error
            y_hat = self.model.forward(self.past_state_actions.float())
            y = self.past_rewards
            loss = self.criterion(y_hat, y)

self.learn(force_update=new_state is None)
```

---

## Data Statistics

* Should we only learn on whatever is current from the last episode?
  * That could be pretty biased
  * But it may also be a better reflection of the data the policy will guide us to see
* Conventional wisdom says that history buffers are important for learning stability

---

## Observations

* Remembering past episodes seems to both help and hurt
  * Clearly, remembering a percentage of data introduces bias after long runs
  * Paradoxically, the DQN gets *worse* after doing extremely well

---

## Success, Eventually

* If we are patient, save checkpoints, and tweak things, we can get a good result
  * Or we could improve our memory system and training selection (the right solution)

---

## Continuous Actions

* The cart and pole had continuous states, but not actions
  * So let's make things more difficult
* The return of the mountain car

<div class="col">
<img style="width: 60%" class="r-stretch" src="./figures/mountain-car-episode-0.gif" />
<br/>
<small>Failed result after 100 episodes.</small>
</div>

---

## Mountain Car

* State:
  * The position is from -1.2 to 0.6
  * The velocity is from -0.07 to 0.07
* Action is a continuous value from -1 to 1 that applies a positive or negative force
* A negative reward of $-0.1 \times action^2$ is applied at each time step
* +100 is rewarded for reaching the goal

---

## What Goes Wrong?

* So what went wrong with DQN?
* We have a curse of dimensionality problem occurring
  * We need sampled data to estimate Q
  * With both the state and the action, the (s, a) input has size $\mathbb{R}^3$
* We will many new states on every episode
  * But those states won't lead to a fundamentally different result

---

* Success involves following a large set of states to the goal
* But there is some stochasticity to this
  * The longer the path, the more likely our model is to become confused

</div>
<div class="col">
<img style="width: 60%" class="r-stretch" src="./figures/mountain-car-episode-0.gif" />
</div>
</div>

---

## Following Rewards

* TD is good at training the model to get out of over-explored areas
  * But in a continuous space, it is always possible to find something unexplored
  * Especially since we use randomness for exploration
* So is there a more efficient approach?
  * Sure! What if we were learning the policy directly?

---

## Discrete and Continuous Differences

* On-policy exploration was hard in the tabular world
  * But that was because our models weren't stochastic
* A learned stochastic policy can do its own exploration
* If you think about AlphaGo, the model was guiding exploration into novel board positions

---

## Continuous, On-Policy Learning

* Q-learning's struggle is in generalizing $Q(s,a)$ values
* For example, we observe a state halfway up the right side of the hill
  * At some velocities, we cannot reach the goal (at 0.45), so the value is low
  * At some velocities we *could* reach the goal, but our policy will fail
    * This is like cliff walking, but with far more cliffs

</div>
<div class="col">
<img style="width: 60%" class="r-stretch" src="./figures/mountain-car-reinforce-2k.gif" />
</div>
</div>

---

## Continuous, On-Policy Learning

* We see that $Q(s,a)$ always rolls down the hill for many values of a
  * But there is a sudden transition point where were are closer to the goal or moving faster
  * Those state have actions that would get a +100 reward
* How can we be persistent enough to get there?

</div>
<div class="col">
<img style="width: 60%" class="r-stretch" src="./figures/mountain-car-reinforce-2k.gif" />
</div>
</div>

---

## Continuous, On-Policy Learning

* So the problem is that we need to "thread a needle" of actions to reach a rare reward
* Going most of the way up the mountain will be penalized, with a reward only occurring at the top
* This seems like a Monte Carlo approach would work
  * Avoids erroneous policy updates that could actually push away from the goal

---

## Monte Carlo Problems

* But there are parts of Monte Carlo that we don't like
  * For example, the importance sampling ratio meant that we ignored some data
  * $ \rho_{t:T-1} = \prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)}$
* But but if our policy is willing to try anything?

---

## Stochastic Policies

* A stochastic policy could select any action, with $p = \pi(a|s, \theta)$
* The policy outputs likelihoods of different actions
* After exploration, we scale gains by the likelihood of that action
* This resolves a problem with Monte Carlo methods, where only the end of an episode had any learning

---

## Intuition

* This is different from action-state value estimation with DQN
* Example:
  * Let's say your current policy is to go to a party whenever you are invited
  * You go to two parties, have fun, and then go to a third party that isn't fun
  * Q-learning would ask you to estimate $Q(party|invitation)$ and $Q(no~party|invitation)$
  * Policy learning just says that $\pi(party|invitation)$ should be $<1$
    * How much less? Does it matter?

---

## Intuition Continued

* You'll keep going to parties, so you'll develop more information about them
  * But you'll also explore more alternatives on a Friday night
* If all other parties are awesome, then $\pi(party|invitation)$ goes up
  * But if alternatives are also awesome, then $\pi(party|invitation)$ stays the same or goes down

---

## Idea Summary

* We are exploring a wide range of actions, never shunning one prematurely
* Advantages:
  * We can explore with only the policy, or inject programmatic random actions
  * Learning simply scales with $\pi(a|s, \theta)$
* So how will backpropagation work?

---

## Policy Gradient Methods

* Let's do the math
* The likelihood of any particular sequence in an episode is:
  * $p(s_1)\prod\limits_{t=1}^T\pi(a_t|s_t,\theta)p(s_{t+1}|s_t,a_t)$
* The expected episodic returns are $\sum\limits_a\pi(a|s)q_\pi(s,a)$
  * $q_\pi(s,a)$ represents the value of an action as $s$ under the current policy

---

## Policy Gradient

* The gradient is "simply" $\nabla v_\pi(s) = \nabla\big[\sum\limits_a\pi(a|s)q_\pi(s,a)\big]$
* Our policy is, of course, a function of its parameters:
  * $\nabla v_\pi(s) = \nabla\big[\sum\limits_a\pi(a|s, \theta) q_\pi(s,a)\big]$
* The gradient ascent update, with learning rate $\alpha$, looks like this:
  * $\theta \leftarrow \theta + \alpha \frac{\delta}{\delta\theta} \big[ \sum\limits_a \pi(a|s)q_\pi(s,a,\theta) \big]$
  * $\theta \leftarrow \theta + \alpha \big[ \sum\limits_a q_\pi(s,a)\frac{\delta}{\delta\theta} \pi(a|s,\theta) \big]$

---

## All-Actions Method

* This form is the *all-actions* method, because it involves summing over all actions
  * $\theta \leftarrow \theta + \alpha \big[ \sum\limits_a q_\pi(s,a)\frac{\delta}{\delta\theta} \pi(a|s,\theta) \big]$
* Also out of reach for us.
* Let's discuss an approximation algorithm, named REINFORCE

---

## Reinforce

* We'll only do updates based upon a single episode
* Now the update should scale with the probability of an observation
* We can accomplish this by multiplying and dividing by that probability
  * $ \mathbb{E}\big[ \sum\limits_a \pi(a|s,\theta)q_\pi(s,a)\frac{\nabla\pi(a|s,\theta)}{\pi(a|s,\theta)} \big]$

---

## Reinforce

* $ \mathbb{E}\big[ \sum\limits_a \pi(a|s,\theta)q_\pi(s,a)\frac{\nabla\pi(a|s,\theta)}{\pi(a|s,\theta)} \big]$
* Let's look at a single step update:
  * $ \mathbb{E}\big[ q_\pi(s,a)\frac{\nabla\pi(a|s,\theta)}{\pi(a|s,\theta)} \big]$
* We don't know $q_\pi(s,a)$, but $\mathbb{E}_\pi[G_t|s_t,a_t] = q_\pi(s_t,a_t)$
  * Meaning we can observe it during the episode

---

## Reinforce Update

* $\theta \leftarrow \theta + \alpha G\frac{\nabla\pi(a|s,\theta)}{\pi(a|s,\theta)}$
* The *likelihood ratio identity* says $\nabla ln(x) = \frac{\nabla x}{x}$
  * We can use this to make a numerically parsimonious equation
* Our update at each time, t, is
  * $\theta \leftarrow \theta + \alpha G_t\nabla ln\big[\pi(a_t|s_t,\theta)\big]$

---

## Per-Step Update

* Let's replace $q$ with an expected return
  * $ \mathbb{E}\big[ \sum\limits_a q_\pi(s,a)\frac{\delta}{\delta\theta} \pi(a|s,\theta) \big]$
* Now the per-sample equation is
  * $ \mathbb{E}\big[ q_\pi(s,a)\frac{\delta}{\delta\theta} \pi(a|s,\theta) \big]$

---

## Interpretation

* $ \mathbb{E}\big[ q_\pi(s,a)\frac{\delta}{\delta\theta} \pi(a|s,\theta) \big]$
* We should interpret a policy model as a classifier
* The loss is the log probability of any action multiplied by its gain

---

## Implementation

* Implementing may not sound easy
  * But it is just some work with cdfs and pdfs
* First, make a model to output the parameters of a distribution
* Second, use those parameters to both choose an action and its probability

---

## Categorical Policies

* If actions are discrete, this is simply a categorical policy

```python
class DiscreteActionModel(torch.nn.Module):
    """A linear neural network for state-action value estimation."""

def __init__(self, states, actions):
        super(DiscreteActionModel, self).__init__()
        # This model takes in the current state and a one-hot vector of actions
        self.net = torch.nn.Sequential(
            torch.nn.Linear(in_features=states, out_features=256),
            torch.nn.ReLU(),
            torch.nn.LayerNorm(normalized_shape=256),
            torch.nn.Linear(in_features=256, out_features=128),
            torch.nn.ReLU(),
            torch.nn.Linear(in_features=128, out_features=actions),
            )
        # Treat as a probability
        self.decision = torch.nn.Softmax(dim=1)

for layer in [0, 3]:
            torch.nn.init.kaiming_normal_(self.net[layer].weight.data, nonlinearity="relu")

def forward(self, x):
        """Forward through the network."""
        y_hat = self.decision(self.net(x))
        return y_hat
```

---

## Continuous Policies

* We can generally use a normal distribution, one for each action variable

```python
class ContinuousActionModel(torch.nn.Module):
    """A linear neural network for state-action value estimation."""

def __init__(self, states, actions):
        super(ContinuousActionModel, self).__init__()
        # This model takes in the current state and a one-hot vector of actions
        self.net = torch.nn.Sequential(
            torch.nn.Linear(in_features=states, out_features=256),
            torch.nn.LayerNorm(normalized_shape=256),
            torch.nn.ReLU(),
            torch.nn.Linear(in_features=256, out_features=128),
            torch.nn.ReLU(),
            )
        self.states = states

self.mean_head = torch.nn.Linear(128, actions)
        self.stddev_head = torch.nn.Linear(128, actions)

for layer in [0, 3]:
            torch.nn.init.kaiming_normal_(self.net[layer].weight.data, nonlinearity="relu")
        torch.nn.init.uniform_(self.mean_head.bias, -0.1, 0.1)
        torch.nn.init.uniform_(self.stddev_head.bias, 0, 1)

def forward(self, x):
        """Forward through the network."""
        x = self.net(x)
        #mean = torch.tanh(self.mean_head(x))
        mean = self.mean_head(x)
        # Ensure that the stddev is positive
        stddev = F.softplus(self.stddev_head(x)) + 1e-6
        return mean, stddev
```

---

## Training REINFORCE

* Pytorch defines all of the common distributions:
  * [https://docs.pytorch.org/docs/stable/distributions.html](https://docs.pytorch.org/docs/stable/distributions.html)
* So we simply take the output of our network, create a distribution, and draw an action
* Once the episode is complete, we go through the network to get the parameters at each state, and use the distributions to get the action probabilities

---

## REINFORCE Code

```python
def actionProb(self, state, action):
    # Get the probability of an action
    # Continuous or discrete action space
    if self.continuous_actions:
        mean, stddev = self.model(state_vector)
        dist = torch.distributions.Normal(mean, stddev)
        log_prob = dist.log_prob(action)
    else:
        probs = self.model(state_vector)
        dist = torch.distributions.Categorical(probs=probs)
        # The output for discrete actions is a one hot vector, so find the index
        a_idx = torch.argmax(action)
        log_prob = dist.log_prob(a_idx)
    return log_prob

dataset = torch.utils.data.TensorDataset(state_vectors, act_vectors, gains_tensor)
loader = torch.utils.data.DataLoader(dataset, batch_size = self.batch_size, shuffle=True)
for states, actions, gains in loader:
    # Compute the probabilities, which will also prepare their gradients
    action_probs, entropies = self.actionProb(states, actions)
    # We just want to make low gain policies have high loss, high gain
    # policies have low loss
    # Learning should come through the probabilities of actions and their gains.
    policy_loss = -action_probs * gains.detach()

# Zero gradients before gradient calculation
    self.optimizer.zero_grad()
    loss = torch.mean(policy_loss)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1)
    self.optimizer.step()
```

---

## Results

* If you try this out, you'll get nothing good
  * But why?
* Policy learning with DNNs is incredibly unstable
* Simple improvements:
  * Baseline comparisons
  * Less biased updates with historic samples
  * Add an entropy loss to prevent collapse

---

## Entropy

* This is easy, so we'll look at it first

```python
def actionProb(self, state, action):
    # Get the probability of an action
    # Continuous or discrete action space
    if self.continuous_actions:
        mean, stddev = self.model(state_vector)
        dist = torch.distributions.Normal(mean, stddev)
        log_prob = dist.log_prob(action)
        entropy = dist.entropy()
    else:
        probs = self.model(state_vector)
        dist = torch.distributions.Categorical(probs=probs)
        # The output for discrete actions is a one hot vector, so find the index
        a_idx = torch.argmax(action)
        log_prob = dist.log_prob(a_idx)
        entropy = dist.entropy()
    return log_prob, entropy

# Zero gradients before gradient calculation
    self.optimizer.zero_grad()
    loss = torch.mean(policy_loss) - 0.1 * torch.mean(entropies)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1)
    self.optimizer.step()
```

---

## Historic Updates

* I showed retaining a percentage of previous examples after each episode
  * Ideally we should actually throw out most samples
* Retain ones that are "interesting" (usually meaning high error)
  * This is *prioritized experience replay*
  * It's a heuristic, so let's not dig into it

---

## Baseline

* This will give a major boost with slight effort
* The problem with gains is that they are an estimate, and have high variance
  * Sure, over many episodes $\mathbb{E}_\pi[G_t|s_t,a_t] = q_\pi(s_t,a_t)$
  * But that means a very small learning rate and lots of time
* Instead of gain, let's use something called *advantage* as a baseline to improve upon

---

## Advantage

* $A(s_t) = G_t - Q(s_t, a_t)$
* Advantage is how much better (or worse) our gain is than our expected value
* Let's assume that Q is perfect
  * A positive value means that $a_t$ was, on average, better than the other episode actions
    * So we should do more of it
  * A negative value means that $a_t$ was, on average, worse than other episode actions
    * So do less of it

---

## Advantage

* How do we get Q?
  * Just use another network. 
  * We are observing many $(s,a)$ pairs in any case, so this has little overhead cost
* This will stabilize learning, even if our estimates are wrong

---

## Continuing Problems

* The mountain car simulation is cut off at 4000 steps
  * So should step 3999 has a tiny negative reward?
  * No!
* As a matter of bookkeeping, we need to estimate the average return and add that in
  * This is more mathematically sound if we have some discount, $\gamma < 1$
* (As an alternative, I could have run without any time limit)

---

## Results

* First, you need to tweak the network to have enough natural variance so that it could ever succeed
  * Or add in randomness manually
* We want decreasing numbers of episodes on the mountain car, and...

---

## Tuning

* That looks terrible, so let's tune
* Keep 25% of samples to the next episode, only keep max 256/episode
  * Obviously this could stand some more tuning for stability

---

## Results

* Results are surprisingly robust
  * But I did multiple random restarts to find good initial DNN values

<div class="container">
<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/mountain-car-reinforce.gif" />
<br/>
<small>500 episodes, $\gamma = 1.0$.</small>
</div>
<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/mountain-car-reinforce-2k.gif" />
<br/>
<small>2000 episodes, $\gamma = 1.0$.</small>
</div>
<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/mountain-car-reinforce-100tuned.gif" />
<br/>
<small>100 episodes, $\gamma = 1.0$, $p_{keep} = 0.25$, keep max 256/episode.</small>
</div>
</div>

---

## Discussion

* During simulation, we always take a random action based upon the policy
  * *Not* the most likely (the mean for a normal distribution)
* This prevents our model from getting "stuck"
* So even if it looks inferior to a model from a DQN, the nondeterminism can help

---

## Improvements

* There are more (and better) algorithms
* For one thing, we have the Monte Carlo problem of only being able to learn at the end of an episode
* But we've learned enough for one lecture!