Cart and pole, DQN.
---
## Continuous Actions
* The cart and pole had continuous states, but not actions
* So let's make things more difficult
* The return of the mountain car
Failed result after 100 episodes.
---
## Mountain Car
* State:
* The position is from -1.2 to 0.6
* The velocity is from -0.07 to 0.07
* Action is a continuous value from -1 to 1 that applies a positive or negative force
* A negative reward of $-0.1 \times action^2$ is applied at each time step
* +100 is rewarded for reaching the goal
---
## What Goes Wrong?
* So what went wrong with DQN?
* We have a curse of dimensionality problem occurring
* We need sampled data to estimate Q
* With both the state and the action, the (s, a) input has size $\mathbb{R}^3$
* We will many new states on every episode
* But those states won't lead to a fundamentally different result
---
##
* Success involves following a large set of states to the goal
* But there is some stochasticity to this
* The longer the path, the more likely our model is to become confused
---
## Following Rewards
* TD is good at training the model to get out of over-explored areas
* But in a continuous space, it is always possible to find something unexplored
* Especially since we use randomness for exploration
* So is there a more efficient approach?
* Sure! What if we were learning the policy directly?
---
## Discrete and Continuous Differences
* On-policy exploration was hard in the tabular world
* But that was because our models weren't stochastic
* A learned stochastic policy can do its own exploration
* If you think about AlphaGo, the model was guiding exploration into novel board positions
---
## Continuous, On-Policy Learning
* Q-learning's struggle is in generalizing $Q(s,a)$ values
* For example, we observe a state halfway up the right side of the hill
* At some velocities, we cannot reach the goal (at 0.45), so the value is low
* At some velocities we *could* reach the goal, but our policy will fail
* This is like cliff walking, but with far more cliffs
---
## Continuous, On-Policy Learning
* We see that $Q(s,a)$ always rolls down the hill for many values of a
* But there is a sudden transition point where were are closer to the goal or moving faster
* Those state have actions that would get a +100 reward
* How can we be persistent enough to get there?
---
## Continuous, On-Policy Learning
* So the problem is that we need to "thread a needle" of actions to reach a rare reward
* Going most of the way up the mountain will be penalized, with a reward only occurring at the top
* This seems like a Monte Carlo approach would work
* Avoids erroneous policy updates that could actually push away from the goal
---
## Monte Carlo Problems
* But there are parts of Monte Carlo that we don't like
* For example, the importance sampling ratio meant that we ignored some data
* $ \rho_{t:T-1} = \prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)}$
* But but if our policy is willing to try anything?
---
## Stochastic Policies
* A stochastic policy could select any action, with $p = \pi(a|s, \theta)$
* The policy outputs likelihoods of different actions
* After exploration, we scale gains by the likelihood of that action
* This resolves a problem with Monte Carlo methods, where only the end of an episode had any learning
---
## Intuition
* This is different from action-state value estimation with DQN
* Example:
* Let's say your current policy is to go to a party whenever you are invited
* You go to two parties, have fun, and then go to a third party that isn't fun
* Q-learning would ask you to estimate $Q(party|invitation)$ and $Q(no~party|invitation)$
* Policy learning just says that $\pi(party|invitation)$ should be $<1$
* How much less? Does it matter?
---
## Intuition Continued
* You'll keep going to parties, so you'll develop more information about them
* But you'll also explore more alternatives on a Friday night
* If all other parties are awesome, then $\pi(party|invitation)$ goes up
* But if alternatives are also awesome, then $\pi(party|invitation)$ stays the same or goes down
---
## Idea Summary
* We are exploring a wide range of actions, never shunning one prematurely
* Advantages:
* We can explore with only the policy, or inject programmatic random actions
* Learning simply scales with $\pi(a|s, \theta)$
* So how will backpropagation work?
---
## Policy Gradient Methods
* Let's do the math
* The likelihood of any particular sequence in an episode is:
* $p(s_1)\prod\limits_{t=1}^T\pi(a_t|s_t,\theta)p(s_{t+1}|s_t,a_t)$
* The expected episodic returns are $\sum\limits_a\pi(a|s)q_\pi(s,a)$
* $q_\pi(s,a)$ represents the value of an action as $s$ under the current policy
---
## Policy Gradient
* The gradient is "simply" $\nabla v_\pi(s) = \nabla\big[\sum\limits_a\pi(a|s)q_\pi(s,a)\big]$
* Our policy is, of course, a function of its parameters:
* $\nabla v_\pi(s) = \nabla\big[\sum\limits_a\pi(a|s, \theta) q_\pi(s,a)\big]$
* The gradient ascent update, with learning rate $\alpha$, looks like this:
* $\theta \leftarrow \theta + \alpha \frac{\delta}{\delta\theta} \big[ \sum\limits_a \pi(a|s)q_\pi(s,a,\theta) \big]$
* $\theta \leftarrow \theta + \alpha \big[ \sum\limits_a q_\pi(s,a)\frac{\delta}{\delta\theta} \pi(a|s,\theta) \big]$
---
## All-Actions Method
* This form is the *all-actions* method, because it involves summing over all actions
* $\theta \leftarrow \theta + \alpha \big[ \sum\limits_a q_\pi(s,a)\frac{\delta}{\delta\theta} \pi(a|s,\theta) \big]$
* Also out of reach for us.
* Let's discuss an approximation algorithm, named REINFORCE
---
## Reinforce
* We'll only do updates based upon a single episode
* Now the update should scale with the probability of an observation
* We can accomplish this by multiplying and dividing by that probability
* $ \mathbb{E}\big[ \sum\limits_a \pi(a|s,\theta)q_\pi(s,a)\frac{\nabla\pi(a|s,\theta)}{\pi(a|s,\theta)} \big]$
---
## Reinforce
* $ \mathbb{E}\big[ \sum\limits_a \pi(a|s,\theta)q_\pi(s,a)\frac{\nabla\pi(a|s,\theta)}{\pi(a|s,\theta)} \big]$
* Let's look at a single step update:
* $ \mathbb{E}\big[ q_\pi(s,a)\frac{\nabla\pi(a|s,\theta)}{\pi(a|s,\theta)} \big]$
* We don't know $q_\pi(s,a)$, but $\mathbb{E}_\pi[G_t|s_t,a_t] = q_\pi(s_t,a_t)$
* Meaning we can observe it during the episode
---
## Reinforce Update
* $\theta \leftarrow \theta + \alpha G\frac{\nabla\pi(a|s,\theta)}{\pi(a|s,\theta)}$
* The *likelihood ratio identity* says $\nabla ln(x) = \frac{\nabla x}{x}$
* We can use this to make a numerically parsimonious equation
* Our update at each time, t, is
* $\theta \leftarrow \theta + \alpha G_t\nabla ln\big[\pi(a_t|s_t,\theta)\big]$
---
## Per-Step Update
* Let's replace $q$ with an expected return
* $ \mathbb{E}\big[ \sum\limits_a q_\pi(s,a)\frac{\delta}{\delta\theta} \pi(a|s,\theta) \big]$
* Now the per-sample equation is
* $ \mathbb{E}\big[ q_\pi(s,a)\frac{\delta}{\delta\theta} \pi(a|s,\theta) \big]$
---
## Interpretation
* $ \mathbb{E}\big[ q_\pi(s,a)\frac{\delta}{\delta\theta} \pi(a|s,\theta) \big]$
* We should interpret a policy model as a classifier
* The loss is the log probability of any action multiplied by its gain
---
## Implementation
* Implementing may not sound easy
* But it is just some work with cdfs and pdfs
* First, make a model to output the parameters of a distribution
* Second, use those parameters to both choose an action and its probability
---
## Categorical Policies
* If actions are discrete, this is simply a categorical policy
```python
class DiscreteActionModel(torch.nn.Module):
"""A linear neural network for state-action value estimation."""
def __init__(self, states, actions):
super(DiscreteActionModel, self).__init__()
# This model takes in the current state and a one-hot vector of actions
self.net = torch.nn.Sequential(
torch.nn.Linear(in_features=states, out_features=256),
torch.nn.ReLU(),
torch.nn.LayerNorm(normalized_shape=256),
torch.nn.Linear(in_features=256, out_features=128),
torch.nn.ReLU(),
torch.nn.Linear(in_features=128, out_features=actions),
)
# Treat as a probability
self.decision = torch.nn.Softmax(dim=1)
for layer in [0, 3]:
torch.nn.init.kaiming_normal_(self.net[layer].weight.data, nonlinearity="relu")
def forward(self, x):
"""Forward through the network."""
y_hat = self.decision(self.net(x))
return y_hat
```
---
## Continuous Policies
* We can generally use a normal distribution, one for each action variable
```python
class ContinuousActionModel(torch.nn.Module):
"""A linear neural network for state-action value estimation."""
def __init__(self, states, actions):
super(ContinuousActionModel, self).__init__()
# This model takes in the current state and a one-hot vector of actions
self.net = torch.nn.Sequential(
torch.nn.Linear(in_features=states, out_features=256),
torch.nn.LayerNorm(normalized_shape=256),
torch.nn.ReLU(),
torch.nn.Linear(in_features=256, out_features=128),
torch.nn.ReLU(),
)
self.states = states
self.mean_head = torch.nn.Linear(128, actions)
self.stddev_head = torch.nn.Linear(128, actions)
for layer in [0, 3]:
torch.nn.init.kaiming_normal_(self.net[layer].weight.data, nonlinearity="relu")
torch.nn.init.uniform_(self.mean_head.bias, -0.1, 0.1)
torch.nn.init.uniform_(self.stddev_head.bias, 0, 1)
def forward(self, x):
"""Forward through the network."""
x = self.net(x)
#mean = torch.tanh(self.mean_head(x))
mean = self.mean_head(x)
# Ensure that the stddev is positive
stddev = F.softplus(self.stddev_head(x)) + 1e-6
return mean, stddev
```
---
## Training REINFORCE
* Pytorch defines all of the common distributions:
* [https://docs.pytorch.org/docs/stable/distributions.html](https://docs.pytorch.org/docs/stable/distributions.html)
* So we simply take the output of our network, create a distribution, and draw an action
* Once the episode is complete, we go through the network to get the parameters at each state, and use the distributions to get the action probabilities
---
## REINFORCE Code
```python
def actionProb(self, state, action):
# Get the probability of an action
# Continuous or discrete action space
if self.continuous_actions:
mean, stddev = self.model(state_vector)
dist = torch.distributions.Normal(mean, stddev)
log_prob = dist.log_prob(action)
else:
probs = self.model(state_vector)
dist = torch.distributions.Categorical(probs=probs)
# The output for discrete actions is a one hot vector, so find the index
a_idx = torch.argmax(action)
log_prob = dist.log_prob(a_idx)
return log_prob
dataset = torch.utils.data.TensorDataset(state_vectors, act_vectors, gains_tensor)
loader = torch.utils.data.DataLoader(dataset, batch_size = self.batch_size, shuffle=True)
for states, actions, gains in loader:
# Compute the probabilities, which will also prepare their gradients
action_probs, entropies = self.actionProb(states, actions)
# We just want to make low gain policies have high loss, high gain
# policies have low loss
# Learning should come through the probabilities of actions and their gains.
policy_loss = -action_probs * gains.detach()
# Zero gradients before gradient calculation
self.optimizer.zero_grad()
loss = torch.mean(policy_loss)
loss.backward()
torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1)
self.optimizer.step()
```
---
## Results
* If you try this out, you'll get nothing good
* But why?
* Policy learning with DNNs is incredibly unstable
* Simple improvements:
* Baseline comparisons
* Less biased updates with historic samples
* Add an entropy loss to prevent collapse
---
## Entropy
* This is easy, so we'll look at it first
```python
def actionProb(self, state, action):
# Get the probability of an action
# Continuous or discrete action space
if self.continuous_actions:
mean, stddev = self.model(state_vector)
dist = torch.distributions.Normal(mean, stddev)
log_prob = dist.log_prob(action)
entropy = dist.entropy()
else:
probs = self.model(state_vector)
dist = torch.distributions.Categorical(probs=probs)
# The output for discrete actions is a one hot vector, so find the index
a_idx = torch.argmax(action)
log_prob = dist.log_prob(a_idx)
entropy = dist.entropy()
return log_prob, entropy
dataset = torch.utils.data.TensorDataset(state_vectors, act_vectors, gains_tensor)
loader = torch.utils.data.DataLoader(dataset, batch_size = self.batch_size, shuffle=True)
for states, actions, gains in loader:
# Compute the probabilities, which will also prepare their gradients
action_probs, entropies = self.actionProb(states, actions)
# We just want to make low gain policies have high loss, high gain
# policies have low loss
# Learning should come through the probabilities of actions and their gains.
policy_loss = -action_probs * gains.detach()
# Zero gradients before gradient calculation
self.optimizer.zero_grad()
loss = torch.mean(policy_loss) - 0.1 * torch.mean(entropies)
loss.backward()
torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1)
self.optimizer.step()
```
---
## Historic Updates
* I showed retaining a percentage of previous examples after each episode
* Ideally we should actually throw out most samples
* Retain ones that are "interesting" (usually meaning high error)
* This is *prioritized experience replay*
* It's a heuristic, so let's not dig into it
---
## Baseline
* This will give a major boost with slight effort
* The problem with gains is that they are an estimate, and have high variance
* Sure, over many episodes $\mathbb{E}_\pi[G_t|s_t,a_t] = q_\pi(s_t,a_t)$
* But that means a very small learning rate and lots of time
* Instead of gain, let's use something called *advantage* as a baseline to improve upon
---
## Advantage
* $A(s_t) = G_t - Q(s_t, a_t)$
* Advantage is how much better (or worse) our gain is than our expected value
* Let's assume that Q is perfect
* A positive value means that $a_t$ was, on average, better than the other episode actions
* So we should do more of it
* A negative value means that $a_t$ was, on average, worse than other episode actions
* So do less of it
---
## Advantage
* How do we get Q?
* Just use another network.
* We are observing many $(s,a)$ pairs in any case, so this has little overhead cost
* This will stabilize learning, even if our estimates are wrong
---
## Continuing Problems
* The mountain car simulation is cut off at 4000 steps
* So should step 3999 has a tiny negative reward?
* No!
* As a matter of bookkeeping, we need to estimate the average return and add that in
* This is more mathematically sound if we have some discount, $\gamma < 1$
* (As an alternative, I could have run without any time limit)
---
## Results
* First, you need to tweak the network to have enough natural variance so that it could ever succeed
* Or add in randomness manually
* We want decreasing numbers of episodes on the mountain car, and...
---
## Tuning
* That looks terrible, so let's tune
* Keep 25% of samples to the next episode, only keep max 256/episode
* Obviously this could stand some more tuning for stability
---
## Results
* Results are surprisingly robust
* But I did multiple random restarts to find good initial DNN values
500 episodes, $\gamma = 1.0$.
2000 episodes, $\gamma = 1.0$.
100 episodes, $\gamma = 1.0$, $p_{keep} = 0.25$, keep max 256/episode.
---
## Discussion
* During simulation, we always take a random action based upon the policy
* *Not* the most likely (the mean for a normal distribution)
* This prevents our model from getting "stuck"
* So even if it looks inferior to a model from a DQN, the nondeterminism can help
---
## Improvements
* There are more (and better) algorithms
* For one thing, we have the Monte Carlo problem of only being able to learn at the end of an episode
* But we've learned enough for one lecture!