# CS 530 - Lecture 20

## Actor-Critic Learning

Bernhard Firner

2026-04-09

---

## Mountain Car

* We've been trying to conquer this mountain
* State:
  * The position is from -1.2 to 0.6
  * The velocity is from -0.07 to 0.07
* Action is a continuous value from -1 to 1 that applies a positive or negative force
* A negative reward of $-0.1 \times action^2$ is applied at each time step
* +100 is rewarded for reaching the goal

---

## REINFORCE Review

* Last time we used REINFORCE to solve the mountain car task
  * Finishing chapter 13 in Sutton and Barto's RL book

<div class="container">
<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/mountain-car-reinforce.gif" />
<br/>
<small>500 episodes, $\gamma = 1.0$.</small>
</div>
<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/mountain-car-reinforce-2k.gif" />
<br/>
<small>2000 episodes, $\gamma = 1.0$.</small>
</div>
<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/mountain-car-reinforce-100tuned.gif" />
<br/>
<small>100 episodes, $\gamma = 1.0$, $p_{keep} = 0.25$, keep max 256/episode.</small>
</div>
</div>

---

## REINFORCE Review

* With some tuning, it had consistent results
  * Advantage instead of gain
  * Normalized returns
  * Keep a subset of long episodes
  * Discard data slowly
* Something to keep in mind with RL: there is a lot of tuning

---

## The Update

* The original REINFORCE update is:
  * $\theta \leftarrow \theta + \alpha G_t\nabla ln\big[\pi(a_t|s_t,\theta)\big]$
  * We multiply the gain ($r_t + \gamma r_{t+1} + ... + \gamma^n r_{t+n}$) by $\pi(a_t|s_t,\theta)$
* The policy network won't know the Q value of an action, but it will learn that one action is preferable to another

---

## Advantage

* The absolute value of the gradient is not critical, so we can feel free to rescale it
* Advantage is a way to compare an action to the average reward
  * $A(s_t) = G_t - Q(s_t, a_t)$
* Asks the model to do more of anything above average, less of anything below

---

## Normalize Returns

* Nothing says that are Q and G values will be sane
  * In fact, Q is coming from a DNN estimate, so it could be anything
* Feed a DNN garbage, and you'll get a garbage DNN
  * So we standardize the gains, making them 0 mean and unit variance
  * This resists degenerate updates while preserving relative advantage

```python
# Normalize the gains so we do more of the good stuff and less of the bad stuff
gains = (gains - torch.mean(gains)) / (torch.var(gains) + 10e-7)
```

---

## Data Retention

* Failed episodes are longer, and consequently have more updates
  * With $\gamma$ too high, their loss is enormous
  * With $\gamma$ too low, the model won't be able to follow a long path to a solution
* So we cap how many samples we will take from an episode
  * Also discard older samples at some rate

---

## Semi-Consistent

* With all of those changes we have semi-consistent results
  * But mountain-car is still a difficult problem
* One of the most challenging parts is how rarely we will see the reward
* We could shape the reward by adding components
  * But let's look for an algorithmic imporovement

---

## TD Learning

* Recall that we moved to temporal difference learning at Monte Carlo learning before
* TD takes a single time step and makes a biased estimate of the gain
* $G_t = R_t + \gamma\hat{V}(s_{t+1})$
  * We can train a network to estimate that value, of course
* Is there policy version of this?

---

## Actor-Critic RL

* The TD error is
  * $\delta_t = R_t + \gamma V(s_{t+1}) - V(s_t)$
* We will use that in place of advantage
  * The new part here is that we estimate both $V(s_{t+1})$ and $V(s_t)$
* This gives us a signal that has nothing to do with the rest of the episode

---

## Actor and Critic

* The network that estimates state value is called the *critic*
* The policy network that is being criticized is called the *actor*
* It has weight updates are similar to the last policy:
  * $\theta \leftarrow \theta + \alpha \delta\nabla ln\big[\pi(a_t|s_t,\theta)\big]$
* We can update any time we want, so I'll do a few batches every 100 steps or so

---

## Network Choices

* You *can* make the actor and critic share weights
* If you were processing images (instead of a couple of numbers) this could save time
  * But you will also need to balance the learning rates of two different signals
  * That is challenging in practice, so most people avoid it when possible

---

## Some Details

* We use the critic network twice, but you shouldn't collect gradients for both estimates
* If we did, it would be as though both sides of an equation are updated
  * We would basically make convergence impossible

```python
# Update the critic
value_estimates = self.critic(states)

# Make sure to detach or use torch.no_grad() for the next_value_estimates, or
# the learning target isn't fixed.
with torch.no_grad():
    #_, next_value_estimates = self.bestAction(next_states)
    next_value_estimates = self.critic(next_states)
estimated_rewards = rewards + self.gamma * next_value_estimates.detach()
```

---

## Sufficient

* With REINFORCE I threw out most of the data over time
  * This time, I'll keep it around
  * The TD steps mean less self-similarity between examples

<div class="col">
<img style="width: 50%" class="r-stretch" src="./figures/mountain-car-actor-critic-100epochs.gif" />
<br/>
<small>100 episodes, $\gamma = 0.99$, keep max 256/episode.</small>
</div>

---

## Improvements

* [Proximal Policy Optimization Algorithms](https://arxiv.org/abs/1707.06347), from 2017, proposed a modification to learning to allow more frequent updates without losing stability
* Here's the problem being addressed:
  * If RL model training is degenerate, then we need to constrain updates somehow
  * This isn't really a learning rate problem
    * Despite all of our efforts to constrain and normalize gradients, the estimate are noisy, so the learning is noisy

---

## Training Signal

* $\theta \leftarrow \theta + \alpha \hat{A}_t\nabla ln\big[\pi(a_t|s_t,\theta)\big]$
  * Where $\hat{A}_t$ is the estimate of advantage at time t.
* Policy updates can be huge
  * Often due to value estimate errors, but uneven reward signals contribute
  * Remember, reaching the goal in mountain car rewards +100
    * The maximum negative reward, per time step, is 0.1

---

## Trust Regions

* Even if we standardize by mean and variance, some updates will dominate
* So we need something else to constrain updates
  * How about by rate of policy change?
* $\underset{\theta}{maximize}\hat{\mathbb{E}}\left[ \frac{\pi(a_t|s_t,\theta)}{\pi(a_t|s_t,\theta_{old})}\hat{A}_t  \right]$
  * As long as $\hat{\mathbb{E}}\left[KL[\pi(a_t|s_t,\theta_{old}), \pi(a_t|s_t,\theta)] \leq \delta  \right]$

---

## Motivation

* That bound should be part of the loss function, but isn't amenable
* The authors of [Proximal Policy Optimization Algorithms](https://arxiv.org/abs/1707.06347) suggest clipping instead

```python
ratio = torch.exp(action_probs) / (torch.exp(old_action_probs.detach()) + 0.00001)
actor_loss1 = -ratio * deltas.detach()
actor_loss2 = -torch.clamp(ratio, 0.8, 1.2) * deltas.detach()
# We want the minimal loss, but we've just negated them so now take the max
actor_loss = torch.max(actor_loss1, actor_loss2).mean()
```

---

## Model Elasticity

* How should we think about this?
* The critic's estimated will be horribly wrong if the actor changes too quickly
  * Recall, it is estimating $Q_\pi(a|s)$
* Since those values guide the advantages, and thus the policy, they need to be stable

---

## Performance, PPO

* Actor cricit with PPO is certainly more reliable than REINFORCE
* What else is causing instability?
  * What about the calculated reward?

<div class="col">
<img style="width: 75%" class="r-stretch" src="./figures/mountain-car-actor-critic-100epochs.gif" />
<br/>
<small>100 episodes.</small>
</div>

<div class="col">
<img style="width: 75%" class="r-stretch" src="./figures/mountain-car-actor-critic-ppo-150epochs-just-barely.gif" />
<br/>
<small>150 episodes.</small>
</div>
</div>

---

## Episodic Rewards

* Look at how many times we nearly reach the goal
* Each time, the value estimate will be horribly wrong
  * That is actually going to push the model *away* from the path that lead there and away from the solution

</div>
<div class="col">
<img style="width: 75%" class="r-stretch" src="./figures/mountain-car-actor-critic-ppo-150epochs-just-barely.gif" />
<br/>
<small>100 epochs.</small>
</div>
</div>

---

## TD Advantages

* An advantage of actor-cricit models is that we learn with every step
* The model can learn to avoid an area without seeing the goal
  * That implies that we can weaken the depth of the reward signal by decreasing $\gamma$
  * This will actually make training *more* stable
* We don't need to know how long it takes to get to the goal from a standstill, just that we shouldn't stand still

---

## Policy Search Disadvantages

* Notice though, that we are now playing with hyperparameters
* The heart of the problem is that this is difficult
  * A tiny change in action leads to a huge swing in future rewards
* Estimating a Q function is a fairly stable task
  * In comparison to policy search, it is also easy to look at the loss and see if it is improving

---

## Discussion

* This is a real weakness of RL
  * Value iteration and tabular Q learning are more predictable
  * Or we need to carefully craft new reward signals
* For example, reward the car for getting really close to the goal so it is not discouraged

---

## Discrete Actions

* If we consider many environments, they can be described with discrete actions
  * For example, we could easily change the mountain car to have three velocity updates: -1, 0, and 1
* Sometimes, things that seem continuous can also have discrete controls
* Such as in [Playing Atari with Deep Reinforcement Learning](https://arxiv.org/abs/1312.5602)

---

## Deep Q-Learning

* The Atari playing paper was straightforward:
  * Estimate $Q(s,a)$ using an $\epsilon\text-greedy$ policy
  * Store a replay buffer to remove bias in the data
  * Inputs were an 84x84 grayscale image

---

## Reward Instability

* As we've seen, episode rewards are not stable
  * Especially in stochastic games

<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/AtariQLearningF2Left.png" />
<br/>
<small>Figure 2 from Mnih et al.</small>
</div>

---

## Q-Stability

* The estimates of the Q function do trend up
  * The network is becoming more confident that it will get a reward
* Thus it can be better to track the network's confidence and outputs rather than the episode results

<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/AtariQLearningF2Right.png" />
<br/>
<small>Figure 2 from Mnih et al.</small>
</div>

---

## Progress

* The DQN trained on Atari games was better than humans on some simple games
  * Pong, for example
* Terrible on others, especially exploration-type games
* In the 13 years since then, there has been substantial progress
  * See [https://deepmind.google/blog/agent57-outperforming-the-human-atari-benchmark/](https://deepmind.google/blog/agent57-outperforming-the-human-atari-benchmark/)

---

## Never Give Up

* [Never Give Up: Learning Directed Exploration Strategies](https://arxiv.org/abs/2002.06038) incorporated rewards for exploration into the model
* Self-supervised learning was used to train the model to produce embeddings for observed states
* Those embeddings were then used to group states that are similar
  * Similar was then used for three things:
    * Discourages visiting similar states
    * Gradually discourages visiting the same states across many episodes
    * The state embedding ignores non-interactive parts of the environment

---

## Embedding Training

* The embedding networks are siamese models, so the embeddings must be similar
* This forces them to have meaningful content
  * And the compression forces out non-interactive information

</div>
<div class="col">
<img style="width: 50%" class="r-stretch" src="./figures/NeverGiveUpFig2Left.png" />
<br/>
<small>Figure 2 from Badia et al.</small>
</div>
</div>

---

## Complicated

* This all becomes very complicated very quickly
* And the best performing techniques now mostly rely upon brute force
  * Distributed training across many GPUs, for example

---

## For You

* Try to make problems discrete
  * And then, ideally, apply value iteration

<!--
Prioritized Experience Replay
https://arxiv.org/abs/1511.05952

Continuous mountain car PPO example:
https://github.com/analista10SPN/AC_CAR_MOUNTAIN

Also see https://github.com/XinJingHao/TD3-Pytorch/blob/main/TD3.py
-->