# CS 530 - Lecture 18

## RL with Approximation

Bernhard Firner

2026-04-02

---

## Review

* What else is there to learn?
* AlphaGoZero demonstrated that deep learning can do everything
  * Estimate the action-utility function, $Q(s,a)$
  * Estimate the optimal policy, $\pi$
* The fast learning paper demonstrated that we could integrate DNNs into cost maps
  * So we can even support value iteration approaches

---

## Hammer?

* So should we just use deep learning for everything?
* Probably not
  * Other techniques, even other ML techniques, may be better matches for some problems
* Let's look at some examples and some pitfalls

---

## Action-Utility Estimation

* Let's begin with action-utility estimation
  * Specifically, let's use a neural network in TD learning
* The traning loss of a DNN, $Q$, with parameters, $\theta$, will be
  * $L[\theta] = \left(R + \gamma\underset{a}{max}Q(s_{t+1}, a) - Q(s_t, a_t)\right)^2$
  * Basically, this is a regression task where the DNN estimates Q

---

## Discrete Vs Continuous

* Note that this will be the same whether the input is discrete or continuous
* Neural networks *generalize*, so learning will cause estimates of Q to be similar for similar states
* We'll use a discrete setup for our first example, but $s$ could easily be an image, or a mix of different sensors

---

## Using Continuous Inputs

* Practically speaking, learning from input images would be very slow
  * We've see that filling in Q values or doing policy exploration typicall requires hundreds or thousands of episodes
* If we had to learn feature extraction in a convnet at the same time we would need lots of data
* Note that is still possible, such as in TrailNet, DAVE, or PilotNet

---

## Pretrained Models

* For your applications, using a pretrained model may be the better approach
* PyTorch has a large number of pretrained weights, making it fairly easy to bootstrap
  * [Models and pre-trained weights](https://docs.pytorch.org/vision/stable/models.html)
* With a good feature extractor, you don't necessarily need the rest of the DNN
  * Features work well with svm from sklearn package, and SVMs generally require less training data

---

## First Example

* We'll revisit the cliff walking example
* It is static, deterministics, and discrete
  * After mastering this, we'll work our way into some stochastic, dynamic, and continuous examples
* At least some of these should be useful for your projects

---

## Software

* The Farama-Foundation [Gymnasium](https://github.com/Farama-Foundation/Gymnasium) is a good place to look for examples
  * Successor to OpenAI's gymnasium
  * Has lots of environments already set up for learning
* Install with `pip3 install gymnasium[all]`
* This will get us started quickly

---

## Cliff Walking

* Cliff Walking: [https://gymnasium.farama.org/environments/toy_text/cliff_walking/](https://gymnasium.farama.org/environments/toy_text/cliff_walking/)
* This is a 4x12 grid world
  * Current state is given by: $row \times nrows + column$
  * Actions are: up, right, down, left

<div class="col">
<img style="width: 60%" class="r-stretch" src="./figures/RLBook/ch6_cliff_walking_top.png" />
<br/>
<small>Cliff walking from Sutton and Barto.</small>

</div>

---

## Setup

* Getting a unique value for each state is actually not great for a neural network
  * But let's begin with that
* We will have to convert the states into a one-hot vector
  * Otherwise the DNN would have to learn that states 3 and 4 are similar, 3 and 15 are similar, but 11 and 12 are not
  * It's a complicated non-linear relationship, which would increase the training data required

---

## ANN

```python
class QModel(torch.nn.Module):
    """A linear neural network for state-action value estimation."""

def __init__(self, states=48, actions=4):
        super(QModel, self).__init__()
        # This model takes in the current state and a one-hot vector of actions
        self.net = torch.nn.Sequential(
            torch.nn.Linear(in_features=states+actions, out_features=64),
            torch.nn.ReLU(),
            torch.nn.Linear(in_features=64, out_features=32),
            torch.nn.Linear(in_features=32, out_features=1))
        self.states = states

torch.nn.init.normal_(self.net[0].weight.data)
        torch.nn.init.normal_(self.net[2].weight.data)
        torch.nn.init.normal_(self.net[3].weight.data)

def forward(self, x):
        """Forward through the network."""
        return self.net(x)
```

---

## Learning

* I'll reproduce something similar to the learning class that we've been using
* We don't really want to learn from one update at a time
  * If you remember your SGD basics, large learning rates to batch sizes lead to better implicit regularization
  * Learning rates that are too large lead to instabilities
* So we want a large enough batch that we can learn quickly without having a numerical problem

---

## Training Bias

* We also have to worry about bias in our training batches
* Taking the last 10 steps will be likely to give us 10 samples from approximately the same location
* For this example, we'll use an $\epsilon\text-greedy$ policy, so the random exploration should help
  * But when you train your models, you may want to subsample your data

---

## Learning Class

```python
class NNQ:
    def __init__(self, actions=4, states=48, alpha=0.001, gamma=1.0):
        """
        Arguments:
            actions   (list): Possible actions
            alpha     (float): Learning rate
            gamma     (float): Discount value
        """
        super(NNQ, self).__init__()
        self.actions = actions
        self.gamma = gamma
        self.model = QModel(states=48, actions=actions)
        self.optimizer = torch.optim.AdamW(self.model.parameters(), lr=0.001, weight_decay=0.01)
        self.criterion = torch.nn.MSELoss()

# We'll fill in the states as we go
        self.action_vectors = F.one_hot(torch.arange(0, 4))
        self.action_vectors.requires_grad = False

# Save up batches of 32 for learning
        self.past_state_actions = []
        self.past_rewards = []

def updateQ(self, state, new_state, action_idx, reward):
        # Calculate the expected rewards
        with torch.no_grad():
            self.past_state_actions.append(self.makeSAVectors(state)[action].tolist())
            # There are no more rewards after the episode is complete
            if new_state is None:
                future_reward = torch.tensor([reward]).float()
            else:
                next_state_vector = F.one_hot(torch.tensor([next_state]), num_classes=48).expand(4, -1)
                future_reward = reward + self.gamma * self.maxReward(new_state)
            self.past_rewards.append(future_reward)

# Learn at the end of an episode or when there is enough data available
        if new_state is None or len(self.past_state_actions) == 32:
            # Zero gradients before gradient calculation
            self.optimizer.zero_grad()
            # Forward again to make a gradient
            # Update with the error
            y_hat = self.model.forward(torch.tensor(self.past_state_actions).float())
            y = torch.tensor(self.past_rewards).float().view(-1, 1)
            #print(f"computing loss between {y_hat} and {y}")
            loss = self.criterion(y_hat, y)

# Gradient calculation
            loss.backward()
            # Update weights
            self.optimizer.step()
            self.past_state_actions = []
            self.past_rewards = []

def estimateQs(self, state):
        sa_vectors = self.makeSAVectors(state)
        return self.model.forward(sa_vectors.float())

def bestAction(self, state):
        q_estimates = self.estimateQs(state)
        return torch.argmax(q_estimates).item()

def maxReward(self, state):
        q_estimates = self.estimateQs(state)
        return torch.max(q_estimates).item()

def meanReward(self, state):
        q_estimates = self.estimateQs(state)
        return torch.mean(q_estimates).item()

def makeSAVectors(self, state):
        with torch.no_grad():
            state_vector = F.one_hot(torch.tensor([state]), num_classes=48).expand(4, -1)
            state_action_vector = torch.concatenate((state_vector, self.action_vectors), axis=1)
        return state_action_vector
```

---

## Training

```python
    for episode in range(1000):
        # Enable for visualization
        #if episode % 10 == 9:
        #    vis.plotMax(qmodel, f"{args.algorithm}_learning_maxQ_{episode}.png")
        # Reset the environment to put the agent into an initial state
        state, info = env.reset()

# Run the simulation
        finished = False

sim_steps = 0
        while not(finished):
            # Epsilon greedy policy
            if random.random() < epsilon:
                action = random.choice([0,1,2,3])
            else:
                action = qmodel.bestAction(state)

next_state, reward, terminated, truncated, info = env.step(action)

if terminated:
                next_state = None

qmodel.updateQ(state, next_state, action, reward)

# Update the state
            state = next_state

sim_steps += 1
            finished = terminated
        print(f"{episode} {sim_steps}")
    env.close()
```

---

## Vs Double Q

* This isn't necessarily better than our tabular approaches

---

## Not Surprising

* That shouldn't be a surprise
* We already know that statistics work well in discrete settings
* There are two advantages though
  * First, we could replace that DNN with a convnet that feeds into a classifier
  * Now we work on image inputs!
* Second, what if the number of states is continuous?

---

## Mountain Car Problem

* Imagine your car is stuck at the bottom of a steep hill
* You need to go fast enough to climb out, but you cannot gather enough speed from the bottom
* You can figure out that you'll need to go back and forth, gathering momentum like a swing
  * Can we RL learn this as well?

---

## State

* The state is the position of the car from -1.2 to 0.6, and its velocity goes from -0.07 to 0.07
* Actions are from -1 to 1, and correspond to accelerating the vehicle
  * There is an easy version of this where outputs are $[-1, 0, 1]$

---

## Mountain Car Problem

---

## Learning Q

* We are still trying to estimate Q
  * (we'll get to learning the policy directly next time)
* But how can we learn Q when there are an infinite number of states?
* Even more critical: how can we choose which action to take?

---

## Random Probing

```
class NNQ:
    def __init__(self, actions=1, states=2, alpha=0.001, gamma=1.0):
        """
        Arguments:
            actions   (list): Possible actions
            alpha     (float): Learning rate
            gamma     (float): Discount value
        """
        super(NNQ, self).__init__()
        self.actions = actions
        self.gamma = gamma
        self.model = QModel(states=states, actions=actions)
        self.optimizer = torch.optim.AdamW(self.model.parameters(), lr=0.001, weight_decay=0.01)
        self.criterion = torch.nn.MSELoss()

self.actions = actions

# Save up batches of 32 for learning
        self.past_state_actions = []
        self.past_rewards = []

def updateQ(self, state, new_state, action_idx, reward):
        # Calculate the expected rewards
        with torch.no_grad():
            self.past_state_actions.append(self.makeSAVectors(state, torch.tensor([action])).tolist())
            # There are no more rewards after the episode is complete
            if new_state is None:
                future_reward = reward
            else:
                future_reward = reward + self.gamma * self.maxReward(new_state)
            self.past_rewards.append(future_reward)

# Learn at the end of an episode or when there is enough data available
        if new_state is None or len(self.past_state_actions) == 32:
            # Zero gradients before gradient calculation
            self.optimizer.zero_grad()
            # Forward again to make a gradient
            # Update with the error
            y_hat = self.model.forward(torch.tensor(self.past_state_actions).float()).view(-1, 1)
            y = torch.tensor(self.past_rewards).float().view(-1, 1)
            #print(f"computing loss between {y_hat} and {y}")
            loss = self.criterion(y_hat, y)

# Gradient calculation
            loss.backward()
            # Update weights
            self.optimizer.step()
            self.past_state_actions = []
            self.past_rewards = []

def estimateQs(self, state, actions):
        sa_vectors = self.makeSAVectors(state, actions)
        return self.model.forward(sa_vectors.float())

def bestAction(self, state):
        # Actions are in the range from -1 to 1
        rand_actions = torch.rand(100)*2-1
        q_estimates = self.estimateQs(state, rand_actions.view(-1, 1))
        return torch.argmax(q_estimates).item()

def maxReward(self, state):
        rand_actions = torch.rand(100)*2-1
        q_estimates = self.estimateQs(state, rand_actions.view(-1, 1))
        return torch.max(q_estimates).item()

def meanReward(self, state):
        q_estimates = self.estimateQs(state)
        return torch.mean(q_estimates).item()

def makeSAVectors(self, state, actions):
        with torch.no_grad():
            state_vector = torch.tensor(state).expand(actions.size(0), -1)
            state_action_vector = torch.concatenate((state_vector, actions), axis=1)
        return state_action_vector
```

---

## Random Actions

* We can test our Q estimates with many random actions, and then select the best
* This is obviously inefficient
  * Now we can see that learning the policy directly may be a better approach
* Intuitively, if the model's internal state can estimate Q, it must also be able to select an action

---

## Result

<div class="col">
<img style="width: 60%" class="r-stretch" src="./figures/mountain-car-episode-0.gif" />
<br/>
<small>Result after 100 episodes.</small>

</div>

---

## Obviously Not Great

* This is obviously not great
* Why?
  * The velocity is really being modified by multiple previous actions, and we are ignoring their input
* We should really switch to some kind of n-step method
  * Or perhaps learn the policy directly

---

## Stochastic Policies

* Before digging into those details, but on the topic of policy learning, let's discuss a side-topic
* DNN policy models are often basically classifiers for different actions
  * This turns them into stochastic models; is there an inherent advantage to that?
* This is a good "on the board" example, so I'll do it there

<div class="col">
<img style="width: 50%" class="r-stretch" src="./figures/RLBook/ch13_switched.png" />
<br/>
<small>Swapped corridors from Sutton and Barto. See chapter 13.</small>

</div>