* Cliff Walking: [https://gymnasium.farama.org/environments/toy_text/cliff_walking/](https://gymnasium.farama.org/environments/toy_text/cliff_walking/)
* This is a 4x12 grid world
* Current state is given by: $row \times nrows + column$
* Actions are: up, right, down, left
Cliff walking from Sutton and Barto.
---
## Setup
* Getting a unique value for each state is actually not great for a neural network
* But let's begin with that
* We will have to convert the states into a one-hot vector
* Otherwise the DNN would have to learn that states 3 and 4 are similar, 3 and 15 are similar, but 11 and 12 are not
* It's a complicated non-linear relationship, which would increase the training data required
---
## ANN
```python
class QModel(torch.nn.Module):
"""A linear neural network for state-action value estimation."""
def __init__(self, states=48, actions=4):
super(QModel, self).__init__()
# This model takes in the current state and a one-hot vector of actions
self.net = torch.nn.Sequential(
torch.nn.Linear(in_features=states+actions, out_features=64),
torch.nn.ReLU(),
torch.nn.Linear(in_features=64, out_features=32),
torch.nn.Linear(in_features=32, out_features=1))
self.states = states
torch.nn.init.normal_(self.net[0].weight.data)
torch.nn.init.normal_(self.net[2].weight.data)
torch.nn.init.normal_(self.net[3].weight.data)
def forward(self, x):
"""Forward through the network."""
return self.net(x)
```
---
## Learning
* I'll reproduce something similar to the learning class that we've been using
* We don't really want to learn from one update at a time
* If you remember your SGD basics, large learning rates to batch sizes lead to better implicit regularization
* Learning rates that are too large lead to instabilities
* So we want a large enough batch that we can learn quickly without having a numerical problem
---
## Training Bias
* We also have to worry about bias in our training batches
* Taking the last 10 steps will be likely to give us 10 samples from approximately the same location
* For this example, we'll use an $\epsilon\text-greedy$ policy, so the random exploration should help
* But when you train your models, you may want to subsample your data
---
## Learning Class
```python
class NNQ:
def __init__(self, actions=4, states=48, alpha=0.001, gamma=1.0):
"""
Arguments:
actions (list): Possible actions
alpha (float): Learning rate
gamma (float): Discount value
"""
super(NNQ, self).__init__()
self.actions = actions
self.gamma = gamma
self.model = QModel(states=48, actions=actions)
self.optimizer = torch.optim.AdamW(self.model.parameters(), lr=0.001, weight_decay=0.01)
self.criterion = torch.nn.MSELoss()
# We'll fill in the states as we go
self.action_vectors = F.one_hot(torch.arange(0, 4))
self.action_vectors.requires_grad = False
# Save up batches of 32 for learning
self.past_state_actions = []
self.past_rewards = []
def updateQ(self, state, new_state, action_idx, reward):
# Calculate the expected rewards
with torch.no_grad():
self.past_state_actions.append(self.makeSAVectors(state)[action].tolist())
# There are no more rewards after the episode is complete
if new_state is None:
future_reward = torch.tensor([reward]).float()
else:
next_state_vector = F.one_hot(torch.tensor([next_state]), num_classes=48).expand(4, -1)
future_reward = reward + self.gamma * self.maxReward(new_state)
self.past_rewards.append(future_reward)
# Learn at the end of an episode or when there is enough data available
if new_state is None or len(self.past_state_actions) == 32:
# Zero gradients before gradient calculation
self.optimizer.zero_grad()
# Forward again to make a gradient
# Update with the error
y_hat = self.model.forward(torch.tensor(self.past_state_actions).float())
y = torch.tensor(self.past_rewards).float().view(-1, 1)
#print(f"computing loss between {y_hat} and {y}")
loss = self.criterion(y_hat, y)
# Gradient calculation
loss.backward()
# Update weights
self.optimizer.step()
self.past_state_actions = []
self.past_rewards = []
def estimateQs(self, state):
sa_vectors = self.makeSAVectors(state)
return self.model.forward(sa_vectors.float())
def bestAction(self, state):
q_estimates = self.estimateQs(state)
return torch.argmax(q_estimates).item()
def maxReward(self, state):
q_estimates = self.estimateQs(state)
return torch.max(q_estimates).item()
def meanReward(self, state):
q_estimates = self.estimateQs(state)
return torch.mean(q_estimates).item()
def makeSAVectors(self, state):
with torch.no_grad():
state_vector = F.one_hot(torch.tensor([state]), num_classes=48).expand(4, -1)
state_action_vector = torch.concatenate((state_vector, self.action_vectors), axis=1)
return state_action_vector
```
---
## Training
```python
for episode in range(1000):
# Enable for visualization
#if episode % 10 == 9:
# vis.plotMax(qmodel, f"{args.algorithm}_learning_maxQ_{episode}.png")
# Reset the environment to put the agent into an initial state
state, info = env.reset()
# Run the simulation
finished = False
sim_steps = 0
while not(finished):
# Epsilon greedy policy
if random.random() < epsilon:
action = random.choice([0,1,2,3])
else:
action = qmodel.bestAction(state)
next_state, reward, terminated, truncated, info = env.step(action)
if terminated:
next_state = None
qmodel.updateQ(state, next_state, action, reward)
# Update the state
state = next_state
sim_steps += 1
finished = terminated
print(f"{episode} {sim_steps}")
env.close()
```
---
## Vs Double Q
* This isn't necessarily better than our tabular approaches
---
## Not Surprising
* That shouldn't be a surprise
* We already know that statistics work well in discrete settings
* There are two advantages though
* First, we could replace that DNN with a convnet that feeds into a classifier
* Now we work on image inputs!
* Second, what if the number of states is continuous?
---
## Mountain Car Problem
* Imagine your car is stuck at the bottom of a steep hill
* You need to go fast enough to climb out, but you cannot gather enough speed from the bottom
* You can figure out that you'll need to go back and forth, gathering momentum like a swing
* Can we RL learn this as well?
---
## State
* The state is the position of the car from -1.2 to 0.6, and its velocity goes from -0.07 to 0.07
* Actions are from -1 to 1, and correspond to accelerating the vehicle
* There is an easy version of this where outputs are $[-1, 0, 1]$
---
## Mountain Car Problem
---
## Learning Q
* We are still trying to estimate Q
* (we'll get to learning the policy directly next time)
* But how can we learn Q when there are an infinite number of states?
* Even more critical: how can we choose which action to take?
---
## Random Probing
```
class NNQ:
def __init__(self, actions=1, states=2, alpha=0.001, gamma=1.0):
"""
Arguments:
actions (list): Possible actions
alpha (float): Learning rate
gamma (float): Discount value
"""
super(NNQ, self).__init__()
self.actions = actions
self.gamma = gamma
self.model = QModel(states=states, actions=actions)
self.optimizer = torch.optim.AdamW(self.model.parameters(), lr=0.001, weight_decay=0.01)
self.criterion = torch.nn.MSELoss()
self.actions = actions
# Save up batches of 32 for learning
self.past_state_actions = []
self.past_rewards = []
def updateQ(self, state, new_state, action_idx, reward):
# Calculate the expected rewards
with torch.no_grad():
self.past_state_actions.append(self.makeSAVectors(state, torch.tensor([action])).tolist())
# There are no more rewards after the episode is complete
if new_state is None:
future_reward = reward
else:
future_reward = reward + self.gamma * self.maxReward(new_state)
self.past_rewards.append(future_reward)
# Learn at the end of an episode or when there is enough data available
if new_state is None or len(self.past_state_actions) == 32:
# Zero gradients before gradient calculation
self.optimizer.zero_grad()
# Forward again to make a gradient
# Update with the error
y_hat = self.model.forward(torch.tensor(self.past_state_actions).float()).view(-1, 1)
y = torch.tensor(self.past_rewards).float().view(-1, 1)
#print(f"computing loss between {y_hat} and {y}")
loss = self.criterion(y_hat, y)
# Gradient calculation
loss.backward()
# Update weights
self.optimizer.step()
self.past_state_actions = []
self.past_rewards = []
def estimateQs(self, state, actions):
sa_vectors = self.makeSAVectors(state, actions)
return self.model.forward(sa_vectors.float())
def bestAction(self, state):
# Actions are in the range from -1 to 1
rand_actions = torch.rand(100)*2-1
q_estimates = self.estimateQs(state, rand_actions.view(-1, 1))
return torch.argmax(q_estimates).item()
def maxReward(self, state):
rand_actions = torch.rand(100)*2-1
q_estimates = self.estimateQs(state, rand_actions.view(-1, 1))
return torch.max(q_estimates).item()
def meanReward(self, state):
q_estimates = self.estimateQs(state)
return torch.mean(q_estimates).item()
def makeSAVectors(self, state, actions):
with torch.no_grad():
state_vector = torch.tensor(state).expand(actions.size(0), -1)
state_action_vector = torch.concatenate((state_vector, actions), axis=1)
return state_action_vector
```
---
## Random Actions
* We can test our Q estimates with many random actions, and then select the best
* This is obviously inefficient
* Now we can see that learning the policy directly may be a better approach
* Intuitively, if the model's internal state can estimate Q, it must also be able to select an action
---
## Result
Result after 100 episodes.
---
## Obviously Not Great
* This is obviously not great
* Why?
* The velocity is really being modified by multiple previous actions, and we are ignoring their input
* We should really switch to some kind of n-step method
* Or perhaps learn the policy directly
---
## Stochastic Policies
* Before digging into those details, but on the topic of policy learning, let's discuss a side-topic
* DNN policy models are often basically classifiers for different actions
* This turns them into stochastic models; is there an inherent advantage to that?
* This is a good "on the board" example, so I'll do it there
Swapped corridors from Sutton and Barto. See chapter 13.