* Look at how many times we nearly reach the goal
* Each time, the value estimate will be horribly wrong
* That is actually going to push the model *away* from the path that lead there and away from the solution
100 epochs.
---
## TD Advantages
* An advantage of actor-cricit models is that we learn with every step
* The model can learn to avoid an area without seeing the goal
* That implies that we can weaken the depth of the reward signal by decreasing $\gamma$
* This will actually make training *more* stable
* We don't need to know how long it takes to get to the goal from a standstill, just that we shouldn't stand still
---
## Policy Search Disadvantages
* Notice though, that we are now playing with hyperparameters
* The heart of the problem is that this is difficult
* A tiny change in action leads to a huge swing in future rewards
* Estimating a Q function is a fairly stable task
* In comparison to policy search, it is also easy to look at the loss and see if it is improving
---
## Discussion
* This is a real weakness of RL
* Value iteration and tabular Q learning are more predictable
* Or we need to carefully craft new reward signals
* For example, reward the car for getting really close to the goal so it is not discouraged
---
## Discrete Actions
* If we consider many environments, they can be described with discrete actions
* For example, we could easily change the mountain car to have three velocity updates: -1, 0, and 1
* Sometimes, things that seem continuous can also have discrete controls
* Such as in [Playing Atari with Deep Reinforcement Learning](https://arxiv.org/abs/1312.5602)
---
## Deep Q-Learning
* The Atari playing paper was straightforward:
* Estimate $Q(s,a)$ using an $\epsilon\text-greedy$ policy
* Store a replay buffer to remove bias in the data
* Inputs were an 84x84 grayscale image
---
## Reward Instability
* As we've seen, episode rewards are not stable
* Especially in stochastic games