* Don't think that this is just a ReLU problem
* Tanh and sigmoid used to be the popular activation functions
* Weight initialization needed to trap values in the -2.5 to 2.5 range
* Otherwise Tanh is approximately constant, with little gradient
---
## Stability Problems
* Those normal weights didn't work, how about making them all large and positive?
* Now the ReLUs aren't doing anything
* And our intermediate values grow incredibly quickly
* If the average input was 0.5 and the average weight was 0.5
* With a width of 1000, the second layer's output is roughly $0.5\times0.5\times1000$
---
## Stability Continued
* $f[x] = \beta_k + \Omega_k ReLU(\beta_{k-1} + \Omega_{k-1}ReLU(\beta_{k-2} + \Omega_{k-2}x))$
* Each neuron of the second layer outputs an average of 250
* What is the average output of the third layer?
* $250\times0.5\times1000$
* This gets unstable very quickly
---
## Problems
* Bad initialization makes learning difficult
* From starting in minima where SGD won't work
* Or from numerical problems
* The numerical issue is easier to examine, so let's begin there
---
## Backpropagation
* If the forward pass is very small or large, it affects backpropagation too
* Numerical problems have names:
* **vanishing gradient** and **exploding gradient**
---
## Gradients