* Parts of the loss surface correspond to 0 gradients
* For example, if all inputs to a ReLU are negative
* Good initialization solves this
---
## More Capacity
* A more complex loss surface tends to be smoother
* This makes local minima less "sharp"
* So we can avoid bad initialization by making our network deeper
* Since this is the most efficient way to add more capacity
---
## Exploding Gradients
* The forward and backwards passes involved many multiplications
* $f_k = \beta_k + \Omega_k ReLU(\beta_{k-1} + \Omega_{k-1}ReLU(\beta_{k-2} + \Omega_{k-2}x))$
* If the average input was 0.5 and the average weight was 0.5
* With a width of 1000, the second layer's output is roughly $0.5\times0.5\times1000$
* The forward and backward pass can cause gradients to explode
---
## Vanishing Gradients
* $f_k = \beta_k + \Omega_k ReLU(\beta_{k-1} + \Omega_{k-1}ReLU(\beta_{k-2} + \Omega_{k-2}x))$
* If the weights are 0 mean, half of a layer's outputs will be 0
* An ReLU with 0-mean normal input with $\sigma=0.1$ has an output of around 0.04
* After passing through a 100-width layer, with $\sigma=0.1$ normally distributed weights, the mean output is 0.0016
---
## Expected Gradient
* We can calculate the variance of each layer's output
* $\sigma^2_{f_i} = \frac{W_{i-1}\sigma_{\Omega}^2\sigma^2_{f_{i-1}}}{2}$
* If weight variance is too high, multiplications lead to explosive outputs
* If weight variance is too low, multiplications lead to vanishing outputs
---
## Interpretation
* Since we are optimizing initialization, we can adjust the weights to keep variance stable
* $\sigma^2_{\Omega_i} = \frac{2}{W_{i-1}}$
* Called Kaiming or He initialization (read the [arxiv paper](https://arxiv.org/abs/1502.01852))
* PyTorch offers this initialization as an existing function
* [kaiming normal](https://docs.pytorch.org/docs/stable/nn.init.html#torch.nn.init.kaiming_normal_)
---
## Other Solutions
* There are other solutions
* But an improved initialization is the most elegant
* No added calculations during forward or backward passes
---
## Measuring Performance
* We can (hopefully) train a deep neural network
* How can we tell if we've done a good job?
* And if something goes wrong, how can we diagnose the problem?
---
## Dataset
* We're going to use a simple dataset
* Small enough that you can train on a laptop CPU
* Called "digits"
* Includes written numeric digits, 0-9
---
## Digits