* Less biased. Why?
* Randomly, paths that used input 1 or 2 were removed
* Makes it better to rely upon both
---
## Stochastic Depth
* Since randomness is good, how about dropping entire layers randomly?
* Stochastic depth is basically dropout, but for entire layers
* We can't use it with any arbitrary network
* Our current linear network would have the wrong number of outputs to connect arbitrary layers
---
## Intentional Noise
* If randomly dropping components is good, are other randomizations also helpful?
* If we add noise to the input data, does that add good stochasticity?
* Good as in similar SGD
* For regression problems, *label noise* smooths the output
---
## Other Noise
* Noise can be applied to weights
* This forces the DNN to areas with flatter minima, where the noise is less harmful
* And since wider minima tend to be better, this can be good
* We can also add noise to the labels
* For classifiers, this improves the decision boundaries
* Nothing provable like SVMs, but it encourages wider margins
---
## Data Augmentation
* But why apply meaningless noise when we could do something meaningful?
* Data augmentations are almost always used
* Image examples
* Flipping, rotating, scaling, cropping, changing the color balance, etc
---
## Transfer Learning
* If we know that our data is bad, why not spend less time with it?
* Begin training on something else
* Or with a different training goal
* Then cut off the head of that network, add a new layer for our actual task, and retrain
* But use early stopping before we learn too much
* This is an attempt to engineer initial weights near a good minima
---
## Book Summary