* Vanilla momentum meansures the gradient a 1
* Then combines that local gradient with the momentum
* Conceptually, 1 to 2 is the gradient, 2 to 3 is the momentum
* Nesterov will use the current momentum to calculate the next point (4)
* The gradient at that point guides the next update (from 4 to 5)
---
## Fixing Step Sizes
* Tuning a learning rate is laborious, because training is often slow
* And some parameters may have a "steep" curve while others are shallow
* So one learning rate may be right for some and wrong for others
* One solution is an algorithm named Adam
* Adaptive Moment Estimation
---
## Fixed Step Sizes
* First, let's begin with the idea of fixed step sizes
* We calculate the vector of our update
* Then normalize it, so that every step is the same distance
* This fixes some problems, but we'll oscillate around the minima
* We really need a small step size to finally settle
---
## Normalized Step Size
* Fixed step size equation:
* $m_{t+1} \leftarrow \frac{\delta L[\phi_t]}{\delta \phi}$
* $v_{t+1} \leftarrow \frac{\delta L[\phi_t]^2}{\delta \phi}$
* Normalize the gradient to fix the step size:
* $\phi_{t+1} \leftarrow \phi_t - \alpha \cdot \frac{m_{t+1}}{\sqrt{v_{t+1}}+\epsilon$
-v-
## Normalizing Vectors
* If you don't recall, dividing a vector by its magnitude normalizes it
* And the magnitude is the square root of the squared terms
* Think of a triangle; the length of the hypotenuse is the square root of the sum of the squares of the other sides
---
## Adam
* Now we add momentum to both the gradient and our normalizing term
* Still use $\beta$ as the momentum of our weight update
* Introduce $\gamma$ as the momentum of the normalizer update
* $m_{t+1} \leftarrow \beta\cdot m_t + (1 - \beta)\frac{\delta L[\phi_t]}{\delta \phi}$
* $v_{t+1} \leftarrow \gamma\cdot v_t + (1 - \gamma)\frac{\delta L[\phi_t]^2}{\delta \phi}$
---
## Adam
* We have bad statistics for the momentum in the beginning
* So a modifier with diminishing effects is then applied
* $\hat{m}_{t+1} \leftarrow \frac{m_{t+1}}{1 - \beta^{t+1}}$
* $\hat{v}_{t+1} \leftarrow \frac{v_{t+1}}{1 - \gamma^{t+1}}$
* $\gamma$ and $\beta$ are between 0 and 1, so they go to 0 as t increases
* In the beginning though, they increase the effective learning rate
---
## Examples

---
## Final Note
* Notice all of the parameters we've introduced?
* Neural network training is full of hyperparametr optimization
* And we haven't even gotten to the complicated stuff yet!
* Often, you can ignore many of these
* Sadly, only experience will tell you when you need to tweak them