* Not guaranteed to point towards global minimum
* Not even guaranteed to point to a local minimum
* Also, large learning rates jump over minima
-v-
```bash
#! /usr/bin/gnuplot
set terminal png size 1920,1080 font "CourierPrime-Bold" fontscale 3 enhanced
set yrange [-5:5]
set xrange [-5:5]
set xlabel "x"
set ylabel "y"
set grid
#set hidden3d
set output "../figures/saddle_loss_surface.png"
splot x**2-y**2 title "loss surface"
set terminal webp size 1920,1080 font "CourierPrime-Bold" fontscale 3 enhanced rounded animate delay 500 loop 0
set output "../figures/trapped_loss.webp"
# Let's say we begin at x = -2, y = 0
step=0
cur_x = -2
cur_y = 0
error_x(val)=val**2
error_y(val)=-val**2
grad_x(val)=2*val
grad_y(val)=-2*val
learning_rate=0.25
# The gradient is the direction of the fastest increase, so subtract
# to get the fastest decrease in error
next_x = cur_x - learning_rate*error_x(cur_x)*grad_y(cur_y)
next_y = cur_y - learning_rate*error_y(cur_y)*grad_y(cur_y)
while (step < 10) {
set arrow 1 from cur_x, cur_y, cur_x**2-cur_y**2 to next_x, next_y, next_x**2-next_y**2 linewidth 2
splot x**2-y**2 title "loss surface"
step = step + 1
cur_x = next_x
cur_y = next_y
next_x = cur_x - learning_rate*error_x(cur_x)*grad_x(cur_x)
next_y = cur_y - learning_rate*error_y(cur_y)*grad_y(cur_y)
}
```
---
## Fixes for learning rate
* There are adaptive learning rate algorithms
* And momentum techniques to prevent oscillation
* We will revisit learning rate more in the future
---
## Dataset memory
* Our logistic regression code attempts to minimize over the entire dataset
* Okay for small amounts of data
* Will obviously fail at some point
---
## Memory solutions
* We can learn online, using a subset of the data
* Called a minibatch
* Train with current sample, minimize future regret
* Guess $W$ to minimize future error, $f(W)=\mathbb{E}[f(W,z)]$
* Adjust $W$ slowly, averaging the results of previous batches
* Assumes future batches will be statistically similar to past ones
* Justified with expectations, so it is stochastic
* Stochastic gradient descent, or SGD
---
## SGD
* What learning rate should be used?
* Remember, we don't know our future loss, only the current step
* Some conditions (Robbins-Monro conditions) ensure convergence
* $\sum_{k=1}^{\infty}\eta_k=\infty, \sum_{k=1}^{\infty}\eta_{k}^2 < \infty$
* The equation defines a schedule
* Notice that we may still require near infinite time to converge
---
## SGD and Learning Rates
* In practice, we can use *early stopping* when errors are small
* So converging to 0 error isn't important
* There are a lot of tricks to improve logistic regression
* But the approach is overshadowed by stronger algorithms
* The solutions will appear again
---
## Mean Problem
* Logistic regression is influence by the population statistics
* Not by individual samples
* Just look at the bias updates