# CS 462 - Lecture 06

## Model Fitting

Bernhard Firner

2026-02-10

---

## Book

* Today we'll be going through chapter 6 of the [book](https://udlbook.github.io/udlbook/)

---

## Review

* Loss Functions
  * Loss functions of probabilities are made using the negative log likelihood
  * MSE is equivalent to learning the mean of a distribution

---

## Logarithms

* Why are we optimizing a logarithm? Why not use the PDF directly?
  * Probabilities are small, and not numerically stable
* $p(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}exp(-\frac{(x - \mu)^2}{2\sigma^2})$
  * That involves dividing by $\sigma^2$, which could be very small
  * We don't want training to fail because of number, so we take a log

---

## Log of a Normal

* $p(x | mu, sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}exp(-\frac{(x - \mu)^2}{2\sigma^2})$
* Taking the log: $ln((2\pi\sigma^2)^{-1/2}) - \frac{(x - \mu)^2}{2\sigma^2}$
* Simplify: $-\frac{1}{2}ln(2\pi) - \frac{1}{2}ln(\sigma^2) - \frac{(x - \mu)^2}{2\sigma^2}$

---

## Final Steps

* We need small loss values to be better, so multiply by -1
  * $\frac{1}{2}ln(2\pi) + \frac{1}{2}ln(\sigma^2) + \frac{(x - \mu)^2}{2\sigma^2}$
* Drop the first term, which is a constant
  * $\frac{1}{2}ln(\sigma^2) + \frac{(x - \mu)^2}{2\sigma^2}$
  * If we ignore the variance, this turns into MSE

---

## Model and Assumptions

* Previously we assumed that noise had constant variance across the dataset
* Now we assume it is some function of the input, x

---

## Constraining the Model

* Use [softplus](https://docs.pytorch.org/docs/stable/generated/torch.nn.Softplus.html#softplus) to force $\sigma$ into a proper range
* We also add a tiny value, called an epsilon, for stability

</div>
<div class="col">

</div>
</div>

---

## Result

---

## Binary Classification

* Use a Bernoulli for a single class
  * $Pr(y|\lambda) = (1 - \lambda)^{1-y}\lambda^y$
* Use sigmoid to force model output to range from 0 to 1
  * $L[\phi] = -(1 - y)ln[1-sig(f[x, \phi])] - (y)ln[sig(f[x, \phi])]$

</div>
<div class="col">

</div>
</div>

---

## Multiple Classes

* Add outputs for multiple classes
  * We want to treat them as probabilities, so force them to sum to 1
  * [softmax](https://docs.pytorch.org/docs/stable/generated/torch.nn.Softmax.html#softmax)
  * $Softmax(x_i) = \frac{exp(x_i)}{\sum_jexp(x_j)}$
* PyTorch provides a [Cross Entropy Loss](https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss) for multiclass classification
  * It handles both a softmax and the negative log loss required

---

## Learning

* We've got a few loss functions
* And we know how to make deep models
  * With a few hidden layers, at least
* So how does learning work?

---

## Gradient Descent

* We've seen that learning means traversing a loss surface
  * Derivatives point uphill, so subtracting gradients moves downhill
* Backprogation calculates gradients for each parameter
  * There is no global optimization, so the learning process is iterative and approximate

---

## Hills and Plateaus

* Loss surfaces tend to have hills and plateaus
  * If there were no plateau, gradient descent would continue

---

## Problems

* We've already seen a problem with the learning rate
  * Can be too large early on, too small later

---

## Complex Loss Surfaces

* The complexity of a two dimensional loss surface pales in comparison to real problems
  * But it at least gives use something to think about
* We'll look at something called a *Gabor model*
* The surface is complex enough that a bad hyperparameter will fail to learn

---

## Gabor Model

* $f[x, \phi] = sin[\phi_0 + 0.06\phi_1 x]\cdot exp(-\frac{(\phi_0 + 0.06\phi_1 x)^2}{32.0})$
* Training points trace a more complicated surface

---

## Gabor Loss Surface

* $f[x, \phi] = sin[\phi_0 + 0.06\phi_1 x]\cdot exp(-\frac{(\phi_0 + 0.06\phi_1 x)^2}{32.0})$
* Training points trace a more complicated surface

---

## Exploring Loss

* We have two constants that also affect the loss complexity:
  * $f[x, \phi] = sin[\phi_0 + 0.06\phi x]\cdot exp(-\frac{(\phi_0 + 0.06\phi x)^2}{c2})$

---

## Exploring Loss

* We have two constants that also affect the loss complexity:
  * $f[x, \phi] = sin[\phi_0 + c1\phi x]\cdot exp(-\frac{(\phi_0 + c1\phi x)^2}{32})$

---

## Local Minima

---

## Local Minima

* Local minima are not global minima
  * Obviously!
* But there are times when gradient descent will become stuck in one

---

## Saddle Points

* There are also locations where the gradient is simply zero
  * We are at the maxima of a slope
* Impossible to be exactly there
  * but if we only look at the magnitude of the gradient to adjust our learning rate, we'll make a mistake

---

## Loss Surface

---

## Gradient Descent

* Gradient descent can work

---

## Gradient Descent

* We can go the wrong way from a peak

---

## Gradient Descent

* And we can start too close to a bad minima

---

## Fitting Techniques

* So we cannot just use gradients with some random learning rate and expect things to work out
  * Could we choose multiple random starting points, train, then take the best?
  * Sure. Do you have a few spare months?
* We need a practical solution that *usually* works *pretty well*

---

## Stochastic Gradient Descent

* SGD is a practical solution that *usually* works
* Attempts to free us from local minima close to our starting location
  * At each learning step, try to learn a slightly different function
  * They should all be related, so a large, smooth minima always exists
  * Smaller minima won't be consistent, so we'll "bounce" out

---

## Batches and Randomness

* Where does SGD get its randomness?
  * Only train on a subset of points at each step
  * Without replacement, so we'll eventually use them all
* Once we've trained on everything, we call it an epoch and then iterate again

---

## SGD Intuition

---

## SGD

* SGD isn't worse that gradient descewnt

---

## SGD

* Will solve some problems

---

## SGD

* Will not fix all problems

---

## SGD Properties

* Can escape from some (shallow) local minima
* Can escape from some saddle points
* Adds noise, but all updates are based upon data
  * And all data is used equally
* Also saves on computation

---

## Learning Rate Schedules

* SGD is often used with a learning rate schedule
* We begin with a large learning rate
  * Helps us escape local minima when we find a high loss batch
  * Then reduce learning rate as we approach solution
* This is a hyperparameter, meaning humans tune it

---

## Momentum

* We can make a simple improvement
* Update with momentum, not the current loss
  * $m_{t+1} \leftarrow \beta\cdot m_t + (1 - \beta)\frac{\delta L[x, \phi]}{\delta \phi}$
  * $\phi_{t+1} \leftarrow \phi_t - \alpha\cdot m_{t+1}$
* $m$ is the momentum, $\beta$ is the momentum weight, and $\alpha$ is the learning rate

---

## SGD

* Sometimes SGD can get stuck in a minima partway to a better minima

---

## SGD with Momentum

* A high momentum can "blow through" some minima

---

## More Momentum

* A small learning rate with SGD can be noisy
  * Momentum can smooth this out, getting rid of jitter

---

## Nesterov Momentum

* Momentum is averaging across past movement
  * This is similar to predicting where the parameters will be going
* The momentum was calculated from the previous points
  * Then why don't we apply momentum first before calculating loss?

---

## Nesterov Momentum

* $m_{t+1} \leftarrow \beta\cdot m_t + (1 - \beta)\frac{\delta L[x, \phi_t - \alpha\beta\cdot m_t]}{\delta\phi}$
* $\phi_{t+1} \leftarrow \phi_t - \alpha\cdot m_{t+1}$
* The momentum is modified by the gradient at the location where momentum will take us
  * This prevents the current loss from "derailing" momentum's path

---

## Nesterov Example

* Vanilla momentum meansures the gradient a 1
  * Then combines that local gradient with the momentum
  * Conceptually, 1 to 2 is the gradient, 2 to 3 is the momentum

</div>
<div class="col">

</div>

---

## Nesterov Example

* Nesterov will use the current momentum to calculate the next point (4)
* The gradient at that point guides the next update (from 4 to 5)

</div>
<div class="col">

</div>

---

## Fixing Step Sizes

* Tuning a learning rate is laborious, because training is often slow
  * And some parameters may have a "steep" curve while others are shallow
  * So one learning rate may be right for some and wrong for others
* One solution is an algorithm named Adam
  * Adaptive Moment Estimation

---

## Fixed Step Sizes

* First, let's begin with the idea of fixed step sizes
* We calculate the vector of our update
  * Then normalize it, so that every step is the same distance
* This fixes some problems, but we'll oscillate around the minima
  * We really need a small step size to finally settle

---

## Normalized Step Size

* Fixed step size equation:
  * $m_{t+1} \leftarrow \frac{\delta L[\phi_t]}{\delta \phi}$
  * $v_{t+1} \leftarrow \frac{\delta L[\phi_t]^2}{\delta \phi}$
* Normalize the gradient to fix the step size:
  * $\phi_{t+1} \leftarrow \phi_t - \alpha \cdot \frac{m_{t+1}}{\sqrt{v_{t+1}}+\epsilon$

-v-

## Normalizing Vectors

* If you don't recall, dividing a vector by its magnitude normalizes it
* And the magnitude is the square root of the squared terms
  * Think of a triangle; the length of the hypotenuse is the square root of the sum of the squares of the other sides

---

## Adam

* Now we add momentum to both the gradient and our normalizing term
  * Still use $\beta$ as the momentum of our weight update
  * Introduce $\gamma$ as the momentum of the normalizer update
* $m_{t+1} \leftarrow \beta\cdot m_t + (1 - \beta)\frac{\delta L[\phi_t]}{\delta \phi}$
* $v_{t+1} \leftarrow \gamma\cdot v_t + (1 - \gamma)\frac{\delta L[\phi_t]^2}{\delta \phi}$

---

## Adam

* We have bad statistics for the momentum in the beginning
  * So a modifier with diminishing effects is then applied
* $\hat{m}_{t+1} \leftarrow \frac{m_{t+1}}{1 - \beta^{t+1}}$
* $\hat{v}_{t+1} \leftarrow \frac{v_{t+1}}{1 - \gamma^{t+1}}$
* $\gamma$ and $\beta$ are between 0 and 1, so they go to 0 as t increases
  * In the beginning though, they increase the effective learning rate

---

## Examples

---

## Final Note

* Notice all of the parameters we've introduced?
  * Neural network training is full of hyperparametr optimization
* And we haven't even gotten to the complicated stuff yet!
* Often, you can ignore many of these
  * Sadly, only experience will tell you when you need to tweak them