# CS 462 - Lecture 12

## Review

Bernhard Firner

2026-03-03

---

## Review

* What have we covered so far?
  * Roughly chapters 1-10 in the book
  * Plus some discussion of batch normalization
    * See book pages 203-205 for an in-depth discussion

---

## Data

* You can't discuss deep learning without discussing data
* Datasets are sampled from populations
  * These have statistics, like mean and variance
* Sampling methods can add their own noise and bias

---

## Good Data, Bad Data

* If a population has more variance, then it requires more data to learn
  * Why? DNNs are secretly estimating populations statistics
  * Estimates of the mean converge slowly with higher variance

---

## Good DNN, Bad DNN

* How do we know if a trained network is good or bad?
  * We need to test it on a holdout set
* If performance on the test set is far worse than on the training set, we often call that **overfitting**

---

## Overfitting?

* If a DNN is overfitting, what exactly is it fitting too closely?
* Imagine that the testing set is a subset of the training set
  * Then we cannot be "overfitting" the training set
* So when do the two datasets not match?
  * High variance population with a small dataset, biased data collection, etc

---

**Q**. Your manager walks by your desk as you create a learning curve graph. The training loss is nearly 0, but the testing loss only decreased for the first 20 epochs before staying steady for the next 30 epochs. Your manager leans over your desk, takes a deep sip from an oversized coffee mug, and asks, "how are you going to fix this?" A good response could be:  
**a**. I'll use a smaller model to prevent overfitting.  
**b**. I'll collect more data to make a more representative training set.  
**c**. I'll use more regularization to decrease the gap between the curves.  
**d**. Any of the above would satisfy your manager.
</div>

---

**Q**. Your manager walks by your desk as you create a learning curve graph. The training loss is nearly 0, but the testing loss only decreased for the first 20 epochs before staying steady for the next 30 epochs. Your manager leans over your desk, takes a deep sip from an oversized coffee mug, and asks, "how are you going to fix this?" A good response could be:  
**a**. I'll use a smaller model to prevent overfitting.  
**b**. **I'll collect more data to make a more representative training set.**  
**c**. I'll use more regularization to decrease the gap between the curves.  
**d**. Any of the above would satisfy your manager.
</div>

* a and c imply that performance on the test data is getting worse, but it is steady. There must already be enough regularization and the model is not so large that it fits to the variance of the training data to the detriment of the testing data.

---

## Bias and Variance Tradeoff

* In machine learning, the tradeoff between bias and variance is often discussed
* These are related to the power, or **capacity**, of your learning model
* In general, a high capacity model can learn overfit to the variance of a dataset
* A simpler model is biased to a simpler output
  * Output here means a decision boundary or regression output

---

## Overcapacity Example

* This model is likely over capacity, but it depends upon the data
  * Is that mislabelled data? Or variance that our dataset doesn't capture?

---

**Q**. Which regularization technique will fix the overfitting model in the previous slide?  
**a**. Stochastic gradient descent.  
**b**. L2 regularization.  
**c**. Batch normalization.  
**d**. None of the above will work.
</div>

---

* The loss function itself causes overfitting. Regularization can move decision boundaries and regression outputs away from training samples, but unless they are so strong as to overwhelm the loss function, they cannot prevent some fitting to a high loss point like this.

---

## Simplicity and Regularization

* *In general* we prefer simpler models, with smoother outputs and loss surfaces
* We presume that these models will outperform more complicated models on **unseen points**.

<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/UDL/Chap09/RegDropout.svg" />
<small>Dropout, a regularizer, fixed the non-smooth area of the output. That wouldn't have been there if we had training data in that gap.</small>
</div>

---

## Regularization

* Remember, regularizers make our model outputs *worse* on the training set
  * They add a new term to the cost function, in addition to the loss

---

## Regularizers

* We've seen plenty
  * Apply and L2 Penalty
  * Add noise to the data, to layer outputs, and to labels
  * Dropout
  * SGD with small batch sizes
* Many things *can* act as regularizers, and you don't need to use all of them

---

## Capacity

* So how do we measure simplicity and complexity?
  * Generally, it is a function of the number of parameters in a model, and how they are used
* Linear neural networks, or multi-layer perceptrons (MLPs), create piecewise linear fits to training data
  * The number of pieces they can assemble is their capacity

---

## Universal Approximation Theorem

* With just neurons, linear networks could only create linear lines
  * $y = \phi_0 + \phi_1x$
* When we add in a nonlinear component, they become capable of approximating any continuous function
  * The number of hidden units determines the precision of the fit
  * This is the *universal approximation theorem*

---

## 3 Lines

---

## 10 Lines

---

## 25 Lines

---

## 100 Lines

---

## Activation Functions

</div>
<div class="col">

</div>
</div>

---

## Nonlinear Activation Functions

* We add these functions after each hidden neuron
  * $\phi = [\phi_0, \phi_1, \phi_2, \theta_{10}, \theta_{11}, \theta_{20}, \theta_{21}]$
  * Why do they need to be nonlinear?
* The change in the slope of the output is what creates the separate pieces used in fitting

---

## Capacity Costs

* Each input to a neuron requires a weight, and each neuron has a bias value
* A layer with $w$ neurons with $n$ inputs require $w(n + 1)$ values
* So is a wider network better, or a deeper one?

---

## Deeper Networks

* A deeper network, with a single input and $k$ hidden layers, has faster capacity growth per parameter
  * Capacity growth is multiplicative with layer number
    * $(w+1)^k$ linear regions
  * Costs $3w + 1 + (k-1)w(w+1)$ parameters
* Notice that with $k=1$ hidden layers, this is the same as before
  * 2w is the cost of the first hidden layer, and w+1 is the output neuron cost

---

## Depth Efficiency

* Adding a layer means adding a new weight for each input, plus a bias parameter
  * This is where the $w(w+1)$ part of the equation comes from
* In reality, every layer won't have the same width, but simplifies the equation
* We can see that deeper networks can achieve more capacity per weight used

---

## More Efficiency

* Of course, we have just learned that convolutions have an even better capacity vs weight
* But these higher capacity networks, including with convolutions, have a cost
  * Longer to train
  * More memory is required

---

**Q**. Consider a CNN that reduces a 28x28 input image to 128 1x1 feature maps after five stride two convolutions. Which of these statements is **true**?

**a**. The CNN requires more memory storage during gradient descent than a linear network with the same number of parameters.  
**b**. Given the final output size, the CNN must not be using padding.  
**c**. If the kernel size in the last layer is 2x2, the number of weights in the last convolution layer is $4\times128$.  
**d**. None of the above statements are true.
</div>

---

**Q**. Consider a CNN that reduces a 28x28 input image to 128 1x1 feature maps after five stride two convolutions. Which of these statements is **true**?

**a**. **The CNN requires more memory storage during gradient descent than a linear network with the same number of parameters.**  
**b**. Given the final output size, the CNN must not be using padding.  
**c**. If the kernel size in the last layer is 2x2, the number of weights in the last convolution layer is $4\times128$.  
**d**. None of the above statements are true.
</div>

---

## Backprop Algorithm:

* Group the parameters by layer, the weights and biases from layer $k$ are $\Omega_k$ and $\beta_k$
* Step 1: run each data sample forward, and remember the outputs of each layer
* Step 2: compute the derivatives of the loss with respect to the values saved in the forward pass
* Step 3: compute the derivatives of the loss with respect to the parameters

---

## Gradient

* Output values and gradients are stored over the entire input, even for a 3x3 convolution
  * So convolutions reduce parameters compared to linear networks, but they haven't simplified gradient descent

---

## Parameter Updates

* Backpropagation moves parameters away from higher error
  * But this does not mean that the parameters will reach optimal values
* Gradient descent can often lead parameters into a bad local minima

---

## Gradient Descent

* The parameters fall into an inferior local minima

---

## Stochastic Gradient Descent

* We've seen that randomness improves many things
* We select random subsets of training data without replacement to serve as batches
* The loss surface from each batch is slightly different
  * But anything similar to a global minima should remain consistent

---

## SGD Illustration

---

## SGD

* Can succeed where GD failed

---

## Momentum

* Stochastic gradient descent is an improvement
  * Still gets stuck in local minima
* So we also add momentum using the past gradient
  * $m_{t+1} \leftarrow \beta\cdot m_t + (1 - \beta)\frac{\delta L[x, \phi]}{\delta \phi}$
  * $\phi_{t+1} \leftarrow \phi_t - \alpha\cdot m_{t+1}$

---

## Minima and Slopes

* Experience shows that global or near global minima tend to have wider minima than local minima
* Momentum may carry us past either
  * But we will eventually slow and return to a wide minima
  * A sharp minima will be passed quickly

---

## Quotes:

* [On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima](https://arxiv.org/abs/1609.04836)

> ...large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation.

---

---

## SGD Without Momentum

---

## SGD with Momentum

---

## Improvements

* We can tweak the momentum slightly, 
  * Nesterov: use the predicted next point to calculate the change to momentum
* Adaptive Moment Estimation, adding momentum to the step size normalization

---

## Improvements

---

## Initialization Trouble

* Initialization is important
  * We could start training already in a local minima
* The initial math during the forward pass could fail
  * Values explode higher; **exploding gradients**
  * Values disappear to 0; **vanishing gradients**

---

## Fixing Initialization

* If we are using ReLU, we want layer outputs to be 0 mean
  * If all positive, then ReLU does nothing
  * If all negative, then there is no output
* That means the variance of layer outputs will determine stability

---

## He Kaiming Initialization

* $\sigma^2_{f_i} = \frac{W_{i-1}\sigma_{\Omega}^2\sigma^2_{f_{i-1}}}{2}$
* We are optimizing initialization, so we adjust the weights to leave the input variance unchanged
  * $\sigma^2_{\Omega_i} = \frac{2}{W_{i-1}}$
* PyTorch support [kaiming normal](https://docs.pytorch.org/docs/stable/nn.init.html#torch.nn.init.kaiming_normal_)
* You can read more in the [arxiv paper](https://arxiv.org/abs/1502.01852)

---

## Other Solutions and Problems

* There are other solutions to the initialization problem
  * This one assumed ReLUs, for example
* But this solution only applies to initialization
  * What happens as the parameters drift during training?

---

## Batch Normalization

* Normalize the layer outputs with their batch mean and variance
  * This adds noise, which is a regularizer!
* Keep the running mean and variance to apply during inference

---

## Evaluating a Network

* [Limits on Learning Machine Accuracy Imposed by Data Quality](https://papers.nips.cc/paper_files/paper/1994/hash/1e056d2b0ebd5c878c550da6ac5d3724-Abstract.html)
  * Paper from 1994
* Increasing capacity learns the training set
  * Only good if the training and testing sets match!

---

## Error Vs Capacity

* If our training and testing sets match, adding more training data is the solution
  * Capacity is only necessary if we aren't learning the training set
* Excess capacity (without regularization) learns training set features that aren't in the test data
* So we want network structures that naturally generalize more

---

## Loss Curves

* The test loss on digits matches the training loss fairly well
  * A sign that our training and test data match
* If there was a mismatch, simplifying the model would likely improve test performance
* But what about in a harder dataset?

---

## Side Note

* Training on a small dataset should **always** work
  * Meaning you converge to 0 errors
* If it doesn't your data must be contradictory
  * Or you have a problem with initialization, the learning algorithm, etc

---

## Cifar Comparison

* Cifar-10 has 10 classes
* Each input is a 32x32 image with RGB color
* Some examples are harder than others
  * So we won't find a regularization that is "just right" for all of them

---

## Baseline Linear

* This model has 1,645,950  trainable parameters.

```python
Linear(
  (net): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=3072, out_features=512, bias=True)
    (2): ReLU()
    (3): Linear(in_features=512, out_features=120, bias=True)
    (4): ReLU()
    (5): Linear(in_features=120, out_features=84, bias=True)
    (6): Linear(in_features=84, out_features=10, bias=True)
  )
  (decision): Softmax(dim=1)
)
```

---

## ResNet with BatchNorm

* Model has 17,020 parameters

```python
DeeperBNConvNet(
  (net): Sequential(
    (0): Conv2d(3, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (3): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (4): ReLU()
    (5): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU()
    (7): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (8): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (9): ReLU()
    (10): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU()
    (12): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (13): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (14): ReLU()
    (15): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (16): ReLU()
    (17): Flatten(start_dim=1, end_dim=-1)
    (18): Linear(in_features=60, out_features=60, bias=True)
    (19): Linear(in_features=60, out_features=10, bias=True)
  )
  (decision): Softmax(dim=1)
)
```

---

## MLP Baseline

<div class="container">
<div class="col">
<img style="width: 65%" class="r-stretch" src="./figures/10_cifar10_mlp.png" />
<br/>
<small>MLP Loss</small>
<br/>
<img style="width: 65%" class="r-stretch" src="./figures/11_cifar10_deeperbn_conv.png" />
<br/>
<small>ConvNet Loss</small>
</div>
<div class="col">
<img style="width: 65%" class="r-stretch" src="./figures/10_cifar10_mlp_acc.png" />
<br/>
<small>MLP Accuracy</small>
<br/>
<img style="width: 65%" class="r-stretch" src="./figures/11_cifar10_deeperbn_conv_acc.png" />
<br/>
<small>ConvNet Accuracy</small>
</div>
</div>

---

**Q**. Why does a convolution network generalize more effectively than a fully connected network?  
**a**. Spatial invariance means that learned features are common across many inputs and locations.  
**b**. The increase in samples per parameter mean that the network is biased towards a simpler solution.  
**c**. Structure generalizes better to new data samples better than neurons that have "memorized" a possibly small number of training samples.  
**d**. All of the above.
</div>

---

---

## Same

* All of those statements are roughly the same
* Imagine that a linear network has 1 neuron per training sample in the first hidden layer
  * Each neuron "matches" a training input, with weights equal to that training input
  * With some wrangling of parameters, we can imagine the class prediction corresponds to the exemplar that has the most pixels in common with an input
* But this means that a green plane would match a green frog more closely than a yellow plane, or the same green plane shifted to the other side of the image

---

## Convolutions

* Convolutions has a few advantages over linear networks
  * Spatial invariance
  * Discover structure in the input
  * Weight efficiency leads to learning efficiency
* In summary, convolutions are a much more weight-efficient way to analyze data with structure

---

## Open-Ended Topics

* Some topics are particularly important
  * What do we understand about loss surfaces?
  * What made learning reliable?
    * How did those tricks work?
  * Why do convolutions work?
    * In what ways are they superior to linear layers?