# CS 462 - Lecture 09

## Regularization

Bernhard Firner

2026-02-19

---

## Book

* Today we'll be going through chapter 9 of the [book](https://udlbook.github.io/udlbook/)

---

## Review

* Measuring performance is fundamental
  * But sometimes difficult
* Results can be surprisingly stochastic
* Learning curves can be confusing

---

## Things To Analyze

* 1994 paper: [Limits on Learning Machine Accuracy Imposed by Data Quality](https://papers.nips.cc/paper_files/paper/1994/hash/1e056d2b0ebd5c878c550da6ac5d3724-Abstract.html)
* Analysis hasn't changed much
  * Plot error as a function of capacity and training set size
* Rare to do something different, such as attempting to plot the local loss surface

---

## Dataset Similarity

* Plot a) occurs, to some extent, due to a mismatch between the training and testing data
* As our training data grows, eventually the testing set becomes indistinguishable (plot b)

---

## Variance

* The amount of data you need is a function of the variance of the data
  * This should be clear from statistics
* A high-variance data source will require more samples to make a good estimate of the mean

---

## Remaining Error

* But why don't we expect to reach 0 errors?
* Some tasks are not possible
  * Could be label errors, could be two similar objects that are indistinguishable

---

## Learning Impossible Things

* It is important that we don't try to learn the unlearnable
  * Neural networks can often find a way to cheat
  * Make them large enough, give them enough data, and they'll do something strange
* e.g. spot of dust on the camera during data collection, data augmentations, etc

---

## Bias Vs Variance

* There is a tension between bias and variance
  * Learn the full variance of the observed training data?
  * Or bias towards a simpler solution?
* How much to worry about this, and the techniques we use, are a function of the dataset

---

## Limiting Capacity

* Regularization means simplifying a model
  * So why not just use a simpler model?
* This is valid, but often doesn't work
  * There is variety in a dataset, so a "simple" model for some data may be over-capacity for the rest

---

## Early Stopping

* We can continue with a large model, but stop training when overfitting shows up
* And we can save the model from past epochs
  * We can go back if we notice the increase in test loss isn't just random fluctuation
* But this isn't foolproof

---

## Effective learning rates

* Some digits must be easier than others
  * Imagine that the learning curve is a blend of the curves for each digit
* This means that we effectively learn some faster
* If we try early stopping, we will stop at a different part of the learning curve for each digit

---

## General Regularization

* So we want something that will regularize evenly over different data
* One possible assumption is that simpler models have few "big" weights
  * So we can penalize the model for having large weights
* $L[\phi] = \lambda\Sigma\phi^2$
  * (Keep in mind that the derivative will be $2\lambda\phi_i$)

---

## Ridge Regression

* This is called ridge regression
  * And is used in regression; it is an old technique
  * Only applied to the weights, not the biases
* Also called Frobenius norm regularization when applied to matrices
* Also called the L2 norm, since we are penalizing the magnitude of the vector of weights

---

## Visualization

---

## Effects

---

## L2 Discussion

* This won't force weights to 0
  * Not if the parameter has any positive correlation with the desired output
* But it does push down anything high
* This generally results in a "smoother" output
  * Smooth regression lines, or smoother decision boundaries for classification

---

## Side Effects

* Notice that the L2 norm increases error
  * The average network output will be lower
* For regression targets, standardize your target to have 0 mean
  * Now L2 won't be in so much tension with your training samples!

---

## In PyTorch

* Every PyTorch optimizer supports L2 norm
  * Called "weight decay"
* `optimizer = torch.optim.Adam(model.parameters(), lr=0.0002, weight_decay=0.001)`
* Let's see how this affects learning curves for digit classification

---

## Digits - Baseline

---

## Weight Decay = 0.001

---

## Weight Decay = 0.01

---

## Best Used Sparingly

* We can force performance on training and testing data to match
  * But now they are both worse
* But we need regularizers to prevent overfitting, right?
  * Let's make our training set tiny, just 100 digits
  * And train to 500 epochs, to see any trends

---

## 100 Training, Baseline

---

## 100 Training, L2 norm=0.001

---

## SGD Hidden Benefits

* We might expect the 100 sample training set to be really bad
  * But it doesn't contradict the testing set
* Still, why aren't we seeing some overfitting?
* It turns out that SGD has an implicit regularization

---

## Implicit Regularization

* SGD actually has a preference to go to some minima over others
  * The minima found are wider as the batch size is reduced
  * [On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima](https://arxiv.org/abs/1609.04836)
* Large enough learning rates actually lead to regularization
  * [On the Origin of Implicit Regularization in Stochastic Gradient Descent](https://arxiv.org/abs/2101.12176)

---

## Quotes:

* [On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima](https://arxiv.org/abs/1609.04836)

> ...large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation.

---

---

## Quotes:

* [On the Origin of Implicit Regularization in Stochastic Gradient Descent](https://arxiv.org/abs/2101.12176)

> ...when the batch size is small the scale of the implicit regularization term is proportional to the ratio of the learning rate to the batch size. We verify empirically that explicitly including the implicit regularizer in the loss can enhance the test accuracy when the learning rate is small.

---

## Implicit Regularization

* Change the batch size from 64 to 8

---

## Better!

* The smaller batch is better
* It converged faster
* Has a slightly better testing accuracy
  * 69.8% vs 69.2%

---

## Why Not Size 1?

* Remember that time is a concern
* The forward and backward passes are done in parallel for an entire batch
  * So doing batch size 1 could be 8 times slower than batch size 8
* Learning may be slightly better or faster per epoch with smaller batch sizes, but only to a point

---

## Regularization and Noise

* So do we actually need L2/weight decay?
  * Yes! But it isn't working with this combination of data and network
* Notice that the linear network just goes and learns the wrong labels
  * We'll see that convolutional network are a bit better at not doing that
* So we'll have to revisit this with a better DNN

---

## Ensembling

* There are, of course, other ways to regularize
  * What if we trained a bunch of small models and then put them together?
  * This is like AdaBoost with decision stumps, if you took CS461
* We could do that with DNNs too
  * But it sounds like it would take a long time

---

## Dropout

* We can simulate ensembling without training 1000s of models
* Every forward pass, only use a subset of the network
  * Just set the outputs of random neurons to 0
  * This excludes them from gradient updates as well
* When done training, use them all, but scale all outputs

---

## Dropout Example

* During training, randomly ignore some neurons
* For example, given neurons a, b, c, and d, drop half at each training step:

1. $f(x) = a + d$ 
1. $f(x) = a + c$ 
1. $f(x) = b + c$ 
1. $f(x) = a + c$

* When training is done, use them all 
* Now there are four numbers instead of two, so divide by half to preserve average magnitude

$f(x) = (a + b + c + d)/2$

---

## Graphical Example

---

## Dropout Intuition

* We may have strange behavior where there are no training points
* If another neuron "compensates" for the strange output of another, nothing corrects it
  * But with dropout, if one neuron has a strange output it will be corrected by SGD

---

## Example in Use

Suppose that your training data has two signals that are very correlated.

```python
import random
import torch

# For better repeatability
torch.random.manual_seed(0)

# Imagine that the inputs represent features in an image
net = torch.nn.Sequential(
        torch.nn.Linear(2, 100),
        torch.nn.ReLU(),
        torch.nn.Linear(100, 100),
        torch.nn.ReLU(),
        torch.nn.Linear(100, 1))

# Our training set.
def make_input_outputs(size):
    with torch.no_grad():
        # Make a batch of inputs that is just the same pairs of numbers
        inputs = torch.empty([size, 1]).uniform_(0, 1).repeat(1, 2)
        outputs = inputs[:,0].view((size, 1))
        for idx in range(size):
            # 1/1000 chance that a signal is missing.
            if random.random() < 0.001:
                inputs[idx,0] = 0
            if random.random() < 0.001:
                inputs[idx,1] = 0

return inputs, outputs

optimizer = torch.optim.SGD(net.parameters(), lr=0.004, momentum=0.05, weight_decay=0.01)
loss_fn = torch.nn.MSELoss(reduction='sum')

net.train()

# Train for a long time.
for step in range(10000):
    x_inputs, y_targets = make_input_outputs(64)
    optimizer.zero_grad()
    output = net.forward(x_inputs)
    loss = loss_fn(output, y_targets)
    loss.backward()
    optimizer.step()

net.eval()

# Probe the network to test how it learned.
print("input a, input b, output")
for a in range(101):
    for b in range(101):
        probe = torch.tensor([a/100, b/100])
        output = net.forward(probe)
        print(f"{a/100}, {b/100}, {output.item()}")
```

---

## Biased Outputs

* The output should change with either input
  * But this mostly uses input 1. Why?
* It arbitrarily used 1 and not 2 because no loss prevented that result.

</div>

</div>

---

## With Dropout

```python [|11,14]
import random
import torch

# For better repeatability
torch.random.manual_seed(0)

# Imagine that the inputs represent features in an image
net = torch.nn.Sequential(
        torch.nn.Linear(2, 100),
        torch.nn.ReLU(),
        torch.nn.Dropout(0.5),
        torch.nn.Linear(100, 100),
        torch.nn.ReLU(),
        torch.nn.Dropout(0.5),
        torch.nn.Linear(100, 1))

return inputs, outputs

optimizer = torch.optim.SGD(net.parameters(), lr=0.004, momentum=0.05, weight_decay=0.01)
loss_fn = torch.nn.MSELoss(reduction='sum')

net.train()

net.eval()

---

## Unbiased Outputs

* Less biased. Why?
  * Randomly, paths that used input 1 or 2 were removed
* Makes it better to rely upon both

</div>

</div>

---

## Stochastic Depth

* Since randomness is good, how about dropping entire layers randomly?
* Stochastic depth is basically dropout, but for entire layers
  * We can't use it with any arbitrary network
* Our current linear network would have the wrong number of outputs to connect arbitrary layers

---

## Intentional Noise

* If randomly dropping components is good, are other randomizations also helpful?
* If we add noise to the input data, does that add good stochasticity?
  * Good as in similar SGD
* For regression problems, *label noise* smooths the output

---

## Other Noise

* Noise can be applied to weights
  * This forces the DNN to areas with flatter minima, where the noise is less harmful
  * And since wider minima tend to be better, this can be good
* We can also add noise to the labels
  * For classifiers, this improves the decision boundaries
  * Nothing provable like SVMs, but it encourages wider margins

---

## Data Augmentation

* But why apply meaningless noise when we could do something meaningful?
* Data augmentations are almost always used
* Image examples
  * Flipping, rotating, scaling, cropping, changing the color balance, etc

---

## Transfer Learning

* If we know that our data is bad, why not spend less time with it?
* Begin training on something else
  * Or with a different training goal
* Then cut off the head of that network, add a new layer for our actual task, and retrain
  * But use early stopping before we learn too much
* This is an attempt to engineer initial weights near a good minima

---

## Book Summary

---

## My Summary

* There are many ways to regularize
  * Which options work can depend upon your model and data
* Some are low effort
  * L2, implicit regularization, dropout
* Others, like augmentations or transfer learning, are harder

---

## Next Time

* We need to use a better network to use some of these tools
* So we'll be moving on to convolutional networks next