# CS 462 - Lecture 08

## Measuring Performance

Bernhard Firner

2026-02-17

---

## Book

* Today we'll be going chapter 8 of the [book](https://udlbook.github.io/udlbook/)

---

## Review

* Last time we went over initialization
  * This is the latest part of how we avoid local minima
  * A good initialization makes learning possible
* And once learning starts, SGD and momentum tend to work well

---

## Bad Initialization

* With bad initialization, we get stuck in a minima right away

---

## Lack of Gradients

* Parts of the loss surface correspond to 0 gradients
* For example, if all inputs to a ReLU are negative
* Good initialization solves this

</div>
<div class="col">
<img style="width: 90%" class="r-stretch" src="./figures/03_relu.svg" />

</div>
</div>

---

## More Capacity

* A more complex loss surface tends to be smoother
* This makes local minima less "sharp"
* So we can avoid bad initialization by making our network deeper
  * Since this is the most efficient way to add more capacity

---

## Exploding Gradients

* The forward and backwards passes involved many multiplications
* $f_k = \beta_k + \Omega_k ReLU(\beta_{k-1} + \Omega_{k-1}ReLU(\beta_{k-2} + \Omega_{k-2}x))$
* If the average input was 0.5 and the average weight was 0.5
  * With a width of 1000, the second layer's output is roughly $0.5\times0.5\times1000$
* The forward and backward pass can cause gradients to explode

---

## Vanishing Gradients

* $f_k = \beta_k + \Omega_k ReLU(\beta_{k-1} + \Omega_{k-1}ReLU(\beta_{k-2} + \Omega_{k-2}x))$
* If the weights are 0 mean, half of a layer's outputs will be 0
* An ReLU with 0-mean normal input with $\sigma=0.1$ has an output of around 0.04
* After passing through a 100-width layer, with $\sigma=0.1$ normally distributed weights, the mean output is 0.0016

---

## Expected Gradient

* We can calculate the variance of each layer's output
  * $\sigma^2_{f_i} = \frac{W_{i-1}\sigma_{\Omega}^2\sigma^2_{f_{i-1}}}{2}$
* If weight variance is too high, multiplications lead to explosive outputs
* If weight variance is too low, multiplications lead to vanishing outputs

---

## Interpretation

* Since we are optimizing initialization, we can adjust the weights to keep variance stable
  * $\sigma^2_{\Omega_i} = \frac{2}{W_{i-1}}$
* Called Kaiming or He initialization (read the [arxiv paper](https://arxiv.org/abs/1502.01852))
* PyTorch offers this initialization as an existing function
  * [kaiming normal](https://docs.pytorch.org/docs/stable/nn.init.html#torch.nn.init.kaiming_normal_)

---

## Other Solutions

* There are other solutions
  * But an improved initialization is the most elegant
  * No added calculations during forward or backward passes

---

## Measuring Performance

* We can (hopefully) train a deep neural network
  * How can we tell if we've done a good job?
* And if something goes wrong, how can we diagnose the problem?

---

## Dataset

* We're going to use a simple dataset
  * Small enough that you can train on a laptop CPU
* Called "digits"
  * Includes written numeric digits, 0-9

---

## Digits

* Datasets were small in the 80s
* So we couldn't tell if neural networks had advantages
* AT&T/Bell Labs assembled the first large digit dataset

</div>
<div class="col">
<img style="width: 60%" class="r-stretch" src="./figures/Larry_OriginalDigitsCropped.png" />

</div>
</div>

---

## MNIST Digits

* Hand written numeric digits converted to 28x28 black and white images
  * Half drawn by U.S. Census Bureau employees, other half by high school students
* 60,000 training images
* 10,000 validation
* Equally distributed over classes 0-9

---

## First 100 Train and Test

</div>

---

## Comments

* This is an easy dataset
  * Labels are correct
  * Images are all the same size
  * Digits are mostly centered
* Excellent for learning, expect the real world to be more difficult

---

## Example Training Code

```python
import argparse
import os
import numpy as np
import random
from PIL import Image
import torch

import mnist_util

class Linear(torch.nn.Module):
    """A linear neural network."""

def __init__(self, nonlinearity = torch.nn.ReLU, classes=10):
        super(Linear, self).__init__()
        self.net = torch.nn.Sequential(
                torch.nn.Flatten(),
                torch.nn.Linear(28*28, 2048),
                nonlinearity(),
                torch.nn.Linear(2048, 120),
                nonlinearity(),
                torch.nn.Linear(120, 84),
                torch.nn.Linear(84, classes)
                )
        self.decision = torch.nn.Softmax(dim=1)

torch.nn.init.kaiming_normal_(self.net[1].weight.data, nonlinearity="relu")
        torch.nn.init.kaiming_normal_(self.net[3].weight.data, nonlinearity="relu")
        torch.nn.init.kaiming_normal_(self.net[5].weight.data, nonlinearity="relu")

def forward(self, x):
        """Forward through the network."""
        y_hat = self.decision(self.net(x))
        return y_hat

def preprocess(images, order, device):
    # Normalize and then add a batch dimension
    # This will be required for convolutions, although it doesn't matter for the linear network.
    preprocessed = torch.tensor(images[order]).float()
    # Convert to 0 mean and unit variance
    var, mean = torch.var_mean(preprocessed)
    preprocessed = (preprocessed - mean) / var

# Add a channel dimension
    return preprocessed.reshape((-1, 1, 28, 28)).to(device)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--train",
        required=True,
        help="gzip file with mnist data.")
    parser.add_argument(
        "--test",
        required=True,
        help="gzip file with mnist data.")
    parser.add_argument(
        "--train_labels",
        required=True,
        help="gzip file with mnist labels.")
    parser.add_argument(
        "--test_labels",
        required=True,
        help="gzip file with mnist labels.")
    parser.add_argument(
        "--epochs",
        required=False,
        type=int,
        default=50,
        help="Number of epochs to train.")
    parser.add_argument(
        "--train_samples",
        required=False,
        type=int,
        default=60000,
        help="Number of samples to use for training.")
    parser.add_argument(
        "--batch_size",
        required=False,
        type=int,
        default=64,
        help="The batch size.")
    parser.add_argument(
        "--random_seed",
        required=False,
        type=int,
        default=112,
        help="The random seed.")
    parser.add_argument(
        "--device",
        required=False,
        type=str,
        default=None,
        help="Override the automatically determined device (cuda or cpu).")

args = parser.parse_args()

# Seed all of the random number generators for repeatability.
    # Keep in mind though, that some algorithms are nondeterministic, so this
    # doesn't guarantee fully repeatable results.
    np.random.default_rng(args.random_seed)
    torch.manual_seed(args.random_seed)
    random.seed(args.random_seed)

if args.device is None:
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    else:
        device = torch.device(args.device)
    print(f"Using device: {device}")

print("Loading data")
    X_train_digits = (mnist_util.load_mnist_ubyte(args.train)/255)[:args.train_samples]
    X_test_digits = mnist_util.load_mnist_ubyte(args.test)/255
    Y_train_digits = mnist_util.load_mnist_labels(args.train_labels)[:args.train_samples]
    Y_test_digits = mnist_util.load_mnist_labels(args.test_labels)
    unique_classes = np.unique(Y_test_digits)
    total_classes = len(unique_classes)
    # Make sure the first class is 0
    if min(unique_classes) != 0:
        Y_train_digits = Y_train_digits.copy() - min(unique_classes)
        Y_test_digits = Y_test_digits.copy() - min(unique_classes)

## If you want to save some digits
    #for i in range (20):
    #    Image.fromarray((255*X_test_digits[i]).reshape((28, 28)).astype(np.uint8)).save(f"example_test_digit_{i}.png")
    #print(f"Saved images have classes {Y_test_digits[:20]}")

# Create the model
    model = Linear(classes=total_classes)

# Don't shuffle the test data, but otherwise treat it the same as the training data.
    X_test = preprocess(X_test_digits, np.arange(X_test_digits.shape[0]), device)
    Y_test = torch.tensor(Y_test_digits).long().to(device)
    test_batch_size = 1000

# Shuffle and preprocess the training data
    order = np.arange(X_train_digits.shape[0])
    np.random.shuffle(order)

X_train = preprocess(X_train_digits, order, device)
    Y_train = torch.tensor(Y_train_digits[order]).long().to(device)

# Are we doing training, or just reloading?
    model.to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.0002)

criterion = torch.nn.CrossEntropyLoss()

# See how many batches we'll use per epoch
    batches = int(np.ceil(X_train_digits.shape[0]/float(args.batch_size)))
    # We could just do this in one step, but let's assume that memory is finite
    test_batches = int(np.ceil(X_test_digits.shape[0]/float(test_batch_size)))

print(f"Training on {X_train_digits.shape[0]} examples over {batches} batches")

for epoch in range(args.epochs):

total_loss = 0.0
        model.train()
        for batch in range(batches):
            begin = batch*args.batch_size
            end = (batch+1)*args.batch_size

X_batch = X_train[begin:end]
            Y_batch = Y_train[begin:end]

# Zero gradients before gradient calculation
            optimizer.zero_grad()

y_hat = model(X_batch)
            loss = criterion(y_hat, Y_batch)
            total_loss += loss.item() * X_batch.size(0)

# Gradient calculation
            loss.backward()
            # Update weights
            optimizer.step()

epoch_loss = total_loss / X_train.size(0)
        print(f"Epoch {epoch} training loss {epoch_loss}")

# Evaluation
        # Don't calculate gradients during these steps
        model.eval()
        with torch.no_grad():
            total_loss = 0.0
            for batch in range(test_batches):
                begin = batch*test_batch_size
                end = (batch+1)*test_batch_size

X_batch = X_test[begin:end]
                Y_batch = Y_test[begin:end]

y_hat = model(X_batch)
                loss = criterion(y_hat, Y_batch)
                total_loss += loss.item() * X_batch.size(0)
            epoch_loss = total_loss / X_test.size(0)
            print(f"Epoch {epoch} testing loss {epoch_loss}")
            ## Accuracy values
            # We can't just run over everything, that takes too much memory. Chop it up.
            matches = 0
            mismatches = 0
            for testbatch in range(int(np.ceil(X_train_digits.shape[0]/float(test_batch_size)))):
                begin = testbatch*test_batch_size
                end = (testbatch+1)*test_batch_size
                y_hat = model(X_train[begin:end])
                classes = torch.argmax(y_hat, dim=1)
                matches += torch.sum(classes == Y_train[begin:end])
                mismatches += torch.sum(classes != Y_train[begin:end])
            train_accuracy = matches/X_train.size(0)
            matches = 0
            mismatches = 0
            for testbatch in range(int(np.ceil(X_test_digits.shape[0]/float(test_batch_size)))):
                begin = testbatch*test_batch_size
                end = (testbatch+1)*test_batch_size
                y_hat = model(X_test[begin:end])
                classes = torch.argmax(y_hat, dim=1)
                matches += torch.sum(classes == Y_test[begin:end])
                mismatches += torch.sum(classes != Y_test[begin:end])
            test_accuracy = matches/X_test.size(0)
            print(f"Epoch {epoch} accuracies are {train_accuracy} {test_accuracy}")

# Final evaluation
    model.eval()
    with torch.no_grad():
        # DNN classification
        matches = 0
        mismatches = 0
        failed_indices = []
        for testbatch in range(int(np.ceil(X_test_digits.shape[0]/float(test_batch_size)))):
            begin = testbatch*test_batch_size
            end = (testbatch+1)*test_batch_size
            y_hat = model(X_test[begin:end])
            classes = torch.argmax(y_hat, dim=1)
            matches += torch.sum(classes == Y_test[begin:end])
            mismatches += torch.sum(classes != Y_test[begin:end])
            failed_indices.append([begin + idx for idx in torch.where(classes != Y_test[begin:end])])
        sum_matches = matches
        test_accuracy = sum_matches/X_test.size(0)

print(f"Final accuracy {sum_matches}/{X_test.size(0)} ({test_accuracy})")
        print(f"Final failures at indices {failed_indices[0][0].tolist()}")

```

---

## Loss Curve

* Loss vs Epoch curves are out most common graph
  * Since every neural network is using a loss function

---

## Reading the Curve

* The first train loss is higher than the first test
  * Test loss is measured after each epoch
  * The first training loss values will be high
* This curve basically shows that everything is great

</div>

</div>
</div>

---

## Accuracy Vs Epoch

* measured after each epoch completes

---

## Learning Limit

* Notice that our training loss saturates at some point
  * Once this occurs, our testing loss cannot improve
  * It is unseen by the network during training
* Basically, if a DNN is "perfect" on the training data, it cannot improve
  * To conquer the last failures, we likely need more training data

---

## Imperfections

* These results have been basically perfect
* But plenty of things can go wrong
  * Let's begin with things we talked about

---

## Setup

* Our initialization and preprocessing are important

```python
class Linear(torch.nn.Module):
    """A linear neural network."""

def forward(self, x):
        """Forward through the network."""
        y_hat = self.decision(self.net(x))
        return y_hat

# Pad 2 on every side, changing the 28x28 to 32x32
    #preprocessed = torch.nn.functional.pad(preprocessed, pad=(2,2,2,2))
    # Add a channel dimension
    return preprocessed.reshape((-1, 1, 28, 28)).to(device)
```

---

## Intentional Mistakes

* Let's see how important those are
  * Change the weight initialization to 0 mean normal

```python
        torch.nn.init.normal_(self.net[1].weight.data)
        torch.nn.init.normal_(self.net[3].weight.data)
        torch.nn.init.normal_(self.net[5].weight.data)
```

---

## Normal Initialization Loss

---

## Normal Initialization Error

---

## SGD Vs Adam

* We are using Adam
  * But what if we use plain SGD without momentum?

---

## SGD Loss

---

## SGD Error

---

## Data Quality

* What if our test data has labels that we haven't seen during training?
* Or what if our training data is too small?

---

## Variance

* When our training set is too small, what does that mean?
* Assume the statistics of our training and testing sets are the same
* Failures on the test set means that our training set is missing something
* The amount of data required is determined by the natural *variance* of the data

---

## 5k Training Samples

---

## 5k Training Samples

---

## Bias To Simplicity

* A small training set is not representative
  * We may be better off not fully learning it
* We could attempt to solve this problem by *biasing* our model to a simpler solution
  * One (non-ideal) way to do this is with a smaller model

---

## Bias-Variance Tradeoff

* There is a tradeoff
* A more complicated model can match the natural variance of a dataset
* But if our dataset is not representative, we can bias it towards a simpler solution
  * Comes at the cost of being able to match the observed variance
* In deep learning, we tend to address this with regularization

---

## Smaller Models

* We haven't covered reguarlization yet, so let's look at model size
* The current size was picked arbitrarily
  * It may actually be wrong for this dataset!
* Halve the hidden units of the first layer
  * Go from 2048 to 1024

---

## 1024 Hidden

---

## 1024 Hidden

---

## Smaller Sometimes Better

* We aren't using explicit regularization
  * So our larger models may be learning weakly correlated features
* Notice that we only improved in the training set though
  * This implies some slightly contradictory training samples

---

## 512 Hidden

---

## 512 Hidden

---

## 512 Comments

* Practically the same as 1024
  * Previously 98.3% test accuracy vs 98.1% now
* Let's keep going smaller
  * Keep in mind that 28 by 28 images have 784 pixels

---

## 256 Hidden

---

## 256 Hidden

---

## 128 Hidden

---

## 128 Hidden

---

## Limits

* We are beginning to see a drop in performance
* Modern deep learning has techniques to continually oversize our models
  * Without using them, we can improve training loss by finding a good size
* But notice that testing loss is unmoved by this tuning

---

## Explanation

* [Limits on Learning Machine Accuracy Imposed by Data Quality](https://papers.nips.cc/paper_files/paper/1994/hash/1e056d2b0ebd5c878c550da6ac5d3724-Abstract.html)
  * Paper from 1994
* Increasing capacity learns the training set
  * Only good if the training and testing sets match!

---

## Error Vs Capacity

* If our training and testing sets match, adding more training data is the solution
  * Capacity is only necessary if we aren't learning the training set
* Excess capacity (without regularization) learns training set features that aren't in the test data

---

## Dataset Matching

* The test loss on digits is well behaved
  * A sign that our training and test data match
* If there was a mismatch, simplifying the model would likely improve test performance

---

## Perfect Learning

* An important note about debugging a training pipeline
* Training on a small dataset should **always** work
  * If it doesn't your data must be contradictory
  * Or you have a problem with initialization, the learning algorithm, etc

---

## Training on 1000

---

## Training on 1000

---

## Bad Data

* Almost any loss curves you'll see in a paper have something these curves don't
  * bad labels and bad data
* Bad data could be anything
  * mislabelled data, impossible examples, ambiguous examples, etc

---

## Bad Data Effects

* Much like a student memorizing something they don't understand, your model *will* memorize answers to ambiguous samples
* But this won't help on the test set!
* Let's test with 900 good labels, 100 bad labels, and 512 hidden units

---

---

## 900 + 100 bad

---

## Adding Capacity

* Let's say we looked at this and decided to add capacity
  * Why not, bigger is better, right?
* Return to 2048
  * Or maybe

---

## 2048 Hidden

---

## 2048 Hidden

---

## Classic Learning Curve

* The diverging training and testing loss curves is classic
  * But it happens because of data
* When training data doesn't match testing data
  * Because of ambiguity, errors, or noise (for regression tasks)

---

## Quiz

* Quiz 2 is later this week!
  * Covers lectures 5-8 (today)
* So here are some example questions

---

## Question

* Which statement about backpropagation is **true**?
  * Backpropagation requires fully defined gradients, so it does not work with ReLU.
  * Backpropagation requires us to store intermediate results during the forward pass.
  * Backpropagation globally optimizes all parameters simultaneously.
  * None of the above are true.

---

## Answer

* Which statement about backpropagation is **true**?
  * Backpropagation requires fully defined gradients, so it does not work with ReLU.
  * **Backpropagation requires us to store intermediate results during the forward pass.**
  * Backpropagation globally optimizes all parameters simultaneously.
  * None of the above are true.

---

## Question

* What statement about loss functions is **false**?
  * We use the negative logs of probabilities for numerical stability.
  * The MSE loss is equivalent to the loss function for the mean of a normal.
  * Loss functions must always be positive values.
  * We must calculate the derivative of the loss function for backpropagation.

---

## Answer

* What statement about loss functions is **false**?
  * We use the negative logs of probabilities for numerical stability.
  * The MSE loss is equivalent to the loss function for the mean of a normal.
  * **Loss functions must always be positive values.**
  * We must calculate the derivative of the loss function for backpropagation.

---

## Question

* Stochastic gradient descent accomplishes all of the following **except**:
  * Reduces memory requirements during training.
  * Increases the stochasticity of the training process.
  * Creates multiple different loss surfaces during each epoch.
  * SGD accomplishes all of the above.

---

## Answer

---

## Question

* Consider momentum:
  * $m_{t+1} \leftarrow \beta\cdot m_t + (1 - \beta)\frac{\delta L[x, \phi]}{\delta \phi}$
* What statement is **true**?
  * Momentum can cause the learning process to overshoot a global minimum.
  * The parameters at $\phi_{t+1}$ are updated using a combination of $m_{t+1}$ and ${\delta L[x, \phi]}{\delta \phi}$.
  * Because the momentum has no upper bound, it must be normalized before parameter updates.
  * Momentum guarantees that learning will be stop in a local minimum.

---

## Answer

* Consider momentum:
  * $m_{t+1} \leftarrow \beta\cdot m_t + (1 - \beta)\frac{\delta L[x, \phi]}{\delta \phi}$
* What statement is **true**?
  * **Momentum can cause the learning process to overshoot a global minimum.**
  * The parameters at $\phi_{t+1}$ are updated using a combination of $m_{t+1}$ and ${\delta L[x, \phi]}{\delta \phi}$.
  * Because the momentum has no upper bound, it must be normalized before parameter updates.
  * Momentum guarantees that learning will be stop in a local minimum.

---

## Question

* What statement about Kaiming initialization is **true**?
  * Weights are initialized as a function of the number of dataset samples.
  * The derivation of Kaiming initialization assumes that weights are uniformly distributed.
  * The optimal weights are calculated to control the variance of layer outputs.
  * None of the above are true.

---

## Answer

* What statement about Kaiming initialization is **true**?
  * Weights are initialized as a function of the number of dataset samples.
  * The derivation of Kaiming initialization assumes that weights are uniformly distributed.
  * **The optimal weights are calculated to control the variance of layer outputs.**
  * None of the above are true.

---

## Question

* What does this graph show?
  * The model is overfitting the training data, causing test loss to increase.
  * The model is too small and needs more capacity to learn the test set.
  * The training and testing datasets are contradictory.
  * The training data does not fully capture the test set variance.

---

## Question

* What does this graph show?
  * The model is overfitting the training data, causing test loss to increase.
  * The model is too small and needs more capacity to learn the test set.
  * The training and testing datasets are contradictory.
  * **The training data does not fully capture the test set variance.**