# CS 461 - Lecture 24

## Machine Learning Principles

### Avoiding Label Costs

Bernhard Firner

2025-12-1

---

## Topics

* Swapping things around to allow for the quiz and homework
  * Alternatives to labelling today
  * Lessons from ResNext: the magic of hyperparameters next class
  * Quiz on Monday, so this will turn into a wrapup for neural networks
  * Review of topics and major takeaways next Thursday

---

## NNs So Far

* We've seen some of the breakthroughs since LeNet
  * We can train larger, deeper networks
  * Training is more resilient to bad data
* Larger networks handle larger datasets
  * But now we have to worry about label errors

---

## Labelling

* Labels are the bane of anyone working with neural networks
* Labels are always
  * too expensive to generate
  * too slow to generate
  * a little bit wrong
* Having a 6 month turnaround time between data collection and getting training labels is possible

---

## Digits

* Substantial progress on LeNet wasn't made before the digits dataset
* The dataset is highly curated
  * Possibly no labelling errors in the 60k sample training set
  * ImageNet, and any modern dataset, is not as good

---

## Why Not?

* Modern datasets are too large
  * Curating 60k images where the labels are (mostly) clear is one thing
  * What if you need to identify every type of car or bird in a photograph?
    * And what if some of them are camouflaged?
      * And you have 1 million images?

---

## Label Types

* Some labels are too difficult to disambiguate
  * e.g. toilet seats and toilet paper in Imagenet
  * And can you tell the different between a tissue, a paper towel, and toilet paper?
* Beyond correlated labelling classes, we now have new label types
  * Bounding boxes, 3D cubes, action classes, etc

---

## Label Semantics

* Should bounding boxes and cubes cover an object when it is partially obscured?
* If you are training a driving DNN, and you use bounding boxes of vehicles, does a reflection of a vehicle in a window get a bounding box?
  * Is a stop sign that's been twisted around still a stop sign?
  * Are road lines still road lines if they are actually temporary tape that has become unstuck?

---

## Collection Hardware

* Plenty of problems here
  * Lens caps on cameras, dust covers on lidar
  * Sensors misaligned
  * Firmware version errors

---

## Practical Result

* There are no "perfect" datasets for modern data
  * We want huge datasets that are free from errors
  * We won't get them
* Making a perfect dataset would not only take a huge amount of money, it would also take a huge amount of time

---

## Solutions

* We obviously need solutions
* Three broad groups
  * Automatically generate labels, decreasing cost
  * Use other labels and apply the trained networks to our desired task
  * Train without labels

---

## Detecting Problems

* Let's talk about how we find out that our labels have problems
* We'll use digits, since it is easy to look at
  * And it only takes a few minutes to train a model

---

## Models

* To show the universality of these techniques, we'll use three models
  * Linear model
  * LeNet 5
  * ResNet with 9 residual blocks

-v-

## Linear Model

```python
class ResidualBlock(torch.nn.Module):
    def __init__(self, in_channels, out_channels, nonlinearity=torch.nn.ReLU, stride=1):
        super(ResidualBlock, self).__init__()
        self.residual = torch.nn.Sequential(
            torch.nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False),
            torch.nn.BatchNorm2d(out_channels),
            nonlinearity(),
            torch.nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False),
            torch.nn.BatchNorm2d(out_channels),
        )
        if stride != 1 or in_channels != out_channels:
            self.shortcut = torch.nn.Sequential(
                torch.nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False),
                torch.nn.BatchNorm2d(out_channels))
        else:
            self.shortcut = torch.nn.Sequential()
        # The nonlinearity after summing the residual and shortcut
        self.nonlinearity = nonlinearity()

def forward(self, x):
        out = self.residual(x)
        x = self.shortcut(x)
        return self.nonlinearity(out + x)

```

-v-

## LeNet 5

```python
class SquashingFunction(torch.nn.Module):
    def __init__(self):
        super(SquashingFunction, self).__init__()
        self.squash = torch.nn.Tanh()
        self.const = 1.7159

def forward(self, x):
        return self.const * self.squash(x)

class LeNet5(torch.nn.Module):
    """A mostly faithful recreation of LeNet 5."""

def __init__(self, nonlinearity = SquashingFunction):
        super(LeNet5, self).__init__()
        self.net = torch.nn.Sequential(
                # 5x5 convolution with 6 output feature maps
                torch.nn.Conv2d(1, 6, 5),
                # 2x2 subsampling learned bias and weight, called S2 in the paper.
                # We'll use an average pool and then a 1x1 conv with 6 groups to emulate that.
                torch.nn.AvgPool2d(kernel_size=2, stride=2),
                torch.nn.Conv2d(6, 6, kernel_size=1, groups=6),
                nonlinearity(),
                # 5x5 convolution with 6 output feature maps of size 5x5
                torch.nn.Conv2d(6, 16, kernel_size=5),
                # This again, emulating layer S4 from the paper.
                torch.nn.AvgPool2d(kernel_size=2, stride=2),
                torch.nn.Conv2d(16, 16, kernel_size=1, groups=16),
                nonlinearity(),
                # The final convolution reduces features to 1x1
                torch.nn.Conv2d(16, 120, kernel_size=5, stride=1),
                torch.nn.Flatten(),
                torch.nn.Linear(120, 84),
                # We are not going to try to recreate the original exemplar-based function in LeNet5
                #euclidean_rbf(84, 12)
                torch.nn.Linear(84, 10),
                )
        self.decision = torch.nn.Softmax(dim=1)

torch.nn.init.uniform_(self.net[0].weight.data, a=-1, b=1)

def forward(self, x):
        """Forward through the network."""
        y_hat = self.decision(self.net(x))
        return y_hat
```

-v-

## ResNet

def forward(self, x):
        out = self.residual(x)
        x = self.shortcut(x)
        return self.nonlinearity(out + x)

class ResNet(torch.nn.Module):
    """A mostly faithful recreation of LeNet 5."""

def __init__(self, nonlinearity = torch.nn.ReLU):
        super(ResNet, self).__init__()
        self.net = torch.nn.Sequential(
                # 5x5 convolution with 8 output feature maps
                torch.nn.Conv2d(1, 16, kernel_size=5),
                torch.nn.BatchNorm2d(16),
                nonlinearity(),
                ## Now we are working with 28x28 feature maps
                ## 3 blocks per downscale, to 14x14, 7x7, 
                ResidualBlock(16, 16),
                ResidualBlock(16, 16),
                ResidualBlock(16, 32, stride=2),
                ResidualBlock(32, 32),
                ResidualBlock(32, 32),
                ResidualBlock(32, 64, stride=2),
                ResidualBlock(64, 64),
                ResidualBlock(64, 64),
                ResidualBlock(64, 128, stride=2),
                # A single average pool to reduce all feature channels to 1x1
                torch.nn.AdaptiveAvgPool2d((1, 1)),
                torch.nn.Flatten(),
                torch.nn.Linear(128, 84),
                # We are not going to try to recreate the original exemplar-based function in LeNet5
                #euclidean_rbf(84, 12)
                torch.nn.Linear(84, 10),
                )
        self.decision = torch.nn.Softmax(dim=1)

torch.nn.init.uniform_(self.net[0].weight.data, a=-1, b=1)

def forward(self, x):
        """Forward through the network."""
        y_hat = self.decision(self.net(x))
        return y_hat
```

---

## Ideal Loss Curves

---

## Loss Curve Interpretation

* Notice how training and testing loss track one another
  * Even with the linear model, which grows worse over time
  * So overtraining isn't indicated by a divergence of training and testing loss

</div>
<div class="col">

</div>
</div>

---

## Ideal Accuracy Curves

---

## Accuracy Interpretation

* Loss and accuracy track one another
* And with bigger networks, we can train forever
  * In fact, this deep learning stuff looks easy!

</div>
<div class="col">

</div>
</div>

---

## Features

* At this point, we can skip the linear layers of the network
* The features present can be used to train an SVM, but its classification accuracy is the same as the DNN
  * We'll come back to this again though

---

## Noisy Labels

* Let's add some noise into the labels
  * Given an error rate, change training labels to a different class

---

## 1% Noise Loss

---

## 1% Noise Accuracy

---

## Discussion

* A small amount of noise raises training error rate
* But, with a good model, the test error rate is about the same
  * The noise is unbiased
  * For every bad label there are 99 good labels
  * If the model is generalizing, it will still make correct classifications
* So is noise okay then?

---

## 10% Noise Loss

---

## 10% Noise Accuracy

---

## Discussion

* At some point, noise does become a problem
  * Maybe the DNN is finding more "interesting" features in the error set to memorize
  * Or maybe the decision boundary becomes too messy
* Of course, if it isn't too bad, and isn't biased, one solution is to ignore it
  * Just be sure to spend every effort to clean up your test set!

---

## Other Solutions

* But what if ignoring it won't work?
  * It's biased, or we don't have any proper labels
* Earlier, we had three groups of solutions
  * Automatically generate labels, decreasing cost
  * Use other labels and apply the trained networks to our desired task
  * Train without labels

---

## Automated Label Generation

* Let's say you had to recreate digits today
  * What would you do?
* One solution would be to find digit data where you knew the answer
  * Forms where people filled in their birth dates, SSNs, etc
  * They are unlikely to be wrong, and you should be able to find a large corpus

---

## Real World

* Autonomous driving has multiple solutions
* If you want road and lane boundaries, you can get map data and fit it to images
  * You have lots of frames, so you can filter it over long time periods to get good matches
  * Within cm; not bad
* Or you could skip intermediate labels and go straight to driving commands

---

## End-To-End Driving

* Images in, driving signal out
* This was the idea behind DAVE
  * Darpa Autonmous VEhicle
  * Project from 2003 to 2004
  * https://cs.nyu.edu/~yann/research/dave/
* Eventually lead to this [NVIDIA project](https://developer.nvidia.com/blog/researching-and-developing-an-autonomous-vehicle-lane-following-system/)
* Labelling is as simple as recording what the human did and marking segments to use or drop

<!--
Embedding video doesn't work!

<video data-autoplay src="figures/DAVE_2004_backyward.mkv"></video>
[](./figures/DAVE_2004_backyard.mkv)
-->

---

## Not Possible?

* What if you cannot automate labels or skip the labelling step?
* Two other options:
  * Use other labels and apply the trained networks to our desired task
  * Train without labels

---

## Other Labels?

* Let's say that we don't have digit labels, but do have characters
  * Perhaps we got them from school children who are being forced to copy something over and over
    * So we could use some automated labelling!
  * Not quite the same as digits, but should be similar
* We can use the EMNIST Letters dataset
  * 145,600 characters
  * https://www.nist.gov/itl/products-and-services/emnist-dataset

---

## Procedure

* Train for 20 epochs on the letters data
  * It has more than twice as many labels
  * Train and test accuracy reach 95%/94%
    * But who cares!
* Off with its head!

---

## Feature Extractor

* The original DNN had 26 classes, but we terminate it after the last convolution
* Then we assume that we have some good training data
  * If we take the DNN trained with all of the letters and train an SVM with all 60k digits, we get an accuracy of 98.73%
* But if we had all of those labels, then we could have trained the DNN
  * Let's see how we do with less data

---

## SVM with Features

---

## SVM with Features

* The SVM with pretrained features was better until around 1000 training samples
  * No effort was spent tuning SVM parameters
* So we pretrained with 140K samples off-target, and that was as good as 1K real samples
  * So is that worth it?
* Yes!

---

## Cost of Labels

* Let me reiterate that the cost of labels is high!
  * You can find a dataset with 1 million labels *for free* and download it in a day!
* How much can you collect and label *correctly* in a day?
  * 100s of unique images? Maybe?
    * It depends upon the classes, doesn't it?
    * If you need pictures of something rare, good luck!

---

## Unsupervised Learning

* Unsupervised learning will look like pretraining
  * but we use a training method that doesn't require labels
    * Mask out part of the image and train a DNN to restore it, for example
* There are some more approaches (GANs, VAEs, etc)
* But this is a big, difficult, rapidly moving target
* So let's talk about image masking and ResNext next class

---

## Bonus Topic!

* Let's return to the accuracy vs samples graph
* The Resnet quickly catches pretrained features + SVM
  * On this data
  * Would it always?
  * How could we tell?
* By examining [learning curves](https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html)!

</div>
<div class="col">

</div>
</div>

---

## Compare Accuracies

* You can fit a sigmoid (or erf, or any s-shaped curve) to the learning curve
* For this example, I'll fit it with the first 4, 5, or 6 points

---

## First 4

---

## First 5

---

## First 6

* Pretty pointless by now; usually most useful near the inflection point
* More meaningful when learning is harder

---

## Conclusions

* We can make an educated guess about accuracy Vs number of training samples
  * Even with a much smaller number of training samples
* If we cannot afford to get that many labels, then we need to look elsewhere

-v-

## Code Used for Demos

```python
import argparse
import gzip
import os
import numpy as np
import random
from PIL import Image
import torch
import torchvision.transforms.v2 as transforms
import sklearn
# This is for saving the trained SVMs. We could use onnx for SVMs and DNNs, but that is slightly more work.
import pickle

def load_mnist_ubyte(image_path):
    """
    Loads MNIST images from the raw ubyte files.

Args:
        image_path (str): Path to the image file (e.g., 'train-images-idx3-ubyte.gz').

Returns:
        images (numpy.ndarray)
    """
    with gzip.open(image_path, 'rb') as f:
        # Read the header: magic number (4 bytes) + num images (4 bytes) +
        # num rows (4 bytes) + num cols (4 bytes) = 16 bytes.

# Read the entire file content into a buffer
        image_data = f.read()

# The image data starts at byte 16.
        images = np.frombuffer(image_data, dtype=np.uint8, offset=16)

# We need the dimensions to reshape. We can extract them from the header bytes,
        # which are big-endian ('>'). We use struct.unpack if we were being strict,
        # but here we'll assume the standard MNIST format and calculate the dimensions
        # for a clean numpy approach.

# The number of images is in the 5th to 8th byte (4 bytes)
        num_images = np.frombuffer(image_data, dtype='>i4', offset=4, count=1)[0]
        # Rows and columns are 28x28 for MNIST, stored in bytes 9-12 and 13-16.
        # num_rows = np.frombuffer(image_data, dtype='>i4', offset=8, count=1)[0]
        # num_cols = np.frombuffer(image_data, dtype='>i4', offset=12, count=1)[0]
        num_rows = 28
        num_cols = 28

# Reshape the 1D array into a 3D array (num_images, rows, columns)
        images = images.reshape(num_images, num_rows, num_cols)
    return images

def load_mnist_labels(label_path):
    """
    Loads MNIST labels from the raw ubyte files.

Args:
        label_path (str): Path to the label file (e.g., 'train-labels-idx1-ubyte.gz').

Returns:
        labels (numpy.ndarray)
    """
    with gzip.open(label_path, 'rb') as f:
        # Read the header: magic number (4 bytes) + num items (4 bytes) = 8 bytes.
        # Skip these 8 bytes.

# Read the entire file content into a buffer
        label_data = f.read()

# The label data starts at byte 8. The data type is unsigned byte ('B' or np.uint8).
        # Labels are a 1D vector.
        labels = np.frombuffer(label_data, dtype=np.uint8, offset=8)

return labels

class SquashingFunction(torch.nn.Module):
    def __init__(self):
        super(SquashingFunction, self).__init__()
        self.squash = torch.nn.Tanh()
        self.const = 1.7159

def forward(self, x):
        return self.const * self.squash(x)

class ResidualBlock(torch.nn.Module):
    def __init__(self, in_channels, out_channels, nonlinearity=torch.nn.ReLU, stride=1):
        super(ResidualBlock, self).__init__()
        self.residual = torch.nn.Sequential(
            torch.nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False),
            torch.nn.BatchNorm2d(out_channels),
            nonlinearity(),
            torch.nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False),
            torch.nn.BatchNorm2d(out_channels),
        )
        if stride != 1 or in_channels != out_channels:
            self.shortcut = torch.nn.Sequential(
                torch.nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False),
                torch.nn.BatchNorm2d(out_channels))
        else:
            self.shortcut = torch.nn.Sequential()
        # The nonlinearity after summing the residual and shortcut
        self.nonlinearity = nonlinearity()

def forward(self, x):
        out = self.residual(x)
        x = self.shortcut(x)
        return self.nonlinearity(out + x)

class ResNet(torch.nn.Module):
    """A mostly faithful recreation of LeNet 5."""

def __init__(self, nonlinearity = torch.nn.ReLU, classes=10):
        super(ResNet, self).__init__()
        self.net = torch.nn.Sequential(
                # 5x5 convolution with 8 output feature maps
                torch.nn.Conv2d(1, 16, kernel_size=5),
                torch.nn.BatchNorm2d(16),
                nonlinearity(),
                ## Now we are working with 28x28 feature maps
                ## 3 blocks per downscale, to 14x14, 7x7,
                ResidualBlock(16, 16),
                ResidualBlock(16, 16),
                ResidualBlock(16, 32, stride=2),
                ResidualBlock(32, 32),
                ResidualBlock(32, 32),
                ResidualBlock(32, 64, stride=2),
                ResidualBlock(64, 64),
                ResidualBlock(64, 64),
                ResidualBlock(64, 128, stride=2),
                # A single average pool to reduce all feature channels to 1x1
                torch.nn.AdaptiveAvgPool2d((1, 1)),
                torch.nn.Flatten(),
                torch.nn.Linear(128, 84),
                # We are not going to try to recreate the original exemplar-based function in LeNet5
                #euclidean_rbf(84, 12)
                torch.nn.Linear(84, classes),
                )
        self.decision = torch.nn.Softmax(dim=1)

torch.nn.init.uniform_(self.net[0].weight.data, a=-1, b=1)

def features(self, x):
        # Go through the first 14 layers to extract a feature vector of size 128
        for i in range(14):
            x = self.net[i](x)
        return x

def forward(self, x):
        """Forward through the network."""
        y_hat = self.decision(self.net(x))
        return y_hat

class LeNet5(torch.nn.Module):
    """A mostly faithful recreation of LeNet 5."""

def __init__(self, nonlinearity = SquashingFunction, classes=10):
        super(LeNet5, self).__init__()
        self.net = torch.nn.Sequential(
                # 5x5 convolution with 6 output feature maps
                torch.nn.Conv2d(1, 6, 5),
                # 2x2 subsampling learned bias and weight, called S2 in the paper.
                # We'll use an average pool and then a 1x1 conv with 6 groups to emulate that.
                torch.nn.AvgPool2d(kernel_size=2, stride=2),
                torch.nn.Conv2d(6, 6, kernel_size=1, groups=6),
                nonlinearity(),
                # 5x5 convolution with 6 output feature maps of size 5x5
                torch.nn.Conv2d(6, 16, kernel_size=5),
                # This again, emulating layer S4 from the paper.
                torch.nn.AvgPool2d(kernel_size=2, stride=2),
                torch.nn.Conv2d(16, 16, kernel_size=1, groups=16),
                nonlinearity(),
                # The final convolution reduces features to 1x1
                torch.nn.Conv2d(16, 120, kernel_size=5, stride=1),
                torch.nn.Flatten(),
                torch.nn.Linear(120, 84),
                # We are not going to try to recreate the original exemplar-based function in LeNet5
                #euclidean_rbf(84, 12)
                torch.nn.Linear(84, classes),
                )
        self.decision = torch.nn.Softmax(dim=1)

torch.nn.init.uniform_(self.net[0].weight.data, a=-1, b=1)

def features(self, x):
        # Go through the first 10 layers to extract a feature vector of size 120
        for i in range(10):
            x = self.net[i](x)
        return x

def forward(self, x):
        """Forward through the network."""
        y_hat = self.decision(self.net(x))
        return y_hat

class Linear(torch.nn.Module):
    """A linear neural network."""

def __init__(self, nonlinearity = torch.nn.ReLU, classes=10):
        super(Linear, self).__init__()
        self.net = torch.nn.Sequential(
                torch.nn.Flatten(),
                torch.nn.Linear(1024, 2048),
                nonlinearity(),
                torch.nn.Linear(2048, 120),
                nonlinearity(),
                torch.nn.Linear(120, 84),
                torch.nn.Linear(84, classes)
                )
        self.decision = torch.nn.Softmax(dim=1)

torch.nn.init.uniform_(self.net[1].weight.data, a=-1, b=1)

def forward(self, x):
        """Forward through the network."""
        y_hat = self.decision(self.net(x))
        return y_hat

def preprocess(X_train, order, device):
    # normalize and then pad to 32x32
    # Images are 0 to 1.
    # Change so the background (white) became -0.1, and the foreground (black) became 1.175
    # Multiply by 1.275 to shift expand the range, and subtract from 1.175
    preprocessed = torch.tensor(1.175 - (1.275*X_train[order])).float()

# Pad 2 on every side, changing the 28x28 to 32x32
    preprocessed = torch.nn.functional.pad(preprocessed, pad=(2,2,2,2))
    # Add a channel dimension
    return preprocessed.reshape((-1, 1, 32, 32)).to(device)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--train",
        required=True,
        help="gzip file with mnist data.")
    parser.add_argument(
        "--test",
        required=True,
        help="gzip file with mnist data.")
    parser.add_argument(
        "--train_labels",
        required=True,
        help="gzip file with mnist labels.")
    parser.add_argument(
        "--test_labels",
        required=True,
        help="gzip file with mnist labels.")
    parser.add_argument(
        "--epochs",
        required=False,
        type=int,
        default=5,
        help="Number of epochs to train.")
    parser.add_argument(
        "--train_samples",
        required=False,
        type=int,
        default=60000,
        help="Number of samples to use for training.")
    parser.add_argument(
        "--save_mismatch",
        required=False,
        type=int,
        default=0,
        help="The number of mismatches to save.")
    parser.add_argument(
        "--batch_size",
        required=False,
        type=int,
        default=32,
        help="The batch size.")
    parser.add_argument(
        "--error_rate",
        required=False,
        type=float,
        default=0.0,
        help="The training label error rate.")
    parser.add_argument(
        "--model",
        required=False,
        default="lenet",
        type=str,
        help="Model type")
    parser.add_argument(
        "--save",
        required=False,
        default=None,
        type=str,
        help="Path to save the trained model")
    parser.add_argument(
        "--load",
        required=False,
        default=None,
        type=str,
        help="Path to load the trained model")
    parser.add_argument(
        "--random_seed",
        required=False,
        type=int,
        default=112,
        help="The random seed.")
    ####
    # These are the SVM Options
    parser.add_argument(
        "--use_svm",
        default=False,
        action='store_true',
        help="Use an SVM for final classification after training or model loading.")
    parser.add_argument(
        "--kernel",
        required=False,
        type=str,
        default='rbf',
        help="linear or rbf or poly")
    parser.add_argument(
        "--C",
        required=False,
        type=int,
        default=None,
        help="C value for svm soft margin. Defaults to 1 within scikit's implementation")
    parser.add_argument(
        "--gamma",
        required=False,
        type=float,
        default=0.1,
        help="Gamma for the rbf kernel")
    parser.add_argument(
        "--degree",
        required=False,
        type=int,
        default=None,
        help="Degree for the polynomial kernel (try 2)")
    parser.add_argument(
        "--coef0",
        required=False,
        type=float,
        default=None,
        help="Offset for the polynomial kernel (try 1)")
    parser.add_argument(
        "--save_svm",
        required=False,
        default=None,
        type=str,
        help="Path to save pickle of trained scikit svm.")
    parser.add_argument(
        "--load_svm",
        required=False,
        default=None,
        type=str,
        help="Path to load the pickle of the trained scikit svm.")
    parser.add_argument(
        "--old_dnn_classes",
        required=False,
        default=None,
        type=int,
        help="The number of classes in the DNN being used as a feature extractor for the SVM. Provide if different from the current dataset.")

args = parser.parse_args()

np.random.default_rng(args.random_seed)
    torch.manual_seed(args.random_seed)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")

print("Loading data")
    X_train_digits = (load_mnist_ubyte(args.train)/255)[:args.train_samples]
    X_test_digits = load_mnist_ubyte(args.test)/255
    Y_train_digits = load_mnist_labels(args.train_labels)[:args.train_samples]
    Y_test_digits = load_mnist_labels(args.test_labels)
    unique_classes = np.unique(Y_test_digits)
    total_classes = len(unique_classes)
    # Make sure the first class is 0
    if min(unique_classes) != 0:
        Y_train_digits = Y_train_digits.copy() - min(unique_classes)
        Y_test_digits = Y_test_digits.copy() - min(unique_classes)

## Save some digits for the homework
    #for i in range (20):
    #    Image.fromarray((255*X_test_digits[i]).reshape((28, 28)).astype(np.uint8)).save(f"example_test_digit_{i}.png")
    #print(f"Example classes test are {Y_test_digits[:20]}")
    ## Save some digits for the homework
    #for i in range (20):
    #    Image.fromarray((255*X_train_digits[i]).reshape((28, 28)).astype(np.uint8)).save(f"example_train_digit_{i}.png")
    #print(f"Example classes are {Y_train_digits[:20]}")

# Create the model
    if args.old_dnn_classes:
        total_classes = args.old_dnn_classes
    if args.model == "lenet":
        model = LeNet5(classes=total_classes)
    elif args.model == "lenet_relu":
        model = LeNet5(nonlinearity=torch.nn.ReLU, classes=total_classes)
    elif args.model == "resnet":
        model = ResNet(classes=total_classes)
    elif args.model == "linear":
        model = Linear(classes=total_classes)

# Don't shuffle the test data, but otherwise treat it the same as the training data.
    X_test = preprocess(X_test_digits, np.arange(X_test_digits.shape[0]), device)
    Y_test = torch.tensor(Y_test_digits).long().to(device)
    test_batch_size = 1000

if args.error_rate > 0.0:
        # Insert errors into the training data at the given error rate
        total_errors = int(args.error_rate * len(Y_train_digits))
        to_change = random.choices(np.arange(len(Y_train_digits)), k=total_errors)
        possible_labels = []
        for original in np.arange(10):
            # The possible wrong labels are every value but the correct one
            possible_labels.append(list(np.arange(original)) + list(np.arange(original+1, 10)))
        # This is read only, so make a writeable copy
        Y_train_digits = Y_train_digits.copy()
        for idx in to_change:
            original = Y_train_digits[idx]
            Y_train_digits[idx] = random.choice(possible_labels[original])

# Shuffle and preprocess the training data
    order = np.arange(X_train_digits.shape[0])
    np.random.shuffle(order)

X_train = preprocess(X_train_digits, order, device)
    Y_train = torch.tensor(Y_train_digits[order]).long().to(device)

# Are we doing training, or just reloading?
    if args.load is not None:
        model.load_state_dict(torch.load(args.load, map_location=torch.device("cpu"), weights_only=True))
        model.to(device)
    else:
        model.to(device)

# Authors used 0.0005 for two epochs, 0.0002 for the next 2, 0.0001 for the
        # next 3, 0.00005 for the next 4, and 0.00001 after.
        optimizer = torch.optim.Adam(model.parameters(), lr=0.0005)
        lr_steps = [2, 5, 9, 13]
        lr_schedule = [0.0002, 0.0001, 0.00005, 0.00001]

criterion = torch.nn.CrossEntropyLoss()

# See how many batches we'll use per epoch
        batches = int(np.ceil(X_train_digits.shape[0]/float(args.batch_size)))
        # We could just do this in one step, but let's assume that memory is finite
        test_batches = int(np.ceil(X_test_digits.shape[0]/float(test_batch_size)))

print(f"Training on {X_train_digits.shape[0]} examples over {batches} batches")

for epoch in range(args.epochs):

# Update the learning rate (as described in the original paper)
            if epoch in lr_steps:
                lr_idx = lr_steps.index(epoch)
                optimizer.lr = lr_schedule[lr_idx]
            total_loss = 0.0
            model.train()
            for batch in range(batches):
                begin = batch*args.batch_size
                end = (batch+1)*args.batch_size

X_batch = X_train[begin:end]
                Y_batch = Y_train[begin:end]

# Zero gradients before gradient calculation
                optimizer.zero_grad()

y_hat = model(X_batch)
                loss = criterion(y_hat, Y_batch)
                total_loss += loss.item() * X_batch.size(0)

# Gradient calculation
                loss.backward()
                # Update weights
                optimizer.step()

epoch_loss = total_loss / X_train.size(0)
            print(f"{args.train_samples} Epoch {epoch} training loss {epoch_loss}")

# Evaluation
            # Don't calculate gradients during these steps
            model.eval()
            with torch.no_grad():
                total_loss = 0.0
                for batch in range(test_batches):
                    begin = batch*test_batch_size
                    end = (batch+1)*test_batch_size

X_batch = X_test[begin:end]
                    Y_batch = Y_test[begin:end]

y_hat = model(X_batch)
                    loss = criterion(y_hat, Y_batch)
                    total_loss += loss.item() * X_batch.size(0)
                epoch_loss = total_loss / X_test.size(0)
                print(f"{args.train_samples} Epoch {epoch} testing loss {epoch_loss}")
                ## Accuracy values
                # We can't just run over everything, that takes too much memory. Chop it up.
                matches = 0
                mismatches = 0
                for testbatch in range(int(np.ceil(X_train_digits.shape[0]/float(test_batch_size)))):
                    begin = testbatch*test_batch_size
                    end = (testbatch+1)*test_batch_size
                    y_hat = model(X_train[begin:end])
                    classes = torch.argmax(y_hat, dim=1)
                    matches += torch.sum(classes == Y_train[begin:end])
                    mismatches += torch.sum(classes != Y_train[begin:end])
                train_accuracy = matches/X_train.size(0)
                matches = 0
                mismatches = 0
                for testbatch in range(int(np.ceil(X_test_digits.shape[0]/float(test_batch_size)))):
                    begin = testbatch*test_batch_size
                    end = (testbatch+1)*test_batch_size
                    y_hat = model(X_test[begin:end])
                    classes = torch.argmax(y_hat, dim=1)
                    matches += torch.sum(classes == Y_test[begin:end])
                    mismatches += torch.sum(classes != Y_test[begin:end])
                test_accuracy = matches/X_test.size(0)
                print(f"{args.train_samples} Epoch {epoch} accuracies are {train_accuracy} {test_accuracy}")

if args.save is not None:
        torch.save(model.state_dict(), args.save)

# Final evaluation
    model.eval()
    with torch.no_grad():
        if args.use_svm:
            if args.load_svm:
                with open(args.load_svm, 'rb') as infile:
                    svm = pickle.load(infile)
            else:
                svm_args = {}
                for arg in ['kernel', 'gamma', 'degree', 'coef0', 'C']:
                    if None != getattr(args, arg):
                        svm_args[arg] = getattr(args, arg)
                svm = sklearn.svm.SVC(**svm_args)
                # Create feature vectors for training
                print("Building SVM inputs.")
                features = None
                for testbatch in range(int(np.ceil(X_train_digits.shape[0]/float(test_batch_size)))):
                    begin = testbatch*test_batch_size
                    end = (testbatch+1)*test_batch_size
                    vectors = model.features(X_train[begin:end]).cpu().numpy()
                    if features is None:
                        features = vectors
                    else:
                        features = np.concatenate((features, vectors))

print("Training the SVM.")
                svm.fit(features, Y_train.cpu().numpy())

if args.save_svm:
                with open(args.save_svm, 'wb') as out:
                    pickle.dump(svm, out)

print("Building test inputs.")
            features = None
            for testbatch in range(int(np.ceil(X_test_digits.shape[0]/float(test_batch_size)))):
                begin = testbatch*test_batch_size
                end = (testbatch+1)*test_batch_size
                vectors = model.features(X_test[begin:end]).cpu().numpy()
                if features is None:
                    features = vectors
                else:
                    features = np.concatenate((features, vectors))

print("Inference with the SVM.")
            results = svm.predict(features)
            Y_test = Y_test.cpu().numpy()
            matches = (results == Y_test)
            sum_matches = np.sum(matches)
            test_accuracy = sum_matches / len(Y_test)
        else:
            # DNN classification
            matches = 0
            mismatches = 0
            for testbatch in range(int(np.ceil(X_test_digits.shape[0]/float(test_batch_size)))):
                begin = testbatch*test_batch_size
                end = (testbatch+1)*test_batch_size
                y_hat = model(X_test[begin:end])
                classes = torch.argmax(y_hat, dim=1)
                matches += torch.sum(classes == Y_test[begin:end])
                mismatches += torch.sum(classes != Y_test[begin:end])
            sum_matches = matches
            test_accuracy = sum_matches/X_test.size(0)

print(f"{args.train_samples} Final accuracy {sum_matches}/{X_test.size(0)} ({test_accuracy})")
```

<!--

Labels are the problem
  * Too expensive
  * Always wrong

With LeNet, big progress wasn't made until they had a digits dataset.

Digits was highly curated, with possibly no labelling errors. In the 60k dataset

ImageNet is not that. It is likely that no dataset of ImageNet's size will ever be made without labelling errors.

Even if you collect your own data, as carefully as possible, you will find errors in your data.
  (Some AV examples;
   data: lens caps, camera configuration wrong, vehicle sensors not works,

Many problems come from the label classes themselves; toilet paper and toilet seats are too correlated
From AV: forgot to label every frame, label instructions changed, label instructions impossible (is a garbage bin with wheels a carriage? is a knocked down sign still a sign?)

Solutions:
  autolabelling; with maps, for example; new problems when maps don't align, but easier to check
  end-to-end training; labels are the behaviors
  transfer learning: learn with the labels you've got
  unsupervised learning: learn without labels, then use a tiny set of labels to specialize

autolabelling and end-to-end are engineering solutions
  and work really well

figures/DAVE_2004_backyard.mkv
figures/DAVE_2003_leg_dodging.wmv

Unsupervised learning is the ML solution
  * Once again, a call back to image reconstruction and RBMs

Plot some graphs showing how we should think about data
  Train vs test errors as dataset size increases
    allows us to guess if we will ever converge
  Train and test loss over epochs (allows us to see if our model is innappropriately sized or our reguluarizers are too weak)

Show how adding in errors changes the figures

Focus on the graphs of results over increasing data
  Need more data to get a better model, but we don't have enough
  Augmentation are sometimes nonsense
    What then? Unsupervised learning.

Unsupervised learning:
  Objective function doesn't require a label

Transfer learning:
  Use a different object function to learn our thing
    Example: train on mnist characters, then use an SVM to classify
    Make the same accuracy vs training samples graph, show how SVM compares

Command to run everything:
python3 24_mnist_unsupervised.py --train /slowdata2/MNIST/raw/train-images-idx3-ubyte.gz --train_labels /slowdata2/MNIST/raw/train-labels-idx1-ubyte.gz  --test /slowdata2/MNIST/raw/t10k-images-idx3-ubyte.gz --test_labels /slowdata2/MNIST/raw/t10k-labels-idx1-ubyte.gz  --save resnet_digits_15epochs.pyt --model resnet --epochs 15 --use_svm --kernel rbf --gamma 0.01 --C 100 --save_svm svm_resnet_digits_rbf_gamma_0_01_C_100.pkl

-->