<!--
Abstract:

CS 461
Introduction to Deep Learning
Lecture 11
-->

# CS 461 - Lecture 20

## Machine Learning Principles

### Intro to Neural Networks

Bernhard Firner

2025-11-12

---

## Example Plotting Code

```python
import numpy as np
from sklearn import neural_network
from sklearn.inspection import DecisionBoundaryDisplay

import matplotlib.pyplot as plt
import matplotlib as mpl

# For repeatability
np.random.seed(100)

y = np.array([1]*5 + [-1]*5)
y_zeros = (y+1)/2

# Something similar to two moons
# Arc is class 0
R = 10
random_angles = np.random.uniform(np.pi/10, 7*np.pi/10, 20)
moon = np.stack((R*np.cos(random_angles), R*np.sin(random_angles)), axis=1) + np.stack((np.random.normal(0, 1, 20), np.random.normal(0, 1, 20)), axis=1)
sun = np.stack((np.random.uniform(0, 3, 20), np.random.normal(2, 2, 20)), axis=1)

X = np.concatenate((moon, sun))
y = np.array([0]*20 + [1]*20)

def fitWithLayers(X, y, sizes, solver, activation='logistic'):

clf = neural_network.MLPClassifier(hidden_layer_sizes=sizes, solver=solver, activation=activation, max_iter=5000)
    clf.fit(X, y)
    print(f"ITER WAS {clf.n_iter_}")

plt.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=plt.cm.Paired)

# plot the decision function
    ax = plt.gca()

# Plot the decision boundary
    step = 0.2
    xx, yy = np.meshgrid(np.arange(X[:, 0].min()-0.5, X[:, 0].max()+0.5, step), np.arange(X[:, 1].min()-0.5, X[:, 1].max()+0.5, step))
    if hasattr(clf, "decision_function"):
        Z = clf.decision_function(np.column_stack([xx.ravel(), yy.ravel()]))
    else:
        Z = clf.predict_proba(np.column_stack([xx.ravel(), yy.ravel()]))[:, 1]
    Z = Z.reshape(xx.shape)
    # Colormap with Z
    # Use one of the diverging maps (https://matplotlib.org/stable/users/explain/colors/colormaps.html)
    ax.contour(xx, yy, Z, cmap=mpl.colormaps['coolwarm'], alpha=0.8)

nonlinearity = activation
    if nonlinearity == "logistic":
        nonlinearity = "sigmoid"

size_string = "_".join([str(val) for val in sizes])
    title_str = f"{size_string} neurons, {solver} solver, {nonlinearity}"
    plt.title(title_str, fontsize=20)
    fname_str = f"{size_string}_neurons_{solver}_solver_{nonlinearity}"
    plt.savefig(f"../figures/19_multi_layer_nn_{fname_str}.png", dpi=2*96)
    plt.show()

fitWithLayers(X, y, (10, 10), 'sgd')
fitWithLayers(X, y, (10, 10, 10), 'sgd')
fitWithLayers(X, y, (10, 10, 10, 10), 'sgd')
fitWithLayers(X, y, (10, 10, 10, 10, 10), 'sgd')
fitWithLayers(X, y, (100, 100, 100), 'sgd')
fitWithLayers(X, y, (1000, 1000), 'sgd')
fitWithLayers(X, y, (10, 10), 'adam')
fitWithLayers(X, y, (10, 10, 10), 'adam')
fitWithLayers(X, y, (10, 10, 10, 10), 'adam')
fitWithLayers(X, y, (10, 10, 10, 10, 10), 'adam')
fitWithLayers(X, y, (100, 100, 100), 'adam')
fitWithLayers(X, y, (1000, 1000), 'adam')
fitWithLayers(X, y, (10, 10), 'sgd', 'tanh')
fitWithLayers(X, y, (10, 10, 10), 'sgd', 'tanh')
fitWithLayers(X, y, (10, 10, 10, 10), 'sgd', 'tanh')
fitWithLayers(X, y, (10, 10), 'sgd', 'relu')
fitWithLayers(X, y, (10, 10, 10), 'sgd', 'relu')
fitWithLayers(X, y, (10, 10, 10, 10), 'sgd', 'relu')
```

---

## Neural Networks

* Important concepts from last time
  * Universal function approximation
  * Difference between "can approximate" and "will approximate"
* The explosion of hyperparameters

---

## Hyperparameters (so far!)

* What nonlinearity should we use?
* What learning rate? Should that change over time?
* What learning algorithm? SGD? Adam?
* How deep? How wide?

---

## References

* FYI and enjoyment, not assigned reading
* [Gradient-Based Learning Applied to Document Recognition](https://cs.nyu.edu/~yann/2010f-G22-2565-001/diglib/lecun-98.pdf)
  * Great example of how much work goes into making a product
* [Handwritten Digit Recognition with a Back-Propagation Network](https://proceedings.neurips.cc/paper/1989/file/53c3bce66e43be4f209556518c2fcb54-Paper.pdf)
  * If that 45 page paper was too much for you
* Two papers on how people were thinking about models and data
  * [Limits on Learning Machine Accuracy Imposed by Data Quality](https://proceedings.neurips.cc/paper_files/paper/1994/file/1e056d2b0ebd5c878c550da6ac5d3724-Paper.pdf)
  * [Structural Risk Minimization for Character Recognition](https://proceedings.neurips.cc/paper/1991/file/10a7cdd970fe135cf4f7bb55c0e3b59f-Paper.pdf)

---

## Prolegomena

* There are some key concepts from the LeNet work
  * System is end-to-end trainable
  * Weight sharing via convolutions
  * Interpretability built into the network
  * Dataset analysis through training results
  * Dataset augmentation

---

## Key Points for You

* Exact network details aren't key
* Concepts will be applicable across different datasets and ML techniques
* Let's begin with the end-to-end training

---

## System View

* LeNet was part of a writing recognition system
  * Feature extraction, classification, segmentation, and token generation
* LeNet handled feature extraction and classification
* Token generation was handled by a graph transformer network
  * Not what it sounds like today; more like a HMM

---

## Features and Classification

* We'll only discuss the LeNet part of the system
* Want to quote from where the authors explain why learning is feasible

---

---

## Trends

* These three trends haven't changed
  * Better hardware
  * Larger, more specific datasets
  * Techniques that work on large datasets
* So how did LeNet improve on large datasets?

---

## Linear Layer Approach

* You can build a NN to learn a weight matrix
  * This is similar to a perceptron, with each pixel have a weight at each neuron
  * With proper loss functions, it might have margins like an SVM
* Any changes in the input would lead to drastic output changes
  * But image features are similar even if characters are shifted or slightly rotated

---

## Weight Sharing

* Image features have some structure
  * Independent of where they are in the image
* We'll use a special neural network layer to take advantage of that

---

## Convolution

* Normally a neuron's output is $bias + \sum w_i x_i$
  * Images are flattened, but we can think of them as $bias + \sum\sum w_{ij} x_{ij}$
* If image structure is invariant to it location, could a neuron be a feature detector?
* Let's say we think features exist in 5x5 pixel areas
  * $f(x) bias + \sum_{i=1}^5\sum_{j=1}^5 w_{ij} x_{ij}$
* Then we just "slide" that over the image

---

## Convolution Output

* A 5x5 kernel has 25 weights and 1 bias
  * Those are learned across an entire image
* Across a 32x32 image leads to a 28x28 feature map
  * 784 outputs means that learning happens over those 784 values

---

## Advantages

* Learning is automatically smoothed, as with regularization
* Location invariant features are boosted
* Weight sharing makes a network smaller, and thus easier to train

---

## Preprocessing

* Images weren't used as they are
* A few modifications were made to enhance learning

---

## Padding

* Inputs to LeNet are 32x32, but the images are 28x28
  * This keeps the interesting features in the center of the receptive fields
    * Rather than at the edges
* The extra pixels are padded out
* Why?

---

## Edges

We want filters to be invariant to position, but some edges are "out of reach" near the edge.

---

## Normalization

* In addition to padding, pixel values were normalized
  * The background (white) became -0.1
  * Foreground (black) became 1.175
* This was actually to make the data 0 mean and unit variance
  * There is generally more empty space than digit

---

## The Network

* The NN itself is not too complicated

---

## The Network

* There are three 5x5 convolutions
  * Each convolution reduces the image size by 4
    * So the 32x32 input is 28x28 after the first convolution
* To further reduce the size, 2x2 subsampling is used
  * After the first two convolutions
* So the feature map size is quickly reduced
  * 32x32 -> 28x28 -> 14x14 -> 10x10 -> 5x5 -> 1x1

---

## Subsampling

* Adjacent features are probably correlated
* So subsampling makes sense
  * Average or take the max or min in the area
* We need to shrink the image quickly
  * If the network is too deep, we'll run into the vanishing gradient problem

---

## Receptive Fields

* The first 5x5 convolution clearly detects 5x5 pixel features
  * What about the last 5x5 convolution?
* It is effectively searching for a feature over the entire (transformed) input image
  * So we can say that its `receptive field` is actually 32x32 pixels
    * Measured in the original space
* Individual convolutions cannot see the entire image, but stacked convolutions achieve the same effect

---

## What about overfitting?

* Authors never use the word, although they mention "over training"
* They also mention that a large learning rate makes it difficult to get stuck in a sharp minima
  * Large movements cause oscillations, preventing a tight fit
  * Also biases the network to ending up in large plateaus of the loss surface
* Can you see how this is an intuitive field?

---

## Classification

* The image is finally reduced to a single pixel across 120 feature maps
* So we've transformed the images into 120 dimensional feature space
* A linear nn is then used for classification

---

## Euclidean Kernels

* You may expect the network to output 10 values
  * 1 for each digit, each is the probability of matching that digit
* To make a product out of this, they authors did something different
* They had the network produce an 84 value vector
  * Then compared that to "exemplar" digits

---

## Exemplars

7x12 exemplar images

---

## Exemplar Idea

* For a product, they needed special outputs
  * Had to be interpretable
  * Had to encode uncertainty
* So they asked LeNet to produce an "ideal" character from the input
  * Notice that this isn't so different from Boltzman Machines

---

## Interpretability

* If two characters were similar (0 and O, for example)
  * This naturally encodes their similarity in the loss function
  * An uncertain output would have a close euclidean distance to either
* Good?
  * Prevents collapsing into an "all one or the other" collapse
  * Typical problem in neural networks

---

## Implementation

* I don't have the time to hand-craft bespoke exemplar 7x12 digits
  * Sorry!
* So we'll just have the network output 10 numbers

---

## Number to Gradient

* We need the loss (our error) to exist for each output
  * Not the hinge loss from SVMs
* Consider each NN output a probability and use the negative log likelihoods
  * Just like in clustering!
* That gives us a loss value for each output

---

## Pytorch

* For any serious training we'll have to use pytorch
* It's the simplest training framework that works

---

## Network definition

```
self.net = torch.nn.Sequential(
        # 5x5 convolution with 6 output feature maps
        torch.nn.Conv2d(1, 6, 5),
        # 2x2 subsampling learned bias and weight, called S2 in the paper.
        # We'll use an average pool and then a 1x1 conv with 6 groups to emulate that.
        torch.nn.AvgPool2d(kernel_size=2, stride=2),
        torch.nn.Conv2d(6, 6, kernel_size=1, groups=6),
        SquashingFunction(),
        # 5x5 convolution with 6 output feature maps of size 5x5
        torch.nn.Conv2d(6, 16, kernel_size=5),
        # This again, emulating layer S4 from the paper.
        torch.nn.AvgPool2d(kernel_size=2, stride=2),
        torch.nn.Conv2d(16, 16, kernel_size=1, groups=16),
        SquashingFunction(),
        # The final convolution reduces features to 1x1
        torch.nn.Conv2d(16, 120, kernel_size=5, stride=1),
        torch.nn.Flatten(),
        torch.nn.Linear(120, 84),
        # We are not going to try to recreate the original exemplar-based function in LeNet5
        #euclidean_rbf(84, 12)
        torch.nn.Linear(84, 10),
        )
```

---

## Program

```python
import argparse
import gzip
import os
import numpy as np
from PIL import Image
import torch

def load_mnist_ubyte(image_path):
    """
    Loads MNIST images from the raw ubyte files.

Args:
        image_path (str): Path to the image file (e.g., 'train-images-idx3-ubyte.gz').

Returns:
        images (numpy.ndarray)
    """
    with gzip.open(image_path, 'rb') as f:
        # Read the header: magic number (4 bytes) + num images (4 bytes) +
        # num rows (4 bytes) + num cols (4 bytes) = 16 bytes.

# Read the entire file content into a buffer
        image_data = f.read()

# The image data starts at byte 16.
        images = np.frombuffer(image_data, dtype=np.uint8, offset=16)

# We need the dimensions to reshape. We can extract them from the header bytes,
        # which are big-endian ('>'). We use struct.unpack if we were being strict,
        # but here we'll assume the standard MNIST format and calculate the dimensions
        # for a clean numpy approach.

# The number of images is in the 5th to 8th byte (4 bytes)
        num_images = np.frombuffer(image_data, dtype='>i4', offset=4, count=1)[0]
        # Rows and columns are 28x28 for MNIST, stored in bytes 9-12 and 13-16.
        # num_rows = np.frombuffer(image_data, dtype='>i4', offset=8, count=1)[0]
        # num_cols = np.frombuffer(image_data, dtype='>i4', offset=12, count=1)[0]
        num_rows = 28
        num_cols = 28

# Reshape the 1D array into a 3D array (num_images, rows, columns)
        images = images.reshape(num_images, num_rows, num_cols)
    return images

def load_mnist_labels(label_path):
    """
    Loads MNIST labels from the raw ubyte files.

Args:
        label_path (str): Path to the label file (e.g., 'train-labels-idx1-ubyte.gz').

Returns:
        labels (numpy.ndarray)
    """
    with gzip.open(label_path, 'rb') as f:
        # Read the header: magic number (4 bytes) + num items (4 bytes) = 8 bytes.
        # Skip these 8 bytes.

# Read the entire file content into a buffer
        label_data = f.read()

# The label data starts at byte 8. The data type is unsigned byte ('B' or np.uint8).
        # Labels are a 1D vector.
        labels = np.frombuffer(label_data, dtype=np.uint8, offset=8)

return labels

class SquashingFunction(torch.nn.Module):
    def __init__(self):
        super(SquashingFunction, self).__init__()
        self.squash = torch.nn.Tanh()
        self.const = 1.7159

def forward(self, x):
        return self.const * self.squash(x)

class LeNet5(torch.nn.Module):
    """A mostly faithful recreation of LeNet 5."""

def __init__(self):
        super(LeNet5, self).__init__()
        self.net = torch.nn.Sequential(
                # 5x5 convolution with 6 output feature maps
                torch.nn.Conv2d(1, 6, 5),
                # 2x2 subsampling learned bias and weight, called S2 in the paper.
                # We'll use an average pool and then a 1x1 conv with 6 groups to emulate that.
                torch.nn.AvgPool2d(kernel_size=2, stride=2),
                torch.nn.Conv2d(6, 6, kernel_size=1, groups=6),
                SquashingFunction(),
                # 5x5 convolution with 6 output feature maps of size 5x5
                torch.nn.Conv2d(6, 16, kernel_size=5),
                # This again, emulating layer S4 from the paper.
                torch.nn.AvgPool2d(kernel_size=2, stride=2),
                torch.nn.Conv2d(16, 16, kernel_size=1, groups=16),
                SquashingFunction(),
                # The final convolution reduces features to 1x1
                torch.nn.Conv2d(16, 120, kernel_size=5, stride=1),
                torch.nn.Flatten(),
                torch.nn.Linear(120, 84),
                # We are not going to try to recreate the original exemplar-based function in LeNet5
                #euclidean_rbf(84, 12)
                torch.nn.Linear(84, 10),
                )
        self.decision = torch.nn.Softmax(dim=1)

torch.nn.init.uniform_(self.net[0].weight.data, a=-1, b=1)

def forward(self, x):
        """Forward through the network."""
        y_hat = self.decision(self.net(x))
        return y_hat

def preprocess(X_train, order, device):
    # normalize and then pad to 32x32
    # Images are 0 to 1.
    # Change so the background (white) became -0.1, and the foreground (black) became 1.175
    # Multiply by 1.275 to shift expand the range, and subtract from 1.175
    preprocessed = torch.tensor(1.175 - (1.275*X_train[order])).float()
    
    # Pad 2 on every side, changing the 28x28 to 32x32
    preprocessed = torch.nn.functional.pad(preprocessed, pad=(2,2,2,2))
    # Add a channel dimension
    return preprocessed.reshape((-1, 1, 32, 32)).to(device)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--train",
        required=True,
        help="gzip file with mnist data.")
    parser.add_argument(
        "--test",
        required=True,
        help="gzip file with mnist data.")
    parser.add_argument(
        "--train_labels",
        required=True,
        help="gzip file with mnist labels.")
    parser.add_argument(
        "--test_labels",
        required=True,
        help="gzip file with mnist labels.")
    parser.add_argument(
        "--epochs",
        required=False,
        type=int,
        default=5,
        help="Number of epochs to train.")
    parser.add_argument(
        "--train_samples",
        required=False,
        type=int,
        default=60000,
        help="Number of samples to use for training (to reduce memory consumption).")
    parser.add_argument(
        "--save_mismatch",
        required=False,
        type=int,
        default=0,
        help="The number of mismatches to save.")
    parser.add_argument(
        "--batch_size",
        required=False,
        type=int,
        default=32,
        help="The batch size for stochastic gradient descent.")
    parser.add_argument(
        "--random_seed",
        required=False,
        type=int,
        default=112,
        help="The random seed.")

args = parser.parse_args()

np.random.default_rng(args.random_seed)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")

print("Loading data")
    X_train_digits = (load_mnist_ubyte(args.train)/255)[:args.train_samples]
    X_test_digits = load_mnist_ubyte(args.test)/255
    Y_train_digits = load_mnist_labels(args.train_labels)[:args.train_samples]
    Y_test_digits = load_mnist_labels(args.test_labels)
    Image.fromarray((255*X_train_digits[2]).astype(np.uint8)).save(f"example_digit.png")

model = LeNet5().to(device)
    
    # Authors used 0.0005 for two epochs, 0.0002 for the next 2, 0.0001 for the
    # next 3, 0.00005 for the next 4, and 0.00001 after.
    optimizer = torch.optim.Adam(model.parameters(), lr=0.0005)
    lr_steps = [2, 5, 9, 13]
    lr_schedule = [0.0002, 0.0001, 0.00005, 0.00001]

criterion = torch.nn.CrossEntropyLoss()

# See how many batches we'll use per epoch
    batches = int(np.ceil(X_train_digits.shape[0]/float(args.batch_size)))
    # We could just do this in one step, but let's assume that memory is finite
    test_batch_size = 1000
    test_batches = int(np.ceil(X_test_digits.shape[0]/float(test_batch_size)))

# Shuffle and preprocess the training data
    order = np.arange(X_train_digits.shape[0])
    np.random.shuffle(order)

X_train = preprocess(X_train_digits, order, device)
    Y_train = torch.tensor(Y_train_digits[order]).long().to(device)

# Don't shuffle the test data, but otherwise treat it the same as the training data.
    X_test = preprocess(X_test_digits, np.arange(X_test_digits.shape[0]), device)
    Y_test = torch.tensor(Y_test_digits).long().to(device)

for epoch in range(args.epochs):

# Update the learning rate (as described in the original paper)
        if epoch in lr_steps:
            lr_idx = lr_steps.index(epoch)
            optimizer.lr = lr_schedule[lr_idx]
        total_loss = 0.0
        model.train()
        for batch in range(batches):
            begin = batch*args.batch_size
            end = (batch+1)*args.batch_size

X_batch = X_train[begin:end]
            Y_batch = Y_train[begin:end]

# Zero gradients before gradient calculation
            optimizer.zero_grad()

y_hat = model(X_batch)
            loss = criterion(y_hat, Y_batch)
            total_loss += loss.item() * X_batch.size(0)

# Gradient calculation
            loss.backward()
            # Update weights
            optimizer.step()

epoch_loss = total_loss / X_train.size(0)
        print(f"{args.train_samples} Epoch {epoch} training loss {epoch_loss}")

# Evaluation
        # Don't calculate gradients during these steps
        model.eval()
        with torch.no_grad():
            total_loss = 0.0
            for batch in range(test_batches):
                begin = batch*test_batch_size
                end = (batch+1)*test_batch_size

X_batch = X_test[begin:end]
                Y_batch = Y_test[begin:end]

y_hat = model(X_batch)
                loss = criterion(y_hat, Y_batch)
                total_loss += loss.item() * X_batch.size(0)
        epoch_loss = total_loss / X_test.size(0)
        print(f"{args.train_samples} Epoch {epoch} testing loss {epoch_loss}")

# Final evaluation
    model.eval()
    with torch.no_grad():
        y_hat = model(X_test)
        classes = torch.argmax(y_hat, dim=1)
        matches = (classes == Y_test)
        mismatches = (classes != Y_test)

print(f"{args.train_samples} Final accuracy {torch.sum(matches)}/{y_hat.size(0)} ({torch.sum(matches)/y_hat.size(0)})")
```

---

## Loss

---

## Accuracy

---

## Discussion

* We don't see any divergence from training and testing loss
* And we also don't see any sign that the network has saturated
  * Should still have the capacity to learn more
* If only we had more data!

---

## Augmentation

* Neural networks like augmented data
* Why?
  * They are data hungry
  * Also, they don't guarantee any margins
  * Augmentations help to fill out the search space

---

## Augmentation Example

Not huge, just some stretching and light rotation.

---

## Loss

---

## Accuracy

---

## No good!

* From the figures, we can see that the testing set performance is unchanged
* Augmentation is another hyperparameter!
  * We have to tune it to be useful!

---

## Next Time

* There's more to say about NNs
  * But some lessons repeat
  * We'll return to augmentations with the Imagenet dataset and AlexNet
* Accuracy is capping near SVM accuracy
  * Both slightly above 97%

---

## Lessons

* The lesson here is that NNs are flexible
* Final output could be images, could be classes, could be anything
* This could have been clear in the 90s, but the capability wasn't heavily used for years
* Also, where is that overfitting?
  * It shouldn't happen, unless you mess something up
    * Like bad augmentations

---

## Sample Questions

<div style="text-align: left;">
Which of the following is an advantage of SVMs?

1) They maximize the margin between classes
2) They make $\alpha$ sparse, reducing computation compared to perceptrons.
3) They support soft margins, allowing users to gracefully tune model complexity to the data.
4) All of the above.

</div>

---

## Sample Questions

<div style="text-align: left;">
What is true about the kernel trick?

1) In SVMs, the kernel trick replaces a weight matrix with a function of the training data.
2) The kernel trick allows for faster inference than matrix multiplication.
3) The kernel trick guarantees a large margin in SVMs.
4) None of the above.

</div>

---

## Sample Questions

<div style="text-align: left;">
What is true about SVMs?

1) Sparsity in $\alpha$ makes it possible to work with large training sets where perceptrons struggled.
2) The large margin improves generalization to unseen points
3) Kernel trick allows the SVM to operate on non-linear space
4) All of the above

</div>

---

## Sample Questions

<div style="text-align: left;">
What can we say about SVM's complexity parameter, C?

1) A low C value simplifies the model.
2) A high C value simplifies the model.
3) A low C value makes the SVM similar to one using a hard margin.
4) The complexity parameter is only useful on very high-dimensional data.

</div>

---

## Sample Questions

<div style="text-align: left;">
What is true about a NN's ability to be a universal function approximator?

1) NNs are not universal function approximators.
2) NNs can approximate any function, but not when trained with SGD.
3) With enough width or depth, and nonlinear activation functions, NNs can approximate any function.
4) None of the above are true.

</div>

<!--
Cover convolutions
* the structure
* weight sharing
* Shift invariance
* Smoother learning with stronger signal

* Weight sharing via convolutions
  * Interpretability built into the network
  * Dataset analysis through training results
  * Dataset augmentation

Go over the network structure, including the RBF kernels at the output. Then the loss functions.

Show the graph of train and test error as a function of the training set size.

Show the loss curves.

That's probably all that will fit into one lecture.
-->