# CS 462 - Lecture 10

## Convolutions

Bernhard Firner

2026-02-24

---

## Book

* Today we'll be going through chapter 10 of the [book](https://udlbook.github.io/udlbook/)
* We'll also go over some outside material
  * Because convolutions are important!

---

## Linear Limitations

* Linear networks work fine
  * But deep learning with linear networks isn't state of the art
* Why?
  * Memory and compute intensive
  * Tend to learn incorrect rules

---

## Decision Boundaries

* "Tend to learn incorrect rules" sounds squishy
  * So let's take a moment to talk about decision boundaries
* Consider data points that have two attributes, x and y
* When we train a classifier, there is line where the linear network predicts either class as equally likely

---

## ReLU Decision Boundary

---

## Tanh Decision Boundary

---

## Boundary Quality

* The decision boundaries are different, but both have 0 error on the training set
* Can we tell if they are good boundaries?
* We can look at the **margins**, the space between the classes
  * The ReLU sticks close to the lower class
  * The Tanh is probably too close to the upper class

---

## Variance

* Let's add in a training point that looks like it's in the wrong area
  * This could represent and mislabelled datapoint
  * Or it could represent natural variance of the actual data
  * The source of the problem isn't important

---

## ReLU Decision Boundary

* Our current model isn't "solving" the problem, so we add layers

---

## ReLU Decision Boundary

* Now we have a solution, but is this good?
* A human might look at that point and let it go, but that's not how SGD works

---

## Regularization

* Let's increase L2 regularization to $\lambda=0.01$
  * Not really better

---

## Why?

* Why does the linear network contort to make such complicated decision boundaries?
  * It has no implicit bias towards smooth solutions or boundaries
* Linear neural networks match any observed variance in the data, whenever possible
  * A statistical approach would set a sane boundary, but DNNs are iterative
  * A bit of regularization can't stop this

---

## Lack of Structure

* Imagine if we shuffled all of the pixels the digits training set
  * And the same pixel shuffle is applied to each image
* Would this change linear network results?
  * No!

---

## Image Structure

* Images (and other types of data) have structure
* Ignoring it is bad
  * Networks learn specific rules at each pixel rather than general structural rules
  * There is no contradiction when a linear network learns a mislabel on very similar data

---

## Efficiency

* Our linear network generalizes somewhat (otherwise it wouldn't work)
* But if a letter is shifted over a few pixels, we need training data with the same shift
* Similarity is only found in exact pixel locations, not features
  * This is probably inefficient!

---

## Alternative

* Let's say we make a small linear network of 9 weights
* Instead of "seeing" the entire image, we use a 3x3 block of pixels as "x"
  * $f[x] = \beta + \sum_{i=0}^{2}\sum_{j=0}^2x_{ij}\omega_{3\times i + j}$
* Then we take the same set of 9 weights, and apply them to the entire image
  * Let's call this output a **feature map**

---

## What happens?

* Feed the feature map to a linear layer and train a classifier
  * The gradient in every position in the feature map applies to the same 9 weights
* Learning can be faster because of the increased number of gradients
  * But only if those gradients are consistent

---

## Similarities

* Recall that SGD pushes towards a good minima because good minima should be shared across all data points
* Our 9 weights are pushed in a similar way
  * If there is some common feature that is useful for classification, it produces a consistent gradient
* That should rapidly push our 9 weights in a useful direction

---

## Spatial Correlation

* Are there similar features found across images?
  * Of course! Otherwise we would be looking at noise!
* Our set of 9 weights are convolved across the image, so we call them **convolutions**
  * They are searching for spatial correlations in groups of pixels

---

## Spatial Invariance

* The weights do not change as they are moved over the image
* A linear network is happy to learn noise, or special rules for something mislabelled
* A convolution should not know where in the image it is being applied
  * So the features are consistent, regardless of the source location
* This idea is called **spatial invariance**

---

## Weight Efficiency

* The 28x28 digits have 784 pixels
  * And every neuron in the first linear layer must have 784 weights
* A 3x3 convolution has 9 weights
  * We could have 87 convolutions with the same number of weights
  * So convolutions are memory efficient too!

---

## Kernel Size

* Our example of 9 pixels would be called a 3x3 convolution
  * The entire set of weights is called a kernel (or, sometimes, filter)
* How is the kernel applied to the input?
  * If it is evaluated at every possible location, that is called stride 1
  * We can also skip pixels between evaluations
    * The adjacent pixels were already evaluated in the 3x3 case, for example

---

## Padding

* What happens at the edge of an image?
* The 3x3 kernel can be evaluated in 26x26 locations on the 28x28 digits
  * Since the edges have no more pixels
* But should we really skip evaluation near the edges? What if our kernel is 5x5 or 7x7?
* We can **pad** the edge of the image, adding enough 0s that the convolution can be evaluated anywhere

---

## Visualization

* Consider a 3x1 run over adjacent values
* Padding increases the output feature map size

---

## Image Visualization

* An image has another dimension, and our convolution becomes 3x3

---

## Channels Visualization

* With color channels, or a feature map, the convolution gains a new dimension

---

## 1D, 2D, 3D, etc

* You can think of the hidden layer convolutions as 3D
  * Since they have 3 dimensions (x, y, channels)
* We can match the size of our convolutions to anything
  * Sounds, images, text, etc
  * As long as it has structure

---

## Stacking Convolutions

* Convolutions learn features
  * Should we use them once and return to linear layers?
  * Or do we build features on top of features?
* We stack them, feature maps becoming the inputs to the next convolution
  * Stacked convolutions mean that input pixels affect larger areas of the feature maps

---

## Receptive Fields

* Consider the 3x1 kernel, with input X and feature map f
  * Input pixels $[x_1, x_2, x_3][\omega_1, \omega_2, \omega_3]^T$ becomes $f_1$, the first value in the feature map
  * $[x_2, x_3, x_4][\omega_1, \omega_2, \omega_3]^T = f_2$ and $[x_3, x_4, x_5][\omega_1, \omega_2, \omega_3]^T = f_3$
* The first output of a second convolution is $[f_1, f_2, f_3][\omega_1', \omega_2', \omega_3']^T$
  * The second kernel uses values derived from $x_1 ... x_5$, so it has a receptive field of 5

---

## Receptive Field & Depth

* Successive width 3 convolutions have wider receptive fields with greater depth

---

## Input Channels

* If we run 16 convolutions on our input image, we have 16 features maps
* Do the next convolutions choose a single feature map?
* They could, but that isn't how they are generally implemented
  * Usually, that includes all input feature maps
  * Generally, we add feature maps at later hidden layers

---

##  Datasets

* We are going to stick with images
  * Because they go well with presentations
* We've already seen digits
* Let's move on to the slightly more difficult CIFAR
  * [https://www.cs.toronto.edu/~kriz/cifar.html](https://www.cs.toronto.edu/~kriz/cifar.html)

---

## CIFAR-10

* 10 classes, 6000 training images per class
  * 32x32 RGB color images
* 10000 testing images
* Classes are airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck

---

## CIFAR-10 Examples

---

## More Variety

* What are these?

---

## Baseline

* Let's use the linear network from last time
* This model has 1,645,950  trainable parameters.

```python
Linear(
  (net): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=3072, out_features=512, bias=True)
    (2): ReLU()
    (3): Linear(in_features=512, out_features=120, bias=True)
    (4): ReLU()
    (5): Linear(in_features=120, out_features=84, bias=True)
    (6): Linear(in_features=84, out_features=10, bias=True)
  )
  (decision): Softmax(dim=1)
)
```

---

## MLP Baseline

---

## ConvNet

```python
class ConvNet(torch.nn.Module):
    """A mostly faithful recreation of LeNet 5."""

def __init__(self, nonlinearity = torch.nn.ReLU, classes=10):
        super(ConvNet, self).__init__()
        self.net = torch.nn.Sequential(
                # Basic ConvNet as in the 20xx years
                torch.nn.Conv2d(in_channels=3, out_channels=15, kernel_size=(3,3), padding=1, stride=2),
                nonlinearity(),
                torch.nn.Conv2d(in_channels=15, out_channels=15, kernel_size=(3,3), padding=1, stride=2),
                nonlinearity(),
                torch.nn.Conv2d(in_channels=15, out_channels=15, kernel_size=(3,3), padding=1, stride=2),
                nonlinearity(),
                torch.nn.Flatten(),
                torch.nn.Linear(240, 60),
                torch.nn.Linear(60, classes),
                )
        self.decision = torch.nn.Softmax(dim=1)

for layer in [0, 2, 4, 7]:
            torch.nn.init.kaiming_normal_(self.net[layer].weight.data, nonlinearity="relu")

def forward(self, x):
        """Forward through the network."""
        y_hat = self.decision(self.net(x))
        return y_hat
```

---

## Convnet Size

* Model has 19,570 parameters
  * Two orders of magnitude smaller

```python
ConvNet(
  (net): Sequential(
    (0): Conv2d(3, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (1): ReLU()
    (2): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (3): ReLU()
    (4): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (5): ReLU()
    (6): Flatten(start_dim=1, end_dim=-1)
    (7): Linear(in_features=240, out_features=60, bias=True)
    (8): Linear(in_features=60, out_features=10, bias=True)
  )
  (decision): Softmax(dim=1)
)
```

---

## Conv Baseline

---

## Comparison

* MLP on left, convnet on the right
* 2 orders of magnitude fewer parameters, similar performance
  * And notice that the training performance is more indicative of testing

---

## Doing Better

* We are using so few parameters
* So let's pump up those numbers with a larger CovnNet
* We'll try wider first, and then deeper

---

## Wider

```python
class WideConvNet(torch.nn.Module):
    """A mostly faithful recreation of LeNet 5."""

def __init__(self, nonlinearity = torch.nn.ReLU, classes=10):
        super(WideConvNet, self).__init__()
        self.net = torch.nn.Sequential(
                # Basic ConvNet as in the 20xx years
                torch.nn.Conv2d(in_channels=3, out_channels=15, kernel_size=(3,3), padding=1, stride=2),
                nonlinearity(),
                torch.nn.Conv2d(in_channels=15, out_channels=30, kernel_size=(3,3), padding=1, stride=2),
                nonlinearity(),
                torch.nn.Conv2d(in_channels=30, out_channels=60, kernel_size=(3,3), padding=1, stride=2),
                nonlinearity(),
                torch.nn.Flatten(),
                torch.nn.Linear(4*240, 60),
                torch.nn.Linear(60, classes),
                )
        self.decision = torch.nn.Softmax(dim=1)

for layer in [0, 2, 4, 7]:
            torch.nn.init.kaiming_normal_(self.net[layer].weight.data, nonlinearity="relu")

def forward(self, x):
        """Forward through the network."""
        y_hat = self.decision(self.net(x))
        return y_hat
```

---

## Wider Params

* Model has 79030 parameters
  * Where are the parameters?
  * The first linear layer has $960\times60 = 57600$ weights 
* Reducing the final feature map to 1x1 reduces the linear layer size

```python
WideConvNet(
  (net): Sequential(
    (0): Conv2d(3, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (1): ReLU()
    (2): Conv2d(15, 30, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (3): ReLU()
    (4): Conv2d(30, 60, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (5): ReLU()
    (6): Flatten(start_dim=1, end_dim=-1)
    (7): Linear(in_features=960, out_features=60, bias=True)
    (8): Linear(in_features=60, out_features=10, bias=True)
  )
  (decision): Softmax(dim=1)
)
```

---

## Wider ConvNet

* Back to the 55% accuracy of the large linear net

---

## Feature Map Sizes

* Remember, deeper networks should be more efficient
  * But now we need to worry about feature map sizes
* With stride 2, we are dropping the dimensions in half with each layer
  * But with stride 1 and proper padding, they would remain the same size
  * We'll alternate between stride 1 and stride 2, going 5 convolutions deep

---

## Deeper

```python
class DeepConvNet(torch.nn.Module):
    """A mostly faithful recreation of LeNet 5."""

def __init__(self, nonlinearity = torch.nn.ReLU, classes=10):
        super(DeepConvNet, self).__init__()
        self.net = torch.nn.Sequential(
                # Basic ConvNet as in the 20xx years
                torch.nn.Conv2d(in_channels=3, out_channels=15, kernel_size=(3,3), padding=1, stride=1),
                nonlinearity(),
                torch.nn.Conv2d(in_channels=15, out_channels=15, kernel_size=(3,3), padding=1, stride=2),
                nonlinearity(),
                torch.nn.Conv2d(in_channels=15, out_channels=15, kernel_size=(3,3), padding=1, stride=1),
                nonlinearity(),
                torch.nn.Conv2d(in_channels=15, out_channels=15, kernel_size=(3,3), padding=1, stride=2),
                nonlinearity(),
                torch.nn.Conv2d(in_channels=15, out_channels=15, kernel_size=(3,3), padding=1, stride=2),
                nonlinearity(),
                torch.nn.Flatten(),
                torch.nn.Linear(240, 60),
                torch.nn.Linear(60, classes),
                )
        self.decision = torch.nn.Softmax(dim=1)

for layer in [0, 2, 4, 6, 8]:
            torch.nn.init.kaiming_normal_(self.net[layer].weight.data, nonlinearity="relu")

def forward(self, x):
        """Forward through the network."""
        y_hat = self.decision(self.net(x))
        return y_hat
```

---

## Deeper

* Model has 23650 parameters
  * The first linear layer has $240\times60 = 14400$ weights

```python
DeepConvNet(
  (net): Sequential(
    (0): Conv2d(3, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (3): ReLU()
    (4): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (5): ReLU()
    (6): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (7): ReLU()
    (8): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (9): ReLU()
    (10): Flatten(start_dim=1, end_dim=-1)
    (11): Linear(in_features=240, out_features=60, bias=True)
    (12): Linear(in_features=60, out_features=10, bias=True)
  )
  (decision): Softmax(dim=1)
)
```

---

## Deeper Results

* Still reaching 55%
  * And now it looks like we could go higher with more epochs

---

## ConvNet Details

* There is a lot of performance to squeeze from ConvNets
  * 3 more lectures, at least
* And obviously many design choices
  * How many channels? When do we downscale?
* And the feature maps from convolutions also unlock new applications

---

## Efficiency

* Convolutions are *efficient*
  * If you only remember one thing, that should be it
* Linear layers are powerful (thanks to the Universal Approximation Theorem)
  * But they are parameter hungry

---

## Training Data

* Each parameter requires training
  * So it follows that a huge linear network requires lots of training data
  * To converge to a good solution; obviously you can train with 3 samples and have a bad network
* Convolutional network are a bit less data hungry than linear networks

---

## Generalization

* That data efficiency means that we expect convolutional network to generalize better with less data
* Getting good data is hard, so this is a big deal!

---

## Next Topic

* How can we squeeze more out of convolutional networks?
* We'll look at a brief history of state of the art
  * And we'll stick with CIFAR-10 for practical considerations