# CS 462 - Lecture 11

## Convolutions

Bernhard Firner

2026-02-24

---

# Convolutions
## Convolutions
### Convolutions
#### Convolutions
##### Convolutions
###### Convolutions

---

## Review: Mechanics

* 1D Convolutions look at adjacent inputs

---

## Image Visualization

* 2D input, like digits, has structure in another dimension
  * Our convolution becomes 3x3

---

## Channels Visualization

* With color channels, such as in Cifar, the convolution gains a new dimension

---

## Other Dimensions

* We could add more dimensions
  * e.g. for time
* But it is common to simply append those into the channel dimension
  * If you look, you'll see that PyTorch only goes up to 3D
  * And hardware support tends to stop at a 3D matrix

---

## CIFAR-10

* [https://www.cs.toronto.edu/~kriz/cifar.html](https://www.cs.toronto.edu/~kriz/cifar.html)
* 10 classes, 6000 training images per class
  * 32x32 RGB color images
* 10000 testing images
* Classes are airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck

---

## CIFAR-10 Examples

---

## More Variety

* What are these?

---

## Convolutions

* Why are they a big deal?
  * 1.6 million parameters outperformed by 24 thousand.

---

## Huge Win!

* Reduces memory requirements
* Requires less data and gets better performance
* Generalizes better
  * Testing loss is closer to training loss compared to linear networks

---

## Why?

* Why does this work so well?
  * Real data has spatial relationships
    * Meaning structure
  * Convolutions are spatially invariant

---

## Spatial Invariance

* Convolutions (generally) don't learn sample specific solutions
  * Whereas linear layers love them
* Gradients point to minima that must be consistent across samples and location
* Intuitive: convolutions use fewer weights, so they must be biased towards simpler solutions

</div>

<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/10_multi_layer_nn_highvar_L2_100_100_100_neurons_adam_solver_relu.svg" />
<small>Here's an example of linear layers fitting to a single data point, even with L2 regularization.</small>
</div>
</div>

---

## Regularization

* What does this mean for regularization?
* Regularization tends to "improve" models in areas missing datapoints
* So we still need to regularize our convolutional networks
  * And (most) convolutional networks still use linear layers for classifiers

---

## Recall

<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/UDL/Chap09/RegDropout.svg" />
<small>Dropout, a regularizer, fixed the non-smooth area of the output. That wouldn't have been there if we had training data in that gap.</small>
</div>

---

## Efficiency

* Those are nice properties, but how is this weight-efficient?
* CIFAR images are 32 by 32 for 1024 pixels
  * A fully connected layer would need 1024 weights per neuron
* A 3 by 3 kernel is 9 weights, regardless of the image size
  * But that kernel can only see small features

---

## Receptive Fields

* A convolution cannot "see" an entire image
  * A 3x3 or 5x5 convolution is limited to just that area of pixels
* But stacked convolutions have increased **receptive fields**
* After two 3x3 filters, each pixel in the feature map is influenced by a 5x5 area of the original input
  * Another 3x3 makes it 7x7, and so on

---

## Feature Maps

* You may wonder what the features in the feature maps mean
  * After one or two convolutions, we can describe them as lines or curves
* But after more convolutions, features become conceptual
  * "trainness" or "catness"
  * The network has projected the original pixel information along new axes

---

## Features and Applications

* People realized right away that features could be used for more than regression and classification
* Back in the 1990s, people were already using them for image reconstruction

---

## LeNet

* Named for Yann LeCun
* Early DNN product: digital document scanning
* 45 page description of product development: [Gradient-Based Learning Applied to Document Recognition](https://cs.nyu.edu/~yann/2010f-G22-2565-001/diglib/lecun-98.pdf)
* More approachable read: [Handwritten Digit Recognition with a Back-Propagation Network](https://proceedings.neurips.cc/paper/1989/file/53c3bce66e43be4f209556518c2fcb54-Paper.pdf)

---

## LeNet Output

* You would expect this to be a classifier
* But instead, the output was a 7x12 pixel image
* Then Euclidean distance to the ideals was used both to classify and quantify uncertainty

---

## Exemplar Idea

* For a product, they needed special outputs
  * Had to be interpretable
  * Had to encode uncertainty
* So LeNet was trained to produce "ideal" characters from the input

---

## Interpretability

* If two characters were similar (0 and O, for example)
  * This naturally encodes their similarity in the loss function
  * An uncertain output would have a close euclidean distance to either
* Prevents collapsing into "all one or the other" decisions
  * Allows Markov model probabilities to decode letters using contextual information

---

## Modern Uses

* Now our models are bigger
* Convolutions are part of any model that does:
  * Image segmentation, modification, and generation
  * Gaze tracking, pose recognition, expression or pose transfer
* Even in transformer models (a later topic) there is still a convolution

---

## Classification + Regression

* We can combine classification and regression outputs
  * Prediction *both* where something is and what it is
* This is the idea behind the well-known YOLO paper
  * [YOLO, CVPR 2016 Paper](https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Redmon_You_Only_Look_CVPR_2016_paper.html)

---

## Object Detection

---

## Semantic Segmentation

---

## Convolution Backwards

* Tasks like semantic segmentation effectively run convolutions in reverse
  * Collapse image to features, then expand back to an image
  * Called deconvolutions, but are really [transposed convolutions](https://docs.pytorch.org/docs/stable/generated/torch.nn.modules.conv.ConvTranspose2d.html)

---

## Interpreting Features

* There are many ways to interpret meaning from convolutions
  * For a broad survey, see: [How Convolutional Neural Networks See the World](https://arxiv.org/abs/1804.11191)
* Different visualizations have different levels of utility
  * Some are more amusing than useful, but as a whole they increase visibility into what ConvNets are doing

---

## Phone Optimized

* Modern ConvNets are so optimized that we can easily run them on our phones
  * Example: [MobileNet](https://arxiv.org/abs/1905.02244)
* Training is resource intensive, inference isn't
  * This is obviously an example of technique and hardware growing together
* But how much of our ability to advance is actually from better theory?

---

## AlexNet Quote, 2012

> In the end, the network's size is limited mainly by the amount of memory available on current GPUs and by the amount of training time that we are willing to tolerate. Our network takes between five and six days to train on two GTX 580 3GB GPUs. All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.

---

## AlexNet

* Alexnet was the paper that kicked off the modern machine learning craze
  * It's important to mention that other groups were also close
* What pushed this work ahead was the lead author's existing familiarity with writing GPU code
* So were they right? Can we just add data, get more results?

---

---

## Limits

* Okay, great, DL is a hammer, let's swing it
  * Let's begin with our baseline

```python
DeepConvNet(
  (net): Sequential(
    (0): Conv2d(3, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (3): ReLU()
    (4): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (5): ReLU()
    (6): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (7): ReLU()
    (8): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (9): ReLU()
    (10): Flatten(start_dim=1, end_dim=-1)
    (11): Linear(in_features=240, out_features=60, bias=True)
    (12): Linear(in_features=60, out_features=10, bias=True)
  )
  (decision): Softmax(dim=1)
)
```

---

## Deep Baseline

---

## Two More Layers

* Model parameters decrease to 16,930.

```python
DeeperConvNet(
  (net): Sequential(
    (0): Conv2d(3, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (3): ReLU()
    (4): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (5): ReLU()
    (6): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (7): ReLU()
    (8): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU()
    (10): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (11): ReLU()
    (12): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (13): ReLU()
    (14): Flatten(start_dim=1, end_dim=-1)
    (15): Linear(in_features=60, out_features=60, bias=True)
    (16): Linear(in_features=60, out_features=10, bias=True)
  )
  (decision): Softmax(dim=1)
)
```

---

## Deeper Results

---

## Disappointment!

* It turns out that the AlexNet authors weren't 100% correct
* We *do* need some improvements to our technique
  * Otherwise training is too slow
  * Or we may never converge to a good result
* Let's throw in one new layer: batch normalization

---

## Batch Normalization

* Proposed by Ioffe and Szegedy in 2015
  * [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://proceedings.mlr.press/v37/ioffe15.html)
* ReLU became popular because it enabled faster training than Tanh
  * But when gradients get stuck in the < 0 area, learning stops
  * This kind of problem should be partially fixed by He initialization, but learning is still random

---

## Batch Norm Mechanics

* In PyTorch: [BatchNorm2d](https://docs.pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html#batchnorm2d)
* Keeps a running mean of the mean and variance of inputs
* Normalizes to 0 mean and unit variance
  * This is what He initialization achieved at initialization
  * This enforces it, even after the parameters change

---

## Also a Regularizer

* Because the estimates change from batch to batch, this adds noise directly to the layer outputs
* If you recall from our discussion of regularizers, this is also a regularization technique
  * For free!

---

## New Model

* Deeper model now has 17,020 parameters, up from 16,930
  * Each of the three BatchNorm layer also has a bias and weight parameter
  * Applied to each of the 15 input channels, so $15 \times 2 \times 3 = 90$

```python
DeeperBNConvNet(
  (net): Sequential(
    (0): Conv2d(3, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (3): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (4): ReLU()
    (5): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU()
    (7): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (8): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (9): ReLU()
    (10): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU()
    (12): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (13): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (14): ReLU()
    (15): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (16): ReLU()
    (17): Flatten(start_dim=1, end_dim=-1)
    (18): Linear(in_features=60, out_features=60, bias=True)
    (19): Linear(in_features=60, out_features=10, bias=True)
  )
  (decision): Softmax(dim=1)
)
```

---

## Deeper With BatchNorm

---

## Accuracy Comparisons

<div class="col">
<img style="width: 65%" class="r-stretch" src="./figures/10_cifar10_mlp_acc.png" />
<br/>
<small>Linear</small>
<br/>
<img style="width: 65%" class="r-stretch" src="./figures/10_cifar10_conv_baseline_acc.png" />
<br/>
<small>Baseline ConvNet</small>
</div>

<div class="col">
<img style="width: 65%" class="r-stretch" src="./figures/11_cifar10_deepernobn_conv_acc.png" />
<br/>
<small>Deeper ConvNet</small>
<br/>
<img style="width: 65%" class="r-stretch" src="./figures/11_cifar10_deeperbn_conv_acc.png" />
<br/>
<small>Deeper with Batch Norm</small>
</div>
</div>

---

## Schedule

* We're magically a day ahead of the syllabus
* So we can use the next lecture as review

<table>
<tr><td>           </td><td> 26 </td><td> Thursday  </td><td>  L11       </td><td>  Convnets                </td><td> HW3: Load and multiply convnet weights (mini HW) </td></tr>
<tr><td>March      </td><td> 3  </td><td> Tuesday   </td><td>  L12       </td><td>  ConvNets                </td><td> </td></tr>
<tr><td>           </td><td> 5  </td><td> Thursday  </td><td>            </td><td>                          </td><td> Midterm (in class) </td></tr>
<tr><td>           </td><td> 10 </td><td> Tuesday   </td><td>  L13       </td><td>  Advanced convnets       </td><td> </td></tr>
<tr><td>           </td><td> 12 </td><td> Thursday  </td><td>  L14       </td><td>  Advanced convents       </td><td> HW4: Convnet Bad Apples </td></tr>
</table>