# CS 462 - Lecture 10 ## Convolutions Bernhard Firner 2026-02-24 --- ## Book * Today we'll be going through chapter 10 of the [book](https://udlbook.github.io/udlbook/) * We'll also go over some outside material * Because convolutions are important! --- ## Linear Limitations * Linear networks work fine * But deep learning with linear networks isn't state of the art * Why? * Memory and compute intensive * Tend to learn incorrect rules --- ## Decision Boundaries * "Tend to learn incorrect rules" sounds squishy * So let's take a moment to talk about decision boundaries * Consider data points that have two attributes, x and y * When we train a classifier, there is line where the linear network predicts either class as equally likely --- ## ReLU Decision Boundary
--- ## Tanh Decision Boundary
--- ## Boundary Quality * The decision boundaries are different, but both have 0 error on the training set * Can we tell if they are good boundaries? * We can look at the **margins**, the space between the classes * The ReLU sticks close to the lower class * The Tanh is probably too close to the upper class --- ## Variance * Let's add in a training point that looks like it's in the wrong area * This could represent and mislabelled datapoint * Or it could represent natural variance of the actual data * The source of the problem isn't important --- ## ReLU Decision Boundary * Our current model isn't "solving" the problem, so we add layers
--- ## ReLU Decision Boundary * Now we have a solution, but is this good? * A human might look at that point and let it go, but that's not how SGD works
--- ## Regularization * Let's increase L2 regularization to $\lambda=0.01$ * Not really better
--- ## Why? * Why does the linear network contort to make such complicated decision boundaries? * It has no implicit bias towards smooth solutions or boundaries * Linear neural networks match any observed variance in the data, whenever possible * A statistical approach would set a sane boundary, but DNNs are iterative * A bit of regularization can't stop this --- ## Lack of Structure * Imagine if we shuffled all of the pixels the digits training set * And the same pixel shuffle is applied to each image * Would this change linear network results? * No! --- ## Image Structure * Images (and other types of data) have structure * Ignoring it is bad * Networks learn specific rules at each pixel rather than general structural rules * There is no contradiction when a linear network learns a mislabel on very similar data --- ## Efficiency * Our linear network generalizes somewhat (otherwise it wouldn't work) * But if a letter is shifted over a few pixels, we need training data with the same shift * Similarity is only found in exact pixel locations, not features * This is probably inefficient! --- ## Alternative * Let's say we make a small linear network of 9 weights * Instead of "seeing" the entire image, we use a 3x3 block of pixels as "x" * $f[x] = \beta + \sum_{i=0}^{2}\sum_{j=0}^2x_{ij}\omega_{3\times i + j}$ * Then we take the same set of 9 weights, and apply them to the entire image * Let's call this output a **feature map** --- ## What happens? * Feed the feature map to a linear layer and train a classifier * The gradient in every position in the feature map applies to the same 9 weights * Learning can be faster because of the increased number of gradients * But only if those gradients are consistent --- ## Similarities * Recall that SGD pushes towards a good minima because good minima should be shared across all data points * Our 9 weights are pushed in a similar way * If there is some common feature that is useful for classification, it produces a consistent gradient * That should rapidly push our 9 weights in a useful direction --- ## Spatial Correlation * Are there similar features found across images? * Of course! Otherwise we would be looking at noise! * Our set of 9 weights are convolved across the image, so we call them **convolutions** * They are searching for spatial correlations in groups of pixels --- ## Spatial Invariance * The weights do not change as they are moved over the image * A linear network is happy to learn noise, or special rules for something mislabelled * A convolution should not know where in the image it is being applied * So the features are consistent, regardless of the source location * This idea is called **spatial invariance** --- ## Weight Efficiency * The 28x28 digits have 784 pixels * And every neuron in the first linear layer must have 784 weights * A 3x3 convolution has 9 weights * We could have 87 convolutions with the same number of weights * So convolutions are memory efficient too! --- ## Kernel Size * Our example of 9 pixels would be called a 3x3 convolution * The entire set of weights is called a kernel (or, sometimes, filter) * How is the kernel applied to the input? * If it is evaluated at every possible location, that is called stride 1 * We can also skip pixels between evaluations * The adjacent pixels were already evaluated in the 3x3 case, for example --- ## Padding * What happens at the edge of an image? * The 3x3 kernel can be evaluated in 26x26 locations on the 28x28 digits * Since the edges have no more pixels * But should we really skip evaluation near the edges? What if our kernel is 5x5 or 7x7? * We can **pad** the edge of the image, adding enough 0s that the convolution can be evaluated anywhere --- ## Visualization * Consider a 3x1 run over adjacent values * Padding increases the output feature map size
--- ## Image Visualization * An image has another dimension, and our convolution becomes 3x3
--- ## Channels Visualization * With color channels, or a feature map, the convolution gains a new dimension
--- ## 1D, 2D, 3D, etc * You can think of the hidden layer convolutions as 3D * Since they have 3 dimensions (x, y, channels) * We can match the size of our convolutions to anything * Sounds, images, text, etc * As long as it has structure --- ## Stacking Convolutions * Convolutions learn features * Should we use them once and return to linear layers? * Or do we build features on top of features? * We stack them, feature maps becoming the inputs to the next convolution * Stacked convolutions mean that input pixels affect larger areas of the feature maps --- ## Receptive Fields * Consider the 3x1 kernel, with input X and feature map f * Input pixels $[x_1, x_2, x_3][\omega_1, \omega_2, \omega_3]^T$ becomes $f_1$, the first value in the feature map * $[x_2, x_3, x_4][\omega_1, \omega_2, \omega_3]^T = f_2$ and $[x_3, x_4, x_5][\omega_1, \omega_2, \omega_3]^T = f_3$ * The first output of a second convolution is $[f_1, f_2, f_3][\omega_1', \omega_2', \omega_3']^T$ * The second kernel uses values derived from $x_1 ... x_5$, so it has a receptive field of 5 --- ## Receptive Field & Depth * Successive width 3 convolutions have wider receptive fields with greater depth
--- ## Input Channels * If we run 16 convolutions on our input image, we have 16 features maps * Do the next convolutions choose a single feature map? * They could, but that isn't how they are generally implemented * Usually, that includes all input feature maps * Generally, we add feature maps at later hidden layers --- ## Datasets * We are going to stick with images * Because they go well with presentations * We've already seen digits * Let's move on to the slightly more difficult CIFAR * [https://www.cs.toronto.edu/~kriz/cifar.html](https://www.cs.toronto.edu/~kriz/cifar.html) --- ## CIFAR-10 * 10 classes, 6000 training images per class * 32x32 RGB color images * 10000 testing images * Classes are airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck --- ## CIFAR-10 Examples
--- ## More Variety * What are these?
--- ## Baseline * Let's use the linear network from last time * This model has 1,645,950 trainable parameters. ```python Linear( (net): Sequential( (0): Flatten(start_dim=1, end_dim=-1) (1): Linear(in_features=3072, out_features=512, bias=True) (2): ReLU() (3): Linear(in_features=512, out_features=120, bias=True) (4): ReLU() (5): Linear(in_features=120, out_features=84, bias=True) (6): Linear(in_features=84, out_features=10, bias=True) ) (decision): Softmax(dim=1) ) ``` --- ## MLP Baseline
--- ## ConvNet ```python class ConvNet(torch.nn.Module): """A mostly faithful recreation of LeNet 5.""" def __init__(self, nonlinearity = torch.nn.ReLU, classes=10): super(ConvNet, self).__init__() self.net = torch.nn.Sequential( # Basic ConvNet as in the 20xx years torch.nn.Conv2d(in_channels=3, out_channels=15, kernel_size=(3,3), padding=1, stride=2), nonlinearity(), torch.nn.Conv2d(in_channels=15, out_channels=15, kernel_size=(3,3), padding=1, stride=2), nonlinearity(), torch.nn.Conv2d(in_channels=15, out_channels=15, kernel_size=(3,3), padding=1, stride=2), nonlinearity(), torch.nn.Flatten(), torch.nn.Linear(240, 60), torch.nn.Linear(60, classes), ) self.decision = torch.nn.Softmax(dim=1) for layer in [0, 2, 4, 7]: torch.nn.init.kaiming_normal_(self.net[layer].weight.data, nonlinearity="relu") def forward(self, x): """Forward through the network.""" y_hat = self.decision(self.net(x)) return y_hat ``` --- ## Convnet Size * Model has 19,570 parameters * Two orders of magnitude smaller ```python ConvNet( (net): Sequential( (0): Conv2d(3, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1)) (1): ReLU() (2): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1)) (3): ReLU() (4): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1)) (5): ReLU() (6): Flatten(start_dim=1, end_dim=-1) (7): Linear(in_features=240, out_features=60, bias=True) (8): Linear(in_features=60, out_features=10, bias=True) ) (decision): Softmax(dim=1) ) ``` --- ## Conv Baseline
--- ## Comparison * MLP on left, convnet on the right * 2 orders of magnitude fewer parameters, similar performance * And notice that the training performance is more indicative of testing
--- ## Doing Better * We are using so few parameters * So let's pump up those numbers with a larger CovnNet * We'll try wider first, and then deeper --- ## Wider ```python class WideConvNet(torch.nn.Module): """A mostly faithful recreation of LeNet 5.""" def __init__(self, nonlinearity = torch.nn.ReLU, classes=10): super(WideConvNet, self).__init__() self.net = torch.nn.Sequential( # Basic ConvNet as in the 20xx years torch.nn.Conv2d(in_channels=3, out_channels=15, kernel_size=(3,3), padding=1, stride=2), nonlinearity(), torch.nn.Conv2d(in_channels=15, out_channels=30, kernel_size=(3,3), padding=1, stride=2), nonlinearity(), torch.nn.Conv2d(in_channels=30, out_channels=60, kernel_size=(3,3), padding=1, stride=2), nonlinearity(), torch.nn.Flatten(), torch.nn.Linear(4*240, 60), torch.nn.Linear(60, classes), ) self.decision = torch.nn.Softmax(dim=1) for layer in [0, 2, 4, 7]: torch.nn.init.kaiming_normal_(self.net[layer].weight.data, nonlinearity="relu") def forward(self, x): """Forward through the network.""" y_hat = self.decision(self.net(x)) return y_hat ``` --- ## Wider Params * Model has 79030 parameters * Where are the parameters? * The first linear layer has $960\times60 = 57600$ weights * Reducing the final feature map to 1x1 reduces the linear layer size ```python WideConvNet( (net): Sequential( (0): Conv2d(3, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1)) (1): ReLU() (2): Conv2d(15, 30, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1)) (3): ReLU() (4): Conv2d(30, 60, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1)) (5): ReLU() (6): Flatten(start_dim=1, end_dim=-1) (7): Linear(in_features=960, out_features=60, bias=True) (8): Linear(in_features=60, out_features=10, bias=True) ) (decision): Softmax(dim=1) ) ``` --- ## Wider ConvNet * Back to the 55% accuracy of the large linear net
--- ## Feature Map Sizes * Remember, deeper networks should be more efficient * But now we need to worry about feature map sizes * With stride 2, we are dropping the dimensions in half with each layer * But with stride 1 and proper padding, they would remain the same size * We'll alternate between stride 1 and stride 2, going 5 convolutions deep --- ## Deeper ```python class DeepConvNet(torch.nn.Module): """A mostly faithful recreation of LeNet 5.""" def __init__(self, nonlinearity = torch.nn.ReLU, classes=10): super(DeepConvNet, self).__init__() self.net = torch.nn.Sequential( # Basic ConvNet as in the 20xx years torch.nn.Conv2d(in_channels=3, out_channels=15, kernel_size=(3,3), padding=1, stride=1), nonlinearity(), torch.nn.Conv2d(in_channels=15, out_channels=15, kernel_size=(3,3), padding=1, stride=2), nonlinearity(), torch.nn.Conv2d(in_channels=15, out_channels=15, kernel_size=(3,3), padding=1, stride=1), nonlinearity(), torch.nn.Conv2d(in_channels=15, out_channels=15, kernel_size=(3,3), padding=1, stride=2), nonlinearity(), torch.nn.Conv2d(in_channels=15, out_channels=15, kernel_size=(3,3), padding=1, stride=2), nonlinearity(), torch.nn.Flatten(), torch.nn.Linear(240, 60), torch.nn.Linear(60, classes), ) self.decision = torch.nn.Softmax(dim=1) for layer in [0, 2, 4, 6, 8]: torch.nn.init.kaiming_normal_(self.net[layer].weight.data, nonlinearity="relu") def forward(self, x): """Forward through the network.""" y_hat = self.decision(self.net(x)) return y_hat ``` --- ## Deeper * Model has 23650 parameters * The first linear layer has $240\times60 = 14400$ weights ```python DeepConvNet( (net): Sequential( (0): Conv2d(3, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): ReLU() (2): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1)) (3): ReLU() (4): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (5): ReLU() (6): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1)) (7): ReLU() (8): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1)) (9): ReLU() (10): Flatten(start_dim=1, end_dim=-1) (11): Linear(in_features=240, out_features=60, bias=True) (12): Linear(in_features=60, out_features=10, bias=True) ) (decision): Softmax(dim=1) ) ``` --- ## Deeper Results * Still reaching 55% * And now it looks like we could go higher with more epochs
--- ## ConvNet Details * There is a lot of performance to squeeze from ConvNets * 3 more lectures, at least * And obviously many design choices * How many channels? When do we downscale? * And the feature maps from convolutions also unlock new applications --- ## Efficiency * Convolutions are *efficient* * If you only remember one thing, that should be it * Linear layers are powerful (thanks to the Universal Approximation Theorem) * But they are parameter hungry --- ## Training Data * Each parameter requires training * So it follows that a huge linear network requires lots of training data * To converge to a good solution; obviously you can train with 3 samples and have a bad network * Convolutional network are a bit less data hungry than linear networks --- ## Generalization * That data efficiency means that we expect convolutional network to generalize better with less data * Getting good data is hard, so this is a big deal! --- ## Next Topic * How can we squeeze more out of convolutional networks? * We'll look at a brief history of state of the art * And we'll stick with CIFAR-10 for practical considerations