# CS 461 - Lecture 22 ## Machine Learning Principles ### Resnets and Feature Vectors Bernhard Firner 2025-11-17 --- ## Alexnet
--- ## Capacity * Call model capacity $h$ * We train with $l$ samples * If $h << l$ the training error is high * This is underfitting * If $h >> l$ there should be no training error * Whether we should call that overfitting depends upon the quality of our dataset --- ## ImageNet
* 14 million images * Many bad and ambiguous labels * Such as the water snake at right
--- ## Overfitting
* Most datasets have "noise" * Errors in labels * Impossible to learn examples * We need high capacity to learn hard examples * But that will also learn "bad" rules * "bad" outweights the "good" at capacity $h*$ in the figure
--- ## Regularization in Alexnet * Shared pooling layers * Random image crops, scales, and flips * Color augmentations * Most importantly, Dropout --- ## Dropout * Randomly 0 the outputs of some layers during training * 50% in AlexNet * This becomes similar to training an ensemble of random subsets of the original weights * Each forward-backward pass only uses a subset of the network --- ## Biased Outputs
* Two inputs have the same random value * With 0.01% chance they are set to 0 * The output is again a repeat of the same random value * But a linear network decides to only look at one of them
--- ## Less Biased Outputs
* Training with dropout improves things
--- ## AlexNet impacts * Model training and selection * Accept that your dataset isn't good * Bigger DNNs, higher capacity, stronger regularization * Weaker attempts to justify used approaches * Analysis of the feature vector reveals remarkable universality --- ## Example Similarity
* Column 1 from the test set * Other columns have nearest feature vectors --- ## Improvements * Once people saw what AlexNet could do, they tried to push farther * 2014 saw VGG and GoogLenet * This was a "go big or go home" moment * 2015 brought ResNets * These have had more lasting impacts * Also enabled a new regularization, called Stochastic Depth * 2016 saw interesting variations in Squeezenet and Densenet * There were more variations on network design as the field took off --- ## GoogLenet * [GoogLenet](https://arxiv.org/abs/1409.4842) pushed to 22 layers deep * Uses fewer parameters than AlexNet * 1x1 convolutions are used as dimensionality reduction to remove compute bottlenecks * This opens them up to using wider layers (more feature maps) > The biggest gains in object-detection have not come from the utilization of deep networks alone or bigger models, but from the synergy of deep architectures and classical computer vision, like the R-CNN algorithm by Girshick et al. --- ## Inception Modules
* GoogLenet used Inception modules instead of single convolutions * 1x1 convolutions reduce the number of features before 3x3 or 5x5 convolutions * These are pretty complicated * Unnecessarily so, it turns out
--- ## VGG * [VGG](https://arxiv.org/abs/1409.1556) pushed 3x3 convolutions to 16-19 layers deep * Also slipped in a 1x1 convolution in places * Back to 2x2 stride 2 pooling * Image augmentation was simple RGB mean subtraction * Local Response Normalization is out * They claimed that it just slowed things down --- ## Interpreting VGG * We can view this as a rejection of the changes AlexNet made to Lenet * With ReLU and Dropout sufficient to train good deep networks * 1x1 convolutions aren't used for dimensionality reduction * Instead, they pass through a ReLU, increasing nonlinearity * They also had one more trick, which is informative --- ## VGG Initialization * The "one simple trick" of VGG is in weight initialization * We know that deep networks have a vanishing gradient problem * So VGG begins by training a shallow network * Then take those weights and use them to initialize a large network --- ## Vanishing Gradients * We believe that deep networks project inputs into an embedded space * Later layers then decode that embedding * The vanishing gradient problem occurs when we are too far from a good solution * Gradients are small and point in every direction, so learning doesn't happen --- ## A Better Starting Point * If we begin from a partial solution, then the gradients are better * A suboptimal projection into an embedded space is still better than the raw images * It is similar in concept to starting with easier images * That's an intuition for why this works --- ## A Step Farther: ResNets * VGG was annoying to train * Why not train the small network at the same time as the large one? * [ResNets](https://arxiv.org/abs/1512.03385) have two big improvements: * Shortcut connection * Batch norm to deal with vanishing gradients and replace dropout * Introduced in [Batch Normalization](https://arxiv.org/abs/1502.03167) paper. --- ## Shortcut Layers
* We add a residual of the original image back into our feature maps * When the shortcut goes over an increase in feature maps, use 1x1 convolution to add dimensions * Or save parameters and use an identity * When the shortcut goes over dimensionality reduction, increase stride to match the reduction * e.g. stride 2 to cut feature map size in half
--- ## Improvements?
* Able to train a 1202 layer deep model * Although 110 layers was better, and state of the art * The filters learn only differences from the base images * Hence "residuals"
--- ## Skip Layers and SGD * Skip layers present SGD a pathway directly to the original image or intermediate layers * A pathway to a good solution space in any convolution can be taken direction * In other networks, some a good convolution could exist in layer 1, but unless layers 2 and 3 had identity functions, SGD wouldn't "see" that possible solution * This is sadly hand-wavey * The only evidence from the authors (other than results) a smaller variance of layer outputs * Measurements showed smoother layer outputs --- ## Batch Normalization * Remember how large learning rates were a regularizer in LeNet? * AlexNet and the following networks had to use low learning rates to fit their data * The [Batch Normalization](https://arxiv.org/abs/1502.03167) authors point out that this is due to huge shifts in data statistics between batches * So normalize the layer inputs, not just the input to the network! * Batch normalization can feel like magic * If your model isn't training, try throwing in some batch normalization --- ## Other Cool Ideas * 2016 * [SqueezeNet](https://arxiv.org/abs/1602.07360) * [Densenet](https://arxiv.org/abs/1608.06993) * 2019 * [EfficientNet](https://arxiv.org/abs/1905.11946) --- ## Using a DNN * Most datasets aren't immediately useful * And most organization cannot afford to make their own datasets * So what can we do? * Re-use the learned embedding --- ## Pretrained Models * Pretrained models for different tasks are online * [PyTorch pretrained models](https://docs.pytorch.org/vision/stable/models.html) * Ideally we would save the models as static descriptions rather than running the in dynamic PyTorch, but this is fine for a demo * Demo time!