<!--
Abstract:

CS 461
Introduction to Deep Learning
Lecture 11
-->

# CS 461 - Lecture 22

## Machine Learning Principles

### Resnets and Feature Vectors

Bernhard Firner

2025-11-17

---

## Alexnet

---

## Capacity

* Call model capacity $h$
* We train with $l$ samples
* If $h << l$ the training error is high
  * This is underfitting
* If $h >> l$ there should be no training error
  * Whether we should call that overfitting depends upon the quality of our dataset

---

## ImageNet

* 14 million images
* Many bad and ambiguous labels
  * Such as the water snake at right

</div>
<div class="col">

</div>
</div>

---

## Overfitting

* Most datasets have "noise"
  * Errors in labels
  * Impossible to learn examples
* We need high capacity to learn hard examples
  * But that will also learn "bad" rules
  * "bad" outweights the "good" at capacity $h*$ in the figure

</div>
<div class="col">

</div>
</div>

---

## Regularization in Alexnet

* Shared pooling layers
* Random image crops, scales, and flips
* Color augmentations
* Most importantly, Dropout

---

## Dropout

* Randomly 0 the outputs of some layers during training
  * 50% in AlexNet
* This becomes similar to training an ensemble of random subsets of the original weights
  * Each forward-backward pass only uses a subset of the network

---

## Biased Outputs

* Two inputs have the same random value
  * With 0.01% chance they are set to 0
* The output is again a repeat of the same random value
* But a linear network decides to only look at one of them

</div>
<div class="col">

</div>
</div>

---

## Less Biased Outputs

* Training with dropout improves things

</div>
<div class="col">

</div>
</div>

---

## AlexNet impacts

* Model training and selection
  * Accept that your dataset isn't good
  * Bigger DNNs, higher capacity, stronger regularization
    * Weaker attempts to justify used approaches
* Analysis of the feature vector reveals remarkable universality

---

## Example Similarity

* Column 1 from the test set
* Other columns have nearest feature vectors

---

## Improvements

* Once people saw what AlexNet could do, they tried to push farther
  * 2014 saw VGG and GoogLenet
    * This was a "go big or go home" moment
  * 2015 brought ResNets
    * These have had more lasting impacts
    * Also enabled a new regularization, called Stochastic Depth
  * 2016 saw interesting variations in Squeezenet and Densenet
* There were more variations on network design as the field took off

---

## GoogLenet

* [GoogLenet](https://arxiv.org/abs/1409.4842) pushed to 22 layers deep
* Uses fewer parameters than AlexNet
  * 1x1 convolutions are used as dimensionality reduction to remove compute bottlenecks
  * This opens them up to using wider layers (more feature maps)

> The biggest gains in object-detection have not come from the utilization of deep
networks alone or bigger models, but from the synergy of deep architectures and classical computer
vision, like the R-CNN algorithm by Girshick et al.

---

## Inception Modules

* GoogLenet used Inception modules instead of single convolutions
* 1x1 convolutions reduce the number of features before 3x3 or 5x5 convolutions
* These are pretty complicated
  * Unnecessarily so, it turns out

</div>
<div class="col">

</div>
</div>

---

## VGG

* [VGG](https://arxiv.org/abs/1409.1556) pushed 3x3 convolutions to 16-19 layers deep
  * Also slipped in a 1x1 convolution in places
* Back to 2x2 stride 2 pooling
* Image augmentation was simple RGB mean subtraction
* Local Response Normalization is out
  * They claimed that it just slowed things down

---

## Interpreting VGG

* We can view this as a rejection of the changes AlexNet made to Lenet
  * With ReLU and Dropout sufficient to train good deep networks
* 1x1 convolutions aren't used for dimensionality reduction
  * Instead, they pass through a ReLU, increasing nonlinearity
* They also had one more trick, which is informative

---

## VGG Initialization

* The "one simple trick" of VGG is in weight initialization
* We know that deep networks have a vanishing gradient problem
* So VGG begins by training a shallow network
  * Then take those weights and use them to initialize a large network

---

## Vanishing Gradients

* We believe that deep networks project inputs into an embedded space
  * Later layers then decode that embedding
* The vanishing gradient problem occurs when we are too far from a good solution
  * Gradients are small and point in every direction, so learning doesn't happen

---

## A Better Starting Point

* If we begin from a partial solution, then the gradients are better
* A suboptimal projection into an embedded space is still better than the raw images
  * It is similar in concept to starting with easier images
* That's an intuition for why this works

---

## A Step Farther: ResNets

* VGG was annoying to train
  * Why not train the small network at the same time as the large one?
* [ResNets](https://arxiv.org/abs/1512.03385) have two big improvements:
  * Shortcut connection
  * Batch norm to deal with vanishing gradients and replace dropout
    * Introduced in [Batch Normalization](https://arxiv.org/abs/1502.03167) paper.

---

## Shortcut Layers

* We add a residual of the original image back into our feature maps
* When the shortcut goes over an increase in feature maps, use 1x1 convolution to add dimensions
  * Or save parameters and use an identity
* When the shortcut goes over dimensionality reduction, increase stride to match the reduction
  * e.g. stride 2 to cut feature map size in half

</div>
<div class="col">

</div>
</div>

---

## Improvements?

* Able to train a 1202 layer deep model
  * Although 110 layers was better, and state of the art
* The filters learn only differences from the base images
  * Hence "residuals"

</div>
<div class="col">

</div>
</div>

---

## Skip Layers and SGD

* Skip layers present SGD a pathway directly to the original image or intermediate layers
* A pathway to a good solution space in any convolution can be taken direction
  * In other networks, some a good convolution could exist in layer 1, but unless layers 2 and 3 had identity functions, SGD wouldn't "see" that possible solution
* This is sadly hand-wavey
  * The only evidence from the authors (other than results) a smaller variance of layer outputs
  * Measurements showed smoother layer outputs

---

## Batch Normalization

* Remember how large learning rates were a regularizer in LeNet?
* AlexNet and the following networks had to use low learning rates to fit their data
  * The [Batch Normalization](https://arxiv.org/abs/1502.03167) authors point out that this is due to huge shifts in data statistics between batches
* So normalize the layer inputs, not just the input to the network!
* Batch normalization can feel like magic
  * If your model isn't training, try throwing in some batch normalization

---

## Other Cool Ideas

* 2016
  * [SqueezeNet](https://arxiv.org/abs/1602.07360)
  * [Densenet](https://arxiv.org/abs/1608.06993)
* 2019
  * [EfficientNet](https://arxiv.org/abs/1905.11946)

---

## Using a DNN

* Most datasets aren't immediately useful
* And most organization cannot afford to make their own datasets
* So what can we do?
  * Re-use the learned embedding

---

## Pretrained Models

* Pretrained models for different tasks are online
  * [PyTorch pretrained models](https://docs.pytorch.org/vision/stable/models.html)
* Ideally we would save the models as static descriptions rather than running the in dynamic PyTorch, but this is fine for a demo
* Demo time!

<!--
Go over some networks that came after AlexNet
Talk about that residual skip layer trick
Mention how that unlocks several more tricks
  Stochastic depth
  Idea of feature modulation
Talk specifics about resnet and a couple of other networks
  ResNet 2015
  Squeezenet 2016
  Densenet 2016
  EfficientNet 2019

-->