# CS 462 - Lecture 14

## Advanced Convnets

Bernhard Firner

2026-03-12

---

# More Advanced Convnets

---

## Review

* Last time, we itroduced the ResNet
  * Transformed what the network learns into a residual
  * Instead of learning $F(x) = H(x)$, learn $F(x) = x + H(x)$
  * We "give" the DNN the original x, so it just learns a difference, which is conceptually simpler

---

## Residual Block

* Whether or not the argument is true, this connection does improve learning

</div>

---

## Shattered Gradients

* A more technical argument is that deep networks suffer from large swings in their outputs from minor changes in parameters
* This causes a decorrelation in parameters and high variance even with small adjustments

---

## The Explanation

* As the shattered gradient paper [explanains:](https://proceedings.mlr.press/v70/balduzzi17b.html)

> [T]he correlation between gradients in standard feedforward networks decays exponentially with depth resulting in gradients that resemble white noise

---

## Skip Connections

* The skip connections directly connect the gradient to early layers, providing a strong learning signal

---

## Additional Utility

* This innovation proved fruitful for many applications
* [U-Net](https://arxiv.org/abs/1505.04597) was designed for image segmentation for biomedical images
* [Stacked hourglass networks](https://arxiv.org/abs/1603.06937) used residual-like skip layers within inner pyramid blocks
  * Used for joint identification on images, which are then used for pose reconstruction

---

## Competition

* Eventually, another technique became "hot": transformers
  * We'll talk about those after the break though
* In short: transformers capture structure similarly to a memory system, such as Markov Models
  * This is not match images as intuitively as a convolution, so they can be awkward in comparison

---

## The Problem with Popular

* When something is popular, more people work on it
* Rapid progress was made with transformers, and soon they were state of the art, supplanting ResNets in many areas
* But were they really better?

---

## A ConvNet for the 2020s

* [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) argued that it was the *techniques* developed around training that yielded improved results with transformers
  * When they brought those techniques to a modern ResNet, they also saw improvements
* They called the new architecture ConvNeXt
  * And there is a [ConvNeXt v2](https://arxiv.org/abs/2301.00808) now as well

---

## Improvements

* So what were the improved techniques?
* Some of them are going to be a bit esoteric without a lot of background
  * So let's just treat a couple in detail and concentrate on the main ideas

---

## ConvNeXt

* Let's begin with the first ConvNeXt architecture.

> ConvNeXts [...] compete favorably with
Transformers in terms of accuracy, scalability and robustness across all
major benchmarks. ConvNeXt maintains the efficiency of standard ConvNets, and
the fully-convolutional nature for both training and testing makes it extremely
simple to implement

---

## Relative Performance

---

## Improvement Procedure

* The authors created this "roadmap", showing how each step improves a baseline ResNet
* Notice that some techniques must be combined, and lead to drops in performance on their own

</div>
<div class="col">
<img style="width: 70%" class="r-stretch" src="./figures/convnext_fig2_roadmap.png" />
</div>
</div>

---

## Procedural Improvements

* Let's go over some other improvements
  * Unrelated to the architecture itself

---

## Training Recipe

* Switch to AdamW
* Train for 300 epochs instead of 90
  * With a cosine decay learning rate schedule and layer wide learning rate decay
* Data augmentation:
  * Mixup, Cutmix, RandAugment, RandomErasing
* Regulariztion:
  * Stochastic Depth, Label Smoothing, Exponential Moving Average

---

## Side Note

* You won't need to memorize this training recipe
* But there are some important points that are important to understand
* Let's go over the critical ones and gloss over the rest

---

## Adam and $L_2$

* Adaptive moment estimation, adam, improved learning over plain SGD
  * It includes momentum and normalizes learning steps, effectively using a different learning rate for each parameter
* But what happens when we also use $L_2$ regularization?
  * That component of the cost function is included in Adam's optimization
  * But it turns out that this is bad

---

## AdamW

* Researchers continued to outperform Adam with plain SGD, $L_2$, and hand-tuned learning rate schedules
* The [AdamW](https://arxiv.org/abs/1711.05101) authors proposed a fix to Adam to:
  * decouple $L_2$ from the gradient-based update
  * thus improving generalization of the trained model

---

## The Problem

* Weight decay should be a separate term from the gradient-based update:
  * $\theta_{t+1} = (1 - \lambda)\theta_t - \alpha\nabla f_t(\theta_t))$
* What $L_2$ actually does is add $\theta_t^2$ to the error
  * That's mathematically equivalent (with $\lambda' = 2\lambda/\alpha$)
    * ...until we begin changing the gradient loss updates

---

## Decoupled

* Adam should modify the loss function to be something like this:
  * $\theta_{t+1} = (1 - \lambda)\theta_t - \alpha M_t\nabla f_t(\theta_t))$
* But with $L_2$ inside of the new mechanics, $\alpha M_t[\nabla f_t(\theta_t)) + \lambda2\theta_t]$ there is no possible value of $\lambda$ to make them equivalent
* So the authors' of AdamW decouple the weight decay from the training loss
  * $\theta_{t+1} \leftarrow \theta_t - \eta_{t+1}\left(\frac{\alpha\widehat{m_{t+1}}}{\sqrt{\widehat{v_{t+1}}}+\epsilon}+\lambda\theta_{t}\right)$
  * Where $\eta$ is the scheduled learning rate multiple, $\frac{\hat{m}}{\sqrt{\hat{v}}}$ is the normalized momentum, and weight decay is added after other calculations

---

## Improvements

* The [AdamW paper](https://arxiv.org/abs/1711.05101) shows (with more graphs than this) that generalization is improved on CIFAR-10

---

## Discussion

* So AdamW is part of what enabled ConvNeXt to train longer and generalize better
* It was in use in late 2018, so it is interesting that it was not the default in some research communities

---

## Stochastic Depth

* The next interesting improvement comes from [stochastic depth](https://arxiv.org/abs/1603.09382)
* This is another idea to improve gradients, although it does a bit more than that
* If the residual blocks are learning to output $x + f(x)$, what if we randomly turn off the $f(x)$?
  * It would be as if the layer, and its residual, didn't happen
* Why is that good?

---

## Stochastic Depth

* Gradients should be improved
  * Similar to DenseNet and ResNet, there can be shorter pathways to layers
  * And they will have less noise without the later layers
* This is similar to dropout; we are creating an implicit ensemble, which improves generalization
  * But unlike dropout, the forward pass is also faster when entire layers removed
* We can also think of this as forcing earlier layers of the network to learn more
  * But sometimes they are given a helping hand by the later layers

---

## Mechanics

* We don't drop all layers with the same probability (dropping the first one would be disastrous)
* Instead, drop probabilities increase with depth

<div class="col">
<img style="width: 50%" class="r-stretch" src="./figures/stochastic_depth_figure2.png" />
<br/>
<small>Here is the visualization from the paper.</small>
</div>

---

## Gradients

* Does it do what the authors said it would?

<div class="col">
<img style="width: 50%" class="r-stretch" src="./figures/stochastic_depth_figure7_gradientmagnitude.png" />
<br/>
<small>The authors showed that gradients remain strong even after learning rate adjustments.</small>
</div>

---

## Results

* Of course we always want to show our improvements

<div class="col">
<img style="width: 50%" class="r-stretch" src="./figures/stochastic_depth_figure4_cifar10.png" />
<br/>
<small>CIFAR10 results from the paper.</small>
</div>

---

## Label Smoothing

* This wasn't a new technique, [introduced in 2016](https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Szegedy_Rethinking_the_Inception_CVPR_2016_paper.html)
* This is a regularization technique that reduces model overconfidence
  * And is included in the [PyTorch CrossEntropyLoss](https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)
* We sometimes replace training labels with new, random labels from a uniform distribution
  * You can think of this as forcing a prior of an equal class distribution upon the model

---

## Image Augmentations

* There are a lot of image augmentations used
* I don't want to get hung up on them
  * The best augmentations may vary with the data being used
  * Just remember that image augmentation is important

---

## Architecture

* Finally, let's talk about how the architecture was changed in ConvNeXt
* The PyTorch Torchvision modules have several [pretrained version](https://docs.pytorch.org/vision/main/models/convnext.html), if you're interested
* Many of the changes are taken directly from transformers, but some come from other convnet work

---

## Step Convolution Sizes

* The initial convolution in modern networks is different from the rest
* Its goal is to rapidly reduce the input size wihout sacrificing features
* ResNets had been using 7x7 convolutions with stride 2 followed by a max pooling layer
  * Pooling takes the maximum value in a region and discards the rest
* ResNeXt changed to a 4x4 convolution with stride 4
  * The downscaling is the same, but is done in a single layer

---

## Grouping

* Back in LeNet, each convolution didn't feed into the next one
  * So if we had 16 feature layers, the convolution in the next layer may only pick 4 of them
  * This grouped convolution reduces feature combinations but reduces computation and weights
* ConvNeXt will use as many groups as there are channels, meaning that each convolution has size=1xHxW

---

## DepthWise Operation

* We need some operation across channels
  * High-level features are built upon from smaller ones, right?
* So after the channel-wide operation there is a 1x1 convolution
  * This is the same as a linear layer run on each $(x,y)$ location across all feature maps
* Because the grouped convolution lose capacity compared to ungrouped convolutions, this operation quadruples the number of features

---

## Inverted Bottleneck

* That quadrupling of features is then followed by a reduction back to the original number
  * Thus an inverted bottleneck
* It isn't as costly a computation as you would think since these are just 1x1 convolutions

---

## Kernel Size

* So what should the size of the first kernel be?
  * Experiments showed that 7x7 worked best
* There isn't much to say about this
* Whether or not justifications are true, this structure does improve learning

</div>

</div>
</div>

---

## Micro Improvements

* Now let's change other components
* We like ReLU, right?
  * But they aren't smooth
* So, with no very strong justification, the authors change ReLU to GeLU, which is smoother
  * [GeLU in PyTorch](https://docs.pytorch.org/docs/stable/generated/torch.nn.GELU.html#gelu)

</div>

</div>
</div>

---

## Removing Activations

* Next, all but one of the GeLUs are removed from each block
  * Why? Again, experiments showed it was better
* But perhaps we can think of the within-feature operation of the convolution and the across-feature operation of the 1x1 convolution as part of a single transformation
  * In that case, there should only be one activation function between them
* And the next 1x1 convolution merely projects back into the original feature space, so maybe it doesn't need an activation
  * We're hand-waving now

---

## Changing Normalization

* If the activations can be done away with, how about normalization?
  * Sure
* And let's swap out BatchNorm for Layer Norm as well
* [Layer Norm PyTorch](https://docs.pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html#torch.nn.LayerNorm)

</div>

</div>
</div>

---

## Why Layer Norm?

* First off, [layer normalization](https://arxiv.org/abs/1607.06450) was introduced in 2016
  * But it performed worse than Batch Norm in earlier convolutional networks
* It was developed because batch statistics don't work well with recurrent networks
* There are other circumstances where batch statistics could be poor
  * Small batches or data with disparate, non-gaussian covariance of features across different classes

---

## What is Layer Norm?

* Layer Norm is generally used to normalize a single pixel across all channels before going into the 1x1 convolution
* But in ResNext, it is used to normalize each individual channel
  * So it is acting very similarly to Batch Norm
* It could be that the noise from Batch Norm is unecessary (or even harmful) with the other regularization and depth present
  * Whatever the reason, Layer Norm gives better results in ConvNeXt

---

## ConvNeXt Block

* Here is the PyTorch printout of the block:

```
  (1): CNBlock(
    (block): Sequential(
      (0): Conv2d(96, 96, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=96)
      (1): Permute()
      (2): LayerNorm((96,), eps=1e-06, elementwise_affine=True)
      (3): Linear(in_features=96, out_features=384, bias=True)
      (4): GELU(approximate='none')
      (5): Linear(in_features=384, out_features=96, bias=True)
      (6): Permute()
    )
```

---

## Downsampling

* Downsampling isn't done inside of the ResNeXt block as it was in ResNets
* Instead, it is done via 2x2 kernels with normalized channels

```
(2): Sequential(
  (0): LayerNorm2d((96,), eps=1e-06, elementwise_affine=True)
  (1): Conv2d(96, 192, kernel_size=(2, 2), stride=(2, 2))
)
```

---

## TakeAways

* Don't feel overwhelmed
  * This is a lot, but most of it isn't important
* The key point here is that we still don't fully understand why some things are better than others
  * Experiments say they are (and explanations may flow later; there is a ConvNeXt2 as well)

---

## Important Parts

* AdamW is a clear improvement over Adam
* Stochastic depth is a regularizer that speeds up training
* Sometimes good ideas need to be revisited
* Researchers in one community should pay attention to advances made in other communities