# CS 462 - Lecture 14 ## Advanced Convnets Bernhard Firner 2026-03-12 --- # More Advanced Convnets --- ## Review * Last time, we itroduced the ResNet * Transformed what the network learns into a residual * Instead of learning $F(x) = H(x)$, learn $F(x) = x + H(x)$ * We "give" the DNN the original x, so it just learns a difference, which is conceptually simpler --- ## Residual Block
* Whether or not the argument is true, this connection does improve learning
--- ## Shattered Gradients * A more technical argument is that deep networks suffer from large swings in their outputs from minor changes in parameters * This causes a decorrelation in parameters and high variance even with small adjustments
--- ## The Explanation * As the shattered gradient paper [explanains:](https://proceedings.mlr.press/v70/balduzzi17b.html) > [T]he correlation between gradients in standard feedforward networks decays exponentially with depth resulting in gradients that resemble white noise --- ## Skip Connections * The skip connections directly connect the gradient to early layers, providing a strong learning signal
--- ## Additional Utility * This innovation proved fruitful for many applications * [U-Net](https://arxiv.org/abs/1505.04597) was designed for image segmentation for biomedical images * [Stacked hourglass networks](https://arxiv.org/abs/1603.06937) used residual-like skip layers within inner pyramid blocks * Used for joint identification on images, which are then used for pose reconstruction --- ## Competition * Eventually, another technique became "hot": transformers * We'll talk about those after the break though * In short: transformers capture structure similarly to a memory system, such as Markov Models * This is not match images as intuitively as a convolution, so they can be awkward in comparison --- ## The Problem with Popular * When something is popular, more people work on it * Rapid progress was made with transformers, and soon they were state of the art, supplanting ResNets in many areas * But were they really better? --- ## A ConvNet for the 2020s * [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) argued that it was the *techniques* developed around training that yielded improved results with transformers * When they brought those techniques to a modern ResNet, they also saw improvements * They called the new architecture ConvNeXt * And there is a [ConvNeXt v2](https://arxiv.org/abs/2301.00808) now as well --- ## Improvements * So what were the improved techniques? * Some of them are going to be a bit esoteric without a lot of background * So let's just treat a couple in detail and concentrate on the main ideas --- ## ConvNeXt * Let's begin with the first ConvNeXt architecture. > ConvNeXts [...] compete favorably with Transformers in terms of accuracy, scalability and robustness across all major benchmarks. ConvNeXt maintains the efficiency of standard ConvNets, and the fully-convolutional nature for both training and testing makes it extremely simple to implement --- ## Relative Performance
--- ## Improvement Procedure
* The authors created this "roadmap", showing how each step improves a baseline ResNet * Notice that some techniques must be combined, and lead to drops in performance on their own
--- ## Procedural Improvements * Let's go over some other improvements * Unrelated to the architecture itself --- ## Training Recipe * Switch to AdamW * Train for 300 epochs instead of 90 * With a cosine decay learning rate schedule and layer wide learning rate decay * Data augmentation: * Mixup, Cutmix, RandAugment, RandomErasing * Regulariztion: * Stochastic Depth, Label Smoothing, Exponential Moving Average --- ## Side Note * You won't need to memorize this training recipe * But there are some important points that are important to understand * Let's go over the critical ones and gloss over the rest --- ## Adam and $L_2$ * Adaptive moment estimation, adam, improved learning over plain SGD * It includes momentum and normalizes learning steps, effectively using a different learning rate for each parameter * But what happens when we also use $L_2$ regularization? * That component of the cost function is included in Adam's optimization * But it turns out that this is bad --- ## AdamW * Researchers continued to outperform Adam with plain SGD, $L_2$, and hand-tuned learning rate schedules * The [AdamW](https://arxiv.org/abs/1711.05101) authors proposed a fix to Adam to: * decouple $L_2$ from the gradient-based update * thus improving generalization of the trained model --- ## The Problem * Weight decay should be a separate term from the gradient-based update: * $\theta_{t+1} = (1 - \lambda)\theta_t - \alpha\nabla f_t(\theta_t))$ * What $L_2$ actually does is add $\theta_t^2$ to the error * That's mathematically equivalent (with $\lambda' = 2\lambda/\alpha$) * ...until we begin changing the gradient loss updates --- ## Decoupled * Adam should modify the loss function to be something like this: * $\theta_{t+1} = (1 - \lambda)\theta_t - \alpha M_t\nabla f_t(\theta_t))$ * But with $L_2$ inside of the new mechanics, $\alpha M_t[\nabla f_t(\theta_t)) + \lambda2\theta_t]$ there is no possible value of $\lambda$ to make them equivalent * So the authors' of AdamW decouple the weight decay from the training loss * $\theta_{t+1} \leftarrow \theta_t - \eta_{t+1}\left(\frac{\alpha\widehat{m_{t+1}}}{\sqrt{\widehat{v_{t+1}}}+\epsilon}+\lambda\theta_{t}\right)$ * Where $\eta$ is the scheduled learning rate multiple, $\frac{\hat{m}}{\sqrt{\hat{v}}}$ is the normalized momentum, and weight decay is added after other calculations --- ## Improvements * The [AdamW paper](https://arxiv.org/abs/1711.05101) shows (with more graphs than this) that generalization is improved on CIFAR-10
--- ## Discussion * So AdamW is part of what enabled ConvNeXt to train longer and generalize better * It was in use in late 2018, so it is interesting that it was not the default in some research communities --- ## Stochastic Depth * The next interesting improvement comes from [stochastic depth](https://arxiv.org/abs/1603.09382) * This is another idea to improve gradients, although it does a bit more than that * If the residual blocks are learning to output $x + f(x)$, what if we randomly turn off the $f(x)$? * It would be as if the layer, and its residual, didn't happen * Why is that good? --- ## Stochastic Depth * Gradients should be improved * Similar to DenseNet and ResNet, there can be shorter pathways to layers * And they will have less noise without the later layers * This is similar to dropout; we are creating an implicit ensemble, which improves generalization * But unlike dropout, the forward pass is also faster when entire layers removed * We can also think of this as forcing earlier layers of the network to learn more * But sometimes they are given a helping hand by the later layers --- ## Mechanics * We don't drop all layers with the same probability (dropping the first one would be disastrous) * Instead, drop probabilities increase with depth
Here is the visualization from the paper.
--- ## Gradients * Does it do what the authors said it would?
The authors showed that gradients remain strong even after learning rate adjustments.
--- ## Results * Of course we always want to show our improvements
CIFAR10 results from the paper.
--- ## Label Smoothing * This wasn't a new technique, [introduced in 2016](https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Szegedy_Rethinking_the_Inception_CVPR_2016_paper.html) * This is a regularization technique that reduces model overconfidence * And is included in the [PyTorch CrossEntropyLoss](https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) * We sometimes replace training labels with new, random labels from a uniform distribution * You can think of this as forcing a prior of an equal class distribution upon the model --- ## Image Augmentations * There are a lot of image augmentations used * I don't want to get hung up on them * The best augmentations may vary with the data being used * Just remember that image augmentation is important --- ## Architecture * Finally, let's talk about how the architecture was changed in ConvNeXt * The PyTorch Torchvision modules have several [pretrained version](https://docs.pytorch.org/vision/main/models/convnext.html), if you're interested * Many of the changes are taken directly from transformers, but some come from other convnet work --- ## Step Convolution Sizes * The initial convolution in modern networks is different from the rest * Its goal is to rapidly reduce the input size wihout sacrificing features * ResNets had been using 7x7 convolutions with stride 2 followed by a max pooling layer * Pooling takes the maximum value in a region and discards the rest * ResNeXt changed to a 4x4 convolution with stride 4 * The downscaling is the same, but is done in a single layer --- ## Grouping * Back in LeNet, each convolution didn't feed into the next one * So if we had 16 feature layers, the convolution in the next layer may only pick 4 of them * This grouped convolution reduces feature combinations but reduces computation and weights * ConvNeXt will use as many groups as there are channels, meaning that each convolution has size=1xHxW --- ## DepthWise Operation * We need some operation across channels * High-level features are built upon from smaller ones, right? * So after the channel-wide operation there is a 1x1 convolution * This is the same as a linear layer run on each $(x,y)$ location across all feature maps * Because the grouped convolution lose capacity compared to ungrouped convolutions, this operation quadruples the number of features --- ## Inverted Bottleneck * That quadrupling of features is then followed by a reduction back to the original number * Thus an inverted bottleneck * It isn't as costly a computation as you would think since these are just 1x1 convolutions --- ## Kernel Size
* So what should the size of the first kernel be? * Experiments showed that 7x7 worked best * There isn't much to say about this * Whether or not justifications are true, this structure does improve learning
--- ## Micro Improvements
* Now let's change other components * We like ReLU, right? * But they aren't smooth * So, with no very strong justification, the authors change ReLU to GeLU, which is smoother * [GeLU in PyTorch](https://docs.pytorch.org/docs/stable/generated/torch.nn.GELU.html#gelu)
--- ## Removing Activations * Next, all but one of the GeLUs are removed from each block * Why? Again, experiments showed it was better * But perhaps we can think of the within-feature operation of the convolution and the across-feature operation of the 1x1 convolution as part of a single transformation * In that case, there should only be one activation function between them * And the next 1x1 convolution merely projects back into the original feature space, so maybe it doesn't need an activation * We're hand-waving now --- ## Changing Normalization
* If the activations can be done away with, how about normalization? * Sure * And let's swap out BatchNorm for Layer Norm as well * [Layer Norm PyTorch](https://docs.pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html#torch.nn.LayerNorm)
--- ## Why Layer Norm? * First off, [layer normalization](https://arxiv.org/abs/1607.06450) was introduced in 2016 * But it performed worse than Batch Norm in earlier convolutional networks * It was developed because batch statistics don't work well with recurrent networks * There are other circumstances where batch statistics could be poor * Small batches or data with disparate, non-gaussian covariance of features across different classes --- ## What is Layer Norm? * Layer Norm is generally used to normalize a single pixel across all channels before going into the 1x1 convolution * But in ResNext, it is used to normalize each individual channel * So it is acting very similarly to Batch Norm * It could be that the noise from Batch Norm is unecessary (or even harmful) with the other regularization and depth present * Whatever the reason, Layer Norm gives better results in ConvNeXt --- ## ConvNeXt Block * Here is the PyTorch printout of the block: ``` (1): CNBlock( (block): Sequential( (0): Conv2d(96, 96, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=96) (1): Permute() (2): LayerNorm((96,), eps=1e-06, elementwise_affine=True) (3): Linear(in_features=96, out_features=384, bias=True) (4): GELU(approximate='none') (5): Linear(in_features=384, out_features=96, bias=True) (6): Permute() ) ``` --- ## Downsampling * Downsampling isn't done inside of the ResNeXt block as it was in ResNets * Instead, it is done via 2x2 kernels with normalized channels ``` (2): Sequential( (0): LayerNorm2d((96,), eps=1e-06, elementwise_affine=True) (1): Conv2d(96, 192, kernel_size=(2, 2), stride=(2, 2)) ) ``` --- ## TakeAways * Don't feel overwhelmed * This is a lot, but most of it isn't important * The key point here is that we still don't fully understand why some things are better than others * Experiments say they are (and explanations may flow later; there is a ConvNeXt2 as well) --- ## Important Parts * AdamW is a clear improvement over Adam * Stochastic depth is a regularizer that speeds up training * Sometimes good ideas need to be revisited * Researchers in one community should pay attention to advances made in other communities