---
## ConvNeXt V2
* The next year, [ConvNeXt V2](https://openaccess.thecvf.com/content/ICCV2021/html/Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper) used a self-supervised technique called masked autoencoders to reach 88.9% top-1 accuracy, vs 87.8% for ConvNeXt
Again, see Figure 1
---
## Image Masking
* Image masking is what it sounds like
Now see ConvNeXt V2, Figure 2
---
## Trivial?
* Pretraining is usually tricky to get working
* For example, image masking isn't really as effective as token masking
* Why?
* Tokens are more discrete and information dense
* Images likely have more degrees of freedom in unseen sections, and high spatial redundancy
* To get masking to work, large amounts of the image must be masked 60+%
---
## More About Masking
* Masking was done in transformers before convnets
* The tokenized images are easier to mask
* See [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) from 2021
See Figure 1
---
## Examples
* Masking generalized across image datasetes
* Here are some examples from the masked autoencoders paper
ImageNet validation image
Coco validation image
---
## Transformer Vs Convolution
* As mentioned, masking is trivial to do in the attention modules of the transformers
* Tokenize the input image, performing a linear projection and adding a position encoding
* Choose a random subset and send them into the encoder
* At the decoder, add in mask tokens and unshuffle everything to its original position
* Notice that this is nondestructive, as position information is in the tokens
---
## Advantages
* With 70% of the tokens dropped, the MAE authors could pretrain with a batch size of 4096 for 800 epochs
* This kind of seeming overkill is typical for self-supervised learning
* Overfitting isn't really possible, as there are no labels, just statistics and a weak training signal
* Think of how many ways there are to remove 70% of an image
---
## More Examples
* Because there are multiple plausible decoded images depending upon the masking, it is difficult to overfit
* Basically, this task is too difficult for that to happen
More examples from the MAE paper, with different mask amounts applied.
---
## More Examples
* It's also important to point out that the DNN is not copying the inputs to the outputs and then drawing around them
* Rather, it is using the input tokens to predict something about the entire image
More examples from the MAE paper, with different mask amounts applied.
---
## What is Being Learned?
* It is important to ask what is being learned by these masking techniques
* The reconstruction imply that the model has an internal representation of objects and structures
* That makes sense
* It has to learn structures to reproduce an image
* Since objects are correlated with themselves, learning objects is a good approach to reconstructing pixels
---
## Context Information
* In vision transformers, the CLS token begins every sequence
* From text sequences: `[CLS] The world is square. [SEP]`
* `[CLS]` can attend to all following tokens
* So (it seems) that its hidden state has a representation of the entire sequence
* What does the token correspond to in an image transformer?
---
## Structure and Self-Supervision
* [Emerging Properties in Self-Supervised Vision Transformers](https://openaccess.thecvf.com/content/ICCV2021/html/Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper) demonstrates some interesting results of attention and the [CLS] token
* It turns out that the [CLS] token pays attention (across multiple heads) to objects
See Figure 1 of the paper. Images show attention payed to 8x8 patches by the [CLS] token by the multi-headed attention in a self-supervised transformer's last layer.
---
## Some Context
* To be clear, feature segmentation naturally emerges from many tasks
* For example, picture a network for self-driving that learned to predict egomotion
* If you examine the features, you will find it reacts to lane lines, road edges, and other vehicles
* The fact that we can see such strong features from a self-supervised model is nevertheless interesting
---
## Improvements
* The surprising thing about the attention masks from a self-supervised network appear to be *superior* to those from a supervised network, even with the same architecture
* This is a hint that supervised learning teaches networks to "grasp at straws" to find any contextual hints
See Figure 4 of the paper. Self-attention maps are thresholded to drop the lower 40% of the mass.
---
## In ConvNets
* Are there similar results with pretraining in convolutional networks?
* Let's return to the [ConvNeXt V2]((https://openaccess.thecvf.com/content/ICCV2021/html/Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper) paper
* A masked autoencoder task is also used
* Masking is done on 60% of the positions at the last stage of the encoder
* The decoder then recreates the missing data
---
## Results
* Let's skip over some details (sparse convolutions and the exact training recipe)
* Instead, let's see what the authors learned
* They observed that feature maps in the original ConvNeXt tended to saturate or collapse
See Figure 3 of the paper.
---
## Good?
* It is likely that the cross entropy loss encourages features to go to extremes
* That works well with the softmax function
* But obviously means that the features are not preserving information
* This can be measured in the cosine distance of features in random images
* In the original ConvNeXt, cosine distance drops in the later layers
---
## Solution
* To encourage different features to have more diverse outputs, they change the model slightly
* Each layer is normalized by its relative activation compared to other layers
* $\mathcal{N}(||X_i||) \triangleq \frac{||X_i||}{\sum\limits_{j=1,...C}||X_j||}$
* Basically, the features of each channel are scaled by the relative strength of the channel
* Called Global Response Normalization (GRN)
---
## Does it Work?
* Of course! Final accuracy numbers are also improved
See Figure 4 of the paper. Feature diversity is improved, and comparable to the transformer trained with MAE.
---
## More Details
* If self-supervised learning is superior, in some way, to supervised learning then that is a big deal
* So is there any more proof?
* [Concept Generalization in Visual Representation Learning](https://openaccess.thecvf.com/content/ICCV2021/html/Sariyildiz_Concept_Generalization_in_Visual_Representation_Learning_ICCV_2021_paper.html) examines generalization
* Took ImageNet1K (a subset of ImageNet with 1K class labels)
* Compared the labels to 21K Imagenet, and selected 5 groups of 1000 labels
* Each group was chosen to be a different "semantic distance" from ImageNet1K
---
## Generalization
* Generalization was then tested from a model trained on ImageNet1K to the 5 new groups
* Each group has decreasing similarity to the original ImageNet1K data
* Benchmark
* Pretrain a model on ImageNet1K
* Extract features
* Learn a linear classifier with those features with no new data
* Add data (1, 2, 4, ..., 128 samples) to see how quickly it adapts
---
## Tested Models
* There are many
* Read [the paper](https://openaccess.thecvf.com/content/ICCV2021/html/Sariyildiz_Concept_Generalization_in_Visual_Representation_Learning_ICCV_2021_paper.html) for full details
Table 1 from the paper.
---
## Results
* All of the results are here
* But there are 31 models
* Mostly we can gain some trust in the concept generalization idea
Figure 3a from the paper.
---
Figure 3 b,c,d,e from the paper.
---
## Conclusion
* Self-supervised learning is a strong regularizer
---
## Following Up
* The self-supervised models actually gained performance relative to a baseline Resnet50
* The best of this group is *s-DINO*, which used a contrastive learning technique
Figure 3b from the paper.
---
## s-DINO is anomalously good
Figure 4 from the paper.
---
## Next Time
* Which paper proposed DINO?
* [Emerging Properties in Self-Supervised Vision Transformers](https://openaccess.thecvf.com/content/ICCV2021/html/Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper)
* The one with the very clean features
* There are also two follow-ups:
* [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193)
* [Vision Transformers Need Registers](https://arxiv.org/abs/2309.16588)
* We'll leave the details of contrastive learning to next lecture
* And we'll go over a couple other interesting unsupervised results
* We care about more than just image classification, right?