# CS 462 - Lecture 19

Unsupervised/Self-Supervised Learning

Bernhard Firner

2026-04-02

---

## A Fun, Relaxed Topic

* Today we will talk more theory than detail
  * Should be a relaxed change of pace from the transformer
* But we should quickly review everything to ensure we don't forget it

---

## Review

* We finally made it through modern encoder-decoder networks!
* We needed
  * embeddings
  * attention (scaled and multi-headed)
  * transformer architecture
  * masked attention
  * cross attention

</div>
<div class="col">
<img style="width: 75%" class="r-stretch" src="./figures/UDL/Chap12/TransformerEncoderDecoder.svg" />
<br/>
<small>Figure 12.13 from UDL</small>
</div>
</div>

---

## Vocabulary

* **Token**
  * An input to a transformer, LSTM, or RNN
  * Could be a word, a phrase, a song in a playlist, a part of an image, etc

* **Embedding**
  * A vector whose values capture intrinsic semantics of a token

---

## Vocabulary

* **Hidden State**
  * An unobservable variable that describes the probabilities of future observations
* **Context** (not to be confused with a context window)
  * This can mean a single vector that "summarizes" the input tokens
    * For example, the vector could capture the overall meaning of an input sentence
  * There is also a hidden state generated as each token is observed
    * As each hidden state, $h_t$ is generated, a $c_t$ is also generated
    * In transformers these intermediate states and contexts are less visible

---

## Vocabulary

* **Linear transformation/linear projection**
  * This means that we use a single linear layer to transform an input
  * This is how we get the queries, keys, and values used in attention
    * Each comes from from the original input
      * Except in cross attention where the values and keys come from the encoder

---

## Vocabulary

* **Encoder**
  * This model "encodes" the tokens of an input
  * A good encoder will capture the context of the input sequence
    * This is generally given to a decoder for a downstream task
* **Decoder**
  * A model that consumes the context and hidden states of the encoder
    * In a transformer with attention, those (conceptually) take the form of values and keys in cross attention

---

## Masked Attention

* Attention is restricted to be temporally plausible
  * Tokens cannot observe tokens from the future
* The ground truth is input to the decoder during training, so masked attention in the decoder prevents cheating

<div class="col">
<img style="width: 65%" class="r-stretch" src="./figures/UDL/Chap12/TransformerDecoder.svg" />
<br/>
<small>Figure 12.12 from UDL</small>
</div>

---

## Cross-Attention

* The embeddings from the encoder and decoder are combined through cross attention in a subsequent attention layer
* The keys and values come from the encoder
  * Picture translation: the context and hidden states should come from the original language

</div>
<div class="col">
<img style="width: 55%" class="r-stretch" src="./figures/AttentionIsAllYouNeedFigure1.png" />
<br/>
<small> <a href="https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html">Attention is All You Need</a></small>
</div>
</div>

---

## Encoder+Decoder

* A full transformer uses multi-headed attention to support multiple features

<div class="col">
<img style="width: 55%" class="r-stretch" src="./figures/UDL/Chap12/TransformerEncoderDecoder.svg" />
<br/>
<small>Figure 12.13 from UDL</small>
</div>

---

## Pretraining With Context

* Remember that we started with a pretraining phase
* The 'modern' version was introduced with [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) in 2018

<div class="col">
<img style="width: 55%" class="r-stretch" src="./figures/UDL/Chap12/TransformerEncoder.svg" />
<br/>
<small>See Figure 12.10 of the UDL book</small>
</div>

---

## BERT

* **B**idirectional **E**ncoder **R**representations from **T**ransformers
* BERT looked at context both before and after the token being predicted, so it captures structure
* Training begins by learning to predict masked words

---

## BERT Training

* In the second phase, a new downstream task is used
  * The BERT paper generally refers to a task with question and answer sentences as inputs
* BERT add an additional embedding token, this one indicating if a word is in sentence 1 or 2

---

## What is Pretraining?

* What is with all of this preparing? Why not just do the thing we want?
* It turns out that data is plentiful
  * But labelled data is not
* This leads us to today's topic: **self-supervised learning**

---

## Pretraining in Real Life

* When you encounter a new object for the first time, you can make guesses about it
  * Is it reflective? Maybe it is metallic.
  * Does it have a seam around the outside? It could be foam or plastic shaped by an injection mold.
* When you make an inference of that type, you are applying past knowledge to a new situation

---

## Task Specifics

* Let's say that you've spent a lifetime (or at least a childhood) grabbing random objects
* If you suddenly need something to prop open a door, what do you do?
  * You'll estimate the force applied by the door and the weight required to prop it open
  * Then you'll grab something of sufficient weight, or wedge something into the gap between the door and the frame
* Although you've never trained to prop open a door, you'll likely do a decent job at it

---

## Unsupervised/Self-Supervised Training

* The nomeclature has changed over time, but the idea is the same
  * We want to learn something where we have a lot of data
  * Then we want to apply what we've learned to a new task
* Importantly, it is ideal if the first step does not need information about the second

---

## Distinctions

* Self-supervised learning is not quite the same as transfer learning
* Transfer learning example
  * AlexNet was trained on a large number of image labels from ImageNet
  * Then the learned features were useful for new applications
* In self-supervised learning, information is extracted from the structure of the unlabelled data
  * We may then use transfer learning to apply the model to a new task

---

## Language Modelling

* Token masking is an example of self-supervised learning
  * The structure of the sentences themselves *is* the data being learned
  * No additional training target is required
* Pretraining is obviously very useful in language modelling
  * BERT demonstrated that we can pretrain a powerful encoder-decoder for words using just masking
* Does the same approach work with images?

---

## ConvNeXt

* Remember the [ConvNeXt](https://openaccess.thecvf.com/content/CVPR2022/html/Liu_A_ConvNet_for_the_2020s_CVPR_2022_paper.html) paper?
  * The authors showed results, with and without pretraining on a larger dataset

</div>
<div class="col">
<img style="width: 50%" class="r-stretch" src="./figures/convnext_fig1_improvements.png" />
<br/>
<small> See Figure 1 </small>
</div>
</div>

---

## ConvNeXt V2

* The next year, [ConvNeXt V2](https://openaccess.thecvf.com/content/ICCV2021/html/Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper) used a self-supervised technique called masked autoencoders to reach 88.9% top-1 accuracy, vs 87.8% for ConvNeXt

</div>
<div class="col">
<img style="width: 40%" class="r-stretch" src="./figures/convnextv2_fig1_improvements.png" />
<br/>
<small> Again, see Figure 1 </small>
</div>
</div>

---

## Image Masking

* Image masking is what it sounds like

</div>
<div class="col">
<img style="width: 40%" class="r-stretch" src="./figures/convnextv2_fig2_FCMAE.png" />
<br/>
<small> Now see ConvNeXt V2, Figure 2</small>
</div>
</div>

---

## Trivial?

* Pretraining is usually tricky to get working
  * For example, image masking isn't really as effective as token masking
* Why?
  * Tokens are more discrete and information dense
  * Images likely have more degrees of freedom in unseen sections, and high spatial redundancy
* To get masking to work, large amounts of the image must be masked 60+%

---

## More About Masking

* Masking was done in transformers before convnets
  * The tokenized images are easier to mask
* See [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) from 2021

<div class="col">
<img style="width: 50%" class="r-stretch" src="./figures/MAE_figure1.png" />
<br/>
<small> See Figure 1</small>
</div>

---

## Examples

* Masking generalized across image datasetes
* Here are some examples from the masked autoencoders paper

<div class="container">
<div class="col">
<img style="width: 75%" class="r-stretch" src="./figures/MAE_imagenet_decode.png" />
<br/>
<small>ImageNet validation image</small>
</div>
<div class="col">
<img style="width: 75%" class="r-stretch" src="./figures/MAE_coco_decode.png" />
<br/>
<small>Coco validation image</small>
</div>
</div>

---

## Transformer Vs Convolution

* As mentioned, masking is trivial to do in the attention modules of the transformers
  * Tokenize the input image, performing a linear projection and adding a position encoding
  * Choose a random subset and send them into the encoder
  * At the decoder, add in mask tokens and unshuffle everything to its original position
* Notice that this is nondestructive, as position information is in the tokens

---

## Advantages

* With 70% of the tokens dropped, the MAE authors could pretrain with a batch size of 4096 for 800 epochs
  * This kind of seeming overkill is typical for self-supervised learning
  * Overfitting isn't really possible, as there are no labels, just statistics and a weak training signal
    * Think of how many ways there are to remove 70% of an image

---

## More Examples

* Because there are multiple plausible decoded images depending upon the masking, it is difficult to overfit
  * Basically, this task is too difficult for that to happen

<div class="col">
<img style="width: 75%" class="r-stretch" src="./figures/MAE_figure4.png" />
<br/>
<small>More examples from the <a href="https://openaccess.thecvf.com/content/ICCV2021/html/Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper">MAE paper</a>, with different mask amounts applied.</small>
</div>

---

## More Examples

* It's also important to point out that the DNN is not copying the inputs to the outputs and then drawing around them
  * Rather, it is using the input tokens to predict something about the entire image

---

## What is Being Learned?

* It is important to ask what is being learned by these masking techniques
* The reconstruction imply that the model has an internal representation of objects and structures
* That makes sense
  * It has to learn structures to reproduce an image
  * Since objects are correlated with themselves, learning objects is a good approach to reconstructing pixels

---

## Context Information

* In vision transformers, the CLS token begins every sequence
  * From text sequences: `[CLS] The world is square. [SEP]`
    * `[CLS]` can attend to all following tokens
    * So (it seems) that its hidden state has a representation of the entire sequence
* What does the token correspond to in an image transformer?

---

## Structure and Self-Supervision

* [Emerging Properties in Self-Supervised Vision Transformers](https://openaccess.thecvf.com/content/ICCV2021/html/Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper) demonstrates some interesting results of attention and the [CLS] token
* It turns out that the [CLS] token pays attention (across multiple heads) to objects

<div class="col">
<img style="width: 45%" class="r-stretch" src="./figures/EmergingProperties_figure1.png" />
<br/>
<small>See Figure 1 of the paper. Images show attention payed to 8x8 patches by the [CLS] token by the multi-headed attention in a self-supervised transformer's last layer.</small>
</div>

---

## Some Context

* To be clear, feature segmentation naturally emerges from many tasks
  * For example, picture a network for self-driving that learned to predict egomotion
  * If you examine the features, you will find it reacts to lane lines, road edges, and other vehicles
* The fact that we can see such strong features from a self-supervised model is nevertheless interesting

---

## Improvements

* The surprising thing about the attention masks from a self-supervised network appear to be *superior* to those from a supervised network, even with the same architecture
* This is a hint that supervised learning teaches networks to "grasp at straws" to find any contextual hints

<div class="col">
<img style="width: 45%" class="r-stretch" src="./figures/EmergingProperties_figure4.png" />
<br/>
<small>See Figure 4 of the paper. Self-attention maps are thresholded to drop the lower 40% of the mass. </small>
</div>

---

## In ConvNets

* Are there similar results with pretraining in convolutional networks?
  * Let's return to the [ConvNeXt V2]((https://openaccess.thecvf.com/content/ICCV2021/html/Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper) paper
* A masked autoencoder task is also used
  * Masking is done on 60% of the positions at the last stage of the encoder
  * The decoder then recreates the missing data

---

## Results

* Let's skip over some details (sparse convolutions and the exact training recipe)
* Instead, let's see what the authors learned
* They observed that feature maps in the original ConvNeXt tended to saturate or collapse

<div class="col">
<img style="width: 45%" class="r-stretch" src="./figures/convnextv2_fig3.png" />
<br/>
<small>See Figure 3 of the paper.</small>
</div>

---

## Good?

* It is likely that the cross entropy loss encourages features to go to extremes
  * That works well with the softmax function
  * But obviously means that the features are not preserving information
* This can be measured in the cosine distance of features in random images
  * In the original ConvNeXt, cosine distance drops in the later layers

---

## Solution

* To encourage different features to have more diverse outputs, they change the model slightly
  * Each layer is normalized by its relative activation compared to other layers
* $\mathcal{N}(||X_i||) \triangleq \frac{||X_i||}{\sum\limits_{j=1,...C}||X_j||}$
  * Basically, the features of each channel are scaled by the relative strength of the channel
  * Called Global Response Normalization (GRN)

---

## Does it Work?

* Of course! Final accuracy numbers are also improved

<div class="col">
<img style="width: 55%" class="r-stretch" src="./figures/convnextv2_fig4.png" />
<br/>
<small>See Figure 4 of the paper. Feature diversity is improved, and comparable to the transformer trained with MAE.</small>
</div>

---

## More Details

* If self-supervised learning is superior, in some way, to supervised learning then that is a big deal
  * So is there any more proof?
* [Concept Generalization in Visual Representation Learning](https://openaccess.thecvf.com/content/ICCV2021/html/Sariyildiz_Concept_Generalization_in_Visual_Representation_Learning_ICCV_2021_paper.html) examines generalization
  * Took ImageNet1K (a subset of ImageNet with 1K class labels)
  * Compared the labels to 21K Imagenet, and selected 5 groups of 1000 labels
  * Each group was chosen to be a different "semantic distance" from ImageNet1K

---

## Generalization

* Generalization was then tested from a model trained on ImageNet1K to the 5 new groups
  * Each group has decreasing similarity to the original ImageNet1K data
* Benchmark
  * Pretrain a model on ImageNet1K
  * Extract features
  * Learn a linear classifier with those features with no new data
  * Add data (1, 2, 4, ..., 128 samples) to see how quickly it adapts

---

## Tested Models

* There are many
* Read [the paper](https://openaccess.thecvf.com/content/ICCV2021/html/Sariyildiz_Concept_Generalization_in_Visual_Representation_Learning_ICCV_2021_paper.html) for full details

</div>
<div class="col">
<img style="width: 75%" class="r-stretch" src="./figures/ConceptGeneralization_table1.png" />
<br/>
<small>Table 1 from the paper.</small>
</div>
</div>

---

## Results

* All of the results are here
  * But there are 31 models
* Mostly we can gain some trust in the concept generalization idea

</div>
<div class="col">
<img style="width: 75%" class="r-stretch" src="./figures/ConceptGeneralization_fig3a.png" />
<br/>
<small>Figure 3a from the paper.</small>
</div>
</div>

---

<div class="col">
<img style="width: 55%" class="r-stretch" src="./figures/ConceptGeneralization_fig3bcde.png" />
<br/>
<small>Figure 3 b,c,d,e from the paper.</small>
</div>

---

## Conclusion

* Self-supervised learning is a strong regularizer

---

## Following Up

* The self-supervised models actually gained performance relative to a baseline Resnet50
* The best of this group is *s-DINO*, which used a contrastive learning technique

</div>
<div class="col">
<img style="width: 75%" class="r-stretch" src="./figures/ConceptGeneralization_fig3b.png" />
<br/>
<small>Figure 3b from the paper.</small>
</div>
</div>

---

## s-DINO is anomalously good

<div class="col">
<img style="width: 65%" class="r-stretch" src="./figures/ConceptGeneralization_fig4.png" />
<br/>
<small>Figure 4 from the paper.</small>
</div>

---

## Next Time

* Which paper proposed DINO?
  * [Emerging Properties in Self-Supervised Vision Transformers](https://openaccess.thecvf.com/content/ICCV2021/html/Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper)
    * The one with the very clean features
  * There are also two follow-ups:
    * [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193)
    * [Vision Transformers Need Registers](https://arxiv.org/abs/2309.16588)
* We'll leave the details of contrastive learning to next lecture
* And we'll go over a couple other interesting unsupervised results
  * We care about more than just image classification, right?

<!--

It would be nice to take it easy on the students after transformers, showing a few examples of unsupervised training.

########### Motion and real world structure
2017
Unsupervised Learning of Depth and Ego-Motion from Video" by Zhou et al.
https://openaccess.thecvf.com/content_cvpr_2017/html/Zhou_Unsupervised_Learning_of_CVPR_2017_paper.html

2017
Unsupervised Deep Homography: A Fast and Robust Homography Estimation Model
Nguyen et al.
https://arxiv.org/abs/1709.03966

########### Masking
#Done
2018
BERT Masking

#Done
2021
Masked Autoencoders Are Scalable Vision Learners
https://arxiv.org/abs/2111.06377

2021
Emerging Properties in Self-Supervised Vision Transformers
https://openaccess.thecvf.com/content/ICCV2021/html/Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper

2022
A ConvNet for the 2020s
https://openaccess.thecvf.com/content/CVPR2022/html/Liu_A_ConvNet_for_the_2020s_CVPR_2022_paper.html

2023
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
https://arxiv.org/abs/2301.00808

########### Contrastive and Similarity Learning:
2020
Momentum Contrast for Unsupervised Visual Representation Learning
https://openaccess.thecvf.com/content_CVPR_2020/html/He_Momentum_Contrast_for_Unsupervised_Visual_Representation_Learning_CVPR_2020_paper.html
2020
A Simple Framework for Contrastive Learning of Visual Representations
https://arxiv.org/abs/2002.05709
2020
Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning, by Grill et al.
https://proceedings.neurips.cc/paper/2020/hash/f3ada80d5c4ee70142b17b8192b2958e-Abstract.html
2020
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
https://proceedings.neurips.cc/paper/2020/hash/70feb62b69f16e0238f741fab228fec2-Abstract.html
2021
Concept generalization in visual representation learning
https://openaccess.thecvf.com/content/ICCV2021/html/Sariyildiz_Concept_Generalization_in_Visual_Representation_Learning_ICCV_2021_paper.html
2021
Emerging Properties in Self-Supervised Vision Transformers
https://openaccess.thecvf.com/content/ICCV2021/html/Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper
2024
Understanding the Benefits of SimCLR Pre-Training in Two-Layer Convolutional Neural Networks
https://arxiv.org/abs/2409.18685

DINOv2: Learning Robust Visual Features without Supervision
https://arxiv.org/abs/2304.07193

TODO:
Add a vocabulary section to the review: token, linear projection, token (including MASK token, CLS) etc
Talk about the layer cosine distances from self-supervised vs supervised
DINO and ConvNeXtV2 both talk about this, and it is interesting.

-->