# CS 462 - Lecture 21

## <div class="clone-word"> **Generative Models** </div>

Bernhard Firner

2026-04-14

---

## The Power of DNNs

* We have seem many examples of the power and expressively of DNNs throughout the course
* Starting from solutions to simple regression and classification to complicated predictive models, we've run through a blistering set of applications
* One thing that has hopefully been made clear about deep learning is that it revolves around *transformations*

---

## Transformation

* Deep learning has the incredible ability to take an image, words, or other information and project it into a latent space
* This space tends to be more efficient, essentially compressing the original input into a smaller number of dimensions
  * That latent information is a superior place to do classification and prediction
* From the features observed with self-supervised learning, well trained models must have information about real-world structure incorporated into their weights and biases

---

## Latent Spaces

* We often call inputs 'x' and outputs 'y'
* What is the intermediate state?
  * We call those unseen variables 'z'
* z is a set of unseen, latent variables
  * z represents x, and can sometimes be thought of as a compressed x

---

## Latent Spaces and Embeddings

* You can think of our word embeddings as a type of latent space
* A good latent space vector should have independence between columns
  * Something we never specifically cared about in embeddings, and didn't try to enforce
* With word tokens the embedding is "clean" because tokens have distinct values
  * With images or sound there will be a noise component to $z$

---

## Generation

* If that information is hidden away in a DNNs internals, can we possibly reverse the transformation of input to answer, running "cat" backwards through Alexnet and getting a picture of a cat?
* Things are not quite so simple, but that basic idea is supported by *generative models*
  * Like the self-supervised models we have discussed, generative models must also learn structure without explicit labels

---

## Transformers?

* Wasn't text prediction generative?
  * Not in the same way
* A trained translation transformer, for example, takes tokens from one language and predicts tokens from a second
  * The input in one language is the thing causing the generation
* Even if translation is between text and images, the DNN still requires a meaningful input to produce an output

---

## Generative Models

* But what if we created that latent input directly, somehow?
  * We should be able to generate an output only from that, right?
  * Or what if we had a way to randomly find a value for z that resulted in reasonable outputs?
* A **generative model** requires no structured input to create an output
* That doesn't mean it won't require any input, just nothing structured

---

## Qualities

* At its most basic, a generative model could simply recreate a single real piece of data, $x_i$
  * That wouldn't be very useful
* We would say that the model had poor **coverage**, meaning that its outputs didn't cover the space of X

---

## Qualities Continued

* We care about more than coverage
* **Quality**: Is the data generated indistinguishable from real samples?
* **Efficiency**: Can the data be generated quickly, with small compute requirements?
* **Smooth response to z**: The generative output should change slightly with small changes to z
* **Disentangled latent space**: The axes of z should be uncorrelated, or as uncorrelated as possible
* **Likelihood computation**: Can we calculate the probabilities of seeing a sample like the one generated?

---

## Inception Distance

* We can also describe quality of generative outputs by the classification results of a pretrained model
  * The model should predict $p(y_i|x)$ with low entropy, corresponding to values close to 0 or 1 for any class, $y_i$
    * Basically, our generator should fool the pretrained model to confidently predict $x$ is one and only one class
* The class score should be unbiased as well, so over the entire set of synthetic images the class probabilities should be even
* This was called the Inception score in [Improved Techniques for Training GANs](https://proceedings.neurips.cc/paper_files/paper/2016/hash/8a3363abe792db2d8761d6403605aeb7-Abstract.html)
  * Named for a particular model

---

## Frechét Inception Distance

* The Inception distance is not a great metric, and is heavily influenced by the model used
  * The Frechét Inception Distance in [GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium](https://proceedings.neurips.cc/paper/2017/hash/8a1d694707eb0fefe65871369074926d-Abstract.html) is slightly better
* Here, the distance between the features at the last layer of a network is used
  * The distances between the means and variances of synthetic samples and real samples are used as the metric
* This works well, in general, but is also very sensitive to the model used

---

## Methods

* Generative models can be made in different ways, with different advantages
* We're just going to look at two different ones
  * GANs: Generative Adversarial Networks
  * Diffusion Models

---

## Generative Adversarial Networks

* GANs consist of two networks
  * A *generator*, which creates synthetic data samples
  * A *discriminator*, which judges if an input is synthetic or real
* The pair of networks could work with any data and use any architecture
  * We'll stick with image examples because they are easy to put into slides

---

## Training a GAN

* The idea is simple
  * $D$, the discriminator, will classify an image as real or synthetic
  * $G$, the generator, will generate an image
* The discriminator is a classifier, and should minimize cross entropy loss
* $\hat{\phi} = \underset{\phi}{argmin}\Bigg[\sum\limits_j -log\bigg[1 - sig\big[D[x_j^*,\phi ]\big]\bigg]  -\sum\limits_i log\bigg[ sig\big[D[x_i, \phi ]\big]\bigg]\Bigg]$
  * Where $x^*_j$ are the synthetic examples from the generator

---

## Training a GAN

* We can add the generator to this
* The generator wants to optimize its own parameters, $\Theta$
* <small>$\hat{\Theta} = \underset{\Theta}{argmax}\left[\underset{\phi}{min}\Bigg[\sum\limits_j -log\bigg[1 - sig\big[D[G[z_j,\Theta],\phi ]\big]\bigg]  -\sum\limits_i log\bigg[ sig\big[D[x_i, \phi ]\big]\bigg]\Bigg]\right]$</small>
* This is complicated; $\phi$ and $\Theta$ are in competition, trying to minimize and maximize the same equation

---

## Loss

* This makes the loss complicated
  * We can think of it coming in two terms
* $\mathcal{L}(\phi) = \sum\limits_j -log\bigg[1 - sig\big[D[G[z_j,\Theta],\phi ]\big]\bigg]  -\sum\limits_i log\bigg[ sig\big[D[x_i, \phi ]\big]\bigg]$
* $\mathcal{L}(\Theta) = \sum\limits_j log\bigg[1 - sig\big[D[G[z_j,\Theta],\phi ]\big]\bigg]$

---

## Visualized

<div class="col">
<img style="width: 50%" class="r-stretch" src="./figures/UDL/Chap15/GanDiscrimGen.svg" />
<br/>
<small>See the UDL text, Figure 15.2.</small>
</div>

---

## Questions

* You should have two (or more!) questions
  * Where is $z$ coming from?
  * Does that loss function work?

---

## Z, the latent variable

* $z$ is trained as random numbers
  * Generally either a vector of uniform or gaussian values
* The loss function will enforce that $G[z, \Theta]$ maps into the space of real images
  * In DCGan, the latent vector is projected to a 4x4 feature space via a single linear layer

<div class="col">
<img style="width: 50%" class="r-stretch" src="./figures/UDL/Chap15/GanDCGANArch.svg" />
<br/>
<small>Deep Convolutional GAN. See the UDL text, Figure 15.3.</small>
</div>

---

## Loss Meaning

* If things converge correctly, we would expect that G will fool D about half of the time
* There is no guarantee of such a thing, but it would imply that D is guessing
  * Generally, the generator is not quite so good
  * In fact, the loss of the generator may go up as the discriminator improves
* The random $z$ values provided during training *should* provide G with enough randomness to "surprise" D, tricking it into believing that a synthetic image is real
  * Thus G must actually using z, since it has no other source of entropy

---

## Effects

* If $G[z, \Theta]$ creates images that map into the real domain, then the values in $z$ should correspond to qualities of the image
  * Right?
  * Sounds tough.
* Here's a problem: a trivial solution for the generator is to pick a subset of real images and just recreate them
  * This is an extreme version of a problem known as **mode collapse**

---

## Oscillation

* The loss function prevents this from being stable if the generator chooses only a single image
  * Because the discriminator is fine classifying one real image incorrectly
  * But imagine if, during training, G rotates between a few images
  * The discriminator will always be chasing the tail of the generator, the behavior oscillates, and nothing actually improves
* This is why we must consider the coverage of our generator
  * In the case of G making a small number of copies, the coverage is abysmal

---

## Mode Collapse

* If we move past an oscillatory behavior, we may still have a mode problem
* We could have a generator that competes fairly with the discriminator, results in confidence of 50% for images
* But what if 10% of the images are too difficult and complicated for G?
  * The generator simply won't bother trying to compete on those images
* Generated images will have lost some modalities and have no incentive to generate them

---

## Mode Collapse Problems

* The Generator need only find a solution that works across all images to fool the Discriminator
  * That means that the loss will push G towards the location where D is most confident that something is *not* synthetic
  * In effect, we have a single goal line towards which all synthetic outputs move
* If any data modalities are biased towards being classified as synthetic, G will ignore them

---

## Gradients

* Looking at the gradients can reveal what is happening
* If we freeze the generator of deep convolutional GAN, but keep training the discriminator, G's gradients vanish
* This means that if the discriminator "gets ahead" of the generator, there is gradient, meaning no way for G to improve

</div>

<div class="col">
<img style="width: 75%" class="r-stretch" src="./figures/UDL/Chap15/GanGradients.svg" />
<br/>
<small>See the UDL text, Figure 15.7.</small>
</div>
</div>

---

## Fixing The Problem

* GANs had a bad reputation for being difficult to train
  * Careful tuning was required
  * "Tricks" in initialization and the start of training
  * Methods to control gradients and keep progress balanced between the generator and the discriminator
* But advances since their beginnings have simplified the difficulties
* Let's go through one improvement

---

## GAN Loss

* Averaging over all inputs, we can rewrite the loss
* $\mathcal{L}(\phi) = -\mathbb{E}_{x^*}\Bigg[ log\bigg[1 - sig\big[D[x^*],\phi \big]\bigg]\Bigg]  -\mathbb{E}_x\Bigg[ log\bigg[ sig\big[D[x_i, \phi ]\big]\bigg]\Bigg]$
* Once we realize that we are working with expectations, this should motivate some improvements
* There is a lot of math to unpack, but let's concentrate on one part:
  * The loss function for the generator currently ignores the second half of the equation

---

## Different Loss

* The right side is this:
  * $-\mathbb{E}_x\Bigg[ log\bigg[ sig\big[D[x_i, \phi ]\big]\bigg]\Bigg]$
* The author's of [The Relativistic Discriminator: a Key Element Missing
from Standard GAN](https://arxiv.org/pdf/1807.00734) point out that the discriminator should affect at that term
  * By the way, this is a good paper to read for some background, in addition to the UDL book
* Why? Because, if G is truly fooling D, then D should become *less certain* that real data is truly real

---

## Prior Knowledge

* There are three reasons this should be true
* First, GAN training provides a mix of real and fake images to the discriminator
* The mix is 50% to 50%, so, in effect, if G is not making D's job harder on real images then D has the advantage
  * To make up for it, we would need to decrease D's learning rate, over-use regularization, etc

---

## Divergence Minimization

* $\mathbb{E}_{x^*}\Bigg[ log\bigg[1 - sig\big[D[x^*],\phi \big]\bigg]\Bigg]$
  * From the generator's point of view, the goal is $x_j^* \rightarrow 1 \forall j$
* But that is unreasonable; if real and synthetic images were indistinguishable then the discriminator should be predicting 0.5 for all of them
* We can make this argument using information theory and the Jensen-Shannon divergence
  * But I will leave that as an exercise for the committed student

---

## Gradients

* If, at any point, $D(x_i) = 1 \forall real$, then real inputs have no impact upon the loss
* D will only focus upon identifying features of fake data, and will not actually learn what is real data
  * Adversarial training still pits G and D against one another, but nothing will force fake images to look more real
  * Training will become stuck
* If increasing $D(G(z_i))$ always decreased some $D(x_i)$ for any real image, then the features of the fake data would have a gradient to guide them "towards" looking real

---

## Solution

* So we can build a more stable, improved GAN equation
* Let's say $D(x) = sigmoid(C(x))$
  * Our new, relativistic loss can be $D(\tilde{x}) = sigmoid(C(x_{real} - C(x_{fake}))$
* The discriminator is estimating if a given real sample is more realistic than a random fake sample
* The generator uses a similar loss, maximizing: $sigmoid(C(x_{fake} - C(x_{real}))$

---

## Global Version

* We don't want to train a pair at a time
* <small>$\mathcal{L}_D(\phi) = \mathbb{E}_r\big[sigmoid(C(x_r) - \mathbb{E}_f C(x_f))\big] + \mathbb{E}_f\big[1 - sigmoid(C(x_f) - \mathbb{E}_r C(x_r))\big]$</small>
* <small>$\mathcal{L}_G(\phi) = \mathbb{E}_r\big[1 - sigmoid(C(x_r) - \mathbb{E}_f C(x_f))\big] + \mathbb{E}_f\big[sigmoid(C(x_f) - \mathbb{E}_r C(x_r))\big]$</small>
* And that is The Relativistic average GAN (RaGAN) as described in the [previous paper](https://arxiv.org/pdf/1807.00734)

* In terms of decision boundaries, the generator has snuck all of its outputs inside of the discriminators boundary

---

## Modern GAN Training

* The GAN recipe was updated again in 2024
* [The GAN is dead; long live the GAN! A Modern GAN Baseline](https://proceedings.neurips.cc/paper_files/paper/2024/hash/4e2acb1e1c8e297d394ae29ed9535172-Abstract-Conference.html)
  * [https://github.com/brownvc/R3GAN](https://github.com/brownvc/R3GAN)
* The previous reformulation improves the loss surface, but it does not guarantee that a minima is actually found
* Two gradient penalties can improve training consistency
  * Basically, these try to prevent "overlearning" that would prevent oscillation

---

## More Details

* GANs still require more tweaking
  * Momentum tends to be bad for training
  * Since two models are being co-optimized, this leads to oscillations

---

### Not People Examples

<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/r3gan-not-people.png" />
<br/>
<small>FFHQ2 data. See <a href="https://github.com/brownvc/R3GAN">the github</a> for instructions.</small>
</div>

---

## Not Birds Examples

<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/r3gan-not-birds.png" />
<br/>
<small>ImgNet data. See <a href="https://github.com/brownvc/R3GAN">the github</a> for instructions.</small>
</div>

---

## Conditional Generation

* There are other ways to train a GAN, of course
* For example, what if we want to control the output?
* We can introduce a second vector, $c$, to condition the output

</div>
<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/UDL/Chap15/GanConditional.svg" />
<br/>
<small>See the UDL text, Figure 15.13.</small>
</div>
</div>

---

## ACGAN

* Auxiliary classifier GAN provides the condition vector along with z
  * The classifier predicts the class as well as real/fake, so the generator must fool that as well
* Here, c is simply one of the class labels in a one-hot vector
* This allows the GAN to generate data from a class, as requested

---

## InfoGAN

* InfoGAN has the discriminator estimate c
  * We want c to end up with any latent information about the image
* Why? The latent variable, z, should be left with random noise and will be impossible to guess
  * So whatever is predicted in c must correspond to structured attributes
* By mixing continuous and discrete variables in $c$, they can be used to predict different facets of the data
* And you can [train your own](https://github.com/Natsu6767/InfoGAN-PyTorch)

-v-

## Aside

* Learning i.i.d. noise and learning structure have different bounds
* [Shannon's Source Coding Theorem](https://en.wikipedia.org/wiki/Shannon's_source_coding_theorem) tells us that, when we compress i.i.d. data, the minimal compression of information scales with the entropy
* Conversely, if what we are compressing is not i.i.d, [Kolmogorov complexity](https://en.wikipedia.org/wiki/Kolmogorov_complexity) is a better estimate
  * The Wikipedia summary is that the Kolmogorov complexity is the length of a shortest computer program that produces the object
  * Just insert "DNN" for "computer program" and that describes what we are doing

---

## InfoGAN Examples

<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/UDL/Chap15/GanInfoGAN.svg" />
<br/>
<small>See the UDL text, Figure 15.15.</small>
</div>

---

## Images from Images

* Now we come to methods that begin with images
  * These are cool, and *very* fast, but there are more modern methods
* Pix2Pix is trained to "stylize" an image
  * Image segments, edges, black and white, etc, to the original
* CycleGAN trains a second generator that translates generated images back to their original, ensuring information wasn't destroyed
* Implementations of CycleGAN and pix2pix can be found on [github](https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix)

---

## Pix2Pix

<div class="col">
<img style="width: 40%" class="r-stretch" src="./figures/UDL/Chap15/GANPix2Pix.svg" />
<br/>
<small>See the UDL text, Figure 15.16.</small>
</div>

---

## CycleGAN

<div class="col">
<img style="width: 45%" class="r-stretch" src="./figures/UDL/Chap15/GANCycleGAN.svg" />
<br/>
<small>See the UDL text, Figure 15.18.</small>
</div>

---

## Latent Space Control

* Conditional generation allows rough control of the latent space
  * But only as much as the training process exposes
* pix2pix and CycleGAN both allow for controlled transformations, but only along a single axis
* You have already been spoiled by word embeddings and multi-headed attention
  * All of the axes of a latent space should be accessible

---

## StyleGAN

* [StyleGAN](https://openaccess.thecvf.com/content_CVPR_2019/html/Karras_A_Style-Based_Generator_Architecture_for_Generative_Adversarial_Networks_CVPR_2019_paper.html) separates high-level attributes of the latent space in an unsupervised manner
  * This improves disentangling, which was one of the metrics that we cared about
* This is accomplished by training a new set of control vectors, w
* How?
  * The initial layer begins with a constant representation, rather than noise and $z$
  * The $z$ is then mapped to multiple styles, via an intermediate vectors, $w$

---

## Learning Styles

* $z$ is mapped to a new, multisegment vecotr w:
  * $f(z) \rightarrow w$
* After each convolution, noise is added to the partial synthetic image along with a style vector, taken from $w$
* So how is meaning imparted to $w$ through the loss?

</div>
<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/UDL/Chap15/GanStyleGANArch.svg" />
<br/>
<small>See the UDL text, Figure 15.19.</small>
</div>
</div>

---

## Learning Styles

* $w$ is converted into multiple style vectors, $y = (y_s, y_b)$
* Those influence the current model representation through an adaptive instance normalization layer
  * $AdaIN(x_i, y) = y_{s,i}\frac{x_i - \mu(x_i)}{\sigma(x_i)} + y_{b,i}$
* Thus the style vectors control the strength of various features

---

## Mixing Regularization

* To teach the generator that styles are separate, sometimes two latent $z$ values are created, leading to two style vectors, $w_1$ and $w_2$
* Some styles are taken from $w_1$ and others from $w_2$ during generation, so G must separate styles at different depths
* This doesn't tell you *what* the style vectors will influence
  * But styles closer to the end of the network can only change small scale features
  * And styles closer to the beginning influence large features of the image

---

## Style GAN

* The components of $w$ can be thought of as coarse, medium, and fine, depending upon where they influence the output
* This leads to great control over the output images, as can be seem from example images in the [paper](https://openaccess.thecvf.com/content_CVPR_2019/html/Karras_A_Style-Based_Generator_Architecture_for_Generative_Adversarial_Networks_CVPR_2019_paper.html)

</div>
<div class="col">
<img style="width: 100%" class="r-stretch" src="./figures/UDL/Chap15/GanStyleGANArch.svg" />
<br/>
<small>See the UDL text, Figure 15.19.</small>
</div>
</div>

---

## More Modern StyleGAN

* But what if we want more control?
* StyleGAN is from late 2018/early 2019, but it has been updated multiple times
* From 2022: [Third Time's the Charm? Image and Video Editing with StyleGAN3](https://arxiv.org/abs/2201.13433)
  * [Demo/github page](https://yuval-alaluf.github.io/stylegan3-editing/)
  * [Code github](https://github.com/yuval-alaluf/stylegan3-editing)

---

## Advances: Alignment

* Aligned Vs Unaligned Images
  * Some operations are easier learned when all images are aligned
  * So there are models to convert alignment and then apply the GAN
  * Then, rotation and translation are encoded into $z$ so that the original alignment can be restored

---

## Advances: Editing Directions

* StyleGAN uses a vector, S, to insert a style into the synthetic image
* [StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery](https://openaccess.thecvf.com/content/ICCV2021/html/Patashnik_StyleCLIP_Text-Driven_Manipulation_of_StyleGAN_Imagery_ICCV_2021_paper.html), from 2021, showed that S could be manipulated using text prompts and CLIP
  * The goal is to get a new latent code, $w$, given a source code, $w_s$
* Here comes some more math!

---

## Advances: Editing Directions

* So optimize this:
  * $\underset{w \in \mathcal{W}+}{argmin}D_{CLIP}(G(w), t) + \lambda_{L2} ||w - w_s|| + \lambda_{ID} \mathcal{L}_{ID}(w)$
  * $D_{CLIP}(G(w), t)$ is the cosine distance between the embedding of the Generator's output and the text embedding
  * ${L}_{ID}$ is a measure of output image similarity
* Basically, find a new $w$ that changes the images a little as possible but brings the CLIP embedding as close as possible

---

## Examples

* Again, the paper, github, and demo website have plenty of examples
* From 2022: [Third Time's the Charm? Image and Video Editing with StyleGAN3](https://arxiv.org/abs/2201.13433)
  * [Demo/github page](https://yuval-alaluf.github.io/stylegan3-editing/)
  * [Code github](https://github.com/yuval-alaluf/stylegan3-editing)
* But how is this useful?

---

## Applications

* Meme-worthy
  * [GAN-Supervised Dense Visual Alignment](https://openaccess.thecvf.com/content/CVPR2022/html/Peebles_GAN-Supervised_Dense_Visual_Alignment_CVPR_2022_paper.html) from 2022
  * Demos [here](https://github.com/wpeebles/gangealing)
* And more serious
  * [Expert-Guided StyleGAN2 Image Generation Elevates AI Diagnostic Accuracy for Maxillary Sinus Lesions](https://dl.acm.org/doi/abs/10.1145/3651781.3651810)
  * [Assessing the Efficacy of StyleGAN 3 in Generating Realistic Medical Images with Limited Data Availability](https://dl.acm.org/doi/abs/10.1145/3651781.3651810)

---

## Current Events

* There is a newer version of pix2pix and CycleGAN [on github](https://github.com/GaParmar/img2img-turbo)
* If you look, you'll see that they mention a bunch of new techniques
  * We'll talk about diffusion models next lecture

---

## What To Remember?

* We've reached the point of deep learning topics where there is an explosion of techniques and applications
* What is important about all of this stuff?
* Look at the latent space:
  * GANs began by choosing random values for the latents
  * This was basically their source of entropy

---

## Latent Spaces

* InfoGAN (and other conditional GANs) added more information
  * The discriminator had to be able to deduce a "realistic" context vector
  * That imposed meaning into the formerly random latent input
* With StyleGAN, the location that the context is inserted constrains what it can effect
* Finally, we can use an external model to provide additional constraints

---

## Quiz This Week

* We have the fourth and final quiz this week, covering lectures 18-21
* We'll blow through diffusion models on Thursday as well
* Next week I'll try to give a whirlwind tour of deep neural networks for reinforcement learning
  * RL is its own subject with its own books, so just expect an overview

---

## Example Questions

The mechanism that prevents a decoder from learning from future tokens during training is called:

a. Strided Attention  
b. Bound Attention  
c. Cross Attention  
d. None of the above.
</div>

---

The mechanism that prevents a decoder from learning from future tokens during training is called:

a. Strided Attention  
b. Bound Attention  
c. Cross Attention  
d. **None of the above.**
</div>

* The correct answer is masked attention

---

What is an advantage of transformers?

a. They are very memory efficient compared to convolutions.  
b. They can learn and infer with inputs that have variable token length.  
c. They are computationally simple.  
d. All of the above.
e. None of the above.
</div>

---

What is an advantage of transformers?

a. They are very memory efficient compared to convolutions.  
b. **They can learn and infer with inputs that have variable token length.**  
c. They are computationally simple.  
d. All of the above.
e. None of the above.
</div>

* Transformers have many desirable qualities, but memory for attention weights grows at $n^2$ and they are compute heavy

---

An image transformer converts the initial image into tokens with 96 feature maps through an initial 14x14 convolution.
How many weight and bias parameters are there for the first linear projection to create the query vectors?

</div>

---

The only important information is that tokens have 96 values. The linear projection will also result in 96 values. There will be 96 bias parameters and $96^2$ weights.
</span>

</div>

---

What could explain why self-supervised learning results in higher-quality features than supervised learning?

a. It can use larger DNNs.  
b. Self-supervised learning can use transformers.  
c. Supervised learning provides signals to use spuriously correlated signals.  
d. All of the above.  
e. None of the above.
</div>

---

What could explain why self-supervised learning results in higher-quality features than supervised learning?

a. It can use larger DNNs.  
b. Self-supervised learning can use transformers.  
c. **Supervised learning provides signals to use spuriously correlated signals.**  
d. All of the above.  
e. None of the above.
</div>

* It isn't so much that self-supervised learning is great, but that regular supervised learning leads to issues. If the DNN needs to predict a bird class but cannot see a bird, it may recognize that birds are correlated with the sky. Similarly, boats are correlated with ripples on the water.

---

Why does contrastive learning of images work?

a. In order to project similar images into similar feature spaces, the model must capture semantics and structure of objects.  
b. The loss function has strong regularization.  
c. Because labelled image pairs of the same class have the same latent representation.  
d. All of the above.  
e. None of the above.
</div>

---

Why does contrastive learning of images work?

a. **In order to project similar images into similar feature spaces, the model must capture semantics and structure of objects.**  
b. The loss function has strong regularization.  
c. Because labelled image pairs of the same class have the same latent representation.  
d. All of the above.  
e. None of the above.
</div>

* Just because two images have the same class, that doesn't mean that their latent vectors are the same. They must have similar components, but that is it. The class label is something that a human gave to the images, it likely does not capture every part of the image.

---

Which training method does not have the risk of instability and collapse?

a. A neural network is trained to generate an image embedding given an image and a desired embedding.  
b. A generator and discriminator are co-trained, one to minimize a function and the other to maximize it.  
c. Training uses a contrastive loss function, $\ell(i, j) = -log\frac{exp(s_{i,j})}{\sum\limits_{k=1}^{2N}\mathbb{1}_{[k\neq i]}exp({s_i,k})}$  
d. All of the above.  
e. None of the above.
</div>

---

Which training method does not have the risk of instability and collapse?

a. **A neural network is trained to generate an image embedding given an image and a desired embedding.**  
b. A generator and discriminator are co-trained, one to minimize a function and the other to maximize it.  
c. Training uses a contrastive loss function, $\ell(i, j) = -log\frac{exp(s_{i,j})}{\sum\limits_{k=1}^{2N}\mathbb{1}_{[k\neq i]}exp({s_i,k})}$  
d. All of the above.  
e. None of the above.
</div>

* Option A is the distillation training procedure used in DINO. The teacher network is updated slowly, so the student has a stable target for learning. c is regular contrastive learning, where the model may overshoot itself, and d is GAN training.

---

Why do the generators from GANs require a latent vector, $z$, as input? Why not use only style inputs, as in StyleGAN, or control vectors, as in InfoGAN?

a. Without entropy in the model generation, the discriminator would quickly learn to determine the fake image.  
b. The entropy of the learned space is incompressible noise which cannot be learned by the model and must be provided by $z$.  
c. Styles and control signals are discovered during training, so training must begin with a random vector, $z$.  
d. All of the above.  
e. None of the above.
</div>

---

Why do the generators from GANs require a latent vector, $z$, as input? Why not use only style inputs, as in StyleGAN, or control vectors, as in InfoGAN?

* <small>a, b, and c are variations of the same statement. Without entropy, the generator could only create the same image so training wouldn't work. Similarly, the generator does not know anything about the latent space at the beginning, but learns to map the random values of $z$ into it. Some components may simply be noise while others are structure, but meaning is only assigned through backpropagation.</small>

<!--

InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets
https://arxiv.org/abs/1606.03657
from 2016
https://github.com/Natsu6767/InfoGAN-PyTorch

GAN Repository
https://github.com/lukemelas/pytorch-pretrained-gans
For unsupervised image segmentation:
https://github.com/lukemelas/unsupervised-image-segmentation
From:
https://arxiv.org/abs/2105.08127
published 2021

StudioGAN: A Taxonomy and Benchmark of GANs for Image Synthesis
https://arxiv.org/abs/2206.09479
https://github.com/POSTECH-CVLab/PyTorch-StudioGAN

Deep Spectral Methods: A Surprisingly Strong Baseline for Unsupervised Semantic Segmentation and Localization
CVPR 2022
https://openaccess.thecvf.com/content/CVPR2022/html/Melas-Kyriazi_Deep_Spectral_Methods_A_Surprisingly_Strong_Baseline_for_Unsupervised_Semantic_CVPR_2022_paper.html
https://github.com/lukemelas/deep-spectral-segmentation

The spectral methods paper refers to:
https://arxiv.org/abs/2109.14279
Localizing Objects with Self-Supervised Transformers and no Labels
from 2021

GAN-Supervised Dense Visual Alignment (GANgealing)
https://openaccess.thecvf.com/content/CVPR2022/html/Peebles_GAN-Supervised_Dense_Visual_Alignment_CVPR_2022_paper.html
from 2022
https://github.com/wpeebles/gangealing

Expert-Guided StyleGAN2 Image Generation Elevates AI Diagnostic Accuracy for Maxillary Sinus Lesions
https://www.nature.com/articles/s43856-025-00907-6

CoordGAN: Self-Supervised Dense Correspondences Emerge from GANs
https://jitengmu.github.io/CoordGAN/
CVPR 2022

The GAN is dead; long live the GAN! A Modern GAN Baseline
https://proceedings.neurips.cc/paper_files/paper/2024/hash/4e2acb1e1c8e297d394ae29ed9535172-Abstract-Conference.html
Neurips 2024
https://github.com/brownvc/R3GAN

##### Diffusion

Diffusion MOdels for Open-Vocabulary Segmentation
https://arxiv.org/abs/2306.09316
https://link.springer.com/chapter/10.1007/978-3-031-72652-1_18
published ECCV 2024
Code:
https://github.com/karazijal/ovdiff

Unsupervised Part Discovery from Contrastive Reconstruction
https://proceedings.neurips.cc/paper/2021/hash/ec8ce6abb3e952a85b8551ba726a1227-Abstract.html
NeurIPS 2021

https://github.com/brownvc/R3GAN

-->