# CS 462 - Lecture 26

## Final Review

Bernhard Firner

2026-04-30

---

## Review

* We've had a midterm and quizzes for all content up to lecture 21
  * The recitation will go over those question and common mistakes
* We'll review the last few weeks today
  * Mostly generative models

---

## Data Generation

* Generative models map from a latent space, $Z$, into a larger space, $X$
* We can think of the transformation from $X$ to $Z$ as compression and the reverse as decompression
  * In fact, there is work that uses [latent conversion as compression](https://openaccess.thecvf.com/content/CVPR2024/html/Jia_Generative_Latent_Coding_for_Ultra-Low_Bitrate_Image_Compression_CVPR_2024_paper.html)

</div>
<div class="col">

<img style="width: 55%" class="r-stretch" src="./figures/big_z_mapping.svg" />
<br/>
<small>Z and X should map onto one another.</small>

</div>
</div>

---

## Generative Models

* Generative models begin by finding a conversion between $X$ and $Z$
* Then, any arbitrary $z$ can be converted into $X$, generating a new observation

</div>
<div class="col">

<img style="width: 55%" class="r-stretch" src="./figures/z_mapping.svg" />
<br/>
<small>An example latent vector, z, can be mapped into X to generate a new sample.</small>

</div>
</div>

---

## Methods

* We covered two approaches
  * GANs: A generator attempts to fool a discriminator into believing fake samples are real
  * Diffusion models: A denoiser successively removes noise from noise until a sample emerges
* There are others, but these showcase the variability of approaches

---

## Difficult

* The $z \rightarrow x^*$ operation is simple to describe, but difficult to learn
* Does $x^*$ have enough *quality* to fool a person?
  * We usually use a DNN to evaluate, for practical reasons
* Is generation computationally *efficient*?
* Does the output change smoothly with $z$, giving us fine-grained control?
* Is the latent space disentangled?
  * Does changing a component of $z$, $z_i$, change on aspect of $x^*$?

---

## Domain Mapping

* Most importantly, does our mapping actually cover all of $X$?
* Generating synthetic values is simple if we only synthesize a few different samples
  * This is a symptom of mode collapse
  * The problem isn't really avoidable

---

## Mode Collapse Joke

> An economist, a statistician, and a mathematician are travelling together and cross into a new country. They look around and see a black cow grazing in a field. The economist turns to the others and says, "In this country, the cows must all be black."

> The statistician quickly disagrees. "All you can say is that this country has at least one black cow."

> "You're both wrong!" shouts the mathematician. The other two turn in surprise. "All you can say is that there exists one cow which has one black side!"

---

## Latent Conversion

* Let's say that you were asked to generate a second cow
  * You are told that you are in the new country
  * How should the cow look?
* Your dataset has 1 million cows from other countries, and one from this country

---

## Training Method

* Depending upon our training method, the generative model may be forced to create only black cows
* With a GAN, the discriminator could trivially learn that any cow from the country must be black
  * This forces the generator to follow suit
* In effect, two different components of $z$ have become entangled

---

## Manual Disentanglement

* In any given dataset, it is inevitable that some attributes will be correlated
* So generative techniques have methods to guide the transformation from $z$ to $X$
* In GANs this takes the form of conditioning upon the components of $Z$
* In diffusion, we use the gradients from another model to influence the direction of the decoding processing

---

## Compression and Information

* However we learn the mapping, it is important that it is *information preserving*
* With conditioned GANs (like StyleGAN), we could clearly see this
  * Some part of $z$ is noise, required for low-level dynamic features
  * Other components of $z$ control qualities of the output
* When converting from $X$ into $Z$ we can predict anything but the noise

</div>
<div class="col">

<img style="width: 55%" class="r-stretch" src="./figures/big_z_mapping.svg" />
<br/>
<small>Z loses the noise in X.</small>

</div>
</div>

---

## Noise

* Noise is not compressible
  * Take a course on information theory!
  * In short: there is no correlation, and thus no possible compression
* But noise also has no information, so compression can discard it

---

## Unsupervised Learning

* This means that generative models and unsupervised models are related
  * The latent space they both use should be similar or the same
* Unsupervised models attempt to learn information preserving representations of their inputs
  * So they should discard noise and preserve everything else
* In many ways, the latent space representation is one of the most interesting results of modern deep learning

---

## Reinforcement Learning

* That brings us to reinforcement learning
* RL explores a space, building knowledge as it explores
* Knowledge can come in two forms
  * The "value" of being in a particular state and taking a particular action
  * The preference of on action over another

---

## Action-Utility

* The estimate of the value of taking action $a$ in state $s$ is the Q function
  * $Q(s, a)$ is the estimate of the value/utility
* In traditional RL this is stored in a lookup table
  * But a neural network, or a deep Q network in this case, also works
  * In fact, DQNs are well-suited to large state spaces

---

## Policies

* If we attempt to learn preferences for actions, that is called policy learning
* The function $\pi(a|s)$ should return the probability of taking action $a$ in state $s$
  * That could be a categorical distribution if actions are discrete
  * Or a normal (or other distribution) if actions are continuous

---

## Why Mix NNs with RL?

* Anything interesting has an enormous state space
* DNNs effectively compress that space into a smaller one
  * This could mean effectively interpolating between observed rewards in similar states
  * Or decreasing the search space itself, reducing a large X to a smaller Z

---

## Example Questions

* Now that everything is fresh in your minds, let's do some example questions
* The recitation will go over some frequently incorrect examples from the midterm and quizzes
* But you haven't seen any questions on the last few topics

---

Q. Style GAN breaks the latent vector into components that are inserted into the beginning, middle, and end of the decoding process. Which statement is true about this process?

a. The component used at the end causes the greatest change in the image because there are few layers left to undo its changes.  
b. The component used at the end causes the greatest change in the image because it can undo the effects of the previous components.  
c. The component used at the end causes the least change in the image because there are few layers left and global features are already set.  
d. All of the above.  
e. None of the above.
</div>

---

Q. Style GAN breaks the latent vector into components that are inserted into the beginning, middle, and end of the decoding process. Which statement is true about this process?

a. The component used at the end causes the greatest change in the image because there are few layers left to undo its changes.  
b. The component used at the end causes the greatest change in the image because it can undo the effects of the previous components.  
c. **The component used at the end causes the least change in the image because there are few layers left and global features are already set.**  
d. All of the above.  
e. None of the above.
</div>

---

Q. ACGAN has the discriminator assign class probabilities each both real and fake images. The one-hot class vector, $c$, is also given to the generator. Why does this condition the GAN to generate images of the class matching the one-hot vector, $c$?

a. This method of training does not work.  
b. The generator must share weights with the discriminator. The shared weights bias the generator to match the class vector, $c$.  
c. The only nonrandom information in generated images comes from $c$, so the one-hot vector must correspond to classes.  
d. All of the above.  
e. None of the above.
</div>

---

<small>
None of them are correct. Answer 'c' is the nearest to making sense, but there is other non-noise information in the images. For example, digits have 10 classes, but also have tilts and line thickness. The values in 'c' could become entangled, and one component could smoothly shift from 1s to 7s, for example, while another shifts between thin 0s and thick 9s. The real answer is that the generator must fool the classification part of the discriminator. Since the discriminator "sees" $c$ through its loss function, the generator must match the discriminator's expectations in order to fool it.
</small>

---

Q. Diffusion models successively "clean up" a degraded image, eventually yielding a "clean" output. What can be used as the initial z used for image diffusion?

a. An image with an undesirable area erased.  
b. A downscaled image.  
c. A image with Gaussian noise added.  
d. All of the above.  
e. None of the above.
</div>

---

Q. Diffusion models successively "clean up" a degraded image, eventually yielding a "clean" output. What can be used as the initial z used for image diffusion?

a. An image with an undesirable area erased.  
b. A downscaled image.  
c. A image with Gaussian noise added.  
d. **All of the above.**  
e. None of the above.
</div>

---

Q. What accuracy describes the advantages of diffusion models and GANs?

a. Diffusion models have a simpler loss function than GANs.  
b. GANs produce an output in a single step, making them naturally faster than diffusion models.  
c. Diffusion models can be guided by any source of a gradient on the image, unlike GANs which need to be trained with preconditioning or style vectors for guidance.  
d. All of the above.  
e. None of the above.
</div>

---

Q. What accuracy describes the advantages of diffusion models and GANs?

---

Q. What is a cause of mode collapse?

a. There is no training data for parts of X.  
b. The generative model finds a few small parts of X that satisfy its loss function.  
c. Insufficient regularization is the source of mode collapse.  
d. All of the above.  
e. None of the above.
</div>

---

Q. What is a cause of mode collapse?

a. There is no training data for parts of X.  
b. **The generative model finds a few small parts of X that satisfy its loss function.**  
c. Insufficient regularization is the source of mode collapse.  
d. All of the above.  
e. None of the above.
</div>

If there is no training data for parts of X then we won't see synthetic data that matches it, but that doesn't describe mode collapse. A good loss function should force the generative model to learn variety, but it may not guarantee it.

---

Q. When using a classifier to guide a generative diffusion model, why must the classifier be trained on noisy images?

a. Without being training on noisy images, the signal from the gradient will be weak or meaningless.  
b. The classifier does not need to be trained on noisy images if it is properly regularized.  
c. Noise changes the means of the image pixels, so the classifier needs to be trained for different average input values to prevent vanishing or exploding gradients.  
d. All of the above.  
e. None of the above.
</div>

---

Q. When using a classifier to guide a generative diffusion model, why must the classifier be trained on noisy images?

a. **Without being training on noisy images, the signal from the gradient will be weak or meaningless.**  
b. The classifier does not need to be trained on noisy images if it is properly regularized.  
c. Noise changes the means of the image pixels, so the classifier needs to be trained for different average input values to prevent vanishing or exploding gradients.  
d. All of the above.  
e. None of the above.
</div>

---

Q. In your last homework, the denoising model generates outputs for $\mu$ and $\sigma^2$. What is the correct way to apply gradient guidance?

a. Add the gradients directly to $\mu$.  
b. Multiply the gradients by $\mu$ and add back to $\mu$.  
c. Multiply the gradients by $\sigma^2$ and add back to $\mu$.  
d. All of the above.  
e. None of the above.
</div>

---

Q. In your last homework, the denoising model generates outputs for $\mu$ and $\sigma^2$. What is the correct way to apply gradient guidance?

a. Add the gradients directly to $\mu$.  
b. Multiply the gradients by $\mu$ and add back to $\mu$.  
c. **Multiply the gradients by $\sigma^2$ and add back to $\mu$.**  
d. All of the above.  
e. None of the above.
</div>

The amount of nudging needs to scale with the current amount of noise present. If too small a nudge is applied early on, it won't actually change the final output. If too large a nudge is applied near the end of diffusion, the output could become malformed.

---

Q. During training of AlphaGoZero, the researchers used only a single board position from each simulated game for training. Why?

a. They had too much data, which made training slow.  
b. The model was linear, so it had a small number of parameters and could only make use of a small dataset.  
c. The model was strong enough to predict the outcome from a single board position, so more data was unnecessary.  
d. All of the above.  
e. None of the above.
</div>

---

Q. During training of AlphaGoZero, the researchers used only a single board position from each simulated game for training. Why?

Adjacent board positions would lack variety and lead to biased data (to moves that end up creating long games rather than short ones, for example). Avoiding bias is one of the first hurdles of training DNNs for reinforcement learning.

---

Q. Consider these two formulations of Q-Learning:  
$Q(S_t,A) \leftarrow Q(S_t,A) + \alpha[R + \gamma \sum\limits_{a\in A}\frac{Q(S_{t+1},a)}{|A|} - Q(S_t,A)]$.  
$Q(S_t,A) \leftarrow Q(S_t,A) + \alpha[R + \gamma \underset{a}{max}Q(S_{t+1},a) - Q(S_t,A)]$.  
What is the difference between them?

a. The second can highly overestimate the utility of $S_{t+1}$ until it is properly trained.  
b. The first will underestimate the possible utility of $S_{t+1}$.  
c. The second will find an optimal policy that does not correspond to the behavior during training.  
d. All of the above.  
e. None of the above.
</div>

---

<small>
The first is SARSA, which learns the best policy that matches the exploratory behavior (which could include random actions). Q-Learning, the second equation, can be biased at first, but learns an optimal policy free from the inefficiency of the exploratory behavior.
</small>

---

Q. Cartpole's state is described as two numbers: the lateral position of the cart and the current angle of the pole. You decide to feed this into a neural network, projecting from those two inputs to a hidden layer with 1024 features. How many weight and bias values are in the initial hidden layer ?

</div>

---

$2 * 1024 = 2048 \text{weight}$ parameters  
$1024 \text{bias}$ parameters

</div>

---

Q. What advantage does policy learning have over Q-Learning?

a. It is less biased.  
b. The policy may be a simpler function than estimating the value in each state.  
c. Policy learning is easier to implement than Q-Learning.  
d. All of the above.  
e. None of the above.
</div>

---

Q. What advantage does policy learning have over Q-Learning?

a. It is less biased.  
b. **The policy may be a simpler function than estimating the value in each state.**  
c. Policy learning is easier to implement than Q-Learning.  
d. All of the above.  
e. None of the above.
</div>

Policy learning only needs to express a preference for one action over another. Estimating numerical values that describe the outcomes of different actions may be much more difficult.