# CS 530 - Lecture 21

## Features and Latent Spaces

Bernhard Firner

2026-04-14

---

## What is AI?

* Let's get philosophical for a moment
  * What is it that makes artificial intelligence a good solution to our problems?
  * And how can we make better use of our existing techniques?
* We have a few lectures left, which should be just enough time to dig into this
* Today we'll survey a few ideas, and then we'll dig into them more

---

## What is Knowledge?

* Applying the parameters of a trained neural network means transforming the input into a desired output
* Outputs are varied
  * Sometimes we get a single value
  * Other times we get class probabilities
  * And we could even transform one image into another image
* The weights and biases of the neural network have knowledge, but what does that mean?

---

## Knowledge as Compression

* [Shannon's Source Coding Theorem](https://en.wikipedia.org/wiki/Shannon's_source_coding_theorem) tells us that, when we compress i.i.d. data, the minimal compression of information scales with the entropy
* Conversely, if what we are compressing is not i.i.d, [Kolmogorov complexity](https://en.wikipedia.org/wiki/Kolmogorov_complexity) is a better estimate
  * The Wikipedia summary is that the Kolmogorov complexity is the length of a shortest computer program that produces the object
* We can make a decent definition by saying knowledge ignores noise and compresses non-noise
  * Learning noise is, after all, futile
  * It would be equivalent to learn its statistics and simply make new noise later

---

## 3 Examples

* Let's look at three examples that neural network demonstrate compression of non-i.i.d. information
* From simplest to most complicated (and perhaps least convincing to most convincing)
  * Feature Map Segmentation
  * Generative Models
  * Puzzle solving

---

## Feature Maps

* If you train a classifier and cut off its head, you will find a feature map
  * This could be from a convnet, a transformer, or some other architecture
* Those features are amenable to clustering via KNN, and can work to classify new, unseen image types
  * As in [Fast Incremental Learning for Off-Road Robot Navigation](https://arxiv.org/abs/1606.08057)
* What information goes into a feature maps? And what information is dropped?

---

## Single Features

* We can look at single feature maps (or the features from a self-attention head)
  * These are recognizable as single object types
* Notice how the rest of the image has most features removed?
  * The "signal" is preserved, the "noise" is removed
  * Noise removal is a basic part of compression

<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/dino_attention_maps.png" />
<br/>
<small>See DINO results <a href="https://github.com/facebookresearch/dino">github page</a>.</small>
</div>

---

## Training Method

* Maybe you aren't impressed, but those features came from a model trained without any labels
* [Emerging Properties in Self-Supervised Vision Transformers](https://openaccess.thecvf.com/content/ICCV2021/html/Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper) (from 2021)
* This paper introduced a self-supervision learning technique named DINO
* Also pointed out that:
  * 1: unsupervised features contain scene layout and object boundaries
  * 2: unsupervised features can simply be k nearest neighbor clustered to make a classifier

---

## Self-Distillation

* Why would the model learn such good semantic features without any labels?
* This learning process asked the model to produce a K dimensional feature vector
  * The loss was how well the features matched another vector made from a different view of the image
* The second features came from a teacher network
  * The teacher itself is a slow exponential moving average of the student

---

## Knowledge Distillation

* The feature vector comparison is a scaled dot product similarity of the projected vectors
  * $similarity(u, v) = \frac{u^Tv}{||u||~||v||}$
  * $s_{u,v}$ for short
* The full loss is a softmax run over all similarities within the batch
  * $\ell(i, j) = -log\frac{exp(s_{i,j}/\tau)}{\sum\limits_{k=1}^{2N}\mathbb{1}_{[k\neq i]}exp({s_i,k} / \tau)}$
  * $\tau$ is used as a temperature (hyperparameter) to control the rate of change

---

## Analogy and Result

* Think of that loss as being asked to pick someone out of a lineup
  * You may have seen the person before, but they will not look the same
  * In order to make a match, you need to know something fundamental about human faces and bodies
* The NN version is a bit harder; you may be asked to match to people given an image of a foot and a hand
* The only way to do that well is to "project" features into a space where all of that noise is ignored and only the signal is present

---

## Feature Quality

* The features have no labels
* The best way to minimize loss is for the model to distinguish between object types
  * The authors find this lead to clean features with no spurious correlations
* What is a suprious correlation?
  * There are waves, so the image has a boat
  * There is sky, so the image is a bird

---

## Example Features

<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/EmergingProperties_figure4.png" />
<br/>
<small>See <a href="https://openaccess.thecvf.com/content/ICCV2021/papers/Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper.pdf">the DINO paper</a>, figure 4.</small>
</div>

---

## Direct Applications

* We could push that farther, using the raw feature maps for different applications
  * Semantic segmentation
  * Localization (and thus tracking)
* Tracking
  * Once we know that an object is present, we can build a tracker for it
  * Since the self-supervised features should be robust to occlusion and rotation, this can be robust

---

## Segmentation

* Eigenvalues of the feature maps plus the original image immediately yield segmentation
  * See [Deep Spectral Methods: A Surprisingly Strong Baseline for Unsupervised Semantic Segmentation and Localization](https://openaccess.thecvf.com/content/CVPR2022/html/Melas-Kyriazi_Deep_Spectral_Methods_A_Surprisingly_Strong_Baseline_for_Unsupervised_Semantic_CVPR_2022_paper.html) from CVPR in 2022
  * Also the [github page](https://github.com/lukemelas/deep-spectral-segmentation)
* It is fairly intuitive that feature masks can be used for segmentation
  * Eigenvalues are also used in PCA
  * Which can also be used as a type of compression, discarding components will less information

---

## Generative Models

* So feature maps capture structure and give us an easy way to discard noise
  * But is what is left behind a compressed version of the original dataset?
* Let's briefly discuss a type of generative model
  * Generative Adversarial Network, or GAN

---

## GANs

* Usually, GANs consist of two networks
  * A *generator*, which creates synthetic data samples
  * A *discriminator*, which judges if an input is synthetic or real
* That generator should sound reminiscent to [Kolmogorov complexity](https://en.wikipedia.org/wiki/Kolmogorov_complexity)
  * Kolmogorov complexity is the length of a shortest computer program that produces the object

---

## Loss Equation

* [The Relativistic Discriminator: a Key Element Missing
from Standard GAN](https://arxiv.org/pdf/1807.00734) introduces an improved GAN loss equation
* Let's say $D(x) = sigmoid(C(x))$
  * D is the discriminator, which classifies x as real or fake
  * C is a critic, just outputting a continuous value of the inputs "realness"

---

## Learning

* The relativistic loss is $D(\tilde{x}) = sigmoid(C(x_{real} - C(x_{fake}))$
  * The discriminator attempts to maximize this
* The generator attempts to maximize the opposite
  * $D(\tilde{x}) = sigmoid(C(x_{fake} - C(x_{real}))$
* Thus the Discriminator and Generator compete to fool C

---

## Generator as Compressor

* Imagine if the generator could create any possible image of a certain class
  * It would be an incredibly effective compressor of that class, right?
* Generators from GANs are not quite that powerful, but they are impressive
  * And what do they need as input? Usually just a "latent vector"
  * In early GANs this was just a uniformly random vector

---

## Noise and Latent Codes

* [InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets](https://arxiv.org/abs/1606.03657) was published way back in 2016

> In this paper, rather than using a single unstructured noise vector, we propose to decompose the input
noise vector into two parts: (i) z, which is treated as source of incompressible noise; (ii) c, which we
will call the latent code and will target the salient structured semantic features of the data distribution.

---

## Variations

* There are a few ways to learn latent codes
* Generally, gradient descent can compress information into a DNN
* Is that memorization, or does it represent learning as we know it?

</div>
<div class="col">
<img style="width: 90%" class="r-stretch" src="./figures/UDL/Chap15/GanConditional.svg" />
<br/>
<small>See <a href="https://udlbook.github.io/udlbook/">Understanding Deep Learning</a>, Figure 15.15.</small>
</div>
</div>

---

## Latent Code Impact

<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/InfoGan_Figure2.png" />
<br/>
<small>Read the <a href="https://arxiv.org/abs/1606.03657">InfoGAN paper</a>, see Figure 2.</small>
</div>

---

## Meaning

* By manipulating the latent code, we can force the generator to explore some dimension of the possible feature space
* That sounds like we have a small program (the DNN) compressing a much larger space of images
* In the 10 years since then, we have gotten better at controlling the exact meaning of the generator inputs
  * See the latest version of StyleGAN from 2022: [Third Time's the Charm? Image and Video Editing with StyleGAN3](https://arxiv.org/abs/2201.13433)

---

## Style GAN

* The components of $w$ can be thought of as coarse, medium, and fine, depending upon where they influence the output
* This leads to great control over the output images, as can be seem from example images in the [paper](https://openaccess.thecvf.com/content_CVPR_2019/html/Karras_A_Style-Based_Generator_Architecture_for_Generative_Adversarial_Networks_CVPR_2019_paper.html)

</div>
<div class="col">
<img style="width: 100%" class="r-stretch" src="./figures/UDL/Chap15/GanStyleGANArch.svg" />
<br/>
<small>See the UDL text, Figure 15.19.</small>
</div>
</div>

---

## More Control

* We can add more control over the latent code by using other methods

* StyleGAN uses a vector, S, to insert a style into the synthetic image
* [StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery](https://openaccess.thecvf.com/content/ICCV2021/html/Patashnik_StyleCLIP_Text-Driven_Manipulation_of_StyleGAN_Imagery_ICCV_2021_paper.html), from 2021, showed that S could be manipulated using text prompts and CLIP
  * The goal is to get a new latent code, $w$, given a source code, $w_s$

---

## Advances: Editing Directions

* So optimize this:
  * $\underset{w \in \mathcal{W}+}{argmin}D_{CLIP}(G(w), t) + \lambda_{L2} ||w - w_s|| + \lambda_{ID} \mathcal{L}_{ID}(w)$
  * $D_{CLIP}(G(w), t)$ is the cosine distance between the embedding of the Generator's output and the text embedding
  * ${L}_{ID}$ is a measure of output image similarity
* Basically, find a new $w$ that changes the images a little as possible but brings the CLIP embedding as close as possible

---

## Generators Overview

* So generators transform a latent variable into an output
  * The latent input has both semantic meaning and noise components
* Let's say we had an output that didn't involve noise; what would that be?
  * Just a semantic meaning vector that produces a fixed output
* Sounds like compression, right?
  * Let's move on to our last, and most explicit, example

---

## Puzzle Solving

* In 2019, [On the Measure of Intelligence](https://arxiv.org/abs/1911.01547) suggested that "superhuman" performance of DNNs on various benchmarks was overselling progress
  * Humans didn't evolve to classify 224x244 pixel images and the DNNs get to train with millions of examples
  * AlphaGoZero played tens of millions of games, but a human playing 10 games a day for 100 years will only reach 36,000.
    * Is that really a proper comparison?
* The author suggests testing the *acquisition* of new problem solving abilities

---

> That is to say, intelligence is the rate at which a learner turns its experience and priors into new skills at valuable tasks that involve uncertainty and adaptation.

> If an AI system has access to extensive, task-specific prior knowledge that is not available to a human, its performance on that task becomes a measure of the developer's cleverness in encoding that knowledge, not the AI's inherent intelligence.

---

## ARC

* ARC is the Abstraction and Reasoning Corpus
* It is designed measure *skill acquisition efficiency*, meaning how well an agent can learn to solve puzzles
* Puzzles expect that an intelligence can infer the rule from a small number of examples
* They are now on iteration three of the [ARC-AGI challenge](https://arcprize.org/arc-agi)

---

## Examples

<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/ARC_fig8.png" />
<br/>
<small>See Figure 8 of the <a href="https://arxiv.org/abs/1911.01547">ARC paper</a>.</small>
</div>

---

## Examples

<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/ARC_fig9.png" />
<br/>
<small>See Figure 9 of the <a href="https://arxiv.org/abs/1911.01547">ARC paper</a>.</small>
</div>

---

## Examples

<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/ARC_fig10.png" />
<br/>
<small>See Figure 9 of the <a href="https://arxiv.org/abs/1911.01547">ARC paper</a>.</small>
</div>

---

## Types of Priors

* What knowledge is being tested?
  * Ideas of shapes and objects being formed from sub-units
  * Object persistence in spite of noise or occlusion
  * Translation, rotation, scaling, etc
  * Counting, repetition
* These are very general, and we know a DNN is generally capable of doing these tasks

---

## ARC Progress

* The first version had 400 tasks in the training set and 400 in a public evaluation set
* A hold out set of 100 tasks were used in competition from 2020 through 2024
* At the end of 2024, OpenAI used the ARC-AGI challenge as a demonstration of their latest reasoning model
  * It eventually scored 87% with high compute

---

## Cost of Progress

* The cost on the x-axis is from the number of tokens consumed

<div class="col">
<img style="width: 90%" class="r-stretch" src="./figures/arc-prize-leaderboard-arc1.png" />
<br/>
<small>From the <a href="https://arcprize.org/leaderboard">ARC leaderboard</a>.</small>
</div>

---

## Version 2

<div class="col">
<img style="width: 80%" class="r-stretch" src="./figures/arc-prize-leaderboard-arc2.png" />
<br/>
<small>From the <a href="https://arcprize.org/leaderboard">ARC leaderboard</a>.</small>
</div>

---

## Version 3 (new for 2026)

<div class="col">
<img style="width: 90%" class="r-stretch" src="./figures/arc-prize-leaderboard-arc3.png" />
<br/>
<small>From the <a href="https://arcprize.org/leaderboard">ARC leaderboard</a>.</small>
</div>

---

## Comparison

* Humans run at around 100 watts (all of us, not just the brainy parts)
  * Similar to a laptop, but we can also actuate things
  * Both susceptible to viruses
* Could you pay a human to do these task?
  * Yeah, people would probably do [these](https://arcprize.org/tasks/tn36) for free

---

## Without Brute Force

* Using one of those large reasoning models is overkill
  * It is like applying all of the world's text and images to solve a children's puzzle book
* Is there a more efficient way to approach these problems?
* Let's look at two papers
  * In [ARC-AGI Without Pretraining](https://arxiv.org/abs/2512.06104) the author's suggest that compressing the problems is a better approach
  * In [Less is More: Recursive Reasoning with Tiny Networks](https://arxiv.org/abs/2510.04871) the author outperforms several LLMs with lss than 0.01% of the parameters using a simplified hierarchical reasoning model

---

## CompressARC

* [ARC-AGI Without Pretraining](https://arxiv.org/abs/2512.06104) attempt's to learn solutions to the ARC-AGI problems without *any* pretraining, using only the test puzzle itself for training

> CompressARC tries to solve the problem of compressing the data into as short a program as possible,
to obtain the puzzle solutions while keeping the program search feasible. In this case, the code
must be entirely self-contained, receive no inputs, and must print out the entire ARC-AGI dataset
of puzzles with any solutions filled in.

---

## Approach

* Setup:
  * Puzzles are represented by tensors of shape [n_example, n_colors, width, height, 2]
  * Make a DNN classifier with outputs [n_examples, n_colors, width, height, 2]
    * The neural network must be equivariant to simple augmentations
  * Pick means and variance ($\mu$ and $\Sigma$) for $z$, the network input
* Training:
  * Optimize network parameters $\theta$, $\mu$, and $\Sigma$

---

## What is This?

* The random input is acting like a latent input to a GAN or VAE
* There is no noise in the output, so $z$ should become a compressed representation of the puzzle
* But the reconstruction itself has errors
  * So the authors find the most common output as use that as the answer

---

## Visualization

* The author's have a [website](https://iliao2345.github.io/blog_posts/arc_agi_without_pretraining/arc_agi_without_pretraining.html) with some visualizations

<div class="col">
<img style="width: 90%" class="r-stretch" src="./figures/CompressARCSite_learning.png" />
<br/>
<small>Compress ARC learning, as visualized by the authors.</small>
</div>

---

## Learning and Compression

* Each trained model is a generator for a particular type of puzzle
* Training *is* the learning process
  * Occurs as puzzles are encountered rather than beforehand
* So the compression, via random input and network parameters, is the learning

---

## Results

* This solved about 20% of the puzzles
  * Which isn't bad, considering all it does is compression on each one individually
  * So there was no generalization and a limited number of samples

---

## Better Learning, Better AI

* For the final set of course topics, we'll focus on this view of learning as compression
  * Knowledge is then a decompressor
* In a biological context, this seems relevant
  * Being able to represent things efficiently means fewer calories and resources invested
* So maybe this is the right way for us to go with artificial intelligence

<!--

ARC-AGI Without Pretraining
https://arxiv.org/pdf/2512.06104
https://iliao2345.github.io/blog_posts/arc_agi_without_pretraining/arc_agi_without_pretraining.html

https://arcprize.org/arc-agi/2

The advisor has lots of interesting papers on sequence modeling and state spaces:
https://scholar.google.com/citations?user=DVCHv1kAAAAJ&hl=en&oi=ao
Basically, he wants to bring back recurrent networks that operate upon very long sequences.

Others have combined the Mamba-2 recursive structure into tiny reasoning models

-->

<!--

Deep Spectral Methods: A Surprisingly Strong Baseline for Unsupervised Semantic Segmentation and Localization
CVPR 2022
https://openaccess.thecvf.com/content/CVPR2022/html/Melas-Kyriazi_Deep_Spectral_Methods_A_Surprisingly_Strong_Baseline_for_Unsupervised_Semantic_CVPR_2022_paper.html
https://github.com/lukemelas/deep-spectral-segmentation

The spectral methods paper refers to:
https://arxiv.org/abs/2109.14279
Localizing Objects with Self-Supervised Transformers and no Labels
from 2021

-->