# CS 530 - Lecture 21

## Latent Spaces and Knowledge Compression

Bernhard Firner

2026-04-16

---

## Knowledge Compression

* Last time we introduced the idea that compression and knowledge are related
* Argument:
  * Compressing something must indicate a fundamental understanding of its structure
  * That understanding may allow an encoding that is vastly smaller than the source
  * Solving problems in a smaller space is easier than in a larger one, so this is good
* For an extreme example, consider the [Mandelbrot set](https://en.wikipedia.org/wiki/Mandelbrot_set)

---

## Why Care?

* There are practical and hypothetical reasons to care
* Practically, this impacts Q-learning and policy prediction
  * We've observed that continuous spaces make learning difficult
  * The curse of dimensionality makes poorer decision boundaries between observed and uncertain states
* It is easy to argue that embeddings learned by many different systems are a type of compression

---

## Hypothetically

* There are interesting results that imply compression is a good direction in general
* We saw some interesting results from CompressARC in [ARC-AGI Without Pretraining](https://arxiv.org/abs/2512.06104)
  * This isn't enough to make us drop everything, but it is interesting
* Why is it that compression can simplify problems?

---

## Dimensionality

* Most interesting problems have features that exist in some high-level dimension
  * An RGB image of size 1000x1000 has 3x1000x1000 different "features"
  * But that doesn't mean visual problems have dimensionality that high
* The [Johnson-Lindenstrauss Lemma](https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma) tells us that high dimensional problems may be convertible to low dimensions
  * In brief, we can project points from high-dimensions into low-dimensions with low distortion

---

## DNN Dimensionality Reduction

* Obviously dimensionality reduction is its own field
* DNNs must do some dimensionality reduction automatically, even if it is ignored
  * It is less explicit than, for example, the kernel trick in SVMs, but present
* It is difficult to understand how many dimensions exist in real world data
  * So when a DNN successfully classifies, how much dimensionality reduction took place?

---

## Example

* Let's revisit this paper that was mentioned last class:
  * [Deep Spectral Methods: A Surprisingly Strong Baseline for Unsupervised Semantic Segmentation and Localization](https://openaccess.thecvf.com/content/CVPR2022/html/Melas-Kyriazi_Deep_Spectral_Methods_A_Surprisingly_Strong_Baseline_for_Unsupervised_Semantic_CVPR_2022_paper.html) by Melas-Kyriazi et al, from CVPR in 2022
  * And the accompanying [github page](https://lukemelas.github.io/deep-spectral-segmentation/)

---

## Deep Spectral Methods

* This work begins with a self-supervised model for images
* The idea is that they want to find a way to partition the image in such a way that the similarity of different partitions is minimized
* An image is thought of as a graph
  * Each pixel is a node
  * The weight of an edge between two pixels are their affinity

---

## Affinities

* In some basic compression techniques, the affinity comes from KNN using color channels
* The authors add features into that
  * $W = W_{features} + \lambda_{knn}W_{knn}$
  * $\lambda$ is a shmooing factor chosen by humans
* Using that W, the eigenvectors of the Laplacian are used to segment the image

---

## Approach

<div class="col">
<img style="width: 70%" class="r-stretch" src="./figures/DeepSpectralMethods_figure2.png" />
<br/>
<small>Read the <a href="https://arxiv.org/abs/2205.07839">Deep Spectral Methods paper</a>, see Figure 2.</small>
</div>

---

## Semantic Segmentation

* Going one step further, the authors attempt semantic segmentation and grouping across a dataset
* They begin with segmentation on each image, as described
* Then a feature vector is computed for each segment
* Finally, K-means is used to cluster feature vectors across the dataset
  * Here the authors used some a-priori knowledge that their dataset had 20 classes
  * They made 21 clusters, one for each class and one for backgrounds

---

## Dimensionality

* Because the feature maps are the output of a DNN, one way to view this a compression of image data to 5-bit pixels
  * $\frac{log(21)}{2} \approx 4.392$, but we can round up to 5 bits
* A decoder could theoretically look at the shape of the segment and do a decent job recreating the original pixels
  * Although for most tasks (classification, for example) reconstruction isn't necessary
* If we also added a spatial component to compression, we could achieve further dimensionality reduction

---

## Actual Dimensionality

* But is that actually the dimensionality of the problem space?
* Consider [InfoGAN](https://arxiv.org/abs/1606.03657), a network trained to produce synthetic images
  * Several latent control variable are used to direct the output
  * And one of those control variables was identified as corresponding to the digit class

---

## Compression and Dimensionality

* If InfoGAN had compressed the dataset with 10 classes into a representation with dimensionality 1 that would be great
  * It would also mean that the semantically important part of the dataset could be represented with a single value
* And if this worked in general, then we could go around compressing everything into a minimal representation
* We could choose the minimal required network sizes for each problem
  * Or make good estimates about how much more capacity might be reuired
  * For example, how much larger and more complicated would CompressARC need to be to solve 50% of ARC-AGI? 75%? 100%?

---

## InfoGAN Results

* But that control variable did not *always* correspond to the class
  * The problem space dimensionality was reduce, but is > 1

<div class="col">
<img style="width: 55%" class="r-stretch" src="./figures/InfoGan_Figure2.png" />
<br/>
<small>Read the <a href="https://arxiv.org/abs/1606.03657">InfoGAN paper</a>, see Figure 2.</small>
</div>

---

## Dataset Dimensionality

* Do we have any expectation what dimensionality we should have seen?
* Is image classification inherently 1 dimensional?
  * 2 dimensional?
  * 3?
* How much does it vary based upon the image classes?
* If we knew, we could judge the "solving power" of current techniques

---

## Dataset Dimensionality

* There are formal methods to measure dimensionality, of course
* VC dimensionality has been studied and successfully applied in other areas
* But we don't have a good grasp of the dimensionality of problems solved with DNNs
* This is well understood, but not well studied
  * [A Theoretical-Empirical Approach to Estimating Sample Complexity of DNNs](https://openaccess.thecvf.com/content/CVPR2021W/TCV/html/Bisla_A_Theoretical-Empirical_Approach_to_Estimating_Sample_Complexity_of_DNNs_CVPRW_2021_paper.html) took a shot at it

---

## Data Complexity

* Asking about data compressibility is similar to asking how much uniqueness is present in a dataset
* Let's assume that DNNs ameliorates the curse of dimensionality by compressing without relevant information loss
* We don't know the dimensionality of the embedding created by the DNN, but we could measure it empirically
  * Just add data and see how quickly the embedded space is filled

---

## Example

* Consider support vector machines (just for a moment!)
* They are generally used with a kernel for classification
  * The support vectors are data points that will be used for classification
* The radial basis function (RBF, or Gaussian) kernel measures similarity between an unknown point and the support vectors

---

## SVM and RBF Kernels

* An RBF kernel is akin to clustering
  * And with an arbitrary number of support vectors it can "shatter" problems of arbitrary dimensionality

</div>

---

## Model Complexity

* If we used an infinite number of support vectors to shatter out dataset, then the model has not successfully simplified things
* But we could find a correlation with the current number of training points and the expectation that the next unseen point will be missclassified
* That relationship should be correlated with model capacity

---

## Capacity and Data Dimensionality

* By measuring the distance of new points to existing point, we can estimate an error probability
* The result of the [Estimating Sample Complexity](https://openaccess.thecvf.com/content/CVPR2021W/TCV/html/Bisla_A_Theoretical-Empirical_Approach_to_Estimating_Sample_Complexity_of_DNNs_CVPRW_2021_paper.html) paper was this equation:
  * $\mathcal{O} \left( \frac{1}{\delta N^{1/d}} \right)$

---

## Equation

* $\mathcal{O} \left( \frac{1}{\delta N^{1/d}} \right)$
  * $N$ is the dataset size
  * $d$ is the dimensionality of the dataset
    * As experienced by the DNN
  * $\delta$ is the radius where prediction error saturates

---

## Example

* What is that saying?
* Suppose that some intermediate step of our DNN has projected samples to a lower dimensional space
  * Each projection is then compared to some template, $\tilde{x}$
  * This presupposes that a DNN "stores" a copy of either individual examples or averages of the to use for comparison
* $P_{error}(x) \triangleq min(1, \frac{||f(x) - \tilde{x}||_2}{\delta})$
* $f$ is the feature extraction component of the DNN and $\delta$ is a radius where samples are considered similar

---

## Errors

* Errors will always occur when the distance between sample features and the template is greater than $\delta$
* If we assume that the model has infinite capacity, then we can just use the nearest point as the template
* How close is the nearest point?
  * That is clearly a function of the number of points, $N$

---

## Dimensionality

* The distance to the nearest point is also a function of the dimensionality of the space, $d$
  * Why? The same argument as the curse of dimensionality
* In one dimensional space, the expected distance varies directly with N
  * In 2D space, with $N^{1/2}$
  * Generally, with $N^{1/d}$

---

## Testing That Theory

* Of course that is a theory of how dimensionality, learning, and dataset size could be related
  * Since that equation predicts a specific curve though, it can be tested experimentally
* We can control N, but we still have two unknowns
  * Solution: bottleneck the network by removing most features, see if it still works

---

## Results

* Saturation indicates the dimensionality required

<div class="col">
<img style="width: 70%" class="r-stretch" src="./figures/DNNComplexity_fig2.png" />
<br/>
<small>Read the <a href="https://openaccess.thecvf.com/content/CVPR2021W/TCV/html/Bisla_A_Theoretical-Empirical_Approach_to_Estimating_Sample_Complexity_of_DNNs_CVPRW_2021_paper.html">sample complexity paper</a>, see Figure 2.</small>
</div>

---

## Dimensionality

* So MNIST and CIFAR10 have dimensionality 2?
  * Sounds reasonable, actually
* Imagenet is 3-4? Maybe?
* If we plug those back into the equation, do the error rates match expectations?

---

## Error Curves

* The authors searched for a best fit $\delta$ after training on a subset of data

<div class="col">
<img style="width: 70%" class="r-stretch" src="./figures/DNNComplexity_fig3.png" />
<br/>
<small>Read the <a href="https://openaccess.thecvf.com/content/CVPR2021W/TCV/html/Bisla_A_Theoretical-Empirical_Approach_to_Estimating_Sample_Complexity_of_DNNs_CVPRW_2021_paper.html">sample complexity paper</a>, see Figure 3.</small>
</div>

---

## Knowledge and Compression

* This study is interesting because it attempts to describe the complexity of a decision space
  * This is rare in today's literature
* Imagine if the ARC-AGI challenge had an accurate way to measure complexity
  * Then we could just find the dimensionality of human-solvable tasks, compare them to machine solvable tasks, and properly measure current distance to AGI
* In general, if we knew the size of the latent space required to represent something, many tasks would be simplified

---

## Never Give Up

* Remember the [Never Give Up: Learning Directed Exploration Strategies](https://arxiv.org/abs/2002.06038) paper?
* The authors a predictive task to force a DNN to learn the embedding

</div>
<div class="col">
<img style="width: 70%" class="r-stretch" src="./figures/NeverGiveUpFig2Left.png" />
<br/>
<small>See Figure 2 in the paper.</small>
</div>
</div>

---

## Embeddings and Dimensionality

* We use embeddings all the time
  * And then we use them for clustering and distance measures
* But is that meaningful?
* If each dimension has the same amount of information, then it is
  * But what if multiple features are entangled into one embedding?
* DNNs are doing compression, but we don't have a good way to measure what is actually happening

---

## Explicit Compression

* Researchers may be making progress on this problem
  * In [Generative Latent Coding for Ultra-Low Bitrate Image Compression](https://openaccess.thecvf.com/content/CVPR2024/html/Jia_Generative_Latent_Coding_for_Ultra-Low_Bitrate_Image_Compression_CVPR_2024_paper.html) the authors use generate models for image compression
* Generative models use a latent vector, $z$, as a basis for data generation
* Since $z$ can lead to an entire image, as long as $||z|| < ||image||$ it also serves as a compressor

---

## Framework

* The Generative Latent Coding (GLC) is trained in three parts
  * First, train an auto-encoder to make visually correct images
  * Second, train a module to predict the latent code for images
  * Third, co-train the auto-encoder and code predictor together for fine-tuning
* Glossing over many details, but this high compression indicates a better estimate of the information in an image

---

> [T]hese methods often lack a careful consideration of the
correlation among the latents, resulting in a insufficient redundancy reduction
and consequently a high bit cost. In GLC, we introduce a transform coding
module to compress the latent, replacing the vector-quantization step for more
effective reduction of latent redundancy.

---

## Results

* It's worth looking at the paper's pdf [Generative Latent Coding for Ultra-Low Bitrate Image Compression](https://openaccess.thecvf.com/content/CVPR2024/html/Jia_Generative_Latent_Coding_for_Ultra-Low_Bitrate_Image_Compression_CVPR_2024_paper.html) so you can zoom in on the individual results
* The authors reported compression 0.04 bpp on natural images and 0.01 bpp on faces with seemingly high quality
  * Assume 24 bit pixels
* High quality for jpeg is around 2.7:1, or about 2bpp
  * Good quality is around 23:1 or 1bpp

---

## Experience

* The ability of GLC to compress images is based upon its *experience*, in the form of the encoder-decoder parameters
  * Meaning that if they were trained on faces, they should do a good job
  * But if they were trained on only faces and then used to compress flowers, the results would not be good
* Bringing this back around to knowledge compression, humans who are good at games are known to effectively compress the current game states
  * This is known as [Chunking Theory](https://en.wikipedia.org/wiki/Chunking_(psychology)) in psychology

---

## Learning and Compression

* We know that the latent space learned through various training schemes can form a compressed representation of a scene
* Compression applies to more than just pixels
  * In CompressARC this is enough to solve logic problems
  * In Never Give Up, it was used to identify the novelty of game states

---

## World Models

* So here is a thought to bounce around your brain:
  * Should problem solving be happening in the latent space itself?
* The latent space is basically a model of the current world state
  * And if we can estimate how the world state changes with actions, we should be able to do planning in that space
* It turns out that this works!

---

## Latent Space "Imagination"

* [DreamerV3](https://www.nature.com/articles/s41586-025-08744-2) is an approach to reinforcement learning where simulation is done in the latent space

> The algorithm consists of three neural networks: the world model predicts
the outcomes of potential actions, the critic judges the value of each
outcome, and the actor chooses actions to reach the most valuable outcomes.

---

## Thoughts

* High level view
  * World model predictions are in $z$, a discrete encoding of the state
  * Actor-Critic learning is performed on both observed and estimated states
* We may be able to do operations using the latent space without fully understanding it
  * But we could probably do a better job if we had more control over it
* It's something to think about!

<!--
Things to discuss:
The eigenvector-based PCA of feature space to produce segmentation results.
The "Never Give Up" paper's 
Actual compression:
  * Generative Latent Coding for Ultra-Low Bitrate Image Compression
    * https://openaccess.thecvf.com/content/CVPR2024/html/Jia_Generative_Latent_Coding_for_Ultra-Low_Bitrate_Image_Compression_CVPR_2024_paper.html

Trading time (as in compression and mandelbrot) for comprehension:
  * This is the trick done by large reasoning models

The advisor has lots of interesting papers on sequence modeling and state spaces:
* https://scholar.google.com/citations?user=DVCHv1kAAAAJ&hl=en&oi=ao
Basically, he wants to bring back recurrent networks that operate upon very long sequences.
* Also
  * Tiny Recursive Reasoning with Mamba-2 Attention Hybrid
  * https://arxiv.org/abs/2602.12078

Others have combined the Mamba-2 recursive structure into tiny reasoning models

* Deep Spectral Methods: A Surprisingly Strong Baseline for Unsupervised Semantic Segmentation and Localization
  * CVPR 2022
  * https://openaccess.thecvf.com/content/CVPR2022/html/Melas-Kyriazi_Deep_Spectral_Methods_A_Surprisingly_Strong_Baseline_for_Unsupervised_Semantic_CVPR_2022_paper.html
  * https://github.com/lukemelas/deep-spectral-segmentation

-->