# CS 530 - Lecture 24

## World Model Reasoning

Bernhard Firner

2026-04-21

---

## Review

* We've been exploring the idea of a 'world model'
* This typically means a latent representation of the environment
* A latent space should be compressed and more information dense than images
  * It should also be information preserving
  * But how can we be sure that the latent representation is good for learning?

---

## Approaches

* We are going to cover a few examples of this
* Did the first one, need to cover the other two
  * Using diffusion
    * Dreamer 4
  * Using JEPA
    * DINO-WM
    * LeWorldModel

---

## Example: Dreamer v4

* Deepmind recently published Dreamer v4, by Hafner, Yan, and Lillicrap
  * [Dreamer 4 video](https://youtu.be/oDlBtTcX0g0?si=64xF-EEQc36XFy7k)
  * [Dreamer 4 site](https://danijar.com/project/dreamer4/)
  * [Dreamer 4 paper](https://arxiv.org/pdf/2509.24527)
* Using a diffusion-based latent representation, the world model is good enough for human play
* Are we satisfied?

---

## Not Satisfied

* From 2025, [DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning](https://arxiv.org/abs/2411.04983)

> Predicting in pixel space, however,
is computationally expensive due to the need for image reconstruction and the overhead of using diffusion models (Ko
et al., 2023). On the other hand, latent-space prediction is
typically tied to objectives of reconstructing images (Hafner
et al., 2019; Micheli et al., 2023; Hafner et al., 2024), which
raises concerns about whether the learned features contain
sufficient information about the task.

---

## Action Agnosticism

> Moreover, many of
these models incorporate reward prediction (Micheli et al.,
2023; Robine et al., 2023; Hafner et al., 2024), or use reward
prediction as auxiliary objective to learn the latent representation (Hansen et al., 2022; 2024), inherently making the
world model task-specific.

* Remember how Dreamer v4 embedded tasks directly into the tokens?
  * That means we can only do tasks that we have labels for

---

## Goal

* A large goal of this work, and a difference with Dreamer v4, is that the world model is general
  * Both for encoding and for action reasoning
* This isn't the JEPA model yet, but we'll get there
* First, where do we get a general world model?

---

## Reconstruction

* Obviously if an latent encoding retains *everything*, then it must be sufficient
  * But how can we be sure that there aren't edge cases that won't work?
  * And if we retain everything (meaning decoding from the latent losslessly reconstructs images), aren't we retaining too much?
* Recall, a goal of working in the latent space was to compress the search space
* Summary: a world model with in-game data could be too large, or could miss edge cases

---

## DINOv2

* First iteration of a general world model uses a self-supervised model
  * [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193)
* Begin with a small curated dataset, then add a large uncurated set
  * Feature embeddings are used to detect and remove duplicates from the uncurated set
  * Then clustering is used to select new data similar to, but not a duplicate of, curated data
* Then discriminative learning is used to train a model

---

## Discriminative Training

* DINOv2 uses multiple types of discriminative training
  * Masking
  * Student and teacher, with the teacher being an exponential moving average of the student
  * Other augmentations and losses
* KNN classifiers using the DINOv2 features have 83.5% accuracy classifying ImageNet1k
  * And their PCA features look good too!

---

## Feature Quality

* Since the model generalizes well, it can be used directly for world model reasoning
* We need more information though:
  * We don't want to decode back into image space
  * So how do actions affect the latent states?

---

## DINO World Models

* Observation model: $z_t \sim \text{enc}_\theta(z_t \mid o_t)$
* Transition model: $z_{t+1} \sim p_\theta(z_{t+1} \mid z_{t-H:t}, a_{t-H:t})$
* Decoder model: $\hat{o}_t \sim q_\theta(o_t \mid z_t)$
  * The decoder is optional, just for humans
  * [https://dino-wm.github.io](https://dino-wm.github.io)

---

## Transition model

* If we want to do *visual representation learning*, then we need action information
* Unlike Dreamer, DINO was no pretrained with actions, so we learn action transitions separately

<div class="col">
<img style="width: 40%" class="r-stretch" src="./figures/DINOWM_fig1left.png" />
<br/>
<small>See Figure 1 of the paper.</small>
</div>

---

## Transition Model Training

* This transition model requires history
  * Since DINO itself has no temporal features
* Feed the transition model both past latent states and actions
* This is trained with causal attention, similar to token prediction in sentences

---

## Zero Shot Planning

* The transition model has past and current information and predict the future
  * How can we go from a desired future to a current action?
* This seem tricky, especially since we do not know how many actions will be required
* Essentially, this is zero-shot planning, where we need to make a plan with no exploration or additional information

---

## MPC Planning

* DINO will solve tasks purely through optimization within the latent vectors
* This will use MPC: Model Predictive Control
  * MPC is a multi-step planning method that minimizes a given cost function
  * The solution yields multiple states and actions that lead to our desired result

---

## Minimization

* Solving MPC means solving for multiple steps of cost and action
  * $x_t$ is the state at time t, $u_t$ is the control at time t
  * $x_{1:T}^\star, u_{1:T}^\star = \mathrm{argmin}_{x_{1:T} \in \mathcal{X}, u_{1:T} \in \mathcal{U}} \sum_{t=1}^T C_t(x_t, u_t)$
  * Subject to $x_{t+1} = f(x_t, u_t)$ and $x_1 = x_{\text{init}}$

---

## Iterative Control

* With gradients, steps look like this:
* $u_{k+1} = u_k - \alpha \nabla_u C$
* The changes to the control inputs are considered as individual steps
  * An entire path of updates up to a time horizon is calculated
  * Then only a few steps are taken and the updates are re-evaluated
* If we have equations describing our environment, and a goal, this can iteratively optimize many problems

---

## MPC

* You can find some examples [here](https://locuslab.github.io/mpc.pytorch/)

<div class="col">
<img style="width: 40%" class="r-stretch" src="./figures/mpc-pendulum.gif" />
<br/>
<small>Gym pendulum solved with MPC.</small>
<br/>
<small>Example code at <a href="https://github.com/andreaostuni/mpc.pytorch">https://github.com/andreaostuni/mpc.pytorch</a></small>
</div>

---

## MPC on Latent Spaces

* The cost we want to minimize can be described as the distance between our desired $z^*$ and the current $z$
* And our action model weights gives us an update equation to go from $z_t$ to $z_{t+1}$
* So we can use MPC to find a set of states and control actions that the model believes will take us to $z^*$
* Examples: [https://dino-wm.github.io](https://dino-wm.github.io)

---

## Better Steps

* That's fantastic
  * But let's talk about this world model
* It isn't always correct in predicting state changes
  * This is a generalization problem; we are always going to encounter unseen states

---

## Contrastive Methods

* This is a problem with contrastive methods
  * The image augmentations and permutations used during training are "exploring" the feature space
* With high dimensional data, it is inevitable that our decision boundary won't be smooth
* So we should prefer a global regularization term

---

## Regularization

* But how can we regularize something like contrastive loss?
* Answer: Don't regularize the loss, regularize the latent vector
  * If the space of $z$ is restricted, that will smooth the loss surface
  * This will make the embedding generalize better, resulting in an improved world model

---

## SIGReg

* [LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels](https://arxiv.org/abs/2603.19312) is similar to the DINO-WM work, but uses a regularized world model
* The chosen regularization is [Sketched-Isotropic-Guassian Regularizer (SIGReg)](https://arxiv.org/abs/2511.08544)

* $\text{SIGReg}(\mathbf{Z}) \triangleq \frac{1}{M} \sum\limits_{m=1}^{M} T(\boldsymbol{h}^{(m)})$
* These papers are from November 2025 and March 2026, so we're at the bleeding edge
  * As is the [github repo](https://github.com/galilai-group/lejepa)

---

## SIGReg Results

* SIGReg penalizes parameters if the latent embedding does not have a normal distribution
* Imagine individual training points that are pushed into the embedding
  * What shape do they form?
* If they are evenly distributed and compact, we should expect a sphere

---

## Why Gaussian?

* The [SIGReg](https://arxiv.org/abs/2511.08544) paper has some mathematical arguments
  * Which you can and should read
* But it boils down to Z being best when points are distributed over a sphere
* Let's try to build an intuition for this

---

## Intuition

* Recall the attempt to quantify DNN predictive quality based upon problem dimensionality and similarity to observed data points?
  * Similarity was described as a point lying within a sphere of some radius to another point
* Contrastive methods demand that similar points be close within Z, but other points should be distant

---

## Intuition

* A skewed latent space leads to possibly large distance along some dimensions
  * Contrastive loss does nothing to stop this
* But large gaps between points makes the latent representation worse
  * Empty regions are risky to traverse, as they may not correspond to a well-behaved state,action mapping
* The expected distance between points is minimized if the latent points are mapped into a sphere

---

## Compact Representation

* We wanted a compact representation so that walking through the latents is well-behaved
  * We need this for MPC
* This change in representation means that planning on a SIGRegularized world model is faster and more successful with with DINO-WM

---

## Proof

* The proof of progress is in the results
* [LeWorldModel](https://arxiv.org/abs/2603.19312) takes a SIGReg trained world model and applies it to control tasks
  * This world model is trained with 2 orders of magnitude less data than DINO-WM
* The authors don't report *better* results than with DINO-WM, but they do report less expensive results

---

## Future Directions

* Generally, replicating results with more stability and fewer resources unlocks other approaches
* The ARC-AGI challenge should motivate simplified reasoning models as well
  * If an approach doesn't scale, we have to look for something different
* Long-term planning must require an efficient representation in a latent space
  * SIGReg world models and compression-based puzzle solving are interesting steps in that direction

---

## Project Presentations

* I'll go through an example so you know what I'm asking for

---

## Example Presentation: PilotNet

* Sections
  * Goal
  * Platform and environment
  * Labels
  * Agent & algorithms
  * Simulation
* Despite being janky and 10 years old, this will be more professional than what I expect from you

---

## Goal

* To have a neural network do lane keeping for a car
* The platform is a car with a motorized steering wheel
* The environment is public roads
  * Dynamic, partially observable, multi-agent, adversarial

---

## Data and Labelling

* Video is collected from several cameras attached to a car via suction cups

---

## Data and Labelling

* Steering wheel and speed data are read by a Bluetooth enabled OBDII sensor

<div class="container">
<div class="col">
<img style="width: 40%" class="r-stretch" src="./figures/PilotNetExample/obd_bluetooth.png" />
<br/>
<small>The sensor.</small>
</div>

<div class="col">
<img style="width: 30%" class="r-stretch" src="./figures/PilotNetExample/smartphone_diagnostics.png" />
<br/>
<small>The phone app.</small>
</div>
</div>

---

## Data and Labelling

* The smartphone is used to synchronize times between the videos and the OBDII data
  * The clock is held in front of the cameras
  * Data is manually synchronized later
* Labels are the steering wheel angles and speeds
  * These can be later converted into relative motion between frames

---

## Agent

* The agent learns to drive by predicting the human steering labels
  * That's how it was in the intial system
* For this class, it might be good to train using RL on a simulation
* Let's say that we want to use the data and train an actor critic policy model
  * Rewards for driving smoothly and penalties for leaving the lane/road
  * But how do we simulate?

---

## Augmented Simulation

* We can augment the data with perspective transforms
* This is useful for training, but we can also use this to simulate

---

## Augmented Resimulation

<div class="container">
<div class="col">
<img style="width: 70%" class="r-stretch" src="./figures/PilotNetExample/ximulator1.png" />
<br/>
<img style="width: 70%" class="r-stretch" src="./figures/PilotNetExample/ximulator3.png" />
<br/>
<small>The original video is in the upper left.</small>
</div>

<div class="col">
<img style="width: 70%" class="r-stretch" src="./figures/PilotNetExample/ximulator2.png" />
<br/>
<img style="width: 70%" class="r-stretch" src="./figures/PilotNetExample/ximulator4.png" />
<br/>
<small>The wheel shows predicted steering .</small>
</div>
</div>

---

## Videos!

* I like videos!

<!--

Concepts from:

Learning and Leveraging World Models in Visual Representation Learning
https://arxiv.org/abs/2403.00504
2024

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning
https://arxiv.org/abs/2411.04983
2025

stable-worldmodel-v1: Reproducible World Modeling Research and Evaluation
https://arxiv.org/abs/2602.08968
2026

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
https://arxiv.org/abs/2603.19312

Then a sample project presentation based upon PilotNet

https://locuslab.github.io/mpc.pytorch/
https://github.com/andreaostuni/mpc.pytorch

https://deepwiki.com/locuslab/mpc.pytorch
https://en.wikipedia.org/wiki/Model_predictive_control

-->