# CS 530 - Lecture 23

## World Model Reasoning

Bernhard Firner

2026-04-21

---

## World Models

* We've been discussing the idea of a world model
* The latent space becomes a model of transitions in the agent's environment
  * If the world model can predict state changes from actions, it should allow for planning
* That's the theory
* This idea dates back to at least the 90s, from Schmidhuber

---

## Modern Interest

* Ha and Schmidhuber published another paper, called [World Models](https://arxiv.org/abs/1803.10122) in 2018

> We explore building generative neural network models of popular reinforcement learning environments. Our world model can be trained quickly in an unsupervised manner to learn a compressed spatial and temporal representation of the environment. By using features extracted from the world model as inputs to an agent, we can train a very compact and simple policy that can solve the required task. We can even train our agent entirely inside of its own hallucinated dream generated by its world model, and transfer this policy back into the actual environment.

---

## Interactive Demo

* [https://worldmodels.github.io](https://worldmodels.github.io)
* We are going to play with this for a moment
  * Pay attention to the latent spaces

---

## 8 Years Old?

* If this idea works so well (and makes such cool visualizations) why didn't it take off 8 years ago?
* The technique used trains two networks
  * First, train a large network to model the world (unsupervised)
  * Second, train a controller to perform tasks on the world model (supervised)
* Straightforward, so why didn't this take over other idea immediately?

---

## Memory and Recurrence

* First, it uses RNNs
* Predicts the next latent vector: $P(z_{t} | a_{t-1}, z_{t-1}, h_{t-1})$

<div class="col">
<img style="width: 60%" class="r-stretch" src="./figures/WorldModels_Figure4_world_model_overview.svg" />
<br/>
<small>Read the <a href="https://arxiv.org/abs/1803.10122">World Models paper</a>, see Figure 4.</small>
<br/>
<small>See <a href="https://github.com/worldmodels/worldmodels.github.io">the author github</a> for code, figures, and license.</small>
</div>

---

## Function Inputs

* The RNN modules need three inputs
  * $h_{t-1}$, the hidden state at the previous time
  * $a_{t-1}$, the previous action
  * $z_{t-1}$, the latent representation of the current state
* They output two things
  * The probability distribution of $z$
  * The new hidden state

---

## Controller

* The controller predicts actions
  * $a_t = W_c[z_t h_t] + b_c$
  * Just a linear projection
* Notice that they use more than just the latent vector for action predict

---

## Improvements

* The approach still feels modern
* Although some components feel outdated
* The latent vector encoder, V, is trained purely on static frames
  * This means it cannot encode the full state, including things like velocity
  * Also means that it could be "too big", including features that aren't task-relevant
* This makes memory (in the form of a hidden state, h) a required input

---

## Dream Training

* The authors trained controller to play VizDoom purely using $z$ predictions
  * Basically, the memory model served as the simulator
* Did that transfer into the real game engine?
  * Yes, with the agent surviving an average of 1100 time steps

---

## Problems?

* One problem uncovered by the agent is that it would exploit mistakes in $z$ prediction
* Basically, it found a way to move that would make the predictor turn off fireballs
* The fix is to have the predictor output a distribution of $z$
  * This makes "hacks" unreliable and discourages the agent from attempting them
* That addresses some parts of $z$ prediction, but not all of them

---

## Z State

* From [Generative Latent Coding for Ultra-Low Bitrate Image Compression](https://openaccess.thecvf.com/content/CVPR2024/html/Jia_Generative_Latent_Coding_for_Ultra-Low_Bitrate_Image_Compression_CVPR_2024_paper.html) we saw that $z$ can store the state of an image with high fidelity
* Is that the kind of $z$ that we need for RL?
  * Maybe not. We wanted something with a small enough state space to make exploration feasible
* Not only that, we also need $z$ to work generally well across all scenes that will be encountered
  * How can we guarantee that?

---

## Encoder Training

* The encoder in this work was trained from random exploration *offline* from the controller
* That provides a good dataset of states and actions for CarRacing and VizDoom
  * But for minecraft will random actions ever take us anywhere near a diamond?
  * I had to watch some youtube videos to figure that out, and the answer is no
* Getting a good latent representation is vital to solving hard problems

---

## Multiple Options

* The latest papers with the most interesting results all compress a state into z differently
  * Using diffusion
    * Dreamer 4
  * Using JEPA
    * DINO-WM
    * LeWorldModel

---

## Similarities

* All of these approaches use offline world models
  * Why?
* If latent compression is learned with specific tasks, it won't be able to generalize
  * So training should happen without task-specific inputs
  * That enables offline training

---

## Dreamer Mines Diamonds

* We looked at this super briefly
* [DreamerV3](https://www.nature.com/articles/s41586-025-08744-2)

> The algorithm consists of three neural networks: the world model predicts
the outcomes of potential actions, the critic judges the value of each
outcome, and the actor chooses actions to reach the most valuable outcomes.

---

## More Dreaming

* Deepmind recently published Dreamer v4
  * [Dreamer 4 video](https://youtu.be/oDlBtTcX0g0?si=64xF-EEQc36XFy7k)
  * [Dreamer 4 site](https://danijar.com/project/dreamer4/)
  * [Dreamer 4 paper](https://arxiv.org/pdf/2509.24527)
* Now the policy and reward estimation are combined into one network

---

## Dreamer v4

* This model learns *only* from offline simulation within its world model
* Motivation? No serious group would train a million dollar robot or autonomous vehicle with online RL
* To trust only offline data, the world model in Dreamer v4 must be fantastic, remaining consistent from frame to frame
  * What else needs that level of consistency?
  * Video generation with diffusion models!

---

## Practicalities

* Diffusion used to be slow, but some modern techniques (forcing functions) make it suitable for real-time
* In general, diffusion models remove degredation from noise until an image is created
* For Dreamer, the conversion is from a corrupted frame to the current one, where the previous frame is mixed with noise
  * $x_\tau = (1 - \tau) x_0 + \tau x_1$, $x_0 \leftarrow \mathcal{N}(0, 1)$
* Training has a simple loss: $\mathcal{L}(\theta) = \|f_\theta(x_\tau, \tau) - (x_1 - x_0)\|^2$

---

## Training Phases

* Training looks similar to something like BERt
* Begin by training a tokenizer using masked autoencoding loss
* Train the world model on the tokenized videos
* Train policy and reward heads, but continue to train video reconstruction
  * This prevents degredation in encoding performance, or loss of capabilities

---

## Task Encoding

* A subset of the training data has task information embedded
  * Task meaning the current goal the player was working towards
* This is similar to positional information embedded into language models
  * Each task is added to the token embeddings
  * If no task is present, no encoding is added

---

## RL

* After all of that, Dreamer can use RL in the imagined space to learn
* Rollouts begin from context taken from the original dataset
  * With each context sampled a single time for data diversity
* Mouse actions are treated as categorical, discretized into 121 actions

---

## Diamond Results

* Gets diamonds in 0.7% of cases with 1 hour episodes!
* Apparently it takes humans around 20 minutes to mine a diamond
  * Involves mouse and keyboard sequences of around 24,000 actions
* Tasks are embedded, like positional encodings, into the DNN's tokens

---

## Reconstruction Quality

* How good is the image reconstructions?
  * Good enough that a human can play it

<div class="col">
<img style="width: 60%" class="r-stretch" src="./figures/DreamerV4_Fig5.png" />
<br/>
<small>See Figure 5 of the <a href="https://arxiv.org/pdf/2509.24527">Dreamer V4 paper</a>.</small>
</div>

---

## Differences with Dreamer

* Can't we train our encoding without task-specific information?
* Why should the encoding function have anything to do with image reconstruction?
  * This implies that the model must retain visual information
  * But is all of that necessary?

---

## JEPA

* In 2022 LeCun wrote a [A Path Towards Autonomous Machine Intelligence](https://openreview.net/pdf?id=BZ5a1r-kVsf&utm_source=pocket_mylist)
* It summarized the state of the field at that time, and listed some wants
* Advocates for more general world models
  * Notice that Dreamer needed to be trained with task embeddings
  * What if we didn't want to do that?

---

## Other Wants

* LeCun also suggested a different baseline for the world model embeddings
* Using regularization methods for training energy-based models
    * Those are models that encode and decode states
* He advocated for Joint Embedding Predictive Architecture (JEPA)

---

## Types of Self-Supervision

* We have multiple choices for types of self-supervision
* Unfortunately, the simplest ones have stability problems

<div class="col">
<img style="width: 60%" class="r-stretch" src="./figures/PathTowardsAMI_Fig10.png" />
<br/>
<small>Sigure Figure 10 of the <a href="https://openreview.net/pdf?id=BZ5a1r-kVsf&utm_source=pocket_mylist">A Path Towards Autonomous Machine Intelligence</a>.</small>
</div>

---

## Contrastive Methods

* Contrastive methods usually involve doing some operation to an image and asking a DNN to conserve its embedding
  * e.g. $F(x, \hat{y})$ should be similar to $F(x, y)$
  * The contrast is between x and y, and similarity between y and $\hat{y}$
  * $L(w, x, y, \hat{y}) = \left[ F_w(x, y) - F_w(x, \hat{y}) + \mu \|y - \hat{y}\|^2 \right]^+$

---

## Contrastive Problems

* We are picking some transformation on $y$
  * Is that arbitrary? Is one transformation better suited to some tasks than another?
* Are we actually filling the dimensionality of the data points?
  * If we aren't, then the contrastive loss will leave bad decision boundaries unexplored
  * This is almost inevitable with high dimensional image data

---

## Regularization

* The solution to ill-formed decision boundaries should be regularization
  * But how can we regularize something like contrastive loss?
* Answer: Don't regularize the loss, regularize the latent vector
  * If the space of $z$ is restricted, that will smooth the loss surface
  * This will make the embedding generalize better, resulting in an improved world model

---

## Arbitrary Goals

* That improved embedding space enables something really interesting
* Dreamer had to be trained with specific tasks
* But if we want a truly general world model, it won't have all tasks encoded
  * What is the alternative?
* How about just showing the DNN a picture of what you want?

---

## DINO-WM

* From 2025, [DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning](https://arxiv.org/abs/2411.04983)
* DINO will solve tasks purely through optimization within the latent vectors

<div class="col">
<img style="width: 40%" class="r-stretch" src="./figures/DINOWM_fig1left.png" />
<br/>
<small>See Figure 1 of the paper.</small>
</div>

---

## Next Topic

* Advanced reasoning in the latent space will be our last topic
  * Next week will be reviews and presentation
* I'll send out feedback and presentation information tonight
* Each group should tell me which day they want to present next week

<!--
Goal for lecture 23:
Just go over why training on the pixels isn't necessary
  * But why did they do it in the first place?
  * Because z can be meaningless?

But how do we get the encoder?
* If pretraining, it will learn components uncorrelated with actions, making search more difficult

* If training, how do we know that the embedding is complete?

Papers to dicuss:

The advisor has lots of interesting papers on sequence modeling and state spaces:
* https://scholar.google.com/citations?user=DVCHv1kAAAAJ&hl=en&oi=ao
Basically, he wants to bring back recurrent networks that operate upon very long sequences.
* Also
  * Tiny Recursive Reasoning with Mamba-2 Attention Hybrid
  * https://arxiv.org/abs/2602.12078

Others have combined the Mamba-2 recursive structure into tiny reasoning models

* [DreamerV3](https://www.nature.com/articles/s41586-025-08744-2) is an approach to reinforcement learning where simulation is done in the latent space

Deep learning, reinforcement learning, and world models
https://dl.acm.org/doi/abs/10.1016/j.neunet.2022.03.037
https://www.sciencedirect.com/science/article/pii/S0893608022001150?via%3Dihub
2022

Learning and Leveraging World Models in Visual Representation Learning
https://arxiv.org/abs/2403.00504
2024

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning
https://arxiv.org/abs/2411.04983
2025

stable-worldmodel-v1: Reproducible World Modeling Research and Evaluation
https://arxiv.org/abs/2602.08968
2026

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
https://arxiv.org/abs/2603.19312

-->