# CS 530 - Lecture 09

## Evaluating Decisions

Bernhard Firner

2026-02-19

---

## Reading

* Chapters 15 and the beginning of 16 today
* And some papers at the end of today's lecture

---

## Decisions

* So far, we've been talking about how an agent can estimate its environment
  * Is the coin being flipped fair?
  * Where am I?
* Once the agent has knowledge of the environment, what should it do?

---

## Actions and Outcomes

* Obviously we need a desired outcome and a set of actions to achieve it
  * This means the return of execution monitoring
* But first, how can we rank different outcomes?
  * Is there an all or nothing result? Or multiple?

---

## Expected Utility

* If we assign a value to every state it simplifies decisions
  * Now we can build a function, $U(s)$, that gives a number to any future state, $s$
* There are multiple possible outcomes to any action, so we enumerate them
  * This is the *expected utility* of any action
  * $EU(a) = \Sigma_{s'}P(Result(a)=s')U(s')$

---

## Quantifying Utility

* Do we need to quantify every state?
  * That sounds difficult!
* What if we can only express preference and indifference?
  * This is a much more realistic case
* And it means our agent only needs to quantify a small set of states

---

## Preferences

* A is better than B: $A \succ B$
* A may be better or the same as B: $A \succsim B$
* There is no preference between A and B: $A \sim B$
* Now we don't need to quantify outcomes, but can still compare them

---

## Axioms of Utility Theory

* We can formalize this system with some rules
* **Orderablity**: The agent must make a decision
  * $A \succ B$, $A \succsim B$, or $A \sim B$
* **Transitivity**: This enforces consistency
  * $(A \succ B) \land (B \succ C) \Longrightarrow (A \succ C)$

---

## Axioms of Utility Theory

* **Continuity**: if $(A \succ B \succ C)$, an action that yields A or C could be equivalent to an action yielding B
  * $(A \succ B \succ C) \Longrightarrow \exists p [p,A; 1-p,C] \sim B$
* **Substitutability**: Even in a complex set of outcomes, if the only difference is swapping our A for B and $A \sim B$, then we are indifferent to the outcomes
  * $(A \sim B) \Longrightarrow [p,A; 1-p,C] \sim [p,B; 1-p,C]$

---

## Axioms of Utility Theory

* **Monotonicity**: if $A \succ B$, then an action that has higher likelihood of A is preferred
  * $(A \succ B) \Longrightarrow (p > q \Leftrightarrow [p,A; 1-p,B] \succ [q,A; 1-q,B]$
* **Decomposability**: Compound outcomes can be reduced to simpler ones
  * $[p,A; 1-p, [q,B; 1-q,C] \sim [p,A; (1-p)q,B; (1-p)(1-q),C]$

---

## Importance

* If our agent does not follow these axioms, it will have some irrational behavior
* Example: An AI vending machine has goods A, B, and C
  * If $(A \succ B \succ C \succ A)$ then we can cheat the AI
  * Trade $C$ for $A + \epsilon$ money, then trade $A$ for $B + \epsilon$ money, etc
  * Soon the vending machine is broke

---

## Implication

* AI agents do not need a function that can rate every possible state
  * Only need a function that can express preference or indifference
* Furthermore, if we do simplify things by assigning a value, the actual number assigned is arbitrary
  * Because $U'(S) = aU(S) + b$ would not change any ordering of preferences
* The function our agent uses is called its **value function** or **ordinal utility function**

---

## In Practice

* Can a human fill in arbitrary values for all states?
  * No
* So instead we often give values to final states and deduce the values of the previous ones
  * This trick is used advanced techniques, like reinforcement learning
* We can also train a neural network to compare states, without ever seeking justification or quantification

---

## Realistic Estimates

* The most logical approach is to use known probabilities
* But there is a risk to using sample statistics in decisions
  * Expectations ignore variance
* This is called the optimizer's curse

---

## Optimizer's Curse

* Which treatment is better
  * Treatment A cured 9 out of 10 patients
    * Best from a trial of 100 different options tested on 1000 people
  * Treatment B cured 800 out of 1000 patients in a standalone trial
* Expectations say pick A, but we are less likely to be disappointed by B

---

## Bayes

* An agent (or person) would be better off using a Bayesian approach
  * Estimate the utility $P(\widehat{EU}|EU)$ considering errors and variance
* When we insert ML (mostly DNNs) into our utility estimates, remember that they are basically using frequentist statistics
  * If that is not explicitly considered, expect them to be data hungry

---

## Information Gathering

* Sometimes (often!) we lack information
  * But we are an agent: we can collect more
  * What is the value of that?
* At first, this seem unquantifiable without knowing the information
  * But we can enumerate states and probabilities to estimate its worth

---

## Example

* Suppose there are $n$ seemingly identical Gashopon machines
  * We have heard that one of them has a super rare prize, worth $C$ dollars
  * And the rest are worthless

</div>
<div class="col">

<img style="width: 100%" class="r-stretch" src="./figures/Gashapon_machines_(Hong_Kong)_by_Mk2010.jpg" />
<small>Credit: <a href="https://commons.wikimedia.org/wiki/File:Gashapon_machines_(Hong_Kong).jpg">Mk2010, wikimedia</a></small>

</div>
</div>

---

## Example

* It costs $C/n$ dollars to empty a machine
  * So emptying all of the machines will break even
* So this isn't worth our time

</div>
<div class="col">

</div>
</div>

---

## Example

* Now let's say that there is a child there who watched one of the machines being filled
  * They don't know what we're looking for, but they can confirm if it is in machine A
* How much should we pay them for this information?

</div>
<div class="col">

</div>
</div>

---

## The Value of Information

* There are two possibilities: A has the rare prize, or it doesn't
  * And we already know that the rare prize probability is $1/n$
* If we learn that A has the rare prize we will empty it for $C/n$ and sell the prize for $C$
  * A profit of $C - C/n = (n-1)C/n$

---

## The Value of Information

* With probability $(n - 1)/n$ there will be no prize in A
  * Now the other machines have a $1/(n-1)$ chance to have the prize, but still cost C to empty
* Emptying a different machine has expected value $C/(n-1) - C/n$
  * $\frac{Cn}{n(n-1)} - \frac{C(n-1)}{n(n-1)} = \frac{C}{n(n-1)}$

---

## Enumerating possibilities

* $\frac{1}{n}\frac{(n-1)C}{n} + \frac{n-1}{n}\frac{C}{n(n-1)} = \frac{(n-1)C}{n^2}+\frac{C}{n^2} = C/n$
* So we should be willing to pay the observant, gashapon-obsessed child something less than C/n

---

## General Formula

* If we have all information, choosing an action is easy
* We want the expected utility of the best action, $\alpha$
  * $EU(\alpha) = \underset{a}{\mathrm{max}}\Sigma_{s'}P(Result(A)=s')U(s')$
* In words, pick the action that maximizes the future state utility

---

## Imperfect Information

* Lacking information, we add in every possible outcome
* From this we can estimate the VPI for new evidence, E
  * Value of Perfect Information
* $VPI(E_j) = (\Sigma_{e'}P(E_j = e_j)EU(\alpha_{ej}|E_j=ej)) - EU(\alpha)$
* In words, the value of information is how much it could change the utility of our current best action

---

## Sequential Decision Problems

* Let's work towards a practical example
* Let's say our agent is in a 4x3 room
  * There are two goals, worth $+1$ and $-1$
* The agent can move up, down, left, or right

</div>
<div class="col">

</div>
</div>

---

## Sequential Decision Problems

* If the environment is fully observable and there is nothing random this is easy
* So let's say that the agent's movement is stochastic
  * with p=0.8 the agent moves as desired
  * with p=0.1 it moves at one of the right angles

</div>
<div class="col">

</div>
</div>

---

## Sequential Decision Problems

* A previous solution, [up, up, right, right, right] now only succeeds with probability $0.8^5 \approx 0.33$
* It could also go [right, right, up, up, right] with negligible probability $0.1^4 0.8 = $

</div>
<div class="col">

</div>
</div>

---

## Evaluating Rewards

* We want to pick a best strategy given the robot's current state
* The two end positions already have their rewards, what about the other squares?
  * We need to make something up that encourages the robot to eventually exit
* We'll pick a reward of -0.04 for all transitions that aren't an exit

---

## Why Negative?

* The negative reward will encourage the robot to exit
* The total utility will be the length of the path multiple by -0.04 plus the exit reward
* This combination of sequential decisions, stochastic state transitions, and additive rewards is a Markov decision process

---

## Markov Decision Processes

* We can solve a MDP with a dynamic programming solution
  * Similar to the Viterbi algorithm
* Once found, the optimal policy is called $\pi^*$
* Given a current state, $\pi^*(s)$ gives the best action

---

## Solutions

</div>

---

## Time Horizons

* The solution was for an *infinite horizon*
  * So when $r > 0$ the robot would wander forever
* But we could have made decisions over a *finite horizon*
  * This this case, any states reached past the time horizon have 0 reward
* This alters our behavior; with little time left, the agent must take the risky right path

---

## Discounting

* Most tasks are considered with infinite time horizons
* But we also discount future rewards
* So we often apply a discount to future reward:
  * $U_i = \gamma^0R_i(a_i) + \gamma R_{i+1}(a_{i+1}) + \gamma^2R_{i+2}(a_{i+2})$
  * $\gamma$ is the discount factor

---

## Why Discount?

* The future is often unreliable
  * We have a current estimate of a reward
  * Will it remain true?
* So our agents should prefer more immediate rewards
* This also fixes some disastrous math on the road to infinity

---

## More Reasons

* Most tasks involve some terminal states
  * If that is the case, we shouldn't look for rewards out to infinity
* We can work around infinities by finding average rewards
  * Such as the stationary state in Markov Models
  * But usually this is difficult

---

## Maximizing Utility

* Now we can give an expected utility to each state
* $U^\pi(s) = E[\Sigma_{t=0}^\infty\gamma^tR(S_t,\pi(S_t),S_{t+1})]$
* Our policy comes from the actions that maximize the rewards of the next state and subsequent states
* $\pi^*(s) = \underset{a\in A(s)}{\mathrm{argmax}}\Sigma_{s'}P(s'|s,a)[R(s,a,s')+\gamma U(s')]$

---

## Bellman Equation

* The utility of a state is the expected transition reward, plus the discounted utility of the next state, assuming the agent chooses the optimal action
  * $U(s) = \underset{a\in A(s)}{\mathrm{max}}\Sigma_{s'}P(s'|s,a)[R(s,a,s') + \gamma U(s')]$
* This is named the Bellman Equation

---

## 4x3 Rewards

* Now we can find the utility of the 4x3 world states
* There are four actions in U(1,1), so evaluate all four to solve
  * Of course this requires solving for U(1,2) and U(2,1)
  * And so on, so we'll need to fill in each state, to a horizon where $\gamma^n < \epsilon$

</div>
<div class="col">

</div>
</div>

---

## Q-Function

* Also called the action-utility function
* This is the expected utility of an action in a given state
  * $U(s) = \underset{a}{\mathrm{max}}Q(s,a)$
* We can create an optimal policy from this:
  * $\pi^*(s) = \underset{a}{\mathrm{argmax}}Q(s,a)$

---

## Q-Function

* Why is that important?
* An expectation is sometimes easier to make than exhaustive calculations
  * Q can be anything, including a neural network

---

## Algorithms

* That brings up algorithms
  * We need a way to solve our Markov Decision Processes
* There are a few options, but let's talk about **Value Iteration**

---

## Value Iteration

* We can simply expand the Bellman equation
  * Need states, $S$, actions, $A(s)$, transition models, $P(s'|s,a)$, $Rewards(s,a,s')$, and $\gamma$
* At each time, loop over every state in S
  * $U'[s] \leftarrow max_{a\in A(s)} Q(mdp, s, a U)$
  * $if |U'[s] - U[s]| > \delta$ then $\delta \leftarrow | U'[s] - U[s]|$
* That finds the highest utility state

---

## Where is Q?

* The q-function is obviously important
* And this doesn't tell us how to get it
  * But some researchers made a differentiable version, so that means deep learning!
  * [Value Iteration Networks](https://proceedings.neurips.cc/paper_files/paper/2016/hash/c21002f464c5fc5bee3b98ced83963b8-Abstract.html)
* Let's take a real-world example

---

## Example: Robot Navigation

* [Cognitive Mapping and Planning for Visual Navigation](https://openaccess.thecvf.com/content_cvpr_2017/html/Gupta_Cognitive_Mapping_and_CVPR_2017_paper.html)
  * From CVPR 2017
* [https://sites.google.com/view/cognitive-mapping-and-planning/](https://sites.google.com/view/cognitive-mapping-and-planning/)

---

## Goal

* Train an agent to navigate a real or simulated environment to achieve specific goals
  * "walk forward 5 steps"
  * "reach this coordinate"
  * "find a chair"

---

## Agent

* Simulated and real
* Uses RGB camera and a depth camera
* Actions are discrete
  * Stay
  * Rotate left
  * Rotate right
  * Move forward

---

## Trainable Value Iteration

* Trainable, differentiable, hierarchical
* Trainable helps with ambiguity and missing information
* Differentiable so that loss can be passed through to mapping system
* Hierarchical at different levels of map scale

---

> Our planner is based on value iteration networks proposed by Tamar et al. [56], who observed that a particular type of planning algorithm called value iteration [7]
can be implemented as a neural network with alternating
convolutions and channel-wise max pooling operations, allowing the planner to be differentiated with respect to its inputs.

---

> Value iteration can be thought of as a generalization of Dijkstra’s algorithm, where the value of each state
is iteratively recalculated at each iteration by taking a max
over the values of its neighbors plus the reward of the transition to those neighboring states. This plays nicely with
2D grid world navigation problems, where these operations
can be implemented with small 3 × 3 kernels followed by
max-pooling over channels.

---

## Summary

* The value iteration equation goes into the DNN's loss function
* Now the DNN outputs the estimated optimal action for an observed state