# CS 530 - Lecture 11

## Applied Machine Learning

Bernhard Firner

2026-02-26

---

## Schedule

* Going to introduce the next topics
  * Machine learning! Yay!
* Then go over expectations for the midterm

---

## Machine Learning

* Machine learning is often confused with AI as a topic
* It is strictly a subset

</div>

---

## Data Driven

* What separates ML from the rest of AI is a data requirement
  * E.g. value iteration, as an algorithm, doesn't need training data
* If we want to estimate a Q function using some ML, we need training data first

---

## Frequentist Statistics

* It is critical to remember that most ML relies upon frequentist statistics
  * Training data is always a sample rather than the full population
* We "patch" this with models and assumptions about noise and error
  * But nothing is perfect
* When ML has a catastrophic failure it is often a problem in the data

---

## Frequentist Isn't Wrong

* You may recall that Bayesian statistics outpredicts frequentist
  * See the [German Tank Problem](https://en.wikipedia.org/wiki/German_tank_problem), for an example
* But that only applies if we actually know the distributions involved
* It's been pointed out that we often don't, but we like to think that we do
  * See "[The Two Cultures](https://projecteuclid.org/journals/statistical-science/volume-16/issue-3/Statistical-Modeling--The-Two-Cultures-with-comments-and-a/10.1214/ss/1009213726.full)" by Leo Breiman, with comments and rejoinder

---

## The Bitter Lesson

* [The Bitter Lesson](http://incompleteideas.net/IncIdeas/BitterLesson.html) is an essay by Richard Sutton
  * Author of the well-regarded book, "Reinforcement Learning"
* It's a short essay, read it
* First line:

> The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.

---

> We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.

---

## Principles of AI

* This class is about the principles of AI
  * So the relationship between AI and data is important
* This raises an important topic: how is data used in machine learning?

---

## Data Usage

* We can roughly divide data consumption into three categories
  * Supervised learning
  * Unsupervised learning
  * Flavors of directed data acquisition

</div>

---

## Technique Explosion

* That sounds tidy, but the number of techniques is overwhelming
  * And in practice, techniques are generally stapled together like a rushed thesis
* So what should we learn in this course?

---

## Making Progress

* Read the LeNet papers from the 90s, and they say success in ML comes from three things
  * More compute, more data, powerful ML techniques that work on large datasets
* Read the [AlexNet](https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html) paper from 2012 and you'll see this quote:

> All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.

---

## Expectations

* A researcher from the 90s wouldn't be confused by AlexNet
  * ReLU replaced Tanh as the activation function, they added dropout
* So there was an assumption that the real breakthrough was in compute and datasets
* That hasn't remained true

---

## Data Efficient AI

* Compute has increased dramatically since 2012
* Yet people still train on the 2012 ImageNet dataset
* ML techniques have improved
  * But the largest advances are in data management

---

## Supervised Learning

* Supervised techniques hit mid 80% accuracy on ImageNet-1K, and are around 90% with external data
* To an agent "in the wild", 85% or 90% accuracy aren't dramatically different

<small>From the [UDL Book](https://udlbook.github.io/udlbook/), Creative Commons CC-BY-NC-ND</small>

</div>

---

## Datasets

* ImageNet is old
* We have newer datasets, like [Coco](https://cocodataset.org/#home)
  * But they haven't dramatically changed the supervised learning landscape
* [A ConvNet for the 2020s](https://openaccess.thecvf.com/content/CVPR2022/html/Liu_A_ConvNet_for_the_2020s_CVPR_2022_paper.html) pointed out that state of the art owes more to hyperparameters and techniques than any single architecture

---

## Limited Dataset

* It is inconvenient, but true, that we are not getting drastically larger **labelled** datasets
* Supervised learning requires labels
* Let's update our expectations:
  * ~~Constantly Increasing Data~~
  * Constantly Increasing Compute
  * More Powerful Algorithms

---

## Compute

* But GPUs will keep getting faster, right?
* We now have [FP8](https://developer.nvidia.com/blog/floating-point-8-an-introduction-to-efficient-lower-precision-ai-training/) and [FP4](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/)!
  * And after 4 comes... uh-oh!

---

## Compute Limitations

* At some point, compute is limited by bandwidth and clock speeds
  * Look at CPU clock speeds
* They haven't changed much in the last 10 years:
  * ~~Constantly Increasing Data~~
  * ~~Constantly Increasing Compute~~
  * More Powerful Algorithms

---

## Algorithms

* At some point, all we'll be able to improve are the algorithms
* So how can we do more with less data?
* It isn't really the *data* that's the problem, it's the *labelling*

---

## Unsupervised Learning

* Collecting new data is easy
  * Remember TrailNet and DAVE? They worked by reducing labelling time
  * Those were still supervised learning, but unsupervised techniques work without any labels at all
* For unsupervised learning, we train a model to do some difficult task present in the data itself
  * e.g. remove a word and then predict what it was

---

## Unsupervised Example

* [Unsupervised Deep Homography: A Fast and Robust Homography Estimation Model](https://arxiv.org/abs/1709.03966)
  * With code: [https://github.com/tynguyen/unsupervisedDeepHomographyRAL2018](https://github.com/tynguyen/unsupervisedDeepHomographyRAL2018)
* Transform an image, ask the DNN to guess the parameters of the transformatino
  * This is synthetically generated
* With an IMU, we can learn on real-world data
  * These are real perspective transforms

---

## Transfer Learning

* We may not actually have much data in our target domain
  * So we something called *transfer learning*
* Learn one task well (for example, with unsupervised learning)
  * Then transfer our model to a separate task
* Doing this successfully requires a clever use of ML algorithms

---

## Data

* So we've solved our data problem, right?
  * It depends
* New video data be recorded
* How about new written content?
  * Not all data is infinite

---

## Fake Data

* If we run out of data we can just make more
  * In simulation!
* This works for many types of data (not writing)
* But what if we were trying to learn a behavior from data
  * How do we know what behavior is optimal?

---

## Utility

* Generating "good" synthetic data requires us to have an expectation of how good an action will be
  * Goodness sounds like utility
  * And these goodness estimates must be a Q function
* If we can figure out a Q function, then we can use it directly
  * Or train a policy with the synthetic data

---

## Reinforcement Learning

* Utility driven simulation is the most common method of reinforcement learning
  * Sometimes people also use real life robots, but this is slow
* This can completely solve our data problem
  * **if** the simulation transfers to the real world
* Example [cartpole in PyTorch](https://docs.pytorch.org/rl/main/tutorials/getting-started-5.html)

---

## Our Topics

* Again, this is a huge field
* We will focus on the data side of things
  * How can we make an agent with a minimal amount of data?
* We will use several techniques, but will mostly focus on use rather than implementation

---

## Midterm Content

* What are environments?
  * Important attributes: continuous vs discrete, etc
* What are agents?
  * What is the difference between a reflexive and non-reflexive agent?
  * How does an agent gain knowledge of its environment?

---

## Example Q

* What is an example of a partially observable, multiagent, sequential environment?
* What about a partially observable, multiagent, continuous environment?

---

## Execution Monitoring

* Refers to reassessment of action feasibility and progress towards a goal
* Q: What was an example of execution monitoring in the TrailNet drone?

---

## Markov Models

* In what environment can we use Markov Models?
* And what can we use them for?

---

## Filtering and Smoothing

* What's the difference?
* And how do we do each one with a Markov model?

---

## Filtering and Smoothing

* Filtering means predicting the most likely state given observations up to the current time, $t$
* Smoothing means predicting the most likely state given all observations up to some future time, $T$

---

## Most Likely Predictions

* We used a Markov model to predict both the most likely sequence of states, and the most likely states at each time
  * The first is the *maximum a posteriori*
  * The second is the maximized posterior marginals
* The first maximizes all states together

---

## MAP vs MPE

* The Viterbi algorithm computes the *maximum a posteriori* (MAP)
  $z^* = \underset{z_{1:T}}{\mathrm{argmax}}(z_{1:T}|x_{t:T})$
* $\gamma$ is an estimate that maximizes the posterior marginals (MPM):
  $\hat{z} = (\underset{z_1}{\mathrm{argmax}}p(z_1|x_{1:T}), ..., (\underset{z_T}{\mathrm{argmax}}p(z_T|x_{1:T}))$

---

## Path Prediction Error

</div>
<div class="col">

</div>
</div>

---

## EM Algorithm

* We also used Markov Models to predict the emission and transition matrices, as well as the initial state, $\pi_0$
* This is quite complicated to do by hand, but the EM algorithm itself is simple
  * Calculate the expectation of the hidden states given the current parameter estimates
  * Then maximize the likelihood of the observed states by adjusting the transition and emission probabilities
* For clustering, the hidden states would be the cluster centers

---

## Stationary Distributions

* The last trick from HMMs is the stationary distribution
* If certain conditions are met, there is a special distribution:
  * $\pi = \pi A$
* This tells us directly how much time is spent in each state
  * And all we need to solve is the transition matrix

---

## Continuous Models

* That brought us to continuous environments
* Filtering is more complicated
  * Our samples of the environment are discrete
  * So there is likely to be more noise in our observations

---

## Kalman Filter

* What is A in this equation?
  * $z_t = A_tz_{t-1} + \epsilon_t$

---

## Kalman Filter

* $z_t = A_tz_{t-1} + \epsilon_t$
  * A holds state transitions, which must be linear
  * The noise is also assumed to be gaussian

---

## Particle Filter

* Everyone has already implemented these, so you know them 100%, right?

---

## Axioms of Utility

* Can you identify if something is breaking one of the axioms?
* Remember, if an agent does not follow the axioms, then it will be possible for the agent to behavior irrationally

---

## Value of Perfect Information

* Even if we don't know something, we can quantify our gain from learning it
  * Enumerate all possibilities, multiply by expected utilities

---

## Garden Bot World

* One space is impassible, and there are two endpoints with utility scores
* Moving each space has a utility of -0.4

---

## Garden Bot World

* Let's say that the world is deterministic
* We can use value iteration to give each space a utility
  * Here we assume that $\gamma=1$, and utility turns into 0.4 times the shortest path to the exit

---

## Garden Bot Information

* How much is it worth to learn what is in the hidden square?

---

## Garden Bot Information

* If the tile is passable, up is 0.08 better than right

---

## Garden Bot Information

* If the tile is impassable, right is 0.08 better than left

---

## Question Types

* Mostly fill in
* A few multiple choice, just to cover some breadth of topics