<!--
Abstract:

CS 461
Introduction to Deep Learning
Lecture 06
-->

# CS 461 - Lecture 06

## Machine Learning Principles

Bernhard Firner

2025-09-24

---

## Reading

* K-Nearest Neighbors
* Recommended reading
  * Machine Learning: A Probabilistic Perspective by Murphy
    * Section 11 (Mixture models and the EM algorithm)
  * Machine Learning: The Art and Science of Algorithms that Make Sense of Data by Flach
    * Chapter 8.4 (distance-based clustering)

---

## Unsupervised Learning

* Sometimes our data comes without labels
* We saw with Hierarchical Agglomerative Clustering (HAC) that doing unsupervised clustering without *any* assumptions is difficult
* Making some assumptions about distributions of data and noise will simplify our problem
  * And if our assumptions are correct, improve our solutions!

---

## Latent Variables

* Most observations arise due to a set of unobserved variables
* Those unknown variables are `latent variables`
  * The ones we see are `visible variables`
* Models that fit unseen variables are `latent variable models`

---

## Latent Variables

* In a way, when we estimated parameters with linear regression, we were estimated the latent variables that caused y given x
  * For example, rock falls increase with mining in a linear relationship, but there are *tons* of variables factors that combine to create the full system

---

## Latent Variables of Complex Systems

* In a complex system, latent variables may have differing distributions and correlations
* Let's hypothesizing that our observations are caused by a mixture of different latent variables
  * This allows us to leverage our data to estimate those variables
  * Our estimates will improve with more observations, which is what we need for unsupervised learning

---

## K-Means Clustering

* Let's begin with an assumption that everything is gaussian
  * Outliers are rare, so clusters of points coming from the same class should be compact
* This is going to result in the simplest version of our algorithm

---

## K-Means Cost function

* We'll be minimizing the `distortion`, $J$
* $J(M,Z) = \sum_{n=1}{N}\lVert x_n - \mu_{z_n} \rVert^2 = \lVert X - ZM^T \rVert_{F}^2$
  * $X$ is the vector of points
  * $M$ holds the cluster centroids
  * $Z$ indicates which centroid a point belongs to
* In practice, we can just loop over each centroid

---

## The algorithm: setup

* Begin with continuous data, $D \subseteq \mathbb{R}^d$
* Choose a number of clusters, $K$
* Randomly choose $K$ starting cluster centers, $\mu_1,...,\mu_K$
  * Just pick them from the sample points
  * This isn't the best, but we'll do better in a moment

---

## The algorithm: iteration

* Assign each point to a cluster, adjust the cluster centers, and repeat

```python
mus = numpy.array(random.choices(X, k=num_clusters))
change = float('inf')
threshold = 0.1
while (change > threshold):
    distances = [numpy.linalg.norm(point - mus, ord=2, axis=1) for point in X]
    print(distances)
    closest_clusters = numpy.argmin(distances, axis=1)
    old_mus = mus.copy()
    for i in range(num_clusters):
        cluster = X[numpy.where(closest_clusters == i)]
        mus[i] = cluster.mean(axis=0)
    change = numpy.sum(mus - old_mus)
```

---

## Stationary Points

* The algorithm will converge on something
  * Not guaranteed to be the best split
* The starting points have a large impact on the end result
* We can't handle categorical inputs
  * And we handle non-gaussian inputs poorly
  * Think of a long, noodle-like clusters with lots of bends

---

## Results

Looks good, but not actually good

---

## Actual Penguins

Real penguins

---

## K-Means++

* Instead of picking random points, we could choose them sequentially
  * That way, future points can't end up right next to earlier ones
  * Scale the probability with distance, $D$ to closest centroid
* The probability of picking a point at step $t$ becomes
  * $p(u_t = x_n) = \frac{D_{t-1}(x_n)}{\sum_{n'=1}^ND_{t-1}(x_{n'}}$

---

## How about non-gaussian inputs?

* Slight improvements won't fix out problem though
* First, is this assuming gaussian?
  * That biases the system to "spherical" clusters
  * What is K-means actually doing?
* If we change our probability distributions, can we use categorical inputs?

---

## Special Case

* K-Means is actually a special case of a more general approach
* We minimized the scatter of the points using a distance metric that works for gaussians
  * Namely, the euclidean distance
  * With guassians, the most likely cluster for a point to belong to is the one with the closest center
* To put this another way, the clusters have no shape

---

## Expectation Maximization

* Let's go back to maximizing log likelihoods
* $l(\Theta) = \sum_{n=1}^{N}logp(y_n|\Theta)=\sum_{n=1}^N\left[\sum_{z_n}p(y_n,z_n|\Theta)\right]$
* Find parameters, $\Theta$, to maximize $p$ given observations, $y$, and some assumed hidden variables, $z$
* Difficult to actually optimize

---

## With Observations

* Let's rewrite with observations
* $l(\Theta) = \sum_{n=1}^{N}logp(y_n|\Theta)=\sum_{n=1}^N\left[\sum_{z_n}q_n(z_n)\frac{p(y_n,z_n|\Theta)}{q_n(z_n)}\right]$
* $l(\Theta) \geq \sum_{n}\mathbb{E}_{q_n}[logp(y_n|\Theta)] + \mathbb{H}(q_n)$
* The right side is the entropy of the distribution, q
* Left side is the **evidence lower bound** (ELBO)
  * $y_n$ are observations, or evidence, of parameters $\Theta$

---

## Useful how?

* In clustering, that expectation is the responsibility of a cluster, k, for a point, n
* Here, our $\Theta$ is an estimate, as is the responsibility
  * $\pi_k$ is the mean of the kth parameter (the center of cluster k)$
* $r_{nk} = p(z_n=k|y_n,\Theta)=\frac{\pi_kp(y_n|\Theta_k)}{\sum_{k'}\pi_kp(y_n|\Theta_{k'})}$

---

## Hard Clustering

* That equation assigned a responsibility to each cluster for each point
* In K-means, we decided that each point is only the responsibility of a single cluster
  * Called `hard clustering`
* So we just need to maximize, for all clusters:
  * $logp(x_n|z_n=k,\Theta) + logp(z_n=k|\Theta)$
* With a uniform prior and gaussian distributions, that just means minimizing the distances of the points to their cluster centers

---

## Other Priors

* The most obvious prior to use here is a Bernoulli
  * So that we can fit the categorical inputs
* $p(y|z=k,\Theta) = \prod_{d=1}^DBer(y_d|\mu_{dk}) = \prod_{d=1}^D\mu_{dk}^{y_d}(1 - \mu_{dk})^{1 - yd}$