# CS 461 - Lecture 06 ## Machine Learning Principles Bernhard Firner 2025-09-24 --- ## Reading * K-Nearest Neighbors * Recommended reading * Machine Learning: A Probabilistic Perspective by Murphy * Section 11 (Mixture models and the EM algorithm) * Machine Learning: The Art and Science of Algorithms that Make Sense of Data by Flach * Chapter 8.4 (distance-based clustering) --- ## Unsupervised Learning * Sometimes our data comes without labels * We saw with Hierarchical Agglomerative Clustering (HAC) that doing unsupervised clustering without *any* assumptions is difficult * Making some assumptions about distributions of data and noise will simplify our problem * And if our assumptions are correct, improve our solutions! --- ## Latent Variables * Most observations arise due to a set of unobserved variables * Those unknown variables are `latent variables` * The ones we see are `visible variables` * Models that fit unseen variables are `latent variable models` --- ## Latent Variables * In a way, when we estimated parameters with linear regression, we were estimated the latent variables that caused y given x * For example, rock falls increase with mining in a linear relationship, but there are *tons* of variables factors that combine to create the full system --- ## Latent Variables of Complex Systems * In a complex system, latent variables may have differing distributions and correlations * Let's hypothesizing that our observations are caused by a mixture of different latent variables * This allows us to leverage our data to estimate those variables * Our estimates will improve with more observations, which is what we need for unsupervised learning --- ## K-Means Clustering * Let's begin with an assumption that everything is gaussian * Outliers are rare, so clusters of points coming from the same class should be compact * This is going to result in the simplest version of our algorithm --- ## K-Means Cost function * We'll be minimizing the `distortion`, $J$ * $J(M,Z) = \sum_{n=1}{N}\lVert x_n - \mu_{z_n} \rVert^2 = \lVert X - ZM^T \rVert_{F}^2$ * $X$ is the vector of points * $M$ holds the cluster centroids * $Z$ indicates which centroid a point belongs to * In practice, we can just loop over each centroid --- ## The algorithm: setup * Begin with continuous data, $D \subseteq \mathbb{R}^d$ * Choose a number of clusters, $K$ * Randomly choose $K$ starting cluster centers, $\mu_1,...,\mu_K$ * Just pick them from the sample points * This isn't the best, but we'll do better in a moment --- ## The algorithm: iteration * Assign each point to a cluster, adjust the cluster centers, and repeat ```python mus = numpy.array(random.choices(X, k=num_clusters)) change = float('inf') threshold = 0.1 while (change > threshold): distances = [numpy.linalg.norm(point - mus, ord=2, axis=1) for point in X] print(distances) closest_clusters = numpy.argmin(distances, axis=1) old_mus = mus.copy() for i in range(num_clusters): cluster = X[numpy.where(closest_clusters == i)] mus[i] = cluster.mean(axis=0) change = numpy.sum(mus - old_mus) ``` --- ## Stationary Points * The algorithm will converge on something * Not guaranteed to be the best split * The starting points have a large impact on the end result * We can't handle categorical inputs * And we handle non-gaussian inputs poorly * Think of a long, noodle-like clusters with lots of bends --- ## Results
Looks good, but not actually good --- ## Actual Penguins
Real penguins --- ## K-Means++ * Instead of picking random points, we could choose them sequentially * That way, future points can't end up right next to earlier ones * Scale the probability with distance, $D$ to closest centroid * The probability of picking a point at step $t$ becomes * $p(u_t = x_n) = \frac{D_{t-1}(x_n)}{\sum_{n'=1}^ND_{t-1}(x_{n'}}$ --- ## How about non-gaussian inputs? * Slight improvements won't fix out problem though * First, is this assuming gaussian? * That biases the system to "spherical" clusters * What is K-means actually doing? * If we change our probability distributions, can we use categorical inputs? --- ## Special Case * K-Means is actually a special case of a more general approach * We minimized the scatter of the points using a distance metric that works for gaussians * Namely, the euclidean distance * With guassians, the most likely cluster for a point to belong to is the one with the closest center * To put this another way, the clusters have no shape --- ## Expectation Maximization * Let's go back to maximizing log likelihoods * $l(\Theta) = \sum_{n=1}^{N}logp(y_n|\Theta)=\sum_{n=1}^N\left[\sum_{z_n}p(y_n,z_n|\Theta)\right]$ * Find parameters, $\Theta$, to maximize $p$ given observations, $y$, and some assumed hidden variables, $z$ * Difficult to actually optimize --- ## With Observations * Let's rewrite with observations * $l(\Theta) = \sum_{n=1}^{N}logp(y_n|\Theta)=\sum_{n=1}^N\left[\sum_{z_n}q_n(z_n)\frac{p(y_n,z_n|\Theta)}{q_n(z_n)}\right]$ * $l(\Theta) \geq \sum_{n}\mathbb{E}_{q_n}[logp(y_n|\Theta)] + \mathbb{H}(q_n)$ * The right side is the entropy of the distribution, q * Left side is the **evidence lower bound** (ELBO) * $y_n$ are observations, or evidence, of parameters $\Theta$ --- ## Useful how? * In clustering, that expectation is the responsibility of a cluster, k, for a point, n * Here, our $\Theta$ is an estimate, as is the responsibility * $\pi_k$ is the mean of the kth parameter (the center of cluster k)$ * $r_{nk} = p(z_n=k|y_n,\Theta)=\frac{\pi_kp(y_n|\Theta_k)}{\sum_{k'}\pi_kp(y_n|\Theta_{k'})}$ --- ## Hard Clustering * That equation assigned a responsibility to each cluster for each point * In K-means, we decided that each point is only the responsibility of a single cluster * Called `hard clustering` * So we just need to maximize, for all clusters: * $logp(x_n|z_n=k,\Theta) + logp(z_n=k|\Theta)$ * With a uniform prior and gaussian distributions, that just means minimizing the distances of the points to their cluster centers --- ## Other Priors * The most obvious prior to use here is a Bernoulli * So that we can fit the categorical inputs * $p(y|z=k,\Theta) = \prod_{d=1}^DBer(y_d|\mu_{dk}) = \prod_{d=1}^D\mu_{dk}^{y_d}(1 - \mu_{dk})^{1 - yd}$