<!--
Abstract:

CS 461
Introduction to Deep Learning
Lecture 06
-->

# CS 461 - Lecture 06

## Machine Learning Principles

Bernhard Firner

2025-09-30

---

## Reading

* Recommended reading
  * Machine Learning: A Probabilistic Perspective by Murphy
    * Section 11.2.3 (Clustering with Mixture models)
    * Section 11.4.3 (Expectation Maximization with a mixture of experts)
    * Section 11.5 (Model selection)
    * Section 14.7.3 (Training a mixture model)
  * Machine Learning: The Art and Science of Algorithms that Make Sense of Data by Flach
    * Chapter 8.4 (distance-based clustering)
    * Chapter 9.4 (probabilistic models with hidden variables)

<div hidden>
<img style="width: 1%" class="r-stretch" src="./figures/mixture_of_experts_penguins.png" />
<img style="width: 100%" class="r-stretch" src="./figures/mixture_of_experts_error.png" />
<img style="width: 100%" class="r-stretch" src="./pictures/SA20211224-0032_chinstrap_HalfMoonIsland.jpg" />
<img style="width: 100%" class="r-stretch" src="./figures/mixture_of_experts_test_error.png" />
</div>

---

## Dataset

* Still talking about penguins
* Images courtesy of Ivan Seskar ([seskar.site](https://seskar.site))

</div>
<div class="col">
<img style="width: 75%" class="r-stretch" src="./pictures/SA20211225-0062_gentoo_RongeIsland.jpg" />

</div>
</div>

---

## Last Time

* K-Means
  * Simple clustering algorithm
    * Iterative, not guaranteed to be optimal
* Equivalent to assuming all data columns are generated by independent guassian sources
  * The most likely cluster for a point is thus the closest by euclidean distance
* We iteratively maximize likelihood of the data given our clusters
  * Which means adjusting the cluster center to the mean of the cluster's points

---

## Expectation Maximization

* First select random starting cluster centers
* Then calculate the expectations for cluster assignment
  * Just assign each data point to the nearest cluster center in K-means
* Now maximize the likelihood of each cluster by adjusting their centers

---

## Unsupervised

* K-Means is unsupervised, meaning that there are no training labels
* We make an assumption about the number and type of latent variables
  * In the penguin case, 3, one for each species
  * But this could be wrong; maybe 6, one for each species and sex would be better
  * Or maybe one for each species and each island

---

## K-Means Complexity

* K-means is fast
  * With $i$ iterations, $k$ clusters, and $n$ samples
  * $O(nki)$

---

## K-Means Weaknesses

* Assumed independence
* Didn't deal with categorical inputs

---

## Categorical Inputs

* The penguin dataset has both continuous and categorical columns
* To combine them, we'll have to somehow find a distance that works for both continuous values and discrete categories
* That distance metric will be the probability

---

## Categorical Distribution

* Those probabilities are from the `categorical distribution`
  * Number of classes is `k`
  * Bernoulli is a special case where $k = 2$
  * Categorical is in turn a special case of the `multinomial distribution` with 1 trial

---

## Using Naive Bayes

* One could misunderstand the columns and think that we should use Bayes rule
  * To describe $P(species|island)$, for example
* Recall Bayes Rule:
  * $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$

---

## Penguin example

* There are 146 Adelie penguins out of 333
  * $P(Adelie) \approx 43.7 \\%$
* They were observed in all three islands:
  * 55 in Dream ($37.7\\%$ of Adelie)
  * 47 in Torgersen ($32.1\\%$ of Adelie)
  * 44 in Biscoe ($30.1\\%$ of Adelie)

---

## P(Adelie | Island)

* So if we observe a penguin on Dream, what's the probability it is an Adelie?
  * $P(Adelie | island = Dream)$
  * $\frac{P(Dream|Adelie)\times P(Adelie)}{P(Dream)} = \frac{37.7\\% \times 43.7 \\%}{37\\%} = 44.5\\%$
* $P(Adelie | Biscoe) = \frac{32.1\\% \times 43.7\\%}{48.9\\%} = 28.7\\%$
* $P(Adelie | Torgersen) = \frac{30.1\\% \times 43.7\\%}{14.1\\%} = 93.2\\%$

---

## Independence

* Only Adelie were observed on Dream and Torgersen
* More importantly, each penguin was only observed once
  * They cannot be tagged with multiple islands
* Don't do crazy things with your data

</div>
</div>

---

## Categorical Probability

* Our categorical probabilities will simple come from the observations
* If we have a cluster with only the Adelie penguins, the probabilities for the island category would be
  * $37.7\\%$ for Dream
  * $32.1\\%$ for Torgersen
  * $30.1\\%$ for Biscoe

---

## Adjusting Categories

* If the cluster is assigned a penguin from Dream, the "distance" along this feature is $1-p_{dream}$
* To update a categorical cluster
  * Sum the frequency of each value, c, in the category, C
  * Scale probabilities by $frequency_c/|C|$

---

## Combining Things

* We are going to transform the probabilities into negative log likelihoods
  * This transforms multiplications into sums, simplifying the gaussian math
  * And ends up being more numerically stable
    * Who remembers floating point considerations from CS211?
* Also makes it easy to talk about as a loss
  * When a probability is 1, the NLL is 0

---

## Unsupervised Loss

* Remember, we want to treat clustering as an unsupervised process
  * So the model will need to deduce $P(species|island)$ by itself
* Going back to responsibility:
  * In the unsimplified form:
  * $r_{nk} = p(z_n=k|y_n,\Theta)=\frac{\pi_kp(y_n|\Theta_k)}{\sum_{k'}\pi_kp(y_n|\Theta_{k'})}$

---

## Gating

* $\pi_k$ is the expert for cluster k, which chooses when its parameters should be used
* In k-means (or any hard clustering) it would be 0 or 1, providing a hard gate

---

## Continuous Loss

* Going to estimate both a mean and a variance for the gaussians
  * We then maximize those parameters over all points in each cluster
* Take the NLL and that transforms into an error to minimize
* $l(\mu_k,\Sigma_k) = \sum_k\sum_ir_{ik}log p(x_i|\Theta_k)$
* The result will be the matrix form of the log of the normal PDF

---

## Loss Estimate

* The log of the normal would be
  * $-log(\theta) - \frac{(x-\mu)^2}{2\theta^2} + \frac{1}{2}(-log(2) - log(\pi))$
* We'll ignore those constants for the loss
* $l(\mu_k,\Sigma_k) = -\frac{1}{2}\sum_ir_{ik}[log|\Sigma_k|+(x_i-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)]$
  * The first part is the log of the covariance
  * The second part is the square of the Mahalanobis distance (a distance measure for the normal)

---

## Mean and Variance Estimates

* $\mu_k = \frac{\sum_ir_{ik}x_i}{r_k}$
  * This just says that the mean of cluster k is the average value of the points, $x_i$, weighted by the responsibility of cluster k for those points
* $\Sigma_k = \frac{\sum_ir_{ik}x_ix_i^T}{r_k} - \mu_k\mu_k^T$
  * Covariance is the scatter matrix weighted by responsibility

---

## Categorical Loss

* Simpler
  * Loss by feature, i, expert, k, and category, c
* $l(p_C) = \sum_k\sum_ir_{ik}log p(x_i|\Theta_k)$
* This can be calculated directly for each class
  * Straightforward if we assume the classes are independent
  * Island and sex seem independent, so we'll keep that assumption

---

## Cluster Assignment

* In the math, $\pi_k$ determine when the parameters of cluster k will be used
* These are called the gating priors, since they gate each model
* For us, we will always choose the most likely cluster for each point on the update step
  * That updates the priors to the fraction of points they cover

---

## Algorithm

* This is the same expectation maximization that we used with K-means
* Iterative, recalculating cluster responsibilities after each parameter update
* Still light compared to HAC
  * With $i$ iterations, $k$ clusters, and $n$ samples
  * $O(nki)$

---

## Algorithm Setup

1. Split the data into continuous and categorical columns
2. Initialize priors for each cluster with equal probabilities
3. Initialize the continuous parameters as with k-means (random points)
4. Initialize categorical probabilities to be uniform for each category

---

## Algorithm Steps

1. Calculate responsibilities for all points over every expert
2. Maximize likelihoods by adjusting expert parameters
3. Repeat from 1 until parameter change is small

---

## Concepts

* The code for this can be found in textbook samples and regurgitated by LLMs
* So what are the main concepts?
* Compared to HAC or K-Means, we are using more sophisticated probabilities
  * We can mix different types of observational data
* Each cluster is an expert, with parameters trained on different subsets of the data

---

## Mixture of Experts

* Each expert can reveal something about your data
* These are the means of a mixture of 10 Bernoullis
  * Each predicts 0 or 1 for every pixel in the MNIST dataset
* 10 clusters isn't actually sufficient
  * There are multiply ways to write some numbers

</div>
<div class="col">
<img style="width: 80%" class="r-stretch" src="./pictures/ML_Murphy_10BernoulliMnist.png" />

</div>
</div>

---

## Code

* Pretty much all examples from Murphy's book are online
  * [https://github.com/probml/pml-book/tree/main](https://github.com/probml/pml-book/tree/main)
* To demonstrate the world we live in, this example code will mostly come from an LLM

---

## Gaussian Likelihood

* The messiest part

```python
def _multivariate_gaussian_log_likelihood(X, mu, sigma_diag):
    """
    Calculates the log-likelihood of each data point under a multivariate Gaussian 
    distribution, assuming a diagonal covariance matrix.

Args:
        X (np.ndarray): Data points (N samples, D features).
        mu (np.ndarray): Mean vector (D features).
        sigma_diag (np.ndarray): Diagonal of the covariance matrix (D features).

Returns:
        np.ndarray: Log-likelihood for each sample (N samples).
    """
    D = X.shape[1]
    # Small epsilon for numerical stability
    epsilon = 1e-6 
    
    # Calculate log determinant (sum of log of diagonal elements)
    # sigma_diag must be positive, adding epsilon for stability
    log_det_sigma = np.sum(np.log(sigma_diag + epsilon))
    
    # Deviation from the mean
    X_mu = X - mu
    
    # Calculate Mahalanobis distance term: -0.5 * (X-mu)^T @ inv(Sigma) @ (X-mu)
    # Since Sigma is diagonal, inv(Sigma) is diag(1/sigma_diag).
    # This simplifies to sum((X-mu)^2 / sigma_diag)
    mahalanobis_term = np.sum((X_mu ** 2) / (sigma_diag + epsilon), axis=1)
    
    # Log-Likelihood formula: -D/2 * log(2*pi) - 0.5 * log(|Sigma|) - 0.5 * Mahalanobis_term
    log_likelihood = -0.5 * D * np.log(2 * np.pi) - 0.5 * log_det_sigma - 0.5 * mahalanobis_term
    
    return log_likelihood
```

---

## Cleaning Up

* `-0.5 * D * np.log(2 * np.pi)` is a constant and is pointless in the loss
  * We can remove it

---

## Categorical loss

* Precomputed when updating clusters

```python
def _categorical_log_likelihood(X_cat, log_probs_cat):
    """
    Calculates the log-likelihood of each categorical feature combination.

Args:
        X_cat (np.ndarray): Categorical data points (N samples, D_cat features).
        log_probs_cat (np.ndarray): Log probabilities (D_cat features, C max categories).

Returns:
        np.ndarray: Log-likelihood for each sample (N samples).
    """
    N, D_cat = X_cat.shape
    log_likelihood = np.zeros(N)
    
    # The categorical likelihood assumes independence across features:
    # log P(x_cat | z) = sum_j log P(x_j | z)
    for j in range(D_cat):
        # x_j is the observed category index (0, 1, 2, ...) for feature j
        categories = X_cat[:, j].astype(int)
        
        # Look up the log probability for the observed category in feature j
        log_likelihood += log_probs_cat[j, categories]
        
    return log_likelihood
```

---

## Full Code

```python
import numpy as np
import sys

import tree_funcs
import cluster_common

# --- Helper Functions ---

def _multivariate_gaussian_log_likelihood(X, mu, sigma_diag):
    """
    Calculates the log-likelihood of each data point under a multivariate Gaussian 
    distribution, assuming a diagonal covariance matrix.

Args:
        X (np.ndarray): Data points (N samples, D features).
        mu (np.ndarray): Mean vector (D features).
        sigma_diag (np.ndarray): Diagonal of the covariance matrix (D features).

Returns:
        np.ndarray: Log-likelihood for each sample (N samples).
    """
    # Small epsilon for numerical stability
    epsilon = 1e-6 
    
    # Calculate log determinant (sum of log of diagonal elements)
    # sigma_diag must be positive, adding epsilon for stability
    log_det_sigma = np.sum(np.log(sigma_diag + epsilon))
    
    # Deviation from the mean
    X_mu = X - mu
    
    # Calculate Mahalanobis distance term: -0.5 * (X-mu)^T @ inv(Sigma) @ (X-mu)
    # Since Sigma is diagonal, inv(Sigma) is diag(1/sigma_diag).
    # This simplifies to sum((X-mu)^2 / sigma_diag)
    mahalanobis_term = np.sum((X_mu ** 2) / (sigma_diag + epsilon), axis=1)
    
    # Log-Likelihood formula: -D/2 * log(2*pi) - 0.5 * log(|Sigma|) - 0.5 * Mahalanobis_term
    log_likelihood = -0.5 * log_det_sigma - 0.5 * mahalanobis_term
    
    return log_likelihood

def _categorical_log_likelihood(X_cat, log_probs_cat):
    """
    Calculates the log-likelihood of each categorical feature combination.

Args:
        X_cat (np.ndarray): Categorical data points (N samples, D_cat features).
        log_probs_cat (np.ndarray): Log probabilities (D_cat features, C max categories).

class MixtureOfExperts:
    """
    Mixture of Experts (MoE) model combining Gaussian and Categorical distributions,
    trained using the Expectation-Maximization (EM) algorithm.
    """
    def __init__(self, k, n_gauss_features, max_iters=100, random_state=42):
        self.k = k
        self.n_gauss = n_gauss_features
        self.max_iters = max_iters
        self.random_state = random_state
        
        # Model Parameters (to be initialized)
        self.pi = None              # Gating Network (Priors): (K,)
        self.gauss_mu = None        # Gaussian Means: (K, D_gauss)
        self.gauss_sigma_diag = None# Gaussian Diagonals of Covariance: (K, D_gauss)
        self.cat_log_probs = None   # Categorical Log Probabilities: (K, D_cat, C_max)
        self.n_cat = None           # Number of categorical features
        self.cat_max_levels = None  # Max number of categories per categorical feature (D_cat,)

def _initialize_params(self, X):
        """Initializes model parameters."""
        
        np.random.seed(self.random_state)
        N, D = X.shape
        
        # 1. Split data into Gaussian (continuous) and Categorical (discrete) parts
        X_gauss = X[:, :self.n_gauss]
        X_cat = X[:, self.n_gauss:]
        self.n_cat = X_cat.shape[1]
        
        # Determine number of categories (levels) for each categorical feature
        self.cat_max_levels = np.max(X_cat, axis=0).astype(int) + 1
        
        # 2. Initialize Priors (pi)
        self.pi = np.ones(self.k) / self.k
        
        # 3. Initialize Gaussian Parameters (using random subsamples for centroids)
        initial_indices = np.random.choice(N, self.k, replace=False)
        self.gauss_mu = X_gauss[initial_indices]
        # Initialize diagonal covariance as variance of full data for stability
        data_variance = np.var(X_gauss, axis=0)
        self.gauss_sigma_diag = np.tile(data_variance, (self.k, 1))
        
        # 4. Initialize Categorical Parameters (uniform probability for each category)
        self.cat_log_probs = []
        for j in range(self.n_cat):
            C_j = self.cat_max_levels[j] # Number of unique categories for feature j
            # Create a (K, C_j) array of uniform log probabilities
            log_probs_j = np.tile(np.log(1 / C_j), (self.k, C_j))
            self.cat_log_probs.append(log_probs_j)
        # Note from Human. The magnitude is different for different columns
        #self.cat_log_probs = np.array(self.cat_log_probs, dtype=object) # Array of arrays of shape (K, C_j)

def _e_step(self, X):
        """
        Expectation Step: Calculates responsibilities (gamma_ik).
        gamma_ik = P(z_ik=k | x_i)
        """
        X_gauss = X[:, :self.n_gauss]
        X_cat = X[:, self.n_gauss:]
        N = X.shape[0]
        
        # log_likelihood_k[i, k] will store log(P(x_i | z_ik=k))
        log_likelihood_k = np.zeros((N, self.k))

for k in range(self.k):
            # Calculate Gaussian Log Likelihood
            log_gauss_lik = _multivariate_gaussian_log_likelihood(
                X_gauss, self.gauss_mu[k], self.gauss_sigma_diag[k]
            )
            
            # Calculate Categorical Log Likelihood
            # log_probs_cat is indexed: [feature_index][expert_index, category_index]
            log_cat_lik = np.zeros(N)
            for j in range(self.n_cat):
                # Retrieve the log probabilities for expert k and feature j
                log_probs_j_k = self.cat_log_probs[j][k]
                categories = X_cat[:, j].astype(int)
                log_cat_lik += log_probs_j_k[categories]

# Combined Expert Log Likelihood: log(P(x | z=k))
            log_likelihood_k[:, k] = log_gauss_lik + log_cat_lik
            
        # Total numerator: log(P(x_i | z_ik=k)) + log(pi_k)
        log_pi = np.log(self.pi)
        log_numerator = log_likelihood_k + log_pi

# Use log-sum-exp trick for numerical stability to compute the denominator:
        # log(P(x_i)) = log(sum_k pi_k * P(x_i | z_k))
        log_denominator = np.logaddexp.reduce(log_numerator, axis=1)

# Responsibilities (gamma_ik): log(P(z_ik=k | x_i)) = log(Numerator) - log(Denominator)
        log_gamma = log_numerator - log_denominator[:, np.newaxis]
        gamma = np.exp(log_gamma)

# Enforce responsibilities sum to 1 (due to floating point issues)
        gamma /= np.sum(gamma, axis=1, keepdims=True)
        
        return gamma, np.sum(log_denominator)

def _m_step(self, X, gamma):
        """
        Maximization Step: Updates parameters (pi, mu, sigma, cat_probs).
        """
        X_gauss = X[:, :self.n_gauss]
        X_cat = X[:, self.n_gauss:]
        N = X.shape[0]
        D_gauss = self.n_gauss
        
        # N_k: Effective number of data points assigned to expert k (sum of responsibilities)
        N_k = np.sum(gamma, axis=0)

# Add small epsilon to N_k for stability, especially if a cluster is empty
        epsilon_n = 1e-6 
        
        for k in range(self.k):
            # --- 1. Update Priors (pi_k) ---
            self.pi[k] = N_k[k] / N

# --- 2. Update Gaussian Expert Parameters ---
            
            # Reshape gamma_k for broadcasting: (N, 1)
            gamma_k = gamma[:, k][:, np.newaxis] 
            
            # Update Mean (mu_k): (1/N_k) * sum_i (gamma_ik * x_i)
            self.gauss_mu[k] = np.sum(gamma_k * X_gauss, axis=0) / (N_k[k] + epsilon_n)
            
            # Update Diagonal Covariance (sigma_diag_k)
            # sigma_diag_k = (1/N_k) * sum_i (gamma_ik * (x_i - mu_k)^2)
            X_mu = X_gauss - self.gauss_mu[k]
            variance_k = np.sum(gamma_k * (X_mu ** 2), axis=0) / (N_k[k] + epsilon_n)
            self.gauss_sigma_diag[k] = variance_k
            
            # Ensure variance is not zero (add stability factor)
            self.gauss_sigma_diag[k][self.gauss_sigma_diag[k] < 1e-6] = 1e-6

# --- 3. Update Categorical Expert Parameters (log_probs) ---
            for j in range(self.n_cat):
                C_j = self.cat_max_levels[j]
                
                # P_j, c, k = (sum_{i: x_{i,j}=c} gamma_ik) / N_k
                probs_j_k = np.zeros(C_j)
                
                for c in range(C_j):
                    # Find indices where categorical feature j has category c
                    indices_c = (X_cat[:, j].astype(int) == c)
                    # Sum responsibilities for data points belonging to category c
                    sum_gamma_c = np.sum(gamma[indices_c, k])
                    probs_j_k[c] = sum_gamma_c
                
                # Normalize by N_k (sum of responsibilities for expert k)
                probs_j_k /= (N_k[k] + epsilon_n)
                
                # Ensure no zero probabilities (add smoothing factor)
                probs_j_k += 1e-6 
                probs_j_k /= np.sum(probs_j_k)
                
                self.cat_log_probs[j][k] = np.log(probs_j_k)

def fit(self, X):
        """
        Trains the MoE model using the EM algorithm.
        """
        self._initialize_params(X)
        
        print(f"Starting EM training with K={self.k}...")
        
        previous_log_likelihood = -np.inf
        
        for iteration in range(self.max_iters):
            # E-Step: Calculate responsibilities and data log-likelihood
            gamma, log_likelihood = self._e_step(X)
            
            # M-Step: Update parameters
            self._m_step(X, gamma)
            
            print(f"Iteration {iteration+1}/{self.max_iters} | Log-Likelihood: {log_likelihood:.4f}")
            
            # Check for convergence
            if log_likelihood - previous_log_likelihood < 1e-5:
                print("EM converged.")
                break
            previous_log_likelihood = log_likelihood
        print(f"Final log likelihood {log_likelihood}")

def predict_proba(self, X):
        """
        Returns the responsibilities (posterior probability of belonging to each expert).
        """
        gamma, _ = self._e_step(X)
        return gamma

def predict(self, X):
        """
        Predicts the expert index for each data point (hard assignment).
        """
        gamma = self.predict_proba(X)
        return np.argmax(gamma, axis=1)

# --- Main Execution ---

def preprocessCSV(csv, y_name, X_names):
    # Load training data
    records, column_names = tree_funcs.read_csv(csv)
    name_to_idx = {name: i for i, name in enumerate(column_names)}

# Validate requested columns
    if y_name not in name_to_idx:
        raise SystemExit(f"Target column '{y_name}' not found in {csv}.")
    for xn in X_names:
        if xn not in name_to_idx:
            raise SystemExit(f"Feature column '{xn}' not found in {csv}.")

# Slice columns by name
    y = records[name_to_idx[y_name]]
    # Split X into continuous and categorical
    continuous = []
    categorical = []
    all_cat_names = []
    for xn in X_names:
        if type(records[name_to_idx[xn]][0]) is float:
            continuous.append(records[name_to_idx[xn]])
        else:
            # Encode the categorical types with a class number instead of a class name
            cats = records[name_to_idx[xn]]
            cat_names = list(np.unique(cats))
            cat_column = [cat_names.index(name) for name in cats]
            # Remember the category names for later printing
            all_cat_names.append(cat_names)
            categorical.append(cat_column)
    return y, continuous, categorical, all_cat_names

def main():
    """
    Usage:
      python k-means.py <data_csv> <clusters> <y_col> <X_col1> [<X_col2> ...]
        - <data_csv>: path to training CSV
        - <test_csv>: Optional path to a training csv
        - <clusters>: Number of clusters
        - <y_col>: target column name (exactly as in the CSV header)
        - <X_col*>: one or more feature column names (exactly as in the CSV header)
    """
    if len(sys.argv) < 5:
        raise SystemExit(
            "Args: <train_csv> <clusters> <y_col> <X_col1> [<X_col2> ...]"
        )

train_csv = sys.argv[1]
    if sys.argv[2].endswith('.csv'):
        test_csv = sys.argv[2]
        cur_input = 3
    else:
        test_csv = None
        cur_input = 2
    num_clusters = int(sys.argv[cur_input])
    y_name = sys.argv[cur_input+1]
    X_names = sys.argv[cur_input+2:]

y, continuous, categorical, all_cat_names = preprocessCSV(train_csv, y_name, X_names)
    # The continuous columns come first, then the categorical
    # Transpose so that the first index accesses the row
    X = np.array(continuous + categorical).T

## Mostly LLM generated code below
    # Initialize and train the MoE model
    model = MixtureOfExperts(
        k=num_clusters, 
        n_gauss_features=len(continuous),
        max_iters=50,
        random_state=42
    )
    
    model.fit(X)

print(f"\n--- Final Model Parameters (K={num_clusters}) ---")
    print(f"Gating Priors (pi): {model.pi}")

print("\nGaussian Expert Means (mu):")
    for k in range(model.k):
        print(f"  Expert {k}: {model.gauss_mu[k]}")

print("\nCategorical Expert Log-Probabilities:")
    # Log-probs are stored as log(P(Category | Expert))
    for j in range(model.n_cat):
        print(f"  Feature {len(continuous) + j} (Category Levels: {model.cat_max_levels[j]}):")
        for k in range(model.k):
            # Print exponentiated values for human readability (actual probabilities)
            print(f"    Expert {k} Probabilities: {np.exp(model.cat_log_probs[j][k])}")
        print(f"  Category names: {all_cat_names[j]}")

# Predict hard cluster assignments for the data points
    assignments = model.predict(X)
    cluster_common.printClusterStatistics(y, assignments)

if test_csv is not None:
        test_y, test_continuous, test_categorical, test_all_cat_names = preprocessCSV(test_csv, y_name, X_names)
        X_test = np.array(test_continuous + test_categorical).T
        test_assignments = model.predict(X_test)
        cluster_common.printClusterStatistics(test_y, test_assignments)
        # Get the log likelihood of the test data
        gamma, log_likelihood = model._e_step(X_test)
        print(f"Test NLL {-log_likelihood}")

if __name__ == "__main__":
    main()
```

---

## Statistics

```python
import numpy

# Let's get the statistics
def printClusterStatistics(y, clusters):
    """
    Arguments:
        y (list): List of class labels
        clusters (list[int]): List of assigned clusters.
    """
    classes, class_counts = numpy.unique(y, return_counts=True)
    num_clusters = len(numpy.unique(clusters))

# Remember which cluster each row was assigned to
    cluster_indices = []
    for y_idx in range(num_clusters):
        cluster_indices.append(numpy.where(clusters==y_idx)[0])

# Assign class labels to the clusters based upon the majority of the class in each cluster
    y_hat = ['none' for _ in range(len(y))]
    for c_idx in range(num_clusters):
        indices = cluster_indices[c_idx]
        actual_classes = [y[index] for index in indices]
        cluster_classes, cluster_counts = numpy.unique(actual_classes, return_counts=True)
        majority_index = numpy.argmax(cluster_counts)

# Mark everything in the cluster with the majority class
        for idx in indices:
            y_hat[idx] = cluster_classes[majority_index]

def class_match(a, b):
        return [i for i in range(len(a)) if a[i] == b[i]]

def class_match_mag(a, b):
        return len(class_match(a, b))

accuracy = class_match_mag(y, y_hat)/len(y)
    print(f"Accuracy is {accuracy}")

confusion = []
    # The columns of the confusion matrix is reality
    for real_class in classes:
        # The row of the confusion matrix is the prediction
        for predicted_class in classes:
            total = 0
            for i in range(len(y_hat)):
                if y_hat[i] == predicted_class and y[i] == real_class:
                    total += 1
            confusion.append(total)

true_positives = confusion[0] + confusion[4] + confusion[8]
    false_positives = sum(confusion[1:4]) + sum(confusion[5:8])
    precision = true_positives / (true_positives + false_positives)

# Print the confusion matrix
    print("Raw confusion")
    print(f"{len(y)}\t\t{classes[0]}\t{classes[1]}\t{classes[2]}")
    print(f"{classes[0]}\t\t{confusion[0]}\t{confusion[1]}\t\t{confusion[2]}")
    print(f"{classes[1]}\t{confusion[3]}\t{confusion[4]}\t\t{confusion[5]}")
    print(f"{classes[2]}\t\t{confusion[6]}\t{confusion[7]}\t\t{confusion[8]}")
    print("Percentage confusion")
    print(f"{len(y)}\t\t{classes[0]}\t{classes[1]}\t{classes[2]}")
    print(f"{classes[0]}\t\t{confusion[0]/len(y):.3f}\t{confusion[1]/len(y):.3f}\t\t{confusion[2]/len(y):.3f}")
    print(f"{classes[1]}\t{confusion[3]/len(y):.3f}\t{confusion[4]/len(y):.3f}\t\t{confusion[5]/len(y):.3f}")
    print(f"{classes[2]}\t\t{confusion[6]/len(y):.3f}\t{confusion[7]/len(y):.3f}\t\t{confusion[8]/len(y):.3f}")
    print("The diagonal over the sum of each row is the recall")
    print("The diagonal over the sum of each column is the precision")
```

---

## Parameters

* Going to use all of the continuous columns, plus the island
* Let's also try with and without sex
* Going to begin with 3 clusters

---

## Output

<pre>
Gaussian Expert Means (mu):
  Expert 0: [  38.02292033   17.77439908  187.15744389 3446.58789809]
  Expert 1: [  47.56806336   14.9966457   217.23528564 5092.43770281]
  Expert 2: [  45.26693581   18.85898489  195.82302063 3934.40100864]

Categorical Expert Log-Probabilities:
  Feature 4 (Category Levels: 3):
    Expert 0 Probabilities: [0.31865682 0.3915926  0.28975057]
    Expert 1 Probabilities: [9.99998000e-01 1.00005506e-06 1.00039193e-06]
    Expert 2 Probabilities: [0.1129877  0.72483876 0.16217354]
  Category names: ['Biscoe', 'Dream', 'Torgersen']
</pre>

---

## Prediction Results

<pre>
Accuracy is 0.8258258258258259
Raw confusion
333		Adelie	Chinstrap	Gentoo
Adelie		93	53		0
Chinstrap	5	63		0
Gentoo		0	0		119
Percentage confusion
333		Adelie	Chinstrap	Gentoo
Adelie		0.279	0.159		0.000
Chinstrap	0.015	0.189		0.000
Gentoo		0.000	0.000		0.357
<pre>

---

## Performance

* Adelie island probabilities were 30.1, 37.7, and 32.1. Similar to cluster 0.
* All Gentoo are from Biscoe. Cluster 1 matches.
* All Chinstrap are from Dream. Cluster 2 is close.
* Better than k-means (72%) but we can do better
  * Maybe we've made a bad assumption
  * Let's add in the sex column

---

## New results

<pre>
Gating Priors (pi): [0.3287228 0.3573577 0.3139195]

Gaussian Expert Means (mu):
  Expert 0: [  40.10380632   17.57828757  188.68964113 3414.26111317]
  Expert 1: [  47.56805878   14.99664074  217.23527266 5092.43587451]
  Expert 2: [  43.99517973   19.20019548  195.30382982 4029.34630019]

Categorical Expert Log-Probabilities:
  Feature 4 (Category Levels: 3):
    Expert 0 Probabilities: [0.20989062 0.56509639 0.225013  ]
    Expert 1 Probabilities: [9.99998000e-01 1.00015477e-06 1.00018852e-06]
    Expert 2 Probabilities: [0.20112228 0.58489147 0.21398625]
  Category names: ['Biscoe', 'Dream', 'Torgersen']
  Feature 5 (Category Levels: 2):
    Expert 0 Probabilities: [0.94314441 0.05685559]
    Expert 1 Probabilities: [0.48739524 0.51260476]
    Expert 2 Probabilities: [0.03595822 0.96404178]
  Category names: ['female', 'male']
Accuracy is 0.7957957957957958
</pre>

---

## Statistics

<pre>
Raw confusion
333		Adelie	Chinstrap	Gentoo
Adelie		146	0		0
Chinstrap	68	0		0
Gentoo		0	0		119
Percentage confusion
333		Adelie	Chinstrap	Gentoo
Adelie		0.438	0.000		0.000
Chinstrap	0.204	0.000		0.000
Gentoo		0.000	0.000		0.357
</pre>

---

## Discussion

* Even worse!
* But look at the cluster probabilities for sex
  * Only one cluster is an even mix
  * Of the other two, one is all female and the other is all male

---

## Cluster Count

* Perhaps the number of clusters is wrong?
  * Intuitively, the male and females of a species could have large differences
* If we are starving for experts, some of them will make bad estimates
  * We could say that this model is under parameterized, and is underfitting

---

## Increasing Clusters

</div>
</div>

---

## 6 Clusters, all categories

<pre>
Categorical Expert Log-Probabilities:
  Feature 4 (Category Levels: 3):
    Expert 0 Probabilities: [0.29060101 0.37712167 0.33227733]
    Expert 1 Probabilities: [1.03474350e-06 9.99997860e-01 1.10506899e-06]
    Expert 2 Probabilities: [0.21412767 0.31222466 0.47364767]
    Expert 3 Probabilities: [9.99997999e-01 1.00002301e-06 1.00100198e-06]
    Expert 4 Probabilities: [0.4195082  0.45302936 0.12746244]
    Expert 5 Probabilities: [1.00012925e-06 9.99997999e-01 1.00069388e-06]
  Category names: ['Biscoe', 'Dream', 'Torgersen']
  Feature 5 (Category Levels: 2):
    Expert 0 Probabilities: [9.99998715e-01 1.28523896e-06]
    Expert 1 Probabilities: [0.95618676 0.04381324]
    Expert 2 Probabilities: [2.07100284e-06 9.99997929e-01]
    Expert 3 Probabilities: [0.48739527 0.51260473]
    Expert 4 Probabilities: [0.04048937 0.95951063]
    Expert 5 Probabilities: [1.65357340e-06 9.99998346e-01]
  Category names: ['female', 'male']
Accuracy is 0.993993993993994
</pre>

---

## Statistics

<pre>
Raw confusion
333		Adelie	Chinstrap	Gentoo
Adelie		145	1		0
Chinstrap	1	67		0
Gentoo		0	0		119
Percentage confusion
333		Adelie	Chinstrap	Gentoo
Adelie		0.435	0.003		0.000
Chinstrap	0.003	0.201		0.000
Gentoo		0.000	0.000		0.357
</pre>

---

## Real Progress

* This is far better than k-means, which tops out at 72% accuracy from 3-20 clusters
* Even without any categories we get nearly 99% accuracy
  * This is because we look at covariance now
* Adding in both categories does the best, with 2 incorrect
  * But is poor until we have enough clusters to fit the data

---

## Choosing Cluster Numbers

* Here we could look at the probabilities and see something was wrong
  * The final result had clusters that were male or female
  * With too few clusters the populations were clearly under fit
* Without labels though, how would we know when to stop adding clusters?
  * We should look at the final NLL

---

## Rookie Mistake

* You need a testing set, or the error will keep going down forever
* Even with unsupervised learning, it's still important to keep a holdout set
* We can re-use the randomized penguin train and test set from an earlier class

---

## Final NLL

<img style="width: 100%" class="r-stretch" src="./figures/mixture_of_experts_test_error.png" />

---

## Test Predictions

<pre>
Accuracy is 1.0
Raw confusion
33		Adelie	Chinstrap	Gentoo
Adelie		16	0		0
Chinstrap	0	6		0
Gentoo		0	0		11
Percentage confusion
33		Adelie	Chinstrap	Gentoo
Adelie		0.485	0.000		0.000
Chinstrap	0.000	0.182		0.000
Gentoo		0.000	0.000		0.333
The diagonal over the sum of each row is the recall
The diagonal over the sum of each column is the precision
Test NLL 414.0106456498475
</pre>

---

## Final Thoughts

* If columns are correlated, then there are diminishing returns with more features
* That implies that we should be able to reduce columns into "meta columns", distilling features to components
* Next time!