<!--
Abstract:

CS 461
Introduction to Deep Learning
Lecture 08
-->

# CS 461 - Lecture 08

## Machine Learning Principles

Bernhard Firner

2025-10-01

---

## Reading

* Recommended reading
  * Machine Learning: A Probabilistic Perspective by Murphy
    * 12.2-12.3
  * Machine Learning: The Art and Science of Algorithms that Make Sense of Data by Flach
    * 10.3

---

## Review - Mixture Model

* Used a mixture of experts to cluster penguins from the [palmer penguins dataset](https://github.com/allisonhorst/palmerpenguins)
  * Experts were gaussian and categorical
* This was an unsupervised technique
  * Imagined that penguins were generated by some random process
    * a mix of gaussian and categorical
  * Estimated the parameters for those processes and assigned penguins to the process most likely to make them

---

## Latent Variable Estimation

* This was an example of latent variable estimation
* Each cluster assigned loss to estimate those unseen parameters from the dataset
* For guassian, the cluster means and variances, $\mu_k$ and $\Sigma_k$
  * $l(\mu_k,\Sigma_k) = -\frac{1}{2}\sum_ir_{ik}[log|\Sigma_k|+(x_i-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)]$
* For categorical, the probabilities of each category, $c$
  * $l(\mu_k) = \sum_i-r_{ik}log(p_k(c_i))$
* Where $r_{ik}$ assigned each point, $i$, into a cluster, $k$

---

## Learned Parameters

* Look at the guassians
* Each expert (cluster) learned a set of means and variances
* Cluster 1:
  * Columns: bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  * Means:    37.25666712   17.54889268  187.7646787   3355.3715975
  * Variances: 4.20934023    0.75753203   30.45701825 68664.83768671

---

## Advantages over K-Means

* Variances allow clusters to have a shape
  * e.g. bill length variances were from 2 to 9mm for different clusters
* Soft clustering via the responsibility is a more correct estimate of datapoint likelihood
* Support for categorical data

---

## Any Weaknesses?

* Assigned one latent variable to each column of data
  * Remember, we began these expert mixing models to learn the hidden variables of the dataset
* Why is one variable per column bad?
  * It could be an over-complicated description of the data

---

## Reducing Covariant Data

* Often times, we see data with hundreds of data columns
  * Images are an obvious example
    * Each pixel is another column of data
* If some of those columns are strongly correlated, then we are looking at redundant information

---

## Dependence

---

## Harmful?

* What's the harm in estimating multiple partially dependent variables?
  * In general, the more parameters we have to learn, the more data we'll need
  * We haven't been looking at noisy data, but noisy data is the norm
* Noise will have a greater impact on an unnecessarily complicated model, compared to a simpler one
  * This was the idea behind regularization
  * And remember that simple linear regression resisted noise without a problem

---

## Idea: Projection

---

## Combining Redundant Columns

* If x and y are related, reorient the graph along the axis with the most variation
* Additionally, if there is little variation in one dimension, drop it

---

## Dimensionality Reduction

* What we are doing is dimensionality reduction
  * Find a way to map the initial features onto a new set of features
  * The `visible space` is the original set of axes
  * The `latent space` is the new one, made of hidden axes
* The latent space may not capture everything in the visible space

---

## Projection

* When we converted the x,y points into a single variable, we were projecting them into a new space
* If there is uncorrelated signal that we didn't include in the projection, that would be the `reconstruction error` or `distortion`
* We should be able to add variables until that reconstruction is 0
  * Usually this means going back to the originally dimensionality

---

## Approach

* We could use linear regression to reduce our dimensions one by one
  * Reorient the data along the best-fit vector, reducing dimensions by 1 at each step
* Too hacky
* No guarantees

---

## Actual Approach

* Instead, we want each new vector to be orthogonal to the others
  * The first vector will capture as much variance as possible along its dimension
  * Subsequent vectors won't find anything there and should be oriented along a new dimension

---

## Illustration

---

## Principal Component Analysis

* PCA is the most widely used dimensionality reduction technique
* Based upon eigenvalues and eigenvectors of the covariance matrix

---

## Covariance Matrix

* Covariance of $X_i$ and $X_j$ is $\mathbb{E}[(X_i-\mathbb{E}[X_i])(X_j-\mathbb{E}[X_j])]$
* We estimate with $Cov_{jk} = \frac{1}{N}\Sigma_{i=1}^N(X_{ij}-\mu(X_j))(X_{ik} - \mu(X_k))$
* The covariance matrix has $Cov_{jk}$ at position j,k
  * The diagonal is filled with the variances of each column
* How is this useful?
  * If two columns are correlated, then they have some redundancy

---

## Correlation Coefficient

* Before we go about using the correlation matrix, it will be more interpretable if we normalize it
* This is the same as standardizing the data first, making each axis 0 mean and unit variance
* After that, the matrix will have values in the range -1 to 1
  * -1 is perfectly anti-correlated
  * 1 is perfectly correlated
  * 0 is completely uncorrelated

---

## Examples

* The correlation matrix of this data is:

* X and Y are nearly correlated. Great candidate for dimensionality reduction.

---

## Examples

* The correlation matrix of this data is:

* Practically only a single dimension

---

## Examples

* The correlation matrix of this data is:

* Not a good candidate for dimensionality reduction

---

## PCA Setup

* We want to approximate each point, $x_n$, with a combination of basis functions, $w_1, ..., w_L$
* The new point obtained by transforming $x_n$ will be called $z_n$
* $x_n \approx \sum_{k=1}^L z_{nk}w_k$

---

## PCA Error

* Error is just the average euclidean distance of the x points to the new z points
* $L(W, Z) = \frac{1}{N}\sum_{n=1}^{N}\lVert x_n - Wz_n \rVert^2$

---

## PCA Algorithm

* Center the data by subtracting the mean
* Compute the covariance matrix
* Calculate the eigenvalues and vectors from the correlation matrix
* Sort eigenvectors by eigenvalues
* Choose as many as you want to make a new basis set
* The explained variance is the eigenvalues of the used eigenvectors

---

## Why?

* The covariance matrix was symmetric, so eigenvectors are orthonormal
  * See the Spectral Theorem
  * Our basis functions should be independent, and this guarantees it
* Choosing the largest eigenvalue first gets the direction of largest variance
  * This better "explains" the data

---

## Implementation

```python
#! /usr/bin/python3

import tree_funcs
import math
import numpy as np
import sys

from sklearn.decomposition import PCA

def preprocessCSV(csv, X_names):
    """
    Args:
        csv (str): The path to a file containing columns of data.
        X_names (list[str]): Names of the columns to return.
    Returns:
        list(columns)
    """
    # Load training data
    records, column_names = tree_funcs.read_csv(csv)
    name_to_idx = {name: i for i, name in enumerate(column_names)}

# Validate requested columns
    for xn in X_names:
        if xn not in name_to_idx:
            raise SystemExit(f"Feature column '{xn}' not found in {csv}.")
    X = np.array([records[name_to_idx[xname]] for xname in X_names])
    return X

def pca(num_basis, X):
    """
    Performs PCA
    by manually constructing and solving the normal equations.

Args:
        num_basis   (int): The number of basis functions
        X (list(columns)): The columns of data.
    """
    ## We could manually standardize. Or just call numpy.corrcoef
    #mean = np.mean(X, axis=0)
    #std = np.std(X, axis=0)
    #X_centered = X - mean
    #X_standardized = X_centered / std
    #correlation_matrix = np.cov(X_standardized, rowvar=False)

# Get the correlation matrix
    correlation_matrix = np.corrcoef(X, rowvar=False)
    
    # Eigenvalue Decomposition
    eigenvalues, eigenvectors = np.linalg.eigh(correlation_matrix)
    
    # Sort and check from greatest to smallest
    sorted_indices = np.argsort(eigenvalues)[::-1]
    sorted_eigenvalues = eigenvalues[sorted_indices]
    sorted_eigenvectors = eigenvectors[:, sorted_indices]
    
    # Get the explained variance
    total_variance = np.sum(eigenvalues)
    explained_variance = sorted_eigenvalues / total_variance

# Return the components
    components = sorted_eigenvectors[:, :num_basis]
    return components, explained_variance

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("Provide the data file, the number of basis functions, and the column names as arguments.")
    else:
        X = preprocessCSV(sys.argv[1], sys.argv[3:]).T
        #print(X_centered)
        print(f"The correlation coefficients of X are {np.corrcoef(X, rowvar=False)}")
        components, explained = pca(int(sys.argv[2]), X)
        print(f"Explained {explained} ({explained.sum()})")
        # Print out results
        print(components)
        # Print out new columns
        mean = np.mean(X, axis=0)
        std = np.std(X, axis=0)
        # Handle features with zero variance (to prevent division by zero)
        epsilon = 1e-8
        std[std == 0] = epsilon
        X_standardized = (X - mean) / std
        new_columns = np.dot(X_standardized, components)
        print(f"corcoef of new columns is {np.corrcoef(new_columns, rowvar=False)}")
        #for i in range(new_columns.shape[0]):
        #    print(",".join([str(val) for val in new_columns[i]]))
```

---

## Penguin Data

* Is PCA worth doing?
* Let's look at the correlation coefficients of the continuous columns

* Some high values, let's try PCA

---

## Basis Vectors

* Each vector transforms an original point
* With 4, we are just rotating the original data
* Here are the coefficients:

<table>
<tr><td>Basis</td><td colspan=4></td></tr>
<tr><td>1</td><td>-0.45375317</td><td>0.6001949 </td><td> 0.64249509</td><td> 0.14516955</td></tr>
<tr><td>2</td><td> 0.39904723</td><td>0.79616951</td><td>-0.42580043</td><td>-0.1599044 </td></tr>
<tr><td>3</td><td>-0.576825  </td><td>0.00578817</td><td>-0.23609516</td><td>-0.78198369</td></tr>
<tr><td>4</td><td>-0.54967471</td><td>0.07646366</td><td>-0.59173738</td><td> 0.58468615</td></tr>

---

## Explained Variance

* The basis vectors explain decreasing amounts of the variance
  * 0.68633893, 0.19452929, 0.09216063, 0.02697115
  *  All four sum to 1 (they reconstruct the dataset)
  * But we could drop some

---

## Basis 1v2

---

## Basis 1v3

---

## Basis 2v3

---

## Uses

* Dimensionality reduction is the most common use-case
* Also useful as a precursor to clustering
  * Why? Because distances are approximately preserved

---

## More on Clustering

* Mentioned `the curse of dimensionality` before
  * KNN struggles to find neighbors as dimensions increase
* By reducing dimensions, KNN suddenly works again
* This can allow KNN using anything that makes an embedding

---

## Common Current Use Case

* Take a black box that compresses high dimensional data to something smaller
* Wave the black box around objects of interest
  * This is your feature discovery phase
* Take PCA of those discovered features
  * Now you can use clustering to find similarities

---

## The black box

* What is the black box?
  * Nowadays, a neural network
  * We'll see later that they are excellent at compressing high-dimensional features to something reasonable

<!--
## Exercieses

* For practice
  * Use PCA on the penguin data, then see if multiple logistic regression can now separate the classes
-->

---

## Sample Questions

When is it not appropriate to use PCA?

a. When there are only a few columns of data.

b. When all of your data columns are orthogonal.

c. When there are too many columns of data.

d. When the correlation matrix values are high.
</div>

---

## Sample Questions

Which of the following is an unsupervised technique?

a. Decision Trees

b. Logistic Regression

c. PCA

d. Least Squares Regression
</div>

---

## Sample Questions

Which of the following is an unsupervised technique?

a. Decision Trees

b. Logistic Regression

c. Mixture of Experts

d. Least Squares Regression
</div>

---

## Sample Questions

Which of the following statements about unsupervised learning is **true**?

a. Unsupervised learning does not need test and validation sets.

b. Unsupervised learning can be applied to more data than supervised learning.

c. Unsupervised learning algorithms are always slower than supervised learning.

d. Supervised learning techniques have a stronger mathematical foundation for their approach.

</div>

---

## Sample Questions

Which statements about the K-Means algorithm are **false**?

a. K-Means is not guaranteed to converge on the best split.

b. K-Means works poorly when variables are correlated.

c. K-Means deals well with overlapping clusters of different classes.

d. K-Means uses hard clustering, assigning each point to the nearest cluster center.

</div>

---

## Sample Questions

Which is **false** about soft clustering?

a. Soft clustering is used in mixture of expert models.

b. Soft clustering is used during model training and not during classification.

c. If two cluster means are equidistant, the cluster with the lower variance has higher responsibility for the point.

d. Soft clustering is used in K-Means.

</div>