<!--
Abstract:

CS 461
Introduction to Deep Learning
Lecture 09
-->

# CS 461 - Lecture 09

## Machine Learning Principles

Bernhard Firner

2025-10-06

---

## Reading

* Recommended reading
  * Machine Learning: The Art and Science of Algorithms that Make Sense of Data by Flach
    * 11.1
  * Machine Learning: A Probabilistic Perspective by Murphy
    * 16.2 (but bagging info is sparse)

---

## Review - Latent Spaces

* Latent Spaces
  * Our observed data is the `visible space`
  * Just a reflection of the `latent space`, which is a more meaningful representation
* Mixture of expert models assumed one random variable per data axis
  * But in reality they are correlated

---

## Review - PCA

* We want to find a "more meaningful" version of our data
* Solution: Principal Component Analysis (PCA)
  * `Project` our data along a new set of axes
  * Possibly fewer, leading to dimensionality reduction

---

## Covariance Matrix

* Covariance of $X_i$ and $X_j$ is $\mathbb{E}[(X_i-\mathbb{E}[X_i])(X_j-\mathbb{E}[X_j])]$
* We estimate with $Cov_{jk} = \frac{1}{N}\Sigma_{i=1}^N(X_{ij}-\mu(X_j))(X_{ik} - \mu(X_k))$
* The covariance matrix has $Cov_{jk}$ at position j,k
  * The diagonal is filled with the variances of each column
* How is this useful?
  * If two columns are correlated, then they have some redundancy

---

## Correlation Coefficient

* Before we go about using the correlation matrix, it will be more interpretable if we normalize it
* This is the same as standardizing the data first, making each axis 0 mean and unit variance
* After that, the matrix will have values in the range -1 to 1
  * -1 is perfectly anti-correlated
  * 1 is perfectly correlated
  * 0 is completely uncorrelated

---

## Review - PCA

* PCA algorithm
  * Find the eigenvalues and eigenvectors of the correlation matrix
    * This is the covariance matrix of a standardized dataset
  * Rank the vectors by their eigenvalues
  * Choose a subset
    * The eigenvalues indicate how much of the dataset variance each vector explains

---

## PCA - Visualization

---

## Implementation

```python
#! /usr/bin/python3

import tree_funcs
import math
import numpy as np
import sys

from sklearn.decomposition import PCA

def preprocessCSV(csv, X_names):
    """
    Args:
        csv (str): The path to a file containing columns of data.
        X_names (list[str]): Names of the columns to return.
    Returns:
        list(columns)
    """
    # Load training data
    records, column_names = tree_funcs.read_csv(csv)
    name_to_idx = {name: i for i, name in enumerate(column_names)}

# Validate requested columns
    for xn in X_names:
        if xn not in name_to_idx:
            raise SystemExit(f"Feature column '{xn}' not found in {csv}.")
    X = np.array([records[name_to_idx[xname]] for xname in X_names])
    return X

def pca(num_basis, X):
    """
    Performs PCA
    by manually constructing and solving the normal equations.

Args:
        num_basis   (int): The number of basis functions
        X (list(columns)): The columns of data.
    """
    ## We could manually standardize. Or just call numpy.corrcoef
    #mean = np.mean(X, axis=0)
    #std = np.std(X, axis=0)
    #X_centered = X - mean
    #X_standardized = X_centered / std
    #correlation_matrix = np.cov(X_standardized, rowvar=False)

# Get the correlation matrix
    correlation_matrix = np.corrcoef(X, rowvar=False)
    
    # Eigenvalue Decomposition
    eigenvalues, eigenvectors = np.linalg.eigh(correlation_matrix)
    
    # Sort and check from greatest to smallest
    sorted_indices = np.argsort(eigenvalues)[::-1]
    sorted_eigenvalues = eigenvalues[sorted_indices]
    sorted_eigenvectors = eigenvectors[:, sorted_indices]
    
    # Get the explained variance
    total_variance = np.sum(eigenvalues)
    explained_variance = sorted_eigenvalues / total_variance

# Return the components
    components = sorted_eigenvectors[:, :num_basis]
    return components, explained_variance

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("Provide the data file, the number of basis functions, and the column names as arguments.")
    else:
        X = preprocessCSV(sys.argv[1], sys.argv[3:]).T
        #print(X_centered)
        print(f"The correlation coefficients of X are {np.corrcoef(X, rowvar=False)}")
        components, explained = pca(int(sys.argv[2]), X)
        print(f"Explained {explained} ({explained.sum()})")
        # Print out results
        print(components)
        # Print out new columns
        mean = np.mean(X, axis=0)
        std = np.std(X, axis=0)
        # Handle features with zero variance (to prevent division by zero)
        epsilon = 1e-8
        std[std == 0] = epsilon
        X_standardized = (X - mean) / std
        new_columns = np.dot(X_standardized, components)
        print(f"corcoef of new columns is {np.corrcoef(new_columns, rowvar=False)}")
        #for i in range(new_columns.shape[0]):
        #    print(",".join([str(val) for val in new_columns[i]]))
```

---

## Lecture 9 Goals

* Return to supervised learning
* Start looking at more realistic datasets
  * Spam, specifically

---

## What is spam?

* Unwanted emails and texts
* Problem?
  * Huge volume, so the approach taken needs to be light
  * Also needs to retrain or patch quickly

---

## How About Naive Bayes?

* The first spam filters were built with naive bayes
  * $p(spam|word)$
  * $p(spam|words) = \prod_I p(spam|word_i)$

---

## Worked Until It Didn't

* Not robust
  * Bayesian poisoning allowed attackers to fool the filters by throwing in some low-correlation words
  * There are always new low-correlation words available, so they could keep evading
* Very costly to classify any non-spam as spam
* Need something with a more robust decision boundary

---

## Fetching Spambase

* https://archive.ics.uci.edu/dataset/94/spambase
  * Download the zip or use code example
* pip install ucimlrepo

```python
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
spambase = fetch_ucirepo(id=94) 
  
# data (as pandas dataframes) 
X = spambase.data.features 
y = spambase.data.targets 
  
# metadata 
print(spambase.metadata) 
  
# variable information 
print(spambase.variables) 
```

---

## Spambase

* Created in June-July 1999
* 4601 entries
* 39.4% legitimate, 60.6% spam
* Data is transformed into 57 continuous variables
  * 48 \[0,100\] word frequency
  * 6 \[0,100\] character frequency
  * capital letter run length average, longest, and total lengths
* Start here

---

## TRECSpam 2007

* 2007: 50199 spam emails, 25220 not spam emails
* 66.5% spam, 33.5% legitimate
* https://www.kaggle.com/datasets/imdeepmind/preprocessed-trec-2007-public-corpus-dataset
* 4 Data Columns
  * subject
  * email to
  * email from
  * message body
* Want to get here

---

## Let's Begin with Spambase

* What words frequencies?
  * address, all, 3dc, our, over, remove, internet, order, mail, receive, will, ...
  * also "george"
    * The dataset comes from an individual's inbox, so it probably won't generalize
* What character frequencies?
  * Brackets, ;, !, \$, #

---

## Some Observations

* Not going to visualize this
* We could make gigantic correlation matrices, but they won't tell us too much

---

## Loading the data

```python
import argparse
import numpy
import os

import pca

def column_names():
    """Return an array of all data column names and the spam label."""
    return ["word_freq_make", "word_freq_address", "word_freq_all", "word_freq_3d", "word_freq_our",
            "word_freq_over", "word_freq_remove", "word_freq_internet", "word_freq_order", "word_freq_mail",
            "word_freq_receive", "word_freq_will", "word_freq_people", "word_freq_report", "word_freq_addresses",
            "word_freq_free", "word_freq_business", "word_freq_email", "word_freq_you", "word_freq_credit",
            "word_freq_your", "word_freq_font", "word_freq_000", "word_freq_money", "word_freq_hp",
            "word_freq_hpl", "word_freq_george", "word_freq_650", "word_freq_lab", "word_freq_labs",
            "word_freq_telnet", "word_freq_857", "word_freq_data", "word_freq_415", "word_freq_85",
            "word_freq_technology", "word_freq_1999", "word_freq_parts", "word_freq_pm", "word_freq_direct",
            "word_freq_cs", "word_freq_meeting", "word_freq_original", "word_freq_project", "word_freq_re",
            "word_freq_edu", "word_freq_table", "word_freq_conference", "char_freq_;", "char_freq_(",
            "char_freq_[", "char_freq_!", "char_freq_$", "char_freq_#", "capital_run_length_average",
            "capital_run_length_longest", "capital_run_length_total", "spam"]

def read_spambase(path):
    """Read the spambase data.
    Arguments:
        path (str): Path to the spambase directory.
    Returns:
        (X, y): The data columns and the class labels
    """
    data_file = os.path.join(path, 'spambase.data')
    try:
        data = numpy.loadtxt(data_file, delimiter=',', skiprows=0, unpack=True)
    except FileNotFoundError:
        print(f"Error: The file '{data_file}' was not found.")
        sys.exit(1)
    except ValueError:
        print(f"Error: The file '{data_file}' does not contain valid numeric data.")
        sys.exit(1)

return data[:-1], data[-1:]

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "path",
        type=str)

args = parser.parse_args()

# TODO Parse unknown arguments to send to ML techniques
    ## args, other_args = parser.parse_known_args()
    # Get the data
    X, y = read_spambase(args.path)

print(X.shape, y.shape)
```

---

## First Steps

* Begin with dimensionality reduction
  * Some of the columns are clearly correlated
    * e.g. average and max run length
    * Max correlation is actually 0.996

---

## PCA Components

---

## Not Great

* Top 3 components are
  * 0.1156, 0.0573, 0.0351
* Bottom 3 components are
  * 0.0046, 0.0033, 0.0000676
* With 57 components, perfectly independent columns would get 0.175 each
  * Only see that after the $20^{th}$ component
* 99\% explained with the $54^{th}$ column

---

## Next Step

* Try logistic regression
  * With and without PCA
  * It's simple and we may learn something

---

## Testing and Validation

* If we're going to try some techniques, we need to measure error
* That means breaking up our dataset into training and test
  * We could also do n-fold validation
    * Break data into N chunks
    * Do N experiments, with n=1,n=2,... held out each time

---

## Pitfalls

* Be aware that this is too naive
  * *we*, the humans will overfit our techniques eventually
  * In addition to the test set, we should have a validation set that we never get to see

---

## Train and Test

```python
    # Split X into training and testing.
    shuffled_order = random.sample(list(range(y.shape[0])), y.shape[0])
    X_test = X.T[shuffled_order[:601]].T
    y_test = y[shuffled_order[:601]]
    X_train = X.T[shuffled_order[601:]].T
    y_train = y[shuffled_order[601:]]
```

---

## Logistic Regression

* Without pca, accuracy is around 0.6 for train and test
* With pca, train and test accuracy are around 0.9
  * learning rate 0.1, 20 epochs, $\lambda=1$
* Fiddling with the parameters doesn't improve much

---

## Logistic Regression

```python
import argparse
import numpy as np
import os
import random
import sys

# ML algorithms that we've used in CS461
import our_ml

def read_spambase(path):
    """Read the spambase data.
    Arguments:
        path (str): Path to the spambase directory.
    Returns:
        (X, y): The data columns and the class labels
    """
    data_file = os.path.join(path, 'spambase.data')
    try:
        data = np.loadtxt(data_file, delimiter=',', skiprows=0, unpack=True)
    except FileNotFoundError:
        print(f"Error: The file '{data_file}' was not found.")
        sys.exit(1)
    except ValueError:
        print(f"Error: The file '{data_file}' does not contain valid numeric data.")
        sys.exit(1)

return data[:-1], data[-1]

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "path",
        type=str)
    parser.add_argument(
        "--pca",
        required=False,
        type=int,
        default=None,
        help="The number of components to take from PCA.")
    parser.add_argument(
        "--model",
        required=False,
        choices=["logistic"],
        default="logistic",
        help="Model to use for classification.")
    parser.add_argument(
        "--learning_rate",
        required=False,
        type=float,
        default=0.01,
        help="The learning rate for the selected algorithm.")
    parser.add_argument(
        "--epochs",
        required=False,
        type=int,
        default=20,
        help="The number of epochs for the selected algorithm.")
    parser.add_argument(
        "--lam",
        required=False,
        type=float,
        default=1.0,
        help="The l2 norm for the selected algorithm.")

args = parser.parse_args()

# TODO Parse unknown arguments to send to ML techniques
    ## args, other_args = parser.parse_known_args()
    # Get the data
    X, y = read_spambase(args.path)

# TODO FIXME Split into training and testing

print(X.shape, y.shape)

if (args.pca is not None):
        # Get all of the components, then we'll choose how many to keep
        # Transpose so each sample is in the first axis
        X = X.T
        first_corr_matrix = np.corrcoef(X, rowvar=False)
        print(f"Maximum column correlation pre PCA is {np.max(first_corr_matrix - np.eye(X.shape[1]))}")
        components, explained = our_ml.pca(args.pca, X)
        for i, expl in enumerate(explained):
            print(f"PCA: {i},{expl},{np.sum(explained[:i+1])/np.sum(explained)}")

mean = np.mean(X, axis=0)
        std = np.std(X, axis=0)
        # Handle features with zero variance (to prevent division by zero)
        epsilon = 1e-8
        std[std == 0] = epsilon
        X_standardized = (X - mean) / std
        new_columns = np.dot(X_standardized, components)
        second_corr_matrix = np.corrcoef(new_columns, rowvar=False)
        print(f"Maximum column correlation post PCA is {np.max(second_corr_matrix - np.eye(args.pca))}")

# Put X back into feature first orientation
        X = new_columns.T
    else:
        # Always standardize at least
        mean = np.mean(X, axis=0)
        std = np.std(X, axis=0)
        # Handle features with zero variance (to prevent division by zero)
        epsilon = 1e-8
        std[std == 0] = epsilon
        X = (X - mean) / std

# Split X into training and testing.
    shuffled_order = random.sample(list(range(y.shape[0])), y.shape[0])
    X_test = X.T[shuffled_order[:601]].T
    y_test = y[shuffled_order[:601]]
    X_train = X.T[shuffled_order[601:]].T
    y_train = y[shuffled_order[601:]]

if (args.model == "logistic"):
        # Do logistic regression with the given learning rate and lambda
        W, b = our_ml.logistic_regression(X_train, y_train, args.learning_rate, args.epochs, lam=args.lam)
        _, train_predictions = our_ml.logistic_predict(W, b, X_train)
        _, test_predictions = our_ml.logistic_predict(W, b, X_test)

# Calculate statistics with the train and test statistics
    print("Train statistics.")
    our_ml.printClassStatistics(y_train, train_predictions)
    print("Test statistics.")
    our_ml.printClassStatistics(y_test, test_predictions)

```

---

## What have we learned?

* The data isn't linearly separable
  * Multiple spam sources
  * Multiple goals of spam
* And we call our inputs continuous
  * But is word frequency really a continuous variable?
* Need something more exemplar

---

## What about Mixture Models?

* This looks like clustering, right?
* Could be, but what distributions do these columns follow?
  * Just because it's a continuous number doesn't mean it's gaussian

---

## Mixture of Guassians

* Ran a bunch of times
* Better with fewer experts (meaning 2)
* Training accuracy 0.662

---

## Mixture of Gaussians

* Test accuracy 0.679

---

## Why so bad?

* Mixture models are cool, but still unsupervised
* And we have no idea what distribution to use for the data
* Need something more exemplar, less statistical

---

## Decision Tree

* Our old friend
  * We've really only covered two supervised learning techniques
* Turn PCA off
  * The tree won't care about statistics
  * Float operations are faster with the original numbers

---

## Decision Tree Results

* Training
* Accuracy is 0.99925

---

## Decision Tree Results

* Testing
* Accuracy is 0.9184692179700499

---

## Decision Tree

---

## Performance

* Building a tree is going to get slow with growing numbers of rows and columns
  * Made parts parallel and it's snappy now
* If we were happy with 90% accuracy, then logistic regression's simplicity is attractive
  * Unfortunately, that's far too low for spam detection
* The tree is better, but it's failing to generalize

---

## More on Trees

* Decision trees can always get near 100% accuracy on their training set
  * Model-free and exemplar-based
* But they are very sensitive to the pivots chosen
  * Meaning that small changes in the input can lead to large changes in the model
  * But maybe this is a good thing?

---

## Tree Troubles

* The problem with trees is that they fit **too** well
  * High variance models
  * Conceptually, their decision boundary weaves around every bit of data
  * We need some way to regularize them
* Making a tree simpler is possible, but it sacrifices some advantages

---

## Tree Solutions

* So let's keep the complexity
  * We want that sweet, sweet 0.999 accuracy number
* Just need a way to generalize

---

## Tree Ensembles

* So long as we can guarantee that our individual trees are diverse, an ensemble will add robustness
* One simple method for this is `bagging`
  * Short for `bootstrap aggregating`
* Train trees on different subsamples of the data
  * The diversity of the subsamples creates diversity in the models
* Just take the majority vote to choose a class

---

## Voting

* Why will majority voting work?
* Let $Y_m$ by the m'th model's prediction
* Sum predictions and divide by M, then threshold
  * $p = Pr(\sum_{m=1}^M Y_m > M/2) = 1 - B(M/2,M,\Theta)$
  * This is the cdf of the binomial
  * $\Theta$ is the model accuracy; let's say it's 0.51
    * 1000 models: $p=0.73$
    * 10000 models: $p=0.97$

---

## Voting Example

```python
# Build k trees
trees = []
train_predictions = []
for i in range(args.ktrees):
    X_bagged, y_bagged = more_tree_funcs.bag_data(X_train, y_train)
    root = more_tree_funcs.build_decision_tree(X_bagged, y_bagged, args.maxdepth)
    train_predictions.append(root.decide(X_train))
# Now vote the predictions
train_stack = np.stack(train_predictions, axis=1)
train_sum = np.sum(train_stack, axis=1)
train_predictions = ((train_sum / args.ktrees) > 0.5).astype(int)
```

---

## Independence Fallacy

* Of course our assumptions will only be true if the models are independent
* If we had enough data (and time) to train an arbitrary number of models that would work
* We don't have infinite data
  * How can be use what we've got to get models that act differently?

---

## Bagging

* Sample from the dataset with replacement
* Take N samples, where N is the original dataset size
* This results on each model seeing about 63% of the unique inputs
  * Because the chance of missing one item out of N over N draws is
  * $(1-1/N)^N \approx 0.37$ as N approaches infinity

---

## Bagging Code

```python
def bag_data(X, y):
    n_samples = random.choices(list(range(y.shape[0])), k=y.shape[0])
    bagged_x = X.T[n_samples].T
    bagged_y = y[n_samples]
    return bagged_x, bagged_y
```

---

## Testing Note

* Because about a third of the data is left out of each tree, it can be used to test that tree
  * This is an efficient alternative to cross validation
* For consistency though, we'll still keep our train/test split

---

## Accuracy

---

## Discussion

* Training improvement looks monotonic and smooth
* Not so with testing accuracy
  * It does improve over a single tree
  * But it's noisy
* Went from 92% accuracy to 95-96% accuracy though
  * About half the error rate; huge

---

## 21 Tree Training

* Training
* Accuracy is 0.999

---

## 21 Tree Testing

* Testing
* Accuracy is 0.962

---

## Notes

* In a real application, we would find that pathway that misclassified one non-spam email and adjust it
  * Since getting some spam is far worse than missing a real email
* We could also mark some tree pathways as uncertain and hold those emails back for review
  * This is basically what is done today, delaying some messages while waiting for better statistics

---

## Modifying Trees

* Trees are very simple to modify
* As recursive structures, any part can be changed without changing the rest
* This, and other nice properties, make trees a desirable model type

---

## Bagging Advantages

* The advantage of bagging lies in lowering the influence of individual training points
* The approach is also simple, and takes advantage of the unstable nature of decision trees

---

## Bagging Drawbacks

* The approach will not always work
  * Depends upon data subsets creating variety in trees
* Basically we have to trust that diversity will happen after subsampling
  * Look at the noisy test accuracy
  * A bit like k-means initialization

---

## Better Methods

* The ensemble works better when the individual models are decorrelated
  * So we could build maximize that by construction
* Two techniques are `random forests` and `boosting`

---

## Random Forests

* Still look at a subset of the data for each tree
* Also look at a subset of the input columns at each tree node
  * This will force each tree to come up with different pivot values

---

## Boosting

* What if we didn't weight each model the same?
* And what if we add trees to compensate for the failures of the previous one?
* `Boosting` grows the ensemble by reweighting the data samples based upon the previous errors
  * Several variants are possible

---

## Decision Trees

* Decision trees are powerful learners because they are "model free"
  * Gives us a lot of freedom to play with their construction
* We'll stick with them for next lecture as well

---

## Homework 2

* Homework 2 is online
  * [https://firner.com/presentations/cs461/homework2.html](https://firner.com/presentations/cs461/homework2.html)
  * Use clustering to compress images

<!--

Outline:
Initiali analysis
Cluster and PCA
Multivariate linear regression
  Calculate number of comparisons, time to retrain
Individual tree
  Numer of comparisons, time to retrain
Ensemble tree with bagging

-->