# CS 461 - Lecture 09 ## Machine Learning Principles Bernhard Firner 2025-10-06 --- ## Reading * Recommended reading * Machine Learning: The Art and Science of Algorithms that Make Sense of Data by Flach * 11.1 * Machine Learning: A Probabilistic Perspective by Murphy * 16.2 (but bagging info is sparse) --- ## Review - Latent Spaces * Latent Spaces * Our observed data is the `visible space` * Just a reflection of the `latent space`, which is a more meaningful representation * Mixture of expert models assumed one random variable per data axis * But in reality they are correlated --- ## Review - PCA * We want to find a "more meaningful" version of our data * Solution: Principal Component Analysis (PCA) * `Project` our data along a new set of axes * Possibly fewer, leading to dimensionality reduction --- ## Covariance Matrix * Covariance of $X_i$ and $X_j$ is $\mathbb{E}[(X_i-\mathbb{E}[X_i])(X_j-\mathbb{E}[X_j])]$ * We estimate with $Cov_{jk} = \frac{1}{N}\Sigma_{i=1}^N(X_{ij}-\mu(X_j))(X_{ik} - \mu(X_k))$ * The covariance matrix has $Cov_{jk}$ at position j,k * The diagonal is filled with the variances of each column * How is this useful? * If two columns are correlated, then they have some redundancy --- ## Correlation Coefficient * Before we go about using the correlation matrix, it will be more interpretable if we normalize it * This is the same as standardizing the data first, making each axis 0 mean and unit variance * After that, the matrix will have values in the range -1 to 1 * -1 is perfectly anti-correlated * 1 is perfectly correlated * 0 is completely uncorrelated --- ## Review - PCA * PCA algorithm * Find the eigenvalues and eigenvectors of the correlation matrix * This is the covariance matrix of a standardized dataset * Rank the vectors by their eigenvalues * Choose a subset * The eigenvalues indicate how much of the dataset variance each vector explains --- ## PCA - Visualization
--- ## Implementation ```python #! /usr/bin/python3 import tree_funcs import math import numpy as np import sys from sklearn.decomposition import PCA def preprocessCSV(csv, X_names): """ Args: csv (str): The path to a file containing columns of data. X_names (list[str]): Names of the columns to return. Returns: list(columns) """ # Load training data records, column_names = tree_funcs.read_csv(csv) name_to_idx = {name: i for i, name in enumerate(column_names)} # Validate requested columns for xn in X_names: if xn not in name_to_idx: raise SystemExit(f"Feature column '{xn}' not found in {csv}.") X = np.array([records[name_to_idx[xname]] for xname in X_names]) return X def pca(num_basis, X): """ Performs PCA by manually constructing and solving the normal equations. Args: num_basis (int): The number of basis functions X (list(columns)): The columns of data. """ ## We could manually standardize. Or just call numpy.corrcoef #mean = np.mean(X, axis=0) #std = np.std(X, axis=0) #X_centered = X - mean #X_standardized = X_centered / std #correlation_matrix = np.cov(X_standardized, rowvar=False) # Get the correlation matrix correlation_matrix = np.corrcoef(X, rowvar=False) # Eigenvalue Decomposition eigenvalues, eigenvectors = np.linalg.eigh(correlation_matrix) # Sort and check from greatest to smallest sorted_indices = np.argsort(eigenvalues)[::-1] sorted_eigenvalues = eigenvalues[sorted_indices] sorted_eigenvectors = eigenvectors[:, sorted_indices] # Get the explained variance total_variance = np.sum(eigenvalues) explained_variance = sorted_eigenvalues / total_variance # Return the components components = sorted_eigenvectors[:, :num_basis] return components, explained_variance if __name__ == "__main__": if len(sys.argv) < 3: print("Provide the data file, the number of basis functions, and the column names as arguments.") else: X = preprocessCSV(sys.argv[1], sys.argv[3:]).T #print(X_centered) print(f"The correlation coefficients of X are {np.corrcoef(X, rowvar=False)}") components, explained = pca(int(sys.argv[2]), X) print(f"Explained {explained} ({explained.sum()})") # Print out results print(components) # Print out new columns mean = np.mean(X, axis=0) std = np.std(X, axis=0) # Handle features with zero variance (to prevent division by zero) epsilon = 1e-8 std[std == 0] = epsilon X_standardized = (X - mean) / std new_columns = np.dot(X_standardized, components) print(f"corcoef of new columns is {np.corrcoef(new_columns, rowvar=False)}") #for i in range(new_columns.shape[0]): # print(",".join([str(val) for val in new_columns[i]])) ``` --- ## Lecture 9 Goals * Return to supervised learning * Start looking at more realistic datasets * Spam, specifically --- ## What is spam? * Unwanted emails and texts * Problem? * Huge volume, so the approach taken needs to be light * Also needs to retrain or patch quickly --- ## How About Naive Bayes? * The first spam filters were built with naive bayes * $p(spam|word)$ * $p(spam|words) = \prod_I p(spam|word_i)$ --- ## Worked Until It Didn't * Not robust * Bayesian poisoning allowed attackers to fool the filters by throwing in some low-correlation words * There are always new low-correlation words available, so they could keep evading * Very costly to classify any non-spam as spam * Need something with a more robust decision boundary --- ## Fetching Spambase * https://archive.ics.uci.edu/dataset/94/spambase * Download the zip or use code example * pip install ucimlrepo ```python from ucimlrepo import fetch_ucirepo # fetch dataset spambase = fetch_ucirepo(id=94) # data (as pandas dataframes) X = spambase.data.features y = spambase.data.targets # metadata print(spambase.metadata) # variable information print(spambase.variables) ``` --- ## Spambase * Created in June-July 1999 * 4601 entries * 39.4% legitimate, 60.6% spam * Data is transformed into 57 continuous variables * 48 \[0,100\] word frequency * 6 \[0,100\] character frequency * capital letter run length average, longest, and total lengths * Start here --- ## TRECSpam 2007 * 2007: 50199 spam emails, 25220 not spam emails * 66.5% spam, 33.5% legitimate * https://www.kaggle.com/datasets/imdeepmind/preprocessed-trec-2007-public-corpus-dataset * 4 Data Columns * subject * email to * email from * message body * Want to get here --- ## Let's Begin with Spambase * What words frequencies? * address, all, 3dc, our, over, remove, internet, order, mail, receive, will, ... * also "george" * The dataset comes from an individual's inbox, so it probably won't generalize * What character frequencies? * Brackets, ;, !, \$, # --- ## Some Observations * Not going to visualize this * We could make gigantic correlation matrices, but they won't tell us too much --- ## Loading the data ```python import argparse import numpy import os import pca def column_names(): """Return an array of all data column names and the spam label.""" return ["word_freq_make", "word_freq_address", "word_freq_all", "word_freq_3d", "word_freq_our", "word_freq_over", "word_freq_remove", "word_freq_internet", "word_freq_order", "word_freq_mail", "word_freq_receive", "word_freq_will", "word_freq_people", "word_freq_report", "word_freq_addresses", "word_freq_free", "word_freq_business", "word_freq_email", "word_freq_you", "word_freq_credit", "word_freq_your", "word_freq_font", "word_freq_000", "word_freq_money", "word_freq_hp", "word_freq_hpl", "word_freq_george", "word_freq_650", "word_freq_lab", "word_freq_labs", "word_freq_telnet", "word_freq_857", "word_freq_data", "word_freq_415", "word_freq_85", "word_freq_technology", "word_freq_1999", "word_freq_parts", "word_freq_pm", "word_freq_direct", "word_freq_cs", "word_freq_meeting", "word_freq_original", "word_freq_project", "word_freq_re", "word_freq_edu", "word_freq_table", "word_freq_conference", "char_freq_;", "char_freq_(", "char_freq_[", "char_freq_!", "char_freq_$", "char_freq_#", "capital_run_length_average", "capital_run_length_longest", "capital_run_length_total", "spam"] def read_spambase(path): """Read the spambase data. Arguments: path (str): Path to the spambase directory. Returns: (X, y): The data columns and the class labels """ data_file = os.path.join(path, 'spambase.data') try: data = numpy.loadtxt(data_file, delimiter=',', skiprows=0, unpack=True) except FileNotFoundError: print(f"Error: The file '{data_file}' was not found.") sys.exit(1) except ValueError: print(f"Error: The file '{data_file}' does not contain valid numeric data.") sys.exit(1) return data[:-1], data[-1:] if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument( "path", type=str) args = parser.parse_args() # TODO Parse unknown arguments to send to ML techniques ## args, other_args = parser.parse_known_args() # Get the data X, y = read_spambase(args.path) print(X.shape, y.shape) ``` --- ## First Steps * Begin with dimensionality reduction * Some of the columns are clearly correlated * e.g. average and max run length * Max correlation is actually 0.996 --- ## PCA Components
--- ## Not Great * Top 3 components are * 0.1156, 0.0573, 0.0351 * Bottom 3 components are * 0.0046, 0.0033, 0.0000676 * With 57 components, perfectly independent columns would get 0.175 each * Only see that after the $20^{th}$ component * 99\% explained with the $54^{th}$ column --- ## Next Step * Try logistic regression * With and without PCA * It's simple and we may learn something --- ## Testing and Validation * If we're going to try some techniques, we need to measure error * That means breaking up our dataset into training and test * We could also do n-fold validation * Break data into N chunks * Do N experiments, with n=1,n=2,... held out each time --- ## Pitfalls * Be aware that this is too naive * *we*, the humans will overfit our techniques eventually * In addition to the test set, we should have a validation set that we never get to see --- ## Train and Test ```python # Split X into training and testing. shuffled_order = random.sample(list(range(y.shape[0])), y.shape[0]) X_test = X.T[shuffled_order[:601]].T y_test = y[shuffled_order[:601]] X_train = X.T[shuffled_order[601:]].T y_train = y[shuffled_order[601:]] ``` --- ## Logistic Regression * Without pca, accuracy is around 0.6 for train and test * With pca, train and test accuracy are around 0.9 * learning rate 0.1, 20 epochs, $\lambda=1$ * Fiddling with the parameters doesn't improve much --- ## Logistic Regression ```python import argparse import numpy as np import os import random import sys # ML algorithms that we've used in CS461 import our_ml def column_names(): """Return an array of all data column names and the spam label.""" return ["word_freq_make", "word_freq_address", "word_freq_all", "word_freq_3d", "word_freq_our", "word_freq_over", "word_freq_remove", "word_freq_internet", "word_freq_order", "word_freq_mail", "word_freq_receive", "word_freq_will", "word_freq_people", "word_freq_report", "word_freq_addresses", "word_freq_free", "word_freq_business", "word_freq_email", "word_freq_you", "word_freq_credit", "word_freq_your", "word_freq_font", "word_freq_000", "word_freq_money", "word_freq_hp", "word_freq_hpl", "word_freq_george", "word_freq_650", "word_freq_lab", "word_freq_labs", "word_freq_telnet", "word_freq_857", "word_freq_data", "word_freq_415", "word_freq_85", "word_freq_technology", "word_freq_1999", "word_freq_parts", "word_freq_pm", "word_freq_direct", "word_freq_cs", "word_freq_meeting", "word_freq_original", "word_freq_project", "word_freq_re", "word_freq_edu", "word_freq_table", "word_freq_conference", "char_freq_;", "char_freq_(", "char_freq_[", "char_freq_!", "char_freq_$", "char_freq_#", "capital_run_length_average", "capital_run_length_longest", "capital_run_length_total", "spam"] def read_spambase(path): """Read the spambase data. Arguments: path (str): Path to the spambase directory. Returns: (X, y): The data columns and the class labels """ data_file = os.path.join(path, 'spambase.data') try: data = np.loadtxt(data_file, delimiter=',', skiprows=0, unpack=True) except FileNotFoundError: print(f"Error: The file '{data_file}' was not found.") sys.exit(1) except ValueError: print(f"Error: The file '{data_file}' does not contain valid numeric data.") sys.exit(1) return data[:-1], data[-1] if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument( "path", type=str) parser.add_argument( "--pca", required=False, type=int, default=None, help="The number of components to take from PCA.") parser.add_argument( "--model", required=False, choices=["logistic"], default="logistic", help="Model to use for classification.") parser.add_argument( "--learning_rate", required=False, type=float, default=0.01, help="The learning rate for the selected algorithm.") parser.add_argument( "--epochs", required=False, type=int, default=20, help="The number of epochs for the selected algorithm.") parser.add_argument( "--lam", required=False, type=float, default=1.0, help="The l2 norm for the selected algorithm.") args = parser.parse_args() # TODO Parse unknown arguments to send to ML techniques ## args, other_args = parser.parse_known_args() # Get the data X, y = read_spambase(args.path) # TODO FIXME Split into training and testing print(X.shape, y.shape) if (args.pca is not None): # Get all of the components, then we'll choose how many to keep # Transpose so each sample is in the first axis X = X.T first_corr_matrix = np.corrcoef(X, rowvar=False) print(f"Maximum column correlation pre PCA is {np.max(first_corr_matrix - np.eye(X.shape[1]))}") components, explained = our_ml.pca(args.pca, X) for i, expl in enumerate(explained): print(f"PCA: {i},{expl},{np.sum(explained[:i+1])/np.sum(explained)}") mean = np.mean(X, axis=0) std = np.std(X, axis=0) # Handle features with zero variance (to prevent division by zero) epsilon = 1e-8 std[std == 0] = epsilon X_standardized = (X - mean) / std new_columns = np.dot(X_standardized, components) second_corr_matrix = np.corrcoef(new_columns, rowvar=False) print(f"Maximum column correlation post PCA is {np.max(second_corr_matrix - np.eye(args.pca))}") # Put X back into feature first orientation X = new_columns.T else: # Always standardize at least mean = np.mean(X, axis=0) std = np.std(X, axis=0) # Handle features with zero variance (to prevent division by zero) epsilon = 1e-8 std[std == 0] = epsilon X = (X - mean) / std # Split X into training and testing. shuffled_order = random.sample(list(range(y.shape[0])), y.shape[0]) X_test = X.T[shuffled_order[:601]].T y_test = y[shuffled_order[:601]] X_train = X.T[shuffled_order[601:]].T y_train = y[shuffled_order[601:]] if (args.model == "logistic"): # Do logistic regression with the given learning rate and lambda W, b = our_ml.logistic_regression(X_train, y_train, args.learning_rate, args.epochs, lam=args.lam) _, train_predictions = our_ml.logistic_predict(W, b, X_train) _, test_predictions = our_ml.logistic_predict(W, b, X_test) # Calculate statistics with the train and test statistics print("Train statistics.") our_ml.printClassStatistics(y_train, train_predictions) print("Test statistics.") our_ml.printClassStatistics(y_test, test_predictions) ``` --- ## What have we learned? * The data isn't linearly separable * Multiple spam sources * Multiple goals of spam * And we call our inputs continuous * But is word frequency really a continuous variable? * Need something more exemplar --- ## What about Mixture Models? * This looks like clustering, right? * Could be, but what distributions do these columns follow? * Just because it's a continuous number doesn't mean it's gaussian --- ## Mixture of Guassians * Ran a bunch of times * Better with fewer experts (meaning 2) * Training accuracy 0.662
4000
Ham
Spam
Ham
1668
764
Spam
588
980
--- ## Mixture of Gaussians * Test accuracy 0.679
601
Ham
Spam
Ham
248
108
Spam
85
160
--- ## Why so bad? * Mixture models are cool, but still unsupervised * And we have no idea what distribution to use for the data * Need something more exemplar, less statistical --- ## Decision Tree * Our old friend * We've really only covered two supervised learning techniques * Turn PCA off * The tree won't care about statistics * Float operations are faster with the original numbers --- ## Decision Tree Results * Training * Accuracy is 0.99925
4000
Ham
Spam
Ham
2413
0
Spam
3
1584
--- ## Decision Tree Results * Testing * Accuracy is 0.9184692179700499
601
Ham
Spam
Ham
345
30
Spam
19
207
--- ## Decision Tree
--- ## Performance * Building a tree is going to get slow with growing numbers of rows and columns * Made parts parallel and it's snappy now * If we were happy with 90% accuracy, then logistic regression's simplicity is attractive * Unfortunately, that's far too low for spam detection * The tree is better, but it's failing to generalize --- ## More on Trees * Decision trees can always get near 100% accuracy on their training set * Model-free and exemplar-based * But they are very sensitive to the pivots chosen * Meaning that small changes in the input can lead to large changes in the model * But maybe this is a good thing? --- ## Tree Troubles * The problem with trees is that they fit **too** well * High variance models * Conceptually, their decision boundary weaves around every bit of data * We need some way to regularize them * Making a tree simpler is possible, but it sacrifices some advantages --- ## Tree Solutions * So let's keep the complexity * We want that sweet, sweet 0.999 accuracy number * Just need a way to generalize --- ## Tree Ensembles * So long as we can guarantee that our individual trees are diverse, an ensemble will add robustness * One simple method for this is `bagging` * Short for `bootstrap aggregating` * Train trees on different subsamples of the data * The diversity of the subsamples creates diversity in the models * Just take the majority vote to choose a class --- ## Voting * Why will majority voting work? * Let $Y_m$ by the m'th model's prediction * Sum predictions and divide by M, then threshold * $p = Pr(\sum_{m=1}^M Y_m > M/2) = 1 - B(M/2,M,\Theta)$ * This is the cdf of the binomial * $\Theta$ is the model accuracy; let's say it's 0.51 * 1000 models: $p=0.73$ * 10000 models: $p=0.97$ --- ## Voting Example ```python # Build k trees trees = [] train_predictions = [] for i in range(args.ktrees): X_bagged, y_bagged = more_tree_funcs.bag_data(X_train, y_train) root = more_tree_funcs.build_decision_tree(X_bagged, y_bagged, args.maxdepth) train_predictions.append(root.decide(X_train)) # Now vote the predictions train_stack = np.stack(train_predictions, axis=1) train_sum = np.sum(train_stack, axis=1) train_predictions = ((train_sum / args.ktrees) > 0.5).astype(int) ``` --- ## Independence Fallacy * Of course our assumptions will only be true if the models are independent * If we had enough data (and time) to train an arbitrary number of models that would work * We don't have infinite data * How can be use what we've got to get models that act differently? --- ## Bagging * Sample from the dataset with replacement * Take N samples, where N is the original dataset size * This results on each model seeing about 63% of the unique inputs * Because the chance of missing one item out of N over N draws is * $(1-1/N)^N \approx 0.37$ as N approaches infinity --- ## Bagging Code ```python def bag_data(X, y): n_samples = random.choices(list(range(y.shape[0])), k=y.shape[0]) bagged_x = X.T[n_samples].T bagged_y = y[n_samples] return bagged_x, bagged_y ``` --- ## Testing Note * Because about a third of the data is left out of each tree, it can be used to test that tree * This is an efficient alternative to cross validation * For consistency though, we'll still keep our train/test split --- ## Accuracy
--- ## Discussion * Training improvement looks monotonic and smooth * Not so with testing accuracy * It does improve over a single tree * But it's noisy * Went from 92% accuracy to 95-96% accuracy though * About half the error rate; huge --- ## 21 Tree Training * Training * Accuracy is 0.999
4000
Ham
Spam
Ham
2424
1
Spam
3
1572
--- ## 21 Tree Testing * Testing * Accuracy is 0.962
601
Ham
Spam
Ham
354
9
Spam
14
224
--- ## Notes * In a real application, we would find that pathway that misclassified one non-spam email and adjust it * Since getting some spam is far worse than missing a real email * We could also mark some tree pathways as uncertain and hold those emails back for review * This is basically what is done today, delaying some messages while waiting for better statistics --- ## Modifying Trees * Trees are very simple to modify * As recursive structures, any part can be changed without changing the rest * This, and other nice properties, make trees a desirable model type --- ## Bagging Advantages * The advantage of bagging lies in lowering the influence of individual training points * The approach is also simple, and takes advantage of the unstable nature of decision trees --- ## Bagging Drawbacks * The approach will not always work * Depends upon data subsets creating variety in trees * Basically we have to trust that diversity will happen after subsampling * Look at the noisy test accuracy * A bit like k-means initialization --- ## Better Methods * The ensemble works better when the individual models are decorrelated * So we could build maximize that by construction * Two techniques are `random forests` and `boosting` --- ## Random Forests * Still look at a subset of the data for each tree * Also look at a subset of the input columns at each tree node * This will force each tree to come up with different pivot values --- ## Boosting * What if we didn't weight each model the same? * And what if we add trees to compensate for the failures of the previous one? * `Boosting` grows the ensemble by reweighting the data samples based upon the previous errors * Several variants are possible --- ## Decision Trees * Decision trees are powerful learners because they are "model free" * Gives us a lot of freedom to play with their construction * We'll stick with them for next lecture as well --- ## Homework 2 * Homework 2 is online * [https://firner.com/presentations/cs461/homework2.html](https://firner.com/presentations/cs461/homework2.html) * Use clustering to compress images