# CS 461 - Lecture 10 ## Machine Learning Principles Bernhard Firner 2025-10-08 --- ## Reading * Recommended reading * Machine Learning: The Art and Science of Algorithms that Make Sense of Data by Flach * 11 * Machine Learning: A Probabilistic Perspective by Murphy * 16.2.5 (random forest) * 16.4 (boosting --- ## Review - Spambase * Created in June-July 1999 * 4601 entries * 39.4% legitimate, 60.6% spam * Data is transformed into 57 continuous variables * 48 \[0,100\] word frequency * 6 \[0,100\] character frequency * capital letter run length average, longest, and total lengths --- ## Results * Logistic regression * Train and test 0.6 * With PCA improves to 0.9 * Decision Tree * Train accuracy is 0.99925 * Test accuracy is 0.9184692179700499 * Bagged Trees * Train accuracy is 0.999 * Test accuracy is 0.962 --- ## What is Bagging? * Bootstrap aggregating * Sample N times from the dataset of size N with replacement * This results on each model seeing about 63% of the unique inputs * Because the chance of missing one item out of N over N draws is * $(1-1/N)^N \approx 0.37$ as N approaches infinity --- ## Bagging Code ```python def bag_data(X, y): n_samples = random.choices(list(range(y.shape[0])), k=y.shape[0]) bagged_x = X.T[n_samples].T bagged_y = y[n_samples] return bagged_x, bagged_y ``` --- ## Why Does this Help? * Tree decision boundaries are very noisy * Small changes in pivot selection can cause major changes * Bagging takes advantage of that, generating many trees that make slightly different decisions * This diversity makes the ensemble of trees more robust --- ## Ensemble Voting * In fact, any group of learners, even weak ones, are improved with aggregation * Only if they are independent though! * So can we make our bagged trees even more independent? --- ## Random Forest * What if we drop some data columns? * Now the different trees are trained on different columns, they must be different! * We'll also continue to use bagging --- ## Drop How Much? * How can we know how many columns to drop? * If one column is very important, dropping it could be risky * Let's look at our PCA results to inform this decision --- ## PCA Results
--- ## Interpretation * None of the columns in the original data can be more important than the PCA's first column * Slightly more than 0.1 * In fact, dropping 15 columns shouldn't lose more than 50% of the dataset variance * So let's drop 15 each time --- ## Implementation * We don't want to have to remember which trees are dropping which columns * So we'll add the dropping into the root node ```python class DropNode(): def __init__(self, columns, child_node): """columns is a list of columns to keep.""" self.keep = columns self.child = child_node def decide(self, data): return self.child.decide(data[self.keep]) def select_columns(X, K): k_columns = random.sample(list(range(X.shape[0])), k=K) resampled_x = X[k_columns] return resampled_x, k_columns ``` --- ## Improvements * This can be a slight improvement over bagging alone * On this dataset, the results are roughly similar though * We could find a "best" value for the number of trees and the number of columns to drop * But now we're optimizing hyperparameters --- ## Less Heuristic Approach * Can't we search for ensemble members? * What if we guarantee that each learner is different from the last by making it do better where the previous learning failed? * This is called boosting --- ## Boosting * Train a model, calculate its error rate, $\epsilon$ * Improve the next model by *boosting* the samples with errors * We could do this by duplicating them in the training set * Or we could weight the importance of getting any sample right or wrong * Weight adjusting is finer than duplication --- ## Boosting Weight Adjustment * We want to assign half the weight to misclassified examples * Half the weight goes to the rest * Biased towards being correct on a small number of errors * Will make successive models in the ensemble very different --- ## Weight Adjustment * Initialize weights to 1/N for N samples * After training each classified * Multiply misclassified weights by $1/(2\epsilon)$ * Multiply correctly classified weights by $1/(1 - 2\epsilon)$ * As long as $\epsilon < 0.5$, this increased the weight of misclassified samples * If the error rate is over 0.5 something is wrong --- ## Confidence Factor * To improve the ensemble, we also want to assign a confidence, $\alpha$, to each model * AdaBoost, the most popular boosting method * Applies an exponential penalty for misclassification * $ln\sqrt\frac{1-\epsilon_t}{\epsilon_t}$ for the model at step t * We use this confidence to weight each model in the ensemble --- ## The Learner * What learner should we be using? * It should be *weak* so that errors can guide our search * We want it to easily conform to the heavily weighted samples * Can't be trees, they fit their training data too well * But what if we use trees, but set their max depth to 1? --- ## Decision Stumps * Just choose a single pivot * Don't use purity, just check immediately what the classification error will be * If the error is greater than 0.5, reverse the polarity and predict the opposite * Thus each stump stores a `pivot column`, a `boundary` value, and a `polarity` * It will also gain a confidence, $\alpha$, once its error rate it measured --- ## Stump ```python class DecisionStump: """Also called univariate decision tree. Has depth 1, making a single decision. Returns either -1 or 1 upon classification. """ def __init__(self, feature_index, boundary, polarity): # The column of the chosen feature self.feature_index = feature_index self.boundary = boundary # Says if the output is -1 or 1 beyond the boundary self.polarity = polarity # The certainty weighting from AdaBoost self.alpha = None # The error of the stump self.error = None def predict(self, X): # The default polarity predictions = self.polarity * np.ones(X.shape[1]) # Reverse polarity for anything below the decision boundary predictions[X[self.feature_index] < self.boundary] *= -1 return predictions ``` --- ## Training with Weight ```python def train_stump(X, y, weights): """ Arguments: X: Column first samples y: Class labels weights: Weights to rows in X """ columns, samples = X.shape min_error = float('inf') best_stump = None for fidx, feature in enumerate(X): # Just like regular trees, check each unique value as the decision boundary boundaries = np.unique(feature) for boundary in boundaries: for polarity in [1, -1]: stump = DecisionStump(fidx, boundary, polarity) predictions = stump.predict(X) # Find the errors and then weight them errors = (predictions != y) weighted_error = np.sum(weights[errors]) if weighted_error < min_error: min_error = weighted_error best_stump = stump best_stump.error = weighted_error return best_stump ``` --- ## Stump Example * Initial weights for points are all $1/6$
--- ## Stump Example * First stump has alpha 0.35 * It has errors, so weights are updated: * 0.25 0.125 0.25 0.125 0.125 0.125
--- ## Stump Example * Second stump has alpha 0.549 * High because it fixed high weight errors from before * New weights: 0.167 0.083 0.167 0.083 0.25 0.25
--- ## Stump Example * Third stump has alpha 0.346 * It's going back and repeating the first line * New weights: 0.25 0.0625 0.25 0.0625 0.1875 0.1875
--- ## Stump Example * Fourth stump has alpha 0.394
--- ## Stump Example * Fifth stump has alpha 0.381
--- ## Stump Example * Sixth stump has alpha 0.472
--- ## Visualizing large datasets * 1 decision boundary
--- ## Visualizing large datasets * 3 decision boundaries
--- ## Visualizing large datasets * 120 decision boundaries
--- ## Complexity * The algorithm is no more complicated than deciding on a single split in a decision tree * $C*D^2$, with C columns and D rows * If each row value is unique, each will be compared as a pivot * We will use many trunks * Creating each trunk is fast * But they cannot be created in parallel --- ## Robustness * AdaBoost uses very weak learners with decision trunks * Each one carves out a tiny chunk of the decision space * This makes it resistant to overfitting to the data * The trunks aren't really capable of overfitting * Trunks become redundant as the model fits * errors will begin to pull decision boundaries back and forth --- ## Results
--- ## Training Statistics * Accuracy is 0.95
601
Ham
Spam
Ham
2354
75
Spam
124
1447
--- ## Test Statistics * Accuracy is 0.968
601
Ham
Spam
Ham
355
4
Spam
15
227
--- ## Bias and Variance * How did these ensemble techniques impact the bias and variance tradeoff? * Bagging reduced the variance of the training dataset * By removing data, the model cannot overfit to it * Results are then majority voted, smoothing the decision boundary * Boosting reduces bias, making a complicated decision boundary from a simple one * Thus bagging works well with high-variance decision trees, while boosting works well with high-bias stumps --- ## Improving Spam Filtering * We aren't going to do much better on Spambase * Improvements require more data * Want to move on to TRECSpam 2007 * 50199 spam emails, 25220 not spam emails * All text fields * But how will we process text rather than summary features?