<!--
Abstract:

CS 461
Introduction to Deep Learning
Lecture 09
-->

# CS 461 - Lecture 10

## Machine Learning Principles

Bernhard Firner

2025-10-08

---

## Reading

* Recommended reading
  * Machine Learning: The Art and Science of Algorithms that Make Sense of Data by Flach
    * 11
  * Machine Learning: A Probabilistic Perspective by Murphy
    * 16.2.5 (random forest)
    * 16.4 (boosting

---

## Review - Spambase

* Created in June-July 1999
* 4601 entries
* 39.4% legitimate, 60.6% spam
* Data is transformed into 57 continuous variables
  * 48 \[0,100\] word frequency
  * 6 \[0,100\] character frequency
  * capital letter run length average, longest, and total lengths

---

## Results

* Logistic regression
  * Train and test 0.6
  * With PCA improves to 0.9
* Decision Tree
  * Train accuracy is 0.99925
  * Test accuracy is 0.9184692179700499
* Bagged Trees
  * Train accuracy is 0.999
  * Test accuracy is 0.962

---

## What is Bagging?

* Bootstrap aggregating
* Sample N times from the dataset of size N with replacement
* This results on each model seeing about 63% of the unique inputs
  * Because the chance of missing one item out of N over N draws is
  * $(1-1/N)^N \approx 0.37$ as N approaches infinity

---

## Bagging Code

```python
def bag_data(X, y):
    n_samples = random.choices(list(range(y.shape[0])), k=y.shape[0])
    bagged_x = X.T[n_samples].T
    bagged_y = y[n_samples]
    return bagged_x, bagged_y
```

---

## Why Does this Help?

* Tree decision boundaries are very noisy
  * Small changes in pivot selection can cause major changes
* Bagging takes advantage of that, generating many trees that make slightly different decisions
  * This diversity makes the ensemble of trees more robust

---

## Ensemble Voting

* In fact, any group of learners, even weak ones, are improved with aggregation
  * Only if they are independent though!
* So can we make our bagged trees even more independent?

---

## Random Forest

* What if we drop some data columns?
  * Now the different trees are trained on different columns, they must be different!
* We'll also continue to use bagging

---

## Drop How Much?

* How can we know how many columns to drop?
  * If one column is very important, dropping it could be risky
* Let's look at our PCA results to inform this decision

---

## PCA Results

---

## Interpretation

* None of the columns in the original data can be more important than the PCA's first column
  * Slightly more than 0.1
* In fact, dropping 15 columns shouldn't lose more than 50% of the dataset variance
* So let's drop 15 each time

---

## Implementation

* We don't want to have to remember which trees are dropping which columns
  * So we'll add the dropping into the root node

```python
class DropNode():
    def __init__(self, columns, child_node):
        """columns is a list of columns to keep."""
        self.keep = columns
        self.child = child_node

def decide(self, data):
        return self.child.decide(data[self.keep])

def select_columns(X, K):
    k_columns = random.sample(list(range(X.shape[0])), k=K)
    resampled_x = X[k_columns]
    return resampled_x, k_columns
```

---

## Improvements

* This can be a slight improvement over bagging alone
* On this dataset, the results are roughly similar though
* We could find a "best" value for the number of trees and the number of columns to drop
  * But now we're optimizing hyperparameters

---

## Less Heuristic Approach

* Can't we search for ensemble members?
* What if we guarantee that each learner is different from the last by making it do better where the previous learning failed?
  * This is called boosting

---

## Boosting

* Train a model, calculate its error rate, $\epsilon$
* Improve the next model by *boosting* the samples with errors
  * We could do this by duplicating them in the training set
  * Or we could weight the importance of getting any sample right or wrong
* Weight adjusting is finer than duplication

---

## Boosting Weight Adjustment

* We want to assign half the weight to misclassified examples
* Half the weight goes to the rest
  * Biased towards being correct on a small number of errors
  * Will make successive models in the ensemble very different

---

## Weight Adjustment

* Initialize weights to 1/N for N samples
* After training each classified
  * Multiply misclassified weights by $1/(2\epsilon)$
  * Multiply correctly classified weights by $1/(1 - 2\epsilon)$
* As long as $\epsilon < 0.5$, this increased the weight of misclassified samples
  * If the error rate is over 0.5 something is wrong

---

## Confidence Factor

* To improve the ensemble, we also want to assign a confidence, $\alpha$, to each model
* AdaBoost, the most popular boosting method
  * Applies an exponential penalty for misclassification
  * $ln\sqrt\frac{1-\epsilon_t}{\epsilon_t}$ for the model at step t
* We use this confidence to weight each model in the ensemble

---

## The Learner

* What learner should we be using?
  * It should be *weak* so that errors can guide our search
  * We want it to easily conform to the heavily weighted samples
* Can't be trees, they fit their training data too well
  * But what if we use trees, but set their max depth to 1?

---

## Decision Stumps

* Just choose a single pivot
  * Don't use purity, just check immediately what the classification error will be
  * If the error is greater than 0.5, reverse the polarity and predict the opposite
* Thus each stump stores a `pivot column`, a `boundary` value, and a `polarity`
* It will also gain a confidence, $\alpha$, once its error rate it measured

---

## Stump

```python
class DecisionStump:
    """Also called univariate decision tree.
    Has depth 1, making a single decision.
    Returns either -1 or 1 upon classification.
    """

def __init__(self, feature_index, boundary, polarity):
        # The column of the chosen feature
        self.feature_index = feature_index
        self.boundary = boundary
        # Says if the output is -1 or 1 beyond the boundary
        self.polarity = polarity
        # The certainty weighting from AdaBoost
        self.alpha = None
        # The error of the stump
        self.error = None

def predict(self, X):
        # The default polarity
        predictions = self.polarity * np.ones(X.shape[1])
        # Reverse polarity for anything below the decision boundary
        predictions[X[self.feature_index] < self.boundary] *= -1
        return predictions
```

---

## Training with Weight

```python
def train_stump(X, y, weights):
    """
    Arguments:
       X: Column first samples
       y: Class labels
       weights: Weights to rows in X
    """
    columns, samples = X.shape
    min_error = float('inf')
    best_stump = None

for fidx, feature in enumerate(X):
        # Just like regular trees, check each unique value as the decision boundary
        boundaries = np.unique(feature)

for boundary in boundaries:
            for polarity in [1, -1]:
                stump = DecisionStump(fidx, boundary, polarity)
                predictions = stump.predict(X)

# Find the errors and then weight them
                errors = (predictions != y)
                weighted_error = np.sum(weights[errors])

if weighted_error < min_error:
                    min_error = weighted_error
                    best_stump = stump
                    best_stump.error = weighted_error
    return best_stump
```

---

## Stump Example

* Initial weights for points are all $1/6$

---

## Stump Example

* First stump has alpha 0.35
* It has errors, so weights are updated:
  * 0.25 0.125 0.25 0.125 0.125 0.125

---

## Stump Example

* Second stump has alpha 0.549
  * High because it fixed high weight errors from before
* New weights: 0.167 0.083 0.167 0.083 0.25 0.25

---

## Stump Example

* Third stump has alpha 0.346
  * It's going back and repeating the first line
* New weights: 0.25 0.0625 0.25 0.0625 0.1875 0.1875

---

## Stump Example

* Fourth stump has alpha 0.394

---

## Stump Example

* Fifth stump has alpha 0.381

---

## Stump Example

* Sixth stump has alpha 0.472

---

## Visualizing large datasets

* 1 decision boundary

---

## Visualizing large datasets

* 3 decision boundaries

---

## Visualizing large datasets

* 120 decision boundaries

---

## Complexity

* The algorithm is no more complicated than deciding on a single split in a decision tree 
  * $C*D^2$, with C columns and D rows
    * If each row value is unique, each will be compared as a pivot
* We will use many trunks
* Creating each trunk is fast
  * But they cannot be created in parallel

---

## Robustness

* AdaBoost uses very weak learners with decision trunks
  * Each one carves out a tiny chunk of the decision space
* This makes it resistant to overfitting to the data
  * The trunks aren't really capable of overfitting
* Trunks become redundant as the model fits
  * errors will begin to pull decision boundaries back and forth

---

## Results

---

## Training Statistics

* Accuracy is 0.95

---

## Test Statistics

* Accuracy is 0.968

---

## Bias and Variance

* How did these ensemble techniques impact the bias and variance tradeoff?
* Bagging reduced the variance of the training dataset
  * By removing data, the model cannot overfit to it
  * Results are then majority voted, smoothing the decision boundary
* Boosting reduces bias, making a complicated decision boundary from a simple one
* Thus bagging works well with high-variance decision trees, while boosting works well with high-bias stumps

---

## Improving Spam Filtering

* We aren't going to do much better on Spambase
  * Improvements require more data
* Want to move on to TRECSpam 2007
  * 50199 spam emails, 25220 not spam emails
  * All text fields
* But how will we process text rather than summary features?