<!--
Abstract:

CS 461
Introduction to Deep Learning
Lecture 04
-->

# CS 461 - Lecture 04

## Machine Learning Principles

Bernhard Firner

2025-09-17

---

## Reading

* Recommended reading
  * Machine Learning: A Probabilistic Perspective by Murphy
    * Section 8.5 (online learning, SGD, and the perceptron)
    * Section 16.2 (trees, stop at 16.2.4)
  * Machine Learning: The Art and Science of Algorithms that Make Sense of Data by Flach
    * Section 7.2 (perceptron)
    * Section 5 (trees)

---

## Review

* Starting looking at a real (tiny) dataset
  * 150 irises, 3 species with 50 examples each

---

## Collab code

```python
#! /usr/bin/python3

import math
import numpy as np
import sys

def sigmoid(eta):
    """
    The sigmoid function.
    """
    return 1 / (1 + np.exp(-eta))

def logistic_regression(file_path, x_labels, y_label, y_target, learning_rate, epochs):
    """
    Performs logistic regression to draw a decision boundary through
    (hopefully) linearly separable data.

Args:
        file_path (str): The path to a file containing x and y data.
        xlabel (str): Names of the x column data
        ylabel (str): Name of the y column data
        ytarget (int): Target y class
        learning_rate (float): Update rate
        epochs (int): Number of times to iterate through the data
    """
    # Get the first line of the file with the column names
    with open(file_path) as f:
        header_row = f.readline().strip('\n')
    column_names = [name.strip(' ') for name in header_row.split(',')]
    # Load data
    columns = np.loadtxt(file_path, delimiter=',', skiprows=1, unpack=True)
    x_indices = [column_names.index(x_label) for x_label in x_labels]
    y_index = column_names.index(y_label)

features = len(x_labels)

X = columns[x_indices]
    y = columns[y_index]
    # Set everything with the target class to 1
    y = y == y_target

lam = 0.01

# What happens here?
    # NOTE: It would probably be simpler to least w with shape (features) rather than (features,1)
    # Most of the transposing comes from this
    w = np.zeros((features, 1))
    b = 0

for i in range(epochs):
        # In training loop
        y_hat = sigmoid(w.T @ X + b)

# Gradients
        dw = (1/len(y)) * (y_hat - y) @ X.T + lam*w.T
        db = (1/len(y)) * np.sum(y_hat - y) + lam*b

# Update
        w -= learning_rate * dw.T
        b -= learning_rate * db

# Out of the loop
    y_hat = sigmoid(w.T @ X + b)
    predictions = (y_hat > 0.5).astype(int)
    accuracy = np.mean(predictions == y)

print(f"Final accuracy is {accuracy}")
    np.set_printoptions(precision=3, suppress=True)
    print(f"Linear model was {w.flatten()} + {b}X")
    return w, b, X, y_hat, y

if __name__ == "__main__":
    if len(sys.argv) < 7:
        print("Provide the data file, the x columns, the y column, the y target class, the learning rate, and the epochs")
    else:
        columns = sys.argv[6:]
        W, b, x, y_hat, y = logistic_regression(sys.argv[1], columns, sys.argv[2], int(sys.argv[3]), float(sys.argv[4]), int(sys.argv[5]))
```

---

## Separability

* Some classes overlap along some feature dimension
  * e.g. along sepal length
  * These classes are not separable using that feature
* If we *can* draw a linear line between the classes, then they are `linearly separable`

---

## Guessing and Collapse

* Classification models may "collapse" to guessing the most common answer
* This generally happens when no signal is found in the data
* Accuracy becomes the frequency of the most common class

---

## Problems with Logistic Regression

* Gradients and learning rate
* Memory
  * We need the entire dataset
* Population mean shift decision boundary

---

## The Gradient

* Gradient points in direction of fastest increase
  * We try to minimize error, so increases are bad
  * So we subtract it, scaled by some learning rate

---

## Problems

* Not guaranteed to point towards global minimum
* Not even guaranteed to point to a local minimum
* Also, large learning rates jump over minima

</div>
<div class="col">
<img style="width: 95%" class="r-stretch" src="./figures/trapped_loss.webp" />

</div>
</div>

-v-

```bash
#! /usr/bin/gnuplot

set terminal png size 1920,1080 font "CourierPrime-Bold" fontscale 3 enhanced
set yrange [-5:5]
set xrange [-5:5]
set xlabel "x"
set ylabel "y"

set grid
#set hidden3d
set output "../figures/saddle_loss_surface.png"
splot x**2-y**2 title "loss surface"

set terminal webp size 1920,1080 font "CourierPrime-Bold" fontscale 3 enhanced rounded animate delay 500 loop 0
set output "../figures/trapped_loss.webp"

# Let's say we begin at x = -2, y = 0

step=0
cur_x = -2
cur_y = 0
error_x(val)=val**2
error_y(val)=-val**2
grad_x(val)=2*val
grad_y(val)=-2*val
learning_rate=0.25
# The gradient is the direction of the fastest increase, so subtract
# to get the fastest decrease in error
next_x = cur_x - learning_rate*error_x(cur_x)*grad_y(cur_y)
next_y = cur_y - learning_rate*error_y(cur_y)*grad_y(cur_y)
while (step < 10) {
    set arrow 1 from cur_x, cur_y, cur_x**2-cur_y**2 to next_x, next_y, next_x**2-next_y**2 linewidth 2
    splot x**2-y**2 title "loss surface"
    step = step + 1
    cur_x = next_x
    cur_y = next_y
    next_x = cur_x - learning_rate*error_x(cur_x)*grad_x(cur_x)
    next_y = cur_y - learning_rate*error_y(cur_y)*grad_y(cur_y)
}
```

---

## Fixes for learning rate

* There are adaptive learning rate algorithms
* And momentum techniques to prevent oscillation
* We will revisit learning rate more in the future

---

## Dataset memory

* Our logistic regression code attempts to minimize over the entire dataset
* Okay for small amounts of data
  * Will obviously fail at some point

---

## Memory solutions

* We can learn online, using a subset of the data
  * Called a minibatch
* Train with current sample, minimize future regret
  * Guess $W$ to minimize future error, $f(W)=\mathbb{E}[f(W,z)]$
  * Adjust $W$ slowly, averaging the results of previous batches
    * Assumes future batches will be statistically similar to past ones
* Justified with expectations, so it is stochastic
  * Stochastic gradient descent, or SGD

---

## SGD

* What learning rate should be used?
  * Remember, we don't know our future loss, only the current step
* Some conditions (Robbins-Monro conditions) ensure convergence
  * $\sum_{k=1}^{\infty}\eta_k=\infty, \sum_{k=1}^{\infty}\eta_{k}^2 < \infty$
* The equation defines a schedule
  * Notice that we may still require near infinite time to converge

---

## SGD and Learning Rates

* In practice, we can use *early stopping* when errors are small
  * So converging to 0 error isn't important
* There are a lot of tricks to improve logistic regression
  * But the approach is overshadowed by stronger algorithms
* The solutions will appear again

---

## Mean Problem

* Logistic regression is influence by the population statistics
  * Not by individual samples
  * Just look at the bias updates

* Only cares about average error
* It is possible for a population to move, leaving an outlier on the wrong side of the boundary

---

## Problem?

* Ignoring outliers and caring about the population is how logistic regression works
  * It's a high bias, low variance algorithm
* An alterantive would be to *only* update upon errors, and ignore samples otherwise
  * This is the perceptron algorithm

---

## The perceptron algorithm

* Classify into -1 or 1 (instead of 0 and 1)
  * So just output the sign of $W^Tx$
  * This replaces the sigmoid
  * $\hat{y}_i = sign(W^Tx_i)$
* If the sign is wrong, then update the weights
  * Two possible errors:
  * $\hat{y}_i = 1, y = -1$
  * $\hat{y}_i = -1, y = 1$

---

## Perceptron Gradient

* Gradient is $(\hat{y}-y)*x$
* $\hat{y}_i = 1, y = -1$
  * negative gradient is $-2x$
* $\hat{y}_i = -1, y = 1$
  * negative gradient is $2x$
* So we merely adjust $W$ by a set learning rate every time it is wrong
* High variance, low bias

---

## Perceptron Algorithm

* Created in 1958
* Showed to converge if the data is linearly separable
  * Meaning a $w$ exists such that $sign(w^Tx_i)$ achieves 0 error
  * Eventually
* Historically important, but SGD with a different model is simply better

---

## One more problem

* Let's say we wanted to use the flower type to predice another variable
  * We can't! The flower type isn't a continuous numeric input!
* We've always assumed that $X$ is a vector of numbers!

---

## Solution: Decision Trees

* Build a tree of decisions that eventually classify an input
* Decisions partition inputs by class (for tabular inputs) or numeric tests
  * e.g. with $x_i > 20$ into one group, everything else into the other

---

## Expressivity

* A small number of questions is powerful
* Depth only grows roughly logarithmically if questions are good
  * Think of the "20 questions" game
  * Need questions that evenly divide the possibilities

---

## Tree Advantages

* Elegantly handles tabular data
* Like the perceptron, is capable of modelling every input value
  * User can restrict the model if needed
* The resulting tree is also human-interpretable
* No real assumptions about the underlying model

---

## Classification Vs Regression

* Also called Classification and Regression Trees (CART)
* We could make regression trees
  * Sort into leaf nodes and then run regression on that data subset
* Going to focus on decision trees for classification though

---

## Growing a Tree

* The tree will be recursive
  * Once we split in the first node, we repeat for the child nodes
* Only need two algorithms:
  * How do we split?
  * When do we stop?

---

## Good Splits

* For example, let's say we see 15 samples
  * 10 of class A and 5 of class B
* The split divides that into two groups
  * If a condition puts 10 A's in one group and 5 B's in another, it is perfect
  * How about 5 A's in one group and 5 of each in the other?

---

## Impurity tests

* A pure split divides a mix of classes into groups that only contain one class
* Our error is now the `impurity`
* We will call our scoring an `impurity` test
  * As in, how pure is the class distribution after this split?

---

## Entropy

* We could use entropy
* If $\hat{p}$ is our class probability estimate
* $\mathbb{H}(\hat{p}) = -\sum_{c=1}^{C}\hat{p}_{c}log(\hat{p}_c)$
* A good split is one where we maximize information gain

---

## Gini Impurity

* We could also use error rate
* Observe each class with probability $\hat{p}$
  * We should guess that class with the same probability
  * So we guess it isn't that class with $1 - \hat{p}$
* Error rate for one class is $\hat{p}_c(1-\hat{p}_c)$
* For all classes: $\sum_{c=1}^{C}\hat{p}_c(1-\hat{p}_c)$

---

## Simplifying

* $\sum_{c=1}^{C}\hat{p}_c(1-\hat{p}_c)$
* $\sum_{c=1}^{C}\hat{p} - \sum_{c=1}^{C}\hat{p}_{c}^2$
* $1 - \sum_{c=1}^{C}\hat{p}_{c}^2$

---

## $\sqrt gini$

* Flach recommends using $\sqrt gini$
  * $\sqrt(1 - \sum_{c=1}^{C}\hat{p}_{c}^2)$
* It is robust to changes in class distribution
  * For example, if your class representations are unbalanced
* I'll defer to Flach here
  * But read section 5.2 if you want to form an opinion

---

## Scoring function

```python
import numpy as np

def gini_impurity(classlist):
    # We can't test with 0 instances
    if len(classlist) == 0:
        return 0.

# Get the unique classes and the counts for each class
    classes, counts = np.unique(classlist, return_counts=True)
    p_hats = counts/len(classlist)
    gini = 1.0 - np.sum(p_hats**2)
    return np.sqrt(gini)
```

---

## Splitting

* Start with a dataset, $D$
  * It could be a subset of the full data
  * Has $n$ columns/features, $x_1$ ... $x_n$
* Loop through each feature, $x_i$
  * Test each split for gini impurity
    * Weight impurity of each side by number of values on each side

---

## Categorical Values

* Gini impurity works with both continuous and categorical data
* The place to split differs though
* Loop over all categories in $x_i$
  * $x_{ij}$
  * Check split statistics if the split condition is equality

---

## Categorical Values

```python
for feature in range(num_features):
    # Find unique values
    values = np.unique(X[:, feature])

for value in values:
        left_indices = np.where(X[:, feature] == value)[0]
        right_indices = np.where(X[:, feature] != value)[0]
        # Now find gini impurity
```

---

## Continuous Values

* For each unique value in $x_i$
  * $x_{ij}$
  * Check statistics with that value as split condition

```python
for feature in range(num_features):
    # Find unique values
    values = np.unique(X[:, feature])

for value in values:
        left_indices = np.where(X[:, feature] <= value)[0]
        right_indices = np.where(X[:, feature] > value)[0]
        # Now find gini impurity
```

---

## Choosing a split

* Check impurities
  * Find the lowest impurity split within each column
  * Find the column with the lowest impurity split
* Select the split with the lowest and create two new subtrees
* Repeat for the new left and right subtrees

---

## Check Each Column

```python
def str_equal(left, right):
    return left == right

def str_nequal(left, right):
    return left != right

def num_lt(left, right):
    return left < right

def num_gte(left, right):
    return left >= right

def get_split(columns, classlist):
    """Search the columns to find the lowest impurity split of the classlist.
    Arguments:
        columns (list[list]]): Columns of numeric and string data.
        classlist  (list[str]: Class labels for each row

Returns:
        (int, float or str): Tuple of the best feature index and its split threshold.
    """
    best_impurity = float('inf')
    best_index = None
    best_threshold = None

for index, column in enumerate(columns):
        if type(column[0]) == float:
            impurity, threshold = get_best_impurity(column, classlist, num_lt, num_gte)
        else:
            impurity, threshold = get_best_impurity(column, classlist, str_equal, str_nequal)
        if impurity < best_impurity:
            best_impurity = impurity
            best_index = index
            best_threshold = threshold

return best_index, best_threshold
```

---

## Checking within a column

```python
def get_best_impurity(column, classlist, comparison_left, comparison_right):
    """Search the values in the column to find the lowest impurity split of the classlist.
    Arguments:
        column (list[str or float]]): Columns of numeric or string data.
        classlist  (list[str]: Class labels for each row
        comparison_left  ((left, right) -> bool): Function that compares two column values.
        comparison_right ((left, right) -> bool): Function that compares two column values.

Returns:
        (float, float): Tuple of the best impurity and its split threshold.
    """

best_impurity = float('inf')
    best_threshold = None

# Find unique values
    values = np.unique(column)

# Test unique values for split effectiveness
    for value in values:
        left_comparisons = [comparison_left(value, x) for x in column]
        right_comparisons = [comparison_left(value, x) for x in column]
        left_indices = np.asarray(left_comparisons).nonzero()[0]
        right_indices = np.asarray(right_comparisons).nonzero()[0]
        # Now find gini impurity
        left_impurity = gini_impurity([classlist[index] for index in left_indices])
        right_impurity = gini_impurity([classlist[index] for index in right_indices])
        weighted_impurity = (len(left_indices)/len(classlist))*left_impurity + \
                            (len(right_indices)/len(classlist))*right_impurity
        if weighted_impurity < best_impurity:
            best_impurity = weighted_impurity
            best_threshold = value
    return best_impurity, best_threshold
```

---

## Stopping

* Whenever a subtree has only one class present
* If multiple classes have exacty the same values and cannot be split
  * You'll see this as a split that has 0 elements on one side
* Can also stop at a given depth or when too few elements are in a leaf
  * Basic kinds of regularization
  * Accept this as a user input, as with $\lambda$

---

## Weaknesses

* The decision thresholds don't have any math behind them
  * Which means that a single decision tree can be a weak classifier
* Small input changes can also vastly change the model
  * One change to an early split has huge impacts
* High variance; possibly respond too much to the sample data

---

## Solutions

* There are many ways to improve trees
* One of the simplest is to put them together, averaging many estimates
* But we'll stop here for now and revisit trees later

---

## Fun Dataset

* Penguins
  * https://github.com/allisonhorst/palmerpenguins
* Filter out records with missing data:
  * grep -v NA palmerpenguins/inst/extdata/penguins.csv > filtered_penguins.csv

---

## Contents

* There are three tabular columns
  * Species
  * Island
  * Sex
* Continuous columns are for bill dimensions, flipper length, and body mass

---

## Dealing with strings

```python
def read_csv(path):
    with open(path, 'r') as datafile:
        header_row = datafile.readline().strip('\n')
        column_names = [name.strip(' ') for name in header_row.split(',')]
        data = [[] for _ in column_names]
        for line in datafile:
            row_data = [entry.strip(' ') for entry in line.split(',')]
            for idx, entry in enumerate(row_data):
                try:
                    # Numerical value
                    number = float(entry)
                    data[idx].append(number)
                except ValueError:
                    # String value (for tabular data)
                    data[idx].append(entry)
            # Error checking
            assert all([len(data[0]) == len(data[i]) for i in range(1, len(data))])
    return data, column_names
```

---

## Homework 1

* Finish the decision tree algorithm
  * Basically, add a node class and assemble the tree
* Traverse your tree *backwards*, from the leaves up, to randomly generate new penguins
* More details when assignment is posted on Canvas
* TA will also go over some probability details