<!--
Abstract:

CS 461
Introduction to Deep Learning
Lecture 03
-->

# CS 462 - Lecture 03

## Machine Learning Principles

Bernhard Firner

2025-09-15

---

## Least Squares Review

<p><math display="block" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mrow><mo stretchy="true" form="prefix">(</mo><mtable><mtr><mtd columnalign="center" style="text-align: center"><msub><mi>y</mi><mn>1</mn></msub></mtd></mtr><mtr><mtd columnalign="center" style="text-align: center"><mi>⋮</mi></mtd></mtr><mtr><mtd columnalign="center" style="text-align: center"><msub><mi>y</mi><mi>n</mi></msub></mtd></mtr></mtable><mo stretchy="true" form="postfix">)</mo></mrow><mo>=</mo><mrow><mo stretchy="true" form="prefix">(</mo><mtable><mtr><mtd columnalign="center" style="text-align: center"><mn>1</mn></mtd><mtd columnalign="center" style="text-align: center"><msub><mi>x</mi><mn>1</mn></msub></mtd></mtr><mtr><mtd columnalign="center" style="text-align: center"><mi>⋮</mi></mtd><mtd columnalign="center" style="text-align: center"><mi>⋮</mi></mtd></mtr><mtr><mtd columnalign="center" style="text-align: center"><mn>1</mn></mtd><mtd columnalign="center" style="text-align: center"><msub><mi>x</mi><mn>3</mn></msub></mtd></mtr></mtable><mo stretchy="true" form="postfix">)</mo></mrow><mrow><mo stretchy="true" form="prefix">(</mo><mtable><mtr><mtd columnalign="center" style="text-align: center"><msub><mi>β</mi><mn>0</mn></msub></mtd></mtr><mtr><mtd columnalign="center" style="text-align: center"><msub><mi>β</mi><mn>1</mn></msub></mtd></mtr></mtable><mo stretchy="true" form="postfix">)</mo></mrow><mo>+</mo><mrow><mo stretchy="true" form="prefix">(</mo><mtable><mtr><mtd columnalign="center" style="text-align: center"><msub><mi>ϵ</mi><mn>1</mn></msub></mtd></mtr><mtr><mtd columnalign="center" style="text-align: center"><mi>⋮</mi></mtd></mtr><mtr><mtd columnalign="center" style="text-align: center"><msub><mi>ϵ</mi><mi>n</mi></msub></mtd></mtr></mtable><mo stretchy="true" form="postfix">)</mo></mrow></mrow><annotation encoding="application/x-tex">\begin{pmatrix}
    y_1\\
    \vdots\\
    y_n
  \end{pmatrix} =
  \begin{pmatrix}
    1 & x_1\\
    \vdots & \vdots\\
    1 & x_3
  \end{pmatrix}
  \begin{pmatrix}
    \beta_0\\
    \beta_1
  \end{pmatrix} + 
  \begin{pmatrix}
    \epsilon_1\\
    \vdots\\
    \epsilon_n
  \end{pmatrix}</annotation></semantics></math></p>

* We wanted to minimize the square error
  * This is a type of linear regression

---

## Sum of Squares

* Error is still $(y - X\beta)$
* Squared error of the samples is $(y-X\beta)^T(y-X\beta)$
* After taking the derivative and solving for 0:
  * $X^TX\beta = X^Ty$
    * Known as the normal equation
  * $\beta = (X^TX)^{-1}X^Ty$

---

## Instability

* If $X^TX$ is near 0, inverting it becomes perilous
  * $\beta = (X^TX)^{-1}X^Ty$

---

## Residuals

* The distances from our fit to the data points are henceforth known as residuals
  * $\hat{y} - y$
* We don't consider these errors because we have chosen this fit
* Instead, we hope that they represent the original noise
* Thus `we never expect error to be 0`

---

## What is Overfitting?

* ML works on *population statistics*
  * These are only an estimate of the true statistics
* If you do not trust your dataset, then you should *bias* your model
  * For example, by forcing estimates of the mean closer to 0
  * This makes your estimates more robust in the face of *dataset variance*
* Overfitting is thus an improperly exact fitting to your data
* This is the *bias-variance* tradeoff

---

## Regularization

* Biasing your model towards a "simpler" representation is called *regularization*
  * In different ML approaches the details will be different
  * Could force sparcity, could penalize high parameters values
* In linear regression we often used $l_{2}$

---

## Ridge Regression

* Add square of $l_2$ as a shrinking term
    * $Error = \frac{1}{n}\Sigma_{i=1}^{n}(Y - (\beta X))^2 + \lambda \left\Vert \beta \right\Vert _{2}^{2}$
  * Changes $\beta = (X^TX)^{-1}X^Ty$
  * int $\beta = (X^TX + \lambda I)^{-1}X^Ty$
* This is called *ridge regression*

---

## Code, with normalization

```python[|24-29|48-54|72-81]
#! /usr/bin/python3

import math
import numpy as np
import sys

def least_squares(file_path, num_records, degree, lam=0.5, normalize=True):
    """
    Performs least squares regression to fit a degree 2 polynomial to data
    by manually constructing and solving the normal equations.

Args:
        file_path (str): The path to a file containing x and y data.
                         The file should have two columns of numbers.
    """
    # Load data
    x, y = np.loadtxt(file_path, unpack=True)
    x = x[:num_records]
    y = y[:num_records]

y_mean = np.mean(y)
    y_variance = np.mean((y - y_mean)**2)
    # Normalize y, if requested
    if normalize:
        y = y - y_mean
        # We could also calculate the variance and set it to 1
        y = y / y_variance
        # Now the y data has 0 mean and unit variance

# Construct X, representing the requested polynomial
    X = np.column_stack([x**i for i in range(0, degree+1)])

# Left side matrix of the normal equation
    left = X.T @ X + lam*np.identity(degree+1)

# Right side vector of the normal equation
    right = X.T @ y

# Solve the system of linear equations to find the coefficients
    try:
        beta = np.linalg.solve(left, right)
    except np.linalg.LinAlgError:
        print("Error: The matrix is singular (meaning linearly dependent rows or columns).")
        return

# If we normalized, turn everything back to their original ranges
    if normalize:
        y = y * y_variance
        beta = beta * y_variance
        y = y + y_mean
        # Just the intercept for the found coefficients
        beta[0] += y_mean

# Find y_hat.
    y_hat = sum([beta[i] * x**i for i in range(degree+1)])
    # Compute error information
    # The coefficient of determination is the sample correlation coefficient
    ss_total = np.sum((y - y_mean)**2)

residuals = y - y_hat
    ssquares_residual = np.sum(residuals**2)

r_squared = 1 - (ssquares_residual / ss_total)

np.set_printoptions(precision=3, suppress=True)
    print(f"Beta is {beta}")
    print(f"Residuals are {residuals}")
    print(f"Mean of squares from residuals is {ssquares_residual/len(residuals)}")
    print(f"R^2 is {r_squared}")
    # Also get results outside of the training range
    min_x = min(x)
    max_x = max(x)
    x_range = max_x - min_x
    x_samples = np.concatenate((
        np.linspace(min_x - 0.1*x_range, min_x, num=10, endpoint=False),
        x,
        np.linspace(max_x + 0.01*x_range, max_x + 0.1*x_range, num=10)))
    y_hat = sum([beta[i] * x_samples**i for i in range(degree+1)])
    return x_samples, y_hat

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("Provide the data file, the number of datapoints to consume, the degree as arguments, and lambda.")
    else:
        xs, y_hat = least_squares(sys.argv[1], int(sys.argv[2]), int(sys.argv[3]), float(sys.argv[4]))
        # Print out results
        for x, y in zip(xs, y_hat):
            print(f"Fit: {x} {y}")
```

---

## Normalization

* Ensures that all samples are 0 mean and unit variance
* Going to do this for most data from now on
* Why?
  * Floating point representation better around the -1 to 1 range
  * Simplifies assumptions when initializing some algorithms

---

## Results - sin

---

## Results - cos

---

## Results - both

---

## Summary

* Linear regression with least squares is useful
  * $y = \beta_{0} + \beta_{1}x$
* *polynomial regression* on more complex functions also possible
* Regression works if
  * There is enough data for the signal to overcome noise
  * And that noise is not correlated with anything
    * independent and identically distributed, iid

---

## Logistic Regression

* Looks like we almost have decision boundary
* Can we turn this into a classifier?
  * Yes, with some work

---

## Reading

* Recommended reading
  * Machine Learning: A Probabilistic Perspective by Murphy
    * Section 8.1-8.3, 8.5.1-8.5.4
  * Machine Learning: The Art and Science of Algorithms that Make Sense of Data by Flach
    * Section 7.4, 9.3, 7.2

---

## Assumptions

* Previously assumed guassian inputs
* Now we'll assume Bernoulli, but what function do we model?
  * $p(y|x, \beta) = Ber(y|\mu(x))$
  * "what is the probability that $y=1$ given we have observed $x$?"
* Outputs must be between 0 and 1, so we will use a new function
  * $\mu(x) = sigm(\beta^{T}x)$

---

## sigm

* The *sigmoid* function
  * Also logistic or logit
* $sigm(\eta) \triangleq \frac{1}{1+exp(-\eta)} = \frac{e^\eta}{e^\eta + 1}$
* We will be seeing this function again

</div>
<div class="col">
<img style="width: 95%" class="r-stretch" src="./figures/sigmoid.png" />

</div>
</div>

---

## Intuition

* For $x$ on the decision boundary, we expect $p(y)=1$ to be near 0.5
* If the classes are linearly separable, (meaning with a single line)
  * Everything with p > 0.5 must lie left or right of x
  * Everything with p < 0.5 must lie in the other direction

---

## Notation

* Going to break $\beta$ into two parts
  * bias ($\beta_0$)
  * weight (everything else)
* Will just look at weights, $w$, next

---

## Formulation

* Gauss made least squares easily formulated
  * Not so for logistic regression
* Probability mass for Bernoulli is
  * $f(k;p) = p^k(1 - p)^{1-k}$
  * k is 0 or 1
* Becomes
  * $P(y_i|x_i)=\hat{p}(x_i)^{y_i}(1-\hat{p}(x_i))^{1-y_i}$

---

## Optimization

* The conditional likelihood is taken over all samples
  * $CL(w, t)=\prod\hat{p}(x_i)^{y_i}(1-\hat{p}(x_i))^{1-y_i}$
* Common practice to take the log and look at the log conditional likelihood
  * $LCL(w, t)=\sum_{i}y_{i}ln\hat{p}(x_i)+(1-y_i)ln(1-\hat{p}(x_i))$
  * It's also common to take the negative log likelihood
* It turns out that this has no analytic solution

---

## Solutions

* See reading for interpretations
  * We can look at the gradients to intuit the solution space
* Brute force solution, at step $k$:
  * $w_{k+1} = w_{k} - \eta_{k}*g_{k}$
  * where $g_k$ is the gradient over all samples at step k
  * $\eta_k$ is the learning rate at step k
  * Called "steepest descent"

---

## Gradients

* Let's focus on a linear solution
  * $y = wx + b$
  * We somehow end up with boring gradients again
    * $\frac{d}{dw}f(w) = \sum_{i}(\mu_i - y_i)x_i=X^{T}(\mu - y)$
* Every implementation I've seen also normalizes by the magnitude of y
  * That's the number of samples
* For $b$, the $x$ vector is just 1s

---

## Example

```python
#! /usr/bin/python3

import math
import numpy as np
import sys

def sigmoid(eta):
    """
    The sigmoid function.
    """
    return 1 / (1 + np.exp(-eta))

def logistic_regression(file_path, x_label, y_label, y_target, learning_rate, epochs):
    """
    Performs logistic regression to draw a decision boundary through
    (hopefully) linearly separable data.

Args:
        file_path (str): The path to a file containing x and y data.
        xlabel (str): Name of the x column data
        ylabel (str): Name of the y column data
        ytarget (int): Target y class
        learning_rate (float): Update rate
        epochs (int): Number of times to iterate through the data
    """
    # Get the first line of the file with the column names
    with open(file_path) as f:
        header_row = f.readline().strip('\n')
    column_names = [name.strip(' ') for name in header_row.split(',')]
    # Load data
    columns = np.loadtxt(file_path, delimiter=',', skiprows=1, unpack=True)
    x_index = column_names.index(x_label)
    y_index = column_names.index(y_label)

x = columns[x_index]
    y = columns[y_index]
    # Set everything with the target class to 1
    y = y == y_target

# We are going to call our parameter w, for weight, and b, for bias
    w = 0
    b = 0

# Gradient descent
    for i in range(epochs):
        # Get the predicted probabilities
        y_hat = sigmoid(w * x.T + b)

# Calculate gradients w.r.t. w and b
        # Average over the size of the dataset
        dw = (1 / len(x)) * (y_hat - y) @ x
        db = (1 / len(x)) * np.sum(y_hat - y)

# Update w
        w -= learning_rate * dw
        b -= learning_rate * db

print(f"epoch {i} error is {np.mean(y_hat-y):.4f}")

# Determine the accuracy
    y_hat = sigmoid(x * w + b)
    predictions = (y_hat > 0.5).astype(int)
    accuracy = np.mean(predictions == y)

print(f"Final accuracy is {accuracy}")
    np.set_printoptions(precision=3, suppress=True)
    print(f"Linear model was {w} * x + {b:.3f}")
    return w, b, x, y_hat, y

if __name__ == "__main__":
    if len(sys.argv) < 7:
        print("Provide the data file, the x column, the y column, the y target class, the learning rate, and the epochs")
    else:
        w, b, x, y_hat, y = logistic_regression(sys.argv[1], sys.argv[2], sys.argv[3], int(sys.argv[4]), float(sys.argv[5]), int(sys.argv[6]))
        # Print out results
        for x, y_hat, y in zip(x, y_hat, y):
            print(f"{x} {y_hat} {y}")
```

---

## Random dataset

```python
#! /usr/bin/python3

import math
import numpy as np
import sys

num_samples = int(sys.argv[1])
mean_one = float(sys.argv[2])
stddev_one = float(sys.argv[3])
mean_two = float(sys.argv[4])
stddev_two = float(sys.argv[5])

rng = np.random.default_rng()

# Class 1 and class 2 samples
class_one = rng.normal(loc=mean_one, scale=stddev_one, size=(num_samples, 1))
class_two = rng.normal(loc=mean_two, scale=stddev_two, size=(num_samples, 1))

class_one = np.concatenate((class_one, np.ones((num_samples, 1))), axis=1)
class_two = np.concatenate((class_two, 2*np.ones((num_samples, 1))), axis=1)

samples = np.concatenate((class_one, class_two), axis=0)
rng.shuffle(samples)

print("x, class")
for i in range(2*num_samples):
    print(f"{samples[i][0]}, {int(samples[i][1])}")
```

---

## Classes

---

## Results

---

## Q: How to implement $l_2$?

---

## Dataset

* Random datasets aren't going to cut it forever
* Not going to generate noise to demonstrate everything
* We'll use a famous dataset from the 1930's
  * Iris dataset
* https://huggingface.co/datasets/scikit-learn/iris

---

## Iris Dataset

* 3 Iris species
  * Setosa
  * Versicolor
  * Virginica
* 4 attributes
  * sepal length
  * sepal width
  * petal length
  * petal width

</div>
<div class="col">
<img style="width: 95%" class="r-stretch" src="././figures/Irissetosa1.jpg" />

</div>
</div>

---

## Setosa by Sepal Length

---

## Versicolor by Sepal Length

---

## Virginica by Sepal Length

---

## Separability

* There is overlap between classes
* This means that a single line cannot classify them perfectly
  * In other words, they are not *separable* with a single line

---

## Guessing and Collapse

* Models "collapse" when they cannot find a solution
* Q: If you were guessing randomly, what is your expected accuracy?

---

## Using Multiple Features

* Can we do better with multiple features?
  * 1000 epochs with lr = 2:
    * Class 1: 1.0
    * Class 2: 0.68
    * Class 3: 0.973
  * 2000 epochs with lr = 0.2 gets class 2 to 0.75
* How can we adapt our code?

---

## Multiple Inputs

```python
#! /usr/bin/python3

import math
import numpy as np
import sys

def sigmoid(eta):
    """
    The sigmoid function.
    """
    return 1 / (1 + np.exp(-eta))

def logistic_regression(file_path, x_labels, y_label, y_target, learning_rate, epochs):
    """
    Performs logistic regression to draw a decision boundary through
    (hopefully) linearly separable data.

Args:
        file_path (str): The path to a file containing x and y data.
        xlabel (str): Names of the x column data
        ylabel (str): Name of the y column data
        ytarget (int): Target y class
        learning_rate (float): Update rate
        epochs (int): Number of times to iterate through the data
    """
    # Get the first line of the file with the column names
    with open(file_path) as f:
        header_row = f.readline().strip('\n')
    column_names = [name.strip(' ') for name in header_row.split(',')]
    # Load data
    columns = np.loadtxt(file_path, delimiter=',', skiprows=1, unpack=True)
    x_indices = [column_names.index(x_label) for x_label in x_labels]
    y_index = column_names.index(y_label)

# What happens here?

if __name__ == "__main__":
    if len(sys.argv) < 7:
        print("Provide the data file, the x columns, the y column, the y target class, the learning rate, and the epochs")
    else:
        columns = sys.argv[6:]
        W, b, x, y_hat, y = logistic_regression(sys.argv[1], columns, sys.argv[2], int(sys.argv[3]), float(sys.argv[4]), int(sys.argv[5]))
```

---

## Problems with Logistic Regression

* The learning rate
* Memory
  * We need the entire dataset
* Mean shift

---

## Solutions for learning rate

* There are adaptive learning rate algorithms
* And momentum techniques to prevent oscillation
* We will revisit learning rate more in the future

---

## Solutions for Memory

* We can learn online, using a subset of the data
  * Called a minibatch
* Train on current samples, minimize regret for future variables
  * $f(W)=\mathbb{E}[f(W,z)]$
* Because this uses expectations, it is stochastic
  * Stochastic gradient descent, or SGD

---

## SGD

* What learning rate, $\eta$, should be used?
  * Remember, we don't know our future loss, only the current step
* Robbins-Monro conditions
  * $\sum_{k=1}^{\infty}\eta_k=\infty, \sum_{k=1}^{\infty}\eta_{k}^2 < \infty$
* The equation defines a schedule
  * Notice that we may still require near infinite time to converge
* In practice, we can use *early stopping*

---

## Mean Problem

* The decision boundary moves with the mean
  * Just look at the bias term
* It is possible for a population to move, leaving an outlier on the wrong side of the boundary
* This is what the algorithms do, but it may not be what we want

---

## The perceptron algorithm

* Uses SGD to solve the mean problem
* $g_i \approx (\hat{y}_i - y_i)x_i$
* If y is in {-1,+1} then:
* $\hat{y}_i = sign(w^Tx_i)$
* Crucially, where there is no error there is no update
  * This prevents the mean of a population from shifting the decision boundary

---

## Perceptron Algorithm

* Created in 1958
* Showed to converge if the data is linearly separable
  * Meaning a $w$ exists such that $sign(w^Tx_i)$ achieves 0 error
* Historically important, and one of many uses of SGD

---

## Next: More classifiers

* Clustering approaches next
* Recitation today to review some probability

<!--
TODO Show weights with and without regularization
TODO What if the noise is uniform instead of guassian?
TODO Show that signal can be picked out of extreme noise (stdev = 10 or something)
TODO Wrap up with differences between generative and discriminative, but I haven't talked about discriminative approach (i.e. Gaussian discriminant analysis)

TODO Logistic Regression. Description from 7.4 in Flach's book is simplest.

TODO Walk through equation, discuss that it cannot be solved in closed form. X^{T}X has time complexity O(n^{2}d) for construction and O(d^3) for inverting, which is rough.
TODO That bring us to steps using the gradient
TODO Discuss convergence problems
TODO Online variant when not all training data fits into memory.
TODO Works with linearly separable data
TODO What is linearly separable data?
TODO What if data isn't linearly separable?
TODO Logistic regression doesn't work well here, as it shifts with the mean of the distributions
     An outlier can lead to terrible results.
     Notice that this is a feature of data, but is possible in the real world.
     For example, cat vs Not Cat will have incredible sampling bias in the "not cat" data.
     Or, a cat could be doing something close to the decision boundary--perhaps it is being walked with a leash and collar, or is wearing a sweater.
TODO What if we use our linear classifier (\beta_0 and \beta_{1}x), but we iteratively update and only upon error?
TODO What is a good fit? What is ROC?
-->