<!--
Abstract:

CS 461
Introduction to Deep Learning
Lecture 01: Introduction
-->

# CS 462 - Lecture 02

## Machine Learning Principles

Bernhard Firner

2025-09-10

---

## Syllabus Updates

* Recommended reading
  * Machine Learning: A Probabilistic Perspective by Murphy
    * Sections 7.1-7.5
  * Machine Learning: The Art and Science of Algorithms that Make Sense of Data by Flach
    * Section 7

---

## Topics

* A bit longer with least squares
  * Get an intuition for data and statistics
  * Justify least squares
  * Do some matrix math in Numpy
* Demonstrate fundamentals that are still present in modern deep learning

---

## Dealing With Noise

* Our first goal with ML should be to defeat noise with data
  * How?
* Last time we solved for the origin and intercept using two points

---

## Recap

* Model: $y = \beta_0 + \beta_{1}x$
* $error = Y - \beta_0 + \beta_{1}X$
* $MSE = \frac{1}{n}\Sigma_{i=1}^{n}(Y - (\beta_0 + \beta_{1}X))^2$

---

## Optimize

* Use the derivative with respect to $\beta_0$ and $\beta_1$
* Then set each to 0
  * $0 = 2\beta_0 + 2\beta_{1}X - 2Y$
  * $0 = 2\beta_{0}X + 2\beta_{1}X^2 - 2XY$
* Minimized MSE, got a line

---

## Solution

* Passes through the two sample points
  * Ignores the rest of the target function, Y
* Two problems:
  * Need a higher degree polynomial
  * Need more data if we add noise

</div>
<div class="col">
<img style="width: 95%" class="r-stretch" src="./figures/ex1_solution.png" />

</div>
</div>

---

## Justification

* Let's justify adding more data and minimizing MSE
* $MSE = \frac{1}{n}\Sigma_{i=1}^{n}(Y - (\beta_0 + \beta_{1}X))^2$
  * This is an estimate of the true error
* So as $n \rightarrow \infty$, our error estimate converges on reality
* But it is important to remember that our error is always an estimate!

---

## Samples > Parameters

* Let's rewrite our equation with matrices

<p><math display="block" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mrow><mo stretchy="true" form="prefix">(</mo><mtable><mtr><mtd columnalign="center" style="text-align: center"><msub><mi>y</mi><mn>1</mn></msub></mtd></mtr><mtr><mtd columnalign="center" style="text-align: center"><mi>⋮</mi></mtd></mtr><mtr><mtd columnalign="center" style="text-align: center"><msub><mi>y</mi><mi>n</mi></msub></mtd></mtr></mtable><mo stretchy="true" form="postfix">)</mo></mrow><mo>=</mo><mrow><mo stretchy="true" form="prefix">(</mo><mtable><mtr><mtd columnalign="center" style="text-align: center"><mn>1</mn></mtd><mtd columnalign="center" style="text-align: center"><msub><mi>x</mi><mn>1</mn></msub></mtd></mtr><mtr><mtd columnalign="center" style="text-align: center"><mi>⋮</mi></mtd><mtd columnalign="center" style="text-align: center"><mi>⋮</mi></mtd></mtr><mtr><mtd columnalign="center" style="text-align: center"><mn>1</mn></mtd><mtd columnalign="center" style="text-align: center"><msub><mi>x</mi><mn>3</mn></msub></mtd></mtr></mtable><mo stretchy="true" form="postfix">)</mo></mrow><mrow><mo stretchy="true" form="prefix">(</mo><mtable><mtr><mtd columnalign="center" style="text-align: center"><msub><mi>β</mi><mn>0</mn></msub></mtd></mtr><mtr><mtd columnalign="center" style="text-align: center"><msub><mi>β</mi><mn>1</mn></msub></mtd></mtr></mtable><mo stretchy="true" form="postfix">)</mo></mrow><mo>+</mo><mrow><mo stretchy="true" form="prefix">(</mo><mtable><mtr><mtd columnalign="center" style="text-align: center"><msub><mi>ϵ</mi><mn>1</mn></msub></mtd></mtr><mtr><mtd columnalign="center" style="text-align: center"><mi>⋮</mi></mtd></mtr><mtr><mtd columnalign="center" style="text-align: center"><msub><mi>ϵ</mi><mi>n</mi></msub></mtd></mtr></mtable><mo stretchy="true" form="postfix">)</mo></mrow></mrow><annotation encoding="application/x-tex">\begin{pmatrix}
    y_1\\
    \vdots\\
    y_n
  \end{pmatrix} =
  \begin{pmatrix}
    1 & x_1\\
    \vdots & \vdots\\
    1 & x_3
  \end{pmatrix}
  \begin{pmatrix}
    \beta_0\\
    \beta_1
  \end{pmatrix} + 
  \begin{pmatrix}
    \epsilon_1\\
    \vdots\\
    \epsilon_n
  \end{pmatrix}</annotation></semantics></math></p>

* Also admitting that we have errors

* Take the derivative and solve $\beta$

---

## Sum of Squares

* Error is still $(y - X\beta)$
* Squared error of the samples is $(y-X\beta)^T(y-X\beta)$
* After taking the derivative and solving for 0:
  * $X^TX\beta = X^Ty$
    * Known as the normal equation
    * AKA what you pass to `numpy.linalg.solve`
  * $\beta = (X^TX)^{-1}X^Ty$

-v-

## Interpretation

* The derivation of the normal equations are heavily avoided in texts
* There is an intuition, though
  * To minimize error, we want the hyperplane from $\beta$ to be as close as possible to each datapoint
  * "as close as possible" means orthogonal
* The normal equation is an attempt to make our solution orthogonal all points in $X$

---

## With Numpy

```python
#! /usr/bin/python3

import numpy as np
import sys

def least_squares(file_path, num_records):
    """
    Performs least squares regression to fit a degree 2 polynomial to data
    by manually constructing and solving the normal equations.

Args:
        file_path (str): The path to a file containing x and y data.
                         The file should have two columns of numbers.
    """
    # Load data
    x, y = np.loadtxt(file_path, unpack=True)
    x = x[:num_records]
    y = y[:num_records]

# Construct X, representing a degree two polynomial
    X = np.column_stack([np.ones(len(x)), x, x**2])

# Left side matrix of the normal equation
    left = X.T @ X

# Right side vector of the normal equation
    right = X.T @ y

# Solve the system of linear equations to find the coefficients
    try:
        beta = np.linalg.solve(left, right)
    except np.linalg.LinAlgError:
        print("Error: The matrix is singular (meaning linearly dependent rows or columns).")
        return

# Find y_hat
    y_hat = beta[0] + beta[1] * x + beta[2] * x**2
    # Compute error information
    # The coefficient of determination is the sample correlation coefficient
    y_mean = np.mean(y)
    ss_total = np.sum((y - y_mean)**2)

residuals = y - y_hat
    ssquares_residual = np.sum(residuals**2)

r_squared = 1 - (ssquares_residual / ss_total)

np.set_printoptions(precision=3, suppress=True)
    print(f"Beta is {beta[0]}, {beta[1]}, {beta[2]}")
    print(f"Residuals are {residuals}")
    print(f"Mean of squares from residuals is {ssquares_residual/len(residuals)}")
    print(f"R^2 is {r_squared}")

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("Provide the data file and the number of datapoints to consume as arguments.")
    else:
        least_squares(sys.argv[1], int(sys.argv[2]))
```

---

## Add Noisy Data

```python
#! /usr/bin/python3

import math
import numpy as np
import sys

num_samples = int(sys.argv[1])

# x values in the range 0 to math.pi/4
xs = np.linspace(0, math.pi/4, num=num_samples)
# Gaussian noise with standard deviation 0.5
noise = np.random.normal(loc=0, scale=0.5, size=(num_samples))
ys = np.sin(xs) + noise

for x, y in zip(xs, ys):
    print(x, y)
```

---

---

## Fairly good with 100 points

---

## $R^2$ Error

* We need a way to describe our fit
* In this case, it will be the correlation coefficient between observations and predictions
* ss_residual = $\sum_{i}(y_i - \hat{y}_i)^2$
* ss_total = $\sum_{i}(y_i - \bar{y})^2$
* $R^2 = 1 - \frac{ss_{residual}}{ss_{total}}$

---

## R^2

---

## Some Notes

* As $X^TX$ gets near 0, inverting it becomes perilous
  * $\beta = (X^TX)^{-1}X^Ty$
* The edges of the data range have a different behavior
  * You will see this all the way through to DNNs
* More data is better-ish

---

## Residuals

* The distances from our fit to the data points are henceforth known as residuals
  * $\hat{y} - y$
* We don't consider these errors because we have chosen this fit
* Instead, we hope that they represent the original noise
* Thus `we never expect error to be 0`

---

## Let's do a ML

* But what if we want 0 error?
* Fit isn't perfect, add more parameters!
* Keep training at 1000 points

---

## Updated Code

```python
#! /usr/bin/python3

import math
import numpy as np
import sys

def least_squares(file_path, num_records, degree):
    """
    Performs least squares regression to fit a degree 2 polynomial to data
    by manually constructing and solving the normal equations.

# Construct X, representing the requested polynomial
    X = np.column_stack([x**i for i in range(0, degree+1)])

# Left side matrix of the normal equation
    left = X.T @ X
    
    # Right side vector of the normal equation
    right = X.T @ y

# Find y_hat
    y_hat = sum([beta[i] * x**i for i in range(degree+1)])
    # Compute error information
    # The coefficient of determination is the sample correlation coefficient
    y_mean = np.mean(y)
    ss_total = np.sum((y - y_mean)**2)

residuals = y - y_hat
    ssquares_residual = np.sum(residuals**2)

r_squared = 1 - (ssquares_residual / ss_total)

np.set_printoptions(precision=3, suppress=True)
    print(f"Beta is {beta}")
    print(f"Residuals are {residuals}")
    print(f"Mean of squares from residuals is {ssquares_residual/len(residuals)}")
    return x, y_hat

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("Provide the data file, the number of datapoints to consume, and the degree as arguments.")
    else:
        xs, y_hat = least_squares(sys.argv[1], int(sys.argv[2]), int(sys.argv[3]))
        # Print out results
        for x, y in zip(xs, y_hat):
            print(f"Fit: {x} {y}")
```

---

## Results!

---

## More Parameters are bad?

---

## Oh no!

* Gain from additional parameters goes away quickly
* Often times, a simpler model is better
  * Sometimes even if we know it is too simple
* We can construct simpler models through `regularization`
* This gives us the freedom to build a model with more parameters
  * The algorithm will clean up our mess

---

## Bias Vs Variance Tradeoff

* Recall $\frac{1}{(n+1)}\sum_{i=1}^{n}X_{i}$
  * We biased our mean estimate towards 0 to prevent impacts from high variance data
* Regularization is the same approach
  * We push our parameters closer to 0

---

## Outliers

* Another problem is outliers
  * Points several $\sigma$ away
* They're a problem because we assumed gaussian noise
  * Being so far from the mean would be unlikely in a guassian
  * Thus, our estimate of $\beta_0$ shifts with the mean

---

## More On Outliers

* We could also go back and change our assumptions
  * Maybe this isn't guassian
* It turns out that these approaches are also regularizers

---

## Ridge Regression

* Add square of $l_2$ as a shrinking term
    * $Error = \frac{1}{n}\Sigma_{i=1}^{n}(Y - (\beta X))^2 + \lambda \left\Vert \beta \right\Vert _{2}^{2}$
  * Changes $\beta = (X^TX)^{-1}X^Ty$
  * int $\beta = (X^TX + \lambda I)^{-1}X^Ty$

---

## Why?

* This biases are guesses closer to zero
  * Changes are assumption about the distribution of $\beta$
  * Previously we had no assumption, which is a uniform distribution
  * Now that we want it closer to 0, we are assuming a gaussian
* The optimal value of $\lambda$ should be determined by population statistics
  * But, in reality, we can just guess

---

## Ridge Regression Code

```python[8,26]
#! /usr/bin/python3

import math
import numpy as np
import sys

def least_squares(file_path, num_records, degree, lam=0.5):
    """
    Performs least squares regression to fit a degree 2 polynomial to data
    by manually constructing and solving the normal equations.

# Construct X, representing the requested polynomial
    X = np.column_stack([x**i for i in range(0, degree+1)])

# Left side matrix of the normal equation
    left = X.T @ X + lam*np.identity(degree+1)

# Right side vector of the normal equation
    right = X.T @ y

residuals = y - y_hat
    ssquares_residual = np.sum(residuals**2)

r_squared = 1 - (ssquares_residual / ss_total)

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("Provide the data file, the number of datapoints to consume, the degree as arguments, and lambda.")
    else:
        xs, y_hat = least_squares(sys.argv[1], int(sys.argv[2]), int(sys.argv[3]), float(sys.argv[4]))
        # Print out results
        for x, y in zip(xs, y_hat):
            print(f"Fit: {x} {y}")
```

---

## Ridge Results

---

## Lasso, etc

* Least Absolute Shrinkage and Selection Operator
* This assumes a Laplace prior on $\beta$
  * A sharper fall-off than gaussian
  * Can actually force some parameters to 0
* We could keep normalizing, but let's move on

---

## Zooming Out

* We've only been looking at a tiny slice of $\pi$
* Let's zoom out and look at more of our curve
  * We'll go from $0$ to $\pi$

---

## Sin

---

## How About Cos?

---

## Why Not Both?

---

## Contradictory Data

* We are drawing from two populations (sin and cos)
* We can't fit a line to both of them
* Errors are minimized if we draw a line between them

---

## Decision Boundaries

* That line is a `decision boundary`
* Let's assume that our data is balanced between the populations
* If our model minimizes its error, our estimates get caught in the middle
  * $\mathbb{E}(\hat{Y}|x_i) \rightarrow \frac{\mathbb{E}(\hat{Y_{sin}}|x_i) + \mathbb{E}(\hat{Y_{cos}}|x_i)}{2}$

---

## Next Topics

* Classification
  * Logistic regression
  * Bayes classification

---

## Digression on lambda values

* $\lambda$ (in ridge regression) will tend to flatten curves, but only to an extent
* High errors will shape $\beta$ as well, and the two reach equilibrium
  * $\lambda$ may have more impact where you have little data
    * like at the edges in our examples

---

## Bias Vs Variance Tradeoff

* Larger values of $\lambda$ will `bias` your model towards smoother curves
  * Meaning that you won't have singularly large parameters in $\beta$
  * The $l_2$ norm penalizes the euclidean distance of all $\beta$ from 0
* You cannot force your curve into the correct shape with $\lambda$
  * Need more data to do that
* Typically, regularization shouldn't be used like a hammer
  * Just a bit is fine

---

## Hyperparameter tuning

---

## Hyperparameter tuning

---

## Hyperparameter tuning
<img style="width: 95%" class="r-stretch" src="./figures/degree_n_mixed_fit_ridge_to_2pi.webp" />