# CS 461 - Lecture 26

## Machine Learning Principles

### Course Review

Bernhard Firner

2025-12-10

---

## Course Recap

* ML has been driven by two things
  * Data
  * Computation
* In the 1600s a least squares fit was done with a few observations and worked out on paper
* Nowadays datasets are too large for a human to even look at all of the data points

---

## Function Approximation

* At their heart, many of the techniques we've learned are function approximations
  * Given inputs, X, and observations, Y, deduce $f(x) = y$
* We began with the least squares method
  * Solve directly for a set of parameters of a polynomial, $\beta$
    * $\hat{y} = \beta_0 + \beta_1x + \beta_2x^2$
  * $\beta = (X^TX)^{-1}X^Ty$

---

## Model-Based Prediction

* We decide on the polynomial degree, basically asserting knowledge about the correct solution
* This introduces the idea of the bias-variance tradeoff
  * Should we create a complicated model to capture all dataset variance?
  * Or should we create a biased model, that assumes a simpler solution is the correct one?

---

## Over Parameterized Fit

---

## What is Overfitting?

* ML works on *population statistics*
  * But our sample population may not be representative
* So we *bias* our models to be simpler
  * This makes your estimates more robust in the face of *dataset variance*
* Overfitting occurs if we choose to fit to training data when the training data is a poor representative of the overall population
* This is the *bias-variance* tradeoff

---

## Regularization

* Add square of $l_2$ as a shrinking term
  * $Error = \frac{1}{n}\Sigma_{i=1}^{n}(Y - (\beta X))^2 + \lambda \left\Vert \beta \right\Vert _{2}^{2}$
* Originally $\beta = (X^TX)^{-1}X^Ty$
  * Now $\beta = (X^TX + \lambda I)^{-1}X^Ty$
* This is our first regularizer, and penalizes large values in $\beta$

---

## Fit With Regularization

---

## Regularization

* Notice how consistent this will be with neural networks
* We can increase the representative power of our model if we have appropriately strong regularization

---

## Different Sources

* Fitting to different sources ends up drawing a boundary line

---

## Regression to Classification

* That boundary can be called a "decision boundary"
  * Anything above is more likely to be from population A
  * Anything below if more liekly to be from population B
* We can turn the distance from the boundary into a class with the sigmoid function
  * Transforms what we're doing into logistic regression

---

## Sigmoid

---

## Problems with Logistic Regression

* Regression is influenced by population statistics
  * Not by individual points
* Which is great! Unless it isn't!
* We can have drastically different datasets that will have the same regression line

---

## Example 1

---

## Example 2

---

## Problem

* Decision boundaries will likewise only care about population statistics
  * Which means that the algorithms won't "care" if a point is wrong
* One solution is to keep adjusting our decision boundary iteratively
  * Choose some increment, and adjust in the right direction whenever we are wrong
  * This is the perceptron algorithm

---

## Perceptron

* The perception is a line that estimates a class using the sign function
  * $\hat{y} = sign(bias + \sum weight_i x$)
* Every time the perceptron is wrong, we adjust the bias and weights in the "correct" direction by a learning rate, $\lambda$
  * $weight_i \leftarrow weight_i + \lambda y x_i$
* Predictions are
  * $\hat{y} = sign(bias + WX)$

---

## Advantages

* If the data is linearly separable, then this will always find a solution with no misclassification
* It is also online, so we can create a classifier from datapoint 1
  * If we discard the training points and just remember our current weight and bias, it is quite light
  * Used for branch prediction in CPUs, for example

---

## Separability

* What if the data isn't linearly separable?
  * Then we iterate forever, and the line keeps moving

---

## The Kernel Trick

* $\hat{y} = sign(\sum_{i=0}^{N}\alpha_i y_i \mathcal{K}(X,t))$
  * Now we don't look at the distance of a point, $t$, from our line
  * We look at some function, $\mathcal{K}$ instead
* As a drawback, we need to remember all of our training points
  * Also, the number of times each was incorrect, $\alpha$

---

## Support Vector Machines

* It turns out that those requirements make perceptrons prohibitively expensive to train
* If only we could make the same decision, but only using the points that we need to define the boundary
  * And while we're at it, let's optimize the boundary, maximizing the distance from the points on either side
* This is support vector machines

---

## Support Vectors

* The support vectors are the points nearest to the decision boundary
* If $m^+$ and $m^-$ are maximized, then there should be at least one vector from each class

</div>
<div class="col">

</div>
</div>

---

## Solving SVMs

* Maximize $-\frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n\alpha_i \alpha_j y_i y_j x^T_i x_j + \sum_{i=1}^n\alpha_i$
* $\sum_{i=1}^N\alpha_i y_i = 0$
* $0 \leq \alpha_i \forall i \in N$
* Looks scary, but it isn't so bad
  * For a computer at least, when we're dealing with large datasets

---

## Non-Separable Data

* But what about non-separable data?
  * We add a "soft margin"
  * This allows our solution to be somewhat wrong
* We'll call the parameter C, for complexity
  * C governs how large $\alpha$ can be for any support vector

---

## Complexity

* Without C, two linearly inseparable points would be sampled infinitely
  * Like a non-converging perceptron
* With large C, a single point can serve as the sole support vector for a class
  * If the data is linearly separable
* With small C, we force more points to be support vectors
  * *Every* point becomes a support vector if C is small enough

---

## C and $\alpha$

* The $\alpha_i$ value in relation to C informs us about $x_i$
  * If $\alpha_i = 0$ then the point is ignored (no error)
  * If $0 \leq \alpha_i \lt C$ then $\xi_i = 0$ and the point is on the margin
  * If $\alpha = C$ then this point is on or inside of the margin, and may be misclassified

---

## Kernels

* We saw a few
  * Dot product, a simple linear kernel
  * Polynomial kernel
  * Radial basis function kernel

---

## SVM - Linear Kernel

$\mathcal{k}=x^Tx$

---

## SVM - Poly Kernel

$\mathcal{k}=(x^Tx)^2$

---

## SVM - Poly Kernel

$\mathcal{k}=(x^Tx + 1)^2$

---

## SVM - Poly Kernel

$\mathcal{k}=(x^Tx + 10)^2$

---

## SVM - Poly Kernel

$\mathcal{k}=(x^Tx)^3$

---

## SVM - Poly Kernel

$\mathcal{k}=(x^Tx + 1)^3$

---

## SVM - Poly Kernel

$\mathcal{k}=(x^Tx + 10)^3$

---

## SVM - Poly Kernel

$\mathcal{k}=(x^Tx)^4$

---

## SVM - Poly Kernel

$\mathcal{k}=(x^Tx)^5$

---

## SVM - RBF Kernel

$\mathcal{k}=exp(-\gamma \lVert x_1 - x_2 \rVert^2)$

---

## SVM - Harder margin

$\mathcal{k}=exp(-\gamma \lVert x_1 - x_2 \rVert^2)$

---

## Linear NNs

* The next classifier, in terms of power, is a linear neural network
  * Linear neural networks can use any number of parameters, so they are powerful in that sense
* On their own, though, a linear neural network is not really better than an SVM
  * And they don't offer any guarantees
* So neural networks really don't come into their own until we look at structured data
  * And SVMs should be your default classifier to numeric data

---

## Before Structured Data

* What about other kinds of data?
* If we have categorical labels, or a mix, we can classify with decision trees
* If we don't have any labels, then we can use clustering
  * Probability models, like a gaussian mixture model, can be used with categorical and continuous data
  * Otherwise, kmeans works for continuous data that is normally distributed

---

## Trees

* Trees make a decision boundary at every node
  * Partitioning the data with the "best" metric each time
  * Which metric is "best"?
* There are some options, but let's use Gini Impurity
  * Sum the probability of being wrong for each class
    * $1 - \sum_{c=1}^{C}\hat{p}_{c}^2$

---

## Building a Decision Tree

* To build a tree, check every value in every column
  * Find the smallest impurity, then use that as the pivot
  * Divide the data, and recursively repeat

---

## Tree Drawbacks

* Trees will always find a solution to separate the data if it is possible
  * This isn't always good; it makes a complicated model
* So we regularize
  * Bagging, Forests, and AdaBoost
* AdaBoost is the most important, and works by adjusting weights of data

---

## Boosting

* The weighting of sample points *boosts* the models in the ensemble
* This forces them to be different, improving the ensemble
* Better concept than bagging
  * Weight adjusting is finer than duplication
  * Guarantees that the next model in the ensemble is focusing on different datapoints than the last

---

## Weight Adjustment

* Initialize weights to 1/N for N samples
* We train multiple classifiers, each seeing a different set of data weights
* Weights adjust each time, based upon error rate, $\epsilon$
  * Increase weights of misclassified data
  * Decrease weights of correctly classified data

---

## Confidence Factor

* Weight adjustment by $+\alpha$ for misclassification, $-\alpha$ for correct classification
  * $\alpha = 0.5 ln\sqrt\frac{1-\epsilon_t}{\epsilon_t}$
* $\alpha$ is also the confidence of the model at step t
  * When the models vote, their prediction is multiplied by their confidence
* AdaBoost is the most popular boosting ensemble method

---

## What to Ensemble?

* We could use AdaBoost with regression
* But we tend to use it with decision stumps
  * This is a tree with a single node
* Why? Because AdaBoost adds ability to fit
  * So we want to start with a very high bias, simple classifier

---

## AdaBoost Example

---

## Classifiers

* In general
  * SVMs are your best numeric classifier
  * AdaBoost is best for anything with categories
  * Structured data goes to DNNs

---

## Small Structured Data

* But neural networks are a big hammer
* So if we can use a Markov Model, for example, we would prefer that
  * Emission, transmission, and stationary state give more information

---

## Clustering

* Sometimes we don't have labels!
  * Clustering doesn't need labels!
  * K-Means
    * K-Means++
  * Mixture Models
* These work well with feature vectors as well, and help humans analyze data

---

## Final Format

* Similar to quizzes
  * With more open-ended or fill-in questions
  * Still using multiple choice
* Solutions to all quizzes will be posted for review

<!--
bias and variance tradeoff

Model-based to model-free.
Least squares regression, Perceptron, SVM, Linear NN
Classification from regression
Linear separability, the kernel trick, universal approximation theorem

Categorical Data
Decision trees, bagging, forests, and boosting

Unsupervised
Clustering with knn, kmeans, gmm, other mixture models

Structured Data
Markov models, convolutional neural networks, RNNs, LSTMs, and attention models