<!--
Abstract:

CS 461
Introduction to Deep Learning
Lecture 01: Introduction
-->

# CS 461 - Lecture 01

## Machine Learning Principles

Bernhard Firner

2025-09-08

---

## Course Details

* My email: `bfirner@cs.rutgers.edu`
* Canvas
  * Not yet
* Office: Hill 273
* Office hours: TBD

---

## Syllabus: Grading

* Current plan
  * 20% Quizzes
  * 50% Assignments
  * 30% Final Exam

---

## Syllabus: Tools

* Python
* Numpy
* Maybe Pandas or PyTorch for specific tasks
  * For example, to show how SVMs can be used with feature vectors from a neural network
* In general, we want you to learn how to implement things on your own

---

## Academic Integrity

* [CS Academic Integrity Policy](https://www.cs.rutgers.edu/academics/undergraduate/academic-integrity-policy)
* Feel free to use any LLM in this course
  * Record all prompts and responses
  * We will use them as part of a discussion at the end of the course

---

## Absences & Late assignments

* Use the self-reporting tool
  * [https://sims.rutgers.edu/ssra/](https://sims.rutgers.edu/ssra/)
* Late assignments
  * 20% deduction for up to 2 days late
  * 50% deduction for up to 7 days late
  * Exceptions for major illnesses
* Missed exams
  * You must have a valid excuse to schedule a make up exam

---

## Exceptions

* Inform me ahead of time of special circumstances
  * Meaning non-emergency situations
* If you have an emergency, please don't stop to email me
  * You can self-report after the fact

---

## Appropriate Conversations

* I can help with problems related to this class
* Can probably help with professional questions
* I'm not a psychologist. I probably can't help with other issues.
  * I am also required to report some topics, so don't assume everything you tell me will remain private

---

## My Background

* 8 years doing autonomous driving at NVIDIA
  * Hundreds of miles between highway failures after 4 years
* Other industry experience in avionics, embedded systems, and autonomous drones
* You may think that we stopped using "old" machine learning techniques
  * Not true; I saw them used constantly

---

## Course Goals

* My goals
  * Teach you enough that you can use ML in real life
  * Give you a strong foundation for future courses
* Your goals?

---

## Material

* CS440 is AI
* CS461 is ML, a subset of AI
* CS462 is DL, a subset of ML

---

## Machine Learning

* Nowadays we do this with computers
  * Origins are in hand calculations
* Important distinctions from other approaches
  * Data-driven
  * Make assumptions about statistics of data
    * About noise, bias, and measurement errors
  * Techniques are formulaic

---

## Rough Course Outline

1. Least squares fitting, regression, classification, and statistical models
2. "Model free" machine learning
3. Introduction to deep learning

---

## Course Philosophy

* Try to tell a story from early techniques through today
* Neural networks may overshadow other techniques
  * But NNs are not always appropriate
  * And some techniques work well *with* NNs

---

## Early Motivations

* Astronomers were the rock stars of their day
  * Early astronomers used movements of stellar object to predict the future
  * Later astronomers used math to predict future movements of objects
  * Many techniques were created in the 1700s
* There is a lot to be learned from a brief history lesson

---

## Epicycles

* Kepler's [laws of planetary motion](https://en.wikipedia.org/wiki/Kepler%27s_laws_of_planetary_motion) were published in the early 1600s
  * Finally supplanted [epicycles](https://en.wikipedia.org/wiki/Deferent_and_epicycle) for good
* Epicycles had planets moving in circular orbits and smaller cycles overlaid on top of the larger orbit
  * They matched observations from Earth
    * Other planets' relative speed isn't constant

</div>
<div class="col">

</div>
</div>

---

## Early Overfitting

* It turns out that epicycles can be used to match *any* orbit
  * The model was too powerful!
* This is a fundamental problem with many ML techniques
  * If you decide upon a model, how do you know it is the right one?
  * If you can't make predictions using a model, what is the point?

---

## Predictions

* Kepler's laws are relatively simple
  * The equations governing motion are not
  * Especially not when combined with measurement error
* This had practical impacts when attempting predictions
  * Ceres is a well known example

---

## Gauss

* Gauss predicted Ceres' reemergence from behind the Sun's glare
  * Big result at the time
* That name should be familiar; he discovered the Gaussian distribution
  * Happened during work on estimation

---

## Predicting a World

* Gauss solved for Ceres' orbit by combining days worth of observations
* The number of observations outnumbered the unknowns in his equations
* However, each observation had some error
  * This is where statistics comes into play
  * As the number of samples grows, the mean of the errors approaches 0
    * If the errors are unbiased and uncorrelated

---

## Morals

* Machine learning is all about models and statistics
  * Is your model capable of solving the problem?
  * Do you have enough data to make your statistics favorable?
  * Do you understand the noise and errors in your data?

---

## Notation

* This should be a review
* Let's say we want to predict the value of a variable
  * $Y$ is the thing we want to predict
  * $X$ is the information we use
* The thing we use to predict is the model
  * It is a function
  * Technically, $\hat{f}(X) = m$
    * The hat means that $f$ is an estimator

---

## Talking About Error

* How wrong is a model?
* *Mean squared error* is our typical error function
  * Going to write that as *MSE*
* The expected error of a model using MSE is
  * $ \mathbb{E}[(Y - m)^2] $

---

## Bias and Variance

* We can remove the variance term
  * $\mathbb{E}[(Y - m)^2] = (\mathbb{E}[Y-m])^2 + Var[Y-m]$
  * The first term is now the squared bias of our estimating function
  * The second is the variance, and reduces to $Var[Y]$
    * Since our estimate has no impact on $Var[Y]$
* $\mathbb{E}[(Y - m)^2] = (\mathbb{E}[Y] - m)^2 + Var[Y]$

---

## Bias

* Bias is $\mathbb{E}[\hat{f}(X) - Y(X)$
* An unbiased estimator has 0 expected error
  * Not 0 error!
  * The sum of all errors is 0!
  * E.g., $\hat{f}(X) = 0$ is an unbiased estimate of $sin$
    * But a terrible one; the variance never goes to 0

---

## Best Predictor

* It should be obvious that the best predictor is an unbiased one
* If you were to only make one guess for all values of $x$, $m(x)$, guess the mean of $Y$
  * Regardless of the variance in $Y$
* If $Y$ has nonzero variance, then guess $m(x) = \mathbb{E}(Y|x)$
* How do we get the best estimate of the mean?

---

## Central Limit Theorem

* If every dataset required new analytics, life would be hard
  * But we are lucky!
* Let's say $Y = signal + noise$
  * As $n \rightarrow \infty$, the mean estimate approaches the true mean
    * $\frac{1}{n}\Sigma_{i=1}^{n}X_{i} \rightarrow \mathcal{N}(\mathbb{E}[X], Var[X]/n)$
  * True if samples of $Y$ are independent and have the same finite variance
    * This is `independent and identically distributed`, `IID`

---

## Reducing Variance

* Sometimes we want to reduce the variance of our predictions
  * Even at the cost of increasing error
  * One example is $\frac{1}{(n+1)}\Sigma_{i=1}^{n}X_{i}$
  * This is a biased estimator
    * Biased towards 0
  * But still converges as $n \rightarrow \infty$

---

## Challenges in ML

* So more data means a better estimate
* However, adding more data isn't always easy
  * Data is going to be contradictory
  * Data can grow stale
  * Distributions may actually be long-tailed

---

## Starting Example

* Let's say we just want to guess a single number
  * Predict $y_i = \hat{f}(x_i)$
  * Observe some $y_i$ given samples, $x_1, x_2, ..., x_n$
* Goal is to minimize $MSE$

---

## Fitting a simple curve

* Let's fit a polynomial to the `sin` function
  * Range from $0$ to $\pi/4$
* We'll use a first degree polynomial to predict $y$ given $x$
  * $y = \beta_0 + \beta_{1}x$
    * This is our `model`

</div>
<div class="col">
<img style="width: 95%" class="r-stretch" src="./figures/ex1_sin.png" />

</div>
</div>

---

## Without Noise

* We'll have observations of the `sin` function from 0 to $\pi/4$
  * This gives us `x`, `y` pairs
  * With no noise and a proper model we can easily minimize error
* But what happens with noise?
  * $Y = \beta_0 + \beta_{1}x + \epsilon_i$
  * That's next

---

## Estimating with Least Squares

* Let's minimize the $MSE$
  * $error = Y - \beta_0 + \beta_{1}X$
  * $MSE = \frac{1}{n}\Sigma_{i=1}^{n}(Y - (\beta_0 + \beta_{1}X))^2$
  * $ = \beta_{0}^2 + 2\beta_{0}\beta_{1}X - 2\beta_{0}Y + \beta_{1}^{2}X^2 - 2\beta_{1}XY + Y^2$

---

## More Math...

* Use the derivative with respect to $\beta_0$ and $\beta_1$
* Then set each to 0
  * $0 = 2\beta_0 + 2\beta_{1}X - 2Y$
  * $0 = 2\beta_{0}X + 2\beta_{1}X^2 - 2XY$

---

## Samples

* Let's say we have two samples:
  * $f(Y|x=0) = 0$
  * $f(Y|x=0.5) \approx 0.48$

---

## 2 Equations, 2 Unknowns

* Plug in $f(X=0)=0$
  * From equation 1
  * $0 = 2\beta_{0}$
    * $\beta_{0} = 0$

---

## 1 Equation, 1 Unknown

* Plug in $f(X=0.5) \approx 0.48$
  * $0 = 2\beta_{0}X + 2\beta_{1}X^2 - 2XY$
  * $0 = 0.5\beta_{1} - 0.48$
  * $\beta_{1} = 0.96$

</div>
<div class="col">
<img style="width: 95%" class="r-stretch" src="./figures/ex1_solution.png" />

</div>
</div>

---

## Observations

* We just solved for the intercept and slope
  * But things will get complicated with more complex equations
* Which is too bad, because our model is too weak
* And what if there is noise?
  * How do we combine samples?
  * And how can we tell that our model is too weak when samples are noisy?

---

## Machine Learning

* Matrix math the is the beating heart of ML
  * And I guess data is the air is breathes
* Both made possible with computers
  * So we've tackled harder problems as a result
* Next class: least squares with Gaussian noise
  * A solution that improve as we add data