* Kepler's [laws of planetary motion](https://en.wikipedia.org/wiki/Kepler%27s_laws_of_planetary_motion) were published in the early 1600s
* Finally supplanted [epicycles](https://en.wikipedia.org/wiki/Deferent_and_epicycle) for good
* Epicycles had planets moving in circular orbits and smaller cycles overlaid on top of the larger orbit
* They matched observations from Earth
* Other planets' relative speed isn't constant
---
## Early Overfitting
* It turns out that epicycles can be used to match *any* orbit
* The model was too powerful!
* This is a fundamental problem with many ML techniques
* If you decide upon a model, how do you know it is the right one?
* If you can't make predictions using a model, what is the point?
---
## Predictions
* Kepler's laws are relatively simple
* The equations governing motion are not
* Especially not when combined with measurement error
* This had practical impacts when attempting predictions
* Ceres is a well known example
---
## Gauss
* Gauss predicted Ceres' reemergence from behind the Sun's glare
* Big result at the time
* That name should be familiar; he discovered the Gaussian distribution
* Happened during work on estimation
---
## Predicting a World
* Gauss solved for Ceres' orbit by combining days worth of observations
* The number of observations outnumbered the unknowns in his equations
* However, each observation had some error
* This is where statistics comes into play
* As the number of samples grows, the mean of the errors approaches 0
* If the errors are unbiased and uncorrelated
---
## Morals
* Machine learning is all about models and statistics
* Is your model capable of solving the problem?
* Do you have enough data to make your statistics favorable?
* Do you understand the noise and errors in your data?
---
## Notation
* This should be a review
* Let's say we want to predict the value of a variable
* $Y$ is the thing we want to predict
* $X$ is the information we use
* The thing we use to predict is the model
* It is a function
* Technically, $\hat{f}(X) = m$
* The hat means that $f$ is an estimator
---
## Talking About Error
* How wrong is a model?
* *Mean squared error* is our typical error function
* Going to write that as *MSE*
* The expected error of a model using MSE is
* $ \mathbb{E}[(Y - m)^2] $
---
## Bias and Variance
* We can remove the variance term
* $\mathbb{E}[(Y - m)^2] = (\mathbb{E}[Y-m])^2 + Var[Y-m]$
* The first term is now the squared bias of our estimating function
* The second is the variance, and reduces to $Var[Y]$
* Since our estimate has no impact on $Var[Y]$
* $\mathbb{E}[(Y - m)^2] = (\mathbb{E}[Y] - m)^2 + Var[Y]$
---
## Bias
* Bias is $\mathbb{E}[\hat{f}(X) - Y(X)$
* An unbiased estimator has 0 expected error
* Not 0 error!
* The sum of all errors is 0!
* E.g., $\hat{f}(X) = 0$ is an unbiased estimate of $sin$
* But a terrible one; the variance never goes to 0
---
## Best Predictor
* It should be obvious that the best predictor is an unbiased one
* If you were to only make one guess for all values of $x$, $m(x)$, guess the mean of $Y$
* Regardless of the variance in $Y$
* If $Y$ has nonzero variance, then guess $m(x) = \mathbb{E}(Y|x)$
* How do we get the best estimate of the mean?
---
## Central Limit Theorem
* If every dataset required new analytics, life would be hard
* But we are lucky!
* Let's say $Y = signal + noise$
* As $n \rightarrow \infty$, the mean estimate approaches the true mean
* $\frac{1}{n}\Sigma_{i=1}^{n}X_{i} \rightarrow \mathcal{N}(\mathbb{E}[X], Var[X]/n)$
* True if samples of $Y$ are independent and have the same finite variance
* This is `independent and identically distributed`, `IID`
---
## Reducing Variance
* Sometimes we want to reduce the variance of our predictions
* Even at the cost of increasing error
* One example is $\frac{1}{(n+1)}\Sigma_{i=1}^{n}X_{i}$
* This is a biased estimator
* Biased towards 0
* But still converges as $n \rightarrow \infty$
---
## Challenges in ML
* So more data means a better estimate
* However, adding more data isn't always easy
* Data is going to be contradictory
* Data can grow stale
* Distributions may actually be long-tailed
---
## Starting Example
* Let's say we just want to guess a single number
* Predict $y_i = \hat{f}(x_i)$
* Observe some $y_i$ given samples, $x_1, x_2, ..., x_n$
* Goal is to minimize $MSE$
---
## Fitting a simple curve