* The support vectors are the points nearest to the decision boundary
* If $m^+$ and $m^-$ are maximized, then there should be at least one vector from each class
---
## Solving SVMs
* Maximize $-\frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n\alpha_i \alpha_j y_i y_j x^T_i x_j + \sum_{i=1}^n\alpha_i$
* $\sum_{i=1}^N\alpha_i y_i = 0$
* $0 \leq \alpha_i \forall i \in N$
* Looks scary, but it isn't so bad
* For a computer at least, when we're dealing with large datasets
---
## Non-Separable Data
* But what about non-separable data?
* We add a "soft margin"
* This allows our solution to be somewhat wrong
* We'll call the parameter C, for complexity
* C governs how large $\alpha$ can be for any support vector
---
## Complexity
* Without C, two linearly inseparable points would be sampled infinitely
* Like a non-converging perceptron
* With large C, a single point can serve as the sole support vector for a class
* If the data is linearly separable
* With small C, we force more points to be support vectors
* *Every* point becomes a support vector if C is small enough
---
## C and $\alpha$
* The $\alpha_i$ value in relation to C informs us about $x_i$
* If $\alpha_i = 0$ then the point is ignored (no error)
* If $0 \leq \alpha_i \lt C$ then $\xi_i = 0$ and the point is on the margin
* If $\alpha = C$ then this point is on or inside of the margin, and may be misclassified
---
## Kernels
* We saw a few
* Dot product, a simple linear kernel
* Polynomial kernel
* Radial basis function kernel
---
## SVM - Linear Kernel
$\mathcal{k}=x^Tx$