# CS 462 - Lecture 02

## Supervised Learning

Bernhard Firner

2026-01-22

---

## Review

* Data can determine success or failure of a machine learning task
* But let's say we've gotten some data, and it's not too bad
  * Deep learning is a big hammer, how do we swing it?

---

## Simplest Example

* Let's say we want to predict a value of $y$ given $x$
  * So a simple regression task
* What is out model?
* What is our loss function?
* What is training?

---

## Example Model

* The simplest building block of a neural network is a line
  * $y = \phi_0 + \phi_1x$
  * This is called a **neuron**
  * $\phi$ is the set of all model parameters

</div>
<div class="col">
<img style="width: 100%" class="r-stretch" src="./figures/02_just_lines.svg" />

</div>
</div>

---

## What is a Model?

* $y = mx + b = \phi_1x + \phi_0$ is a simple model
* Maps from an input to an output
  * For example, maybe we want to predict how far a bird can fly given its wingspan
* Of course we can make predictions using multiple inputs or multiple outputs

---

## Notation

* Input will be $x$
* Output will be $y$
* Model is $y = f[x]$
* When inputs or outputs are vectors or matrices I will capitalize them
  * $Y = f[X]$
  * $Y_0$ would be the first output

---

## Linear Regression Model

* This model is a function of two parameters, $\phi_0$ and $\phi_1$
  * We could rewrite the output as $y = f[x, \phi]$
* Given vectors of labels, $Y$, and training inputs $X$, we **train** the model by finding "good" values of $\phi$

</div>
<div class="col">
<img style="width: 100%" class="r-stretch" src="./figures/02_just_lines.svg" />

</div>
</div>

---

## Defining "Good"

* Rather than making a "goodness" metric, we generally do the opposite and define a **loss function**
* For regression, this is the sum of squares of the errors at each training point
  * $L[\phi] = \sum_{i=1}^I(f[x_i,\phi] - y_i)^2$
  * This loss is named *least-squares loss*
* For our linear regression example, this is
  * $L[\phi] = \sum_{i=1}^I(\phi_0 + \phi_1 x_i - y_i)^2$

---

## Loss Examples

* Let's add some points to the graph
* Notice that the points don't all lie on a single line

</div>
<div class="col">
<img style="width: 100%" class="r-stretch" src="./figures/02_line_error_points.svg" />

</div>
</div>

---

## Loss Examples

* $L[\phi] = \sum_{i=1}^I(\phi_0 + \phi_1 x_i - y_i)^2$
* The loss is the sum of the squared distances from every point to our line

</div>
<div class="col">
<img style="width: 100%" class="r-stretch" src="./figures/02_line_error_1.svg" />

</div>
</div>

---

## Loss Examples

* $L[\phi] = \sum_{i=1}^I(\phi_0 + \phi_1 x_i - y_i)^2$
* Some lines are obviously better than others

</div>
<div class="col">
<img style="width: 100%" class="r-stretch" src="./figures/02_line_error_2.svg" />

</div>
</div>

---

## Loss Examples

* $L[\phi] = \sum_{i=1}^I(\phi_0 + \phi_1 x_i - y_i)^2$
* We can minimize the error by solving for
  * $\hat{\phi} = \underset{\phi}{\mathrm{argmin}}\left[L[\phi]\right]$

</div>
<div class="col">
<img style="width: 100%" class="r-stretch" src="./figures/02_line_error_3.svg" />

</div>
</div>

---

## Minimizing Loss

* What does it mean to minimize the loss function?
* One way to think of this is to plot loss as a function of $\phi$
  * This is called the **loss surface**

---

## Loss Surface

---

## Loss Surface

---

## Loss Surface

---

## Finding the Minimum

* Let's say we begin with a randomly initialized $\phi$
  * We can measure our error, and see how that compares to nearby values of $\phi$
* How do we get to the "best" parameters from there?
  * In this case, the loss surface is convex
  * We can take the derivative and set it to 0, solving directly
  * This is how linear regression works

---

## Learning Code

```python
# Solve for the next value of phi
y_hats = x * phi[1] + phi[0]
error = y_hats - y
# Calculate gradients w.r.t. phi
# Average over the size of the dataset
dw = torch.mean(error @ x)
db = torch.mean(error)

# Update
learning_rate = 0.3
if len(phis) > 5:
    learning_rate = 1
next_phi1 = phi[1] - learning_rate * dw
next_phi0 = phi[0] - learning_rate * db
```

---

## Different Visualization

---

## Different Visualization

---

## Different Visualization

---

## Training a Neural Network

* Not all loss surfaces are as simple as this one
  * Usually we cannot solve directly for $\hat{\phi} = \underset{\phi}{\mathrm{argmin}}\left[L[\phi]\right]$
* The solution is to "walk" from our current estimate of $\hat{\phi}$ to a better one
  * At each estimate, we check the local gradient and walk "downhill" towards a minimum

---

## Training

---

## Training

---

## Training

---

## Training

---

## Training

---

## Training Animated

-v-

## Code

```python
#! /usr/bin/python3

import torch
import matplotlib.pyplot as plt
import sys

"""
Adapted from the UDL notebook by Simon D Prince.
https://github.com/udlbook/udlbook/blob/main/Notebooks/Chap02/2_1_Supervised_Learning.ipynb
"""

# Some points, matching what is in the book.
x = torch.tensor([0.03, 0.19, 0.34, 0.46, 0.78, 0.81, 1.08, 1.18, 1.39, 1.60, 1.65, 1.90])
y = torch.tensor([0.67, 0.85, 1.05, 1.0, 1.40, 1.5, 1.3, 1.54, 1.55, 1.68, 1.73, 1.6 ])

# Define 1D linear regression model
def f(x, phi0, phi1):
  return phi0 + phi1*x

# Get current axes and set them to only the desired range
ax = plt.gca()

def init_axes(ax):
    ax.set_xlim([0, 2])
    ax.set_ylim([0, 2])
    ax.set_xlabel('Input, $x$', fontsize=16)
    ax.set_ylabel('Output, $y$', fontsize=16)

def draw_errors(ax, xs, ys, phi):
    for x, y in zip(xs, ys):
        prediction = f(x, phi[0], phi[1])
        ax.plot([x, x], [y, prediction], markevery=[0], color='#070707', marker='o', linestyle='dashed')

# Function to calculate the loss
def compute_loss(x, y, phi):
  loss = torch.sum((phi[0] + phi[1]*x - y)**2)
  return loss

figure, [ax_left, ax_right] = plt.subplots(nrows=1, ncols=2, sharey=False)
figure.set_size_inches(w=12, h=7)

######################
# Draw the loss surface

# Make a 2D grid of possible phi0 and phi1 values
# meshgrid creates a grid of coordinates, expanding the two provided tensors
# See https://docs.pytorch.org/docs/stable/generated/torch.meshgrid.html#torch-meshgrid
xrange = torch.arange(0.0,2.0,0.02)
yrange = torch.arange(-1.0,1.0,0.02)
phi0_mesh, phi1_mesh = torch.meshgrid(xrange, yrange, indexing='ij')

def draw_contour(ax):
    ax.clear()
    # Make a 2D array for the losses
    all_losses = torch.zeros_like(phi1_mesh)
    # Run through each 2D combination of phi0, phi1 and compute loss
    for xidx in range(phi0_mesh.size(0)):
        for yidx in range(phi0_mesh.size(1)):
            indices = xidx, yidx
            all_losses[indices] = compute_loss(x,y, [phi0_mesh[indices], phi1_mesh[indices]])

levels = 256
    ax.contourf(phi0_mesh, phi1_mesh, all_losses, levels)
    levels = 40
    ax.contour(phi0_mesh, phi1_mesh, all_losses, levels, colors=['#80808080'])
    ax.set_ylim([1,-1])

ax.set_ylabel('$\phi_1$', fontsize=16)
    ax.set_xlabel('$\phi_0$', fontsize=16)
######################

# Default starting location
phis=[[0.1, 0.1]]

# Base output path
outbase="../figures/02_learning_"

# If the user provided a different starting point, begin from there and change the plot filenames
if len(sys.argv) == 3:
    phis = [[float(sys.argv[1]), float(sys.argv[2])]]
    outbase=f"../figures/02_learning_{phis[0][0]}_{phis[0][1]}"

for step in range(15):
    phi = phis[-1]
    ax_right.clear()
    init_axes(ax_right)

# Draw the points
    ax_right.scatter(x,y)

# Draw some lines with labels
    x_line = torch.arange(0,2,0.01)
    y_line = f(x_line, phi[0], phi[1])
    plt.plot(x_line, y_line, 'g-', lw=2, marker='none')
    angle = torch.rad2deg(torch.tensor([torch.arctan2(torch.tensor(phi[1]), torch.tensor([1]))])).item()
    plt.text(1, phi[0]+phi[1], f'$\phi_0={phi[0]:.2f}$,$\phi_1={phi[1]:.2f}$', ha='left', va='top',
             transform_rotates_text=True, rotation=angle, rotation_mode='anchor', fontsize=14)

draw_errors(ax_right, x, y, phi)

loss = compute_loss(x,y,phi)
    plt.text(0.25, 1.8, f'Loss = {loss:.2f}', fontsize=16)

# Left graph with the loss surface
    # Draw the surface and the path we have taken so far
    phi_zeros = [p[0] for p in phis]
    phi_ones = [p[1] for p in phis]
    draw_contour(ax_left)
    ax_left.plot(phi_zeros, phi_ones, color='green', marker='o', markersize=10, linestyle='solid')

plt.savefig(f"{outbase}_{step}.svg", dpi=2*96)

##### This is the actual learning!!!!
    # Solve for the next value of phi
    y_hats = x * phi[1] + phi[0]
    error = y_hats - y
    # Calculate gradients w.r.t. phi
    # Average over the size of the dataset
    dw = torch.mean(error @ x)
    db = torch.mean(error)

# Update
    learning_rate = 0.3
    if len(phis) > 5:
        learning_rate = 1
    next_phi1 = phi[1] - learning_rate * dw
    next_phi0 = phi[0] - learning_rate * db
    phis.append([next_phi0, next_phi1])

# Animate with
# ffmpeg -framerate 3 -i '../figures/02_learning_%d.svg' -filter_complex "[0:v] split [a][b];[a] palettegen [p];[b][p] paletteuse" ../02_learning_animated.gif
```

---

## Different Start

---

## Different Start

---

## Gradient Descent

* This downhill walk is called **gradient descent**
  * It is the basis for learning with neural networks
* Why?
  * Closed form solutions won't exist when models are more complicated
  * The number of parameters will make any kind of random search impractical

---

## Some Details

* Notice that the distance covered with each step goes down
  * Since the step size is dependent upon the error, it goes down as the error grows smaller
* I actually increase the step size to compensate:

```python
    # Update
    learning_rate = 0.3
    if len(phis) > 5:
        learning_rate = 1
```

---

## More Details

* Notice also that the steps zig-zag back and forth
  * We'll see in a different class that this code is not the most efficient
* In fact, there have been quite a few algorithmic advances in the update step

---

## Other Loss Functions

* We need to train more than regression models
  * As long as our loss function is differentiable, we can use it
  * This opens up a wide variety of functions
* We can also consider other factors in addition to a loss function
  * The L2 norm, for example

---

## Loss Vs Cost

* We've been using the mean squared loss function as our only criteria
* If you are familiar with regression, you may remember regularization
  * E.g. ridge regression adds in an L2 penalty
    * Just the square of the magnitudes of the parameters
* The **cost function** is the loss function and whatever else is being minimized (such as a regularization term)

---

## Next Topics

* There are several topics
  * More complicated network structures
  * Methods to increase the representative power of neural networks
  * More about loss surfaces and loss minimization
* We can't do them all at once, so we'll first increase network complexity
  * After we see how that affects the loss surface, we'll talk about minimization