# CS 462 - Lecture 03

## Shallow Networks

Bernhard Firner

2026-01-29

---

## Review

* The simplest building block of a neural network is a line
  * $y = \phi_0 + \phi_1x$
  * This is called a **neuron**
  * $\phi$ is the set of all model parameters

</div>
<div class="col">
<img style="width: 100%" class="r-stretch" src="./figures/02_just_lines.svg" />

</div>
</div>

---

## Loss

* The loss is the sum of the squared distances from every point to our line
* We train a network by minimizing the loss over $I$ training samples
* For a single neuron that is
  * $L[\phi] = \sum_{i=1}^I(\phi_0 + \phi_1 x_i - y_i)^2$

</div>
<div class="col">
<img style="width: 90%" class="r-stretch" src="./figures/02_line_error_2.svg" />

</div>
</div>

---

## Loss Surface

---

## Different Visualization

---

## Training a Neural Network

* Training just means finding a minima on the loss surface
  * Although not all surfaces are as simple as the previous one
* The solution is to "walk" from our current estimate of $\hat{\phi}$ to a better one
  * At each estimate, we check the local gradient and walk "downhill" towards a minimum

---

## More Complicated Example

* Let's say we wanted to match a sin function
  * $y = sin(\phi_0 + \phi_1 x)$
* Previous approach still works
  * As long as our parameters begin "close enough" to the solution
  * More about this later

</div>
<div class="col">
<img style="width: 100%" class="r-stretch" src="./figures/03_learning_sin_29.svg" />

</div>
</div>

-v-

```python
#! /usr/bin/python3

import numpy as np
import matplotlib.pyplot as plt

# Keep things consistent by setting a random seed
np.random.seed(100)

# Points from a sin function + noise
true_params = [0.2, 0.4]

# Ten random x points for training, from 0 to 6
x = np.random.uniform(0, 6, 10)
# The y points, with a tiny bit of noise (up to 0.05 noise)
y = np.sin(true_params[0] + true_params[1]*x) + np.random.uniform(-0.05, 0.05, 10)

# Define the model
def f(x, params):
  return np.sin(params[0] + x*params[1])

# Get current axes and set them to only the desired range
ax = plt.gca()

def init_axes(ax):
    ax.set_xlim([0, 6])
    ax.set_ylim([-1.05, 1.05])
    ax.set_xlabel('Input, $x$', fontsize=16)
    ax.set_ylabel('Output, $y$', fontsize=16)

def draw_errors(ax, xs, ys, phi):
    for x, y in zip(xs, ys):
        prediction = f(x, phi)
        ax.plot([x, x], [y, prediction], markevery=[0], color='#070707', marker='o', linestyle='dashed')

# Function to calculate the loss
def compute_loss(x, y, phi):
  loss = np.sum((f(x, phi) - y)**2)
  return loss

figure, [ax_left, ax_right] = plt.subplots(nrows=1, ncols=2, sharey=False)
figure.set_size_inches(w=12, h=7)

######################
# Draw the loss surface

# Make a 2D grid of possible phi0 and phi1 values
phi0_mesh, phi1_mesh = np.meshgrid(np.arange(-1.0,1.0,0.02), np.arange(-1.0,1.0,0.02))

def draw_contour(ax):
    ax.clear()
    # Make a 2D array for the losses
    all_losses = np.zeros_like(phi1_mesh)
    # Run through each 2D combination of phi0, phi1 and compute loss
    for indices,temp in np.ndenumerate(phi1_mesh):
        all_losses[indices] = compute_loss(x,y, [phi0_mesh[indices], phi1_mesh[indices]])

levels = 256
    ax.contourf(phi0_mesh, phi1_mesh, all_losses, levels)
    levels = 40
    ax.contour(phi0_mesh, phi1_mesh, all_losses, levels, colors=['#80808080'])
    ax.set_xlim([-1, 1])
    ax.set_ylim([-1, 1])

ax.set_ylabel('$\phi_1$', fontsize=16)
    ax.set_xlabel('$\phi_0$', fontsize=16)
######################

phis=[[0.5, 0.5]]
for step in range(30):
    phi = phis[-1]
    ax_right.clear()
    init_axes(ax_right)

# Draw the points
    ax_right.scatter(x,y)

# Draw some lines with labels
    x_line = np.arange(0,6,0.01)
    y_line = f(x_line, phi)
    plt.plot(x_line, y_line, 'g-', lw=2, marker='none')
    plt.text(1, f(1, phi), f'$\phi_0={phi[0]:.2f}$,$\phi_1={phi[1]:.2f}$', ha='left', va='top',
             transform_rotates_text=True, rotation_mode='anchor', fontsize=14)

draw_errors(ax_right, x, y, phi)

loss = compute_loss(x,y,phi)
    plt.text(0.25, 1.8, f'Loss = {loss:.2f}', fontsize=16)

# Left graph with the loss surface
    # Draw the surface and the path we have taken so far
    phi_zeros = [p[0] for p in phis]
    phi_ones = [p[1] for p in phis]
    draw_contour(ax_left)
    ax_left.plot(phi_zeros, phi_ones, color='green', marker='o', markersize=10, linestyle='solid')

plt.savefig(f"../figures/03_learning_sin_{step}.svg", dpi=2*96)

# Solve for the next value of phi
    y_hats = f(x, phi)
    error = y_hats - y
    # Calculate gradients w.r.t. phi
    # The equation is y = sin(phi_0 + phi_1 * x)
    # dy/dphi_0 = cos(phi_0 + phi_1 * x)
    # dy/dphi_1 = phi_1 * cos(phi_0 + phi_1 * x)
    # Average over the size of the dataset
    dp1 = (1 / len(y)) * error @ (phi[1] * np.cos(phi[0] + phi[1] * x))
    dp0 = (1 / len(y)) * error @ np.cos(phi[0] + phi[1] * x)
    print(f"dp1 is {dp1}, dp0 is {dp0}")

# Update
    learning_rate = 0.1
    next_phi1 = phi[1] - learning_rate * dp1
    next_phi0 = phi[0] - learning_rate * dp0
    phis.append([next_phi0, next_phi1])

# Animate with
# ffmpeg -framerate 3 -i '../figures/03_learning_sin_%d.svg' -filter_complex "[0:v] split [a][b];[a] palettegen [p];[b][p] paletteuse" ../03_learning_sin_animated.gif
```

---

## Descent

---

## Neural Networks

* The technique works, but we said that neural networks would be built with lines
  * $y = \phi_0 + \phi_1 x$
* We used a sin function in our code

```python
# Define the model
def f(x, params):
  return np.sin(params[0] + x*params[1])
```

* So how do we match a sin function with lines?

---

## The Solution

1. Piecewise linear fit
2. Lots of pieces
3. Activation functions

---

## Adding More Lines

* $y = f[x, \phi]$
* With two lines:
  * $y = \phi_0 + \phi_1 x + \phi_2 x$
* But that doesn't work
  * We could just replace $\phi_1$ and $\phi_2$ with a single parameter
  * So we also need a way to make the activations of those parameters different

---

## Activation

* Think about a piecewise linear fit
  * We want a line with a different slope at different values of x
* So we will use something called an **activation function** to turn parameters on and off

---

## Nonlinear Activation

* There are many options
  * We need the function to be *nonlinear*, meaning that it is more than just multiplying the input by a constant
* Why?
  * We need to have some way to turn parameters of the network "off" and "on"
  * Or at least some way to make parameters act different with different inputs

---

## Rectified Linear Unit

* One popular function is ReLU
  * Rectified Linear Unit
* Output is "off" when it is below 0
* "On," equal to the input, otherwise

</div>
<div class="col">
<img style="width: 90%" class="r-stretch" src="./figures/03_relu.svg" />

</div>
</div>

---

## Some Alternatives

</div>
<div class="col">

</div>
</div>

---

## Some Alternatives

</div>
<div class="col">

</div>
</div>

---

## Using Activation Functions

* Let's call our activation function $a$
  * For now, assume that it is ReLU
    * It doesn't need to be, but it simplifies explanations
* $y = f[x, \phi]$
* We'll use our function to turn lines with different slopes off or on
  * $y = \phi_0 + \phi_1 a[\theta_{10} + \theta_{11} x] + \phi_2a[\theta_{20} + \theta_{21}x]$

---

## Example

* $y = \phi_0 + \phi_1 a[\theta_{10} + \theta_{11} x] + \phi_2a[\theta_{20} + \theta_{21}x]$
  * Notice that there are 7 variables
  * $\phi = [\phi_0, \phi_1, \phi_2, \theta_{10}, \theta_{11}, \theta_{20}, \theta_{21}]$
* The two terms controlled by $\phi_1$ and $\phi_2$ are the **hidden units**

---

## Example

* Let's set some values and see what we get
  * $\phi = [\phi_0, \phi_1, \phi_2, \theta_{10}, \theta_{11}, \theta_{20}, \theta_{21}]$
  * $\phi = [0, 2, -0.3, -3, 1, -3, 2]$

---

## Example

* $y = \phi_0 + \phi_1 a[\theta_{10} + \theta_{11} x] + \phi_2a[\theta_{20} + \theta_{21}x]$
* The values inside of the ReLUs control when those terms become non-zero
  * If $\theta_{10}$, the bias, is a large negative number, and $\theta{11}$ is relatively small, the first term won't "activate" until $x$ is large
  * $\theta_{11}$ could also be negative, in which case there will be a positive output when x is very negative and that "deactivates" when x increases

---

## Example

* $y = \phi_0 + \phi_1 a[\theta_{10} + \theta_{11} x] + \phi_2a[\theta_{20} + \theta_{21}x]$
* The outer variables, $\phi_1$ and $\phi_2$, control the slope of each ReLU's output
* The important concept here is that we can create a line with 3 elbows using 2 hidden units

---

## Fitting

* Let's see how we can approximate a function with this
  * We can approximate the sin curve from earlier
  * But without using a sin function
* For simplicity, let's use a known equation: $sin(0.5\pi + x)$

---

## First Step

* We only care about the range from 0 to 6
* So next, we'll set $\phi_0 = 1$, since that where the sin begins at $x = 0$

---

## Second Step

* With $\phi_0 = 1$ the fit begins in the correct place
* Let's add a negative slope from the first term.
  * What what slope do we use? Which variable is set?

---

## Second Step

* This is with $\phi_1 = \frac{-2}{\pi}$, $\theta_{10} = 0$, $\theta_{11} = 1$
* The solution is not unique
  * It works as long as $\theta_{11}*\phi_1 = \frac{-2}{\pi}$ and $\theta_{11} > 0$

---

## Third Step

* Now we make the second elbow at $\pi$
  * We want the output to climb with slope $\frac{2}{\pi}$
  * So what value works for $\theta_{21}*\phi_2$?

---

## Third Step

* $\theta_{21}*\phi_2$ must cancel out $\theta_{11}*\phi_1$ and still have a positive slope
  * That means the total slope should be $\frac{4}{\pi}$
  * Begin going up at $x=\pi$, so $\theta_{21}=-\pi$, $\theta_{22}=1$

-v-

```python
#! /usr/bin/python3

import numpy as np
import matplotlib.pyplot as plt

my_colors = [ '#2E2585', '#337538', '#5DA899', '#94CBEC' ]

# Activation function
def relu(x):
    y = x.copy()
    y[x < 0] = 0
    return y

# Set up x points
x = np.linspace(0,6,201)

def init_axes(ax):
    ax.set_xlim([0, 6])
    ax.set_ylim([-1.5, 1.5])
    ax.set_xlabel('Input, $x$', fontsize=16)
    ax.set_ylabel('Output, $y$', fontsize=16)

phi = np.array([0, 2, -0.3, -3, 1, -3, 2])
# $\phi = [\phi_0, \phi_1, \phi_2, \theta_{10}, \theta_{11}, \theta{20}, \theta{21}]$

figure, [ax_left, ax_right, ax_full] = plt.subplots(nrows=1, ncols=3, sharey=True)
figure.set_size_inches(w=14, h=6)

ax_left.set_title(r'$\phi_1ReLU[\theta_{10} + \theta_{11}x]$', fontsize=16)
ax_right.set_title(r'$\phi_0ReLU[\theta_{20} + \theta_{21}x]$', fontsize=16)
ax_full.set_title(r'$y = \phi_0 + \phi_1 a[\theta_{10} + \theta_{11} x] + \phi_2a[\theta_{20} + \theta_{21}x]$', fontsize=14)

def solveAndPlot(ax_left, ax_right, ax_full, x, phi):
    for ax in [ax_left, ax_right, ax_full]:
        ax.clear()
        init_axes(ax)

y1 = phi[1] * relu(phi[3] + phi[4] * x)
    y2 = phi[2] * relu(phi[5] + phi[6] * x)
    y3 = phi[0] + y1 + y2
    sin_y = np.sin(0.5*np.pi + x)

ax_left.plot(x, y1, linestyle='solid', marker=None, color=my_colors[0], label=f'{phi[1]:.2f}ReLU({phi[3]:.2f} + {phi[4]:.2f}x)', linewidth=3)
    ax_left.legend(loc='lower left', fontsize=16)
    ax_right.plot(x, y2, linestyle='solid', marker=None, color=my_colors[1], label=f'{phi[2]:.2f}ReLU({phi[5]:.2f} + {phi[6]:.2f}x)', linewidth=3)
    ax_right.legend(loc='upper left', fontsize=16)
    ax_full.plot(x, y3, linestyle='solid', marker=None, label="fit", color=my_colors[2], linewidth=3)
    ax_full.plot(x, sin_y, linestyle='solid', marker=None, label="sin", color=my_colors[3], linewidth=3)
    ax_full.legend(loc='best', fontsize=16)

# First version
solveAndPlot(ax_left, ax_right, ax_full, x, phi)
plt.savefig(f"../figures/03_fit_network_a.svg", dpi=2*96)

# Second version
phi[0] = 1
solveAndPlot(ax_left, ax_right, ax_full, x, phi)
plt.savefig(f"../figures/03_fit_network_b.svg", dpi=2*96)

# Third version
# Go up with a slope of 1/pi beginning at 0
phi[1] = -2/np.pi
phi[3] = 0
phi[4] = 1
solveAndPlot(ax_left, ax_right, ax_full, x, phi)
plt.savefig(f"../figures/03_fit_network_c.svg", dpi=2*96)

# Fourth version
phi[2] = 4/np.pi
phi[5] = -np.pi
phi[6] = 1
solveAndPlot(ax_left, ax_right, ax_full, x, phi)
plt.savefig(f"../figures/03_fit_network_d.svg", dpi=2*96)
```

---

## Observations

* Doing this by hand isn't fun
* But this is easy to optimize
  * Everything in the equation has an easy derivative
  * ReLU(x) has derivative 0 if $x<0$, and is $x$ otherwise

---

## Hidden Units

* The inner terms are called **hidden units**
  * Their details are hidden from an observer who see the network as $y = f[x]$
* Adding more hidden units increases the model's ability to approximate a curve

---

## Universal Approximation Theorem

* There is no bound on the complexity we can model with a shallow NN given an arbitrary number of hidden units
  * As long as the curve being modelled is continuous
  * The precision of our fit is arbitrary with the number of hidden units
* Let's revisit that sin curve

---

## Training a Model

* Going to use more hidden layers
  * But we're going to let PyTorch train the model this time

---

## Declaring the Model

```python
    # Make a model
    net = torch.nn.Sequential(
            torch.nn.Linear(1, num_hidden),
            torch.nn.ReLU(),
            torch.nn.Linear(num_hidden, 1))
```

-v-

## Parameter Updates

* I don't really want to get ahead of ourselves, but this is the parameter update

```python
        # This is the forward pass
        y_hat = net(xt)
        # This computes the loss
        loss = loss_fn(yt, y_hat)
        # This computes the gradients (derivates of the error w.r.t each parameter)
        loss.backward()
        # Update the parameters
        for param in net.parameters():
            param.data -= param.grad * learning_rate
            # Zero the gradient before the next pass
            param.grad = None
```

-v-

## The Code

```python
#! /usr/bin/python3

import numpy as np
import matplotlib.pyplot as plt
import torch

my_colors = [ '#2E2585', '#337538', '#5DA899', '#94CBEC' ]

# Activation function
def relu(x):
    y = x.copy()
    y[x < 0] = 0
    return y

# Set up x points
x = np.linspace(0,6,201)

def init_axes(ax):
    ax.set_xlim([0, 6])
    ax.set_ylim([-1.5, 1.5])
    ax.set_xlabel('Input, $x$', fontsize=16)
    ax.set_ylabel('Output, $y$', fontsize=16)

figure, [ax_left, ax_full] = plt.subplots(nrows=1, ncols=2, sharey=True)
figure.set_size_inches(w=12, h=7)

def solveAndPlot(ax_left, ax_full, x, num_hidden):
    for ax in [ax_left, ax_full]:
        ax.clear()
        init_axes(ax)

ax_left.set_title('Hidden Units', fontsize=16)
    ax_full.set_title('NN Output', fontsize=16)
    sin_y = np.sin(0.5*np.pi + x)

# Make a model
    net = torch.nn.Sequential(
            torch.nn.Linear(1, num_hidden),
            torch.nn.ReLU(),
            torch.nn.Linear(num_hidden, 1))
    # The output isn't normalized, so we'll initialize the bias so that more of
    # the lines are used
    torch.nn.init.uniform_(net[0].bias, -5, 5)

learning_rate = 0.01

# Train the model
    # Convert the data to torch tensors first
    xt = torch.tensor(x.reshape(201, 1)).float()
    yt = torch.tensor(sin_y.reshape(201, 1)).float()
    loss_fn = torch.nn.MSELoss()
    for epoch in range(int(500 * np.sqrt(num_hidden))):
        # Ensure that there are no gradients stored
        net.zero_grad()
        # This is the forward pass
        y_hat = net(xt)
        # This computes the loss
        loss = loss_fn(yt, y_hat)
        # This computes the gradients (derivates of the error w.r.t each parameter)
        loss.backward()
        # Update the parameters
        for param in net.parameters():
            param.data -= param.grad * learning_rate
            # Zero the gradient before the next pass
            param.grad = None

# Get a final estimate of y_hat and all of the intermediate hidden layer outputs
    with torch.no_grad():
        hidden_outputs = []
        halfway = net[1](net[0](xt))
        y_hat = net[2](halfway)

for idx in range(num_hidden):
            ax_left.plot(xt[:,0], net[2].weight[0,idx]*halfway[:,idx], linestyle='solid', marker=None, color=my_colors[0], linewidth=3)
    ax_full.plot(xt, y_hat, linestyle='solid', marker=None, label="fit", color=my_colors[2], linewidth=3)
    ax_full.plot(x, sin_y, linestyle='solid', marker=None, label="sin", color=my_colors[3], linewidth=3)
    ax_full.legend(loc='best', fontsize=16)

for pieces in [5, 10, 25, 100]:
    solveAndPlot(ax_left, ax_full, x, pieces)
    plt.savefig(f"../figures/03_fit_pytorch_{pieces}.svg", dpi=2*96)
```

---

## 3 Lines

---

## 10 Lines

---

## 25 Lines

---

## 100 Lines

---

## Discussion

* We'll go over the code in more detail later
* For now, let's notice a couple of things
  * First, more lines can make a smoother fit
  * Second, results aren't optimal
    * And, if you run this multiple times, they are inconsistent as well
* It turns out that training a neural network isn't trivial

---

## Representative Power

* How many hidden layers should we use?
  * We need to discuss representative power
* Important to do before digging into the details of training
  * After all, making our network too large or complicated may make training more difficult as well

---

## Layer Type

* The hidden layer we have seen is an example of a **linear** or **fully connected** layer
  * Linear because it responds linearly to its input
    * $y = mx + b$
  * Fully connected because every line in the layer uses each input
* Notice that the first layer has a single input and makes more outputs
  * The last layer takes all of those inputs and makes one output

---

## Inputs and Outputs

* We could just as easily make a layer with multiple inputs *and* multiple outputs
  * Then a hidden layer could feed into a second hidden layer
* A network with one hidden layer is called a shallow neural network
* For most people, anything with two or more hidden layers is a deep neural network

---

## Capacity

* Adding more lines increases the model's **capacity**
* Why do we say capacity?
  * The network is **memorizing** the slope of the target function in small segments
  * The more segments it has, the more capacity it has to memorize parts of the overall function

---

## 2D Capacity

* In two dimensions, the elbows in the graph are the model's capacity
  * Each additional neuron defines a new region and we can model another part of a curve

---

## Network Size

* What does it cost to add capacity?
* Each input to a neuron requires a weight, and each neuron has a bias value
  * $w$ hidden units with $n$ inputs require $w(n + 1)$ values
  * The output layer uses 1 bias value plus $w$ weights

---

## Shallow Network Costs

* With a single input and a single hidden layer, each new elbow "costs" 3 additional values
* There are $w+1$ regions in the network output with a cost of $w(n+1) + (1 + w)$

---

## Network Efficiency

* Each parameter requires training, so we want to get the most out of them
* This leads to an important question:
  * Is it better to make networks wider, or deeper?

---

## Wider Vs Deeper

* What is a deeper network?
  * Anything with more than one hidden layer
* We'll get into their mechanics, and advantages, next class