# CS 462 - Lecture 03 ## Shallow Networks Bernhard Firner 2026-01-29 --- ## Review
* The simplest building block of a neural network is a line * $y = \phi_0 + \phi_1x$ * This is called a **neuron** * $\phi$ is the set of all model parameters
--- ## Loss
* The loss is the sum of the squared distances from every point to our line * We train a network by minimizing the loss over $I$ training samples * For a single neuron that is * $L[\phi] = \sum_{i=1}^I(\phi_0 + \phi_1 x_i - y_i)^2$
--- ## Loss Surface
--- ## Different Visualization
--- ## Training a Neural Network * Training just means finding a minima on the loss surface * Although not all surfaces are as simple as the previous one * The solution is to "walk" from our current estimate of $\hat{\phi}$ to a better one * At each estimate, we check the local gradient and walk "downhill" towards a minimum --- ## More Complicated Example
* Let's say we wanted to match a sin function * $y = sin(\phi_0 + \phi_1 x)$ * Previous approach still works * As long as our parameters begin "close enough" to the solution * More about this later
-v- ```python #! /usr/bin/python3 import numpy as np import matplotlib.pyplot as plt # Keep things consistent by setting a random seed np.random.seed(100) # Points from a sin function + noise true_params = [0.2, 0.4] # Ten random x points for training, from 0 to 6 x = np.random.uniform(0, 6, 10) # The y points, with a tiny bit of noise (up to 0.05 noise) y = np.sin(true_params[0] + true_params[1]*x) + np.random.uniform(-0.05, 0.05, 10) # Define the model def f(x, params): return np.sin(params[0] + x*params[1]) # Get current axes and set them to only the desired range ax = plt.gca() def init_axes(ax): ax.set_xlim([0, 6]) ax.set_ylim([-1.05, 1.05]) ax.set_xlabel('Input, $x$', fontsize=16) ax.set_ylabel('Output, $y$', fontsize=16) def draw_errors(ax, xs, ys, phi): for x, y in zip(xs, ys): prediction = f(x, phi) ax.plot([x, x], [y, prediction], markevery=[0], color='#070707', marker='o', linestyle='dashed') # Function to calculate the loss def compute_loss(x, y, phi): loss = np.sum((f(x, phi) - y)**2) return loss figure, [ax_left, ax_right] = plt.subplots(nrows=1, ncols=2, sharey=False) figure.set_size_inches(w=12, h=7) ###################### # Draw the loss surface # Make a 2D grid of possible phi0 and phi1 values phi0_mesh, phi1_mesh = np.meshgrid(np.arange(-1.0,1.0,0.02), np.arange(-1.0,1.0,0.02)) def draw_contour(ax): ax.clear() # Make a 2D array for the losses all_losses = np.zeros_like(phi1_mesh) # Run through each 2D combination of phi0, phi1 and compute loss for indices,temp in np.ndenumerate(phi1_mesh): all_losses[indices] = compute_loss(x,y, [phi0_mesh[indices], phi1_mesh[indices]]) levels = 256 ax.contourf(phi0_mesh, phi1_mesh, all_losses, levels) levels = 40 ax.contour(phi0_mesh, phi1_mesh, all_losses, levels, colors=['#80808080']) ax.set_xlim([-1, 1]) ax.set_ylim([-1, 1]) ax.set_ylabel('$\phi_1$', fontsize=16) ax.set_xlabel('$\phi_0$', fontsize=16) ###################### phis=[[0.5, 0.5]] for step in range(30): phi = phis[-1] ax_right.clear() init_axes(ax_right) # Draw the points ax_right.scatter(x,y) # Draw some lines with labels x_line = np.arange(0,6,0.01) y_line = f(x_line, phi) plt.plot(x_line, y_line, 'g-', lw=2, marker='none') plt.text(1, f(1, phi), f'$\phi_0={phi[0]:.2f}$,$\phi_1={phi[1]:.2f}$', ha='left', va='top', transform_rotates_text=True, rotation_mode='anchor', fontsize=14) draw_errors(ax_right, x, y, phi) loss = compute_loss(x,y,phi) plt.text(0.25, 1.8, f'Loss = {loss:.2f}', fontsize=16) # Left graph with the loss surface # Draw the surface and the path we have taken so far phi_zeros = [p[0] for p in phis] phi_ones = [p[1] for p in phis] draw_contour(ax_left) ax_left.plot(phi_zeros, phi_ones, color='green', marker='o', markersize=10, linestyle='solid') plt.savefig(f"../figures/03_learning_sin_{step}.svg", dpi=2*96) # Solve for the next value of phi y_hats = f(x, phi) error = y_hats - y # Calculate gradients w.r.t. phi # The equation is y = sin(phi_0 + phi_1 * x) # dy/dphi_0 = cos(phi_0 + phi_1 * x) # dy/dphi_1 = phi_1 * cos(phi_0 + phi_1 * x) # Average over the size of the dataset dp1 = (1 / len(y)) * error @ (phi[1] * np.cos(phi[0] + phi[1] * x)) dp0 = (1 / len(y)) * error @ np.cos(phi[0] + phi[1] * x) print(f"dp1 is {dp1}, dp0 is {dp0}") # Update learning_rate = 0.1 next_phi1 = phi[1] - learning_rate * dp1 next_phi0 = phi[0] - learning_rate * dp0 phis.append([next_phi0, next_phi1]) # Animate with # ffmpeg -framerate 3 -i '../figures/03_learning_sin_%d.svg' -filter_complex "[0:v] split [a][b];[a] palettegen [p];[b][p] paletteuse" ../03_learning_sin_animated.gif ``` --- ## Descent
--- ## Neural Networks * The technique works, but we said that neural networks would be built with lines * $y = \phi_0 + \phi_1 x$ * We used a sin function in our code ```python # Define the model def f(x, params): return np.sin(params[0] + x*params[1]) ``` * So how do we match a sin function with lines? --- ## The Solution 1. Piecewise linear fit 2. Lots of pieces 3. Activation functions --- ## Adding More Lines * $y = f[x, \phi]$ * With two lines: * $y = \phi_0 + \phi_1 x + \phi_2 x$ * But that doesn't work * We could just replace $\phi_1$ and $\phi_2$ with a single parameter * So we also need a way to make the activations of those parameters different --- ## Activation * Think about a piecewise linear fit * We want a line with a different slope at different values of x * So we will use something called an **activation function** to turn parameters on and off --- ## Nonlinear Activation * There are many options * We need the function to be *nonlinear*, meaning that it is more than just multiplying the input by a constant * Why? * We need to have some way to turn parameters of the network "off" and "on" * Or at least some way to make parameters act different with different inputs --- ## Rectified Linear Unit
* One popular function is ReLU * Rectified Linear Unit * Output is "off" when it is below 0 * "On," equal to the input, otherwise
--- ## Some Alternatives
--- ## Some Alternatives
--- ## Using Activation Functions * Let's call our activation function $a$ * For now, assume that it is ReLU * It doesn't need to be, but it simplifies explanations * $y = f[x, \phi]$ * We'll use our function to turn lines with different slopes off or on * $y = \phi_0 + \phi_1 a[\theta_{10} + \theta_{11} x] + \phi_2a[\theta_{20} + \theta_{21}x]$ --- ## Example * $y = \phi_0 + \phi_1 a[\theta_{10} + \theta_{11} x] + \phi_2a[\theta_{20} + \theta_{21}x]$ * Notice that there are 7 variables * $\phi = [\phi_0, \phi_1, \phi_2, \theta_{10}, \theta_{11}, \theta_{20}, \theta_{21}]$ * The two terms controlled by $\phi_1$ and $\phi_2$ are the **hidden units** --- ## Example * Let's set some values and see what we get * $\phi = [\phi_0, \phi_1, \phi_2, \theta_{10}, \theta_{11}, \theta_{20}, \theta_{21}]$ * $\phi = [0, 2, -0.3, -3, 1, -3, 2]$
--- ## Example * $y = \phi_0 + \phi_1 a[\theta_{10} + \theta_{11} x] + \phi_2a[\theta_{20} + \theta_{21}x]$ * The values inside of the ReLUs control when those terms become non-zero * If $\theta_{10}$, the bias, is a large negative number, and $\theta{11}$ is relatively small, the first term won't "activate" until $x$ is large * $\theta_{11}$ could also be negative, in which case there will be a positive output when x is very negative and that "deactivates" when x increases --- ## Example * $y = \phi_0 + \phi_1 a[\theta_{10} + \theta_{11} x] + \phi_2a[\theta_{20} + \theta_{21}x]$ * The outer variables, $\phi_1$ and $\phi_2$, control the slope of each ReLU's output * The important concept here is that we can create a line with 3 elbows using 2 hidden units --- ## Fitting * Let's see how we can approximate a function with this * We can approximate the sin curve from earlier * But without using a sin function * For simplicity, let's use a known equation: $sin(0.5\pi + x)$ --- ## First Step * We only care about the range from 0 to 6 * So next, we'll set $\phi_0 = 1$, since that where the sin begins at $x = 0$
--- ## Second Step * With $\phi_0 = 1$ the fit begins in the correct place * Let's add a negative slope from the first term. * What what slope do we use? Which variable is set?
--- ## Second Step * This is with $\phi_1 = \frac{-2}{\pi}$, $\theta_{10} = 0$, $\theta_{11} = 1$ * The solution is not unique * It works as long as $\theta_{11}*\phi_1 = \frac{-2}{\pi}$ and $\theta_{11} > 0$
--- ## Third Step * Now we make the second elbow at $\pi$ * We want the output to climb with slope $\frac{2}{\pi}$ * So what value works for $\theta_{21}*\phi_2$?
--- ## Third Step * $\theta_{21}*\phi_2$ must cancel out $\theta_{11}*\phi_1$ and still have a positive slope * That means the total slope should be $\frac{4}{\pi}$ * Begin going up at $x=\pi$, so $\theta_{21}=-\pi$, $\theta_{22}=1$
-v- ```python #! /usr/bin/python3 import numpy as np import matplotlib.pyplot as plt my_colors = [ '#2E2585', '#337538', '#5DA899', '#94CBEC' ] # Activation function def relu(x): y = x.copy() y[x < 0] = 0 return y # Set up x points x = np.linspace(0,6,201) def init_axes(ax): ax.set_xlim([0, 6]) ax.set_ylim([-1.5, 1.5]) ax.set_xlabel('Input, $x$', fontsize=16) ax.set_ylabel('Output, $y$', fontsize=16) phi = np.array([0, 2, -0.3, -3, 1, -3, 2]) # $\phi = [\phi_0, \phi_1, \phi_2, \theta_{10}, \theta_{11}, \theta{20}, \theta{21}]$ figure, [ax_left, ax_right, ax_full] = plt.subplots(nrows=1, ncols=3, sharey=True) figure.set_size_inches(w=14, h=6) ax_left.set_title(r'$\phi_1ReLU[\theta_{10} + \theta_{11}x]$', fontsize=16) ax_right.set_title(r'$\phi_0ReLU[\theta_{20} + \theta_{21}x]$', fontsize=16) ax_full.set_title(r'$y = \phi_0 + \phi_1 a[\theta_{10} + \theta_{11} x] + \phi_2a[\theta_{20} + \theta_{21}x]$', fontsize=14) def solveAndPlot(ax_left, ax_right, ax_full, x, phi): for ax in [ax_left, ax_right, ax_full]: ax.clear() init_axes(ax) y1 = phi[1] * relu(phi[3] + phi[4] * x) y2 = phi[2] * relu(phi[5] + phi[6] * x) y3 = phi[0] + y1 + y2 sin_y = np.sin(0.5*np.pi + x) ax_left.plot(x, y1, linestyle='solid', marker=None, color=my_colors[0], label=f'{phi[1]:.2f}ReLU({phi[3]:.2f} + {phi[4]:.2f}x)', linewidth=3) ax_left.legend(loc='lower left', fontsize=16) ax_right.plot(x, y2, linestyle='solid', marker=None, color=my_colors[1], label=f'{phi[2]:.2f}ReLU({phi[5]:.2f} + {phi[6]:.2f}x)', linewidth=3) ax_right.legend(loc='upper left', fontsize=16) ax_full.plot(x, y3, linestyle='solid', marker=None, label="fit", color=my_colors[2], linewidth=3) ax_full.plot(x, sin_y, linestyle='solid', marker=None, label="sin", color=my_colors[3], linewidth=3) ax_full.legend(loc='best', fontsize=16) # First version solveAndPlot(ax_left, ax_right, ax_full, x, phi) plt.savefig(f"../figures/03_fit_network_a.svg", dpi=2*96) # Second version phi[0] = 1 solveAndPlot(ax_left, ax_right, ax_full, x, phi) plt.savefig(f"../figures/03_fit_network_b.svg", dpi=2*96) # Third version # Go up with a slope of 1/pi beginning at 0 phi[1] = -2/np.pi phi[3] = 0 phi[4] = 1 solveAndPlot(ax_left, ax_right, ax_full, x, phi) plt.savefig(f"../figures/03_fit_network_c.svg", dpi=2*96) # Fourth version phi[2] = 4/np.pi phi[5] = -np.pi phi[6] = 1 solveAndPlot(ax_left, ax_right, ax_full, x, phi) plt.savefig(f"../figures/03_fit_network_d.svg", dpi=2*96) ``` --- ## Observations * Doing this by hand isn't fun * But this is easy to optimize * Everything in the equation has an easy derivative * ReLU(x) has derivative 0 if $x<0$, and is $x$ otherwise --- ## Hidden Units * The inner terms are called **hidden units** * Their details are hidden from an observer who see the network as $y = f[x]$ * Adding more hidden units increases the model's ability to approximate a curve --- ## Universal Approximation Theorem * There is no bound on the complexity we can model with a shallow NN given an arbitrary number of hidden units * As long as the curve being modelled is continuous * The precision of our fit is arbitrary with the number of hidden units * Let's revisit that sin curve --- ## Training a Model * Going to use more hidden layers * But we're going to let PyTorch train the model this time --- ## Declaring the Model ```python # Make a model net = torch.nn.Sequential( torch.nn.Linear(1, num_hidden), torch.nn.ReLU(), torch.nn.Linear(num_hidden, 1)) ``` -v- ## Parameter Updates * I don't really want to get ahead of ourselves, but this is the parameter update ```python # This is the forward pass y_hat = net(xt) # This computes the loss loss = loss_fn(yt, y_hat) # This computes the gradients (derivates of the error w.r.t each parameter) loss.backward() # Update the parameters for param in net.parameters(): param.data -= param.grad * learning_rate # Zero the gradient before the next pass param.grad = None ``` -v- ## The Code ```python #! /usr/bin/python3 import numpy as np import matplotlib.pyplot as plt import torch my_colors = [ '#2E2585', '#337538', '#5DA899', '#94CBEC' ] # Activation function def relu(x): y = x.copy() y[x < 0] = 0 return y # Set up x points x = np.linspace(0,6,201) def init_axes(ax): ax.set_xlim([0, 6]) ax.set_ylim([-1.5, 1.5]) ax.set_xlabel('Input, $x$', fontsize=16) ax.set_ylabel('Output, $y$', fontsize=16) figure, [ax_left, ax_full] = plt.subplots(nrows=1, ncols=2, sharey=True) figure.set_size_inches(w=12, h=7) def solveAndPlot(ax_left, ax_full, x, num_hidden): for ax in [ax_left, ax_full]: ax.clear() init_axes(ax) ax_left.set_title('Hidden Units', fontsize=16) ax_full.set_title('NN Output', fontsize=16) sin_y = np.sin(0.5*np.pi + x) # Make a model net = torch.nn.Sequential( torch.nn.Linear(1, num_hidden), torch.nn.ReLU(), torch.nn.Linear(num_hidden, 1)) # The output isn't normalized, so we'll initialize the bias so that more of # the lines are used torch.nn.init.uniform_(net[0].bias, -5, 5) learning_rate = 0.01 # Train the model # Convert the data to torch tensors first xt = torch.tensor(x.reshape(201, 1)).float() yt = torch.tensor(sin_y.reshape(201, 1)).float() loss_fn = torch.nn.MSELoss() for epoch in range(int(500 * np.sqrt(num_hidden))): # Ensure that there are no gradients stored net.zero_grad() # This is the forward pass y_hat = net(xt) # This computes the loss loss = loss_fn(yt, y_hat) # This computes the gradients (derivates of the error w.r.t each parameter) loss.backward() # Update the parameters for param in net.parameters(): param.data -= param.grad * learning_rate # Zero the gradient before the next pass param.grad = None # Get a final estimate of y_hat and all of the intermediate hidden layer outputs with torch.no_grad(): hidden_outputs = [] halfway = net[1](net[0](xt)) y_hat = net[2](halfway) for idx in range(num_hidden): ax_left.plot(xt[:,0], net[2].weight[0,idx]*halfway[:,idx], linestyle='solid', marker=None, color=my_colors[0], linewidth=3) ax_full.plot(xt, y_hat, linestyle='solid', marker=None, label="fit", color=my_colors[2], linewidth=3) ax_full.plot(x, sin_y, linestyle='solid', marker=None, label="sin", color=my_colors[3], linewidth=3) ax_full.legend(loc='best', fontsize=16) for pieces in [5, 10, 25, 100]: solveAndPlot(ax_left, ax_full, x, pieces) plt.savefig(f"../figures/03_fit_pytorch_{pieces}.svg", dpi=2*96) ``` --- ## 3 Lines
--- ## 10 Lines
--- ## 25 Lines
--- ## 100 Lines
--- ## Discussion * We'll go over the code in more detail later * For now, let's notice a couple of things * First, more lines can make a smoother fit * Second, results aren't optimal * And, if you run this multiple times, they are inconsistent as well * It turns out that training a neural network isn't trivial --- ## Representative Power * How many hidden layers should we use? * We need to discuss representative power * Important to do before digging into the details of training * After all, making our network too large or complicated may make training more difficult as well --- ## Layer Type * The hidden layer we have seen is an example of a **linear** or **fully connected** layer * Linear because it responds linearly to its input * $y = mx + b$ * Fully connected because every line in the layer uses each input * Notice that the first layer has a single input and makes more outputs * The last layer takes all of those inputs and makes one output --- ## Inputs and Outputs * We could just as easily make a layer with multiple inputs *and* multiple outputs * Then a hidden layer could feed into a second hidden layer * A network with one hidden layer is called a shallow neural network * For most people, anything with two or more hidden layers is a deep neural network --- ## Capacity * Adding more lines increases the model's **capacity** * Why do we say capacity? * The network is **memorizing** the slope of the target function in small segments * The more segments it has, the more capacity it has to memorize parts of the overall function --- ## 2D Capacity * In two dimensions, the elbows in the graph are the model's capacity * Each additional neuron defines a new region and we can model another part of a curve
--- ## Network Size * What does it cost to add capacity? * Each input to a neuron requires a weight, and each neuron has a bias value * $w$ hidden units with $n$ inputs require $w(n + 1)$ values * The output layer uses 1 bias value plus $w$ weights
--- ## Shallow Network Costs * With a single input and a single hidden layer, each new elbow "costs" 3 additional values * There are $w+1$ regions in the network output with a cost of $w(n+1) + (1 + w)$ --- ## Network Efficiency * Each parameter requires training, so we want to get the most out of them * This leads to an important question: * Is it better to make networks wider, or deeper? --- ## Wider Vs Deeper * What is a deeper network? * Anything with more than one hidden layer * We'll get into their mechanics, and advantages, next class