# CS 461 - Lecture 19 ## Machine Learning Principles ### Intro to Neural Networks Bernhard Firner 2025-11-10 --- ## Neural Networks * Starting with some perspective * Moving on to technical topics * NN structure * SGD * Their weaknesses --- ## The Hype Neural network hype is not new
--- ## Deserved? * The hype around neural networks has always been "deserved" * But it is easy to underestimate how hard it will be to make progress * The problem in the 1990s? * Data. --- ## Early Digits
--- ## Obvious Advantages? It wasn't immediately obvious that neural networks would be great
--- ## Maybe Good?
--- ## Some problems This is training time in days on a Sparc 10
--- ## Intuition Vs Understanding Nobody won these bets
--- ## Problems to Overcome * Which goes up faster? * Training time goes up with data * Performance goes up with data * What structure is best? * More parameters or less? * Deeper or wider? --- ## Intro Done * Let's move on to the mechanics of neural networks * But remember, all other techniques are *easy* compared to NNs * Not because of their complexity (NNs are so simple!) but because of the freedom in the NN approach * Limitless options means limitless choices * Impossible to know which approach is "optimal" --- ## Building Blocks * Each element of a neural network has a weight for the N input features, and a bias * $f(x) = \sum_{j=1}^{N} w_jx_j + bias$ * or $\mathcal{y} = W^TX + b$ * If we use a single neuron we are doing linear regression * Wrap it in a sigmoid and it is logistic regression --- ## What's In a Name? * We could have called these functions "linear regressors" or "regression nodes" or something meaningful * Instead they are called neurons * We will use multiple for each input to create a layer --- ## Expressive Freedom * Previous models scaled with their inputs * Think of the Gram matrix * Here, each neuron has a weight for each feature, but the number of neurons is completely free * What does that mean for us? --- ## Wider is better? * First, why do we want multiple neurons on the input layer? * Think back to AdaBoost * Remember the ensemble of linear regression lines? * That was equivalent to a wide, single layer neural network * Instead of adding the regression lines individually, we start with all of them * Let SGD sort it out --- ## 1-Layer Examples
--- ## 1-Layer Examples
--- ## 1-Layer Examples
--- ## 1-Layer Examples
--- ## Wider! * With a wide enough neural network we could approximate any continuous function * This is called the *universal approximation theorem* * States that a NN with sufficient structure can approximate any continuous function to arbitrary accuracy --- ## More Ingredients Required * To achieve universal function approximation, we need more * Right now, this is no better than AdaBoost with regression or stumps --- ## What is Missing? * So a single layer neural network has no advantage * What about multiple layers? * $f(x) = Sigmoid(f_3(f_2(f_1(x))))$ * We'll call $f_2$ a hidden layer, since it is not exposed to the input or output --- ## Chain Rule * This is now more complicated than linear regression * How do we update weights? * We need to take the derivative over each layer by applying the *chain rule* * Which I'm sure everyone remembers from calculus * Given $h(x) = f(g(x))$ * $h'(x) = f'(g(x))g'(x)$ --- ## Gradients * The chain rule gives us a gradient at each NN parameter * We then multiply by a learning rate * (unchanged from linear regression) * When we train over batches and iteratively converge, this is *stochastic gradient descent* --- ## What is our function? * Let's consider a 2 layer NN with width 1 and 2 inputs, x and y * The first layer has one neuron with parameters $w_{11}, w_{12}, b_1$ * The second layer has one neuron with parameters $w_{21}, b_2$ * $f(x) = w_{21}(w_{11}x + w_{12}y + b_1) + b_2$ * Just a scaled linear combination of x and y * So what's the point? --- ## It's just addition * That network creates a linear combination of x and y * So it isn't very useful * We need to make the network response *nonlinear* * But how? --- ## Nonlinear Activation Functions * Let's stick sigmoid functions in between the layers * There are other options, but this is okay for now * Sigmoid is nonlinear, so the function outputs will be as well * Notice that it also differentiable, which is important for gradient descent --- ## Recall Sigmoid
--- ## Nonlinear, meaning what? * Let's say we want to detect negative values * Output 1 if negative, 0 otherwise * We want an rapid transition, so multiply the input by a large weight * $sigmoid(x * -10000)$ * Add in a bias value in the previous layer, and now we are thresholding at particular values --- ## Remapping * How about preserving the original value of x as it passes through the sigmoid? * Just multiply by a small value, pushing it into the linear area of sigmoid * $x \approx 4000*(Sigmoid(x * 0.0001) - 0.5)$ * So multiple Sigmoid layers can implement logic functions and preserve input values --- ## Weird Example ```python import torch import math make_relu = torch.nn.Sequential( torch.nn.Linear(1, 2), torch.nn.Sigmoid(), torch.nn.Linear(2, 2), torch.nn.Sigmoid(), torch.nn.Linear(2, 1)) with torch.no_grad(): # Positive values are mapped onto 0 make_relu[0].weight[0][0] = -100000000. make_relu[0].bias[0] = 0. # Pass x to the next layer, but scale it into the sigmoids linear section make_relu[0].weight[1][0] = 0.0001 make_relu[0].bias[1] = 0. # Now use the negative value detector to suppress any negative values # Also keep the x value in the linear space of the sigmoid make_relu[2].weight[0][0] = -100000000. make_relu[2].weight[0][1] = 1 make_relu[2].bias[0] = -0.5 make_relu[2].weight[1][0] = 100000000. make_relu[2].weight[1][1] = 0 make_relu[2].bias[1] = -1000 # Finally, expland the x value back to linear space. # Sigmoid goes from around 0 at x=-2 to around 2 at x=2. make_relu[4].weight[0][0] = 4*40000 make_relu[4].weight[0][1] = 80000 # -5 bias as we convert the x value back to linear space make_relu[4].bias.fill_(-0.5*4*40000) # From: # >>> 40000*(sigmoid(torch.tensor([0*0.0001, 1*0.0001, 2*0.0001])) - 0.5) # tensor([0.0000, 1.0014, 2.0003]) for i in range(-20, 20): with torch.no_grad(): x = torch.tensor([i]).float() x1 = make_relu[0].forward(x) x2 = make_relu[1].forward(x1) x3 = make_relu[2].forward(x2) x4 = make_relu[3].forward(x3) x5 = make_relu[4].forward(x4) print(f"{i} {x1} {x2} {x3} {x4} {x5}") ``` --- ## Approximation
--- ## Universal Approximation Theorem * Now we can see the power of the NN * But at what cost? * Neural network parameters are optimized with SGD * That means we calculate the gradient w.r.t. error for every parameter * Then update by some small step --- ## Deeper Solutions * Let's solve those points again * But with our fancy network structure that solves anything * We'll also try out an optimized version of SGD that only costs 3x the memory * Called Adam --- ## Spaghetti Solutions
--- ## Spaghetti Solutions
--- ## Spaghetti Solutions
--- ## Spaghetti Solutions
--- ## Spaghetti Solutions
--- ## Spaghetti Solutions
--- ## What Gives? * The simple, wide network learned this just fine * What's happening? * The loss surface -- the gradient that we descend -- is more complicated * That makes it more likely to get stuck in a local minima * So let's flail around! Maybe we can change from vanilla SGD to something fancy called Adam! * We read somewhere that it's the cool thing nowadays! --- ## Spaghetti Solutions
--- ## Spaghetti Solutions
--- ## Spaghetti Solutions
--- ## Spaghetti Solutions
--- ## Spaghetti Solutions
--- ## Spaghetti Solutions
--- ## Now what? * There are too many options with neural networks * You cannot simply flail around * If we think that our loss surface is too complicated, let's change that --- ## Hyperbolic Tangent * The Sigmoid nonlinearity *can* make a universal function approximator * But we just saw how complicated it could be * It was made for regression to 0 or 1, not learning * Why not try a different function that goes to -1 or 1 * Allows for a simpler representation of activation or suppression --- ## Hyperbolic Tangent
--- ## Spaghetti Solutions
--- ## Spaghetti Solutions
--- ## Spaghetti Solutions
--- ## Wow! * That was a subtle change to the loss function, for a large change to the results * Which is typical of neural network training --- ## ReLU * Remember that function we approximated with sigmoids before? * f(x) = x if x > 0, 0 otherwise * That's called a rectified linear unit (or ReLU) * It's another popular nonlinearity --- ## ReLU
--- ## Spaghetti Solutions
--- ## Spaghetti Solutions
--- ## Spaghetti Solutions
--- ## What does this mean? * A NN *can* approximate any function * That doesn't mean that it *will* * The struggle with NNs is making their loss surface amenable to learning with gradient descent * Unfortunately, loss surfaces are difficult to visualize --- ## Hyperparameters * We could probably tweak learning for all of those networks and eventually make them work * If the network with 2 hidden layers of size 10 worked, the rest should to * This is often the struggle with NNs * They *should* work, but sometimes they don't --- ## Guarantees of Gradient Descent * Convergence * Theoretically, if step sizes are arbitrarily small and if floating point numbers were real valued and a few other assumptions * Margin * Only if expressed in the loss function * Sparsity * Only if expressed in the loss function --- ## Missing Gradients * Parameter updates rely upon a gradient existing * That is the derivative calculation performed on every operation * Remember HMM forward and backward pass calculations? * Probabilities get small over long chains * A similar thing happens to gradients over deep networks * This is called the *vanishing gradient problem* --- ## Initialization * Neural network parameters can also start in a bad spot * Let's say a weight started with a bias of 100000 and is fed into a sigmoid * The output will be 1, regardless of the input * It will change with enough iterations * but if another part of the network uses it as a constant, then it's stuck * In general, we use *too many* parameters, hoping that some start in a good place --- ## Other problems with SGD * Gradient descent requires that we save gradients * So we need memory for 2x the network parameters * But we'll also want to keep track of momentum, to get over irregular loss surfaces * And probably some more stuff * So this is all very memory and compute expensive --- ## Advantages * Gradient descent works with any loss function * L2 norm is easy to add (recall linear regression) * The chain rule lets us work with (nearly) arbitrary structures and functions * This allows us to craft NNs for particular problem classes * And then tweak the structure until it works --- ## Images * That brings us to LeNet, an early NN for image analysis * It popularized what we now call convolutions, but was originally called a local receptive field * [Handwritten Digit Recognition with a Back-Propagation Network](https://proceedings.neurips.cc/paper_files/paper/1989/file/53c3bce66e43be4f209556518c2fcb54-Paper.pdf) * We'll go through LeNet and apply it to Digits next class