<!--
Abstract:

CS 461
Introduction to Deep Learning
Lecture 11
-->

# CS 461 - Lecture 19

## Machine Learning Principles

### Intro to Neural Networks

Bernhard Firner

2025-11-10

---

## Neural Networks

* Starting with some perspective
* Moving on to technical topics
  * NN structure
  * SGD
  * Their weaknesses

---

## The Hype

Neural network hype is not new

---

## Deserved?

* The hype around neural networks has always been "deserved"
* But it is easy to underestimate how hard it will be to make progress
* The problem in the 1990s?
  * Data.

---

## Early Digits

---

## Obvious Advantages?

It wasn't immediately obvious that neural networks would be great

---

## Maybe Good?

---

## Some problems

This is training time in days on a Sparc 10

---

## Intuition Vs Understanding

Nobody won these bets

---

## Problems to Overcome

* Which goes up faster?
  * Training time goes up with data
  * Performance goes up with data
* What structure is best?
  * More parameters or less?
  * Deeper or wider?

---

## Intro Done

* Let's move on to the mechanics of neural networks
* But remember, all other techniques are *easy* compared to NNs
  * Not because of their complexity (NNs are so simple!) but because of the freedom in the NN approach
  * Limitless options means limitless choices
    * Impossible to know which approach is "optimal"

---

## Building Blocks

* Each element of a neural network has a weight for the N input features, and a bias
  * $f(x) = \sum_{j=1}^{N} w_jx_j + bias$
    * or $\mathcal{y} = W^TX + b$
    * If we use a single neuron we are doing linear regression
    * Wrap it in a sigmoid and it is logistic regression

---

## What's In a Name?

* We could have called these functions "linear regressors" or "regression nodes" or something meaningful
  * Instead they are called neurons
* We will use multiple for each input to create a layer

---

## Expressive Freedom

* Previous models scaled with their inputs
  * Think of the Gram matrix
* Here, each neuron has a weight for each feature, but the number of neurons is completely free
* What does that mean for us?

---

## Wider is better?

* First, why do we want multiple neurons on the input layer?
* Think back to AdaBoost
  * Remember the ensemble of linear regression lines?
  * That was equivalent to a wide, single layer neural network
* Instead of adding the regression lines individually, we start with all of them
  * Let SGD sort it out

---

## 1-Layer Examples

---

## 1-Layer Examples

---

## 1-Layer Examples

---

## 1-Layer Examples

---

## Wider!

* With a wide enough neural network we could approximate any continuous function
* This is called the *universal approximation theorem*
  * States that a NN with sufficient structure can approximate any continuous function to arbitrary accuracy

---

## More Ingredients Required

* To achieve universal function approximation, we need more
  * Right now, this is no better than AdaBoost with regression or stumps

---

## What is Missing?

* So a single layer neural network has no advantage
* What about multiple layers?
  * $f(x) = Sigmoid(f_3(f_2(f_1(x))))$
* We'll call $f_2$ a hidden layer, since it is not exposed to the input or output

---

## Chain Rule

* This is now more complicated than linear regression
  * How do we update weights?
* We need to take the derivative over each layer by applying the *chain rule*
  * Which I'm sure everyone remembers from calculus
  * Given $h(x) = f(g(x))$
  * $h'(x) = f'(g(x))g'(x)$

---

## Gradients

* The chain rule gives us a gradient at each NN parameter
* We then multiply by a learning rate
  * (unchanged from linear regression)
* When we train over batches and iteratively converge, this is *stochastic gradient descent*

---

## What is our function?

* Let's consider a 2 layer NN with width 1 and 2 inputs, x and y
  * The first layer has one neuron with parameters  $w_{11}, w_{12}, b_1$
  * The second layer has one neuron with parameters $w_{21}, b_2$
* $f(x) = w_{21}(w_{11}x + w_{12}y + b_1) + b_2$
  * Just a scaled linear combination of x and y
  * So what's the point?

---

## It's just addition

* That network creates a linear combination of x and y
  * So it isn't very useful
* We need to make the network response *nonlinear*
  * But how?

---

## Nonlinear Activation Functions

* Let's stick sigmoid functions in between the layers
  * There are other options, but this is okay for now
* Sigmoid is nonlinear, so the function outputs will be as well
  * Notice that it also differentiable, which is important for gradient descent

---

## Recall Sigmoid

---

## Nonlinear, meaning what?

* Let's say we want to detect negative values
  * Output 1 if negative, 0 otherwise
* We want an rapid transition, so multiply the input by a large weight
  * $sigmoid(x * -10000)$
* Add in a bias value in the previous layer, and now we are thresholding at particular values

---

## Remapping

* How about preserving the original value of x as it passes through the sigmoid?
  * Just multiply by a small value, pushing it into the linear area of sigmoid
  * $x \approx 4000*(Sigmoid(x * 0.0001) - 0.5)$
* So multiple Sigmoid layers can implement logic functions and preserve input values

---

## Weird Example

```python
import torch
import math

make_relu = torch.nn.Sequential(
        torch.nn.Linear(1, 2),
        torch.nn.Sigmoid(),
        torch.nn.Linear(2, 2),
        torch.nn.Sigmoid(),
        torch.nn.Linear(2, 1))

with torch.no_grad():
    # Positive values are mapped onto 0
    make_relu[0].weight[0][0] = -100000000.
    make_relu[0].bias[0] = 0.
    # Pass x to the next layer, but scale it into the sigmoids linear section
    make_relu[0].weight[1][0] = 0.0001
    make_relu[0].bias[1] = 0.
    # Now use the negative value detector to suppress any negative values
    # Also keep the x value in the linear space of the sigmoid
    make_relu[2].weight[0][0] = -100000000.
    make_relu[2].weight[0][1] = 1
    make_relu[2].bias[0] = -0.5
    make_relu[2].weight[1][0] = 100000000.
    make_relu[2].weight[1][1] = 0
    make_relu[2].bias[1] = -1000
    # Finally, expland the x value back to linear space.
    # Sigmoid goes from around 0 at x=-2 to around 2 at x=2.
    make_relu[4].weight[0][0] = 4*40000
    make_relu[4].weight[0][1] = 80000
    # -5 bias as we convert the x value back to linear space
    make_relu[4].bias.fill_(-0.5*4*40000)
    # From:
    # >>> 40000*(sigmoid(torch.tensor([0*0.0001, 1*0.0001, 2*0.0001])) - 0.5)
    # tensor([0.0000, 1.0014, 2.0003])

for i in range(-20, 20):
    with torch.no_grad():
        x = torch.tensor([i]).float()
        x1 = make_relu[0].forward(x)
        x2 = make_relu[1].forward(x1)
        x3 = make_relu[2].forward(x2)
        x4 = make_relu[3].forward(x3)
        x5 = make_relu[4].forward(x4)
        print(f"{i} {x1} {x2} {x3} {x4} {x5}")
```

---

## Approximation

---

## Universal Approximation Theorem

* Now we can see the power of the NN
  * But at what cost?
* Neural network parameters are optimized with SGD
  * That means we calculate the gradient w.r.t. error for every parameter
  * Then update by some small step

---

## Deeper Solutions

* Let's solve those points again
  * But with our fancy network structure that solves anything
* We'll also try out an optimized version of SGD that only costs 3x the memory
  * Called Adam

---

## Spaghetti Solutions

---

## Spaghetti Solutions

---

## Spaghetti Solutions

---

## Spaghetti Solutions

---

## Spaghetti Solutions

---

## Spaghetti Solutions

---

## What Gives?

* The simple, wide network learned this just fine
* What's happening?
  * The loss surface -- the gradient that we descend -- is more complicated
  * That makes it more likely to get stuck in a local minima
* So let's flail around! Maybe we can change from vanilla SGD to something fancy called Adam!
  * We read somewhere that it's the cool thing nowadays!

---

## Spaghetti Solutions

---

## Spaghetti Solutions

---

## Spaghetti Solutions

---

## Spaghetti Solutions

---

## Spaghetti Solutions

---

## Spaghetti Solutions

---

## Now what?

* There are too many options with neural networks
  * You cannot simply flail around
* If we think that our loss surface is too complicated, let's change that

---

## Hyperbolic Tangent

* The Sigmoid nonlinearity *can* make a universal function approximator
  * But we just saw how complicated it could be
  * It was made for regression to 0 or 1, not learning
* Why not try a different function that goes to -1 or 1
  * Allows for a simpler representation of activation or suppression

---

## Hyperbolic Tangent

---

## Spaghetti Solutions

---

## Spaghetti Solutions

---

## Spaghetti Solutions

---

## Wow!

* That was a subtle change to the loss function, for a large change to the results
  * Which is typical of neural network training

---

## ReLU

* Remember that function we approximated with sigmoids before?
  * f(x) = x if x > 0, 0 otherwise
* That's called a rectified linear unit (or ReLU)
  * It's another popular nonlinearity

---

## ReLU

---

## Spaghetti Solutions

---

## Spaghetti Solutions

---

## Spaghetti Solutions

---

## What does this mean?

* A NN *can* approximate any function
  * That doesn't mean that it *will*
* The struggle with NNs is making their loss surface amenable to learning with gradient descent
  * Unfortunately, loss surfaces are difficult to visualize

---

## Hyperparameters

* We could probably tweak learning for all of those networks and eventually make them work
  * If the network with 2 hidden layers of size 10 worked, the rest should to
* This is often the struggle with NNs
  * They *should* work, but sometimes they don't

---

## Guarantees of Gradient Descent

* Convergence
  * Theoretically, if step sizes are arbitrarily small and if floating point numbers were real valued and a few other assumptions
* Margin
  * Only if expressed in the loss function
* Sparsity
  * Only if expressed in the loss function

---

## Missing Gradients

* Parameter updates rely upon a gradient existing
  * That is the derivative calculation performed on every operation
* Remember HMM forward and backward pass calculations?
  * Probabilities get small over long chains
  * A similar thing happens to gradients over deep networks
  * This is called the *vanishing gradient problem*

---

## Initialization

* Neural network parameters can also start in a bad spot
  * Let's say a weight started with a bias of 100000 and is fed into a sigmoid
  * The output will be 1, regardless of the input
* It will change with enough iterations
  * but if another part of the network uses it as a constant, then it's stuck
* In general, we use *too many* parameters, hoping that some start in a good place

---

## Other problems with SGD

* Gradient descent requires that we save gradients
* So we need memory for 2x the network parameters
  * But we'll also want to keep track of momentum, to get over irregular loss surfaces
  * And probably some more stuff
* So this is all very memory and compute expensive

---

## Advantages

* Gradient descent works with any loss function
  * L2 norm is easy to add (recall linear regression)
* The chain rule lets us work with (nearly) arbitrary structures and functions
* This allows us to craft NNs for particular problem classes
  * And then tweak the structure until it works

---

## Images

* That brings us to LeNet, an early NN for image analysis
* It popularized what we now call convolutions, but was originally called a local receptive field
  * [Handwritten Digit Recognition with a Back-Propagation Network](https://proceedings.neurips.cc/paper_files/paper/1989/file/53c3bce66e43be4f209556518c2fcb54-Paper.pdf)
* We'll go through LeNet and apply it to Digits next class

<!--
Example of many lines making a smoother curve
Piecewise linear fits are arbitrarily smooth with more pieces. This allows us to carve something out, even from the middle of a circle.

Then go on to depth. Show how multiple successive layers can model the polynomial kernel.
Show how they can project 2D points onto a separable space, even when the layers are not wide. This allows us to project the middle of a circle into a separable space.

How? Look at the gradients for each example. Print them out. Calculate some by hand.

When doesn't gradient descent work?
Talk about parameter initialization.
Add in normalization/weight decay

-->