# CS 461 - Lecture 25

## Machine Learning Principles

### Neural Network Review

Bernhard Firner

2025-12-3

---

## Quick Review

* I've put a quick review of all the DNN lectures here
  * Except for yesterdays
* Sample questions come at the end

---

## DNN are Different

* Neural networks, especially deep neural networks, behave differently from the other models we have studied
* We use models that are far over-capacity, and this improves results
* We must use multiple strong regularizing techniques
  * L2 norm on its own isn't good enough
* Many improvements come from hyperparameters rather than network structure

---

## Unpredictability

* *Why* DNNs work so well is unclear
  * The phrase "unreasonable effectiveness" shows up in the literature
* But the fact that better hardware would yield improved results was easily predicted
* That being said, more improvements have been made besides making larger models

---

## Universal Approximation Theorem

* A model with a nonlinear activation function (like ReLU) can produce a piecewise linear function
* That function can be fit to a target of arbitrary complexity, assuming the layer has enough width
  * Increasing the width leads to a smoother fit: a better approximation
* Thus, for any *linear function*, a sufficiently wide neural network can approximate it to arbitrary precision
* This result has been extended to deep architectures as well

---

## Practical Result?

* Does this give us anything actionable?
  * Not really; it merely tell us that neural networks have strong representation power
  * It doesn't guarantee that they will work

---

## Fitting a Neural Network

* We want through a copious amount of examples
* Let's see them again!
* Let's begin with a single layer network
  * We will train with SGD, and make the network wider until it solves the set of points to our satisfaction

---

---

---

---

---

## Universal Approximation

* Notice that our network create a smoother decision boundary with more elements?
* That is the magic of a large number of pieces in a piecewise linear fit
  * But we shouldn't be excited; AdaBoost could do the same

---

## Nonlinear Functions

* So far, we have only talked about approximating linear functions
  * What about nonlinear functions?
* We need to add nonlinearities into the network for that

---

## Activation Functions

* The nonlinearities that we insert into the neural network's hidden layers are called nonlinear activation functions
* Sigmoid, Tanh (hyperbolic tangent), and ReLU are examples
* We that a three linear layer DNN with sigmoids in between the layers could emulate ReLU
  * We didn't implement it, but we could have learned a Sigmoid from a DNN with ReLUs as well
* So this nonlinearities are, conceptually, equivalent

---

## Deeper Results

* The next thing to explore is solving our earlier dataset with deeper networks
* We'll use sigmoids between all of the layers
* A notation of AxBxC means three layers of size A, B, and C

---

## 10x10 with sigmoid

---

## 10x10x10 with sigmoid

---

## 10x10x10x10 with sigmoid

---

## 10x10x10x10x10 with sigmoid

---

## 100x100x100 with sigmoid

---

## 1000x1000 with sigmoid

---

## Gradient Descent

* Our struggle here is not with the DNN's representative power, but with learning
* The sigmoid only has a narrow range where its output changes substantially
  * In other places, it changes slowly, so its gradient is small
  * The gradient is the derivative of the loss function with respect to our model parameters
* Plain SGD uses a single learning rate for all parameters
  * Parameters at different parts of a sigmoid could need different learning rates, but that isn't possible

---

## Options

* But if Sigmoid is giving us trouble, let's try some other options
  * Tanh
  * ReLU

---

## Hyperbolic Tangent

---

## 10x10 with Tanh

---

## 10x10x10 with Tanh

---

## 10x10x10x10 with Tanh

---

## ReLU

---

## 10x10 with ReLU

---

## 10x10x10 with ReLU

---

## 10x10x10x10 with ReLU

---

## Configuring DNNs

* Configuring a DNN isn't just about making it larger
* Most optimizations are actually about changing the gradient and loss surface
* Adding parameters to a DNN is easy, adding them and maitaining learning is hard

---

## Solutions

* First, what is our fitting algorithm?
  * LeNet5 used Levenberg-Marquardt algorithm
  * Most others have used stochastic gradient descent (SGD)
  * Now Adam (and AdamW) are popular

---

## More Fitting

* Modern fitting algorithms always use momentum
  * Track the previous weight updates, and continue in those directions with some decay rate
  * In both SGD and Adam
* Help learning move past sharp valleys

---

## Adam

* We haven't really covered adam
  * But progress in learning algorithms is one of the lessons from ConvNext
* Basically, Adam takes a fixed-side step along the learning surface, scaled by the learning rate
  * So on a steep section of the loss, it will take a smaller step than SGD
  * On a flat section, it will take a larger step than SGD
* This can sometimes solve SGDs problem where it ceases to learn where gradients are small, and oscillates around steep gradients

---

## Regularization

* What happens if we fit *too well*?
  * This is prevented with regularization
* l2 worked in the past, and it works now
* But now there are more options

---

## Other Regularizers

* Large learning rates are their own regularizer
  * From LeNet
  * They push through sharp cliffs and make it more likely to settle in broad plateaus of the loss surface
* Convolutions
  * They are location invariant and weight-sharing
  * Very difficult to overfit

---

## New Regularizers

* Dropout
  * Drop out some nodes in linear layers, features from convolution outputs
* Stohastic depth
  * Skip entire layers
* Anything that enables higher learning rates
  * Batch norm, larger batches sizes

---

## What have we learned?

* AlexNet demonstrated the power of deep learning
  * Even on an objectively awful dataset
* The feature embedding in later layers could be used for clustering
  * Or combined with an SVM to classify object that AlexNet never trained on
* This lead to an explosion in the field

---

## Deeper Networks

* VGG
  * One of the early deep networks
  * Showed that weight initialization held back AlexNet
  * Trained on a shorter network first, then added new layers to the middle
* ResNets identified the same problem, but fixed them better

---

## ResNets

* Fix gradient problems by adding batch normalization in between layers
  * Automatically normalize layer outputs, which controls gradients
  * That fixes vanishing and exploding gradients, but training a 1000 layer network isn't possible
* Now add in skip layers, so that the convolutions only learn a residual to add to the input
  * Now 1000 layer networks are possible

---

## Progress

* ResNet-adjacent architectures are still dominant for image tasks
* Transformers, using self-attention, are dominant with other structured data
  * Quickly supplanted RNNs/LSTMs
  * Easier to train with arbitrarily sized context window
  * LSTMs are more justifiable, but sequence training is too difficult
* This is a huge simplificaion, there are a lot of models and applications out there

---

## Question 1

<div style="text-align: left;">
Which of the following is not an advantage of deep neural networks?

1) They can have many more parameters than training points but converge to a good solution anyway.
2) Any randomly initialized DNN is guaranteed to converge on the global minimum during training.
3) Deep neural networks trained on images seem to find nearly universal features, even for unseen object classes.
4) All of the above.

</div>

---

## Question 2

<div style="text-align: left;">
What can prevent unstable gradients?

1) Careful weight and bias initialization.
2) Batch normalization.
3) Replacing multiplications with addition in the network structure.
4) All of the above.

</div>

---

## Question 3

<div style="text-align: left;">
What is true about the vanishing gradient problem?

1) The vanishing gradient problem only exists because floating point numbers are approximations.
2) A network is so wide that the gradient is shared between too many weights.
3) A network multiplies too many values, forcing us to scale the gradient to a smaller value.
4) None of the above.

</div>

---

## Question 4

<div style="text-align: left;">
Why does an SVM work with the features from the final convolution of a neural network?

1) They do not.
2) Because the features are guaranteed to be linearly separable.
3) Feature universality seems to be an emergent property of DNNs trained on large bodies of images.
4) None of the above.

</div>

---

## Question 5

<div style="text-align: left;">
Which of these is not a method to escape the high cost of labelling?

1) Automated label generation
2) Unsupervised learning
3) Use a pretrained feature extractor to vastly reduce label needs.
4) All of the above are valid methods.
</div>

---

## Question 6

* Given the figure on the following slide, evaluate if the following statements are true or false. Assume that the test data is high quality.
  1. There is a mismatch in the training and testing data.
  2. Decreasing the model capacity to match the test data will result in a better generalizing model.
  3. The upward trajectory of the testing set error as capacity increases is unavoidable with DNNs.

---

## Question 6

</div>

---

## Question 7

* Given the figure on the following slide, evaluate if the following statements are true or false. Assume that the test data is high quality.
  1. This figure is unrealistic because test error cannot be lower than train accuracy.
  2. The model may be failing to converge to 0 error because of errors in the training labels.
  3. The model may be failing to converge to 0 error because of a lack of capacity.

---

## Question 7

---

## Question 8

* Evaluate the veracity of the following statements about the decision boundaries in the following picture.
  1. If you want them to be smoother, you can change to a smoother nonlinear function in the DNN.
  2. If you want them to be smoother, you can increase the number of hidden layers in the DNN.
  3. If you want them to be smoother, you can feed the features from a hidden layer into an SVM that uses a smooth kernel.

---

## Question 8

</div>

---

## Answer 1

Which of the following is not an advantage of deep neural networks?

2) Any randomly initialized DNN is guaranteed to converge on the global minimum during training.

Convergence is not guaranteed. If it was, we wouldn't worry about exploding gradients.

---

## Answer 2
What can prevent unstable gradients?

4) All of the above.

Any of the answers can help with exploding or vanishing gradients.

---

## Answer 3

What is true about the vanishing gradient problem?

1) The vanishing gradient problem only exists because floating point numbers are approximations.

If we were using real numbers, it wouldn't matter how small they were. The multiplication issue makes our floating point numbers unstable, so removing multiplications can help (as in ResNet), but scaling (as in BatchNorm) doesn't cause vanishing gradients.

---

## Answer 4

Why does an SVM work with the features from the final convolution of a neural network?

3) Feature universality seems to be an emergent property of DNNs trained on large bodies of images.

No guarantees, but, observationaly, this seems to be true.

---

## Answer 5

Which of these is not a method to escape the high cost of labelling?

4) All of the above are valid methods.

---

## Answer 6

* Given the figure at right, evaluate if the following statements are true or false. Assume that the test data is high quality.
  1. True, or the curves would closely match
  2. True, even if it is not a solution we like
  3. False, we can find a model that has stronger regularizers

</div>
<div class="col">

</div>
</div>

---

## Answer 7

* Given the figure at right, evaluate if the following statements are true or false. Assume that the test data is high quality.
  1. False, test accuracy can be higher than train accuracy (it's harder, noisy, augmentations, etc)
  2. True. They it seems to be preventing the test set from converging to 0, the errors must be biased in some way.
  3. True. This could also be true.

</div>
<div class="col">

</div>
</div>

---

## Answer 8

* Evaluate the veracity of the following statements about the decision boundaries in the picture.
  1. True. These harder edges are something you see with ReLU.
  2. True. I plotted more versions and can make it arbitrarily smooth.
  3. True. Weird thing to do just for smoothness, but it is possible.

</div>
<div class="col">

</div>
</div>