# CS 462 - Lecture 07

## Initialization

Bernhard Firner

2026-02-12

---

## Book

* Today we'll be going through 7.5-7.6 of the [book](https://udlbook.github.io/udlbook/)

---

## Review

* We wanted to optimze our learning process
* Goals:
  * make gradient descent more robust
  * remove need to fiddle with learning rates

---

## Stochastic Gradient Descent

* We add randomness to the learning process
* Select a random subsets of training data without replacement
  * called batches
* The loss surface from each batch is slightly different
  * But anything similar to a global minima should remain consistent

---

## SGD Illustration

---

## GD and Saddle Points

* Gradient descent can go the wrong way at saddle points

---

## SGD

* Can succeed where GD failed

---

## Momentum

* Stochastic gradient descent is an improvement
  * Still gets stuck in local minima
  * Even when there are better options nearby
* Momentum helps with that

---

## Avoiding Poor Minima

* We add momentum using the past gradient
  * $m_{t+1} \leftarrow \beta\cdot m_t + (1 - \beta)\frac{\delta L[x, \phi]}{\delta \phi}$
  * $\phi_{t+1} \leftarrow \phi_t - \alpha\cdot m_{t+1}$
* What if we go past the global minima?

---

## Minima and Slopes

* Experience shows that global or near global minima tend to have wider minima than local minima
* Momentum may carry us past either
  * But we will eventually slow and return to a wide minima
  * A sharp minima will be passed quickly

---

## SGD Without Momentum

---

## SGD with Momentum

---

## Improvements

* We can tweak the momentum slightly
  * Nesterov: use the predicted next point to calculate the change to momentum
  * Automatically adjust learning rate with normalized step size
* Adaptive Moment Estimation
  * Add momentum to the step size normalization too!

---

## Improvements

---

## Initialization

* We're still left with this
  * It's an initialization problem

---

## Initialization Woes

* Consider a network with two layers:
  * $f[x] = \beta_k + \Omega_k ReLU(\beta_{k-1} + \Omega_{k-1}x)$
* Imagine that x is in the range from 0 to 1
  * You initialize your network with a normal distribution
* What happens?

---

## Zero Gradients

* A normal should have roughly half of its values < 0
  * So half of $ReLU(\beta_{k-1} + \Omega_{k-1}x)$ will be 0
* The output of that layer is half 0, and otherwise positive
* What happens if the network gets deeper?

---

## Depth Problems

* $f[x] = \beta_k + \Omega_k ReLU(\beta_{k-1} + \Omega_{k-1}ReLU(\beta_{k-2} + \Omega_{k-2}x))$
* The 0 inputs to the second layer stay 0
* If half of $\Omega_{k-1}$ is negative (because it is normally distributed)
  * Now $\frac{3}{4}$ of the output is 0

---

## Not Just ReLU

* Don't think that this is just a ReLU problem
* Tanh and sigmoid used to be the popular activation functions
  * Weight initialization needed to trap values in the -2.5 to 2.5 range
  * Otherwise Tanh is approximately constant, with little gradient

</div>
<div class="col">

</div>
</div>

---

## Stability Problems

* Those normal weights didn't work, how about making them all large and positive?
  * Now the ReLUs aren't doing anything
* And our intermediate values grow incredibly quickly
* If the average input was 0.5 and the average weight was 0.5
  * With a width of 1000, the second layer's output is roughly $0.5\times0.5\times1000$

---

## Stability Continued

* $f[x] = \beta_k + \Omega_k ReLU(\beta_{k-1} + \Omega_{k-1}ReLU(\beta_{k-2} + \Omega_{k-2}x))$
* Each neuron of the second layer outputs an average of 250
  * What is the average output of the third layer?
  * $250\times0.5\times1000$
  * This gets unstable very quickly

---

## Problems

* Bad initialization makes learning difficult
  * From starting in minima where SGD won't work
  * Or from numerical problems
* The numerical issue is easier to examine, so let's begin there

---

## Backpropagation

* If the forward pass is very small or large, it affects backpropagation too
* Numerical problems have names:
  * **vanishing gradient** and **exploding gradient**

---

## Gradients

---

## Real Numbers

* Tanh has a gradient far away from 0, but it is likely very small
  * So theoretically it's fine, SGD should still work
* And we could say the same thing about vanishing or exploding gradients
* But we aren't working with real numbers
  * We are using floating point values!

---

## Any Activation

* In fact, these problems occur with any activation function
* The real problem is the multiplications of weights
  * Forget about the biases and activations for a moment
  * The neurons on the kth layer have an output something like this:
  * $(\prod_{i=1}^k \omega_i)x$

---

## Why Deeper?

* But why do we want to make such a deep network?
  * To make it less likely we start in a bad local minima!
* Increasing model capacity smooths the loss surface 
  * And the most efficient way to add more capacity is with depth

---

## Good Initialization

* There must be something in between too large and too small
* Let's again assume that bias is 0 to simplify things
* Assume that weights are initialized with a normal distribution
  * 0 mean, and variance $\sigma^2$
  * Can we choose parameters that keep variance constant through all layers?

---

## Expectations

* What is the expected output of a neuron?
  * Let's say the input to the neuron on layer $i$ is $h$
  * It is made of multiple values, $h_1$, $h_2$, and so on
* The expected output of that neuron is
  * $E[f_i] = E[\beta_i + \sum_{j=1}^{W_{i-1}}\Omega_{ij}h_j]$

---

## Simplifying

* $E[f_i] = E[\beta_i + \sum_{j=1}^{W_{i-1}}\Omega_{ij}h_j]$
* $E[f_i] = E[\beta_i] + \sum_{j=1}^{W_{i-1}}E[\Omega_{ij}]E[h_j]$
  * If we assume indepdendence between $h$ and $\Omega_i$
* We said that we would 0 the bias and use a 0 mean normal for the weights, so
    * $E[f_i] = 0 + \sum_{j=1}^{W_{i-1}}0E[h_j] = 0$

---

## Variance

* The mean is 0, but post ReLU that won't be true
  * Will the numbers be stable?
  * It depends upon the variance
* $\sigma^2_{f_i} = E[f_i^2] - E[f_i]^2$

---

## Simplifying

<p><math display="block" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mtable><mtr><mtd columnalign="right" style="text-align: right"><msubsup><mi>σ</mi><mrow><mi>f</mi><msub><mi>′</mi><mi>i</mi></msub></mrow><mn>2</mn></msubsup></mtd><mtd columnalign="left" style="text-align: left"><mo>=</mo><mi>E</mi><mrow><mo stretchy="true" form="prefix">[</mo><mi>f</mi><msubsup><mi>′</mi><mi>i</mi><mn>2</mn></msubsup><mo stretchy="true" form="postfix">]</mo></mrow><mo>−</mo><mi>E</mi><msup><mrow><mo stretchy="true" form="prefix">[</mo><mi>f</mi><msub><mi>′</mi><mi>i</mi></msub><mo stretchy="true" form="postfix">]</mo></mrow><mn>2</mn></msup></mtd></mtr><mtr><mtd columnalign="right" style="text-align: right"></mtd><mtd columnalign="left" style="text-align: left"><mo>=</mo><mi>E</mi><mrow><mo stretchy="true" form="prefix">[</mo><msup><mrow><mo stretchy="true" form="prefix">(</mo><msub><mi>β</mi><mi>i</mi></msub><mo>+</mo><munderover><mo>∑</mo><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><msub><mi>W</mi><mrow><mi>i</mi><mo>−</mo><mn>1</mn></mrow></msub></munderover><msub><mi>Ω</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><msub><mi>h</mi><mi>j</mi></msub><mo stretchy="true" form="postfix">)</mo></mrow><mn>2</mn></msup><mo stretchy="true" form="postfix">]</mo></mrow><mo>−</mo><mn>0</mn></mtd></mtr><mtr><mtd columnalign="right" style="text-align: right"></mtd><mtd columnalign="left" style="text-align: left"><mo>=</mo><mi>E</mi><mrow><mo stretchy="true" form="prefix">[</mo><msup><mrow><mo stretchy="true" form="prefix">(</mo><munderover><mo>∑</mo><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><msub><mi>W</mi><mrow><mi>i</mi><mo>−</mo><mn>1</mn></mrow></msub></munderover><msub><mi>Ω</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><msub><mi>h</mi><mi>j</mi></msub><mo stretchy="true" form="postfix">)</mo></mrow><mn>2</mn></msup><mo stretchy="true" form="postfix">]</mo></mrow></mtd></mtr><mtr><mtd columnalign="right" style="text-align: right"></mtd><mtd columnalign="left" style="text-align: left"><mo>=</mo><munderover><mo>∑</mo><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><msub><mi>W</mi><mrow><mi>i</mi><mo>−</mo><mn>1</mn></mrow></msub></munderover><mi>E</mi><mrow><mo stretchy="true" form="prefix">[</mo><msubsup><mi>Ω</mi><mrow><mi>i</mi><mi>j</mi></mrow><mn>2</mn></msubsup><mo stretchy="true" form="postfix">]</mo></mrow><mi>E</mi><mrow><mo stretchy="true" form="prefix">[</mo><msubsup><mi>h</mi><mi>j</mi><mn>2</mn></msubsup><mo stretchy="true" form="postfix">]</mo></mrow></mtd></mtr><mtr><mtd columnalign="right" style="text-align: right"></mtd><mtd columnalign="left" style="text-align: left"><mo>=</mo><munderover><mo>∑</mo><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><msub><mi>W</mi><mrow><mi>i</mi><mo>−</mo><mn>1</mn></mrow></msub></munderover><msubsup><mi>σ</mi><mi>Ω</mi><mn>2</mn></msubsup><mi>E</mi><mrow><mo stretchy="true" form="prefix">[</mo><msubsup><mi>h</mi><mi>j</mi><mn>2</mn></msubsup><mo stretchy="true" form="postfix">]</mo></mrow></mtd></mtr><mtr><mtd columnalign="right" style="text-align: right"></mtd><mtd columnalign="left" style="text-align: left"><mo>=</mo><msubsup><mi>σ</mi><mi>Ω</mi><mn>2</mn></msubsup><munderover><mo>∑</mo><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><msub><mi>W</mi><mrow><mi>i</mi><mo>−</mo><mn>1</mn></mrow></msub></munderover><mi>E</mi><mrow><mo stretchy="true" form="prefix">[</mo><msubsup><mi>h</mi><mi>j</mi><mn>2</mn></msubsup><mo stretchy="true" form="postfix">]</mo></mrow></mtd></mtr></mtable><annotation encoding="application/x-tex">\begin{equation*}
\begin{split}
\sigma^2_{f'_i} & = E[f'_i^2] - E[f'_i]^2 \\
                & = E[(\beta_i + \sum_{j=1}^{W_{i-1}}\Omega_{ij}h_j)^2] - 0 \\
                & = E[(\sum_{j=1}^{W_{i-1}}\Omega_{ij}h_j)^2] \\
                & = \sum_{j=1}^{W_{i-1}}E[\Omega_{ij}^2]E[h_j^2] \\
                & = \sum_{j=1}^{W_{i-1}}\sigma_{\Omega}^2E[h_j^2] \\
                & = \sigma_{\Omega}^2\sum_{j=1}^{W_{i-1}}E[h_j^2]
\end{split}
\end{equation*}</annotation></semantics></math></p>

---

## Cleaning up

* Let's assume that goes through a ReLU
* With 0 mean, half of the inputs are clipped
  * $\sigma^2_{f_i} = \sigma_{\Omega}^2\sum_{j=1}^{W_{i-1}}\frac{\sigma^2_{f_{i-1}}}{2}$
* That finally simplifies to
  * $\sigma^2_{f_i} = \frac{W_{i-1}\sigma_{\Omega}^2\sigma^2_{f_{i-1}}}{2}$

---

## Interpretation

* The expected variance depends upon
  * the width of the layer, $W$, the variance of the weights, and the variance of the previous layer
* Since we are optimizing initialization, we can adjust the weights to leave the input variance unchanged
  * $\sigma^2_{\Omega_i} = \frac{2}{W_{i-1}}$

---

## Using it

* Called Kaiming or He initialization
* PyTorch offers this initialization as an existing function
  * [kaiming normal](https://docs.pytorch.org/docs/stable/nn.init.html#torch.nn.init.kaiming_normal_)
* You can read more in the [arxiv paper](https://arxiv.org/abs/1502.01852)

---

## Fan in, fan out

* In PyTorch, there is an option for the fan mode
  * $\sigma^2_{\Omega_i} = \frac{2}{W_{i-1}}$ stabilizes the variance in the forward direction
* We could stabilize the backward direction $\sigma^2_{\Omega_i} = \frac{2}{W_{i}}$
* Or we could average the two
  * $\sigma^2_{\Omega_i} = \frac{4}{W_{i-1}+W_{i}}$

---

## Bias

* We've been ignoring bias, but it is also important
  * Although He only mentions the word "bias" twice in the paper
  * Bound the bias value by a constant multiplied by the gain of the nonlinear units
* [PyTorch documentation](https://docs.pytorch.org/docs/stable/nn.init.html#torch.nn.init.kaiming_uniform_)

---

## Inputs

* Notice that we didn't mention $x$
  * But obviously it can cause problems too
* So we usually standardize our input
  * Depending upon the problem, either make it 0 mean or uniform in a small range

---

## Other Solutions

* Remember, the goal here was to unlock deep networks
  * Those tend to solve the problems of bad, wide local minima
* There are other solutions
  * We could manually adjust outputs to keep them stable
    * These are techniques called *[batch normalization](https://docs.pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html#torch.nn.BatchNorm2d)* and *[layer normalization](https://docs.pytorch.org/docs/stable/generated/torch.nn.modules.normalization.LayerNorm.html#torch.nn.modules.normalization.LayerNorm)*

---

## More Solutions

* A much earlier initialization scheme was used in LeNet
  * [Handwritten Digit Recognition with a Back-Propagation Network](https://proceedings.neurips.cc/paper_files/paper/1989/file/53c3bce66e43be4f209556518c2fcb54-Paper.pdf)
  * This was used with the sigmoid activation

---

## Other Activations

* As with everything in neural networks, there are many options
  * He's paper proposed *PReLU*
* And a later paper called [Self-Normalizating Nueral Networks](https://arxiv.org/abs/1706.02515) proposed SELUs

---

## Best Solution

* There is no "best" solution to initialization, normalization, and activation function choice
* In general, start with ReLUs and He initialization
  * But if you are working on something specific, see what is being used

---

## Goals

* We've covered all of the basics
  * So now we can begin training models!
* I've uploaded the digits dataset to canvas

---

## Digits

* One of the first "big" datasets
  * Only 60k samples

---

## Learning on Images

* The digits are obviously images
  * 28x28 pixels
* Linear layers will take each pixel as an input
* We will pick up there next class
* Be sure to go to recitation!

<!--

TODO Demonstrate digits with different initializations

Graph results from
../CS461/examples/25_mnist_demos.py
with different initializations and with and without normalizing inputs.

Maybe cite the lenet paper about initialization, and resnet too?
-->