# CS 462 - Lecture 13

## Advanced Convnets

Bernhard Firner

2026-03-10

---

# Advanced Convnets
## Advanced Convnets
### Advanced Convnets
#### Advanced Convnets
##### Advanced Convnets
###### Advanced Convnets

* Chapter 11!

---

## Advanced Convnets

* What makes a convnet "advanced"?
* I'm going to draw an arbitrary line at a particular innovation
  * This "one simple trick" allowed us to go from 10s of layers to 1000
  * For reference, LeNet was around 5, AleNet was 8
* Let's pick up where we were before and review Batch Normalization

---

## Batch Normalization

* Proposed by Ioffe and Szegedy in 2015
  * [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://proceedings.mlr.press/v37/ioffe15.html)
* ReLU became popular because it enabled faster training than Tanh
  * But when gradients get stuck in the < 0 area, learning stops
  * He initialization solves initialization, but not wherever gradients take our parameters

---

## Batch Norm Mechanics

* In PyTorch: [BatchNorm2d](https://docs.pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html#batchnorm2d)
* Normalizes to 0 mean and unit variance for each batch
  * Enforced as the parameters change
* Keeps a running mean of the mean and variance of inputs to use after training is complete

---

## Also a Regularizer

* Because the estimates change from batch to batch, this adds noise directly to the layer outputs
* If you recall from our discussion of regularizers, this is also a regularization technique
  * For free!

---

## Batch Norm

* We had just added batch norm and went to 7 convolution layers

```python
DeeperBNConvNet(
  (net): Sequential(
    (0): Conv2d(3, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (3): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (4): ReLU()
    (5): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU()
    (7): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (8): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (9): ReLU()
    (10): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU()
    (12): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (13): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (14): ReLU()
    (15): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (16): ReLU()
    (17): Flatten(start_dim=1, end_dim=-1)
    (18): Linear(in_features=60, out_features=60, bias=True)
    (19): Linear(in_features=60, out_features=10, bias=True)
  )
  (decision): Softmax(dim=1)
)
```

---

## Adding More

* Notice that convolutions can preserve the image size
  * A 3x3 convolution with stride 1 and padding 1, for example
* So can we stack any number of convolutions here?
  * BatchNorm keeps things stable, right?
  * Let's just replicate each stride 1 convolution 10 times

---

## The Model

```
VariableBNConvNet(
  (net): Sequential(
    (0): Conv2d(3, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (4): ReLU()
    (5): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): ReLU()
    (8): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (10): ReLU()
    (11): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (12): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (13): ReLU()
    (14): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (16): ReLU()
    (17): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (18): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (19): ReLU()
    (20): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (21): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (22): ReLU()
    (23): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (24): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (25): ReLU()
    (26): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (27): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (28): ReLU()
    (29): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (30): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (31): ReLU()
    (32): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (33): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (34): ReLU()
    (35): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (36): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (37): ReLU()
    (38): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (39): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (40): ReLU()
    (41): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (42): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (43): ReLU()
    (44): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (45): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (46): ReLU()
    (47): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (48): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (49): ReLU()
    (50): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (51): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (52): ReLU()
    (53): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (54): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (55): ReLU()
    (56): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (57): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (58): ReLU()
    (59): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (60): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (61): ReLU()
    (62): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (63): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (64): ReLU()
    (65): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (66): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (67): ReLU()
    (68): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (69): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (70): ReLU()
    (71): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (72): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (73): ReLU()
    (74): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (75): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (76): ReLU()
    (77): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (78): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (79): ReLU()
    (80): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (81): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (82): ReLU()
    (83): Conv2d(15, 15, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (84): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (85): ReLU()
    (86): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (87): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (88): ReLU()
    (89): Conv2d(15, 15, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (90): ReLU()
    (91): Flatten(start_dim=1, end_dim=-1)
    (92): Linear(in_features=60, out_features=60, bias=True)
    (93): Linear(in_features=60, out_features=10, bias=True)
  )
  (decision): Softmax(dim=1)
)
Model has 66760 parameters.
```

---

## Results

---

## Wider

* That's better
  * The testing loss and accuracy are close to the training results
* But we're still bumping around the mid 60s
* Let's increase the width from 15 to 30 feature maps

---

## Next Iteration

* Jumped about 5x to 250k parameters

```
Training on 50000 examples over 6250 batches
VariableBNConvNet(
  (net): Sequential(
    (0): Conv2d(3, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (4): ReLU()
    (5): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): ReLU()
    (8): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (10): ReLU()
    (11): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (12): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (13): ReLU()
    (14): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (16): ReLU()
    (17): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (18): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (19): ReLU()
    (20): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (21): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (22): ReLU()
    (23): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (24): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (25): ReLU()
    (26): Conv2d(30, 30, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (27): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (28): ReLU()
    (29): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (30): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (31): ReLU()
    (32): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (33): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (34): ReLU()
    (35): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (36): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (37): ReLU()
    (38): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (39): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (40): ReLU()
    (41): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (42): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (43): ReLU()
    (44): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (45): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (46): ReLU()
    (47): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (48): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (49): ReLU()
    (50): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (51): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (52): ReLU()
    (53): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (54): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (55): ReLU()
    (56): Conv2d(30, 30, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (57): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (58): ReLU()
    (59): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (60): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (61): ReLU()
    (62): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (63): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (64): ReLU()
    (65): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (66): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (67): ReLU()
    (68): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (69): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (70): ReLU()
    (71): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (72): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (73): ReLU()
    (74): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (75): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (76): ReLU()
    (77): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (78): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (79): ReLU()
    (80): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (81): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (82): ReLU()
    (83): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (84): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (85): ReLU()
    (86): Conv2d(30, 30, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (87): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (88): ReLU()
    (89): Conv2d(30, 30, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (90): ReLU()
    (91): Flatten(start_dim=1, end_dim=-1)
    (92): Linear(in_features=120, out_features=60, bias=True)
    (93): Linear(in_features=60, out_features=10, bias=True)
  )
  (decision): Softmax(dim=1)
)
Model has 254350 parameters.
```

---

## Results

* At 100 epochs testing accuracy eventually reaches 75%

---

## Efficiency

* The wider network is better
* But adding width is less efficient
  * Remember that 250k parameters at 4 bytes per float is over 1GB
* Deeper is more efficient, so let's make it deeper before making it even wider

---

## 30x Stride 1s

* Keeping feature maps at 30, 3 times larger than the last model
* This bumps us up to 750k parameters

```
VariableBNConvNet(
  (net): Sequential(
    (0): Conv2d(3, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (4): ReLU()
    (5): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): ReLU()
    (8): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (10): ReLU()
    (11): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (12): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (13): ReLU()
    (14): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (16): ReLU()
    (17): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (18): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (19): ReLU()
    (20): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (21): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (22): ReLU()
    (23): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (24): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (25): ReLU()
    (26): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (27): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (28): ReLU()
    (29): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (30): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (31): ReLU()
    (32): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (33): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (34): ReLU()
    (35): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (36): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (37): ReLU()
    (38): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (39): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (40): ReLU()
    (41): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (42): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (43): ReLU()
    (44): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (45): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (46): ReLU()
    (47): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (48): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (49): ReLU()
    (50): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (51): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (52): ReLU()
    (53): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (54): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (55): ReLU()
    (56): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (57): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (58): ReLU()
    (59): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (60): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (61): ReLU()
    (62): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (63): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (64): ReLU()
    (65): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (66): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (67): ReLU()
    (68): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (69): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (70): ReLU()
    (71): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (72): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (73): ReLU()
    (74): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (75): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (76): ReLU()
    (77): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (78): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (79): ReLU()
    (80): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (81): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (82): ReLU()
    (83): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (84): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (85): ReLU()
    (86): Conv2d(30, 30, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (87): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (88): ReLU()
    (89): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (90): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (91): ReLU()
    (92): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (93): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (94): ReLU()
    (95): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (96): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (97): ReLU()
    (98): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (99): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (100): ReLU()
    (101): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (102): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (103): ReLU()
    (104): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (105): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (106): ReLU()
    (107): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (108): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (109): ReLU()
    (110): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (111): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (112): ReLU()
    (113): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (114): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (115): ReLU()
    (116): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (117): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (118): ReLU()
    (119): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (120): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (121): ReLU()
    (122): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (123): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (124): ReLU()
    (125): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (126): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (127): ReLU()
    (128): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (129): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (130): ReLU()
    (131): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (132): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (133): ReLU()
    (134): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (135): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (136): ReLU()
    (137): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (138): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (139): ReLU()
    (140): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (141): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (142): ReLU()
    (143): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (144): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (145): ReLU()
    (146): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (147): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (148): ReLU()
    (149): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (150): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (151): ReLU()
    (152): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (153): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (154): ReLU()
    (155): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (156): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (157): ReLU()
    (158): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (159): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (160): ReLU()
    (161): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (162): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (163): ReLU()
    (164): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (165): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (166): ReLU()
    (167): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (168): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (169): ReLU()
    (170): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (171): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (172): ReLU()
    (173): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (174): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (175): ReLU()
    (176): Conv2d(30, 30, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (177): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (178): ReLU()
    (179): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (180): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (181): ReLU()
    (182): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (183): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (184): ReLU()
    (185): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (186): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (187): ReLU()
    (188): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (189): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (190): ReLU()
    (191): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (192): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (193): ReLU()
    (194): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (195): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (196): ReLU()
    (197): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (198): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (199): ReLU()
    (200): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (201): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (202): ReLU()
    (203): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (204): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (205): ReLU()
    (206): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (207): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (208): ReLU()
    (209): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (210): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (211): ReLU()
    (212): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (213): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (214): ReLU()
    (215): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (216): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (217): ReLU()
    (218): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (219): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (220): ReLU()
    (221): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (222): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (223): ReLU()
    (224): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (225): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (226): ReLU()
    (227): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (228): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (229): ReLU()
    (230): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (231): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (232): ReLU()
    (233): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (234): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (235): ReLU()
    (236): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (237): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (238): ReLU()
    (239): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (240): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (241): ReLU()
    (242): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (243): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (244): ReLU()
    (245): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (246): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (247): ReLU()
    (248): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (249): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (250): ReLU()
    (251): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (252): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (253): ReLU()
    (254): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (255): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (256): ReLU()
    (257): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (258): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (259): ReLU()
    (260): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (261): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (262): ReLU()
    (263): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (264): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (265): ReLU()
    (266): Conv2d(30, 30, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (267): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (268): ReLU()
    (269): Conv2d(30, 30, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (270): ReLU()
    (271): Flatten(start_dim=1, end_dim=-1)
    (272): Linear(in_features=120, out_features=60, bias=True)
    (273): Linear(in_features=60, out_features=10, bias=True)
  )
  (decision): Softmax(dim=1)
)
Model has 745750 parameters.
```

---

## Results

* Could it catch up? Maybe, but this is really slow

---

## Problems

* This problem with deeper networks is common and well-known
  * Here's the figure from the book showing the same issue on CIFAR-10
  * Without batch normalization things are worse

---

## Why?

* Both the training and test errors increase
  * So don't say that this is overfitting
* One explanation has to do with the loss surface
* We know that each layer is separately optimized
  * Is something different occurring in the earlier layers?

---

## Gradient Woes

* A change in layer 1 leads to a large change in layer 20
  * This may cause difficulty learning in that layer
* Recall, $\frac{\delta y}{\delta f_1} = \frac{\delta f_2}{\delta f_1} \frac{\delta f_3}{\delta f_2} ... \frac{\delta f_n}{\delta f_{n-1}}$
  * That gradient is correct, but we change all of the layers at once
    * After changing layer 1, is the gradient still correct for $f_2, f_3, ..., f_n$?
    * Maybe not, unless the update is infinitesimally small

---

## Shattered Gradients

* The loss surface is not smooth enough to traverse
  * Instead, it is like navigating a chaotic surface, with no consistent direction
* This problem is called **shattered gradients**

---

## Shattered Gradients

* If we plot the gradient and autocorrelation we will see the correlation of local gradients disappears for deep networks

---

## The Explanation

* This [explanation](https://proceedings.mlr.press/v70/balduzzi17b.html) came about *after* the solution
  * Like many things in DL, we found and used a solution that worked

> [T]he correlation between gradients in standard feedforward networks decays exponentially with depth resulting in gradients that resemble white noise

---

## A Solution

* There was a moment after AlexNet where many strange, complicated architectures were tried
* The one that emerged with a popular solution to the depth issue was the *residual network*
* [ResNets](https://arxiv.org/abs/1512.03385) have two big improvements:
  * Batch Norm (which we've seen)
  * Residual layers

---

## Residual Layer

* This is the residual block, as first described
* Preserves original input
  * So even if one layer does nothing, the next can learn with that
* Layers are only learning *diffs* to apply, not entirely new feature maps

</div>

---

## Details

* When the skip goes over an increase in feature maps, use 1x1 convolution to add dimensions
  * Or save parameters and use an identity
* When the skip goes over dimensionality reduction, increase stride to match the reduction
  * e.g. stride 2 to cut feature map size in half

</div>

---

## Unravelled

---

## Motivation

* But if we only fully understood the benefits after using it, what was the original motivation?
* Let's read [the ResNet paper](https://arxiv.org/abs/1512.03385)

---

## The Observed Problem

> We conjecture that the deep plain nets may
have exponentially low convergence rates, which impact the
reducing of the training error.

* So the authors tried to find something easier to learn

---

> Let us consider H(x) as an underlying mapping to be
fit by a few stacked layers (not necessarily the entire net), with x denoting
the inputs to the first of these layers. If one hypothesizes that multiple
nonlinear layers can asymptotically approximate complicated functions, then it
is equivalent to hypothesize that they can asymptotically approximate the
residual functions, i.e., H(x) − x (assuming that the input and output are of
the same dimensions).

---

## Translation

* We are asking the network to learn $H(x)$
* But maybe $H(x)$ is a *transformation* of the original input, $x$

---

>  So
rather than expect stacked layers to approximate H(x), we explicitly let these
layers approximate a residual function F(x) := H(x) − x. The original function
thus becomes F(x)+x.

---

## Translation

* We will *give* the network x, so it just needs to learn the residual $F(x)$
  * The output function is still the same, but maybe now it is easier
  * The magnitude of the change is certainly less

---

## Residual Implementation

```python
class ResBlock(torch.nn.Module):
    """Simplifies using a residual block."""
    def __init__(self, nonlinearity, in_channels, out_channels, kernel_size, padding, stride):
        super(ResBlock, self).__init__()
        self.net = torch.nn.Sequential(
                torch.nn.Conv2d(in_channels=in_channels, out_channels=in_channels,
                                kernel_size=kernel_size, padding=padding, stride=1),
                torch.nn.BatchNorm2d(in_channels),
                nonlinearity(),
                torch.nn.Conv2d(in_channels=in_channels, out_channels=out_channels,
                                kernel_size=kernel_size, padding=padding, stride=stride),
                torch.nn.BatchNorm2d(out_channels),
                )
        torch.nn.init.kaiming_normal_(self.net[3].weight.data, nonlinearity="relu")
        self.a = nonlinearity()
        # Either preserve the original input or use a 1x1 convolution to
        # increase channels or decrease dimensions. This is consistent with the original paper.
        if in_channels == out_channels and stride == 1:
            self.identity = lambda x: x
        else:
            self.identity = torch.nn.Conv2d(in_channels=in_channels, out_channels=out_channels,
                                            kernel_size=1, padding=0, stride=stride)

def forward(self, x):
        y = self.net(x)
        x_prime = self.identity(x)
        return self.a(y + x_prime)
```

---

## Paper Results

* Plain ConvNets with batch norm (left) and ResNets (right)

---

## Paper Results

* ResNets have lower variance in weights overall
  * So they may indeed have learned an easier function

---

## Some Details

* The residual blocks won't work on their own
  * They would struggle without BatchNorm
* Why? Because the original image keep adding in the original image
  * This increases variance!
  * We could rescale the outputs, shrinking the values from He initialization
  * But BatchNorm magics the problem away

---

## BatchNorm + Residuals

* BatchNorm enables higher learning rates
  * Because it smoothes changes in the loss surface
* BatchNorm also injects noise, acting as a regularizer
* Residuals simplify the target function, so there is theoretically less to learn
* Putting them together reached state of the art and put many more complicated architectures to rest

---

## Loss Visualization

<div class="col">
<img style="width: 65%" class="r-stretch" src="./figures/UDL/Chap11/ResidualSurface.svg" />
<br/>
<small>
A projection of the parameter space onto 2D shows that residual connection smooth the loss surface.
The minima was first found with training, then the surface was traversed with skip connections and without.
</small>
</div>

---

## Other Cool Ideas

* ResNets did inspire other interesting twists on the idea
* 2016
  * [SqueezeNet](https://arxiv.org/abs/1602.07360)
  * [DenseNet](https://arxiv.org/abs/1608.06993)
* 2019
  * [EfficientNet](https://arxiv.org/abs/1905.11946)
* Some modern training recipes restrict our network structure though
  * After the initial burst of strange networks, we've settled into a few common types

---

## DenseNet

* Why only forward the original image?
  * How about we send all of the feature maps from previous layers?

<div class="col">
<img style="width: 65%" class="r-stretch" src="./figures/UDL/Chap11/ResidualDensenet.svg" />
<br/>
<small>DenseNet drawing from the UDL book</small>
</div>

---

## DenseNet

* DenseNet has many of the positive ResNet qualities
* And gradients are passed fairly directly to all layers

<div class="col">
<img style="width: 55%" class="r-stretch" src="./figures/DenseNet-5-layer.png" />
<br/>
<small>DenseNet drawing from the <a href="https://arxiv.org/abs/1608.06993">original paper</a></small>
</div>

---

## DenseNet Efficiency

* DenseNet had competitive errors on ImageNet with few parameters
* Modern training recipes have left this network behind, but sometimes it is good to remember good ideas from the past

<div class="col">
<img style="width: 65%" class="r-stretch" src="./figures/DenseNet-ParameterAndFlopsVResnet.png" />
<br/>
<small>Efficiency plots from the <a href="https://arxiv.org/abs/1608.06993">original paper</a></small>
</div>

---

## Other Skip Connections

* Many networks eventually reproduce an output of the same size of their inputs
* [U-Net](https://arxiv.org/abs/1505.04597) segments the original image
  * This is similar to taking the original and recoloring it
  * The original use-case was on biomedical images

---

## UNet

* Upconvolutions convert the learned features back into the original image space

<div class="col">
<img style="width: 65%" class="r-stretch" src="./figures/UDL/Chap11/ResidualUNet.svg" />
<br/>
<small>Modern drawing of UNet from the UDL book</small>
</div>

---

## Segmentation

* Input on left, output on right with random color masks
  * Yellow outlines show the ground truth

<div class="col">
<img style="width: 65%" class="r-stretch" src="./figures/UNetFigure4.png" />
<br/>
<small>Figure 4 from the original paper, showing results on the ISBI cell tracking challenge.</small>
</div>

---

## Pose Estimation

* Pose estimation with [stacked hourglass networks](https://arxiv.org/abs/1603.06937) also used residual-like skip layers within inner pyramid blocks
* Their motivation is to capture features at different scales
  * But this ends up looking a bit like repeated U-Nets

---

---

## Parting Thoughts

* Residual networks demonstrated an ability to train our networks with arbitrary depth
  * This was useful for image classification
  * And also useful in many other complicated tasks, where shattered gradients may have prevented learning
* Despite their applicability, Residual blocks and their ilk are not at all complicated