# CS 462 - Lecture 05

## Loss Functions

Bernhard Firner

2026-02-05

---

## Book

* Today we'll be going through chapter 5 of the [book](https://udlbook.github.io/udlbook/)

---

## Review

* Depth Efficiency
* Backpropagation

---

## Depth Efficiency

* Quick summary:
  * Deeper networks have more regions per parameter
  * K is the depth, D is the width

---

## Deeper Vs Wider

* The Universal Approximation Theorem applies to wide networks
  * With enough hidden units and a nonlinear activation function, we can approximate any continuous function to arbitrary accuracy
* There is a similar [depth version](https://en.wikipedia.org/wiki/Universal_approximation_theorem#Arbitrary-depth_case) of the theorem
* So deeper networks can achieve the same representative power as shallow networks

---

## Parameters

* It is important to remember how many parameters are used in a neural network
  * And if I wasn't restricting myself to a single-sided page for quiz 1, I would give you a bunch of network pictures and ask for the parameters
* You should be able to look at this and quickly determine the parameters

---

## Depth Drawbacks

* There are a few costs that come with depth
  * First, layer outputs are computed sequentially
    * So a deeper network has a longer propagation delay from beginning to end
  * That is a cost during inference
* There are more costs during training

---

## Depth Training Drawbacks

* Parameter updates are computed via the backpropagation algorithm
  * For even a shallow network with a single hidden layer, this involves multiple steps

* We apply the **chain rule** over each layer
  * $\frac{\delta\ell}{\delta f_1} = 2(f_1 - y)$
  * $\frac{\delta\ell}{\delta relu} = \frac{\delta f_1}{\delta relu} \frac{\delta\ell}{\delta f_1}$
  * $\frac{\delta\ell}{\delta f_0} = \frac{\delta relu}{\delta f_0}\frac{\delta f_1}{\delta relu} \frac{\delta\ell}{\delta f_1}$

</div>
<div class="col">

</div>
</div>

---

## Backpropagation Costs

* We must store all intermediate outputs during the forward pass
* We also hope that a gradient exists connecting early layers to the loss
  * But this doesn't always happen!
  * Think of the 0 gradient section of a ReLU
* Add in numerical stability problems, and training a deep network is tough!

---

## Learning Details

* To demonstrate problems (and solutions!) with training, we should use real data
* But right now, we don't really know how to train a model
  * Everything isn't just mean squared error curve fitting
  * So let's spend a lecture on loss functions

---

## Loss Functions

* The loss function determines what our function models
  * So what is MSE doing?
  * Fitting our data, right?
* Let's take an example where our training data is contradictory

---

## Contradictory Data

* We'll generate data for two sin curves, and try to learn them both at once
  * We end up with predictions in between the two curves

-v-

```python
#! /usr/bin/python3

import matplotlib.pyplot as plt
import math
import torch

my_colors = [ '#2E2585', '#337538', '#5DA899', '#94CBEC' ]

torch.random.manual_seed(1)

# Set up x points
x = torch.linspace(-3,3,200)

def init_axes(ax):
    ax.set_xlim([-3, 3])
    ax.set_ylim([-1.5, 1.5])
    ax.set_xlabel('Input, $x$', fontsize=16)
    ax.set_ylabel('Output, $y$', fontsize=16)

# Plot the training input and the current output
ax = plt.gca()

if torch.backends.cuda.is_built():
    train_device = 'cuda'
else:
    train_device = 'cpu'

def solveAndPlot(ax, x, num_hidden):
    ax.clear()
    init_axes(ax)

sin_y_plus = torch.sin(0.5*torch.pi + x) + 0.2
    sin_y_neg = torch.sin(0.5*torch.pi + x) - 0.2

xt = x.repeat(2)
    yt = torch.concatenate((sin_y_plus, sin_y_neg))
    # Convert the data to torch tensors first
    xt = xt.reshape(xt.size(0), 1).float()
    yt = yt.reshape(yt.size(0), 1).float()

# Shuffle our data before training
    indices = torch.randperm(xt.size(0))
    xt = xt[indices].to(train_device)
    yt = yt[indices].to(train_device)

# Make a model
    net = torch.nn.Sequential(
            torch.nn.Linear(1, num_hidden),
            torch.nn.Tanh(),
            torch.nn.Linear(num_hidden, 1))
    net.to(train_device)

net.train()

# Train the model

loss_fn = torch.nn.MSELoss()

optimizer = torch.optim.Adagrad(net.parameters(), lr=0.0001)

batch_size = 25

optimizer.zero_grad()
    for epoch in range(10000):
        for batch in range(math.floor(xt.size(0) / batch_size)):
            xb = xt[batch*batch_size:batch*(batch_size+1)]
            yb = yt[batch*batch_size:batch*(batch_size+1)]
            output = net.forward(xb)
            loss = loss_fn(output, yb)
            loss.backward()
            optimizer.step()

net.eval()
    y_hat = net(xt)
    ax.scatter(xt.cpu(), yt.cpu(), linestyle='None', alpha=0.5, marker="o", color=my_colors[0], linewidth=0)
    ax.scatter(xt.cpu(), y_hat.cpu().detach().numpy(), linestyle='None', marker="o", label="fit", color=my_colors[2], linewidth=0)

solveAndPlot(ax, x, num_hidden = 5000)
plt.savefig(f"../figures/05_fit_between.svg", dpi=2*96)
```

---

## Predicting Distributions

* When we use MSE, we are really asking the model to learn the mean
  * $\hat{y} = f[x, \phi]$
  * $\hat{y}$ is an estimate of the mean at point $x$
* We could add variance to our loss function
* Or even predict *one of n* classes at input $x$

---

* By changing our loss function, we can create new predictions

</div>
<div class="col">

</div>
</div>

---

## NLL Loss

* NLL stands for negative log likelihood
  * This is a very common loss
* Used to estimate the parameters of probability distributions
  * Basically, we compare every observation to the probability of observing it, given a predicted distribution parameters
  * e.g. $\mu$ and $\sigma^2$ if we are estimating a gaussian

---

## Logarithms

* Why are we optimizing a logarithm? Why not use the PDF directly?
  * Probabilities are small, and not numerically stable
* $p(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}exp(-\frac{(x - \mu)^2}{2\sigma^2})$
  * That involves dividing by $\sigma^2$, which could be very small
  * We don't want training to fail because of number, so we take a log

---

## Log of a Normal

* $p(x | mu, sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}exp(-\frac{(x - \mu)^2}{2\sigma^2})$
* Taking the log: $ln((2\pi\sigma^2)^{-1/2}) - \frac{(x - \mu)^2}{2\sigma^2}$
* Simplify: $-\frac{1}{2}ln(2\pi) - \frac{1}{2}ln(\sigma^2) - \frac{(x - \mu)^2}{2\sigma^2}$

---

## Negative Log

* We need small loss values to be better, so multiply by -1
  * Right now, the PDF gives a higher value if the observation was more likely
  * $\frac{1}{2}ln(2\pi) + \frac{1}{2}ln(\sigma^2) + \frac{(x - \mu)^2}{2\sigma^2}$
* The first term is a constant and doesn't matter for the loss
  * Drop it

---

## Equivalence to MSE

* $\frac{1}{2}ln(\sigma^2) + \frac{(x - \mu)^2}{2\sigma^2}$
* Notice that if sigma is a constant as well, we end up with MSE loss
  * Thus, optimizing for MSE is like learning the mean of a gaussian

---

## Example

* Let's learn the variance of a distribution as well as the mean
* We'll use a sin again, inserting noise that depends upon x

---

## Model and Assumptions

* Previously we assumed that noise had constant variance across the dataset
* Now we assume it is some function of the input, x

---

## Numerical Stability

* We are no longer dividing by a potentially tiny or huge number
* There are still some constraints on $\sigma$ though
  * But we cannot take the log of 0
  * $\sigma$ should be positive

---

## Fixes

* We apply a [softplus](https://docs.pytorch.org/docs/stable/generated/torch.nn.Softplus.html#softplus) function to $\sigma$
  * Basically a smooth ReLU
* We also add a tiny value, called an epsilon, for stability

</div>
<div class="col">

</div>
</div>

---

## Code

```python
import torch
import matplotlib.pyplot as plt

my_colors = [ '#2E2585', '#337538', '#5DA899', '#94CBEC' ]

torch.random.manual_seed(10)

# Set up x points
x = torch.linspace(-3, 3, 500)
# Add a batch dimension
x = x.reshape(x.size(0), 1).float()

# y values without noise
y_original = torch.sin(x)

# x-dependent noise, higher magnitude near the center
noise_std = 0.1 * (3 - torch.abs(x))
noise = torch.randn_like(x) * noise_std

# Training data
y_noisy = y_original + noise

class NormalModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = torch.nn.Sequential(
            torch.nn.Linear(1, 512),
            torch.nn.ReLU(),
            torch.nn.Linear(512, 512),
            torch.nn.ReLU()
        )
        self.mean_head = torch.nn.Linear(512, 1)
        self.var_head = torch.nn.Linear(512, 1)

def forward(self, x):
        # Predict both a mean and a variance
        features = self.layers(x)
        mu = self.mean_head(features)
        # Softplus ensures variance > 0; we add a small epsilon for stability
        sigma_sq = torch.nn.functional.softplus(self.var_head(features)) + 1e-6
        return mu, sigma_sq

def gaussian_nll_loss(mu, sigma_sq, target):
    # Use negative log likelihood for numerical reasons
    # This is from the normal PDF
    # p(x | mu, sigma^2) = \frac{1}{\sqrt 2\pi\sigma^2}exp(-\frac{(x - \mu)^2}{2\sigma^2}
    # Taking the negative log: -1[ ln((2\pi\sigma^2)^{1/2}) - \frac{(x - \mu)^2}{2\sigma^2}
    # = -1[ -\frac{1}{2}ln(2\pi) - \frac{1}{2}ln(\sigma^2) - \frac{(x - \mu)^2}{2\sigma^2}
    # = \frac{1}{2}ln(2\pi) + \frac{1}{2}ln(\sigma^2) + \frac{(x - \mu)^2}{2\sigma^2}
    # The first term is a constant and doesn't matter for the loss.
    # Notice that if sigma is a constant as well, we end up with MSE loss
    return (torch.log(sigma_sq) / 2 + (target - mu)**2 / (2 * sigma_sq)).mean()

net = NormalModel()

learning_rate = 0.001
for epoch in range(500):
    # Ensure that there are no gradients stored
    net.zero_grad()
    # This is the forward pass
    mu, variance = net(x)
    # This computes the loss
    loss = gaussian_nll_loss(mu, variance, y_noisy)
    # This computes the gradients (derivates of the error w.r.t each parameter)
    loss.backward()
    # Update the parameters
    for param in net.parameters():
        param.data -= param.grad * learning_rate
        # Zero the gradient before the next pass
        param.grad = None

with torch.no_grad():
    mu, variance = net(x)

ax = plt.gca()
ax.scatter(x[:,0], y_noisy[:,0], linestyle='None', marker='o', label="target data", color=my_colors[1], linewidth=3)
ax.fill_between(x[:,0], (mu+2*variance)[:,0], (mu-2*variance)[:,0], alpha=0.5, color=my_colors[2], linewidth=0)
ax.plot(x[:,0], mu[:,0], linestyle='solid', marker=None, label="mu", color=my_colors[2], linewidth=3)
plt.savefig(f"../figures/05_nll.svg", dpi=2*96)
```

---

## Result

---

## Assumptions

* We are assuming that the training data is really gaussian
  * This means independence and unbiased
  * But almost any ML technique assumes i.i.d gaussian data

---

## Benefits

* We can actually use any probability distribution
  * So if we actually know that something isn't gaussian, we can still use a DNN

---

## Binary Classification

* Classification is obviously something that we want to do
* What distribution is this?
  * Bernoulli, with the DNN predicting $\lambda$
  * Use the [probability mass function of the Bernoulli](https://en.wikipedia.org/wiki/Bernoulli_distribution)
  * $Pr(y|\lambda) = (1 - \lambda)^{1-y}\lambda^y$
* $\lambda$ needs to be between 0 and 1 though

---

## Logistic Sigmoid

</div>
<div class="col">

* We'll just pass the network output through a sigmoid
  * That will force it to be between 0 and 1
  * $\lambda = sig(f[x, \phi])$
  * Problem solved!

</div>
</div>

---

## Negative Log

* We need the negative log of the Bernoulli
* $L[\phi] = -(1 - y)ln[1-sig(f[x, \phi])] - (y)ln[sig(f[x, \phi])]$
  * $sig[f[x, \phi]] = \lambda$ is the probability for classification
* PyTorch calls this the [Binary Cross Entropy loss](https://docs.pytorch.org/docs/stable/generated/torch.nn.BCELoss.html#torch.nn.BCELoss)

---

## Multiple Classes

* Why classify just one thing?
  * We could train separate networks for each class, but that is inefficient
* Instead, train a network with one output for each class
* We want to treat them as probabilities, so force them to sum to 1
  * [softmax](https://docs.pytorch.org/docs/stable/generated/torch.nn.Softmax.html#softmax)

---

## Softmax

* $Softmax(x_i) = \frac{exp(x_i)}{\sum_jexp(x_j)}$
* So we just divide each output by the sum all of the NN outputs
  * It must sum to 1
  * We use exponents to guarantee outputs are all positive

---

## PyTorch

* There are plenty of loss function already defined in PyTorch
  * [https://docs.pytorch.org/docs/stable/nn.html#loss-functions](https://docs.pytorch.org/docs/stable/nn.html#loss-functions)
  * Including the [Gaussian NLL Loss](https://docs.pytorch.org/docs/stable/generated/torch.nn.GaussianNLLLoss.html#torch.nn.GaussianNLLLoss)
* So we generally don't need to do this work by ourselves
  * It isn't difficult though, so remember that you can do it if required

---

## Cross Entropy Loss

* PyTorch provides a [Cross Entropy Loss](https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss) for multiclass classification
  * It handles both a softmax and the negative log loss required
  * If you train a model with this, remember to use a softmax during inference

---

## What is Cross Entropy?

* This ends up being equivalent to the negative log-likelihood of a distribution
  * But what is it?
* Alternative to maximizing the likelihood of a probability mass function
  * Instead, we find $\phi$ that minimize the distance between the distribution of $y$ and the model's predicted distribution
  * This is the Kullback-Leibler divergence ([also in PyTorch](https://docs.pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html#torch.nn.KLDivLoss))

---

## Skipping Math

* We don't need to dig into the internals, but once we simplify we end up with the Cross Entropy
  * Which is defined as the information required to distinguish one distribution from the other
* In the end, it simplifies to the negative log-likelihood from before
* The name makes it sound like something different, but it is not

---

## Quiz Preparation

* First quiz today!
  * So we can be happy with a slightly short lecture
* Let's talk about things you should know!

---

## Topics

* Don't forget that we started with a review of data
  * Data has statistics
  * There is a difference between noise and variance
  * And bias is another thing altogether

---

## Learning

* Do you remember what is happening here?

---

## Parameters

* Why do we want to add more parameters to our neural networks?
* Is there anything neural network *can't* do?
* Why do we use ReLU? Is it our only option?

---

## Question Style

* Given the image below, which of these statements is **true**?
  * The output layer has two bias parameters.
  * The first hidden layer has two bias parameters.
  * The second hidden layer has two weight parameters.
  * None of the above are true.

---

## Answer

* Given the image below, which of these statements is **true**?
  * The output layer has two bias parameters.
  * **The first hidden layer has two bias parameters.**
  * The second hidden layer has two weight parameters.
  * None of the above are true.

---

## Panic!

* Any other last-minute questions?
* This quiz is just multiple choice
  * Later quizzes may have more variety