# CS 462 - Lecture 16

## Attention

Bernhard Firner

2026-03-31

---

## Review

* We've been working up to transformers
* First, we talked about the struggles unique to NLP
  * How do we transform discrete tokens into input and outputs of neural networks?
  * The answer is embeddings
* And it turns out that we can use embeddings for any discrete input
  * Songs, products from a store, and so on

---

## NLP Training

* Natural language has a problem of unpredictable input and output sizes
* For Digits and CIFAR10, all inputs were the same size
  * But sentences are variable length
* Recurrent networks (RNNs) solve the problem of encoding variable length inputs, but cause other issues

---

## Recurrent Network Problems

* One problem is simply the pathway taken by backpropagation
* The LSTM solution, similar to ResNets, enables gradient descent through a long pathway

---

## LSTM

* We change the update of the hidden state to use additions
  * This technique was introduced [in 1997](https://ieeexplore.ieee.org/abstract/document/6795963)!
* LSTM: Long Short-Term Memory
  * Either retain the current hidden state
  * Or update it with an addition of a function of the input token
  * $s = s + gy$
* $g$ selects whether or not, and how much, of the update $y$ to use

---

## Training Trouble

* Training a recurrent network still presents challenges
  * How should the hidden state be initialized?
  * How do we do batch training? Must all inputs be the same length?
* Will it be more difficult to learn a corpus of short string?
  * Is Anna Karenina easier to learn than social media posts?

---

## Note on Recurrence

* We're about to blow past LSTMs and talk about attention and transformers
* This may give you the impression that LSTMs are outdated
* Theoretically, LSTMs are more powerful than a transformer with attention
  * But we currently struggle to train them effectively

---

## Context

* The linear neural network always had context because individual weights corresponded to a token at a particular position
  * i.e. we cannot "forget" a token just because it is far away
* The recurrent networks must learn to encode and retain information from past tokens into their hidden states
  * But long-term dependencies are difficult to learn

---

## Attention

* Researchers tried to simplify the task, adding shortcut paths to ease training
* This was originally called [alignment or soft-search](https://arxiv.org/abs/1409.0473) but the modern approach is called attention
* The [Attention is All You Need](https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html) paper did away with the recurrent network, showing that attention with a *tranformer architecture* could do the heavy lifting

---

## Encoder-Decoders

* Let's consider the motivation behind this development
* Given an input, $X = (x_1, ..., x_{T_x})$ we want to predict $Y = (y_1, ..., y_{T_y})$
* We're being vague here, but this description matches many things
  * Translation from one language to another
  * A prompt that determines a new output
  * Future songs in your playlist

---

## Encoder-Decoder Continued

* The encoder is two functions, $f$ and $q$
* Hidden states come from the encoder, $h_t = f(x_t, h_{t-1})$
* A context is generated by $c = q({h_1, ..., h_{T_x}})$
  * The context could be a fixed-length vector or variable, depending upon the setup

---

## Decoding

* Typically next token probabilities will be generated by the decoder
  * $p(y_t | {y_1, ..., y_{t-1}}, c) = g(y_{t-1}, s_t, c)$
* The next token is predicated upon $y_0, ...y_{t-1}$, but we cannot read all of those at once
  * Instead, previous tokens are captured in another hidden state, $s_t$

---

## Alignment

* The [alignment](https://arxiv.org/abs/1409.0473) paper proposed a small change
  * $p(y_t | {y_1, ..., y_{t-1}}, x) = g(y_{t-1}, s_t, c_t)$
* Now the context vector itself is position dependent, and comes from an *alignment model*
* The alignment model is tasked with learning a weight to assign to each encoder hidden state
  * $c_t = \sum_{j=1}^{T_x}\alpha_{tj}h_j$

---

## Alignment

* $c_t = \sum_{j=1}^{T_x}\alpha_{tj}h_j$
* Each $\alpha$ assigns a proportion of importance to each position
  * So they are made to sum to 1
  * $\alpha_{tj} = \frac{exp(e_{tj})}{\sum_{k=1}^{T_x}exp(e_{tk})}$
* The alignment model is the function $e_{tj} = a(s_{t-1},h_j)$

---

## What is This?

* This looks like a classifier
  * Different input tokens "compete" to be more involved at token $t$
* The alignment model uses the previous state vector and hidden state for the input token
  * Notice that this always has a fixed size
  * The number of operations grows linearly with the length of the input

---

## Results

* The technique, trained with sentences of length 50, improved translation performance.
* And translation quality held as sentences grew.

<div class="col">
<img style="width: 55%" class="r-stretch" src="./figures/NeuralMachineTranslationFig2.png" />
<br/>
<small>See <a href="https://arxiv.org/abs/1409.0473">Neural Machine Translation by Jointly Learning to Align and Translate</a> by Bahdanau, Cho, and Bengio for full results.</small>
</div>

---

## Alignments

* We can also see how the neural network aligns words during translation

<div class="col">
<img style="width: 35%" class="r-stretch" src="./figures/NeuralMachineTranslationFig3a.png" />
<br/>
<small>See <a href="https://arxiv.org/abs/1409.0473">Neural Machine Translation by Jointly Learning to Align and Translate</a> by Bahdanau, Cho, and Bengio, Figure 3.</small>
</div>

---

## Improvements

* Let's look at those equations:
  * $p(y_t | {y_1, ..., y_{t-1}}, x) = g(y_{t-1}, s_t, c_t)$
  * $s_t = f(s_{t-1}, y_{s-1}, c_t)$
  * $c_t = \sum_{j=1}^{T_x}\alpha_{tj}h_j$
  * $h_t = f(x_t, h_{t-1})$
* We are still using a recurrent relationship to find hidden states at the input and output

---

## Why Recurrence?

* But is recurrence necessary?
* Remember, a linear network with the embeddings from 50 words would use an incredible number of weights
  * The input will be multiplied by the size of the hidden layer, so this quickly becomes bad
  * 50 words, with an embedding size of 512, and a hidden unit with 1000 elements will have more than 25 million weights
* A recurrent network merely takes, as its input, the embedding of 1 word and the current hidden state

---

## Alternative: Convolutions

* We've already seen one alternative: convolutions
  * With convolutions, successive filters have increasing receptive fields, but not growing numbers of weights
  * Depth buys the same advantages with a larger initial linear layer
* But learning relationships between distant tokens is still difficult

---

## Removing Past Dependency

* What if we don't need a hidden state?
  * Alignment found relationships between words but only grew linearly with input length
    * Recall: $c_t = \sum_{j=1}^{T_x}\alpha_{tj}h_j$
    * $h_t = f(x_t, h_{t-1})$
* Since all past tokens are combined in $c_t$, do we really need to iteratively find $h_t$?
  * Can we just make a function of token $x_t$ and $t$, its distance?

---

## Dot-Product Self-Attention

* Each token, $x_1, ..., x_N$ is dimension $D\times 1$
* A linear value is determined from each input, $v_m = \beta_v + \omega_vx_m$
* The self-attention at each position, $sa_n$ is a sum of those values
  * $a$ is the attention, similar to alignment from before, and sums to 1 for each $n$
  * $sa_n[x_1, ..., x_N] = \sum_{m=1}^{N}a[x_m,x_n]v_m$

---

## Details

* $sa_n[x_1, ..., x_N] = \sum_{m=1}^{N}a[x_m,x_n]v_m$
* We translate each input into a vector, decide how much attention to pay to each one, and sum them together

<div class="col">
<img style="width: 65%" class="r-stretch" src="./figures/UDL/Chap12/TransformerRouting.svg" />
<br/>
<small>See Figure 12.1 of the UDL book</small>
</div>

---

## In Matrices

* Remember that the weights used are the same for each input
* Notice that the actual computation done scales linearly with the sequence length
  * And the number of attention weights are quadratic with sequence length

<div class="col">
<img style="width: 65%" class="r-stretch" src="./figures/UDL/Chap12/TransformerSA1.svg" />
<br/>
<small>See Figure 12.2 of the UDL book</small>
</div>

---

## Semantics

* The attention weights are both an output of one part of the network, and the weights used in another part
  * This can feel confusing since we are using the word "weights" twice
* The operation is no more complicated than any other matrix multiplications

---

## Computing Attention

* We know what it is, but we need to actually compute attention
* For each input, we compute a "query" and a "key", multiply them, and take the softmax
  * This can be thought of a measure of similarity
* Queries and keys are both obtained from linear transformations:
  * $q_n = \beta_q + \omega_qx_n$
  * $k_m = \beta_k + \omega_kx_m$

---

## Computing Attention

* Given the query and key for each combination of tokens, $(m, n)$
  * $q_n = \beta_q + \omega_qx_n$
  * $k_m = \beta_k + \omega_kx_m$
* We solve for attention that we pay to token $m$ at position $n$
  * $a[x_m,x_n] = \frac{exp([k_m^Tq_n])}{\sum_{i=1}^{N}exp[k_i^Tq_n]}$

---

<div class="col">
<img style="width: 60%" class="r-stretch" src="./figures/UDL/Chap12/TransformerSA2.svg" />
<br/>
<small>See Figure 12.3 of the UDL book</small>
</div>

---

## Self-Attention Parameters

* In total, there are three linear transformations:
  * $phi = {\beta_v, \omega_v, \beta_q, \omega_q, \beta_k, \omega_k}$
* They are independent of the sequence length, N
  * So we can use the same weights with sentences of different lengths
  * This makes training much simpler than with recurrent networks

---

## Matrix Form

<div class="col">
<img style="width: 65%" class="r-stretch" src="./figures/UDL/Chap12/TransformerBlockSA.svg" />
<br/>
<small>See Figure 12.4 of the UDL book</small>
</div>

---

## Position

* You may be wondering how this can completely replace the context
  * Context had an ordering to it, this does not
  * Remember, we are just adding vectors together
* To fix this, we will also add a positional encoding

---

## Positional Encoding

* The positional encoding is just a vector added to the input embeddings
* This may seem too easy, but we are leaning on neural networks to sort it out
* As long as the value is unique at each position

</div>

<div class="col">
<img style="width: 65%" class="r-stretch" src="./figures/UDL/Chap12/TransformerPE.svg" />
<br/>
<small>See Figure 12.4 of the UDL book</small>
</div>
</div>

---

## Positional Encoding

* The encoding could be learned, but alternating sin and cos values are generally used
* The encoding will be relative to the distance from the current token
  * Even indices are summed with: $sin\left(\frac{pos}{10000^{2i/D}}\right)$
  * Odd indices are summed with: $cos\left(\frac{pos}{10000^{2i/D}}\right)$

```python
# Create a positional encoder, following the Vaswami method.
# Each token will have a slightly different value
# based upon its location within the context window
pe = torch.zeros(context_length, num_tokens + num_tokens%2)
position = torch.arange(0, context_length, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, num_tokens, 2).float() * (-math.log(10000.0) / context_length))
# The positional encoding changes based upon the index being odd or even
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
# We may have added an extra 1 here since the above math didn't work with odd numbers
pe = pe[:, :num_tokens]
```

---

## Scaled Attention

* We (of course!) need to worry about gradients and stability
  * It turns out that initial values of $k_m^Tq_n$ can be enormous
  * This will make the gradients of other values disappear
  * $Attention(Q,K,V) = Softmax(Q^TK)V$
* To resolve that problem, we scale by the root of the query size
  * $Attention(Q,K,V) = Softmax(\frac{Q^TK}{\sqrt{|D|}})V$

---

## Multi-Headed Attention

* Finally, we want multiple projections from attention modules
  * Similar to convolutional networks producing multiple feature maps
* This allows different attention "heads" to attend to different "features"
  * Hypothetically, the attention for rhyming and attention for meaning should be different
  * Such a distinction would be lost when everything is averaged together

---

## Multi-Headed Attention

* Think of this as being roughly equivalent to multiple features maps in convnets

<div class="col">
<img style="width: 50%" class="r-stretch" src="./figures/UDL/Chap12/TransformerBlockSAMultiHead.svg" />
<br/>
<small>Multi-Headed version of attention.</small>
</div>

---

## The Transformer

* After going through all of that, we are finally ready for the transformer
* Context is now created in one shot, solving out training problems
* But let's review for this week's quiz topics first

</div>
<div class="col">
<img style="width: 55%" class="r-stretch" src="./figures/AttentionIsAllYouNeedFigure1.png" />
<br/>
<small> <a href="https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html">Attention is All You Need</a></small>
</div>
</div>

---

## Quiz Topics

* From lecture 13 up to today
* ResNets
* Lessons from ConvNeXt
* Transformer prerequisites
  * Embeddings
  * Attention

---

What describes the problem of shattered gradients?  
a. Gradients lack a consistent direction, preventing learning.  
b. Too many gradients are 0 or close to 0.  
c. The variance between network parameters and network output is too small.  
d. All of the above.  
e. None of the above.
</div>

---

What describes the problem of shattered gradients?  
a. **Gradients lack a consistent direction, preventing learning.**  
b. Too many gradients are 0 or close to 0.  
c. The variance between network parameters and network output is too small.  
d. All of the above.  
e. None of the above.
</div>

---

What solution was used to resolve the problem of shattered gradients?  
a. AdamW.  
b. Label smoothing.  
c. Layer Norms in place of Batch Norms.  
d. All of the above.  
e. None of the above.
</div>

---

What solution was used to resolve the problem of shattered gradients?  
a. AdamW.  
b. Label smoothing.  
c. Layer Norms in place of Batch Norms.  
d. All of the above.  
e. **None of the above.**
</div>

* Residual layers with their skip connections create a smoother loss surface.

---

Which of these is **not** a regularization technique?  
a. AdamW with an L2 penalty.  
b. Label smoothing.  
c. Layer Norms in place of Batch Norms.  
d. All of the above.  
e. None of the above.
</div>

---

Which of these is **not** a regularization technique?  
a. AdamW with an L2 penalty.  
b. Label smoothing.  
c. **Layer Norms in place of Batch Norms.**  
d. All of the above.  
e. None of the above.
</div>

* The layer norm will apply a consistent normalization per input, so it no longer has the noisy property of batch normalization.

---

Which of these statements about stochastic depth is **true**?  
a. It makes training faster.  
b. It improves DNN performance.  
c. It can only be used when there are skip connections.  
d. All of the above.  
e. None of the above.
</div>

---

---

What did AdamW fix?  
a. Nonlinearity in the activation function.  
b. The L2 penalty no longer being equivalent to weight decay.  
c. An unstable learning rate after many epochs.  
d. All of the above.  
e. None of the above.
</div>

---

What did AdamW fix?  
a. Nonlinearity in the activation function.  
b. **The L2 penalty no longer being equivalent to weight decay.**  
c. An unstable learning rate after many epochs.  
d. All of the above.  
e. None of the above.
</div>

---

What is the receptive field of two successive 5x5 stride 1 convolutions?

---

What is the receptive field of two successive 5x5 stride 1 convolutions?

</div>

* The second 5x5 must take, as its input, the outputs of 25 5x5 convolutions from the first layer
  * We can worry about one dimension at a time. How many source pixels correspond to 5 5x5 convolutions?

---

What is the advantage of one-hot vectors over embeddings when training NLP models?  
a. Embeddings are not interpretable.  
b. One-hot vectors can be easily combined with other one-hot vectors, unlike embeddings.  
c. One-hot vectors require less space.  
d. All of the above are advantages.  
d. None of the above are true.
</div>

---

---

What is true about the number of attention weights, $phi = {\beta_v, \omega_v, \beta_q, \omega_q, \beta_k, \omega_k}$  
a. The number of weights in $\omega_q$ grow with the number of token inputs.  
b. The number of weights in $\omega_v$ grow with the embedding size of the input.  
c. The number of weights in $\omega_k$ are equal to the number of weights in $\omega_q$ plus the weights in $\omega_v$.  
d. All of the above are true.  
d. None of the above are true.
</div>

---

What is true about the number of attention weights, $phi = {\beta_v, \omega_v, \beta_q, \omega_q, \beta_k, \omega_k}$?  
a. The number of weights in $\omega_q$ grow with the number of token inputs.  
b. **The number of weights in $\omega_v$ grow with the embedding size of the input.**  
c. The number of weights in $\omega_k$ are equal to the number of weights in $\omega_q$ plus the weights in $\omega_v$.  
d. All of the above are true.  
d. None of the above are true.
</div>

---

* Independence from the input length is the big advantage that we get with attention.
* The total number of attention weights does grow with sequence length, though.
* With embedding size $D$ and sequence length $n$
  * Each attention weight has size $D$
  * There are $n^2$ attention weights