# CS 462 - Lecture 18

<div class='a'>
Transformers
</div>
<div class='b'>
Transformers
</div>
<br/>
<div class='c'>
Transformers
</div>
<br/>
<div class='d'>
Transformers
</div>
<br/>

Bernhard Firner

2026-04-02

---

## Review

* Transformers! We are finally here!
* But what path did we follow?
  * Embeddings
  * Alignment
  * Multi-Headed Scaled Dot Product Attention
    * Which is so cool that it has its own [PyTorch implementation](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html#torch.nn.functional.scaled_dot_product_attention)
  * We also added position encoding at the last minute

---

## Ouch!

* That's a lot of stuff
  * And it seemed to be a bit confusing, too!
* So let's carefully go over these advances

---

## Why Alignment?

* When translating, or doing any `seq2seq` transformation, we need to find some "hidden state"
  * `seq2seq` could mean words to other words, as in translation, question and answer, suggesting a playlist, etc
  * Hidden state here means some vector that captures the semantics of the token sequence
    * $h_t$ should capture the "meaning" of all tokens from $x_0, ..., x_{t-1}$
* The preceding tokens can be of variable length, but we want the hidden state to be fixed-length so that it can be a network input

---

## Hidden State Problems

* This turns out to be too difficult
* So in the [alignment paper](https://arxiv.org/abs/1409.0473), they suggested keeping all of the hidden states separate
* During translation, have a linear operation on each hidden state, individually, determine how important it is
* Send those values through a softmax, and now we have a version of the hidden state for our current position in the decoder

---

## Alignments

<div class="col">
<img style="width: 35%" class="r-stretch" src="./figures/NeuralMachineTranslationFig3a.png" />
<br/>
<small>See <a href="https://arxiv.org/abs/1409.0473">Neural Machine Translation by Jointly Learning to Align and Translate</a> by Bahdanau, Cho, and Bengio, Figure 3.</small>
</div>

---

## Attention

* Attention takes that one step farther
* Previously, a recurrent network was used to find those hidden states
  * Why what if we could get rid of that recurrent network, and replace it with some simple linear operations?
  * The only thing we lose is the position information

---

## What Attention Will Be

* A linear transformation will be run on each embedding, producing $V$
* Those vectors are then blended at each output position

<div class="col">
<img style="width: 65%" class="r-stretch" src="./figures/UDL/Chap12/TransformerRouting.svg" />
<br/>
<small>See Figure 12.1 of the UDL book</small>
</div>

---

## Dot Product Attention

* With alignment, those weights required an RNN
* Dot product attention makes them in one operation

<div class="col">
<img style="width: 50%" class="r-stretch" src="./figures/UDL/Chap12/TransformerSA2.svg" />
<br/>
<small>See Figure 12.2 of the UDL book</small>
</div>

---

## Computing Attention

* The dot product computes a value for each key and query
  * This captures the context over the entire window
* We solve for attention that we pay to token $m$ at position $n$
  * $v_m = \beta_v + \omega_vx_m$
  * $q_n = \beta_q + \omega_qx_n$
  * $k_m = \beta_k + \omega_kx_m$
  * $a[x_m,x_n] = \frac{exp([k_m^Tq_n])}{\sum_{i=1}^{N}exp[k_i^Tq_n]}$

---

## Matrix Form

<div class="col">
<img style="width: 65%" class="r-stretch" src="./figures/UDL/Chap12/TransformerBlockSA.svg" />
<br/>
<small>See Figure 12.4 of the UDL book</small>
</div>

---

## Look at those Operations!

* There are three linear networks involved
  * And the weights and bias are only dependent upon the embedding size
  * So this will easily scale with the sequence length!
* In fact, only the number of embedding weights output by attention will increase
  * With the square of the context size, but even a context of 100 isn't too much

---

## Details

* Previously we used an RNN, which learned position implicitly
* We've lost that, so we add it back in
  * Even indices are summed with: $sin\left(\frac{pos}{10000^{2i/D}}\right)$
  * Odd indices are summed with: $cos\left(\frac{pos}{10000^{2i/D}}\right)$
* $i$ is the distance from the token being predicted

</div>

<div class="col">
<img style="width: 65%" class="r-stretch" src="./figures/UDL/Chap12/TransformerPE.svg" />
<br/>
<small>See Figure 12.4 of the UDL book</small>
</div>
</div>

-v-

```python
```python
import math
import torch

embedding_dim = 10
num_tokens = 4

# Create a positional encoder, following the Vaswami method.
# Each token will have a slightly different value
# based upon its location within the context window
pe = torch.zeros(embedding_dim, num_tokens + num_tokens%2)
position = torch.arange(0, embedding_dim, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, num_tokens, 2).float() * (-math.log(10000.0) / embedding_dim))
# The positional encoding changes based upon the index being odd or even
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
# We may have added an extra 1 here since the above math didn't work with odd numbers
pe = pe[:, :num_tokens]
print(pe)
```

-v-

* Each row is similar to a column in the example image

```
tensor([[ 0.0000,  1.0000,  0.0000,  1.0000],
        [ 0.8415,  0.5403,  0.1578,  0.9875],
        [ 0.9093, -0.4161,  0.3117,  0.9502],
        [ 0.1411, -0.9900,  0.4578,  0.8891],
        [-0.7568, -0.6536,  0.5923,  0.8057],
        [-0.9589,  0.2837,  0.7121,  0.7021],
        [-0.2794,  0.9602,  0.8140,  0.5809],
        [ 0.6570,  0.7539,  0.8954,  0.4452],
        [ 0.9894, -0.1455,  0.9545,  0.2983],
        [ 0.4121, -0.9111,  0.9896,  0.1439]])
```

</div>

<div class="col">
<img style="width: 65%" class="r-stretch" src="./figures/UDL/Chap12/TransformerPE.svg" />
<br/>
<small>See Figure 12.4 of the UDL book</small>
</div>
</div>

---

## Sensible?

* We add the position encoding to the embeddings
* How does that make sense?
  * The position and the original embedding area linearly separable
  * So, if something is important, then $\omega_v$, $\omega_k$, and $\omega_q$ can learn it
* We are just going to have to trust gradient descent

---

## Two More Details

* To make learning stable, we cannot allow initial values to be huge
  * Once they go through the softmax, they may drive over values too close to 0
  * So we scale everything going into the softmax
* Change $Attention(Q,K,V) = Softmax(Q^TK)V$
* Into $Attention(Q,K,V) = Softmax(\frac{Q^TK}{\sqrt{|D|}})V$

---

## Feature Diversity

* The output of attention is similar to the feature maps output from convolutions
  * And we know that just one feature isn't enough!
* So we use multiple heads

</div>
<div class="col">
<img style="width: 90%" class="r-stretch" src="./figures/UDL/Chap12/TransformerBlockSAMultiHead.svg" />
<br/>
<small>Multi-Headed version of attention.</small>
</div>
</div>

---

## Now: The Transformer

* You will recognize much of this from ConvNext
* Skip layers, stochastic depth, label smoothing, layer norm, etc
* This is a compactified view of both an encoder and decoder

</div>
<div class="col">
<img style="width: 55%" class="r-stretch" src="./figures/AttentionIsAllYouNeedFigure1.png" />
<br/>
<small> <a href="https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html">Attention is All You Need</a></small>
</div>
</div>

---

## Book Version

* The book has an easier to follow diagram
* Notice that we are using skip connections
  * This allows us to train a deeper network, solve shattered gradients, use stochastic depth, etc, etc

<div class="col">
<img style="width: 65%" class="r-stretch" src="./figures/UDL/Chap12/TransformerBlock.svg" />
<br/>
<small>See Figure 12.10 of the UDL book</small>
</div>

---

## Not So Complicated

* Before you groan at how complicated this look, it is actually very similar to ConvNeXt
  * Remember, ConvNeXt took inspiration from *this*
  * So if you followed that, this will be simple

---

## ConvNeXt Block

* Recall ConvNeXt:

```
  (1): CNBlock(
    (block): Sequential(
      (0): Conv2d(96, 96, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=96)
      (1): Permute()
      (2): LayerNorm((96,), eps=1e-06, elementwise_affine=True)
      (3): Linear(in_features=96, out_features=384, bias=True)
      (4): GELU(approximate='none')
      (5): Linear(in_features=384, out_features=96, bias=True)
      (6): Permute()
    )
```

---

## Transformer Block

* The Multi-Head Attention module replaces the convolution
* The layer norm is the same
* The pair of linear layers with a nonlinearity in between is the same
  * These can either be a permutation following by linear or 1x1 convolutions
    * The two are equivalent
* The dimensionality is different; Attention is All You Need used 512 features going into the linear layers and 2048 coming out

<div class="col">
<img style="width: 45%" class="r-stretch" src="./figures/UDL/Chap12/TransformerBlock.svg" />
<br/>
<small>See Figure 12.10 of the UDL book</small>
</div>

---

## Using the Transformer

* That was the transformer, but how do we use it?
* In the world of NLP models, we still need an embedding and an encoder and a decoder to work with it
* We can begin with a pretrained word embedding
  * Using CBOW or skip-grams or some other unsupervised technique, such as predicting missing words

<div class="col">
<img style="width: 45%" class="r-stretch" src="./figures/UDL/Chap12/TransformerEncoder.svg" />
<br/>
<small>See Figure 12.10 of the UDL book</small>
</div>

---

## Pretraining With Context

* Pretraining the embedding with the transformer is convenient
* It captures context in a way that CBOW and skip-grams did not
* Introduced in [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) in 2018

<div class="col">
<img style="width: 55%" class="r-stretch" src="./figures/UDL/Chap12/TransformerEncoder.svg" />
<br/>
<small>See Figure 12.10 of the UDL book</small>
</div>

---

## BERT

* **B**idirectional **E**ncoder **R**representations from **T**ransformers
* BERT looked at context both before and after the token being predicted
  * But unlike CBOW or skip-gram's, it also captures structure
* This advancement was enabled by the transformer architecture, and swifly followed their rise to popularity

---

## BERT Training

* There are two phases of pretraining with the BERT approach
* First, learning to predict masked words
  * Similar in operation to CBOW, but with structure and possibly multiple masked words

---

## BERT Training

* In the second phase, a new downstream task is used
  * The BERT paper generally refers to a task with question and answer sentences as inputs
* BERT add an additional embedding token, this one indicating if a word is in sentence 1 or 2

---

## Some Details

* Just in case you get into an argument with someone about BERT training
* 15% of words are selected for masking
  * Masking of the selected word is only done 80% of the time
  * 10% label smoothing is applied, so 10% of the time we see a random word
  * 10% of the time the original word is left alone
    * That is sufficient to bias the model towards real sentences

---

## Some Examples

* Next sentence prediction examples from the BERT paper
* Input:
  * [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP]
  * Label: IsNext
  * [CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
  * Label: NotNext

---

## Other Downstream Tasks

<div class="col">
<img style="width: 70%" class="r-stretch" src="./figures/UDL/Chap12/TransformerEncoderFineTune.svg" />
<br/>
<small>Figure 12.11 from UDL</small>
</div>

---

## Decoding

* We can train all sorts of tasks
  * This approach (from BERT) means that researchers could grab a pretrained NLP model and use it for anything
* Similar to how people used pretrained AlexNet for just about any image task
  * Part of that excitement and ease-of-use lead to today's language models
* Great, but how do we train a decoder?

---

## Masked Attention

* We want to predict an entire output sentence (or paragraph), but we want to train an entire sentence at a time
  * Learning one word at a time is inefficient, right?
* So we use the tranformer as usual, predicting all of the output tokens
  * The inputs only interact during the dot product self-attention
  * So we simply set attention weights to 0 for any token "ahead" of the tokens being predicted

---

## Masked Attention

* Not pictured: there are multiple transformer layers
* Later tokens will attend to the vectors from earlier tokens at each layer, so the output should be a coherent sentence

<div class="col">
<img style="width: 65%" class="r-stretch" src="./figures/UDL/Chap12/TransformerDecoder.svg" />
<br/>
<small>Figure 12.12 from UDL</small>
</div>

---

## Encoder+Decoder

* That was a decoder on its own, so how about an actual seq2seq task?
  * Like translation
* Unlike with the RNN, we don't need to process the sentence one word at a time
* But we need to capture the context of the source sequence in the encoder and give it to the decoder
  * The mechanism for this is called *cross-attention*

---

## Cross-Attention

<div class="col">
<img style="width: 70%" class="r-stretch" src="./figures/UDL/Chap12/TransformerBlockSACross.svg" />
<br/>
<small>Figure 12.12 from UDL</small>
</div>

---

## Cross-Attention In Use

* In the decoder, we will let the keys and values come from the encoder's embeddings
  * The query vectors still come from the decoder's embeddings
* During training, the ground truth is made available to the decoder so that it learns to predict future tokens from earlier ones

---

## Cross-Attention In Use

* With the ground truth at the input, we need masked attention in the decoder to prevent cheating
* The embeddings from the encoder and decoder are combined through cross attention in a subsequent attention layer

---

## Encoder+Decoder

<div class="col">
<img style="width: 55%" class="r-stretch" src="./figures/UDL/Chap12/TransformerEncoderDecoder.svg" />
<br/>
<small>Figure 12.13 from UDL</small>
</div>

---

## Online Examples

* Pytorch has a training and inference demo if you want to see all of this end to end:
  [https://github.com/pytorch/examples/tree/main/word_language_model](https://github.com/pytorch/examples/tree/main/word_language_model)
* Transformers have the bad habit of being rather complicated
  * So let's end here and leave some brain space for the quiz