* Previously we used an RNN, which learned position implicitly
* We've lost that, so we add it back in
* Even indices are summed with: $sin\left(\frac{pos}{10000^{2i/D}}\right)$
* Odd indices are summed with: $cos\left(\frac{pos}{10000^{2i/D}}\right)$
* $i$ is the distance from the token being predicted
See Figure 12.4 of the UDL book
-v-
```python
```python
import math
import torch
embedding_dim = 10
num_tokens = 4
# Create a positional encoder, following the Vaswami method.
# Each token will have a slightly different value
# based upon its location within the context window
pe = torch.zeros(embedding_dim, num_tokens + num_tokens%2)
position = torch.arange(0, embedding_dim, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, num_tokens, 2).float() * (-math.log(10000.0) / embedding_dim))
# The positional encoding changes based upon the index being odd or even
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
# We may have added an extra 1 here since the above math didn't work with odd numbers
pe = pe[:, :num_tokens]
print(pe)
```
-v-
* Each row is similar to a column in the example image
```
tensor([[ 0.0000, 1.0000, 0.0000, 1.0000],
[ 0.8415, 0.5403, 0.1578, 0.9875],
[ 0.9093, -0.4161, 0.3117, 0.9502],
[ 0.1411, -0.9900, 0.4578, 0.8891],
[-0.7568, -0.6536, 0.5923, 0.8057],
[-0.9589, 0.2837, 0.7121, 0.7021],
[-0.2794, 0.9602, 0.8140, 0.5809],
[ 0.6570, 0.7539, 0.8954, 0.4452],
[ 0.9894, -0.1455, 0.9545, 0.2983],
[ 0.4121, -0.9111, 0.9896, 0.1439]])
```
See Figure 12.4 of the UDL book
---
## Sensible?
* We add the position encoding to the embeddings
* How does that make sense?
* The position and the original embedding area linearly separable
* So, if something is important, then $\omega_v$, $\omega_k$, and $\omega_q$ can learn it
* We are just going to have to trust gradient descent
---
## Two More Details
* To make learning stable, we cannot allow initial values to be huge
* Once they go through the softmax, they may drive over values too close to 0
* So we scale everything going into the softmax
* Change $Attention(Q,K,V) = Softmax(Q^TK)V$
* Into $Attention(Q,K,V) = Softmax(\frac{Q^TK}{\sqrt{|D|}})V$
---
## Feature Diversity
* You will recognize much of this from ConvNext
* Skip layers, stochastic depth, label smoothing, layer norm, etc
* This is a compactified view of both an encoder and decoder
---
## Book Version
* The book has an easier to follow diagram
* Notice that we are using skip connections
* This allows us to train a deeper network, solve shattered gradients, use stochastic depth, etc, etc
* In the decoder, we will let the keys and values come from the encoder's embeddings
* The query vectors still come from the decoder's embeddings
* During training, the ground truth is made available to the decoder so that it learns to predict future tokens from earlier ones
---
## Cross-Attention In Use