<!--
Abstract:

CS 461
Introduction to Deep Learning
Lecture 11
-->

# CS 461 - Lecture 23

## Machine Learning Principles

### Temporal Neural Networks

Bernhard Firner

2025-11-24

---

## Temporal DNNs

### A very Brief overview

---

## What is Temporal?

* Things like documents, sound, etc
  * Structure that flows along one axis
  * Not images, where structure can exist between arbitrary groups of pixels
* We could simply run a 1-D CNN over a numerically encoded input
  * But that has some limitations

---

## Topics

* A brief overview that brings us to modern techniques
  * Cover where earlier techniques succeed
  * And where they fall short, and why
* Caution!
  * More modern does not necessarily mean "better"
  * Just that results look better given today's hardware and datasets

---

## Techniques

* Recurrent Neural Networks
* Long Short Term Memory
* Attention
* Transformers

---

## Recurrent Neural Networks

* Let's say that you want to use neural networks in your hidden Markov models
  * Congratulations, you've got a recurrent neural network!

---

## Graphical Representation

* Each emitted token, y, comes from the hidden state
* Each hidden state is a function of the previous hidden state and the observed token

---

## Representation

* Arrows going into a node represent fully connected inputs and outputs
* A hidden state is just a vector of numbers
* Token outputs are probabilities and carry meaning
  * But hidden state vectors only have meaning within the model

---

## Conditioning Outputs

* We probably want to condition the outputs somehow
* Based upon
  * past tokens
  * future tokens
  * initial prompt
  * image embedding

---

## NN Structure

---

## Conditioned States

* Use an input vector, x, to represent the conditioning signal
* Use it as an input when creating each hidden state so it cannot be forgotten

---

## Vec2Seq

* Called Vec2Seq
  * The conditioning input is a vector
  * The output is a sequence
* How about going the opposite direction?

---

## Seq2Vec

* Sometimes we want to predict a vector given a sequence of tokens
  * If we want to classify a sequence
  * Or predict a feature vector from a description
    * Could be used in image search

---

## Seq2Vec Predictions

* Inputs are fed in one token at a time
* A prediction, y, is create at each step
  * We could ignore all but the last, or average them somehow

---

## Improving Seq2Vec

* From HMM, we know that going in both directions improves our predictions
  * So we can make hidden state for forward and backward directions

---

## Directional Hidden States

* The bidirectional hidden states predict new states and y values
* The final state (or an average of multiple) predicts y

---

## Seq2Seq

* What about prediction sequences from other sequences?
  * Translate from one language to another
  * Generate a response to a query
* This looks just like seq2vec
  * But now the y predictions are a single token, not a class

---

## Seq2Seq

---

## So Simple?

* No
  * What if the output has a different number of tokens than the input?
* We need to parse the entire sentence first
  * Either in a forward pass, or with forward and backward passes
* The final hidden state is called the context
  * Will be used to predict every output, as in vec2seq

---

## Seq2Seq with Context

* Notice that this is just seq2vec followed by vec2seq

---

## Everything

* And that covers everything to do with sequence predictions
  * Except for how to train them
* Let's talk about backpropagation

---

## Backprop Woes

* Consider vec to seq
  * We compare NN output to target value, compute error, and backprop
* If the output is short, that should be okay

</div>
<div class="col">

</div>
</div>

---

## Backprop Woes

* seq2seq makes one huge chain
  * we backpropagate the entire thing, like unrolling a loop
* But inputs and output are often long
  * What goes wrong in long backprop pathways?

</div>
<div class="col">

</div>
</div>

---

## Calculating hidden States

* Theoretically the hidden states contain information from any point in the past
* Each hidden state comes from the last state and the input token
  * Just look at predicting the hidden states
  * $h_{t+1} = w_h h_t + w_x x_t$

---

## Vanishing Gradient

* Forget about the input for a moment
  * Unrolling several steps get $h_4 = w_h (w_h (w_h (w_h h_0)))$
  * Is that likely to be stable? (e.g. nonzero and non-infinite)
    * No
* In practice, we can avoid things going to infinity
  * But going to 0 is common

---

## Observed Result

* $h_{t+1} = w_h h_t + w_x x_t$ theoretically can encode long history
  * But practically is heavily influence by $x_t$ and a short history
* Where have we seen a solution to long chains of gradient descent?

---

## Solution?

* ResNets added a residual to the current feature map
  * This created the next feature map
  * The residual was an identity or downscaled version of the current map
* What if we did something like that?
  * Preserve the current hidden state
  * Calculate something to add to it
  * No more multiplication

---

## LSTMs

* This is a technique named long short term memory
  * Actually predates ResNets by around 20 years
* Replaces the hidden state Multiplicative update with an addition
* Also make a gating function, using Tanh
  * Either retain the current hidden state
  * Or update it with an addition of a function of the input token

---

## LSTM Mechanics

* This is a great topic
  * For CS 462
* For 461 I wanted to point out the relationship between problems in seemingly different areas
  * The vanishing gradient problem in sequence learning has a similar solution to deeper networks for image recognition

---

## High Level Perspective

* RNNs and LSTMs use some nonlinearity and a weight multiplication to update hidden states
  * Something like $h_{t+1} = ReLU(w_x^TX + w_h^Th_{t})$
    * At the input, X holds input observations
    * In intermediate layers, X holds features
    * You can think of those as fully connected layers, by the way
* But what if the transformation depended upon X?

---

## Key-Value Store

* Recall the restricted boltzman machines from lecture 18
  * Created a key-value store with a multilayer network
  * Key can be noisy or corrupted
  * Returned value is corrected
* This idea is similar

---

## queries and lookups

* Attention is like a key-value store
  * Query with a vector, `q`
  * That query is compared to the keys, `K`
* This has to be a differentiable function, so key comparison goes through a softmax
  * This is the similarity of `q` to each key in `K`
  * The output value is the similarity weighted sum of values

---

## Details

* Again, going into the details is more of a 462 topic
  * But recall the last key-value retrival we discussed
  * keys were partial images, and we attempted to retrieve full images
* In this case, query can be the hidden states
  * Hidden states while parsing a sentence, for example

---

## Why is that Useful?

* Without memory, we can train on fixed-length lists of tokens
  * So we are back to experts deciding how large the window should be
  * But training can easily fill batches of the same size

---

## Utility

* One type of translating neural network, for example, uses the current hidden state as the query
  * The query response is an alignment vector
  * It indicates what part of the input should be translated at each output step

---

## Details

* See https://arxiv.org/abs/1508.04025 for full details
* Interesting results with a training set of 4.5M sentences
  * English <-> German
  * Not something I can demonstrate in a day

---

## Input Feeding

* This is similar to earlier approaches, but attention is fed back to the predictors to provide alignment information
* Consider ABCD = "I am a student" and XYZ = "Je suis étudient"

---

## Self-Attention

* One way to create that vector is via the dot product of the current token with the keys of all other words
  * We can think of this as some kind of similarity between the words
  * Or just remember that this was how we attempted to reconstruct an input with a RBM
* Running through a fully connected layer will have the same result, but look more mysterious

---

## Note

* A note about the matrix multiplication on the next slide
  * This is a gross simplification
* Just showing the outline of this approach
  * This way you can see that it is a "simple" multiplication

---

## Example

```python
class BasicAttention(torch.nn.Module):
    def __init__(self, num_tokens, embedding_size, context_length):
        super(BasicAttention, self).__init__()
        # For the dot product attention we use in the forward pass, the embedding size must be d, the number of tokens
        # if embedding_size != num_tokens, then use torch.nn.linear layers
        embedding_size = num_tokens
        # Make embeddings a parameter so that it will learn.
        # If we want to predict continuous numbers, the embedding is replaced with a nn.Linear layer
        self.embeddings = torch.nn.Embedding(num_embeddings=num_tokens, embedding_dim=embedding_size)
        self.num_tokens = num_tokens
        self.embedding_size = embedding_size

def context(self, tokens):
        """Return a context embedding from the tokens (which can be in a batch)."""
        return self.embeddings(tokens)

def forward(self, query):
        """Query the embeddings."""
        # See Murphy: Probabilistic Machine Learning section 15.4.1 (or 15.44)
        # See also https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html#torch.nn.functional.scaled_dot_product_attention
        # We could also just call torch.nn.functional.scaled_dot_product_attention, but this
        # way we can see the operation
        key = query
        value = query
        attn_weight = query @ key.transpose(-2, -1) / math.sqrt(query.size(-1))
        attn_weight = torch.softmax(attn_weight, dim=-1) @ value
        # The non dot product version would use a linear layer to map from the original tokens to a query, keys, and values.

# Do a projection from attention
        # If we average for simplicity we lose positional information
        attn_weight = torch.mean(attn_weight, dim=-2)
        # So in reality we want a linear network here
        #attn_weight = self.out_proj(attn_weight.flatten(-2))
        return attn_weight
```

-v-

## Training

```python
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--corpus",
        required=True,
        help="Text file with class A data")
    parser.add_argument(
        "--save",
        required=False,
        default=None,
        type=str,
        help="Path to save the trained model")
    parser.add_argument(
        "--load",
        required=False,
        default=None,
        type=str,
        help="Path to load the trained model")
    parser.add_argument(
        "--prompt",
        default="The",
        type=str,
        help="A prompt to use for text generation.")
    parser.add_argument(
        "--embedding_size",
        default=15,
        type=int,
        help="The embedding size per token.")
    parser.add_argument(
        "--context_length",
        default=10,
        type=int,
        help="The context window length.")

args = parser.parse_args()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")

#########
    # Note: You need the nltk parser
    # Open a python terminal and run:
    # import nltk
    # nltk.download('punkt_tab')

try:
        with open(args.corpus, 'r') as file:
            #corpus_sentences = nltk.sent_tokenize(file.read())
            flat_words = np.array(nltk.word_tokenize(file.read()))
    except IOError as e:
        print(f"Error: Could not open or read file at path.")
        sys.exit(1)
    except Exception as e:
        print(f"An unexpected error occurred during parsing: {e}")
        sys.exit(1)

# Initialize random embeddings for each word
    # Just a fancy way of saying random numbers for each word
    # Training will make the embeddings more meaningful
    attention, unique_words, sentence_indices = makeBasicAttention(flat_words, args.embedding_size, args.context_length)
    print(f"There are {len(unique_words)} unique words")

## Generate nonoverlapping 10 word subsequences from the data for training
    ## Remember that these words have been converted into indices into the attention's embedding tensor
    context_length = args.context_length
    # Grab windows of context_length+1 so that we have a token to predict
    context_windows = np.array([sentence_indices[window:window+context_length+1] for window in np.arange(0, len(sentence_indices)-context_length-1, step=context_length+1)])
    print(f"Working with {len(context_windows)} context windows")
    print(context_windows[0])
    np.random.shuffle(context_windows)
    context_window_X = torch.tensor(context_windows[:,:context_length])
    context_window_Y = torch.tensor(context_windows[:,-1])
    batch_size = 32
    batches = len(context_windows) // batch_size

# This is a classifier output
    criterion = torch.nn.CrossEntropyLoss()

if args.load is not None:
        attention.load_state_dict(torch.load(args.load, weights_only=True))
        attention.to(device)
    else:

attention.to(device)
        attention.train()
        context_window_X = context_window_X.to(device)
        context_window_Y = context_window_Y.to(device)

#optimizer = torch.optim.SGD(attention.parameters(), lr=1e-2)
        optimizer = torch.optim.Adam(attention.parameters(), lr=1e-2)

# Now loop through the text corpus, training
        for epoch in range(10):
            epoch_loss = 0
            for batch in range(batches):
                begin = batch*batch_size
                end = begin+batch_size

# Get ready to learn
                attention.zero_grad()

# A minibatch of word indces
                minibatch = context_window_X[begin:end]
                Y_batch = context_window_Y[begin:end]

# Create a context window of the token embeddings
                context = attention.context(minibatch)
                y_hat = attention(context)

loss = criterion(y_hat, Y_batch)
                epoch_loss += loss.item() * minibatch.size(0)

# Gradient calculation
                loss.backward()
                # Update weights
                optimizer.step()
            epoch_loss = epoch_loss / batches*batch_size
            print(f"Epoch {epoch} training loss {epoch_loss}")

if args.save is not None:
            torch.save(attention.state_dict(), args.save)

# Now try to predict something
    prompt_words = np.array(nltk.word_tokenize(args.prompt))
    # Convert prompt words to token values
    prompt_tokens = [np.where(unique_words == word)[0][0] for word in prompt_words]

attention.eval()

print(prompt_words)

for i in range(20):

# Create a context window of the token embeddings
        context = attention.context(torch.tensor([prompt_tokens]).to(device))

# Do the query
        # Find the next token probabilities
        y_hat = attention(context)
        next_word_index = torch.argmax(y_hat).item()
        next_word = unique_words[next_word_index]
        print(' ' + next_word)
        next_token = np.where(unique_words == next_word)[0][0]
        prompt_tokens.append(next_token)
        if len(prompt_tokens) > context_length:
            prompt_tokens = prompt_tokens[1:]
```

---

## Not Straightforward

* Try training this on Romeo and Juliet and you will get a bunch of repetition
* It's a one layer network
  * Just like the example from lecture 18, it isn't very powerful
* We are also getting into that sparse data problem, where we need more data
* There are some examples of deeper (but still reasonable) models like this:
  * https://github.com/karpathy/nanoGPT/tree/master

---

## Embedding

* The embedding's job is to say which other tokens are related to observed tokens
  * That's nice; it makes the model interpretable and imposes some structure
* Since we haven't converted the input words into some undecipherable features, we can even see which words are related to others

---

## Advantage?

* So what is the advantage of doing things this way?
  * It's called self-attention
  * Notice how we could chop up the input tokens any way we wanted to train?
* The context window replaced the history of our RNN or LSTM
  * And if we believed that we were only getting, say 50 tokens anyway, this is easier

---

## How Much Easier?

* We can store our context windows ahead of time
  * Load and train them in one shot, no iterating through each token
* Now, we've removed iteration over the sequence at the encoder and decoder
  * This is called a transformer

---

## Comparison - steps

* For a sequence length of n,
  * Recurrent networks require $n$ sequential steps
  * Convolutions take 1 step
  * Transformers take 1 step

---

## Comparison - minimum depth

* For a sequence length of n,
  * Recurrent networks require $n$ sequential steps
    * Unrolling them leads to a depth $n$ network
  * Convolutions take $O(log_kn)$, where n is the kernel size
    * This is number of convolutions to reach a receptive field of desired size
  * Transformers just require 1 layer
    * Usually ends up being large, and maybe more than 1 to learn better

---

## Comparison - complexity

* By complexity, we mean computation time
  * Recurrent networks require $nd^2$ computations
    * $d$ is the size of the hidden state, which should be the dimensionality of the input features
  * Convolutions take $O(knd^2)$
    * The feature channel is like the hidden state
  * Transformers require $O(n^2d)$ compute
    * The fetch from the embedding is d steps, but done once

---

## Advantages

* So self-attention and transformers make it easier to train
  * That's their big advantage
* As the context window grows, they become heavier to train
  * But there are ways to keep matrices sparse
  * And the hidden states and sequential training of RNNs are still worse

---

## Takeaways

* Transformers are very hot right now
  * They present solutions to hard problems in sequence learning
  * But notice that RNNs/LSTMs do theoretically have more expressive power
    * Just difficult to train
* We see limitations of context windows in today's LLMs

---

## Your Takeaways

* That's great
  * So what should you remember?

---

## ResNets and LSTMs

* Problems repeat
* Sometimes solutions do to
* Lessons learned by changing a multiplication to an addition could have been taken from LSTMs to convolutional networks, but weren't

---

## Some Solutions are Hard

* RNNs (including LSTMs) are the "correct" solution
  * But they are too hard to train
* Transformers are a neat trick to solve a problem with RNNs

---

## Transformers

* By removing the RNN-type sequential training, we unlock more traditional batch training
  * This has lead to rapid advances in sequence models

---

## Context Windows are Limiting

* Transformers are limited by their context windows
  * Which means they're limited by how much model fits onto hardware