# CS 461 - Lecture 23 ## Machine Learning Principles ### Temporal Neural Networks Bernhard Firner 2025-11-24 --- ## Temporal DNNs ### A very Brief overview --- ## What is Temporal? * Things like documents, sound, etc * Structure that flows along one axis * Not images, where structure can exist between arbitrary groups of pixels * We could simply run a 1-D CNN over a numerically encoded input * But that has some limitations --- ## Topics * A brief overview that brings us to modern techniques * Cover where earlier techniques succeed * And where they fall short, and why * Caution! * More modern does not necessarily mean "better" * Just that results look better given today's hardware and datasets --- ## Techniques * Recurrent Neural Networks * Long Short Term Memory * Attention * Transformers --- ## Recurrent Neural Networks * Let's say that you want to use neural networks in your hidden Markov models * Congratulations, you've got a recurrent neural network! --- ## Graphical Representation
* Each emitted token, y, comes from the hidden state * Each hidden state is a function of the previous hidden state and the observed token --- ## Representation * Arrows going into a node represent fully connected inputs and outputs * A hidden state is just a vector of numbers * Token outputs are probabilities and carry meaning * But hidden state vectors only have meaning within the model --- ## Conditioning Outputs * We probably want to condition the outputs somehow * Based upon * past tokens * future tokens * initial prompt * image embedding --- ## NN Structure
--- ## Conditioned States
* Use an input vector, x, to represent the conditioning signal * Use it as an input when creating each hidden state so it cannot be forgotten --- ## Vec2Seq * Called Vec2Seq * The conditioning input is a vector * The output is a sequence * How about going the opposite direction? --- ## Seq2Vec * Sometimes we want to predict a vector given a sequence of tokens * If we want to classify a sequence * Or predict a feature vector from a description * Could be used in image search --- ## Seq2Vec Predictions
* Inputs are fed in one token at a time * A prediction, y, is create at each step * We could ignore all but the last, or average them somehow --- ## Improving Seq2Vec * From HMM, we know that going in both directions improves our predictions * So we can make hidden state for forward and backward directions --- ## Directional Hidden States
* The bidirectional hidden states predict new states and y values * The final state (or an average of multiple) predicts y --- ## Seq2Seq * What about prediction sequences from other sequences? * Translate from one language to another * Generate a response to a query * This looks just like seq2vec * But now the y predictions are a single token, not a class --- ## Seq2Seq
--- ## So Simple? * No * What if the output has a different number of tokens than the input? * We need to parse the entire sentence first * Either in a forward pass, or with forward and backward passes * The final hidden state is called the context * Will be used to predict every output, as in vec2seq --- ## Seq2Seq with Context
* Notice that this is just seq2vec followed by vec2seq --- ## Everything * And that covers everything to do with sequence predictions * Except for how to train them * Let's talk about backpropagation --- ## Backprop Woes
* Consider vec to seq * We compare NN output to target value, compute error, and backprop * If the output is short, that should be okay
--- ## Backprop Woes
* seq2seq makes one huge chain * we backpropagate the entire thing, like unrolling a loop * But inputs and output are often long * What goes wrong in long backprop pathways?
--- ## Calculating hidden States * Theoretically the hidden states contain information from any point in the past * Each hidden state comes from the last state and the input token * Just look at predicting the hidden states * $h_{t+1} = w_h h_t + w_x x_t$ --- ## Vanishing Gradient * Forget about the input for a moment * Unrolling several steps get $h_4 = w_h (w_h (w_h (w_h h_0)))$ * Is that likely to be stable? (e.g. nonzero and non-infinite) * No * In practice, we can avoid things going to infinity * But going to 0 is common --- ## Observed Result * $h_{t+1} = w_h h_t + w_x x_t$ theoretically can encode long history * But practically is heavily influence by $x_t$ and a short history * Where have we seen a solution to long chains of gradient descent? --- ## Solution? * ResNets added a residual to the current feature map * This created the next feature map * The residual was an identity or downscaled version of the current map * What if we did something like that? * Preserve the current hidden state * Calculate something to add to it * No more multiplication --- ## LSTMs * This is a technique named long short term memory * Actually predates ResNets by around 20 years * Replaces the hidden state Multiplicative update with an addition * Also make a gating function, using Tanh * Either retain the current hidden state * Or update it with an addition of a function of the input token --- ## LSTM Mechanics * This is a great topic * For CS 462 * For 461 I wanted to point out the relationship between problems in seemingly different areas * The vanishing gradient problem in sequence learning has a similar solution to deeper networks for image recognition --- ## High Level Perspective * RNNs and LSTMs use some nonlinearity and a weight multiplication to update hidden states * Something like $h_{t+1} = ReLU(w_x^TX + w_h^Th_{t})$ * At the input, X holds input observations * In intermediate layers, X holds features * You can think of those as fully connected layers, by the way * But what if the transformation depended upon X? --- ## Key-Value Store * Recall the restricted boltzman machines from lecture 18 * Created a key-value store with a multilayer network * Key can be noisy or corrupted * Returned value is corrected * This idea is similar --- ## queries and lookups * Attention is like a key-value store * Query with a vector, `q` * That query is compared to the keys, `K` * This has to be a differentiable function, so key comparison goes through a softmax * This is the similarity of `q` to each key in `K` * The output value is the similarity weighted sum of values --- ## Details * Again, going into the details is more of a 462 topic * But recall the last key-value retrival we discussed * keys were partial images, and we attempted to retrieve full images * In this case, query can be the hidden states * Hidden states while parsing a sentence, for example --- ## Why is that Useful? * Without memory, we can train on fixed-length lists of tokens * So we are back to experts deciding how large the window should be * But training can easily fill batches of the same size --- ## Utility * One type of translating neural network, for example, uses the current hidden state as the query * The query response is an alignment vector * It indicates what part of the input should be translated at each output step --- ## Details * See https://arxiv.org/abs/1508.04025 for full details * Interesting results with a training set of 4.5M sentences * English <-> German * Not something I can demonstrate in a day --- ## Input Feeding
* This is similar to earlier approaches, but attention is fed back to the predictors to provide alignment information * Consider ABCD = "I am a student" and XYZ = "Je suis étudient" --- ## Self-Attention * One way to create that vector is via the dot product of the current token with the keys of all other words * We can think of this as some kind of similarity between the words * Or just remember that this was how we attempted to reconstruct an input with a RBM * Running through a fully connected layer will have the same result, but look more mysterious --- ## Note * A note about the matrix multiplication on the next slide * This is a gross simplification * Just showing the outline of this approach * This way you can see that it is a "simple" multiplication --- ## Example ```python class BasicAttention(torch.nn.Module): def __init__(self, num_tokens, embedding_size, context_length): super(BasicAttention, self).__init__() # For the dot product attention we use in the forward pass, the embedding size must be d, the number of tokens # if embedding_size != num_tokens, then use torch.nn.linear layers embedding_size = num_tokens # Make embeddings a parameter so that it will learn. # If we want to predict continuous numbers, the embedding is replaced with a nn.Linear layer self.embeddings = torch.nn.Embedding(num_embeddings=num_tokens, embedding_dim=embedding_size) self.num_tokens = num_tokens self.embedding_size = embedding_size def context(self, tokens): """Return a context embedding from the tokens (which can be in a batch).""" return self.embeddings(tokens) def forward(self, query): """Query the embeddings.""" # See Murphy: Probabilistic Machine Learning section 15.4.1 (or 15.44) # See also https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html#torch.nn.functional.scaled_dot_product_attention # We could also just call torch.nn.functional.scaled_dot_product_attention, but this # way we can see the operation key = query value = query attn_weight = query @ key.transpose(-2, -1) / math.sqrt(query.size(-1)) attn_weight = torch.softmax(attn_weight, dim=-1) @ value # The non dot product version would use a linear layer to map from the original tokens to a query, keys, and values. # Do a projection from attention # If we average for simplicity we lose positional information attn_weight = torch.mean(attn_weight, dim=-2) # So in reality we want a linear network here #attn_weight = self.out_proj(attn_weight.flatten(-2)) return attn_weight ``` -v- ## Training ```python if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument( "--corpus", required=True, help="Text file with class A data") parser.add_argument( "--save", required=False, default=None, type=str, help="Path to save the trained model") parser.add_argument( "--load", required=False, default=None, type=str, help="Path to load the trained model") parser.add_argument( "--prompt", default="The", type=str, help="A prompt to use for text generation.") parser.add_argument( "--embedding_size", default=15, type=int, help="The embedding size per token.") parser.add_argument( "--context_length", default=10, type=int, help="The context window length.") args = parser.parse_args() device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print(f"Using device: {device}") ######### # Note: You need the nltk parser # Open a python terminal and run: # import nltk # nltk.download('punkt_tab') try: with open(args.corpus, 'r') as file: #corpus_sentences = nltk.sent_tokenize(file.read()) flat_words = np.array(nltk.word_tokenize(file.read())) except IOError as e: print(f"Error: Could not open or read file at path.") sys.exit(1) except Exception as e: print(f"An unexpected error occurred during parsing: {e}") sys.exit(1) # Initialize random embeddings for each word # Just a fancy way of saying random numbers for each word # Training will make the embeddings more meaningful attention, unique_words, sentence_indices = makeBasicAttention(flat_words, args.embedding_size, args.context_length) print(f"There are {len(unique_words)} unique words") ## Generate nonoverlapping 10 word subsequences from the data for training ## Remember that these words have been converted into indices into the attention's embedding tensor context_length = args.context_length # Grab windows of context_length+1 so that we have a token to predict context_windows = np.array([sentence_indices[window:window+context_length+1] for window in np.arange(0, len(sentence_indices)-context_length-1, step=context_length+1)]) print(f"Working with {len(context_windows)} context windows") print(context_windows[0]) np.random.shuffle(context_windows) context_window_X = torch.tensor(context_windows[:,:context_length]) context_window_Y = torch.tensor(context_windows[:,-1]) batch_size = 32 batches = len(context_windows) // batch_size # This is a classifier output criterion = torch.nn.CrossEntropyLoss() if args.load is not None: attention.load_state_dict(torch.load(args.load, weights_only=True)) attention.to(device) else: attention.to(device) attention.train() context_window_X = context_window_X.to(device) context_window_Y = context_window_Y.to(device) #optimizer = torch.optim.SGD(attention.parameters(), lr=1e-2) optimizer = torch.optim.Adam(attention.parameters(), lr=1e-2) # Now loop through the text corpus, training for epoch in range(10): epoch_loss = 0 for batch in range(batches): begin = batch*batch_size end = begin+batch_size # Get ready to learn attention.zero_grad() # A minibatch of word indces minibatch = context_window_X[begin:end] Y_batch = context_window_Y[begin:end] # Create a context window of the token embeddings context = attention.context(minibatch) y_hat = attention(context) loss = criterion(y_hat, Y_batch) epoch_loss += loss.item() * minibatch.size(0) # Gradient calculation loss.backward() # Update weights optimizer.step() epoch_loss = epoch_loss / batches*batch_size print(f"Epoch {epoch} training loss {epoch_loss}") if args.save is not None: torch.save(attention.state_dict(), args.save) # Now try to predict something prompt_words = np.array(nltk.word_tokenize(args.prompt)) # Convert prompt words to token values prompt_tokens = [np.where(unique_words == word)[0][0] for word in prompt_words] attention.eval() print(prompt_words) for i in range(20): # Create a context window of the token embeddings context = attention.context(torch.tensor([prompt_tokens]).to(device)) # Do the query # Find the next token probabilities y_hat = attention(context) next_word_index = torch.argmax(y_hat).item() next_word = unique_words[next_word_index] print(' ' + next_word) next_token = np.where(unique_words == next_word)[0][0] prompt_tokens.append(next_token) if len(prompt_tokens) > context_length: prompt_tokens = prompt_tokens[1:] ``` --- ## Not Straightforward * Try training this on Romeo and Juliet and you will get a bunch of repetition * It's a one layer network * Just like the example from lecture 18, it isn't very powerful * We are also getting into that sparse data problem, where we need more data * There are some examples of deeper (but still reasonable) models like this: * https://github.com/karpathy/nanoGPT/tree/master --- ## Embedding * The embedding's job is to say which other tokens are related to observed tokens * That's nice; it makes the model interpretable and imposes some structure * Since we haven't converted the input words into some undecipherable features, we can even see which words are related to others --- ## Advantage? * So what is the advantage of doing things this way? * It's called self-attention * Notice how we could chop up the input tokens any way we wanted to train? * The context window replaced the history of our RNN or LSTM * And if we believed that we were only getting, say 50 tokens anyway, this is easier --- ## How Much Easier? * We can store our context windows ahead of time * Load and train them in one shot, no iterating through each token * Now, we've removed iteration over the sequence at the encoder and decoder * This is called a transformer --- ## Comparison - steps * For a sequence length of n, * Recurrent networks require $n$ sequential steps * Convolutions take 1 step * Transformers take 1 step --- ## Comparison - minimum depth * For a sequence length of n, * Recurrent networks require $n$ sequential steps * Unrolling them leads to a depth $n$ network * Convolutions take $O(log_kn)$, where n is the kernel size * This is number of convolutions to reach a receptive field of desired size * Transformers just require 1 layer * Usually ends up being large, and maybe more than 1 to learn better --- ## Comparison - complexity * By complexity, we mean computation time * Recurrent networks require $nd^2$ computations * $d$ is the size of the hidden state, which should be the dimensionality of the input features * Convolutions take $O(knd^2)$ * The feature channel is like the hidden state * Transformers require $O(n^2d)$ compute * The fetch from the embedding is d steps, but done once --- ## Advantages * So self-attention and transformers make it easier to train * That's their big advantage * As the context window grows, they become heavier to train * But there are ways to keep matrices sparse * And the hidden states and sequential training of RNNs are still worse --- ## Takeaways * Transformers are very hot right now * They present solutions to hard problems in sequence learning * But notice that RNNs/LSTMs do theoretically have more expressive power * Just difficult to train * We see limitations of context windows in today's LLMs --- ## Your Takeaways * That's great * So what should you remember? --- ## ResNets and LSTMs * Problems repeat * Sometimes solutions do to * Lessons learned by changing a multiplication to an addition could have been taken from LSTMs to convolutional networks, but weren't --- ## Some Solutions are Hard * RNNs (including LSTMs) are the "correct" solution * But they are too hard to train * Transformers are a neat trick to solve a problem with RNNs --- ## Transformers * By removing the RNN-type sequential training, we unlock more traditional batch training * This has lead to rapid advances in sequence models --- ## Context Windows are Limiting * Transformers are limited by their context windows * Which means they're limited by how much model fits onto hardware