# CS 462 - Lecture 15 ## Transformers Bernhard Firner 2026-03-24 --- ## Review * Before the break, we talked about ResNeXt * Which took a bunch of improvements from transformers and put them into convolutional networks * We'll see *why* some of those improvements came from transformers in a bit --- ## Big Changes * There were a few large changes to convnets * Adam improved to AdamW and L2 penalty increased * BatchNorm replaced with LayerNorm * ReLU replaced by GELU * Convolution sizes changed, reverse botteneck used * Stochastic Depth instead of dropout --- ## Changes on CIFAR10 * We can't use all of the ConvNeXt changes on CIFAR10 * Dataset is too small * But here is a selection of improvements --- ## AdamW Impacts
Adam Loss
AdamW Loss
Adam Accuracy
AdamW Accuracy
--- ## AdamW Impacts * AdamW removes momentum modifications from L2 norm, giving us proper weight decay * Results in a smoother loss; regularization is working properly
Adam Accuracy
AdamW Accuracy
--- ## Label Smoothing Impacts
Adam Loss
AdamW+Labelsmoothing Loss
Adam Accuracy
AdamW+Labelsmoothing Accuracy
--- ## Stochastic Depth
AdamW+Labelsmoothing Loss
AdamW+Labelsmoothing+Stochastic Depth Loss
AdamW+Labelsmoothing Accuracy
AdamW+Labelsmoothing+Stochastic Depth Accuracy
--- ## Random Flip Augmentations
AdamW+Labelsmoothing+Stochastic Depth Loss
All That + Random Image Flipping Loss
AdamW+Labelsmoothing+Stochastic Depth Accuracy
All That + Random Image Flipping Accuracy
--- ## More Improvements * We have't tuning anything about the training hyperparameters * Batch size? Learning rate? Other augmentations? * We could probably get Layer Norm in place of Batch Norm, maybe replace ReLU with GELU * These have all come "for free" by making (relatively effortless) changes to old ResNets --- ## Improvements from where? * Where did these improvement come from? * Learning language models is difficult * Training samples are *wildly* different, so LayerNorm makes more sense than BatchNorm * Proper regularization is required and SGD is too hard to hand-tune, so Adam had to be fixed * More training data is available for text, meaning that it pushed the pipeline differently --- ## Precursors to Transformers * The book treats transformers in a single chapter * Chapter 12 * But this jumps over some other concepts that are worth covering * First, how is learning languages different from learning images? * Second, what models can capture temporal structure? --- ## Language Models * Transformers come from the world of *natural language processing* * This could be written, spoken, or any symbolic language * So how is language different from images? --- ## Similarities * Language has structure to it * Not a bag of words, the same way an image isn't a bag of pixels * Semantics change meanings * An eye shape could be camouflage or could be a real eye * "Orange" could be a color or a fruit --- ## Differences * Language is discrete and sparse, with vastly different token frequencies * Here are the most and least common words from last lecture ``` 1 accuracy 1 across-feature 1 acting 1 actually 1 Adaptive 1 add 1 added 13 learning 13 The 14 be 15 from 16 as 17 are 18 it 20 and 20 that 20 with 30 in 32 of 35 to 37 a 44 is 84 the ``` --- ## Challenges * Every language corpus is going to be a biased dataset * Some "features" will be unexplored, others used in strange ways * If we only had my slides, learning how to use the word "acting" would be tough * And there are plenty of words that will never show up * Other words are so common that we might guess that they will show up everywhere --- ## Data Preprocessing * This makes data cleaning difficult * Should we discard some "boring" sentences? * Or perhaps we should begin learning on "boring" sentences and move on to difficult ones? * How should we treat words that are incredibly rare? --- ## Learning Rates * Each sample is going to be vastly different * So we want huge batches to actually sample our loss surface * Statistics are divergent, so eventually we'll use LayerNorm instead of BatchNorm * We can expect training to take a while as well --- ## Simplifying * Let's introduce transformers by going through the building blocks that lead to them * We'll begin from the perspective of *natural language processing* * And what follows is a discussion of *concepts* first, then implementation details --- ## Statistical Models * Before we were training DNNs, we had Markov models * These eventually gave rise to current neural networks: * Recurrent Neural Networks, LSTMs, GRUs, Transformers * Maybe (hopefully) you've heard of these --- ## Markov Models
* Markov models make some assumptions: * We are making observations, and those come from some hidden state * The hidden states change with some (perhaps unknown) probabilities
--- ## Markov Models
* This seems pointless if, for example, each hidden state has a single *emission*, or observation * But what if each hidden state has multiple observable tokens?
--- ## Markov Models
* Learning the probabilities of seeing different tokens allow us to answer new questions * Example: if we read the word "cqt", what was the probability that the writer meant "cat"?
--- ## Natural Language Processing * Markov models have been widely used in natural language processing * Written, spoken, and other visual systems; anything that involves discrete tokens * Consider a person attempting to type the word "cat" * The hidden state will be a tuple of the intended word and the current position * Each hidden state is most likely to produce the correct character, but could produce a skip, substitution, transposition, etc * The possible *emitted* tokens are represented per state, in an emission matrix --- ## Using Markov Models * Filtering: What is the user trying to type now? * Smoothing: What did the user mean to type? * Simulating: Generate text as a human would. --- ## Learning * Learning transition and emission probabilities is done with the observations * Could be easy * What is the transition probability from one hidden state to another? * What is the likelihood of different observations given the hidden state? * E.g. hitting 'q' instead of 'a' should be more likely than hitting 'y' since 'q' and 'a' are nearby on the keyboard --- ## Learning is Tough * But language tends to be sparse * Some words are rarely used, requiring huge amount of data to learn * Meanings change with context * And context requires multiplying long chains of probabilities * And we actually want to learn semantics as part of the hidden states * More than just "what word is being typed?" --- ## Semantic Problems * Different sentence structures map to similar (or the same) meanings * "The sunset is a breathtaking orange." * "The orange sunset took my breath away." * Some similar words map to different meanings * "The sunset is orange." * "The best flavor is orange." --- ## Embeddings * Learning "orange" as the same token in every situation is wrong * So instead we want to learn an *embedding* that maps *words* into *semantics* * We can use that embedding to compare sentences, translate phrases, etc * But what quality should this embedding have? --- ## Similarity Example * Let's say we are building a word predictor * The currently observed tokens are: * "Those fluffy bunnies make me" * We guess "smile" but then observe "happy" * What error goes into our loss function? --- ## Similarity * We should represent each word not as a single value, but as a vector of its qualities * That vector is the embedding * So what qualities should it have? * A logical choice is similarity * $Similarity(A, B) = \frac{f(A)\cdot f(B)}{||f(A)||~||f(B)||} = cos(\phi)$ * This is the *cosine similarity* between A and B --- ## Example * Consider three vectors, A, B, and C * $A = [0, 0, 1.0, 2.0]$ * $B = [1, 0.5, 0, 0]$ * $C = [0, 1, 1, 0.]$ * $Similarity(A, B) = \frac{0}{\sqrt{5}\times \sqrt{1.25}} = 0$ * $Similarity(A, C) = \frac{1}{\sqrt{5}\times \sqrt{2}} \approx 0.316$ * $Similarity(A, A) = \frac{1+4}{\sqrt{5}\times \sqrt{5}} = 1$ --- ## Example Continued * If these two sentences occur with equal regularity: * "Those fluffy bunnies make me smile" * "Those fluffy bunnies make me happy" * Then 'smile' and 'happy' should end up with a small distance * So they both end up with high likelihood when we observe "Those fluffy bunnies make me" --- ## Learning Embeddings * We want to transform our inputs into embeddings through a loss function * The exact transformation should be learned from the data * But wait, didn't we just say that context will affect embeddings? * "I had spent a week wandering the barren wilderness with nothing to eat. Those fluffy bunnies made me" * Now what word could come next? And are "smile" and "happy" still nearly equivalent? --- ## Hidden State Information * So the hidden state needs to include past words in order to have context * But how many past words? * Does it depend? * Is this social media, or Herman Melville? --- ## Hidden State Information * Adding difficulty, order may or may not matter * "Seeing those fluffy bunnies, after a week wandering the wilderness without food, made me" * "After a week wandering the wilderness without food, seeing those fluffy bunnies made me" * We want the hidden states after those two phrases to be similar or the same * One of the tricks to natural language processing is to separate the *embeddings* from *context* --- ## Context * Context (our hidden state) will combine multiple word embeddings * This results in a more continuous hidden state * Better captures the soft, inexact nature of concepts and language * Could change slightly or abruptly as we add tokens * Now we have two tasks: * learn word embeddings * learn hidden states from observed embeddings --- ## Note * Before we dive into learning embeddings * This is a precursor to learning a more complicated model * Remember, we don't want to base our model upon words * It should learn a hidden state represented by a function of word embeddings * So remember that this is all step 1 of N --- ## Intuitions About Embeddings * So what is an embedding, exactly? * Just a lookup table from our word (a one-hot vector) to some numbers * In PyTorch, it also tracks gradients so those numbers can change * If "smile" and "happy" should be similar, then "Seeing fluffy bunnies make me" should be enough to predict the embedding of either word --- ## Linearity * We want our embeddings to be linearly transformable, representing a continuous space * If "frown" and "smile" express something opposite, then one of the values in their embedding should be negation of the other * Similarity goes from 0 to 1, anything from 0 to -1 is a dissimilarity * If another value in the embedding represents that they are facial expression, those numbers may be the same * So if we add the embedding of "smile" and "frown", the emotion part should cancel and the expression part should remain --- ## Learning Missing Words * Call $V = [v_1, v_2, v_3, v_4, v_5]$ the vector representation of the words in "Seeing fluffy bunnies make me" * We should be able to find a function $f(V)$ that predicts "smile" and "happy" with high probability * We don't want to worry about structure yet, so we will treat the vectors as a *continuous bag-of-words* * This means that we'll predict a token given some group of vectors * The vectors will be averaged together, so the exact window size and ordering won't matter --- ## CBOW Example * Let's take this sentence from *Anna Karenina* and a context window of 5: > Happy families are all alike; every unhappy family is unhappy in its own way. * A bag of words might get words "happy families all alike" * It would predict "are" since that is the word most associated with the others * If given "unhappy in own way" we would predict "its" --- ## Skip-Gram * Another alternative method for training embeddings is the *Skip-gram* * Both CBOW and Skip-grams were introduced in the [same paper](https://arxiv.org/abs/1301.3781) * Skip-grams have us predicting *any* word in a sentence given any other word * We'll stick with CBOW because it probably makes more intuitive sense * To train a good context we actually need billions of examples anyway --- ## Skip-Gram Example > Happy families are all alike; every unhappy family is unhappy in its own way. * Now we take a single word and predict the others * Given "are" we would predict, with equal probabilities: "happy," "families," "all," and "alike" * Obviously the exact prediction will depend upon the training corpus * And common words, such as "is", will be associated with many other words * And being equally associated with many things will end up being a weak association --- ## CBOW Model ```python class CBOWEmbedder(torch.nn.Module): def __init__(self, num_tokens, embedding_size, context_length): """ Initialize a predictor. Arguments: num_tokens (int): The number of unique tokens in the vocabulary. embedding_size (int): The number of features to represent the hidden state. context_length (int): The number of tokens of context. """ super(CBOWEmbedder, self).__init__() self.embedding = torch.nn.Embedding(num_tokens, embedding_size) # Use the hidden state to predict the next token lin_size = 6*math.ceil(math.sqrt(num_tokens)) self.predictor = torch.nn.Sequential( torch.nn.Linear(embedding_size, lin_size), torch.nn.Linear(lin_size, num_tokens)) self.decision = torch.nn.Softmax(dim=1) self.embedding.weight.data.uniform_(-0.1, 0.1) def forward(self, observation): embeds = self.embedding(observation) # By taking the mean we make sentences easier to learn from embeddings # since they no longer have any ordering. return self.decision(self.predictor(torch.mean(embeds, dim=1))) ``` --- ## Simple Model * Notice that this is a simple linear model * We want the target word to be a linear transformation from other words * So a word with a similar meaning must have similar values in the embedding * This will lead to interesting mathematical properties once the embedding is learned --- ## Learning Examples * Let's learn from something familiar * Going to use [Romeo and Juliet, from project Gutenberg](https://www.gutenberg.org/ebooks/1513) * It turns out the Anna Karenina is long and uses difficult words * Note that we'll get the copyright notice at the top, just like any good model --- ## Learning N-Grams * Let's make our target task word prediction * Given previous n tokens, predict the next one * We will learn the embedding first, and then the task * Remember, we are just doing something to learn and test out an embedding * Then we'll talk about bigger concepts --- ## N-Gram Model * Notice that increasing the context eats up memory * Linear layer grows by embedding_size * lin_size with each additional token ```python class NGramPredictor(torch.nn.Module): def __init__(self, num_tokens, embedding_size, context_length): """ Initialize a predictor. Arguments: num_tokens (int): The number of unique tokens in the vocabulary. embedding_size (int): The number of features to represent the hidden state. context_length (int): The number of tokens of context. """ super(NGramPredictor, self).__init__() self.embedding = torch.nn.Embedding(num_tokens, embedding_size) # Use the hidden state to predict the next token lin_size = 6*math.ceil(math.sqrt(num_tokens)) self.predictor = torch.nn.Sequential( torch.nn.Flatten(), torch.nn.Linear(context_length * embedding_size, lin_size), torch.nn.LayerNorm(lin_size), torch.nn.GELU(), torch.nn.Linear(lin_size, num_tokens)) self.decision = torch.nn.Softmax(dim=1) self.embedding.weight.data.uniform_(-0.1, 0.1) def forward(self, observation): embeds = self.embedding(observation) return self.decision(self.predictor(embeds)) ``` --- ## Preprocessing * Going to use the nltk (natural language toolkit) for preprocessing * Will also remove stage directions * It is also common to replace rare words with a standin, such as "UNK" * We'll keep things simple ```python with open(args.corpus, 'r') as file: text = file.read() # Remove stage directions from Shapespeare's corpus text = re.sub('^[A-Z]+:$', '', text, flags=re.MULTILINE) text = re.sub('^[A-Z]+.$', '', text, flags=re.MULTILINE) text = re.sub('^ Enter.*$', '', text, flags=re.MULTILINE) text = re.sub(r'\[.*?\]', '', text, flags=re.MULTILINE) text = re.sub(r'\*\*\*.*\*\*\*', '', text) words = nltk.word_tokenize(words.lower()) flat_words = np.array(words) # First, count the number of unique words unique_tokens = np.unique(flat_words) num_tokens = len(unique_tokens) print("Converting all words to numbers.") word_to_index = {} for idx, token in enumerate(unique_tokens): word_to_index[token] = idx # Convert the words to numbers. These will look up a vector embedding in our model's embedding. sentence_indices = [word_to_index[word] for word in flat_words] ``` --- ## Preprocessing * Preprocessing reduces Romeo and Juliet to 4119 unique tokens * The length of the training corpus is 35,098 * So words are used on average less than 9 times * We know that a few prepositions are going to grab most of that * This will make things challenging --- ## Pretraining ```python # Make a continuous bag of words model cbow_model = CBOWEmbedder(num_tokens, args.embedding_size, 2*args.context_length) print("Building model.") print(cbow_model) optimizer = torch.optim.AdamW(cbow_model.parameters(), lr=0.001, weight_decay=args.weight_decay) criterion = torch.nn.CrossEntropyLoss(label_smoothing=0.001) # Pretrain the embedding with CBOW if args.load is None: cbow_model.to(device) cbow_model.train() rng = np.random.default_rng() # Report loss every 1000 steps running_loss = 0.0 alpha = 0.01 for step in range(args.steps): # Make a batch begins = rng.integers(low=0, high=len(sentence_indices) - 2*args.context_length - 1, size=args.batch_size) # For continuous bag of words CBOW # This is generally used for pretraining groups = [sentence_indices[begin:begin+2*args.context_length+1] for begin in begins] left = torch.tensor([gr[:args.context_length] for gr in groups], dtype=torch.long) labels = torch.tensor([gr[args.context_length] for gr in groups], dtype=torch.long).to(device) right = torch.tensor([gr[args.context_length+1:] for gr in groups], dtype=torch.long) batch = torch.concatenate((left, right), dim=1).to(device) # Get ready to learn cbow_model.zero_grad() # Create a context window of the token embeddings y_hat = cbow_model(batch) loss = criterion(y_hat, labels) running_loss += alpha*(loss.item() - running_loss) # Gradient calculation loss.backward() torch.nn.utils.clip_grad_norm_(cbow_model.parameters(), 1.0) # Update weights optimizer.step() ``` --- ## Embedding Training ```python # Now train the NGRAM predictor model = NGramEmbedder(num_tokens, args.embedding_size, args.context_length) # Use the pretrained result from the CBOW with torch.no_grad(): model.embedding.weight.copy_(cbow_model.embedding.weight) # If we trusted the embedding, we could turn off learning #model.embedding.weight.requires_grad = False print("Building model.") print(model) if args.load is not None: model.load_state_dict(torch.load(args.load, weights_only=True)) model.to(device) else: optimizer = torch.optim.AdamW(model.parameters(), lr=0.0005, weight_decay=args.weight_decay/10) criterion = torch.nn.CrossEntropyLoss() model.to(device) model.train() rng = np.random.default_rng() # Report loss every 1000 steps running_loss = 0.0 alpha = 0.01 for step in range(args.steps): # Make a batch begins = rng.integers(low=0, high=len(sentence_indices) - args.context_length - 1, size=args.batch_size) # For next word prediction batch = torch.tensor([sentence_indices[begin:begin+args.context_length] for begin in begins], dtype=torch.long).to(device) # The labels are the following token labels = torch.tensor([sentence_indices[begin+args.context_length] for begin in begins], dtype=torch.long).to(device) # Get ready to learn model.zero_grad() # Create a context window of the token embeddings y_hat = model(batch) loss = criterion(y_hat, labels) running_loss += alpha*(loss.item() - running_loss) # Gradient calculation loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # Update weights optimizer.step() ``` --- ## Copying the Embedding * Notice that we can simply copy over the pretrained embedding * Remember that question about what we can do with initialization? * This should start us off nearer to a good minima ```python # Use the pretrained result from the CBOW with torch.no_grad(): model.embedding.weight.copy_(cbow_model.embedding.weight) ``` --- ## Prediction ```python # Now try to predict something. # Make 5 predictions, just to see how much variety we can get. print(f"Initial prompt is '{args.prompt}'") for _ in range(5): prompt_words = np.array(nltk.word_tokenize(args.prompt.lower())) # Convert prompt words to token values prompt_tokens = [word_to_index[word] for word in prompt_words] model.eval() if len(prompt_tokens) < args.context_length: padding = [np.where(unique_tokens == '.')[0][0]] * (args.context_length - len(prompt_tokens)) prompt_tokens = padding + prompt_tokens for i in range(20): # Create a context window of the token embeddings X = torch.tensor([prompt_tokens[-args.context_length:]]).to(device) # Find the next token probabilities y_hat = model(X) next_word_index = np.random.choice(np.arange(len(unique_tokens)), p=y_hat[0].cpu().detach().numpy()) prompt_tokens.append(next_word_index) prompt_words = [unique_tokens[idx] for idx in prompt_tokens] # Clean up the tokens. for idx in range(len(prompt_words)): if idx+1 < len(prompt_words) and prompt_words[idx+1] in ['.', '!', '?', ';', ',']: punctuation = prompt_words[idx+1] prompt_words[idx] = prompt_words[idx] + punctuation prompt_words[idx+1] = '' # Make the word after a sentence ending upper case. if punctuation in ['.', '?', '!'] and idx+2 < len(prompt_words): prompt_words[idx+2] = prompt_words[idx+2][:1].upper() + prompt_words[idx+2][1:] if idx+2 < len(prompt_words) and prompt_words[idx+1] == "’" and len(prompt_words[idx+2]) == 1: prompt_words[idx] = prompt_words[idx] + prompt_words[idx+1] + prompt_words[idx+2] prompt_words[idx+1] = '' prompt_words[idx+2] = '' print(re.sub(r' +', ' ', ' '.join(prompt_words))) ``` --- ## Note About Training * The loss will look bad * In fact, the model will be mostly wrong * Of course! This is language, guessing what word comes next will often fail * But, as long as the predictions are reasonable we are doing okay --- ## Sanity Test * If you want to be sure that things are working, begin with a subset of the training corpus * This model will learn a few hundred lines perfectly, regurgitating everything in the original * (assuming a large enough context window) * With that, we can proceed with actual training --- ## Results ``` Initial prompt is 'do you know the cause' do you know the cause? I will not thy lips. O, my romeo, friar lawrence. Romeo, alack the day do you know the cause? I have you love, that i must to my thee. Scene ii. Friar lawrence’s do you know the cause? I will not thy lips. Cheeks, and lie with her, but let me take it. do you know the cause? I will you and to bed. Ah, juliet’s to; for i will not away do you know the cause? I would you are? What, shall i groan! Romeo, and would i can not love ``` --- ## More Results ``` Initial prompt is 'into this bed of death' into this bed of death? They romeo in juliet? I will not thy lips. Put thee in, and, and bid into this bed of death, they romeo in juliet? I will not away. What, the here? I will not well into this bed of death, they romeo in juliet? I will not away. What, the pox of such a lisping, into this bed of death? They romeo, friar lawrence. Romeo, go you to bed. Ah, the’s my into this bed of death, they romeo in juliet? I will not away. What, the here i s. Go and ``` --- ## Preposition Problems * What is most likely after "of"? * Romeo, it turns out ``` Initial prompt is 'into this bed of' . into this bed of romeo? No, not he that tybalt is’s to bed. Ah, juliet’s to . into this bed of romeo? No, i he not his the that of must, and convoy in the fearful of do . into this bed of enter? Where here comes romeo? Friar lawrence. I hear some noise. If i come, , . into this bed of romeo? No, i he not his the tybalt of the, and i will give me occasion. . into this bed of awhile, for romeo and in a montague, the only lacks a cover : the fish of if thou ``` --- ## Embedding Influence * The word "descend" is associated with "death" here ``` Initial prompt is 'descend into this bed of' descend into this bed of death, they romeo in juliet, i will; for it is wisely, and; i come, descend into this bed of death? They romeo, friar lawrence. Go with me to the fresh i will give you, sir descend into this bed of death? They romeo, friar lawrence. Ay, forsooth. Well, he may chance to do some descend into this bed of death? They romeo, friar lawrence. Romeo, there? What is it in that? I will descend into this bed of death, they romeo in juliet, he is there lies, tybalt. First and, and you will ``` --- ## Embedding Influence * The other words may not even matter ``` Initial prompt is 'descend into this rose of' descend into this rose of death, and romeo in juliet, he is it is my man, but i ’ ll be hanged descend into this rose of death, and romeo in juliet, he is not, o, she was well? She’s descend into this rose of death, and romeo in juliet, he is not, spited, grief you shall be a with sweet descend into this rose of death, and romeo in juliet, he is not away. What, the pox of such a lisping descend into this rose of death, and romeo in juliet, he is it is my man, but i ’ ll be hanged ``` --- ## Good Embeddings * Suppose we had a good embedding * We should be able to do something cool, like math on concepts in embedded space * $king - man + woman \approx queen$ * $Paris - France + Itely \approx Rome$ * And what does it take to get such good embeddings? * Just a dataset of hundreds of billions of words --- ## Comparison With Images * The embedding here is similar to the initial large convolution in ResNext * But learning the embedding is more troublesome * Obviously everything afterwards depends upon it --- ## Improving Embeddings * Training just the embedding is difficult * But we've had good methods for more than 10 years * For an example, see [word2vec](https://arxiv.org/abs/1301.3781) --- ## Improvements * When training CBOW or Skip-grams, the ordering of words doesn't matter * Since structure has been tossed out of the window, we can also toss the over-common words * So if "of" shows up too often, just drop it * The sentences will be mangled, but training the embedding will improve without so much redundant data * We can also search for common phrases and combine them into single tokens --- ## Negative Sampling * Notice that the model predicts a single word out of thousands, and learns wrong guesses quickly * So the majority of the math in the loss function is pushing the same weights to 0 repeatedly * And may erroneously push a weight to 0 too quickly * We can change the loss to only look at the target word and $K$ negative samples * $L(\theta) = \log Sigmoid(u_{c}^{T} v_{w})+\sum_{k = 1}^{K} \log Sigmoid(-u_{k}^{T} v_{w})$ * $u_c$ is the context vector * $v_w$ is the target word * $u_k$ is the vector of the kth negative sample --- ## Using The Embedding * Let's say we bothered to get everything required to train a great embedding * What next? * The embedding itself only tells us what words are related to other words * Our simple linear n-gram predictor needs more context to make good predictions * But longer context windows will take huge amounts of memory --- ## Recurrent Networks * This problem motivates the development of more interesting models * One powerful idea is that of a *recurrent* neural network * That means that network remembers its own hidden state * That's where we'll pick up next class!