# CS 461 - Lecture 18 ## Machine Learning Principles ### Memory Networks Bernhard Firner 2025-11-05 --- ## SVM & Kernel Review * SVMs with soft margins are a huge improvement over perceptrons * Sparsity in $\alpha$ makes them usable with large datasets * Complexity parameter, C, give you control over the model complexity * SVMs are constructed to maximize the margin between examples of different classes * Improves generalization * Kernels allow them to learn functions in nonlinear spaces --- ## Interpretation * The weight vector is still a linear combination of the training data * With RBF kernel, we clearly built a similarity search * But with an optimal set of support vectors * Better than using KNN for clustering, at least --- ## Hyperparameter Problems * But we still need to optimize hyperparemeters * Which kernels? * Or which combination of kernels? * What about their parameters? * And how about C? --- ## Digits - Linear Kernel Accuracy 87.22% with C=1
--- ## Digits - RBF Kernel Accuracy 87.22% with C=1
--- ## Digits - RBF Kernel Accuracy 97.56% with $\gamma=0.01$ and C=1000
--- ## What about Structure? * It's fair to ask how a human might recognize a digit * Probably similarity to some ideal, right? * We probably have some kind of exemplar floating around in our heads * But the exemplar is conceptual, not an image --- ## What is a concept? * We've seen one technique that projects from image space into another space * PCA * That conceptual space could be useful * For example, comparisons in the conceptual space would be more meaningful --- ## Story * This will be a "math light" lecture * Trying to convey an intuition about our next topic, neural networks * Also trying to make it seem like they are a logical thing to do * Which they aren't! * Going to name different techniques, but that's to satisfy curiosity * Not expecting you to memorize every technique along the way --- ## Warning * I'm going to present a series of concept that will (hopefully) logically lead to neural networks * In reality, progress was more lumpy * However, the concepts will help you understand neural networks as more than black boxes --- ## Reading * Murphy * Chapters 19.1 - 19.4.2 * Markov random fields to Boltzman machines * Chapter 27.7 * Restricted Boltzman Machines --- ## Projections * Let's say we want to project to and from conceptual space * If $g$ projects into it, and $f$ projects out of it, then $f(g(x))=x$ * There are a few ways to think of $g$ * Maybe it's a compressor * Or maybe an associative memory --- ## Let's talk Images * A compressor or an associative memory needs to find correlations between components * Then we need to be able to query it * For example, reproduce the entire image given a small piece or a downscaled version * Then we could use this for classification * Given an unknown image, fetch the nearest exemplar stored in our model --- ## The Model * So what do we know that could guess one pixel's value given some other pixel values? * This implies structure, so maybe a Markov model * The relationships can go in *any* direction through the image, though * This *undirected graphical model* is called a *Markov random field* or *Markov network* --- ## Modelling * How do we represent this? * Any pixel's value is predicted from the value of every other pixel * Sounds like a matrix * Flatten our image to get a sample, $x$ * The outer product, $xx^T$ creates a row for each pixel --- ## Simple example * Let's say we have a 4 pixel image
0.5
1
0
0.5
--- ## Weight Matrix * $xx^T$ has a row to predict each pixel's value * And note, this is $xx^t$ if x is 4x1, $x^Tx$ is 1x4, so how you store the image matters
0.25
0
0.25
0.25
0.5
1
0.
0.5
0.
0
0.
0
0.25
0.5
0.
0.25
--- ## Removing the diagonal * We don't want to use each pixel to represent itself * Set the diagonal to 0
0.
0
0.25
0.25
0.5
0.
0.
0.5
0.
0
0.
0
0.25
0.5
0.
0.
--- ## Prediction * Let's say we only get the top row of pixels in $X_{t}$ * Can we retrieve the bottom row with $w^TX_{t}$?
0.5
1
?
?
--- ## Prediction * This doesn't *quite* work * What if we simplify all weights to 1 or -1? * This basically changes this to a classification problem
0.625
0.5
0
0.625
--- ## Code ```python import argparse import gzip import os import numpy as np from PIL import Image def load_mnist_ubyte(image_path): """ Loads MNIST images from the raw ubyte files. Args: image_path (str): Path to the image file (e.g., 'train-images-idx3-ubyte.gz'). Returns: images (numpy.ndarray) """ with gzip.open(image_path, 'rb') as f: # Read the header: magic number (4 bytes) + num images (4 bytes) + # num rows (4 bytes) + num cols (4 bytes) = 16 bytes. # Read the entire file content into a buffer image_data = f.read() # The image data starts at byte 16. images = np.frombuffer(image_data, dtype=np.uint8, offset=16) # We need the dimensions to reshape. We can extract them from the header bytes, # which are big-endian ('>'). We use struct.unpack if we were being strict, # but here we'll assume the standard MNIST format and calculate the dimensions # for a clean numpy approach. # The number of images is in the 5th to 8th byte (4 bytes) num_images = np.frombuffer(image_data, dtype='>i4', offset=4, count=1)[0] # Rows and columns are 28x28 for MNIST, stored in bytes 9-12 and 13-16. # num_rows = np.frombuffer(image_data, dtype='>i4', offset=8, count=1)[0] # num_cols = np.frombuffer(image_data, dtype='>i4', offset=12, count=1)[0] num_rows = 28 num_cols = 28 # Reshape the 1D array into a 3D array (num_images, rows, columns) images = images.reshape(num_images, num_rows, num_cols) return images def flattenAndPolarize(X): """Convert the image into value -1 and 1""" flattened = X.reshape(-1, 1) positive_mask = np.sign(flattened - 0.5) == 1 polarized = np.where(positive_mask, 1, -1) return polarized def createWeightMatrix(X): # This is the outer product weight_matrix = np.dot(X, X.T) # We remove the identity because predictions for pixel i shouldn't use pixel i np.fill_diagonal(weight_matrix, 0) return weight_matrix def retrieve(weights, X, max_steps=100): """Retrieve a sample from the Hopfield weights and the partial image.""" cur_state = X.copy() num_features = X.size for _ in range(max_steps): # Remember the previous state so we can check for convergence next_state = np.sign(weights.T @ cur_state) # Check for convergence (no change in state) if np.array_equal(cur_state, next_state): # print(f"Converged after {_ + 1} iterations.") break cur_state = next_state return cur_state if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument( "--train", required=True, help="gzip file with mnist data.") parser.add_argument( "--index", required=False, type=int, default=0, help="Index to convert and retrieve") args = parser.parse_args() print("Loading data") X_train = load_mnist_ubyte(args.train)/255 # Polarize the first sample polarized = flattenAndPolarize(X_train[args.index]) W = createWeightMatrix(polarized) print(f"Image shape is {X_train[args.index].shape}") print(f"Weight matrix size is {W.shape}") masked = polarized.copy() print(f"masked size is {masked.size}") masked[int(masked.size/2):].fill(0) retrieved = retrieve(W, masked) Image.fromarray((255*((polarized+1)/2).reshape((28, 28))).astype(np.uint8)).save(f"hopfield_test_image_{args.index}.png") Image.fromarray((255*((masked+1)/2).reshape((28, 28))).astype(np.uint8)).save(f"hopfield_masked_image_{args.index}.png") Image.fromarray((255*((retrieved+1)/2).reshape((28, 28))).astype(np.uint8)).save(f"hopfield_retrieved_image_{args.index}.png") ``` --- ## Examples
--- ## Combined Weights * We want to mix weights for multiple images * And then query with unseen inputs * So we will simply sum the weights * The reconstruction will be wrong after a single multiply, so we'll just repeat it and hopefully it will converge * $sign(w^Tsign(w^Tsign(w^Tsign(w^Tsign(w^TX_{t})))))$ --- ## Test with 2 Works great!
--- ## Test with 10 * Some garbage * It turns out that this doesn't work if the images aren't linearly separable
--- ## Internals * Starts off looking like a 3 * But some of those pixels are correlated with a 5 * After several rounds of multiplications, it is a combination of the two
--- ## Solution * The solution used in this community wasn't kernel functions * Instead, since we are already doing a bunch of matrix multiplication, why not use multiple matrices * These "hidden" states that aren't directly applicable to the input make the model more powerful * Called a restricted Boltzman machine (RBM) --- ## Problem * How do we train those other matrices? * For each training sample we can demand * $X_i = sign(w_5^Tsign(w_4^Tsign(w_3^Tsign(w_2^Tsign(w_1^TX_{i})))))$ * Learning all those matrix values is tough * So lean on stochastic gradient descent, maybe with an $\mathcal{l}_2$ norm --- ## Pause * We motivated this with classification as our task * And posited that a model that could retrieve a full image from part of the image has a good internal representation * But why bother with the reconstruction? --- ## Just Predict * $X_i = sign(w_5^Tsign(w_4^Tsign(w_3^Tsign(w_2^Tsign(w_1^TX_{i})))))$ * If some internal step, represented by those $w$ matrices, is good, why not use it? * So replace those $sign$ functions with something else, maybe sigmoid, and stop early * $Class_i = sigm(w_3^Tsigm(w_2^Tsigm(w_1^TX_{i})))$ --- ## Neural Networks * And that is basically a neural network * This isn't how we got there, but it is a way to think about them * We project from an observed space (X) into some new feature space * And then we make predictions * We'll pick this up next week