<!--
Abstract:

CS 461
Introduction to Deep Learning
Lecture 11
-->

# CS 461 - Lecture 18

## Machine Learning Principles

### Memory Networks

Bernhard Firner

2025-11-05

---

## SVM & Kernel Review

* SVMs with soft margins are a huge improvement over perceptrons
  * Sparsity in $\alpha$ makes them usable with large datasets
  * Complexity parameter, C, give you control over the model complexity
* SVMs are constructed to maximize the margin between examples of different classes
  * Improves generalization
* Kernels allow them to learn functions in nonlinear spaces

---

## Interpretation

* The weight vector is still a linear combination of the training data
* With RBF kernel, we clearly built a similarity search
  * But with an optimal set of support vectors
  * Better than using KNN for clustering, at least

---

## Hyperparameter Problems

* But we still need to optimize hyperparemeters
  * Which kernels?
  * Or which combination of kernels?
  * What about their parameters?
  * And how about C?

---

## Digits - Linear Kernel

Accuracy 87.22% with C=1

---

## Digits - RBF Kernel

Accuracy 87.22% with C=1

---

## Digits - RBF Kernel

Accuracy 97.56% with $\gamma=0.01$ and C=1000

---

## What about Structure?

* It's fair to ask how a human might recognize a digit
  * Probably similarity to some ideal, right?
* We probably have some kind of exemplar floating around in our heads
* But the exemplar is conceptual, not an image

---

## What is a concept?

* We've seen one technique that projects from image space into another space
  * PCA
* That conceptual space could be useful
  * For example, comparisons in the conceptual space would be more meaningful

---

## Story

* This will be a "math light" lecture
  * Trying to convey an intuition about our next topic, neural networks
  * Also trying to make it seem like they are a logical thing to do
    * Which they aren't!
* Going to name different techniques, but that's to satisfy curiosity
  * Not expecting you to memorize every technique along the way

---

## Warning

* I'm going to present a series of concept that will (hopefully) logically lead to neural networks
  * In reality, progress was more lumpy
* However, the concepts will help you understand neural networks as more than black boxes

---

## Reading

* Murphy
  * Chapters 19.1 - 19.4.2
    * Markov random fields to Boltzman machines
  * Chapter 27.7
    * Restricted Boltzman Machines

---

## Projections

* Let's say we want to project to and from conceptual space
  * If $g$ projects into it, and $f$ projects out of it, then $f(g(x))=x$
* There are a few ways to think of $g$
  * Maybe it's a compressor
  * Or maybe an associative memory

---

## Let's talk Images

* A compressor or an associative memory needs to find correlations between components
* Then we need to be able to query it
  * For example, reproduce the entire image given a small piece or a downscaled version
* Then we could use this for classification
  * Given an unknown image, fetch the nearest exemplar stored in our model

---

## The Model

* So what do we know that could guess one pixel's value given some other pixel values?
  * This implies structure, so maybe a Markov model
  * The relationships can go in *any* direction through the image, though
* This *undirected graphical model* is called a *Markov random field* or *Markov network*

---

## Modelling

* How do we represent this?
  * Any pixel's value is predicted from the value of every other pixel
  * Sounds like a matrix
* Flatten our image to get a sample, $x$
* The outer product, $xx^T$ creates a row for each pixel

---

## Simple example

* Let's say we have a 4 pixel image

---

## Weight Matrix

* $xx^T$ has a row to predict each pixel's value
  * And note, this is $xx^t$ if x is 4x1, $x^Tx$ is 1x4, so how you store the image matters

---

## Removing the diagonal

* We don't want to use each pixel to represent itself
  * Set the diagonal to 0

---

## Prediction

* Let's say we only get the top row of pixels in $X_{t}$
  * Can we retrieve the bottom row with $w^TX_{t}$?

---

## Prediction

* This doesn't *quite* work
  * What if we simplify all weights to 1 or -1?
  * This basically changes this to a classification problem

---

## Code

```python
import argparse
import gzip
import os
import numpy as np
from PIL import Image

def load_mnist_ubyte(image_path):
    """
    Loads MNIST images from the raw ubyte files.

Args:
        image_path (str): Path to the image file (e.g., 'train-images-idx3-ubyte.gz').

Returns:
        images (numpy.ndarray)
    """
    with gzip.open(image_path, 'rb') as f:
        # Read the header: magic number (4 bytes) + num images (4 bytes) +
        # num rows (4 bytes) + num cols (4 bytes) = 16 bytes.

# Read the entire file content into a buffer
        image_data = f.read()

# The image data starts at byte 16.
        images = np.frombuffer(image_data, dtype=np.uint8, offset=16)

# We need the dimensions to reshape. We can extract them from the header bytes,
        # which are big-endian ('>'). We use struct.unpack if we were being strict,
        # but here we'll assume the standard MNIST format and calculate the dimensions
        # for a clean numpy approach.

# The number of images is in the 5th to 8th byte (4 bytes)
        num_images = np.frombuffer(image_data, dtype='>i4', offset=4, count=1)[0]
        # Rows and columns are 28x28 for MNIST, stored in bytes 9-12 and 13-16.
        # num_rows = np.frombuffer(image_data, dtype='>i4', offset=8, count=1)[0]
        # num_cols = np.frombuffer(image_data, dtype='>i4', offset=12, count=1)[0]
        num_rows = 28
        num_cols = 28

# Reshape the 1D array into a 3D array (num_images, rows, columns)
        images = images.reshape(num_images, num_rows, num_cols)
    return images

def flattenAndPolarize(X):
    """Convert the image into value -1 and 1"""
    flattened = X.reshape(-1, 1)
    positive_mask = np.sign(flattened - 0.5) == 1
    polarized = np.where(positive_mask, 1, -1)
    return polarized

def createWeightMatrix(X):
    # This is the outer product
    weight_matrix = np.dot(X, X.T)
    # We remove the identity because predictions for pixel i shouldn't use pixel i
    np.fill_diagonal(weight_matrix, 0)
    return weight_matrix

def retrieve(weights, X, max_steps=100):
    """Retrieve a sample from the Hopfield weights and the partial image."""
    cur_state = X.copy()
    num_features = X.size

for _ in range(max_steps):
        # Remember the previous state so we can check for convergence
        next_state = np.sign(weights.T @ cur_state)

# Check for convergence (no change in state)
        if np.array_equal(cur_state, next_state):
            # print(f"Converged after {_ + 1} iterations.")
            break
        cur_state = next_state

return cur_state

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--train",
        required=True,
        help="gzip file with mnist data.")
    parser.add_argument(
        "--index",
        required=False,
        type=int,
        default=0,
        help="Index to convert and retrieve")

args = parser.parse_args()

print("Loading data")
    X_train = load_mnist_ubyte(args.train)/255

# Polarize the first sample
    polarized = flattenAndPolarize(X_train[args.index])
    W = createWeightMatrix(polarized)
    print(f"Image shape is {X_train[args.index].shape}")
    print(f"Weight matrix size is {W.shape}")
    masked = polarized.copy()
    print(f"masked size is {masked.size}")
    masked[int(masked.size/2):].fill(0)
    retrieved = retrieve(W, masked)

Image.fromarray((255*((polarized+1)/2).reshape((28, 28))).astype(np.uint8)).save(f"hopfield_test_image_{args.index}.png")
    Image.fromarray((255*((masked+1)/2).reshape((28, 28))).astype(np.uint8)).save(f"hopfield_masked_image_{args.index}.png")
    Image.fromarray((255*((retrieved+1)/2).reshape((28, 28))).astype(np.uint8)).save(f"hopfield_retrieved_image_{args.index}.png")
```

---

## Examples

---

## Combined Weights

* We want to mix weights for multiple images
  * And then query with unseen inputs
  * So we will simply sum the weights
* The reconstruction will be wrong after a single multiply, so we'll just repeat it and hopefully it will converge
  * $sign(w^Tsign(w^Tsign(w^Tsign(w^Tsign(w^TX_{t})))))$

---

## Test with 2

Works great!

---

## Test with 10

* Some garbage
  * It turns out that this doesn't work if the images aren't linearly separable

---

## Internals

* Starts off looking like a 3
  * But some of those pixels are correlated with a 5
  * After several rounds of multiplications, it is a combination of the two

---

## Solution

* The solution used in this community wasn't kernel functions
* Instead, since we are already doing a bunch of matrix multiplication, why not use multiple matrices
  * These "hidden" states that aren't directly applicable to the input make the model more powerful
  * Called a restricted Boltzman machine (RBM)

---

## Problem

* How do we train those other matrices?
* For each training sample we can demand
  * $X_i = sign(w_5^Tsign(w_4^Tsign(w_3^Tsign(w_2^Tsign(w_1^TX_{i})))))$
* Learning all those matrix values is tough
  * So lean on stochastic gradient descent, maybe with an $\mathcal{l}_2$ norm

---

## Pause

* We motivated this with classification as our task
* And posited that a model that could retrieve a full image from part of the image has a good internal representation
* But why bother with the reconstruction?

---

## Just Predict

* $X_i = sign(w_5^Tsign(w_4^Tsign(w_3^Tsign(w_2^Tsign(w_1^TX_{i})))))$
  * If some internal step, represented by those $w$ matrices, is good, why not use it?
* So replace those $sign$ functions with something else, maybe sigmoid, and stop early
* $Class_i = sigm(w_3^Tsigm(w_2^Tsigm(w_1^TX_{i})))$

---

## Neural Networks

* And that is basically a neural network
* This isn't how we got there, but it is a way to think about them
* We project from an observed space (X) into some new feature space
  * And then we make predictions
* We'll pick this up next week