# CS 211 - Lecture 19

## Computer Architecture

### Memory Usage Optimization

Bernhard Firner

2025-11-11

---

## Review

* Cache anatomy
  * Block: the fixed-size packet of information
  * Line: the block and associated information
    * Valid bit, tag, dirty bit, etc
  * Set: A group of lines

---

## Cache Misses

* `cold miss`: When the valid bit is unset (also called *compulsory miss*)
* `capacity miss`: When our cache is full and the miss couldn't be avoided
  * For example, you are looping repeatedly through a 1MB array on a 64KB cache
* `conflict miss`: When the program has used the cache poorly
  * There is available space, but locality has not been exploited to use it

---

## AMD Zen 4 (2023)

<table>
<tr> <hr><td>Cache</td><td>Access Time (cycles)</td><td>Cache Size (C)</td><td> Assoc. (E)</td><td>Block size (B)</td><td>Sets (S)</td></hr></tr>
<tr> <td> Op cache</td><td> 1 (9 ops/cycle)    </td><td> 6.75 K ops    </td><td> 12 </td><td> 9 macro ops  </td><td> 64 </td></tr>
<tr> <td> L1 i-cache</td><td> 4               </td><td> 32 KB    </td><td> 8 </td><td> 64 B  </td><td> 64 </td></tr>
<tr> <td> L1 d-cache</td><td> 4-5 int, 7-8 fp </td><td> 32 KB    </td><td> 8 </td><td> 64 B  </td><td> 64 </td></tr>
<tr> <td> L2 unified cache</td><td> 14   </td><td> 1 MB    </td><td> 8 </td><td> 64 B  </td><td> 2048 </td></tr>
<tr> <td> L3 unified cache</td><td> 50   </td><td> 96 MB    </td><td> 16 </td><td> 64 B  </td><td> 98304 </td></tr>
</table>

<p style='font-size:13pt'>$^*$ Source: [AMD Zen4 Software Optimization Guide](https://docs.amd.com/v/u/en-US/57647)</p>

---

## AMD Zen 4 Diagram

<p style='font-size:13pt'>$^*$ Source: [AMD Zen4 Software Optimization Guide](https://docs.amd.com/v/u/en-US/57647)</p>

---

## Prefetching

* Modern CPUs have detect multiple access patterns for prefetching
* Zen 4 has five prefecthing strategies:
  * L1 Stream: assumes sequential lines are accessed
  * L1 Stride: looks for constant offsets in access patterns
  * L1 Region: looks for locally correlated access patterns
  * L2 Stream: assumes sequential lines are accessed
  * L2 Up/Down: uses history to predict access of the next or previous line

---

## Matrix Multiplication

* We used matrix multiplication as an example
* The order that we traversed the matrices had a large impact upon performance

---

## Memory Usage

* A matrix is usually stored as one long contiguous block of memory
  * As you did in HW3

---

## Loop Locality

* The outer loop counter changes infrequently
  * So a cache miss has less impact
* The inner loop counter changes frequently
  * So we try to access memory that can take advantage of spatial locality

```C
    for (size_t r = 0; r < product.height; ++r) {
        for (size_t c = 0; c < product.width; ++c) {
            for (size_t e = 0; e < num_elem; ++e) {
                product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
            }
        }
    }
```

---

## Loop Locality

* So which order was best?
  * 'r' or 'e' on the outside loop avoid frequent strides over left or right matrices

---

## AMD Ryzen 5 Results

An older desktop CPU.

---

## AMD Ryzen 9 Results

Newer chip, but in a laptop.

---

## Gains

* 40% of original CPU time at 1500x1500 matrices
  * And runtime growth looks nonlinear
* So hopefully that drives home how important cache aware code can be

---

## Mechanics

* So what was happening?
* The CPU cache has a set capacity
  * B bytes per block in a line
  * E lines per set
  * S sets per cache
  * Cache size = $B \times E \times S$

---

## Matrix Multiply Costs

* To multiply two doubles, we need 16 bytes
  * That is 1/4 of the L1 cache, so that is comfortable
    * Space to prefetch the next double while we work on multiplying
* 8 way associative, so prefetching shouldn't cause conflict misses
* But can the CPU predict the next memory location?

---

## Strides

* The inner loops steps over e
  * The memory in `left.rows[r][e]` is accessed 1 double at a time
    * Prefetching will work until the end of the row
  * Then, since the memory is contiguous, the next row immediately follows
    * So this loop is perfect for `left.rows`; no cache misses are likely

```C
    for (size_t c = 0; c < product.width; ++c) {
        for (size_t r = 0; r < product.height; ++r) {
            for (size_t e = 0; e < num_elem; ++e) {
                product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
            }
        }
    }
```

---

## Strides

* The situation for `right.rows[e][c]` is different
  * We jump by a full row at each increment of the inner loop
  * The CPU will *probably* identify this as a read with stride equal to the width
  * And then we increment `c`

---

## Strides

* After increment `c`, the address `right.rows[e][c]` is shifted by 1 double
  * The block size in the d-cache is 64 bytes, so this can happen 8 times before a miss
    * Remember that doubles are 8 bytes
  * Prefetching *could* solve this, depending upon sophistication

---

## Strides

* The access pattern of `product.rows[r][c]` is similar to right.rows
  * Jumps by a row at a time with every increment of `r` and then by 8 bytes every increment of `c`
  * But this happens so rarely; if every 4 increments of c causes a cache miss, then that is c/4 in total

---

## Other Improvements

* We also went over block calculations, where we work on a small section of memory at a time
  * With proper tuning, maybe it could be faster than the other loops
  * Blocking helps more when the CPU does no prefetching
* Anything else?

---

## Data Transformation

* What if we could use the same index for everything?
* These are square, so we could take the right matrix and transpose it
  * `transposed[i][j] = right[j][i]` for each position
* Is that slow?
  * $O(n^2)$. The multiply is $O(n^3)$, so that is nothing.

---

## Transposing Code

```C
Matrix dotProduct_transpose_rce(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    // Make a transposed matrix
    Matrix transpose_right = newMatrix(right.width, right.height);
    for (size_t i = 0; i < right.height; ++i) {
        for (size_t j = 0; j < right.width; ++j) {
            transpose_right.rows[j][i] = right.rows[i][j];
        }
    }
    Matrix product = newMatrix(left.height, right.width);
    // Now we can go across the rows of the right matrix rather than down the columns
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    for (size_t r = 0; r < product.height; ++r) {
        for (size_t c = 0; c < product.width; ++c) {
            double sum = 0.0;
            for (size_t e = 0; e < num_elem; ++e) {
                // This was previously sum += left.rows[r][e] * right.rows[e][c];
                // The transposition puts e into the second index
                sum += left.rows[r][e] * transpose_right.rows[c][e];
                // This is now the same as left.data[r*width+e] * transpose_right.data[c*width + e]
            }
            product.rows[r][c] += sum;
        }
    }
    // We are done with the transposed matrix
    freeMatrix(transpose_right);
    return product;
}
```

---

## Transposing Results (Ryzen5)

---

## Transposing Results (Ryzen9)

---

## Data Transformation

* Transforming data is often worth it
  * Changing images from CxHxW to HxWxC for some pixel operations
* In our case we can improve cache performance, so it is worth it

---

## Idle Cores

* Here's another way to improve results
* Most of your CPU cores are idle most of the time
  * So let's make them busy!
* We can multiply in parallel using threads

---

## If you don't know

* Quick primer on threads
  * Our programs are loaded onto the CPU and execute, one instruction at a time
    * Well, multiple at a time in a pipeline
* When a create a `thread`, we tell the CPU that we want to run a second thing
  * Start at a given function, share memory with the current program

---

## Threads

* The thread gets its own memory space if we "detach" it
  * It becomes harder for our program to talk to it
    * And dangerous to access variable at the same time!
* In return, the thread can go off and do some work while we do something else
* In this example, we won't detach it, so the memory will remain shared

---

## Threads in C

* Functions and types defined in "threads.h"
  * [https://cppreference.com/w/c/header/threads.html](https://cppreference.com/w/c/header/threads.html)
* Call `thrd_create` to make a thread
  * We also pass it a function to run and the arguments to that function
  * It returns an ID for that thread
* Call `thrd_join` to wait for a thread to finish

---

## thrd_create

* `int thrd_create( thrd_t *thr, thrd_start_t func, void *arg );`
  * The first argument is a pointer to fill in with the thread information
  * The second object is a pointer to a function
  * The third is an array of pointers (to unknown things, so they are void*)
* Not going to quiz you about the function, but threads are too important to ignore in modern CPUs

---

## Example

```C
#include <stdio.h>
#include <threads.h>

void* printingFunction(void* args) {
    // We have to know what type is being passed.
    size_t thread_id = *(size_t*)(args);
    printf("Thread %lu is running!\n", thread_id);
    return NULL;
}

int main(void) {

// Let's make some threads!
    thrd_t threads[10];
    size_t numbers[10];
    // Start some threads.
    for (int i = 0; i < 10; ++i) {
        numbers[i] = i;
        int result = thrd_create(&threads[i], (thrd_start_t)printingFunction, &numbers[i]);
        if (result == thrd_error) {
            printf("Error creating thread %lu\n", numbers[i]);
        }
    }
    // Wait for all threads to finish.
    for (int i = 0; i < 10; ++i) {
        int res = thrd_join(threads[i], NULL);
        if (res == thrd_error) {
            printf("Error: thread %lu failed to join\n", numbers[i]);
        }
    }

return 0;
}
```

---

## Output

<pre>
$ ./a.out
Thread 1 is running!
Thread 2 is running!
Thread 4 is running!
Thread 0 is running!
Thread 3 is running!
Thread 6 is running!
Thread 5 is running!
Thread 7 is running!
Thread 8 is running!
Thread 9 is running!
</pre>

---

## Thread Execution

* Notice that the order of the threads is different from when we started them
* Any synchronization is up to the programmer
* This makes threaded programming tricky

---

## Memory Safety

* There are some ways to coordinate between threads
  * But I'll leave those for your OS or parallel computing class
* Just know that we'll need to avoid having threads write to the same place
  * Reading is okay though

---

## Parallelizing Matrix Multiply

* Let's make a thread for each row
  * So the outer loop will go away
* Each thread operates on a row of the left matrix and output
  * That will replace the `r` variable and loop in the code

```c
size_t num_elem = left.height;
for (size_t r = 0; r < product.height; ++r) {
    for (size_t c = 0; c < product.width; ++c) {
        double sum = 0.0;
        for (size_t e = 0; e < num_elem; ++e) {
            sum += left.rows[r][e] * right.rows[e][c];
        }
        product.rows[r][c] = sum;
    }
}
return product;
```

---

## Avoid Contention

* The threads are going to all be writing simultaneously
* So we need to pick a version of matrix multiply that only writes into a single index of the destination at a time

```c
    size_t num_elem = left.height;
    for (size_t r = 0; r < product.height; ++r) {
        for (size_t c = 0; c < product.width; ++c) {
            double sum = 0.0;
            for (size_t e = 0; e < num_elem; ++e) {
                sum = left.rows[r][e] * right.rows[e][c];
            }
            product.rows[r][c] += sum;
        }
    }
```

---

## Contention Free

* Each thread will write to `product.rows[r][c]`
  * One thread per `r`
  * Each thread will loop over `c`
* Forget about locality for a moment

---

## Thread Function

```C
void* sum_into_ce(void* void_args) {
    void ** args = (void**)void_args;
    size_t thread_id = (size_t)(args[0]);
    double* row = (double*)(args[1]);
    double* left_row = (double*)(args[2]);
    Matrix* right = (Matrix*)(args[3]);
    int width = *(int*)(args[4]);
    int height = *(int*)(args[5]);
    for (size_t c = 0; c < width; ++c) {
        double sum = 0.0;
        for (size_t e = 0; e < height; ++e) {
            sum += left_row[e] * right->rows[e][c];
        }
        row[c] = sum;
    }
    return NULL;
}
```

---

## Calling Function

```c
Matrix dotProduct_thread_rce(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    size_t r, c, e;
    // We must send the outer loop to the threads, so the outer loop must be 'e'
    // Will create one thread per row
    thrd_t threads[product.height];
    void* args[product.height][6];
    for (size_t r = 0; r < product.height; ++r) {
        args[r][0] = (void*)r;
        args[r][1] = (void*)(product.rows[r]);
        args[r][2] = (void*)(left.rows[r]);
        args[r][3] = (void*)(&right);
        args[r][4] = (void*)(&product.width);
        args[r][5] = (void*)(&product.height);
        int res = thrd_create(&threads[r], (thrd_start_t)sum_into_ce, args[r]);
        if (res == thrd_error) {
            printf("Error: thread create failed on %lu\n", r);
        }
    }
    for (size_t r = 0; r < product.height; ++r) {
        int res = thrd_join(threads[r], NULL);
        if (res == thrd_error) {
            printf("Error: thread %lu failed to join\n", r);
        }
    }
    return product;
}
```

---

### CPU Time

Slower? Sure, in total CPU time. The random access is causing cache misses.

---

## Wall Time

But the pipeline is so full, we run faster in wall time.

---

### CPU Time (Ryzen9)

---

## Wall Time (Ryzen9)

---

## Threading Vs Cache

* There is a tension between parallelism and locality
* It is harder to guess access patterns and prefetch when lots of threads hit the memory
  * Instructions coming from multiple places
  * data access all over the place
* But if a thread is waiting for data to load, we can just load another thread
  * And if the threads share data, it should end up in the shared L3 cache

---

## Where is L3?

L3 isn't per-CPU, it's shared. Notice how it doesn't appear in the CPU block diagram.

<p style='font-size:13pt'>$^*$ Source: https://docs.amd.com/v/u/en-US/58455_1.00</p>

---

## L3 Cache

* In a modern, multi-core CPU the L3 cache is optimized for threading
* The L3 is populated by L2 victims (lines removed after conflict misses)
* In Zen5, L3 hits move the line back to L2 on writes or if hits are from a single core
  * Lines remain in L3 after reads or hits from multiple cores
* In addition, if there is an L2 and L3 miss, L3 will grab the data from another L2 if possible

<p style='font-size:13pt'>$^*$ Source: https://docs.amd.com/v/u/en-US/58455_1.00</p>

---

## Improving Thread Locality

* We can get the best of both worlds if we are clever
* Notice that reducing the number of threads improved cache miss rates
  * At least until the matrix was too large and the threads operated on distant data
    * Aruond 1250x1250 matrix
* How could we improve locality?
  * Maybe with blocking
  * Or how about transposing the matrix first?

---

## Threaded Transposition

```c
void* sum_into_transposed(void* void_args) {
    void ** args = (void**)void_args;
    size_t thread_id = (size_t)(args[0]);
    double* row = (double*)(args[1]);
    double* left_row = (double*)(args[2]);
    Matrix* right = (Matrix*)(args[3]);
    size_t width = *(int*)(args[4]);
    size_t elements = *(int*)(args[5]);
    for (size_t c = 0; c < width; ++c) {
        double sum = 0.0;
        for (size_t e = 0; e < elements; ++e) {
            // Unthreaded: sum += left.rows[r][e] * transpose_right.rows[c][e];
            sum += left_row[e] * right->rows[c][e];
        }
        // Unthreaded: product.rows[r][c] += sum;
        row[c] += sum;
    }
    return NULL;
}

Matrix dotProduct_thread_transpose_rce(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    // Make a transposed matrix
    thrd_t threads[max_threads];
    void* t_args[max_threads][5];
    Matrix transpose_right = newMatrix(right.width, right.height);
    for (size_t r = 0; r < right.height; r+=max_threads) {
        // Making this threaded: transpose_right.rows[r][c] = right.rows[c][r];
        // Start a new batch of threads
        for (size_t t = 0; t < max_threads && r+t<right.height; ++t) {
            t_args[t][0] = (void*)t;
            t_args[t][1] = (void*)(transpose_right.rows[r+t]);
            t_args[t][2] = (void*)(&right);
            t_args[t][3] = (void*)(&transpose_right.width);
            t_args[t][4] = (void*)(&r);
            int res = thrd_create(&threads[t], (thrd_start_t)transpose_row, t_args[t]);
            if (res == thrd_error) {
                printf("Error: thread create failed on %lu\n", r+t);
            }
        }
        // Wait for this batch of threads to finish
        for (size_t t = 0; t < max_threads && r+t<right.height; ++t) {
            int res = thrd_join(threads[t], NULL);
            if (res == thrd_error) {
                printf("Error: thread %lu failed to join\n", r+t);
            }
        }
    }
    Matrix product = newMatrix(left.height, right.width);
    // Now we can go across the rows of the right matrix rather than down the columns
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    void* args[max_threads][6];
    size_t num_elem = left.height;
    void* mm_args[max_threads][5];
    for (size_t r = 0; r < right.height; r+=max_threads) {
        // Start a new batch of threads
        for (size_t t = 0; t < max_threads && r+t<product.height; ++t) {
            args[t][0] = (void*)t;
            args[t][1] = (void*)(product.rows[r+t]);
            args[t][2] = (void*)(left.rows[r+t]);
            args[t][3] = (void*)(&right);
            args[t][4] = (void*)(&product.width);
            args[t][5] = (void*)(&product.height);
            int res = thrd_create(&threads[t], (thrd_start_t)sum_into_transposed, args[t]);
            if (res == thrd_error) {
                printf("Error: thread create failed on %lu\n", r+t);
            }
        }
        // Wait for this batch of threads to finish
        for (size_t t = 0; t < max_threads && r+t<product.height; ++t) {
            int res = thrd_join(threads[t], NULL);
            if (res == thrd_error) {
                printf("Error: thread %lu failed to join\n", r+t);
            }
        }
    }
    // We are done with the transposed matrix
    freeMatrix(transpose_right);
    return product;
}

```

---

## Threaded Transpose CPU

---

## Threaded Transpose Wall

---

## Threaded Transpose CPU (Ryzen9)

---

## Threaded Transpose Wall (Ryzen9)

---

## Threads and Multicore

* Desktop and datacenters architectures are all multicore
* To get value from them
  * Run multiple processes
  * Run multithreaded programs

---

## Parallelism

* It's come up before!
* Speedups from longer pipelines eventually reached their limits
* Then architectures became more parallel

---

## Not Just GPUs

* Currently, GPUs are the obvious example of parallel computation
* However, CPUs also have parallel features
  * L3 came out after multi-core chips
  * Some instructions are explicitly parallel
* We'll go over some more examples next time