# CS 211 - Lecture 19

## Computer Architecture

### Cache Details

Bernhard Firner

2025-11-06

---

## Review

* Cache anatomy
  * Block: the fixed-size packet of information
  * Line: the block and associated information
    * Valid bit, tag, dirty bit, etc
  * Set: A group of lines

---

## Associativity

* Direct-mapped: each set holds a single line
* n-way set associative: n lines per set
* Fully Associative: A single set with all lines

---

## Trade-offs

* A direct-mapped cache is simple and fast, but will have many collisions
* A fully associative cache is expensive, but minimizes collisions
  * Only practical when the total size is small

---

## Cache Misses

* `cold miss`: When the valid bit is unset (also called *compulsory miss*)
* `capacity miss`: When our cache is full and the miss couldn't be avoided
  * For example, you are looping repeatedly through a 1MB array on a 64KB cache
* `conflict miss`: When the program has used the cache poorly
  * There is available space, but locality has not been exploited to use it

---

## Eviction

* In misses other than a cold miss, the other memory must be removed
* In a direct-mapped cache there is no choice
* In an associative cache, there are options
  * random, least recently used, least frequently used

---

## Writing

* *Write-through*
  * Write immediately to the underlying memory
* *Write-back*
  * Defer writing until the block is evicted
  * This means that we need to add a dirty bit to mark modifications

---

## Write-misses

* *Write-allocate*
  * Load into cache and write to it there
  * Used with the write-back strategy
* *No-write-allocate*
  * Same as write through, send the new value directly to memory

---

## Cache Types

* A *d-cache* holds data
* An *i-cache* holds instructions
* A *unified cache* holds both
* It makes sense to handle instructions and data differently at the CPU, but less sense farther away

---

## Performance

<table>
<tr> <hr><td>Cache</td><td>Access Time (cycles)</td><td>Cache Size (C)</td><td> Assoc. (E)</td><td>Block size (B)</td><td>Sets (S)</td></hr></tr>
<tr> <td> L1 i-cache</td><td> 4   </td><td> 32 KB    </td><td> 8 </td><td> 64 B  </td><td> 64 </td></tr>
<tr> <td> L1 d-cache</td><td> 4   </td><td> 32 KB    </td><td> 8 </td><td> 64 B  </td><td> 64 </td></tr>
<tr> <td> L2 unified cache</td><td> 10   </td><td> 256 KB    </td><td> 8 </td><td> 64 B  </td><td> 512 </td></tr>
<tr> <td> L3 unified cache</td><td> 40-75   </td><td> 8 MB    </td><td> 16 </td><td> 64 B  </td><td> 8192 </td></tr>
<tr> <td> Main memory</td><td> 200   </td><td>     </td><td>  </td><td>       </td><td>      </td></tr>
</table>

<p style='font-size:13pt'>$^*$ According to CS:App, Core i7 cache</p>

---

## Updates

* Cache has changed over time
  * AMD switched L1 from write-through to write-back in 2017
    * Probably because increased density allow for the more complex option
  * Intel now uses heterogenous cores (performance and efficiency), with different cache types

---

## AMD Zen 4 (2023)

<table>
<tr> <hr><td>Cache</td><td>Access Time (cycles)</td><td>Cache Size (C)</td><td> Assoc. (E)</td><td>Block size (B)</td><td>Sets (S)</td></hr></tr>
<tr> <td> Op cache</td><td> 1 (9 ops/cycle)    </td><td> 6.75 K ops    </td><td> 12 </td><td> 9 macro ops  </td><td> 64 </td></tr>
<tr> <td> L1 i-cache</td><td> 4               </td><td> 32 KB    </td><td> 8 </td><td> 64 B  </td><td> 64 </td></tr>
<tr> <td> L1 d-cache</td><td> 4-5 int, 7-8 fp </td><td> 32 KB    </td><td> 8 </td><td> 64 B  </td><td> 64 </td></tr>
<tr> <td> L2 unified cache</td><td> 14   </td><td> 1 MB    </td><td> 8 </td><td> 64 B  </td><td> 2048 </td></tr>
<tr> <td> L3 unified cache</td><td> 50   </td><td> 96 MB    </td><td> 16 </td><td> 64 B  </td><td> 98304 </td></tr>
</table>

<p style='font-size:13pt'>$^*$ Source: [AMD Zen4 Software Optimization Guide](https://docs.amd.com/v/u/en-US/57647)</p>

---

## AMD Zen 4 Diagram

<p style='font-size:13pt'>$^*$ Source: [AMD Zen4 Software Optimization Guide](https://docs.amd.com/v/u/en-US/57647)</p>

---

## Stability

* Notice that L1 is very stable from old to new generations
  * Drastic changes are difficult to fit
  * There are more of them, since desktop chips are multicore
    * Interestingly, latest ARM chips have 64KB L1, but fewer cores
  * Parallelism chosen before overpowered single chips
* Even if the L1 cache size doubled, your job doesn't change much

---

## Locality

* So you need to know how to exploit locality
  * But the best way to exploit it can change

---

## Example

```c
#include <stdio.h>
#include <stdlib.h>

typedef struct BigMem BigMem;
struct BigMem {
    char important_stuff;
    char some_cruft[254];
    char other_stuff;
};

int main(int argc, char** argv) {
    if (argc < 3) {
        printf("This program expects a number of elements and a number of loops.\n");
        printf("With an L1 cache size of 32KB, 125 BigMems fit.\n");
        printf("With an L2 cache size of 512KB, 2000 BigMems fit.\n");
        printf("With an L3 cache size of 32MB, 125000 BigMems fit.\n");
        return 0;
    }
    int units = atoi(argv[1]);
    int loops = atoi(argv[2]);
    if (units < 1 || loops < 1) {
        printf("This program needs positive values.\n");
    }

// 32 KB in the L1 cache, 512 KB in L2
    // BigMem is 256 B, so 125 fit into L1, 2000 fit into L2, 125000 fit into L3
    // (for some CPUs)
    BigMem* data = calloc(units, sizeof(BigMem));

for (int i = 0; i < loops; ++i) {
        for (int j = 0; j < units; ++j) {
            // Twiddle the bits
            data[j].important_stuff ^= 1;
            data[j].other_stuff ^= 1;
        }
    }

free(data);

return 0;
}
```

---

## Quick Tests

* Does cacheing work?
  * Yes. A second loop over 10 million elements is barely slower than 1 loop

<pre>
$ time ./lec19_struct_cache 10000000 1

real	0m2.030s
user	0m0.164s
sys	0m1.866s
$ time ./lec19_struct_cache 10000000 2

real	0m2.186s
user	0m0.340s
sys	0m1.846s
</pre>

---

## Cache Time

* How much of that time is taken up by cache misses?
* It's easy to test
  * Just make a single element and loop over it 10 million times

---

## Cache Time

* About 98.5% of the time was cache stuff
  * Ouch!

<pre>
$ time ./lec19_struct_cache 1 10000000

real	0m0.032s
user	0m0.030s
sys	0m0.001s
</pre>

---

## Prefetching

* I'd like to show more optimizations
  * but the CPU foils my attempts with `prefetching`
* Prefetching is when the CPU anticipates what data will be read next
  * Generally just reads more ahead of what you are currently accessing
  * So if your memory access is totally random then prefetching won't help

---

## More Prefetching

* Zen 4 has five prefecthing strategies:
  * L1 Stream: assumes sequential lines are accessed
  * L1 Stride: looks for constant offsets in access patterns
  * L1 Region: looks for locally correlated access patterns
  * L2 Stream: assumes sequential lines are accessed
  * L2 Up/Down: uses history to predict access of the next or previous line

---

## Optimizing for an Architecture

* The cache size, prefetching, etc are set per architecture
  * Tricks that work on one may not work on another
* But companies publish guides to make this easier:
  * [AMD Zen4 Software Optimization Guide](https://docs.amd.com/v/u/en-US/57647)

---

## The Memory Mountain

* The CS:APP authors made an interesting visualization
  * 3D plot of read throughput vs the stride and size of a memory block
  * Stride means how many elements we skip between reads
* Code here: [https://www.cs.cmu.edu/afs/cs/academic/class/15213-f06/www/code/mem/](https://www.cs.cmu.edu/afs/cs/academic/class/15213-f06/www/code/mem/)

---

### Code Snippet

```C
int data[MAXELEMS];         /* The array we'll be traversing */

void test(int elems, int stride) /* The test function */
{
    int result = 0;
    volatile int sink;

for (int i = 0; i < elems; i += stride) {
        result += data[i];
    }
    sink = result; /* So compiler doesn't optimize away the loop */
}
```

---

## Core i7 Mountain

---

## Gallery

* The authors have a gallery of other results:
  * [https://csappbook.blogspot.com/2017/05/a-gallery-of-memory-mountains.html](https://csappbook.blogspot.com/2017/05/a-gallery-of-memory-mountains.html)
* Modern CPUs do automatic frequency scaling, automatic parallelism, etc
  * Their method may not give neat results on a modern machine
    * Thanks stride-aware prefetching!
  * Just a warning that you always need a way to measure performance
  * But the lessons hold true

---

## Practical Advice

* So do you have to measure everything? Are there no general rules?
* There are a few of things to avoid
  * Pointers to pointers to pointers...
  * Overstuffed classes
  * If statements
* Easiest optimization is memory usage in loops

---

## Detailed example: matrix multiply

* You did a matrix dot product in hw 3
* It is a loop and data access heavy operation
* For each $c_{ij}$, need to loop over rows of a and columns of b

<p><math display="block" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mrow><mo stretchy="true" form="prefix">[</mo><mtable><mtr><mtd columnalign="center" style="text-align: center"><msub><mi>c</mi><mn>11</mn></msub></mtd><mtd columnalign="center" style="text-align: center"><msub><mi>c</mi><mn>12</mn></msub></mtd></mtr><mtr><mtd columnalign="center" style="text-align: center"><msub><mi>c</mi><mn>21</mn></msub></mtd><mtd columnalign="center" style="text-align: center"><msub><mi>c</mi><mn>22</mn></msub></mtd></mtr></mtable><mo stretchy="true" form="postfix">]</mo></mrow><mo>=</mo><mrow><mo stretchy="true" form="prefix">[</mo><mtable><mtr><mtd columnalign="center" style="text-align: center"><msub><mi>a</mi><mn>11</mn></msub></mtd><mtd columnalign="center" style="text-align: center"><msub><mi>a</mi><mn>12</mn></msub></mtd></mtr><mtr><mtd columnalign="center" style="text-align: center"><msub><mi>a</mi><mn>21</mn></msub></mtd><mtd columnalign="center" style="text-align: center"><msub><mi>a</mi><mn>22</mn></msub></mtd></mtr></mtable><mo stretchy="true" form="postfix">]</mo></mrow><mrow><mo stretchy="true" form="prefix">[</mo><mtable><mtr><mtd columnalign="center" style="text-align: center"><msub><mi>b</mi><mn>11</mn></msub></mtd><mtd columnalign="center" style="text-align: center"><msub><mi>b</mi><mn>12</mn></msub></mtd></mtr><mtr><mtd columnalign="center" style="text-align: center"><msub><mi>b</mi><mn>21</mn></msub></mtd><mtd columnalign="center" style="text-align: center"><msub><mi>b</mi><mn>22</mn></msub></mtd></mtr></mtable><mo stretchy="true" form="postfix">]</mo></mrow></mrow><annotation encoding="application/x-tex">
\begin{bmatrix}
c_{11} & c_{12} \\
c_{21} & c_{22}
\end{bmatrix}
=
\begin{bmatrix}
a_{11} & a_{12} \\
a_{21} & a_{22}
\end{bmatrix}
\begin{bmatrix}
b_{11} & b_{12} \\
b_{21} & b_{22}
\end{bmatrix}</annotation></semantics></math></p>

---

## Matrix Multiply

```C
// Dot product of two matrices
Matrix dotProduct(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    for (size_t lrow = 0; lrow < product.height; ++lrow) {
        for (size_t rcolumn = 0; rcolumn < product.width; ++rcolumn) {
            for (size_t elem = 0; elem < num_elem; ++elem) {
                product.rows[lrow][rcolumn] += left.rows[lrow][elem] * right.rows[elem][rcolumn];
            }
        }
    }
    return product;
}
```

---

## Timing

* Let's do some timing tests
* For simplicity, let's rename the row, column, and element counter $r, c, e$

```C
    for (size_t r = 0; r < product.height; ++r) {
        for (size_t c = 0; c < product.width; ++c) {
            for (size_t e = 0; e < num_elem; ++e) {
                product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
            }
        }
    }
```

---

## Alternatives

* Notice that the loops over $r, c, e$ can be in any order
  * The multiplication statement inside of the loop remains the same

```C
    for (size_t e = 0; e < num_elem; ++e) {
        for (size_t r = 0; r < product.height; ++r) {
            for (size_t c = 0; c < product.width; ++c) {
                product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
            }
        }
    }
```

---

## Access Optimization

* We will also change the innermost loop to access the destination matrix a single time
  * Sum into a register first, then write to the matrix
  * We can only do this when r and c are the outter loops

```C
    for (size_t c = 0; c < product.width; ++c) {
        for (size_t r = 0; r < product.height; ++r) {
            double sum = 0.0;
            for (size_t e = 0; e < num_elem; ++e) {
                sum += left.rows[r][e] * right.rows[e][c];
            }
            product.rows[r][c] = sum;
        }
    }
```

---

## Options

* How many options are there?
  * $3\times 2 \times 1=6$
  * rce, rec, cre, cer, erc, ecr
* So let's make 6 functions and a program to test them

---

## Access Patterns

* Each destination cell is accessed once
* Each source is accessed n-times
  * Where n is the width or height
    * (since these are square matrices)
  * So there should be locality in the source access patterns
  * Locality in the destination depends up the write-back mechanism

---

## Row Vs Column

* Data in a row is consectutive
  * So there is spatial locality to exploit
* Traversing a column means skipping n elements
  * A stride prefetcher may be able to figure it out

---

## Innermost loop

* Which ordering do we expect is best?
* The indices on the innermost loop change the most
* e and c access the rows of the matrices
  * Since the rows are spatially local, they may be best on the inside

<pre>
sum += left.rows[r][e] * right.rows[e][c];
</pre>

---

## Outermost loop

* The column accesses are controlled by r and e
* So putting them on the outside means that they only change n times
  * That's the best we could do

<pre>
sum += left.rows[r][e] * right.rows[e][c];
</pre>

---

## Writing

* r outside of c looks better
* But the performance will depend upon the write-through or write-back behavior

<pre>
product.rows[r][c] = sum;
</pre>

---

## Prediction

* e or c on the inside to exploit locality
* r or e on the outside to avoid large strides
* maybe r outside of c for writing
* So rec or erc will be faster?
* And r on the outside is clearly worst: cer or ecr

---

## Timing

* We'll use functions in time.h for timing
  * We'll record both cpu time and wall time
  * Wall means the time that passes for you
  * CPU time could be more, if the CPU splits processing over multiple cores

---

## Code

```c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#include "hw3_matrix_lib.h"

// Converts the timespect to milliseconds
typedef struct timespec timespec;
double timespecToMs(timespec t) {
    return 1000.0 * t.tv_sec + 1e-6 * t.tv_nsec;
}

int main(int argc, char** argv) {

if (argc != 3) {
        printf("Provide an option for which loop to run. Options are 1-6.\n");
        printf("Provide a matrix size greater than or equal to 10.\n");
        return 0;
    }
    int opt = atoi(argv[1]);
    if (opt < 1 || opt > 6) {
        printf("Provide an option for which loop to run. Options are 1-6.\n");
        return 0;
    }
    int size = atoi(argv[2]);
    if (size < 10) {
        printf("Provide a matrix size greater than or equal to 10.\n");
        return 0;
    }
    // This is a function type. Don't worry about it.
    typedef Matrix (*mm_func)(Matrix, Matrix);
    mm_func mmul = NULL;
    switch (opt) {
        case 1:
            mmul = dotProduct_rce;
            printf("Testing rce\n");
            break;
        case 2:
            mmul = dotProduct_rec;
            printf("Testing rec\n");
            break;
        case 3:
            mmul = dotProduct_cre;
            printf("Testing cre\n");
            break;
        case 4:
            mmul = dotProduct_cer;
            printf("Testing cer\n");
            break;
        case 5:
            mmul = dotProduct_erc;
            printf("Testing erc\n");
            break;
        case 6:
            mmul = dotProduct_ecr;
            printf("Testing ecr\n");
            break;
    }

Matrix m1 = newMatrix(size, size);
    Matrix m2 = newMatrix(size, size);
    for (int i = 0; i < size*size; ++i) {
        // Fill the matrices with random numbers. Be sure to avoid dividing by 0.
        m1.data[i] = rand()/ ((double)rand()+1);
        m2.data[i] = rand()/ ((double)rand()+1);
    }

// Set up timers and run
    struct timespec tw_begin;
    clock_gettime(CLOCK_MONOTONIC, &tw_begin);
    struct timespec ts_begin;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_begin);

Matrix out = mmul(m1, m2);

struct timespec tw_end;
    clock_gettime(CLOCK_MONOTONIC, &tw_end);
    struct timespec ts_end;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_end);

double posix_wall = timespecToMs(tw_end) - timespecToMs(tw_begin);
    double cpu_time = timespecToMs(ts_end) - timespecToMs(ts_begin);
    // Print the result
    switch (opt) {
        case 1:
            printf("rce wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("rce CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 2:
            printf("rec wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("rec CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 3:
            printf("cre wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("cre CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 4:
            printf("cer wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("cer CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 5:
            printf("erc wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("erc CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 6:
            printf("ecr wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("ecr CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
    }

freeMatrix(m1);
    freeMatrix(m2);
    freeMatrix(out);

return 0;
}

```

---

## Matrix functions

```c
Matrix dotProduct_rce(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    for (size_t r = 0; r < product.height; ++r) {
        for (size_t c = 0; c < product.width; ++c) {
            double sum = 0.0;
            for (size_t e = 0; e < num_elem; ++e) {
                sum += left.rows[r][e] * right.rows[e][c];
            }
            product.rows[r][c] = sum;
        }
    }
    return product;
}
Matrix dotProduct_rec(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    for (size_t r = 0; r < product.height; ++r) {
        for (size_t e = 0; e < num_elem; ++e) {
            for (size_t c = 0; c < product.width; ++c) {
                product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
            }
        }
    }
    return product;
}
Matrix dotProduct_cre(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    for (size_t c = 0; c < product.width; ++c) {
        for (size_t r = 0; r < product.height; ++r) {
            double sum = 0.0;
            for (size_t e = 0; e < num_elem; ++e) {
                sum += left.rows[r][e] * right.rows[e][c];
            }
            product.rows[r][c] = sum;
        }
    }
    return product;
}
Matrix dotProduct_cer(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    for (size_t c = 0; c < product.width; ++c) {
        for (size_t e = 0; e < num_elem; ++e) {
            for (size_t r = 0; r < product.height; ++r) {
                product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
            }
        }
    }
    return product;
}
Matrix dotProduct_erc(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    for (size_t e = 0; e < num_elem; ++e) {
        for (size_t r = 0; r < product.height; ++r) {
            for (size_t c = 0; c < product.width; ++c) {
                product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
            }
        }
    }
    return product;
}
Matrix dotProduct_ecr(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    for (size_t e = 0; e < num_elem; ++e) {
        for (size_t c = 0; c < product.width; ++c) {
            for (size_t r = 0; r < product.height; ++r) {
                product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
            }
        }
    }
    return product;
}

Matrix dotProduct_block(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    size_t block = 64 / sizeof(double);
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    for (size_t r = 0; r < product.height; r+=block) {
        for (size_t c = 0; c < product.width; c+=block) {
            for (size_t e = 0; e < num_elem; e+=block) {
                // B x B block matrix multiplications
                for (size_t r1 = r; r < r+block; ++r1) {
                    for (size_t c1 = c; c < c+block; ++c1) {
                        for (size_t e1 = e; e < e+block; ++e1) {
                            product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
                        }
                    }
                }
            }
        }
    }
    return product;
}
```

---

## Making Data

<pre>
$ for s in `seq 50 50 950` `seq 1000 100 1400`; do for e in `seq 1 6`; do ./a.out $e $s >> mm_results.dat; done; done
</pre>

---

## Results

---

## Interpretation

* Break into 3 distinct groups
* 'r' on the inside is the worst, as predicted
  * Jumps a column n times in the inner loop
* 'c' on the inside is the best, as predicted
  * Exploits row locality for one source matrix and the destination
* 'rce' is in the middle, but 'cre' is somehow as good as 'erc' and 'rec'
  * This is probably a strided prefetch on the source matrices from 'r' and 'e'

---

## CS:app version

---

## Explanation

* The Core i7 probably had worse prefetching
* The differences on the i7 are greater than on the Ryzen 3600
  * So a bigger L3 and L2 cache does help
    * Notice how smooth the lines are on the newer CPU as well

---

## Ryzen 9

---

## Ryzen 9 Explanation

* 'c' on the inside is still best, as predicted
* 'cre' still surprisingly good, but not as good as 'c' on the inside
* You can see now that 'rce' is similar to 'cre', but there's a jump as matrix size 1200
  * It corresponds to step from matrix size of 92MB to 108MB, or row size 76KB to 83KB

---

## Further Optimization

* There is another way to speed things up, called *blocking*
  * But it would require more tuning to specific hardware
  * And may not actually help!

---

## Blocking

* Add another triple loop inside of the outer one
  * Operate on blocks of memory from the three arrays
  * If we choose the right block size for the CPU, it can prefetch efficiently into each cache

---

## Blocking

```C
Matrix dotProduct_block(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    //size_t block = 64 / sizeof(double);
    size_t block = 128;
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    size_t r, c, e;
    for (r = 0; r < product.height; r+=block) {
        for (c = 0; c < product.width; c+=block) {
            for (e = 0; e < num_elem; e+=block) {
                // B x B block matrix multiplications
                for (size_t r1 = r; r1 < r+block; ++r1) {
                    for (size_t c1 = c; c1 < c+block; ++c1) {
                        double sum = 0;
                        for (size_t e1 = e; e1 < e+block; ++e1) {
                            sum += left.rows[r][e] * right.rows[e][c];
                        }
                        product.rows[r][c] += sum;
                    }
                }
            }
        }
    }
    // Anything we missed
    for (; r < product.height; ++r) {
        for (; e < num_elem; ++e) {
            for (; c < product.width; ++c) {
                product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
            }
        }
    }
    return product;
}
```

---

## Results

---

## Block Results

* The CPU is doing a good enough job prefetching that our blocking isn't generally helping
  * But there are signs that it could, once the matrix gets really large
* You can see that the blocks sit at some performance level sometimes
  * That is a sign that they aren't filling the cache for a few steps
  * So if we tuned the block size for each matrix size, we could squeeze out more perf

---

## Ryzen 9

---

## Ryzen 9 Block Size

* Clearly our block size is wrong for a different CPU
* So you can't just blindly take code from one system and toss it onto another
  * But hopefully you already knew that!

---

## Applying lessons

* As a group, you will work on many different hardware platforms
  * Desktop, mobile, embedded
  * Intel, AMD, Arm, etc
  * CPU, GPU, Asic
* You will have to read a bit of documentation
  * It's a good skill to have

---

## Takeaways

* So what should you optimize?
  * The common case
    * Most widely used data structures
    * Algorithms that go through the most data
* Sometimes you can transform an algorithm to make it more efficient
  * Such as blocking
  * This is very different from the big-O notation that you saw in algorithms