typedef struct BigMem BigMem;
struct BigMem {
char important_stuff;
char some_cruft[254];
char other_stuff;
};
int main(int argc, char** argv) {
if (argc < 3) {
printf("This program expects a number of elements and a number of loops.\n");
printf("With an L1 cache size of 32KB, 125 BigMems fit.\n");
printf("With an L2 cache size of 512KB, 2000 BigMems fit.\n");
printf("With an L3 cache size of 32MB, 125000 BigMems fit.\n");
return 0;
}
int units = atoi(argv[1]);
int loops = atoi(argv[2]);
if (units < 1 || loops < 1) {
printf("This program needs positive values.\n");
}
// 32 KB in the L1 cache, 512 KB in L2
// BigMem is 256 B, so 125 fit into L1, 2000 fit into L2, 125000 fit into L3
// (for some CPUs)
BigMem* data = calloc(units, sizeof(BigMem));
for (int i = 0; i < loops; ++i) {
for (int j = 0; j < units; ++j) {
// Twiddle the bits
data[j].important_stuff ^= 1;
data[j].other_stuff ^= 1;
}
}
free(data);
return 0;
}
```
---
## Quick Tests
* Does cacheing work?
* Yes. A second loop over 10 million elements is barely slower than 1 loop
$ time ./lec19_struct_cache 10000000 1
real 0m2.030s
user 0m0.164s
sys 0m1.866s
$ time ./lec19_struct_cache 10000000 2
real 0m2.186s
user 0m0.340s
sys 0m1.846s
---
## Cache Time
* How much of that time is taken up by cache misses?
* It's easy to test
* Just make a single element and loop over it 10 million times
---
## Cache Time
* About 98.5% of the time was cache stuff
* Ouch!
$ time ./lec19_struct_cache 1 10000000
real 0m0.032s
user 0m0.030s
sys 0m0.001s
---
## Prefetching
* I'd like to show more optimizations
* but the CPU foils my attempts with `prefetching`
* Prefetching is when the CPU anticipates what data will be read next
* Generally just reads more ahead of what you are currently accessing
* So if your memory access is totally random then prefetching won't help
---
## More Prefetching
* Zen 4 has five prefecthing strategies:
* L1 Stream: assumes sequential lines are accessed
* L1 Stride: looks for constant offsets in access patterns
* L1 Region: looks for locally correlated access patterns
* L2 Stream: assumes sequential lines are accessed
* L2 Up/Down: uses history to predict access of the next or previous line
---
## Optimizing for an Architecture
* The cache size, prefetching, etc are set per architecture
* Tricks that work on one may not work on another
* But companies publish guides to make this easier:
* [AMD Zen4 Software Optimization Guide](https://docs.amd.com/v/u/en-US/57647)
---
## The Memory Mountain
* The CS:APP authors made an interesting visualization
* 3D plot of read throughput vs the stride and size of a memory block
* Stride means how many elements we skip between reads
* Code here: [https://www.cs.cmu.edu/afs/cs/academic/class/15213-f06/www/code/mem/](https://www.cs.cmu.edu/afs/cs/academic/class/15213-f06/www/code/mem/)
---
### Code Snippet
```C
int data[MAXELEMS]; /* The array we'll be traversing */
void test(int elems, int stride) /* The test function */
{
int result = 0;
volatile int sink;
for (int i = 0; i < elems; i += stride) {
result += data[i];
}
sink = result; /* So compiler doesn't optimize away the loop */
}
```
---
## Core i7 Mountain
---
## Gallery
* The authors have a gallery of other results:
* [https://csappbook.blogspot.com/2017/05/a-gallery-of-memory-mountains.html](https://csappbook.blogspot.com/2017/05/a-gallery-of-memory-mountains.html)
* Modern CPUs do automatic frequency scaling, automatic parallelism, etc
* Their method may not give neat results on a modern machine
* Thanks stride-aware prefetching!
* Just a warning that you always need a way to measure performance
* But the lessons hold true
---
## Practical Advice
* So do you have to measure everything? Are there no general rules?
* There are a few of things to avoid
* Pointers to pointers to pointers...
* Overstuffed classes
* If statements
* Easiest optimization is memory usage in loops
---
## Detailed example: matrix multiply
* You did a matrix dot product in hw 3
* It is a loop and data access heavy operation
* For each $c_{ij}$, need to loop over rows of a and columns of b
---
## Matrix Multiply
```C
// Dot product of two matrices
Matrix dotProduct(Matrix left, Matrix right) {
// If the sizes don't match, return a matrix with NULL data and 0 size.
if (left.width != right.height) {
Matrix empty = newMatrix(0, 0);
return empty;
}
Matrix product = newMatrix(left.height, right.width);
// Go row by row on the left and column by column on the right
// Assume that all matrices are square, so the number of ops is either the height or the width
size_t num_elem = left.height;
for (size_t lrow = 0; lrow < product.height; ++lrow) {
for (size_t rcolumn = 0; rcolumn < product.width; ++rcolumn) {
for (size_t elem = 0; elem < num_elem; ++elem) {
product.rows[lrow][rcolumn] += left.rows[lrow][elem] * right.rows[elem][rcolumn];
}
}
}
return product;
}
```
---
## Timing
* Let's do some timing tests
* For simplicity, let's rename the row, column, and element counter $r, c, e$
```C
for (size_t r = 0; r < product.height; ++r) {
for (size_t c = 0; c < product.width; ++c) {
for (size_t e = 0; e < num_elem; ++e) {
product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
}
}
}
```
---
## Alternatives
* Notice that the loops over $r, c, e$ can be in any order
* The multiplication statement inside of the loop remains the same
```C
for (size_t e = 0; e < num_elem; ++e) {
for (size_t r = 0; r < product.height; ++r) {
for (size_t c = 0; c < product.width; ++c) {
product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
}
}
}
```
---
## Access Optimization
* We will also change the innermost loop to access the destination matrix a single time
* Sum into a register first, then write to the matrix
* We can only do this when r and c are the outter loops
```C
for (size_t c = 0; c < product.width; ++c) {
for (size_t r = 0; r < product.height; ++r) {
double sum = 0.0;
for (size_t e = 0; e < num_elem; ++e) {
sum += left.rows[r][e] * right.rows[e][c];
}
product.rows[r][c] = sum;
}
}
```
---
## Options
* How many options are there?
* $3\times 2 \times 1=6$
* rce, rec, cre, cer, erc, ecr
* So let's make 6 functions and a program to test them
---
## Access Patterns
* Each destination cell is accessed once
* Each source is accessed n-times
* Where n is the width or height
* (since these are square matrices)
* So there should be locality in the source access patterns
* Locality in the destination depends up the write-back mechanism
---
## Row Vs Column
* Data in a row is consectutive
* So there is spatial locality to exploit
* Traversing a column means skipping n elements
* A stride prefetcher may be able to figure it out
---
## Innermost loop
* Which ordering do we expect is best?
* The indices on the innermost loop change the most
* e and c access the rows of the matrices
* Since the rows are spatially local, they may be best on the inside
sum += left.rows[r][e] * right.rows[e][c];
---
## Outermost loop
* The column accesses are controlled by r and e
* So putting them on the outside means that they only change n times
* That's the best we could do
sum += left.rows[r][e] * right.rows[e][c];
---
## Writing
* r outside of c looks better
* But the performance will depend upon the write-through or write-back behavior
product.rows[r][c] = sum;
---
## Prediction
* e or c on the inside to exploit locality
* r or e on the outside to avoid large strides
* maybe r outside of c for writing
* So rec or erc will be faster?
* And r on the outside is clearly worst: cer or ecr
---
## Timing
* We'll use functions in time.h for timing
* We'll record both cpu time and wall time
* Wall means the time that passes for you
* CPU time could be more, if the CPU splits processing over multiple cores
---
## Code
```c
#include
#include
#include
#include "hw3_matrix_lib.h"
// Converts the timespect to milliseconds
typedef struct timespec timespec;
double timespecToMs(timespec t) {
return 1000.0 * t.tv_sec + 1e-6 * t.tv_nsec;
}
int main(int argc, char** argv) {
if (argc != 3) {
printf("Provide an option for which loop to run. Options are 1-6.\n");
printf("Provide a matrix size greater than or equal to 10.\n");
return 0;
}
int opt = atoi(argv[1]);
if (opt < 1 || opt > 6) {
printf("Provide an option for which loop to run. Options are 1-6.\n");
return 0;
}
int size = atoi(argv[2]);
if (size < 10) {
printf("Provide a matrix size greater than or equal to 10.\n");
return 0;
}
// This is a function type. Don't worry about it.
typedef Matrix (*mm_func)(Matrix, Matrix);
mm_func mmul = NULL;
switch (opt) {
case 1:
mmul = dotProduct_rce;
printf("Testing rce\n");
break;
case 2:
mmul = dotProduct_rec;
printf("Testing rec\n");
break;
case 3:
mmul = dotProduct_cre;
printf("Testing cre\n");
break;
case 4:
mmul = dotProduct_cer;
printf("Testing cer\n");
break;
case 5:
mmul = dotProduct_erc;
printf("Testing erc\n");
break;
case 6:
mmul = dotProduct_ecr;
printf("Testing ecr\n");
break;
}
Matrix m1 = newMatrix(size, size);
Matrix m2 = newMatrix(size, size);
for (int i = 0; i < size*size; ++i) {
// Fill the matrices with random numbers. Be sure to avoid dividing by 0.
m1.data[i] = rand()/ ((double)rand()+1);
m2.data[i] = rand()/ ((double)rand()+1);
}
// Set up timers and run
struct timespec tw_begin;
clock_gettime(CLOCK_MONOTONIC, &tw_begin);
struct timespec ts_begin;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_begin);
Matrix out = mmul(m1, m2);
struct timespec tw_end;
clock_gettime(CLOCK_MONOTONIC, &tw_end);
struct timespec ts_end;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_end);
double posix_wall = timespecToMs(tw_end) - timespecToMs(tw_begin);
double cpu_time = timespecToMs(ts_end) - timespecToMs(ts_begin);
// Print the result
switch (opt) {
case 1:
printf("rce wall time used for size %i is %.2f ms\n", size, posix_wall);
printf("rce CPU time used for size %i is %.2f ms\n", size, cpu_time);
break;
case 2:
printf("rec wall time used for size %i is %.2f ms\n", size, posix_wall);
printf("rec CPU time used for size %i is %.2f ms\n", size, cpu_time);
break;
case 3:
printf("cre wall time used for size %i is %.2f ms\n", size, posix_wall);
printf("cre CPU time used for size %i is %.2f ms\n", size, cpu_time);
break;
case 4:
printf("cer wall time used for size %i is %.2f ms\n", size, posix_wall);
printf("cer CPU time used for size %i is %.2f ms\n", size, cpu_time);
break;
case 5:
printf("erc wall time used for size %i is %.2f ms\n", size, posix_wall);
printf("erc CPU time used for size %i is %.2f ms\n", size, cpu_time);
break;
case 6:
printf("ecr wall time used for size %i is %.2f ms\n", size, posix_wall);
printf("ecr CPU time used for size %i is %.2f ms\n", size, cpu_time);
break;
}
freeMatrix(m1);
freeMatrix(m2);
freeMatrix(out);
return 0;
}
```
---
## Matrix functions
```c
Matrix dotProduct_rce(Matrix left, Matrix right) {
// If the sizes don't match, return a matrix with NULL data and 0 size.
if (left.width != right.height) {
Matrix empty = newMatrix(0, 0);
return empty;
}
Matrix product = newMatrix(left.height, right.width);
// Go row by row on the left and column by column on the right
// Assume that all matrices are square, so the number of ops is either the height or the width
size_t num_elem = left.height;
for (size_t r = 0; r < product.height; ++r) {
for (size_t c = 0; c < product.width; ++c) {
double sum = 0.0;
for (size_t e = 0; e < num_elem; ++e) {
sum += left.rows[r][e] * right.rows[e][c];
}
product.rows[r][c] = sum;
}
}
return product;
}
Matrix dotProduct_rec(Matrix left, Matrix right) {
// If the sizes don't match, return a matrix with NULL data and 0 size.
if (left.width != right.height) {
Matrix empty = newMatrix(0, 0);
return empty;
}
Matrix product = newMatrix(left.height, right.width);
// Go row by row on the left and column by column on the right
// Assume that all matrices are square, so the number of ops is either the height or the width
size_t num_elem = left.height;
for (size_t r = 0; r < product.height; ++r) {
for (size_t e = 0; e < num_elem; ++e) {
for (size_t c = 0; c < product.width; ++c) {
product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
}
}
}
return product;
}
Matrix dotProduct_cre(Matrix left, Matrix right) {
// If the sizes don't match, return a matrix with NULL data and 0 size.
if (left.width != right.height) {
Matrix empty = newMatrix(0, 0);
return empty;
}
Matrix product = newMatrix(left.height, right.width);
// Go row by row on the left and column by column on the right
// Assume that all matrices are square, so the number of ops is either the height or the width
size_t num_elem = left.height;
for (size_t c = 0; c < product.width; ++c) {
for (size_t r = 0; r < product.height; ++r) {
double sum = 0.0;
for (size_t e = 0; e < num_elem; ++e) {
sum += left.rows[r][e] * right.rows[e][c];
}
product.rows[r][c] = sum;
}
}
return product;
}
Matrix dotProduct_cer(Matrix left, Matrix right) {
// If the sizes don't match, return a matrix with NULL data and 0 size.
if (left.width != right.height) {
Matrix empty = newMatrix(0, 0);
return empty;
}
Matrix product = newMatrix(left.height, right.width);
// Go row by row on the left and column by column on the right
// Assume that all matrices are square, so the number of ops is either the height or the width
size_t num_elem = left.height;
for (size_t c = 0; c < product.width; ++c) {
for (size_t e = 0; e < num_elem; ++e) {
for (size_t r = 0; r < product.height; ++r) {
product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
}
}
}
return product;
}
Matrix dotProduct_erc(Matrix left, Matrix right) {
// If the sizes don't match, return a matrix with NULL data and 0 size.
if (left.width != right.height) {
Matrix empty = newMatrix(0, 0);
return empty;
}
Matrix product = newMatrix(left.height, right.width);
// Go row by row on the left and column by column on the right
// Assume that all matrices are square, so the number of ops is either the height or the width
size_t num_elem = left.height;
for (size_t e = 0; e < num_elem; ++e) {
for (size_t r = 0; r < product.height; ++r) {
for (size_t c = 0; c < product.width; ++c) {
product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
}
}
}
return product;
}
Matrix dotProduct_ecr(Matrix left, Matrix right) {
// If the sizes don't match, return a matrix with NULL data and 0 size.
if (left.width != right.height) {
Matrix empty = newMatrix(0, 0);
return empty;
}
Matrix product = newMatrix(left.height, right.width);
// Go row by row on the left and column by column on the right
// Assume that all matrices are square, so the number of ops is either the height or the width
size_t num_elem = left.height;
for (size_t e = 0; e < num_elem; ++e) {
for (size_t c = 0; c < product.width; ++c) {
for (size_t r = 0; r < product.height; ++r) {
product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
}
}
}
return product;
}
Matrix dotProduct_block(Matrix left, Matrix right) {
// If the sizes don't match, return a matrix with NULL data and 0 size.
if (left.width != right.height) {
Matrix empty = newMatrix(0, 0);
return empty;
}
size_t block = 64 / sizeof(double);
Matrix product = newMatrix(left.height, right.width);
// Go row by row on the left and column by column on the right
// Assume that all matrices are square, so the number of ops is either the height or the width
size_t num_elem = left.height;
for (size_t r = 0; r < product.height; r+=block) {
for (size_t c = 0; c < product.width; c+=block) {
for (size_t e = 0; e < num_elem; e+=block) {
// B x B block matrix multiplications
for (size_t r1 = r; r < r+block; ++r1) {
for (size_t c1 = c; c < c+block; ++c1) {
for (size_t e1 = e; e < e+block; ++e1) {
product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
}
}
}
}
}
}
return product;
}
```
---
## Making Data
$ for s in `seq 50 50 950` `seq 1000 100 1400`; do for e in `seq 1 6`; do ./a.out $e $s >> mm_results.dat; done; done
---
## Results
---
## Interpretation
* Break into 3 distinct groups
* 'r' on the inside is the worst, as predicted
* Jumps a column n times in the inner loop
* 'c' on the inside is the best, as predicted
* Exploits row locality for one source matrix and the destination
* 'rce' is in the middle, but 'cre' is somehow as good as 'erc' and 'rec'
* This is probably a strided prefetch on the source matrices from 'r' and 'e'
---
## CS:app version
---
## Explanation
* The Core i7 probably had worse prefetching
* The differences on the i7 are greater than on the Ryzen 3600
* So a bigger L3 and L2 cache does help
* Notice how smooth the lines are on the newer CPU as well
---
## Ryzen 9
---
## Ryzen 9 Explanation
* 'c' on the inside is still best, as predicted
* 'cre' still surprisingly good, but not as good as 'c' on the inside
* You can see now that 'rce' is similar to 'cre', but there's a jump as matrix size 1200
* It corresponds to step from matrix size of 92MB to 108MB, or row size 76KB to 83KB
---
## Further Optimization
* There is another way to speed things up, called *blocking*
* But it would require more tuning to specific hardware
* And may not actually help!
---
## Blocking
* Add another triple loop inside of the outer one
* Operate on blocks of memory from the three arrays
* If we choose the right block size for the CPU, it can prefetch efficiently into each cache
---
## Blocking
```C
Matrix dotProduct_block(Matrix left, Matrix right) {
// If the sizes don't match, return a matrix with NULL data and 0 size.
if (left.width != right.height) {
Matrix empty = newMatrix(0, 0);
return empty;
}
//size_t block = 64 / sizeof(double);
size_t block = 128;
Matrix product = newMatrix(left.height, right.width);
// Go row by row on the left and column by column on the right
// Assume that all matrices are square, so the number of ops is either the height or the width
size_t num_elem = left.height;
size_t r, c, e;
for (r = 0; r < product.height; r+=block) {
for (c = 0; c < product.width; c+=block) {
for (e = 0; e < num_elem; e+=block) {
// B x B block matrix multiplications
for (size_t r1 = r; r1 < r+block; ++r1) {
for (size_t c1 = c; c1 < c+block; ++c1) {
double sum = 0;
for (size_t e1 = e; e1 < e+block; ++e1) {
sum += left.rows[r][e] * right.rows[e][c];
}
product.rows[r][c] += sum;
}
}
}
}
}
// Anything we missed
for (; r < product.height; ++r) {
for (; e < num_elem; ++e) {
for (; c < product.width; ++c) {
product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
}
}
}
return product;
}
```
---
## Results
---
## Block Results
* The CPU is doing a good enough job prefetching that our blocking isn't generally helping
* But there are signs that it could, once the matrix gets really large
* You can see that the blocks sit at some performance level sometimes
* That is a sign that they aren't filling the cache for a few steps
* So if we tuned the block size for each matrix size, we could squeeze out more perf
---
## Ryzen 9
---
## Ryzen 9 Block Size
* Clearly our block size is wrong for a different CPU
* So you can't just blindly take code from one system and toss it onto another
* But hopefully you already knew that!
---
## Applying lessons
* As a group, you will work on many different hardware platforms
* Desktop, mobile, embedded
* Intel, AMD, Arm, etc
* CPU, GPU, Asic
* You will have to read a bit of documentation
* It's a good skill to have
---
## Takeaways
* So what should you optimize?
* The common case
* Most widely used data structures
* Algorithms that go through the most data
* Sometimes you can transform an algorithm to make it more efficient
* Such as blocking
* This is very different from the big-O notation that you saw in algorithms