# CS 211 - Lecture 21

## Computer Architecture

### Data Parallelism

Bernhard Firner

2026-04-13

---

## Review (Again!)

* Read chapter 6 in the book!
* Cache anatomy
  * Block: the fixed-size packet of information
  * Line: the block and associated information
    * Valid bit, tag, dirty bit, etc
  * Set: A group of lines

<div hidden>
<img style="width: 100%" class="r-stretch" src="./figures/lec21_vmulpd_figure.png" />
<img style="width: 1%" class="r-stretch" src="./figures/mm_results_ryzen5_final_loops.png" />
<img style="width: 1%" class="r-stretch" src="./figures/mm_results_ryzen5_final_transpose.png" />
<img style="width: 1%" class="r-stretch" src="./figures/mm_results_ryzen5_final_transpose_threads.png" />
<img style="width: 1%" class="r-stretch" src="./figures/mm_results_ryzen5_final_transpose_threads_wall.png" />
<img style="width: 1%" class="r-stretch" src="./figures/mm_results_ryzen5_final_transpose_threads_simd.png" />
<img style="width: 1%" class="r-stretch" src="./figures/mm_results_ryzen5_final_transpose_threads_simd_wall.png" />
<img style="width: 1%" class="r-stretch" src="./figures/mm_results_ryzen5_final_comparison_wall.png" />
<img style="width: 1%" class="r-stretch" src="./figures/mm_results_ryzen5_final_comparison_wall_log.png" />
<img style="width: 1%" class="r-stretch" src="./figures/mm_results_ryzen5_final_transpose_threads_wall_O2.png" />
<img style="width: 1%" class="r-stretch" src="./figures/mm_results_ryzen5_final_comparison_wall_O2.png" />
<img style="width: 1%" class="r-stretch" src="./figures/mm_results_ryzen5_final_comparison_wall_log_O2.png" />
</div>

---

## Cache Misses

* `cold miss`: When the valid bit is unset (also called *compulsory miss*)
* `capacity miss`: When our cache is full and the miss couldn't be avoided
  * For example, you are looping repeatedly through a 1MB array on a 64KB cache
* `conflict miss`: When the program has used the cache poorly
  * There is available space, but locality has not been exploited to use it

---

## Cereal Analogy

* There are millions of boxes of cereal at the factory
  * This is like your disk, which is crammed full of data
* How does the cereal get to the consumer?

---

## High Bandwidth, High Latency

* Once a month, 100,000 boxes ship via train to a distribution center
* Once a week, a truck takes 1,000 boxes and delivers them to your store
* A palette sits in the back, and only 10 are taken to the shelf
* If you want a box of cereal, you go into the store, go to the location of the shelf, and grab a box

---

## A Cache Miss

* Let's say you ask for Dr. Firner's Bits and Bytes
  * The store owner has never heard of it
* This a cold, or compulsory miss
  * There was nothing the store owner could do
* A new order will take a week if the cereal is at the distribution center, or a month if center also needs to order it
  * That latency is our *miss penalty*

---

## Cache Benefits

* Let's say that a different store had gotten a request for Bits and Bytes
  * That request would have already populated the distribution center
* This is analagous to the L3 cache, which is shared between multiple cores
  * In this way, multithreaded programs can actually reduce cache misses on other cores

---

## Prefetching

* The store manager tracks the rate at which new cereal needs to be ordered
  * Rather than waiting for all of the cereal to run out, future demand is anticipated
  * As long as cereal is ordered a week in advance from the distribution center, no customer will go without cereal
* Similarly, the distribution center also tracks usage and orders more train shipments
* Each level of cache in your CPU also does its own prefetching

---

## CPU Prefetching

* CPU prefetching is a bit different from cereal, since data isn't all the same
* Instead, the addresses of data are used to predict how they will be used
* This is called data locality
  * Respecting how the CPU prefetches data allows us to write better programs

---

## Matrix Multiplication

* Example of a (possibly) cache intensive function
* Went from a 25 second 1500x1500 matrix multiply to 2 seconds
  * Transposed a matrix first to change locality
  * Then used threads (but only six) to gain  speedup
    * This was as fast as a thread per row, but far more efficient!

---

## Initial Versions

---

## CPU Time

---

## Wall Time

---

## Locality

* Locality matters
  * Changing the loop order or transposing the matrix reduced L1 misses
    * 32KB L2 cache wasn't being pushed
* Threads shared L3 cache
  * on an L2 miss, L3 will try to find missing data from other L2 cache
  * We actually saw improved locality with threads on one CPU

---

## Prefetching

* L1 is 32KB
  * A 1500x1500 array is around 2MB
  * About 70 times too large for L1
  * Twice as large as L2
* To avoid constant cache misses, we need successful prefetching
  * But the CPU only understands access patterns that follow locality

---

## Threading and Locality

* Threading, with a ton of threads, usually has bad locality
  * Like a swarm of uncoordinated bees
* But, if we plan things out, we can vastly reduce contention between threads
  * Even leverage prefetching from one thread to help another thread
* In the end, that lead to the best version of our multiply algorithm

---

## Other types of parallelism

* We created parallelism in the instructions with out threads
* There is another kind of parallelism
  * We remember SIMD, right?
    * single instruction, multiple data
  * Also called vector processing

---

## Advantages

* We reduce the number of instructions
  * Denser code also leads to few cache misses
  * Replaces multiple loads, adds, multiplies, xors, etc
* Can also eliminate loops
* The hardware is optimized for them

---

## SIMD Review

* We saw these before, but let's review again briefly
* GPU instructions are similar, so we need to be sure this idea makes sense

---

## The Big Idea

* `vmulp` (vector multiply) operates over multiple values in the registers
  * `vmulps` for single precision, `vmulpd` for double
* `vmulpd ymm0, ymm1, ymm2`
  * Multiplies everything in ymm0 by everything in ymm1 and stores in ymm2
  * This could allow us to leave either ymm0 or ymm1 unaltered
    * Which we've seen in our matrix multiply loops

---

## Graphically

---

## Moving the Data

* We obviously don't want to have multiple load instructions for each array element
  * And the ymm registers have a single name but hold multiple values
* So there are also a vector load instructions
  * `vmovups` and `vmovupd` for single and double precision
* And addition
  * `vaddps` and `vaddpd`

---

## Important Details

* Remember that we can drop assembly into our C code
  * `__asm__` and then the assembly in strings
  * % means a user-provided variable, and %% refers to an actual register
* Also, the first three arguments passed to a function use known registers
  * %rdi, %rsi, and %rdx
  * Assuming they're the size of a pointer

---

## Code

```c
#include <stdio.h>
#include <stdlib.h>

// Note that we are going to assume that these each have four doubles.
void showMultiply(double a[4], double b[4], double c[4]) {
    // Going to take advantage of the fact that the arguments are in rdi, rsi, and rdx by default.
    // Don't turn on compiler optimizations.
    // If those registers are going to be used, then we should store our arguments on the stack:
    /*
        movq    %rdi, -8(%rbp)
        movq    %rsi, -16(%rbp)
        movq    %rdx, -24(%rbp)
    */
    // (gcc actually does that in any case)
    // But if we only use assembly in this function, they should be safe to use.
    // The double %% indicates a register, a single % means a user-provided variable.
    __asm__(
            "vmovupd (%%rdi), %%ymm0\n\t"
            "vmovupd (%%rsi), %%ymm1\n\t"
            "vmulpd %%ymm0, %%ymm1, %%ymm2\n\t"
            "vmovupd %%ymm2, (%%rdx)\n\t"
            : // No outputs (we assume rax is safe)
            : // No inputs (we assume rdi and rsi are safe)
            :
           );
    return;
}

int main(void) {
    double a[4] = {3.0, 2.0, 3.0, 4.0};
    double b[4] = {4.0, -2.0, 3.0, -1.0};
    double c[4] = {0.0};

showMultiply(a, b, c);

printf("%lf, %lf, %lf, %lf\n", a[0], a[1], a[2], a[3]);
    printf("%lf, %lf, %lf, %lf\n", b[0], b[1], b[2], b[3]);
    printf("%lf, %lf, %lf, %lf\n", c[0], c[1], c[2], c[3]);

return 0;
}
```

---

## Results

<pre>
$ gcc lec21_show_vmulp.c
$ ./a.out 4
3.000000, 2.000000, 3.000000, 4.000000
4.000000, -2.000000, 3.000000, -1.000000
12.000000, -4.000000, 9.000000, -4.000000
</pre>

Wow, so cool!

---

## Options

* We want to use this to speed up matrix multiplication
* Each output value comes from the multiplication of a row and a column
  * Isn't this already parallel with threads?

---

## Loading with Threads

* Loading 4 values from the two source matrices involves 8 mov instructions
  * Let's say that they are all hits
* We saw before that the latency of any mov was 7-8 clock cycles for floating point values
  * Even if they are all hits!
* A single SIMD instruction will still pay that penalty, but will load 4 values at a time

---

## Combine with Threads

* Doing 4 rows and columns simultaneously may overwhelm our bandwidth
  * How would we calculate that?
    * Compare with latency for 2 movupd loads from memory, 1 movupd back to memory, 1 vmulpd, 1 vaddpd
* We already have the threaded transpose code
  * Changing to 1/4 the instructions should at least speed up that loading

---

## Threaded Code

```c
void* asm_sum_into_transposed(void* void_args) {
    void ** args = (void**)void_args;
    size_t thread_id = (size_t)(args[0]);
    double* row = (double*)(args[1]);
    double* left_row = (double*)(args[2]);
    Matrix* right = (Matrix*)(args[3]);
    size_t width = (size_t)(args[4]);
    size_t elements = (size_t)(args[5]);
    for (size_t c = 0; c < width; ++c) {
        // Sum into register ymm2
        // Anything xor with itself is 0.
        __asm__ (
                "vxorpd %%ymm2, %%ymm2, %%ymm2\n\t"
                : // No outputs
                : // No inputs
                : "ymm2" // Tell gcc to stay away from this register (if we turned on optimizations)
               );
        for (size_t e = 0; e+4 < elements; e+=4) {
            // Unthreaded: sum += left.rows[r][e] * transpose_right.rows[c][e];
            // Turn this next instruction into assembly
            // sum += left_row[e] * right->rows[c][e];
            // Note that we could try to preload registers, but it's safer to use them in the instruction.
            //register double* l asm ("r10") = left_row[e];
            //register double* r asm ("r11") = right->rows[c][e];
            __asm__ (
                    // %0 and %1 will come from left_row[e] and right->rows[c][e]
                    "vmovupd (%0), %%ymm0\n\t"
                    "vmovupd (%1), %%ymm1\n\t"
                    "vmulpd %%ymm0, %%ymm1, %%ymm1\n\t"
                    "vaddpd %%ymm1, %%ymm2, %%ymm2\n\t"
                    : // No outputs (we assume rax is safe)
                    : "r" (left_row+e), "r" (right->rows[c]+e)
                    : "ymm0", "ymm1", "ymm2"
                   );
        }
        double block[4];
        __asm__ (
                // %0 and %1 will come from left_row[e] and right->rows[c][e]
                "vmovupd %%ymm2, (%0)\n\t"
                :
                : "r" (block)
                :
               );
        // Finish whatever didn't fit into our 4 double registers
        double sum = 0.0;
        size_t left = elements%4;
        for (size_t e = elements-left; e < elements; ++e) {
            // Unthreaded: sum += left.rows[r][e] * transpose_right.rows[c][e];
            sum += left_row[e] * right->rows[c][e];
        }
        // Unthreaded: product.rows[r][c] += sum;
        row[c] = sum + block[0] + block[1] + block[2] + block[3];
    }
    return NULL;
}
```

---

## Final Results

* Let's look at all of our results
* And the compare to the compiler with -O2!
* First, here's the final (debugged) code

---

## Final Code (matrix)

```c
Matrix dotProduct_rce(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    for (size_t r = 0; r < product.height; ++r) {
        for (size_t c = 0; c < product.width; ++c) {
            double sum = 0.0;
            for (size_t e = 0; e < num_elem; ++e) {
                sum += left.rows[r][e] * right.rows[e][c];
            }
            product.rows[r][c] += sum;
        }
    }
    return product;
}
Matrix dotProduct_rec(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    for (size_t r = 0; r < product.height; ++r) {
        for (size_t e = 0; e < num_elem; ++e) {
            for (size_t c = 0; c < product.width; ++c) {
                product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
            }
        }
    }
    return product;
}
Matrix dotProduct_cre(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    for (size_t c = 0; c < product.width; ++c) {
        for (size_t r = 0; r < product.height; ++r) {
            double sum = 0.0;
            for (size_t e = 0; e < num_elem; ++e) {
                sum += left.rows[r][e] * right.rows[e][c];
            }
            product.rows[r][c] += sum;
        }
    }
    return product;
}
Matrix dotProduct_cer(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    for (size_t c = 0; c < product.width; ++c) {
        for (size_t e = 0; e < num_elem; ++e) {
            for (size_t r = 0; r < product.height; ++r) {
                product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
            }
        }
    }
    return product;
}
Matrix dotProduct_erc(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    for (size_t e = 0; e < num_elem; ++e) {
        for (size_t r = 0; r < product.height; ++r) {
            for (size_t c = 0; c < product.width; ++c) {
                product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
            }
        }
    }
    return product;
}
Matrix dotProduct_ecr(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    for (size_t e = 0; e < num_elem; ++e) {
        for (size_t c = 0; c < product.width; ++c) {
            for (size_t r = 0; r < product.height; ++r) {
                product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
            }
        }
    }
    return product;
}

size_t min(size_t a, size_t b) {
    if (a < b) {
        return a;
    }
    else {
        return b;
    }
}

Matrix dotProduct_block(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    //size_t block = 64 / sizeof(double);
    size_t block = 128;
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    size_t r = 0;
    size_t c = 0;
    size_t e = 0;
    for (r = 0; r < product.height; r+=block) {
        for (c = 0; c < product.width; c+=block) {
            for (e = 0; e < num_elem; e+=block) {
                // B x B block matrix multiplications
                size_t r_end = min(r+block, product.height);
                size_t c_end = min(c+block, product.width);
                size_t e_end = min(e+block, num_elem);
                for (size_t r1 = r; r1 < r_end; ++r1) {
                    for (size_t c1 = c; c1 < c_end; ++c1) {
                        double sum = 0;
                        for (size_t e1 = e; e1 < e_end; ++e1) {
                            sum += left.rows[r1][e1] * right.rows[e1][c1];
                        }
                        product.rows[r1][c1] += sum;
                    }
                }
            }
        }
    }
    return product;
}

Matrix dotProduct_transpose_rce(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    // Make a transposed matrix
    Matrix transpose_right = newMatrix(right.width, right.height);
    for (size_t i = 0; i < right.height; ++i) {
        for (size_t j = 0; j < right.width; ++j) {
            transpose_right.rows[j][i] = right.rows[i][j];
        }
    }
    Matrix product = newMatrix(left.height, right.width);
    // Now we can go across the rows of the right matrix rather than down the columns
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    for (size_t r = 0; r < product.height; ++r) {
        for (size_t c = 0; c < product.width; ++c) {
            double sum = 0.0;
            for (size_t e = 0; e < num_elem; ++e) {
                // This was previously sum += left.rows[r][e] * right.rows[e][c];
                // The transposition puts e into the second index
                sum += left.rows[r][e] * transpose_right.rows[c][e];
                // This is now the same as left.data[r*width+e] * transpose_right.data[c*width + e]
            }
            product.rows[r][c] += sum;
        }
    }
    // We are done with the transposed matrix
    freeMatrix(transpose_right);
    return product;
}

void* sum_into_ce(void* void_args) {
    void ** args = (void**)void_args;
    size_t thread_id = (size_t)(args[0]);
    double* row = (double*)(args[1]);
    double* left_row = (double*)(args[2]);
    Matrix* right = (Matrix*)(args[3]);
    int width = *(int*)(args[4]);
    int height = *(int*)(args[5]);
    for (size_t c = 0; c < width; ++c) {
        double sum = 0.0;
        for (size_t e = 0; e < height; ++e) {
            sum += left_row[e] * right->rows[e][c];
        }
        row[c] += sum;
    }
    return NULL;
}

Matrix dotProduct_thread_rce(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    size_t r, c, e;
    // We must send the outer loop to the threads, so the outer loop must be 'e'
    // Will create one thread per row
    thrd_t threads[product.height];
    void* args[product.height][6];
    for (size_t r = 0; r < product.height; ++r) {
        args[r][0] = (void*)r;
        args[r][1] = (void*)(product.rows[r]);
        args[r][2] = (void*)(left.rows[r]);
        args[r][3] = (void*)(&right);
        args[r][4] = (void*)(&product.width);
        args[r][5] = (void*)(&product.height);
        int res = thrd_create(&threads[r], (thrd_start_t)sum_into_ce, args[r]);
        if (res == thrd_error) {
            printf("Error: thread create failed on %lu\n", r);
        }
    }
    for (size_t r = 0; r < product.height; ++r) {
        int res = thrd_join(threads[r], NULL);
        if (res == thrd_error) {
            printf("Error: thread %lu failed to join\n", r);
        }
    }
    return product;
}

size_t max_threads = 6;
Matrix dotProduct_thread_rce_max(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    // We must send the outer loop to the threads, so the outer loop must be 'e'
    // Will create one thread per row
    thrd_t threads[max_threads];
    void* args[max_threads][6];
    size_t r = 0;
    for (; r < product.height; r+=max_threads) {
        // Start a new batch of threads
        for (size_t t = 0; t < max_threads && r+t<product.height; ++t) {
            args[t][0] = (void*)t;
            args[t][1] = (void*)(product.rows[r+t]);
            args[t][2] = (void*)(left.rows[r+t]);
            args[t][3] = (void*)(&right);
            args[t][4] = (void*)(&product.width);
            args[t][5] = (void*)(&product.height);
            int res = thrd_create(&threads[t], (thrd_start_t)sum_into_ce, args[t]);
            if (res == thrd_error) {
                printf("Error: thread create failed on %lu\n", r+t);
            }
        }
        // Wait for this batch of threads to finish
        for (size_t t = 0; t < max_threads && r+t<product.height; ++t) {
            int res = thrd_join(threads[t], NULL);
            if (res == thrd_error) {
                printf("Error: thread %lu failed to join\n", r+t);
            }
        }
    }
    return product;
}

void* sum_into_re(void* void_args) {
    void ** args = (void**)void_args;
    size_t thread_id = (size_t)(args[0]);
    Matrix* product = (Matrix*)(args[1]);
    Matrix* left = (Matrix*)(args[2]);
    Matrix* right = (Matrix*)(args[3]);
    int width = *(int*)(args[4]);
    int height = *(int*)(args[5]);
    int c = *(int*)(args[6]);
    for (size_t r = 0; r < product->height; ++r) {
        double sum = 0.0;
        for (size_t e = 0; e < left->width; ++e) {
            sum += left->rows[r][e] * right->rows[e][c];
        }
        // TODO FIXME This would need a mutex since the c variable is shared over threads.
        product->rows[r][c] += sum;
    }
    return NULL;
}

// TODO FIXME This would need a mutex
Matrix dotProduct_thread_cre(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    size_t r, c, e;
    // We must send the outer loop to the threads, so the outer loop must be 'e'
    // Will create one thread per row
    thrd_t threads[product.height];
    void* args[product.height][7];
    for (size_t c = 0; c < product.width; ++c) {
        args[c][0] = (void*)c;
        args[c][1] = (void*)(&product);
        args[c][2] = (void*)(&left);
        args[c][3] = (void*)(&right);
        args[c][4] = (void*)(&product.width);
        args[c][5] = (void*)(&product.height);
        args[c][6] = (void*)(&c);
        int res = thrd_create(&threads[c], (thrd_start_t)sum_into_re, args[c]);
        if (res == thrd_error) {
            printf("Error: thread create failed on %lu\n", r);
        }
    }
    for (size_t c = 0; c < product.width; ++c) {
        int res = thrd_join(threads[c], NULL);
        if (res == thrd_error) {
            printf("Error: thread %lu failed to join\n", r);
        }
    }
    return product;
}

void* transpose_row(void* void_args) {
    void ** args = (void**)void_args;
    size_t thread_id = (size_t)(args[0]);
    double* row = (double*)(args[1]);
    Matrix* source = (Matrix*)(args[2]);
    size_t width = *(size_t*)(args[3]);
    size_t column = (size_t)(args[4]);
    // Copy from the column into the row
    for (size_t r = 0; r < width; ++r) {
        row[r] = source->rows[r][column];
    }
    return NULL;
}

void* sum_into_transposed(void* void_args) {
    void ** args = (void**)void_args;
    size_t thread_id = (size_t)(args[0]);
    double* row = (double*)(args[1]);
    double* left_row = (double*)(args[2]);
    Matrix* right = (Matrix*)(args[3]);
    size_t width = (size_t)(args[4]);
    size_t elements = (size_t)(args[5]);
    for (size_t c = 0; c < width; ++c) {
        double sum = 0.0;
        for (size_t e = 0; e < elements; ++e) {
            // Unthreaded: sum += left.rows[r][e] * transpose_right.rows[c][e];
            sum += left_row[e] * right->rows[c][e];
        }
        // Unthreaded: product.rows[r][c] += sum;
        row[c] += sum;
    }
    return NULL;
}

Matrix dotProduct_thread_transpose_rce(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    // Make a transposed matrix
    thrd_t t_threads[right.height];
    void* t_args[max_threads][5];
    Matrix transpose_right = newMatrix(right.width, right.height);
    for (size_t r = 0; r < right.height; r+=max_threads) {
        // Making this threaded: transpose_right.rows[r][c] = right.rows[c][r];
        // Start a new batch of threads
        for (size_t t = 0; t < max_threads && r+t<right.height; ++t) {
            t_args[t][0] = (void*)t;
            t_args[t][1] = (void*)(transpose_right.rows[r+t]);
            t_args[t][2] = (void*)(&right);
            t_args[t][3] = (void*)(&transpose_right.width);
            t_args[t][4] = (void*)(r+t);
            int res = thrd_create(&t_threads[t], (thrd_start_t)transpose_row, t_args[t]);
            if (res == thrd_error) {
                printf("Error: thread create failed on %lu\n", r+t);
            }
        }
        // Wait for this batch of threads to finish
        for (size_t t = 0; t < max_threads && r+t<right.height; ++t) {
            int res = thrd_join(t_threads[t], NULL);
            if (res == thrd_error) {
                printf("Error: thread %lu failed to join\n", r+t);
            }
        }
    }
    Matrix product = newMatrix(left.height, right.width);
    // Now we can go across the rows of the right matrix rather than down the columns
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    void* args[max_threads][6];
    size_t num_elem = left.height;
    thrd_t threads[max_threads];
    void* mm_args[max_threads][5];
    for (size_t r = 0; r < right.height; r+=max_threads) {
        // Start a new batch of threads
        for (size_t t = 0; t < max_threads && r+t<product.height; ++t) {
            args[t][0] = (void*)t;
            args[t][1] = (void*)(product.rows[r+t]);
            args[t][2] = (void*)(left.rows[r+t]);
            args[t][3] = (void*)(&transpose_right);
            args[t][4] = (void*)(product.width);
            args[t][5] = (void*)(product.height);
            int res = thrd_create(&threads[t], (thrd_start_t)sum_into_transposed, args[t]);
            if (res == thrd_error) {
                printf("Error: thread create failed on %lu\n", r+t);
            }
        }
        // Wait for this batch of threads to finish
        for (size_t t = 0; t < max_threads && r+t<product.height; ++t) {
            int res = thrd_join(threads[t], NULL);
            if (res == thrd_error) {
                printf("Error: thread %lu failed to join\n", r+t);
            }
        }
    }
    // We are done with the transposed matrix
    freeMatrix(transpose_right);
    return product;
}

void* asm_sum_into_transposed(void* void_args) {
    void ** args = (void**)void_args;
    size_t thread_id = (size_t)(args[0]);
    double* row = (double*)(args[1]);
    double* left_row = (double*)(args[2]);
    Matrix* right = (Matrix*)(args[3]);
    size_t width = (size_t)(args[4]);
    size_t elements = (size_t)(args[5]);
    for (size_t c = 0; c < width; ++c) {
        // Sum into register ymm2
        // Anything xor with itself is 0.
        __asm__ (
                "vxorpd %%ymm2, %%ymm2, %%ymm2\n\t"
                : // No outputs
                : // No inputs
                : "ymm2" // Tell gcc to stay away from this register (if we turned on optimizations)
               );
        for (size_t e = 0; e+4 < elements; e+=4) {
            // Unthreaded: sum += left.rows[r][e] * transpose_right.rows[c][e];
            // Turn this next instruction into assembly
            // sum += left_row[e] * right->rows[c][e];
            // Note that we could try to preload registers, but it's safer to use them in the instruction.
            //register double* l asm ("r10") = left_row[e];
            //register double* r asm ("r11") = right->rows[c][e];
            __asm__ (
                    // %0 and %1 will come from left_row[e] and right->rows[c][e]
                    "vmovupd (%0), %%ymm0\n\t"
                    "vmovupd (%1), %%ymm1\n\t"
                    "vmulpd %%ymm0, %%ymm1, %%ymm1\n\t"
                    "vaddpd %%ymm1, %%ymm2, %%ymm2\n\t"
                    : // No outputs (we assume rax is safe)
                    : "r" (left_row+e), "r" (right->rows[c]+e)
                    : "ymm0", "ymm1", "ymm2"
                   );
        }
        double block[4];
        __asm__ (
                // %0 and %1 will come from left_row[e] and right->rows[c][e]
                "vmovupd %%ymm2, (%0)\n\t"
                :
                : "r" (block)
                :
               );
        // Finish whatever didn't fit into our 4 double registers
        double sum = 0.0;
        size_t left = elements%4;
        for (size_t e = elements-left; e < elements; ++e) {
            // Unthreaded: sum += left.rows[r][e] * transpose_right.rows[c][e];
            sum += left_row[e] * right->rows[c][e];
        }
        // Unthreaded: product.rows[r][c] += sum;
        row[c] = sum + block[0] + block[1] + block[2] + block[3];
    }
    return NULL;
}

Matrix dotProduct_thread_transpose_asm_rce(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    // Make a transposed matrix
    thrd_t t_threads[right.height];
    void* t_args[max_threads][5];
    Matrix transpose_right = newMatrix(right.width, right.height);
    for (size_t r = 0; r < right.height; r+=max_threads) {
        // Making this threaded: transpose_right.rows[r][c] = right.rows[c][r];
        // Start a new batch of threads
        for (size_t t = 0; t < max_threads && r+t<right.height; ++t) {
            t_args[t][0] = (void*)t;
            t_args[t][1] = (void*)(transpose_right.rows[r+t]);
            t_args[t][2] = (void*)(&right);
            t_args[t][3] = (void*)(&transpose_right.width);
            t_args[t][4] = (void*)(r+t);
            int res = thrd_create(&t_threads[t], (thrd_start_t)transpose_row, t_args[t]);
            if (res == thrd_error) {
                printf("Error: thread create failed on %lu\n", r+t);
            }
        }
        // Wait for this batch of threads to finish
        for (size_t t = 0; t < max_threads && r+t<right.height; ++t) {
            int res = thrd_join(t_threads[t], NULL);
            if (res == thrd_error) {
                printf("Error: thread %lu failed to join\n", r+t);
            }
        }
    }
    Matrix product = newMatrix(left.height, right.width);
    // Now we can go across the rows of the right matrix rather than down the columns
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    void* args[max_threads][6];
    size_t num_elem = left.height;
    thrd_t threads[max_threads];
    void* mm_args[max_threads][5];
    for (size_t r = 0; r < right.height; r+=max_threads) {
        // Start a new batch of threads
        for (size_t t = 0; t < max_threads && r+t<product.height; ++t) {
            args[t][0] = (void*)t;
            args[t][1] = (void*)(product.rows[r+t]);
            args[t][2] = (void*)(left.rows[r+t]);
            args[t][3] = (void*)(&transpose_right);
            args[t][4] = (void*)(product.width);
            args[t][5] = (void*)(product.height);
            int res = thrd_create(&threads[t], (thrd_start_t)asm_sum_into_transposed, args[t]);
            if (res == thrd_error) {
                printf("Error: thread create failed on %lu\n", r+t);
            }
        }
        // Wait for this batch of threads to finish
        for (size_t t = 0; t < max_threads && r+t<product.height; ++t) {
            int res = thrd_join(threads[t], NULL);
            if (res == thrd_error) {
                printf("Error: thread %lu failed to join\n", r+t);
            }
        }
    }
    // We are done with the transposed matrix
    freeMatrix(transpose_right);
    return product;
}
```

---

## Final Code (main)

```c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#include "hw3_matrix_lib.h"

// Converts the timespect to milliseconds
typedef struct timespec timespec;
double timespecToMs(timespec t) {
    return 1000.0 * t.tv_sec + 1e-6 * t.tv_nsec;
}

int main(int argc, char** argv) {

if (argc != 3 && argc != 4) {
        printf("Provide an option for which loop to run. Options are 1-12.\n");
        printf("Provide a matrix size greater than or equal to 10.\n");
        printf("For algorithm 11, also provide a number of threads (defaults to 6).\n");
        return 0;
    }
    int opt = atoi(argv[1]);
    if (opt < 1 || opt > 13) {
        printf("Provide an option for which loop to run. Options are 1-12.\n");
        return 0;
    }
    int size = atoi(argv[2]);
    if (size < 10) {
        printf("Provide a matrix size greater than or equal to 10.\n");
        return 0;
    }
    if (argc > 3) {
        // This variable is declared in hw3_matrix_lib.h
        // Global variables like this should generally be avoided as they aren't clear.
        max_threads = atoi(argv[3]);
    }

// This is a function type. Don't worry about it.
    typedef Matrix (*mm_func)(Matrix, Matrix);
    mm_func mmul = NULL;
    switch (opt) {
        case 1:
            mmul = dotProduct_rce;
            printf("Testing rce\n");
            break;
        case 2:
            mmul = dotProduct_rec;
            printf("Testing rec\n");
            break;
        case 3:
            mmul = dotProduct_cre;
            printf("Testing cre\n");
            break;
        case 4:
            mmul = dotProduct_cer;
            printf("Testing cer\n");
            break;
        case 5:
            mmul = dotProduct_erc;
            printf("Testing erc\n");
            break;
        case 6:
            mmul = dotProduct_ecr;
            printf("Testing ecr\n");
            break;
        case 7:
            mmul = dotProduct_block;
            printf("Testing block\n");
            break;
        case 8:
            mmul = dotProduct_transpose_rce;
            printf("Testing transpose rce\n");
            break;
        case 9:
            mmul = dotProduct_thread_rce;
            printf("Testing thread rce\n");
            break;
        case 10:
            mmul = dotProduct_thread_cre;
            printf("Testing thread cre\n");
            printf("Note! CRE should be using a mutex, but currently is not.\n");
            break;
        case 11:
            mmul = dotProduct_thread_rce_max;
            printf("Testing thread rce_max\n");
            break;
        case 12:
            mmul = dotProduct_thread_transpose_rce;
            printf("Testing thread_transpose rce\n");
            break;
        case 13:
            mmul = dotProduct_thread_transpose_asm_rce;
            printf("Testing thread_transpose_asm rce\n");
            break;
    }

Matrix m1 = newMatrix(size, size);
    Matrix m2 = newMatrix(size, size);
    for (int i = 0; i < size*size; ++i) {
        // Fill the matrices with random numbers. Be sure to avoid dividing by 0.
        m1.data[i] = rand()/ ((double)rand()+1);
        m2.data[i] = rand()/ ((double)rand()+1);
    }

// Set up timers and run
    struct timespec tw_begin;
    clock_gettime(CLOCK_MONOTONIC, &tw_begin);
    struct timespec ts_begin;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_begin);

Matrix out = mmul(m1, m2);

struct timespec tw_end;
    clock_gettime(CLOCK_MONOTONIC, &tw_end);
    struct timespec ts_end;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_end);

double posix_wall = timespecToMs(tw_end) - timespecToMs(tw_begin);
    double cpu_time = timespecToMs(ts_end) - timespecToMs(ts_begin);
    // Print the result
    switch (opt) {
        case 1:
            printf("rce wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("rce CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 2:
            printf("rec wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("rec CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 3:
            printf("cre wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("cre CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 4:
            printf("cer wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("cer CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 5:
            printf("erc wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("erc CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 6:
            printf("ecr wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("ecr CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 7:
            printf("block wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("block CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 8:
            printf("transpose rce wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("transpose rce CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 9:
            printf("thread rce wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("thread rce CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 10:
            printf("thread cre wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("thread cre CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 11:
            printf("thread rce_max wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("thread rce_max CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 12:
            printf("thread_transpose rce wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("thread_transpose rce CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 13:
            printf("thread_transpose_asm rce wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("thread_transpose_asm rce CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
    }

freeMatrix(m1);
    freeMatrix(m2);
    freeMatrix(out);

return 0;
}
```

---

## Where we started

---

## With transposed matrix

---

## With threads (CPU time)

---

## With threads (wall time)

---

## With SIMD (CPU time)

---

## With SIMD (wall time)

---

## Final comparison (wall time)

---

## Final comparison (wall time)

---

## With gcc's -O2 (wall time)

---

## With gcc's -O2 (wall time)

---

## With gcc's -O2 (wall time)

---

## Moral

* The moral of the story is that optimization is hard, but rewarding!
  * We could probably take some of gcc's optimization to squeeze out a little more
* Hardware is generally adapted to allow better implementations
  * But you may need to know that those optimizations exist
* And if you ever need to design hardware, you'll have to know all the most common use cases!
  * This is why all of the trig functions are supported in hardware

---

## Today's Workloads

* There is a special instruction, called a fused matrix multiply
  * Multiply one vector (or matrix) by another and then add a constant
    * $w^Tx + b$
* It's used in machine learning
  * And, since ML is everywhere, so is that instruction

---

## Result

* You may think that ML is only on GPUs
  * But everyone wants to run them
* From your phone, your laptop, to your smart fridge
* Inefficient implementations cost energy and time
  * So we are seeing wider support for vector and matrix operations in hardware

---

## Next Topics

* There are two non-CPU architectures that are increasingly important for modern data workflows
  * GPUs (Graphics Processing Units)
  * TPUs (Tensor Processing Units)
* And we have a quiz next recitation!

---

## Example Questions

* Note, in the initially uploaded version I somehow copied the first question twice and half changed each version, leading to confusion
* The following version is corrected

---

## Example Questions

Which of the following statements about cache misses is always **true**?

a. If two addresses have the same tag, they will not cause a conflict miss.  

b. If two addresses map onto different sets in a directly mapped cache, they can still cause a conflict miss.  

c. If two addresses map onto the same block, they will cause a conflict miss.  

d. None of the above.
</div>

---

Which of the following statements about cache misses is always **true**?

a. **If two addresses have the same tag, they will not cause a conflict miss.**  

b. If two addresses map onto different sets in a directly mapped cache, they can still cause a conflict miss.  

c. If two addresses map onto the same block, they will cause a conflict miss.  

d. None of the above.
</div>

* For (a), two addresses with the same tag are either: in the same block of memory if the set is the same or in different sets, so they cannot conflict. This is true.
* For (b), two addresses mapped into different sets can never conflict
* For (c), two address that map onto the same block will not conflict. This is a case of locality, and is good, as loading one address fetching the data for the other.

---

<div style="text-align: left;">
In a four-way associative cache, which of the following is definitely true?

a. There are four blocks per line.  

b. There are four sets per block.  

c. There are four blocks per set.  

d. There are four sets in the cache.  
</div>

---

<div style="text-align: left;">
In a four-way associative cache, which of the following is definitely true?

a. There are four blocks per line.  

b. There are four sets per block.  

c. **There are four blocks per set.**  

d. There are four sets in the cache.  
</div>

---

<div style="text-align: left;">
A load instruction copies a value from memory into a register. If the L1 hit latency is 4 cycles and the miss penalty if we find our data in the L2 cache is 10 cycles, which of the following is true?

a. If the data is in the L1 cache then it will be available after four cycles have passed.

b. If the data is in the L2 cache, then it must have already been requested at the L1 cache.  

c. Data cannot be copied from memory into registers.

d. None of the above.
</div>

---

a. **If the data is in the L1 cache then it will be available after four cycles have passed.**  

b. If the data is in the L2 cache, then it must have already been requested at the L1 cache.  

c. Data cannot be copied from memory into registers.  

d. None of the above.
</div>

---

<div style="text-align: left;">
L1 cache stores data using what technology?

a. SRAM  

b. DRAM  

c. Laser activated magnetic platters.  

d. Non-volatile flash
</div>

---

<div style="text-align: left;">
L1 cache stores data using what technology?

a. **SRAM**  

b. DRAM  

c. Laser activated magnetic platters.  

d. Non-volatile flash
</div>

---

In a cold cache, what is the latency of the instruction
<pre>add $rax $rbx</pre>

in the ID cycle? Latency means the cycles before it completes. L1 hit latency is 4 cycle, miss penalty to go to L2 is 10 cycles, and the miss penalty for L3 is 40 cycles.

a. 1 cycle.  

b. 10 cycles.  

c. 40 cycles.  

d. Longer than the above options.

</div>

---

In a cold cache, what is the latency of the instruction
<pre>add $rax $rbx</pre>

in the ID cycle? Latency means the cycles before it completes. L1 hit latency is 4 cycle, miss penalty to go to L2 is 10 cycles, and the miss penalty for L3 is 40 cycles.

a. **1 cycle**.  

b. 10 cycles.  

c. 40 cycles.  

d. Longer than the above options.

</div>

---

Without threading, Which method is **slowest**?

a. r in the outer loop.  

b. r in the middle loop.  

c. r in the inner loop.  


d. All are the same.

</div>
<div class="col">
Given
<pre>
product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
</pre>
and
<pre>
for (size_t r = 0; r < product.height; ++r)
for (size_t e = 0; e < num_elem; ++e)
for (size_t c = 0; c < product.width; ++c)
</pre>

</div>
</div>
</div>

---

Without threading, Which method is **slowest**?

a. r in the outer loop.  

b. r in the middle loop.  

c. **r in the inner loop.**  


d. All are the same.

</div>
</div>
</div>

---

* We are using a 2-way associative cache with block size 2
* Memory address 1010 is requested.
  * What are the Tag, Set and Block?
  * How will the current state of the cache change?

<br/>
<table>
<tr> <b><th colspan="4"> Request </th></b></tr>
<tr> <td>Tag </td><td> Set </td><td> Block </td></tr>
<tr> <td> ?</td><td> ? </td><td> ? </td></tr>
</table>
<br/>

</div>
<div class="col">

<table>
<tr> <td></td><th> valid  </th> <th>Tag</th> <th colspan="2"> Block </th><th> valid  </th> <th>Tag</th> <th colspan="2"> Block </th></tr>
<tr> <td><b>Set 0</b></td><td> 1</td><td>00 </td><td>m[0]</td><td>m[1]</td><td> 1</td><td>10</td><td>m[8]</td><td>m[9]</td></tr>
<tr> <td><b>Set 1</b></td><td> 1</td><td>01 </td><td>m[6]</td><td>m[7]</td><td> 0</td><td> </td><td> </td><td> </td></tr>
</table>
</div>
</div>

---

* We are using a 2-way associative cache with block size 2
* Memory address 1010 is requested.
  * What are the Tag, Set and Block?
  * How will the current state of the cache change?

<br/>
<table>
<tr> <b><th colspan="4"> Request </th></b></tr>
<tr> <td>Tag </td><td> Set </td><td> Block </td></tr>
<tr> <td> 10 </td><td> 1 </td><td> 0 </td></tr>
</table>
<br/>

</div>
<div class="col">

<table>
<tr> <td></td><th> valid  </th> <th>Tag</th> <th colspan="2"> Block </th><th> valid  </th> <th>Tag</th> <th colspan="2"> Block </th></tr>
<tr> <td><b>Set 0</b></td><td> 1</td><td>00 </td><td>m[0]</td><td>m[1]</td><td> 1</td><td>10</td><td>m[8]</td><td>m[9]</td></tr>
<tr> <td><b>Set 1</b></td><td> 1</td><td>01 </td><td>m[6]</td><td>m[7]</td><td> 1</td><td>10 </td><td> m[10]</td><td> m[11]</td></tr>
</table>
</div>
</div>

---

## Cache Simulator

* There is a nifty online simulator
  * [https://courses.cs.washington.edu/courses/cse351/cachesim/](https://courses.cs.washington.edu/courses/cse351/cachesim/)
* There are others, but that one is easy to use
* If the cache is confusing, try out

<!--
TODO:
Go through some example vector operations
Apply to the same matrix multiply code
ymm registers hold 256-bits, which is 4 doubles. So we should be able to multiply 4 pairs of doubles in a single instruction

Use MOVDQA or MOVQ for vector moves into the vector registers (VMOVDQA64, I guess?)
Then MULPS for vector multiply
https://asm-docs.microagi.org/x86/mulps.html

Or if we want to use the ymm registers
VMASKMOVPD ymm,ymm,m256 does a masked move

Some interesting stuff on this stackoverflow, not really useful for us:
https://stackoverflow.com/questions/40623773/best-way-to-load-store-from-to-general-purpose-registers-to-from-xmm-ymm-registe

This is probably the best example:
https://wiki.osdev.org/AVX2
If we are staying with the same value for one side of the multiply, we could save some loads
; extern void Mul8floats(float* Dest, float* Src)
Mul8floats:
; RCX contains pointer to Src
; RDX contains pointer of Dest
     vmovups ymm0, [rcx]
     vmovups ymm1, [rdx]
     vmulps ymm0, ymm0, ymm1 ; Packed multiply of 8 floats across ymm0 by 8 floats in ymm1 and store result in ymm0
     vmovups [rcx], ymm0 ; Store the result in memory (float* Dest)
     ret
Use vmulpd for double precision instead of vmulps for single. Only 4 floats at a time. And vmovupd to load?

https://www.felixcloutier.com/x86/mulpd

That takes advantage of the operators being pushed into particular registers, as in this example:
https://web.engr.oregonstate.edu/~mjb/cs575/Handouts/simd.vector.1pp.pdf

If we do our own moves within the for loop things should be faster.
We can try both ways, I guess.

Then go through some example quiz2 questions
-->