# CS 211 - Lecture 21

## Computer Architecture

### Single Instruction, Multiple Data

Bernhard Firner

2025-11-11

---

## Review (Again!)

* Cache anatomy
  * Block: the fixed-size packet of information
  * Line: the block and associated information
    * Valid bit, tag, dirty bit, etc
  * Set: A group of lines

---

## Cache Misses

* `cold miss`: When the valid bit is unset (also called *compulsory miss*)
* `capacity miss`: When our cache is full and the miss couldn't be avoided
  * For example, you are looping repeatedly through a 1MB array on a 64KB cache
* `conflict miss`: When the program has used the cache poorly
  * There is available space, but locality has not been exploited to use it

---

## Matrix Multiplication

* Example of a (possibly) cache intensive function
* Went from a 25 second 1500x1500 matrix multiply to 2 seconds
  * Transposed a matrix first to change locality
  * Then used threads (but only six) to gain  speedup
    * This was as fast as a thread per row, but far more efficient!

---

## Initial Versions

---

## CPU Time

---

## Wall Time

---

## Locality

* Locality matters
  * Changing the loop order or transposing the matrix reduced L1 misses
    * 32KB L2 cache wasn't being pushed
* Threads shared L3 cache
  * on an L2 miss, L3 will try to find missing data from other L2 cache
  * We actually saw improved locality with threads on one CPU

---

## Prefetching

* L1 is 32KB
  * A 1500x1500 array is around 2MB
  * About 70 times too large for L1
  * Twice as large as L2
* To avoid constant cache misses, we need successful prefetching
  * But the CPU only understands access patterns that follow locality

---

## Threading and Locality

* Threading, with a ton of threads, usually has bad locality
  * Like a swarm of uncoordinated bees
* But, if we plan things out, we can vastly reduce contention between threads
  * Even leverage prefetching from one thread to help another thread
* In the end, that lead to the best version of our multiply algorithm

---

## Other types of parallelism

* We created parallelism in the instructions with out threads
* There is another kind of parallelism
  * Some instructions are "single instruction, multiple data"
    * SIMD
  * Also called vector processing

---

## Advantages

* We reduce the number of instructions
  * Denser code also leads to few cache misses
  * Replaces multiple loads, adds, multiplies, xors, etc
* Can also eliminate loops
* The hardware is optimized for them

---

## Many Versions

* The first vector instructions supported on x85 were the MMX instructions
  * Introduced in 1997, MM is for multimedia
  * This introduced the MM0-MM7 registers
* See what your CPU supports, just `cat /proc/cpuinfo`
  * Try using grep to find mmx, sse, or avx

---

## What For?

* These operations are single instructions that operations upon multiple data registers
  * We saw gcc trying to use these to speed up a loop earlier
* Newer versions support new operations (such as single-instruction multiply add)
* Newer versions also support more registers
  * AVX-512 (from 2016) support 512-bit wide registers

---

## 2023 Additions

* AMX introduced in Intel's Sapphire Rapics microarchitecture
* Tile registers are 8192 bits wide (16 rows of 64 bytes)
  * Made for in-hardware matrix instructions
* Even if your matrix is too large, you can break it into blocks
  * The actual operation will be fully parallel
  * Delay will be fetching and storing the data
* 2048 INT8 or 1024 BF16 ops per cycle

---

## iLab3 CPU flags

<pre>
$ cat /proc/cpuinfo |grep flags
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es
<pre>

---

## Support

* iLab cpus only support up to AVX2
  * These are 256-bit vector instructions
* My laptop is sadly newer :(

---

## Laptop Flags

So many avx options

<pre>
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d
</pre>

---

## Compiler Support

* Compilers generally add support to use fancy new instructions
  * But there can be a delay
* So if you want to use something cutting edge, you may need to do it yourself
* So it's a good thing that we know some assembly!

---

## Features

* We'll stick with the ymm...ymm15 registers so you can test on iLab
* Each is 256-bits
  * Doubles are 64-bits, so these hold 4
* That means we can do 4 simultaneous multiplications with a single instruction

---

## The Big Idea

* `vmulp` (vector multiply) operates over multiple values in the registers
  * `vmulps` for single precision, `vmulpd` for double
* `vmulpd ymm0, ymm1, ymm2`
  * Multiplies everything in ymm0 by everything in ymm1 and stores in ymm2
  * This could allow us to leave either ymm0 or ymm1 unaltered
    * Which we've seen in our matrix multiply loops

---

## Graphically

---

## Moving the Data

* We obviously don't want to have multiple load instructions for each array element
  * And the ymm registers have a single name but hold multiple values
* So there are also a vector load instructions
  * `vmovups` and `vmovupd` for single and double precision
* And addition
  * `vaddps` and `vaddpd`

---

## Important Details

* Remember that we can drop assembly into our C code
  * `__asm__` and then the assembly in strings
  * % means a user-provided variable, and %% refers to an actual register
* Also, the first three arguments passed to a function use known registers
  * %rdi, %rsi, and %rdx
  * Assuming they're the size of a pointer

---

## Code

```c
#include <stdio.h>
#include <stdlib.h>

// Note that we are going to assume that these each have four doubles.
void showMultiply(double a[4], double b[4], double c[4]) {
    // Going to take advantage of the fact that the arguments are in rdi, rsi, and rdx by default.
    // Don't turn on compiler optimizations.
    // If those registers are going to be used, then we should store our arguments on the stack:
    /*
        movq    %rdi, -8(%rbp)
        movq    %rsi, -16(%rbp)
        movq    %rdx, -24(%rbp)
    */
    // (gcc actually does that in any case)
    // But if we only use assembly in this function, they should be safe to use.
    // The double %% indicates a register, a single % means a user-provided variable.
    __asm__(
            "vmovupd (%%rdi), %%ymm0\n\t"
            "vmovupd (%%rsi), %%ymm1\n\t"
            "vmulpd %%ymm0, %%ymm1, %%ymm2\n\t"
            "vmovupd %%ymm2, (%%rdx)\n\t"
            : // No outputs (we assume rax is safe)
            : // No inputs (we assume rdi and rsi are safe)
            :
           );
    return;
}

int main(void) {
    double a[4] = {3.0, 2.0, 3.0, 4.0};
    double b[4] = {4.0, -2.0, 3.0, -1.0};
    double c[4] = {0.0};

showMultiply(a, b, c);

printf("%lf, %lf, %lf, %lf\n", a[0], a[1], a[2], a[3]);
    printf("%lf, %lf, %lf, %lf\n", b[0], b[1], b[2], b[3]);
    printf("%lf, %lf, %lf, %lf\n", c[0], c[1], c[2], c[3]);

return 0;
}
```

---

## Results

<pre>
$ gcc lec21_show_vmulp.c
$ ./a.out 4
3.000000, 2.000000, 3.000000, 4.000000
4.000000, -2.000000, 3.000000, -1.000000
12.000000, -4.000000, 9.000000, -4.000000
</pre>

Wow, so cool!

---

## Useful?

* You all know what's coming next
* Let's revisit matrix multiplication!
* But first, let's discuss

---

## Options

* Each output value comes from the multiplication of a row and a column
  * Do we want to want to calculate 4 outputs at a time?
    * Involves reading 4 rows and columns simultaneously
  * Or should we calculate a single value in 1/4 the instructions?

---

## Combine with Threads

* Doing 4 rows and columns simultaneously may overwhelm our load bandwidth
  * How would we calculate that?
    * Compare with latency for 2 movupd loads from memory, 1 movupd back to memory, 1 vmulpd, 1 vaddpd
* We already have the threaded transpose code
  * Changing to 1/4 the instructions should have the same instruction speedup as 4 simultaneous rows and columns
  * But all memory loads will be contiguous

---

## Threaded Code

```c
void* asm_sum_into_transposed(void* void_args) {
    void ** args = (void**)void_args;
    size_t thread_id = (size_t)(args[0]);
    double* row = (double*)(args[1]);
    double* left_row = (double*)(args[2]);
    Matrix* right = (Matrix*)(args[3]);
    size_t width = (size_t)(args[4]);
    size_t elements = (size_t)(args[5]);
    for (size_t c = 0; c < width; ++c) {
        // Sum into register ymm2
        // Anything xor with itself is 0.
        __asm__ (
                "vxorpd %%ymm2, %%ymm2, %%ymm2\n\t"
                : // No outputs
                : // No inputs
                : "ymm2" // Tell gcc to stay away from this register (if we turned on optimizations)
               );
        for (size_t e = 0; e+4 < elements; e+=4) {
            // Unthreaded: sum += left.rows[r][e] * transpose_right.rows[c][e];
            // Turn this next instruction into assembly
            // sum += left_row[e] * right->rows[c][e];
            // Note that we could try to preload registers, but it's safer to use them in the instruction.
            //register double* l asm ("r10") = left_row[e];
            //register double* r asm ("r11") = right->rows[c][e];
            __asm__ (
                    // %0 and %1 will come from left_row[e] and right->rows[c][e]
                    "vmovupd (%0), %%ymm0\n\t"
                    "vmovupd (%1), %%ymm1\n\t"
                    "vmulpd %%ymm0, %%ymm1, %%ymm1\n\t"
                    "vaddpd %%ymm1, %%ymm2, %%ymm2\n\t"
                    : // No outputs (we assume rax is safe)
                    : "r" (left_row+e), "r" (right->rows[c]+e)
                    : "ymm0", "ymm1", "ymm2"
                   );
        }
        double block[4];
        __asm__ (
                // %0 and %1 will come from left_row[e] and right->rows[c][e]
                "vmovupd %%ymm2, (%0)\n\t"
                :
                : "r" (block)
                :
               );
        // Finish whatever didn't fit into our 4 double registers
        double sum = 0.0;
        size_t left = elements%4;
        for (size_t e = elements-left; e < elements; ++e) {
            // Unthreaded: sum += left.rows[r][e] * transpose_right.rows[c][e];
            sum += left_row[e] * right->rows[c][e];
        }
        // Unthreaded: product.rows[r][c] += sum;
        row[c] = sum + block[0] + block[1] + block[2] + block[3];
    }
    return NULL;
}
```

---

## Final Results

* Let's look at all of our results
* And the compare to the compiler with -O2!
* First, here's the final (debugged) code

---

## Final Code (matrix)

```c
Matrix dotProduct_rce(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    for (size_t r = 0; r < product.height; ++r) {
        for (size_t c = 0; c < product.width; ++c) {
            double sum = 0.0;
            for (size_t e = 0; e < num_elem; ++e) {
                sum += left.rows[r][e] * right.rows[e][c];
            }
            product.rows[r][c] += sum;
        }
    }
    return product;
}
Matrix dotProduct_rec(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    for (size_t r = 0; r < product.height; ++r) {
        for (size_t e = 0; e < num_elem; ++e) {
            for (size_t c = 0; c < product.width; ++c) {
                product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
            }
        }
    }
    return product;
}
Matrix dotProduct_cre(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    for (size_t c = 0; c < product.width; ++c) {
        for (size_t r = 0; r < product.height; ++r) {
            double sum = 0.0;
            for (size_t e = 0; e < num_elem; ++e) {
                sum += left.rows[r][e] * right.rows[e][c];
            }
            product.rows[r][c] += sum;
        }
    }
    return product;
}
Matrix dotProduct_cer(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    for (size_t c = 0; c < product.width; ++c) {
        for (size_t e = 0; e < num_elem; ++e) {
            for (size_t r = 0; r < product.height; ++r) {
                product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
            }
        }
    }
    return product;
}
Matrix dotProduct_erc(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    for (size_t e = 0; e < num_elem; ++e) {
        for (size_t r = 0; r < product.height; ++r) {
            for (size_t c = 0; c < product.width; ++c) {
                product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
            }
        }
    }
    return product;
}
Matrix dotProduct_ecr(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    for (size_t e = 0; e < num_elem; ++e) {
        for (size_t c = 0; c < product.width; ++c) {
            for (size_t r = 0; r < product.height; ++r) {
                product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
            }
        }
    }
    return product;
}

size_t min(size_t a, size_t b) {
    if (a < b) {
        return a;
    }
    else {
        return b;
    }
}

Matrix dotProduct_block(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    //size_t block = 64 / sizeof(double);
    size_t block = 128;
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    size_t r = 0;
    size_t c = 0;
    size_t e = 0;
    for (r = 0; r < product.height; r+=block) {
        for (c = 0; c < product.width; c+=block) {
            for (e = 0; e < num_elem; e+=block) {
                // B x B block matrix multiplications
                size_t r_end = min(r+block, product.height);
                size_t c_end = min(c+block, product.width);
                size_t e_end = min(e+block, num_elem);
                for (size_t r1 = r; r1 < r_end; ++r1) {
                    for (size_t c1 = c; c1 < c_end; ++c1) {
                        double sum = 0;
                        for (size_t e1 = e; e1 < e_end; ++e1) {
                            sum += left.rows[r1][e1] * right.rows[e1][c1];
                        }
                        product.rows[r1][c1] += sum;
                    }
                }
            }
        }
    }
    return product;
}

Matrix dotProduct_transpose_rce(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    // Make a transposed matrix
    Matrix transpose_right = newMatrix(right.width, right.height);
    for (size_t i = 0; i < right.height; ++i) {
        for (size_t j = 0; j < right.width; ++j) {
            transpose_right.rows[j][i] = right.rows[i][j];
        }
    }
    Matrix product = newMatrix(left.height, right.width);
    // Now we can go across the rows of the right matrix rather than down the columns
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    for (size_t r = 0; r < product.height; ++r) {
        for (size_t c = 0; c < product.width; ++c) {
            double sum = 0.0;
            for (size_t e = 0; e < num_elem; ++e) {
                // This was previously sum += left.rows[r][e] * right.rows[e][c];
                // The transposition puts e into the second index
                sum += left.rows[r][e] * transpose_right.rows[c][e];
                // This is now the same as left.data[r*width+e] * transpose_right.data[c*width + e]
            }
            product.rows[r][c] += sum;
        }
    }
    // We are done with the transposed matrix
    freeMatrix(transpose_right);
    return product;
}

void* sum_into_ce(void* void_args) {
    void ** args = (void**)void_args;
    size_t thread_id = (size_t)(args[0]);
    double* row = (double*)(args[1]);
    double* left_row = (double*)(args[2]);
    Matrix* right = (Matrix*)(args[3]);
    int width = *(int*)(args[4]);
    int height = *(int*)(args[5]);
    for (size_t c = 0; c < width; ++c) {
        double sum = 0.0;
        for (size_t e = 0; e < height; ++e) {
            sum += left_row[e] * right->rows[e][c];
        }
        row[c] += sum;
    }
    return NULL;
}

Matrix dotProduct_thread_rce(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    size_t r, c, e;
    // We must send the outer loop to the threads, so the outer loop must be 'e'
    // Will create one thread per row
    thrd_t threads[product.height];
    void* args[product.height][6];
    for (size_t r = 0; r < product.height; ++r) {
        args[r][0] = (void*)r;
        args[r][1] = (void*)(product.rows[r]);
        args[r][2] = (void*)(left.rows[r]);
        args[r][3] = (void*)(&right);
        args[r][4] = (void*)(&product.width);
        args[r][5] = (void*)(&product.height);
        int res = thrd_create(&threads[r], (thrd_start_t)sum_into_ce, args[r]);
        if (res == thrd_error) {
            printf("Error: thread create failed on %lu\n", r);
        }
    }
    for (size_t r = 0; r < product.height; ++r) {
        int res = thrd_join(threads[r], NULL);
        if (res == thrd_error) {
            printf("Error: thread %lu failed to join\n", r);
        }
    }
    return product;
}

size_t max_threads = 6;
Matrix dotProduct_thread_rce_max(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    // We must send the outer loop to the threads, so the outer loop must be 'e'
    // Will create one thread per row
    thrd_t threads[max_threads];
    void* args[max_threads][6];
    size_t r = 0;
    for (; r < product.height; r+=max_threads) {
        // Start a new batch of threads
        for (size_t t = 0; t < max_threads && r+t<product.height; ++t) {
            args[t][0] = (void*)t;
            args[t][1] = (void*)(product.rows[r+t]);
            args[t][2] = (void*)(left.rows[r+t]);
            args[t][3] = (void*)(&right);
            args[t][4] = (void*)(&product.width);
            args[t][5] = (void*)(&product.height);
            int res = thrd_create(&threads[t], (thrd_start_t)sum_into_ce, args[t]);
            if (res == thrd_error) {
                printf("Error: thread create failed on %lu\n", r+t);
            }
        }
        // Wait for this batch of threads to finish
        for (size_t t = 0; t < max_threads && r+t<product.height; ++t) {
            int res = thrd_join(threads[t], NULL);
            if (res == thrd_error) {
                printf("Error: thread %lu failed to join\n", r+t);
            }
        }
    }
    return product;
}

void* sum_into_re(void* void_args) {
    void ** args = (void**)void_args;
    size_t thread_id = (size_t)(args[0]);
    Matrix* product = (Matrix*)(args[1]);
    Matrix* left = (Matrix*)(args[2]);
    Matrix* right = (Matrix*)(args[3]);
    int width = *(int*)(args[4]);
    int height = *(int*)(args[5]);
    int c = *(int*)(args[6]);
    for (size_t r = 0; r < product->height; ++r) {
        double sum = 0.0;
        for (size_t e = 0; e < left->width; ++e) {
            sum += left->rows[r][e] * right->rows[e][c];
        }
        // TODO FIXME This would need a mutex since the c variable is shared over threads.
        product->rows[r][c] += sum;
    }
    return NULL;
}

// TODO FIXME This would need a mutex
Matrix dotProduct_thread_cre(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    Matrix product = newMatrix(left.height, right.width);
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    size_t num_elem = left.height;
    size_t r, c, e;
    // We must send the outer loop to the threads, so the outer loop must be 'e'
    // Will create one thread per row
    thrd_t threads[product.height];
    void* args[product.height][7];
    for (size_t c = 0; c < product.width; ++c) {
        args[c][0] = (void*)c;
        args[c][1] = (void*)(&product);
        args[c][2] = (void*)(&left);
        args[c][3] = (void*)(&right);
        args[c][4] = (void*)(&product.width);
        args[c][5] = (void*)(&product.height);
        args[c][6] = (void*)(&c);
        int res = thrd_create(&threads[c], (thrd_start_t)sum_into_re, args[c]);
        if (res == thrd_error) {
            printf("Error: thread create failed on %lu\n", r);
        }
    }
    for (size_t c = 0; c < product.width; ++c) {
        int res = thrd_join(threads[c], NULL);
        if (res == thrd_error) {
            printf("Error: thread %lu failed to join\n", r);
        }
    }
    return product;
}

void* transpose_row(void* void_args) {
    void ** args = (void**)void_args;
    size_t thread_id = (size_t)(args[0]);
    double* row = (double*)(args[1]);
    Matrix* source = (Matrix*)(args[2]);
    size_t width = *(size_t*)(args[3]);
    size_t column = (size_t)(args[4]);
    // Copy from the column into the row
    for (size_t r = 0; r < width; ++r) {
        row[r] = source->rows[r][column];
    }
    return NULL;
}

void* sum_into_transposed(void* void_args) {
    void ** args = (void**)void_args;
    size_t thread_id = (size_t)(args[0]);
    double* row = (double*)(args[1]);
    double* left_row = (double*)(args[2]);
    Matrix* right = (Matrix*)(args[3]);
    size_t width = (size_t)(args[4]);
    size_t elements = (size_t)(args[5]);
    for (size_t c = 0; c < width; ++c) {
        double sum = 0.0;
        for (size_t e = 0; e < elements; ++e) {
            // Unthreaded: sum += left.rows[r][e] * transpose_right.rows[c][e];
            sum += left_row[e] * right->rows[c][e];
        }
        // Unthreaded: product.rows[r][c] += sum;
        row[c] += sum;
    }
    return NULL;
}

Matrix dotProduct_thread_transpose_rce(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    // Make a transposed matrix
    thrd_t t_threads[right.height];
    void* t_args[max_threads][5];
    Matrix transpose_right = newMatrix(right.width, right.height);
    for (size_t r = 0; r < right.height; r+=max_threads) {
        // Making this threaded: transpose_right.rows[r][c] = right.rows[c][r];
        // Start a new batch of threads
        for (size_t t = 0; t < max_threads && r+t<right.height; ++t) {
            t_args[t][0] = (void*)t;
            t_args[t][1] = (void*)(transpose_right.rows[r+t]);
            t_args[t][2] = (void*)(&right);
            t_args[t][3] = (void*)(&transpose_right.width);
            t_args[t][4] = (void*)(r+t);
            int res = thrd_create(&t_threads[t], (thrd_start_t)transpose_row, t_args[t]);
            if (res == thrd_error) {
                printf("Error: thread create failed on %lu\n", r+t);
            }
        }
        // Wait for this batch of threads to finish
        for (size_t t = 0; t < max_threads && r+t<right.height; ++t) {
            int res = thrd_join(t_threads[t], NULL);
            if (res == thrd_error) {
                printf("Error: thread %lu failed to join\n", r+t);
            }
        }
    }
    Matrix product = newMatrix(left.height, right.width);
    // Now we can go across the rows of the right matrix rather than down the columns
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    void* args[max_threads][6];
    size_t num_elem = left.height;
    thrd_t threads[max_threads];
    void* mm_args[max_threads][5];
    for (size_t r = 0; r < right.height; r+=max_threads) {
        // Start a new batch of threads
        for (size_t t = 0; t < max_threads && r+t<product.height; ++t) {
            args[t][0] = (void*)t;
            args[t][1] = (void*)(product.rows[r+t]);
            args[t][2] = (void*)(left.rows[r+t]);
            args[t][3] = (void*)(&transpose_right);
            args[t][4] = (void*)(product.width);
            args[t][5] = (void*)(product.height);
            int res = thrd_create(&threads[t], (thrd_start_t)sum_into_transposed, args[t]);
            if (res == thrd_error) {
                printf("Error: thread create failed on %lu\n", r+t);
            }
        }
        // Wait for this batch of threads to finish
        for (size_t t = 0; t < max_threads && r+t<product.height; ++t) {
            int res = thrd_join(threads[t], NULL);
            if (res == thrd_error) {
                printf("Error: thread %lu failed to join\n", r+t);
            }
        }
    }
    // We are done with the transposed matrix
    freeMatrix(transpose_right);
    return product;
}

void* asm_sum_into_transposed(void* void_args) {
    void ** args = (void**)void_args;
    size_t thread_id = (size_t)(args[0]);
    double* row = (double*)(args[1]);
    double* left_row = (double*)(args[2]);
    Matrix* right = (Matrix*)(args[3]);
    size_t width = (size_t)(args[4]);
    size_t elements = (size_t)(args[5]);
    for (size_t c = 0; c < width; ++c) {
        // Sum into register ymm2
        // Anything xor with itself is 0.
        __asm__ (
                "vxorpd %%ymm2, %%ymm2, %%ymm2\n\t"
                : // No outputs
                : // No inputs
                : "ymm2" // Tell gcc to stay away from this register (if we turned on optimizations)
               );
        for (size_t e = 0; e+4 < elements; e+=4) {
            // Unthreaded: sum += left.rows[r][e] * transpose_right.rows[c][e];
            // Turn this next instruction into assembly
            // sum += left_row[e] * right->rows[c][e];
            // Note that we could try to preload registers, but it's safer to use them in the instruction.
            //register double* l asm ("r10") = left_row[e];
            //register double* r asm ("r11") = right->rows[c][e];
            __asm__ (
                    // %0 and %1 will come from left_row[e] and right->rows[c][e]
                    "vmovupd (%0), %%ymm0\n\t"
                    "vmovupd (%1), %%ymm1\n\t"
                    "vmulpd %%ymm0, %%ymm1, %%ymm1\n\t"
                    "vaddpd %%ymm1, %%ymm2, %%ymm2\n\t"
                    : // No outputs (we assume rax is safe)
                    : "r" (left_row+e), "r" (right->rows[c]+e)
                    : "ymm0", "ymm1", "ymm2"
                   );
        }
        double block[4];
        __asm__ (
                // %0 and %1 will come from left_row[e] and right->rows[c][e]
                "vmovupd %%ymm2, (%0)\n\t"
                :
                : "r" (block)
                :
               );
        // Finish whatever didn't fit into our 4 double registers
        double sum = 0.0;
        size_t left = elements%4;
        for (size_t e = elements-left; e < elements; ++e) {
            // Unthreaded: sum += left.rows[r][e] * transpose_right.rows[c][e];
            sum += left_row[e] * right->rows[c][e];
        }
        // Unthreaded: product.rows[r][c] += sum;
        row[c] = sum + block[0] + block[1] + block[2] + block[3];
    }
    return NULL;
}

Matrix dotProduct_thread_transpose_asm_rce(Matrix left, Matrix right) {
    // If the sizes don't match, return a matrix with NULL data and 0 size.
    if (left.width != right.height) {
        Matrix empty = newMatrix(0, 0);
        return empty;
    }
    // Make a transposed matrix
    thrd_t t_threads[right.height];
    void* t_args[max_threads][5];
    Matrix transpose_right = newMatrix(right.width, right.height);
    for (size_t r = 0; r < right.height; r+=max_threads) {
        // Making this threaded: transpose_right.rows[r][c] = right.rows[c][r];
        // Start a new batch of threads
        for (size_t t = 0; t < max_threads && r+t<right.height; ++t) {
            t_args[t][0] = (void*)t;
            t_args[t][1] = (void*)(transpose_right.rows[r+t]);
            t_args[t][2] = (void*)(&right);
            t_args[t][3] = (void*)(&transpose_right.width);
            t_args[t][4] = (void*)(r+t);
            int res = thrd_create(&t_threads[t], (thrd_start_t)transpose_row, t_args[t]);
            if (res == thrd_error) {
                printf("Error: thread create failed on %lu\n", r+t);
            }
        }
        // Wait for this batch of threads to finish
        for (size_t t = 0; t < max_threads && r+t<right.height; ++t) {
            int res = thrd_join(t_threads[t], NULL);
            if (res == thrd_error) {
                printf("Error: thread %lu failed to join\n", r+t);
            }
        }
    }
    Matrix product = newMatrix(left.height, right.width);
    // Now we can go across the rows of the right matrix rather than down the columns
    // Go row by row on the left and column by column on the right
    // Assume that all matrices are square, so the number of ops is either the height or the width
    void* args[max_threads][6];
    size_t num_elem = left.height;
    thrd_t threads[max_threads];
    void* mm_args[max_threads][5];
    for (size_t r = 0; r < right.height; r+=max_threads) {
        // Start a new batch of threads
        for (size_t t = 0; t < max_threads && r+t<product.height; ++t) {
            args[t][0] = (void*)t;
            args[t][1] = (void*)(product.rows[r+t]);
            args[t][2] = (void*)(left.rows[r+t]);
            args[t][3] = (void*)(&transpose_right);
            args[t][4] = (void*)(product.width);
            args[t][5] = (void*)(product.height);
            int res = thrd_create(&threads[t], (thrd_start_t)asm_sum_into_transposed, args[t]);
            if (res == thrd_error) {
                printf("Error: thread create failed on %lu\n", r+t);
            }
        }
        // Wait for this batch of threads to finish
        for (size_t t = 0; t < max_threads && r+t<product.height; ++t) {
            int res = thrd_join(threads[t], NULL);
            if (res == thrd_error) {
                printf("Error: thread %lu failed to join\n", r+t);
            }
        }
    }
    // We are done with the transposed matrix
    freeMatrix(transpose_right);
    return product;
}
```

---

## Final Code (main)

```c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#include "hw3_matrix_lib.h"

// Converts the timespect to milliseconds
typedef struct timespec timespec;
double timespecToMs(timespec t) {
    return 1000.0 * t.tv_sec + 1e-6 * t.tv_nsec;
}

int main(int argc, char** argv) {

if (argc != 3 && argc != 4) {
        printf("Provide an option for which loop to run. Options are 1-12.\n");
        printf("Provide a matrix size greater than or equal to 10.\n");
        printf("For algorithm 11, also provide a number of threads (defaults to 6).\n");
        return 0;
    }
    int opt = atoi(argv[1]);
    if (opt < 1 || opt > 13) {
        printf("Provide an option for which loop to run. Options are 1-12.\n");
        return 0;
    }
    int size = atoi(argv[2]);
    if (size < 10) {
        printf("Provide a matrix size greater than or equal to 10.\n");
        return 0;
    }
    if (argc > 3) {
        // This variable is declared in hw3_matrix_lib.h
        // Global variables like this should generally be avoided as they aren't clear.
        max_threads = atoi(argv[3]);
    }

// This is a function type. Don't worry about it.
    typedef Matrix (*mm_func)(Matrix, Matrix);
    mm_func mmul = NULL;
    switch (opt) {
        case 1:
            mmul = dotProduct_rce;
            printf("Testing rce\n");
            break;
        case 2:
            mmul = dotProduct_rec;
            printf("Testing rec\n");
            break;
        case 3:
            mmul = dotProduct_cre;
            printf("Testing cre\n");
            break;
        case 4:
            mmul = dotProduct_cer;
            printf("Testing cer\n");
            break;
        case 5:
            mmul = dotProduct_erc;
            printf("Testing erc\n");
            break;
        case 6:
            mmul = dotProduct_ecr;
            printf("Testing ecr\n");
            break;
        case 7:
            mmul = dotProduct_block;
            printf("Testing block\n");
            break;
        case 8:
            mmul = dotProduct_transpose_rce;
            printf("Testing transpose rce\n");
            break;
        case 9:
            mmul = dotProduct_thread_rce;
            printf("Testing thread rce\n");
            break;
        case 10:
            mmul = dotProduct_thread_cre;
            printf("Testing thread cre\n");
            printf("Note! CRE should be using a mutex, but currently is not.\n");
            break;
        case 11:
            mmul = dotProduct_thread_rce_max;
            printf("Testing thread rce_max\n");
            break;
        case 12:
            mmul = dotProduct_thread_transpose_rce;
            printf("Testing thread_transpose rce\n");
            break;
        case 13:
            mmul = dotProduct_thread_transpose_asm_rce;
            printf("Testing thread_transpose_asm rce\n");
            break;
    }

Matrix m1 = newMatrix(size, size);
    Matrix m2 = newMatrix(size, size);
    for (int i = 0; i < size*size; ++i) {
        // Fill the matrices with random numbers. Be sure to avoid dividing by 0.
        m1.data[i] = rand()/ ((double)rand()+1);
        m2.data[i] = rand()/ ((double)rand()+1);
    }

// Set up timers and run
    struct timespec tw_begin;
    clock_gettime(CLOCK_MONOTONIC, &tw_begin);
    struct timespec ts_begin;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_begin);

Matrix out = mmul(m1, m2);

struct timespec tw_end;
    clock_gettime(CLOCK_MONOTONIC, &tw_end);
    struct timespec ts_end;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_end);

double posix_wall = timespecToMs(tw_end) - timespecToMs(tw_begin);
    double cpu_time = timespecToMs(ts_end) - timespecToMs(ts_begin);
    // Print the result
    switch (opt) {
        case 1:
            printf("rce wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("rce CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 2:
            printf("rec wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("rec CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 3:
            printf("cre wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("cre CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 4:
            printf("cer wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("cer CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 5:
            printf("erc wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("erc CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 6:
            printf("ecr wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("ecr CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 7:
            printf("block wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("block CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 8:
            printf("transpose rce wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("transpose rce CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 9:
            printf("thread rce wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("thread rce CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 10:
            printf("thread cre wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("thread cre CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 11:
            printf("thread rce_max wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("thread rce_max CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 12:
            printf("thread_transpose rce wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("thread_transpose rce CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
        case 13:
            printf("thread_transpose_asm rce wall time used for size %i is %.2f ms\n", size, posix_wall);
            printf("thread_transpose_asm rce CPU time used for size %i is %.2f ms\n", size, cpu_time);
            break;
    }

freeMatrix(m1);
    freeMatrix(m2);
    freeMatrix(out);

return 0;
}
```

---

## Where we started

---

## With transposed matrix

---

## With threads (CPU time)

---

## With threads (wall time)

---

## With SIMD (CPU time)

---

## With SIMD (wall time)

---

## Final comparison (wall time)

---

## Final comparison (wall time)

---

## With gcc's -O2 (wall time)

---

## With gcc's -O2 (wall time)

---

## With gcc's -O2 (wall time)

---

## Moral

* The moral of the story is that optimization is hard!
  * We could probably take some of gcc's optimization to squeeze out a little more
* Hardware is generally adapted to allow better implementations
  * But you may need to know that those optimizations exist
* And if you ever need to design hardware, you'll have to know all the most common use cases!
  * This is why all of the trig functions are supported in hardware

---

## Today's Workloads

* There is a special instruction, called a fused matrix multiply
  * Multiply one vector (or matrix) by another and then add a constant
    * $w^Tx + b$
* It's used in machine learning
  * And, since ML is everywhere, so is that instruction

---

## Result

* You may think that ML is only on GPUs
  * But everyone wants to run them
* From your phone, your laptop, to your smart fridge
* Inefficient implementations cost energy and time
  * So we are seeing wider support for vector and matrix operations in hardware

---

## Next week

* Second in-recitation quiz begins on Wednesday

---

## Example Questions

<div style="text-align: left;">
Which of the following statements about cache misses is true? (see lecture 18)

a. If two addresses do not have the same tag, they cannot cause a conflict miss.

b. If two addresses map onto different sets in a directly mapped cache, they can still cause a conflict miss.

c. If two addresses map onto the same block, they will cause a conflict miss.

d. None of the above.
</div>

---

## Example Questions

<div style="text-align: left;">
In a four-way associative cache, which of the following is definitely true? (see lecture 18)

a. There are four blocks per line.

b. There are four sets per block.

c. There are four blocks per set.

d. There are four sets in the cache.
</div>

---

## Example Questions

<div style="text-align: left;">
A load instruction copies a value from memory into a register. If the L1 hit latency is 4 cycles and the L2 hit latency is 10 cycles, which of the following is true?

a. If the data is in the L1 cache then it will be available after four cycles have passed.

b. If the data is in the L2 cache, then it will be available after 14 cycles have passed.

c. Data cannot be copied from memory into registers.

d. None of the above.
</div>

---

## Example Questions

<div style="text-align: left;">
L1 cache stores data using what technology? (see lecture 17)

a. SRAM

b. DRAM

c. Laser activated magnetic platters.

d. Non-volatile flash
</div>

---

## Example Questions

<div style="text-align: left;">
In a cold cache, what the latency of the instruction `add $rax $rbx` in the ID cycle? Latency means the cycles before it completes. L1 hit latency is 4 cycle, L2 hit latency is 10 cycles, and L3 hit latency if 40 cycles.

a. 1 cycle.

b. 10 cycles.

c. 40 cycles.

d. Longer than the above options.

</div>

---

## Example Questions

Without threading, Which method is **slowest**?

a. r in the outer loop.

b. r in the middle loop.

c. r in the inner loop.


d. All are the same.

</div>
<div class="col">
Given
<pre>
product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
</pre>
and
<pre>
for (size_t r = 0; r < product.height; ++r)
for (size_t e = 0; e < num_elem; ++e)
for (size_t c = 0; c < product.width; ++c)
</pre>

</div>
</div>
</div>

---

## Other topics

* Study the example questions from recitation
* Don't forget how CPUs work

---

## Cache Simulator

* There is a nifty online simulator
  * [https://courses.cs.washington.edu/courses/cse351/cachesim/](https://courses.cs.washington.edu/courses/cse351/cachesim/)
* There are others, but that one is easy to use

<!--
TODO:
Go through some example vector operations
Apply to the same matrix multiply code
ymm registers hold 256-bits, which is 4 doubles. So we should be able to multiply 4 pairs of doubles in a single instruction

Use MOVDQA or MOVQ for vector moves into the vector registers (VMOVDQA64, I guess?)
Then MULPS for vector multiply
https://asm-docs.microagi.org/x86/mulps.html

Or if we want to use the ymm registers
VMASKMOVPD ymm,ymm,m256 does a masked move

Some interesting stuff on this stackoverflow, not really useful for us:
https://stackoverflow.com/questions/40623773/best-way-to-load-store-from-to-general-purpose-registers-to-from-xmm-ymm-registe

This is probably the best example:
https://wiki.osdev.org/AVX2
If we are staying with the same value for one side of the multiply, we could save some loads
; extern void Mul8floats(float* Dest, float* Src)
Mul8floats:
; RCX contains pointer to Src
; RDX contains pointer of Dest
     vmovups ymm0, [rcx]
     vmovups ymm1, [rdx]
     vmulps ymm0, ymm0, ymm1 ; Packed multiply of 8 floats across ymm0 by 8 floats in ymm1 and store result in ymm0
     vmovups [rcx], ymm0 ; Store the result in memory (float* Dest)
     ret
Use vmulpd for double precision instead of vmulps for single. Only 4 floats at a time. And vmovupd to load?

https://www.felixcloutier.com/x86/mulpd

That takes advantage of the operators being pushed into particular registers, as in this example:
https://web.engr.oregonstate.edu/~mjb/cs575/Handouts/simd.vector.1pp.pdf

If we do our own moves within the for loop things should be faster.
We can try both ways, I guess.

Then go through some example quiz2 questions
-->