$ cat /proc/cpuinfo |grep flags flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es --- ## Support * iLab cpus only support up to AVX2 * These are 256-bit vector instructions * My laptop is sadly newer :( --- ## Laptop Flags So many avx options flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d --- ## Compiler Support * Compilers generally add support to use fancy new instructions * But there can be a delay * So if you want to use something cutting edge, you may need to do it yourself * So it's a good thing that we know some assembly! --- ## Features * We'll stick with the ymm...ymm15 registers so you can test on iLab * Each is 256-bits * Doubles are 64-bits, so these hold 4 * That means we can do 4 simultaneous multiplications with a single instruction --- ## The Big Idea * `vmulp` (vector multiply) operates over multiple values in the registers * `vmulps` for single precision, `vmulpd` for double * `vmulpd ymm0, ymm1, ymm2` * Multiplies everything in ymm0 by everything in ymm1 and stores in ymm2 * This could allow us to leave either ymm0 or ymm1 unaltered * Which we've seen in our matrix multiply loops --- ## Graphically --- ## Moving the Data * We obviously don't want to have multiple load instructions for each array element * And the ymm registers have a single name but hold multiple values * So there are also a vector load instructions * `vmovups` and `vmovupd` for single and double precision * And addition * `vaddps` and `vaddpd` --- ## Important Details * Remember that we can drop assembly into our C code * `__asm__` and then the assembly in strings * % means a user-provided variable, and %% refers to an actual register * Also, the first three arguments passed to a function use known registers * %rdi, %rsi, and %rdx * Assuming they're the size of a pointer --- ## Code ```c #include #include // Note that we are going to assume that these each have four doubles. void showMultiply(double a[4], double b[4], double c[4]) { // Going to take advantage of the fact that the arguments are in rdi, rsi, and rdx by default. // Don't turn on compiler optimizations. // If those registers are going to be used, then we should store our arguments on the stack: /* movq %rdi, -8(%rbp) movq %rsi, -16(%rbp) movq %rdx, -24(%rbp) */ // (gcc actually does that in any case) // But if we only use assembly in this function, they should be safe to use. // The double %% indicates a register, a single % means a user-provided variable. __asm__( "vmovupd (%%rdi), %%ymm0\n\t" "vmovupd (%%rsi), %%ymm1\n\t" "vmulpd %%ymm0, %%ymm1, %%ymm2\n\t" "vmovupd %%ymm2, (%%rdx)\n\t" : // No outputs (we assume rax is safe) : // No inputs (we assume rdi and rsi are safe) : ); return; } int main(void) { double a[4] = {3.0, 2.0, 3.0, 4.0}; double b[4] = {4.0, -2.0, 3.0, -1.0}; double c[4] = {0.0}; showMultiply(a, b, c); printf("%lf, %lf, %lf, %lf\n", a[0], a[1], a[2], a[3]); printf("%lf, %lf, %lf, %lf\n", b[0], b[1], b[2], b[3]); printf("%lf, %lf, %lf, %lf\n", c[0], c[1], c[2], c[3]); return 0; } ``` --- ## Results $ gcc lec21_show_vmulp.c $ ./a.out 4 3.000000, 2.000000, 3.000000, 4.000000 4.000000, -2.000000, 3.000000, -1.000000 12.000000, -4.000000, 9.000000, -4.000000 Wow, so cool! --- ## Useful? * You all know what's coming next * Let's revisit matrix multiplication! * But first, let's discuss --- ## Options * Each output value comes from the multiplication of a row and a column * Do we want to want to calculate 4 outputs at a time? * Involves reading 4 rows and columns simultaneously * Or should we calculate a single value in 1/4 the instructions? --- ## Combine with Threads * Doing 4 rows and columns simultaneously may overwhelm our load bandwidth * How would we calculate that? * Compare with latency for 2 movupd loads from memory, 1 movupd back to memory, 1 vmulpd, 1 vaddpd * We already have the threaded transpose code * Changing to 1/4 the instructions should have the same instruction speedup as 4 simultaneous rows and columns * But all memory loads will be contiguous --- ## Threaded Code ```c void* asm_sum_into_transposed(void* void_args) { void ** args = (void**)void_args; size_t thread_id = (size_t)(args[0]); double* row = (double*)(args[1]); double* left_row = (double*)(args[2]); Matrix* right = (Matrix*)(args[3]); size_t width = (size_t)(args[4]); size_t elements = (size_t)(args[5]); for (size_t c = 0; c < width; ++c) { // Sum into register ymm2 // Anything xor with itself is 0. __asm__ ( "vxorpd %%ymm2, %%ymm2, %%ymm2\n\t" : // No outputs : // No inputs : "ymm2" // Tell gcc to stay away from this register (if we turned on optimizations) ); for (size_t e = 0; e+4 < elements; e+=4) { // Unthreaded: sum += left.rows[r][e] * transpose_right.rows[c][e]; // Turn this next instruction into assembly // sum += left_row[e] * right->rows[c][e]; // Note that we could try to preload registers, but it's safer to use them in the instruction. //register double* l asm ("r10") = left_row[e]; //register double* r asm ("r11") = right->rows[c][e]; __asm__ ( // %0 and %1 will come from left_row[e] and right->rows[c][e] "vmovupd (%0), %%ymm0\n\t" "vmovupd (%1), %%ymm1\n\t" "vmulpd %%ymm0, %%ymm1, %%ymm1\n\t" "vaddpd %%ymm1, %%ymm2, %%ymm2\n\t" : // No outputs (we assume rax is safe) : "r" (left_row+e), "r" (right->rows[c]+e) : "ymm0", "ymm1", "ymm2" ); } double block[4]; __asm__ ( // %0 and %1 will come from left_row[e] and right->rows[c][e] "vmovupd %%ymm2, (%0)\n\t" : : "r" (block) : ); // Finish whatever didn't fit into our 4 double registers double sum = 0.0; size_t left = elements%4; for (size_t e = elements-left; e < elements; ++e) { // Unthreaded: sum += left.rows[r][e] * transpose_right.rows[c][e]; sum += left_row[e] * right->rows[c][e]; } // Unthreaded: product.rows[r][c] += sum; row[c] = sum + block[0] + block[1] + block[2] + block[3]; } return NULL; } ``` --- ## Final Results * Let's look at all of our results * And the compare to the compiler with -O2! * First, here's the final (debugged) code --- ## Final Code (matrix) ```c Matrix dotProduct_rce(Matrix left, Matrix right) { // If the sizes don't match, return a matrix with NULL data and 0 size. if (left.width != right.height) { Matrix empty = newMatrix(0, 0); return empty; } Matrix product = newMatrix(left.height, right.width); // Go row by row on the left and column by column on the right // Assume that all matrices are square, so the number of ops is either the height or the width size_t num_elem = left.height; for (size_t r = 0; r < product.height; ++r) { for (size_t c = 0; c < product.width; ++c) { double sum = 0.0; for (size_t e = 0; e < num_elem; ++e) { sum += left.rows[r][e] * right.rows[e][c]; } product.rows[r][c] += sum; } } return product; } Matrix dotProduct_rec(Matrix left, Matrix right) { // If the sizes don't match, return a matrix with NULL data and 0 size. if (left.width != right.height) { Matrix empty = newMatrix(0, 0); return empty; } Matrix product = newMatrix(left.height, right.width); // Go row by row on the left and column by column on the right // Assume that all matrices are square, so the number of ops is either the height or the width size_t num_elem = left.height; for (size_t r = 0; r < product.height; ++r) { for (size_t e = 0; e < num_elem; ++e) { for (size_t c = 0; c < product.width; ++c) { product.rows[r][c] += left.rows[r][e] * right.rows[e][c]; } } } return product; } Matrix dotProduct_cre(Matrix left, Matrix right) { // If the sizes don't match, return a matrix with NULL data and 0 size. if (left.width != right.height) { Matrix empty = newMatrix(0, 0); return empty; } Matrix product = newMatrix(left.height, right.width); // Go row by row on the left and column by column on the right // Assume that all matrices are square, so the number of ops is either the height or the width size_t num_elem = left.height; for (size_t c = 0; c < product.width; ++c) { for (size_t r = 0; r < product.height; ++r) { double sum = 0.0; for (size_t e = 0; e < num_elem; ++e) { sum += left.rows[r][e] * right.rows[e][c]; } product.rows[r][c] += sum; } } return product; } Matrix dotProduct_cer(Matrix left, Matrix right) { // If the sizes don't match, return a matrix with NULL data and 0 size. if (left.width != right.height) { Matrix empty = newMatrix(0, 0); return empty; } Matrix product = newMatrix(left.height, right.width); // Go row by row on the left and column by column on the right // Assume that all matrices are square, so the number of ops is either the height or the width size_t num_elem = left.height; for (size_t c = 0; c < product.width; ++c) { for (size_t e = 0; e < num_elem; ++e) { for (size_t r = 0; r < product.height; ++r) { product.rows[r][c] += left.rows[r][e] * right.rows[e][c]; } } } return product; } Matrix dotProduct_erc(Matrix left, Matrix right) { // If the sizes don't match, return a matrix with NULL data and 0 size. if (left.width != right.height) { Matrix empty = newMatrix(0, 0); return empty; } Matrix product = newMatrix(left.height, right.width); // Go row by row on the left and column by column on the right // Assume that all matrices are square, so the number of ops is either the height or the width size_t num_elem = left.height; for (size_t e = 0; e < num_elem; ++e) { for (size_t r = 0; r < product.height; ++r) { for (size_t c = 0; c < product.width; ++c) { product.rows[r][c] += left.rows[r][e] * right.rows[e][c]; } } } return product; } Matrix dotProduct_ecr(Matrix left, Matrix right) { // If the sizes don't match, return a matrix with NULL data and 0 size. if (left.width != right.height) { Matrix empty = newMatrix(0, 0); return empty; } Matrix product = newMatrix(left.height, right.width); // Go row by row on the left and column by column on the right // Assume that all matrices are square, so the number of ops is either the height or the width size_t num_elem = left.height; for (size_t e = 0; e < num_elem; ++e) { for (size_t c = 0; c < product.width; ++c) { for (size_t r = 0; r < product.height; ++r) { product.rows[r][c] += left.rows[r][e] * right.rows[e][c]; } } } return product; } size_t min(size_t a, size_t b) { if (a < b) { return a; } else { return b; } } Matrix dotProduct_block(Matrix left, Matrix right) { // If the sizes don't match, return a matrix with NULL data and 0 size. if (left.width != right.height) { Matrix empty = newMatrix(0, 0); return empty; } //size_t block = 64 / sizeof(double); size_t block = 128; Matrix product = newMatrix(left.height, right.width); // Go row by row on the left and column by column on the right // Assume that all matrices are square, so the number of ops is either the height or the width size_t num_elem = left.height; size_t r = 0; size_t c = 0; size_t e = 0; for (r = 0; r < product.height; r+=block) { for (c = 0; c < product.width; c+=block) { for (e = 0; e < num_elem; e+=block) { // B x B block matrix multiplications size_t r_end = min(r+block, product.height); size_t c_end = min(c+block, product.width); size_t e_end = min(e+block, num_elem); for (size_t r1 = r; r1 < r_end; ++r1) { for (size_t c1 = c; c1 < c_end; ++c1) { double sum = 0; for (size_t e1 = e; e1 < e_end; ++e1) { sum += left.rows[r1][e1] * right.rows[e1][c1]; } product.rows[r1][c1] += sum; } } } } } return product; } Matrix dotProduct_transpose_rce(Matrix left, Matrix right) { // If the sizes don't match, return a matrix with NULL data and 0 size. if (left.width != right.height) { Matrix empty = newMatrix(0, 0); return empty; } // Make a transposed matrix Matrix transpose_right = newMatrix(right.width, right.height); for (size_t i = 0; i < right.height; ++i) { for (size_t j = 0; j < right.width; ++j) { transpose_right.rows[j][i] = right.rows[i][j]; } } Matrix product = newMatrix(left.height, right.width); // Now we can go across the rows of the right matrix rather than down the columns // Go row by row on the left and column by column on the right // Assume that all matrices are square, so the number of ops is either the height or the width size_t num_elem = left.height; for (size_t r = 0; r < product.height; ++r) { for (size_t c = 0; c < product.width; ++c) { double sum = 0.0; for (size_t e = 0; e < num_elem; ++e) { // This was previously sum += left.rows[r][e] * right.rows[e][c]; // The transposition puts e into the second index sum += left.rows[r][e] * transpose_right.rows[c][e]; // This is now the same as left.data[r*width+e] * transpose_right.data[c*width + e] } product.rows[r][c] += sum; } } // We are done with the transposed matrix freeMatrix(transpose_right); return product; } void* sum_into_ce(void* void_args) { void ** args = (void**)void_args; size_t thread_id = (size_t)(args[0]); double* row = (double*)(args[1]); double* left_row = (double*)(args[2]); Matrix* right = (Matrix*)(args[3]); int width = *(int*)(args[4]); int height = *(int*)(args[5]); for (size_t c = 0; c < width; ++c) { double sum = 0.0; for (size_t e = 0; e < height; ++e) { sum += left_row[e] * right->rows[e][c]; } row[c] += sum; } return NULL; } Matrix dotProduct_thread_rce(Matrix left, Matrix right) { // If the sizes don't match, return a matrix with NULL data and 0 size. if (left.width != right.height) { Matrix empty = newMatrix(0, 0); return empty; } Matrix product = newMatrix(left.height, right.width); // Go row by row on the left and column by column on the right // Assume that all matrices are square, so the number of ops is either the height or the width size_t num_elem = left.height; size_t r, c, e; // We must send the outer loop to the threads, so the outer loop must be 'e' // Will create one thread per row thrd_t threads[product.height]; void* args[product.height][6]; for (size_t r = 0; r < product.height; ++r) { args[r][0] = (void*)r; args[r][1] = (void*)(product.rows[r]); args[r][2] = (void*)(left.rows[r]); args[r][3] = (void*)(&right); args[r][4] = (void*)(&product.width); args[r][5] = (void*)(&product.height); int res = thrd_create(&threads[r], (thrd_start_t)sum_into_ce, args[r]); if (res == thrd_error) { printf("Error: thread create failed on %lu\n", r); } } for (size_t r = 0; r < product.height; ++r) { int res = thrd_join(threads[r], NULL); if (res == thrd_error) { printf("Error: thread %lu failed to join\n", r); } } return product; } size_t max_threads = 6; Matrix dotProduct_thread_rce_max(Matrix left, Matrix right) { // If the sizes don't match, return a matrix with NULL data and 0 size. if (left.width != right.height) { Matrix empty = newMatrix(0, 0); return empty; } Matrix product = newMatrix(left.height, right.width); // Go row by row on the left and column by column on the right // Assume that all matrices are square, so the number of ops is either the height or the width size_t num_elem = left.height; // We must send the outer loop to the threads, so the outer loop must be 'e' // Will create one thread per row thrd_t threads[max_threads]; void* args[max_threads][6]; size_t r = 0; for (; r < product.height; r+=max_threads) { // Start a new batch of threads for (size_t t = 0; t < max_threads && r+theight; ++r) { double sum = 0.0; for (size_t e = 0; e < left->width; ++e) { sum += left->rows[r][e] * right->rows[e][c]; } // TODO FIXME This would need a mutex since the c variable is shared over threads. product->rows[r][c] += sum; } return NULL; } // TODO FIXME This would need a mutex Matrix dotProduct_thread_cre(Matrix left, Matrix right) { // If the sizes don't match, return a matrix with NULL data and 0 size. if (left.width != right.height) { Matrix empty = newMatrix(0, 0); return empty; } Matrix product = newMatrix(left.height, right.width); // Go row by row on the left and column by column on the right // Assume that all matrices are square, so the number of ops is either the height or the width size_t num_elem = left.height; size_t r, c, e; // We must send the outer loop to the threads, so the outer loop must be 'e' // Will create one thread per row thrd_t threads[product.height]; void* args[product.height][7]; for (size_t c = 0; c < product.width; ++c) { args[c][0] = (void*)c; args[c][1] = (void*)(&product); args[c][2] = (void*)(&left); args[c][3] = (void*)(&right); args[c][4] = (void*)(&product.width); args[c][5] = (void*)(&product.height); args[c][6] = (void*)(&c); int res = thrd_create(&threads[c], (thrd_start_t)sum_into_re, args[c]); if (res == thrd_error) { printf("Error: thread create failed on %lu\n", r); } } for (size_t c = 0; c < product.width; ++c) { int res = thrd_join(threads[c], NULL); if (res == thrd_error) { printf("Error: thread %lu failed to join\n", r); } } return product; } void* transpose_row(void* void_args) { void ** args = (void**)void_args; size_t thread_id = (size_t)(args[0]); double* row = (double*)(args[1]); Matrix* source = (Matrix*)(args[2]); size_t width = *(size_t*)(args[3]); size_t column = (size_t)(args[4]); // Copy from the column into the row for (size_t r = 0; r < width; ++r) { row[r] = source->rows[r][column]; } return NULL; } void* sum_into_transposed(void* void_args) { void ** args = (void**)void_args; size_t thread_id = (size_t)(args[0]); double* row = (double*)(args[1]); double* left_row = (double*)(args[2]); Matrix* right = (Matrix*)(args[3]); size_t width = (size_t)(args[4]); size_t elements = (size_t)(args[5]); for (size_t c = 0; c < width; ++c) { double sum = 0.0; for (size_t e = 0; e < elements; ++e) { // Unthreaded: sum += left.rows[r][e] * transpose_right.rows[c][e]; sum += left_row[e] * right->rows[c][e]; } // Unthreaded: product.rows[r][c] += sum; row[c] += sum; } return NULL; } Matrix dotProduct_thread_transpose_rce(Matrix left, Matrix right) { // If the sizes don't match, return a matrix with NULL data and 0 size. if (left.width != right.height) { Matrix empty = newMatrix(0, 0); return empty; } // Make a transposed matrix thrd_t t_threads[right.height]; void* t_args[max_threads][5]; Matrix transpose_right = newMatrix(right.width, right.height); for (size_t r = 0; r < right.height; r+=max_threads) { // Making this threaded: transpose_right.rows[r][c] = right.rows[c][r]; // Start a new batch of threads for (size_t t = 0; t < max_threads && r+trows[c][e]; // Note that we could try to preload registers, but it's safer to use them in the instruction. //register double* l asm ("r10") = left_row[e]; //register double* r asm ("r11") = right->rows[c][e]; __asm__ ( // %0 and %1 will come from left_row[e] and right->rows[c][e] "vmovupd (%0), %%ymm0\n\t" "vmovupd (%1), %%ymm1\n\t" "vmulpd %%ymm0, %%ymm1, %%ymm1\n\t" "vaddpd %%ymm1, %%ymm2, %%ymm2\n\t" : // No outputs (we assume rax is safe) : "r" (left_row+e), "r" (right->rows[c]+e) : "ymm0", "ymm1", "ymm2" ); } double block[4]; __asm__ ( // %0 and %1 will come from left_row[e] and right->rows[c][e] "vmovupd %%ymm2, (%0)\n\t" : : "r" (block) : ); // Finish whatever didn't fit into our 4 double registers double sum = 0.0; size_t left = elements%4; for (size_t e = elements-left; e < elements; ++e) { // Unthreaded: sum += left.rows[r][e] * transpose_right.rows[c][e]; sum += left_row[e] * right->rows[c][e]; } // Unthreaded: product.rows[r][c] += sum; row[c] = sum + block[0] + block[1] + block[2] + block[3]; } return NULL; } Matrix dotProduct_thread_transpose_asm_rce(Matrix left, Matrix right) { // If the sizes don't match, return a matrix with NULL data and 0 size. if (left.width != right.height) { Matrix empty = newMatrix(0, 0); return empty; } // Make a transposed matrix thrd_t t_threads[right.height]; void* t_args[max_threads][5]; Matrix transpose_right = newMatrix(right.width, right.height); for (size_t r = 0; r < right.height; r+=max_threads) { // Making this threaded: transpose_right.rows[r][c] = right.rows[c][r]; // Start a new batch of threads for (size_t t = 0; t < max_threads && r+t #include #include #include "hw3_matrix_lib.h" // Converts the timespect to milliseconds typedef struct timespec timespec; double timespecToMs(timespec t) { return 1000.0 * t.tv_sec + 1e-6 * t.tv_nsec; } int main(int argc, char** argv) { if (argc != 3 && argc != 4) { printf("Provide an option for which loop to run. Options are 1-12.\n"); printf("Provide a matrix size greater than or equal to 10.\n"); printf("For algorithm 11, also provide a number of threads (defaults to 6).\n"); return 0; } int opt = atoi(argv[1]); if (opt < 1 || opt > 13) { printf("Provide an option for which loop to run. Options are 1-12.\n"); return 0; } int size = atoi(argv[2]); if (size < 10) { printf("Provide a matrix size greater than or equal to 10.\n"); return 0; } if (argc > 3) { // This variable is declared in hw3_matrix_lib.h // Global variables like this should generally be avoided as they aren't clear. max_threads = atoi(argv[3]); } // This is a function type. Don't worry about it. typedef Matrix (*mm_func)(Matrix, Matrix); mm_func mmul = NULL; switch (opt) { case 1: mmul = dotProduct_rce; printf("Testing rce\n"); break; case 2: mmul = dotProduct_rec; printf("Testing rec\n"); break; case 3: mmul = dotProduct_cre; printf("Testing cre\n"); break; case 4: mmul = dotProduct_cer; printf("Testing cer\n"); break; case 5: mmul = dotProduct_erc; printf("Testing erc\n"); break; case 6: mmul = dotProduct_ecr; printf("Testing ecr\n"); break; case 7: mmul = dotProduct_block; printf("Testing block\n"); break; case 8: mmul = dotProduct_transpose_rce; printf("Testing transpose rce\n"); break; case 9: mmul = dotProduct_thread_rce; printf("Testing thread rce\n"); break; case 10: mmul = dotProduct_thread_cre; printf("Testing thread cre\n"); printf("Note! CRE should be using a mutex, but currently is not.\n"); break; case 11: mmul = dotProduct_thread_rce_max; printf("Testing thread rce_max\n"); break; case 12: mmul = dotProduct_thread_transpose_rce; printf("Testing thread_transpose rce\n"); break; case 13: mmul = dotProduct_thread_transpose_asm_rce; printf("Testing thread_transpose_asm rce\n"); break; } Matrix m1 = newMatrix(size, size); Matrix m2 = newMatrix(size, size); for (int i = 0; i < size*size; ++i) { // Fill the matrices with random numbers. Be sure to avoid dividing by 0. m1.data[i] = rand()/ ((double)rand()+1); m2.data[i] = rand()/ ((double)rand()+1); } // Set up timers and run struct timespec tw_begin; clock_gettime(CLOCK_MONOTONIC, &tw_begin); struct timespec ts_begin; clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_begin); Matrix out = mmul(m1, m2); struct timespec tw_end; clock_gettime(CLOCK_MONOTONIC, &tw_end); struct timespec ts_end; clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_end); double posix_wall = timespecToMs(tw_end) - timespecToMs(tw_begin); double cpu_time = timespecToMs(ts_end) - timespecToMs(ts_begin); // Print the result switch (opt) { case 1: printf("rce wall time used for size %i is %.2f ms\n", size, posix_wall); printf("rce CPU time used for size %i is %.2f ms\n", size, cpu_time); break; case 2: printf("rec wall time used for size %i is %.2f ms\n", size, posix_wall); printf("rec CPU time used for size %i is %.2f ms\n", size, cpu_time); break; case 3: printf("cre wall time used for size %i is %.2f ms\n", size, posix_wall); printf("cre CPU time used for size %i is %.2f ms\n", size, cpu_time); break; case 4: printf("cer wall time used for size %i is %.2f ms\n", size, posix_wall); printf("cer CPU time used for size %i is %.2f ms\n", size, cpu_time); break; case 5: printf("erc wall time used for size %i is %.2f ms\n", size, posix_wall); printf("erc CPU time used for size %i is %.2f ms\n", size, cpu_time); break; case 6: printf("ecr wall time used for size %i is %.2f ms\n", size, posix_wall); printf("ecr CPU time used for size %i is %.2f ms\n", size, cpu_time); break; case 7: printf("block wall time used for size %i is %.2f ms\n", size, posix_wall); printf("block CPU time used for size %i is %.2f ms\n", size, cpu_time); break; case 8: printf("transpose rce wall time used for size %i is %.2f ms\n", size, posix_wall); printf("transpose rce CPU time used for size %i is %.2f ms\n", size, cpu_time); break; case 9: printf("thread rce wall time used for size %i is %.2f ms\n", size, posix_wall); printf("thread rce CPU time used for size %i is %.2f ms\n", size, cpu_time); break; case 10: printf("thread cre wall time used for size %i is %.2f ms\n", size, posix_wall); printf("thread cre CPU time used for size %i is %.2f ms\n", size, cpu_time); break; case 11: printf("thread rce_max wall time used for size %i is %.2f ms\n", size, posix_wall); printf("thread rce_max CPU time used for size %i is %.2f ms\n", size, cpu_time); break; case 12: printf("thread_transpose rce wall time used for size %i is %.2f ms\n", size, posix_wall); printf("thread_transpose rce CPU time used for size %i is %.2f ms\n", size, cpu_time); break; case 13: printf("thread_transpose_asm rce wall time used for size %i is %.2f ms\n", size, posix_wall); printf("thread_transpose_asm rce CPU time used for size %i is %.2f ms\n", size, cpu_time); break; } freeMatrix(m1); freeMatrix(m2); freeMatrix(out); return 0; } ``` --- ## Where we started --- ## With transposed matrix --- ## With threads (CPU time) --- ## With threads (wall time) --- ## With SIMD (CPU time) --- ## With SIMD (wall time) --- ## Final comparison (wall time) --- ## Final comparison (wall time) --- ## With gcc's -O2 (wall time) --- ## With gcc's -O2 (wall time) --- ## With gcc's -O2 (wall time) --- ## Moral * The moral of the story is that optimization is hard! * We could probably take some of gcc's optimization to squeeze out a little more * Hardware is generally adapted to allow better implementations * But you may need to know that those optimizations exist * And if you ever need to design hardware, you'll have to know all the most common use cases! * This is why all of the trig functions are supported in hardware --- ## Today's Workloads * There is a special instruction, called a fused matrix multiply * Multiply one vector (or matrix) by another and then add a constant * $w^Tx + b$ * It's used in machine learning * And, since ML is everywhere, so is that instruction --- ## Result * You may think that ML is only on GPUs * But everyone wants to run them * From your phone, your laptop, to your smart fridge * Inefficient implementations cost energy and time * So we are seeing wider support for vector and matrix operations in hardware --- ## Next week * Second in-recitation quiz begins on Wednesday --- ## Example Questions Which of the following statements about cache misses is true? (see lecture 18) a. If two addresses do not have the same tag, they cannot cause a conflict miss. b. If two addresses map onto different sets in a directly mapped cache, they can still cause a conflict miss. c. If two addresses map onto the same block, they will cause a conflict miss. d. None of the above. --- ## Example Questions In a four-way associative cache, which of the following is definitely true? (see lecture 18) a. There are four blocks per line. b. There are four sets per block. c. There are four blocks per set. d. There are four sets in the cache. --- ## Example Questions A load instruction copies a value from memory into a register. If the L1 hit latency is 4 cycles and the L2 hit latency is 10 cycles, which of the following is true? a. If the data is in the L1 cache then it will be available after four cycles have passed. b. If the data is in the L2 cache, then it will be available after 14 cycles have passed. c. Data cannot be copied from memory into registers. d. None of the above. --- ## Example Questions L1 cache stores data using what technology? (see lecture 17) a. SRAM b. DRAM c. Laser activated magnetic platters. d. Non-volatile flash --- ## Example Questions In a cold cache, what the latency of the instruction `add $rax $rbx` in the ID cycle? Latency means the cycles before it completes. L1 hit latency is 4 cycle, L2 hit latency is 10 cycles, and L3 hit latency if 40 cycles. a. 1 cycle. b. 10 cycles. c. 40 cycles. d. Longer than the above options. --- ## Example Questions Without threading, Which method is **slowest**? a. r in the outer loop. b. r in the middle loop. c. r in the inner loop. d. All are the same. Given product.rows[r][c] += left.rows[r][e] * right.rows[e][c]; and for (size_t r = 0; r < product.height; ++r) for (size_t e = 0; e < num_elem; ++e) for (size_t c = 0; c < product.width; ++c) --- ## Other topics * Study the example questions from recitation * Don't forget how CPUs work --- ## Cache Simulator * There is a nifty online simulator * [https://courses.cs.washington.edu/courses/cse351/cachesim/](https://courses.cs.washington.edu/courses/cse351/cachesim/) * There are others, but that one is easy to use
--- ## Support * iLab cpus only support up to AVX2 * These are 256-bit vector instructions * My laptop is sadly newer :( --- ## Laptop Flags So many avx options flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d --- ## Compiler Support * Compilers generally add support to use fancy new instructions * But there can be a delay * So if you want to use something cutting edge, you may need to do it yourself * So it's a good thing that we know some assembly! --- ## Features * We'll stick with the ymm...ymm15 registers so you can test on iLab * Each is 256-bits * Doubles are 64-bits, so these hold 4 * That means we can do 4 simultaneous multiplications with a single instruction --- ## The Big Idea * `vmulp` (vector multiply) operates over multiple values in the registers * `vmulps` for single precision, `vmulpd` for double * `vmulpd ymm0, ymm1, ymm2` * Multiplies everything in ymm0 by everything in ymm1 and stores in ymm2 * This could allow us to leave either ymm0 or ymm1 unaltered * Which we've seen in our matrix multiply loops --- ## Graphically --- ## Moving the Data * We obviously don't want to have multiple load instructions for each array element * And the ymm registers have a single name but hold multiple values * So there are also a vector load instructions * `vmovups` and `vmovupd` for single and double precision * And addition * `vaddps` and `vaddpd` --- ## Important Details * Remember that we can drop assembly into our C code * `__asm__` and then the assembly in strings * % means a user-provided variable, and %% refers to an actual register * Also, the first three arguments passed to a function use known registers * %rdi, %rsi, and %rdx * Assuming they're the size of a pointer --- ## Code ```c #include #include // Note that we are going to assume that these each have four doubles. void showMultiply(double a[4], double b[4], double c[4]) { // Going to take advantage of the fact that the arguments are in rdi, rsi, and rdx by default. // Don't turn on compiler optimizations. // If those registers are going to be used, then we should store our arguments on the stack: /* movq %rdi, -8(%rbp) movq %rsi, -16(%rbp) movq %rdx, -24(%rbp) */ // (gcc actually does that in any case) // But if we only use assembly in this function, they should be safe to use. // The double %% indicates a register, a single % means a user-provided variable. __asm__( "vmovupd (%%rdi), %%ymm0\n\t" "vmovupd (%%rsi), %%ymm1\n\t" "vmulpd %%ymm0, %%ymm1, %%ymm2\n\t" "vmovupd %%ymm2, (%%rdx)\n\t" : // No outputs (we assume rax is safe) : // No inputs (we assume rdi and rsi are safe) : ); return; } int main(void) { double a[4] = {3.0, 2.0, 3.0, 4.0}; double b[4] = {4.0, -2.0, 3.0, -1.0}; double c[4] = {0.0}; showMultiply(a, b, c); printf("%lf, %lf, %lf, %lf\n", a[0], a[1], a[2], a[3]); printf("%lf, %lf, %lf, %lf\n", b[0], b[1], b[2], b[3]); printf("%lf, %lf, %lf, %lf\n", c[0], c[1], c[2], c[3]); return 0; } ``` --- ## Results $ gcc lec21_show_vmulp.c $ ./a.out 4 3.000000, 2.000000, 3.000000, 4.000000 4.000000, -2.000000, 3.000000, -1.000000 12.000000, -4.000000, 9.000000, -4.000000 Wow, so cool! --- ## Useful? * You all know what's coming next * Let's revisit matrix multiplication! * But first, let's discuss --- ## Options * Each output value comes from the multiplication of a row and a column * Do we want to want to calculate 4 outputs at a time? * Involves reading 4 rows and columns simultaneously * Or should we calculate a single value in 1/4 the instructions? --- ## Combine with Threads * Doing 4 rows and columns simultaneously may overwhelm our load bandwidth * How would we calculate that? * Compare with latency for 2 movupd loads from memory, 1 movupd back to memory, 1 vmulpd, 1 vaddpd * We already have the threaded transpose code * Changing to 1/4 the instructions should have the same instruction speedup as 4 simultaneous rows and columns * But all memory loads will be contiguous --- ## Threaded Code ```c void* asm_sum_into_transposed(void* void_args) { void ** args = (void**)void_args; size_t thread_id = (size_t)(args[0]); double* row = (double*)(args[1]); double* left_row = (double*)(args[2]); Matrix* right = (Matrix*)(args[3]); size_t width = (size_t)(args[4]); size_t elements = (size_t)(args[5]); for (size_t c = 0; c < width; ++c) { // Sum into register ymm2 // Anything xor with itself is 0. __asm__ ( "vxorpd %%ymm2, %%ymm2, %%ymm2\n\t" : // No outputs : // No inputs : "ymm2" // Tell gcc to stay away from this register (if we turned on optimizations) ); for (size_t e = 0; e+4 < elements; e+=4) { // Unthreaded: sum += left.rows[r][e] * transpose_right.rows[c][e]; // Turn this next instruction into assembly // sum += left_row[e] * right->rows[c][e]; // Note that we could try to preload registers, but it's safer to use them in the instruction. //register double* l asm ("r10") = left_row[e]; //register double* r asm ("r11") = right->rows[c][e]; __asm__ ( // %0 and %1 will come from left_row[e] and right->rows[c][e] "vmovupd (%0), %%ymm0\n\t" "vmovupd (%1), %%ymm1\n\t" "vmulpd %%ymm0, %%ymm1, %%ymm1\n\t" "vaddpd %%ymm1, %%ymm2, %%ymm2\n\t" : // No outputs (we assume rax is safe) : "r" (left_row+e), "r" (right->rows[c]+e) : "ymm0", "ymm1", "ymm2" ); } double block[4]; __asm__ ( // %0 and %1 will come from left_row[e] and right->rows[c][e] "vmovupd %%ymm2, (%0)\n\t" : : "r" (block) : ); // Finish whatever didn't fit into our 4 double registers double sum = 0.0; size_t left = elements%4; for (size_t e = elements-left; e < elements; ++e) { // Unthreaded: sum += left.rows[r][e] * transpose_right.rows[c][e]; sum += left_row[e] * right->rows[c][e]; } // Unthreaded: product.rows[r][c] += sum; row[c] = sum + block[0] + block[1] + block[2] + block[3]; } return NULL; } ``` --- ## Final Results * Let's look at all of our results * And the compare to the compiler with -O2! * First, here's the final (debugged) code --- ## Final Code (matrix) ```c Matrix dotProduct_rce(Matrix left, Matrix right) { // If the sizes don't match, return a matrix with NULL data and 0 size. if (left.width != right.height) { Matrix empty = newMatrix(0, 0); return empty; } Matrix product = newMatrix(left.height, right.width); // Go row by row on the left and column by column on the right // Assume that all matrices are square, so the number of ops is either the height or the width size_t num_elem = left.height; for (size_t r = 0; r < product.height; ++r) { for (size_t c = 0; c < product.width; ++c) { double sum = 0.0; for (size_t e = 0; e < num_elem; ++e) { sum += left.rows[r][e] * right.rows[e][c]; } product.rows[r][c] += sum; } } return product; } Matrix dotProduct_rec(Matrix left, Matrix right) { // If the sizes don't match, return a matrix with NULL data and 0 size. if (left.width != right.height) { Matrix empty = newMatrix(0, 0); return empty; } Matrix product = newMatrix(left.height, right.width); // Go row by row on the left and column by column on the right // Assume that all matrices are square, so the number of ops is either the height or the width size_t num_elem = left.height; for (size_t r = 0; r < product.height; ++r) { for (size_t e = 0; e < num_elem; ++e) { for (size_t c = 0; c < product.width; ++c) { product.rows[r][c] += left.rows[r][e] * right.rows[e][c]; } } } return product; } Matrix dotProduct_cre(Matrix left, Matrix right) { // If the sizes don't match, return a matrix with NULL data and 0 size. if (left.width != right.height) { Matrix empty = newMatrix(0, 0); return empty; } Matrix product = newMatrix(left.height, right.width); // Go row by row on the left and column by column on the right // Assume that all matrices are square, so the number of ops is either the height or the width size_t num_elem = left.height; for (size_t c = 0; c < product.width; ++c) { for (size_t r = 0; r < product.height; ++r) { double sum = 0.0; for (size_t e = 0; e < num_elem; ++e) { sum += left.rows[r][e] * right.rows[e][c]; } product.rows[r][c] += sum; } } return product; } Matrix dotProduct_cer(Matrix left, Matrix right) { // If the sizes don't match, return a matrix with NULL data and 0 size. if (left.width != right.height) { Matrix empty = newMatrix(0, 0); return empty; } Matrix product = newMatrix(left.height, right.width); // Go row by row on the left and column by column on the right // Assume that all matrices are square, so the number of ops is either the height or the width size_t num_elem = left.height; for (size_t c = 0; c < product.width; ++c) { for (size_t e = 0; e < num_elem; ++e) { for (size_t r = 0; r < product.height; ++r) { product.rows[r][c] += left.rows[r][e] * right.rows[e][c]; } } } return product; } Matrix dotProduct_erc(Matrix left, Matrix right) { // If the sizes don't match, return a matrix with NULL data and 0 size. if (left.width != right.height) { Matrix empty = newMatrix(0, 0); return empty; } Matrix product = newMatrix(left.height, right.width); // Go row by row on the left and column by column on the right // Assume that all matrices are square, so the number of ops is either the height or the width size_t num_elem = left.height; for (size_t e = 0; e < num_elem; ++e) { for (size_t r = 0; r < product.height; ++r) { for (size_t c = 0; c < product.width; ++c) { product.rows[r][c] += left.rows[r][e] * right.rows[e][c]; } } } return product; } Matrix dotProduct_ecr(Matrix left, Matrix right) { // If the sizes don't match, return a matrix with NULL data and 0 size. if (left.width != right.height) { Matrix empty = newMatrix(0, 0); return empty; } Matrix product = newMatrix(left.height, right.width); // Go row by row on the left and column by column on the right // Assume that all matrices are square, so the number of ops is either the height or the width size_t num_elem = left.height; for (size_t e = 0; e < num_elem; ++e) { for (size_t c = 0; c < product.width; ++c) { for (size_t r = 0; r < product.height; ++r) { product.rows[r][c] += left.rows[r][e] * right.rows[e][c]; } } } return product; } size_t min(size_t a, size_t b) { if (a < b) { return a; } else { return b; } } Matrix dotProduct_block(Matrix left, Matrix right) { // If the sizes don't match, return a matrix with NULL data and 0 size. if (left.width != right.height) { Matrix empty = newMatrix(0, 0); return empty; } //size_t block = 64 / sizeof(double); size_t block = 128; Matrix product = newMatrix(left.height, right.width); // Go row by row on the left and column by column on the right // Assume that all matrices are square, so the number of ops is either the height or the width size_t num_elem = left.height; size_t r = 0; size_t c = 0; size_t e = 0; for (r = 0; r < product.height; r+=block) { for (c = 0; c < product.width; c+=block) { for (e = 0; e < num_elem; e+=block) { // B x B block matrix multiplications size_t r_end = min(r+block, product.height); size_t c_end = min(c+block, product.width); size_t e_end = min(e+block, num_elem); for (size_t r1 = r; r1 < r_end; ++r1) { for (size_t c1 = c; c1 < c_end; ++c1) { double sum = 0; for (size_t e1 = e; e1 < e_end; ++e1) { sum += left.rows[r1][e1] * right.rows[e1][c1]; } product.rows[r1][c1] += sum; } } } } } return product; } Matrix dotProduct_transpose_rce(Matrix left, Matrix right) { // If the sizes don't match, return a matrix with NULL data and 0 size. if (left.width != right.height) { Matrix empty = newMatrix(0, 0); return empty; } // Make a transposed matrix Matrix transpose_right = newMatrix(right.width, right.height); for (size_t i = 0; i < right.height; ++i) { for (size_t j = 0; j < right.width; ++j) { transpose_right.rows[j][i] = right.rows[i][j]; } } Matrix product = newMatrix(left.height, right.width); // Now we can go across the rows of the right matrix rather than down the columns // Go row by row on the left and column by column on the right // Assume that all matrices are square, so the number of ops is either the height or the width size_t num_elem = left.height; for (size_t r = 0; r < product.height; ++r) { for (size_t c = 0; c < product.width; ++c) { double sum = 0.0; for (size_t e = 0; e < num_elem; ++e) { // This was previously sum += left.rows[r][e] * right.rows[e][c]; // The transposition puts e into the second index sum += left.rows[r][e] * transpose_right.rows[c][e]; // This is now the same as left.data[r*width+e] * transpose_right.data[c*width + e] } product.rows[r][c] += sum; } } // We are done with the transposed matrix freeMatrix(transpose_right); return product; } void* sum_into_ce(void* void_args) { void ** args = (void**)void_args; size_t thread_id = (size_t)(args[0]); double* row = (double*)(args[1]); double* left_row = (double*)(args[2]); Matrix* right = (Matrix*)(args[3]); int width = *(int*)(args[4]); int height = *(int*)(args[5]); for (size_t c = 0; c < width; ++c) { double sum = 0.0; for (size_t e = 0; e < height; ++e) { sum += left_row[e] * right->rows[e][c]; } row[c] += sum; } return NULL; } Matrix dotProduct_thread_rce(Matrix left, Matrix right) { // If the sizes don't match, return a matrix with NULL data and 0 size. if (left.width != right.height) { Matrix empty = newMatrix(0, 0); return empty; } Matrix product = newMatrix(left.height, right.width); // Go row by row on the left and column by column on the right // Assume that all matrices are square, so the number of ops is either the height or the width size_t num_elem = left.height; size_t r, c, e; // We must send the outer loop to the threads, so the outer loop must be 'e' // Will create one thread per row thrd_t threads[product.height]; void* args[product.height][6]; for (size_t r = 0; r < product.height; ++r) { args[r][0] = (void*)r; args[r][1] = (void*)(product.rows[r]); args[r][2] = (void*)(left.rows[r]); args[r][3] = (void*)(&right); args[r][4] = (void*)(&product.width); args[r][5] = (void*)(&product.height); int res = thrd_create(&threads[r], (thrd_start_t)sum_into_ce, args[r]); if (res == thrd_error) { printf("Error: thread create failed on %lu\n", r); } } for (size_t r = 0; r < product.height; ++r) { int res = thrd_join(threads[r], NULL); if (res == thrd_error) { printf("Error: thread %lu failed to join\n", r); } } return product; } size_t max_threads = 6; Matrix dotProduct_thread_rce_max(Matrix left, Matrix right) { // If the sizes don't match, return a matrix with NULL data and 0 size. if (left.width != right.height) { Matrix empty = newMatrix(0, 0); return empty; } Matrix product = newMatrix(left.height, right.width); // Go row by row on the left and column by column on the right // Assume that all matrices are square, so the number of ops is either the height or the width size_t num_elem = left.height; // We must send the outer loop to the threads, so the outer loop must be 'e' // Will create one thread per row thrd_t threads[max_threads]; void* args[max_threads][6]; size_t r = 0; for (; r < product.height; r+=max_threads) { // Start a new batch of threads for (size_t t = 0; t < max_threads && r+theight; ++r) { double sum = 0.0; for (size_t e = 0; e < left->width; ++e) { sum += left->rows[r][e] * right->rows[e][c]; } // TODO FIXME This would need a mutex since the c variable is shared over threads. product->rows[r][c] += sum; } return NULL; } // TODO FIXME This would need a mutex Matrix dotProduct_thread_cre(Matrix left, Matrix right) { // If the sizes don't match, return a matrix with NULL data and 0 size. if (left.width != right.height) { Matrix empty = newMatrix(0, 0); return empty; } Matrix product = newMatrix(left.height, right.width); // Go row by row on the left and column by column on the right // Assume that all matrices are square, so the number of ops is either the height or the width size_t num_elem = left.height; size_t r, c, e; // We must send the outer loop to the threads, so the outer loop must be 'e' // Will create one thread per row thrd_t threads[product.height]; void* args[product.height][7]; for (size_t c = 0; c < product.width; ++c) { args[c][0] = (void*)c; args[c][1] = (void*)(&product); args[c][2] = (void*)(&left); args[c][3] = (void*)(&right); args[c][4] = (void*)(&product.width); args[c][5] = (void*)(&product.height); args[c][6] = (void*)(&c); int res = thrd_create(&threads[c], (thrd_start_t)sum_into_re, args[c]); if (res == thrd_error) { printf("Error: thread create failed on %lu\n", r); } } for (size_t c = 0; c < product.width; ++c) { int res = thrd_join(threads[c], NULL); if (res == thrd_error) { printf("Error: thread %lu failed to join\n", r); } } return product; } void* transpose_row(void* void_args) { void ** args = (void**)void_args; size_t thread_id = (size_t)(args[0]); double* row = (double*)(args[1]); Matrix* source = (Matrix*)(args[2]); size_t width = *(size_t*)(args[3]); size_t column = (size_t)(args[4]); // Copy from the column into the row for (size_t r = 0; r < width; ++r) { row[r] = source->rows[r][column]; } return NULL; } void* sum_into_transposed(void* void_args) { void ** args = (void**)void_args; size_t thread_id = (size_t)(args[0]); double* row = (double*)(args[1]); double* left_row = (double*)(args[2]); Matrix* right = (Matrix*)(args[3]); size_t width = (size_t)(args[4]); size_t elements = (size_t)(args[5]); for (size_t c = 0; c < width; ++c) { double sum = 0.0; for (size_t e = 0; e < elements; ++e) { // Unthreaded: sum += left.rows[r][e] * transpose_right.rows[c][e]; sum += left_row[e] * right->rows[c][e]; } // Unthreaded: product.rows[r][c] += sum; row[c] += sum; } return NULL; } Matrix dotProduct_thread_transpose_rce(Matrix left, Matrix right) { // If the sizes don't match, return a matrix with NULL data and 0 size. if (left.width != right.height) { Matrix empty = newMatrix(0, 0); return empty; } // Make a transposed matrix thrd_t t_threads[right.height]; void* t_args[max_threads][5]; Matrix transpose_right = newMatrix(right.width, right.height); for (size_t r = 0; r < right.height; r+=max_threads) { // Making this threaded: transpose_right.rows[r][c] = right.rows[c][r]; // Start a new batch of threads for (size_t t = 0; t < max_threads && r+trows[c][e]; // Note that we could try to preload registers, but it's safer to use them in the instruction. //register double* l asm ("r10") = left_row[e]; //register double* r asm ("r11") = right->rows[c][e]; __asm__ ( // %0 and %1 will come from left_row[e] and right->rows[c][e] "vmovupd (%0), %%ymm0\n\t" "vmovupd (%1), %%ymm1\n\t" "vmulpd %%ymm0, %%ymm1, %%ymm1\n\t" "vaddpd %%ymm1, %%ymm2, %%ymm2\n\t" : // No outputs (we assume rax is safe) : "r" (left_row+e), "r" (right->rows[c]+e) : "ymm0", "ymm1", "ymm2" ); } double block[4]; __asm__ ( // %0 and %1 will come from left_row[e] and right->rows[c][e] "vmovupd %%ymm2, (%0)\n\t" : : "r" (block) : ); // Finish whatever didn't fit into our 4 double registers double sum = 0.0; size_t left = elements%4; for (size_t e = elements-left; e < elements; ++e) { // Unthreaded: sum += left.rows[r][e] * transpose_right.rows[c][e]; sum += left_row[e] * right->rows[c][e]; } // Unthreaded: product.rows[r][c] += sum; row[c] = sum + block[0] + block[1] + block[2] + block[3]; } return NULL; } Matrix dotProduct_thread_transpose_asm_rce(Matrix left, Matrix right) { // If the sizes don't match, return a matrix with NULL data and 0 size. if (left.width != right.height) { Matrix empty = newMatrix(0, 0); return empty; } // Make a transposed matrix thrd_t t_threads[right.height]; void* t_args[max_threads][5]; Matrix transpose_right = newMatrix(right.width, right.height); for (size_t r = 0; r < right.height; r+=max_threads) { // Making this threaded: transpose_right.rows[r][c] = right.rows[c][r]; // Start a new batch of threads for (size_t t = 0; t < max_threads && r+t #include #include #include "hw3_matrix_lib.h" // Converts the timespect to milliseconds typedef struct timespec timespec; double timespecToMs(timespec t) { return 1000.0 * t.tv_sec + 1e-6 * t.tv_nsec; } int main(int argc, char** argv) { if (argc != 3 && argc != 4) { printf("Provide an option for which loop to run. Options are 1-12.\n"); printf("Provide a matrix size greater than or equal to 10.\n"); printf("For algorithm 11, also provide a number of threads (defaults to 6).\n"); return 0; } int opt = atoi(argv[1]); if (opt < 1 || opt > 13) { printf("Provide an option for which loop to run. Options are 1-12.\n"); return 0; } int size = atoi(argv[2]); if (size < 10) { printf("Provide a matrix size greater than or equal to 10.\n"); return 0; } if (argc > 3) { // This variable is declared in hw3_matrix_lib.h // Global variables like this should generally be avoided as they aren't clear. max_threads = atoi(argv[3]); } // This is a function type. Don't worry about it. typedef Matrix (*mm_func)(Matrix, Matrix); mm_func mmul = NULL; switch (opt) { case 1: mmul = dotProduct_rce; printf("Testing rce\n"); break; case 2: mmul = dotProduct_rec; printf("Testing rec\n"); break; case 3: mmul = dotProduct_cre; printf("Testing cre\n"); break; case 4: mmul = dotProduct_cer; printf("Testing cer\n"); break; case 5: mmul = dotProduct_erc; printf("Testing erc\n"); break; case 6: mmul = dotProduct_ecr; printf("Testing ecr\n"); break; case 7: mmul = dotProduct_block; printf("Testing block\n"); break; case 8: mmul = dotProduct_transpose_rce; printf("Testing transpose rce\n"); break; case 9: mmul = dotProduct_thread_rce; printf("Testing thread rce\n"); break; case 10: mmul = dotProduct_thread_cre; printf("Testing thread cre\n"); printf("Note! CRE should be using a mutex, but currently is not.\n"); break; case 11: mmul = dotProduct_thread_rce_max; printf("Testing thread rce_max\n"); break; case 12: mmul = dotProduct_thread_transpose_rce; printf("Testing thread_transpose rce\n"); break; case 13: mmul = dotProduct_thread_transpose_asm_rce; printf("Testing thread_transpose_asm rce\n"); break; } Matrix m1 = newMatrix(size, size); Matrix m2 = newMatrix(size, size); for (int i = 0; i < size*size; ++i) { // Fill the matrices with random numbers. Be sure to avoid dividing by 0. m1.data[i] = rand()/ ((double)rand()+1); m2.data[i] = rand()/ ((double)rand()+1); } // Set up timers and run struct timespec tw_begin; clock_gettime(CLOCK_MONOTONIC, &tw_begin); struct timespec ts_begin; clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_begin); Matrix out = mmul(m1, m2); struct timespec tw_end; clock_gettime(CLOCK_MONOTONIC, &tw_end); struct timespec ts_end; clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_end); double posix_wall = timespecToMs(tw_end) - timespecToMs(tw_begin); double cpu_time = timespecToMs(ts_end) - timespecToMs(ts_begin); // Print the result switch (opt) { case 1: printf("rce wall time used for size %i is %.2f ms\n", size, posix_wall); printf("rce CPU time used for size %i is %.2f ms\n", size, cpu_time); break; case 2: printf("rec wall time used for size %i is %.2f ms\n", size, posix_wall); printf("rec CPU time used for size %i is %.2f ms\n", size, cpu_time); break; case 3: printf("cre wall time used for size %i is %.2f ms\n", size, posix_wall); printf("cre CPU time used for size %i is %.2f ms\n", size, cpu_time); break; case 4: printf("cer wall time used for size %i is %.2f ms\n", size, posix_wall); printf("cer CPU time used for size %i is %.2f ms\n", size, cpu_time); break; case 5: printf("erc wall time used for size %i is %.2f ms\n", size, posix_wall); printf("erc CPU time used for size %i is %.2f ms\n", size, cpu_time); break; case 6: printf("ecr wall time used for size %i is %.2f ms\n", size, posix_wall); printf("ecr CPU time used for size %i is %.2f ms\n", size, cpu_time); break; case 7: printf("block wall time used for size %i is %.2f ms\n", size, posix_wall); printf("block CPU time used for size %i is %.2f ms\n", size, cpu_time); break; case 8: printf("transpose rce wall time used for size %i is %.2f ms\n", size, posix_wall); printf("transpose rce CPU time used for size %i is %.2f ms\n", size, cpu_time); break; case 9: printf("thread rce wall time used for size %i is %.2f ms\n", size, posix_wall); printf("thread rce CPU time used for size %i is %.2f ms\n", size, cpu_time); break; case 10: printf("thread cre wall time used for size %i is %.2f ms\n", size, posix_wall); printf("thread cre CPU time used for size %i is %.2f ms\n", size, cpu_time); break; case 11: printf("thread rce_max wall time used for size %i is %.2f ms\n", size, posix_wall); printf("thread rce_max CPU time used for size %i is %.2f ms\n", size, cpu_time); break; case 12: printf("thread_transpose rce wall time used for size %i is %.2f ms\n", size, posix_wall); printf("thread_transpose rce CPU time used for size %i is %.2f ms\n", size, cpu_time); break; case 13: printf("thread_transpose_asm rce wall time used for size %i is %.2f ms\n", size, posix_wall); printf("thread_transpose_asm rce CPU time used for size %i is %.2f ms\n", size, cpu_time); break; } freeMatrix(m1); freeMatrix(m2); freeMatrix(out); return 0; } ``` --- ## Where we started --- ## With transposed matrix --- ## With threads (CPU time) --- ## With threads (wall time) --- ## With SIMD (CPU time) --- ## With SIMD (wall time) --- ## Final comparison (wall time) --- ## Final comparison (wall time) --- ## With gcc's -O2 (wall time) --- ## With gcc's -O2 (wall time) --- ## With gcc's -O2 (wall time) --- ## Moral * The moral of the story is that optimization is hard! * We could probably take some of gcc's optimization to squeeze out a little more * Hardware is generally adapted to allow better implementations * But you may need to know that those optimizations exist * And if you ever need to design hardware, you'll have to know all the most common use cases! * This is why all of the trig functions are supported in hardware --- ## Today's Workloads * There is a special instruction, called a fused matrix multiply * Multiply one vector (or matrix) by another and then add a constant * $w^Tx + b$ * It's used in machine learning * And, since ML is everywhere, so is that instruction --- ## Result * You may think that ML is only on GPUs * But everyone wants to run them * From your phone, your laptop, to your smart fridge * Inefficient implementations cost energy and time * So we are seeing wider support for vector and matrix operations in hardware --- ## Next week * Second in-recitation quiz begins on Wednesday --- ## Example Questions Which of the following statements about cache misses is true? (see lecture 18) a. If two addresses do not have the same tag, they cannot cause a conflict miss. b. If two addresses map onto different sets in a directly mapped cache, they can still cause a conflict miss. c. If two addresses map onto the same block, they will cause a conflict miss. d. None of the above. --- ## Example Questions In a four-way associative cache, which of the following is definitely true? (see lecture 18) a. There are four blocks per line. b. There are four sets per block. c. There are four blocks per set. d. There are four sets in the cache. --- ## Example Questions A load instruction copies a value from memory into a register. If the L1 hit latency is 4 cycles and the L2 hit latency is 10 cycles, which of the following is true? a. If the data is in the L1 cache then it will be available after four cycles have passed. b. If the data is in the L2 cache, then it will be available after 14 cycles have passed. c. Data cannot be copied from memory into registers. d. None of the above. --- ## Example Questions L1 cache stores data using what technology? (see lecture 17) a. SRAM b. DRAM c. Laser activated magnetic platters. d. Non-volatile flash --- ## Example Questions In a cold cache, what the latency of the instruction `add $rax $rbx` in the ID cycle? Latency means the cycles before it completes. L1 hit latency is 4 cycle, L2 hit latency is 10 cycles, and L3 hit latency if 40 cycles. a. 1 cycle. b. 10 cycles. c. 40 cycles. d. Longer than the above options. --- ## Example Questions Without threading, Which method is **slowest**? a. r in the outer loop. b. r in the middle loop. c. r in the inner loop. d. All are the same. Given product.rows[r][c] += left.rows[r][e] * right.rows[e][c]; and for (size_t r = 0; r < product.height; ++r) for (size_t e = 0; e < num_elem; ++e) for (size_t c = 0; c < product.width; ++c) --- ## Other topics * Study the example questions from recitation * Don't forget how CPUs work --- ## Cache Simulator * There is a nifty online simulator * [https://courses.cs.washington.edu/courses/cse351/cachesim/](https://courses.cs.washington.edu/courses/cse351/cachesim/) * There are others, but that one is easy to use
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d
$ gcc lec21_show_vmulp.c $ ./a.out 4 3.000000, 2.000000, 3.000000, 4.000000 4.000000, -2.000000, 3.000000, -1.000000 12.000000, -4.000000, 9.000000, -4.000000
product.rows[r][c] += left.rows[r][e] * right.rows[e][c];
for (size_t r = 0; r < product.height; ++r) for (size_t e = 0; e < num_elem; ++e) for (size_t c = 0; c < product.width; ++c)