# CS 211 - Lecture 15

## Computer Architecture

###  Architecture Aware Code

Bernhard Firner

2025-10-21

---

## Review: Parallel Execution

* More throughput
  * Clock ticks now as fast as the slowest stage
  * More pipelines, faster clock rates!
* But each stage must be fully isolated from the others
  * Pipeline registers store results in between pipeline stages

</div>
<div class="col">

</div>
</div>

---

## Pipeline Stages$^*$

1. Instruction fetch (IF)
2. Instruction decode/register fetch (ID)
3. Execute (math or address calculation) (EX)
4. Memory access (MA)
5. Write back (WB)

<p style='font-size:25pt'>$^*$ For our super-simplified example pipeline</p>

---

## Pipeline Costs

* Adding five pipeline stages doesn't give 5x throughput
  * The slowest stage is probably more than 1/5 of the original time
* Pipeline registers also take time to buffer outputs
  * Let's say you cut a 500ps single stage into five 100ps stages
  * If the buffers take 10ps, then each stage takes 110ps
  * $500/110 \approx 4.54$ speedup

---

## Diminishing Returns

* If you cut that same 500ps into ten 50ps stages, what happens?
  * $500/(50+10) \approx 8.33$ speedup
* So our returns diminish quickly
* Not only that, longer pipelines are also harder to fill with instructions

---

## Filling a Pipeline

* Deeper (more stages) and wider (superscalar) pipelines are only good if capacity is used
* Why wouldn't we be able to fill a pipeline?
  * Data hazards (read after memory fetch)
  * Control hazards (jmp, call, ret, etc)

---

## Data Hazard Mitigations

* If there is a dependency that cannot be resolved, no-op instructions fill the space
  * Called a `bubble`
* There are two way to mitigate this:
  * Forward the result from one stage to where it is needed, skipping intermediate steps
  * Re-order instructions

---

## Bubble

* Waiting for data means doing nothing
  * Here we are loading something into rax and then we want to use it
* No ops are inserted into the pipeline, causing a bubble

</div>
<div class="col">

</div>
</div>

---

## Data Hazards

* Any time an instruction needs a value there is the potential for a hazard
  * Forwarding may be able to resolve the hazard

</div>
<div class="col">

</div>
</div>

---

## Forwarding

---

## Forwarding

---

## Forwarding

---

## Forwarding
<img style="width: 100%" class="r-stretch" src="./figures/execution_alu_alu_forwarding4.png" />

---

## Other Mitigations

* No-ops are a waste of time, but forwarding can't solve all problems
* So use `out of order execution`
  * If there is bandwidth, prefetch extra instructions, find some without dependencies, and squeeze them in
* Re-order buffer (ROB) re-orders the outputs

</div>
<div class="col">

</div>
</div>

---

## Control Hazards

* Control hazards have the greatest performance penalty
  * Come from control elements (`if`, `while`, `for`, function calls, etc)
  * Translate to assembly jumps, `call`, `ret`

---

## Control Hazard Difficulties

* For comparisons (`jne`, `ja`, etc)
  * We do not know the result of the comparison before the status register is ready
* We will not know the address until it is calculated
  * Which happens after EX (stage 3)

---

## What Goes Wrong?

* We can't wait for EX step to finish
  * May as well load some instructions and hope they are the correct ones
  * This is `speculative execution`

---

## Branch Prediction

* What happens if the guessed branch isn't taken?
  * All progress lost
  * New instructions need to be loaded
* So we want really good `branch prediction`

---

## Branch Prediction Continued

* Moderns CPUs get high 90% prediction rates
  * Use simple ML, like perceptrons, implemented in hardware

---

## Improved Code

* Now that we know all of this, how can we put it to use?

---

## Turn on Compiler Optimizations

* Compilers are really good
  * But only if you turn on their options!
* For gcc
  * `-O2` optimizes for speed and program size
  * `-O3` optimizes for speed, sacrificing program size
* But your compiler doesn't know how your code will be used

---

## Example

* Which is faster?
  * twiddle1 is 4 memory reads, two additions, and two memory writes
  * twiddle2 is 2 memory reads, a left shift, an addition, and a memory write
* Can the compiler optimize *twiddle1* into *twiddle2*?

</div>
<div class="col">

```c
void twiddle1(int *xp, int *yp) {
    *xp += *yp;
    *xp += *yp;
}

void twiddle2(int *xp, int *yp) {
    *xp += 2 * (*yp);
}
```

</div>
</div>

---

## Example Continued

* The compiler cannot optimize.
  * What if *xp* and *yp* have the same value?
* As the writer of the code, you may know that they will always be different
  * So it is still up to you to write faster code

</div>
<div class="col">

```c
void twiddle1(int *xp, int *yp) {
    *xp += *yp;
    *xp += *yp;
}

void twiddle2(int *xp, int *yp) {
    *xp += 2 * (*yp);
}
```

</div>
</div>

---

## Optimization Not Obfuscation

* Correct and slow is better than fast and wrong
* We've already seen that variables don't exist
  * Just registers
* So feel free to use more variables and make your code more clear

---

## Constants That We Know

* Don't call functions in your loop conditions
* The loop condition exists on a single line, but it is executed multiple times
  * Storing the result in a variable has no cost

```c
for (int i = 0; i < strlen(argv[1]); ++i) {
    puts(argv[i]);
}
```

</div>
<div class="col">

```c
int len = strlen(argv[1]);
for (int i = 0; i < len; ++i) {
    puts(argv[i]);
}
```

</div>
</div>

---

## Structs Vs Bags of Arguments

* Passing a pointer to a struct always uses 1 register
* Using too many arguments will force the compiler to push them onto the stack
* You generally won't notice, except if this function is called many times
  * But this is true of most imptimizations

---

## Program Optimization

* Consider this program:

```c
#include <stdio.h>
#include <stdlib.h>

void addSum(long long* buf, int max) {
    long long sum = 0;
    for (int i = 1; i < max; ++i) {
        sum += i;
    }
    *buf += sum;
}

int main(int argc, char** argv) {
    if (argc < 2) {
        printf("This program requires a positive number as its second argument.\n");
        return 0;
    }
    long long buffer = 0;
    int sum_to = atoi(argv[1]);
    addSum(&buffer, sum_to);
    printf("%lli\n", buffer);
}
```

---

## Optimizations

* Let's see how well the compiler optimizes it with default optimizes, -O2, and -O3

<table>
<tr> <b><td>O level  </td><td> O    </td><td> O2      </td><td> O3   </b></tr>
<tr> <td>user time </td><td> 2.228s </td><td> 0.302s  </td><td> 0.195s  </tr>
</table>

---

## Optimizations

* Now consider this version

```c
#include <stdio.h>
#include <stdlib.h>

void addSum(long long* buf, int max) {
    long long sum = 0;
    for (int i = 1; i < max-1; i+=2) {
        sum += i;
        sum += i+1;
    }
    // If max is even then we missed a number
    if (max % 2 == 0) {
        sum += max-1;
    }
    *buf += sum;
}

```

---

## Optimizations

* Or how about this one?

```C
#include <stdio.h>
#include <stdlib.h>

void addSum(long long* buf, int max) {
    long long sum = 0;
    long long sum2 = 0;
    for (int i = 1; i < max-1; i+=2) {
        sum += i;
        sum2 += i+1;
    }
    // If max is even then we missed a number
    if (max % 2 == 0) {
        sum += max-1;
    }
    *buf += sum + sum2;
}

```

---

## Run Times

---

## Progress

* We halved the number of loops in versions 2 and 3
  * Summing into an additional variable doesn't help
* If we were missing many branches, then this would allow us to fill the pipeline better, but this branch is predictable

---

## More Versions

* What if we pull the math out of the for loop?

```c
#include <stdio.h>
#include <stdlib.h>

void addSum(long long* buf, int max) {
    long long sum = 0;
    long long sum2 = 0;
    int end = max - 1;
    for (int i = 1; i < end; i+=2) {
        sum += i;
        sum2 += i+1;
    }
    // If max is even then we missed a number
    if (max % 2 == 0) {
        sum += max-1;
    }
    *buf += sum + sum2;
}

---

## Run Times

* Faster at first, but this is an optimization the compiler can do

---

## More Versions?

* Loop unrolling

```c
#include <stdio.h>
#include <stdlib.h>

void addSum(long long* buf, int max) {
    long long sums[] = {0, 0, 0, 0};
    int i = 1;
    int end = max - 3;
    for (; i < end; i+=4) {
        sums[0] += i;
        sums[1] += i+1;
        sums[2] += i+2;
        sums[3] += i+3;
    }
    // Fill in the rest of the values
    for (; i < max; ++i) {
        sums[0] += i;
    }
    *buf += sums[0] + sums[1] + sums[2] + sums[3];
}

---

## More Versions?

* Even more unrolling

```c
#include <stdio.h>
#include <stdlib.h>

void addSum(long long* buf, int max) {
    long long sums[] = {0, 0, 0, 0, 0, 0, 0, 0};
    int i = 1;
    int end = max - 7;
    for (; i < end; i+=8) {
        sums[0] += i;
        sums[1] += i+1;
        sums[2] += i+2;
        sums[3] += i+3;
        sums[4] += i+4;
        sums[5] += i+5;
        sums[6] += i+6;
        sums[7] += i+7;
    }
    // Fill in the rest of the values
    for (; i < max; ++i) {
        sums[0] += i;
    }
    *buf += sums[0] + sums[1] + sums[2] + sums[3] +  sums[4] + sums[5] + sums[6] + sums[7];
}

```

---

## Run Times

* Obviously loop unrolling is known to the compiler
  * Turned on in $-O3$

<table>
<tr> <b><td>O level  </td><td> O    </td><td> O2      </td><td> O3   </b></tr>
<tr> <td>version 1 </td><td> 2.228s </td><td> 0.302s  </td><td> 0.195s  </tr>
<tr> <td>version 2 </td><td> 1.303s </td><td> 0.294s  </td><td> 0.183s  </tr>
<tr> <td>version 3 </td><td> 1.298s </td><td> 0.296s  </td><td> 0.180s  </tr>
<tr> <td>version 4 </td><td> 0.759s </td><td> 0.294s  </td><td> 0.180s  </tr>
<tr> <td>version 5 </td><td> 0.607s </td><td> 0.196s  </td><td> 0.186s  </tr>
<tr> <td>version 6 </td><td> 0.435s </td><td> 0.175s  </td><td> 0.173s  </tr>
</table>

---

## Pipeline Filling

* Did we need the extra variables to enable wider execution?

```C
#include <stdio.h>
#include <stdlib.h>

void addSum(long long* buf, int max) {
    long long sum = 0;
    int i = 1;
    int end = max - 7;
    for (; i < end; i+=8) {
        sum += i;
        sum += i+1;
        sum += i+2;
        sum += i+3;
        sum += i+4;
        sum += i+5;
        sum += i+6;
        sum += i+7;
    }
    // Fill in the rest of the values
    for (; i < max; ++i) {
        sum += i;
    }
    *buf += sum;
}

```

---

## Run Times

---

## O2 Anomoly

* Version 7 -O2 code is different from version 6
* Better, apparently, but then it gets worse at -O3
  * -O3 isn't always better.
  * Does a lot of optimizations that waste space, which can slow down loading

---

## Version 6, -O2

* See how it puts values into 8 registers? So smart!

```asm
	.file	"add_sum6.c"
	.text
	.p2align 4
	.globl	addSum
	.type	addSum, @function
addSum:
.LFB39:
	.cfi_startproc
	endbr64
	pushq	%r12
	.cfi_def_cfa_offset 16
	.cfi_offset 12, -16
	movq	%rdi, %r8
	movl	%esi, %r12d
	pushq	%rbp
	.cfi_def_cfa_offset 24
	.cfi_offset 6, -24
	leal	-7(%rsi), %ebp
	pushq	%rbx
	.cfi_def_cfa_offset 32
	.cfi_offset 3, -32
	cmpl	$1, %ebp
	jle	.L6
	movl	$1, %eax
	xorl	%r10d, %r10d
	xorl	%r11d, %r11d
	xorl	%ebx, %ebx
	xorl	%r9d, %r9d
	xorl	%edi, %edi
	xorl	%esi, %esi
	xorl	%ecx, %ecx
	xorl	%edx, %edx
	.p2align 4,,10
	.p2align 3
.L3:
	addq	%rax, %rdx
	leaq	1(%rcx,%rax), %rcx
	leaq	2(%rsi,%rax), %rsi
	leaq	3(%rdi,%rax), %rdi
	leaq	4(%r9,%rax), %r9
	leaq	5(%rbx,%rax), %rbx
	leaq	6(%r11,%rax), %r11
	leaq	7(%r10,%rax), %r10
	addq	$8, %rax
	cmpl	%eax, %ebp
	jg	.L3
	leal	-9(%r12), %ebp
	andl	$-8, %ebp
	addl	$9, %ebp
.L2:
	cmpl	%ebp, %r12d
	jle	.L4
	subl	%ebp, %r12d
	movslq	%ebp, %rax
	leaq	(%r12,%rax), %rbp
	andl	$1, %r12d
	je	.L5
	addq	%rax, %rdx
	addq	$1, %rax
	cmpq	%rax, %rbp
	je	.L4
	.p2align 4,,10
	.p2align 3
.L5:
	leaq	1(%rdx,%rax,2), %rdx
	addq	$2, %rax
	cmpq	%rax, %rbp
	jne	.L5
.L4:
	leaq	(%rdx,%rcx), %rax
	addq	%rsi, %rax
	addq	%rdi, %rax
	addq	%r9, %rax
	addq	%rbx, %rax
	popq	%rbx
	.cfi_remember_state
	.cfi_def_cfa_offset 24
	popq	%rbp
	.cfi_def_cfa_offset 16
	addq	%r11, %rax
	popq	%r12
	.cfi_def_cfa_offset 8
	addq	%r10, %rax
	addq	%rax, (%r8)
	ret
	.p2align 4,,10
	.p2align 3
.L6:
	.cfi_restore_state
	xorl	%r10d, %r10d
	xorl	%r11d, %r11d
	xorl	%ebx, %ebx
	xorl	%r9d, %r9d
	xorl	%edi, %edi
	xorl	%esi, %esi
	xorl	%ecx, %ecx
	xorl	%edx, %edx
	movl	$1, %ebp
	jmp	.L2
	.cfi_endproc
.LFE39:
	.size	addSum, .-addSum
	.section	.rodata.str1.8,"aMS",@progbits,1
	.align 8
.LC0:
	.string	"This program requires a positive number as its second argument."
	.section	.rodata.str1.1,"aMS",@progbits,1
.LC1:
	.string	"%lli\n"
	.section	.text.startup,"ax",@progbits
	.p2align 4
	.globl	main
	.type	main, @function
main:
.LFB40:
	.cfi_startproc
	endbr64
	subq	$24, %rsp
	.cfi_def_cfa_offset 32
	movq	%fs:40, %rax
	movq	%rax, 8(%rsp)
	xorl	%eax, %eax
	cmpl	$1, %edi
	jle	.L21
	movq	8(%rsi), %rdi
	movl	$10, %edx
	xorl	%esi, %esi
	movq	$0, (%rsp)
	call	strtol@PLT
	movq	%rsp, %rdi
	movl	%eax, %esi
	call	addSum
	movq	(%rsp), %rdx
	movl	$2, %edi
	xorl	%eax, %eax
	leaq	.LC1(%rip), %rsi
	call	__printf_chk@PLT
.L18:
	movq	8(%rsp), %rax
	subq	%fs:40, %rax
	jne	.L22
	xorl	%eax, %eax
	addq	$24, %rsp
	.cfi_remember_state
	.cfi_def_cfa_offset 8
	ret
.L21:
	.cfi_restore_state
	leaq	.LC0(%rip), %rdi
	call	puts@PLT
	jmp	.L18
.L22:
	call	__stack_chk_fail@PLT
	.cfi_endproc
.LFE40:
	.size	main, .-main
	.ident	"GCC: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0"
	.section	.note.GNU-stack,"",@progbits
	.section	.note.gnu.property,"a"
	.align 8
	.long	1f - 0f
	.long	4f - 1f
	.long	5
0:
	.string	"GNU"
1:
	.align 8
	.long	0xc0000002
	.long	3f - 2f
2:
	.long	0x3
3:
	.align 8
4:
```

---

## Version 7, -O2

```asm
	.file	"add_sum7.c"
	.text
	.p2align 4
	.globl	addSum
	.type	addSum, @function
addSum:
.LFB39:
	.cfi_startproc
	endbr64
	leal	-7(%rsi), %ecx
	cmpl	$1, %ecx
	jle	.L6
	movl	$1, %eax
	xorl	%edx, %edx
	.p2align 4,,10
	.p2align 3
.L3:
	leaq	28(%rdx,%rax,8), %rdx
	addq	$8, %rax
	cmpl	%eax, %ecx
	jg	.L3
	leal	-9(%rsi), %ecx
	andl	$-8, %ecx
	addl	$9, %ecx
.L2:
	cmpl	%ecx, %esi
	jle	.L4
	subl	%ecx, %esi
	movslq	%ecx, %rax
	leaq	(%rsi,%rax), %rcx
	andl	$1, %esi
	je	.L5
	addq	%rax, %rdx
	addq	$1, %rax
	cmpq	%rax, %rcx
	je	.L4
	.p2align 4,,10
	.p2align 3
.L5:
	leaq	1(%rdx,%rax,2), %rdx
	addq	$2, %rax
	cmpq	%rax, %rcx
	jne	.L5
.L4:
	addq	%rdx, (%rdi)
	ret
	.p2align 4,,10
	.p2align 3
.L6:
	movl	$1, %ecx
	xorl	%edx, %edx
	jmp	.L2
	.cfi_endproc
.LFE39:
	.size	addSum, .-addSum
	.section	.rodata.str1.8,"aMS",@progbits,1
	.align 8
.LC0:
	.string	"This program requires a positive number as its second argument."
	.section	.rodata.str1.1,"aMS",@progbits,1
.LC1:
	.string	"%lli\n"
	.section	.text.startup,"ax",@progbits
	.p2align 4
	.globl	main
	.type	main, @function
main:
.LFB40:
	.cfi_startproc
	endbr64
	subq	$24, %rsp
	.cfi_def_cfa_offset 32
	movq	%fs:40, %rax
	movq	%rax, 8(%rsp)
	xorl	%eax, %eax
	cmpl	$1, %edi
	jle	.L20
	movq	8(%rsi), %rdi
	movl	$10, %edx
	xorl	%esi, %esi
	movq	$0, (%rsp)
	call	strtol@PLT
	movq	%rsp, %rdi
	movl	%eax, %esi
	call	addSum
	movq	(%rsp), %rdx
	movl	$2, %edi
	xorl	%eax, %eax
	leaq	.LC1(%rip), %rsi
	call	__printf_chk@PLT
.L17:
	movq	8(%rsp), %rax
	subq	%fs:40, %rax
	jne	.L21
	xorl	%eax, %eax
	addq	$24, %rsp
	.cfi_remember_state
	.cfi_def_cfa_offset 8
	ret
.L20:
	.cfi_restore_state
	leaq	.LC0(%rip), %rdi
	call	puts@PLT
	jmp	.L17
.L21:
	call	__stack_chk_fail@PLT
	.cfi_endproc
.LFE40:
	.size	main, .-main
	.ident	"GCC: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0"
	.section	.note.GNU-stack,"",@progbits
	.section	.note.gnu.property,"a"
	.align 8
	.long	1f - 0f
	.long	4f - 1f
	.long	5
0:
	.string	"GNU"
1:
	.align 8
	.long	0xc0000002
	.long	3f - 2f
2:
	.long	0x3
3:
	.align 8
4:
```

---

## Math Optimization

* The compiler recognized that we were doing math once the results were in a single variable
* `leaq` (load effective address) is an address calculation that does addition and multiplication in one step
  * `28(%rdx,%rax,8)` calculates `rdx+28+8*rax`
  * `%rdx` is sum, `%rax` is i,
  * Equivalent to `sum += 8*i + 28`

</div>
<div class="col">

```asm
.L3:
	leaq	28(%rdx,%rax,8), %rdx
	addq	$8, %rax
	cmpl	%eax, %ecx
	jg	.L3
	leal	-9(%rsi), %ecx
	andl	$-8, %ecx
	addl	$9, %ecx
```

</div>
</div>

---

## O3 Alternative

* The -O3 optimzation uses a different set of optimizations
  * Uses the special `xmm` (multimedia) registers
  * Not an optimization in this case

```asm
	.file	"add_sum7.c"
	.text
	.p2align 4
	.globl	addSum
	.type	addSum, @function
addSum:
.LFB39:
	.cfi_startproc
	endbr64
	leal	-7(%rsi), %r9d
	movl	%esi, %ecx
	cmpl	$1, %r9d
	jle	.L9
	leal	-9(%rsi), %eax
	movl	%eax, %r8d
	shrl	$3, %r8d
	leal	1(%r8), %esi
	cmpl	$23, %eax
	jbe	.L10
	pxor	%xmm3, %xmm3
	movl	%esi, %edx
	pxor	%xmm10, %xmm10
	xorl	%eax, %eax
	shrl	$2, %edx
	movaps	%xmm3, -88(%rsp)
	movdqa	%xmm3, %xmm15
	movdqa	.LC0(%rip), %xmm11
	.p2align 4,,10
	.p2align 3
.L4:
	movdqa	%xmm11, %xmm2
	movdqa	%xmm10, %xmm1
	addl	$1, %eax
	movdqa	.LC2(%rip), %xmm0
	pcmpgtd	%xmm2, %xmm1
	movdqa	%xmm2, %xmm13
	movdqa	.LC3(%rip), %xmm8
	movdqa	.LC4(%rip), %xmm7
	paddd	%xmm2, %xmm0
	movdqa	%xmm10, %xmm14
	movdqa	.LC5(%rip), %xmm6
	movdqa	.LC6(%rip), %xmm5
	paddd	%xmm2, %xmm8
	movdqa	%xmm0, %xmm9
	paddd	%xmm2, %xmm7
	movdqa	.LC7(%rip), %xmm4
	punpckldq	%xmm1, %xmm13
	movaps	%xmm1, -72(%rsp)
	paddd	%xmm2, %xmm6
	paddd	%xmm2, %xmm5
	movdqa	%xmm13, %xmm1
	movdqa	%xmm10, %xmm13
	paddd	%xmm2, %xmm4
	movdqa	.LC8(%rip), %xmm3
	pcmpgtd	%xmm0, %xmm13
	pcmpgtd	%xmm8, %xmm14
	paddq	%xmm15, %xmm1
	movdqa	%xmm10, %xmm15
	movdqa	%xmm10, %xmm12
	paddd	%xmm2, %xmm3
	pcmpgtd	%xmm7, %xmm15
	pcmpgtd	%xmm4, %xmm12
	punpckhdq	-72(%rsp), %xmm2
	punpckldq	%xmm13, %xmm9
	movaps	%xmm14, -40(%rsp)
	paddq	-88(%rsp), %xmm2
	paddd	.LC1(%rip), %xmm11
	paddq	%xmm9, %xmm1
	movdqa	%xmm8, %xmm9
	movaps	%xmm13, -56(%rsp)
	punpckhdq	-56(%rsp), %xmm0
	punpckldq	%xmm14, %xmm9
	punpckhdq	-40(%rsp), %xmm8
	movaps	%xmm15, -24(%rsp)
	paddq	%xmm9, %xmm1
	movdqa	%xmm7, %xmm9
	paddq	%xmm2, %xmm0
	punpckldq	%xmm15, %xmm9
	punpckhdq	-24(%rsp), %xmm7
	paddq	%xmm8, %xmm0
	paddq	%xmm9, %xmm1
	movdqa	%xmm10, %xmm9
	movdqa	%xmm3, %xmm15
	pcmpgtd	%xmm6, %xmm9
	paddq	%xmm7, %xmm0
	movdqa	%xmm9, %xmm14
	movdqa	%xmm6, %xmm9
	punpckldq	%xmm14, %xmm9
	punpckhdq	%xmm14, %xmm6
	paddq	%xmm9, %xmm1
	movdqa	%xmm10, %xmm9
	paddq	%xmm0, %xmm6
	pcmpgtd	%xmm5, %xmm9
	movdqa	%xmm9, %xmm13
	movdqa	%xmm5, %xmm9
	punpckldq	%xmm13, %xmm9
	punpckhdq	%xmm13, %xmm5
	paddq	%xmm9, %xmm1
	movdqa	%xmm4, %xmm9
	paddq	%xmm6, %xmm5
	punpckldq	%xmm12, %xmm9
	punpckhdq	%xmm12, %xmm4
	paddq	%xmm9, %xmm1
	movdqa	%xmm10, %xmm9
	paddq	%xmm5, %xmm4
	pcmpgtd	%xmm3, %xmm9
	punpckhdq	%xmm9, %xmm3
	punpckldq	%xmm9, %xmm15
	paddq	%xmm3, %xmm4
	paddq	%xmm1, %xmm15
	movaps	%xmm4, -88(%rsp)
	cmpl	%edx, %eax
	jne	.L4
	paddq	-88(%rsp), %xmm15
	movdqa	%xmm15, %xmm0
	psrldq	$8, %xmm0
	paddq	%xmm0, %xmm15
	movq	%xmm15, %rax
	testb	$3, %sil
	je	.L6
	andl	$-4, %esi
	leal	1(,%rsi,8), %edx
.L3:
	movslq	%edx, %rsi
	addq	%rax, %rsi
	leal	1(%rdx), %eax
	cltq
	addq	%rsi, %rax
	leal	2(%rdx), %esi
	movslq	%esi, %rsi
	addq	%rax, %rsi
	leal	3(%rdx), %eax
	cltq
	addq	%rsi, %rax
	leal	4(%rdx), %esi
	movslq	%esi, %rsi
	addq	%rax, %rsi
	leal	5(%rdx), %eax
	cltq
	addq	%rsi, %rax
	leal	6(%rdx), %esi
	movslq	%esi, %rsi
	addq	%rax, %rsi
	leal	7(%rdx), %eax
	cltq
	addq	%rsi, %rax
	leal	8(%rdx), %esi
	cmpl	%esi, %r9d
	jle	.L6
	movslq	%esi, %rsi
	addq	%rax, %rsi
	leal	9(%rdx), %eax
	cltq
	addq	%rsi, %rax
	leal	10(%rdx), %esi
	movslq	%esi, %rsi
	addq	%rax, %rsi
	leal	11(%rdx), %eax
	cltq
	addq	%rsi, %rax
	leal	12(%rdx), %esi
	movslq	%esi, %rsi
	addq	%rax, %rsi
	leal	13(%rdx), %eax
	cltq
	addq	%rsi, %rax
	leal	14(%rdx), %esi
	movslq	%esi, %rsi
	addq	%rax, %rsi
	leal	15(%rdx), %eax
	cltq
	addq	%rsi, %rax
	leal	16(%rdx), %esi
	cmpl	%esi, %r9d
	jle	.L6
	movslq	%esi, %rsi
	addq	%rax, %rsi
	leal	17(%rdx), %eax
	cltq
	addq	%rax, %rsi
	leal	18(%rdx), %eax
	cltq
	addq	%rsi, %rax
	leal	19(%rdx), %esi
	movslq	%esi, %rsi
	addq	%rax, %rsi
	leal	20(%rdx), %eax
	cltq
	addq	%rsi, %rax
	leal	21(%rdx), %esi
	movslq	%esi, %rsi
	addq	%rax, %rsi
	leal	22(%rdx), %eax
	addl	$23, %edx
	cltq
	movslq	%edx, %rdx
	addq	%rsi, %rax
	addq	%rdx, %rax
.L6:
	leal	9(,%r8,8), %edx
.L2:
	cmpl	%edx, %ecx
	jle	.L7
	movslq	%edx, %rsi
	addq	%rsi, %rax
	leal	1(%rdx), %esi
	cmpl	%ecx, %esi
	jge	.L7
	movslq	%esi, %rsi
	addq	%rsi, %rax
	leal	2(%rdx), %esi
	cmpl	%esi, %ecx
	jle	.L7
	movslq	%esi, %rsi
	addq	%rsi, %rax
	leal	3(%rdx), %esi
	cmpl	%ecx, %esi
	jge	.L7
	movslq	%esi, %rsi
	addq	%rsi, %rax
	leal	4(%rdx), %esi
	cmpl	%ecx, %esi
	jge	.L7
	movslq	%esi, %rsi
	addq	%rsi, %rax
	leal	5(%rdx), %esi
	cmpl	%esi, %ecx
	jle	.L7
	movslq	%esi, %rsi
	addq	%rsi, %rax
	leal	6(%rdx), %esi
	cmpl	%esi, %ecx
	jle	.L7
	movslq	%esi, %rsi
	addl	$7, %edx
	addq	%rsi, %rax
	movslq	%edx, %rsi
	addq	%rax, %rsi
	cmpl	%edx, %ecx
	cmovg	%rsi, %rax
.L7:
	addq	%rax, (%rdi)
	ret
	.p2align 4,,10
	.p2align 3
.L9:
	movl	$1, %edx
	xorl	%eax, %eax
	jmp	.L2
.L10:
	movl	$1, %edx
	xorl	%eax, %eax
	jmp	.L3
	.cfi_endproc
.LFE39:
	.size	addSum, .-addSum
	.section	.rodata.str1.8,"aMS",@progbits,1
	.align 8
.LC9:
	.string	"This program requires a positive number as its second argument."
	.section	.rodata.str1.1,"aMS",@progbits,1
.LC10:
	.string	"%lli\n"
	.section	.text.startup,"ax",@progbits
	.p2align 4
	.globl	main
	.type	main, @function
main:
.LFB40:
	.cfi_startproc
	endbr64
	subq	$24, %rsp
	.cfi_def_cfa_offset 32
	movq	%fs:40, %rax
	movq	%rax, 8(%rsp)
	xorl	%eax, %eax
	cmpl	$1, %edi
	jle	.L22
	movq	8(%rsi), %rdi
	movl	$10, %edx
	xorl	%esi, %esi
	movq	$0, (%rsp)
	call	strtol@PLT
	movq	%rsp, %rdi
	movl	%eax, %esi
	call	addSum
	movq	(%rsp), %rdx
	movl	$2, %edi
	xorl	%eax, %eax
	leaq	.LC10(%rip), %rsi
	call	__printf_chk@PLT
.L19:
	movq	8(%rsp), %rax
	subq	%fs:40, %rax
	jne	.L23
	xorl	%eax, %eax
	addq	$24, %rsp
	.cfi_remember_state
	.cfi_def_cfa_offset 8
	ret
.L22:
	.cfi_restore_state
	leaq	.LC9(%rip), %rdi
	call	puts@PLT
	jmp	.L19
.L23:
	call	__stack_chk_fail@PLT
	.cfi_endproc
.LFE40:
	.size	main, .-main
	.section	.rodata.cst16,"aM",@progbits,16
	.align 16
.LC0:
	.long	1
	.long	9
	.long	17
	.long	25
	.align 16
.LC1:
	.long	32
	.long	32
	.long	32
	.long	32
	.align 16
.LC2:
	.long	1
	.long	1
	.long	1
	.long	1
	.align 16
.LC3:
	.long	2
	.long	2
	.long	2
	.long	2
	.align 16
.LC4:
	.long	3
	.long	3
	.long	3
	.long	3
	.align 16
.LC5:
	.long	4
	.long	4
	.long	4
	.long	4
	.align 16
.LC6:
	.long	5
	.long	5
	.long	5
	.long	5
	.align 16
.LC7:
	.long	6
	.long	6
	.long	6
	.long	6
	.align 16
.LC8:
	.long	7
	.long	7
	.long	7
	.long	7
	.ident	"GCC: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0"
	.section	.note.GNU-stack,"",@progbits
	.section	.note.gnu.property,"a"
	.align 8
	.long	1f - 0f
	.long	4f - 1f
	.long	5
0:
	.string	"GNU"
1:
	.align 8
	.long	0xc0000002
	.long	3f - 2f
2:
	.long	0x3
3:
	.align 8
4:
```

---

## Doing Math On Our Own

```C
#include <stdio.h>
#include <stdlib.h>

void addSum(long long* buf, int max) {
    long long sum = 0;
    long long i = 1;
    int end = max - 7;
    for (; i < end; i+=8) {
        sum += 8*i + 1+2+3+4+5+6+7;
    }
    // Fill in the rest of the values
    for (; i < max; ++i) {
        sum += i;
    }
    *buf += sum;
}

---

## Doing Better Math

```C
#include <stdio.h>
#include <stdlib.h>

void addSum(long long* buf, int max) {
    // The sum from 1 to max is going to be the sum of the average values, which is max*(max+1)/2
    // Note that the numerator will always be even since either max or max+1 is even.
    long long sum = (long long)max*(max+1)/2;
    *buf += sum;
}

---

## Run Times

* We rely upon humans for fundamental optimizations

<table>
<tr> <b><td>O level  </td><td> O    </td><td> O2      </td><td> O3   </b></tr>
<tr> <td>version 1 </td><td> 2.228s </td><td> 0.302s  </td><td> 0.195s  </tr>
<tr> <td>version 4 </td><td> 0.759s </td><td> 0.294s  </td><td> 0.180s  </tr>
<tr> <td>version 5 </td><td> 0.607s </td><td> 0.196s  </td><td> 0.186s  </tr>
<tr> <td>version 6 </td><td> 0.435s </td><td> 0.175s  </td><td> 0.173s  </tr>
<tr> <td>version 7 </td><td> 0.532s </td><td> 0.076s  </td><td> 0.188s  </tr>
<tr> <td>version 8 </td><td> 0.142s </td><td> 0.039s  </td><td> 0.038s  </tr>
<tr> <td>version 9 </td><td> 0.001s </td><td> 0.000s  </td><td> 0.000s  </tr>
</table>

---

## Premature Optimization

* So we can, at times, out-think the compiler with a good algorithm
* If we had spent time stressing about that function though, it likely would have been pointless
  * Unless we called this function *a lot*

---

## Best Practices

* In reality, you can't stress about evey piece of code
* So first write for correctness
* Then let your compiler optimize
  * [https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html)
* Then use a tool to profile your code
  * `gprof` or similar
  * Other options for non-C languges

---

## Example Question

**Q** Which of the following will **not** resolve a read-after-write hazard between two instructions?

a. A no-op bubble in between them.

b. Data forwarding from the first to the second.

c. An out of order executed instruction in between them.

d. Branch prediction.

</div>

---

## Example Question

**Q** Consider an fp-8 type with 1 bit sign, 4 bits exponent, and 3 bits significand. Whenever all three exponent bits are set, the value is either NaN or +/- infinity. Exponent bias is -8. What is the maximum value this type can store? You may write the equation rather than reducing it to a single number.

\_\_\_\_\_\_\_\_\_\_\_\_
</div>

---

## Example Question Answer

* The exponent range is 0-15, but the bias always reduces it so that there is one more negative value
  * So -8 to +7
* The significand is 3 bits, so it is treated as a fraction out of 8
  * The maximum is $\frac{7}{8}$
* The answer is $2^7*(1+\frac{7}{8})$, which is 240

---

## Example Question

**Q** Incrementing the largest floating point value less than 1 will:

a. Overflow from the significand and increment the exponent.

b. Result in an exponent value equal to the bias.

c. Leave the value in the significand equal to 0.

d. All of the above.
</div>

---

## Example Question

**Q** What is immediate addressing?

a. When an operand loads data from an offset into the memory referred to by a register.

b. When the operand is specified within the instruction itself.

c. When an operand is within one of the general purpose registers.

d. None of the above.
</div>

---

## Example Question

**Q** Which of these statments is **false** about the x86-64 instruction set?

a. It supports C-style array access through indirect memory access instructions.

b. Compared to other instruction sets, the x86-64 ISA has a large number of instructions.

c. The x86-64 ISA only supports 64-bit operands.

d. The same registers can be referred to with multiple names to use different amounts of the registers.
</div>

<!--

Review of hazards

Review of mitigations
 * forwarding
 * bubbles
 * out of order execution, with reorder buffer at the end
 * speculative execution
 * branch prediction

* Practical improvements:
  * Switch statements
  * inline functions
  * loop improvements
    * iterators instead of offsets?
  * loop unrolling

-->