\n\tOp is one of + - x /\n\tcount is the number of repetitions\n", argv[0]);
return 1;
}
int count = atoi(argv[2]);
char op = argv[1][0];
switch(op) {
case('+'):
add(count);
break;
case('-'):
subtract(count);
break;
case('x'):
multiply(count);
break;
case('/'):
divide(count);
break;
case('>'):
gt(count);
break;
case('='):
equality(count);
break;
default:
break;
}
return 0;
}
```
---
## Note on Timing
* The `time` command measures how long something runs
* Gives `real`, `user`, and `sys` times
* **real**: wall clock time
* **user**: time this took in "userspace"
* Doesn't count time the OS took doing things, like opening files
* Also doesn't count time the process isn't running
* **sys**: Time during operating system calls
* We can just look at user
---
## Differences
$ time ./timing_ints "+" 1000000000
real 0m2.833s
user 0m2.831s
sys 0m0.001s
$ time ./timing_ints "-" 1000000000
real 0m2.841s
user 0m2.840s
sys 0m0.001s
$ time ./timing_floats "+" 1000000000
real 0m3.165s
user 0m3.163s
sys 0m0.002s
$ time ./timing_floats "-" 1000000000
real 0m3.164s
user 0m3.162s
sys 0m0.002s
---
## Multiplication and division
$ time ./timing_ints "x" 1000000000
real 0m2.812s
user 0m2.810s
sys 0m0.002s
$ time ./timing_ints "/" 1000000000
real 0m34.541s
user 0m34.538s
sys 0m0.001s
$ time ./timing_floats "x" 1000000000
real 0m3.199s
user 0m3.197s
sys 0m0.002s
$ time ./timing_floats "/" 1000000000
real 0m34.273s
user 0m34.270s
sys 0m0.001s
---
## Comparisons
$ time ./timing_ints ">" 1000000000
real 0m4.017s
user 0m4.015s
sys 0m0.001s
$ time ./timing_ints "=" 1000000000
real 0m3.995s
user 0m3.994s
sys 0m0.002s
$ time ./timing_floats ">" 1000000000
real 0m4.021s
user 0m4.019s
sys 0m0.002s
$ time ./timing_floats "=" 1000000000
real 0m3.997s
user 0m3.992s
sys 0m0.005s
---
## Magic?
* The floating point operations seem like they should be slower than the integer ones
* More steps, more fields, special cases, etc
* So why not?
---
## ALUs and FPUs
* That wasn't (and isn't) always the case
* An ALU is an Arithmetic Logic Unit
* Piece of hardware that does integer calculations
* It does the adding from before, plus other operations
* Cannot handle floating point operations though
---
## FPU
* There is an ALU equivalent for floating point operations
* The FPU
* Floating Point Unit
* Ints and floats are too different to optimize together
* Originally, FPUs were treated as an optional "add on"
* The GPU is just about the only co-processor that remains in current systems
---
## Emulating Floating Point
* If we time our fixed point implementation, we'll find it to be horribly slow
* Why? No hardware support.
* Steps are done separately, the total time is the sum of those steps.
* An actual FPU would be faster
---
## Hardware Vs Software
$ time ./timing_fixed "+" 1000000000
real 0m9.988s
user 0m9.982s
sys 0m0.002s
$ time ./timing_fixed "x" 1000000000
real 1m0.209s
user 1m0.201s
sys 0m0.002s
---
## FPU
* The floating point unit combines the steps required for floating point math into one unit
* Anything that can be done in parallel is
* e.g. adding exponents and multiplying significand don't need to be sequential
---
## The same speed?
* Why would they be the same speed though?
* There is a minimum unit of time in your system
* The clock tick
* This is the speed of a CPU
* 2200 MHz is 2,200,000 clock ticks per second
---
## Practical Limits
* We could try to make the time for a `char` addition the base unit, but why?
* Other operations would need to be chopped up, increasing complexity
* Also, on 64-bit systems memory operations will need to be 64-bit
* And we want them to be 1 clock cycle
* Gives us a lower bound on the system clock cycles
---
## Silicon to Speed
* As technology has improved, engineers have squeezed more stuff into your CPU
* This includes ALUs and FPUs
* Improvements have also decreased clock cycle latency for operations
---
## Real Example
* On Zen5 (2024), integer addition and subtraction take 1 clock cycle
* And multiple can be done in parallel per core!
* Integer multiplication has a latency of ~3, and division is ~11-14
* On the AMD K7 (1999) division latency was 40 cycles for 32 bit operations
* Although CPU speeds have been stable, practical execution times have improved
---
## Max Speed
* Most of the integer operations are already at maximum speed
* Do they have their result before the next clock cycle ticks?
* Possibly. But to use the result, the clock would have to be faster
* If you've ever overclocked something, you know going faster isn't always stable
* So the system is limited to the most unstable parts
* If the FPU can finish its ops in less time, they'll take 1 cycle
---
## Actual Costs
* The slowest floating point operation is actually FBSTP
* Converting the float to a binary coded decimal
* Other high latency operations are FSIN, FCOS, FPATAN, etc
* Yes, your CPU is the thing doing the trig functions
---
## What is Supported?
* How do we know what the hardware supports?
* By looking at the opcodes of our instruction set
---
## For Next Time
* Don't want to start diving into instruction sets today
* So take some time to do homework 2 and study
* Next class we'll being looking at common instructions and how they execute on a CPU
* For today, let's introduce a tiny bit of assembly with whatever time is left
---
## Why?
* Why learn assembly and architecture?
* Knowing architecture informs your programming decisions
* Usually, the bridge between your code and the hardware is your compiler
* So let's begin learning how the compiler treats your code
---
## Recursion
* Recursion is often the most elegant way to express an idea
* Searching, sorting, etc
* How does that translate to hardware?
* We saw before that functions can cause a `stack overflow` if the call stack is deep enough
* Let's revisit this
---
## No Recursion
```python
#include
#include
// Calculate the nth fibonocci number and return it.
unsigned long long fibonocci(unsigned long long n) {
if (n == 0 || n == 1) {
return n;
}
return fibonocci(n-1) + fibonocci(n-2);
}
int main(int argc, char** argv) {
if (argc < 2) {
printf("Give me the nth fibonocci number that you want to see.");
return 0;
}
int n = atoi(argv[1]);
unsigned long long nth = fibonocci(n);
printf("The %i fibonocci number is %llu\n", n, nth);
return 0;
}
```
---
## Calculating the fibonocci
* Calculating fibonocci numbers takes about $1.6^n$ steps
* Plenty of calls, and we only avoid a stack overflow because it takes nearly no memory
* However, this kind of call is what causes stack overflows
* Compile with `gcc -S fib.c -o fib.s`
---
## Assembly View
```asm[|22,25]
.file "example25.c"
.text
.globl fibonocci
.type fibonocci, @function
fibonocci:
.LFB39:
.cfi_startproc
endbr64
movq %rdi, %rax
cmpq $1, %rdi
jbe .L5
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
pushq %rbx
.cfi_def_cfa_offset 24
.cfi_offset 3, -24
subq $8, %rsp
.cfi_def_cfa_offset 32
movq %rdi, %rbx
leaq -1(%rdi), %rdi
call fibonocci
movq %rax, %rbp
leaq -2(%rbx), %rdi
call fibonocci
addq %rbp, %rax
addq $8, %rsp
.cfi_def_cfa_offset 24
popq %rbx
.cfi_def_cfa_offset 16
popq %rbp
.cfi_def_cfa_offset 8
ret
.L5:
.cfi_restore 3
.cfi_restore 6
ret
.cfi_endproc
.LFE39:
.size fibonocci, .-fibonocci
.section .rodata.str1.8,"aMS",@progbits,1
.align 8
.LC0:
.string "Give me the nth fibonocci number that you want to see."
.align 8
.LC1:
.string "The %i fibonocci number is %llu\n"
.text
.globl main
.type main, @function
main:
.LFB40:
.cfi_startproc
endbr64
pushq %rbx
.cfi_def_cfa_offset 16
.cfi_offset 3, -16
cmpl $1, %edi
jle .L12
movq 8(%rsi), %rdi
movl $10, %edx
movl $0, %esi
call strtol@PLT
movl %eax, %ebx
movslq %eax, %rdi
call fibonocci
movq %rax, %rcx
movl %ebx, %edx
leaq .LC1(%rip), %rsi
movl $2, %edi
movl $0, %eax
call __printf_chk@PLT
.L10:
movl $0, %eax
popq %rbx
.cfi_remember_state
.cfi_def_cfa_offset 8
ret
.L12:
.cfi_restore_state
leaq .LC0(%rip), %rsi
movl $2, %edi
movl $0, %eax
call __printf_chk@PLT
jmp .L10
.cfi_endproc
.LFE40:
.size main, .-main
.ident "GCC: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0"
.section .note.GNU-stack,"",@progbits
.section .note.gnu.property,"a"
.align 8
.long 1f - 0f
.long 4f - 1f
.long 5
0:
.string "GNU"
1:
.align 8
.long 0xc0000002
.long 3f - 2f
2:
.long 0x3
3:
.align 8
4:
```
---
## CALL
* The call instructions are assembly's way of calling a function
* Each call means more memory on the stack
* What if we want to avoid this?
---
## Changing Calls into Loops
* Recursive functions can be converted into loops
* Use a stack to manually store the temporary variables
---
## Recursion to Loop Example
```C
#include
#include
#include "llu_stack.h"
int main(int argc, char** argv) {
if (argc < 2) {
printf("Give me the nth fibonocci number that you want to see.");
return 0;
}
int n = atoi(argv[1]);
Stack s = newStack(10);
push(&s, n);
unsigned long long sum = 0;
while (len(s) > 0) {
unsigned long long next = pop(&s);
if (next == 0 || next == 1) {
sum += next;
}
else {
push(&s, next - 1);
push(&s, next - 2);
}
}
freeStack(s);
printf("The %i fibonocci number is %llu\n", n, sum);
return 0;
}
```
---
## Is One Superior?
* Yes; the compiler knows how to optimize the recursive call better
* First, recompile with `-O3` to enable optimizations
* We can time it with the `time` command
$ time ./example25 50
The 50 fibonocci number is 12586269025
real 0m17.635s
user 0m17.630s
sys 0m0.003s
$ time ./example26 50
The 50 fibonocci number is 12586269025
real 3m0.281s
user 3m0.260s
sys 0m0.005s
---
## Compiler Tricks
* The compiler knows tricks for your hardware and OS
* And it will execute them better than you will
* So we are not going to be doing assembly for speed (usually)
* A combination of C and assembly will allow us to access the hardware in ways other languages cannot
---
## Tail Recursion
* There are times where there is a faster version of a recursive function
* `Tail recursion` occurs when the value being returned is only the recursive call
* That means that nothing else in the function will be used again
* So memory can be freed, even before the next call completes
* Also means that actually calling the function is unnecessary
* Must tell gcc to use some optimization level, (usually -O2 or -O3)
---
## Tail Recursion Example
```C
#include
#include
// Calculate the sum n + (n - 1) + (n - 2) ... + 1
unsigned long long sequence(unsigned long long n, unsigned long long current) {
if (n < 2) {
return 1 + current;
}
// This is now tail recursive
return sequence(n-1, current+n);
}
int main(int argc, char** argv) {
if (argc < 2) {
printf("Give me the number that you want to see.");
return 0;
}
int n = atoi(argv[1]);
printf("The %i sequence number is %llu\n", n, sequence(n, 0));
return 0;
}
```
---
## Assembly
```asm[|14,21-25]
.file "example25_c.c"
.text
.p2align 4
.globl sequence
.type sequence, @function
sequence:
.LFB39:
.cfi_startproc
endbr64
cmpq $1, %rdi
jbe .L3
leaq -1(%rdi), %rax
testb $1, %dil
jne .L2
addq %rdi, %rsi
movq %rax, %rdi
cmpq $1, %rax
je .L3
.p2align 4,,10
.p2align 3
.L2:
leaq -1(%rsi,%rdi,2), %rsi
subq $2, %rdi
cmpq $1, %rdi
jne .L2
.L3:
leaq 1(%rsi), %rax
ret
.cfi_endproc
.LFE39:
.size sequence, .-sequence
.section .rodata.str1.8,"aMS",@progbits,1
.align 8
.LC0:
.string "Give me the nth fibonocci number that you want to see."
.align 8
.LC1:
.string "The %i sequence number is %llu\n"
.section .text.startup,"ax",@progbits
.p2align 4
.globl main
.type main, @function
main:
.LFB40:
.cfi_startproc
endbr64
subq $8, %rsp
.cfi_def_cfa_offset 16
cmpl $1, %edi
jle .L35
movq 8(%rsi), %rdi
movl $10, %edx
xorl %esi, %esi
call strtol@PLT
xorl %ecx, %ecx
movl %eax, %edx
cltq
cmpq $1, %rax
jbe .L21
leaq -1(%rax), %rsi
testb $1, %al
jne .L20
movq %rax, %rcx
movq %rsi, %rax
cmpq $1, %rsi
je .L21
.p2align 4,,10
.p2align 3
.L20:
leaq -1(%rcx,%rax,2), %rcx
subq $2, %rax
cmpq $1, %rax
jne .L20
.L21:
addq $1, %rcx
leaq .LC1(%rip), %rsi
movl $2, %edi
xorl %eax, %eax
call __printf_chk@PLT
.L19:
xorl %eax, %eax
addq $8, %rsp
.cfi_remember_state
.cfi_def_cfa_offset 8
ret
.L35:
.cfi_restore_state
leaq .LC0(%rip), %rsi
movl $2, %edi
xorl %eax, %eax
call __printf_chk@PLT
jmp .L19
.cfi_endproc
.LFE40:
.size main, .-main
.ident "GCC: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0"
.section .note.GNU-stack,"",@progbits
.section .note.gnu.property,"a"
.align 8
.long 1f - 0f
.long 4f - 1f
.long 5
0:
.string "GNU"
1:
.align 8
.long 0xc0000002
.long 3f - 2f
2:
.long 0x3
3:
.align 8
4:
```
---
## Assembly and Architecture
* Programming C requires a bit of hardware knowledge
* Programming in assembly requires more
* If you find the idea daunting, preemptively find a resource that you like
* Try searching for "intro to x86 assembly programming"
* We won't be programming assembly from scratch, but you'll need to understand it