\n\tOp is one of + - x /\n\tcount is the number of repetitions\n", argv[0]);
return 1;
}
int count = atoi(argv[2]);
char op = argv[1][0];
switch(op) {
case('+'):
add(count);
break;
case('-'):
subtract(count);
break;
case('x'):
multiply(count);
break;
case('/'):
divide(count);
break;
case('>'):
gt(count);
break;
case('='):
equality(count);
break;
default:
break;
}
return 0;
}
```
---
## Note on Timing
* The `time` command measures how long something runs
* Gives `real`, `user`, and `sys` times
* **real**: wall clock time
* **user**: time this took in "userspace"
* Doesn't count time the OS took doing things, like opening files
* Also doesn't count time the process isn't running
* **sys**: Time during operating system calls
* We can just look at user
---
## Differences
$ time ./timing_ints "+" 1000000000
real 0m2.833s
user 0m2.831s
sys 0m0.001s
$ time ./timing_ints "-" 1000000000
real 0m2.841s
user 0m2.840s
sys 0m0.001s
$ time ./timing_floats "+" 1000000000
real 0m3.165s
user 0m3.163s
sys 0m0.002s
$ time ./timing_floats "-" 1000000000
real 0m3.164s
user 0m3.162s
sys 0m0.002s
---
## Multiplication and division
$ time ./timing_ints "x" 1000000000
real 0m2.812s
user 0m2.810s
sys 0m0.002s
$ time ./timing_ints "/" 1000000000
real 0m34.541s
user 0m34.538s
sys 0m0.001s
$ time ./timing_floats "x" 1000000000
real 0m3.199s
user 0m3.197s
sys 0m0.002s
$ time ./timing_floats "/" 1000000000
real 0m34.273s
user 0m34.270s
sys 0m0.001s
---
## Comparisons
$ time ./timing_ints ">" 1000000000
real 0m4.017s
user 0m4.015s
sys 0m0.001s
$ time ./timing_ints "=" 1000000000
real 0m3.995s
user 0m3.994s
sys 0m0.002s
$ time ./timing_floats ">" 1000000000
real 0m4.021s
user 0m4.019s
sys 0m0.002s
$ time ./timing_floats "=" 1000000000
real 0m3.997s
user 0m3.992s
sys 0m0.005s
---
## Magic?
* The floating point operations seem like they should be slower than the integer ones
* More steps, more fields, special cases, etc
* So why not?
---
## ALUs and FPUs
* That wasn't (and isn't) always the case
* An ALU is an Arithmetic Logic Unit
* Piece of hardware that does integer calculations
* It does the adding from before, plus other operations
* Cannot handle floating point operations though
---
## FPU
* There is an ALU equivalent for floating point operations
* The FPU
* Floating Point Unit
* Integer operations and floating point are too different, so they are different pieces of hardware
* When we couldn't cram so many transistors into the same place, the FPUs were either not present, or only present in a co-processor
* Like a GPU or external sound card
* The GPU is just about the only co-processor that remains in general computers
---
## Emulating Floating Point
* If we time our fixed point implementation, we'll find it to be horribly slow
* Why? No hardware support.
* Steps are done separately, the total time is the sum of those steps.
* An actual FPU would be faster
---
## Hardware Vs Software
$ time ./timing_fixed "+" 1000000000
real 0m9.988s
user 0m9.982s
sys 0m0.002s
$ time ./timing_fixed "x" 1000000000
real 1m0.209s
user 1m0.201s
sys 0m0.002s
---
## FPU
* The floating point unit combines the steps required for floating point math into one unit
* Anything that can be done in parallel is
* e.g. adding exponents and multiplying significand don't need to be sequential
---
## The same speed?
* Why would they be the same speed though?
* There is a minimum unit of time in your system
* The clock tick
* This is the speed of a CPU
* 2200 MHz is 2,200,000 clock ticks per second
---
## Practical Limits
* We could try to make the time for a `char` addition the base unit, but why?
* Other operations would need to be chopped up, increasing complexity
* Also, on 64-bit systems memory operations will need to be 64-bit
* And we want them to be 1 clock cycle
* Gives us a lower bound on the system clock cycles
---
## Silicon to Speed
* As technology has improved, engineers have squeezed more stuff into your CPU
* This includes ALUs and FPUs
* Improvements have also decreased clock cycle latency for operations
---
## Real Example
* On Zen5 (2024), integer addition and subtraction take 1 clock cycle
* And multiple can be done in parallel per core!
* Integer multiplication has a latency of ~3, and division is ~11-14
* On the AMD K7 (1999) division latency was 40 cycles for 32 bit operations
* Although CPU speeds have been stable, practical execution times have improved
---
## Max Speed
* Most of the integer operations are already at maximum speed
* Do they have their result before the next clock cycle ticks?
* Possibly. But to use the result, the clock would have to be faster
* If you've ever overclocked something, you know going faster isn't always stable
* We'll get into why later
* So the system is limited to the most unstable parts
* If the FPU can finish its ops in less time, they'll take 1 cycle
---
## Actual Costs
* The slowest floating point operation is actually FBSTP
* Converting the float to a binary coded decimal
* Other high latency operations are FSIN, FCOS, FPATAN, etc
* Yes, your CPU is the thing doing the trig functions
---
## What is Supported?
* How do we know what the hardware supports?
* By looking at the opcodes of our instruction set
---
## For Next Time
* Don't want to start diving into instruction sets today
* So take some time to do homework 2 and study
* Next class we'll being looking at common instructions and how they execute on a CPU