# CS 211 - Lecture 09 - Recursion and Loops in Assembly

Bernhard Firner

2026-02-18

---

## Review!

* We'll spend most of today going over more float facts
* With a bit of time introducing some assembly
  * Just an appetizer!
* The drawings from last week are in the uploaded files on canvas

---

## Drawings

---

## Drawings

---

## Floating Point Numbers

* Three sections
  * Sign: 0 or 1
  * Exponent
  * Significand (also called mantissa)
* 32 bit (full precision) and 64 bit (double precision) most common

<table>
<tr><td>Sign</td><td colspan="8">Exponent</td><td colspan="23">Significand</td></tr>
<tr><td>31</td><td>30</td><td colspan="6">...</td><td>23</td><td colspan="21">22</td><td>...</td><td>0</td></tr>
</table>

---

## Floating Point Values

* Can divide floating point numbers into different types
  * Two special cases
    * NaN (not a number)
    * Infinity
  * normalized
  * subnormal
    * Including 0, when all memory is set to 0

---

## NaN and Infinity

* Positive and negative infinity
  * Set all exponent values to 1
  * All significand values to 0
* All exponents bits set
  * Any significand bits set

---

## Exponent in IEEE FP32

* Always subtract bias from the exponent
  * bias = $2^{k-1}$ without NaN and inf values, $2^{k-1} - 1$ with them
  * k is the number of exponent bits
  * For 32 bit floats, exponent range is $2^{1-127}$ to $2^{255-127}$
* That translates to exponents of $2^{-127}$ to $2^{128}$
  * Or 1.1754943508222875e-38
  * 170141183460469231731687303715884105728

---

## Limits

* All of those values are stored in float.h
  * FLT_MIN, FLT_MAX
* FLT_RADIX
  * Value is is raised to the exponent. 2 for us.
* FLT_ROUNDS and FLT_EVAL_METHOD
  * Read (or set) rounding mode and evaluation precision
  * Will revisit later

---

## Why Bias?

* The bias allows the exponent field to act like it's part of a regular number
* The sign+exponent+significant can be compared directly to another float
  * If it were 2's complement, it would require special processing

## Subnormal Numbers

* value = $(-1)^{sign} \times 2^{1-bias} \times (\frac{Significand}{2^{23}})$
* Allows us to represent 0
* Also matches the increments when exponent=1 rather than getting tiny
  * This keeps math more stable in a few situations

---

## Normalized Numbers

* So when exponent bits > 0, we are in the normal number range
  * Begin with an implied 1 in the equation
* value = $(-1)^{sign} \times 2^{exponent - 0x7F} \times (1+\frac{Significand}{2^{23}})$

---

## Sane increments

* Incrementing a float by 1 is sane
  * Keep increasing the significand and eventually it overflows into the exponent
  * This works out
  * $2^2(1+\frac{2^{23}-1}{2^{23}})$ is the largest value less than 8
    * 0x7F+2 in the exponent, 0x7FFFFF in the significand
  * Incrementing by 1 yields 8
    * $2^3(1+\frac{0}{2^{23}})$ is 8 exactly

---

## Proof

```c
#include <stdio.h>
#include <stdlib.h>

typedef struct FloatBits FloatBits;
struct FloatBits {
    unsigned int significand : 23;
    unsigned int exponent : 8;
    unsigned int sign : 1;
};

typedef union FloatIntBits FloatIntBits;
union FloatIntBits {
    float the_float;
    int the_int;
    FloatBits the_bits;
};

int main(void) {
    FloatIntBits fib = {.the_bits.exponent = 0x7F+2, .the_bits.significand = 0x7FFFFF};
    printf("The float is %.20f\n", fib.the_float);

fib.the_int++;
    printf("The float is %f\n", fib.the_float);
    return 0;
}
```

---

## Output

<pre>
The float is 7.99999952316284179688
The float is 8.000000
</pre>

---

## Arithmetic

* Incrementing and decrementing are sane
* Does comparison work the same way as integers?
  * Almost
  * Comparing infinity values works
    * They use the highest value of the exponent
    * Sign bit still means negative
  * Need to check for NaN values

---

## Addition and Subtraction

* Obviously these involve matching up the exponents somehow
  * Need to convert the numbers to have the same range
  * Then handle rounding

---

## Rounding and Addition

```c
#include <stdio.h>
#include <float.h>

int main(void) {
    printf("Float rounding method is %i and the eval precision is %i\n", FLT_ROUNDS, FLT_EVAL_METHOD);
    float a = 3.14;
    float b = 1e10;
    float c = -1e10;
    printf("a, b, and c are %f, %f, and %f\n", a, b, c);
    printf("a + b + c is %f\n", a + b + c);
    printf("c + b + a is %f\n", c + b + a);
    return 0;
}
```

---

## Outputs

<pre>
Float rounding method is 1 and the eval precision is 0
a, b, and c are 3.140000, 10000000000.000000, and -10000000000.000000
a + b + c is 0.000000
c + b + a is 3.140000
</pre>

---

## What does this mean?

* FLT_ROUNDS == 1 means rounding to the nearest value when there is imprecision
* FLT_EVAL_METHOD == 0 means that floating point operations are done with 32 bits
* When we add 3.14 and 10000000000, what are the bits?
  * The exponents of the two numbers are 1 and 33
  * Their minimum increments are 2.384185791015625e-07 and 1024
    * We can't represent 3.14 in the second float
    * Also, 512.0 + 1e10 - 1e10 == 1024

---

## Rounding

* So any operation on floating point numbers will probably involve rounding
* Why round to nearest?
  * Statistical bias
  * If always up or down then calculations would be consistently large or small
  * This way they are wrong, but not in a biased way

---

## Rounding

```c
#include <stdio.h>
#include <float.h>

int main(void) {
    printf("Float rounding method is %i and the eval precision is %i\n", FLT_ROUNDS, FLT_EVAL_METHOD);
    float a = 1e10;
    float b = 512;
    float c = -1e10;
    printf("a, b, and c are %f, %f, and %f\n", a, b, c);
    printf("a + b + c is %f\n", a + b + c);
    printf("c + b + a is %f\n", c + b + a);
    return 0;
}
```

---

## Rounding

```c
#include <stdio.h>
#include <float.h>

int main(void) {
    printf("Float rounding method is %i and the eval precision is %i\n", FLT_ROUNDS, FLT_EVAL_METHOD);
    float a = 1e10;
    float b = 1024+512;
    float c = -1e10;
    printf("a, b, and c are %f, %f, and %f\n", a, b, c);
    printf("a + b + c is %f\n", a + b + c);
    printf("c + b + a is %f\n", c + b + a);
    return 0;
}
```

---

## Multiplication and Division

* Multiplication involves adding the exponents and multiplying the significands
  * So does it take the time of one multiplication and one addition?
* And how long does division take?
* What even is multiplication? A bunch of additions?
* What even is division on a computer? Is it successive subtractions?

---

## Addition

* Addition is done one bit at a time

---

## Handling Carry

* More wires for the carry in and carry out signals
* Multiple adders to make up an entire integer
* Some slight delays, but fairly fast

---

## Multiplication

* Booth's algorithm, compressors, and hardware techniques
* Complicated
  * But parts can be made parallel
  * That just means more wires
  * We are good at tiny wires

---

## Timing: integers

```c
/*
 * Test the time it takes to do an operation on some number of ints.
 */

#include <stdio.h>
#include <stdlib.h>

const int randints[] = {
    71501472,-406930519,167198576,-506476827,677574771,
    -216019958,-896939258,587778791,-921524727,-979857950,
    937022794,-1000917168,697685341,857223902,222195556,
    257300523,-44621541,-798482123,-436348977,-634093188,
    -642097513,-92061249,-801071892,-484722028,-177026294,
    181769787,-25895578,-788014379,-846172028,-302776229,
    868440015,-643364568,-163554712,-1025734220,-109980683,
    -489093616,-74256562,759298958,949504440,-262565140};

void add(int repeats) {
    // Do a bunch of additions.
    // Use different random numbers, just in case some ops take
    // different times with different values.
    int left_idx = 0;
    int right_idx = 0;
    for (;repeats > 0; --repeats) {
        int result;
        result = randints[0] + randints[10];
        result = randints[1] + randints[11];
        result = randints[2] + randints[12];
        result = randints[3] + randints[13];
        result = randints[4] + randints[14];
        result = randints[5] + randints[15];
        result = randints[6] + randints[16];
        result = randints[7] + randints[17];
        result = randints[8] + randints[18];
        result = randints[9] + randints[19];
    }
}

void subtract(int repeats) {
    // Do a bunch of additions.
    // Use different random numbers, just in case some ops take
    // different times with different values.
    int left_idx = 0;
    int right_idx = 0;
    for (;repeats > 0; --repeats) {
        int result;
        result = randints[0] - randints[10];
        result = randints[1] - randints[11];
        result = randints[2] - randints[12];
        result = randints[3] - randints[13];
        result = randints[4] - randints[14];
        result = randints[5] - randints[15];
        result = randints[6] - randints[16];
        result = randints[7] - randints[17];
        result = randints[8] - randints[18];
        result = randints[9] - randints[19];
    }
}

void multiply(int repeats) {
    // Do a bunch of additions.
    // Use different random numbers, just in case some ops take
    // different times with different values.
    int left_idx = 0;
    int right_idx = 0;
    for (;repeats > 0; --repeats) {
        int result;
        result = randints[0] * randints[10];
        result = randints[1] * randints[11];
        result = randints[2] * randints[12];
        result = randints[3] * randints[13];
        result = randints[4] * randints[14];
        result = randints[5] * randints[15];
        result = randints[6] * randints[16];
        result = randints[7] * randints[17];
        result = randints[8] * randints[18];
        result = randints[9] * randints[19];
    }
}

void divide(int repeats) {
    // Do a bunch of additions.
    // Use different random numbers, just in case some ops take
    // different times with different values.
    int left_idx = 0;
    int right_idx = 0;
    for (;repeats > 0; --repeats) {
        int result;
        result = randints[0] / randints[10];
        result = randints[1] / randints[11];
        result = randints[2] / randints[12];
        result = randints[3] / randints[13];
        result = randints[4] / randints[14];
        result = randints[5] / randints[15];
        result = randints[6] / randints[16];
        result = randints[7] / randints[17];
        result = randints[8] / randints[18];
        result = randints[9] / randints[19];
    }
}

void gt(int repeats) {
    // Do a bunch of additions.
    // Use different random numbers, just in case some ops take
    // different times with different values.
    int left_idx = 0;
    int right_idx = 0;
    for (;repeats > 0; --repeats) {
        int result;
        result = randints[0] > randints[10];
        result = randints[1] > randints[11];
        result = randints[2] > randints[12];
        result = randints[3] > randints[13];
        result = randints[4] > randints[14];
        result = randints[5] > randints[15];
        result = randints[6] > randints[16];
        result = randints[7] > randints[17];
        result = randints[8] > randints[18];
        result = randints[9] > randints[19];
    }
}

void equality(int repeats) {
    // Do a bunch of additions.
    // Use different random numbers, just in case some ops take
    // different times with different values.
    int left_idx = 0;
    int right_idx = 0;
    for (;repeats > 0; --repeats) {
        int result;
        result = randints[0] == randints[10];
        result = randints[1] == randints[11];
        result = randints[2] == randints[12];
        result = randints[3] == randints[13];
        result = randints[4] == randints[14];
        result = randints[5] == randints[15];
        result = randints[6] == randints[16];
        result = randints[7] == randints[17];
        result = randints[8] == randints[18];
        result = randints[9] == randints[19];
    }
}

int main(int argc, char** argv) {
    if (argc < 3) {
        printf("Usage: %s <op> <count>\n\tOp is one of + - x /\n\tcount is the number of repetitions\n", argv[0]);
        return 1;
    }

int count = atoi(argv[2]);
    char op = argv[1][0];
    switch(op) {
        case('+'):
            add(count);
            break;
        case('-'):
            subtract(count);
            break;
        case('x'):
            multiply(count);
            break;
        case('/'):
            divide(count);
            break;
        case('>'):
            gt(count);
            break;
        case('='):
            equality(count);
            break;
        default:
            break;
    }
    return 0;
}
```

---

## Timing: floats

```c
/*
 * Test the time it takes to do an operation on some number of floats.
 */

#include <stdio.h>
#include <stdlib.h>

const int randfloats[] = {
    17089592.442231685,42650262.76743865,-53497819.84984845,-116691867.6590221,24809873.623072624,
    108600008.32102972,125398338.0416575,-120378736.75773332,116731913.37504315,77312670.38473937,
    -71530409.76935619,80182138.68649063,-47645384.879797846,-88263059.38595527,48201775.93114355,
    92453382.04082373,15758869.336459607,92874705.66463462,-114804327.49872798,6557254.501784831,
    -101537030.2437717,-92741851.55613247,-34323721.76140547,-86950365.81892607,133378638.29638937,
    18108168.7427693,-128727443.24053451,-71597764.37060487,-38187532.21805462,31284076.22326246,
    101780333.4527517,123052312.94178456,94335178.55253151,33028173.831077695,34771174.536762565,
    76130639.93920115,-60319246.15152177,-90615915.30101988,78392121.87562653,-59348261.877902895
};

void add(int repeats) {
    // Do a bunch of additions.
    // Use different random numbers, just in case some ops take
    // different times with different values.
    int left_idx = 0;
    int right_idx = 0;
    for (;repeats > 0; --repeats) {
        float result;
        result = randfloats[0] + randfloats[10];
        result = randfloats[1] + randfloats[11];
        result = randfloats[2] + randfloats[12];
        result = randfloats[3] + randfloats[13];
        result = randfloats[4] + randfloats[14];
        result = randfloats[5] + randfloats[15];
        result = randfloats[6] + randfloats[16];
        result = randfloats[7] + randfloats[17];
        result = randfloats[8] + randfloats[18];
        result = randfloats[9] + randfloats[19];
    }
}

void subtract(int repeats) {
    // Do a bunch of additions.
    // Use different random numbers, just in case some ops take
    // different times with different values.
    int left_idx = 0;
    int right_idx = 0;
    for (;repeats > 0; --repeats) {
        float result;
        result = randfloats[0] - randfloats[10];
        result = randfloats[1] - randfloats[11];
        result = randfloats[2] - randfloats[12];
        result = randfloats[3] - randfloats[13];
        result = randfloats[4] - randfloats[14];
        result = randfloats[5] - randfloats[15];
        result = randfloats[6] - randfloats[16];
        result = randfloats[7] - randfloats[17];
        result = randfloats[8] - randfloats[18];
        result = randfloats[9] - randfloats[19];
    }
}

void multiply(int repeats) {
    // Do a bunch of additions.
    // Use different random numbers, just in case some ops take
    // different times with different values.
    int left_idx = 0;
    int right_idx = 0;
    for (;repeats > 0; --repeats) {
        float result;
        result = randfloats[0] * randfloats[10];
        result = randfloats[1] * randfloats[11];
        result = randfloats[2] * randfloats[12];
        result = randfloats[3] * randfloats[13];
        result = randfloats[4] * randfloats[14];
        result = randfloats[5] * randfloats[15];
        result = randfloats[6] * randfloats[16];
        result = randfloats[7] * randfloats[17];
        result = randfloats[8] * randfloats[18];
        result = randfloats[9] * randfloats[19];
    }
}

void divide(int repeats) {
    // Do a bunch of additions.
    // Use different random numbers, just in case some ops take
    // different times with different values.
    int left_idx = 0;
    int right_idx = 0;
    for (;repeats > 0; --repeats) {
        float result;
        result = randfloats[0] / randfloats[10];
        result = randfloats[1] / randfloats[11];
        result = randfloats[2] / randfloats[12];
        result = randfloats[3] / randfloats[13];
        result = randfloats[4] / randfloats[14];
        result = randfloats[5] / randfloats[15];
        result = randfloats[6] / randfloats[16];
        result = randfloats[7] / randfloats[17];
        result = randfloats[8] / randfloats[18];
        result = randfloats[9] / randfloats[19];
    }
}

void gt(int repeats) {
    // Do a bunch of additions.
    // Use different random numbers, just in case some ops take
    // different times with different values.
    int left_idx = 0;
    int right_idx = 0;
    for (;repeats > 0; --repeats) {
        int result;
        result = randfloats[0] > randfloats[10];
        result = randfloats[1] > randfloats[11];
        result = randfloats[2] > randfloats[12];
        result = randfloats[3] > randfloats[13];
        result = randfloats[4] > randfloats[14];
        result = randfloats[5] > randfloats[15];
        result = randfloats[6] > randfloats[16];
        result = randfloats[7] > randfloats[17];
        result = randfloats[8] > randfloats[18];
        result = randfloats[9] > randfloats[19];
    }
}

void equality(int repeats) {
    // Do a bunch of additions.
    // Use different random numbers, just in case some ops take
    // different times with different values.
    int left_idx = 0;
    int right_idx = 0;
    for (;repeats > 0; --repeats) {
        int result;
        result = randfloats[0] == randfloats[10];
        result = randfloats[1] == randfloats[11];
        result = randfloats[2] == randfloats[12];
        result = randfloats[3] == randfloats[13];
        result = randfloats[4] == randfloats[14];
        result = randfloats[5] == randfloats[15];
        result = randfloats[6] == randfloats[16];
        result = randfloats[7] == randfloats[17];
        result = randfloats[8] == randfloats[18];
        result = randfloats[9] == randfloats[19];
    }
}

int main(int argc, char** argv) {
    if (argc < 3) {
        printf("Usage: %s <op> <count>\n\tOp is one of + - x /\n\tcount is the number of repetitions\n", argv[0]);
        return 1;
    }

---

## Note on Timing

* The `time` command measures how long something runs
  * Gives `real`, `user`, and `sys` times
  * **real**: wall clock time
  * **user**: time this took in "userspace"
    * Doesn't count time the OS took doing things, like opening files
    * Also doesn't count time the process isn't running
  * **sys**: Time during operating system calls
* We can just look at user

---

## Differences

<pre>
$ time ./timing_ints "+" 1000000000

real	0m2.833s
user	0m2.831s
sys	0m0.001s
$ time ./timing_ints "-" 1000000000

real	0m2.841s
user	0m2.840s
sys	0m0.001s
$ time ./timing_floats "+" 1000000000

real	0m3.165s
user	0m3.163s
sys	0m0.002s
$ time ./timing_floats "-" 1000000000

real	0m3.164s
user	0m3.162s
sys	0m0.002s
</pre>

---

## Multiplication and division

<pre>
$ time ./timing_ints "x" 1000000000

real	0m2.812s
user	0m2.810s
sys	0m0.002s
$ time ./timing_ints "/" 1000000000

real	0m34.541s
user	0m34.538s
sys	0m0.001s
$ time ./timing_floats "x" 1000000000

real	0m3.199s
user	0m3.197s
sys	0m0.002s
$ time ./timing_floats "/" 1000000000

real	0m34.273s
user	0m34.270s
sys	0m0.001s
</pre>

---

## Comparisons

<pre>
$ time ./timing_ints ">" 1000000000

real	0m4.017s
user	0m4.015s
sys	0m0.001s
$ time ./timing_ints "=" 1000000000

real	0m3.995s
user	0m3.994s
sys	0m0.002s
$ time ./timing_floats ">" 1000000000

real	0m4.021s
user	0m4.019s
sys	0m0.002s
$ time ./timing_floats "=" 1000000000

real	0m3.997s
user	0m3.992s
sys	0m0.005s
</pre>

---

## Magic?

* The floating point operations seem like they should be slower than the integer ones
* More steps, more fields, special cases, etc
* So why not?

---

## ALUs and FPUs

* That wasn't (and isn't) always the case
* An ALU is an Arithmetic Logic Unit
  * Piece of hardware that does integer calculations
  * It does the adding from before, plus other operations
  * Cannot handle floating point operations though

---

## FPU

* There is an ALU equivalent for floating point operations
  * The FPU
  * Floating Point Unit
* Ints and floats are too different to optimize together
* Originally, FPUs were treated as an optional "add on"
  * The GPU is just about the only co-processor that remains in current systems

---

## Emulating Floating Point

* If we time our fixed point implementation, we'll find it to be horribly slow
  * Why? No hardware support.
  * Steps are done separately, the total time is the sum of those steps.
* An actual FPU would be faster

---

## Hardware Vs Software

<pre>
$ time ./timing_fixed "+" 1000000000

real	0m9.988s
user	0m9.982s
sys	0m0.002s
$ time ./timing_fixed "x" 1000000000

real	1m0.209s
user	1m0.201s
sys	0m0.002s
</pre>

---

## FPU

* The floating point unit combines the steps required for floating point math into one unit
* Anything that can be done in parallel is
  * e.g. adding exponents and multiplying significand don't need to be sequential

---

## The same speed?

* Why would they be the same speed though?
  * There is a minimum unit of time in your system
  * The clock tick
* This is the speed of a CPU
  * 2200 MHz is 2,200,000 clock ticks per second

---

## Practical Limits

* We could try to make the time for a `char` addition the base unit, but why?
  * Other operations would need to be chopped up, increasing complexity
* Also, on 64-bit systems memory operations will need to be 64-bit
  * And we want them to be 1 clock cycle
  * Gives us a lower bound on the system clock cycles

---

## Silicon to Speed

* As technology has improved, engineers have squeezed more stuff into your CPU
  * This includes ALUs and FPUs
* Improvements have also decreased clock cycle latency for operations

---

## Real Example

* On Zen5 (2024), integer addition and subtraction take 1 clock cycle
  * And multiple can be done in parallel per core!
* Integer multiplication has a latency of ~3, and division is ~11-14
  * On the AMD K7 (1999) division latency was 40 cycles for 32 bit operations
* Although CPU speeds have been stable, practical execution times have improved

---

## Max Speed

* Most of the integer operations are already at maximum speed
* Do they have their result before the next clock cycle ticks?
  * Possibly. But to use the result, the clock would have to be faster
  * If you've ever overclocked something, you know going faster isn't always stable
* So the system is limited to the most unstable parts
  * If the FPU can finish its ops in less time, they'll take 1 cycle

---

## Actual Costs

* The slowest floating point operation is actually FBSTP
  * Converting the float to a binary coded decimal
* Other high latency operations are FSIN, FCOS, FPATAN, etc
  * Yes, your CPU is the thing doing the trig functions

---

## What is Supported?

* How do we know what the hardware supports?
* By looking at the opcodes of our instruction set

---

## For Next Time

* Don't want to start diving into instruction sets today
  * So take some time to do homework 2 and study
* Next class we'll being looking at common instructions and how they execute on a CPU
* For today, let's introduce a tiny bit of assembly with whatever time is left

---

## Why?

* Why learn assembly and architecture?
* Knowing architecture informs your programming decisions
* Usually, the bridge between your code and the hardware is your compiler
  * So let's begin learning how the compiler treats your code

---

## Recursion

* Recursion is often the most elegant way to express an idea
  * Searching, sorting, etc
  * How does that translate to hardware?
* We saw before that functions can cause a `stack overflow` if the call stack is deep enough
* Let's revisit this

---

## No Recursion

```python
#include <stdio.h>
#include <stdlib.h>

// Calculate the nth fibonocci number and return it.
unsigned long long fibonocci(unsigned long long n) {
    if (n == 0 || n == 1) {
        return n;
    }
    return fibonocci(n-1) + fibonocci(n-2);
}

int main(int argc, char** argv) {
    if (argc < 2) {
        printf("Give me the nth fibonocci number that you want to see.");
        return 0;
    }

int n = atoi(argv[1]);

unsigned long long nth = fibonocci(n);

printf("The %i fibonocci number is %llu\n", n, nth);
    return 0;
}
```

---

## Calculating the fibonocci

* Calculating fibonocci numbers takes about $1.6^n$ steps
* Plenty of calls, and we only avoid a stack overflow because it takes nearly no memory
  * However, this kind of call is what causes stack overflows
* Compile with `gcc -S fib.c -o fib.s`

---

## Assembly View

```asm[|22,25]
	.file	"example25.c"
	.text
	.globl	fibonocci
	.type	fibonocci, @function
fibonocci:
.LFB39:
	.cfi_startproc
	endbr64
	movq	%rdi, %rax
	cmpq	$1, %rdi
	jbe	.L5
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	pushq	%rbx
	.cfi_def_cfa_offset 24
	.cfi_offset 3, -24
	subq	$8, %rsp
	.cfi_def_cfa_offset 32
	movq	%rdi, %rbx
	leaq	-1(%rdi), %rdi
	call	fibonocci
	movq	%rax, %rbp
	leaq	-2(%rbx), %rdi
	call	fibonocci
	addq	%rbp, %rax
	addq	$8, %rsp
	.cfi_def_cfa_offset 24
	popq	%rbx
	.cfi_def_cfa_offset 16
	popq	%rbp
	.cfi_def_cfa_offset 8
	ret
.L5:
	.cfi_restore 3
	.cfi_restore 6
	ret
	.cfi_endproc
.LFE39:
	.size	fibonocci, .-fibonocci
	.section	.rodata.str1.8,"aMS",@progbits,1
	.align 8
.LC0:
	.string	"Give me the nth fibonocci number that you want to see."
	.align 8
.LC1:
	.string	"The %i fibonocci number is %llu\n"
	.text
	.globl	main
	.type	main, @function
main:
.LFB40:
	.cfi_startproc
	endbr64
	pushq	%rbx
	.cfi_def_cfa_offset 16
	.cfi_offset 3, -16
	cmpl	$1, %edi
	jle	.L12
	movq	8(%rsi), %rdi
	movl	$10, %edx
	movl	$0, %esi
	call	strtol@PLT
	movl	%eax, %ebx
	movslq	%eax, %rdi
	call	fibonocci
	movq	%rax, %rcx
	movl	%ebx, %edx
	leaq	.LC1(%rip), %rsi
	movl	$2, %edi
	movl	$0, %eax
	call	__printf_chk@PLT
.L10:
	movl	$0, %eax
	popq	%rbx
	.cfi_remember_state
	.cfi_def_cfa_offset 8
	ret
.L12:
	.cfi_restore_state
	leaq	.LC0(%rip), %rsi
	movl	$2, %edi
	movl	$0, %eax
	call	__printf_chk@PLT
	jmp	.L10
	.cfi_endproc
.LFE40:
	.size	main, .-main
	.ident	"GCC: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0"
	.section	.note.GNU-stack,"",@progbits
	.section	.note.gnu.property,"a"
	.align 8
	.long	1f - 0f
	.long	4f - 1f
	.long	5
0:
	.string	"GNU"
1:
	.align 8
	.long	0xc0000002
	.long	3f - 2f
2:
	.long	0x3
3:
	.align 8
4:
```

---

## CALL

* The call instructions are assembly's way of calling a function
* Each call means more memory on the stack
  * What if we want to avoid this?

---

## Changing Calls into Loops

* Recursive functions can be converted into loops
* Use a stack to manually store the temporary variables

---

## Recursion to Loop Example

```C
#include <stdio.h>
#include <stdlib.h>

#include "llu_stack.h"

int main(int argc, char** argv) {
    if (argc < 2) {
        printf("Give me the nth fibonocci number that you want to see.");
        return 0;
    }

int n = atoi(argv[1]);

Stack s = newStack(10);
    push(&s, n);

unsigned long long sum = 0;
    while (len(s) > 0) {
        unsigned long long next = pop(&s);
        if (next == 0 || next == 1) {
            sum += next;
        }
        else {
            push(&s, next - 1);
            push(&s, next - 2);
        }
    }

freeStack(s);

printf("The %i fibonocci number is %llu\n", n, sum);
    return 0;
}
```

---

## Is One Superior?

* Yes; the compiler knows how to optimize the recursive call better
* First, recompile with `-O3` to enable optimizations
* We can time it with the `time` command

<pre>
$ time ./example25 50
The 50 fibonocci number is 12586269025

real	0m17.635s
user	0m17.630s
sys	0m0.003s
$ time ./example26 50
The 50 fibonocci number is 12586269025

real	3m0.281s
user	3m0.260s
sys	0m0.005s
</pre>

---

## Compiler Tricks

* The compiler knows tricks for your hardware and OS
  * And it will execute them better than you will
* So we are not going to be doing assembly for speed (usually)
* A combination of C and assembly will allow us to access the hardware in ways other languages cannot

---

## Tail Recursion

* There are times where there is a faster version of a recursive function
* `Tail recursion` occurs when the value being returned is only the recursive call
  * That means that nothing else in the function will be used again
    * So memory can be freed, even before the next call completes
  * Also means that actually calling the function is unnecessary
* Must tell gcc to use some optimization level, (usually -O2 or -O3)

---

## Tail Recursion Example

```C
#include <stdio.h>
#include <stdlib.h>

// Calculate the sum n + (n - 1) + (n - 2) ... + 1
unsigned long long sequence(unsigned long long n, unsigned long long current) {
    if (n < 2) {
        return 1 + current;
    }
    // This is now tail recursive
    return sequence(n-1, current+n);
}

int main(int argc, char** argv) {
    if (argc < 2) {
        printf("Give me the number that you want to see.");
        return 0;
    }

int n = atoi(argv[1]);

printf("The %i sequence number is %llu\n", n, sequence(n, 0));
    return 0;
}
```

---

## Assembly

```asm[|14,21-25]
	.file	"example25_c.c"
	.text
	.p2align 4
	.globl	sequence
	.type	sequence, @function
sequence:
.LFB39:
	.cfi_startproc
	endbr64
	cmpq	$1, %rdi
	jbe	.L3
	leaq	-1(%rdi), %rax
	testb	$1, %dil
	jne	.L2
	addq	%rdi, %rsi
	movq	%rax, %rdi
	cmpq	$1, %rax
	je	.L3
	.p2align 4,,10
	.p2align 3
.L2:
	leaq	-1(%rsi,%rdi,2), %rsi
	subq	$2, %rdi
	cmpq	$1, %rdi
	jne	.L2
.L3:
	leaq	1(%rsi), %rax
	ret
	.cfi_endproc
.LFE39:
	.size	sequence, .-sequence
	.section	.rodata.str1.8,"aMS",@progbits,1
	.align 8
.LC0:
	.string	"Give me the nth fibonocci number that you want to see."
	.align 8
.LC1:
	.string	"The %i sequence number is %llu\n"
	.section	.text.startup,"ax",@progbits
	.p2align 4
	.globl	main
	.type	main, @function
main:
.LFB40:
	.cfi_startproc
	endbr64
	subq	$8, %rsp
	.cfi_def_cfa_offset 16
	cmpl	$1, %edi
	jle	.L35
	movq	8(%rsi), %rdi
	movl	$10, %edx
	xorl	%esi, %esi
	call	strtol@PLT
	xorl	%ecx, %ecx
	movl	%eax, %edx
	cltq
	cmpq	$1, %rax
	jbe	.L21
	leaq	-1(%rax), %rsi
	testb	$1, %al
	jne	.L20
	movq	%rax, %rcx
	movq	%rsi, %rax
	cmpq	$1, %rsi
	je	.L21
	.p2align 4,,10
	.p2align 3
.L20:
	leaq	-1(%rcx,%rax,2), %rcx
	subq	$2, %rax
	cmpq	$1, %rax
	jne	.L20
.L21:
	addq	$1, %rcx
	leaq	.LC1(%rip), %rsi
	movl	$2, %edi
	xorl	%eax, %eax
	call	__printf_chk@PLT
.L19:
	xorl	%eax, %eax
	addq	$8, %rsp
	.cfi_remember_state
	.cfi_def_cfa_offset 8
	ret
.L35:
	.cfi_restore_state
	leaq	.LC0(%rip), %rsi
	movl	$2, %edi
	xorl	%eax, %eax
	call	__printf_chk@PLT
	jmp	.L19
	.cfi_endproc
.LFE40:
	.size	main, .-main
	.ident	"GCC: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0"
	.section	.note.GNU-stack,"",@progbits
	.section	.note.gnu.property,"a"
	.align 8
	.long	1f - 0f
	.long	4f - 1f
	.long	5
0:
	.string	"GNU"
1:
	.align 8
	.long	0xc0000002
	.long	3f - 2f
2:
	.long	0x3
3:
	.align 8
4:
```

---

## Assembly and Architecture

* Programming C requires a bit of hardware knowledge
* Programming in assembly requires more
* If you find the idea daunting, preemptively find a resource that you like
  * Try searching for "intro to x86 assembly programming"
* We won't be programming assembly from scratch, but you'll need to understand it