# CS 211 - Lecture 08 - Fixed and Floating Point

Bernhard Firner

2026-02-16

---

# Reading

* Chapter 2.4 in the book

---

## Review

* Stacks, bit fields, and unions
* Fractional values

---

## Why Stacks?

* We spent some time on stacks
  * Stacks are constantly visible as we program assembly
  * Also useful to model several parts of the architecture
* So don't forget how stacks work

---

## Bit Fields

* We can specify the length of a struct field, in bits

```C
struct Bits {
    int a : 5;
};
```

* Bits.a is a 5 bit type

---

## Bool packing

* The compiler automatically packs `bool` into 1 bit
  * But not other types

```C
typedef struct Bits Bits;
struct Bits {
    bool a : 1;
    bool b : 1;
    bool c : 1;
    bool d : 1;
    bool e : 1;
    bool f : 1;
    bool g : 1;
    bool h : 1;
};
```

* `sizeof(Bits)` will be 8

---

## Bit packing

```C
typedef struct Numbers Numbers;
struct Numbers {
    int a : 10;
    int b : 10;
    int c : 10;
};
```

* `sizeof(Numbers)` will be 4
  * 30 bits fits into 4 bytes
  * Size will always be a whole number of bytes

---

## Memory Alignment

```C
typedef struct Aligned Aligned;
struct Aligned {
    int a : 10;
    int b : 10;
    int c : 10;
    long long int d : 64;
};
```

* `sizeof(Aligned)` will have a size of 16
  * On this architecture at least
  * Why? Should be 4 + 8 = 12, right?

---

## Words

* Memory is fetched in fixed sizes called **blocks**
* And memory is organized into those blocks as well
  * So if a piece of data is spread over two blocks, it takes 2 memory requests
* Aligning memory wastes space, but runs faster

---

## Packing

* We can tell the compiler to pack things together to save space

```
#pragma pack(push, 4)
typedef struct Unaligned Unaligned;
struct Unaligned {
    int a : 10;
    int b : 10;
    int c : 10;
    long long int d : 64;
};
#pragma pack(pop)
```

* This tells the compiler to align to 4 byte (double word) boundaries
* Now the size is 12

---

## Unions

* In assembly, data is loaded into a register before use
* The type of register changes the way the data is treated
  * Not the source of the data!
  * All data is just data, only the registers have types!
* So we could treat memory any way we want, or in multiple ways
  * C calls this a Union

---

## Union Syntax

* Declaring a union looks like declaring a struct
* But every member is treated as having overlapping memory

---

## Union Example

```C
union together {
    int number;
    char chars[4];
};
```

* Equivalent to having an int and a char* to it.

---

## Fixed Points

* Let's put those together to build a type for fractional values
* We want to split up an integer into two fields
  * Whole part
  * Fractional part
* Called fixed point numbers

---

## Fixed Point Values

* Let's say you had to represent values with $1/2$ steps
  * 0.5, 1, 1.5, 2, ...
* What's the easiest way?
  * Just take an int and use the smallest bit for $1/2$

```C
int value;
...
if (value & 0x1) {
    printf("Value is %i.5\n", value>>1);
}
else {
    printf("Value is %i\n", value>>1);
}
```

---

## 4 bits of fraction

* Let's make this slightly useful
  * Take a 32 bit int and dedicate 4 to the fraction
* What is the smallest increment?
  * $1 / 2^4$, of 0.0625
* other $2^{28}$ used as a regular int
  * $-2^{27}$ to $2^{27}-1$

---

## Fixed Point Type

```C
typedef struct FixedFourStorage FixedFourStorage;
struct FixedFourStorage {
    unsigned int fraction : 4;
    unsigned int value : 28;
};

// If 0x1000 is 0.5, then 0x0001 must be 0.0625
// Always print with width 4 when using printf
const unsigned int ffour_fraction = 625;

typedef union FixedFour FixedFour;
union FixedFour {
    FixedFourStorage storage;
    // Used for math
    unsigned int raw;
};
```

---

## Math

* We can add through the *raw* field, and math works for addition and subtraction

```C
#include <stdio.h>
#include <stdlib.h>

typedef struct FixedFourStorage FixedFourStorage;
struct FixedFourStorage {
    unsigned int fraction : 4;
    unsigned int value : 27;
    unsigned int sign : 1;
};

typedef union FixedFour FixedFour;
union FixedFour {
    FixedFourStorage storage;
    // Used for math
    unsigned int raw;
};

FixedFour FixedFourAdd(FixedFour a, FixedFour b) {
    FixedFour result;
    result.raw = a.raw + b.raw;
    return result;
}

int getFFValue(FixedFour a) {
    if (a.storage.sign) {
        return -1 * a.storage.value;
    }
    else {
        return a.storage.value;
    }
}

int getFFFraction(FixedFour a) {
    // If 0x1000 is 0.5, then 0x0001 must be 0.0625
    // Always print with width 4 when using printf
    const unsigned int ffour_fraction = 625;
    return a.storage.fraction * ffour_fraction;
}

int main(void) {
    FixedFour ff = {.storage.value = 10, .storage.fraction = 1};

// Print with %04u means pad with leading zeros to width 4
    printf("ff value is %i.%04u\n", getFFValue(ff), getFFFraction(ff));

FixedFour other = {.storage.value = 10, .storage.fraction = 1};
    ff = FixedFourAdd(ff, other);
    printf("Adding with another value.\n");
    printf("ff value is %i.%04u\n", getFFValue(ff), getFFFraction(ff));

return 0;
}
```

---

## Math

* But things go wrong if we multiply or divide
  * 0.5 * 0.5 should be 0.25
  * But 0b1000 * 0b1000 won't be b0100
* Why? Because multiply treats the lowest bit as the 1s place, but it should be the 0.0625s place
* Solution? Right shift by 4 after multiplication

---

## Fixed Point Multiply

```C
// Create 10.5 and 0.5. When multiplying, rescale afterwards
FixedFour ff = {.storage.value = 10, .storage.fraction = 1};
FixedFour other = {.storage.value = 0, .storage.fraction = 0x1<<3};
ff.raw = (ff.raw * other.raw) >> 4;
```

* Division is similar, but we need to upscale first before dividing or we lose precision
  * `result.raw = ((a.raw << 4) / b.raw);`

---

## Negative Divisions

* When we divide, we don't want the math to use the MSB
* It changes the number, and thus the math
* So we have to mask it out before dividing
* **You don't need to memorize this code, we are proving a point with this example**

---

## Fixed

```C
#include <stdio.h>
#include <stdlib.h>

typedef struct FixedFourStorage FixedFourStorage;
struct FixedFourStorage {
    unsigned int fraction : 4;
    unsigned int value : 27;
    unsigned int sign : 1;
};

typedef union FixedFour FixedFour;
union FixedFour {
    FixedFourStorage storage;
    // Used for math
    unsigned int raw;
};

FixedFour FixedFourAdd(FixedFour a, FixedFour b) {
    FixedFour result;
    result.raw = a.raw + b.raw;
    return result;
}

FixedFour FixedFourMultiply(FixedFour a, FixedFour b) {
    FixedFour result;
    // Mask out the sign bit
    unsigned int mask = 0x7FFFFFFF;
    result.raw = ((a.raw&mask) * (b.raw&mask)) >> 4;
    // Use xor to get the sign. Overwrite whatever the multiply operation put there
    result.storage.sign = a.storage.sign ^ b.storage.sign;
    return result;
}

FixedFour FixedFourDivide(FixedFour a, FixedFour b) {
    FixedFour result;
    // Mask out the sign bit
    unsigned int mask = 0x7FFFFFFF;
    // We are going to lose precision, so shift before dividing
    result.raw = ((a.raw << 4) / (mask&b.raw));
    // Use xor to get the sign. Overwrite whatever the divide operation put there
    result.storage.sign = a.storage.sign ^ b.storage.sign;
    return result;
}

int getFFValue(FixedFour a) {
    if (a.storage.sign) {
        return -1 * a.storage.value;
    }
    else {
        return a.storage.value;
    }
}

int main(void) {
    FixedFour ff = {.storage.value = 10, .storage.fraction = 1};

// Print with %04u means pad with leading zeros to width 4
    printf("ff value is %i.%04u\n", getFFValue(ff), getFFFraction(ff));

printf("Tripling.\n");
    other.storage.value = 3;
    other.storage.fraction = 0;
    ff = FixedFourMultiply(ff, other);
    printf("ff value is %i.%04u\n", getFFValue(ff), getFFFraction(ff));

printf("Dividing by 7.\n");
    other.storage.value = 7;
    other.storage.fraction = 0;
    ff = FixedFourDivide(ff, other);
    printf("ff value is %i.%04u\n", getFFValue(ff), getFFFraction(ff));

printf("Multiplying by -1.\n");
    other.storage.sign = 1;
    other.storage.value = 1;
    other.storage.fraction = 0;
    ff = FixedFourMultiply(ff, other);
    printf("ff value is %i.%04u\n", getFFValue(ff), getFFFraction(ff));

printf("Dividing by -0.5\n");
    other.storage.sign = 1;
    other.storage.value = 0;
    other.storage.fraction = 0x1<<3;
    ff = FixedFourDivide(ff, other);
    printf("ff value is %i.%04u\n", getFFValue(ff), getFFFraction(ff));

return 0;
}
```

---

## Outputs

```C
ff value is 10.0625
Adding with another value.
ff value is 20.1250
Tripling.
ff value is 60.3750
Dividing by 7.
ff value is 8.6250
Multiplying by -1.
ff value is -8.6250
Dividing by -0.5
ff value is 17.2500
```

---

## Fixed Point Disadvantages

* Poor range; less than an int
* Poor precision; gave up 4 bits for increments of 0.0625
* Instead, we use floating point numbers
  * Precision is dynamic
  * High precision with small numbers, low precision at high numbers

---

## Hardware Support

* Notice that fixed point arithmetic had extra steps
  * Floating point values are similar, requiring more work than integers
* To make things faster, we need hardware support
* All values in your CPU are treated as integers or floating point

---

## Floating Point Numbers

* [IEEE 754 Standard](https://en.wikipedia.org/wiki/IEEE_754)
  * Imagine the nightmare if different machines represented numbers differently?
  * The way people lived before 1980's
* Three sections
  * Sign: 0 or 1
  * Exponent
  * Significand (also called mantissa)
* 32 bit (full precision) and 64 bit (double precision) most common

<table>
<tr><td>Sign</td><td colspan="8">Exponent</td><td colspan="23">Significand</td></tr>
<tr><td>31</td><td>30</td><td colspan="6">...</td><td>23</td><td colspan="21">22</td><td>...</td><td>0</td></tr>
</table>

---

## FP Representation

* Numbers won't be as straightforward as int or fixed points
* Going to stick with 32 bit (full precision) for examples
  * Consistent with other precisions, just add or remove bits
* In general, value = $(-1)^{sign} \times 2^{exponent} \times (1 + Significand)$
  * There are some intricacies to the exponent and significand values

---

## Why?

* The $2^{exponent}$ part allows us to choose the increment of the significand
* This trades off precision for range
  * As the exponent grows, we lower precision but gain range
  * So representations are more precise at lower exponents
  * Going to allow the exponent to be negative, for values between 0 and 1

---

## Special Cases

* We also want to have some "special" numbers
* Can divide floating point numbers into different types
  * Two special cases
    * NaN (not a number)
    * Infinity
  * normalized
  * subnormal (values nearest to 0)

---

## NaNs and Infinity

* Positive and negative infinity
  * Set all exponent values to 1
  * All significand values to 0

---

## Infinities

```C
#include <stdio.h>
#include <stdlib.h>

typedef struct FloatBits FloatBits;
struct FloatBits {
    unsigned int significand : 23;
    unsigned int exponent : 8;
    unsigned int sign : 1;
};

typedef union FloatIntBits FloatIntBits;
union FloatIntBits {
    float the_float;
    int the_int;
    FloatBits the_bits;
};

int main(void) {
    FloatIntBits fib = {.the_int = 0};

printf("All 0s is %f\n", fib.the_float);

fib.the_bits.significand = 0;
    fib.the_bits.exponent = 0xFF;
    fib.the_bits.sign = 0;
    printf("Fields are 0x%x, 0x%x, 0x%x; float is %f\n", fib.the_bits.sign, fib.the_bits.exponent, fib.the_bits.significand, fib.the_float);

fib.the_bits.significand = 0;
    fib.the_bits.exponent = 0xFF;
    fib.the_bits.sign = 1;
    printf("Fields are 0x%x, 0x%x, 0x%x; float is %f\n", fib.the_bits.sign, fib.the_bits.exponent, fib.the_bits.significand, fib.the_float);
    return 0;

fib.the_float = 1.0 / 0.0;
    printf("Fields are 0x%x, 0x%x, 0x%x; float is %f\n", fib.the_bits.sign, fib.the_bits.exponent, fib.the_bits.significand, fib.the_float);
    return 0;
}
```

---

## Output

<pre>
All 0s is 0.000000
Fields are 0x0, 0xff, 0x0; float is inf
Fields are 0x1, 0xff, 0x0; float is -inf
Fields are 0x0, 0xff, 0x0; float is inf
</pre>

---

## Not a Numbers

* All exponents bits set
  * Any significand bits set
* Why?
  * Need a way to indicate math errors
  * 0/0
  * $\sqrt{-1}$

---

## NaNs

```C
#include <stdio.h>
#include <stdlib.h>

typedef struct FloatBits FloatBits;
struct FloatBits {
    unsigned int significand : 23;
    unsigned int exponent : 8;
    unsigned int sign : 1;
};

typedef union FloatIntBits FloatIntBits;
union FloatIntBits {
    float the_float;
    int the_int;
    FloatBits the_bits;
};

int main(void) {
    FloatIntBits fib = {.the_int = 0};

printf("All 0s is %f\n", fib.the_float);

fib.the_bits.significand = 1;
    fib.the_bits.exponent = 0xFF;
    fib.the_bits.sign = 0;
    printf("Fields are 0x%x, 0x%x, 0x%x; float is %f\n", fib.the_bits.sign, fib.the_bits.exponent, fib.the_bits.significand, fib.the_float);

fib.the_bits.significand = 0x7FFFFF;
    fib.the_bits.exponent = 0xFF;
    fib.the_bits.sign = 1;
    printf("Fields are 0x%x, 0x%x, 0x%x; float is %f\n", fib.the_bits.sign, fib.the_bits.exponent, fib.the_bits.significand, fib.the_float);

fib.the_float = 0.0 / 0.0;
    printf("Fields are 0x%x, 0x%x, 0x%x; float is %f\n", fib.the_bits.sign, fib.the_bits.exponent, fib.the_bits.significand, fib.the_float);
    return 0;
}
```

---

## Output

<pre>
All 0s is 0.000000
Fields are 0x0, 0xff, 0x1; float is nan
Fields are 0x1, 0xff, 0x7fffff; float is -nan
Fields are 0x1, 0xff, 0x400000; float is -nan
</pre>

---

## Signalling NaN

* Sometimes you may want your program to stop when you hit NaNs
* CPUs can set a flag when this happens, allowing you to stop
* For "quiet" NaNs, only set the most significant bit of the significand
  * 0x1 << 22 for 32 bit floats
* For signalling NaNs set any of the other bits
  * Any of 0x3FFFFF for 32 bit floats

---

## Checking exception registers

* We could do this in assembly
  * But C already has a library for it in <fenv.h>
  * Need to link to a new library when compiling
  * gcc program.c -lm -o program
  * -lm is for "libmath"

---

## Register Example

```C
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <fenv.h>

typedef struct FloatBits FloatBits;
struct FloatBits {
    unsigned int significand : 23;
    unsigned int exponent : 8;
    unsigned int sign : 1;
};

typedef union FloatIntBits FloatIntBits;
union FloatIntBits {
    float the_float;
    int the_int;
    FloatBits the_bits;
};

typedef struct FPStatus FPStatus;
struct FPStatus {
    bool invalid : 1;
    bool subnormal : 1;
    bool div_by_0 : 1;
    bool overflow : 1;
    bool underflow : 1;
    bool fp_inexact : 1;
};

typedef union FPStatBuffer FPStatBuffer;
union FPStatBuffer {
    // The floating point status register is 16 bits long
    unsigned short value;
    FPStatus status;
};

int main(void) {
    FloatIntBits fib = {.the_int = 0};
    fib.the_bits.significand = 0x1 << 22;
    fib.the_bits.exponent = 0xFF;
    fib.the_bits.sign = 0;

printf("%f + %f is %f\n", fib.the_float, 3.14, fib.the_float + 3.14);

// Check for floating point exception
    if (fetestexcept(FE_INVALID)) {
        printf("A floating point exception occurred.\n");
    }
    // Clear any exceptions
    feclearexcept(FE_ALL_EXCEPT);

// lower 22 bits set in significand for signalling NaN
    fib.the_bits.significand = 0x3FFFFF;
    fib.the_bits.exponent = 0xFF;
    fib.the_bits.sign = 1;

printf("%f + %f is %f\n", fib.the_float, 3.14, fib.the_float + 3.14);

// Check for floating point exception
    if (fetestexcept(FE_INVALID)) {
        printf("A floating point exception occurred.\n");
    }
    // Clear any exceptions
    feclearexcept(FE_ALL_EXCEPT);

return 0;
}
```

---

## Output

<pre>
nan + 3.140000 is nan
-nan + 3.140000 is -nan
A floating point exception occurred.
</pre>

---

## Pause

* You don't need to memorize every detail of floating point numbers
* But recognize that floating point numbers are far more complicated than integers
* The way we represent numbers (coming next!) is the important part

---

## Details

* We want comparisons and increments to work as normal
  * e.g. the significand bits are all set and we increment by 1, the exponent goes up by 1
* That means that the exponent must be stored as an unsigned number

---

## Bias

* To store the exponent as an unsigned value and still have negative exponents, we always subtract a **bias**
* Bias is $2^{k-1}$, or $2^{k-1} - 1$ if using NaN and inf values
  * Where $k$ is the number of bits in the exponent field
  * b01111111 (or 0x7F) for 32 bit floats
* So setting the exponent field to 0x7F gives an exponent value of 0
* Exponent value begins at $1 - 2^{k-1}$ and goes to $2^{k-1}$
  * -126 to 127 for 32 bit floats

---

## Subnormal Numbers

* When the exponent bits are 0, this is like a fixed point format
  * You would expect the exponent to be $0 - bias$
  * However, the equation is different than for normal numbers
* Each step in the significand is a step of the minimum increment
* value = $(-1)^{sign} \times 2^{1-bias} \times (\frac{Significand}{2^{23}})$

---

## Tiny Numbers

* value = $(-1)^{sign} \times 2^{1-bias} \times (\frac{Significand}{2^{23}})$
* If the exponent bits are zero, the exponent value is
  * $2^{1 - 0x7F} = 1.1754943508222875e-38$
* The significand value is
  * $\frac{1}{2^{23}} = 1.1920928955078125e-07$
* So we'll add precision to printf

---

## Subnormal Example

```C
#include <stdio.h>
#include <stdlib.h>

typedef struct FloatBits FloatBits;
struct FloatBits {
    unsigned int significand : 23;
    unsigned int exponent : 8;
    unsigned int sign : 1;
};

typedef union FloatIntBits FloatIntBits;
union FloatIntBits {
    float the_float;
    int the_int;
    FloatBits the_bits;
};

int main(void) {
    FloatIntBits fib = {.the_int = 0};

// Special 0 values
    fib.the_bits.significand = 0;
    fib.the_bits.exponent = 0;
    fib.the_bits.sign = 0;
    printf("Fields are B=0x%x, E=%i, 0x%x; float is %f\n", fib.the_bits.sign, fib.the_bits.exponent - 0x7F, fib.the_bits.significand, fib.the_float);

fib.the_bits.significand = 0;
    fib.the_bits.exponent = 0;
    fib.the_bits.sign = 1;
    printf("Fields are B=0x%x, E=%i, 0x%x; float is %f\n", fib.the_bits.sign, fib.the_bits.exponent - 0x7F, fib.the_bits.significand, fib.the_float);

// Smallest value above 0
    fib.the_bits.sign = 0;
    fib.the_bits.significand = 1;
    printf("Fields are B=0x%x, E=%i, 0x%x; float is %.50f\n", fib.the_bits.sign, fib.the_bits.exponent - 0x7F, fib.the_bits.significand, fib.the_float);

// Maximum value for a subnormal number
    fib.the_bits.significand = 0x7FFFFF;
    fib.the_bits.sign = 0;
    printf("Fields are B=0x%x, E=%i, 0x%x; float is %.50f\n", fib.the_bits.sign, fib.the_bits.exponent - 0x7F, fib.the_bits.significand, fib.the_float);

return 0;
}
```

---

## Outputs

<pre>
Fields are B=0x0, E=-127, 0x0; float is 0.000000
Fields are B=0x1, E=-127, 0x0; float is -0.000000
Fields are B=0x0, E=-127, 0x1; float is 0.00000000000000000000000000000000000000000000140130
Fields are B=0x0, E=-127, 0x7fffff; float is 0.00000000000000000000000000000000000001175494210692
</pre>

* Note that I increased the printing precision
  * `%.50f` in printf

---

## Normalized Numbers

* When exponent bits > 0, we are in the normal number range
  * Begin with an implied 1 in the equation
* For the smallest normal number, the exponent doesn't change
  * This makes the first normal number $\frac{1}{significand~range}$ greater than the largest subnormal
  * Simplifies comparisons, too, as we'll see
  * Also efficient! The range after each increment of the exponent won't overlap with the previous range

---

## Normalized Numbers

* For 32 bit floats:
  * value = $(-1)^{sign} \times 2^{exponent - 0x7F} \times (1+\frac{Significand}{2^{23}})$

---

## Some values

```C
#include <stdio.h>
#include <stdlib.h>

typedef struct FloatBits FloatBits;
struct FloatBits {
    unsigned int significand : 23;
    unsigned int exponent : 8;
    unsigned int sign : 1;
};

typedef union FloatIntBits FloatIntBits;
union FloatIntBits {
    float the_float;
    int the_int;
    FloatBits the_bits;
};

int main(void) {
    FloatIntBits fib = {.the_int = 0};

// Powers of 2
    fib.the_bits.significand = 0;
    fib.the_bits.exponent = 1;
    fib.the_bits.sign = 0;
    printf("Fields are B=0x%x, E=%i, 0x%x; float is %f\n", fib.the_bits.sign, fib.the_bits.exponent - 0x7F, fib.the_bits.significand, fib.the_float);

fib.the_float = 1.0;
    printf("Fields are B=0x%x, E=%i, 0x%x; float is %f\n", fib.the_bits.sign, fib.the_bits.exponent - 0x7F, fib.the_bits.significand, fib.the_float);

fib.the_float = 2.0;
    printf("Fields are B=0x%x, E=%i, 0x%x; float is %f\n", fib.the_bits.sign, fib.the_bits.exponent - 0x7F, fib.the_bits.significand, fib.the_float);

fib.the_float = 8.0;
    printf("Fields are B=0x%x, E=%i, 0x%x; float is %f\n", fib.the_bits.sign, fib.the_bits.exponent - 0x7F, fib.the_bits.significand, fib.the_float);

fib.the_float = 1024.0;
    printf("Fields are B=0x%x, E=%i, 0x%x; float is %f\n", fib.the_bits.sign, fib.the_bits.exponent - 0x7F, fib.the_bits.significand, fib.the_float);

return 0;
}
```

---

## Output

<pre>
Fields are B=0x0, E=-126, 1+0x0; float is 0.000000
Fields are B=0x0, E=0, 1+0x0; float is 1.000000
Fields are B=0x0, E=1, 1+0x0; float is 2.000000
Fields are B=0x0, E=3, 1+0x0; float is 8.000000
Fields are B=0x0, E=10, 1+0x0; float is 1024.000000
</pre>

---

## Now with Significand

```C

#include <stdio.h>
#include <stdlib.h>

typedef struct FloatBits FloatBits;
struct FloatBits {
    unsigned int significand : 23;
    unsigned int exponent : 8;
    unsigned int sign : 1;
};

typedef union FloatIntBits FloatIntBits;
union FloatIntBits {
    float the_float;
    int the_int;
    FloatBits the_bits;
};

int main(void) {
    FloatIntBits fib = {.the_int = 0};

// Powers of 2
    fib.the_bits.significand = 0;
    fib.the_bits.exponent = 1;
    fib.the_bits.sign = 0;
    printf("Fields are (-1)^%u * 2^%i * 1+(%u/2^23); float is %f\n", fib.the_bits.sign, fib.the_bits.exponent - 0x7F, fib.the_bits.significand, fib.the_float);

fib.the_float = 1.0;
    printf("Fields are (-1)^%u * 2^%i * 1+(%u/2^23); float is %f\n", fib.the_bits.sign, fib.the_bits.exponent - 0x7F, fib.the_bits.significand, fib.the_float);

fib.the_float = 2.0;
    printf("Fields are (-1)^%u * 2^%i * 1+(%u/2^23); float is %f\n", fib.the_bits.sign, fib.the_bits.exponent - 0x7F, fib.the_bits.significand, fib.the_float);

fib.the_float = 8.0;
    printf("Fields are (-1)^%u * 2^%i * 1+(%u/2^23); float is %f\n", fib.the_bits.sign, fib.the_bits.exponent - 0x7F, fib.the_bits.significand, fib.the_float);

fib.the_float = 1024.0;
    printf("Fields are (-1)^%u * 2^%i * 1+(%u/2^23); float is %f\n", fib.the_bits.sign, fib.the_bits.exponent - 0x7F, fib.the_bits.significand, fib.the_float);

// Adding 1
    fib.the_float = 3.0;
    printf("Fields are (-1)^%u * 2^%i * 1+(%u/2^23); float is %f\n", fib.the_bits.sign, fib.the_bits.exponent - 0x7F, fib.the_bits.significand, fib.the_float);

fib.the_float = 9.0;
    printf("Fields are (-1)^%u * 2^%i * 1+(%u/2^23); float is %f\n", fib.the_bits.sign, fib.the_bits.exponent - 0x7F, fib.the_bits.significand, fib.the_float);

fib.the_float = 1025.0;
    printf("Fields are (-1)^%u * 2^%i * 1+(%u/2^23); float is %f\n", fib.the_bits.sign, fib.the_bits.exponent - 0x7F, fib.the_bits.significand, fib.the_float);

return 0;
}
```

---

## Output

<pre>
Fields are (-1)^0 * 2^-126 * 1+(0/2^23); float is 0.00000000000000000000000000000000000001175494350822
Fields are (-1)^0 * 2^0 * 1+(0/2^23); float is 1.000000
Fields are (-1)^0 * 2^1 * 1+(0/2^23); float is 2.000000
Fields are (-1)^0 * 2^3 * 1+(0/2^23); float is 8.000000
Fields are (-1)^0 * 2^10 * 1+(0/2^23); float is 1024.000000
Fields are (-1)^0 * 2^1 * 1+(4194304/2^23); float is 3.000000
Fields are (-1)^0 * 2^3 * 1+(1048576/2^23); float is 9.000000
Fields are (-1)^0 * 2^10 * 1+(8192/2^23); float is 1025.000000
</pre>

---

## Integers and Floats

* Integer math and floating point math is different
* Special hardware exists for both
  * ALUs (Arithmetic Logic Units) for integers 
  * FPUs (Floating Point Units) for floats 
* These are the only real "types" in your computer architecture

---

## Takeaways

* Floating point math is more complicated
  * And we'll also have to worry about rounding
    * Next time!
* But floats are elegantly crafted

---

## Equations

* Subnormals
  * exponent is 0
  * value = $(-1)^{sign} \times 2^{1-bias} \times (\frac{Significand}{2^{Sig~Bits}})$
* Normals
  * value = $(-1)^{sign} \times 2^{exponent-bias} \times (1 + \frac{Significand}{2^{Sig~Bits}})$