# CS 211 - Lecture 23

## Computer Architecture

### Parellel and GPU Architecture

Bernhard Firner

2025-11-18

---

## Course Logistics

* Not really enough time to cram in 2 more homeworks
  * So just 1 more
* Calculating your grade
  * Canvas doesn't have the correct scaling
  * Homeworks are 50%, quizzes are 5% each, midterm in 15%, final is 25%

---

## Will there be a curve?

* Right now, if you've scored the average on every exam and diligently done your homework, your grade is on track to be in the mid 80s
  * Doesn't look like we need a curve
* If that changes (if everyone bombs quiz2 and the final, for example) that could change
  * So don't ask me to tell you where any grade cutoffs are, I'll have to see how everyone does on the final
* Let's get to more floating points so you don't bomb your exam

---

## Review: GPU Architectures

* GPUs are similar to CPUs
  * Instruction in L0 and L1 cache (per core)
  * Data in L1 cache (one per core) and shared L2 cache
* Use DRAM for off-device memory, but GPUs come with their own

---

## Differences

* Unlike CPUs, GPUs are built for parallelism
  * Higher bandwidth to load more data
  * Threads are the default execution type
* Bandwidth and threading allow GPUs to tackle problems that CPUs handle poorly

---

## GPU Advantages

---

## What is Parallelism?

* Most of the code you've looked at is single-threaded
  * Only one instruction is being executed at a time
  * Program counter has one value
* In multithreaded execution, there are multiple instructions happening simultaneously
  * Do those threads share memory? How do they synchronize?
    * These are the details that distinguish different systems

---

## Parallelism on the CPU

* CPU cores are multiple instructions wide
  * But that doesn't necessarily mean that running multiple threads on a core is good
  * If they are accessing different data this is a disaster for locality
* More likely that each thread ends up on a different core
  * But what if they do share some memory?

---

## L3 Cache

* The L3 cache is built to enhance parallelism on CPUs
* On an L2 miss, it searches other the L2 cache of other cores
  * This seems slow, but it is faster than going to RAM
* L3 is large, so it can afford to store data as it is evicted from L2

---

## Cache on the GPU

* Data is processing in huge chunks
* But that sometimes leads to instruction cache misses
  * See an [NVIDIA writeup](https://developer.nvidia.com/blog/improving-gpu-performance-by-reducing-instruction-cache-misses-2/)
* This is why there is an L0 instruction cache on the GPU

---

## Why Instruction Misses?

* As threads become desynchronized, they run different parts of a program
  * This is completely different from a CPU core, which has 1 or 2 threads
* NVIDIA's streaming multiprocessors (think GPU core) can have 1024 threads
  * Imagine trying to run 1024 different parts of a program
    * Maybe all with branch misses! Ugh!

---

## Improving Instruction Misses

* Get rid of branches
* Manually tune loop unrolling
  * Reduces instructions
  * Also reduces register usage, preventing thread contention for registers
* Make threads fully independent, allowing them to run on different cores

---

## Parallelism Performance

* Threaded systems are poorly described by a clock rate
  * An option with 1/10 the speed but with 1000 times the threads could be better
* So manufacturers prefer to report OPS
  * Operations Per Second
* Usually FLOPS (floating point OPS)
  * Most important workloads are floating point

---

## Results

* From AMD MI300x datasheet

---

## Chasing Performance

* A GPU manufacturer wants to report a bigger number
  * So they want you to use FP16 instead of FP32 or FP64
  * But they also need to show it its accuracy is sufficient
* So what limitations come with smaller numbers?
  * Let's review floating point values

---

## Sign, Exponent, Significand

* [IEEE 754 Standard](https://en.wikipedia.org/wiki/IEEE_754)
* Three sections
  * Sign: 0 or 1
  * Exponent
  * Significand (also called mantissa)
* 32 bit and 64 bit values are called single and double precision
* 16 bit floats are called half precision

<table>
<tr><td>Sign</td><td colspan="8">Exponent</td><td colspan="23">Significand</td></tr>
<tr><td>31</td><td>30</td><td colspan="6">...</td><td>23</td><td colspan="21">22</td><td>...</td><td>0</td></tr>
</table>

---

## Representation

* In general:
* Subnormal
  * value = $(-1)^{sign} \times 2^{-bias}\frac{Significand}{2^{s}}$
* Normal
  * value = $(-1)^{sign} \times 2^{exponent}(1 + \frac{Significand}{2^{s}})$

---

## Bias

* The exponent field isn't a signed int
  * Instead, it is a value with a `bias` subtracted
* There are two general forms, one where values are reserved for NaN and infinity, and one without

---

## Without NaN and Infinity

* Given an exponent field of $k$ bits, and a value $e$ in the exponent field
  * $bias = 2^{k-1}$
  * $exponent = e - bias$
* There is one more value on the negative side than the positive side
  * Which means we represent more numbers < 1 than > 1

---

## Some scribbles

* The next slide is some scribbles from class
* To explain, we are working with an 8 bit float
  * 1 bit sign, 3 exponent (marked "e"), 4 significand ("sig")
* The example begins with the value for 0 and increments up to the first normal number
  * That occurs when the significand is incremented from 15
  * The significand value overflows, setting the exponent to 1 and the significand back to 0

---

---

## A Note

* When I've asked questions so far, I've been doing them for a simplified floating point number
  * No subnormal range
  * No bits reserved for NaN and Infinity
* I'm assuming that if you understand the basics, the additional complication won't be confusing
  * Please, at least understand the basics, who know what could show up in the final

---

## With NaN and Infinity

* Infinity
  * All bits in $e$ set, significand values 0
* NaN
  * All bits in $e$ set, significand values != 0
* Bias has a smaller range
  * $bias = 2^{k-1} - 1$
  * $exponent = e - bias$
* So now the exponent range is $[1-b, b]$

---

## IEEE 754

* The standard floating point values on your CPU are the IEEE version
* For normal floating point values, with $e \in [1, 254]$
  * value = $(-1)^{sign} \times 2^{e - 127}(1 + \frac{Significand}{2^{s}})$

---

## Subnormal

* With $e = 0$ we are in the subnormal range
  * We don't use the normal equation
    * value = $(-1)^{sign} \times 2^{-bias}(1 + \frac{Significand}{2^{s}})$
  * That would leave a "large" gap 0 between
  * Would also make it awkward to represent 0
* Instead, we divide the space between 0 and $2^{1-bias}$ evenly
  * value = $(-1)^{sign} \times 2^{1-bias}\frac{Significand}{2^{s}}$

---

## More Subnormals

* When the value in the exponent field is 0, we change the equation
  * $2^{1-bias}$, not $2^{0-bias}$ as you would expect
  * And remove the 1 from $(1 + \frac{Significand}{2^{s}})$
* This gives evenly spaced steps of size $\frac{2^{1-bias}}{2^s}$ from 0 to $2^{1-bias}$

---

## Example

```c
#include <stdio.h>
#include <stdlib.h>

typedef struct FloatBits FloatBits;
struct FloatBits {
    unsigned int significand : 23;
    unsigned int exponent : 8;
    unsigned int sign : 1;
};

typedef union FloatIntBits FloatIntBits;
union FloatIntBits {
    float the_float;
    int the_int;
    FloatBits the_bits;
};

int main(void) {
    FloatIntBits value = {.the_bits.sign = 0x0, .the_bits.exponent = 0x0, .the_bits.significand = 0x0};
    printf("These are subnormal values.\n");
    printf("The bias is 2^(8-1)-1 = 127.\n");
    printf("The smallest exponent is 1 - bias = -126.\n");
    printf("The exp and sig are %u and %07u, and the float is %.50f\n", value.the_bits.exponent, value.the_bits.significand, value.the_float);
    value.the_bits.significand = 1;
    printf("The exp and sig are %u and %07u, and the float is %.50f\n", value.the_bits.exponent, value.the_bits.significand, value.the_float);
    value.the_bits.significand = 2;
    printf("The exp and sig are %u and %07u, and the float is %.50f\n", value.the_bits.exponent, value.the_bits.significand, value.the_float);
    value.the_bits.significand = 0x7FFFFF;
    printf("The exp and sig are %u and %u, and the float is %.50f\n", value.the_bits.exponent, value.the_bits.significand, value.the_float);
    printf("These are normal values.\n");
    for (int i = 0; i < 8; ++i) {
        value.the_bits.exponent = 1 << i;
        value.the_bits.significand = 0;
        printf("The exp and sig are %u and %07u, and the float is %.50f\n", value.the_bits.exponent, value.the_bits.significand, value.the_float);
        value.the_bits.significand = 1;
        printf("The exp and sig are %u and %07u, and the float is %.50f\n", value.the_bits.exponent, value.the_bits.significand, value.the_float);
        value.the_bits.significand = 0x7FFFFF;
        printf("The exp and sig are %u and %u, and the float is %.50f\n", value.the_bits.exponent, value.the_bits.significand, value.the_float);
    }
    printf("These are special values.\n");
    value.the_bits.exponent = 0xFF;
    value.the_bits.significand = 0;
    printf("The exp and sig are %u and %07u, and the float is %.50f\n", value.the_bits.exponent, value.the_bits.significand, value.the_float);
    value.the_bits.significand = 1;
    printf("The exp and sig are %u and %07u, and the float is %.50f\n", value.the_bits.exponent, value.the_bits.significand, value.the_float);
    value.the_bits.significand = 0x7FFFFF;
    printf("The exp and sig are %u and %u, and the float is %.50f\n", value.the_bits.exponent, value.the_bits.significand, value.the_float);
    return 0;
}
```

---

## Example Output

```
These are subnormal values.
The bias is 2^(8-1)-1 = 127.
The exp and sig are 0 and 0000000, and the float is 0.00000000000000000000000000000000000000000000000000
The exp and sig are 0 and 0000001, and the float is 0.00000000000000000000000000000000000000000000140130
The exp and sig are 0 and 0000002, and the float is 0.00000000000000000000000000000000000000000000280260
The exp and sig are 0 and 8388607, and the float is 0.00000000000000000000000000000000000001175494210692
These are normal values.
The exp and sig are 1 and 0000000, and the float is 0.00000000000000000000000000000000000001175494350822
The exp and sig are 1 and 0000001, and the float is 0.00000000000000000000000000000000000001175494490952
The exp and sig are 1 and 8388607, and the float is 0.00000000000000000000000000000000000002350988561515
The exp and sig are 2 and 0000000, and the float is 0.00000000000000000000000000000000000002350988701645
The exp and sig are 2 and 0000001, and the float is 0.00000000000000000000000000000000000002350988981904
The exp and sig are 2 and 8388607, and the float is 0.00000000000000000000000000000000000004701977123029
The exp and sig are 4 and 0000000, and the float is 0.00000000000000000000000000000000000009403954806578
The exp and sig are 4 and 0000001, and the float is 0.00000000000000000000000000000000000009403955927617
The exp and sig are 4 and 8388607, and the float is 0.00000000000000000000000000000000000018807908492118
The exp and sig are 8 and 0000000, and the float is 0.00000000000000000000000000000000000150463276905253
The exp and sig are 8 and 0000001, and the float is 0.00000000000000000000000000000000000150463294841873
The exp and sig are 8 and 8388607, and the float is 0.00000000000000000000000000000000000300926535873885
The exp and sig are 16 and 0000000, and the float is 0.00000000000000000000000000000000038518598887744717
The exp and sig are 16 and 0000001, and the float is 0.00000000000000000000000000000000038518603479519525
The exp and sig are 16 and 8388607, and the float is 0.00000000000000000000000000000000077037193183714626
The exp and sig are 32 and 0000000, and the float is 0.00000000000000000000000000002524354896707237777318
The exp and sig are 32 and 0000001, and the float is 0.00000000000000000000000000002524355197633791587823
The exp and sig are 32 and 8388607, and the float is 0.00000000000000000000000000005048709492487921744129
The exp and sig are 64 and 0000000, and the float is 0.00000000000000000010842021724855044340074528008699
The exp and sig are 64 and 0000001, and the float is 0.00000000000000000010842023017324751454180269995275
The exp and sig are 64 and 8388607, and the float is 0.00000000000000000021684042157240381566043314030823
The exp and sig are 128 and 0000000, and the float is 2.00000000000000000000000000000000000000000000000000
The exp and sig are 128 and 0000001, and the float is 2.00000023841857910156250000000000000000000000000000
The exp and sig are 128 and 8388607, and the float is 3.99999976158142089843750000000000000000000000000000
These are special values.
The exp and sig are 255 and 0000000, and the float is inf
The exp and sig are 255 and 0000001, and the float is nan
The exp and sig are 255 and 8388607, and the float is nan
```

---

## Rationale

* Computation becomes imprecise around 0
  * Dividing by small numbers if fraught with peril
* This is why the round method used is also strange
  * Round to nearest, and alternate rounding up or down

---

## Other Floating Point Formats

* But this is designed for certain workloads
  * Are there alternatives?
    * Yes, obviously, or this slide wouldn't be here
* Let's talk about tiny floating point numbers
  * Specifically, the ones that will pump up our TOPS numbers

---

## FP16 (half precision)

* Let's begin with FP16
  * This is used in graphics
    * Humans can't actually distinguish colors very well
    * And there are only 255 values for a pixel
  * Also in machine learning
    * Faster calculations! Outputs are approximations anyway!

---

## FP16 Bias

* Do we have a special exponent for inf and NaN?
    * Let's use $SE = 1$ if we do
    * And use $SE = 0$ if we don't
* $bias = 2^{e-1} - SE$
* With 5 exponent bits, as in the IEEE 754-2008 standard
  * bias is either 128 or 127

---

## Naming

* There are other formats, specialized for different hardware and industries
* To instantly indicate what bits are used, we name them in the type
  * EYSX
    * where X is the significand bits
    * and Y is the exponent bits
* IEEE fp16 is FP16 E5S10

---

## Formats

* ARM processors have their own FP16 type
* For them, the don't encode inf or NaN
* This variability is why you should learn the fundamentals
  * then read the documentation for platform specifics

---

## bfloat16 (BF16)

* brain floating point
  * Named by Google Brain
* The same as IEEE fp32, but with the trailing 16 bits of significand truncated
  * So don't add these together
* When multiplying, this preserves the range as well as IEEE fp32
  * Which is the common case in ML
  * 8 bits exponent

---

## Smaller

* FP8
  * See [NVIDIA's AI-centric description](https://developer.nvidia.com/blog/floating-point-8-an-introduction-to-efficient-lower-precision-ai-training/)
* Two common formats:
  * E4M3
  * E5M2
* One has more range, one has more precision

---

## Why Both?

* In neural networks, the range of outputs tends to be small
  * But precision can be important for the actual result
  * So E4M3 is better
* During backpropagation, derivatives are used to assign changes to parameters based upon the output error
  * These can be tiny, so E5M2 is better
* See [https://arxiv.org/abs/2209.05433](https://arxiv.org/abs/2209.05433) for an exhaustive study

---

## Ranges

* Recall:
  * $bias = 2^{k-1} - SE$
  * $exponent = e - bias$
* FP16 (e=5) max value: $2^{15}(1 + \frac{2^{10}-1}{2^{10}} \approx 65504$
* BF16 (e=8) max value: $2^{127}(1+ \frac{2^7-1}{2^7}) \approx 1.39e+38$
* E4M3 (e=4) max value: $2^{7}(1+ \frac{2^3-1}{2^3}) \approx 448$
  * Note that there is no infinity, and only one pattern for NaN
* E5M2 (e=5) max value: $2^{15}(1+ + \frac{2^{2}-1}{2^{2}}) \approx 57344$

---

## Even Smaller

* 4 bits! Let's forget about NaN and infinity
* See [NVIDIA's Ai-centric FP4 comparison](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/)
* "Standard" FP4 is E2M1
  * Significand steps by 1/2, exponents from 0.5 to 2
  * Range of 0 to 6 is too small, so a constant scale is applied to all FP4 values
* Big loss of accuracy when doing calculations

---

## MXFP4

* Same bit format (E2M1)
* But now stored in blocks of 32
  * That's 128 bits
* Each block gets an additional scale by a power of 2
* This is called "block floating point" format

---

## Scaling

* Scaling restores our range
  * The scale for a block is taken from the largest value in the block
  * E8M0 (8 bits of exponent) is used as the scale
* You can kind of think of this as E10M1
  * But if the values are very different, some will lose a lot of precision

---

## Scaling Example

* Let's say you want to store the values -3.54, -0.07, 1.38, and 10.1
  * Our values available with E2M1 are 0, 0.5, 1, 2, 4, 6
* Choose the scale from the largest value (10.1)
  * We have to scale by a power of 2, so our best options are $2\times4$ and $2\times6$
  * $2\times6$ is slightly better, so our scale will be 2
* Values become $2\times(-2, -0, 0.5, 6)$
  * Absolute errors are $0.46, 0.07, 0.38, 1.9$

---

## NVFP4

* Same bit format (E2M1)
* But now the block size is 16
  * Less likely to have large differences in a smaller group
* Scale factor is now E4M3
  * The mantissa give more fine-grained control
* Then scale once more with FP32 E8M23 per full matrix or vector

---

## NVFP4 Scaling

* Again, our values are -3.54, -0.07, 1.38, and 10.1
  * Our values available with E2M1 are 0, 0.5, 1, 2, 4, 6
* Choose the scale from the largest value (10.1)
  * E4M3 scaling, values are: $2^{e-8}(1 + \frac{sig}{2^3})$
* This turns into a approximate factoring problem, but the mechanics aren't important to us

---

## NPFP4 Scaling

* Original values are -3.54, -0.07, 1.38, and 10.1
* Our values available with E2M1 are 0, 0.5, 1, 2, 4, 6
  * Choose scale 1.75, from e=8, s=6
* Values become $1.75\times(-2, -0, 1.0, 6)$
  * Absolute errors are $0.04, 0.07, 0.37, 0.4$

---

## Stochastic Rounding

* NVFP4 does one more trick
  * Replace deterministic rounding
    * Rounding to nearest even is the default
  * This sounds good, but has bias with some distributions
* Stochastic rounding rounds with probability proportional to the distance
  * e.g. 0.1 has to round to 0 or 0.5
    * it rounds to 0 with p = 0.2,
    * it rounds to 0.5 with p = 0.8

---

## Why care?

* [NVIDIA reports](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/) reports a drop in energy consumption with FP4 and their latest architecture
* Also roughly doubles OPS compared to FP8
  * Not to mention better locality!
* In a world where several percent of our energy grid is spent on ML, this is good

---

## What Should you Remember?

* You should know how floating point representation works
  * And remember that numeric conversions lose precision
* The exponent gives more dynamic range
* The significand/mantissa gives more precision

---

## Normal Vs Subnormal

* Representation of infinity and NaN can vary between formats
* But the subnormal representation is fixed
  * So remember how it works
* With $e = 0$ we are in the subnormal range
* Divide the space between 0 and $2^{1-bias}$ evenly
  * value = $(-1)^{sign} \times 2^{1-bias}\frac{Significand}{2^{s}}$

---

## Next Topics

* For the rest of the semester we'll cover digital logic design
* A combination of hardware and boolean logic
  * We've brushed up against it in other topics
* We don't dive deep
  * But you will hopefully get some perspective to think about what's happening on the silicon