# CS 211 - Lecture 14

## Threaded Execution

Bernhard Firner

2026-03-11

---

## Review Concepts

* x86-64 is CISC (complex instruction set computer)
  * But internally the pipeline has reduced complexity
  * Look like RISC
* But the instruction handling has grown in complexity

---

## Pipelines

* Pipelines increase CPU throughput
  * Each functional unit can work on a new instruction every clock cycle
* Buffers, called `pipeline registers`, store outputs until next clock cycles

</div>
<div class="col">

</div>
</div>

---

## Pipeline Stages$^*$

1. Instruction fetch (IF)
2. Instruction decode/register fetch (ID)
3. Execute (math or address calculation) (EX)
4. Memory access (MA)
5. Write back (WB)

<p style='font-size:25pt'>$^*$ For our super-simplified example pipeline</p>

---

## Pipeline Execution

---

## Pipeline Problems

* Data hazards
  * read after write: when we need to read data that a previous instruction is writing
    * The write could be from mov immediate, from EX instruction, or from memory mov
* Control hazards
  * When an instruction may update the instruction pointer
    * `jmp`, `call`, `ret`

---

## Hazard

* When a dependecy *may* cause a problem, it is called a `hazard`
* Hazards can be mitigated
  * Either by stalling the pipeline with a bubble of no-ops
  * Or by using hardware to forward results between pipeline stages

---

## Bubble

* Waiting for data means doing nothing
  * Here we are loading something into rax and then we want to use it
* No ops are inserted into the pipeline, causing a stall

</div>
<div class="col">

</div>
</div>

---

## Data Hazards

* Any time an instruction needs a value there is the potential for a hazard
  * Forwarding may be able to resolve the hazard

</div>
<div class="col">

</div>
</div>

---

## Other Mitigations

* Some hazards require a delay
  * Memory read whose result is used next cycle
  * Forwarding saves one cycle
* But no-ops are a waste of time
* So use `out of order execution`
  * CPU is always prefetching instructions, and will squeeze some in if they have no depedencies
* Re-order buffer (ROB) re-orders the outputs

</div>
<div class="col">

</div>
</div>

---

## Control Hazards

* Control hazards involve the instruction pointer
* Two problem cases in our example pipeline
  * Memory read results used to set instruction pointer
  * Math ALU results that modify the instruction pointer

---

## Example

* Branching code will cause slowdowns
* The *if* statement is turned into a `cmpq` and a `ja`

```C
// Calculate the sum n + (n - 1) + (n - 2) ... + 1
unsigned long long sequence(unsigned long long n, unsigned long long current) {
    if (n < 2) {
        return 1 + current;
    }
    return sequence(n-1, current+n);
}
```

</div>
<div class="col">

<pre>
sequence:
	pushq	%rbp
	movq	%rsp, %rbp
	subq	$16, %rsp
	movq	%rdi, -8(%rbp)
	cmpq	$1, -8(%rbp)
	ja	.L2
	movl	$1, %eax
	jmp	.L3
.L2:
	movq	-8(%rbp), %rax
	subq	$1, %rax
	movq	%rax, %rdi
	call	sequence
	movq	-8(%rbp), %rdx
	addq	%rdx, %rax
.L3:
	leave
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
</pre>

</div>
</div>

---

## Speaking of compq

* The cmpq statement has trouble in our pipeline
* cmpq wants to compare an immediate value to a value at a calculated memory address
* So it needs several operations
  * Calculate the address
  * Fetch memory
  * do the comparison

---

## Microcode

* Those are too many things at once
  * A modern CPU rewrites the instruction
    * Into `movq` with indirect addressing, loading memory into a register
    * Into `cmpq`, now using two registers

---

## Execution

* Let's see how this executes
  * The `movq` is issued first, moving memory into a register
  * `cmpq` follows, but it needs to wait until after the MA stage
    * So we could IF at cycle 2, but cannot EX until cycle 4 (with forwarding)

</div>
<div class="col">

</div>
</div>

---

## Bubble

* We are forced to bubble, inserting no ops to delay the cmpq by one cycle
* But what about the `ja` instruction?

</div>
<div class="col">

</div>
</div>

---

## Bubble

* `ja` needs the cmpq result from the ALU operation in EX
  * Forwarding resolves this

</div>
<div class="col">

</div>
</div>

---

## Speculative Execution

* `ja` will only determine the jump address after EX
* The CPU loads more instructions anyway, assuming that it can guess the branch
  * This is `speculative execution`

</div>
<div class="col">

</div>
</div>

---

## Correct Branch Prediction

* If branch prediction was correct, then things continue happily

</div>
<div class="col">

</div>
</div>

---

## Incorrect Branch Prediction

* When branch prediction fails
  * The executed instructions are invalidated and their results ignored
* The instruction pointer is updated to the correct value
  * This means that new instructions must be fetched
  * In reality, these instructions may already be in the prefetch cache

</div>
<div class="col">

</div>
</div>

---

## Performance Analysis

* So what is the cost of a bubble?
  * Or of a branch misprediction?
* And how much speedup do we actually get from a pipeline?

---

## Pipeline Benefits

* A non-pipelined CPU must have a clock cycle that allows every stage to finish
* A pipelined CPU just needs to be able to finish the stages, not an entire instruction
* Deeper pipelines allow for faster clock rates

---

## Pipeline Costs

* If we cut a single stage system into five stages you would expected 5x throughput
* However, pipeline registers take time to buffer outputs
  * Let's say you cut a 500ps single stage into five 100ps stages
  * If the buffers take 10ps, then each stage takes 110ps
  * $500/110 \approx 4.54$ speedup

---

## Diminishing Returns

* If you cut that same 500ps into ten 50ps stages, what happens?
  * $500/(50+10) \approx 8.33$ speedup
* So our returns diminish quickly
* Not only that, longer pipelines are also harder to fill with instructions

---

## Filling a Pipeline

* A pipeline that isn't filled isn't fast
* Why wouldn't we be able to fill a pipeline?
  * jmp, call, ret, etc
* It's quite common that there are multiple jumps in a row
  * e.g. a series of `if` statements

---

## Misprediction costs

* The data dependency beetween `movq` and `cmpq` cost 1 cycle
* What did a mispredicted branch cost?
  * We lose 2 cycles when we discard the results at time 6
  * If we didn't already have the jump address loaded, that could be several cycles
  * So 2 cycles with preloading, 3 cycles if it takes one cycle to fetch, more cycles if the new instructions aren't ready

</div>
<div class="col">

</div>
</div>

---

## Prefetching and Cacheing

* A modern instruction fetch step is complicated
  * Predict what to load and cache ahead of time
  * Try to pull data from the cache preemptively

---

## Prefetch Why?

* There are multiple levels of memory
  * Registers are the fastest, being accessed within a clock cycle
* Other delays dependent upon architecture
  * For Arm Cortex A510 (cell phones)
    * L1 cache 2-3 instructions
    * L2 cache is 9-11 instructions
  * Some CPUs also have L3 cache
* Going to disk means going out to lunch (metaphorically)

---

## Modern Branch Prediction

* Predicts whether a branch will be taken and the target address
  * This unit actually has memory, storing if the branch at a memory address was previously taken
* Current branch prediction actually uses basic ML, built directly into the hardware

---

## Instruction Fetch Translations

* Remember how your memory addresses in C were fake?
  * That memory is called **virtual memory**
  * Memory always looked like it belonged to your program alone
  * In reality, shared between processes

---

## Virtual Memory Translation

* When instructions are fetched, they are translated from the virtual addresses used in program memory into real, physical locations
  * The recent translations are stored in a **translation lookaside buffer**, or TLB

---

## TLB

* If an address hasn't been seen before, it is absent from the TLB
* This incurs a time penalty the first time memory is seen
  * Your CPU has to get the translation from main memory
* This means that there is a context switching cost when swapping programs, as they will refresh the TLB

---

## Instruction Decode

* Decoding doesn't just mean moving a number from an input to an output
  * Instructions are rewritten
    * Split into micro codes, optimized for hardware
  * Registers are reassigned
    * to support parallelism
    * to remove register dependencies

---

## Superscalar

* With those advances, wider, slower pipelines can be better than deeper, faster ones
  * Instead of being faster, why not load 2 (or more!) instructions at once
  * Any pipeline that is more than one instruction wide is a **superscalar**
* Current CPUs focus on running multiple programs or threads
  * Don't need to be faster, just more efficient

---

## 2 Instructions Wide

* Deeper pipelines have diminishing returns
* So we go wider with superscalar pipelines
* This is a consequence of Moore's law, the increase of transistors per unit area

</div>
<div class="col">

</div>
</div>

---

## 3 Instructions Wide

* So what is happening inside of the CPU?
* There are actually 100s of registers, not just the ones you see
  * The CPU uses these to issue multiple sets of instructions in parallel
* Is this all automatic? Do you have to do anything?

</div>
<div class="col">

</div>
</div>

---

## Threads

* Superscaler architectures can run multiple parallel instructions
  * This is called **instruction level parallelism**
* But there is only so much rearranging the CPU can do
  * Sequential programs are inherently dependent upon previous steps
* So we introduce **thread level parallelism**

---

## Motivation

* Run *cat /proc/cpuinfo* on a linux machine and you'll see all of its cores
  * You can find multiple 24-core CPUs on iLab
* On a shared machine you may have 24 users running 24 different programs
  * But on a personal machine, your processes mostly come from a web browser
* If your current CPU is mostly idle, you aren't going to buy a new one

---

## Feeding the CPU

* If CPUs aren't getting better, we aren't buying new ones
* So modern CPUs are optimized for threading
  * A thread is when a single program runs different parts in parallel
* But their memory may remain shared, so things can get complicated

---

## Threads

* Let's do a simple example, where multiple threads execute within the same program
* Memory management with threads is tricky, so let's not worry about that for now

---

## Threads in C

* Functions and types defined in "threads.h"
  * [https://cppreference.com/w/c/header/threads.html](https://cppreference.com/w/c/header/threads.html)
* Call `thrd_create` to make a thread
  * We also pass it a function to run and the arguments to that function
  * It returns an ID for that thread
* Call `thrd_join` to wait for a thread to finish

---

## thrd_create

* `int thrd_create( thrd_t *thr, thrd_start_t func, void *arg );`
  * The first argument is a pointer to fill in with the thread information
  * The second object is a pointer to a function
  * The third is an array of pointers (to unknown things, so they are void*)
* Not going to quiz you about the function, but threads are too important to ignore in modern CPUs

---

## Parallel Example

```C
#include <stdio.h>
#include <threads.h>

void* printingFunction(void* args) {
    // We have to know what type is being passed.
    size_t thread_id = *(size_t*)(args);
    printf("Thread %lu is running!\n", thread_id);
    return NULL;
}

int main(void) {

// Let's make some threads!
    thrd_t threads[10];
    size_t numbers[10];
    // Start some threads.
    for (int i = 0; i < 10; ++i) {
        numbers[i] = i;
        int result = thrd_create(&threads[i], (thrd_start_t)printingFunction, &numbers[i]);
        if (result == thrd_error) {
            printf("Error creating thread %lu\n", numbers[i]);
        }
    }
    // Wait for all threads to finish.
    for (int i = 0; i < 10; ++i) {
        int res = thrd_join(threads[i], NULL);
        if (res == thrd_error) {
            printf("Error: thread %lu failed to join\n", numbers[i]);
        }
    }

return 0;
}

```

---

## Output

<pre>
$ ./a.out
Thread 1 is running!
Thread 2 is running!
Thread 4 is running!
Thread 0 is running!
Thread 3 is running!
Thread 6 is running!
Thread 5 is running!
Thread 7 is running!
Thread 8 is running!
Thread 9 is running!
</pre>

---

## Thread Execution

* Notice that the order of the threads is different from when we started them
  * Any synchronization is up to the programmer, making things tricky
* But this is like an extreme version of out of order execution
  * The CPU can easily fill its its superscaler pipelines
  * And it is up to *us*, the programmers, to guarantee safety

---

## Larger Example

* We'll write an example program that sums numbers
  * This is just giving our cpu busywork
* Imagine that we are doing one of the divide and conquer sorting algorithms instead

---

## Summing with Threads

```C
#include <stdio.h>
#include <stdlib.h>
#include <threads.h>
#include <time.h>

// Converts the timespec to milliseconds
typedef struct timespec timespec;
double timespecToMs(timespec t) {
    return 1000.0 * t.tv_sec + 1e-6 * t.tv_nsec;
}

// Return the sum from begin up to, but not including, end
size_t sumFunction(size_t begin, size_t end) {
    // We could obviously do (end+begin)/2, but this is for illustration
    size_t sum = 0;
    for (int i = begin; i < end; ++i) {
        sum += i;
    }
    return sum;
}

void* sumFunctionThread(void* void_args) {
    size_t* args = (size_t*)void_args;
    size_t thread_id = args[0];
    printf("Thread %lu is running!\n", thread_id);
    size_t begin = args[1];
    size_t end = args[2];
    // Fill in the result value
    args[3] = sumFunction(begin, end);
    return NULL;
}

int main(int argc, char** argv) {
    if (argc < 3) {
        printf("Usage: %s max threads\n", argv[0]);
        printf("\tmax: value to sum up to.\n");
        printf("\tthreads: the number of threads to be used for summing.\n");
        return 0;
    }

int max = atoi(argv[1]);
    int nthreads = atoi(argv[2]);

// Set up timers and run
    struct timespec tw_begin;
    clock_gettime(CLOCK_MONOTONIC, &tw_begin);
    struct timespec ts_begin;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_begin);

// If there are no threads run in nonthreaded mode.
    // Let's make some threads!
    thrd_t threads[nthreads];
    // Each args array will have the thread ID, the begin, end, and the return value
    size_t args[nthreads][4];

// Start some threads.
    for (int i = 0; i < nthreads; ++i) {
        args[i][0] = i;
        args[i][1] = i*max/nthreads;
        args[i][2] = (i+1)*max/nthreads;
        args[i][3] = 0;
        int result = thrd_create(&threads[i], (thrd_start_t)sumFunctionThread, args[i]);
        if (result == thrd_error) {
            printf("Error creating thread %lu\n", args[i][0]);
        }
    }
    size_t sum = 0;
    // Wait for all threads to finish.
    for (int i = 0; i < nthreads; ++i) {
        int res = thrd_join(threads[i], NULL);
        if (res == thrd_error) {
            printf("Error: thread %lu failed to join\n", args[i][0]);
        }
        sum += args[i][3];
    }
    printf("Sum is %lu\n", sum);

struct timespec tw_end;
    clock_gettime(CLOCK_MONOTONIC, &tw_end);
    struct timespec ts_end;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_end);

double posix_wall = timespecToMs(tw_end) - timespecToMs(tw_begin);
    double cpu_time = timespecToMs(ts_end) - timespecToMs(ts_begin);

printf("wall time used is %.2f ms\n", posix_wall);
    printf("CPU time used is %.2f ms\n", cpu_time);

return 0;
}
```

---

## Speedup

* We'll sum 100 million numbers
* Time is measured in "wall" and "CPU" time
  * "wall" is time as we experience
  * "cpu" is the total time on the CPUs for all threads

<pre>
$ ./a.out 100000000 1
Thread 0 is running!
Sum is 4999999950000000
wall time used is 193.23 ms
CPU time used is 192.79 ms
$ ./a.out 100000000 2
Thread 0 is running!
Thread 1 is running!
Sum is 4999999950000000
wall time used is 98.23 ms
CPU time used is 195.57 ms
</pre>

---

## Timing

* The CPU time is about the same for each
  * Makes sense, the same work is being done
* But the wall time dropped from 193.23ms to 98.23ms
  * About half, and we used 2 threads
* Can we keep going forever?

---

## Performance

* Why does the performance plateau?
* And what's with that strange jump at 5 threads?
  * We'll have to learn more architecture to find out!

</div>

---

## Modern Architectures

* Because of changing complexity, some modern CPU diagrams will break a CPU into
  * Front end: the instruction fetch and decode steps
  * Execution engine
  * Back-end: memory and write-back
* They also care more about *total* throughput rather than speed, and energy used per unit of work

---

## Real World

---

## Real World

---

## Still Recognizable

* Despite the modern CPU pipeline becoming wider and more sophisticated, the stages are the same
  * The fetch and decode stages put in a lot of work
  * They identify optimizations, rewrite instructions, predict branches, etc
* **The next few slides are for context, I don't expect you to memorize the internals of specific CPUs**

---

## Real World

---

## Real World

---

## Real World

---

## Some Are Simple

* The Cortex-A510 Architecture from ARM
* 10 stages
  * 3 fetch
  * 3 decode
  * 1 issue
  * 2 execute
  * 1 write

</div>
<div class="col">

</div>
</div>

---

## So What?

* What does this mean for you?
  * Knowing your processor will affect your code
* Programming for a smart phone?
  * Use less floating point, don't bother with parallelism
* Programming on current AMD or Intel?
  * Make more threads
* Using a GPU?
  * Turn everything into matrices

---

## Biggest Impact

* On any architecture though, two delays dominate:
  * Delays from incorrect branch prediction
  * Memory usage patterns

---

## Major Topics

* Two major topics to look forward to
  * Control flow optimizations
  * Introduction of cache and memory aware programming