int main(int argc, char** argv) {
int a = 260;
char b = (char)a;
printf("a and b are %i and %i\n", a, b);
return 0;
}
```
* What prints out?
---
## Loss of Precision
* Converting an int (32 bits) into a char (8 bits) means we lose some values
* 260 is 0x104, from 256 + 4
* The char cannot hold the 256, so that is discarded
* b is left with the value 4
---
## ASM
* Can you read this code?
```asm
.file "check.c"
.text
.section .rodata
.LC0:
.string "a and b are %i and %i\n"
.text
.globl main
.type main, @function
main:
.LFB0:
.cfi_startproc
endbr64
pushq %rbp
movq %rsp, %rbp
subq $32, %rsp
movl %edi, -20(%rbp)
movq %rsi, -32(%rbp)
movl $260, -4(%rbp)
movl -4(%rbp), %eax
movb %al, -5(%rbp)
movsbl -5(%rbp), %edx
movl -4(%rbp), %eax
movl %eax, %esi
leaq .LC0(%rip), %rax
movq %rax, %rdi
movl $0, %eax
call printf@PLT
movl $0, %eax
leave
ret
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0"
.section .note.GNU-stack,"",@progbits
.section .note.gnu.property,"a"
.align 8
.long 1f - 0f
.long 4f - 1f
.long 5
0:
.string "GNU"
1:
.align 8
.long 0xc0000002
.long 3f - 2f
2:
.long 0x3
3:
.align 8
4:
```
---
## ASM
* Familiar from the bomb lab, right?
* This loads the immediate 260 value:
* `movl $260, -4(%rbp)`
* This moves the value into %eax
* `movl -4(%rbp), %eax`
* This moves the lower byte of that register on a stack variable
* `movb %al, -5(%rbp)`
---
## Addressing Types
* You remember these, right?
* Immediate
* Register
* Direct
* Indirect
---
## Immediate Addressing
* We can use constant values with registers
* e.g. `$5` refers to the constant value 5
* This sets the destination to 5
* `movq $5 dst`
---
## Register Addressing
* More common to work with register values
* `movq %rax, %rdi`
* Moves the value in %rax into %rdi
---
## Direct and Indirect
* These are pointer equivalents
* Direct uses a hard-coded address
* `movq 0x40abcd, dst`
* Would move `memory[0x40abcd]` to destination
* `moveq (%rax), dst`
* Would move `memory[%rax]` to destination
* The first is `direct`, the second is `indirect`
---
## Displacement and Scaling
* Arrays are very common, so syntax supports them
* `8(%rax)` is `memory[%rax + 8]`
* like `char_ptr[8]`
* But most displacement has a scaling factor
* E.g. ints have a scale of 4 bytes
* `8(%rax, %rcx, 4)` is `memory[%rax + 8 + 4*%rcx]`
* Can skip the displacement for familiar array access
* `(%rax, %rcx, 4)` is `memory[%rax + 4*%rcx]`
---
## Floating Point Numbers
* These seem to trouble you
Consider a 4-bit floating point value with 1 bit of sign, 2 bits of exponent, and 1 bit of signficand. Bias = 2. Infinity and NaN are not used in this format. What is the smallest (largest negative) value that this format can represent?
---
## Floating Point Numbers
* Slight change.
Consider a 4-bit floating point value with 1 bit of sign, 1 bit of exponent, and 2 bits of signficand. Bias = 1. Infinity and NaN are not used in this format. What is the smallest (largest negative) value that this format can represent?
---
## Float Values
* Often support special patterns for inf and NaN
* Generally with all exponent bits set
* For fp32
* value = $(-1)^{sign} \times 2^{exponent - bias} \times (1+\frac{Significand}{2^{23}})$
* If exponent bits all 0
* value = $(-1)^{sign} \times 2^{1-bias} \times (\frac{Significand}{2^{23}})$
* These are subnormal numbers
---
## Rounding
* How do floats round?
* What if we always chose up or down?
* Then we get biased results
* So we could round up half the time and down the other half
* For example, round towards even
---
## Stochastic Rounding
* For machine learning, that isn't good enough
* Rounding to nearest even has bias with some distributions
* Instead, round with probability proportional to the distance
* e.g. 0.1 has to round to 0 or 0.5
* it rounds to 0 with p = 0.2,
* it rounds to 0.5 with p = 0.8
---
## Digital Math
* All math happens in the CPU
* Integer math in ALUs
* Arithmetic logic units
* Floating point math in FPUs
* Floating Point Units
---
## Idyllic Architecture
* Real hardware is complex, so we made up a simple system
---
## Pipelines
* Here's our idyllic version of a 5-stage CPU pipeline
* Each gray bar is a buffer, called a `pipeline register`
* The pipeline stages increase throughput roughly 5x
---
## Pipeline Stages$^*$
1. Instruction fetch (IF)
2. Instruction decode/register fetch (ID)
3. Execute (math or address calculation) (EX)
4. Memory access (MA)
5. Write back (WB)
$^*$ For our super-simplified example pipeline
---
## Where is the Cache?
* Notice that there are multiple pieces with memory
* Instruction cache in the IF stage
* Registers in ID
* L1 cache in MA
* And other memory if there is a miss in L1
* So our idyllic version of the pipeline is hiding many details
---
## Dependencies
* Our idyllic pipeline is most useful for revealing dependencies
* If we need to load data into a register for some math, that is a data dependency
Q: A load instruction issued on cycle 0 completes at the end of cycle 6. The next instruction requires that load result in the EX stage. How many no-ops instructions are issued to insert a bubble in the pipeline until the data is ready? Assume that we can forward directly from the MA stage to the EX stage.
---
## Dependency
* If the first instruction is issued on cycle 0, the next is issued on cycle 1
* EX is the third stage, so EX for the second instruction occurs on cycle 3
* But the data is not ready until cycle 6
* If the result is forwarded directly to the EX state, we still need to stall for 3 cycles
---
## Dealing with Dependencies
* Data hazards
* Data forwarding
* Out of order execution
* Control Hazards
* Use speculative execution with branch prediction
* Instructions are prefetched so that different branch instructions are available
---
## Prefetching
* It takes time to load data into a cache
* Whether the instruction cache, or the data cache
* So modern CPUs predict what data you want
* They do this by guessing the data access pattern
* We call this locality
* Good locality leads to better prefetching and fewer misses, and more efficient data usage
---
## Locality
* Programs use data near what they are currently using
* `Temporal locality`: programs keep using the data they've got
* `Spatial locality`: when programs read new data, it is near what they recently addressed
---
## Locality
```c
float mean(float* data, size_t size) {
float sum = 0;
for (size_t i = 0; i < size; ++i) {
sum += data[i];
}
return sum/size;
}
```
* Data:
* Spatial locality in the array access
* Temporal locality in the sum variable
---
## Locality continued
```c
float mean(float* data, size_t size) {
float sum = 0;
for (size_t i = 0; i < size; ++i) {
sum += data[i];
}
return sum/size;
}
```
* Instructions:
* Spatial locality: the instructions could have been prefetched before the function was called
* Temporal locality: repeatedly use the loop instructions
---
```c
float mean(float* data, size_t size) {
float sum = 0;
for (size_t i = 0; i < size; ++i) {
sum += data[i];
}
return sum/size;
}
```
With no prefetching and a block size of 16 bytes, what is the L1 cache miss rate?
---
```c
float mean(float* data, size_t size) {
float sum = 0;
for (size_t i = 0; i < size; ++i) {
sum += data[i];
}
return sum/size;
}
```
Each float is 4 bytes or one quarter of a block. The first in a block is a miss, but the rest will be hits. The overall miss rate is $1/4$.
---
```c
float mean(float* data, size_t size) {
float sum = 0;
for (size_t i = 0; i < size; ++i) {
sum += data[i];
}
return sum/size;
}
```
How does the answer change if there is sequential prefetching?
---
```c
float mean(float* data, size_t size) {
float sum = 0;
for (size_t i = 0; i < size; ++i) {
sum += data[i];
}
return sum/size;
}
```
Only the first request is a miss since all subsequent reads are sequential through memory. The miss rate is $1/\text{size}$, and approaches 0 as the array grows larger.
---
## Prefetching and Block Size
* A small block is fetched faster, but has a higher miss rate
* Since it may not grab the entire region that will be used
* Large blocks increase hit rate if spatial locality is true
* But decrease the number of cache lines
* This increases misses if data isn't used in groups of that block size
* And contention misses, if there are multiple programs/threads
* Higher latency to load too
---
## Describing a Cache
* For any cache
* Data is fetched into `cache lines` which each hold a single `block`
* Lines are grouped into `sets`
* For CPUs and GPUs there are different cache levels
* They all make different tradeoffs with number of sets and block sizes
---
## Associativity
* Think fast!
* What is the disadvantage of a fully associative cache?
* What is the disadvantage of a directly-mapped cache?
* Consider only the data, not the instructions, of our matrix multiply code
* Which associativity is best? 2-way, 4-way, or 8-way?
---
## 2-Way Associative Example
* Blocks are 2 bytes, there are 4 sets
* Requests are:
* 0000
* 0001
* 1000
* 1001
* 1010
* 1100
* 1110
* 1010
* 1011
* 0100
| Valid 1 | Tag 1 | Valid 2 | Tag 2 | Hit/Miss |
| Set 0 | 0 | | 0 | | |
| Set 1 | 0 | | 0 | | |
| Set 2 | 0 | | 0 | | |
| Set 3 | 0 | | 0 | | |
---
## 2-Way Associative Example
* Blocks are 2 bytes, there are 4 sets
* Requests are:
* **0000**
* 0001
* 1000
* 1001
* 1010
* 1100
* 1110
* 1010
* 1011
* 0100
| Request |
| Tag | Set | Block |
| 0 | 00 | 0 |
| Valid 1 | Tag 1 | Valid 2 | Tag 2 | Hit/Miss |
| Set 0 | 1 | 0 | 0 | | Miss |
| Set 1 | 0 | | 0 | | |
| Set 2 | 0 | | 0 | | |
| Set 3 | 0 | | 0 | | |
---
## 2-Way Associative Example
* Blocks are 2 bytes, there are 4 sets
* Requests are:
* 0000
* **0001**
* 1000
* 1001
* 1010
* 1100
* 1110
* 1010
* 1011
* 0100
| Request |
| Tag | Set | Block |
| 0 | 00 | 1 |
| Valid 1 | Tag 1 | Valid 2 | Tag 2 | Hit/Miss |
| Set 0 | 1 | 0 | 0 | | Hit |
| Set 1 | 0 | | 0 | | |
| Set 2 | 0 | | 0 | | |
| Set 3 | 0 | | 0 | | |
---
## 2-Way Associative Example
* Blocks are 2 bytes, there are 4 sets
* Requests are:
* 0000
* 0001
* **1000**
* 1001
* 1010
* 1100
* 1110
* 1010
* 1011
* 0100
| Request |
| Tag | Set | Block |
| 1 | 00 | o |
| Valid 1 | Tag 1 | Valid 2 | Tag 2 | Hit/Miss |
| Set 0 | 1 | 0 | 1 | 1 | Miss |
| Set 1 | 0 | | 0 | | |
| Set 2 | 0 | | 0 | | |
| Set 3 | 0 | | 0 | | |
---
## 2-Way Associative Example
* Blocks are 2 bytes, there are 4 sets
* Requests are:
* 0000
* 0001
* 1000
* **1001**
* 1010
* 1100
* 1110
* 1010
* 1011
* 0100
| Request |
| Tag | Set | Block |
| 1 | 00 | 1 |
| Valid 1 | Tag 1 | Valid 2 | Tag 2 | Hit/Miss |
| Set 0 | 1 | 0 | 1 | 1 | Hit |
| Set 1 | 0 | | 0 | | |
| Set 2 | 0 | | 0 | | |
| Set 3 | 0 | | 0 | | |
---
## 2-Way Associative Example
* Blocks are 2 bytes, there are 4 sets
* Requests are:
* 0000
* 0001
* 1000
* 1001
* **1010**
* 1100
* 1110
* 1010
* 1011
* 0100
| Request |
| Tag | Set | Block |
| 1 | 01 | 0 |
| Valid 1 | Tag 1 | Valid 2 | Tag 2 | Hit/Miss |
| Set 0 | 1 | 0 | 1 | 1 | |
| Set 1 | 1 | 1 | 0 | | Miss |
| Set 2 | 0 | | 0 | | |
| Set 3 | 0 | | 0 | | |
---
## 2-Way Associative Example
* Blocks are 2 bytes, there are 4 sets
* Requests are:
* 0000
* 0001
* 1000
* 1001
* 1010
* **1100**
* 1110
* 1010
* 1011
* 0100
| Request |
| Tag | Set | Block |
| 1 | 10 | 0 |
| Valid 1 | Tag 1 | Valid 2 | Tag 2 | Hit/Miss |
| Set 0 | 1 | 0 | 1 | 1 | |
| Set 1 | 1 | 1 | 0 | | |
| Set 2 | 1 | 1 | 0 | | Miss |
| Set 3 | 0 | | 0 | | |
---
## 2-Way Associative Example
* Blocks are 2 bytes, there are 4 sets
* Requests are:
* 0000
* 0001
* 1000
* 1001
* 1010
* 1100
* **1110**
* 1010
* 1011
* 0100
| Request |
| Tag | Set | Block |
| 1 | 11 | 0 |
| Valid 1 | Tag 1 | Valid 2 | Tag 2 | Hit/Miss |
| Set 0 | 1 | 0 | 1 | 1 | |
| Set 1 | 1 | 1 | 0 | | |
| Set 2 | 1 | 1 | 0 | | |
| Set 3 | 1 | 1 | 0 | | Miss |
---
## 2-Way Associative Example
* Blocks are 2 bytes, there are 4 sets
* Requests are:
* 0000
* 0001
* 1000
* 1001
* 1010
* 1100
* 1110
* **1010**
* 1011
* 0100
| Request |
| Tag | Set | Block |
| 1 | 01 | 0 |
| Valid 1 | Tag 1 | Valid 2 | Tag 2 | Hit/Miss |
| Set 0 | 1 | 0 | 1 | 1 | |
| Set 1 | 1 | 1 | 0 | | Hit |
| Set 2 | 1 | 1 | 0 | | |
| Set 3 | 1 | 1 | 0 | | |
---
## 2-Way Associative Example
* Blocks are 2 bytes, there are 4 sets
* Requests are:
* 0000
* 0001
* 1000
* 1001
* 1010
* 1100
* 1110
* 1010
* **1011**
* 0100
| Request |
| Tag | Set | Block |
| 1 | 01 | 1 |
| Valid 1 | Tag 1 | Valid 2 | Tag 2 | Hit/Miss |
| Set 0 | 1 | 0 | 1 | 1 | |
| Set 1 | 1 | 1 | 0 | | Hit |
| Set 2 | 1 | 1 | 0 | | |
| Set 3 | 1 | 1 | 0 | | |
---
## 2-Way Associative Example
* Blocks are 2 bytes, there are 4 sets
* Requests are:
* 0000
* 0001
* 1000
* 1001
* 1010
* 1100
* 1110
* 1010
* 1011
* **0100**
| Request |
| Tag | Set | Block |
| 0 | 10 | 0 |
| Valid 1 | Tag 1 | Valid 2 | Tag 2 | Hit/Miss |
| Set 0 | 1 | 0 | 1 | 1 | |
| Set 1 | 1 | 1 | 0 | | |
| Set 2 | 1 | 1 | 1 | 1 | Miss |
| Set 3 | 1 | 1 | 0 | | |
---
## DLD
* We remember the full adder, right?
* The carry out has multiple solutions
* $C_{out} = \bar{A}BC_{in} + AC_{in}\bar{B} + AB\overline{C_{in}} + ABC_{in}$
* This simplifies multiple ways
* Most common is $C_{out} = AB + C_{in}(A \oplus B)$
| Input A | Input B | Carry In | Sum | Carry Out |
| 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 1 | 1 | 0 |
| 0 | 1 | 0 | 1 | 0 |
| 0 | 1 | 1 | 0 | 1 |
| 1 | 0 | 0 | 1 | 0 |
| 1 | 0 | 1 | 0 | 1 |
| 1 | 1 | 0 | 0 | 1 |
| 1 | 1 | 1 | 1 | 1 |
---
## Full Adder
---
## Ripple Carry Adder
---
## Decoding
* Decodes $n$ input bits into $2^n$ outputs
* One output is 1, the rest are 0
* We can implement any sum of products truth table with this
A 2 to 4 decoder
---
## Question
* Implement a full adder with the provided 4 to 8 decoder
---
## Solution
* Bonus question! What is the delay of this circuit if each gate has a delay of $n$?
---
## The Final
* Don't bring a ton of stuff with you
* I'll just be asking you to leave your things up front
* No phone, smart watch, smart glasses, in or out of ear speakers, etc
* Being found with a device will result in exam confiscation and a 0