# CS 211 - Lecture 18 ## Computer Architecture ### Types of Cache Bernhard Firner 2026-04-01 --- ## Review
--- ## Exploiting Locality * Memory nearer to the processing (ALUs and FPUs) is faster, but smaller * In general * As we work our way up, the faster memory exhibits more locality in usage * As we work our way down, the slower memory has higher throughput to satisfy memory closer to the action * The slower memory is generally optimized for relatively higher density and throughput --- ## Locality * Programs use data near what they are currently using * `Temporal locality`: programs keep using the data they've got * `Spatial locality`: when programs read new data, it is near what they recently addressed --- ## Locality ```c float mean(float* data, size_t size) { float sum = 0; for (size_t i = 0; i < size; ++i) { sum += data[i]; } return sum/size; } ``` * Data: * Spatial locality in the array access * Temporal locality in the sum variable --- ## Locality continued ```c float mean(float* data, size_t size) { float sum = 0; for (size_t i = 0; i < size; ++i) { sum += data[i]; } return sum/size; } ``` * Instructions: * Spatial locality: the instructions could have been prefetched before the function was called * Temporal locality: repeatedly use the loop instructions --- ## Locality By Design * Good design is on *you*, the programmer * We should optimize our program for the cache * Cache is non-register memory on the CPU itself * There are generally multiple types of cache --- ## Future Hardware Can't Trivialize * Some problems are solved with future hardware * Not this one! * RAM has grown in density by many orders of magnitude, but latency hasn't kept up * To compensate, we have more levels of cache now, in a desperate race to keep up with modern processing speed --- ## Reading * CS:App Chapters 6.2-6.6 * Outside reading: * [https://gameprogrammingpatterns.com/data-locality.html](https://gameprogrammingpatterns.com/data-locality.html) --- ## Other Resources * Good, practical talks from game developers * [https://youtu.be/SZOr0m-K5PQ?si=JAG3Lx88y8yjkSaZ](https://youtu.be/SZOr0m-K5PQ?si=JAG3Lx88y8yjkSaZ) * Lots of register talk, so fun! * Because they *can* share, and performance matters to them * A good proxy for other industries since games are so varied --- ## Cacheing * Let's talk about how we cache * When is data placed into a cache? * When is data evicted? * What is the cost? --- ## Abstract Cache * Data is organized into *blocks* * A higher cache mirrors blocks from a lower cache * The higher cache is smaller, and holds a small fraction of what is in the lower cache
Higher Cache
Block 12
Block 300
Block 5
Block 8
Lower Cache
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
Block 7
Block 8
Block 9
Block 10
Block 11
Block 12
Block ........................................................
--- ## Describing Cache * Obviously each cache has dinstinct properties * But how do we describe a cache? How is its memory organized? * For CPU cache, meaning L1, L2, l3: * Data is fetched into `cache lines` * Each line is arranged into a `set` --- ## Visualization
See CS:App, Figure 6.25
--- ## Cache Lines * Cache lines hold three fields * Valid bit * tag * Data * $2^b$ bytes, usually 64-bytes in L1/L2/L3 cache * The address of the original data is translated into the the set and tag --- ## Querying the Cache * Each cache line holds a set number of bytes, called the block size * Memory, as we know from pointers, is described by a number * Remember *aliasing* from C structs? * We saw briefly that SIMD instructions like aliased memory as well * Alignment corresponds to the block being transferred to/from a cache
Full address
tag bits
set bits
block bits
* Addresses with the same tag and set bits live together in the same block --- ## Cache Sizes * The sizes of the bit fields indicate the amount of memory being addressed * Sets: $S = 2^s$ * Lines per set: $E$ * Block size: $B = 2^b$ * Cache size: $C = B \times E \times S$ * Not including overhead for valid and tag bits --- ## Querying the Cache
See CS:App, chapter 6.4
--- ## Hit or Miss? * When the desired data is missing, there is a *cache miss* * When the desired data is present, there is a *cache hit*
See CS:App, chapter 6.4
--- ## Cold Cache * When a device is turned on, all cache is empty * The valid bits are set to false * Every request for memory is a *miss* * We call these *cold misses* or *compulsory misses* --- ## Fetching Data * Whenever there is a cache miss, we must fetch some data * Let's go through an example of how this works * And we'll see that there are different ways to tune a cache * Which is best will depend upon how we expect memory will be used --- ## Direct-Mapped Cache * In a **direct-mapped cache**, there is one line per set * Let's say we have 16 bytes of memory * That's 4 bits of addresses * We set up a cache as follows: * 4 sets * 2 bytes/block * 1 line/set (because this is a direct-mapped cache) --- ## Direct Cache Example
* Starting state * Notice that all valid bits start 0
valid
Tag
Block
Set 0
0
Set 1
0
Set 2
0
Set 3
0
--- ## Direct Cache Example
* Request address 0000
Request
Tag
Set
Block
0
00
0
valid
Tag
Block
Set 0
0
Set 1
0
Set 2
0
Set 3
0
--- ## Direct Cache Example
* Request address 0000 * The valid bit on set 0 isn't set, so this is a *cold miss* * Notice that we always fetch a full block
valid
Tag
Block
Set 0
1
0
m[0]
m[1]
Set 1
0
Set 2
0
Set 3
0
--- ## Direct Cache Example
* Request address 0001
Request
Tag
Set
Block
0
00
1
* That's a **hit**, and we can fetch m[1] from the block
valid
Tag
Block
Set 0
1
0
m[0]
m[1]
Set 1
0
Set 2
0
Set 3
0
--- ## Direct Cache Example
* Request address 0111
Request
Tag
Set
Block
0
11
1
* The valid bit isn't set, so this is a *cold miss*
valid
Tag
Block
Set 0
1
0
m[0]
m[1]
Set 1
0
Set 2
0
Set 3
1
0
m[6]
m[7]
--- ## Direct Cache Example
* Request address 1000
Request
Tag
Set
Block
1
00
0
* The valid bit is set, but the tag doesn't match so this is a miss * We **evict** and replace the current data
valid
Tag
Block
Set 0
1
1
m[8]
m[9]
Set 1
0
Set 2
0
Set 3
1
0
m[6]
m[7]
--- ## Direct Cache Example
* Request address 0000
Request
Tag
Set
Block
0
00
0
* This is the memory we just evicted! * This type of miss is called a **conflict miss** * There is enough capacity * Rather than using it, there is contention over a cache location
valid
Tag
Block
Set 0
1
0
m[0]
m[1]
Set 1
0
Set 2
0
Set 3
1
0
m[6]
m[7]
--- ## Reducing Conflict Misses * Conflict misses are common in a direct cache * This is similar to the thrashing problem in disks, where two memory locations cause a resource contention * Happens in disks, happens in cache, happens everywhere * An obvious solution is to provide multiple lines per set --- ## Set Associative Caches * An associative cache provides multiple blocks of data within each set. * How many? It depends upon the task * Let's say that we choose 2, thinking that maybe data memory and instruction memory could have caused conflicts * Two-way associative (E = 2)
$^*$ From CS:App
--- ## Associative Cache Example * Starting state * Blocks still hold 2 bytes * There are two sets, each with two lines * So the total cache size is the same * 2 bits of tag
valid
Tag
Block
valid
Tag
Block
Set 0
0
0
Set 1
0
0
--- ## Associative Cache Example
* Request address 0000 * tag=00, s=0 * cold miss
Request
Tag
Set
Block
00
0
0
valid
Tag
Block
valid
Tag
Block
Set 0
1
00
m[0]
m[1]
0
Set 1
0
0
--- ## Associative Cache Example
* Request address 0001 * hit
Request
Tag
Set
Block
00
0
1
valid
Tag
Block
valid
Tag
Block
Set 0
1
00
m[0]
m[1]
0
Set 1
0
0
--- ## Associative Cache Example
* Request address 0111 * cold miss
Request
Tag
Set
Block
01
1
1
valid
Tag
Block
valid
Tag
Block
Set 0
1
00
m[0]
m[1]
0
Set 1
1
01
m[6]
m[7]
0
--- ## Associative Cache Example
* Request address 1000 * cold miss
Request
Tag
Set
Block
10
0
0
valid
Tag
Block
valid
Tag
Block
Set 0
1
00
m[0]
m[1]
1
10
m[8]
m[9]
Set 1
1
01
m[6]
m[7]
0
--- ## Associative Cache Example
* Request address 0000 * Hit! * Problem solved!
Request
Tag
Set
Block
00
0
0
valid
Tag
Block
valid
Tag
Block
Set 0
1
00
m[0]
m[1]
1
10
m[8]
m[9]
Set 1
1
01
m[6]
m[7]
0
--- ## Fully Associative Cache * We could go even farther * Just 1 set * Now conflict misses will only happen when capacity is filled * *capacity misses* * Look things up by their tag * Which will now be 3 bits since there is no set address --- ## Fully Associative Drawbacks? * The tag bits need to be compared to every single line * This is hardware expensive * Easy to get either large or fast, not both * A fully associative cache is only used in special, small cache applications * Like the translation lookaside buffer * (that was the translation from virtual memory to actual addresses) --- ## Eviction in an Associative Cache * Obviously associativity doesn't solve all conflict and capacity misses * Occur when a set is fully utilised and new memory needs to be cached * We have to choose which memory to evict * Different algorithms: least recently used, least frequently used, random --- ## Writing * We've only handled fetching * What about writing? * Remember, writing is slow too! * There are different strategies --- ## Write Strategies * We're modifying data cache, when is it updated at the source memory? * For the L1 cache, that means updating L2, for l3 that means RAM, etc * *Write-through* * Write immediately to the underlying memory * *Write-back* * Defer writing until the block is evicted * This means that we need to add a dirty bit to mark modifications --- ## Write-misses * What if we write to memory without reading it? * We could have a miss! * More strategies * *Write-allocate* * Load into cache and update there * *No-write-allocate* * Same as write through, go directly to memory --- ## Typical Settings * Write-through + no-write-allocate * Write-back + Write-allocate * These can make better use of locality --- ## Real-World Example Core i7 Cache
$^*$ From CS:App
--- ## Performance Impacts
Cache
Access Time (cycles)
Cache Size (C)
Assoc. (E)
Block size (B)
Sets (S)
L1 i-cache
4
32 KB
8
64 B
64
L1 d-cache
4
32 KB
8
64 B
64
L2 unified cache
10
256 KB
8
64 B
512
L3 unified cache
40-75
8 MB
16
64 B
8192
Main memory
200
$^*$ According to CS:App, Core i7 cache
--- ## Cache Types * A *d-cache* holds data * An *i-cache* holds instructions * A *unified cache* holds both * It makes sense to handle instructions and data differently at the CPU, but less sense farther away --- ## Performance of Block Size * A small block size is generally faster, but has a higher miss rate * Large blocks increase hit rate if spatial locality is true * But decrease the number of cache lines * This increases misses if data isn't used in groups of that block size * And large blocks take longer to load too --- ## Performance of Associativity * Low associativity can lead to thrashing from conflict misses * High associativity is expensive and slow * Longer to search the memory * More difficult decisions about which memory to evict * Generally lower associativity closer to the CPU registers, higher associativity farther away * In i7, L1 and L2 are 8-way associative, L3 is 16-way --- ## Write Strategy * Write-through is simpler * Writing can also be shifted to another system, called a write buffer * Since the memory is already written, an eviction from a read won't also cause a write, slowing it down farther * Write-back saves on bandwidth, though * Becomes more likely as we move down the hierarchy --- ## Example Read Latenecies * A i7 program has a hit rate of 97% * Hit time is 4 cycles for L1 * Miss penalties (additional time beyond L1's 4 cycles) * 10 cycles for L2 * 200 cycles from main memory * Assume that we check L1 and L2 in parallel --- ## Example Read Latencies * L4: 4 cycles, L2 10 cycles, Main memory: 200 cycles * If data is either in L1 or isn't in the cache: * Average access time with 97% hit: * $4\~cycles + 0.03\times 200\~cycles = 10\~cycles$ * Average access time with 99% hit * $4\~cycles + 0.01\times 200\~cycles = 6\~cycles$ * Notice that the miss rate is what we care about, rather than the hit rate --- ## Example Read Latencies * L4: 4 cycles, L2 10 cycles, Main memory: 200 cycles * 3% miss rate in L1, 50% miss rate in L2: * $4\~cycles + 0.03\times(0.5\times 10\~cycles + 0.5\times200\~cycles) = 7.15\~cycles$ * 1% miss rate in L1, 50% miss rate in L2: * $4\~cycles + 0.01\times(0.5\times 10\~cycles + 0.5\times200\~cycles) = 5.05\~cycles$ * Even a low hit rate, an L2 cache will drastically decrease our expected latency --- ## Next Time * More cache details * Examples of cache friendly code * Fun topics, but we'll save them for next lecture!