# CS 211 - Lecture 18

## Computer Architecture

###  Types of Cache

Bernhard Firner

2026-04-01

---

## Review

---

## Exploiting Locality

* Memory nearer to the processing (ALUs and FPUs) is faster, but smaller
* In general
  * As we work our way up, the faster memory exhibits more locality in usage
  * As we work our way down, the slower memory has higher throughput to satisfy memory closer to the action
* The slower memory is generally optimized for relatively higher density and throughput

---

## Locality

* Programs use data near what they are currently using
* `Temporal locality`: programs keep using the data they've got
* `Spatial locality`: when programs read new data, it is near what they recently addressed

---

## Locality

```c
float mean(float* data, size_t size) {
    float sum = 0;
    for (size_t i = 0; i < size; ++i) {
        sum += data[i];
    }
    return sum/size;
}
```

* Data:
  * Spatial locality in the array access
  * Temporal locality in the sum variable

---

## Locality continued

```c
float mean(float* data, size_t size) {
    float sum = 0;
    for (size_t i = 0; i < size; ++i) {
        sum += data[i];
    }
    return sum/size;
}
```

* Instructions:
  * Spatial locality: the instructions could have been prefetched before the function was called
  * Temporal locality: repeatedly use the loop instructions

---

## Locality By Design

* Good design is on *you*, the programmer
* We should optimize our program for the cache
  * Cache is non-register memory on the CPU itself
  * There are generally multiple types of cache

---

## Future Hardware Can't Trivialize

* Some problems are solved with future hardware
  * Not this one!
* RAM has grown in density by many orders of magnitude, but latency hasn't kept up
* To compensate, we have more levels of cache now, in a desperate race to keep up with modern processing speed

---

## Reading

* CS:App Chapters 6.2-6.6
* Outside reading:
  * [https://gameprogrammingpatterns.com/data-locality.html](https://gameprogrammingpatterns.com/data-locality.html)

---

## Other Resources

* Good, practical talks from game developers
  * [https://youtu.be/SZOr0m-K5PQ?si=JAG3Lx88y8yjkSaZ](https://youtu.be/SZOr0m-K5PQ?si=JAG3Lx88y8yjkSaZ)
    * Lots of register talk, so fun!
  * Because they *can* share, and performance matters to them
* A good proxy for other industries since games are so varied

---

## Cacheing

* Let's talk about how we cache
  * When is data placed into a cache?
  * When is data evicted?
  * What is the cost?

---

## Abstract Cache

* Data is organized into *blocks*
* A higher cache mirrors blocks from a lower cache
  * The higher cache is smaller, and holds a small fraction of what is in the lower cache

<div class="container">
<table>
<tr> <b><th colspan="4"> Higher Cache </th></b></tr>
<tr> <td>Block 12 </td><td> Block 300 </td><td> Block 5 </td><td> Block 8 </td></tr>
</table>

<table>
<tr> <th colspan="4"> Lower Cache  </th></tr>
<tr> <td>Block 1 </td><td> Block 2</td><td> Block 3 </td><td> Block 4</td></tr>
<tr> <td>Block 5 </td><td> Block 6</td><td> Block 7 </td><td> Block 8</td></tr>
<tr> <td>Block 9 </td><td> Block 10</td><td> Block 11</td><td> Block 12</td></tr>
<tr> <td colspan="4" rowspan="4">Block ........................................................</td></tr>
</table>
</div>

---

## Describing Cache

* Obviously each cache has dinstinct properties
* But how do we describe a cache? How is its memory organized?
* For CPU cache, meaning L1, L2, l3:
  * Data is fetched into `cache lines`
  * Each line is arranged into a `set`

---

## Visualization

<img style="width: 80%" class="r-stretch" src="./CSAppFigures/12_6_GeneralCacheOrganization.png" />
<br/>
<small> See CS:App, Figure 6.25</small>

---

## Cache Lines

* Cache lines hold three fields
  * Valid bit
  * tag
  * Data
    * $2^b$ bytes, usually 64-bytes in L1/L2/L3 cache
* The address of the original data is translated into the the set and tag

---

## Querying the Cache

* Each cache line holds a set number of bytes, called the block size
* Memory, as we know from pointers, is described by a number
* Remember *aliasing* from C structs?
  * We saw briefly that SIMD instructions like aliased memory as well
  * Alignment corresponds to the block being transferred to/from a cache

<table>
<tr> <td colspan="12">Full address</td></tr>
<tr> <td colspan="4">tag bits</td><td colspan="4">set bits</td><td colspan="4">block bits</td></tr>
</table>

* Addresses with the same tag and set bits live together in the same block

---

## Cache Sizes

* The sizes of the bit fields indicate the amount of memory being addressed
  * Sets: $S = 2^s$
  * Lines per set: $E$
  * Block size: $B = 2^b$
  * Cache size: $C = B \times E \times S$
    * Not including overhead for valid and tag bits

---

## Querying the Cache

<img style="width: 80%" class="r-stretch" src="./CSAppFigures/12_7_CacheReading.png" />
<br/>
<small> See CS:App, chapter 6.4</small>

---

## Hit or Miss?

* When the desired data is missing, there is a *cache miss*
* When the desired data is present, there is a *cache hit*

<img style="width: 40%" class="r-stretch" src="./CSAppFigures/12_8_CacheReading.png" />
<small> See CS:App, chapter 6.4</small>

---

## Cold Cache

* When a device is turned on, all cache is empty
* The valid bits are set to false
* Every request for memory is a *miss*
  * We call these *cold misses* or *compulsory misses*

---

## Fetching Data

* Whenever there is a cache miss, we must fetch some data
* Let's go through an example of how this works
  * And we'll see that there are different ways to tune a cache
  * Which is best will depend upon how we expect memory will be used

---

## Direct-Mapped Cache

* In a **direct-mapped cache**, there is one line per set
* Let's say we have 16 bytes of memory
  * That's 4 bits of addresses
* We set up a cache as follows:
  * 4 sets
  * 2 bytes/block
  * 1 line/set (because this is a direct-mapped cache)

---

## Direct Cache Example

* Starting state
* Notice that all valid bits start 0

</div>
<div class="col">

<table>
<tr> <td></td><th> valid  </th> <th>Tag</th> <th colspan="2"> Block </th></tr>
<tr> <td><b>Set 0</b></td><td> 0</td><td> </td><td> </td><td> </td></tr>
<tr> <td><b>Set 1</b></td><td> 0</td><td> </td><td> </td><td> </td></tr>
<tr> <td><b>Set 2</b></td><td> 0</td><td> </td><td> </td><td> </td></tr>
<tr> <td><b>Set 3</b></td><td> 0</td><td> </td><td> </td><td> </td></tr>
</table>
</div>
</div>

---

## Direct Cache Example

* Request address 0000
<br/>
<table>
<tr> <b><th colspan="4"> Request </th></b></tr>
<tr> <td>Tag </td><td> Set </td><td> Block </td></tr>
<tr> <td>0 </td><td> 00 </td><td> 0 </td></tr>
</table>

</div>
<div class="col">

---

## Direct Cache Example

* Request address 0000
  * The valid bit on set 0 isn't set, so this is a *cold miss*
  * Notice that we always fetch a full block

</div>
<div class="col">

<table>
<tr> <td></td><th> valid  </th> <th>Tag</th> <th colspan="2"> Block </th></tr>
<tr> <td><b>Set 0</b></td><td> 1</td><td>0</td><td>m[0]</td><td>m[1]</td></tr>
<tr> <td><b>Set 1</b></td><td> 0</td><td> </td><td> </td><td> </td></tr>
<tr> <td><b>Set 2</b></td><td> 0</td><td> </td><td> </td><td> </td></tr>
<tr> <td><b>Set 3</b></td><td> 0</td><td> </td><td> </td><td> </td></tr>
</table>
</div>
</div>

---

## Direct Cache Example

* Request address 0001
<br/>
<table>
<tr> <b><th colspan="4"> Request </th></b></tr>
<tr> <td>Tag </td><td> Set </td><td> Block </td></tr>
<tr> <td>0 </td><td> 00 </td><td> 1 </td></tr>
</table>
<br/>

* That's a **hit**, and we can fetch m[1] from the block

</div>
<div class="col">

---

## Direct Cache Example

* Request address 0111
<br/>
<table>
<tr> <b><th colspan="4"> Request </th></b></tr>
<tr> <td>Tag </td><td> Set </td><td> Block </td></tr>
<tr> <td>0 </td><td> 11 </td><td> 1</td></tr>
</table>
<br/>

* The valid bit isn't set, so this is a *cold miss*

</div>
<div class="col">

<table>
<tr> <td></td><th> valid  </th> <th>Tag</th> <th colspan="2"> Block </th></tr>
<tr> <td><b>Set 0</b></td><td> 1</td><td>0</td><td>m[0]</td><td>m[1]</td></tr>
<tr> <td><b>Set 1</b></td><td> 0</td><td> </td><td> </td><td> </td></tr>
<tr> <td><b>Set 2</b></td><td> 0</td><td> </td><td> </td><td> </td></tr>
<tr> <td><b>Set 3</b></td><td> 1</td><td>0</td><td>m[6]</td><td>m[7]</td></tr>
</table>
</div>
</div>

---

## Direct Cache Example

* Request address 1000
<br/>
<table>
<tr> <b><th colspan="4"> Request </th></b></tr>
<tr> <td>Tag </td><td> Set </td><td> Block </td></tr>
<tr> <td>1 </td><td> 00 </td><td> 0 </td></tr>
</table>
<br/>

* The valid bit is set, but the tag doesn't match so this is a miss
* We **evict** and replace the current data

</div>
<div class="col">

<table>
<tr> <td></td><th> valid  </th> <th>Tag</th> <th colspan="2"> Block </th></tr>
<tr> <td><b>Set 0</b></td><td> 1</td><td>1</td><td>m[8]</td><td>m[9]</td></tr>
<tr> <td><b>Set 1</b></td><td> 0</td><td> </td><td> </td><td> </td></tr>
<tr> <td><b>Set 2</b></td><td> 0</td><td> </td><td> </td><td> </td></tr>
<tr> <td><b>Set 3</b></td><td> 1</td><td>0</td><td>m[6]</td><td>m[7]</td></tr>
</table>
</div>
</div>

---

## Direct Cache Example

* Request address 0000
<br/>
<table>
<tr> <b><th colspan="4"> Request </th></b></tr>
<tr> <td>Tag </td><td> Set </td><td> Block </td></tr>
<tr> <td>0 </td><td> 00 </td><td> 0 </td></tr>
</table>

* This is the memory we just evicted!
* This type of miss is called a **conflict miss**
  * There is enough capacity
  * Rather than using it, there is contention over a cache location

</div>
<div class="col">

<table>
<tr> <td></td><th> valid  </th> <th>Tag</th> <th colspan="2"> Block </th></tr>
<tr> <td><b>Set 0</b></td><td> 1</td><td>0</td><td>m[0]</td><td>m[1]</td></tr>
<tr> <td><b>Set 1</b></td><td> 0</td><td> </td><td> </td><td> </td></tr>
<tr> <td><b>Set 2</b></td><td> 0</td><td> </td><td> </td><td> </td></tr>
<tr> <td><b>Set 3</b></td><td> 1</td><td>0</td><td>m[6]</td><td>m[7]</td></tr>
</table>
</div>
</div>

---

## Reducing Conflict Misses

* Conflict misses are common in a direct cache
  * This is similar to the thrashing problem in disks, where two memory locations cause a resource contention
    * Happens in disks, happens in cache, happens everywhere
* An obvious solution is to provide multiple lines per set

---

## Set Associative Caches

* An associative cache provides multiple blocks of data within each set.
  * How many? It depends upon the task
* Let's say that we choose 2, thinking that maybe data memory and instruction memory could have caused conflicts
* Two-way associative (E = 2)

---

## Associative Cache Example

* Starting state
  * Blocks still hold 2 bytes
  * There are two sets, each with two lines
    * So the total cache size is the same
  * 2 bits of tag

<table>
<tr> <td></td><th> valid  </th> <th>Tag</th> <th colspan="2"> Block </th><th> valid  </th> <th>Tag</th> <th colspan="2"> Block </th></tr>
<tr> <td><b>Set 0</b></td><td> 0</td><td> </td><td> </td><td> </td><td> 0</td><td> </td><td> </td><td> </td></tr>
<tr> <td><b>Set 1</b></td><td> 0</td><td> </td><td> </td><td> </td><td> 0</td><td> </td><td> </td><td> </td></tr>
</table>

---

## Associative Cache Example

* Request address 0000
  * tag=00, s=0
  * cold miss

<br/>
<table>
<tr> <b><th colspan="4"> Request </th></b></tr>
<tr> <td>Tag </td><td> Set </td><td> Block </td></tr>
<tr> <td>00 </td><td> 0 </td><td> 0 </td></tr>
</table>
<br/>

</div>
<div class="col">

<table>
<tr> <td></td><th> valid  </th> <th>Tag</th> <th colspan="2"> Block </th><th> valid  </th> <th>Tag</th> <th colspan="2"> Block </th></tr>
<tr> <td><b>Set 0</b></td><td> 1</td><td>00</td><td>m[0]</td><td>m[1]</td><td> 0</td><td> </td><td> </td><td> </td></tr>
<tr> <td><b>Set 1</b></td><td> 0</td><td> </td><td> </td><td> </td><td> 0</td><td> </td><td> </td><td> </td></tr>
</table>
</div>
</div>

---

## Associative Cache Example

* Request address 0001
  * hit

<br/>
<table>
<tr> <b><th colspan="4"> Request </th></b></tr>
<tr> <td>Tag </td><td> Set </td><td> Block </td></tr>
<tr> <td>00 </td><td> 0 </td><td> 1 </td></tr>
</table>
<br/>

</div>
<div class="col">

<table>
<tr> <td></td><th> valid  </th> <th>Tag</th> <th colspan="2"> Block </th><th> valid  </th> <th>Tag</th> <th colspan="2"> Block </th></tr>
<tr> <td><b>Set 0</b></td><td> 1</td><td>00</td><td>m[0]</td><td>m[1]</td><td> 0</td><td> </td><td> </td><td> </td></tr>
<tr> <td><b>Set 1</b></td><td> 0</td><td> </td><td> </td><td> </td><td> 0</td><td> </td><td> </td><td> </td></tr>
</table>
</div>
</div>

---

## Associative Cache Example

* Request address 0111
  * cold miss

<br/>
<table>
<tr> <b><th colspan="4"> Request </th></b></tr>
<tr> <td>Tag </td><td> Set </td><td> Block </td></tr>
<tr> <td>01 </td><td> 1 </td><td> 1 </td></tr>
</table>
<br/>

</div>
<div class="col">

<table>
<tr> <td></td><th> valid  </th> <th>Tag</th> <th colspan="2"> Block </th><th> valid  </th> <th>Tag</th> <th colspan="2"> Block </th></tr>
<tr> <td><b>Set 0</b></td><td> 1</td><td>00 </td><td>m[0]</td><td>m[1]</td><td> 0</td><td> </td><td> </td><td> </td></tr>
<tr> <td><b>Set 1</b></td><td> 1</td><td>01 </td><td>m[6]</td><td>m[7]</td><td> 0</td><td> </td><td> </td><td> </td></tr>
</table>
</div>
</div>

---

## Associative Cache Example

* Request address 1000
  * cold miss

<br/>
<table>
<tr> <b><th colspan="4"> Request </th></b></tr>
<tr> <td>Tag </td><td> Set </td><td> Block </td></tr>
<tr> <td>10 </td><td> 0 </td><td> 0 </td></tr>
</table>
<br/>

</div>
<div class="col">

<table>
<tr> <td></td><th> valid  </th> <th>Tag</th> <th colspan="2"> Block </th><th> valid  </th> <th>Tag</th> <th colspan="2"> Block </th></tr>
<tr> <td><b>Set 0</b></td><td> 1</td><td>00 </td><td>m[0]</td><td>m[1]</td><td> 1</td><td>10</td><td>m[8]</td><td>m[9]</td></tr>
<tr> <td><b>Set 1</b></td><td> 1</td><td>01 </td><td>m[6]</td><td>m[7]</td><td> 0</td><td> </td><td> </td><td> </td></tr>
</table>
</div>
</div>

---

## Associative Cache Example

* Request address 0000
  * Hit!
  * Problem solved!

<br/>
<table>
<tr> <b><th colspan="4"> Request </th></b></tr>
<tr> <td>Tag </td><td> Set </td><td> Block </td></tr>
<tr> <td>00 </td><td> 0 </td><td> 0 </td></tr>
</table>
<br/>

</div>
<div class="col">

<table>
<tr> <td></td><th> valid  </th> <th>Tag</th> <th colspan="2"> Block </th><th> valid  </th> <th>Tag</th> <th colspan="2"> Block </th></tr>
<tr> <td><b>Set 0</b></td><td> 1</td><td>00 </td><td>m[0]</td><td>m[1]</td><td> 1</td><td>10</td><td>m[8]</td><td>m[9]</td></tr>
<tr> <td><b>Set 1</b></td><td> 1</td><td>01 </td><td>m[6]</td><td>m[7]</td><td> 0</td><td> </td><td> </td><td> </td></tr>
</table>
</div>
</div>

---

## Fully Associative Cache

* We could go even farther
* Just 1 set
  * Now conflict misses will only happen when capacity is filled
    * *capacity misses*
* Look things up by their tag
  * Which will now be 3 bits since there is no set address

---

## Fully Associative Drawbacks?

* The tag bits need to be compared to every single line
  * This is hardware expensive
    * Easy to get either large or fast, not both
* A fully associative cache is only used in special, small cache applications
  * Like the translation lookaside buffer
  * (that was the translation from virtual memory to actual addresses)

---

## Eviction in an Associative Cache

* Obviously associativity doesn't solve all conflict and capacity misses
  * Occur when a set is fully utilised and new memory needs to be cached
  * We have to choose which memory to evict
    * Different algorithms: least recently used, least frequently used, random

---

## Writing

* We've only handled fetching
* What about writing?
  * Remember, writing is slow too!
* There are different strategies

---

## Write Strategies

* We're modifying data cache, when is it updated at the source memory?
  * For the L1 cache, that means updating L2, for l3 that means RAM, etc
* *Write-through*
  * Write immediately to the underlying memory
* *Write-back*
  * Defer writing until the block is evicted
  * This means that we need to add a dirty bit to mark modifications

---

## Write-misses

* What if we write to memory without reading it?
  * We could have a miss!
* More strategies
  * *Write-allocate*
    * Load into cache and update there
  * *No-write-allocate*
    * Same as write through, go directly to memory

---

## Typical Settings

* Write-through + no-write-allocate
* Write-back + Write-allocate
  * These can make better use of locality

---

## Real-World Example

Core i7 Cache

---

## Performance Impacts

<table>
<tr> <hr><td>Cache</td><td>Access Time (cycles)</td><td>Cache Size (C)</td><td> Assoc. (E)</td><td>Block size (B)</td><td>Sets (S)</td></hr></tr>
<tr> <td> L1 i-cache</td><td> 4   </td><td> 32 KB    </td><td> 8 </td><td> 64 B  </td><td> 64 </td></tr>
<tr> <td> L1 d-cache</td><td> 4   </td><td> 32 KB    </td><td> 8 </td><td> 64 B  </td><td> 64 </td></tr>
<tr> <td> L2 unified cache</td><td> 10   </td><td> 256 KB    </td><td> 8 </td><td> 64 B  </td><td> 512 </td></tr>
<tr> <td> L3 unified cache</td><td> 40-75   </td><td> 8 MB    </td><td> 16 </td><td> 64 B  </td><td> 8192 </td></tr>
<tr> <td> Main memory</td><td> 200   </td><td>     </td><td>  </td><td>       </td><td>      </td></tr>
</table>

<p style='font-size:13pt'>$^*$ According to CS:App, Core i7 cache</p>

---

## Cache Types

* A *d-cache* holds data
* An *i-cache* holds instructions
* A *unified cache* holds both
* It makes sense to handle instructions and data differently at the CPU, but less sense farther away

---

## Performance of Block Size

* A small block size is generally faster, but has a higher miss rate
* Large blocks increase hit rate if spatial locality is true
  * But decrease the number of cache lines
  * This increases misses if data isn't used in groups of that block size
  * And large blocks take longer to load too

---

## Performance of Associativity

* Low associativity can lead to thrashing from conflict misses
* High associativity is expensive and slow
  * Longer to search the memory
  * More difficult decisions about which memory to evict
* Generally lower associativity closer to the CPU registers, higher associativity farther away
  * In i7, L1 and L2 are 8-way associative, L3 is 16-way

---

## Write Strategy

* Write-through is simpler
  * Writing can also be shifted to another system, called a write buffer
  * Since the memory is already written, an eviction from a read won't also cause a write, slowing it down farther
* Write-back saves on bandwidth, though
  * Becomes more likely as we move down the hierarchy

---

## Example Read Latenecies

* A i7 program has a hit rate of 97%
  * Hit time is 4 cycles for L1
* Miss penalties (additional time beyond L1's 4 cycles)
  * 10 cycles for L2
  * 200 cycles from main memory
* Assume that we check L1 and L2 in parallel

---

## Example Read Latencies

* L4: 4 cycles, L2 10 cycles, Main memory: 200 cycles
* If data is either in L1 or isn't in the cache:
  * Average access time with 97% hit:
    * $4\~cycles  + 0.03\times 200\~cycles = 10\~cycles$
  * Average access time with 99% hit
    * $4\~cycles  + 0.01\times 200\~cycles = 6\~cycles$
* Notice that the miss rate is what we care about, rather than the hit rate

---

## Example Read Latencies

* L4: 4 cycles, L2 10 cycles, Main memory: 200 cycles
* 3% miss rate in L1, 50% miss rate in L2:
  * $4\~cycles  + 0.03\times(0.5\times 10\~cycles + 0.5\times200\~cycles) = 7.15\~cycles$
* 1% miss rate in L1, 50% miss rate in L2:
  * $4\~cycles  + 0.01\times(0.5\times 10\~cycles + 0.5\times200\~cycles) = 5.05\~cycles$
* Even a low hit rate, an L2 cache will drastically decrease our expected latency

---

## Next Time

* More cache details
* Examples of cache friendly code
* Fun topics, but we'll save them for next lecture!

<!--

Pick up from page 650

Replacement policies: random replacement, least recently used
TODO Show some examples of LRU; keep a counter in the line data

Cache associativity
# lines (of size), # ways, # sets (lines = ways * sets)

do the matmul example https://www.cs.cmu.edu/afs/cs/academic/class/15213-f06/www/code/mem/matmult/