# CS 211 - Lecture 17

## Computer Architecture

###  Types of Memory

Bernhard Firner

2025-10-29

---

## Memory

* Bandwidth and latency are the primary metrics
  * Locality affects access times, decreasing latency
    * Like register memory is close to the ALU
* But the technology used is also a major consideration
* So what other memory is part of our compute systems?

---

## Types

* CPU memory: SRAM
* Main memory: DRAM
* System memory (disks)
* Network memory (network mounted storage or similar)

---

## SRAM

* Static Random-Access Memory
* Uses circuits that are stable as long as power is present
  * Requires multiple transistors to implement
  * Expensive
* Random-access because they are addressable at any location

---

## DRAM

* Dynamic Random-Access Memory
* Usually based upon on transistor and capictor per bit
  * Simple compared to SRAM
  * Order of magnitude slower compared to SRAM
* The capacitor gradually loses its charge, though
  * DRAM requires a memory refresh to maintain state

---

## Comparison

* SRAM is fast, but takes more space
  * Used in register and cache memory in the CPU
* DRAM has higher density
  * Thus it is better when large amount of memory are required
  * This is the colloquial "RAM" in your computer

---

* SRAM is speedy, so it sits where data is actually used
  * Accessible within a single cycle
* DRAM is accessed in larger blocks
  * Data is transported over a bus and then cached locally
  * Access time $\approx$ 10ns

$^*$ From CS:App

---

## Reading Over Busses

* Busses are collections of wires that transport many bits in parallel
* Used when we need new data from RAM in a register
  * Request goes out to RAM
  * Block of memory is read
  * Bits in the block are transported in parallel over the bus
  * The block is placed into cache
  * Finally, the requested data is loaded into the CPU register

---

## Writing over Busses

* During a write operation, memory flows in the opposite direction
  * In reality, this is so slow that we don't bother on every write
  * If we have a local copy in cache, we continue working with that local copy
  * If we are ever going to unload it from the cache, then we write back to RAM

---

## System Memory

* Disks
  * And things that call themselves disks
* Disks have gone through a few iterations
  * But the cheapest memory per TB is spinning disks

---

## Spinning Disks

* Picture a record, but the disk is a shiny, magnetic disk of metal
  * The disk is called a "platter"
* A tiny read head moves back and forth, reading the polarity of sections of metal
* Likely multiple disks, likely multiple read heads

---

## Spinning Disks

* Advantages: inexpensive for the memory you get
* Disadvantages: Very slow
  * Even slower if access isn't sequential
    * Because the read head needs to physically move

---

## Seek Times

* Disks spin around $\approx$ 7,200 RPM
 * So <ms delays waiting for data to come around
* If we read data sequentially (on the disk) then the read head doesn't move
* However, if we want data located at a different radius, the head must seek
 * This seek time can take ms; which is forever

---

## Read/Write Rates

* Sequential reads $\approx$ 500 MB/s
  * But sustained rate is $\approx$ 275MB/s
  * Why? There is an on-board cache that stores data, evening out transfers
  * Once that cache is depleted, we see the true disk read rate
* "average latency" $\approx$ 4ms

$^*$ See https://www.seagate.com/content/dam/seagate/en/content-fragments/products/datasheets/exos-m-v2/exos-m-v2-DS2045-3-2506-en_US.pdf

---

## Is seeking slow?

* ms: $10^{-3}$
* cpu instructions, SRAM: $10^{-9}$
* That's a $10^6$ difference
  * That's how many istructions we lose during a read head seek

---

## What about bandwidth?

* 500MB/s of new data
* Registers are 64 bits
  * This fills 62.5 million registers per second
  * If we need to fill two registers for an op, that's still 30 million per second
  * Enough to fill a 3GHz machine 10 instructions wide
* So this is more than enough -- if we begin loading data in advance and our busses are wide enough

---

## Thrashing

* When we access memory all over the disk, performance degrades rapidly
* Data in the cache is constantly ejected, and then needs to be fetched again
* The system grinds to a halt
* This is called thrashing, from the sound made by the read head frantically thrashing back and forth

---

## Increasing Density

* Hard disk platters are stacked to increase their total density
  * Now there are multiply read heads to read the disk simultaneously
  * Does not necessarily mean there is enough bandwidth to read from all disks at once

---

## Spinning Disk Considerations

* Read rates are actually faster on the outside of the disk
  * Because there is increased angular velocity away from the center
* Data that will be accessed together should be stored sequentially on the disk

---

## Sophisticated

* Don't be fooled by how old-fashioned a spinning disk sounds
* Seagate recently released its newest technology
* HAMR
  * Heat-assisted magnetic recording
  * 30TB disks, 3.2TB/platter (as of 2024)

---

## HAMR

* Lasers on the read head heat tiny sections of disk
  * This means that only that tiny section's polarity will change
* Seagate platters use an iron-platinum alloy for stability
* "Plasmonic" write head heats platter over 800°
  * And then it cools again in two nanoseconds

$^*$ See https://www.seagate.com/innovation/mozaic/

---

## SSDs: The other disk

* The alternative to a spinning disk is a solid-state disk
  * SSD
* Bits are stored in non-volatile flash
* No moving parts, behaves more like RAM
  * Lower latency and (generally) higher bandwidth than spinning disks
  * But lower density than spinning disks

---

## SSD Usage

* Generally the only good option in mobile devices
* Also offers different form-factors than spinning disk devices

---

## Reading and Writing

* Sequential sustained read/write rates $\approx$ 500MB/s
 * Random access *is* still slower, but not as dramatic as with spinning disks
* Latences of < $200 \mu$seconds

$^*$ See https://www.seagate.com/content/dam/seagate/en/content-fragments/products/datasheets/nytro-1370-sata-ssd/nytro-1370-sata-ssd-DS23-1-2508-en_US.pdf

---

## Storage Tradeoffs

* Spinning disks are more durable
* SSD memory "wears out"
  * Sectors are erased to rewrite, and can only be erased a set number of times

---

## Internet Storage

* Fiber latency $5 \mu$seconds/km
  * $c = 299.792\frac{m}{\mu s}$
  * Refractive index $\approx 1.47$
  * latency = $distance\frac{r}{c}$
  * Or 0.1s to go to the other side of the world
* Bandwidth? GB/s

---

## Internet Storage

* How about satellite?
  * Latency of 10s to 100s ms, depending upon satellite orbit
* Bandwidth
  * 100s MB/s

---

## Overview

$^*$ From CS:App

---

## Latency Vs Bandwidth

* Bandwidth must make up for latency

$^*$ From CS:App

---

## So What to Do?

* We need to predict what we need and fetch it before we need it
  * We've seen this paradigm before with branch prediction

---

## Predicting Data Usage

* Unlike branch prediction
  * Branch prediction is "path A or path B"
* Data usage prediction has far more options
  * But a few things remain consistent

---

## Locality

* Programs use data near what they are currently using
* `Temporal locality`: programs keep using the data they've got
* `Spatial locality`: when programs read new data, it is near what they recently addressed

---

## Locality

```c
float mean(float* data, size_t size) {
 float sum = 0;
 for (size_t i = 0; i < size; ++i) {
 sum += data[i];
 }
 return sum/size;
}
```

* Data:
  * Spatial locality in the array access
  * Temporal locality in the sum variable

---

## Locality continued

```c
float mean(float* data, size_t size) {
 float sum = 0;
 for (size_t i = 0; i < size; ++i) {
 sum += data[i];
 }
 return sum/size;
}
```

* Instructions:
  * Spatial locality: the instructions could have been prefetched before the function was called
  * Temporal locality: repeatedly use the loop instructions

---

## Improving Locality

* Consider a matrix
  * If each row is allocated separately, then they may not be spatially local

```C
Matrix newMatrix(size_t height, size_t width) {
 Matrix nm;
 // First we allocate pointers to each row.
 nm.values = calloc(height, sizeof(float*));
 for (size_t i = 0; i < height; ++i) {
 nm.values[i] = calloc(width, sizeof(float));
 }
 nm.height = height;
 nm.width = width;
 return nm;
}
```

---

## Matrix Example

* Worst case scenario: each row of the two matrices are in separate memory areas

```C
// Add two matrices.
Matrix addMatrix(Matrix left, Matrix right) {
 // If the sizes don't match, return a matrix with NULL data and 0 size.
 Matrix sum = {.values = NULL, .width = 0, .height = 0};
 if (left.height != right.height || left.width != right.width) {
 return sum;
 }
 sum = newMatrix(left.height, left.width);
 for (size_t row = 0; row < sum.height; ++row) {
 for (size_t column = 0; column < sum.width; ++column) {
 sum.values[row][column] = left.values[row][column] + right.values[row][column];
 }
 }
 return sum;
}
```C

---

## Single Block Matrix

* What if the entire matrix was allocated in one chunk?
  * Like in your homework!
  * You get better spatial locality!

---

## The Loop

* Which is better?

```c
 for (size_t row = 0; row < sum.height; ++row) {
 for (size_t column = 0; column < sum.width; ++column) {
 sum.values[row][column] = left.values[row][column] + right.values[row][column];
 }
 }
```

```c
 for (size_t column = 0; column < sum.width; ++column) {
 for (size_t row = 0; row < sum.height; ++row) {
 sum.values[row][column] = left.values[row][column] + right.values[row][column];
 }
 }
```

---

## Bigger Example

* Imagine a 1000x1000 RGB image
  * How should we allocate memory?
    * $3 \times 1000^2$ blocks? (CxHxW)
    * $1 \times 1000^2 \times 3$ blocks? (HxWxC)
* Possibly the second, if we operate on one pixel at a time and do operations over all channels
  * The sizes of our cache may mean that the entire image won't fit, but we'll have time to prefetch blocks as we go

---

## The Lesson

* Faster storage is more expensive
* Denser storage is cheaper
  * Often high latency but high bandwidth
* Good programs exploit locality to compensate for latency gap
  * Rely upon computer systems to prefetch data

---

## Prefetch Where?

* This is where cacheing comes into play
  * Too exensive and difficult to add more registers
  * So put a cache next to the CPU
* And another one slightly farther away, and another one after that
* Load from disk into RAM and from the Internet onto disk

---

---

## Exploiting Locality

* Each part of the system assumes:
  * The next higher part is faster, but will exhibit locality
  * The next lower part is slower, but has the bandwidth required

---

## For You

* This means that you need to write cache- and locality-aware code
* This is something that your compiler *cannot* do
  * Your CPU is helpless here too!
* So you need to practice writing locality aware code
  * Helps to understand how cacheing and prefetching work too