$^*$ See https://www.seagate.com/content/dam/seagate/en/content-fragments/products/datasheets/exos-m-v2/exos-m-v2-DS2045-3-2506-en_US.pdf
---
## Is seeking slow?
* ms: $10^{-3}$
* cpu instructions, SRAM: $10^{-9}$
* That's a $10^6$ difference
* That's how many istructions we lose during a read head seek
---
## What about bandwidth?
* 500MB/s of new data
* Registers are 64 bits
* This fills 62.5 million registers per second
* If we need to fill two registers for an op, that's still 30 million per second
* Enough to fill a 3GHz machine 10 instructions wide
* So this is more than enough -- if we begin loading data in advance and our busses are wide enough
---
## Thrashing
* When we access memory all over the disk, performance degrades rapidly
* Data in the cache is constantly ejected, and then needs to be fetched again
* The system grinds to a halt
* This is called thrashing, from the sound made by the read head frantically thrashing back and forth
---
## Increasing Density
* Hard disk platters are stacked to increase their total density
* Now there are multiply read heads to read the disk simultaneously
* Does not necessarily mean there is enough bandwidth to read from all disks at once
---
## Spinning Disk Considerations
* Read rates are actually faster on the outside of the disk
* Because there is increased angular velocity away from the center
* Data that will be accessed together should be stored sequentially on the disk
---
## Sophisticated
* Don't be fooled by how old-fashioned a spinning disk sounds
* Seagate recently released its newest technology
* HAMR
* Heat-assisted magnetic recording
* 30TB disks, 3.2TB/platter (as of 2024)
---
## HAMR
* Lasers on the read head heat tiny sections of disk
* This means that only that tiny section's polarity will change
* Seagate platters use an iron-platinum alloy for stability
* "Plasmonic" write head heats platter over 800°
* And then it cools again in two nanoseconds
$^*$ See https://www.seagate.com/innovation/mozaic/
---
## SSDs: The other disk
* The alternative to a spinning disk is a solid-state disk
* SSD
* Bits are stored in non-volatile flash
* No moving parts, behaves more like RAM
* Lower latency and (generally) higher bandwidth than spinning disks
* But lower density than spinning disks
---
## SSD Usage
* Generally the only good option in mobile devices
* Also offers different form-factors than spinning disk devices
---
## Reading and Writing
* Sequential sustained read/write rates $\approx$ 500MB/s
* Random access *is* still slower, but not as dramatic as with spinning disks
* Latences of < $200 \mu$seconds
$^*$ See https://www.seagate.com/content/dam/seagate/en/content-fragments/products/datasheets/nytro-1370-sata-ssd/nytro-1370-sata-ssd-DS23-1-2508-en_US.pdf
---
## Storage Tradeoffs
* Spinning disks are more durable
* SSD memory "wears out"
* Sectors are erased to rewrite, and can only be erased a set number of times
---
## Internet Storage
* Fiber latency $5 \mu$seconds/km
* $c = 299.792\frac{m}{\mu s}$
* Refractive index $\approx 1.47$
* latency = $distance\frac{r}{c}$
* Or 0.1s to go to the other side of the world
* Bandwidth? GB/s
---
## Internet Storage
* How about satellite?
* Latency of 10s to 100s ms, depending upon satellite orbit
* Bandwidth
* 100s MB/s
---
## Overview
$^*$ From CS:App
---
## Latency Vs Bandwidth
* Bandwidth must make up for latency
$^*$ From CS:App
---
## So What to Do?
* We need to predict what we need and fetch it before we need it
* We've seen this paradigm before with branch prediction
---
## Predicting Data Usage
* Unlike branch prediction
* Branch prediction is "path A or path B"
* Data usage prediction has far more options
* But a few things remain consistent
---
## Locality
* Programs use data near what they are currently using
* `Temporal locality`: programs keep using the data they've got
* `Spatial locality`: when programs read new data, it is near what they recently addressed
---
## Locality
```c
float mean(float* data, size_t size) {
float sum = 0;
for (size_t i = 0; i < size; ++i) {
sum += data[i];
}
return sum/size;
}
```
* Data:
* Spatial locality in the array access
* Temporal locality in the sum variable
---
## Locality continued
```c
float mean(float* data, size_t size) {
float sum = 0;
for (size_t i = 0; i < size; ++i) {
sum += data[i];
}
return sum/size;
}
```
* Instructions:
* Spatial locality: the instructions could have been prefetched before the function was called
* Temporal locality: repeatedly use the loop instructions
---
## Improving Locality
* Consider a matrix
* If each row is allocated separately, then they may not be spatially local
```C
Matrix newMatrix(size_t height, size_t width) {
Matrix nm;
// First we allocate pointers to each row.
nm.values = calloc(height, sizeof(float*));
for (size_t i = 0; i < height; ++i) {
nm.values[i] = calloc(width, sizeof(float));
}
nm.height = height;
nm.width = width;
return nm;
}
```
---
## Matrix Example
* Worst case scenario: each row of the two matrices are in separate memory areas
```C
// Add two matrices.
Matrix addMatrix(Matrix left, Matrix right) {
// If the sizes don't match, return a matrix with NULL data and 0 size.
Matrix sum = {.values = NULL, .width = 0, .height = 0};
if (left.height != right.height || left.width != right.width) {
return sum;
}
sum = newMatrix(left.height, left.width);
for (size_t row = 0; row < sum.height; ++row) {
for (size_t column = 0; column < sum.width; ++column) {
sum.values[row][column] = left.values[row][column] + right.values[row][column];
}
}
return sum;
}
```C
---
## Single Block Matrix
* What if the entire matrix was allocated in one chunk?
* Like in your homework!
* You get better spatial locality!
---
## The Loop
* Which is better?
```c
for (size_t row = 0; row < sum.height; ++row) {
for (size_t column = 0; column < sum.width; ++column) {
sum.values[row][column] = left.values[row][column] + right.values[row][column];
}
}
```
```c
for (size_t column = 0; column < sum.width; ++column) {
for (size_t row = 0; row < sum.height; ++row) {
sum.values[row][column] = left.values[row][column] + right.values[row][column];
}
}
```
---
## Bigger Example
* Imagine a 1000x1000 RGB image
* How should we allocate memory?
* $3 \times 1000^2$ blocks? (CxHxW)
* $1 \times 1000^2 \times 3$ blocks? (HxWxC)
* Possibly the second, if we operate on one pixel at a time and do operations over all channels
* The sizes of our cache may mean that the entire image won't fit, but we'll have time to prefetch blocks as we go
---
## The Lesson
* Faster storage is more expensive
* Denser storage is cheaper
* Often high latency but high bandwidth
* Good programs exploit locality to compensate for latency gap
* Rely upon computer systems to prefetch data
---
## Prefetch Where?
* This is where cacheing comes into play
* Too exensive and difficult to add more registers
* So put a cache next to the CPU
* And another one slightly farther away, and another one after that
* Load from disk into RAM and from the Internet onto disk
---
---
## Exploiting Locality
* Each part of the system assumes:
* The next higher part is faster, but will exhibit locality
* The next lower part is slower, but has the bandwidth required
---
## For You
* This means that you need to write cache- and locality-aware code
* This is something that your compiler *cannot* do
* Your CPU is helpless here too!
* So you need to practice writing locality aware code
* Helps to understand how cacheing and prefetching work too