Oct 17, 2011

Flash Crisis

Robert X. Cringely absolutely nails it in his recent column about some of the consequences of rapidly reducing IO times on programming languages¹. His major point was that slow but expressive² high-level scripting languages such as Ruby and Python have been getting away with their lack of performance due to slow disks. With super-fast seekless flash expected to replace, or at least complement, spinning disks in the storage hierarchy, the long honeymoon of Python and Ruby will come to an end when profiling reveals that IO is fast, and the runtime or interpreter is the bottleneck.

This impending “flash crisis” is well known in system circles. It’s almost like a mini Y2K.

However, its repercussions are much deeper than just the choice of programming language and runtime³.

It’s depressing to see graphs of things like CPU speed and disk and RAM capacity, which keep going up, up, up, compared with those of disk throughput and seeks, which for all practical purposes has remained mostly flat. A 100 GB drive from a few years ago does about the same number of seeks per second as a 2 TB drive today. Drive seeks have become the straw through which we’re trying to drink the ocean.

If you look at the current software stack, all the way down to the operating system kernel, you will find that the slowness of disks is a baked-in assumption throughout. Operating systems play intricate games with IO scheduling and how files are laid out on disk and caching disk blocks in memory, and spend millions of CPU cycles doing so, because the few microseconds taken for those millions of CPU cycles is noise compared to the tens of milliseconds it might take to hit the disk. Pushed to the extreme, this obsession with avoiding the disk leads to schemes like RAMSter⁴, because even accessing the memory of a different computer over the network fabric is faster than going to local disk.

In essence, we have been trading plentiful resources such as CPU cycles and RAM for precious disk seeks. Most web-ish serving workloads almost demand this, because they tend to be IO rather than CPU bound.

What happens when you blow up the “slow disk” assumption is that your entire software stack begins to crumble. When the cushion provided by those 10 ms of seek time goes away, CPU becomes your bottleneck. You have to start rethinking a lot of design choices.

There are a few saving graces which will buy us time. Flash hasn’t hit the price point where it can en masse replace spinning disks⁵. And the shakedown will start from the top of the stack. Programmers will first discover bottlenecks in their application code, then in their choice of language and runtime, and finally, after those layers have been tuned to death will they look at things like filesystems and the operating system kernel.

The Second Coming of Java, by Robert X. Cringely. ↩︎
“slow but expressive”–I believe that’s a false dichotomy. You can be fast and expressive too. But that’s a whole other topic. ↩︎
Also, I disagree with Cringely that Java, and the JVM, is the answer. There are good reasons why, even though the JVM has thrived at the application layer, it has yet to make any inroads into the lower layers of the stack. ↩︎
Transcendent memory for the Linux kernel ↩︎
On NewEgg, the cheapest 1 TB disk was $69.99, whereas I couldn’t even find a 1TB SSD for desktops, and the cheapest 500 GB SSD was $729.99. ↩︎