The Poisonous Long Tail
If you are at all curious or interested in where systems are headed, and the unique challenges in building warehouse-scale (not datacenter scale, there is a difference) computers, then this talk by Googler Luiz Andre Barroso is a must-see. Storage, disks, flash, energy efficiency, networking–you will learn something about all of these, and more.
One of his points that I want to highlight is why backend engineers spend so much time worrying about long-tail performance, and trying to curb it. For an application running in a warehouse scale computer, every entity that is touched in the process of answering a query has a performance tail that is significantly worse than its “usual” performance, often by an order of magnitude or more.
Using the example from the talk: consider a server that gives you a very quick response (much lower than 1 second) 99% of the time, but takes more than one second the other 1% of the time. Now imagine that you have to talk to 10 such servers to answer your query. Now 10% of your queries take longer than 1 second. With 50 servers, 40% of your queries are slow. And with 100 servers, two thirds of your queries take longer than one second. It’s not unusual for complicated services to be composed of these many other services.
Thus, a long tail in the extremely high percentiles for every component rapidly spreads out into the distribution of the entire service. It’s like the one drop of dye that colors an entire cauldron.
(On the brighter side, operating at scale does buy you reliability.)