Vivek Haldar

Tackling tail latency

This will be hard to miss as it is the cover story of the next Communications of the ACM, but here’s a shout out nevertheless to the article titled “The Tail at Scale” by Googlers Jeffrey Dean and Luiz AndrĂ© Barroso.

The abstract:

It is challenging for service providers to keep the tail of latency distribution short for interactive services as the size and complexity of the system scales up or as overall use increases. Temporary high-latency episodes (unimportant in moderate-size systems) may come to dominate overall service performance at large scale. Just as fault-tolerant computing aims to create a reliable whole out of less-reliable parts, large online services need to create a predictably responsive whole out of less-predictable parts; we refer to such systems as “latency tail-tolerant,” or simply “tail-tolerant.” Here, we outline some common causes for high-latency episodes in large online services and describe techniques that reduce their severity or mitigate their effect on whole-system performance.

Go read the whole thing.