Oct 18, 2012

How did software get so reliable?

John Regehr raises this question, prompted by a paper of the same title by Tony Hoare. I don’t disagree with the reasons given – management, testing, tools etc. – but I do think there are a number of factors not called out. Here is my take.

Age

Flakiness in software is most noticeable when it is in the foundational layers – the operating system, basic system services such as those for networking etc. And it is exactly those foundational pieces that by now have had two decades or more of continuous existence. This means they have been exercised and tested in every way imaginable in the real world. It also helps that the responsibilities of almost all foundational infrastructure software has barely changed during that same time.

For example, what an operating system kernel does has not changed in a very long time. The basic networking protocols – TCP/IP, DNS, HTTP and many others – have also been remarkably stable, with mostly incremental improvements. The sheer weight of everything else built on top of this layer has turned these into diamonds.

Server-side software

Most software that is today used by end consumers, i.e. clients, is but a thin pretty shell backed by an array of services running on a large collection of machines off in a data center somewhere. This has several implications for the overall perception of reliability.

server-side software is tended by humans. I mean that literally. When error rates go up, humans are woken up by alerts, and they deal with the problem. In most cases, the issue will be taken care of without any significant disruption, or anybody even noticing. For every major outage where a site or service is completely down, there are a thousand such problems.
server-side software can be updated under controlled conditions, on demand.
embarrassingly parallel loads spread across a large number of machines means a high level of aggregate reliability for the entire service.

In the era of heavy client side software, these would all be major disruptions.

In essence, the move to server side software allows us to cheat when it comes to reliability. You can deal with problems as they arise. You don’t have to worry about a large existing installation out in the field. This also means you don’t need to front-load hardening and testing.