Saturday, September 29, 2007

Warning: 99.999% Uptime is a Myth, and a Planet Killer

Uptime is a measure of the time a particular computer system has been "up" and running. Not surprisingly, it's the opposite of downtime, when a system is not operational. In geek cant, uptime is sometimes measured in nines; "Three nines" gives you 99.9% reliability, about 8 hours and 46 minutes of downtime a year. The gold standard is "Five nines" - 99.999% reliability, which translates to a total downtime of no longer than five minutes per year.

At first glance, it would seem that you want to keep your site up as much as possible, lots of nines. Problem is that it turns out to be exponentially expensive to do so, both cost-wise and e-wise, because massive redundancy is required to support more reliant systems. Instead of one server, you need several... instead of one 9 to 5 support employee, you need around-the-clockers... instead of one power line, you need several, and going to different substations. All this leads to a Malthusian
eWaste accumulation, increased labor expenditure, and according to the Uptime Institute, unsustainable power consumption. And one more thing; it's impossible.

Fact is that five nines is simply impossible for a long period of time. Remember, 5 nines is 5 minutes of downtime a year. So, if your site is out for an hour, it must be up non-stop for the next 12 years to hit your number - wholly unlikely. Good news is that this is hardly a business killer; for example, according to Pingdom, even eBay is only good for three nines over a long period. Seems like they are doing fine.

The stark truth is that most companies don't need five nines. It sounds great in boardroom meetings ("If this saves one life, it's worth it!") and fills the pork barrel for those IT directors looking for a new project. But the numbers just don't make sense for most industries; studies show that three nine design is fine for most retailing applications. Yeah, if you are Walmart you need it for your transaction processing. Yeah, if you are LAX and a computer glitch leaves 17,000 passengers stranded, you need it. But for the vast majority of cases, there's not enough return on investment (ROI) to justify the cost to your business and the damage to the planet. Good enough is in fact good enough. :: Cnet


Unknown said...

Is 99.99% ok?

BlindManNoFish said...

If this is true, couldn't a savings be arrived at by coordinating down times across sites? If the rational for minimizing down time is that down time costs 'something' then where does that 'something' go? To other sites? If that's true then coordinating the down times across all sites would result in no losses (combined with no gains.) I suspect that what's actually occurring is that uptime/downtime advantage only exists as a comparison to what your competitors' uptime/downtime is. Incidentally, if a company is properly tracking their traffic, then they ought to be able to effectively minimize the impact of their downtime by scheduling downtimes for minimal traffic times--which is in fact what the majority of them do. This makes %uptime vs. % downtime effectively a moot point; if you're in the business of selling Christmas cookies, and your downtime is all in late November/early December vs. the middle of June & July, then obviously it matters quite a bit less whether you have 99.9%/0.1% (Nov/Dec) vs. 95%/5% (Jun/Jul)or whatever. The former case would likely still result it a higher cost to your business.