A couple of weeks ago my website went down. Nine thousand other web servers also died when The Planet, a large hosting service, experienced an electrical explosion and fire in one of its facilities.
Because it was an electrical fire, the local fire department refused to let the company activate their backup generators for fear of further damage.
The servers were down for almost two days as the company worked around the clock to fix the problems and get the servers back online. The response by customers was mixed, with some quite happy with The Planet's handling of the situation, and others vowing to switch companies as soon as possible.
Hosting companies do have alternate powers sources and backup plans in place for such incidents. In this case, the situation was extreme, hence the long delay in getting things back online. Yet as we become ever more dependent upon the internet in our daily lives, interruptions are not just inconveniences - they're critical events. Don't think so? Ask people how they felt when Amazon's cloud service went down.
Unfortunately, the architecture of the internet wasn't really designed to handle what we're doing with it. No, I'm not about to say that the internet will collapse upon itself by 2012, or that we'll be out of bandwidth in two years. But the internet does have a huge single point of failure - the domain name system (DNS).
Every computer connected to the internet has an Internet Protocol (IP) address. Think of it as your computer's phone number. To make things a little more people-friendly, we map these IP addresses to domain names - the addresses such as google.com that you type into your browser.
These mappings are the responsibility of something called a DNS Server. Your browser sends the name to the server and gets back the desired IP address. This also makes it possible for you to type google.com and be directed to a server much closer to you than Google's office in California.
The DNS architecture is a bit like a tree. A basic DNS server knows about the mappings directly below it, just as a branch has leaves on it. Further up, a DNS server will ask other DNS servers about their mappings. And finally, the root DNS servers (of which there are only a very small number), ask servers below them about what they know.
When The Planet went down, so did its DNS servers. That meant the DNS servers couldn't answer requests for information. It appeared as if those 9,000 servers never existed. DNS servers are also prime targets for denial of service attacks. If a server is too busy to answer a request, then the servers it knows about effectively don't exist either.
Yet we've accepted the questionable redundancy of just running a couple of DNS servers at the hosting company. Obviously that didn't help at all when The Planet went down. Even if the servers came back up they wouldn't appear to be there.
At the very least it should be possible to designate multiple, non-co-located DNS servers with failover support, so that even if a subdomain becomes unreachable through one DNS path, those servers can still be reached. The internet may route around damage, but it shouldn't ignore the survivors.