Date: July 10, 2010

I recently changed jobs. I started working at Yahoo! as a Service Engineer. Part of my job is to administer the servers that we use to run Apache Hadoop. We currently run 38,000 servers. That number is growing by thousands each quarter. It was 25,000 when I interviewed earlier this year.

When operating servers at this scale, your perspective changes. When considering high availability, you have to consider the system as a whole. Your software needs to ensure the availability of your system rather than the hardware. I don’t care what brand or configuration of hardware you purchase, you can not operate 38,000 servers without a hardware failure. Hardware breaks. Fans break. Hard drives fail. Power supplies stop working. Cosmic rays flip bits in memory (Don't laugh. People are designing systems to detect and work around cosmic rays.) This is a fact of life.

Another impact of this scale is the cost of redundant components. Consider the cost of acquisition of 20,000 servers. If you can save $100 per server by eliminating a redundant power supply and redundant fans, you just saved $2,000,000.

Then end result of these forces is that System Administrators must become efficient at managing hardware failure. It is no longer a question of if servers will fail. The question becomes how many will fail and how fast can you get them back in service.

Managing hardware failure is one of the most time consuming (therefore one of the most expensive) aspects of my job. Here is a comparison to put it in perspective. If I need to reinstall the operating system on 40 servers, it would take me about 30 minutes. I would write a script to edit some PXE config files and then use IPMI to power cycle the systems. I could then turn to another task for the remainder of the installation. This solution scales as well. If I needed to reinstall 4000 servers, it would still take me about 30 minutes.

Consider, on the other hand, a failure in the disk that contains the root partition. In order to diagnose and resolve the problem, I would have to log on to the console via IPMI, examine logs, possibly boot once or twice to figure out the problem. Then, I would have to fill out a ticket for the site operations time to replace the disk. Overall, this might take 10 minutes or more. You can see that managing tens of thousands of servers which have dozens of failures per day would get very cumbersome very fast.

I can hear you now. "My enterprise grade server hardware includes hardware management that eliminates the diagnostics problem." This is true. For the scale of many enterprises, the combination of enterprise server hardware and the manufacturer's management software minimizes the expense of diagnosis. There are two problems that make this unusable at a larger scale. First, the cost of server grade hardware is significantly higher than commodity hardware. Sometimes it is as much as double. Second, I have not seen management software that scales to tens of thousands of compute nodes.

Indulge me for a moment while I go off on a tangent. IPMI for large scale system administration is the best thing since sliced bread. I can remotely connect to a server, get on the console, power cycle, see log of some hardware events, read the temperature and fan speed. It works cross platform. I don't need to worry about which server manufacturer I buy from as long as they follow the IPMI standard.

So, I've given a long list of the reasons why hardware sucks. Now, what do we do about it? I would like to see hardware diagnostics become a standard part of the hardware management tool set. Right now, each manufacturer has its own method for hardware diagnostics. Some put a separate partition on the disk. Some give you CD to boot from. Some make it a feature of the fancy remote management card. In order to be able to diagnose hardware at scale, there needs to be an API.

I would like to see hardware diagnostics as ubiquitous as IPMI. Diagnosing server hardware problems needs to be as easy to automate as power cycling a system. S.M.A.R.T. is a decent idea in this area. All the hard drive manufacturers have come together to provide a unified API to access this information from the hard drive. Tools have been written to read this info cross platform as well as cross operating system. Hardware diagnostics might even fit well as an extension to IPMI.

Operating hardware will always be a very manual process, even at a large scale. We need to strive to automate as much of this operation as possible. Hardware manufacturers have been heading in this direction. As sysadmins, we need to continue to push hardware manufacturers to include standards based diagnostics.