Wanted: Hard drive boys for our new ginormous data center

In November, Google wrote in their official blog that they had done an experiment where they had sorted 1 PB (1,000 TB) of data with MapReduce. The information about the sorting itself was impressive, but one thing that stuck in our minds was the following (emphasis added by us):

An interesting question came up while running experiments at such a scale: Where do you put 1PB of sorted data? We were writing it to 48,000 hard drives (we did not use the full capacity of these disks, though), and every time we ran our sort, at least one of our disks managed to break (this is not surprising at all given the duration of the test, the number of disks involved, and the expected lifetime of hard disks).

Each of these sorting runs that Google did lasted six hours. So that would mean that hard drives would be breaking at least 4 times a day for every 48,000 hard drives that a data center is using.

Interesting, isn’t it? We have discussed this several times around the office here at Pingdom. Data centers are getting huge. How many hard drives are there in one of these new, extremely large data centers? 100,000? 200,000? More?

Add to this the “cloud computing” trend. Since we store more and more data online, data centers will have to keep adding more data storage capacity all the time to be able accommodate their customers.

To name an example of how enormous some of these new data centers are, Microsoft has stated that it will have 300,000 servers in a new data center they are building in Chicago. We don’t know how many hard drives that will result in for storage, but we imagine that it will be many.

So, let’s assume we have one huge data center with 200,000 hard drives. At least 16 hard drives would break every day. With 400,000 hard drives, one hard drive would break every 45 minutes. (Ok, perhaps we’re getting carried away here, but you get the idea.)

Does this mean that these huge data centers will basically have a dedicated “hard drive fixer” running around replacing broken hard drives?

Is the “cloud computing era” ushering in a new data center profession? Hard drive boys? 🙂

Maybe this is already happening?

Questions for those in the know…

So, if this is already the situation, or will be in the near future, at least in “mega data centers”, what would be the best way to handle this? Would you organize your data center with this in mind, keeping all storage in close proximity to avoid having to walk all over the place? And what about the containerized data centers that for example Microsoft is building? Would you have to visit each separate container to deal with the problems as they arise?


  1. We’ll probably do what we’ve always done with repetitive tedious tasks. Batch em. If there’s a 16 drive failure rate/day. Place so many hot spares or hot nodes around to sustain the center for a week and batch replace them once a day/week/month.

  2. I think you’re right. At 1 failure every 45 minutes, you’d have just enough time to find the physical machine, pull out the HD, replace it, and restore a disk image onto it as another drive would fail.

    Especially for a large system with 400,000 drives. That’s a lot of physical space. You could almost have races to see who could get to (or even find) the faulty machine first 😉

  3. Google (and probably anyone running something of that size) don’t replace hard drives. They ignore all failures and leave the dead hardware in place until a sufficent % of machines are dead in a particular sector then just rip them all out and replace them with new machines.

    Apparently Microsoft are using an even broader approach for their new data centers – everything will be done at rack level – new racks full of machines will arrive from the hardware vendors ready to plug in and run, then left untouched until enough has died to make it worth replacing the whole rack.

  4. Maybe datacenters can implement an intelligent BURN-E solution like on WALL-E (slightly more interesting than a tape jockey)

  5. Fuzzball: You’d be surprised, on average all brands perform/last just as well. I think a lot of love for Seagate came from their 5 year warranty, but this is being reduced to 4 year IIRC to cut costs.

    aka: That sounds like the most expensive solution to an easy problem. A complete rack would cost a lot of $ having even 10% of a rack sitting broken/unused is such a waste of space/resources.

  6. @Mike:
    And still, that is how Google is doing it. Harddrive failure? Just let it. Server failure? We don’t care. A whole rack down? So, what?

    They replace the whole thing if there are enough racks down. Did you know, that Google even is using today, some old Pentium 2/3’s?

    Expensive? Do you know how expensive it is, to let some guys in your datacenter 24/7 replacing spare parts?

    Check this video from Google: http://www.uwtv.org/programs/displayevent.aspx?rID=2879 (Very interesting video!!)

Leave a Reply

Comments are moderated and not published in real time. All comments that are not related to the post will be removed.required