Skip to content

Gmail and the Electrical Grid: Looks the same

GMail had a large-scale cascading failure yesterday:

At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system “stop sending us traffic, we’re too slow!”. This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded.

This cascading failure sounds exactly like that other universal network we all share: The electrical grid.

Large blackouts are cascading failures compounded by the failure of the “fuses” meant to isolate still functioning parts of the grid from the failed part.

Sounds a lot like what happened to gmail.

Interestingly the smart people at Google have recognized exactly that:

we have concluded that request routers don’t have sufficient failure isolation (i.e. if there’s a problem in one datacenter, it shouldn’t affect servers in another datacenter) and do not degrade gracefully (e.g. if many request routers are overloaded simultaneously, they all should just get slower instead of refusing to accept traffic and shifting their load).

Sounds like a good start. But what happens if a datacenter catestrofically fails. The routers stop accepting requests because they are gone ( or can’t respond ). Prepare for the next GMail failure.

As wikipedia notes:

Modern power systems are designed to be resistant to this sort of cascading failure, but it may be unavoidable (see below). Moreover, since there is no short-term economic benefit to preventing rare large-scale failures, some observers have expressed concern that there is a tendency to erode the resilience of the network over time, which is only corrected after a major failure occurs. It has been claimed that reducing the likelihood of small outages only increases the likelihood of larger ones. In that case, the short-term economic benefit of keeping the individual customer happy increases the likelihood of large-scale blackouts.

Posted in technical.

3 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. tom says

    I agree with much of your post until the wikipedia entry. It states that there is no short-term economic benefit to preventing rare large-scale failures. This may be true for a power company that has an exclusive relationship with its citizens. I can’t choose to use another power company. With Gmail or Google Apps I CAN choose to use another provider which presumably is a motivating factor for Google to get its act together.

  2. patrick says

    @tom —

    You are correct that it is much easier to switch email than power provider grids.

    But it still isn’t easy to switch email providers. Sure you can move the email … sort of easily. But what about all the people that have your gmail account. They all have to be notified. There is a large social cost to switching.

    I tend to add email addresses not abandoned them . I still have my yahoo accounts from years ago before I moved to gmail.

    I doubt gmail lost any customers as a result. So how much incentive is there really?

  3. Dave Doolin says

    The same people that rage against microsoft monoculture are cheering the same sort of computing monoculture with google.

    I’m considering moving off gmail in fact.

    But not yet.

Some HTML is OK

or, reply to this post via trackback.