At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system “stop sending us traffic, we’re too slow!”. This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded.
This cascading failure sounds exactly like that other universal network we all share: The electrical grid.
Large blackouts are cascading failures compounded by the failure of the “fuses” meant to isolate still functioning parts of the grid from the failed part.
Sounds a lot like what happened to gmail.
Interestingly the smart people at Google have recognized exactly that:
we have concluded that request routers don’t have sufficient failure isolation (i.e. if there’s a problem in one datacenter, it shouldn’t affect servers in another datacenter) and do not degrade gracefully (e.g. if many request routers are overloaded simultaneously, they all should just get slower instead of refusing to accept traffic and shifting their load).
Sounds like a good start. But what happens if a datacenter catestrofically fails. The routers stop accepting requests because they are gone ( or can’t respond ). Prepare for the next GMail failure.
Modern power systems are designed to be resistant to this sort of cascading failure, but it may be unavoidable (see below). Moreover, since there is no short-term economic benefit to preventing rare large-scale failures, some observers have expressed concern that there is a tendency to erode the resilience of the network over time, which is only corrected after a major failure occurs. It has been claimed that reducing the likelihood of small outages only increases the likelihood of larger ones. In that case, the short-term economic benefit of keeping the individual customer happy increases the likelihood of large-scale blackouts.