Cosmic rays are a fact of life, and as transistors get smaller, the amount of energy it takes to spontaneously flip a bit gets smaller, too. By 2023, when exascale computers—ones capable of performing 1018 operations per second—are predicted to arrive in the United States, transistors will likely be a third the size they are today, making them that much more prone to cosmic ray–induced errors. For this and other reasons, future exascale computers will be prone to crashing much more frequently than today’s supercomputers do. For me and others in the field, that prospect is one of the greatest impediments to making exascale computing a reality.
- A high-profile example affected what was the second fastest supercomputer in the world in 2002, a machine called ASCI Q at Los Alamos National Laboratory. When it was first installed at the New Mexico lab, this computer couldn’t run more than an hour or so without crashing.
- In the summer of 2003, Virginia Tech researchers built a large supercomputer out of 1,100 Apple Power Mac G5 computers. They called it Big Mac. To their dismay, they found that the failure rate was so high it was nearly impossible even to boot the whole system before it would crash.
The problem was that the Power Mac G5 did not have error-correcting code (ECC) memory, and cosmic ray–induced particles were changing so many values in memory that out of the 1,100 Mac G5 computers, one was always crashing.
Everything from cosmic rays to weakly radioactive trace lines to power regulators failing would cause a supercomputer crash.
Fascinating stuff. I do have some questions that I hope to hear answers for at some point:
- Each core has to be able to run ‘independently’ to some degree (like google) – what can the supercomputer field borrow from large data-centers?
- why is the total state needed to be saved – why can’t the state preservation be at a core by core level?
- why not have 3 cores work on each part of the problem and use consensus to determine the correct answer ( this is how the space shuttle operated ) ?
- What is the nature of the problems that supercomputers are solving that prevents the google mass-of-computers solution from being used?
These questions lead to these interesting questions:
- How will any meaningful quantum computer operate?
- Biology has even more information / compute density – how does biology deal with errors – what can CS learn from biology?
- will this prevent humans from being able to have good computers in space? ( assuming that we ever get off this rock in a meaningful way)
- Is there an information theory in the making that can put a theoretical maximum on reliable information density based on radiation level? i.e. the error correction logic will consume any improvements in bit storage reduction? will radiation density impose a minimum trace thickness?