Supercomputers face growing resilience problems

As supercomputers grow more powerful, they’ll also grow more vulnerable to failure, thanks to the increased amount of built-in componentry. A few researchers at the recent SC12 conference, held last week in Salt Lake City, offered possible solutions to this growing problem.

Today’s high-performance computing (HPC) systems can have 100,000 nodes or more — with each node built from multiple components of memory, processors, buses and other circuitry. Statistically speaking, all these components will fail at some point, and they halt operations when they do so, said David Fiala, a Ph.D student at the North Carolina State University, during a talk at SC12.

The problem is not a new one, of course. When Lawrence Livermore National Laboratory’s 600-node ASCI (Accelerated Strategic Computing Initiative) White supercomputer went online in 2001, it had a mean time between failures (MTBF) of only five hours, thanks in part to component failures. Later tuning efforts had improved ASCI White’s MTBF to 55 hours, Fiala said.

But as the number of supercomputer nodes grows, so will the problem. “Something has to be done about this. It will get worse as we move to exascale,” Fiala said, referring to how supercomputers of the next decade are expected to have 10 times the computational power that today’s models do.

Today’s techniques for dealing with system failure may not scale very well, Fiala said. He cited checkpointing, in which a running programme is temporarily halted and its state is saved to disk. Should the program then crash, the system is able to restart the job from the last checkpoint.

The problem with checkpointing, according to Fiala, is that as the number of nodes grows, the amount of system overhead needed to do checkpointing grows as well — and grows at an exponential rate. On a 100,000-node supercomputer, for example, only about 35 percent of the activity will be involved in conducting work. The rest will be taken up by checkpointing and — should a system fail — recovery operations, Fiala estimated.

Because of all the additional hardware needed for exascale systems, which could be built from a million or more components, system reliability will have to be improved by 100 times in order to keep to the same MTBF that today’s supercomputers enjoy, Fiala said.

Fiala presented technology that he and fellow researchers developed that may help improve reliability. The technology addresses the problem of silent data corruption, when systems make undetected errors writing data to disk.

Basically, the researchers’ approach consists of running multiple copies, or “clones” of a programme, simultaneously and then comparing the answers. The software, called RedMPI, is run in conjunction with the Message Passing Interface (MPI), a library for splitting running applications across multiple servers so the different parts of the program can be executed in parallel.

RedMPI intercepts and copies every MPI message that an application sends, and sends copies of the message to the clone (or clones) of the programme. If different clones calculate different answers, then the numbers can be recalculated on the fly, which will save time and resources from running the entire programme again.

“Implementing redundancy is not expensive. It may be high in the number of core counts that are needed, but it avoids the need for rewrites with checkpoint restarts,” Fiala said. “The alternative is, of course, to simply rerun jobs until you think you have the right answer.”

Fiala recommended running two backup copies of each program, for triple redundancy. Though running multiple copies of a programme would initially take up more resources, over time it may actually be more efficient, due to the fact that programs would not need to be rerun to check answers. Also, checkpointing may not be needed when multiple copies are run, which would also save on system resources.

“I think the idea of doing redundancy is actually a great idea. [For] very large computations, involving hundreds of thousands of nodes, there certainly is a chance that errors will creep in,” said Ethan Miller, a computer science professor at the University of California Santa Cruz, who attended the presentation. But he said the approach may be not be suitable given the amount of network traffic that such redundancy might create. He suggested running all the applications on the same set of nodes, which could minimise internode traffic.

In another presentation, Ana Gainaru, a Ph.D student from the University of Illinois at Urbana-Champaign, presented a technique of analysing log files to predict when system failures would occur.

The work combines signal analysis with data mining. Signal analysis is used to characterise normal behaviour, so when a failure occurs, it can be easily spotted. Data mining looks for correlations between separate reported failures. Other researchers have shown that multiple failures are sometimes correlated with each other, because a failure with one technology may affect performance in others, according to Gainaru. For instance, when a network card fails, it will soon hobble other system processes that rely on network communication.

The researchers found that 70 percent of correlated failures provide a window of opportunity of more than 10 seconds. In other words, when the first sign of a failure has been detected, the system may have up to 10 seconds to save its work, or move the work to another node, before a more critical failure occurs. “Failure prediction can be merged with other fault-tolerance techniques,” Gainaru said.

DEF 2026 accelerates Dubai’s rise as global gaming destination, says Muna Al Falasi

NVIDIA DRIVE Hyperion becomes global platform for robotaxi-ready world

eVoost AI, Mardi Holding ink agreement to expand across Georgia’s real estate market

Emirati AI experts prepare to lead implementation of UAE AI Strategy 2031

Space42 advances UAE’s sovereign space, AI ambitions through geospatial intelligence

NVIDIA DRIVE Hyperion becomes global platform for robotaxi-ready world

SentinelOne strengthens sovereign AI-driven cybersecurity strategy across KSA, UAE

AI becomes the default for Saudi consumers as Deloitte’s 2026 Digital Consumer Trends Report reveals a decisive shift

Huawei brings AI-powered Xinghe Intelligent Network to Saudi Arabia IP Club 2026

HPE accelerates RSG’s vision for luxury hospitality with AI-native switching, Wi-Fi

Microsoft AI Tour showcases groundbreaking AI innovations for Oman

Open Innovation AI collaborates with Intel to revolutionize AI orchestration with Gaudi

KROHNE delivers insights to inspire the next generation of engineers in Oman

Oracle supports major project to accelerate Oman digital economy

Ooredoo accelerates cybersecurity in Oman with new deal

Bahrain sets global benchmark with GCC’s first stablecoin regulatory framework

Open Innovation AI collaborates with Intel to revolutionize AI orchestration with Gaudi

BDB launches “tijara” platform for SMEs

Bahrain achieves full nationwide 5G coverage

Batelco, SonicWall launch integrated security solutions for SMEs in Bahrain

Open Innovation AI collaborates with Intel to revolutionize AI orchestration with Gaudi

Infopercept opens its first Middle East office in Kuwait

Microsoft Compliance Manager now available in Kuwait

Commercial Bank of Kuwait gets mobile payments moving with Thales Digital Solutions

Ooredoo chooses Fortinet to deliver secure SD-WAN managed services in Kuwait

StarLink advances AI-driven cybersecurity at GITEX Africa 2026 with strong partnerships

Vodacom and Google Cloud look to revamp AI in Africa

ODC Africa and ME partners with Hedera Africa Hackathon to boost Web3 innovation

Dubai’s Omining unveils first African site in Kenya’s Special Economic Zone

Rise of Fearless unites 2,500+ gamers through African heritage, battle royale

NVIDIA DRIVE Hyperion becomes global platform for robotaxi-ready world

eVoost AI, Mardi Holding ink agreement to expand across Georgia’s real estate market

Liferay champions composable platforms for next era of customer experience

From 2G to 5G GCC leads shift to smarter connectivity: Opensignal principal analyst

HPE accelerates RSG’s vision for luxury hospitality with AI-native switching, Wi-Fi

Soros Economic Development Fund reinvests $5 Million in Working Capital Fund

NVIDIA DRIVE Hyperion becomes global platform for robotaxi-ready world

eVoost AI, Mardi Holding ink agreement to expand across Georgia’s real estate market

Core42 advances US AI infrastructure strategy with New York deployment

DXC Engineering unifies expertise to scale AI-powered innovation, says top official

Anthropic could reach $1 trillion valuation

Europe’s €180 million move: sovereign cloud rebuild starts now

LinkShadow positioned in the Visionaries Quadrant in the 2026 Gartner® Magic Quadrant™ for NDR

Tenable Hexa AI turns exposure discovery into automated remediation at machine speed

Mistral AI makes acquisitions as part of an industrial push

Qlik expands cloud footprint with new AWS region in Middle East

PeopleStrong powers UAE’s talent shift, accelerates ME growth: Mrigank Tripathi

Microsoft names Samer Abu-Ltaif president for Europe, ME and Africa

Open Innovation AI collaborates with Intel to revolutionize AI orchestration with Gaudi

YouGotaGift CEO says ‘product-centricity’ the key to their phenomenal success

AI-powered solutions shape future of SMEs, says Zoftware founder

Open Innovation AI collaborates with Intel to revolutionize AI orchestration with Gaudi

Kaspersky exposes new scam targeting SMBs

Thriwe: Enhancing the Omni-channel experience

Alaris expands information capture ecosystem for SMEs

AI without borders: Startups leading the next global leap

Secure Domains brings cutting-edge DNS protection to MENA region

Open Innovation AI collaborates with Intel to revolutionize AI orchestration with Gaudi

Kaspersky exposes new scam targeting SMBs

Thriwe: Enhancing the Omni-channel experience

Huawei launches EduTech1.0 framework to advance Intelligent Education and Digital Talent Development

Microsoft issues an update on their new quantum chip

Amazon looks to upgrade its European facilities with robotics

Samsung and MediaTek claim first successful 5G uplink

Apple looks to open its first developer centre in Europe

Samsung and MediaTek claim first successful 5G uplink

Telstra and Google to boost digital infrastructure across APAC

Aramco and Emerson collaborate on strategic corrosion R&D

Qashio and NEXA AI Lab launch partnership to automate finance workflows in UAE

Ericsson and Telstra to collaborate on AI-native 6G

Tenable One powers AI-driven cyber risk decisions with the release of the Open Connector

Huawei launches EduTech1.0 framework to advance Intelligent Education and Digital Talent Development

Microsoft issues an update on their new quantum chip

Infoblox IQ to power next era of agentic AI operations for networking and security

China progresses general research into 6G

AI reshapes software economics, pricing models, quality control, says Arthur D. Little

Apple launches new creative subscription bundle

Instagram denies reports of a data breach

Meta closes over 500,000 accounts to comply with new Australian law

Kyivstar’s Starlink service enjoys a successful launch