Blanc SIMI 3 - Sciences de l'information, de la matière et de l'ingénierie : Matériels et logiciels pour les systèmes, les calculateurs, les communications

Resilience for exascale scientific computing – RESCUE

RESCUE

Resilience for exascale scientific computing

Resilience for exascale applications

The advent of exascale machines will help solve new scientific challenges only if the resilience of large scientific application deployed on these machines can be guaranteed. With 10,000,000 core<br />processors, or more, the time interval between two consecutive failures<br />is anticipated to be smaller than the typical duration of a checkpoint.<br />The main objective of the RESCUE project is to develop new algorithmic<br />techniques and software tools to solve the exascale resilience problem.

This research follows three main research thrusts:
(i) checkpoint protocols: scalable and light-weight checkpoint and migration protocols
(ii) execution models: stochastic models to predict (and, in turn, optimize) the expected performance (execution time or throughput)
of large-scale parallel scientific applications.
(iii) parallel algorithms: numerical methods and robust algorithms that still converge in the presence of multiple failures.

Only the combination of these three thrusts
(new checkpoint protocols, new execution models, and new parallel algorithms) can solve the exascale resilience problem. We hope to contribute to the solution of this critical problem by providing the
community with new protocols, models and algorithms, as well as with a set of freely available public-domain software prototypes.

- Protocols for exascale fault-tolerance: The design of new rollback/recovery protocols that are lightweight, distributed and scalable

- Performance and execution models for exascale applications. The aim is to minimize the expected execution time of these applications through the optimization of checkpoint/migration frequency and through efficient resource selection

- Robust numerical algorithms for exascale simulation

The project has delivered 31 publications as of today.

The advent of exascale machines will help solve new scientific
challenges only if the resilience of large scientific applications
deployed on these machines can be guaranteed.
With 10,000,000 core processors, or more, the time interval between
two consecutive failures is anticipated to be smaller than the typical duration
of a checkpoint, i.e., the time needed to save all necessary
application and system data. No actual progress can then be expected for a large-scale parallel
application. Current fault-tolerant techniques and tools can no longer be used.

The main objective of the RESCUE project is to develop new algorithmic techniques and software
tools to solve the "exascale resilience problem". Solving this
problem implies a departure from current approaches,
and calls for yet-to-be-discovered algorithms, protocols and software tools.

This proposed research follows three main research thrusts. The first thrust deals with
novel checkpoint protocols. This thrust will include the classification
of relevant fault categories and the development of a software package
for fault injection into application execution at runtime. The main
research activity will be the design and development of scalable
and light-weight checkpoint and migration protocols, with on-the-fly
storing of key data, distributed but coordinated decisions, etc. These
protocols will be validated via a prototype implementation integrated
with the public-domain MPICH project. The second
thrust entails the development of novel execution models, i.e.,
accurate stochastic models to predict (and, in turn, optimize) the
expected performance (execution time or throughput) of large-scale
parallel scientific applications. In the third thrust, we will develop
novel parallel algorithms for scientific numerical kernels. We will
profile a representative set of key large-scale applications to assess their resilience characteristics
(e.g., identify specific patterns to reduce checkpoint overhead). We will also
analyze execution trade-offs based on the replication of crucial kernels and on
decentralized ABFT (Algorithm-Based Fault Tolerant) techniques. Finally, we will
develop new numerical methods and robust algorithms that still converge in
the presence of multiple failures. These algorithms will be implemented as part
of a software prototype, which will be evaluated when confronted with
realistic faults generated via our fault injection techniques.

We firmly believe that only the combination of these three thrusts
(new checkpoint protocols, new execution models, and new parallel
algorithms) can solve the exascale resilience problem. We hope to
contribute to the solution of this critical problem by providing the
community with new protocols, models and algorithms, as well as with
a set of freely available public-domain software prototypes.

The RESCUE project team comprises well-recognized scientists,
with complementary expertise, and
who are gathered together for the first time. In addition,
the project is conducted in collaboration with a selected team of US
leaders: Marc Snir and Bill Gropp
at the University of Illinois at Urbana Champaign (Blue Waters
project), and Henri Casanova at Hawaii University (models for parallel jobs).
The former collaboration with Marc Snir and Bill Gropp is conducted
under the auspices of the INRIA-Illinois Joint Laboratory
at Urbana Champaign co-headed by Franck Cappello and Marc Snir. The
latter collaboration with Henri Casanova takes place within a joint
INRIA-NSF team.
All this explains why we did not go through a formal ANR-NSF agreement.

Project coordination

Yves ROBERT (INRIA - Siège) – Yves.Robert@ens-lyon.fr

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partner

INRIA INRIA - Siège

Help of the ANR 503,689 euros
Beginning and duration of the scientific project: - 48 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter