CE25 - Infrastructures de communication hautes performances (réseau, calcul et stockage), Sciences et technologies logicielles

Energy saving in large scale distributed platforms – Energumen

Submission summary

Energy consumption has always been a concern in High Performance Computing (HPC) platforms. Today, it becomes even more critical due to the transition to the next generation extreme scale platforms and the their convergence with Cloud, BigData and Internet of Things. An increase of the computing performance by a significative factor is targeted while staying at the current order of magnitude in term of energy usage. This clearly shows that reaching this target needs a revolution in the way of handling resource management problems. Reducing the energy consumption to obtain radically more Flops per watt than today's systems is THE major challenge. In order to address societal challenges such as health, climate change or security, HPC platforms evolve toward heterogeneous many-cores (hundreds of cores in the same chip) composed of general purpose components along with specialized cores. The number of computing units will drastically increase but the I/O and interconnection networks are evolving much slowly while the memory hierarchy will be even deeper than today. In addition, more processing capabilities will obviously lead to more data produced, stressing even more the interconnects both within nodes and between nodes consuming more than 70% of the total power of HPC systems.

The ability to build a physical exascale system is not a guarantee for running exascale applications. Efficient tools to utilize such platforms at a sustained rate must also be provided. A key element for application design and management will be to better use memory hierarchies and optimize data movements. To the best of our knowledge, no work has been devoted so far to study explicit methods of saving energy as a consequence of enhanced allocations and reduced data movements using the knowledge extracted from applications.
ENERGUMEN will design, study and validate efficient and practical tools for managing the allocation of jobs to the various components of an extreme scale HPC platform.

There exist various mechanisms for reducing the energy in large scale HPC platforms such as Dynamic Voltage and Frequency Scaling (DVFS) or switching on/off nodes. There are many studies in this direction, but most are assuming idealized and restricted models. Alternatively, saving energy can be obtained as a consequence of reducing data movements by adequate communication-aware allocations of the jobs increasing locality of communications. These current approaches are limited since they do not consider the influence on the applications themselves, considering them as black boxes.
In this project, we propose two new complementary mechanisms for addressing the energy/performance trade-off in extreme scale HPC platforms. First, we will revisit the classical speed-scaling and power down mechanisms by using a malleable model, which allows to shrink or stretch dynamically the execution times of the jobs according to the current energy profile. We will also study optimized policies for energy-aware data allocations at the software level. These mechanisms aim to introduce more flexibility into the management of the heterogeneous resources of extreme scale HPC platforms. Both mechanisms involve the design of sophisticated models and methodologies for the efficient exploitation of the idle periods at running time. There are hard scientific and technological challenges to determine the best trade-off between both mechanisms. The success of designing adequate abstract models and efficient optimization methods depends heavily on the collection and the analysis of the huge amount of data produced in HPC platforms. We also propose to develop and test several software products in actual datacenters.
In the quest of extreme scale platforms where a trade-off between performance and energy consumption is necessary, the originality of ENERGUMEN is to revisit the principles of existing resource managers and to investigate new functionalities by harnessing applications' malleability.

Denis Trystram (Laboratoire d'informatique de Grenoble)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

IRIT Institut de Recherche en Informatique de Toulouse
LIG Laboratoire d'informatique de Grenoble
LIP6 Laboratoire d'informatique de Paris 6

Help of the ANR 534,313 euros
Beginning and duration of the scientific project: October 2018 - 48 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.