Scheduling adaptive HPC jobs to navigate through the energy doldrums – EnergyDoldrums
High-performance computing (HPC) is crucial for advancements in science and engineering, and the demand for it has surged with the rise of artificial intelligence (AI). However, challenges such as escalating electricity costs and the need to reduce carbon emissions pose constraints on computing resources globally. The intermittent nature of renewable energy sources further complicates matters, introducing variability on the scale of hours. Consequently, HPC providers may need to dynamically adjust system capacity based on cost, emission and demand tradeoffs or temporarily restrict resources to meet energy constraints. This adaptability introduces a new dimension to HPC systems, making them malleable – a characteristic previously associated with specific job classes.
Malleability allows jobs to dynamically adjust their resource usage in response to scheduler requests, even under fixed-size capacity. While traditional HPC workloads have been slow to adopt malleability due to limited support, AI training workloads, for which malleability can be easily achieved, provide an opportunity for widespread exploitation of this feature. In addition, AI training jobs may alter their resource requirements as training progresses, which classifies them as evolving. For instance, in computer vision tasks, the ideal batch size may grow as the learning progresses, suggesting the redistribution of resources in favor of jobs beyond their early training stages. The concept of adaptive jobs, encompassing both malleable and evolving jobs, is seen as conducive to the efficient operation of systems with dynamic capacity.
This project focuses on developing scheduling algorithms for adaptive workloads on systems with variable capacity. The initial step involves a comprehensive formalization of the problem, including system modeling and the definition of objective functions. A multi-criteria approach will be pursued, combining system-oriented metrics such as energy efficiency with user-oriented metrics such as quality of service. The algorithm design will be grounded in theoretical analysis, encompassing complexity analysis, approximation or inapproximability results, and lower or upper bounds. Empirical evaluation will be conducted through simulation using ElastiSim, a simulator designed for adaptive workloads, to be extended to account for systems with variable capacity.
To exploit our findings in a realistic scenario, the project will implement a simple overlay resource manager for distributed deep learning. This manager will operate atop standard resource managers, orchestrating single-node jobs and allocating resources to multi-node learning tasks on demand. The primary use case of this project involves developing a scheduling algorithm optimizing the speed and efficiency of distributed deep learning on systems with variable capacity by adjusting the resource sets of individual learning tasks.
Project coordination
Anne BENOÎT (Laboratoire d'Informatique du Parallélisme)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
Partnership
TUDa Technical University of Darmstadt
LIP Laboratoire d'Informatique du Parallélisme
Help of the ANR 163,440 euros
Beginning and duration of the scientific project:
March 2025
- 36 Months