Scheduling algorithms and stochastic performance models for workflow applications on dynamic grid platforms – StochaGrid
The major objectives of this project are the following: - Design a new stochastic model that will allow for an accurate prediction of the performance of workflow applications on grid computing platforms, - Design new and robust scheduling algorithms for executing such applications, and analytically assess their performance through the new model, - Build and implement a prototype library to experimentally validate both the model and the algorithms, checking the adequacy of the former to evaluate the efficiency of the latter. Grid computing platforms and components are subject to a great variability. Statistical models are mandatory to deal with changes in resource performance, such as CPU speeds or link bandwidths. Traditionally, Markov chains are used to capture the inherent uncertainty linked to parameter estimation. Markov chains are also used to model resource degradation and faults. Our approach will definitively incorporate such components. However, as explained in the detailed project description, Markov chains lack a key feature: because they are memoryless, they cannot accurately model the performance of parallel systems periodically interacting through message exchanges in steady-state mode. In contrast, sophisticated static scheduling strategies have been developed to map workflow applications on static (with no variability) grid computing platforms. Optimal algorithms have been designed to map simple pipeline skeleton kernels onto heterogeneous clusters and grids. Such applications operate in pipeline mode, and standard objective functions include maximizing the throughput and/or minimizing the response time (latency), for each data set. Interactions between cooperating resources precisely define good mappings and are carefully optimized by the scheduler. A major goal of this project is to fill the lack of statistical performance models for robust scheduling strategies. On the one hand, statistical models are mandatory to account for the variability and dynamicity of resources. On the other hand, efficient scheduling algorithms only exist for static, dedicated platforms. We need a new stochastic model able to capture the performance of dynamic parallel systems accurately. This new model will be non-Markov for system interaction but will be Markov-based for platform characteristics (fault-tolerance and variability). The design and evaluation of this new model will be the first key contribution of the project. New and robust scheduling algorithms will be designed and evaluated on top of this model, thereby providing the first stochastic testbed for workflow applications expressed in terms of algorithmic skeletons on dynamic grid platforms. We will start with simple algorithmic paradigms, such as pipeline skeletons, because their static behavior is well-understood, thereby providing a solid theoretical foundation to understand and evaluate their stochastic performance. The design and stochastic evaluation of multi-criteria (throughput, latency, robustness) workflow scheduling algorithms is the second key contribution of this project. The third key contribution of the project will be the design of a prototype library for deploying workflow applications on computational grids. This library will build on publicly available components such as GridSolve, APST, and the Edinburgh MPI skeleton library eSkel, and will extensively reuse existing environments. The main idea is to extend current tools with a flexible application descriptor and mapping toolkit, which will serve as a common input to the stochastic model evaluator and to the grid launcher. Mapping procedures will be robust and will dynamically adjust assignment decisions to current processing and communication capabilities of target resources. In other words, our library will evaluate the scheduling and mapping solutions both analytically through the new stochastic model and experimentally through the execution of MPI library routines on computational platforms.
Project coordination
Yves ROBERT (Autre établissement d’enseignement supérieur)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
Partner
Help of the ANR 118,647 euros
Beginning and duration of the scientific project:
- 36 Months