Performance- and Accuracy-aware Data format Optimization in numerical Codes – PADOC
Our approach was to focus on the optimization of iterative programs. Even if they are very present in numerical programs, few automated tools have been designed to adapt the format of data and/or instructions in iterative routines. Our work was thus carried out in two directions.
(1) The first direction aimed to design a tool to quickly isolate vectorizable blocks, that is, blocks of a program that can benefit from vector instructions, once this program is compiled. Once these vectorizable blocks are isolated, this tool allows for reducing the format of their instructions, increasing consequently the vectorization factor, to thus improve the performances of the application, without accuracy constraint at this step.
(2) The second direction is the main direction of the project. It deals with the adaptation of the format of instructions in iterative programs. This process is based on a first basic block whose objective is to analyze the impact of the transformation of the format of certain instructions on the accuracy of the results of the input program. For this, we have designed a tool in the LLVM compilation infrastructure, which is based in particular on (1) the fp2mp tool which allows to instrument a floating-point program with multiple precision computations using MPFR, and (2) a loop splitter which allows to split a loop into several sub-loops iterating on reduced iteration spaces. The first module allows to evaluate the impact of a transformation of the precision of certain instructions of a program, while the second allows to apply the transformations to certain iterations of a loop, only.
Finally we developed an optimization tool based on the delta-debugging algorithm. The latter allows to isolate, in a set of possible transformations, a maximal subset allowing to guarantee that the accuracy constraint on the final result remains satisfied. This Python tool then allows to choose, for a given iterative program, a format for the instructions of each iteration or block of iterations. This is done by splitting the loop of the program into sub-loops iterating on reduced iteration spaces, and by adapting the format of the instructions for each sub-loop. We illustrated the interest of our approach on the example of matrix multiplication on SIMD architectures.
This project led to the development of three programs.
(1) The first tool, called vectoptim (~4000 lines), was developed in the LLVM compilation infrastructure. It allows for isolating instructions that can benefit from vector units, and then for reducing their format.
(2) The second tool is an infrastructure designed to analyze the accuracy in iterative programs. It consists of fp2mp and a loop splitter (~11000 lines). Developed in LLVM, as well, it allows for analyzing the impact of the transformation of the format of certain data and/or instructions in a program on the accuracy of its results, particularly in the case of iterative programs.
(3) Finally, the third tool uses our analysis infrastructure to adapt the format of instructions in iterative programs. This Python tool (~1000 lines), based on fp2mp, uses the delta-debugging algorithm to decide the instructions of the program whose format can be reduced without too much degrading the accuracy of the result, and thus to adapt the format of certain instructions in iterative programs. Unlike existing adaptation tools, it allows for example to reduce the precision of certain instructions at certain iterations of a loop only, or by block of iterations.
The initial objectives of the project were refocused on the optimization of iterative programs on vector architectures. The prospects of the project are therefore structured around their initial motivations, and in particular two main directions.
(1) The first direction concerns the robustness of programs to input data. Indeed, like most existing tools, the tool designed in this project allows the optimization of instruction formats valid for a given input set of data. A first perspective would be to propose optimizations valid not only for a single set of data, but for several sets of data, or even regardless of the input set of data.
(2) The second direction concerns the scalability. Our tool allows for handling simple computation routines. However, the methods used do not allow for handling full computation programs. A second perspective would be to improve the proposed approach to allow for handling larger numerical programs, such as numerical simulation programs from physics, as initially planned in the project.
These two directions still represent today the two main challenges for data format adaptation tools in numerical computation programs.
The development of efficient and accurate numerical codes is a difficult
task. Indeed it requires to know perfectly the targeted micro-architecture, the
behavior of its computational units and its instruction set. But it also needs
to carefully analyze the accuracy constraints of all the data of the code and
the computing precision in order to reach a given accuracy on the
output. Usually these two criteria are handled in a separated way. And mainly
because of the lack of expertise from the developers, we observe that the
computational data are often over-sized. Therefore this does not enable to take
full advantage from the capabilities of the micro-architecture.
To improve the performance of the applications, and to reduce their execution
time, an approach consists in lowering the precision of certain well-chosen
data. This method enables to exploit the parallelism provided by certain modern
architectures, through the use of vector instructions. However deciding the data
to be tuned without impacting too much the accuracy of the result is a
complicated task, even for experts in numerical software. And this difficulty
increases with the size of the applications.
The objective of the PADOC project consists in designing tools to help
developers in adapting in an automated way the format of the floating-point data
in a numerical code. The interest is to exploit at best the computational units
of the targeted architecture, and thus to improve the performance of the code,
while guaranteeing a given accuracy threshold on the result. Some tools already
exist to address this issue. However, these tools may suffer from the following
drawbacks: First they are dependent of some input data, which are not
representative enough of the data the code will be used with. Hence the
transformations may not remain valid for other data. Second they do not consider
the targeted micro-architecture in the tuning process. This may results in codes
that do not fully benefit from its capabilities. Third they perform optimization
on the whole code, considering all the data even those for which the tuning
process will not be relevant. For a large number of data, the combinatorics is
huge, and this leads to a slowing down of the adaptation time. Therefore our
goal is to design tools to produce codes optimized for a given
micro-architecture, and robust with respect to the input data. But above all, we
aim at speeding up the process by isolating the data the most relevant in the
tuning, and thus reducing the barrier.
The design flow of the PADOC project works on an input code, and it returns an
optimized mixed precision code. In this proposal we aim at working around three
directions. The first direction consists in studying the impact of the data
format adaptation step on the performance of the code. More particularly, the
goal is to provide means to isolate in the code some parts, for which lowering
the precision of the data may lead to a high improvement in terms of
performance. The second direction is to study the impact of this adaptation on
the numerical accuracy of the computed result, and the robustness of the
optimized code with respect to the input data. The goal is here to provide means
to isolate the data of the code, whose accuracy suffer from precision adaptation
in greater or lesser extent. By doing this, we will be able to decide which part
is the more relevant for the tuning process. The goal of the third direction is
to develop the global optimization tool, and to validate it using simulation
applications, like CORSIKA. For this purpose, the main idea is to use the
knowledge in terms of accuracy and performance from previous steps, in order to
discriminate between the different parts of this code. The tuning will be thus
performed on the most relevant parts, consequently reducing the
combinatorics. In this way, we hope to demonstrate a time reduction in the
adaptation of scientific numerical codes.
Project coordination
Guillaume Revy (Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
Partnership
LIRMM Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier
Help of the ANR 138,456 euros
Beginning and duration of the scientific project:
February 2019
- 48 Months