Heuristics for Heterogeneous Memory – H2M
In order to achieve the project’s goals, H2M consists of five work packages (WP) that will be carried out by both partners. Both partners are involved in each work package to lay the foundation for the work, but individual tasks may be carried out by only a single partner. WP1 will provide a foundation for follow-up work in the other work packages by quantifying the use of heterogeneous memory with scientific-technical applications. Based on this, WP2 will develop abstractions for use of heterogeneous memory in programming systems by the programmer, while WP3 will develop runtime techniques to exploit these abstractions by mapping memory requests to suitable kinds of memory. WP4 will formulate these concepts in the form of constructs and APIs to extend existing programming systems. The development of prototype implementations of concepts and methods developed in the project is mainly part of WP5, complemented with the intent to propose project results to standards and APIs to achieve long-term sustainability of the project results. The design of the work packages is motivated by considering a scientific-technical application to consist of (at least) five components, as displayed in Figure 1. It underlines the need to consider heterogeneous memory in a holistic approach. The application is on top and is written with the use of the parallel programming system, for example MPI or OpenMP or any other parallel API. The compiler translates the application, possibly equipped with constructs or directives from the programming system, to machine and device code and links to the runtime system, which again might provide additional APIs. Both the compiler and the runtime employ different forms of performance model to optimize the machine and device code for the target architecture and to make decisions at runtime. Finally, the application is executed by the operating system. In H2M, we will perform research on the exploitation of heterogeneous memory on the levels of the programming system, the runtime and the performance model. In order to prove the implementability, effectiveness and efficiency of our results, we will develop some prototype implementations. These will include:
1. Constructs and directives to abstract memory allocation with type and memory access information. These will be designed in the spirit of OpenMP extensions (hence denoted as OMP-X) and can be implemented by a source-to-source approach based on a modified and extended runtime (see item 3), similar to how we implemented task affinity.
2. APIs to query the architecture and properties of the heterogeneous memory subsystem and information about how the application makes use of that memory. These will be implemented similarly in hwloc.
3. An extended runtime, based on the open source OpenMP runtime used by the llvm project, that integrates the information from the memory abstraction and the use of APIs and therefore provides the implementation of the heuristics.
We presented an extended methodology to automatically optimize initial and phase-based data placement on systems with heterogeneous memory. Our approach is based on abstracting or intercepting regular memory allocations to be able to provide additional requirements or hints (such as placement decisions) that can be exploited by the runtime system to select a suitable storage location. We presented a data-driven workflow that utilizes memory access profiling for existing applications and a placement optimizer. We formalized the data placement optimization problems for initial and phase-based data placement, facilitating metrics and heuristics based on the collected profiling data and considering individual application execution phases. Finally, we evaluated our approaches with six different applications on two recent heterogeneous memory systems, an Intel Ice Lake system (DRAM+NVM) and an Intel Sapphire Rapids system (HBM+DRAM), and demonstrated that our heuristics can efficiently manage the data placement on such architectures, outperforming a first-come-first-served approach in most instances. Our IDP-RT-LT strategy presents a viable and stable solution for most codes, while dynamic data migration at run time with PBDP-RT can provide fur- ther speedups for applications with coarse grained execution phases.
We also develop ways to discover and expose memory heterogeneity. Our approach focuses on first identifying the existing memory kinds in the platform and then exposing their abilities through a number of convenient attributes, such as bandwidth, latency, and capacity. This identification and characterization step was missing in existing approaches and it hindered productivity and portability by requiring users to benchmark every new platform and guess their memory organization. We implemented these ideas in hwloc, the de facto standard tool for managing hardware topology in HPC software. We then showed that implementing a heterogeneous memory allocator enables applications to to easily specify their needs for each allocation. This work rather focuses on productivity than on performance: Performance will not be superior to manual tuning of each allocation, but we expect a much better portability. Indeed, there
We also presented an extended survey of ways to emulate heterogeneous memory in order to pave the way for future memory architectures. Performance emulation consists in modifying the performance of some memory accesses so that the application behaves as on a real heterogeneous system. This is useful for identifying which data buffers and compute kernels are sensitive to memory bandwidth and latency. Environment emulation consists in exposing information about het- erogeneous memory to the runtime even if performance is not heterogeneous. This is useful for verifying that the code is able to identify different kinds of memory such as HBM, DRAM, NVDIMMs or CXL, and allocate its sensitive buffers on the right one.
High-performance computing (HPC) is crucial to advance computational science and engineering. In all modern computing systems, the performance gap between compute and memory continues to spread, particularly in the face of multi-core and accelerated systems. In consequence, the memory
subsystem is changing: the evolution of the cache hierarchy is followed by new technologies with new kinds of memory. In the context of HPC, this has been pioneered by combining traditional main memory with a small fraction of high-bandwidth memory. In systems with accelerators, like GPUs, the
heterogeneity by means of different kinds of memory is already higher. Currently, applications have to be heavily modified for specific target platforms, and have to employ vendor-specific APIs to exploit heterogeneous memory.
There is a critical need to develop a portable, vendor-neutral view of heterogeneous memory to enable a productive use in scientific or technical applications. This has to come in the form of a hierarchy of abstractions to cope with the variety of existing hardware, to enable the use of runtime heuristics to select from the kinds of memory available at runtime. At the moment it is unclear how these abstractions and heuristics should look like and several fundamental questions have to be answered.
H2M’s research results will define a concrete development roadmap for parallel programming systems. The project will develop a hierarchy of programming abstractions to expose heterogeneous memory at different levels of detail and control, complemented by a set of required, vendor-neutral capabilities to be provided by standards and intelligent runtime systems.
Intelligent runtime systems have to employ various strategies based on these abstractions to place data. H2M will develop runtime heuristics to exploit heterogeneous memory for dynamic, abstract data structures. A memory performance model will help to decide when to bind threads first and place data later, and when to place data first and bind threads accordingly. This will be implemented in heuristics to select a suitable kind of memory based on application needs. Furthermore, H2M will define deciding factors if and when to move allocated application data from one kind or bank of memory to another. Both decisions will be made considering the trade-off between performance and capacity.
Based on the performance research, at the end of the project H2M will develop concrete proposals to serve as the base for proposals to standardization committees.
H2M combines the Inria team’s expertise in exposing low-level runtime functionality and the RWTH’s group ability to leverage these to develop abstractions for HPC programming. The result of this joint work will be a deep understanding of how heterogeneous memory systems have to be programmed, a hierarchy of programming abstractions and a set of heuristics for use in intelligent runtime systems to serve applications optimized for performance and scalability.
Project coordination
Brice GOGLIN (Centre de Recherche Inria Bordeaux - Sud-Ouest)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
Partnership
INRIA Centre de Recherche Inria Bordeaux - Sud-Ouest
RWTH Aachen University
Help of the ANR 180,360 euros
Beginning and duration of the scientific project:
December 2020
- 36 Months