Notifying Memories for Dynamic Data-Flow Application – Nooman
Notifying Memories for Dynamic Data-Flow Application
The current architecture of computers where memory is a servile component subject to processor requests imposes constant and energy-consuming transfers. However, many of these requests are unnecessary. My goal is to bring intelligence to the memory to delete these requests because they are useless and halve the energy consumed, which will have a major impact in the context of a growing number of digital devices and the goal of a carbon-neutral society.
No interrupt, no polling, synchronisation in a relaxed way
In systems on a chip, memory has long taken on a role which it was not designed for: to ensure synchronization between different tasks, whether in a single-processor context in the past, or multi-processor today, by storing the variables needed to implement the synchronization primitives. There are two main families for synchronization: 1) interrupt, 2) polling. An interrupt to a processor causes a context switch, which is very expensive from a memory point of view. Polling consists of continuously checking the state of a variable up to an expected value, which unnecessarily occupies the processor, and just as unnecessarily leads to useless memory accesses. The Nooman project offers a new synchronization family, called notification, which is triggered by memory, like an interrupt, but does not interrupt the processor which simply polls its notification table, when it is available for another task, without directly requesting the memory as in the case of classic polling. This technique allows flexible synchronization of tasks.
The new synchronization technique is based on new hardware that it is not possible within the framework of the project to have in a real chip. It is therefore necessary to virtually prototype the approach through functional simulation tools at the system level. The goal being to evaluate the interest for a system based on several processors, the Sniper tool presents the good properties to carry out the experiments. A model of “notifying memories” has been implemented in the simulation tool. On the application side, the input specification is an application that follows the so-called data-flow model of computation, which has the advantage of explicitly expressing the parallelism of the application, but also the synchronization points between the different tasks (called “actors” in this approach). It is possible to generate code for different hardware targets from a single dataflow application. In our case, the generated application makes use of the new synchronization mechanisms added in the platform modelled in Sniper.
The results show that, even in the context of static applications, where synchronisation points can be optimised at compile-time, our technique brings 20% of improvement on the execution time, and saves up to 15% of energy consumption, at a negligible hardware cost (less than 1%). For reconfigurable data-flow applications, the results are obtained for a single application only for the moment, but still show similar improvements on execution time and energy, but mainly show that our technique helps in scaling the platform in the number of processors. Our approach works for high performance platforms, like servers, and for processors designed for embedded systems.
The mapping of data in the memory remains to be studied. Les results obtained for static dataflow applications allow to functionally validate the approach and the idea. Better gains are expected for more dynamic applications.
Studies on the memory behaviour of data-flow applications have been the subject of two publications now published, one in conference, the other in an international journal. The original synchronization technique is being revised (major revision) for an international journal article.
The era of many-cores in now wide open. These architectures that combine several cores should allow the continued performance scaling, for either embedded systems or high performance computing. However, enabling the scaling of smaller systems requires significant research breakthroughs in three key areas: power efficiency, programmability, and execution granularity. The improved technology alone will not be sufficient. Improvements in architecture and systems are also needed. The heritage of Von Neumann architectures, where computation units are separated from the memories hits the well-known memory wall. The use of networks on-chip (NoC) allows to increase the bandwidth, but at a very high energy consumption cost, and increases at the same time the latency to the data, leading to a higher execution time of the applications. The concept of processing in memory, which brings computation close to memory, is gaining renewed interest. This project goes beyond with the concept of notifying memories, which provides hardware notification capabilities to the memories. The approach breaks the classical architectural organisation as memories can be master and initiate transactions on the NoC. The hardware component is close to the memory, programmable as to fit with application demands, and directly sends the information or the data through notifications to the processor. Besides, most of the applications are developed with programming languages not suited for parallel architectures. Data-flow programming is a programming paradigm that allows the developer to explicitly specify both temporal and spatial parallelism of the application, while completely abstracting the underlying target architecture. A data-flow application is a network of actors, each actor is in charge of part of the computing, that communicate through unbounded FIFOs to transfer the data. This project will consider so-called dynamic data-flow application since their expressibility allows for specifying data-dependent applications (like the HEVC codec which is nearly impossible to specify with static models). The main drawback with dynamic applications is the need to check the firing rules of the actors at runtime and this leads to many memory requests, that are useless sometimes. The concept of notifying memories deletes all the useless memory requests, thus reducing the traffic of the NoC and the energy consumption, while improving the performance of the application and leaves the NoC bandwidth for useful communication. This concept changes deeply the hardware and needs to be studied jointly with the compilation chain to make it imperceptible to the application developer. This notification concept does not exist in any related work and we are the first to publish on that topic. The promising results that we have deserve a more thorough study. The goal is to keep the head start on that idea and to reinforce our results with the development of this new architecture and the associated compilation chain.
Project coordination
Kevin Martin (Laboratoire des Sciences et Techniques de l'Information, de la Communication et de la Connaissance)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
Partnership
Lab-STICC - U-Bretagne Sud Laboratoire des Sciences et Techniques de l'Information, de la Communication et de la Connaissance
Help of the ANR 272,480 euros
Beginning and duration of the scientific project:
- 42 Months