Speeding-up parallel programing with broadcast communications based on hybrid wireless/wired network
The de facto way of programming multi/manycore chips assumes that the memory is shared, and the hardware support for that is cache coherence throughout the memory hierarchy. The question of enabling the scaling of the protocol needed to ensure coherence is recurrent, because it requires to broadcast coherence messages to all the caches, or to multicast these messages to an identified subset of the caches. Similarly, collective synchronizations like barriers or condition signaling hardly scale. By nature, radio communications provide broadcast capabilities and negligible latency, they have thus the potential to disseminate information very quickly at the scale of a circuit and thus to be an opening for solving these issues. <br />The typical architecture of a manycore utilizing the RAKES project results is composed of n x m clusters of different types connected through a NoC whose routers are also equipped with an RF transmitter and receiver. Clusters consist of p processors and local L2 cache, and may also include portion of distributed last-level cache (LLC) or DRAM controller. <br /> <br />Wireless links in NoC have emerged as a solution to reduce latency of multi-hop paths. In RAKES, we aim to solve the current challenges that impede the exploration of the promised lands expected by the parallel computing community, namely the use of broadcast capabilities for cache coherence protocol and parallel programming mechanisms. Available broadcast is a key feature of the project that will allow scientific breakthroughs as compared to current solutions. <br /> <br />Wireless transceivers are not free: we believe that the clustered approach is the way to go to benefit best from the local electronic interconnect and the global wireless communication.
In RAKES, we investigate the benefit of broadcast and multicast communication based on a hybrid Wireless/Wired NoC. The availability of efficient broadcast/multicast presents impressive perspectives in terms of Latency reduction for both shared-memory and message passing model of computation. In the project we mainly address the optimization of the communication overhead that impacts the performances of cache coherence protocols. The first strong point of our proposal relies on our approach, since we consider a joint exploration and design of cache-coherence protocol and multicast NoC. It will also contribute to the improvement of synchronization mechanisms. The second key point relies on the use of signal processing techniques that take into account realistic radio-channel models. and take advantage of low-power implementations.
Conventional NoC architectures support broadcast operations in the form of multiple unicast transmissions, which results in significant system performance penalties concerning network latency and energy consumption overhead.
The wireless broadcasting capabilities are very promising. However, one to many communication is not the only requirement (needed when an upper level cache sends information to the caches which belong to the list of sharers for example), and many to one (needed when several lower level caches are concurrently trying to access an upper level cache) and many to many (when, in the same situation than previously, the upper level cache is physically distributed, in which case the communications must reach all memory cuts that constitute this cache).
Currently, we have put into place the simulation infrastructures necessary to gather the memory access traces. For some simulation technologies (spider, gem5), the traces include the cache coherence messages, while for others (qemu), this is still an ongoing work (in particular doing it in a scalable manner).
We have also progressed regarding the modeling and implementation of on-chip wireless links. More specifically, we have designed and implemented a wireless transceiver showcasing significant gains in energy compared to electronic links in multicast and broadcast situations.
Demonstration of significant energy gains by using wireless communication
The followup of the work will focus on high-level modeling of wireline and wireless exchanges, in order to inject this information into the cache models, with the aim of evolving the protocols and evaluating their performance in a realistic and convincing way.
A. Faravelon, O. Gruber, F. Pétrot. Removing Load/Store Helpers in Dynamic Binary Translation. Multi-Processor System-on-Chip 1: Architectures, John Wiley & Sons, Inc. pp. 133-160, Chapter 7, 2021.
Sosa, J., Sentieys, O. and Roland, C., Adaptive transceiver for wireless noc to enhance multicast/unicast communication scenarios. In IEEE Computer Society Annual Symposium on VLSI (ISVLSI) (pp. 592-597), 2019.
Sosa, J., Sentieys, O., Roland, C. and Killian, C., Multi-carrier spread-spectrum transceiver for WiNoC. In Proceedings of the 13th IEEE/ACM International Symposium on Networks-on-Chip, 2019.
Sosa, J., Killian, C., Ammar, H. and Chillet, D., Min/max time limits and energy penalty of communication scheduling in ring-based ONoC. In NoCArc 2020.
N. Chatterjee and K. Martin, Broadcast Communication in Wireless Network-on-Chip, Colloque du GdR SOC2, juin 2021
The efficient exploitation by software developers of multi-core architectures is tricky, especially when the specificity of the machine is visible to the application software. To limit the dependencies to the architecture, the generally accepted vision of the parallelism assumes a coherent shared memory and a few, either point to point or collective, synchronization primitives. However, this requires to share information between may if not all the nodes.
Unfortunately, as soon as the number of core is around 10 (ten), the communication cannot occur on a shared medium anymore, and designs make use of bus hierarchies or Networks on Chip. This latter solution is clean and efficient, but each core can see only the communications it is the target of, and unlike shared but, cannot spy what is going on between other cores. This is particularly difficult when implementing cache coherence and collective synchronizations, and a possible solution to overcome this issue is to use radio communications on chip.
By nature, radio communications provide broadcast capabilities at negligible latency, they have thus the potential to disseminate information very quickly at the scale of a circuit and thus to be an opening for solving these issues.
In the RAKES project, we intend to study how RF communication can solve the scalability of the above mentioned problems for architectures with a large number of cores (>256), by using mixed wired/RF NoC. We plan to study several alternatives and to provide (a) & virtual platform for evaluation of the solutions and (a) an actual implementation.
Monsieur Frederic Petrot (Techniques de l'Informatique et de la Microélectronique pour l'Architecture des systèmes intégrés)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
Inria Rennes - Bretagne Atlantique Centre de Recherche Inria Rennes - Bretagne Atlantique
LAB-STICC Laboratoire des Sciences et Techniques de l'Information, de la Communication et de la Connaissance
Grenoble INP/TIMA Techniques de l'Informatique et de la Microélectronique pour l'Architecture des systèmes intégrés
Help of the ANR 617,027 euros
Beginning and duration of the scientific project: May 2019 - 48 Months