CE23 - Intelligence artificielle

Planning and Learning to Act in Systems of Multiple Agents – plasma

Planning and Learning to Act in Systems of Multiple Agents

The main idea presented in this proposal is that it is possible to reduce a multi-agent decision-making problem (such as a partially observable stochastic game) to a fully observable stochastic game, which is solved using generic algorithms based on recent advances in Artificial Intelligence. We focus on generic methods with theoretical guarantees since they are of particular interest in safety-critical systems that can affect human lives.



- Dynamic programming
- Reinforcement Learning
- Deep Reinforcement Learning

1. On continuous-state MDPs w/ hierarchical information
2. Solving Dec-POMDPs as Sequential-Move Continuous-State Multi-Agent MDPs
3. On Lipschitz-continuity of e-optimal value-fn. of zs-POSGs
4. On Lipschitz + convex-concave structure of e-optimal value-fn. for zs-POSGs
5. Planning algorithm for zs-SGs
6. Deep RL algorithm for MAS
7. Planning algorithm for Dec-POMDPs as MILPs
8. SDM'Studio C++ API

- Novel theory for solving general zs-POSGs
- New paradigm: sequential-move central planning for simultaneous-move decentralized execution
- A C++ API for solving POSGs

[1] Y. Xie, J. Dibangoye and O. Buffet. Optimally Solving Two-Agent Decentralized POMDPs Under One-Sided Information Sharing. In: Proceedings of the 37th Inter- national Conference on Machine Learning. Ed. by H. D. III and A. Singh. Vol. 119. Proceedings of Machine Learning Research. PMLR, 2020, pp. 10473–10482.

[2] O. Buffet, J. Dibangoye, A. Delage, A. Saffidine and V. Thomas. On Bellman’s Optimality Principle for zs- POSGs. CoRR abs/2006.16395 (2020).

[3] O. Buffet, J. S. Dibangoye, A. Delage and V. Thomas. Sur le principe d’optimalité de Bellman pour les zs- POSG. In: Actes Journées Francophones sur la Plani- fication, la Décision et l’Apprentissage pour la conduite de systèmes (JFPDA 2020). 2020.

[4] A. Delage, O. Buffet and J. S. Dibangoye. HSVI pour zs-POSG usant de propriétés de convexité, concavité, et Lipschitz-continuité. In: Actes Journées Francophones sur la Planification, la Décision et l’Apprentissage pour la conduite de systèmes (JFPDA 2020). 2020.

[5] A. Delage, O. Buffet and J. Dibangoye. HSVI fo zs- POSGs using Concavity, Convexity and Lipschitz Prop- erties. arXiv (2021).

[6] O. Buffet, J. Dibangoye, A. Saffidine and V. Thomas. Heuristic Search Value Iteration for Zero-Sum Stochas- tic Games. IEEE Transactions on Games 13.3 (2021), pp. 239–248.

[7] G. Bono, J. Dibangoye, O. Simonin, L. Matignon and F. Pereyron. Solving Multi-Agent Routing Problems Us- ing Deep Attention Mechanisms. IEEE Transactions on Intelligent Transportation Systems (2020), pp. 1–10.

[8] J. S. Dibangoye, O. Buffet and A. Kumar. Multi- agent Planning and Learning As MILP. In: Actes Journées Francophones sur la Planification, la Décision et l’Apprentissage pour la conduite de systèmes (JFPDA 2020). 2020.

The holy grail of Artificial Intelligence (AI)---creating an agent (e.g., software or machine) that comes close to mimicking and (possibly) exceeding human intelligence---remains far off. But past years have seen breakthroughs in agents that can gain abilities from experience with the environment: providing significant advances in the society and the industries including health care, autonomous driving, recommender systems; and ultimately influencing many if not all aspects of everyday life. These advances are partly due to single-agent Deep Learning (DL) along with RL and Monte-Carlo Tree Search (MCTS), i.e., AI research subfields in which the agent can describe its world as a Markov decision process. Some stand-alone planning and RL algorithms are guaranteed to converge to the optimal behavior, as long as the environment, the agent is experiencing, is Markovian and stationary, but scalability remains a significant issue. DL along with RL and MCTS methods have emerged as a powerful combination to break the curse of dimensionality in the face of very large-scale domains at the expend of astronomical data and computational resources, but so far their applicability is mainly restricted to either single-agent domains or sequential games.

Today, real-life applications widely use MASs, that is, groups of autonomous, interacting agents sharing a common environment, which they perceive through sensors and upon which they act with actuators. At home, in cities, and almost everywhere, a growing number of sensing and acting machines surround us, sometimes visibly (e.g., robots, drones, cars, power generators) but often imperceptibly (e.g., smartphones, televisions, vacuum cleaners, washing machines). Before long, through the emergence of a new generation of communication networks, most of these machines will be interacting with one another through the internet of things (IoT). Constantly evolving MASs will thus break new ground in coming years, pervading all areas of the society and the industries, including security, medicine, transport, and manufacturing. Although Markov decision processes provide a solid mathematical framework for single-agent planning and RL, they do not offer the same theoretical grounding in MASs. In contrast to single-agent systems, when multiple agents interact with one another, how the environment evolves depends not only upon the action of one agent but also on the actions taken by the other agents, rendering the Markov property invalid and the environment no longer stationary. Also, a centralized (single-agent) control authority is often inadequate because agents cannot (e.g., due to communication cost, latency or noise) or do not want (e.g., in competitive or strategic settings) to share all their information all the time.

As a consequence, the increasing penetration of MASs in the society will require a paradigm shift---from single-agent to multi-agent planning and reinforcement learning algorithms---leveraging on recent breakthroughs. That leads us to the fundamental challenge this proposal addresses: the design of generic algorithms with provable guarantees that can efficiently compute rational strategies for a group of cooperating or competing agents in spite of stochasticity and sensing uncertainty, yet using the same algorithmic scheme. Such algorithms should adapt to changes in the environment; apply to different tasks, and eventually converge to a rational solution for the task at hand. But it needs not to exhibit the fastest convergence rates since there is no free lunch.
Using the same algorithmic scheme for different problems eases knowledge transfer and dissemination in expert as well as practitioner communities.

Project coordinator


The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.



Help of the ANR 254,296 euros
Beginning and duration of the scientific project: February 2020 - 42 Months

Useful links

Explorez notre base de projets financés



ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter