BANDITS AGAINST NON-STATIONARITY AND STRUCTURE – BADASS

Submission summary

We strongly believe that understanding the dynamics of complex systems and how to optimally act in them can have a big positive impact on aspects of human societies that require a careful management of natural, energetic, human and computational resources. We seek an automatic way that can learn to behave provably optimally, from partial observations and interactions with a complex noisy system. Central to this problem is the exploration-exploitation trade-off, formalized by Multi-Armed Bandits (MAB), that over the last decade have unlocked state-of-the-art answers and key industrial applications on decision making with partial feedback. As a MAB for which each arm corresponds to a stationary process models a one state Markov Decision Process (MDP) it is thus a key building block to understand optimal acting in a complex noisy system.
Motivated by the number of modern applications of sequential decision making that require developing strategies that are especially robust to change in the stationarity of the signal, and in order to anticipate and impact the next generation of applications of the field, we intend to push theory and application of MAB to the next level by incorporating non-stationary observations while retaining near optimality against the best not necessarily constant decision strategy. Since a non-stationary process typically decomposes into chunks associated with some possibly hidden variables (states), each corresponding to a stationary process, handling non-stationarity crucially requires exploiting the (possibly hidden) structure of the decision problem. For the same reason, a MAB for which arms can be arbitrary non-stationary processes is powerful enough to capture MDPs and even partially observable MDPs as special cases. It is thus crucial to jointly address the issue of non-stationarity together with that of structure.
In order to advance these two nested challenges from a solid theoretical stand point, we intend to focus on the following objectives:
1. To broaden the range of optimal strategies for stationary MABs: current strategies are only known to be provably optimal in a limited range of scenarios for which the class of distribution (structure) is perfectly known; also, recent heuristics possibly adaptive to the class need to be further analyzed.
2. To strengthen the literature on pure sequential prediction (focusing on a single arm) for non-stationary signals via the construction of adaptive confidence sets and a novel measure of complexity: traditional approaches consider a worst-case scenario and are thus overly conservative and non-adaptive to simpler signals.
3. To embed the low-rank matrix completion and spectral methods in the context of reinforcement learning, and further study models of structured environments: promising heuristics in the context of e.g. contextual MABs or Predictive State Representations require stronger theoretical guarantee.
This project will result in the development of a novel generation of strategies to handle non-stationarity and structure, that will be evaluated in a number of test beds and validated by a rigorous theoretical analysis. Beyond the significant advancement of the state of the art in MAB and RL theory and the mathematical value of the program, this JCJC BADASS is expected to strategically impact societal and industrial applications, ranging from personalized health-care and e-learning to computational sustainability or rain-adaptive river-bank management to cite a few.
The ambitious program requires an acute expertise in MAB theory, especially techniques for the control of the regret and from concentration of measure, together with the use of many tools from model selection and aggregation, universal prediction, MDPs or spectral methods. These requirements are largely met by the team that will also use this JCJC program in order to develop a rich scientific activity along these lines (with invited researchers, seminars, workshops, tutorial).

Odalric-Ambrym Maillard (Inria Lille - Nord Europe / Equipe SEQUEL)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Inria Inria Lille - Nord Europe / Equipe SEQUEL

Help of the ANR 181,029 euros
Beginning and duration of the scientific project: October 2016 - 42 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.