DS0705 - Fondements du numérique

Extraction et transfert de connaissances dans l'apprentissage par renforcement – ExTra-Learn

Extraction and TRAnsfer of knowledge in reinforcement LEARNing

Enjeux et objectif

In the near future, intelligent and autonomous systems will become more ubiquitous and pervasive in applications such as autonomous robotics, design of intelligent personal assistants, and management of energy smart grids. Although very diverse, these applications call for the development of decision-making systems able to interact and manage open-ended, uncertain, and partially known environments. This will require increasing the autonomy of ICT systems, which will have to continuously learn from data, improve their performance over time, and quickly adapt to changes. EXTRA-LEARN is directly motivated by the evidence that one of the key features that allows humans to accomplish complicated tasks is their ability of building knowledge from past experience and transfer it while learning new tasks. We believe that integrating transfer of learning in machine learning algorithms will dramatically improve their learning performance and enable them to solve complex tasks. We identify in the reinforcement learning (RL) framework the most suitable candidate for this integration. RL formalizes the problem of learning an optimal control policy from the experience directly collected from an unknown environment. Nonetheless, practical limitations of current algorithms encouraged research to focus on how to integrate prior knowledge into the learning process. Although this improves the performance of RL algorithms, it dramatically reduces their autonomy. In this project we pursue a paradigm shift from designing RL algorithms incorporating prior knowledge, to methods able to incrementally discover, construct, and transfer “prior” knowledge in a fully automatic way. More in detail, three main elements of RL algorithms would significantly benefit from transfer of knowledge.

Méthodes / Approches

The methodology followed to develop transfer algorithms differ from the standard design of RL
methods and requires answering three different questions:
• Which “prior” knowledge is effective in improving the learning performance? A first important
step will be to develop different models of knowledge. Optimal policies, subpolicies, features,
sampling strategy, value function, raw samples, are examples of the elements characterizing
RL problems and solutions. The choice of which knowledge to use will be directed by the
evidence that if a human supervisor was able to provide it in the form of prior knowledge,
then the learning algorithm would be able to significantly improve its performance. While in
many cases such evidence is already available (e.g., features adapted to the target value
functions), there are still many scenarios where even if explicit prior knowledge were
available, no RL algorithm would be able to provably take advantage of it.
• How can “prior” knowledge be automatically learned? While in some cases the knowledge of
interest is immediately available after the interaction with a task (e.g., samples), there are
more sophisticated forms of knowledge (e.g., an underlying low-dimensional representation
of value functions) that require defining a specific learning process.
• How can knowledge be integrated into a transfer learning process? In order to obtain a full
transfer algorithm, we need to both collect useful knowledge from past tasks and transfer it
while solving new tasks. In some cases, this may not be trivial and may require solving a
trade-off between learning the desired knowledge as fast and efficiently as possible (e.g., by
performing a thorough exploration of the environment in many tasks) and exploiting the
current learned knowledge to improve the performance of the learning process.

Résultats

In the empirical evaluation of the proposed algorithms we will study
different forms of improvements. In particular, we expect different qualitative
improvements from a transfer RL algorithm:
• Jumpstart. This improvement only considers the very initial stages of the learning process and
it is mostly related to the initialization of the algorithm. We can expect to achieve this
improvement whenever it is possible to develop a prior over the solution of all tasks, that, on
average, is more accurate then a standard non-informative initialization.
• Learning speed. In exploration-exploitation problems, the learning speed improvement
corresponds to the fact that the RL algorithm can actually reduce the amount of exploration of
the environment needed to find near-optimal policies. In approximated RL algorithms,
improving the learning speed corresponds to a reduction in the sample complexity, which
refers to the amount of samples needed to achieve a desired accuracy. In general, a learning
speed improvement does not change the asymptotic performance of the transfer RL algorithm,
which achieves the same level of performance as the no-transfer version.
• Asymptotic performance. Transfer learning and notably feature learning and hierarchical
decomposition may significantly affect the asymptotic performance of a RL algorithm. In fact, if
the approximation scheme changes (e.g., changing the basis function used in linear value
function approximation), the space of functions and policies that the RL algorithm can learn will
be affected as well. As a result, we expect that, if a transfer algorithm captures the common
structure among different tasks, it could be able to learn features that allow to increase the
asymptotic accuracy of the learning process.

Perspectives

If the outcome of this research is positive, we expect to contribute to the
development of learning algorithms able to interact with complex environments in a much more
intelligent and autonomous way. This has a clear strategic role in facing the future challenges of a
digital society where intelligent and autonomous systems will be more pervasive and ubiquitous.
In the long term, we envision decision-making support systems where transfer learning takes
advantage of the data available from different tasks (e.g., users) to construct high-level knowledge
that allows sophisticated reasoning and learning in complex domains. For instance, online
education systems will build on automatic schedulers that design specific agendas for each student.
While learning methods guarantee to find the most effective track of lessons for each user, transfer
algorithms will be able to dramatically reduce the exploration needed to discover the skills of the
student (e.g., preliminary exercises to assess her level and define the best learning strategy) to the
minimum thanks to effective exploration strategies refined over time and users. Furthermore,
personal fitness assistants will be able to access training data from thousands of users and
transfer methods will use them to recover the best low-dimensional feature representation, which
will be constantly adapted to make learning the best fitness plan for a new user more accurate.
Finally, complex robotic tasks will be automatically decomposed in a hierarchy of subtasks
whose solutions will be transferred and reused from task to task.

Productions scientifiques et brevets

The results of the project will be primarily disseminated within the machine learning community.
The theoretical, algorithmic, and empirical results will be published in major national and
international conferences and journals.

Résumé de soumission

Dans un futur proche, systèmes intelligents et autonomes seront de plus en plus omniprésents dans le cadre d'applications telles que la robotique autonome, les assistants personnels intelligents et la gestion de l'énergie dans les "smart grids". Bien que très diversifiées, ces applications nécessitent le développement de systèmes de prise de décision capables d'interagir et de gérer des environnements ouverts, incertains, et partiellement connus. Par conséquent, il faudra accroître l'autonomie des systèmes de TIC, qui devront apprendre en continu à partir de données, améliorer leurs performances au fil du temps, et s'adapter rapidement aux changements.
EXTRA-LEARN est motivé par l'observation que l'une des caractéristiques clés qui permet aux humains d'accomplir des tâches complexes est leur capacité de construire des connaissances à partir de l'expérience passée et de les transférer lors de l'apprentissage de nouvelles tâches. Nous croyons que l'intégration de la théorie du transfert d'apprentissage (TA) dans les algorithmes d'apprentissage par renforcement (AR) pourrait considérablement améliorer leur performance et leur permettre de résoudre des tâches complexes. L'AR formalise le problème d'apprentissage d'une politique de contrôle à partir de l'expérience acquise pendant l'interaction avec un environnement inconnu. A cause des limites pratiques des algorithmes actuels, la recherche s'est concentrée sur l'intégration de connaissances "a priori" dans le processus d'apprentissage. Bien que cela améliore la performance des algorithmes d'AR, cela réduit considérablement leur autonomie. Dans ce projet, nous poursuivons un changement de paradigme: de la conception d'algorithmes qui intègrent connaissance "a priori" au développement de méthodes capables de découvrir, de construire et de transférer des connaissances d'une manière entièrement automatique. Plus en détail, les trois éléments de l'AR qui pourraient bénéficier du TA sont:
(i) Les algorithmes d'AR ont besoin de découvrir l'environnement à travers une longue phase d'exploration, qui devient impraticable dans des environnements de grande taille. Le TA permettrait aux algorithmes de réduire considérablement l'exploration d'une nouvelle tâche en exploitant sa ressemblance avec les tâches résolues dans le passé.
(ii) Les algorithmes d'AR évaluent la qualité d'une politique en calculant sa fonction valeur. Quand le nombre d'états est trop grand, il est nécessaire d'introduire des mécanismes d'approximation. Bien que cela soit fait actuellement par un expert du domaine, nous proposons de définir des systèmes qui adaptent les modèles d'approximation en fonction des tâches rencontrées au fil du temps. Cela permettrait d'augmenter la précision et la stabilité des algorithmes de AR.
(iii) Pour faire face à des environnements complexes, des systèmes hiérarchiques ont été proposés, où les politiques sont organisées dans une hiérarchie de sous-tâches. Cela nécessite une définition précise de la hiérarchie, qui, si elle n'est pas correctement construite, peut réduire les performances d'apprentissage. L'objectif du TA est de construire automatiquement une hiérarchie des compétences qui puisse être réutilisée pour résoudre plusieurs tâches.

L'objectif du projet est de proposer des solutions de TA pour chacun de ces éléments. L'impact principal à court terme sera une avancée significative de l'état de l'art de l'AR, avec le développement d'une nouvelle génération d'algorithmes ayant de meilleures performances en pratique et une analyse théorique rigoureuse. À long terme, nous envisageons des systèmes de prise de décision où le TA utilise les données de nombreuses tâches pour construire une connaissance de haut niveau qui permette l'apprentissage dans des domaines complexes, avec un impact sur plusieurs domaines, de la robotique à la santé, de l'énergie aux transports.

Michal Valko (Institut National de Recherche en Informatique et Automatique)

L'auteur de ce résumé est le coordinateur du projet, qui est responsable du contenu de ce résumé. L'ANR décline par conséquent toute responsabilité quant à son contenu.

Inria Institut National de Recherche en Informatique et Automatique

Aide de l'ANR 251 400 euros
Début et durée du projet scientifique : septembre 2014 - 42 Mois

Explorez notre base de projets financés

L’ANR met à disposition ses jeux de données sur les projets, cliquez ici pour en savoir plus.