CHIST-ERA - 5ème Appel à Projets de l’ERA-NET CHIST-ERA (Call 2014)

Interactive Grounded Language Understanding – IGLU

IGLU - Interactive Grounded Language Understanding

Through a developmental approach where knowledge grows in complexity while driven by multimodal experience and language interaction with a human, we propose a robotic agent that will incorporate models of dialogues, human emotions and intentions as part of its decision-making process. This will be possible through combining developmental, deep and reinforcement learning to handle large-scale multimodal inputs, and state-of-the-art dialogue systems.

A) Uncertainty in neural networks B) Policy transfer from simulated robots to physical robots C) Evaluating the performance of an interactive human-robot object learning scenario

A) In order to facilitate developmental learning in neural networks, the networks have to be aware of whether they do or don't know some information, i.e. the networks require some notion of certainty/uncertainty. Such a metric was not readily available at the beginning of the project, and was developed and evaluated in the context of the project. B) Training of robots is usually done in simulation, because that is significantly cheaper and faster than training a physical robot. Very frequently though problems arise whenever the so-learned policies are transferred to the physical system. As many of the project partners do not have the same robotic systems available and require to do all their research in simulation, we aim to develop a method for making the simulation more realistic and easier to transfer to the real robot.

A) When input is passed multiple times through a neural network usually the output is always the same. By adding either well-defined poisson noise to the inputs or dropout noise to the intermediate layers, the same input can lead to a distribution of different outputs. By measuring this distribution with measures of entropy and changing rate of output type or classification, we can obtain a metric of uncertainty of the network on a given input, even without knowing the actual label of the output. B) The physical robot and a simulated version are set to explore the sensorimotor space to generate a small dataset, which is in turn utilized to train a neural network, that transfers the output from the simulation software to the physical environment. Based on this learned transformation a deep continuous policy learning algorithm is utilized to learn the optimal behavior for a few different benchmark scenarios. C) A set of 10 human tutors were instructed to select 10 objects out of a pool of everyday household items and teach each of them in 3 different ways to the robot (by showing them, pointing at them and speaking about them without touching them).

A) We have a working method with which we can establish both if the model has been trained on given data before and we have a method that can select from a big unlabeled pool of data elements of data that would optimally guide learning progress. Therefore it is possible for example to pick out interesting objects in a scene and ask human tutor to name these objects. B) This is still work in progress. The project is expected to significantly reduce the time it takes to move to a physical robot, after the robot has been trained in simulation. C) The dataset has been successfully recorded and was made publically available by the Universit of Zaragoza, see below (section “Scientific production and patents”).

Teaching robots in simulation is a necessity, as training is significantly faster, when simulated, it cannot damage the physical machine, and it’s usually cheaper. But when transfering the acquired knowledge to the physical robot, frequently problems arise. These problems can come from a variety of sources, most dominant among them being noise, friction and dynamic effects of the environment that either aren’t reflected in simulation or that have different values and cause behaviors. Therefore it is commonly required to spend significant time adjusting policies to reality that have been learned in simulation. We are developing a method that can reduce the amount of adjustment required, by learning how any simulation environment can be changed to model realistic settings. Therefore the policies that we’re learning based on our modified simulator only require minimal adjustments to make them perform the given task on the real robot. To this end we utilize state-of-the-art methods of deep learning, as they provide sufficient complexity to model highly dynamic environments.

The Baxter dataset is already public and can be found at, where description and images are provided. We published a workshop paper about the dataset to NIPS 2016 and a full paper about the dataset and a preliminary method of harnessing the multimodality of the dataset to IROS 2017.

Language is an ability that develops in young children through joint interaction with their caretakers and their physical environment. At this level, human language understanding could be referred as interpreting and expressing semantic concepts (e.g. objects, actions and relations) through what can be perceived (or inferred) from current context in the environment. Previous work in the field of artificial intelligence has failed to address the acquisition of such perceptually-grounded knowledge in virtual agents (avatars), mainly because of the lack of physical embodiment (ability to interact physically) and dialogue, communication skills (ability to interact verbally). We believe that robotic agents are more appropriate for this task, and that interaction is a so important aspect of human language learning and understanding that pragmatic knowledge (identifying or conveying intention) must be present to complement semantic knowledge. Through a developmental approach where knowledge grows in complexity while driven by multimodal experience and language interaction with a human, we propose an agent that will incorporate models of dialogues, human emotions and intentions as part of its decision-making process. This will lead anticipation and reaction not only based on its internal state (own goal and intention, perception of the environment), but also on the perceived state and intention of the human interactant. This will be possible through the development of advanced machine learning methods (combining developmental, deep and reinforcement learning) to handle large-scale multimodal inputs, besides leveraging state-of-the-art technological components involved in a language-based dialog system available within the consortium. Evaluations of learned skills and knowledge will be performed using an integrated architecture in a culinary use-case, and novel databases enabling research in grounded human language understanding will be released. IGLU will gather an interdisciplinary consortium composed of committed and experienced researchers in machine learning, neurosciences and cognitive sciences, developmental robotics, speech and language technologies, and multimodal/multimedia signal processing. We expect to have key impacts in the development of more interactive and adaptable systems sharing our environment in everyday life.

Project coordinator


The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.


CRIStAL Centre de Recherche en Informatique, Signal et Automatique de Lille
University of Mons University of Mons
KTH Royal Institute of Technology KTH Royal Institute of Technology
Universidad de Zaragoza Universidad de Zaragoza
Université de Sherbrooke Université de Sherbrooke

Help of the ANR 293,280 euros
Beginning and duration of the scientific project: October 2015 - 36 Months

Useful links

Sign up for the latest news:
Subscribe to our newsletter