CE33 - Interaction, robotique

Multi-Modal Multi-person Low-Level Learning for Robot Interaction – ML3RI

ML3RI

Multi-person robot interaction in the wild (i.e. unconstrained and using only the robot's resources) is nowadays unachievable because of the lack of suitable machine perception and decision-taking models. Indeed, current robotic skills are derived from machine learning techniques working in constrained environments.

Understand multi-person multi-modal interactions -- lack of data.

Besides the common struggle due to the health crisis, and the fact that all our meetings were held through a screen, thus limiting the interaction and making the understanding more difficult, we were severely constrained by the fact that we could not record data in our laboratory. This made impossible T3.1 (preliminary data collection), and we had to be creative and use datasets available online for our purposes. T3.1 is replaced by the use of data available online.

Because the capacity to understand and react to low-level behavioral cues is crucial for autonomous robot communication, we propose to develop novel Multi-Modal Multi-person Low-Level Learning models for Robot Interaction (ML3RI). We will explore methods combining
the flexibility of probabilistic models with the robustness and performance of deep neural architectures. Since training these models requires large annotated datasets, we will develop multi-modal data generation techniques, reducing the required amount of real-data. Additional efforts to develop demonstrators working on mobile robotic platforms to evaluate our methods outside the laboratory will be done.

We have worked on various robotic perception tasks, such as person re-identification [C3], human body pose recognition [J2,C4] and speech enhancement [J1, J3, C1, C2, C5, C6, P1]. We have also worked towards the generation of interactive data [P2] (see paragraph below). While some of this works use visual [J2, C3, C4] or auditory [C1, C6, P1], some others exploit the complementarity of audio and visual data [J1, J3, C2, C5]. This allowed us to make progress on T1.1, T.1.2 and T3.2. We started working on hybrid deep-probabilistic models (T1.3) and processing a time-varying number of people (T1.4). Similarly, we started studying the use of reinforcement learning in social robotics, for the time being in simulation. Results are expected in the following months.

We published a review on dynamical variational autoencoders [J4]. Future challenges on how to exploit this methodology for ML3RI are awaiting.

[J1] Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement Mostafa Sadeghi, Xavier Alameda-Pineda IEEE Transactions on Signal Processing, IEEE, 2021, 69, pp.1899-1909.
[J2] Variational Inference and Learning of Piecewise-linear Dynamical Systems Xavier Alameda-Pineda, Vincent Drouard, Radu Horaud IEEE Transactions on Neural Networks and Learning Systems, IEEE, 2021.
[J3] Audio-Visual Speech Enhancement Using Conditional Variational
Auto-Encoders Mostafa Sadeghi, Simon Leglaive, Xavier Alameda-Pineda, Laurent Girin, Radu Horaud IEEE/ACM Transactions on Audio, Speech and Language Processing, Institute of Electrical and Electronics Engineers, 2020, 28, pp.1788-1800.
[J4] Laurent Girin, Simon Leglaive, Xiaoyu Bie, Julien Diard, Thomas Hueber and Xavier Alameda-Pineda (2021),
«Dynamical Variational Autoencoders: A Comprehensive Review«,
Foundations and Trends® in Machine Learning: Vol. 15: No. 1-2, pp 1-175.

[C1] A Benchmark of Dynamical Variational Autoencoders applied to
Speech Spectrogram Modeling Xiaoyu Bie, Laurent Girin, Simon
Leglaive, Thomas Hueber, Xavier Alameda-Pineda Interspeech,
Aug 2021, Brno, Czech Republic.
[C2] Switching Variational Auto-Encoders for Noise-Agnostic Audio-
visual Speech Enhancement Mostafa Sadeghi, Xavier Alameda-
Pineda ICASSP 2021 - IEEE International Conference on
Acoustics, Speech and Signal Processing, Jun 2021, Toronto,
Canada. pp.1-5
[C3] CANU-ReID: A Conditional Adversarial Network for Unsupervised
person Re-IDentification Guillaume Delorme, Yihong Xu,
Stéphane Lathuilière, Radu Horaud, Xavier Alameda-Pineda ICPR
2020 - 25th International Conference on Pattern Recognition,
Jan 2021, Milano, Italy. pp.1-8
[C4] PI-Net: Pose Interacting Network for Multi-Person Monocular 3D
Pose Estimation Wen Guo, Enric Corona, Francesc Moreno-
Noguer, Xavier Alameda-Pineda WACV 2021 - IEEE Winter
Conference on Applications of Computer vision, Jan 2021,
Waikoloa, United States. pp.1-11
[C5] Robust Unsupervised Audio-visual Speech Enhancement Using a
Mixture of Variational Autoencoders Mostafa Sadeghi, Xavier
Alameda-Pineda IEEE International Conference on Acoustics,
Speech and Signal Processing, May 2020, Barcelona, Spain.
[C6] A Recurrent Variational Autoencoder for Speech Enhancement
Simon Leglaive, Xavier Alameda-Pineda, Laurent Girin, Radu
Horaud ICASSP 2020 - IEEE International Conference on
Acoustics, Speech and Signal Processing, May 2020,
Barcelone, Spain. pp.1-7.

Multi-person robot interaction in the wild (i.e. unconstrained and using only the robot's resources) is nowadays unachievable because of the lack of suitable machine perception and decision-taking models. Indeed, current robotic skills are derived from machine learning techniques working in constrained environments. Because the capacity to understand and react to low-level behavioral cues is crucial for autonomous robot communication, we propose to develop novel Multi-Modal Multi-person Low-Level Learning models for Robot Interaction (ML3RI). We will explore methods combining the flexibility of probabilistic models with the robustness and performance of deep neural architectures. Since training these models requires large annotated datasets, we will develop multi-modal data generation techniques, reducing the required amount of real-data. Additional efforts to develop demonstrators working on mobile robotic platforms to evaluate our methods outside the laboratory will be done.

In order to achieve these goals, we will implement the project into four workpackages, with several tasks and deliverables. The first three workpackages are devoted to a different scientific challenge each (robust perception, pertinent behaviour and data generation). The fourth workpackage is devoted to evaluate the developed methods and their implementations using two robotic platforms Nao and Pepper. Integration will be done within the ROS framework. Because Nao and Pepper have similar ROS interfaces, there will be no need to develop to separate software packages for the evaluation with Nao and with Pepper. The main scientific impact of the ML3RI is to develop new learning methods and algorithms for the aforementioned tasks, thus opening the door to study multi-party conversations with robots. In addition, the project supports open and reproducible research.

The ANR JCJC will allow Xavier to implement his research agenda, to intensify his supervising efforts. In the near future, Xavier will submit an ERC StG and his HdR.

Project coordination

Alameda-Pineda XAVIER (Centre de Recherche Inria Grenoble - Rhône-Alpes)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partner

Inria GRA Centre de Recherche Inria Grenoble - Rhône-Alpes

Help of the ANR 293,328 euros
Beginning and duration of the scientific project: February 2019 - 48 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter