CE33 - Interaction, Robotique – Intelligence artificielle

ROBOVOX - Robust Vocal Identification for Mobile Security Robots – ROBOVOX


Robust voice recognition for mobile security robots


This project is dedicated to robust voice identification for mobile security robots and offers solutions integrating supportive modalities for voice recognition, taking advantage of the human-robot interaction context.

The strategy adopted for this project is above all pragmatic. It consists first of all in relying on the strong background of the partners in voice identification, noise reduction, human-machine dialogue and autonomous robots. This know-how will allow rapid implementation and a reliable estimate of the difficulties to be solved. The second strong point of ROBOVOX's strategy is to make the most of the central element of the project: if the context of the autonomous robot is the major source of difficulty, then let's make the best use of this context to resolve this point! The consideration of the different microphones, the location of the intruder and the possibility of placing it in the optimal place constitute a first example of the implementation of this strategic orientation. Getting around certain difficulties (short message durations) through vocal dialogue skills is a second.



1. Mohammad Mohammadamini, Driss Matrouf, Paul-Gauthier Noé. Denoising x-vectors for Robust Speaker Recognition. Odyssey 2020 The Speaker and Language Recognition Workshop, Nov 2020, Tokyo, Japan. pp.75-80,

2. Mohammad Mohammadamini, Driss Matrouf. Data augmentation versus noise compensation for x-vector speaker recognition systems in noisy environments. EUSIPCO, Jan 2021, Amsterdam, Netherlands.

3. Pierre-Michel Bousquet and Mickaël Rouvier. The LIA System Description for SdSV Challenge Task. InterSpeech 2020 (Session SdSV Challenge)
4. Mickaël Rouvier and Pierre-Michel Bousquet. Review of different robust x-vector extractors for speaker verification. EUSIPCO 2020

5. Pierre-Michel Bousquet and Mickaël Rouvier. Adaptation strategy and clustering from scratch for new domains of speaker recognition Speaker Odyssey 2020

During periods of inactivity, using a stand-alone mobile robot to monitor industrial premises is an efficient solution with an excellent cost/effective rate. The robot moves around the premises and analyzes the activity in them. When a person is detected, the robot is responsible for verifying his identity. In case of difficulty, the robot then contacts a human operator. The main goal of this project is to take into account the real use conditions of the robot and their impact on voice identification. This involves conducting experiments with the robot itself and in a realistic environments. Voice identification in the context of a mobile security robot faces several challenges related to the use of remote microphones, which can drastically reduce performance : ambient noise and internal noises related to activator of the robot (egonoise) lead to low signal to noise ratio, the reverberation phenomena due to the configuration of the highly variable places in which the robot is located, the location of the speakers in the room with respect to the microphones, etc. In this project, we propose methods to cope with these issues. The proposed solutions are based on our expertise in the field of acoustic modeling and signal processing, as well as to the use of deep neural networks (DNN). DNNs are currently heavily used in machine-learning applications and they have become the state-of-the-art and many application domain (including speech processing).

Despite the efforts to tackle acoustic difficulties, there are scenarios in which voice identification alone cannot meet satisfying reliability requirements. In applications where a high level of security is required, the use of a single modality is generally too risky and voice identification is often implemented in conjunction with other identification modalities. Thus, we propose that the robot in this project uses its ability to interact with the persons detected. This modality is used when the robot does not have enough elements to take a reliable decision. On the one hand, it can use its interaction capabilities to acquire more acoustic data in order to consolidate voice identification. On the other hand, the robot can use the interaction module to resolve an ambiguity by a set of simple questions-answers based on knowledge verifiable by the robot (for example ask the first name or the surname of person’s direct superior). Finally, information on the emotional state of the speaker and regarding the acoustic scene will be transmitted to the system such that the robot can adapt its dialogue strategy, its behavior and the pre-processing and voice identification algorithms.

In addition to the direct scientific and technical expected benefits, this project will be the opportunity to create and disseminate a unique corpus that will allow during and after the project to evaluate the solutions aiming to tackle acoustic difficulties, such as ambient noise, reverberation or short durations. An evaluation plan with an experimental protocol will be defined to ensure that the solutions developed during the project are relevant to both the scientific community and the industrial partner.

Project coordinator

Monsieur Driss MATROUF (Laboratoire d'Informatique d'Avignon)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.


Inria Centre de Recherche Inria Nancy - Grand Est
LIA Laboratoire d'Informatique d'Avignon

Help of the ANR 665,677 euros
Beginning and duration of the scientific project: January 2019 - 48 Months

Useful links

Sign up for the latest news:
Subscribe to our newsletter