CE23 - Intelligence artificielle et science des données 2022

Speaker diarization with a unified robust multimodal and spatial audio model – SAROUMANE

Submission summary

The speaker diarization (SD) aims to answer the question: “who speaks and when ?”. It still remains a challenging problem due to its various complex real scenario configurations (propagation environment, large number and moving speaker ...). In presence of at least two speakers (meeting, phone conversation, TV show ...), the SD is essential for good performance of automatic transcription or translation algorithms. For the last decade, SD has been focusing on many deep neural networks (DNN) architectures (end-to-end, autoencoder, recurrent neural network, transformer ...) in order to alleviate its high nonlinear complexity. One popular DNN architecture used for SD is the autoencoder which takes two neural networks into account: the encoder mapping the input in a so-called latent space and the decoder which transforms the latent variable to some output data supposed to be identical to the input ones. In parallel, recent papers make use of a multichannel audio dataset as an input of SD using DNN and considerably increase the performances.
Most of the aforementioned SD with DNN lacks interpretability although a good average performance has been shown. It consequently makes a DNN hard to train and a low-level adaptability in case of uncommon scenarios of SD not included in the training dataset. Scientific challenges arise from what was stated before as 1) proposing a robust and interpretable DNN architecture that considers 2) multichannel audio input and 3) other multimodal information. One architecture that was proved to reinforce the interpretability and performance of autoencoder is the variational autoencoder (VAE). The VAE assumes a probabilistic model on input data that leads to variational techniques for autoencoder parameter estimation.

SAROUMANE project aims to develop new methodologies for MSD combining unified heavy-tailed probabilistic models on multichannel audio signal and multimodal data with a VAE architecture.

Mathieu Fontaine (Telecom ParisTech)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

LTCI Telecom ParisTech

Help of the ANR 267,836 euros
Beginning and duration of the scientific project: March 2023 - 36 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.