These last years, deep learning (DL) has become the state-of-the-art machine learning (ML) paradigm when applied in supervised settings to data with a latent structure, in image, video, audio and speech processing. In order to deploy deep learning solutions in problems for which little or even no annotated data are available, there is a growing research area on lightly-supervised and unsupervised settings in DL, with methods such as one-shot learning in object categorization in images. This interest also exists in audio processing, but to a much smaller extent. LUDAU will remediate to this situation by exploring the very powerful properties of DNNs, namely feature and multi-level representation learning, combined with state-of-the-art clustering techniques.
LUDAU is a proposal to explore and strengthen this hot research topic in the framework of DL and deep neural networks (DNNs). Two scenarios will be targeted:
1) a lightly-supervised scenario in which coarse manual labels are available. By coarse labels, we refer to labels that globally describe a recording.
2) a zero-resource or unsupervised scenario when only raw audio recordings are available.
The main motivation behind LUDAU is to minimize the need for manual labeling effort by involving coarse labels, and then, hopefully, to even remove the need for coarse labeling.
To reach this goal, we plan to:
1) propose new methods to extract feature representations to better discriminate between audio units,
2) segment and cluster the audio signal representations into useful and meaningful elementary units.
To tackle these issues, we plan to rely on the representation learning and discriminative power of DNNs. We propose to combine top-down approaches based on the coarse high-level labels with bottom-up approaches based on activation maps extracted from DNNs to automatically infer low-level audio units.
To discover useful audio units, clustering will be applied to activation maps and an interactive schema alternating clustering and DNN training on pseudo-labels is expected to bring improvements. We will consider using saliency detection in order to identify the important segments in the audio input that support the predictions made by a model. An important outcome will be a processing pipeline to automatically annotate raw recordings in pseudo-units both in time and frequency.
Unsupervised speech unit discovery, in particular at phonemic level, will be the target application. We will also broaden our scope to other audio tasks, namely the detection of audio events in field recordings.
Monsieur Pellegrini THOMAS (Institut de Recherche en Informatique de Toulouse)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
IRIT Institut de Recherche en Informatique de Toulouse
Help of the ANR 222,366 euros
Beginning and duration of the scientific project: - 42 Months