Online Diarization Enhanced by recent Speaker identification and Sequential learning Approaches – ODESSA
Online Diarization Enhanced by recent Speaker identification and Sequential learning Approaches
Speaker diarization is an unsupervised process that aims at identifying each speaker within an audio stream and determining when each speaker is active. It considers that the number of speakers, their identities and their speech turns are all unknown. Speaker diarization has become an important key technology in many domains such as content-based information retrieval, voice biometrics, forensics or social-behavioural analysis. urrent state-of-the-art systems suffer from many limitations. Such systems are extremely domain-dependent ane experience drastically degraded performance when tested on a different type of recordings. In the recent years, state-of-the-art speaker recognition systems have shown good improvement, thanks to the emergence of new recognition paradigms such as i-vectors and deep learning. Therefore, one goal of the project is to adapt those techniques for speaker diarization. Furthermore, most existing work addresses the problem of offline speaker diarization, which is not admissible in real-time applications. Since our main application is related to security, designing an online speaker diarization system with low latency is necessary. A third goal of the project is to take into account the inherent temporal structure of interactions between speakers and rely on structured prediction techniques. In a context of reproducible research, we will evaluate the proposed algorithms on standard databases (NIST SRE, REPERE, ETAPE, AMI...) and collect a medium-size database that suits our main application of fight against cyber-criminality.
The project will focus on the application of recent advances in speaker recognition to speaker diarization, esp. i-vectors, deep neural network and domain adaptation. Their main goal is to reduce the effect of within-class variability that is mainly due to background noise, channel variability and state of the speaker or to overcome the biases introduced by the unseen testing dataset.
Many security applications require online approaches to speaker diarization. New, online learning algorithms will be investigated in order to process vast datasets with manageable demands on both memory and computational processing capacity, while improving the reliability of speaker modelling and segmentation in the face of short segments and channel and other unwanted variability.
Conversations between several speakers are usually highly structured and speech turns of a given person are not uniformly distributed over time. However, state-of-the-art approaches seldom takes this intrinsic structure into account. A goal of the project is to demonstrate that structured prediction techniques can be applied to speaker diarization.
Finally, it is our strong belief that reproducible research in speaker diarization and in general should be promoted. Hence it is our ambition in this project to have a dedicated task on these aspects, in close links with dissemination efforts, to implement the evaluation protocols and merit figures into open source libraries.
• EURECOM submission to the international “Albayzin 2016 Speaker Diarization Evaluation”, organized by the Spanish Red Tematica en Tenchologias del Habla (RTTH) was ranked the first and obtained the evaluation award
• Defining a protocol for a low latency speaker spotting task along with the open source distribution of a related code
• Recording of a database of VoIP conversations
• Integration of i-vector sequence modeling into Bob open source library
• Integration of speaker turn segmentation with LSTM models into pyannote open source library
• Integration of speaker diarization evaluation module into pyannote library
• 11 papers presented at international conferences (ICASSP, Interspeech…)
An important goal in the last term of the project is to disseminate the project results through a journal paper on low latency speaker spotting and the participation to an international evaluation campaign related to speaker diarization.
1. H. Bredin. “TristouNet: Triplet Loss for Speaker turn embedding”. ICASSP 2017. herve.niderb.fr/download/pdfs/Bredin2017.pdf
2. H. Bredin. “pyannote.metrics: a toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems”. Interspeech 2017. herve.niderb.fr/download/pdfs/Bredin2017a.pdf
3. M. Cernak, A. Komaty, A. Anjos, S. Marcel. “Bob Speaks Kaldi”. Interspeech 2017. infoscience.epfl.ch/record/229211
4. J. Patino, H. Delgado, N. Evans and X. Anguera, “EURECOM submission to the Albayzin 2016 speaker diarization evaluation,” IberSPEECH 2016.
5. J. Patino, H. Delgado and N. Evans, “Speaker change detection using binary key modelling with contextual information,” ICSLSP 2017.
6. G. Wisniewski, H. Bredin, G. Gelly, C. Barras. “Combining speaker turn embedding and incremental structure prediction for low-latency speaker diarization”. Interspeech 2017. herve.niderb.fr/download/pdfs/Wisniewski2017.pdf
7. R. Yin, H. Bredin, C. Baras. “Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks”. Interspeech 2017. github.com/yinruiqing/change_detection/blob/master/doc/change-detection.pdf
8. J. Patino, R. Yin, H. Delgado, H. Bredin, A. Komaty, G. Wisniewski, C. Barras, N. Evans, S. Marcel. “Low-latency speaker spotting with online diarization and detection”. Odyssey 2018. www.isca-speech.org/archive/Odyssey_2018/pdfs/60.pdf
9. R. Yin, H. Bredin, H. Bredin. “Neural Speech Turn Segmentation and Affinity Propagation for Speaker Diarization”. Interspeech 2018.
10. J. Patino, H. Delgado and N. Evans. “The EURECOM submission to the first DIHARD Challenge”. Proc. Interspeech 2018.
11. J. Patino, H. Delgado and N. Evans. “Enhanced low-latency speaker spotting using selective cluster enrichment”. Proc. BIOSIG 2018.
Speaker diarization is an unsupervised process that aims at identifying each speaker within an audio stream and determining when each speaker is active. It considers that the number of speakers, their identities and their speech turns are all unknown. Speaker diarization has become an important key technology in many domains such as content-based information retrieval, voice biometrics, forensics or social-behavioural analysis. Examples of applications of speaker diarization include speech and speaker indexing, speaker recognition (in the presence of multiple speakers), speaker role detection, speech-to-text transcription, speech-to-speech translation and document content structuring.
Although speaker diarization has been studied for almost two decades, current state-of-the-art systems suffer from many limitations. Such systems are extremely domain-dependent: for instance, a speaker diarization system trained on radio/TV broadcast news experiences drastically degraded performance when tested on a different type of recordings such as radio/TV debates, meetings, lectures, conversational telephone speech or conversational voice-over-ip speech. Overlap speech, spontaneous speaking style, background noise, music and other non-speech sources (laugh, applause, etc.) are all nuisance factors that badly affect the quality of speaker diarization.
Furthermore, most existing work addresses the problem of offline speaker diarization, that is, the system has full access to the entire audio recording beforehand and no real time processing is required. Therefore, the multi-pass processing over the same data is feasible and a bunch of elegant machine learning tools can be used. Nevertheless, these compromises are not admissible in real-time applications mainly when it comes to public security and fight against terrorism and cyber-criminality.
Moreover, after an initial step of segmentation into speech turns, most approaches address speaker diarization as a bag-of-speech-turns clustering problem and do not take into account the inherent temporal structure of interactions between speakers. One goal of the project is to integrate this information and rely on structured prediction techniques to improve over standard hierarchical clustering methods.
Since our main application is related to the fight against cyber-criminality and public security, designing an online speaker diarization system is necessary. Therefore, the focus on industrial research will be supplemented by addressing more fundamental research issues related to structured prediction and methods such as conditional random fields.
Speaker diarization is inherently related to speaker recognition. In the recent years, state-of-the-art speaker recognition systems have shown good improvement, thanks to the emergence of new recognition paradigms such as i-vectors and deep learning, new session compensation techniques such as probabilistic linear discriminant analysis, and new score normalization techniques such as adaptive symmetric score normalization. However, existing speaker diarization systems did not take full advantages of those new techniques. Therefore, one goal of the project is to adapt those techniques for speaker diarization, and thus fill the research gap in the current literature.
To evaluate the proposed algorithms and to ensure their genericness, different existing databases will be considered such as NIST SRE 2008 summed-channel telephone data, NIST RT 2003-2004 conversational telephone data, REPERE TV broadcast data and AMI meeting corpus. Furthermore, we are aiming to collect a medium-size database that suits our main application of fight against cyber-criminality.
Monsieur Claude Barras (Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
Idiap Idiap Research Institute
LIMSI Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur
Help of the ANR 308,405 euros
Beginning and duration of the scientific project: February 2016 - 42 Months