JCJC - Jeunes chercheuses et jeunes chercheurs

Acoustic-Visual Speech Synthesis by Bimodal Unit Concatenation: Toward a true acoustic-visual speech synthesis – ViSAC

Submission summary

The aim of this project is to contribute to the fields of acoustic speech synthesis and audiovisual speech synthesis by building a bimodal corpus as complete as possible, i.e. that covers the sounds of French language with as much variability as possible for the acoustic and the visual facets of speech, which considers the acoustic signal and the corresponding visual one as a whole. This corpus will be an essential resource and the basis of a text-to-acoustic-visual speech synthesis system that will be the main result of this project. The development of such a system allows the study of several research topics concerning both domains at the same time, mainly coarticulation and concatenation of units. The major originality of this work is to consider the speech signal as bimodal (composed of two channels acoustic and visual) 'viewed' from either facet visual or acoustic. Thus, one facet, even if it is not 'visible', can be exploited to process the other facet. In this sense, the acoustic signal is a part of the visual information: at each step, speech units will be considered as couples of an acoustic speech segment and its corresponding visual one. This new vision of synthesis will help to reformulate some key issues in synthesis, and consequently, to improve acoustic and visual speech synthesis. One of the important steps in this work is the recording of a large (or even a very large when considering only the visual facet) bimodal corpus. This corpus will be composed of motion capture data and acoustic data recorded simultaneously. Acoustic-visual synthesis will be performed following these steps: – Bimodal selection of non-uniform units in the corpus – Concatenation of bimodal units – Bimodal synthesis system These tasks combine our expertise in the fields of acoustic, vision and audiovisual, and correspond to our long-term research efforts. The goal of the bimodal selection of non-uniform units is to adapt the non-uniform acoustic unit selection principle (used in TTS systems) to non-uniform bimodal unit selection to improve the acoustic-visual speech synthesis. The work will mainly deal with the study of bimodal distance measures to evaluate the join cost of two units. The challenge is to find the best way to combine the acoustic and the visual features to take into account the perceptual differences at the boundaries of two bimodal units. One main purpose is to minimize the discrepancies between the selected units. However, it cannot be guaranteed that this selection will be enough to prevent all the discrepancies between two adjacent units, especially for the visual modality. Several directions of research are thus envisaged to address this issue, and will be evaluated. During the next step, the actual concatenation of two bimodal units will be investigated. To keep the naturalness of our bimodal units, our approach is local at the boundaries of the units, and in worse cases, it may be extended by using the longer part of units in particular in the visual space. The final system accepts as input a text, that will be synthesized acoustically and visually, and at the end we will get the corresponding acoustic speech with the visual articulator commands. These commands can reconstruct the 3D points representing the face (sparse meshes). As working with only this low resolution head cannot be very effective, we will map the sparse meshes to a high resolution 3D head. Due to the acquisition technique, the lips are not fully visible all the time. It should be stressed that no existing technology is likely to answer this problem alone. This is an important problem since, perceptually, lips gestures are very important to recognize labial phonemes. Several approaches with or without a priori geometric models will be investigated. Finally, the system will be evaluated using objective and perceptual evaluations for the acoustic alone, the visual alone, and for the audiovisual as a bimodal signal. The outcomes of our works are: - A very large bimodal corpus allowing several studies in this field and related ones - A better understanding of coarticulation - A deeper study of correlations between acoustic and visual information - An improved text-to-speech synthesis system - A text-to-acoustic-visual speech synthesis system that can be connected to another talking face, after a geometrical adaptation to the target face. One possible way to valorize our system is to use it as an assistant for hard of hearing people or second language learners where the bimodality of speech is crucial to provide some feedback. Furthermore, the bimodal corpus is also a valuable resource for our future work to study several phenomena related to audiovisual speech.

Project coordinator


The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.



Help of the ANR 210,193 euros
Beginning and duration of the scientific project: - 48 Months

Useful links

Explorez notre base de projets financés



ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter