Analysis and tRansformation of Singing style – ARS
For the project’s first central objective, singing style analysis, two approaches were proposed for investigation. The first approach of the project aimed to leverage advancements in deep learning-based voice processing models. Regarding musical analysis, this covers models for F0 analysis, vocal separation, as well as lyrics alignment. For the second approach, ARS aimed to study interactive speech synthesis algorithms as interactive tools for performative musicological analysis.
Concerning the second objective, vocal style transformation, we developed deep models for voice representation and transformation. Specifically, for representation, a neural vocoder, and for transformation, an autoencoder with constraints.
ARS_analysis.fr
The ARS project has made significant progress towards its two initial objectives. Regarding the computational support for musicological analysis of vocal style, the ARS project has developed new software, named ars_analysis, which serves as the backend for a web service developed by the WEB team at IRCAM. This web service has been integrated by the partner Passages XX-XXI into a website accessible at www.ars-analysis.fr, dedicated to researchers working on musicological studies of singing. The site allows all researchers in the community to upload songs with their lyrics and then offers the possibility to download the results: the extracted sung voice, accompanied by its F0 analysis as well as a temporal annotation of the syllables of the lyrics aligned with the audio.
Alignment Between Audio and Lyrics
The main innovation of the ARS project in this software is the Adagio model for aligning lyrics with audio, developed in Yann Teytaut’s doctoral thesis. It contains a syllabic alignment model capable of aligning the lyrics with the sung voice, even without removing the background music. A particular feature of the model is that training can be performed on simple audio-text pairs, without requiring reference alignments. Another characteristic of the model is that, although it was trained solely on English singing, it works with nearly all Western languages.
Singing Transformation
The second major outcome of the ARS project is the development of the Circe demonstrator (Circe: the IRCam Voice Encoder), an innovative deep learning model for voice transformation. The model includes the MBExWN vocoder (Multi-Band Excited WaveNet), a universal neural vocoder. This vocoder was one of the first universal vocoders to support both sung and spoken voices, enabling nearly transparent inversion of a given mel-spectrogram into a corresponding vocal signal while remaining competitive in terms of computational requirements. Secondly, there is the Circe model itself, an autoencoder that allows high-quality transformation of pitch and intensity for both sung and spoken voices. This voice transformation approach, proposed in the Circe autoencoder, is rare, even unique, in the current research landscape, which tends to focus on using large language models and textual descriptions for all controls. The refined control offered by the Circe autoencoder is highly valued by artists as it allows for much more detailed control. We note that intensity transformation was learned without the need for a calibrated intensity recording database, thus facilitating the production of this type of transformation.
Computational and Statistical Study of Singing Style in Musicology
Digital musicology is an emerging branch of musicology. The web service developed by the ARS project will support these expanding activities and provide computational tools to the community.
New Approaches for Voice Transformation
The MBExWN neural vocoder and the Circe autoencoder, developed within the ARS project, illustrate a frugal approach to neural voice processing. These models will be further developed to cover a broader range of vocal qualities while reducing computational costs and latency.
The study of singing style in popular music is an emerging branch in musicology, while effects related to singing style have become a central part of the majority of popular music productions making use of the few effect plugins that are available today. ARS aims to establish a mutually beneficial collaboration between musicologists working on singing performance and specialists in signal processing, with the following objectives: 1) to exploit advances in voice signal processing and deep learning for musicological research on singing style and 2) to develop new algorithms for high quality expressive singing voice transformation that diversify and enrich the palette of artistic expressions in popular music. Musicologists will contribute to the development of singing effects with their expertise about musically and artistically relevant singing style features, while signal-processing specialist will establish robust analysis algorithms for musicologists to study singing style in real music performances, and innovative singing voice transformation algorithms that allow modification of singing style in music productions.
Project coordination
Axel Roebel (INST RECH COORD ACOUSTIQ MUSIQ)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
Partnership
FLUX SOFTWARE ENGINEERING
EA4160 PASSAGES XX-XXI
STMS INST RECH COORD ACOUSTIQ MUSIQ
d'Alembert Institut Jean le rond d'Alembert
Help of the ANR 774,197 euros
Beginning and duration of the scientific project:
December 2019
- 42 Months