CE48 - Fondements du numérique: informatique, automatique, traitement du signal 2020

Tackling hard problems in audio with Data-Efficient Non-linear InverSe mEthods – DENISE

DENISE: Tackling Hard Problems in Audio with Data-Efficient Non-linear Inverse Methods

Audio signal processing is undergoing lasting transformations due to the exceptional success of deep learning methods. However, these methods suffer from well-known limitations: difficulty in interpretation, task-specificity, and significant data and computational requirements. DENISE aims to address these challenges by developing new, efficient, generic, and interpretable tools based on nonlinear inverse methods. The targeted applications are audio inpainting and acoustic parameter estimation.

Main issues and general objectives

The machine learning paradigm and in particular deep learning methodologies have profoundly and sustainably transformed the field of audio signal processing, thanks to their remarkable ability to approximate complex non-linear functions given sufficiently large training datasets of examples to learn from. However, these performance come with a number of known limitations: difficulty of interpretation, lack of theoretical guarantees, task specificity, poor out-of-distribution generalizability and high computational and data demand. The latter increases the financial cost and ecological footprint of developing these methods, from research labs to production, and raises the issue of under-represented sound categories, which could be left behind in audio technologies.<br /><br />Within this context, DENISE’s central premise is that a number of key sub-problems in audio can be tackled without any learning at all, thanks to recent theoretical and methodological breakthroughs in the field of non-linear inverse problems, which remain largely unnoticed by the audio research community. Its central goal is to develop novel and widely applicable non-linear inverse methods to tackle difficult audio signal processing problems in a data-effective manner, with the following three objectives:<br />1. Developing novel non-linear audio signal processing tools with guaranteed, interpretable results that could be generically applied to a variety of tasks;<br />2. Enabling ecological and cost-efficient development of audio processing methods by drastically reducing data requirements in two key problems;<br />3. Gaining scientific insights on where to place boundaries between learning-based and analytical solutions within new hybrid audio signal processing frameworks.<br /><br />Concretely, DENISE focuses on two tasks identified as under-explored, emerging and challenging. Work Package (WP) 1 focuses on audio inpainting, i.e., how to restore completely missing samples or chunks from an audio signal. WP2 focuses on echo-aware multichannel processing, i.e., how to process recordings from a microphone array in the presence of unknown acoustic reflectors. Our central research hypothesis is that non-linear inverse methods can strongly benefit both tasks, yielding new techniques outperforming the state of the art while being less data intensive and more generalizable.

1. Phase-Aware Audio Inpainting
The first workpackage focused on the application of audio inpainting, namely, how to reconstruct a completely missing chunk in an audio signal. To this aim, we studied a new kind of prior: access to the magnitude spectrum of the signal of interest. We leveraged that short-time Fourier spectra of natural sounds such as speech and music are strongly structured and redundant over time, and hence more amenable to prediction and interpolation. Via a new formulation, we established a connection between this problem and the long-standing research field of phase retrieval, for which many non-linear inverse methods have been developed over the year, but had not yet permeated to audio signal processing. Project DENISE leveraged this connection to obtain a new theoretical result and develop new algorithms for the audio inpainting problem.

2. Echo-Aware Signal Processing
The second workpackage focused on estimating the amplitudes and timings of early acoustic reflections from a sound source impinging at a microphone array, also known as echoes. This is a fundamental task in acoustic signal processing, underlying a number of applications from echo-aware signal enhancement o room acoustic diagnosis through audio augmented reality. While the initial plan was to primarily investigate this task in contexts where the source signal is unknown, e.g., speech, we decided to focus instead on the case where a controlled source signal is available.

2.a. From Room Impulse Responses to Image Sources
When a controlled source and multiple microphones are used, it is possible to measure multichannel «room impulse responses« (RIR), i.e., received signals corresponding to a source emitting a perfect time impulse. Through a new formulation combining the wave equation, the image-source model and a discrete low-pass microphone sampling model, we connected the difficult inverse problem of localizing image sources associated to reflection paths from RIRs to the field of super-resolution, and in particular sparse measure recovery. We adapted the sliding Frank-Wolfe algorithm for this purpose.

2.b. From Image Sources to Shoebox Room Parameters
We then developed a new geometrical method that recovers the 18 input shoebox room acoustic simulation parameters given a localized image source point cloud from the previous contribution. These are the 3D source position, the 3 dimensions of the room, the 6-degrees-of-freedom room translation and orientation, and an absorption coefficient for each of the 6 room boundaries. The approach proceeds by first identifying the three room axes, then label all first-order image sources, and deduce all of the remaining parameters from their coefficients, their positions and those of the true source.

2.c. Beyond the Shoebox Case with Shape Optimization
We also tackled the reconstruction of more general polygonal room shapes using the notion of shape derivatives and the method of fundamental solutions.

1. Audio Inpainting using Fourier Magnitudes
1.a. An Almost-Uniqueness Result
We showed that if the (discrete) Fourier magnitudes are available and the number of consecutive missing samples is strictly less that one third of the total signal length, a random full signal can be exactly recovered with probability 1 by solving our proposed phase-retrieval formulation.

1.b. Least-square Fitting of Magnitude Spectrum
Using a an alternative minimization scheme initialized by a convex relaxation of our formulation, extensive computational experiments on speech signals reveal than when magnitude spectra are available with sufficient precision (>10 dB signal-to-noise ratio), better inpainting performance are achieved than using a more conventional sparsity prior on the spectra. Further experiments showed that in the noiseless case, near-exact recovery of over 80% of test signals can be recovered by the approach as long as 30\% of samples are missing or less.

2. Shoebox room parameter recovery
By combining the super-resolution and geometrical techniques developed in WP2, extensive simulated experiments revealed that near-exact recovery of the 18 room parameters considered from a room impulse response is achieved for a 32-element, 8.4-cm-wide spherical microphone array and a sampling rate of 16 kHz, using fully randomized input parameters within rooms of size 2x2x2 to 10x10x5 meters. Estimation errors decay towards zero when increasing the array size and sampling rate. These results are strictly limited to simulated data and to shoebox-shaped rooms. Nonetheless, they represent to our knowledge the first algorithmic demonstration that the infamously difficult inverse problem of «hearing the shape of a room« is in-principle fully solvable over a wide range of configurations.

The overarching goal of project DENISE was to make fundamental methodological contributions to the field of audio signal processing by tackling different sub-problems in audio using recent theoretical and methodological breakthroughs in the field of non-linear inverse problems. Through four publications in internationally recognized venues and two more in the planning, this goal was largely fulfilled. In particular, significant strides were made in developing new, accurate and generalizable methods for audio inpainting and acoustic reflector analysis from room impulse responses, together with progress on the theoretical understanding of these problems. This work opened up a number of promising research avenues, and in particular the work of WP2, to which most of the project's resources were devoted.

While the results obtained in WP2 are promising, especially thanks to their potential for fully recovering room parameters, the method cannot work on real RIR measurements, because the underlying model assumes idealized, spike-like early acoustic echoes, whereas measured ones are distorted by the directivity and frequency responses of imperfect sources and microphones. In a follow-up work (2026), we proposed Real2Sim diffusion, a framework to reconcile this mismatch. We trained a diffusion Schrödinger-bridge model to translate RIRs generated by a realistic simulator into RIRs generated by a simplistic simulator. Once trained, the model can translate real measured RIRs into simplified, canonical counterparts that are compatible with the physics-driven inverse method developed in DENISE. We demonstrated this by correctly localizing dozens of image sources of order up to 5 from the early part of real 32-channel RIRs.

These results open the door to a new, data-driven way of bridging the gap between measurements and the physics-based inverse methods developed during the DENISE project. They will unlock many applications underpinning strong industrial interest, including room acoustic diagnosis, spatial audio acquisition and calibration, and acoustic-aware speech enhancement. A recent collaboration with French company Trinnov and the newly started ANR-PRC AWESOME project will directly build on the foundations laid out by DENISE to tackle these new challenges.

=== Project Publications===
The research carried out during the DENISE project led to the publication of articles in 3 international
journal and 1 international (peer-reviewed) conference.
[1] T. Sprunck, A. Deleforge, Y. Privat, and C. Foy, “Gridless 3D recovery of image sources from room
impulse responses,” IEEE Signal Processing Letters, vol. 29, pp. 2427–2431, 2022.
[2] L. Bahrman, M. Krémé, P. Magron, and A. Deleforge, “Signal inpainting from fourier magnitudes,”
in 31st European Signal Processing Conference (EUSIPCO), IEEE, 2023, pp. 116–120.
[3] T. Sprunck, A. Deleforge, Y. Privat, and C. Foy, “Fully reversing the shoebox image source method:
From impulse responses to room parameters,” IEEE Transactions on Audio, Speech and Language
Processing, vol. 33, pp. 1023–1033, 2025.
[4] A. Deleforge, C. Foy, Y. Privat, and T. Sprunck, “Hearing the shape of a cuboid room using sparse
measure recovery,” Inverse Problems, vol. 41, no. 9, p. 095 002, Sep. 2025.

=== Planned Publications ===
The members of DENISE are currently working on 2 further international journal articles based on the project’s findings.
[5] A. Deleforge, C. Foy, A. Lorrain, Y. Privat, and T. Sprunck, “From sound to shape: polygonal room reconstruction via shape optimization” submitted journal article, 2026.
[6] M. Krémé, P. Magron, and A. Deleforge, “Magnitude-informed signal inpainting,” Journal article in preparation, 2026.

DENISE aims at fundamental methodological contributions to the field of audio signal processing. Its promises are data-requirement savings and performance leaps that, in the long run, underlie strong economical and ecological benefits for the quickly-growing application field of audio technologies.

The state of affairs is a ubiquitous and successful use of deep learning methods across all areas of audio signal processing. This is justified by their remarkable ability to approximate arbitrary non-linear functions given sufficiently large training datasets to learn from.

DENISE's central premise, however, is that a number of key sub-problems in audio may be tackled without any learning thanks to recent theoretical and methodological breakthroughs in the field of non-linear inverse problems, which have come largely unnoticed by the audio research community as of yet. Fundamental research efforts will be carried out to unlock the full potential of these findings in two emerging and challenging applications: audio inpainting, i.e., the recovery of completely missing samples, and echo-aware multichannel processing.

Far from giving up the power of machine learning, project DENISE advocates the development of hybrid approaches that fully leverage the potential of both analytical and learned solutions, with data-efficiency at its core.

Project coordination

Antoine Deleforge (Centre de Recherche Inria Nancy - Grand Est)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partnership

INRIA NGE Centre de Recherche Inria Nancy - Grand Est

Help of the ANR 210,336 euros
Beginning and duration of the scientific project: March 2021 - 42 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter