CE23 - Données, Connaissances, Big data, Contenus multimédias, Intelligence Artificielle

Distributed, Personalized, Privacy-Preserving Learning for Speech Processing – DEEP-PRIVACY

Distributed, Personalized, Privacy-Preserving Learning for Speech Processing

The project concerns the development of distributed, personalized and privacy preserving approaches for speech recognition. We propose a hybrid approach in which the terminal of the user performs private computations locally and does not share the raw speech data, while some inter-user computations (such as model optimization) are performed on a server or a peer-to-peer network, with speech data that are shared after anonymization.

Objectives

Approaches

The learning of privacy-preserving representations of the speech signal aims to disentangle the features that expose private information (to be kept on the user's device) and generic features useful for the task of interest (which satisfy some notion of privacy and can thus be shared with servers). For speech recognition, these representations correspond respectively to speaker-specific information (to be protected) and phonetic / linguistic information (to be shared) carried by the speech signals. We will explore several directions, all based on deep learning approaches, and, besides traditional speech and speaker recognition measures, we will also use some formal notion of privacy to assess their performance.

The distributed approach and associated personalization rely on the design of efficient distributed algorithms, which operate under the setting where sensitive user data is kept on-device, with global components running on servers and personalized components running on personal devices. Data transferred to servers should contain useful information for learning / updating global components (here speech recognition models), while preserving user privacy. We will study the convergence guarantees of distributed training algorithms and investigate how much speaker information is carried out by the information exchanged during training. Moreover, personalized components allow for introducing speaker-specific transforms and adapting some model parameters to the speaker. We will also consider a peer-to-peer framework, as an alternative to servers, for data sharing and model training.

Results

Intermediate results.

Initial research was focused on the study of an adversarial deep learning approach to obtain a representation of the speech signal useful for speech recognition and irrelevant for speaker verification; and on the evaluation and further development of approaches based on voice conversion techniques that also rely on deep learning.

A protocol for evaluating anonymization performance has been developed, and several scenarios have been defined taking into consideration different levels of knowledge that an attacker might have. Several privacy metrics have been compared and evaluated.

Prospects

(project is on going)

Scientific productions and patents

Publications are available on the project web site.

Submission summary

Speech recognition is now used in many applications, such as virtual assistants which collect, process and store personal speech data in centralized servers, raising serious concerns regarding the privacy of their users. Embedded speech recognition frameworks have recently been introduced to address privacy issues during the recognition phase: in this case, a (pre-trained) speech recognition model is shipped to the user's device so that the processing can be done locally without the user sharing its data. However, speech recognition technology still has limited performance in adverse conditions (e.g., noisy environments, reverberated speech, strong accents, etc) and there is a need for performance improvement. This can only be achieved using large speech corpora that are representative of the actual users and of the various usage conditions. There is therefore a strong need to share speech data for improved training that is beneficial to all users, while keeping the speaker identity and voice characteristics private. It is also becoming clear that the user should have better control over its data, so that he/she can decide not to transmit data whose semantic content is sensitive.

In this context, DEEP-PRIVACY proposes a new paradigm based on a distributed, personalized, and privacy-preserving approach for speech processing, with a focus on machine learning algorithms for speech recognition. To this end, we propose to rely on a hybrid approach: the device of each user does not share its raw speech data and runs some private computations locally, while some cross-user computations are done by communicating through a server (or a peer-to-peer network). To satisfy privacy requirements
at the acoustic level, the information communicated to the server should not expose sensitive speaker information.
The project addresses the above challenges from the theoretical, methodological and empirical standpoints through two major scientific objectives.

The first objective is to learn privacy-preserving representations of the speech signal that disentangle the features that expose private information (to be kept on the user's device) and generic features useful for the task of interest (which satisfy some notion of privacy and can thus be shared with servers). For speech recognition, these representations correspond respectively to speaker-specific information (to be protected) and phonetic~/ linguistic information (to be shared) carried by the speech signals. We will explore several directions, all based on deep learning approaches, and, besides traditional speech and speaker recognition measures, we will also use some formal notion of privacy to assess their performance.

The second objective concerns distributed algorithms and personalization, through the design of efficient distributed algorithms which operate under the setting where sensitive user data is kept on-device, with global components running on servers and personalized components running on personal devices.
The personalized components allow for better speaker-adapted processing and recognition.
Data transferred to servers should contain useful information for learning~/ updating global components (here speech recognition models), while preserving user privacy. We will study the convergence guarantees of distributed training algorithms and investigate how much speaker information is carried out by the information exchanged during training. Moreover, personalized components allow for introducing speaker-specific transforms and adapting some model parameters to the speaker. We will also consider a peer-to-peer framework, as an alternative to servers, for data sharing and model training.

Emmanuel VINCENT (Centre de Recherche Inria Nancy - Grand Est)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Inria Centre de Recherche Inria Nancy - Grand Est
LIUM LABORATOIRE D'INFORMATIQUE DE L'UNIVERSITE DU MANS (LIUM)
MAGNET Machine Learning in Information Networks
LIA Laboratoire Informatique d’Avignon

Help of the ANR 611,604 euros
Beginning and duration of the scientific project: - 48 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.