TSIA - Giga-modèles - Thématiques Spécifiques en Intelligence Artificielle (Giga-modèles pour le traitement automatique du langage naturel et des données multimodales)

Construction and evaluation of multimodal and inclusive large language models (written, oral, pictograms) for general and clinical French – Pantagruel

Submission summary

The Pantagruel project is an ambitious initiative that aims to develop and evaluate multimodal (written, spoken, pictograms) and inclusive linguistic models for French. The project draws on the expertise of researchers from different disciplines, including computer science, signal processing, sociology, and linguistics, to ensure diversity of perspectives, as well as the reliability and relevance of results. The main contributions of the project are the development of freely available self-supervised models for French, including one to three of the modalities for the general and clinical domains. The project will not only produce models but also design test benches to evaluate the generalization of such models, building on the experience gained in the FlauBERT and LeBenchmark projects. Part of the project will be devoted to the biases and stereotypes conveyed in the training corpora and in the downstream models. An ethics committee will help limit the amplification effect of bias within the training corpora, in particular by working on the demographic characteristics of the speakers (for audio or transcribed speech) and of the authors (for part of the written data). We will thus be able to compare the models learned on training corpora with variable proportions for these characteristics, such as gender. This study will quantify to what extent the predictions of the language models are reliable reflections of the upstream corpora and to better control the way in which they can be used as social scientific research tools. The project will develop software components that will facilitate the integration of language models into various applications and allow the development of innovative solutions that exploit the power of multimodal French language models. These tools are particularly intended for non-computer scientists such as those who are members of the consortium (sociologists, linguists, doctors, speech therapists), researchers from other fields, or artists. The Pantagruel project thus has the potential to significantly advance the state of the art in multimodal language models and to have disseminate the use of these models in a wide range of applied fields, ranging from healthcare to the humanities and the social sciences.

Project coordination

Didier Schwab (Laboratoire d'Informatique de Grenoble)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partner

LLF laboratoire de linguistique formelle
CREST Centre de Recherche en Economie et Stastistique - CREST
INA Institut national de l'audiovisuel
LIG Laboratoire d'Informatique de Grenoble
LIA Laboratoire d'Informatique d'Avignon

Help of the ANR 599,996 euros
Beginning and duration of the scientific project: September 2023 - 36 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter