CE23 - Intelligence Artificielle 2021

SEmantic LEXicon INduction for Interpretability and diversity in text processing – SELEXINI

Submission summary

Despite great enthusiasm for deep learning in NLP, concern is rising about its limitations. First, neural models are often blackboxes, and their behavior is hard to interpret. Second, benchmark-based evaluation overlooks biases, questioning the robustness and coverage of the resulting generalisations, yielding a landscape of overall diversity. The goal of the SELEXINI project is to address these issues by developing **weakly supervised methods to induce semantic lexicons** from raw corpora, which will then be **seamlessly integrated with semantic text processing models**. Lexical units are seen as useful abstractions that allow representing complex phenomena (e.g. polysemy, similarity, multiword units) associated with interpretable labels, avoiding the overhead and opaqueness of contextualized embeddings (one vector per occurrence). Moreover, our lexicon will combine continuous data (embeddings, clusters) and symbolic data (labels). We will model single and multiword units, their senses, and their semantic frames (arguments, roles). Hence, we propose a new "by-construction" view on interpretability, which can be seen as an alternative to methods trying to dissect complex neural models. For extrinsic evaluation of interpretability and diversity, the induced lexicon will be integrated into standard deep learning models in downstream tasks requiring semantic information: machine reading comprehension and multiword expressions identification. We will develop an experimental protocol to assess the lexicon-corpus complementarity on diverse linguistic phenomena, and to assess the lexicon's usefulness for non-expert end users requiring interpretable results. We expect that this original approach will increase both the interpretability of models and the coverage of diverse phenomena (e.g. rare/unseen items in training data).

Marie CANDITO (Laboratoire de Linguistique Formelle)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

LIFAT Laboratoire d'Informatique Fondamentale et Appliquée de Tours
ATILF Analyse et Traitement Informatique de la Langue Française (ATILF)
LLF Laboratoire de Linguistique Formelle
LISN Laboratoire Interdisciplinaire des Sciences du Numérique
LIS Laboratoire d'Informatique et Systèmes

Help of the ANR 678,190 euros
Beginning and duration of the scientific project: March 2022 - 48 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.