DS0707 - Interactions des mondes physiques, de l'humain et du monde numérique

Syntactic parsing and multiword expressions in French – PARSEME-FR

Submission summary

The Project, PARSEME-FR, aims at improving linguistic representativeness, precision and computational efficiency of Natural Language Processing (NLP) applications, notably parsing. The project focuses on the major bottleneck of these applications: Multi-Word Expressions (MWEs), i.e. groups of words with a certain degree of idiomaticity such as “hot dog”, “to kick the bucket”, “San Francisco 49ers” or "to take a haircut".
Despite recent advances during the last years, the state of the art concerning Multiword Expression (MWE) representation and processing is largely unsatisfactory. Current research on MWEs concentrates either on creating MWE lexicons or on the automatic recognition of MWEs in text. Only few approaches address the links between MWEs and a comprehensive linguistic analysis of text. These approaches confirm that a proper treatment of MWEs increases both linguistic precision and robustness. But they are mostly limited to some MWE classes, and to syntactic parsing. This unsatisfactory state is mainly due to a lack of linguistic resources encoding MWE information that would feed the linguistic analysers (in particular, parsers). In French, such resources exist, but are incomplete in terms of syntactic and semantic representation, coverage and/or adequacy for being used in NLP tools.
In this project, we propose to bridge the gap between linguistic precision and computational efficiency in NLP applications by investigating the syntactic and semantic representation of MWEs in language resources, the integration of MWE analysis in (deep) syntactic parsing and its links to semantic processing. Expected deliverables include enhanced language resources (lexicons, grammars and annotated corpora), MWE-aware (deep) parsers and tools linking predicted MWEs to knowledge bases. This proposal is a spin-off of the European IC1207 COST action PARSEME on the same topic.

Mathieu CONSTANT (CNRS DR CENTRE EST)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

LLF Laboratoire de Linguistique Formelle
CNRS DR CENTRE EST
LIF Laboratoire d'Informatique Fondamentale de Marseille
LIFO Laboratoire d'Informatique Fondamentale d'Orléans
Inria Paris - Rocquencourt Centre Inria Paris - Rocquencourt
LI Laboratoire d’Informatique de l’Université de Tours
LIGM Laboratoire d'informatique Gaspard-Monge

Help of the ANR 732,025 euros
Beginning and duration of the scientific project: December 2015 - 48 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.