Urgent news
CE27 - Culture, création, patrimoine

Increase the DIgital VITALity and visibility of languages of France: linguistic descriptions and annotated corpora – DIVITAL

Submission summary

Digital resources such as lexicons, dictionaries and text corpora, both raw and enriched with linguistic annotations, play a key role in enabling better inclusion of regional and minority languages in the digital world. Yet the gap between languages with many resources (fewer than ten languages) and ‘low resource’ languages remains wide. This gap is also documented in France, where regional languages are found to be very poorly equipped with digital resources and tools, compared to French. In this project, we will focus on four low resource languages of France: Alsatian, Corsican, Occitan and Poitevin-Saintongeais.
From a theoretical point of view, the project will integrate and reassess existing linguistic knowledge about these languages, in comparison to other related languages. The goal will be to produce comprehensive and up-to-date descriptions, which can be captured in annotation guidelines.
The project also aims to increase awareness about regional languages of France in the linguistics and computational linguistics research communities, by augmenting, collecting and building much needed unlabelled and labelled datasets. The corpora will integrate genres that approach or transcribe the oral language, e.g., theatre plays or narrative ethnotexts, as well as parallel translated documents. The labelled datasets will take the form of Universal Dependencies (UD) corpora. The use of the UD framework is motivated by its large adoption by the natural language processing community and the many tools and guidelines already available.
Finally, the project will investigate how to share and transfer experience and tools between languages. This should enable those languages which are less advanced to be pulled upwards and thus benefit from the experience of others to accelerate their development. Beyond the concrete and immediate achievements for the languages represented in this project, the aim is also to build methodologies that can be used and applied to other less-resourced languages. This is also a means to build a community of researchers who work on less-resourced languages of France and neighbouring regions.

Project coordination

Delphine BERNHARD (Linguistique, Langues et Parole (EA 1339 - UR 1339 depuis 01.01.2020))

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partnership

LiLPa Linguistique, Langues et Parole (EA 1339 - UR 1339 depuis 01.01.2020)
FORELLIS FORMES ET REPRESENTATIONS EN LINGUISTIQUE, LITTERATURE ET DANS LES ARTS DE L'IMAGE ET DE LA SCENE
CLLE COGNITION, LANGUES, LANGAGE, ERGONOMIE
LISA UMR LIEUX, IDENTITES, ESPACES, ACTIVITES

Help of the ANR 413,632 euros
Beginning and duration of the scientific project: December 2021 - 48 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter