CE27 - Culture, création, patrimoine 2021

Increase the DIgital VITALity and visibility of languages of France: linguistic descriptions and annotated corpora – DIVITAL

Increase the DIgital VITALity and visibility of languages of France: linguistic descriptions and annotated corpora

DIVITAL: Giving France's regional languages (Alsatian, Corsican, Occitan, Poitevin-Saintongeais) a place in the digital age. By creating and enriching structured linguistic corpora (parallel corpora and annotated corpora) for these under-resourced languages, the project counters their invisibility and contributes to their vitality by providing digital resources that are essential for the development of Natural Language Processing tools.

Issues and objectives

The DIVITAL project is taking place in a context where regional languages of France, here Alsatian, Corsican, Occitan and Poitevin-Saintongeais, are suffering from marginalisation and a significant digital divide compared to French. The lack of digital resources and tools is a major obstacle to the development of Natural Language Processing (NLP) for these languages. The overall objective of the DIVITAL project is to increase the vitality and digital visibility of these four under-resourced languages. The work aims to produce up-to-date linguistic descriptions, create resources and raise awareness among the NLP community about the specific issues affecting these languages. The major challenge is the lack of annotated corpora and lexicons, making manual annotation tasks time-consuming and costly in terms of human and financial resources. The absence of stable spelling standards and the high internal (diatopic) variation of Alsatian, Corsican, Occitan and Poitevin-Saintongeais complicate the exploitation of texts and annotation. Furthermore, the absence of standardised codes (such as ISO 639-3) for certain languages (notably Poitevin-Saintongeais) leads to their invisibility in the digital space, complicating their documentation and inclusion in global language inventories. To meet these challenges, the project implemented a dual strategy of resource creation and methodological innovation (see Methods section). The resources created can be used as training data for the future development of NLP tools for these languages. Furthermore, by creating resources in new text genres (legal, contemporary topics), the project promotes new practices that move away from a simple heritage-based approach. Overall, the project helps to reduce inequalities between languages by filling the gap in digital resources.

Methods

The main approach consists in creating parallel corpora through human translation of various texts (literary, legal, journalistic) from French into the four regional languages. This process enriches the corpora with new non-narrative and contemporary genres. Some of these texts are then annotated according to an international standardised framework called Universal Dependencies, which describes the grammatical structure of sentences (word categories and dependency relationships).

In order to streamline the corpus annotation process, the corpora are pre-annotated automatically and then corrected manually by linguists using annotation guides. The technologies tested for pre-annotation are based on knowledge transfer from closely related languages (such as German for Alsatian or Italian for Corsican). Various strategies are used to manage variation in order to significantly improve the quality of pre-annotation: simple standardisation methods, use of bilingual lexicons.

To manage the complexity and dialectal diversity of these languages, the project emphasises the precise and detailed documentation of the data collected (metadata) using a data management system. This system makes it possible to collect very precise information on the origin of the texts, the authors, the genres, and the dialectal varieties, which is crucial for languages where variation (geographical, written) is very large.

Results

The work carried out as part of the DIVITAL project has produced major concrete achievements.

The main outcome is the creation of structured linguistic resources for these languages, thus filling a gap in relation to French.

The project has created the first syntax-annotated corpora according to the international Universal Dependencies guidelines for Alsatian and Poitevin-Saintongeais. For Alsatian, the corpus comprises 977 sentences, or nearly 20,000 words. For Poitevin-Saintongeais, the corpus contains 239 sentences, or approximately 5,500 words. For Corsican, a corpus of 500 sentences has been annotated in grammatical categories according to the Universal Dependencies guidelines.

A unique translation corpus linking the four languages of the project and French has been compiled and made available via the Parcolab platform, developed by one of the project teams. This corpus includes contemporary, non-narrative texts such as the Universal Declaration of Human Rights and newspaper opinion pieces. Monolingual corpora have also been made available on ParCoLab, which has greatly increased the volume of texts available. This approach allows regional languages to be used and documented in new areas.

Prospects

The project's work paves the way for numerous practical applications, improvements to existing digital tools, and original avenues of research, particularly for languages with limited resources.

The annotated corpora created are valuable resources for research in syntax, comparative linguistics, and the development of up-to-date linguistic descriptions.

The ParCoLab web platform, which hosts the DIVITAL parallel corpus, is designed to be a practical tool for teachers, language learners, and translators. The corpus can be used in primary, secondary or university education for language comparison activities. The approach of creating new resources by translating modern, non-narrative texts makes it possible to provide new data that goes beyond a simple heritage-based view of these languages.

Annotated corpora are an essential basis for the development of digital applications and resources for these minority languages, such as machine translation, the creation of multilingual lexicons and the training of syntactic parsers.

In general, the methodologies developed can serve as an example for other regional languages with limited resources.

Submission summary

Digital resources such as lexicons, dictionaries and text corpora, both raw and enriched with linguistic annotations, play a key role in enabling better inclusion of regional and minority languages in the digital world. Yet the gap between languages with many resources (fewer than ten languages) and ‘low resource’ languages remains wide. This gap is also documented in France, where regional languages are found to be very poorly equipped with digital resources and tools, compared to French. In this project, we will focus on four low resource languages of France: Alsatian, Corsican, Occitan and Poitevin-Saintongeais.
From a theoretical point of view, the project will integrate and reassess existing linguistic knowledge about these languages, in comparison to other related languages. The goal will be to produce comprehensive and up-to-date descriptions, which can be captured in annotation guidelines.
The project also aims to increase awareness about regional languages of France in the linguistics and computational linguistics research communities, by augmenting, collecting and building much needed unlabelled and labelled datasets. The corpora will integrate genres that approach or transcribe the oral language, e.g., theatre plays or narrative ethnotexts, as well as parallel translated documents. The labelled datasets will take the form of Universal Dependencies (UD) corpora. The use of the UD framework is motivated by its large adoption by the natural language processing community and the many tools and guidelines already available.
Finally, the project will investigate how to share and transfer experience and tools between languages. This should enable those languages which are less advanced to be pulled upwards and thus benefit from the experience of others to accelerate their development. Beyond the concrete and immediate achievements for the languages represented in this project, the aim is also to build methodologies that can be used and applied to other less-resourced languages. This is also a means to build a community of researchers who work on less-resourced languages of France and neighbouring regions.

Delphine BERNHARD (Linguistique, Langues et Parole (EA 1339 - UR 1339 depuis 01.01.2020))

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

LiLPa Linguistique, Langues et Parole (EA 1339 - UR 1339 depuis 01.01.2020)
FORELLIS FORMES ET REPRESENTATIONS EN LINGUISTIQUE, LITTERATURE ET DANS LES ARTS DE L'IMAGE ET DE LA SCENE
CLLE COGNITION, LANGUES, LANGAGE, ERGONOMIE
LISA UMR LIEUX, IDENTITES, ESPACES, ACTIVITES

Help of the ANR 413,632 euros
Beginning and duration of the scientific project: December 2021 - 48 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.