CE38 - Révolution numérique : rapports au savoir et à la culture 2019

Computational Language Documentation by 2025 – CLD2025

Submission summary

The main objective of the CLD2025 project is to facilitate the urgent task of documenting endangered languages by leveraging the potential of computational methods. A breakthrough is now possible: machine learning tools (such as artificial neural networks and Bayesian models) have improved to a point where they can effectively help to perform linguistic annotation tasks such as automatic transcription of audio recordings, automatic glossing of texts, and automatic word discovery. Thorough documentation of the world’s dwindling linguistic diversity is much more feasible with these tools than under a manual workflow. For instance, manual transcription of 50 hours of speech (a sizeable fieldwork corpus) can take hundreds of hours’ work, creating a bottleneck in the language documentation workflow. Another key task, referred to in linguistics as interlinear glossing (in a nutshell: word-by-word translation/annotation), is even more time-consuming, and is moreover difficult to perform manually with the required level of consistency. Models created through machine learning have the potential to aid in these time-consuming and difficult tasks. But Natural Language Processing (NLP) remains little-used in language documentation for a variety of reasons such as that the technology is still new (and evolving rapidly), user-friendly interfaces are still under development, and there are few case studies demonstrating practical usefulness in a low-resource setting. Field linguists typically rely on manual methods throughout the documentation process. The objective of the CLD2025 project is therefore to enable the implementation of these techniques in the mid term (by 2025) by developing a co-construction of models and tools by field linguists and computational linguists, and the development of interfaces and systems that allow real use by field linguists.
We are building on the achievements of the BULB project in terms of corpora and modes of acquisition, as well as the development of models for transcription and segmentation. We are not developing corpora here, but rather focusing on how to exploit existing corpora. We address automatic processing problems (phoneme and tone transcription, unit discovery, automatic glossing), some of which are original (tonal transcription, automatic glossing), by validating them on endangered languages of very varied natures: Bantu Mboshi C25, Mande Kakabe, a Sino-Tibetan language, Yongning Na (Mosuo), and 3 Nakh-Daghestanian languages, Khinalug, Kryz (Kryts), Budugh. We will perform work to leverage the results of the improved automatic processing to the linguistic work level: the automatic speech and language processing mechanisms and results will be used to explore phonetic-phonological issues on segmental, supra-segmental and tonal levels of the languages addressed in the project,
Finally, from the beginning of the project, the focus will be on the usability of the tools and models developed. This point highlights the fundamentally interdisciplinary aspect of the work carried out here by computational scientists and field linguists. To do so, a recognized field linguist will work full-time on the project, and will participate, both through her experience and expertise in the definition, development and evaluation of the different systems developed in the project.

Gilles ADDA (Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

LACITO Laboratoire de Langues & Civilisations à Tradition Orale
LIMSI Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur
LIG Laurent Besacier
KIT Karlsruher Institut für Technologie (KIT) / Institut für Anthropomatik (IFA)
LPP Laboratoire de Phonétique et Phonologie
EmpSprWiss Universität Frankfurt / Institut für Empirische Sprachwissenschaft

Help of the ANR 464,667 euros
Beginning and duration of the scientific project: February 2020 - 36 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.