DS0807 - 2016

PRocessing Old French Instrumented TExts for the Representation Of Language Evolution – PROFITEROLE

Submission summary

The PROFITEROLE has three main goals that fall within the fields of linguistics and Natural Language Processing (NLP). These three goals are closely correlated. First, it seeks to model certain morphological and syntactic aspects of the diachronic evolution of French. Second, it targets the development of a methodology to explore and annotate heterogeneous linguistic data while providing automatic analysers for various stages of the French language. Finally, it aims to expand linguistic resources for French, by building a large annotated corpus (1 M words) of Medieval French (9th-15th centuries) and morphological lexicons covering several stages of French.
The Medieval period constitutes a critical period for the study of the evolution of French. It is indeed during this period that most core morphological and syntactic changes were initiated and began to spread throughout the language. Focusing on this chronological span therefore allows us to achieve a better insight into the evolution of French and a better understanding of certain mechanisms of change that have also taken place in other languages. Material constraints have so far limited data mining or other extensive analyses of French diachronic text collections, which call for a partially automated utilization of the data.This holds true especially for the Medieval period. The emergence in 2013 of the Syntactic Reference Corpus of Medieval French has opened up new perspectives for both linguistic and NLP issues. SRCMF is an Old French (9th-13th) treebank annotated with fine-grained syntactic dependency structures, with each of the 251,000 words carrying a manually checked POS tag and a syntactic function.
Old French is characterized by much greater variation than Modern French on both the grapho-phonetic and syntactic levels. This variation must be conceived as internal to the language, although the variation can also be seen as external as it operates between texts of different external variables (such as date, dialect, domain-genre, form, or register), with the date being the main parameter for variation. The SRCMF is a highly conducive field for the study of variation in its internal and external dimensions, in addition to the possible correlations between the two – a hitherto understudied domain of syntax, especially with regard to the progressive fixation of word order in the history of French, which will be our main linguistic focus. The complexity of the task involves the use of sophisticated statistical and computer technologies. The strong variation in Medieval French complicates the identification and grasping of its successive stages, while also being a critical factor for the passage from one stage to the other. A significant increase in processed data is therefore crucial for a better understanding of language change. Yet, this same multi-heterogeneity of the data proves a major challenge for processing an automatic text enrichment.
Building on the SRCMF and on the diachronic morphological lexicons designed throughout the project, we will develop an annotation methodology capable of processing this variability, by exploring two distinct approaches in order to do so. The first relies on manually crafted symbolic parsers and the second is based on Machine Learning. These automatic analysis tools and resources, accurately and easily configurable for various stages of Medieval French, will thus serve to automatically explore and annotate additional Old French plain texts. All the annotated data and tools will provide new and valuable linguistic insights for the study of diachronic variation, and will more generally contribute to our understanding of multiple heterogeneous data and its computerized processing, both for past stages of French as well as present-day stages of languages.
The collaboration of international specialists in historical linguistics, digital humanities and natural language processing is an important asset for the success of this project.

Sophie Prévost (Langues, Textes, Traitements automatiques, Cognition,)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

ICAR Interactions, Corpus, Apprentissages, Représentations
UPD Université Paris Diderot
LaTTiCe Langues, Textes, Traitements automatiques, Cognition,

Help of the ANR 371,518 euros
Beginning and duration of the scientific project: February 2017 - 42 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.