Access ERC - Access ERC 2023

Modeling Allusions in Voltaire and the Enlightenment with Neural networks – MAVEN

Submission summary

Modeling Allusions in Voltaire and the Enlightenment with Neural networks (MAVEN) aims to close the NLP gap between Latin and well-resourced modern languages and simultaneously fulfill the need for a digital tool capable of tracing classical Latin ideas in French texts. My long-term goal is to create an open-access tool capable of three research tasks: 1) Automatically locate classical allusions in Enlightenment literature (French-to-Latin); 2) Trace ideas from the classical period down to the Enlightenment (Latin-to-French); 3) Search the canonical Latin texts by subject matter (Latin-to-Latin). These tasks will be possible through a novel deep-learning model capable of processing both 18th-century French and classical Latin. At the end of my fellowship I will deliver a proof-of-concept version of this language model and a foundational publication for my project. Through a subsequent ERC Starting grant, I will expand MAVEN from a proof-of-concept model to an open-access webtool with a user-interface, database infrastructure, and improved search algorithm. For this ERC grant to be successful, I need to establish my credentials in the field of artificial intelligence, build a network of international collaborators experienced in applying deep learning techniques, and to mitigate the risks inherent to my long-term research program by delivering a proof-of-concept model.

I will begin building my proof-of-concept model by training a multilingual neural-network, in three steps. First, the transformer-based model of modern French called CamemBERT will be fine-tuned on a hand-corrected corpus of 65 million words of 18th-century French; if necessary, this set of texts can be expanded into billions of words using uncorrected digital scans (OCR data). This data will be supplied by the Observatoire des Textes, Idées, et Corpus at Sorbonne University (ObTIC). Second, the resulting ‘18th-century CamemBERT’ model will then be trained in sentence vectorization. This requires a set of pairs of French sentences rated according to their degree of semantic similarity, a dataset that will be created for this project. Third, this model of 18th-century French will be aligned with an existing model of Latin. This requires approximately 1 million sentences of French translated into Latin, which will be assembled for this project.

The resulting neural network model will be capable of aligning French and Latin sentence embeddings; however, research has shown that a combination of neural networks and topic modeling significantly improves results. The efficacy of traditional topic modeling is limited by two factors: first, the traditional topic model’s inability to deal with words that are not present in its training data, and second, the limited number of Latin documents available for training a model. By combining French topic models with my neural-network model, both of these limitations can be overcome. The resulting ‘cross-language contextualized topic model’ (CLCTM) will be able to align sentences of French with sentences of Latin, as well as French with French and Latin with Latin. This will be the technology at the core of MAVEN. To create this model I will rely on the expertise of Glenn Roe, who is a leader in digital humanities and 18th-century French, and Nicholas Benoit at the Sorbonne Center for Artificial Intelligence, who is an expert in AI modeling and has offered me the use of the MeSU computing cluster.

After the completion of the MAVEN core technology, I will test its ability to recover allusions to Vergil’s Aeneid in Voltaire’s La Henriade. Because the Henriade makes many allusions to the Aeneid, it is the perfect test-case for MAVEN. In a scientific article I will describe 1) the trained MAVEN CLCTM, and 2) its success in finding cross-language allusions that do not depend on direct translation. This article will be submitted to top-tier journals in order to bolster my credentials for an ERC Starting grant application.

James Gawley (Centre d'étude de la langue et des littératures françaises)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

CELLF Centre d'étude de la langue et des littératures françaises

Help of the ANR 169,945 euros
Beginning and duration of the scientific project: - 24 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.