Our main objective has been to develop a working method to study the texts in 5 languages (including 2 Semitic languages) to process them in a uniform way. Indexing guides and keywords, resulting from a collaborative effort, guide the collaborators working on the texts from a common platform. The processing of the texts, all indexed in the same way in XML-TEI and gathered in a common place (the working platform), facilitates the second objective of our project: the calculation of the similarities between sapiential statements, helping break the linguistic boundaries and offering to all a database from which it is possible to visualize the results of all the work. We have avoided the linguistic difficulty linked to a heterogeneous corpus by choosing 3 working languages: English, French and Spanish. Therefore, the database is trilingual and all the information about the sense is systematically translated. This helps disambiguate the texts and facilitate the computational research, the 3 working languages being provided with lemmatizers. The novelty rests on the creation of a software for similarities calculation. The alignment of the original texts, their lemmatization and their linguistic structure may be useful to elaborate dedicated computing software.
The result is an open access database with a body of texts presented, annotated and searchable; an annotation protocol, standardizing the BSE in a comparable way; precise annotations on the BSE sense and structure to serve as research tools; BSE translations in English, Spanish and French; a search algorithm to establish parallels between BSE and match them. The general public will be able to identify the source of proverbs in use today, find the paths that led to the formation of our common way of thinking.
The work done offers new perspectives for paremiologists, computer scientists, specialists of medieval literature, of TAL, linguists, and we offer a tool of calculation of the similarities between sentences useful to researchers in the computer sciences. The bringing together of sapiential texts written in very different languages as well as our detailed textual annotations will allow a research on the circulation, the impact and the diffusion of brief sapiential statements from a particular geographical area and for a given time as well as on their sources, influences and transformation over time. The database that we have developed can also be used for a comparative evaluation of the dissemination vector of the sapiential statements.
By making use of the alignment between the statements, their lemmatization and their linguistic structure it must be possible to develop automatic lemmatization tools. All the possibilities offered by the annotations have not been exploited. Similarity calculations are independent from one language of translation to the other (English, French, Spanish). It should be possible to refine them by treating the three languages simultaneously
We have privileged 3 modes of dissemination: the conferences (3), study days (3) and the scientific publications (5). The conferences and study days are closely linked to our work on the texts. They made it possible to go beyond disciplinary boundaries and share what the texts teach us in terms of continuity and rupture from the oldest cultures of the Mediterranean Basin (Sumer/Akkad), on which we depend. We have established an annotation protocol (guide) and a limited list of keywords, which constitute an ontology (commented list) accessible on line.
In the Ninth Century, the rich Arabic tradition of adab finds its way to Spain, in al-Andalus, which then played a central role in knowledge exchange from the Orient and then relayed to the West, by monasteries from the North of the Iberian Peninsula in the 11th and 12th C. In al-Andalus, the adab literature meets the Jewish sapiential tradition of the midrashic literature. New collections are composed, including original works in the 10th and 11th centuries and from the 12th century on, exempla and philosophers’ sayings are translated into Hebrew, Latin, and Romance languages. Much of this complex heritage is found in the extensive Spanish paremiological literature, which is at its highest in the 16th and 17th centuries, and in current Spanish, Judeo-Spanish and Maghrebian collections of proverbs.
Although the main lines of these exchanges are known, we lack specific information on the circulation of these short sapiential statements (our basic research unit), on the successive translating choices made by the translators, the cultural reinterpretations, or the weight of a borrowing over another. If sapiential textual filiations and translation sequences should be treated cautiously, this is particularly true for the sapiential statements contained in these texts. Due to the difficulty of understanding them, these volatile elements, whose categorization varies with time and considered cultures, have never been subject to overall textual studies, which would recount their sources, circulation and evolution through the different spoken or written languages by the three cultures within the Iberian Peninsula, during the Middle-Ages. The paremiological studies have principally produced compilations of proverbs (thesauri); editions; erudite studies dedicated to a single work, a single language or a single culture, except for D. Gutas’ remarkable groundbreaking work on the Philosophical Quartet (1975). The few existing databases take into account contemporary “paremiae” corpora, most often unilingual or with a traductology perspective.
Therefore, the aim of the ALIENTO project is to calculate matches even when partial, close or distant connections in order to reassess inter-textual relations by comparing a great quantity of data and intersecting encoded texts written in different languages.
This I why the project, which needs a close interdisciplinary collaboration between computational researchers (ATILF) and the linguists and specialists of literature (MSH Lorraine + INALCO and the international network of collaborators), will develop a computational software transferable to other similar texts using a large corpus of reference composed of 8 related texts which circulated in the Iberian Peninsula (in Latin, Arabic, Hebrew, Spanish and Catalan), representing 582 pages for a number of sapiential statements evaluated at 9,570 units.
The developed software will extract and connect brief sapiential units through matching generated by the specific encoding system elaborated scientifically and written in an encoding manual XML-TEI. The choice and the type of annotations used result from a collaborative reflexion between the members of the project, specialists of linguistic paremiology, ancient texts, design engineers of textual databases, computational researchers during special scientific sessions. It will evolve in a collaborative manner during the matching processes.
At the end we will have:
- a body of texts belonging to a multilingual corpus, digitized, tagged in XML/TEI and publicly accessible, linked to a set of data on the text and its author.
- a set of brief sapiential units with their XML/TEI annotations, accessible free of charge.
- a trilingual questioning interface, making it possible to display the matched statements contained in these works, with information which can be used to study them regardless of the language.
- an encoding methodology and a software for matching data transferable to other similar corpora.
Marie-Sol ORTOLA (Maison des Sciences de l'Homme Lorraine)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
MSHL Maison des Sciences de l'Homme Lorraine
CERMOM Centre de Recherche Moyen-Orient et Méditerranée
ATILF Analyse (linguistique), Traitement Informatique, Langue Française
Help of the ANR 239,948 euros
Beginning and duration of the scientific project: December 2013 - 42 Months