Digitizing Armenian Linguistic Heritage: Armenian Multivariational Corpus and Data Processing – DALiH
The project Digitizing Armenian Linguistic Heritage: Armenian Multivariational Corpus and Data Processing (DALiH) aims at building for the first time an open-access and open-source unified digital linguistic platform for the whole spectrum of Armenian language variation. Each language variety will be represented by a comprehensive text database which will be provided with full morphological annotation. More particularly, DALiH will design 1) a Classical Armenian corpus; 2) a Modern Western Armenian corpus; 3) a pilot corpus of Middle Armenian 4) three pilot corpora of dialects, and 5) an updated Modern Eastern Armenian annotated corpus. Deep-learning and rule-based natural language processing resources will be designed in order to process the databases, to develop grammatical annotation and Automatic speech recognition models and to cross-check their value for further corpus enlargement, in a context of multiparameter language variation for an under-resourced language.
Project coordination
Victoria Khurshudyan (Structure et Dynamique des Langues)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
Partner
Russian Academy of Science / Institute for Linguistic Studies
SeDyL Structure et Dynamique des Langues
Russian Academy of Sciences / Vinogradov Institute for Russian Language
ERTIM EQUIPE DE RECHERCHE : TEXTES, INFORMATIQUE, MULTILINGUISME
LIPN Laboratoire d'Informatique de Paris-Nord
American University of Armenia / Digital Library of Armenian Literature
Help of the ANR 465,494 euros
Beginning and duration of the scientific project:
March 2022
- 42 Months