CE38 - Révolution numérique : rapports au savoir et à la culture 2021

Digitizing Armenian Linguistic Heritage: Armenian Multivariational Corpus and Data Processing – DALiH

Digitizing Armenian Linguistic Heritage (DALiH): Armenian Multivariational Corpus and Data Processing

The project Digitizing Armenian Linguistic Heritage (DALiH): Armenian Multivariational Corpus and Data Processing aims at building for the first time an open-access and open-source unified digital linguistic platform for the whole spectrum of Armenian language variation, more particularly annotated corpora for 1) Classical Armenian; 2) Modern Western Armenian; 3) a pilot corpus of Middle Armenian; 4) three pilot corpora of dialects, and 5) one updated Modern Eastern Armenian corpus on the basis of the existing one.<br />Research will be conducted in Natural language processing (NLP) and linguistic perspectives in order to provide full grammatical annotation and Automatic speech recognition (ASR) models for the target Armenian varieties. Multi-approach deep-learning and rule-based resources will be designed in order to process the written and oral databases and to cross-check their value for further corpus enlargement, in a context of multiparameter language variation for an under-resourced language.<br />NLP-based linguistic researches, such as language identification and variety distance measuring, lexical and morphological disambiguation, will be carried out to revisit the existing research issues and to introduce new ones backed by the new available processed written and oral data.

Methodologies for under-resourced languages

The methodologies employed for data management in the DALiH project adhere closely to the principles outlined in the General Data Protection Regulation (GDPR), ensuring that data usage, particularly oral data (since written data are sourced exclusively from published materials), is both ethically sound and legally compliant. Our research team is committed to fostering trust between participants and researchers, while also safeguarding the integrity of the research process. The project adheres to GDPR and data protection regulations, ensuring:
• Informed consent for oral data sharing.
• Ethical considerations in working with displaced language communities.
• Encryption and pseudonymization for sensitive data.
Ethical research practices ensure minimal AI biases in NLP model development.
While documenting target dialects is a key objective, the methods for recording displaced languages, especially those arising from armed conflict, present significant academic and human challenges. Due to frequent constraints on conducting on-site fieldwork, the project emphasizes the need for innovative off-site research methods. Several methodological solutions are proposed:
a. Ethical and Sensitive Research Design: Ethical considerations are crucial when conducting participatory research with affected communities. This includes obtaining informed consent, maintaining confidentiality, and respecting cultural values and practices.
b. Community Engagement: Actively involving community members in the research process is vital for effective documentation. By facilitating workshops, interviews, and focus groups, the project aims to create an inclusive environment that values community voices. Making documented dialect data accessible will promote awareness and valorize both the dialects and the community.
c. Utilization of Shared Dialects: Researchers who share similar dialects can enhance documentation efforts through rapport with participants, fostering comfort and encouraging active engagement in the research process. My background as a native speaker of the Goris dialect has been instrumental in facilitating the Getashen data collection.
d. Remote Data Collection and Crowdsourcing: Digital technologies, such as video conferencing and online surveys, enable connection with displaced communities, including Getashen members in Russia. Crowdsourcing will empower local speakers to document their languages and cultural practices, fostering ownership of the preservation process.
By integrating these solutions, the project aims to develop a comprehensive and ethically sound framework for documenting displaced languages, benefiting linguists, anthropologists, and other specialists in related fields.

Results

Methodologies for Data Collection. The written data included in the project come predominantly from published sources through downloading, OCRing and typing according to the initial source type, quality and genre. OCRing was applied for essentially Classical Armenian and Modern Western Armenian data. Within the framework of the project a LLM was trained based on Llama-2 using the Armenian OSCAR corpus (Wikipedia in Western and Eastern Armenian), and re-specialized in Classical, Western, or Eastern Armenian depending on the target task. The LLM is initially developed for post-OCR correction, in order to clean the results produced by OCR, but it can also serve as a foundation for multiple standard tasks such as named entity recognition or topic modeling.

The oral data compiled within our project are of various genres, registers but also of various sensibility with the regard to the content as well as to the respondents’ personal history. Three main types of oral data can be outlined in DALiH project: task-oriented discourse, public oral discourse and spontaneous oral discourse.

Prospects

n/a

Scientific productions and patents

n/a

Submission summary

The project Digitizing Armenian Linguistic Heritage: Armenian Multivariational Corpus and Data Processing (DALiH) aims at building for the first time an open-access and open-source unified digital linguistic platform for the whole spectrum of Armenian language variation. Each language variety will be represented by a comprehensive text database which will be provided with full morphological annotation. More particularly, DALiH will design 1) a Classical Armenian corpus; 2) a Modern Western Armenian corpus; 3) a pilot corpus of Middle Armenian 4) three pilot corpora of dialects, and 5) an updated Modern Eastern Armenian annotated corpus. Deep-learning and rule-based natural language processing resources will be designed in order to process the databases, to develop grammatical annotation and Automatic speech recognition models and to cross-check their value for further corpus enlargement, in a context of multiparameter language variation for an under-resourced language.

Victoria Khurshudyan (Structure et Dynamique des Langues)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Russian Academy of Science / Institute for Linguistic Studies
SeDyL Structure et Dynamique des Langues
Russian Academy of Sciences / Vinogradov Institute for Russian Language
ERTIM EQUIPE DE RECHERCHE : TEXTES, INFORMATIQUE, MULTILINGUISME
LIPN Laboratoire d'Informatique de Paris-Nord
American University of Armenia / Digital Library of Armenian Literature

Help of the ANR 465,494 euros
Beginning and duration of the scientific project: March 2022 - 42 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.