Machine Translation for Open Science – MaTOS
MaTOS is a multidisciplinary project, bringing together teams from a variety of scientific backgrounds. The studies carried out are based, on the one hand, on methods for analyzing specialized corpora to build up inventories of terms and study their usage, whether on the scale of a document, a collection, or a specific time period.
They are also based on the development of state-of-the-art machine translation systems, capable of managing extended translation contexts (from several sentences to several paragraphs, and beyond) and of handling expert resources (specialized dictionaries, translation memories). These systems are specialized for several scientific fields.
The evaluation work is based on the one hand on state-of-the-art automatic metrics, in their most recent developments, integrating, for example, large-scale neural models; on the other hand, it relies on the mobilization of post-editors specialized in their scientific field, recruited to carry out revision tasks that would be representative of a typical scientific writing activity.
After two years, the project has first produced a set of reports documenting the state of the art, focusing notably on:
- human assessments of translation quality
- automatic evaluation of document translation
- computational architectures for document translation.
Various resources have also been collected, prepared, and formatted. These include terminologies for two specialized domains, as well as various monolingual and bilingual corpora, in particular long documents and their translations (for abstracts and full-text articles) for the same two domains.
Software developments have focused on three aspects:
- the development of tools to identify terms and their variants in corpora; these will be used to thoroughly document the spectrum of acceptable terminological variations in academic documents, and to evaluate the degree of unacceptable variation ;
- the study of methods for automatically proposing neologisms to translate emerging terms;
- the development of specialized MT systems for the translation of long scientific documents, based on both encoder/decoder architectures and large multilingual language models.
In terms of evaluation, two pilot studies involving the post-editing of automatically translated abstracts have been carried out with the involvement of specialized translators and members of the academic community, in anticipation of a larger-scale study.
This work has already resulted in a dozen publications, which are available on the project website, and a number of corpora, which are also distributed via the same channel.
Capitalizing on the resource-building work already completed or in the process of being finalized, the main prospects for the end of the project are as follows:
- analyze terminological variation in source texts and its correlates in human and machine translations;
- develop specialized MT systems that translate entire documents, integrate terminological constraints, and also produce coherent, cohesive texts that require minimal proofreading for publication ;
- develop automatic methods to characterize and evaluate the ability of MT systems to (a) correctly translate segments involving specialized terms; (b) produce terminologically consistent translations ;
- implement very large-scale evaluations of the systems and metrics thus designed, and using them in post-editing protocols involving large communities of users.
The MaTOS (Machine Translation for Open Science) project aims to develop new methods for the machine translation (MT) of complete scientific documents, as well as automatic metrics to evaluate the quality of these translations. Our main application target is the translation of scientific articles between French and English, where linguistic resources can be exploited to obtain more reliable translations, both for publication purposes and for gisting and text mining. However, efforts to improve MT of complete documents are hampered by the inability of existing automatic metrics to detect weaknesses in the systems and to identify the best ways to remedy them. The MaTOS project aims to address both of these issues.
This project is part of a movement to automate the processing of scientific articles; MT is no exception to this trend, particularly in the biomedical field. Applications are numerous: text mining, bibliometric analysis, automatic detection of plagiarism and articles reporting falsified conclusions, etc. We wish to take advantage of the results of these works, but also to contribute to it in many ways: (a) by developing new open resources for specialised MT; (b) by improving, through the study of terminological variations, the description of textual coherence markers for scientific articles; (c) by studying new methods of multilingual processing for these documents; (d) by proposing metrics dedicated to the measurement of progress for this type of task. The final result will allow, through improved translation, the circulation and dissemination of scientific knowledge.
Project coordination
françois Yvon (Institut des Systèmes Intelligents et de Robotique)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
Partnership
CLILLAC-ARP Université de Paris
ISIR Institut des Systèmes Intelligents et de Robotique
Institut national de la recherche en informatique et automatique
INIST Institut de l'information scientifique et technique
Help of the ANR 782,530 euros
Beginning and duration of the scientific project:
December 2022
- 48 Months