The goal of the ADDICTE project (Distributional analysis in specialized domain) is to propose an operational solution to the distributional semantic analysis in specialized domain to construct semantico-conceptual representations of the domain (domain ontologies, thesaurus, terminological resources) which can be used both in knowledge engineering and in some documentary applications (indexing of documents for instance).
Today, robust distributional analysis models provide "ready-to-wear" resources built from very large copious based on general language. These generic word embeddings are not sufficient to represent semantic in specialized domains, and it is therefore necessary to buildthem on the basis of specialized corpus. However, specialized texts have problematic characteristics for the application of these distributional methods, whose efficiency is correlated with the quantity of data available. On the one hand, these corporas are samll in size (generally below one million words) compared to the very large corpora of general language. On the other hand, terminological units, and in particular complex terms, predominate, which, by their specificity, further reduce the volume of contexts that can be mobilized for semantic computation. On the other hand, these data have interesting characteristics that can be exploited by a distributional analysis system: these texts are generally highly structured, the lexicon is reduced, semantic resources are often available and can be injected into the analysis process.
In this context, the originality of ADDICTE is to question and cross the fundamental approaches in distributional analysis and speciazed texts. In particular, three aspects will be studied: (i) endogenous improvement of distributional contexts by taking into account terminological units which convey an important part of the knowledge of a specialized domain; (Ii) exogenous improvement of distributional contexts by enriching distributional contexts with external resources and (iii) improving the nature of the distributional contexts by proposing a distributional representation that can take advantage of endogenous and exogenous information.
The ADDICTE project intends to propose new advances, particularly in terms of approaches based on a better exploitation of the linguistic and terminological characteristics of the textual material, so that the distributional analysis in specialized domain can reach the same level of maturity as for the large corpus based on general language. The transfer of the new predictive methods developed in the project will be carried out through a software library of adaptation to the domain (under a non-contaminating free license).
Monsieur Emmanuel Morin (Laboratoire des Sciences du Numérique de Nantes)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
CLLE Cognition, Langues, Langages, Ergonomie
CEA LIST Commissariat à l'énergie atomique et aux énergies alternatives
LIMSI Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur
LS2N Laboratoire des Sciences du Numérique de Nantes
Help of the ANR 590,885 euros
Beginning and duration of the scientific project: March 2018 - 42 Months