DS07 - Société de l'information et de la communication 2017

Derivational Morphology in Extension – DEMONEXT

Demonext : Derivation in Extension

Demonext consists in the construction of a French morphological database (MDB) that describes the derivational properties of words in a systematic manner. The MDB will meet multiple needs, such as empirical confirmation of morphological hypothesis and elaboration of new ones, design of natural language processing (NLP) tools, vocabulary teaching and the treatment of developmental or acquired language disorders.

General objectives

The lexicon of a language like French is composed mainly of morphologically complex words: prefixed, suffixed, converted or compound. This structural information is generally available in the etymological sections of dictionaries, but the variability of its formulation makes it difficult to exploit. For languages such as English, German, Dutch or Czech, there are morphological databases (MDB) that describe the derivation properties of words in a systematic way: CELEX, CatVar, DerivBase, etc..... This information is essential because many others can be inferred from it, the most important being the meaning of these words. Currently, there is a prototype of the MDB, the Demonette database, developed by the two main partners of the project and which can be considered as an exploratory study of the present project. Having a widely covered MDB with rich and reliable descriptions in French would make it possible to meet multiple needs, such as empirical confirmation and hypothesis development in morphology, the development of NLP tools, vocabulary teaching, and the diagnosis and treatment of developmental or acquired lexical disorders.<br />To meet these challenges, we propose to build the MDB Demonext. This large-scale resource will have rich descriptions of lexemes (i. e. lexical units) and derivation relationships and the paradigms in which they fit, represent information explicitly and uniformly, ensure systematic traceability of all the information it provides, and be compatible with the main current morphological theories (morpheme-based; lexeme-based; paradigm-based).

Methods and Approaches

The principles underlying this resource will give it an original organization compared to existing MDBs. An entry of Demonext corresponds to a morphological relationship between two lexemes. The whole of the relationships shared by a lexeme with its morphological «parents« will define its derivational family. For example, NATION forms a family with NATIONAL, INTERNATIONAL, NATIONALITY, NATIONALIZATION, INTERNATIONALIZATION, etc.. An even more original feature of Demonext is that it will describe on a large scale the derivation paradigms that structure the lexicon and organize it into interconnected networks (for example, any relation obeying the X? XAL scheme, where X is a name, is part of a network that can be generalized in the form of a quadruplet {X, XAL, XALISER, XALISATION}).
Demonext also distinguishes itself from existing MDBs by another remarkable feature which is that each entry will be provided with a set of semantic information: morphological relationships are semantically annotated and the words they link to semantic types. The annotation of relationships will be made by means of glosses defining one of the words relative to the meaning of the other. For example, NATIONALIZATION can be defined in relation to nationalization by a gloss as «action of nationalization«. The morpho-semantic typing of lexems connected by a relationship (such as CAUSE_CHANGE for NATIONALIZE or ACTION for NATIONALIZATION) will be based on the content of the Framenet network, which has an extended set of types.
One of the principles that will guide the design of Demonext is that it can be fed by a variety of French lexical resources, as long as they can be freely redistributed. These resources will be cumulatively integrated into Demonext; the format of the knowledge they contain will be unified; important missing information will be calculated automatically when possible.

Results

Demonext will thus be a large-scale MDB with an original structure of interconnected networks, whose arcs and summits will be equipped with a variety of information: morphosemantic, morphophonological, derivation, statistics, etc. Demonext will also be able to offer a wide range of services. A second outcome of the project consists of a set of teaching tools and materials, such as collections of exercises and tests. These derivatives exploiting Demonext will be examples of its possible uses and its expected societal impact for primary and secondary teachers, students and higher-education teaching staff, speech-language pathologists, specialists in construction morphology and statistical modelling of the lexicon. Demonext will be distributed under a Creative Commons free license and will be made accessible to the various categories of users who will have interfaces according to the intended use: interfaces for interrogation, editing and visualization for specialized audiences; simplified and ergonomic access for the general public. It will be available for download via the EQUIPEX Ortolang (www.ortolang.fr/) and the REDAC platform (redac.univ-tlse2.fr/).
Being Demonext a database hosting an annotated morphological network of derivative, formal, semantic, semantic and frequency descriptions, we expect it to have an impact in several scientific and social fields. Demonext will offer linguists (morphologists, psycholinguists, L1 or L2 didacticians) an experimental field with extensive coverage, and will offer a wide range of information ranging from statistical measurements to semantic properties, morphological decompositions, categorical and phonological characteristics.

Prospects

In morphology research, Demonext will contribute to the emergence of a more quantitative and experimental morphology, by enabling large-scale testing of hypotheses and the development of new ones. It will also make it possible to improve the visibility of the results of studies on derivation in French and probably lead to more formalized analyses.
The task of statistical modelling of competition between processes will bring not only a better understanding of the structure and dynamics of the French derivation system, but also the tools and methods to explore and model this system.
In higher education, the production of representations in a variety of formalisms will allow the development of exercises for MOOCs.
In NLP, the breadth of its coverage and the richness of its content will favour its integration into processing chains in information retrieval, data mining, analysis of feelings, etc. Semantic descriptions will be useful for creating terminology and exploiting corpus.
In pedagogy, Demonext will participate in the diversification of vocabulary teaching techniques for primary school teachers, through the introduction of specific vocabulary acquisition techniques based on research data.
Finally, in speech and language therapy, the resource will enable the development of evaluation and therapy materials focused on the morphological level, whether to improve this level of treatment, when it is deficient or to mobilize it, when it is preserved, in the development of compensatory strategies.

Scientific productions and patents

Papers
Hathout, N. & Namer, F. (2016). Giving Lexical Resources a Second Life: Démonette, a Multi-sourced Morpho-semantic Network for French. LREC 2016, Portorož:ELRA, 1084-1091.
Dal, G. & Namer, F. (2016). Chapter 4: Productivity. The Cambridge Handbook of Morphology. Stump, G. and Hippisley, A. Cambridge, Cambridge University Press: 70-89.
Dal, G. & Namer, F. (2015). La fre´quence en morphologie : pour quels usages ?. Langages 197: 47-68.
Hathout, N. & Namer, F. (2014). De´monette, a French derivational morpho-semantic network. Linguistic Issues in Language Technology 11(5): 125-168.
Namer, F. (2013) A Rule-Based Morphosemantic Analyzer for French for a Fine-Grained Semantic Annotation of Texts, Communications in Computer and Information Science, 380, 93-115.
Hathout, N, & Namer, F. Eds (2012) Vers la Morphologie et au-delà. TAL 52.2.
Namer, F. (2012) Nominalisation et composition en français, Lexique 20 : 169-201.
Namer F. (2009). Morphologie, lexique et TAL – Le système DériF: London: Hermès, 448p.
Namer F. & Baud R., (2007) Defining and relating biomedical terms : towards a cross-language morphosemantics-based system. International Journal of Medical Informatics 76: 226-233.
Dal G., Hathout N. & Namer F. (2002) An Experimental Constructional Database: The MorTAL Project. Many Morphologies, Paul Boucher (éds). Somerville, MA: Cascadilla Press: 178-209.
Resources :
Flemm : Lemmatisation du Français, version 3.1: www.ortolang.fr/market/tools/flemm
DériF : Dérivation en Français searchable online : www.cnrtl.fr/outils/DeriF/
Démonette (with N Hathout) : Base de Données Morphologique du français, downloadable versions 1.1 and 1.2 : www.ortolang.fr

Submission summary

The objective of the project is to build Demonext, a French morphological database (MDB) that systematically describes the derivational properties of words. This database will fill a gap because no such resource is available in French.

This database will be fed by large scale lexical resources covering all morphological processes of French and created by morphologists as part of academic studies. During the migration of these resources, new information will be automatically inferred from the knowledge contained in these resources.

The MDB will offer an unprecedented combination of information that meets multiple needs, such as empirical confirmation and elaboration of morphological hypotheses, development of NLP tools, vocabulary learning and treatment of developmental and acquired language disorders.

The entries in Demonext will be organized as a network in which the lexemes morphologically related will bear morphological, phonological and frequency annotations, the indication of age of acquisition, etc. They will be connected by direct and indirect morphological relations. One innovative feature of Demonext is the importance given to the description of the semantic properties on lexemes and relations : relations will be characterized by a representation of their derivational meaning, and lexemes will be typed semantically, drawing on WordNet and FrameNet.

The project will be carried out by a consortium of four partners, bringing together almost all the French morphologists, a large team of NLP specialists (language resources and computational morphology), computer scientists, psycholinguists and speech therapists. All are already used to working together.

The main result of the project will be a large-scale derivational morphological database with an original structure of interconnected networks whose edges and vertices bear numerous and rich information: morphosemantical, morphosyntactic, morphophonological, derivational, distributional, statistical, etc.

A second result of the project will be a set of tools and educational materials such as sets of exercises and tests that will exploit Demonext and provide examples of its possible uses and its expected societal impact. The expected recipients of these different products will be: elementary and secondary school teachers, students, university teachers, speech therapists, academics working on derivational morphology and on the statistical modeling of the lexicon.

Démonext will be distributed under Creative Commons license. It will also be accessible to the many audiences through a platform designed for the different uses of the MDB, that will give access to query interfaces and to editing and viewing tools designed for more specialized audiences. Demonext will also provide a simplified and ergonomic access for the general public. It will be made available on the EQUIPEX Ortolang and on the REDAC platforms.

Fiammetta NAMER (Analyse et Traitement Informatique de la Langue Française)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

CLLE Cognition, Langues, Langages, Ergonomie
UDL SHS - STL SAVOIRS, TEXTES, LANGAGE
LLF Laboratoire de Linguistique Formelle
ATILF Analyse et Traitement Informatique de la Langue Française

Help of the ANR 592,133 euros
Beginning and duration of the scientific project: April 2018 - 48 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.