DS0805 - Cultures, patrimoines, création

Describing and Modelling Reference Chains: Tools for Corpus Annotation (including diachronic and comparative language studies) and Automatic Processing – DEMOCRAT

DEMOCRAT, description and modelling of reference chains: tools for corpus annotation (with diachronic and cross-linguistic approaches) and automatic processing

Four steps to consider innovative studies of referring expressions and reference chains: (i) a discursive, diachronic and cross-linguistic model; (ii) a manually annotated corpus; (iii) a tool to annotate and to explore the annotations; (iv) an automatic language processing system to pave the way for automated annotation.

Issues and objectives for the study of referring expressions and reference chains

Despite the existence of in-depth descriptions of referring expressions, there does not exist: (i) any integrated description allowing the modelling of reference chains, nor any predictions about their textual behaviour and their typology; (ii) any corpus to apprehend the historical evolution of their composition; (iii) any tool for visualizing, exploring, and analysing correlations in reference chains; (iv) any natural language processing software able to process raw texts written in French so as to extract the referring expressions and the reference chains. The ambition of the DEMOCRAT project is to provide new results for these four aspects, which constitute the four main workpackages and the four main deliverables of the project.<br /><br />From a theoretical point of view, DEMOCRAT will articulate all available knowledge on isolated referring expressions and on anaphoric chains, as well as verify or refine (for the French language) the hypotheses raised by theories such as Accessibility Theory, Givenness Hierarchy, and Centering Theory.<br /><br />As regards resources (corpus and tools), DEMOCRAT will make a contribution to the digital humanities by proposing a rich digital corpus for the French language, annotated on the basis of linguistic analyses carried out within an approach that has been little explored to date and involves both semantics and pragmatics. Providing new data on the French language, this corpus and the associated model are intended to: (i) feed all natural language processing applications (the corpus size will allow machine learning applications); (ii) consolidate the role of the French language in the world, particularly through its integration into an international competition; (iii) provide new knowledge to linguistics-related subjects such as psycholinguistics and language teaching.

In order to reach a better understanding of referring expressions and reference chains, the DEMOCRAT approach will combine methods from linguistics (especially diachronic), tooled corpus linguistics, and statistical analysis of textual data. Once the relevant phenomena are defined in the form of markables and annotation schemes, a set of texts will be annotated manually. The texts will be chosen according to a number of periods and textual genres. Some experiments will allow us to refine the choices and write an annotation manual, which will be tested through timed annotation sessions and measurement of inter-annotator agreements. The final corpus, like all the DEMOCRAT productions, will be freely available under Creative Commons licenses (to be specified). In parallel, new methods of qualitative and quantitative analyses will be tested, including measurements that are adapted to reference chains. A prototypical analysis procedure will also be tested, in order to facilitate comparisons. The GUI and the macro library of TXM will evolve with DEMOCRAT. If some texts have been annotated elsewhere with other kinds of linguistic interpretations, cross-analyses will be tested.

Automatic chain detection will be based on state-of-the-art techniques, not only for French (mainly rule-based systems), but first and foremost for the languages that are regularly considered in international campaigns and competitions (mainly machine learning techniques and hybrid systems). Several techniques will be implemented for French; the first step will consist in separating the detection of referring expressions from that of coreferential pairs, as the two stages involve different techniques. A linguistic analysis of system errors will be carried out, in order to specify dedicated hybridization methods: application of rules before or after machine learning, identification of features that are specific to the French language.

In addition to the deliverables (the model, the corpus, the annotation tool, and the NLP system), the main expected results relate to methods, applications, and scientific enrichments.

Methods: TXM will become a tool for annotating and exploiting textual information, with new possibilities for quantitative and qualitative linguistic analyses. With the annotated corpus, these possibilities will provide further input to enrich the theoretical research carried out by the groups that work within the CORLI consortium (Huma-Num TGIR) and more generally research in the following domains: corpora for older states of the French language, high-level annotation, methods and tools for corpus exploration, scientific quality of corpora, data accessibility. The availability of the annotation tool and the annotated corpus will allow researchers (in linguistics or related disciplines) to test their hypotheses.

Applications: the analyses of reference chains will enrich analyses from linguistics and other social sciences. The system for the automatic detection of reference chains will be promoted to professionals: documentalists, teachers, NLP companies. It will help to improve semantic search engines and make some operations easier, such as topic identification and indexation, summarizing, data mining, or machine translation. It may also be of use for text simplification (for young readers or people with special needs).

Scientific enrichments: in addition to advances in the understanding of the expression of reference in texts, DEMOCRAT will integrate all available knowledge on isolated referring expressions and anaphoric chains. The project will propose a set of chain patterns which will provide researchers with bundles of converging evidence to characterize and distinguish between various textual genres and types. Finally, it will enrich existing databases, in particular those for Old French (CoRPTeF, BFM, SCRMF) and repositories (ORTOLANG).

The DEMOCRAT linguistic research remains to be compared to research on spoken French and to psycholinguistic research on how we resolve anaphora and coreferences. Spoken French is characterized by tone units and stresses that emphasize some referring expressions to the detriment of others. Studies along this perspective would take the research carried out with the ANCOR and DEMOCRAT corpora one step further. Psycholinguistic analyses of subjects asked to describe a story from a (controlled) succession of images show that the parameters that characterize the reference chains are numerous and stand in close relation to the concerns of the DEMOCRAT project. This should make it possible to specify new experimental materials.

The DEMOCRAT corpus consists of texts written in French, of different textual genres and from different periods. As such, it does not include the cross-linguistic dimension. Implementing this dimension through a multilingual annotation procedure is an important perspective, not only for a contrastive approach, but also for the automatic processing of multilingual documents and machine translation.

Concerning the annotation tools, the intrinsic nature of a reference chain (which can cover the whole text) raises visualization problems, for which visual metaphors and interaction procedures remain to be refined or even redesigned.

NLP is currently exploring many machine learning techniques, from support vector machines and conditional random fields to neural networks, including recurrent neural networks. Some techniques remain to be adapted to the specificities of the French language, taking into account the diachronic approach and the multiplicity of textual genres. This is the case, for instance, of domain adaptation techniques.

Several publications are planned for each of the four workpackages of the project, in dedicated national and international conferences and journals: (i) articles in linguistics that describe the facets of the DEMOCRAT (co)reference model (e.g. 'Diachro' colloquium, World Congress of French Linguistics (CMLF), journals such as 'Langue Française', 'Meta', 'Discours', Journal of French Language Studies); (ii) articles that describe the annotation methodology and procedure ('Journées de Linguistique de Corpus', 'Corpus' journal, Joint ACL-ISO Workshop on Interoperable Semantic Annotation, International Journal of Corpus Linguistics); (iii) articles that describe the annotation platform and the possible ways to explore annotated data (International Conference on Statistical Analysis of Textual Data, International Conference on Language Resources and Evaluation); (iv) NLP articles (French Conference on Natural Language Processing, 'Traitement Automatique des Langues' journal, International Conference on Computational Linguistics and Intelligent Text Processing). Will also be published the DEMOCRAT corpus, the extension of TXM software, and the various NLP tools implemented during the project.

All these scientific productions will be uploaded to the HAL platform, in their original version or in the authors' draft version depending on copyright terms. The project website will automatically retrieve them from HAL. In due course, the website will also provide additional documents such as the corpus annotation manual. It will also point to public pages dedicated to the DEMOCRAT tools: the TXM platform (with its own user manual and announcements for training sessions) and tools for the automatic detection of coreference chains.

The DEMOCRAT project aims to develop linguistic research on French and in particular issues of text structuring through a detailed and contrastive analysis of reference chains (successive references to the same entity) in a varied corpus of texts covering the entire history of written French (9th-21st centuries). The project will make available to the scientific community: (i) an integrated and discursive model of referring and reference chains, (ii) an annotated corpus that can be used as a reference corpus as well as a training corpus for international evaluation campaigns on coreference, (iii) a tool for manual annotation, computer-aided annotation and annotated data management, and (iv) a system for the automatic identification of coreferences. The corpus that will be annotated in reference chains will be one million words long, i.e. about 100 000 annotated units.

Motivations: (i) the need for an integrated model of referring expressions that would allow the modelling of reference chains and that is all the more precise from a linguistic point of view and formal enough to allow computational applications, (ii) the need for attested linguistic data, diachronic in particular, that allow on the one hand to appreciate the variations in chains composition, and on the other hand to serve as a reference corpus for the French language, on semantic data and not only morphologic or syntactic data, (iii) the need for an unified platform for corpus management, from visualization to querying and statistic computing, including annotation of phenomena from various linguistic dimensions, and (iv) the need for a natural language processing tool for the identification of reference chains for the French language.

Model and corpus. In spite we can find a lot of existing descriptions of referring expressions, there does not exist any integrated description for coreference chains modelling. There does not exist either any prediction on their typologies or on their textual behavior. Moreover, there is no corpus allowing an assessment of the historical development of their composition. There is no corpus either allowing a comparison of their modes of cross-linguistic composition. There exists one corpus with anaphora annotations (ANCOR), for oral French, but there is no annotated corpus available for written French, i.e. implying long reference chains. Thus, the project aims at collecting a working corpus, relevant and diversified enough to account for the varied compositional modes of reference chains, and providing theoretical hypotheses on the notion of reference chains. These hypotheses should permit an annotation of the documents. It should also facilitate the improvement of existing annotation tools. Copyright free databases of annotated texts of Old French will be used and enriched: Corpus Représentatif des Premiers Textes Français, the Base de Français Mediéval and the Syntactic Reference Corpus of Medieval French. For contemporary French, we will exploit extracts from the ANR ORFEO corpus.

Tool. The design and implementation of an annotation software platform, based on TXM platform, and enriched with ANALEC’s dynamic annotation functionalities, will lead to the proposition of a new and unified framework for efficient and ergonomic annotation and for launching experiments on computer-aided annotation.

NLP system. To get a system for the automatic identification of reference chains, we will on the one hand use and optimize CROC (Coreference Resolution for Oral Corpus), a prototype designed and implemented at LATTICE using machine learning techniques, and on the other hand explore the design of hybrid systems, grouping several kinds of machine learning techniques and knowledge-based rules such as the ones from RefGen, a tool designed at LILPA. Then, DEMOCRAT will provide the first NLP system dedicated to the automatic detection of coreference chains for the French language. This system will participate to international campaigns.

Project coordination

Frederic Landragin (Langues, Textes, Traitements Informatiques, Cognition)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partner

LATTICE Langues, Textes, Traitements Informatiques, Cognition
LILPA Linguistique, Langues et Parole
ICAR Interactions, Corpus, Apprentissages, Représentations

Help of the ANR 385,736 euros
Beginning and duration of the scientific project: September 2015 - 48 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter