CE23 - Données, Connaissances, Big data, Contenus multimédias, Intelligence Artificielle

Processes of Textualisation: Linguistic, Psycholinguistic, and Machine Learning Modelling – Pro-TEXT

Pro-TEXT

Processes of Textualization: Linguistic, Psycholinguistic, and Machine Learning Modeling

Analyzing Textualization Processing and Modelling Their Dynamics

This research will lead to a comprehensive linguistic analysis of the textualization process, i.e. the real-time progressive construction of a text. During the textualization process, spontaneous language production is interrupted/segmented by pauses. A textual segment produced between two pauses is called a burst (Chenoweth & Hayes 2001): e.g. [pause] une cousine qui [pause] peut venir partager du temps avec elle pendant [pause] le [pause] w [pause] eek [pause] – [pause] end. [pause]<br />We will study bursts of writing, which are textual segments produced between two pauses, in order to provide insight into the relation between regularities of language performance and the cognitive and contextual constraints. The aim is to understand some of the layout mechanisms that allow language to give rise to novelty out of known and prefabricated data.<br />We contend that a better knowledge of the dynamics of textualization processes will <br />• make it possible to grasp the mechanisms that connect structure, genre constraints and pragmatic aims; <br />• help understand the way language unit layouts achieve qualitative leaps;<br />• unveil the moves that enable a qualitatively new product, the text, to be forged out of available data and structures.

The Pro-TEXT project will develop linguistic and psycholinguistic methods and machine-learning tools to model these regularities and provide evidence about patterns of text processing.The issue is i) to unearth the linguistic strings chosen by writers to build up their texts and the links by which they are interconnected; ii) to identify the types of sequences that constitute the linguistic material for textualization, iii) to fix the rules and layout regularities that support their organization in a formally and semantically valid text and the combinatorial strategies used by writers in various contexts and text genres; iv) to interpret the pauses of production and the bursts of writing by identifying the cognitive processes underlying them and how variations in cognitive demands affect these pauses and bursts, as well as the linguistic forms and functions of bursts.
To do this, we carry out behavioural analyses (calculation of pauses, duration and speed of production of linguistic units, backtracking on the text), linguistic analyses (description of bursts of writing, modeling of the types of relations that allow them to be articulated to form higher-level units), statistical analyses (calculation of regularities, similarities, comparison of corpora as a function of a series of variables), and we apply machine learning methods to model the process of textualization.

The expected research results are:
1. A detailed description of the linguistic performance units produced spontaneously during the textualisation process;
2. A categorisation of the types of pauses;
3. A modelling of the textualisation processes.
At mid-term, the results are:
- preparation and formatting of the corpus;
- Complete behavioural annotation of the corpus;
- partial linguistic annotation of the corpus (establishment of an annotation guide, automatic annotation and manual correction of part of the corpus)
- ad hoc linguistic analyses targeting specific objects
- textometric analyses
- statistical analyses of behavioural data and raw linguistic data.

The perspectives at this stage concern:
- complete linguistic annotation of the corpus
- dynamic visualisation of the corpus
- making corpora available
- complete linguistic and textometric analysis
- statistical analysis taking into account the annotated data
- theoretical modelling
- articulation of behavioural and linguistic data
- modelling of the textualisation process using machine learning approaches

Translated with www.DeepL.com/Translator (free version)

Achieved
3 journal articles
4 oral communications
In progress
1 paper proposal accepted
2 international papers planned for July 2021
Delayed
1 inter-ANR study day
1 workshop in an international conference
3 papers in international conferences

Pro-TEXT
Basically, a text is a configuration pertaining to the highest level of linguistic complexity and constituting a communicative unit. But there is still no theoretical model of textualization as a process AND a product, despite the ubiquity and the empirical awareness of texts: in the current state of the art there is no consensual theoretical definition of text, nor are we fully informed on how a text is built, and automatic text-generation approaches have not yet found a satisfactory model. Indeed, texts - and more specifically written texts - are produced under complex constraints among which some were impossible to capture until recently, due to the inaccessibility of insights into the textualization process as such.
Real-time recording of the writing process using key-stroke logging provides access to the dynamics of the textualization process. Roughly, written or oral language performances are incremental linearizations constrained by temporality, accompanied by disfluencies due to revision. The Pro-TEXT project aims at
• grasping the mechanisms that connect structure, genre constraints and pragmatic aims;
• helping to understand the way language unit layouts achieve qualitative leaps;
• unveiling the moves that allow a qualitatively new product, the text, to be forged out of available data and structures.
Three research teams - Clesthia (linguistics), CeRCA (cognitive psychology) and the LIPN (informatics) will study bursts of writing, which are textual segments produced between two pauses:

e.g.: [pause] une cousine qui [pause] peut venir partager du temps avec elle pendant [pause] le [pause] w [pause] eek [pause] – [pause] end. [pause]

Four types of writing recorded in real time have been collected: educational reports of child protection, academic essays produced by Master's students, student writings (in French) and short French-English translations. An experimental corpus will complete these data.
Linguistic analysis lead by Clesthia aims i) to unearth the linguistic strings chosen by the writers to build up their texts and the links by which they are interconnected; ii) to identify the types of sequences that constitute the linguistic material for textualization, iii) to fix the rules and layout regularities that support their organization in a formally and semantically valid text and the combinatorial strategies used by the writers in various contexts and text genres.
Cognitive psychology analysis will interpret the pauses of production and the bursts of writing by identifying the cognitive processes underlying them and how variations in cognitive demands affect these pauses and bursts, as well as the linguistic forms and functions of bursts.
Based on machine-learning methodologies (collaborative and dynamical clustering), the Pro-TEXT project will elucidate the dynamics of the textualisation process by modeling relations between temporal indices of cognitive processes and the nature of linguistic forms produced in keylogged writing. Machine-learning incremental approaches will fill a gap in the analysis and representation of real-time language performance, while revealing regularities that remain unremarked under the methodologies used previously.
We will offer structured linguistic and psycholinguistic descriptions of textualization processes, multi-layer (linguistically and behaviorally) annotated dynamic corpora, incremental models of the textualization process, and clustering tools adapted to such types of dynamic data.

Project coordination

Georgeta Cislaru (Langage, systèmes, discours)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partner

CLESTHIA Langage, systèmes, discours
LIPN Laboratoire d'Informatique de Paris-Nord
ETIS UMR CNRS 8051
CeRCA Centre de recherches sur la cognition et l'apprentissage

Help of the ANR 517,959 euros
Beginning and duration of the scientific project: March 2019 - 42 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter