FRAL - Programme franco-allemand en Sciences humaines et sociales

Segmentation of oral corpora – SegCor

Submission summary

Although a great variety of segmentation systems have been experienced since the beginning of research on talk-in-interaction, there are still lacks in a principle-based segmentation rooted in knowledge. Thereof arise problems in the usability and exploitation of spoken language and interaction corpora. The project will hence develop a method of segmentation for oral/interactional language data that is adequate for the analyses of data from talk-in-interaction at different levels and for various communities of researchers. It is based on three large collections of French and German audio and video recordings of various interaction types (the databases CLAPI, ESLO and FOLK, respectively), as well as approaches to segmentation put forward in the literature on conversation analysis, interactional linguistics, pragmatics and corpus linguistics, as its starting point. The project is the first approach to segmentation that is both based on comprehensive data treatment of a sufficiently large and diverse empirical basis and takes into account the cross-linguistic dimension. The results will enable a better use of the three databases but aims as well at establishing best practices for oral corpora beyond the databases used in the project. Results contribute to analyses of structures of talk-in-interaction (based on an retrieval of structural properties of turns at talk), to purposes of language teaching, to contrastive analysis of spoken German and French and to the development of language technology for interaction data.
Methodologically the project is based on two different perspectives: 1) a qualitative, multidimensional approach which takes into account segmentation indices, problems and criteria and leads to tested and improved segmentation guidelines and 2) a quantitative, unidimensional approach based on selected criteria where possible boundaries are automatically identified and classified by human annotators according to their relevance for segmentation. Both approaches use a pilot test corpus of 10 excerpts of 10 minutes each for each language which represents the overall data diversity in terms of situation types. In a second phase of the project, the corpus will be extended to 5 hours and takes into account findings from the initial phase. From the beginning of the project, contrastive aspects will be considered particularly.

Véronique Traverso (ENS de Lyon - laboratoire ICAR)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

ENS Lyon, ICAR ENS de Lyon - laboratoire ICAR
Institut für Deutsche Sprache, Mannheim Institut für Deutsche Sprache, Mannheim
Université d'Orléans-CNRS, LLL Université d'Orléans-CNRS, Laboratoire Ligérien de Linguistique

Help of the ANR 246,329 euros
Beginning and duration of the scientific project: December 2015 - 36 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.