Tools and Researches on Written and Spoken French – ORFEO

Submission summary

Over the last twenty years corpus linguistics has developed thanks to the construction of reference corpora. This has made a considerable contribution to linguistic science and to automatic speech processing (ATP). The issues at hand are considerable. In theoretical linguistics, the framework that underpins corpus-based investigations is “usage-based grammar” which refutes the concept of a unique grammar for a language and postulates the concept of “multiple grammars”. These account for the diversified spoken and written usage relative to the situations where discourse is produced. In ATP the previous paradigm held that it is not reasonable to think that a generic “trans-genre” tool may allow an efficient automatic speech processor of differing performances, relative to their conditions of production. ATP tools, as is the case with human speakers, need to adapt lexically and grammatically to the in-built variety of usage. With this in view, France is in a singular position. She has recently set up a digital tool, TGE Adonis, whose objective is to pool resources, technological standards and data preservation in the Social Sciences and Humanities. This will be in collaboration with CLARIN, the network of resource and technology management centres. France however does not have a reference corpus complying with international standards. It is unrealistic, for political and financial reasons to consider such a project in the framework of an ANR. The ORFEO project is offering an alternative solution, viz. the constitution of a Corpus for the Study of Contemporary French: CEFC.
We therefore propose to:
1. Collect existing free-access corpora with the assent of the initial deviser.
2. Collect data for non-represented genres in view of constituting a genre-based corpus, with a total of 15M words (one fourth spoken, four fourths written). This will cover most of the usages contemporary spoken and written French: formal/informal, monologues/conversations, etc.
3. Design an Internet portal for accessing data and meta-data collected in accordance with the rights of the copyright holders and the legal conditions for their use as specified by the authors.
4. Guarantee the permanent conservation of the documents, by storing the annotated corpora in digital resource centres (CNTRL, SLDR, or the forthcoming Equipex project presented by its supporting laboratories in connection with the Universities of Paris-Ouest and Orléans).
5. Automatically annotate the entire corpus by carefully adapting the tools to the different genres. The various layers of annotations will draw on the pilot experiments of the ANR Rhapsodie and Annodis projects, the former for prosodic and macro-syntactic annotation, the latter for discursive annotation. The annotations will also draw on an active learning process enabling work on a larger scale. Spoken data will be subjected to an original treatment which will take into account prosody and a specific outline of syntactic annotation. Open source request and analysis tools will be available on the Internet portal so that users can develop the analyses of their choice.
6. Propose pilot studies in areas such as list effects, attitude markers, clause combining phenomena for which analyses will be conducted on the basis of a constructional approach, which takes into account the formal and semantic properties of the linguistic units. This approach will help specialists in syntax, prosody, discourse analysis, interactional and co-referential analysis, involved in the project to work together on the same subjects and produce the first chapters of a grammar of usage of contemporary French.

