DS10 - Défi des autres savoirs

A Corpus-based Macro-Syntactic Study of Naija (Nigerian Pidgin) – NaijaSynCor

NaijaSynCor. A corpus-based study of Naija (Common Nigerian Pidgin)

NaijaSynCor takes an exhaustive and in-depth look at the structure of Naija (Common Nigerian Pidgin) in Nigeria today. Spoken by educated Nigerians, it has been proved by Deuber (2005) to develop in Lagos as a discrete language, separate from Nigerian English. This study proposes to assess whether this holds true for the rest of Nigeria where Naija is spoken by over 75 million speakers. It examines diachronic, diatopic, diaphasic, diastratic, and genre variation.

NaijaSynCor studies the structure and functions of Naija (Common Nigerian Pidgin) in Nigeria today. It examines diachronic, diatopic, diaphasic, diastratic, and genre variation.

1. Building a reference 500,000 words oral corpus (the Reference Naija Corpus, RNC), collected in 11 different points of survey in the country, with a deeply annotated and manually checked sub-section of 100,000 words (the Naija prosodic and syntactic Treebank, GTB). <br />2. Comparing the RNC with the Nigerian International Corpus of English (ICE Nigeria), both qualitatively and quantitatively. <br />3. Achieving a better understanding of the variations of Naija along the formal-informal functional scale through the study of its use on university campuses and in the media, and more specifically on the radio (news reporting, editorials, information, etc.).<br />4. Understanding the patterns observed in the prosody of emerging languages, and linking the prosodic description of Naija to that of its grammatical and information structures through the use of NLP tools.

The project is a collaboration between: Llacan, studying lesser-described languages and Modyco, specialised in the interaction of prosody and syntax in French and the development of large treebanks, and two Nigerian leading experts on Naija (F. Egbokhare & C. Ofulue). The macrosyntactic framework developed in the ANR Rhapsodie project (Lacheret, Pietrandrea & Tchobanov 2014) has proved to be particularly efficient in dealing with the specificities of oral corpora, e.g. piles stacking, disfluencies, repetitions, discourse markers, overlaps, co-enunciation, false starts, self-repairs and truncations. This method is data-driven, inductive (the relevant units are identified through annotation) and modular.
NaijaSynCor is a highly integrated programme divided into 4 work packages (WP), which are interdependent, going from fieldwork and data collection (WP1) to the final characterization of Naija through the study of the annotated corpus (WP4).
• WP1 produces the RNC (Reference Naija Corpus). WP1 will be conducted in Nigeria based on the input of data collected during fieldwork. The RNC files will be constantly uploaded into the main database run by the research team at Llacan in Villejuif and made available to other WPs.
• WP2 automatically annotates the 500Kw reference corpus (RNC) for syntax and morphology; it will turn 150 out of the 500 Kw of the RNC into a Treebank of deep and fine-grained macro- and microsyntactic annotations (GTB) .
• WP3 conducts an instrumental acoustic analysis of the prosodic features of Naija in relation with its Information Structure.
• WP4 does the final analysis of the corpus in terms of relationship between Naija, Nigerian English and Vernacular languages, will be a collaborative effort between all the members of the project, The aim of WP4 is (i) to run a study of the intonosyntax of Naija; (ii) to establish the identity of Naija through its diachronic, diatopic, diaphasic, diastratic and gender variation (Coseriu 1981).

WP2: A 150 Kw gold standard treebank for Naija (manual correction) ; 400 Kw treebank for Naija (automatic annotation) ; Syntactic annotation guidelines for Naija ; Tagger and glosser for Naija ; A dependency parser for Naija (MATE trained on our gold standard treebank); 500 Kw treebank for Nigerian English (ICE Nigeria analysed with English Stanford parser).
WP3: Annotation guidelines for the annotation of prosody in Naija. A 150 Kw treebank for Naija time-aligned tiers containing information about the prosodic units of different levels as well as their corresponding pitch contours; prominences and disfluences distributions. A database containing all tokens and the measurements of their prosodic correlates (mean F0, intensity, pitch excursion, velocity, duration, etc), comprising both continuous data like time-normalized F0 contours and F0 velocity profiles suitable for graphical analysis, and discrete measurements suitable for statistical analysis.
WP4: Metadata statistical analysis for the patterns and correlations (e.g. spread of Naija as first language and factors influencing said spread) AND do the same with one (or more) of the examples of variation found in step 1, in order to determine the correlation of sociolinguistic variables recorded in the questionnaire with said variation.Multivariate analysis of the corpus based on morphological, lexical, syntatic and prosodic features.

This innovative approach to the dynamics of contact and change in the areas of human behaviour and sociology of language will powerfully impact the methodology and technology of research on emerging languages. It is ground-breaking as, for the first time, it will use new NLP tools that integrate syntax, intonation and information structure on a large deeply annotated corpus to build a gold-standard bench-marking database.
Last but not least, it is hoped that it will provide the annotated data and the NLP tools necessary to produce speech recognition devices that can be implemented in smartphones, opening wide development perspectives in a 160+ million country where a large part of the population is illiterate while having access to modern communication tools used e.g. in dematerialized banking operations via smartphones.

Expected productions: Conference and journal papers; Open Source databases and treebanks; Open Source applications; a grammar and a dictionary of Naija.

NaijaSynCor takes an exhaustive and in-depth look at the structure of Naija (Nigerian Pidgin) in Nigeria today. Spoken by educated Nigerians, it has been proved by Deuber (2005) to develop in Lagos as a discrete language, separate from Nigerian English. This study proposes to assess whether this holds true for the rest of Nigeria where Naija is spoken by over 75 million speakers. It examines diachronic, diatopic, diaphasic, diastratic and genre variation.
The project is a collaborative effort of two research units that have proved their expertise in corpus annotation in previous programmes: Llacan, on lesser-described languages (Corpafroas and Cortypo); Modyco, on the interaction of prosody and syntax in French (ANR Rhapsodie) and the development of large treebanks (ANR Orféo), and two Nigerian leading experts on Naija (F. Egbokhare & C. Ofulue). The macrosyntactic framework developed in the ANR Rhapsodie project (Lacheret, Kahane et al. 2014) has proved to be particularly efficient in dealing with the specificities of oral corpora, e.g. piles stacking, disfluencies, repetitions, discourse markers, overlaps, co-enunciation, false starts, self-repairs and truncations. This method is data-driven, inductive (the relevant units are identified through annotation) and modular.
The tools developed by the research team in these previous corpus study programs are robust and mature enough to focus on the linguistic problem posed by Naija: in its geographical and functional expansion, does Naija maintain its status as a discrete language, separate from Nigerian English, or does it undergo decreolization? While answering this question, the research programme aims at overcoming two remaining technological challenges, (i) automatic identification of illocutionary units based on intonation data as a parameter; (ii) building a parser integrating intonation data as a parameter.
Through the creation of a deeply annotated corpus, the project documents the emergence of Naija as a language at the national level, challenging existing theories of the development of creoles and languages in contact. Capitalizing on the latest developments in the area of corpus annotation, this innovative approach to the dynamics of contact and change in the areas of human behaviour and sociology of language will powerfully impact the methodology and technology of research on emerging languages.

Project coordinator

Monsieur Bernard CARON (Langage, langues et cultures d'Afrique noire)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partner

LLACAN Langage, langues et cultures d'Afrique noire
Modyco Modèles, Dynamiques, Corpus, UMR7114

Help of the ANR 356,642 euros
Beginning and duration of the scientific project: January 2017 - 42 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter