DS0806 - Révolution numérique et mutations sociales

A sociolinguistics of Twitter : social links and linguistic variations – SoSweet

SoSweet - A sociolinguistics of Twitter : social links and linguistics variation

The SoSweet project focuses on the synchronic variation and the diachronic evolution of the variety of French used on Twitter. SoSweet adopts a strong interdisciplinary position, at the crossing of social media linguistics, sociolinguistics, natural language processing (NLP) and network science. It is based on a dataset containing several hundred of millions tweets, together with a several millions users social network.

Providing a detailed account of the links between linguistic variation and social structure in Twitter, both synchronically and diachronically.

The recent rise of novel digital services opens up new areas of expression which support new linguistics behaviors. In particular, social medias such as Twitter provide channels of communication through which speakers/writers use their language in ways that differ from standard written and oral forms. The result is the emergence of new varieties of languages.<br />A characteristic of these varieties is that they exhibit large variability among communities of speakers and high innovation rates. A scientific description of such varieties must take into account this variability and explain how social forces and technical constraints regulate its dynamic. The main goal of SoSweet is therefore to provide a detailed account of the links between linguistic variation and social structure in Twitter, both synchronically and diachronically. Through this specific example, and aware of its bias, we aim at providing a more detailed understanding of the dynamic links between individuals, social structure and language variation and change. Due to the digital nature of the data and its amount, traditional sociolinguistic methods are not sufficient. In order to achieve our goal, we develop interdisciplinary, computational and data driven approaches.

The SoSweet project's position is to rely on big data. This involves crossing complementary skills. From a linguistic point of view, we invoque sociolinguistics and corpus linguistics. The former allows us to be anchored a long tradition of study of variation, the second to mobilize a know-how for the analysis of textual data. From the computer science point of view, we combine natural language processing, which, complementary with corpus linguistics, allows us to deal with the highly noisy language forms of our data, and network science which, in addition to Sociolinguistics, provides us with the tools and concepts to take into account their social dimension.

All our results converge towards the demonstration that our methodological approaches are relevant to the objectives of the project. By studying the distribution of the parts of the speech, we showed that several genres coexist on Twitter and are adopted by different communities of users. In parallel, we have shown that sociolinguistic variables of French known and previously studied project on Twitter and correlate with the socio-demographic structure of the population

Our very first data of was collected three years ago. This temporal span will allow us to tackle diachronic questions and to study the way the linguistic innovations diffuse or not within a population, and in particular to understand with social and temporal resolutions much more precise than traditional approaches allow the conditions governing this diffusion.
Moreover, in the past few months, we have begun to use automatic learning methods, particularly at the crossroads of distributional approaches and deep learning, to construct representations of data that highlight the links between linguistic variation and social structure. We will invest further in this direction which is seen to become a strong aspect of the project.

All of our publications are deposited on HAL with mention of the project. Their list can be consulted at sosweet.inria.fr/publications
To date, we count 18 of them, plus two very recent acceptances of oral intervention in international conferences.

The SoSweet project focuses on the synchronic variation and the diachronic evolution of the variety of French language used on Twitter.
The Web has entered all areas of our social life. As the language is central in our social interactions, it is legitimate to ask how the Web has become a factor acting on language. This is even more actual as the recent rise of novel digital services opens up new areas of expression, which support new linguistics behaviors. In particular, social medias such as Twitter provide channels of communication through which speakers/writers use their language in ways that differ from standard written and oral forms. The result is the emergence of new varieties of languages.

A characteristic of these varieties is that they exhibit large variability among communities of speakers and high innovation rates. A scientific description must take into account this variability and explain how social forces and technical constraints regulate its dynamic. The main goal of SoSweet is to provide a detailed account of the links between linguistic variation and social structure in Twitter, both synchronically and diachronically. Through this specific example, and aware of its bias, we aim at providing a more detailed understanding of the dynamic links between individuals, social structure and language variation and change.

Traditional methods are not suitable to address these questions. On the one hand, Twitter requires redefining fundamental concepts such as “addressee” or the public/private communication distinction. Moreover, while sociolinguistic studies are based on small samples, we will base our analysis on a corpus of 500 million tweets combined with the social network of the 10 million users who authored these tweets, complemented by socio-demographic data. This large data mass leads us to heavily rely on computational methods from different areas. The SoSweet project will therefore adopt a strong interdisciplinary position, at the crossing of social media linguistics, sociolinguistics, natural language processing (NLP) and network science.

The NLP tools are designed for standard forms of language and exhibit a drastic loss of accuracy when applied to social media varieties. To define appropriate tools, descriptions of these varieties are needed. Descriptions that needs tools. We will address this circularity interdisciplinary, by working simultaneously both on linguistics description and on NLP tools development. For its part, network science provides us with tools for studying massive data from complex networks of users, through graph theory and computational modeling.

The scientific program of SoSweet has been conceived in order to favor optimal interdisciplinary work as the four work packages (management, data collection and enrichment, variation and evolution analysis, outreach) involve all partners. The project will last 48 months. It involves 4 leading teams in their own field of research. The principal investigator, Icar, is specialized in corpus linguistics and computer mediated interaction. Icar will carry out the tasks of unifying linguistics evidences (empirical and theoric) with social clues (extracted from a massive network of sociological relations). Lidilem is in charge of adapting the sociolinguistics framework to the case of variation and communication on Twitter. Alpage, specialized in natural language processing, takes care of the linguistics enrichment part, which provides the other partners with normalized and structurally enriched forms of text. Alpage is also responsible of providing distributional analysis of our corpus, by the means of various forms of word clustering in order to define sociolinguistic variants in the tweets. Inria DANTE, specialized in the exploration of massive graph structures, will lead the crucial network analysis and will work on jointly integrating the sociological network and the linguistic distributional network of lexical relations

Project coordination

Jean-Philippe Magué (Interactions, Corpus, Apprentissages, Représentations)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partner

Inria Paris-Rocquencourt Centre INRIA Paris - Rocquencourt
LIDILEM Linguistique et Didactique des Langues Etrangères et Maternelles
Inria - DANTE Centre de recherche Inria Grenoble Rhône-Alpes - DANTE
ICAR - CNRS Interactions, Corpus, Apprentissages, Représentations

Help of the ANR 635,187 euros
Beginning and duration of the scientific project: September 2015 - 48 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter