DS0806 -

Big Statistical Data for a Mobile Society – Big_Stat

Big_Stat

Big Statistical Data for a Mobile Society

Centred on the scientific use of French administrative data in social sciences, the project has three main components: research, feedback to data producers, users training.

First, the project aims at researching on important topics in sociodemography, for which administrative data complement survey data. Three topics form the core of the research projects. 1) Assessment of double counts in surveys and censuses, description of the family situation of the inhabitants taking into account the persons enumerated or surveyed twice because they have two usual dwellings. 2) Formation and dissolution of young adults’ couples. 3) Analysis of the family and socio-economic situations of children of separated parents who share their time between the two parental homes. <br /> <br />Second, it aims at validating data, in collaboration with producers, based on a comparison between sources and sociological survey data (including qualitative interviews) to analyse the concrete situations behind family situations that are poorly identified or misidentified by official statistics and, on the other hand, by returning to the initial data in the event of a weird result. These validations can lead, as it is already the case for the Permanent Demographic Sample of the Institute of Statistics and Economic Studies (INSEE), to file enrichments or corrections. <br /> <br />Finally, it aims at disseminating and making available administrative data, which is reflected in the creation of a project mailing list, websites dedicated to each source, and training activities for users of this data.

For the three research topics, the main idea is to compare results from different data sources and using different methods, in order to enrich our understanding of the behaviours under study. We use a very rich set of data: tax and social administrative data, as well as the census, provide very precise estimates of family situations, based on their own specific definitions. For example, the family unit is defined as the taxable family (housing tax data allow identifying households and flatmates). Census data are based on the household definition as the group of people living in the same dwelling; family units are constructed within the households. These data come from the housing form, which has been renewed in 2018, in order to include more precise information on family links, as well as on any other usual residence for each member of the household. Living as a couple is also identified through a question in the individual census form. Most household surveys identify co-resident or partially co-resident couples, while more specific surveys and in-depth interviews describe all couple situations (co-resident or not), as well as situations of co-residence without a life as a couple, or life as a co-resident couple, but without long-term commitment.

Regarding children's family situations, children sharing their time between the two parental residences after a break-up of the couple are described in the tax data for alternate care (sharing of tax shares), in the census since 2018, and in household surveys since 2004 based on the question on the presence of another usual residence. Comparison with specific surveys, such as the 2011 Family and Housing Survey, makes it possible to identify the situations of these children; the Permanent Demographic Sample allows, for children present in the same year in both parental homes, to describe their family situation based on the composition of their two homes, where their two families live.

The project initially gathered work on INSEE's Permanent Demographic Sample (PDS), which has been collecting census and vital statistics data since 1968, and whose recent addition to social and tax data makes it an extremely rich data file, but whose complete documentation and validation benefit from user feedback. The analysis of the Permanent Demographic Sample, in collaboration between INSEE and INED, made it possible to measure (for the first time since the adoption of the annual census surveys in 2004) the frequency of double counting in the census at 2.4%, and to discuss the accuracy and scope of this result with the census authorities, particularly in terms of estimating the complex family situations often associated with double counting (children of separated parents, young adults more or less leaving their parents' homes).

A large number of research projects have been launched: observation of same-sex couples in the census and the Family survey, changes in couple situations and transitions between different conjugal states in France between 2010 and 2015, residential mobility following a divorce or a break-up of PACS, with a particular focus on parents and the role of the type of childcare, measurement of fertility by birth order in the PDS and in civil registration data. This work has led to returns to INSEE, the producer of the PDS. Work on data source validation is going on.

The work is collaborative, and many research projects are included in the program. We have created a website for the project big-stat.site.ined.fr, in English and French, where we present the sources potentially available for research and the means of accessing them, the work carried out on the basis of these data and funded by this project, as well as all the projects. The website also presents a coherent set of theoretical articles on administrative data and big data in the humanities and social sciences, as well as numerous examples of work using such data, in France and elsewhere. The link to www.data.gouv is complemented by links to the main databases in the fields of demography, health, territorial equipment and services, economy and transport. Similarly, the main contextual databases are referenced.

Then, we created a participatory site for Permanent Demographic Sample (PDS) users (https://utiledp.site.ined.fr), where they can consult the data documentation (which is not available elsewhere) and contribute to its enrichment by proposing variable codes built by themselves and usable by other users, on the model of the ELFE cohort of children users’ website (https://util-elfe.site.ined.fr/en). The site is constantly updated and corrected.

Similar websites have been developed for the Annual Census Survey, the file containing the «common core questions« of INSEE household surveys, and data from the Caisse nationale des allocations familiales (Cnaf), which are available since November 2018. Others are being considered for the European Survey on Income and Living Conditions (EU-SILC) surveys and the Fidéli tax data file.

We organized in spring 2018 training in big data analysis methods and routines that can be used in R language, and helped finance an INED training workshop on EU-SILC data. A summer school on the INSEE Permanent Demographic Sample will be organized in the summer of 2020.

This is in line with the original plan.

Toulemon Laurent. 2017. Undercount of young children and young adults in the new French census, Statistical Journal of the IAOS, Vol 33, p. 311–316. content.iospress.com/articles/statistical-journal-of-the-iaos/sji1054
Ferrari, G., Bonnet,

New family and demographic behaviours are leading to greater individual and social mobility, making it more difficult to define and observe actual family and housing situations. Simultaneously, big statistical data, i.e. data from administrative files covering the whole population, are now becoming available to the research community. The project aims to extend our knowledge of complex and hard-to-measure situations, using several data sources including big data, and to assess the strengths and weaknesses of several data sources that will be disseminated in 2016 by the French National Institute of Statistics and Economic Studies (INSEE).

Data necessary for complex demographic studies, such as the French Demographic Panel based on censuses and civil registration, tax data and family allowance data, are now becoming widely available. So far, they have been rarely used for research purposes in demography. Therefore, we propose, in a first and crucial stage, to assess and document for general use the big data sources recently made available for research, in collaboration with INSEE. This collaboration is key to the constitution of reliable and well documented data sources. The knowledge from of experts from several backgrounds and institutions will be essential to fully validate and test these data sources for various uses. In this step, we will check the consistency of population estimates based on censuses, surveys and administrative data, in terms of omissions and double counts, and the impact of discrepancies on the estimation of family situations and behaviours. Two research questions, which are normally difficult to evaluate with standard surveys, will then be addressed, making use of diverse methods and sources: administrative data, censuses, population surveys and qualitative data from semi-structured in-depth interviews. . First, intimate relationships at young adult ages are known for their volatility, and are therefore hard to study with standard survey data. The new data sources will make it possible to look at forms of partnership and union stability in relation to income, education, occupation and labour market integration. This will vastly increase our knowledge of the dynamics of early adulthood and will further our understanding of new forms of partnership. Second, we will look more closely at the situation of children whose parents are separated, and who are a major source of double counting in surveys and censuses. New data sources will provide a more accurate picture of the family situation of children, including those in complex living arrangements, in relation to their standards of living and poverty risk. Administrative data are very useful for studying transitions, while retrospective surveys are often complicated by recall bias and panel studies are weakened by attrition.

This project will be placed in a national and international perspective. It will provide an opportunity to combine the strengths of national institutions, while creating links with institutions abroad involved in the analysis of big administrative and census data. We will benefit from their experience and interact with big data networks to improve the quality and efficiency of our assessments and studies.

By making information, documentation and code for data use available on a website, this project will have a significant impact on the scientific community. It will also contribute to the enhancement of data quality and access. The publication of methodological and applied articles in internationally reputed journals will promote the dissemination of the project’s progress and findings.

Members of the project come from the French Institute for Demographic Studies (INED), INSEE, and the universities of Paris 1 Panthéon-Sorbonne, Paris Descartes, Lyons, Nancy and Strasbourg. A major output of the project will be to encourage and facilitate the use of big statistical data amongst scholars working in the humanities and social sciences.

Project coordinator

Monsieur Laurent Toulemon (Institut National d'Etudes Démographiques)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partner

INED Institut National d'Etudes Démographiques

Help of the ANR 291,584 euros
Beginning and duration of the scientific project: February 2017 - 48 Months

Useful links