DS08 - Sociétés innovantes, intégrantes et adaptatives

Exploitation of Historical Big Data for the Digital Social Sciences: application to financial data – HBDEX

Exploitation of Historical Big Data for the Digital Social Sciences: application to financial data – HBDEX

The 2008 financial crisis showed the empirical weaknesses of explanatory economic and financial models, caused by the scarcity of long-term data needed to update the models and to test them in different historical and geographical contexts (especially those concerning structural changes). Through information technology innovations, HBDEX proposes a major contribution to our understanding of the functioning of financial markets and historical events.

Developing technologies and tools to spark the “Big Data Revolution” in the historical sciences

The three goals of the project are: (1) to design innovative and widely-applicable technology to ensure the production chain of big data from historical tabular sources and to overcome the technological bottleneck impeding the “Big Data Revolution” in the sciences of the past; (2) to integrate into the Equipex DFIH database the daily prices of the Coulisse from 1899 to 1939 and then produce an efficient tool for the understanding of the financial markets functioning (3) the comparative exploitation of data on the 1873-1939 period, analyzing data already produced by the Equipex DFIH (the Paris Bourse, 1796 to 1976; Coulisse, 1873-1898) and completed by the data produced by HBDEX. <br /><br />An econometrics and time-series analysis will be realized and agent-based modelling will be simulated. Visualization methods will help to provide educational data-based services. A comparative, long-term analysis of the robustness of two types of market organization (auction and off-the-counter) will inform the debate about the re-organization and regulation of financial markets -- a main issue of public policy. <br /><br />The developed technology could lay the foundations for a national platform to spark the Big Data Revolution in historical social sciences (the scaling-up in the variety and quantity of available data).

IRISA has developed a preliminary system for the stock prices listed on the Coulisse. For each page of the listings, this technology recognizes and reads the structural organization of the price tables by using a grammatical description written with the DMOS-PI method. This description is made on elements like lines of text (detected by deep learning methods) or vertical and horizontal line segments. This system will be driven by a strategy built at a document collection level.

This global strategy at a document collection level (using several weeks of price lists) is currently being developed by IRISA. It consists of an iterative process in which each iteration checks for the type of data: columns, sections, titles. The first step consists of structural recognition (in order to extract a page’s contents); this is followed by transversally checking the extracted information using a text recognition system.

In order to integrate the specific knowledge and characteristics related to the treated documents, a recognition system has been developed by LITIS. It consists of a deep neural network that combines convolutional layers in order to extract visual features and recursive layers (BLSTM) in order to take the context into account. This optical model is based on a language model that is configurable by applying the specific knowledge on the Coulisse documents (listing price title dictionary, specific syntax of the dates and of amounts, etc.).

A web-based user interface developed by LITIS allows users to view and correct the transcriptions by simultaneously viewing the transcribed results and the original documents from which these results originate.

The recognition system of the tabular structures (per page) has so far been applied on the 1899 and 1924 listings to generate the different field zones, which enables the system to be trained to recognize textual fields using syntactic contexts. Combined with corrections/annotations realized through the web-based user interface, the recognition system has been trained on 70 000 samples so far. The results (0.84 % of Character Error Rate) show a significantly promising performance in comparison to other similar commercially available systems.

The web-based user interface enables the correction of the transcribed text recognition results by showing the user the original document text images and the corresponding transcribed results side-by-side. The images of the listings of the previous and the next days are also introduced in order to allow the user to view new price titles within the Coulisse. The system also highlights ambiguous results in order to alert the user to intervene. The corrected information are integrated automatically into a database which connects this information with the corresponding source document image.

While the results from the recognition system are promising, they can still be improved in order to obtain more reliable results. This is currently being managed by increasing the size of the learning system by leveraging the first recognition system that has been already developed, which would enable further automatic annotations and validations via the user interface.

The overall strategy for the document recognition using the document collection is currently being developed. This introduction of the document collection context (over a period of several weeks) is important in order to extract the data (the listing titles as well as the listing prices) in the most reliable way. The main goal within this strategy is to decrease the errors, as well as human intervention, as much as possible. This work requires the integration of the strategies for data collection, the structural analysis of the pages, as well as the recognition of the pages’ contents (texts).

Conference: “Big-data historique : modélisation de stratégies d'analyse de collections de document”. Symposium International Francophone sur l'Ecrit et le Document (SIFED), 2019

Panel: “Financial centers, agents and transactions on the long run. Towards a multidimensional approach and tools of analysis” au World Economic History Congress, Boston (Juillet-Août 2018)

Conference: “Combination of deep-learning and syntactical approaches
for the interpretation of interactions between text-lines and tabular structures in handwritten documents”, International Conference on Document Analysis and Recognition, 2019

The major research trends involve innovative methods of production, processing and analysis throughout the whole value chain of data, but also the development of original solutions for the extraction of innovative knowledge. However, “born digital” Big Data lacks the historical depth required to understand the current dynamics of society. Using a major technological innovation in ICT, HBDEX proposes a major contribution to our understanding of the functioning of financial markets and historical events. The financial crisis of 2008 has once again highlighted the weakness of the empirical foundations of explanatory models. The Paris financial market has been for a long time organized through two co-existing markets, the centralized and regulated Paris Bourse, and the Coulisse, an unregulated bilateral OTC market. It is likely that the differences in these organizations and their evolution have affected the economic outcomes and the main historical financial events as the 1929 crisis. One of the bottlenecks to the understanding of financial markets is the scarcity of long-term data. These data are needed to update the stylized facts used in models and to test these models in different historical and geographical contexts, especially models concerning structural transformations. ICT are becoming ever more central to major scientific, economic and social issues, calling for close collaboration with other disciplines in order to design solutions adapted to their specific needs.
By providing a breakthrough in ICT, the interdisciplinary (computer science, economic history and economics) HBDEX project proposes to answer to a major concern in economics, i.e. improving the understanding of financial markets functioning. It has three goals: (1) to design innovative and widely-applicable technology to ensure the production chain of big data from historical tabular sources and to overcome the technological bottleneck impeding the “Big Data Revolution” in the sciences of the past; (2) to integrate into the Equipex DFIH database the daily prices of the Coulisse from 1899 to 1939 and then produce an efficient tool for the understanding of the financial markets functioning (3) the comparative exploitation of data on the 1873-1939 period, analyzing data already produced by the Equipex DFIH (Paris Bourse, 1796 to 1976; Coulisse, 1873-1898) and completed by the data produced by HBDEX. An econometrics and time-series analysis will be driven, agent-based modelling will be simulated and methods of visualization will help to provide educational data-based services. A comparative, long-term analysis of the robustness of two types of market organization (auction and OTC) will inform the debate about the re-organization and regulation of financial markets, a main issue of public policy.
The developed technology could lay the foundations for a national platform to spark the Big Data Revolution in historical social sciences, this is the scaling-up in the variety and quantity of available data. HBDEX cooperates with the TGIR Progedo and Huma-Num for the data dissemination and the valuing of scanned sources. It relies on Equipex DFIH experience, support and data. It will be supported by the computing power and reception capacities of the Institute of Complex Systems Paris-Ile de France. It participates in the dynamics that could lead the creation of a European leader in the field of financial data, CEDEFI, outcome of the merger of the Equipex DFIH, Equipex BEDOFI and EUROFIDAI, a project for a research infrastructure recently proposed to the Ministry of Research. A member of HBDEX is the European coordinator of EURHISFIRM a project submitted in March 2017 within the Infrastructure Development Program of H2020 that brings together the most significant European experiences of financial data collection.

Project coordination

PIERRE CYRILLE HAUTCOEUR (ECOLE D´ ECONOMIE DE PARIS)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partner

PSE ECOLE D´ ECONOMIE DE PARIS
LITIS LABORATOIRE D'INFORMATIQUE, DE TRAITEMENT DE L'INFORMATION ET DES SYSTÈMES
INSA-IRISA IRISA Institut de recherche en informatique et systèmes aléatoires Unité de recherche
CAMS Centre d'analyses et de mathématiques sociales Unité de recherche

Help of the ANR 660,960 euros
Beginning and duration of the scientific project: December 2017 - 48 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter