The evolution and maturity of information technology have facilitated the emergence of a new way of research which aims at using the power of computers to literary and historical research. Digital Humanities today became a separate discipline whose goal, both societal and technology, is to go beyond simple scanning of content. A large number of corpus available online appeared in the humanities and social sciences (eg. Humanum). These constitute corpus of data fields which could be supports for future research. However, the heterogeneity of these databases and the lack of tools to query the data by knowing and comparing their quality, limit their use to validate scientific hypotheses or to extract knowledge. There is a brake (which can even become a blocking) to the development of research in the humanities and social sciences.

We will focus in this project on historical prosopographical bases. Historians have numerous sources (books, acts, edicts, registers) whose study allows the development of prosopographical databases, records identifying the curriculum collections of individuals (education, occupations, grades, etc.), locations attended, their teachers, their scientific or literary output. These bases whose purpose is the study of social groups, support a methodology for building then confirming hypotheses. For instance, academics are until the nineteenth century an itinerant population. One may ask if there are typical career paths depending on the period and focus on differences between the career path of a given individual and the typical path. This requires the mining, joining and enriching data in the context of heterogeneous data, where time is uncertain in the absence of standardization of dating, where the names and locations of the properties changed frequently and where data are often incomplete.

However it is not possible currently to easily mobilize measures that would inform the user and enable him to know how it can exploit the data, extract reliable information or insert it into another document.

This work of confrontation of hypotheses with the facts or of building rules is done entirely manually and is tedious. Testing a hypothesis implies the consultation of thousands of varied quality records to identify, with any degree of certainty, individuals who validate this assumption, or rather those who refute. Every historian can thus emit tens of assumptions leading to hundreds of rules on each prosopographic base. Moreover prosopographic bases, multidimensional by nature (space, time, function, work, meetings, influences, etc.), many rules remain hidden from historians. This assessment is even more essential since the base is intended to be fed by many experts and who do not have more than ordinary users an understanding of all data in the database.

Considering this situation, the objectives of the DAPHNE project are i) to automate the extraction of the knowledge on which the historians base their research, ii) to study the formalization of the validation process of historical research on this type of corpus and to characterize how this process is computational, iii) to introduce data quality considerations and (iv) to propose a platform integrating the results obtained. All phases of the project require close collaboration between historians and computer scientists in the consortium and will lead to scientific results in both history and computer science.

