NAVigation In DOcument MASSes – Navidomass
There is an increasing interest to digitally preserve and provide access to historical document collections residing in libraries, museums and archives. Such archives of old documents are a unique public asset, forming the collective and evolving memory of our societies. Indeed, ancient documents have a historical value not only for their physical appearance but also for their contents. Examples are unique manuscripts written by well known scientists, artists or writers; letters, trade forms or official documents that help to reconstruct historical sequences in a given place or time; artwork elements like stamps, illustrations, covers, etc. On the other hand, there is also a need of the preservation of the technical heritage belonging to companies or public institutions. Examples of that are old engineering drawings or cadastral maps. The challenge that is currently widespread in Europe is the conversion of such heritage to digital libraries that allow to preserve it but also to make it available worldwide using web-based portals. Citizens of the future should be able, through the medium of better designed digital libraries to gain access to a myriad of forms of knowledge from anywhere and at any time and in an efficient and user-friendly fashion. A number of initiatives exists focusing on the creation of large digital libraries worldwide reachable. Google is now running a project to create a global virtual library. A number of European libraries have started a joint similar project (http://www.dw-world.de/dw/article/0,1564,1566717,00.html). DELOS is an European Network of Excellence on digital libraries (http://www.delos.info/). The construction of such libraries has an additional and important challenge, the analysis of documents and the extraction of knowledge. Such goal requires efforts in designing and developing semantic-based systems to acquire, organize, share and use the knowledge embedded in documents. The field of Data Mining, combined with Document Analysis offers a robust methodological basis to perform tasks such as descriptive modeling (clustering and segmentation), classification, discovering patterns and rules or retrieval by content applied to document sources and databases. Old documents can be originals (paper, parchment etc.) or in image form (already scanned, possibly using now outdated technology). The key requirement is to be able to process these unique manuscripts, whether they are presented as free flowing text (treatises, novels, ...) or structured at different levels of physical-logical structure correspondence (letters, census lists, trade forms, ...). Degradation may be caused by a lifetime of use, and access must also be preserved to user annotations and corrections, stamps and unique artwork. Each class of document requires a different approach throughout the conversion process and lends itself to different levels of information extraction and description. In summary, the analysis of historical document knowledge to build metadata that is used to access to digital libraries. In the knowledge society, the interest is beyond the digitization of documents but to create semantically enriched digital libraries of such digitized documents. Enriched documents mean to add semantical annotations to digital images of the scanned documents. Such metadata is intended to describe, classify and index documents by their content. It would allow anywhere anytime natural access to such a cultural and scientific heritage. Thus, the main research goal of this project is to work in a collaborative framework on the Analysis of Old Documents. This goal consists in developing Pattern Recognition and Image Analysis techniques that allow extracting knowledge from documents and converting them to Digital Libraries containing the scanned pages enriched with semantical information. The partners groups of this project, Laboratoire Informatique, Image, Interaction - L3i (Université La Rochelle, France), QGar Team of the LORIA (Nancy), Laboratoire d'informatique de Tours (Université de Tours), Laboratoire CRIP5 (Université de Paris 5) Laboratoire LITIS (ex-PSI, Université de Rouen) and IRISA-IMADOC have large and complementary experience in Document Image Analysis (DIA), attested by many publications in this domain since the two last decades, and by a relevant presence in all the international DIA events. Indeed, on the ten four last years, the total number of journal publications reaches 25 contributions in 2005 while it reaches 57 papers in international conferences/workshops. These teams are currently working on different R+D projects on cultural heritage preservation in relation with their own geographic environment, with local partners. In this "Action de Recherche Amont" dedicated to the Mass of data, we plan to share insights from the experience in the corresponding projects, and work together in some topics related to the field of DIA applied to old documents. Retombées scientifiques et techniques attendues Retombées scientifiques et techniques attendues The main innovative issue of our joint research is the creation of metadata associated to old document images, instead of just digitizing documents. A number of projects exist in the field of the preservation of cultural heritage. Those having some relation to old documents focus mainly in the early stages of digitization or on the creation of digital libraries of document images. However, the task of automatically extract knowledge from documents is rarely included in such projects. Thus our challenge is to investigate on pattern recognition, artificial intelligence and multimodal interfaces domains to build components of an Interactive framework to digitize and annotate old documents, and as a consequence, improve the document retrieval process. In this domain, some previous research projects allowed to tackle specific questions for which mature tools are now available. However some technological bolts still exist and require fundamental research to improve the quality of automatically produced annotations. Precisely, the focus of the project deals with the following points that can be grouped into four research topics: Document Layout analysis and structure based indexing: this part aims at automatically extracting the different layers of the documents (text, graphic, tables, captions ...) , and detecting fundamental structure elements (title, sub-title, page number) that are very important for the indexing and the navigation process. Information spotting: after having characterized the different layers of information, this part aims at characterizing each class of information with relevant features, allowing performing information spotting in a same layer or between different layers. This point requires the development of innovative signatures, the signatures classically used in recognition process being to costly to be used in such a process. The signatures that have to be determined for this point deal with very different layers : text (word spotting), graphic (drawing spotting), ... Structuration of the feature space in order to build efficient information retrieval system: this point is referred to the difficulty to build an efficient search system in the context of high dimensional vector space. This research which is a difficult point in many domains has never been really tackled till now in the domain of document analysis. The idea is to consider relevant techniques allowing to build relevant clusters in the feature spaces, and to develop rapid access system to the researched information. Interactive extraction and relevance feedback: in the context of ancient documents, our relative experiences highlight the diversity of usages and the difficulty to provide an answer to the contradictory aspects relative to build generic and personalized systems at a same time. This difficult research point aims at providing the user with some inter
Project coordination
Université
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
Partnership
Help of the ANR 553,356 euros
Beginning and duration of the scientific project:
- 36 Months