Recognition of historical Chinese manuscripts
Large-scale digitization projects produce every year a great number of scanned images of documents from cultural heritage. The objective of these projects is to make cultural heritage accessible online worldwide. These historical documents must be transcribed into an editable text in order to allow researchers and experts to edit, translate, search and browse their contents. The manual transcription of old documents and especially manuscripts must be achieved only by experts or by a computer vision system trained by experts. Without computer assistance, the manual transcription of major historical documents by a reduced number of specialists will take several decades. The Optical Character Recognition (OCR) is a full automatic system which can read automatically characters within digital images without human intervention. OCR reaches poor performances on historical documents and especially on degraded handwritten manuscripts. <br /><br />The main objective of this project is the development of the very first OCR for Chinese historical manuscripts. It is a very difficult task because the complete ancient Chinese writing system consists of more than 80,000 characters each representing a one-syllable word. With so many characters, visual complexity is necessary to discriminate between them all, and some characters take up to 30 strokes. Today modern documents contain only up to about 8500 characters and the most common characters have been simplified to an average of around five strokes. Moreover ancient manuscripts show many degradations due to the documents aging which make difficult their recognition. The spatial variability of handwritings brings another difficulty in this project. <br /><br />The NSFC and the ANR do not have fund any prior research projects on recognition for manuscripts in Chinese. Guwenshibie will be the very first project for the development of an automatic recognition system for historical Chinese documents.<br />
All previous work on OCR systems in the literature are based on image segmentation. All steps of the recognition process use information extracted from binary images. This traditional approach is well suited for modern printed or handwritten documents, which contain isolated characters without noise. But the binarization step, which transform colour images into binary images is a critical issue for historical documents, especially for degraded manuscripts.
We propose to develop a completely free-segmentation OCR which uses directly greyscales and colour information to describe patterns of characters and analyse the layout. We want to develop new methodologies in order to overcome the actual limits of the OCR technology for historical handwritten documents and in particularly written in Chinese. We aim to make significant progress on free segmentation approaches for a robust Optical Character Recognition engine for historical manuscripts. The pattern recognition engine will be trained on noisy original image without image segmentation. Such approach improves the performance of any pattern recognition systems on images of documents having severe degradation due to aging. It is the only solution to process historical documents and especially ancient Chinese scrolls. OCR for Chinese historical documents is a new challenge for researchers in Documents Images Analysis domain.
Tsinghua university of Beijing and INSA of Lyon have complementary skills : Tsinghua university brings a great expertise in multilingual OCR and INSA of Lyon offers its experience in processing handwritten manuscripts and the robust extraction of image information from degraded historical documents.
The project is completely a novel proposal and a real challenge. There is no previous work on handwritten OCR for Chinese historical documents showing degradations due to aging. The expected results are :
• A prototype of a robust OCR system for historical Chinese manuscripts. This prototype will be used for future commercial Chinese OCR.
• Publications about OCR for historical documents and free segmentation approaches.
• A higher recognition rate measured on a large test set from various Chinese manuscripts compared to the state of the art
• Images datasets with ground truth (transcription) for training OCRs, for international competitions and generally to research purposes.
These researches on robust pattern recognition for historical Chinese document can be extended to other manuscripts in different language and different scripts. It may be also useful to improve the robustness of OCR for Chinese modern texts captured from camera in natural scenes.
Guwenshibie will provide
• New free segmentation methodologies for a more robust OCR
• New robust features to describe complex patterns of characters without segmentation
• several joint publications between Tsinghua university and INSA de Lyon in international journals and communication in international conferences.
• The online publication of results of experiments.
• An on-line OCR through Web service
Large-scale digitization projects (Google Book, Europeana, Gallica, British Library...National Library of China) produce every year a great number of scanned images of documents from cultural heritage. The objective of the proposed project is to transcribe historic document image into an editable text in order to allow researchers and experts to edit, translate, search and browse their contents. It will focus recognition methods for historical Chinese manuscripts in International Dunhuang Project (IDP) with a cross-cultural communication background. This project addresses the challenges to current OCR research, including degraded image quality, various layout and styles, and large character set with less learning samples. The proposed research plan by two partners LIRIS and TSINGHUA attacks the problems of robust processing and recognition. The final system's performance is evaluated by standard database and evaluation tools. The project makes meaningful exploration to the historical Chinese manuscripts recognition. The estimated research achievements will be useful to historical document in other scripts, as both partners has research experience on multilingual document analysis and recognition.
frank LeBourgeois (Institut National des Sciences Appliquées de Lyon - Laboratoire d'Informatique en Images et Systèmes d'information) – Franck.firstname.lastname@example.org
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
TSINGHUA university Beijing TSINGHUA university Beijing
INSA DE LYON - LIRIS Institut National des Sciences Appliquées de Lyon - Laboratoire d'Informatique en Images et Systèmes d'information
Help of the ANR 178,880 euros
Beginning and duration of the scientific project: December 2012 - 36 Months