LabCom V3 - Laboratoires communs organismes de recherche publics – PME/ETI 2018

International Document Engineering, Analysis and Security lab – IDEAS

IDEAS

International Document Engineering, Analysis and Security Lab

Imaginer, Inventer; Concevoir, développer, optimiser et entrainer les meilleurs algorithmes de traitement automatiques des documents d'entreprise

La vision que nous avons du LabCom est d’imaginer, inventer, concevoir, développer, optimiser et entrainer les meilleurs algorithmes de traitement automatiques des documents d’entreprise pour offrir un service d’intelligence artificielle capable de comprendre un maximum de document d’entreprise

Un laboratoire en 3 axes

Afin d’opérationnaliser cette vision, trois axes scientifiques sont actuellement en cours avec pour chacun un.e responsable côté Université / L3i et un.e responsable côté Entreprise / Yooz, et des recrutements associés. Ces trois axes sont :
- la classification de document, dont la visée est d’identifier tous types de documents reçus par les entreprises du monde entier ;
la fouille de document, dont la visée est d’extraire toutes les informations essentielles portées par le document (informations inscrites ou induites) pour automatiser le traitement métier ;
- la détection de fraude documentaire, dont la visée est d’identifier tout risque pour le récepteur du document de capturer des informations malignes.

Results

L’ensemble de l’activité du LabCom est valorisée auprès de communication dans des salons et évènements grands publics (fête de la science, journées portes ouvertes) d’une part, et dans des évènements scientifiques d’autre part (conférences ou revues internationales). Le laboratoire IDEAS a démarré doucement avec la structuration de l’équipe, et la définition des sujets. Ainsi, l’activité est en cours d’accélération et ce travail de valorisation devrait s’accroître dans les deux ans à venir.
Pour le moment, la valorisation et diffusion auprès de réseaux professionnels spécialisés a été très limité de par la situation sanitaire (aucune présentation officielle n’a pu avoir lieu pour le moment). La valorisation scientifique pour sa part a été principalement lors de journées d’échanges ou de conférences (nationales ou internationales). La liste exhaustive de ces rencontres et publications est présentée ci-après.
[1] Nadeem Iqbal Kajla, Malik Muhammad Saad Missen, Muhammad Muzzamil Luqman, Mickaël Coustaty, Arif Mehmood, Gyu Sang Choi: Additive Angular Margin Loss in Deep Graph Neural Network Classifier for Learning Graph Edit Distance. IEEE Access 8: 201752-201761 (2020)
[2] Joris Voerman, Aurélie Joseph, Mickaël Coustaty, Vincent Poulain D'Andecy, Jean-Marc Ogier: Evaluation of Neural Network Classification Systems on Document Stream. DAS 2020: 262-276 – Conférence Internationale de rang A
[3] Ibrahim Souleiman, Joris Voerman, Mickaël Coustaty, Aurélie Joseph, Vincent Poulain d'Andecy and Jean-Marc Ogier : Apprentissage multimodal basé sur des modèles d’attention pour la classification de documents dans un contexte déséquilibré. EGC 2021 (to appear) – Conférence Nationale
Il faut également ajouter que trois articles sont actuellement soumis pour relecture à une conférence internationale de rang A.

Prospects

Continuer les travaux initiés

Scientific productions and patents

[1] Nadeem Iqbal Kajla, Malik Muhammad Saad Missen, Muhammad Muzzamil Luqman, Mickaël Coustaty, Arif Mehmood, Gyu Sang Choi: Additive Angular Margin Loss in Deep Graph Neural Network Classifier for Learning Graph Edit Distance. IEEE Access 8: 201752-201761 (2020)
[2] Joris Voerman, Aurélie Joseph, Mickaël Coustaty, Vincent Poulain D'Andecy, Jean-Marc Ogier: Evaluation of Neural Network Classification Systems on Document Stream. DAS 2020: 262-276 – Conférence Internationale de rang A
[3] Ibrahim Souleiman, Joris Voerman, Mickaël Coustaty, Aurélie Joseph, Vincent Poulain d'Andecy and Jean-Marc Ogier : Apprentissage multimodal basé sur des modèles d’attention pour la classification de documents dans un contexte déséquilibré. EGC 2021 (to appear) – Conférence Nationale

Submission summary

Artificial Intelligence used with digitzed documents, ie the use of digital versions of documents, allows the automation of document interpretation and the automation of business processes impacted by the content of these documents .

Yooz offers a SaaS Internet service for the automation of purchase requisitions and payments and recently an extension to "all documents". Yooz's success with more than 2,000 customers is based on its strategy of technological innovation in the automatic understanding of documents. The L3i of the University of La Rochelle has developed a know-how of excellence in algorithms and methodologies of document analysis, applied in domains as varied as the historical document, administrative, cultural, the natural scene video, the multimedia document security, ...
Yooz and the L3i have been partners since 2011 in several collaborative research projects focused on the administrative document, including pioneering projects in fraud detection. The IDEAS LabCom is a continuation of these collaborations, marking a new step in strengthening the L3i-Yooz partnership.

This work has led to the development of a common vision that defines a scope and an ambition for common scientific and technological developments: we wish to invent, develop, optimize and train the best algorithms for the automatic processing of business documents to offer an artificial intelligence service able to understand a maximum of business document. In concrete terms, this shared vision is divided into three functional themes: document classification, document mining and documentary fraud detection.

The technological innovation resulting from this vision lies in the performance and coverage in terms of the variety of documents (typologies, languages). Scientifically, it requires to go beyond the state of the art in order to learn efficiently many classes of documents under the constraint of very small set of data for the learning phase, which correspond to the industrial reality (it is difficult to have a priori document samples in real world use cases). A second important innovation, both technological and scientific, is the proposal of fraud detection algorithms on document images, in particular on documents that have undergone print / scan sequences, for which the state of the art is very poor. .

Conscious of the variety of maturity of the existing methods and approaches, and in order to be able to value the technical and scientific innovations at the earliest, we propose an implementation of the LabCom program in a strategy of continuous integration. At the same time, it aims to develop short-term research on mature methods and more fundamental research, in the medium or long term, on less mature issues.
Thus, in the short term, we envisage work from existing technologies to optimize the learning and cooperation of the different methods available, to enrich the Yooz expert document mining system, and to optimize existing prototypes made at the end of Securdoc project concerning the detection of modification in images. In the medium term, we wish to explore other approaches that may go beyond the limits of existing methods. We will be interested in incremental Deep Learning in order to benefit from the power of these techniques with the constraints of continuous scalability and tolerance to small corpora of learning, and the generalization of these algorithms on classification tasks as well as search of document. Finally, in the long term, we want to study other steganographic and printer authentication techniques to detect quality inconsistencies in a document.

Mickael COUSTATY (LABORATOIRE INFORMATIQUE IMAGE INTERACTION)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

EA2118 LABORATOIRE INFORMATIQUE IMAGE INTERACTION

Help of the ANR 300,000 euros
Beginning and duration of the scientific project: February 2019 - 36 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.