CE23 - Intelligence artificielle et science des données

ExtraCtion of LAtent knowledge in Documents by conjointly Analyzing Texts and TAbles – ECLADATTA

Submission summary

Identifying, extracting, structuring, and storing knowledge are major knowledge management tasks. They constitute important challenges for organizations, partly because knowledge is scattered across different types of sources (e.g. databases, spreadsheets, textual documents) and heterogeneously represented. For instance, a large number of data repositories within companies as well as on Open Data portals are represented in the form of tabular data (spreadsheets) whereas PDF reports or Web pages frequently mix texts and tables. Hence, there is a need to structure and reconcile such scattered knowledge, which can be achieved by automatically extracting knowledge from heterogeneous sources to build and refine knowledge graphs. Such an extraction and refinement process enables a mutual correction and completion between texts, tables, and knowledge graphs. Interestingly, texts and tables may be related in the same document or across documents and complement one another, a complementarity that is little used so far. From these observations, the ECLADATTA project aims at leveraging this complementarity between tables, texts, and knowledge graphs to propose an end-to-end process that builds corpora of related texts and tables, and performs a joint knowledge extraction and reconciliation to enrich or update a knowledge graph. Such a process raises several issues that will be tackled by the ECLADATTA project. For example, assessing the relatedness between knowledge graphs, texts, and tables requires to delimit the exact text portion associated with a table and to compare atomic information taking into account temporal validity or aggregates such as means or sums. This process will be evaluated on collections of public documents collected from the web (e.g Wikimedia projects such as Wikipedia, with the ambition of scaling to large corpora such as the Common Crawl) to enrich publicly available knowledge graphs such as Wikidata.

Yoan Chabot (ORANGE SA)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

EURECOM EURECOM
Orange ORANGE SA
IRIT Université Toulouse 3 - Paul Sabatier

Help of the ANR 601,637 euros
Beginning and duration of the scientific project: January 2023 - 42 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.