ChairesIA_2019_1 - Chaires de recherche et d'enseignement en Intelligence Artificielle - vague 1 de l'édition 2019 2019

Intelligent handling of imperfect data – INTENDED

Intelligent Handling of Imperfect Data

Accessing the relevant information contained in real-world data to support informed decision making is difficult, time-consuming, and error-prone due to the need to integrate data across multiple heterogeneous sources. Moreover, even if this first hurdle is overcome, a perhaps even more daunting challenge arises: how to obtain reliable insights from imperfect data? This chair will develop intelligent, knowledge-based methods for handling imperfect data.

Context and Challenges

Our starting point will be the ontology-based data access (OBDA) approach, which employs semantic knowledge and automated reasoning to bridging the gap between users’ information needs and how the relevant data is actually stored. While OBDA systems are growing in maturity, they too often fail to address the data quality issue, aside from issuing warnings when inconsistencies are discovered. <br /><br />To tackle the data quality challenge, it is essential to equip OBDA systems with appropriate mechanisms for handling imperfect data: how to obtain meaningful answers to queries posed over imperfect data, and how best to generate a high-quality version of the data ? While these questions have begun to be explored by myself and other researchers, with some promising first results, we are still quite far from having robust and widely applicable techniques for handling data quality in OBDA. The chair will substantially advance the SOA, through a novel integration of sophisticated methods for tackling data quality issues into the OBDA approach.

Research Program

Our research program is structed into the following tasks:

1) Develop pragmatic methods for inconsistency-tolerant OBDA to treat more expressive settings, involving richer ontology languages, mappings, and temporal information, currently beyond the reach of the SOA

2) Exploit qualitative & quantitative reliability information for facts and constraints (provided e.g. by rule mining and information extraction tools) to refine query results and annotate them with confidence scores

3) Address a wider range of data problems and achieve better overall results by developing a holistic approach that tightly integrates existing data cleaning methods (e.g. entity linking, statistical analysis)

4) Develop a customized user-sensitive approach by bringing users into the process, letting them give direction on how to address some types of errors, based upon their knowledge and how data will be used

5) Explore how the developed approach can be applied in practice, by means of a use case on clinical data

6) Demonstrate and promote the project results via implemented tools and experimental evaluation

Throughout the project, we shall enable confident decision making by ensuring that the developed approaches have clear semantics and that it is possible to trace back query results to see which parts of the data and knowledge contributed to a given answer (and how) and to be able to justify the confidence scores (explainability).

Results

The expected results of our foundational research will consist of:

- new formal frameworks for reasoning on imperfect data in the presence of constraints, ontologies, reliability information, and user preferences

- complexity results to pinpoint the difficulty of the reasoning tasks, to inform the development of algorithms

- novel algorithms and optimizations for reasoning on imperfect data and for constructing and analyzing user policies for handling imperfect data

The dissemination of these results will mainly take the form of publications in first-rank conferences and journals. We target the main artificial intelligence conferences (IJCAI, AAAI, ECAI) and the top specialized conference KR, but we may also publish our work in prestigious conferences in other areas related to the project, like database theory or medical informatics.

The applied component of the project will produce:

- a case study which will examine the utility of our data quality techniques in a hospital data use case.

- an implementation and experimentation of the most promising algorithms, and a demo to showcase them

Prospects

We expect that the pragmatic solutions developed in the project will be integrated into existing or future OBDA systems, which would be a huge step forward in making such systems sufficiently robust to tackle the integration of messy real-world data, thereby widening the potential applications for the OBDA approach.

The project will demonstrate within healthcare how diverse data quality techniques, issuing from separate research communities, can be fruitfully combined to improve their efficacy and applicability. Through our work and organized events, the chair will encourage collaborations between AI researchers working on OBDA and inconsistency handling and database researchers working on data quality and consistent query answering. More broadly, the chair will help address the major AI challenge of combining symbolic and numeric methods.

Scientific productions and patents

Meghyn Bienvenu: A Short Survey on Inconsistency Handling in Ontology-Mediated Query Answering. Special Issue on Ontologies and Data Management: Part II. Künstliche Intelligenz 34(4): 443-451, 2020.

Meghyn Bienvenu, Camille Bourgaux: Querying and Repairing Inconsistent Prioritized Knowledge Bases: Complexity Analysis and Links with Abstract Argumentation. Proceedings of the 17th International Conference on Principles of Knowledge Representation and Reasoning (KR), 2020.

Gianluca Cima, Marco Console, Maurizio Lenzerini, Antonella Poggi: Monotone Abstractions in Ontology-based Data Management. Proc of 36th AAAI Conference on Artificial Intelligence (AAAI), 2022.

Submission summary

The huge wealth of data available nowadays holds tremendous potential to improve our lives, whether it be by advancing scientific knowledge, improving patient care, or supporting more informed policymaking. However, obtaining relevant and reliable information from real-world data is difficult due both to the need to integrate data across multiple heterogeneous sources and the ubiquity of data quality issues (e.g. missing or incorrect facts). The ambition of the INTENDED chair is to take active part in the paradigm shift towards explainable & trustable AI by developing intelligent, knowledge-based methods for handling imperfect data, enabling confident and informed decision making.

The starting point for the INTENDED project is ontology-based data access (OBDA), a promising declarative approach to data integration that exploits semantic knowledge and automated reasoning to bridge the gap between users’ information needs and how the relevant data is actually stored. While OBDA systems are growing in maturity, they too often fail to address the data quality issue.

To tackle this limitation, the INTENDED research program will (i) develop pragmatic methods for inconsistency-tolerant OBDA to treat more expressive settings, involving richer ontology languages, mappings, and temporal information, currently beyond the reach of the SOA (ii) exploit qualitative & quantitative reliability information for facts and constraints (provided e.g. by rule mining and information extraction tools) to refine query results and annotate them with confidence scores, (iii) address a wider range of data problems and achieve better overall results by developing a holistic approach that tightly integrates existing data cleaning methods (e.g. entity linking, statistical analysis), and (iv) develop a customized user-sensitive approach by bringing users into the process, letting them give direction on how to address some types of errors, based upon their knowledge and how data will be used. Throughout, we will ensure that the developed approaches have clear semantics and that it is possible to trace back query results to see how different pieces of data and knowledge contributed to a given answer.

To validate our approach, we will implement and test the most promising algorithms and make them available to the scientific community. Moreover, we will demonstrate the practical interest through a hospital usecase, aimed at adding semantic search facilities and reliability indicators to an interface currently being developed for displaying relevant information on incoming emergency care patients.

INTENDED gathers an interdisciplinary team of experts in all related fields (AI, databases, and medecine) with significant experience in ontology-based data access, inconsistency-tolerant query answering, and data and knowledge integration for public health. The project is perfectly in line with the new « Data and Knowledge » theme, part of the restructuring of the PI’s host team at LaBRI, and will develop new collaborations between the PI and Bordeaux Population Health on health data integration, which is strategic for U Bordeaux.

While healthcare is the privileged application, the chair’s results have much broader applicability, and in particular are highly relevant for enterprise data integration, given the increasing interest by companies in using semantics to get more value from their data. Opportunities to valorize the project’s achievements via partnerships with public and private sector organizations will be explored.

INTENDED also includes an ambitious training program, which aims to introduce ontologies and Semantic Web standards (OWL, RDF, SPARQL) to a wide range of student populations, equipping U Bordeaux graduates with unique skills of high relevance to academic research, healthcare, and the private sector.

Bienvenu Meghyn (Laboratoire Bordelais de Recherche en Informatique)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

LaBRI Laboratoire Bordelais de Recherche en Informatique

Help of the ANR 591,192 euros
Beginning and duration of the scientific project: August 2020 - 48 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.