CE23 - Données, Connaissances, Big data, Contenus multimédias, Intelligence Artificielle

QualiHealth: Enhancing the Quality of Healthcare Data – QualiHealth

ANR QualiHealth: Enhancing the Quality of Healthcare Data

Hospitals and life-science institutions produce a tremendous amount of data on <br />a daily basis during the healthcare process and ordinary scientific <br />activity. Such data are highly valuable to improve the <br />process of care delivery and prevention and in <br />prospective clinical research. <br /> <br />This research project focuses on capturing and <br />formalizing the knowledge of data quality from domain experts, enriching <br />the available data with this knowledge and thus exploiting it <br />in quality-aware healthcare.

ANR QualiHealth tackles the data quality issues hampering the usage of healthcare data for medical domain experts and data analysts.

Our project addresses the challenges of conveying data curation results in the clinical and healthcare domain by <br />fulfilling the following concrete research objectives (numbered from 1 to 4). <br />(O1) Exploratory analysis and collection of anonymized datasets. <br />(O2) Declarative specification of quality indicators and annotations. (O3) Quality-aware query answering and refinement. (O4) Quality Indicators-driven Analytics. <br /> <br />The novelty of the QualiHealth project resides in the design of a full-fledged Quality Indicators(QI)- driven analytical platform allowing to combine specification tasks for quality indicators tailored for the clinical and preclinical data with query answering and human-guided query refinement tasks, along with complex analytical and learning tasks. It seamlessly addresses the needs of computer scientists, data <br />scientists and medical doctors in tandem, by providing a unified framework where all these actors can rely on automated and semi-automated techniques to build quality-aware analytical tasks. As tangible results of our project, we expect a quality-certified collection of medical and biological datasets, on which data quality-certified analytical queries can be formulated. We also envision the design and implementation of a quality-aware query engine. Our objective is also to contribute to the advances on data curation, data cleaning, and highly complex analytics for healthcare data, which to <br />the best of our knowledge is not existing at present in France.

The project gathers complementary expertise for tackling the problem of quality for healthcare data, thus leading to employ data management and data intelligence techniques altogether. The methods and technologies used span from databases to Artificial Intelligence and Machine Learning as well as BioInformatics and Health Informatics. Each task will benefit from cross collaborations among the partners. The underpinnings of our methodology lies in the typical data curation pipelines and data quality dimensions, such as uniqueness, consistency, freshness and completeness. The latter are encoded as quality indicators and thus computed as annotations for the underlying healthcare data and used in subsequent analytical and inference processes.

The project has led to a long-term collaboration between interdisciplinary fields, namely Computer Science (represented by the laboratories LIRIS, LIMOS and LIS) and medical and biological partners (represented by HEGP and INSERM) along with an IT industrial partner (Gnubila/Almerys). Data collection at the sites of the domain experts (the company and the biomedica domain) is undergoing and will lead to the construction of a reference dataset emcompassing healthcare data and processes (under the form of queries and analytical tasks). At the same time, researchers from the different domains are collaborating toward the same objectives and targeting common publication venues in both domains. The project has led in the first 18 months to two publications in top-notch venues, such as the Proceedings of Very Large Data Bases 2020 and the International Conference on Extending Data Base Technology (EDBT). These algorithms and techniques developed are of interest for the respective communities but also to perfom the technological transfer to the involved company. In addition, the project coordinator has been selected as a French Scholar Awardee 2020 at the Peter Wall Institute for Advanced Studies and the French Enbassy in Vancouver (Canada) (https://pwias.ubc.ca/profile/angela-bonifati) in order to strengthen the collaboration between France and Canada. This invitation could not take place due to the Covid19 outbreak but it has been postponed and will serve the need of disseminating the results of QualiHealth to a wide public involving many universities and hospitals in the Vancouver area. It will also foster the collaboration between the coordinator and Prof. Raymond Ng (UBC), who is an external partner in the project.

Our project aims at helping
caregivers better exploit their massive clinical and preclinical data repositories and reduce the costs due to erroneous
medical diagnoses, with in-depth societal, scientific and economical impact.
The future prospect of our project will be twofold: enabling the access of scientific data
to all the actors involved in the medical decision processes; helping our partners to enhance the accuracy of clinical
and preclinical data within their own repositories and thus allow better analyses and diagnoses on more accurate data.
Our industrial partner will tightly work with us on the potential exploitation of the research results towards the implementation of ‘Quality-As-A-Service’ workbench. We expect that our project will let
us produce high impact publications and expertly trained
Students to feed to French economy and technology, and research results that will have direct impact on improving
healthcare data quality in France, in Europe and internationally.

The project has led to to the following outstanding international publications: Ousmane Issa, Angela Bonifati, Farouk Toumani:
Evaluating Top-k Queries with Inconsistency Degrees. Proc. VLDB Endow. 13(11): 2146-2158 (2020) Core Rank: A*; Impact Factor (2019): 3.56 hal.archives-ouvertes.fr/hal-02898931
Ugo Comignani, Noël Novelli, Laure Berti-Équille:
Data Quality Checking for Machine Learning with MeSQuaL. EDBT 2020: 591-594 (demonstration) Core Rank: A hal.archives-ouvertes.fr/hal-02865824 Moreover, the following article published in the BDA (National Conference on Advanced Data Bases) 2019 received the Best Paper Award: Prix du Meilleur Article : Ousmane Issa, Angela Bonifati et Farouk Toumani
pour l'article « A Relational Framework for Inconsistency-aware Query Answering », Bases de Données Avancées (BDA) 2019.

Hospitals and life-science institutes produce a tremendous amount of data on
a daily basis during the healthcare process and ordinary scientific
activity. Such data are highly valuable as they can be used to improve the
process of care delivery and prevention and can also play a pivotal role in
prospective clinical research. However, clinical, biological and imaging
data are usually gathered by means of diverse data collection channels and
procedures exhibiting a diverse degree of reliability and trustability. As
a consequence, the collected data is usually scattered over heterogeneous
data sources and suffers from quality problems that hampers its use for
analysis purposes.

Classical data quality issues can be observed, including missing or
erroneous data, and also more complex problems can be perceived, for
example due to secondary use in different contexts than the ones they were
meant to be collected for. Additionally, the distribution of data can
evolve over time creating “data-glitches” than can cause interpretation
errors of high severity.

Today, no system is able to assist the clinicians and researchers in a
quality-aware exploration of their data. Overall, the lack of quality
indicators strongly limits an in-depth use of healthcare data in
translational research. We argue that more analyses of increasing
complexity and more interactions between clinical and pre-clinical medical
research would be feasible if the available data were annotated with
quality indicators, and if such quality indicators were also employed in
the querying and analysis of the available data.

This research proposal is geared toward a system capable of capturing and
formalizing the knowledge of data quality from domain experts, enriching
the available data with this knowledge and thus exploiting this knowledge
in the subsequent quality-aware medical research studies.

We expect a quality-certified collection of medical and biological
datasets, on which quality-certified analytical queries can be formulated.
We envision the conception and implementation of a quality-aware query
engine with query enrichment and answering capabilities. To reach this
ambitious objectives, the following concrete scientific goals must be
fulfilled :

An innovative research approach, that starts from concrete datasets and
expert practices and knowledge to reach formal models and theoretical
solutions, will be employed to elicit innovative quality dimensions and to
identify, formalize, verify and finally construct quality indicators able
to capture the variety and complexity of medical data;
those indicators have to be composed, normalized and aggregated when
queries involve data with different granularities (e.g., accuracy
indications on pieces of information at the patient level have to be
composed when one queries cohort) and of different quality dimensions
(e.g., mixing incomplete and inaccurate data);
In turn, those complex aggregated indicators have to be used to provide new
quality-driven query answering, refinement, enrichment and data analytics
techniques. A key novelty of this project is the handling of data which are
not rectified on the original database but sanitized in a query-driven
fashion: queries will be modified, rewritten and extended to integrate
quality parameters in a flexible and automatic way.

The adequacy of our declarative specification of quality indicators, and
the efficiency of query refinement and query answering, along with analytical tasks
leveraging such indicators will be assessed by domain experts on real
representative datasets collected by the project consortium.

Project coordinator


The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.


LIMOS Laboratoire d'Informatique, de Modélisation et d'Optimisation des Systèmes
INSERM U1016 Institut Cochin
LIS Laboratoire d'Informatique et Systèmes
UBC University of British Columbia / Department of Computer Science

Help of the ANR 744,591 euros
Beginning and duration of the scientific project: January 2019 - 48 Months

Useful links

Sign up for the latest news:
Subscribe to our newsletter