CE23 - Intelligence Artificielle 2020

Interactive constraint elicitation for unsupervised and semi-supervised data mining – InvolvD

Submission summary

Early machine learning (ML) and data mining (DM) research tried to fully automate knowledge discovery processes, reducing human intervention. For good reasons: we cannot deal with large amounts of (high-dimensional) data, see patterns everywhere, and technical progress should help save time. Currently, Google offers an AutoML service.
In a supervised setting, using labels for parameter tuning and model selection can work but with only few labels available (a semi-supervised setting), or none at all (an unsupervised one), the opposite is necessary: putting the user into the loop and letting her react to results improves the mining process. In short, one needs interactive data mining.
This creates new challenges: result presentation needs to help the user make decisions, feedback needs to be intuitive for the user and useful in the mining process, and where AutoML can run for hours, interactive systems need to interact often with the user.
Overcoming them not only improves mining results but adds another benefit: users are more likely to trust results if they understand the process generating them, by participating in it or by understanding how others did. This is relevant in high-stakes settings, where investments (of time or money), or even human lives depend on correct results. Presenting results during interactive mining also helps to interpret final results, supporting hypothesis formation, data acquisition, and real-life experiments. Finally, recent regulation in the EU and the US define citizens’ rights to have explanations for algorithmic decisions affecting them, and require companies or institutions to provide them. Such requirements have motivated research attempting to translate black box models, e.g. in Deep Learning, into interpretable ones via an intermediate step. In such methods, however, user feedback cannot directly influence the model nor the learning process.
To work towards explainable results in unsupervised and semi-supervised DM, we propose the project InvolvD, which addresses several challenges: identifying sense-making visualizations, offering explanations for informed feedback, transforming them into useful constraints, and developing new algorithms exploiting those. Using clustering and symbolic pattern mining, we will study problem settings where user reactions can be fed back directly into the process itself.
The use case employed to guide our progress during the project’s duration is chemoinformatics, a prototypical one for the issues outlined above. In drug design, exploratory data analysis is highly important, molecules need to be understood w.r.t. their structure and/or chemical properties, and experts have knowledge that is hard to exploit before seeing preliminary results.
We structure our project into five work packages treating the different aspects required for successful completion:
1) Translating different forms of user feedback into constraints that can be effectively exploited in structured pattern mining
2) Leveraging user feedback to form clusters that agree with the user's intuition and preferences
3) Identifying candidates for visualizing mining results, and using latest image processing techniques (e.g. deep learning) to automatically evaluate them
4) Integrating clustering and pattern mining, developing ways to explain patterns in terms of clusters and vice versa, building an interface offering different feedback options, and exploiting feedback on patterns to shape cluster formation and vice versa
5) On-going evaluation of developed solutions in chemoinformatics settings, gauging the usability of the integrated tool, and new insight experts derive from using it
While the use case will influence design decisions made in InvolvD, the general insights developed in the project will be applicable to all settings where trustworthy labels are lacking but understanding the mining process and interpreting the final results is important, such as environmental studies or law enforcement.

Albrecht Zimmermann (Groupe de recherche en Informatique, Image, Automatique et Instrumentation de Caen)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

GREYC Groupe de recherche en Informatique, Image, Automatique et Instrumentation de Caen
CERMN CENTRE D'ETUDES ET DE RECHERCHE SUR LE MEDICAMENT DE NORMANDIE
LIFO EA 4022 LABORATOIRE D'INFORMATIQUE FONDAMENTALE D'ORLÉANS
LaBRI Laboratoire Bordelais de Recherche en Informatique

Help of the ANR 575,817 euros
Beginning and duration of the scientific project: January 2021 - 48 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.