Active and batch Segmentation, Clustering, and seriation: toward unified foundations in AI – ASCAI
Unsupervised Learning is one of the most fundamental problem of machine learning, and more generally, of artificial intelligence. In a broad sense, it amounts to learning some unobserved latent structure over data. This structure may be of interest per se, or may serve as an important stepping stone integrated in a complex data analysis pipe-line - since large amounts of unlabeled data are more common than costly labeled data. Arguably, one the cornerstones of unsupervised learning is clustering, where the aim is to recover a partition of the data into homogeneous groups. Beside vanilla clustering, unsupervised learning encompasses a large variety of related other problems such as hierarchical clustering, where the group structure is more complex and reveals both the backbone and fine-grain organization of the data, segmentation where the shape of the clusters is constrained by side information, or ranking or seriation problems where where no actual cluster structure exists, but where there is some implicit ordering between the data. All these problems have already found countless applications and interest in these methods is even strengthening due to the amount of available unlabelled data. We can for instance cite crowdsourcing - where individuals answer to a subset of questions, and where, depending on the context, one might want to e.g. cluster them depending on their field of expertise, rank them depending on their performances, or seriate them depending on their affinities. Such problems are extremely relevant for recommender systems - where individuals are users, and questions are items - and for social network analyses.
The analysis of unsupervised learning procedures has a long history that takes its roots both in the computer science and in mathematical communities. In response to recent bridges between these two communities, groundbreaking advances have been made in the theoretical foundations of vanilla clustering. We believe that these recent advances hold the key for deep impacts on the broader field of unsupervised learning because of the pervasive nature of clustering. In this proposal, we first aim at propagating these recent ground-breaking advances in vanilla clustering to problems where the latent structure is either more complex or more constrained. We will consider problems of increasing latent structure complexity - starting from hierarchical clustering and heading toward ranking, seriation, and segmentation - and propose new algorithms that will build on each other, focusing on the interfaces between these problems. As a result, we expect to provide new methods that are valid under weaker assumptions in comparison to what is usually done - e.g. parametric assumptions - while being also able to adapt to the unknown intrinsic difficulty of the problem.
Moreover, many modern unsupervised learning applications are essentially of an online nature - and sometimes decisions have to be made sequentially on top of that. For instance, consider a recommender systems that sequentially recommends items to users. In this context where sequential, active recommendations are made, it is important to leverage the latent structure underlying the individuals. While both the fields of unsupervised learning, and sequential, active learning, are thriving, research at the crossroad has been conducted mostly separately by each community - leading to procedures that can be improved. A second aim of this proposal will then be to bring together the fields of unsupervised learning and active learning, in order to propose new algorithms that are more efficient at leveraging sequentially the unknown latent structure. We will consider the same unsupervised learning problems as in the batch learning side of the proposal. We will focus on developing algorithms that fully take advantage of new advances in clustering, and of our own future work in batch learning.
Project coordination
Nicolas Verzelen (Mathématiques, Informatique et Statistique pour l'Environnement et l'Agronomie)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
Partnership
TUM Technical University of Munich
LMO Laboratoire de mathématiques d'Orsay
MISTEA Mathématiques, Informatique et Statistique pour l'Environnement et l'Agronomie
UP Universitaet Potsdam/ Institut fuer Mathematik
Help of the ANR 272,496 euros
Beginning and duration of the scientific project:
January 2022
- 36 Months