CE40 - Mathématiques, informatique théorique, automatique et traitement du signal

Foundations of Clustering Algorithms – FOCAL

Submission summary

Partitioning a dataset in such a way that data elements in the same part share common
features is a fundamental problem that arises in a broad range of applications.
This is key in various areas: in machine learning or data analysis for example,
this is used to preprocess and analyze a dataset. In the study of social networks,
the dataset is a graph of relationship and finding such a good partitioning leads to identify
communities. When trying to find good electoral districts, the dataset is the map
of all dwellings and a good partitioning will gather in the same district people
leaving close to each other.

Thus, computing a good partitioning is a major problem in a variety of research areas
and so the problem requires a deep understanding of its complexity together with the development
of new techniques that takes into account the applications at hand. Our project
aims at making a significant step in this direction.

To maintain a tight connection with the main clustering applications, we will
focus on specific types of inputs that have been successful in modeling machine learning
inputs, social networks and road and VLSI networks.

First, to model machine learning inputs and social networks we will consider
random graphs generated from the planted partition model. This model has proven
to be a successful model for developing algorithmic technique that have
a strong impact in practice. In the first part,
we will consider one of the most popular heuristic for graph partitioning,
the Kernighan-Lin heuristic, and aim at explaining its success on real-world instances
by proving that it performs well on graphs generated by the
planted partition model. Furthermore, we will consider the scenario where the dataset is
massive. In this context, the algorithm must deal with a stream of data and so
we will aim at designing new techniques for finding good partitioning in on graphs
generated by the planted partition model in the streaming setting.

Second, we will consider the scenario faced by various online service providers.
Given an evolving dataset -- where data elements are inserted and removed --
how to maintain a good clustering of the data? We will
bring together techniques from learning theory and approximation algorithms and
aim at delivering new algorithms that achieves the best possible guarantees.
Our approach has a potential for both fundamental and applied high impact:
We will be considering scenarios that have not yet been considered in
the online optimization community and develop techniques that are tailored
to the application. Our project will deliver prototypes of our algorithms to maximize
the dissemination of our new techniques.

Finally, motivated by the redistricting problem (namely, the problem of
finding good electoral districts), we will study classic graph partitioning
objectives such as sparsest cut or balanced cut. Since finding good districts is often
based on the road network interconnecting the dwellings, we will focus on embedded
graphs. How to partition an embedded graph into equal-size districts where points
in the same district are close to one another? We will tackle this
long-standing open problem using tools that we have developed in a recent
serie of work.

Therefore, our project tackles ambitious, long-standing open problems with
a potential of high impact in a broad range of communities. We believe that
our expertise on clustering, embedded graphs, beyond worst-case analysis
and local search techniques will make this project successful. Finally,
we believe that the project will help build a network of researchers
from various community to work on a central and fundamental problem.

Vincent Cohen-Addad (Laboratoire d'informatique de Paris 6)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

LIP6 Laboratoire d'informatique de Paris 6

Help of the ANR 171,720 euros
Beginning and duration of the scientific project: September 2018 - 48 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.