BLANC - Blanc 2009

Classification non-supervisée en grande dimension : algorithmes et applications – CLARA

Submission summary

With the advent of high technology sensors and efficient computing facilities, massive data sets with complex structures are routinely collected across a wide range of scientific disciplines. To analyze such data, the statistical community is faced with multiple challenges among which the development of sound statistical methodologies for high-dimensional data. When analyzing complex data sets, a common practice is to try to get some initial intuition about the structure of the data by identifying meaningful groups of observations. The extracted groups can then be interpreted in the scientific context of the application at hand, or serve as a summary of the data for further analysis. This problem of extracting relevant groups of similar items out of a data set is known as clustering or unsupervised classification. The present project is born from current inter-disciplinary collaborative research among the team members, and will focus on two domains of applications, namely Earth observation science and post-genomic biology. These two research areas typically give rise to complex and high-dimensional data for which standard clustering algorithms are completely ineffective. In view of the limitations of existing algorithms, adaptations and improvement as well as further theoretical developments are greatly needed. The project contains a mathematical and an applied component. Building upon our previous work, the mathematical component will focus on theoretical properties (consistency, rates of convergence, limit partitioning, etc.) of clustering methods based on pairwise comparisons (typically, kernel k-means and spectral clustering). These techniques allow detection of groups of arbitrary form in large-dimensional spaces; for this reason, they are receiving increasing attention from the scientific community. The mathematical properties of newly developed algorithms will be studied in the unified framework of the project. The applied component is focused on specific problems in Earth observation science and post-genomic biology. In the case of Earth observation science, these include identification/discrimination of phytoplankton functional types for biogeochemistry applications as well as determination of aerosol types for radiation budget studies and remote sensing of aerosol and surfaces properties. In terms of post-genomic biology, the project will focus on the detection of cancer subtypes from tumor profiling, as well as detection of groups of genes forming functional pathways from the clustering of gene expression time series. These two domains of application give rise to large amounts of complex and high-dimensional data; adapted, improved and even completely new clustering algorithms are urgently needed to try and answer these scientific questions. To address the mathematical and applied aspects of the proposal, a team of 9 scientists with complementary expertise has been assembled. In groups of two and three, these team members have already previously collaborated on subjects related to the questions addressed in the project.

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Help of the ANR 135,000 euros
Beginning and duration of the scientific project: - 0 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.