Blanc SIMI 2 - Sciences de l'information, de la matière et de l'ingénierie : Sciences de l’information, simulation

Classification With a Very Large Number of Categories – Class-Y

Submission summary

Statistical learning has emerged in recent years as a key technology for processing and analyzing large amounts of data, whether from foundations or business data or available from the web. Meanwhile, the growth of such data, their complexity, and the multiplication of needs generate new data processing problems that detonate the conventional framework of learning that is currently before a set of fundamental challenges.

For example, many applications require the classification with tens of thousands of classes and there is currently no answer to this qualitative leap. Research in this area is still at a preliminary stage. One reason is that the basic principles used are mainly inherited from models developed for problems of recognition of simple shapes with a small number of categories without relationships between them. The most sophisticated models consider taxonomies of categories that are far from reflecting the nature and complexity of classification problems currently encountered.

We propose in this project fundamental work on classification methods with a very large number of classes. It is to revisit the basics of algorithms and field, study and develop a set of new methods to achieve real operational algorithms. The target is the treatment of large body of data with semantic content. This will be coupled with experimental work conducted in a international challenge on very large scale data, that will be organized by the project partners.

The major challenges are:
- The development of algorithms capable of scaling to very large classes. For example DMOZ is a large web repository, containing over 600 000 categories.
- Taking into account the complex relationships between these categories. For example, the online encyclopedia Wikipedia has more than 20,000 categories related to each other by different types of relationships.

Such challenges are found in many application areas such as:
• Filtering and classification of semantic data
• Annotation of multimedia objects
• Search Engines
• On line Recommendation
• Targeting of advertising

To meet the challenges of classification in many categories, the project proposes to explore solutions in three families of approaches:
- Models called "Big Bang" that address the problem without exploiting the structural information nor the relationships between classes. This is to develop sparse methods, quick for classification.
- Methods called "Top Down" exploiting a taxonomy of classes or pre-existing concepts. The aims are firstly to develop methods capable of determining the optimal cascade of classifiers from a given hierarchy and secondly to propose accurate and fast hierarchical classifiers.
- Models that automatically infer relationships between classes from the data without using a priori knowledge. This problem is very prospective and aims at learning the class structure from data, it may handle situations in which classes are not structured (eg annotation tags).

Finally, the project provides an assessment task on two large corpus that are representative of these different situations. It will be proposed as an international challenge.

Thierry ARTIERES (UNIVERSITE PARIS VI [PIERRE ET MARIE CURIE])

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

LIP6-UPMC UNIVERSITE PARIS VI [PIERRE ET MARIE CURIE]
LIG UNIVERSITE GRENOBLE I [Joseph Fourier]

Help of the ANR 406,839 euros
Beginning and duration of the scientific project: - 48 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.