ChairesIA_2019_2 - Chaires de recherche et d'enseignement en Intelligence Artificielle - vague 2 de l'édition 2019

CooperAtive MachinE Learnning and OpTimization – CAMELOT

CooperAtive MachinE Learning and OpTimization

The goal of the Chaire CaMeLOt is to address the modern challenges of crowd-sourcing, peer-grading and to provide a unified mathematical framework and efficient algorithms to cope with cooperative and decentralized learning and optimization.

crowd-sourcing (with focus on species identification), decentralized learning and optimization (with privacy preserving constraints), peer grading (from a modeling and analysis point of view)

Firstly, we aim at addressing challenges in crowd-sourcing biodiversity identification. Biodiversity informatics is a relatively young discipline (the term was coined in the early 1990’s) that typically<br />builds on taxonomic, biogeographic, or ecological information stored in digital form. Pl@ntNet [17],<br />a citizen science project designed to identify plants automatically thanks to (camera or mobile) pic-<br />tures would be key in CaMeLOt. The annotation process described earlier tends to remove part<br />of the ambiguity present in plant images collected by Pl@ntNet, right from the beginning. Details<br />on how addressing the statistical and learning challenges in this context will be described in Axis 1.<br />Secondly, decentralized learning and optimization has become more and more critical in the<br />recent years. Indeed optimization problems are ubiquitous in machine learning (many popular<br />algorithms can be interpreted as solutions of optimization formulation): Ridge regression, Support<br />Vector Machine, Boosting or Neural Network to name a few of the most popular techniques. But<br />nowadays, with more and more citizens being concerned by privacy issues, techniques avoiding<br />to collect and store huge (possibly sensitive) datasets are gaining more and more popularity. This<br />requires rethinking standard architectures from centralized versions to decentralized and distributed<br />ones. An example of such a distributed learning scenario Federated learning4 and is currently<br />used by Google to improve the mobile user keyboard experience: to preserve privacy, the device<br />downloads the current model (say a recurrent neural network in this context), improves it by learning<br />from data on the phone locally, and then summarizes the changes as a light update that can be<br />safely shared. Only this update is sent and averaged with other user updates to improve the shared<br />model: no training data leaves the user’s device. We will detail in Axis 2 how we plan to handle<br />such constraints leveraging stochastic and gossip methods.<br />Last but not least, the development of Massive Open Online courses (MOOC) has made peer-<br />grading popular even outside scientific contexts (where researchers are used to this practice). Such<br />grading technique, where a student is both completing an assignment and correcting the one of<br />some peers, has also gained popularity in standard courses. Yet, the spread of peer-grading has<br />been slow in academia (especially in France), and call for additional theoretical foundations and<br />improved software solutions to reach a wider audience. This challenge will be addressed in Axis 3.<br />These three Axis call for new mathematical and computational insights, leveraging tools from<br />high-dimensional inference, large-scale (distributed) optimization and software development.

• collecting multiple labels per image, predicting sets of species and
incorporating this new information in the training of convolutional neural networks (CNN).
• Smoothing top-K losses for deep learning in extreme classification

• Differential privacy and gossip
• Sparsity structure and decentralized asynchronous optimization

• Improved modeling and reliable estimation to spread peer-grading.
• Reducing the upward bias in peer grades
• quantify the benefit for skill acquisitions.
• Providing a better and simpler user interface would help the spread of this grading technique.

We have already:

- Provided a new dataset (Pl@ntNet-300K), that is a plant image dataset with high label ambiguity and a long-tailed distribution. this would be helpful for biologist and for machine learner to train models on realistic problems.

- We have introduced a method for top-K classification that improves on the state-of-the-art when combined with deep learning for extreme classification. the solution is compact and easy to incorporate in modern deep learning toolboxes.

- We have provided a new framework for hyperparameter tuning that could be useful in a wide variety of learning problems we address in this project (sparse regression in centralized/decentralized settings, smoothing loss functions with various smoothing levels

- We analyzed and improved known control on coordinate descent algorithm in a differentially private setting. This algorithms is well suited for applications of sparse regression models in the context of high privacy constraints (such as for medical applications).

The biggest contribution so far is the introduction of both a methodology to improved top-K performance for extreme classification (encountered in application like Pl@ntNet) and the release of a dataset (Pl@ntNet 300K) that could be of interest for the community. Having such a benchmark method could foster new approaches and improved the practical performance on this kind of applications.
In parallel the development of Benchopt, a platform that allows comparing the performance of optimization solver could also have a similar impact, though in a wider range of application.

- Pl@ntNet-300K: a plant image dataset with high label ambiguity and a long-tailed distribution
C. Garcin, A. Joly, P. Bonnet, A. Affouard, J.-C. Lombardo, M. Chouet, M. Servajean, T. Lorieul and J. Salmon (2021)
Neurips, Datasets and Benchmarks Track

- Stochastic smoothing of the top-K calibrated hinge loss for deep imbalanced classification
Garcin, C. and Servajean, M. and Joly, A. and Salmon, J. (2022)

- Differentially Private Coordinate Descent for Composite Empirical Risk Minimization
P. Mangold, A. Bellet, J. Salmon and M. Tommasi (2021) (preprint)

- Quentin Bertrand, Quentin Klopfenstein, Mathurin Massias, Mathieu Blondel, Samuel Vaiter, et al.. Implicit differentiation for fast hyperparameter selection in non-smooth convex learning. 2021. ?hal-03228663?

Open source: Benchtop / GBIF

Statistics, etymologically is rooted in the word "state", and hence historically a strong entity was needed to store a dataset of a full population in a centralized way.
Data was commonly collected and stored once and for all, in one single place, prior analysis.
With recent evolutions in storage, computation, mobile devices and privacy concerns this paradigm has changed tremendously, leading to more and more decentralized and cooperative ways to perform data analysis or to train learning systems.
In this Chaire, we plan to embrace this paradigm shift by focusing on three axis deeply rooted in this evolution.

The first one is crowd-sourcing.
Indeed, this techniques has tremendously helped spreading neural network popularity (now simply referred to as deep learning), though often neglected: other key factors generally mentioned for this success are GPUs, mature software ecosystem (TensorFlow or Pytorch), or the availability of larger and larger datasets (CIFAR-10 and Imagenet).
Yet, the last point totally relies on crowd-sourcing for labeling millions of images.
Though already important for such dataset, crowd-sourcing is even more critical for applied fields, where it is hard to find knowledgeable person to perform the labeling.
We plan to address species identification in the context of a large-scale cooperative system: a successful example being Pl@ntNet (a citizen science project that identifies plants automatically thanks to pictures).
We aim at building a theoretical framework and new algorithms for controlling and improving the quality of species identification in such a cooperative system as this could help monitoring biodiversity distribution and evolution.
The lack of quality training data is often a major obstacle to allow the automated monitoring of crowd-sourcing applications and a major effort will be put on understanding and addressing challenges mostly due to ambiguities in the labeling process.

The second axis is decentralized optimization and learning, that has emerged in telecommunication (or sensing) but has now spread wider with the development of cloud services.
In such context, the data is often stored in various places, for instance on different mobiles in a network.
In parallel, privacy concerns have raised among the general public and sharing sensitive data with a centralized entity (say a state or a company) is more and more perceived as a threat.
Though, learning on the sum of each user's data could be beneficial to the whole population.
To satisfy privacy constraints, new techniques have recently been developed such as gossip or federated learning.
Their key feature is that users do not necessarily share their whole dataset with a (possibly not so) trusted entity, but rather tend to share partial information with other trusted agents.
We plan to accelerate such schemes by reducing the communications between agents by randomly sparsifying the local updates (of the training procedure).

The last axis is peer grading in education.
Rooted in peer review, peer-grading has gained popularity with the spread of MOOC in the last ten years.
In peer grading, students are seen as cooperative agents and perform grading on top of their original task of performing an assignment.
The benefits are many for both teachers and students: the repetitive burden is removed from the teacher's side leaving more time and energy to refined feedback.
For the student perspective, learning is improved by the repetition (say a student correct three assignments), and relevant answers/skills of a few can diffuse in the whole class more quickly.
Yet, there is an inherent bias in this context, as students naturally tends to over-grade their (friends) peers.
Hence, we plan to reduce this drawback leveraging statistical modeling, as well as providing open source software for peer grading, since this would allow bug fixes from the community and access to logs for our own research.

Project coordination

, Joseph SALMON (Institut Montpelliérain Alexander Grothendieck)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partner

IMAG Institut Montpelliérain Alexander Grothendieck

Help of the ANR 599,324 euros
Beginning and duration of the scientific project: June 2020 - 48 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter