TSIA - Giga-modèles - Thématiques Spécifiques en Intelligence Artificielle (Giga-modèles pour le traitement automatique du langage naturel et des données multimodales) 2023

Observation de la Terre généralisée avec la télédétection et le texte – GEO ReSeT

GEO-ReSeT

Generalized Earth observation with remote sensing and text

A new paradigm for extracting information from remote sensing data

In recent years, remote sensing images have become more available than ever thanks to important efforts coming from the public and private sectors. An emblematic example are the Sentinel satellites launched from 2014 in the frame of the Copernicus program. This mission provides a wide coverage of images including Synthetic Aperture Radar (SAR) and multi-spectral data with a short revisit time for free. These images contain information which is already used to track climate change, improve security and to understand and manage the environment. Exploiting the different levels of information provided by different modalities is an active field of research and used in most remote sensing applications. During the last decades, the processing method of earth observation data has strongly shifted towards deep learning-based methods. While leading to a general improvement of performances, such a shift has other consequences. First, the task of collecting large, annotated and reliable datasets became more important. As a majority of the proposed methods are based on supervised learning, their performances are directly dependent on the quality of the training data. Thus, important works have been dedicated to the collection of such datasets. However, these datasets are: - Task specific: the collection of the labels is made for one specific task, e.g. land cover mapping in the case of BigEarthNet. - Sensor specific: in remote sensing, algorithms generally rely on particularities of the well-calibrated sensors. While developing methods that can be robust to sensor changes is an interesting work direction, this has been under-explored. Second, the shift to supervised algorithms has made the entry-barrier to extracting information from remote sensing data higher than before. Indeed, it is now often necessary to find a training dataset, have the resources to train a model and finally to perform the inference to use state-of-the-art extraction information methods. While in some cases pre-trained models are available, the resources to perform the inference are still a limitation for many users. These two consequences are even more important when a user wants to target a specific task. In this case, the user has to collect a new dataset, come-up with a new method and train it to be able to achieve its goal. This greatly restricts the usage of remote sensing data for new potential users in sectors such as journalism, urban planning or agriculture. The aim of this project is to provide the Earth observation community with a foundation model that can be used for any task and with any data modality.

Data collection and foundation model training

In this project, we propose a generic method to model the interactions between text and Earth Observation data. First, we aim at studying multi-modal representations of different Earth Observation data. In this work, we start by developing models for individual modalities (e.g. multispectral imagery) that are able to use data coming from sensors with different characteristics (e.g. spectral bands, spatial resolution, revisit time, etc.). We approach this by investigating the use of auxiliary embeddings that go beyond spatio-temporal position and also capture other data characteristics such as spatial and spectral resolution. A second part is dedicated to the collection of relevant texts that contain geographical semantics. We first propose to collect explicitly geographically coded texts (e.g. wikipedia pages about a geographical area), before studying models able to extract the geographical component from any given text. Finally, we build on the previous two work-packages to build a multi-modal model that can work with any combination of modalities, including multispectral, SAR, vectorial data and text. This giga-model is called GEO-ReSeT. We hope that such a multi-modal pretrained model can serve as a foundation for a plethora of geo-spatial applications and represents the main methodological contributions of this project.

Results

Initially, work was carried out on modeling the complementary information from the different Earth observation modalities. The idea behind this modeling is to project the information from different modalities into a common latent space, taking into account the fact that for each modality, certain dimensions of the latent space will not be relevant. Thus, the encoder for each modality predicts a value for each dimension, but also a confidence score, in the form of a variance. Ultimately, the latents from different modalities can be merged according to different functions (e.g., maximum confidence for each dimension, confidence-weighted average, etc.). We then explored two complementary approaches for developing models that allow information from any remote sensing sensor to be encoded without retraining. There are many difficulties, as differences in resolution, whether spatial, spectral, or temporal, must be taken into account.
The first approach, called Atomizer, seeks to model each pixel as an independent input to the model. This modeling is interesting because it allows the specific characteristics of each sensor to be accurately encoded and images of different sizes and shapes to be taken into account. However, commonly used approaches based on transformers have quadratic complexity depending on the number of inputs due to the self-attention operation. We therefore proposed using an approach based on a Perceiver for this model. The second approach, called RAMEN, also works at the pixel level. However, it proposes working at a spatial resolution decided by the end user. The different input images are therefore interpolated to achieve the desired resolution. The advantage of this approach is that it allows the user to choose the trade-off between computational cost and spatial accuracy. Finally, initial work on the construction of a cross-modal image/text database has begun.

Prospects

We are continuing to work on foundation models based solely on visual information in order to improve performance. At the same time, we are continuing to collect textual data so that we can move on to training a vision-language model.

Scientific productions and patents

? de Turckheim, H. R., Lobry, S., Interdonato, R., & Marcos, D. (2025). Atomizer: Generalizing to new modalities by breaking satellite images down to a set of scalars. In BMVC 2025-36th British Machine Vision Conference. (Best presentation award)
? Houdré, N., Marcos, D., de Turckheim, H. R., Ienco, D., Wendling, L., Kurtz, C., & Lobry, S. (2025). RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation. arXiv preprint arXiv:2512.05025.
? Houdré, N., Marcos, D., Ienco, D., Wendling, L., Kurtz, C., & Lobry, S. (2025). ProMM-RS: Exploring Probabilistic learning for Multi-Modal Remote Sensing Image Representations. In Proceedings of the Workshops of Winter Conference on Applications of Computer Vision (pp. 554-562).
? Ienco, D., & Dantas, C. F. (2024). DisCoM-KD: Cross-Modal Knowledge Distillation via Disentanglement Representation and Adversarial Learning. In BMVC 2024-35th British Machine Vision Conference. (Best paper award)
? Dantas, C. F., Gaetano, R., & Ienco, D. (2024). Semi-supervised heterogeneous domain adaptation via disentanglement and pseudo-labelling. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 440-456). Cham: Springer Nature Switzerland.

Submission summary

This research proposal aims to develop a versatile foundation model for geo-spatial data that can be used for any task and with any data modality. By using location on the Earth's surface as the common link between different modalities, the model will be able to incorporate a variety of data sources, including remote sensing imagery, textual descriptions of places, and features in maps. Through self-supervised learning methods such as contrastive learning or multi-modal masked autoencoders, the model will leverage the large amounts of unlabeled geo-spatial data from these different sources to learn a better representation of any geo-spatial location and convey a semantic representation of the information.
The proposed foundation model has the potential to revolutionise Earth observation by allowing for few or zero-shot solutions to classical problems such as land-cover and land-use mapping, target detection, and Visual Question Answering. It will also be useful for a wide range of applications with a geo-spatial component, including environmental monitoring, urban planning and agriculture.
By leveraging several data modalities, the foundation model will provide a more comprehensive and accurate understanding of the Earth's surface, enabling more informed decisions and actions. This will be particularly valuable for new potential users in sectors such as journalism, social sciences or environmental monitoring, who may not have the resources or expertise to collect their own training datasets and develop their own methods, thus moving beyond open Earth observation data and democratising the access to Earth observation information.

Sylvain Lobry (LABORATOIRE INFORMATIQUE PARIS DESCARTES)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Inria Centre Inria d'Université Côte d'Azur
LIPADE LABORATOIRE INFORMATIQUE PARIS DESCARTES

Help of the ANR 593,269 euros
Beginning and duration of the scientific project: September 2023 - 48 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.