CE23 - Intelligence artificielle et science des données 2023

Text-to-Video Synthesis for Creative Applications – RENAISSANCE

Submission summary

Synthetic media describes the use of Artificial Intelligence (AI) to generate and manipulate data, most often to automate the creation of entertainment (image, video, audio). Technological progress in this field fosters the use of AI algorithms in the creative industry by revolutionising the process of content creation. Recently, text-to-image algorithms have progressed dramatically, and have been made available to all creatives and the general public. It is now possible to generate, in seconds, an image corresponding to a provided textual description, which has massive implications for creative applications.

Video is the most versatile and efficient medium for conveying information. Hence, being able to create video from text would bring a whole new level to the AI creative revolution. The gains in costs and time are several orders of magnitude higher compared to what text-to-image already brings.

However, there are several issues that prevent the development of text-to-video tools, notably data availability and models suited for long video generation conditioned by text. Moreover, in order to put these tools in the hands of artists and creative people, they should be accessible and easy to interact with, which poses a challenge in terms of interfaces and functionalities.

In the RENAISSANCE research project, we will work on creating text-to-video algorithms for creative applications. Our consortium is composed of Obvious, a world renown artist trio that work with artificial intelligence models to create art with a strong research background, and MLIA, a major player in the development of deep learning and neural networks in France studying computer vision and natural language processing. This unique combination of an art collective and an academic laboratory is specially tailored to work on research for creative application. Our work will focus on four important research directions.

First, we aim at creating qualitative text-video datasets. Currently, they do not exist and thus it prevents high-definition video generation with coherence and rich content. We will perform this task with a strict data use policy: we will only release models that have been trained on data for which we have the rights (through partnerships for instance).

Then, we will tackle the difficult problem of generating videos with complex movements with spatial and temporal coherence from text. A hierarchy of increasingly harder problems arise, from simple videos like a car going in a straight line to a full tutorial video for a complex cooking recipe. For that, we will leverage the recent tremendous progress in text-to-image research, with the advent of diffusion models and transformers architecture for image/video processing and generation.

We will also work on the release of our models with interfaces specifically developed for creative use, with careful consideration of the artist's needs. Obvious will leverage its network of artists and actors of the creative industry to test out interfaces and functionalities that will be relevant for final applications, thus increasing the impact of the project, culturally and economically.

Finally, in order to have text-to-video algorithms that are useful in the creative industry, we will focus on the needs of artists and creatives, providing directly usable tools for non-technical individuals. We envision the development of functionalities such as mask-free editing (modifying an object or subject of the generated video by simply inputting text), personalization (adding yourself or a personal object to the concepts known to the model), or scenario handling (automatically dividing a complete scenario in prompts of different scenes to provide to a text-to-video model).

Obvious (PME (petite et moyenne entreprise))

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Obvious
ISIR Institut des Systèmes Intelligents et de Robotique

Help of the ANR 729,795 euros
Beginning and duration of the scientific project: October 2023 - 48 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.