CE23 - Intelligence artificielle et science des données 2022

Multimodel Streaming Data Management – POLYFLOW

Submission summary

The need for addressing data variety paved the road to the advent of multidatabases, which the Turing Award winner M. Stonebraker endorsed. Multi-databases include multistores, which expose a unified declarative query interface over heterogeneous data, and polystores which combine the benefits of multistores with polyglot querying, i.e., they expose multiple query interfaces over heterogeneous data models.
Both aim to have a uniform view over heterogeneous data by reducing the required ExtractTransform-Load (ETL) jobs.

The growing availability of sensor networks, microservices, and edge-computing infrastructures has recently pushed the rise of streaming data as a unifying abstraction. For multistores and polystores, data streams are just a means for fast ingestion that enables low latency analytics to be executed with a specialized engine, i.e., the Data Stream Management System (DSMS). However, "one size does not fit all" for DSMS too. The processing of heterogeneous data streams in real-time is possible, yet it requires extensive programming efforts to fill the data integration gap. Even when declarative languages exist, the lack of a streaming-data-integration theory imposes strong assumptions for merging different sources.

The progress in multi-model data management and the rise of streaming data suggest that times are ripe for a paradigmatic change that enables multi-model and polyglot streaming data management (henceforth polystreaming). In support of the timeliness of the proposed investigation, we observe that:

O1) Data streams emerge as a natural abstraction to glue together specialized data systems. Streaming ingestion systems like Kafka are data model agnostic, i.e., they guarantee end-to-end latency independently from the serialization mechanism. Conversely, DSMSs still depend on the data model. Indeed, data streams have different characteristics, and thus, the choice of the data model may simplify the analysis. For example, a relational stream suggests the presence of keys and functional dependencies, while graph streams advise about more schema flexibility.

O2) As streaming data grows in sophistication, DSMSs evolve to deal with data variety. For example, systems like Flink and Spark are polyglot. They expose a hierarchy of progressively more flexible yet more complex APIs. In Flink SQL, it possible to process document data like JSON at high-level by performing the flattenisation of nested structures. Lower-level APIs give users much freedom, yet they demand extensive programming work to perform data integration.

O3) DSMSs' capabilities go beyond traditional analytics. Some DSMSs provide strong consistency guarantees that can lead to the support of transactions, queryable state, and even stateful functions. For example, Apache Flink and Kafka Streams support Exactly-Once-Semantics, which means that despite an event being sent more than once to the sink view, the effects will be the same as if it had been processed precisely once. Some DSMSs support advanced iterative computations that users can leverage for streaming graph analytics or machine learning.

Finally, O4) the emerging industrial and academic initiatives for a unified DSMS interface like Apache Calcite, Beam, and RSP4J signal the need for an integrated solution.

PolyFlow aims to benefit from such opportunities (O1-O4) to build a new generation of data systems, namely polystreaming systems, that elect the DSMSs as processing run-time. The PolyFlow vision calls for foundational and empirical research objectives: OB1) the integration of declarative languages for continuous querying and OB2) design and efficient management of polystreaming systems.

Riccardo Tommasini (UMR 5205 - LABORATOIRE D'INFORMATIQUE EN IMAGE ET SYSTEMES D'INFORMATION)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

LIRIS UMR 5205 - LABORATOIRE D'INFORMATIQUE EN IMAGE ET SYSTEMES D'INFORMATION

Help of the ANR 264,420 euros
Beginning and duration of the scientific project: March 2023 - 42 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.