As we move to the world of Big Data, single-site processing becomes insufficient: large scale scientific applications can no longer be accommodated within a single datacenter. Workflows are the perfect illustration of such data-driven applications. They describe the relationship between individual computational tasks (standalone binaries) and their input and output data in a declarative way and exchange data through temporary files. With fast growing volumes of data to be handled at larger and larger scales, geographically distributed workflows are emerging as a natural data processing paradigm. This may have several benefits: resilience to failures, distribution across partitions (e.g., moving computation close to data or viceversa), elastic scaling to support usage bursts, user proximity etc. In this context, sharing, disseminating and analyzing the data sets results in frequent large-scale data movements across widely distributed sites. Studies show that the inter-datacenter traffic is expected to triple in the following years.
As of today, state-of-the-art public clouds do not provide adequate mechanisms for efficient data management across datacenters for scenarios involving masses of geographically distributed data that are stored and processed in multiple sites across the globe. Existing solutions are limited to cloud-provided storage, which offers low performance based on rigid cost schemes. In turn, workflow engines need to improvise substitutes, achieving performance at the cost of complex system configurations, maintenance overheads, reduced reliability and reusability. High throughput, low latencies or cost-related trade-offs are just a few concerns for both cloud providers and users when it comes to handling data across datacenters.
In this project we investigate approaches to data management enabling an efficient execution of such geographically distributed workflows running on multi-site clouds. We focus on a common scenario where workflows generate and process a huge number of small files, which is particularly challenging with respect to data management. As such workloads generate a deluge of small and independent I/O operations, efficient data and metadata handling is critical. We will explore means to better hide latency for data and metadata access and optimise transfers as a way of improving the global performance. The targeted solution leverages both the workflow semantics (e.g. data-access patterns) and the practical tools available on today’s clouds (e.g. caching services for PaaS clouds) to propose several strategies for decentralized data management. The system will be leveraged by real-life applications from bio-informatics, smart cities and nuclear physics.
OverFlow proposes a new, pioneering paradigm: Workflow Data Management as a Service - a general and easy to use cloud provided service that bridges for the first time the gap between single- and multi-site workflow data management. It aims to reap economic benefits from the geo-diversity while accelerating the scientific discovery through a "democratisation" of access to globally distributed data.
INSA Rennes / Institut de recherche en informatique et systèmes aléatoires (Laboratoire public)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
INSA Rennes / Institut de recherche en informatique et systèmes aléatoires
Help of the ANR 247,216 euros
Beginning and duration of the scientific project: September 2015 - 48 Months