This projet relies on the formal study of tasks that can be efficiently performed on text documents (structured or not) when accessed as streams.
Computer science is based on processing data. Among others, data can be text documents (flat, or structured, like XML documents). When these documents are huge, processing them as streams is sometimes more efficient. Some tasks are standard: validation (conformance with respect to some criteria), queries (extract information), or transformation (transform to another document). This project aims at characterizing which tasks are suitable to a streaming evaluation. More precisely, it focuses on identifying which parameters of theses tasks prevent them from being efficiently evaluated in streaming mode.
Our approach is based on formal methods, in particular language theory. These methods have provided numerous results, including algorithms. Our objectif is to combine this approach with streaming-based constraints on documents. This includes: efficient processing of each event, and low memory consumption. In order to achieve this, some results from language theory must be established or extended.
New algorithms have been proposed for each type of task considered. Regarding validation of XML documents, we proposed an algorithm allowing to decide which schemas (like DTD) can be repaired in a streaming manner. Concerning queries on XML streams, we contributed a fast algorithm answering almost at the earliest position (hence with low memory usage). Finally we improved algorithms deciding whether a transformation (defined by a transducer) can be performed with bounded memory.
An unexpected achievement of this project is a significant result in language theory, in the links between logics and automata. We contributed an algorithm deciding whether a regular language (described for instance by an automaton) can be characterized by a first order formula with three quantifier alternations.
We presented these results in various conferences, and published them in journal articles.
The increasing tendency in data management is to consider data as transient, rather than persistently stored, and accordingly process it on the fly, rather than on demand. Many recent applications, for instance, generate data in the form of continuous and arbitrarily long streams (e.g. log records, email and chat messages, data feeds from sensor applications, financial tickers, genomic data, etc.). This trend is particularly evident in the cloud computing and big data communities, who are actively seeking novel ways of exploiting the scalability of the systems in order to deal with peaks in the number and size of input streams. The distinguishing features of computational problems in the streaming setting are (i) the fact that the input is serialized, namely, represented as a list of events and usually accessed sequentially from left to right, (ii) the stringent constraints on space and time complexity (e.g. bounded memory footprint, constant time per event).
Due to the particular characteristics of stream processing problems, new standards, algorithms, and languages are sought with the aim of achieving efficient streaming solutions, as well as new theoretical questions are being posed to the research community. The intended goal of this research project is to invent new computational models and rigorous mathematical tools to complement the current state of the art in stream processing techniques. The issues covered by this research project thus concern different areas of computer science - ranging from database to complexity theory, from logics to automata, from combinatorics to algebras - and include a number of difficult open problems.
Several computational tasks on streams will be considered, most notably XML-related operations (XPath query answering, XSL transformation, stream validation), and pattern matching. These tasks are formally presented as families of problems parametrized by queries, constraints, patterns, etc., and receive as input arbitrary sequences of events (streams).
In detail, the scientific program is organized around two main tasks. The first task aims at developing a suite of algorithms for optimally manipulating streams. The suite will include online algorithms for enumerating in a timely manner the answers to a query and for efficiently transforming streams of data, decision procedures for constant-memory realizability of a given query/transformation (streamability problem), synthesis of finite state stream validators, streaming solutions to parametrized pattern matching problems. The second task aims at characterizing those parameters of the processing problems that admit efficient streaming solutions. As an intermediate step, we will study some formalisms based on automata and logical hierarchies that allows the formal specification of the classes of parameters (e.g. XPath queries). Some emphasis will be put on the ability of reasoning about equalities between attributes ranging over infinite domains (data equality tests). We will then compare the formalisms in terms of expressive power and succinctness, and we will finally try to derive effective characterizations for the levels of some logical hierarchies.
The issues addressed by the research project are challenging in many respects. On the one hand, they can be seen as variants of problems in verification and formal language theory. On the other hand, they embody algorithmic constraints specific to the streaming setting. The combination of these two aspects leads to highly complex questions, which conceivably require new mathematical models in order to be answered.
Monsieur Olivier GAUWIN (Laboratoire Bordelais de Recherche en Informatique)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
LaBRI Laboratoire Bordelais de Recherche en Informatique
Help of the ANR 187,800 euros
Beginning and duration of the scientific project: March 2014 - 48 Months