CE25 - Infrastructures de communication hautes performances (réseau, calcul et stockage), Sciences et technologies logicielles

Causal debugging for concurrent systems – DCore

Causal Debugging of Concurrent Programs

Causal Debugging of Concurrent Erlang Programs

Two Causal Debugging Engines

As software takes over more and more functionalities in embedded and safety-critical systems, bugs may endanger the safety of human beings and of the environment, or entail heavy financial losses. In spite of the development of verification and testing techniques, debugging still plays a crucial part in the arsenal of the software developer. Unfortunately, usual debugging techniques do not scale to large concurrent and distributed systems: they fail to provide precise and efficient means to inspect and analyze large concurrent executions; they do not provide means to automatically reveal software faults that constitute actual causes for errors; and they do not provide succinct and relevant explanations linking causes (software bugs) to their effects (errors observed during execution).<br />The overall objective of the project is to develop a semantically well-founded, novel form of concurrent debugging, which we call causal debugging, that aims to alleviate the deficiencies of current debugging techniques for large concurrent software systems. Briefly, the causal debugging technology developed by the DCore project will comprise and integrate two main novel engines:<br />1. a reversible execution engine that allows programmers to backtrack and replay a concurrent or distributed program execution, in a way that is both precise and efficient (only the exact threads involved by a return to a target anterior or posterior program state are impacted);<br />2. a causal analysis engine that allows programmers to analyze concurrent executions, by asking questions of the form “what caused the violation of this program property?”, and that allows for the precise and efficient investigation of past and potential program executions.

We will tackle the three main questions:
1. How do we efficiently and flexibly instrument a concurrent program execution that records causal dependencies, while allowing different observation trade-offs (e.g. run-time performance vs observation precision)?
2. How do we develop an efficient and scalable analysis supporting causal queries on large concurrent executions, and providing meaningful, concise explanations?
3. How do we demonstrate the effectiveness of our causal debugging technology?
Accordingly, WP1 “Reversible execution engine” will study the formal semantics underpinnings of reversible execution and replay for concurrent programs, and will develop formal specifications for the reversible execution engine and its realization in WP3. WP2 “Causal analysis engine” will develop formal analyses for identifying actual causes for execution errors and for providing synthetic explanations for said errors. WP3 “Prototypes” will develop prototypes for both the reversible execution engine specified in WP1, and the analysis engine specified in WP2.

1. We have studied causal-consistent reversibility and debugging of Core Erlang, an intermediate representation of Erlang programs.
2. We have defined a semantics based on barbed bisimilarity for the Core Erlang language that allows us to reason abstractly on Core Erlang programs.
3. We have introduced a formal framework for fault ascription, which consists in identifying, within a multi-actor system, the actors or events whose faulty behavior has caused the violation of an expected safety property. Our analysis satisfies a set of formal requirements that are crucial for reasoning about causation via an abstraction.
4. We have proposed two notions of causal explanations that allow us to extract the causally relevant part of the system behavior that entailed a property violation. Both approaches are complementary: while the first one is well suited to ``condense'' the execution trace in sequential portions of a program, the second one is able to explain failures stemming from non-deterministic choices, such as concurrency bugs.
5. We have developed a static analysis that enables reasoning about causation in presence of counters and arithmetics.

TBD

5 articles in international journals, 11 publications at international conferences.

As software takes over more and more functionalities in embedded and safety-critical systems, bugs may endanger the safety of human beings and of the environment, or entail heavy financial losses. In spite of the development of verification and testing techniques, debugging still plays a crucial part in the arsenal of the software developer. Unfortunately, usual debugging techniques do not scale to large concurrent and distributed systems: they fail to provide precise and efficient means to inspect and analyze large concurrent executions; they do not provide means to automatically reveal software faults that constitute actual causes for errors; and they do not provide succinct and relevant explanations linking causes (software bugs) to their effects (errors observed during execution).

The overall objective of the project is to develop a semantically well-founded, novel form of concurrent debugging, which we call "causal debugging'', that aims to alleviate the deficiencies of current debugging techniques for large concurrent software systems.

Briefly, the causal debugging technology developed by the DCore project will comprise and integrate two main novel engines:

1. A reversible execution engine that allows programmers to backtrack and replay a concurrent or distributed program execution, in a way that is both precise and efficient (only the exact threads involved by a return to a target anterior or posterior program state are impacted);

2. a causal analysis engine that allows programmers to analyze concurrent executions, by asking questions of the form "what caused the violation of this program property?'', and that allows for the precise and efficient investigation of past and potential program executions.

The project will build its causal debugging technology on results obtained by members of the team, as part of the past ANR project REVER, on the causal semantics of concurrent languages, and the semantics of concurrent reversible languages, as well as on recent works by members of the project
on abstract interpretation, causal explanations and counterfactual causal analysis.

The project primarily targets multithreaded, multicore and multiprocessor software systems, and functional software errors, that is errors that arise in concurrent executions because of faults (bugs) in software that prevents it to meet its intended function. Distributed systems, which can be impacted by network failures and remote site failures are not an immediate target for DCore, although the technology developed by the project should constitute an important contribution towards full-fledged distributed debugging. Likewise, we do not target performance or security errors, which come with specific issues and require different levels of instrumentation, although the DCore technology should prove a key contribution in these areas as well.

Project coordination

Gregor Goessler (Centre de Recherche Inria Grenoble - Rhône-Alpes)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partner

IRIF Institut de Recherche en Informatique Fondamentale
Inria Paris Centre de Recherche Inria de Paris
Inria Grenoble Rhône-Alpes Centre de Recherche Inria Grenoble - Rhône-Alpes

Help of the ANR 605,686 euros
Beginning and duration of the scientific project: March 2019 - 48 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter