Support for software interrupts in log-based rollback-recovery

J. Hamilton Slye*, E. N. Elnozahy

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

9 Scopus citations

Abstract

The piecewise deterministic execution model is a fundamental assumption in many log-based rollback-recovery protocols. Process execution in this model consists of intervals, each starting with the receipt of a message at an application-defined execution point. Execution within each interval is deterministic and messages are the only source of nondeterminism that affects the computation. This simple model excludes the nondeterminism that results when asynchronous signals or interrupts occur at arbitrary execution points. As a result, a wide range of applications cannot use log-based rollback-recovery in practice. We present a solution that removes this restriction and allows applications to replay interrupts at the same execution points during recovery. The solution relies on using a software counter to compute the number of instructions between the asynchronous signals during normal operation. Should a failure occur, the instruction counts are used to force the replay of these signals at the same execution points. The execution of the application thus can be replayed to recreate the prefailure state while accommodating nondeterminism due to asynchronous signals. We then use the deterministic replay of interrupts to solve another problem, namely tracking nondeterminism due to interleaved shared memory access in multithreaded applications on a single processor. We use the instruction counter solution to implement a user-level thread package in which thread scheduling decisions can be replayed if a failure occurs. By repeating the scheduling decisions during an execution replay, threads access the shared memory in the same order and the execution to be reconstructed. This technique allows multithreaded applications to use log-based rollback-recovery with low overhead, which was not previously possible. We carried out two prototype implementations that have shown the overhead is no more than a 6 percent slowdown in application execution on the DEC Alpha, and from 6 percent to 18 percent on the Intel Pentium. Thus, restrictions of the piecewise deterministic execution model can be lifted at a reasonable cost.

Original languageEnglish (US)
Pages (from-to)1113-1123
Number of pages11
JournalIEEE Transactions on Computers
Volume47
Issue number10
DOIs
StatePublished - 1998
Externally publishedYes

Keywords

  • Checkpointing
  • Distributed systems
  • Instruction counters
  • Message logging
  • Rollback-recovery

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'Support for software interrupts in log-based rollback-recovery'. Together they form a unique fingerprint.

Cite this