The Interplay of Power Management and Fault Recovery in Real-Time Systems

Rami Melhem*, Daniel Mossé, Elmootazbellah Elnozahy

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

130 Scopus citations

Abstract

This paper describes how to exploit the scheduling slack in a real-time system to reduce energy consumption and achieve fault tolerance at the same time. During failure-free operation, a task takes checkpoints to enable recovery from failure. Additionally, the system exploits the slack to conserve energy by reducing the processor speed. If a task fails, it will restart from a saved checkpoint and execute at maximum speed to guarantee that the deadlines are met. The paper shows that the number of checkpoints and their placements interact in subtle ways with the power management policy. We study two checkpoint placement policies for aperiodic tasks and analytically derive the optimal number of checkpoints to conserve energy under each. This optimal number allows the CPU speed to be slowed down to the level that yields minimum energy consumption, while still guaranteeing recoverability of tasks under each checkpointing policy. The results show that traditional periodic checkpointing is not the best policy for the combined purpose of conserving energy and guaranteeing recovery. Instead, better energy savings are possible through a nonuniform distribution of checkpoints that takes into account the energy consumption and reliability factors. Depending on the amount of slack and the checkpointing overhead, energy can be reduced by up to 68 percent under nonuniform checkpointing. We also demonstrate the applicability of these checkpoint placement policies to periodic tasks.

Original languageEnglish (US)
Pages (from-to)217-231
Number of pages15
JournalIEEE Transactions on Computers
Volume53
Issue number2
DOIs
StatePublished - Feb 2004
Externally publishedYes

Keywords

  • Checkpointing
  • Fault tolerance
  • Frequency scaling
  • Power management
  • Real-time systems
  • Reliability
  • Voltage scaling

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'The Interplay of Power Management and Fault Recovery in Real-Time Systems'. Together they form a unique fingerprint.

Cite this