Stochastic Models for Fault Tolerance: Restart, Rejuvenation and Checkpointing

Stochastic Models for Fault Tolerance: Restart, Rejuvenation and Checkpointing

Katinka Wolter

Language: English

Pages: 269

ISBN: 3642112560

Format: PDF / Kindle (mobi) / ePub

As modern society relies on the fault-free operation of complex computing systems, system fault-tolerance has become an indispensable requirement. Therefore, we need mechanisms that guarantee correct service in cases where system components fail, be they software or hardware elements. Redundancy patterns are commonly used, for either redundancy in space or redundancy in time.

Wolter’s book details methods of redundancy in time that need to be issued at the right moment. In particular, she addresses the so-called "timeout selection problem", i.e., the question of choosing the right time for different fault-tolerance mechanisms like restart, rejuvenation and checkpointing. Restart indicates the pure system restart, rejuvenation denotes the restart of the operating environment of a task, and checkpointing includes saving the system state periodically and reinitializing the system at the most recent checkpoint upon failure of the system. Her presentation includes a brief introduction to the methods, their detailed stochastic description, and also aspects of their efficient implementation in real-world systems.

The book is targeted at researchers and graduate students in system dependability, stochastic modeling and software reliability. Readers will find here an up-to-date overview of the key theoretical results, making this the only comprehensive text on stochastic models for restart-related problems.

Invitation to Computer Science (7th Edition)

Fuzzy Knowledge Management for the Semantic Web (Studies in Fuzziness and Soft Computing)

The Art of Computer Programming, Volume 4A: Combinatorial Algorithms, Part 1

Cyber Warfare: Techniques, Tactics and Tools for Security Practitioners (2nd Edition)

Python Network Programming Cookbook





















again for exponentially distributed up- and downtimes and for both failure modes versus the number of tolerated downtimes. 30 2 Task Completion Time P (w) 1 0.8 preemptive resume preemptive repeat 0.6 0.4 w 3.0 0.2 K 20 40 60 80 100 Fig. 2.6 Probability of task completion as function of tolerated downtimes K P (w) 1 0.8 preemptive resume preemptive repeat 0.6 0.4 0.2 w 5.0 K 200 400 600 800 Fig. 2.7 Probability of task completion as function of tolerated downtimes K At the

remarkable decrease in mean time to repair, translating either to higher system availability or more tolerance in the fault detection time. It should be noted that as yet no stochastic models for microreboot systems exist. One may consider the opportunistic micro rejuvenation for embedded systems [142] as closely related. This concept is a combination of software rejuvenation and microreboot as suitable for embedded systems. It has been analysed using a stochastic activity network with the Möbius

comparison with other modelling formalisms. It allows for very complex structures, which then are extremely hard to debug. But the main disadvantage of a Petri net model is that parameter optimisation must be done ‘by hand’ in carrying out sequences of experiments. Software rejuvenation has also been modelled using fluid stochastic Petri nets [12]. Extending stochastic Petri nets with fluid places, that hold a continuous amount of fluid, rather than discrete tokens was first proposed in [155],

[115] combines both application-initiated as well as systeminitiated checkpointing. At runtime the operating system uses heuristics classifying the system state to decide whether a checkpoint, which is implemented in the application code, should be executed. Most applications use equidistant checkpoints, which, as we will see later in this chapter, is in many cases the best choice. Cooperative checkpointing then appears to be irregular, since the system can either grant or deny a checkpoint

spaced intervals such that the total work requirement w is covered by checkpoint intervals of length τ , i.e. w = K τ . The sequence of failure times t1 , t2 , . . . , forms a renewal process. Consequently, also the sequence of checkpoint locations forms a renewal process. The expected task completion time with checkpointing E [TC (w)] is shown in [37] to be E [TC (w)] = w 1 C + D+C + τ γ (eγ w − 1) . Using (2.8) the solution with respect to w or γ of the inequality E [TC (w)] < E [T (w)] 1

Download sample