[ main page ] [ back ]

2004 : Startup and Recovery of Fault-Tolerant Time-Triggered Communication

Comment
This PhD thesis has been published by VDM Verlag Dr. Mueller e.K. (April 17, 2008) and can be obtained from http://www.amazon.com/Startup-Recovery-Fault-Tolerant-Time-Triggered-Communication/dp/383649521X/ref=sr_1_2?ie=UTF8&s=books&qid=1209974348&sr=8-2
Author(s)
Wilfried Steiner
Abstract
Computer systems are becoming more and more interesting for ultra-high dependable applications as for example flight control systems, where they replace and supplement traditional hydraulic or mechanical mechanisms. Such computer systems inherently require a distributed solution in order that the corruption of a part of the system does not cause the system to fail as a whole. The spatial distribution, however, implies the implementation of a communication infrastructure, such that the participants in the system are able to exchange information. For economic reasons the communication infrastructure is often realized as a shared resource and dedicated communication protocols have to be implemented to coordinate its usage. Such a communication protocol can be based on the synchronization of local clocks of the participants. This thesis addresses the problem of initial synchronizing the local clocks of the participants in the distributed system such that they closely agree on the current point in real-time. Algorithms that solve this problem are known as emph{startup algorithms}. Although there are several ways to construct a startup algorithm, they show a common pattern: after power-on, a participant has first to determine if there already exists an agreement of a sufficient set of participants on the current point in real-time. If so, the participant has to join this set, we say the participant has to emph{integrate}. If such a set does not exist, the participants have to establish such an agreement. This process is called the emph{coldstart}. In this thesis we discuss the startup problem in general and introduce two particular startup algorithms. Furthermore, we show how to protect the shared communication resource against the arbitrary failure of one of its participants, and how a startup algorithm can be used for recovery once the synchronization has been lost. The design of fault-tolerant algorithms is inherently vulnerable to design failures or incomplete specifications. It is, thus, of high importance to use a formal approach during the design phase. The remarkable performance of modern model-checking tools allows the application of formal methods not only for the specification purpose, but also for ``verification-in-the-loop'', to get immediate feedback on the correctness of the algorithm. We use the texttt{SAL} model-checking tool-suite during the design process. The output of these tools contributes significantly to the understanding of the nature of the startup problem and to the correction of algorithmic weaknesses.
Bibtex
@phdthesis{ steiner:2004,
  author =      "Wilfried Steiner",
  title =       "Startup and Recovery of Fault-Tolerant Time-Triggered Communication",
  address =     "Treitlstr. 3/3/182-1, 1040 Vienna, Austria",
  school =      "Technische Universit{\"a}t Wien, Institut f{\"u}r Technische Informatik",
  year =        "2004"
}
Download


[ main page ] [ back ]