TY - JOUR
T1 - Automated rule-based diagnosis through a distributed monitor system
AU - Khanna, Gunjan
AU - Cheng, Mike Yu
AU - Varadharajan, Padma
AU - Bagchi, Saurabh
AU - Correia, Miguel P.
AU - Veríssimo, Paulo J.
N1 - Generated from Scopus record by KAUST IRTS on 2021-03-16
PY - 2007/1/1
Y1 - 2007/1/1
N2 - In today's world, where distributed systems form many of our critical infrastructures, dependability outages are becoming increasingly common. In many situations, it is necessary to not only detect a failure but also to diagnose the failure, that is, to identify the source of the failure. Diagnosis is challenging, since high-throughput applications with frequent interactions between the different components allow fast error propagation. It is desirable to consider applications as blackboxes for the diagnostic process. In this paper, we propose a Monitor architecture for diagnosing failures in large-scale network protocols. The Monitor only observes the message exchanges between the protocol entities (PEs) remotely and does not access the Internal Protocol state. At runtime, it builds a causal graph between the PEs based on their communication and uses this together with a rule base of allowed state-transition paths to diagnose the failure. The tests used for the diagnosis are based on the rule base and are assumed to have imperfect coverage. The hierarchical Monitor framework allows distributed diagnosis handling failures at individual Monitors. The framework is implemented and applied to a reliable multicast protocol executing on our campuswide network. Fault injection experiments are carried out to evaluate the accuracy and latency of the diagnosis. © 2007 IEEE.
AB - In today's world, where distributed systems form many of our critical infrastructures, dependability outages are becoming increasingly common. In many situations, it is necessary to not only detect a failure but also to diagnose the failure, that is, to identify the source of the failure. Diagnosis is challenging, since high-throughput applications with frequent interactions between the different components allow fast error propagation. It is desirable to consider applications as blackboxes for the diagnostic process. In this paper, we propose a Monitor architecture for diagnosing failures in large-scale network protocols. The Monitor only observes the message exchanges between the protocol entities (PEs) remotely and does not access the Internal Protocol state. At runtime, it builds a causal graph between the PEs based on their communication and uses this together with a rule base of allowed state-transition paths to diagnose the failure. The tests used for the diagnosis are based on the rule base and are assumed to have imperfect coverage. The hierarchical Monitor framework allows distributed diagnosis handling failures at individual Monitors. The framework is implemented and applied to a reliable multicast protocol executing on our campuswide network. Fault injection experiments are carried out to evaluate the accuracy and latency of the diagnosis. © 2007 IEEE.
UR - http://ieeexplore.ieee.org/document/4358702/
UR - http://www.scopus.com/inward/record.url?scp=36248945561&partnerID=8YFLogxK
U2 - 10.1109/TDSC.2007.70211
DO - 10.1109/TDSC.2007.70211
M3 - Article
SN - 1545-5971
VL - 4
SP - 266
EP - 279
JO - IEEE Transactions on Dependable and Secure Computing
JF - IEEE Transactions on Dependable and Secure Computing
IS - 4
ER -