Assessing the Effects of Communication Faults on Parallel Applications
Joao Carreira, Henrique Madeira, and Joao Gabriel Silva
Procceedings of IPDS'95, International Computer and Dependability Symposium, pp 214-223, Erlangen, Germany, 1995
Abstract
This paper addresses the problem of injection of faults in the
communication system of disjoint memory parallel computers and presents
fault injection results showing that 5% to 30% of the faults injected in
the communication subsystem of a commercial parallel computer caused undetected
errors that lead the application to generate erroneous results. All these
cases correspond to situations in which it would be virtually impossible
to detect that the benchmark output was erroneous, as the size of the result
s file was plausible and no system errors have been detected. This emphasises
the need for fault tolerant techniques in parallel systems in order to
achieve confidence in the application results. This is especially true
in massively parallel computers, as the probability of occurring faults
increase with the number of processing nodes. Moreover, in disjoint memory
computers, which is the most popular and scalable parallel architecture,
the communication subsystem play an important role, and is also very pr
one to errors. CSFI (Communication Software Fault Injector) is a versatile
tool to inject communication faults in parallel computers. Faults injected
with CSFI directly emulates by software communication faults and expurious
messages generated by non fail-silent nodes, allowing the evaluation of
the impact of faults in parallel systems, and the assessment of fault tolerant
techniques. The use of CSFI is nearly transparent to the target application
as only requires minor adaptations. Deterministic faults of different nature
can be injected without user intervention, and the CSFI also collects fault
injection results.