Assessing the Effects of Communication Faults on Parallel Applications

Joao Carreira, Henrique Madeira, and Joao Gabriel Silva

Procceedings of IPDS'95, International Computer and Dependability Symposium, pp 214-223, Erlangen, Germany, 1995

Abstract
This paper addresses the problem of injection of faults in the communication system of disjoint memory parallel computers and presents fault injection results showing that 5% to 30% of the faults injected in the communication subsystem of a commercial parallel computer caused undetected errors that lead the application to generate erroneous results. All these cases correspond to situations in which it would be virtually impossible to detect that the benchmark output was erroneous, as the size of the result s file was plausible and no system errors have been detected. This emphasises the need for fault tolerant techniques in parallel systems in order to achieve confidence in the application results. This is especially true in massively parallel computers, as the probability of occurring faults increase with the number of processing nodes. Moreover, in disjoint memory computers, which is the most popular and scalable parallel architecture, the communication subsystem play an important role, and is also very pr one to errors. CSFI (Communication Software Fault Injector) is a versatile tool to inject communication faults in parallel computers. Faults injected with CSFI directly emulates by software communication faults and expurious messages generated by non fail-silent nodes, allowing the evaluation of the impact of faults in parallel systems, and the assessment of fault tolerant techniques. The use of CSFI is nearly transparent to the target application as only requires minor adaptations. Deterministic faults of different nature can be injected without user intervention, and the CSFI also collects fault injection results.