Understanding Communication Faults in Parallel Computers

Joao Carreira, Diamantino Costa, Henrique Madeira, João Gabriel Silva

Fault-Tolerant Systems and Software, Ed. Ravi Mittal, C. Muthukrishana e V. Bhatkar, Narosa Publishing House, 1995, pp. 158-164

Abstract
This paper addresses the evaluation of the dependability properties of distributed memory parallel systems through fault injection. The most popular parallel computers are based on the distributed memory architecture where loosely coupled processors communicate by message-passing. Fault tolerance is an issue which increasingly concerns manufacturers and end users of these systems as the probability of occurrence of a fault increases with the number of components, and parallel machines can have up to thousands of nodes and complex interconnection media. For the purpose of the validation of fault tolerance in these systems, both the processing nodes and the communication subsystem should be taken into account. This paper focus on the validation of communication subsystems and reports experiments conducted with the CSFI tool - Communication Software Fault Injector in a commercial parallel machine with no fault handling mechanisms. Two set of experiments have been performed: one using original applications, and another using the same applications in conjunction with an application level CRC mechanism for the messages. The outcome of the experiments was analysed focusing on those faults that caused the generation of wrong results by the application without any error being detected. These cases correspond to situations in which it would be virtually impossible to detect that the benchmark output was erroneous. The results obtained show the effectiveness of the CRC as an error detection mechanism and emphasise the need for robust communication protocols in parallel machines in order to achieve confidence in the applications results and suggest that the actual quest for performance in the parallel computing industry can only be effective if it is provided along with dependability.