Understanding Communication Faults in Parallel Computers
Joao Carreira, Diamantino Costa, Henrique Madeira, João Gabriel Silva
Fault-Tolerant Systems and Software, Ed. Ravi Mittal, C. Muthukrishana e V. Bhatkar, Narosa Publishing House, 1995, pp. 158-164
Abstract
This paper addresses the evaluation of the dependability properties
of distributed memory parallel systems through fault injection. The most
popular parallel computers are based on the distributed memory architecture
where loosely coupled processors communicate by message-passing. Fault
tolerance is an issue which increasingly concerns manufacturers and end
users of these systems as the probability of occurrence of a fault increases
with the number of components, and parallel machines can have up to thousands
of nodes and complex interconnection media. For the purpose of the validation
of fault tolerance in these systems, both the processing nodes and the
communication subsystem should be taken into account. This paper focus
on the validation of communication subsystems and reports experiments conducted
with the CSFI tool - Communication Software Fault Injector in a commercial
parallel machine with no fault handling mechanisms. Two set of experiments
have been performed: one using original applications, and another using
the same applications in conjunction with an application level CRC mechanism
for the messages. The outcome of the experiments was analysed focusing
on those faults that caused the generation of wrong results by the application
without any error being detected. These cases correspond to situations
in which it would be virtually impossible to detect that the benchmark
output was erroneous. The results obtained show the effectiveness of the
CRC as an error detection mechanism and emphasise the need for robust communication
protocols in parallel machines in order to achieve confidence in the applications
results and suggest that the actual quest for performance in the parallel
computing industry can only be effective if it is provided along with dependability.