Experimental evaluation of the impact of processor faults on parallel applications
D. Costa, F. Moreira, H. Madeira, M. Rela, and J. G. Silva
Proc. of 14th IEEE Symposium on Reliable Distributed Systems, SRDS-14, Bad Neuenahr, Germany, Sept., 13 to 15, 1995, pp. 10-19
Abstract
This paper addresses the problem of processor faults in distributed
memory parallel systems. It shows that transient faults injected at the
processor pins of one node of a commercial parallel computer, without any
particular fault-tolerant techniques, can cause erroneous application results
for up to 43% of the injected faults (depending on the application). In
addition to these very subtle faults, up to 19% of the injected faults
(almost independent on the application) caused the system to hang up. These
results show that fault-tolerant tech-niques are absolutely required in
parallel systems, not only to ensure the completion of long-run applications
but, and more important, to achieve confidence in the application results.
The benefits of including some fairly simple behaviour based error detection
mechanisms in the system were evaluated together with Algorithm Based Fault
Tolerance (ABFT) techniques. The inclusion of such Mechanisms in parallel
systems seems to be very important for detecting most of those subtle errors
with-out greatly affecting the performance and the cost of these systems.