As the parallelism of computer systems surpasses tens of hundreds of processors, the ability of an application to tolerate faults becomes increasingly important. Transient faults, also known as soft errors, are emerging as a critical concern in the reliability of computer systems. This research identifies an intersection of program analysis, operating systems, and application-level characteristics to enable a new approach to fault tolerance. The approach, built within a run-time execution environment adapts to program phase as well as system utilization of resources. For example, chips may only sustain significant error rates when operating above a certain temperature or other conditions. In this case, thermal feedback can direct the system to dynamically adjust the amount of reliability. The approach's shifts the focus on transient faults from ensuring correct hardware execution to ensuring correct software execution. As a result, the system ignores many benign faults that do not propagate to affect program correctness. Since architecture trends point towards multi-threaded multi-core designs, the technique leverages the available hardware resources for transient fault tolerance. Redundancy at the process level allows the operating system to freely schedule the processes across all available hardware resources. Experimental results demonstrate only an average 30% overhead on multi-core systems is required to eliminate transient faults in an application.
Dan Connors is an Assistant Professor at the University of Colorado at Boulder. His research includes run-time compilation for energy control, temperature management, fault tolerance, and optimization of multi-core systems. He directs the DRACO research group which integrates compiler techniques, hardware performance monitors, and binary instrumentation tools such as PIN into research projects that evaluate run-time adaptation of code and system resource allocation. He received his Ph.D. in Computer Engineering from the University of Illinois at Urbana-Champaign in 2000.