High Availability White Papers
Highly Reliable Linux HPC Clusters: Self-Awareness Approach
Overview Current solutions for fault-tolerance in HPC systems focus on dealing with the result of a failure. However, most are unable to handle runtime system configuration changes caused by transient failures and require a complete restart of the entire machine. The recently released HA-OSCAR software stack is one such effort making inroads here. This paper discusses detailed solutions for the high-availability and serviceability enhancement of clusters by HAOSCAR via multi-head-node failover and a service level fault tolerance mechanism. The solution employs self-configuration and introduces Adaptive Self Healing (ASH) techniques. HA-OSCAR availability improvement analysis was also conducted with various sensitivity factors.
| Publisher | Intel | File Format | |
|---|---|---|---|
| Date Published | October 2004 | ||
| Format | White Papers | ||
| Topics | |||



