High Availability White Papers

Highly Reliable Linux HPC Clusters: Self-Awareness Approach

Overview Current solutions for fault-tolerance in HPC systems focus on dealing with the result of a failure. However, most are unable to handle runtime system configuration changes caused by transient failures and require a complete restart of the entire machine. The recently released HA-OSCAR software stack is one such effort making inroads here. This paper discusses detailed solutions for the high-availability and serviceability enhancement of clusters by HAOSCAR via multi-head-node failover and a service level fault tolerance mechanism. The solution employs self-configuration and introduces Adaptive Self Healing (ASH) techniques. HA-OSCAR availability improvement analysis was also conducted with various sensitivity factors.

Further White Paper Details
PublisherIntel File FormatPDF
Date PublishedOctober 2004
FormatWhite Papers   
Topics

Quick Sitemap Links: