High Availability White Papers
A Failure Predictive and Policy-Based High Availability Strategy for Linux High Performance Computing Cluster
Overview The Open Source Cluster Application Resources (OSCAR) is a fully integrated cluster software stack designed for building, and maintaining a Linux Beowulf cluster. As OSCAR has become a popular tool for building the cost effective HPC cluster, undoubtedly, High Availability (HA) will equally be an important aspect that enables HPC systems, as clearly an unavailable cluster equals no performance. To embrace both HA and HPC features, the HA-OSCAR solution is created which eliminates the numerous single-point-of-failure in HPC systems and alleviates unplanned downtime through sophisticated self-healing mechanisms and component redundancy. This paper report the newly introduced ideas and experiments on hardware level failure detection and prediction based on the Service Availability Forum's Hardware Platform Interface (OpenHPI).
| Publisher | Louisiana Tech University | File Format | |
|---|---|---|---|
| Date Published | March 2004 | ||
| Format | White Papers | ||
| Topics | |||


