High Availability White Papers
Recovery of Memory and Process in DSM Systems
Overview In this report, we discuss the recovery of memory and processes on the platform of a shared-memory DSM system. We divide the problem into recovery of unaffected memory (RUM), and recovery of affected processes (RAP). We point out that specially designed fault-tolerant, non-volatile memory is neither sufficient nor necessary to solve the problem of RUM. It is not sufficient that the system can go down when one node goes away, which can be a result of many types of faults: power failure is but one of them. It is not necessary either, because the system is distributed in nature; information redundancy across fault units can be realized, therefore, without using special memory. We discuss several ways of implementing a fault-tolerant memory system using plain memory by modifying the write-back protocols in DSM systems. The proposed techniques include mirroring and RAIM, which stands for Redundant Array of Independent Memory.
| Publisher | HP Labs | File Format | PDF, requires Acrobat Rdr 5 |
|---|---|---|---|
| Date Published | March 2001 | Downloads | 2 |
| Format | White Papers | ||
| Topics | |||



