|
|
OpenSolaris Community: Fault ManagementView the leaders for this communityCommunity Observers
Endorsed projects
About Predictive Self-HealingSelf-healing functionality for users and administrators of a modern operating system provides fine-grained fault isolation and restart where possible of any component—hardware or software—that experiences a problem. To do so, the system must include intelligent, automated, proactive diagnoses of errors that are observed on the system. The diagnosis system is used to trigger targeted automated responses or guided human intervention that mitigate a specific problem or at least prevent it from getting worse. Finally, these new system capabilities are connected to a new model for system administrators oriented around simpler, higher-level abstractions. Sun's first Predictive Self-Healing features are part of Solaris 10 and OpenSolaris and include the Fault Manager and the Service Manager. About Fault ManagementThe Solaris Fault Management effort (originally code-named FMA inside of Sun) provides a new architecture for building resilient error handlers, error telemetry, automated diagnosis software, response agents, and a consistent model of system failures for a management stack. Many parts of Solaris are already participating in FMA, including the CPU and Memory error handling for UltraSPARC III and IV, the UltraSPARC PCI HBAs, and more. And a variety of projects are underway, including full support for CPU, Memory, and I/O faults on Opteron, conversion of key device drivers, and integration with various management stacks. The legacy UNIX failure model was simply to leave error handling up to each subsystem author, and simply provide the ability to emit an error message for a human to the system log in a non-standard format. When a subsystem is converted to participate in Fault Management, error handling is made resilient so that the system can continue to operate despite some underlying failure, and telemetry events are produced that drive automated diagnosis and response. The Fault Management tools and architecture enable development of self-healing content for software and hardware failures, for both microscopic and macroscopic system resources, all with a unified, simple view for administrators and system management software. Some objectives for the Fault Management Community are to:
Documentation
What's NewThe OpenSolaris universe of new fault management activities (projects, ARC cases, diagnosis engines, recovery agents and error handlers). NewsPredictive Self-Healing for x64 Feature Story | sun.com | 08/30/2006Feature story for the front page of sun.com that describes the new Predictive Self-Healing features for x64 systems. Benefits of Memory Page Retire | Dependable Systems and Networks 2006 | 07/04/2006Paper for Dependable Systems and Networks Conference describing a quantitative model demonstrating the availability benefits of Solaris Memory Page Retire (MPR), driven by the Solaris Fault Manager. Predictive Self-Healing and DTrace Receive 2005 Innovation Awards | InfoWorld | 08/01/2005InfoWorld has announced its 2005 Innovator awards and Sun proudly received awards for Predictive Self-Healing and DTrace, two of the breakthrough technologies in Solaris 10, and the senior engineers who designed and led these projects. A Diagnosis of Self-Healing Systems | SlashDot | 12/21/2004SlashDot coverage of the ACM Queue article describing Sun's Predictive Self-Healing architecture. Self-Healing in Modern Operating Systems | ACM Queue | 12/01/2004ACM Queue article by Mike Shapiro describing Sun's approach to Predictive Self-Healing, including both Fault Management and Service Management in Solaris 10. BlogsScott Davenport - The x86gentopo ProjectJun 30, 5:14 PM You've all seen this before...actually, maybe you haven't. In which case....good! You've got a healthy system. But, if your box did have issues, FMA would report something like: # fmadm faulty ... Scott Davenport - Device Driver Integration with Solaris FMAJun 3, 3:23 PM A lot of work has been put into devising a common rule set for PCI/PCIE devices in Solaris FMA. Known as I/O Fault Services, there's a thorough document detailing how a developer goes about hardening ... Scott Davenport - OpenSolaris 2009.06, the M3000, and FMAJun 1, 10:01 AM OpenSolaris 2009.06 released today. One of the key features is support for the SPARC product line. A colleague and I gave one of the pre-releases a whirl on a midrange system, an M3000 Now, a silent ... Scott Davenport - Contract Kills on Memory UEsMay 12, 9:19 AM On SPARC systems, when there's a memory uncorrectable error (UE), Solaris will determine if the affected page is in user space or kernel space. If in user space, the affected user process is killed - ... Scott Davenport - Solaris 10 Update 7 AvailableMay 1, 11:11 AM Solaris 10 Update 7 is now posted and available for download. And there's been 65+ bug fixes and enhancements for FMA. Here's a few of my favorites (can one have favorite bugs? :) fixed in S10U7: ... |