OpenSolaris

You are not signed in. Sign in or register.

OpenSolaris Community: Fault Management

View the leaders for this community
Community Observers

Endorsed projects

About Predictive Self-Healing

Self-healing functionality for users and administrators of a modern operating system provides fine-grained fault isolation and restart where possible of any component—hardware or software—that experiences a problem. To do so, the system must include intelligent, automated, proactive diagnoses of errors that are observed on the system. The diagnosis system is used to trigger targeted automated responses or guided human intervention that mitigate a specific problem or at least prevent it from getting worse. Finally, these new system capabilities are connected to a new model for system administrators oriented around simpler, higher-level abstractions. Sun's first Predictive Self-Healing features are part of Solaris 10 and OpenSolaris and include the Fault Manager and the Service Manager.

About Fault Management

The Solaris Fault Management effort (originally code-named FMA inside of Sun) provides a new architecture for building resilient error handlers, error telemetry, automated diagnosis software, response agents, and a consistent model of system failures for a management stack. Many parts of Solaris are already participating in FMA, including the CPU and Memory error handling for UltraSPARC III and IV, the UltraSPARC PCI HBAs, and more. And a variety of projects are underway, including full support for CPU, Memory, and I/O faults on Opteron, conversion of key device drivers, and integration with various management stacks.

The legacy UNIX failure model was simply to leave error handling up to each subsystem author, and simply provide the ability to emit an error message for a human to the system log in a non-standard format. When a subsystem is converted to participate in Fault Management, error handling is made resilient so that the system can continue to operate despite some underlying failure, and telemetry events are produced that drive automated diagnosis and response. The Fault Management tools and architecture enable development of self-healing content for software and hardware failures, for both microscopic and macroscopic system resources, all with a unified, simple view for administrators and system management software.

Some objectives for the Fault Management Community are to:

  • Convert hardware and software subsystems to participate in Fault Management
  • Connect Solaris Fault Management to system management protocols and software
  • Evolve and enrich the tools and common architecture for Fault Management
  • Research underlying failure modes and design and develop automated diagnosis software that is able to build effective self-healing for particular resources

Documentation

  • FMA Events and Messages
    • Diagnosis results obtained from the Fault Management software in Solaris contain links to the Knowledge Article Web.
    • The FMA Event Registry is the central repository for all fault management events passed between error handlers, the fault manager and its agents.
  • Writing Device Drivers for FMA. The Writing Device Drivers guide contains a section "Sun Fault Management Architecture I/O Fault Services" in "Chapter 13, Hardening Solaris Drivers." This section describes the steps and techniques used to write an FMA-aware driver.
  • Fault Management Daemon Programmer's Reference Manual. The FMD PRM is a description of the internal architecture of the Sun Fault Management Daemon, fmd(1M), and the programming interfaces exported by the daemon.
    • FMDPRM 1.4 April 2008. Added -b option to the fmtopo command. Changed descriptions of TOPO_WALK_CHILD and TOPO_WALK_SIBLING.
    • FMDPRM 1.3 March 2008. Added repaircode to the table of Fault Management Configuration Properties.
    • FMDPRM 1.2 August 2007. Initial post of this document.

What's New

The OpenSolaris universe of new fault management activities (projects, ARC cases, diagnosis engines, recovery agents and error handlers).

News

Predictive Self-Healing for x64 Feature Story | sun.com | 08/30/2006

Feature story for the front page of sun.com that describes the new Predictive Self-Healing features for x64 systems.

Benefits of Memory Page Retire | Dependable Systems and Networks 2006 | 07/04/2006

Paper for Dependable Systems and Networks Conference describing a quantitative model demonstrating the availability benefits of Solaris Memory Page Retire (MPR), driven by the Solaris Fault Manager.

Predictive Self-Healing and DTrace Receive 2005 Innovation Awards | InfoWorld | 08/01/2005

InfoWorld has announced its 2005 Innovator awards and Sun proudly received awards for Predictive Self-Healing and DTrace, two of the breakthrough technologies in Solaris 10, and the senior engineers who designed and led these projects.

A Diagnosis of Self-Healing Systems | SlashDot | 12/21/2004

SlashDot coverage of the ACM Queue article describing Sun's Predictive Self-Healing architecture.

Self-Healing in Modern Operating Systems | ACM Queue | 12/01/2004

ACM Queue article by Mike Shapiro describing Sun's approach to Predictive Self-Healing, including both Fault Management and Service Management in Solaris 10.

Blogs

Scott Davenport - The FMA Triad: Topology, Telemetry & Diagnosis Rules - Part 2

Apr 29, 11:21 AM

In Part 1 of the "FMA Triad": Topology, Telemetry, and Diagnosis Rules" I focused on topology. It's time to unravel the second piece of the triad - telemetry - with a specific focus on how the ...

Scott Davenport - FMA Features & Fixes in Solaris 10 Update 5

Apr 15, 3:49 PM

Solaris 10 Update 5 released today and is available for download . This update contains some great features and fixes for FMA. Here's just a few of my favorite bugs (an oxymoron?) that are included ...

Scott Davenport - Predictive Self Healing (FMA) for T5140/T5240

Apr 9, 9:03 AM

April 9, 2008: Sun announced the T5140 / T5240 platforms centered around the UltraSPARC T2 Plus processor . The T2 Plus extends the capabilities of the UltraSPARC T2 processor, the most obvious being ...

Scott Davenport - Logging Repair Actions

Mar 26, 9:52 AM

Some changes were putback yesterday to have repair events logged. A repair event is either automatically initiated by FMD when it detects a component has been replaced in the system, or via the ...

Scott Davenport - The FMA Triad: Topology, Telemetry & Diagnosis Rules

Mar 11, 12:06 PM

At a high level, Solaris FMA relies on three things to do things right: topology, telemetry, and diagnosis rules. The "FMA Triad", if you will. All three need to be correct. All three need to be in ...