Predictive Failure Analysis - IBM p5 590 System Handbook

page of 316

/ 316
Contents
Table of Contents
Bookmarks

Table of Contents

Integrated hardware error detection and fault isolation has been a key

component of IBMs UNIX server design strategy since 1997. FFDC check

stations are carefully positioned within the server logic and data paths to ensure

that potential errors can be quickly identified and accurately tracked to an

individual field replaceable unit (FRU). These checkers are collected in a series

of Fault Isolation Registers (FIR), where they can easily be accessed by the

service processor. All communication between the service processor and the FIR

is accomplished

is transparent to an operating system. This entire structure is below the

architecture and is not seen, nor accessed, by system level activities.

In this environment, strategically placed error checkers are continuously

operating to precisely identify error signatures within defined hardware fault

domains. IBM servers are designed so that in the unlikely event that a fatal

hardware error occurs, FFDC, coupled with extensive error analysis and

reporting firmware in the service processor, should allow IBM to isolate a

hardware failure to a single FRU. In this event, the FRU part number will be

included in the extensive error log information captured by the service processor.

In select cases, a set of FRUs will be identified when the fault is on an interface

between two or more FRUs. For example, three FRUs may be called out when

the system cannot differentiate between a failed driver on one component, the

corresponding receiver on a second, or the interconnect fabric. In either case, it

is IBMs maintenance practice for the p5-590 and p5-595 systems to replace all of

the identified components as a group. Meeting rigorous goals for fault isolation

requires a reliability, availability, and serviceability methodology that carefully

instruments the entire system logic design with meticulously placed error

checkers.

6.3.2 Predictive failure analysis

Statistically, there are two main situations where a component has a catastrophic

failure: Shortly after being manufactured, and when it has reached its useful life

period. Between these two regions, the failure rate for a given component is

generally low, and normally gradual. A complete failure usually happens after

some degradation has happened, be it in the form of temporary errors, degraded

performance, or degraded function.

The p5-590 and p5-595 have the ability to monitor critical components such as

processors, memory, cache, I/O subsystem, PCI-X slots, adapters, and internal

disks, and detect possible indications of failures. By continuously monitoring

these components, upon reaching a threshold, the system can isolate and

deallocate the failing component without system outage, thereby avoiding a

partition or complete system failure.

IBM Eserver p5 590 and 595 System Handbook

144

out of band

. That is, operation of the error detection mechanism

Table of Contents

Chapters

Table of Contents

This manual is also suitable for:

P5 595

Predictive Failure Analysis - IBM p5 590 System Handbook

6.3.2 Predictive failure analysis

Chapters

Related Manuals for IBM p5 590

Related Content for IBM p5 590

This manual is also suitable for:

Table of Contents