Managing Reliability in Real Time - The Risk Management Grid (RMG)

Jim Fitch, Noria Corporation
Tags: industrial lubricants

In a past article,1 I mentioned that in the world of reliability, risk can be defined as the probability of failure multiplied by the consequence(s) of failure. This simple definition should be a reliability team’s most important daily metric. In this editorial, a two-dimensional matrix is proposed that serves as a real-time moving picture of the health of plant assets (Figure 1).

I don’t manage an in-plant machinery reliability team at Noria (we have no machines), but if I did, I would want to see this matrix at the beginning and end of each day. I would want it to serve as my reliability command post in planning and scheduling maintenance work orders and related activities.

Let’s call it the Risk Management Grid (RMG) for production assets. Its singular purpose is to characterize composite asset risk factors associated with failure probability and consequence. It provides a revealing cross-sectional view of current reliability conditions and leaves a visible trail of the past to analyze and prescribe future proactive improvement initiatives.

Another important feature is that it facilitates the integration of maintenance technologies (vibe, oil analysis, etc.) by formatting and normalizing all technology alarms together into a unified real-time view. Note, many of the concepts in the following discussion are rooted in the failure management tools known as reliability-centered maintenance (RCM) and failure modes effects analysis (FMEA), which were discussed in Root Cause Analysis Techniques for the Lubrication Professional.2 What’s new is the real-time presentation format.

As can be seen in Figure 1, populating the grid squares are the asset numbers (IDs) of machines that have reportable adverse reliability conditions. Each machine that has such a reportable condition is positioned on the grid. However, it is the overall machine population density and distribution on the grid that conveys the pictorial image of plant health. Let’s take a closer look at how the grid is constructed.

Figure 1

Condition Severity

Horizontally across the grid is the condition severity (CS) representing three classes in progression from left to right: S1, S2 and S3. These classes or levels of condition severity are defined below:

This is a cautionary alarm level associated with a significant root cause condition or, in some cases, mild incipient (but abnormal) wear. At its best, S1 conditions are root cause-related, not yet exhibiting symptoms of a failure in progress. Examples would be elevated temperatures, marginal misalignment, contaminated oil, etc. Corrective measures can usually be planned and scheduled without lost production or downtime. An undetected or uncorrected S1 machine can advance to an S2 level. An excellent metric for monitoring and controlling machines that breech cautionary limits and become designated S1 is the overall lubrication effectiveness (OLE) discussed in Drew Troyer’s article, “OLE! Rallying for a New Lubrication Performance Metric,” which was published in Machinery Lubrication magazine’s July-August 2002 issue.3

An S2 condition is one that has moved beyond the root cause or incipient wear stage to a point at which the machine’s reliability is increasingly at risk. Although this is not a condition of imminent or even impending failure, it does represent a more advanced fault or wear condition and is a matter of real concern. If left unremedied, the rate of wear and deterioration threatens reliable operation with the potential for lost production and/or an expensive rebuild. Examples of S2 conditions might be wear metals in a fan bearing oil above the 95th percentile, process pump vibration overalls creeping north of say, 3.5 mm/sec (0.14 inches/sec), or a hot gearbox coupling from misalignment detected with a heat gun. Predictive maintenance has an important role in detecting S2 faults and machine conditions. At its best, S2 conditions are detected early and corrective measures can be scheduled without lost production. A short outage for inspection and/or correction may be required for processes with long production cycles. However, if undetected or left uncorrected an S2 condition can advance to an S3 level.

Machines that rate an S3 are at an impending to precipitous state of failure - little time is left. Often the corrective action has the potential to disrupt production schedules. For S3 level machines, every effort is being made to minimize collateral damage, forced downtime and repair costs. Focus is primarily on damage control and keeping the failure contained to a single bearing or machine component. Correction usually involves a component or unit change-out. For process-critical machines, an undetected or corrected S3 condition can lead to a major outage and costly repairs. S3 levels and beyond (events of catastrophic sudden death) define classic breakdown maintenance.

Machine Criticality

Going vertically down the RMG are three levels characterizing machine criticality (MC), corresponding to the consequence(s) of failure. Assigning criticality levels is subjective and is based on the judgment of reliability professionals with knowledge of such factors as:

  1. Current production demand for the process or machine (peak load, base load, etc.)

  2. In the case of a potential forced or even scheduled outage/shutdown, the cost of each downtime hour and the downtime duration

  3. Safety and environmental risks

  4. Availability of skilled craft to make the repairs

  5. Availability of spares

  6. Availability of redundant or standby equipment

  7. Rate at which failures typically progress

  8. Detectability of in-progress failures

For organizations that have already gone through criticality analysis using RCM/FMEA tools, criticality levels have probably been pre-assigned to each machine or machine component. However, the scale will likely have to be adjusted to conform to the three MC levels used in the grid:

These are machines that are generally not process-critical. As such, a failure becomes more of an inconvenience than a serious maintenance cost or business disruption.

Loss of availability of C2 machines can be tolerated only for short periods of time and the cost of repair and/or downtime is potentially significant.

A C3 machine has the highest degree of criticality, which is usually associated with one or more of the following: high business interruption costs, safety risks, environmental factors and/or high cost of repair/ replacement.

Grid Zones

There are five distinct zones running diagonally down and across the grid that integrate condition severity levels with machine criticality levels. These zones are color-coded as seen in Figure 1. While all machines with reportable conditions are located on the RMG, the zones define the seriousness and urgency of the machines by their locations on the grid.

Machines in Zone A are low-priority corrections. Occasionally, no corrective action is taken unless a machine transitions in condition severity, say from S1 to S2, which would move it into Zone B. In such cases, Zone A machines are watched more closely and every reasonable effort is made to control root cause conditions. Machines in Zone B are also low-priority but corrective measures can be planned and scheduled to mitigate the cost impact of the repairs.

Machines in Zone C are higher in risk and corrective workorders should be scheduled as early as possible. Machines in Zones D and E require immediate action, commanding the highest priority and allocation of maintenance focus and resources.

Trending the RMG Composite Score

Reliability program effectiveness can be monitored by simply calculating a daily composite score from all machines on the RMG grid. The following equation employs a power series corresponding to the relational reliability risk assigned to each zone. In others words, the risk associated with Zone B machines is roughly twice that of Zone A, likewise a machine in Zone C represents twice the risk of a Zone B machine, etc. The following equation should be used to derive the RMG composite score:

RMG Composite =
1A + 2B + 4C + 8D + 16E

The letters A through E represent the number of machines in the zones corresponding to those letters. In the example shown in Figure 1, the composite RMG score is calculated as follows:

Zone A - 10 Machines
Zone B - 10 Machines
Zone C - 11 Machines
Zone D - 5 Machines
Zone E - 1 Machine

RMG Composite =
1x11 + 2x10 + 4x11 + 8x5 + 16x1 = 131

Figure 2 shows a trend graph of a hypothetical plant. In this case, the daily RMG composite score was reduced by more than 50 percent in 12 months. Strategies for reducing composite scores relate to an organization’s improvements in maintenance standards and practices, especially excellence in proactive and predictive maintenance.

Click here to see Figure 2. Average Monthly RMG Composite Score

It is worth noting that, for obvious reasons, the RMG will be populated only with machines that have detected reportable conditions. Hence the effectiveness and power of the RMG as a reliability management tool depend on the reliability team’s ability to detect aberrant machine conditions. This brings into question the effectiveness of available condition-monitoring technologies (oil analysis, vibration, motor current, etc.), usage skills of technologists and the frequency of use. Collectively, these factors will all have a marked effect on the likelihood that reportable conditions will be identified and placed on the grid.

Machines that go undetected and have severe fault/wear conditions can fail catastrophically without appearing on the RMG grid. Additionally, as plant reliability programs continuously improve in constant pursuit of world class, alarming levels will likely change. For instance, what was at first an S1 reportable condition may shift in time to an S2 as the program refines and becomes more performance-sensitive and discriminating.

The score is also impacted by the total number of machines in the plant or work site. As machines are retired and others added, the score will be affected. This effect can be normalized by reporting the RMG Composite score per, say 100 machines in the system. Other metrics that can be monitored and reported based on the RMG grid include:

The Payoff

Companies use a variety of different systems and methods to direct maintenance activities. Labor, parts and tools are often of such limited supply that the utmost precision in allocating maintenance resources and avoiding unnecessary workorders is needed. In sum, the RMG makes planning and scheduling maintenance priorities more conspicuous, helps integrate the use of maintenance technologies and significantly simplifies the planning and scheduling process. Most importantly it leaves a historic trail of events needed to prescribe new proactive maintenance and reliability strategies to prevent reoccurrence.


  1. Fitch, J. (2002, September - October). How and Why Machines Wear Out Machinery Lubrication, pp. 2-6.
  2. Troyer, D. (2002, November-December). Root Cause Analysis Techniques for the Lubrication Professional, Practicing Oil Analysis, pp. 26-39.
  3. Troyer, D. (2002, July-August). OLE! Rallying for a New Lubrication Performance Metric, Machinery Lubrication, pp. 4-9.