How to Make Your Oil Analysis Program Produce More Alerts

Jim Fitch, Noria Corporation
Tags: oil analysis

I want bad news fast. Why? Problems tend to compound. Rarely do they heal themselves. Instead, the worse things get, the faster they get worse. As time passes, the cost of repair and lost production can soar exponentially.

Maintenance resources need focus and preemptive timing. The longer you wait to respond to what causes failure, the more machine conditions take over control of your schedule and budget. Fix the roof while the sun is shining. It’s not just about keeping machines running but rather keeping them running at the lowest possible cost.

Condition monitoring is an essential troubleshooting tool. It provides examination skills to find the cause or source of an impending problem. When issues are caught early, you have the luxury of convenience and the option of simple remedies, usually with minimal (if any) business interruption.

The Wisdom of Early Detection - Not Just-in-Time Detection

Several years ago, I performed a root cause analysis on a rash of gearbox failures in helicopters used for the logging industry (heli-logging). These failures were the cause of serious accidents and human casualties. Because of the high power density of these gearboxes, mild problems could quickly escalate into life-threatening conditions.

A helicopter’s planetary gearbox conveys power between the main drive shaft (from the engine) to the main rotor, giving the helicopter aerodynamic lift and thrust. The condition of the lubricant (the right health, cleanliness and supply) is essential for reliable, safe operation. Any failure of the lubricant can quickly start a disastrous chain of events, from lube failure to bearing failure to gear failure to rotor failure to lift failure to helicopter failure (crash and burn).

While the helicopter example may be an extreme case, it is still an important and vivid illustration of the seriousness of early detection and response. What might seem like a mild condition now (rising varnish potential, cloudy oil, elevated particle count, uptick in fuel dilution, out-of-grade viscosity, low oil level, etc.) can abruptly worsen to a runaway catastrophic outcome.

Root Cause Alarms

Proactive maintenance should always be “job one” in condition monitoring. To do well, it requires a solid understanding of root causes that are ranked by likelihood (the Pareto principle). There is also a need to understand the criticality of each machine. What you perceive as “the cause” may be nothing more than a transitional step in a sequence of many in route to functional failure. The real root cause you are seeking is the one that could have been controlled or prevented. That is not to say your purpose is to look for someone to blame.

73% of MachineryLubrication.com visitors use the ISO cleanliness code to set target alarms for system cleanliness levels

I was once surveying a steel mill and noted that the needles on most of the pressure-‚Äč differential gauges (used on filters) were in the yellow or red zones. My escort quickly commented that the maintenance staff no longer used the gauge readings to schedule oil changes. Instead, all oil filters were changed on six-month intervals regardless of the gauge reading. Needless to say, I was shocked. On-condition maintenance should always take precedence over scheduled maintenance.

See the fault tree in Figure 1 used to troubleshoot the root cause of a large process pump failure. Asking the “repetitive why” takes you to the pump (and not the coupling or motor) that failed. Next, the pump bearings were found destroyed. An examination of the lubricant revealed heavy contamination. An effective functioning filter would have prevented such contamination. In fact, the filter was found to be plugged, as indicated by the red-alert pressure gauge reading that went unnoticed.

Had an operator reported the high gauge reading and followed up with a filter change, the pump might have been saved. A red-alert gauge reading is not a pump failure unless it does not get reported with corrective action. If the filter became plugged prematurely, it should have been examined. Analyze the used filter to determine the type and source of particles. How have they invaded the machine and the oil? Don’t just change the filter. Correct the ingression points, i.e., the source of particles that plugs the filter.

Many root causes can be detected by oil analysis. These include various types of contaminants, wrong oil and degraded oil. If cleanliness targets and other alerts are adjusted to promptly weed out root causes, they won’t advance to even the earliest state of failure.

I recently finished a case related to diesel engine failures. The lab was using a 5-percent alarm limit for fuel dilution. Anything less than 5 percent was reported as normal. No wonder the many defective fuel injectors were not detected until catastrophic engine failure had occurred. Coolant leak detection is also poorly deployed in diesel engine oil analysis. In my opinion, even the slightest amount of coolant leak is cause for concern.

Weak Signal Alarms

Long ago, researchers and tribologists discovered that the strength of signals being emitted during machine failure depends on the state of failure. Incipient failure produces weak signals, while precipitous (advanced) failure creates strong signals. This applies to various types of signals, including wear debris generation, vibration, heat and acoustics. Hence, if you want to catch failure early, you need to be good at detecting and responding to weak signals.

Sadly, many oil analysis programs are structured to do the opposite. You can’t fix a problem that you can’t see. Rather than setting tight alarms and limits that are a slight offset from normal conditions, some programs don’t alarm until problems advance to a state of imminent danger, such as two or even three standard deviations over the mean based on data history. Loose alarms equate to failure blindness.

Some of the best oil analysis programs use cautionary alarms to alert condition-based maintenance (CBM) technicians of abnormal lab data or a reportable inspection condition. This way, the alarm results in a more measured or throttled response. People who wrongly push the panic button often discredit the value of condition monitoring. Remember, “if it ain’t broke, don’t fix it.” A “possible problem” needs to be vetted before you tear down a machine and cause even more serious or real problems. Cautionary alarms should start the vetting process.

Nuisance alarms or false positives are always a concern. Oil analysis is far from perfect, and there are many unavoidable data errors, including false negatives. Exception testing can be extremely useful for confirming an alarm condition and obtaining a better understanding of severity. It involves resampling and running more extensive tests by the lab. Because many exception tests are costly or time-consuming, it is not practical to use them with routine samples. Good examples of exception tests are analytical ferrography and scanning electron microscopy/energy dispersive X-ray spectroscopy (SEM/EDS).


Figure 1. Fault tree used to troubleshoot
the root cause of a large process pump failure

3 Ways to Produce More Alerts

As discussed many times in Machinery Lubrication, the primary goal in making lubrication and maintenance decisions is to achieve the optimum reference state (ORS). You are trying to optimize reliability, not maximize it. You seek the reliability you need at the lowest possible cost and distraction.

This holds true for producing more alerts. You don’t want “alert overkill” but rather an optimized, effective level of cautionary and critical alarms. This can best be done with three methods working in unison:

1. More Frequent Testing and Inspection

It’s a false promise to expect even the best oil analysis or condition monitoring program to catch incipient faults and root causes if testing or data collection is conducted infrequently. Many machines can fail start to finish in just a matter of hours. Other failure modes may take weeks, months or years. For instance, if the failure development period is two months, a test or inspection interval of two weeks provides real opportunity for early detection. Conversely, a test or inspection interval of two months may not offer an advanced warning at all. Inspection 2.0 is a great strategy for achieving frequent and effective inspection alerts.

2. More Comprehensive Examination

You’ve probably heard the expression, “If all you have is a hammer, everything looks like a nail.” This also relates to oil analysis and condition monitoring in general. You need bandwidth. Many companies pretend to save money by cutting oil analysis to the bone. This includes reducing the number of machines that are sampled as well as conducting fewer and less effective tests. For good detective work, you need to cast a wide net. This can be done by expanding the screening tests used on routine samples but also by unifying oil analysis with a penetrating inspection program and other condition monitoring technologies (e.g., vibration).

3. Pin-drop Sensitive Alarms and Limits

I’ve described the virtues of recasting your alarms and limits for greater early detection sensitivity. You get rid of this data blindness in some cases by taking away control of how alarms and limits are set from commercial laboratories. This is done through education of CBM personnel and analysis of each machine based on failure modes, criticality and reliability history.

In sum, oil analysis needs to produce more quality alerts. Perhaps 30 percent of all oil analysis reports should have some reportable condition, and certainly no fewer than 10 percent. Of course, this statistic depends greatly on the types of machines and their field of application. Don’t be afraid of alerts. Instead, view them as a real opportunity for continuous improvement. Like most organizations, your resources for improvement (people and budget) are probably lean. In the reliability space, alerts can help you use these resources in the most efficient and effective way possible.