
In their search for the latest and greatest management and technical innovations, which often arrive with glorious trappings and fanfare, organizations often overlook opportunities to correct deficiencies that are so ingrained within the organization that they are often accepted as normal. Even simple changes can profoundly affect an organization’s effectiveness and bottom line. Clearly, the organization can’t ignore highly innovative technology and business practices that might profoundly affect its performance. However, management must place priority on eliminating controllable deficiencies that the organization accepts as normal, and it must support this effort with resources.
The problem-solving method with which the roots of these deficiencies are uncovered, described and eliminated is generally referred to as Root Cause Analysis (RCA), or Root Cause Failure Analysis (RCFA). This powerful technique can, and should be employed by machinery lubrication professionals to seek out and eliminate the underlying reasons for technical problems and/or human errors.
Regrettably, most organizations that depend upon heavy equipment to achieve their mission are riddled with lubrication problems. In some cases, managers incorrectly attribute lubrication-induced machine failures to another cause or vice versa. Human nature is to fit a current failure event into a familiar framework that the individual has seen before, often with only surface knowledge about the specific event. In other cases, the organization has simply accepted poor machine or lubricant life as normal. In either case, removing the deficiencies can translate into significant bottom-line improvements - often with little capital or manpower investment.
![]() |
|
Figure 1. Root Cause Analysis Process |
Sometimes lubricant failure is detected in time to avoid damage to the host system. In other cases, the host system fails because the lubricant or lubrication system fails to perform effectively. In still other cases, the lubricant and/or lubrication system offers valuable clues about a failure that was caused by some other forcing function. In all cases, the lubrication professional should be well-grounded in the RCA process and its supplementary tools and methods to recognize the fundamental reason for the problem, so that he can implement a root cause solution.
Root Cause Analysis Process
Several methodologies have been advanced for performing RCA, or RCFA. While the fine details of the various approaches differ, most of them share the most important elements in common. One excellent (and free) resource is DOE-NE-STD-1004-92, the DOE Guideline for performing RCA. It was developed in 1992 by the U.S. Department of Energy, Office of Nuclear Energy and can be downloaded in PDF (Portable Document Format) format from the Internet. R. Keith Mobley’s book “Failure Root Cause Analysis” is also an excellent resource. Mobley wrote the book for plant engineers, and he utilizes terms and language familiar to lubrication professionals and reliability engineers.
RCA is detective work at its finest. It requires the collection and analysis of evidence for the purpose of solving the crime. It employs deductive reasoning much like detective Sherlock Holmes might have used. To be effective, RCA must be carried out in a systematic manner. Following is a general description of the five-phase process described in the DOE Guideline for performing RCA (Figure 1).
This process is intended to enable the lubrication professional to more ably conduct lubricant failure investigations, or to support machine failure investigations where the lubricant and/or lubrication system is a suspect, or where it might offer valuable clues.
Phase I - Data Collection
Perhaps the most important step in the RCA process is to preserve the crime scene and collect clues about the failure. Unless it compromises safety or the quality of information, begin to collect data immediately after the failure occurs - or while the failure is in progress if possible. Make every effort to preserve all data and physical evidence of the failure. The lubricant and lubrication system offer a tremendous amount of information about the failure. This information is often overlooked or unknowingly discarded. Be sure to collect this vital data and physical information, which includes, but is not limited to the following items:
Lubricants - Often, mechanical failure is in one way or another tied to the lubricant. Collect the lubricant for further evaluation: the lubricant is an unquestionable repository of valuable clues. It can reveal the presence of contaminants or degradation byproducts that may be analyzed. The lubricant may also contain wear particles that can be analyzed to reveal clues about the physical activity that occurred on the component’s surface leading up to the failure event. Remember, a wear particle is the mirror image of a defect left on a component’s surface when wear occurs. Analysis of the particle’s size, shape, color, surface detail and other qualitative information reveals information about the wear mechanism leading up to the failure, and often the root cause.
Filters and Separators - Like the oil, the filter collects debris leading up to and including the catastrophic event. However, unlike the lubricant, which is a snapshot of the system at the time the sample is taken, the filter can serve as a history book of the events since the filter was installed, up until the failure event. Similarly, the effluent from centrifugal separators might also contain clues about the failure. Often, excessive particle contamination causes the wear leading to failure, or causes valves to jam or stick. Inspect the filter, along with its manifold or housing to identify common filter failures, including: media failure or rips, seam failure, end-cap failure, bypass valve failure, etc. Also, inspect vacuum dehydration units, centrifugal separators, coalescing separators, electrostatic separators, ion exchange resins, fuller’s earth, etc., to ensure they are functioning properly.
Gums, Resins Varnish, Sludge and Other Deposits - If the lubricant was badly degraded or it reacted chemically with coatings, seals, environmental contaminants, process contaminants, etc., the clues about the reaction will often reside within the gum, resin, varnish, sludge or other deposits that are the byproduct of the reaction. Grease lines and bearing housings occasionally contain grease thickener that has separated, a condition that should be noted. Also sample the caked material for analysis. Occasionally, simple tests such as elemental analysis can be performed on this residue to identify their likely sources such as degraded additives, grease matrix materials or ingressed contaminants.
Oil Analysis - Include all previous oil analysis when collecting condition- monitoring information. Also, be sure to call the oil analysis laboratory if it held-over the last sample. It is common for labs to hold the samples for some period of time after performing the initial analysis in case a problem arises and it needs to retest the sample. You may want to run some nonroutine tests on this held-over oil to uncover any evidence of the incipient failure contained in the sample. It is important to call the lab as soon as possible; samples are held over only for a short time.
Tank or Sump Condition - Inspect the sump or tank for rust, varnish, water puddles, missing, damaged or saturated breather, open hatch covers, damaged seals and gaskets, leaks, etc. Be sure to collect information about the lubricant level, including level sensors and low level alarms.
Lubrication System - Evaluate the lubrication system, including the following components:
System Process Parameters and Observations - Assemble previously collected process parameter information including lubricant temperature, flow rate, percent saturation with water, online particle count and any other measured information. For engines, the presence of black or white smoke and other operational abnormalities should be noted. In hydraulic systems, cycle time and cylinder creep should also be noted.
Whenever a component is removed or replaced, it should be carefully scrutinized. It is particularly important to inspect components before and after cleaning because residues found on uncleaned bearings, gear, etc., may offer vital clues to the real root cause of the failure.
Wherever possible, take photographs, and when appropriate, video to supplement the collection of physical evidence. Today’s digital cameras allow the inspector to shoot hundreds of high-resolution photographs without uploading to a computer. Likewise, if parts, wear particles or other physical evidence are analyzed microscopically, capture images where possible. As the old adage states: A picture is worth a thousand words. This is important during a failure investigation when one is attempting to collect physical failure evidence before it becomes spoiled.
In addition to collecting data and physical evidence, it is necessary to interview operators, mechanics, engineers, supervisors and managers to collect the softer information. According to the DOE standard, “interviews must be fact-finding and not fault-finding.” Interview those people most familiar with the machine and the situation; often this means operators and lube technicians who work with the equipment daily. Include a walk-through if possible. Just like Peter Falk as Lt. Colombo in the 1980s TV detective show of the same name, being near the scene of the crime often serves to jog the memory and bring useful information to the conscious mind. It is important to conduct interviews as quickly as possible. Over time, if the incident doesn’t match an individual’s idea of what should have happened, the mind tends to distort one’s perception of the facts to better fit his or her paradigm.
Prepare a list of questions before the interview to guide data collection, but leave room for free-form thoughts and discussion. Generally, the questionnaire should uncover information about the items described in Table 1.
Click Here to See Table 1.
The items listed are by no means an exhaustive list of all possible lubrication and lubrication system data that might be collected during or after a failure, but are generally applicable to the lubricated machines found in most plants. Mobley suggests that the lack of a formal process to facilitate the reporting and data collection of a failure event severely limits the effectiveness of RCA. It is sensible to create such a form specific to data about lubricant and lubrication system that should or could be collected. Again, refer to the items listed in Table 1 for guidance.
Phase II - Assessment
The assessment phase involves the analysis of all collected data in order to identify causal factors. The major failure cause categories are detailed in Table 1 along with some lubricant or lubrication system related examples. According to DOE-NE-STD-1004-92, the objective of the assessment phase is to identify the problem and its significance, then to work progressively through the possible causes until the fundamental root cause(s) is (are) defined at the highest level of resolution. This means that the event cannot reasonably be reduced any further (Figure 2).

Figure 2. Failure Assessment Process Flow Model
Consider the simplified example of a pump failure in Figure 3. (Click Here to See Figure 3.) After determining that the event is significant, one begins by identifying possible causes at the first level in the sequence, eliminating nonapplicable causes leaving only the cause or causes applicable to the situation. In our example, the pump failure might have been attributed to the pump, the motor or the coupling; in this case it turned out to be the pump.
The process then sequences to the next causal level, which in our case is to determine if it was a bearing or impeller failure; in this case it’s the bearing(s).
The next sequential level is to determine if the failure was caused by misalignment, contamination, degraded lubricant, etc. In our example, based on oil analysis data, the bearing failed due to dirt contamination-induced wear. Dirt can enter or remain in the lubricant because the new oil is dirty, the breather is missing or ineffective, the seals are failing to exclude them or the filter is not effectively removing them. In our case, it is a filter failure, this is where our system evaluation and interview process comes into play. A filter can fail to perform because it is damaged or because it is full.
In our example based on evidence provided by the lube technicians, a full filter was the culprit. A full filter can be attributed to a lack of or ineffective inspection, an incomplete work order to change the filter, a failed pressure differential gauge or the lack of a pressure differential gauge. In our example, again based on a system inspection, a failed differential gauge was the root cause of the pump failure. In this case, the corrective action should be to replace the gauge and to implement routine inspection to test the gauges in the future.
Numerous methodologies are available for completing this cause-effect analysis once a failure has been identified and deemed significant and worthy of further investigation. The methods vary in their sophistication, but they all focus on establishing clear cause-effect relationships. Below is a general description of some of the more common techniques:
Fault Tree Analysis - Arguably the most popular and sophisticated technique for analyzing failures, fault tree analysis (FTA) is a deductive reasoning technique that may be employed before the failure as a design tool (usually in conjunction with or in lieu of failure modes effects analysis or FMEA), or after the fact as a failure analysis tool. According to the international standard IEC 1025, “FTA is concerned with the identification and analysis of conditions and factors which cause or contribute to the occurrence of a defined undesirable event, usually one which significantly affects system performance, economy, safety or other required characteristics.” FTA results in a tree that starts with a top event, and progresses logically downward until the limit of resolution is reached, which reveals the root cause or causes. This is the approach used in the example in Figure 3.
Cause-and-Effect Analysis - This technique is widely called fishbone analysis due to the fish-shaped pattern that it produces. The typical cause-and-effect analysis identifies human and mechanical factors, methods and materials that might have resulted in the effect or were undesirable. The cause-and-effect analysis technique has been criticized for lacking a clear description of the sequence of events.
Other techniques described in the literature include: sequence of events analysis, events and causal factor analysis, change analysis, barrier analysis, management oversight and risk tree analysis, human performance evaluation and the Kepner-Tregoe problem- solving and decision-making method.
In addition to implementing processes and procedures for in-house staff to assess failure events, it may be necessary to engage additional expertise in the process. In some cases, especially for complex failure investigations, it is advisable to seek out individuals who have a deep understanding of the failure investigation process itself, and who can guide you in avoiding mistakes. Likewise, it may be necessary to engage individuals knowledgeable of the particular failure type and/or root cause(s) - again, to help shorten the process and avoid possible errors and/or omissions of fact and/or process steps due to a lack of in-house detailed knowledge.
The lubrication professional’s failure investigations will likely fall into one of the following three distinct scenarios:
The first two categories are reactive in nature. Functional failure of the machine has occurred, and the investigation is focused upon gaining an understanding of the event to make corrective actions and avoid its recurrence. The third category, however, is proactive in nature. By detecting a lubricant or lubrication system failure before a functional failure of the lubricated system, one may take preemptive action to eliminate the defect in advance of damage to, or failure of the machine.
The term lubrication failure is widely abused in industry. It is generally applied to any failure in which the lubricant is suspected. In some cases, it is assigned as a matter of convenience simply because no other cause was readily revealed. Ineffective lubrication often lies at the root of mechanical wear and failure, but one must develop a clearer understanding of lubrication failures and investigate them individually. There is no single definition for lubrication failure, rather multiple possible failures with multiple possible causes. Evaluate each significant failure independently of previous failures, avoiding the temptation to casually apply the scenario from a previous failure to the current one.
Common lubrication-related failure modes are described here to provide the lubrication professional with an understanding of the range of breadth of common problems. Lubricant or lubrication system failures might be attributed to material problems, procedural deficiencies, personnel errors, design flaws, training deficiencies, management problems or external phenomena, depending upon the nature of the event.
Table 2 maps common lubricant failures to the various problem areas presented in Table 1.
Insufficient Lubricant Volume - This is a broad category. The condition can be proactively detected, analyzed and corrected before the machine fails, or afterward, depending upon how early the failure is detected. Described below is a partial list of possible scenarios.
Excessive Lubricant Volume - Excessive lubrication is common in machines, particularly greased bearings.
Wrong Lubricant - Again, a common problem that might be attributable to material problems, procedural deficiencies, human error, design flaws, training deficiencies, management problems or external phenomena, depending upon the nature of the event. Described below is a partial list of possible scenarios.
Contaminated Lubricant - This is perhaps the most common cause of machine wear and failure. Contaminants may temporarily affect the performance of the lubricant, or catalyze chemical reactions that materially change the lubricant’s physical, chemical and performance properties. Described below is a partial list of possible scenarios.
Lubricant Failure - Lubricants don’t last forever. They eventually wear out and must be changed or reclaimed. However, if the rate of lubricant degradation is shorter than normal, the lubricant might have been defective when new, or a new forcing function has increased the rate of degradation.
Abnormal Wear Debris Generation - Wear debris analysis is an assessment of the machine and may be employed to help evaluate machine failures whether or not the lubricant is suspected as a root cause. The technique’s power lies in the fact that a wear particle that is generated by a failed or failing machine is the mirror image of the component surface that generated the particle. The debris can be extracted from the fluid, the filter or the effluent from centrifugal separators to define its metallurgy and analyze numerous qualitative aspects of the particle’s appearance. Common wear debris analysis tests include atomic emission spectroscopy, ferrous density analysis and optical microscopy. However, during failure investigations, it is often advisable to employ specialized X-ray fluorescent spectroscopy, scanning electron microscopy, X-ray crystallography and other detailed metallurgical tests.
Phase III - Corrective Actions
To effectively improve reliability and safety, the root cause analysis process must produce corrective actions. These corrective actions should be geared toward preventing recurrence of the problem yet be feasible to implement and stay within the organization’s mission, without introducing new risks that are deemed unacceptable. Prior to taking corrective action, the organization should consider the consequences of implementing the actions versus the consequences of not implementing them. In addition, the organization should consider capital costs, engineering costs, training costs, operational costs, risk-based costs and other costs relative to the benefits associated with eliminating recurrence of the failure multiplied by the probability that the corrective actions will in fact prove effective.
Phase IV - Inform
It is necessary to inform all parties of the correction, particularly changes that will affect them. This includes management, supervisors, engineering, operations and maintenance personnel, as well as affected suppliers, consultants and subcontractors. It is also appropriate to notify other locations within the company of the findings and recommendations from the RCA process so they may evaluate the information relative to their unique situation and implement the corrective actions where applicable. If the failure is safety or environment-critical, other organizations within and outside the industry should be notified of the findings. Yes, even competitors should be notified - it is good corporate citizenry to do so, and the only ethical scenario.
Phase V - Follow-up
It is necessary to follow up to ensure that the corrective actions were properly implemented, are functioning as intended and have in fact eliminated the problem. Should the problem recur, reevaluate the original occurrence to determine why corrective actions were not effective. Identify and analyze the difference between the first and second events to see what has changed and make any adjustments required to revise the original corrective actions, or add additional corrections to address problems caused by one of the differing variables.
Lubricants and the lubrication system can fail in many different ways. These failures might be attributable to problems with material, procedures, personnel, design, training deficiencies, management and/or external phen-omena. Likewise, when a machine fails for other reasons, the lubricant or lubrication system offers clues about the failure and the events that led up to it. In many cases, the lubricant offers the only reasonable way to assess what happened to the component’s surface prior to failure, because the component itself is usually mangled and compromised during the late stages of the failure.
Most organizations that rely upon heavy equipment to achieve their mission will have a difficult time matching the value proposition offered by excellence in precision lubrication. Lubricant and lubrication system failures are typically the chronic, recurring type. Fortunately, they can usually be uncovered, analyzed and corrected with minimal investment of resources, making them a perfect target for performance improvement initiatives.
Lubrication professionals should seek an understanding of root cause analysis processes. The methods are generally simple and intuitive. Standards, books and articles from experts on the topic are widely available, some of which can be found for free on the Internet. Learn these skills and integrate them with your knowledge of lubricants and lubrication systems to become your organization’s top-shelf failure and opportunity detective.
References