Root Cause Analysis - A Practical Guide

John Martinez, Tate and Lyle
Tags: maintenance and reliability

Root cause analysis is an important component to any maintenance department. Its goal is to eliminate the source of equipment failures, not simply the symptoms, in order to prevent those issues from recurring.

Performed correctly, it can reduce problem areas in the plant and allow for more consistent, stable production. Follow these steps for successful root cause analysis.

Setup and Documentation

When beginning a root cause analysis, you will need to be able to capture all failures that require investigation. There should be a system for viewing all of the failures that have occurred over a set period of time.

For larger plants, this information should be reviewed daily. In smaller plants, weekly or monthly may suffice, depending on the number of failures and the frequency of their recurrence.

After evaluating the failures over time, your next step is to determine when a root cause analysis is necessary. A quick way to do this is to establish a trigger or a desired service life for your equipment.

This can be measured in months or years and will be different for each equipment type. An example would be using a baseline of three years for motors and one year for a pump.

With these guidelines, any motor failure in less than three years and any pump failing in less than one year would call for an analysis to be performed. The lone exception would be for critical equipment. If an extremely critical piece of equipment fails, a report may be required.

Create a Root Cause Analysis Database for Tracking Failures

Once you have a list of failures, begin tracking the number of failures and the status of the analysis. To do this, you will need to create a database where all root cause reports can be viewed in one place.

At the very least, the database should include the equipment number or name, the date of the failure, the date of the last failure, the area where the equipment is located, the notification or work order number, a brief explanation of the failure, possible solutions and the name of the person responsible for the solution.

An example of this type of database is shown below. Please note that not all the information will be readily available and may not be entered into the database until much later in the investigation. However, as much information as possible should be included to help establish which facts are already known.

Gather Equipment Information and History

The next part of the process may be the most critical. Gather as much information as possible, including what happened during the failure and the equipment’s failure history. Find out what has been tried previously to correct the problem. If these solutions did not work, you will save time by not trying them again.

Utilize all of your resources. Talk to electricians, mechanics, shift personnel, operators, clean-up crews and anyone with knowledge of the equipment. These individuals may offer important clues as to why the problem occurred and possibly even solutions or suggestions for improvements. This is imperative if the failure happened on the weekend or on a specific shift.

Speak to those who worked during the shift for details on how and why the failure happened when it did. Start the process as soon as possible. The sooner you start, the more accurate the information will be and the easier it will be to recover or remember.

Next, check your plant’s system that tracks equipment failures. You should be able to see the frequency of the failures and can then ask questions, such as does this failure occur in a periodic timeframe or a specific time of day/year, i.e., every three months, only at night, in the winter, etc.

If the analysis is performed early enough, you may be able to observe the equipment while it is still running and on the verge of failing, e.g., a leaking pump that has not yet been changed. You can then evaluate the equipment’s running conditions, some of which may be a source of damage.

Photographs provide the best way to show the magnitude of a failure.

In addition, always inspect the equipment when it is disassembled to determine which components failed and to look for signs of damage not visible from the outside. These might include indications of overheating, lack of lubrication, misalignment and vibration.

Be sure to take pictures and document everything. Use a notebook or tablet and a digital camera. You can’t remember everything that happened or exactly how it looked, especially if you are writing the report days or weeks later.

A photograph is also one of the best ways to show how bad the failure was to those who did not see it.

Writing Root Cause Analysis Reports

When writing the report, remember that you want readers to be able to understand and follow everything being presented. Avoid technical words or overly complicated terminology. Keep it simple and stick to the facts. The report may be read by a large number of people who do not have the same experience or specialized knowledge that you do.

Do not include anyone’s name. Instead, use only job titles unless you need to assign a name to the solution. The report should not become a blame-game or finger-pointing exercise. You also do not want to alienate any individuals because they may not offer you information the next time you are investigating a failure.

Include the photographs taken while gathering information. If someone does not believe a condition or problem exists, there is no denying it when you have a picture of it. Be sure to write captions to help describe what is shown in the photographs in case an object or situation is not easily recognizable.

Every report should at least include the equipment information, the date of the current and last failure, an explanation of the failure and the findings with an idea of the root cause, an explanation of the past history, the proposed solution, an assignment of a person(s) to the solution, and appropriate data to help explain the failure (pictures, graphs, trends, etc.).

Reviewing Reports

Schedule a meeting to review all the reports and to come to an agreement as to what the solution should be. Send the reports in advance of the meeting to give everyone a chance to look over and discuss the issues and possible solutions beforehand. This is better than first presenting the reports during the meeting and not allowing individuals to conduct their own research or investigation.

Attendance at the meeting should be mandatory for area managers and engineers, maintenance managers, maintenance coordinators (both electrical and mechanical), area process supervisors and key process technicians, mechanics and electricians.

Attendance may be optional for the operations manager, plant manager and planner. You can establish the meeting’s importance by having the plant and operations manager question absent managers and engineers as to why they were not in attendance.

Review each report, even if the failure was small, to make everyone aware of what happened and what is being done to prevent future failures. Come to an agreement on what the next steps should be so it is the entire group’s decision rather than just one person’s idea. Now you have a team of 10 to 20 individuals who are invested in the results.

The frequency of your meetings should depend on how severe the failures are and how many have occurred. The more failures, the more often you should have meetings to discuss the problems.

Implement and Track Changes

Changes should be monitored by a lead person (usually a maintenance or reliability engineer). This person will create a method to track changes and observe what did or did not work.

The team will also need to assign an individual and a date for completing the proposed solution. The lead person will then contact this individual to determine if the change has been made and if it was successful.

An example of a database for tracking equipment failures

A meeting should be scheduled for the lead person to review the solutions with the team. This will allow the group to understand what has been done to solve the problem and if the suggested action worked or if additional time or resources are needed.

Similar to the database created previously, a simple document can be used to chronicle which ideas have been implemented and which remain to be carried out.

Root Cause Analysis Examples

As you continue your root cause analysis program, be sure to use success stories and examplese to credit yourself, your team and the overall plant. Not everyone is aware of the changes you have made or the problems that the team has solved.

Making this known to team members will show them how their efforts are making a difference and having a positive impact. Try to spread the word plant-wide in a newsletter or as a topic in a plant meeting.

The more people who are cognizant of the effect that the root cause team has had on their job, the more willing they will be to provide you with information and suggestions to help in your investigations.

Following are a few examples that show how root cause analysis not only can impact a plant’s bottom line but also make workers’ jobs easier.

Wet Mill Sump Pump

A wet mill sump pump failed on average every two to three months over a three-year span. The solution was implemented in July 2011, and no failures have occurred since.

While this was not a huge cost savings, it was a nuisance to both mechanics and technicians. Previously, mechanics had to replace the pump four to six times a year, and technicians had to walk in 4 to 6 inches of wet slop each time the pump failed.

Wet Mill Gearbox

A wet mill gearbox was failing every three to four months over a two-year span. The failures would upset the process system and reduce production by 40 percent whenever the conveyor was down. In February 2011, the maintenance team designed a new seal. The gearbox has not failed since.

Finished Product Pump

A finished product pump failed every three months over a five-year span. After many attempts to fix it, the problem was finally corrected in May 2012. The pump was then replaced in November 2013.

The correction prevented five failures before the pump was replaced for a savings of more than $400,000 when factoring in product reduction during the 12 hours the system would be down.

Finished Product Recompressor

A finished product recompressor was rebuilt and modified over a five-day shutdown. When the compressor was started, onsite oil analysis found water in the oil.

Root cause analysis showed that the heat exchanger had developed a leak while it was down. The oil analysis saved more than 18 hours of downtime for a savings of $140,000. This did not include the possible damage to the new equipment.

Pump and Motor MTBF

Through root cause analysis, better lubrication practices, vibration and oil analysis, an improvement in mean time between failures (MTBF) for both pumps and motors was achieved.

Since MTBF was tracked in 2009, pump life increased from 50 months to 63.2 months as of December 2013. This was an increase of almost 25 percent in pump life over five years.

The motors had an even greater increase of 78 percent from 148 months in January 2009 to 263.5 months in December 2013.

Root Cause Analysis Case Study
These examples show how root cause analysis can lead to improvements in mean time between failures for a plant’s pumps and motors.

Final Thoughts and Suggestions

A root cause analysis program can offer benefits for almost any plant. It eliminates repeating problems and allows you to focus on other issues. Always try to gather as much information as possible. If you don’t solve the problem the first time, the additional information may be useful for a future solution.

Along those same lines, you should include as many people as possible in the root cause meeting. This will enable you to form a team of people who care about the problem and are involved in the decision-making or changes.

Although you may come up with a solution, it doesn’t necessarily mean that it will get accomplished. You must follow up on the suggestions and assign individuals to each of them for accountability. If someone is not tracking the changes, they will never be completed.

Concentrate on the easy wins and the critical big problems. The easy wins will get some buy-in from other areas of the plant, which will cause others to want to be more involved in the solutions. The critical problems are those that will have the most impact on the plant, either the most repeated failures or the most savings if eliminated.

In terms of costs, track the savings from the changes that have been made during the analysis. You may not track every single change but be sure to include the more valuable ones. This will give you the opportunity to justify the program’s value.

Form a team to help gather information and come up with solutions. If you try to do it all on your own, you will not succeed. There are simply too many items to track and changes to make.

While you would like to have the entire plant as part of your team, utilize trusted technicians and maintenance personnel who have shown some interest. Provide feedback on how they have helped, when the changes will be made and if they were successful. The more of an impact they feel they have, the more they will want to help.

As you create your team, don’t focus on whose fault it was but rather on what the problem is and how to reach a solution. In the end, the goal is to eliminate the source of the issues instead of the person who did the wrong thing.

Finally, relay to everyone involved that root cause analysis is performed to help everyone at the plant. The more people believe this and see the difference that it can make, the more they will want to be involved.

More importantly, when they realize how it can help them, they not only will give you more information, but they also will come to you with their problems because they know that you produce results.