Why You Should View Root Cause Analysis as a System and Not Just a Task
Root cause analysis, also known as RCA, is an extremely beneficial process to understand what brings about actual or potential issues. However, RCAs often get a bad rap in the workplace. An organization's "RCA guy" isn't exactly known as the bearer of good news. Instead, they're almost always associated with an unfortunate event.
A prevailing majority of people think that the outcome of RCA is someone taking the blame. The idea is that all the problems go away along with the fall guy. Contrary to popular belief, humans usually aren't the cause of the problem. Instead, they are commonly the victims of deficient systems.
Throughout this text, our goal is to dig deeper into this issue and understand why good people make bad decisions. Moreover, we have to regard RCA as a system and not just a task.
Viewing RCA as a System and Not Just a Task
RCA works by taking in inputs, applying logic to process information, then finally coming up with outputs.
The RCA system starts by identifying what criteria triggers the team to perform an RCA. The main reason why RCAs have an unpopular reputation is that these triggers are usually reactions to catastrophic events. Typical examples could be production losses, injuries, or even casualties. But for a more proactive approach, there's nothing stopping companies from doing root cause analyses after identifying significant risks―way before the disaster even happens.
After identifying the inputs to perform RCA―either reactively or proactively―the rest of the system can run its course. You can think of the system outputs as recommended actions and measures as a response to the root cause. The RCA follows a five-step approach to coming up with a resolution:
Preserve all the evidence that you have. In a reactive approach, you want all the information you could gather about the failure event. In a proactive approach, the same logic applies. You want to provide as much detail as possible about the projected failure mode.
Instruct the relevant team to perform the analysis. In this step, you determine that RCA is required. You then select team members for the task and establish functional guidelines.
This step is where the bulk of the work is needed. With all the team's eyes, and using all available data, hypotheses are developed and verified. RCA tools, such as logic trees, are valuable for this step.
After the analysis, you should have a way to report your findings. Results of discussions are presented to relevant stakeholders, along with any recommendations.
The last step includes the implementation of recommendations, and measures of the effectiveness of solutions. There will be costs to implement the suggested corrective or preventive measures. By tracking the success of these actions, dollars spent can measure up with the projected returns.
The Dunning-Kruger Effect: The Impact of the Illusion of Confidence
An overview of RCA looks pretty straightforward and could be fairly simple in some cases. But it's always best to take a cautious approach and equip yourself with as much knowledge and experience as you can. Most processes can be put at risk by a false illusion of confidence.
The Dunning-Kruger effect is a form of bias in which people incorrectly overestimate their ability or capability to perform tasks. Picture yourself as a spectator at a high-speed racetrack. You see all these cars passing by at hundreds of miles per hour, making it look so easy. Now imagine that you get the chance to drive one of those performance vehicles for the first time. Chances are, you're going to aim for the speed that you've been watching all this time! You'll then quickly realize that you're in danger of overshooting or spinning out.
The example illustrates how people tend to be wrongly confident about something they haven't experienced yet. As you gain more experience and overcome more obstacles, you start to build that confidence back up. This time, you’re assured not with a mere cognitive bias but with a deeper understanding of your capacity.
Going back to RCA, organizations need to be aware of the pitfalls that involve this same effect. It’s easy for someone who has not gone through enough training to think that they have everything covered. As people gain experience and learnings through the years, they'll find that the right processes are more challenging than they had initially thought.
Comparison of Common Analysis Tools
You might have heard of the commonly used tools to perform an RCA. Depending on availability or out of habit, your company might choose one or a combination of these. We run through a few of them and see how each one might come up with different root causes.
This process, as the name suggests, asks the question, "Why?" until the team uncovers the most basic reason they can find. The strength of this tool is its simplicity. Virtually anyone can join the discussions. However, for more significant events, there might not be a single root issue. Instead, you can find parallel failures that converge at some point. For such cases, the linear approach adopted by Five Whys might not capture the whole picture.
Fishbone Diagrams, also known as Ishikawa Diagrams, are a graphical representation of a problem with several categories for potential causes. These diagrams are specifically useful for identifying roots that fall under predetermined categories. It is representative of what is known as categorical RCA. While this helps to focus on particular classifications of failures, it does not represent direct cause-and-effect relationships.
Logic trees are a visual reconstruction based on the cause-and-effect relationships of events. The logic tree implies a different way of questioning compared to the previous tools. Instead of answering why something occurs, a logic tree hints more towards how an event could come about.
This seemingly subtle change in the type of question can uncover some useful information for the analysis. For example, in the case of human error, asking from a different angle can shed light on some of the latent root causes of the event.
Analyzing the Event: Germination of a Failure
A failure event does not just occur instantaneously, while it may seem that way. Instead, it would have bypassed and gotten through controls and safeguards that you already have in place. Here are the typical steps that a failure goes through as it develops into the outcome that we can perceive.
1. Cultural Norms and System Failures
Cultural norms and events that trigger system failures are among the first line of defense against undesirable outcomes. Some of these include policies, procedures, communication and technology systems, training systems, and general management oversight.
2. Triggering Decision
Next down the line are the decision-makers. The information eventually comes to someone who needs to make a call. This step is where human reasoning comes into play on whether to proceed or not.
3. Observable Consequences
If the action pushes through, there will be observable consequences in the physical world. These are a series of cause-and-effect relationships that end up in detectable outcomes.
4. Undesirable Outcome
Without some form of intervention to break the chain, an undesirable outcome becomes inevitable. The definition of "undesirable" will vary for each company depending on the set threshold for an event. There will be a set of criteria that would trigger an RCA process. These could be in the form of production loss or maybe even a near miss.
Breaking Down the Definition of RCA
As Bob Latino reminds us, "RCA is about establishing logically complete, evidence-based, tightly coupled chains of factors, from the least acceptable consequences to the deepest, significant underlying causes." We can break down this comprehensive definition into some of its key elements:
This term stresses the importance of asking "how" something could occur. By taking a logically complete approach, you acknowledge the multiple possibilities that contribute to an outcome.
As you would in court, a proper RCA requires you to present credible evidence for your case. An analysis based on hearsay is not going to be beneficial for your cause.
A tightly coupled arrangement of events emphasizes the importance of identifying cause-and-effect relationships. As we have seen in earlier sections, a logic tree illustrates this best.
To have a better sense of what we've been talking about, let's look at an example. Specifically, we want to look at two things when starting to implement RCA:
- Characterize the event.
- Identify the failure modes or reasons for an event to occur.
Characterizing the Event
Let's say, for instance, that we had a critical pump failure. You might be quick to describe the event plainly as a critical pump failure. While this statement might be true, it doesn't provide a lot of context on why RCA is required. Information on the consequences of the event, such as downtime, would be helpful to get the right amount of attention on the incident. A better version of this statement might be "unexpected downtime for X hours due to a pump failure."
Identifying Failure Modes
Next, you want to identify the ways how a failure could have occurred. For example, you might suspect that it could either be a shaft, bearing, or motor failure. As you proceed with forming your logic tree, you identify failure modes down the line.
Primer on Component Failure
The basic idea when analyzing failure events is to think about the driver and the driven. Think of components working in a succession of events―one drives the other. Each increment will lead you to the next potential location of the failure.
For instance, imagine a bearing failure, one of the possible modes from our example. There are a few known ways of how these components fail. A quick overview of metallurgy will give you valuable insight into how parts eventually give out.
In simple terms, erosion occurs as particles are shaved off from material rubbing against each other.
Corrosion is another type of loss or degradation of a material. This phenomenon occurs as the chemical and electrical properties of a substance interact with other substances or the environment.
Cyclic loading, or the repetitive application and removal of forces, contributes to fatigue stress and eventually failure.
Overload failures occur instantaneously. It would typically appear as a quick snap or fracture at the point of impact.
You can further simplify categorizing failures as materials either being lost or overpowered. In either case, the expertise of a metallurgist would prove to be very useful. Evidence and supporting data are essential to identify the root cause.
At What Point Do We Stop Drilling Down?
You can imagine the logic tree starting to take shape. Exhaust all answers that you can bring out from the collective experience of the team.
Following the example, you would have already identified the physical root of the problem. A tangible component, in this case, a bearing, gave out. Shortly after, you might find the human root of the issue. For this instance, let's say misalignment.
It’s easy for a lot of companies to stop at this point and tag the whole incident as an effect of human error. However, there might be some deeper reason why someone thought to do what they did at the time.
It's not easy to be sure whether or not you've drilled down enough to get to the real root cause of the problem. However, acknowledging that the problem does not always end at human error can be a good start. There might be a systemic issue such as inadequate training or unclear expectations and accountabilities. If you stop prematurely, you can expect the problem to recur even with a different person performing the same task.
What Is the "Truth?"
There is a common misconception that truth is the absence of lies. However, what this fails to recognize is the possibility of different perspectives. In other words, various parties might be telling the truth as they see it. Most of the time, these truths would differ depending on the point of view of who you ask.
Systems are made to work with multiple people. Safeguards are put in place to ensure reliability and accuracy when executing tasks, regardless of who's doing it. If we stop at one version of the truth, we risk missing the true root cause. If we stopped at the sight of physical failure or human error, then you risk the possibility of experiencing the same mishap.
The Difference Between RCA and Shallow Cause Analysis
By now, we've gone through a lot of the usual scenarios that happen in your typical analysis. We've also highlighted the kinds of things that we might easily fall into as opposed to what we should be doing.
To summarize, we can think of the RCA process as a series of stages. First, we will notice the physical manifestations of an undesirable outcome. Then, we will arrive at a point where some human intervention appears to have caused the observable failure. If we dig deeper, we then find some of the latent root causes, possibly relating to systemic issues.
In general, shallow cause analysis transpires when we stop at the initial stages of the failure. The investigation might stop at the physical or even human causes of the breakdown. A comprehensive root cause analysis does not stop prematurely and further goes into the less obvious, yet more likely, causes of failure.
Root cause analysis is more than just a process that pops up whenever something unfortunate happens. RCA is a valuable tool that uncovers opportunities for improvement, especially in the very foundations, systems and processes. While it's easy to point to the physical and human roots of failure, it is more beneficial to delve into the latent root causes. Tools such as logic trees, combined with the team's collective expertise, are the basic building blocks of a proper RCA.
Want to keep reading?
Good choice. Here are some similar articles!
What's a function and what's functional failure?
A function describes the intention of a piece of equipment, while functional failures detail conditions that would prevent equipment from peak operation.
What is the best way to use failure codes for root cause analysis?
Use failure codes systematically: start with analyzing the reasons for failures, code this information in a consistent way, and make it accessible.
What are the five pillars of maintenance and reliability?
The SMRP identified five pillars of maintenance and reliability to help facility managers create a framework for their businesses.