Root cause analysis helps you solve the problem—not just its symptoms.

Do you solve some problems only to have them return again and again? If so, you may be solving the symptoms and not the problem. Problems return if you don’t dig deeper for the root cause.

Root cause analysis (RCA) is a rigorous identification of failure mechanisms. Its goal is to identify the item, or set of items, that initiates a failure. It’s methodical, measurement-dependent, persistent, graphical, logical, and inquisitive. Using RCA, you can determine cause-and-effect relationships in a system and isolate one subsystem or component at a time, repeating the process until you uncover the source.


Methods and applications. A typical RCA procedure should first include the identification of the inputs and outputs of a problem. Where does the problem begin and end? Narrow it down as much as possible. Next, identify which inputs are variable and which are fixed. Examine the fixed inputs first—those an operator can’t change—holding the variables constant. Look for cause-effect and what-if relationships, especially in the infrastructure. Ask questions like, Why is it this way?, Has it always been this way?, and What has changed?

Repeat the above steps for the variables, making sure to record the “before” conditions. Change one input at a time and record the effects, and restore to the “before” conditions and note the status of the problem. Then make a change and note the status of the problem condition over time.

RCA can be performed using three different methods. They differ in complexity, but each involves digging for information. Fault tree analysis traces a failure back to the initiating event or events, while event tree analysis works in the opposite direction, tracing possible initiating events forward to the failure. The third method, failure mode and effects analysis, is the most difficult, but most conclusive. Forensic engineers and advanced troubleshooters prefer it for its exhaustive identification of component failures and causes and effects. In most situations, the first two methods will suffice because most problems are near the surface and don’t require extensive research. The trick is in knowing when to keep digging. As you work through your RCA, don’t focus on one possible cause as “the one” and set out to prove it. Instead, categorize ideas, suspect causes, and analyze them one at a time.

If you were to apply the first of these methods to a problem like a nuisance breaker trip, you would have to try to determine what changed just prior to the trip. Did someone turn a machine on? Did some power anomaly occur? Using event tree analysis, you would need to make a list of possible explanations and check each item to see if it was responsible for the problem.

Failure mode and effects analysis would require you to identify every component in the system and then determine if a failure of a particular component can explain completely what happened. In the case of the nuisance breaker trip, components include the individual breaker mechanisms, specific connections, the grounding system, the load supplied by that breaker, and anything else involved in the supply/breaker/load chain of components. When you find the single component that may have started the chain of events that led to the failure, you’ve found your root cause. This method is time-consuming, and it’s usually overkill. However, when you need a conclusive analysis, this is the only method to use. Regardless of your RCA method, using the right tools for data collection is essential to success.

An example of root cause analysis. A manufacturing plant that produced battery separators—thin plastic sheets used inside lead-acid batteries to keep the plates from touching—had a high scrap rate on a newly upgraded plastic extruder. The raw materials were good, the machine checked out per the normal diagnostics, and the operators were competent. What was causing the scrap? The control system was reading a pressure spike and adjusting the machine’s process controls accordingly, but there was nothing in the machine capable of generating that kind of pressure. In theory, the pressure didn’t exist even though the process control system showed it did.

The plant electrician solved this problem by first disconnecting the process controller from its inputs and simulating them at the controller while monitoring the outputs. The problem did not reappear, ruling out the controller and the entire output side. After running the process while monitoring the inputs at the controller and seeing the spike, he ran the process while monitoring the inputs at the point of measurement. The second process run failed to register a spike, proving the spike was in the wiring between the process and the controller.

The root cause turned out to be a poor wiring job that induced a voltage much larger than the pressure sensor could produce in the signal wiring. What looked like a machine alignment problem disguised as a pressure spike was actually a wiring problem. But the RCA process can be taken further: Poor project management was at fault for the wiring problem. The civil engineer assigned to do the electrical work made a costly fundamental error by violating NEC Art. 300.

RCA takes more time upfront than simply addressing symptoms or fixing intermediary causes. However, it allows properly trained and equipped people to save measurable amounts of time and money in the long run—and often sooner.