A short outage may not cause much trouble for a refrigerated storage facility, but unplanned shutdowns may cost millions of dollars and cause a drop in share price for data centers, microchip manufacturers, or e-commerce-related companies. In addition, performance appraisals and salaries of facility engineers and plant managers may be negatively influenced by the extent of downtime of their electrical systems. In fact, system reliability is often a critical performance measure for facility engineers and plant managers, yet many of them misunderstand important concepts about system reliability and how to improve it.

These people, who get paid to know their facilities' systems inside and out, often have trouble answering questions like, What is the total cost of an outage at the facility?, Does the facility meet “six nines” availability criteria?, or What would a probabilistic risk assessment (PRA) of the electrical system tell about downtime?

Also, terms like “N+2,” “MTBF,” “failure rate,” and “high nines of availability” are often misused or misunderstood. Although PRA techniques have been applied for many years and can now be applied using off-the-shelf software, the details of this technique can easily become overwhelming.

Reliability through good design. Even though system design is typically not the direct cause of equipment failure or system shutdown, design will have an effect on system availability and on the length of shutdowns when they do occur. If the system has been designed with multiple redundancies, it can allow for maintenance outages and can ride through equipment failures without resulting in an unplanned shutdown. Using common reliability analysis tools, the predicted reliability and availability of your electrical system can be calculated.

To provide continuous operation under all foreseeable circumstances, including utility outages and equipment breakdown, you must design reliability into an electrical system (Photo 1). Investigating the number of redundancies designed into the electrical system is one of the common analytical approaches. It identifies the normal source (N) and any redundant circuits/sources or equipment that would provide alternate paths for electrical power to flow.

A system with one redundant path would be termed an N+1 design. This would allow for one of the paths to be de-energized for maintenance while the other is still energized, allowing maintenance without system shutdown. If the system is designed with a normal path and two alternate paths (N+2 design), one path could be down for maintenance, a failure could occur in a second path, and ideally, the third path would supply power to the load without interruption. An N+1 or N+2 assessment of a system can reveal single points of failure within the system.

Reliability through probabilistic risk assessment. Performing a probabilistic risk assessment (PRA) is another way to look at system reliability. Table 1 provides a list of some of the terms and formulae used in analyzing reliability.

Describing and detailing PRA would take a book's worth of pages to do justice. However, IEEE Standard 493, Recommended Practice for the Design of Reliable Industrial and Commercial Power Systems (the Gold Book) does provide data and describe a process for assessing system performance based on PRA principals.

Using the typical failure rate for a given type of equipment and the mean time necessary to repair it, PRA looks at the probability of failure of each type of electrical power equipment and, depending on the number of redundancies built into system design, can be used to predict availability, number of failures per year, and annual downtime. Software is commercially available to perform PRA calculations on electrical distribution systems. Books are also available to help explain this concept, such as Probabilistic Risk Assessment and Management for Engineers and Scientists, 2nd Edition, by Hiromitsu Kumamoto and Ernest J. Henley (ISBN: 0-7803-6017-6).

Table 2 (click here) provides a small sample of the type of data available from IEEE 493. Of course, the results of such reliability studies can only predict performance based on available data. And obviously, if the data used for such a study isn't representative, the results will be unreliable.

Data provided by IEEE 493 is based on failure rates and repair time information gathered from U.S. industrial plants over the past several years, but it may not be very representative of failure rates and repair times for your facility. As such, you would need to modify this data by including more site-specific information or substituting better data, if such data is available.

Quantifying system reliability. To quantify system reliability, it's necessary to first define the term “loss of power.” Many utilities don't keep records of service interruptions shorter than one minute. Some don't keep records of those interruptions shorter than five minutes. But for many critical facilities, even a five- or 10-second outage would qualify as loss of power.

Before performing a reliability analysis, you must understand and agree on the circumstances that qualify as a power failure. Table 3 shows the relationship between downtime and availability. Note that six nines of availability represents an average annual downtime of more than 30 seconds. While this may be an acceptable level of availability for many facilities, it would be completely unacceptable for many data centers, intensive care units, and other critical facilities that may expect seven, eight, or nine nines of availability.

If the cost of outages and estimated costs of various improvement projects are known, it's possible (by multiplying the probability of failure and cost of failure, and then subtracting that cost from the cost of each of the improvement projects) to compare the relative merits of the current system and each of the alternatives. You can then use this information to evaluate return on investment (ROI).

Reliability through proper maintenance. Maintenance clearly affects reliability. The IEEE 493 Standard provides data showing that failures increase when maintenance is deferred or done poorly. Also, according to NFPA Standard 70B, Recommended Practice for Electrical Equipment Maintenance, “As soon as new equipment is installed, a process of normal deterioration begins. Unchecked, the deterioration process can cause malfunction or an electrical failure.”

With this in mind, it's important to establish an ongoing program designed to maintain an acceptable level of reliability for the facility. You can greatly improve reliability of electrical systems and equipment through proper maintenance practices and procedures, starting with effective system startup and acceptance testing.

When normal acceptance and start-up testing isn't performed (usually to save a few dollars), the results can be disastrous. Perfectly good switchgear, transformers, or other equipment can be “smoked” due to relatively small installation errors. In other cases, the failures don't occur until months after the facility has gone into operation and the warranties have expired. Loose connections (Photo 2) or insulation damage may not show up until more equipment comes online and electrical loads increase.

To implement effective acceptance testing procedures, refer to the recommendations provided by the InterNational Electrical Testing Association (NETA) Acceptance Testing Standards (ATS). Acceptance and start-up testing also provides valuable baseline or benchmark information that can be used later. (See the comments below about trending of test data.)

Several good methods exist for establishing maintenance programs designed to maximize reliability:

  • NETA Maintenance Testing Standard (MTS) recommendations

  • National Fire Protection Association (NFPA) 70B Standard recommendations

  • A reliability centered maintenance (RCM) assessment, which rigorously reviews critical system and equipment failure effects and establishes appropriate condition assessment tasks and maintenance activities for facilities or systems where reliability is critical

To get the full benefit of condition assessment and maintenance testing, you should trend the results. Trending contact resistance, temperature, insulation resistance, and other indicators will warn of deterioration and often provides an opportunity for a planned shutdown for correction of the problem before failure.

Reliability through proper operations. While the relationship between quality of maintenance and resulting system reliability may be clear, the effect on reliability due to operations and other actions of personnel may be less obvious. This area of human interaction and its effect on the electrical system is considered the main source of unavailability (Photo 3). It has been estimated that 70% to 80% of all unplanned shutdowns are due to human error, meaning that only 20% to 30% of unplanned shutdowns are due to equipment malfunction or poor design.

The condition and availability of facility records also influences reliability. Out-of-date or non-existent drawings and instruction manuals can result in unnecessary shutdowns, equipment failures, and even injuries, yet a surprisingly low number of facilities rigorously maintain these crucial documents in an accurate and up-to-date condition.

Recognizing that equipment may fail or human error may occur, it's important that you have documents and procedures in place to quickly enable recovery actions and minimize the length of the shutdown. Such documents should include an up-to-date single-line diagram of the system and a list of emergency contact numbers.

Having at least a minimal number of spare parts for critical components is also essential to system availability. Maintaining a spare parts inventory for emergencies requires the implementation of a program that identifies which equipment is critical and which spare parts are needed for emergency conditions. Such a program should also involve periodic condition assessment of each of these spare parts and regular updating of the inventory.

Personnel training and detailed procedures for operation are essential. Procedures at many data centers and other similar “critical facilities” or facilities with “critical environments” require very detailed work procedures or scripts. These scripts must be written and then reviewed, revised if necessary, and approved by all the appropriate stakeholders, including engineering, maintenance, information technology, construction, and operations and procurement, before any physical work begins. The step-by-step work procedures must be followed without exception.

In a typical facility, operations usually have a larger effect on system reliability than maintenance or system design. Table 4 (click here) provides a basic checklist that can help identify areas that need to be evaluated.

One step ahead. You must design reliability into an electrical system to provide continuous operation under all foreseeable circumstances, including utility outages and equipment breakdown. Analyzing the number of redundancies designed into the electrical system and conducting a PRA are two methods of looking at system reliability. When considering the implications of reliability, you must remember that reliability analysis should examine all three pillars of system reliability: design, operations, and maintenance.

To effectively examine the overall picture, try using an electrical operations and maintenance checklist like the one shown in Table 4. This checklist serves as a basic blueprint to identify areas in need of evaluation, and when used in an actual working environment it should be written and reviewed by all appropriate personnel in the facility.

Vahlstrom is director, technical services, Electrical Reliability Services, Emerson Process Management, San Ramon, Calif.