Evaluating Risk
Probability is a fundamental concept in statistics, often thought of as “the odds.” It is defined as the ratio of the number of favorable outcomes to the total number of possible outcomes. As an example, the probability of picking an ace from a standard deck of cards is 4/52 or 7.7%. There is also the probability of failure, which is 48/52 or 92.3%. When discussing the probability of success or failure for an electrical product or system, however, exact projections are much harder to come by.
Reliability is defined as the probability that a product or system will operate properly for a specified period of time under design operating conditions without failure. There can be many factors that affect overall reliability and some types of failure have a much greater impact on production than others. Therefore, determining reliability involves not only how often something fails, but also the failure's impact on other products or systems involved. Each type of failure can be categorized as a particular type of risk.
Assessing the odds. One method of evaluating risks associated with technically complex systems is called probabilistic risk assessment (PRA). Risk in a PRA is defined as a detrimental outcome that is feasible. Each detrimental outcome has an associated magnitude or severity and the probability of it occurring.
The first step of a PRA is to determine what the risks are for a particular system. Next, look at the risks that would have a major impact on its performance, and determine what the probability is that it will happen. There are a number of ways to perform a statistical analysis of the reliability of a piece of equipment or system. One common approach is called a reliability block diagram (RBD), in which a block diagram is developed of the equipment or system to be modeled. The failure and repair rates for each major component are combined to determine the overall reliability of the system.
As an example, consider a PRA for a data center. For facilities that use computers extensively, one of the severe detrimental outcomes to the operation is the loss of power. In order to determine the probability of losing power at the output of the UPS module, an RBD is used to model the system. The failure and repair rates for the individual components are combined, based on the configuration of the system to be modeled to determine the reliability and availability.
Sample exercise. Shown in Fig. 1 is the electrical one-line diagram for a simple system used by many facilities that use computers extensively. It consists of a standby generator, automatic transfer switch (ATS), uninterruptible power supply (UPS) module, and a static bypass switch. The UPS module provides power to the switchboard below it. If the UPS module fails, the static bypass switch turns on to provide power to the switchboard directly from the utility.
The utility supplies normal power to the ATS. The generator supplies emergency power, through the ATS to a UPS module on loss of utility power. Both the generator and batteries to the UPS module are on “standby” until the utility power is lost. While the generator starts, the batteries provide power to the inverter, maintaining the power at the output of the UPS module. In order to model the actual standby operation of the generators and batteries, a software program is needed, which performs simulations to calculate the reliability.
Figure 2 (click here to see Fig. 2) shows an RBD model of Fig. 1. At the top lefthand corner of the RBD is where the diagram starts. In each block is a reference designator, which indicates what the block represents. The RBD ends at the bottom, beside the distribution switchboard shown in the RBD as “Dist Swbd.” There is a failure rate and repair rate for each block in the RBD, which is loaded into the software. The first block UTL represents utility power. In normal operation, the utility power goes through a distribution switchboard and an airframe circuit breaker (CB AF) to an ATS. The standby junction above the ATS has priority set to the utility side, and the generator is set to standby. There is a line from the junction to the right of start directly to the generator (GEN) that is obscured by the UTL and Dist Swbd blocks. When the block UTL fails, the generator starts, and the ATS transfers to provide generator power to the switchboard below it. Since it takes approximately 10 seconds for the generator to start, the standby junction has a 10-second time delay before it transfers.
Below the ATS and switchboard is a pair of molded-case circuit breakers. One provides power to the rectifier of the UPS module; the other provides power to the static bypass switch (SBS). Below the rectifier is another standby junction, which connects the battery. The battery charge block (BAT CH) lasts for 15 minutes. The block below the generator is the probability of the generator starting, which is 98.65%. When the utility fails, 98.65% of the time the generator starts. During the 10 seconds that the generator is starting, the battery provides the power to the inverter of the UPS module. If the generator fails to start (1.35% of the time), the battery runs out of power after 15 minutes, and the distribution switchboard below the UPS module loses power.
If the UPS module fails, the static bypass switch turns on, providing utility power directly to the distribution switchboard. Should the utility power fail while the static bypass switch is providing power, the system would fail, since power would be lost while the generator was starting.
For the system in Fig. 2, the probability of losing power to the distribution switchboard at the bottom is 24.7% for a five-year period. Reliability is a function of time. The longer any system runs, the lower the reliability will be. Since the probability of failure is equal to 1 minus the reliability figure, the longer the time, the higher the probability of failure.
There are two more factors that are very commonly used when determining the reliability of a system. The first is mean time between failures (MTBF). The MTBF, as the name implies, is an average of multiple failures. Shown in Fig. 3 on page 50 is a very common failure distribution called the normal distribution or “bell curve.” The MTBF is the average of all of the failures.
The MTBF for the system in Fig. 2 is 116,750 hours. That is not how long it will be before the first failure, but the average time between failures. For example, let's say there are 100 systems just like the one above in service. There may be a failure of any one system at any time, but the average time between failures will be 116,750 hours.
The other very common term used in reliability modeling is availability. Availability is the long-term average fraction of time that a repairable component or system is in service and satisfactorily performing its intended function. For example, if the electricity is off for one hour in a year, but the rest of the year the electricity is on, the availability of electrical power for that year is 8,759 hours divided by 8,760 hours, which is 0.999886.
An availability of 0.99999 (which is commonly referred to as five 9s) could mean that the system was down for 5.3 minutes (or 315 seconds) per year. It would make no difference in the availability calculation if there was one 5.3-minute outage, or 315 one-second outages. It could also be one outage of 1.77 hours in 20 years. In all three cases, the availability is 0.99999.
The availability for the system shown in Fig. 2 is 0.999986. Since availability is a combination of how often the system fails and how quickly it is repaired, availability alone does not tell us how good this system would be for our application of powering computers. Obviously 315 1-second outages would be very destructive to the operation. However, one outage of 1.77 hours every 20 years would not be so bad.
This is why the probability of failure and MTBF are also used. The MTBF of 116,750 hours tells us the average time between failures is 13.3 years. Therefore, it is very unlikely that our system will experience a lot of very short outages. The probability of failure of 24.7% in five years gives us an indication of the likelihood of experiencing the first failure in a short period of time.
Analysis advantages. As demonstrated in the example above, performing a PRA can provide quantitative data you can use to evaluate and mitigate risk. If a probability of failure of 24.7% were too high, which it would be for a major data center, redundant components would have to be added. A second UPS module or a second generator are two obvious possibilities. Using RBDs to model each would offer insight into which strategy would boost overall system performance.
Gross is CEO and chief technical officer and Schuerger is a principal with EYP Mission Critical Facilities, Inc., in Los Angeles.