Willkommen! - Bienvenido! - Welcome!

Bitácora digital de Información al cliente de Tux&Cía.
Bitácora Central: Tux & Cía.
Bitácora Técnica (multilingüe): TecniCambalandia
May the source be with you!

Sunday, March 28, 2010

Confiabilidad y Disponibilidad

Reliability and Availability Basics
Realtime and embedded systems are now a central part of our lives. Reliable functioning of these systems is of paramount concern to the millions of users that depend on these systems everyday. Unfortunately most embedded systems still fall short of users expectation of reliability.
In this article we will discuss basic techniques for measuring and improving reliability of computer systems. The following topics are discussed:
Failure Characteristics
Hardware Failures
Hardware failures are typically characterized by a bath tub curve. An example curve is shown below. The chance of a hardware failure is high during the initial life of the module. The failure rate during the rated useful life of the product is fairly low. Once the end of the life is reached, failure rate of modules increases again. 
Bath tub curve characterizing hardware failure
Hardware failures during a products life can be attributed to the following causes:
Design failures This class of failures take place due to inherent design flaws in the system. In a well designed system this class of failures should make a very small contribution to the total number of failures.
Infant Mortality This class of failures cause newly manufactured hardware to fail. This type of failures can be attributed to manufacturing problems like poor soldering, leaking capacitor etc. These failures should not be present in systems leaving the factory as these faults will show up in factory system burn in tests.
Random Failures Random failures can occur during the entire life of a hardware module. These failures can lead to system failures. Redundancy is provided to recover from this class of failures.
Wear Out Once a hardware module has reached the end of its useful life, degradation of component characteristics will cause hardware modules to fail. This type of faults can be weeded out by preventive maintenance and routing of hardware.
Software Failures
Software failures can be characterized by keeping track of software defect density in the system. This number can be obtained by keeping track of historical software defect history. Defect density will depend on the following factors:
  • Software process used to develop the design and code (use of peer level design/code reviews, unit testing)
  • Complexity of the software
  • Size of the software
  • Experience of the team developing the software
  • Percentage of code reused from a previous stable project
  • Rigor and depth of testing before product is shipped. 
Defect density is typically measured in number of defects per thousand lines of code (defects/KLOC).

Reliability Parameters
MTBF
Mean Time Between Failures (MTBF), as the name suggests, is the average time between failure of hardware modules. It is the average time a manufacturer estimates before a failure occurs in a hardware module.
MTBF for hardware modules can be obtained from the vendor for off-the-shelf hardware modules. MTBF for inhouse developed hardware modules is calculated by the hardware team developing the board.
MTBF for software can be determined by simply multiplying the defect rate with KLOCs executed per second. 
FITS
FITS is a more intuitive way of representing MTBF. FITS is nothing but the total number of failures of the module in a billion hours (i.e. 1000,000,000 hours). 
Mean Time To Repair (MTTR), is the time taken to repair a failed hardware module. In an operational system, repair generally means replacing the hardware module. Thus hardware MTTR could be viewed as mean time to replace a failed hardware module. It should be a goal of system designers to allow for a high MTTR value and still achieve the system reliability goals. You can see from the table below that a low MTTR requirement means high operational cost for the system.
Estimating the Hardware MTTR
Where are hardware spares kept? How is site manned?  Estimated MTTR
Onsite 24 hours a day 30 minutes
Onsite Operator is on call 24 hours a day 2 hours
Onsite Regular working hours on week days as well as weekends and holidays 14 hours
Onsite Regular working hours on week days only 3 days
Offsite. Shipped by courier when fault condition is encountered. Operator paged by system when a fault is detected. 1 week
Offsite. Maintained in an operator controlled warehouse System is remotely located. Operator needs to be flown in to replace the hardware. 2 week
MTTR for a software module can be computed as the time taken to reboot after a software fault is detected. Thus software MTTR could be viewed as the mean time to reboot after a software fault has been detected. The goal of system designers should be to keep the software MTTR as low as possible. MTTR for software depends on several factors:
 More

Availability
Availability of the module is the percentage of time when system is operational. Availability of a hardware/software module can be obtained by the formula given below.
Availability calculation from MTBF and MTTR
Availability is typically specified in nines notation. For example 3-nines availability corresponds to 99.9% availability. A 5-nines availability corresponds to 99.999% availability.
Downtime
Downtime per year is a more intuitive way of understanding the availability. The table below compares the availability and the corresponding downtime.
Availability Downtime
90% (1-nine) 36.5 days/year
99% (2-nines) 3.65 days/year
99.9% (3-nines) 8.76 hours/year
99.99% (4-nines) 52 minutes/year
99.999% (5-nines) 5 minutes/year
99.9999% (6-nines) 31 seconds/year !

No comments: