• Aucun résultat trouvé

Several formal definitions are needed to quantify VAXcluster avai lability.

Availability is the proportion of time that ser­

vice is available from a VAXcluster system to per­

form a user appl ication.

I t is i mportant to remember that this definition of availability is a general one. As the nature of the application, the size of the VAXcluster config­

u ration, and the amount of redundancy change, availabi lity can be defined in more complex

CONFIGURATION

RELIABILITY BLOCK DIAGRAM FOR HARDWARE MODEL

RELIABILITY BLOCK DIAGRAM FOR RECONFIGURATION MODEL

Figure 4 Configuration with Redundant Processor and Storage Controller (Mode/ 3)

Digital Technical journal No. 5 September 1987

ways. For the configurations used i n this study, at least one of each type of element must be running for the VA.Xcluster system to be operational.

Unavailability is the proportion of time that service is interrupted and that a VAXcluster sys­

tem cannot perform a user application .

I n this study, the related metric of downtime i n minutes per year w i l l be used rather than the sys­

tem unavailabi lity.

Reconfiguration time is the time taken to ini­

tially detect a failed element and remove it from the VAXcluster system. For a failed VAX proces­

sor, this time also i ncludes the time taken later to re-establish the repaired element's membership in the cluster.

CONFIGURATION

RELIABILITY BLOCK DIAGRAM FOR HARDWARE MODEL

RELIABI LITY BLOCK DIAGRAM FOR RECONFIGU RATION MODEL

Figure 5 Configuration of Fully Redundant System (Model 4)

Digital Technicaljournal No. 5 Septem ber 1987

Note that the HSC device employs "warm stand-by" redundancy and therefore does not have any significant reconfiguration time associ­

ated with re-establishing membership in the cluster.

VAXcluster reconfiguration activities usually complete i n a matter of seconds; however, in extremely rare cases, much longer times are possible .

Overview

The most common approach to model ing com­

plex systems consists of structurally d ividing a system into smaller subsystems, such as proces­

sors, controllers, and d isks.6 The availability of each subsystem is then analyzed separately, and the i ndividual subsystem solutions are combined to obtai n the system solution . One important assumption must be made to achieve a solution: · the behavior of each subsystem must be indepen­

dent from that of any other subsystem .

Furthermore, a decomposition technique can be applied to certain behaviors that cause system outages due to failures in redundant subsystems.

In these cases, the recovery to an operational sys­

tem happens q u ickly . Similar behavior is also present when the failed subsystem is repai red and is ready to rejoi n the system to make it a fully configured system. This type of decomposition is cal led behavioral decomposition.

With this approach to structural and behavioral decomposi tion , hardware fai lures and VAXcl uster reconfigurations are modeled separately. Such a decomposition allows the model to analyze both VAXcluster reconfigurations and complete sys­

tem failures due to hardware fai lures. It also allows the model to analyze the sensitivity of sys­

tem availability to each factor.

In this study, availabil ity model i ng captured the following factors:

Hard fai lures requiring a repair cal l

VAXcluster reconfigurations during which the VA.Xcluster system was assumed to be unavail­

able in this analysis

Response time for maintenance personnel

Time-to-repair

The following factors were nor considered (except for the impact of reconfigurations due to hardware failures) :

I ntermittent failures

Transient failures

7 3

VAXcluster Systems

VAXcluster A vailability Modeling

Quorum disks

Operational errors

Software errors

The fol lowing modeling parameters were used :

The mean time-between-failures (MTBF) and mean t ime-to-repair (MTTR) of each of the fol ­ lowing elements:

- VAX processor

- HSC storage control ler - Disk drive

VAXcluster reconfiguration times caused by - VAX processor failure

- Re-establishment of the repaired VAX processor into the VAXcluster configuration - HSC storage controller failure

- Disk drive failure

Response time for maintenance

The remainder of this section describes i n detail the modeling of the fourth configuration (Model 4 ) .

Analysis of Hardware Failure

Consider the structural decomposition of the VAXcluster configuration. Three subsystems were connected in series, each consisting of two elements in parallel . At least one element in each subsystem had to be operational for the VAXclus­

ter system to be operational . The hardware reli­

ability block diagram is shown in Figure 5 . Repairable systems are those for which an auto­

matic or manual repair can be made if an element fails. Assume that each element is subject to fail­

ure and has its own repair facility 7 I f the time-to­

failure of element i is exponentially distributed with failure rate

A1 ,

and the time-to-repair of ele­

ment i is exponentially distributed with repair rate

f.Lt,

the instantaneous availability can be obtained by the following equation:

As t approaches i nfinity, A1(t) approaches the steady-state availability and A1 equals

f.L;/(A; +f.L; )

. The steady-state availability of a single element is given by the following equation :

A =

f.L/(A + f.L)

74

in which A is the failure rate of the element and

f.L

is the repair rate of the element. The time-to-fail­

ure and the time-to-repair are assumed to be exponentially distributed .

The steady-state availability of two elements in parallel iss

In Model 4 , the elements in each subsystem are two VAX processors, or two HSC storage con­

trollers, or two disk drives. Using the equation above , the availability of the processor subsys­

tem , Ap , can be expressed as

Sim ilarly, the availability of the HSC storage controller subsystem , Ah , and the avai lability of the disk drive subsystem , A" can be expressed as

and

The aggregate availability of the VAXcluster system is

For exponentially distributed times, the fail­

ure rate,

A,

is I jMTBF and the repair rate,

f.L,

is l jMTTR .

Analysis of Reconfiguration Times

Next, consider the behavioral decomposition caused by the reconfiguration that occurs when one element in a subsystem fails and an automatic failover tO a second (redundant) element takes place . During this process, a reconfiguration occurs when a failed element leaves the VAXclus­

ter system. For processors only, another reconfig­

uration occurs when a repaired processor later rejoins the VAXcluster system . Depending on the user application , the VAXcluster system may be unavailable to perform user applications during these reconfigurations .

Digital Technicaljournal No. 5 Septem be•· 1987

For example, consider the following time l i ne :

--y,._ , ,-t---t.--'t

_____.,..., TIME

t, tz b t.

Figure 6

Time t1 to t2 is the VAXcluster reconfiguration time for a failed VAX processor to be detected and removed from the VAXcluster membership.

Time t2 to t3 is the repair time for the fai led hard­

ware element . Time t3 to t4 is the time for the repaired VAX processor to be re-established in the VAXcluster membership.

Figure 5 incl udes the reliability block dia­

gram representing the VAXcluster reconfigura­

tion behavior of the Model 4 configuration. Each subsystem is shown as two elements in series. If any single element is not operational, the sub­

system can be unavailable due to a VAXcluster reconfiguration.

For two elements in series, the availability is8

A = A1

X

Az

In model 4 , the elements in each subsystem are two VAX processors, or two HSC storage con­

trollers, or two disk drives.

Applying the equation above for elements i n series, the availability of the processor subsys­

tem ,

Ap,

is

A { llp }z p = (J..p +!lp)

Note that for the VAX processor, the rate

llp

is the reciprocal of the sum of the times t 1 to t2 and t3 to t 1 .

Similarly, the availability of the HSC storage controller subsystem, Ab , and the avai lability of the disk drive subsystem ,

A,,

is

A { llb }2

b =

(J..b + llb)

and

Ar = {(J..r�llr)f

The aggregate avai lability of the VAXcluster system is

As = Ap

X

Ah

X

Ar

Assuming an operation running 24 hours a day, 365 days per year, the downtime equals

Digital Technicaljournal No . 5 September 1 98 7

(l -As)

X 5 2 5 , 600 minutes per year. This fig­

ure is the downtime caused only by reconfigura­

tions. The total downtime is the sum of the down­

time caused by hardware fai lures and the down­

ti me caused by VAXcluster reconfigurations.

Extensions to the Models

The simple mode.ls considered in this study can be extended in several dimensions.

The complexity of the configurations can be i ncreased either by adding more VAXcluster e le­

ments or by extending the bounds of the models to include the Ethernet and its attachments. A complex configuration could include multiple clusters and multiple Ethernet segments. More complex definitions of availabil ity are needed as the configurations increase in complexity. These definitions range from the single-user view to a measure of system productivity.

Only permanent

(

hard) hardware failu res are considered in this study. I ntermittent and tran­

sient hardware and software fai lu res, as we ll as operational errors, can be added as extensions to future models. The downtime allocation reponed in the l i terature typically attributes about one third of the total to each of the hardware, soft­

ware, and operator-induced failures.9 This result includes the effectiveness of system recovery that can be hardware based, software based, or both.

Certain insidious failures can result in ineffec­

tive recovery, even in the presence of hardware or software redundancies. The term "fault cover­

age" represents the joint probability of fault detection and successful failover to a redun­

dant element. A fault-coverage factor of one is assumed in this smdy.

This study also assumes that the subsystems of VAX processors , HSC storage controllers, and disk drives are independent. Relaxing this assumption adds to the complexity of the modeling approach. Similarly, a simplistic mai ntenance strategy is assumed in which each cluster ele­

ment has its own repair facility.

The extensions described above add more real­

ism to the model ing approach at the expense of added complexity in both model formulation and solution technique . Moreover, the textbook for­

mulae used i n this study are li miting and often inappropriate.

Markov modeling is a particu larly useful ana­

lytic technique for formu lating and solving these complex models_7 Simulation is an alternative but computationally less efficient technique.

75

VAXcluster Systems

VAXcluster A vailability Modeling

Another valuable industry-wide tool is the Sym­

bolic Hierarchical Automatic Reliability and Performance Evaluator (SHARPE) software . 1 0 SHARPE's hierarchical feature allows complex subsystem models tO be combined into a system

model for efficient solution . SHARPE also employs state-of-the-art matrix-solving routines to solve large and often ill-conditioned problems arising from the Markov model formulation of these complex configurations.