Dependability Enhancement in a Service Oriented Architecture

When developing a dependable system, there are a number of ways by which de-pendability can be imparted. These are:

• fault prevention,

• fault tolerance,

• fault removal, and

• fault forecasting.

Not every method is applicable in every stage in a service-oriented architecture.

For example, during service discovery, a “time-out” cannot be guaranteed using fault removal techniques, since the fault is brought about at runtime.

6.4.1 Fault Prevention

Fault prevention is the process of preventing faults from occurring in the first in-stance. In other words, fault prevention represents the set of actions that can be taken

to minimise the number of bugs that are inserted in a given application. One popular fault prevention technique is the use of high-level programming languages such as Java [11]. Another important fault prevention technique is the use of formal meth-ods [4] to ascertain certain guarantees provided by a system. In a service-oriented architecture, formal methods can be used in different ways. For example, formal methods can be used for the specification and verification of service description and discovery [22]. The mathematical notation employed, together with verification can enable the detection of mismatching assumptions between service provider and clients. Another important line of work in formal specification and verification in service-oriented computing is the verification of service composition [20].

6.4.2 Fault Tolerance

Fault tolerance is the ability of a system to satisfy its specification, in spite of faults that can perturb its execution. When designing a fault-tolerant system, the function-ality of the fault tolerance mechanisms needed can be factored along two dimen-sions [2], namely detection, and correction.

Detection is needed before triggering the correction part. Detection is achieved using predicates, calleddetection predicates. A detection predicate captures the con-ditions that indicates failures in some part of the system. For example, a well-known predicate is the timeout predicate. When a timeout expires, it indicates that a mes-sage had not been reliably delivered. On the other hand, correction mechanisms attempt to impose a given predicate on the system. A popular correction mecha-nism is message retransmission, whereby a lost message is retransmitted so as the predicate (that captures correct system operation) can be reinstated. The use of de-tection and/or correction mechanisms gives rise to different levels of fault tolerance.

Specifically, the levels of fault tolerance are.

• Fail-safe fault tolerance [2]. It is necessary and sufficient to add detectors to ensure fail-safe fault tolerance. A fail-safe fault-tolerant system is one where the safety of the system is more important than liveness. Several web applications adopt a fail-safe approach in their design.

• Non-masking fault tolerance [2]. It is necessary and sufficient to use correctors to ensure non-masking fault tolerance. A non-masking fault-tolerant system is one where liveness is more important than safety.

• Masking fault tolerance [2]. It is necessary and sufficient to use both detectors and correctors to ensure masking fault tolerance. In masking fault tolerance, both safety and liveness are important. In the domain of web services, masking fault tolerance is important especially if some services become unavailable. Specif-ically, when a service discovery process times out (detector), another service discovery process can be initiated (corrector). The number of possible retries can be specified as a service parameter.

Detectors and correctors can be implemented through connector components.

Given that a connector component can be viewed as a service provider, it needs to export the detection or correction predicate it is implementing. The connector components can be published in a similar way to normal services, and the connectors can be discovered likewise.

6.4.3 Fault Removal

During the development process, design faults (also known as bugs) could have been inserted into the system. These bugs, when activated, can cause the system to violate its specification. Thus, it becomes imperative to remove these faults. Fault removal is the process through which these faults are removed. The most popular fault removal technique is testing. During testing, the system is subjected to a range of test cases.

Each test case is designed such that it helps to uncover some bug. The test cases, in general, need to help achieve some test coverage [7]. However, for a carefully-developed system, the number of bugs may be very small, making it very difficult to uncover the bugs, even though the system may have been rigorously tested. In other words, it means that the mean time to failure (MTTF) of such a system is high. The high MTTF then translates into an inability to guarantee a certain level of dependability of the system. What is then needed is to be able to reduce the MTTF of the system by artificially introducing faults in the system. This is achieved through a process calledfault injection[15], which we will discuss in the next section. Fault injection introduces artificial faults in the system with the intention of mimicking software bugs. If the program cannot handle the effect of the artificial bug, then it means, if the bug does exist, the program will not be able to handle its effect.

Hence, it is important to be able to introduce faults that are representative of bugs in the system under test.

6.4.4 Fault Forecasting: Fault Injection

Fault forecasting is the process during which (i) the number of any residual faults in the system is estimated, and (ii) their impact is analysed. However, in systems where the runtime execution can be affected by perturbations in the environment, such as embedded systems, the impact of environmental problems needs to be also assessed. However, because the Mean Time to Failure (MTTF) of a system may be very long, it becomes very difficult to have a statistically significant confidence in the ability of the system to deliver the required services. Hence, faults have to be artificially introduced to lower the MTTF of the system, which in turn will allow us to assess the impact of bugs on the system. Various fault injection techniques and tools have been introduced over the years, and the techniques can generally be divided into three categories:

• simulation-based fault injection,

• physical fault injection, and

• software implemented fault injection (SWIFI).

In this chapter, we will focus on software implemented fault injection. We refer the interested reader to [15] for an in-depth discussion about fault injection.

6.4.4.1 Software-Implemented Fault Injection (SWIFI)

Software-implemented fault injection (SWIFI) is by far the most versatile, and pos-sibly the most popular form of fault injection. The approach uses software to inject faults into physical, and sometimes simulated, systems. Further, it can also be the case that errors (a runtime consequence of a fault execution) can be injected, in which the state of the system is perturbed at runtime. However, for historical rea-sons, the process is called fault injection (rather than error injection). There are both advantages and disadvantages of using SWIFI for system validation. In order to gen-erate readouts, and inject faults and errors, a target system has to be instrumented.

The instrumentation process consists of inserting probes for logging variables and events, as well as inserting injection locations for faults and errors. Once faults or errors are injected, data is collected, and later analysed and interpreted for depend-ability analysis of the system.

In the context of service-oriented architectures, faults can occur at several levels, as discussed in Section 6.3.2. Thus, during fault injection, faults that are injected need to mimic problems that will lead to failures such as “wrong service called”

and “incorrect results”. During the fault injection process, in a distributed system, not only are faults injected to corrupt variables, but faults need to be injected at the network level too. For example, faults can be injected by corrupting, dropping or reordering the network packets at the network interface [18]. One way of injecting faults is to instrument the network protocol stack, however the problem is that the receiver’s network stack may detect this and then reject the packet. Another way is to inject faults at the application level [9, 10], where the types of faults being injected are corruption of packet header information, injecting random byte errors into packet payloads.

Dans le document Advanced Information and Knowledge Processing (Page 163-166)