General approach - Data analysis - Noam Palatin, Arie Leizarowitz, Assaf Schuster and Ran Wolff

Noam Palatin, Arie Leizarowitz, Assaf Schuster and Ran Wolff

5.4 Data analysis

5.4.1 General approach

Two major assumptions guide our approach to detecting misconfigured machines. First, we assume that the majority of machines in a well maintained pool are properly configured. Second, we assume that misconfigured machines behave differently from other, similar machines. For example, a misconfigured machine might run a job significantly slower, or faster, than that job would run on other, similar, machines. The first of these assumptions limits our approach to systems that are generally operative, and predicts that it would fail if most of the resources are misconfigured. The second assumption limits the usefulness of the GMS to misconfigurations that affect the performance of jobs (and not, e.g., system security).

Our choice of an algorithm is strongly influenced by two computational characteristics of grid systems. First, function shipping in grid systems is, by far, cheaper than data shipping – i.e.

it pays to process the data where they reside rather than ship them elsewhere for processing.

This, together with the difficulty in storing centralized data (as discussed in the previous section), motivates a distributed outlier detection algorithm. Second, machines in a grid system are expected to have very low availability. Thus, the algorithm needs to be able to proceed asynchronously and produce results based on the input of only some of the machines.

Finally, our approach is influenced by characteristics of the data itself. It takes many features to accurately describe events which occur in grid systems, and these events are very hetero-geneous. For instance, Intel NetBatch reportedly serves more than 700 applications and about 20 000 users. Since every application may put a different load on the system (and furthermore this load may greatly vary according to the input) and since the working habits of users can be very different from one another, the log data generated by NetBatch has a very complex distribution. To simplify the data, one can focus on jobs that are associated with a particular application receiving varied input parameters. For instance, in NetBatch a great percentage

of the actual executions are by a single application, which takes in the description of a VLSI circuit and a random input vector and validates that the output of the circuit adheres to its specification. In our implementation, we did not have access to this kind of application. Rather we emulated it by executing standard benchmarks with random arguments.

Because the data regarding an execution are always generated and stored on the execution machine the resulting database is horizontally partitioned: a property of which we take advantage in our implementation. On the other hand, sampling, which is popular in knowledge discovery, does not seem to be suited to our particular problem for several reasons. First, sam-pling is in general less appropriate for sparse distributions, and specifically it is prone to missing outliers. Second, because the distribution of the data is typically highly dependent on the ma-chine from which it was sampled, it seems that uniform sampling would still have to visit each individual machine and would thus not achieve substantial performance gains. Instead, we con-centrated on an algorithm that guarantees exact results once all of the machines are available, and which would often yield accurate results even in the absence of many of the well configured machines.

5.4.2 Notation

LetP = {P₁, P₂, . . .}be a set of machines in the algorithm, and letS_i = {x¹_i, x²_i, . . .}be the input of machineP_i. Each input tuplex^j_i is taken from an arbitrary metric spaceD, on which the metricd :D→R⁺is defined. We denote byS_Nthe union of the inputs of all machines.

Throughout the remainder of this chapter we assume that the distances between points inS_N are unique¹. Among other things, this means that for eachS⊂S_N the solution of the HilOut outlier detection algorithm is uniquely defined.

For any arbitrary tuplexwe define the nearest neighbours ofx, [x|S]_m, to be the set ofm points inSwhich are the closest tox. For two sets of pointsS, R⊂Dwe define the nearest neighbours ofRfromSto be the union of the nearest neighbours fromSfor every point inR.

We denote by ˆd(x, S) the average distance ofxfrom the points inS. Consequently, ˆd(x,[x|S]_m) denotes the average distance ofxfrom itsmnearest neighbours inS. For any setSof tuples fromD, we defineAk,m(S) to be the topkoutliers as computed by the (centralized) HilOut algorithm when executed onS. By definition of HilOut, these arekpoints fromSsuch that for allx∈Ak,m(S), y∈S\Ak,m(S) we have ˆd(x,[x|S]m)>d(y,ˆ [y|S]m).

5.4.3 Algorithm

The basic idea of the Distributed HilOut algorithm is to have the machines construct together a set of input pointsS_Gfrom which the solution can be deduced.S_Gwould have three important qualities. First, it is eventually shared by all of the machines. Second, the solution of HilOut, when calculated fromS_G, is the same one as is calculated fromS_N(i.e.Ak,m(S_G)=Ak,m(S_N)).

Third, the nearest neighbours of the solution onS_GfromS_Gare the same ones as the nearest neighbours from the entire set of inputsSN(i.e. [Ak,m(SG)|SG]m=[Ak,m(SG)|SN]m).

Since many of the machines are rarely available, the progression ofSG over time may be slow. Every time a machinePibecomes available (i.e. a grid resource can accept a job related to the analysis), it will receive the latest updates toSGand will have a chance to contribute toSG

fromS_i. By tracking the contributions of machines toS_G, an external observer can compute an

1This assumption is easily enforced by adding some randomness to the numeral features of each data point.

5.4 DATA ANALYSIS 79 ad hoc solution to HilOut at any given time. Besides permitting progress even when resources are temporarily (sometime lastingly) unavailable, the algorithm has two additional benefits.

One, the size of S_G is often very small with respect to S_N, and the number of machines contributing toS_Gvery small with respect to the overall number of machines. Two,Ak,m(S_G) often converges quite quickly, and the rest of the computation deals solely with the conver-gence of [Ak,m(S_G)|S_G]_m.Ak,m(S_G) converges quickly because many of the well configured machines could be used to weed out a nonoutlier that is wrongly suspected to be an outlier.

The details of the Distributed HilOut algorithm are given in Algorithms 1–3. The algorithm is executed by a sequence of recursive workflows. The first algorithm, Algorithm 1, is run by the user. It submits a workflow (Algorithm 2) to every resource in the pool and terminates.

Afterwards, each of these workflows submits a job (Algorithm 3) to its designated resource and awaits the job’s successful termination. If the job returns with an empty output, the workflow terminates. Otherwise, it adds the output toSG and recursively submits another workflow – similar to itself – to each resource in the pool.

If there are points fromSithat should be added toSG, they are removed fromSiand returned as the output of the job. This happens on one of two conditions: (1) when there are points in the solution of HilOut overSi∪SGwhich come fromSiand notSG. (2) when there are nearest neighbours fromS_i∪S_Gto the solution as calculated overS_Galone, which are part ofS_iand notS_G.

Moving points fromS_itoS_Gmay change the outcome of HilOut onS_G. Thus, the second condition needs to be repeatedly evaluated byP_i until no more points are moved fromS_i to S_G. Strictly for the sake of efficiency, this repeated evaluation is encapsulated in a while loop;

otherwise, the same repeated evaluation would result from the recursive call to Algorithm 3.

ALGORITHM 1. Distributed HilOut – User Side

Input: The number of desired outliers –kand the number of nearest neighbours to be considered for each outlier –m

Initialization:

SetSG← ∅

For everyP_isubmit a Distributed HilOut workflow with argumentsP_i,k, andm On request for output: ProvideAk,m(S_G) as the ad hoc output.

ALGORITHM 2. Distributed HilOut Workflow Arguments:P_i,k, andm

Submit a Distributed HilOut Job toP_iwith argumentsk,mandS_G Wait for the job to return successfully with outputR

SetS_G←S_G∪R

If R /= ∅ submit a Distributed HilOut workflow for every P_j=/ P_i with arguments P_j,k, andm.

Optimizations One optimization that we found necessary in our implementation is to store at every execution machine the latest version ofS_Git has received. In this way, the argument to every Distributed HilOut job can be the changes inS_G rather than the full set. A second optimization is to purge, at the beginning of a workflow, all of the other workflows intended for the same resource from the Condor queue. This is possible since the purpose of every such

ALGORITHM 3. Distributed HilOut Job Job parameters:k,m,SG

Input atPi:Si

Job code:

SetQ←Ak,m(SG∪Si) Do

–Q←Q∪[Ak,m(S_G∪Q)|S_G∪S_i]_m WhileQchanges

SetR←Q\S_G SetS_i←S_i\R

Return withRas output.

workflow is only to propagate some additional changes to the resource. Since the first workflow transfers all of the accumulated changes, the rest of the workflows are not needed.

Dans le document Data Mining Techniques in Grid Computing Environments (Page 100-103)