Some data mining techniques in parallel - Data Mining in a Parallel Environment

Data Mining in a Parallel Environment

9.3 Some data mining techniques in parallel

In this section, we present a parallel version of some of the data mining techniques discussed in the previous chapters.

A = one sample S = set of samples

equally divide the samples in S among the p processors for each processor (in parallel)

for all the samples s in S and in the processor memory compute distance between s and A

save each distance in the vector dist end for

partDist = minimum distance value in dist send partDist to all the other processors receive partDist from the other processors minDist = minimum among all the partDist end for

Fig. 9.2 A parallel algorithm for computing the minimum distance between one sample and a set of samples in parallel.

9.3.1 k-means

k-means is a data mining technique for clustering. A detailed description of the technique is provided in Chapter 3. In the same chapter different improvements and variants of the technique are also discussed. A sketch of the standardk-means algorithm is provided in Figure 3.2. An application in C implementing a simple variant of the algorithm is presented in Appendix B and is referred to ash-means algorithm (see Figure 3.9). The aim of the algorithm is to find a suitable partition of a set of samples in clusters. The main tasks to be performed during the algorithm are:

the computation of all the distances between samples and the centers of the clusters, and the computation of the centers. These tasks can be carried out in parallel in order to speed the clustering algorithm up. In [119, 191, 222, 250] some parallel versions of thek-means algorithm can be found. In the following we will present a parallel h-means algorithm based on the ideas presented in [119].

The set of data to be partitioned can be divided among the memories of the processors involved in the computation. Ifpis the number of processors, andnis the size of the set of data, every processor can store on its own memory approximately np =n/psamples. This allows reduction of the quantity of memory that must be devoted to storing the data on each processor memory. Each single processor can then work only on the samples allocated to this processor. Each cluster has a center. The currentkcenters of the clusters can be stored in the memories of all the processors, because they are frequently needed during the algorithm.

The computation of the centers of the clusters can be performed in parallel as follows. Each processor contains on its own memory a part of the samples to partition.

Hence it does not have all the information for computing the centers, because the samples it has can belong to different clusters. A center can be simply computed by calculating the arithmetic mean among all the samples belonging to the same cluster. Precisely, the sum of all the samples in the same cluster must be computed and the obtained value must be divided by the number of samples. This simple

for each processor kp in {0,1, . . . , p−1} (in parallel)

compute the partial sums of the samples belonging to the same cluster send the partial sums to the other processors

receive the partial sums from the other processors for each cluster

sum the partial sums

divide the total sum by the number of samples in the current cluster end for

end for

Fig. 9.3A parallel algorithm for computing the centers of clusters in parallel.

task must be split inp sub-tasks in order to compute such centers in parallel. The main idea is that all the processors exploit all the information they have, while the number of communications is kept minimal. The samples in each processor can in general belong to each of thekclusters. Hence, only the partial sum of the samples belonging to the same cluster can be computed on each processor. During this phase, no communications are needed. When these partial sums are computed, the processors need to exchange them for computing the centers. Each processor has to send its partial sum to the others and needs to receive the partial sums from all the other processors. After the communication phase, each processor can sum thep partial sums and then divide the result by the number of samples contained in each of thekclusters. In this way, each processor has thekcluster centers. A sketch of this parallel algorithm is given in Figure 9.3.

At each step of theh-means algorithm, the distances between each sample and the centers of the clusters are computed and the sample is assigned to the cluster having the closest center. Since all the processors know the centers, this task can be performed independently on each processor on the samples stored in the local memories. There is no need of communications at all, and this makes this phase of theh-means algorithm very efficient in parallel. When all samples are re-assigned, new centers need to be computed. At this point, the parallel procedure discussed above can be reused. A communication phase is included in such procedure, but it is the only one needed during an entire iteration of theh-means algorithm. Therefore, this parallel version of the h-means algorithm can be efficiently implemented on parallel computers. A sketch of the parallelh-means algorithm is given in Figure 9.4.

9.3.2 k-NN

k-NN is a method for classification (see Chapter 4). It classifies unknown samples by checking the classification of thek-nearest known samples contained in a given training set. The basick-NN algorithm is provided in Figure 4.2. k-NN is a very simple algorithm, but it can be computational expensive if the training set and the set of samples to be classified are large. Parallel computing can help in speeding the

equally divide the samples among the p processors for each processor kp in{0,1, . . . , p−1} (in parallel)

randomly assign each sample in processor kp to one cluster end for

compute the cluster centers in parallel while the centers are not stable

for all the samples Samples(i) in processor k_p

compute the distances between Samples(i) and all the centers find k’ such that the k’-th center is the closest to Sample(i) assign Sample(i) to the cluster k’

end for

recompute the centers of the clusters in parallel end while

Fig. 9.4A parallel version of theh-means algorithm.

algorithm up. Since the basic algorithm is very simple, there are parallel versions of it which are simple as well.

The most simple parallel solution is presented in [84]. Given an unknown sample, it must be compared to all the samples contained in the training set, but it is not compared to any of the other unknown samples. Hence, if the set of unknown samples is divided among the p processors involved into a parallel computation, and the training set is replicated on each of them, each processor can work independently from each other. The standardk-NN algorithm can be performed on each processor by using the whole training set and only a part of the set of samples to be classified. This is highly efficient, because no communications are needed during the computation at all. A sketch of this parallelk-NN algorithm is given in Figure 9.5.

equally divide the unknown samples among the p processors for each processor kp in {0,1, . . . , p−1} (in parallel)

for all the unknown samples UnSample(i) in processor kp

for all the known samples Sample(j)

compute the distance between UnSamples(i) and Sample(j) end for

find the k smallest distances and check the related classification assign UnSample(i) to the class which appears more frequently end for

end for

Fig. 9.5A parallel version of thek-NN algorithm.

If the training set is much larger than the set of unknown samples, then this parallel algorithm may not be so effective. This is a very rare situation though, where a lot of known data are available for classifying few unknown samples. In any case, for reducing the computational time, large training sets can be reduced by using one of the techniques presented in Section 4.2. If the training set is still too large to be stored in all the processor memories, it might be split among thepprocessors. The basic k-NN could be carried out locally on each processor, but a communication phase

would be needed before an unknown sample could be classified. For this reason, if there are not problems related to the memory space on each processor, the algorithm in Figure 9.5 is the most efficient one. Other studies on parallel versions of thek-NN algorithm are presented in [7].

9.3.3 ANNs

Neural networks can be used for solving classification problems (see Chapter 5).

They are inspired by studies on the human brain. The multilayer perceptron is a neural network in which the neurons are organized in layers. The input signal is fed to the network through the input layer and then such signal propagates layer by layer. The neurons of the output layer provide the network output. Layer by layer, the neurons on the same layer manage the data they receive simultaneously and then they send their output to the neurons of the following layer. A general scheme of the multilayer perceptron is given in Figure 5.1.

There is an inherent parallelism in neural networks [206]. Neurons belonging to the same layer work in parallel, and hence they can be distributed on different processors. When the signal passes from one layer to another, each neuron in the first layer must send its information to all the neurons on the second layer. During this phase, then, processors need to communicate with each other for receiving in-formation from neurons working on other processors. Every time a certain input is given to the network, the processors need to communicate a number of times equal to the number of layers before obtaining the corresponding output. The quantity of communications is therefore high, and therefore neural networks can be efficiently used in parallel environments if the number of neurons on each layer is sufficiently large. For small networks, instead, this kind of parallelism would not be so efficient.

Indeed, the computational cost in terms of operations would be much smaller than the computational cost in terms of communications.

Another kind of parallelism can also be introduced for neural networks. The training process of a neural network can be formulated as a global optimization problem where the function to be optimized is the function (5.3) (see Section 5.2).

As already pointed out, global optimization problems can be difficult to solve and methods designed for solving them may give as solution a point which actually is only one of the local optima. For more details refer to Section 1.4. When meta-heuristic algorithms are used, different executions of the algorithms can lead to different solutions, because the algorithm is probabilistically driven. In such cases, the algorithm is performed more than once for solving the exact same problem, and the best solution obtained over a certain number of trials is considered to be the global optimal solution.

The parallelism can then be introduced at another level. The training phase of the neural network can be considered as sequential and not parallel. However, more training phases can be performed in parallel on a parallel computer. On each pro-cessor, one training phase can start by using different initial parameters (such as the

Dans le document DATA MINING IN AGRICULTURE (Page 196-200)