Parallel computing - Data Mining in a Parallel Environment

Data Mining in a Parallel Environment

9.1 Parallel computing

In this section, we give a very brief introduction to parallel computing, with the aim of giving to the reader the basic knowledge needed to understand the parallel version of some of the data mining techniques discussed in this book. A very simple example of a parallel algorithm is presented in Section 9.2. A parallel version of thek-means algorithm, thek-nearest neighbor decision rule, and the training phases of a neural network and a support vector machine are presented in Section 9.3.

When there is the need to analyze a large amount of data, the parallel computing paradigm can be used to fulfill these tasks and also reduce both the computational time and the memory requirement. A parallel environment is a machine or a set of machines in which more processors can simultaneously work on the same task.

When working in a parallel environment, the computational time needed for carrying a standard algorithm out is sped up, because it is performedin parallelon more processors. The basic idea is to split the problem at hand into smaller subproblems that can be solved on different processors simultaneously. Each processor can also have a private memory in which it can store its own data. This reduces the memory requirement on each single processor.

The simplest and cheapest way to build a parallel machine is to interconnect sin-gle personal computers by a network and make them work together. The obtained parallel machine is also called aBeowulfcluster of computers. Each computer of the cluster can keep working independently from the others, but they can also work in parallel on a single task. These clusters of computers belong to the group of the MIMD parallel architectures, where MIMD stands for multiple instruction multiple data. As already mentioned, the basic idea in parallel computing is to divide a cer-tain problem into smaller subproblems. On MIMD computers, each subproblem is solved independently from the others on different processors. The instructions are multiple, and therefore each subproblem can be solved by using an algorithm which is completely different from the ones implemented on the other processors. The data are also multiple, and therefore each processor can refer to a set of data different from

A. Mucherino et al., Data Mining in Agriculture, Springer Optimization and Its Applications 34, 173 DOI: 10.1007/978-0-387-88615-2_9,

the ones the other processors refer to. Thus, each processor on a MIMD computer can work independently from the others. It is very important that the computational load on each processor is as balanced as possible. In other words, it is important that the computational cost for solving each subproblem is similar on each proces-sor. If this aim is reached, the time for solving a problem in parallel on a machine withpprocessors could be the time for solving it on a sequential machine divided byp. However, this result is quite difficult to reach. Indeed, the subproblems in which a problem can be split are usually dependent from each other. For instance, some variable computed while solving one of these subproblems could be needed for the solution of another subproblem. For this reason, the processors often need to exchange data among them, and this operation may have a relevant computational cost. A good parallel algorithm is the one in which the computational load is well distributed among the processors and the number of synchronizations among the processors is limited.

Nowadays, Beowolf clusters of computers are much used. They work as a MIMD computer in which each processor can be located on a different personal computer. In particular, a Beowulf cluster is a MIMD parallel computer with distributed memory.

In fact, each personal computer is equipped with its own processor and its own memory. Each processor can then access its own memory only, and not the memories of the others. Therefore, when the processors need to synchronize and exchange data, they actually need to communicate. Some processor may need to send its data to another or to all the processors. Some other processor may need to receive and save these data on its own memory. All these communications have a computational cost, which depends on the particular Beowulf cluster.

It is worth noting that other parallel computers with different architectures exist.

For instance, MIMD computers can also have a shared memory. In this case, all the processors of the parallel machine refer to the same memory. Therefore, this memory must be big enough for containing all the data needed for all the processors. Moreover, the processors read and write on the same memory, and then the data a processor access can be modified even by another processor. This makes the development of algorithms for MIMD computers with shared memory more complex. On the other hand, there is no need of communications, and therefore there is no computational cost for communications in this case. Figure 9.1 compares the MIMD computers with distributed and shared memory.

Another kind of parallel machine is the SIMD computer, where SIMD stands for single instruction multiple data. As before, the data are multiple, and hence all the

Fig. 9.1A graphic scheme of the MIMD computers with distributed and shared memory.

processors of the parallel machine can work on different data. The instructions are single, though. This means that all the processors carry out the same instructions on different data. Differently from the MIMD computers, the processors are always synchronized in this case. A detailed classification of the parallel machines can be found in [77]. In this context, a single personal computer is referred to as a SISD machine, where SISD stands for single instruction single data. Recently, hybrid ma-chines using both MIMD and SIMD architectures have also been developed. Details can be found in [217]. Finally, it is important to note that parallel computing re-cently evolved into the so-calledgrid computing[79]. The main difference between standard parallel computing and grid computing stands in the fact that many remote computers having difference properties (such as the CPU clock) are used simultane-ously in grid computing. This brings consequences which are not discussed in this book.

Let us consider a simple problem and let us try to develop a parallel algorithm for solving this problem. Given an integer and positive numberN, it could be desirable to know if such number is prime or not. The easiest way for finding out ifNis prime is to try all the possible divisions ofN by all the integer numbers smaller thanN. The totality of these divisions can then be split among thepprocessors of a parallel machine so that each processor can try different divisors in parallel. In theory, the processors can work without knowing anything about the other processors. When all the processors have finished their job,panswers are available. Each processor indeed can provide as output the divisibility or not ofNconsidering only its part of all possible divisors. If all the processors give as an answer “not divisible,’’ thenNis prime. If at least one processor gives as an answer “divisible,’’ thenNis not prime.

A way for improving this parallel algorithm can be the following one. While the processors work simultaneously, one of them may find an integern¯ such that N is divisible by n. This would mean that¯ N is not prime, and then the parallel algorithm can already provide its output: N is not prime. All the next divisions that all the processors might perform are completely useless, because the output is already known. In a sequential algorithm, this situation can be simply handled by a loop such aswhileorrepeat..until. In this case, instead, only the processor which findsn¯can stop making divisions in this way. How to tell the other processors that continuing work is useless? The processors need to communicate.

On MIMD computers with distributed memory, all the processors have a private memory. On each memory, a variable can be stored and used as a sort of signal for the divisibility. This variable can have value 1 when no divisors have been found, and 0 otherwise. Naturally, if one processor changes the signal variable to 0, the others cannot see the change and stop working. For overcoming this problem, the processors can periodically exchange their own signal variable. If one of them at least is 0, all the processors can stop working and all of them can provide as output “the number is not prime.’’ Instead, if all the processors finish working on their divisions, they exchange the signal variables and they find out that none of them is 0, then the global (or parallel) output is: “the number is prime.’’

Organizing the communications among the processors of a parallel computer in an efficient way is a crucial point for the success of a parallel procedure. Let us suppose

a variable, saya, needs to be computed somewhere during the algorithm, and that the variable is needed later for computing other variables by all the processors. The variablea should then be located in the memory of each processor, because it is needed for performing a part of the instructions. There is therefore the need to let the processors communicate for exchanging data. An example may be that a single processor computesa and then it sends a to another processor that receives this information. Another example is that a single processor sends a to all the other processors. This kind of communication is calledbroadcast, because one processor sends its data to all the others involved in the computation. Let us suppose now that another variable, sayb, is needed in all the memories because it has to be used in some part of the algorithms performed on each processor. Another communication may be activated for sendingbto all the other processors by the time they need it.

However, communications among the processors require time, and the time saved in the computation must pay off the time spent in the communication process. When the computation ofaorbis not very expensive, it is preferable to let all the processors compute such variables in order to avoid communications, which might require more time.

Since clusters of computers are currently more frequently used than other paral-lel architectures, we will focus in the following on algorithms to be implemented on these kinds of parallel machines. On these machines, each processor can work on a different algorithm by using its own data, which are stored on its own mem-ory. Communications are needed during the execution of the parallel algorithm. The message passing interface (MPI) [168] provides interfaces for allowing processors to communicate to each other. It was originally designed to be used on clusters of computers. It is based on C and Fortran programming languages and it is available on almost all platforms and operating systems. MPI consists of a library of C functions and Fortran subroutines needed for making the processors communicate. Basic com-munications are implemented, such as for sending or receiving variables from one processor to another. Other functions or subroutines allow more than two processors to communicate to one another and to perform simultaneously predetermined tasks.

Dans le document DATA MINING IN AGRICULTURE (Page 192-195)