Parallel Programming with Java

(1)

1

Parallel Programming with Java

Aamir Shafi

National University of Sciences and Technology (NUST)

http://hpc.seecs.edu.pk/~aamir

http://mpj-express.org

(2)

2

Two Important Concepts

• Two fundamental concepts of parallel programming are:

• Domain decomposition

• Functional decomposition

(3)

3

Domain Decomposition

Image taken from https://computing.llnl.gov/tutorials/parallel_comp/

(4)

4

Functional Decomposition

Image taken from https://computing.llnl.gov/tutorials/parallel_comp/

(5)

5

Message Passing Interface (MPI)

• MPI is a standard (an interface or an API):

• It defines a set of methods that are used by application developers to write their applications

• MPI library implement these methods

• MPI itself is not a library—it is a specification document that is followed!

• MPI-1.2 is the most popular specification version

• Reasons for popularity:

• Software and hardware vendors were involved

• Significant contribution from academia

• MPICH served as an early reference implementation

• MPI compilers are simply wrappers to widely used C and Fortran compilers

• History:

• The first draft specification was produced in 1993

• MPI-2.0, introduced in 1999, adds many new features to MPI

• Bindings available to C, C++, and Fortran

• MPI is a success story:

• It is the mostly adopted programming paradigm of IBM Blue Gene systems

• At least two production-quality MPI libraries:

• MPICH2 (http://www-unix.mcs.anl.gov/mpi/mpich2/)

• OpenMPI (http://open-mpi.org)

• There’s even a Java library:

• MPJ Express (http://mpj-express.org)

(6)

6

Message Passing Model

• Message passing model allows processors to communicate by passing messages:

• Processors do not share memory

• Data transfer between processors required cooperative operations to be performed by each processor:

• One processor sends the message while other receives the message

(7)

7

chenab6 barq.niit.edu.pk

(cluster head node) chenab1

chenab3 chenab2

chenab4

chenab5 chenab7

Distributed Memory Cluster

(8)

8

Steps involved in executing the “Hello World!” program

1. Let’s logon to the cluster head node 2. Write the Hello World program

3. Compile the program

4. Write the machines files

5. Start MPJ Express daemons

6. Execute the parallel program

7. Stop MPJ Express daemons

(9)

9

Step1: Logon to the head node

(10)

10

Step 2: Write the Hello World

Program

(11)

11

Step 3: Compile the code

(12)

12

Step 4: Write the machines file

(13)

13

Step 5: Start MPJ Express

daemons

(14)

14

Step 6: Execute the parallel program

aamir@barq:~/projects/mpj-user> mpjrun.sh -np 6 -headnodeip 10.3.20.120 -dport 11050 HelloWorld

..

Hi from process <3> of total <6>

…

(15)

15

Step 7: Stop the MPJ Express

daemons

(16)

16

COMM WORLD Communicator

import java.util.*;

import mpi.*;

..

// Initialize MPI

MPI.Init(args); // start up MPI

// Get total number of processes and rank size = MPI.COMM_WORLD.Size();

rank = MPI.COMM_WORLD.Rank();

..

(17)

17

What is size?

• Total number of processes in a communicator:

• The size of MPI.COMM_WORLD is 6

import java.util.*;

import mpi.*;

..

// Get total number of processes size = MPI.COMM_WORLD.Size();

..

(18)

18

What is rank?

• The “unique” identify (id) of a process in a communicator:

• Each of the six processes in MPI.COMM_WORLD has a distinct rank or id

import java.util.*;

import mpi.*;

..

// Get total number of processes rank = MPI.COMM_WORLD.Rank();

..

(19)

19

Single Program Multiple Data (SPMD) Model

import java.util.*;

import mpi.*;

public class HelloWorld {

MPI.Init(args); // start up MPI size = MPI.COMM_WORLD.Size();

if (rank == 0) {

System.out.println(“I am Process 0”);

}

else if (rank == 1) {

System.out.println(“I am Process 1”);

}

MPI.Finalize();

}

(20)

20

Single Program Multiple Data (SPMD) Model

import java.util.*;

import mpi.*;

public class HelloWorld {

MPI.Init(args); // start up MPI size = MPI.COMM_WORLD.Size();

if (rank%2 == 0) {

System.out.println(“I am an even process”);

}

else if (rank%2 == 1) {

System.out.println(“I am an odd process”);

}

MPI.Finalize();

}

(21)

21

Point to Point Communication

• The most fundamental facility provided by MPI

• Basically “exchange messages between two processes”:

• One process (source) sends message

• The other process (destination) receives message

(22)

22

Point to Point Communication

• It is possible to send message for each basic datatype:

• Floats (MPI.FLOAT), Integers (MPI.INT), Doubles (MPI.DOUBLE) …

• Java Objects (MPI.OBJECT)

• Each message contains a “tag”—an identifier

Tag1

Tag2

(23)

23

Process 6 Process 0

Process 1

Process 3 Process 2

Process 4 Process 5

Process 7

message

Integers Process 4 Tag COMM_WORLD

Point to Point Communication

(24)

24

Blocking Send() and Recv() Methods

public void Send(Object buf, int offset, int count,

Datatype datatype, int dest, int tag) throws MPIException

public Status Recv(Object buf, int offset, int count,

Datatype datatype, int src, int tag) throws MPIException

(25)

25

Blocking and Non-blocking Point-to-Point Comm

• There are blocking and non-blocking version of send and receive methods

• Blocking versions:

• A process calls _Send() or _Recv(), these methods return when the message has been physically sent or received

• Non-blocking versions:

• A process calls _Isend() or _Irecv(), these methods return immediately

• The user can check the status of message by calling _Test() or

Wait()

• Non-blocking versions provide overlapping of computation and communication:

• Asynchronous communication

(26)

26

CPU waits

“Blocking”

Send() Recv()

Sender Receiver

time CPU waits

“Non Blocking”

Isend() Irecv()

Sender Receiver

time CPU does computation

Wait()

CPU waits Wait() CPU waits

CPU does computation

(27)

27

Non-blocking Point-to-Point Comm

public Request Isend(Object buf, int offset, int count,

Datatype datatype, int dest, int tag) throws MPIException public Request Irecv(Object buf, int offset, int count,

Datatype datatype, int src, int tag) throws MPIException public Status Wait() throws MPIException

public Status Test() throws MPIException

(28)

28

Performance Evaluation of Point to Point Communication

• Normally ping pong benchmarks are used to calculate:

• Latency: How long it takes to send N bytes from sender to receiver?

• Throughput: How much bandwidth is achieved?

• Latency is a useful measure for studying the performance of “small” messages

• Throughput is a useful measure for studying the

performance of “large” messages

(29)

29

Latency on Myrinet

(30)

30

Throughput on Myrinet

(31)

31

Collective communications

• Provided as a convenience for application developers:

• Save significant development time

• Efficient algorithms may be used

• Stable (tested)

• Built on top of point-to-point communications

• These operations include:

• Broadcast, Barrier, Reduce, Allreduce, Alltoall, Scatter, Scan, Allscatter

• Versions that allows displacements between the data

(32)

Image from MPI standard doc32

Broadcast, scatter, gather,

allgather, alltoall

(33)

33

Broadcast, scatter, gather, allgather, alltoall

public void Bcast(Object buf, int offset, int count,

Datatype type, int root) throws MPIException

public void Scatter(Object sendbuf, int sendoffset, int sendcount, Datatype sendtype, Object recvbuf, int recvoffset, int recvcount, Datatype recvtype, int root) throws MPIException

public void Gather(Object sendbuf, int sendoffset, int sendcount, Datatype sendtype, Object recvbuf, int recvoffset, int recvcount, Datatype recvtype, int root) throws MPIException

public void Allgather(Object sendbuf, int sendoffset int sendcount, Datatype sendtype, Object recvbuf, int recvoffset, int recvcount, Datatype recvtype)

throws MPIException

public void Alltoall(Object sendbuf, int sendoffset, int sendcount, Datatype sendtype, Object recvbuf, int recvoffset, int recvcount, Datatype recvtype)

throws MPIException

(34)

34

Reduce collective operations

1 2 3 4 5

15

1 2 3 4 5

15 15 15 15 15 reduce

allreduce Processes Data

 MPI.PROD

 MPI.SUM

 MPI.MIN

 MPI.MAX

 MPI.LAND

 MPI.BAND

 MPI.LOR

 MPI.BOR

 MPI.LXOR

 MPI.BXOR

 MPI.MINLOC

 MPI.MAXLOC

P ro ce ss es

(35)

35

Reduce collective operations

public void Reduce(Object sendbuf, int sendoffset,

Object recvbuf, int recvoffset, int count, Datatype datatype, Op op, int root) throws MPIException

public void Allreduce(Object sendbuf, int sendoffset,

Object recvbuf, int recvoffset, int count, Datatype datatype, Op op)

throws MPIException

(36)

36

Collective Communication

Performance

(37)

37