HAL Id: hal-00426664
https://hal.archives-ouvertes.fr/hal-00426664v2
Submitted on 5 Jul 2010
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Parallel Program Performance Debugging with the Pandore II Environment
Cyrille Bareau, Yves Mahéo, Jean-Louis Pazat
To cite this version:
Cyrille Bareau, Yves Mahéo, Jean-Louis Pazat. Parallel Program Performance Debugging with the
Pandore II Environment. International Conference on Parallel Computing (ParCo’93), Sep 1993,
Grenoble, France. pp.241-248. �hal-00426664v2�
Parallel Program Performance Debugging with the Pandore II Environment
C. Bareau, Y. Maheo, J.-L. Pazat
aa
PAMPA Team, IRISA, 35042 RENNES Cedex, France ( pandore@irisa.fr )
In this paper, we present the overall design of Pandore II , an environment dedicated to the experimentation of distribution of sequential programs for their execution on dis- tributed memory parallel architectures. The emphasis is then put on two performance analysis tools integrated in this environment.
1. INTRODUCTION
Programming distributed memory parallel computers ( dmpc s for short) is a dicult task because of the management of parallel processes. The intricateness between compu- tation and interprocess communication is often very tight, leading to very cumbersome programs. So the wish of most users is to be freed from all these \low level" details.
In many cases they do not want to take into account the distributed aspects of their programs, which they also want to be machine independent.
A solution to this problem is given by the use of sequential programming languages extended with data distribution features like High Performance Fortran [7]. In this case, the compiler is in charge of generating communicating parallel processes from the sequen- tial code and the data distribution specication. This solution has been validated [4], and implemented in the Pandore II environment [1]. Yet, eciency issues necessitate to develop specic performance analysis tools.
2. THE PANDORE II ENVIRONMENT
Pandore II is an environment designed for parallel execution of imperative sequential programs on dmpc s. It comprises a compiler, libraries for dierent dmpc s and execution analysis tools including a proler and a trace generator.
2.1. The source language
The source language for this environment is a subset of
Caugmented with features for data decomposition purpose { similar characteristics appear in the recently dened High Performance Fortran language. No specic knowledge of the target machine is required of the user: only the specication of data decomposition is left to his duty.
A Pandore II program is a collection of distributed phases which may be seen like functions. The specication of data distribution is expressed as attributes of the formal parameters. Arrays can be partitioned into rectangular blocks and their mapping on processors can be regular or cyclic. For example :
dist myphase(float A[N][N] by block(N,1) map wrapped(0,1) mode INOUT)
species that the array A is to be partitioned into columns mapped cyclically onto the processors. The INOUT mode species that A is to be read at the beginning of the phase and written back at the end. Figure 1 shows an example of a Pandore II program.
#dene N 128
#dene P 4
oat
A[N][N],
oatV[N];
dist
myphase(
oatA[N][N]
by block(N/P,N)map regular(0,1)modeINOUT,
oat
V[N]
by block(N/P)map regular(0)modeINOUT) {
inti,j;
for
(i=0; i<N; i++) /* */
for
(j=0; j<N; j++) /*
Instrumentation zone 1*/
V[i] = f(V[i],A[i][j]); /* __________________ */
for
(j=0; j<N; j++) /* */
for
(i=1; i<N-1; i++) /*
Instrumentation zone 2*/
A[i][j] = g(A[i+1][j], A[i-1][j]); /* */
} main()
{ myphase(A,V); }
Figure 1. Example of Pandore II source program
2.2. The compiler
The Pandore II compiler generates parallel processes according to the data decom- position specied by the programmer. The compilation scheme is based on the locality of writes, on the host/node and the SPMD model. The host process executes the code of the
main() part of the program, whereas node processes execute the SPMD code produced by the compiler from the distributed phases. A node process is in charge of its local vari- ables: it updates their values and sends them to other processes when needed, according to the original sequential program. For example the assignment
A[
i] =
B[
i+ 1] +
C[
i] is translated into:
refresh(
ftmp1 , tmp2
g,
fB[i+1],C[i]
g, owner(A[i])) exec(owner(A[i]), tmp1 + tmp2 )
free(
ftmp1 , tmp2
g)
When executing the refresh macro, owners of
B[
i+ 1] and
C[
i] send their values to the owner of
A[
i]. This process receives the values into buers tmp1 and tmp2 . The
exec macro is a guarded command that insures that only the owner of
A[
i] executes the statement
A[
i] = tmp1 + tmp2 . If the assigned variable is replicated on all the processors (as scalars for example), distant values are broadcasted through the network.
This basic translation scheme is not very ecient but optimization techniques based
on the same model exist, for example [2] carries out a static domain analysis of loops to
generate ecient code.
2.3. The runtime library
The runtime permits the execution of object code on dierent dmpc s. Its goal is to implement memory and process management, communication of data elements between processes, distributed data management and ecient index translation. It relies on ma- chine dependent libraries and is implemented using macro denitions.
The machine model is a fully connected network of processing elements, communicating through reliable FIFO channels. Sends are non-blocking, whereas receives are blocking.
The design and the implementation of this library increase the system portability and facilitates the instrumentation of the code.
2.4. Need for performance debugging
The performances of the code generated by the Pandore II system are dependent on the appropriateness of the chosen data distribution to the algorithm but also on strategies directing the compiler and the runtime implementation. It appears that evaluating the inuence of these parameters is necessary in order to guide the system designers and to provide the user with tools helping him to distribute his data. A dynamic evaluation { as opposed to a static estimation [3, 6] { oers the advantage of being applicable to every type of program and yields precise results.
Two techniques are used for performance measurement in the Pandore II environ- ment: tracing and proling. These two techniques dier in their aim and in their imple- mentation. Tracing permits the recording of events to which are assigned at least a type and a timestamp. With this method, the parallel activity can be recorded in order to rebuild the program behavior; we present here an extension of usual tracing for extracting information on all the potential behaviors of the program. The counterpart of this method is proling, whose aim is to gather enough statistics for execution analysis. A number of counters related to events are updated during program execution. In Pandore II , an enhancement of mere proling is used: in addition to their occurrences, the durations of events may also be cumulated [9].
3. TRACE GENERATION AND ANALYSIS 3.1. Principle of the Pandore II trace analyzer
The principle of performance analysis by trace generation is to study the causal struc- ture of the program that takes into account the load balancing induced by the chosen data distribution and the compilation scheme. Indeed, from an intuitive point of view, \a distributed program is very parallel when, most of the time, most processors can actually perform an action, i.e. are not blocked waiting for a message". Now these blockings, due to the asynchronism of the communications, happen when an event is causally dependent on an event on a dierent processor; thus, a tool that makes these dependences between processors clearly visible gives an estimation of their importance in terms of eciency.
Moreover, by focusing on events rather than on relations between them, one may get an estimation of the load balance of the program.
Our tool permits to visualize whether an action creates a bottleneck or can be performed
in parallel with many other actions on other processors. For this purpose, we build the
lattice of all the possible behaviors of the program: it is a graph in which each vertex
represents an instant of the execution and the outgoing edges are the actions that may be
performed at this instant by the unblocked processors. Hence each path from the initial vertex to the nal one corresponds to one of the interleaving of the actions of the program.
For instance, from the following program, where x[i] and y[j] are placed on processor P1 and z[i] on processor P2,
Source code execution on P1: on P2:
z[i]:=y[j]+1; send(y[j],P2); (a) receive(y[j],P1); (e)
x[i]:=5; x[i]:=5; (b) z[i]:=y[j]+1; (f)
y[j]:=x[i]+3*z[i]; receive(z[i],P2); (c) send(z[i],P1); (g)
z[i]:=z[i]-3; y[j]:=x[i]+3*z[i]; (d) z[i]:=z[i]-3; (h)
we get, giving a direction to each processor, the lattice of executions of gure 2(a).
a c b
c d
e
g h
h h h
g b
b f f b
e d
b P1 P2
f
h d
b h h b b
f d
(a) (b)
Figure 2. Lattices of executions
The main drawback of the lattice of executions is its size. Furthermore, events are not all interesting: for instance when an assignment always follows a reception, one of these two events could be abstracted without loss of precision. The user may also want to analyze only parts of its program. In that respect, we allow him to include his own observation points either in the Pandore II source program or in the generated code. In the previous example, if we choose to observe only assignments, we obtain the new lattice shown in gure 2(b).
3.2. Construction
The construction of the lattice of executions is based on results of order theory, using the fact that a distributed execution can be seen as a partial order. An algorithm [10] that computes vector timestamps coding this order has been implemented in ECHIDNA [8], a programming environment for execution of Estelle specications on dmpc s, networks of workstations, or by simulation on monoprocessors
1. In order to use this tool, an Estelle-
1
EstelleisanISOlanguageforproto colsp ecication.
code generator has been added as new back-end to the Pandore II environment. The execution timesof the Estelle code are of course verydierent from those obtained with the C code, but the notion of intrusion is here irrelevant for we focus on the very structure of the algorithm, on the causal dependences between events that ensue only from the source, not from the low-level mechanisms.
The lattice of executions is actually the lattice of the ideals of an execution seen as a partial order, therefore it can be created from only one execution. We use an algorithm described in [5]; this algorithm is online: during an execution, the observed events gen- erate traces with timestamps coding the partial order, and the lattice is incrementally constructed from these traces.
3.3. Applications
• •• •
(a)
••••••••••••
••••••••
••
••
••
••
••
••
•
••••••••
••
••
•
••
••
••
••
••
••
•••
•••
••
••
••
••
••
•
••
••
••
••
••
••
• •
••••
•••••••
••••••••
••
••
••
••
••
••
•
••••••••
••
••
•
••
••
••
••
••
••
•••
•••
••
••
••
••
••
•
••
••
••
••
••
••
• •
••••
•••••••
••••••••
••
••
••
••
••
••
•
••••••••
••
••
•
••
••
••
••
••
••
•••
•••
••
••
••
••
••
•
••
••
••
••
••
••
• •
••••
•••••••
••••••••
••
••
••
••
••
••
•
(b)
• •• •• ••••• ••••••
••
••
••
••
• •
••
••
••
••
• •• ••••• ••••••
••
••
••
••
••
••
••
••
••
•••••
••••
••
••
••
•••••
••
••
••
••
••
••
••
••
• •••
••
••
••
•
••
••
••
••
•
••
••
••
••
•
••
••
••
••
••••••
••••••••••••
••
••
• •• •• •• •• •• •
(c)
• •• ••••••••••••••••••••••••••••••••••••••• ••• •••
••••••••••••••••••••••••••••
••
••
••
••
••
••
••
••
•••
••••
•
••
••
••
••
••
••
••
• •••
••
••
••
••
••
••
••
•
Figure 3. Matrix product lattices of executions
With our tool, the user is given an intuitive look over the performance of his distributed program. The lattice of executions highlights the bottlenecks as well as the \fully parallel"
parts of the program, therefore it is possible to have an overall estimation of the perfor- mance, and thus to compare several data distributions. Furthermore, when a bottleneck appears in the lattice, the user immediately knows which events are causing it, and so which part of the program should be reconsidered.
As an example, we show in gure 3 the lattices corresponding to a matrix product on
four processors with three dierent distributions. The observed events are the assignments
of the program. It can be seen that in the distribution (c) there are many points with only one or two outgoing edges: parallelism is here very weak, especially at the beginning and at the end of the execution when only one processor achieves assignments, the others performing only communication actions. On the contrary, in (a), most points correspond to instants when three or four processors are \ready to work", leading to a greater amount of potential parallelism. This approach may seem unrealistic if we consider lattices with many processors; however we observe that such lattices present the same overall shape (for example the juxtaposition of the same pattern in distribution (a)), so the observations made with few processors are pertinent.
4. THE PANDORE II PROFILER 4.1. Instrumentation
The Pandore II proler allows the user to collect a number of quantitative measures on his program's execution with minimal intervention. The use of proling restrains the amount of storage needed; the number of counters to be updated is of the order of the number of variables declared in the source program. This proler has been implementedon a 32-node iPSC/2 but is easily portable. Sensors are inserted in modied versions of some runtime macros, thus the compiler generates a similar code whether an instrumentation is demanded or not. Lapses of time are measured with a software microsecond clock.
As most information for updating counters is available at compile time, the level of intrusion remains low (limited to a few percent execution overhead). Measurements are performed on each node and counters are brought back to the host at the end of the execution and then written down into a le that can be exploited by appropriate tools.
The host code is not instrumented due to the lack of precision of time measurement on a time-shared multi-user system.
The links between the source and the evaluation results are established two dierent ways: rst the user bounds fragments of the distributed phases he wants to be evaluated by dening some instrumentation zones , typically loop nests. Moreover, output gures are associated to objects of the source program such as arrays, scalars, conditional statements or loops.
4.2. Results
Besides execution times and the load balance, the main results produced by a proled execution are related to communications and synchronizations. They may be classied in two categories: measures specic to distributed phases (communication with host at the beginning and at the end of each phase, phase triggering) and measures concerning assignments within instrumentation zones.
These statistics give information about the eciency of the runtime implementation
especially for message passing. Moreover, with the last class of results, data distribution
for a given algorithm can be evaluated. An assignment of a distributed array element in
which another distributed array reference appears in the right hand side may generate a
message from the owner of the right hand side to the owner of the left hand side. The
aim of the measurement is to globally build a directed graph where vertices are array
partitions and arcs describe the trac between partitions. Arcs are valued by the number
of messages, the transferred volume or the waiting time on reception. For example, the
assignment
A[3
;5] =
B[4] will increase the value of the arc (
B1
!A2) if element
A[3
;5]
is on processor 2 and
B[4] on processor 1.
In a similar way, an assignment of a replicated variable in which a distributed array reference appears in the right hand side will generate a broadcast message from the owner of the array element. For the entire execution, broadcasting is described (number of messages, transferred volume, waiting time on reception) for each pair ( var,part ), where the partition part represents the source and var the assigned replicated variable. For example, the assignment
x=
A[3
;5] will increase the value of the counters related to the pair (
x;A1) if
A[3
;5] is located on processor 1.
The produced results may be analyzed as such or treated by a specic tool that can give partial and abstracted views (e.g. by selecting or grouping processors or variables).
V1
V0 V2 V3
A0 A1 A2 A3
A0 A1 A2 A3
A0 A2 A1 A3
V1
V0 V2 V3
(a) Row-wise
(b) Column-wise
(c) Block-wise 126 messages / arc
1024 messages / arc
2048 messages / arc (Ai, Vj) 64 messages / arc (Ai, Aj)
A0 A1 A2 A3
V1
V0 V2 V3
A0 A1
A0 A2 A1 A3
V1 V3
Number of messages Waiting time(1 u = 100 ms)