Implementing Asynchronous Distributed Systems Using the IOA Toolkit

(1)

Implementing Asynchronous Distributed Systems Using the IOA Toolkit

Chryssis Georgiou

¹

, Panayiotis P. Mavrommatis

²

, Joshua A. Tauber

²

1

University of Cyprus, Department of Computer Science

2

MIT Computer Science and Artificial Intelligence Laboratory September 27, 2004

1 Introduction

This document is a report about the capabilities and performance of the IOA Toolkit, and in par- ticular the tools that provide support for implementing and running distributed systems (checker, composer, code generator). The Toolkit compiles distributed systems speciﬁed in IOA into Java classes, which run on a network of workstations and communicate using the Message Passing Inter- face (MPI). In order to test the toolkit, several distributed algorithms were implemented, ranging from simple algorithms such as LCR leader election in a ring network to more complex algorithms such as the GHS algorithm for computing the minimum spanning tree in an arbitrary graph. All of our experiments completed successfully, and several runtime measurements were made.

2 Experiments

Implementation Platform The machines used are located in the Theory of Computation Group of the MIT Computer Science and Artiﬁcial Intelligence Laboratory, forming a Local Area Network.

They are all Red Hat Linux machines and with Intel Pentium III to IV with clock speed ranging from 1 GHz to 3.2 GHz. Even though MPI sets up a connection between every pair of nodes, the algorithms only use the communication channels they need. For example, a node (i) in LCR only sends to node i+1 and only receives from node i-1.

2.1 LCR Leader Election

The algorithm of Le Lann, Chang and Roberts for Leader Election in a ring network was the first experiment in running an IOA program on a network of computers. The automaton definition that appears in [1](Section 15.1) was used, with some modifications. For all the algorithms that follow, the nodes are automatically numbered from 0 to (size - 1).

Automata Definitions The automata LCRProcess, LCRNode, SendMediator and ReceiveMe- diator were written. The mediator automata are given in Appendix C.1 (these automata implement the channel automata integrated with MPI functionality). The composed automaton (LCR) ap- pears in Appendix C.2 (this is the one that is translated into Java code by the IOA Toolkit).

LCR Leader Election process automaton

type S t a t u s = enumeration o f idle , v o t i n g , e l e c t e d , a n n o u n c e d automaton L C R P r o c e s s ( rank : Int , size : Int )

s i g n a tu re

(3)

output l e a d e r (const rank ) s t a t e s

p e n d i n g : Mset [ Int ] := { rank }, s t a t u s : S t a t u s := idle

t r a n s i t i o n s input vote

e f f s t a t u s := v o t i n g

input R E C E I V E ( m , j , i ) where m > i e f f p e n d i n g := i n s e r t ( m , p e n d i n g ) input R E C E I V E ( m , j , i ) where m < i input R E C E I V E ( i , j , i )

e f f s t a t u s := e l e c t e d output SEND ( m , i , j )

pre s t a t u s = idle ∧ m ∈ p e n d i n g e f f p e n d i n g := d e l e t e ( m , p e n d i n g ) output l e a d e r ( rank )

pre s t a t u s = e l e c t e d e f f s t a t u s := a n n o u n c e d

LCR Leader Election composition automaton

automaton L C R N o d e ( rank : Int , size : Int ) components

P : L C R P r o c e s s ( rank , size );

RM [ j : Int ]: R e c e i v e M e d i a t o r ( Int , Int , j , rank ) where j = mod ( rank -1, size );

SM [ j : Int ]: S e n d M e d i a t o r ( Int , Int , rank , j ) where j = mod ( rank +1, size )

Results The trace of a run on 8 nodes (on 4 machines) can be found in Appendix E.1. A snapshot of the trace, representing the last five transitions, is shown below.

A snapshot of the trace of LCR leader election

t r a n s i t i o n: output SEND ( 7 , 5 , 6 ) i n automaton LCR (5) on c o n d o r . c s a i l . mit . edu at 7 : 2 5 : 3 7 : 2 8 0

M o d i f i e d s t a t e v a r i a b l e s :

P → T u p l e , m o d i f i e d f i e l d s : {[ p e n d i n g - > ( ) ] }

SM → Map , m o d i f i e d e n t r i e s : { [ 6 - > T u p l e , m o d i f i e d f i e l d s : {[ t o S e n d ->

S e q u e n c e , e l e m e n t s a d d e d : { 7 } E l e m e n t s r e m o v e d : { } ] } ] } t r a n s i t i o n: output R E C E I V E ( 7 , 5 , 6 ) i n automaton LCR (6)

on p a r r o t . c s a i l . mit . edu at 7 : 2 5 : 3 7 : 7 5 5 M o d i f i e d s t a t e v a r i a b l e s :

P → T u p l e , m o d i f i e d f i e l d s : {[ p e n d i n g - > ( 7 ) ] } t r a n s i t i o n: output SEND ( 7 , 6 , 7 ) i n automaton LCR (6)

on p a r r o t . c s a i l . mit . edu at 7 : 2 5 : 3 7 : 7 7 0 M o d i f i e d s t a t e v a r i a b l e s :

P → T u p l e , m o d i f i e d f i e l d s : {[ p e n d i n g - > ( ) ] }

(4)

S e q u e n c e , e l e m e n t s a d d e d : { 7 } E l e m e n t s r e m o v e d : { } ] } ] } t r a n s i t i o n: output R E C E I V E ( 7 , 6 , 7 ) i n automaton LCR (7)

on tui . c s a i l . mit . edu at 7 : 2 5 : 3 7 : 8 7 2 M o d i f i e d s t a t e v a r i a b l e s :

P → T u p l e , m o d i f i e d f i e l d s : {[ s t a t u s -> e l e c t e d ] } t r a n s i t i o n: output l e a d e r (7) i n automaton LCR (7)

on tui . c s a i l . mit . edu at 7 : 2 5 : 3 7 : 8 7 4 M o d i f i e d s t a t e v a r i a b l e s :

P → T u p l e , m o d i f i e d f i e l d s : {[ s t a t u s -> a n n o u n c e d ] }

As the trace indicates, node 7’s message has made its way around the ring and eventually returned to node 7. At that point, node 7 announced itself as the leader. Figure 2.1 shows the messages sent by the nodes. All messages were sent from node i to node i + 1( mod 8) and the message with value i was sent ﬁrst. The message with value 7 followed and made the round of the ring, to elect node 7 as the leader.

Figure 2.1: Running the LCR leader election algorithm on a ring of 8 nodes. The white squares represent the nodes, while the shaded parallelograms represent the messages sent.

2.2 Asynchronous Spanning Tree

The Asynchronous Spanning Tree Algorithm, (see Section 15.3 of [1]) was the next test for the

Toolkit. The algorithm is still very simple: Given a general graph it computes a spanning tree on

the graph. This was the ﬁrst test of the Toolkit on arbitrary graphs, where each node had more

than one incoming and outgoing communication channels.

(5)

Asynchronous Spanning Tree process automaton

type M e s s a g e = enumeration o f s e a r c h , null automaton s T r e e P r o c e s s ( i : Int )

s i g n a tu re

input R E C E I V E ( m : M e s s a g e , const i : Int , j : Int ) output SEND ( m : M e s s a g e , const i : Int , j : Int ) output P A R E N T ( j : Int )

s t a t e s

nbrs : Set [ Int ] := {},

p a r e n t : Int := - 1 , % - 1 for null r e p o r t e d : Bool := f a l s e ,

send : Map [ Int , M e s s a g e ] t r a n s i t i o n s

input R E C E I V E ( m , i , j ) e f f

i f i = 0 ∧ p a r e n t = -1 then p a r e n t := j ;

f o r k : Int i n nbrs - { j } do send [ k ] := s e a r c h

od f i

output SEND ( m , i , j ) pre send [ j ] = s e a r c h

e f f send [ j ] := null output P A R E N T ( j )

pre p a r e n t = j ∧ r e p o r t e d = f a l s e e f f r e p o r t e d := true

Asynchronous Spanning Tree composition automaton

type M e s s a g e = enumeration o f s e a r c h , null automaton s T r e e N o d e ( i : Int )

components

P : s T r e e P r o c e s s ( i );

RM [ j : Int ]: R e c e i v e M e d i a t o r ( M e s s a g e , Int , i , j );

SM [ j : Int ]: S e n d M e d i a t o r ( M e s s a g e , Int , i , j )

Results Figure 2.2 shows a graph of 16 nodes connected in a 4 × 4 grid that was used on some of our tests. The source node was node 0. Some of the spanning trees computed are also shown in Figure 2.2. A snapshot of the trace follows, showing nodes 1 and 4 sending their message to node 5. The message of node 1 arrives ﬁrst and thus node 5 announces node 1 as its parent. The complete trace of this run can be found in Appendix E.2.

Trace snapshot of the Asynchronous Spanning Tree algorithm

(6)

Figure 2.2: The 4 × 4 grid used to test the Asynchronous Spanning Tree Algorithm, along with 2 diﬀerent spanning trees computed from 2 diﬀerent runs of the algorithm.

t r a n s i t i o n: output SEND ( s e a r c h , 1 , 5 ) i n automaton s T r e e N o d e (1) on b l a c k b i r d . c s a i l . mit . edu

P → [ nbrs : ( 0 2 5 ) , p a r e n t : 0 , r e p o r t e d : true , send : ioa . r u n t i m e . adt . M a p S o r t ]

RM → ioa . r u n t i m e . adt . M a p S o r t SM → ioa . r u n t i m e . adt . M a p S o r t

t r a n s i t i o n: output SEND ( s e a r c h , 4 , 5 ) i n automaton s T r e e N o d e (4) on p a r r o t . c s a i l . mit . edu

P → [ nbrs : ( 0 5 8 ) , p a r e n t : 0 , r e p o r t e d : true , send : ioa . r u n t i m e . adt . M a p S o r t ]

t r a n s i t i o n: output R E C E I V E ( s e a r c h , 5 , 1 ) i n automaton s T r e e N o d e (5) on p a r r o t . c s a i l . mit . edu

P → [ nbrs : ( 1 4 6 9 ) , p a r e n t : 1 , r e p o r t e d : f a l s e , send : ioa . r u n t i m e . adt . M a p S o r t ]

RM → ioa . r u n t i m e . adt . M a p S o r t

t r a n s i t i o n: output P A R E N T (1) i n automaton s T r e e N o d e (5) on p a r r o t . c s a i l . mit . edu

P → [ nbrs : ( 1 4 6 9 ) , p a r e n t : 1 , r e p o r t e d : true , send : ioa . r u n t i m e . adt . M a p S o r t ]

(7)

2.3 Asynchronous Broadcast Convergecast

This is essentially an extension of the previous algorithm, where along with the construction of a spanning tree, a broadcast and convergecast take place (using the computed spanning tree). The source node was node 0, and the message 99 (a dummy message) was broadcast on the network.

Some complexity is added compared to the previous algorithm by the fact that diﬀerent kinds of messages are exchanged.

Automata Definitions The AsynchBcastAck automaton, as deﬁned in [1](section 15.3) was used. The process automaton and the composition automaton are shown below. The expanded automaton is given in Appendix C.4.

Asynchronous Broadcast Convergecast process automaton

type Kind = enumeration o f b c a s t , ack

type B C a s t M s g = tuple o f kind : Kind , w : Int type M e s s a g e = union o f msg : B C a s t M s g , kind : Kind automaton b c a s t P r o c e s s ( rank : Int , nbrs : Set [ Int ])

s i g n a tu re

input R E C E I V E ( m : M e s s a g e , const rank , j : Int ) output SEND ( m : M e s s a g e , const rank , j : Int ) i n t e r n a l r e p o r t (const rank )

s t a t e s

val : Int := - 1 , % - 1 = s p e c i a l v a l u e d e n o t i n g null p a r e n t : Int := -1,

r e p o r t e d : Bool := f a l s e , a c k e d : Set [ Int ] := {},

send : Map [ Int , Seq [ M e s s a g e ]]

i n i t i a l l y

rank = 0 ⇒

( val = 99 ∧ % % 9 9 = the v a l u e to be b r o a d c a s t

(∀ j : Int

(( j ∈ nbrs ) ⇒ send [ j ] = {} msg ([ b c a s t , val ] ) ) ) ) t r a n s i t i o n s

output SEND ( m , rank , j ) pre m = head ( send [ j ])

e f f send [ j ] := tail ( send [ j ]) input R E C E I V E ( m , rank , j )

e f f

i f m = kind ( ack ) then a c k e d := a c k e d ∪ { j } e l s e

i f val = -1 then val := m . msg . w ; p a r e n t := j ;

f o r k : Int i n nbrs - { j } do send [ k ] := send [ k ] m od

e l s e

send [ j ] := send [ j ] kind ( ack )

(8)

f i f i

i n t e r n a l r e p o r t ( rank ) where rank = 0 pre a c k e d = nbrs ;

r e p o r t e d = f a l s e e f f r e p o r t e d := true

i n t e r n a l r e p o r t ( rank ) where rank = 0 pre p a r e n t = -1;

a c k e d = nbrs - { p a r e n t };

r e p o r t e d = f a l s e

e f f send [ p a r e n t ] := send [ p a r e n t ] kind ( ack );

r e p o r t e d := true ;

Asynchronous Broadcast Convergecast composition automaton

type M e s s a g e = union o f msg : B C a s t M s g , kind : Kind automaton b c a s t N o d e ( i : Int )

components

P : b c a s t P r o c e s s ( i );

Results The algorithm was tested on several graphs. One of them is shown in Figure 2.3 (a).

Figure 2.3 (b) and (c) depicts some of the spanning trees that were computed and the communica- tion sequence; the numbers next to the nodes represent the sequence at which the nodes reported done (through the internal report action). A snapshot of the trace is listed below, showing nodes 1 and 4 reporting that they are done. At that point, node 0 is enabled and after some communi- cation between these nodes, node 0 also reports done. The complete trace of this run is shown in Appendix E.3.

Trace snapshot of the Asynchronous Broadcast Convergecast algorithm

t r a n s i t i o n: i n t e r n a l r e p o r t (1) i n automaton b c a s t N o d e M o d i f i e d s t a t e v a r i a b l e s :

P → [ val : 9 9 , a c k e d : ( 2 5 ) , nbrs : ( 0 2 5 ) , p a r e n t : 0 , r e p o r t e d : true , send : ioa . r u n t i m e . adt . M a p S o r t , temp : [ kind : b c a s t , w : 8 7 ] ]

t r a n s i t i o n: output SEND ( kind ( ack ) , 1 , 0 ) i n automaton b c a s t N o d e M o d i f i e d s t a t e v a r i a b l e s :

SM → ioa . r u n t i m e . adt . M a p S o r t

t r a n s i t i o n: output R E C E I V E ( kind ( ack ) , 4 , 8 ) i n automaton b c a s t N o d e M o d i f i e d s t a t e v a r i a b l e s :

P → [ val : 9 9 , a c k e d : ( 5 8 ) , nbrs : ( 0 5 8 ) , p a r e n t : 0 , r e p o r t e d : f a l s e , send : ioa . r u n t i m e . adt . M a p S o r t , temp : [ kind : b c a s t , w : 8 7 ] ]

(9)

Figure 2.3: The grid shown in (a) was used to test the algorithm. (b) and (c) show the spanning trees computed in two (diﬀerent) runs and the numbers next to the nodes indicate the sequence at which the nodes reported done.

t r a n s i t i o n: output SEND ( kind ( ack ) , 4 , 0 ) i n automaton b c a s t N o d e M o d i f i e d s t a t e v a r i a b l e s :

P → [ val : 9 9 , a c k e d : ( 1 ) , nbrs : ( 1 4 ) , p a r e n t : - 1 , r e p o r t e d : f a l s e , send : ioa . r u n t i m e . adt . M a p S o r t , temp : [ kind : b c a s t , w : 9 9 ] ]

P → [ val : 9 9 , a c k e d : ( 1 4 ) , nbrs : ( 1 4 ) , p a r e n t : - 1 , r e p o r t e d : f a l s e , send : ioa . r u n t i m e . adt . M a p S o r t , temp : [ kind : b c a s t , w : 9 9 ] ]

RM → ioa . r u n t i m e . adt . M a p S o r t

t r a n s i t i o n: i n t e r n a l r e p o r t (0) i n automaton b c a s t N o d e M o d i f i e d s t a t e v a r i a b l e s :

P → [ val : 9 9 , a c k e d : ( 1 4 ) , nbrs : ( 1 4 ) , p a r e n t : - 1 , r e p o r t e d : true , send : ioa . r u n t i m e . adt . M a p S o r t , temp : [ kind : b c a s t , w : 9 9 ] ]

(10)

2.4 Leader Election Using Broadcast Convergecast

In page 500 of [1], the author describes how the Asynchronous Broadcast Convergecast algorithm can be used to implement a leader election algorithm on a general graph using the Asynchronous Broadcast Convergecast algortithm. The main idea is to have every node act as a source node and create its own spanning tree, broadcast its UID using this spanning tree and hear from all the other nodes via a convergecast. During this convergecast, along with the acknowledge message, the children also send what they consider as the maximum UID in the network. The parents gather the maximum UIDs from the children, compare it to their own UID and send the maximum to their own parents. Thus, each source node learns the maximum UID in the network and the node whose UID equals the maximum one announces itself as a leader.

Automata Definitions The process and composition automata are shown below. The expanded automaton is deﬁned in Appendix C.5.

Leader Election Using Broadcast Convergecast process automaton

type Kind = enumeration o f b c a s t , ack

type B C a s t M s g = tuple o f kind : Kind , w : Int type A c k M s g = tuple o f kind : Kind , mx : Int

type MSG = union o f bmsg : B C a s t M s g , amsg : A c k M s g , kind : Kind type M e s s a g e = tuple o f msg : MSG , s o u r c e : Int

automaton b c a s t L e a d e r P r o c e s s ( rank : Int ) s i g n a tu re

input R E C E I V E ( m : M e s s a g e , i : Int , j : Int ) output SEND ( m : M e s s a g e , i : Int , j : Int ) i n t e r n a l r e p o r t ( i : Int , s o u r c e : Int ) i n t e r n a l f i n i s h e d

output L E A D E R s t a t e s

nbrs : Set [ Int ],

val : Map [ Int , Int ], % i n i t i a l i a l l y - 1 ( null ) for all n o d e s p a r e n t : Map [ Int , Int ], % i n i t i a l i a l l y - 1 ( null ) for all n o d e s r e p o r t e d : Map [ Int , Bool ], % i n i t i a l i a l l y f a l s e for all n o d e s a c k e d : Map [ Int , Set [ Int ]], % i n i t i a l i a l l y {} for all n o d e s send : Map [ Int , Int , Seq [ M e s s a g e ]], % F i r s t v a r i a b l e : s o u r c e .

max : Map [ Int , Int ], % The max v a l u e f o u n d in the n e t w o r k , i n i t i a l l y i e l e c t e d : Bool := f a l s e ,

a n n o u n c e d : Bool := f a l s e i n i t i a l l y

(∀ j : Int

((0 ≤ j ∧ j < 1 6 ) ⇒

( rank = j ⇒ val [ j ] = rank

∧ rank = j ⇒ val [ j ] = -1

∧ p a r e n t [ j ] = -1

∧ a c k e d [ j ] = {}

∧ max [ j ] = rank

∧(∀ k : Int

( send [ j , k ] = {} ∧

(11)

output SEND ( m , i , j )

pre m = head ( send [ m . s o u r c e , j ])

e f f send [ m . s o u r c e , j ] := tail ( send [ m . s o u r c e , j ]) input R E C E I V E ( m , i , j )

e f f i f m . msg = kind ( ack ) then

a c k e d [ m . s o u r c e ] := a c k e d [ m . s o u r c e ] ∪ { j } e l s e i f tag ( m . msg ) = amsg then

i f max [ m . s o u r c e ] < m . msg . amsg . mx then max [ m . s o u r c e ] := m . msg . amsg . mx ; f i;

a c k e d [ m . s o u r c e ] := a c k e d [ m . s o u r c e ] ∪ { j } e l s e % B c a s t M s g

i f val [ m . s o u r c e ] = -1 then

val [ m . s o u r c e ] := m . msg . bmsg . w ; p a r e n t [ m . s o u r c e ] := j ;

f o r k : Int i n nbrs - { j } do

send [ m . s o u r c e , k ] := send [ m . s o u r c e , k ] m od

e l s e

send [ m . s o u r c e , j ] := send [ m . s o u r c e , j ] [ kind ( ack ), m . s o u r c e ] f i

f i

i n t e r n a l f i n i s h e d

pre a c k e d [ rank ] = nbrs ∧ r e p o r t e d [ rank ] = f a l s e e f f r e p o r t e d [ rank ] := true ;

i f ( max [ rank ] = rank ) then e l e c t e d := true

f i; output L E A D E R

pre e l e c t e d = true ∧ a n n o u n c e d = f a l s e e f f a n n o u n c e d := true

i n t e r n a l r e p o r t ( i , s o u r c e ) where i = s o u r c e pre p a r e n t [ s o u r c e ] = -1 ∧

a c k e d [ s o u r c e ] = nbrs - { p a r e n t [ s o u r c e ]} ∧ r e p o r t e d [ s o u r c e ] = f a l s e

e f f send [ s o u r c e , p a r e n t [ s o u r c e ]] :=

send [ s o u r c e , p a r e n t [ s o u r c e ]] [ amsg ([ ack , max [ s o u r c e ] ] ) , s o u r c e ];

r e p o r t e d [ s o u r c e ] := true ;

Leader Election Using Broadcast Convergecast composition automaton

automaton b c a s t L e a d e r N o d e ( i : Int ) components

P : b c a s t L e a d e r P r o c e s s ( i );

Results This algorithm was also tested on the 4 × 4 grid that was used in the two previous

algotithms (see Figure 2.3 (a)). The algorithm terminated correctly and announced node 15 as the

leader. A snapshot of the trace is shown below. The complete trace of this run can be found on

the web, at http://theory.csail.mit.edu:/~pmavrom/ioaReport.html

(12)

Trace snapshot of the Leader Election Using Broadcast Convergecast algorithm

t r a n s i t i o n: output R E C E I V E ([ msg : amsg ([ kind : ack , mx : 1 3 ] ) , s o u r c e : 1 4 ] , 1 4 , 1 0 ) i n automaton b c a s t L e a d e r ( 1 4 ) on d r a k e . c s a i l . mit . edu at 9 : 0 7 : 2 0 : 4 5 5 M o d i f i e d s t a t e v a r i a b l e s :

P → T u p l e , m o d i f i e d f i e l d s : {[ a c k e d -> Map , m o d i f i e d e n t r i e s : { [ 1 4 - > ( 1 0 1 3 1 5 ) ] } ] }

SM → Map , m o d i f i e d e n t r i e s : { [ 1 0 - > T u p l e , m o d i f i e d f i e l d s : {[ t o S e n d ->

S e q u e n c e , e l e m e n t s a d d e d : { } E l e m e n t s r e m o v e d : {[ msg : amsg ([ kind : ack , mx : 1 4 ] ) , s o u r c e : 1 1 ] } ] [ sent -> S e q u e n c e , e l e m e n t s a d d e d : {[ msg : amsg ([ kind : ack , mx : 1 4 ] ) , s o u r c e : 1 1 ] } E l e m e n t s r e m o v e d : { } ] } ] }

c2 → 7

t r a n s i t i o n: i n t e r n a l f i n i s h e d () i n automaton b c a s t L e a d e r (14) on d r a k e . c s a i l . mit . edu at 9 : 0 7 : 2 0 : 4 6 4

P → T u p l e , m o d i f i e d f i e l d s : {[ r e p o r t e d -> Map , m o d i f i e d e n t r i e s : { [ 1 4 - >

true ] } ] }

c2 → 14

k → 15

t r a n s i t i o n: output R E C E I V E ([ msg : amsg ([ kind : ack , mx : 1 4 ] ) , s o u r c e : 1 5 ] , 1 5 , 1 1 ) i n automaton b c a s t L e a d e r ( 1 5 ) on loon . c s a i l . mit . edu at 9 : 0 7 : 2 0 : 6 8 4 M o d i f i e d s t a t e v a r i a b l e s :

P → T u p l e , m o d i f i e d f i e l d s : {[ a c k e d -> Map , m o d i f i e d e n t r i e s : { [ 1 5 - > ( 1 1 1 4 ) ] } ] }

SM → Map , m o d i f i e d e n t r i e s : { [ 1 1 - > T u p l e , m o d i f i e d f i e l d s : {[ t o S e n d ->

S e q u e n c e , e l e m e n t s a d d e d : { } E l e m e n t s r e m o v e d : {[ msg : amsg ([ kind : ack , mx : 1 5 ] ) , s o u r c e : 4 ] } ] [ sent -> S e q u e n c e , e l e m e n t s a d d e d : {[ msg : amsg ([ kind : ack , mx : 1 5 ] ) , s o u r c e : 4 ] } E l e m e n t s r e m o v e d : { } ] } ] }

c2 → 7

t r a n s i t i o n: i n t e r n a l f i n i s h e d () i n automaton b c a s t L e a d e r (15) on loon . c s a i l . mit . edu at 9 : 0 7 : 2 0 : 6 8 7

P → T u p l e , m o d i f i e d f i e l d s : {[ r e p o r t e d -> Map , m o d i f i e d e n t r i e s : { [ 1 5 - >

true ] } ] [ e l e c t e d -> true ] }

c2 → 15

t r a n s i t i o n: output L E A D E R () i n automaton b c a s t L e a d e r (15) on loon . c s a i l . mit . edu at 9 : 0 7 : 2 0 : 6 8 9

P → T u p l e , m o d i f i e d f i e l d s : {[ a n n o u n c e d -> true ] }

2.5 Unrooted Spanning Tree to Leader Election

The algorithm STtoLeader of [1](page 501) was implemented as the next test for the Toolkit. The

algorithm takes as input an unrooted spanning tree and returns a leader. The automaton listed

below was written, according to the description of the algorithm.

(13)

Unrooted Spanning Tree to Leader Election process automaton

type S t a t u s = enumeration o f idle , e l e c t e d , a n n o u n c e d type M e s s a g e = enumeration o f e l e c t

axioms C h o i c e S e t ( Int f o r E )

automaton s T r e e L e a d e r P r o c e s s ( rank : Int , nbrs : Set [ Int ]) s i g n a tu re

input R E C E I V E ( m : M e s s a g e , const rank : Int , j : Int ) output SEND ( m : M e s s a g e , const rank : Int , j : Int ) output l e a d e r

s t a t e s

r e c e i v e d E l e c t : Set [ Int ], s e n t E l e c t : Set [ Int ], s t a t u s : S t a t u s ,

send : Map [ Int , Seq [ M e s s a g e ]]

i n i t i a l l y

true ⇒

r e c e i v e d E l e c t = {}

∧ s e n t E l e c t = {}

∧ s t a t u s = idle

∧ ( size ( nbrs ) = 1 ⇒

send [ c h o o s e R a n d o m ( nbrs )]

= send [ c h o o s e R a n d o m ( nbrs )] e l e c t ) t r a n s i t i o n s

input R E C E I V E ( m , i , j )

e f f r e c e i v e d E l e c t := i n s e r t ( j , r e c e i v e d E l e c t );

i f size ( r e c e i v e d E l e c t ) = size ( nbrs ) - 1 then send [ c h o o s e R a n d o m ( nbrs - r e c e i v e d E l e c t )] :=

send [ c h o o s e R a n d o m ( nbrs - r e c e i v e d E l e c t )] e l e c t ; s e n t E l e c t :=

i n s e r t ( c h o o s e R a n d o m ( nbrs - r e c e i v e d E l e c t ), s e n t E l e c t ) e l s e i f r e c e i v e d E l e c t = nbrs then

i f j ∈ s e n t E l e c t then i f i > j then s t a t u s := e l e c t e d f i e l s e s t a t u s := e l e c t e d

f i f i

output SEND ( m , i , j ) pre m = head ( send [ j ])

e f f send [ j ] := tail ( send [ j ]) output l e a d e r

pre s t a t u s = e l e c t e d e f f s t a t u s := a n n o u n c e d

Unrooted Spanning Tree to Leader Election composition automaton

automaton s T r e e L e a d e r N o d e ( rank : Int , nbrs : Set [ Int ]) components

P : s T r e e L e a d e r P r o c e s s ( rank , nbrs );

RM [ j : Int ]: R e c e i v e M e d i a t o r ( M e s s a g e , Int , rank , j );

SM [ j : Int ]: S e n d M e d i a t o r ( M e s s a g e , Int , rank , j )

Results The algorithm was tested on several graphs and diﬀerent spanning trees as input. A

spanning tree on a 4 × 4 grid is shown in Figure 2.4. The algorithm worked correctly, announcing

(14)

only one node as the leader. A snapshot of the trace is listed below. In that speciﬁc run, the edge between nodes 5 and 6 had elect messages sent in both directions, with node 6 being elected as the leader, having a larger UID. The complete trace of this run can be found in Appendix E.4.

Figure 2.4: (a) shows the unrooted spanning tree used in one run. In (b), the arrows show the direction of the messages in a speciﬁc run, where the edge between nodes 5 and 9 had elect messages sent in both directions. Node 9, with a larger UID was elected as the leader.

Trace Snapshot of the Unrooted Spanning Tree to Leader Election

t r a n s i t i o n: output SEND ( e l e c t , 6 , 5 ) i n automaton s T r e e L e a d e r (6) on d r a k e . c s a i l . mit . edu at 9 : 2 9 : 1 5 : 1 1 2

P → T u p l e , m o d i f i e d f i e l d s : {[ send -> Map , m o d i f i e d e n t r i e s : { [ 5 - >

S e q u e n c e , e l e m e n t s a d d e d : { } E l e m e n t s r e m o v e d : { e l e c t } ] } ] }

S e q u e n c e , e l e m e n t s a d d e d : { e l e c t } E l e m e n t s r e m o v e d : { } ] } ] }

k → 5

t e m p N b r s 2 → ( 1 0 2 )

t r a n s i t i o n: output SEND ( e l e c t , 5 , 6 ) i n automaton s T r e e L e a d e r (5) on loon . c s a i l . mit . edu at 9 : 2 9 : 1 6 : 1 8 2

P → T u p l e , m o d i f i e d f i e l d s : {[ send -> Map , m o d i f i e d e n t r i e s : { [ 6 - >

S e q u e n c e , e l e m e n t s a d d e d : { } E l e m e n t s r e m o v e d : { e l e c t } ] } ] }

S e q u e n c e , e l e m e n t s a d d e d : { e l e c t } E l e m e n t s r e m o v e d : { } ] } ] }

k → 6

t e m p N b r s 2 → ( 1 4 9 )

t r a n s i t i o n: output R E C E I V E ( e l e c t , 5 , 6 ) i n automaton s T r e e L e a d e r (5) on loon . c s a i l . mit . edu at 9 : 2 9 : 1 6 : 4 9 4

P → T u p l e , m o d i f i e d f i e l d s : {[ r e c e i v e d E l e c t - > ( 1 4 6 9 ) ] }

(15)

t r a n s i t i o n: output R E C E I V E ( e l e c t , 6 , 5 ) i n automaton s T r e e L e a d e r (6) on d r a k e . c s a i l . mit . edu at 9 : 2 9 : 1 6 : 5 5 4

P → T u p l e , m o d i f i e d f i e l d s : {[ r e c e i v e d E l e c t - > ( 1 0 2 5 ) ] [ s t a t u s ->

e l e c t e d ] }

S e q u e n c e , e l e m e n t s a d d e d : { } E l e m e n t s r e m o v e d : { e l e c t } ] [ sent ->

S e q u e n c e , e l e m e n t s a d d e d : { e l e c t } E l e m e n t s r e m o v e d : { } ] } ] } t r a n s i t i o n: output l e a d e r () i n automaton s T r e e L e a d e r (6)

on d r a k e . c s a i l . mit . edu at 9 : 2 9 : 1 6 : 5 5 9 M o d i f i e d s t a t e v a r i a b l e s :

P → T u p l e , m o d i f i e d f i e l d s : {[ s t a t u s -> a n n o u n c e d ] }

2.6 GHS Minimum Spanning Tree

The last algorithm we ran using the Toolkit was the algorithm of Gallager, Humblet and Spira for finding the minimum-weight spanning tree in an arbitrary graph. Welch, Lamport and Lynch give an I/O automaton definition of the GHS algorithm in [2], which is used as a basis for our automata definitions. One technical detail in running this version of the algorithm is that the edge weights needed to be unique. This algorithm is significantly longer than the previous ones, with 7 different messages exchanged and 12 state variables for each automaton.

Automata Definitions The automaton below deﬁnes the algorithm and the composition with the mediator automata is shown after that. The expanded automaton can be found in Ap- pendix C.7.

GHS process automaton

type N s t a t u s = enumeration o f s l e e p i n g , find , f o u n d type Edge = tuple o f s : Int , t : Int

type Link = tuple o f s : Int , t : Int

type L s t a t u s = enumeration o f u n k n o w n , b r a n c h , r e j e c t e d

% M e s s a g e T y p e s

type Msg = enumeration o f C O N N E C T , I N I T I A T E , TEST , R E P O R T , A C C E P T , R E J E C T , C H A N G E R O O T

type C o n n M s g = tuple o f msg : Msg , l : Int type S t a t u s = enumeration o f find , f o u n d

type I n i t M s g = tuple o f msg : Msg , l : Int , c : Null [ Edge ], st : S t a t u s type T e s t M s g = tuple o f msg : Msg , l : Int , c : Null [ Edge ]

type R e p o r t M s g = tuple o f msg : Msg , w : Int

type M e s s a g e = union o f c o n n M s g : C o n n M s g , i n i t M s g : I n i t M s g , t e s t M s g : T e s t M s g , r e p o r t M s g : R e p o r t M s g , msg : Msg

uses C h o i c e S e t ( Link )

%%

% a u t o m a t o n G H S P r o c e s s : P r o c e s s of GHS A l g o r i t h m for min . s p a n n i n g tree

% rank : The UID of the a u t o m a t o n , a u t o m a t i c a l l y i n i t i a l i z e d by MPI

% size : The size of the n e t w o r k , a u t o m a t i c a l l y i n i t i a l i z e d by MPI

% l i n k s : Set of L i n k s with s o u r c e = rank ( L_p ( G ))

(16)

% w e i g h t : Maps the L i n k s ∈ l i n k s to t h e i r w e i g h t

%%

automaton G H S P r o c e s s ( rank : Int , size : Int , l i n k s : Set [ Link ], w e i g h t : Map [ Link , Int ]) s i g n a tu re

input s t a r t P

input R E C E I V E ( m : M e s s a g e , const rank , i : Int ) output I n T r e e ( l : Link )

output N o t I n T r e e ( l : Link )

output SEND ( m : M e s s a g e , const rank , j : Int ) i n t e r n a l R e c e i v e C o n n e c t ( qp : Link , l : Int )

i n t e r n a l R e c e i v e I n i t i a t e ( qp : Link , l : Int , c : Null [ Edge ], st : S t a t u s ) i n t e r n a l R e c e i v e T e s t ( qp : Link , l : Int , c : Null [ Edge ])

i n t e r n a l R e c e i v e A c c e p t ( qp : Link ) i n t e r n a l R e c e i v e R e j e c t ( qp : Link )

i n t e r n a l R e c e i v e R e p o r t ( qp : Link , w : Int ) i n t e r n a l R e c e i v e C h a n g e R o o t ( qp : Link ) s t a t e s

n s t a t u s : N s t a t u s , n f r a g : Null [ Edge ], n l e v e l : Int ,

b e s t l i n k : Null [ Link ], b e s t w t : Int ,

t e s t l i n k : Null [ Link ], i n b r a n c h : Link ,

f i n d c o u n t : Int ,

l s t a t u s : Map [ Link , L s t a t u s ],

q u e u e O u t : Map [ Link , Seq [ M e s s a g e ]], q u e u e I n : Map [ Link , Seq [ M e s s a g e ]], a n s w e r e d : Map [ Link , Bool ],

% T e m p o r a r y v a r i a b l e s

min : Int , minL : Null [ Link ], S : Set [ Link ] i n i t i a l l y

true ⇒ n s t a t u s = s l e e p i n g

∧ n f r a g = nil

∧ n l e v e l = 0

∧ b e s t l i n k = e m b e d ( c h o o s e R a n d o m ( l i n k s ))

∧ b e s t w t = w e i g h t [ c h o o s e R a n d o m ( l i n k s )]

∧ t e s t l i n k = nil

∧ i n b r a n c h = c h o o s e R a n d o m ( l i n k s )

∧ f i n d c o u n t = 0

∧ ∀ l : Link ( l ∈ l i n k s ⇒

l s t a t u s [ l ] = u n k n o w n

∧ a n s w e r e d [ l ] = f a l s e

∧ q u e u e O u t [ l ] = {}

∧ q u e u e I n [ l ] = {}) t r a n s i t i o n s

% I N P U T A C T I O N S input s t a r t P

e f f i f n s t a t u s = s l e e p i n g then

% W a k e U p

minL := e m b e d ( c h o o s e R a n d o m ( l i n k s )); min := w e i g h t [ minL . val ];

(17)

l s t a t u s [ minL . val ] := b r a n c h ; n s t a t u s := f o u n d ;

q u e u e O u t [ minL . val ] := q u e u e O u t [ minL . val ] c o n n M s g ([ C O N N E C T , 0 ] ) ;

% E n d W a k e U p f i

input R E C E I V E ( m : M e s s a g e , i : Int , j : Int ) e f f q u e u e I n [[ i , j ]] := q u e u e I n [[ i , j ]] m

% O U T P U T A C T I O N S

output I n T r e e ( l : Link )

pre a n s w e r e d [ l ] = f a l s e ∧ l s t a t u s [ l ] = b r a n c h e f f a n s w e r e d [ l ] := true

output N o t I n T r e e ( l : Link )

pre a n s w e r e d [ l ] = f a l s e ∧ l s t a t u s [ l ] = r e j e c t e d e f f a n s w e r e d [ l ] := true

output SEND ( m : M e s s a g e , i : Int , j : Int ) pre m = head ( q u e u e O u t [[ i , j ]])

e f f q u e u e O u t [[ i , j ]] := tail ( q u e u e O u t [[ i , j ]])

% I N T E R N A L A C T I O N S

i n t e r n a l R e c e i v e C o n n e c t ( qp : Link , l : Int )

pre head ( q u e u e I n [ qp ]) = c o n n M s g ([ C O N N E C T , l ]) e f f q u e u e I n [ qp ] := tail ( q u e u e I n [ qp ]);

i f n s t a t u s = s l e e p i n g then

% W a k e U p

f o r t e m p L : Link i n l i n k s do

i f w e i g h t [ t e m p L ] < min then minL := e m b e d ( t e m p L ); min := w e i g h t [ t e m p L ] f i; od;

% E n d W a k e U p f i;

i f l < n l e v e l then

l s t a t u s [[ qp . t , qp . s ]] := b r a n c h ; i f t e s t l i n k = nil then

q u e u e O u t [[ qp . t , qp . s ]] := q u e u e O u t [[ qp . t , qp . s ]] i n i t M s g ([ I N I T I A T E , n l e v e l , n f r a g , find ]);

f i n d c o u n t := f i n d c o u n t + 1 e l s e

q u e u e O u t [[ qp . t , qp . s ]] := q u e u e O u t [[ qp . t , qp . s ]] i n i t M s g ([ I N I T I A T E , n l e v e l , n f r a g , f o u n d ])

f i; e l s e

i f l s t a t u s [[ qp . t , qp . s ]] = u n k n o w n then

q u e u e I n [ qp ] := q u e u e I n [ qp ] c o n n M s g ([ C O N N E C T , l ]) e l s e

q u e u e O u t [[ qp . t , qp . s ]] := q u e u e O u t [[ qp . t , qp . s ]] i n i t M s g ([ I N I T I A T E , n l e v e l +1, e m b e d ([ qp . t , qp . s ]), find ])

f i f i

i n t e r n a l R e c e i v e I n i t i a t e ( qp : Link , l : Int , c : Null [ Edge ], st : S t a t u s ) pre head ( q u e u e I n [ qp ]) = i n i t M s g ([ I N I T I A T E , l , c , st ])

e f f q u e u e I n [ qp ] := tail ( q u e u e I n [ qp ]);

n l e v e l := l ; n f r a g := c ;

i f st = find then n s t a t u s := find e l s e n s t a t u s := f o u n d f i;

% - Let S = { [ p , q ] : l s t a t u s [[ p , r ]] = b r a n c h , r = q } - S := {};

(18)

f o r pr : Link i n l i n k s do

i f pr . t = qp . s ∧ l s t a t u s [ pr ] = b r a n c h then S := S ∪ { pr }

f i od;

f o r k : Link i n S do

q u e u e O u t [ k ] := q u e u e O u t [ k ] i n i t M s g ([ I N I T I A T E , l , c , st ]) od;

i f st = find then

i n b r a n c h := [ qp . t , qp . s ];

b e s t l i n k := nil ;

b e s t w t := 1 0 0 0 0 0 0 0 ; % I n f i n i t y

% T e s t P

minL := nil ; min := 1 0 0 0 0 0 0 0 ; % I n f i n i t y f o r t e m p L : Link i n l i n k s do

i f w e i g h t [ t e m p L ] < min ∧ l s t a t u s [ t e m p L ] = u n k n o w n then minL := e m b e d ( t e m p L ); min := w e i g h t [ t e m p L ]

f i; od;

i f minL = nil then t e s t l i n k := minL ;

q u e u e O u t [ minL . val ] := q u e u e O u t [ minL . val ] t e s t M s g ([ TEST , n l e v e l , n f r a g ]);

e l s e

t e s t l i n k := nil ;

% R e p o r t

i f f i n d c o u n t = 0 ∧ t e s t l i n k = nil then n s t a t u s := f o u n d ;

q u e u e O u t [ i n b r a n c h ] := q u e u e O u t [ i n b r a n c h ] r e p o r t M s g ([ R E P O R T , b e s t w t ]) f i

% E n d R e p o r t f i;

% E n d T e s t P

f i n d c o u n t := size ( S ) f i

i n t e r n a l R e c e i v e T e s t ( qp : Link , l : Int , c : Null [ Edge ]) pre head ( q u e u e I n [ qp ]) = t e s t M s g ([ TEST , l , c ]) e f f q u e u e I n [ qp ] := tail ( q u e u e I n [ qp ]);

i f n s t a t u s = s l e e p i n g then

% W a k e U p

f o r t e m p L : Link i n l i n k s do

i f w e i g h t [ t e m p L ] < min then minL := e m b e d ( t e m p L ); min := w e i g h t [ t e m p L ] f i; od;

% E n d W a k e U p f i;

i f l > n l e v e l then

q u e u e I n [ qp ] := q u e u e I n [ qp ] t e s t M s g ([ TEST , l , c ]);

e l s e

i f c = n f r a g then

q u e u e O u t [[ qp . t , qp . s ]] := q u e u e O u t [[ qp . t , qp . s ]] msg ( A C C E P T )

(19)

q u e u e O u t [[ qp . t , qp . s ]] := q u e u e O u t [[ qp . t , qp . s ]] msg ( R E J E C T ) e l s e

% Test

f i; od;

e l s e

% R e p o r t

% E n d R e p o r t f i;

% E n d T e s t f i;

f i; f i;

i n t e r n a l R e c e i v e A c c e p t ( qp : Link ) pre head ( q u e u e I n [ qp ]) = msg ( A C C E P T ) e f f q u e u e I n [ qp ] := tail ( q u e u e I n [ qp ]);

i f w e i g h t [[ qp . t , qp . s ]] < b e s t w t then b e s t l i n k := e m b e d ([ qp . t , qp . s ]);

b e s t w t := w e i g h t [[ qp . t , qp . s ]];

f i;

% R e p o r t

% E n d R e p o r t

i n t e r n a l R e c e i v e R e j e c t ( qp : Link ) pre head ( q u e u e I n [ qp ]) = msg ( R E J E C T ) e f f q u e u e I n [ qp ] := tail ( q u e u e I n [ qp ]);

i f l s t a t u s [[ qp . t , qp . s ]] = u n k n o w n then l s t a t u s [[ qp . t , qp . s ]] := r e j e c t e d f i;

% Test

f i; od;

e l s e

(20)

% R e p o r t

% E n d R e p o r t f i

% E n d T e s t

i n t e r n a l R e c e i v e R e p o r t ( qp : Link , w : Int )

pre head ( q u e u e I n [ qp ]) = r e p o r t M s g ([ R E P O R T , w ]) e f f q u e u e I n [ qp ] := tail ( q u e u e I n [ qp ]);

i f [ qp . t , qp . s ] = i n b r a n c h then f i n d c o u n t := f i n d c o u n t -1;

i f w < b e s t w t then b e s t w t := w ;

b e s t l i n k := e m b e d ([ qp . t , qp . s ]) f i;

% R e p o r t

% E n d R e p o r t e l s e

i f n s t a t u s = find then

q u e u e I n [ qp ] := q u e u e I n [ qp ] r e p o r t M s g ([ R E P O R T , w ]) e l s e i f w > b e s t w t then

% C h a n g e R o o t

i f l s t a t u s [ b e s t l i n k . val ] = b r a n c h then

q u e u e O u t [ b e s t l i n k . val ] := q u e u e O u t [ b e s t l i n k . val ] msg ( C H A N G E R O O T ) e l s e

q u e u e O u t [ b e s t l i n k . val ] := q u e u e O u t [ b e s t l i n k . val ] c o n n M s g ([ C O N N E C T , n l e v e l ]) ;

l s t a t u s [ b e s t l i n k . val ] := b r a n c h f i

% E n d C h a n g e R o o t f i

f i

i n t e r n a l R e c e i v e C h a n g e R o o t ( qp : Link ) pre head ( q u e u e I n [ qp ]) = msg ( C H A N G E R O O T ) e f f q u e u e I n [ qp ] := tail ( q u e u e I n [ qp ]);

% C h a n g e R o o t

i f l s t a t u s [ b e s t l i n k . val ] = b r a n c h then

q u e u e O u t [ b e s t l i n k . val ] := q u e u e O u t [ b e s t l i n k . val ] msg ( C H A N G E R O O T ) e l s e

q u e u e O u t [ b e s t l i n k . val ] := q u e u e O u t [ b e s t l i n k . val ] c o n n M s g ([ C O N N E C T , n l e v e l ]) ; l s t a t u s [ b e s t l i n k . val ] := b r a n c h

f i

% E n d C h a n g e R o o t

(21)

automaton G H S N o d e ( rank : Int , size : Int , l i n k s : Set [ Link ], w e i g h t : Map [ Link , Int ]) components

P : G H S P r o c e s s ( rank , size , l i n k s , w e i g h t );

RM [ j : Int ]: R e c e i v e M e d i a t o r ( M e s s a g e , Int , rank , j );

SM [ j : Int ]: S e n d M e d i a t o r ( M e s s a g e , Int , rank , j )

Results One of the graphs that were used to test the algorithm is shown in Figure 2.5 (a). The (unique) minimum spanning tree computed by the algorithm is shown in Figure 2.5 (b). A snap- shot of the trace follows, showing the beginning of the algorithm with node 0 waking up (after a start action), sending a CONNECT message to its minimun weight outgoing edge (1) and reporting this edge as part of the minimum spanning tree. The complete trace of this run can be found at http://theory.csail.mit.edu:/~pmavrom/ioaReport.html. The algorithm ran correctly, re- turning the unique minimum spanning tree in all cases.

Figure 2.5: (a) shows one of the graphs used to test the GHS algorithm. In (b), the (unique) minimum spanning tree computed by the algorithm

Trace snapshot of the GHS algorithm

I n i t i a l i z a t i o n s t a r t s ( 0 ) on loon . c s a i l . mit . edu at 7 : 0 5 : 5 5 : 8 3 0 M o d i f i e d s t a t e v a r i a b l e s :

P → [ n s t a t u s : s l e e p i n g , n f r a g : nil , n l e v e l : 8 7 , b e s t l i n k : nil , b e s t w t :

(22)

87, t e s t l i n k : nil , i n b r a n c h : [ s : 8 7 , t : 8 7 ] , f i n d c o u n t : 8 7 , l s t a t u s : Map {}, q u e u e O u t : Map {}, q u e u e I n : Map {}, a n s w e r e d : Map {}, min : 8 7 , minL : nil , S : ( ) ]

RM → Map {}

SM → Map {}

l i n k s → ([ s : 0 , t : 1 ] [ s : 0 , t : 4 ] ) lnk → [ s : 8 7 , t : 8 7 ]

lnks → ()

rank → null

size → null

t e m p L → [ s : 8 7 , t : 8 7 ] t e m p L i n k s → ()

w e i g h t → Map {[[ s : 0 , t : 1 ] - > 8 ] [ [ s : 0 , t : 4 ] - > 1 2 ] } I n i t i a l i z a t i o n ends

t r a n s i t i o n: input s t a r t P () i n automaton GHS (0) on loon . c s a i l . mit . edu at 7 : 0 5 : 5 5 : 8 5 2

P → [ n s t a t u s : f o u n d , n f r a g : nil , n l e v e l : 0 , b e s t l i n k : e m b e d ([ s : 0 , t :

4 ] ) , b e s t w t : 1 2 , t e s t l i n k : nil , i n b r a n c h : [ s : 0 , t : 1 ] , f i n d c o u n t : 0 , l s t a t u s : Map {[[ s : 0 , t : 1 ] - > b r a n c h ] [[ s : 0 , t : 4 ] - > u n k n o w n ] } , q u e u e O u t : Map {[[ s : 0, t : 1 ] - > { c o n n M s g ([ msg : C O N N E C T , l : 0 ] ) } ] [ [ s : 0 , t : 4 ] - > { } ] } , q u e u e I n : Map {[[ s : 1 , t : 0 ] - > { } ] [ [ s : 4 , t : 0 ] - > { } ] } , a n s w e r e d : Map {[[ s : 0 , t : 1 ] - >

f a l s e ] [[ s : 0 , t : 4 ] - > f a l s e ] } , min : 8 , minL : e m b e d ([ s : 0 , t : 1 ] ) , S : ( ) ] RM → Map { [ 1 - > [ s t a t u s : idle , t o R e c v : { } , r e a d y : f a l s e ] ] [ 4 - > [ s t a t u s : idle , t o R e c v : { } , r e a d y : f a l s e ]] }

SM → Map { [ 1 - > [ s t a t u s : idle , t o S e n d : { } , sent : { } , h a n d l e s : { } ] ] [ 4 - >

[ s t a t u s : idle , t o S e n d : { } , sent : { } , h a n d l e s : { } ] ] } l i n k s → ([ s : 0 , t : 1 ] [ s : 0 , t : 4 ] )

lnk → [ s : 8 7 , t : 8 7 ]

lnks → ()

rank → 0

size → 16

t e m p L → [ s : 0 , t : 4]

t e m p L i n k s → ()

w e i g h t → Map {[[ s : 0 , t : 1 ] - > 8 ] [ [ s : 0 , t : 4 ] - > 1 2 ] }

t r a n s i t i o n: output SEND ( c o n n M s g ([ msg : C O N N E C T , l : 0 ] ) , 0 , 1 ) i n automaton GHS (0) on loon . c s a i l . mit . edu at 7 : 0 5 : 5 5 : 8 6 4

P → T u p l e , m o d i f i e d f i e l d s : {[ q u e u e O u t -> Map , m o d i f i e d e n t r i e s : { [ [ s : 0,

t : 1 ] - > S e q u e n c e , e l e m e n t s a d d e d : { } E l e m e n t s r e m o v e d : { c o n n M s g ([ msg : C O N N E C T , l : 0 ] ) } ] } ] }

S e q u e n c e , e l e m e n t s a d d e d : { c o n n M s g ([ msg : C O N N E C T , l : 0 ] ) } E l e m e n t s r e m o v e d : { } ] }]}

lnk → T u p l e , m o d i f i e d f i e l d s : {[ s - > 0 ] [ t - > 1 ] }

t r a n s i t i o n: output I n T r e e ([ s : 0 , t : 1 ] ) i n automaton GHS (0) on loon . c s a i l . mit . edu at 7 : 0 5 : 5 5 : 9 1 5

P → T u p l e , m o d i f i e d f i e l d s : {[ a n s w e r e d -> Map , m o d i f i e d e n t r i e s : { [ [ s : 0, t : 1 ] - > true ] } ] }

S e q u e n c e , e l e m e n t s a d d e d : { } E l e m e n t s r e m o v e d : { c o n n M s g ([ msg : C O N N E C T , l : 0 ] ) } ]

(23)

3 Results

Figures 3.6, 3.7 and 3.8 depict the runtime results of the algorithms on 1 node per machine, a constant number of nodes and a constant number of machines respectively. The runtimes were measured in seconds, from the time when the ﬁrst node started initialization until the time at which the algorithm terminated, and they are averages on 10 runs. The tables in Appendix D show the raw data collected from our experiments. In Figure 3.6(a) to (b) the relationship between time and number of nodes is not very clear due to the short time taken, number of nodes and messages exchanged. Figure 3.6(d) however, where the number of messages exchanged is large shows the clear (linear) relation.

4 Conclusions

During the months of June and July 2004 we had the opportunity to test the capabilities and the performance of the version of the toolkit that was then available. We started by mostly modifying the Java code the Toolkit generated to get LCR up and running. The toolkit was then gradually modiﬁed to automate the manual changes we had to make in Java. By the Spanning Tree to Leader algorithm the Toolkit was completely automated to provide Java code that compiled and ran successfully. After that, and after making sure that simple algorithms such as a simple broadcast and convergecast could be run, we implemented a more complicated algorithm, the GHS algorithm for computing the minimum spanning tree in an arbitrary graph. The successfull implementation of this algorithm makes us very conﬁdent that the toolkit can be used to implement complex distributed systems successfully.

Performance We now comment on some performance issues:

1. Scalability: As Figure 3.6 suggests, the Toolkit is scalable. Letting the number of nodes double increases the runtime but no more than twice the previous runtime. Moreover, Figure 3.6(e) shows that the rate of increase of the running time is smaller as the number of messages increases.

2. Setup time: The time required for MPI to set up all the connections and enable the nodes to initialize was not measured in the runtime results. However, when the number of nodes was large, this time was also quite signiﬁcant (around 5-10 minutes).

3. Resource usage: The generated programs used all the available processing power available to them during the whole run. This was mainly because there was no pause between suceessive tests for incoming or outgoing messages.

5 Comments and Suggestions on the Toolkit

• The NDR Language, used both in the scheduler and the initialization block could be extended

to support the for v:T in s:Set[T] statement, which is already used in IOA. This would

make the code much cleaner since using the current syntax, these loops are implemented using

a while loop and an extra set of two or more variables.

(24)

(25)

Figure 3.7: (a) to (c) show the time taken to terminate, in regard to the number of physical

machines used.

(26)

Figure 3.8: (a) to (c) show the time taken to terminate, in regard to the number of nodes.

(27)

• Initialization of Map (or Array) entries. The toolkit allows usage of possibly non-initialized map/ array entries. As a result, when the compiled code is run, a NullPointerException is thrown from the Java environment when these variables are accessed. The following example illustrates the problem and demonstrates some possible solutions:

t y p e T u p = t u p l e o f a:I n t, b: B o o l automaton T e s t

s i g n a t u r e . . .

s t a t e s

M: M a p[T u p] i n i t i a l l y

(t r u e ⇒ ^{S M}[ 1 ] .a = 3 ) d e t do

S M[ 1 ] .a := 3 . . .

The user here tries to initialize the tuple before constructing it, which is done using a com- mand like

SM[1] := [3, true]

. One possible solution would be for the toolkit to check for an initialization statement of the type

SM[1] := ...

in the

initially

block, before the accessing statement (of the type

SM[1].a

). If none exists, it could print a warning. Another solution could be to arbitrarily initialize all Map / Array Entries. The toolkit already supports such kind of initialization, but again if it does not occur, a warning message could be printed.

• In GHS, some parts of the IOA code were used more than once. The code given in [2] makes use of a procedure, i.e. a block of statements that can be declared and then called as many times as necessary. Given the complexity of GHS, such a feature would be quite useful during the design, coding and debugging process. For GHS, there was no need for parameter passing to the procedure. Therefore, as a ﬁrst step, a simple procedure support (without parameter passing) from the toolkit would be enough.

• The use of MPI for communication has both advantages and disadvantages. Implementing the Toolkit was probably easier using MPI. No low-level communication programming was necessary. Furthermore, it has been tested to work for a long time now, and indeed, it works.

However, it introduces some issues that could have been avoided. Firstly, most of the error messages coming from MPI are far from descriptive. (see Troubleshooting # 4) Moreover, MPI sets up a connection between all pairs of nodes, even if these connections are not necessary.

An n-node LCR needs only n connections, while MPI sets up Θ(n

²

) connections. This is probably the reason why on a larger number of nodes, MPI takes up a lot of time to setup.

We suggest that the possibility of another communication interface, which gives more control over these issues (e.g. Java RMI) is examined at some time in the future.

• Finally, we have tested the possibility of pausing between consecutive tests for incoming and

outgoing messages (MPI’s test and Iprobe). Right now a node might be in an inﬁnite loop

probing for messages for tens of seconds and using up all the processing power available to

it. Forcing the nodes to sleep for some milliseconds before probing again showed to improve

performance when the number of nodes per physical machine is large. We have experimented

with some runtimes of LCR Leader election on 20 nodes on a single machine, using diﬀerent

sleep times. The results are shown below:

(28)

Sleep time (milliseconds) LCR on 20 nodes, 1 machine runtime (seconds)

100 40

50 25

25 13

10 8

1 13

0 121

References

[1] Nancy Lynch. Distributed Algorithms. Morgan Kaufmann Publishers, San Mateo, CA, 1996.

[2] Lampoft L. Welch J. and Lynch N. A lattice-structured proof of a minimum spanning tree

algorithm. In Proceedings of 7th A CM Symposium on Principles of Distributed Computing,

pages 28–43, August 1988.

(29)

A Troubleshooting

1. Sometimes during the composition of the process and mediator automata, two SEND and/or two RECEIVE transitions were produced by the composer. This should be ﬁxed before generating Java code because the code will not compile later. To prevent this one should make sure that the

where

clauses on the SEND and RECEIVE actions in the process automaton match the where clauses for the SendMediator and ReceiveMediator component automata in the node composition.

2. Send/ Receive Convention. In LCR, we have used the convention of [1] for the Send and Receive transitions: Send(m, i, j) and Receive(m, j, i). This means that i is always the sender and j is always the receiver. This convention sometimes caused problems that were hard to debug. We found that using a diﬀerent convention made it easier to understand and debug any problems: Send(m, i, j) and Receive(m, i, j), meaning that node i sends to node j and node i receives from node j respectively.

3. Currenlty the NDR Language, used in the schedule and initialially blocks does not support the

for v:T in S:Set[T]

. The way to create such loops is to use a

while

loop of the form:

u s e s C h o i s e S e t . . .

s c h e d u l e s t a t e s

t e m p N b r s: S e t[I n t], k: I n t

do

t e m p N b r s := P.n b r s;

w h i l e (¬i s E m p t y(t e m p N b r s) ) ^do k := c h o o s e R a n d o m(t e m p N b r s);

t e m p N b r s := d e l e t e(^k, t e m p N b r s);

\ % f i r e a c t i o n o n n e i g h b o r k od;

od

4. MPI throws a SIGSEV error if the program tries to access a node using a negative number as the rank.

5. Several times when running an algorithm we encountered a java.lang.NullPointerException during the initialization or the ﬁrst transitions of the automata. Almost always, this was due to a wrong initialization of the state variables in IOA.

B Bug Reports

1. In the LCRNode automaton some types such as Status and States[LCRProcess] were not

declared automatically. Manual declaration of these types was necessary. The latest version

(30)

of the toolkit generates these type deﬁnitions automatically. [Fixed]

2. The transitions that connect to the MPI have diﬀerent signatures, but the implementation of the toolkit did not support that. (e.g. Iprobe(i_v1,j_v2) and resp_Iprobe(i_v3, j_v4) were not recognized as an MPI pair). We manually changed them to have exactly the same signature, for example Iprobe(i_v1, j_v2) and resp_Iprobe(i_v1, j_v2). The latest version of the toolkit recognizes them as pairs even if they have diﬀerent names. [Fixed]

3. In LCR a statement (const 0) was produced that was not recognized by the code generator.

We manually changed the IL ﬁle produced to (const v999), and declared the variable v999 to something like (v999 zero s8 (scope 1)) where s8 was the sort corresponding to NatSort.

[Pending]

4. Formal parameters were not translated in Java. We manually declared the ﬁeld in the gen- erated code, and initialized it to the correct value. The toolkit now translates all formal parameters and automatically initializes three special parameters: rank, size (of type Int or Nat) and hostName (of type String). [Fixed]

5. The initialization of the arguments from MPI (the statement MPI.Init(args);) had to hap- pen in the main() method of the class LCRNode and not ioa.runitime.Automaton. Otherwise, the environment would not recognize the correct number of processes running. The toolkit has been modiﬁed so that this will happen automatically from now on. [Fixed]

6. The composer tool (incorrectly) removes all the preconditions of internal actions during com- position. [Pending]

7. In the NDR Language translation, no break; statement was produced after a ﬁre block. This caused unexpected termination in some algorithms. [Fixed]

8. IntSort’s modulo operator is not compatible with its speciﬁcation. [Pending]

9. The syntax of declaring a map is

v: Map[D1, ..., Dn, R]

where n ≥ 1. If only one type is

given, (e.g.

map1: Map[Int]

), the checker tool throws a java.lang.InternalError instead of

a more descriptive Syntax Error message.

(31)

C Automata Definitions and Input Files

C.1 SendMediator and ReceiveMediator

These automata were used as the channel between two nodes and the connection of the automata to the MPI.

SendMediator

type s C a l l = enumeration o f idle , I s e n d , test

automaton S e n d M e d i a t o r ( Msg , Node : Type , i : Node , j : Node ) assumes I n f i n i t e ( H a n d l e )

s i g n a tu re

input SEND ( m : Msg , const i , const j ) output I s e n d ( m : Msg , const i , const j )

input r e s p _ I s e n d ( h a n d l e : H a n d l e , const i , const j ) output test ( h a n d l e : H a n d l e , const i , const j )

input r e s p _ t e s t ( flag : Bool , const i , const j ) s t a t e s

s t a t u s : s C a l l := idle , t o S e n d : Seq [ Msg ] := {}, sent : Seq [ Msg ] := {}, h a n d l e s : Seq [ H a n d l e ] := {}

t r a n s i t i o n s

input SEND ( m , i , j )

e f f t o S e n d := t o S e n d m output I s e n d ( m , i , j )

pre head ( t o S e n d ) = m ; s t a t u s = idle

e f f t o S e n d := tail ( t o S e n d );

sent := sent m ;

s t a t u s := I s e n d

input r e s p _ I s e n d ( h a n d l e , i , j ) e f f h a n d l e s := h a n d l e s h a n d l e ;

s t a t u s := idle output test ( h a n d l e , i , j )

pre s t a t u s = idle ;

h a n d l e = head ( h a n d l e s ) e f f s t a t u s := test

input r e s p _ t e s t ( flag , i , j ) e f f i f ( flag = true ) then

h a n d l e s := tail ( h a n d l e s );

sent := tail ( sent ) f i;

s t a t u s := idle

ReceiveMediator

type r C a l l = enumeration o f idle , r e c e i v e , I p r o b e

(32)

automaton R e c e i v e M e d i a t o r ( Msg , Node : Type , i : Node , j : Node ) assumes I n f i n i t e ( H a n d l e )

s i g n a tu re

output R E C E I V E ( m : Msg , const i , const j ) output I p r o b e (const i , const j )

input r e s p _ I p r o b e ( flag : Bool , const i , const j ) output r e c e i v e (const i , const j )

input r e s p _ r e c e i v e ( m : Msg , const i , const j ) s t a t e s

s t a t u s : r C a l l := idle , t o R e c v : Seq [ Msg ] := {}, r e a d y : Bool := f a l s e t r a n s i t i o n s

output R E C E I V E ( m , i , j ) pre m = head ( t o R e c v )

e f f t o R e c v := tail ( t o R e c v ) output I p r o b e ( i , j )

pre s t a t u s = idle ; r e a d y = f a l s e e f f s t a t u s := I p r o b e

input r e s p _ I p r o b e ( flag , i , j ) e f f r e a d y := flag ;

s t a t u s := idle output r e c e i v e ( i , j ) pre r e a d y = true ;

s t a t u s = idle e f f s t a t u s := r e c e i v e input r e s p _ r e c e i v e ( m , i , j )

e f f t o R e c v := t o R e c v m ; r e a d y := f a l s e ;

s t a t u s := idle

C.2 LCR

Expanded automaton with initialization and scheduling

uses C h o i c e S e t ( Int )

automaton LCR ( rank : Int , size : Int ) s i g n a tu re

output r e c e i v e ( N6 : Int , N7 : Int )

where N7 = rank ∧ N6 = mod (( rank + size ) - 1 , size ) output SEND ( m : Int , I2 : Int , I3 : Int )

where I3 = mod ( rank + 1 , size ) ∧ I2 = rank input r e s p _ I p r o b e ( flag : Bool , N4 : Int , N5 : Int )

where N5 = rank ∧ N4 = mod (( rank + size ) - 1 , size ) input r e s p _ t e s t ( flag : Bool , N18 : Int , N19 : Int )

where N19 = mod ( rank + 1 , size ) ∧ N18 = rank

Implementing Asynchronous Distributed Systems Using the IOA Toolkit