Programming at the edge of synchrony

(1)

HAL Id: hal-03134314

https://hal.inria.fr/hal-03134314

Submitted on 8 Feb 2021

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Programming at the edge of synchrony

Cezara Dragoi, Josef Widder, Damien Zufferey

To cite this version:

Cezara Dragoi, Josef Widder, Damien Zufferey. Programming at the edge of synchrony. SPLASH

2020 - ACM SIGPLAN conference on Systems, Programming, Languages, and Applications: Software

for Humanity, Oct 2020, Chicago / Virtual, United States. �10.1145/3428281�. �hal-03134314�

(2)

Programming at the Edge of Synchrony

CEZARA DRĂGOI,

_{INRIA, ENS, CNRS, PSL*, France}

JOSEF WIDDER,

_{Informal Systems, Austria}

DAMIEN ZUFFEREY,

_{MPI SWS, Germany}

Synchronization primitives for fault-tolerant distributed systems that ensure an effective and efficient coopera-tion among processes are an important challenge in the programming languages community. We present a new programming abstraction, ReSync, for implementing benign and Byzantine fault-tolerant protocols. ReSync has a new round structure that offers a simple abstraction for group communication, like it is customary in synchronous systems, but also allows messages to be received one by one, like in the asynchronous systems. This extension allows implementing network and algorithm-specific policies for the message reception, which is not possible in classic round models.

The execution of ReSync programs is based on a new generic round switch protocol that generalizes the famous theoretical result of ?. We evaluate experimentally the performance of ReSync’s execution platform, by comparing consensus implementations in ReSync with LibPaxos3, etcd, and Bft-SMaRt, three consensus libraries tolerant to benign, resp. byzantine faults.

CCS Concepts: • Software and its engineering → Domain specific languages; Control structures; Runtime environments; • Theory of computation → Distributed computing models;

Additional Key Words and Phrases: Distributed systems, Fault-tolerance, Round Model, Synchrony ACM Reference Format:

Cezara Drăgoi, Josef Widder, and Damien Zufferey. 2020. Programming at the Edge of Synchrony. Proc. ACM Program. Lang.1, OOPSLA, Article 1 (January 2020),31pages.

1 INTRODUCTION

Fault-tolerant distributed algorithms are notorious for being hard to design and implement. One needs to take into account features of a non-deterministic network, behaviors of faulty processes, as well as the asynchrony of the actions performed by correct processes. These algorithms underpin many distributed applications and critical services, e.g., Zab [?], Zyzzyva [?] or more recent blockchains like Tendermint [?] or Libra [?]. These systems are implemented in general-purpose programming languages that support the asynchronous programming paradigm. This programming paradigm is error-prone and facilitates bug occurrences [???] even when implementing protocols that have been formalized and proved like Paxos [?]. These bugs show that going from (verified) protocols to implementations is a difficult task. The asynchronous programming paradigm, despite its pitfalls, is preferred because it enables important performance optimizations but also because other suitable programming abstractions and synchronization primitives are lacking. Therefore, our goal is to identify new synchronization primitives that simplify programming distributed systems, tolerant to benign and byzantine faults.

The seminal work [?] defines the basic round model, the first abstract computational model that offers rounds as a synchronous primitive. The model establishes its notoriety because it can solve consensus, i.e.,n processes trying to agree on a value, provided that the network satisfies the partial

synchronyassumption. Consensus is not solvable over asynchronous networks in the presence of

faults [?] and partial synchrony is one of the weakest theoretical assumptions that allows us to

Authors’ addresses: Cezara Drăgoi, INRIA, ENS, CNRS, PSL*, France, [email protected]; Josef Widder, Informal Systems, Austria, [email protected]; Damien Zufferey, MPI SWS, Germany, [email protected].

2020. 2475-1421/2020/1-ART1 $15.00

(3)

solve it. Partial synchrony means that the network eventually becomes synchronous, i.e., there

exists a bound∆ on the message delay and a bound Φ on the relative speed of processes, however

these bounds hold only from some time on, called global stabilization time GST. (There are two other definitions of partial synchrony in [?], which are less used.) The basic round model, and the follow-up round-based approaches [???], rely on an abstract global clock. The abstract global clock synchronizes all processes, that is, all processes execute the code of a round in lockstep (at the same time), where a round is a sequence of sending messages, receiving (some or all) messages sent in that round in one atomic step, followed by some local updates based on the received messages. To “execute” a consensus-solving algorithm in a round model, one has to implement an algorithm that computes when the round switch should locally happen such that processes switch rounds

at the same time. To this, ? use an algorithm that implements a distributed clock, exploiting∆

andΦ from the partially synchronous network assumption. The distributed clock determines the

round switch at each process. Roughly, before GST the system is in an arbitrary state (different processes may have different clock values), but after GST, the distributed clock is guaranteed to simulate the abstract global clock and implicitly synchronizes all the correct processes. This is a purely theoretic result, since it requires multiple synchronization steps (clock steps) within a round, and it makes simplistic assumptions regarding the time spent by processes sending and receiving messages or performing local computation. In general, round-based solutions are infamous for their prohibitive performance, although most existing solutions for consensus rely on the partial synchrony assumptions and implicitly on rounds.

Among the systems that rely on partial synchrony are PBFT [?], BftSmart[?], ViewStamped Replication [?], Zookeeper’s atomic broadcast protocol [?], Apache Cassandra [?], libPaxos [?], or the more recent works emerging from blockchain research like Tendermint [?], RedBelly [?], Lumière [?], Hot-Stuff [?], or Libra [?]. Each of these systems, use rounds implicitly and propose a new algorithm for the round switch that exploits partial synchrony in a different way, thus resulting in a new solution for consensus (state machine replication) either in the benign, or in the Byzantine case. Although solving the same problem under similar network assumptions, these are stand-alone systems, sharing no high-level control structure. They are implemented in error prone asynchronous programming paradigms (as shown by numerous bug reports), which are preferred for facilitating performance optimizations. These systems exploit the synchronicity of the network after GST differently, and they guarantee that soon after GST, correct processes are synchronized (a property called eventual synchrony [?]), and reach consensus. To implement eventual synchrony, these systems use asynchronous features like properties on the set of received messages (e.g., a message with a certain payload or from a certain sender has been received, or a certain number of messages have been received) that trigger a local round switch or timeouts on the local duration of a round, which ensure that after GST the processes get synchronized. These features are opaque in the basic round model as they are hidden in the theoretical implementation of the distributed clock. In the basic-round model, and in all follow-up round-based programming abstractions, processes have no control over the set of received messages and no control over the round switch.

Problem statement.In this paper we ask the following question: is there a programming abstrac-tion that systematizes the optimizaabstrac-tion principles used in implementaabstrac-tions of consensus protocols for partially synchronous networks? More widely, is there a synchronous round-based model that allows protocol specific optimizations, which today are possible only in asynchronous programming frameworks?

We propose ReSync, a new programming language with a novel round structure which preserves the abstraction level of theoretical round-based models and reasonably matches the performance demands of large-scale systems like Bft-SMaRt [?] or LibPaxos3 [?]. ReSync is a parametric

(4)

framework that accommodates various implementations of the eventual synchrony property, rather than a fixed implementation, as it is the case in all previous works. We systematize the optimization principles used in large-scale systems that solve consensus and wrap them into a domain-specific language, that facilitates future consensus implementations under partial synchrony, and in general facilitates exploiting the network specifics.

ReSync is addressed to the algorithm designer who wants to prototype algorithms without worrying about network code and data management, and to the programmer less versatile in reasoning about interleavings and partial synchrony.

Key insights.In synchronous (or round-based) models, computations are structured in a sequence of rounds, where in each round all processes send messages and then update their local state based on the set of received messages. Rounds are communication-closed [?] and are a powerful synchronization primitive: messages are not received outside the round they were sent. The message reception of a round is hidden from the protocol’s perspective: a non-deterministically chosen subset of the messages sent is received in one atomic step. Round structures are known for their poor performance due to an excessive need of synchronization. Any execution framework for the round-structure must implement a round switch algorithm. All known approaches are either too general [?] and hence purely theoretical due their poor performance, or they are restricted to a sub-class of distributed protocols like PSync [?] which implements one solution for eventual synchrony (that is not customizable). It is well established that there is no unique round switch algorithm that fits the performance needs of all protocols. Therefore, to design a more flexible way to implement the round switch we turn to asynchronous computational models.

Asynchronous computational models are preferred for implementing systems due to their better performance. Still, algorithm designers, present their asynchronous solutions in terms of rounds, phases, epochs, ballots, etc., e.g., [???]. These notions encode some logical time that allows to structure and decompose distributed computations along the time line. Although presented in rounds, the system is implemented under the asynchronous semantics because it enables network-specific performance optimizations: messages are received one-by-one, which allows the system to react quickly to incoming messages, as opposed to the round structure that would wait for all the messages of the current round before reacting. For example, under the asynchronous semantics a process could switch rounds fast, as soon as it receives a message that contains the information it was waiting for, e.g., the first message that is receives comes from the leader and contains the value all processes agree on. In contrast, when executing the synchronous round-based model, a process would wait for all the messages sent in that round, even if the followers (the processes that are not leaders) only relay the message from the leader for fault-tolerance purposes.

The main insight is that we want a round structure where messages are received one by one, instead of all in one atomic step. Therefore, we define a new round structure where a round combines: Sending messages,

A message accumulator that is a new round component which allows programming the message reception within the round-based model, and computes the sequence of messages received in a round, and

A round computation that updates the local state, depending on the mailbox of the current round computed by the message accumulator.

The executions of the new round model preserve the lockstep semantics at the round boundaries, but, within a round, messages are received asynchronously by the message accumulators of different processes. In this way, messages are received one-by-one and the accumulator decides when to wait for more messages or transition to the next round. This removes the need for extra synchronization,

(5)

i.e., synchronization barriers needed to satisfy the (conservative) guarantees of the round model but which are not required for the protocol-specific progress condition.

The implementation of the message accumulator, more precisely the termination of a message accumulator, uses new features that are not available in any other round-based programming abstraction. First, the language provides constructs to control the termination of a message accumu-lator using timeouts, or using properties satisfied by the received messages for the current round. These features are sufficient to program message accumulators for distributed protocols whose specification is solvable in asynchronous networks.

However, timeouts and conditions on the set of messages received for the current round are not sufficient for implementing a round-switch in consensus protocols. The implementations of consensus exploit the partially synchronous nature of the network to synchronize processes after GST. For that, they use features that are not communication-closed, i.e., they cannot be implemented directly in a round-based model, e.g., a process reacts to messages sent in rounds different from the process’s current round. The key to ensure synchronization is to fine-tune the moment when processes locally switch rounds, such that after GST all correct processes are roughly in the same round. The number of correct processes is determined by the system being tolerant to byzantine or benign faults. Byzantine systems are parametrized by the number of tolerated byzantine processes,

denotedF . Instead of using a distributed clock to define the round switch, like in [?], we use

hardware clocks and we introduce two synchronization primitives that trigger the termination of the message accumulator:

Catch-up. The message accumulator terminates and the process switches to a future roundr,

when it receivesF + 1 messages sent in rounds greater than or equal to r, where F is a

parameter of the system, the maximal number of byzantine processes.

SynchronizeF + 1 processes. The message accumulator terminates and the process switch to

the next round only if it observes that at leastF + 1 correct processes switched to the next

round, whereF is again the number of byzantine processes.

Each round is parametrized by the enabling/disabling of these primitives for protocol-specific

values ofF . To compute a round switch, the runtime of ReSync integrates the implementation of

these primitives (when enabled) with the timeouts (or other conditions) defined in the message accumulator. We show that the execution model implemented by the runtime of ReSync matches the theoretical guarantees for eventual synchrony provided in [?], under similar assumptions.

Evaluation.We evaluated ReSync on one benign and one byzantine implementation of Paxos and

compared it against LibPaxos3 [?], an open source implementation of the Paxos algorithm, etcd an implementation of consensus based on Raft [?], and Bft-SMaRt [?], an open source implementation of byzantine agreement. ReSync performance are at most 30% worse than LibPaxos3 for a small number of replica, and 10% faster with more replicas. The results are show a 3.5× improvement of ReSync over PSync and ReSync and 25× faster than Goolong [?].

Contributions.We propose a new programming abstraction, called ReSync, embedded into

the Scala programming language. We are the first to propose a language that lies between the synchronous and the asynchronous paradigms. Our main contributions are:

• We propose a new round structure that lets programmers use asynchronous optimizations without breaking the synchronous structure.

• ReSync is suitable for both benign and byzantine fault-tolerant protocols.

• ReSync allows the algorithm designer to write custom code that locally controls the round boundaries, i.e., the message accumulator with synchronization primitives, expressing a wide range of protocols.

(6)

• ReSync compiles to efficient asynchronous code, that can be executed over any asynchronous network. The runtime environment ensures a sound and efficient round structure for benign and byzantine faults, based on seminal work by ?.

2 OVERVIEW

A program in ReSync is parametric inN , the number of processes, and F , the number of byzantine ones among them. All (correct) processes execute the same code. A program is a sequence of rounds, called phase. In a round, processes send and receive messages, and then update their local state (depending on the set of received messages). ReSync has a round-based semantics, which facilitates writing programs, and an equivalent runtime asynchronous semantics to execute programs efficiently. In the following, we focus on the round-based semantics of ReSync, where rounds proceed in lockstep. Thus, there are no interleaving of actions performed by processes in different rounds.

Example.As example protocol, we consider a round-based version of ViewChange [?], a protocol

used in PBFT [?]. The goal of PBFT is to receive requests from a client, and to store all of them on N replicas, in the same order, tolerating F byzantine faulty processes, where N = 3F + 1. Ideally, all replicas have the same history of client requests, however due to faults this is not the case. ViewChange ensures that a majority of correct replicas—i.e., at leastF + 1 correct replicas—agree on the most recent history of requests. To this, it elects a leader, which makes sure that all the replicas in its quorum agree on the most recent history.

DoVC Forward NewView

x

x x

Fig. 1. Lock-step executions forN = 4 and F = 1.

The protocol is structured in three rounds,

which form a phase. Fig. 1 shows an

execu-tion of ViewChange. In the first round, called DoVC, all replicas broadcast their current his-tory (all-to-all communication). We assume au-thenticated communication channels between any two processes [??], i.e., the receiver of a message can be sure which process sent the message. All processes, including the leader, wait for 2F +1 messages. However, only F +1 of these messages may come from correct replicas, which does not provide sufficient information

to compute the most recent history of the system. The leader’s ID is a function of the phase number (rotating leader), so the protocol uses the same leader in the next two rounds.

In byzantine systems, receiving one message is not sufficient to trust the received value, because byzantine processes might send different (possibly faulty) values to different processes. Therefore, to help the leader trust the received histories, in the second round, called Forward, the replicas that receive at least 2F + 1 messages in the first round, forward these messages to the leader. The leader trusts an historyhi sent by processpi, ifF + 1 processes acknowledge that they received hi from pi. If the leader receives 2F + 1 trusted values, based on them it computes, in the second round, the most recent trusted history in the system.

In the third round, called NewView, the leader broadcasts the previously computed history and the set of trusted pairs that the computation was based on. Also in the third round, because the leader might be byzantine, the other processes echo to the entire network the acknowledgments sent to the leader (in the second round). A processp that receives the leader’s message, checks that the latest history was correctly computed, using the messages from the other replicas to trust the values communicated by the leader.

(7)

1val DoVC_mbox:List[(ProcessID, Hist) 2val skip_next_round:Boolean

3val history: Hist

4

5new EventRound[Hist]{//Round: DoVC 6 def send(): Map[ProcessID,Hist] = {

7 broadcast(history)

8 }

9 def receiveInit(){

10 Progress.timeout(to)

11 Progress.catchUp(true)

12 Progress.sync(F+1)

13 }

14 def receive(sender: ProcessID) = { 15 if (mbox.size >= 2F+1)

Progress.goAhead

16 }

17 def finishRound(mbox: List[(ProcessID, Hist)]) = { 18 if(mbox.size >= 2F+1) 19 DoVC_mbox = mbox 20 else // timeout 21 skip_next_round=true 22 }}

Fig. 3. The 1st round of ViewChange (PBFT) in ReSync.

def receiveInit(){

Progress.waitMessages

Progress.catchUp(false)

Progress.sync(0) }

def receive(sender: ProcessID) = {

if (mbox.size >= 2*F+1)

Progress.goAhead }

Fig. 4. Another message accumulator for

the round in Fig.3

DoVC Forward NewView

x x x x x x x x

Fig. 2. Lock-step executions forN = 4 and F = 1.

The protocol executes these three rounds in a loop, because the execution of one phase does not guarantee that a majority of correct replicas are up to date, e.g., if the leader is byzantine. Fig.2shows an execution of a phase where al-though the leader is not byzantine, there are too many faults in the system to update sufficiently many replicas.

The algorithm is safe if less than a third of the processes are faulty and it is live, if eventually there is a good period where a quorum can

communicate within itself and with the leader. This is an instance of partial synchrony, which states that eventually all correct processes can talk to each other.

Encoding in ReSync.The code of the first round, i.e., DoVC, is given in Fig.3. The round imple-ments four methods: send, defining the messages to be sent, receiveInit and receive computes the set of messages received in the round, i.e., the message accumulator, and finishRound that defines the round computation based on the received messages. In this round, all processes broad-cast their current histories. Therefore, send uses the primitive broadbroad-cast and returns, at line7, a (key,value) map, where the keys are all the process identities (the recipients) and the value is

(8)

the history of the sender, i.e., a sequence of requests (type Hist). The method send is executed synchronously by all processes.

Message accumulator.In classic round-based models [?], the set of received messages is a subset of the sent messages, non-deterministically chosen by an adversarial environment. We enhance the round-based model with a message accumulator (specific to each round) that receives messages one by one. This gives the programmer the capabilities to react to a non-deterministic execution environment. The message accumulator is defined by a Progress object and two methods that use it, receiveInit (line9) and receive (line14). The main bottleneck of round-based models is an inefficient round-switch policy that requires synchronizing correct process. Using the methods of Progress, called progress conditions, the programmer guides the execution platform of ReSync, speeding up or slowing down the default round-switch policy. They are high-level constructs im-plemented by the execution platform. For example, using Progress.timeout(to) the programmer fixes in line10the time interval the execution engine waits for the messages of the current round, i.e., it waits for to milliseconds. Two other progress conditions are enabled.

Table 1. Network assumptions.

Lossy links Reliable link

benign process crash F =0 and UDP F =0 and TCP

authenticated byzantine processes F >0 and UDP F >0 and TCP

A message accumulator is designed taking into account the underling network. In this paper we consider different fault models, given in Table.??. Regarding process failure, processes can crash (and not recover) or they can be byzantine. A byzantine process does not follow the protocol, it may send any messages up to forging the identity of other processes. We assume authenticated byzantine processes which cannot impersonate other processes, and this is done using digital signatures.

The number of byzantine processes, denotedF , is a parameter of the program. If F equals zero

then the program tolerates only benign faults. The size of the network is fixed by the parameter N . For the link failures, we consider either lossy links, where messages can be lost, reordered, duplicated, or reliable links where every message sent it is eventually received. In practice lossy links are ensured by UDP and reliable links by TCP. The design of a message accumulator can take advantage of any other network properties the programmer is aware of, it is not restricted to the network assumptions discussed in this paper.

Synchronization primitives.ReSync has two synchronization primitives, Progress.catchUp and

Progress.sync. They are implemented by the execution platform, for the types of network given

in Table.??. Note that a network with lossy links is a weaker network assumption than reliable

links and it is one of the weakest network assumptions that one can consider. These primitives do not have a round-based definition for lossy links (Progress.catchUp does not have a round-based implementation even over networks that ensure reliable links).

The primitive Progress.catchUp(true), Fig.3line11, enables a process to switch rounds faster at runtime, as soon as it has receivedF + 1 messages with a higher round from different processes. It is used to resynchronize processes by advancing a slow process to the round of a faster correct one.

The primitive Progress.sync(F+1), Fig.3line12requires thatF + 1 correct processes switch

rounds together, at runtime. Its implementation forces processes to receive at least 2F + 1 messages

(9)

one, i.e.,cr + 1. Note that the network might drop more than F messages in a round (we recall N = 3F + 1). If the timeout expired and less than 2F + 1 messages are received, the execution platform first increases the timeout but if this is not sufficient, it resends the messages of the current round (if the round does not use an all-to-all communication, the execution platform insert dummy

messages to implement the synchronization requirement)1_{. Progress.sync(F+1) overrules the}

timeout constraints set by Progress.timeout(to). Byzantine system cannot resynchronize (after

an asynchronous period under partial synchrony) unless at leastF + 1 processes are within the

same round [?]. Therefore, even if Progress.sync(F+1) potentially slows down the round switch, it is necessary for byzantine consensus protocols.

For each round, the Progress object is configured in the method receiveInit. Messages are received one by one: the method receive takes as input a received message and implements the conditions under which this should be the last message received in the current round. Each process, in each round, has a fresh mailbox, that contains the queue of messages received (inputs to receive) by the process in that round. This mailbox is populated by the execution platform. The execution of the message accumulator starts with receiveInit, continues with multiple calls to

receive, and stops when the progress conditions are met. The accumulator in Fig.3stops either

due to Progress.catchUp or because Process.sync(F+1) holds and either the timeout expired

or the control reached Progress.goAhead in Fig.3line15. Since Process.sync(F+1) is enabled,

the message accumulator continues to collect messages until the progress condition is met even if the timeout expired. The messages accumulators of the same round execute asynchronously across processes.

Round computation.The method finishRound implements the computation done by a process

in a round, based on the set of received messages. All (non-byzantine) processes execute the same code, however due to different mailboxes and branching, processes have different behaviors.

finishRoundis executed when a process stops receiving messages. The only computation done in

the first round in ViewChange is to store the mailbox in the variable DoVC_mbox, in line19. Across rounds, ReSync respects the classic round-based semantics: at the end of a round the mailbox of each process is purged and the computation moves in lockstep to the next round.

Different message accumulators.To solve byzantine consensus, it is sufficient that only some

rounds forceF + 1 correct processes to be synchronized. Therefore, in ViewChange at least one of the three rounds should enable Process.sync, but not necessarily the first one. If we consider the

message accumulator from Fig.3with Process.sync(0), then the message accumulator might

stop due to an expired timeout, before receiving 2F + 1 messages. Therefore when finishRound

is executed, line21is reachable and skip_next_round is true. In ViewChange, processes that do

not receive 2F + 1 messages in the first round move on to the next phase. In the round-based

structure processes cannot skip rounds. However, they are not forced to send messages or do any computation in a round. The flag skip_next_round is used to encode the behavior of these processes, which go through the second and third round of the current phase without sending any messages or doing any computation.

Let us consider the message accumulator in Fig.4which uses Progress.waitMessages instead

of Progress.timeout. The semantics of Progress.waitMessages states that calls toreceive are

made until Progress.goAhead is reached. Since Progress.catchUp is disabled, the only way for a process to switch round is if 2F + 1 messages are received. This message accumulator can get stuck, if the program is executed over a network that provides unreliable communication. However, if

(10)

1// Commit request 2new EventRound[Int]{

3 def send(): Map[ProcessID,Int] = {

4 if (id == leader) {

5 broadcast(transactionID)

6 } else { Map.empty }

7 }

8 def receiveInit = { 9 Progress.waitMessages 10 Progress.catchUp(false) 11 }

12 def receive(sender: ProcessID,

payload: Int) = {

13 Progress.goAhead 14 }

15 def finishRound(mbox:

List[(ProcessID, Int)]) = {

16 commit = check(localState, mbox)

17} }

1// Vote (Yes/No)

2new EventRound[Boolean]{

3 def send(): Map[ProcessID,Boolean] = {

4 Map( leader -> commit )

5 }

6 def receiveInit = {

7 if (id != leader) Progress.goAhead

8 else Progress.waitMessage 9 }

10 def receive(sender: ProcessID, payload:

Boolean) = {

11 if(!payload || mbox.size == n)

Progress.goAhead

12 }

13 def finishRound(mbox: List[(ProcessID,

Boolean)]) = {

14 if (id == leader) {

15 decision = head(mbox)._2

16} } }

Fig. 5. Voting phase of theTwo phase commit protocol in ReSync.

executed with reliable communication between correct processes, e.g., TCP, it is an implementation (in the new round model) of the synchronization primitive Process.sync(F+1).

The message accumulator can also depend on the value in the payload of messages. Let us look at the two-phase commit protocol. Fig.5presents the first two rounds — i.e., the first phase — of the protocol. It is an atomic commitment protocol, executed byn benign processes (that is all processes follow the protocol). This protocol assumes crash recovery, i.e., the nodes have stable storage and they can resume operation after a crash. Therefore, the processes block until they receive the messages they expect.

In the example, the transaction identifier is an integer that the leader proposes (at line5) to all processes. Processes blocks until they receive the message from the leader (lines9and13). Each process checks locally whether it can locally commit the transaction (without violating the consistency of the database) at line16. In the second round, on the right, processes send (at line4) their vote to the leader. The two-phase commit protocols says that a transaction can commit only if all the participants agree. Therefore, the leader wait forN “yes” votes or a single “no” vote (line

11). In the next two rounds, which we omit for brevity, the coordinator sends a commit or abort

message to all and after it receives acknowledgments from all processes, it informs the client of the decision and restarts with a fresh transaction.

In our example, we do not give concrete values for timeouts. The actual timeout value might vary across rounds as they are influenced by the amount of computation required by handling messages. For example, if we consider networks with non-authenticated communication channels, messages are signed using public encryption policies, and the signatures are checked by the recipients. The timeout depends on the size of the information to be signed and on the signature algorithm. By choosing a timeout that takes only the network delay into account, the round switch may happen too fast, and processes could miss messages that are delivered by the network.

(11)

DoVC Forward NewView (a) x Phase 1 (b) x x Phase 20 x x …. x x DoVC Forward Phase 1 Phase 20 …. ….

Fig. 6. Runtime executions forN = 4 and F = 1.

space between the two runs

Runtime executions.Fig. 6(a) and Fig.6(b) show two runtime executions of ViewChange in

ReSync. At runtime, processes are not all in the same round (due to the asynchronous nature of the network). Therefore, processes receive messages from different rounds out of order. The runtime discards the late messages and stores the messages from the future rounds. It delivers only the messages for the current round. Since Progress.sync(F+1), at runtime, a process will not finish DoVc when the timeout expired, unless it has received at least 2F + 1 messages for the current round (or a higher one). However, a process finishes the other two rounds independently of the number of received messages.

Not all processes are forced to wait for 2F + 1 messages. The progress condition catchUp makes processes jump to a future round, if they received more thanF + 1 messages for that round (or a higher one). For example, in Fig.6(b), the bottom process is slowed down in the round DoVc of the first phase and jumps to round DoVC of the 20th phase, when it receivesF + 1 messages for it. This process is not forced to wait for 2F + 1 messages as the reception of F + 1 messages for a higher

round guarantees that there were (other)F + 1 correct processes synchronized. We show that the

runtime executions are refinements of the lockstep executions: processes at runtime go through the same sequence of local states as in the round-based semantics. For example, the execution in Fig6(a) is a refinement of the execution in Fig1.

3 SYNTAX OF RESYNC

The syntax of an ReSync program is given in Fig.7. A program defines the code executed by one

process. In ReSync all processes execute the same code. However, within a round the programmer can encode different roles (e.g., coordinator, follower) by branching. The branching conditions may be over the process identifier or received messages (which can vary non-deterministically across processes).

A program is defined by an interface and the program’s body. The interface is an object with a list of methods that are used to communicate with other distributed programs. The interface has inbound operations, for example to get a new client request, and outbound operations (callbacks), called to deliver some results to the client, e.g., acknowledgments that the request has been processed. The program’s body has variables declaration, an initialization function, the main distributed computation part, and sequential pure function. All variables are local to a process. The communication between processes is done exclusively by message passing.

Variables.All ReSync programs have several predefined variables:N the total number of

pro-cesses,F the number of byzantine processes (both parameters of the program), and r the round

(12)

program ::= interface body body ::= var_decl∗ pure_fun∗ init phase phase ::= round+ EventRoundM mbox:List[P × M] Progress: ProgressC send: () → [P 7→ M] receiveInit: () → () receive: (s :P, m : M) → ()) finishRound: (mbox) → ()

Fig. 7. ReSync abstract syntax.

Initialization.Each program implements an init method, which configures the initial state of a process: the initial values are obtained using the inbound interface operations. The method init takes as input interface objects, it does not return any values. It modifies the process’s variables without sending or receiving messages. Implicitly, init initializes the round numberr to zero.

Rounds.The distributed computation part is structured into a fixed sequence of rounds, called phase. Each round is an instance of an abstract class EventRound, whose structure is described in

Fig.7. Rounds are parameterized by a payload type M and have two attributes: 1) mbox, of type

List[(P, M)], is a queue of messages, where P is the process type, and a progress object, Progress, where ProgressC is the class of progress conditions. In the code snippets, we use process identifier

(ProcessID) forP. For the semantics, these two types are the same. Rounds contain a few methods

which define their behaviors. The methods send and finishRound must be defined by the program. An implementation of the methods receiveInit and receive is optional. If receiveInit and receiveare implemented, the round is called an open round, otherwise it is called a closed round. In closed round Progress.catchUp and Progress.sync(F+1) are enabled by default.

The type of messages exchanged in different rounds might differ, however within a round, all messages have the same payload type. The control flow of a round is a sequence of method calls, starting with send, followed by receiveInit and a loop of calls to receive, if the latter two are present, and lastly finishRound. The round variables mbox and Progress are visible in all methods of the same round.

Sending messages.The method send does not take any inputs, and returns a partial map from

process identity to a payload type, i.e., [P 7→ M], which associates receivers with payload values, i.e., the sent messages. The method send is side-effect free w.r.t. the algorithm variables.

Message accumulator.A message accumulator collects the messages delivered by the network in

a round (to the process that executes it). The message accumulator stops, when a certain condition holds. Next, the method finishRound is executed and the control switches to the next round.

The algorithm designer implements the message accumulator by defining receiveInit and

receiveand controls when the accumulator terminates using different progress conditions. A

progress condition is a method or an attribute of a ProgressC object. They impose different conditions on when the accumulator may stop receiving messages:

• Progress.timeout(to: Milliseconds), the accumulator is enabled to end receiving mes-sages after a timeout expired, that is to time units after the round started;

• Progress.goAhead, the accumulator is enabled to end as soon as the method goAhead is called;

• Progress.waitMessages is a flag; when set, it means that there is no timeout to stop the message reception.

(13)

• Progress.catchUp(b: Boolean) allows or disallows the execution platform to (re-)synchronize processes under a partial synchrony assumption;

• Progress.sync(k: Int) is a global synchronization primitive that constraints the message

accumulators ofk correct processes to end synchronously.

The method receiveInit configures the progress conditions. The programmer may implement different progress conditions for different processes in the same round by branching. The method

receivedefines the mbox and checks if the message accumulator should terminate. It takes as

input an incoming message, that is, a pair (sender,value), where sender is the process which

sent the message , andvalue a value of the round’s the payload type. Each execution of receive

adds the input pair (sender,value) to the mbox of the current round. In receive the programmer uses Progress.goAhead to terminate the message accumulator when a condition on the mailbox computed so far holds.

When the methods receiveInit and receive are not implemented, the mailbox is a non-deterministic subset of the sent messages.

Round computation.In the method finishRound the programmer defines how the process state

changes at the end of a round. The method finishRound takes as input the queue of messages re-ceived, i.e., takes as input mbox. It may call outbound operations from the interface, to communicate the results of the computation to a client.

4 LOCK-STEP SEMANTICS OF RESYNC

Given a program P, all (non-byzantine) process execute, in a loop, the sequence of rounds defined in the program. The semantics of a program is given by the synchronized parallel composition of the transition systems associated with each process. We denoted by [[P]] the set of executions defined by the transition system in Fig.8, Fig.9, Fig.10, and Fig.11.

The execution progresses in lockstep, that is, all processes execute the same round in the same time. Within a round, the methods send and finishRound are executed synchronously by all processes, while the executions of receiveInit, receive, and the network transitions are interleaved across processes within the round boundaries. Locally, a process executes a sequence of calls to round methods corresponding to

init; send; (receiveInit; receive∗

) | NondetNet ;f inishRound∗

where NondetNet represents a network transition. All processes execute the same sequences of rounds.

The stateS of a ReSync program over the set of processes P is represented by the tuple

⟨gstep,s, r,msд⟩ where:

• gstep ∈ {Send, Recv, Net, Fin} indicates whether the next method is send, receive, a network transition, or finishRound, respectively;

• s ∈ [P → V ars ∪ {lstep} → D] stores the local states of the processes, V ars is the union

of all variables, D is the domain of values, and lstep is a special variable taking a value in {Send, Recv, Net, Fin} and ;

• r ∈ N is a counter for the executed rounds;

• sms ⊆ 2P,M,P is a set of sent messages in a round anddms ⊆ 2P,M,P is a ordered set of

delivered messages, all the messages insms and dms have a different pair of sender/receiver, i.e., a process can only send one message to another process in each round.

To model correct, crashed, and byzantine processes we use the setsCoP, Cr, and ByzP. CoP is

the set of correct processes,Cr is the set of crashed processes, and ByzP contains the byzantine processes. These sets form a partition of the processes, i.e.,P = CoP ⊎ Cr ⊎ ByzP. N and F takes

(14)

Init ∀p ∈ Pr . ∗init()−→,ops(p) O = {op |p ∈ Pr } ∗ ∅, {initp ()|p∈Pr },O =⇒ ⟨Send, s, 0, ∅, ∅⟩ Send-Open ∀p ∈ Pr . s(p)mp =phase[r ].send(),∅−→ s(p) sms =Ø p∈P {(p, m, q) | q ∈ P ∧ (q, m) ∈ mp} ⟨Send, s, r, ∅, ∅⟩∅, {sendp (mp )|p∈P },∅=⇒ ⟨Recv, s, r, sms, ∅⟩

Send-Closed

∀p ∈ Pr . s(p)mp =phase[r ].send(),∅−→ s(p) sms =Ø

p∈P

{(p, m, q) | q ∈ P ∧ (q, m) ∈ mp} ⟨Send, s, r, ∅, ∅⟩∅, {sendp (mp )|p∈P },∅=⇒ ⟨Net,s, r, sms, ∅⟩

RecvStart

∀p ∈ Pr . s(p)|Vars= s′(p)|Vars∧s′(p)(lstep) = Recv

⟨Recv, s, r, sms, dms ⟩∅, ∅, ∅=⇒ ⟨Recv, s′_{, r, sms, dms ⟩}

RecvEnd

∀p ∈ Pr . s(p)(lstep) = F in

⟨Recv, s, r, sms, dms ⟩∅, ∅, ∅=⇒ ⟨Fin, s, r, sms, dms ⟩

FinishRound

∀p ∈ Pr . s(p)phase[r ].finishRound(s(p)(mbox)),op−→ s′(p) s′(p)(mbox) = [] r′= r + 1 O = {op |p ∈ Pr } ⟨F in, s, r, sms, dms ⟩∅, {finishRoundp (s(p)(mbox))|p∈Pr },O=⇒ ⟨Send, s′_{, r}′_{, ∅, ∅⟩}

Fig. 8. The global semantics of ReSync. A transitions(p)

phase[r ].m,o

→ s′(p) denotes that p executes, in local states(p), the method m of the EventRound phase[r]. The execution produces observable events o, corresponding to calls to methods from the interface.

Drop sms = sms′_{⊎ {m }} ⟨s, sms, dms ⟩7−→ ⟨s, smsDr ′_{, dms ⟩} Crash s′₍_{p) = ⊥} ∀q . q , p ⇒ s(q) = s′(q) ⟨s, sms, dms ⟩7−→ ⟨sCr ′_{, sms, dms ⟩} Deliver sms = sms′_{⊎ {(}_{p, v, q)}} ₍_{p ∈ ByzP ∨ v = v}′₎ _dms′_{= dms ∪ {(p, v}′_{, q)}} ⟨s, sms, dms ⟩7−→ ⟨s, smsDv ′_{, dms}′_⟩

Fig. 9. Network transition on the message pools.

the valuesN = |P | and F = |ByzP |. For crashed processes, we set their local state to the special ⊥ value, i.e., for a stores we have Cr = {p ∈ P | s(p) = ⊥}. Crashed processes never recover. Pr is a shorthand the set of running processes, i.e.,Pr = P \ Cr.

The rules for the round structure where the processes work in lockstep are shown in Fig.8.

Later, we will see additional rules for the network (Fig.9), closed rounds (Fig.10), and open rounds (Fig.11). Global lockstep transitions are written asS E,I,O=⇒ S′_where_{S, S}′_{are states,}_{E is a set of} labels that identify network transitions,O is a set of labels defined from the names of the methods in the interface.I is a set of labels not in the interface corresponding to internal transitions. A client of a ReSync program can only observe the transitions with labels inO. Local transitions of a process are written with −→ and transition of the network are denoed with 7−→.

Init.Initially the state of the system is undefined, denoted by ∗. The first transition of the system, captured by the rule Init in Fig.8, defines the initial state of the protocol by calling init on all processes. All processes execute in lockstep (synchronously) init. The method init does not

(15)

Closed-NetworkSteps

⟨s, sms, dms ⟩7−→ ⟨sN ′_{, sms}′_{, dms}′_⟩

⟨Net,s, r, sms, dms ⟩N , ∅, ∅=⇒ ⟨Net, s′, r, sms′, dms′⟩

Closed-Mailbox

∀p ∈ Pr . s(p)(lstep) = Net ∧ s′₍_{p)(lstep) = Fin}

∧s(p)|Vars= s′(p)|Vars∧s′(p)(mbox) = [(q, m)|(q, m, p) ∈ dms]

⟨Net,s, r, sms, dms ⟩∅, ∅, ∅=⇒ ⟨Fin, s′, r, sms, dms ⟩

Fig. 10. Network transitions for closed rounds.

return a value and defines the local state of a process using inbound operations from the interface.

The global transition Init emphasizes the inbound operations called by init, denotedop, where

o is the operation’s name and p is the process executing it. Init sets the round counter to 0, the message pools are empty (there are no messages in the system), and globally the system transitions to aSend state, i.e., дstep = Send. Rounds are executed in a loop. Next, we define the semantics of a round.

Send.In every round, the first transition is a global one, which executes locally in lockstep the method send on all processes. Its semantics is formalized by the rules Send-Open and Send-Closed

in Fig.8. Only one of the two rules applies in a round. Send-Open is applied when init and

receiveare defined. The control labelдstep becomes Recv, i.e., the control goes to the implemen-tation of a message accumulator. Send-Closed is applied when init and receive have the default implementation and in this case the control goes to the network, that isдstep equals Net, otherwise дstep equals Recv. The method send has no process local side effects and returns a map from receivers to payload values, i.e., the messages sent by the process that executes send. The rules

Send-Closed, resp. Send-Open add the messages sent by processes to the pool of sent messages,

sms. The messages in sms are triples of the form (sender, payload, recipient), where the sender and receiver are processes and the payload has type M. The triples are obtained from the map returned by send to which we add the identity of the process that executed send.

After sending messages the control passes to the network that non-deterministically decides which messages to deliver and which to alter.

Network transitions.The pool of messagessms and the lists of delivered messages dms are

add du- pli- ca-tion rule. add du- pli- ca-tion rule. dz: du- pli- ca-tion is never visi-ble to the program-mer, I don’t think we need it in the lock-step se- man-tics. in the run-time we could use it but that part is just given dz: du- pli- ca-tion is never visi-ble to the program-mer, I don’t think we need it in the lock-step se- man-tics. in the run-time we could use it but that part is just given

managed by a special network process that modifies the pool of messagessms and dms (for simplicity the rules omit its identity), and in the case of open rounds controls also the progress conditions. The rule Deliver, in Fig.9, captures how the network takes a messagem from the message pool sms and puts it into the delivery setdms. For the byzantine case, we model the malicious behaviours by changing the delivered messages to arbitrary values. This way of modeling means we can keep uniform rules for sending and processing messages. The rule Drop captures how the network can

drop sent messages by removing them from the send poolsms. Processes may crash (see Crash)

and never recover.

Message reception in closed rounds.When receiveInit and receive are missing (they are not

redefined), after Send-Closed, the rule NoRecv-NetworkSteps is applied, a nondeterministic number of times, followed by the rule NoRecv-Mailbox, from Fig.10. The mailbox of each process is set to be equal with the content of the delivery queue of the process. It is totally under the control of the network, that populates mbox with a non-deterministic chosen (possibly altered) subset of the sent messages.

(16)

Progress transitions Progress.goAhead s′₍_p)(дoAhead) s(p) receiveInitp () receivep (q,v) −→ s′(p), Progress.waitMessages ¬s′(p)(to) s(p) receiveInitp receivep (q,v) −→ s′(p) Progress.sync s′₍_p)(sync) s(p) receiveInitp receivep (q,v) −→ s′(p) Progress.catchUp(true) s′_(p)(catchUp) s(p) receiveInitp receivep (q,v) −→ s′(p) Progress.timeout s′_(p)(to) s(p) receiveInitp receivep (q,v) −→ s′(p) Accumulator transitions Acc-Init

s(p)(lstep) = Recv s(p)receiveInit−→ ps′(p) s(p)|_Vars= s′(p)|_Vars ⟨s(p), sms,dms⟩{receiveInit−→ p}, ∅_s′_{(p), sms,dms}

Acc-Recv

¬s(p)(дoAhead) dms = dms′_{⊎ {(q,v,p)}} _s(p)receive_−→q(q,v)_s′_(p)

s(p)(lstep) = s′₍_{p)(lstep) = Recv} _s(p)|

Vars= s′(p)|Vars s′(p)(mbox) = (q,v) :: s(p)(mbox)

⟨s(p), sms,dms⟩{receive−→p(q,v)}, ∅

s′₍_{p), sms,dms}′ Acc-End

s(p)(lstep) = Recv (s(p)(GoAhead) ∨ s(p)(to) ∨ s(p)(catchUp)) s(p)|Vars= s′(p)|Vars s′(p) = Fin ¬s′(p)(GoAhead) ¬s′(p)(to)

⟨s(p), sms,dms⟩−→∅, ∅ s′₍_{p), sms,dms} Global transitions Acc-NetworkStep ⟨s, sms,dms⟩7−→N s′, sms′, dms′ ⟨Recv,s, r, sms,dms⟩N, ∅, ∅=⇒ Recv,s′, r, sms′, dms′ Acc-ProcessStep ⟨s, sms,dms⟩−→I, ∅ s′, sms′, dms′ ⟨Recv,s, r, sms,dms⟩ ∅=⇒ Recv,s,I, ∅ ′, r, sms′, dms′

Fig. 11. Message accumulator transitions.

Message accumulator (message reception in open rounds).The implementation of the message

accumulator starts after sending messages (after apllying Send-Open), by updating the local variable

lsteptoRecv (see rule RecvStart in Fig.8). To compute the mailbox of an open round, process

executes first receiveInit (see rule Acc-Init in Fig.11) and afterwards a loop of calls to receive (see rule Acc-Recv in Fig.11). Calls to receive stop when the progress conditions configured in

receiveInitare met (see rule Acc-End).

ReceiveInit. The progress conditions of a round are set in receiveInit. To formally define them we use auxiliary boolean local variables:дoAhead, catchUp, to, and sync. Because the semantics is given at the lockstep level and the progress conditions are hints to control the runtime algorithms, the progress condition does not have much impact on the semantics. For instance, the timeout is

(17)

abstracted as a boolean value and the accumulator can non-deterministically continue receiving messages or end.

Receive. The method receive when executed by process p, takes as input a delivered message in the setdms (see rule Acc-Recv) and adds it to the mailbox of the current round. In each round, at most one message is received from any process, i.e., duplicates if present are eliminated. Another call toreceive follows only if the progress conditions are not met, see rule Acc-End in Fig.11. The message accumulator ends because either the catchUp condition holds, or the synchrony condition holds together with either the goAhead or the timeout condition (see rule Acc-End). Note that the catchUp, synch, and timeout progress conditions are in this lockstep semantics under the control of the network, and the protocol assumes the network respects their semantics. The execution platform provides an implementation of these primitives.

The accumulator transitions are interleaved with the network transitions defined in Fig.9. While collecting messages, the local variable lstep equalsRecv until it gets updated to Fin at the end of the message accumulator. Globally the systems transitions into aFin global state when all non-crashed processes have their local variable lstep equal toFin, rule RecvEnd in Fig.8.

Round-based Algorithm transition.In a FinishRound transition (Fig.8), all processes execute synchronously (without any interference) the method finishRound of the current round. Locally,

on each processp, the set of received messages mboxp (computed previously) is the input of

finishRound. The finishRound operation might produce an observable transitionop. At the end

of the round,sms and дmsд are purged and r is incremented by 1.

5 RUNTIME OF RESYNC

We introduce an execution platform for ReSync over asynchronous networks. The runtime execu-tions are refinements of the lockstep execuexecu-tions from Section ??. For closed rounds, if the underlying network is partially synchronous, the runtime eventually synchronizes correct processes, ensuring that they can communicate reliably with each other.

5.1 Runtime semantics

The runtime algorithm in Listing1defines the code executed at runtime by one process. Listing1 shows an extract from the actual implementation (which has lines of code) that highlights the control structure of the runtime algorithm. Given a ReSync program P, the set of runtime executions [[P]]Runis defined by the asynchronous parallel composition ofN transitions systems corresponding to Listing1.

The runtime algorithm defines a wrapper for the ReSync code. It has the following structure: declaration of the variables that keep track of the timeout, the round number, and the buffer of delivered messages (line6), a few auxiliary methods (lines15-29), and a big while loop where every iteration corresponds to the execution of one round (lines32-59). All processes go through all rounds in order. Each loop iteration is further split into initialization (lines37-40), send (line34), message accumulator (lines42-54), and finishing the round (lines56-58). The message accumulator is a loop where at each iteration one message is received and processed. The exit of this loop triggers a round switch.

The runtime uses phase, the sequence oflen rounds defined in ReSync, and the variable

currentRoundholding the current round number. In Fig.1, phase[currentRound] is the kth

EventRounddefined in the ReSync program, wherek equals currentRound%len. The runtime uses

the implementations of send, receiveInit, receive, finishRound given in ReSync by calling phase[currentRound].send, etc. Each call is wrapped by checkProgress that implements the semantics of the progress conditions.

(18)

Listing 1. ReSync Runtime Algorithm

1 // Global constants

2 id: Pid; N , F : N//own process id, number of processes, malicious processes

3 round: Array[EventRoundM]//Program

4 // State shared with the program

5 Progress: ProgressC

6 mbox: List[Msg] = [] // mailbox of the current round

7 // Internal state

8 currentRound, nextRound: N = 0

9 roundStart, timeout: Time

10 strict, inSync: B = false// flags to control the progress

11 maxRound: Map[Pid, N] =Map[for (p <-Pid)yield(p, 0)] // keep the max round seen for each process

12 pendingMessages:Map[N, Msg] =Map[for(n <- N)yield(n,∅)] // buffered messages (asynchrony)

13

14 // Auxiliary Methods

15 defprocessReceive(message: Msg ) { // Receiving a message

16 if (∀ m ∈ mbox. m.sender , message.sender) { // check duplication (UDP or byzantine)

17 mbox = message :: mbox

18 round[currentRound].receive(message.sender, message.payload)

19 checkProgress()

20 } }

21 defcheckProgress() { // Update timeout and strict according to Progress

22 if (Progress.isTimeout) { timeout = Progress.timeout; strict = !Progress.catchUp }

23 else if (Progress.isWaitMessage) { timeout = ∞; strict =true }

24 else if (Progress.isGoAhead) { strict =false; nextRound = max(nextRound, currentRound + 1) }

25 if (Progress.syncNb > 0) {

26 if (maxRound.values.filter( _ >= currentRound).size >= Progress.syncNb +F ) { strict = false }

27 else { timeout = ∞; strict = true }

28 } }

29 defcatchUpTo() = maxRound.values.sorted[N - F - 1] // highest round excluding byzantine processes

30

31 //Execute the program, each iteration of the loop corresponds to one round

32 while(true) {

33 // SEND

34 for( (dest, payload) <-round[currentRound].send() )

35 Network.sendTo(dest, Msg(id, currentRound, payload))

36 // RECEIVE INIT

37 strict =false; mbox = [] // clean

38 roundStart = currentTime()// set time

39 round[currentRound].receiveInit()

40 checkProgress()

41 // RECEIVE

42 for( message <- pendingMessages[currentRound] ) processReceive(message) // deliver buffered msg

43 while ((nextRound == currentRound ∨ strict)) { // not enough messages received

44 message = Network.receiveWithTimeout(roundStart + timeout - currentTime())

45 if (message ==null) {// timeout

46 strict = false

47 nextRound = max(nextRound, currentRound + 1)

48 }else {

49 maxRound[message.sender] = max(maxRound[message.sender], message.round)

50 if (message.round < currentRound) { }// late message, ignore

51 else if (message.round == currentRound) { processReceive(message) }

52 else { pendingMessages[message.round].add(message) // buffer and

53 nextRound = max(nextRound, catchUpTo()) }// check if catching-up is needed

54 } }

55 // FINISHROUND

56 round[currentRound].finishRound(mbox)

57 currentRound += 1; maxRound[id] = currentRound

58 nextRound = max(nextRound, currentRound)

(19)

Message reception is implemented using the method receiveWithTimeout in line44that either returns a message or null if the timeout (given by the ReSync program) expired. If the progress condition waitMessages is active, receiveWithTimeout is called with an infinite timeout. The messages delivered to a process are from different rounds. For instance, if two processesp and q are in different rounds, and send messages for their current rounds to some processr, r’s message buffer will contain messages from different rounds. The runtime implements a filtering of these messages. To each sent message it adds as metadata the current round number of its sender. When a message is received, the runtime discards it, if it was sent in a round smaller than the current round of the receiver, and it buffers it in the buffer pendingMessages if it is a message that comes from a future round. The mailbox contains messages only from the process’s current round. The default behavior of receiveInit is to set a timeout, enable catchUp and enable the synchronization of F + 1 processes. The default behavior of receive is to add a received message to the mailbox, when it belongs to the current round. Messages in pendingMessages are moved to the mailbox when the process reaches the round they were sent in. The function checkProgress updates the local variables that gate a round switch at line43. This function is called after the reception of a message (line51) and in the beginning of every round with the call of receiveInit (line40) but also on the buffered set of messages for the current round (line42) received while the system was operating in a lower round.

CatchUp algorithm.This primitive enables a process to jump over over some rounds at runtime,

so that it can catch-up with processes that made quicker progress. Catch-up is a mechanism used by many asynchronous systems, e.g, Paxos, ViewStamped, PBFT. Implementing this primitive requires breaking communication-closure which is possible only at execution time.

In the benign case, i.e.,F = 0, when a message m from a future round fr is received, instead of storing it and continuing with the current round, the runtime jumps to the round fr and starts

the message accumulator of fr withm in the mailbox. Note that catch-up is useful to synchronize

processes (with the faster ones) the under lossy links, because messages of the current round might be lost, but also under reliable links due to message reorder (see Table ??). The implementation of catch-up uses two variables nextRound, ranging over round numbers, and strict which is a boolean mirroring Progress.catchUp.

In the byzantine case, i.e.,F > 0, it is not sound for a process to catch-up upon a single message reception, because a Byzantine process may send arbitrary round numbers. Therefore, the runtime keeps not only the buffer of pending messages but also an array maxRound, that for each process stores the highest round value it sent in a message. Instead of jumping to the maximal round number, the runtime ignores theF highest round values (line29) in the array maxRound, and jumps to theF + 1st highest one (at least one correct process made progress until that round).

In case of a jump, the message accumulator of the current round stops, and finishRound is executed for the current and all the other rounds, up to nextRound.

Synchronization algorithm.Although all protocols may use Progress.sync(k), except for two

phase commit (wherek = N ), this primitive is useful mainly in byzantine protocols, where k equals F + 1. It ensures that k correct processes, i.e., non-byzantine and not-crashed, are synchronously dz: i don’t under-stand the 2PC re-dz: i don’t under-stand the 2PC

re-switching at runtime the respective round.

The implementation of Progress.sync(k) forces the runtime to wait forF +k messages sent by

different processes for the current round or any other higher round before terminating the current message accumulator, and do a round switch. This is done using the maxRound map which stores the highest round number received from each process. Over networks with reliable links, two timeouts ensure that between round switches the waiting time is long enough to let correct process exchange messages. However, over networks with lossy links timeouts are an underestimation. The transport

(20)

layer is responsible for resending the messages of the round, and therefore, the synchronization algorithm an infinite timeout assuming the transport layer will eventually deliver enough messages. The implementation of this synchronization primitive is possible in the round-based model only under reliable networks, i.e., TCP. Resending the messages of a round breaks the round structure, which requires the number of sent messages to be bounded by the state of the process and the number of processes in the network. Also, over networks with lossy links it is necessary to consider

messages from higher rounds. A message that is from a roundf r greater than the current round

cr, witnesses that the process which sent it was at some point in cr, therefore it will be counted among the 2F + 1 messages (from different processes) required to validate the synchronization barrier. This behavior it is not possible implementable directly in the round model.

The runtime uses an auxiliary methodprocessReceiveto check that only one message per process per round is delivered. This deals with malicious processes which may send multiple messages and link faults, e.g., UDP may duplicate packets.

5.2 Runtime correctness

At runtime, the executions are asynchronous, i.e., processes may be in different rounds at the same time, and processes may receive messages coming from rounds that are different from their current one. We show that these runtime asynchronous executions are indistinguishable from the ReSync lockstep semantics.

Definition 5.1 (Indistinguishability). Two executionsπ and π′_{are indistinguishable w.r.t. a set of} actionsA, denoted π ≃Aπ′, iff for every processp, the projection of both executions on p and on the actions inA agree up to finite stuttering.

Next, by send, receiveInit, receive, and finishRound, we refer to the execution, at runtime, of the wrappers of those respective methods in ReSync program.

Theorem 5.2 (Correctness). Given a ReSync program P, for every execution ae ∈ [[(P)]]Run,

there exists an indistinguishable execution se ∈ [[P]] with respect to the actions C = {init,send, receiveInit,receive, f inishRound}, that is, ae ∼C se.

Proof sketch. Let us consider an asynchronous execution ae ∈ [[(P)]]Run. The main ingredient

is that the runtime only provides messages for the current round to the ReSync program. Thus, reduction arguments [???] due to communication closure allow us to reduce to indistinguishable executions where globally actions are ordered with non-decreasing round numbers. Then, we use the fact that send and receive are left and right movers [?], resp. This generates indistinguishable executions that have all theinit and send as well as the f inishround for the same round appearing in a block, so that they can be reduced to synchronized actions as required for se ∈ [[P]]. Timeout actions to (and additional sends for byzantine rounds) in the asynchronous executions lead to

stuttering invisible events, which we may drop and maintain indistinguishability. □

Theorem5.2implies that the specifications closed under indistinguishability are preserved by the execution platform. These specifications (also called local properties in [?]) form an important class, which includes consensus, k-set agreement, and leader election.

For networks that are partially synchronous we prove a stronger result. Partially synchrony is defined over the physical model of the network, with immediate consequences over the runtime ex-ecutions, but also over the high-level round-based executions. LetT (m) be the time it takes between

the moment a messagem is added to the send pool sms and the moment it either gets dropped

by the network or it is delivered to its recipient. A (physical) network is partially synchronous if there exists a∆, and a global stabilization time GST , such that the transmission delays of all messagesm sent after GST from a correct processes to a correct process satisfy T (m) < ∆. The

(21)

original definition in [?] assumes also that a bound on the relative speed of processes holds only afterGST . However, on modern architectures this bound holds during the entire computation. Note

that beforeGST arbitrarily many messages might be dropped.

In the round-based model, the existence of GST means that from some round on, calledGSR,

all messages sent by correct processes to correct processes are received. Prior toGSR arbitrarily many messages may be dropped. Given a program P, the set of partially synchronous lockstep executions of P is a subset of [[P]], consisting only in those executions where there exists global

synchronization roundGSR.

Theorem 5.3 (PS Correctness – Closed). Given a ReSync program P with closed rounds, for every execution ae ∈ [[(P)]]Run over a partially synchronous network, there exists an indistinguish-able partially synchronous execution pse ∈ [[P]] w.r.t. the common actions C = {send,receiveInit, receive, f inishRound}, that is, ae ∼C pse.

The proof of Theorem5.3follows similar ideas as the proofs in [?]. In the following, we emphasize a few crucial differences that make the execution platform of ReSync practical. A distributed clock is constructed [?], which ensures that at leastF + 1 correct processes have closely synchronized local estimates of the clock. The distributed clock has two purposes: (i) it ensures that after GST all correct processes will quickly resynchronize and (ii) a process uses the clock ticks to determine how long to wait for the messages of a round. In [?], at every process, the steps of the distributed clock algorithm are interleaved with the steps of the consensus protocol that is executed. During a round of the consensus algorithm, the clock ticks multiple times, as the clock is used to burn time and slow down fast processes (to wait sufficiently long for messages). Each tick is implemented by an instance of a distributed clock synchronization scheme, which involves all-to-all communication. In practice this would flood the network with many messages and slow down the consensus computation (due to interleavings of the clock with the algorithm).

Instead of using a distributed clock to timeout messages, the runtime of ReSync uses a hardware clock, and it implements a distributed clock only to synchronize the round start, that is, it uses only one clock value per round, contrary to many clock ticks in [?]. By replacing many distributed clock ticks with a hardware clock ReSync’s runtime gives the programmer the possibility to use timeouts that 1) estimate the time it takes for the slowest correct process to resynchronize with the fastest correct process and 2) estimate the time to process and transmit a message.

Definition 5.4 (Observational refinement). Given two transition systemsTS1 and TS2 and a set of actions of the two systemsO, called observable actions, TS1 observationally refines TS2, denoted

TS1 ▷OTS2, iff all executions of TS1 projected on the observable actions O are included in the

executions ofTS2 projected on the observable actions O.

The observable actions of a ReSync program are calls to the interface methods. The runtime executions and the lockstep executions are w.r.t. the interface actions. In [?] sequentially consistency is proven equivalent with observational refinement for commutative clients, i.e., when the interface operations commute. This proof is adapted straightforward to indistinguishability.

Corollary 5.5. Given a program P the runtime semantics of P observationally refines the ReSync semantics of P w.r.t. the actionsO in the interface of P, if the interface actions commute for the client, that is, [[(P[true])]]Run▷O[[P[true]]] and [[(P[PS])]]Run▷O[[P[PS]]].

5.3 Runtime Implementation Details

The runtime implements the algorithm in Listing1. We discuss in this section aspects related to the message structure and memory management, which are not explicitly visible in the algorithm in Section5as its presentation focuses on the computation.