Master Recherche en Informatique - November 2010

## Fault-Tolerance: Agreement Problems in Distributed Asynchronous Systems

Achour Most´efaoui

Irisa/Ifsic, Universit´e de Rennes achour@irisa.fr

http://www.irisa.fr/asap/

Checkpointing Distributed Computations 1

Some Failure Types

• Software errors

• Process failures

⋆ Crash failure

⋆ Send/Receive omission

⋆ Arbitrary (Byzantine)

• Link failures

⋆ Omission/duplication failure

• Clock/Performance failures

Checkpointing Distributed Computations 2

Asynchronous Distributed Systems

• A set Π of n processes: p_{1}, p_{2}, . . . , pn

• A reliable communication network

• No bound on message transfer delays

• No upper bound on the time required by a process to execute a step

• Failures: At most t processes may crash

Checkpointing Distributed Computations 3

How to Tolerate Failures?

• Debbuging/Validation.

• Duplication of processors and memories.

• Faul-tolerant software.

⋆ Fault-tolerant services (Consensus, NBAC, etc.)

⋆ Checkpointing/Rollback-Recovery.

Checkpointing Distributed Computations 4

Fault-Tolerant Services: Agreement Services

• Try to continue to compute and make consistent deci- sions although there are crashed or slow processes.

There is a need of agreement services.

• Consensus

• Atomic broadcast

• Non blocking atomic commit

• Election

• Renaming, etc.

Checkpointing Distributed Computations 5

The Consensus Problem

Each process p_{i} proposes a value v_{i} and tries to decide.

• Termination: Every correct process eventually decides some value.

• Validity: If a process decides v, then v was proposed by some process.

• Agreement: No two correct processes decide differ- ently.

• Uniform Agreement: No two (correct or not) processes decide differently.

Checkpointing Distributed Computations 6

The Main Theoretical Result

Fisher-Lynch-Paterson’s Impossibilty Result (1985)

## There is no deterministic protocol that solves the consensus problem in an asynchronous sys- tem that is subject to even a single process crash failure

Too bad: this results extends to many other agreement problems.

Checkpointing Distributed Computations 7

How to Circumvent the Impossibility Result?

• Randomized Protocols

⋆ The termination property becomes: With proba 1, every correct process eventually decides.

• Equip the system with Additional Properties:

⋆ Parital synchrony

⋆ Failure Detectors

Checkpointing Distributed Computations 8

An always Safe Consensus Protocol (t < n/2)

p_{i} initially proposes v_{i}
repeat

• send my value to all processes

• wait for n−t values from different processes

• if all received values are equal to a same value v send this value to all processes otherwise send ⊥

• wait for n−t values from different processes

⋆ all received values are equal to v: decide v

⋆ all received values are equal to ⊥: adopt any of the proposed values

⋆ if v and ⊥ are both received then adopt v endrepeat

Checkpointing Distributed Computations 9

Few Features of this Algorithm

• Ifanyprocess decidesvduring a round thenall processes that end that round will either decide v or adopt v.

• If all processes start a round with v then all processes that end the round will decide v (this is not the only case where processes can decide).

This is known as the Abort/Commit algorithm.

Checkpointing Distributed Computations 10

Does this Algorithm Work?

• It is always safe but does not always terminate

This algorithm terminates with a very high probability.

• If the forever loop is changed to a fixed number of rounds, the algorithm willalways terminatebut thesafety may be violated

The new algorithm ensures safety with a very high prob- ability

Checkpointing Distributed Computations 11

How to get a Correct Algorithm?

• Synchrony properties: For example if the system is even- tually synchronous

During a reception phase, a process waits for (n −t) messages and at least some uniformly increasing time.

• Randomization: replace the statement “any of the pro- posed values” in the algorithm by the statement “a ran- dom value”.

Checkpointing Distributed Computations 12

Byzantine Processes

• How does the proposed algorithm behave if the t faulty processes can exhibit a malicious behavior?

⋆ a malicious process can disseminate wrong informa- tion

⋆ a malicious process can send different values do dif- ferent processes

⋆ the adversary can delay messages and/or processes

• Few examples

Checkpointing Distributed Computations 13

How to Deal with Byzantine Processes

• Adapt the specification of the problem

• Have smaller values for t then for crash failures

• Use much more messages

• Use certificates

• Use cryptography

• etc.

Checkpointing Distributed Computations 14

The Byzantine Consensus Problem

• Any solution to byzantine consensus (even in synchronous systems) needs t < n/3.

Each process p_{i} proposes a value v_{i} and tries to decide.

• Termination: Every correct process eventually decides some value.

• Validity: If all correct processes propose v, then only v can be decided.

• Agreement: No two correct processes decide differ- ently.

Checkpointing Distributed Computations 15

Adapting the Previous Algorithm

• Consider the binary consensus

• t < n/5

• Replace the statement “any of the proposed values”

in the algorithm by the statement “a random value”

among 0 and 1.

• Replace the statement “all received values” by the state- ment “at least n−2t received values”.

We get the Byzantine randomized algorithm of Rabin.

Checkpointing Distributed Computations 16