There is no deterministic protocol that solves the consensus problem in an asynchronous sys- tem that is subject to even a single process crash failure

(1)

Master Recherche en Informatique - November 2010

Fault-Tolerance: Agreement Problems in Distributed Asynchronous Systems

Achour Most´efaoui

Irisa/Ifsic, Universit´e de Rennes achour@irisa.fr

http://www.irisa.fr/asap/

Checkpointing Distributed Computations 1

Some Failure Types

• Software errors

• Process failures

⋆ Crash failure

⋆ Send/Receive omission

⋆ Arbitrary (Byzantine)

• Link failures

⋆ Omission/duplication failure

• Clock/Performance failures

Asynchronous Distributed Systems

• A set Π of n processes: p₁, p₂, . . . , pn

• A reliable communication network

• No bound on message transfer delays

• No upper bound on the time required by a process to execute a step

• Failures: At most t processes may crash

How to Tolerate Failures?

• Debbuging/Validation.

• Duplication of processors and memories.

• Faul-tolerant software.

⋆ Fault-tolerant services (Consensus, NBAC, etc.)

⋆ Checkpointing/Rollback-Recovery.

(2)

Fault-Tolerant Services: Agreement Services

• Try to continue to compute and make consistent deci- sions although there are crashed or slow processes.

There is a need of agreement services.

• Consensus

• Atomic broadcast

• Non blocking atomic commit

• Election

• Renaming, etc.

The Consensus Problem

Each process p_i proposes a value v_i and tries to decide.

• Termination: Every correct process eventually decides some value.

• Validity: If a process decides v, then v was proposed by some process.

• Agreement: No two correct processes decide differently.

• Uniform Agreement: No two (correct or not) processes decide differently.

The Main Theoretical Result

Fisher-Lynch-Paterson’s Impossibilty Result (1985)

There is no deterministic protocol that solves the consensus problem in an asynchronous sys- tem that is subject to even a single process crash failure

Too bad: this results extends to many other agreement problems.

How to Circumvent the Impossibility Result?

• Randomized Protocols

⋆ The termination property becomes: With proba 1, every correct process eventually decides.

• Equip the system with Additional Properties:

⋆ Parital synchrony

⋆ Failure Detectors

(3)

An always Safe Consensus Protocol (t < n/2)

p_i initially proposes v_i repeat

• send my value to all processes

• wait for n−t values from different processes

• if all received values are equal to a same value v send this value to all processes otherwise send ⊥

• wait for n−t values from different processes

⋆ all received values are equal to v: decide v

⋆ all received values are equal to ⊥: adopt any of the proposed values

⋆ if v and ⊥ are both received then adopt v endrepeat

Few Features of this Algorithm

• Ifanyprocess decidesvduring a round thenall processes that end that round will either decide v or adopt v.

• If all processes start a round with v then all processes that end the round will decide v (this is not the only case where processes can decide).

This is known as the Abort/Commit algorithm.

Does this Algorithm Work?

• It is always safe but does not always terminate

This algorithm terminates with a very high probability.

• If the forever loop is changed to a fixed number of rounds, the algorithm willalways terminatebut thesafety may be violated

The new algorithm ensures safety with a very high probability

How to get a Correct Algorithm?

• Synchrony properties: For example if the system is eventually synchronous

During a reception phase, a process waits for (n −t) messages and at least some uniformly increasing time.

• Randomization: replace the statement “any of the proposed values” in the algorithm by the statement “a random value”.

(4)

Byzantine Processes

• How does the proposed algorithm behave if the t faulty processes can exhibit a malicious behavior?

⋆ a malicious process can disseminate wrong informa- tion

⋆ a malicious process can send different values do different processes

⋆ the adversary can delay messages and/or processes

• Few examples

How to Deal with Byzantine Processes

• Adapt the specification of the problem

• Have smaller values for t then for crash failures

• Use much more messages

• Use certificates

• Use cryptography

• etc.

The Byzantine Consensus Problem

• Any solution to byzantine consensus (even in synchronous systems) needs t < n/3.

Each process p_i proposes a value v_i and tries to decide.

• Termination: Every correct process eventually decides some value.

• Validity: If all correct processes propose v, then only v can be decided.

• Agreement: No two correct processes decide differently.

Adapting the Previous Algorithm

• Consider the binary consensus

• t < n/5

• Replace the statement “any of the proposed values”

in the algorithm by the statement “a random value”

among 0 and 1.

• Replace the statement “all received values” by the statement “at least n−2t received values”.

We get the Byzantine randomized algorithm of Rabin.