Master Recherche en Informatique - November 2010
Fault-Tolerance: Agreement Problems in Distributed Asynchronous Systems
Achour Most´efaoui
Irisa/Ifsic, Universit´e de Rennes achour@irisa.fr
http://www.irisa.fr/asap/
Checkpointing Distributed Computations 1
Some Failure Types
• Software errors
• Process failures
⋆ Crash failure
⋆ Send/Receive omission
⋆ Arbitrary (Byzantine)
• Link failures
⋆ Omission/duplication failure
• Clock/Performance failures
Checkpointing Distributed Computations 2
Asynchronous Distributed Systems
• A set Π of n processes: p1, p2, . . . , pn
• A reliable communication network
• No bound on message transfer delays
• No upper bound on the time required by a process to execute a step
• Failures: At most t processes may crash
Checkpointing Distributed Computations 3
How to Tolerate Failures?
• Debbuging/Validation.
• Duplication of processors and memories.
• Faul-tolerant software.
⋆ Fault-tolerant services (Consensus, NBAC, etc.)
⋆ Checkpointing/Rollback-Recovery.
Checkpointing Distributed Computations 4
Fault-Tolerant Services: Agreement Services
• Try to continue to compute and make consistent deci- sions although there are crashed or slow processes.
There is a need of agreement services.
• Consensus
• Atomic broadcast
• Non blocking atomic commit
• Election
• Renaming, etc.
Checkpointing Distributed Computations 5
The Consensus Problem
Each process pi proposes a value vi and tries to decide.
• Termination: Every correct process eventually decides some value.
• Validity: If a process decides v, then v was proposed by some process.
• Agreement: No two correct processes decide differ- ently.
• Uniform Agreement: No two (correct or not) processes decide differently.
Checkpointing Distributed Computations 6
The Main Theoretical Result
Fisher-Lynch-Paterson’s Impossibilty Result (1985)
There is no deterministic protocol that solves the consensus problem in an asynchronous sys- tem that is subject to even a single process crash failure
Too bad: this results extends to many other agreement problems.
Checkpointing Distributed Computations 7
How to Circumvent the Impossibility Result?
• Randomized Protocols
⋆ The termination property becomes: With proba 1, every correct process eventually decides.
• Equip the system with Additional Properties:
⋆ Parital synchrony
⋆ Failure Detectors
Checkpointing Distributed Computations 8
An always Safe Consensus Protocol (t < n/2)
pi initially proposes vi repeat
• send my value to all processes
• wait for n−t values from different processes
• if all received values are equal to a same value v send this value to all processes otherwise send ⊥
• wait for n−t values from different processes
⋆ all received values are equal to v: decide v
⋆ all received values are equal to ⊥: adopt any of the proposed values
⋆ if v and ⊥ are both received then adopt v endrepeat
Checkpointing Distributed Computations 9
Few Features of this Algorithm
• Ifanyprocess decidesvduring a round thenall processes that end that round will either decide v or adopt v.
• If all processes start a round with v then all processes that end the round will decide v (this is not the only case where processes can decide).
This is known as the Abort/Commit algorithm.
Checkpointing Distributed Computations 10
Does this Algorithm Work?
• It is always safe but does not always terminate
This algorithm terminates with a very high probability.
• If the forever loop is changed to a fixed number of rounds, the algorithm willalways terminatebut thesafety may be violated
The new algorithm ensures safety with a very high prob- ability
Checkpointing Distributed Computations 11
How to get a Correct Algorithm?
• Synchrony properties: For example if the system is even- tually synchronous
During a reception phase, a process waits for (n −t) messages and at least some uniformly increasing time.
• Randomization: replace the statement “any of the pro- posed values” in the algorithm by the statement “a ran- dom value”.
Checkpointing Distributed Computations 12
Byzantine Processes
• How does the proposed algorithm behave if the t faulty processes can exhibit a malicious behavior?
⋆ a malicious process can disseminate wrong informa- tion
⋆ a malicious process can send different values do dif- ferent processes
⋆ the adversary can delay messages and/or processes
• Few examples
Checkpointing Distributed Computations 13
How to Deal with Byzantine Processes
• Adapt the specification of the problem
• Have smaller values for t then for crash failures
• Use much more messages
• Use certificates
• Use cryptography
• etc.
Checkpointing Distributed Computations 14
The Byzantine Consensus Problem
• Any solution to byzantine consensus (even in synchronous systems) needs t < n/3.
Each process pi proposes a value vi and tries to decide.
• Termination: Every correct process eventually decides some value.
• Validity: If all correct processes propose v, then only v can be decided.
• Agreement: No two correct processes decide differ- ently.
Checkpointing Distributed Computations 15
Adapting the Previous Algorithm
• Consider the binary consensus
• t < n/5
• Replace the statement “any of the proposed values”
in the algorithm by the statement “a random value”
among 0 and 1.
• Replace the statement “all received values” by the state- ment “at least n−2t received values”.
We get the Byzantine randomized algorithm of Rabin.
Checkpointing Distributed Computations 16