MIT LCS TR 994

MIT-LCS-TR-994 MIT-LCS-TR-994

User Manual: MIT-LCS-TR-994

Open the PDF directly: View PDF .
Page Count: 11

Byzantine Clients Rendered Harmless

Barbara Liskov and Rodrigo Rodrigues

MIT CSAIL and INESC-ID

32 Vassar Street - Cambridge, MA 02139 - USA

Rua Alves Redol 9 - 1000 Lisboa - Portugal

Abstract

Byzantine quorum systems have been proposed that work properly even when up to freplicas fail arbitrarily.

However, these systems are not so successful when confrontedwith Byzantine faulty clients. This paper presents novel

protocols that provide atomic semantics despite Byzantine clients. Our protocols are the ﬁrst to handle all problems

caused by Byzantine clients. They prevent Byzantine clients from interfering with good clients: bad clients cannot

prevent good clients from completing reads and writes, and they cannot cause good clients to see inconsistencies. In

addition we also prevent bad clients that have been removed from operation from leaving behind more than a bounded

number of writes that could be done on their behalf by a colluder.

Our protocols are designed to work in an asynchronous system like the Internet and they are highly efﬁcient. We

require 3f+ 1 replicas, and either two or three phases to do writes; reads normally complete in one phase and require

no more than two phases, no matter what the bad clients are doing.

We also present strong correctness conditions for systems with Byzantine clients that limit what can be done on

behalf of bad clients once they leave the system. Furthermore we prove that our protocols are both safe (they meet

those conditions) and live.

1 Introduction

Quorum systems [4, 14] are valuable tools for building

highly available replicated data services. A quorum sys-

tem can be deﬁned as a set of sets (called quorums) with

certain intersection properties. These systems allow read

and write operations to be performed only at a quorum of

the servers, since the intersection properties ensure that

any read operation will have access to the most recent

value that was written.

The original work on quorum systems assumed that

servers fail benignly, i.e., by crashing or omitting some

steps. More recently, researchers have developed tech-

niques that enable quorum systems to provide data avail-

ability even in the presence of arbitrary (Byzantine) faults [9].

Earlier work provides correct semantics despite server (i.e.,

replica) failures and also handles some of the problems of

Byzantine clients [9, 10, 5, 1, 2, 11, 12].

This paper extends this earlier work in two important

ways. First, it deﬁnes new protocols that handle all prob-

lems of Byzantine clients. In addition, our protocols are

more efﬁcient than previous proposals, in terms of both

operation latency and number of replicas. Second, the

paper states correctness conditions for such systems and

proves that protocols are correct. The correctness condi-

tions are stronger than what has been stated previously [11]

and what has been guaranteed by previous approaches.

Since a dishonest client can write garbage into the shared

variable, it may seem there is little value in limiting what

bad clients can do. But this is not the case, for two reasons.

First, bad clients can cause a protocol to misbehave so that

good clients are unable to perform operations (i.e., the pro-

tocol is no longer live) or observe incorrect behavior. For

example, if the variable is write-once, a good client might

observe that its state changes multiple times.

Second, bad clients can continue to interfere with good

ones even after they have been removed from operation,

e.g., by a system administrator who learns of the misbe-

havior. We would like to limit such interference so that,

after only a limited number of writes by good clients, any

lurking writes left behind by a bad client will no longer be

visible to good clients. A lurking write is a modiﬁcation

launched by the bad client before it was removed from

operation that will happen (possibly with help from an ac-

complice) after it has left the system. By limiting such

writes we can ensure that the object becomes useful again

after the departure of the bad client, e.g., some invariant

that good clients preserve will hold.

Of course, it is not possible to prevent actions by a bad

client, even if it has been shut down, if the private key it

uses to prove that it is authorized to modify an object can

be used by other nodes; thus, a bad client is in the system

as long as any node knows its private key. If nodes can

expose their private keys, we can make statements about

behavior in the absence of a bad client only when all faulty

nodes have been removed from operation [11] – a condi-

tion that is not very useful in practice. A more practical ap-

proach is make use of secure cryptographic coprocessors

(like the IBM 4758 [7]) that allow signing without expos-

ing the private key. Our correctness condition is stated in

a way that allows either model about the lifetime of faulty

nodes in the system.

Thus, we make the following contributions:

•We present strong correctness conditions for an atomic

read/write object in a system in which both clients

and replicas can be Byzantine. Our conditions en-

sure atomicity for good clients, and also limit the

effects of bad clients that have been removed from

operation: our conditions bound the number of lurk-

ing writes by a constant factor, and prevent lurking

writes after bad clients stop and good clients subse-

quently overwrite the data.

•We present the ﬁrst Byzantine quorum protocols that

satisfy the conditions using only 3f+ 1 replicas (to

survive ffaulty replicas) and work in an unreliable

asynchronous network like the Internet. Further-

more our protocols are efﬁcient: To do writes re-

quires either 3 phases (our base protocol) or mostly

2 phases (our optimized protocol). Reads usually

require 1 phase; they sometimes need an additional

phase (to write back the data just read). Our base

protocol ensures that there can be at most one lurk-

ing write after a bad client has left the system; the

optimized protocol ensures slightly weaker behav-

ior: there can be a most two lurking writes.

•We are able to manage with a small number of repli-

cas and phases because our protocol makes use of

“proofs”, an approach that we believe can be used

to advantage in other protocols. A proof is a col-

lection of 2f+ 1 certiﬁcates from different replicas

that vouch for some fact, e.g., that a client has com-

pleted its previous write, or that the state read from

a single replica is valid.

•We prove the correctness of our protocols, both safety

and liveness. In addition we describe a variation of

our protocols that supports a still stronger correct-

ness condition, in which we can bound the number

of writes of good clients that it takes to ensure that

modiﬁcations of the bad client that has left the sys-

tem will no longer be visible. This variation some-

times requires an additional phase to do a write.

The rest of this paper is organized as follows. We be-

ing by describing our assumptions about the system. Sec-

tion 3 describes our base protocol. Section 4 describes

our correctness conditions and we prove our base protocol

meets them in Section 5. Section 6 describes our opti-

mized protocol and proves its correctness. Section 7 de-

scribes a variation of our protocols that allows us to bound

the number of overwrites needed to hide effects of lurking

writes. Section 8 discusses related work and we conclude

in Section 9.

2 Model

The system consists of a set C={c1, ..., cn}of client

processes and a set S={s1, ..., sn}of server processes.

Client and server processes are classiﬁed as either correct

or faulty. Correct processes are constrained to obey their

speciﬁcation, i.e., they follow the prescribed algorithms.

Faulty processes may deviate arbitrarily from their spec-

iﬁcation, i.e., we assume a Byzantine failure model [8].

Note that faulty processes include those that fail benignly,

and those that are initially correct and fail at some point in

the execution of the system.

We refer to the set of faulty clients as Cbad and the

set of correct clients as Cok, and consequently we have

C=Cbad ∪Cok (and respectively S=Sbad ∪Sok). Note

that our algorithms do not require the knowledge of Cok,

Cbad,Sok, nor Sbad. In other words, we do not assume

that we can detect faults.

We assume an asynchronous distributed system where

nodes are connected by a network that may fail to deliver

messages, delay them, duplicate them, corrupt them, or

deliver them out of order, and there are no known bounds

on message delays or on the time to execute operations.

We assume the network is fully connected, i.e., given a

node identiﬁer, any other node can (attempt to) contact

the ﬁrst node directly by sending it a message.

For liveness, we require that if a message is retransmit-

ted repeatedly it will eventually be delivered. Note that we

only require this for liveness; safety does not depend on

any assumptions a about message delivery.

We assume nodes can use unforgeable digital signa-

tures to authenticate communication. More precisely, any

node, n, can authenticate messages it sends by signing

them. We denote a message msigned by nas hmiσn.

And (with high probability) no node can send hmiσn(ei-

ther directly or as part of another message) on the network

for any value of m, unless it is repeating a message that

has been sent before or it knows n0sprivate key. We call

the set of possible signatures Σ. These signatures can be

veriﬁed with public keys in a set P.

We also assume the existence of a collision-resistant

hash function, h, such that any node can compute a digest

h(m)of a message mand (with high probability) it is im-

possible to ﬁnd two distinct messages mand m0such that

h(m) = h(m0).

These assumptions are probabilistic but there exist sig-

nature schemes (e.g., RSA [13]) and hash functions (e.g.,

SHA-1 [3]) for which they are believed to hold with very

high probability. Therefore, we will assume they hold

with probability one in the rest of the paper.

To avoid replay attacks we tag certain messages with

nonces that are signed in the replies. We also assume that

when clients pick nonces they will not choose a repeated

nonce with probability one.

3 BFT-BC Algorithm

This section presents our construction for a read/write vari-

able implemented using Byzantine quorum replication, and

that tolerates Byzantine-faulty clients. We begin by giv-

ing a brief overview of Byzantine quorums in Section 3.1.

Then we present our base protocol. This protocol requires

3 phases to write; in Section 6 we present an optimization

that requires only 2 phases most of the time.

3.1 Byzantine Quorums Overview

This section gives an overview of how current algorithms

use Byzantine quorums to implement a shared read/write

variable. This presentation follows the original BQS pro-

tocol [9], using their construction for a system that doesn’t

handle Byzantine clients. (We discuss this system further

in Section 8.)

A Byzantine quorum system deﬁnes a set of subsets

of a replica group with certain intersection properties. A

typical way to conﬁgure such a system is to use groups of

3f+ 1 replicas to survive ffailures with quorums of size

2f+ 1 replicas. This ensures that any two quorums inter-

sect in at least one non-faulty replica. Each of the replicas

maintains a copy of the data object, along with an asso-

ciated timestamp, and a client signature that authenticates

the data and timestamp.

Two phases are required to write the data. First, the

client contacts a quorum to obtain the highest timestamp

produced so far. The client then picks a timestamp higher

than what was returned in the ﬁrst phase, signs the new

value and timestamp, and proceeds to the second phase

where the new value is stored at a quorum of replicas.

Replicas allow write requests only from authorized clients.

A replica overwrites what it has stored only if the times-

tamp in the request is greater than what it already has.

The read protocol usually has a single phase where the

client queries a quorum of replicas and returns the value

with the highest timestamp (provided the signature is cor-

rect). An extension of this protocol [10] requires a second

phase that writes back the highest value read to a quorum

of replicas (this ensures atomic semantics for reads).

3.2 BFT-BC Protocol

The protocol just presented is not designed to handle Byzantine-

faulty clients, which can cause damage to the system in

several ways, e.g.:

1. Not follow the protocol by writing different values

associated with the same timestamp.

2. Only carry out the protocol partially, e.g., install a

modiﬁcation at just one replica.

3. Choose a very large timestamp and exhaust the time-

stamp space.

4. Issue a large number of write requests and hand

them off to a colluder who will run them after the

bad client has been removed from the system. This

colluder could be one of the replicas, or a com-

pletely separate machine.

To avoid these problems, we need to limit what bad

clients can do. But we must also allow good clients to

make progress, in spite of bad clients that might collude

with one another and/or with bad replicas.

Our protocol accomplishes these goals. It uses 3f+ 1

replicas, and quorums can be any subset with 2f+1 repli-

cas. It uses a three-phase protocol to write, consisting of a

read phase to obtain the most recent timestamp, a prepare

phase in which a client announces its intention to write a

particular timestamp and value, and a write phase in which

the client does the write that it prepared previously.

As it moves from one phase to another, however, the

client needs to “prove” that what it is doing is legitimate.

A proof takes the form of a quorum of certiﬁcates from

different replicas that vouch for some fact.

For example, the purpose of the prepare phase is for

the client to inform the replicas of what it intends to do,

and this must be acceptable, e.g., the timestamp it pro-

poses can’t be too big. Replicas that approve the prepare

request return certiﬁcates that together provide a “prepare

proof”. This proof is needed to carry out the write phase:

a client can only carry out a write that has been approved.

If a write is allowed by a replica it returns a certiﬁcate and

these certiﬁcates together form a “write proof”. This proof

is needed for the replica to do its next write: it cannot do

a second write without completing its ﬁrst one. This con-

straint, plus the fact that proofs cannot be forged or pre-

dicted in advance by bad clients, is what limits the num-

ber of lurking writes a bad client can leave behind when

it stops (to be carried out by some node that colludes with

it).

We now describe our protocol in detail.

As mentioned, each object in BFT-BC is replicated at

a set of 3f+ 1 replicas, numbered from 0 to 3f. Quorums

can be any subset with 2f+ 1 replicas and we use a three-

phase protocol to write.

Each replica maintains the following per-object infor-

mation:

•data, the value of the object.

•current-ts, the associated timestamp.

•prepare-proof, a valid prepare proof for above times-

tamp and a hash of the value.

•prepare-list, a list containing the timestamp, hash,

and proposing client of currently prepared writes.

•write-ts, the timestamp of the latest write known to

have completed at 2f+ 1 replicas.

In the remainder of this paper we simplify the presen-

tation by considering a system containing only a single

object, and therefore we omit object identiﬁers from the

description of the protocol.

3.2.1 Write Protocol

Our protocols require that different clients choose differ-

ent timestamps, and therefore we construct timestamps by

concatenating a sequence number with a client identiﬁer:

ts =hts.val,ts.idi. We assume that client identiﬁers are

unique. To increment a timestamp a client with identiﬁer c

uses the following function: succ(ts, c) = hts.val + 1, ci.

Timestamps can be compared in the usual way, by com-

paring the val parts and if these agree, comparing the

client ids.

Client Processing

To write a data object, clients go through a three-phase

protocol. In all phases, clients retransmit their requests to

account for lost messages; they stop retransmitting once

they collect a quorum of valid replies.

Phase 1. The client, c, sends a READ-TS request to all

replicas; the request contains a nonce to avoid replays.

The replies each include a prepare proof for the timestamp

being returned. The client waits for a quorum of correctly

authenticated replies. It selects the largest timestamp tmax

it received, computes a new timestamp t0=succ(tmax, c),

and moves to phase 2.

Phase 2. The client sends PREPARE requests to all repli-

cas. The request contains the new timestamp, the prepare

proof obtained in phase 1, a hash of the value it intends

to write, and a write proof (for the most recent write done

by this client, or null if this is the ﬁrst write done by this

client), all authenticated by the client. The client waits for

a quorum of PREPARE-ACKs, and then moves to phase 3;

the 2f+ 1 responses it received in phase 2 constitute a

prepare proof for the write.

Phase 3. The client sends a WRITE request containing the

prepare proof along with the new value, and waits for a

quorum of valid replies, all authenticated by the client.

Each reply contains a signed statement vouching for

the write for that timestamp and the replica id of the sender.

A vector of 2f+ 1 such statements constitutes a write

proof for the current operation.

Processing at replicas

Phase 1. Replicas reply to a READ-TS request by sending

the prepare proof for their current timestamp and value,

plus the nonce, all authenticated by the replica.

Phase 2. Phase 2 processing is the crucial part of the al-

gorithm. The replicas check to ensure that the timestamp

being proposed is reasonable, that the client is doing just

one prepare, that the value being proposed doesn’t differ

from a previous request for the same timestamp, and that

the client has completed its previous write.

The replica discards the message if it is invalid: if the

prepare or write proof can’t be validated, the timestamp

doesn’t have the client id in the low-order part, the pro-

posed timestamp isn’t the successor of the timestamp in

the prepare proof, or the client isn’t authorized to write

the object.

Next, the replica uses the write proof to update its write-ts:

it sets this to the larger of the timestamp in the write proof

it just received, and what it already has stored. Then it re-

moves any entry from its prepare-list whose timestamp is

less than or equal to the write-ts.

If there is still an entry in the prepare-list for the client,

the replica discards the request unless that entry matches

what is being proposed (both the timestamp and the hash).

If the replica accepts the request it updates the prepare-

list by adding the timestamp and hash for this client (un-

less this information is already in the list). Then it returns

aPREPARE-ACK message containing a signed certiﬁcate

for the new timestamp and hash.

Phase 3. Replicas carry out the request only if it is valid:

the prepare proof must be valid and the new value must

match the hash in the proof. Replicas respond to valid re-

quests by sending a WRITE-ACK containing a signed cer-

tiﬁcate for the new timestamp. The replica additionally

modiﬁes its state if the timestamp in the request is greater

than its current timestamp; in this case it stores the pro-

vided information as the current value, timestamp, and as-

sociated prepare proof.

3.2.2 Read Protocol

The read protocol usually requires just one phase.

Phase 1. The client sends a READ request to all replicas;

the request contains a nonce. A replica replies with its

value, timestamp, prepare proof, and nonce all signed by

it. The client waits for a quorum of valid responses and

chooses the one with the largest timestamp (this is the re-

turn value). If all the timestamps are the same the read

protocol ends.

Phase 2. Otherwise the client goes on to the write-back

phase for the largest timestamp; this is identical to phase 3

when a client is doing a write, except that the client needs

to send only to replicas that are behind, and it must wait

only for enough responses to ensure that 2f+ 1 replicas

now have the new information.

3.3 Discussion

In the above description we mentioned that certain mes-

sages or statements were authenticated, but the kind of

authentication that may be used was unspeciﬁed. This

issue is important since different techniques have differ-

ent costs: we can authenticate a point-to-point message by

means of symmetric cryptography by establishing session

keys and using message authentication codes (MACs). This

does not work for signing statements that have to be shown

to other parties, in which case we need to rely on more ex-

pensive public key cryptography.

Our protocol requires signing using public key cryp-

tography in two places: the phase 2 and phase 3 responses.

These signatures are needed because they are used as proofs

offered to third parties, e.g., the prepare proof is generated

for one client but then used by a different client to justify

its choice of the next timestamp.

A further point is that only the phase 2 response sig-

nature needs to happen in the foreground. The signature

for the phase 3 response can be done in the background:

a replica can do this after replying to the phase 2 request,

so that it will have the signature ready when the phase 3

request arrives.

4 Correctness Condition

This section deﬁnes the correctness conditions for a vari-

able shared by multiple clients that may incur Byzantine

failures. We begin by deﬁning histories (Section 4.1), and

then we give our correctness condition.

4.1 Histories and Stopping

We use a deﬁnition of history similar to the one proposed

in [6], extended to handle Byzantine clients.

The execution is modeled by a history, which is a se-

quence of events of the following types:

•Invocation of operations.

•Response to operation invocations.

•Stop events.

An invocation by a client cis written hc:x.opiwhere

xis an object name and op is an operation name (pos-

sibly including arguments). A response to cis written

hc:x.rtvaliwhere rtval is the response value. A re-

sponse matches an invocation if their object names agree

and their client names agree. A stop event by client cis

written hc:stopi.

A history is sequential if it begins with an invocation, if

every response is immediately followed by an invocation

(or a stop or no event), and if every invocation is imme-

diately followed by a matching response. A client subhis-

tory H|cof a history His the subsequence of all events in

Hwhose client names are c. A history is well-formed if

for each client c,H|cis sequential. We use Hto denote

the set of well-formed histories.

A object subhistory history H|xis the subsequence of

all events whose object names are xand a history His

a single-object history for some object xif H|x=H. A

sequential speciﬁcation for an object is a preﬁx-closed set

of single-object sequential histories for that object. A se-

quential history His legal if each object subhistory H|x

belongs to the sequential speciﬁcation of x.

An operation oin a history is a pair consisting of an

invocation inv(o)and the next matching response rsp(o).

A history Hinduces an irreﬂexive partial order <Hon

the operations and stop events in Has follows: o0<H

o1if and only if rsp(o0)precedes inv(o1)in H;o0<H

hc:stopiif and only if rsp(o0)precedes hc:stopiin

H;hc:stopi<Ho1if and only if hc:stopiprecedes

inv(o1)in H; and hc1:stopi<Hhc2:stopiif and only

if hc1:stopiprecedes hc2:stopiin H.

4.1.1 Veriﬁable Histories

In [6], the notion of linearizability is applied to a concur-

rent computation, which is modeled by a history (a ﬁnite

sequence of invocation and response events by all pro-

cesses).

It would be difﬁcult to model such computations in our

environment, since faulty processes do not obey any spec-

iﬁcation, and therefore we cannot deﬁne what an invoca-

tion or a response means for such processes. However, we

know that after a STOP event from a faulty process it will

halt.

Therefore we introduce the concept of a veriﬁable his-

tory which is a history that contains the sequence of invo-

cations and responses from correct clients, and stop events

from faulty clients.

The correctness condition we present next is applicable

only to veriﬁable histories.

4.2 Correctness Condition

We are now ready to deﬁne the correctness condition for

variables shared by multiple processes that may incur Byzan-

tine failures.

The idea is that, as in linearizability [6], we require

that the veriﬁable history look plausible to the correct pro-

cesses, i.e., that there is a sequential history in which all

processes are correct that explains what the correct pro-

cesses observed. Furthermore, we require that once a faulty

client stops, its subsequent effect on the system is limited

in the following way: The number of operations by that

client that are “seen” by correct clients after it stopped is

bounded by some constant.

These concepts can be formalized as follows.

Deﬁnition 1 A veriﬁable history H∈ H is BFT-linearizable

if there exists some legal sequential history H0∈ H such

that

1. H|p=H0|p, ∀p∈ Cok

2. <H⊆<H0

3. ∀c∈ Cbad :hc:stopi ∈ H⇒

(∃h1, h2⊆H0:H0=h1hc:stopih2∧

|{o∈h2:o=hc:x.opi}| ≤ max-b)

Points 1 and 2 of the deﬁnition above state that there

must exist a sequential history that looks the same to cor-

rect processes as the veriﬁable history in question, and that

sequential history must preserve the <Hordering (in other

words, if an operation or a stop event precedes another op-

eration or stop event the veriﬁable history, then the prece-

dence must also hold in the sequential history).

Point 3 says that if all faulty clients stop, then when

we look at the sequential history and consider the sub-

history after the last stop event by faulty clients (this is

h2), the number of events by faulty clients in that history

is bounded by max-b.

Note that there is a counter-intuitive notion that a bad

client can execute operations after it has stopped. This cor-

responds to the notion that the bad client left some pending

operations (e.g., with a colluder) before it left, and this is

precisely the effect we want to minimize.

We can now deﬁne a BFT-linearizable object to be an

object whose veriﬁable histories are BFT-linearizable with

respect to its sequential speciﬁcation.

5 Correctness Proof

This section sketches a proof that the algorithm presented

in Section 3 meets the correctness conditions presented in

Section 4.

The idea is to look at what kind of properties we can

ensure given a certain state of the system when a faulty

client cstops. If we look at the current-ts values stored at

that instant, we can guarantee the following.

Lemma 1. Let tsmax be the f+ 1st highest timestamp

stored by non-faulty replicas at time tstop (when some

faulty client cstops). Then the following must hold

1. At any time up to tstop, no node can collect a write

proof for a timestamp t0such that t0>tsmax.

2. There are no two timestamps t1, t2>tsmax such that

client cassembled a prepare proof for t1and t2.

3. No two prepare proofs exist for the same timestamp

t > tsmax and different associated values.

Proof.

1. By algorithm construction, a nonfaulty replica will not

sign an entry in a write proof vouching for a times-

tamp higher than the one held in the variable current-ts.

Since non-faulty replicas always store increasing times-

tamp values, this means that the number of signatures

that can be held in the system at time tstop for times-

tamps higher than tsmax is at most 2f(i.e., ffrom

faulty replicas and the fcorrect replicas that may hold

timestamps higher than tsmax).

2. By contradiction, suppose that client cheld prepare

proofs for t1, t2, both greater than tsmax. The two

proofs intersect in at least one nonfaulty replica. By

part (1) of this lemma, that replica had its write-ts vari-

able always at a value less than or equal to tsmax (at

all times up to and including tstop). Therefore that

replica could not have signed both proofs, since after

signing the ﬁrst one (say, for t1) it would insert an

entry for client cand t1in its prepare-list, and that

entry would not be removed (because of the value of

write-ts), which prevents the replica from signing the

proof for t2.

3. Suppose, by contradiction that two prepare proofs exist

for timestamp tand values vand v0. By the quorum in-

tersection properties, these two prepare proofs contain

at least one signature from the same correct replica. By

part (1) of this lemma, no write proof was ever assem-

bled for timestamps greater or equal than tsmax, and

these entries in the prepare list were never removed.

This violates the constraint that correct replicas do not

sign a timestamp that is already in its prepare list for a

different value.

We are now ready to show the correctness of our algo-

rithm.

Theorem 1. The BFT-BC algorithm is BFT-linearizable.

Proof. Consider any correct reader in the case that the

writer is correct. In this case, the quorum initially ac-

cessed in a read operation intersects the quorum written

in the most recently completed write operation in at least

one correct replica. Therefore, the read returns either the

value in the most recently completed write, or a value with

a higher timestamp (which could be written concurrently

with the read). Since a read also writes back its obtained

value and timestamp to a quorum of processes, any subse-

quent read by a correct reader will return that timestamp

value or a later one. So, for any execution, we construct

the history needed to show BFT-linearizability by putting

every read right after the write whose value it returns.

If the writer is faulty, we construct the sequential his-

tory to show BFT-linearizability as follows: for each read

by a correct reader returning vsuch that the phase 3 re-

quest for vwas produced by client cb, insert a write opera-

tion in the history that writes v(by client cb) immediately

before the read.

Insert a stop event before the invocation of the ﬁrst op-

eration that succeeded the stop event in the original (veri-

ﬁable) history (i.e., as late as possible while preserving the

<Hdependencies).

It is left to show that if a faulty process stops, this

history contains a bounded number of operations by that

faulty process after it stops.

To prove this condition we note that Lemma 1 part (2)

says that client conly assembled prepare proofs for a sin-

gle timestamp for a write that would become visible after

it stopped (i.e., with a timestamp greater than tsmax), and

Lemma 1 part (3) implies that if the write were to become

visible, the prepare proof could only correspond to a sin-

gle value.

This means the number of operations by the faulty client

cafter it stops is at most one.

5.1 Liveness

As far as liveness is concerned, we guarantee that good

clients can always execute read operations in the time it

takes for two client RPCs to complete at 2f+ 1 replicas

(i.e., the requests reaching the replicas and the respective

replies returning to the client). This is so because at least

2f+ 1 replicas will provide them with the appropriate an-

swers (client requests are answered unconditionally pro-

vided the requests are well-formed). For the write proto-

col, the operation will complete in the time for three client

RPCs to complete in 2f+ 1 replicas, and the trick is to

ensure that their phase 2 requests are never refused (since

phase 1 requests are answered unconditionally and phase 3

requests are answered if they contain a valid prepare proof,

which good clients will always send).

The phase 2 request also get replies since the client will

submit the write proof for its latest write alongside with it,

which will force the replica to discard the client’s entry in

the prepare list and accept the prepare request.

6 Optimized BFT-BC Algorithm

This section describes our optimized protocol, which usu-

ally requires only two phases to do a write, and proves that

the optimized protocol satisﬁes our correctness condition.

6.1 Optimized BFT-BC

Our plan is to avoid one of the phases by merging it with

another phase. We cannot merge phases 2 and 3 since the

protocol requires that the prepare information be stored at

f+ 1 honest replicas before the client can do phase 3: this

is how we avoid a bad client being able to create many

writes in advance that could be launched after it leaves the

system. Therefore, we will merge phases 1 and 2.

The idea is that the client sends the hash in the phase

1 request and the replica does phase 2 on its behalf: it

predicts the next timestamp, adds the new timestamp with

the hash to the prepare-list, and returns this information,

signed by itself. If the client receives a quorum of re-

sponses all for a particular (highest) new timestamp, it can

move to phase 3 immediately.

This optimization will work well in the normal case

where a write is actually received by all replicas. There-

fore the good client is highly likely to receive a quorum of

acknowledgments for the same timestamp in the responses

in phase 1, and most of the time a write will require two

phases.

However there are problems that must be overcome for

the optimization to be used. They occur when there are

concurrent writes or when other clients are faulty.

A simple example is the following. Suppose a client

does phase 1 for some hash h. It receives PREPARE-ACKS

for a particular timestamp tfrom replicas R1and R2, and

PREPARE-ACKS for a larger timestamp t0from replicas R3

and R4. The client will then be unable to get approval

for either tor t0, because to approve the request a replica

would need to violate our restriction on having at most one

entry per client on the prepare-list.

Clearly we need to ﬁnd a way to allow clients to do

writes in cases like this one. But given our current con-

straints this is not possible.

To allow the client to make progress we will weaken

our constraint on how many entries a client can have in the

prepare list. In particular we will change the replica state

to have a second prepare list, opt-list, containing entries

produced by the replica, with the constraint that a client

can have at most one entry per list. The two entries for

the same client (one in each list) might be for different

timestamps, or they might be for the same timestamp. In

the latter case, it will be possible for the write with that

timestamp to happen twice (when there is a bad client). In

this case we need a way to order the writes: we will do

this by using the numeric order on their hashes.

6.2 Detailed Write Protocol

Phase 1. The client, c, sends a READ-TS-PREP request

containing the hash of the proposed value, and the client’s

current write proof, to all replicas.

Replicas process the request in the usual phase-2 way,

including removing entries from the prepare-lists. Then it

will do the prepare on behalf of the client for timestamp

t0=succ(ts, c), unless the client already has an entry in

either prepare list for a different timestamp or hash. If

the prepare is successful, the replica adds an entry to the

opt-list for t0and the hash (unless that entry is already in

the list) and returns a PREPARE-ACK signed by it; other-

wise it returns a normal phase 1 response.

If the client gets a quorum of PREPARE-ACKS (i.e.,

obtains a prepare proof) for the same new timestamp, it

moves to phase 3.

Phase 2. Otherwise the client chooses the new timestamp

as before and carries out phase 2 for it (as in the normal

protocol). When it has a quorum of signatures for this

choice (obtained either in phase 1 or phase 2), it moves to

phase 3.

Replicas handle phase 2 requests as in the normal pro-

tocol; they ignore the opt-list in this processing.

Phase 3. The client does phase 3 in the normal way.

The replica also does normal processing, except that

the timestamp in the write might match the current times-

tamp, but with a different value: in this case it retains the

value with the larger hash.

6.3 Discussion

The optimized protocol can lead to a client having entries

on two prepare lists (normal and optimized). A dishonest

client can exploit this to carry out phase 3 twice as part

of the same write request. And as a result, each dishon-

est client can leave two lurking writes behind when it is

removed from the system.

Another point is that it is now possible for honest clients

to see valid responses to a read request that have the same

timestamp but different values. The client protocol re-

solves this situation by returning (and writing back) the

value with the larger hash.

6.4 Optimized BFT-BC Correctness Proof

For the optimized protocol some of the invariants shown

in Section 5 need to be slightly modiﬁed. Notably, Lemma

1 parts (2) and (3) no longer hold, since a faulty client can

now collect two distinct prepare proofs (for its entries in

prepare-list and opt-list). Therefore Lemma 1 parts (2)

and (3) become:

Lemma 1’.

(2+3). For the optimized BFT-BC protocol, no more than

two prepare proofs can exist for distinct timestamp, value

pairs, with timestamps greater than tsmax.

Proof. Same as Lemma 1, but taking into consideration

that the new algorithm allows for one entry in the normal

prepare list, and another in the optimistic prepare list.

This only affects the proof of Theorem 1 in that the

number of operations by a faulty client after it stops be-

comes 2 instead of 1, and therefore the main theorem still

holds:

Theorem 2. The optimized BFT-BC algorithm is BFT-

linearizable.

7 Stronger Correctness Conditions

In this section we propose a stronger correctness condition

than the one introduced in Section 4, and discuss how to

extend the current protocol to meet the new condition.

7.1 New Correctness Conditions

The idea is that we want to further limit the effects a bad

client can have on the system once it stopped such that,

if correct clients execute a certain number of write opera-

tions after a faulty client stops, then no more operations by

that client will ever be “seen”. (We can generalize this to

any state-overwriting operation in a set Ooverwrite in case

the variable has an interface other than read and write.)

We formalize the BFT-linearizable+ condition as be-

ing equal to the condition presented in Section 4, except

that point (3) is modiﬁed to state that no operations by the

faulty client cappear after the kth state-overwriting oper-

ation in h2.

7.2 Modiﬁed BFT-BC protocol

To meet this stronger correctness condition, we merely

need to modify the BFT-BC protocol to require the client

to submit a write proof in its prepare request (along with

the information it already sends in the prepare). This proof

certiﬁes that the timestamp it is proposing is the successor

of one that corresponds to a write that has already hap-

pened; a replica will discard the request if this condition

doesn’t hold.

The client can easily assemble this proof in phase 1

if all responses to its phase 1 request contain the same

timestamp. If they don’t the client can obtain the proof

by doing a write-back; the difﬁculty is getting the value to

write back. This could be accomplished by having phase

1 actually be a read, but this is unattractive since values

can be large. So instead, the client fetches the value, if

necessary, in a separate step; effectively it redoes phase

1 as a normal read, although this can be optimized (e.g.,

to fetch from the f+ 1 replicas that returned the largest

timestamps).

This scheme guarantees that the timestamp in the lurk-

ing write is the successor of a lower bound on the value

stored by at least f+ 1 non-faulty replicas when the bad

client stopped. Consequently, if there were two consecu-

tive writes by correct clients after the bad client stopped,

the lurking write would no longer be seen.

8 Related Work

In this section we discuss the previous work that dealt with

bad clients, and also the previous work on correctness con-

ditions.

Our protocol is more efﬁcient than those proposed pre-

viously. Furthermore it enforces stronger constraints on

the behavior of bad clients: we handle all problems han-

dled by previous protocols plus we limit the number of

lurking writes that a bad client can leave behind after it

has left the system.

In addition we provide stronger safety and liveness guar-

antees than previous protocols: in particular read opera-

tions cannot return null values, and reads terminate in a

constant number of rounds, independently of the behavior

of concurrent writers.

Our use of prepare proofs is what enables us to ensure

liveness, since this allows a client to complete a read using

information it received from just one replica. No previous

protocol uses this technique. As a result they can only en-

sure that a read will return the result of the latest write in

a constant number of rounds provided there are no con-

current writes. Yet concurrent writes are a real issue in a

system with Byzantine clients, since bad clients can use

this to prevent good clients from making progress: they

can either write continually, or they can write partially (not

complete their writes), and either of these causes problems

for readers.

The previous proposals fall into two categories: those

that ensure that at most one value can be associated with

a timestamp, and those that enforce non-skipping times-

tamps.

The initial work on Byzantine quorum systems [9] de-

scribed the quorum consensus protocol discussed in Sec-

tion 3, which used 3f+1 replicas, one phase reads and two

phase writes, and did not handle Byzantine clients. That

paper also described a protocol for a system of 4f+ 1

replicas that prevented malicious clients from associating

different values with different timestamps; the protocol re-

quired a three-phase communication among the replicas to

be carried out as part of performing a write (in addition to

the client-server communication) where servers guarantee

that each value and timestamp pair is propagated to a quo-

rum. The protocol provides liveness, but at the expense

of providing weak semantics for reads where they could

return a null value in case of concurrent writes.

The Phalanx system [10] improves on the previous re-

sult in two ways. First it added the write-back phase (to the

simpler protocol) as a way to ensure atomicity for reads.

In addition it presented a more efﬁcient protocol to handle

Byzantine clients. That protocol used 4f+ 1 replicas, but

clients carried out a three-phase protocol to do a write, and

the server-to-server communication was no longer needed.

In the new protocol for Byzantine clients, read operations

could return a null value if there was an incomplete or a

concurrent write.

The work by Goodson et al. [5] proposes a solution for

erasure-coded storage that tolerates Byzantine failures of

clients and servers; this system requires 4f+ 1 replicas.

This work is focused on integrating erasure coded storage

with Byzantine quorum protocols and they tolerate Byzan-

tine clients that write different “values” to different repli-

cas by having the next reader detect the inconsistency and,

if possible, repair it. In some cases, reads may return a null

value.

The work of Martin et al. [12] proposes a protocol that

only uses 3f+ 1 replicas (like our protocol). They re-

quire a quorum of 2f+ 1 identical replies for read oper-

ations to succeed, which is difﬁcult to ensure in an asyn-

chronous system. Their solution is to assume a reliable

asynchronous network model, where each message is de-

livered to all correct replicas. This means that inﬁnite re-

transmission buffers are needed in an asynchronous envi-

ronment like the Internet: the failure of a single replica

(which might just have crashed) causes all messages from

that point on to be remembered and retransmitted. In this

protocol concurrent writers can slow down readers. The

authors discuss how to extend their protocols to guaran-

tee that Byzantine clients associate the same value with

the same timestamp by having servers forward the writes

among themselves (again, this is possible due to their net-

work model) and keep the highest value for each times-

tamp. They also discuss how to prevent a faulty client

from exhausting resources at correct servers, but at the ex-

pense of possibly sacriﬁcing liveness for correct readers.

Two papers (Bazzi and Ding [1], Cachin and Tessaro [2])

describe protocols that enforce non-skipping timestamps.

In both cases, the protocols are based on the Martin et

al. work, which uses the reliable asynchronous network

model with the problems we mentioned above. The proto-

col of Bazzi and Ding [1] requires 4f+ 1 replicas. Cachin

and Tessaro [2] require 3f+ 1 replicas; an added prob-

lem in this system is the use of threshold signatures to

vouch for an increment on the current timestamp (since

these signatures are very expensive). They also employ an

expensive reliable broadcast mechanism for client-server

communication.

Our deﬁnition of BFT-linearizability presented in Sec-

tion 4 builds upon a condition (called Byznearizability)

proposed by Malkhi et al. [11]. This work was the ﬁrst

to point out the problem of lurking writes. However, their

correctness condition is weaker than ours, since they re-

quire only that the number of lurking writes is ﬁnite, whereas

we require that the number of lurking writes is bounded

by a constant. In fact, their correctness condition is met

by the variant of the Phalanx protocol [10] that was de-

signed to handle only honest clients. Another point is that

this paper did not consider the use of secure co-processors

to prevent malicious clients from leaking their private key,

and therefore their condition requires all faulty clients and

servers to stop in order to provide any guarantees.

Our condition is also stronger due to the fact that we

consider the number of overwrites to mask all the lurking

writes (see Section 7).

9 Conclusions

This paper has presented novel protocols for Byzantine

quorum systems that provide atomic semantics despite Byzan-

tine faulty clients. Ours is the ﬁrst protocol to fully address

the complete range of problems that can arise with Byzan-

tine clients. We prevent bad clients from leaving behind

more than one or two lurking writes and from exhausting

the timestamp space. In addition Byzantine clients are un-

able to interfere with good clients in the sense that they

cannot prevent good clients from completing reads and

writes, and they cannot cause good clients to see incon-

sistencies. Another point is that an extension of our pro-

tocol can additionally ensure that the effects of a Byzan-

tine client are no longer visible to good clients at all after

two successive writes by good clients (or four successive

writes in the optimized protocol).

We also presented strong correctness conditions that

address the above problems: we require that protocols

guarantee atomicity for good clients, and limit what can

be done on behalf of bad clients once they leave the sys-

tem. Furthermore we proved that our protocols are both

safe (they meet those conditions) and live.

Our protocols are designed to work in an asynchronous

system like the Internet and they are highly efﬁcient. Our

base protocol completes writes in three network round-

trips; the optimized protocol reduces this cost so that writes

normally complete in two network round-trips (at the ex-

pense of allowing one more lurking write). In either case

reads normally complete in one phase, and require no more

than two phases, no matter what the bad clients are do-

ing. We achieve these efﬁciencies because of our use of

“proofs”, which allow clients or replicas to know that in-

formation presented to them is valid, without having to

hear the same thing directly from 2f+ 1 replicas.

References

[1] R. Bazzi and Y. Ding. Non-skipping timestamps for byzantine

data storage systems. In Distributed Computing, 18th Interna-

tional Conference, DISC 2004, pages 405–419, Oct. 2004.

[2] C. Cachin and S. Tessaro. Optimal resilience for erasure-coded

byzantine distributed storage. Technical Report RZ 3575, IBM

Research, Feb. 2005.

[3] FIPS 180-1. Secure Hash Standard. U.S. Department of Com-

merce/NIST, National Technical Information Service, Spring-

ﬁeld, VA, Apr. 1995.

[4] D. K. Gifford. Weighted voting for replicated data. In 17th SOSP.

[5] G. Goodson, J. Wylie, G. Ganger, and M. Reiter. Efﬁcient

byzantine-tolerant erasure-coded storage. In Proc. of the Inter-

national Conference on Dependable Systems and Networks, June

2004.

[6] M. P. Herlihy and J. M. Wing. Axioms for Concurrent Objects.

In Conference Record of the Fourteenth Annual ACM Symposium

on Principles of Programming Languages, pages 13–26, Munich,

Germany, Jan. 1987.

[7] IBM. http://www.ibm.com/security/cryptocards/, 2005.

[8] L. Lamport, R. Shostak, and M. Pease. The Byzantine Generals

Problem. ACM Transactions on Programming Languages and

Systems, 4(3):382–401, July 1982.

[9] D. Malkhi and M. Reiter. Byzantine Quorum Systems. Journal

of Distributed Computing, 11(4):203–213, 1998.

[10] D. Malkhi and M. Reiter. Secure and scalable replication in pha-

lanx. In Proc. 17th SRDS, Oct. 1998.

[11] D. Malkhi, M. Reiter, and N. Lynch. A Correctness Condi-

tion for Memory Shared by Byzantine Processes. Unpublished

manuscript, Sept. 1998.

[12] J. Martin, L. Alvisi, and M. Dahlin. Minimal Byzantine stor-

age. Technical Report TR-02-38, University of Texas at Austin,

Department of Computer Sciences, Aug. 2002.

[13] R. Rivest, A. Shamir, and L. Adleman. A Method for Obtaining

Digital Signatures and Public-Key Cryptosystems. Communica-

tions of the ACM, 21(2):120–126, Feb. 1978.

[14] R. H. Thomas. A majority consensus approach to concurrency

control for multiple copy databases. ACM Transactions on

Database Systems, 4(2):180–209, June 1979.

MIT LCS TR 994

MIT-LCS-TR-994 MIT-LCS-TR-994

Navigation menu

Versions of this User Manual:

Views

Navigation