MIT LCS TR 994

MIT-LCS-TR-994 MIT-LCS-TR-994

User Manual: MIT-LCS-TR-994

Open the PDF directly: View PDF PDF.
Page Count: 11

DownloadMIT-LCS-TR-994
Open PDF In BrowserView PDF
Byzantine Clients Rendered Harmless
Barbara Liskov and Rodrigo Rodrigues
MIT CSAIL and INESC-ID
32 Vassar Street - Cambridge, MA 02139 - USA
Rua Alves Redol 9 - 1000 Lisboa - Portugal

Abstract
Byzantine quorum systems have been proposed that work properly even when up to f replicas fail arbitrarily.
However, these systems are not so successful when confronted with Byzantine faulty clients. This paper presents novel
protocols that provide atomic semantics despite Byzantine clients. Our protocols are the first to handle all problems
caused by Byzantine clients. They prevent Byzantine clients from interfering with good clients: bad clients cannot
prevent good clients from completing reads and writes, and they cannot cause good clients to see inconsistencies. In
addition we also prevent bad clients that have been removed from operation from leaving behind more than a bounded
number of writes that could be done on their behalf by a colluder.
Our protocols are designed to work in an asynchronous system like the Internet and they are highly efficient. We
require 3f + 1 replicas, and either two or three phases to do writes; reads normally complete in one phase and require
no more than two phases, no matter what the bad clients are doing.
We also present strong correctness conditions for systems with Byzantine clients that limit what can be done on
behalf of bad clients once they leave the system. Furthermore we prove that our protocols are both safe (they meet
those conditions) and live.

1 Introduction

client, even if it has been shut down, if the private key it
uses to prove that it is authorized to modify an object can
Quorum systems [4, 14] are valuable tools for building
be
used by other nodes; thus, a bad client is in the system
highly available replicated data services. A quorum sysas long as any node knows its private key. If nodes can
tem can be defined as a set of sets (called quorums) with
expose their private keys, we can make statements about
certain intersection properties. These systems allow read
behavior
in the absence of a bad client only when all faulty
and write operations to be performed only at a quorum of
nodes have been removed from operation [11] – a condithe servers, since the intersection properties ensure that
tion that is not very useful in practice. A more practical apany read operation will have access to the most recent
proach is make use of secure cryptographic coprocessors
value that was written.
(like the IBM 4758 [7]) that allow signing without exposThe original work on quorum systems assumed that
ing
the private key. Our correctness condition is stated in
servers fail benignly, i.e., by crashing or omitting some
a way that allows either model about the lifetime of faulty
steps. More recently, researchers have developed technodes
in the system.
niques that enable quorum systems to provide data availThus,
we make the following contributions:
ability even in the presence of arbitrary (Byzantine) faults [9].
Earlier work provides correct semantics despite server (i.e.,
• We present strong correctness conditions for an atomic
replica) failures and also handles some of the problems of
read/write object in a system in which both clients
Byzantine clients [9, 10, 5, 1, 2, 11, 12].
and replicas can be Byzantine. Our conditions enThis paper extends this earlier work in two important
sure atomicity for good clients, and also limit the
ways. First, it defines new protocols that handle all probeffects of bad clients that have been removed from
lems of Byzantine clients. In addition, our protocols are
operation: our conditions bound the number of lurkmore efficient than previous proposals, in terms of both
ing writes by a constant factor, and prevent lurking
operation latency and number of replicas. Second, the
writes after bad clients stop and good clients subsepaper states correctness conditions for such systems and
quently overwrite the data.
proves that protocols are correct. The correctness conditions are stronger than what has been stated previously [11]
• We present the first Byzantine quorum protocols that
and what has been guaranteed by previous approaches.
satisfy the conditions using only 3f + 1 replicas (to
Since a dishonest client can write garbage into the shared
survive f faulty replicas) and work in an unreliable
variable, it may seem there is little value in limiting what
asynchronous network like the Internet. Furtherbad clients can do. But this is not the case, for two reasons.
more our protocols are efficient: To do writes reFirst, bad clients can cause a protocol to misbehave so that
quires either 3 phases (our base protocol) or mostly
good clients are unable to perform operations (i.e., the pro2 phases (our optimized protocol). Reads usually
tocol is no longer live) or observe incorrect behavior. For
require 1 phase; they sometimes need an additional
example, if the variable is write-once, a good client might
phase (to write back the data just read). Our base
observe that its state changes multiple times.
protocol ensures that there can be at most one lurkSecond, bad clients can continue to interfere with good
ing write after a bad client has left the system; the
ones even after they have been removed from operation,
optimized protocol ensures slightly weaker behave.g., by a system administrator who learns of the misbeior: there can be a most two lurking writes.
havior. We would like to limit such interference so that,
• We are able to manage with a small number of repliafter only a limited number of writes by good clients, any
cas and phases because our protocol makes use of
lurking writes left behind by a bad client will no longer be
“proofs”, an approach that we believe can be used
visible to good clients. A lurking write is a modification
to advantage in other protocols. A proof is a collaunched by the bad client before it was removed from
lection of 2f + 1 certificates from different replicas
operation that will happen (possibly with help from an acthat vouch for some fact, e.g., that a client has comcomplice) after it has left the system. By limiting such
pleted its previous write, or that the state read from
writes we can ensure that the object becomes useful again
a single replica is valid.
after the departure of the bad client, e.g., some invariant
that good clients preserve will hold.
• We prove the correctness of our protocols, both safety
Of course, it is not possible to prevent actions by a bad
and liveness. In addition we describe a variation of

our protocols that supports a still stronger correctness condition, in which we can bound the number
of writes of good clients that it takes to ensure that
modifications of the bad client that has left the system will no longer be visible. This variation sometimes requires an additional phase to do a write.
The rest of this paper is organized as follows. We being by describing our assumptions about the system. Section 3 describes our base protocol. Section 4 describes
our correctness conditions and we prove our base protocol
meets them in Section 5. Section 6 describes our optimized protocol and proves its correctness. Section 7 describes a variation of our protocols that allows us to bound
the number of overwrites needed to hide effects of lurking
writes. Section 8 discusses related work and we conclude
in Section 9.

2 Model
The system consists of a set C = {c1 , ..., cn } of client
processes and a set S = {s1 , ..., sn } of server processes.
Client and server processes are classified as either correct
or faulty. Correct processes are constrained to obey their
specification, i.e., they follow the prescribed algorithms.
Faulty processes may deviate arbitrarily from their specification, i.e., we assume a Byzantine failure model [8].
Note that faulty processes include those that fail benignly,
and those that are initially correct and fail at some point in
the execution of the system.
We refer to the set of faulty clients as C bad and the
set of correct clients as Cok , and consequently we have
C = Cbad ∪ Cok (and respectively S = Sbad ∪ Sok ). Note
that our algorithms do not require the knowledge of C ok ,
Cbad , Sok , nor Sbad . In other words, we do not assume
that we can detect faults.
We assume an asynchronous distributed system where
nodes are connected by a network that may fail to deliver
messages, delay them, duplicate them, corrupt them, or
deliver them out of order, and there are no known bounds
on message delays or on the time to execute operations.
We assume the network is fully connected, i.e., given a
node identifier, any other node can (attempt to) contact
the first node directly by sending it a message.
For liveness, we require that if a message is retransmitted repeatedly it will eventually be delivered. Note that we
only require this for liveness; safety does not depend on
any assumptions a about message delivery.
We assume nodes can use unforgeable digital signatures to authenticate communication. More precisely, any

node, n, can authenticate messages it sends by signing
them. We denote a message m signed by n as hmi σn .
And (with high probability) no node can send hmi σn (either directly or as part of another message) on the network
for any value of m, unless it is repeating a message that
has been sent before or it knows n0 s private key. We call
the set of possible signatures Σ. These signatures can be
verified with public keys in a set P .
We also assume the existence of a collision-resistant
hash function, h, such that any node can compute a digest
h(m) of a message m and (with high probability) it is impossible to find two distinct messages m and m 0 such that
h(m) = h(m0 ).
These assumptions are probabilistic but there exist signature schemes (e.g., RSA [13]) and hash functions (e.g.,
SHA-1 [3]) for which they are believed to hold with very
high probability. Therefore, we will assume they hold
with probability one in the rest of the paper.
To avoid replay attacks we tag certain messages with
nonces that are signed in the replies. We also assume that
when clients pick nonces they will not choose a repeated
nonce with probability one.

3 BFT-BC Algorithm
This section presents our construction for a read/write variable implemented using Byzantine quorum replication, and
that tolerates Byzantine-faulty clients. We begin by giving a brief overview of Byzantine quorums in Section 3.1.
Then we present our base protocol. This protocol requires
3 phases to write; in Section 6 we present an optimization
that requires only 2 phases most of the time.

3.1 Byzantine Quorums Overview
This section gives an overview of how current algorithms
use Byzantine quorums to implement a shared read/write
variable. This presentation follows the original BQS protocol [9], using their construction for a system that doesn’t
handle Byzantine clients. (We discuss this system further
in Section 8.)
A Byzantine quorum system defines a set of subsets
of a replica group with certain intersection properties. A
typical way to configure such a system is to use groups of
3f + 1 replicas to survive f failures with quorums of size
2f + 1 replicas. This ensures that any two quorums intersect in at least one non-faulty replica. Each of the replicas
maintains a copy of the data object, along with an associated timestamp, and a client signature that authenticates
the data and timestamp.
Two phases are required to write the data. First, the

client contacts a quorum to obtain the highest timestamp
produced so far. The client then picks a timestamp higher
than what was returned in the first phase, signs the new
value and timestamp, and proceeds to the second phase
where the new value is stored at a quorum of replicas.
Replicas allow write requests only from authorized clients.
A replica overwrites what it has stored only if the timestamp in the request is greater than what it already has.
The read protocol usually has a single phase where the
client queries a quorum of replicas and returns the value
with the highest timestamp (provided the signature is correct). An extension of this protocol [10] requires a second
phase that writes back the highest value read to a quorum
of replicas (this ensures atomic semantics for reads).

For example, the purpose of the prepare phase is for
the client to inform the replicas of what it intends to do,
and this must be acceptable, e.g., the timestamp it proposes can’t be too big. Replicas that approve the prepare
request return certificates that together provide a “prepare
proof”. This proof is needed to carry out the write phase:
a client can only carry out a write that has been approved.
If a write is allowed by a replica it returns a certificate and
these certificates together form a “write proof”. This proof
is needed for the replica to do its next write: it cannot do
a second write without completing its first one. This constraint, plus the fact that proofs cannot be forged or predicted in advance by bad clients, is what limits the number of lurking writes a bad client can leave behind when
it stops (to be carried out by some node that colludes with
3.2 BFT-BC Protocol
it).
The protocol just presented is not designed to handle Byzantine- We now describe our protocol in detail.
faulty clients, which can cause damage to the system in
As mentioned, each object in BFT-BC is replicated at
several ways, e.g.:
a set of 3f + 1 replicas, numbered from 0 to 3f . Quorums
can be any subset with 2f + 1 replicas and we use a three1. Not follow the protocol by writing different values
phase protocol to write.
associated with the same timestamp.
Each replica maintains the following per-object information:
2. Only carry out the protocol partially, e.g., install a
modification at just one replica.
3. Choose a very large timestamp and exhaust the timestamp space.
4. Issue a large number of write requests and hand
them off to a colluder who will run them after the
bad client has been removed from the system. This
colluder could be one of the replicas, or a completely separate machine.
To avoid these problems, we need to limit what bad
clients can do. But we must also allow good clients to
make progress, in spite of bad clients that might collude
with one another and/or with bad replicas.
Our protocol accomplishes these goals. It uses 3f + 1
replicas, and quorums can be any subset with 2f + 1 replicas. It uses a three-phase protocol to write, consisting of a
read phase to obtain the most recent timestamp, a prepare
phase in which a client announces its intention to write a
particular timestamp and value, and a write phase in which
the client does the write that it prepared previously.
As it moves from one phase to another, however, the
client needs to “prove” that what it is doing is legitimate.
A proof takes the form of a quorum of certificates from
different replicas that vouch for some fact.

• data, the value of the object.
• current-ts, the associated timestamp.
• prepare-proof, a valid prepare proof for above timestamp and a hash of the value.
• prepare-list, a list containing the timestamp, hash,
and proposing client of currently prepared writes.
• write-ts, the timestamp of the latest write known to
have completed at 2f + 1 replicas.
In the remainder of this paper we simplify the presentation by considering a system containing only a single
object, and therefore we omit object identifiers from the
description of the protocol.

3.2.1 Write Protocol
Our protocols require that different clients choose different timestamps, and therefore we construct timestamps by
concatenating a sequence number with a client identifier:
ts = hts.val, ts.idi. We assume that client identifiers are
unique. To increment a timestamp a client with identifier c
uses the following function: succ(ts, c) = hts.val + 1, ci.

Timestamps can be compared in the usual way, by comparing the val parts and if these agree, comparing the
client ids.
Client Processing
To write a data object, clients go through a three-phase
protocol. In all phases, clients retransmit their requests to
account for lost messages; they stop retransmitting once
they collect a quorum of valid replies.
Phase 1. The client, c, sends a READ - TS request to all
replicas; the request contains a nonce to avoid replays.
The replies each include a prepare proof for the timestamp
being returned. The client waits for a quorum of correctly
authenticated replies. It selects the largest timestamp t max
it received, computes a new timestamp t 0 = succ(tmax , c),
and moves to phase 2.
Phase 2. The client sends PREPARE requests to all replicas. The request contains the new timestamp, the prepare
proof obtained in phase 1, a hash of the value it intends
to write, and a write proof (for the most recent write done
by this client, or null if this is the first write done by this
client), all authenticated by the client. The client waits for
a quorum of PREPARE - ACKs, and then moves to phase 3;
the 2f + 1 responses it received in phase 2 constitute a
prepare proof for the write.
Phase 3. The client sends a WRITE request containing the
prepare proof along with the new value, and waits for a
quorum of valid replies, all authenticated by the client.
Each reply contains a signed statement vouching for
the write for that timestamp and the replica id of the sender.
A vector of 2f + 1 such statements constitutes a write
proof for the current operation.
Processing at replicas
Phase 1. Replicas reply to a READ - TS request by sending
the prepare proof for their current timestamp and value,
plus the nonce, all authenticated by the replica.
Phase 2. Phase 2 processing is the crucial part of the algorithm. The replicas check to ensure that the timestamp
being proposed is reasonable, that the client is doing just
one prepare, that the value being proposed doesn’t differ
from a previous request for the same timestamp, and that
the client has completed its previous write.
The replica discards the message if it is invalid: if the
prepare or write proof can’t be validated, the timestamp
doesn’t have the client id in the low-order part, the proposed timestamp isn’t the successor of the timestamp in
the prepare proof, or the client isn’t authorized to write
the object.
Next, the replica uses the write proof to update its write-ts:
it sets this to the larger of the timestamp in the write proof

it just received, and what it already has stored. Then it removes any entry from its prepare-list whose timestamp is
less than or equal to the write-ts.
If there is still an entry in the prepare-list for the client,
the replica discards the request unless that entry matches
what is being proposed (both the timestamp and the hash).
If the replica accepts the request it updates the preparelist by adding the timestamp and hash for this client (unless this information is already in the list). Then it returns
a PREPARE - ACK message containing a signed certificate
for the new timestamp and hash.
Phase 3. Replicas carry out the request only if it is valid:
the prepare proof must be valid and the new value must
match the hash in the proof. Replicas respond to valid requests by sending a WRITE - ACK containing a signed certificate for the new timestamp. The replica additionally
modifies its state if the timestamp in the request is greater
than its current timestamp; in this case it stores the provided information as the current value, timestamp, and associated prepare proof.

3.2.2 Read Protocol
The read protocol usually requires just one phase.
Phase 1. The client sends a READ request to all replicas;
the request contains a nonce. A replica replies with its
value, timestamp, prepare proof, and nonce all signed by
it. The client waits for a quorum of valid responses and
chooses the one with the largest timestamp (this is the return value). If all the timestamps are the same the read
protocol ends.
Phase 2. Otherwise the client goes on to the write-back
phase for the largest timestamp; this is identical to phase 3
when a client is doing a write, except that the client needs
to send only to replicas that are behind, and it must wait
only for enough responses to ensure that 2f + 1 replicas
now have the new information.

3.3 Discussion
In the above description we mentioned that certain messages or statements were authenticated, but the kind of
authentication that may be used was unspecified. This
issue is important since different techniques have different costs: we can authenticate a point-to-point message by
means of symmetric cryptography by establishing session
keys and using message authentication codes (MACs). This
does not work for signing statements that have to be shown
to other parties, in which case we need to rely on more expensive public key cryptography.

Our protocol requires signing using public key cryptography in two places: the phase 2 and phase 3 responses.
These signatures are needed because they are used as proofs
offered to third parties, e.g., the prepare proof is generated
for one client but then used by a different client to justify
its choice of the next timestamp.
A further point is that only the phase 2 response signature needs to happen in the foreground. The signature
for the phase 3 response can be done in the background:
a replica can do this after replying to the phase 2 request,
so that it will have the signature ready when the phase 3
request arrives.

of single-object sequential histories for that object. A sequential history H is legal if each object subhistory H|x
belongs to the sequential specification of x.
An operation o in a history is a pair consisting of an
invocation inv(o) and the next matching response rsp(o).
A history H induces an irreflexive partial order < H on
the operations and stop events in H as follows: o 0  tsmax .
2. There are no two timestamps t1 , t2 > tsmax such that
client c assembled a prepare proof for t 1 and t2 .
3. No two prepare proofs exist for the same timestamp
t > tsmax and different associated values.
Proof.

1. By algorithm construction, a nonfaulty replica will not
sign an entry in a write proof vouching for a timestamp higher than the one held in the variable current-ts.
Since non-faulty replicas always store increasing timestamp values, this means that the number of signatures
that can be held in the system at time t stop for timestamps higher than tsmax is at most 2f (i.e., f from
faulty replicas and the f correct replicas that may hold
timestamps higher than ts max ).
2. By contradiction, suppose that client c held prepare
proofs for t1 , t2 , both greater than tsmax . The two
proofs intersect in at least one nonfaulty replica. By
part (1) of this lemma, that replica had its write-ts variable always at a value less than or equal to ts max (at
all times up to and including tstop ). Therefore that
replica could not have signed both proofs, since after
signing the first one (say, for t1 ) it would insert an
entry for client c and t1 in its prepare-list, and that
entry would not be removed (because of the value of
write-ts), which prevents the replica from signing the
proof for t2 .
3. Suppose, by contradiction that two prepare proofs exist
for timestamp t and values v and v 0 . By the quorum intersection properties, these two prepare proofs contain
at least one signature from the same correct replica. By
part (1) of this lemma, no write proof was ever assembled for timestamps greater or equal than ts max , and
these entries in the prepare list were never removed.
This violates the constraint that correct replicas do not
sign a timestamp that is already in its prepare list for a
different value.

We are now ready to show the correctness of our algorithm.
Theorem 1. The BFT-BC algorithm is BFT-linearizable.
Proof. Consider any correct reader in the case that the
writer is correct. In this case, the quorum initially accessed in a read operation intersects the quorum written
in the most recently completed write operation in at least
one correct replica. Therefore, the read returns either the
value in the most recently completed write, or a value with
a higher timestamp (which could be written concurrently
with the read). Since a read also writes back its obtained
value and timestamp to a quorum of processes, any subsequent read by a correct reader will return that timestamp
value or a later one. So, for any execution, we construct
the history needed to show BFT-linearizability by putting
every read right after the write whose value it returns.

If the writer is faulty, we construct the sequential history to show BFT-linearizability as follows: for each read
by a correct reader returning v such that the phase 3 request for v was produced by client cb , insert a write operation in the history that writes v (by client c b ) immediately
before the read.
Insert a stop event before the invocation of the first operation that succeeded the stop event in the original (verifiable) history (i.e., as late as possible while preserving the

Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.3
Linearized                      : Yes
XMP Toolkit                     : 3.1-701
Producer                        : GNU Ghostscript 6.51
Modify Date                     : 2005:07:22 14:13:35-04:00
Create Date                     : 2005:07:22 14:13:35-04:00
Metadata Date                   : 2005:07:22 14:13:35-04:00
Format                          : application/pdf
Document ID                     : uuid:51e6d68e-fadc-11d9-b5a2-000a95aa5bbe
Instance ID                     : uuid:51e6dd0e-fadc-11d9-b5a2-000a95aa5bbe
Page Count                      : 11
EXIF Metadata provided by EXIF.tools

Navigation menu