MIT LCS TR 994
MIT-LCS-TR-994 MIT-LCS-TR-994
User Manual: MIT-LCS-TR-994
Open the PDF directly: View PDF .
Page Count: 11
Download | |
Open PDF In Browser | View PDF |
Byzantine Clients Rendered Harmless Barbara Liskov and Rodrigo Rodrigues MIT CSAIL and INESC-ID 32 Vassar Street - Cambridge, MA 02139 - USA Rua Alves Redol 9 - 1000 Lisboa - Portugal Abstract Byzantine quorum systems have been proposed that work properly even when up to f replicas fail arbitrarily. However, these systems are not so successful when confronted with Byzantine faulty clients. This paper presents novel protocols that provide atomic semantics despite Byzantine clients. Our protocols are the first to handle all problems caused by Byzantine clients. They prevent Byzantine clients from interfering with good clients: bad clients cannot prevent good clients from completing reads and writes, and they cannot cause good clients to see inconsistencies. In addition we also prevent bad clients that have been removed from operation from leaving behind more than a bounded number of writes that could be done on their behalf by a colluder. Our protocols are designed to work in an asynchronous system like the Internet and they are highly efficient. We require 3f + 1 replicas, and either two or three phases to do writes; reads normally complete in one phase and require no more than two phases, no matter what the bad clients are doing. We also present strong correctness conditions for systems with Byzantine clients that limit what can be done on behalf of bad clients once they leave the system. Furthermore we prove that our protocols are both safe (they meet those conditions) and live. 1 Introduction client, even if it has been shut down, if the private key it uses to prove that it is authorized to modify an object can Quorum systems [4, 14] are valuable tools for building be used by other nodes; thus, a bad client is in the system highly available replicated data services. A quorum sysas long as any node knows its private key. If nodes can tem can be defined as a set of sets (called quorums) with expose their private keys, we can make statements about certain intersection properties. These systems allow read behavior in the absence of a bad client only when all faulty and write operations to be performed only at a quorum of nodes have been removed from operation [11] – a condithe servers, since the intersection properties ensure that tion that is not very useful in practice. A more practical apany read operation will have access to the most recent proach is make use of secure cryptographic coprocessors value that was written. (like the IBM 4758 [7]) that allow signing without exposThe original work on quorum systems assumed that ing the private key. Our correctness condition is stated in servers fail benignly, i.e., by crashing or omitting some a way that allows either model about the lifetime of faulty steps. More recently, researchers have developed technodes in the system. niques that enable quorum systems to provide data availThus, we make the following contributions: ability even in the presence of arbitrary (Byzantine) faults [9]. Earlier work provides correct semantics despite server (i.e., • We present strong correctness conditions for an atomic replica) failures and also handles some of the problems of read/write object in a system in which both clients Byzantine clients [9, 10, 5, 1, 2, 11, 12]. and replicas can be Byzantine. Our conditions enThis paper extends this earlier work in two important sure atomicity for good clients, and also limit the ways. First, it defines new protocols that handle all probeffects of bad clients that have been removed from lems of Byzantine clients. In addition, our protocols are operation: our conditions bound the number of lurkmore efficient than previous proposals, in terms of both ing writes by a constant factor, and prevent lurking operation latency and number of replicas. Second, the writes after bad clients stop and good clients subsepaper states correctness conditions for such systems and quently overwrite the data. proves that protocols are correct. The correctness conditions are stronger than what has been stated previously [11] • We present the first Byzantine quorum protocols that and what has been guaranteed by previous approaches. satisfy the conditions using only 3f + 1 replicas (to Since a dishonest client can write garbage into the shared survive f faulty replicas) and work in an unreliable variable, it may seem there is little value in limiting what asynchronous network like the Internet. Furtherbad clients can do. But this is not the case, for two reasons. more our protocols are efficient: To do writes reFirst, bad clients can cause a protocol to misbehave so that quires either 3 phases (our base protocol) or mostly good clients are unable to perform operations (i.e., the pro2 phases (our optimized protocol). Reads usually tocol is no longer live) or observe incorrect behavior. For require 1 phase; they sometimes need an additional example, if the variable is write-once, a good client might phase (to write back the data just read). Our base observe that its state changes multiple times. protocol ensures that there can be at most one lurkSecond, bad clients can continue to interfere with good ing write after a bad client has left the system; the ones even after they have been removed from operation, optimized protocol ensures slightly weaker behave.g., by a system administrator who learns of the misbeior: there can be a most two lurking writes. havior. We would like to limit such interference so that, • We are able to manage with a small number of repliafter only a limited number of writes by good clients, any cas and phases because our protocol makes use of lurking writes left behind by a bad client will no longer be “proofs”, an approach that we believe can be used visible to good clients. A lurking write is a modification to advantage in other protocols. A proof is a collaunched by the bad client before it was removed from lection of 2f + 1 certificates from different replicas operation that will happen (possibly with help from an acthat vouch for some fact, e.g., that a client has comcomplice) after it has left the system. By limiting such pleted its previous write, or that the state read from writes we can ensure that the object becomes useful again a single replica is valid. after the departure of the bad client, e.g., some invariant that good clients preserve will hold. • We prove the correctness of our protocols, both safety Of course, it is not possible to prevent actions by a bad and liveness. In addition we describe a variation of our protocols that supports a still stronger correctness condition, in which we can bound the number of writes of good clients that it takes to ensure that modifications of the bad client that has left the system will no longer be visible. This variation sometimes requires an additional phase to do a write. The rest of this paper is organized as follows. We being by describing our assumptions about the system. Section 3 describes our base protocol. Section 4 describes our correctness conditions and we prove our base protocol meets them in Section 5. Section 6 describes our optimized protocol and proves its correctness. Section 7 describes a variation of our protocols that allows us to bound the number of overwrites needed to hide effects of lurking writes. Section 8 discusses related work and we conclude in Section 9. 2 Model The system consists of a set C = {c1 , ..., cn } of client processes and a set S = {s1 , ..., sn } of server processes. Client and server processes are classified as either correct or faulty. Correct processes are constrained to obey their specification, i.e., they follow the prescribed algorithms. Faulty processes may deviate arbitrarily from their specification, i.e., we assume a Byzantine failure model [8]. Note that faulty processes include those that fail benignly, and those that are initially correct and fail at some point in the execution of the system. We refer to the set of faulty clients as C bad and the set of correct clients as Cok , and consequently we have C = Cbad ∪ Cok (and respectively S = Sbad ∪ Sok ). Note that our algorithms do not require the knowledge of C ok , Cbad , Sok , nor Sbad . In other words, we do not assume that we can detect faults. We assume an asynchronous distributed system where nodes are connected by a network that may fail to deliver messages, delay them, duplicate them, corrupt them, or deliver them out of order, and there are no known bounds on message delays or on the time to execute operations. We assume the network is fully connected, i.e., given a node identifier, any other node can (attempt to) contact the first node directly by sending it a message. For liveness, we require that if a message is retransmitted repeatedly it will eventually be delivered. Note that we only require this for liveness; safety does not depend on any assumptions a about message delivery. We assume nodes can use unforgeable digital signatures to authenticate communication. More precisely, any node, n, can authenticate messages it sends by signing them. We denote a message m signed by n as hmi σn . And (with high probability) no node can send hmi σn (either directly or as part of another message) on the network for any value of m, unless it is repeating a message that has been sent before or it knows n0 s private key. We call the set of possible signatures Σ. These signatures can be verified with public keys in a set P . We also assume the existence of a collision-resistant hash function, h, such that any node can compute a digest h(m) of a message m and (with high probability) it is impossible to find two distinct messages m and m 0 such that h(m) = h(m0 ). These assumptions are probabilistic but there exist signature schemes (e.g., RSA [13]) and hash functions (e.g., SHA-1 [3]) for which they are believed to hold with very high probability. Therefore, we will assume they hold with probability one in the rest of the paper. To avoid replay attacks we tag certain messages with nonces that are signed in the replies. We also assume that when clients pick nonces they will not choose a repeated nonce with probability one. 3 BFT-BC Algorithm This section presents our construction for a read/write variable implemented using Byzantine quorum replication, and that tolerates Byzantine-faulty clients. We begin by giving a brief overview of Byzantine quorums in Section 3.1. Then we present our base protocol. This protocol requires 3 phases to write; in Section 6 we present an optimization that requires only 2 phases most of the time. 3.1 Byzantine Quorums Overview This section gives an overview of how current algorithms use Byzantine quorums to implement a shared read/write variable. This presentation follows the original BQS protocol [9], using their construction for a system that doesn’t handle Byzantine clients. (We discuss this system further in Section 8.) A Byzantine quorum system defines a set of subsets of a replica group with certain intersection properties. A typical way to configure such a system is to use groups of 3f + 1 replicas to survive f failures with quorums of size 2f + 1 replicas. This ensures that any two quorums intersect in at least one non-faulty replica. Each of the replicas maintains a copy of the data object, along with an associated timestamp, and a client signature that authenticates the data and timestamp. Two phases are required to write the data. First, the client contacts a quorum to obtain the highest timestamp produced so far. The client then picks a timestamp higher than what was returned in the first phase, signs the new value and timestamp, and proceeds to the second phase where the new value is stored at a quorum of replicas. Replicas allow write requests only from authorized clients. A replica overwrites what it has stored only if the timestamp in the request is greater than what it already has. The read protocol usually has a single phase where the client queries a quorum of replicas and returns the value with the highest timestamp (provided the signature is correct). An extension of this protocol [10] requires a second phase that writes back the highest value read to a quorum of replicas (this ensures atomic semantics for reads). For example, the purpose of the prepare phase is for the client to inform the replicas of what it intends to do, and this must be acceptable, e.g., the timestamp it proposes can’t be too big. Replicas that approve the prepare request return certificates that together provide a “prepare proof”. This proof is needed to carry out the write phase: a client can only carry out a write that has been approved. If a write is allowed by a replica it returns a certificate and these certificates together form a “write proof”. This proof is needed for the replica to do its next write: it cannot do a second write without completing its first one. This constraint, plus the fact that proofs cannot be forged or predicted in advance by bad clients, is what limits the number of lurking writes a bad client can leave behind when it stops (to be carried out by some node that colludes with 3.2 BFT-BC Protocol it). The protocol just presented is not designed to handle Byzantine- We now describe our protocol in detail. faulty clients, which can cause damage to the system in As mentioned, each object in BFT-BC is replicated at several ways, e.g.: a set of 3f + 1 replicas, numbered from 0 to 3f . Quorums can be any subset with 2f + 1 replicas and we use a three1. Not follow the protocol by writing different values phase protocol to write. associated with the same timestamp. Each replica maintains the following per-object information: 2. Only carry out the protocol partially, e.g., install a modification at just one replica. 3. Choose a very large timestamp and exhaust the timestamp space. 4. Issue a large number of write requests and hand them off to a colluder who will run them after the bad client has been removed from the system. This colluder could be one of the replicas, or a completely separate machine. To avoid these problems, we need to limit what bad clients can do. But we must also allow good clients to make progress, in spite of bad clients that might collude with one another and/or with bad replicas. Our protocol accomplishes these goals. It uses 3f + 1 replicas, and quorums can be any subset with 2f + 1 replicas. It uses a three-phase protocol to write, consisting of a read phase to obtain the most recent timestamp, a prepare phase in which a client announces its intention to write a particular timestamp and value, and a write phase in which the client does the write that it prepared previously. As it moves from one phase to another, however, the client needs to “prove” that what it is doing is legitimate. A proof takes the form of a quorum of certificates from different replicas that vouch for some fact. • data, the value of the object. • current-ts, the associated timestamp. • prepare-proof, a valid prepare proof for above timestamp and a hash of the value. • prepare-list, a list containing the timestamp, hash, and proposing client of currently prepared writes. • write-ts, the timestamp of the latest write known to have completed at 2f + 1 replicas. In the remainder of this paper we simplify the presentation by considering a system containing only a single object, and therefore we omit object identifiers from the description of the protocol. 3.2.1 Write Protocol Our protocols require that different clients choose different timestamps, and therefore we construct timestamps by concatenating a sequence number with a client identifier: ts = hts.val, ts.idi. We assume that client identifiers are unique. To increment a timestamp a client with identifier c uses the following function: succ(ts, c) = hts.val + 1, ci. Timestamps can be compared in the usual way, by comparing the val parts and if these agree, comparing the client ids. Client Processing To write a data object, clients go through a three-phase protocol. In all phases, clients retransmit their requests to account for lost messages; they stop retransmitting once they collect a quorum of valid replies. Phase 1. The client, c, sends a READ - TS request to all replicas; the request contains a nonce to avoid replays. The replies each include a prepare proof for the timestamp being returned. The client waits for a quorum of correctly authenticated replies. It selects the largest timestamp t max it received, computes a new timestamp t 0 = succ(tmax , c), and moves to phase 2. Phase 2. The client sends PREPARE requests to all replicas. The request contains the new timestamp, the prepare proof obtained in phase 1, a hash of the value it intends to write, and a write proof (for the most recent write done by this client, or null if this is the first write done by this client), all authenticated by the client. The client waits for a quorum of PREPARE - ACKs, and then moves to phase 3; the 2f + 1 responses it received in phase 2 constitute a prepare proof for the write. Phase 3. The client sends a WRITE request containing the prepare proof along with the new value, and waits for a quorum of valid replies, all authenticated by the client. Each reply contains a signed statement vouching for the write for that timestamp and the replica id of the sender. A vector of 2f + 1 such statements constitutes a write proof for the current operation. Processing at replicas Phase 1. Replicas reply to a READ - TS request by sending the prepare proof for their current timestamp and value, plus the nonce, all authenticated by the replica. Phase 2. Phase 2 processing is the crucial part of the algorithm. The replicas check to ensure that the timestamp being proposed is reasonable, that the client is doing just one prepare, that the value being proposed doesn’t differ from a previous request for the same timestamp, and that the client has completed its previous write. The replica discards the message if it is invalid: if the prepare or write proof can’t be validated, the timestamp doesn’t have the client id in the low-order part, the proposed timestamp isn’t the successor of the timestamp in the prepare proof, or the client isn’t authorized to write the object. Next, the replica uses the write proof to update its write-ts: it sets this to the larger of the timestamp in the write proof it just received, and what it already has stored. Then it removes any entry from its prepare-list whose timestamp is less than or equal to the write-ts. If there is still an entry in the prepare-list for the client, the replica discards the request unless that entry matches what is being proposed (both the timestamp and the hash). If the replica accepts the request it updates the preparelist by adding the timestamp and hash for this client (unless this information is already in the list). Then it returns a PREPARE - ACK message containing a signed certificate for the new timestamp and hash. Phase 3. Replicas carry out the request only if it is valid: the prepare proof must be valid and the new value must match the hash in the proof. Replicas respond to valid requests by sending a WRITE - ACK containing a signed certificate for the new timestamp. The replica additionally modifies its state if the timestamp in the request is greater than its current timestamp; in this case it stores the provided information as the current value, timestamp, and associated prepare proof. 3.2.2 Read Protocol The read protocol usually requires just one phase. Phase 1. The client sends a READ request to all replicas; the request contains a nonce. A replica replies with its value, timestamp, prepare proof, and nonce all signed by it. The client waits for a quorum of valid responses and chooses the one with the largest timestamp (this is the return value). If all the timestamps are the same the read protocol ends. Phase 2. Otherwise the client goes on to the write-back phase for the largest timestamp; this is identical to phase 3 when a client is doing a write, except that the client needs to send only to replicas that are behind, and it must wait only for enough responses to ensure that 2f + 1 replicas now have the new information. 3.3 Discussion In the above description we mentioned that certain messages or statements were authenticated, but the kind of authentication that may be used was unspecified. This issue is important since different techniques have different costs: we can authenticate a point-to-point message by means of symmetric cryptography by establishing session keys and using message authentication codes (MACs). This does not work for signing statements that have to be shown to other parties, in which case we need to rely on more expensive public key cryptography. Our protocol requires signing using public key cryptography in two places: the phase 2 and phase 3 responses. These signatures are needed because they are used as proofs offered to third parties, e.g., the prepare proof is generated for one client but then used by a different client to justify its choice of the next timestamp. A further point is that only the phase 2 response signature needs to happen in the foreground. The signature for the phase 3 response can be done in the background: a replica can do this after replying to the phase 2 request, so that it will have the signature ready when the phase 3 request arrives. of single-object sequential histories for that object. A sequential history H is legal if each object subhistory H|x belongs to the sequential specification of x. An operation o in a history is a pair consisting of an invocation inv(o) and the next matching response rsp(o). A history H induces an irreflexive partial order < H on the operations and stop events in H as follows: o 0tsmax . 2. There are no two timestamps t1 , t2 > tsmax such that client c assembled a prepare proof for t 1 and t2 . 3. No two prepare proofs exist for the same timestamp t > tsmax and different associated values. Proof. 1. By algorithm construction, a nonfaulty replica will not sign an entry in a write proof vouching for a timestamp higher than the one held in the variable current-ts. Since non-faulty replicas always store increasing timestamp values, this means that the number of signatures that can be held in the system at time t stop for timestamps higher than tsmax is at most 2f (i.e., f from faulty replicas and the f correct replicas that may hold timestamps higher than ts max ). 2. By contradiction, suppose that client c held prepare proofs for t1 , t2 , both greater than tsmax . The two proofs intersect in at least one nonfaulty replica. By part (1) of this lemma, that replica had its write-ts variable always at a value less than or equal to ts max (at all times up to and including tstop ). Therefore that replica could not have signed both proofs, since after signing the first one (say, for t1 ) it would insert an entry for client c and t1 in its prepare-list, and that entry would not be removed (because of the value of write-ts), which prevents the replica from signing the proof for t2 . 3. Suppose, by contradiction that two prepare proofs exist for timestamp t and values v and v 0 . By the quorum intersection properties, these two prepare proofs contain at least one signature from the same correct replica. By part (1) of this lemma, no write proof was ever assembled for timestamps greater or equal than ts max , and these entries in the prepare list were never removed. This violates the constraint that correct replicas do not sign a timestamp that is already in its prepare list for a different value. We are now ready to show the correctness of our algorithm. Theorem 1. The BFT-BC algorithm is BFT-linearizable. Proof. Consider any correct reader in the case that the writer is correct. In this case, the quorum initially accessed in a read operation intersects the quorum written in the most recently completed write operation in at least one correct replica. Therefore, the read returns either the value in the most recently completed write, or a value with a higher timestamp (which could be written concurrently with the read). Since a read also writes back its obtained value and timestamp to a quorum of processes, any subsequent read by a correct reader will return that timestamp value or a later one. So, for any execution, we construct the history needed to show BFT-linearizability by putting every read right after the write whose value it returns. If the writer is faulty, we construct the sequential history to show BFT-linearizability as follows: for each read by a correct reader returning v such that the phase 3 request for v was produced by client cb , insert a write operation in the history that writes v (by client c b ) immediately before the read. Insert a stop event before the invocation of the first operation that succeeded the stop event in the original (verifiable) history (i.e., as late as possible while preserving the
Source Exif Data:File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.3 Linearized : Yes XMP Toolkit : 3.1-701 Producer : GNU Ghostscript 6.51 Modify Date : 2005:07:22 14:13:35-04:00 Create Date : 2005:07:22 14:13:35-04:00 Metadata Date : 2005:07:22 14:13:35-04:00 Format : application/pdf Document ID : uuid:51e6d68e-fadc-11d9-b5a2-000a95aa5bbe Instance ID : uuid:51e6dd0e-fadc-11d9-b5a2-000a95aa5bbe Page Count : 11EXIF Metadata provided by EXIF.tools