CSL 81 8_Information_Storage_in_a_Decentralized_Computer_System 8 Information Storage In A Decentralized Computer System
CSL-81-8_Information_Storage_in_a_Decentralized_Computer_System CSL-81-8_Information_Storage_in_a_Decentralized_Computer_System
User Manual: CSL-81-8_Information_Storage_in_a_Decentralized_Computer_System
Open the PDF directly: View PDF .
Page Count: 153
Download | |
Open PDF In Browser | View PDF |
Information Storage in a Decentralized Computer System David K. Gifford Information Storage in a Decentralized Computer System by David K. Gifford CSL-81-8 June 1981; Revised March 1982 Abstract: See page x. This report is a slightly revised version of a dissertation submitted to the Department of Electrical Engineering and the Committee on Graduate Studies of Stanford University in partial fulfillment of the requirements for the degree of Doctor of Philosophy. CR categories: 4.33. 4.35. 3.73 Key words and phrases: file system, information storage, reference, transaction, location, replication, suite, representative, quorum, reconfiguration, protection, cryptographic sealing, key, seal, unseal. XEROX Xerox Corporation Palo Alto Research Centers 3333 Coyote Hill Road Palo Alto, California 94304 @ Copyright 1981, 1982 by David K. Gifford All Rights Reserved. Contents List of Figures vii List of Tables viii Preface ix Abstract x 1. Method and Principles 1 1.1 Introduction 1 1.2 Method 1 1.3 System Description 3 1.4 Background 4 1.5 Architectural Principles 6 1.6 Summary 7 2. Environment 8 2.1 Hardware Components 8 2.1.1 Processors 8 2.1.2 Communication 9 2.1.3 Storage Devices 9 2.2 Stable Storage 10 2.3 Unique Identifiers 11 2.4 Reliable Remote Evaluation 11 2.4.1 Model 11 2.4.2 Algorithm 13 2.5 Summary 20 3. Transactional Storage 21 3.1 Model 21 3.1.1 References 21 3.1.2 Transactions 23 3.1.3 Volumes 26 3.1.4 Files 27 3.1.5 Mutable and Immutable Objects 28 3.2 Basic Algorithms 29 3.2.1 Single Processor Case 29 3.2.2 Decentralized Case 30 3.2.3 Reed's Algorithm 31 3.2.4 Located References 32 3.3 Refinements 35 3.3.1 Lock Compatibility 35 3.3.2 Lock Granularity 36 3.3.3 Lower ,Degrees of Consistency 36 3.3.4 Broken Read Locks 37 3.3.5 Releasing Read Locks 37 3.3.6 Deadlock 37 iii 3.3.7 Checkpoints and Save Points 38 3.3.8 Nested Transactions 38 3.3.9 Stable Storage Failure 38 3.4 Summary 38 4. Location 40 4.1 Indexes 40 4.2 Indirection 41 4.2.1 Model 41 4.2.2 Basic Algorithm 41 4.3 Indirect References 43 4.4 Object Location 46 4.5 Choice References 46 4.6 Summary 48 5. Replication 49 5.1 Suites 49 5.2 Basic Algorithm 52 5.3 Correctness Argument 61 5.4 Refinements 62 5.4.1 Weak Representatives 62 5.4.2 Lower Degrees of Consistency 62 5.4.3 Representative Performance 62 5.4.4 Expressive Power of Suites 62 5.4.5 Size of Replicated Objects 62 5.4.6 Total Replacement 63 5.4.7 Simultaneous Suite Update 63 5.4.8 Releasing Read Locks 63 5.4.9 Updating Representatives in Background 63 5.4.10 Guessing a Representative is Current 63 5.4.11 Postponed Creation of File Representatives 63 5.4.12 An Alternative for Replicated Volumes 64 5.5 Related Work 64 5.6 Summary 64 6. Reconfiguration 65 6.1 Reconfigurable Objects 65 6.2 Basic Algorithm 66 6.3 Refinements 74 6.3.1 Reducing Data Movement 74 6.3.2 Eliminating the Volume Index 74 6.4 Summary 75 iv 7. Protection 76 7.1 Preliminaries 77 7.1.1 Framework 77 7.1.2 Environment 77 7.1.2.1 Cryptography 77 7.1.2.2 Threshold Scheme 81 7.1.2.3 Checksums 82 7.2 Cryptographic Sealing 82 7.2.1 Model 82 7.2.2 Basic Algorithm 85 7.2.3 Strength 93 7.2.3.1 Correctness Argument 93 7.2.3.2 Susceptibility to Cryptanalysis 95 7.3 Applications 97 7.3.1 Privilege Establishment 97 7.3.1.1 Key Rings 97 7.3.1.2 Encrypted Objects 97 7.3.1.3 Guarded Objects 98 7.3.1.4 Protected Volumes 99 7.3.2 Common Protection Mechanisms 100 7.3.2.1 Capabilities 100 7.3.2.2 Access Control Lists 102 7.3.2.3 Information Flow Control 102 7.3.3 Secure Processors 104 7.3.3.1 Limiting Remote Evaluation 104 7.3.3.2 Secure Channels 105 7.3.4 Revocation 107 7.3.4.1 Protected Indirection 109 7.3.4.2 Revocation Algorithm 109 7.4 Practical Considerations III 7.4.1 Changing Protection Controls III 7.4.2 Authentication in the Large 112 7.4.3 Performance 112 7.4.4 Comments 113 7.4.5 Elimination of Authentication 113 7.5 Comparative Analysis 113 7.6 Summary 114 8. Practical Considerations 115 8.1 Implementation 115 8.1.1 Prototypes 115 8.1.2 Full-Scale Implementations 116 8.2 Configuration 116 8.2.1 Static Configuration 116 8.2.2 Dynamic Configuration 120 8.3 Summary 120 v 9. Summary of Ideas 121 Appendix A: Exposition Language: EL 123 A.l Language Extensions 123 A.2 Classes 123 A.3 Records 125 AA External Representation 127 A.5 Exception Handling 127 A.6 Processes 128 A.7 Synchronization 128 A.8 Sets 128 A.9 Byte Arrays 129 A.I0 Miscellaneous 129 Appendix B: Storage System Summary 130 Appendix C: The Open Function 133 Bibliography 134 Index 138 vi Figures Figure 2.1 - Structure of a Processor Class 18 Figure 3.1- Structure ofa Located Class 34 Figure 4.1 - An Indirect Record 42 Figure 4.2 - Structure of an Indirect Class 45 Figure 5.1 - Structure of a Suite Class 54 Figure 6.1 - Structure of a Reconfigurable File or Index 67 Figure 6.2 - Structure ofa Reconfigurable Volume 68 Figure 6.3 - File and Index Reconfiguration Algorithm 70 Figure 6.4 - Step 1 of Volume Reconfiguration: Move Files 71 Figure 6.5 - Step 2 of Volume Reconfiguration: Move File Index 72 Figure 6.6 - Step 3 of Volume Reconfiguration: Change Volume Pointer 73 Figure 7.1- A Passive Protection Mechanism 78 Figure 7.2 - An Active Protection Mechanism 79 Figure 7.3 - Example Sealed Objects 87 Figure 7.4 - Protection System Internal Structure 88 Figure 7.5 - Capability Reference 101 Figure 7.6 - Object Reference: Access Control List 103 Figure 7.7 - Structure of a Secure Located Class 108 Figure 7.8 - A Revocable Capability 110 Figure 8.1 - A Basic Storage Service 117 Figure 8.2 - Client Defined Protected Volumes 118 vii Tables Table 3.1 - Typical Lock Compatibility Matrix 35 Table 3.2 - Lock Compatibility Matrix with Intention Locks 36 Table 5.1 - Sample Suite Configurations 51 viii Preface This paper is organized so that it can be used in several ways. It can of course be read in its entirety for a treatment of the general problem of decentralized information storage. It is also possible to read an isolated chapter for a treatment of a single topic, such as protection. Each chapter begins with an outline of its organization and ends with a summary of its contents. If the reader wishes to look at a chapter in isolation I would also suggest that they read Sections 1.3 and 3.1. There is a comprehensive index at the end of the paper. The index is useful to locate references to a specific concept It also indexes the program text, and thus should help a reader find the definition of a function or a type. I have tried to present enough detail so a reader can easily implement the ideas I discuss. However, it is possible to understand the essence of the ideas without reading program text On a first reading, a casual reader should probably skip sections titled implementation or refinement I would like to thank my patient advisers, Dr. Butler Lampson and Prof. Susan Owicki, for their help and good advice over the past four years. In addition to my advisers, Peter Deutsch, Jim Gray, Martin Haeberli, Prof. Larry Manning, Gene McDaniel, Alfred Spector, Larry Stewart, and Howard Sturgis took the time to carefully read chapters of this paper and suggest improvements. I would also like to thank Bob Taylor for making it possible for me to do this work at the Xerox Palo Alto Research Center. The work benefited greatly from daily interactions with colleagues at Xerox. The Fannie and John Hertz Fellowship that I held while a graduate student gave me the freedom to pursue this research. The research was supported in part by the Fannie and John Hertz Foundation, and by the Xerox Corporation. This work is dedicated to my parents. ix Abstract This paper describes an architecture for shared information storage in a decentralized computer system. The issues that are addressed include: naming of files and other objects (naming), reliable storage of data (stable storage), coordinated access to shared storage (transa~tional storage), location of objects (location), use of multiple copies to increase performance, reliability and availability (replication), dynamic modification of object representations (reconfiguration), and storage security and authentication (protection). A complete model of the architecture is presented, which describes the interface to the facilities provided, and describes in detail the proposed mechanisms for implementing them. The model presents new approaches to naming, location, replication, reconfiguration, and protection. To verify the model, three prototypes were constructed, and experience with these prototypes is discussed. The model names objects with variable length byte arrays called references. References may contain location information, protection guards, cryptographic keys, and other references. In addition, references can be made indirect to delay their binding to a specific object or location. The replication mechanism is based on assigning votes to each copy of a replicated object. The characteristics of a replicated object can be chosen from a range of possibilities by appropriately choosing its voting configuration. Temporary copies can be easily implemented by introducing copies with no votes. The reconfiguration mechanism allows the storage that is used to implement an object to change while the system is operating. A client need not be aware that an object has been reconfigured. The protection mechanism is based on the idea of sealing an object with a key. Sealed objects can only be unsealed with an appropriate set of keys. Complex· protection structures can be created by using such operators as Key-Or and Key-And. The protection mechanism can be employed to create popular protection mechanisms such as capabilities, access control lists, and information flow control. x Chapter 1: Method and Principles 1.1 Introduction Communication is an essential part of our basic need to cooperate and share with one another. We have been given the freedom to have distant friends and increased knowledge about our world by advances in communication technology such as the post office, the telegraph, and the telephone. Computer systems promise to be another such advance. Large scale community information systems are likely to playa major role in our future ability to create, organize, process, store, and share information. It was once thought that the problem of building large computer systems was that of building large computers. It is now clear that this is not the case. Instead of employing a single computer, future large scale computer systems will be composed of thousands, or even millions, of computers. The goal of this research is to demonstrate an information storage system architecture that can be used to integrate a collection of computers. By integrate we mean that the architecture permits information to be easily exchanged between the users of the system. We are not suggesting that information storage be centralized. Rather, we propose to organize the storage facilities that would normally exist in a collection of computers into a single decentralized system. The information storage architecture we present is intended to create a foundation for shared applications. Example applications include office information systems, programming environments, and data base systems. The principles of our architecture are best characterized by the desire to provide fundamental storage facilities that can be flexibly adapted to a wide range of uses. 1.2 Method Computer system research is more than the invention of new algorithms; part of the work lies in the synthesis of a collection of ideas into a single package. Furthermore, it is important that a synthesis be faithful to a single set of coordinated architectural principles. Conceptual integrity keeps complex interactions from making the system intractable as it increases in size and function. A clear statement of principal design decisions is central to the overall success of a large system. The importance of these ideas have been demonstrated by [Belady and Lehman 77] and [Brooks 75]. We suggest a four stage process for system creation that is intended to promote these concepts. The idea of the process is to emphasize the importance of asking fundamental questions early in the life of a system, and to postpone secondary decisions. In order, the four stages are: 1. Define the system's architectural principles. The architectural principles of a system are a set of primary design decisions that consider technical feasibility [Liddle 76]. These decisions serve to define and delimit the scope of a system. Furthermore, they allow for orderly growth by providing a single conceptual framework that can accommodate extensions in system size and function. For example, what is the nature of the system? Is it intended to provide a general purpose computing environment? Or is going to be used exclusively for electronic mail? How large 1 CHAPTER 1: METHOD AND PRINCIPLES will the system be? Are different instances of the system going to be interconnected? What are the capacity and reliability requirements of the system? What are the services that the system will offer? How will these services be presented to a client? What are the requirements for accounting? For protection? 2. Formulate a system model. A system model is a design for the system in line with its architectural principles. A model describes the system's interfaces and mechanisms in enough detail that it is possible to reason about the correctness of key algorithms. When the system is constructed, the system model is used as a pattern. 3. Implement the system. A system implementation is a concrete set of hardware and software components that realize a system model. The implementation of a system normally starts by making a plan for its construction, testing, and documentation. Naturally, there can be several implementations of a system model. This is important, as over the life of a system new implementations of parts of a system will cause new and old components to coexist 4. Plan a system configuration. A system configuration is an installed set of components from a system implementation. A system implementation represents a wide spectrum of capacity, reliability, availability, and performance possibilities; a configuration reflects the decisions made to meet specific needs. Our research has been organized according to these stages. The remainder of this chapter treats the architectural principles of our system and their background. We discuss how time-sharing systems, personal computing systems, and computer networks have influenced our goals. The next six chapters describe our system model by successive refinement. Chapter 2 defines the environment of the system model. Chapter 3 presents a simple storage system. This storage system would be ideal if one made the following assumptions: 1. Files, volumes, and other objects never move. 2. It is never necessary to improve the reliability, availability, or performance characteristics of storage devices. 3. It is never necessary to change the storage that is used to store an object. 4. People are perfectly trustworthy and there is no need for protection. Chapters 4 through 7 remove these assumptions. We add location (Chapter 4), replication (Chapter 5), reconfiguration (Chapter 6), and protection (Chapter 7) to describe a practical system. Chapter 8 discusses system implementation and configuration. Three prototypes were built to test the validity of our system model, and experience with these prototypes is discussed. In view of these prototypes, we outline our expectations about full scale implementations of the system. System configuration is discussed briefly, but it is not thoroughly explored. Chapter 9 concludes the paper with a summary of the major ideas introduced in the paper and a review of how we have achieved the architectural principles we set forth. 2 CHAPTER 1: METHOD AND PRINCIPLES 1.3 System Description This paper includes a large amount of program text The intent of the program text is to provide the reader with a detailed understanding of the system model. Considered collectively, the program text is an implementation of the system model. The program text is also intended to be used as a pattern for full-scale implementations. Thus, the code that we present is called a model implementation. The functions and types of the model implementation are indexed at the back of the paper. The system model is described in EL ("Exposition Language"), a dialect of Lisp 1.5 [McCarthy et at. 62]. Lisp was chosen because it includes an Eval function, which allowed us to define the semantics of operations executed at remote processors. In addition, Lisp lends itself to the transformation of objects to byte strings and back again (see the descriptions of Encode and Decode in the appendix). Appendix A should provide enough information to enable the reader to understand the code in the paper. Much of the technical content of the paper is contained in the program text, and thus we suggest that the reader take the time now to read Appendix A. To help the reader understand EL we present a short example program fragment The fragment shown below is not intended to be useful, but it does demonstrate some key EL constructs. We start by defining three record types: Person, Experience, and Experienced-Person. Experienced-Person is a derived record type that contains the fields and types of Person and Experience. The function Open-Person creates a new class that services the operations Name and Parent Open-Person takes a person record and a person class as inputs. They respectively represent a person and that person's parent When a Name request is sent to a class, Person-Name is invoked, and the name of the person is returned. When a Parent request is sent to a class, Parent-Name is invoked. Parent-Name gets the name of the class' parent from its superclass, which was set when the class was created. Every time that we create a class we include a comment that describes the class' instance variables. For example, Open-Person creates a class, and Person-Name and Parent-Name can use the instance variable name. The value of instance variables persist over class activations. Person fa Record[name: Byte-Array]; Experience fa Record[years: Integer]; Experienced-Person fa Extend[person, Experience]; Create-Person[name: Byte-Array / p: Person] +- Prog[ (]; p fa Create[person]; p.name fa name; Return[p]; ]; Open-Person[p: Person, parent: Person-Class / c: Person-Class] [name: Byte-Array]; -- copy name name fa p.name; create a new class c fa Create-Class[List[ 'Name, 'Person-Name, 'Parent, 'Parent-Name], parent]; 3 fa Prog[ CHAPTER 1: METHOD AND PRINCIPLES -- Instance variables: name Return[c]; ]; Person-Name[/n: Byte-Array] .- Prog[ []; -- Person-Name is evaluated in the environment of Open-Person Return[name]; ]; Parent-Name[/n: Byte-Array] +- Prog[ []; -- ask superclass for its name n +- superclass I N ame[]; Return[n]; ]; frank-class .- Open-Person[Create-Person[ttFrank tt ], NIL]; alfred-class .- Open-Person[Create-Person["Alfred tt ], frank-class]; -- dad will be "Frank" dad +- alfred-class I Parent[]; We observe the following stylistic contentions in program text Variables always begin with lower-case letters. Function names and record types are capitalized. Whenever a new object is introduced, we follow the same order of presentation as we did in our example. First, we introduce the operation to create an object instance. Second, we describe an tt open" function that returns a class that will service an object instance. Third, we present the functions that actually implement the class' operations. Let us define some terms that we will use repeatedly throughout this paper. A client is a program that uses the facilities we describe, and a user is a human being that interacts with a client The reliability of a system is a measure of the probability that the system will malfunction, and a system's availability is a measure of the probability that it will be operational when it is needed. 1.4 Background Early in the 1960's time-sharing was introduced as a way of providing the illusion of a personal computer to aid in program debugging. Time-sharing systems turned. out to provide another benefit that was not originally anticipated. Users found they could easily share information that was stored in a time-sharing system. Sharing proved to be easy because it was as if a single file cabinet simultaneously existed in every user's office. Items placed in one file cabinet immediately appeared in the rest of the file cabinets. The facilities for information sharing provided by time-sharing soon found use in large collaborative software projects. As time-sharing matured, sharing was recognized as a basic facility, and made correspondingly convenient The crss system [Corbato et al. 62] pioneered multiple access computers, and provided a simple shared file system. Based on this experience, Multics [Corbato et al. 72] extended its file system to include a tree-structured naming system and advanced protection facilities. Late in the 1970's hardware became inexpensive enough that users could be provided their own computers [Thacker et al. 79]. Placing a large amount of computational power at the man-machine 4 CHAPTER 1: MEfHOD AND PRINCIPLES interface dramatically improved its character. For example, graphics could be introduced, adding an important element to the domain of things that computers could be used for. Communication networks [Metcalfe and Boggs 76] allowed the users of personal computers to share resources such as high-speed printers. However, sharing was not made as convenient as it was in time-sharing systems. This research proposes to combine the success of time-sharing's shared storage and the success of personal computing's man-machine interface into a single system. Unfortunately, it is not sufficient to simply add conventional storage to personal computing systems. In a decentralized environment such problems as coordination, protection, reliability, availability, and performance become much more complicated. For example, in a decentralized system when information is transferred between computers iover an insecure channel it must be ~ncrypted to provide protection. Furthermore, people's expectations have properly increased. Computers are being used for an increasingly diverse spectrum of applications, and many computer users are no longer computer professionals. Our understanding of these requirements is reflected in the principles of our architecture. A number of systems have been built that share our primary goal of integrating a collection of computers with a shared information storage system. These systems fall into three broad categories. 1. Existing time-sharing systems have been modified to access remote files. The RSEXEC [Thomas 73] system was an early attempt to join TEN EX systems in this manner, and the RSEXEC approach was later adopted by the National Software Works [Forsdick et a1. 77]. The Locus project [popek et al. 81] at UCLA has integrated a number of UNIX systems in a similar manner. Locus includes facilities for mediating concurrent access to information, and there are plans to incorporate replicated data as well. These systems all have major restrictions that are rooted in their time-sharing origins. In addition, they have as a general rule adopted ad hoc solutions to the intrinsic and environmental problems they faced. Thus, they do not provide a general framework of the sort we propose. 2. Data base systems have been extended to operate on several computers that are connected by a network. Examples of such systems are CICS ISC [IBM 80a] and Tandem Computer's Encompass fTandem 81]. These systems are intended to provide a specialized service. In addition, they have not provided general solutions to many of the problems that we consider. 3. File servers have been constructed for local computer networks, and these file servers have been used by client computers for shared storage. WFS, and its successor, IFS, are two such file servers [Swinehart et al. 79]. A file server at the University of Cambridge has been successfully used as the only storage service of a time-sharing system [Dion 80]. The Xerox Distributed File System [Israel et a1. 1978] provides facilities for guaranteeing information consistency across file servers, and sophisticated facilities for failure recovery. These file servers are more general than the time-sharing based efforts, and motivated the system we propose. However, the scope of these servers is limited, and they do not address many of the problems that we consider. 5 CHAPTER 1: MEIHOD AND PRINCIPLES 1.5 Architectural Principles The following twelve architectural principles describe and delimit the scope of our ideal storage system. We will return to this list at the end of the paper to review how they have been satisfied 1. The system should behave in a well defined way. In a large system design there are many opportunities for selecting a mechanism that works most of the time. Such a mechanism can only be employed in conjunction with a backup mechanism that is expected to work all of the time. For example, certain existing systems will undetectably malfunction in unusual circumstances. We will not consider such designs. 2. The system should provide a basic storage service. The basic unit of storage should be the file, an uninterpreted array of bytes. Read and write primitives should be provided to access files. The notion of a volume should also be provided to model storage media. Files are created and stored on volumes. 3. Storage should be resilient to expected failures. From time to time hardware errors, system errors, or operator errors will occur. The storage system should expect such errors, and recover from them without information loss. Furthermore, if unexpected errors occur, the system should indicate that storage has been damaged instead of providing incorrect in formation. 4. Files, volumes, and other objects should be named with unambiguous low-level names. The storage system should not anticipate how clients might use these names or what naming environments will be presented to users. 5. The system should mediate concurrent access to storage to ensure consistency. 6. The system should be decentralized, and the location of storage system objects should be hidden from clients. The system should also allow clients to discover where objects are located. 7. The system should allow modular expansion. The storage capacity of the system should not be limited by any design decision, nor should the design intrinsically limit the number of users that the system can support. 8. It should be possible to improve the performance, reliability, and availability of the storage system by keeping multiple copies of selected storage system objects. This principle includes the idea of making temporary copies of objects for rapid access. 9. It should be possible to reconfigure the system while it is operating. Reconfiguration involves changing the storage resources that are used to implement a storage system object 10. A mechanism for information secrecy and authentication should be provided. The mechanism should be general enough that clients can use it to implement a variety of protection policies. No one should be able to circumvent the protection mechanism. For example, system administrators should not be able to access information that they are not authorized to see. 11. It should be possible to construct derived volumes by extending existing volumes with 6 CHAPTER 1: METHOD AND PRINCIPLES replication, reconfiguration, and protection structures. Files created on derived volumes will use the volume's structure as a template. For example, it should be possible to create a volume R from three other volumes A, B, and C. When a file F is created on R, copies of F will be automatically maintained on A, B, and C. This allows popular classes of storage to be directly represented in the system as volumes. Thus, clients may choose to ignore what facilities are being used to provide the storage they use. 12. A client should be able to select the resources that it uses in such a way that its processor can remain autonomous from the rest of the system. 1.6 Summary A new information storage system was proposed. It is intended to be the foundation of diverse applications such as office information systems, programming environments, and data base systems. The design of the system was divided into four stages: definition of architectural principles, formulation of a system model, system implementation, and system configuration. The paper is structured according to these stages. The system's architectural principles and their background were then introduced. Chapter 2 begins our consideration of a system model to realize these principles. 7 Chapter 2: Environment The first step in defining the system model is to outline its presumed environment This chapter reviews the characteristics of the hardware components we use, introduces the concepts of stable storage and unique identifiers, and shows how to perform reliable function evaluation at a remote processor. The material in this chapter represents an integration of ideas that have been presented before in various forms. Notably [Lampson and Sturgis 79] previously introduced a number of the concepts reviewed here. 2.1 Hardware Components Three types of hardware components comprise the system: processors, communication channels, and storage devices. In each of the following sections we discuss the characteristics of each type of hardware component The hardware model presented is abstract to the extent that it only concerns itself with device characteristics that will influence later design decisions. 2.1.1 Processors A processor corresponds to the familiar notion of a stored program digital computer. Processors are the active elements in a system, and they operate independently from one another. Processors only communicate with each other through communication channels. There is absolute protection between processors in the sense that all a malicious processor can do is send messages, which other processors can choose to ignore. Depending on application needs, processors may be connected to a wide variety of peripherals, such as storage devices, bit-map displays, pointing devices, bar-code readers, and laser printers. Although it is not an absolute necessity, we will assume that all processors in the system have the following ideal capabilities. Every processor is assigned an identifier that is distinct from any other processor's identifier. Processor identifiers can be implemented by including a read-only memory in each processor that contains its identifier. Fixed length processor identifiers offer simplicity, but for some implementations the expansion capability offered by variable length identifiers may be attractive. A processor's identifier can be discovered with the function GetProcessorID. GetProcessorID[/id: ProcessorID] GetProcessorID returns the identifier of the processor that the calling process is using. In addition, it is assumed that a processor can encrypt and decrypt data, and generate true random numbers. Encryption can be implemented in software, but for efficient operation with high security codes it is likely that special purpose hardware will be required. True random numbers are useful for generating hard to guess cryptographic keys. A true random number is not the output of a pseudo-random number generator, but is derived from a truly random process such as thermal noise, shot noise, or radioactive decay. 8 CHAPTER 2: ENVIRONMENT 2.1.2 Communication Communication and synchronization are accomplished between processors by sending and receiving messages. A message is an uninterpreted array of bytes that is sent to a destination processor. Every processor is assigned an address that includes its processor identifier, so different processors will never be assigned the same address. However, we assume that if a processor is physically moved its address will change. The following functions define the interface to the communication system. Note that Processor includes a processor identifier and fields that are private to the communication system. Processor ... Extend[Record(id: ProcessorID], PrivateFields]; Send[destination: Processor, message: Byte-Array] Send transmits message to destination. Receive[/message: Byte-Array] Receive delays until a message arrives addressed to this processor, and then returns the incoming message. GetMyProcessor[/self: Processor] GetMyProcessor returns the address of this processor. The communication system can lose, duplicate, or arbitrarily delay messages. Thus, messages may arrive out of order, more than once, or not at all. We will assume that messages damaged in transit are detected and discarded by the communication system, and will appear to have been lost At least one distinguished address, Broadcast, is defined by the communication system so that a processor can discover things about its environment and begin communicating with other processors. One or more processors can ask to receive messages addressed to Broadcast. How far broadcast messages will propagate in the communication system is implementation dependent. Chapter 4 discusses how this facility is used. Broadcast: Processor Packet switched networks [Metcalfe 73] are well adapted to providing a full connectivity network at a moderate cost, and are an attractive method for implementing the proposed communication system. Local packet switched networks, such as the Ethernet [Metcalfe and Boggs 76], provide high capacity and low delay at low cost for a local area. A treatment of local networks can be found in [Clark et al. 78]. Local networks can be connected together by gateway processors and communication channels to form an internetwork [Boggs et al. 80]. An internetwork retains the performance and cost advantages of a local network, while extending the communication system to accommodate modular growth. In an internetwork, non-local packets are forwarded though a succession of gateways to eventually arrive at their destination. 2.1.3 Storage devices A storage device is a processor peripheral that stores data. Read and write are the two primitive operations that are used to access and update storage devices, respectively. 9 We assume that a ~2:E~RONMENT storage device will indicate when it has failed to accurately store data. Failure detection is typically implemented with the help of labels and checksums [Lampson and Sproull 79]. Three fundamental characteristics of storage devices are capacity, latency, and transfer rate. The capacity of a storage device is the maximum number of bytes that it can store on a single medium. We have been careful to state maximum, because the amount of data that can be stored is often a function of the type of media in use. The transfer rate is the maximum number of bytes per second that can be transferred to or from the device. The latency of a storage device is the average amount of time required to read or write information ignoring the time the device is actually transferring data. Storage devices have four additional important characteristics. 1. Some storage devices are designed for random access to data, and some are designed for serial access. These devices will be called random access and serial access, respectively. A tape drive is an example of a device designed for serial access to data, and an attempt to use it in a random access mode would result in poor performance. Although the storage capacity of random access devices is increasing, serial access devices may continue to have a role in information storage because of their lower cost 2. Some storage devices allow data to be read or written, but once data is stored it can not be overwritten. Such a device will be called write-once. This property is usually due to the storage media in use, for example a write-once optical disk. 3. Some storage devices do not allow data to be written. Such a device will be called readonly. As a safety feature some storage devices can be made temporarily read-only. 4. Some storage devices can record on storage media that are interchangeable between other storage devices of the same type. Such a device will be said to have removable media. 2.2 Stable Storage Storage that is resilient to a set of expected failures is called stable storage. From time to time hardware errors, system errors, or human errors will occur. A stable storage system expects such errors, and recovers from them without information loss. Furthermore, if unexpected errors occur, a stable storage system indicates that storage has been damaged instead of providing incorrect information. Expected hardware failures include storage device transfers that malfunction, and information on a storage device that decays and becomes unreadable. Transfers that malfunction include transfers that are in progress when their controlling processor fails. The write-ahead-log protocol [Gray 78] is widely used to implement stable storage. It implements stable storage by carefully maintaining two copies of data on devices with independent failure modes. [Lampson and Sturgis 79] suggest a similar. algorithm. It is difficult to precisely characterize the reliability of stable storage. Manufacturers give bit error rates for their devices, but these figures do not include catastrophic failures. Examples of catastrophic failures are head crashes on disk drives, media dropped on the floor by operators, and media mistakenly erased by users. We shall call storage that is not protected against failures volatile storage. Stable storage will be used to record long term system state, and volatile storage will be used for intermediate results. 10 CH~R2:E~RONMENT Volatile storage is naturally much more efficient than stable storage because the same precautions do not have to be observed 2.3 Unique Identifiers A unique identifier is defined to be a value that is distinct from every other unique identifier. We will assume for simplicity that unique identifiers are system generated nonsensical bit strings, but unique identifiers could be client generated and sensible. All unique identifiers do not have to be the same length. When it is known that a unique identifier is going to be used extensively it may be advantageous to use a shorter identifier. Of course, there .are fewer short identifiers than there are long ones. A common method for generating a unique identifier is to concatenate a processor identifier with a locally unique identifier. A locally unique identifier can be implemented by a counter in stable storage. Whenever a locally unique identifier is required the counter is incremented and returned. An obvious optimization is to withdraw a sequence of locally unique identifiers from stable storage at once to reduce delay. Another technique for generating locally unique identifiers is as follows. At processor initialization time create a variable nextId and set it equal to a calendar clock. A calendar clock is a clock that holds the current date and time and thus monotonically increases. Every time a locally unique identifier is requested increment nextld by one and ensure that it is less than the calendar clock by pausing if necessary. Now nextId is guaranteed to be locally unique. Thus, the second scheme does not require stable storage. Although theoretically unique identifiers are unique, there is a chance that the unique identifier mechanism could fail and issue duplicate identifiers. Such a failure could result from two processors that were mistakenly assigned the same processor identifier, or from a malicious client The algorithms we present in many instances check for duplicate unique identifieTh. However, to provide a foundation on which to build, we will assume that unique identifiers are in fact unique. The following function provides unique identifiers: GetUniqueID[/id: UniqueID] GetUniqueID returns a unique identifier. On a single processor subsequent unique identifiers from GetUniqueID are monotonically increasing. 2.4 Reliable Remote Evaluation 2.4.1 Model Remote form evaluation is the way one processor requests another processor to evaluate a function and return a result A remote evaluation is a generalization of what is commonly referred to as a remote procedure call. [Spector 80] provides a taxonomy of remote operations and their semantics. It is possible to provide precise semantics for remote evaluation because evaluation is formally defined with respect to an environment that binds values to free variables and functions. We will assume that remote evaluation is done with respect to a default environment that includes the data types and functions defined in the paper. To communicate with a remote processor it is first opened. Opening a processor results in a 11 CHAYfER 2: ENVIRONMENT class that will service EWlI requests. Eval will evaluate an EL form at the remote processor. To make the remote evaluation of fonns easier, we restrict fonns to be defined by the following grammar: