CSL 81 8_Information_Storage_in_a_Decentralized_Computer_System 8 Information Storage In A Decentralized Computer System
CSL-81-8_Information_Storage_in_a_Decentralized_Computer_System CSL-81-8_Information_Storage_in_a_Decentralized_Computer_System
User Manual: CSL-81-8_Information_Storage_in_a_Decentralized_Computer_System
Open the PDF directly: View PDF
.
Page Count: 153
| Download | |
| Open PDF In Browser | View PDF |
Information Storage in a Decentralized
Computer System
David K. Gifford
Information Storage in a Decentralized
Computer System
by David K. Gifford
CSL-81-8
June 1981; Revised March 1982
Abstract: See page x.
This report is a slightly revised version of a dissertation submitted to the Department of
Electrical Engineering and the Committee on Graduate Studies of Stanford University in
partial fulfillment of the requirements for the degree of Doctor of Philosophy.
CR categories: 4.33. 4.35. 3.73
Key words and phrases: file system, information storage, reference, transaction, location,
replication, suite, representative, quorum, reconfiguration, protection, cryptographic sealing,
key, seal, unseal.
XEROX
Xerox Corporation
Palo Alto Research Centers
3333 Coyote Hill Road
Palo Alto, California 94304
@
Copyright 1981, 1982 by David K. Gifford
All Rights Reserved.
Contents
List of Figures vii
List of Tables viii
Preface ix
Abstract x
1. Method and Principles 1
1.1 Introduction 1
1.2 Method 1
1.3 System Description 3
1.4 Background 4
1.5 Architectural Principles 6
1.6 Summary 7
2. Environment 8
2.1 Hardware Components 8
2.1.1 Processors 8
2.1.2 Communication 9
2.1.3 Storage Devices 9
2.2 Stable Storage 10
2.3 Unique Identifiers 11
2.4 Reliable Remote Evaluation 11
2.4.1 Model 11
2.4.2 Algorithm 13
2.5 Summary 20
3. Transactional Storage 21
3.1 Model 21
3.1.1 References 21
3.1.2 Transactions 23
3.1.3 Volumes 26
3.1.4 Files 27
3.1.5 Mutable and Immutable Objects 28
3.2 Basic Algorithms 29
3.2.1 Single Processor Case 29
3.2.2 Decentralized Case 30
3.2.3 Reed's Algorithm 31
3.2.4 Located References 32
3.3 Refinements 35
3.3.1 Lock Compatibility 35
3.3.2 Lock Granularity 36
3.3.3 Lower ,Degrees of Consistency 36
3.3.4 Broken Read Locks 37
3.3.5 Releasing Read Locks 37
3.3.6 Deadlock 37
iii
3.3.7 Checkpoints and Save Points 38
3.3.8 Nested Transactions 38
3.3.9 Stable Storage Failure 38
3.4 Summary 38
4. Location 40
4.1 Indexes 40
4.2 Indirection 41
4.2.1 Model 41
4.2.2 Basic Algorithm 41
4.3 Indirect References 43
4.4 Object Location 46
4.5 Choice References 46
4.6 Summary 48
5. Replication 49
5.1 Suites 49
5.2 Basic Algorithm 52
5.3 Correctness Argument 61
5.4 Refinements 62
5.4.1 Weak Representatives 62
5.4.2 Lower Degrees of Consistency 62
5.4.3 Representative Performance 62
5.4.4 Expressive Power of Suites 62
5.4.5 Size of Replicated Objects 62
5.4.6 Total Replacement 63
5.4.7 Simultaneous Suite Update 63
5.4.8 Releasing Read Locks 63
5.4.9 Updating Representatives in Background 63
5.4.10 Guessing a Representative is Current 63
5.4.11 Postponed Creation of File Representatives 63
5.4.12 An Alternative for Replicated Volumes 64
5.5 Related Work 64
5.6 Summary 64
6. Reconfiguration 65
6.1 Reconfigurable Objects 65
6.2 Basic Algorithm 66
6.3 Refinements 74
6.3.1 Reducing Data Movement 74
6.3.2 Eliminating the Volume Index 74
6.4 Summary 75
iv
7. Protection 76
7.1 Preliminaries 77
7.1.1 Framework 77
7.1.2 Environment 77
7.1.2.1 Cryptography 77
7.1.2.2 Threshold Scheme 81
7.1.2.3 Checksums 82
7.2 Cryptographic Sealing 82
7.2.1 Model 82
7.2.2 Basic Algorithm 85
7.2.3 Strength 93
7.2.3.1 Correctness Argument 93
7.2.3.2 Susceptibility to Cryptanalysis 95
7.3 Applications 97
7.3.1 Privilege Establishment 97
7.3.1.1 Key Rings 97
7.3.1.2 Encrypted Objects 97
7.3.1.3 Guarded Objects 98
7.3.1.4 Protected Volumes 99
7.3.2 Common Protection Mechanisms 100
7.3.2.1 Capabilities 100
7.3.2.2 Access Control Lists 102
7.3.2.3 Information Flow Control 102
7.3.3 Secure Processors 104
7.3.3.1 Limiting Remote Evaluation 104
7.3.3.2 Secure Channels 105
7.3.4 Revocation 107
7.3.4.1 Protected Indirection 109
7.3.4.2 Revocation Algorithm 109
7.4 Practical Considerations III
7.4.1 Changing Protection Controls III
7.4.2 Authentication in the Large 112
7.4.3 Performance 112
7.4.4 Comments 113
7.4.5 Elimination of Authentication 113
7.5 Comparative Analysis 113
7.6 Summary 114
8. Practical Considerations 115
8.1 Implementation 115
8.1.1 Prototypes 115
8.1.2 Full-Scale Implementations 116
8.2 Configuration 116
8.2.1 Static Configuration 116
8.2.2 Dynamic Configuration 120
8.3 Summary 120
v
9. Summary of Ideas 121
Appendix A: Exposition Language: EL 123
A.l Language Extensions 123
A.2 Classes 123
A.3 Records 125
AA External Representation 127
A.5 Exception Handling 127
A.6 Processes 128
A.7 Synchronization 128
A.8 Sets 128
A.9 Byte Arrays 129
A.I0 Miscellaneous 129
Appendix B: Storage System Summary 130
Appendix C: The Open Function 133
Bibliography 134
Index 138
vi
Figures
Figure 2.1 - Structure of a Processor Class 18
Figure 3.1- Structure ofa Located Class 34
Figure 4.1 - An Indirect Record 42
Figure 4.2 - Structure of an Indirect Class 45
Figure 5.1 - Structure of a Suite Class 54
Figure 6.1 - Structure of a Reconfigurable File or Index 67
Figure 6.2 - Structure ofa Reconfigurable Volume 68
Figure 6.3 - File and Index Reconfiguration Algorithm 70
Figure 6.4 - Step 1 of Volume Reconfiguration: Move Files 71
Figure 6.5 - Step 2 of Volume Reconfiguration: Move File Index 72
Figure 6.6 - Step 3 of Volume Reconfiguration: Change Volume Pointer 73
Figure 7.1- A Passive Protection Mechanism 78
Figure 7.2 - An Active Protection Mechanism 79
Figure 7.3 - Example Sealed Objects 87
Figure 7.4 - Protection System Internal Structure 88
Figure 7.5 - Capability Reference 101
Figure 7.6 - Object Reference: Access Control List 103
Figure 7.7 - Structure of a Secure Located Class 108
Figure 7.8 - A Revocable Capability 110
Figure 8.1 - A Basic Storage Service 117
Figure 8.2 - Client Defined Protected Volumes 118
vii
Tables
Table 3.1 - Typical Lock Compatibility Matrix 35
Table 3.2 - Lock Compatibility Matrix with Intention Locks 36
Table 5.1 - Sample Suite Configurations 51
viii
Preface
This paper is organized so that it can be used in several ways. It can of course be read in its
entirety for a treatment of the general problem of decentralized information storage. It is also
possible to read an isolated chapter for a treatment of a single topic, such as protection. Each
chapter begins with an outline of its organization and ends with a summary of its contents. If the
reader wishes to look at a chapter in isolation I would also suggest that they read Sections 1.3 and
3.1.
There is a comprehensive index at the end of the paper. The index is useful to locate
references to a specific concept It also indexes the program text, and thus should help a reader
find the definition of a function or a type.
I have tried to present enough detail so a reader can easily implement the ideas I discuss.
However, it is possible to understand the essence of the ideas without reading program text On a
first reading, a casual reader should probably skip sections titled implementation or refinement
I would like to thank my patient advisers, Dr. Butler Lampson and Prof. Susan Owicki, for
their help and good advice over the past four years. In addition to my advisers, Peter Deutsch,
Jim Gray, Martin Haeberli, Prof. Larry Manning, Gene McDaniel, Alfred Spector, Larry Stewart,
and Howard Sturgis took the time to carefully read chapters of this paper and suggest
improvements. I would also like to thank Bob Taylor for making it possible for me to do this work
at the Xerox Palo Alto Research Center. The work benefited greatly from daily interactions with
colleagues at Xerox.
The Fannie and John Hertz Fellowship that I held while a graduate student gave me the
freedom to pursue this research. The research was supported in part by the Fannie and John Hertz
Foundation, and by the Xerox Corporation.
This work is dedicated to my parents.
ix
Abstract
This paper describes an architecture for shared information storage in a decentralized computer
system. The issues that are addressed include: naming of files and other objects (naming), reliable
storage of data (stable storage), coordinated access to shared storage (transa~tional storage), location
of objects (location), use of multiple copies to increase performance, reliability and availability
(replication), dynamic modification of object representations (reconfiguration), and storage security
and authentication (protection).
A complete model of the architecture is presented, which describes the interface to the facilities
provided, and describes in detail the proposed mechanisms for implementing them. The model
presents new approaches to naming, location, replication, reconfiguration, and protection. To verify
the model, three prototypes were constructed, and experience with these prototypes is discussed.
The model names objects with variable length byte arrays called references. References may
contain location information, protection guards, cryptographic keys, and other references. In
addition, references can be made indirect to delay their binding to a specific object or location.
The replication mechanism is based on assigning votes to each copy of a replicated object. The
characteristics of a replicated object can be chosen from a range of possibilities by appropriately
choosing its voting configuration. Temporary copies can be easily implemented by introducing
copies with no votes.
The reconfiguration mechanism allows the storage that is used to implement an object to
change while the system is operating. A client need not be aware that an object has been
reconfigured.
The protection mechanism is based on the idea of sealing an object with a key. Sealed objects
can only be unsealed with an appropriate set of keys. Complex· protection structures can be created
by using such operators as Key-Or and Key-And. The protection mechanism can be employed to
create popular protection mechanisms such as capabilities, access control lists, and information flow
control.
x
Chapter 1: Method and Principles
1.1 Introduction
Communication is an essential part of our basic need to cooperate and share with one another.
We have been given the freedom to have distant friends and increased knowledge about our world
by advances in communication technology such as the post office, the telegraph, and the telephone.
Computer systems promise to be another such advance. Large scale community information
systems are likely to playa major role in our future ability to create, organize, process, store, and
share information.
It was once thought that the problem of building large computer systems was that of building
large computers. It is now clear that this is not the case. Instead of employing a single computer,
future large scale computer systems will be composed of thousands, or even millions, of computers.
The goal of this research is to demonstrate an information storage system architecture that can
be used to integrate a collection of computers. By integrate we mean that the architecture permits
information to be easily exchanged between the users of the system. We are not suggesting that
information storage be centralized. Rather, we propose to organize the storage facilities that would
normally exist in a collection of computers into a single decentralized system.
The information storage architecture we present is intended to create a foundation for shared
applications. Example applications include office information systems, programming environments,
and data base systems. The principles of our architecture are best characterized by the desire to
provide fundamental storage facilities that can be flexibly adapted to a wide range of uses.
1.2 Method
Computer system research is more than the invention of new algorithms; part of the work lies
in the synthesis of a collection of ideas into a single package. Furthermore, it is important that a
synthesis be faithful to a single set of coordinated architectural principles. Conceptual integrity
keeps complex interactions from making the system intractable as it increases in size and function.
A clear statement of principal design decisions is central to the overall success of a large system.
The importance of these ideas have been demonstrated by [Belady and Lehman 77] and [Brooks 75].
We suggest a four stage process for system creation that is intended to promote these concepts.
The idea of the process is to emphasize the importance of asking fundamental questions early in the
life of a system, and to postpone secondary decisions. In order, the four stages are:
1.
Define the system's architectural principles. The architectural principles of a system are a
set of primary design decisions that consider technical feasibility [Liddle 76]. These
decisions serve to define and delimit the scope of a system. Furthermore, they allow for
orderly growth by providing a single conceptual framework that can accommodate
extensions in system size and function.
For example, what is the nature of the system? Is it intended to provide a general purpose
computing environment? Or is going to be used exclusively for electronic mail? How large
1
CHAPTER 1: METHOD AND PRINCIPLES
will the system be? Are different instances of the system going to be interconnected? What
are the capacity and reliability requirements of the system? What are the services that the
system will offer? How will these services be presented to a client? What are the
requirements for accounting? For protection?
2.
Formulate a system model. A system model is a design for the system in line with its
architectural principles. A model describes the system's interfaces and mechanisms in
enough detail that it is possible to reason about the correctness of key algorithms. When
the system is constructed, the system model is used as a pattern.
3.
Implement the system. A system implementation is a concrete set of hardware and software
components that realize a system model. The implementation of a system normally starts
by making a plan for its construction, testing, and documentation. Naturally, there can be
several implementations of a system model. This is important, as over the life of a system
new implementations of parts of a system will cause new and old components to coexist
4.
Plan a system configuration. A system configuration is an installed set of components from
a system implementation. A system implementation represents a wide spectrum of capacity,
reliability, availability, and performance possibilities; a configuration reflects the decisions
made to meet specific needs.
Our research has been organized according to these stages.
The remainder of this chapter treats the architectural principles of our system and their
background. We discuss how time-sharing systems, personal computing systems, and computer
networks have influenced our goals.
The next six chapters describe our system model by successive refinement. Chapter 2 defines
the environment of the system model. Chapter 3 presents a simple storage system. This storage
system would be ideal if one made the following assumptions:
1.
Files, volumes, and other objects never move.
2.
It is never necessary to improve the reliability, availability, or performance characteristics of
storage devices.
3.
It is never necessary to change the storage that is used to store an object.
4.
People are perfectly trustworthy and there is no need for protection.
Chapters 4 through 7 remove these assumptions. We add location (Chapter 4), replication (Chapter
5), reconfiguration (Chapter 6), and protection (Chapter 7) to describe a practical system.
Chapter 8 discusses system implementation and configuration. Three prototypes were built to
test the validity of our system model, and experience with these prototypes is discussed. In view of
these prototypes, we outline our expectations about full scale implementations of the system.
System configuration is discussed briefly, but it is not thoroughly explored.
Chapter 9 concludes the paper with a summary of the major ideas introduced in the paper and
a review of how we have achieved the architectural principles we set forth.
2
CHAPTER 1: METHOD AND PRINCIPLES
1.3 System Description
This paper includes a large amount of program text The intent of the program text is to
provide the reader with a detailed understanding of the system model. Considered collectively, the
program text is an implementation of the system model. The program text is also intended to be
used as a pattern for full-scale implementations. Thus, the code that we present is called a model
implementation. The functions and types of the model implementation are indexed at the back of
the paper.
The system model is described in EL ("Exposition Language"), a dialect of Lisp 1.5 [McCarthy
et at. 62]. Lisp was chosen because it includes an Eval function, which allowed us to define the
semantics of operations executed at remote processors. In addition, Lisp lends itself to the
transformation of objects to byte strings and back again (see the descriptions of Encode and Decode
in the appendix). Appendix A should provide enough information to enable the reader to
understand the code in the paper. Much of the technical content of the paper is contained in the
program text, and thus we suggest that the reader take the time now to read Appendix A.
To help the reader understand EL we present a short example program fragment The
fragment shown below is not intended to be useful, but it does demonstrate some key EL
constructs. We start by defining three record types: Person, Experience, and Experienced-Person.
Experienced-Person is a derived record type that contains the fields and types of Person and
Experience. The function Open-Person creates a new class that services the operations Name and
Parent Open-Person takes a person record and a person class as inputs. They respectively
represent a person and that person's parent When a Name request is sent to a class, Person-Name
is invoked, and the name of the person is returned. When a Parent request is sent to a class,
Parent-Name is invoked. Parent-Name gets the name of the class' parent from its superclass, which
was set when the class was created.
Every time that we create a class we include a comment that describes the class' instance
variables. For example, Open-Person creates a class, and Person-Name and Parent-Name can use
the instance variable name. The value of instance variables persist over class activations.
Person fa Record[name: Byte-Array];
Experience fa Record[years: Integer];
Experienced-Person fa Extend[person, Experience];
Create-Person[name: Byte-Array / p: Person] +- Prog[ (];
p fa Create[person];
p.name fa name;
Return[p];
];
Open-Person[p: Person, parent: Person-Class / c: Person-Class]
[name: Byte-Array];
-- copy name
name fa p.name;
create a new class
c fa Create-Class[List[
'Name, 'Person-Name,
'Parent, 'Parent-Name], parent];
3
fa
Prog[
CHAPTER 1: METHOD AND PRINCIPLES
-- Instance variables: name
Return[c];
];
Person-Name[/n: Byte-Array] .- Prog[ [];
-- Person-Name is evaluated in the environment of Open-Person
Return[name];
];
Parent-Name[/n: Byte-Array] +- Prog[ [];
-- ask superclass for its name
n +- superclass I N ame[];
Return[n];
];
frank-class .- Open-Person[Create-Person[ttFrank tt ], NIL];
alfred-class .- Open-Person[Create-Person["Alfred tt ], frank-class];
-- dad will be "Frank"
dad +- alfred-class I Parent[];
We observe the following stylistic contentions in program text Variables always begin with
lower-case letters. Function names and record types are capitalized. Whenever a new object is
introduced, we follow the same order of presentation as we did in our example. First, we introduce
the operation to create an object instance. Second, we describe an tt open" function that returns a
class that will service an object instance. Third, we present the functions that actually implement
the class' operations.
Let us define some terms that we will use repeatedly throughout this paper. A client is a
program that uses the facilities we describe, and a user is a human being that interacts with a client
The reliability of a system is a measure of the probability that the system will malfunction, and a
system's availability is a measure of the probability that it will be operational when it is needed.
1.4 Background
Early in the 1960's time-sharing was introduced as a way of providing the illusion of a personal
computer to aid in program debugging. Time-sharing systems turned. out to provide another benefit
that was not originally anticipated. Users found they could easily share information that was stored
in a time-sharing system. Sharing proved to be easy because it was as if a single file cabinet
simultaneously existed in every user's office. Items placed in one file cabinet immediately appeared
in the rest of the file cabinets. The facilities for information sharing provided by time-sharing soon
found use in large collaborative software projects.
As time-sharing matured, sharing was recognized as a basic facility, and made correspondingly
convenient The crss system [Corbato et al. 62] pioneered multiple access computers, and provided
a simple shared file system. Based on this experience, Multics [Corbato et al. 72] extended its file
system to include a tree-structured naming system and advanced protection facilities.
Late in the 1970's hardware became inexpensive enough that users could be provided their own
computers [Thacker et al. 79]. Placing a large amount of computational power at the man-machine
4
CHAPTER 1: MEfHOD AND PRINCIPLES
interface dramatically improved its character. For example, graphics could be introduced, adding an
important element to the domain of things that computers could be used for. Communication
networks [Metcalfe and Boggs 76] allowed the users of personal computers to share resources such
as high-speed printers. However, sharing was not made as convenient as it was in time-sharing
systems.
This research proposes to combine the success of time-sharing's shared storage and the success
of personal computing's man-machine interface into a single system. Unfortunately, it is not
sufficient to simply add conventional storage to personal computing systems. In a decentralized
environment such problems as coordination, protection, reliability, availability, and performance
become much more complicated. For example, in a decentralized system when information is
transferred between computers iover an insecure channel it must be ~ncrypted to provide protection.
Furthermore, people's expectations have properly increased. Computers are being used for an
increasingly diverse spectrum of applications, and many computer users are no longer computer
professionals. Our understanding of these requirements is reflected in the principles of our
architecture.
A number of systems have been built that share our primary goal of integrating a collection of
computers with a shared information storage system. These systems fall into three broad categories.
1.
Existing time-sharing systems have been modified to access remote files. The RSEXEC
[Thomas 73] system was an early attempt to join TEN EX systems in this manner, and the
RSEXEC approach was later adopted by the National Software Works [Forsdick et a1. 77].
The Locus project [popek et al. 81] at UCLA has integrated a number of UNIX systems in
a similar manner. Locus includes facilities for mediating concurrent access to information,
and there are plans to incorporate replicated data as well. These systems all have major
restrictions that are rooted in their time-sharing origins. In addition, they have as a general
rule adopted ad hoc solutions to the intrinsic and environmental problems they faced.
Thus, they do not provide a general framework of the sort we propose.
2.
Data base systems have been extended to operate on several computers that are connected
by a network. Examples of such systems are CICS ISC [IBM 80a] and Tandem Computer's
Encompass fTandem 81]. These systems are intended to provide a specialized service. In
addition, they have not provided general solutions to many of the problems that we
consider.
3.
File servers have been constructed for local computer networks, and these file servers have
been used by client computers for shared storage. WFS, and its successor, IFS, are two
such file servers [Swinehart et al. 79]. A file server at the University of Cambridge has
been successfully used as the only storage service of a time-sharing system [Dion 80]. The
Xerox Distributed File System [Israel et a1. 1978] provides facilities for guaranteeing
information consistency across file servers, and sophisticated facilities for failure recovery.
These file servers are more general than the time-sharing based efforts, and motivated the
system we propose. However, the scope of these servers is limited, and they do not address
many of the problems that we consider.
5
CHAPTER 1: MEIHOD AND PRINCIPLES
1.5 Architectural Principles
The following twelve architectural principles describe and delimit the scope of our ideal storage
system. We will return to this list at the end of the paper to review how they have been satisfied
1.
The system should behave in a well defined way. In a large system design there are many
opportunities for selecting a mechanism that works most of the time. Such a mechanism
can only be employed in conjunction with a backup mechanism that is expected to work all
of the time. For example, certain existing systems will undetectably malfunction in unusual
circumstances. We will not consider such designs.
2.
The system should provide a basic storage service. The basic unit of storage should be the
file, an uninterpreted array of bytes. Read and write primitives should be provided to
access files. The notion of a volume should also be provided to model storage media. Files
are created and stored on volumes.
3.
Storage should be resilient to expected failures. From time to time hardware errors, system
errors, or operator errors will occur. The storage system should expect such errors, and
recover from them without information loss. Furthermore, if unexpected errors occur, the
system should indicate that storage has been damaged instead of providing incorrect
in formation.
4.
Files, volumes, and other objects should be named with unambiguous low-level names. The
storage system should not anticipate how clients might use these names or what naming
environments will be presented to users.
5.
The system should mediate concurrent access to storage to ensure consistency.
6.
The system should be decentralized, and the location of storage system objects should be
hidden from clients. The system should also allow clients to discover where objects are
located.
7.
The system should allow modular expansion. The storage capacity of the system should not
be limited by any design decision, nor should the design intrinsically limit the number of
users that the system can support.
8.
It should be possible to improve the performance, reliability, and availability of the storage
system by keeping multiple copies of selected storage system objects. This principle
includes the idea of making temporary copies of objects for rapid access.
9.
It should be possible to reconfigure the system while it is operating. Reconfiguration
involves changing the storage resources that are used to implement a storage system object
10. A mechanism for information secrecy and authentication should be provided. The
mechanism should be general enough that clients can use it to implement a variety of
protection policies. No one should be able to circumvent the protection mechanism. For
example, system administrators should not be able to access information that they are not
authorized to see.
11. It should be possible to construct derived volumes by extending existing volumes with
6
CHAPTER 1: METHOD AND PRINCIPLES
replication, reconfiguration, and protection structures. Files created on derived volumes will
use the volume's structure as a template. For example, it should be possible to create a
volume R from three other volumes A, B, and C. When a file F is created on R, copies of
F will be automatically maintained on A, B, and C. This allows popular classes of storage
to be directly represented in the system as volumes. Thus, clients may choose to ignore
what facilities are being used to provide the storage they use.
12. A client should be able to select the resources that it uses in such a way that its processor
can remain autonomous from the rest of the system.
1.6 Summary
A new information storage system was proposed. It is intended to be the foundation of diverse
applications such as office information systems, programming environments, and data base systems.
The design of the system was divided into four stages: definition of architectural principles,
formulation of a system model, system implementation, and system configuration. The paper is
structured according to these stages. The system's architectural principles and their background
were then introduced. Chapter 2 begins our consideration of a system model to realize these
principles.
7
Chapter 2: Environment
The first step in defining the system model is to outline its presumed environment This
chapter reviews the characteristics of the hardware components we use, introduces the concepts of
stable storage and unique identifiers, and shows how to perform reliable function evaluation at a
remote processor. The material in this chapter represents an integration of ideas that have been
presented before in various forms. Notably [Lampson and Sturgis 79] previously introduced a
number of the concepts reviewed here.
2.1
Hardware Components
Three types of hardware components comprise the system: processors, communication channels,
and storage devices. In each of the following sections we discuss the characteristics of each type of
hardware component The hardware model presented is abstract to the extent that it only concerns
itself with device characteristics that will influence later design decisions.
2.1.1 Processors
A processor corresponds to the familiar notion of a stored program digital computer. Processors
are the active elements in a system, and they operate independently from one another. Processors
only communicate with each other through communication channels. There is absolute protection
between processors in the sense that all a malicious processor can do is send messages, which other
processors can choose to ignore. Depending on application needs, processors may be connected to a
wide variety of peripherals, such as storage devices, bit-map displays, pointing devices, bar-code
readers, and laser printers. Although it is not an absolute necessity, we will assume that all
processors in the system have the following ideal capabilities.
Every processor is assigned an identifier that is distinct from any other processor's identifier.
Processor identifiers can be implemented by including a read-only memory in each processor that
contains its identifier.
Fixed length processor identifiers offer simplicity, but for some
implementations the expansion capability offered by variable length identifiers may be attractive. A
processor's identifier can be discovered with the function GetProcessorID.
GetProcessorID[/id: ProcessorID]
GetProcessorID returns the identifier of the processor that the calling process is using.
In addition, it is assumed that a processor can encrypt and decrypt data, and generate true
random numbers. Encryption can be implemented in software, but for efficient operation with high
security codes it is likely that special purpose hardware will be required. True random numbers are
useful for generating hard to guess cryptographic keys. A true random number is not the output of
a pseudo-random number generator, but is derived from a truly random process such as thermal
noise, shot noise, or radioactive decay.
8
CHAPTER 2: ENVIRONMENT
2.1.2 Communication
Communication and synchronization are accomplished between processors by sending and
receiving messages. A message is an uninterpreted array of bytes that is sent to a destination
processor. Every processor is assigned an address that includes its processor identifier, so different
processors will never be assigned the same address. However, we assume that if a processor is
physically moved its address will change.
The following functions define the interface to the communication system. Note that Processor
includes a processor identifier and fields that are private to the communication system.
Processor ... Extend[Record(id: ProcessorID], PrivateFields];
Send[destination: Processor, message: Byte-Array]
Send transmits message to destination.
Receive[/message: Byte-Array]
Receive delays until a message arrives addressed to this processor, and then returns the
incoming message.
GetMyProcessor[/self: Processor]
GetMyProcessor returns the address of this processor.
The communication system can lose, duplicate, or arbitrarily delay messages. Thus, messages
may arrive out of order, more than once, or not at all. We will assume that messages damaged in
transit are detected and discarded by the communication system, and will appear to have been lost
At least one distinguished address, Broadcast, is defined by the communication system so that a
processor can discover things about its environment and begin communicating with other processors.
One or more processors can ask to receive messages addressed to Broadcast. How far broadcast
messages will propagate in the communication system is implementation dependent. Chapter 4
discusses how this facility is used.
Broadcast: Processor
Packet switched networks [Metcalfe 73] are well adapted to providing a full connectivity
network at a moderate cost, and are an attractive method for implementing the proposed
communication system. Local packet switched networks, such as the Ethernet [Metcalfe and Boggs
76], provide high capacity and low delay at low cost for a local area. A treatment of local networks
can be found in [Clark et al. 78].
Local networks can be connected together by gateway processors and communication channels
to form an internetwork [Boggs et al. 80]. An internetwork retains the performance and cost
advantages of a local network, while extending the communication system to accommodate modular
growth. In an internetwork, non-local packets are forwarded though a succession of gateways to
eventually arrive at their destination.
2.1.3 Storage devices
A storage device is a processor peripheral that stores data. Read and write are the two primitive
operations that are used to access and update storage devices, respectively.
9
We assume that a
~2:E~RONMENT
storage device will indicate when it has failed to accurately store data. Failure detection is typically
implemented with the help of labels and checksums [Lampson and Sproull 79].
Three fundamental characteristics of storage devices are capacity, latency, and transfer rate. The
capacity of a storage device is the maximum number of bytes that it can store on a single medium.
We have been careful to state maximum, because the amount of data that can be stored is often a
function of the type of media in use. The transfer rate is the maximum number of bytes per
second that can be transferred to or from the device. The latency of a storage device is the average
amount of time required to read or write information ignoring the time the device is actually
transferring data.
Storage devices have four additional important characteristics.
1.
Some storage devices are designed for random access to data, and some are designed for
serial access. These devices will be called random access and serial access, respectively. A
tape drive is an example of a device designed for serial access to data, and an attempt to
use it in a random access mode would result in poor performance. Although the storage
capacity of random access devices is increasing, serial access devices may continue to have a
role in information storage because of their lower cost
2.
Some storage devices allow data to be read or written, but once data is stored it can not be
overwritten. Such a device will be called write-once. This property is usually due to the
storage media in use, for example a write-once optical disk.
3.
Some storage devices do not allow data to be written. Such a device will be called readonly. As a safety feature some storage devices can be made temporarily read-only.
4.
Some storage devices can record on storage media that are interchangeable between other
storage devices of the same type. Such a device will be said to have removable media.
2.2 Stable Storage
Storage that is resilient to a set of expected failures is called stable storage. From time to time
hardware errors, system errors, or human errors will occur. A stable storage system expects such
errors, and recovers from them without information loss. Furthermore, if unexpected errors occur,
a stable storage system indicates that storage has been damaged instead of providing incorrect
information. Expected hardware failures include storage device transfers that malfunction, and
information on a storage device that decays and becomes unreadable. Transfers that malfunction
include transfers that are in progress when their controlling processor fails.
The write-ahead-log protocol [Gray 78] is widely used to implement stable storage. It
implements stable storage by carefully maintaining two copies of data on devices with independent
failure modes.
[Lampson and Sturgis 79] suggest a similar. algorithm.
It is difficult to precisely characterize the reliability of stable storage. Manufacturers give bit
error rates for their devices, but these figures do not include catastrophic failures. Examples of
catastrophic failures are head crashes on disk drives, media dropped on the floor by operators, and
media mistakenly erased by users.
We shall call storage that is not protected against failures volatile storage. Stable storage will be
used to record long term system state, and volatile storage will be used for intermediate results.
10
CH~R2:E~RONMENT
Volatile storage is naturally much more efficient than stable storage because the same precautions
do not have to be observed
2.3 Unique Identifiers
A unique identifier is defined to be a value that is distinct from every other unique identifier.
We will assume for simplicity that unique identifiers are system generated nonsensical bit strings,
but unique identifiers could be client generated and sensible.
All unique identifiers do not have to be the same length. When it is known that a unique
identifier is going to be used extensively it may be advantageous to use a shorter identifier. Of
course, there .are fewer short identifiers than there are long ones.
A common method for generating a unique identifier is to concatenate a processor identifier
with a locally unique identifier. A locally unique identifier can be implemented by a counter in
stable storage. Whenever a locally unique identifier is required the counter is incremented and
returned. An obvious optimization is to withdraw a sequence of locally unique identifiers from
stable storage at once to reduce delay. Another technique for generating locally unique identifiers is
as follows. At processor initialization time create a variable nextId and set it equal to a calendar
clock. A calendar clock is a clock that holds the current date and time and thus monotonically
increases. Every time a locally unique identifier is requested increment nextld by one and ensure
that it is less than the calendar clock by pausing if necessary. Now nextId is guaranteed to be
locally unique. Thus, the second scheme does not require stable storage.
Although theoretically unique identifiers are unique, there is a chance that the unique identifier
mechanism could fail and issue duplicate identifiers. Such a failure could result from two
processors that were mistakenly assigned the same processor identifier, or from a malicious client
The algorithms we present in many instances check for duplicate unique identifieTh. However, to
provide a foundation on which to build, we will assume that unique identifiers are in fact unique.
The following function provides unique identifiers:
GetUniqueID[/id: UniqueID]
GetUniqueID returns a unique identifier. On a single processor subsequent unique
identifiers from GetUniqueID are monotonically increasing.
2.4 Reliable Remote Evaluation
2.4.1 Model
Remote form evaluation is the way one processor requests another processor to evaluate a
function and return a result A remote evaluation is a generalization of what is commonly referred
to as a remote procedure call. [Spector 80] provides a taxonomy of remote operations and their
semantics.
It is possible to provide precise semantics for remote evaluation because evaluation is formally
defined with respect to an environment that binds values to free variables and functions. We will
assume that remote evaluation is done with respect to a default environment that includes the data
types and functions defined in the paper.
To communicate with a remote processor it is first opened. Opening a processor results in a
11
CHAYfER 2: ENVIRONMENT
class that will service EWlI requests. Eval will evaluate an EL form at the remote processor. To
make the remote evaluation of fonns easier, we restrict fonns to be defined by the following
grammar: