CSL 81 8_Information_Storage_in_a_Decentralized_Computer_System 8 Information Storage In A Decentralized Computer System

CSL-81-8_Information_Storage_in_a_Decentralized_Computer_System CSL-81-8_Information_Storage_in_a_Decentralized_Computer_System

User Manual: CSL-81-8_Information_Storage_in_a_Decentralized_Computer_System

Open the PDF directly: View PDF PDF.
Page Count: 153

DownloadCSL-81-8_Information_Storage_in_a_Decentralized_Computer_System CSL-81-8 Information Storage In A Decentralized Computer System
Open PDF In BrowserView PDF
Information Storage in a Decentralized
Computer System
David K. Gifford

Information Storage in a Decentralized
Computer System
by David K. Gifford


June 1981; Revised March 1982

Abstract: See page x.
This report is a slightly revised version of a dissertation submitted to the Department of
Electrical Engineering and the Committee on Graduate Studies of Stanford University in
partial fulfillment of the requirements for the degree of Doctor of Philosophy.

CR categories: 4.33. 4.35. 3.73

Key words and phrases: file system, information storage, reference, transaction, location,
replication, suite, representative, quorum, reconfiguration, protection, cryptographic sealing,
key, seal, unseal.


Xerox Corporation
Palo Alto Research Centers
3333 Coyote Hill Road
Palo Alto, California 94304


Copyright 1981, 1982 by David K. Gifford
All Rights Reserved.

List of Figures vii
List of Tables viii
Preface ix
Abstract x

1. Method and Principles 1
1.1 Introduction 1
1.2 Method 1
1.3 System Description 3
1.4 Background 4
1.5 Architectural Principles 6
1.6 Summary 7

2. Environment 8
2.1 Hardware Components 8
2.1.1 Processors 8
2.1.2 Communication 9
2.1.3 Storage Devices 9
2.2 Stable Storage 10
2.3 Unique Identifiers 11
2.4 Reliable Remote Evaluation 11
2.4.1 Model 11
2.4.2 Algorithm 13
2.5 Summary 20

3. Transactional Storage 21
3.1 Model 21
3.1.1 References 21
3.1.2 Transactions 23
3.1.3 Volumes 26
3.1.4 Files 27
3.1.5 Mutable and Immutable Objects 28
3.2 Basic Algorithms 29
3.2.1 Single Processor Case 29
3.2.2 Decentralized Case 30
3.2.3 Reed's Algorithm 31
3.2.4 Located References 32
3.3 Refinements 35
3.3.1 Lock Compatibility 35
3.3.2 Lock Granularity 36
3.3.3 Lower ,Degrees of Consistency 36
3.3.4 Broken Read Locks 37
3.3.5 Releasing Read Locks 37
3.3.6 Deadlock 37


3.3.7 Checkpoints and Save Points 38
3.3.8 Nested Transactions 38
3.3.9 Stable Storage Failure 38
3.4 Summary 38

4. Location 40
4.1 Indexes 40
4.2 Indirection 41
4.2.1 Model 41
4.2.2 Basic Algorithm 41
4.3 Indirect References 43
4.4 Object Location 46
4.5 Choice References 46
4.6 Summary 48

5. Replication 49
5.1 Suites 49
5.2 Basic Algorithm 52
5.3 Correctness Argument 61
5.4 Refinements 62
5.4.1 Weak Representatives 62
5.4.2 Lower Degrees of Consistency 62
5.4.3 Representative Performance 62
5.4.4 Expressive Power of Suites 62
5.4.5 Size of Replicated Objects 62
5.4.6 Total Replacement 63
5.4.7 Simultaneous Suite Update 63
5.4.8 Releasing Read Locks 63
5.4.9 Updating Representatives in Background 63
5.4.10 Guessing a Representative is Current 63
5.4.11 Postponed Creation of File Representatives 63
5.4.12 An Alternative for Replicated Volumes 64
5.5 Related Work 64
5.6 Summary 64

6. Reconfiguration 65
6.1 Reconfigurable Objects 65
6.2 Basic Algorithm 66
6.3 Refinements 74
6.3.1 Reducing Data Movement 74
6.3.2 Eliminating the Volume Index 74
6.4 Summary 75


7. Protection 76
7.1 Preliminaries 77
7.1.1 Framework 77
7.1.2 Environment 77 Cryptography 77 Threshold Scheme 81 Checksums 82
7.2 Cryptographic Sealing 82
7.2.1 Model 82
7.2.2 Basic Algorithm 85
7.2.3 Strength 93 Correctness Argument 93 Susceptibility to Cryptanalysis 95
7.3 Applications 97
7.3.1 Privilege Establishment 97 Key Rings 97 Encrypted Objects 97 Guarded Objects 98 Protected Volumes 99
7.3.2 Common Protection Mechanisms 100 Capabilities 100 Access Control Lists 102 Information Flow Control 102
7.3.3 Secure Processors 104 Limiting Remote Evaluation 104 Secure Channels 105
7.3.4 Revocation 107 Protected Indirection 109 Revocation Algorithm 109
7.4 Practical Considerations III
7.4.1 Changing Protection Controls III
7.4.2 Authentication in the Large 112
7.4.3 Performance 112
7.4.4 Comments 113
7.4.5 Elimination of Authentication 113
7.5 Comparative Analysis 113
7.6 Summary 114

8. Practical Considerations 115
8.1 Implementation 115
8.1.1 Prototypes 115
8.1.2 Full-Scale Implementations 116
8.2 Configuration 116
8.2.1 Static Configuration 116
8.2.2 Dynamic Configuration 120
8.3 Summary 120


9. Summary of Ideas 121

Appendix A: Exposition Language: EL 123
A.l Language Extensions 123
A.2 Classes 123
A.3 Records 125
AA External Representation 127
A.5 Exception Handling 127
A.6 Processes 128
A.7 Synchronization 128
A.8 Sets 128
A.9 Byte Arrays 129
A.I0 Miscellaneous 129

Appendix B: Storage System Summary 130
Appendix C: The Open Function 133
Bibliography 134
Index 138



Figure 2.1 - Structure of a Processor Class 18
Figure 3.1- Structure ofa Located Class 34
Figure 4.1 - An Indirect Record 42
Figure 4.2 - Structure of an Indirect Class 45
Figure 5.1 - Structure of a Suite Class 54
Figure 6.1 - Structure of a Reconfigurable File or Index 67
Figure 6.2 - Structure ofa Reconfigurable Volume 68
Figure 6.3 - File and Index Reconfiguration Algorithm 70
Figure 6.4 - Step 1 of Volume Reconfiguration: Move Files 71
Figure 6.5 - Step 2 of Volume Reconfiguration: Move File Index 72
Figure 6.6 - Step 3 of Volume Reconfiguration: Change Volume Pointer 73
Figure 7.1- A Passive Protection Mechanism 78
Figure 7.2 - An Active Protection Mechanism 79
Figure 7.3 - Example Sealed Objects 87
Figure 7.4 - Protection System Internal Structure 88
Figure 7.5 - Capability Reference 101
Figure 7.6 - Object Reference: Access Control List 103
Figure 7.7 - Structure of a Secure Located Class 108
Figure 7.8 - A Revocable Capability 110
Figure 8.1 - A Basic Storage Service 117
Figure 8.2 - Client Defined Protected Volumes 118



Table 3.1 - Typical Lock Compatibility Matrix 35
Table 3.2 - Lock Compatibility Matrix with Intention Locks 36
Table 5.1 - Sample Suite Configurations 51



This paper is organized so that it can be used in several ways. It can of course be read in its
entirety for a treatment of the general problem of decentralized information storage. It is also
possible to read an isolated chapter for a treatment of a single topic, such as protection. Each
chapter begins with an outline of its organization and ends with a summary of its contents. If the
reader wishes to look at a chapter in isolation I would also suggest that they read Sections 1.3 and
There is a comprehensive index at the end of the paper. The index is useful to locate
references to a specific concept It also indexes the program text, and thus should help a reader
find the definition of a function or a type.
I have tried to present enough detail so a reader can easily implement the ideas I discuss.
However, it is possible to understand the essence of the ideas without reading program text On a
first reading, a casual reader should probably skip sections titled implementation or refinement
I would like to thank my patient advisers, Dr. Butler Lampson and Prof. Susan Owicki, for
their help and good advice over the past four years. In addition to my advisers, Peter Deutsch,
Jim Gray, Martin Haeberli, Prof. Larry Manning, Gene McDaniel, Alfred Spector, Larry Stewart,
and Howard Sturgis took the time to carefully read chapters of this paper and suggest
improvements. I would also like to thank Bob Taylor for making it possible for me to do this work
at the Xerox Palo Alto Research Center. The work benefited greatly from daily interactions with
colleagues at Xerox.
The Fannie and John Hertz Fellowship that I held while a graduate student gave me the
freedom to pursue this research. The research was supported in part by the Fannie and John Hertz
Foundation, and by the Xerox Corporation.
This work is dedicated to my parents.



This paper describes an architecture for shared information storage in a decentralized computer
system. The issues that are addressed include: naming of files and other objects (naming), reliable
storage of data (stable storage), coordinated access to shared storage (transa~tional storage), location
of objects (location), use of multiple copies to increase performance, reliability and availability
(replication), dynamic modification of object representations (reconfiguration), and storage security
and authentication (protection).
A complete model of the architecture is presented, which describes the interface to the facilities
provided, and describes in detail the proposed mechanisms for implementing them. The model
presents new approaches to naming, location, replication, reconfiguration, and protection. To verify
the model, three prototypes were constructed, and experience with these prototypes is discussed.
The model names objects with variable length byte arrays called references. References may
contain location information, protection guards, cryptographic keys, and other references. In
addition, references can be made indirect to delay their binding to a specific object or location.
The replication mechanism is based on assigning votes to each copy of a replicated object. The
characteristics of a replicated object can be chosen from a range of possibilities by appropriately
choosing its voting configuration. Temporary copies can be easily implemented by introducing
copies with no votes.
The reconfiguration mechanism allows the storage that is used to implement an object to
change while the system is operating. A client need not be aware that an object has been
The protection mechanism is based on the idea of sealing an object with a key. Sealed objects
can only be unsealed with an appropriate set of keys. Complex· protection structures can be created
by using such operators as Key-Or and Key-And. The protection mechanism can be employed to
create popular protection mechanisms such as capabilities, access control lists, and information flow


Chapter 1: Method and Principles

1.1 Introduction
Communication is an essential part of our basic need to cooperate and share with one another.
We have been given the freedom to have distant friends and increased knowledge about our world
by advances in communication technology such as the post office, the telegraph, and the telephone.
Computer systems promise to be another such advance. Large scale community information
systems are likely to playa major role in our future ability to create, organize, process, store, and
share information.
It was once thought that the problem of building large computer systems was that of building
large computers. It is now clear that this is not the case. Instead of employing a single computer,
future large scale computer systems will be composed of thousands, or even millions, of computers.
The goal of this research is to demonstrate an information storage system architecture that can
be used to integrate a collection of computers. By integrate we mean that the architecture permits
information to be easily exchanged between the users of the system. We are not suggesting that
information storage be centralized. Rather, we propose to organize the storage facilities that would
normally exist in a collection of computers into a single decentralized system.
The information storage architecture we present is intended to create a foundation for shared
applications. Example applications include office information systems, programming environments,
and data base systems. The principles of our architecture are best characterized by the desire to
provide fundamental storage facilities that can be flexibly adapted to a wide range of uses.
1.2 Method
Computer system research is more than the invention of new algorithms; part of the work lies
in the synthesis of a collection of ideas into a single package. Furthermore, it is important that a
synthesis be faithful to a single set of coordinated architectural principles. Conceptual integrity
keeps complex interactions from making the system intractable as it increases in size and function.
A clear statement of principal design decisions is central to the overall success of a large system.
The importance of these ideas have been demonstrated by [Belady and Lehman 77] and [Brooks 75].
We suggest a four stage process for system creation that is intended to promote these concepts.
The idea of the process is to emphasize the importance of asking fundamental questions early in the
life of a system, and to postpone secondary decisions. In order, the four stages are:

Define the system's architectural principles. The architectural principles of a system are a
set of primary design decisions that consider technical feasibility [Liddle 76]. These
decisions serve to define and delimit the scope of a system. Furthermore, they allow for
orderly growth by providing a single conceptual framework that can accommodate
extensions in system size and function.
For example, what is the nature of the system? Is it intended to provide a general purpose
computing environment? Or is going to be used exclusively for electronic mail? How large



will the system be? Are different instances of the system going to be interconnected? What
are the capacity and reliability requirements of the system? What are the services that the
system will offer? How will these services be presented to a client? What are the
requirements for accounting? For protection?

Formulate a system model. A system model is a design for the system in line with its
architectural principles. A model describes the system's interfaces and mechanisms in
enough detail that it is possible to reason about the correctness of key algorithms. When
the system is constructed, the system model is used as a pattern.


Implement the system. A system implementation is a concrete set of hardware and software
components that realize a system model. The implementation of a system normally starts
by making a plan for its construction, testing, and documentation. Naturally, there can be
several implementations of a system model. This is important, as over the life of a system
new implementations of parts of a system will cause new and old components to coexist


Plan a system configuration. A system configuration is an installed set of components from
a system implementation. A system implementation represents a wide spectrum of capacity,
reliability, availability, and performance possibilities; a configuration reflects the decisions
made to meet specific needs.

Our research has been organized according to these stages.
The remainder of this chapter treats the architectural principles of our system and their
background. We discuss how time-sharing systems, personal computing systems, and computer
networks have influenced our goals.
The next six chapters describe our system model by successive refinement. Chapter 2 defines
the environment of the system model. Chapter 3 presents a simple storage system. This storage
system would be ideal if one made the following assumptions:

Files, volumes, and other objects never move.


It is never necessary to improve the reliability, availability, or performance characteristics of

storage devices.

It is never necessary to change the storage that is used to store an object.


People are perfectly trustworthy and there is no need for protection.

Chapters 4 through 7 remove these assumptions. We add location (Chapter 4), replication (Chapter
5), reconfiguration (Chapter 6), and protection (Chapter 7) to describe a practical system.
Chapter 8 discusses system implementation and configuration. Three prototypes were built to
test the validity of our system model, and experience with these prototypes is discussed. In view of
these prototypes, we outline our expectations about full scale implementations of the system.
System configuration is discussed briefly, but it is not thoroughly explored.
Chapter 9 concludes the paper with a summary of the major ideas introduced in the paper and
a review of how we have achieved the architectural principles we set forth.



1.3 System Description
This paper includes a large amount of program text The intent of the program text is to
provide the reader with a detailed understanding of the system model. Considered collectively, the
program text is an implementation of the system model. The program text is also intended to be
used as a pattern for full-scale implementations. Thus, the code that we present is called a model
implementation. The functions and types of the model implementation are indexed at the back of
the paper.
The system model is described in EL ("Exposition Language"), a dialect of Lisp 1.5 [McCarthy
et at. 62]. Lisp was chosen because it includes an Eval function, which allowed us to define the
semantics of operations executed at remote processors. In addition, Lisp lends itself to the
transformation of objects to byte strings and back again (see the descriptions of Encode and Decode
in the appendix). Appendix A should provide enough information to enable the reader to
understand the code in the paper. Much of the technical content of the paper is contained in the
program text, and thus we suggest that the reader take the time now to read Appendix A.
To help the reader understand EL we present a short example program fragment The
fragment shown below is not intended to be useful, but it does demonstrate some key EL
constructs. We start by defining three record types: Person, Experience, and Experienced-Person.
Experienced-Person is a derived record type that contains the fields and types of Person and
Experience. The function Open-Person creates a new class that services the operations Name and
Parent Open-Person takes a person record and a person class as inputs. They respectively
represent a person and that person's parent When a Name request is sent to a class, Person-Name
is invoked, and the name of the person is returned. When a Parent request is sent to a class,
Parent-Name is invoked. Parent-Name gets the name of the class' parent from its superclass, which
was set when the class was created.
Every time that we create a class we include a comment that describes the class' instance
variables. For example, Open-Person creates a class, and Person-Name and Parent-Name can use
the instance variable name. The value of instance variables persist over class activations.

Person fa Record[name: Byte-Array];
Experience fa Record[years: Integer];
Experienced-Person fa Extend[person, Experience];
Create-Person[name: Byte-Array / p: Person] +- Prog[ (];
p fa Create[person];
p.name fa name;
Open-Person[p: Person, parent: Person-Class / c: Person-Class]
[name: Byte-Array];
-- copy name
name fa p.name;
create a new class
c fa Create-Class[List[
'Name, 'Person-Name,
'Parent, 'Parent-Name], parent];





-- Instance variables: name
Person-Name[/n: Byte-Array] .- Prog[ [];
-- Person-Name is evaluated in the environment of Open-Person
Parent-Name[/n: Byte-Array] +- Prog[ [];
-- ask superclass for its name
n +- superclass I N ame[];
frank-class .- Open-Person[Create-Person[ttFrank tt ], NIL];
alfred-class .- Open-Person[Create-Person["Alfred tt ], frank-class];
-- dad will be "Frank"
dad +- alfred-class I Parent[];
We observe the following stylistic contentions in program text Variables always begin with
lower-case letters. Function names and record types are capitalized. Whenever a new object is
introduced, we follow the same order of presentation as we did in our example. First, we introduce
the operation to create an object instance. Second, we describe an tt open" function that returns a
class that will service an object instance. Third, we present the functions that actually implement
the class' operations.
Let us define some terms that we will use repeatedly throughout this paper. A client is a
program that uses the facilities we describe, and a user is a human being that interacts with a client
The reliability of a system is a measure of the probability that the system will malfunction, and a
system's availability is a measure of the probability that it will be operational when it is needed.
1.4 Background

Early in the 1960's time-sharing was introduced as a way of providing the illusion of a personal
computer to aid in program debugging. Time-sharing systems turned. out to provide another benefit
that was not originally anticipated. Users found they could easily share information that was stored
in a time-sharing system. Sharing proved to be easy because it was as if a single file cabinet
simultaneously existed in every user's office. Items placed in one file cabinet immediately appeared
in the rest of the file cabinets. The facilities for information sharing provided by time-sharing soon
found use in large collaborative software projects.
As time-sharing matured, sharing was recognized as a basic facility, and made correspondingly
convenient The crss system [Corbato et al. 62] pioneered multiple access computers, and provided
a simple shared file system. Based on this experience, Multics [Corbato et al. 72] extended its file
system to include a tree-structured naming system and advanced protection facilities.
Late in the 1970's hardware became inexpensive enough that users could be provided their own
computers [Thacker et al. 79]. Placing a large amount of computational power at the man-machine



interface dramatically improved its character. For example, graphics could be introduced, adding an
important element to the domain of things that computers could be used for. Communication
networks [Metcalfe and Boggs 76] allowed the users of personal computers to share resources such
as high-speed printers. However, sharing was not made as convenient as it was in time-sharing
This research proposes to combine the success of time-sharing's shared storage and the success
of personal computing's man-machine interface into a single system. Unfortunately, it is not
sufficient to simply add conventional storage to personal computing systems. In a decentralized
environment such problems as coordination, protection, reliability, availability, and performance
become much more complicated. For example, in a decentralized system when information is
transferred between computers iover an insecure channel it must be ~ncrypted to provide protection.
Furthermore, people's expectations have properly increased. Computers are being used for an
increasingly diverse spectrum of applications, and many computer users are no longer computer
professionals. Our understanding of these requirements is reflected in the principles of our
A number of systems have been built that share our primary goal of integrating a collection of
computers with a shared information storage system. These systems fall into three broad categories.

Existing time-sharing systems have been modified to access remote files. The RSEXEC
[Thomas 73] system was an early attempt to join TEN EX systems in this manner, and the
RSEXEC approach was later adopted by the National Software Works [Forsdick et a1. 77].
The Locus project [popek et al. 81] at UCLA has integrated a number of UNIX systems in
a similar manner. Locus includes facilities for mediating concurrent access to information,
and there are plans to incorporate replicated data as well. These systems all have major
restrictions that are rooted in their time-sharing origins. In addition, they have as a general
rule adopted ad hoc solutions to the intrinsic and environmental problems they faced.
Thus, they do not provide a general framework of the sort we propose.


Data base systems have been extended to operate on several computers that are connected
by a network. Examples of such systems are CICS ISC [IBM 80a] and Tandem Computer's
Encompass fTandem 81]. These systems are intended to provide a specialized service. In
addition, they have not provided general solutions to many of the problems that we


File servers have been constructed for local computer networks, and these file servers have
been used by client computers for shared storage. WFS, and its successor, IFS, are two
such file servers [Swinehart et al. 79]. A file server at the University of Cambridge has
been successfully used as the only storage service of a time-sharing system [Dion 80]. The
Xerox Distributed File System [Israel et a1. 1978] provides facilities for guaranteeing
information consistency across file servers, and sophisticated facilities for failure recovery.
These file servers are more general than the time-sharing based efforts, and motivated the
system we propose. However, the scope of these servers is limited, and they do not address
many of the problems that we consider.



1.5 Architectural Principles
The following twelve architectural principles describe and delimit the scope of our ideal storage
system. We will return to this list at the end of the paper to review how they have been satisfied

The system should behave in a well defined way. In a large system design there are many
opportunities for selecting a mechanism that works most of the time. Such a mechanism
can only be employed in conjunction with a backup mechanism that is expected to work all
of the time. For example, certain existing systems will undetectably malfunction in unusual
circumstances. We will not consider such designs.


The system should provide a basic storage service. The basic unit of storage should be the
file, an uninterpreted array of bytes. Read and write primitives should be provided to
access files. The notion of a volume should also be provided to model storage media. Files
are created and stored on volumes.


Storage should be resilient to expected failures. From time to time hardware errors, system
errors, or operator errors will occur. The storage system should expect such errors, and
recover from them without information loss. Furthermore, if unexpected errors occur, the
system should indicate that storage has been damaged instead of providing incorrect
in formation.


Files, volumes, and other objects should be named with unambiguous low-level names. The
storage system should not anticipate how clients might use these names or what naming
environments will be presented to users.


The system should mediate concurrent access to storage to ensure consistency.


The system should be decentralized, and the location of storage system objects should be
hidden from clients. The system should also allow clients to discover where objects are


The system should allow modular expansion. The storage capacity of the system should not
be limited by any design decision, nor should the design intrinsically limit the number of
users that the system can support.


It should be possible to improve the performance, reliability, and availability of the storage
system by keeping multiple copies of selected storage system objects. This principle
includes the idea of making temporary copies of objects for rapid access.


It should be possible to reconfigure the system while it is operating. Reconfiguration
involves changing the storage resources that are used to implement a storage system object

10. A mechanism for information secrecy and authentication should be provided. The
mechanism should be general enough that clients can use it to implement a variety of
protection policies. No one should be able to circumvent the protection mechanism. For
example, system administrators should not be able to access information that they are not
authorized to see.
11. It should be possible to construct derived volumes by extending existing volumes with



replication, reconfiguration, and protection structures. Files created on derived volumes will
use the volume's structure as a template. For example, it should be possible to create a
volume R from three other volumes A, B, and C. When a file F is created on R, copies of
F will be automatically maintained on A, B, and C. This allows popular classes of storage
to be directly represented in the system as volumes. Thus, clients may choose to ignore
what facilities are being used to provide the storage they use.
12. A client should be able to select the resources that it uses in such a way that its processor
can remain autonomous from the rest of the system.
1.6 Summary

A new information storage system was proposed. It is intended to be the foundation of diverse
applications such as office information systems, programming environments, and data base systems.
The design of the system was divided into four stages: definition of architectural principles,
formulation of a system model, system implementation, and system configuration. The paper is
structured according to these stages. The system's architectural principles and their background
were then introduced. Chapter 2 begins our consideration of a system model to realize these


Chapter 2: Environment

The first step in defining the system model is to outline its presumed environment This
chapter reviews the characteristics of the hardware components we use, introduces the concepts of
stable storage and unique identifiers, and shows how to perform reliable function evaluation at a
remote processor. The material in this chapter represents an integration of ideas that have been
presented before in various forms. Notably [Lampson and Sturgis 79] previously introduced a
number of the concepts reviewed here.

Hardware Components

Three types of hardware components comprise the system: processors, communication channels,
and storage devices. In each of the following sections we discuss the characteristics of each type of
hardware component The hardware model presented is abstract to the extent that it only concerns
itself with device characteristics that will influence later design decisions.
2.1.1 Processors
A processor corresponds to the familiar notion of a stored program digital computer. Processors
are the active elements in a system, and they operate independently from one another. Processors
only communicate with each other through communication channels. There is absolute protection
between processors in the sense that all a malicious processor can do is send messages, which other
processors can choose to ignore. Depending on application needs, processors may be connected to a
wide variety of peripherals, such as storage devices, bit-map displays, pointing devices, bar-code
readers, and laser printers. Although it is not an absolute necessity, we will assume that all
processors in the system have the following ideal capabilities.
Every processor is assigned an identifier that is distinct from any other processor's identifier.
Processor identifiers can be implemented by including a read-only memory in each processor that
contains its identifier.
Fixed length processor identifiers offer simplicity, but for some
implementations the expansion capability offered by variable length identifiers may be attractive. A
processor's identifier can be discovered with the function GetProcessorID.
GetProcessorID[/id: ProcessorID]
GetProcessorID returns the identifier of the processor that the calling process is using.
In addition, it is assumed that a processor can encrypt and decrypt data, and generate true
random numbers. Encryption can be implemented in software, but for efficient operation with high
security codes it is likely that special purpose hardware will be required. True random numbers are
useful for generating hard to guess cryptographic keys. A true random number is not the output of
a pseudo-random number generator, but is derived from a truly random process such as thermal
noise, shot noise, or radioactive decay.



2.1.2 Communication

Communication and synchronization are accomplished between processors by sending and
receiving messages. A message is an uninterpreted array of bytes that is sent to a destination
processor. Every processor is assigned an address that includes its processor identifier, so different
processors will never be assigned the same address. However, we assume that if a processor is
physically moved its address will change.
The following functions define the interface to the communication system. Note that Processor
includes a processor identifier and fields that are private to the communication system.
Processor ... Extend[Record(id: ProcessorID], PrivateFields];
Send[destination: Processor, message: Byte-Array]
Send transmits message to destination.
Receive[/message: Byte-Array]
Receive delays until a message arrives addressed to this processor, and then returns the
incoming message.
GetMyProcessor[/self: Processor]
GetMyProcessor returns the address of this processor.
The communication system can lose, duplicate, or arbitrarily delay messages. Thus, messages
may arrive out of order, more than once, or not at all. We will assume that messages damaged in
transit are detected and discarded by the communication system, and will appear to have been lost
At least one distinguished address, Broadcast, is defined by the communication system so that a
processor can discover things about its environment and begin communicating with other processors.
One or more processors can ask to receive messages addressed to Broadcast. How far broadcast
messages will propagate in the communication system is implementation dependent. Chapter 4
discusses how this facility is used.
Broadcast: Processor
Packet switched networks [Metcalfe 73] are well adapted to providing a full connectivity
network at a moderate cost, and are an attractive method for implementing the proposed
communication system. Local packet switched networks, such as the Ethernet [Metcalfe and Boggs
76], provide high capacity and low delay at low cost for a local area. A treatment of local networks
can be found in [Clark et al. 78].
Local networks can be connected together by gateway processors and communication channels
to form an internetwork [Boggs et al. 80]. An internetwork retains the performance and cost
advantages of a local network, while extending the communication system to accommodate modular
growth. In an internetwork, non-local packets are forwarded though a succession of gateways to
eventually arrive at their destination.
2.1.3 Storage devices
A storage device is a processor peripheral that stores data. Read and write are the two primitive

operations that are used to access and update storage devices, respectively.


We assume that a


storage device will indicate when it has failed to accurately store data. Failure detection is typically
implemented with the help of labels and checksums [Lampson and Sproull 79].
Three fundamental characteristics of storage devices are capacity, latency, and transfer rate. The
capacity of a storage device is the maximum number of bytes that it can store on a single medium.
We have been careful to state maximum, because the amount of data that can be stored is often a
function of the type of media in use. The transfer rate is the maximum number of bytes per
second that can be transferred to or from the device. The latency of a storage device is the average
amount of time required to read or write information ignoring the time the device is actually
transferring data.
Storage devices have four additional important characteristics.

Some storage devices are designed for random access to data, and some are designed for
serial access. These devices will be called random access and serial access, respectively. A
tape drive is an example of a device designed for serial access to data, and an attempt to
use it in a random access mode would result in poor performance. Although the storage
capacity of random access devices is increasing, serial access devices may continue to have a
role in information storage because of their lower cost


Some storage devices allow data to be read or written, but once data is stored it can not be
overwritten. Such a device will be called write-once. This property is usually due to the
storage media in use, for example a write-once optical disk.


Some storage devices do not allow data to be written. Such a device will be called readonly. As a safety feature some storage devices can be made temporarily read-only.


Some storage devices can record on storage media that are interchangeable between other
storage devices of the same type. Such a device will be said to have removable media.

2.2 Stable Storage
Storage that is resilient to a set of expected failures is called stable storage. From time to time
hardware errors, system errors, or human errors will occur. A stable storage system expects such
errors, and recovers from them without information loss. Furthermore, if unexpected errors occur,
a stable storage system indicates that storage has been damaged instead of providing incorrect
information. Expected hardware failures include storage device transfers that malfunction, and
information on a storage device that decays and becomes unreadable. Transfers that malfunction
include transfers that are in progress when their controlling processor fails.
The write-ahead-log protocol [Gray 78] is widely used to implement stable storage. It
implements stable storage by carefully maintaining two copies of data on devices with independent
failure modes.
[Lampson and Sturgis 79] suggest a similar. algorithm.
It is difficult to precisely characterize the reliability of stable storage. Manufacturers give bit
error rates for their devices, but these figures do not include catastrophic failures. Examples of
catastrophic failures are head crashes on disk drives, media dropped on the floor by operators, and
media mistakenly erased by users.
We shall call storage that is not protected against failures volatile storage. Stable storage will be
used to record long term system state, and volatile storage will be used for intermediate results.



Volatile storage is naturally much more efficient than stable storage because the same precautions
do not have to be observed
2.3 Unique Identifiers
A unique identifier is defined to be a value that is distinct from every other unique identifier.
We will assume for simplicity that unique identifiers are system generated nonsensical bit strings,
but unique identifiers could be client generated and sensible.
All unique identifiers do not have to be the same length. When it is known that a unique
identifier is going to be used extensively it may be advantageous to use a shorter identifier. Of
course, there .are fewer short identifiers than there are long ones.
A common method for generating a unique identifier is to concatenate a processor identifier
with a locally unique identifier. A locally unique identifier can be implemented by a counter in
stable storage. Whenever a locally unique identifier is required the counter is incremented and
returned. An obvious optimization is to withdraw a sequence of locally unique identifiers from
stable storage at once to reduce delay. Another technique for generating locally unique identifiers is
as follows. At processor initialization time create a variable nextId and set it equal to a calendar
clock. A calendar clock is a clock that holds the current date and time and thus monotonically
increases. Every time a locally unique identifier is requested increment nextld by one and ensure
that it is less than the calendar clock by pausing if necessary. Now nextId is guaranteed to be
locally unique. Thus, the second scheme does not require stable storage.
Although theoretically unique identifiers are unique, there is a chance that the unique identifier
mechanism could fail and issue duplicate identifiers. Such a failure could result from two
processors that were mistakenly assigned the same processor identifier, or from a malicious client
The algorithms we present in many instances check for duplicate unique identifieTh. However, to
provide a foundation on which to build, we will assume that unique identifiers are in fact unique.
The following function provides unique identifiers:
GetUniqueID[/id: UniqueID]
GetUniqueID returns a unique identifier. On a single processor subsequent unique
identifiers from GetUniqueID are monotonically increasing.
2.4 Reliable Remote Evaluation

2.4.1 Model
Remote form evaluation is the way one processor requests another processor to evaluate a
function and return a result A remote evaluation is a generalization of what is commonly referred
to as a remote procedure call. [Spector 80] provides a taxonomy of remote operations and their
It is possible to provide precise semantics for remote evaluation because evaluation is formally
defined with respect to an environment that binds values to free variables and functions. We will
assume that remote evaluation is done with respect to a default environment that includes the data
types and functions defined in the paper.
To communicate with a remote processor it is first opened. Opening a processor results in a



class that will service EWlI requests. Eval will evaluate an EL form at the remote processor. To
make the remote evaluation of fonns easier, we restrict fonns to be defined by the following

Function[ ... 1 Quote[ where is a local variable. and Connection "" ~ address: I Processor 0 I I 0 id: I Unique Identifer sequence: I Integer I I I Connections Set 0 0 HeldResults Set Structure of a Processor Class Figure 2.1 18 CHAPTER 2: ENVIRONMENT SL: Lock +- CreateLock(]; Get-Next-Sequence[c: Connection I next: Connection] +- Critical[SL, 'Prog[ (]; -- increment connection sequence number c.sequence +- c.sequence + 1; Return[c]; ]]; Bad-Sequence[cl: Connection, c2: Connection I bad: Boolean] +- Critical[SL, 'Prog[ (]; IF cl.sequence#c2.sequence THEN Returnrn; Return[NIL]; ]]; The following functions are executed by remote processors in support of rp Connections: Set +- Create-Set[]; OpenConnection[1 c: Connection] +- Prog [(]; -- create a new connection c +- Create[Connection]; c.sequence +- 1; c.id +- GetUniqueID(]; c.address +- GetMyProcessor(]; Set-Insert[Connections, c.id, c]; Return[c]; ]; HeldResults: Set +- Create-Set(]; Result +- Record[value: Any]; Remember-Eval[c: Connection, form: Form I result: Any] +- Prog [ [connection: Connection; rr: Result]; rr +- Set-Lookup[HeldResults, c]; -- if we still have result from previous Eval, return it IF Not[Null[rr]] THEN Return[rr.value]; connection +- Set-Lookup[Connections, c.id]; -- if connection unknown, assume that we have crashed IF NUll[connection] THEN Return[Error['ProcessorFailure]]; -- make sure this is the next in the sequence IF Bad-Sequence[connection, c] THEN Return[Error['Discard]]; -- create a record so we can distinguish the following cases: -- a) a result of NIL -- b) a result that is not in HeldResults rr +- Create[Result]; -- evaluate the form rr.value +- Eval[form]; Set-Insert[HeldResults, c, rr]; Get-Next-Sequence[connection]; 19 I Eval. CHAPTER 2: ENVIRONMENT Retum[rr. value]; ]; DeleteResult[c: Connection] +- Prog [[]; -- originating processor has received result Set-Delete[HeldResults, c]; ]; CloseConnection[c: Connection] +- Prog [[]; -- originating processor closed connection Set-Delete[Connections, c.id]; ]; 2.5 Summary This chapter introduced and characterized three types of hardware components: processors, communication, and storage devices. Stable storage was introduced, and identifiers that are unique over processors and time were demonstrated. It was shown how to perform remote form evaluation exactly once with processor failure detection. Exercises 1. Assume that you have a remote processor class, rp. Give a statement that will determine the value of x in the environment at rp. 2. Assume that your processor provides you with a stream of uncorrelated but biased bits. Give an algorithm that will remove the bias (hint: the algorithm will not produce unbiased bits at a fixed rate) [von Neumann 51]. 3. Give an algorithm for implementing stable storage. Assume that the basic unit of storage is a page, failures are independent, and the probability that a storage device remembers a page after time t is pr(t) = e- At. What is pr(t) for your algorithm? How could you do better? 20 Chapter 3: Transactional Storage This chapter presents a basic set of facilities for shared information storage in three stages. The first section of the chapter introduces transactional storage, the second section discusses algorithms for implementing such storage, and the final section outlines some refinements. The information storage system described in this chapter would be ideal if storage system objects never moved, storage did not need its properties improved, there was no need to reconfigure storage, and people were perfectly trustworthy. Each of these problems is the subject of a subsequent chapter. 3.1 Model Users do not want the inconsistent intermediate results they store to be misinterpreted by others, and thus inherent with the concurrent sharing of storage is the need for coordination. To this end we extend the notion of stable storage to create what we will call transactional storage. Transactional storage solves the problem of coordination by providing a client with the illusion that there is no other activity in the system. This illusion is achieved by the use of transactions, which are a basic unit of concurrency control and error recovery. The following sections define a transactional storage system in terms of an an ideal set of generic objects and operations. We will call objects that store information files, and objects that store files volumes. Volumes are intended to model storage devices. Not all objects have to support all operations, and it is possible to create specialized types of files and volumes. For example, an indexed file could be defined that implemented storage and retrieval by key. Extensions to the basic facilities are provided in subsequent chapters, and we expect that clients will also create facilities tailored to their needs. A further property of our transactional storage system is that it hides the physical location of a resource from its clients. Clients will typically use objects without knowing where they are. The facilities of a transactional storage system are introduced in the following five sections. In order, the topics covered are naming, transactions, volumes, files, and immutable objects. 3.1.1 References All objects· we define are named by references. A reference contains a unique identifier that unambiguously identifies its referent References can be extended to contain such things as location hints, cryptographic keys, indirect pointers, and other useful things. We have already introduced processor references in Chapter 2 with the type Processor. Reference .... Record[id: UniqueID); Using references, clients can implement naming systems tailored to their users' needs. A general model for a naming system is a function that maps context dependent specifications into references. For example, a data base could be employed to resolve queries into references. A query might be "show me the file I created last Tuesday". Alternatively, objects could be selected from a menu of icons on a terminal screen. In such a system screen coordinates would be resolved to references. 21 CHAPTER 3: TRANSACfIONAL STORAGE Records are used to construct references. As described in Appendix A, record instances can be converted to and from arrays of bytes by the operations Encode and Decode. Thus references can be saved in files. Clients can not predict what fields will be in a reference, and thus should store references verbatim. As a matter of convention, the type of a reference will correspond to the type of its referent Thus, the function Is (Appendix A) can be used to determine the type of object that a reference names. Every object has a reference-count, which corresponds to the number of counted references to the object An object can also have uncounted references, which do not affect its reference-count The storage for an object is not released until all of its counted references are destroyed. When the storage for an object is released we say the object is deleted. If a reference is counted, care must be taken to ensure that it is not lost, and that it is properly destroyed when it is no longer needed. Clients, of course, are not bound to use the full generality of the reference-counting mechanism. They could limit the number of counted.references to an object to one, in which case destroying a counted reference would function like delete-object Counted references include the Counted type: Counted +- Record[); Reference-counts are provided as a convenience to help with, but not completely solve, the problem of reclaiming objects that are no longer wanted. Reference-counts have traditionally been somewhat suspect, but there is a good chance that they will be more reliable when they are maintained in transactional storage instead of volatile storage. Garbage collection can also be used to reclaim unwanted objects. An asynchronous garbage collection scheme has been successfully used in a system that keeps systematic track of all counted references [Garnett and Needham 80]. However, as the size of a system grows, it is probably not wise to assume that garbage collection will always be a feasible alternative. Furthennore, when protection is added to a system, a garbage collector may not be authorized to enumerate all of the references in the system. There is a well known accounting problem with reference-counts. An object may exist for an undefined time after the user that is paying for its storage is interested in it Provisions could be made to allow users that fund objects to delete them, but the semantics of reference-counts would be destroyed. It is possible that a configuration will have enough storage capacity that an object will never have to be deleted. As we shall see in Chapter 8, in this case reference-counts are still useful to automatically reclaim unused copies of objects that are stored on the configuration's limited amount of high performance storage. If a read-only reference is used to refer to an object, no operations can be issued that would change the object in any way. Read-only references provide a client with a form of self-protection. Read-only references include the type Read-Only. Add-Type (Appendix A) can be used to add the type Read-Only to a reference. Read-Only +- Record[); All of the operations on objects in the system are implemented by classes. To acquire a class that will service operations for a given object, a reference for the object is opened with the function Open. Open tests the type of its input argument and then evaluates a function that is specialized to 22 CH~R3:TRANSACTIONALSTORAGE deal with the presented type of reference. For example, if Open is applied to a processor reference, Open-Processor will be evaluated. Appendix C contains a complete description of Open. The following operations are supported by all objects. Open[ref: Reference, tc: TC, ring: List[Key], guards: List[Key] I c: Class] Open returns a class that will service requests to ref. All operations that the class performs will be part of transaction class tc, as described in Section 3.1.2. The precise initialization steps that Open takes depend on the reference's type. If a reference's referent can not be found, Error['NotFound] is returned. The third and fourth arguments to Open, ring and guards, are described in the chapter on protection (Chapter 7). c: Class I Close[] Close deactivates a class. Open references should be closed after they are no longer being used. c: Class I GetID[1 id: UniqueID] GetID returns the unique identifier of the object serviced by c. c: Class I GetTransactionClass[1 tc: TC] GetTransactionClass returns the transaction class of class c. c: Class I CopyReference[counted: Boolean I ref: Reference] An open reference can be copied, and a client can specify whether it wishes the copy to be counted or uncounted. Making a counted reference is the only way to increment the reference-count of an object c: Class I DestroyReference[1 deleted: Boolean] When a reference is no longer needed, it should be destroyed. The only way to decrement the reference-count of an object is to destroy one of its counted references. DestroyReference returns T if the object's reference-count went to zero. If the reference count of the object went to zero, the object is deleted when c is closed. 3.1.2 Transactions Function evaluations that access transactional storage, also called actions, are grouped into disjoint sets called transactions. A transaction is defined to have three properties. Totality. Totality ensures that all of the actions in a transaction will either occur exactly once or not at all. Because transactional storage is built from stable storage, totality also guarantees that if a transaction appears to occur its effects will be resilient to hardware failures. Serial Consistency. Serial consistency makes it appear to a transaction that that there is no other simultaneous activity in transactional storage. To formalize the notion of serial consistency we will adapt some ideas from [Eswaran et a1. 76]. As we have stated, a transaction is a set of actions. For simplicity, we will only 23 CHAPTER 3: TRANSACnONAL STORAGE consider read and write actions, and we will assume that a transaction performs at most one read and one write action on a data element The question of serial consistency arises when a transactional storage system concurrently executes actions drawn from several transactions. Imagine two simple transactions: T2 Tl (a2l) ReadX (all) Read Y (a22) Read Y (a12) Write X (a23) Write X (a13) Write Y The order in which the actions of Tl and T2 are processed is called a schedule. A schedule is an arbitrary interleaving of the actions of a set of transactions into a single sequence. A serial schedule results when transactions are executed one at a time to completion. Thus, there are two possible serial schedules for our example: Sa: {all, a12, a13, a2l, a22, a23} Sb: {a2l, a22, a23, all, a12, a13}. A schedule generates a dependency relation. A dependency relation describes transactions that depend on one another. If S is a schedule, then is a member of DEP(S) if: Ta and T b are distinct transactions 3j ETa and a.i ETb 3j and a.i are actions from S 3j occurs before a.i in S 3j is a write action to a data element e there are no other write actions to e between 3j and a.i a.i is a read or write action to e. In other words, is in DEP(S) if Ta updates data element e that T buses. If two schedules Sand S* have identical dependency relations, DEP(S) = DEP(S*), then they provide each transaction with the same inputs and outputs. Thus, two distinct schedules that have identical dependency relations are said to be equivalent. A system provides serial consistency if it guarantees that the schedule it will use to process a set of transactions is equivalent to one of the possible serial schedules. External Consistency. External consistency guarantees that a transaction will always receive current information. Using the concepts we have just introduced, we can provide a formal definition of external consistency. The actual time order in which transactions complete defines a unique serial schedule. This serial schedule is called the external schedule. A system is said to provide external consistency if it guarantees that the schedule it will use to process a set of transactions is equivalent to its external schedule. Let's consider a classical example to see why these properties are important in practice. Imagine that a bank teller requests that the money in a customer's savings account be transferred to his checking account The bank's computer system would begin a transaction, and evaluate functions referenced to this transaction to withdraw the balance of the customer's savings account 24 CHAPTER 3: TRANSACTIONAL STORAGE and deposit it in his checking account Totality ensures that either both the withdrawal and deposit will take place, or neither of them will. Serial consistency eliminates problems that might result from concurrency. For example, it prevents two independent transactions from observing the same balance, and then simultaneously withdrawing it Finally, external consistency ensures that if the customer goes to another teller to cash a check, the teller will see the customer's new checking account balance. A transaction is active until it ends by either committing or aborting. If it commits, the effects of its actions will become permanent and public. If it aborts, the effects of its actions will be undone. If a client does not wait for an action to complete before requesting its transaction to commit, the action mayor may not occur. Furthermore, a transaction can commit with outstanding uncompleted reads. Normally a client controls the destiny of a transaction, but transactions can be spontaneously aborted by the system to cope with exceptional conditions such as hardware failure or the inability to guarantee consistency. In addition to performing actions in transactional storage, a transaction can cause real actions to occur [Gray 78]. A real action is an action that once done, can not be undone. Examples of real actions are launching a missile or dispensing cash from an automated teller. Totality is guaranteed for real actions as well as for updates to transactional storage. For example, assume a withdrawal transaction from an automated teller updates transactional storage (to debit the customer's account) and dispenses money from the teller machine. Totality guarantees that either both of these actions will occur, or neither of them will. Committing a transaction is a real action. The only way to undo the effects of a comitted transaction is to execute a compensating transaction. If compensation is possible, precisely how it can be accomplished is highly application dependant For example, it is very difficult to recall a missile that has been launched. A benefit of the all-or-nothing nature of transactions is that they can be used to insulate remote evaluations from processor failures. We assume that every evaluation of a transactional storage system function is associated with a transaction. If a processor fails while it is executing such an evaluation, it is sufficient to abort the evaluation's transaction. All of the effects of the evaluation will be discarded. Furthermore, if orphans later try to execute actions referenced to the aborted transaction, they will not do any damage. A coordinator is an object that creates and manages transactions. When a coordinator is opened, the transaction parameter to Open is ignored. The interface to coordinators is as follows: Transaction +- Extend[Reference, Record[]]; Coordinator +- Extend[Reference, Record[]]; cc: Coordinator-Class I Create-Transaction[1 t: Transaction] Create-Transaction creates a new transaction and returns a reference for it. The transaction reference returned is not counted. However, if a counted reference for the transaction is made, the transaction will persist after it is committed or aborted. Abort and Commit erase transactions that have a zero reference-count Once a transaction has been created, it can be opened. The transaction parameter to Open is ignored when a transaction is opened. The type of class that results from opening a transaction will 25 CHAPTER 3: TRANSACTIONAL STORAGE be often referred to as a TC, for transaction class. The operations that a transaction provides are as follows: tc: Transaction-Class I Commit[] Commit causes the effects of a transaction to be made permanent and pUblic. tc: Transaction-Class I Abort[] Abort erases any effects of a transaction and makes it appear as if the transaction had never existed. tc: Transaction-Class I GetStatus[/ status: Atom] GetStatus returns C, A, or NIL, for a transaction that is committed, aborted, or active, respectively. Clients that hold information derived from an active transaction are called participants. A buffer manager is an example of a participant. At commit, a participant should ensure that all updated information has been output to transactional storage. At commit or abort, a participant may wish to invalidate information that was derived from the transaction. Because the transaction is no longer active, the derived information is no longer guaranteed to be current. tc: Transaction-Class I AddParticipant[commit, abort: Global-Function / id: UniquelD] A client can be informed when a transaction commits or aborts by calling AddParticipant. The functions provided by the client will be applied to the specified transaction's reference just before the transaction commits or aborts, respectively. The final disposition of a transaction is not guaranteed when commit or abort are called. Thus, the commit function can not be used to indicate success to a user. Only a successful return from tc I Commit indicates that a transaction has sucessfully committed. tc: Transaction-Class I DeleteParticipant[id: UniquelD] D~leteParticipant deletes a participant. The participant to be deleted is specified by a unique identifier that was returned from AddParticipant. It is possible that a client will request a transaction to commit and then its processor will fail before it can find out what happened. A solution to this problem adopted by some systems is to use real actions to indicate success. Alternatively, a client can keep a counted reference for the transaction, and use GetStatus to discover what happened when its processor recovers. 3.1.3 Volumes A volume is a file-storage service that is independent of a specific hardware realization. A volume can be implemented by a fraction of a storage device, one storage device, or more than one storage device. The details of how a physical medium are prepared for use are beyond the scope of this work, and we simply assume that a mechanism exists that will create a new volume and return a reference for it The storage technology used to implement a volume influences its properties. We call the volumes implemented by serial devices and write-once devices serial volumes and write-once 26 CHAPTER 3: TRANSACTIONAL STORAGE volumes, respectively. A volume may be both serial and write-once. Serial and write-once volumes influence the properties of the files stored on them, as described in the next section. A volume is named by a volume reference, and serial and write-once volumes are further distinguished by unique types of volume references. The types that are used to implement these references are: Volume ~ Extend[Reference, Record[]); Serial ~ Record[]; Serial-Volume ~ Extend[Volume, Serial]; Write-Once ~ Record[]; Write-Once-Volume ~ Extend[Volume, Write-Once]; Users may choose not to rely on reference-counts to determine when a volume should be retired, and thus an implementation of volumes may choose not to implement the referenceHowever, automatic volume counting semantics of CopyReference and Destroy Reference. reclamation may be appropriate for some storage devices [IBM 80b). Create-File is the only operation that all volumes must support. However, a volume might choose to implement an operation to return its remaining storage capacity, or an operation to enumerate the files that it holds. 3.1.4 Files A file is a variable-length array of bytes that can be read and written. The length of a file can be set, and the system mayor may not remember information past the declared end of a file. A file is initialized to contain all zero bytes, and an attempt to read past the end of file will return zero bytes. If data is written beyond the end of a file, the file will be automatically extended and have its length adjusted. Like other objects, a file is named with a reference. A file reference contains a reference for its containing volume. A file reference may also contain other things, such as storage device addresses, but as we described in the section on references, the job of a client is simply to store references verbatim. A file reference is defined as: File ~ Extend[Reference, Record[volume: Volume)); Files that are stored on serial and write-once volumes have special properties and reference types. If a file is stored on a serial volume, then non-serial access to the file is very expensive. If a file is stored on a write-once volume, then it can only be written once. It is unlikely that clients will have to deal directly with serial or write-once files. In Chapter 8 we discuss briefly how to obtain the cost and performance benefits of serial and write-once devices while presenting a normal file interface. Serial and write-once file references are defined as: Serial-File ~ Extend[File, Serial]; Write-Once-File ~ Extend[File, Write-Once]; The following functions define the file operations. Files can be created, read, written, and they can have their length changed. 27 CHAPTER 3: TRANSACTIONAL STORAGE vc: Volume-Class I Create-File[name: UniqueID, exists: Boolean / ref: File] Create-File creates a file with a given name on a specific volume, and returns an uncounted reference for the file. If a client wishes to preserve the file, it must make a counted reference for it If a file already exists on the specified volume with the given name, Error['DuplicateIdentifier] is returned unless exists is T. Although Create-File is described in this section, it is intrinsically a volume operation. fc: fc: fc: fc: File-Class File-Class File-Class File-Class Write[startPage: Integer, pages: Integer, data: Byte-Array] Read[startPage: Integer, pages: Integer / data: Byte-Array] Set-Size(pageCount: Integer] Get-Size[/ pageCount: Integer] The fundamental operations for data retrieval and storage are read and write. Files are read and written in units of pages, and the size of a page is implementation dependent Write writes into a file from an array a specified number of pages starting at a given page number. Read reads from a file into an array a specified number of pages starting at a given page number. The length of a file can be set and determined by Set-Size and GetSize, respectively. These operations can result in four exceptional conditions. An attempt to write or set the length of a file on a read-only device returns Error['ReadOnly]. An attempt to write or set the length of an immutable file (see below) returns Error[' Immutable]. An attempt to extend the length of a file, either by writing past the end of it or by an evaluation of SetSize, will return Error['InsufficientCapacity] if the file's containing volume is full. If an unrecoverable failure is detected Error['StorageDamaged] is returned. 3.1.5 Mutable and Immutable Objects After an idea originally suggested by Paul Mclones, files and volumes can be freely modified until they are made immutable. Once an object is niade immutable it becomes permanently readonly, and there is no way of making the object mutable again. An immutable file can not be written or have its length changed, and an immutable volume consists of a permanently fixed set of immutable files. The following two functions set and test the immutable property: or: Class or: Class SetImmutableD IsImmutable[/ immutable: Boolean] Immutable objects afford important practical benefits. A copy of an immutable object can be made without fear that the copy will become obsolete. Furthermore, because immutable objects can not change, the concurrency control and error recovery facilities of transactions are not required for their contents. However, deleting an immutable object is a real action, and thus requires the facilities of transactions. Theoretically a storage system could consist of a large number of immutable files and one mutable file that contained a single reference. In such a scheme, every time an object was updated, a new immutable version of the storage system would be created. The single mutable reference would point to the current version of the storage system. The extent to which immutable objects 28 CHAPTER 3: TRANSACfIONAL STORAGE should be used will depend on the characteristics of an application under consideration, and the relative cost of mutable files. The type of device that is used to store an object and whether the object is immutable or not are completely independent For example, objects that are stored on a write-once storage device do not have to be immutable. Although an object that is stored on a write-once device can not be directly updated, it could be moved to a read-write device, and then updated. 3.2 Basic Algorithms Techniques for implementing single-processor, transactional-storage systems have been well understood for some time. At the time of this writing, a few systems have been sucessfully constructed that implement a transactional storage system across processor boundaries [Eade et ale 77; Israel et ale 1978; LeLann 81; Tandem 81]. Because we are drawing on previous work, we will provide an overall sketch of how a decentralized transactional storage system can be implemented, but the reader should consult the references provided for details. Our survey is given in four sections. Transactional storage is first introduced on a single processor, and then extended to the decentralized case. We then discuss a different algorithm for transactional storage due to Reed. The final section describes how remote objects are accessed in our model. 3.2.1 Single Processor Case The job of a transactional storage system is to implement the three properties of transactions: totality, serial consistency, and external consistency. Existing systems follow a common pattern [Gray 78]. The two types of consistency are implemented by lock management A lock protects objects that a transaction uses from concurrent updates from other transactions, and it notifies other transactions that an object the transaction updates is busy. Locks are typically set implicitly as the result of storage system function evaluations. A client can also use its knowledge of what it is trying to accomplish to directly set and clear locks as an optimization. Some form of time out mechanism is usually provided that will cause a transaction to be aborted if it holds a popular lock too long. Totality is implemented by recovery management In an "undo" implementation, updates are directly applied to objects, and recovery management remembers old values in case a transaction aborts. If a transaction aborts, recovery management undoes the transaction's updates by consulting the preserved old values; hence the designation undo. In a "redo" or "intentions" implementation a transaction does not directly update objects. Requested updates are remembered by recovery management, and if a transaction commits, recovery management applies them to their respective objects. Thus, the list of updates remembered by recovery management are called intentions. If the system fails while intentions are being applied they must be reapplied, or redone, until they have been completely applied. Recovery management requires storage. In an undo scheme the extra storage is for old values, and in a redo scheme the extra storage is for new values. This storage is implemented in existing systems with a log or a shadow mechanism. A log is an append-only file that records the history of the system. In typical log-based systems, updating an object causes a log record to be written that includes the object's name, old 29 CHAPTER 3: TRANSACfIONAL STORAGE and new values, and information that identifies the update's transaction. A shadow mechanism implements storage for new values. New values are written out to fresh pages called shadows. If a transaction commits, any files that were updated have their physical storage description changed to point to their shadow pages instead of to pages that contain old values. The ideas of undo and redo can be combined. For example, in System R performance is increased by not keeping stable storage up to date. If the system fails, updates may have to be undone or redone from the log [Gray et al. 79]. Because most log-based systems remember redo values, they implement stable storage. Even if part of a file is lost, the log will still have a copy of the data. To prevent logs from becoming infinite, condensed snapshots called fuzzy dumps are taken periodically [Gray et al. 79]. Systems that only use shadows must implement stable storage with an additional mechanism. Recovery management keeps track of the real actions that a transaction has requested. These actions are postponed until recovery management is told to commit or abort. If recovery management is told to commit a transaction, then the transaction's real actions are performed. Now that we have introduced the concepts of lock and recovery management we can describe how they are synchronized to implement transactions. Our discussion assumes that a transaction is processed by more than one lock manager and more than one recovery manager. Inside of a single system there may be several lock and recovery managers. This commonly occurs when two independent data base systems are integrated. As we have discussed, a transaction is aborted for one of two reasons. First, a client can request that a transaction be aborted. Second, if a lock manager can not guarantee the consistency of a transaction, the transaction is aborted. When a transaction is aborted, the transaction's coordinator tells the transaction's lock managers to release the transaction's locks, and the transaction's recovery managers to erase the transaction's updates. A transaction is committed only if a client requests that it be committed. When a client requests a transaction to commit, the transaction's coordinator executes the following two step algorithm. 1. The coordinator asks all of the transaction's lock managers to guarantee the transaction's consistency, and all of the transaction's recovery managers to guarantee to commit on demand. If any manager can not make the requested guarantee, the transaction is aborted. 2. After all of the managers have made their promises, the coordinator informs the recovery managers to go ahead and commit the transaction. After the recovery managers have applied the transaction's updates, the lock mangers are told to release the transaction's locks. This two step process is known as the two-phase commit protocol. 3.2.2 Decentralized Case The decentralized case is not substantially different from the single processor case with multiple lock and recovery managers. A transaction that spans processors is implemented by a cooperating set of single processor transaction systems. Each processor, with its own transactional storage system, is called a worker. A worker implements lock and recovery management for the data it stores. 30 CHAPTER 3: TRANSACTIONAL STORAGE A decentralized transaction's coordinator carefully synchronizes the commit and abort processing of the transaction's workers as follows. If a transaction is aborted, either because a lock manager could not guarantee consistency or because the client requested an abort, the coordinator tells every worker to abort. If a client requests commit, the coordinator asks every worker to promise to commit on demand. Before a worker can make such a promise it must ensure that as far as it knows the transaction is consistent, and it has all of the information in stable storage that is required for the transaction to commit If these conditions are met then the worker can make its solemn promise. A worker that has made such a promise is said to be prepared. After making its promise a worker must remain prepared until it is informed to commit or abort by the coordinator. The request and promise sequence is phase one. If all workers promise to commit, then they can all be instructed to go ahead and commit This is phase two. If any worker fails or refuses to make the promise then the coordinator aborts the transaction. The two steps of a decentralized commit, request and receive promises, followed by commit, is the decentralized version of the two-phase commit protocol [Gray 78]. Coordinators must be very reliable so workers are not left prepared for an appreciable length of time. A prepared worker is undesirable because its lock manager can not permit any transactions to be processed that would destroy the consistency of its prepared transaction. If a transaction's coordinator fails a worker may remain prepared for a long time. We now digress for a moment to introduce a necessary idea. Imagine that a value can be replicated at a number of distinct places and manipulated in a way that is reasonably tolerant of place failures. There are two operations to manipulate the value: TryToSet and ReadValue. TryToSet attempts to set the value, but it only sets the value ifit was NIL. Thus, if many TryToSet operations are simultaneously evaluated only one of them will successfully update the value. TryToSet will not return until the value is not NIL (it mayor may not have succeeded). ReadValue returns the current value. Several algorithms have been proposed that solve this problem [Lamport 78a; Lampson 80; Thomas 79]. The following addition to the synchronization algorithm sketched above keeps workers from depending on a coordinator. The coordinator stores the state of a transaction at a number of places using TryToSet. The state of a transaction can be NIL. C. or A. representing active, committed, or aborted. When a transaction is aborted, the coordinator executes TryToSet[' A]. If a client requests that a transaction be commited, and the coordinator has received promises from every worker, then the coordinator executes TryToSet['C]. After it has done this, the coordinator sees if the value of ReadValue is C, and if so. it tells the workers to commit When a worker is asked to get prepared and· make its promise, it is passed the list of places where the state of the transaction is stored. If a worker gets restless or its lock manager notes that there are competing requests for a. prepared transaction's locks it can use ReadValue to determine the state of the transaction. If the state is C, it can go ahead and commit. If the state is A, it can abort. If the state is NIL, it can execute TryToSet[' A). Thus, regardless of when the coordinator might fail the transaction will continue to move towards termination. 3.2.3 Reed's Algorithm Reed has proposed a new way of implementing transactions [Reed 78). We will not describe his scheme in detail, but we briefly sketch how his assumptions differ from tradition. First, he 31 CHAPTER 3: TRANSACTIONAL STORAGE discarded the idea of private storage for recovery management. and instead let new and old values coexist in storage. New values are hidden by a version mechanism until a transaction commits, and old values persist for an implementation dependent period. Second, he abandoned locks as a concurrency control mechanism. Every transaction in his system is assigned a distinct pseudo-time period during which it appears to run. Data items have time stamps acquired from transactions, and rules are enforced that ensure serial consistency. 3.2.4 Located References Open is defined to return a class that will service requests for an object regardless of where the object is located. In this section we will make simplifying assumptions so we can clearly present the mechanics of how remote objects are accessed. In later chapters we add more complex concepts, such as object location. A reference is said to be located if it includes the address of a processor that will service requests for its referent Until Chapter 4 we assume that all references are located, and that the locations in references are always acctfrate. The address of an object is added to its reference as follows: Create-Located[ref: Reference, loc: Processor / Iref: Located] Create-Located creates a located reference. When Open is applied to Iref, control will be transferred to processor loc, and ref will be opened there. Located ~ Record[ref: Reference, loc: Processor]; Create-Located[ref: Reference, loe: Processor / lref: Located] Iref ~ Create[Located]; lref.ref ~ ref; lref.loe ~ loe; Return[Add-Type[lref, Major-Type[ref]]]; ]; ~ Prog[ (]; When we create references that are derived from other references, the new reference is assigned the same major type as the reference it was derived from. For example, if Create-Located is applied to a file reference, it will return a located file reference. Major-Type[ref: Reference / type: Type] ~ Prog[ []; -- return the major type of ref Return[Intersection[ref.type, Union[File, Volume, Coordinator, Transaction, Index, Processor, Counted, Secure-Door, Suite]]; ]; When Open encounters a located reference, it calls Open-Located: Open-Located[ref: Reference, tc: TC, ring: List[Key], guards: List[Key] / c: Class] Open-Located will open ref on the processor that is specified in its reference. The class returned can be used to communicate with the remote object as if it were local. It is not sufficient just to evaluate Open on the remote processor because classes can not be 32 CHAPTER 3: TRANSACfIONAL STORAGE passed between processors. Thus, Open-Located arranges to open a reference on a remote processor, and have the resulting class kept at the remote processor in a set called Doors. An entry in Doors is indexed by a unique identifier, and the unique identifier is returned to the originating processor. We will call an entry in the set Doors a door. The class created by Open-Located can refer to a remote class by specifying the unique identifer of its door. Shown below is the model implementation of Open-Located. Open-Door opens a reference on a remote processor and creates a door for the resulting class. A request directed at a located class is sent to Enter-Door. Enter-Door passes the request and a door identifier to the remote processor, where Door-Eval actually performs the requested operation. Figure 3.1 shows the structure of the class that Open-Located creates. Open-Located[ref: Reference, te: TC, ring: List[Key], guards: List[Key] / class: Class] fProg [ [rp: Processor-Class; d: Door; t: Transaction; r: Reference]; IF ref.loc=GetMyProcessor[] THEN [ -- object is at this processor class f- Open[ref.ref, te, ring, guards]; IF Is[class, Error-Type] THEN Return[class]; ELSE [ -- object is at a remote processor t f- tc I CopyReferenceO; r f- ref.ref; rp f- Open[ref.loc]; -- pass guards but not ring to remote processor (details in Chapter 7) d f- rp I Eval[,Open-Door[r, t, guards]]; IF Is[d, Error-Type] THEN Return[d]; class f- Function[Enter-Door]; ]; Instance variables: rp, d, ref, tc Return [Create-Class[ List[' Copy Reference, 'Default-Copy], class]]; ]; Enter-Door[request: List / result: Any] f- Prog [ 0; result f- rp I Eval[,Door-Eval[d, request]]; -- Eliminate problems with orphans if the processor fails IF result = Error['ProcessorFailure] THEN te I Abort[]; Return[result]; ]; Default-Copy[counted: Boolean/ cref: Reference] f- Prog [ []; -- default implementation of Copy Reference cref f- Copy[ref]; IF counted THEN [ -- if counted, increase reference count of referent Apply[superclass, request]; -- add counted type Add-Type[cref, Counted]; ] 33 located Class rp: d: I I Processor-Class Unique Identifier I , ". Processor Class I I I I ,~ ,,. Remote P rocesso r ,"" 0 0 Doors Structure of a Located Class Figure 3.1 34 Class CHAPTER 3: TRANSACTIONAL STORAGE ELSE Remove-Type[cref. Counted]; Retum[cref]; ]; Open-Door opens a reference and returns the identifier of its door. Door-Eval applies a door to a request Doors ~ Create-Set[]; Open-Door[ref: Reference. t: Transaction, guards: List[Key] / d: Door] [c: Class; dc: Class]; c ~ Open[ref, Open[t]. NIL, guards]; IF Is[c, Error-Type] THEN Retum[c]; d ~ GetUniqueIDD; dc ~ Create-Class[List[·Close. 'Close-Door]. c]; Set-Insert[Doors, d. dc]; -- Instance variables: c, d Retum[d]; ]; ~ Prog [ Door-Eval[d: Door, request: List / result: Any] ~ Prog[ [c: Class]; c ~ Set-Lookup[Doors. d]; IF Null[c] THEN Return[Error['NoSuchDoor]]; Retum[Apply[c. request]]; ]; Close-Door[] ~ Prog [ [tc: Transaction-Class]; c I GetTransactionClassD; c I Close[]; tc I CloseD; Set-Delete[Doors, d]; ]; tc 3.3 ~ Refinements 3.3.1 Lock Compatibility A typical locking structure that is used to guarantee serial consistency has two types of locks, read and update. These locks are set on data items implicitly in response to file operations. The compatibility of the locks is specified by Table 3.1. No Lock Read Update No Lock Yes Yes Yes Read Yes Yes No Update Yes No No Table 3.1: Typical Lock Compatibility Matrix A transaction is suspended if it attempts to set a lock that is incompatible. This matrix corresponds to the familiar rule that either n readers or one updater are permitted to access a file 35 CHAPTER 3: TRANSACfrONAL STORAGE simultaneously. This locking rule potentially can introduce long periods of time when information is unavailable. For example. if a user controls the length of a transaction. he can hold a update lock for a long period of time. This may naturally occur as a user thinks at the keyboard. A property of serial consistency is that all of a transaction's updates appear to occur at transaction commit time. One can take advantage of this property to increase the concurrency in the following way. Updates appear to occur when they are issued. but in fact are buffered until commit time by the stable storage system. Allowances are made so a read following a write will receive the write's data. When a user updates a datum, an I-Update lock is set, for intention to update. At commit time I-Update locks are converted to Commit locks. and the updates are actually reflected in stable storage. Table 3.2 specifies the new lock compatibility matrix. No Lock Read I-Update Commit No Lock Yes Yes Yes Yes Read Yes Yes Yes No I-Update Yes Yes No No Commit Yes No No No Table 3.2: Lock Compatibility Matrix with Intention Locks With this revised locking matrix, information is only unavailable for predictably short periods of time. during commit processing. This results in increased concurrency, but it may cause the later abortion of a transaction. We chose to make multiple I-Update locks incompatible, because eventually one of the two transactions would probably commit, and become incompatible. Thus we chose not to postpone the inevitable. This is one example of what is called lock conversion [Gray 78]. Lock conversion allows clients to predeclare their intentions to help lock management schedule transactions. 3.3.2 Lock Granularity The granularity of a lock [Gray et al. 76] refers to its scope. For example, a fine grained lock might be set on a range of bytes in a file. and a coarse grained lock might be set on an entire file. Choosing the granularity of locks presents a trade off between concurrency and lock manager overhead. Fine grained locks increase concurrency by not locking as much, but lock management will set more locks. On the other hand, if coarser locks are chosen. then concurrency will be reduced. Locking protocols have been devised that provide variable granularity locks [Gray et al. 76]. Variable granularity locks allow clients to decide how much should be locked, and have proven to be very valuable in practice. 3.3.3 Lower Degrees of Consistency There are a number of different levels of consistency, but serial consistency is the highest, and the consensus is that lower degrees of consistency are not sufficient for all applications [Gray 78]. [Gray et al. 76] define four degrees of consistency that can be roughly characterized as follows. Degree 0 protects transactions from overwriting each others uncommitted updates. Degree 1 adds 36 CHAPTER 3: TRANSACfIONAL STORAGE totality. Degree 2 adds protection against reading the uncommitted updates of others. Degree 3 provides serial consistency. Further details are provided in the taxonomy [Gray et al. 76]. The motivation for using lower degrees of consistency is that they permit more concurrency, perform better because fewer locks are set, and are less complex to implement than higher degrees. A transactional storage system could support lower degrees of consistency. However, there are a number of ways of achieving the benefits of lower degrees of consistency with the system as described. For example, if a transaction runs for a long time it could be broken up into many smaller transactions. 3.3.4 Broken Read Locks In some cases when a data item that a transaction has read becomes obsolete it is not necessary to abort the transaction. A common example is a cache manager that holds data as a convenience. If one of the data items it is holding is no longer current it would be content to be informed of that fact so it could remove the item from its cache, and then proceed. Transactional storage could inform clients that a data item they previously received has become obsolete, and give them the option of not aborting their transaction. This is called breaking a read lock, in reference to the read lock the data item held. The advantage of the scheme is that fewer transactions are aborted. 3.3.5 Releasing Read Locks Transactional storage could allow clients to inform it that although they read a data item, it did not influence their computation, and its corresponding read lock could be released. As a general rule, the more read locks a transaction holds, the more likely it is to conflict with another transaction and be aborted. In addition, the system must verify that read locks are still being held at commit time, which increases the amount of communication that must occur. 3.3.6 Deadlock Cyclic dependencies, or deadlocks, can occur when clients dynamically acquire resources. For example, if client A acquires X and wants Y, and client B acquires Y and wants X, a deadlock exists. In order to guarantee progress either A or B must be aborted and the other allowed to proceed. Deadlock prevention requires that all transactions either declare in advance what resources they are going to use, or acquire resources in a fixed order. Deadlock prevention is usually considered too restrictive for general use, and it reduces concurrency by locking too much for too long. Deadlock detection maintains a resource dependency graph to detect when a cyclic dependency occurs. This approach is difficult to efficiently implement in a decentralized system, although some proposals have been made [Gray 78; Obermarck 80]. A time-out strategy resolves deadlocks by aborting a transaction if it holds a resource for too long that another transaction has requested. The problem with a time-out scheme is that in a system with a lot of contention for resources, once a transaction has run longer than the timeout interval it is likely to be aborted. This problem can be alleviated somewhat by specifying the timeout interval for a transaction when it is created. 37 CHAPTER 3: TRANSACfIONAL STORAGE 3.3.7 Checkpoints and Save Points It is possible to commit a transaction and still keep the transaction active for further operations. Such a commit is called a checkpoint. A checkpoint allows a transaction to save a consistent set of results to hedge against the possibility that it might not complete [Gray et al. 79]. A save point is an intermediate place in a transaction that recovery management can return to if the transaction gets into trouble. For example, if a transaction is involved in a deadlock, it may be possible for recovery management to return to the transaction's last save point If this is possible, all of the transaction's work will not be discarded, as would be the case if recovery management could only abort the entire transaction. 3.3.8 Nested Transactions The one level transaction structure that has been presented can be extended to include nested transactions [Davies 73; Reed 78]. Nested transactions are just like normal transactions. except that when they commit their results are only visible to their parent transaction. For example. imagine T has two nested transactions, Tl and T2. running at the same time. Tl and T2 can not detect each others presence, and their results are only visible to T when they commit The results of Tl and T2 are made public when T commits. Nested transactions may provide a performance improvement by making transactions less likely to abort For example. in Section 3.2.4 we described a scheme that forced a transaction to abort if a processor failed. If Enter-Door created a separate nested transaction for each remote evaluation, then on processor failure it could abort the corresponding nested transaction and retry the evaluation. 3.3.9 Stable Storage Failure The fundamental premise that we are operating under is that stable storage will not fail. However, stable storage will fail sometime. If stable storage fails it is the clients' responsibility to reconstruct its contents. Simply using an obsolete copy of the destroyed data will not do. Assume that there is a catastrophic failure of volume V, and all that remains is an out of date copy of V made at time T. One can not simply restore V to its state at time T because consistency would be destroyed. Consider the case where V stores bank account balances, and substantial withdrawals have been made since time T. Restoring V would result in the bank giving away money. Obsolete information can be of use, however. For example, if V and V's log from time T to the present are available stable storage can be totally reconstructed. If a log is not available, a salvaging program could read obsolete data and combine it with information available from outside the system to rebuild storage. 3.4 Summary Transactional storage was introduced as an ideal way of allowing stable storage to be shared between clients. The fundamental characteristics of transactional storage were described, along with a model set of objects and primitives. A survey of existing and proposed algorithms for the implementation of transactional storage was provided. The final section of the chapter outlined some refinements that can be made to the model and its implementation. 38 CHAPTER 3: TRANSACfIONAL STORAGE Exercise 1. Give an algorithm for TryToSet Will your algorithm always terminate in finite time? 39 Chapter 4: Location In many instances objects naturally physically move about in a decentralized system. A removable storage volume is an obvious example. Other objects, such as coordinators, can also move if the processor that implements them changes. This chapter describes the techniques that we employ to determine where an object is implemented. The first section introduces the notion of an index, the second section shows how indexes can be used to implement indirection, the third section introduces indirect references, the fourth section discusses how files, volumes, and other objects are located, and the last section describes choice references. The chapter ends with a brief summary. 4.1 Indexes An index holds a set of entries. An entry is an array of bytes, and each entry is named with an entry-name. Like other objects, indexes are named by references, and Open is evaluated to obtain a class to service an index. Like files and volumes, indexes can be made immutable. The following functions define the interface to indexes: Create-I ndex(storage: File / index: Index] Create-Index takes a file reference and returns an uncounted reference for a new index. It is presumed that storage either contains the data structures that represent an index, or is a zero length file. We will see in Chapter 6 why storage might already contain the data structures of an index. The index will inherit the unique identifier of storage. An index reference is defined to have the following type: Index +- Extend[Record(storage: File], Reference]; ic: Index-Class I Write[entry-name: Byte-Array, value: Byte-Array] Write enters an entry in an index and names it entry-name. If an entry previously existed under the specified entry-name its value is updated. If an entry previously existed and the value argument to Write is NIL, the entry is deleted from the index. ic: Index-Class I Read[entry-name: Byte-Array / value: Byte-Array] Read returns the entry named by entry-name. name, NIL is returned. If there is no entry under the specified Entry +- Record[entry-name: Byte-Array, value: Byte-Array]; ic: Index-Class I Enumerate[last: Entry / next: Entry] Enumerate allows all of the entries in an index to be enumerated. The first entry in an index is returned when last is NIL, and next is NIL when there are no more entries. There are many algorithms for implementing indexed files. Popular approaches are presented by [Knuth 73] and [McCreight 77]. 40 CHAPTER 4: LocATION 4.2 Indirection 4.2.1 Model As the first step to implement location, we introduce processor-independent indirection. Indirection is processor-independent in the sense that an indirect pointer can be made on one processor and dereferenced on another processor. The following four functions define the interface to the indirection mechanism: Create-Indirect[record: Record, index: Index, te: TC / ie: Indirect] Create-Indirect creates an indirect entry, places record in the entry, and returns an indirect pointer to the new entry. Lookup[ie: Indirect, te: TC / record: Record] Lookup returns the contents of the indirect entry specified by ie. Change-Indirect[ie: Indirect, record: Record, te: TC] Change-Indirect places record in the indirect entry specified by ie. indirect entry is deleted. If record is NIL, the Normalize[ie: Any, te: TC / record: Record] If ie is indirect, Normalize dereferences ie using Lookup until the result is not an indirect record. 4.2.2 Basic Algorithm Indexes are used to implement indirection. As shown in Figure 4.1, an indirect record points to the index entry that is used to hold its value. Indirect +- Record[indirect-id: UniqueID, index: Index]; Create-Indirect[record: Record, index: Index, te: TC / ie: Indirect] +- Prog[ [ic: Index-Class]; ie +- Create[Indirect]; ie.indirect-id +- GetUniqueID[]; ie.index +- index; ic +- Open[index, te]; IF Is[ic, Error-Type] THEN Return[ic]; -- check for unique id generator malfunction IF Not[N uIl[ic I Read[ie.indirect-id]]] THEN [ ic I Close[]; . Return[Error['UniqueIDMalfunction]]; ]; ic Write[ie.indirect-id, Encode[record]]; ic I Close[]; Return[Add-Type[ie, Major-Type[record]]]; ]; Lookup[ie: Indirect, te: TC / record: Record] +- Prog[ [ic: Index-Class]; ic +- Open[ie.index, te]; 41 "", Indirect Record index: id: I I I Index Reference I Unique Identifier I I "" ~ Record Index An Indirect Record Figure 4.1 42 CHAPTER 4: LocATION record +- Decode[ic ic I Close[]; Return[record]; ]; I Read[ie.indirect-id]]; Change-Indirect[ie: Indirec~ record: Record, te: TC] +- Prog[ ric: Index-Class; contents: Any]; contents +- IF Null[record] THEN NIL ELSE Encode[record]; ic +- Open[ie.index. te]; ic I Write[ie.indirect-id. contents]; ic I Close[]; ]; Nonnalize[ie: Any. tc: TC / record: Record] +- Prog[ []; record +- ie; UNTIL Not[Is[record. Indirect]] 00 record +- Lookup[record. te]; Return[record]; ]; 4.3 Indirect References With the structure we have described so far once a reference is given to a client it must never change. There are many reasons to eliminate this restriction. For example. if a located reference could change. then the location of its referent could change. At times it may also be desirable to change the referent of a reference, perhaps to cause clients to use a new object To allow references to change, we introduce the notion of an indirect reference. Indirect references are created by applying Create-Indirect to a reference. Indirect references are used just like ordinary references. They are counted or uncounted, depending on the type of their original reference. CopyReference and DestroyReference are used to copy and destroy indirect references. Indirect-References take up less space than nonnal references, and thus can reduce the amount of storage occupied by common references. ChangeReference is used to change' an indirect reference and destroy the old reference that was in its index entry. Change-Indirect can be used to change an indirect reference, but it does not destroy the old reference. ic: Indirect-Class I ChangeReference[nref: Reference] ChangeReference causes the indirect reference represented by ic to be changed to point at nref ic is also changed to service nref When a counted-indirect reference is made for an objec~ the object's reference count is increased by one. When the destruction of an indirect reference results in the destruction of its referent (it was the only counted reference), then the indirect reference's index entry is also destroyed. Thus, if three counted-indirect references share an indirect entry, the indirect entry will be destroyed when the last of the three references is destroyed. A file reference is defined to include a reference for its containing volume. Thus, when a file is created on an indirect volume. a reference for the indirect volume is included in the new file reference. Figure 4.2 shows the class structure that is established when an indirect reference is opened. 43 CHAPTER 4: locATION All requests are first considered by a class specialized for indirect references, which is called an indirect class. If a request is not one of the four operations that the indirect class handles, then it is passed along to a class for the indirect reference's referent A similar class structure will be used for other types of references. Open-Indirect[ref: Reference, te: TC, ring: List[Key], guards: ListIKey] I c: Class] +- Prog[ [d-ref: Reference]; d-ref +- Lookup[ref. te]; c +- Open[d-ref, te, ring, guards]; IF Is[c, Error-Type] THEN Retum[c]; c +- Create-Class[List{ 'CopyReference, 'Default-Copy, 'ChangeReference, 'Indirect-Change, 'DestroyReference, 'Indirect-Destroy, 'Create-File, 'Indirect-Create-File], c]; Instance variables: ref, te, ring, guards Retum[c]; ]; Indirect-Change[nref: Reference] +- Prog[ 1]; -- change reference -- first, destroy old reference superclass I DestroyReferencel]; superclass I Closel]; -- second, update index Change-Indirect[ref, nref, te]; -- third, open new reference and switeh superclass superclass +- Open[nref, te, ring, guards]; IF Is[superclass, Error-Type] THEN Retum[superclass); ]; Indirect-Destroy[1 deleted: Boolean] +- Prog[ 1]; -- destroy reference pointed to deleted +- Apply[superclass, request); -- destroy indirect if object deleted IF deleted THEN Change-Indirect{ref. NIL, te]; Retum[deleted]; ]; Indirect-Create-File[] +- Prog[ 1]; -- substitute an indirect volume ref file-ref +- Apply[superclass, request]; IF Is[file-ref, Error-Type] THEN Return[file-ret]; file-ref.volume +- self I CopyReference[]; ]; Open-Indirect can cache the results of Lookup requests. However, if it uses one of these cached references and an Error results, then it must use Lookup to determine if its cached value is still current If an error is going to occur it will occur at Open time, so it is always safe for OpenIndirect to attempt to use an obsolete cache entry. 44 I Indirect Ref : ,"" ,... Reference Referent . I J.. Reference Structure I I Indirect Class ? Services: ""'- Referent Class Services all other Change Reference operations CopyRefe rence Create-File Dest royRefe rence Class Structu re Structure of an Indirect Class Figure 4.2 45 CHAPrER. 4: locATION 4.4 Object Location Indirect references are used to implement location for volumes, coordinators, and indexes. As a matter of convention, indirect references are always made for objects that might move. When the location of such an object changes, Change-Indirect is used to update the object's indirect entry with a new located reference. Take the case of a removable volume as an example. When the volume is created, an indirect reference for it is made. This indirect reference is what is given to clients and passed around in the system. Whenever the removable volume is mounted, its location is placed in the index specified by its indirect reference. Clients that hold the volume's reference will then be able to access it Indexes can not always be located with indexes because of the obvious problem with recursion. The index-index is introduced to solve this problem. The index-index is named by a fixed, located reference. Because the index-index will be used frequently an implementation should represent its reference compactly. The reference for the index-index is a located reference: Index-Index: Located-Index The location in the index-index reference is Broadcast. This is the only use of the broadcast address we will make. We assume that every processor that listens for broadcasts knows of a processor that will service requests for the index-index. Files and indexes are assumed to be located at the same place as their containing volume, and we can take advantage of this fact to reduce our need for indirection. We simply determine the location of an index's or file's volume, and use it to construct a located reference. Open-File[ref: Reference, tc: TC, ring: List[Key], guards: List[Key] / c: Class] +- Prog [ [volume-ref: Reference]; -- Get located volume reference volume-ref +- Normalize[ref.volume, tc]; IF Not[Is[volume-ref, Located)) THEN Retum[Error['NotFound)); -- Open file at the processor where the volume is c +- Open[Create-Located[ref, volume-ref.loc], te, ring, guards]; IF Is[c, Error-Type] THEN Retum[c]; Retum[Create-Class[List['Copy Reference, 'Default-Copy], c)); ]; Open-Index[ref: Reference, tc: TC, ring: List[Key], guards: List[Key] / c: Class] +- Prog [ [volume-ref: Reference]; -- Get located volume reference volume-ref +- Normalize[Normalize[ref.file, tc].volume, tc]; IF Not[Is[volume-ref, Located)) THEN Retum[Error['NotFound)); -- Open index at the processor where the volume is c +- Open [Create-Located[ref, volume-ref.loc], tc, ring, guards]; IF Is[c, Error-Type] THEN Retum[c]; Retum[Create-Class[List['Copy Reference, 'Default-Copy], c)); ]; 4.5 Choice References Let us return to the way we name objects for a moment We have stated that a reference is an unambiguous name for a single object Here we modify that view slightly. Imagine that we ~anted A£ CHAPrER 4: LocATION to use some object from the list 01 ... On, but we did not care which object was chosen. To implement this idea we introduce the idea of a choice reference. A choice reference is a set of references S, and when a choice reference is opened, the class returned will correspond to one of the elements of S. A strategy could be employed to select which element of a choice reference is selected. For example, in the case of a volume choice reference we could select the volume with the largest remaining capacity. An application for choice references is when anyone of a number of processors can service requests for an object In this case, we do not want to place the address of a single processor in the object's index entry. Instead, we create a processor choice reference that includes all of the processors that can service the object, and use this choice reference as the object's location. A choice reference is created by Create-Choice: Create-Choice[choices: List[Reference] I cr: Choice] Create-Choice creates a reference whose referent can be anyone of the references in choices. Choice +- Record[choices: List[Reference)); Create-Choice[choices: List[Reference] I cr: Choice] +- Prog[ 1]; cr +- Create[Choice]; cr.choices +- choices; Retum[cr]; ]; The model implementation for a choice reference opens all of its possible choices at once, and uses the first object to respond. Objects that are not selected are closed. Open-Choice[ref: Reference. te: TC, ring: List[Key]. guards: List[Key] I class: Class] +- Prog[ [choice-made: Condition-Variable, choice-lock: Lock]; class +- NIL; choice-lock +- Create-Lockl]; choice-made +- Create-Condition-Variablel]; MapCar[ref.choices, 'Choice-Open]; UNTIL Not[Null[Get-Choice-Class)) 00 Wait[choice-made]; Retum[Get-Choice-Classl11; ]; Get-Choice-Class[/c: Class] +- Critical[choice-Iock, 'class]; Set-Choice-Class[c: Class] +- Critical[choice-Iock. 'Prog[ 1]; IF Is[c Error-Type] THEN Retuml]; IF Null[class] THEN [ class +- c; Broadcast[choice-made]; ] ELSE c I Closel]; )); Choice-Open[ref: Reference] +- Fork[,Set-Choice-Class[Open[ref, te. ring, guards]]]; 47 CHAPTER 4: LocATION 4.6 Summary This chapter described a general framework for locating objects. Indexes were defined. and used to implement indirect records. Indirect records in tum allowed indirect references to be created. Indirect references were used to keep the location of an object in a single place that can be conveniently updated. Choice references were introduced to handle a service that is implemented by many objects. Exercise 1. Under what circumstances would you use a choice reference for a volume? 48 Chapter 5: Replication So far we have described a system where only one copy is kept of files, indexes, and volumes. This chapter describes how objects can be replicated to improve their availability, reliability, and performance. The first section describes a model of the replication facilities we provide, the second section shows how our model can be implemented, the third section argues that the replication algorithm preserves the properties of transactions, the fourth section offers some refinements to the basic algorithm, and the final section compares our method of replication with previous work. The chapter ends with a brief summary. 5.1 Suites To introduce what a suite is and how one works let's start with an example. Imagine that we would like to maintain three copies of a file. We will say a copy is current if it has received all of the updates that have been applied to the file. Here are three approaches for maintaining the copies: 1. Always update all three copies. Thus, all three copies will always be current To read the file, we could read anyone of its three copies. Of course, if anyone of the three copies is unavailable we would be unable to update the file. 2. Always update at least two copies. Both of these copies must be current, to guarantee that a current copy always receives all of the updates that are applied to the file. To read, we know that if we examine two copies, one of them is guaranteed to be current If version numbers are kept on the copies, then we can identify a current copy, because it will have the highest version number. 3. Always update at least one copy. Once again, this copy must be current, to guarantee that a current copy always receives all of the updates that are applied to the file. To read. we know that if we examine all three copies, one of them is guaranteed to be current Version numbers are once again required to determine which of the three copies is current These methods all preserve the invariant that every set of copies that they write intersects with every set of copies that they read. In this way they are always guaranteed to provide current information. We will call a collection of copies that implements a single object a suite, and individual copies representatives. Our example described a file suite with three representatives. Furthermore, we will call a set of representatives that is examined to read a suite a read quorum, and a set of representatives that is written in response to a suite write operation a write quorum. As in our example,every read quorum and every write quorum must have a representative in common. The read and write quorums described by the above three methods are respectively: 1. Write Quorum: {I, 2, 3} Read Quorums: {I}, {2}, {3} 49 CHAPTER 5: REPLICATION 2. Write Quorums: {I, 2}, {I, 3}, {2, 3} Read Quorums: {I, 2}, {l, 3}, {2, 3} 3. Write Quorums: {I}, {2}, {3} Read Quorum: {I, 2, 3} The basic idea behind our replication mechanism is a compact encoding of a suite's read and write quorums. Every representative of a suite is assigned a non-negative integer, which represents a number of votes. A read quorum is considered to be any set of representatives whose votes total to r, and a write quorum is considered to be any set of representatives whose votes total to w. Nonnegative integers r and w are chosen such that r + w is greater than the total number' of votes assigned to the suite. A suite's voting configuration is the triple (vote-assignment, r, w). As in our example, when updates are applied to a write quorum every member of the write quorum must be current Thus, every read quorum and every write quorum have a representative in common, and every read quorum will always have a representative that is current Version numbers make it possible to identify this representative. Section 5.3 describes the algorithm in detail. To read a suite, at least r votes worth of representatives must be available. To write a suite, MAX[r,w] votes worth of representatives must be available. This is because before a suite can be written a read quorum of representatives must be assembled to determine what version number a current representative will have. When the contents of a suite are totally replaced only w votes worth of representatives need be available (Section 5.4.6). The algorithm has a number of desirable properties. It starts with and preserves the properties of transactional storage, including totality, serial consistency, and external consistency (Section 3.1.2). It continues to operate even when some representatives are unavailable. It is simple, and can be used to create many types of suites. In addition, all of the copies of an object, including temporary copies that clients create to increase performance, can be incorporated into the framework. A suite's characteristics can be chosen from a range of possibilities by adjusting its voting configuration. For example, heavily weighting high performance representatives will result in a suite with higher performance, and heavily weighting representatives that are very reliable will result in a more reliable suite. A completely decentralized structure results from equally weighting representatives, and a completely centralized scheme results by assigning all of the votes to a single representative. Once the general reliability and performance of a suite is established by its voting configuration, the relative reliability and performance of Read and Write can be controlled by adjusting rand w. As r decreases, reads become more efficient and reliable. As w decreases, writes become more efficient and reliable. The choice of rand w will depend on the read to write ratio expected, the relative costs of reading and writing, and desired suite characteristics. Table 5.1 suggests the diverse mix of properties that can be created by appropriately setting r and w. The blocking probabilities shown in the table represent the probability that a quorum will not be available. We have assumed that the probability that a representative is unavailable is .01. Example 1 is configured for a file with a high read to write ratio in a single server, multiple user environment Replication is used to enhance the performance of the system, not the reliability. 50 CHAPTER 5: REPLICATION There is one server on a local network that can be accessed in 75 milliseconds. Two users have chosen to make copies on their personal disks by creating weak representatives, or representatives with no votes (see Section 5.5.1 for a complete discussion of weak representatives). This allows them to access the copy on their local disk, resulting in lower latency and less traffic to the shared server. Example 2 is configured for a file with a moderate read-to-write ratio that is primarily accessed from one local network. The server on the local network is assigned two votes, with the two servers on remote networks assigned one vote apiece. Reads can be satisfied from the local server, and writes must access the local server and one remote server. The system will continue to operate in read-only mode if the local server fails. Users could create additional weak representatives for lower read latency. Example 3 is configured for a file with a very high read to write ratio, such as a system directory, in a three server environment Users can read from any server, and the probability that the file will be unavailable is very small. Updates must be applied to all copies. Once again, users could create additional weak representatives on their local machines for lower read latency. Exam~le ExamQle 1 ExamQle2 75 65 65 75 100 750 75 750 750 (1,0,0) 1 1 <2,1,1) 2 3 (1,1,1) 1 3 Read Latency (msec) Blocking Probability 65 1.0 X 10-2 75 2.0 X 10-4 75 1.0 X 10-6 Write Latency (msec) Blocking Probability 75 1.0 X 10-2 100 1.0 X 10-2 750 3.0 X 10-2 Latency (msec) Representative 1 Representative 2 Representative 3 Voting Configuration r w 3 Table 5.1 Create-Suite is used to organize a collection of objects into a suite. Create-Suite[id: UniqueID, r: Integer, w: Integer, rep: List[Reference], votes: List[Integer] / suite: Suite] Create-Suite creates a suite reference from a specification that consists of the identifier to be assigned to the suite, r, w, a list of representative references, and a list of the representatives' respective vote assignments. The suite reference returned can be opened and used like an ordinary object The suite's type is determined by the type· of its representatives. All of the references in rep must be of the same type, and they all must have the same version number. If the resulting reference is a volume reference, then files created on suite will be file suites. Suite +- Record[]; 51 CHAPTER 5: REPLICAnON Normal-Suite +- Extend(Suite, Record(id: UniqueID, r: List[Reference], votes: List[IntegerD; Integer, w: Integer, rep: Create-Suite[id, r, w, rep, votes I suite] +- Prog[ [); -- start making up the suite suite +- Create[Normal-Suite); suite.r +- r; suite. w +- w; suite. votes +- votes; suite.id +- id; suite.rep +- rep; Return[Add-Type[suite, Major-Type[car[repDD; ]; When Create-Suite is applied to a set of volumes a volume suite results. A volume suite is a template for creating file suites. When Create-File is applied to a volume suite a file suite will be created that has the same voting configuration as the volume suite. The new file suite will have a set of representatives whose votes total to at least MAX[r,w]. This ensures that the new suite will be fully functional. In order to allow the replication algorithm to keep track of which representatives have current information. and to allow the algorithm to update obsolete representatives. all objects that will be used as representatives must implement the following operations. class: Class I GetVersion[1 version: Integer) The version number of an object is a count of the number of writes that have been performed on the object Consistency is guaranteed for version numbers. In other words. once GetVersion returns the version number of an object, no other transaction will update the object before the the transaction associated with class ends. class: Class I Copy[copy-from: Reference] Copy copies the index or file specified by copy-from to the index or file serviced by class. Copy is implemented in such a way that an object can be copied to itself. After Copy finishes. the original object and its copy are indistinguishable. The object serviced by class receives the following state from copy-from: version number. reference count, length. immutability, and contents. The last thing that is copied is the version number. Thus, if a transaction is committed before Copy finishes. the object serviced by class will still have its old version number. In order to guarantee that an object has a monotonically increasing version number. copy-from must have a version number that is larger or equal to the version number of class. If this is not the case. Copy will return Error['VersionError). 5.2 Basic Algorithm We present the basic algorithm by describing how operations on a suite are transformed into operations· on the suite's representatives. The perspective taken is that of a single transaction. Of course there can be many transactions accessing a given suite at the same time. all performing the algorithm. We also provide a step by step synthesis of a class to implement the algorithm. The class is initialized by Open-Suite, which is called by Open to create a class to service a suite. Before OpenSuite ret IfDS, it gathers a read quorum to determine the suite's version number. Open-Suite 52 CHAPTERs: REPLICATION establishes the correspondence between operations on the suite (e.g. Create-File) and the functions that implement these operations in tenns of the suite's representatives (e.g. Suite-Create). Careful use of critical sections has been made so the class can process simultaneous requests. Figure 5.1 shows the gross structure of a suite cl&ss. Open-Suite[input-ref: Suite, te: TC, ring: List[Key], guards: List[Key] / sc: Class] +- Prog[ [x: List[Representative]; r: Integer; w: Integer; votes: List[Integer]; number-of-representatives: Integer; -- normal form of input-ref is suite-ref suite-ref: Normal-Suite; -- Broadcast[crowd-Iarger] occurs when a new representative becomes available crowd-larger: Condition Variable; -- write-lock must be locked to access write-quorum or raw-votes write-lock: Lock; write-quorum: List[Representative]; raw-votes: Integer; -- suite-lock must be locked to access any of the following: suite-lock: Lock; suite: List[Representative]; current: Integer; a current representative has this version number not-found-votes: Integer; ]; -- let's see what is out there and initialize suite variables Initiate-Inquiries[]; x +- Collect-Read-Quorum[]; -- if suite was not found, return error to client IF Is[x, Error-Type] THEN Return[x]; -- Instance variables: r, w, votes, number-of-representatives, suite-ref, crowd-larger, write-lock, write-quorum, raw-votes, suite-lock, suite, current, not-found-votes, input-ref, te, ring, guards Return[Create-Class[List[ 'Read, 'Suite-Read, 'Copy, 'Suite-Copy, 'Enumerate, 'Suite-Read, 'GetVersion, 'Suite-Read, 'Write, 'Suite-Write, 'IsImmutable, 'Suite-Read, 'SetImmutable, Suite-Write, 'GetSize, 'Suite-Read, 'GetID, 'Suite-GetID 'SetSize, 'Suite-Write, 'GetTransactionClass, 'Suite-Read, 'Copy Reference, 'Suite-Copy Ref, 'DestroyReference, 'Suite-Write, 'Close, 'Suite-Close, 'Create-File, 'Suite-Create], NIL]]; ]; 53 ,"" Representative Class Suite Class [!] w:[!] r: Votes Version 1 47 47 46 1 1 Class ,'" Representative Class suite-lock ,.. St ructu re of a Suite Class Figure 5.1 54 Representative Class CHAPTER 5: REPLICATION Suite-GetID[1 id: UniqueID] ... [suite-ref.id]; A quorum is represented by a list of Representative records. Collect-Quorum and CollectWrite-Quorum collect quorums. They are described in detail later. Representative ... Record[ version: Integer, votes: Integer, class: Class, ref: Reference]; A suite read operation is an operation that examines the state of a suite without changing it The operations Read. Enumerate, IsImmutable, GetSize. GetTransactionClass, and GetVersion are all transformed into Suite-Read. A suite read operation can be performed by any current representative. To find a current representative the algorithm first collects a read quorum. From this quorum a current representative is selected, and the requested read operation is performed by this representative. Ideally, one would like to read from the representative that will respond the fastest If storage is damaged by an unexpected error, Error['StorageDamaged] will be returned by a read request In this event Suite-Read could attempt to find another current representative to read from. Suite-Read[1 result: Any] ... Prog[ []; result ... Apply[Select-Current[Collect-Quorum[r]].class, request]; Return [result]; ]; Select-Current[quorum: List[Representative] I rep: Representative] ... Critical[suite-Iock, 'Prog[ []; FOR rep IN quorum 00 [ -- find a current representative IF rep.version=current THEN Return[rep]; ]; ]]; A suite write operation is an operation that updates the state of a suite and creates a new version. The operations Write, SetImmutable, SetSize, and DestroyReference are transformed into Suite-Write. A suite write operation is performed by a write quorum, all of whose members are current The algorithm first gathers a write quorum, and then every representative in the quorum performs the requested write operation. After all of the members of the quorum have finished, a check is made to ensure that none of them returned an error, and the result from one of the representatives is returned as the result of the write operation. The result of concurrent· writes that update the same portion of a suite is undefined. In the model implementation, if two concurrent writes update the same portion of a suite it is possible that half of the suite will assume one value, and the other half of the suite a different value. When the suite is subsequently read, it will be indeterminate which of these two values will be returned. 55 CHAPTER 5: REPLICATION Suite-Write[/ result: Any] f- Prog[ []; Retum[Apply-Quorum[Collect-Write-Quorum[]]]; ]; Apply-Quorum[quorum: List[Representative] / result: Any] [process: List[Process]; x: Any]; -- start all representatives working on the request process f- MapCar[quorum, 'Fork-Request]; -- wait for all of the representatives to finish result f- MapCar[process, 'Join]; FOR x IN result 00 IF Is[x, Error-Type] THEN Retum[x]; Return[car[result]]; ]; Fork-Request[rep: Representative / p: Process] Fork['Apply[rep.class, request]]; f- Prog[ f- A Copy request is a special write operation. Because the contents of the suite are being replaced, all of the members of the write quorum do not have to be current. A Copy operation is transformed into Suite-Copy. Suite-Copy first replaces the contents of the suite, and then updates cached version numbers. Suite-Copy[/ result: Any] f- Critical[write-Iock, 'Prog[ [rep: Representative]; -- replace a write quorum write-quorum f- Collect-Quorum[w]; result f- Apply-Quorum[write-quorum]; IF Is[result, Error-Type] THEN Retum[result]; Mark-Reps-Current[write-quorum]; Return[result]; ]]; Mark-Reps-Current[quourm: List[Representative]] f- Critical[suite-Iock, 'Prog[ [rep: Representative]; -- update cached version numbers current f- car[write-quorum].class I GetVersion[]; FOR rep IN write-quorum DO rep. version f- current; j); A CopyReference request is a write operation, because it can update the state of the suite by incrementing its reference count. A CopyReference operation is transformed into Suite-CopyRef. Suite-CopyRef{counted: Boolean / nr: Suite] f- Prog[ []; -- copy the references of a write quorum Apply-Quorum[Collect-Write-Quorum[]]; -- return a copy of the reference nr f- Copy[input-ref]; IF counted THEN Add-Type[nr, Counted] ELSE Remove-Type[nr, Counted]; 56 CHAPTER 5: REPLICATION Return[nr]; ]; As we discussed in the last section, when Create-File is applied to a volume suite a file suite is created Representatives for the new file suite are created on a set of volume representatives whose votes total to at least MAX[r,w]. The model implementation of Suite-Create does not allow a volume suite to include representatives that are reconfigurable or protected, concepts covered in Chapters 6 and 7 respectively. We digress for a moment to explain a fine point about the operation of Suite-Create. There are two types of file suite references. A suite reference returned from Create-Suite is fully expanded, and contains a list of a suite's representatives, vote assignments, and so on. This type of suite reference is said to be in normal form. The file suite reference returned from Suite-Create does not contain a reference for each of its representatives. Such references are condensed and consist of the unique identifier of the suite's representatives and a volume suite. This type of file suite reference can be easily reconstituted to normal form. as we shall see later. Suite-Create[id: UniqueID / nr: Suite-Reference] ... Prog[ [result: Any]; result ... Apply-Quorum[Collect-Quorum[Max[r, w]]]; IF Is[resul~ Error-Type] THEN Return[result]; -- create file reference nr ... Create[Extend[File, Suite]]; nr.id ... id; nr.volume ... self I CopyReference[]; -- return the new suite's reference. Return[nr]; ]; This concludes our discussion of how suite operations are transformed into operations on suite representatives. We now turn our attention to how a suite is initialized, and how quorums are gathered. When a suite is opened queries are sent out to determine the version numbers of the suite's representatives. If a representative is available. it responds with its version number. and it can be considered for inclusion in a quorum. As each representative reports its version number the highest version number seen is updated. After a read quorum of representatives have reported their version numbers the highest version number that has been seen is the highest that exists. This in essence is the version number of the suite, and any current representative will have this version number. If a read quorum of a suite's representatives do not exis~ provisions are made so clients are told that the suite does not exist It may be that a version number inquiry will never return because its corresponding representative is unavailable. As stated in Section 3.1.2, outstanding uncompleted version number reads do not affect the ability of a transaction to commit Initiate-Inquiries[] ... Prog [ [i: Integer]; -- initialize suite suite-ref ... Expand-Suite[input-ret]; r ... suite-ref.r; w ... suite-ref.w; 57 CHAPTER 5: REPLICATION votes +- suite-ref. votes; number-of-representatives +- Length[suite-ref.rep]; write-lock +- Create-LockO; write-quorum +- NIL; suite-lock +- Create-LockO; crowd-larger +- Create-Condition VariableO; suite +- NIL; curren t+-O; not-found-votes +- 0; -- send out version number inqmnes FOR i FROM 1 TO number-of-representatives DO Fork[,Open-Representative[i]]; ]; Open-Representative[i: Integer] +- Prog[ [rep: Representative]; rep +- Create[Representative); rep.ref +- Nth[suite-ref.rep, i]; rep. votes +- Nth[suite-ref.votes, i]; rep.class +- Open[rep.ref, te, ring. guards]; IF rep. class = Error['NotFound] THEN Not-Found[rep.votes]; IF Is[rep.class, Error-Type] THEN ReturnO; rep. version +- IF Is[rep.ref, Volume] THEN 0 ELSE class Note-Representative[rep]; ]; Note-Representative[rep: Representative] +- Critical[suite-Iock, 'Prog[ -- a representative has responded to our inquiry IF rep. version > current THEN current +- rep.version; suite +- Append[suite, rep]; Broadcast[crowd-Iarger]; ]]; I GetVersionO; 0; Not-Found[votes: Integer] +- Critical[suite-Iock, 'Prog[ 0; -- a representative was not found not-found-votes +- not-found-votes + votes; IF not-found-votes=w]]; The reference produced by Suite-Create is a file reference that contains a volume suite. Expand-Suite is used by Initiate-Inquiries to convert such references into normal form. 58 CHAFfER 5: REPLICATION Expand-Suite[ref: Suite / expanded: Normal-Suite] +- Prog[ [f: File; v: Volume; nv: Volume; reps: List[Reference]]; IF Is[ref, Normal-Suite] THEN Return[ref]; -- ref is a result of Suite-Create nv +- Normalize[ref.volume, te]; reps +- NIL; FOR v IN nv.reps DO [ fjd +- ref.id; f.volume +- v; f +- Create[File]; -- if volume rep is a volume suite, make file rep a file suite IF Is[v, Suite] THEN Add-Type[f, Suite]; reps +- Append[reps, f]; ]; Return[Create-Suite[refjd, nv.r, nv.w, reps, nv.votes]]; ]; A Close request is applied to every open representative. Suite-Close[] +- Critical[suite-Iock, 'Prog[ []; Return[Apply-Quorum[suite]]; ]]; Collect-Quorum normally gathers a quorum with a specified number of votes and returns immediately to its caller. All of the representatives in the quorum are not guaranteed to be current If a quorum of representatives have not reported their version numbers all that Collect-Quorum can do is wait for a representative to respond. Quorum sizes are the minimum number of votes that must be collected to guarantee correct operation. However, quorums can always be expanded by adding additional representatives. In the model implementation all eligible representatives are included in every quorum. Collect-Quorum[threshold: Integer/ quorum: List[Representative]] +- Prog [ []; oo[ -- if the suite does not exist, return an error IF Suite-Not-Found[] THEN Return[Error['NotFound]]; quorum +- Collect[threshold]; IF Not[Null[quorum]] THEN Return[quorum]; -- if we can't get a quorum, just wait Wait[crowd-Iarger]; ]; ]; Collect[threshold: Integer / quorum: List[Representative]] +- Critical[suite-Iock, 'Prog [ [x: Representative; votes: Integer]; -- returns a quorum or NIL votes +- 0; FOR x IN suite 00 votes +- votes + x.votes; IF votesw THEN Return[quorum]; -- could not get a quorum Return[NIL]; ]]; An obsolete representative must be updated with a consistent version of the suite. The model implementation achieves this by not allowing the suite to change while an obsolete representative is being updated. Alternatively, an obsolete representative could be updated in a separate transaction. Update-Obsolete-RepresentativeO +-, Prog [ [rep: Representative]; rep +- Select-Obsolete[]; IF Not{Null[rep)) THEN [ rep.c1ass I Copy[suite-ref]; U pdate-Done[rep]; ]; ]; Select-Obsolete[1 rep: Representative] +- Critical[suite-Iock, 'Prog [ FOR rep IN suite DO IF rep.version#current THEN -- We found an obsolete representative. 60 0; CHAPTER 5: REPLICATION Retum[rep]; Return[NIL]; ]]; Update-Done[rep: Representative] +- Critical[suite-lock, 'Prog[ []; -- mark representative as being current rep. version +- current; ]]; The above functions comprise the model implementation of the suite algorithm. There are many refinements that can be made to the suite we have described, and some of them are given at the end of this chapter. For example, when a suite is closed inquiry processes can optionally be stopped. The details of how this can be accomplished were not shown. 5.3 Correctness Arguments We argue that if a representative has the highest version number it has received all of the updates to the suite. The set of representatives that we update to make version v+ 1 all have version v. The desired result follows by simple induction. We argue that the replication algorithm preserves the properties of transactions. Representatives are stored in transactional storage, and we base our arguments on the properties transactional storage is guaranteed to exhibit (totality) Suite updates are transformed to representative updates. Representative updates are guaranteed totality. Thus, suite updates are guaranteed totality. (consistency) We show that the suite algorithm preserves serial and external consistency (Section 3.1.2) by showing that it produces schedules that are equivalent to the unreplicated case's schedules. Let S be a schedule, and S' be a schedule that results when we change object 0 to be a suite. We wish to show that DEP(S)= DEP(S'). Reconfigu ra ble Ref fo r I Reconfigurable Ref for(FI} I Indirect Reffor(VX} ""r Ref for (FI) '" Ref for (VX) I I I I r Reconfigurable Ref for I Indirect Ref for (F) I I ! ",. V ~ J """" ,'" \ ,'" Ref for (F) File Index Volume Structure of a Reconfigurable Volume Figure 6.2 68 CHAPTER 6: RECONFIGURATION contents of the object is copied by its new implementor. Second, the indirect reference is changed to point at the new implementor. Reconfigure[new-ref: Reference] ... Prog[ []; for files and indexes only ic ... Open[new-ref, tc, ring]; ic I Copy[ref]; ic I Close[]; superclass I ChangeReference[new-ret]; ]; Whenever a file is created on a reconfigurable volume, it is made reconfigurable. The volume's ,file index is used to implement the necessary indirection. Reconfigure-Create-File[] ... Prog[ [file-ref: File-Reference]; file-ref ... Apply[superclass, request]; IF Is[file-ref, Error-Type] THEN Return[file-ret]; Return[Create-Reconfigurable[file-ref, ref.file-index, tcU; ]; As shown in Figures 6.4 through 6.6, reconfiguring a volume occurs in three steps: 1. All of the files on the volume are moved to their new home. This is accomplished by moving the files that are listed in the file index one by one. 2. The file index is moved to the new volume. 3. The indirect pointer to the current implementor of the reconfigurable volume is switched to point at the new volume. Contention for a reconfigurable 'volume may cause a transaction that is performing a reconfiguration to abort Step 1 can be split up into many smaller transactions, each of which would reconfigure a single file. 69 Reconfigu ra ble Refe rence I Indirect Ref I , lo. I ~ Reference Old Implementor Index V' New Implementor Step 1 : Copy State Reconfigurable Reference I Indirect Ref , 1 lo. I Old Implementor Reference Index lo. -, New Implementor Step 2: Change Pointer File and Index Reconfigu ration Algorithm Figure 6.3 70 Old Volume File Reference Reference Reference File Reference File Index Index Reconfigurable Volume Indirect Volume Indirect Index Volume Ref Index Ref New Volume L----I~ File File Step 1 of Volume Reconfiguration: Move Files Figure 6.4 71 Index Reconfigurable Volume Indirect Volume Volume Ref Indirect Index Index Ref Old Volume File Reference Reference File Reference File Index File '----I~ File New Volume Step 2 of Volume Reconfiguration: Move File Index Figure 6.5 72 Index Reconfigurable Volume Indirect Volume Indirect Index Volume Ref Index Ref Old Volume File Reference Reference File Reference File Index File L.-----1.... File New Volume Step 3 of Volume Reconfigu ration: Change Volume Pointer Figure 6.6 73 CHAPTER 6: RECONFlGURA TION Reconfigure-Volume[new-volume: Volume] +- Prog[ [old-home: Reference; new-home: Reference; entry: Entry; vc: Volume-Class; ic: Index-Class]; -- reconfigure all of the files on the volume ic +- Open[ref.file-index, tc, ring]; vc +- Open[new-volume, tc, ring]; entry +- ic I Enumerate[NIL]; -- move all of the files WHILE Not[Null[entry]] DO [ -- move one file: first construct a reference rfr +- Create[Reconfigurable]; rfr.ref +- Create[Indirect]; rfr.ref.index +- ref.file-index; rfr.ref.indirect-id +- entry.entry-name; open current implementor of file fc +- Open[rfr, tc, ring]; -- create new implementor of file new-home +- vc I Create-File[fc I GetID[], T]; move file fc I Reconfigure[new-home]; fc I Close£]; entry +- ic I Enumerate[entry]; ]; now move the file index new-home +- vc I Create-File[ic I GetID[], T]; new-home +- Create-Index[new-home]; ic I Reconfigure[new-home]; -- change the volume's configuration superc1ass I ChangeReference[new-volume]; -- all done. ic I Close£]; vc I Close[]; ]; 6.3 Refinements 6.3.1 Reducing Data Movement Copy could be modified to return if its source and destination were already identical. Copy could accomplish this by checking their unique identifiers and version numbers. This could significantly reduce unnecessary data movement. Consider the case of a file suite that is reconfigured to change its vote assignments. It may be that no representatives need to be updated. In this case, with the proposed improvement to Copy, the Reconfigure operation would not cause any data movement. 6.3.2 Eliminating the File Index The file index for a reconfigurable volume could be eliminated by storing a file's index entry in its first few pages. When such a file was opened, it would either contain a pointer to itself or to its new location. The reason that we did not use this scheme is that if a file that is part of such an 74 CHAPTER 6: RECONFIGURATION indirect chain was unavailable, it would be impossible to find the new home of a file. By using an index we avoided this problem at some expense. 6.4 Summary Reconfiguration was accomplished with indirection and transfer of state at the time an object is reconfigured. All of the information containing objects we have defined - volumes. files, and indexes - can be made dynamically reconfigurable. Exercise 1. Discuss how a volume reconfiguration could use a transaction for each file or index that had to be moved. 75 Chapter 7: Protection Until this point we have assumed that users are perfectly trustworthy. This chapter relaxes this assumption by describing a protection mechanism that can be used to protect client information. The mechanism can be used directly for the protection of small objects such as data base entries, and it can be used to implement popular protection policies for larger objects. Protection can be considered to consist of four distinct components: secrecy (ensuring that information is only disclosed to authorized users), authentication (ensuring that information is not forged), integrity (ensuring that information is not destroyed), and availability (ensuring that access to information can not be maliciously interrupted). This chapter describes a new protection mechanism called cryptographic sealing that provides primitives for secrecy and authentication. The mechanism is enforced with a synthesis of conventional cryptography, public-key cryptography, and a threshold scheme. The new mechanism is based on the idea of sealing an object with a key. Sealed objects are self-authenticating, and in the absence of an appropriate set of keys, only provide information about the size of their contents. Thus, keys are the basic unit of secrecy and authentication in the mechanism. New keys can be freely created at any time, and keys can also be derived from existing keys with operators that include Key-Or and Key-And. These operators allow protection structures to be established that allow any member of a set of keys to unseal an object, that require every member of a set of keys to unseal an object, or any combination of these extremes. This flexibility allows cryptographic sealing to implement common protection mechansims such as capabilities, access control lists, and information flow control. Objects and sealed objects are simply arrays of bytes. In order to update the value of a sealed object it is necessary to unseal it, change its value, and reseal it. Clients must operate in a secure environment so they can safely manipulate unsealed objects. A simple way of providing such an environment would be for each client to execute on a separate physical processor. For the purposes of this discussion we will assume that every client executes on a secure processor that is protected from every other client (see Section 2.1.1). Our description of the protection mechanism and its applications is organized into six sections. The first section describes some preliminaries, including the general framework of the mechanism and the cryptographic methods that we use. The second section covers cryptographic sealing. The third section shows how cryptographic sealing can be used to implement capabilities, access control lists, information flow control, secure processors, and revocation. The fourth section examines some practical considerations. The fifth section is a comparative analysis of cryptographic sealing and traditional protection mechanisms. The chapter ends with a brief conclusion. 76 CHAPTER 7: PROTECTION 7.1 Preliminaries 7.1.1 Framework In order to compare the present work wIth previous protection systems [Saltzer and Schroeder 79] we need to provide an appropriate framework. We will call a protection mechanism active if it is placed between a client and protected information, as shown in Figure 7.1. An active protection mechanism serves to inhibit unauthorized client requests to storage. If a client could bypass an active protection mechanism, it· could gain unauthorized access to protected information. Thus, active protection mechanisms depend on a security envelope that a client can not penetrate. As shown in the figure, system administrators and operators are inside of this envelope, because they have direct physical access to the system. The protection mechanism this chapter introduces is the first example of a general purpose passive protection mechanism. It is passive because there are no restrictions placed on a client's access to storage. However, a client is only able to decipher information that it is authorized to see. As shown in Figure 7.2, a passive protection system operates as part of the client. A passive protection system can only address the secrecy and authentication aspects of computer security. Section 7.3 describes how an active protection mechanism can be used to supplement a passive system to provide integrity and availability. When passive and active systems are used together in this manner a failure of the active protection system only effects integrity and availability. Secrecy and authentication are still guaranteed by the passive protection system. In a similar way, if the ability to revoke privileges is desired, an active protection system would be used as described in Section 7.3.4. Some of the ideas in this chapter have appeared in other forms. Morris [Morris 73] discussed the concept of sealing objects. but he did not present a way to create general protection structures, and his mechanism was not enforced by cryptography. Chaum and Fabry [Chaum and Fabry 78]. Lindsay and Gligor [Lindsay and Gligor 78], and Needham [Needham 79] independently observed that cryptography can be used to authenticate objects such as capabilities. Gudes [Gudes 80] described a way to use cryptography to implement a form of access control lists without groups or indirect keys. 7.1.2 Environment We describe the building blocks of the protection mechanism in detail because with a thorough knowledge of these facilities the reader will find it easier to understand the sections that follow. The three facilities that the protection mechanism uses are cryptography, a threshold scheme, and checksums. It is difficult to motivate all of these facilities in the abstract, and thus we ask the reader to be patient until the following section where it will become clear why they were introduced. Cryptography Cryptography is used to encrypt information to be protected. also called cleartext. into ciphertext. Encryption is a transformation from the space of possible cleartexts to the space of possible ciphertexts. The transformation selected depends on the key supplied to the encryption function. To decrypt. or recover the cleartext of a ciphertext, requires a specific key so the 77 Client Protection Mechanism Security Envelope An Active Protection Mechanism Figure 7.1 78 Storage Storage Client Protection Mechanism A Passive Protection Mechanism Figure 7.2 79 CHAPTER 7: PROTECfION encryption transformation can be inverted. A cryptographic system is called perfectly secure if all that can be discerned from a ciphertext is that the ciphertext exists, and that the corresponding cleartext is not longer than the ciphertext. Perfect security can only be achieved when the entropy of a key is at least as great as the entropy of information encrypted [Shannon 49]. Thus, perfectly secure systems are usually not practical because a key must be at least as long as the cleartext it is used to encrypt. A cryptographic system is called computationally secure if even when enough infonnation is theoretically available to break the system, the amount of computation required to do so is unreasonable. Unfortunately, proving facts about the computational complexity of many interesting algorithms is beyond the reach of current theory. Thus, most practical cryptosystems can not be proven to be computationally secure. In practice, the best that can be done is to invest a substantial amount of effort in trying to think of a way to break a system; if the system survives such an attack, then it is considered to be secure for practical purposes. As unsatisfying as this may be, it is the current state of the art The goal of a cryptanalytic attack is to find the key that was used to encrypt a ciphertext. Attacks are classified by the information an intruder has to aid him. It is assumed that an intruder knows the workings of the cryptosystem he is attacking. As the name implies, in a ciphertext-only attack an intruder only has ciphertext. I n a known-cleartext attack an intruder has both the ciphertext and the cleartext that was used to generate it. In achosen-cleartext attack an intruder can choose text to be encrypted and observe the resulting ciphertext Experience has shown that a cryptosystem that is known to be vulnerable to a chosen-cleartext attack should not be used. Two generic types of cryptosystems are used to implement cryptographic sealing. As we introduce each cryptosystem we will provide enough background so the reader can understand its properties. For more infonnation about cryptography see Diffie and Hellman [Diffie and Hellman 79]. The framework of the protection system is very general, and it is designed to accommodate new cryptosystems as they are discovered. The interfaces to the cryptosystems are described in general terms, allowing stronger implementations of the systems to be adopted as they become available. Conventional Cryptography In a conventional cryptosystem a cleartext is encrypted with a key, and the same key is used to decrypt the resulting ciphertext The following functions implement conventional cryptography. Create-Conventional-Key[/key: Byte-Array] Create-Conventional-Key returns a random key that can be used in conjunction with the conventional encrypt and decrypt functions. All of our key creation functions are based on a true random bit generator. Conventional-Encrypt[clear: Byte-Array, key: Byte-Array I cipher: Byte-Array] Conventional-Encrypt encrypts clear with key, producing cipher. Conventional-Decrypt[cipher: Byte-Array, key: Byte-Array I clear: Byte-Array] Conventional-Decrypt decrypts cipher with key, producing clear. 80 If the key CHAPTER 7: PROTECfION specified was not used to encrypt cipher, the value of clear is undefined. Public-Key Cryptography In a public-key cryptosystem the key generation function returns a pair of keys. Call the keys of such a pair keya and keyb. Only keyb can decrypt ciphertext enciphered with keya, and only keya can decrypt ciphertext enciphered with keyb. This is in contrast to a conventional cryptosystem, where a single key is used for both encryption and decryption. Public-key cryptosystems were first suggested in the open literature by Diffie and Hellman [Diffie and Hellman 76]. The name public-key resulted from the observation that with such a system certain keys could be made public, solving in part the problem of key distribution. Publickey cryptosystems have also been called asymmetric cryptosystems. Rivest, Shamir, and Adleman [Rivest et al. 78] described the first practical implementation of a public-key system. The only cryptanalytic approaches currently known for breaking their scheme are at least as computationally complex as factoring extremely large numbers. The exact semantics we chose for our definition of public-key cryptography were motivated by the Rivest, Shamir, and Adleman proposal. Some public-key cryptosystems allow ciphertext enciphered with keya to be decrypted with keyb, but not the converse. In this case we will assume that two key pairs from such a system are used to implement one of our public-key pairs to provide the appropriate semantics. The following functions implement public-key cryptography. PK-Pair +- Record[keya: Byte-Array, keyb: Byte-Array]; Create-PK -Pair[/pair: PK -Pair] Create-PK-Pair computes a pair of public keys. The keys are random, in the sense that they are a function of a true random number. PK-Encrypt[clear: Byte-Array, key: Byte-Array / cipher: Byte-Array] PK-Encrypt encrypts clear with one of the members of a key pair, and it returns cipher. PK -Decrypt[cipher: Byte-Array, key: Byte-Array / clear: Byte-Array] PK-Decrypt decrypts cipher with key, producing clear. correct, the value of clear is undefined. If the key specified is not Threshold Scheme A threshold scheme allows a datum D to be divided into n pieces, such that any k pieces are sufficient to reconstruct D but complete knowledge of any k-1 pieces reveals no information about D [Blakley 79, Shamir 79]. A practical implementation of this system was first demonstrated by [Shamir 79]. The following functions implement a threshold scheme. Threshold-Pieces +- Record[pieces: List]; 81 CHAPTER 7: PROTECfION Threshold-Split[clear: Byte-Array, n: Integer, k: Integer / piece-list: Threshold-Pieces] Threshold-Split takes clear and logically splits it into n pieces, any k of which are sufficient to recover clear. These pieces are returned as an n element list Threshold-Recover(piece-list: Threshold-Pieces / clear: Byte-Array] Threshold-Recover takes a list of pieces, and if k pieces are available, it will return the original value of clear. Elements on the input list that are not threshold pieces are ignored. If k pieces are not available, Threshold-Recover returns Error['Fai/ed}. Checksums A checksum function maps arbitrary input values into a comparatively small set of output values, such that independent input values have a small probability of being mapped into the same output value. I f a c bit checksum is implemented by a cyclic code, it can be proved that the fraction of independent values that have the same checksum is T C [Peterson and Weldon 72, p. 229]. Conventional encryption can also be used to create checksums. One way of creating an encryption based checksum is to use a cipher block chain technique [NBS 80). Add-Checksum[x: Any / cx: Checksummed-Object] Add-Checksum returns a copy of x and x's checksum. Check-Checksum[cx: Checksummed-Object / x: Any] Check-Checksum checks the checksum of cx, and if the checksum is correct, it returns the value of x it holds. If the checksum of ex is incorrect, Check-Checksum returns Error['Fai/ed}. 7.2 Cryptographic Sealing Our basic protection mechanism, cryptographic sealing, is described in three stages. First, we provide a model of what the mechanism does. Second, we show how the mechanism works. Third, we discuss the extent to which the mechanism can be trusted. In the sections that follow, we extend the notion of keys. These extended keys are similar enough to cryptographic keys that we did not coin a new tenn, but the reader should be aware that they are not precisely the same. 7.2.1 Model As we have mentioned, keys are the basic unit of secrecy and authentication in the protection mechanism. There are two ways to generate a key. It is possible to generate a brand new random key that does not depend on any other keys; such a key is called a base key. It is also possible to generate a key that is a function of existing keys; such a key is called a derived key. Derived keys are used to implement protection structures that can not be realized with base keys alone. The keys required to unseal an object depend on the structure of the key used to seal the object For example, to unseal an object that was sealed with Key-Or[ka, Key-And[kb, kc}} one needs either ka or both kb and kc. To be as explicit as possible, we introduce the unseals relation between a set of keys and a key. As we shall see in a moment, if a set of keys S unseals key k, .' 82 CHAPTER 7: PROTECfION then S can be used to unseal any object that has been sealed with k. If key set S unseals k we will write S >> k. If S >> k we also say that k is unsealed by S. Seal and Unseal are the primitive functions of the protection mechanism. Seal seals an object with a key, and Unseal reverses the sealing process. The type Key is used below to denote either a base or a derived key. Seal[x: Any, k: Key I sx: Sealed-Object] Seal seals x with k, returning a sealed object modified. Seal[x, NIL} is x. Seal operates by value; x and k are not Unseal[sx: Sealed-Object, ks: List[Key], tc: TC I x: Any] Unseal unseals sx with the set of keys contained in the key set ks. If Unseal is successful, it returns x (the original input value to Seal). Otherwise Unseal returns Error['Fai/ed]. Unseal does not modify sx. The protection mechanism provides two properties. The secrecy property states that a sealed object is useless to someone who does not have a set of keys that unseals the key that was used to seal the object The authentication property states that Unseal will only return values that were in fact properly sealed. x can be recovered from Seal[x, k} with a set of keys S if and only if S Secrecy. » k. Authentication. If x was not sealed with a key that is unsealed by S then the result of Unseal[x, S} will be Error['Failed}. The following functions create keys, and completely define the unseals relation. Create-Base-Key[1 k: Key] Create-Base-Key creates a new regular base key. Regular base keys are the simplest kind of keys. To unseal Seal[x, k} requires k. In other words, a key set unseals k if and only if it contains k. (S Key-Pair +- » k) == (k E S) Record(keya: Key, keyb: Key); Create-Key-Pair[1 kp: Key-Pair] Create-Key-Pair creates a pair of base keys. These keys are related to each other in the following way. To unseal Seal[x, keyaJ requires keyb, and to unseal Seal[x, keyb} requires keya. In other words, a key set unseals keya if and only if it contains keyb, and a key set unseals keyb if and only if it contains keya. (S (S >> » keya) _ (keyb E S) keyb) _ (keya E S) Key-And[ka: Key, kb: Key I dk: Key] Key- And creates a derived key that is the logical and of ka and kb. To unseal Seal[x, dk} requires either dk or a set of keys that will unseal both ka and kb. That is, a key set 83 CHAPTER 7: PROTECTION unseals dk if and only if it unseals both ka and kb or it includes dk. (S » dk) _ «S » » ka) 1\ (S kb» V (dk E S) Key-Or(ka: Key. kb: Key / dk: Key] Key-Or creates a derived key that is the logical or of ka and kb. To unseal Seal[x, dkJ requires either dk or a set of keys that unseals ka or kb. That is. a key set unseals dk if and only if it unseals ka or kb or it includes dk. (S » dk) = (S » ka) V (S » kb) V (dk E S) Key-Quorum[q: Integer. key-list: List[Key] / dk: Key] Key-Quorum creates a derived key. key-list is a list of an arbitrary number of keys. To unseal Seal[x. dkJ requires either dk or a set of keys that unseals q distinct keys from keylist. In other words. a set of keys unseals dk if and only if it includes dk or if it unseals q distinct keys from key-list. Note that Key-Quorum can be used to implement Key- And and Key-Or (with q= 2 and q= 1 respectively). but as we shall see. the implementations of KeyAnd and Key-Or are more efficient than Key-Quorum. For some q combination k j , ... , kq drawn from key-list; (S >>dk) =(1\ 1 < i < q S >>ki ) V (dk E S) Submaster(k: Key / dk: Key] Submaster creates a derived key. To unseal Seal[x, dkJ requires either dk or a set of keys that unseals k. In other words, a key set unseals dk if it unseals k or if it includes dk. (S » dk) == (S » k) V (dk E S) Seal-Only[k: Key / dk: Key] Seal-Only creates a derived key. To unseal Seal[x, dkJ requires a set of keys that unseals k. In other words. a key set unseals dk if and only if it unseals k. A key set that includes dk is not sufficient to unseal dk. (S » dk) = (S » k) Create-Indirect-Key[k: Key, tc: TC / ik: Indirect-Key] Create-Indirect-Key creates an indirect key. To unseal Seal[x, ikJ either requires ik or a set of keys that unseals k. In other words, a key set will unseal ik if it contains ik or if it unseals k. Once an indirect key has been created, it can be changed with Change-IndirectKey. Because indirect keys can be altered, the following statement is an implication. (S » k) V (ik E S) -+ (S » ik) Change-Indirect-Key[ik: Indirect-Key, nk: Key. tc: TC] Change-Indirect-Key changes an indirect key. After Change-Indirect-Key has been performed, to unseal an object that has been sealed with ik either requires ik or a set of keys that unseals nk. In other words, ik is changed such that if a key set unseals nk it now 84 CHAPTER 7: PROTECfION also unseals ike 7.3.4. (S » Change-Indirect-Key does not provide revocation as discussed in Section nk) -+ (S » ik) Let us consider an example to make the idea of an indirect key clear. following occurs: Imagine that the kevin +- Create-Base-Key[]; ik +- Create-Indirect-Key[kevin]; sx +- Seal[x, ik]; At this point sx can be unsealed with either ik or kevin. The indirect key is then updated: harry +- Create-Base-Key[]; Change-Indirect-Key[ik, Key-Or[kevin, harry]]; After the Change-Indirect- Key occurs, ik, harry. or kevin are required to unseal sx. If S » kl and S » k2, then Unseal[Seal[Seal[x, kll. k2l. Sl will be Seal[x. kll. In order to recover an object that may have been sealed many times R Unseal can be used. RUnseal[sx: Sealed-Object, ks: List[Key], tc: TC / x: Any] RUnseal applies Unseal as many times as necessary until its result is not sealed. RUnseal always applies Unseal at least once. RUnseal-List[sl: List[Sealed-Object], ks: List[Key], tc: TC / Ix: List[Any]] RUnseal-List is a function that when passed two lists [al ... an] [kl ... kn] returns [RUnsea4.al. kl. tc] ... RUnsea4.an. kn. tc]] Seal-List[lista: List[Any], listb: List[Key] / slist: List[Sealed]] Seal-List is a function that when passed two lists [al ... an] [kl ... kn] returns [Sea4.al.kl] ... Sea4an. kn]]. NA-Unseal[sx: Sealed[Any], kl: List[Key], tc: TC / x: Any] NA-Unseal unseals sx with RUnseal if it is sealed, otherwise it returns sx. Unseal does not provide authentication. Thus, NA- NA-Unseal-List[sx: Sealed[Any], kl: List[Key], tc: TC / x: Any] The same as RUnseal-List, except that is uses NA-Unseal. 7.2.2 Basic Algorithm We present the basic algorithm by first outlining the principles of the mechanism and then presenting detailed descriptions of the major functions. The protection mechanism operates by arranging that precisely enough information is included in a sealed object to allow an appropriate set of keys to unseal it. We will first consider base keys and then show how this is arranged for derived keys. Base keys are directly implemented by cryptography. If an object is sealed with a base key, 85 CHAPTER 7: PROTECTION then Seal encrypts the object with the key. To provide the authentication property, a checksum is added to the object before it is encrypted. Unseal uses the checksum to tell if it has a legitimate sealed object For example, assume that kb is a base key. Seal[x. kbJ is transformed to: Encrypt[Add-Checksum[x], kb]. Base keys are assigned unique identifiers which are included in the result of an Encrypt operation. Thus, it is easy for Unseal to determine which key can be used to recover the contents of a sealed object A derived key consists of two fields: key and opener. When an object is sealed with a derived key dk, it is sealed with dk.key, and dk.opener is carried along in the sealed object Seal[x. dkJ is transfonned to: [Seal[x, dk.key]. dk.opener]. The central idea is that the opener of a derived key is carefully constructed so that if S >> dk, then it is possible to compute a key that unseals dk.key from dk.opener and S. Figure 7.3 gives several examples of sealed objects. The first example shows what happens when an object is sealed with a base key. The object is encrypted (as suggested by the heavy box), and marked with the key that can be used to decrypt the object. Although it is not shown in the figure, the object in the box includes its checksum for authentication. The next three examples show how derived keys work. It should be clear from the illustration how the opener expresses the protection structure that corresponds to its Seal statement. The following functions implement the protection mechanism. Figure 7.4 is a diagram of the dependency relationships between the functions, which the reader may find helpful. Simple base keys are implemented by conventional cryptography. A simple base key consists of a unique identifier and an encryption key. Simple-Key +- Record[id: UniqueID, key: Byte-Array]; Create-Base-Key[/k: Key] +- Prog[ []; k +- Create[Simple-Key]; k.key +- Create-Conventional-Key[]; k.id +- GetUniqueID[]; Retum[k]; ]; Key pairs are implemented by public-key cryptography. Like a conventional key, a Key-PairHalfhas fields for an identifier and akey. A Key-Pair-Halfalso includes the identifier of the key that will decrypt it Key-Pair-Half +- Record(id: UniqueID, decryptedBy: UniqueID, key: Byte-Array]; Create-Key-Pair[1 kp: Key-Pair] +- Prog[ (pkp: PK-Pair]; kp +- Create[Key-Pair]; kp.keya +- Create[Key-Pair-Half]; kp.keyb +- Create[Key-Pair-Half]; pkp +- Create-PK -Pair[]; -- set up keya kp.keyajd +- GetUniqueID[]; kp.keya.decryptedBy +- GetUniqueID[]; 86 Seal[Object, key] II Object -i I ....- - - k e y : : : . J Seal[Object, Key·And[keya, keyb]] opener II Object 1 I IL.::C ikey ....._ _ _ ikey:J II keya - _..... - - - - - keyb Seal[Object, Key·Or[keya, keyb]] opener II Obje~:ey nII ikey keya UI.-.I_ik_eykeyb U Seal[Object, Key·And[keya, Key·Or[keyb, keyc]]] opener opener II Object -I I IL.::C ikey2 :.J .....- - - ikey2 II keya _ _...... II ikeY10 II ikeyl ..._ _ _ keYb:J -----ikeyl Example Sealed Objects Figure 7.3 87 jl -.- - - keyc : J Seal-Only Change-I ndirect Create-Indirect Seal Unseal / Create-Key I I Encrypt Decrypt X Public-Key Cryptography Key-Quorum " Classical Cryptography Protection System Internal Structure Figure 7.4 88 Threshold Scheme CHAPTER 7: PROTECfION kp.keyakey +- pkp.keya; -- set up keyb kp.keyb.id +- kp.keya.decryptedBy; kp.keyb.decryptedBy +- kp.keyajd; kp.keyb.key +- pkp.keyb; Retum[kp]; ]; Before we discuss Seal and Unseal, let us introduce two cryptographic functions that work with base keys. The Encrypt function encrypts an object with a base key, and includes in its result the unique identifier of the key that will decrypt the resulting ciphertext The Decrypt function takes an object that has been encrypted and attempts to decrypt it with the aid of a set of keys. Ciphertext +- Record[decryptedBy: UniqueID, text: Byte-Array]; Encrypt[clear: Any, k: Key / cipher: Ciphertext] +- Prog[ D; cipher +- Create[Ciphertext]; -- encrypt clear with k, producing cipher IF Is[k, Key-Pair-Half] THEN [ cipher.decryptedBy +- k.decryptedBy; cipher.text +- PK-Encrypt[Encode[clear], k.key]; ] ELSE [ cipher.decryptedBy +- k.id; cipher.text +- Conventional-Encrypt[Encode[clear], k.key]; ]; Retum[cipher] ; ]; Decrypt[cipher: Ciphertext, ks: List[Key] / clear: Any] +- Prog[ [k: Key, rk: Key]; -- decrypt cipher with one of the keys in ks clear +- Error['Failed]; FOR k IN ksOO [ -- If a derived key, get the base key hidden inside. rk +- IF Is[k, Derived-Key] THEN k.key ELSE k; -- see if we have a match IF Is[rk, Simple-Key] THEN [ IF cipher.decryptedBy =rk.id THEN clear +- Decode[Conventional-Decrypt[cipher.text, rk.key)); ]; IF Is[rk, Key-Pair-Half] THEN [ IF cipher.decryptedBy = rk.id THEN clear +- Decode[pK-Decrypt[cipher.text, rk.key)); ]; ]; Retum[clear]; ]; Seal works in the following way. It first tests the type of key that it has been passed. If it is a base key then Encrypt is used to encrypt the object and a checksum of the object The checksum is 89 CHAPTER 7: PROTECTION used by Unseal to authenticate the sealed object If Seal is passed a derived key it uses the base key contained in the derived key to seal the object, and it includes the opener of the derived key in its result Derived-Key +- Record[key: Key. opener: List[Sealed-Object]]; Sealed-Object +- Record[cipher: Any. opener: Any]; Seal[x: Any. k: Key / sealed-x: Sealed-Object] +- Prog[ []; IF NUll[k] THEN Return[x]; sealed-x +- Create[Sealed-Object]; IF Is[k. Derived-Key] THEN [ -- Include the derived key's opener in the sealed object sealed-x.cipher +- Seal[x, k.key]; sealed-x.opener +- k.opener; Return[sealed-x]; ]; -- Add a checksum for authentication. sealed-x.cipher +- Encrypt[Add-Checksum[x], k]; sealed-x.opener t- NIL; Return[sealed-x]; ]; Unseal reverses the sealing process. Unseal first checks to see what type of key was used to seal an object If the object was sealed with a base key, Unseal uses Decrypt to recover the object's contents. The result of Decrypt is authenticated by ensuring that the object's checksum is correct If the object was sealed with a derived key, the opener is unsealed, and the resulting key set is used to unseal the object Unseal[sealed-x: Sealed-Object, ks: List[Key], tc: TC / x: Any] +- Prog[ [opener: List[Key]; tp: Threshold-Pieces]; -- Ensure that the object was in fact sealed. IF Not[Is[sealed-x, Sealed-Object]] THEN Return[Error[,Failed]]; IF Null[sealed-x.opener] THEN [ -- object was sealed with a base key Return[Check-Checksum[Decrypt[sealed-x.cipher, ks]]]; ]; The object was sealed with a derived key. opener +- Normalize[sealed-x.opener, tc]; IF Is[opener, Threshold-Pieces] THEN [ -- The opener is a list of sealed key pieces. tp +- Create[Threshold-Pieces]; tp.pieces +- RUnseal-List[opener.pieces. ks, tc]; opener +- List[Decode[Threshold-Recover[tp]]]; ] ELSE The opener is a list of sealed keys. opener +- RUnseal-List[opener. ks, tc]; Return[Unseal[sealed-x.cipher, Append[opener. ks], tc]]; ]; 90 CHAPTER 7: PROTECfION The derived key dk that Key- And creates is simple in concept The opener of Key- And can be unsealed with a key set S only if S >> ka and S >> kb. ka and kb must not be a key pair. This is because with the implementation of public-key cryptography proposed by [Rivest et at. 78] encrypting with one and then the other is equivalent to encrypting and then decrypting. This restriction could be eliminated at the cost of a small increase in the size of the sealed object by changing the opener to be List[Sea/[Sea/[dk.key, ka], Submasterlkb}}}. Key-And[ka: Key, kb: Key / dk: Key] +- Prog[ []; dk +- Create[Derived-Key]; dk.key +- Create-Base-Key[]; dk.opener +- List[Seal[Seal[dk.key, ka], kb]]]; Retum[dk]; ]; Key-Or is implemented in the same manner as Key- And. One of the elements of the opener of Key-Or can be unsealed with a key set S only if S >> ka or S >> kb. Key-Or[ka: Key, kb: Key / dk: Key] +- Prog[ []; dk +- Create[Derived-Key]; dk.key +- Create-Base-Key[]; dk.opener +- List[Seal[dk.key, ka], Seal[dk.key, kb]]; Retum[dk]; ]; Key-Quorum creates a new base key and then splits the key into n pieces, where n is the length of the input key list Things are arranged so that any combination of q of these pieces can be used to reconstruct the new base key. The n parts of the base key are then sealed with the n input keys, creating an opener that will yield the base key when a key set is available that unseals q or more distinct input keys. Key-Quorum[q: Integer, kl: List[Key] / dk: Key] +- Prog[ []; dk +- Create[Derived-Key]; dk.key +- Create- Base- Key[]; -- Create the proper number of key pieces. dk.opener +- Threshold-Split[Encode[dk.key], Length[kl], q]; -- Seal the pieces with the elements of kl. dk.opener.pieces +- Seal-List[dk.opener.pieces, kl]; Retum[dk]; ]; Submaster is implemented by creating a new base key, and including the key in an opener sealed with k. Submaster[k: Key / dk: Key] +- Prog[ []; dk +- Create[Derived- Key]; dk.key +- Create-Base-Key[]; dk.opener +- List[Seal[dk.key, k]]; Retum[dk]; ]; Seal-Only ensures that only sets that unseal k will unseal dk. 91 This derived key is set up to CHAPTER 7: PROTECTION encrypt objects with one half of a key pair, and to hide the other half in an opener. The opener can only be unsealed with a key set S if S >> k. Seal-Only[k: Key / dk: Key] +- Prog[ [kp: Key-Pair]; kp +- Create-Key-Pair[]; dk +- Create[Derived-Key]; dk.key +- kp.keya; dk.opener +- List[Seal[kp.keyb, k]]; Retum[dk]; ]; Create-Indirect-Key creates a key that can be changed. This is accomplished by creating a derived key with an indirect opener. By changing the indirect opener, one can change the keys that unseal the indirect key. Create-Indirect-Key[k: Key, tc: TC / dk: Key] +- Prog[ 0; k +- Submaster[k]; -- create indirect key dk +- Create[Derived-Key]; dk.key +- k.key; dk.opener +- Create-Indirect[k.opener, tc]; Retum[dk]; ]; Change-Indirect-Key[dk: Key, nk: Key, tc: TC] +- Prog[ 0; -- if nk is NIL, delete key IF Null[nk] THEN Change-Indirect[dk.opener, NIL, tc]; ELSE -- replace opener with new one Change-Indirect[dk.opener, List[Seal[dk.key, nk]], tc]; RetumD; ]; RUnseal and NA-Unseal are implemented as the following functions. RUnseal[sx: Any, ring: List[Key], tc: TC / x: Any] +- Prog[ -- recursive unseal x +- Unseal[sx, ring, tc]; IF Is[x, Sealed] THEN Return[RUnseal[x, ring, tcU; Return[x]; ]; 0; NA-Unseal[sx: Any, ring: List[Key], tc: TC / x: Any] +- Prog[ 0; unauthenticated unseal -- sx does not have to be sealed x +- IF Is[sx, Sealed] THEN RUnseal[sx, ring, tc] ELSE sx; Return[x]; ]; 92 CHAPTER 7: PROTECfION 7.2.3 Strength We reason about the strength of the protection mechanism in two parts. First, assuming that the cryptographic systems we use are perfect, we demonstrate the security and authentication properties of the mechanism. Second, we examine the assumption that cryptographic systems are perfect, and suggest usage conventions that would make the protection mechanism less susceptible to cryptanalysis. Correctness Argument It is difficult to prove facts about systems that are based on cryptography because, as we discussed, it is difficult to show that cryptosystems have certain properties. Thus, what we will do is to make strong assumptions about the behavior of the cryptographic systems that we use, and show that the secrecy and authentication properties of the protection mechanism follow from these assumptions. First, we demonstrate the secrecy property of the protection mechanism, which is: Secrecy. x can be recovered from Seal[x, k] with a set of keys S if and only if S » k. Our first assumption has to do with the security of cryptography: PC. (Perfect Cryptography) Encrypt[x, k] reveals no information about k. If k is one half of a public-key pair, then x can only be recovered with the other half of the public-key pair, and if k is a classical key, then x can only be recovered with k. We interpret "can only be recovered" to mean a total lack of information in the information theoretic sense. PC is close enough to what is expected of a practical cryptosystem to make it a reasonable assumption. However, we know of one exception to PC. To recover x from Encrypt[Encrypt[x, kI], k2] should always require a key set S, such that S » kl and S » k2. In the public-key system proposed by [Rivest et at. 78] if kl and k2 are a public-key pair, then Encrypt[Encrypt[x, kI], k2] is x. In Key-And, where we will need to reason about an object that has been sealed twice, we will assume that kl and k2 are not a public-key pair. Our demonstration that the secrecy property follows from PC proceeds by induction on the structure of the key k. Our basis is the case where k is a base key. Basis. Assume that k is the result of Create-Base-Key or Create-Key-Pair. Seal[x, k] is transformed to Encrypt[Add-Checksum[x], k]. The secrecy property follows from PC. We now assume the secrecy property as our induction hypothesis, and consider each way that derived keys can be created as our induction step. Key-And. Assume dk is the result of Key-And[ka, kb]. dk is [k, [Seal[Seal[k, ka], kb]l1 where k is the result of a Create-Base-Key operation. Seal[x, dk] is [Seal[x, k], [Seal[Seal[k, ka], kbl1]. By the induction hypothesis we need k to recover x, and to recover k we need a set of keys that is admitted to ka and to kb. Thus, a set of keys S can recover x if and only if 93 CHAPTER 7: PROTECTION «S » kb) A (S » ka» V (dk E S). This is precisely the unseals relation for Key-And. Assume dk is the result of Key-Or(ka. kb). Key-Or. dk is [k. [Seal[k. ka). Seal[k. kb]]] where k is the result of a Create-Base-Key operation. Seal[x. dk) is [Seal[x. k). [Seal[k. ka). Seal[k. kb]]]. By the induction hypothesis we need k to recover x. and to recover k we need a set of keys that is admitted to ka or to kb. Thus. a set of keys S can recover x if and only if (S » ka) V (S » kb) V (dk E S). This is precisely the unseals relation for Key-Or. The arguments for Submaster and Create-Indirect are very similar to Key-And and Key-Or. and thus we will omit them. Assume dk is the result of Seal-Only[k). Seal-Only. dk is [ka. [Seal[kb. k]]] where [ka. kb) is the result of Create-Key-Pair. Seal[x. dk) is [Sea1[x. ka). [Seal[kb. k]]]. By the induction hypothesis we need kb to recover x. and to recover kb we need a set of keys that is admitted to k. kb only appears in openers. Thus. a set of keys S can recover x if and only if (S » k). This is precisely the unseals relation for Seal-Only. Key-Quorum. Assume dk is the result of Key-Quorum[q. List[kl .. kn)). dk is [k. [Seal(Pl' k 1) ... Seal(Pn' k n))) where k is the result of a Create-Base-Key operation. and Pl'" Pn are the result of Threshold-Split[k. n. q). Seal[x. k) is [Seal[x. k). [Seal(Pl' k 1) ... Sea1(Pn' k n]]]. By the induction hypothesis we need k to recover x. To recover k we need q distinct values of Pl'" Pn· By the induction hypothesis we can only recover q distinct values of Pl'" Pn with a set of keys that is admitted to q distinct keys in k 1 ... k n . Thus. a set of keys Scan recover x if and only if for some set of q distinct keys ka ... kq drawn from k 1 ... k n : «S »ka) A .. , A (S» kq )} V (dk E S). This is precisely the unseals relation for Key-Quorum. D. 94 CHAPTER 7: PROTECTION We will now demonstrate the authentication property of the protection mechanism, which is: Authentication. If x was not sealed with a key that is unsealed by S then the result of Unseal[x, S] will be Error['Failed]. To demonstrate the authentication property we need to make an assumption concerning the behavior of checksums: CHK. Assuming x and Encrypt[x, k] are known, it is intractable to compute Encrypt[AddChecksum[x], k] unless k is known. The argument we use is once again based on induction on the structure of a key that was used to seal an object As our basis we will assume that Unseal decides that an object has been sealed with a base key. Basis. Assume that Unseal decides that x has been sealed with base key k. By CHK, unless the client that sealed x had k it could not generate the correct encrypted checksum. If the checksum of Decrypt[x. kl] is not valid Unseal returns Error['Failedj. Now we assume the authentication property as the induction hypothesis, and show that if an object was sealed with a derived key the authentication property is also guaranteed. Induction Step. Assume that Unseal decides that x has been sealed with a derived key. It unseals what it thinks is an opener with kl. If the opener was sealed a key that is unsealed by kl, then this operation will return a new key list kl'. By the induction hypothesis this operation would return Error['FailedJ if kl' had not been sealed by a key that is unsealed by kl. Unseal then uses kl and kl'to unseal x. By the induction hypothesis if x was sealed by a key that is not unsealed by the union of kl and kl'then Unseal will return Error['Failedj. A key that is unsealed by kl was required to create the opener that contained kl'. Thus, if x was not sealed with a key that is unsealed by kl then the result of U nseal[x, kl] will be Error['Fai/edJ. D. Susceptibility to Cryptanalysis In the previous section we assumed that cryptography is perfect Of course it is not Often, breaking a practical cryptographic system is a matter of economics [Diffie and Hellman 77]. However, if some guidelines are followed, the susceptibility of our protection system to cryptanalysis can be reduced. The cryptosystems that are used must be secure against a known-cleartext attack. The checksums that are included in sealed objects and the implementation of Key-And increase the probability that an intruder will be able to use a known-c1eartext attack. It is wise to keep the amount of information protected with a single key to a minimum. This makes it more difficult for an intruder to perform a known-cleartext attack, and it reduces the vulnerability of the system to a single cryptanalytic success. Because keys are inexpensive, a client should use as many keys as are natural for its application. The strength of the cryptographic system used to protect information should correspond to the length of time the information needs to be kept secret For example, if certain information is going 95 CHAPTER 7: PROTECfION to be sensitive for twenty years, it would not be wise to protect it with a cryptosystem that can be broken in one year. Another metric is that the value of information protected with a cryptographic system should be considerably less than the cost of an attack on the system. Cryptographic sealing can accommodate the coexistence of a number of cryptosystems that have different key sizes and strengths. Thus, the strength of a cryptosystem can be matched to the sensitivity of the information it protects. Finally, there is no need to divulge information that might lead to a successful cryptanalytic attack to clients that do not need to know the information. For example, public keys can be protected from clients that do not need them. 96 7.3 Applications We now turn our attention to applications of cryptographic sealing. The first two sections show how cryptographic sealing and a simple active protection mechanism can implement a variety of popular protection mechanisms, including capabilities, access control lists, and information flow control. The third section demonstrates secure processors. The last section shows how secure processors can be used to implement revocation. 7.3.1 Privilege Establishment Key Rings Keys are used to represent privileges, and thus a list of keys defines a set of privileges. Each user has a personal list of keys, or key ring, that defines his privileges. When a key ring is stored, it is sealed with a key that only its owner knows. A user authenticates himself to the system by providing his key ring key. The key ring key is used to unseal the user's key ring in his processor, resulting in his list of privileges. A key ring key could be stored on a magnetic card, or perhaps transformed into an easily remembered sentence, such as "Ralph and George ran to the store on a rainy cold day with their Aunt Essie's dog Fred". Such a transformation could be accomplished with a context free grammar. A user's unsealed key ring is the third argument to Open, ring. Open is very careful with a user's key ring, and will not transmit it to any other processor. Encrypted Objects Cryptography is used to control access to information stored in files and indexes. An encrypted object E is an object that is encrypted with a certain conventional key K. Thus, possession of K gives a client the ability to access the information in E. Objects may be encrypted many times. Create-Encrypted[ref: Reference, k: Key / eref: Encrypted] Create-Encrypted creates an encrypted object. If the object is a file or an index, then to access the object a client must have a key that is admitted to k. Create-Encrypted assumes that the referent of ref will be totally overwritten and currently contains no information. Encrypted objects are implemented by creating a special reference that contains their key and a reference to their encrypted form. Encrypted ~ Record[ref: Reference, key: Key]; Create-Encrypted[ref: Reference, key: Key / eref: Encrypted] eref ~ Create[Encrypted]; eref.ref ~ ref; eref.key ~ key; Return [Add-Type[eref, Major-Type[ref]ll; ]; ~ Prog[ []; When an encrypted object is opened, a class structure is set up that will encrypt and decrypt the data portions of read and write requests. 97 CHAPTER 7: PROTECfION Open-Encrypted[ref: Reference, tc: TC, ring: List[Key], guards: List[Key] / c: Class] +- Prog[ [k: Key; ew: Function]; try to unseal key k +- NA-Unseal[ref.key, ring, tc]; -- failed, client does not have access IF Is[k, Error-Type] THEN Retum[k]; ew +- IF Is[ref, Index] THEN 'Encrypted-Index-Write ELSE 'Encrypted-File-Write; c +- Open[ref.ref, tc, ring, guards]; IF Is[c, Error-Type] THEN Retum[c]; c +- Create-Class[List[ 'Copy Reference, 'Default-Copy, 'Read, 'Encrypted-Read, 'Write, ew, 'Enumerate, 'Encrypted-Enumerate], c]; -- Instance variables: k, ref Retum[c]; ]; Encrypted-Read[1 value: Byte-Array] +- Prog[ []; value +- Apply[superclass, request]; Retum[Conventional-Decrypt[value, k.key]]; ]; Encrypted-Index-Write[entry-name: Byte-Array, value: Byte-Array] +- Prog[ []; IF Null[value] THEN Retum[superclass I Write[key, NIL]] ELSE Retum[superclass I Write[entry-name, Conventional-Encrypt[value, k.key]]]; ]; Encrypted-File-Write[startpage: Integer, pages: Integer, value: Byte-Array] +- Prog[ []; Retum[superclass I Write[startpage, pages, Conventional-Encrypt[value, k.key]]]; ]; Encrypted-Enumerate[last: Entry I next: Entry] +- Prog[ []; IF Not[Null[last]] THEN last value +- Conventional-Encrypt[lastvalue, k.key]; next +- superclass I Enumerate[last]; next value +- Conventional-Decrypt[next value, k]; Retum[next]; ]; Guarded Objects To provide integrity and availability we introduce a simple active protection mechanism. Imagine that each object is assigned a unique set of passwords, one for each of its independent privileges. We will call these passwords guards. Because each object has a unique set of guards, they must be stored with the object. For example, suppose file F is assigned write guard G. The processor that stores F would require that G be presented for each write access. Guards are presented in the fourth argument to Open, guards. Guards are checked at open time. For example, if a reference is opened for update (it is not a Read-Only reference) then Open checks for a write guard if one is required. Guards can be directly manipulated by a client, or the facilities described below can be used to help manage guards. The system defines a standard set of guard types. An Access guard must be provided before an 98 CHAPTER 7: PROTECfION object can be used in any way, a Read guard must be provided before an object can be read, a Write guard must be provided before an object can be written, a Create guard must be provided before an object will service a create operation, and a Change guard must be provided before an object will allow its guards to be changed. An object can implement a subset of these guard types, or it can choose to implement guards based on its special needs. c: Class I Set-Guard[gt: Type, gk: Simple-Key] Set-Guard sets a guard for the object serviced by c to be gk. The guard will protect the set of privileges specified by gt. If gk is NIL Set-Guard removes guards. ic: Index-Class I Set-Entry-Guard[entry-id: Byte-Array, gt: Type, gk: Simple-Key] Index entries can also be protected by guards. Set-Entry-Guard sets a guard for the index entry specified by entry-id. The guard will protect the set of privileges specified by gt. If gk is NIL Set-Entry-Guard removes guards. Protected Volumes In many cases clients would like to have the files that they create protected automatically. To this end we provide the notion of a protected volume. Files created on a protected volume assume a default protection structure specified by the protected volume. Create-Protected-Volume[ref: Reference, tl: List[Type], kl: List[Key] / pref: Protected] Create-Protected-Volume creates a reference for a protected volume. All files created on the protected volume will have the types of access controls in tl set, and the resulting privilege keys will be sealed with corresponding elements of kl. The elements of tl may be guard types, or they may be the type Encrypted, in which case the file will be encrypted. Protected ~ Record[ref: Reference, tl: List[Key], kl: List[Key]]; Create-Protected-Volume[ref: Reference, tl: List[Key], Protected] ~ Prog[ []; pref ~ Create[Protected]; pref.kl ~ kl; pref.ref ~ ref; pref.tl ~ tl; Retum[Add-Type[pref, Major-Type[ret]]]; kl: List[Key] / pref: ]; When a protected volume is opened the model implementation establishes a class structure that will set protection controls in response to create file operations, and that will properly copy a protected volume reference. Open-Protected[ref: Reference, tc: TC, ring: List[Key], guards: List[Key] / c: Class] ~ Prog[ []; c ~ Open[ref.ref, tc, ring, guards]; IF Is[c, Error-Type] THEN Retum[c]; Retum[Create-Class[List[ 'Copy Reference, 'Default-Copy, 99 CHAPTER 7: PROTECfION 'Create-File, 'Protected-Create], c]]; ]; Protected-Create[1 cref: Capability] ~ Prog[ []; [k: Key; cl: Class; priv: List[Sealed]; x: Cons[Type, Key]]; priv ~ NIL; -- create file cref ~ Apply[superclass, request]; open file cl ~ Open[cref, te, ring]; -- apply guards or encrypt file FOR x IN Pair[ref.tl, ref.gl] DO k ~ Create-Base-Key[]; IF Is[car[x] , Encyrpted] THEN cref ~ Create-Encrypted[cref, Seal[k, cdr[x]]]; ELSE [ cl I Set-Guard[car[x], k]; priv ~ Append[priv, Seal[k, cdr[x]]]; ]; ]; cl I Close[]; -- Capabilities are described in Section Return[Create-Capability[cref. priv]]; ]; 7.3.2 Common Protection Mechanisms All of the ingredients are now at hand for creating a large number of common protection mechanisms. We have represented privilege by the possession of keys, and these keys can be sealed such that only authorized clients can unseal them. We will treat the three major types of protection mechanisms in current use: capabilities, access control lists, and information flow control. Capabilities A capability is an unforgable ticket which permits a possessor to access the object it names with certain privileges [Dennis and Van Horn 66, Lampson 69]. No special privileges are required in our system to make a copy of a capability. Capabilities can be implemented by including a set of keys and guards in an object's reference. For example, imagine object 0 is encrypted with the conventional key K, and G is the write guard for object o. If a reference for 0 includes K, then it can be used to read O. If the reference also includes G, it can be used to read or write O. Without G or K it is impossible to read or write O. To the extent that keys and guards are considered impossible to guess, capabilities can be considered unforgable. A capability reference is shown in Figure 7.5. Create-Capability[ref: Reference, pI: List[Key] I cref: Capability] Create-Capability creates a capability for ref. pI is the list of keys that defines the set of privileges that the new capability will have. If any elements of pI are sealed, an attempt will be made to unseal them when the capability is opened. Capability ~ Record[ref: Reference, pI: List[Key]]; 100 Object Ref Key Write Guard Capability Reference Figure 7.5 101 CHAPTER 7: PROTECTION Create-Capability[ref: Reference, pI: List[Key] / cref: Capability] cref +- Create[Capability]; cref.pI +- pI; cref.ref +- ref; Return[Add-Type[cref, Major-Type[ret]]]; ]; +- Prog[ []; Open-CapabiIity[ref: Reference, tc: TC, ring: List[Key], guards: List[Key] / c: Class] +- Prog[ [PI]; pI +- NormaIize[ref.pl, te]; pI +- NA-Unseal-List[pl, Append[ring, guards], te]; -- expand privilege keys by pI c +- Open [ref.ref, te, Append[ring, pI], Append[guards, pI]]; IF Is[c, Error-Type] THEN Retum[c]; Retum[Create-Class[List['Copy Reference, 'Default-Copy], c]]; ]; Access Control Lists An access control list system associates a list of users with each object. This list describes who may access the object, and with what privileges [Saltzer and Schroeder 79]. To implement access control lists each user of the system creates a key pair, and makes one of the keys of this pair public. Section 7.4.2 discusses how one user can reliably learn another user's public key. A user keeps the private half of his key pair on his key ring. Thus if user X seals an object with user Y's public key, only user Y will be able to unseal it. This result follows directly from the public-key cryptography that is used to implement key pairs. Access control lists are implemented by sealing the privileges in a capability reference, as shown in Figure 7.6. In general, a key or guard is sealed with the Key-Or of the users' keys that have been granted the corresponding privilege. Once sealed, these references can be placed in a public directory system. Only users that have been granted a privilege will be able to unseal its corresponding key or guard. It is often desirable to be able to grant privileges to a group of users and allow the members of the group to change over time. If revocation is not required, indirect keys can be used to define a group as Group- Key +- Seal-Only[Create-In direct- Key[ Key-Or[ul ... Key-Or[un-l, un]]]]; where ul ... un are the users' keys that are members of the group. The members of the group can be altered with Change-Indirect-Key. Information Flow Control In an information flow control scheme [Denning 76] each object is labeled with one or more classifications, and the output of a computation is labeled with the union of the classifications of its inputs. An example of an information flow control scheme is the military system of classification. If Top Secret and Crypto information are used in a report, then the report would be classified Top Secret, Crypto. Information flow control schemes are usually nondiscretionary, meaning that the classification of 10) Read (Jones>, (Smith>, (Zephyr> Object Ref II II Write (Jones>, (Smith> Key .....---key23 Write Guard TI TI key12 Key23 = Key-Or[(Jones>, (Smith>, (Zephyr>] Key12 = Key-Or[(Jones>, (Smith>] Object Reference: Access Control List Figure 7.6 103 CHAPTER 7: PROTECTION a computation's output is fixed, and can not be altered by a client In the information flow control scheme we present a client chooses the classifications of its outputs. Once a classification is specified, the protection mechanism ensures that it is properly enforced. Cryptographic sealing can be used to implement information flow control in the following manner. Represent each classification by a key. If 0 is created from objects that have classifications C] ... C n• seal the privileges in O's reference with Key-And[C] ... Key-And[Cn_]. C,J]. For example, imagine that the fictitious company Sierra has two divisions, a medical division and an office division. Sierra would like to enforce the policy that information that is private to one division is only accessible to employees of that division. Furthermore, Sierra has financial information that only senior managers are allowed to access. Sierra could create keys to represent these classifications as follows: Medical Office t- Seal-Only[Create-Indirect- Key[ Key-Or[]]]]; Seal-Only[Create-Indirect-Key[ Key-Or[(jones), ... Key-Or[ "" ~ 0 0 .,,. Private Doors Set '" r Remote P rocesso r Structure of a Secure Located Class Figure 7.7 108 Class CHAPTER 7: PROTECTION Create-Revocable-Capability creates a capability that has certain privileges, and these privileges can be changed by Change-Revocable-Capability. If a client's privileges are reduced by ChangeRevocable-Capability. on subsequent calls to Open there is no way the client can assume its old privileges. Create-Revocable-Capability[ref: Reference. pI: List[Key]. key-list: List[Key], ck: Key, i: Index, sp: Secure-Processor, tc: TC / rref: Revocable] Create-Revocable-Capability creates a new revocable capability. pi is the set of privileges that the capability has, and these privileges are protected by key-list with Seal-List. Clients that can unseal ck can change the revocable capability. The index and secure processor specified are used to implement the capability. and must be trusted. Change-Revocable-Capability[rref: Revocable, pI: List[Key], key-list: List[Key]. ring: List[Key], tc: TC] Change-Revocable-Capability changes rref such that the new set of privileges is pi, and these permissions are protected by key-list with Seal-List A key ring that can unseal ck must be supplied. Protected Indirection In order to implement revocation we first need to introduce indirection with access protection. The functions shown below are a simple variation of the standard indirection primitives. As an exercise the reader might consider how to implement these functions from the primitives we have introduced. Create-Secure-Indirect[record: Any, index: Index, tc: TC, access: Key / ind: Indirect]; Create-Secure-Indirect creates an indirect entry that can only be accessed by a client that can unseal access. Change-Secure-Indirect[ie: Indirect, contents: Any. tc: TC, ring: List[Key]] Change-Secure-Indirect changes an indirect entry if ring unseals access, and returns NIL. Otherwise Error['Failed] is returned. Lookup-Secure[ie: Indirect, tc: TC, ring: List[Key]] Lookup-Secure returns the contents of ie if ring contains a key that unseals access. Error['Failed] is returned if ring does not unseal access. Revocation Algorithm Figure 7.8 shows how revocation is accomplished. Assume that the client can not access the second capability shown, but secure processor SP can. The client's key ring is used to used to unseal the first capability, which yields a set of intermediate keys. These intermediate keys are then passed to SP. SP takes the set of intermediate keys and determines the privilege keys for the object SP opens the object on the client's behalf with these keys. Intermediate keys are employed so a user's key ring does not have to be passed to SP. The model implementation of revocation follows. 109 Secure Processor SP Located Ref I I SP I ndi rect Ref I ~ r I Indirect Privileges Revocable Capability '"r I I I "',." I ikeyl I I ] ckeyl pkeyl ikeyl I ] 0 0 0 0 I l Reference ikeyn I ] pkeyn ] ikeyn ckeyn Capability Key List Protected f rom Client ,."'" Object Protected by: pkeyl, """' pkeyn A Revocable Capability Figure 7.8 110 CHAPTER 7: PROTECfION Create-Revocable-Capability[ref: Reference, pi: List[Key], key-list: List[Key], ck: Key, i: Index, sp: Secure-Processor, tc: TC / rref: Revocable] +- Prog[ [nkl: List[Key]; gate: List[Sealed[Key]]]; nkl +- Create-Key-List[Length[pI]]; gate +- Seal-List[pl, nkl]; key-list +- Seal-List[nkl, key-list]; -- with nkl gate can be unsealed rref +- Create-Capability[ref, gate]; -- we assume that the secure processor will use its -- private key when it looks up indirect references. rref +- Create-Secure-Indirect[rref, i, tc, Key-Or[ck, sp.public-key]]; -- switch to secure processor rref +- Create-Located[rref, sp]; -- with key-list nkl can be unsealed rref +- Create-Capability[rref, Create-Indirect[key-list, i, tc]]; Return[rref]; ]; Create-Key-List[n: Integer / kl: List[Key]] +- Prog[ [i: Integer]; -- creates a list of n keys kl +- NIL; FOR i FROM 1 TO n 00 kl +- Cons[Create-Base-Key[], kl]; Return[kl]; ]; Change-Revocable-Capability[rref: Revocable, pi: List[Key], key-list: List[Key], ring: List[Key], tc: TC]] +- Prog[ [nref: Reference; oref: Reference; nkl: List[Key]; gate: List[Sealed[Key]]]; -- create new set of intermediate keys nkl +- Create-Key-List[Length[pI]]; -- create new sealed privilege list gate +- Seal-List[pl, nkl]; -- with key-list, a client can discover intermediate keys key-list +- Seal-List[nkl, key-list]; oref +- Lookup-Secure[rref.ref.ref, tc, ring]; nref +- Create-Capability[oref.ref, gate]; Change-Secure-Indirect[rref.ref.ref, nref, tc, ring]; Change-Indirect[rref.pl, key-list, tc]; ]; 7.4 Practical Considerations 7.4.1 Changing Protection Controls Once a set of protection controls has been established there are two ways of changing the controls. The first way is to create the protection structure with indirect keys. For example, if an indirect key is made for the users that can access a file, it is a simple matter to change this indirect key to authorize additional users. Another way to change protection controls is to reconfigure an object to a new implementor that is protected with a desired structure. The new implementing object may in fact be the same object as the old one, but protected in a new way. III CHAPTER 7: PROTECfION 7.4.2 Authentication in the Large A practical system must ensure that external names are mapped into correct internal keys. There are a number of solutions to this problem. Needham and Schroeder [Needham and Schroeder 78] have demonstrated one way of transforming external names into authentic internal keys. Here is a straightforward way to solve the problem in our framework. Provide every processor with a reference R that is located with a secure processor. Let R be a reference for the file that contains the root of the naming system. If a client uses R as the starting point in resolving names to references or keys, then the client will obtain an authentic internal reference or key. We assume that R is protected against malicious modification. The problem with this approach is that a corrupt system administrator could change R's referent in such a way that users would unwittingly give him access to their objects. To guarantee that an intermediate party does not tamper with key distribution it is necessary to distribute personal public keys outside of the system. Another approach would be to create a key pair, [keya, keyb]. keya is kept secret by the system administration, and Reconfigurable Ref for Indirect Ref for Index: Index-Index Reconfigu rable Index Ref ,'" Index: Index-Index File Index Volume Suite Indirect Volume Located Volume id r: 2 ,'" Index: Index-Index ,'" id:31294 loc: w: 2 I Processor I Votes: 1 id: 2274736 Indirect Volume Located Volume id ,'" Index: Index-Index id: 28946 loc: I Processor I Votes: 1 Indirect Volume Located Volume id Index: Index-Index Votes: 1 A Basic Storage Service Figu re 8.1 117 , .10. id: 38464 loc: I Processor I : Read Protected Volume : Read : Write Read: Key-Or[{Key-2837}, {Key-3482}] Write: {KeY-3482} ref: I : Read Protected Volume : Write Read: {Key-76} Write: ref: {KeY-76} I Protection St ructu re will be able to be read by users in (Accounting Dept.> or in (Manufacturing> and updated by users in (Accounting Dept.>. (Accounting Dept.> and (Manufacturing> can be changed as people join and leave the groups. This is accomplished by changing their indirect keys. Only (John Dover> can read and write files created on (John Dover's Volume>. Note that there is no way to treat the files created on (Accounting Volume> as a group. The same is true for files created on (John Dover's Volume>. Thus. it is not possible to change the protection of all of the files created on (John Dover's Volume> because there is no way to enumerate these files. However. if (John Dover's Volume> had been a reconfigurable protected volume. then we could change the protection of all of the files that had been created on it. There is an important subtle difference between reconfigurable protected volumes and protected reconfigurable volumes. In general. configuration is a difficult problem. This fact is hidden from naive clients. as they can use storage without knowing how it is con figured. A methodology for configuring a system is as follows: 1. First. carefully understand the needs of the client population. Clients' needs should be analyzed in enough detail to understand specific requirements for capacity. reliability. availability. performance. security. and flexibility. 2. Choose a set of hardware components that is likely to satisfy these client requirements. 3. Organize the raw capability represented by the hardware components into a set of external objects (such as volumes) that clients will use. This corresponds to specifying the client interface to the storage service without deciding how the storage service is implemented. 4. Determine how the external objects will be configured. Suite configurations. the location of volumes. an indexing structure. and so on must be selected based. in part. on the anticipated applications of the system. For example. a wide variety of indexing structures can be created. For a very small configuration all objects could be indexed by the index-index: for large configurations it is likely that several levels of indexes will be common (index B indexes volume C. index A indexes index B. the index-index indexes index A). 5. Review the properties of the resulting system. and compare them with the requirements established in Step 1. The last four steps are repeated until an acceptable con figuration is chosen. It is likely that a program could help in this process. A program could have knowledge of the 119 CHAPTER 8: PRACTICAL CONSIDERA nONS types of hardware components available. possible component interconnections. and how the facilities we have proposed can be used to create desired properties. For example. the program could suggest alternative hardware and suite configurations to achieve a required level of availability. Furthermore. when a new type of storage service is required it is important to consider how existing storage facilities might be utilized. A configuration program could maintain a data base of available storage services and their properties and automatically consider how they might be used. In summary. capacity can be added to a configuration by adding storage devices. and the properties of storage can be improved by using the facilities described by the system model. 8.2.2 Dynamic Configuration The configuration decisions we have just outlined so far have two properties. First. the decisions are reviewed infrequently. Second. the decisions are large in the sense that they affect the abstractions that clients see and they involve trucks dropping off new equipment at loading docks. We will call the problem of making such decisions static configuration. It is also possible to take discretionary actions to improve the performance or lower the cost of a system. These decisions do not affect the objects that users see and they are essentiaIIy reviewed continuously. We will caII the problem of making such decisions dynamic configuration. Dynamic configuration includes ideas such as: Using write-once storage to store a file and. when the file is opened for update. temporarily reconfiguring the file to read-write storage. This allows lower cost storage technologies to be used without the knowledge of clients. DynamicaIIy creating weak representatives in response to perceived local needs for information. Dynamically changing suite configurations to match changes in the referencing characteristics of clients. It is likely that decision analysis [Raiffa 70] can be applied to help formulate strategy for dynamic configuration. Decision analysis aIIows strategy to be formulated with incomplete information and allows the value of additional information to be judged. 8.3 Summary We argued that the system model is practical. based on prototypes that have been constructed. The model implementation has benefited from experience with a number of prototypes. Thus. it is likely that full-scale implementations of the system model will be patterned after the model implementation. System configuration was observed to have two independent aspects. The first was the static arrangement of hardware components and logical objects to provide a useful service. The second was the problem of real-time system management which involves strategies for taking discretionary actions to improve the performance of the system. 120 Chapter 9: Summary of Ideas In this paper we have considered the problem of information storage in a decentralized computer system. The major ideas that we adopted were: 1. Transactional Storage Transactions that guarantee totality. serial consistency, and external consistency were used to simplify parts of the system. As we have pointed out. all of the properties of transactions are not always required, but in some instances they provide a foundation that simplifies system design to a large degree. 2. Object Style The system model was constructed using an object oriented programming style. This style allowed a diverse set of ideas to be considered and explained separately. From these two starting points. we introduced the following major novel ideas: 1. Naming References were introduced to name objects. Internally. a reference is a typed' record. To a client a reference appears to be a variable length byte string. References can include such things as location information, protection guards, cryptographic keys, and other references. In addition. references can be made indirect to delay their binding to a specific object or location. 2. Location A new location mechanism was presented that hides the location of objects. Location was implemented by using indirection to delay the binding of references to object storage sites. 3. Replication A new replication algorithm was introduced that can improve the availability. reliability. and performance of objects. It was shown how previous replication algorithms were special cases of the new algorithm. and how temporary copies naturally fit into its framework. 4. Reconfiguration A new reconfiguration mechanism was presented that will dynamically reorganize objects. Reconfiguration is accomplished by indirection and state transfer. 5. Cryptographic Sealing A new very flexible low-level protection mechanism based on sealing objects with cryptographic techniques was introduced. The mechanism can be used for fine grained protection. Assuming perfect cryptography. the mechanism was shown to be correct. 6. Object Protection The low-level protection mechanism was used to create popular protection structures. Access control lists, information flow control. capabilities, secure communication channels. and revocation were implemented in terms of our new low-level protection mechanism. 121 CHAPTER 9: SUMMARY OF IDEAS In Chapter 1 we established a set of architectural principles that the system model was to observe. Here we briefly review how the system model satisfies each of the twelve principles. 1. The system we described will always function in a well defined manner. The semantics of concurrent access to storage are well defined. and the mechanism provided for object location is always guaranteed to find an on-line object. 2. The storage service we provide is based on the ideas of files and volumes. 3. Stable storage is resilient to a set of expected failures. unexpected failures. 4. References provide unambiguous low-level names that can be used in a variety of ways by clients. 5. Transactions mediate concurrent access to storage in a well defined way. 6. The assumed environment of the system model is decentralized. and a mechanism is provided for locating objects. Clients can also choose to use located references. 7. There is no inherent limit to the size of the system. Processors and storage devices can be added to increase the capacity of the system. and the number of users that it can service. 8. Suites provide a comprehensive facility for replicating objects to improve their availability. reliability. and performance. Weak representatives allow cache copies to be handled in a natural way. 9. The reconfiguration mechanism allows the storage that is used to implement a storage system object to be dynamically changed. Furthermore. it will detect most 10. A low-level protection mechanism based on cryptographic sealing is provided. and we showed that it is secure. This mechanism is used to create popular protection policies. 11. Volume suites. reconfigurable volumes. and protected volumes allow complex configuration structures to be hidden from clients. 12. A client specifies the volumes that are used to store its information. guarantee its autonomy if it so desires. Thus. a client can This paper has considered problems that occur in large scale in formation systems. We have taken care to formulate our solutions to these problems in a general way. Some of our solutions. such as the protection mechanism. will no doubt find use in a variety of applications. In the years to come it will be a challenge to design and build large information systems. These systems will alter our life style. influence the way that we interact with each other. and even contain sensitive information about us. It is our hope that the ideas we presented will be useful to the future designers of such systems. 122 Appendix A: Exposition Language· EL EL. the programming language that is used in the paper to describe algorithms. is a simple extension of Lisp 1.5. The purpose of this appendix is to fully describe the extensions we have made. The reader unfamiliar with Lisp will find the Lisp 1.5 Manual [McCarthy et al. 62] helpful. A.I Language Extensions To improve the readability of source programs a number of minor syntactic extensions are added to M-Expressions. These extensions have been chosen to make EL programs easy to read for people that are familiar with contemporary algebraic languages. Things that are ignored: Any line that starts with two dashes "--" is treated as a comment and is ignored. The expression after a colon ":" is ignored. This is so the types of variables can be included as comments in the source text. "Any" is used to indicate that a variable may contain any expression. Transformations: Commas are used to separate variables. Function definitions can include a formal return variable after a slash "I". transformed to F[x] +- Prog[[y]: Return[z]]. F[x/y] +- Z is The following constructs are adopted from Interlisp [Teitelman et al. 78]: IF THEN. IF THEN ELSE. WHILE DO. UNTIL DO. FOR DO. and DO. As in Interlisp, they are transformed into progs. As in Interlisp. certain infix operators are transformed to appropriate functions and predicates. The infix predicates transformed are: =. (, >. (=, >=. # (not equal). The infix functions transformed are +-. +. -. *. I. t. and they must have a space on either side. A shorthand for class invocations is provided. x I y[z] is transformed to Apply[x, List['y, z]] (see Section A.2). Record accesses are transformed to make programs easier to read. The form x.y +- z is transformed to Store[x, 'yo z]. The form x.y without a trailing assignment operator is transformed to Fetch[x. 'y] (see Section A.3). If a function is evaluated with too many arguments. the extra arguments are ignored. If a function is evaluated with too few arguments. NIL is supplied for the missing arguments. A.2 Classes Seemingly identical requests can require substantially different amounts of processing. Consider reading a file. Reading one file might not present any unusual complications. but reading another 123 ApPENDIX A: EXPOSITION LANGUAGE - EL file might require decryption, logic to determine which copy of the file to actually read, and communication to a remote processor. Ideally, a client should not be able to tell the difference between the two files. The notion of a class allows algorithms to be composed with one other to provide a composite service that has a standard interface. A class can use private state and other classes to process the requests that it receives. For example, a class that implemented a replicated file might need state to keep track of what copies are current, and might use other services to read and write file copies. The concept of a class is further elucidated in [Ingalls 78]. To request that a class perform a function the class is applied to a request list. The car of the request list is the function to be performed, and the remainder of the request list consists of arguments to the function. Section A.I introduced the shorthand for invoking a class class I operation [argI .. argn] which is transformed to Apply[c1ass, List['operation, argl. ... , argn]]. When a class starts executing it has at hand a request list, and it also has private state variables that were declared when the class was created. These variables are implemented by the closure mechanism in Lisp 1.5. Create-Class[function-map: List superclass: Class / class: Class] Create-Class creates a class that can process the functions included in junction-map. junction-map is a list of operation names and functions to implement them. For example, if Create-Class was applied to List['Read, 'FRead, 'Write, 'FWrite] it would create a class that would evaluate FRead when it received a Read request, and FWrite when it received a Write request. If an operation is not contained in junction-map then the request is passed to superclass, if it has been specified. If superclass is NIL and the operation is not contained in junction-map then Error[,NoFunction] is returned. Create-Class saves the environment in which it was evaluated. This saved environment is restored whenever the class it created is evaluated. This allows a class to maintain private state between invocations. Using our previous example, FRead and FWrite would both execute in the environment that Create-Class had, and thus could save state and communicate with each other. The following distinguished variables are defined when a class is executing: self superclass request The current class. This class' superclass. The current request. The class mechanism is implemented by the following functions: Create-Class[function-map: List superclass: Class / self: Class] +- Prog[ self +- Function[Standard-Class]: -- self. superclass, and request will be available to the class Retum[self]: ]: 124 0: ApPENDIX A: EXPOSITION LANGUAGE - EL Standard-Class[request: List / result: Any] ... Prog[ [f: Function~ operation: Atom~ self: Class]~ operation ... car[request]~ f ... Listget[f~nction-map, operation]~ IF Null[t] THEN [ -- No function. See if we have a superclass. IF Null[superclass] THEN Return[Error['NoFunction]]~ Return[Apply[superclass. request]]~ ]: Return[Apply[f. cdr[request]]]: ]: Listget[x: List y: Atom / result: Any] ... Prog[ 0: takes a list of the form [atom. val. ... atom. val] -- returns the value associated with y IF Null[x] THEN Return[NIL]: IF car[x] = y THEN Retum[cadr[x]]: Return [Listget[cddr[x]. y]]: ]: A.3 Records To make data structures easier to understand and implement the notion of a record is introduced. A record instance consists of a set of independent fields that can be read and written with Fetch and Store. respectively. A field is just like a variable. and is named by a tag. A record instance is made by applying Create to a record type. A record type is created by applying the function Record to a list of tags the new record might contain. It is also possible to create a new record type by combining two existing record types with Extend. Every record is assigned a unique type. Different types of records are used for different things, and often it is convenient to be able to determine the type of a record instance. A client can use the function Is to test the type of a record instance. Record[tagl: Atom. tag2: Atom. ... tagn: Atom / type: Record-Type] Record creates a new record with a distinct type. The arguments to Record are discarded. Record[/type: Record-Type] ... Prog[ Return[List[GetUniqueID[]]] ]: 0 Create[type: Record-Type / instance: Record-Instance] Create creates an instance of a record. The instance inherits the types of its template. Create[type: Record-Type / instance: Record-Instance] ... Prog[ Return[List[Cons['type. type]]] ]: 125 0 ApPENDIX A: EXPOSITION LANGUAGE - EL Store[instance: Record-Instance. tag: Atom. value: Any] Store sets the value of the field named by tag to be value. Store also returns value. Store[instance: Record-Instance, tag: Atom. value: Any] ... Prog[ [] PutAsscx;[tag. value, instance]: Retum[value]: ]: Fetch[instance: Record-Instance. tag: Atom / value: Any] Fetch returns the value of the field named by tag in instance. If no value has been stored in the field. NIL is returned. Fetch[instance: Record-Instance. tag: Atom / value: Any] "'Prog[ [] value ... Assoc[tag. instance]: IF Null[value] THEN Return[NIL]: Retum[cdr[val ue]]: ]: Extend[typea: Record-Type. typeb: Record-Type / type: Record-Type] Extend creates a record type that is the union of its input record types. I Extend[typea. typeb / type] ... Prog[ [] Return[Append[typea. typeb]]: ]: Is[x: Record-Instance. y: Record-Type / result: Boolean] The type of a record instance can be checked with Is. Is returns T if the the types of yare a subset of the types of x. Otherwise it returns NIL. To define Is we need an auxiliary function. Subset. Subset[x: List. y: List] returns T if x is a subset of y. Is[x: Record-Instance. y: Record-Type / result: Boolean] ... Prog[ [] IF Null[x] THEN Return[NIL): IF Atom[x) THEN Retum[NIL]: Return[Subset[y. Fetch[x. 'type]]]: ]: Subset[x: List y: List / result: Boolean] ... Prog[ [] IF NUll[x] THEN Return[T]: Return[And[Member[Car[x]. y]. Subset[Cdr[x]. y]]]: ]: Add-Type[x: Record-Instance. t: Record-Type / z: Record-Instance] Add-Type adds the type t to the record instance specified by x. x is also returned. Add-Type[x: Record-Instance. t: Type / z: Record-Instance] ... Prog[ []: x.type ... Union[x.type. t]: Return[x]: ]: 126 ApPENDIX A: EXPOSITION LANGUAGE - EL Remove-Type[x: Record-Instance, t: Record-Type / z: Record-Instance] Remove-Type removes the type t from the record instance specified by x. returned. x is also Remove-Type[x: Record-Instance, t: Type / z: Record-Instance] +- Prog[ []; Efface[t, x.type]; Return[x); ]: Examples of record usage: Experienced +- Record[experience: Integer]: Person +- Record[age: Integer]: Experienced-Person +- Extend[person. Experienced]: joe +- Create[Experienced-Person]: -- IsUoe Experienced] = T joe. experience +- 10: joe.age +- 26: A.4 External Representation Encode[object: Any / encoded: Byte-Array] Encode converts an object into an array of bytes. and is similar to the Lisp function Print. An encoded object may be preserved outside of EL for any length of time. None of the data structures in the paper are circular. and thus we do not require that Encode work properly on circular structures. Decode[encoded: Byte-Array / object: Any] Decode converts an encoded object back into a forin that can be directly manipulated; Decode is similar to the Lisp function Read. Decode[Encode[x]] is equivalent to Copy[x]. A.5 Exception Handling Error-Type +- Record[problem: Any]: Error[problem: Any] When an exceptional condition occurs in an EL function. the function returns what amounts to an error code. Error creates an instance of Error-Type. The argument to Error is used to set the new instance's problem field so different types of errors can be Is[x. Error-Type] will return T if x is an error instance. distinguished. Error[problem / instance] +- [ []: instance +- Create[Error-Type]: instance.problem +- problem: Return[instance]: ]: 127 ApPENDIX A: EXPOSITION LANGUAGE- EL A.6 Processes Fork[form: Any I handle: Process-Handle] Fork starts another process that evaluates the supplied form in the current environment. Fork returns a process handle that can be used with Join and Stop. Join[handle: Process-Handle I result: Any] Join returns the result of the evaluation carried out by the process specified by handle. If the process specified by handle has not completed its assigned evaluation Join will pause until the result is available. If no copies of a process' handle are kept then the result of the process' evaluation will be discarded. Stop[handle: Process-Handle] Stop causes the process specified by handle to be stopped and destroyed. A.7 Synchronization Create- Lock[/lock: Lock] Create- Lock creates a lock. Critical[lock: Lock. form: Any I result: Any] Critical locks the specified lock. evaluates form. and then unlocks the specified lock. Critical returns the result of the evaluation of form. Create-Condition-Variable[time-out: Time I cv: Condition-Variable] Create-Condition-Variable creates a condition variable. the maximum time that Wait will pause on cv. If time-out is not NIL. then it is Wait{cv: Condition-Variable] Wait pauses until a Broadcast on the specified condition variable occurs. or the time-out expires. Broadcast{cv: Condition-Variable] Broadcast wakes up all processes waiting on cv. A.8 Sets Set-Create[/set: Set]: Set-Create creates a new volatile set. A volatile set is a set that is destroyed when there are no more references to it or the machine it is stored on fails. Set-Insert[set: Set key: Any. value: Any] Inserts or replaces a reference to value in set indexed by key, 128 ApPENDIX A: EXPOSITION LANGUAGE - EL Set-Lookup[set: Set. key: Any / value: Any] Returns the value indexed by key in set. or NIL if there is no such key. Set-Delete[set: Set. key: Any] Deletes the entry indexed by key in set. A.9 Byte Arrays Byte-Array[length: Integer I array: Byte-Array] The Byte-Array function creates an array of bytes that has a specified length. A byte is an eight bit integer. All of the elements of the array are initially set to zero. Length[array: Byte-Array / length: Integer] Length returns the number of of bytes in an array. Elt[array: Byte-Array. element: Integer / value: Byte] Elt returns the value of the specified element of the array. SElt[array: Byte-Array. element: Integer. value: Byte] SElt sets the value of the specified array element. A.IO Miscellaneous Intersection[lista: List. listb: List / result: List] Intersection returns the list of elements that are in Usta and Ustb. Nth[l: List. n: Integer / element: Any] Nth returns the nth element of list 1. Pair[lista: List. listb: List '/ result: List] If !ista is [a b c] and listb is [1 2 3]. the result of Pair[lista. listb] is [[a 1] [b 2] [c 3]]. Union[lista: List. listb: List / result: List] Union returns the list of elements that are either in Usta or Ustb. 129 Appendix B: Storage System Summary This appendix is a concise summary of the operations that are implemented by the storage system. The reader is referred to the text for full descriptions of these operations. and for details on the protection mechanism. The operations we introduced to create references that are derived from other references are: Create-Capability[ref: Reference. pI: List[Key] / cref: Capability] Creates a capability for ref with privileges pI. Create-Choice[choices: List[Reference] / cr: Choice] Creates a reference whose referent can be anyone of the elements in choices. Create-Encrypted[ref: Reference. k: Key / eref: Encrypted] Encrypts an object with k. Create-Index[storage: File / iref: Index] Creates an index. Create-Indirect[ref: Reference. index: Index. tc: TC / iref: Indirect] Binds iref to ref. but this binding can be changed with ChangeReference. Create-Located[ref: Reference. loc: Location / lref: Located] When Iref is opened. control will be transferred to loe and ref will be opened. Create-Protected-Volume[ref: Reference. tl: List[Type]. kl: List[Key] / gref: Guarded] Files created on a protected volume have protection controls set according to tl and gl. Create-Reconfigurable[ref: Reference. index: Index. tc: TC / IT: Reconfigurable] Creates a reconfigurable object.. whose. first implementor is ref. Create-Revocable[ref: Reference. pI: List[Key]. key-list: List[Key]. ck: Key. i: Index. sp: Secure-Processor. tc: TC / ITef: Revocable] Creates a capability that may be revoked. Create-Suite[name: UniqueID. r: Integer. List[Reference] / sref: Suite] Creates a suite with w: Integer. rep: List[Reference]. votes: the specified configuration. Detailed below are the operations that are supported by each type of object. The bold face entries correspond to object types. The list begins with the operations that are supported by all objects. and then describes specialized operations. Class CloseD Deactivates a class. CopyReference[counted: Boolean / copy: Reference] Copies a reference. 130 ApPENDIX B: STORAGE SYSTEM SUMMARY GetIO[/ id: UniqueIO] Returns the unique identifier of the object GetTransactionClass[1 tc: TC] Returns the transaction class associated with this open object. OestroyReference[/ destroyed: Boolean] Destroys this reference and returns T if its reference count went to zero. Indirect ChangeReference[new-ref: Reference] Changes the indirect reference to point at new-ref. Processor, Secure Processor Eval[fonn: Any I result: Any] Evaluates form at the processor serviced by this class. Coordinator Create-Transaction[1 tref: Transaction] Creates a transaction. Transaction Abort{] Aborts the transaction. Commit(] Commits the transaction. GetStatus[/state: Atom] Returns the current state of the transaction. AddParticipant[commit: Global-Function. abort: Global-Function lid: UniqueIO] Adds a participant to transaction processing. OeleteParticipant[id: U niq ue 10] Deletes a participant from transaction processing. Information Holding Object IsImmutable[/ immutable: Boolean] Returns T if the object is immutable. SetlmmutableD Makes an object immutable. GetVersion[1 version: Integer] Returns the objecfs version number. 131 ApPENDIX B: STORAGE SYSTEM SUMMARY Copy[from: Reference] Overwrites the object with the contents of from. Volume, Volume Suite Create-File[name: UniqueID, exists: Boolean / ref: File] Creates a new file. File, File Suite Read[startPage: Integer. pages: Integer / data: Byte-Array] Reads from a file starting at startPage the number of pages specified by pages. Write[startPage: Integer. pages: Integer. data: Byte-Array] Updates a file starting at startPage by number of pages specified by pages. GetSize[l pageCount: Integer] Returns the number of pages in the file. SetSize[pageCount: Integer] Sets the number of pages in the file. Index, Index Suite Enumerate[Iast: Entry / next: Entry] Enumerates the contents of an index. Read[entry-name: Byte-Array / value: Byte-Array] Reads the value of an index entry. Write[entry-name: Byte-Array. value: Byte-Array] Updates the value of an index entry. 132 Appendix C: Open The function Open contains the only centralized knowledge of the types of objects that are supported by the system. Thus. to add a new type of object to the system it is only necessary to add an appropriate line of code to Open. The function Open-Local is defined to provide classes to service coordinators. files. indexes. volumes. or transactions that reside on the local processor. If the location of a file or an index is not known. we look for it on the local processor. If it is not there. its location is computed. These two things could occur in parallel. Although it is not shown in the code. Open could detect when an object was opened more than once with identical parameters and return the same class. In certain cases. such as opening a processor more than once. the performance gain would be significant Open[ref: Reference. tc: TC. ring: List[Key]. guards: List[Key] / class: Class] +- Prog[ -- if passed NIL return NIL IF Null[ref) THEN Return[NIL]: It -- Located. Capability. and Indirect have the highest precedence (because they alter where a request is processed. the keys it is processed with. and the referent that is used). IF Is[ref. Located] THEN Return[Open-Located[ref. tc. ring. guards]]: IF Is[ref. Capability] THEN Return[Open-Capability[ref. tc. ring. guards]]: IF Is[ref. Indirect] THEN Return[Open-Indirect[ref. tc. ring. guards]]: -IF IF IF IF IF IF IF IF The following can occur in any order. Is[ref. Choice] THEN Return[Open-Choice[ref. tc. ring, guards]]: Is[ref. Encrypted] THEN Return[Open-Encrypted[ref. tc. ring, guards]]: Is[ref. Protected] THEN Return[Open-Protected[ref. tc. ring. guards]]: Is[ref. Processor] THEN Return[Open-Processor[ref, NIL NIL guards]]: Is[ref. Reconfigurable] THEN Return[Open-Reconfigurable[ref. tc. ring, guards]]; Is[ref. Secure-Door] THEN Return[Open-Secure-Door[ref, NIL NIL guards]]: Is[ref. Secure-Processor] THEN Return[Open-Secure-Processor[ref. NIL, NIL guards]]: Is[ref. Suite] THEN Return[Open-Suite[ref. tc. ring. guards]]; -- We don't know where. the object is. Try this processor. class +- Open-Local[ref. tc. ring. guards]: -- if success. return to client IF class # ErrorrNotFound] THEN Return[class]: -- Find object IF Is[ref. File] THEN Return[Open-File[ref. tc. ring. guards]]: IF Is[ref. Index] THEN Return[Open-Index[ref. tc. ring. guards]]: -- Not a File or an Index (should not occur) Return[ErrorrN otFound]]: ]: 133 Bibliography [Alsberg et al. 76] Alsberg, P., Belford, G. G., Day, J. D., Grapa, E., Multi-Copy Resiliency Techniques, CAC Document Number 202, Center for Advanced Computation, Univ. of Illinois, Urbana, Illinois, May 1976. [Blakley 79] Blakley, G. R .. Safeguarding Cryptographic Keys, Proc. 1979 National Compo Con/.. AFIPS Press. pp. 313-317. [Boggs et al. 80] Boggs, D.R .. Shoch. J.F., Taft. E.A.. Pup: An Internetwork Architecture, l.E.E.E. Trans. on Comm COM-28. 4 (April 1980). pp. 612-624. [Belady and Lehman 77] Belady, L. A.. and Lehman, M. M., The Characteristics of Large Systems, Report RC-6785. IBM T. J. Watson Research Center, September 1977. [Birrell and Needham 80] Birrell. A. 0 .. and Needham. R. M .. A Universal File Server, IEEE Trans. on Software Engineering SE-6. 5 (September 1980), pp. 450-453. [Brooks 75] Brooks. F. P.; The Mythical Man-Month. Addison-Wesley, Reading, MA, 1975. [Chaum and Fabry 78] Chaum. D. and Fabry. R .. Implementing Capability-Based Protection Using Encryption. Rep. UCB/ERL M78/46. Electronics Research Laboratory. University of CA, Berkeley, July 1978. [Clark et al. 78] Clark. D. D .. Pogran. K. T .. Reed. D. P.. An Introduction to Local Area Networks, Proc. of the IEEE 66. 11 (November 1978). pp. 1497-1516. [Corbato et al. 62] Corbato. F. 1.. Daggett M. M.. Daley, R. C .. An Experimental Time-Sharing System. 1962 Fall Joint Computer Conference. AFIPS Con/. Proc. 21. pp. 335-344. [Corbato et al. 72] Corbato. F. J .. Saltzer. J. H .. Clingen. C. T.. Multics - The First Seven Years. 1972 Fall Joint Computer Conference. AFIPS Con/. Proc. 40. pp. 571-583. [Davies 73] Davies. C. T.. Recovery Semantics for a DB/DC System. Proc. ACM National Conference. 1973. pp. 136-141. [Denning 76] Denning. D. E.. A Lattice Model of Secure Information Flow. Comm ACM 19. 5 (May 1976). pp. 236-243. [Dennis and Van Hom 66] Dennis. 1. B.. and Van Hom. E. C .. Programming Semantics for Multiprogrammed Computations. Comm ACM 9. 3 (March 1966), pp. 143-155. [DES 75] Notice of a Proposed Federal Information Processing Data Encryption Standard. Federal Register. Vol. 40. No. 52. March. 1975. [Diffie and Hellman 76] Diffie. W .. and Hellman. M. E.. New Directions in Cryptography. IEEE Trans. on In/. Thy. IT-22.6 (November 1976). pp. 644-654. [Diffie and Hellman 77] Diffie. W .. and Hellman. M. E.. Exhaustive Cryptanalysis of the NBS Data Encryption Standard. Computer (June 1977). pp. 74-84. [Diffie and Hellman 79] Diffie. W .. and Hellman. M. E.. Privacy and Authentication: An Introduction to Cryptography. Proc. of the IEEE 67. 3 (March 1979), pp. 397-427. [Dian 80] Dion. J .. The Cambridge File Server. Operating Systems Review 14. 4 (October 1980). pp. 26-35. 134 BIBLIOGRAPHY [Eswaran et ·a1. 76] Eswaran. K.P. et al.. The Notions of Consistency and Predicate Locks in a Database System. Comm. ACM 19. 11 (November 1976). pp. 624-633. [Forsdick et a1. 77] Forsdick. H. C .. Schantz. R. E.. Thomas. R. H.. Operating Systems for Computer Networks. Rep. 3614. Bolt Beranek and Newman. Cambridge. MA. July 1977. [Garnett and Needham 80] Garnett. N.H .. and Needham. R.M .. An Asynchronous Garbage Collector for the Cambridge File Server. Operating Systems Review 14. 4 (October 1980). pp. 36-40. [Gifford 79a] Gifford. D.K.. Violet. An Experimental Decentralized System. Integrated Office System Workshop. IRIA. Rocquencourt. France. November. 1979. Available as CSL Report 79-12. Xerox Palo Alto Research Center. 1979. [Gifford 79b] Gifford. O.K. Weighted Voting for Replicated Data. Operating System Review 13. 5 (December 1979). pp. 150-162. [Gray et a1. 76] Gray. J. N. et al.. Granularity of Locks and Degrees of Consistency in a Shared Data Base. in Modeling in Data Base Management Systems. North Holland Publishing. 1976. pp. 365-394. [Gray 78] Gray. J. N.. Notes on Data Base Operating Systems. in Operating Systems. An Advanced Course. Lecture Notes in Computer Science 60. Springer-Verlag. 1978. pp. 393-481. [Gray et a1. 79] Gray. J. N .. et al.. The Recovery Manager of a Data Management System. Rep. RJ2623. IBM San Jose Research Laboratory. San Jose. CA. August 1979. [Gudes 80] Gudes. E.. The Design of a Cryptography Based Secure File System. IEEE Trans. on Soft. Eng. SE-6. 5 (September 1980). pp. 411-420. [Hellman 76 et a1.] Hellman. M. E.. et al.. Results on an initial attempt to Cryptanalyze the NBS data encryption standard. Rep. SEL-76-042. Dept. of Electrical Engineering. Stanford Univ .. Stanford. CA. September 1976. [IBM 80a] CICS/VS System Application Design Guide. Third Edition. Rep. SC33-0068-2. International Business Machines. 1980. [IBM 80b] IBM 3850 Mass Storage System Introduction and Preinstallation Planning. Second Edition. Rep. GA32-0038-1. International Business Machines. 1980. [Ingalls 78] Ingalls. D. H.. The Smalltalk-76 Programming System Design and Implementation. Fifth Annual ACM Symposium on Principles of Programming Languages. Tucson. Arizona. January 1978. [Israel et al. 1978] Israel. J.E .. Mitchell. J.G .. and Sturgis. H.E.. Separating Data From Function in a Distributed File System. Report CSL-78-5. Xerox Palo Alto Research Center. Palo Alto. CA. September 1978. [Knuth 73] Knuth. D. E.. The Art of Computer Programming. Vol. 3. Sorting and Searching. Addison-Wesley. Reading. MA. 1973. [Lamport 78a] Lamport. L.. The Implementation of Reliable Distributed Multiprocess Systems. Computer Networks 2 (1978). pp. 95-114. [Lamport 78b] Lamport. L.. Time. Clocks. and the Ordering of Events in a Distributed System. Comm. ACM 21. 7 (July 1978). pp. 558-565. [Lampson 69] Lampson. B. W.. Dynamic Protection Structures. 1969 Fall Joint Computer Conference. AFIPS Con/. Proc. 35. pp. 27-38. 135 BIBLIOGRAPHY [Lampson 80] Lampson, B. W., Replicated Commit, Draft Paper, Xerox Palo Alto Research Center, Palo Alto, CA. November 1980. [Lampson and Sproull 79] Lampson, B. W., and Sproull, R. E, An Open Operating System for a Single-User Machine, Operating System Review 13. 5 (December 1979), pp. 98 - 105. [Lampson and Sturgis 79] Lampson. B. W .. and Sturgis, H. E.. Crash Recovery in a Distributed Data Storage System. Draft Report Xerox Palo Alto Reserch Center. Palo Alto, CA, April 1979. Also to appear Comm ACM. [LeLann 81] Le Lann. G., A Distributed System for Real-Time Transaction Processing, Fourteenth Hawaii International Conference on System Science. Honolulu. January 1981. [Liddle 76] Liddle. D., private communication. [Lindsay and Gligor 78] Lindsay. B.. and Gligor. V.. Migration and Authentication of Protected Objects. IEEE Trans. Soft. Eng. SE-5. 6 (November 1979). pp. 607-611. [McCarthy et al. 62] MA. 1962. McCarthy. J .. et. al.. Lisp 1.5 Programmer's Manual, MIT Press, Cambridge, [McCreight 77] McCreight E. M .. Pagination of B*-Trees with Variable-Length Records. Comm ACM 20. 9 (September 1977). pp. 670-674. [Metcalfe 73] Metcalfe. R.M .. Packet Communication. Ph.D. Thesis. Harvard University, December. 1973. Also available as Rep. TR-114. MIT Laboratory for Computer Science. Cambridge, MA. [Metcalfe and Boggs 76] Metcalfe. R.M .. and Boggs. D.. Ethernet: Packet Switching for Local Computer Networks. Comm ACM 19. 7 (July 1976). pp. 395-403. [Morris 73] Morris. 1. H .. Protection in Programming Languages. Comm ACM 16. 1 (January 1973). pp. 15-21. [NBS 80] National Bureau of Standards. DES Modes of Operation. Preliminary Copy of Federal Information Standards Publication 81. September. 1980. [Needham and Schroeder 78] Needham. R. M .. and Schroeder. M. D.. Using Encryption for Authentication in Large Networks of Computers. Comm ACM 21. 12 (December 1978), pp. 993999. [Needham 79] Needham. R. M .. Adding Capability Access to Conventional File Servers. Operating Systems Review 13. 1 (January 1979). pp. 3-4. [Nelson 81] Nelson. Boo Remote Procedure Call. Ph.D. Thesis. Carnegie-Mellon Univ .. May. 1981. [Obermarck 80] Obermarck. R .. Global Deadlock Detection Algorithm. Rep. RJ2845. IBM San Jose Research Laboratory. San Jose. CA. August 1979. [Osterhout et al. 80] Osterhout J. K .. Schelza. D. A.. Sindhu. P. S.. Medusa: An Experiment in Distributed Operating System Structure. Comm ACM 23. 2 (February 1980), pp. 92-105. [Peterson and Weldon 72] Peterson. W. W .. Weldon. E. J .. Error-Correcting Codes. Second Edition. MIT Press. Cambridge. MA. 1972. [Popek et al. 81] Popek. G .. et aL LOCUS: A Network Transparent High Reliability Distributed System. Operating Systems Review 15. 5 (December 1981). pp. 169-177. [Raiffa 70] Raiffa. H .. Decision Analysis: Introductory Lectures on Choices under Uncertainty. Addison-Wesley. Reading. MA. 1975. 136 BIBLIOGRAPHY [Redell 74] . Redell. D. D.. Naming and Protection in Extendible Operating Systems. Ph.D. Thesis. University of California. Berkeley. September 1974. Available as TR-140. MIT Laboratory for Computer Science. Cambridge. MA. [Reed 78] Reed. D. P.. Naming and Synchronization in a Decentralized Computer System. Rep. TR-205. MIT Laboratory for Computer Science. Cambridge. MA. 1978. [Rivest et al. 78] Rivest. R. L.. Shamir. A.. and Adleman. L.. A Method of Obtaining Digital Signatures and Public-Key Cryptosystems. Comm ACM 21. 2 (February 1978). pp. 120-126. [Rothnie et al. 77] Rothnie. 1. B.. Goodman. N.. Bernstein. P. A.. The Redundant Update Methodology of SOD-I: A System for Distributed Databases (The Fully Redundant Case). Technical Report CCA-77-02. Computer Corporation of America. Cambirdge. MA. June 1977. [Saltzer and Schroeder 79] Saltzer. J. H .. and Schroeder. M. D .. The Protection of Information in Computer Systems. Proc. of the IEEE 63. 9 (September. 1975). pp. 1278-1308. [Shamir 79] Shamir. A.. How to Share a Secret. Comm ACM 22. 11 (November. 1979). pp. 612613. [Shannon 49] Shannon. C. E.. Communication Theory of Secrecy Systems. Bell Sys. Tech Journal 28. (October 1949). pp. 656-715. [Spector 80] Spector. A. Z.. Performing Remote Operations Efficiently on a Local Computer Network. Stanford Department of Computer Science Report STAN-CS-80-850. Stanford Univ .. Stanford. CA. December 1980. [Swinehart et al. 79] Swinehart. D .. McDaniel. G .. Boggs. D.. WFS: A Simple Shared File System for a Distributed Environment. Operating System Review 13. 5 (December 1979). pp. 9 - 17. [Tandem 81] 1981. Data Base Management ENCOMPASS: TMF. Tandem Computers. Cupertino. CA. [Teitelman et al. 78] Teitelman. W .. et at.. Interlisp Reference Manual. Xerox Palo Alto Research Center. Palo Alto. CA. 1978. [Thacker et al. 79] Thacker. C. P. et at.. Alto: A Personal Computer. Rep. CSL-79-11. Xerox Palo Alto Research Center. Palo Alto. CA. August 1979. [Thomas 73] Thomas. R. H .. A Resource Sharing Executive for the ARPANET. 1973 National Computer Conference. AFIPS Con/. Proc. 44. pp. 155-163. [Thomas 79] Thomas. R. H .. A Majority Consensus Approach to Concurrency Control for Multiple Copy Databases. ACM Trans. on Database Systems 4. 2 (June 1979) pp. 181-209. [von Neumann 51] von Neumann. J .. Various Techniques Used in Connection with Random Digits. in Monte Carlo Math Series # 12. National Bureau of Standards. 1951. 137 INDEX Conventional-Encrypt. 80 Coordinator, 25 coordinator, 25, 31 Copy, 52 CopyReference.23 counted reference, 22 Create. 125 Create-Base-Key,83 implementation of, 86 Create-Capability. 100 Create-Choice, 47 Create-Class, 124 Create-Condition-Variable, 128 Create-Conventional-Key.80 Create-Encrypted. 97 Create-File, 28 Create-Global-Function. 13 Create-Index, 40 Create-Indirect, 41 implementation of, 41 Create-Indirect-KeY,84 implementation of, 92 Create- Key-Pair. 83 implementation of, 86 Create- Located, 32 Create-Lock, 128 Create-PK -Pair. 81 Create-Protected-Volume,99 Create-Recon figurable, 65 implementation of, 66 Create-Revocable-Capability, 109 implementation of. III Create-Suite, 51 Create-Transaction. 25 Critical, 128 cryptanalytic attack, 80 chosen-cleartext,80 ciphertext-only, 80 known-cleartext. 80 on cryptographic sealing, 95 cryptographic sealing, 76 applications of. 97 authentication property of, 83 comparative analysis of, 113 correctness of, 93 cryptanalysis of. 95 model of, 82 performance of, 112 security property of, 83 cryptography. 8. 77 computationally secure, 80 conventional, 80 hardware for, 115 perfectly secure. 80 public-key. 81 Abort. 26 abort. 25 access control list. 102 action. 23 real. 25 active protection mechanism. 77 Add-Type. 126 AddParticipant 26 address. 9 broadcast 9. 46 Apply-Quorum. 56 architectural principles. l. 122 list of, 6 authentication. 76 elimination of. 113 in the large. 112 availability. 76 base key. 82 Broadcast. 9. 128 broken read lock, 37 Byte-Array. 129 Capability, 100 capability, 100 Change-Indirect 41 implementation of. 43 Change-Indirect-Key.84 implementation of. 92 Change-Revocable-Capability. 109 implementation of. III ChangeReference. 43 checkpoint 38 Checksum. 82 checksum. 82 choice reference. 47 ciphertext 77 class. 22. 123 cleartext 77 Close. 23 Collect-Quorum. 59 Collect-Write-Quorum. 60 Commit. 26 commit 25 communication. 9 one-way authenticated. 105 packet switched. 9 secure. 105 compensation. 25 connection. 16 secure. 105 consistency degrees of. 36 external. 24 lower degrees of. 62 serial. 23 Conventional-Decrypt 80 138 INDEX current, 49 deadlock. 37 detection. 37 prevention. 37 time-out strategy. 37 Decode. 127 decrypt. 77 DeleteParticipant, 26 dependency relation. 24 derived key. 82 key field. 86 opener field. 86 Destroy Reference. 23 door. 33 secure. 105 Door-Eval. 33. 105 dynamic configuration. 120 EL. 3. 121. 123 Elt, 129 Encode. 127 encrypt. 77 Encrypted. 97 encrypted object. 97 Entry. 40 Enumerate. 40 equivalent schedule. 24 Error. 127 Error-Type. 127 Eval. 12. 104 Extend. 126 external consistency. 24 failure processor. 12. 25 stable storage. 38 storage. 10 Fetch. 126 File. 27 file. 21. 27 file index. 66 Fork. 128 form. 12 garbage collection. 22 Get-Size. 28 GetID.23 GetMyProcessor.9 GetProcessorID. 8 GetStatus.26 GetTransactionClass.23 GetUniqueID. 11 GetVersion.52 global function. 13 goal. 1 guard. 98 standard types. 98 guarded object, 98 ideas major, 121 immutable. 28 Inde'x,40 index. 40 entry, 40 entry name. 40 file. 66 Index-Index. 46 Indirect. 41 indirect reference. 43 indirection. 41 information flow control. 102 Initiate-Inquiries. 57 integrity. 76 Intersection. 129 Is. 126 IsImmutable. 28 Join. 128 key. 76 base. 82 derived. 82 key ring. 97 key for. 97 Key-And. 83 implementation of. 91 Key-Or, 84 implementation of. 91 Key-Pair. 83 Key-Quorum. 84 implementation of, 91 Length. 129 Lisp, 123 Located. 32 location. 46. 121 lock. 29 breaking read. 37 compatibility. 35 granularity. 36 releasing read. 37 lock management. 29 log. 29 Lookup. 41 implementation of. 41 Major-Type, 32 message, 9 model implementation. 3 NA-Unseal. 85 implementation of. 92 NA-Unseal-List. 85 naming systems model of. 21 networks internetwork. 9 139 INDEX local. 9 Normalize. 41 implementation of. 43 nth. 129 object sealed. 76 one-way authenticated. 105 Open. 22. 133 Open-Capability. 102. 133 Open-Choice. 47. 133 Open-EncryPted. 98. 133 Open-File. 46. 133 Open-Index. 46. 133 Open-Indirect 44. 133 Open-LocaL 133 Open- Located. 32. 133 Open-Processor. 12. 133 implementation of. 17 Open-Protected. 99.133 Open-Reconfigurable. 66. 133 Open-Secure-~r. 106.133 Open-Secure-Processor. 105. 133 Open-Suite. 53. 133 opener. 86 orphan. 12 page. 28 Pair. 129 participant 26 passive protection mechanism. 77 personal computing. 5 PK-Decrypt 81 PK -Encrypt. 81 PK-Pair.81 practical considerations. 115 Pre-Eval. 14 prepared. 31 Processor. 9 processor. 8 identifier. 8 secure. 104 Protected. 99 protected volume. 99 protection. 76 authentication. 76 availability. 76 changing controls. III common mechanisms. 100. 121 group. 102 integrity. 76 secrecy. 76 protection mechanism. 121 active. 77 algorithm. 85 cryptographic sealing. 76 passive. 77 prototype of. 115 prototypes, 115 pseudo-time, 32 quorum read. 49 write. 49 random number. 80 random numbers. 8 Read file. 28 index, 40 read quorum, 49 Read-Only. 22 real action. 25 Receive, 9 Reconfigurable.66 reconfigurable object. 65 reconfiguration. 65. 121 algorithm. 66 Reconfigure. 65 Record. 125 records. 125 recovery management. 29 Reference. 21 reference. 21 capability. 100 choice. 47 coordinator. 25 counted. 22 encrypted object, 97 file, 27 index. 40 indirect, 43 located. 32 processor. 9 protected volume, 99 read only, 22 reconfigurable.66 suite, 51 transaction, 25 uncounted. 22 volume. 27 reference-count. 22 references. 121 Remember-Eval. 17. 105 remote form evaluation. 11. 104 algorithm. 13 semantics, 12 Remove-Type. 127 replication. 49. 121 related work. 64 replication algorithm prototype of. 115 Representative. 53 representative. 49 140 INDEX transfer rate, 10 write-once, 10 Store. 126 Submaster, 84 implementation of, 91 Suite, 51 suite, 49 algorithm, 52 algorithm correctness, 61 alternative for volumes, 64 concurrent update of, 63 examples, 50 object size of. 62 recursive, 62 representative. 49 total replacement of. 63 volume, 52 voting con figuration. 50 suite algorithm, 52 Suite-Close, 59 Suite-Copy, 56 Suite-CopyRef.56 Suite-Create. 57 Suite-Read. 55 Suite-Write. 56 superc1ass. 124 system configuration, 2. 116 dynamic. 120 example of. 116 methodology for. 119 static, 120 system implementation, 2, 115 full-scale. 116 prototypes, 115 system model, 2 environment. 7 summary of. 130 TC.26 threshold scheme. 81 Threshold-Pieces, 81 Threshold-Recover. 82 Threshold-Split. 82 time-sharing, 4 totality, 23 Transaction. 25 transaction, 21. 23 abort, 25 commit, 25 nested. 38 transactional storage, 21, 121 decentralized. 30 prototype of. 115 single processor, 29 TryToSet 31 two-phase commit protocol, 30 creation. 63 current, 49 obsolete. 63 weak, 62 request, 124 Request- Eval. 15. 105 revocation. 107 algorithm. 109 ring, 97 RUnseal. 85 implementation of. 92 RUnseal-List, 85 save point, 38 schedule. 24 equivalent 24 Seal. 83 implementation of. 87 Seal-List 85 Seal-Only. 84 implementation of, 92 sealed object, 76 sealing. 76 secrecy. 76 secure channel, 105 secure door, 105 secure processor. 104 Secure-~r. 105 Secure-Processor. 105 security envelope. 77 self. 124 SElt.129 Send, 9 serial consistency. 23 Set-Create. 128 Set-Delete. 129 Set-Entry-Guard. 99 Set-Guard. 99 Set-Insert, 128 Set-Lookup. 129 Set-Size. 28 Setlmmutable. 28 shadow mechanism. 29 spoof. 105 stable storage. 10. 30 static configuration. 120 Stop. 128 storage transactional. 21 storage device. 9 capacity. 10 latency. 10 random access. 10 read-only. 10 removable media. 10 serial access. 10 141 INDEX decentralized, 31 uncounted reference, 22 Union. 129 unique identifier. 11 generation. 11 UniqueID,11 Unseal. 83 implementation of. 90 unsealed by. 83 unseals relation. 82 volatile storage. 10 Volume. 27 volume. 21. 26 protected. 99 reconfigurable.66 serial. 26 write-once. 26 vol ume suite. 52 Wait 128 weak representative. 62 Weak-Remote-EvaL 13 implementation of. 14 worker. 30 Write file. 28 index. 40 write quorum. 49 write-ahead-log protocol. 10 142

Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.3
Linearized                      : No
XMP Toolkit                     : Adobe XMP Core 4.2.1-c043 52.372728, 2009/01/18-15:56:37
Create Date                     : 2010:11:26 16:51:06-08:00
Modify Date                     : 2010:11:26 16:58:52-07:00
Metadata Date                   : 2010:11:26 16:58:52-07:00
Producer                        : Adobe Acrobat 9.4 Paper Capture Plug-in
Format                          : application/pdf
Document ID                     : uuid:27fc910f-b88c-4508-962f-c54032c3e030
Instance ID                     : uuid:898f7026-c5f0-4027-9787-1e6b79c956d7
Page Layout                     : SinglePage
Page Mode                       : UseNone
Page Count                      : 153
EXIF Metadata provided by EXIF.tools

Navigation menu