1972 12_#41_Part_1 12 #41 Part 1
1972-12_#41_Part_1 1972-12_%2341_Part_1
User Manual: 1972-12_#41_Part_1
Open the PDF directly: View PDF .
Page Count: 666
Download | |
Open PDF In Browser | View PDF |
AFIPS CONFERENCE PROCEEDINGS VOLUME 41 PART I 1972 FALL JOINT COMPUTER CONFERENCE December 5 - 7, 1972 Anaheim, California The ideas and opinions expressed herein are solely those of the authors and are not necessarily representative of or endorsed by the 1972 Fall Joint Computer Conference Committee or the American Federation of Information Processing Societies, Inc. Library of Congress Catalog Card Number 55-44701 AFIPS PRESS 210 Summit Avenue Montvale, New Jersey 07645 ©1972 by the American Federation of Information Processing Societies, Inc., Montvale, New Jersey 07645. All rights reserved. This book, or parts thereof, may not be reproduced in any form without permission of the publisher. Printed in the United States of America CONTENTS PART I OPERATING SYSTEMS Properties of disk scheduling policies in multiprogrammed computer systems .. " ........................................ , ....... . The interaction of multiprogramming job scheduling and CPU scheduling .................................................. . 1 T. J. Teorey 13 23 J. C. Browne J. Lan F. Baskett D. Murphy 33 K. Levitt Exact calculation of· computer network reliability .................. . 49 R. Wilkov E. Hansler G. McAuliffe A framework for analyzing hardware-software trade-offs in fault tolerant computing systems ................................... . 55 K. M. Chandy C. V. Ramamoorthy A. Cowan Automation of reliability evaluation procedures through CARE-The computer aided reliability estimation program .................. . An adaptive error correction scheme for computer memory systems .. . 65 83 Dynamic configuration of system integrity ........................ . 89 F. P. Mathur A. M. Patel M. Hsiau B. Borgerson Storage organization and management in TENEX ................. . The application of program-proving techniques to the verification of synchronization processes ..................................... . ARCHITECTURE FOR HIGH SYSTEM AVAILABILITY COMPUTING INSTALLATIONS-PROBLEMS AND PRACTICES The in-house computer department .............................. . A computer center accounting system ............................ . An approach to job billing in a multiprogramming environment ..... . 97 105 115 Facilities managemBnt-A marriage of porcupines ................. . 123 J. Pendray F. T. Grampp C. Kreitzberg J. Webb D. C. Jung COMPUTER GRAPHICS Automated map reading and analysis by computer ................ . 135 Computer generated optical sound tracks ......................... . 147 Simulating the visual environment in real-time via software ......... . Computer animation of a bicycle simulation ...................... . 153 161 An inverse computer graphics problem ........................... . 169 R. H. Cofer J. Tou E. K. Tucker L. H. Baker D. C. Buckner R. S. Burns J. P. Lynch R.·D. Roland W. D. Bernhart SOFTWARE ENGINEERING-THEORY AND PRACTICE (PART I) Module connection analysis-A tool for scheduling software debugging activities ................................................... . Evaluating the effectiveness of software verification-Practical experience with an automated tool ............................... . A design methodology for reliable software systems ................ . A summary of progress toward proving program correctness. . . ..... . 173 F. M. Haney 181 191 201 J. R. Brown B. H. Liskov T. A. Linden 213 221 D. J. Kuck J. Watson 229 J. A. Rudolph SIFT-Software Implemented Fault Tolerance .................... . TRIDENT-A new maintenance weapon ........................ . Computer system maintainability at the Lawrence Livermore Laboratory ................................................. . 243 255 J. H. Wensley R. M. Fitzsimons 263 The retryable processor ........................................ . 273 J. M. Burk J. Schoonover G. H. Maestri 279 287 299 G. J. Nutt T. E. Bell A. De Cegama LOGOS and the software engineer ............................... . Some conclusions from an experiment in software engineering techniques .................................................. . Project SUE as a learning experience ............................ . 311 C. W. Rose 325 331 System quality through structured programming .......... , ....... . 339 D. L. Parnas K. C. Sevcik J. W. Atwood M. S. Grushcow R. C. Holt J. J. Horning D. Tsichritzis F. T. Baker SUPERCOMPUTERS-PRESENT AND FUTURE Supercomputers for ordinary users ............................... . The Texas Instruments advanced scientific computer .............. . A production implementation of an associative array processorSTARAN .................................................. . MAINTENANCE AND SYSTEM INTEGRITY COMPUTER SIMULATIONS OF COMPUTER SYSTEMS Evaluation nets for computer system performance analysis ......... . Objectives and problems in simulating computers .................. . A methodology for computer model building ...................... . SOFTWARE ENGINEERING-THEORY AND PRACTICE (PART II) ARCHITECTURE LIMITATIONS IN LARGE-SCALE COMPUTATION AND DATA PROCESSING (P anel Discussion-No Papers in this Volume) ARRAY LOGIC AND OTHER ADVANCED TECHNIQUES An application of cellular logic for high speed decoding of minimum redundancy codes ........................................... . 345 On an extended threshold logic as a unit cell of array logics ......... . Multiple operand addition and multiplication ..................... . 353 367 Techniques for increasing fault coverage for asynchronous sequential networks .................................................. , . 375 L. R. Hoover J. H. Tracey K.Ohmori K. Nezu S. Naito T. Nanya R. Mori R. Waxman S. Singh ADVANCES IN SIMULATION System identification and simulation-A pattern recognition approach .................................................. , . Horizontal domain partitioning of the Navy atmospheric primitive equation prediction model .................................... . 385 W. J. Karplus 393 An analysis of optimal control system algorithms .................. . 407 Computer simulation of the metropolis ........................... . 415 E. Morenoff P. G. Kesel L. C. Clarke C. N. Walter G. H. Cohen B. Harris 423 425 S. Rothman R. F. Boruch 435 R. Turn N. Z. Shapiro 445 J. M. Carroll Hardware-software trade-offs-Reasons and directions ............. . A design for an auxiliary associative parallel processor ............. . 453 461 An eclectic information processing system ........................ . 473 R. L. Mandell M. A. Wesley S. K. Chang J. H. Mommens R. Cutts H. Huskey J. Haynes J. Kaubisch L. Laitinen G. Tollkuhn E. Yarwood PRIVACY AND THE SECURITY OF DATABANK SYSTEMS The protection of privacy and security in criminal offender record information systems ......................................... . Security of information processing-Implications for social research ... . Privacy and security in data bank systems-Measures, costs, and protector intruder interactions ................................ . Snapshot 1971-How one developed nation organizes information about people ............................ , ..... , .................. . ARRAY LOGIC-WHERE ART THOU? (Panel Discussion-No Papers in this Volume) HARDWARE-FIRMWARE-SOFTWARE TRADE-OFFS Microtext-The design of a microprogrammed finite state search machine for full text retrieval ................................. . 479 Design of the B1700 ........................................... . 489 R. H. Bullen, Jr. J. K. Millen W. T. Wilner HUMAN ENGINEERING OF PROGRAMMING SYSTEMS-THE USER'S VIEW An on-line two-dimensional computation system ................... . Debugging PLII programs in the multics environment ............. . AEPL-An Extensible Programming Language ................... . 499 507 515 The investment analysis language ............................... . 525 T. G. Williams B. Wolman E. Milgrom J. Katzenelson C. Dmytryshak DATA COMMUNICATION SYSTEMS The design approach to integrated telephone information in the Netherlands ................................................ . 537 R. DiPalma G. F. Hice Field evaluation of real-time capability of a large electronic switching system ..................................................... . 545 Minimum cost, reliable computer-communications networks ........ . 553 W. C. Jones S. H. Tsiang J. De Mercado MEASUREMENT OF COMPUTER SYSTEMS-SYSTEM PERFORMANCE (Panel Discussion-No Papers in this Volume) MEMORY ORGANIZATION AND MANAGEMENT Control Data STAR-lOO file storage station ....................... . 561 Protection systems and protection implementations ................ . B1700 memory utIlization ...................................... . Rotating storage devices as "partially associative memories" ........ . 571 579 587 G. Christensen P. D. Jones R. M. Needham W. T. Wilner N. Minsky DYNAMIC PROGRAM BEHAVIOR Page fault frequency (PFF) replacement algorithms ............... . 597 Experiments wish program locality .............................. . 611 W. W. Chu H. Opderbeck J. R. Spirn P. J. Denning COMPUTER ASSISTED EDUCATIONAL TEST CONSTRUCTION TASSY-One approach to individualized test construction .......... . 623 T. Blaskovics J. Kutsch, Jr. A comprehensive question retrieval application to serve classroom teachers ... , ............................................. ' .. . Computer processes in repeatable testing ..... , ................... . 633 641 G. Lippey F. Prosser J. N akhnikian Properties of disk scheduling policies in multiprogrammed computer systenls by TOBY J. TEOREY University of Wisconsin Madison, Wisconsin SCAN have been suggested by Coffman and Denning, 2 Manocha, 9 and Merten. 10 The implementation of SCAN is often referred to as LOOK,1O,12 but we retain the name SCAN for consistency within this paper. Both C_SCAN9,11,12,13 and the N-step scan6 ,12,13 have been discussed or studied previously and the Eschenbach scheme was developed for an airlinessystem. 14 Because it requires overhead for rotational optimization as well as seek time optimization it is not included in the following discussion. In the simulation study12 it was seen that the C-SCAN policy, with rotational optimization, was more appropriate than the Eschenbach scheme for all loading conditions, so we only consider C-SCAN here. The simulation results indicated the following, given that cylinder positions are addressed randomly:12 under very light loading all policies perform no better than FCFS. Under medium to heavy loading the FCFS policy allowed the system to saturate and the SSTF policy had intolerable variances in response time. SCAN and the N -step policies were superior under light to medium loading, and C-SCAN was superior under heavy loading. We first investigate various properties of the N -step scan, C-SCAN, and SCAN, since these are the highest performance policies that optimize on arm positioning time (seek time). The properties include mean, variance, and distribution of response time; and the distribution of the positions of requests serviced as a function of distance from the disk arm before it begins its next sweep. Response time mean and variance are then compared with simulation results. A unified approach is applied to all three policies to obtain mean response time. The expressions are nonlinear and require an iterative technique for solution; however, we can easily show that sufficient conditions always exist for convergence. Finally, we look at the factors that must be considered in deciding whether or not to implement disk INTRODUCTION The subject of scheduling for movable head rotating storage devices, i.e., disk-like devices, has been discussed at length in recent literature. The early scheduling models were developed by Denning, 3 Frank, 6 and Weingarten. 14 Highly theoretical models have been set forth recently by Manocha,9 and a comprehensive simulation study has been reported on by Teorey and Pinkerton. 12 One of the goals of this study is to develop a model that can be compared with the simulation results over a similar broad range of input loading conditions. Such a model will have two advantages over simulation: the computing cost per data point will be much smaller, and the degree of uncertainty of a stable solution will be decreased. Although the previous analytical results on disk scheduling are valid within their range of assumptions, they do not provide the systems designer with enough information to decide whether or not to implement disk scheduling at all; neither do they determine which scheduling policy to use for a given application, be it batch multiprogramming, time sharing, or real-time processing. The other goal of this study is to provide a basis upon which these questions can be answered. The basic scheduling policies are summarized with brief descriptions in Table 1. Many variations of these policies are possible, but in the interest of mathematical analysis and ease of software implementation we do not discuss them here. SCAN was first discussed by Denning. 3 He assumed a mean (fixed) queue length and derived expected service time and mean response time. The number of requests in the queue was assumed to be much less than the number of cylinders, so the probability of more than one request at· a cylinder was negligible. We do not restrict ourselves to such an assumption here. Improvements on the definition and representation of 1 2 Fall Joint Computer Conference, 1972 TABLE I-Basic Disk Scheduling Policies 1. FCFS (First-come-first-served): No reordering of the queue. 2. SSTF (Shortest-seek-time-first): Disk arm positions next at the request tha.t minimizes arm movement. 3. SCAN: Disk arm sweeps back and forth across the disk surface, servicing all requests in its path. It changes direction only when there are no more requests to service in the current direction. 4. C-SCAN (Circular scan): Disk arm moves unidirectionally across the disk surface toward the inner track. When there are no more requests to service ahead of the arm it jumps back to service the request nearest the outer t.rack and proceeds inward again. 5. N-step scan: Disk arm sweeps back and forth as in SCAN, but all requests that arrive during a sweep in one direction are batched and reordered for optimum service during the return sweep. 6. Eschenbach scheme: Disk arm movement is circular like C-SCAN, but with several important exceptions. Every cylinder is serviced for exactly one full track of information whether or not there is a request for that cylinder. Requests are reordered for service within a cylinder to take advantage of rotational position, but if two requests overlap sector positions within a cylinder, only one is serviced for the current sweep of the disk arm. (constant transmission time) is a fair approximation for our purpose of a comparative analysis. 3. Only a single disk drive with a dedicated controller and channel is considered, and there is only one movable head per surface. All disk arms are attached to a single boom so they must move simultaneously. A single position of all the read/write heads defines a cylinder. 4. Seek time is a linear function of seek distance. 5. No distinction is made between READ and WRITE requests, and the overhead for scheduling is assumed negligible. If there are L requests in the queue at equilibrium and C cylinders on the disk, we partition the disk surface into C1 equal regions (as defined below) and assume that at least one request lies in the center of that region. This partition is only valid when seek time is a linear function of distance. C1 is -computed as follows: since the distribution of L requests serviced is uniform, the probability that cylinder k has no requests is given by scheduling in a complex system. In practice, considerable attention should be given to these factors before thinking about which policy to use. N-STEP SCAN The N -step scan is the simplest scheduling policy to model using the approach discussed here. While the disk arm is sweeping across the surface to service the previous group of requests, new requests are ordered linearly for the return sweep. No limit is placed on the size of the batch, but at equilibrium we know the expected value of that size to be L, the mean queue length. Furthermore, we know that the resulting request position distribution will be the same as the input distribution, which we assume to be uniform across all the disk cylinders. We also assume the following: 1. Request inter arrival times are generated from the exponential distribution. 2. File requests are for equal sized records. This simplifies the analysis. We assume that the total service time distribution (seek time plus rotational delay plus transmission) is general and cannot be described by any simple distribution function. We also assume that the access time (seek time plus rotational delay) dominates the total service time, so that fixed record size (1) The expected number of cylinders with no requests is CO=CPk , so that the expected number of cylinders requiring service is: C1 =C-CO =C-C(l- ~r (2) If the incoming requests are placed at· random and the disk arm has equal probability of being at any cylinder, we know that the expected distance between an incoming request and the current position of the disk arm is approximately C/3 for large C. Typically, C ~ 200 for currently available disks. In Figure 1 we see the possible paths taken from the disk arm to the new request for the expected distance of C/3. The expected number of requests serviced hefore the new request is serviced is L, and the mean response time is (3) where Ts is the expected service time per request and T 8W is the expected sweep time from one extreme of the disk surface to the other. Properties of Disk Scheduling Policies 3 we have: L= XCl(Tmin+AT/Cl+T/2+T/m-a) 1-Xa NEW REQUEST Figure 1 The expected service time under the assumptions listed above was derived by Teorey and Pinkerton12 as follows: T.=P (T,.+ f + f) +(1-P) (9) Equation (9) computes mean queue length in terms of the input rate X, the known disk hardware characteristics, and C1• C1 is, however, a nonlinear function of L. We solve (9) by estimating an initial value for L in (2) and iteratively substituting (2) into (9) until the process converges. Convergence ! [(mt-2) (m-l) +1] m 2(mt-l) (4) L(l-Xa) where P is the probability that a seek is required to service the next request, Tsk is the expected seek time, T is the rotational time of a disk, m is the number of sectors per track, and t is the number of tracks per cylinder. Under our conditions, P=CI/L, and we simplify expression (4) by making the following definition: T [(mt-2) (m-l) a= +1 ] m 2(mt-1) (5) XAT L= - 1-Xa [1- (C~lrJ (Tmin+ f +; -a) -,..C +- (Tmin +T/2+T/m-a) l-Xa XC (C-l)L - - - (Tmin +T/2+T/m-a) -C (10) Letting Kl=XATj(l-Xa) +[XC/(1-Xa)](Tmin +T/2+ Tim-a) andK2 =[XC/(1-Xa)]. (Tmin +T/2+T/m-a) we obtain after i iterations: llT c; (11) (6) llT T m in+ =MTHC 1-Xa Also, for a linear seek time characteristic Tmin+ Rewriting (9) in terms of (2) we obtain 3 where llT = T max- T min, T min is the seek time for a distance of 1 cylinder, and T max is the seek time for a distance of C -1 cylinders. Restating (4) we now have Assuming that Li>O for all i, and l-Xa>O (no saturation) , we have: Li>O} =}0:5: (C-l)Li <1 - -C1-Xa>0 =}0L i - 1=}Li+l >Li and L i 1 for all i, and l-Aa>O (no saturation). L'i>1 } l-Aa>O ==>O~ (C-l)Li -- C - ==>1/C 0< K1 C , Ka also exists; each has a probability of .5. DISK ARM DIRECTION h AREA 2 Kr c Ka Figure 6 cylinder k obtaining the next incoming request: 1. Kr Ka C for Areas 1, 2, 3; lR = number of requests serviced from Kr to Ka = AREA 3 Area 2 K-1 = 72 (Kr-Ka) ( C":'I· 2L' K-1 2L') C + C-=-1 • C l~k~C (23) The input distribution is uniform; therefore each arrival of a new request represents a repeated Bernoulli trial of the same experiment. The probability that cylinder k remains empty is for Area 2 (Kr-Ka) (Kr+Ka-2)L' C(e-l) (23) = (1- k-l • ~)L' C-l To compute the expected number of cylinders with no requests, we first determine the probability of a given C for Areas 1, 2, 3 (24) and the expected number of occupied cylinders in that region is /,,/ / / A=50 REQUESTS/SEC. 500 C2 =C/3- Kr [1- (k -=-. 1 2L'/ )]lR :E -C lR C 1 k=Ka .,1'" for Area 2 cLIJ U ~ 2)LI 400 c ( k-l C1=C-:E 1- . - LIJ en en k=l t;; 300 LIJ :::> S a::: ~ a::: IJJ C for Areas 1, 2,3 (25) Mean response time 200 A=20 REQUESTS/SEC. CD ~ The mean response time is given by W = Probability {K r> Ka} • Tsw {Area 2} :::> z C-l 100 +Probability {Kr -' f- This expression is the same as (9) for the N -step scan except for the meaning of L' and C1• Solution of (27) is obtained by iteration. ~ ffi ~ .050 ~ .025 Convergence ~---+-----t----+---I--_ _ Sufficient conditions for convergence of the above procedure for SCAN are L'o>O and l-Xa>O. The proof proceeds as before: Letting K 1= (X/I-Xa)[AT+ C(Tmin +T/2+T/m-a)] and K 2 = (X/I-Xa) [Tmin + o .5Tsw Figure 7 Variance of response time The response time distribution for SCAN is not intuitively obvious. In order to obtain a close approximation to this distribution we can sample all possible TABLE II-Ratio of Requests Serviced per Sweep to Mean Queue Length for SCAN L' /L Requests/second 10 20 I I I 1.18 1.36 1.46 30 40 I 16 - - - - - N-STEP SCAN ................. SCAN ---C-SCAN 1.47 50 60 1.48 1.49 Limit 1. 50 I 14 I I 12 T/2+T/m-a] we can substitute (25) into (27) and obtain after i iterations: L'i+1=KI-K2 c ( L k=l L'i> 2 k-l )Lif 1- - . - C C-l 0} (2 k -1 )Lif =}O~ 1- - . - - l-Xa>O (28) I fBz 0 f:d 10 sa i= LIJ CFl <1 for all k~C z 0 L 8 ~ f{3 c:: z « k=l I LIJ LIJ =}O~ I :!! C C-l o ( RESPONSE TIME :!! 2 k_l)Lif 1- - . (S:> T)J == [CRAS)':) T)] to form a logical expression involving a single implication where only the transformed consequent assertion appears on the right side of the implication. 35 /\ (B/2= Q*D/4) /\ (D/2=2- k ) '/\ (k = nonnegative integer) /\ (P/Q-D/2 O. TA-71 0582-33 Figure I-Flow chart representation of simple control program Two steps must be followed in proving the program Application of Program-Proving Techniques with respect to the above four assertions. Step 1 is to prove that for all paths the assertions are consistent with the transformation specified by the intervening code; Step 2 is to establish that the validity of the correctness and deadlock parts is correctly embedded in the guessed assertions. First, in Step 1 the following control paths must be verified: 1~2, 1~3,3~4,3~3. For purposes of illustration we will outline the proof of 1~3; this outline should enable the reader to verify the other paths. The path from 1~3 embodies the following steps 37 ---I I CRITICAL SECTION 1 I I ql S~S-l ~ Test: 8=0 D~D+1 V(S) qa. ~I Back substitution on qa leads to the following verification condition: [ (integer S) /\ (8 ~ 1) /\ (D = u ( - 8 + 1) ) /\ (PENS=u(8) -S) J/\ (8-1 =0) :> [ (integer 8 - 1) /\ (8 -1 ~ 0) /\ (D+1=1) /\ (PENS = -S+l)]. The first term of the consequent, integer 8 -1, is true from integer S. The second term is true from 8 -1 = 0, i.e., 8=1. The third term, D=O, is true from [D=u( -8+1)J/\ (S=l). The fourth term is true from [PEN8=u(S)-8J/\(8=1). Thus the path 1~3 is verified (with respect to the "guessed" assertions). Step 2, establishing that the assertions embody the desired behavior of the correctness and deadlock parts, remains to be carried out. The correctness part is apparent from the assertions by noting that D = or 1. The deadlock part is satisfied by noting that whenever PEN8 ~ 1, then also D ~ 1; thus there exists a process currently in the critical section that will eventually schedule some deferred process. As an extension of this simple control program that we will use in the following sections, consider the program displayed in Figure 2. The program is a straightforward extension of the simple single critical section program discussed above. It can be shown by a proof similar to that outlined above that access is granted to only one of the two critical sections at a time. Thus, control cannot be simultaneously at points ® and 0. The interpretation of P(8) and V(S) is modified from that described previously, as shown in Figure 3. The variables PENS1 and PENS2 serve to ° I ~ p ( S ) -...... - l® 0 - ;, I I CRITICAL SECTION 2 V TA-71 0582-34 Figure 2-A control program with two critical sections indicate, respectively, the number of processes pending on semaphore 8 at critical sections 1 and 2. The "CHOOSE" box functions as follows. Either of the two output branches is chosen at random. If the test in the selected branch succeeds, then control continues along the branch; otherwise, control is passed to the other branch. Note that the relation S < 1 ensures that control 38 Fall Joint Computer Conference, 1972 APPLYING ASSERTIONS TO SYNCHRONIZATION PROGRAMS AND ABSTRACTING THE PROOF OF CORRECTNESS AND DEADLOCK FOR THE ASSERTIONS P(S) (FOR CRITICAL SECTION 1) ~ S +- S -1 • N01 TEST: S < l~C SPLIT h I -' ~ 4 (:HOOS0 ~ TEST: [ENS I PENS 1+- PENS + To (V II 1 ~ > O~ TEST: PE1V:> 1- 1J IPENS 0 2 +- PENS 2 - 11 The simple program of Figure 1 reveals, although only in a trivial manner, the possibilities for parallel activity that we wish to exhibit. For example, in Figure 1 it is possible for control to reside simultaneously in the critical section (point 0) and at point CD. The assertion we applied at point CD reflects the possibilities for multiple points of control in that the variable relationships correspond to control being only at point CD, simultaneous at points CD and 0, or simultaneous at points CD, ®, 0. (It is assumed that processors are available to execute any code associated with the critical section as well as with the peS) and YeS) blocks.) In proving the program we did not require any new formalisms beyond those associated with the uniprocessing situation since hardware locks are so constituted that the P and V operations are not simultaneously executed. A more general situation is displayed in Figure 4. Here we illustrate portions of two processes, A and B, with interprocess communication achieved via the semaphore S. The particular model of computation that we will assume is as follows: Assume that at periodic intervals calls are made on sections A or B. The availability of a processor + TOQD TA-71 0582-35 Figure 3-Interpretation of P and V for two critical sections can pass along at least one of the branches because if S < 1, then PENS1 + PENS2 > 1. The purpose of the CHOOSE box is to place no arbitrary constraints on the scheduling of deferred processes. The "SPLIT" box simultaneously passes control along each of its output branches. The intention here is both to reschedule another process onto a critical section associated with semaphore S and to have the process that just finished the critical section execute the instructions following YeS). Wherever two or more parallel paths converge there is a JOIN box, embodying some rules for combining the paths. Points 0 and 0 of Figure 3 are really JOIN boxes. The most apparent such rules are OR (AND) indicating that control is to proceed beyond the JOIN box wherever any (all) of the inputs to the JOIN box are active. Our discussion will apply mainly to the OR rule, but is easily· extended to the AND case. SECTION A SECTION B ENTER A ENTER B t ® P(SI---+ Y, <- f(Y21 ! @ VIS) r-- -----------------------, I I I I ITEST':< 1YH.( SPtT ~ I [No [ ! I I ~ II t I S<-S-1 I I ~ ~ TEST: PENS 1 1 :T2! ~: >0 PENS 1 <- PENS 1 - 1 TEST: PENS 2 1 >0 PENS 2 ... PENS 2 - 1 y,·tJ V(SI---+ 1® Y2 <- h(Y2 1 !0 Ya <- g(Y2 1 I II L---------i------------------- J ~V(MI TA-710582-36 Figure 4-Program to illustrate assertion interpretation Application of Program-Proving Techniques to commence process'ng of the calls is always assumed to exist. If two or more processors attempt simultaneous reference to a variable or operator, the selection of the processor that achieves access is made arbitrarily. If execution is deferred, say, at point @ , subsequent to the P (lVI) operation, the affected processor is presumably freed to handle other tasks. When the corresponding V (M) operation is carried out, schedul ng a deferred process, a processor is assumed to exist to effect the processing. With reference to this program and the assumed model of parallel computation, we will illustrate approaches to the placement of assertions and to proving the consistency of the assertions relative to intervening program statements. Assertion placement Since we are assuming a parallel/multiprocessing environment, there are potentially many points in the flow chart at which a processor can be in control. For example, in Figure 4 control can be simultaneous at points CD, ®, and 0. However, we will assume that the role of the POVI) and V(l\1:) operations is to exclude simultaneous accesses to the intervening critical section, provided there are no branches into the critical section. Hence, control cannot be simultaneous at points CD and @ . An assertion, for example at point CD, must reflect the state of the variables of the various processes assuming that: (1) Control is at point CD and, simultaneously, (2) Control is at any possible set of allowable points. By "allowable" we mean sets of points not excluded from control by virtue of mutual exclusion. We recall that for the uniprocessor environment assertions are placed so that each path in the program is provable. As an extension of that observation we can show that the proving of paths in a parallel program can be accomplished provided the following rules are satisfied: (1) Each loop in the· program must be broken by at least one assertion. (2) Within a critical section (i.e., one where control is at only one point at a time and where any variables in the critical section common to other portions of the program are themselves in critical sections under control of the same semaphore), only a sufficient number of assertions need be applied to satisfy the loop-cutting 39 rule, (1). We assume that all entries to critical section are controlled by P, V primitives. If not then rule (3) below applies. (3) All points not covered by rule (2) must generally be supplied with assertions. (4) Each HALT point and all WAIT points associated with a P operation must contain assertions. Thus, in Figure 5 a possible placement of assertions is at points @ , CD, ®, 0, 0, and 0. Note that since the purpose of synchronization programs is generally to exclude, by software techniques, more than one process from critical sections, such programs will not require the plethora of assertions associated with a general parallel program. Also note that it is a simple syntactic check to determine if a given assertion placement satisfies the above rules. Once the points where the assertions are to be placed have been selected and the assertions have been developed, it remains to prove the consistency of assertions. As in the uniprocessor case, the first step in this proof process is to develop the verification conditions for each path. For the parallel environment of concern to us here, we are confronted with the following types of paths: simple paths, paths with SPLIT nodes, paths with CHOOSE nodes, and impossible paths. These four path types are handled below, wherein the rules are given for developing the verification conditions, and some indication is given that the parallel program is correct if these rules are followed. A complete proof of the validity of the rules is not given because an induction argument similar to that of Floyd's applies here. Verification condition for a simple path By a simple path we mean a path bounded by an antecedent and a consequent assertion, with the intervening program steps being combinations of simple branch and assignment statements. For such a path the verification condition is derived exactly as in the uniprocessor case. That this is the correct rule is seen by noting that the assertion qa placed at point a in the program reflects the status of the variables, assuming that control is at point a and also at any allowable combination of other points containing assertions. Also note that because of our assertion placement rules, the variables involved in the code between a and b are not modified simultaneously by any other process. Thus, if a simple path a~b is bounded by assertions qa and qb and if it is proven that %/\ (intervening code) ::)qb, then the path is proven independently of the existence of control at other allowable points. 40 Fall Joint Computer Conference, 1972 Verification conditions for paths with SPLIT nodes Impossible paths Assume that a SPLIT node occurs in a path, say, bounded on one end by the antecedent assertion qa. Recall that at the SPLIT node, separate processors commence simultaneously on each of the emerging paths. Also assume that along the two separate paths emerging from the split nodes the next assertions encountered are qb and qc, respectively. * In this case the "path" (which is actually two paths) is proved by showing that As mentioned above, not all topological paths in a program are necessarily paths of control. In effect, what this means is that no input data combinations exist so that a particular exit of a Test is taken. Recall that for antecedent and consequent assertions qa, qb and an intervening Test, T, the verification condition is qa/\ T'::)qb', where the prime indicates that back substitution has been carried out. Clearly, if the test always evaluates to FALSE, then qa/\ T' must evaluate to FALSE, in which case the implication evaluates to TRUE independent of qb'. (We recall that TRUE::) TRUE, FALSE::)TRUE, and FALSE::)FALSE are all TRUE.) qa/\ (code between point a and SPLIT node) /\ (code between SPLIT node and point b) /\ (code between SPLIT node and point c)::) (qb/\ qc). Proving that program has no deadlock Note that it is not sufficient merely to consider the path between, say, a and b, since the transformations between the SPLIT node and c may influence the assertion qb. However, note that the variable references along the two paths emerging from the SPLIT node are disjoint, by virtue of the rules for selecting assertion points. Hence the use of back substitution to generate the verification condition can function as follows. Assertion qb is back-transformed by the statements between point b and the SPLIT node, followed by the statements between point c and the SPLIT node, finally followed by the statements between the SPLIT node and point a to generate qb. A similar rule holds for traversing backward from qc to generate qc. Note that the order in which the two paths following the SPLIT node are considered is not crucial since these paths are assumed not to reference the same variables. Verification condition for a path with a CHOOSE node Recall that when control reaches a CHOOSE node having two exits, the exit that is chosen to follow is chosen arbitrarily. Hence the effect of a CHOOSE node is simply to introduce two separate simple paths to be proven. For antecedent assertions qb, qc, what must be proved is qa /\ (code between a and b) ::)qb qa /\ (code between a and c) ::)qc. Note that one or possibly both of the paths might not be control paths, but this introduces no difficulties, as we show below. * Various special cases are noted, none of which introduce any particular difficulties. It is possible that qa, qb and qc might not be all distinct or that another SPLIT node occurs along a path before encountering a consequent assertion. For the parallel programs that we are dealing with deadlock will be avoided if for every semaphore S such that one or more processes are pending on S, there exists a process that will eventually perform a YeS) operation and thus schedule one of the deferred processes. (Weare not implying that every deferred process will be scheduled, since no assumptions are made on the scheduling mechanism.) In particular, if a process is pending on semaphore a, then it is necessary to show that another process is processing a. If that latter process is also pending on a semaphore b, it is necessary to show that b~a, and that a third process is processing b. If that third process is pending on c, it is necessary to show that c~b, c~a, and that a fourth process is processing c, etc. In the next sections we apply the concepts above to the verification of particular control programs. PROOF OF COURTOIS2 PROBLEM 1 This section presents a proof of a control program that was proposed by Courtois et al. The program is as follows: Integer RC; initial value = 0 Semaphore M, Q; initial values = 1 READER P(M) WRITER RC~RC+1 if RC= 1 then P(Q) V (lVI) READ PERFORMED P(M) RC~RC-1 if RC=O then V(Q) V(M) P(Q) WRITE PERFORMED V(Q) Application of Program-Proving Techniques WRITER READER lCD P(M)--+® -----..0 RC +- , RC + 1 TEST: RC = 1.:!!!, NOj_ . PIQ )-@ +-------, RD +- RD + 1 lCD l ---... P(Q)--+® V(M) 1® 1 ~1° WD (DEVICE) WD RD - 1 RC - 1 ++- 1 V(Q)_..J I 4 I 1 1 +- WD - 1 I 1@ V(Q) _ __ Yes TEST: RC = 0--, No WD + 1 @)(DEVICE) P(M)-+@ RD RC +- - 41 certain rare circumstances a reader's access might be deferred for a writer even though at the time at which the reader activates the READER section no writer is actually on the device.) A writer is to be granted access only if no other writer or reader is using the device; otherwise, the requesting writer's access is deferred. In particular, any number of simultaneous readers are allowed access provided no writers are already on. The role of the semaphore M is to enforce a scheduling discipline among the readers' access to RC and Q. For our model of parallel computation, it can be shown that the semaphore M is not needed, although its inclusion simplifies the assertion specification. Figure 5 is a flow chart representation of the program. A few words of explanation about the figure are in order. The V(Q) operator for the reader and the V (1\1) operator for the upper critical section are assumed to be the generalized V's containing the CHOOSE and SPLIT nodes as discussed in the two previous sections. The other V operators are assumed to contain CHOOSE but no SPLIT nodes. The dashed line emerging from V (Q) indicates a control path that will later be shown to be an impossible path. Associating appropriate variables with each of the P and V operators, the following integer variables and initial values are seen to apply to the flow-chart. 1\;1 Q RC RD WD RPENQ 1 1 0 0 0 WPENQ RPEN1\11 RPEN1\12 o 0 0 ° where the Rand W prefixes to a variable correspond, respectively, to readers and writers and the 1 and 2 suffixes correspond, respectively, to the "upper" and "lower" critical sections associated with semaphore 1\1. Once again we will divide the proof for this program into a correctness part and a deadlock part. For the correctness part we will establish that ......_ _ V(M) l® TA-71 0582-37 Figure 5-Flow chart representation of Courtois problem 1 The purpose of the program is to control the access of "readers" and "writers" to a device, where the device serves in effect as a critical section being competed for by readers and writers. If no writers are in control of the critical section, then all readers who so desire are to be granted access to the device. (We show below that the program almost satisfies this goal, although under (1) WD = 0 or 1, indicating that at most one writer at a time is granted access to the device. (2) If WD=l, then RD=O, indicating that if one writer is on the device, then no readers are "on." (3) If WD=O, then RPENQ=O, indicating that if no writer is on the device, then a reader is not held up by semaphore Q. An entering reader under these circumstances could be held up by semaphore 1\1, i.e., RPENMl>O. (We will temporarily defer discussion of this situation.) According to the assertion placement rules, each 42 Fall Joint Computer Conference, 1972 input, output and wait point must possess an assertion, each loop must be cut by an assertion, and in addition, an assertion must be placed at each point along a path wherein along another parallel path there exists an instruction referencing variables common to the point in quest:on. For this program the assertion placement problem is simplified since all variables, e.g., RC and Q, common to two or more parallel paths are a part of critical sections wherein access is granted only to one such critical section at a time. Hence, only the inputoutput and loop-cutting constraints must be satisfied, leading to a possible placement of assertions at the numbered points in Figure 5. Note that point ® does not require an assertion, but since it represents a control point where readers are on the device, it is an interesting reference point. The assertions associated with all 11 control points are indicated in Table 1. The assertion labelled G LO BAL is intended to conjoin with the other 11 assertions. The appearance of (i) at the beginning of a disjunctive term in q2, q3, q8, q9 indicates that the first (i) terms are the same as in ql. Thus, for example, in the first disjunctive term of assertion q2, the first six conjunctive terms are the same as in the first disjunctive term of ql, but the seventh and eighth terms are different, as shown. It is worthwhile discussing our technique for specifying the assertions-we will provide sample proofs later on to attest to the validity of the assertions. In specifying the assertion at a point a, we assumed, of course, that control is at a and then attempted to guess at which other points control could reside. Variable relationships based on this case analysis were then derived, and then the expressions were logically simplified to diminish the proliferation of terms that resulted. For example, in assertion ql, the first disjunctive term corresponds to the case: no writers on the device, i.e., control is not at @. The second disjunctive term corresponds to the case of control at @. With regard to the second term if control is hypothesized at @, it is also guessed that control could possibly be at 0, ([), and 00r0. It remains to verify all the paths bounded by the 11 assertions. The paths so defined are: 1~2;1~3;3~;3~(5,3);3~(5, 7);5~6;5~7, 7~8 [RC~O]; 7~7 [RC~O]; 7~3 [RC~O]; 7~(5, 3) [RC=O]; 7~(5, 7) [RC=O]; 7~(5, 8) [RC=O]; 7~(10, 3) [RC=O]; 7~(10, 7) [RC=O]; 7~(10, 8) [RC=O]; 1~9; 1~10; 10~11; 10~10; 10~5; 10~(5, 3); 10~(5, 7). A brief discussion of the symbolism is in order. For example, the path 3~(5, 3) commences at 0, and then splits at the SPLIT node of V (M) into two paths leading to ® and 0. The path 7~(10, 3) [RC=O] indicates that the branch defined by RC = 0 is taken, followed by a splitting at V (Q), one path leading to 0, and the other path taking the CHOOSE exit toward @. Clearly, many of the above paths are impossible paths-as revealed by the proof. We will not burden the reader of this paper with proofs of all the paths, but we will provide an indication of the proofs for several of the more interesting paths. TABLE I-Assertions for Courtois Problem 1 Global: (All variables E 1);'\ (M ~ 1);'\ (Q ~ 1);'\ (RC ~O);'\ (RD ~O);'\ (WD ~O);'\ (RPENQ ~O);'\ (WPENQ ~O);'\ (RPENMI ~ 0);'\ (RPENM2~0) ql; q2; qa: q4: [(WD=O);,\(RD=RC);,\(RPENQ=O);,\(WPENQ=u(Q)-Q);,\ u(Q)=u(I-RC»;'\(RPENM2~RD);'\(RPENMl+ RPENM2=u(M)-M)]V [WD= 1);'\ (RD =0);'\ (RPENQ = RC);'\ (WPENQ = -Q-RC);'\(RC =u(RC»;,\ (RPENM2 =0);'\ (RPENMI =u(M) - M);'\ (M ~u( 1- RC»] [(6)(RPENMI >Q);,\ (M O RPENQ~RPENQ -1 RD~RD+1 M~M+1 Test: M<5 Test: RPENi>o ~q, RPENMl~RPENMl-l Backsubstitute qa and q5 to yield qa', q5' qs': (WD=1)I\(RD+1=RC)I\(RPENQ=1)I\(WPENQ= -Q-l)I\(Q+1::S;0)I\(RPENM2::S;RD)I\(RPENMl+RPENM2 =u(M+1)-M)I\(RC~1) q3': r(WD = 1)1\ (RD + 1 =RC)I\ (RPENQ = 1)1\ (WPENQ = u(Q + 1) -Q -1)(u(Q + 1) =u( 1-RC»I\ (RPENM2 ::S;RD + 1) I\(RPENM1+RPENM2=u«M+l)-M)]V[(WD=2)I\(RD= -1) ... ] Tests backsubstituted T': (RPENM1 >0)1\ (M <0)1\ (RPENQ >0)1\ (Q <0) Verification Conditions qlOl\ T':>q5' qlOl\ T'::>q3' Sample Proof: Proof of Q5' term RPENQ = 1 From qlO: (RPENQ=RC)I\(RC=u(RC» Thus RPENQ=O or 1 From T' RPENQ>O Thus RPENQ = 1 Table II outlines the steps in proving the path 10~(5, 3). At the top of Table II we delineate the steps encountered along the path. As is readily noted, the path contains a SPLIT node. To develop the verification condition, back substitution is required from both q3 and q5 to form qa' and q5'; note that in developing q5' the statements between the SPLIT node and point ® must be considered, in addition to the statements directly between points @ and 0. To verify the path, the following two logical formulas must be proved true: qlOl\ T'~q5', qlOI\ T'~q3" At the bottom of Table II we outline the few simple steps required to prove the term (RPENQ = 1) in q5'. The patient reader of this paper can carry out the comparably simple steps to handle the remaining terms. Note that qs' is the disjunction of two terms, one beginning with the term (WD = 1) and the other with the term (WD=2). For ql01\ T'~qs' to be true, it is necessary for only one of the disjunctive terms to be true. The reader can verify that it is indeed the first disjunctive term that is pertinent. As a final note on the verification of paths, consider the path 10~(5, 7). A little thought should indicate that this should be an impossible path since the effect of control passing to point (j) is to schedule a process that was deferred at point 0, but at point 0 a reader is considered to be on the device, and hence point 0 could not be reached from point @ where a writer is on the device. This is borne out by considering the formula (qlO 1\ T') for the path in question. In qlO there is the conjunctive term (RPENM2=O) while in T', the back-substituted test expression, there is the conjunctive term (RPENM2 <0). Thus, qlOl\ T' evaluates to FALSE, indicating that the path is impossible. I t remains now to prove the correctness and deadlock conditions by observation of the assertions and the program itself. The key assertion here is ql since it expresses the relationship among variables for all possible variations of control, e.g., for all allowable assignments of processors to control points in the program. On the basis of ql we can conclude the following with regard to correctness: (1) WD = 0 or 1, indicating that no more than one writer is ever granted access to the device. (2) If WD=l, then RD=O, indicating that if a writer is on the device, then no reader is. (3) The issue. of a requesting reader not encountering any (appreciable) delay in getting access to a device not occupied by a writer is more complicated. From the first disjunctive term of ql 44 Fall Joint Computer Conference, 1972 (that deals with the case of no writers on the device), we note that if WD =0, then RPENQ = O. Hence, under the assumed circumstances a requesting reader is not deferred by semaphore Q. However, a requesting reader could be deferred by semaphore lVr. In fact, a requesting reader could be deferred at point ® while the RD readers on the device emerge from point 0, and then be scheduled onto the lower/ critical section wherein the last emerging reader performs V(Q) and schedules a writer. The deferred reader will then be scheduled onto the upper critical section only to be deferred by Q at point 0. Although it is an unusual timing of requests and reader completions that leads to this situation, it still violates the hypothesis that a reader is granted access provided no writer is on the device. * Note that, under the assumption that (WD = 0) and RPENM2 remains zero while a requesting reader is deferred by M at point ®, the requesting reader will be granted access to the device prior to any requesting writers. We now dispose of the question of deadlock. We need to demonstrate that, if a process is pending on a semaphore, then there exists another process that will eventually perform a V operation on that semaphore. With regard to semaphore Q, we note from observation of ql that if RPENQ>O or WPENQ>O, then either WD = 1 or RD ~ 1. Thus, if any process is pending on Q, there exist processes that might eventually do a V(Q) operation. It remains to dispose of the issue of these processes themselves pending on semaphores. It is obvious that a writer on the device must emerge eventually, at which time it will do a V (Q) operation. For one reader (or more) on the device, in which case RPENQ = 0, we have shown that the last reader will perform a V(Q) operation. A reader could be deferred by semaphore M, but in this case there is a reader processing lVI that is not deferred by Q and hence must do a V (1\11) operation. * We conjecture that there is no solution to this problem without permitting the conditional testing of semaphores, so that the granting of access to a writer or reader to the device is decided on the basis of the arrivaltime of a reader or writer at the entry point to the control program. In effect, what the program here accomplishes is to grant a reader access to the device provided it passes the test: RC = 0 while WD = O. Note that there are other problems that do not admit to solutions using only P and V operations unless conditional testing of semaphores is permitted, e.g., see Patil.1 5 DISCUSSION In this paper we have developed an approach based on program-proving techniques for verifying parallel control programs involving P and V type operations. The proof method requires user-supplied assertions similar to Floyd's method. We have given a method for developing the verification conditions, and for abstracting the proof of correctness and nondeadlock from the assertions. We applied the technique to two control programs devised by Courtois et al. At first glance it might appear that the method is only useful for toy programs since our proofs for the above two programs seem quite complex. However, in reality the proofs presented here are not complex, but just lengthy when written out in detail. The deductions needed to prove the verification conditions are quite trivial, and well within the capability of proposed program proving systems. * By way of extrapolation it seems reasonable for an interactive program verifier to handle hundreds of verification conditions of comparable complexity. Thus one might expect that operating systems containing up to 1000 lines of high-level code should be handled by the proposed program verifier. We might add that some additional theoretical work is called for relative to parallel programs and operating systems. Perhaps the main deficiency of the proofs presented here is that a suspicious reader might not believe the proofs. In establishing the correctness of the programs it was required to carry out a nontrivial analysis of the assertions. For example, we refer the reader to the previous section where the subject of a reader not encountering any delay in access is discussed. Contrast this with a program that prints prime numbers, wherein the output assertion says that the nth item printed is the nth prime-if the proof process establishes the validity of the output assertion there is no doubt that the program is correct. It is thus clear that the operating system environment could benefit from a specification language that would provide a mathematical description of the behavior of an operating system. Also some additional work is needed in understanding the impact of structured programming on the proof of operating systems. We would expect that structured programming techniques would reduce the number of assertion points and the number of paths that must be verified. * See London16 for a review of current and proposed program proving systems. Application of Program-Proving Techniques ACKNOWLEDGMENTS The author wishes to thank Ralph London for many stimulating discussions on program proving and operating systems and for providing a copy of his proof of the two programs discussed in this paper. Peter Neumann, Bernard Elspas and Jack Goldberg read a preliminary version of the manuscript. Two referees provide some extremely perceptive comments. 14 15 16 REFERENCES 1 E W DIJKSTRA The structure of THE multiprogramming system Comm ACM 11 5 pp 341-346 May 1968 2 P J COURTOIS F HEYMANS D L PARNASS Concurrent control with "READERS" and WRITERS" Comm ACM 14 10 pp 667-668 October 1971 3 R W FLOYD Assigning meanings to programs In Mathematical Aspects of Computer Science J T Schwartz (ed) Vol 19 Am Math Soc pp 19-32 Providence Rhode Island 1967 4 P NAUR Proof of algorithms by general snapshots BIT 6 4 pp 310-316 1966 5 Z MANNA The correctness of programs J Computer and System Sciences 3 2 pp 119-127 May 1969 6 R L LONDON Computer programs can be proved correct In Proc 4th Systems Symposium-Formal Systems and N onnumerical Problem Solving by Computer R B Banerji and M D Mesarovic (eds) pp 281-302 Springer Verlag New York 1970 7 R L LONDON Proof of algorithms, a new kind of certification (Certification of Algorithm 245, TREESORT 3) Comm ACM 136 pp 371-373 June 1970 8 R L LONDON Correctness of two compilers for a LISP subset Stanford Artificial Intelligence Project AIM-151 Stanford California October 1971 9 B ELSPAS K N LEVITT R J WALDINGER A WAKSMAN An assessment of techniques for proving program correctness ACM Computing Surveys 4 2 pp 97-147 June 1972 10 E ASHCROFT Z MANNA Formalization of properties of parallel programs Stanford Artificial Intelligence Project AIM-110 Stanford California February 1970 11 A N HABERMANN Synchronization of communicating processes Comm ACM 153 pp 177-184 March 1970 12 J C KING A program verifier PhD Thesis Carnegie-Mellon University Pittsburgh Pennsylvania 1969 13 D I GOOD Toward a man-machine system for proving program correctness 17 45 PhD Thesis University of Wisconsin Madison Wisconsin 1970 B ELSPAS M W GREEN K N LEVITT R J WALDINGER Research in interactive program-proving technique Stanford Research Institute Menlo Park California May 1972 S PATIL Limitations and capabilities of Dijkstra's semaphore primitives for coordination among processes MIT Project MAC Cambridge Massachusetts February 1971 ' R L LONDON The current status of proving programs correct Proc 1972 ACM Conference pp 39-46 August 1972 R C HOLT Comments on the prevention of system deadlocks Comm ACM 14 1 pp 36-38 January 1971 APPENDIX Proof of Courtois problem 2 Figure 6 displays the flow chart of the second control problem of Courtois et al. 2 The intent of this program is (1) similar to problem 1 in that the device can be shared by one or more readers, but a writer is to be granted exclusive access; (2) if no writers are on the device or waiting for access, a requesting reader is to be granted immediate access to the device; and (3) if one or more writers are deferred, then a writer is to be granted access before any reader that might subsequently request access. As we show below, a formal statement of the priorities can be expressed in terms of the variables of Figure 6. Also, as in problem 1, the intent of the program is not quite achieved relative to the receiving of requests at the entry points of the reader and writer sections. It is noted that the program contains semaphores L, M, N, Q, S, all of which have initial value 1, and "visible" integer variables RS, RD, RC, WS, WD, WC, all of which have initial value o. In addition, as in problem 1, there are the variables associated with the various P and V operations. As in problem 1, the V operators, with the exception of VeL) and those at points @ and @ , embody both the SPLIT and CHOOSE nodes; VeL) has only the SPLIT node, and the final V's have only CHOOSE nodes. The dotted control lines indicate paths that can be shown to be impossible. The numbered points on the flow chart are suitable for assertion placement in that each loop is cut by at least one assertion and all commonly referenced variables are contained within critical sections. There are several approaches toward deriving the assertions, but once again the most sensible one involves case analysis. 46 Fall Joint Computer Conference, 1972 WRITER READER 0!® P(L)- ---i® P(S~)_.:... _ _ _ _ _ _ _ _ _ _ _ _ _ _- - , l.. ------ .,I RS +- RS + 1 ~ 0 r-I-r------:::rl ® I RC :tc I! I 1~ ® P~N).@.. @! WC +- wc + 1 ~ TEST: WC = Yes 1~ N0l.. II ...--_ _No\,_=_ 4------------, 4-------, I P(O)-=- II I I + 1 TEST: RC = 01 II I I I I P(M)- : I I I I l I RD +- RD + 1 l L----V(M) RS +- t- P(O)- ~l + WD+-WD+1 (DEVICE)@ --V(L) • ~ WD +- WD - 1 (1) ~® P(M)~ (DEVICE) L-- _-===vtO) ~t® ~@ P(N)- @t:=:======!=; RD +- RD - 1 + WC+-WC-1 + 0--, Yes TEST: WC RC +- RC - 1 Nol .. 11- +@ ~ ~ I + V(N) vtS)------.l.--J TEST: RC = P(S)-=-+ WS+-WS+1 I! 1 @ 1_, I CI describes the domain of the individual variables and is common to all assertions for the program. It was convenient to decompose the second disjunctive term into two disjunctive terms, bl, b2, corresponding to the reader processing Q and the reader not processing Q. A similar case analysis for the al term is embedded in the conjunctive terms. Note that, as in problem 1, the prefixes W, R refer to writer and reader and the suffixes 1 and 2 refer to the upper and lower critical sections. We will not burden the reader of the paper with a listing of the assertions at all points or with a proof of the various paths; the proof is quite similar to that illustrated for problem 1. However, it is of interest to abstract from qi sufficient information to prove the program's intent. For a discussion of deadlock the reader is referred to Reference 14. As with problem 1 the decision concerning whether a requesting reader or a requesting writer gains access to the device is based on which one arrives first at the corresponding P (8) operation-not on arrival time of the readers and writers at the corresponding section entry points. This point is discussed in more detail below: Yes 0---, = WS+-WS-1 -I V(O)-:..-- - - - - ' 1 1 V!S)--_J 1_ ...._~I V(N)=========~ L...-_ _ V(M) l@ TA_710582-38 Figure 6-Flow chart representation of Courtois problem 2 From the view of control at point CD, we have derived the assertion qi of the form qi = CI/\ (al V [a2/\ (b i V b2) ]) , wherein al corresponds to a writer not processing S, i.e., WS =0, and [a2/\ (biV b2) ] corresponds to a 'writer processing S, i.e., WS = 1. The assertionql listed in Table III reflects this case analysis: The global assertion (1) The assertions indicate that any number of readers can share the device provided no writers are on, since if WD = 0, then from al we see there are no constraints on RD. The assertions indicate that at most one writer is on the device because from observation of both al and a2 we note that WD = or 1. (2) The assertions indicate, as follows, that a reader's access to the device is not delayed provided no writers are processing S or are on the device, and provided no writers are pending on Q or S. The term al indicates that if WS = WD = 0, i.e., no writers are processing S or on the device, and if WPENQ= WPENS=O, i.e., no writers are pending on Q or S, then RPEN8=RPENQ=0, ° TABLE III-Main Assertion for Courtois Problem 2 Global: (All. variables E I)I\(L~l)/\ (M ~l)/\ (N ~l)/\ (Q ~l)/\ (S~l)/\(RC~O)/\ (RD~O)/\ (WC~O)/\ (WD ~O)/\ (RPENS~O) /\(WPENS~O)/\(RPENQ~O)/\(WPENQ~O)/\(RPENL~O)/\(RPENMI~O)/\(RPENM2]~0)/\(WPENNl~0)/\(WPENN2~0) /\(RS~O)I\(WS~O) ql: (Writer not processing S) (RS=u( -S+l»/\(WS=O)!\(WD=O)/\(WC=u(WC»/\(WPENQ=O)/\(RPENQ=O)/\(RPENS=O)/\(u(WPENS)=u(S)-S) /\ (RD = RC)/\ (u(Q) =u(l-RC»/\ (WPENS = WC)/\(WPENNI =u(N)-N/\ (WPENN2 =0)/\ (RPENL=u(L) -L)/\(u(L) =u(S»/\ (RPENMI +RPENM2 =u(M)- M)/\ (RPENMI ~RD) (Writer processing S) {(WS=l)/\(RS=O)/\(WPENS=O)/\(RPENS= -S)/\(S= -u( -S»A (RPENQ =0)/\ (WPENQ =u(Q)-Q)/\(WPENNl + WPENN2 = u(N) - N)/\ (RC =RD)/\ (RPENL = u(L) - L)/\ (RPENMI = 0)/\ (RPENM2 = u(M) - M)/\ (L:::; u(S + 1» }/\ {[(Q :::;0) /\(RC~l)/\(WD=O)/\(WC= -Q)/\(WPENN2=0)]V[(RC={)/\(M=1)/\(WC=WD+WPENNl+WPENQ)/\(WD=u(WD» /\ (WD ~u(WPENQ»]1 Application of Program-Proving Techniques indicating that no reader is deferred by S or Q from access to the device. The issue of writer priority will be handled by applying case analysis to ql. • RPENQ is always 0; thus a V(Q) operation can only grant access to a deferred writer, never to a reader. • RS is 0 or 1; thus, at most, one reader is processing S. If RS = 1, then RPENS = 0 and WPENS = 0 or 1. This indicates that if a reader is processing S, the subsequent YeS) operation can only signal a deferred writer. • If WS = 1 then WPENS = 0, and there are no constraints on WPENQ. This indicates that all deferred writers are pending on Q (or N as discussed below), and since RPENQ=O a writer must get access to the device either immediately if RD= WD=O, or when the next V(Q) is performed by either a writer or a reader. As we mentioned above, the issue of granting access 47 to a writer or a reader is determined by the arrival time at peS). If this is indeed the intent of the program, then the above discussion serves to prove the correctness of the program. However, there are several important exceptions that deserve discussion. For example, while a writer is pending on S, all subsequent requesting \\-Titers will be deferred by N . Now these writers should be granted access to the device before any requesting readers receive it, which will be the situation under "normal" timing conditions. The deferred writer, at point @, will be scheduled by a reader doing V(S), in which case the writer will perform YeN) and in turn will schedule a deferred writer. These previously deferred writers will not get blocked by S but will pass to P (Q). Of the readers requesting access, one will be blocked by S and the remainder by L. The only way this scheduling would not occur as stated would be if the deferred writer at point @ passed through the \\-Titer section and performed a YeS) operation, thus scheduling a deferred reader before a writer processing the upper critical section could get through the first two instructions. Exact calculation of computer network reliability by E. HANSLER IBM Research Laboratory Ruschlikon, Switzerland G. K. McAULIFFE IBM Corporation Dublin, Ireland and R. S. WILKOV IBM Corporation Armonk, N ew York network fail with the same probability q and each of the b links fail with the same probability p, then Pees, t] is approximately given by INTRODUCTION The exact calculation of the reliability of the communication paths between any pair of nodes in a distributed computer network has not been feasible for large networks. Consequently, many reliability criteria have been suggested based on approximate calculations of network reliability. For a thorough treatment of these criteria, the reader is referred to the book and survey paper by Frank and Frisch 1 •2 and the recent survey paper by Wilkov. 3 Making use of the analogy between distributed computer networks and linear graphs, it is noted that a network is said to be connected if there is at least one path between every pair of nodes. A (minimal) set of links in a network whose failure disconnects it is called a (prime) link cutset and a (minimal) set of nodes with the same property is called a (prime) node cutset. If a node has failed, it is assumed that all of the links incident to that node have also failed. A cutset with respect to a specific pair of nodes ns and nt in a connected network, sometimes called an s-t cut, is such that its removal destroys all paths between nodes ns and nt. The exact calculation of PeeS, t], the probability of successful communication between any pair of operative computer centers ns and nt, requires the examination of all paths in the network between nodes ns and nt. More specifically, if each of the n nodes in any given b PeeS, t]= 2: A:.t(i) (l_p)ipb-i, p»q. (1) i=O In Eq. (1), A:.t(i) is the number of combinations of i links such that if only they are operative, there is at least one communication path between nodes ns and nt. On the other hand, the calculation of the probability P,[s, t] of a communication failure between nodes ns and nt requires the examination of all s-t cuts. For specified values of p or q, P,[s, t] is approximately given by b P,[s, t]= 2:C:. t (i)pi(l-p)b-i, p»q. (2) i=O For q»p, a similar expression can be given replacing C:.t(i) by C:.t(i). The coefficients ·C:.t(i) and C:. t (i) denote the total number of link and node s-t cuts of size i. The enumeration of all paths or cutsets between any pair of nodes ns and nt is not computationally possible for very large networks. RELIABILITY APPROXIMATION BASED ON CUTSET ENUMERATION If any network G of b links and n nodes, it is easily shown that the order of the number of cutsets is 2 n - 2 49 50 Fall Joint Computer Conference, 1972 whereas the order of the number of paths between any pair of nodes is 2b-n+2. For networks having nodes of average degree (number of incident links) greater than four, b>2n and 2b-n+2>2n-2. Consequently, such networks have a larger number of paths than cutsets. Computation time would clearly be reduced in such cases by calculating network reliability from cutsets instead of paths. In this case PeeS, tJ can be obtained from PeeS, tJ= 1-Pr [s, tJ, where Pres, tJ can be calculated from Eq. (2). Alternatively, PI[s, tJ~p[9JZ:.,] (3) where Q!. t is the event that all links fail in the ith prime s-t cut and N is the total number of prime cutsets with respect to nodes ns and nt. The calculation of Pres, tJ from Eq. (2) clearly requires the examination of all s-t cuts. The number of prime s-t cuts is usually much smaller. However, Pres, tJ is not readily calculated from Eq. (3) because the Q!.t are not mutually exclusive events. Following Wilkov,4 we shall use Pr = Maxs.tPr[s, tJ as an indication of the overall probability of service disruption for a given computer network. For specified values of p or q, Pr depends only on the topology of the network. A maximally reliable network clearly has a topology which minimizes Pr and hence minimizes Maxs.tC:.t(m) or Maxs.tC:.tCm) for small (large) values of m when p or q is small (large). Letting X:,t(m) and X:.t(m) denote the number of prime node and edge s-t cuts of size m, Xn (m) = Maxs.tX:.t(m) and Xe(m) =Maxs.tX:,t(m) have been proposed4 as computer network reliability measures. These measures Xn(m) and Xe(m) denote the maximum number of prime node and edge cutsets of size m with respect to any pair of nodes. A maximally reliable network is such that Xn(m) and Xe(m) are as small as possible for small (large) values of m when the probability of node or link failure is small (large). In the calculation of Xn(m) and Xe(m) for any given network, all node pairs need not be considered if all nodes or links have the same probability of failure. It has been shown5 that in order to calculate Xn(m) and Xe(m), one need only consider those node pairs whose distance (number of links on a shortest route between them) is as large as possible. For a specified pair of nodes n s, nt, X:. t(m) can be calculated for all values of m using a procedure given by Jensen and Bellmore. 6 Their procedure enumerates all prime link cutsets between any specified pair of nodes in a non-oriented network (one consisting only of full or half duplex links). It requires the storage of a large binary tree with one terminal node for each prime cutset. Although these cutsets are not mutually exclusive events, it has been suggested 6 that Eq. (3) be approximated by N Pr[s, tJ ~ ~ P[Q!. tJ. (4) i=O However, it is shown in the following section that no additonal computation time is required to actually compute Pr[s, tJ exactly. EXACT CALCULATION OF COMPUTER NETWORK RELIABILITY A simple procedure is described below to iteratively calculate a minimal set of mutually exclusive events containing all prime link s-t cuts. This procedure starts with the prime cutset consisting of the link incident to node nt. Subsequently, these links are re-connected in all combinations and we then cut the minimal set of links adjacent to these that lie on a path between node ns and nt, assuming that the network contains no pendant nodes (nodes with only one incident link). The link replacements are iterated until the set of links connected to node ns are reached. The procedure is easily extended to provide for node cutsets as well and requires a very small amount of storage since each event is generated from the previous one. PieS, tJ is obtained by accumulating the probabilities of each of the mutually exclusive events. Procedure I 1. Initialization Let: N be the set of all nodes except nodes n s • C be the set of all links not incident to node ns. M1 = {ns} F1 be the set of links incident to both ns and nt Sl be the set of links incident to ns but not nt b1 be a binary number consisting of only I Sl ones i=l I 2. Let: Ti be a subset of Si consisting of those elements in Si for which the corresponding digit in bi is unity. M i +1 be a subset of N consisting of nodes incident to the links in T i. N = N-Mi+l' Fi+l be a subset of C consisting of links incident to nt and adjacent to any member of T i • Si+l be a subset of C consisting of links incident to nodes in N other than nt and adjacent to any member of T i • C C- (Si+lUFi+1). Exact Calculation of Computer Network Reliability 3. If Si+I¢0, then let: bi+l be a binary number with I Si+l I ones i = i+1 Go to step 2 Otherwise, let: Ti+I=0 HI CS= U [Fk U1\U (Sk-Tk)], k=l where CS is a modified cutset and Tk indicates that the links in set Tk are connected. 4. Let: C=CUFi+IUSiH' N=NUMi+1 bi = bi-l (modulo 2) If bi B ij • A flow-chart for determining whether a state ought to be saved after task i is finished and if task j is called next is shown in Figure 2. If S = H + M, then it is possible to implement rollback and to allow recovery from one error by means of rollback. The reliability in this case is the probability of no error in T+H time units (in which case no rollback is necessary) plus the probability of exactly one error in T + H units followed by a period of M error free units in which recovery is taking place. R(H+M) = e-a(T+H) [a(T+H) ]te-a(T+H+M) += -:--------- 1! By the same argument, if S = H +2M then two error recoveries are possible and R(H+2M)=R(H+M)+ [a (T +H +M) J2e -a(T+H+2M) , 2. In general R(H+nM) =R(H+(n-1)M) + { aCT + H + (n -1) MJ} ne-a(T+H+nM) n., for n=2, 3, ... If we are considering delaying the time required to complete the mission by S units we get the Software Reliability Efficiency index to be: SRE= R(S) -R(O) S Computing the software reliability efficiency index Let T be the time required to complete the program if there is no error, and without implementing a rollback method. Let H be the overhead incurred by implementing a rollback procedure. H can be easily computed for an arbitrary program as shown in Reference 8. Recollect that the rollback procedure is designed so that the maximum recovery time will not exceed a given value M. If the mission is completed in T+S units rather than T units a "lateness penalty" is incurred which gets larger as S increases. We shall find the reliability of a system with rollback as a function of S, the amount of "lateness" permitted. We shall assume that failures occur according to the exponential failure law, and the mean time between failures is l/a. If S = 0 then the program must finish in T time units without error. The probability of no error in T time units is e-aT • Letting R(S) be the reliability, defined as the probability of completing a successful mission, we have: R(O) =e-aT Note that in this analysis undetected and permanent errors were ignored. They can be included quite simply. Let Q(S) be the probability of the event that there is no undetected or permanent error in S units and let it be independent of other events. Then we have SRE= Q(S) ·[R(S) -R(O) ] . S I nstructional retrial If an error is detected while the processor is executing an instruction, the instruction could be retried, if its operands have not already been modified. This technique is an elementary form of rollback: recovery time never exceeds the execution time of an instruction, and overhead is negligible. However, there is a probability that an error will persist even after instruction retry. Let this probability be Q. The SRE for this technique can be computed in a manner identical to that for rollback and has the same form. The SRE for instruction retrial will in general be higher than that for rollback. Framework for Hardware-Software Tradeoffs 59 Deadlock prevention Discussion INPUT --_1. . ___ CO_",_P_UT_E_R_ _ _: - _........ OUTPUT Prevention of deadlocks is an important aspect of overall system reliability. Deadlocks may arise when procedures refuse to give up the resources assigned to them, and when some procedures demand more resources from the system than the system has left unassigned. Consider a system with one unit each of two resources A and B, and two procedures I and II. Now suppose procedure I is currently assigned the unit of resource A while II is assigned B. Then if procedure I demands B and II demands A, the system will be in a deadlock: neither procedure can continue without the resources already assigned to the other. The hardware approach to this problem is to buy sufficient resources so that requests can be satisfied on demand. Habermann and others 6,7 have discussed methods for preventing deadlocks without purchasing additional resources. In these methods sufficient resources are always kept in reserve to prevent the occurrence of deadlock. This may entail users being (temporarily) refused resources requested, even though there are unassigned resources available. Keeping resources in reserve also implies that resource utilization is (at least temporarily) decreased. An alternative approach is to allocate resources in such a manner, that even though it is possible that deadlocks might arise, it is very improbable that such a situation could occur. The tradeoff here is between the probability of deadlock on the one hand and resource utilization (or throughput) on the other. The tradeoff is expressed in terms of the software reliability efficiency index. Simplex Figure 3a Summary of software methods Different methods for improving the overall reli... ability of a system using software have been discussed. The software reliability efficiency index was suggested as an aid in evaluating software methods. Techniques for computing SRE were discussed. Similar techniques can be used for computing SRE for other software methods. HARDWARE METHODS Triple modulo redundancy Discussion Triple Modulo Redundancy (TMR) was one of the earliest methods! suggested for obtaining a reliable sys,;. tem from less reliable components. The system output (Figure 3) is the majority of three identical components. If only one of the components is in error, the system output will not be in error, since the majority of Determining the software reliability efficiency index The probability P of a deadlock while the mission is in progress and the time T required to complete the mission (assuming no deadlock) using a scheme where resources are granted on request are determined through simulation. The time (T+H) required to complete the mission using a deadlock prevention scheme is 'also determined by means of simulation. If Q(L) is the probability that no malfunctions other than deadlock arise in L time units, then assuming independence, we have: SRE= Q(T+H) -Q(T)· (l-P) Configuration r - - ---------------, I I I I I j 1 I I System Input I i I I I I I J I I I I I I I I I I I SYSTEM I L _____________________ 1 H At this time we know of no way of computing Hand P analytically. Figure 3b System Output 60 Fall Joint Computer Conference, 1972 components will not be in error. Thus, the system can tolerate errors in anyone component; note that these errors may be transient or permanent. In this discussion we discuss only permanent errors. Computing the hardware reliability efficiency index Let P be the probability that a permanent malfunction occurs in a given component before the mission is completed. If failures obey an exponential law, and if the average time to a failure is 1/a, then P = 1- e-aT , where T is the time required to complete the mission. If the system is a discrete transition system (such as a . computer system), then the time required to complete the mission can be expressed as N cycles (iterations) where computation proceeds in discrete steps called cycles. If the probability of failure on any cycle is p independent of other cycles then P=l- (1-p)N Let v be the probability of a malfunction in the votetaker before the mission is complete independent of other events. The reliability R of a TMR system is the probability that at least two components and the votetaker do not fail for the duration of the mission. R= [(I-P)3+3(I-P)2 oPJo (I-v) If C is the cost of each component, and D the cost of the vote-taker, the hardware reliability efficiency index is: HRE = 1_-_ P-,-)3_+_3-,-(1_-_P---,)_2_oP-=Jc--(c--1_-_v)c-------.:-(1_-_P_) =-[(c-0 2C+D Transient errors can also be included quite easily in HRE. Hybrid system Discussion Mathur and Avizienis 2 discuss an ingenious method of obtaining reliability by using TMR and spares, see Figure 4. The spares are not powered-on and will be referred to as "inactive" components. If at any point in time, one of the three active components disagrees with the majority, the component in the minority is switched out and replaced by a spare. The spare must be powered-up and loaded; one method of loading the component is to use rollback and load the component with the last saved error-free state, and begin computation from that point. If at most one component fails r----------------, INPUT I I I I I I I I I I I I I I I----+-'-~-- OUTPUT I : I I I I I I I I I I I I L _____ 1_ _ _ _ _ _ _ ~~e~o~ _ _ J I I 8 Spare Un i ts I I EJ Hybrid System (5,3) Figure 4 during a cycle and if the vote-taker is error-free, this system is fail-safe until all the spares are used up, i.e., the system output will not be in error. Consider a comparison of a system with three active units and two spares with another system which has five active units. If at most one unit can fail at a time then the majority is always right and the system with three active units is at least as good as a system with five active units (since a majority of two active units is as right as a majority of four). Thus if at most one unit fails at a time, the number of active units need never exceed three; additional units should be kept as spares. Of course in digital computer systems where computation proceeds in discrete steps such as cycles, iterations, instruction-executions, task-executions, etc., it is possible, though improbable, that more than one unit may fail in a single step. In this case, an analysis which assumes that at most one active unit can fail at a time is an approximation to the real problem. Computation /oJ the hardware reliability efficiency index Mathur and Avizienis (op cit) assume that malfunctions occur according to an exponential failure law. A consequence of this assumption is that at most one unit 61 Framework for Hardware-Software Tradeoffs curve of reliability as a function of N is shown in Figure 8. Let RH be the reliability of the hybrid system, C the cost of each unit and D the cost of the vote-taker. The hardware reliability efficiency index with two spares is then: Number of Active Units HRE =R __ H_-_C_1_-_P_)_N 4C+D Passive Units Self-purging system Discussion Markov diagram of a hybrid conflguratlon Figure 5 can fail at a given instant which in turn implies that the majority is always right. Now consider what happens if the improbable event does occur and the majority is in error and the minority is correct. The correct minority unit will be switched out to be replaced by a spare which is powered up and initialized. A comparison with the other two active units will show that the powered-on spare is in the minority, and it will in turn be switched out to be replaced by yet another spare and so on. Eventually all the spares will be used up and the system will crash. Thus even though the probability of failure of two units in one iteration is indeed small, the consequence of this improbable event is catastrophic. Hence we feel that in calculating SRE it is important to back-Up the ¥athur-Avizienis study of this ingenious method with an analysis that does not assume that simultaneous failures never occur. In this analysis we will assume that computation proceeds in discrete steps called tasks; a task may consist of several instructions or a single instruction. Key variables of the active units are compared at the end of a task completion, and the minority element, if any, is switched out. Let the probability of failure of a unit on any step of the computation be P, independent of other units and earlier events. A discrete-state, discrete~transistion Markov process may be used to model this system. A Markov state diagram is shown in Figure 5. If the system is in state F, then a system failure has already occurred. The reliability of the system is the probability that the system is not in state F at the Nth iteration, where N is the number of computation steps required in the mission. The reliability can be computed analytically from the Jordan normal form. A Consider a self-purging system shown in Figure 6. Initially there are five active units and no spares. If at any instant the vote-taker detects a disagreement among the active units, the units whose outputs are in the minority are switched out, leaving three, active, error-free units. If the failure rates for active and passive units are the same, the self-purging system will tolerate two simultaneous failures, which may be catastrophic for the hybrid system. Computation of the hardware reliability efficiency index In this analysis we shall assume that computation proceeds in discrete steps, as in the analysis for the OUTPUT INPUT Self-purging System with Figure 6 5 Units Fall Joint Computer Conference, 1972 62 reliability of the system is the probability that the system is not in state F one the Nth computation step. A curve showing the reliability of this system as a function of N is shown in Figure 8. Let Rs be the reliability of a self-purging system with five active units initially. Then Number of fault free processors HRE= Rs- (l-P)N 4C+D If the cost of power supplies are included HRE for the hybrid system is larger than that for self-purging. Markov dlagram of a self-purging configuration Figure 7 hybrid system. Let P be the probability of failure of a unit on a computation step, independent of other units and earlier steps. A Markov state diagram for this process is shown in Figure 7. As in the hybrid case the Summary of hardware methods TMR, hybrid, and a system called a self-purging system were discussed. Some of the problems of approximating these systems as continuous transition systems were analyzed. Techniques for obtaining the hardware reliability efficiency indices were presented. Similar techniques can be used for other hardware methods. 0 0 .... CONCLUSION 1 We have attempted to develop a set of simple indices which may be useful in comparing different techniques for achieving reliability. We feel that an important research and pedogogical problem is to develop a more comprehensive, sophisticated framework. Models for rollback and discrete transition models for hybrid and self-purging systems were discussed briefly. Z;0~ 0- .0 o~ ~ «: ACKNOWLEDGMENT This research was supported in part by NSF grants GJ-35109 and GJ-492. REFERENCES o '" o 240 120 Figure 8 Time 1 J VON NEUMANN Probabilistic logics and the synthesis of reliable organisms from unreliable components Automata Studies p 43-98 Princeton University Press Princeton N J 1956 2 F P MATHUR A AVIZIENIS Reliability analysis and architecture of a; hybrid-redundant digital system: Generalized triple module redundancy with self-repair Proc Spring Joint Computer Conference 1970 Framework for Hardware-Software Tradeoffs 3 M BALL F H HARDIE Redundancy for better maintenance of computer systems Computer Design pp 50-52 January 1969 4 M BALL F H HARDIE Self-repair in a T M R computer Computer Design pp 54-57 February 1969 5 A COWAN Hardware-software tradeojJs in the design of reliable computers Master's thesis in the Department of Computer Sciences University of Texas December 1971 6 A N HABERMANN Prevention of system deadlocks Comm ACM Vol 12 No 7 July 1969 7 J HOWARD The coordination of multiple processes in computer operating systems 63 Dissertation Computer Sciences Department University of Texas at Austin 1970 8 K M CHANDY C V RAMAMOORTHY Optimal rollback IEEE-C Vol C-21 No 6 pp 546-556 June 1972 9 G OPPENHEIMER K P CLANCY Considerations of software protection and recovery from hardware failures Proc FJCC 1968 AFIPS pp 29-37 10 A N HIGGINS Error recovery through programming Proc FJCC 1968 AFIPS pp 39-43 11 A N HABERMANN On the harmonious cooperation of abstract machines Thesis Mathematics Department Technological U Eindhoven The Netherlands 1967 Automation of reliability evaluation procedures through CARE-The computer-aided reliability estimation program* by FRANCIS P. MATHUR University of Missouri Columbia, Missouri INTRODUCTION Unifying notation The large number of different redundancy schemes available to the designer of fault-tolerant systems, the number of parameters pertaining to .each scheme, and the large range of possible variations in each parameter seek automated procedures that. would enable the designer to rapidly model, simulate and analyze preliminary designs and through man-machine symbiosis arrive at optimal and balanced fault-tolerant systems under the constraints of the prospective application. Such an automated procedural tool which can model self-repair and fault-tolerant organizations, compute reliability theoretic functions, perform sensitivity analysis, compare competitive systems with respect to various measures and facilitate report preparation by generating tables and graphs is implemented in the form of an on-line interactive computer program called CARE (for Computer-Aided Reliability Estimation). Essentially CARE consists of a repository of mathematical equations defining the various basic redundancy schemes. These equations, under program control, are then interrelated to generate the desired mathematical model to fit the architecture of the system under evaluation. The math model is then supplied with ground instances of its variables and then evaluated to generate values for the reliability theoretic functions applied to the model. The math models may be evaluated as a function of absolute mission time, normalized mission time, nonredundant system reliability, or any other system parameter that may be applicable. A unifying notation, developed to describe the various system configurations using selective, massive, or hybrid redundancy is illustrated in Figure 1. N refers to the number of replicas that are made massively redundant (NMR) ; S is the number of spare units; W refers to the number of cascaded units, i.e., the degree of partitioning; R( ) refers to the reliability of the system as characterized in the parentheses; TMR stands for triple modular redundant system (N =3); the NMR stand for N-tuple modular redundancy. A hybrid redundant system H(N, S, W) is said to have a reliability R(N, S, W). If the number of spares is S = 0, then the hybrid system reduces to a cascaded NMR system whose reliability expression is denoted by R(N, 0, W) ; in the case where there are no cascades, it reduces to R(N, 0,1), or more simply to R(NMR). Thus the term W may be elided if W = 1. The sparing system R (1, S) consists of one basic unit with S spares. Furthermore, the convention is used that R * indicates that the unreliability (1- Rv) due to the overhead required for restoration, detection, or switching has been taken into account e.g., R*(NMR) =Rv.R(NMR); if the asterisk is elided then it is assumed that the overhead has a negligible probability of failure. This proposed notation is extendable and can incorporate a number of functional parameters in addition to those shown here by enlarging the vector or lists of parameters within the parentheses, e.g., R (N, S, W, ... , X, Y, Z). Existing reliability programs Some reliability evaluation programs, known to the author, are the RCP, the RELAN, and the REL70. The RCpl,2 is a reliability computation package developed by Chelson (1967). This is a program which * The work presented here was carried out while the author was with the Jet Propulsion Laboratory, California Institute of Technology, and under Contract No. NAS7-100, sponsored by the National Aeronautics and Space Administration. 65 66 Fall Joint Computer Conference, 1972 NMR SYSTEMS r--~---------------------l : R(NrMRI\ I ., S=O , W=l \ / R(TiMRI~,: \ S=O' II W=l\~ ~ • , : R(N,O,W) \ R(3,O,W) \ I --------,-----------"t-.J SPARING SYSTEMS ts r-------, '! I I ; I =0 I, 1 : N =1 I S =0 \ --------.&..---_______ l. RH, S, WIl+--l R(N, S, WI W=l I \ I· , I W= 1 I :i R(3 S WI I ' , /---+ ,/ N = 3 lW = 1 repository of equations is extendable. Dummy routines are provided wherein new or more general equations may be placed as they are developed and become available to the fault-tolerant computing community. For example, the equation developed by Bouricius, et al., for standby-replacement systems embodying the parameters C and Q has been bodily incorporated into the equation repository of CARE. " / /' I 1. " I I L~E:..~ ___ J+--L -R(N, SI .. / R(3, SI /' I - - - - - - - - - - - _______ -J HYBRID SYSTEMS Figure l-Unifying notation can model a network of arbitrary series-parallel combinations of building blocks and analyzes the system reliability by means of probabilistic fault-trees. RE LAN3 is an interactive program developed by TIME/WARE and is offered on the Computer Sciences Corporation's INFONET network. RELAN like Rep models arbitrary series-parallel combinations but in addition allows a wide choice (any of 17 types) of failure distributions. RELAN has concise and easy to use input formats and provides elegant outputs such as plots and histograms. REL704 and its forerunner REL5 developed by Bouricius, et al., are interactive programs developed in APL/360. Unlike RCP and RELAN, REL70 is more adapted for evaluating systems other than series-parallel such as standby-replacement and triple modular redundancy. It offers a large number of system parameters, in particular C the coverage factor defined as the probability of recovering from a failure given that the failure exists and Q, the quota, which is the number of modules of the same type required to be operating concurrently. REL 70 is primarily oriented toward the exponential distribution though it does provide limited capabilities for evaluating reliability with respect to the Weibull distribution; its outputs are primarily tabular. Since APL is an interpretive language, REL is slow in operation; however, its designers have overcome the speed limitation by not programming the explicit reliability equations but approximate versions6 which are applicable to short missions by utilizing the approximation (l-exp( - AT» = AT for small values of AT. The CARE program is a general program for evaluating fault-tolerant systems, general in that its reliability theoretic functions do not pertain to anyone system or equation but to all equations contained in its repository and also to complex equations which may be formed by interrelating the basic equations. This CARE'S ENVIRONMENT, USERS AND AVAILABILITY CARE consists of some 4150 FORTRAN V statements and was developed on the UNIVAC 1108 under EXEC 8 (version lIe). The particular FORTRAN V compiler used was the Level 7E having the modified 2/3/4 argument ENCODE-DECODE commands. The amount of core required by the unsegmented CARE is 64K words. The software for graphical outputs is designed to operate in conjunction with the Stromberg Carlson 4020 plotter. The software enabling threedimensional projections, namely the Projectograph routines,7 are a proprietary item of Technology Service Corporation. In addition to the Jet Propulsion Laboratory, the originator, currently there are three other users of CARE, namelyNASA Langley Research Center (a FORTRAN II version operational on a CDC 3600), Ultrasystems Corp. (operational on a UNIVAC 1108 under EXEC II), and MIT Draper Laboratory. The CARE program, minus the Projectograph routines, has been submitted to COSMIC** and is available to interested parties from them along with users manuals. Its reference number at COSMIC is NPO-13086. CARE's repository of equations The equations residing in CARE, based on the exponential failure law, model the following basic fault-tolerant organizations: (1) Hybrid-redundant (N, S) systems. 8 •9 (a) NMR (N, 0) systems.lO (b) TMR (3,0) systems. 10 ( c) Cascaded or partitioned versions of the above systems. (d) Series string of the above systems. The equation representing the above family of ** Computer Software Management and Information Center, University of Georgia, Athens, Georgia 30601. The Computer-Aided Reliability Estimation Program systems is the following: l~K< 00 for K= 00 for R*(N, S) = [ R QIW 67 L --'(CQAT/W)iJwz --.--'-,-'-S 1. i=0 (3) TMR systems with probabilistic compensating failures. Io (a) Series string and cascaded versions of the above, The equation characterizing this system is: ~ R*(3, 0) = {RV[3R2IW -2R3IW +6P(1- P)RIIW (1- RIIW)2]} wz L (Kl+S)(_l _ 8-2 j={} RI/W s j+1 r _1)i+I}]. RV (4) Hybrid/simplex redundant (3, S)sim systems. ll (a) TMR/simplex systems,s (b) Series string and cascaded versions of the above. The general equation for this class of systems is the following: R(3, S)sim[T] 1 -1 ) =R3Rs 8 { 1+1·5 ( -2-8 R Rs J l~K< for = 1IW {RNIWR8 [1+(NK+I) ~ (_I)i-Z( (Kl+I) 1 i=1 J i=0 l~K< X(R~-l -1) (2K+:~~K+i)} for )] -1 00 i l RV }WZ and = (I.5)8+IR-R3 and S= 1 S>O and ",>0 ±(3AT)~+~-i i=I for 2K+~ and S>I l=O R81IWRlIW 8 i=I 8 (3K+ ') 8-1 (S) - II 'J L (-I)i t (~) f (i) -r=O X 00 +i) II (3K --, (S-1,), X[(I.5) i-I]-R3[(1.5)8+1-1] (2) Standby-sparing redundant (1, S) systems. a,10 for S>O and JL=O and (a) K-out-of-N systems,S (b) Simplex systems. (c) Series string and cascaded versions of the above. The general equation for the above is: R(l, S) = [RQIW {H E[~(l-R,'IW); X fi (QK+i) )}T" R*(3, S)sim=R v ·R(3, S)sim For the description of the above systems and their mathematical derivations, refer to the cited references. These equations are the most general representation of their systems parameterizing mission time, failure rates, dormancy factors, coverage, number of spares, number of multiplexed units, number of cascaded units, and number of identical systems in series. The definitions of these parameters reside in CARE and may be optionally requested by the user. More complex systems may be modeled by taking any of the above listed systems in series reliability with one another. 68 Fall Joint Computer Conference, 1972 TABLE I-Table of Abbreviations and Terms x = p, = U npowered failure rate = AI J1, = Dormancy factor K T XT R R S Powered failure rate Mission time Normalized mission time = Simplex reliability = Dormant reliability, exp( -p,T). = Number of spares n = (N-1)/2 where N is the total number of multiplexed units Q = Quota or number of identical units in simplex systems C = Coverage factor, Pr(recovery /failure) RV = Reliability of restoring organ or switching overhead Z = Number of identical systems in series W = N umber of cascaded or partitioned units P = Probability of unit failing to "zero" TMR = Triple modular redundancy TMRp = TMR system with probabilistic compensating failures (1, S) = Standby spare system (N , S) = Hybrid redundant system (3, S)sim = Hybrid/simplex redundant system MTF = Mean life R(MTF) = Reliability at the mean life = = Reliability theoretic functions The reliability equations in the repository may be evaluated as a function of absolute mission time (T), normalized mission time (AT), nonredundant system reliabili ty (R), or any other system parameter that may be applicable. The set of reliability theoretic functions defined in CARE are applicable to any of the equations in the repository. This independence of the equations from the functions to be applied to the equations impart generality to the program. Thus the equation repository may be upgraded without effecting the repertoire of functions. The various reliability theoretic functions useful in the evaluation of faulttolerant computing systems have been presented in Ref. 11, the measures of reliability have been defined, categorized into the domains of probabilistic measures and time measures and their effectiveness compared. Among the various measures of reliability that the user may request for computation are: the system mean-life, the reliability at the mean-life, gain in reliability over a simplex system or some other competitive system, the reliability improvement factor, and the mission time availability for some minimum tolerable mission reliability. Operational features Although CARE is primarily an interactive program, it may be run in batch mode if the user prespecifies the protocol explicitly. In the interactive mode CARE assumes minimum knowledge on the user's part. Default values are provided to many of the items that a user should normally supply. This safeguards the user and also makes usage simpler by providing logical default values to conventionally used parameters. Instructions provided by CARE are optional thus the experienced user can circumvent these and operate in fast mode. Definitions of reliability terms and abbreviations used in the program may be optionally requested. An optional "echo" feature that echoes user's responses back to the terminal is also provided .. A number of diagnostics and recovery features that save users from some common fatal errors are in the program. Model formulation-an example A typical problem submitted for CARE analysis may be the following: Given a simplex system with 8 equal modules which is made fault-tolerant by providing two standby spares for each module, where each module has a constant failure rate of 0.5 failures per year and where the spares have a dormancy factor of 10 and the applicable coverage factor being 0.99, it is required to evaluate the system survival probability in steps of 1/10 of a year for a maximum mission duration of 12 years. It is required that the system reliability be compared against the simplex or nonredundant system and that all these results be tabulated and also plotted. It is further required that the mean-life of the system as well as the reliability at the mean-life be computed. It is of interest to know the maximum mission duration that is possible while sustaining some fixed system reliability objective and to display the sensitivity of this mission duration with respect to variations in the tolerable mission reliability. I t is also required that the above analysis be carried out for the case where three standby spares are provided and these configurations of three and two spares be compared and the various comparative measures of reliability be evaluated and displayed. The above problem formulation is entered into CARE by stating that Equation 2 (which models standby spare systems) is required and the pertinent data (S=2,3; Z=8; K=10; T=12.0; LAMBDA =0.5; C=0.99; STEP=O.l) is inserted into CARE between the VARiable namelist delimiters $VAR ... $END. The above example illustrates the complexity of problems that may be posed to CARE, and the simplicity with which the specifications are entered. The reliability theoretic functions to be performed on the above specified system are acknowledged interactively by responding a YES or NOon the demand terminal to CARE's questions at the time it so requests. The Computer-Aided Reliability Estimation Program A PRIMITIVE SYSTEM: O,S), (N,S), (3,S)SIM OR TMRp ----m---rn-... -1:IDAN m- PARTITIONED PRIMITIVE SYSTEM (W = m). ~ ... --c:z::::::J--- SERIES - STRI NG OF A PRIMITIVE SYSTEM (Z =.i). 1 2 i .. ~~~~.~ L.ii%~.~ L _______ .J L ______ .J L ______ J AN m- PARTITIONED SERIES - STRING OF A PRIMITIVE SYSTEM (W =m, Z =2). ....._---,,,--_. . .1. ... -1'--_ _ --1'--_~_----J~ AN ARBITRARY SERIES-STRING OF m-PARTITIONED SERIES-STRING OF PRIMITIVE SYSTEMS. Figure 2-Formation of complex systems COMPLEX SYSTEMS The basic equations in CARE's repository define the primitive systems: (1, S), (N, S), (3, S)sim and TMRp. Equations representing more complex systems may be fabricated by combining the primitive systems in series reliability with one another as shown in Figure 2. The description of a complex system is entered by first enumerating the equation numbers of the primitive systems involved in namelist VARiable1. Thus "$VAR1; PROD = 1, 2; $END;" states that equation 1 and equation 2 are to be configured in series reliability. N ext, the parameter specifications for these equations are then entered using the namelist VARiable. The set of values for any parameter pertaining to a complex system is stored as a matrix, thus in the general case of PARAMETER (m, n) n refers to the equation involved m is the internal index for the set of values that will be attempted successively. For example, C(I, 2) = 1.0, 0.99 states that in equation 2 (the equation for standby-spares system) the value of the coverage factor should be taken to be 1.0 and having evaluated the complex system for this value the system is to be reconsidered with coverage factor being 0.99. 69 here was to be evaluated for the worst case dormancy factors K of 1 and infinity. On completing the evaluation of the above system, the effect of reducing coverage to 0.99 was to be reevaluated. Also the effect of increasing the number of spares to 3, as also the effect of increasing the module failure rates to their upper bound value of .0876 failures/year. All combinations of these modifications on the original system are to be considered. The mission time is 12 years and evaluations are to be made in steps of 1/10th of a year. The above desired computations are specified using the VAR namelist thus: $VAR; T=12.0; STEP=O.I; Z(I, 1) =1, Z(I,2)=8; C(1, 2) =1.0, 0.99; N(I,I)=3; S(I, 1) =2,3,S(I,2) =2,3;LAMBDA(I, 1) = .01752, .0876, LAMBDA(I, 2)=.01752, .0876; K(1, 1) = 1.0, INF, K(I, 2) = 1.0, INF; $END; (N ote the semicolons (;) denote carriage returns.) The ease and compactness with which complex systems can be specified in CARE is demonstrated by the above example. The reader will note the complex system configured in this example corresponds to a STAR-like system having eight functional units in standby-spare mode and a hard-core test-and-repair unit in Hybrid redundant mode (Figure 3). SOlVIE SIGNIFICANT RESULTS USING CARE Some significant results pertaining to the behavior of W partitioned NMR system (Figure 4) will now be presented. These results pertain to the behavior or reliability theoretic functions of an NMR system such as its mean life or mean time to first failure (MTF) and the reliability of the system at the mean life, R(MTF). The reliability theoretic system measure- Complex model formulation-an example It was required to evaluate a system consisting of 8 equally partitioned modules in a standby-spares (1, S) configuration having 2 spares· for each module. The 9th module was the hard-core of the system and was configured in a Hybrid redundant (3, S) system having 2 spares (S = 2). The coverage on the (1, S) system modules was to be initially considered to be 1.0. The lower bound on the failure rate A on all the modules had been evaluated to be .01752 failures/year on the basis of parts count. This complex system as specified 1 0 00000 DOD Figure 3-Configuration for an example of a STAR-like complex model 70 Fall Joint Computer Conference, 1972 1.00 ~ 0.60 ... o ... z - 0 :: 0.401---N= TMR, N=3 .125.06250.00 .03125 0.0 0.5 I 0.694 I 2.78 AT R(N,~O, W) vs AT AS A FUNCTION OF NAND W Figure 4-R(N ,0, W) vs AT as a function of Nand W reliability at the mean life, R(MTF)-is the reliability of the system computed for missions or time durations of length equal to the mean time to first failure of the system. The behavior of these functions were evaluated under the limiting conditions of the system parameters in order to establish system performance bounds. The results presented here have been both proven mathematicallylO and been verified by CARE analysis. Since it is well-known that mean-life (MTF) is not a credible measure of reliability performance (e.g., MTF of a simplex system is greater than the MTF of a TMR system!), another measure the reliability at the meanlife R(MTF) has been used to supplant MTF. 'This measure essentially uses a typical reliability estimate of the system. The typical reliability value being taken at a specific and hopefully a representative time of the system. This representative time is taken to be the time to first failure of the system, namely the MTF of the system. The foregoing is the rationale for choosing R(MTF) as a viable measure of system reliability. However, contrary to general belief this measure R(MTF) is not a good measure for partitioned NMR systems due to its asymptotic behavior as a function of the number of partitions W. It is proved in [10J that the reliability at MTF of a (3, 0, W) system in the limit as W becomes very large approaches the value exp ( -11"/4) asymptotically from below and that this bounding value is reached very rapidly, see Figure 5. The Computer-Aided Reliability Estimation Program 71 TABLE II--MTF and R(MTF) as a Function of W r-;··.. 1.00 (3,0, W) System W MTF R(MTF) o (Simplex) 1.0 0.368 0.402 exp( -11"/4) = 0.454 1 (TMR) co (3,0, co) 0.83 co 0.80 Some other results observed graphically in Figure 4 and the detailed mathematical proof of which are in [10J are summarized below. These results follow from the general reliability equation for a W partitioned NMR system, which is: R(N, 0, W) = (N) E [ J----------[A The construction of the submatrices H64 and A is done by an APL program3 given in the appendix with theory stated in Section III. The sub matrix 13) and drop the column if this seven digit vector has already appeared in a previously taken column. This guarantees that these columns along with the first 7 columns for check bits form a single error correcting code. This exercise was carried out using an APL computer program which generated a (104,90) and (172, 154) DEC codes which has separable SEC and can be shortened to handle data bit lengths 64 and 128. The codes are given in the Appendix. The DED capability is obtained by adding a check bit on the SEC code which makes a SEC-DED odd-weightcolumn-code. 5 The number in front of each column of H-matrix in the Appendix represents the cyclic position number in the full length code. These position numbers are used in the algebraic decoding algorithm4 in error correction process. SYSTEM IMPLEMENTATION There are at least three methods of generating the parity check matrix of a double error ·correcting code. The parity check matrix denoted by HI has Xi mod ml(x)ma(X) as its ith column (0 origin) where ml(x) and ma(X) are minimum functions of the field elements a and aa of GF (27). The parity check matrix H2 generated by the second method has the concatenated vector Xi mod ml(X), Xi mod ma(X) as its ith column. The parity check matrix Ha generated by the third method has the concatenated vector Xi mod mi (x) , x3 i mod ml(x) as its ith column. The codes generated by thes~ three matrices are not only equivalent but also isomorphic. These three matrices possess different desirable properties. In particular, the matrix HI possess the property (1) for the adaptive correction scheme-presently under consideration. The firs t 14 columns of HI represent an identity matrix which corresponds to 14 independently-acting check digits. However, any 7 check bits as a group do not provide SEC capability which is the required property (2). The matrix H2 on the other hand can be divided into two parts where the first group of seven check bits, corresponding in the part column Xi mod ml (x), does provide SEC capability, however, the two groups of check bits do not act independently and hence are not separable. The matrix Ha behaves in the same manner as H2 except that the syndrome in Ha is easily decodable. 4 Let us use a simple example for illustration. Figure 1 shows a memory system which contains two basic memory units. Each unit has a (72, 64) SEC-DED code. The following is the parity check matrix for this simple system. [H64 IsJ cf> cf> H= cf> cf> [HS4 IsJ cf>] cf> (2) [ [A cf>J [A cf>J Is Where H64 is the first group of 7 columns of the matrix in the Appendix and an additional column is added to make it odd weight. The A-matrix is the second group of 7 columns of the matrix in the Appendix. Another column is added to these 15 columns to make the overall parity matrix odd weight. This means that the overall code has double error correction and triple error detection capability. The encoding follows directly from the H-matrix of Equation (2). The decoding is classified as follows: 1. Any single error in each memory unit can be corrected separately and simultaneously. 2. If a double error is detected in one of the memory units and no error indication in the other memory Adaptive Error Correction Scheme for Computer Memory System 85 words out of a group of m words is very small. Such adaptive error correction scheme more closely matches the requirements of modern computer memory systems and can be used very effectively for masking faults and reducing cost of maintenance. REFERENCES MEM MEM Check DECODER To CPU or Channel Figure 1 unit for the corresponding word, then this double error can be corrected by the additional 8 check bits. 1 J F KEELEY System/370-Reliability from a system viewpoint 1971 IEEE International Computer Society Conference September 22-24 1971 Boston Massachusetts 2 W W PETERSON Error correcting codes MIT Press 1961 3 A D F ALKOFF K ElVERSON APL/360 user's manual IBM Watson Research Center Yorktown Heights New York 1968 4 A M PATEL S J HONG Syndrome trapping technique for fast double-error correction Presented at the 1972 IEEE International Symposium on Information Theory IEEE Catalog No 72 CHO 578-S IT 1972 5 M Y HSIAO A class of optimal minimum odd-weight-column SEC-DED codes IBM J of Res & Dev Vol 14 No 4 pp 395-402 July 1970 APPENDIX A-CODE GENERATION PROGRAl\1 The decoding of the double errors as stated in class 2 needs the data bits portion of both memory units. The data bit portion for the error free memory is required to cancel its effects in the last 8 syndrome bits. Therefore, the double error correction can be done as that given in Reference 4. APL 360 VSECDEC[O]V SUl\1MARY An adaptive ECC scheme with SEC-DED feature can be expanded to DEC feature in a memory system containing several memory units environment. The normal memory cycle time remains unaffected, except in the presence of a double error when extra decoding time is required for the double error correction procedure. Other major advantage is cost savings in terms of number of check bits required. If the memory system contains m basic memory units then 8(m-l) check bits can be saved by using this scheme. The number m is chosen such that the probability of double-errors in two [1] [21 [3] [4) [5] [6) V SEC DEC C M+-1+pG }l+2*(!tf+2 ) V+,~1p 0 Z+l'JpO (j+(MpO),l I+O [7] V+MpQ+(-l4>q)~(Gx.q[M-l]) ~9+(I>M)x«X+(2i«M+2)+V»)€Z) [8] [9) '0123456789*'[(10 10 10 TI).10,V] Z[I]+X I+I+1 [10] [11] [12] [13] [14] ~7+(7x(I=N» 'DONE' ~15 V 86 Fall Joint Computer Conference, 1972 APPENDIX B-PARITY CHECK l\;fATRIX FOR (104, 90) SEC-SEPARABLE DEC CODE SEeDEC 1 0 0 0 0 1 1 0 1 1 1 0 1 1 1 000*10000000000000 001*01000000000000 002*00100000000000 003*00010000000000 004*00001000000000 005*00000100000000 006*00000010000000 007*00000001000000 008*00000000100000 009*00000000010000 010*00000000001000 011*00000000000100 012*00000000000010 013*00000000000001 014*10000110111011 015*11000101100110 016*01100010110011 017*10110111100010 018*01011011110001 019*10101011000011 020*11010011011010 021*01101001101101 022*10110010001101 023*11011111111101 024*11101001000101 025*11110010011001 026*11111111110111 027*11111001000000 028*01111100100000 029*00111110010000 030*00011111001000 031*00001111100100 032*00000111110010 037*00110001010110 03$*00011000101011 039*10001010101110 040*01000101010111 041*10100100010000 042*01010010001000 043*00101001000100 044*00010100100010 045*00001010010001 046*10000011110011 047*11000111000010 050*11011101011110 051*01101110101111 052*10110001101100 053*01011000110110 054*00101100011011 055*10010000110110 056*01001000011011 057*10100010110110 058*01010001011011 059*10101110010110 060*01010111001011 061*10101101011110 065*00101011011011 066*10010011010110 074*10001100000010 075*01000110000001 077*11010100000110 078*01101010000011 083*11101111000111 084*11110001011000 085*01111000101100 086*00111100010110 088*10001001111110 094*11001111110100 095*01100111111010 096*00110011111101 097*10011111000101 098*11001001011001 099*11100010010111 100*11110111110000 101*01111011111000 108*10010110110001 109*11001101100011 110*11100000001010 111*01110000000101 112*10111110111001 113*11011001100111 114*11101010001000 115*01110101000100 116*00111010100010 117*00011101010001 119*11000010110010 120*01100001011001 124*00110111011100 125*00011011101110 126*00001101110111 Adaptive Error Correction Scheme for Computer Memory System APPENDIX C-(172, 154) SEC-SEPARABLE DEC CODE SEC DEC SECDEC 1 0 1 1 0 1 1 1 1 0 1 1 0 0 0 1 1 000*1000000000000000 001*0100000000000000 002*0010000000000000 003*0001000000000000 004*0000100000000000 005*U000010000000000 006*0000001000000000 007*0000000100000000 008*0000000010000000 009*0000000001000000 010*0000000000100000 011*0000000000010000 012*0000000000001000 013*0000000000000100 014*000000000CO n 0010 015*0000000000000001 01~*1011011110110001 017*1110110001101001 01q*11000001100001~1 011*1101111101110011 020*1101110000001000 021*°110111000000100 022*OOliolll000000l0 023*0001101110000001 024*1011101001110081 025*1110101010001001 026*1100001011110101 l27*1101011011001011 031*1010110000101011 032*1110000110100100 033*0111000011010010 034*0011100001101001 035*1010101110000101 030*1110001001110011 o37*1100011~laOOlnoo 038*0110001101000100 03·::J*QOll00nll0l000l0 040*0001100011010001 041*1011101111011001 045*0110101101111111 04r*1000001000001110 047*010Q0001COOOOlll 048*1001011100110010 04J*0100101110011001 050*1001001001111101 051*1111111010001111 052*1100100011110110 053*0110010001111011 o 5 II * 1 000 n 11 110 0011 0 0 05S*0100001011000110 05C*0010000101100011 057*1010011100000000 058*0101001110000000 05j*0010100111J000JO 06J*Q0010ljOlll00000 I) 61 * ,) 0 (; 0 1 0 10 Q 111 :) 0 :I 0 062*0000J11100111000 ~~7·~1~11-11·111"""~ 063*1001101001001001 069*1111101010010101 070*1100101011111011 071*1101001011001100 072*0110100101100110 073*0011010010110011 074*1010110111101000 075*0101011011110100 076*0010101101111010 077*0001011110111101 078*1011110101101111 079*1110100100000110 08~*0111a10010000011 ~81*100Dll0lllll00CO 082*0100011011111000 083*0010001101111100 o [lll * 'i 0 r) 1800110111110 085*1011001111011110 187*0181100111101111 088*10011:lllJ1nOOll0 () 8 ') * 0 1 'J (; 11 r; 11 0 1 r) 0 0 11 090*1001000101100000 ,) 91 * ,:; 1 :) 0 1 0 0 0 1 0 11 ~; 0 () 0 092*0010010001011000 093*0001001000101100 094*0000100100010110 096*1011010111110100 097*0101101011111010 098*0010110101111101 099*1010000100001111 100*1110011100110110 101*0111001110011011 102*1000111001111100 103*0100011100111110 105*1010011001111110 107*1001111800101110 108*010011111001~111 109*1001iJ00000111010 111*1001001110111111 11 3 * 0 1111111 () () 11 0 111 114*1001100000101010 115*0100010000010101 11~*100101011nlll0ll 117*1111110101101100 llR*nllllll0l0ll0ll0 11~*OOlllll101011Qll 120*1010100000011100 121*0101010000001110 1 2 2 * .J 0 1 ,1 1 0 10 0 0 0 0 0 111 123*1010001110111010 124*~1()liJOOlalJll0Jl 125*10011111JOOlll0l 126*11111J0000111111 127*1110101110101110 128*0110010111010111 131*1001011011100111 132*1111110011000010 135*1111001111110001 136*1100111001001001 137*1101000010010101 138*1101111111111011 130*1101100001001100 140*0110110000100110 141*0011011000010011 147*0101111010111101 143*1001100011101111 143*1111101111000110 150*0111110111100011 151*1000100101000000 153*0010001001010000 15l*lnll0llQl01JOOll lGO*Oll1111nOlllQono 161*0011101100111000 162*0001110110011100 163*0000111011001110 164*0000011101100111 165*1011110000000010 171*11DllllollCllo00 172*0110111101101100 176~alnlll01J010llln 177*0010111010010111 178*1010008011111010 179*0101000001111101 182*0111110000111011 189*1010001101100001 190*1110011000000001 191*1100010010110001 192*1101010111101001 193*1101110101000101 19 11 * 11 0110 () 100010011 195*1101101100111000 1 96* 0 11 () 11 () 11 00111 f) 0 201*1001100100110001 209*0000110111101110 210*0000011011110111 21G*OOll111010111100 217*0001111101011110 71~*0000111110101111 210*1011000001100110 220*0101100000110011 223*001001101110101i )24*8001001101110101 225*1011111000001011 226*1110100010110100 228*0011101000101101 229*1010101010100111 23J*nll1a~OlUlllJOOl 232*1000111100001001 233*1111000000110101 234*1100111110101011 23G*Ol10100000110010 240*1100011100000110 242*1000011001110000 243*0100001100111000 87 Dynamic confirmation of system integrity * by BARRY R. BORGERSON University of California Berkeley, California INTRODUCTION continuously integral are identified, and the integrity of the rest of the system can then be confirmed by means less stringent than concurrent fault detection. For example, it might be expedient to allow certain failures to exist for some time before being detected. This might be desirable, for instance, when certain failure modes are hard to detect concurrently, but where their effects are controllable. It is always desirable to know the current state of any system. However, with most computing systems, a large class of failures can remain undetected by the system long enough to cause an integrity violation. What is needed is a technique, or set of techniques, for detecting when a system is not functioning correctly. That is, we need some way of observing the integrity of a system. A slight diversion is necessary here. Most nouns which are used to describe the attributes of computer systems, such as reliability, availability, security, and privacy, have a corresponding adjective which can be used to identify a system that has the associated attribute. Unfortunately, the word "integrity" has no associated adjective. Therefore, in order to enhance the following discourse, the word "integral" will be used as the adjective which describes the integrity of a system. Thus, a computer system will be integral if it is working exactly as specified. Now, if we could verify all of the system software, then we could monitor the integrity of a system in real time by providing a 100 percent concurrent fault detection capability. Thus, the integrity of the entire system would be confirmed concurrentlYJ where "concurrent confirmation" of the integrity of any unit of logic means that the integrity of this unit is being monitored concurrently with each use. A practical alternative to providing concurrent confirmation of system integrity is to provide what will be called "dynamic confirmation of system integrity." With this concept, the parts of a system that must be QUALITATIVE JUSTIFICATION In most contemporary systems, a multiplicity of processes are active at any given time. Two distinct types of integrity violations can occur with respect to the independent processes. One type of integrity violation is for one process to interfere with another process. That is, one process gains unauthorized access to another's information or makes an illegitimate change of another process' state. This type of transgression will be called an "interprocess integrity violation." The other basic type of malfunction which can be caused by an integrity violation occurs when the state of a single process is erroneously changed without any interference from another process. Failures which lead to only intraprocess contaminations will be called "intraprocess integrity violations." For many real-time applications, no malfunctions of any type can be tolerated. Hence, it is not particularly useful to make the distinction between interprocess and intraprocess integrity violations since concurrent integrity-confirmation techniques must be utilized throughout the system. For most user-oriented systems, however, there is a substantial difference in the two types of violations. Intrapr{)cess integrity violations always manifest themselves as contaminations of a process' environment. Interprocess integrity violations, on the other hand, may manifest themselves as security infractions or contaminations of other processes' environments. * This research was supported by the Advanced Research Projects Agency under· contract No. DAHC15 70 C 0274. The views and conclusions contained in this document are those of the author and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the Advanced Research Projects Agency or the U.S. Government. 89 90 Fall Joint Computer Conference, 1972 We now see that there can be some freedom in defining what is to constitute a continuously-integral, useroriented system. For example, the time-sharing system described below is defined to be continuously integral if it is providing interprocess-interference protection on a continuous basis. Thus other properties of the system, such as intraprocess contamination protection, need not be confirmed on a continuous basis. Although the concept of dynamic confirmation of system integrity has a potential for being useful in a wide variety of situations, the area of its most obvious applicability seems to be for fault-tolerant systems. More specifically, it is most useful in those systems which are designed using a solitary-fault assumption. Where "solitary fault" means that at most one fault is present in the active system at any time. The notion of "dynamic" becomes more clear in this context. Here, "dynamic" means in such a manner, and at such times, so that the probability of encountering simultaneous faults is below a predetermined limit. This limit is dictated not only by the allowable probability of a catastrophic failure, but also by the fact that other factors eventually become more prominent in determining the probability of system failure. Thus, there often becomes a point beyond which there is very little to be gained by increasing the ability to confirm integrity. The rest of this paper will concern itself with dynamic confirmation in the context of making this concept viable with respect to the solitary-fault assumption. DYNAMIC CONFIRMATION TECHNIQUES In this section, and the following section, a particular class of systems will be assumed. The class of systems considered will be those which tolerate faults by restructuring to run without the faulty units. Both the stand-by sparing and the fail-softly types of systems are in this category. These systems have certain characteristics in common; namely, they both must detect, locate, and isolate a fault, and reconfigure to run without the faulty unit, before a second fault can be reliably handled. Obviously, if simultaneous faults are to be avoided, the integrity of all parts of the system must be verified. This is reasonably straightforward in many areas. For instance, the integrity of data in memory can be rather easily confirmed by the method of storing and checking parity. Of course, checks must also be provided to make sure that the correct word of memory is referenced, but this can be done fairly easily too. 1 It is generally true that parity, check sums, and other straightforward concurrent fault-detection techniques can be used to confirm the integrity of most of the logic external to processors. However, there still remains the problems of verifying the integrity of the checkers themselves, of the processors, and of logic that is infrequently used such as that associated with isolation and reconfiguration. All too often, there is no provision made in a system to check the fault detection logic. Actually, there are two rather straightforward methods of accomplishing this. One method uses checkers that have their own failure space. That is, they have more than two output states; and when they fail, a state is entered which indicates that the checker is malfunctioning. This requires building checkers with specifically defined failure modes. It also requires the ability to recognize and handle this limbo state. An example of this type of checker appears in Reference 2. Another method for verifying the integrity of the fault-detection logic is to inject faults; that is, cause a fault to be created so that the checker must recognize it. In many cases this method turns out to be both cheaper and simpler than the previously mentioned scheme. With this method, it is not necessary to provide a failure space for the checkers themselves. However, it is necessary to make provisions for injecting faults when that is not already possible in the normal design. With this provision, confirming the integrity of the checking circuits becomes a periodic software task. Failures are injected, and fault detection inputs are expected. The system software simply ignores the fault report or initiates corrective action if no report is generated. Associated with systems of the type under discussion, there is logic that normally is called into use only when a fault has been detected. This includes the logic dedicated to such tasks as diagnosis, isolation, and reconfiguration. This normally idle class of hardware units will collectively be called "reaction logic." In order to avoid simultaneous faults in a system, this reaction logic must not be allowed to fail without the failure being rapidly detected. Several possibilities exist here. This logic can be made very reliable by using some massive redundancy technique such as triple-modular-redundancy.3 Another possibility is to design these units such that they normally fail into a failure space which is detected and reported. However, this will not be as simple here as it might be for self-checking fault detectors because the failure modes will, in general, be harder to control. A third method would be to simulate the appropriate action and observe the reaction. This also is not as simple here as it was above. For example, it may not be desirable to reconfigure a system on a frequent periodic basis. However, one way out of this is to simulate the Dynamic Confirmation of System Integrity action, initiate the reaction, and confirm the integrity of this logic without actually causing the reconfiguration. This will probably require that the output logic either be made "reliable" or be encoded so as to fail into a harmless and detectable failure space. The final area that requires integrity confirmation is the processors. The technique to be employed here is very dependent on the application of the system. For many real-time applications, nothing short of concurrent fault detection will apparently suffice. However, there are many areas where less drastic methods may be adequate. Fabry 4 has presented a method for verifying critical operating-system decisions, in a timesharing environment, through a series of independent double checks using a combination of a second processor and dedicated hardware. This method can be extended to verifying certain decisions made by a real-time control processor. If most of the tasks that a real-time processor performs concern data reduction, it is possible that software-implemented consistency checks will suffice for monitoring the integrity of the results. When critical control decisions are to be made, a second processor can be brought into the picture for consistency checks or dedicated hardware can be used for validity checking. Alternatively, a separate algorithm, using separate registers, could be run on the same processor to check the validity of a control action, with external time-out hardware being used to guarantee a response. These procedures could certainly provide a substantial cost savings over concurrent fault-detection methods. For a system to be used in a general-purpose, timesharing environment, the method of checking processors non-concurrently is very powerful because simple, relatively inexpensive schemes will suffice to guarantee the security of a user's environment. The price that is paid is to not detect some faults that could cause contamination of a user's own information. But conventional time-sharing systems have this handicap in addition to not having a high availability and not maintaining security in the presence of faults, so a clear improvement would be realized here at a fairly low cost. In order to detect failures as rapidly as possible in processors that have no concurrent fault-detection capability, periodic surveillance tests can be run which will determine if the processor is integral. VALIDATION OF THE SOLITARY-FAULT ASSUMPTION 91 removed at any given time. However, in order to design the system so that all possible types of failures can be handled, it is usually necessary to assume that at most one active unit is malfunctioning at any given time. The problem becomes essentially intractable when arbitrary combinations of multiple faults are considered. That is not to say that all cases of multiple faults will bring a system down, but usually no explicit effort is made to handle most multiple faults. Of course by multiple faults we mean multiple independent faults. If a failure of one unit can affect another, then the system must be designed to handle both units malfunctioning simultaneously or isolation must be added to limit the influence of the original fault. A quantitative analysis will now be given which provides a basis for evaluating the viability of utilizing non-concurrent integrity-confirmation techniques in an adaptive fault-tolerant system. In the analysis below, the letter "s" will be used to designate the probability that two independent, simultaneous faults will cause a system to crash. The next concept we need is that of coverage. Coverage is defined5 as the conditional probability that a system will recover given that a failure has occurred. The letter "e" will be used to denote the coverage of a system. In order to determine a system's ability to remain continuously available over a given period of time, it is necessary to know how frequently the components of the system are likely to fail. The usual measure employed here is the mean-time-between-failures. The letter "m" will be used to designate this parameter. It should be noted here that "m" represents the meantime-between-internal-failures of a system; the system itself hopefully has a much better characteristic. The final parameter that will be needed here is the maximum-time-to-recovery; This is defined to be the maximum time elapsed between the time an arbitrary fault occurs and the time the system has successfully reconfigured to run without the faulty unit. The letter "r" will be used to designate this parameter. The commonly used assumption that a system does not deteriorate with age over its useful life will be adopted. Therefore, the exponential distribution will be used to characterize the failure probability of a system. Thus, at any given time, the probability of encountering a fault within the next u time units is: p= jU (llm)*exp( -tim) dt o Fault-tolerant systems which are capable of isolating a faulty unit, and reconfiguring to run without it, typically can operate with several functional units = 1-exp( -ulm) From this we can see that the upper bound on the 92 Fall Joint Computer Conference, 1972 conditional probability of encountering a second independent fault is given by: q= l-exp( -rim) Since it is obvious that r must be made much smaller than m if a system is to have a high probability of surviving many internal faults, the following approximation is quite valid: q= l-exp( -rim) 00 =1- 2: (-r/m)k/k! k=O = 1-I+r/m- (Y2)*(r/m)2+ Oi)*(r/m)3- ..• ~r/m Therefore, the probability of not being able to recover from an arbitrary internal failure is given by: x= (I-c) +c*q*s = (I-c) +c*s*r/m where the first term represents the probability of failing to recover due to a solitary failure and the second term represents the probability of not recovering due to simultaneous failures given that recovery from the first fault was possible. If we now consider each failure as an independent Bernoulli trial and make the assumption that faulty units are repaired at a sufficient rate so that there is never a problem with having too many units logically removed from a system at any given time, then it is a simple ~atter to determine the probability of surviving a given period, T, without encountering a system crash. The hardware failures will be treated as n independent samples, each with probability of success (1- x), where n is the smallest integer greater than or equal to T /m. Thus, the probability of not crashing on a given fault is (I-x) =c*(1-r*s/m) and the probability, P, of not crashing during the period T is given by: P= [c*(I-r*s/m) In =c *(I-r*s/m)n . concurrent schemes and since this time is essentially equivalent to how frequently the confirmation procedures are invoked, we can assume that r is equal to the time period between the periodic integrity confirmation checks. In order to gain a feeling for the order of r, rather pessimistic numbers can be assumed for m, s, and T. Assume m=1 week, s= Y2, and T=10 years; this gives an n of 520. For now, assume c is equal to one. Now, in order to attain a probability of .95 that a system will survive 10 years with no crashes under the above assumptions, r will have to be: r= m/ s*[I-. 95(1/520) ] = 119 seconds Thus, if the periodic checks are made even as infrequently as every two minutes, a system will last 10 years with a probability of not crashing of approximately.95. The effects of the coverage must now be examined. In order for the coverage to be good enough to provide a probability of .95 of no system crashes in 10 years due to the system's inability to handle single faults, it must be: c= .95(11520) =.9999 Now this would indeed be a very good coverage. Since the actual coverage of any given system will most likely fall quite short of this value, it seems that the coverage, and not multiple simultaneous faults, is the limiting factor in determining a system's ability to recover from faults. The most important conclusion to be drawn from this section is that the solitary-fault assumption is not only convenient but quite justified, and this is true even when only periodic checks are made to verify the integrity of some of the logic. INTEGRITY CONFIRMATION FEATURES OF THE "PRIME" SYSTEM ll With this equation, it is now possible to establish the validity of using the various non-concurrent techniques mentioned above to confirm the integrity of a system. What this equation will establish is how often it will be necessary to perform the fault injection, action simulation, and surveillance procedures in order to gain an acceptable probability of no system crashes. Since the time required to detect, locate, and isolate a fault, and reconfigure to run without the faulty unit, will be primarily a function of the time to detection for the non- In order to better illustrate the potential power of dynamic integrity confirmation techniques, a descrip, tion. will now be given of how this concept is being used to economically provide an integrity confirmation structure for a fault-tolerant system. At the University of California, Berkeley, we are currently building a modular computer system, which has been named PRIME, that is to be used in a multiaccess, interactive environment. The initial version of this system will have five processors, 13 8K-word by 33-bit memory blocks with associated switching units, Dynamic Confirmation of System Integrity DISK I I ~ DRIVE .. .. . DISK DRIVE I I I IEXTERNAL DEVICE DISK DRIVE I I I"'TERNAL DEVICE .... IEXTERNAL DEVICE I I I I I INTERCONNECTION NETWORK I I I I/O CONTROLLER I I~I I I 1/0 CONTROLLER I 1*11ri I I I/O CONTROLLER 1*1 ~ 93 I T I/O CONTROLLER * I I ~ CONTROLLER I/O I *1 *RECONFIGURATION LOGIC EACH nIDICATED LINE REPRE SENTS 16 TERMINAL CONNECT IONS I PROCESSOR I I MEMORY :rnTERFACE PROCESSOR I I I PROCESSOR I I MEMORY INTERFACE MEMORY 1 :rnTERFACE PROCESSOR I I I I PROCESSOR I r :rnTERFACE MEMORY r MEMORY INTERFACE j - ~~E\j ~ 1 MB4 MB5 EJ~~ ~ EACH MEMORY BLOCK (MB) CONSISTS OF TWO 4K MODULES MBlO MBn 88 Figure 1-Block diagram of the PRIME system 15 high-performance disk drives, and a switching network which allows processor, disk, and external-device switching. A block diagram of PRIME appears in Figure 1. The processing elements in PRIME are 3-bus, 16-bit wide, 90ns cycle time microprogrammable processors called IVIETA 4s. 6 Each processor emulates a target machine in addition to performing I/O and executive functions directly in microcode. At any given time, one of the processors is designated the Control Processor (CP), while the others are Problem Processors (PPs). The CP runs the Central Control Monitor (CCM) which is responsible for scheduling, resource allocation, and interprocess message handling. The Problem Processors run user jobs and perform some system functions with the Extended Control l\1onitor (ECl\1) which is completely isolated from user processes. Associated with each PP is a private page, which the ECM uses to store data, and some target-machine code which it occasionally causes to be executed. A more complete descrip- tion of the structure and functioning of PRIME is given elsewhere. 7 The most interesting aspects of PRIl\1E are in the areas of availability, efficiency, and security. PRIME will be able to withstand internal faults. The system has been designed to degrade gracefully in the presence of .internal failures. 8 Also, interprocess integrity is always maintained even in the presence of either hardware or software faults. The PRIME system is considered continuously integral if it is providing interprocess interference protection. Therefore, security must be maintained at all times. Other properties, such as providing user service and recovering from failures, can be handled in a less stringent manner. Thus, dynamic confirmation of system integrity in PRIl\1E must be handled concurrently for interprocess interference protection and can be handled periodically with respect to the rest of the system. Of course, there are areas which do not affect interprocess interference protection but which 94 Fall Joint Computer Conference, 1972 will nonetheless utilize concurrent fault detection simply because it is expedient to do so. Fault injection is being used to check most of the fault-detection logic in PRIlVIE. This decision was made because the analysis of non-concurrent integrityconfirmation techniques has established that periodic fault injection is sufficiently effective to handle the job and because it is simpler and cheaper than the alternatives. There is a characteristic of the PRIME system that makes schemes which utilize periodic checking very attractive. At the end of each job step, the current process and the next process are overlap swapped. That is, two disk drives are used simultaneously; one of these disks is rolling the current job out, while the other is rolling the next job in. During this time, the associated processor has some potential free time. Therefore, this time can be effectively used to make whatever periodic checks may be necessary. And since the mean time between job steps will be less than a second, this provides very frequent, inexpensive periodic checking capabilities. The integrity of Problem Processors is checked at the end of each job step. This check is initiated by the Control Processor which passes a one-word seed to the PP and expects the PP to compute a response. This seed will guarantee that different responses are required at different times so that the PP cannot accidently "memorize" the correct response. The computation requires the use of both target machine instructions and a dedicated firmware routine to compute the expected response. The combination of these two routines is called a surveillance procedure. This surveillance procedure checks all of the internal logic and the control storage of the microprocessors. The target machine code of the surveillance routine is always resident in the processor's private page. The microcode part is resident in control storage. A fixed amount of time is allowed for generating a response when the CP asks a PP to run a surveillance on itself. If the wrong response is given or if no response is given in the allotted time, then the PP is assumed to be malfunctioning and remedial action is initiated. In a similar manner, each PP periodically requests that the CP run a surveillance on itself. If a PP thinks it detects that the CP is malfunctioning, it will tell the CP this, and a reconfiguration will take place followed by diagnosis to locate the actual source of the detected error. More will be said later about the structure of the reconfiguration scheme. While the periodic running of surveillance procedures is sufficient for most purposes, it does not suffice for protecting against interprocess interference. As previously mentioned, this protection must be continuous. Therefore, a special structure has been developed which is used to prevent interprocess interference on a continuous basis. 4 This structure provides double checks on all actions which could lead to interprocess interference. In particular, the validity of all memory and disk references, and all interprocess message transmissions, are among those actions double checked. A class code is used to associate each sector (lK words) of each disk pack with either a particular process or with the null process, which corresponds to unallocated space. A lock and key scheme is used to protect memory on a page (also lK words) basis. In both cases, at most one process is bound to a lK-word piece of physical storage. The Central Control ]VIonitor is responsible for allocating each piece of storage, and it can allocate only those pieces which are currently unallocated. Each process is responsible for deallocating any piece of storage that it no longer needs. Both schemes rely on two processors and a small amount of dedicated hardware to provide the necessary protection against some process gaining access to another process' storage. In order for the above security scheme to be extremely effective, it was decided to prohibit sharing of any storage. Therefore, the Interconnection Network is used to pass files which are to be shared. Files are sent as regular messages, with the owning process explicitly giving away any information that it wishes to share with any other process. All interprocess messages are sent by way of the CPo Thus, both the CCM and the destination EC]VI can make consistency checks to make sure that a message is delivered to the correct process. The remaining area of integrity checking which needs to be discussed is the reaction hardware. In the PRIlVIE system, this includes the isolation, power switching, diagnosis, and reconfiguration logic. A variety of schemes have been employed to confirm the integrity of this reaction logic. In order to describe the methods employed to confirm the integrity, it will be necessary to first outline the structure of the spontaneous reconfiguration scheme used in the PRIME system. There are four steps involved in reorganizing the hardware structure of PRIME so that it can continue to operate with internal faults. The first step consists of detecting a fault. This is done by one of the many techniques outlined in this paper. In the second step, an initial reconfiguration is performed so that a new processor, one not involved in the detection, is given the job of being the CPo This provides a pseudo "hard core" which will be used to initiate gross diagnostics. The third step is used to locate the fault. This is done by having the new CP attach itself to the Programmable Control Panel9 of a Problem Processor via the Interconnection Network, and test it by single-stepping this Dynamic Confirmation of System Integrity PP through a set of diagnostics. If a PP is found to be functioning properly, then it is used to diagnose its own I/O channels. After the fault is located, the faulty functional-unit is isolated, and a second reconfiguration is performed to allow the system to run without this unit. Of the four steps involved in responding to a fault, the initial reconfiguration poses the most difficulty. In order to guarantee that this initial reconfiguration could be initiated, a small amount of dedicated hardware waS incorporated to facilitate this task. Associated with each processor is a flag which indicates when the processor is the CPo Also associated with each processor is a flag which is used to indicate that this processor thinks the CP is malfunctioning. For every processor, these two flags can be interrogated by any other processor. Each processor can set only its own flag that suggests the CP is sick. The flag which indicates that a processor is the CP can be set only if both the associated processor and the dedicated hardware concur. Thus, the dedicated hardware will not let this flag go up if another processor already has its up. Also, this flag will automatically be lowered whenever two processors claim that the CP is malfunctioning. There is somewhat of a dilemma associated with confirming the integrity of this logic. Because of the distributed nature of this reconfiguration structure, it should be unnecessary to make any of it "reliable." That is, the structure is already distributed so that a failure of any part of it can be tolerated. However, if simultaneous faults are to be avoided, the integrity of this logic must be dynamically confirmed. Unfortunately, it is not practical to check this logic by frequently initiating reconfigurations. This dilemma is being solved by a scheme which partially simulates the various actions. The critical logic that cannot be checked during a simulated reconfiguration is duplicated so that infrequent checking by actual reconfiguration is sufficient to confirm the integrity of this logic. The only logic used in the diagnostic scheme where integrity confirmation has not already been discussed is the Programmable Control Panel. This pseudo panel is used to allow the CP to perform all the functions normally available on a standard control panel. No explicit provision will be made for confirming the integrity of the Programmable Control Panel because its loss will never lead to a system crash. That is, failures in this unit can coexist with a failure anywhere else in the system without bringing the system down. For powering and isolation purposes, there are only four different types of functional units in the PRIlVIE system. The four functional units are the intelligence module, which consists of a processor, its I/O controller 95 and the subunits that directly connect to the controller, its memory bus, and its reconfiguration logic; the memory block, which consists of two 4K-word by 33-bit 1\10S memory modules and a 4X2 switching matrix; the switching module, which consists of the switch part of two processor-end and three device-end nodes of the Interconnection Network; and the disk drive. The disk drives and switching modules can be powered up and down manually only. The intelligence modules must be powered up manually, but they can be powered down under program control. Finally, the memory blocks can be powered both up and down under program control. No provision was made to power down the disks or switching modules under program control because there was no isolation problem with these units. Rather than providing very reliable isolation logic at the interfaces of the intelligence modules and memory blocks, it was decided to provide additional isolation by adding the logic which allows these units to be dynamically powered down. Also, because it may be necessary to power memory blocks down and then back up in order to determine which one has a bus tied up, the provision had to be made for performing the powering up of these units on a dynamic basis. Any processor can power down any memory block to which it is attached, so it was not deemed necessary to provide for any frequent confirmation of the integrity of this power-down logic. Also, every processor can be powered down by itself and one other processor. These two power-down paths are independent so again no provision was made tofrequently confirm the integrity of this logic. In order to guarantee that the independent power-down paths do not eventually fail without this fact being knmvll, these paths can be checked on an infrequent basis. All of the different integrity confirmation techniques used in PRIlVIE have been described. The essence of the concept of dynamic confirmation of system integrity is the systematic exploitation of the specific characteristics of a system to provide an adequate integrity confirmation structure which is in some sense minimal. For instance, the type of use and the distributed intelligence of PRI1\1E were taken advantage of to provide a sufficient integrity-confirmation structure at a much lower cost and complexity than would have been possible if these factors were not carefully exploited. REFERENCES 1 B BORGERSON C V RAVI On addressing failures in merrwry systems Proceedings of the 1972 ACM International Computing Symposium Venice Italy pp 40-47 April 1972 2 D A ANDERSON G METZE Design of totally self-checking check circuits for M-out-of-N 96 Fall Joint Computer Conference, 1972 codes Digest of the 1972 International Symposium on Fault-Tolerant Computing pp 30-34 3 R A SHORT The attainment of reliable digital systems through the use of redundancy-A survey IEEE Computer Group News Vol 2 pp 2-17 March 1968 4 R S FABRY Dynamic verification of operating system decisions Computer Systems Research Project Document No P-14.0 University of California Berkeley February 1972 5 W G BOURICIUS W C CARTER P R SCHNIEDER Reliability rrwdeling techniques for self-repairing computer systems Proceedings of the ACM National Conference pp 295-309 1969 4. computer system microprogramming referlmce manual Publication No 7043MO Digital Scientific Corporation San diego California June 1972 7 H B BASKIN B R BORGERSON R ROBERTS P RIM E-A rrwdular architecture for terminal-oriented systems Proceedings of the 1972 Spring Joint Computer Conference pp431-437 8 B R BORGERSON A fail-softly system for time-sharing use Digest of the 1972 International Symposium on Fault-Tolerant Computing pp 89-93 9 G BAILLIU B R BORGERSON A multipurpose processor-enhancement structure Digest of the 1972 IEEE Computer Society Conference San Francisco September 1972 pp 197-200 6 META The in-house computer department by JOHN J. PENDRAY TECSI-SOFTWARE Paris, France INTRODUCTION tive's office. This organizational form has created two of the most widely adopted erroneous concepts ever to permeate corporate activity. The first, and perhaps least damaging of these, is the concept that the highest corporate officers should be directly in touch with the computer at all times (and at any cost) to take advantage of something called the J\1:anagement Information System (MIS). (Briefly, a MIS is a system designed to replace a fifteen-minute telephone call to the research department by a three-second response from a computer, usually providing answers exactly fourteen minutes and fifty-seven seconds faster than anyone can phrase his precise question.) The second concept to follow the attachment of the computer department to the chief executive's office has been the missionary work which has been undertaken in the name of, and with the power or influence of, the chief executive. Information service missionary work generally consists of the computer experts telling each department exactly what their information needs are and how they should go about their business. This article will examine the nature of· the in-house computer department in terms of its place in the corporate structure, its product, its function in the maturing of the product, and its methods of optimizing its resource utilization. Additionally, one possible internal structure for an in-house computer department will be presented. Over fifteen years ago, in some inner recess of some large corporation, a perplexed company official stood pondering before a large corporate organizational chart on his office wall. In his hand he held a small square of paper on which the words "Computer Department" were inscribed. Behold one of the modern frontiersmen of twentieth century business: the first man to try to stick the in-house computer department on the company organizational chart. He probably failed to find a place with which he felt comfortable, thereby becoming the first of many who have failed to resolve this problem. Most of the earlier attempts ended by putting the computer department somewhere within the grasp of the corporate financial officer. The earliest computer applications were financial in nature, such as payroll, bookkeeping, and, after all, anything that costs as much as a computer must belong in the financial structure somehow. Many corporations are still trying to get these financial officers to recognize that there are many non-financial computer applications which are at least as important as the monthly corporate trial balances. Additionally, and perhaps even worse, the allocation of the computer department's resources is viewed as a relatively straightforward financial matter subject to budgeting within financial availability. This method of resource dispensing seems not to provide the right balance of performance and cost generally sought in the business world. As the computer department growth pattern followed the precedent of Topsy, many corporations began to wonder why something that had become an integral part of every activity in the company should belong to one function, ·like finance. This questioning led to a blossoming forth of powerful in-house computer departments disguished under surcharged names like Information Services Department. Often, this square on the organizational chart had a direct line to the chief execu- THE IN-HOUSE COMPUTER DEPARTMENT WITHIN THE CORPORATE STRUCTURE Most of the blocks on the corporate organizational chart have some direct participation in the business of the company. Take an example. The Whiz-Bang Corporation is the world's leader in the production of whizbangs. Its sales department sells whiz-bangs. Its production department produces whiz-bangs. Its development department develops new types of whiz-bangs. Its 97 98 Fall Joint Computer Conference, 1972 computer department has nothing to do with whizbangs. The people in the computer department know lots about computers, but their knowledge about whizbangs comes from what other departments have told them. What are they doing in the whiz-bang operation? The computer department provides services to all the other departments in the company. These other departments are directly involved in the business of the company, but the function of the computer department is to provide services to the company, not to contribute directly in the business of the company. In this light, the computer department is like an external supplier of services. How should such a supplier of services be funded? Let's return to the Whiz-Bang Corporation analogy. The marketing department is allocated sufficient resources to market whiz-bangs; the production department gets resources adequate to produce whiz-bangs, etc. It is not possible to allocate r~sources to the computer department on the basis of its direct contribution to whiz-bangs. The computer department provides services, and these services are what should be funded. The value of these services provides the basis for funding. Other departments use the computer services, and it follows that only these departments can place the value on a service and that each department should pay for the services which it gets. Therefore, the funding of the computer department is the sum total of the payments received from the other departments for services rendered. How should the computer department be controlled? First of all, it is necessary to define what is to be controlled. Either one can control product specifications or one can control the resources necessary to produce a product. Product specifications are generally controlled, in one way or another, by the buyer, while resource control is usually an internal problem concerned with the production of the buyer-specified product. At the Whiz-Bang Corporation, the marketing determines the buyer-desired product specifications, but each internal department calculates and controls its resource requirements to yield the specified number and type of whizbangs. If the nature of the computer department is to provide services as its product, the users of these services should control their specifications. Mter all, they are paying for them (or should be). If the computer department has the task of providing services that the other departments will be willing to fund, it should have the responsibility to allocate its resources to optimize its capability to provide the services. Mter all, they are the experts (or should· be). In resume, the departments in the corporation are using an external type of service from an internal source, the in-house computer department. Only they can value the service, but they won't do this job of valuation unless they are charged for the service. This valuation will automatically produce customer-oriented specifications for the services. On the other hand, once the services are specified and accepted at a certain cost, it is the job of the computer department to use its revenues in the best manner to produce its services. That is, the funding flows as revenues from the other departments; but the utilization of this funding is the proper responsibility of the provider of the services, the computer department. These principles indicate that the in-house computer department can be melded into the corporate structure in any position where it can be equally responsive to all of the other departments while controlling, itself, the utilization of its resources. THE PRODUCT OF THE COl\fPUTER DEPARTMENT-THE COMPUTER SERVICE A computer service, which is the product produced and sold by the computer department, has an average life span of between five and ten years. It is to be expected that as the speed of computer technological change diminishes, this life span will lengthen. To date, many computer services have been conceived and developed without a real understanding of the nature of a computer service. The lengthening of the life span of the computer service should produce a more serious interest in understanding this nature in order to produce more economical and responsive services. A well-conceived computer service is a highly tuned product which depends on the controlled maturing and merging of many technical facets. Too often this maturing and merging is poorly controlled because the life cycle of the computer service is not considered. The net result may be an inflexible and unresponsive product which lives on the edge of suicide, or murder, for the entirety of its operational life. Computer services management should not allow this inflexibility to exist, for the computer is one of the most flexible tools in the scientific grabbag. This innate flexibility should be exploited by management in the process of maturing a computer service. MATURING THE COMPUTER SERVICE There are four major phases in the maturing process: definition, development, operation, and overhaul. Perhaps the most misunderstood aspect of this maturing The In-House Computer Department process is the relation between the phases in the life cycle of a computer service and the corresponding changes required in the application of technical specialties. Each phase requires a different technical outlook and a different level of technical skills. The definition phase Defining the service is oriented toward producing the functional specifications which satisfy the needs and constraints of the client. From another point of view, this is the marketing and sales problem for the computer department. It should be treated as a selling problem because the service orientation of the computer department is reinforced by recognition that the buyer has the problem, the money, and the buying decision. The technical outlook should be broad and long term, for the entire life of the service must be considered. Technical details are not desirable at this stage, but it is necessary to have knowledge of recent technical advances which may be used to the benefit of the service. Also, a good understanding of the long-range direction and plans of the computer department is necessary in order to harmonize the proposed service with these goals. The first step in defining a computer service is to locate the potential clients and estimate their susceptibility to an offer of a computer service. At first glance, this seems an easy task as the potential clients are well-known members of the corporate structure. Not so! Many of the most promising avenues of computer services cut across the normal functional separations and involve various mixtures of the corporate hierarchy. These mixtures are frequently immiscible, and the selling job involves convincing each participant of his benefit and helping him justify his contribution. The corporate higher-ups would also need to be convinced, but the money will seldom come from their operating budgets. In any case, the responsibility to seek out and sell new computer services lies with the computer department; however, the decision to buy is the sole property of the client departments. Mter potential clients are identified, a complete understanding of the problem must be gained in order to close the sale. This understanding should give birth to several alternative computer system approaches giving different performance and cost tradeoffs. The potential customer will want to understand the parameters and options available to him in order. to select his best buy. This is a phase of the life cycle of the service where the computer department provides information and alternatives to the prospective client. Closing of the agreement should be in contractual terms with each party obligated for its part of the 99 responsibility. All terms such as financing schedules, product specifications, development schedules, modification procedures, and penalties should be reduced to writing and accepted before work begins. A computer department that cannot (or will not) make firm commitments in advance of a project is poorly managed. (Of course there can always be a periodic corporate reckoning to insure that imbalances are corrected.) The development phase The contract is signed; the emphasis for the computer department changes from sales to development and implementation of the service. This phase calls for a concentrated, life-of-the-effort technical outlook with in-depth and competent technical ability required at all levels. The specialists of the computer department must be organized to produce the system which will provide the service as specified. The usual method for accomplishing this organization is the "project". Many learned texts exist on the care and feeding of a technical project, so let's examine here only the roles of the computer department and the client within the general framework of a project. Computer department participation centers on its role as being the prime responsible party for the project. It is the computer department's responsibility to find the best techniques for satisfying all the goals of the project. The correct utilization of the resources available to the computer department is a key to the project's success. One resource is time, and time runs put for a project. That is to say that no true project succeeds unless it phases out on time. A project team produces a product, turns it over to the production facility, and then the project ceases to exist. The personnel resource of the computer department is also viewed differently in a project. The project team is composed of a hand-tailored mix of specialists who are given a temporary super-incentive and then removed from the project after their work is done. Superincentives and fluid workforces are not easily arranged in all companies, and this is one of the reasons why the computer department must maintain control of the utilization of its resources. The computer department should acquire new resources for a project within the following guideline: don't. Projects should not collect things around them or they become undisintegratable. The only exception: acquisitions which form part of the product, and not part of the project, and which will go with the product into the production phase. Assuring the continuing health of the project's product is another critical aspect of the computer depart- 100 Fall Joint Computer Conference, 1972 ment's responsibility in the project. Since the project team will die, it must provide for the product to live independently of the project. This involves producing a turnoverable product which is comprehensible at all levels of detail. Also, the final product must be flexible enough to respond to the normal changes required during its lifetime. It is interesting to note that in the development phase of the life cycle of a service, the project philosophy dictates that the computer department orient itself toward project goals and not just toward satisfying the specifications of the service. That is, the service specifications are only one of the project goals along with time, cost, etc. On the other hand, the eventual user of the service, i.e., the client department, views the project as only a part of the total process necessary to have the service. To the client, the project is almost a "necessary evil"; however, the development project philosophy depends on active client involvement. Three distinct client functions are required. In their order of importance they are: 1. Countinuing decision-making on product performance and cost alternatives surfaced during the project work. 2. Providing aid to the technical specialists of the computer department to insure that the functional specifications are well understood. 3. Preparing for use of the service, including data preparation, personnel training, reorganization, etc. These three client functions are certainly important aspects of a project, but it should not be forgotten that the development project is a method used by the computer department to marshal its resources and, therefore, must be under the responsibility of the computer department. Development of the service may be an anxious phase as the client has been sold on the idea and is probably eager for his first product. This eagerness should not be blunted by the project team, nor should it affect the sound judgment of the team. Consequently, contact between the technical experts and the client should be controlled and directed toward constructive tasks. philosophy which is single-minded: to assure the continuing viability of the service. This is often a fire-fighting function in which the quick-and-dirty answer is the best answer. There isn't much technical glory in this part of the life cycle of a service, but it's the part that produces the sustaining revenues for the computer department. The computer department enhances continuing product viability by performing two functions. Of primary importance is to reliably provide the specified service with minimum expenditure of resources. Secondarily, the client must be kept aware of any possible operational changes which might affect the performance or cost of his service. Again, the client has a strong part in the decision to effect a change. The client must contribute to the continuing viability of the product by using it intelligently and periodically evaluating its continuing worth. The overhaul phase As a service ages during its operational heyday, the environment around it changes little by little. Also, the quick-and-dirty maintenance performed by the operations personnel will begin to accumulate into a patchwork quilt which doesn't look much like the original edition. These two factors are not often self-correcting, but they can go unnoticed for years. The only answer is a complete technical review and overhaul. Every service should be periodically dragged out of the inventory and given a scrub-down. This is another job where the technical glamor is quite limited; however, overhauling services to take advantage of new facilities or concepts can provide significant gains, not to mention that the service will remain neat, controllable, flexible, and predictable. Thus definition, development, operation, and overhaul are the four phases in the life cycle of a computer service. All of these phases directly affect the clients and are accomplished with their aid and involvement. However, there is another area of responsibility for the computer department that does not touch the clients as closely. This area is the control over the utilization of the computer department's resources. OPTIMIZING THE UTILIZATION OF THE COMPUTER DEPARTMENT'S RESOURCES The operation phase The third step in the life cycle of a service begins when the development project begins to phase out. This is the day-to-day provision of the service to the client. In this phase, the computer department has a production This important responsibility of the computer department is an internally-oriented function which is not directly related to the life cycles of the services. This is the problem of selecting the best mix of resources which fulfills the combined needs of the clients. In the comput- The In-House Computer Department er service business there are two main resources, people and computing equipment. Effective management of computer specialists involves at least training, challenging, and orienting. If these three aspects are performed well, a company has a better chance of keeping its experts, and keeping them contributing. Training should be provided to increase the professional competence of the staff, but in a direction which is useful to the company. It is not clear, for instance, that companies who use standard off-theshelf programming systems have a serious need to train the staff in the intricate design of programming systems software. It's been done, and every routine application suddenly became very sophisticated, delicate, and incomprehensible. However, training which is beneficial for the company should be made interesting for the personnel. Challenging technical experts is a problem which is often aggravated by a poor hiring policy which selects over-qualified personnel. Such people could certainly accomplish the everyday tasks of the company if only they weren't so bored. The management problem of providing challenge is initially solved by hiring people who will be challenged by the work that exists at the company. Continuing challenge can be provided by increasing responsibility and rotating tasks. Orienting the technical personnel is a critical part of managing the computer department. If left alone, most technical specialists tend to view the outside world as it relates to the parameter list of his logical input/output module, for example. He needs to be oriented to realize that his technical specialty is important because it contributes to the overall whole of the services provided to the clients. This client-oriented attitude is needed at all levels within a service organization. Besides personnel, the other major resource to be optimized by the computer department is the computing system. This includes equipment and the basic programs delivered with the equipment, sometimes called "hardware" and "software". Optimizing of a computing system is a frequently misunderstood or neglected function of the computer department. In a sense this is not surprising as there are three factors which obscure the recognition of the problem. First of all, computers tend to be configured by technical people who like computers. Secondly, most computer systems have produced adequate means of justifying themselves, even in an unoptimized state. Lastly, computer personnel, both manufacturers and users, have resisted attempts to subject their expenditures to rigorous analysis. It seems paradoxical that the same computer experts who have created effective 101 analysis methodologies for so many other fields maintain that their field is not predictable and not susceptible to methodological optimization. The utilization of computer systems is capable of being analyzed and may be seen as three distinct steps in the life cycle of the resource. These three steps can be presented diagrammatically as follows: general requiremBnts I I development of the hardware strategy computing requirements selection of a system system options j tuning of the system system configuration All too often, the strategy is chosen by default, the selection is made on the basis of sales effectiveness, and the tuning is something called "meeting the budget." Development of the hardware strategy Many computer departments don't even realize that different strategies exist for computing. This is not to say that they don't use a strategy; rather that they don't know it and haven't consciously selected a strategy. The hardware strategy depends on having an understanding of the general needs of the computer department. The needs for security, reliability, independence, centralization of employees, type of computing to be done, amount of computing, etc., must be formulated in general terms before a strategy decision can be made. There are many possible ways to arrange computing equipment, and they each have advantages, disadvantages, and, as usual, different costs. The problem is to pick the strategy which responds to the aggregate of the general needs. Perhaps some examples can best demonstrate the essence of a computing strategy. A large oil company having both significant scientific and business processing decides to separate the two applications onto two machines with each machine chosen for its performance/ cost in one of the two specialized domains. A highly decentralized company installs one large economical general purpose computer but with remote terminals each of which is capable of performing significant 102 Fall Joint Computer Conference, 1972 independent processing when not being used as a terminal. A highly centralized company installs two large mirror-image general purpose computers with remote terminals which are efficient in teletransmission. This is one area where the in-house computer department is not exactly like an external supplier of services, for the system strategy must reflect the general needs, and constraints, of the whole corporation. Selection of a system Mter the strategy is known, it becomes possible to better analyze and formulate the computing needs in terms of the chosen strategy. This usually results in a formal specification of computing requirements which includes workload projections for the expected life of the system. This is not a trivial task and will consume time, but the service rendered by the eventual system will directly depend on the quality of this task. Once an anticipated workload is defined, one is free to utilize one, or a combination, of the methods commonly used for evaluating computer performance. Among these are simulation, benchmarks, and technical expert analysis. One key decision will have a great influence on the results of the system selection: is a complete manufacturer demonstration to be required? This question should not be answered hastily; because a demonstration requires completely operational facilities, which may guarantee that the computer department will get yesterday's system, tomorrow. On the other hand, not having a demonstration requirement may bring tomorrow's most advanced system, but perhaps late and not quite as advanced as expected. In any case, some methodology of system selection is required, if only to minimize the subjectivity which is so easily disguised behind technical jargon. environment. Take an example. As a result of the characteristics of the selected computer system, it might turn out that the mix of jobs "required" during the peak hours dictates that the expensive main memory be 50 percent larger than at any other time. Informing the clients of this fact, and that the additional memory cost will naturally be spread over their peak period jobs, will usually determine if all the requirements are really this valuable to the client. The client has the right to be informed of problems that will directly affect his service or costs. Only he can evaluate them and decide what is best for him. Tuning of the environment involves selecting the best technical options, fully exploiting the potential of the computing configuration, and otherwise varying the parameters available. The trick is to examine all the parameters in the environment, not just the technical ones. This tuning process should be made, on a periodic basis, to insure that the environment remains as responsive as possible to the current needs. PROPOSAL-AN ORGANIZATIONAL STRUCTURE FOR THE IN-HOUSE COMPUTER DEPARTMENT It may not be possible to organize every computer department in the same manner, but some orientation should be found which would minimize the lateral dependencies in the organization. Perhaps a division of responsibilities based on the time perspective would be useful. Something as simple as a head office with three sections for long-range, medium-range, and short-range tasks could minimize lateral dependencies and still allow for exploitation of innate flexibility. In the language of the computer department, these sections might be called the head office, planning, projects, and operations, as shown in Figure 1. Tuning of the system The head office The winner of the hardware selection should not be allowed to start to take advantage of the computer department once the choice is made. On the contrary, the computer department is now in its strongest position as the parameters are much better defined. One more iteration on the old specifications of requirements can now be made in light of the properties of the selected system. Also, an updating of the workload estimates is probably in order. Armed with this information, the computer department is now ready to do final battle to optimize the utilization of the system. This optimization involves more than just configuring the hardware. It is a fine tuning of the computing There are three functions which must be performed by the head office. These functions are those which HEAD OFFICE PROJECTS SECTION Figure 1 The In-House Computer Department encompass all of the other sections and are integral to the computer department. The first, and most important, of the functions for the head office is certainly marketing and client relations. All aspects of a service's life cycle involve the customer and he must be presented with a common point of contact on the business level. Every client should feel that he has the attention of city hall for resolving problems. In the opposite direction, the three sections should also use the head office for resolving conflicts or making decisions which affect the clients. The second function of the head office is to control the life cycle of a service. As a service matures from definition to development to operations, it will be passed from one section to another. This phasing avoids requiring the technical people to change their outlook and skills to match the changes in the maturing process, but may create problems as a service is passed from hand to hand. Only the head office can control the process. Resource control is the last function of the head office. The allocation of the various resources is an unavoidable responsibility and must reflect the changing requirements of the computer department. 103 is limited for each task and each task is executed in a project approach. A permanent nucleus of specialists exists to evaluate and implement major changes in the equipment. Each such major change is limited in its scope and accomplished on a project basis. Development of services is naturally a task for the projects section. Each such project is performed by a team composed of people from the permanent nucleus and from the other two sections. The leadership comes from the projects section to insure that the project philosophy is respected, but utilization of personnel from the other sections assists in the transistions from planning to projects and from projects to operations. This latter transition from development to operations is a part of the third function of the projects section. Direct aid is given to the operations section to insure that project results are properly understood and exploited in the day-to-day operations. The operations section Here is the factory. The time orientation is immediate. There are five major tasks to be performed, each of which is self-evident. The planning section This is the long-range oriented group which must combine technical and market knowledge to plan for the future. The time orientation of this section will vary from company to company, but any task which can be considered as being in the planning phase is included. Among the planning tasks is the development of longrange strategy. This strategy must be founded on a knowledge of expected customer needs (market research), advances in technical capabilities (state-of-theart studies), and constraints on the computer department (corporate policy). Development of an equipment strategy is a good example of this task. Another planning function is the developing of functional specifications for potential new services. In this respect, the planning section directly assists the head office in defining new services for clients. Lastly, -the planning section assists the projects section by providing state-of-the-art techniques which can be used in developing a specified service. The projects section This section has responsibility for the tasks in the computer department which are between planning and operation. Included is both development of services and changes in the technical facilities. The time orientation • Day-to-day production of the services, • Accounting, analysis and control of production costs, • Installation and acceptance of new facilities, • Maintenance of all facilities (this includes systems software and client services), • Recurring contact, training, and aid to the clients in use of the services. TWO EXAMPLES Perhaps the functioning of this organization can be demonstrated by an exampleJrom each of the two major areas of services and resources. The life cycle of a service may begin either in the planning section (as a result of market research) or in the head office (as a result of sales efforts). In any case, the definition of the service is directed by the head office and performed by the planning section. Once the contract is signed, the responsibility passes to the projects section and the project team is buJlt for the development effort. On the project team there will be at least one member from the planning section who is familiar with the definition of the service. The operations section also contributes personnel to facilitate the turnover at the end of the project. Other personnel are gathered from the permanent nucleus and the sections as needed. Each 104 Fall Joint Computer Conference, 1972 project member is transferred to the project section for the life of the project. The service is implemented, turned over to the operations section, and the project team is disbanded. Daily production and maintenance are performed by the operations section as is the periodic overhaul of the system. Each change of sections and all client contacts are under the control of the head office. For resource utilization a close parallel exists. The head office again controls the life cycle. As an example, take the life cycle of a computer system. The planning section would develop a strategy of computing which would be approved by the head office. When the time arrived for changing the computer system, the projects section would define a project and combine several temporary members and permanent nucleus personnel to form the project team. A computer system selection would be made in line with the strategy of computing, and the system would be ordered. The operations section would be trained for the new system and accept it after satisfactory installation. Periodic tuning of the computer system would be done by permanent personnel in the projects section with the cooperation of the operations section. The flow of responsibility for these two examples is represented by Figure 2. SUMMARY Excepting those cases where the product of a company contains a computer component, the in-house computer department is in the business of providing an external service to the integral functions of a non-computer business. For this reason, the computer department does not appear to mesh well on an organizational chart of the departments which do directly contribute to the product line of the corporation. However, a wellfounded in-house computer department which depends on its users for funds and on itself for the optimizing of the resources provided by these funds can peacefully serve within the organization. The computer department can respond to these two principles of funding and resource control by recognizing that its funds depend on the satisfaction of the users and that the optimizing of the use of these funds can be aided by organizing around the life cycles of both the services provided and the resources used. One possible organization designed to fulfill these two goals is composed of a head office and three sections. The head office maintains continuing control over the client relationship and over the life cycle of both services and resources. Each of the three sections specializes on a certain phase of the life cycle: definition, development, and operation. Such an organizational approach for the computer department should provide: CLIENTS 1 HEAD OFFICE I servi ce functi ons rt - - - I SERVIcESALES L _ _ _ ...J --------1 I -, I resource functions I r---, RESOURCE r' .... - ....... -- -'-' .... - ' ':' i I NEEDS L _ _ _ _ .-1 PROJECTS SECTION r I, 1 I " '- _ _ _ _ . ____L I 1 l : RESOURCE: : STRATEGY I--~ RESOURCE : SERVICE SERVICE I_~ , I ~. -- t- _ _ _ _ _ _ _ 1 : ~ - : - - -- - _ ... I RESOURCE , SELAE~610N ;--.: UTILIZATION: 1 ' : OPT 1M t ZAT I ON : : 1--------1 :---------j !---------j '- ________ , ,- - - - - - - - - -l '- - - _ - - - - - - -! , : ~,1_DEF IN IT ION'- -,.,1_ DEVELOPMENT _______I _ _ _ _ _ _ _ Figure 2 L-~ I 1 •• 1_ SERVICE PRODUCTION _ _______ , .!I • Computer services which are responsive to, and justified by, the needs of the users, • A controlled and uniform evolution of the life cycle of both services and resources, • A computer department management oriented towards dealing with technical considerations on a business basis, • Technical personnel who are client-oriented specialists and who are constantly challenged and matured by dealing with different problems from different frames of reference, • An in-house computer department which is selfsupporting, self-evaluating, and justified solely by its indirect contributions to the total productivity of the corporate efforts. A computer center accounting system by F. T. GRAMPP Bell Telephone Laboratories, Incorporated Holmdel, New Jersey INTRODUCTION and timely reporting of charges by case (the term "case" is the accounting term we use for "project" or "account"), so that costs of computer usage to a project would be known, and by department, to ascertain the absolute and relative magnitude of computer expenses in each organization. These orders of reporting are not necessarily identical, or even similar. For example, the cost of developing a particular family of integrated circuits might be charged against a single case, and computer charges for this development might be shared by departments specializing in computer technology, optics, solid state physics, and the like. Similarly, a single department may contribute charges against several or many cases-a good example of this is a drafting department. Original charging information is associated with a job number, an arbitrary number assigned to a programmer or group of programmers, and associated with the department for which he works, and the project he is working on. This job number is charged to one case and one department at any given point in time; however, the case and/or department to which it is charged may occasionally change, as is shown later. This paper describes a computer center accounting system presently in use at the Holmdel Laboratory and elsewhere within Bell Telephone Laboratories. It is not (as is IBM's SMF, for example), a tool which measures computer usage and produces "original" data from which cost-per-run and other such information can be derived. ltis, rather, a collector of such data: it takes as input original run statistics, storage and service measurements from a variety of sources, converts these to charges, and reports these charges by the organizations (departments) and projects (cases) which incur them. "DESIGN CRITERIA," below, outlines the overall functions of the system and describes the design criteria that must be imposed in order to assure that these functions can be easily and reliably performed. The remainder of this paper is devoted to a somewhat detailed description of the data base (as seen by a user of the system) and to the actual implementation of the data base. Of particular interest is a rather unusual means of protecting the accounting data in the event of machine malfunction or grossly erroneous updates. Finally, we describe backup procedures to be followed should such protection prove to be inadequate. A description of the system interface is given in the Appendix for reference by those who would implement a similar system. Simplicity of modification Many factors were considered in designing the system described here. The following were of major importance: One thing that can be said of any accounting system is that once operational, it will be subjected to constant changes until the day it finally falls into disuse. This system is no exception. lt is subj ected to changes in input and output data types and formats, and to changes in the relationships among various parts of its data base. Response to such changes must be quick and simple. Cost reporting Expansion capability Reporting costs is the primary function of any accounting system. Here, we were interested in accurate One of the more obvious unknowns in planning a system of this type is the size to which its data base may DESIGN CRITERIA 105 106 Fall Joint Computer Conference, 1972 eventually grow. On a short term basis, this presents no problem: one simply allocates somewhat more storage than is currently needed, and reallocates periodically as excess space begins to dwindle. Two aspects of such a procedure must, however, be borne in mind: First, the reallocation process must not be disruptive to the day-to-day operation of the system. Second, there must be no reasonably foreseeable upper limit beyond which reallocation eventually cannot take place. Protection Loss of, say, hundreds of thousands of dollars worth of accounting information would at the very least be most embarrassing. Thus steps must be taken in the design of the system to guarantee insofar as is possible the protection of the data base. Causes of destruction can be expected to range from deliberate malfeasance (with which, happily, we need not be overly concerned), to program errors, hardware crashes, partial updating, or operational errors such as running the same day's data twice. If such dangers cannot be prevented, then facilities which recover from their effects must be available. Continued maintenance The most important design criterion, from the designer's point of view, is that the system be put together in such a way that its continued maintenance be simple and straightforward. The penalty for failure to observe this aspect is severe: the designer becomes the system's perpetual caretaker. On the other hand, such foresight is not altogether selfish when one considers the problems of a computer center whose sole accounting specialist has just been incapacitated. THE DATA BASE: LEVEL 1 There are two ways in which to examine the data base associated with the accounting system. In the first case, there is its external appearance: the way it looks to the person who puts information into it or extracts information from it. Here, we are concerned with a collection of data structures, the way in which associations among the structures are represented, and the routines by means of which they are accessed. In the second, we look at its internal appearance: Here, we are interested in implementation details-in particular, those which make the system easily expansible and maintainable, and less vulnerable to disaster. These two aspects of the data base are, in fact, quite independent; moreover, to look at both simultaneously would be confusing. For this reason, we shall consider the first here, and defer discussion of the second to a later part of this paper. We first examine the structures themselves. Tally records Accounting system data is kept on disk in structures called tally records. Since we are concerned with data pertaining to cases, departments and job numbers, we have specified a corresponding set of tally records: Case Tally Records, Department Tally Records and Job Tally Records, respectively. These will be abbreviated as CTRs, DTRs and JTRs. In each tally record is kept the information appropriate to the particular category being represented. Such data fall naturally into three classes: fiscal information-money spent from the beginning of the year until the beginning of the present (fiscal) month; linkage data-pointers to associated records; other data---;anything not falling into the other two categories. For example, a CTR contains fiscal and linkage information: charges (a) up to and (b) for the current fiscal period, and a pointer to a chain of JTRs representing job numbers charged to the CTR's case. A DTR's content is analogous to that of a CTR; the exception is the inclusion of some "other" data. When we. report charges by case, the entire report is simply sent to the comptroller. Department reports, however, are sent to the heads of individual departments. To do so, we require the names of the department heads, and their company mailing addresses; hence the "other" data. A JTR contains considerably more information: in addition to the usual fiscal and linkage information, a JTR contains pointers to associated case and department, data identifying the responsible programmer, and a detailed breakdown of how charges for the current month are being accumulated. There is no way of determining a priori those things which will be charged for in order to recover computer center costs. In the olden days (say, 10 years ago) this was no problem: one simply paid for the amount of time he sat at the computer console. With today's computers, however, things just aren't that simple, since the computer center is called upon to provide all sorts of computing power, peripherals and services, and in turn, must recover the costs of said services from those who use them. Thus one might expect to find charges for CPU time, core usage, I/O, tape and disk storage rental, mounting of private volumes, telephone connect time, and so on. Add to this the fact that the charging A Computer Center Accounting System algorithm changes from time to time, and it quickly becomes apparent that the number and kinds of charging categories simply defy advance specification. Further, it seems clear that a given resource need not always be charged at the same rate-that in fact the rate charged for a resource should be a function of the way in which the resource is being used. For example, consider a program which reads a few thousand records from a tape and prints them. If such a program were to be run in a batch environment, in which printed output is first spooled to a disk and later sent to a high speed printer, one would expect the tape drive to be in use for only a matter of seconds. If the same program were to be run in a time-shared environment, in which each record read was immediately shipped to a teletype console for printing, the drive might be in use for several hours. If the computer center's charging algorithm is designed to amortize the rental costs of equipment among the users of the equipment, the latter use of "tape" ought to be considerably more expensive than the former, even though the same amount of "work" was done in each case. For these reasons, we chose to make the process table-driven. In this way, new charging categories can be added, old ones deleted, and rates changed simply by editing the file on which a rate table resides. Such a scheme has the obvious drawback of requiring a table search for each transaction with the system, but the inefficiencies here are more than compensated by the ability to make sweeping changes in the charging information without having to reprogram the system. Our rate table is encoded in such a way that it may be thought of as a two dimensional matrix. One dimension of the matrix consists of the services offerred by the computer center: batch processing (in our case, an ASP system), time shared services, data handling (a catch-all category which includes such things as tape copying, disk pack initialization and the like) storage rental, and sundry others. The other dimension consists of the usual computer resources: CPU time, core, disk and tape usage, telephone connect time, etc. When a user incurs a charge, it is recorded in his JTR as a triple called a "chit." The chit consists of a service name, such as "ASP," a resource name, such as "CPU," and the dollar amount which he has been charged. In this implementation, each chit occupies twelve bytes: BYTE: 0 COST RES SERV 4 8 107 These chits are placed in an area at. the end of the JTR. Initially, the area is empty. As time progresses and charges are accumulated, the number of chits in the JTR grows each time the job number is charged for a service-resource combination that it hasn't used before. The JTR itself is of variable length, and open-ended "to the right" to accommodate any number of chits that might be placed there. Linkages There are, in general, two ways in which one accesses information in the data base. Either one knows about a job number, and applies a charge against it and its associated case and department, or one knows about a case or department number and desires to look at the associated job numbers. This implies that there must be enough linkage information available for the following: (a) Given a job number, find the case and department to which that number is charged. (b) Given a case or department number, find all of the job numbers associated with that case or department. The first case is trivial: one simply spells out, in a JTR, the case and department to which the job number is charged. The second case is somewhat more interesting in that there may be one, or a few, or even very many job numbers associated with a single case or department. At Holmdel, we have the worst of all possible situations in this regard, in that the large majority of our cases and departments have very few job numbers associated with them, whereas a certain few have on the order of a hundred job numbers. Viewed in this light, schemes such as keeping an array of pointers in a CTR or DTR are, to say the least, unattractive because of storage management considerations. What we have chosen to do, in keeping with our philosophy of open-endedness, is to treat the case-job and department-job structures as chains, and using the CTRs and DTRs as chain heads, operate on the chains using conventional list processing techniques. In our implementation, a case-job chain (more properly, the beginning of it) appears in a CTR as a character field containing a job number charged to that case. In the JTR associated with that job number, the chain is continued in a field which either contains another job number charged to the same case, or a string of zeros, which is used to indicate the end of a chain. Fields in the DTR and JTR function analogously to represent department job chains. 108 Fall Joint Computer Conference, 1972 will fit in the core storage currently allocated the index. RO,MRT: are not relevant at this time, and will be discussed later. Traversing such a chain (as one frequently does while producing reports) is quite simple: begin at the beginning and get successive JTRs until you run out of pointers; then stop. Inserting a new job number into a case- or department-job chain is also straightforward: copy the chain head into the chain field in the new JTR; then point the CTR or DTR to the new JTR. Deletion of JTRs from the system is accomplished by means of similar "pointer copying" techniques. Entries are of the form (P1,P2,NAME), where PI and P2 are 31-bit binary numbers pointing to records in a direct access data set, and NAME is a character string of appropriate length containing a case, job or department number. Indices A ccessing techniques As was previously mentioned, the job numbers that are assigned to users are arbitrary. They happen, in point of fact, to be sequential for the most part, but this is simply a matter of clerical convenience. The only convention followed by case and department numbers is that (as of this writing) they are strictly numeric. This implies the necessity of a symbol table to associate names: case, department and job numbers, with their corresponding tally records on disk. Three types of symbol table organization were considered for use with this system: sequential, in which a search is performed by examining consecutive entries; binary, in which an ordered table is searched by successively halving the table size; hash, in which a randomizing transformation is applied to the key. Of these, the sequential search is simply too slow to be tolerated. While the hashing method has a speed advantage over the binary method, the binary method has a very strong advantage for our application, namely, that the table is ordered. One of the functions of the accounting system is that of producing reports, which are invariably ordered by case, department or job number. The ordering of the indices facilitates the work of many people who use the system. In this implementation, there are three indices: one for cases, one for departments, and one for job numbers. These will be abbreviated CDX, DDX and JDX, respectively. Each index consists of some header information followed by pairs of names and pointers to associated tally records. Header information consists of five items: Two types of access to the data base are required. The first is the programmer's access to the various structures and fields at the time he writes his program. The second is the program' 8 access to the same information at the time the program is run. The choice of PL/I as the language in which to write the system was, oddly enough, an easy one, since of all of the commonly available and commonly used languages for System/360, only PL/I and the assembler have a macro facility. Using assembly language would make much of the code less easily maintainable, and thus PL/I won by default. The macro facility is used solely to describe various data base components to the routines that make up the accounting system by selectively including those components in routines which use them. Further, all references to these components are made via the macros. Adoption of this strategy has two somewhat related advantages: First,. it forces consistent naming of data items. Without the macros, one programmer would call a variable "X", another would call it "END-OFMONTH-TOTAL", and so on. This, at least, would happen, and worse can be imagined. Second, should there be a change in a structure, all of the programs that use the structure must be recompiled. If the macros are used, the change can be made in exactly one place (the compile-time library) before recompilation. Run-time access to the data base is achieved by following simple conventions and by using routines that have been supplied specifically for this purpose. These conventions are simple because they are few and symmetric. The data base consists of six structures: the three indices, ~nd the three types of tally records. None of these structures are internal to a program that interfaces with the data base. All of them are BASED, that is, located by PL/I POINTER VARIABLES which have been declared to be EXTERNAL so that they will be known to all routines in the RL: TN: TMAX: Record Length for the tally record. This is needed by the PL/I and 08/360 InputOutput routines. The number of entries currently in the index. The maximum number of entries which A Computer Center Accounting System system. Thus, for example, a program that accesses the JDX would contain the following declarations: DCL 1 JDX BASED(PJDX), /* The JDX is defined, and */ % INCLUDE JDX; /* its detailed description */ DCL PJDX POINTER EXTERNAL; /* called from the library. */ The same convention applies to all of the other structures: they are allocated dynamically and based on external pointers whose names are the structure names prefixed by "P". A more detailed description of the user interface is given in the Appendix. The foregoing implies that there is a certain amount of initialization work to be done by the system: setting pointers, filling indices and the like. This is, in fact, the case. Initialization is accomplished by calling a routine named INIT, usually at the start of the PL/I MAIN program. Among its other functions, INIT: (a) Opens the accounting files. These include the six files containing the indices and tally records. Also opened are the file which contains the rate table, and a file used for JTR overflow. (b) Allocates space for the indices and tally records, then sets pointers to the allocated areas. (c) Reads into core the indices and the rate table, then closes these files. Some unblocking is required here both because the designers of PL/I (and indeed, of OS/360) have decreed that records shall not exceed 32,756 bytes in length, and because short records make the data base accessible to time shared programs running on our CPS system. Once INIT returns control, the operating environment for the accounting system has been established. Indices are in core, and can be accessed by conventional programming techniques or by using the SEARCH, ENTER and DELETE routines, provided. Reading and writing. of tally records is also done by system routines, these being: RDCTR RDDTR RDJTR WRCTR WRDTR WRJTR The read-write routines all require two arguments-a character string containing the name of the tally record to be read or written, and a logical variable which is set to signal success or failure to the caller. Actual data 109 transfer takes place between a direct access data set and a based "TR" area in core. A typical example of the use of these routines is: CALL RDJTR(MYJOB,OK); IF --, OK THEN STOP; Two higher level routines, FORMJTR and LOSEJTR, are available for purposes of expanding or contracting the data base. FORMJTR examines the contents of JTR. If the JTR seems reasonable, that is, if it contains a case and department number, and its chain pointers are explicitly empty (set to zero) it performs the following functions: (a) Checks to see if an appropriate CTR and DTR exist. If not, it creates them. (b) Writes the JTR. (c) Includes the JTR in the linkage chains extending from the CTR and DTR. LOSEJTR performs exactly the inverse function, including deleting CTRs and DTRs whose chains have become empty as a result of the transaction. INTERFACING WITH THE SYSTEM Activities involving the system fall into four general categories: creating the data base, modifying the existing data base, inputting charges, and producing reports. Creating the data base No utility is provided in the system for the express purpose of creating the data base, because the form and format of previously extant accounting information varies widely from one Bell Laboratories installation to the next. A program has to be written for this purpose at each installation; however, the system routines provided are such that the writing of this program is a straightforward job requiring a few hours' work, at most. Briefly, creation of the data base proceeds as follows: (a) Estimates are made of data set space requirements. These estimates are based on the number of cases, departments and job numbers to be handled, and on the direct access storage device capacities as described in IBM's Data Management! publication. Data sets of the proper size are allocated, and perhaps catalogued, using normal OS/360 procedures. (b) An accounting system utility named RESET is 110 Fall Joint Computer Conference, 1972 then run against the files. RESET initializes the indices so that they can communicate their core storage requirements to the system. No entries are made in the indices. (c) The aforementioned installation-written routine is run. This routine consists of a two step loop: read in the information pertinent to a job number and construct a JTR; then call FORMJTR. (d) At this point, the data base is installed. A program called UNLOAD is run so that a copy of the data base in its pristine form is available for backup purposes. Modifying the data base Two types of data base modifications are possible: those which involve linkage information, and those which do not. The latter case is most easily handledan EDITOR is provided which accepts change requests in a format similar to PL/l's data directed input, then reads, modifies and rewrites a designated tally record. The former case is not so simple, however, and is broken down into several specific activities, each of which is handled by an accounting system utility supplied specifically for that purpose. Authorizing new job numbers and closing old ones is done by a program called AUTHOR. This program adds a new entry to the data base by calling FORMJTR, and closes a job number by setting a "closed" bit to "I" in its JTR. Note that closed job numbers are not immediately deleted from the system. Deleting closed job numbers is done once per year with an end-of-year program designed for that purpose. At this time, DTRs and CTRs which have no attached JTRs are also deleted from the system. Changing the case or department number to which a job number is charged may be done in either of two ways. It is best to illustrate these by example. In the first case, consider a department which has been renamed as a result of an internal reorganization. Its department number has been changed, say from 1234 to 5678, yet its work and personnel remain the same. In this case, it is desirable to delete "1234" from the DDX, install "5678", and change all "1234" references in the department-job chain to "5678". As a second example, consider the case of a job number which was used by department 2345 but is now to be used by department 6789 due to a change in departmental work assignments.· On the surface, this seems to be a matter of taking the job number out of 2345's chain and inserting it into 6789's. Unfortunately, it isn't that simple. The charge fields in a chain, if added, should be equal to the field in the DTR at the chain head. Simply moving a JTR from one chain to another will make the old chain's fields sum low, and the new chain's fields sum high. The obvious solution to this problem is to forbid the changing of charged departments-i.e., to require that in the event that such a change is desired, the old job number be closed, and a new one authorized. Such a solution is not a very popular one, since job numbers have a habit of becoming imbedded in all sorts of hardto-reach places--catalogued procedures, data set names and the like. Furthermore, it has been our experience that programmers develop a certain fondness for particular job numbers over a period of time and are somewhat reluctant to change them. Our solution, then, is as follows: Given a job number, say 1234, whose charged department is to be reassigned, open a new job number, say 1234X, whose name was not previously known to the system, and which is charged to the proper department. Then close the old job number, and proceed to exchange n~mes in the JTRs, and linkage pointers in the respective chains. A utility called SWAP is available which permits renaming or reassignment of either departments or cases (or both). Inputting charges As might be expected from our previous discussion of charging categories, there are many inputs to the accounting system. Moreover, the input formats are quite diverse, and subject to constant change. In order that the people charged with maintaining the accounting system might also be able to maintain their own sanity, it was necessary to design a simple way of incorporating new sources of charging information into the system. Our first thought was to design a "general purpose input processor" i.e., a program that would read a data description and then proceed to process the data following (in this case, charge records). This approach was quickly abandoned for two reasons. First, the data description language required to process our existing forms of charge records would be quite complicated and thus difficult to learn and use, if in fact it could be implemented at all. Second, for each class of input charges, there is a certain amount of validity checking that can be performed at the time the charge records are read. Such checking need not be limited to a single record-for example, if it is known that a certain type of input consists of sequentially numbered cards, then a check can be made to determine whether some cards have been left out. A Computer Center Accounting System Our approach was as follows. For each type of charge record used by an installation, an input program must be written. This input program reads a charge record, does whatever checking is possible, constructs a standard structure consisting of a job number, service name, and one or more resource-quantity pairs, and passes this structure to aprogram called CHARGE. CHARGE does the rest. It brings in the appropriate JTR, converts the quantities in the resource-quantity pairs to dollar charges via factors contained in the rate table, charges the JTR, adding chits if necessary, and charges the associated CTR and DTR. The important point here is that the writer of an input program is allowed complete freedom with respect to formats and checking procedures, while he is also allowed almost complete naivete with respect to the rest of the system. Reporting The system includes programs to produce three "standard" reports: one, (by cases) to be sent to the comptroller, one (by departments) to be sent to department heads, and a third (by job number) to be sent to the person responsible for each active job number in the system. The comptroller's report is required of the computer center, and its format was specified in detail by the comptroller. The other two reports were designed to give the users of the computer center a clear and easily readable report of their computer usage in as concise a form as possible. The department report shows old and recent charges to the department, followed by a list of job numbers being charged to that department. Accompanying each job number are its charges, the case to which it is charged, and the name of the person responsible for it. A more detailed breakdown is certainly possible; the average department head, however, usually doesn't want to see a breakdown unless something looks unusual. In that case, the programmer responsible for the unusual charges is probably his best source of information. The user's report shows old and new charges for a job number, together with a detailed breakdown of the new charges by service-resource pairs. Its use to the programmer is threefold: it satisfies his curiousity-it enables him, in some cases, to detect and correct uneconomical practices-and it enables him to supply more detailed information to his department head should the need arise. In order to produce the user's report, all of the chits in all of the JTRs in the system must be scanned. Dur- 111 ing the scanning process, it is a trivial matter to maintain a set of "grand totals" showing the money recovered by the computer center in terms of all serviceresource categories. This valuable "by-product" is published after the user reports have been generated. More specialized reporting is possible, but these programs, by their nature, are· best written by particular installations rather than distributed as a part of the accounting system package. As was mentioned earlier, the ordering of the indices greatly facilitates the writing of such programs. THE DATA BASE: LEVEL II The foregoing discussion of the data base was aimed at the user of the system, and thus said nothing about its structure in terms of physical resources required, and the way in which these resources are used. We now expand on that discussion, concentrating on those factors influencing expansibility and protection. The main features of interest here are the implementation of tally record storage, the indices, and the provision to handle variable length JTRs. Free storage pools CTRs, DTRs and JTRs are stored on direct access data sets. When it is desired to access a tally record, a search of the appropriate index is made, and a relative record number on which the tally record is written is obtained from the index and used as the KEY in a PL/I read or write statement. The interesting feature of the system is that there is no permanent association between a particular tally record and a particular relative record number. Direct access records used to contain tally records are stored in linked pools. The RO entry in the appropriate index head points to the first available link, that link points to the second, and so on. One can think of the initial condition of a pool (no space used) as follows: RO contains the number 1, record # 1 contains the number 2, etc. When a link is needed for tally record storage, a routine called GETLINK detaches the record pointed to by RO from the free pool by copying that record's pointer into RO. The record thus detached is no longer available, and its number can be included into an index entry. A second routine called PUTLINK performs exactly the inverse function. These activities are well hidden from the users of the system. The obvious advantage of the casual user not seeing the list processing operations is that he won't 112 Fall Joint Computer Conference, 1972 be confused by them. The disadvantage is that when he runs out of space and allocates a larger data set, he will forget to initialize the records by having the nth unused record point to the n + 1st, as above. On the assumption (probably valid) that the data base will continue to grow over long periods of time, we have simplified the initialization procedure as follows: (a) Let RO point to the first available record (initially 1) and MRT point to the maximum record ever taken (initially 0). (b) When GETLINK calls for a record, compare RO and MRT. If RO>MRT then the data base is expanding into an area it never used before. In this case, set MRT=MRT+1. Otherwise, the record specified by RO has been used in the past and has since been returned by PUTLINK, in which case we proceed as before. With this procedure, initialization of the pools is done at the time that the data base is first created. Subsequent reallocation of data sets for purposes of enlarging the storage area is done as per standard OS / 360 practice. Indices and protection Recall that although an index entry can be thought of as consisting of a pair (P ,NAME) where P is a record number, and NAME is some character string, the entries are in fact represented as triples (Pl,P2,NAME). At the time that an index is read into core by INIT, all of the PIs contain record numbers, while all of the P2s contain O. Reading and writing of the tally records is done as follows. For reading: (a) If the P2 entry is non-zero, read record # P2. (b) Otherwise, read record # Pl. And for writing: (a) If the P2 entry is zero, call GETLINK for a free record number, and copy it into P2. (b) Write record # P2. At the conclusion of a run in which the data base is to be updated, the main program, which had caused the operating environment to be established by calling INIT, now calls a routine named FINI, which in turn: (a) Exchanges the P2 and PI index entries in all cases where P2 is non-zero. (b) Returns surplus links to the pool via PUTLINK. (c) Rewrites the indices in more than one place. (d) Closes all of the files. Such a strategy offers both protection and convenience. Clearly, the danger of partial updating of the files during a charging run is minimized. Indeed, our standard operating instructions for those who run the system state that a job which crashes prior to completion is to be run a second time, unchanged. Further, a program that doesn't call FINI will not update the accounting files. Included in this category, besides "read only" reporting programs, are debugging runs, and input programs which contain algorithms to test the validity of incoming data, and which may not modify the files. Variable length records The JTRs, because of the fact that they can contain an unpredictable number of chits, are variable in length. Overflow records are obtained via GETLINK to extend the JTR as far as required. As read into storage by RDJTR, the overflow links are invisible to the user. Besides the obvious convenience, the overflow handling in the JTR offers a different, if not devious type of protection. In the case of a system such as this, where the number of charging categories is, for practical purposes, unlimited, there is always the temptation to make the charging breakdown finer, and finer, and finer. Succumbing to this temptation gives rise to nasty consequences. Processing time and storage space increase but the reports from the system become more voluminous, hence less readable, and in a sense contain less information because of the imprecision inherent in so many of the "computer usage measurement" techniques. (In this latter case, we often tend to behave analogously to the freshman physics student who measures the edges of a cube with a meter stick and then reports its volume to the nearest cubic millimicron.) By happy coincidence, it turns out that in a system with "normal" charging categories, most JTRs have relatively few chits-too few to cause overflow-while occasional JTRs require one or more overflow records. Should the breakdown become fine enough that most of the JTRs cause overflow, the cost of running the accounting system rises-not gradually, but almost as a step. Further, if the breakdown is subsequently made coarser, the excess chits, and hence the overflow records, quietly disappear at the end of the next account- A Computer Center Accounting System ing period. Thus the system is, in a sense, forgiving, and tends to protect the user from himself. BACKUP As Mr. Peachum2 aptly remarked, it has never been very noticeable that what ought to happen is what happens. In addition to our efforts to make the system crash-proof, we have also provided several levels of backup procedures. 113 to the extent that one or more of them "points to the wrong place". Although this condition is most unusual, it is also most insidious, since there is a possibility that errors of this, type can remain hidden for, perhaps, as long as a few weeks. If enough input data has been added to the data base to make it undesirable to backtrack to the point prior to that at which the initial error is suspected to have occurred, symbolic information sufficient to regenerate the pointers is contained in the data base, and routines have been provided to copy the data base, sans structure, onto a sequential file, and then to rebuild it, using FORMJTR. Backup indices ACKNOWLEDGMENT As noted previously, FINI rewrites the indices, but in more than one place. Since the "extra" copies are written first, these can be copied to the "real" index files in the event that a crash occurs while the latter are being rewritten by FIN!. Unload-reload copies Two utilities, UNLOAD and RELOAD, are supplied with the system .. UNLOAD copies the structured files onto tape in such a way that the structure is preserved if RELOAD copies them back. It is our present practice to take UNLOAD snapshots daily, cycling the tapes once per week, and, with a different set of tapes, monthly, cycling the tapes once per year. Since chits are deleted at the end of each month (for economy of storage) UNLOAD-style dumps are also useful if it becomes necessary to backtrack for any reason to a point in time prior to the beginning of the current month. Further, the tapes are in such a format that they are easily transmitted via data link to another installation for purposes of inspection or off-site processing. The design and implementation of the accounting system described here was completed with the help and cooperation of many people, and for this the author is truly grateful. In particular, the efforts, advice, insight and inspiration provided throughout the project by Messers. R. E. Drummond and R. L. Wexelblat assured its successful completion. REFERENCES 1 IBM system/360 operating system data management services Order Form GC26-3746 2 B BRECHT Die Dreigroschenoper APPENDIX The user-system interface The facilities provided to give the user convenient access to the data base and the routines which manipulate it can be divided into two categories: compile-time facilities and run-time facilities. 08/360 dump-restore It is the practice, in our computer center, to periodicany dump all of our permanently mounted direct access storage devices using the OS/360 Dump-Restore utility. Since the accounting files are permanently mounted, this procedure provides an additional level of safety. Reformatting The worst possible mishap is one in which the chains in the system, for one cause or another, are destroyed Compile-time facilities These consist of PL/I macro definitions describing various structures. Since the storage class of a structure (e.g., BASED, STATIC, etc.) may be different in different routines, or, where there are multiple copies of a structure, even within the same routine, the initial "DCL 1 name class," must be provided by the user. Compile-time structures include the indices (CDX, DDX, JDX) the tally records (CTR, DTR, JTR) and the rate table. 114 Fall Joint Computer Conference, 1972 2 CCUM FIXED(31) BINARY, /* Cumulative total. */ 2 CJCH CHAR(8); /* Job chain for this case. */ Example 1: DCL 1 CDX BASED(PCDX), % INCLUDE CDX; produces: Example 3: DCL 1 CDX BASED (PCDX), 2 RL FIXED(15) BINARY, /* Record Length */ 2 TN FIXED(15) BINARY, /* # of Entries */ 2 TMAX FIXED(31) BINARY, /* Max. Entries. */ 2 RO FIXED(31) BINARY, /* Pool Head */ ~2 MRT FIXED(31) BINARY, /* Max. Record Taken */ 2 VAREA(O:N REFER(CDX.TN)), /* Index Proper */ 3 PI FIXED(31) BINARY, /* Read Ptr. */ 3 P2 FIXED(31) BINARY, /* Write Ptr. */ 3 NAME CHAR(9); /* Case Number */ Since it is expected that the user will always use the system-supplied rate table (as opposed to a private copy of same): % INCLUDE RATES; produces: Example 2: DCL 1 CTR BASED (PCTR), % INCLUDE CTR; produces: DCL 1 CTR BASED (PCTR) , 2 CLNK FIXED(31) BINARY, /* Used by GETLINK. 2 CCAS CHAR(9) , /* Case charged. 2 CUNUSED CHAR(3), /* For future use. 2 COLD FIXED(31) BINARY, / * $ to last fiscal. 2 CNEW FIXED(3l) BINARY, /* Latest charges. DCL 1 RATES BASED (PRTS) , /* Rate Data */ 2 #SERVICES FIXED BIN(31), 2 TOT_RES FIXED BIN(31), /* Tot. # Resources */ 2 SERVICE(12) , /* Classes of Service */ 3 NAME CHAR(4) , 3 CODE CHAR(l), /* Comptroller's Code */ 3 #RESOURCES FIXED BIN(31), 3 OFFSET /* Into Res. Table */ FIXED BIN(3l), 2 RES_TABLE (120) , /* Resources */ 3 NAME CHAR (20) , 3 ABBR CHAR(4), 3 UNIT CHAR(8) , 3 RATE FLOAT DED(14); /* Per-unit */ */ */ Run-time facilities */ Routines are provided to establish and terminate the system's run-time environment, maintain the indices, fetch and replace tally records, expand and contract the data base, and handle allocation of disk storage. These are shown in Table I, below. */ */ TABLE I-User Interface Routines ROUTINE INIT FINI SEARCH ENTER DELETE RDCTR RDDTR RDJTR WRCTR WRDTR WRJTR FORMJTR LOSEJTR GETLINK PUTLINK FUNCTION Initialization & termination. Index maintenance. ARGUMENTS REQUIRED None. Index name, key name, return pointer, success indicator. Read and write routines for tally records Name (i.e. case, department or job number), success indicator. Installation & deletion of job nos. Allocate & return disk space. Job number, success indicator. Data set name, pointer to 1st avail. record, return pointer. EXAMPLE CALL INIT; CALL FINI; CALL SEARCH(DDX, '1234', RP, OK); IF , OK THEN STOP; CALL RDJTR('MYJOB', OK); IF ,OK THEN DO; PUT LIST(MYJOBII'MISSING'); STOP; END; CALL FORMJTR(NEWJOB,OK); IF ,OK THEN STOP; CALL GETLINK (FILE,RP,POOLHD) ; An approach to joh pricing in a multi-programming environment by CHARLES B. KREITZBERG and JESSE H. WEBB Educational Testing Service Princeton, New Jersey equitably charge for the running of jobs. The two major reasons for this are: INTRODUCTION Computers are amazingly fast, amazingly accurate, and amazingly expensive. This last attribute, expense, is one which must be considered by those who would utilize the speed and accuracy of computers. In order to equitably distribute the expense of computing among the various users, it is essential that the computer installation management be able to accurately assess the costs of processing a specific job. Knowing job costs is also important for efficiency studies, hardware planning, and workload evaluation as well as for billing purposes. For a second generation computer installation, job billing was a relatively simple task; since in this environment, any job that was in execution in the machine had the total machine assigned to it for the entire period of execution. As a result, the billing algorithm could be based simply upon the elapsed time for the job and the cost of the machine being used. In most cases, the cost for a job was given simply as the product of the run time and the rate per unit time. While this algorithm was a very simple one, it nevertheless was an equitable one and in most cases a reproducible one. Because of the fact that in a second generation computer only one job could be resident and in execution at one time, the very fast CPUs were often under utilized. As the CPUs were designed to be even faster, the degree of under utilization of them increased dramatically. Consequently, a major goal of third generation operating systems was to optimize the utilization of the CPU by allowing multiple jobs to be resident concurrently so that when anyone job was in a wait state, the CPU could then be allocated to some other job that could make use of it. While multiprogramming enabled a higher utilization of the CPU, it also introduced new problems in job billing. No longer was the old simple algorithm sufficient to • The sharing of resources by the resident jobs, and • The variation in elapsed time from run to run of a . given job. Unlike the second generation computer a given job no longer has all of the resources that are available on the computer allocated to it. In a multi-programming computer, a job will be allocated only those resources that it requests in order to run. Additional resources, that are available on the computer, can be allocated to other jobs. Therefore, it is evident that the rate per unit time cannot be a constant for all jobs, as it was for second generation computer billing, but must in some sense be dependent upon the extent to which resources are allocated to the jobs. The second item, and perhaps the most well-known, that influences the design of a billing algorithm for a third generation computer is the variation that is often experienced in the elapsed time from run to run of a given job. The elapsed time for any given job is no longer a function only of that job, but is also a function of the job mix. In other words, the elapsed time for a job will vary depending upon the kinds and numbers of different jobs which are resident with it when it is run. In order to demonstrate the magnitude of variation that can be experienced with subsequent runs of a given job, one job was run five different times in various job mixes. The elapsed time varied from 288 seconds to 1,022 seconds. This is not an unusual case, but represents exactly what can happen to the elapsed time when running jobs in a multi-programming environment. The effect, of course, is exaggerated as the degree of multiprogramming increases. Not only can this variation in run time cause a difference in the cost of a job from one run to another, 115 116 Fall Joint Computer Conference, 1972 but it also can cause an inequitability in the cost of different jobs; the variation in run time can effectively cause one job to be more expensive than another even though the amount of work being done is less. Objectives We have isolated several important criteria to be met by a multi-programming billing algorithm. Briefly, these criteria are as follows. • Reproducibility-As our previous discussion has indicated, the billing on elapsed time does not provide for reproducibility of charges. Any algorithm that is designed to be used in a multiprogramming environment should have as a characteristic, the ability to produce reproducible charges for any given job regardless of when or how it is run, or what jobs it is sharing with the computer. • Equitability-Any billing algorithm designed for use in a multi-programming environment must produce equitable costs. The cost of a given job must be a function only of the work that the job does, and of the amount of resources that it uses. Jobs which use more resources or do more work must pay more money. The billing algorithm must accommodate this fact. • Cost Recovery-In many computer operations it is necessary to recover the cost of the operation from the users of the hardware. The billing algorithm developed for a multi-programming environment must enable the recovery of costs to be achieved. • A uditability-A multi-programming billing algorithm must produce audit able costs. This is particularly true when billing outside users for the use of computer hardware. The charges to the client must be audit able. • Encourage Efficient Use of the H ardware-8ince one goal in a design of the third generation hardware was to optimize the use of that hardware, a billing algorithm that is designed for use in a multijobbing environment should be such that it encourages the efficient use of the hardware. • Allow for Cost Estimating-The implementation of potential computer applications is often decided upon by making cost estimates of the expense of running the proposed application. Consequently, it is important that the billing algorithm used to charge customers for the use of the hardware also enables potential customers to estimate beforehand, the expense that they will incur when running their application· on the computer hardware. We distinguish between job cost and job price: job cost is the amount which it costs the installation to process a given job; job price is the amount that a user of the computer facility pays for having his job processed. Ideally, the job price will be based on the job cost but this may not always be the case. In many organizations, notably universities, the computer charges are absorbed by the institution as overhead; in these installations the job price is effectively zero-the job costs are not. In other organizations, such as service bureaus, the job price may be adjusted to attract clients and may not accurately reflect the job cost. In either case, however, it is important that the installation management know how much it costs to process a specific job. 1 •2 The development of the job billing algorithm (JBA) discussed in this paper will proceed as follows: first, we will discuss the "traditional" costing formula used in second generation computer systems: cost = (program run time) X (rate per unit time) and we shall demonstrate its inadequacy in a multijobbing environment. Second, we shall develop a cost formula in which a job is considered to run on a dedicated computer (which is, in fact, a subset of the multi-programming computer) in a time interval developed from the active time of the program. DEVELOP1VIENT OF THE JOB PRICING ALGORITHlVI In order to recover the cost of a sharable facility over a group of users, the price, P, of performing some operation requiring usage t is; P= (C) ( (tk) ) Lti (1) where; C is the total cost of the facility L ti is the total usage experienced tk is the amount of use required for the operation Consider the billing technique which was used by many computer installations running a single thread (one program at a time) system. Let $m be the cost per unit time of the computer configuration. Then, if a program began execution at time tl and terminated execution at time t2, the cost of running the program was computed by: (2) As the utilization of the computer increased the cost per unit time decreased. The cost figure produced by (2) is in many ways a Approach to Job Pricing in Multi-Programming Environment very satisfying one. It is simple to compute, it is reproducible since a program normally requires a fixed time for its execution, it is equitable since a "large" job will cost more than a "small" job (where size is measured by the amount of time that the computers resources are held by the job). Unfortunately for the user, however, the cost produced by (2) charges for all the resources of the computer system even if they are unused. This "inflated" charge is a result of the fact that, in a single thread environment, all resources of the computer system are allocated to a program being processed even if that program has no need of them. The effect of this is that the most efficient program in a single thread environment is the program which executes in the least amount of time; that is, programmers attempt to minimize the quantity (l? - t1 ) ; this quantity, called the wall clock time (WeT) of the program, determines the program's cost. Since the rate of the computer is constant, the only way to minimize the cost for a given program is to reduce its WeT; in effect, make it run faster. Hence, many of the techniques which were utilized during the second generation were designed to minimize the time that a program remained resident in the computer. The purpose of running in a multi-thread environment, one in which more than the one program is resident concurrently, is to maximize the utilization of the computer's resources thus reducing the unit cost. In a multi-thread processing system, the cost formula given by (2) is no longer useful because: 1. It is unreasonable to charge the user for the entire computer since the unused resources are available to other programs. 2. The wall clock time of a program is no longer a constant quantity but becomes a function of the operating environment and job mix. For these reasons we must abandon (2) as a reasonable costing formula. lVlany pricing algorithms are in use; however, none is as "nice" as (2). If possible, we should like to retain formula (2) for its simplicity and intuitive appeaJ.3 This may be done if we can find more consistent definitions to replace m (rate) and WeT (elapsed time). Computed elapsed time A computer program is a realization of some process on a particular hardware configuration. That is, a program uses some subset of the available resources and "tailors" them to perform a specific task. The program is loaded into the computer's memory at time t1 and compute ---rI 110 voluntary ,·",t 117 ~ +____ _ T " t l - - - - - - - - - - - - - - - - - -.~ . time Figure I-States of a program in a single thread environment terminates at time t2 • During the period of residency, the program may be in either one of two states: active or blocked. A program is active when it is executing or when it awaiting the completion of some external event. A program is blocked when it is waiting for some resource which is unavailable. These categories are exhaustive; if a program is not active and is not waiting for something to happen then it is not doing anything at all. The two categories are not, however, mutually exclusive since a program may be processing but also awaiting the completion of an event (for example an input/output operation) indeed, it is this condition which we attempt to maximize via channel overlap. Therefore, we define voluntary wait as that interval during which a program has ceased computing and is awaiting the completion of an external event. We define involuntary wait as the interval during which a program is blocked; a condition caused by contention. In general, voluntary wait results from initiation of an input/output operation and in a single thread system we have: (3) where: each tc is a compute interval and each tv is a voluntary wait interval. graphically, the situation is represented as in Figure 1. The solid line represents periods of compute and the broken line indicates in~ervals of input/output activity. Since '1;tc is based on the number of instructions executed which is constant and '1;t v is based on the speed of input/output which is also constant (except for a few special cases), WeT is itself constant for a given program and data in a single thread environment. The ideal case in this type of system is one in which the overlap is so good that the program obtains the i+ 1st record as soon as it has finished processing the ith awn .4_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ~ll ---1J----][---]----- - - - Figure 2-A program with maximum overlap 0/1 alndwo~ 118 Fall Joint Computer Conference, 1972 voluntary wait compute i nvo 1untary compute ----.r,.-------.. ,. wait ~~~p~te ~::::::::;:~, J~~OI involuntary wait l _____t I JOB 2 : compute r'-----'"""',,.,,------"""'-\r'-----"""'-, -------------- compute ~,::::::::~, j-I_ _ _ _ _ _ _----! ____________________ comoute: .~ time time Figure 5-States of a program based on active time Figure 3-States of a program in a multi-thread environment record. Graphically, this situation is shown in Figure 2 and we can derive the lower bound on WCT as: (4) and, of course: WCT~~tc as ~tv~O (5) In a multi-thread environment, we know that: WCT = 'J;tc + ~tv + ~ti (6) where: ti is an interval of involuntary wait. But, from the above discussion we know that ~tc+ ~tv is a constant for a given program, hence, the inconsistency in the WCT must come from ti . This is precisely what our intuition tells us; that the residency time of a job will increase with the workload on the computer. Graphically, a program running in a multithread environment might appear as in Figure 3. During the interval that a program is in involuntary wait, it is performing no actions (in fact, some programmers refer to a program in this state as "asleep"). As a consequence, we may "remove" the segments of time that the program is asleep from the graph for time does not exist to a program activity in involuntary wait. This permits us to construct a series of time sequences for the various programs resident in the computer; counting clock ticks only when the program is active. When we do this a graph such as Figure 4 becomes continuous with respect to the program (Figure 5). Of course, the units on the x-axis in Figure 5 no longer represents real-time, they represent, instead, the active time of the program. We shall call the computed time interval computed elapsed time (CET) defined as: , CET=~tc+~tv=WCT-~ti and as ship: ti~O, CET~WCT (7) so that we have the relation- WCT~CET compute voluntary wait -----.",----...,', involuntary wait (8) ----, " ----.. comoute --- - ---/111//11//1/ 1 . - - - - - I I I time Figure 4-States of a program based on real time The quantity WCT-CET represents the interference from other jobs in the system and may be used as a measure of the degree of multi-programming. Unfortunately, the CET suffers from the same deficiency as the WCT-it is not reproducible. The reason for this is that on a movable head direct access storage device contention exists for the head and the time for an access varies with the job mix. However, the CET may be estimated from its parameters. Recall that CET = ~tc+ ~tv. The quantity ~tc is computed from the number of instructions executed by the program and is an extremely stable parameter. The quantity ~tv is based upon the number and type of accesses and is estimated as: (9) where a (i) is a function which estimates the access time to the ith file and ni is the number of access to that file. The amount of time which a program waits for an input or output operation depends upon a number of factors. The time required to read a record is based upon the transfer rate of the input/output device, the number of bytes transferred, the latency time associated with the device (such as disk rotation, tape inter-record gap time, and disk arm movement). For example, a tape access on a device with a transfer rate of RT and a start-stop time of ST would require: (10) seconds to transfer a record of b bytes. Hence, for a file of n records, we have a total input/output time of: n L (ST+RTb i ) =nST+RT~bi (11) i=l where ~bi is the total, number of bytes transferred. In practice ~bi ~ nB where n is the number of records and B is the average blocksize. The term ST is, nominally, the start-stop time of the device. However, this term is also used to apply a correction to the theoretical record time. The reason is that while the CET will never be greater than the I/O time plus the CPU time, overlap may cause it to be less. This problem is mitigated by the fact that at most computer shops (certainly at ETS) almost all programs are written in high-level computer languages and, as a result, the job mix is homogeneous. A measure of overlap may be obtained by fitting various curves to historical data and choosing Approach to Job Pricing in IVlulti-Programming Environment the one which provides the best fit. In other words, pick the constants which provide the best estimate of the WCT. It is important to remember that the CET function produces a time as its result. We are using program parameters such as accesses, CPU cycles, and tape mounts only because they enable us to predict the CET with a high degree of accuracy. The original billing formula (2) which we wished to adapt to a multi-thread environment utilized a time multiplied by a dollar rate per unit time. The CET estimating function has provided us with a pseudo run time; we must now develop an appropriate dollar rate function. In order to develop a charging rate function we consider the set of resources assigned to a program. In a multi-programming environment, the computer's resources are assigned to various programs at a given time. The resources are partitioned into subset computers each assigned to a program. The configuration of the subset computers is dynamic; therefore, the cost of a job is: n cost= L: CETiori (12) i=l where i is the allocation interval; that is, the interval between changes in the job's resources held by the job. CET i is the CET accumulated during the ith interval. ri is the rate charged for the subset computer with the configuration held by the program during interval i. The allocation interval for OS/360 is a job step. The rate function Some of the attributes which the charging rate function should have are: • the rate for a subset computer should reflect the "size" of the subset computer allocated; a "large" computer should cost more than a "small" computer. • the rate for a subset computer should include a correction factor based upon the probability that a job will be available to utilize the remaining resources. • the sum of the charges over a given period must equal the monies to be recovered, for that period. With these points in mind, we may create a rate function. The elements of the resource pool may be classified as sharable resources and nonsharable resources. Tape drives, core memory, and unit record equipment are 119 examples of nonsharable resources; disk units are an example of a sharable resource. While these categories are not always exact they are useful since we assume that allocation of a nonsharable resource is more significant than allocation of a sharable resource. At Educational Testing Service, it was determined that the most used nonsharable resources are core storage and tape drives. Therefore, it was logical to partition the computer into subset computers based upon the program's requirement for core and tapes. Tapes are allocated in increments of one; core is allocated in 2K blocks. Hence, there are (# tapes * available core/2,000) possible partitions. For any given design partition, we would like to develop a rate which is proportional to the load which allocation places upon the resources pool. A single job may sometimes disable the entire computer. If, for example, a single program is using all of the available core storage then the unused devices are not available to any other program and should be charged for. On the other hand, if a single job is using all available tapes, other jobs may still be processed and the charge should be proportionately less. The design proportion is the mechanism by which the total machine is effectively partitioned into submachines based upon the resources allocated to the sub machines. A design proportion can be then assigned to any job based upon the resources it requires. The design proportion should have at least the following properties. • The design proportion should range between the limits 0 and 1. • The design proportion should reflect the proportion of the total resources that are allocated to the job. • The design proportion should reflect, in some fashion, the proportion of the total resources that are not allocated to the job, but which the job prevents other jobs from using. The design proportion proposed for the billing algorithm is based upon the probability that when the job is resident, some other job can still be run. The definition of this parameter is as stated below. The design proportion of a job is equal to the probability that when the job is resident, another job will be encountered such that there are insufficient resources remaining to run it. Since OS/360 allocates core in 2K blocks, the number of ways that programs can be written to occupy available core is equal to: N=C/2 where, (13) 120 Fall Joint Computer Conference, 1972 N = Number of ways that programs can be written C = Core available in Kilo-bytes In addition, if there are T tapes available on the hardware configuration then there are T plus 1 different ways that programs can be written to utilize tapes. Therefore, the total number of ways that programs can be written to utilize -core and tapes is given by the following equation, N = (C/2) (T+1) (14) where, N = Total number of ways that programs can be written C = Core available in Kilo-bytes T=Number of tape drives available The design proportion for a given job can be alternately defined as 1 minus "the probability that another job can be written to fit in the remaining resources." This is shown as follows. D =1.0- [(C A -Cu )/2](TA -Tu +1) p [(C A )/2](TA +1) (15) where, Dp = Design proportion for the job CA = Core available in Kilo-bytes Cu = Core used by job T A = Tape drives available on the computer Tu=Tape drives used by the job It is important to note that the sum of the design proportions of all jobs resident at one time can be greater than 1.0. For example, consider the following two jobs resident in a 10K, four tape machine. Job #1: 6K; 1 Tape Dp =17/25 Job #2: 4K; 3 Tapes D p =19/25 resources. Clearly, the theoretical probability and the actual probability may be somewhat different. Consequently, a design proportion could be designed based upon the actual probabilities experienced in a particular installation. Such a probability function would change as the nature of the program library changed. The design proportion function described above would change only as the configuration of the hardware changed. Either technique is acceptable and the design proportion has the desired properties. That is, the design proportion increases as the resources used by the various jobs increase. However, it also reflects the resources that are denied to other jobs because of some one jobs' residency. Consider the fact that when all of core is used by a job, the tape drives are denied to other jobs. The design proportion in this case is 1.0 reflecting the fact that the job in effect has tied up all available resources even though they are not all used by the job itself. While the design proportion function is simple, it has many desirable properties: • It is continuous with respect to OS/360 allocation; all allocation partitions are available. • It always moves in the right direction, that is, increasing the core requirement or tape requirement of the program, results in an increased proportion. • It results in a proportion which may be multiplied by the rate for the total configuration to produce a dollar cost for the subset computer. • It is simple to compute. If it were determined that the required recovery could be obtained if the rate for the computer were set at $35 per CET minute, the price of a step is determined by the equation: P step = (($35.)Dp(core, tapes)) (CET/60) (16) and the price of a job (with n steps) is: The sum of their design proportion is 36/25. This seems odd at first since the design proportion of a 10K; four tape job is 1.0. However, this can be shown to be a necessary and desirable property of the design proportion. To show that this is the case, it is necessary to consider the amount of work done and the total cost of the work for two or more jobs that use the total machine compared to the cost of the same amount of work done by a single job that uses the total machine. This analysis will not be covered here. The design proportion function as defined herein is a theoretical function. It is based solely upon the theoretical possibility of finding jobs to occupy available n P job = L P step (17) 1:=1 We have come full circle and returned to our "second generation" billing formula: cost = rate· time The key points in the development were: • A multi-tasking computer system may be considered to be a collection of parallel processors by altering the time reference. • The variation in time of a program run in a multi- Approach to Job Pricing in Multi-Programming Environment· programmed environment is due to involuntary wait time. • The computed elapsed time may be multiplied by a rate assigned to the subset computer and an equitable and reproducible cost developed. IMPLEMENTATION OF THE JOB PRICING ALGORITHl\1 The Job Pricing Algorithm (JPA) is implemented under OS/360 Release 19.6. No changes to the operating system were required; a relatively minor modification was made to HASP in order to write accounting records to disk for inclusion in the accounting system. The basis of the JPA is the IBM machine accounting facility known as Systems IVlanagement Facility (SMF).4 Billing under the JPA involves four steps: 121 CONCLUSION The approach to user billing described in this paper has proved useful to management as well as users. Many improvements are possible especially in the area of more accurate CET estimation. Hopefully, designers of operating systems will, in the future, include sophisticated statistics gathering routines as part of their product thus providing reliable, accurate data for acco?nting. APPENDIX A method of deriving GET parameters Let the wall clock time (W) be estimated as follows, where, 1. Collect the job activity records at execution time. The records are produced by SMF and HASP and are written to a disk data setSYS1.MANX. 2. Daily, the SYSl.lVIANX records are consolidated into job description records and converted to a fixed format. 3. The output from step (2) is used as input to a daily billing program which computes a cost for the jobs and prep arBS a detailed report of the day's activity by account number. 4. Monthly, the input to the daily program is consolidated and used as input to a monthly billing program which interfaces with the ETS accounting system. X T = # of tape accesses XD=# of disk accesses X M = # of tape mounts C=CPUtime AT, AD, AM = Coefficients to be determined We wish to determine the coefficients AT, AD, and AM that will maximize the correlation between W', the computed elapsed time, and W, the actual elapsed time. Define the error e as, e= (W - W') (2) The correlation coefficient, r, can be written as, The raw SMF data which is produced as a result of job execution contains much valuable information about system performance and computer workload which is of interest to computer center management. One useful characteristic of the· JPA is that costs are predictable. This enables a programmer or systems analyst to determine, in advance, the costs of running a particular job and, more importantly, to design his program in the most economical manner possible. In order to facilitate this process, a terminal oriented, interactive, cost estimating program has been developed. This program is written in BASIC and enables the programmer to input various parameters of his program (such as filesize, CPU requirements, blocking factors, memory requirements) and the cost estimate program produces the cost of the program being developed. Parameters may then be selectively altered and the effects determined. (4) Then, in order to maximize r2, it is sufficient to minimize u e2 since u w 2 is a constant over a given sample. Since (6) we have, (7) ue2= ~e2-n-l(~e)2 Finally, we have, ue2= ~[(Wi-Gi) -ATXT -ADXD-AMXMJ2-n-l[~(Wi-Gi) (8) 122 Fall Joint Computer Conference, 1972 (9) (10) (11) Solving the simultaneous equations ( 13), ( 14), and (15) for AT, AD, and AM should give values for the parameters that will maximize the correlation between the computed elapsed time and the actual elapsed time. The technique was applied to a sample month of data which was composed of 19401 job steps. The coefficients determined were, AT = 0.0251 seconds AD = 0.0474 seconds AM=81.2 seconds (12) Since all the partials must vanish, we have, When these coefficients were used in Equation (1) to determine the computed elapsed time, the correlation coefficient between the computed time and actual time over the 19401 steps was 0.825. When other coefficients were used, i.e. A T =0.015, AD =:,0.10, and A M=60.0, the correlation was only 0.71. Note: Card read, card punch, and print time constants were not computed in this fashion simply because there is insufficient data on job steps that use these devices as dedicated devices. However, as data become available in the future, the method could be applied to obtain good access times. (13) (14) REFERENCES 1 L L SELWIN Computer resource accounting in a time sharing environment Proceedings of the Fall Joint Computer Conference 1970 2 C R SYMONS A cost accounting formula for multi-programming computers The Computer Journal Vol 14 No 11971 3 J T HOOTMAN The pricing dilemma Datamation Vol 15 No 8 1969 4 IBM Corp IBM System/360 operating system: System management facilities Manual GC28-6712 1971 Facilities management-A marriage of porcupines by DAVID C. JUNG Quantum Science Corporation Palo Alto, California Fl\1-DEFINED tion. Moreover, this software succeeded in improving operator control and reducing operating costs. Consequently, EDS marketed these software packages to other Blue Cross/Blue Shield organizations. Outside of the medical insurance field, EDS has successfully pursued FM opportunities in life insurance, banking, and brokerage. The success of EDS, both in revenue/profit growth and in the stock market did not go unnoticed by others in the computer services industry. As a result, in the late 1960's and early 1970's many software firms and data service bureaus diversified into Fl\![-many, unfortunately, with no real capabilities. Since FM has proven itself as a viable business in the commercial market, over 50 independent FM firms have been formed. Moreover, at least 50 U.S. corporations with large, widespread computer facilities have spun off profit centers or separate corporations from their EDP operations. In many cases, these spinoffs offer customers FM as one of their computer services. F M definition often elusive There are almost as many definitions for Facilities Management (FM) as there are people trying to define it. Because FM can offer different levels of service, some variations in its definition are legitimate. FM was initiated by the Federal Government in the 1950's when the Atomic Energy Commission, the National Aeronautics and Space Administration (NASA), and the Department of Defense offered several EDP companies the opportunity to manage and operate some of their EDP installations. Previously, these companies had developed strong relationships with the various agencies through systems development and software contracts. FM definition expanding Nurtured by the Federal Government, FM has emerged as a legitimate computer service in the commercial EDP environment. Since FM has been offered in the commercial market, its definition has expanded to include additional services. In fact, customers are now beginning to expect Fl\1 vendors to have expertise that extends far beyond the day-to-day management of the data processing department. Electronic Data Systems (EDS), formed in the early 1960's, pioneered the FM concept in the commercial market. Shortly after its founding, EDS recognized the massive EDP changes required in the hospital and medical insurance industry as a result of Medicare and other coverages changed by the Social Security Administration. Accordingly, EDS secured several State Blue Cross/Blue Shield organizations as customers. While operating these installations, EDS developed standard software packages that met the recordkeeping requirements of the Social Security Administra- A n ideal concept of F M The ideal role for the FM vendor is to assist in all the tasks related to business information and the EDP operations in the firm. The Facilities Manager could assume full responsibility for the EDP operations, from acquiring the equipment and staffing the installationto distributing the information to the firm's operating areas. FM also has a vital role in defining business information requirements for top management. More specifically, the FM vendor should be able to define what information is required to operate the business, based on his industry experience. He should also be able to help establish cost parameters, based on an analysis of what other firms in the industry spend for EDP. Moreover, FM vendors will assist top management to cost optimize the array of business processing methods which may include manual or semi-automated 123 124 Fall Joint Computer Conference, 1972 only a single division or a major application may be taken over by an FM vendor. Merely taking one of many applications on a computer and performing this function on a service bureau or time-sharing basis, however, is not included as an FM contract. TOP MANAGEMENT DEFINES BUSINESS INFORMATION REOUIREMENTS ----. PER IODIC REV lEW FOR EDP I I I I I • • • • • • DEFINE INFORMATION REQUIRED TO OPERATE PERFORM SYSTEMS ANALYSIS SPECIFY OUTPUTSflNPUTS SET TIMING SET COST PARAMETERS SELECT INFORMATION PROCESSING METHODS I NON-EDP PROCESSING METHODS II HOW EDP USERS BENEFIT EDP OPERATIONS • • • • • I I. FEEDBACK ACQUIRE EQUIPMENT AND PERSONNEL SCHEDULE AND BUDGET RESOURCES MANAGE DAILY OPERATIONS PERFORM AUDITS/SECURE OPERATIONS DISTRIBUTE INFORMATION COMPANY OPERATIONS F M benefits: A study in contrast SOllle FM users benefit • • • I USE INFORMATION • DEVELOP NEW/CHANGED INFORMATION REqUIREMENTS Figure I-Business information and EDP in the ideal firm approaches as well as EDP. The FM vendor must be skilled at working with personnel in the customer's operation centers to improve ways in which the information is used and to effectively develop new methods for handling information as a business grows. (See Figure 1.) FM-Today it's EDP takeover The real world of FM is quite different from the ideal version just described, and there will be a period of long and difficult transition to reach that level. Actual takeover of an existing EDP installation is now the prime determinant of whether an FM relationship exists. When the FM vector takes over the EDP installation, it also takes over such EDP dpeartment tasks as (1) possession, maintenance, and operation of all EDP equipment and the payment of all rental fees or acquisition of equipment ownership, (2) hiring and training all EDP personnel, and (3) development of applications, performance of systems analysis, acquisition of new equipment and implementation of new applications. Takeover lllay be partial Many FM vendors are increasingly offering cafeteriastyle services so that the customer can retain control over EDP activities that he can perform proficiently. In some cases where equipment is owned, the customer may retain title to the equipment. Salaries of EDP personnel may continue to be paid by the customer, but responsibility for management is assigned to the vendor. Also included as partial FM are takeovers of less than the client's total EDP activities. EDP activity of Southwestern Life, a $5 billion life insurance company in Dallas and a customer of EDS, typifies the satisfied FM user. Southwestern Life's vice president, A. E. Wood, has stated, "We are very pleased with our agreement and the further we get into it, the more sure we are we did the right thing. We won't save an appreciable amount of money on operations, but the efficiency of operation will be improved in great meassure. To do the same job internally would have taken us two to three times as long and we still would not have benefited from the continual upgrading we expect to see with EDS." ••• and SOllle do uot Disgruntled users exist too, but they are more difficult to find and in many cases are legally restricted from discussing their experiences. One manufacturing company told us, "We cannot talk to you; however, let me say that our experience was unfortunate, very unfortunate. They (the FM vendor) did not understand our business, did not understand the urgency of turnaround time on orders. We lost control of our orders and finished goods inventory for six weeks. As a result we lost many customers whom we are still trying to woo back after more than a year." Two medium-sized banks had similar comments that indirectly revealed much about FM benefits. "We're not in any great difficulty. In fact, the EDP operations now are running well, but every time we want to make a change it costs us. I wish I had my own EDP manager back to give orders to." FM benefits are far ranging Large users benefit least frolll FM There is no question in our mind that there are many potential benefits for FM users. However, installation size is the primary yardstick for measuring benefits Facilities Management-A ]Vlarriage of Porcupines users can obtain from FM; large users have the least to gain for several reasons. In most cases, large users have already achieved economy-of-scale benefits which FM and other computer services can bring to bear. Large users typically have computers operating more shifts during the day and do not allow the computers to sit idle. In addition to higher utilization, larger users can more fully exploit the capabilities of applications and systems programmers because they can spread these skills over more CPU's than can smaller users. For these and other reasons, it is much more difficult to demonstrate to large users that an FM vendor can operate his EDP department more efficiently and less expensively. For these reasons, the bulk of FM revenue will come from the small- or medium-sized EDP user. This is defined as a user who has a 360/50 or smaller computer and is spending less than $1.5 million per year on EDP. Improved EDP operations The most tangible benefit FM can bring to an EDP user is improved control over the EDP operations and stabilization of the related operating costs. This conclusion is based on Quantum's field research which has shown that installations in the small-to medium-sized range are out of control despite the refusal by managers to admit it. Lack of EDP planning, budgeting, and scheduling shows up in obvious ways, such as skyrocketing costs, as well as in obscure ways that are difficult to detect, yet contribute significantly to higher EDP costs. These subtle inefficiencies include program reruns due to operator or programming errors, equipment downtime due to sloppy programming documentation, and idle time due to poorly scheduled EDP workloads. Because they are obscure and often hidden by EDP departments, it is difficult for managements in small and medium installations to detect and correct these problems. On the other hand, an FM vendor can often quickly identify these problems and offer corrective remedies because his personnel are trained to uncover these inefficiencies and his profits depend on their correction. S:maller investments to upgradeEDP FM can also benefit end users by reducing proposed future increases in EDP costs. Small- and medium-sized users that have a single computer must eventually face the problem of increasing their equipment capacity to meet requirements of revenue growth and expanded 125 360/40 360/50 360/65 HAVE OS NOW 10% 46% 69% HAVE NO OS NOW, BUT PLAN TO INSTALL IN 1971-72 48 38 8 HAVE NO OS NOW AND NO PLANS 42 16 23 100% 100% 100% - TOTAL TABLE A-User OS Plans 1971-72 applications. This often means a significant increment in rental and other support costs. A 360/30 user, for example, who is spending $13,000 a year on equipment may have to jump to a 360/40 or a 370/135, costing $18,000-$22,000 per year to achieve the required increase in computing power. Support costs will also increase, in many cases more quickly. If a useris acquiring a 360/40, for example, he probably will have to use an Operating System (OS) to achieve efficient machine performance. Many users today will upgrade their software as shown in Table A. An OS installation requires a higher level of programming talent than is currently required to run a DOS 360 system. Because the user does not need the full time services of these system programmers, FM offers an economical solution whereby system programmers are shared among multiple users. Elimination of EDP personnel problems One of the most serious problems users encounter in managing EDP operations is personnel management. The computer has acquired an aura of mysticism that has tended to insulate the EDP department from the normal corporate rules and procedures. Many programmers often expect to receive special treatment, maintain different dress and appearance and obtain higher pay. High turnover among EDP personnel, often two to three times the norm for other company operations, further aggravates EDP personnel problems. Through subcontracting, FM vendors can separate EDP personnel from the corporation and thus alleviate this situation for management. Eased conversion to current generation software Over one-third of all users are locked into using third generation computers in the emulation mode, 126 Fall Joint Computer Conference, 1972 LARGE COMPANIES MEDIUM COMPANIES SMAll COMPANIES ANNUAL EDP EXPENOITURES PERCENT OF EDP COSTS SPENT ON PLANNING, ETC. >'1.5 MILLION 1-5% $3OOK-1.5 MILLION 0-2% <$300K 0-1% TABLE B-User Expenditures on EDP Planning where second generation language programs are run on third generation computers. Although software conversion is a difficult and expensive task for users, the FM vendor who has an industry-oriented approach usually has a standard package already available that the customer can use. In several installations, FM vendors have simplified conversion, thus providing their users with the economies of third generation computers. Improved selection of new equipment and services Users of all sizes continually need to evaluate new equipment and new service offerings, including the evaluation of whether to buy outside or do in-house development. Again, the large user holds an advantage because his size permits him to invest in a technical staff dedicated to evaluations. Installations spending more than $1.5 million annually for EDP usually have one full time person or more appointed to these functions. In smaller installations there is no dedicated staff and pro-tem evaluation committees are formed when required. Table B shows the relationship between the size of EDP expenditures and the share of those expenditures allocated to planning, auditing and technical evaluations. In this area of EDP planning, FM can benefit users in two ways. First, FM vendors can and do take over this responsibility and, second, the effective cost to any single user is less because it is spread over multiple users. Other operating benefits One potential benefit from FM relates to new application development. Typically, 60 percent or more of a firm's EDP expenditures are tied to administrative applications, such as payroll, general accounting, and accounts payable. Because of the relatively high saturation in the administrative area, firms are now extending the use of EDP into operational areas such as productioncontrol and distribution management. However, many of these firms lack the qualified EDP professionals and line managers necessary to develop and implement applications in non-administrative areas. Thus, they have become receptive to considering alternatives, including FM. Major EDP cost savings Earlier in this chapter, the stabilization of EDP costs was discussed. Now we will focus on the major savings that FM can provide through the actual reduction of EDP costs. This potential FM benefit is too often the major theme of an FM vendor's sales pitch. Consequently, its emotional appeal often clouds a rational evaluation that should precede an FM contract. If the FM contract is well written and does not restrict either party, the FM vendor can apply his economies of scale and capabilities for improving EDP operations and should be able to show a direct cost savings for the customer. However, these "savings" may be needed to offset costs of software conversion or other contingencies and thus, may not really be available to the customer in the early contract years. Long range benefits-Better information Improved operation control and profits through better information-this is the major long-range benefit from FM. While this contribution is not unique to FM vendors, few EDP users today have been able to develop a close relationship between company operations and EDP. Companies such as Weyerhauser and American Airlines-generally recognized as leading edge users-~re few in number, and many try to emulate their achievements in integrating EDP into the company operations. EDP expenditures, however, are seldom judged on their contribution toward solving basic company problems and increasing revenues and profits. Many apparently well-run EDP departments would find it difficult to justify their existence in these terms. The situation is changing, however. An indication of this new attitude is the increased status of the top EDP executive in large firms. The top EDP executive is now a corporate officer in over 300 of the Fortune 500 firms. While titles often mean little, the change to Vice President or Director of Business Information from Director of EDP Operations suggests that top management in many companies has considered and faced the problem. Facilities Management-A Marriage of Porcupines 127 In 'addition to new management titles, continuing penetration of EDP functions into operating areas is increasingly evident. FM-A permanent answer for users TOTAL $645 MI LLiON FM REVENUES FM should not he treated as an interim first-aid treatment for EDP. There are several good reasons for continuing the FM relationship indefinitely. • Individual users cannot duplicate the economies of scale that FM vendors can achieve. Standard softwar epackages, for example, require constant updating and support and new equipment evaluations are constantly required if lowest cost EDP is to be maintained. • Top management would have to become involved in EDP management if operations were brought back in-house. This involvement would take time from selling and other revenue producing activities. A rational top ,management trys to minimize the share of its time spent on cost-management activities. • By disengaging from the FM contract, the customer risks losing control over his EDP again while receiving no obvious compensation for this ri~k. Even if the customer believes he is being overcharged by the FM vendor, there ~s no real guarantee the excess profits can he converted to savings to the customer. For these reasons an F.M relationship should normally be considered permanent rather than temporary. MARKET STRUCTURE AND FORECAST Current F M market Total FM Illarket size and recent growth The 1971 market for FM services totals $645 million with 337 contracts. However, 45 percent or $291 million was captive and not available to independents. Captive FM contracts are defined as being solidly in the possession of the vendor because of other than competitive considerations. Typically, captive contracts are negotiated between a large firm and its EDP spinoff subsidiary. The remaining market is available to all competitors and totals $354 or 55 percent. Available does not necessarily mean the contract is available for competition immediately, since most contracts are signed for a term of two to five years. Captive and available 1971 FM revenues and contracts are shown in Figure 2. TOTAL 337 FM CONTRACTS Figure 2-1971 FM market Industry analysis Discrete and Process Manufacturing are the largest industrial sectors using 'FM services and account for over 44 percent of total FM revenues. However, most EDP spinoffs have occurred in manufacturing and much of these FM revenues are therefore captive and not available to independent competitors. After deleting the captive portion, the two manufacturing sectors account for only 12 percent of the available 1971 FM market of $354 million. Manufacturing has failed to develop into a major available FM market primarily because there is a general absence of common business and accounting procedures from company to company, thus, providing no basis for leveraging standard software. This is true even within manufacturing subsectors producing very common products. In the medical insurance sector, however, Federal Medicare regulations enforce a common method for reporting claims and related insurance data, thus providing a good basis for leveraging standard software. The Medical Sector accounts for 25 percent of available FM revenues. The Medical Sector includes medical health insurance companies (Blue Cross/Blue Shield) 128 Fall Joint Computer Conference, 1972 Type of perfor:mance SERVICE BUREAU TOTAL FM MARKET $645 MILLION FM vendors who initially take over on-site operation of a customer's computer strive for economies of scale. This has created a trend whereby the FM vendor has eliminated the need for the customer's computer by processing data through NIS (timesharing) or service bureaus. NIS now accounts for 5 percent of total FM revenues. Service bureau processing which requires the physical transport of data from the client's location to the vendor's computer installation accounts for 2 percent of total FM revenues. In Figure 3, which depicts FM market by type of performance, combination refers to the use of two or more of the above services to carry out the FM contract. Types of vendors Types of vendors that perform FM contracts are described below: AVAILABLE FM MARKET $354 MILLION Figure 3-1971 FM market by type of performance and hospitals. This sector was the first major commercial FM market. FM continues to be attractive in this sector because it permits rapid upgrading of EDP to meet the new Medicare reporting procedures and relieves the problem of low EDP salary scales. The largest industry sector in the available FM market, the Federal Government, accounts for over 34 percent of available revenues. All Federal Government contracts are awarded on the basis of competitive bids. Most Federal Government FM contracts still tend to be purely subcontracting of EDP operations rather than total business information management which is becoming more common in commercial markets. The Finance Sector currently accounts for 22 percent of available FM revenues. Banks and insurance companies are the major markets within the Finance Sector which also includes brokerage firms, finance· companies and credit unions. • Independents who accounted for 67 percent of total FM revenues in 1971 were startups in the computer service industry or vendors who have graduated from the ranks of spinoffs. • Spinoffs are potentially strongest in their "home" industries ; however , competitive pressures may limit market penetration here. An oil company spinoff, for example, would have a difficult time selling its seismic services to another oil company because of the high value placed on oil exploration and related information. • Computer manufacturers are increasingly offering FM services. Honeywell has several FM contracts and will be joined by Univac and CDC who have announced intentions to marketFM services. The RCA Services Division should find FM opportunities· among RCA· customers. IBM has several ways in which it can enter FM, and will show an expanding profile. Contract Values The average FM contract in 1971 is valued at slightly less than $2 million. This is the equivalent of a user with two or three computers, one at least a 360/50. However, this is based on the total market analysis which distorts the averages for captive and available FM markets. An analysis of available and captive contracts shows that the average value of an available contract drops to $1.24 million, which would be equivalent to a user with Facilities Management-A Marriage of Porcupines 129 TABLE C-Major Vendors AVAILABLE TOTAL FM Revenues· Rank Company 1 2 Electronic Data f3ystems Corp. McDonnell Douglas Automation Co. Boeing Computer Services Inc. University Computing Company Computer Sciences Corp. Grumman Data Systems Computing and Software, Inc. System Development Corp. Martin Marietta Data Systems Westinghouse Tele-Computer Systems Corp. MISCO National Sharedata Corp. A. O. Smith Corp.'s Data Systems Div. Executive Computer Systems, Inc. Unionamerica Computer Corp. Cambridge Computer Corp. Greyhound Computer Corp. Tracor Computing Corp. Mentor Corp. Programming Methods, Inc. (PMI) 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 95.7 89.4 82.2 42.2 26.6 25.6 14.7 14.0 11.5 11.0 10.0 7.5 5.4 5.2 5.0 4.4 4.3 4.0 4.0 3.8 Company Electronic Data Systems Corp. Computer Sciences Corp. Boeing Computer Services Inc. Computing and Software, Inc. System Development Corp. University Computing Company National Sharedata Corp. McDonnell Douglas Automation Co. Executive Computer Systems, Inc. Cambridge Computer Corp. Greyhound Computer Corp. Tracor Computing Corp. Programming Methods, Inc, (PMI) MISCO Allen Babock Computing Corp. RAAM Information Services Corp. Data Facilities Management Inc. Bradford Computer and Systems, Inc. Computer Usage Co., Inc. Martin Marietta Data Systems FM Revenues· 95.7 26.6 22.2 14.7 12.8 10.1 7.5 7.1 5.2 4.4 4.3 4.0 3.8 3.0 2.9 2.5 2.3 2.0 2.0 1.5 • Annual Rate in 1971 in millions of dollars two 360/40's. Analysis of captive contracts, however, shows that the average value is significantly higher at $5.6 million per year. Most of these contracts are spinoffs from ·large industrial firms who have centralized computer installations or multiple installations spread throughout the country. A more revealing analysis of contract values is shown in Table D. Here total and available contracts are distributed according to contract value. From this analysis, it is clear that well over one-third of contracts are valued at $300,000 or less per year. A typical computer installation of this value would include a CONTRACT VALUE PER YEAR 0.1-0.3 0.31-0.5 0.51-0.8 > 0.8 TOTAL ALL CONTRACTS % # 1 AVAILABLE CONTRACTS % # 129 40 39 129 38.3 11.9 11.6 38.2 122 30 34 99 42.8 10.5 11.9 34.8 337 100.0 285 100.0 AVERAGE CONTRACT VALUE: $1.91 MILLION 285 AVAILABLE CONTRACTS - AVERAGE VALUE: $1.24 MILLION 52 CAPTIVE CONTRACTS - AVERAGE VALUE: $5.61 MILLION TABLE D-FM Contract Analysis by Value I - I 360/30, 360/20, 360/25 or equivalent computers in other manufacturers' lines. There are in total 18 contracts, captive or available, valued at more than $5 million per year. These are all spinoff parent or Federal Government contracts. Projected 1977 FM market FM market potential The ultimate U.S. market potential for FM is the sum of EDP expenditures for all users. By 1977 EDP expenditures for equipment, salaries and services will total $29.5 billion spread among 52.4 thousand users. Since FM benefits are not available to all users, five criteria were developed to help identify the industry sectors which· could most benefit from FM and would be most amenable to accepting FM as an alternative approach for EDP. The five criteria are: • Homogeneous Business Methods. Industries with similar information requirements from company to company are ideal situations for FM. These might be the coding of business records, such as the MICR codes used in banks, or price standards, e.g., tariffs used in motor freight. 130 Fall Joint Computer Conference, 1972 TABLE E-High Growth Potential FM Markets Selection Criteria Industry Sector Homogeneous Business Methods Similar Products or Services Regulation by Government Agencies Prior Evidence of Subcontracting Services Special EDP Operating Problems X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X Medical-Health Insurance Federal Government Banking-Commercial Insurance State and Local Government Motor Freight Brokerage Utilities-all except telephone Medical-Hospitals, Clinics Regional/Interstate Airlines Mutual Fund Accounting Banking-Savings Small & Medium Aerospace Cos. Education-Elementary and Secondary Education-College Construction and Mining X X X X X X X X X X X X X X X X X X X X company or industry practices which indicate that subcontracting of vital services is an accepted business procedure also help pinpoint industries with high FM potential. Correspondent relationships between smaller banks and larger city banks, historically a part of the banking industry, is an example. • Special EDP Operating Problems. Several industries have special EDP operating problems. These may result from historically low pay scales for EDP personnel which cannot easily be changed, such as in state and local government or a pending major conversion in basic accounting approaches imposed by an outside force, resulting in major EDP conversions as was the case in health insurance when Medicare and state health programs were implemented in the 1960's. • Similar Products or Services. The more similar the products and services sold by companies within a given industry sector, the more likely they will have common business procedures and, therefore, EDP systems. In the brokerage industry, for example, there is little differentiation in the serivce provided. • Regulation by Government Agencies. Industries that are regulated directly by State/Federal agencies or indirectly through strong trade associations also become good candidates for FM because of the enforced standards for pricing, operating procedures, account books, or other factors that impact ED P operations. Health insurance firms, utilities of all types, and brokerage firms are typical of these highly regulated industries. • Prior Evidence of Subcontracting Services. Prior TABLE F-Total FM Revenue Growth 1971-1977 Revenues Millions Contracts # Compound Annual Growth Rate of FM Revenues 446 590 389 344 350 236 318 160 23 2856 635 590 255 200 350 185 400 320 60 2995 27 37 18 16 92 12 58 52 69 28 1977 1971 Revenues Millions Contracts 104 88 146 144 7 122 20 13 1 645 89 89 42 34 15 36 13 17 2 337 $ Medical and Other Finance Discrete Manufacturing Process Manufacturing Government-State and Local Government-Federal Utilities and Transportation Wholesale and Retail Trade EDP Service Bureaus and NIS Operators Total # $ Facilities Management-A Marriage of Porcupines 131 TABLE G-Available FM Revenue Growth 1971-1977 Revenues Contracts Revenues Contracts $ Millions # $ Millions # Compound Annual Growth Rate of FM Revenues 15 76 81 36 16 6 32 22 1 285 350 503 280 236 133 196 135 129 6 1968 350 480-510 380-400 185 250-270 235-245 90-95 70-75 12-14 2052-2144 92 37 21 12 49 137 41 30 35 33 1971 Government-State and Local Finance Medical and Other Government-Federal Wholesale and Retail Trade Utilities and Transportation Discrete Manufacturing Process Manufacturing EDP Service Bureaus and NIS Operators Total 7 77 88 122 12 3 17 27 1 354 INDEPENDENTS $1,396M 1977 The above criteria were applied against major industry sectors. As a result, 16 sectors were identified and ranked according to their suitability for FM. (See Table E.) On the basis of this analysis the industry sectors most likely to benefit from FM include banking (mainly commercial), insurance, state and local governments, Federal Government, motor freight, brokerage, and medical (hospitals, and health insurance firms). Of these, the Federal Government and medical sectors are already established FM markets and will grow more slowly as a result. AVAILABLE $1,968 MILLION ProjectedFM revenues, 1971-1977 Actual realized FM revenues will be $2.86 billion in 1977. This is a 28 percent annual growth from $645 billion in 1971. Total contracts will increase to 2,995 in 1977 from 337 in 1971, with an average contract value of slightly less than $1 million. The available portion of the 1977 FM market will total $1.97 billion, up from $354 million in 1971, a growth of over 500 percent. Related contracts will be between 2,000 and 2,200 in 1977, up from 285 in 1971. See Tables F and G. Who are the vendors? TOTAL $2.856 MILLION Figure 4-1971 FM markets by type of vendor Independent vendors will retain the same share of the total FM market in 1977, as in 1971. Computer manufacturers will increase their penetration in the FM business primarily to protect installations that are threatened by competitive equipment. See Figure 4. 132 Fall Joint Computer Conference, 1972 HOW TO EVALUATE FM PROPOSALS Know what benefits are desired For the purposes of reading this segment, assume you are an EDP user considering a proposed FM contract. Assume further that by reading the previous material, you have concluded that, indeed, FM can benefit your company, both in terms of improved EDP operations and in improved information flow to the operating departments. But now you must get specific about the vendor, his proposal, and finally the detailed provisions of the contract he wishes you to sign. In this chapter we will provide the guidelines you can use to make these evaluations. Before digging into the evaluation guidelines, you should first articulate just what you, the management, and the current EDP department are expecting in the way of benefits. By doing this, you can compare your expectations as a customer with what the FM vendor is willing and able to provide. Have you had a poor experience with EDP? Is your primary objective to get out of the operating problems of an EDP department? If this is the case, then don't expect immediate improvements in the information you are receiving from EDP and the speed in which it flows to your operating departments-even if you have been told by the FM vendor this is to be the case. On the other hand, if your real goal is speeding order entry and decreasing finished goods inventory by a factor of three without a major investment in new applications software, then these are the points an FM vendor should be addressing in his proposal and you will want to evaluate him on this basis. Assuming you and the vendor have agreed on a set of expectations, let us look at the guidelines you can use in evaluating the vendor, his proposal and the FM contract: • • • • Evaluating vendor and his proposal Vendor Three potential problem areas should be explored to accurately appraise an FM vendor. These are: financial stability, past FM record, and level of industry expertise. • Financial Stability Financial stability of the vendor is a critical issue to pin down, for if he is in difficulty, such. as being short of working capital, your information • flow from EDP could be stopped leaving you in an extremely serious and vulnerable position. Previous FM Performance Record N ext to the financial record, the vendor's previous performance in FM as well as in other data services can be a good guide to his future performance on your contract. If the vendor has done well in past contracts, he no doubt will use his past work as a "showcase" and invite your visit to current sites he has under contract. However, the absence of these referrals should not be taken negatively due to the possible proprietary nature of current FM work. Industry Expertise Full knowledge of your industry and its detailed operating problems should be demonstrated completely by the vendor. This should include full appreciation for the operating parameters most sensitive for profitability in your industry and company. The vendor should be staffed with personnel who have had top management experience in the specific industry and people who have had experience in other specific industries. Vendors become more credible if they can show existing customers who are pleased with the vendor's services and who will testify to his ability to solve specific industry-oriented EDP problems. Proposal Responsiveness The proposal should be addressed specifically to the objectives that you and the vendor agreed were the purpose of considering the FM contract. The vendor should .detail exactly how he will improve your EDP operation or provide faster or improved information to serve your operating areas. He should suggest where savings can be made or what specific actions he can take that are not now being taken to effect these savings. Work Schedule for Information Reports While it is not desirable to pin the vendor down to an operating schedule for theEDP department-for it is exactly this flexibility that allows him to achieve economies of scale-he should, however, be very specific about the schedule for delivery of required reports. If you have a dataentry problem, for example, then the proposal should indicate that the computer will be available when you need to enter data. The work schedule should fully reflect as closely as possible the current way in which you do business and any change should be fully justified in terms of how it can improve the operation of the whole company, not just the operation of the EDP department. Equipment Transfer Details of equipment ownership and any trans- Facilities Management-A Marriage of Porcupines • • • • fers to the vendor should be specified. Responsibility for rental or lease payments should also be detailed. Responsibility for maintenance not built into equipment rentals or leases should also be delineated. Cost Schedule Contract pricing is the most critical cost item. A fixed-fee contract is advantageous to both parties if the customer's business volume is expected to continue at current levels or grow. If business drops, a fixed fee could hurt the customer. Thus, the fairest pricing formula is composed of two components: a fixed fee to cover basic operating costs and a variable fee based on revenue, number of orders, or some easily identifiable variable sensitive to business volume. Some contracts also include a cost-of-living escalator. The proposed cost schedule should also take into account equipment payments, wages, salary schedules, travel expenses, overhead to be paid to the customer (if the vendor occupies space in the customer's facilities) and all other expenses that might occur during the course of the contract. If special software programming or documentation is to be performed for the customer, the hourly rates to be charged should be identified in the contract. Vendor Liaison A good proposal recognizes the need for continuing contact between top management and the FM vendor. Close liaison is especially required in the early days of the contract, but also throughout its life. The cost for this liaison person should be borne by the customer, but the responsibilities and the functions that will be expected of him should be clearly stated in the proposal. Personnel Transfer Since all or most of the personnel in ED P operations will be transferred to the FM vendor, you must make sure that this will be an orderly transfer. Several questions arise in almost every contract situation and should be covered in the proposal: Does the proposal anticipate the possible personnel management problems that might come about? If all personnel are not being transferred and some may be terminated, how will this be handled? Are FM vendor personnel policies consistent with yours? IIas the vendor taken into account the possibility that large numbers of persons may not wish to join the vendor and may leave? Failure to Perform While it is most desirable to emphasize the positive aspects of an FM relationship, the negative possibilities should be explored to the satis- 133 faction of both parties. Most of these revolve around failure to perform. If the vendor fails to perform his part of the contract, you should be able to terminate the contract. The proposal should detail how this termination can be carried out. Is the vendor, for example, obligated to permit you to recover your original status and reinstall your in-house computer? What are the penalties the vendor will incur? What is the extent of his liabilities to replace lost revenue, lost profits that you may suffer as a result of his failure to perform? IIow will these lost revenues and lost profits be identified and measured? That's the vendor's side, but you also have obligations as a customer. If your input data is not made available according to schedule, for example, what is your possible exposure in terms of late reports? • Software I t is important to pin down ownership of existing software when an FM contract is signed and any subsequent software that is developed. Proprietary as well as non-proprietary software packages should be identified and specified in the report so that competitors may not benefit unfairly if the FM vendor uses the packages with other clients in your industry. Software backup and full documentation procedures should also be identified. This is one area in which FM may be a great help. If your installation is typical, your backup and documentation procedures are weak and an FM vendor, using professional approaches, should be able to improve your disaster recovery potential. FM contract: Marriage of porcupines The FM contract should incorporate all the above issues, plus any others which are uniquely critical, in an organized format for signing. The FM co~tract is as legal a document as any other the company might enter into; therefore, the customer's legal staff should carefully review it in advance of any signing. The body of a typical FM contract shows the general issues which have been discussed above and which apply in most FM contract situations. Attachments are used to detail specific information about the customer that is proprietary in an FM contract Attachments discuss the service and time schedule, equipment ownership and responsibility, cost schedule and any special issues. One. of the most striking features is the general absence of detailed legal jargon. This is typical in most 134 Fall Joint Computer Conference, 1972 FM contracts and is a result of two factors. First, the two parties have attempted to communicate with each other in the language that both understand. Second, the wording reflects an aura of trust between the two parties. In a service subcontracting relationship the customer must implicitly trust the vendor. Without this mutual trust, it would be foolish for a vendor or a customer to even consider a proposal. BIBLIOGRAPHY EDP productivity at 50%'1 Administrative Management June 1971 pp 67-67 Working Paper #177 Graduate School of Business Stanford University P J McGOVERN Interest in facilities management-Whatever it is-Blossoms EDP Industry Report April 30 1971 D M PARNELL JR A new concept: EDP facilities management Administrative Management September 1970 pp 20-24 I POLISKI Facilities management-Cure-all for DP headaches? Business Automation March 1 1971 pp 27-34 A RICHMAN Oklahoma bank opts for FM Bank Systems and Equipment February 1970 pp 18-32 EDP-What top management expects Banking April 1972 pp 18-32 L W SMALL Special report on bank automation Banking April 1971 Facilities management users not sure they're using-If they are Datamation January 1 1971 p 54 When EDP goes back to the experts Business Week October 18 1969 pp 114-116 KUTTNER et al Is facilities management solution to EDP problems? The National Underwriter January 23 1971 H CLUCAS JR The user data processing interface Quantum Science Corporation Reports Dedicated information services July 1970 Facilities management-How much of a gamble? November 1971 Federal information services October 1971 Network information services April 1971 Automated map reading and analysis by computer by R. H. COFER and J. T. TOU University of Florida Gainesville, Florida INTRODUCTION Florida's IBM 360/65 computer utilizing less than lOOK words of direct storage. Although the set of possible map symbols is quite large, those used in modern topographic maps form the three classes shown in Figure 2. Point symbology is used to represent those map features characterized by specific spatial point locations. This class of symbology is normally utilized to represent cultural artifacts such as buildings, markers, buoys, etc. Lineal symbology is used to mark those features possessing freedom of curvature. This class is normally utilized to represent divisional boundaries, or routes of transportation. Typical examples of lineal symbology include roads, railways, and terrain contours as well as various boundary lines. Area symbology is used to represent those features possessing homogeneity over some spatial region. It is normally composed of repeated instances of point symbology or uniform shading of the region. Examples include swamps, orchards, and rivers. As its extension to the recognition of area symbology is rather straightforward, MAPPS has been designed to recognize the point and lineal forms of symbology only. Further it has been designed to recognize only that subset of point and lineal symbology which possess topographically fixed line structures. This restriction is of a minor nature since essentiall all map symbology is, or may be easily converted to be, of this form. Even given these restrictions, MAPPS has immediate practical utility since many applications of map reading require only partial recognition of the symbology of a given map. As an example, the survey of cultural artifacts can be largely limited to the recognition of quite restricted forms of point and lineal symbology. Color information provides strong perceptual clues in maps. On standard topographic maps for instance blue represents hydrographic features, brown represents terrain features, and black represents cultural features. Even so, utilization of color clues is not incorporated into MAPPS. This has been done to provide a more stringent test of other more fundamental techniques of A great deal of attention is presently being given to the design of computer programs to recognize and describe two-dimensional pictorial symbology. This symbology may arise from natural sources such as scenery or from more conventionalized sources such as text or mathematical notation. The standardized graphics used in specification of topographic maps also form a conventionalized, two-dimensional class of symbology. This paper will discuss the automated perception of the pictorial symbology to be found within topographic maps. Although conventionalized, this symbology is used in description of natural terrain, and therefore has many of the characteristics of more complex scenery such as is found within aerial photography. Thus it is anticipated that the techniques involved may be applied to a broader class of symbology with equal effectiveness. The overall hardware system is illustrated by Figure 1. A map region is scanned optically and a digitized version of the region is fed into the memory of a computer. The computer perceives in this digitized data the pictorial symbology portrayed and produces a structured output description. This description may then be used as direct input to cartographic information retrieval, editing, land-use or analysis programs. THE PROGRAM Many results of an extensive research into the perception of pictorial symbology have been incorporated into a computer program which recognizes a variety of map symbology under normal conditions of overlap and breakage. The program is called MAPPS since it performs Machine Automated Perception of Pictorial Symbology. MAPPS is written in the PL/l programming language heavily utilizing the language's list and recursive facilities. It is operated on the University of 135 136 Fall Joint Computer Conference, 1972 Computer System Figure I-The overall hardware system recognition. It is obvious however, that utilization of color descriptors can be easily incorporated, and will result in increased speed of execution and improved accuracy of recognition. MAPPS is divided into three systems: picture acquisition, line extraction, and perception. In brief, the picture acquisition system inputs regions of the map into the computer, the line extraction system constructs data entities for each elementary line and node present in the input, and the perception system conducts the recognition of selected symbology. A flow-chart of MAPPS is shown in Figure 3. scanning of 35 mm transparencies within a research environment. I It consists of a flying-spot scanner, minicomputer, disk memory, storage display, and incremental tape unit. In operation, PIDAC scans a transparency, measures the optical density of the transparency at each point, stores the results in digital form, performs limited preprocessing actions, and generates a digital magnetic record of the acquired data for use by the IBM 360/65 computer. For each transparency, PIDAC scans a raster of 800 rows by 1024 columns, a square inch in the film plane Picture Acquilition Slltem Lin. Eatractlon Sl,tem 51mboiou Perc.ptlon Sl't.m po.. lbility 0' laolottoll PICTURE ACQUISITION The picture acquisition system PIDAC is a hardware system developed by the authors to perform precision D Arroy looloto" Symbology ea.ed LI.t Structure Llno. a Nod.. 0' X .d. Gray-tovet Picture Point Symbology ~ 8tnary Ptctur. \\ Recoonized Map Symbolooy Arroy I \ Procel.. d lilt of 1111•• olld lIod •• Ullr QUlry \ DOIcrlptlon of •• irod .ymboloOJ J Figure 3-MAPPS flow chart Lineal SymboloGY "II, t =~ ":' ~I~ ":' 00000 00000 00000 0000 Area Symbology Figure 2-Classes of map symbology thus corresponds to approximately 106 points. At each raster point PIDAC constructs a 3-bitnumber corresponding to the optical density at that point. As the original map may be considered to be black and white, a preprocessing routine, operating locally, dynamically reduces the 3-bit code to a 1-bit code in a near optimal fashion. This action is accomplished by a routine called COMPACT since it compacts storage requirements as well. The result is an array whose elements correspond to the digitized points of the map region. This compacted array is then input to the line extraction system. Automated Map Reading and Analysis by Computer 137 LINE EXTRACTION I I CLEAN FINAL CLEAN UP I I MEDIAL AXIS DETERMINATION I ,:, I. &r ~ I ~ I LIST GENrATION Name - 1 1st fbie NaIre - 1 1st Ncxle Position - (38.44) 2nd Ncxle Name - 2 2m Ncxle Position - (57.45) Line Ier¢h - 49 Grid-Intersect Coding - 24424 54444544546464465 7656606007777000000000 60000 Ncxle Eht%y 4-POINT LOOP REMOVAL Name - 2 Positicn - (57.45) Ibnber Adjacent Lines - 3 1st Adjacent Line Erxi 1st Line - 2 2nd Adjacent Line - 2 End 2nd Line - 1 3td Adjacent Line - 3 Erxi 3rd Line - 2 As shown by Figure 3, the compacted array is input to the line extraction system. The function of this system is the extraction of each of the elementary line segments represented in the map, so that the program can conduct perception in terms of line segments and nodes rather than having to deal with the more numerous set of digitized points. The system of line extraction, as developed, does not destroy or significantly distort the· basic information contained within the map. This is necessary since significant degradation makes later perception more difficult or impossible. The action of the line extraction system is illustrated in Figure 4. First the map is cleared of all small holes and specks likely to have resulted from noise. Then a connected medial axis network of points is obtained for each black region of the map. This first approximation to a desired line network is converted to a true line network by an algorithm called 4-point loop removal. Operating on a more global basis, later algorithms remove spurious lines and nodes, locate possible corner nodes, and convert to a more suitable list processing form of data base. For each line and node, a PL/I based structure is produced. Each structure contains attributes and pointers to adjacent and nearby data entities. The structure for a line entity contains the attributes of width, length, grid interest coding, as well as pointers to adjacent nodes. The structure for each node entity contains the attribute of position and pointers to adjacent lines and nearby nodes. The line extraction system, being somewhat intricate, has been discussed in detail in a prior paper. 2 Abstractly, each state S of the system can be viewed as responding to distortions occurring within the map. These distortions may be characterized by a set of context sensitive productions of the form Rr(i,j)Rzn(i,j)~Rr(i,j)Rlnf(i,j) z= 1, 2, ... , Ns RIm (i, j) represent some region about the point (i, j) having a fixed size and gray-level distribution. Rzn(i, j) and· Rlnf (i, j) represent regions of the point (i, j) having the same fixed size but differing gray-levels. By inversion of the production sets, each stage can be described as the repetitive application of the rules Rr(i,j)Rzn f (i,j)~Rr(i,j)Rln(i,j) l= 1,2, ... ,Ns in forward raster sequence to the points (i, j) of the map until no further response is obtainable. As an example, one such rule {M(i+l,j) =0, M(i-l,j) =O} {M(i,j) =1} Figure 4-Action of line extraction system ~{M(i+l,j) =0, M(i-l,j) =O} {M(i, j) =2} 138 Fall Joint Computer Conference, 1972 is used in the medial axis determination to mark object regions of width 1 as possible line points for further investigation. It is important to observe the degree of data reduction and organization which is accomplished through the extraction of line data. As previously mentioned, even a small map region contains a huge number of nearly 106 picture points. The extracted list structure typically contains no more than 300 lines and node points thereby resulting in a very significant data reduction. Equally significant, the data format of the list structure permits efficient access of all information to be required in the perception of symbology. The digitized map array therefore may be erased to free valuable storage for other purposes. PERCEPTION OF SYMBOLOGY It is interesting to observe that certain familiar pattern recognition procedures cannot be directly used in the recognition of map symbology. This results from the fact that in cartography, symbology cannot be well isolated as there are often requirements for overlap or overlay of symbology in order to meet spatial positioning constraints. Many of the techniques used for recognition of isolated symbology, such as correlation or template matching of characters, cannot be used to recognize such non-isolated symbology and are thus not very powerful in map reading. In MAPPS, alternative techniques have been employed to accomplish isolation of symbology in addition to its recognition. THE CONCEPT OF ISOLATION PROCESSING Conceptually, isolation of symbology from within a map cannot be accomplished in vacuo. Isolation requires some partial recognition, while recognition generally requires some partial isolation. This necessitates the use of a procedure in which isolation is accompanied by partial recognition. In order to guide this procedure, there must exist some a priori knowledge about the structure of the symbology being sought. The underlying structure of pictorial symbology, such as is present in maps and elsewhere, is found to be that of planar graphs upon the removal of all metric constraints. Using this structure the isolation process functions by sifting through the data of the map proposing possible instances of pattern symbology on a graph-theoretic equivalency basis; thereby suppressing extraneous background detail. is isomorp hie to is h omao mo rphie to Figure 5-Graph equivalencies Two types of graph equivalency are used in isolation. These are • isomorphism • homomorphism One graph is isomorphic to another if there exists a one-to-one correspondence between their nodes which preserves adjacency of edges. A graph is homomorphic to another if it can be obtained from the other by subdivision of edges. Figure 5 yields an instance of isomorphic and of homomorphic equivalence of graphs. Using the above definitions of graph equivalency, the process of isolation can be achieved by means dependent upon and able to cope with the types of structural degradation, Figure 6, found within actual maps. For instance, should a map contain no structural degradation, then on the basis of graph structures only, it is necessary and sufficient to propose as possible symbology isolations those components of the map which are isomorphic to the sy~bology being sought. If the map Crossing of lines Breakage of lines Overlay of nodes Overloy of lines Uncertain location of corner nodes Figure 6-Structural degradations occurring in MAPPS Automated Map Reading and Analysis by Computer contains no crossing, overlay, or breakage of lines then on the basis of graph structures only, it is necessary and sufficient to propose as possible symbology isolations those partial subgraphs of the map which are isomorphic to the symbology. If the map contains no breakage or overlay of lines, then it is necessary and sufficient to propose those partial subgraphs which are homomorphic to the symbology sought. Finally, if a map contains as the only forms of structural degradation: line crossovers, line breakage, node overlay, and uncertain location of corner nodes, first complete the map by filling in all possible instances of line breakage. Then it is necessary and essentially sufficient to propose as possible pattern isolations those partial subgraphs of the completed map which have no two adjacent broken edges and which are homomorphic to the symbology sought. Although the process of isomorphic matching of graphs can be conducted rather efficiently,4 the more realistic process of homomorphic matching requires the investigation of large numbers of partial subgraphs of the map for possible equivalency to pattern symbology. (b) Region containing instances (a) a feature space of pattern symbology ./ L /' '/ V (c) region forme d by (d) best con.native region bounds te.ting of formed by bound testing feotures of f eatur IS Figure 7-Partitioning of feature space by metric tests (a) A feature space (b) Region containing instances of pattern symbology (c) Region formed by bounds testing of features (d) Best conservative region formed by bound testing of features 139 8 H ." /::;. G F Cl E y a .~ B • 0 A B fj I~' ~ • /::;.." 8 • y C Its spanning tree T. A pattern Symbol S 4 a c 8 T '. I~ 12 16 18 "~. 19 13 The element. T. Ii) of T. AppHcotion to a mop Figure 8-The structure of a pattern symbol In order to limit the number of partial subgraphs which need be checked for homomorphic match, metric equivalency tests have been integrated into the graph theoretic isolation process. These tests include bounds checking of the lengths, curvatures, thicknesses, and angles between lines, and may be easily extended as required. If the metric tests are well chosen then they will be conservative, i.e., will not reject any true instance of pattern symbology. This may be seen by viewing the various screening quantities as features in the feature space of Figure 7. The ensemble of all true instances of pattern symbology will form some region A in the space, Figure 7. Any set of metric tests may also be viewed as partitioning the feature space, passing only those instances of symbology which lie in some region B of the space formed by the partition. If region B contains region A, then the set of tests is conservative. If region B exactly contains region A, then the set of tests also form a perfect recognizer. It is more important however, that the tests result in a high processing efficiency. This may be achieved by immediate testing of each feature as it is first calculated. This form of testing generates a partition which boxes in some region of feature space as shown in Figure 7c. While this partition is not necessarily perfect, it is usually possible to adjust the bounds of the tests so as to achieve a near optimal, as well as conservative, performance on the basis of limited sampling of pattern symbology within a map, Figure 7d. Thus the isolation process may also often serve well as the final stage· of recognition. When desired, however, it is always possible to concatenate other more conventional recognition processes in order to achieve yet higher accuracies of recognition. 140 Fall Joint Computer Conference, 1972 To Calling Routines __--------------__ A~ (final Success End ~Of- Pattern _________________ Err~ Temporary Success Return \ Get Next M atc h Possibility For This Match Level ~ then MATCH takes a FINAL SUCCESS exit which carries it back up the recursive string with the isolated symbology from Gm • If all matching possibilities for T 8 (i) , i>l, have been exhausted then MATCH takes an ERROR RETURN exit back to the i-lth level of recursion in order to try to find other matching possibilities for T8(i-l). Alternatively if all matching possibilities for Ts (i), i = 1, have been exhausted then MATCH fails to isolate the symbology sought and exits along the ERROR RETURN exit to the calling program. On the other hand, if it finds an acceptable match for Ts(i) then it exits via the TEMPORARY SUCCESS exit to continue the matching search for T8(i+l). At each recursive level, MATCH performs one of three specific actions: matching to nodes of T 8 , matching to lines of T and initial matching to new pattern components of Ts. 8 , , ,\ Final \(uccess Temporary Success Error Return) Matching of nodes y Recursive Invocation Of Match Figure 9-Structure of MATCH The routine MATCH Application of the search for graphical and metric equivalencies is conducted via a recursive routine called MATCH. On the graph-theoretic level, MATCH functions through utilization of tree programming. In this approach, a spanning tree Ts is pre-specified for each pattern symbol S. The elements of Ts are named T s (i),i=l, 2, ... , N s, where Ts(i) is constrained to be connected to Ts(1) through the set {T8 (j), j = 1, 2, ... , i}. These structural conventions, illustrated by Figure 9, are developed to permit utilization of a recursive search policy in matching Ts and the partial subgraphs of a map Gm • The recursive structure of MATCH is shown in Figure 9. It has one entry from and two exits back to the calling program. Being recursive, it can call itself. At the ith level of recursion, MATCH investigates the possibilities of homomorphic equivalence of elements Gm to Ts(i). As each possibility is proposed MATCH checks to insure that all implied graph-theoretic equivalencies between Ta(j) ,j = 1, 2, ... , i, and Gm are acceptable, and that basic metric equivalences are met. More explicitly, at the recursive level i, MATCH takes the following action. If Ts has been fully matched The fundamental operation performed by MATCH is the matching of the immediate neighborhood of a node of Gm to that of Ts. This matching must satisfy several constraints. It must be feasible, must satisfy a···· CJ .. : ~.. ~-~><. (a) node neighborhoods before matching .. g':" . ~ Gm ... . 'l~' '. 5 6 (b) node neighborhoods after matching Figure lo-Matching of node neighborhoods (a) N ode neighborhoods before matching (b) Node neighborhoods after matching •• ' ... Automated Map Reading and Analysis by Computer ... ~ ... (a) line regions before matching . ~ .;y. 141 This path may contain one or more elementary lines and may even contain breaks. The path must, however satisfy minimal constraints. It must not cross over it~ self, no portion of the path other than endpoints may have been previously matched, no breaks may be adjacent, the implied endpoint matchings must be consistent with prior matchings, and finally certain metric equivalencies must be observed. Typically these metric equivalencies need be no more complex than a rough correspondence of length and curvature between line of T8 and the path within Gm • As an example of the matching of lines consider Figure 11. If the conditions of Figure 11a hoid upon a call of MATCH then Figure lIb shows a suitable matching between the line of Ts and a path within Om . Gm ... ~ ... (b) line regions after matching Figure ll-Matching of line regions (a) Line regions before matching (b) Line regions after matching certain angular conditions, and must not violate any prior matching assumptions. A matching is feasible if the degree of the node of Gm is greater than or equal to that of the node of Ts. This requirement, for example results in termination of the matching of the map seg~ ment of Figure 8 at recursive stage 16 because the degreeof node Ts(16) was greater than the degree of the corresponding node of Gm. A matching satisfies necessary angular constraints if all internal angles of the planar graphs of Gm and Ts are sufficiently similar. It satisfies prior matching assumptions if the present matching attempt is not in conflict with previous matching attempts or involves lines of Gm which are matched to other pattern symbology. The neighborhood of a node of Ts is considered to be fully matched when the node and its adjacent lines are matched to a node of Gm and some subset of its adjacent lines. For instance, if the conditions represented in Figure lOa hold upon a call of MATCH, then Figure lOb shows a suitable match of the neighborhoods. Matching of lines In matching a line of T8 to elements of Gm , MATCH finds a path in Gm which corresponds to the line of T s , Initial m.atching of com.ponents . Matching of nodes and lines of connected symbology conducted by the tracking of connectivity via T B • This technique may be extended to the matching of symbology S composed of disjoint components through inclusion of lines within Ts which link nodes of the various components of S. These lines may, for matching purposes, be treated as straight lines of T s , thereby simplifying the matching process. IS FINAL CLASSIFICATION MAPPS has the capability for inclusion of a final classification routine (CLASS). When used this routine serves to provide a final decision as to whether an isolated piece of symbology is a true instance of the symbology being sought. If the isolation is determined to be erroneous, then MATCH continues its search toward an alternative isolation .. CLASS may be implemented by a variety of app,roaches, the best known and most appropriate of which is through use of discriminant functions. The power of its application can be dramatically enhanced through proper use of results from MATCH. For example, quite complex features can be devised for input to CLASS from the very detailed description of the isolated symbology produced by MATCH. As further example MATCH may be used to isolate new symbology and tentative classification from a map to form a training set for CLASS. Then with or without a teacher, the discriminant function underlying CLASS can be perturbed toward a more optimal setting by any of several well-known algorithms. 3 142 Fall Joint Computer Conference, 1972 /} : I ! }i Figure 12-Test results Figure 12a-The input map Figure 12c-Isolated highways OUTPUT The Output Routine (OUTPUT) takes the isolated symbology recognized by earlier routines (MATCH, CLASS, ... ) and produces the final MAPPS output in ··-···l···-···-······ ..-.. --..---~I accordance with a specific user query. This is accomplished by establishing a data structure in which data can be retrieved through use of relational pointers. Retrieval is effected by specification of the desired symbology class and by calling various relations. The relation "contains" may be used, for instance, to find ----- ------_--..,-.. _.____ _ ................................................._._......_-.. _- .'~' Figure 12d-Isolated railways Figure 12b-The input map after line extraction Automated Map Reading and Analysis by Computer 143 tively a call of "position" will return the nominal location of the center of each isolated symbology. RESULTS OF TEST RUNS ....I.:::~:...I.·. ......-- ..:~.... ...-.=~=:::.- . . . .. ~> l / . Figure 12e-Isolated roads the various isolated symbols belonging to a specified symbology class. Another call of "contains" will then result in presentation of all lines and nodes present in the specified symbology. Yet another call of "contains" will return the specific picture points involved. Alterna- MAPPS has been tested on several map regions. In each case CLASS was set to accept all isolations in order to most stringently test the operation of MATCH. Throughout all testing the results were highly satisfactory. Figure 12 presents the results for a representative run. The map region of Figure 12a was fed to the early stages of MAPPS producing the preprocessed map of Figure 12b. This preprocessed map was then subjected to several searches for specified symbology resulting in Figures 12c through 12k. In all but one case the recognition was conservative. Only in the case of Figure 12f was a false isolation made. An M was there recognized as an E. Had CLASS been implemented using character recognition techniques, this misrecognition could have been avoided. In those cases where recognition was incomplete, as for the highway of Figure 12c, isolation was terminated by MATCH due to mismatch of structure between the map and symbology sought. Some overall statistics on the test run: MAPPS correctly found instances of 18 types of lineal and point symbologies. These instances were formed from 5382 elementary lines. In addition 7 incorrect instances were EF' .-.t. e E Figure 12f-Isolated 'E's Figure 12g-Buildings 144 Fall Joint 'Computer Conference, 1972 I I I . . e\ \/, " , '".r Figure 12h-Benchmark symbols Figure 12j-Swamp symbols isolated although in each case this could have been avoided by use of a proper classification structure within CLASS. Since minimization of run-time was of minor importance, the average test-run for each symbology search of a map region took approximately 10 minutes from input film to output description. It is estimated that this could have been improved very significantly by various means; however this was not a maj or goal at this stage of research. "·"+-r::, .~--!: Figure 12i-Churches Figure 12k-Spring symbols Automated Map Reading and Analysis by Computer 145 CONCLUSION ACKNOWLEDGMENT This work has been an investigation into a broad class of conventionalized, two-dimensional, pictorial patterns: the symbology of maps. Important aspects of the problem involve line extraction, isolation under conditions of qualitatively-defined degradation, use of graph structures and matching techniques in isolation, and interactive recognition of geometrically variable symbology. A sophisticated approach to line extraction yielded a useful data base upon which to conduct symbology isolation and recognition. The use of graph structure and matching in symbology isolation proved very effective. Unexpectedly, it was found to be seldom necessary to resort to formal classification techniques in recognition of the isolated symbology. Such techniques could be incorporated as desired resulting in a continued search for symbology in case of any misisolation. The program as a whole is able to be expanded to the recognition of a wide variety of graphical symbology. In addition, the concepts involved can quite possibly be applied to the automated perception of gray-level sceneries such as blood cells, aerial photographs, chromosomes, and target detection. The authors would like to acknowledge the interest displayed by other members of the Center for Informatics Research in this and related research. This research has been sponsored in part by the Office of Naval Research under Contract No. N00014-68-A-0173-0001, NR 049-172. REFERENCES 1 R H COFER Picture acquisition and graphical preprocessing system Proceedings of the Ninth Annual IEEE Region III Convention Charlottesville Virginia 1971 2 R H COFER J T TOU Preprocessing for pictorial pattern recognition Proceedings of the NATO Symposium on Artificial Intelligence Rome Italy 1971 3 J T TOU Engineering principles of pattern recognition Advances in Information Systems Science Vol 1 Plenum Press New York New York 1968 4 G SALTON Information organization and retrieval McGraw-Hill Book Company N ew York 1968 Computer generated optical sound tracks by E. K. TUCKER, L. H. BAKER and D. C. BUCKNER University of California Los Alamos, New Mexico were represented by sounds, interpretation of results would be greatly facilitated. This is feasible only if the sound track is computer produced, not "dubbed in" after the fact. It should be made clear at this point that it was not an objective of this project to have the computer create all of the waveforms represented on the sound track. What was required was that the computer be able to reproduce on an optical sound track any recorded audible sound, including voices or music. The waveforms that the computer would actually have to create could be limited to some of the sounds we wanted to use as data representations. INTRODUCTION For several years various groups at the Los Alamos Scientific Laboratory have been using computer generated motion pictures as an output medium for large simulation and analysis codes. 1 ,2,3 Typically, the numerical output from one simulation run is so large that conventional output media are ineffective. The timevariable medium of motion picture film is required to organize the results into a form that can be readily interpreted. But even this medium cannot always convey all of the information needed. Only a limited number of variables can be distinctly represented before the various representations begin to obscure or obliterate each other. Furthermore, the data presented usually must include a significant amount of explanatory material such as scaling factors, representation keys, and other interpretive aids. If a film is to have long-term usefulness to a number of people, this information must either be included on the film or in a separate writeup that accompanies the film. In an effort to increase the effective information density of these films, a study was undertaken to determine the feasibility of producing optical sound tracks as well as pictorial images with a microfilm plotter. Some exploratory work done at the Sandia Laboratories, Albuquerque, New Mexico, suggested that this might provide a good solution to the problem. 4 It has been demonstrated many times that a sound track facilitates the interpretation of visual presentations. 5 However, from our standpoint, the addition of another channel for data presentation was as important as facilitating interpretation. Not only could a sound track present explanatory and narrative material efficiently and appealingly, it could also be used to represent additional data that might otherwise be lost. For example, it is always difficult to clearly represent the movement of many particles within a bounded three-dimensional space. If, however, the collisions of particles-either with each other or with the boundaries of the space- OPTICAL SOUND TRACKS Sound is generated by a vibrating body which produces a series of alternating compressions and rarefactions of some medium, i.e., a wave. As this series is propagated through the medium, particles of the medium are temporarily displaced by varying amounts. We shall speak of the magnitude and direction of this displacement as the instantaneous amplitude of the wave. If the variation of this amplitude can be described as a function of time, a complete description or encoding of the wave is obtained. Thus, a sound wave can be "stored" in various representations, as long as the representation fully describes the variation of amplitude with respect to time. An optical sound track is one way of representing a sound. It consists of a photographic image which determines the amount of light that can pass through the track area of the film at a given point. As the film is pulled past the reader head, varying amounts of light pass through the film to strike a photocell, producing a proportionally varying electrical signal. A given change in signal amplitude can be produced at the photocell by varying either the area or the intensity of exposure of the sound track image. Conventional sound tracks are produced by either of two methods. The variable area type of track is pro147 148 Fall Joint Computer Conference, 1972 duced by having a beam of light of constant intensity pass through a slit of variable length to expose the film. In the. variable intensity recording method, either the light's intensity or the slit width can be varied with the slit length held constant. Commercial sound tracks are· produced by both methods. In both cases, the sound track image is produced on a special film that is moved past the stationary light source. Separate films of sound track and pictures are then reprinted onto a single film. Sixteen-millimeter movies with sound have sprocket holes on only one edge. The sound track is located along the other edge of the film (see Figure 1). Such sound tracks are normally limited to reproducing sound with an upper frequency of 5000-6000 Hz. This limitation is imposed by the resolution that can be obtained with relatively inexpensive lens systems, film and processing and by the sound reproduction system of most 16 mm projectors. 6 INPUT SIGNALS In order not to be limited to the use of computer created sounds alone, it was necessary to be able to store TIME SAMPLING THE ORIGINAL SIGNAL y ............... ............. ~------------------~~------------.-x APPROXIMATING THE ORIGINAL SIGNAL FROM THE SAMPLES Figure 1-A computer generated optical sound track Figure 2~Discrete sampling Computer Generat.ed Optical Sound Tracks other complex audio signals, such as voices, in a form that could be manipulated by a digital computer. As discussed above, any audio signal can be completely described by noting the variation of the signal's amplitude as a function of time. Therefore, the data for a digital approximation of an audio signal can be obtained by periodically sampling the signal's amplitude (see Figure 2). The primary restriction associated with this approach requires that the sampling rate be at least twice the highest frequency contained in the signal,7 In effect, samples obtained at a relatively low sampling rate S from a wave containing relatively high frequencies f will create a spurious "foldover" wave of frequency 8-f. The input for our experimental film was recorded on standard 7.4'-inch magnetic tape at a speed of 7,72 IPS. Frequencies greater than 8000 Hz were filtered out, and the resulting signal was digitized at a sampling rate of 25,000 samples/second. The digitizing was performed on an Astrodata 3906 analog-to-digital converter by the Data Engineering and Processing Division of Sandia Laboratories, Albuquerque. The digital output of this process was on standard ,72-inch 7-track digital magnetic tape in a format compatible with a CDC 6600 computer. This digital information served as the audio input for the sound track plotting routine. PLOTTING THE SOUND TRACK The sound track plotting routine accepts as input a series of discrete amplitudes which are then appropriately scaled and treated as lengths. These lengths are plotted as successive constant intensity transverse lines in the sound track area of the film. When these lines are plotted close enough together, the result is an evenly exposed image whose width at any point is directly proportional to an instantaneous amplitude of the original audio signal (see Figure 1). Consequently, as this film is pulled past the reader head, the electrical signal produced at the photocell of the projector will approximate the wave form of the original audio signal. The routine is written to produce one frame's sound track at a time. During this plotting, the film is stationary while the sound track image is produced line by line on the cathode ray tube of the microfilm plotter. The sound reproduction system of a motion picture projector is very sensitive to any gaps or irregularities in the sound track image. Plotting a sound track, therefore, requires very accurate film registration. Furthermore, the sound track image must be aligned in a perfectly vertical orientation. If either the registration or the vertical alignment is off, the track images for successive frames will not butt smoothly together and noise will be produced. 149 PLOTTER MODIFICATIONS All of our early experimental films were produced on an SD 4020 microfilm printer/plotter. Three modifications had to be made to the 16 mm camera of this machine in order to make these films. These modifications do not affect any of the camera's normal functions. In the first modification, the Vought 16 mm camera had to be altered to accommodate single sprocketed 16 mm movie film. For this it was necessary to provide a single sprocketed pull-down assembly. This was accomplished by removing the sprocket teeth on one side of the existing double sprocket pull-down assembly. Next, it was necessary to replace the existing lens with a lens of the proper focal length to enable the camera to plot the sound track at the unsprocketed edge of the film. The lens used was a spare 50 mm lens which had previously been used on the 35 mm camera. With the existing physical mountings in the 4020, this 50 mm lens presents, at the film plane, an image size of approximately 17.5 X 17.5 mm. Thus, with proper raster addressing, a suitable 16 mm image and sound track may be plotted on film. (Increasing the image size in this fashion produces a loss of some effective resolution in the pictorial portion of the frame while the 50 mm lens is in use. This loss of resolution in the picture portion is not particularly penalizing in most applications.) Finally, it was necessary to expand the aperture both horizontally and vertically to allow proper positioning and abutment of the sound track on the film. By interchanging the new lens with the original lens, normal production can be resumed with no degradation caused by the enlarged aperture and single sprocketed pull-down. No other modifications were required on the SD 4020 in order to implement the sound track option. The primary difficulty we encountered using the SD 4020 was that we could not get consistently accurate butting of consecutive frames. Therefore, the later films were plotted on an III FR-80, which has pin registered film movement. In order to use this machine, the film transport had to be altered to accommodate single sprocketed film, and the aperture had to be enlarged. A software system tape was produced to allow the sound track image to be plotted at the unsprocketed edge of the film, with the pictorial images still plotted in the normal image space. The FR-80 also provides higher resolution capabilities, so that no loss of effective resolution is incurred when pictorial images and the sound track are plotted in one pass through the machine. As was discussed earlier, optical sound tracks are usually limited to reproducing sound with an upper frequency of 5000-6000 Hz. Since motion picture film is projected at a rate of 24 frames/second, a minimum of 150 Fall Joint Computer Conference, 1972 410 lines per frame are needed to represent such frequencies in the sound track. While we have made no quantitative tests to demonstrate the production of such frequencies, we would expect efficient resolution to produce frequencies in or near this range with either of the plotters. Our applications so far have not needed the reproduction of sounds in this frequency range. THE TRACK PLOTTING ROUTINE The present sound track plotting routine was written with three primary objectives in mind. First, it was felt that it would be advantageous to be able to produce both pictorial imagery and the sound track in one pass through the plotter, with the synchronization of pictures and sound completely under software control. Second, the routine was written to allow the user maximum flexibility and control over his sound track "data files". Finally, the routine was designed to produce film that could be projected with any standard 16 mm projector. One-pass synchronization The sound track plotting routine is written to produce one frame's sound track at a time, under the control of any calling program. However, in a projector, the reader head for the sound track is not at the film gate; it is farther along the threading path. The film gate and the reader head are separated by 25 frames of film. Therefore, to synchronize picture and sound, a frame of sound track must lead its corresponding picture frame by this amount so that as a given frame of sound track arrives at the reader head, its corresponding pictorial frame is just reaching the film gate. In order to be able to generate both picture and sound in one pass through the plotter, it was necessary to build a buffer into the sound track plotting routine. This buffer contains the plotting commands for 26 consecutive frames offilm. In this way, a program plotting a pictorial frame still has access to the frame that should contain the sound track for the corresponding picture. The simultaneous treatment of pictorial plot commands puts the synchronization of pictures and sound completely under software control. Furthermore, this can be either the synchronization of sound with picture or the synchronization of picture with sound. This is an important distinction in some applications; the current picture being drawn can determine which sound is to be produced, or a given picture can be produced in response to the behavior of a given sound track wave. Flexibility The present routine will read from any number of different digital input files and can handle several files simultaneously. Thus, for example, if one wishes to have a background sound, such as music, from one file behind a narrative taken from another file, the routine will combine the two files into a single sound track. The calling routine can also control the relative amplitudes of the sounds. In this way, one input signal can be made louder or softer than another, or one signal can be faded out as another one fades in. Any input file can be started, stopped, restarted or rewound under the control of the calling program. DEMONSTRATION FILMS Several films with sound have been produced using the sound track plotting routine. Most of the visual portions were created with very simple animation techniques in order to emphasize the information content added by the sound track. The films review the techniques employed for the generation of a sound track. No attempts have been made to rigorously quantify the quality of the sounds produced since no particular criterion of fidelity was set as an objective of the project. Furthermore, the sound systems of portable 16 mm projectors are not designed to produce high fidelity sound reproduction, since the audio portion will always be overlaid by the noise of the projector itself. For our purposes it was enough to make purely subjective judgments on the general quality of the sounds produced. SUMMARY The ability to produce optical sound tracks, as well as pictorial imagery, on a microfilm plotter can add a tremendous potential to computer generated movies. The sound medium can serve to enhance the visual presentation and can give another dimension of information content to the film. This potential cannot be fully exploited unless the sound track and the pictures can be plotted by the computer simultaneously. Under this condition, the input for the sound track can be treated by the computer as simply one more type of data in the plotting process. The input for the sound track plotting routine discussed in this report is obtained by digitizing any audio signal at a suitable sampling rate. This digital information can then be plotted on the film like any other data. Very few hardware modifications were made to the Computer Generated Optical Sound Tracks plotter in order to produce sound tracks. The modifications that were made did not affect the plotter's other functions. The routine is written to give the user as much flexibility and control as possible in handling his sound track data files. Multiple files can be combined, and synchronization is under the control of the user's program. It now appears that the production of computer generated optical sound tracks will prove to be cost effective as well as feasible. If so, this process could conveniently be used to add sound to any computer generated film. ACKNOWLEDGMENTS While many individuals have made significant contributions to this project, the authors would like to give particular thanks to Jerry Melendez of Group C-4 for many hours of help in program structuring and debugging. The work on this project was performed under the auspices of the U. S. Atomic Energy Commission. 151 REFERENCES 1 L H BAKER J N SAVAGE E K TUCKER Managing unmanageable data Proceedings of the Tenth Meeting of UAIDE Los Angeles California pp 4-122 through 4-127 October 1971 2 L H BAKER B J DONHAM W S GREGORY E K TUCKER Computer movies for simulation of mechanical tests Proceedings of the Third International Symposium on Packaging and Transportation of Radioactive Materials Richland Washington Vol 2 pp 1028-1041 August 1971 3 Computer fluid dynamics 24-minute film prepared by the Los Alamos Scientific Laboratory No Y-204 1969 4 D ROBBINS Visual sound Proceedings of the Seventh Meeting of UAIDE San Francisco California pp 91-96 October 1968 5 W A WITTICH C F SCHULLER A udio visual materials Harper & Row Publishers, Inc. New York 1967 6 The Focal encyclopedia of film and television techniques Focal Press New York 1969 7 J R RAGAZZINI G F FRANKLIN Sampled-data control systems McGraw-Hill Book Company New York 1958 Simulating the visual environment in real-time via software by RAYMOND S. BURNS University of North Carolina Chapel Hill, North Carolina INTRODUCTION Laboratory, Providence, Rhode Island, has constructed several examples of unprogrammed simulators. One of these features a model terrain board with miniature roads and buildings over which a television camera is moved through mechanical linkages to the steering wheel of an automobile mock-up. The television camera is oriented so that the subject is presented with a windshield view. This arrangement earns the "unprogrammed" label within the physical limits of the terrain board. In practice, however, its value as a research tool is limited to studying driver behavior at dusk, as the image presented to the subject is dim. Natural daylight illumination, even under cloudy conditions, is much brighter than the usual indoor illumination. Duplicating the natural daylight illumination over the surface of the whole terrain board was found to be impractical in terms of the heat produced and the current re,.. quired by the necessary flood lamps. Because of the difficulties and disadvantages of filmand terrain board-type simulators, some efforts in recent years have been directed toward constructing visual simulators based on computer-generated images. General Electric has developed a visual simulator for NASA, used for space rendezvous, docking and landing simulation, which embodies few compromises. 2 The G. E. simulator output is generated in real time and displayed in color. However, from a cost standpoint, such a simulator is impractical for use as a highway visual.simulator because the G. E. simulator was implemented to a large extent in hardware. Consequently, the search for a visual-environment simulator which could be implemented in software was initiated. A study, investigating the feasibility of such a simulator was undertaken by the Highway Safety Research Center, Chapel Hill, North Carolina, an agency of the State of North Carolina. This study led to the development of the VES, for Visual Environment Simulator, a black-and-white approximation of the GE-NASA spaceflight simulator, adapted for highway environment simulation and implemented in software. Computer graphics has been seen since its inception! as a means of simulating the visual environment. I van Sutherland's binocular CRTs was the first apparatus designed to place a viewing subject in a world generated by a computer. When the subject in Sutherland's apparatus turned his head, the computer generated new images in response, simulating what the subject would see if he really .were in the 3-space which existed only in the computer's memory. This paper describes a system which is a practical extension of Sutherland's concept. The problem of simulating the visual environment of the automobile driver has attracted a variety of partial solutions. Probably the most used technique is simple film projection. This technique requires only that a movie camera be trained on the highway from a moving vehicle as it maneuvers in a traffic situation. The resulting film is shown to subjects seated in detailed mockups of automobile interiors, who are directed to work the mock-up controls to "drive" the projected road. The illusion of reality breaks down, however, when the subject turns the steering wheel in an unexpected direction and the projected image continues on its predefined course. Mechanical linkages from the mock-up to the projector, which eause the projector to swing when the steering wheel is turned, have also been tried. But that technique still breaks down when the subject chooses a path basically different from the path taken by the vehicle with the movie camera. Such film simulators are termed "programmed" . That is, what the subject sees is a function, not of his dynamic actions, but of the actions taken at the time the film was recorded. An "unprogrammed" simulator reverses this situation in that the image that the subject sees is determined only by his behavior in the mock-up. Unprogrammed visual environment simulators have been built for studying driving behavior. The U. S. Public Health Service at the Injury Control Research 153 154' Fall Joint Computer Conference, 1972 lated building, the visual, kinetic and ,auditory feedback should realistically reflect his actions. A visual simulator to provide the feedback described above must meet several requirements. To support the subject's unlimited alternatives, each image generated by the visual simulator must be determined only by the subject's inputs via the steering wheel, accelerator and brake, together with the subject's position in the simulated terrain. Therefore, the entire image representing the visual environment must be calculated in the time span separating subsequent images. REALISM Figure I-Mock-up of an automobile interior VES DESIGN REQUIREMENTS The requirements laid down by the Highway Safety Research Center were for a visual simulator that could be incorporated in a research device to totally simulate the driving experience to the subject. Not only was the visual environment to be simulated, but the auditory and kinesthetic environment as well. The subject was to be seated in a mock-up of an automobile interior; complete with steering wheel, brake and accelerator (see Figure1). The kinesthetic environmentwas to be simulated by mounting the mock-up on a moveable platform equipped with hydraulic rams. Under computer control, the mock-up could be subjected to acceleration and deceleration forces, as well as pitch, yaw and Toll. Similarly, a prerecorded sound track would be synchronized with the visual simulation to provide auditory feedback. To as great a degree as possible, the subject was to be isolated from the real environment and presented only with a carefully controlled simulated environment. From the researcher's point of view, this simulation device should allow him to place a subject in the mockup, present him with a realistic simulated environment and then study the subject's reactions. Further, the choice of reactions available to the subject should not be limited in·any way. So, if the subject were to "drive" off the simulated road and through the side of a simu- The high premium placed on realism in the visual simulator implied that the time span between subsequent images would be short, comparable to the time span between movie or television frames. The realism requirement also made hidden surface removal mandatory. Transparent hills, cars and road signs were unacceptable if the illusion of reality were to be maintained. Further, television-type images were preferable to wire-frame drawings. If the images were to be of the wire-frame type, then objects would be represented by bright lines on the otherwise dark face of the CRT. For objects at close range, this representation presents few problems. But for objects at long range, the concentration of bright lines near the horizon would resemble a sunrise. SYSTEM DESCRIPTION The visual simulator software runs on a stand-alone IDIIOM-2 interactive graphics terminal consisting of a display processor, a VARIAN 620f mini-computer and a program function keyboard3 (see Figure 2). The display processor is itself a computer, reading and exe-' cuting its program (called a display file) from the core of the mini-computer on a cycle-stealing basis. The display processor's instruction set is extensive, but the visual simulator uses only a few instructions. Those used are instructions to draw horizontal vectors at varying intensities at varying vertical positions on the screen. The display processor is very fast, drawing a full screen (12") vector in about 20 microseconds. This speed allows a display file of seven thousand instructions to be executed in about ~~oth of a second, effectively preventing image flicker at low light levels. The VARIAN 620f mini-computer is also fast. Its Core has a 750 nanosecond cycle time and most instructions require two cycles. Word size is 16 bits and core size is 16,384. Simulating Visual Environment in Real-Time Via Software In its present configuration, the simulator receives its steering, braking and acceleration inputs from an arrangement of push buttons on the program function keyboard. The design configuration calls for the installation of an analog-to-digital converter and a driving station mock-up to replace the PFK. At the same time that the analog-to-digital converter is installed, a VARIAN fixed-head disk with a capacity of 128K words will be installed, giving the simulator nearly unlimited source data set storage. The visual simulator (VES)· accepts a pre-defined data set which describes a plan view of the terrain through which travel is being simulated. The terrain data set consists of (x, y, z) triples which describe the vertices of polygons. At present, the YES input data set resides in the computer memory at all times. The main function of the VARIAN fixed-head disk mentioned above will be to store the YES input data set. In operation, the YES system accesses a portion of the input data set corresponding to the terrain which is "visible" to the subject as a function of his position ,............... 155 J\ F c 1~ • E • 1 A Figure 3-Diagram depicting subject's position (light triangle) moving through terrain data set versus data set moving past subject's position Display File NO Last Plan on Disk Figure 2-YES system block diagram in the simulated landscape. Then, the steering, brake and accelerator inputs from the mock-up are analyzed and used to compute a wire-frame type view of the terrain which would be visible through a car's windshield as a result of such steering, braking or accelerating.Next, the hidden surface removal routine (HSR) processes each polygon to determine which polygons "obscure" others and to remove the parts of each that are obscured. The output of HSR is then converted into a program (display file) to be executed by the display processor. The display processor executed this program to draw the horizontal vectors at up to 8 different intensities which make up the television-like final image. The subject's position (see figure 3) in the terrain plan view is represented by the light triangle. The dark triangle represents a fixed object in the terrain. If the terrain is established as the frame of reference, the subject'sposition moves across the terrain. But from the point of view of the subject, who is stationary, the terrain must move toward him. The current angular position of the mock-up steering wheel in radians, relative to a fixed heading, is found in variable ALPHA. AL- 156 Fall Joint Computer Conference, 1972 DIST x COS (ALPHA) SIN(ALPHA} N th of the rectangle and the triangle are compared. As indicated by the dashed lines, the rectangle is "behind" the triangle. In this case, no ordered triple is output. On encountering a3, the triangle's flag is set "out". As there is now only one polygon with flag set to "in", the ordered triple (X3, Y 3,PN r ) is output. Similarly, as a4 is encountered, the rectangle's flag is set to "out" and the triple (X4' Y 4, PN r) is output. This concludes LINESCAN's processing of a single scan line. To obtain the set of intersections corresponding to . t hALT for scan 1'" scan line "b," each element In e me a " must be modified by an amount determined by the space between raster elements, oY, and the slope of the polygon's face. Because polygons are composed of straight line segments, the change necessary is constant for each given line segment. To obtain the ALT entries for scan line "b," this constant value is added to 'the previous entries in the ALT. . But before processing scan line "b" can begm, the new ALT is re-sorted on increasing X values. This step is required because when the new ALT is constructed from the old by the addition of the slope constants mentioned above the order of some points may be disturbed. Note that this situation occurs when the ALT for scan line "c" is generated. Because of the differing slopes of the triangle and rectangle sides, a3 now precedes a2 in the left-to-right scanning order. Once the ALT is sorted, LINES CAN continues to process the ALT points as described above. DEVELOPMENT AND SPECIALIZATION OF THE HSR ALGORITHM Y1 1 Figure 7-Detail of LINESCAN operation 159 In writing the HSR program, the basic logic of LINESCAN was implemented. Unlike LINESCAN, which was not expressly designed for real-time applications, HSR was written in assembly language. Some features implemented in LINES CAN were judged unnecessary 160 Fall Joint Computer Conference, 1972 for the visual simulator application. Chief among these was the "implicitly defined line" feature of LINESCAN. This feature allows polygons to intersect and project through one another. Without this feature, polygons projecting through one another subvert the scanning logic, producing incorrect and distracting images. In a driving simulator, intersecting polygonal objects usually represent car crashes; hence, these are events which should be distracting. Some major changes to the basic logic of LINESCAN were implemented with the object of saving time. Recall that, when LINESCAN processes the ALT, as each point is encountered, a flag associated with that polygon is set to signify that the scan has "entered" that polygon. Then as each successive point is encountered, a search of the flags is used to determine which and how many flags are set. Performing even a short search at each point encountered on each scanline would consume a large fraction of the time allowed between frames in a real-time system. In the HSR algorithm, a table of polygon numbers is kept and updated as each new polygon is "entered" by the scan. The number of elements in the table is kept in a variable. Unless this variable indicates that the scan is "in" more than one polygon at a time, "no "depth sort" is required and no search need be made for polygons flagged as "in." When a "depth sort" is required, the polygons which must be depth sorted are readily accessible by table reference. Another change to the basic LINESCAN logic also involved sorting. LINESCAN sorts the ALT once for each scan. Recall that this step is required because the ALT is disordered when lines of different slopes intersect. Rather than sort the ALT for each scan, a simple test for ALT order was devised and performed at each point of the ALT. When disorder is found, the ALT is sorted. In simple scenes, this disorder occurs for about 8-10 of the possible 512 lines in a frame. Even very complex scenes require fewer than 20 ALT sorts. Hence, the savings in time are substantial. REFERENCES 1 I E SUTHERLAND A head-mounted three dimensional display Proceedings of the Fall Joint Computer Conference Vol 33 Part I pp 757-764 1968 2 BELSON Color TV generated by computer to evaluate spaceborne systems Aviation Week and Space Technology October 1967 3 IDIOM-2-Interactive graphic display terminal The Computer Display Review Vol 5 pp 201-214 1972 G ML Corporation Lexington Massachusetts 4 J BOUKNIGHT An improved procedure for generation of half-tone computer graphics presentations Communications of the ACM Vol 13 Number 9 pp 527-536 September 1970 Computer animation of a bicycle simulation by JAMES P. LYNCH and R. DOUGLAS ROLAND Cornell Aeronautical Laboratory, Inc. Buffalo, New York INTRODUCTION In early 1971, Cornell Aeronautical Laboratory, Inc., (CAL), began a research program, sponsored by Schwinn Bicycle Company, devoted to the development of a comprehensive digital computer simulation of a bicycle and rider. This simulation would be used to study the effects of certain design parameters on bicycle stability and control. Phase II of this research effort included the development of a computer graphics display program which generates animated movies of the bicycle and rider maneuvers being simulated. It is this graphics display capability that is described herein. For years, printed output was the only means of communication between the computer and man. This limitation dictated that only the technically skilled could interpret the reams of computer printout with its lists of numbers and specialized codes. For certain types of computer usage, such as accounting, numbers may be the most meaningful form of output which can be presented to the user. Solutions to other problems, however, may represent functional relationships of intangible variables. In this case plots of output data provide a much faster means of communication between the computer and the human. There is a class of problems for which neither numerical nor plotted output provide sufficient reality for rapid user comprehension. One such area is the simulation of the dynamics of tangible physical systems· such as airplanes, automobiles and bicycles. Fortunately, a means of communication is becoming practical which provides immediate visual interpretation of simulation results; not only for the analyst but for the layman as well. This mediumi s the computer animated graphics display. The early development of computer animated graphics displays was spurred by several investigators. Bill Fetter of the Boeing Company created an animated human figure in 1960 and a carrier landing film in 1961.1 Ed Zajac of Bell Telephone Laboratories produced a computer generated movie of a tumbling communications satellite in 1963. 2 Frank Sinden, also of Bell Laboratories, generated an educational computer animated film about gravitational forces acting on two bodies. 3 Two other investigators deserve mention, Ken Knowlton of Bell Labs for his computer animation language (BEFLIX)4 and Ivan Sutherland for his interactive computer animation work. 5 Interested readers will find an excellent bibliography on the subject in Donald Weiner's survey paper on computer animation. 6 Figure I-Computer graphics rendition of a bicycle and rider 161 162 Fall Joint Computer Conference, 1972 Figure 2-Blcycle slalom maneuver Computer Animation of a Bicycle Simulation 1.. 1 SEC Figure 2 (Cont'd) 163 164 Fall Joint Computer Conference, 1972 Computer Graphics activities at the Cornell Aeronautical Laboratory range from everyday use of general purpose plotting facilities by many programmers to highly complex computer-generated radar displays. One of the more fascinating computer graphics applications has been the Single Vehicle Accident Display Program, developed at CAL for the Bureau of Public Roads by C. M. Theiss. 9 This program converts automobile dynamics simulation data into a sequence of computer animated pictures used to generate motion picture film of the event. The demonstrated usefulness of this capability spurred the development of a graphics program for the Schwinn Bicycle Simulation. BICYCLE GRAPHICS PROGRAM FEATURES The Schwinn Bicycle Graphics Program provides a complete and flexible perspective graphics package capable of pictorially documenting the results of the bicycle simulation. The salient features of the graphics program are; 1. The program can plot a perspective picture of a bicycle and rider, positioned and oriented as per the simulation data. 2. The line drawing of the bicycle and rider can be easily changed to fit simulation or esthetic requirements. 3. The program can produce single pictures or animated movies. 4. Background objects, such aB roadways, houses, obstacles, etc., can be plotted in the scene. 5. The "frame rate" for animated films can be adjusted for "slow motion" or normally timed action. 6. The program is written to simulate a 16 mm movie camera, so that "photographing" a sce:lie is accomplished by specifying a set of standard camera parameters. 7. The program's "camera" can be set to automatically pan, zoom, remain fixed, or operate as on a moving base. 8. Any of the above characteristics may be changed during a run. Figure 1 shows a typical frame from a bicycle simulation movie. SIMULATION AND GRAPHICS SOFTWARE Digital computer simulation of bicycle and rider The computer simulation consists of a comprehensive analytical formulation of the dynamics of a bicycle-rider system stabilized and guided by a closed-loop rider control model. This computer simulation program will be used for bicycle design and development with particular consideration being given to the effects of various design parameters and rider ability on bicycle stability and maneuverability. The bicycle-rider model is a system of three rigid masses with eight degrees of freedom; six rigid body degrees of freedom, a steer degree of freedom of the front wheel, and a rider lean degree of freedom. I~ cluded in the analysis are tire radial stiffness, tire side forces due to slip angle and inclination angle, the gyroscopic effects of the rotating wheels, as well as all inertial coupling terms between the rider, the front wheel and steering fork, and the rear wheel and frame. Forty-four parameters of input data are required by the simulation program. These data include dimensions, weights, moments of inertia, tire side force coefficient, initial conditions, etc. The development of the simulation program has been supported by the measurement of the above physical characteristics of bicycles, the measurement of the side force characteristics of several types of bicycle tires and full scale experimental tests using an instrumented bicycle. Solutions are obtained by the application of a modified Runge-Kutta step-by-step procedure to integrate equations of motion. Output is obtained from a separate output processor program which can produce time histories of as many as 36 variables (bicycle translational and angular positions, velocities, accelerations, and tire force components, etc.) in both printed and plotted format. The simulation program, consisting of seven subroutines, uses approximately 170K bytes of core storage and requires about 4 seconds of CPU time per second of problem time when run on an IBM 370/165 computer. The output processor program uses approximately 200K bytes of core storage and requires about 5 seconds of CPU time per run. The total cost of both the simulation and output processor programs is approximately seven dollars per problem. The mechanics of making a bicycle graphics movie In addition to the printed and plotted output generated by the Schwinn Bicycle Simulation Program, a pecial "dynamics tape" is created for input to the bicycle graphics program. This dynamics tape contains, for each simulation solution interval, the bicycle's c.g. position (X, Y, Z coordinates), angular orientation (Euler angles), front wheel steer angle, and rider lean angle. All other pertinent information, such as the steering head caster angle, rider "hunch forward" angle, are fed to the graphics program via data cards, along Computer Animation of a Bicycle Simulation 165 with the stored three-dimensional line drawings of the bicycle and rider, and any desired backgrounds. The bicycle graphics program searches the tape and finds the simulation time corresponding to the desired "frame time." Information is then extracted to draw the desired picture. The program mathematically combines the chassis, front fork and pedals to draw the bicycle, and mathematically combines the torso, left and right upper arms and forearms, and left and right thighs, calves and feet to draw the rider. Everything is so combined to yield a picture of a rider astride a bicycle assuming normal pedaling, leaning and handlebar grip. The correctly positioned three dimensional line drawings are transformed into a two dimensional picture plane, as specified by the program's camera parameters (location, orientation, focal length, etc.). COMPUTER PLOTS OF SIMULATION RESULTS PRINTOUT {J. OF BICYCLE GRAPHICS PROGRAM ~-----tol STORED PICTURE Figure 4-Joints used for rider display .j! ::J,.4-, .j! t Figure 3-8teps in making Schwinn bicycle movie An interface program converts the final line drawings into a set of commands to the CAL Flying Spot Scanner. The cathode ray tube beam of the Flying Spot Scanner traces out one frame of the movie while a 16 mm. movie camera records the image. Upon completion of the picture, the movie camera automatically advances one frame and the graphics program reads the next data (positions, angles, etc., of bicycle and rider) from the dynamics tape. The completed film will show animated motion, exactly as simulated by the computer, Figure 2. A block diagram of the movie making procedure is shown in Figure 3. x~ CHASSIS For maximum realism and esthetic quality, seven distinct bicycle/rider motions were generated: 1. Bicycle chassis translation and rotation (6 degrees-of-freedom) c!)- .x FRONT·FORK \ Bicycle motions displayed ·x __ t PEDALS "-. BICYCLE SYSTEM Figure 5-Sections used for bicycle display 166 Fall Joint Computer Conference, 1972 2. 3. 4. 5. 6. 7. Front wheel and handlebar steering Bicycle crank and pedal rotation Rider left-right leaning Rider arm steering Rider leg pedaling Rider ankle flexing (XSTEER, Y STEER, ZSTEER) are points in the front Figure 4 shows the various body members and joints included in the rider. The separate parts of the bicycle are shown in Figure 5. Modification of the basic graphics package The Bureau of Public Roads graphics display program provided an excellent base from which to build the Schwinn Bicycle Graphics Program. A pre-stored line drawing, defined in its own coordinate system, is Euler transformed into fixed space and camera transformed into two dimensional picture space. Edge tests are performed to delete lines out of the field of view. Plotting any object (a line drawing) involves a call to the OBJECT subroutine CALL OBJECT (TITLE, X, Y, Z, PHI, THETA, PSI) Title refers to a particular stored line drawing, while X, Y, Z and PHI, THETA, PSI refer to the desired fixed space position and Euler angles at which the object is to be plotted. Subroutine OBJECT then does all the necessary transformations to plot the object. Plotting the chassis is straightforward, the chassis position and Euler angles are read directly from the dynamics tape. Displaying the bicycle and rider All segments of the bicycle and rider are displayed with the same mathematical approach. Parts are referenced by position and orientation to the chassis axis system, and this information is used to calculate the fixed space Euler angles and position. For example, the matrix equation relating points in the front fork axis system to corresponding points in fixed space is: [ ::] = [AJ ZF 1 [B J [:::::] ZSTEER + [:::]1 + [:] ZZF J Z where: A is the standard Euler transformation matrix ( chassis to fixed space) B is the front-fork system to chassis axis transformation matrix fork space (XXF, Y YF, ZZF) is the front fork system connection point in the chassis system (X, Y, Z) is the current fixed space position of the bicycle chassis (XF, Y F, ZF) is the front fork points specified in the fixed space set The B matrix, of course, is a two rotational transformation, being a function of the caster angle and the steer angle. The Euler angles required by subroutine OBJECT can be determined by equating like terms of the standard Euler transformation with the overall transformation, [ABJ= [AJ*[BJ For instance: PHI = TAN-l AB(3, 2) AB(3, 3) PSI = TAN-l AB(2, 1) AB(l,l) THETA=TA -1 N -AB(3, l)*SIN(PSI) AB(2, 1) This procedure can be easily automated by a general subroutine which accepts the coefficients of the two transformation matrices and outputs the Euler angles. Displaying the pedaling action The pedal rotation angle is easily determined by tabulating the distance traveled by the chassis and relating it to the wheel size and gear ratio. The toe angle can be approximated by a cosine function of the pedal rotation angle. w = gear ratio*distance/wheelsize Toe angle = - .25*cos (w) An important simplifying assumption in the display of the leg pedaling motion is that the legs move up and down in a single plane. This makes trigonometric calculation of the joint locations straightforward and the object-to-chassis transformations simple one-rotation matrices. Once this information is determined, procedures similar to the front-fork manipulations are used. Three objects are required for each leg: the thigh, the calf, and the foot. Displaying the torso The torso must hunch forward (so that the arms may reach the handlebars) and lean to the left and right Computer Animation of a Bicycle Simulation (real-world rider control action). The transformation between the torso axis system and the chassis system is determined by two rotations. This transformation is also used for determination of arm location in the chassis system. Displaying the arm.s Determination of the fixed space Euler angles of the arms is complicated by the fact that the elbow joint lies on a circular locus around the shoulder-to-handlebar line. Since the upper arm and forearm are assumed equal in length, the perpendicular distance from the elbow to thehandlebar-to-shoulder line is known. A transformation matrix can be developed to convert points in the elbow circle plane to the torso system. A constant angle from the elbow circle plane's Y-axis defines a unique elbow point which can be transformed back into the chassis system. Once the elbow point is known, determination of the Euler angles of the arm is straightforward. MOVIE PRODUCTION Both the bicycle simulation program and the bicycle graphics program are run on CAL's IBM 370/165 computer. The flying spot scanner is interfaced with the central digital computer through an IBM 2909 asynchronous data channel. The flying spot scanner is a high resolution CRT display system used for plotting and scanning. The interface software provides all the controls required by the display to move the beam, advance the film, etc. The Schwinn Bicycle Graphics program requires 250K bytes of core, and generally runs from 50¢ to 90¢ per frame in computing costs, depending on image complexity. No attempt at hidden line removal was planned for this phase. FUTURE APPLICATIONS The Schwinn Bicycle Graphics Program was designed as a research tool to demonstrate the capability of the bicycle simulation. Several computer animated movies have been produced of simulated bicycle maneuvers which compare well with full scale experimental maneuvers. At current production cost levels, only the most interesting runs are documented with the bicycle graphics program. The authors feel, however, that the advent of high speed intelligent computer terminals will 167 allow the economical production of computer graphics. In the future the investigator will be able to view animated summaries of simulation results first, before referring to more detailed printed and plotted output data. The most gratifying result of this bicycle graphics capability is that the technically unskilled can share in the understanding that computer simulation is an emulation of reality, and has visible meaning in the everyday world. ACKNOWLEDGMENT The authors wish to express their gratitude to the Schwinn Bicycle Company for permission to present this work and also to Ronald B. Colgrove, CAL chief artist, for his excellent rendering. of the bicycle rider used in the movie sequences. REFERENCES 1 W A FETTER Computer graphics in communication McGraw-Hill New York 1965 2 E ZAJAC Film animation by computer New Scientist Vol 29 Feb 10, 1966 pp 346-349 3 F SINDEN Synthetic cinematography Perspective Vol 7 No 4 1965 pp 279-289 4 K KNOWLTON A computer technique for producing animated movies Joint Computer Conference AFIPS Conference Proceedings Vol 25 Baltimore Md Spartan 1964 pp 67-87 5 I SUTHERLAND Perspective views that change in real time Proceedings of 8th UAIDE Annual Meeting 1969 pp 299-310 6 D D ·WEINER Computer animation-an exciting new tool for educators IEEE Transactions on Education Vol E-14 No 4 Nov 1971 7 R D ROLAND JR D E MASSING A digital computer simulation of bicycle dynamics Cornell Aeronautical Laboratory Inc Technical Report No YA-3063-K-1 June 1971 8 R D ROLAND JR J P LYNCH Bicycle dynamics, tire characteristics and rider modeling Cornell Aeronautical Laboratory Inc Technical Report No YA-3063-K-2 March 1972 9 C M THEISS Perspective picture output for automobile dynamics simulation Prepared for Bureau of Public Roads by Cornell Aeronautical Laboratory Inc Technical Report No CPR-1l-3988 January 1969 10 C M THEISS Computer graphics displays of simulated automobile dynamics Proceedings AFIPS Conference Spring 1969 An inverse computer graphics problem by W. D. BERNHART Wichita State University Wichita, Kansas The goal of a conventional computer perspective algorithm is to assist in the establishment of a scaled perspective view of a real or conceptual geometric object. The purpose of this paper is to present the required conditions for the inverse transformation; that is, given the perspective of an object, establish the required parameters used in generating the perspective and to a more restrictive extent, establish the original geometric definition of the object. Because this inverse mapping is from a two to three dimensional space, the method is approximate and is accomplished by the method of least squares based on certain a priori information regarding the geometrical object. The method does require a considerable amount of numerical computation, but is particularly well suited to a digital computer solution. The need for this required transformation arose in the course of a problem associated with the determination of the coordinates of certain desired points which appeared in photographs of an event which occurred several years ago, wherein the desired points had been completely obliterated by recent construction activities. Thus, the first task was to establish the generating parameters for the photographs. The generating parameters are defined as six independent coordinates from which a photograph may be geometrically reproduced by considering a large number of points in the threedimensional object space, and transforming these to the two-dimensional space of the photograph. These parameters consist of the coordinates of the point where the camera is located, the symmetric equations of the line along the optical axis of the camera, and a linear scale factor associated with the photograph, enlarged to any magnification. The treatment of a photograph as a true perspective is consistent with the paraxial ray tracing approximation of geometrical optics. For the purpose of this analysis, all points will be defined in a rectangular Cartesian coordinate system as shown in Figure 1. The point where the camera is located is denoted by three independent coordinates, (X e, Y e, Ze). In the context of traditional perspective terminology, this point is commonly described as the location of the eye or observer, and the point (Xo, Yo, Zo) is referred to as the center of interest of the object space or perspective center. A line through these two points is regarded as the optical axis of the camera and the plane perpendicular to this axis represents the picture plane, projection plane, or two-space photograph. The location of this plane in relation to the eye point requires the identification of a linear-scale factor which is associated with each photograph~ The coordinates of z Figure 1-Projection plane and control points 169 170 Fall Joint Computer Conference, 1972 the center of interest, (Xo, Yo, Zo), are not a unique set, as any point on the line passing through points 'e' and '0' will require a particular value of the linearscale factor to perspectively generate the object space into the projection plane space. For this analysis, the scale factor will be regarded as a constant and the six independent parameters, (Xe, Y e, Ze) and (Xo, Yo, Zo) will be determined such that the photograph may be geometrically reproduced in the perspective sense. Before analyzing this particular problem, it will be necessary to present the required coordinate transformation that maps an arbitrary point 'i' in the object space to the projection-plane space. This perspective transformation has received considerable attention in computer graphics applications in the last decade. 1 ,2,3 A form which is particularly suited to the parameter identification problem is ni=Ro(1-A) Returning to the original problem, the six desired parameters are determined by the method of least squares by considering four or more points in the object space whose rectangular coordinates are known or may be estimated with a high degree of accuracy. Next let (Sij)m denote the measured value of the distance between points i and j in the photograph. Thus, for en' such points, there are m = n (n -1) /2 corresponding measured distances. The calculated value of this associated distance in the projection plane is given by and the six desired generating parameters are then obtained by expanding this calculated value in a multiple Taylor series, expressed as (1) hi = A(Ro2/PoDi) {- (Xi-X o) (Y e- Yo) + (Y i - Yo) (Xe- Xo) } Vi = A(Ro/PoD i ) {- [(Xi-Xo) (Xe-Xo) + (Y i - Yo) (Y e- Yo) ](Ze-Zo) (9) (2) (3) + (Zi-Zo)Po2} in which + higher-order terms P o= [(X e-Xo)2+ (Y e- Y o)2J/2 Ro= [(X e-Xo)2+ (Ye- Yo)2+ (Ze-Zo)2J/2 (5) Di=R2-[(Xe-Xo) (Xi-X o) + (Y e- Yo) (Y i - Yo) + (Ze-Zo) (Zi-Zo)] (6) and A= the linear scale factor; A> O. The coordinate normal to the picture plane is a constant and is of no particular interest other than as an aid in the estimation of a suitable photographic scale factor. For the case of a photograph, this normal coordinate is proportional to the focal length of a simple convergent camera lens. This particular form of the perspective mapping transformation is based on twosuccessive rotational transformations such that the plane defined by a line parallel to the Z-axis and the. point 'e' also contains the V-axis of the projection plane. These two-successive rotations are defined as follows 8=tan-1 [(Ye+Yo)/(Xe-X o)] (7a) ,8 = sin-l [(Ze-Zo)/Ro] (7b) A third rotation may be easily introduced by rotating the H, V-axes in the projection plane. It is important to note that distances measured in the projection plane would remain invariant with respect to this third rotation. The subscript 'a' in Equation 9 denotes the evaluation for some assumed value of the six parameters. Thus, by neglecting the higher-order terms and minimizing the sum of the squares of the residuals between the calculated and measured values for the en' points m G= L: [(Sij)c- (Sij)m]% (lOa) k=l and aG -=0 aXe ' axo aYe aG ' aG aG -=0 aG -=0 ' aZ e ' (lOb) aG -=0 aYo -=0 -=0 ' azo The six equations lOb, in general yield the six desired parameters after two to five iterations, depending on the initial assumed· values of the parameters and the desired accuracy. Again, the scale factor is held constant throughout this iterative process. A different choice of A will simply slide the coordinates of point '0' along the line o-e without disturbing the iterated coordinates of point 'e'. The writer has employed this procedure on several Inverse Computer Graphics Problem different controlled photographs with encouraging success. 4 These laboratory experiments yielded parameters estimates within 4 percent of their exact values. This error is largely attributed to the various unknowns associated with the optics of both the camera and enlarger, as both instruments were of commercial rather than laboratory quality. Recent experiments, 5 dealing with photogrammetric resectioning yielded considerably smaller errors. These experiments utilized a phototheodolite, spectroscopic flat quality glass plates and a mono comparator. As mentioned earlier, the original need involved the determination of the coordinates of certain desired points which appeared in photographs of an event which occurred several years ago, wherein the desired points had been completely obliterated by construction activities. However, a sufficient number of points in the object space still existed such that the tn' required object-space coordinates described previously could still be easily obtained by field measurements. The desired points were located such that they appeared in two different photographs of the event. Thus, by iteratively determining the generating parameters for each photo- 171 graph, the coordinates of the desired point were redetermined by solving for the intersection of the two lines associated with the point in each photograph. REFERENCES 1 H R. PUCKETT Computer methods for perspective drawing ARS-IAS Structures and Materials Conference Engineering Paper No 135 Palm Springs California April 1-3 1963 2 T E JOHNSON Sketchpad III-A computer program for drawing in three dimensions Proceedings Spring Joint Computer Conference 1963 3 W D BERNHART W A FETTER Planar illustration method and apparatus United States Patent Office No 3519997 July 7 1970 4 W D BERNHART Determination of perspective generating parameters ASCE Journal of the Surveying and Mapping Division Vol 94 No SU2 September 1968 5 L J FESSER Computer-generated perspective plots for highway design evaluation Federal Highway Administration Report No FHWA-RD-72-3 September 1971 Module connection analysis-A tool for scheduling software debugging activities by FREDERICK M. HANEY X erox Corporation El Segundo, California INTRODUCTION of various kinds of effort such as design, coding, module testing, etc. More recently Belady and Lehman described a mathematical model for the "meta-dynamics of systems in growth. "3 These schemes provide useful insights into the difficulties of designing and implementing large systems. Even with these improved estimation techniques, however, we still face the threat of long periods of unstructured post-integration putting out of fires. We may know better how long this "final" debugging will take, but we are still at a loss to predict what resources will be required or what specific activities will take place. If we predict an 18 month period for "final testing," will management buy it? How can we peer into this hazy contingency portion of a schedule and predict in greater detail where bugs will occur, who will be needed to fix them, elapsed time between internal releases, etc.? Belady and Lehman suggest the need for a "micro-model" for system activities; i.e., a model based on internal, structural aspects of a system. This is essentially the objective of this paper. In the following sections, we will develop a very simple, but useful, technique for modeling the "stabilization" of a large system as a function of its internal structure. The concrete result described in this paper is a simple matrix formula which serves as a useful model for the "rippling" effect of changes in a system. The real emphasis is on the use of the formula as a model; i.e., as an aid to understanding. The formula can certainly be used to obtain numeric estimates for specific systems, but its greater value is that it helps to explain, in terms of system structure and complexity, why the process of changing a system is generally more involved than our intuition leads us to believe. The technique described here, called Module Connection A nalysis, is based on the idea that every module pair (may be replaced by subsystem, component, or any other classification) of a system has a finite (possibly 0) The largest challenge facing software engineers today is to find ways to deliver large systems on schedule. Past experience obviously indicates that this is not a wellunderstood problem. The development costs and schedules for many large systems have exceeded the most conservative, contingency-laden estimates that anyone dared to make. Why has this happened? There must be a plethora of explanations and excuses, but I think H. R. J. Grosch identified the common denominator in his article, "Why MAC, MIS and ABM will never fly."l Grosch's observation is essentially that for some large systems the problem to be solved and the system designed to solve it are in such constant flux that stability is never achieved. Even for some systems that are flying today, it is obvious that they came precariously close to this unstable, "critical mass" state. It is my feeling that our most significant problem has been gross underestimation of the effort required to change (either for purposes of debugging or adding function) a large, complex system. l\10st existing systems spent several years in a state of gradual, painfully slow transition toward a releasable product. This transition was only partially anticipated and almost entirely unstructured; it was a time for putting out fires with little expectation about where the next one would occur. The difficulties of stabilizing large systems are universal enough that our experience has resulted in several improved methods for estimating projects. Rules-of-thumb like "10 lines of code per man day" once sounded like extremely conservative allowance for the complexities of system integration and testing. J. D. Aron2 has described a relatively elaborate technique for estimating total effort for large projects. Aron's technique is based· on the estimated amount of code for a project and empirically observed distributions 173 174 Fall Joint Computer Conference, 1972 probability that a change in one module will necessitate a change in any other module. By interpreting these probabilities and applying elementary matrix algebra, we can derive formulae for estimating the total number of "changes" required to stabilize a system and the staging of internal releases. The total number of changes, by module, is given by where A is a row vector representing the initial changes per module, P is a matrix such that Pij is the probability that a change in module i necessitates a change in modulej, and 1 is the nXn identity matrix. The number of changes required for each "internal release" is given by APk, K=O, 1, ... , or by AX (1 -P)-lX Uk, k=1,2, ... n, Uk= (0, ... , 1, ... 0) i k th element depending upon the release strategy. The derivations of these formulae are presented in the following section. Module connection analysis is useful primarily as a tool for augmenting a designer's quantitative understanding of his problem. It produces quantitative estimates of the effects of module interconnections, an area in which intuitive judgment is generally inadequate. THEORY OF lVIODULE CONNECTIONS As a basis for our analysis, we postulate several characteristics of a system: • A system is hierarchical in structure. It may consist of subsystems, which contain components, which contain modules· or it may be completely general having n different levels of composition where an object at any level is composed of objects at the next lower level. • At any level of the hierarchy, there may be some interdependence between any two parts of the system. • If we view a system as a collection of modules (or, whatever object resides at the lowest hierarchical level), then the various interdependencies are manifested in terms of dependencies between all pairs of modules. By dependence here, we mean that a change in one module may necessitate a change in the other. The fundamental axiom of module connection analysis is that intermodule connections are the essential culprit in elongated schedules. That a change in one module creates the necessity for changes in other modules, and these changes create others, and so on. Later, we will see that perfectly harmless-looking assumptions lead easily to sums like hundreds of changes required as a result of a single initial change. (The notions of hierarchy, interconnection, etc., used here are described at length in Reference 4.) If we assume that a system consists of n "modules," then there are n 2 pairwise relationships of the formPij = Probability that a change in module i necessitates a change in module j. In the following, the letter "P" denotes the n X n matrix with elements pij. Furthermore, with each module i, there is associated a number Ai of changes that must be made in module i upon integration with the system. (Ai is approximately the number of bugs that show up in module i when it is integrated with the system.) If we let A denote a row vector with elements Ai, then we have the following: A = total changes, by module, required at integration time, or at internal release 0. AP = total changes required, by module, as a result of changes made in release 0, or total changes for internal release 1. (Internal release n+ 1 is, roughly, a version of the system containing fixes for all first-order problems in internal release n.) N ow we observe that the i, jth element of p2 is n L: Pik Pkj, k=l which represents the sum of probabilities that a change in module i is propagated to module k and then to module j. Hence, the i, jth element of p2 is the "twostep" probability that a change in module i propagates to module j. Or, AP2 is the number of changes required in internal release 2. The generalization is now obvious. The number of changes required in internal release k is given by APk and the total number· of changes, T, is given by Now we are interested to know whether or not the matrix power series in P converges; clearly, if it does not our system will never stabilize. To establish con- Module Connection Analysis vergence of the power series, we appeal to matrix algebra (see Reference 5, for example) which tells us that the above series converges whenever the eigenvalues of P are less than 1 in absolute value. If this is the case, then the series converges and T=A (I _P)-l, where I is the nXn identity matrix. We now have an extremely simple way to estimate the total number of changes required to stabilize a system as a linear function of a set of initial changes, A. Moreover, the number of changes at each release is given by the elements of AI, AP, AP2, etc. 175 ·STAGING. INTERNAL RELEASES There are various strategies for tracking down bugs in a complex system. The most obvious are: (1) fix all bugs in one selected module and chase down all side effects, or, (2) fix all "first-order" bugs in each module, then fix all "second-order" bugs, and so on. The module cbnnection model can aid in predicting release intervals for either approach. For strategy (1) (one module at a time), the number of changes required to stabilize module i, given Ai initial changes, is given by (p, ... , Ai, ... , 0) (I _P)-l ESTIMATING TOTAL DEBUGGING EFFORT FOR A SYSTEM The above theory suggests a simple procedure for estimating the total number of changes required to stabilize a system. The procedure is as follows: The product is a row vector with elements corresponding to the number of changes that must be made in each module as a result of the original changes. The total number of changes required to stabilize this one release is given by n Ai LXik, k=l (1) For each module pair, i,j, estimate the probability that a change in module i will force a change in module j. These estimates constitute the probability matrix P .. (2) From the vector A by estimating for each module i the number of "zero-order" changes, or changes required at integration time. (3) Compute the total number of changes, by module: T=A (I _P)-l. ( 4) Sum the elements of the column vector T to obtain the total number of changes, N. (5) Make a simple extrapolation to "total time" based on past experience and knowledge of the environment. If past experience suggests a "fix" rate of d per week, then the total number of weeks required is N / d. Hence if we have some estimate for the initial correctness (or "bugginess) of a system and for the intermodule connectivity (the probabilities), then we can easily obtain an estimate for the total number of changes that will be required to debug the system. The formula is a simple one in matrix notation, but the fact that we are dealing with matrices probably explains the failure of our intuition in understanding debugging problems. In the following sections, we will show how the above formula can be used to aid our understanding of other aspects of t~e debugging process. where the Xik are elements of (I _P)-l. This strategy, then results in n internal releases where· the time for release i is Ai (max X ik) X (time required per change) k and the total debug time after integration is L (Ai max Xik) X (tIme required per change) i k With the second debugging strategy (make all "firstorder" changes, then all "second-order" changes, etc.), the number of changes in the kth release is given by APk. That is, APk is a row vector with elements corresponding to the number of changes in each module for release k. The time required for release k is approximately max (APk) Xtime required per change. To determine the total number of releases for this strategy, we must examine A, AP, AP2, until the number of changes AP8 in release 8 is small enough that the system is releasable. The total time for this strategy, then, is 8 L max APkXtime required per change. k=O I t is worth noting that both of the debug strategies described above evidence a "critical path" effect. The total time in each case is a sum of maximum times for each release. This effect corresponds to the well-known fact that debugging is generally a highly sequential 176 Fall Joint Computer Conference, 1972 process with only minor possibilities for making many fixes in parallel. This fact, coupled with the "amplification" of changes caused by rippling effects, certainly accounts for a large portion of many schedule slips. REFINING THE INITIAL ESTIMATES Module connection analysis is proposed as a tool for aiding designers and implementors. l\1ore than anything else, it is a rationale for making detailed quantitative estimates for what is generally called "contingency." Now, we must ask, "As a project progresses, how can we take advantage of actual experience to refine the initial estimates?" The module connection model is based on two objects: A, the vector of initial changes; and P, the matrix of connection probabilities between the modules. Both A and P can be revised simply as live data become available. As each module i is integrated into the system, the number Ai of initial changes becomes apparent. Using updated values for the vector A, we can recompute the expected total number of changes and the revised release strategy. The elements, Pij, of the matrix P can be revised periodically if sufficient data is kept on changes, their causes, and their after-effects. One simple way to do this is to keep a record for each module as follows: Module i .2 .1 o .2 0 o o .1 .1 0 .1 0 .1 0 .1 .1 0 0 .1 0 .1 0 .2 0 .1 .1 0 0 0 0 0 0 0 0 0 0 0 0 0 o .1 0 o .2 0 o 0 0 o .1 o .1 0 0 0 0 o .1 0 0 0 0 0 0 0 0 0 0 0 0 0 o o .1 .1 .1 0 0 o .1 .4 .1 o .3 .2 .1 .2 o 0 0 o .1 0 0 o .1 o .1 0 0 0 0 0 0 0 0 .1 o o .1 o .1 .1 0 .1 .1 .1 .3 .1 0 .1 0 0 .1 0 0 0 0 .1 0 0 0 0 o .1 o .1 o .1 o .1 0 .1 o .4 0 o .2 0 0 o .1 o .2 0 0 0 0 0 0 0 0 0 0 0 0 0 o .1 0 0 .1 .1 0 .1 .4 0 0 0 0 .1 0 0 0 0 0 0 0 0 .2 .3 .2 .1 0 0 0 0 0 .1 0 o .1 .1 0 0 0 0 o .1 0 0 0 o .1 0 o .1 o o .1 0 0 0 0 .2 .1 .1 0 0 0 0 o .1 .3 0 0 o .2 0 0 o .2 0 0 o 0 0 0 0 0 0 0 o .1 0 0 0 0 0 0 0 .1 .1 .3 o 0 0 0 0 0 .1 0 0 0 .1 0 0 .1 0 .1 0 0 0 .2 .1 0 o o .1 o .1 o .1 o .1 o caused by which other modules module? affected After a relatively large sample of data is available, the above forms can be used to revise P as follows: .. number of changes in j caused by i total changes made to i P1,J = - - - - - - = - - - - - - " - . - - - - - ' - - The revised matrix P can be used to revise earlier estimates for total effort and release strategies. AN EXAMPLE OF MODULE CONNECTION ANALYSIS The following example is hased on the Xerox Universal Timesharing System. Eighteen actual subsystems 0 .1 0 0 0 0 0 0 .2 Figure I-Probability connection matrix, P are used as "modules." Estimates for connection probabilities and initial changes are made in the same way that they would be made for a new system, except that some experience and "feel" for the system were used to obtain realistic numbers. (Thanks to G. E. Bryan, Xerox Corporation, for helping to construct this example.) The 18 X 18 probability connection matrix for this example is given in Figure 1. The matrix is relatively sparse; moreover, most of the nonzero elements have a value of .1. Most the larger elements lie on the diagonal INITIAL AND FINAL CHANGES description of change o 0 0 0 0 0 0 0 .1 0 .1 0 0 0 0 0 0 0 .3 Module Initial Changes Total Required Changes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 2 8 4 6 28 12 8 28 4 8 40 12 16 12 12 28 28 40 241.817 100.716 4.44444 98.1284 248.835 230.976 228.951 257.467 4.44444 318.754 238.609 131.311 128.318 ·157.108 96.1138 150.104 188.295 139.460 TOTALS 296 2963.85 Figure 2 Module Connection Analysis corresponding to the fact that the subsystems are relatively large so that the probability of ripple within a subsystem is relatively large. The total number of changes required in each module are given in Figure 2. It is interesting to note which modules require the most changes and to observe that six modules account for 50 percent of the changes. Figure 3 illustrates the one-release-per-module debug strategy. That is, we repair one module and all side effects, then another module, and so on. This strategy is rather erratic since the time between releases, which is determined by the maximum number of fixes in one module, ranges from 4 to 95 indiscriminately. If we adopt this strategy, we may want to select the 177 MAXIMUM CHANGES PER MODULE PER RElEASE AND TOTAL CHANGES PER RElEASE 300 275 250 225 :> o ~ 30 ~ Y~296' 230 (.92)-X , / (TOTAL PER RElEASE) o - 150 ~~ '~ ! -125 20 ~ - 100 ::!; 75 50 - 25 ·---------·-------'-----+I-----"----~I 10 15 20 30 35 25 ONE RELEASE PER MODULE Maximum Changes in One Module Release 4.41764 11. 8619 4.44444 8.84029 67.8994 24.7185 20.3720 85.8099 4.44444 35.2976 95.2147 22.5608 39.7013 15.0000 15.0000 35.0000 35.0000 66.5554 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 "CRITICAL PATH" TOTAL Figure 4-"Internal" release an average rate of about 1 per day, then Figure 4 is fairly representative of experience with the first release of UTS. The total number of changes on the "critical path" is 338, so that approximately 15 months would I I I 4000 3000 - -f _ 2000 592.138 Figure 3 worst module first and continue using the worst module at each step. We will see, however, that this strategy is far from optimal because it does not take maximum advantage of opportunities to make fixes in parallel. A more effective release strategy is illustrated in Figure 4. This strategy assumes all first-order changes in release 1, all second order changes in release 2, etc. Figure 4 shows, for each release, the maximum number of changes in one module and the total number of changes. The reader who has worked on a large system will, no doubt, recognize the painfully slo~ convergence pattern. In this case, the system is assumed to be ready for external release when the "maximum changes per module" becomes less than one. If we assume the "critical path" changes are made at X "'NTERNAl" RELEASE 1000 900 800 / 700 / 6 0 0 / -/ 500 -- . 400 300 - 200 AVERAGE MODULE CONNECTION PROBABILITY 100 .01 I .02 I .03 -21-.04 .05 Figure 5-Total changes as a function of "average connection probability" 178 Fall Joint Computer Conference, 1972 be required to stabilize the system for the first external release. To conclude this example, let us take a brief look at the relationship between "total changes" and the probability of intermodule connection. The probabilities in the connection matrix above have an average value of approximately .04. What is the result if we assume the same relative distribution of probabilities in the matrix, but reduce the average by dividing each element by a constant? Figure 5 shows the total number of changes as a function of "average probability of module connection" under the above assumption. This curve shows that our example is precariously close to "critical mass" and that any small improvement in the connection probabilities results in significant payoff. OTHER APPLICATIONS OF IVIODULE CONNECTION ANALYSIS The value of module connection analysis is its simplicity. The computations can be performed easily by a small (less than 50 lines) program written in APL, BASIC, or whatever language is available. Used on-line, the technique is useful for experimenting with various design approaches, implementation· strategies, etc. Three examples of this use of the model are described below: Estimating new work If the designers, or managers, of a system have kept detailed records of the .module-module changes in the system (as described above), then the matrix P is a reliable estimator of the "ripple factor" for the system. It can be used to predict, and stage, the effort to stabilize the system after any set of changes. If we postulate a major improvement release of the system, then we can assume, for example, that the new program code falls into two categories: (1) independent code particular to a new function and, (2) code that necessitates changes in an existing module. By estimating the number of changes, bi, to each module i, we can estimate the total number of changes to restabilize the system: The previously described computations can be used to estimate release intervals and total time for the improvement release. To be more realistic, it may be useful in the above computation to use bi+ei as the estimated changes in the module, where ei represents the number of changes required in module i by previous activity. Evaluating design approaches The best time to guarantee success of a· system development effort is in the early design stages when architecture of the system is still variable. There is much to be gained by selecting an appropriate "decomposition" (see Reference 4). of the system into subsystems, components, etc. During this stage of a project, module connection analysis is a useful tool for evaluating various decompositions, interfacing techniques, etc. It is a simple, quantitative way of estimating the modularity of a system, the ever-present objective that no one knows exactly how to achieve. By fixing some of his assumptions about intermodule connections, a designer can experiment with various system organizations to determine which are the least likely to achieve "critical mass." Evaluating implementation approaches The reader who performs some simple experiments with the formulas described here is likely to be very surprised at the results. Even an extremely sparse connection matrix with very low probabilities can result [examine (/ - P)-l] in very large "ripple factors." It is also interesting to experiment with small perturbations in the connection matrix and observe the profound effect they can have on the "ripple factor." One becomes convinced more than ever before that it is necessary to minimize connections between modules, localize changes, and simplify the process of making changes. The most impressive gains come from minimizing the probabilities of intermodule propagation of changes. A reduction of the average probability by as little as 5 or 10 percent can cause a significant reduction in the "ripple factor." Additional improvement can result from improvements in techniques for making changes. The total debug time is essentially linear with respect to the time required to make a change, but the multiplier (total number of changes) can be so large that any reduction in the time-per-change results in enormous savings. lVlodule connection techniques are extremely useful in estimating the value of various implementation techniques and strategies. How are the module connection probabilities changed if we use a high-level implementation language? How much easier will it be to Module Connection Analysis make changes? How much will we save, if any, by doing elaborate environment simulation and testing of each module before it is integrated with the system? l\1odule connection analysis is a valuable augmentation of intuition in these areas and can be useful for generating cost justifications for approaches that result in significant savings. CONCLUSION The objective of this paper has been to describe a simple model for the effect of "rippling changes" in a large system. The model can be used to estimate the number of changes and a release strategy for stabilizing a system given any set of initial changes. The model can be criticized for being simplistic, yet it seems to describe the essence of the problem of stabilizing a system. It is clear, to the author at least, that experimentation with the module connection model could have 179 prevented a significant portion of the schedule delay that occurred for many large systems. REFERENCES 1 H R J GROSCH Why MAC, MIS, and ABM won't fly Datamation 17 Nov 1 1971 pp 71-72 2 J D ARON Estimating resources for large programming systems Software Engineering Techniques J N Buxton and B Randell (eds) April 1970 3 L A BELADY M M LEHMAN Programming system dynamics or the meta-dynamics of systems in maintenance and growth Research Report IBM Thomas J Watson Research Center Yorktown Heights New York July 1971 4 C ALEXANDER Notes on the synthesis of form Cambridge Mass Harvard University Press 1964 5 M MARCUS Basic theorems in matrix theory National Bureau of Standards Applied Mathematics Series #57 US Government Printing Office January 1960 Evaluating the effectiveness of software verificationPratical experience with an automated tool by J. R. BROWN and R. H. HOFFMAN TRW Systems Group Redondo Beach, California tem , an evolving collection of automated tools which . provide support in various phases of software testmg. Examination of a typical software testing process results in identification of four fundamental activities: test planning, production, execution and evaluation. Examination of the overall cost and schedule impact resulting from manual performance of these activities reveals the reasons for many testing efforts being less complete and successful than expected. With emphsais upon those tasks which are often neglected due to the menial aspect of their performance, PACE development was planned to complement manual testing efforts with automated utilities. Early planning and study efforts indicated a need to give emphasis to the ability of the system to meet diverse (and probably changing) user needs. To adequately cope with this requirement a number of events (instances) were identified at which operation~l releases of interim PACE capability would be most beneficial. Practical applications of the capabilities produced by each P ACE instance would then provide meaningful direction for subsequent releases. The initial PACE instance was the FLOW program to support test evaluation activities. FLOW monitors statement usage during test execution, thus providing a basic evaluation of test effectiveness. The results produced by FLOW, in particular the statement usage frequencies, are similar to the program profiles discussed by Knuth in Reference 1. In addition, FLOW supports the test planning activity by indicating the unexercised code and, consequently, the additional tests required for more comprehensive testing. INTRODUCTION From the point of view of the user, a reliable computer program is one which performs satisfactorily according to the computer program's specifications. The ability to determine if a computer program does indeed satisfy its specifications is most often based upon accumulated experience in using the software. This is due in part to general agreement that the quality of computer software increases as the software is extensively used and failures are discovered and corrected. In keeping with this philosphy, increasing emphasis has been placed on exhaustive testing of computer programs as the principal means of assuring sufficient quality. Nevertheless, a significant problem which pervades all software development is a lack of knowledge as to how much testing of a software system or component constitutes sufficient verification. The major impact of this problem (if not adequately addressed) is evidenced by high cost of testing (as much as 50 percent of total project cost) and insufficient visibility of test effectiveness. As a result, we often lack sufficient confidence that the software will continue to operate successfully for unanticipated combinations of data in a real-world environment. In recognition of the high cost and uncertainty of software verification, TRW Systems' Product Assurance Office initiated a company-funded effort to improve upon current testing methodology. Much of the effort has been directed toward development of some general purpose automated software "tools" which would provide significant aid in performance of a software quality assurance activity. The desirable extent to which the "general purpose'; and "automated" characteristics should be pursued has received considerable study, as did a precise definition of "significant aid." The result of the study, experimentation, design and development thus far conducted comprises the TRW Product Assurance Confidence Evaluator (PACE) sys- FLOW PROGRAM DESCRIPTION Purpose During the software development process, a question frequently asked (and seldom if ever answered satis181 182 Fall Joint Computer Conference, 1972 STATEMENT PSN 0 0 0 0 1 2 3 ELEMENT SPEAR PROGRAM SPEAR(INPUT,OUTPUT,TAPE5=INPUT,TAPE6=OUTPUT) DIMENSION A(10),B(10),R(20) N AMELIST /TESTIN /N ,NR,A,B 100 READ(5,TESTIN) CALL SRANK(A,B,R,N,RS,T,NDF,NR) WRITE(6,6001)A 0 0 0 1 2 ELEMENT SRANK SUBROUTINE SRANK(A,B,R,N,RS,T,NDF,NR) DIMENSION A(l),B(l),R(l) FNNN=F(N) IF(NR-1)5,10,5 15 16 17 18 19 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15,16 17 18,19 20 21 22 23 24 25 26 27 0 KT=l CALL TIE(R,N,KT,TSA) CALL TIE(R(N+1),N,KT,TSB) FKTN = F(KT) + F(N) IF(TSA)60,55,60 ELEMENT TIE SUBROUTINE TIE(R,N,KT,T) DIMENSION R(l) T=O.O Y=O.O 5 X=1.0E38 IND=O DO 30 I=l,N IF(R(I) - Y)30,30,10 10 IF(R(I)-X)20,30,30 20 X=R(I) IND=IND+1 30 CONTINUE IF(IND)90,90,40 40 Y=X CT=O.O DO 60I=1,N IF (R(I) EQ.X) CT=CT+l.O 60 CONTINUE IF (CT.NE.O) IF(KT-1) 75,80,75 GO TO 5 75 T=T+CT*(CT-1.0)/2.0 GO T05 80 CONTINUE ICT=CT T = T + F(ICT) /12.0 GO TO 5 90 RETURN END Figure I-Sample program with pseudo statement numbers Evaluating Effectiveness of Software Verification factorily) is: "How much testing is enough?" There appears to be vital interest in the subject,2.3.4 but too little in the way of practical applications has been accomplished in the past to provide any final answers. We feel strongly that a measure of the variety of ways in which a computer program is tested (or not tested) can combine to form a software "experience index", and quantification of the index supports evaluation of both the computer program and testing thoroughness. Based on this premise, the FLOW program was developed to: (1) support assessment of the extensiveness with which a computer program is tested, (2) provide a variety of quantified indices summarizing program operation, and, (3) support efforts to create a more comprehensive but less costly test process. The objective of FLOW is not to find errors, per se, but to quantitatively assess how thoroughly a program has been tested and 183 to support test planning by indicating the portions of code which are not exercised by existing test cases. Method FLOW analyzes the source code of a computer program and instruments the code in a manner which permits subsequent compilation and makes possible monitored execution of the program. This technique is representative of one of several approaches toward software measurement technology described by Kolence. 5 A complete application of FLOW provides for an accumulation of frequencies with which selected program elements (e.g., statements, small segments of code, subprograms, etc.) are exercised as the program is being tested. There are optional levels of detail at which usage **QAFLOW MAP PRINT** ELEMENT SPEAR PSEUDO NOS. FREQ 1 TO 7= 1 CUMULATIVE TIME PSEUDO NOS. FREQ .0780 SECONDS PSEUDO NOS. FREQ PSEUDO NOS. ELEMENT RANK PSEUDO NOS. FREQ 1 TO 7= 20 14 TO 14= 200 24 TO 24= 2 CUMULATIVE TIME PSEUDO NOS. FREQ 8 TO 9= 200 15 TO 17= 20 .6430 SECONDS PSEUDO NOS. FREQ 10 TO 11= 90 18 TO 22= 0 PSEUDO NOS. FREQ 12 TO 13= 20 23 TO 23= 20 ELEMENT SRANK PSEUDO NOS. FREQ 1 TO 5= 1 15 TO 22= 1 31 TO 32= 1 CUMULATIVE TIME PSEUDO NOS. FREQ 6 TO 10= 0 23 TO 26= 0 .0860 SECONDS PSEUDO NOS. FREQ 11 TO 11 = 1 27 TO 29= 1 PSEUDO NOS. FREQ 12 TO 14= 10 30 TO 30= 0 FREQ ELEMENT TIE CUMULATIVE TIME 1.2350 SECONDS PSEUDO NOS. FREQ PSEUDO NOS. FREQ PSEUDO NOS. FREQ PSEUDO NOS. FREQ 1 TO 2 = 2 3 TO 4 = 22 5 TO 6 = 220 7 TO 7= 110 12 TO 13= 20 8 TO 9= 65 10 TO 10= 220 11 TO 11= 22 14 TO 15= 200 16 TO 16= 20 17 TO 17= 200 18 TO 19= 20 20 TO 22= 0 23 TO 26= 20 20 TO 27= 2 **QAFLOW USAGE SUMMARY AFTER 1589 NUMBER PAIRS(ENTRY/EXIT SEGMENTS) THE SUBJECT PROGRAM CONTAINS 90 EXECUTABLE STATEMENTS. THE TEST DATA EXERCISED 72 OF THESE STATEMENTS. THE TEST EFFECTIVENESS RATIO AT THE STATEMENT LEVEL IS .80 THE PROGRAM CONTAINS 1 TERlVIINATION POINTS, ONLY ONE OF WHICH WAS EXECUTED. THE CORRECTED TEST EFFECTIVENESS RATIO IS .80 4. THE PROGRAM CONTAINS 4 ENTRY POINTS. THE TEST DATA EXERCISED THE TEST EFFECTIVENESS RATIO AT THE ENTRY POINT LEVEL IS 1.00 Figure 2-FLOW execution frequency summary 184 Fall Joint Computer Conference, 1972 monitoring can be performed. The desired level is selected by the user and controlled by input. Typical use of the complete FLOW capability involves the application of three distinct FLOW elements. The first of these is QAMOD, the code analysis and" instrumentation module. The QAMOD module sequentially analyzes each statement of a FORTRAN source program and accomplishes the following: 1. The first executable statement of each element (i.e., sub-routine or main program) is assigned a pseudo statement number (PSN) of one. Each subsequent statement (assuming that the most detailed monitoring is opted) is assigned a sequential PSN and the statements are displayed with their assigned number as illustrated in Figure 1.* Statements are referenced by element name and PSN during subsequent FLOW processing. 2. The code is instrumented by the insertion of transfers to the FLOW execution monitor subprogram, QAFLOW. The function of the transfers is the generation of a recording file containing the sequence of statements exercised during test execution. Upon completion of the analysis and instrumentation of the source program, the instrumented version of the program is output to a file for subsequent compilation and execution. QAFLOW is appended to the program prior to execution with test data. The third FLOW module, QAPROC, provides summary statistics on the frequency of use of program elements as well as detailed trace information and an indication of the effectiveness of the test. QAPROC accesses the statement execution recording file generated by execution of the instrumented subject program and produces an evaluation and summary of the test case executed. The recording file is sequentially accessed and the data are assimilated into an internal table. At times designated by the input control options,a display is printed (Figure 2) which includes the following: 1. A map, delineated by subroutine, indicating the number of executions which have been recorded for each statement. * The program shown is a modification of the Spearman Rank Correlation Coefficient program from the IBM Scientific Subroutine Package. 6 Figure 1 shows a portion of the main program SPEAR and the subroutine SRANK (lines omitted indicated by :). The complete subroutine TIE is included to support later reference in this report. 2. Statistics indicating the percentage of the total executable statements which were exercised at least once. 3. Statistics indicating the percentage of the total number of subroutines which were executed at least once. 4. A list of the names of subroutines which were not executed. 5. Total execution time spent in each subroutine. Frequencies derived by FLOW from a number of separate tests of the subject program may be combined to provide a cumulative measure of the comprehensiveness of all testing applied to the program. At the option of the user, detailed trace information can be displayed. The trace depicts the sequence in which statements (referenced by pseudo statement number) were exercised during program execution. A complete trace display for one test of the SPEAR program is illustrated in Figure 3. In addition, time of entry to each subroutine is recorded and displayed to support timing studies. The information in Figure 3 is interpreted as follows: • Execution is initiated at pseudo statement number (PSN) 1 of the main program SPEAR at time 2.474; • Subroutine SRANK is called from PSN 2 of SPEAR at time 2.479; • Subroutine RANK is called following the sequential execution of PSN 1, 2 and 3 of SRANK; • Upon entry to subroutine RANK, PSN 1 and 2 are executed 10 times before proceeding to PSN 3; • When execution of RANK reaches PSN 24, control is returned to subroutine SRANK at PSN 4. The value of the FLOW trace information in understanding an otherwise complex logic structure can be appreciated by following the execution of subroutine TIE (using the program listing in Figure 1). The interaction of the three FLOW modules is illustrated in Figure 4 with a description of inputs and outputs for a typical application. CASE STUDIES In early planning for the capability which FLOW should provide, consideration was given to the requirements of the various phases of the software testing process. Because of the resulting flexibility of the FLOW program, successful use has been reported from a number of diverse applications. IVlajor usage has been in two areas: (1) assessment of. testing effectiveness, and Evaluating Effectiveness of Software Verification **QAFLOW TRACE PRINT** ELEMENT SPEAR TIME = 2.4740 1- 2, ELEMENT SRANK TIME = 2.4790 1- 3, ELEJ\IENT RANK TIME 2.4850 1- 2 (10 TIMES) 14- 14, 8- 11 3- 9, 12- 14, 8- 11, 14- 14, 8- 11, 14- 14, 8- 11, 14- 14 14- 14, 8- 9, 14- 14 14- 14, 8- 9, 8- 9, 12- 17, 23- 24 8- 9, 14- 14, 2.8040 ELEMENT SRANK TIME 4- 4~ TIME ELEMENT RANK 2.8100 1- 2 (10 TIMES) 3- 9, 12- 14, 8- 9, 14- 14, 8- 9 ELEMENT 17- 17, ELEMENT 1- 10, 55- 7, 1010- 10, 55- 7, 1017- 17, 1414- 15, 1717- 17, 1414- 15, 175- 10, 55- 7, 10- SRANK TIME TIE 7, 10, 7, 10, 15, 17, 15, 19, 7, 10, 10510517141723105- 10171714310- 17, 15, 15, 17, 10, 6, 14171714510- 15, 17, 17, 15, 6, 10, 17- 17, 23- 26, 5- 6, 17141417105- 14- 15, 3- 6, 10- 10, 10- 10, 527- 27, ELEMENT 18- 22, 27ELEMENT 6, TIME 10, 7, 10, 7, 17, 15, 17, 26, 10, 7, 3.7410 5105101417143510- 3.7470 7, 1010, 57, 1017, 1415, 1717, 1415, 176, 107, 1010, 5- 17141417105- 17, 17, 15, 19, 10, 6, 14- 15 14- 15 17- 17 23- ·26 5- 6 10- 10 17- 17, 10- 10, 5- 6, 14- 15, 5- 6, 10- 10, 17- 19 10- 10 5- 6 10- 10, 5- 6, 10- 11 15, 17, 17, 15, 6, 10, SRANK TIME 29, 31- 32, SPEAR TIME = 4.3700 4.4270 3- 7, Figure 3-FLOW execution trace display 10 7 10 15 17 15 17 10 10 7 185 186 Fall Joint Computer Conference, 1972 OAMOD INDEXED LISTING OF SUBJECT SOURCE PROGRAM (FIGURE 1) time and 35-50 man-hours of test results validation. Although developers were aware that redundant testing was being performed, it was impractical to delete any of the cases from the file. Because of the criticality of the program's accuracy, removal of any test case without precise proof of its impact on verification effectiveness could not be allowed. In addition, the tight schedule of the project did not permit detailed manual appraisal of each test case. The FLOW program provided the means of determining the areas of HOPE which were tested by each case. The first FLOW analysis disclosed that the 33 cases tested 85 percent of the subprograms and that one-half of this number were exercised by almost every case. Consideration of these statistics prompted the funding of extended analyses to produce a more effective test file. An incremental test planning activity was performed and a file of six cases was generated. These six cases tested 93 percent of the subprograms, but they required less than three hours of computer time and less than 24 man-hours of test results examination. Since the FLOW analyses indicate the areas of the program exercised by each case, these six cases can be selectively used at each update to assure maximum cost effectiveness . • Navigation Simulation Processor (NAVPRO) QAPROC (FIGURE 2) Figure 4-FLOW program overview (2) analysis and solution of software problems difficult to solve with conventional techniques. Brief descriptions of several such applications are included here and grouped accordingly. Test effectiveness • Houston Operations Predictor /Estimator (HOPE) The HOPE program is used by NASA/MSC for orbit determination and error analyses on the Apollo Missions. It contains approximately 500 subprograms including 80,000 lines of code. Over a four year period of program development, cases had been added to the . test data file as required until the file consisted of 33 separate cases which required 4.5 hours of computer N AVPRO is the program used by N ASA/MSC to process data from Apollo Command Module and Lunar Module onboard computer navigation simulation programs. NAVPRO contains approximately 75 subprograms and 4000 executable statements. FLOW was applied to NAVPRO to assist in the generation of a comprehensive set of test cases. The first step was selection of a basic test from the cases which were then being used for verification. FLOW analysis of the effectiveness of this first test had surprising resuIts; although the case exercised 45 percent of the NAVPRO code, it was apparent that the time span being simulated (and consequently the case execution time) could be reduced by 85 percent without significantly reducing the effectiveness of the test. By eliminating this redundant testing and then manually extending the input data with the goal of improving its effectiveness, the case was modified such that it tested 80 percent of the code in one-fourth the time required by the original case. By continued application of FLOW, a complete test file consisting of four cases was compiled which tested 98 percent of the executable statements. The 2 percent not tested were areas of the program dedicated to error terminations not considered worthy of verification at each program update. These Evaluating Effectiveness of Software Verification were verified initially and will be retested only if modifications are made which specifically affect their operation. • Skylab Activities Scheduling Program The FLOW program was used by NASA/MSC to measure the comprehensiveness of a set of 20 test cases for 52 subroutines comprising a crew model for the Skylab Activities Scheduling Program. The testing which had been performed was thought to be adequate but, since the program is to be used for on-line mission support, documentary evidence of sufficient verification is especially important. Each of the 20 test cases was executed and evaluated separately by FLOW, then the results were accumulated using one of the FLOW options. These cumulative results verified that the critical software for each of the subroutines was indeed exercised; thus, there was no requirement to apply FLOW in the modification or addition of test cases. Although no direct manpower savings can be assigned to this application, the value of the confidence in the software and in the test cases due to the FLOW results is evident. The users also acknowledged the value of the trace capability of FLOW, since they easily diagnosed a program error which had been previously undiscovered in their testing. • Program Anatomy Tables Generator (TABGEN) TABGEN is a utility program developed for NASA/ MSC as one of the components of the Automated Verification System (AVS). The functions of TABGEN are to perform syntax analysis of FORTRAN programs, segment the code into blocks of statements and generate tables describing each of these blocks (e.g. variables referenced, transfer destinations) and the logical relationships between blocks. TABGEN consists of approximately 25 subroutines and 2000 executable statements. Through FLOW application to TABGEN, test cases were devised to test 100 percent of the executable statements. The developers and users of TABGEN are convinced of the value of thorough testing, due to the fact that no errors have occurred since delivery of TABGEN in November 1971. The original version of the program was not altered until April 1972, when new requirements made modifications necessary. • Minuteman Operational Targeting Program (MOTP) MOTP is used by USAF/SAC to generate the targeting constants which must be supplied to the guidance 187 computers aboard the Minuteman II missile system. The program contains approximately 160 subprograms which are extensively overlaid. Prior to each delivery of an updated version to SAC, extensive validation must be performed. Because of the criticality of this validation exercise, a means of accurately measuring the testing effectiveness was clearly required. To determine the applicability of FLOW to the MOTP verification effort, a particularly complex portion of the program was instrumented and then monitored during a complete targeting run. FLOW provided new information about portions of the program which were assumed to be exercised but, in fact, were not. The results of this application clearly demonstrated the value of using FLOW to complement verification efforts. The decision was made to incorporate FLOW as a standard testing procedure for future deliveries. Recommendations were also made for selective use of the FLOW logic trace feature to gain a clearer understanding of the more complex portions of the MOTP. Problem solving • Apollo Reference Mission Program (ARM) The ARM program is used by NASA/MSC during Apollo missions for simulation of all activities (powered and free flight) from earth launch to re-entry. Because of its extensive use during Apollo and anticipated future applications, it is imperative that the program execution time be optimal; expecially in the areas of the program which receive the most use. The FLOW program was applied to ARM to determine the most-used portions of the program during a typical mission simulation and to obtain execution time analyses. * Although the application did not produce any surprising results, the predictions of the developers were verified (i.e., timing had been of prime consideration during development). Careful examination of critical statements (those exercised more than 10,000 times during the run) resulted in some minor modifications to improve timing which, if extrapolated over their anticipated period of use, will result in noticeable cost savings. • DRUM SLAB II The DRUM SLAB II aerodynamic analysis program was developed for NASA/MSC to simulate the molecule impact force and direction on spacecraft surfaces * Similar applications have been produced by Knuth using the FORDAP program. 1 188 Fall Joint Computer Conference, 1972 during re-entry. During checkout, the program always aborted after seven minutes of execution with an illegal operation apparently resulting from erroneous storage of data due to the complex computation of various indices. Several unsuccessful attempts were made to manually diagnose and solve the problem. Although the incorrect data storage was thought to be occurring throughout the run, it did not cause an abort until the density of the molecules began to increase rapidly at lower altitudes. It was not obvious which of the indices were being miscalculated or at what point they were computed. Because of the complex modelling of the program and the fact that the original developers were not available, the problem caused the development project to be discontinued after three months of unsuccessful debug efforts using conventional methods. Several months later, after attending a FLOW demonstration, the manager in charge of the DRUM SLAB development requested that FLOW be applied in an attempt to diagnose the problem. By selective instrumentation of the DRUM SLAB program and application of the FLOW data trace option, the problem was found to originate at some point during execution of the first 800 lines of the main program. Then, by close examination of the statement execution trace for these 800 lines, the precise point at which the erroneous index computation occurred was determined. Three separate errors were found in the computation of various indices. Correction of these computations eliminated the store error and resulted in an apparently error-free execution until the run was terminated by the operator at 15 minutes (the maximum execution time specified for the run). Although limited funds and lack of personnel familiar with the DRUM SLAB program prohibited a complete verification of the modified program, the utility of FLOW was proven by the fact that the problem had been solved in 50 man-hours by personnel totally unfamiliar with the DRUM SLAB program. • Minuteman Geometric Identification Data Program (GIDATA) The Minuteman Geometric Identification Data (GIDATA) program has been used to generate absolute and relative radar data for tracking sites. Recently, the program was extensively modified to generate special purpose output. The FLOW program was applied to the GIDATA program before modification was started in hope that the analysis would give a better understanding of the program and, hence, aid in modification design. Some of the most useful information obtained from FLOW was: • Subroutine level trace and usage summary • Inefficient subroutine structure and calling sequences • Areas where code was never used • Relative subroutine timing indicating inefficient code Using this information a better understanding of GIDATA was achieved and it became relatively simple to determine necessary modifications for reducing program execution time and core requirements. Upon completion of the GIDATA modifications, additional applications of FLOW will ensure comprehensive testing of the program. • Navigation Simulation Processor (NAVPRO) In generation of a particular test case for NAVPRO (program described in the previous sections of this report), a problem developed when the error flag indicating vehicle impact with the lunar surface was being set during execution. Since the flag was in global COMMON and could have been set in any of several subroutines during the integration, it was difficult to determine precisely where the error was occurring. Since NAVPRO had already been instrumented for statement execution monitoring, the origin of the error was easily detected. By using a special option of FLOW, the value stored in the error flag location was checked at execution of each transfer during the run. The FLOW display disclosed the exact statement at which the vehicle impact flag was set and described the program logic flow immediately preceding the impact. The error, which was in the NAVPRO input data, was found and corrected. • Earth Re-entry Orbit Determination Program (REPOD) REPOD is a large multi-link program developed and used in support of Minuteman trajectory analysis and orbit determination. Since REPOD is an amalgamation of several older programs, the detailed flow through each of its 9 links is particularly difficult to understand. The trajectory link is one of the more complex and was therefore chosen for FLOW analysis to identify possible program improvements. The analysis of the trajectory link was particularly desirable because: (1) A significant portion of the total REPOD execution time is spent in this link. (2) It was felt, by the user, that the FLOW analysis would lead to significant improvement in program efficiency. Evaluating Effectiveness of Software Verification One application of FLOW provided some striking results in identification of blocks of statements which were exercised with unexpected high frequency. FLOW also: (1) identified portions of REPOD not used for selected input options, (2) displayed subroutine and statement trace data for given options, and (3) indicated primary areas of concern for subsequent program improvements. Using the FLOW results as a guide, a detailed examination of the trajectory integration algorithm was initiated. The complete task culminated in significant reductions in execution time (for example, processing time for one function was cut from 67 seconds to 11 seconds) and optimum selection of error criterion and integration step size for improved program performance. SUMMARY The initial PACE instance described here responded - to an important need in supporting assurance of comprehensively tested and more reliable software products. Although execution of all statements is by no means a conclusive measure of test effectiveness, it is considered an important first step in the improvement of conventional testing methodology. Subsequent instances of PACE have produced: • A program which displays unexercised statements and performs an analysis of the FORTRAN code to determine the conditions necessary for their execution;3 the computation and input of significant parameters is highlighted to support test redesign activities. • A program to determine all possible logical transfers and extrapolate these to construct and display all logic paths within a FORTRAN module. 7 • A program to monitor the execution of transfers during program execution;8 a test effectiveness ratio is calculated based upon actual versus potential transfers exercised (used either as an alternative or in conjunction with FLOW statement usage analyses). Parallel research and development activities have resulted in: • A FLOW-like program to produce statement usage frequency without the execution trace feature;9 although the results are not as detailed as those produced by FLOW, the program operation is more efficient and therefore more easily applied to large systems. • Well-defined steps for the adaptation of PACE 189 technology to programming languages other than FORTRAN (e.g., assembly language, COBOL, JOVIAL). This approach toward development of PACE technology has proved successful and has resulted in needed exposure and critique of concepts and techniques. PACE applications have already provided some very meaningful answers to a variety of participants (from programmer to procurer) in a number of software· development activities. As was expected, each new application lends additional insight into the evaluation of existing PACE technology and provides vital information for direction of continued design and implementation. 10 Application of PACE capabilities has stimulated interest in the effectiveness of testing among TRW personnel and its customers and has provided a firm foundation upon which a long-neglected technology2 can now advance. ACKNOWLEDGMENTS Without the cooperation of many individuals the collection and presentation of the FLOW usage results documented he:re would have been an extremely difficult task. Those particularly deserving of mention are A. C. Arterbery, K. W. Krause, Dr. E. C. Nelson, R' M. Poole, R. W. Smith and R. F. Webber. REFERENCES 1 DE KNUTH An empirical study of FORTRAN programs Software-Practice and Experience Vol 1 pp 105-103 1971 2 F GRUENBERGER Program testing and validating Computing: A First Course 1968 3 J R BROWN et al A utomated software quality assurance: A case study of three systems Presented at the ACM SIGPLAN Symposium Chapel Hill North Carolina June 21-23 1972 4 LTC F BUCKLEY Verification of software programs Computers and Automation February 1971 5 K W KOLENCE A software view of measurement tools Datamation January 1971 6 System/360 scientific subroutine package (360A-CM-03X) version III programmer's manual IBM Application Program H20-0205-3 7 J R BROWN Practical applications of automated software tools To be published in the Proceedings of the Western Electronic Show and Convention (WESCON) 190 Fall Joint Computer Conference, 1972 Los Angeles California September 19-22 1972 8 R W SMITH Measurement of segment relationship execution frequency TRW Systems (#72-4912.30-31) March 29 1972 9 R H HOFFMAN et al Node determination and analysis program (NODAL) user's manual TRW Systems (#18793-6147-RO-00) June 30 1972 10 J R BROWN R H HOFFMAN Automating software development-A survey of techniques and automated tools TRW Inc May 1972 A design methodology for reliable software systems* by B. H. LI8KOV** The JlrfITRE Corporation Bedford, Massachusetts INTRODUCTION In order for testing to guarantee reliability, it is necessary to insure that all relevant test cases have been checked. This requires solving two problems: Any user of a computer system is aware that current systems are unreliable because of errors in their software components. While system designers and implementers recognize the need for reliable software, they have been unable to produce it. For example, operating systems such as 08/360 are released to the public with hundreds of errors still in them. l A project is underway at the MITRE Corporation which is concerned with learning how to build reliable software systems. Because systems of any size can always be expected to be subject to changes in requirements, the project goal is to produce not only reliable software, but readable software which is relatively easy to modify and maintain. This paper describes a design methodology developed as part of that project. (1) A complete (but minimal) set of relevant test cases must be identified. (2) It must be possible to test all relevant test cases; this implies that the set of relevant test cases is small and that it is possible to generate every case. The solutions to these problems do not lie in the domain of debugging, which has no control over the sources of the problems. Instead, since it is the system design which determines how many test cases there are and how easily they can be identified, the problems can be solved most effectively during the design process: The need for exhaustive testing must influence the design. We believe that such a design methodology can be developed by borrowing from the work being done on proof of correctness of programs. While it is too difficult at present to give formal proofs of the correctness of large programs, it is possible to structure programs so that they are more amenable to proof techniques. The objective of the methodology presented in this paper is to produce such a program structure, which will lend itself to informal proofs of correctness. The proofs, in addition to building confidence in the correctness of the program, will help to identify the relevant test cases, which can then be exhaustively tested. When exhaustive testing is combined with informal proofs, it is reasonable to expect reliable software after testing is complete. This expectation is borne out by at least one experiment performed in the past. 4 Rationale Before going on to describe the methodology, a few words are in order about why a design methodology approach to software reliability has been selected. t The unfortunate fact is that the standard approach to building systems, involving extensive debugging, has not proved successful in producing reliable software, and there is no reason to suppose it ever will. Although improvements in debugging techniques may lead to the detection of more errors, this does not imply that all errors will be found. There certainly is no guarantee of this implicit in debugging: as Dijkstra said, "Program testing can be used to show the presence of bugs, but never to show their absence." 3 * This work was supported by Air Force Contract No. F19(628)71-C-0002. ** Present Address-Department of Electrical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts. t The material in this section is covered in much greater detail in Liskov and Towster.2 The scope of the paper A key word in the discussion of software reliability is "complex"; it is only when dealing with complex sys191 192 Fall Joint Computer Conference, 1972 tems that reliability becomes an acute problem. A twofold definition is offered for "complex." First, there are many system states in such a system, and it is difficult to organize the program logic to handle all states correctly. Second, the efforts of many individuals must be coordinated in order to build the system. A design methodology is concerned with providing techniques which enable designers to cope with the inherent logical complexity effectively. Coordination of the efforts of individuals is accomplished through management techniques. The fact that this paper only discusses a design methodology should not be interpreted to imply that management techniques are unimportant. Both design methodology and management techniques are essential to the successful construction of reliable systems. It is customary to divide the construction of a software system into three stages: design, implementation, and testing. Design involves both making decisions about what precisely a system will do and then planning an overall structure for the software which enables it to perform its tasks. A "good" design is an essential first step toward a reliable system, but there is still a long way to go before the system actually exists. Only management techniques can insure that the system implementation fits into the structure established by the design and that exhaustive testing is carried out. The management techniques should not only have the form of requirements placed on personnel; the organization of personnel is also important. It is generally accepted that the organizational structure imposes a structure on the system being built.5 Since we wish to have a system structure based on the design methodology, the organizational structure must be set up accordingly. * CRITERIA FOR A GOOD DESIGN The design methodology is presented in two parts. This section defines the criteria which a system design should satisfy. The next section presents guidelines intended to help a designer develop a design satisfying the criteria. To reiterate, a complex system is one in which there are so many system states that it is difficult to understand how to organize the program logic so that all states will be handled correctly. The obvious technique to apply when confronting this type of situation is "divide and rule." This is an old idea in programming and is known as modularization. Modularization consists of dividing a program into subprograms * Management techniques intended to support the design methodology proposed in this paper are described by Liskov. 6 (modules) which can be compiled separately, but which have connections with other modules. We will use the definition of Parnas:7 "The connections between modules are the assumptions which the modules make about each other." Modules have connections in control via their entry and exit points; connections in data, explicitly via their arguments and values, and implicitly through data referenced by more than one module; and connections in the services which the modules provide for one another. Traditionally, modularity was chosen as a technique for system production because it makes a large system more manageable. It permits efficient use of personnel, since programmers can implement and test different modules in parallel. Also, it permits a single function to be performed by a single module and implemented and tested just once, thus eliminating some duplication of effort and also standardizing the way such functions are performed. The basic idea of modularity seems very good, but unfortunately it does not always work well in practice. The trouble is that the division of a system into modules may introduce additional complexity. The complexity comes from two sources: functional complexity and complexity in the connections between the modules. Examples of such complexity are: (1) A module is made to do too many (related but different) functions, until its logic is completely obscured by the tests to distinguish among the different functions (functional complexity). (2) A common function is not identified early enough, with the result that it is distributed among many different modules, thus obscuring the logic of each affected module (functional complexity) . (3) Modules interact on common data in unexpected ways (complexity in connections) . The point is that if modularity is viewed only as an aid to management, then any ad hoc modularization of the system is acceptable. However, the success of modularity depends directly on how well modules are chosen. We will accept modularization as the way of organizing the programming of complex software systems. A major part of this paper will be concerned with the question of how good modularity can be achieved, that is, how modules can be chosen so as to minimize the connections between them. First, however, it is necessary to give a definition of "good" modularity. To emphasize the requirement that modules be as disjoint as possible, and because the term "module" has been used so often and so diversely, we will discard it and define modularity as the division of the system into Design Methodology for Reliable Software Systems "partitions." The definition of good modularity will be based on a synthesis of two techniques, each of which addresses a different aspect of the problem of constructing reliable software. The first, levels of abstraction, permits the development of a system design which copes with the inherent complexity of the system effectively. The second, structured programming, insures a clear and understandable representation of the design in the system software. Levels of abstraction Levels of abstraction were first defined by Dijkstra. 8 They provide a conceptual framework for achieving a clear and logical design for a system. The entire system is conceived as a hierarchy of levels, the lowest levels being those closest to the machine. Each level supports an important abstraction; for example, one level might support segments (named virtual memories), while another (higher) level could support files which consist of several segments connected together. An example of a file system design based entirely on a hierarchy of levels can be found in Madnick and Alsop. 9 Each level of abstraction is composed of a group of related functions. One or more of these functions may be referenced (called) by functions belonging to other levels; these are the external functions. There may also be internal functions which are used only within the level to perform certain tasks common to all work being performed by the level and which cannot be referenced from other levels of abstraction. Levels of abstraction, which will constitute the partitions of the system, are accompanied by rules governing some of the connections between them. There are two important rules governing levels of abstraction. The first concerns resources (I/O devices, data) : each level has resources which it owns exclusively and which other levels are not permitted to access. The second involves the hierarchy: lower levels are not aware of the existence of higher levels and therefore may not refer to them in any way. Higher levels may appeal to the (external) functions of lower levels to perform tasks; they may also appeal to them to obtain information contained in the resources of the lower levels. * * In the Madnick and Alsop paper referenced earlier, the hierarchy of levels is strictly enforced in the sense that if the third level wishes to make use of the services of the first level, it must do so through the second level. This paper does not impose such a strict requirement; a high level may make use of a level several steps below it in the hierarchy without necessarily requiring the assistance of intermediate levels. The 'THE' systemS and the Venus systemlO contain exampl~ of levels used in this way. 193 Structured programming Structured programming is a programming discipline which was introduced with reliability in mind. 11 ,12 Although of fairly recent origin, the term "structured programming" does not have a standard definition. We will use the following definition in this paper. Structured programming is defined by two rules. The first rule states that structured programs are developed from the top down, in levels. * The highest level describes the flow of control among major functional components (major subsystems) of the system; component names are introduced to represent the components. The names are subsequently associated with code which describes the flow of control among still lower-level components, which are again represented by their component names. The process stops when no undefined names remaIn. The second rule defines which control structures may be used in structured programs. Only the following control structures are permitted: concatenation, selection of the next statement based on the testing of a condition, and iteration. Connection of two statements by a goto is not permitted. The statements themselves may make use of the component names of lower-level components. Structured prograInming and proofs of correctness The goal of structured programming is to produce program structures which are amenable to proofs of correctness. The proof of a structured program is broken down into proofs of the correctness of each of the components. Before a component is coded, a specification exists explaining its input and output and the function which it is supposed to perform. (The specification is defined at the time the component name is introduced; it may even be part of the name.) When the component is coded, it is expressed in terms of specifications of lower level components. The theorem to be proved is that the code of the component matches its specifications; this proof will be given based on axioms stating that lower level components match their specifications. The proof depends on the rule about control structures in two important ways. First, limiting a component to combinations of the three permissible control structures insures that control always returns from a component to the statement following the use of the * The levels in a structured program are not (usually) levels of abstraction, because they do not obey the rule about ownership of resources. 194 Fall Joint Computer Conference, 1972 component name (this would not be true if go to statements were permitted). This means that reasoning about the flow of control in the system may be limited to the flow of control as defined locally in the component being proved. Second, each permissible control structure is associated with a well-known rule of inference: concatenation with linear reasoning, iteration with induction, and conditional selection with case analysis. These rules of inference are the tools used to perform the proof (or understand the component). Structured progra:mming and syste:m design Structured programming is obviously applicable to system implementation. We do not believe that by itself it constitutes a sufficient basis for system design; rather we believe that system design should be based on identification of levels of abstraction. * Levels of abstraction provide the framework around which and within which structured programming can take place. Structured programming is compatible with levels of abstraction because it provides a comfortable environment in which to deal with abstractions. Each structured program component is written in terms of the names of lower-level components; these names, in effect, constitute a vocabulary of abstractions. In addition, structured programs can replace flowcharts as a way of specifying what a program is supposed to do. Figure 1 shows a structured program for the top level of the parser in a bottom-up compiler for an begin integer relation; boolean must-scan; string symbol; stack parse_stack; must.scan := true; push (parse_stack, eoLentry); while not finished(parse_stack) do begin if must.scan then symbol := scan_next-symbol; relation := precedenceJelation(top(parse_stack), symbol); perform_opera tion_based_onJelation (relation, parse_stack, symbol, must-scan) end end Figure I-A structured program for an operator precedence parser * A recent paper by Henderson and Snowden13 describes an experiment in which structured programming was the only technique used to build a program. The program had an error in it which was the direct result of not identifying a level of abstraction. INITIALIZE .. / FINISHED? \ NO YES SCAN SYMBOL IF NECESSARY COMPUTE PRECEDENCE RELATION .r PERFORM OPERATION BASED ON PRECEDENCE RELATION Figure 2-Flowchart 'of an operator precedence parser operator precedence grammar, and Figure 2 is a flowchart containing approximately the same amount of detail. While it is slightly more difficult to write the structured program, there are compensating advantages. The structured program is part of the final program; no translation is necessary (with the attendant possibility of introduction of errors). In addition, a structured program is more rigorous than a flowchart. For one thing, it is written in a programming language and therefore the semantics are well defined. For another, a flowchart only describes the flow of control among parts of a system, but a structured program at a minimum must also define the data controlling its flow, Design Methodology for Reliable Software Systems so the description it provides is more concrete. In addition, it defines the arguments and values of a referenced component, and if a change in level of abstraction occurs at that point, then the data connection between the two components is completely defined by the structured program. This should help to avoid interface errors usually uncovered during system integration. Basic definition We now present a definition of good modularity supporting the goal of software reliability. The system is divided into a hierarchy of partitions, where each partition represents one level of abstraction, and consists of one or more functions which share common resources. At the same time, the entire system is expressed by a structured program which defines the way control passes among the partitions. The connections between the partitions are limited as follows: (1) The connections in control are limited by the rules about the hierarchy of levels of abstraction and also follow the rules for structured programs. (2) The connections in data between partitions are limited to the explicit arguments passed from the functions of one partition to the (external) functions of another partition. Implicit interaction on common data may only occur among functions within a partition. (3) The combined activity of the functions in a partition support its abstraction and nothing more. This makes the partitions logically independent of one another. For example, a partition supporting the abstraction of files composed of many virtual memories should not contain any code supporting the existence of virtual memories. A system design satisfying the above requirements is compatible with the goal of software reliability. Since the system structure is expressed as a structured program, it should be possible to prove that it satisfies the system specifications, assuming that the structured programs which will eventually support the functions of the levels of abstraction satisfy their specifications. In addition, it is reasonable to expect that exhaustive testing of all relevant test cases will be possible. Exhaustive testing of the whole system means that each partition must be exhaustively tested, and all combinations of partitions must be exhaustively tested. Exhaustive testing of a single partition involves both testing based on input parameters to the functions in the partition and testing based on intermediate values of state vari- 195 abIes of the partition. When this testing is complete, it is no longer necessary to worry about the state variables because of requirement 2. Thus, the testing of combinations of partitions is limited to testing the input and output parameters of the external functions in the partitions. In addition, requirement 3 says that partitions are logically independent of one another; this means that it is not necessary when combining partitions to test combinations of the relevant test cases for each partition. Thus, the number of relevant test cases for two partitions equals the sum of the relevant test cases for each partition, not the product. GUIDELINES FOR SYSTEM DESIGN Now that we have a definition of good modularization, the next question is how a system modularization satisfying this definition can be achieved. The traditional technique for modularization is to analyze the execution-time flow of the system and organize the system structure around each major sequential task. This technique leads to a structure which has very simple connections in control, but the connections in data tend to be complex (for examples see Parnas14 and CohenI5). The structure therefore violates requirement 2; it is likely to violate requirement 3 also since there is no reason (in general) to assume any correspondence between the sequential ordering of events and the independence of the events. If the execution flow technique is discarded, however, we are left with almost nothing concrete to help us make decisions about how to organize the system structure. The guidelines presented here are intended to help rectify this situation. First are some guidelines about how to select abstractions; these guidelines tend to overlap, and when designing a system, the choice of a particular abstraction will probably be based on several of the guidelines. Next the question of how to proceed with the design is addressed. Finally, an example of the selection of a particular abstraction within the Venus systemlO is presented to illustrate the application of several of the principles; an understanding of Venus is not necessary for understanding the example. Guidelines for selecting abstractions Partitions are always introduced to support an abstraction or concept which the designer finds helpful in thinking about the system. Abstraction is a very valuable aid to ordering complexity. Abstractions are introduced in order to make what the system is doing clearer and more understandable; an abstraction is a conceptual simplification because it expresses what is being done 196 Fall Joint Computer Conference, 1972 without specifying how it is done. The purpose of this section is to discuss the types of abstractions which may be expected to be useful in designing a system. Abstractions of resources Every hardware resource available on the system will be represented by an abstraction having useful characteristics for the user or the system itself. The abstraction will be supported by a partition whose functions map the characteristics of the abstract resource into the characteristics of the real underlying resource or resources. This mapping may itself make use of several lower partitions, each supporting an abstraction useful in defining the functions of the original partition. It is likely that a strict hierarchy will be imposed on the group of partitions; that is, other parts of the system may only reference the functions in the original partition. In this case, we will refer to the lower partitions as "sub-partitions." Two examples of abstract resources are given. In an interactive system, "abstract teletypes" with end-ofmessage and erasing conventions are to be expected. In a multiprogramming system, the abstraction of processes frees the rest of the system from concern about the true number of processors. Abstract characteristics of data In most systems the users are interested in the structure of data rather than (or in addition to) storage of data. The system can satisfy this interest by the inclusion of an abstraction supporting the chosen data structure; functions of the partition for that abstraction will map the structure into the way data is actually represented by the machine (again this may be accomplished by several sub-partitions). For example, in a file management system such an abstraction might be an indexed sequential access method. The system itself also benefits from abstract representation of data; for example, the scanner in a compiler permits the rest of the compiler to deal with symbols rather than with characters. Simplification via limiting information According to the third requirement for good modularization, the functions comprising a partition support only one abstraction and nothing more. Sometimes it is difficult to see that this restriction is being violated, or to recognize that the possibility for identification of another abstraction exists. One technique for simplification is to limit the amount of information which the functions in the partition need to know (or even have access to). An example of such information is the complicated format in which data is stored for use by the functions in the partition (the data would be a resource of the partition). The functions require the information embedded in the data but need not know how it is derived from the data. This knowledge can be successfully hidden within a lower partition (possibly a sub-partition) whose functions will provide requested information when called; note that the data in question become a resource of the lower partition. Simplification via generalization Another technique for simplification is to recognize that a slight generalization of a function (or group of functions) will cause the functions to become generally useful. Then a separate partition can be created to contain the generalized function or functions. Separating such groups is a common technique in system implementation and is also useful for error avoidance, minimization of work, and standardization. The existence of such a group simplifies other partitions, which need only appeal to the functions of the lower partition rather than perform the tasks themselves. An example of a generalization is a function which will move a specified number of characters from one location to another, where both locations are also specified; this function is a generalization of a function in which one or more of the input parameters is assumed. Sometimes an already existing partition contains functions supporting tasks very similar to some work which must be performed. When this is true, a new partition containing new versions of those functions may be created, provided that the new functions are not much more complex than the old ones. System maintenance and modification Producing a system which is easily modified and maintained is one of our primary goals. This goal can be aided by separating into independent partitions functions which are performing a task whose definition is likely to change in the future. For example, if a partition supports paging of data between core and some backup storage, it may be wise to isolate as an independent partition those functions which actually know what the backup storage device is (and the device becomes a resource of the new partition). Then if a new device is added to the system (or a current device is removed), only the functions in the lower partition 'will be affected; the higher partition will have been isolated Design Methodology for Reliable Software Systems from such changes by the requirement about data connections between partitions. How to proceed with the design Two phases of design are distinguished. The very first phase of the design (phase 1) will be concerned with defining precise system specifications and analyzing them with respect to the environment (hardware or software) in which the system will eventually exist. The result of this phase will be a number of abstractions which represent the eventual system behavior in a very general way. These abstractions imply the existence of partitions, but very little is known about the connections between the partitions, the flow of control among the partitions (although a general idea of the hierarchy of partitions will exist), or how the functions of the partitions will be coded. Every important external characteristic of the system should be present as an abstraction at this stage. Many of the abstractions have to do with the management of system resources; others have to do with services provided to the user. The second phase of system design (phase 2) investigates the practicality of the abstractions proposed by phase 1 and establishes the data connections between the partitions and the flow of control among the partitions. This latter exercise establishes the placement of the various partitions in the hierarchy. The second phase occurs concurrently with the first; as abstractions are proposed, their utility and practicality are immediately investigated. For example; in an information retrieval system the question of whether a given search technique is efficient enough to satisfy system constraints must be investigated. A partition has been adequately investigated when its connections with the rest of the system are known and when the designers are confident that they understand exactly what its effect on the system will be. Varying depths of analysis will be necessary to achieve this confidence. It may be necessary to analyze how the functions of the partition could be implemented, involving phase 1 analysis as new abstractions are postulated requiring lower partitions or sub-partitions. Possible results of a phase 2 investigation are that an abstraction may be accepted with or without changes, or it may be rejected. If an abstraction is rejected, then another abstraction must be proposed (phase 1) and investigated (phase 2). The iteration between phase 1 and phase 2 continues until the design is complete. Structured program.m.ing It is not clear exactly how early structured- programming of the system should begin. Obviously, whenever 197 the urge is felt to draw a flowchart, a structured program should be written instead. Structured programs connecting all the partitions together will be expected by the end of the design phase. The best rule is probably to keep trying to write structured programs; failure will indicate that system abstractions are not yet sufficiently understood and perhaps this exercise will shed some light on where more effort is needed or where other abstractions are required. When is the design finished? The design will be considered finished when the following criteria are satisfied: (1) All major abstractions have been identified and partitions defined for them; the system resources have been distributed among the partitions and their positions in the hierarchy established. (2) The system exists as a structured program, showing how the flow of control passes among the partitions. The structured program consists of several components, but no component is likely to be completely defined; rather each component is likely to use the names of lower-level components which are not yet defined. The interfaces between the partitions have been defined, and the relevant test cases for each partition have been identified. (3) Sufficient information is available so that a skeleton of a user's guide to the system could be written. Many details of the guide would be filled in later, but new sections should not be needed.* A n example from Venus The following example from the Venus systemlO is presented because it illustrates many of the points made about selection, implementation, and use of abstractions and partitions. The concept to be discussed is that of external segment name, referred to as ESN from now on. The concept of ESN was introduced as an abstraction primarily for the benefit of users of the system. The important point is that a segment (named virtual memory) exists both conceptually (as a place where a * This requirement helps to insure that the design fulfills the system specifications. In fact, if there is a customer for whom the system is being developed, a preliminary user's guide derived from the system design could be a means for reviewing and accepting the design. 198 Fall Joint Computer Conference, 1972 programmer thinks of information as being stored) and in reality (the encoding of that information in the computer). The reality of a segment is supported by an internal segment name (ISN) which is not very convenient for a programmer to use or remember. Therefore, the symbolic ESN was introduced. As soon as the concept of ESN was imagined, the existence of a partition supporting this concept was implied. This partition owned a nebulous data resource, a dictionary, which contained information about the mappings between ESNsand ISNs. The formatting of this data was hidden information as far as the rest of the system was concerned. In fact, decisions about the dictionary format and about the algorithms used to search a dictionary could safely be delayed until much later in the design process. A collective name, the dictionary functions, was given to the functions in this partition. Now phase 2 analysis commenced. It was necessary to define the interface presented by the partition to the rest of the system. Obvious items of interest are ESNs and ISNs; the format of ISNs was already determined by the computer architecture, but it was necessary to decide about the format of ESNs. The most general format would be a count of the number of characters in the ESN followed by the ESN itself; for efficiency, however, a fixed format of six characters was selected. At this point a generalization of the concept of ESN occurred, because it was recognized that a two-part ESN would be more useful than a single symbolic ESN. The first part of the ESN is the symbolic name of the dictionary which should be used to make the mapping; the second part is the symbolic name to be looked up in the dictionary. This concept was supported by the existence of a dictionary containing the names of all dictionaries. A format had to be chosen for telling dictionary functions which dictionary to use; for reasons of efficiency, the ISN of the dictionary was chosen (thus avoiding repeated conversions of dictionary ESN into diction~ry IS N) . When phase 2 analysis was over, we had the identification of a partition; we knew what type of function belonged in this partition, what sort of interface it presented to the rest of the system, and what information was kept in dictionaries. As the system design proceeded, new dictionary functions were specified as needed. Two generalizations were realized later. The first was to add extra information to the dictionary; this was information which the system wanted on a segment basis, and the dictionaries were a handy place to store it. The second was to make use of dictionary functions as a general mapping device; for example, dictionaries are used to hold information about the map- ping of record names into tape locations, permitting simplification of a higher partition. In reality, as soon as dictionaries and dictionary functions were conceived, a core of dictionary functions was implemented and tested. This is a common situation in building systems and did not cause any difficulty in this case. For one thing, extra space was purposely left in dictionary entries because we suspected we might want extra information there later although we did not then know what it was. The search algorithm selected was straight serial search; the search was embedded in two internal dictionary functions (a sub-partition) so that the format of the dictionaries might be changed and the search algorithm redefined with very little effect on the system or most of the dictionary functions. This follows the guideline of modifiability. CONCLUSIONS This paper has described a design methodology for the development of reliable software systems. The first part of the methodology is a definition of a "good" system modularization, in which the system is organized into a hierarchy of "partitions", each supporting an "abstraction" and having minimal connections with one another. The total system design, showing how control flows among the partitions, is expressed as a structured program, and thus the system structure is amenable to proof techniques. The second part of the methodology addresses the question of how to achieve a system design having good modularity. The key to design is seen as the identification of "useful" abstractions which are introduced to help a designer think about the system; some methods of finding abstractions are suggested. Also included is a definition of the "end of design", at which time, in addition to having a system design with the desired structure, a preliminary user's! guide to the system could be written as a way of checking that the system meets its specifications. Although the methodology proposed in this paper is based on techniques which have contributed to the production of reliable software in the past, it is nevertheless largely intuitive, and may prove difficult to apply to real system design. The next step to be undertaken at MITRE is to test the methodology by conscientiously applying it, in conjunction with certain management techniques,6 to the construction of a small, but complex, multi-user file management system. We hope that this exercise will lead to the refinement, extension and clarification of the methodology. Design Methodology for Reliable Software Systems ACKNOWLEDGMENTS The author wishes to thank J. A. Clapp and D. L. Parnas for many helpful criticisms. REFERENCES 1 J N BUXTON B RANDELL (eds) Software engineering techniques Report on a Conference Sponsored by the NATO Science Committee Rome Italy p 20 1969 2 B H LISKOV E TOWSTER The proof of correctness approach to reliable systems The MITRE Corporation MTR 2073 Bedford Massachusetts 1971 3 E W DIJKSTRA Structured programming Software Engineering Techniques Report on a Conference sponsored by the NATO Science Committee Rome Italy J N Buxton and B Randell (eds) pp 84.;.88 1969 4 F T BAKER Chief programmer team management of production programming IBM Syst J 111 pp 56-73 1972 5 M CONWAY How do committees invent? Datamation 14 4 pp 28-31 1968 6 B H LISKOV Guidelines for the design and implementation of reliable software systems 7 8 9 10 11 12 13 14 15 199 The MITRE Corporation MTR 2345 Bedford Massachusetts 1972 D L PARNAS Information distribution aspects of design methodology Technical Report Department of Computer Science Carnegie-Mellon University 1971 E W DIJKSTRA The structure of the "THE"-multiprogramming system Comm ACM 11 5 pp 341-346 1968 S MAD NICK J W ALSOP II A modular approach to file system design AFIPS Conference Proceedings 34 AFIPS Press Montvale New Jersey pp 1-13 1969 B H LISKOV The design of the Venus operating system Comm ACM 15 3 pp 144-149 1972 E W DIJKSTRA Notes on structured programming Technische Hogeschool Eindhoven The Netherlands 1969 H D MILLS Structured programming in large systems Debugging Techniques in Large Systems R Rustin (ed) Prentice Hall Inc Englewood Cliffs New Jersey pp 41-55 P HENDERSON R SNOWDEN An experiment in structured programming BIT 12 pp 38-53 1972 D L PARNAS On the criteria to be used in decomposing systems into modules Technical Report CMU-CS-71-101 Carnegie-Mellon University 1971 A COHEN Modular programs: Defining the module Datamation 18 1 pp 34-37 1972 A summary of progress toward proving program correctness by T. A. LINDEN National Security Agency Ft. George G. Meade, Maryland whether the program text is correct with respect to those specifications. The mathematics necessary for this was originally worked out primarily by Floydl and Manna. 2 I t must be made clear that a proof of correctness is radically different from the usual process of testing a program. Testing can and often does prove a program is incorrect, but no reasonable amount of testing can ever prove that a nontrivial program will be correct over all allowable inputs. INTRODUCTION Interest in proving the correctness of programs has grown explosively within the last two or three years. There are now over a hundred people pursuing research on this general topic; most of them are relative newcomers to the field. At least three reasons can be cited for this rapid growth: (1) The inability to design and implement software systems which can be guaranteed correct is severely restricting computer applications in many important areas. (2) Debugging and maintaining large computer programs is now well recognized as one of the most serious and costly problems facing the computer industry. (3) A large number of mathematicians, especially logicians, are interested in applications where their talents can be used. Example The approach to proving programs correct which was developed and popularized by Floyd is still the basis for most current proofs of correctness. I t is generally known as the method of inductive assertions. Let us begin with a simple example of the basic idea. Consider the flowchart in Figure 1 for exponentiation to a positive integral power by repeated multiplication. For simplicity, assume all values are integers. I have put assertions or specifications for correctness on the input and output of the program. We want to prove that if X and Yare inputs with Y>O, then the output Z will satisfy Z = KY. This assertion at the output is the specification for correctness of the program. The assertion at the input defines the input conditions (if any) for which the program is to produce output satisfying the output assertion. Note that the proof will use symbolic techniques to establish that the· program is correct for all allowable inputs. The proof technique works as follows: Somewhere within each loop we must add an assertion that adequately characterizes an invariant property of the loop. This has been done for the single loop flowchart of Figure 1. It is now possible to break this flowchart into tree-like sections such that each section begins and ends with assertions and no section contains a loop. This is This paper summarizes recent progress in developing rigorous techniques for proving that programs satisfy formally defined specifications. Until recently proofs of correctness were limited to toy programs. They are still limited to small programs, but it is now conceivable to attempt to prove the correctness of small critical modules of a large program. This paper is designed to give a sufficient introduction to current research so that a software engineer can evaluate whether a proof of correctness might be applicable to some of his problems sometime in the future. THE NATURE OF CORRECTNESS PROOFS Given formal specifications for a program and given the text of a program in some formally defined language, it is then a well-defined mathematical question to ask 201 202 Fall Joint Computer Conference, 1972 shown in Figure 2 if one disregards the dashed-line boxes. We want to show that if execution of a section begins in a state with the assertion at its head true, then when the execution leaves that section, the assertion at the exit must also be true. By taking an assertion at the end of each of these sections and using the semantics of the program statement above it, one can generate an assertion which should have held before that statement if the assertions after it are to be guaranteed true. Working up the trees one then generates all the assertions in dashed-line boxes in Figure 2. Each section will then preserve truth from its first to its last assertions if the first assertion implies the assertion that was generated in the dashed-line box at the top. One thus gets the logical theorems or verification conditions given below each section. With a little thought it can now be seen that if these theorems can all be proven and if the program halts, then it will halt with the correct output values. In this case the theorems are obviously true. Halting can be proven by other techniques. - - ---z = Xl __ r - -:--- -- - ----- --, :.lY~~ ~=XYl ~ W~~ ~xX == X~ll! >--+J~___ ----i~: ~YJ ___ _I z =Xl ~ [( Y = I ~ Z = X Y) & (Y f I ~ Z x X == xI+ 1) ] Figure 2-Sectioned flowchart The careful reader will note that the input assumption Y> 0 is not really needed for the proof of either of these theorems. This is because that assumption is really only needed to prove that the program terminates. Inherent difficulties Figure 1-Exponentiation program This process for proving the correctness of programs is subject to many variations both to handle programming constructs which do not occur in this example and to try to make the proof of correctness more efficient. Full treatments with many examples are available in a recent survey by Elspas, et al.,3 and in Manna's forthcoming textbook. 4 Some further general comments about the nature of the problem will be made here. Analogous comments could be made about most of the other approaches to proving correctness. Programs can only be said to be correct with respect to formal specifications or input-output assertions. Summary of Progress Toward Proving Program Correctness There is no formal way to guarantee that these specifications adequately express what someone really wants the program to do. Given a program with specifications on the input and output, there is probably no automatic way to generate all the additional assertions which must be added to make the proof work. For a human to add these assertions requires a thorough understanding of the program. The programmer should be able to supply these assertions if he is able to formalize his intuitive understanding of the program. Given a program with assertions in each loop and given an adequate definition of the semantics of the programming language, it is fairly routine to generate the theorems or verification conditions. Several existing computer programs that do this are described below. The real problem in proving correctness lies in the fact that even for simple programs, the theorems that are generated become quite long. This length makes proving the theorems very difficult for a human or for current automatic theorem provers. Formalizing the programmer's intuition of correctness It may not be apparent, but the process of proving correctness is just a formalization into rigorous logical terms of the informal and sometimes sloppy reasoning that a programmer uses in constructing and checking his program. The programmer has some idea of what he expects to be true at each stage of his program (the assertions), he knows how the programming language semantics will transform a stage (generating the assertions in dashed-line boxes of Figure 2), and he convinces himself that the transformations will give the desired result (the proof of the theorem). In this sense proving program correctness is just a way to put into formal language everything one should understand in reading and informally checking a program for correctness. In fact, there is no clear division between the idea of reading code to check it for correctness and the idea of proving it correct by more rigorous means; the difference is one of degree of formality. One question that should be addressed in this context regards the fact that both the correctness and the halting problems for arbitrary programs are known to be undecidable in the mathematical sense. However, this question of mathematical undecidability should not arise for any program for which there are valid intuitive reasons for the program to be correct. Confidence in correctness I hope I have made the point that logical proof of correctness techniques are radically different from 203 testing techniques which are based on executing the program on selected input data in a specific environment. However, I do not want to imply that in a practical situation a proof or anything else can lead to absolute certitude of correctness. In fact a proof by itself does not necessarily lead to a higher level of confidence than might be achieved by extensive testing of a program. From a practical viewpoint there are a number of things that could still be wrong after a proof if one is not careful: what is proven may not be what one thought was proven, the proof may be incorrect, or assumptions about either the execution environment or the problem domain may not be valid. However, a proof does give a quite different and independent view of program correctness, and if it is done well, it should be able to provide a very high level of confidence in correctness. In particular, to the extent that a proof is valid, there should no longer be any doubt about what might happen after allowable but unexpected input values. MANUAL PROOFS The basic ideas in the last section have been known for some time. This section describes the practical progress which has been made with manual proofs in the last few years. The size of programs which can be proven by hand depends on the level of formality that is used. In 1967 McCarthy and Painter5 manually proved the correctness of a compiler for very elementary arithmetic expressions. I t was a formal proof based on formal definitions of the syntax and semantics of the simple languages involved. Rigorous but informal proofs A more informal approach to proofs is now popular. This approach is rigorous, but uses a level of formality like that in a typical mathematics text. Arguments are based on an intuitive definition of the semantics of the programming language without a complete axiomatization. Using these techniques a variety of realistic, efficient programs to do sorting, merging, and searching have been proven correct. The proof of a twenty line sort program might require about three pages. It would now be a reasonable exercise for advanced graduate students. Proofs of significantly more complex programs have also been published. London6 ,7 has done proofs of a pair of LISP compilers. The larger compiler is about 160 lines of highly recursive code. It complies almost the 204 Fall Joint Computer Conference, 1972 full LISP language-enough so it can compile itself. It is a generally unused compiler. It was written for teaching purposes, but it is not just a toy program. Another complex program has been proven correct by Jones. s The program is a PL-1 coding of a slightly simplified version of Earley's recognizer algorithm. It is about 200 lines of code. Probably the largest program that has been proven correct is in the work on computer interval arithmetic by Good and London. 9 There they proved the correctness of over 400 lines of Algol code. The largest individual procedure was in the 150-200 line category. A listing of many other significant programs which have been proven correct can be found in London's recent paper. lO If a complex 200 line program can now be proven correct by one man in a couple of months, one can begin to think about breaking larger programs into modules and getting a proof of correctness within a few man years of effort. Clearly there are programs for which a guarantee of the correctness of the running program would be worth not man years but many man decades of effort. We had better take a closer look at the feasibility of such an undertaking and what the proof of correctness would really accomplish. language program would certainly go a long way toward improving the probability that the program will run according to specifications. Errors in the proof An informal proof of correctness typically is much longer than the program text itself-often five to ten times as long. Thus the proof itself is subject to error just like any other extremely detailed and complex task done by humans. There is the possibility that an informal proof is just as wrong as the program. However, a proof does not have any loops and the meaning of a statement is fixed and not dependent on the current internal state of the computer. To read and check a proof is a straightforward and potentially automatable operation. The same can hardly be said for programs. Despite its potential fallibility, an informal proof would dramatically improve the probability that a program is correct. There is evidence from London's work7 that a proof of correctness will find program bugs that have been overlooked in the code. Less rigorous proofs Environment problems In most existing proofs of program correctness, what has been proven correct is either the algorithm or a high level language representation of the algorithm. With today's computers what happens when the program actually runs on a physical computer would still be anybody's guess. It would be a significant additional chore to verify that the environment for the running program satisfies all the assumptions that were made about it in the proof. Problems with round off errors, overflow, and so forth can be handled in proofs. Good and London, 9 Hoare,11 and others have described techniques for proving properties of programs in the context of computer arithmetic, but this can make the proof much more complex. Furthermore, to assure correctness of the running program one would have to be sure that all assumptions about the semantics of the programming language were actually valid in the implementation. The compiler and other system software would have to be certified. Finally, this could all be for naught considering the possibility of hardware failure as it exists in today's machines. Thus, proving the correctness of a source language program is only one aspect of the whole problem of guaranteeing the correctness of a running program. Nevertheless, eliminating all errors from the source A person proving a program correct by manual techniques must first achieve a very thorough understanding of all details of the program. This clearly limits manual proof techniques to programs simple enough to be totally comprehended by the program provers. It also means that clarity and simplicity is very important in the program design if the program is to be proven correct. There is another school of thought which places primary emphasis on techniques for obtaining clarity and structure in the program design. Dijkstral2 •l3 as long been the primary advocate of this approach. By appropriately structuring the program and by using what is apparently a much less formal approach to proofs, Dijkstra claims to have proven the correctness of his THE operating system. l4 Millsl6 advocates a similar approach with the program being sufficiently structured so an informal proof can be as short as the program text itself. I t is probably true that more practical results can be obtained with less rigorous approaches to proofs, especially in the near future. I t is even debatable whether the more rigorous proofs give more assurance of correctness, but the formality does make it more feasible to automate the proof process. Whether or not one feels that the rigorous hand proofs of correctness will have much practical value, they are providing experience with different proof techniques that should Summary of Progress Toward Proving Program Correctness be very valuable in attempting to automate the proof process. AUTOMATING PROOFS OF CORRECTNESS In proving program correctness the logical statement that has to be proven usually is very long; however, the proof is seldom mathematically deep and much of it is likely to be quite simple. In the example given previously the theorems to be proven were almost trivial. I t would seem that some sort of automatic theorem proving should be able to be applied in proving program correctness. This has been tried. So far the results have not been very exciting from a practical viewpoint. Computer-generated proofs Fully automatic theorem provers based on the resolution principle generally can prove correctness for very small programs-not much larger than the exponentiation program above. However, Slagle and N orton16 report that they have obtained fully automatic proofs of the verification conditions for Hoare's sophisticated little program FIND17 which finds the nth largest element of an array. In 1969 King18 completed a program verifier that automatically generated the verification conditions and then used a special theorem prover based on a natural deduction principle to automatically prove them. This system successfully proved programs to do a simple exchange sort, to test whether a number is prime, and similar integer manipulation programs. The data types were limited to integer variables and one dimensional arrays. Others have experimented with other data types and proof procedures. At the time of this writing I believe that there is no automatic theorem prover which has proven correctness for a program significantly larger than those mentioned. Automatic theorem provers still cannot handle the length and complexity of the theorems that result from larger programs. Another problem lies in the fact that some semantics of the programming language and additional facts about the application area of the program have to be supplied to the theorem prover as axioms. Automatic theorem provers have difficulty in selecting the right axioms when they are given a large number of them. Even in the minor successes that have been achieved, a somewhat tailor-made set of axioms or rules of inference have been used. 205 Computer-aided proofs There are now several efforts directed toward providing computer assistance for proving correctness. This takes the form of systems to generate verification conditions and to do proof checking, formula simplification and editing, and semiautomatic or interactive theorem proving. Unfortunately at this time almost any automation of the proof process forces one into more detailed formalisms and reduces the size of the program that can be proven. This is because the logical size of the proof steps that can be taken in a partially automated proof system is still quite small. Presumably this is a temporary phenomenon. It seems reasonable to expect that we will soon see computer-aided verification systems which make use of some automatic theorem proving and can be used to prove correctness of programs somewhat larger than those that have been proven by hand. Igarashi, London, and Luckham19 are developing a system for proving programs written in PASCAL. The verification condition generator handles almost all the constructs of that language except for many of the data structures. Their approach is based on the work of Hoare. ll •2o Elspas, Green, Levitt, and Waldinger21 are developing a proof of correctness system based on the problemsolving language QA4. 22 It will use the goal-oriented, heuristic approach to theorem proving which is characteristic of that language. Good and Ragland23 have designed a simple language NUCLEUS with the idea that a verification system and a compiler for the language could be proven correct. Both the verification system and the compiler would be written in NUCLEUS and the proofs of correctness would be based on a formal definition of the language. Theintent is that the language would then be able to be used to obtain other certified system software. These three systems give a general idea of the current work going on. A proof-checking system will be described in the next section. Several other interesting systems have been implemented and basic information about them is readily available in London's recent paper. 10 Long-range outlook Proofs of correctness are currently far behind testing techniques in terms of the size and complexity of the programs that can be handled adequately. It is very much an open question whether automated proof techniques will ever be feasible as a commonly used alter- 206 Fall Joint Computer Conference, 1972 native to testing a program. Many arguments pro and con are too subjective for adequate consideration here; however, a few comments are in order before one uses the rate of progress in the past as a basis for extrapolating into the future. Proofs are based on sophisticated symbolic manipulations, and we are still at an early stage of gathering information about ways to automate them. Existing proof systems have been aimed mostly at testing the feasibility of techniques. Few if any have involved more than a couple man years of effort-many have been conceived on a scale appropriate for a Ph.D. dissertation. If and when a cost-effective system for proving correctness becomes feasible, it will certainly require a much larger implementation effort. Proofs may be practical only in cases where a very high level of confidence is desired in specified aspects of program behavior. With computer-aided proofs one could hope to eliminate most of the sources of error that might remain after a manual proof. As exemplified by the work of Good and Ragland,23 the verification system itself as well as compilers and other system software should be able to be certified. If the basic hardware/software is implemented with a system such as LOGOS24 for computer-aided design of computer systems, then there should be a reasonable guarantee that the implemented computer system meets design specifications. With sufficient error-checking and redundancy, it should thus be possible to virtually eliminate the danger of either design or hardware malfunction errors. By the end of this decade these techniques may make it possible to obtain virtual certitude about a program's behavior in a running environment. There are many applications in areas such as real-time control, financial transactions, and computer privacy for which one would like to be able to achieve such a level of confidence. SOME THEORETICAL FRONTIERS Proofs of program correctness involve one in a seemingly exorbitant amount of formalism and detail. Some of this is inherent in the nature of the problem and will have to be handled by automation; however, the formalisms themselves often seem awkward. The long formulas and excessive detail may result partially because we have not yet found the best techniques and notation. Active theoretical research is developing many new techniques that could be used in proving correctness. Research in this area, usually called the mathematical theory of computation, has been active since McCarthy's25.26 early papers on the subject. I feel that practical applications for proofs of correctness will develop slowly unless new techniques for proving correctness can significantly reduce the awkwardness of the formalisms required. This section will describe some of the current ideas being investigated. The topics chosen are those which seemed more directly related to techniques for facilitating proofs of correctness. Induction techniques for loops and recursion Proving correctness of programs would be comparatively simple if programs had no loops or recursion. However, some form of iteration or recursion is central to programming, and techniques for dealing with it effectively in proofs have been a subject of intensive study. All the techniques use some form of induction either explicitly or implicitly. The method of inductive assertions described previously handles loops in flowcharts by the addition of enough extra assertions to break every loop and then appeals to induction on the number of commands executed. For theoretical purposes it is often easier and more general to work with recursively defined functions rather than flowcharts. Almost ten years ago McCarthy proposed what he called Recursion Induction26 for this situation. Manna et al. have extended the inductive assertion method to cover recursive, 27 parallel,28 and non-deterministic29 programs. Several other induction principles have been proposed by Burstall,30 Park,3l Morris, 32 and Scott. 33 A development and comparison of the various induction principles has been done recently by Manna, Ness, and Vuilleman. 34 Formalizing the semantics of programming languages The process of constructing the verification conditions or logical formulation of correctness is dependent on the meaning or semantics of the programming language. One can also take the opposite approach-proving correctness is a formal way of knowing whether a higher level meaning is true of the program. Thus the meaning or semantics of any program in a language is implicitly defined by a formal standard for deciding \vhether the program satisfies a specification. There is a very close interrelation between techniques for formalizing the semantics of a programming language and proofs of program correctness. Floyd's early work on assigning meanings to programsl has been developed especially by Manna2 and Ashcroft. 35 Bursta1l36 gives an alternative way to formulate program semantics in first-order logic. Ashcroft37 has recently summarized this work and described its relevance. Summary of Progress Toward Proving Program Correctness Hoare,1l·2o Igarashi,38 de Bakker,39 and others have worked to develop axiomatic characterizations of the semantics of particular programming languages and constructs. The Vienna Definition Language40 uses an abstract machine approach to defining semantics, and Allen41 describes a way of obtaining an axiomatic definition from an abstract machine definition. The axiomatic definition is generally more useful in proofs. Scott and Strachey have developed another approach to defining semantics42 which is described below. Work on defining the semantics of programming languages is very active with many different approaches being tried. Those described above are only the ones more closely related to proofs. If any of these ideas can greatly simplify the expression or manipulation of properties of programs, they should have a similar simplifying impact on proofs of correctness. Formal notation for specifications Formal correctness only has meaning with respect to an independent, formal specification of what the program is supposed to do. For some programs such specifications can be given fairly easily. For example, consider a routine SORT which takes a vector X of arbitrary length n as an argument and produces a vector Y as its result. With appropriate conventions, the desired ordering on Y is specified by: (Vi,j)[l~i
a:: a:: COST <{ L.' I 72 SALES I I \ I I JONES \ a:: 0 3: I T 41 i L \ a I 71 SALES T I 1 PROJECT ·1 ENGR I \ I I * IACTIVITY \ I 0 " 10 0 EXPENSES \ I \ I 1 \\0 1 1 0 0 1 1 0 1 0 0 i 1 i 0 0 1 1 1 0 1 \ 1 CUSTOMER 1DIVISION <{ 0 1 0 -l 0 NAME <{ u - OHIO CI) ~ 252 DAVIS 253 NAME 254 255 0 0 0 1 SMITH 0 0 SMITH 0 a.. DIV DEPT TAG TAG / ,\1 REP DESCRIPTOR HIRE DATE SALES RECORD 0 EMPTY ARRAY 1 PROJECT 0 DELETED 0 0 0 1 0 0 WORD RECORD RECORD PERSON NEL RECORD, SEC 1 PERSONNEL RECORD, SEC 2 0 0 255 0 4 - - BIT *PARENT RECORD IDENTIFIER IS EMPLOYEE NAME IN ADDRESS--+ TWO-SECTION PERSONNEL RECORD Figure 5-Associative array map example the PE having the lowest physical address in the array (or arrays). The new record, with its activity field set to a "1," is written into this first empty location. The hardware pointer then moves to the next available empty memory location for writing another record if a batch of new entries must be loaded. If no empty locations are found the program will exit to whatever routine the programmer has chosen for handling this type of error-for example, if appropriate to a specific application, the program may select an age test of all records in a particular file, purging the oldest to make room for the newest. A record once located may be deleted from a file by merely setting the activity bit to an "0." When a specific file is to be processed in some manner, the scattered locations containing the file's records are activated by performing EQC's on both the activity field and an n-bit "file descriptor" tag field. If, as in the example of Figure 5, the file descriptor field is two bits long, the entire selected file will be ready for processing in less than 2 microseconds « 1 p..s for the activity bit search, < 1 p..s for the file descriptor field search). Where record lengths are greater than the 256-bit length of the associative array word, several noncontiguous associative array words may be used to store the single· record in sections, one section per array word. The format for each record section must contain the same activity and file descriptor fields as are used in all record formats, and in addition it must contain a parent record identifier and an n-bit "section identifier" tag field. The scattered locations containing the desired section of all records in the specific file may be activated by performing EQC's on the activity, file descriptor, and section identifier fields. All three searches can be completed in approximately 2 or 3 microseconds. These two or three tag search operations in the AP STARAN permit random placement of records in the physical file and eliminate the bookeeping associated with file structuring and control required in conventional systems. The same approach is used for files which exceed the capacity of the associative arrays-the records of such files are stored in a similar manner on external mass storage devices and are paged into the arrays as required. The strategy used to allocate array storage space can have a significant effect on program execution time. An example is shown in Figure 6 where the products of three operand pairs are required. In A, the operands are stored in a single array word. For 20-bit fixed point operands the three MPF instructions would execute in a total of 1175 microseconds. All similar data sets stored in other array words would be processed during the same instruction execution. However, an alternative storage scheme (B) which utilizes three PE's per data set requires only one MPF execution to produce the three products in 392 microseconds. If one thousand data sets were involved in 235 each case the average multiply times per product would be 392 and 131 nanoseconds, respectively, but at the expense, in B, of using 3000 processing elements. Unused bits in B may be assigned to other functions· A last example of how array storage allocation can affect program execution time is shown in Figure 7 where the columns represent fields. Here the sum el, of 16 numbers is required. If the 16 numbers are directly or as a result of a previous computation stored in the same field of 16 physically contiguous array words, the near-neighbor relationships between the processing elements can be used to reduce the number of ADF executions to four. All similar 16 number sets would be processed at the same time. STARAN APPLICATIONS While many papers have appeared (see Minker4 for a comprehensive bibliography) which discuss the application of AM's and AP's in information retrieval, PROBLEM: 0i , bi , ci ,di ,ei ,fj ARE 20 BIT OPERANDS. FORM PRODUCTS ojbj, cjdj , ejfj FOR n DATA SETS FILE METHOD A - ALLOCATE ONE ARRAY WORD (PROCESSING ELEMENT) PER DATA SET SET IDENT PROGRAM A-i. MPF A, B, G 2. MPF C, D, H "3. MPF E, F, J I n sets processed in 1175.A(s (fixed point) METHOD B - ALLOCATE THREE ARRAY WORDS (PROCESSING ELEMENTS) PER DATA SET FIELD NAME - A B C °i Cj bi OJ bj j 01 1 dj ci di j 01 t ej fj ej fj i 01 t I / 1 III ? ! an bn On bn cn dn en fn n 011 cn dn n 011 en fn n 011 PROGRAM B - MPF A, B, C 1 n sets processed in 392 .A(s (fixed point) Figure 6-Effect of array memory allocation on execution time 236 Fall Joint Computer Conference, 1972 A ir traffic control -- ] ~1---~- ~ JJT~ ~~ JT~;:r 0e '< be °9 010 °11 °12 013 16 °14 Lai 1 °15 °16 NUMBER OF OPERATIONS IS -fn2N = ..In216 = 4 Figure 7-Tree-sum example text editing, matrix computations, management information systems and sensor data processing systems, there are none yet published which describe actual results with operating AP equipment in any application. (But see Stillman: for a recent AM application result.) Recent actual applications of the AP have been in real time sensor related surveillance and control systems. These initial applications share several common characteristics: 1. a highly active data base; 2. operations upon the data base involve multiple key searches in complex combinations of equal, greater, between-limits, etc., operations; 3. identical processing algorithms may be performed on sets of records which satisfy a complex search criterion; 4. one or more streams of input data must be processed in real time; and 5. there is a requirement for real time data output in accordance with individual selection criteria for multiple output devices. An example of an actual AP application in an air traffic control environment is shown in Figure 8. In this application a two array (512 processing elements) STARAN 8-500 model was interfaced via leased telephone lines with the output of the FAA ARSR long range radar at Suitland, Maryland. Digitized radar and beacon reports for all air traffic within a 55 mile radius of Philadelphia were transmitted to STARAN in real time. An FAA air traffic controller's display of the type used in the new ARTS-III terminal ATC system and a Metrolab Digitalk-400 digital voice generator were interfaced with STARAN to provide real-time data output. The controller's keyboard was used to enter commands, call up various control programs and select display options. Although a conventional computer is not shown explicitly in Figure 8 the sequentially oriented portions of the overall data processing load were programmed for and executed in the STARAN sequential controller as shown in Figure 9. Sequential and associative programs and instruction counts for STARAN are shown in Table II. In a larger system involving multiple sensors and displays, and more ATC functions such as metering and spacing, flight plan processing, and digital communications, the sequential and parallel workloads would increase to the point where a separate conventional computer system interfaced with the AP would be required. The STARAN system was sized to process 400 tracks. Since the instantaneous airborne count in the 55 mile radius of Philadelphia was not expected to exceed 144 aircraft, a simulation program was developed to simultaneously generate 256 simulated ARSR RADAR TELEPHONE SUITLAND, MARYLAND LINES .. r STARAN S- 500 l FAA A portion of the processing inherent in these applications is parallel-oriented and well suited to the array processing capability of the AP. On the other hand these same applications also involve a significant amount of sequentially-oriented computation which would be inefficient to perform upon any array processor, a simple example being coordinate conversion of serially occurring sensor reports. DISPLAY BEACON • RADAR • CONFLICT • DETECT ION RESOLUTION AVOI DANCE TERRAIN AUTOMAT IC vorCE ADVISORY DIGITAL 1 TRACK I NG • CONFLICT • • MONITOR TRACKI NG • DISPLAY PROCESSING VFR VOICE GENERATOR Figure 8-Air traffic control application STARAN 237 DATA PATH CONTROL PATH EXECUTIVE ---...,I I I M~E1_""'...I.:_--.j_.... SC TELETYPE SC''''-f---t ---+---1. . CONTROLLER TAPE READER/PUNCH ON -LINE SC DEBUG AND UTILITY PACKAGE DATA RECEIVER ASSEMBLY KEYBOARD INTERRUPT HANDLER , __________ .J r---I • EXTERNAL DEVICE ARTS m KEYBOARD TARGET SIMULATION ROUTINE .. .....-----....- AVA I SC I I AUTOMATIC VOICE ADVISORY DRIVER I I I I I ASSOCIATIVE PROCESSOR L _______ .., AP I I I AP SC LIVE DATA INTERRUPT HANDLER I I I I SEQUENTIAL CONTROL PROCESSOR CLOCK INTERRUPT SC DATA LINE SC I L ________ ., I I I : 1- - - - - I I AP CONFLICT PREDICTION AP CONFLICT RESOLUTION AVA MESSAGE SELECTOR AP I I L_________'t. ______ ____ :!t __ __________ ~ _________ ...J Figure 9-ATC program organization aircraft tracks. Display options permitted display of mixed live and simulated aircraft. The 400 aircraft capacity is representative of the density expected as North-South traffic loads increase through the late '70s. Conflict prediction and resolution programs based upon computed track data were demonstrated and used to display conflict warning options. Automatic voice services were provided for operator-designated aircraft, thus simulating warning advisories for VFR pilots requesting the service. The voice messages, which in an operational system would be automatically radioed to the pilot, were generated by the Metrolab unit from digital formats produced by the associative processor and broadcast in the demonstration area via a public address system. A· typical message would be read out in voice as, "ABLE BAKER CHARLIE, FAST TRAFFIC SEVEN O'CLOCK, 4 MILES, ALTITUDE 123 HUNDRED, NORTHEAST BOUND". Top level flow charts for four of the associative programs used in the demonstration are shown in Figures 10, 11, 12, and 13. A detailed report is in preparation describing all of the ATC programs used in this demonstration, but some comments on the four flow charts shown may be of interest. Live target tracking (Figure 10) is performed in two dimensions (mode C altitude data was not available) using both radar· and beacon target reports to track all aircraft. Incoming reports are correlated against the entire track file using five correlation box sizes, three of which vary in size with range. Any incoming report which does not correlate with an existing track is used to automatically initiate a new tentative track. An aircraft track must correlate on two successive scans and have a velocity exceeding 21 knots to qualify as an established track and must correlate on three successive scans to achieve a track firmness level high enough to be displayed to a controller as a live target. 238 Fall Joint Computer Conference, 1972 TABLE II-STARAN Air Traffic Control Programs SEQUENTIAL PROGRAMS NAME Executive , ASSOCIATIVE PROORAMS INSTR COUNT Keyboard Inte rrupt Real Time Interrupt Live Data Input Automatic Voice Output > 1600 . INSTR COUNT Tracking System 881 Track Simulation System 415 Turn Detection 88 Conf 1 ict Pred ic t ion 488 Conflict Resolution 296 Automat ic Voice Advisory 709 Display Process ing Total Field Definition Statements Included ret Operating Instructions 1"6"00 There are provisions for 15 levels of track firmness including 7 "coast" levels. If a report correlates with more than one track, special processing (second pass resolve) resolves the ambiguity. Correlated new reports in all tracks are used for position and velocity smoothing once per scan via an alpha-beta tracking filter where for each track one of nine sets of alpha-beta values is selected as a function of track history and the correlation box size required for the latest report correlation. If both beacon and radar reports correlate with a track, the radar report is used for position updating. Smoothed velocity and position values are used to predict the position of the aircraft for the next scan of the radar and for the look-ahead period involved in conflict prediction. Track simulation processing (Figure 11) produces 256 tracks in three dimensions with up to four programmable legs for each track. Each leg can be. of 0 to 5 minute duration and have a turn rate, acceleration, or altitude rate change. A leg change can be forced by the conflict resolution program to simulate pilot response to a ground controller's collision avoidance maneuver command. Targets may have velocities between 0-600 knots, altitudes between 100-52,000 feet, and altitude rates between 0-3000 feet per minute. The conflict prediction program sequentially selects Net Ope rat ing Instructions 1140 4017 514 3493 up to 100 operator-designated "controlled" or "AVA" aircraft, called reference tracks in Figure 12, and compares the future position of each during the lookahead period with the future positions of all live and simulated aircraft and also to the static position of all terrain obstacles. Any detected conflicts cause conflict tags in the track word format to be set, making the tracks available for conflict display processing. A turn detection program not shown opens up the heading uncertainty for turning tracks. Display processing (Figure 13) is a complex associative program which provides a variety· of manageby-exception display options and automatically moves operator-assigned alpha numeric identification display data blocks associated with displayed aircraft so as to .prevent overlap of data blocks for aircraft in close proximity to one another on the display screen. Sector control, hand off, and quick-look processing is provided. All programs listed in Table II were successfully demonstrated at three different locations in three successive weeks, using live radar data from the Suitland radar at each location. The associative programs were operated directly out of the bulk core and page 0 portions of control memory since there was no requirement, in view of the low 400 aircraft density STARAN * *ONE REPORT AGAINST ALL TRACKS -ALL TRACKS * * SECOND PASS RESOLVE (ONCE PER AMBIGUOUS TRACK) * SMOorH TRACK POSITION AND VELOCITY (ONCE PER 10 SEC.) PREDICT TRACKS NEXT REPORTING POSITIONS (ONCE PER 5 SEC.) where performance is shown in terms of millions of instructions per second for the ADF and EQC instructions using two different operand lengths, and cost effectiveness is measured in terms of instructions per second per hardware dollar. This form of presentation was taken from Bell. ~ Another cost effectiveness measure is to compare projected hardware and software costs of an associative configuration and an all-conventional design for the same new system requirements, where the associative configuration may include a conventional computer. Only a few attempts at this approach have been made to date and none have been confirmed through experience. One classified example, using a customer defined cost effectiveness formula, yielded a total system cost effectiveness ratio of 1.6 in favor of the associative configuration. Of the two methods, the first is least useful because there is no way of estimating from these data how much of the associative computing capability can be used in an actual application. The second method is U?DATE TRACK FIRMNESS (ONCE PER 5 SEC.) NO Figure IO-Live target tracking involved, for the higher speed instruction accesses available from the page memories. At intervals during the demonstration all programs were demonstrated at a speed-up of 20 times real time with the exception of the live data and AVA programs which, being real-time, cannot be speeded up. Timing data for the individual program segments will be available in the final report. The entire program executed in less th3ill 200 milliseconds per 2 second radar sector scan or in less than 10 percent of real time. All programming effort was completed in 4% months with approximately 3 man-years of effort. This was the first and as of this writing the only actual demonstration of a production associative processor in a live signal environment known to the author. It was completed in June, 1972. Other actual applications currently in the programming process at Goodyear involve sonar, electronic warfare and large scale data management systems. These will be reported as results are achieved. CALCULATE NEW ie, Y.MODIFIED BY fil, OR Vc tn = time left in leg 13 = turning rate Vt =acceleration rate COST EFFECTIVENESS Associative processor cost effectiveness can be ex.;. pressed in elementary terms as shown in Figure 14 239 Figure II-Tracking simulation 240 Fall Joint Computer Conference, 1972 more meaningful but is exceedingly expensive to use since it implies a significant engineering effort to derive processing algorithms, system flow charts, instruction counts, and timing estimates for both the conventional and the· associative approach. The weakest element in this approach lies in the conventional approach software estimate which historically has been subject to overruns of major dimensions. 2000 EQC-16-BIT 1500 EQC-32-BlT 1000 0 Z 0 ADF-16-BIT 500 U ILl ~ « r~~:::~::~ ~ 0 I Q. ~ 19~1 F M 'l: I I M A 85 J 1971 F A M I M J J A I A SON D 19J72 F M A M J J ~ S 0 N D J F M A 1972 Figure 5-Percent of total machine availability M J J Computer System Maintainability 271 TABLE I-Current Scheduled Maintenance Monday Machine Network hub CDC 6600 CDC 7600 CDC 7600 CDC 7600 a b Tuesday Wednesday Thursday Friday Interval x x x x x b b b 4:00-8:00 4:00-8:00 4:30-8:00 4:30-8:00 4:30-8:00 a x x x x x CDC 6600 taken on alternate Mondays. Any two CDC 7600's may be taken, but not all three at the same time. is maximized by conducting scheduled maintenance at a time least visible to the user (0400-0800 hours) and by selecting subsets of components to be down concurrently. For comparison, the scheduled or. preventive maintenance (PM) and unscheduled or emergency maintenance (EM) history for the CDC 7600 R (serial No.1) and CDC 7600 S (serial No.2) host computers is illustrated in Figure 4. These maintenance actions required the host computers to be off-line and therefore completely unavailable to the user. Figure 5 shows the average total' percentage availability for the CDC 7600 Rand S host co~puters, the CDC 6600 L (serial No. 13) and CDC 7600 M (serial No. 31) host computers and the total percentage availability for the PDP-10 hub or control computer. The percentages are arrived as follows: Hours in Month-(PM +EM) Hours in Month EVALUATION OF DIAGNOSTIC MAINTENANCE TOOLS AND PROCEDURES ACKNOWLEDGMENTS The diagnostic maintenance tools do provide for rapid, positive identification and isolation when the component or device failure is solid. However, experi7600 R 7600 S 6600 L 6600 M PDP-IO File transport network ence has indicated that most failures tend to be intermittent in nature and difficult to identify and isolate. Even though great amounts of time and money can be spent attempting to isolate intermittent failures, it has not been demonstrated at LLL that intermittent failures become significantly less intermittent when extensive off-line diagnostic periods are used. For this reason, it is concluded that it is better to catalog an intermittent error for administrative analysis, recover as softly as possible, and promptly return the device or component to full productivity rather than insist on the immediate off-line isolation of the problem. Samplings (Figure 6) of system availability taken hourly Monday through Friday from 0800-1630 hours from November 1970 through April 1972 demonstrate the following: Percent Total System Availability 75* (all devices on line) Partial System Availability 100* (Useful work being accomplished by at least one host) D--+-......--+-----+----~----+-- IBM Data Cell ~H--+---+---~---+- IBM Photostore D General Precision disk D-+~-+---+---+---+-- 400 series PDP-8 TTY network D~-.-+- 200 series PDP-8 TTY network 0 600 series PDP-8 TTY network 0 - .......--+-----'-------<1.-----1.- ... ~-.---+------ -- Figure 6-Host computers on-line during sampling period The authors express their appreciation to LLL's Donald L. von Buskirk, Richard E. Potter, Robert G. Werner, and Pete Nickolas for their contributions to this paper. REFERENCES 1 D L PEHRSON An engineering view of the LRL Octopus computer network Lawrence Livermore Laboratory Rept. UCID-51754 1970 2 Livermore time-sharing system manual M-026 Lawrence Livermore Laboratory 1972 3 K H PRYOR et al Status of major hardwar(} additions to Octopus Lawrence Livermore Laboratory Rept UCID-30036 1972 * Power failures affecting the entire network are not included in these figures. 272 Fall Joint Computer Conference, 1972 APPENDIX Commands available include: Printer / Punch P ALL PI Send all printer output to on-line printer No. 1. Printer 2 can now go down for maintenance. PALL P2 Send all printer output to on-line printer No.2. Printer 1 can now go down for maintenance. PALL HSP Send all printer output to tape for off-line processing. Both on-line printers can now go down for maintenance. P KILL PI Ab6rts processing of current printer / or P2 or punch files on indicated device. PUNCH P2 HSP Send all printer No. 2 output to tape for off-line processing. Printer 2 can now go down for maintenance. P NORMAL Restores operating status of printer output devices. Disk DNMMM N is a disk "unit number 1 through 4. MMM is either "IN" or "OUT." This will allow or prevent, respectively, the creation of new files on disk N. If "OUT," existing files remain accessible and disk N can be scheduled for maintenance. DP N MMM As described for the D option, but will also purge disk N of all existing files. All files on disk N are destroyed and are no longer accessible. Drum P DRUM Transfers system tables from the drum DOWN to memory and rewrites these tables to a disk file. All subsequent attempts to access the drum will be redirected to the disk. This provides backup capabil- ity for the drum and allows the drum to be taken down for maintenance. P DRU1VI UP Restores normal operating status of the drum. System tables that have been stored on disk during the drum down period are transferred from disk to memory and rewritten to the drum. Tape CN Tape unit N is physically not available to the system. FN Tape unit N is physically available to the system. EP A tape error status is returned to program P. Severs logical connection between tape XN unit N and problem program. Execution Allows only privileged user number SP NNNNNN NNNNNN access to the system. All IN TEXT users programs are saved on disk, and all users remote TTY stations are logged out. The TEXT is sent to all users attempting to log in. Removes privileged user number SP NNNNNN NNNNNN and automatically restarts previously running programs and reOUT stores normal time-sharing. Prohibits any additional log in. TEXT S TEXT is sent to remote TTY stations. Restores normal time-sharing. R Broadcasts I STORE Sends the TEXT to all logged in TEXT remote TTY stations. Sends the TEXT once to all remote TTY's at log-in time. Sends TEXT to TMDS. I ERASE Erases the I STORE TEXT. I BROAD Broadcasts TEXT once to all remote TEXT TTY stations and sends TEXT to TMDS. The retryahle processor by GEORGE H. IVIAESTRI Honeywell Information Systems Inc. PhOenL1{, Arizona INTRODUCTION In the interest of improving readability, instruction retry is presented generically. Technical terms unique to the 6000 are avoided. A study performed by the U.S.A.F.l showed that 80 percent of the electronic failures in computers are intermittent. A second study performed by IBM2 stated that "intermittents comprised over 90 percent of the field failures." The intermittent failure problem Alternatives for solution A hard failure is the result of a logic signal either remaining permanently in a one or a zero state or of a signal consistently switching to an improper state (such as an AND gate behaving as an OR). In the case of an intermittent failure, identical instructions executed in different sequences or at different times will not fail consistently. Test and Diagnostic (T&D) programs are designed to diagnose solid failures and are successful at accomplishing their design objectives. They begin by certifying a basic core of processor functions and then read the T&D executive into memory to commence comprehensive testing. No function is used until it is tested. A problem with this approach is that intermittent failures can occur on functions that have been previously certified, completely destroying the rationale of the program. The second, and most likely problem, is that T&D programs seldom detect intermittent failures. To trigger an intermittent failure, the T&D must execute a particular sequence of instructions in an exact order, using the proper data patterns. Also, intermittents are often triggered by stimulus that is beyond the control of programs; thermal variations, mechanical vibration and power fluctuations are examples. Sequence sensitive intermittents can be caused by the following: a low noise threshold in an IC, crosstalk, slow rise or fall times of logic signals, or extra slow or fast gates that activate a normally quiescent race condition. The ideal method of diagnosing an intermittent failure is to bypass test programs and to diagnose directly from the symptoms of the original failure. The only reason that this method is not in common use is that the set of failure symptoms that are available to programs is inadequate for that purpose. First, it is necessary to know which bits are in error and whether they failed to switch from a one to a zero state or vice versa. Second, all control points should be visible to the diagnostic for all cycles of the failed instruction. A scratchpad memory or snapshot register could save the state of control points and data in case an error is detected. In the case of intermittent failures, the problems associated with using a failing processor to diagnose its own ills will be minimal. Also, a minicomputer or a second processor could do the data processing necessary for diagnosis. If it is not possible to diagnose from the symptoms of the original failure, it will be necessary to run a T&D program to gather additional information about the failure. However, T&D programs are ineffective against intermittent failures because they are usually unsuccessful in detecting them. What is required then is a method of .making an intermittent failure solid. Stress testing is often effective in changing an intermittent failure to a solid failure. Stress testing involves setting marginal voltage and timing conditions to amplify the effects of slow rise times, low switching thresholds and race conditions; thermal stress is also 273 274 Fall Joint Computer Conference, 1972 applied for the same reasons. Mechanical vibration is applied while the T &D is in execution to locate loose wirewraps, defective connectors, microphonic chips or substrates, conductive debris that is caught between wirewrap pins, and defective printed circuit runs. Error Detection And Correction (EDAC) codes are particularly effective for correcting memory parity errors, which are inherently not recoverable by instruction retry techniques. Algorithms have heen developed to allow single or multiple bit failures to be corrected. Some EDAC codes are particularly effective for correcting memory parity errors, which are inherently not recoverable by instruction retry techniques. Algorithms have been developed to allow single or multiple bit failures to be corrected. Some EDAC codes can traverse adders to correct addition errors. A dvantages of instruction retry over other alternatives There is no reason that an immediate branch to a diagnostic program must exclude an instruction retry. The detection of an error can cause an immediate branch to a diagnostic program that will log and analyze all available symptoms. Failure analysis could result in the generation of a list of all logic elements whose failure could result in the symptoms that were recorded. The boundary between suspect and nonsuspect logic will be called "limits." Once limits are established, they can be analyzed to determine if the failure has been sufficiently isolated to enable a repair. If they have not, the processor can be restored to the state that existed prior to the failure and the instruction can be retried. If the retry attempt is successful, the processor will remain available to the customer until the next failure. Subsequent failures will serve to further narrow the limits by contributing new symptoms. Stress testing requires that the processor be dedicated to T&D, which means that the processor will not be available to the user. Instruction retry will keep the processor available to the user as long as it is successful; maintenance can be performed during slack time. Also, thermal and mechanical stress can inflict new damage. While EDAC is an effective means of correcting memory parity errors, it is incapable of correcting control point errors in the processor. If a word of data and its correction code fail to traverse a switch, because of a control point error, both the correction code and the data will be lost. Since instruction retry is conversely ineffective against memory parity errors and most effective against control point errors, EDAC and instruction retry will complement each other. Obstacles to retry A prerequisite to a successful instruction retry is that memory locations and registers associated with the faulted instruction must contain the same data that they did before the instruction was started. If a register .or memory location was changed during the execution of the instruction, it must be restored before retry can be attempted. Restoration will not be possible if the contents of a memory location is added to and replaces the contents of a register and an image of the original register contents is not available. Memory parity errors are detected after an error has invalidated the contents of a memory location. Unless memories are duplicated or EDAC is present, memory parity errors cannot be retried. The instruction repertoire of some processors includes instructions that cause the memory controller to perform a destructive read of a memory location. If an error occurs on an instruction that caused a destructive read, it will be necessary to restore the cleared memory location before retry can be attempted. A MOVE is an instruction that moves a block of data from one area of memory to another. If a MOVE overlays part of its source data, instruction retry will not be possible. For example: if 100 words are moved from location 70 to location 0, words 70 to 100 of the source data can be overlaid. Indirect addressing offers the programmer the capability to address operands via a string of indirect words that are often automatically updated every . time they are accessed. If a .faulted instruction has obtained its operand via such a string, every indirect word in the string must be restored prior to retrying the instruction. If an error occurred during an indirect word cycle, then only the indirect words preceding the failure must be restored. In processors with hard-wired control logic, the multicycle instructions repeatedly change the contents of registers as fast as data can be cycled through the adder. Delaying every cycle for error checking is often an unacceptable degradation of throughput. Consequently, a register could be overlaid with erroneous data before the error can be detected. Instruction overlap is a feature of large scale processors that complicates identifying the failing instruc- , tion. Instruction overlap takes advantage of the fact that no single instruction uses all of the processor logic at any given instant. While one instruction is The Retryable Processor terminating, a second instruction may be using the adder, while a third is undergoing an address preparation cycle and a fourth is being fetched from memory. If instruction overlap is active, the instruction counter may be pointing to the instruction being fetched from memory at the time an error is detected on the instruction that is in the address preparation sequence. Thus, merely safestoring the instruction counter at the time of failure will not serve to identify the failing instruction. Design methodology to avoid obstacles The destruction of data can be avoided for single cycle instructions by not overlaying the register in the first place. The adder sum can be buffered or merely retained on the data lines until error checking has finished. If an error is detected, the instruction can be aborted before the defective data is moved into a register. EDAC can enable the recovery of memory parity errors. At the time of a fault, the state of the cycle control flags and address register could be saved in a snapshot register. The contents of the snapshot register could identify the failing cycle of a MOVE so that software could continue moving the block of data in place of the interrupted MOVE. This will be effective in recovering an error on a MOVE that has overlaid part of its source data. The snapshot register can also serve as a diagnostic aid by saving the state of cycle control flags at the time of an error. One method of restoring a string of indirect words is to obtain a pointer to the first indirect word from the address field of the instruction. Since indirect word updates are performed by a fixed and known algorithm, it will be a simple matter to restore the first word of the string to obtain a pointer to the second and then follow the string; restoring each word to obtain a pointer to the next. However, several pitfalls· exist in the above method. One is that if the error occurred on an indirect word cycle, the recovery program must know when to terminate its indirect word restoration activity. Otherwise, it may restore indirect words that have not been updated, thus inducing an error. Also, the recovery program must be able to determine if the indirect word being restored has been damaged by a parity error. Another problem is the possibility that an indirect string may wrap back on itself, causing a word to be updated twice. If the recovery program merely follows the string, without knowledge of the 275 double update, it will fail to restore part of the string and will induce an error when the instruction is retried. An alternative that would also allow software to restore indirect words without inducing errors, is to provide a scratchpad memory to save the state of sequence control flags and memory addresses for every cycle of an instruction. If an indirect word string wraps back on itself, causing an indirect word to be updated twice, it would present no problem to the recovery program; the snapshot register will contain two entires for that word, and it will be rolled back twice. Providing intermediate registers will serve to both increase processor speed and to protect the contents of primary registers in case of error. The secondary registers can be placed at the inputs to the adder to hold the operand from memory and the operand from a primary register. The secondary registers will also serve to decrease the execution time of multicyc1e instructions, by providing a shorter path to the adder. Intermediate registers will allow date manipulation to be performed for multiplies, divides, etc., without changing the contents of the primary registers. The sum, product, quotient, etc., can be moved into a primary register after error checking is complete. Another alternative would be to save an image of the registers every time that an instruction comes to a normal (error-free) termination. Instruction retry could be accomplished by refreshing the - primary registers with data from the backup registers. If the processor has instruction overlap capability, it will be necessary to correct the instruction counter when a fault is detected. Otherwise, it may not be possible to determine which of several instructions, that are simultaneously in execution, failed. Another possibility is to provide an instruction counter for each of the instructions that can be simultaneously executed. The instruction counter assigned to the failing cycle can be selectively safestored. A third possibility is to include a failure flag in the scratchpad memory to identify the failing cycle. If the failing cycle is identified in the scratchpad, the instruction containing that cycle can be identified by a program. Tradeoffs Figure 1 shows that the simple processor operations; i.e., loads, stores, transfers and instruction fetches account for 95 percent of all processor operations (excluding address modification). Figure 2 shows that 30 percent of all processor operations are preceded by 276 Fall Joint Computer Conference, 1972 some type of address modification; of the 30 percent, 25.4 percent is simple register type modification and 4.6 percent involves indirect words. Since register type address modification does not in itself alter register contents, it is not a factor in determining the retryability of a simple instruction. Consequently, if instruction retry is implemented at all, it will be better than 90 percent effective. The mandatory design requirements for instruction retry are: (1) The failure must have been detected by hardware error detection. (2) The failing instruction must be identifiable. (3) Instruction operands must either remain intact or must be restorable. The simple mechanism of holding the adder sum output on the data lines until it 'has been determined that an error has not occurred will prove effective against processor/memory interface errors, for simple instructions. If the processor does not have instruction overlap capability, merely safestoring the instruction address in a predetermined memory location will serve to identify the failing instruction. If the processor has overlap capability, then a more sophisticated method of identifying the failing inNumber of Operations Number of Instructions Instruction Fetches** Stores Multiplies Divides Transfers Shifts Floating Adds Floating Multiplies Floating Divides Loads Load Register, Store Register Negate Master Mode Entry Execute Double, Execute Repeats Repeated Instructions*** Returns Binary Coded Decimal NOP Retryable Operations 2,653,856 1,661,723 938,873 367,811 1,196 933* 479,421 41,894 3,231* 2,621* 372* 743,170 1,078 450 661 4,152 5,326 53,260* 4,088 2,919 2,400 2,589,287 (97.5 percent) * Not Retryable 64,569 (2.5 percent) ** Instruction Fetches = 56.5 percent Number of Instructions *** Repeated Instructions == Repeats times 10 Figure I-Instruction frequency analysis ProbabiH Probabili ty .4 .3 .2 .1 '------1t-----+----f----4--.-.-==.~~.~ Any R IT IR RI Totals Any address modification 761,580 Address modification =R 644,062 Address modification = IT 58,161 Address modification = IR 53,778 Address modification = RI 5,579 Figure 2-Probability of address modification on processor operations 3 struction is necessary. (See "Design Methodology To Avoid Obstacles"). Once the failing instruction is identified, the opcode and address modification specifier can be examined to determine if the instruction is a candidate for an instruction retry attempt. The addition of a snapshot register bank and other features mentioned in the "Design Methodology" section of this paper will allow multi-cycle instructions and instructions utilizing indirect address modification to be retried; This will raise the effectiveness of instruction retry to almost 100 percent.* Cost of implementation The 6000 processor features hard wired control logic and instruction overlap. Its four instruction counters, scratchpad register bank and intermediate registers allow instruction retry to be better than 97 percent successful (see Figure 1). * 100 percent effectiveness means that instruction retry can be unconditionally attempted. The Retryable Processor However, none of the above features were incorporated as instruction retry aids and cannot be considered a cost of implementation. The instruction retry effort was not started until after the processor design was frozen. The four instruction counters are an improvement on the 600 line's "ICT Correction" logic. It has always been considered good design practice to accurately identify the location of a fault. The scratchpad register bank was implemented in approximately 400 SSI chips as a dump analysis and T&D aid. The intermediate registers were implemented to speed processor instruction execution. The instruction retry feasibility study, programming and debug efforts required one man year. 277 3 K ROSENSTEEL An analysis of dynamic program behavior Honeywell No ASEE # 54 1972 ACKNOWLEDGMENTS Peter J. Scola For originating many of the hardware changes that enabled instruction retry to be implemented on the Honeywell 6000 line and for obtaining the necessary funding. Harlow E. Frick For assistance in implementing the Honeywell Instruction Retry Program. CONCLUSION APPENDIX If a processor failure is detected, there are two possible actions that can be taken. One is to abort the program that was in the execution, to prevent propagation of the error. The second is to retry the failing instruction in an attempt to bypass a possible intermittent failure. Since 80 to 90 percent of processor failures are intermittent, there is an excellent chance that instruction retry will succeed. The advantage of retrying the instruction over aborting the program is that a successful instruction retry will keep the system available to the customer while an abort takes it away from him. As long as errors continue to be successfully recovered and performance is not seriously degraded, maintenance can be deferred until a convenient time period. The question of what comprises a serious degradation is probably best answered by the individual user. In a real time application, 3 or 4 percent may be serious; while in an I/O bound batch application, 30 or 40 percent degradation may be tolerable. REFERENCES 1 J P ROTH Phase I I of an architectural study for a self-repairing computer USAF Space and Missile Sys Org Los Angeles CA AD 825 460 18 1967 2 M BALL F HARDIE Effects and detection of intermittent failures in digital systems IBM No 67-825-2137 1967 With the exception of the footnotes in Figure 1, Figures 1 and 2 were extracted from a report by Kenneth Rosensteel3 entitled: "An Analysis of Dynamic Program Behavior". Figures 1 and 2 represent the total number of instructions executed by a mix of FORTRAN, ALGOL and COBOL compilations and executions. In considering address modification, Figure 2 shows the four possible modification types: Register (R)-Indexing according to the named register and termination of the address modification procedure. Register Then Indirect (RI)-Indexing with the named register, then substitution and continuation of the modification procedure as directed by the tag field of this indirect word. (Indirection with pre-indexing.) Indirect Then Register (IR)-Saving the register designator, then substitution and continuation of the modification procedure as directed by the tag field of this indirect word. (Indirection with post-indexing.) Indirect Then Tally (IT)-Indirection, then use of the indirect word as tally and address with automatic incrementing and decrementing. Any-Probability of any type of address modification. Evaluation nets for computer system performance analysis by G. J. NUTT* University of Washington Seattle, Washington INTRODUCTION second transition, a2, has a single input location, b3, and two output locations, bl and b4• For this example, let a dot in a location represent a token residing on that location. Suppose that the definitions of the two transitions, al and a2, specify that they fire when all input locations contain a token and all output locations do not contain a token. Then in Figure 1 (a), transition al is ready to fire. Figure 1 (b) shows the same transitions and locations after transition al has fired. Figure 1 (c) shows the result after firing a2. In this example, the prespecified subsets are the complete sets of input and output locations. Figure 1 may be interpreted as the following situation in a computer system. If bl contains a token, the central processor is available. If b2 contains a token, there is a job requesting the central processor. Thus, concurrent occupancy of tokens on bl and b2 indicates that there is a request for the central processor and it is available, causing transition al to take action, (representing central processor allocation). The time required to fire al indicates allocation time, and is negligible. A token on b3 represents central processor activity. The transition time for a2 reflects the length of central processor time for the job, and on completion of firing, the central processor again becomes available, (i.e., a token is placed on bl ) and the job has completed central processor utilization, (i.e., a token is placed on b4 ). Evaluation nets are derived from the work of Petri2 and Noe3 ; they have also been influenced by the work of Holt,4,5 who is primarily responsible for the development of Petri nets. The nets given in this paper· differ from Petri nets in their' 'practicality." The path of a token through a net is well-defined by providing a mechanism to choose from a set of alternate paths that the token might take. Each token in the net is distinct and retains its identity. The token may have a vector of attributes, (capable of taking on values), that may be modified by the various transitions that operate on the token. The time required for each execution of a transition is part of the specification of the net, thus introducing time as a measure of net performance. The growing complexity of modern computer systems has made performance evaluation results more and more difficult to obtain. The difficulty of representation and analysis of combination hardware/software systems has increased with the level of sophistication used in their design. One popular approach that has been used for evaluating proposed computer systems is simulation.l In this paper, a method of representation is presented that is useful in constructing a modular model, where the level of detail may vary throughout. This method has been designed to aid implementation of these models, with a net effect of providing the ability to construct flexible simulation models in a relatively small amount of time. A graphic approach is used so that the two dimensional structure of the machine is pictorially available to th~ simulation designer. The graphs are also useful for planning measurements of either a simulation model or the machine they represent. An evaluation net is made up of transitions interconnected by directed edges with locations. Each location may contain a token. For a particular transition, the members of the set of locations directed into the transition are called input locations, and the members of the set of locations directed away from the transition are called output locations. A transition fires if the set of input and output locations satisfies the definition of that particular transition, causing one token to be removed from each location of a prespecified sllbset of the input locations and one token to be placed on each .location of a prespecified subset of the output locations. Example Figure 1 (a) shows an example with two transitions. The first transition, (the vertical line labeled al), has two input locations, (the circles labeled bl and b2 ). The * Present address: Department of Computer Science, University of Colorado, Boulder, Colorado 80302 279 280 Fall Joint Computer Conference 1972 THE CLASS OF EVALUATION NETS (a) In this section we shall describe the class of evaluation nets in detail. We begin by defining location types and statuses. All locations are connected to at least one . transition. If a location is an input (output) location for some transition and is not an output (input) for any other transition, the location is said to be peripheral, e.g., b2 and b4 in Figure 1. If a location is not peripheral, it is an inner location. A location is empty if it does not contain a token, and full if it contains a token, e.g., locations b1 and b2 are full and ba is empty in Figure 1 (a). If it is not known whether the location is empty or full, the status of the location is undefined. An inner location may change from empty to full or full to empty only by the firing of one of the transitions to which it is connected. The conditions for the status of a location to be undefined will be discussed later, as will the utility of this convention. (b) Transition schemata (c) A transition definition is given by a triple, a = ( s, t ( a ) , q), where s is a transition schema (or type), t (a) is a transition time, and q is a transition procedure. The movement of tokens from input locations to output locations is described by the schema. The number of output locations is limited to two, the number of input locations is limited to three, and the number of connected locations is limited to four for any given transition. If a location is empty, its status is denoted as "0". If the location is full, the .status is denoted as "1". The undefined status is given by the symbol, " ,M(r): =l-i] where i is either 0 or 1, r is the label of the peripheral resolution location, and PI, P2 are "Algolic" Boolean expressions (predicates) which! can be evaluated to either true or false, (Nutt7 contains a more complete handling of these predicates). The resolution procedure is evaluated by first evaluating Pl. If it is true, M(r) becomes i and further evaluation of the procedure is discontinued. Otherwise, P2 is evaluated; if it is true, M (r) is set to 1- i. In either case the resolution procedure evaluation is discontinued after predicate P2 is evaluated. Note that when both predicates are evaluated as false, the marking of r remains undefined. The procedure need not be evaluated again until one of the arguments of the predicates changes its status. Examples of resolution procedures are given in the next section. Token attributes and their modification The transition schema definitions imply that no location may contain more than a single token at a time, provided that an initial marking does not place more than one token on any location. For example, the type T transition fires only when the input location is full and the output is empty, hence only an empty location can receive a token. This property of evaluation nets, (known as safety), allows each token to be distinct. Since tokens retain their identity, we shall give them names and associate a list of n attributes with each token, such that each attribute may take on a value. A token, K, with n attributes is denoted as K[n], and if location b contains K[n], we shall write M(b) =K[n] rather than M(b) =1. The }th attribute of the token K is denoted as K ( j). At times we will find use for tokens with no attributes, whose identity is unimportant. We shall continue to indicate these tokens by the symbol "I". For example, a resolution location setting will only need to indicate empty or full status, hence can be denoted as M(r) =0 or M(r) =1. The attributes of a token impose a data structure on the locations of a net. Any particular location will always receive (and yield) tokens with a fixed number, n, of attributes. A location, b, which holds tokens with n attributes is denoted as ben]. Hence, more properly, the expression of a marking should be M (b[n]) = K[n]. As long as the context makes the dimension of b clear, we will not insist on the more complete notation. Conceptually, a location b[n] is composed of n "slots" which contain the n attributes of a token residing on the location. The values of the slots are the values of the corresponding attributes. If the location is empty, the values of the slots are undefined. We shall refer to the ith slot as M(b(i», hence if M(b[n]) =K[n], the ith attribute of K may be denoted M(b(i». Let b[m] be an output location of transition ai and an input location of transition aj (see Figure 4). First, suppose that ai produces a token, K[n], to be placed on b[m], where n~m; the resulting M(b[m]) is defined as follows. Let g be the minimum of the integers nand m. Then M(b(l»: =K(l) M(b(2»: =K(2) M(b(g»: =K(g) If n is greater than m, then the remaining attributes of K[ n] are lost. If n is less than m, then the values of M (b (i) ), for n+ 1 :::; i:::; m, are undefined. Next suppose that aj removes the token on b[m]. The number of a.~ a·J F-€~ Figure 4-Number of attributes Evaluation Nets for Computer System Performance Analysis attributes for that token is defined to be m; where jlf(b(n+l)), ... ,M(b(m)) are undefined through the placement of the token on M(b[n]). Although the transition schema of a particular transition defines the locations that are to receive and yield tokens, the identities and attribute modifications are not reflected without specifically providing for them. For example, suppose a transition a= (s, tea), q), has a schema, s, of J(b1[n], b2[n], b3 [n]) and M(b 1) = KI[n], M(b2) =K2[n], where KI[n]~K2[n]. A transition procedure has the form [Pl-7(ell; e12; ... ; eln ) : ••• : Pk-7(ekl; ek2; ... ; ekm)] where the Pi are predicates (l:::;;i:::;;k, k finite), as described previously, and the eij are "Algolic" arithmetic assignment statements, e.g., M(b 3 (4)): =M(b1(4)) +100. A transition procedure is evaluated by the following algorithm: 1. Set i to 1. Go to step 2. 2. If Pi is true, execute (eil; ei2; ... ; eij) and then terminate transition procedure evaluation. Otherwise go to step 3. 3. Set i to i+l. If i is greater than k, terminate the transition procedure evaluation. Otherwise g<> to step 2. Transition firing A transition firing may now be more formally defined as consisting of the following phases: pseudo enabled phase: A transition is pseudo enabled if all locations satisfy the left hand side of a schema except for the undefined status of a peripheral resolution location. Since this status is undefined, the resolution procedure must be evaluated. (The resolution procedure cannot be evaluated unless the transition is pseudo enabled.) enabled phase: A transition is enabled if all location statuses satisfy the left hand side of a schema. The transition then begins operation. active phase: Transition action is in progress. The status of all associated locations does not change. terminate phase: The transition completes processing, changing the status of output locations to agree with the right hand side of the schema, then executing the transition procedure, and finally changing the status of the input locations to agree with the right hand side of the schema. 283 The existence of an active phase in a transition firing implies an associated time that the transition requires to carry out its operation. An expression reflecting this time is provided by the second coordinate of the transition description, (s, tea), q). This specification, t (a), may be a constant value, or it may be a function that is evaluated, ( on entering the active phase), for the particular token(s) that enable the transition for a specific firing. It is convenient to express t (a) for a transition of type X or Yas an ordered pair, where the first coordinate is t (a) if the token moves from the location labeled "0" in a Y transition graph. and the second coordinate is t ( a) if the token moves from the location labeled "1". In the X transitions, the first coordinate indicates t (a) if the token moves to the location labeled by a "0" in the graph and the second coordinate indicates tea) if the token moves to the location labeled with a "1". Since tokens that enable a transition reside on the input location(s) during the active phase, transition time imposes a dwell time on each location. The dwell time of a location, b, denoted d(b), is the total amount of time any token resided on location b. The dwell time contributed by a particular token may be greater than the corresponding transition time for the token, since the token may have begun residence on the location without enabling the associated transition. The accumulation of dwell time for a location reflects the "occupancy time" or "busy time" for that location. Dwell times for a particular token may be summed up to provide a measure of the time required for that particular token to traverse the network, hence turnaround time. In N utt, 7 measures of dwell time and their relationship to transition times are explored further. Definition of an evaluation net With the above preliminaries in mind, we can now define an evaluation net. An evaluation net is a connected set of locations over the set of transition schema and is denoted as the 4-tuple E = (L, P, R, A) and an initial marking of the locations, M, where L = A finite, non-empty set of locations. P = A set of peripheral locations, P~L. R = A set of resolution locations, R ~L. A = A finite, non-empty set of transition declarations, {ai}, ai "= (s, t(ai), q) where s is a transition schema, t(ai) is a transition time, and q is a transition procedure. 284 Fall Joint Computer Conference, 1972 EXAMPLE OF AN EVALUATION NET Let us construct a model of a very simple computer system which uses most of the concepts presented in the previous section. In our computer system, a job entering the mix may either wait for a single tape drive if it requests one, or if no tape drive is needed, proceed directly to processing by requesting the central processor. When processing is complete, the job relinquishes the central processor and releases the tape drive if it has been allocated. In the description given below, we shall use the symbol "T" to denote a predicate that is always true. For the transition procedures that are implied by the transition schema, (i.e., there is no attribute modification and the transition merely copies the token from an input location to the output location(s) indicated by the schema), the procedure is indicated by a hyphen, bl : b2 : b : 3 b4 : b : 5 b6 : b : 7 be: b : 9 Job ready to enter mix Tape drive is available Job requires tape drive Job does not require tape drive Tape job has drive allocated Job requesting CP CP is idle CP is busy Job is through with CP b lO : b ll : b 12 : b : 13 rl : r2: r3: r4: Tape job ready to release drive Non-tape job ready to vacate Tape job ready to vacate Job 1s complete Routes tape job to b ; Non-tape to b4 3 Chooses job from b or b 4 5 Routes tape job to blO; Non-tape to b ll Chooses job from b l2 or b n Figure 5-Graph of evaluation net "-" Tokens that represent jobs in the computer system will be of the form K[3J, where K(l) =The number of tape drives required, (0 or 1). K(2) =Time required to fetch and mount a tape. K (3) = Central processor time. Let E = (L, P, R, A) be the net, (see Figure 5) R = {rl' r2, ra, r4} P = {bl [3J, bu [3]} UR L = {b2, b3[3J, b4[3J, bs[3J, b6 [3J, b7, bs[3J, b9[3J, blO[3J, bl1 [3J, bI2 [3J} UP A = {al'~, ... , as} al= (X(rl, bl [3J, b4[3J, b3[3J), (0,0), -) i.e., al is a type X transition with input locations rl and bl [3J, which copies tokens to either b3[3J or b4[3J with no time delay. a2= (J(~, b3[3J, bs[3J), M(b3(2)), [T~(M(bs[3J): =M(b3[3J)J) i.e., a2 is a type J transition whose time is determined by the second attribute of the token on the input location, b3[3J. aa= (Y(r2, b4[3], bs[3J, b6[3J), (0,0), -) a4 = (J (b6 [3J, b7 , bs[3J), 0, [T~(M (bs[3J) : = M (b6[3J)) J) as= (F(bs[3J, b9[3J, b7), M(b s(3)), [T~M(b7) :=1]) a6= (X(r3rb9[3J, bu [3J, bIO[3J), (0,10 sec.), -) ~= (F(b IO [3J, b2, bI2 [3J), 0, [T~(M(b2): =1)J) as= (Y(r4, bu [3J, bI2 [3J, bI3 [3J), (0,0), -) rl: [(M(b l (l)) = l)~M(rl): = 1; (M(b l (l)) =O)~M(rl): =OJ i.e., rl takes on the same values as the first attribute of the token on bl [3 J. r2:[T~M(r2): = IJ i.e., r2 is always marked with a one. r3:[(M(b9(1)) =O)~M(ra): =0; T~M(r3): =1J r4:[T~M(r4): =IJ Initially, letM(b2 ) =M(b7) =1. A job enters the net at location bl [3J, (the arrival rate of subsequent jobs is not specified in this example). The existence of a token on bl [3 J pseudo enables transition al since b3[3J and b4[3J are both empty. Resolution procedure rl is evaluated, its marking being determined by the first attribute of the token on b1[3J. Suppose that M (~ (1) ) = 1. Then the token is- moved to location b3 [3J, the transition time being negligible, i.e., teal) is zero. Since M(~) = 1 initially, transition ~ is enabled and becomes active. The transition time for a2 is provided by the second attribute of the token on location b3 [3J (which, let us say, contains "trace data" giving the time required to mount a tape). When this transition time has elapsed, bs[3J receives the token from b3 [3J, (see the transition procedure for ~). The resolution location, r2, is a "tie breaker" and in this case always favors jobs· that have just had the tape drive allocated to them, should two jobs be ready to start requesting the central processor simultaneously. Evaluation Nets for Computer System Performance Analysis The remainder of the net may be interpreted in the manner described above. Let us suppose that the net was put into operation at time to and was halted at time tn. The elapsed time, tn - to, is called the system up time and is denoted T Notice that the dwell time of location b7, d (b 7 ), gives the central processor idle time and corresponds to Tu-d(b s[3]). Similarly, the resource utilization of the tape drive is available from Tu-d(b2). If token Km[3] enters location b1[3] at time tim and enters location b13 [3] at time tjm, the expression tjm - tim reflects the turnaround time of the job represented as Km[3]. Let K 1 [3], K 2 [3], ... ,KN [3] be N tokens that traversed the net. Then the mean turnaround time for this mix is given by 'U' N L (tjm-tim)/N m=l or, alternatively, may be computed by summing up the appropriate dwell times and dividing by N. The throughput rate may be expressed as N jTu jobs/system up time Suppose we exercise our model and find that it is insufficient for our purposes, e.g., disk access is completely ignored, but has an affect on the parameters we are measuring. We may choose to change the level of detail of the central processor activity in the net. Figure 6 suggests a slightly more complex net that reflects simultaneous disk I/O with central processor activity. We can replace transitions a4, as, and location bs[3] of Figure 5 by the net shown in Figure 6 (the definition of this modification can be expressed in the b : Job through with OF and disk 9 b 14 : Job ready to use disk and OF b : Disk is idle 15 b : JOb is requesting disk 16 b : Disk is busy 17 b : Job is through with disk 18 b : OF is busy 19 b ! Job is through with OF eO b 2l : Job ready to relinquish OP Figure 6-Parallel central processor and disk activity 285 same manner as illustrated previously, but will not be given here). This implies that another attribute for disk time is necessary, which determines the transition time for an. The transition time for as would become zero, and t(a13) is determined by trace data carried in M (b 19 (3». SUMMARY The class of evaluation nets has been informally described. An evaluation net may be treated as an interpreted marked directed graph, where transitions correspond to vertices and the locations correspond to directed arcs. The arcs are capable of holding a single item of structured data at a time. The graph of the net represents the structure of the system and indicates the control of token flow. The transition procedures interpret the action of the vertices. By operating the net, (in a simulation manner), measures of resource utilization, turnaround, throughput, etc., are available for further analysis of the system. An implementation of the nets might include some "automatic" analysis, such as resource utilization figures. The nets are modular and allow varying level of detail of representation. An interactive implementation of evaluation nets might consist of a net editor with graphic and symbolic output. The graphic output would be used by the designer in structural debugging and the symbolic output could be used by an interpreter to simulate the net. Current studies at the University of Washington include the implementation of evaluation nets. A more formal treatment of the nets may be found in N utt, 7 from which this paper is abstracted. Examples are given which model the Boolean functions of two variables and a Turing machine. A comprehensive evaluation net of the CDC 6400 is presented which shows the structure of the machine and which allows an extensive performance evaluation of the machine at the task/resource level. This net includes models of priority queues of arbitrary length and illustrates how queueing algorithms may be handled. Evaluation nets are also compared with Petri nets. Future work, besides the implementation, includes the study of the nets as models for computational processes. ACKNOWLEDGMENT The author is grateful to Jerre D. Noe for his support of the research and to Alan C. Shaw for his editorial suggestions. 286 Fall Joint Computer Conference, 1972 REFERENCES 1 H CLUCAS JR Performance evaluation and monitoring Computing Surveys 3 No 4 pp 79-911971 2 C A PETRI Kommunikation mit automaten PhD dissertation University of Bonn 1962 Translated by C F Greene Jr Applied Data Research Inc Technical Report No RADC-TR-65-377 1 supl1 1966 3 J D NOE A Petri net description of the CDC 6400 Proceedings of ACM Workshop on System Performance Evaluation Harvard University pp 362-378 1971 4 A W HOLT et al Information system theory report Applied Data Research Inc Technical Report No RADC-TR-68-305 1968 5 A W HOLT F COMMONER Events and conditions Record of the Project MAC Conference on Concurrent Systems and Parallel Computation pp 3-52 1970 6 J D NOE G J NUTT Validation of a trace-driven CDC 6400 simulation SJCC Proceedings Volume 40 pp 749-757 1972 7 G J NUTT The formulation and application of evaluation nets PhD dissertation University of Washington Computer Science 1972 Objectives and problems in simulating computers by THOMAS E. BELL The Rand Corporation Santa Monica, California problems, a list of objectives, and, finally, a matrix showing which objectives lead to which problems. With this information he can plan his effort more effectively* and improve the design of his simulation model. INTRODUCTION Because the effort required to simulate a computer system is often very great, analysts should consider carefully the probable value of the results prior to embarking on it. Speciallanguages1-5 have been created to aid the programmer in reducing the time required to code a simulation, and analysis techniques 6- 11 are available to reduce time requirements in the later phases of a study. Still, unexpected problems usually arise: An effort concludes with a study only partly completed because budgeted resources have been exhausted, * or the results may be of less value than anticipated. If the analyst can foresee problems prior to commencing the detailed coding phase of a study, he can avoid many of the problems, mitigate many of the remainder, and allow for the rest in anticipating the payoffs of the effort. While some of the problems encountered have unique characteristics, a common set of them seems to keep appearing in simulation studies of computers. Simply knowing the total list of all common problems is no solution to the analyst who typically goes over budget; his difficulty is sorting out the problems that are most relevant to his situation and ignoring the rest. Trying to plan for the unlikely and unimportant can deflect effort from more appropriate areas and lead to less effective analysis than would occur if the problems were ignored until they appeared. The objectives of the simulation influence how the situation will be approached and which problems will most likely lead to critical difficulties. The challenge facing the analyst is to associate the potential problems with his objectives so that he can anticipate his most probable pitfalls and allocate his resources to solving these problems. He needs -a list of SIMULATION PROBLEMS Problems in simulating computer systems could be organized into (1) choosing the language for the simulation, (2) representing the real system appropriately, (3) debugging the simulation, (4) performing experiments, and (5) interpreting the results. ** This classification scheme jumps to the analyst's mind immediately because, chronologically, these are the steps he takes in performing a simulation analysis. Although procedural frameworks are important and may lead to improved simulations, they usually do not attempt to identify which particular issues will be most important for a specific simulation effort throughout the procedure. For example, the analyst, in designing his simulation, must consider the resources available to him and how flexible his work must be. He can choose his simulation language by considering these and several other issues. The underlying problems he encounters in language choice and the other steps in a study amount to resolv- * One of the most important advantages in the planning stage is an ability to predict the costs and specific payoffs of an effort. Overselling the potential payoffs of a simulation not only puts the actual results in question, but decreases the credibility of future simulations. ** A more useful scheme is suggested by Morris (Chairman of the Association for Computing Machinery's Special Interest Group on Simulation) and Mayhan in an unpublished paper:12 (1) Define the problem; (2) select a solution method; (3) develop models; (4) validate models; (5) simulate alternative solutions; (6) select and implement the best alternative; and (7) validate simulation solutions. * See Reference 2, page 2. 287 288 Fall Joint Computer Conference, 1972 ing them correctly. Some of the most troublesome are the following:* 1. Resources. The amount of manpower and machine resources to perform a simulation study may be greater than the expected value of the study, or they may simply exceed the total resources available for the effort. The desire should always be to minimize invested resources, but the characteristics of some simulations make this issue more critical than in other studies. (The total available resources may be very limitedparticularly in terms of elapsed time-and the challenge very great.) Typically, adequate resources are invested in the early phases of an effort with the later phases receiving whatever is available. The issue of resources is mentioned on page 150 of Reference 14. 2. Changes. Changes to improve model validity, to produce additional output, and to reflect modified objectives can prove a major difficulty in some simulations, while they are relatively trivial in others. Although some simulation efforts are not complicated by unexpected changes, quick examination of simulation code often reveals that changes were far more extensive than anticipated. Inadequate appreciation of the need for change can lead to choosing a language that is too inflexible as well as designing code that is too complex. The need for changes in a model is noted on page 87 of Reference 15. 3. Boundaries. In addition to changes as described above, a simulation analyst may find that the boundaries defining the modeled portion of the system change as the study progresses. He may discover that he has attempted to simulate too much of the system and be forced to replace parts of the simulation with simple functions. Alternatively, he may find that his boundaries are too narrow, and important interactions are not being reflected. Identifying the degree to which boundaries will need change can alter a simulation's design to reduce the difficulty of boundary redefinition. Dumas16 refers to the problem of boundaries on page 77 of his paper. 4. Costs. Cost models are often of significant utility, * Few authors even mention the problems they have encountered in simulating a computer system; this may explain the impression held by some that such efforts are easy. References to sources dealing with specific issues are given in the descriptions of the issues. McCredie and Schlesinger13 mention nearly every one of the issues in their paper. 5. J 6. 7. 8. particularly when the objective includes analysis for procurement or performance improvement decisions. Their inclusion, however, often implies a heavier investment of resources in order to determine the costs of purchasing hardware or software. Costs of using alternative systems (including costs of delays) often prove particularly difficult to quantify. The importance of cost models is noted in References 17 and 18. Experimental design. Toward the end of many simulation efforts analysts realize that exercising the simulation will not be a straightforward process. At this late date, they begin to consider how to design experiments: Are 500 hours of CPU time adequate to determine the response surface? Many documents deal with the problem, including References 6 and 8-11. Detail. Simulations vary in detail of implementation from those that are relatively gross (References 19 and 20 give examples) to those that represent operations at the micro-instruction level (References 21 and 22 present examples). The level of detail can often be expressed as the smallest increment of time explicitly recognized in the model. If the simulation is performed in a language like GPSS23 or CSS,24 this level is explicitly recognized in the language. However, this indicator of minimum time increment, although quantitative, conceals the essence of the problem, which is to decide on the extent that system interactions are to be replicated. Accuracy. Analysts should always desire to have the ultimately achievable degree of accuracy in a simulation as high as possible. However, the utility of improved accuracy may be very low and hardly worth the cost. This issue is addressed in References 19 and 25. Validation. An analyst's belief in the accuracy of his simulation is inadequate for evaluating its actual closeness to reality. Only a formal validation effort· can reduce the doubt that it is unrepresentative of the real system. The degree of representativeness is usually assumed to be definable by the ability of the simulation to produce a few numbers that are close to the numbers obtained from reality. Other types of validity are often important, however, including correct sequences of operations and correct responses to alterations of input. The analyst must determine the most appropriate degree of effort to be expended in validating his model. Although many simulations of computers are never validated, examples of validation exercises can be found in References 19 and 25. Objectives and Problems in Simulating Computers 289 OBJECTIVES The objectives of a simulation should be explicitly stated and should be closely related to the decision environment in both terminology and emphasis. Simulation for its own sake is a sterile process and economically unjustifiable. Some published papers on specific simulations state that the author's objective was to simulate a particular jobstream on a particular hardware/software system. These papers probably reflect the author's orientation toward the problems involved in the simulation activity per se; the decision-environment objectives can usually be deduced from sections titled "Findings" or "Conclusions." Five categories of simulation objectives seem to characterize the bulk of simulations of computer systems. These five categories are as follows: 1. Feasibility analysis-investigating the possibility of performing a conceptualized workload on a general class of computer systems. An example of a feasibility analysis is presented in Reference 26. 2. Procurement decision-making-comparing one or more computer systems with a specific workload to decide which of several (or whether any) computer systems should be procured. For example, Bell Telephone Laboratories reports this type of simulation application in Reference 27, and page 4 of Reference 28 provides a report of Mobil Oil Corporation's application. 3. Design support-projecting the effects of various design decisions and/or tracking .the development of a system. Many simulations have design as the objective. Examples are to be found in References 15, 16, and 29. 4. Determining capacity-for projected systems, determining the processing capacity of various configurations; for existing systems, determining the processing capacity of a load different from the current work. Examples of what were apparently simulations to determine capacity are presented in References 30 and 31. 5. Improving system performance-increasing processing capacity by identifying and changing the most sensitive parts of the hardware/software system. This process is also known as tuning, and examples can be found in References 20 25, and 32. Decision-oriented objectives may be as hard to state at the beginning of an effort as they are to discover in many post-analysis papers~ Nevertheless, analysts somehow manage to choose an approach and then develop ISSUE OBJECTIVE Feasibility Procurement Design Determining Capacity Improving System Resources Changes Boundaries Costs Experimental Design Detail Accuracy Validation Figure I-Desired matrix some solution to each of the issues suggested earlier. Many of these are developed within the context of other choices (e.g., the language to be used) involving additional, mechanistic criteria (e.g., user-directed output). One danger in using this procedure is that the process of making other choices may seriously compromise the simulation's value by directing the simulation into unfruitful areas. Just as importantly, the analyst may attempt to generate a simulation that will do all possible things. McCredie and Schlesinger* point out that attempting simulations "capable of answering almost any reasonable question about the system ... must be paid for by large investments in personnel and computer time." This is true, of course, because the analyst must solve all the problems indicated earlier, and some of these may have solutions for one objective that are inconsistent with solutions for other objectives. Such inconsistent solutions should be detectable by drawing a matrix of the issues and objectives with the general solutions as entries. Figure 1 illustrates such a matrix, but it is not completed because the objectives are not well enough defined to permit identification of even a general solution for each issue. For example, the most appropriate level of detail in a study to determine the feasibility of computer logic might be at the microinstruction level as it is in Rummer's study.33 At the same time, a simulation to investigate the feasibility of an entire system might be at so high a level that nothing shorter than a complete job task or data transmission is considered. (This is the case in the studies by Downs, et a1. 26 and Katz.34) Yet both simulations would have * Pages 201-202 of Reference 13. 290 Fall Joint Computer Conference, 1972 ~easibility as the objective. A different categorization scheme is needed for objectives-one that will make it easier to associate problems with objectives by aggregating the decision environment's objectives into classes for the simulation environment. Alternative categorization scheme for objectives The alternative scheme suggested in this paper does not divide the objectives into more categories; instead, it reduces the number and redirects them so that they are more useful in defining answers to the issues suggested above. The alternative defines three categories: absolute projection, sensitivity analysis, and diagnostic investigation. It may appear that all the simulations in each of the decision-oriented five categories map easily, as blocks, into categories in the alternative scheme of three, but exceptions appear often enough that generalizations about mappings are dangerous. Absolute projection This category includes those simulation studies whose objectives can be reduced to the desire to make basically dissimilar comparisons. An example of this type of objective is a situation in which the processing capacities of two systems under a certain load are to be compared. The analyst wishes to determine which system should be procured. Another example is the comparison of a system's processing capacity with the load that it is expected to encounter. (This is usually tested operationally by determining the expected time for the simulated system to process a load and comparing this time with the maximum allowable time.) The decision under consideration in this instance may be whether to procure a certain system or it may be whether to perform a new job on an existing system. A third example of a simulation in this category is one in which response times are being compared with stated requirements. If the proposed system is unable to meet the requirements, then it must be augmented. The important characteristic in each of these examples is the necessity for evaluating an objective function in absolute terms with a high degree of absolute accuracy. If two systems actually differ in processing capacity by 20 percent, the simulation technique must produce answers with absolute errors of less than 10 percent if the analyst is to be sure of choosing the better system. Apparent examples of absolute projection are described in References 14, 26, and 30. Sensitivity analysis Simulations falling in this category emphasize similar comparisons. While simulation studies making absolute projections must have absolute accuracy, sensitivity analyses require good accuracy only in (1) the areas in which two cases are not identical and (2) the areas that significantly interact with the nonidentic~l areas. Although the simulation code may represent far more than the portion of the system under consideration, the primary validation effort should be devoted to the central portion, with reasonableness being the criterion for the rest. The remainder of the simulation code is seldom excess (and therefore an embarrassment) for several reasons. First, it usually interacts with the central portion in some manner in which the details are not important, but the general sorts of interaction are important. Second, other sensitivity analyses may use the same simulation code, and building one simulation for several analyses may be the most effiCIent procedure. Third, the boundaries of the central portion are often not identifiable early in the simulation effort because the analyst is not yet familiar with all the interactions. A decision-maker doing sensitivity analysis may require that answers have high reliability, but if he has an alternative that improves on the default by 20 percent, he does not need to have the absolute values of each. His decision is based on the changed value of the objective function rather than its absolute value. A basic characteristic of sensitivity analysis, of course, is the comparison of slightly different alternatives. For the simulation analyst this implies that his simulation must be constructed to facilitate changes. As an example of sensitivity analysis, an analyst might be interested in the effects of changing hardware, changing software, or changing scheduling schemes. Specifically, he might want to know whether increasing the size of buffers results in increased message throughput. With the exception of the changes under consideration, the initial and changed simulations are identical. The analyst must ensure that the changed portion (and the parts it interacts with) are represented accurately, but the remainder of the simulation (probably including disk queuing, front-end processors, file layouts, etc.) can be less accurate. Of course, the possibility exists that the analyst will incorrectly assume that parts of the system are not critical when they really are, but this is the boundary problem that an analyst always faces. He might find comfort in having the simulation agree with reality in correctly reporting message throughput over a wide range of conditions, but his decision can be made on the basis of the ratio of throughputs before and after the increase in message size. Objectives and Problems in Simulating Computers II References 16, 18,21,22, 29, and 35 would appear to give examples of sensitivity analysis. Diagnostic investigation Diagnostic investigations tend to place less emphasis on the value of an objective function. The interest of the analyst is to gain understanding of the detailed manner in which the simulated system behaves. He may be interested in examining interactions, in analyzing aberrations in the real system (or, too often, those peculiar to his simulation), or in tracing the progress of a transaction to determine whether it goes through the system as expected. The emphasis tends to be on performance of very small parts of the system~ Graphical analysis techniques often find application in this type of simulation since detailed sequences of activities may require examination. Diagnostic investigation would appear to be the objective in References 32 and 33. Substudy objectives The global objectives of a simulation study may not match the immediate objectives of an analyst at certain points in a study. For example, an analyst performing a sensitivity analysis study may find that he needs to project absolute performance to determine whether his model's gross interactions yield results that are even remotely correct. Then he may wish to verify the details of an alternative scheduling strategy and trace its actions through the scheduling algorithm. Only then does he bother to perform simulation runs for the several alternatives that he has programmed. Although his global objective would fall in the category of sensitivity analysis, the analyst would have performed two substudies with local objectives in the other two categories ef absolute projection and diagnostic investigation. Dumas16 and Ceci and Dangel36 have performed these types of substudies. The substudy objectives in a simulation study constitute a means of attaining the study's objectives and merely reflect short term techniques. While these may be important in performing tasks such as verification and ensuring reasonableness, they are not the objectives that determine the simulation's overall design and should not confuse the analyst about the type of global objectives he is pursuing. If the effort devoted to a substudy becomes large, the analyst should carefully consider whether his substudy effort is relevant, his formal global objectives should be revised, or the global objectives.are simply unattainable. 291 ASSOCIATING PROBLEMS WITH OBJECTIVES The definitions of problems and objectives suggested in the preceding pages have assumed that the analyst is interested in designing his simulation before launching into the details of language choice and coding. The assertion has been implicit that these definitions could be used in associating the problems with the objectives to lead to better simulation designs. Figure 2 represents an attempt to provide this type of aid. The importance of the first five issues (resources, changes, boundaries, costs, and experimental design) are indicated there; the applicability of most of the entries is apparent. For example, limitations on available resources will be a critical problem in absolute projection studies because the entire system must be simulated to a high degree of accuracy, and usually the work must be done in a short time. On the other hand, a diagnostic investigation need only reflect particular parts of the system of interest. Sensitivity analyses lie somewhere between these two extremes. Suggestions regarding the last three issues are of a different character. Rather than indicating importance (largely the degree of resource commitment needed), they suggest approaches that are not necessarily indicative of a particular level of effort; however, taking the right action is critical for a simulation of any objective. Level of detail The most appropriate level of detail for an absolute projection simulation is usually at quite a macro level because the entire system must be simulated, and resources are usually at a premium. At the other extreme, a diagnostic investigation usually must be at a relatively micro level in order to reflect detailed interactions. A sensitivity analysis simulation, however, often represents a combination of levels since it may represent the bulk of the system grossly and the altered part in detail. Accuracy The accuracy of response time or throughput figures in a diagnostic investigation study is usually of superficial importance. The analyst is investigating the manners in which one (or a few) parts of the system interact; investigating the details of one part of a system's behavior does not require overall accuracy of performance parameters. Of course, the performance should be reasonable or the behavior will not be reasonable, but 292 Fall Joint Computer Conference, 1972 ISSUE OBJECTIVE Absolute Projection Sensitivity Analysis Diagnostic Investigation Resources Critical Important Desirable Changes Desirable Critica I Important Boundaries Desirable Important Critica I Desirable Important Irrelevant Experimental Design Important Critical Desirable Detail Macro Moderate Micro Accuracy Critical Overall Critica I in Places Reasonableness Only Validation Value Comparison Derivative Comparison Sequence Checking Costs Figure 2-Issues vs objectives high accuracy in performance parameters is not necessary for representative interactions. For sensitivity analysis, a simulation must closely reflect the differences that will be encountered between the various alternatives under consideration. While absolute values of performance parameters may be comforting, the decision problem at hand requires only the relative difference between similar situations. Absolute projection, of course, requires accuracy in desired performance parameters; undue faith in absolute projections is perhaps the most dangerous error in simUlating computer performance. Validation Validation is performed. to improve the confidence that the required type and degree of accuracy is obtained in a simulation. This means that, in absolute projection, the simulation's projection of performance parameters must be compared with the parameters from the real system. This value comparison is necessary if faith is to be vested in the predicted. parameters. In instances where a system does not exist (so no valida- tion can be performed), the analyst should include a caveat with any reported results to indicate that the simulation is of undetermined accuracy. Sensitivity analyses are often validated by comparing the values of real performance parameters with the predicted values over some set of conditions that are realizable on the actual system being simulated. A projection can then be made of an unvalidated case based on the knowledge that the predicted performance was correct in a number of similar cases; therefore, the changes in performance were accurately reflected and probably will be in the new case. This approach may be excessively expensive because it requires accuracy in parts· of the simulation that are not to be altered. An alternative is to compare only changes in performance due to specific changes in the system. In this case, only the fractional changes need be compared, so significant savings may be possible. This less exhaustive process is analogous to comparing derivatives rather than absolute values of functions. In many eases, the analyst only needs to determine whether it is positive, negative, or zero. Validation of diagnostic investigations requires even less rigor than for sensitivity analyses. Since the emphasis is on examining detailed interactions, the analyst usually only needs to ensure that the sequence of operations is correct. Even this validation can be quite timeconsuming and frustrating if the analyst is restricted to viewing flowcharts. Powerful graphical techniques for showing interactions are very useful here. APPLICATION The categorization schemes and matrix presented in this paper are without value unless they can aid analysts in planing analyses and designing simulations. Two examples will be given to indicate how they can be applied. One example uses a simulation that was performed without reference to such schemes and illustrates how the effort could have been aided by their use. (The problems encountered led to developing the schemes and matrix.) The second example presents a situation in which the schemes were applied in order to avoid problems that otherwise might have arisen. Example 1: S~'mulating without reference to the matrix This first example involves a simulation of the Video Graphics System (VGS) performed during the implementation of software on newly designed hardware. The system uses a central communications switching and controlling machine-an IBM 1800-that communicates with a series of terminals and several service Objectives and Problems in Simulating Computers machines. The service machines execute user code and send digital representations of pictures to the 1800 for conversion to analog representation in a special picture generator controlled by the 1800. One picture generator and three scan converters service all users (presently 28) who employ terminals with raster scanned screens that can be slightly modified broadcast television sets. Various input devices, including keyboards, are added to the sets to enable two-way communications. The objective of the system is to supply high-powered interactive graphics capability to many users at a moderately low cost through time-shared use of the expensive digitalto-picture hardware. The system as a whole is described in Reference 37 and a description of the modeled portion of the system is presented in Reference 38. Objectives Prior to doing any simulation coding, we spent time learning about the system1s characteristics and developing a set of simulation objectives. We then distributed a preliminary description of our understanding of the system along with our proposed objectives. (The objectives were expressed as questions that needed answers.) The characters to the left of each objective did not appear in the original (taken from Page 80 of Reference 38). They indicate the type of objective, and the characters stand for the following: A S D Absolute projection. Sensitivity analysis. Diagnostic investigation. Although several additional objectives were added during the study, the objectives listed below were retained for its duration. Many of these objectives, however, were not addressed due to lack of time and the belief that the questions could not be adequately answered with the simulation. A A A S 1. Under what load conditions will the system give poor response? (It may be feasible to alter the load by user education as well as by changing characteristics of such software support as the Integrated Graphics System.) 2. Will messages be unduly delayed in the VMH* system in the 360s? 3. Will channel cycle-stealing slow the 1800 CPU enough that input data are lost due to delays in processing? 4. Will a ping-pong system decrease response time of the VGS? * Video Message Handler, essentially an access method. 293 D 5. What will be the effect of the 1800 waiting at interrupt level four while buffers are unavailable for service machine input? D 6. What will be the effect on the 1800 of one service machine being unresponsive for a short period? A 7. What portion of system capacity does a Tablet take? (It might be profitable to disable a Tablet that is temporarily not in use, or to use a keyboard instead of the Tablet.) S 8. How useful would more core be in the 1800? S 9. How useful would another 1800 be? These objectives included four in the category of absolute projection, three in sensitivity analysis, and two in diagnostic investigation. Since we had objectives in each of the three objective categories, we can see from Figure 2 that we needed a simulation that was at a macro level but also (conflicting) in micro detail. In addition, the simulation had to have a high degree of overall accuracy, be easy to change, have easily altered boundaries, use few resources, and be extensively validated. While no one noted the extreme difficulty of achieving all the objectives at the time we stated them, our proposed categorization schemes and matrix of solutions quickly reveals how difficult it would be to achieve them all. Diagnostic investigations We decided to create a simulation at a low level of detail; the basic time increment in this GPSS simulation was 50 microseconds. It traced all normal interactions in the 1800 and used approximate timing information generated by multiplying the number of instructions in a module by the average time per instruction as measured during an early run of the system. Most of the actual work with the simulation involved diagnostic investigations, including objectives 5 and 6. In addition to the objectives stated before coding began, we investigated cases of potential deadlock· and the platooning that were characteristic of the system. Interactive computer graphics was used extensively to aid in investig~ting these situations; hardcopy graphics was used to document the results and communicate with system designers. Figure 3 shows a typical output, complete with the analyst's marginal notes. This display shows, over simulated time, the priority level of the executing software at the top; the entire bottom of the display presents a Gantt chart. This Gantt chart shows which routines are in control at each moment of simulated time. With these displays the simulation served ad- . mirably in answering questions during diagnostic investigations. 294 Fall Joint Computer Conference, 1972 DISPLAY FROM _~~~~~ TO ~--l~~~'---~:"-R =-.! - !- 2 • II !II 1 DI:;PLAY "ROM ___I~~~~ TO __ ~~~~~. 11100 -;;-_~-. ,:,J_ .. -_ _-+---_ _ -~-~-----~ili-~j+-: ---.:8:...-...1_ _ 8 ! ---- --~_J-_I_ . --+-:-=---··_ _ 'ill 8 'ill 8 -__.J-L- 2 1 . : 1'i11 8 11 6 +--------------l---,,,......::.===--+--I - - - - - sc.. \ 11 S~ 7, 1 Sd ~~ fJV A 1 • . e e ~--------------------~·f-·--~--~---~~·--~~--~--~~--~~--~ 111 C --L Ii 1 !. I ./ • !I llli~00~----------~11~8~OO~---------1~17.10~G70-------·---------1-18-0-0---------1-111~OLOO3 + STATISTICS GAM"" CHAR" VARIABLI! ORA PH STATISTI_C_S_~_G_A_M"_"_C_H_AR_"__--l-_V_AR_I_AB_L_I!_G....:RA__ P__ H---1 Figure 3-Graphical output Sensitivity analysis The initial sensitivity analysis objectives were addressed, but delays in validation caused us to be very reluctant to put much faith in the results. Early use of the real system indicated that some functions performed by hardware should be implemented in software, and we decided to add an objective about the utility of this change. We found, to our surprise, that total system utilization would be only marginally affected by' the change. Without validation, we discounted this result initially. The importance of the issue, however, led to a substudy with strong characteristics of diagnostic investigation to explain the result. We discovered that low priority attempts by the system to clean tip various queues caused processing in the altered case (with hardware implementation) but not in the initial case (with a software implementation). Eventually, we did perform a validation of the simulation and found that, within the context of sensitivity analysis, the simulation had quite adequate accuracy. (See Reference 39 for details of this validation effort.) Absolute projection The largest number of objectives for this simulation study fell into the category of absolute projection. Ob- jective 7 (regarding the portion of system capacity used by a single RAND Tablet) is typical of these, and illustrates the problems of using a simulation for absolute projection when it is designed to fulfill other objectives too. The first problem' is that the portion of system required by a Tablet varies with system loading. As the load increases, the overhead to handle a Tablet (contrary to usual system performance) decreases. Therefore, a single number is inadequate to represent performance in general. This characteristic of systems (performance not being easily represented in simple ways) appears in most systems, but, in absolute projection, stating the fact is often considered unacceptable by people desiring simple answers. Secondly, the absolute projections, in comparison with measured results, tended to be optimistic by a factor of about two. That is, reality took twice as long as predicted by the simulation. Since projections were based on average instruction times, we put the 1800 processor into a very restricted processing loop (83 instructions) to separate timing assumptions from interaction representations, computed the predicted time to execute the instructions (using published, manufacturer supplied timings), and measured the actual time to execute them. In a variety of instances the actual and predicted did not agree; in one of the clearest cases the prediction was 155 microseconds and the measured time was 220 microseconds. This last difference led us to Objectives and Problems in Simulating Computers doubt that our bottom up approach to generating timings would ever lead to fulfilling most of the absolute projection objectives since we did not understand some of the interactions between hardware and software. (One of the few objectives that were usefully addressed, even if not rigorously answered, was objective 7. The predicted system loading was so high that even gross errors in the simulation would not lead to acceptable performance. Predicting this performance problem helped strengthen the case for hardware implementation of some of the functions necessary for Tablet operation.) SUInm.ary Many of the anticipated payoffs of the simulation were not realized because its objectives implied inconsistent solutions to problems in simulation. While the results were useful in fulfilling some objectives, a review of the problems to be encountered in achieving the other objectives could have allowed us to rank them and to consider, before coding the simulation, whether its design was most appropriate in aiding the VGS designers. Example 2: Referencing the matrix before simulating This second example involves a ~imulation of a very large information management system. The simulation was undertaken during the design stage; no hardware was yet available for running any validation tests. A "packaged simulator" was to be used to determine the size of hardware to be ordered. The objective clearly fell into the absolute projection category, and yet, validation of program descriptions could not be performed. While management wished to know the precise configuration that should be acquired, facilities were not available for performing the necessary validations of overall accuracy. Diagnostic investigation We suggested that diagnostic investigations be undertaken to determine whether some critical portions of software would perform as expected. The micro-level simulations could be checked for correct sequences and, as soon as hardware was available, the reasonableness of the predictions could be validated. Absolute projection The need for information about appropriate hardware configurations was very real, so we suggested that 295 a multi-phase strategy be pursued. During the period when no validation was possible, important programs could be simulated at a macro level to see whether obvious design problems existed. (If the simulation predicted 100 hours to run each of ten daily programs, even the most skeptical analyst would question the design. A number of such instances were discovered and corrected.) The important element of this phase was to devote heavy effort only to cases where problems clearly exceeded the potential errors in the simulation. Concurrently, techniques for describing programs were checked by employing them on software being run on an existing system. This effort led to changes in the descriptions of software for use in simulations. Later, preliminary validation could be performed using data made available from configurations used in testing. Since analysts had already completed initial simulations of the programs, validation and revision could be accomplished in the short time between availability of initial data and the required hardware ordering date. SUInInary Suggestions about a more appropriate procedure for this example could clearly be made without our schemes and matrix. In practice, however, they often are forgotten in the rush to implement something and show results. Further, opinion about the difficulty of a specific task is a weak tool to use in convincing people who are unfamiliar with simulation's limitations or under heavy pressure to "get on with the job." The categorization schemes and matrix of solutions are convenient techniques for indicating the requirements to achieve a certain objective in comparison with other potential objectives. RECOMMENDATIONS We have found the application of this approach useful in planning and designing our own simulations and in helping other analysts to improve theirs. It proves particularly useful in predicting how much effort is appropriate for validation exercises and what form such validation should take. While an experienced simulation analyst may feel that it expresses little that he does not already know, too many analysts fail to apply their knowledge rigorously in the early stages of a simulation effort. We suggest that analysts force themselves to state objectives clearly-and in writing-at the beginning of a simulation effort. They should then consider whether all their objectives are realizable when using the suggested solutions to the problems listed in the matrix of 296 Fall Joint Computer Conference, 1972 Figure 2. Only after assuring themselves that the effort can result in fulfilling the objectives should they design the simulation. Finally, they should consider whether the achievement of the objectives will justify the cost required to implement and validate the simulation. REFERENCES 1 L J COHEN S3 The system and software simulator Digest of the Second Conference on Applications of Simulation ACM et al New York December 1968 pp 282-285 2 N R NIELSEN ECSS: An extendable computer system simulator The Rand Corporation RM-6132-NASA February 1970 3 J N BAIRSTOW A review of system evaluation packages Computer Decisions Vol 2 No 6 June 1970 p 20 4 W C THOMPSON The application of simulation in computer system design and optimization Digest of the Second Conference on Applications of Simulation ACM et al New York December 1968 pp 286-290 5 G K HUTCHINSON J N MAGUIRE Computer systems design and qnalysis through simulation Proceedings AFIPS 1965 Fall Joint Computer Conference Part 1 pp 161-167 6 G S FISHMAN Estimating reliability in simulation experiments Digest of the Second Conference on Applications of Simulation ACM et al New York December 1968 pp 6-10 7 T E BELL Computer graphics jor simulation problem-solving Third Conference on Applications of Simulation ACM et al New York December 1969 pp 47-56 (Also available as RM-6112 The Rand Corporation December 1969) 8 D P GAVER JR Statistical methods for improving simulation efficiency Third Conference on Applications of Simulation ACM et al N ew York December 1969 pp 38-46 9 T H NAYLOR K WERTZ T H WONNACOTT Methods for analyzing data from computer simulation experiments Communications of the ACM Vol 10 No 11 November 1967 pp 703-710 10 G S FISHMAN Problems in the statistical analysis of computer simulation experiments: the comparison of means and the length of sample records The Rand Corporation RM-4880-PR February 1967 11 G A MIHRAM An efficient procedure for locating the optimalsimular response Fourth Conference on Applications of Simulation ACM et al New York December 1970 pp 154-161 12 M F MORRIS A J MAYHAN Simulation as a process Simuletter Vol 4 No 1 October 1972 pp 10-15 13 J W McCREDIE S J SCHLESINGER A modular simulation of TSS/360 Fourth Conference on Applications of Simulation ACM et al New York December 1970 pp 201-206 14 H A ANDERSON Simulation of the time-varying load on future remote-access immediate-response computer systems Third Conference on Applications of Simulation ACM et al New York December 1969 pp 142-164 15 A L FRANK The use of simulation in the design of information systems Digest of the Second Conference on Applications of Simulation ACM et al New York December 1968 pp 87-88 16 K DUMAS The effects of program segmentation on job completion times in a multiprocessor computing system Digest of the Second Conference on Applications of Simulation ACM et al New York December 1968 pp 77-78 17 S R CLARK T A ROURKE A simulation study of cost of delays in computer systems Fourth Conference on Applications of Simulation ACM et al New York December 1970 pp 195-200 18 N R NIELSEN An analysis of some time-sharing techniques Communications of the ACM Vol 14 No 2 February 1971 pp 79-90 19 J D NOE G J NUTT Validation of a trace-driven CDC 6400 simulation Proceedings AFIPS 1972 Spring Joint Computer Conference Vol 40 1972 pp 749-757 20 J H KATZ Simulation of a multiprocessor computer system Proceedings AFIPS 1966 Spring Joint Computer Conference Vol 28 pp 127-157 21 S C CATANIA The effects of input/output activity on the average instruction time of a real-time computer system Third Conference on Applications of Simulation ACM et al New York December 1969 pp 105-113 22 S E McAULAY J obstream simulation using a channel multiprogramming feature Fourth Conference on Applications of Simulation ACM et al New York December 1970 pp 190-194 23 General purpose simulation system/360 user's manual H20-0326 International Business Machines Corporation White Plains New York 1967 24 Computer system simulator II (CSS II) general information manual GH20-0874 International Business Machines Corporation White Plains New York 1970 25 P E BARKER H K WATSON Calibrating the simulation model of the IBM system/360 time sharing syste