Experience With A Software Defined Machine Architecture WRL 91 10
WRL-91-10 WRL-91-10
User Manual: WRL-91-10
Open the PDF directly: View PDF .
Page Count: 49
Download | |
Open PDF In Browser | View PDF |
AUGUST 1991 WRL Research Report 91/10 Experience with a Software-Defined Machine Architecture David W. Wall digi tal Western Research Laboratory 250 University Avenue Palo Alto, California 94301 USA The Western Research Laboratory (WRL) is a computer systems research group that was founded by Digital Equipment Corporation in 1982. Our focus is computer science research relevant to the design and application of high performance scientific computers. We test our ideas by designing, building, and using real systems. The systems we build are research prototypes; they are not intended to become products. There two other research laboratories located in Palo Alto, the Network Systems Laboratory (NSL) and the Systems Research Center (SRC). Other Digital research groups are located in Paris (PRL) and in Cambridge, Massachusetts (CRL). Our research is directed towards mainstream high-performance computer systems. Our prototypes are intended to foreshadow the future computing environments used by many Digital customers. The long-term goal of WRL is to aid and accelerate the development of high-performance uni- and multi-processors. The research projects within WRL will address various aspects of high-performance computing. We believe that significant advances in computer systems do not come from any single technological advance. Technologies, both hardware and software, do not all advance at the same pace. System design is the art of composing systems which use each level of technology in an appropriate balance. A major advance in overall system performance will require reexamination of all aspects of the system. We do work in the design, fabrication and packaging of hardware; language processing and scaling issues in system software design; and the exploration of new applications areas that are opening up with the advent of higher performance systems. Researchers at WRL cooperate closely and move freely among the various levels of system design. This allows us to explore a wide range of tradeoffs to meet system goals. We publish the results of our work in a variety of journals, conferences, research reports, and technical notes. This document is a research report. Research reports are normally accounts of completed research and may include material from earlier technical notes. We use technical notes for rapid distribution of technical material; usually this represents research in progress. Research reports and technical notes may be ordered from us. You may mail your order to: Technical Report Distribution DEC Western Research Laboratory, WRL-2 250 University Avenue Palo Alto, California 94301 USA Reports and notes may also be ordered by electronic mail. Use one of the following addresses: Digital E-net: DECWRL::WRL-TECHREPORTS Internet: WRL-Techreports@decwrl.dec.com UUCP: decwrl!wrl-techreports To obtain more details on ordering by electronic mail, send a message to one of these addresses with the word ‘‘help’’ in the Subject line; you will receive detailed instructions. Experience with a Software-Defined Machine Architecture David W. Wall August, 1991 digi tal Western Research Laboratory 250 University Avenue Palo Alto, California 94301 USA Abstract We built a system in which the compiler back end and the linker work together to present an abstract machine at a considerably higher level than the actual machine. The intermediate language translated by the back end is the target language of all high-level compilers and is also the only assembly language generally available. This lets us do intermodule register allocation, which would be harder if some of the code in the program had come from a traditional assembler, out of sight of the optimizer. We do intermodule register allocation and pipeline instruction scheduling at link time, using information gathered by the compiler back end. The mechanism for analyzing and modifying the program at link time was also useful in a wide array of instrumentation tools. i 1. Introduction When our lab built its experimental RISC workstation, the Titan, we defined a high-level assembly language as the official interface to the machine. This high-level assembly language, called Mahler, represents a compromise between a traditional assembly language and a compiler intermediate language. It is the target language for all high-level language compilers, and it is also the only assembly language available for general use. For this reason we say that the Titan/Mahler combination is a machine with a software-defined architecture: hardware with a software face. We had two reasons for this approach. First, programming a RISC machine in a true assembly language can be quite unpleasant. We expected this to be especially true for us, because the Titan’s instruction set is reduced even further than most RISC machines that preceded or followed it [1,5,16,24,26,31,33], and moreover we expected the architecture to change somewhat from generation to generation. Second, we wanted to do very global optimization, including intermodule promotion of variables to registers. Traditional assemblers can interfere with this by allowing actions that the optimizer does not know about; this forces the optimizer to make conservative assumptions. Mahler is not intended to be a machine-independent intermediate language, an idea going back thirty years or more [37]. Its structure is intentionally quite close to that of the Titan family, and it is more appropriate to think of it as a high-level assembly language. To put it another way, Mahler provides the same sort of clean interface for the Titan that a CISC architecture provides for its microcode engine. Within the envelope of this interface, the assembler and linker can safely perform optimizations that require knowledge of the entire program. This paper describes our experience with Mahler. Section 2 briefly contrasts the Titan hardware with the Mahler interface. Section 3, the bulk of the paper, describes the implementation of the Mahler assembler and linker, and describes the optimization and instrumentation capabilities of the system. Section 4 contains quantitative measurements of the effectiveness of Mahler’s link-time optimizations. 2. Titan and Mahler The Titan [29] is an experimental, high-performance, 32-bit scientific workstation developed at Digital Equipment’s Western Research Lab (WRL). It is the first of a family of machines that we have designed. There have already been two different versions of Titan itself and a substantially different CMOS machine we nevertheless call the MicroTitan [23]. These machines are generally similar in that each is a pipelined load-store architecture with a large register set and a floating-point coprocessor. They are substantially different in the size of the register set, the details of the pipeline, and the interface and capabilities of the coprocessor. Both the Titan and the MicroTitan are load-store architectures. Memory is accessed only through simple load and store instructions with a base register and a 16-bit displacement. 333333333333333 Early versions of some of this material appeared in Proceedings of the SIGPLAN ’86 Symposium on Compiler Construction (1986) [40], Second International Symposium on Architectural Support for Programming Languages and Operating Systems (1987) [44], and Proceedings of the SIGPLAN ’88 Conference on Programming Language Design and Implementation (1988) [41]. 1 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE Other operations like addition and subtraction are three-register instructions that cannot access memory. Branches and jumps are delayed [31]: the transfer of control does not occur immediately after the branch instruction, but rather one instruction later than that. The instruction executed after a branch is said to occupy the branch slot. Filling the branch slot with a useful instruction rather than a no-op is desirable. The CPU pipeline can stall for one cycle at a memory reference, depending on the instruction that preceded it, or for tens of cycles at a coprocessor operation if it tries to fetch a coprocessor result that is not ready yet. These stalls are implemented by the hardware and do not require explicit no-ops. Nevertheless, we would like to avoid these wasted cycles by scheduling other instructions to fill them whenever possible. Mahler is the intermediate language used by all of the front-end compilers implemented for the Titan. It has a uniform infix syntax for operations performed on addresses, signed integers, unsigned integers, single-precision reals, and double-precision reals. These operations are performed on constants or named variables. A variable may be either local to a procedure or global. Some variables are user variables from the source program, while others are temporary variables generated by the front-end compiler. Branches in Mahler are not delayed. There is no explicit concept of a register in Mahler; it is as if there are unboundedly many registers available, each corresponding to a named scalar variable. Procedures are explicitly declared and explicitly called, as in a high-level language. Thus Mahler hides several machine-level aspects of the Titan: the registers, the pipeline, the delayed branches, and the asynchronous coprocessor. Nevertheless we want to treat the Mahler interface as the official machine description, and as the only available assembly language. The Mahler implementation therefore needs to use these resources well, so that performance is not degraded and Mahler’s clients do not wish that they had direct control over them. We accomplish this by taking advantage of the fact that Mahler is the official machine description. This means that it has complete control over an entire program, regardless of what high-level languages it is written in, and regardless even of whether ‘‘assembly language’’ subroutines are included. The Mahler implementation uses this complete control to perform several code improvements at link time, when the entire program is available for analysis. 3. The Mahler implementation Figure 1 shows an overview of our language system. The Mahler implementation consists of the Mahler assembler and an extended linker. A high-level language compiler or a human writes source files in the Mahler language. The assembler reads the Mahler source files and produces object modules. These object files can be linked in the usual manner to form an executable program, or they can be improved at link time by invoking an intermodule register allocator and an instruction scheduler. They can also be instrumented at link time in various ways. The assembler extensively annotates the object modules with information that enables these link-time transformations to be done quickly and correctly. 2 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE Fortran Modula-2 Scheme C Pascal Human assembly programmer Mahler source files Mahler assembler Local optimization Code annotation Intra-proc variable coloring Per-procedure use estimates Annotated object files Program-wide use estimates Variable promotion Pipeline scheduling Code rewriting Mahler linker Executable file Figure 1. The Mahler implementation. 3.1. The Mahler assembler Code selection on a RISC is easy because there are few choices. Since Mahler is at a higher level than the hardware, it is both possible and desirable to do a number of classical optimizations. Some are global optimizations based on data-flow analysis, but most are local optimizations like local strength reduction. On the Titan, for example, multiplication by an integer constant can be profitably replaced by the sum or difference of as many as four shifts.* The Mahler assembler starts by parsing the Mahler source file and producing an internal form that represents each basic block as a directed acyclic graph (DAG). All of the local optimization is then performed on this DAG structure. Within each basic block, the DAGs for common subexpressions, including multiple uses of the same variable value, are merged when possible. After DAG optimization, the assembler traverses the DAG, generating code for each node in turn. It allocates temporary registers from a small set of eight reserved for expression evaluation. This number of temporaries is large enough that most basic blocks do not have to spill these registers to memory, but small enough to leave as many registers as possible for global allocation. In general, the assembler generates simple fast code whose performance will be tolerable even without link-time improvement. It makes no attempt to keep variables in registers, except locally within a basic block. For the most part, it does not try to avoid pipeline stalls, and it generates a no-op after each branch. The exceptions to this occur only inside ‘‘idioms,’’ single operations in Mahler that require sequences of several instructions at the machine level. One example of an idiom is procedure call; another is variable left shift, which is carried out as an indexed jump into a table of thirty-two constant left shifts. These idioms are somewhat like a 333333333333333 * Since constants between 1 and 170 can all be handled this way, this eliminates all the multiplications in many programs. Bernstein [7] explored this technique more generally. 3 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE set of standard macros, and we tailored their expansions to do useful work during pipeline delays and branch slots whenever we could. Part of the assembler’s job is to collect and record information to help the link-time register allocator and pipeline scheduler. It builds a table of the variables and procedures that are defined or referenced in the module, and includes this table in the object file. It also annotates the code heavily with a kind of generalized relocation information, which the register allocator uses to remove loads and stores of variables that it promotes to registers. Finally, it flags certain code sequences in idioms as unchangeable so the pipeline scheduler will know not to harm them. In addition, the assembler optionally does a dataflow analysis on each procedure to determine whether two locals of the same procedure are ever simultaneously live. If they aren’t, the link-time register allocator can decide to keep them in the same register. When the assembler has finished with all of the source files, the resulting object modules can be linked in the usual manner. Alternatively, the linker can also be requested to perform register allocation and instruction scheduling based on the information the assembler has included with each object module. 3.2. The Mahler linker The interesting part of the Mahler implementation is the linker. The Mahler linker is a standard linker, augmented with a variety of code transformation options. These transformations include optimizations like inter-module register allocation and pipeline instruction scheduling. They also include various kinds of high-level and low-level instrumentation. The code transformation system is invoked by the linker proper, but is essentially independent of it. The linker reads into memory the modules to be linked, from object files or from libraries, and passes each in turn to the module rewriter. Depending on the requested transformation, the module rewriter changes the module in memory and returns it to the linker proper. The changed module has the same format as a module read from a file, so several independent transformations can be done in succession. After all the transformations, the linker combines the resulting object modules into a single executable image, which it writes to an executable file. Some transformations are easy and do not depend on much context; others are harder and require knowledge of the program structure, sometimes even at the source level. This kind of information is collected by the assembler and passed in the object module to the linker, which can then combine the information from all of the modules if needed. The assembler includes this information routinely, so it will be present no matter what transformations we ask the linker to do. Transformations require us to insert, delete, or change individual instructions. Inserting and deleting is the hard part, because this causes the addresses of code and data locations to change. To correct for this, we must do two things: compute a mapping from old addresses to new ones, and find each place in the program where an address is used and identify the address referenced there. 4 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE Computing the mapping is straightforward because we know where we are inserting or deleting instructions. We simply keep track of how much insertion and deletion has been done at each point in the program. We must be careful, however, to identify code inserted between two instructions as either after the first instruction or before the second. The destination of a branch to some instruction should be changed to point to new code inserted before that instruction, but not to new code inserted after its predecessor. The computation of the correct mapping of this instruction’s address must distinguish between these two similar-looking insertions. Finding the places where addresses are used also turns out to be relatively easy. Each object module contains a loader symbol table and relocation tables that mark the places where unresolved addresses are used. An unresolved address may be one that depends on an imported symbol whose value is not yet known, or may be one that is known relative to the current module but that will change when the module is linked with other modules. In either case it must be resolved during normal linking, and the relocation tables tell the linker what kind of address it is. This same information also lets the module rewriter correct the value, if only by leaving the relocation entry in place so that the linker will correct it. Other addresses are not marked for relocation, because they are position-relative addresses and will not change when the module is linked with others. These are all manifest in the instruction format itself, as in the case of a PC-relative branch instruction. If instructions are inserted between a branch and its destination, we increase the magnitude of the branch’s displacement. Inserting and deleting instructions can have an effect on span-dependent branches. We believed that this would not be a problem, and it never was; our span-dependent branches were used only within a procedure, and no procedure was ever too big. Handling span-dependent branches is well-understood, however, and incorporating a classical algorithm [38] would not be hard. We are translating individual object modules into new object modules, so we must also translate the loader symbol table and relocation tables. Addresses in the symbol table are tagged with types that tell us how to correct them. Addresses in the relocation tables are a bit different, since they identify specific code or data words, rather than locations; even if we insert a new instruction, the relocation entry should apply to the same instruction as before! We therefore do not translate these addresses by looking them up in our mapping; instead we build the new relocation table as we build the new segment, appending an entry whenever we append the associated instruction. In summary, the algorithm for modifying an object module is as follows. Based on the transformation requested, the linker determines the changes to be made, possibly using information left by the assembler. In one pass over the module it makes the individual changes. As it does this, it builds a mapping from old addresses to new, in each case relative to the beginning of the module under consideration. On a second pass it corrects the addresses that appear in the changed code. During this second pass it also produces a new version of the code’s relocation table. It then goes on to correct the addresses that appear in the data segment and in the loader symbol table. The result is a new version of the object module. This new 5 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE object module can then be linked with others just as if it had been read from an object file, or can be passed in turn to another transformation for further modification. The Mahler linker can perform many different transformations in this way. Perhaps the most interesting is the intermodule register allocation, to which we turn next. 3.3. Intermodule register allocation A good way to use a lot of registers is to promote variables: allocate registers for them to live in for long periods of time. Chaitin et al. [9,10] and Chow [11] pioneered a technique that combines this with ordinary local register allocation. A key idea is that the same register can be used for many different variables if this causes no conflicts. The compiler starts by doing a liveness dataflow analysis. It then builds a conflict graph in which an edge appears between two variables if they are ever live at the same time. Two variables that are never live at the same time can be promoted to the same register. It follows that a coloring of this graph is equivalent to a register allocation for the variables. (A coloring of a graph is an assignment of ‘‘colors,’’ one per vertex, subject to the constraint that adjacent vertices must be different colors. A minimal coloring is one with the fewest possible colors. Good linear-time coloring heuristics exist, even though the problem of finding a minimal coloring is NP-complete.) Applying this technique across procedure boundaries requires some form of interprocedural dataflow analysis. This in turn requires a more complete picture of the program than is usually available to a compiler in an environment of separately compiled modules. We might approach this problem by compiling all the modules in one monolithic compilation. More recently, some compilers have maintained persistent program databases [34], triggering recompilation of modules when new information shows this to be necessary. In either case, it is awkward to include library routines in the process, since users don’t want to worry about what library modules are included. Our solution is to wait until link time to select the variables to promote. We can pick the globals and locals to promote using program-wide safety and payoff information. We can keep an important global in the same register across all procedures. And when one procedure calls another, we can keep the locals of the two procedures in different registers. Spills and reloads around calls are not necessary. Simply postponing the coloring technique until link time is impractical, however. If we wait until link time and then do a dataflow analysis, conflict graph construction, graph coloring, and final code generation, we will be doing the bulk of the back end’s work at link time. Rebuilding a program after a change will take almost as long as if we used monolithic compilation. Moreover, algorithms for dataflow analysis and conflict graph construction require slightly more than linear time. The cost may be acceptable if we apply the algorithms repeatedly to a lot of smaller modules or procedures, but we have found it too expensive to apply them to an entire program. The call-return structure of most programs gives us an easy way to conservatively approximate the interprocedural liveness information. Our link-time register allocator assumes that a local variable is live for exactly as long as its procedure is active.* The assembler generates 333333333333333 * We can optionally refine this assumption with an intra-procedural coloring done by the Mahler assembler. See section 3.3.7. 6 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE code assuming that variables live in memory, loading them into temporary registers as needed within a basic block. Along with the object code, it produces extra information about the uses of variables and procedures throughout the code. If invoked, the intermodule register allocator picks the variables to promote and uses this extra information to transform each module accordingly. Generating code that is correct as it stands but that can be easily modified later is attractive for three reasons. First, it means that the promotion of variables is entirely optional. If the expense of global register allocation is unwarranted, we can omit it. Second, it means we are not forced somehow to allocate registers for all the variables in the program. If we manage to keep the most frequently used variables in registers, we can keep the rest of them in memory and the results are still correct. Finally, it keeps us honest as designers of the system. Once we postpone anything until link time, the temptation is great to postpone everything, so that we can know what the other modules look like. The requirement that we generate working code at compile time assures that we will not succumb to that temptation. To make the global promotion of variables optional, the assembler sets aside some registers as expression temporaries. These temporaries will not be available for link-time allocation to promoted variables. This is contrary to the philosophy of Chaitin et al., who included expression temporaries in the conflict graph, and introduced spills and reloads in order to reduce the complexity of the conflict graph to the point where it was colorable using the number of registers available. This meant that global register allocation was an integral part of code generation and could never be omitted. 3.3.1. Annotated object code The assembler’s main contribution to link-time register allocation is to explain the code it produces, by annotating it with register allocation actions. action REMOVE.v OP1.v OP2.v RESULT.v LOAD.v STORE.v KEEP.v explanation delete the instruction replace operand 1 by the register allocated to v replace operand 2 by the register allocated to v replace the result register by the register allocated to v replace ‘‘temp := load v’’ by ‘‘temp := reg v’’ replace ‘‘store v := temp’’ by ‘‘reg v := temp’’ do not delete the instruction, overriding any REMOVE Figure 2. Register allocation actions invoked if v is promoted to a register. These actions tell the linker how to change the code if it decides to promote some of the variables to registers. Typically the linker removes loads and stores of a promoted variable, and modifies each instruction that uses the loaded or stored value so that it references the newly allocated register instead of the original temporary. Occasionally the structure of a basic block forces variations of this. The annotations discussed in this paper appear in Figure 2. Each action is qualified by its variable name, and will be ignored if that variable is not promoted. 7 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE The simplest case is when any of the loads or stores can be removed. For example, the assignment ‘‘x := y + z’’ would lead to annotated code like this: instruction r1 := load y r2 := load z r3 := r1 + r2 store x := r3 actions REMOVE.y REMOVE.z OP1.y OP2.z RESULT.x REMOVE.x (In this example, registers r1, r2, and r3 are temporaries managed by the assembler.) For instance, if we decide to keep y in a register, we will delete the load of y and change the addition instruction so that its first operand is the register allocated for y instead of the temporary register. The three examples below show the rewritten code that results from various link-time selections of register variables. register y r2 := load z r3 := y + r2 store x := r3 register x,y r2 := load z x := y + r2 register x,y,z x := y + z If a value will not be available in the variable for as long as it is needed, the linker must replace the load by a register move rather than deleting it altogether. Figure 3 shows the two possible cases. In the first case, the variable x is not changed until after the last use of its original value; in the second case, x is reassigned before we are through. This could happen because of local optimizations like copy propagation, and means we must keep the value in the temporary and accordingly must replace the load by a register move. instruction r1 := load x ... := r1 + ... ... := r1 - ... store x := ... actions REMOVE.x OP1.x OP1.x instruction actions r1 := load x LOAD.x ... := r1 + ... store x := ... ... := r1 - ... Figure 3. Register actions if value is always available in x (left) and if x gets a new value before the old value is dead (right). A similar analysis applies in the case of a store. When a value is computed and immediately stored to a promoted variable, we can find subsequent uses of that value in that variable if we make no new assignment to the variable before the last use of the value. A simple assignment like ‘‘x := y’’ is a special case, because the analysis sketched above would cause both the load and the store to be marked for removal if both x and y are promoted to registers, which is clearly wrong. Instead, we use the KEEP action: instruction r1 := load y store x := r1 actions REMOVE.y RESULT.x STORE.x REMOVE.x KEEP.y OP1.y 8 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE The KEEP.y action overrides the REMOVE.x action if both x and y are promoted. We end up with a single instruction if we promote either or both. This three-part analysis of loads, stores, and simple assignments lets us correctly annotate the code we generate. We begin with a forward and backward pass over the basic block. This step marks each operation in the basic block with the variable or variables, if any, in which its result will always be available. This is followed by the generation of the annotated code. Appendix 1 sketches these algorithms. Some generated instructions may be given redundant annotations, such as two different OP1 actions. This just means that the value is available in two different variables, and either can be used. We don’t know in advance which variables will be promoted, so this redundancy can be helpful. By analyzing the basic block this carefully, the assembler is essentially planning the code needed for any possible combination of promoted variables. The rest of the work is done by the linker, which chooses which variables to promote, and uses the assembler’s annotations to rewrite the code based on that choice. 3.3.2. Register allocation and module rewriting The first thing the register allocator must do is decide which variables to promote. To help it, the assembler collects use information about each module it translates, and records this information in the object file. This information includes a list of the procedures appearing in the module and a list of the procedures each calls. It also includes an estimate of how many times each procedure uses each variable. The assembler computes this estimate for each procedure by counting the static number of times the reference is made, with references inside loops, no matter how deeply nested, counting as 10. This is a coarse estimate, but it seemed to work better than, for instance, multiplying by 10 for each surrounding loop. The register allocator in the linker collects the use information for all the modules being linked, and then builds a call graph for the program. If there are no recursive or indirect calls, then this call graph is a directed acyclic graph (DAG). Recursive and indirect calls require special handling, which we will describe shortly. The allocator then estimates the dynamic number of references of each variable, by multiplying each of the assembler’s per-procedure estimates by the static number of places the procedure is called. (In effect, we are estimating that each procedure is called exactly once from each call site. We tried getting a better estimate by traversing the call graph and multiplying estimated counts, so that if P calls Q ten times and Q calls R ten times, we estimate 100 calls for R. Since the assembler’s estimates are only rough guesses, this tended to give wildly wrong counts, some even overflowing a 32-bit integer. The current more conservative scheme works better.) The idea behind our allocation of registers to local variables is simple. If two variables are local to different procedures, and these procedures are never simultaneously active, then the two variables are never simultaneously live. These two locals can therefore be grouped together and kept in the same register.* We do this grouping by traversing the call DAG in 333333333333333 * Michael L. Powell is responsible for this insight, for which I am most grateful. 9 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE pseudo-registers available pr0 pr1 pr2 pr3 pr4 pr5 pr6 pr7 pr8 pr9 pr10 . . . A B B D C A C E D E total number of registers needed Figure 4. A sample call DAG, with arrows showing calls in the program, and the allocation of pseudo-registers to locals of the procedures. reverse depth-first search order [4]. We start by allocating pseudo-registers to the locals of each leaf procedure, beginning with pseudo-register 0. We then allocate pseudo-registers to the locals of other procedures, beginning immediately after those used by its children. Figure 4 shows an example call DAG and the associated allocation of pseudo-registers to the procedures’ locals. The algorithm* is given in Figure 5. for each proc p in reverse depth-first search order do: – – First find the largest pseudo-register allocated in a –– previous iteration to the children of this procedure. – – Leaves have no children. childPseudos := 0 for each child q of p in the call DAG do: childPseudos := max (childPseudos, pseudos[q]) – – Now allocate pseudo-registers to p’s locals, starting –– where its children left off pseudos[p] := childPseudos for each local v of p do: allocate pseudo-register number "pseudos[p]" to v pseudos[p] := pseudos[p] + 1 Figure 5. Assigning locals to pseudo-registers. 333333333333333 * My thanks to Richard Beigel for this elegant algorithm. 10 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE The register allocator then creates a pseudo-register for each individual global scalar. It gives each pseudo-register a reference frequency that is the sum of the frequencies of its variables, and sorts the pseudo-registers by frequency. Finally, it allocates an actual register to each of the most frequently used pseudo-registers. A variable whose address is taken anywhere in the program is not eligible for promotion. This is an extremely conservative decision and might be relaxed by using dataflow analysis and type information to determine the places where the address might be used. When we have decided which variables to keep in registers, we know which register actions to apply to the code of each module. Applying these actions is easy since each one is independent of context; the assembler did the hard work when it generated the actions in the first place. 3.3.3. Initialization A global variable may be declared with an initial value. This value is preset in memory before the program begins to execute, so no instruction is responsible for assigning the value to the variable. If we promote such a variable, we must somehow get that initial value from the original memory location into the register. The assembler identifies these globals in the use information it puts in the object file. The startup routine that sets up the program environment before invoking the user main program includes a special register action, called INIT, that is unconditionally performed as part of register allocation. This action inserts instructions to load the initial value of each initialized global that gets promoted to register. A local variable declared with an initial value does not present this problem. The frontend compiler generates ordinary Mahler statements to assign the variable’s initial value at the beginning of the procedure. The Mahler assembler then annotates this code just like code for explicit user statements. A parameter is more like a global than like a local. The assembler assumes that parameters are passed on the stack, so the argument value is already in the parameter location when we enter the procedure. If a parameter is promoted, the module rewriter must insert code at the beginning of the procedure to load that value into the register.* 3.3.4. Indirect and recursive calls Our algorithm for assigning pseudo-registers to variables depends on building a directed acyclic graph to describe the static call structure of the entire program. Indirect calls don’t appear on the DAG at all, and recursive calls make the graph cyclic. Something must be added to handle these kinds of calls. One solution to the problem of indirect calls is to represent an indirect call by direct calls to each possible destination procedure. We could do this naively, by listing every procedure that is ever assigned to a procedure variable. We could also do it more carefully, by using type 333333333333333 * In fact, we can do better, by having the caller put the argument value directly into the parameter register. We discuss this in section 3.3.5. 11 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE information and flow analysis to narrow down the possible candidates. The naive approach seems a bad idea, but the careful approach might well be appropriate for an object-oriented program with a preponderance of indirect calls. Our solution is different. We don’t record indirect calls in the call graph, and we insert a non-trivial spill and reload around an indirect call. In this approach an indirect call archives all of the apparently active locals. When we arrive at the called procedure we can then behave as if we had performed a normal call visible in the call DAG. The set of active locals consists of each local whose procedure is on any path in the call DAG from the root to the calling procedure. We chose our solution because it seemed likely that few of our applications would make heavy use of indirect calls; it was also easier. This seems not to have hurt us: a program whose author [30] describes it as ‘‘object-oriented’’ seems to benefit as much as the others. We should note that our solution works only because we allow indirect calls only to toplevel procedures. Suppose that P and Q are both nested in R and both refer to R’s locals, and suppose that P calls Q indirectly. Then archiving R’s locals around the call is wrong because Q might legally change one of them. For many years we thought recursion could be handled analogously to indirect calls. In an early report [40], we described a system that did so. We removed recursion from the call graph by building a depth-first spanning tree [4] and discarding the back edges. This gave us an acyclic graph on which we could perform our usual algorithm. At the call corresponding to the discarded back edge, we added a spill and reload for a possibly large set of variables. To determine the set of variables spilled, we examined the call DAG to find the procedures appearing on each path from the called procedure to the calling procedure. The locals of these procedures were spilled. On the assumption that the back-edge call really is recursive, these are the only locals that might be recursively re-instantiated. Unfortunately, this isn’t always correct. Figure 6 shows one of two possible depth-firstsearch DAGs for a counter-example program with five procedures. The solid lines are calls in the DAG; the dashed line from Q to P is a call not in the DAG. Procedure R has locals in r1 and r2, and procedure Q has a local in r1. This means that procedure P can keep a local in r3 (but not r2, because P’s child R uses it). Procedure T, on the other hand, is allowed to keep a local in r2, because T’s only child is Q. The recursive call from Q to P will archive all locals of both P and Q, namely r1 and r3 but not r2. Unfortunately, the series of calls shown as a dotted line leads to an error. Both T and R feel entitled to use r2, but r2 was not archived by the backward call to P. Thus the value in T’s local can be damaged by R. It was not hard to fix this problem. The better way to remove cycles from a directed graph is to partition the graph into its strongly-connected components* and then replace each component by a single node. Then we apply the bottom-up algorithm as before, with one 333333333333333 * Nodes a and b are in the same strongly-connected component of a directed graph if there is a (possibly empty) path from a to b and a (possibly empty) path from b to a. Thus a strongly-connected component of a call graph is either a maximal group of procedures that can form a recursive cycle, or a single procedure that can never be in a recursive cycle [3]. 12 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE main r4 P T r3 r2 R Q r1,r2 r1 Figure 6. A depth-first-search call-DAG (solid) with a recursive call (dashed). The call chain shown (dotted) reveals a bug in our original algorithm for saving registers. difference. When we reach a node representing an entire recursive cycle, we must pick pseudo-registers for all locals of all procedures in the cycle. We can use the same pseudoregister for locals of two different procedures in the cycle, but we must insert a spill and reload on each path between them. Our system does this simply by treating each procedure in the recursive component independently; it inserts spills and restores around any call between strongly-connected procedures. If recursion were more common, it might be worth concentrating the spills and reloads at the less frequent calls. 3.3.5. Fast argument passing The scheme we have described so far deals with argument passing in a simple-minded way. We evaluate each argument and place it on the stack, in the memory location associated with the parameter that we are about to create. If the called procedure has a promoted parameter, it loads the argument from the stack into that register. It would be better if the calling procedure could put the argument directly into the parameter register to begin with. We use the same idea here that we used in expression evaluation. The assembler generates code to evaluate the argument and store it on the stack. It flags this code with the usual RESULT, STORE, and REMOVE actions. These actions are not qualified by the name of the parameter, however, but rather by its position in the parameter list of the called procedure. (It cannot refer to the parameter by name because it generates the action at assembly time, when the called procedure may not be visible.) At link time, the module rewriter checks whether that parameter was promoted. If it was, the positional qualification is converted into a normal variable qualification. We discussed the promotion of parameters to registers in section 3.3.3. In that scheme, the caller put each argument value on the stack. If we promote the parameter, we insert into the procedure entry code a load of the argument value into the parameter register. Now we 13 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE must not do these loads, since the caller has already put the arguments directly into the parameter registers, and the value on the stack is garbage. Unfortunately, a caller still cannot do this if the call is indirect, because the identity of the called procedure is unknown. So we leave the inserted loads in place, and instead modify direct calls so that each enters the procedure after the point where we have inserted the loads. 3.3.6. Profiling We do not keep all of the variables in registers, but only those that seem to be referenced frequently. As a result, the quality of the frequency estimates is important, especially in large programs. Instead of relying on estimates, we can use previously gathered execution-time profiles. Two kinds of profile are useful — a variable-use profile and a procedure-call profile. A variable-use profile tells how many times each variable is loaded or stored. We can use these counts directly in place of the variable-use estimates that drive the selection of variables to promote. We built the variable-use profiler using the code modification facility in the Mahler linker. We will describe this in more detail in section 3.7.2. The variable-use profile has the disadvantage that we may need to recompute it often for it to be effective. If the programmer adds a new variable to a procedure that is executed very frequently, that variable will not appear in the previously computed profile. It is therefore unlikely to get promoted. An alternative is to use gprof [17] to get a procedure-call profile telling us how many times each procedure was called. We can then combine these dynamic counts with the assembler’s per-procedure estimates of variable use, which are computed every time the module is assembled. The resulting estimates will be less precise than a true variable-use profile, but also somewhat less sensitive to small changes in the program, and should be considerably more precise than estimates alone. 3.3.7. Intra-procedure coloring In grouping together local variables into pseudo-registers, we assume very conservatively that a local variable is live whenever its procedure is active. This is a cheap way of estimating when locals of different procedures will conflict, and it is reasonable because the lifetimes of most procedures are short anyway. This simplification does have one important limitation: it does not allow us to combine two non-conflicting locals of the same procedure. Recognizing two such locals does not require interprocedure dataflow analysis, only intra-procedure analysis and coloring. The assembler does a liveness analysis on each procedure it translates, grouping together locals that do not conflict. It does not allocate registers to these groups, but simply reports these groups to the link-time allocator.* The only tricky part is that the liveness analysis may allow a parameter to be combined with another local, or even with another parameter that for some reason is initially dead. If so, the register allocator must make sure that the live argument value gets properly loaded into the register that the two variables share. 333333333333333 * This analysis can be suppressed by the user, resulting in a faster compilation but producing code that is a few percent slower [40]. 14 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE 3.3.8. Alternative allocation strategies Register allocation works by first deciding which variables and constants we should keep in registers and then rewriting the code to reflect this decision. These two phases are essentially independent. The code rewriting works in the same manner no matter how we make the selection of variables to promote. This allows us, for example, to drive the selection by either frequency estimates or by a dynamic variable-use profile. It also allows us explicitly to exclude certain classes of variables from consideration, if we wish. In a multiprocessor system, for instance, we might wish to force global variables to reside in memory, because they are shared between processors with different register banks. Another interesting notion is the distinction between cooperative allocation and selfish allocation. Cooperative allocation is the form described earlier. In cooperative allocation, a given procedure may not get all of its locals into registers. On the other hand, the allocation is based on the call graph, so spills and reloads around calls are not usually necessary. In selfish allocation, by contrast, each procedure is treated independently. It can keep all of its locals in registers if it wishes, but it must pay for this by saving and restoring these registers at entry and exit. In practice this means that it keeps a local in a register only if the local is used more than twice. The allocator uses frequency estimates or a dynamic profile to make that judgement. 3.3.9. Comparison with Steenkiste’s allocation method Steenkiste [35,36] independently developed an approach similar to ours. He does not do allocation in the linker and therefore has no need for annotations or module rewriting, but his algorithm for allocating registers to variables is much the same. The major difference is that Steenkiste does not use frequency information in allocating registers. Both schemes traverse the call graph starting with the leaves, in order to combine local variables into groups. Our scheme allocates registers to the most important of these groups. Steenkiste’s scheme allocates registers to the groups sufficiently close to the leaves. Whenever it runs out of registers, Steenkiste’s scheme reverts to doing saves and restores. Thus procedures near the leaves use cooperative allocation, and procedures far from the leaves use selfish allocation. Globals do not fit well into Steenkiste’s scheme, which can hamper performance. 3.3.10. Intermodule register allocation: Summary The Mahler assembler prepares for intermodule register allocation by annotating the code with register actions and providing intra-procedure estimates of scalar use frequency. The Mahler linker combines these estimates to obtain estimates for the entire run, groups together non-conflicting locals based on the program call graph, and allocates registers for the most important groups. It then rewrites the code to reflect this choice, removing loads and stores and modifying other instructions. As we will see in section 4.1, between 60% and 80% of the references to scalars are removed in this manner. Once register allocation is finished, we can schedule the resulting instructions for better pipeline performance, the subject of the next section. 15 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE 3.4. Pipeline instruction scheduling Our second link-time optimizing transformation is the pipeline instruction scheduler, originally written by Michael L. Powell [44]. The instruction scheduler finds the places where cycles will be wasted. Wasted cycles can happen because of a pipeline stall in accessing memory, a stall waiting for the coprocessor to finish, or a no-op after a branch. It then tries to move instructions around so that these wasted cycles are filled with productive instructions. Because these productive instructions would otherwise require cycles of their own, the program gets faster. The instruction scheduler works in two phases. First, it reorders each basic block independently, trying to fill stalls and branch slots from within the same basic block. Then it makes a final pass looking for branch slots that have not been filled. It tries to fill each unfilled slot with a useful or potentially useful instruction from the next block to be executed. 3.4.1. Scheduling a basic block We say that two instructions conflict if they cannot be exchanged even when they are adjacent. This is true only if one of them modifies a register or memory location that the other instruction also uses or modifies. Usually it is impossible to tell if two memory references are to the same location, so we must assume the worst and suppose that they are. The one exception is when the two references use the same base register and different displacements. These two references do not conflict even though an intervening instruction might change the value of the base register, because the intervening instruction will conflict with both memory references and the order of the three will be preserved. The scheduler starts by computing the earliest time at which each instruction can be executed relative to the beginning of its basic block. This earliest time is determined by looking for conflicts with all preceding instructions. In the same way, the scheduler computes the latest time at which each instruction can be executed, relative to the end of the block, by looking for conflicts with instructions that follow it. The instructions in the block (except for the final branch, if present, and its branch slot) are then sorted according to latest start time. Ties are broken according to the earliest start time. Next, another pass looks for instructions that stall because of an interaction with the previous instruction. The scheduler delays such an instruction by pushing it later in the instruction sequence. This instruction can be pushed past any instruction whose earliest allowed time precedes this instruction’s. At this point, the scheduler has filled as many of the memory stalls and coprocessor stalls in the basic block as it can.* If the block ends with an empty branch slot, the scheduler looks at the instruction that has been sorted to the position before the branch. If this instruction can be performed later than the branch itself, then it can be moved into the branch slot to replace the no-op there. 333333333333333 * The problem of finding an optimal schedule is unfortunately NP-complete, but more aggressive heuristics than ours have been explored [20]. 16 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE 3.4.2. Filling slots with future instructions When the scheduler reorganizes a basic block, it may fill the branch slot with an instruction from earlier in the basic block. When it has reorganized all of the blocks, however, some branch slots may still be unfilled. The possibility remains of looking in the block that will be executed after a branch, and moving an instruction from there into the branch slot. A final pass over the object code tries to do this, by looking at the destination of each such branch. If a branch jumps to an unconditional branch, we can sometimes change the first branch so it bypasses the second, by jumping directly to the same place. The only problem is that the second branch’s delay slot may contain an instruction that must be executed. If the first branch is unconditional, we can guarantee that this instruction gets executed anyway by copying it into the slot of the first branch. We cannot bypass the second branch if both have non-empty branch slots, or if both are conditional. If an unconditional branch jumps to a nonbranch, we copy the nonbranch target into the branch slot. Then we change the branch so that it jumps to the instruction after the nonbranch. This also works if the destination is a conditional branch, because of the way delayed branches work. The most interesting case is when a conditional branch jumps to a nonbranch. We may be able to fill the slot speculatively, either with an instruction from the block at the branch destination or with one from the block that we get to by falling through the branch. This is safe only if the instruction is harmless even when we end up in the wrong block; it is profitable whenever we end up in the right one. Looking at one of the two blocks, we can pick a candidate instruction from among those instructions whose earliest time is 0; these can be executed first and so could just as well be executed in the preceding branch slot. A candidate instruction from one block is safe if the register it modifies is dead at the beginning of the other block. Dataflow analysis would give us precise information about this, but we can do pretty well just by looking to see if the other block sets that register without using it first. If we pick the instruction from the fall-through block, we must delete it from that block. Otherwise it will be executed twice: once in the branch slot and once in the block that follows. We therefore consider using the fall-through block only if it cannot be branched to from elsewhere in the program. If we pick the instruction from the destination block, we move it to the beginning of that block, copy it into the branch slot, and increment the destination of the branch so that it skips over the original. This leaves the destination block intact in case there is another branch to it. We are not allowed to do this a second time to fill some other branch slot, however, so we mark the instruction in the destination block as immovable. In Figure 7(a), the only safe candidates are the loads into r1. It is worth noting that they make each other safe: moving either one into the branch slot is safe because the other one shows us that r1 is dead at the start of its block. The loads into r2 or r3 cannot be guaranteed safe, and the arithmetic operations and stores cannot even be candidates because their start times are greater than zero. Figures 7(b) and 7(c) show what happens if we select each of the two safe candidates. 17 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE (a) original code if r4 = 0 goto L1 no-op (b) fill from destination if r4 = 0 goto L1+1 r1 := load v (c) fill from fall-through if r4 = 0 goto L1 r1 := load x r1 := load x r2 := load y r1 := r1 + r2 goto L2 store z := r1 r1 := load x r2 := load y r1 := r1 + r2 goto L2 store z := r1 r2 := load y r1 := r1 + r2 goto L2 store z := r1 L1: r3 := load u r1 := load v r3 := r3 - r1 store z := r3 L2: L1: r1 := load v r3 := load u r3 := r3 - r1 store z := r3 L2: L1: r3 := load u r1 := load v r3 := r3 - r1 store z := r3 L2: Figure 7. Filling a conditional branch slot with instructions from each possible successor block. If the branch is a backward branch, it is probably a loop and will therefore most likely be taken. We therefore prefer to use the destination block rather than the fall-through block. But a speculatively executed instruction in a branch slot is no worse than a no-op; looking in both blocks seems sensible. 3.5. Cooperation between assembler and linker We have already seen that the assembler provides information that the register allocator uses to select register variables and rewrite the code. We have also seen that the assembler marks tricky idioms so that the scheduler will not hurt them. There are other forms of cooperation as well. One important example is our decision to do scheduling after global register allocation. The instruction scheduler never crosses module boundaries and might therefore be part of the assembler rather than the linker. Doing scheduling before register allocation doesn’t make much sense, however, because register allocation causes profound changes in the structure of a basic block. To begin with, register allocation removes many loads and stores, which are common sources of stalls. There would be little point in reorganizing the code to avoid a load stall if the load itself later disappears. Furthermore, by using more of the registers, register allocation can remove irrelevant conflicts between instructions. If the sequence r1 := load x r1 := r1 + 1 store x := r1 appears in the middle of a block, other unrelated uses of r1 cannot be moved past it in either direction. If x is promoted, however, the increment of x no longer uses r1, and other 18 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE instructions become more mobile. The assembler helps the scheduler by allocating the dedicated temporary registers in round-robin order. When we are finished using r1, we will usually not allocate it for a new purpose until several instructions later. Thus even a block without many promoted variables will tend to have more mobile instructions than it would if it used the same one or two temporaries over and over. This round-robin allocation of temporary registers might hamper the effectiveness of the inter-block phase of scheduling, because it makes it harder to be sure that a particular register is dead at the beginning of the block. The assembler alleviates this somewhat by starting the cycle over again each time a new basic block begins. Thus the first temporary register to be assigned by a block is more likely to be r1 than any other, so it is more likely that the two blocks will both have such assignments. In that case either one could be moved into the branch slot. 3.6. Loopholes Sometimes you really do need precise control over the machine. You might need to write absolutely the best possible version of a library routine that will be executed millions of times, like input/output conversion. Or you might need to write a routine that is responsible for directly maintaining part of the machine state, like an interrupt handler. Mahler has two mechanisms for getting this kind of fine control over the machine: builtin procedures, and startup procedures. The Mahler assembler recognizes the names of a small set of builtin procedures. The name of the builtin procedure is a Mahler keyword and cannot be redefined by the user. When the assembly language contains a call to a builtin procedure, the Mahler assembler translates it to inline code instead of calling code. Builtin procedures are used to implement operations that cannot be expressed (or expressed efficiently enough) in normal Mahler. A call to a builtin procedure looks just like a normal call, so it can be made from a high-level language as well as from a Mahler routine. We have tried to design Mahler builtins with semantics that are as clean and high-level as practical. They should satisfy the user, but they should also let Mahler know what is happening so that its analysis will not be impaired. Thus we have shied away, for example, from builtins that simply let the user examine or change a specific register. Instead, we have tried to find out what problem the user was really trying to solve, so we could design a builtin to suit. Dangerous lower-level builtins are occasionally useful to let the user decide what is really needed. Nonetheless, they are used rarely; we have always tried to replace them with more meaningful builtins when we understand the problem better. For example, Mahler has builtin procedures to spill and reload big pieces of the register state for use in interrupt handling. It has builtin procedures to access certain extendedprecision arithmetic instructions, for use by the input/output conversion routines. These routines would likely have to be changed for a new generation of Titan, and perhaps it would be better to make the entire conversion routine be builtin. Even the C library routines of setjmp, longjmp, and alloca are Mahler builtins, because Mahler must know when these things happen 19 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE if it is to optimize data references safely. Adding a new builtin procedure to Mahler is straightforward, but it must be done by the Mahler implementor: the user cannot specify arbitrary machine-level code. We have found this mechanism adequate. Machine-level control of the registers and pipeline is handled adequately by the Mahler system itself, and machine-level instruction selection is for the most part expressible in the Mahler language — where it is necessary at all. The Titan is a simple machine, and most of its user operations are directly accessible from a high-level language. The other Mahler loophole mechanism is the startup procedure. A startup procedure has no name and is never called by another routine. To give this routine control some piece of magic is needed, either by hardware intervention (for operating system bootstrap code) or by operating system activity (for user code that begins a program execution). A startup procedure has no local variables and must ensure that a valid data environment exists before it can call a normal procedure or reference a global variable; the startup procedure can invoke builtins to do these setup tasks inline. These two extensions were enough to avoid the use of any lower-level assembly language, even in the operating system. A UNIX operating system was experimentally ported to the Titan using only high-level languages and Mahler. Even the less pure port now in use includes only 1000 instructions of ‘‘true’’ assembly code, for bootstrapping and interrupt handling. This code is not reachable by normal means, and the normal procedures that it calls are declared outside as procedure variables, so no change to Mahler’s analysis is necessary. 3.7. Instrumentation Mahler produces all of the code in any program. It also makes extensive changes to the code at link time. This makes it easy for the linker to insert instrumentation code of various kinds. We have taken advantage of this for a variety of different applications, both high-level and low-level, and both special-purpose and general-purpose [42]. A similar facility is provided by the pixie tool developed by Earl Killian at MIPS Computer Systems [28]. Pixie was developed independently from our system but the possible modifications overlap with ours; its approach is in some ways similar to ours. The pixie system works by transforming a fully-linked executable instead of by transforming object files being linked. Thus it transforms the code after linking instead of before. This has two consequences. On one hand, their system is easier to use than ours. One must know only the name of the executable, rather than how to link its components. On the other hand, our approach is easier to implement. An isolated object file, prior to linking, contains the relocation table and loader symbol table, which make the transformation easier. Lacking this information, pixie must postpone much address translation until the modified program is executed, because it cannot reliably distinguish text addresses from data values at transformation time [25]. This introduces runtime overhead that makes pixie an unsuitable vehicle for optimization, though it has been used for several of the instrumentation purposes that Mahler has. 20 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE 3.7.1. Instruction-level instrumentation Our first instrumentation tool was an instruction-level profiler. We create a set of global counters in the instrumented program, one for each kind of event we want to count. We then insert code to increment the associated counter wherever that event appears. If we insert an increment wherever a load appears, for example, running the program will count the loads performed. This lets us count occurrences of different kinds of instructions or recognizable multi-instruction idioms like byte load or procedure call. We can also count pipeline stalls that occur completely within a single basic block, by examining each block and inserting an increment for the right amount. This requires only the same analysis that the scheduler itself must do. Because stalls sometimes occur across a block boundary, we also create some global state variables. When we see a block with an operation whose latency extends past the end of the block, we insert code to set the state variable, and when we see a block with an instruction that could be delayed because of the state, we insert code to check the state variable and conditionally increment the stall counter. Instruction-level instrumentation provides us with statistics that one often acquires only by instruction-level simulation. Inserting the instrumentation at link time is expensive, but it is an order of magnitude cheaper than inserting it at compile time. Executing the instrumented program is expensive, too: the instrumentation can slow the program by an order of magnitude. But simulating the program instead would be slower by two to four orders of magnitude. For this kind of application, link-time instrumentation seems to be the best approach. 3.7.2. Profiling The gprof [17] profiler is a useful tool for performance debugging of applications. The normal way to use it is to recompile all of your source modules with the compiler option –pg, and then link the new object modules with standard libraries that have themselves previously been compiled with the same option. Running the resulting instrumented program produces a file of profile data. The only effect of compiling with –pg is to insert a lot of calls to a special routine named mcount, which is responsible for keeping track of who calls whom. These calls are inserted at the beginning of each procedure, and it’s easy to make the Mahler linker do that itself. The advantages are considerable. Relinking is one or two orders of magnitude faster than recompiling all modules, and we no longer need to maintain the instrumented versions of libraries. Profiling is now much more attractive. Just as we can insert profile code at the beginning of each procedure, we can insert code to count the executions of each individual basic block. The linker allocates a long vector of counts, one for each basic block in the program. The inserted code for a block increments the count associated with that block. Our main use of the basic block execution counts is the construction of a variable-use profile. In section 3.7.2 we observed that the register allocation can be improved by using actual variable reference counts rather than compile-time estimates. To build the variable-use profile, we combine the basic block counts with static information produced by the register allocator. This static information tells, for each basic block, how many variable references it 21 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE contains, and which variables they are. This is easy to determine: each variable reference counted is a load or store marked for removal if the variable is promoted. Combining these per-block statistics with the dynamic block counts gives us a dynamic profile of variable use. 3.7.3. Register management strategies In 1988, we reported the results of a study comparing register allocation techniques and hardware register windows [41]. Some machines [23,24,29,33] treat the register set as a flat space and rely on the language system to promote the variables. Other machines [1,5,16,31] include hardware to divide the register set into a circular buffer of windows. When a procedure is called, the tail of the buffer is advanced in order to allocate a new window of registers, which the procedure can use for its locals. Without promotions, we would need to perform some number M 0 of loads and stores to access scalar variables and constants. Promoting some of these to registers reduces this number to M 1 . (In either case, of course, we would need loads and stores to access array or pointer structures. These are not included in M 0 or M 1 .) Unfortunately, any scheme for promotion also requires that we add some number S of new loads and stores to spill and reload promoted variables. With register allocation, the new loads and stores are needed when one makes procedure calls that would otherwise harm registers in use. This might be all procedure calls or, in our scheme, only recursive and indirect calls. With register windows, the new loads and stores are needed when one runs out of buffers after a long series of calls or returns. When this happens, an exception occurs that is handled by spilling or reloading windows. Although these loads and stores do not appear in the user program, it seems only fair to charge for them. We can take both effects into account by computing a miss ratio (M 1 +S)/M 0 . The miss ratio is a measure of how thoroughly we were able to remove loads and stores associated with scalars. If the miss ratio is zero, we have managed to keep all variables and constants in registers, and did not need to perform any spills and reloads. If the miss ratio is more than unity, then the spills and reloads slowed the program down more than keeping things in registers speeded it up. The miss ratio is a property of the dynamic behavior of a program. A particular load or store instruction may be counted many times, or no times, depending on how often control passes through it. We used our automatic instrumentation to compute the miss ratio for a variety of hardware and software promotion schemes. This usually required more than simply instrumenting the executed code. For register allocation, we inserted code to count not the loads and stores that were actually executed, but those that would have been executed with a given allocation scheme and a given number of registers. We similarly counted the times we passed a place where a load or store would have been removed. For hardware windows, the saves and restores depend on the history of recent calls and returns and on how deep the call chain is. We inserted instrumentation code to keep track of this and to simulate the state of the circular window buffer. This let us count hypothetical saves and restores at the times when a machine with windows would actually overflow or underflow. Some of the results will be sketched in section 4.1. 22 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE 3.7.4. Address traces An address trace is the sequence of instruction or data addresses referenced by a program during the course of its execution. Such a trace is useful in simulating the performance of different cache configurations. Previous techniques for acquiring address traces fall into two general categories. One is to physically monitor the address bus, using either a hardware monitor [13] or extra microcode [2], and log the addresses seen. This approach can slow execution by an order of magnitude, and in any case is not suitable for modern RISC machines, which are integrated on a single chip and have no microcode. The other approach is simulation. We can build an instruction-level machine simulator that also logs the memory references made by the simulated program. Unfortunately, simulation is very slow: simulating a program typically takes two to four orders of magnitude longer than executing it. With the advent of machines with very large caches, we must use very long address traces to get realistic and useful results. Simulation is too slow to do this easily. Our approach is to use link-time code modification to instrument the code. Wherever a data memory reference appears, the linker inserts a very short, stylized subroutine call to a routine that logs the reference in a large buffer. The same thing is done at the beginning of each basic block, to record instruction references. Borg [8] provided this facility with operating system kernel support that lets the trace buffer be shared by several multiprogrammed processes and even the kernel itself. Each time the buffer fills up, a special untraced process is resumed to consume it, either dumping it to an output device or, more usefully, incrementally continuing a parameterized cache simulation. 23 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE 4. Performance of register and pipeline management How well does Mahler do at managing the registers and pipeline? By instrumenting the Mahler implementation and the code it produces, we were able to produce some answers to this question. We performed the measurements described in this section using five artificial benchmarks and ten real programs in use at WRL. These fifteen programs are summarized in Figure 8. Sed, egrep, troff, yacc, rsim, and mx are written in C, and the Boyer benchmark is written in Scheme, a dialect of Lisp; the rest are written in Modula-2. lines vars procs remarks 222222222222222222222222222222222222222222222222222222222 1 1 Livermore 1 268 347 20 1 Livermore loops Whetstones 1 462 413 37 1 Floating-point 1 Linpack 814 397 31 11 Linear algebra [14] 1 Stanford 1 1019 565 65 1 Hennessy’s suite [18] 1 Boyer 600 1702 98 11 Lisp theorem-proving [15] 222222222222222222222222222222222222222222222222222222222 1 1 1751 sed 259 31 11 Stream editor 1 egrep 1 844 241 28 1 File search 1 troff 947 175 11 Text formatter 1 7577 yacc 1 1856 689 73 1 Compiler-compiler 1 metronome 1 4287 1240 157 11 Timing verifier rsim 1 3003 911 133 1 Logic simulator [39] 1 grr 1833 243 11 Printed circuit board router 1 5883 eco 1 2721 1008 150 1 Recursive tree comparison 1 ccom 333 11 C compiler front end 1 10142 2193 mx 1 26449 2869 354 1 Mouse-based editor [30] Figure 8. The fifteen test programs. 4.1. Performance of intermodule register allocation The aim of intermodule register allocation is to promote important variables and constants to registers, so as to avoid memory references. The miss ratio, defined in section 3.7.3, is a measure of how effective we are at this. We used automatic instrumentation to compute the miss ratio of our register allocation technique assuming a range of possible register set sizes. Figure 9 shows the miss ratios for each of the fifteen test programs, plotted as a function of the number of registers available for global register allocation. The left-hand graph shows the effects of register allocation based only on the compile-time estimates of variable use frequency. The right-hand graph shows the effects based on an actual variable-frequency profile for the same run. In this graph, six of the test programs had miss ratios very close to zero; these are labelled with numbers rather than names. These graphs extend beyond the actual limits of the Titan, in which 52 registers are available for global register allocation. The dotted curve is the arithmetic mean of the miss ratios for all fifteen programs. 24 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE 1 1 Titan Titan miss ratio 0.8 boyer 0.6 0.4 0.2 0 0 KEY: 1 = yacc 2 = whet 3 = sed 4 = livermore 5 = egrep 6 = linpak 0.8 20 40 60 80 100 number of registers troff met ccom whet grr rsim eco mx seddstan livermore yacc linpak egrep 120 140 160 0.6 0.4 boyer 0.2 0 0 20 40 60 80 100 number of registers met ccom rsim troff grrmx eco stan 123456 120 140 160 Figure 9. Miss ratio as a function of registers allocated, for allocation using estimates (left) or dynamic profile (right). There is considerable variation across the test suite, but it is not obviously correlated with the size of the program. The mx program is the largest, and moreover is written [30] in ‘‘object-oriented’’ style with frequent use of procedure variables, but it nevertheless does considerably better than average. Even with relatively few registers, most programs can make half or more of their scalar references from registers. On the Titan the average is around twothirds. Use of a profile not only improves the curves, it also smooths them: each time a new register is available, the allocator makes the best choice of what to do with it. Each profile used here was for identical inputs to the program in question, so the curves on the right represent the best possible behavior.* Unsurprisingly, the profile makes the biggest difference when few registers are available. As discussed in section 3.7.3, we computed the miss ratio curves for several register management techniques. Figure 10 shows the average miss ratio for each scheme, as a function of number of registers used. The left-hand graph shows the average for the five artificial benchmarks, and the right-hand graph shows it for the ten real programs. We examined six different schemes. Each scheme was tested twice, basing its decisions first on compile-time variable-frequency estimates, and then on an actual variable-frequency profile. The labels on Figure 10’s curves are defined below; a ‘‘P’’ indicates that the scheme used a dynamic variable-use profile rather than estimates. Link-time allocation (L, LP) is our technique, described in section 3.3.† It starts by keeping all scalars in memory except within a basic block, but improves this at link time based on intermodule analysis. It builds a complete call graph and an estimate of how often each scalar is 333333333333333 * This is optimistic, but perhaps not grossly so [43]. † The miss ratios for each benchmark under schemes L and LP were shown in Figure 9. 25 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE used. Locals are combined into non-conflicting groups based on the call graph, and the most frequently used globals and groups of locals are promoted to registers. Little-used scalars continue to live in memory, and are brought into temporary registers whenever they are needed. Compile-time allocation (C, CP) keeps each procedure’s locals in registers but saves and restores those registers on entry and exit so that this procedure does not interfere with others. Globals are not kept in registers, as this requires interprocedure analysis to be safe and effective. Steenkiste allocation (S, SP) is an improvement of compile-time allocation. Steenkiste allocation starts with a compile-time allocation, in which each procedure uses essentially the same set of registers. Then it builds a complete call graph, and locals of procedures near the leaves are renamed to different registers so that saves and restores are not needed. The decision of which locals to rename is determined by the structure of the call graph and not by use frequency. Globals are not kept in registers. Hybrid allocation (H, HP) is a combination of our link-time allocation and Steenkiste allocation. It starts with a compile-time allocation with saves and restores in each procedure. A second phase then tries to use the remaining registers to rename locals so as to remove saves and restores, or to hold globals. This second phase is based on the estimated savings as well as on the structure of the call graph. As with both our link-time allocation and Steenkiste allocation, using the call graph allows the same register to be used for several non-conflicting locals. Fixed-sized windows (W, WP) assumes that the register set is divided into overlapping hardware windows in a circular buffer. The buffer shifts at each call, allowing eight new registers to be used. When a long chain of calls or returns causes the buffer to overflow or underflow, exactly eight registers are saved or restored to make room for the required window. Although these saves and restores are done by trap code, we assume that the trap has absolutely no overhead, and charge only for the loads and stores that occur. This is optimistic; real window machines have a significant trap overhead, and try to minimize it by saving and restoring several windows at a time. In that case, more may be saved or restored than is necessary. Variable-sized windows (V, VP) assumes that the register set is used as a circular buffer of variable-sized windows. At each call, the buffer shifts for the number of registers the procedure needs for its locals. When the buffer overflows or underflows, exactly enough registers are saved or restored to make room for the required window. Again, we assume that the window traps have no overhead, and we therefore charge only for the loads and stores. In Figure 10, if we look only at the five artificial benchmarks, we might conclude that the link-time schemes L and H were clear winners over the others. The results of the ten real programs suggest that the advantage is not so dramatic. Compile-time allocation, Steenkiste allocation, and windows have smaller average miss ratios for the real programs than they do for the artificial benchmarks. This merely shows that artificial benchmarks are unrealistic. Modern allocation techniques aim to exploit the locality of procedures. Unfortunately, artificial benchmarks often have unrealistically small procedures and an unrealistically high ratio of global references to local references. 26 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE 1 1 Titan Titan miss ratio 0.8 0.8 C CP S SP W WP V VP 0.6 0.6 0.4 0.4 L H HP LP 0.2 0 0 C CP S SP W WP V VP 20 40 60 80 100 120 number of registers 140 0.2 0 0 160 L H LP HP 20 40 60 80 100 120 number of registers 140 160 Figure 10. Miss ratio as a function of registers allocated, for each register management scheme, averaged over the five artificial benchmarks (left) or the ten real programs (right). For most of our range of study, variable-sized windows don’t do noticeably better than fixed-sized windows. On the Titan, with 52 registers available, windows and link-time allocation do about equally well. The variable-window curve flattens out almost immediately, suggesting that windows would have the biggest advantage over allocation when only a few registers are available. We should remember, though, that the miss ratio for windows does not include the overhead of the trap for window overflow or underflow. Its apparent advantage for small register sets may be spurious. A profile had little effect on schemes not designed to exploit a profile, like Steenkiste’s scheme. Nevertheless, a profile was occasionally of use even to these schemes. It revealed variables that were never used at all and that therefore did not need a register. Having implemented both Steenkiste’s scheme and our own, the hybrid scheme was an easy generalization. It turns out to be reasonably successful. It combines the best of Steenkiste’s scheme (complete coverage of locals, at the smaller expense of saves and restores) with the best of our link-time scheme (inclusion of globals and use of frequency information). Although it has little to recommend it over our normal scheme when applied to the artificial benchmarks, its performance on the real programs is good. It stays consistently ahead of our scheme, and takes an early lead that makes it suitable even when relatively few registers are available, giving it a clear advantage over both its parents. When we first made the comparison, the hybrid scheme did several percent worse than our original scheme. The reason was interesting. In our standard libraries, several routines make calls through procedure variables, even when it is not necessary. The exit routine is one such, as are all routines that set up arguments for system calls. In either Steenkiste allocation or the hybrid scheme, these routines, along with any routines above them in the call graph, are rendered ineligible for renaming of local registers. This is overkill, however. We never return 27 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE from the exit routine, and so it would be safe to rename locals above it on the call graph. Similarly, the routine that performs a system-call trap is short and stylized, and poses no threat to renamed registers above it. The designers of a system using this sort of register allocation would probably tune their libraries so that they did not unnecessarily handicap the allocator. Alternatively, they might build in knowledge of these exceptions so that the allocator could work around them. The latter would be more dangerous but might be easier. We assume that the problem would be addressed in some manner, and wanted only to get realistic measurements of the Steenkiste and hybrid schemes. We therefore adopted the latter approach. The link-time allocation schemes (L and H) are clear winners asymptotically, though it must be admitted that without a profile other schemes work better with small register sets. As future machines exhibit higher levels of integration and more instruction-level parallelism, we are likely to see an increase in the size of register sets. This will make the advantages of linktime allocation schemes even more important. The better performance of the link-time schemes comes from two advantages over the other schemes. First, the link-time analysis makes it safe to include globals without risk of aliasing errors. Second, the use of variable-frequency estimates (or profiles) lets them use their registers preferentially for important variables. By including this kind of interprocedural and intermodule analysis, we could improve the window schemes enough to give them comparable miss ratios [41]. Still, if we are willing to do the intermodule analysis, it is unclear why we should want the windows. 4.2. Performance of pipeline instruction scheduling Our other link-time optimization is pipeline scheduling. The Titan CPU can waste cycles because of a pipeline stall or an unfilled branch slot. The scheduler hides these CPU-wasted cycles by moving other instructions into them, in an intra-block phase followed by a modest inter-block phase, as described in section 3.4. Because the inter-block phase fills branch slots speculatively from the likely successor block, we must be careful how we count CPU-wasted cycles. A branch slot filled with an instruction from the destination block, for example, is useful whenever the branch is actually taken, and is wasted only when the branch is not taken; the reverse is true if we fill the slot from the fall-through block. We used Mahler’s instruction-level instrumentation to count useful and wasted cycles in unscheduled code, in code with only intra-block scheduling, and in code with full scheduling.* The results are shown in Figure 11. CPU-wasted cycles in unscheduled code made up between 20 and 60 percent of all cycles. We normalized the CPU-wasted cycles for each benchmark so that the wasted cycles for unscheduled code is unity; thus the fraction removed is zero. In the left-hand graph, global register allocation was done; in the right-hand graph it was not. The dotted curves are the arithmetic means over all 15 programs. In most cases the second phase of scheduling was less effective than the first, because the second phase is trying only to fill branch slots. Two of the programs, however, improved proportionally more in the second phase than in the first. The spread of improvements was about 333333333333333 * This does not count wasted cycles due to effects outside the processor, such as cache misses or page faults. 28 fraction of CPU-wasted cycles removed EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE 1 1 0.8 0.8 0.6 0.6 egrep met livermore yacc mx stan sedd eco troff grr linpak ccom boyer rsim whet 0.4 0.2 0 no sched intrablock full sched egrep yacc sedd grr met mx stan linpak eco troff ccom rsim livermore boyer 0.4 0.2 whet 0 no sched intrablock full sched Figure 11. Fraction of all CPU-wasted cycles that are removed by different scheduling intensities, with global register allocation (left) and without (right). the same with and without register allocation. On the average, scheduling removed about one fourth of the CPU-wasted cycles. Also of interest is how well the scheduler can fill branch slots with useful instructions. As before, we must consider how often a branch slot is usefully executed rather than how often it contains an instruction other than a no-op. The results are summarized in Figure 12, which shows how often various kinds of branch slots contained useful instructions when executed, averaged over all the test programs.* The certainty of an unconditional jump allows us to fill the slot almost all the time. In fact, the only exceptions are certain branches whose destinations are unknown, as when we are jumping into a table. Because a backward conditional branch is likely to be a loop, we try to fill its slot from the destination block in preference to the fall-through block. The result is that the slot after a backward conditional branch is likely to be usefully filled around two-thirds of the time. Figure 12 shows that global register allocation has little effect on either intra-block scheduling or inter-block filling of unconditional branch slots. Unfortunately, it does significantly reduce our ability to speculatively fill the slot of a conditional branch in the interblock phase. This is because register allocation broadens the set of registers referenced. In code without register allocation, the first instruction of a block is likely to be a load of some variable into the first temporary register. It is likely that any pair of possible successor blocks will 333333333333333 * The fraction of useful branch slots is not zero even without scheduling, because the Mahler assembler fills in a few branch slots itself, within idioms (see section 3.1). 29 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE 1 uncond, no reg alloc uncond, reg alloc cond backward, no reg alloc useful fraction of branch slots 0.8 cond backward, reg alloc 0.6 cond forward, no reg alloc cond forward, reg alloc 0.4 0.2 0 no sched intrablock full sched Figure 12. Fraction of branch slots that are useful for three kinds of branches, with and without register allocation. both start with such a load instruction. In that case, the scheduler will know that either one can safely be moved into the preceding branch slot. After register allocation, however, these loads are likely to be absent; the first instruction is probably an operation using the register allocated for the variable that was loaded in the original code. The variables used by the two successor blocks will not necessarily be the same. The scheduler will therefore have more trouble guaranteeing that a candidate instruction will be harmless if control goes the other way. Although register allocation degrades the filling of forward branches and backward branches by comparable proportions on the average, there is an interesting asymmetry between the two. Figure 13 shows the effect of register allocation on the fraction of usefully filled branch slots after scheduling. The left-hand graph shows forward conditional branches, and the right-hand graph shows backward conditional branches. Each graph shows the fraction of slots that are useful in each of the fifteen test programs, first when register allocation is not done and then when it is done. The range of fractions without register allocation is about the same in either case. The effect of register allocation on the two cases is quite different, however. Register allocation degraded the filling of forward branch slots by degrading most of the programs about the same amount. In contrast, it degraded the filling of backward branch slots by spreading out the distribution. Some programs had many backward branch slots filled, with or without register allocation. Others did much worse under register allocation. The variation does not seem to be related to differences in the relative abundance of forward and backward branches. Why this asymmetry exists is still unclear. 30 useful fraction of branch slots EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE 1 1 0.8 0.8 0.6 met livermore mx boyer ccom yacc linpak egrep grr eco whet troff rsim stan sedd 0.4 0.2 livermore linpak whet stan boyer ccom met troff egrep 0.6 mx 0.4 grr eco yacc 0.2 rsim sedd 0 reg alloc off 0 reg alloc on reg alloc off reg alloc on Figure 13. Effect of register allocation on the dynamic fraction of useful branch slots after scheduling, for forward conditional branches (left) and backward conditional branches (right). 5. A retrospective assessment Performing optimizations at link time is not free. The register actions increase the size of an object file, typically by a factor of 2. Register allocation by itself slows down the link step by a factor of 1.5 to 3. Asking for both register allocation and scheduling slows it down by a factor of 2 to 4. We had hoped it would be less than this; slowdowns of 2 to 4 are enough to keep some people from using link-time optimizations routinely. Because our register allocation exploits intermodule information, an alternative would be to require complete recompilation, so that the intermodule information can be collected. A complete recompilation takes anywhere from 5 to 50 times as long as a non-optimizing link, not including recompiling the relevant library modules. An optimizing link is clearly preferable. Whether people at WRL are willing to accept the slowdown of optimizing at link time depends on how fast our programs run after optimization. Figure 14 shows the runtimes of the fifteen test programs for various combinations of register allocation and scheduling, relative to the runtimes when neither was done. Even without profiling, the average payoff in using both is a 25% improvement in speed. What would we do differently? Our biggest mistake, despite our attempts to avoid it, was that we pushed too much into the linker. The Mahler assembler should have done intraprocedure promotion of variables, inserting saves and restores that the link-time allocator would remove. This would have led directly to the hybrid scheme described in section 4.1, making the link-time part of the algorithm both more effective and faster. With fewer memory references making it past the assembler, we might even have been able to do pipeline scheduling in the assembler rather than the linker. We were also disappointed in the low appeal of the profile-driven allocation. Even though using profile-driven allocation made programs 10% to 15% faster than using normal 31 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE 1 (all) whet-ccom boyer troff-rsim-mx-liv-eco boyer stan whet yacc-met-grr egrep ccom sed linp 0.8 whet-boyer ccom met sed-rsim-linp troff liv eco grr mx stan yacc rsim whet ccom linp-boyer rsim met whet ccom rsim-boyer troff-sed-met eco linp-grr mx liv egrep stan yacc liv eco troff-grr sed mx stan yacc egrep normalized execution time 0.6 met-linp eco grr troff sed-mx liv stan yacc egrep egrep 0.4 0.2 sched: off on reg alloc: - o f f - 0 0 1 2 off on from estimates 3 4 off from 5 on profile 6 Figure 14. Execution times for various optimization combinations, normalized to the case of no link-time optimization. 32 7 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE link-time allocation, few people went to the trouble. In retrospect, this is not terribly surprising. The programs that one wants to squeeze the last bit of performance from are usually the programs that run a long time, which means they are not programs one wants to run as part of the build step. Perhaps if a profile had offered 50% or 90% improvement this would be different, but our experience may serve as a warning to those exploring profile-driven optimization. We did not attempt to make the debugger understand link-time optimized programs. This was due mainly to a lack of people. Register allocation should not have been a significant problem since a variable is either promoted for its entire lifetime or it is not promoted at all. Pipeline scheduling can rearrange instructions enough to make statement-oriented breakpoints difficult, but there is now a fairly well-understood tradeoff between code motion and debugging [19]. It is interesting to note that we had merely to relink if we wanted to convert between an optimized program and a debuggable one. This is considerably more convenient than if we had had to recompile; if the inconvenience had been greater than it was, we might have taken the trouble to make the debugger smarter. Our greatest success was the technique of link-time code modification. By itself, register allocation at link time may be overkill; very global optimization by monolithic compilation of source files or of intermediate-code files [12,21,36,45] or perhaps by reference to persistent program data bases [34] could still turn out to be a better tradeoff. But our machinery for code modification led us to develop a wide variety of tools for performance analysis at the source and machine levels. Most of these tools actually require very little of the machinery we implemented, but their advantage is profound. It has transformed gprof from something we use a few times a year into something we use a few times a month or week. Late modification of the compiled program, though not a new idea, is getting an increasing amount of attention. It is a handy stage to reorder procedures or basic blocks to improve cache performance [27,32], because only when the code is in our hands do we know how big basic blocks and procedures really are. Benitez and Davidson [6] describe an optimizing linker that does fairly general peephole optimizations. Johnson [22] discusses an elegant approach called postloading: code modification occurs after linking and is completely separate from the linker, but the linker always retains in the executable file enough information for the postloader to understand the uses of addresses and the like. Finally, we are also pleased with our use of Mahler as the ‘‘software-defined architecture’’ for the Titan. Three related but different machines have been designed under the Mahler umbrella, with few changes needed in the front-end compilers. The implementation of Mahler was good enough that users interested in performance were not heard to wish that they could get at the underlying machine. The main benefit of the ‘‘RISC philosophy’’ is the ability to make sensible tradeoffs between hardware and software. A software-defined architecture like Mahler allows us to make these tradeoffs flexibly and without causing turmoil in the higher levels of the environment. 33 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE History and acknowledgements In 1982, Forest Baskett founded Digital Equipment’s Western Research Lab. Its initial project was the development of the Titan, a high-performance scientific workstation intended to exploit the advantages of simplified architecture and hardware-software tradeoffs. Baskett and Michael L. Powell were interested in doing very global allocation before I joined the lab, and provided many prods and insights. Powell implemented our first pipeline scheduler, and the associated instruction-level instrumentation. David Goldberg invented the first version of the Mahler language and a global register allocator that worked at compile time, as well as the name ‘‘Mahler.’’ Loretta Guarino Reid provided a linker that worked on Titans, and contributed to the interface between the linker proper and the code modification facilities. Richard Beigel did a prototype study of register allocation during his summer with us, suggesting that allocation based on the call graph would be effective. Many thanks must also go to Jud Leonard, Gene McDaniel, Jeff Prisner, and Neil Wilhelm, for first advocating and then tolerating Mahler’s extreme approach toward hiding the underlying machine. In more recent years, newcomers have contributed to the applications in performance measurement. Anita Borg took the hack that enabled address tracing and incorporated it into a system that allows traces of multiprogram sets and even of kernel activity. Scott Nettles and Jeff Mogul spurred the implementation of link-time gprof. Joel Bartlett, Patrick Boyle, Mary Jo Doherty, Alan Eustace, Norm Jouppi, Jeff Mogul, and John Ousterhout were unyieldingly critical of drafts of this paper, forcing me to strengthen it considerably. My thanks to them. Finally, I must thank all of WRL’s varying population over the years, for helping to build a stimulating environment in which strange ideas and real systems turn out to be the same thing. Appendix 1. Generation of annotated code. We assume that a basic block is represented as a sequence of commands of three kinds. A leaf command says to obtain the value of a particular variable. An assign command says to assign the value of a particular previous command to a particular variable. An operate command says to perform a particular operation on the results of two previous commands and make the new result available to later commands. There are links from an operate or assign command to the commands it uses, and a link from a leaf or assign command to the symbol table entry for the variable it uses. Commands also have the following fields, which we will fill as we perform the algorithm: index : cardinal; firstUse : link to Command; lastUse : link to Command; assigned : link to Variable; inLeaf : boolean; – – index in basic block – – link to first command using c – – link to last command using c – – where c gets assigned, if available – – c is leaf and will be avail in leaf var 34 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE We also need Variables to have a field nextAssn : cardinal; – – index of command that next assigns to v that we use for bookkeeping while we decorate the commands. The first step is to find out if each value will be available, either in a leaf variable or as the result of an assignment, for as long as we need it. This is done with a forward pass over the basic block, followed by a backward pass. The forward pass is as follows: index := 0 for each command c do c.index := index index := index + 1 c.assigned := nil c.inLeaf := false c.firstUse := nobody for each command c2 that is an operand of c do if c2.firstUse = nobody then c2.firstUse := c c2.lastUse := c if c is ‘‘leaf v’’ or ‘‘v := rhs’’ then v.nextAssn := MAXINT and the backward pass is as follows: for each command c, in reverse order do if c is ‘‘v := rhs’’ then if (rhs.firstUse = c) and (v.nextAssn >= rhs.lastUse.index) then rhs.assigned := v v.nextAssn := c.index elseif c is ‘‘leaf v’’ then if v.nextAssn >= c.lastUse.index then c.inLeaf := true Then we can generate annotated code. The main algorithm for this is on the next page. 35 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE procedure GenCode ( ) for each command c do if c = ‘‘leaf v’’ then Generate an instruction to load v into a temporary register FlagLeaf (c, v) elseif c = ‘‘v := rhs’’ then Generate an instruction to store rhs from its temp reg into v FlagAssn (c, v, rhs) FlagOp (1, rhs, expr) else – – c is an operation Generate one or more instructions to perform the operation. Call FlagOp(n,opnd,c) for any insts that use the value of opnd as the n-th operand, where opnd is some operand of c Call FlagResult(c) for any instructions that produce the result of this operation This algorithm uses the following four subsidiary algorithms to annotate individual instructions: procedure FlagLeaf (c : Expr; v : Symbol) if c.inLeaf and c.assigned=x then elseif c.inLeaf then else REMOVE.v RESULT.x REMOVE.v LOAD.v procedure FlagAssn (c : Expr; v : Symbol; rhs : Expr) if rhs.firstUse ≠ c then STORE.v elseif rhs.assigned = nobody then STORE.v elseif rhs=‘‘leaf x’’ and rhs.inLeaf then STORE.v REMOVE.v KEEP.x elseif rhs=‘‘leaf x’’ then STORE.v else REMOVE.v procedure FlagResult (c : Expr) if c.assigned = x then RESULT.x procedure FlagOp (n : cardinal; opnd, c : Expr) if n=1 then op := OP1 else op := OP2 if opnd=‘‘leaf v’’ and opnd.inLeaf then if opnd.assigned=x and opnd.firstUse≠c then 36 op.v op.x EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE References [1] Advanced Micro Devices. Am29000 Streamlined Instruction Processor User’s Manual (1987). Advanced Micro Devices, Inc., 901 Thompson Place, P. O. Box 3453, Sunnyvale, CA 94088. [2] Anant Agarwal, Richard L. Sites, Mark Horowitz. ATUM: A new technique for capturing address traces using microcode. Proceedings of the 13th Annual Symposium on Computer Architecture, pp. 119-127. Published as Computer Architecture News 14 (2), June 1986. [3] Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman. The Design and Analysis of Computer Algorithms, pp. 189-195. Addison-Wesley, 1974. [4] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques, and Tools, pp. 660-664. Addison-Wesley, 1986. [5] Russell R. Atkinson and Edward M. McCreight. The Dragon processor. Proceedings of the Second International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 65-69. Published as Computer Architecture News 15 (5), Operating Systems Review 21 (4), SIGPLAN Notices 22 (10), October 1987. [6] Manuel E. Benitez and Jack W. Davidson. A portable global optimizer and linker. Proceedings of the SIGPLAN ’88 Conference on Programming Language Design and Implementation, pp. 329-338. Published as SIGPLAN Notices 23 (7), July 1988. [7] Robert Bernstein. Multiplication by integer constants. Software — Practice and Experience 16 (7), pp. 641-652, July 1986. [8] Anita Borg, R. E. Kessler, Georgia Lazana, and David W. Wall. Long address traces from RISC machines: Generation and analysis. Seventeenth Annual International Symposium on Computer Architecture, pp. 270-279, May 1990. A more detailed version is available as WRL Research Report 89/14, September 1989. [9] G. J. Chaitin. Register allocation & spilling via graph coloring. Proceedings of the SIGPLAN ’82 Symposium on Compiler Construction, pp. 98-105. Published as SIGPLAN Notices 17 (6), June 1982. [10] Gregory J. Chaitin, Marc A. Auslander, Ashok K. Chandra, John Cocke, Martin E. Hopkins, and Peter W. Markstein. Register allocation via coloring. Computer Languages 6, pp. 47-57, 1981. [11] Frederick C. Chow. A Portable Machine-Independent Global Optimizer — Design and Measurements. PhD dissertation, Stanford University. Available as Computer Systems Laboratory Technical Note 83-254. Stanford University, December 1983. [12] Fred C. Chow. Minimizing register usage penalty at procedure calls. Proceedings of the SIGPLAN ’88 Conference on Programming Language Design and Implementation, pp. 85-94. Published as SIGPLAN Notices 23 (7), July 1988. [13] Douglas W. Clark. Cache performance in the VAX-11/780. ACM Transactions on Computer Systems 1 (1), pp. 24-37, February 1983. 37 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE [14] Jack J. Dongarra. Performance of various computers using standard linear equations software in a Fortran environment. Computer Architecture News 11 (5), pp. 22-27, December 1983. [15] Richard P. Gabriel. Performance and Evaluation of Lisp Systems, pp. 116-135. The MIT Press, 1985. [16] Robert B. Garner, et al. The Scalable Processor Architecture (SPARC). Digest of Papers: Compcon 88, pp. 278-283, March 1988. [17] Susan L. Graham, Peter B. Kessler, and Marshall K. McKusick. gprof: a call graph execution profiler. Proceedings of the SIGPLAN ’82 Symposium on Compiler Construction, pp. 120-126. Published as SIGPLAN Notices 17 (6), June 1982. [18] John Hennessy. Stanford benchmark suite. Personal communication. [19] John Hennessy. Symbolic debugging of optimized code. ACM Transactions on Programming Languages and Systems 4 (3), pp. 323-344, July 1982. [20] John Hennessy and Thomas Gross. Postpass code optimization of pipeline constraints. ACM Transactions on Programming Languages and Systems 5 (3), pp. 422-448, July 1983. [21] Mark I. Himelstein, Fred C. Chow, and Kevin Enderby. Cross-module optimizations: Its implementation and benefits. Proceedings of the Summer 1987 USENIX Conference, pp. 347-356, June 1987. [22] S. C. Johnson. Postloading for fun and profit. Proceedings of the Winter ’90 USENIX Conference, pp. 325-330, January 1990. [23] Norman P. Jouppi and Jeffrey Y.-F. Tang. A 20 MIPS sustained 32 bit CMOS microprocessor with high ratio of sustained to peak performance. IEEE Journal of Solid-State Circuits 24 (5), pp. 1348-1359, October 1989. [24] Gerry Kane. MIPS R2000 Risc Architecture. Prentice Hall, 1987. [25] Earl A. Killian. Personal communication. [26] Ruby B. Lee. Precision Architecture. IEEE Computer 22 (1), pp. 78-89, January 1989. [27] Scott McFarling. Program optimization for instruction caches. Third International Symposium on Architectural Support for Programming Languages and Operating Systems, pp. 183-191, April 1989. Published as Computer Architecture News 17 (2), Operating Systems Review 23 (special issue), SIGPLAN Notices 24 (special issue). [28] MIPS Computer Systems. RISCompiler and C Programmer’s Guide. MIPS Computer Systems, Inc., 930 Arques Ave., Sunnyvale, California 94086. 1989. [29] Michael J. K. Nielsen. Titan system manual. WRL Research Report 86/1, September 1986. [30] John Ousterhout. Personal communication. [31] David A. Patterson. Reduced instruction set computers. Communications of the ACM 28 (1), pp. 8-21, January 1985. 38 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE [32] Karl Pettis and Robert C. Hansen. Profile guided code positioning. Proceedings of the SIGPLAN ’90 Conference on Programming Language Design and Implementation, pp. 16-27. Published as SIGPLAN Notices 25 (6), June 1990. [33] George Radin. The 801 minicomputer. Proceedings of the Symposium on Architectural Support for Programming Languages and Operating Systems, pp. 39-47. Published as SIGARCH Computer Architecture News 10 (2), March 1982, and as SIGPLAN Notices 17 (4), April 1982. [34] Vatsa Santhanam and Daryl Odnert. Register allocation across procedure and module boundaries. Proceedings of the SIGPLAN ’90 Conference on Programming Language Design and Implementation, pp. 28-39. Published as SIGPLAN Notices 25 (6), June 1990. [35] Peter Steenkiste. Lisp on a Reduced-Instruction-Set Processor: Characterization and Optimization. PhD dissertation, Stanford University. Available as Stanford Computer Systems Laboratory Technical Report CSL-TR-87-324. March 1987. [36] Peter A. Steenkiste and John L. Hennessy. A simple interprocedural register allocation algorithm and its effectiveness for LISP. ACM Transactions on Programming Languages and Systems 11 (1), pp. 1-32, January 1989. [37] J. Strong, et al. The problem of programming communication with changing machines: A proposed solution. Communications of the ACM 1 (8), pp. 12-18, August 1958, and 1 (9), pp. 9-15, September 1958. [38] Thomas G. Szymanski. Assembling code for machines with span-dependent instructions. Communications of the ACM 21 (4), pp. 300-308, April 1978. [39] Christopher J. Terman. User’s Guide to NET, PRESIM, and RNL/NL. M.I.T. Laboratory for Computer Science, 545 Technology Square, Room 418, Cambridge, Massachusetts. [40] David W. Wall. Global register allocation at link time. Proceedings of the SIGPLAN ’86 Symposium on Compiler Construction, pp. 264-275. Published as SIGPLAN Notices 21 (7), July 1986. Also available as WRL Research Report 86/3. [41] David W. Wall. Register windows vs. register allocation. Proceedings of the SIGPLAN ’88 Conference on Programming Language Design and Implementation, pp. 6778. Published as SIGPLAN Notices 23 (7), July 1988. Also available as WRL Research Report 87/5. [42] David W. Wall. Link-time code modification. WRL Research Report 89/17, September 1989. [43] David W. Wall. Predicting program behavior using real or estimated profiles. Proceedings of the SIGPLAN ’91 Conference on Programming Language Design and Implementation, pp. 59-70. Published as SIGPLAN Notices 26 (6), June 1991. Also available as WRL Technical Note TN-18. [44] David W. Wall and Michael L. Powell. The Mahler experience: Using an intermediate language as the machine description. Second International Symposium on Architectural Support for Programming Languages and Operating Systems, pp. 100-104. Published 39 EXPERIENCE WITH A SOFTWARE-DEFINED MACHINE ARCHITECTURE as Computer Architecture News 15 (5), Operating Systems Review 21 (4), SIGPLAN Notices 22 (10), October 1987. A more detailed version is available as WRL Research Report 87/1. [45] William Wulf. Personal communication. 40 WRL Research Reports ‘‘Titan System Manual.’’ Michael J. K. Nielsen. WRL Research Report 86/1, September 1986. ‘‘MultiTitan: Four Architecture Papers.’’ Norman P. Jouppi, Jeremy Dion, David Boggs, Michael J. K. Nielsen. WRL Research Report 87/8, April 1988. ‘‘Global Register Allocation at Link Time.’’ David W. Wall. WRL Research Report 86/3, October 1986. ‘‘Fast Printed Circuit Board Routing.’’ Jeremy Dion. WRL Research Report 88/1, March 1988. ‘‘Optimal Finned Heat Sinks.’’ William R. Hamburgen. WRL Research Report 86/4, October 1986. ‘‘Compacting Garbage Collection with Ambiguous Roots.’’ Joel F. Bartlett. WRL Research Report 88/2, February 1988. ‘‘The Mahler Experience: Using an Intermediate Language as the Machine Description.’’ David W. Wall and Michael L. Powell. WRL Research Report 87/1, August 1987. ‘‘The Experimental Literature of The Internet: An Annotated Bibliography.’’ Jeffrey C. Mogul. WRL Research Report 88/3, August 1988. ‘‘The Packet Filter: An Efficient Mechanism for User-level Network Code.’’ Jeffrey C. Mogul, Richard F. Rashid, Michael J. Accetta. WRL Research Report 87/2, November 1987. ‘‘Measured Capacity of an Ethernet: Myths and Reality.’’ David R. Boggs, Jeffrey C. Mogul, Christopher A. Kent. WRL Research Report 88/4, September 1988. ‘‘Fragmentation Considered Harmful.’’ Christopher A. Kent, Jeffrey C. Mogul. WRL Research Report 87/3, December 1987. ‘‘Cache Coherence in Distributed Systems.’’ Christopher A. Kent. WRL Research Report 87/4, December 1987. ‘‘Visa Protocols for Controlling Inter-Organizational Datagram Flow: Extended Description.’’ Deborah Estrin, Jeffrey C. Mogul, Gene Tsudik, Kamaljit Anand. WRL Research Report 88/5, December 1988. ‘‘Register Windows vs. Register Allocation.’’ David W. Wall. WRL Research Report 87/5, December 1987. ‘‘SCHEME->C A Portable Scheme-to-C Compiler.’’ Joel F. Bartlett. WRL Research Report 89/1, January 1989. ‘‘Editing Graphical Objects Using Procedural Representations.’’ Paul J. Asente. WRL Research Report 87/6, November 1987. ‘‘Optimal Group Distribution in Carry-Skip Adders.’’ Silvio Turrini. WRL Research Report 89/2, February 1989. ‘‘The USENET Cookbook: an Experiment in Electronic Publication.’’ Brian K. Reid. WRL Research Report 87/7, December 1987. ‘‘Precise Robotic Paste Dot Dispensing.’’ William R. Hamburgen. WRL Research Report 89/3, February 1989. 41 ‘‘Simple and Flexible Datagram Access Controls for Unix-based Gateways.’’ Jeffrey C. Mogul. WRL Research Report 89/4, March 1989. ‘‘Spritely NFS: Implementation and Performance of Cache-Consistency Protocols.’’ V. Srinivasan and Jeffrey C. Mogul. WRL Research Report 89/5, May 1989. ‘‘Link-Time Code Modification.’’ David W. Wall. WRL Research Report 89/17, September 1989. ‘‘Noise Issues in the ECL Circuit Family.’’ Jeffrey Y.F. Tang and J. Leon Yang. WRL Research Report 90/1, January 1990. ‘‘Efficient Generation of Test Patterns Using Boolean Satisfiablilty.’’ Tracy Larrabee. WRL Research Report 90/2, February 1990. ‘‘Available Instruction-Level Parallelism for Superscalar and Superpipelined Machines.’’ Norman P. Jouppi and David W. Wall. WRL Research Report 89/7, July 1989. ‘‘Two Papers on Test Pattern Generation.’’ Tracy Larrabee. WRL Research Report 90/3, March 1990. ‘‘A Unified Vector/Scalar Floating-Point Architecture.’’ Norman P. Jouppi, Jonathan Bertoni, and David W. Wall. WRL Research Report 89/8, July 1989. ‘‘Virtual Memory vs. The File System.’’ Michael N. Nelson. WRL Research Report 90/4, March 1990. ‘‘Architectural and Organizational Tradeoffs in the Design of the MultiTitan CPU.’’ Norman P. Jouppi. WRL Research Report 89/9, July 1989. ‘‘Efficient Use of Workstations for Passive Monitoring of Local Area Networks.’’ Jeffrey C. Mogul. WRL Research Report 90/5, July 1990. ‘‘Integration and Packaging Plateaus of Processor Performance.’’ Norman P. Jouppi. WRL Research Report 89/10, July 1989. ‘‘A One-Dimensional Thermal Model for the VAX 9000 Multi Chip Units.’’ John S. Fitch. WRL Research Report 90/6, July 1990. ‘‘A 20-MIPS Sustained 32-bit CMOS Microprocessor with High Ratio of Sustained to Peak Performance.’’ Norman P. Jouppi and Jeffrey Y. F. Tang. WRL Research Report 89/11, July 1989. ‘‘1990 DECWRL/Livermore Magic Release.’’ Robert N. Mayo, Michael H. Arnold, Walter S. Scott, Don Stark, Gordon T. Hamachi. WRL Research Report 90/7, September 1990. ‘‘Pool Boiling Enhancement Techniques for Water at Low Pressure.’’ Wade R. McGillis, John S. Fitch, William R. Hamburgen, Van P. Carey. WRL Research Report 90/9, December 1990. ‘‘The Distribution of Instruction-Level and Machine Parallelism and Its Effect on Performance.’’ Norman P. Jouppi. WRL Research Report 89/13, July 1989. ‘‘Long Address Traces from RISC Machines: Generation and Analysis.’’ Anita Borg, R.E.Kessler, Georgia Lazana, and David W. Wall. WRL Research Report 89/14, September 1989. ‘‘Writing Fast X Servers for Dumb Color Frame Buffers.’’ Joel McCormack. WRL Research Report 91/1, February 1991. 42 ‘‘A Simulation Based Study of TLB Performance.’’ J. Bradley Chen, Anita Borg, Norman P. Jouppi. WRL Research Report 91/2, November 1991. ‘‘Analysis of Power Supply Networks in VLSI Circuits.’’ Don Stark. WRL Research Report 91/3, April 1991. ‘‘Cache Write Policies and Performance.’’ Norman P. Jouppi. WRL Research Report 91/12, December 1991. ‘‘Packaging a 150 W Bipolar ECL Microprocessor.’’ William R. Hamburgen, John S. Fitch. WRL Research Report 92/1, March 1992. ‘‘TurboChannel T1 Adapter.’’ David Boggs. WRL Research Report 91/4, April 1991. ‘‘Observing TCP Dynamics in Real Networks.’’ Jeffrey C. Mogul. WRL Research Report 92/2, April 1992. ‘‘Procedure Merging with Instruction Caches.’’ Scott McFarling. WRL Research Report 91/5, March 1991. ‘‘Systems for Late Code Modification.’’ David W. Wall. WRL Research Report 92/3, May 1992. ‘‘Don’t Fidget with Widgets, Draw!.’’ Joel Bartlett. WRL Research Report 91/6, May 1991. ‘‘Piecewise Linear Models for Switch-Level Simulation.’’ Russell Kao. WRL Research Report 92/5, September 1992. ‘‘Pool Boiling on Small Heat Dissipating Elements in Water at Subatmospheric Pressure.’’ Wade R. McGillis, John S. Fitch, William R. Hamburgen, Van P. Carey. WRL Research Report 91/7, June 1991. ‘‘A Practical System for Intermodule Code Optimization at Link-Time.’’ Amitabh Srivastava and David W. Wall. WRL Research Report 92/6, December 1992. ‘‘Incremental, Generational Mostly-Copying Garbage Collection in Uncooperative Environments.’’ G. May Yip. WRL Research Report 91/8, June 1991. ‘‘A Smart Frame Buffer.’’ Joel McCormack & Bob McNamara. WRL Research Report 93/1, January 1993. ‘‘Recovery in Spritely NFS.’’ Jeffrey C. Mogul. WRL Research Report 93/2, June 1993. ‘‘Interleaved Fin Thermal Connectors for Multichip Modules.’’ William R. Hamburgen. WRL Research Report 91/9, August 1991. ‘‘Tradeoffs in Two-Level On-Chip Caching.’’ Norman P. Jouppi & Steven J.E. Wilton. WRL Research Report 93/3, October 1993. ‘‘Experience with a Software-defined Machine Architecture.’’ David W. Wall. WRL Research Report 91/10, August 1991. ‘‘Unreachable Procedures in Object-oriented Programing.’’ Amitabh Srivastava. WRL Research Report 93/4, August 1993. ‘‘Network Locality at the Scale of Processes.’’ Jeffrey C. Mogul. WRL Research Report 91/11, November 1991. ‘‘Limits of Instruction-Level Parallelism.’’ David W. Wall. WRL Research Report 93/6, November 1993. 43 ‘‘Fluoroelastomer Pressure Pad Design for Microelectronic Applications.’’ Alberto Makino, William R. Hamburgen, John S. Fitch. WRL Research Report 93/7, November 1993. WRL Technical Notes ‘‘TCP/IP PrintServer: Print Server Protocol.’’ Brian K. Reid and Christopher A. Kent. WRL Technical Note TN-4, September 1988. ‘‘Predicting Program Behavior Using Real or Estimated Profiles.’’ David W. Wall. WRL Technical Note TN-18, December 1990. ‘‘TCP/IP PrintServer: Server Architecture and Implementation.’’ Christopher A. Kent. WRL Technical Note TN-7, November 1988. ‘‘Cache Replacement with Dynamic Exclusion’’ Scott McFarling. WRL Technical Note TN-22, November 1991. ‘‘Smart Code, Stupid Memory: A Fast X Server for a Dumb Color Frame Buffer.’’ Joel McCormack. WRL Technical Note TN-9, September 1989. ‘‘Boiling Binary Mixtures at Subatmospheric Pressures’’ Wade R. McGillis, John S. Fitch, William R. Hamburgen, Van P. Carey. WRL Technical Note TN-23, January 1992. ‘‘Why Aren’t Operating Systems Getting Faster As Fast As Hardware?’’ John Ousterhout. WRL Technical Note TN-11, October 1989. ‘‘A Comparison of Acoustic and Infrared Inspection Techniques for Die Attach’’ John S. Fitch. WRL Technical Note TN-24, January 1992. ‘‘Mostly-Copying Garbage Collection Picks Up Generations and C++.’’ Joel F. Bartlett. WRL Technical Note TN-12, October 1989. ‘‘TurboChannel Versatec Adapter’’ David Boggs. WRL Technical Note TN-26, January 1992. ‘‘The Effect of Context Switches on Cache Performance.’’ Jeffrey C. Mogul and Anita Borg. WRL Technical Note TN-16, December 1990. ‘‘A Recovery Protocol For Spritely NFS’’ Jeffrey C. Mogul. WRL Technical Note TN-27, April 1992. ‘‘Electrical Evaluation Of The BIPS-0 Package’’ Patrick D. Boyle. WRL Technical Note TN-29, July 1992. ‘‘MTOOL: A Method For Detecting Memory Bottlenecks.’’ Aaron Goldberg and John Hennessy. WRL Technical Note TN-17, December 1990. ‘‘Transparent Controls for Interactive Graphics’’ Joel F. Bartlett. WRL Technical Note TN-30, July 1992. 44 ‘‘Design Tools for BIPS-0’’ Jeremy Dion & Louis Monier. WRL Technical Note TN-32, December 1992. ‘‘Link-Time Optimization of Address Calculation on a 64-Bit Architecture’’ Amitabh Srivastava and David W. Wall. WRL Technical Note TN-35, June 1993. ‘‘Combining Branch Predictors’’ Scott McFarling. WRL Technical Note TN-36, June 1993. ‘‘Boolean Matching for Full-Custom ECL Gates’’ Robert N. Mayo and Herve Touati. WRL Technical Note TN-37, June 1993. 45
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.2 Linearized : No Page Count : 49 Create Date : 1999:01:22 10:23:48 Producer : APSetDocInfo 2.2.1 HP-UX SPDF_1112 Feb 10 2005 Creator : David Wall,WRL,4156173309, and Scribe 7(1700) Title : Experience with a Software-Defined Machine Architecture Author : Wall, David W Modify Date : 2007:09:07 17:26:32+17:00 SPDF : 1112EXIF Metadata provided by EXIF.tools