Digital Technical Journal, Volume 10, Number 1 Dtj_v10 01_1998 Dtj V10 01 1998
dtj_v10-01_1998 dtj_v10-01_1998
User Manual: dtj_v10-01_1998
Open the PDF directly: View PDF .
Page Count: 111
Download | |
Open PDF In Browser | View PDF |
I PROGRAMMING LANGUAGES & TOOLS Volume 10 Number 1 1998 Editorial jane C. Blake, Managing Editor Kathleen M. Stetson, Editor Hden L. Patterson, Editor The Digital Technicaljoumalis a refereed AlphaServer, Compaq, tl1e Compaq logo, journal published quarterly by Compaq DEC, DIGITAL, tl1e DIGITAL logo, MA 01460-1289. Computer Corporation, 550 King Street, LKGI-2jW7, Littleton, Circulation sending a check in U.S. funds (made payable Kristine M. Lowe, Administrator to Compaq Computer Corporation) to the Production rates arc $40.00 (non-U.S. $60) for four issues Hard-copy subscriptions can be ordered by Christa W. Jessica, Production Editor Elizabeth McGrail, Typographer Peter R. Woodbury, Illustrator Advisory Board Thomas F. Gannon, Chairman (Acting) Z. Harbert Scott E. Cutler Donald pub lis hed- by address. General subscription and $75.00 (non-U.S. $115) for eight issues. University and college professors and Ph.D. students in the elecu·icaJ engineering and com puter science fields receive complimentary sub notification when a new issue is available on the Internet. Single copies and back issues can be ordered by sending tl1e requested issue's volume and number and a check for $16.00 (non-U.S. $18) each to tl1e published-by address. Recent issues arc also available on me Internet at mentary subscription orders can be sent Corporation. Copying wimout fee is per mitted provided that such copies are made f·or usc in educational institutions by faculty members and are not distributed for com mercial advantage. Absu·acting with credit of Compaq Computer Corporation's author ship is permitted. The information in tl1e jo u rnal is subject to change without notice and should not be construed as a commitment by Compaq Computer Corporation or by the compa nies herein represented. Compaq Computer Corporation assumes no responsibility for any errors that may appear in t11e./OII/'I/Cii. ISSN 0898-90IX Documentation Number EC-P9706-I8 Book production was done by Quantic Communications, Inc. The cover was designed by Lucinda O'Neill of the Compaq Industria! and Graphic Design Group. Corporation. sively through X/Open Company Ltd. tl1e published-by or electronic mail address. formance possible for software applications. SPEC and SPECint are registered trademarks of Standard Performance Evaluation can also be made by calling U1e.fournal Copyright© 1998 Compaq Computer transforms code to extract the highest per International, Inc. UNIX is a registered trademark in the United and may be sent to tl1e managing editor at forms common elements into precious gold International Business Machines C01voration. Solaris is a registered trademark of Sun States and in other countries, licensed exclu requests to contact autl1ors are welcomed to represent the compiler developer who marks of Roque Wave Software, Inc. RS/6000 is a registered trademark of mail address, ctj@compaq.com. Inquiries Comments on the content of any paper and wc have chosen the alchemist who trans Roque Wave and .h++ are registered trade published-by address or tl1e electronic office at 978-506-6858. & Tools, specifi Corporation. Microsysrems, Inc. to the Dlj!,ital Technica/Joumal at tl1e Programming Languages NULLSTONE is a trademark ofNullstonc SPARC is a registered trademark of SPARC Inquiries, address changes, and compli cally on compiler software. For the cover, MIPS is a registered trademark of MIPS Technologies, Inc. Compaq employees may order subscrip http://web rc.das.dec.com. Cover Design Microsoft, Visual C++, Windows, and http://www.digital.com/ dtj. tions through Readers Choice at URL This special issue of the jounw/ focuses on IlUX is a registered trademark of Silicon Graphics, Inc. of Microsoft Corporation. This service will send an electronic mail Robcrt M. Supnik Intel and Pentium are registered u·ademarks of Intel Corporation. aged to contact tl1eir sales representatives. http:jjwww.digital.com/subscription. Richard F. Lary Corporation. Windows NT are registered trademarks no charge by accessing URL Alan G. Nemeth DIGITAL UNIX, FX132, and OpenVMS arc trademarks of Compaq Computer may qualify tor gift subscriptions and arc encour scriptions upon request. Compaq customers Electronic subscriptions are available at William A. Laing ULTIUX, VAX, and VMS are registered in the U.S. Patent and Trademark Office. Other product and company names mentioned herein may be trademarks and/or registered trademarks of their respective owners. December 1998 A letter to readers of the Dip,ital Technicaljournal This issue is the last Digital Technicaljournal to be published. Since 1985, the Journal has been privileged to publish intormation about significant engineeting accomplishments for DIGITAL, including standards-setting network and storage teclmologies, industry-leading VAX. systems, record-breaking Alpha microproces sors and semiconductor technologies, and advanced application software and performance tools. The Journal has been rewarded by continual growth in rhe number of readers and by rheir expressions of appreciation for the quality of content a.nd presentation. The editors dunk rhe engineers who somehow made d1e time to write, the engi neering managers who supported rhem, rhe consulting engineers and professors who reviewed manuscripts and made rhe process a learning experience for all of us, and, of course, the readers who are the reason the Journal came into existence 13 years ago. With kind regards, Jane Blake Managing Editor Kathleen Stetson Editor Helen Patterson Editor Digital Technical Journal Volume 10 Number 1 Contents Introd uction C. Rober t Morgan, Guest Editor 2 Foreword William C. Blake 4 Tracing and Characterization of W i ndows NT-based Jason P. Casmira, David P. Hunter, 6 System Workloads and David R. Kael i Automatic Template Instantiation i n DIGITAL C++ Avru m E . I tzkowitz and Lois D . Foltan 22 Hemant G. Rotithor, Kevin W. Harris, 32 Measurement and Analysis of C and C++ Performance and Mark W. Davis August G. Reinig 48 Compiler Optimization for Superscalar Systems: P hilip 58 Global I nstruction Scheduling without Copies and Brett L. H uber Maximizing M ulti processor Performance Mary W. Hall, Jennifer M . Anderson, with the S U I F Compiler Saman P. Amarasinghe, Brian R. Murp hy, Alias Analysis in the DEC C and DIGITAL C++ Compilers H. Sweany, Steven M. Carr, 71 Shih-Wei Liao, Eduoard Bugnion, :md Monica S. Lam Debugging Optimized Code: Concepts and Ronald F. Brender, Jeffrey E. Nelson, Implementation on DIGITAL Alpha Systems and Mark E. Arsenault D i fferentia l Testing for Software William M. McKeeman 81 100 Introduction The complexity of high-performance Profiling describes the point in the systems and d1e need tor ever-increased program that is most frequently performance to be gained from those executed. Tracing describes the systems creates a challenge for engi commonly executed sequence of neers, one d1at requires bod1 experience instructions. In addition to helping and innovation in the development developers build more efficient of software tools. The papers in this applications, this information assists issue of tJ1 e ]ournal are a few selected exa mp le s of the work performed C. Robert Morgan within Compaq and by researchers Every compi l er consists of two worldwide to advance me state of me components: the front end, which Technical Program Manage1; art. In fact, Compaq supports rele analyzes the specific language, and Core Technology Croup vant research in programming lan the back end, which generates opti guages and tools. mized instructions for the target Senior Consulting Engineer and Compaq has been developing balance of both components. As lan than thirty years, starting with the guages such as C++ evolve, the com Fortran compiler for the DIGITAL piler front end must also evolve to PDP-10, introduced in 1967. Later keep pace. C++ has now been stan compilers and tools for VAX com dardized, so evolutionary changes puter systems, introduced in 1977, made the VA.'< system one of me most will Jessen. However, compiler devel usable in history. The compilers and front-end techniques for implement opers must continue to improve ing the language to ensure ever better plary. With the introduction of the application performance. An impor VfuY.. successor in 1992, the 64-bit tant feature of C++ compiler develop RISC Alpha systems, Compaq has ment is C++ templates. Temp lates continued me tradition of developing may be implemented in multiple advanced tools that accelerate appli ways, with varying effects on appli cation performance and usability for cation programs. The paper by system users. The papers, however, Itzkowitz and Foltan describes represent not only the work of Compaq's efficient implementation Compaq engineers but aJso that of of templates. On a related subject, researchers and academics who are working on problems and advanced techniques of interest to Compaq. The paper on cbaractetization of Rotid1or, Hanis, and Davis describe a systematic approach Compaq has developed for monitoring and improving C++ compiler perfor system workloads by Casmira, Hw1ter, mance to minimize cost and maxi and Kaeli addresses the capture of mize function and reliability. basic data needed for me development Digital Technical Journal machin e. An efficient compiler is a high-performance tools for more debugger f or VAXjVMS are exem 2 designers and implementers of future Windows NT systems. Improved optimization techniques of tools and high-performance appli for compiler back ends are presented cations. The authors' work focuses in three papers In the first of d1e se on generating accurate profile and trace data on machines ru n ning the Reinig addresses the requirement in an optimizing compiler for an accu Windows NT operating system. rate description of the variables and Vol. 10 N o . 1 1 99 8 . , fields that may be changed by an by Brender, Nelson, and Arsenault assignment operation, and describes reports an advanced developmt:nt an efficient technique used in the project at Compaq to provide tech C/C++ compilers for gathering this niques for the debugger to discover information. Sweany, Carr, and Huber a more accurate image of the state of describe techniques for increasing the program. These techniques are execution speed in processors like currently being added to Compaq the Alpha that issue multiple instruc debuggers. tions simultaneously. The technique One of the problems that tool reorders the instructions in the pro developers face is increasing tool reli gram to increase the number of ability. Tool developers, therefore, instructions that are simultaneously test the code. However, developers issued. Maximizing the performance are often biased; they know how their of multiprocessor systems is the sub programs operate, and they test cer ject of the paper by Hall et al., which tain aspects of the code but not oth was previously published in IEEE ers. The paper by McKeeman describes Computer and a technique called differential testing updated with an addendum for this issue. The authors that generates correct random tests of describe the SUIF compiler, which tools such as compilers. The random represents some of the best research nature of the tests removes the devel in this area and has become the basis opers' bias. The tool can be used for of one part of the ARPA compiler two purposes: to improve existing infrastructure project. Compaq tools and to compare the reliability assisted researchers by providing the of competitive tools. DIGITAL Fortran compiler fi-ont end and an AJphaServer 8400 system. As compilers become more effec The High Performance Technical Computing Group and the Core Technology Group within Compaq tive in increasing application program are pleased to help develop this issue performance, the ability to debug of the]ou rn al. Studying the work the programs becomes more difficult. performed within Compaq and by The difficulty arises because the other researchers worldwide is one compiler gains efficiency by reorder way tlut we remain at the cutting ing and eliminating instructions. edge of technology of programming Consequently, the instructions for language, compiler, and program an application program are not easiJy ming tool research. identifiable as part of any particular statement. The debugger cannot always report to the application pro gram where variables are stored or what statement is currently being executed. Application programmers have two choices: Debug an unopti mized version of the program or find some other technique for determn i ing the state of the program. The paper Digital Technical Journal Vol. 10 No. I 1998 3 Foreword You might think that the cover of this issue of the Digital William C. Blake Director, High Performance Technical Computing and Core Technologv Gruups Tecbnicaljournal piled into those instructions. This the relevance of those ancient alchemists semantic gap between programming in the drawing to the computer-age languages and machine instructions is topic of programming languages and central to the evolution of compilers tools? Certainly, both alchemists and and to microprocessor architectures programmers work busily on new as well. The compiler developer's role tools. An even more interesting is to help close tbe gap by preserving metaphorical connection is the the correctness of the compilation alchemist and the compiler software and at the same time resolving the developer as creators of tools that trade-offs between the optimizations transform (transmute, in the strict needed tor improvements "close to sense of alchemy) tbe base into the the programmer" and those needed precious. The metaphor does, how "close to the machine." To put the work described in tl1is and folklore of alchemy, the science journal and technology of compiler software think about the changes in compiler development is a real and important requirements over tl1e past 15 years. into context, it is helptl.IJ to part of processing a new solution or It was in the early 1980s that the direc algorithm into the correct and high tion of future computer architectures est performance set of actual machine changed rrom increasingly complex instructions. This issue of tl1ejournal instruction sets, CISC, that supported addresses current, state-of-the-art high-level languages to computer work at Compaq Computer Corp architectures with much simpler, oration on programming languages reduced instruction sets, RJSC. Three and tools. key research efforts led the way: the Gone are the days when program mers plied their craft "close to the Berkeley RJSC processor, the IBM 801 RISC processor, and the Stanford machine," tlut is, working in detailed MIPS processor. Nl three approaches machine instructions. Today, system dramatically reduced the instruction designers and application developers, set and increased the clock rate. The driven by the pressures of time to RISC approach promised improve market and technical complexity, ments up to a factor of five compared must express their solutions in terms witl1 CISC machines using the same "close to the programmer" because manufacturing technology. Compaq's people think best in ways that are transition rrom the VAX to the Npha abstract, language dependent, and 64-bit RISC architecture was a direct machine independent. Enhancing result of the new architectural trend. the characteristics of an abstract high-level language, however, con Digital Technical Journal programmer must be correctly com is a bit odd. After all, what could be ever, break down. Unlike the mytl1 4 the high-level programs close to the As a consequence of these major architectural changes, compilers and flicts with the need tor lower level their associated tools became signifi optimizations tl1at make tl1e code cantly more important. New, much run f:1stest. Computers still require more complex compilers for RISC detailed machine instructions, and machines eliminated the need tor the Vol. 1 0 No . I 1998 large, microcoded CISC machines. speedup enhancements. In the next The complexities of high-level lan 1 5 years, Moore's Law may be stopped guage processing moved from the by the physical reali6es of scaling lim petritied software of CISC micro processors to a whole new generation of optimizing compilers. This move caused some to claim that ruse really its. But Amdahl's Law will be broken as well, as improvements in parallel language, tool development, and new methods of achieving parallelism wiU stands for "Relegate Important Stuff posi6vely affect the future of compil to Compilers." ers and hence application performance. The introduction of the third-gen eration Alpha microprocessor, the 21264, demonstrates that the shift to As you will see in papers in this issue, there is a new emphasis on increasing execution speed by exploiting the ruse and AJpha system implementa multiple instruction issue capability of tions and compilers served Compaq AJpha microprocessors. Improvements customers well by producing reliable, in execu6on speed will accelerate dra accurate, and high-performance com matically as future compilers exploit puters. In fact, AJpha systems, which performance improvement techniques have the ability to process over a bil using new capabilities evolved in AJpha. lion 64-bit floating-point numbers Compilers will deliver new ways of per second, pertorm at levels formerly hiding instruc6on latency (reducing attained only by specialized super the pertormance gap bel:\veen vector IUSC superscalar computers. It is not surprising that processors and the AJpha microprocessor is the most machines), improved unrolling and frequendy used microprocessor in the optimization of loops, instruction top 500 largest supercomputing sites reordering and scheduling, and ways in the world. After reading through the papers of dealing with parallel decomposi6on and data layout in nonuniform in this issue, you may wonder what is memory architectures. The challenges next for compilers and tools. As phys to compiler and tool developers will ical limits curtail the shrinking of sili undoubtedly increase over 6me. con feature sizes, there is not likely to be a repeat of the performance gains By not relying on hardware improvements to deliver all the at the microprocessor level, so atten increases in performance, compiler tion will turn to compiler technology wizards are making their own contri and computer architecture to deliver butions- always watchful of correct the next thousandfold increase in sus ness first, d1en run-time performance, tained application pertormance. The and, finally, speed and efficiency of the two principal laws that atfect drama6c software development process itself application pertormance improve ments are Moore's Law and Amdahl's Law. Moore's Law states d1at perfor mance will double each 1 8 months due to semiconductor process scaling; and Amdahl's Law expresses the diminishing returns of various system Digital Technical Journal Vol . 1 0 No. 1 1 998 5 I Jason P. Cas mira David P. Htmter David R. Kaeli Tracing and Characterization of Windows NT-based System Workloads To optimize the design of pipelines, branch pre dictors, and cache memories, computer archi tects study the characteristics of benchmark programs by examining traces, i.e., samples of program execution. Since commercial desktop applications are increasingly dependent on ser vices and application programming interfaces provided by the host operating system, the authors argue that traces from benchmark exe cution must capture operating system execution in addition to native application execution. Common benchmark-based workloads, how ever, lack operating system execution. This paper discusses the ongoing joint efforts of the Northeaster n University Computer Architecture Research Laboratory and Compaq Computer Corporation's Advanced and Emerging Tech nologies Advanced Development Group to cap ture operating system-rich traces on Alpha based machines running the Windows NT oper ating system. The authors describe the latest PatchWrx software toolset and demonstrate its trace-generating capabilities by characterizing numerous applications. Included is a discussion of the fundamental differences between using traces captured from common benchmark pro grams and using those captured on commercial desktop applications. The data presented demonstrates that operating system execution can dominate the overall execution time of desktop applications such as Microsoft Word, Microsoft Visual C/C++, and Microsoft Internet Explorer and that the characteristics of the operating system instruction stream can be quite different from those typically found in benchmarking workloads. 6 Digital Technic� I journal Vol. 10 No. l 1 99 8 The computer architecture research co mmun iry com monly uses trace-driven sim ulation in pursuing answers to a variety of design issues. Archi tects spend a significant amou n t of ti me studying the characteristics of benchmark programs by examining traces, i .e., sam ples taken from program execu tion . Popu lar bench mark programs include the SPEC' and the BYTEmark2 benchmark test s u i tes. Si nce the underlyi ng assump tion is that these programs generate workloads that represent user applications, today's computer designs have been optimized based on the cl1aracteristics of these benchmark programs. Although the authors of popu l ar benchmarks arc wel l i n tentioned, the resulti ng workloads lack operat ing system execution and consequently do n o t repre sent some of the most prevalen t desktop applications, e.g., Microsoft Word , Mi crosoft Visual C/C++, and Microsoft Internet Explorer. Such applications make heavy use of app lication programming inted:1ces (APis ) , which in turn exec ute many instructions in the operating system. As a resu lt, the overal l performance of many desktop applications depends on efficien t operating system interaction . C learly operating system overhead can greatly reduce the benefits of a new compu ter design feature. Past archi tectural studies , however, have generally ignored operating system interaction because few tools can generate operating system-ric h traces. This paper d iscusses the ongoing joi n t ef forts of Northeastern U niversi ty and Compaq Computer Corporation to capture operating system-rich traces on DIGITAL Alpha-based machi nes running the M icrosoft Windo>vs NT operating system . We argue th:tt tor u·aces of today's workloads to be accurate, they must capture the operating system execution as well as the native appli cation execution . This need to capture complete pro gram u·ace i n formation has been a dtiving fen-ce behind the development and use of software tools such as the PatchWrx dynamic execution-tracing too lset, which we desctibe i n this paper. The PatchvVrx toolset was origi nally developed by Sites and Perl at Digi tal Equ ipment Corporation's Systems Research Center. They described P:ttchWrx, as developed for vVindows NT version 3.5, in "Studies of Windows NT Performance Using Dynamic Execution far from complete, t h i s list provides a sample of the Traces."> The Northeastern University Computer tools that have been used to generate input to a variety Architecture Research Laboratory and Compaq's of trace-driven sim ulation studies. 'vVe have character Advanced ized each tool in terms of the three issues (criteria) pre Development Group continue to develop t h e toolset. viously mentioned. Table llists the target plattorm(s) We have updated the fra mework to operate under for each tracing tool. Advanced and Emerging Technologies Wi ndows NT version 4.0, added the ability to trace Note that many of these tools cannot capture oper programs that have code sections larger than 4 mega ating system activity. For those that can, their associ bytes (MB), added multiple trace buffer sizes, and ated slowdown can significantly affect the accuracy of developed additional postprocessing tools. the captured trace . Of the tools that provide this capa After briefly discussi ng related tracing tools, we bility, PatchWrx introduces the least amount of slow describe the PatchWrx toolset and specify the new down yet mai ntains the integrity of the address space. features we have ad ded. We then analyze PatchWrx The next section discusses the Patch Wrx toolset. traces captured on W i ndows NT version 4.0, demon strating the capabilities of the tool while illustrati ng PatchWrx the i m portance of capturing operating system-rich traces. In the final section, we su m m arize the paper, Patch Wrx discuss the current limitations of the toolset, and sug developed for use on the Alpha- based Microsoft is a dynamic execution-traci ng toolset gest new directions for development and study. W indows NT operating system. The toolset utilizes the Privileged Architecture Library (PAL) facility, also Trace Generation Tools referred to as PALcode, of the Alpha microprocessor Trace-driven simulation has been the method of can instrument, i . e . , choice for evaluating the merits of various architec tion and system binary i mages, including the kernel, tural trade-offs.'5 Traces captured from the system operating system services, drivers, and shared libraries. to perform tracing with minimal overhead .2' PatchWrx patch, all W indows NT applica under test are recorded and replayed through a model The PAL faci l ity i s a set of architected fu nctions and of the proposed architecture instructions that provides a consistent interface to a set researchers have proposed methodologies that capture of complex system functions. These routines provide both application and operati ng system references. These tools include hardware- based"- 10 and software pri mitives for memory management, context switch design. Computer ing, interrupts, and e xceptions. based' Hs methods . Some of the issues involved in cap turing operating system-rich traces are Patch Wrx and the Alpha PAL Routines The PatchWrx software tool is made possible through l. Tracing overhead (system slowdown ) the PAL used by DIGITAL Alpha microprocessors. 2. Accuracy (perntrbation of the memory address space) PAL routines have access to physical memory and 3. Completeness ( capturing all desired i n formation, i nternal hardware registers and operate with interrupts disabled . PALcode is loaded from disk at system boot e .g . , the operating system reference stream) Table 1 contains a list of 10 tracing tools that have been developed over the past 10 to 15 years. Although tim e . We modified and extended the shrink-wrapped Alpha PALcode on a DIGITAL Alpha 21064-based system to support the PatchWrx operations. The mod- Tab le 1 S a m p l e of Trac i n g Too l s Average Slowdown Addr ess Pertur bation Operating System Activity Platfor m ATOM'3 lOX to lOOX No Yes DIGITAL Alpha UNIX ATUM'6 20X No Yes DIGITAL VA X Ope nVMS EEL" lOX to lOOX Yes No SPARC Solaris Etch'" 35X Yes No Intel x86 Microsoft Wind ows NT V4.0 NT-Atom" lOX to lOOX No No DIGITAL Alpha M i crosoft Windows N T V4.0 PatchWrx3 4X No Yes DIGITAL Alpha M i crosoft Windows NT V4.0 Pixie'-0 lOX to lOOX Yes No DIGITAL MIPS ULTRIX Q P T 12 lOX to lOOX Yes No SPARC Sola ris, DIGITAL ULTRIX Shade2' 6X No No S PARC Solaris SimOS14 1 OX to 50,000X No Yes DIGITAL Alpha UNIX, SGIIRIX, SPARC Solar is Name Digital TechnicJ! Journal Vol. 10 No. I 1998 7 i fied PatchWrx PAL rou t i nes serve two major pur ( l ) to reserve the trace bufkr at system boot time and ( 2 ) to l og trace e n tries at trace ti m e . poses: We d e fi n e a patched i nstr uction as a n i nstruction wit h i n a n i m age's code section that is overwri tten wi th a n u ncon d i tional branch ( B R) to a patc h . The target of O n e w a y that PatchWrx mai ntains a l o w operati n g the B R contains the parch sec/io n . The patch sec tion overhead i s t o store t h e captu red trace i n a p h ysical i ncludes the trap ( CA L L_PA L ) to the appropriate PA L memory bu fter, w hi c h is reserved at boot time. The routine t h a t l ogs a trace e ntry correspond i n g to the s i ze of the bu ffer can be varied depen d i n g on t h e type of i nstruction p<1tched and t h e return branch to a m o u n t of physical m e mory i nstal led on the system . the origi n a l target. S i n ce we use PAL rou ti nes to reserve this mem ory, the PatchWrx docs not m od i fy the origi n a l b i na r y operati n g system i s not aware that the m e m o ry e x ists i mages; i nstead , i t generates n e w i m ages t h a t conta i n because the PALcode performs all low - l evel system i n i patches. This operation preserves t h e origi nal i mages ti a l i zation before t h e operating system is started. on the system in case they need to be restored . PatchWrx logs all trace e n tries in this b u ffer. Wri ti ng trace e ntries d i rectly to p h ysical memory h as several I nstru mentation i nvolves replaci n g all bra n c h i ng i nstr uctions of type u n conditional bran c h , con d i tional advantages. F i rst, writing to memory is m u c h faster bra n c h ( e . g . , b ra n c h i f e q u a l to zero [ B EQ] ) , branch th;m wri t i n g to d i sk or to tape. Seco n d , u s i n g p h ysical to subroutine ( BS R ) , fu n c tion retu rn memory a l l ows tracing of the lowest levels of the oper ( J M P ) , and j u m p to su bro u ti n e ( J S R) w i t h i n an ( RET ) , j u m p ating system ( i . e . , the p a ge fau l t h:md ler) without gen i m age 's code section with u n co n d i tional bra nc h es to era ti ng page fa u l ts . T h i r d , usi n g p h ysical memor}' a patch secti o n . If loads a n d stores are a lso trace d , a l l ows tracing across m u l ti pl e t h reads r u n n i n g in m u l PatcbWr x rep l aces t h ese i n structions ( e . g . , l oad sign tiple add ress spaces regardJess of which ad d ress sp a ce i s e x te nd e d currently r u n n i ng . branches t o tl1e patch secti on, where t h e o riginal load To enabl e PatchWrx t o operate u nder Wi ndows NT l o n gword [ LD L ] ) with u n co n d i ti o n a l o r store i nstruction i s copi ed . A return branch is also 4 . 0 , we started with the PA L rou needed to return control flow to the i nstruction s u bse tines mod i fi ed by S ites a n d Perf and made additional q u e n t to the original load . Wnen PatchWrx enco u n versions 3 . 5 1 and mod i fications as req u i red by the operating system ver ters this patc h , t h e tool records t h e register value of the sions . These m od i fi cations were concentrated in the original load or store i nstruction i n the trace log. The process d ata structu res . The PatchWr x -specitlc PAL p atch section con ta i n s all the patches for the i m age 2 . The fi rst t h ree routines a n d is added to t h e rewritten i mage. Figure l s h ows are used for read i n g the trace e n tries tl·om t h e bu ffer e x a m p les of patched i n structi o n s . Patch \rVrx rep laces and for t u rn i ng tracing on and off. The rema i n i n g five o n l y branch i n structions within a n i mage to red uce the ro u ti n es are listed i n Ta ble rou ti n es are used to log trace e n tries based on the type type and n u m be r of e n tries logged in the trace bu ffer. of i nstruction i nstru mented . Usi n g these traced bra n c h es, the tool can later recon struct the basic blocks they represent. Patch Wrx Image Instrumentation Next we describe how we use PatchWrx to i nstru ment As s h own in Figure 1 , PatchWrx repl aces B R a n d M icrosoft Wind ows NT i m ages. Patc h i ng the o perat J M P i nstructions w i t h B R i nstructions t h a t transfer i ng system i n volves the i nstru m e n tation of ::d l the instruction i s re peated i n t h e patch section for the p u r binary i m ages, i n c l ud i n g app l i cations, operating sys pose of record i n g the va l u e o r· the target register ( i f control to the patch secti o n . The ori g i n a l BR or J M P tem cxecutables, l ibraries, and kern e l . O n ce patc h i n g necessary) i n to the trace bu ffe r w h e n the patched is complete , trace e ntries a r e logged by means o r' PA L i mage is exec u ted . T h i s register val u e is necessary ro u tines as i mages execute. reconstru cti ng the traced i nstruction stream . Patch\Vrx Table 2 PatchWrx-specific PAL Routines 8 PAL Routines Function PWR D E NT Read a trace entry from trace memo ry PWP E E K Read an arbitrary l ocation (for debug) PWCTRL I n iti a l i ze, turn tracing on/off PWB S R Record a branch to s u b routi ne PWJSR Record a j u m p/call/return PWLDST Record a load/store base reg ister va l ue PWBRT Record a co nditional branch taken bit PWB RF Record a condit i o n a l branch fal l -thro u g h bit DigiL11 Tec h nical Journ,l l Vul . ! 0 No. l 1 99 8 tor PATCH E D CODE ORIGINAL CODE EXAMPLE 1 MP ZERO , ( R1 9 ) Jl!P Z'i8RO, (Rl9) PATCH . O O l : EXAMPLE 2 J S R R2 6 , ( R1 9 ) ��� P.'\TCH . 0 0 2 : EXAMPLE 3 BEQ R 3 , TARGET . 0 0 3 BR CALL_PAL PltJJSR J�lP ZERO , ( R 1 9 ) BSR R2 6 , PAT CH . 0 0 2 CALL_PAL PWJSR JMP ZERO , ( R l 9 ) BEQ R3 . �RSE� . 002 BACK . 0 0 3 PATCH . 0 0 3 : l?l>.TCH . 0 0 1 BR PAT . H . 0 0 3 BEQ R 2 , PATCH . 0 0 3 T Cli.LL_P AL PWBRF BR BACK . 0 0 3 PATCH . 0 0 3 T : EXAMPLE Figure 1 Instruction Patch 4 LDL R 2 0 , 4 ( R 1 6 ) LDL R20 , 41Rl6 ) Bli.CK . 04 1?/I.TCH . 00 4 : CAL _PAL P BRT B R TARG ET . 0 0 3 BR P TCH . 0 0 4 CALL_PAL PWLDST LDL R2 0 , 4 ( Rl 6 ) BR 8 ACK . 0 0 4 Examples repl aces JSR and BSR i nstructions with BSR patches. This replacement preserves the return address ( RA) register fi e l d value, which contains the return address for the subroutine. Again, the original i nstruction is repeated in the patch section for register val ue record ing during traci ng to help facilitate reconstruction. Cond itional branches have a larger and more com plex patch than the other branch types because the original condition is d u plicated and resolved within the patch . The taken or fall-through path generates a bit val ue when logged within the taken or fall-th rough trace entry. The return branch i n the patch section is a rep l i ca of the original cond i tional branc h . As explained earuer, tor all patches, PatchWrx replaces the original branch with a patch unconclitional branch . Since Alph a i nstructions are equal i n size, this replace ment process allows patching without increasi ng the code size with i n the i mage . Although the code size remains u nchanged, the image size will increase in proportion to the number of patches added. This i mage size change becomes an issue for dynamical ly linked l ibrary ( DLL) i mages. Patching Dynamic Link Libraries The Microsoft Wi ndows NT operating system pro vides a memory management system that allows shar i n g between p rocesses.n For example, two processes that edit text files can share the text editor application image that has been mapped into memory. When the first process i nvokes the ed itor, the operating system loads the application into memory and maps the process's virtual address space to it. When the second process i nvokes the editor, rather than l oad another editor image, the operati ng system maps the second process's virtual address space to the physical pages that contain the editor. Of course, both processes con tain local storage for private data . DLLs are loaded i nto memory and shared in this manner. When patches are added to a DLL, the size of the i mage i ncreases. When this i mage is mapped to Digital Technical Journal Vol.. l O No. l 1998 9 p hysical memory ( as per its preferred base load address ) , the larger i m age may overlap with another image having J bJse add ress wi thin the new range. This i mage overlap can p revent the operati ng system from booting properly: some environment DLLs wi ll conflict i n memory because they perform calls d i rectly i n to other D LLs at fi xed offsets . To resolve this issue, we rebase 24 the preferred base load add resses of the patched DLLs, which modi fies the base l oad add resses of each patched D L L to elimi nate con fl icts . Rebasing affects the address accuracy of the patched S}'Ste m , though w e are a b l e to readjust t h e addresses d u ri ng reconstruction . An increase i n the pagi ng activit\' m ay also be observed si nce the additional code may cross page boundaries. The original version of the PatchWrx toolset was developed on Microsoft Windows NT version 3 . 5 . When versions 3 . 5 1 and 4 . 0 were released, several mod ifications were made to the i mage format. In complet ing the 3 . 5 1 - and 4.0-eompatible versions of PatchWrx, we bad to add ress this issue. One change that affected how we patch was the placement of the I mport Address T1ble ( IAT) into the front of the i nitial code section of executable binary images. This table is used to look up the add resses of DLL proced ures used ( i . e . , i m ported ) by the executable binary. In developing the current gen eration of Patch\Vrx, we had to make modi fications to usc image header fields that had previously remained un used or reserved, indicating the executa b le code sec tions that contained data areas. Another issue t hat we add ressed in the recent modi fications to PatchvVrx was long branches. The original version of PatchWrx repl aces a l l branch, j u mp , ca l l , and return i nstructions with either B R o r B S R i nstruc tions to the patch section. Si nce the PatchWrx tool has no information about machine state d u ri ng the patch ing phase, i t is impossi ble to uti l ize other branching instructions ( e . g . , J MP or JSR instructions ) to provide this branc h - to-patch tra nsition. Register and register indirect branching instructions wou ld requ i re per tu rbing the machine state . Therefore , the devel opers could use only program counter ( PC)-based offset branching i nstructions. As discu ssed previously, i n replacing a control How instruction with a patch branch, PatchvVrx uses a B R o r B S R i nstruction i n which the off-Set field i s set to branch to the correspond ing patch wi thin the i m age's patch section . The A l pha architecture branching i nstructions use the format s hown i n hgurc 2 . I = = = 25-BIT DISPLACEMENT LBR I NSTRUCTION FORMAT 20-BIT DISPLACEMENT OPCODE 31 26 2 5 REG 2 1 -BIT DISPLACEMENT 0 2 1 20 Oi[!:iLal Technical Journal LBSR I NSTRUCTION FOR MAT Fig u re 3 PALcode Lon g B ranch Instr uction l-'ormars Figure 2 Al p ha Branch Instruction Format 10 The branch target virtual add ress comp u tation t-cJr this format is newPC ( ol d PC + 4) + (4 * sign cxtcndcd ( 2 l -bit branch d isp lacement) ) . The register field holds the return address for BSRs. With this branch format and target virtual add ress computation, the Alpha architectu re provides a branc h target range of 4 MB from an i nstruction's current PC. Several appl ications that run today on Microsoft Windows NT version 4 . 0 are sufficiently large that the displacement between a control rlow i nstruction to be patched and the patch location within the patch section exceeds this 4-MB l imit. ( Recal l that since we want to avoid moving code or data sections, the patch section is placed at the end of the image . ) To address this problem, we developed two new branch i nstructions for usc with PatchWrx. These new branches were n ot implemented in the i nstruction set architecture of the Al pha architec ture. I nstead, we used PALcodc to implement d1cm . The two new branches arc designated long branch ( LB R) and long branch subroutine ( LB S R) . F i gu re 3 i l l ustrates the format of these two i nstructions. The computation of the target virtual add ress is newPC ( oldPC + 4 ) + (4 * sign-ex te nd ed ( 2 5 -bit branch d isplacement)) tor LB R branches and ncwPC (oldPC + 4 ) + ( 32 * zcro-cxte nded ( 2 0 - bi t br::mch dis placemen t) ) for LBSR branches. PatchWrx uses LB Rs when p a tching any control fl ow instruction that has a d ispl acement greater than 4 LV! B . PatchWrx uses LBSRs similarly for control H ow i nstructions that must p reserve the register field val u e . \Vhcn a n LB R or L B S R i nstruction i s cxecu ted within the i mage code section, a trap to PALcodc occurs . Normal ly, CALL_PAL i nstructions have one of several defined fu nction fields that cause a correspond i n g PAL routine to b e executed . The two l o n g branch i nstructions have fu nction fields that do not belong to any of the defi ned CALL_PAL instructions a nd there fore force a n i l legal i nstructio n exception within the PALcod e . This PALcodc flow has been mod i fied to detect i f a long branch has been encou ntered . Vol . 10 No. l 1 99 8 AB shown in Figure 3, both long branch types have the same PALcode operation code (opcode) value of 000000. To distinguish between the r-.vo types, the least significant bit in the instruction word is set to 0 for LBRs and to 1 for LBSRs. This bit is not included as a usable bit for the displacement fields of either branch type. Consequently, each LBR has a 2 5 - bit displacement field and each LBSR has a 20-bit field. With a 2 5 - bit usable displacement field, the PALcode performs the LB R tar get address computation, allowing a ± 64-MB range . Since each LBSR instruction has a 2 0-bit d isplace ment field, whereas the original Alpha architecture branch displacement field is 2 1 bits, the target instruc tion address computation for LBSR instructions is per formed differently than tOr standard branches within the PALcode. As shown in the address computation equation, the 2 0 - bi t displacement is mu ltiplied by 3 2 rather than by 4 ( as for the L B R branch ) . Notice that the 2 0 - bi t d isplacement is always zero extended . The computation provides the LBSR instruction with a dis placement of + 3 2 M B . This computation procedure has two implications. First, LBSR instructions can only be used to branc h from an image code section to an image's patch sec tion . Second , branches into the patch section are either BR or BSR instructions (or their long displace ment counterparts ) . PatchWrx uses only BR or LBR instructions to return from the patch section to the original branch target within a code section; BSR and LBSR i nstructions are never used . Therefore, restrict ing LBSR i nstructions to use positive displacements does not present a problem. The LBSR displacement m u l tiplier value of 32 does present some restrictions, however. The m ultiplier value of 4 used in the original Alpha i nstruction set architecture represents the instruction word length of 4 bytes. Thus, normal branch instruction target addresses must be aligned on a 4- byte boundary. By using the multiplier val u e of 32 for LBSR instructions, LBSR target addresses are restricted to align on a 32byte (i.e., eight-instruction) boundary. Since all LBSR targets reside within the patch section, this restriction does not pose a problem . If an LBSR is to be inserted into the image code section and the next available patch target address is not aligned properly, PatchWrx can insert no operation ( NOP) instruction words and advance the next avai la ble patch target address unti l the necessary alignment is achieved. PatchWrx never executes the NOPs; they are i nserted for alignment purposes only. Although inserting these NOP instruc tions increases t h e image size, w e have implemented several optimizations into the instrumentation algo rithm to minimize this increase. For example, a queue is used to hold LBSRs that do not align . As LBR patches are committed , PatchWrx probes the queue to determine if any LBSRs align fi·om their origin to the newly available patch target offset. Trace Capture The PatchWrx toolset allows the user to turn tracing on and off and thus capture any portion of workload execu tion. The tracing tool is also responsible for copying trace entJies fi-om the physical memory buffer to disk. Copying the trace buffer to disk is performed after u·acing has stopped so that the time required to perform the copy does not introduce any overhead during u·ace capture . PatchWrx logs a trace e nu·y for each patch encoun tered during program execution. AB .it executes instruc tions witllin the code section, PatchWrx encounters an unconditional PatchWrx branch. Instead of branclling to the otiginal target, the patched branch transfers control to tl1e image's patch section . Witl1in the patch section, a PatcbWrx PALcall u·aps to the PAL routine correspond i.ng to tl1e patch type and logs a trace entry to tl1e trace buffer. The PAL routine then returns to the instruction following the CALL_PAL insu·uction. PatchWrx uses an unconditional branch to transfer control fi-om tl1e patch section back to the original target within an image code section. During the execution of the PatchWrx PAL rou tine, necessary machine state information is recorded and logged in the trace buffer. This allows for the capture of register contents, process I D information, etc . , which are used later during u·ace reconsu·uction. The trace capture £1cility captures tl1e dynamic execu tion of a workload running on the system . To recon struct tl1e trace after it has been captured, the tracing tool must also capture a snapshot of tlK base load addresses of all active images on tl1e system. This snap shot serves as the virtual address map used in recon structing the trace. Each active process and its associated libra.Ji es is loaded into a separate address space, which may be different tha.Jl me preferred load address as spec ified statically in tl1e image header. If each image was loaded into memory at its preferred base address, tl1e virtual address map would not be necessary to perform reconstruction. Instead, PatchWrx could map target addresses from the trace buffer using the base address values contained in tl1e static image headers. The type of trace record that PatchWrx logs into the trace buffer depends on the type of branch or low-level PAL function being traced. Figure 4 shows the trace record formats. The first three trace entry formats consist of an 8 - bit opcode and a 24-bit time stamp. The time stamp is the low-order 24 bits of the CPU cycle counter. The 32 -bit field of these three formats depends on the type of trace entry logged . The .first format is used for target virtual addresses for all u nconditional direct and i ndirect branches, j umps, calls, returns, interrupts, and returns from interrupts. The 32- bit field of the second format is used to record the base register val ue tor traced load and store instructions and stack pointer val ues that are flushed into the trace buffer during system caJis and returns. The 32 -bit field of the third format is used for logging the current active process ID at a context swap. Digita} Technical Journal VoJ . 1 0 No. 1 1 99 8 11 OPCODE T I M E STA M P TARGET P C 8 24 32 OPCODE TIME STAMP BASE REGISTER VAL U E 8 24 OPCODE T I M E STAMP 8 32 NEW PROCESS 24 10 32 r-- OPCODE \ START B I T J 3 I VECTOR OF 60 TA KE N/FALL-TH ROUGH TWO-WAY BRANCH BITS 1 60 Figur e 4 Trace Entry Formats The fo urth trace entry type is used for tracing con Using the first target virtual address and process ID ditional branches. I t uses a 3-bit opcode a n d up to 60 pair from the captured trace, trace reconsu·uction con taken/fa l l -through bits. A start bit i s used to deter su l ts the virtual address map to determine in which mine how many b i ts are active. The start bit i s set to i m age the instruction fa lls ( b ased on its dynamic base l i f a conditional branch is taken and to 0 i f the branch load address) and where that image is physically is not taken . This recording scheme allows a compact located o n the syste m . The tool consults the patched encoding of conditional branch trace entries. Duri n g image to determine the actual i nstruction at the target trace reconstruction, Patc hWrx uses conditional branch address, records this instruction , a n d then reads the trace e n tries to reconstruct the correct instru ction next insu·uction from the patched image . This process flow when condi tional branches are e n countered and continues until reconstruction encounters either a to provide concise information about when to d eliver conditional branch or an u n conditional branch. A i n terrupts in loops. conditional branch causes the tool to check the first Trace Reconstruction determine su bsequent control flow; the process then The reconstruction phase is the final step in generating conti n ues at that address. I f a n un conditional branch is a full instruction stream of traced system activity. As encou ntered , reconstruction records the e n try and active bit of the current taken/fall- through e n try to shown i n Figure 5 , trace reconstruction requires sev ch ecks it against the next captured trace en try. I f the eral resources i n order to generate a n accurate instruc tvvo entries match , the tool outp u ts the recorded tion stream of all traced system activity. instructions to an instruction stream file, consults the Trace reconstruction reads and i ni tializes the head 12 captured trace entry for the next target i nstruction vir i n g of the cap tured trace, which i ncludes a time sta m p , tual address, and repeats the procedure u n til the entire t h e n a m e of the u s e r w h o captured t h e trace, a n d any captured trace has been processe d . important system configuration information, e . g . , the Since PatchWrx cap tures i nterrupts a n d other low operating system version n u m ber. Next, reconstruc level system activities ( e .g., page fa ults) i n the trace, tion reads the first fou r raw trace records, which are these activi ties must also be reconstructed . When automatically entered w h enever traci ng i s turned o n . PatchWrx logs an interrupt in to the trace bu ffer, the These records contain t h e first target virtual address, corresponding target virtual address in the captured the active process ID , the value of the stack pointer, record represents the address of the rl rst instruction and the first talcen/fall- through record to be used not executed when the i nterrupt was take n . PatchvVrx (such records always precede the branches they repre flushes the currently active taken/fa l l -through entry sen t ) . PatchvVrx uses this i nformation to i n i tialize the to the memory buffer and i n itializes a nevv taken/fall necessary data su·uctures of the reconstruction process. through enu·y. This new e n try will be responsible fo r Digital Technic:�] Journal Vo l . 1 0 No. I 1 998 PATCHED IMAGE PATCHED IMAGE PATCH E D IMAGE I - RECONSTRUCTED I N STRUCTION STREAM CAPTUR E D RAW TRACE RECON STRUCTION TOOL V I RTUAL ADDRESS MAP Figur e 5 Instruction Stream Reconstruction Resources the conditional branches e n countered begi n n i n g with the i n terrupt service routi n e . The add ress of the first in struction wi t h i n the i n terrupt service ro utine is then logged i n the trace . D u ring reconstruction, the reconstruction tool looks fo r the i n terrupt's first u n e xecmed i n struction address 2 . DLL domain-Wi n 3 2 user (e.g., kernel 3 2 d l l , user3 2 . dl l , a n d ntd ll . dl l ) 3 . Operati ng system domain-W i n 3 2 kerne l , ke rn e l , system processes, system idle process (e.g., Wi n 3 2 K.sys, ntoskr n l . e xe, drivers, and t h e spooler) to know which instructi o n to stop at when recon Exa m i n i n g the e ti mes provi des i nsight into a work- structi n g the i nstruction strea m . The tool then begi ns load 's use of each dom ai n . We also examine DLL and reco nstructi ng the i n struction stream, i n c l u d i n g the system service usJge on a n i m age basis for each work interrupt h a n d l e r stream. I f the u nexecuted i n struc load . Tlus breakdown helps us more clearly identi f)r the tion is w i t h i n a loop, trace reconstruction uti l i zes the dependence between the workloJd and the system ser taken/fa l l - through entry convention . On ta king the vices provided by the Windows NT operati ng system. i nterrupt, the active take n/fall-through record i s flushed We also present the i n struction m i x of each workload and another record is starte d . This process al lows the with and without the incl usion of the operating system tool to conti n u e to reconstruct i terations o f the l oop execution . U n dersta nding the djfferences in instru c u n ti l a l l the taken/fal l-through bits are exhau sted . tion com position i n the presence of system activity fur ther highli ghts the behavior lacking in application-only Operating System-Rich Workload Characterization traces, such as i n creases in branch and memory instruc tions, when compared to application-only workloads. We present the average basic block l e ngths fo r each As prese nted i n the study by Lee et al . ; ' desktop appli domain of execution ( Jpplication-only, DLL, operating cations and benchmarks s h a re some workload charac syste m ) separately a nd the n i n co m bi na ti o n . This met teristics, but app l ic ations alone d o not represent fu l l ric reveals which workload domai n dominates the system behavior. To i n vestigate and address system branc hing beh avior. Casm ira's work provides a more design issues, com p u ter arch i tects should use operat complete description of these d i fferences across a wider ing system-r i c h traces. set o f workload c h aracteristics.2; To i l l u s trate this point, we present a sample of the vJrious workload characteristics tbat exist in a set of Workload Descriptions bench mark a n d desktop appl ications spec i a l ly sele cted We pertonn ed a l l the e xperiments reported on in this to study the d i ffe rences in the use o f the operati ng sys paper o n a DI GITAL Alpha p l attorm r u n n i n g the tem and related services. The first c h a racteristic we dis .Microsoft Windows NT version 4.0 operat i n g syste m . cuss is the amount oftime each ben chmark or desktop We captured the traces o n a 1 5 0- megahertz Npha application spends with i n three domains: 2 1 064 processor. The system configuration incl uded l . Appl ication-on ly domai n ( e . g . , winword .exe and excel . e x e ) 8 0 MB of physical memory. TJ ble 3 l ists the workloads we examined . Digital Tech n ical Journal Vol . 10 No. 1 1 998 13 Ta ble 3 Workload Description Workload Description fou r i e r B YTEmark benchma rk; a n umerica l ana lysis routi ne for calculat i n g series approximations o f waveforms neural BYTEmark bench mark; a s m a l l , funct i o n a l back- propagation netwo rk s i m u lator go SPEC95 Go! game bench m a rk li SPEC95 Lisp i nterpreter bench mark cdplay Microsoft CD Pl ayer playing a m u sic CD fx ! 3 2 D I G ITAL FX 1 3 2 V 1 . 1 i nterpretin g/translating incl uded Ope n G L s a m p l e x86 a p p l ication ie M icrosoft I nternet Explore r V2.0 fo llowing a series of web page l i nks vc50 M icrosoft Visual C/C++ V S . O com p i l i n g a 3, 000- l i ne C program word Microsoft Wo rd97 V7.0, spell-check i n g a 1 5-page docu me nt The fourier and n e u ra l workloads are from the To provide a clear and represent::�tivc comparison BYTEmark benchm ark test s u i te : the n e u ral workload ohvorkload behavior, we captured several traces. For is a small array- based floating-point test; the fou rier all scenarios, fu l l traces of each workload captu red workload i s designed to measure transce n d e n ta l a n d approximately 5 to lO seconds of executi o n , f-i l l i ng the 4 5 - M B trace b u ffer. To c haracterize worldoad behav trigonometric fl oating-p oi nt unit perfor mance. The go and li workloads a.rc !Tom the SPEC9 5 integer ior, each experim e n t w:1s run with the benchmark or bench mark suite: the go workload is a simulation of the application as the only activity o n the syste m . E a c h game workload w a s r u n in t h e !-(>regro u n d . Co1, witl1 ilie computer playing against itselr; ilie li workload is a Lisp in terpreter. All the workloads use ilie To e n s u re t h a t t h e traces captured were represe nta stand ard i n puts provided vvit h tl1c bench marks and are tive o f the overall worldo:�d behavior, we captured com piled mul tiple traces. We chose d i ffe re nt poi n ts d u ri n g exe with the default optimiz:.tion level using the native Alpha version of Mi crosoft C/C++ version 5 . 0 . cution fo r tracing to allow comparison between d i fter The cdplay workload i s t h e Mi crosoft C D Player en t portions of the selected scen:�rios. To i n v estigate application i n c l uded i n M i c rosoft Wi ndows NT ver the variabi l i ty present in selected workloads, we tr:�ced sion 4 . 0 . The device w:.s traced while playing a music additiona l scenarios . A second M i crosoft Word trace CD using d e fa u l t p l ayi n g options ( e . g . , playing a l l the was captured with the appli cation perfor m i n g an auto songs i n order). for m a t operation of the same docu m e n t used in the The 6.: ' 32 workload is the DIGITAL FX' 32 version 1 . 1 first trace of the spell- check operation , and we cap emulator/translator provided by Compaq's DIGITAL tured a second M i crosoft I n ternet Explorer tr;K e , Alpha Migration Tools G ro u p .1" We ran the robot arm repeating the S o n y l i nks but with t h e l i n ks cac hed . We Ope n G L sample I n te l - based appli cation in the for e captu red a second trace o f ground d u ring trace captu re. The ie workl oad i s the st:.ndard M i c rosoft I n ternet FX ' 32 using the i n c l u ded boggl e sample ga m e ( to r comparison agai nst using the OpenGL application i np u t ) . Add i tional ly, the FX 1 3 2 Exp lorer version 2 . 0 workload i n clude d i n lvl icrosoft translator was traced while i t optim ized a n:�tive I n tel Win dows NT version 4 . 0 . The ie workload was traced x86 application's pro fi l e . To conde nse the n u m be r of w h i l e traversing fo u r l i n ks through the Sony home memory pages occupied web page, arriving fi na l ly at the Sony PlayStation Store designed the new l i n ker to a l low d a ta to resid e wi r- h i n by a n i m :� gc, Microsoft we b page . The trace was captured on M ay 4 , 1 9 9 8 ; t h e code regions. Hookway a n d Herdeg"' provide :1 n pages m ay have c h anged s i n ce this d ate. The history expl anation of the D I G ITAL cache and the web link cache were both e mpty w h e n t:ranslationjoptimization procedures . Casmira discusses t h e trace was captu red . iliese scenarios a n d others .' · FX1 32 e m u lation and The vcSO workload i s tl1c M i crosoft C/C++ version 5 . 0 compiler compiling a 3 , 000-l ine C source code tile. Domain Mix We used the command l i ne i nterrace, and we used the To i l l ustrate the i n herent d i ffe rences between bench default optimization levels and oilier parameters, which m a r k and d esktop application behavior, we break best represented ilie common usage of tl1e compiler. The word workload is M i c rosofi: \Nord from the M i crosofi: Offi ce97 desktop app l i c::�tion su ite tor the ( 2 ) D LL, and ( 3 ) operati ng syste m . The application Alpha processor used to capture :1 m a n u a l spell c h e c k domai n represen ts the set of-' executed instructions that o f a 1 5 -page Mi crosoft Wo rd docu m e n t . T h e standard are within the traced appl ication ' s execut a ble i m ::� ge. Microsoft Wo rd d i c tiorury was employe d . 14 down the captured trace i n terms of three m m u a l ly exclusive domai n s . These domains arc ( l ) application, Digital Tcdmi'a: 1U) � 10 l � n 5 : I 0 FOURIER N E U RAL GO Ll C O P LAY FX'32 IE VC 5 0 WORD WORKLOAD Figure 9 Average Basic Block Length within their executable ima ges . Therefore, i n c l u d i n g The vc50 workload spends a s i g n i fica nt amount of any operati n g system activity i n to a basic block le ngth ti me wi t h i n i ts own executable image , which leads to average has a m i n i m a l effect. a n overa l l average basic block l e n gth si m i l ar to the However, consideri n g the la rge amount of operat applic ation - o n l y va l u e . T h e word workload is s i m i l a r, ing system execution present in the cdplay trace, the b u t the D LL behavior domi nates. The cd p l ay and ie overa l l basic block lengt h is s i gn i fi cantly Jess than the workloads experience a 50 percent decrease i n average appl ication-only l e ngth . The overa l l and operating basic b l o c k length . This decrease c a n be attri buted to system length val ues are al most the same. Not o n l y a n i ncrease i n the n u mber of branc hes in the presence does i nc l u ding t h e system activity i n t h e trace i n tl u of operating system activity. With this i n crease in con e n ce t h e overa l l basic block length b u t the amount trol fl ow i nstructi ons, we ex pect increased pressure to of system activity determines to what degree the length be placed u po n the branch prediction hardware. i s affected . In a si milar fashion, the overall basic block length of As observed in othe r c h aracteristic categories, the fou r bench marks do not e x h i bit noticeable deviations the fx ! 3 2 trace tracks that of its D LLs. The length is fro m appli cation - o n ly be havior when the operati n g directly proportional to the amount of ti me the work system activity is introd uced. Aga i n this explains why load spends in its DLL domai n . The execution of the ie s i m u l ation results using ben c h mark traces usually track workload is more evenly distri buted among the t hree the actual performance when the bench marks are r u n domains, which affects tl1e overall basic block length, on the real syste m . I n contrast, fo ur o f the five desktop produ cing a more evenly weighted average of all its applications exhib i t significantly d i ffe rent behav i or i n domain basic block lengths ( n o one domain dominates ) . the presence o f the operating system. Digiral l.edmical Journal Vol . 1 0 No. 1 1998 19 Summary 6. rhe Eleue n rh Sympos ium o n Co mputer A rchitecture I n this paper we described the PatchWrx toolset. We (June compared it to e x isti ng tools and d emonstrated the need for operati n g system-ri ch traces by showing the 1 9 94 ) : 1 2 6-1 3 5 . the D L Ls. I n a d d i t i o n , w e showed t h a t n i s t i n g d e s k K . Flanagan , J. Arch ibald, B . Nelson, and K . Grim srud , " B AC H : BYU Address Collection Hardware; The Collection of Complete Traces," Proceedings of t o p bench marks do not exercise the kernel and t h e the Sixtb International Olllfereucc un Jfodeling Tech 7. amount of the total execution spe n t i n the ke r n e l a n d n iques and Tools /or Co mp u t er Fmlwttio l l ( 1 992 ) : D LL sufficiently to provide m e a n i ngfu l i n d i cators of 5 1 -6 5 . desktop pertonnance. These resu lts have rei n torced our argu ment that and operating system i n forma tion , especia l ly a s n e w D. Kaeli, 0 . La Maire, 'vV. White, P . Henner, a n d W. Starke, " Real-Time Trace Ge neration," !nt em u t iu ua l Journal 0 1 1 Computer Sim ula t i o n . vol. 6, no. 1 ( 1 9 96 ) : appl ications s p e n d m o r e time executing within the 53-68 . 8. researchers need to use traces with both applicati o n operating syste m . The goal i s for computer arch itects to usc operati ng system-ri c h traces of appli cations that 9. D . Kaeli, L. Fong, D . Ren frew, K. I m m i ng, and R. Booth , "Performance Analysis on J CC-NUM.A Prototype," IBM .foumal ol l es(t; i l a 1 1 d !mplementatio l l . Orloudo. References and Notes l. SPEC Neu'Sietter( Septembe r 1 9 95 ) . 2. I n formation about t h e BYTEmark benchmark suite is available fi·om B YTE Magazine ar http :/jwww. byte. com/bmark/bmark .hrm. 3. 15. M . Rosenblum, E. Bu gnion, S. Devine, and S. Herrod, "Using the SimOS Machine Simu lator to Study Com plex Computer Systems," A CJ1 Transactio/Is IJ/1 .\llod eling and Sim ula t ion , vol . 7, no. I ( January 1 9 9 7 ) : 4. 16. D. Kael i , " Issues i n Trace- Driven Simu lation," Lectu re f\iu. and Verlag, 5. 729, Per/ormance Notes in Computer Science, Eua lu at io u Svstems. 78-1 0 3 . 1996) : 1 69-1 8 3 . of Co mp uter Com m u n ication L. Donatiello and R. Nelson, eds. (Springer A. Agarwal, A nazvsis o/ Ca che Perjorma n e ej(;r Oper ating Systems a11d Multipru,q ra rnming m ic Publisher, ( Kluwer Acade 1989). 1 7. ]. Larus and E. Schnarr, " EEL: Rewriting Executable Files to Measure Program Behavior," Pmc('edi ngs of the A C!VI SIG'PLA N"95 Co nference 011 Pn��ran1111i11g 1993 ) : 2 2 4-244 . R. Uhlig and T. Mudge, "Trace- D riven Memory Sim u lation : A Su rvey," A C/11 Comfllltillg Surn·: Fs, vol . 2 9 , no. 2 (J unc 1 99 7 ) : 1 2 8-1 7 0 . Digital T.:d1nic.1l journal 1 994 ) : 1 96-2 0 5 . M . Rosenblum, S. Herrod , E. Wirchcl, a n d A . Gu pta, "Complete Computer System Simulation: The SimOS Approach," JEF:F..fo u m al of Pa ra llel a u d Distrlhu ted Tech nology, 1 99 8 , forthco ming. ings o/ the Secoud fSIW!X .�vrnposium on Operating ( October Fla. ( Ju ne 14. S . Perl and R . Sites, "Studies of Windows NT Perfor mance Using Dynamic Execution Traces," Proceed S),stem f)esig n and lmplcmentmiun 20 J. Emer and D. Clark, "A Characte rization of Proces sor Performance in the VA,\. l l - 7 8 0 , " Proceedillf.;s u/ Vol . 10 No. l 1 99 8 La ng uage Desi_q 1 1 and Implementation. ( Ju n e 1 99 5 ) : 2 9 1-300. La jolla, Calif 1 8 . D . Lee, P. Crowley, ] . - L. B:ter, T. Anderson, and B. B ershad , " Execution CharJCteristics of Desktop Appl ications on Windows NT," Proceedings of the Twen ty�jifih International -�ymposiu m on Computer A rchitecture. Barcelona, Spain ( J u n e 1 99 8 ). 19. E . Bem, D . H u nter, and S . Smith , " Moving ATOM to Windows NT for Al ph a , Dtj; ital Tech n ical journal. vol . 10, no. 2 , accepted for p u b l ication . " 20. M . Smith, "Tracing with Pixie," Technica l Report, CSL- TR-9 1 -497, Stanford Univ e rsi ty, November 1 99 1 . 2 1 . R . Cmelik a nd D . Keppel, "Shade: A Fast I nstruction Set Simu lator for Execution Profi l ing," Proceedings of A CM S(qmetrics ( May 1 99 4 ) : 1 28- 1 37 . 2 2 . Alphu AXP A rchitec tu re Ha ndhnok. Order No. EC Q D 2 KA-TE ( Maynard , Mass . : Digital Equipment Corporation, O ct ober 1 994 ) . 2 3 . H . Custer, Inside Wi n dows Microsoft Press, 1 993 ) . NT ( Red mond , Wash . : 2 4 . Microsoft Sothvare Developer's Toolkit. This toolkit is ava i l able :tt http://msd n.microsoft.com/developer/ sdk/plattorm.htm. 25. J. Casmira, " Op e rating System Rich Workload Char acterization," Master's thesis, ECE-CEG-98-0 1 8 , Northeastern University, May 1 99 8 . David P. Hunter David H u nter is the engineeri ng manager of Compaq Computer Corporation's Advanced and Emerging Technologies Group. Prior to that he was the manager of D I GITAL's Software Parmer Engineering Advanced Development Group, where he was involved in perfornnnce i nvestigations of databases and their i nteractions with the U N I X and Windows NT operating syste ms. H e has held positions i n the Alpha Migration Organization, the I SV Porting Group, and the Govern ment Group's Tech nical Program Management Oftice. David joined D I GITAL's La boratory Data Prod ucts Group in 1 98 3 , where he devel oped the VA..,'\lab User Management System. H e was the project leader of the advanced development project, ITS, an executive information system, tor which he designed hard ware a.nd so thvare components. David has two p a te n t appl i cations pending in the area of sothvare engineering. He holds a degree in electrical and computer engineering ti·om Northeastern U niversity in Boston, Mas achusetts, and a diploma in National Security and Strategic Smdies fTom t h e United States Naval War Col lege in Newport, Rhode Island . 26. R . Hookw<1Y a n d M . Herdeg, " D I GITAL F X ! 3 2 : Combining Emu lation a n d Binary Translation," Digital Tecbnicaljournal. vol. 9, no. 1 ( 1 997): 3- 1 2 . Biographies Jason P. Casmira Jason Casmira received B . S . and M . S . degrees in electrical engi n eering ri·om Northeastern University i n 1 996 and 1998, respectively, and is pursuing a Ph . D . degree in com pute r science at the University of Colorado, Boulder. For the past two yc<1rs, ] ason was a member of the Northeastern U niversity Computer Architectu re Research Laboratory ( N UCAR), where he focused on developing the cu rrent version of the P:nchWrx tracing toolset. H e also investi gated issues related to swdying operating system -ri c h traces. While at N U CAR, Jason was supported by a grant ri·om the Nation;� I Sci e n c e Foundation. H e has p u blished seven papers and is a member of the I E EE and the Eta Kappa Nu honor society. David R. Kael i Da,�d .Kadi received P h . D . ( 1 992) and B . S . ( 1 98 1 ) degrees in e lectrical engineering trom Ru tge rs U niversity and an M .S . degree in computer engineering trom Syracuse U niversity in 1 985. He joined the electrical and computer engineering facu l ty at Northeastern U niversity in 1 99 3 after spending 12 years at I BM , the last 7 of which were at the I B M T. j . vVatson Research Center i n Yorktown Heights, New York. David is the d i rector of the Northeastern U niversity Computer Architecrure Research Laboratorv ( NCCAR ) , where he investigates the performance and design o f high p e r form a nc e computer systems and sothvare. H is current research topics i nclude 1/0 worklo:�d characterization, branch prediction snrdies, memory hierarchy design, object oriented code execution pertonnance, 3-D microelectronics, and back-end compiler design. He frequently gives tutorials on the subject of trace-driven char c l a s s S t ack ( T * top_o f_s t a c k ; p bl i c : voi d push ( : aL l ; vo i po ( T · ar ) ; } ; The act of ap p l yi ng the arguments to the tem plate to as t e m pla te insta n tiation . An i n sta n ti a tion of a te mpl ate creates a new type or fun ct i on that is defined for the speci f-i ed typ es. Stack< i n t> creates a c l ass that provides a stack of the type int. Stack creates a c lass that provides a stack of u ser_ cl as s . The types i n t and user_class are the argu ments for the tempiJte Stack. is r efer red 22 Digir:li Technicol Journ:�l Vol . 10 No. I 1 99 8 tem p l ate needs to be i n stantiated when de fined d i rectives or pragm as. S i nce i nstanti ations are it is referen ced . When a c l ass template is i nstantiate d , give n global external l i nkage, the u ser must ensure only those membe r fun ctions a n d static data members that the specified te mplate i nstant i ations appear o n l y that are referenced are also i n stanti:Hed . In the Stack o n ce throughout all the modu les t h a t com pose t h e e xa m p l e , the m e m be r fu nction Push of the c l ass progra m . When o n l y t h i s m o d e o f i n st a n ti a tion i s In genera l , :1 Sta c k < i n t> needs to be i nstantiated only if it is used . u s e d , the u s e r also must e n s u r e that a l l req u i re d tem Tem p l ate fu nctions a n d static d a ta mem bers have plate i n stanti ations are spe c i fi ed to avoid u n resolved global scope; there fore, o n ly one i nstantiation of each sym bols at l i n k time. should be i n a user's appl i cati o n . Since source fi les are Command- l i n e options compiled separately a n d combined later at l i n k time to Co m m and-line Instantiation prod uce an exec utable, the compi l e r alone i s not able can be used to speci f)' template in stantiation . They are to ensure that one a n d only one i nstance of a specific similar in operation to the explicit i nstantiation req uests, templ ate is e fficiently generated for any given exe except they i n dicate groups of templates that shou ld be c u ta b l e . That is, the compiler by itself is not a b le to instantiated, rather than naming specific templates to be know whether the function or variable defi nition for a i nstantiated . The com mand - l i ne options include speci tlc te mpl ate is satisfied by code ge nerated in another object mod u l e . The C++ Standard • provides fac i l i ties for the user to e ntities whose definitions are known d u ri n g compi specif)' where a tem p late en tity s hou ld be i n stantiated . ' lation and whose argu m e n t types are specified . This When the user explicitly spe cities te m p late i n stantia has the advantage of spec i fYing many te mpl ate tio n , the user then becomes responsible for ensuring i n stantiations at o n c e . The user must st i l l e nsure that there is only one i nstantiation of the te mplate that no te m p late i n stantiation happens more than fu nction or static data m e m be r per appl ic ation . This once in the program a n d that all req u i red i n stantia responsibil ity can necessitate a conside ra b l e amount of tions are satisfi ed . Due to these require m e nts, the work. However, the com p i l e r and l i n ker worki ng user can not usually specif)' this option on more than together can provide e ffective templ ate i n stantiation one source- fi l e com p i l a tion in the progra m . This without specific user d i rectio n . option can also cause the i ns ta n ti a ti o n of templates I n the foll owi ng section, we presen t t h e various approaches that can be used for template i n sta ntiati o n . I n stantiate Al l Te mpl ates. A com m a n d - l i n e option can d i rect the compiler to i nstantiate all tem p l ate that are not used by the prog ram . • Instantiate Used Te mplates. A command-line option Te mplate Instantiation Techniques can be used to d i rect the compiler to i nstantiate Te mpl ate i nstantiation te c h n i q ues can be broad ly cat source code and whose defi n i tions arc known at only those templ ate enti ties that are used by the com pil ation . As in the previous tec h n i q u e , the user egorized as either m a n u a l or automati c . vVith m a n u a l m ust ensure that no template i nstantiation happens i nstantiation, t h e com pi l a tion system responds t o user more than once i n the program and that all req uired d i rectives to i nstantiate te mplate e n tities. These d irec i nstantiations arc satis fied . Due to these req u i re tives can be in the source progra m , or they may be ments , t h e u s e r can not u s u a l l y spec i fY this option co m m a n d - l i n e options. With autom atic i n stantiatio n , o n more than one sou rce - fi l e comp i lation in the the compilation syste m , i n c l u d i n g t h e l i n ker, decides progra m . which instantiations are req u i red a n d attempts to pro vide them t(Jr the user's appl icati o n . • Instantiate Used Te mplates Locally. Thi s command line option works l i ke the i nstantiate used te mpl ates Manual Instantiation opti o n , except that it d e fi nes each te m p l a te i n st a n Manual te m p l ate i n sta ntiation is the act of manua l ly tiation locally in the current compilation . This option specifYing that a template should be i nstantiated in the has the adva n tage of provid i n g com plete te mpbte ti le that i s being compi led . This instantiat ion i s given i n sta n tiation coverage for the progra m , as long as global external l in kage, so that references to the the definitjons of the used tem plates are avai lable in i nstantiation that are made i n other til es resolve to this each mod u l e . Since all templ a te i nstantiations are te mpl ate i n stantiati o n . M a n u a l te mplate i nstantiation given local scope, there is n o pote n tial problem i nc l udes explicit i n stantiation requ ests and pragmas as with wel l as com mand - l i n e options. program is l i n ke d . The major p ro b l e m with this m u l t iply d e fi ned i nstantiations when the tech n i q u e is that the user's appl ication can be The u n n ecessari ly large, si nce the same te mplate i n stan compi lation system i nstantiates those te mpl ate e n ti ties tiations could appear withi n m u l tiple object fi les that the user specifies tor i nstantiation . The specification used can be made using the C++ expl icit te mpl ate i n stantia if the i nstan tiations m u st have global scope such as tion syntax or may be made using i m p lementation- a c l ass's static d ata m e m bers. Explicit I nsta ntiation Requests and Pragma s to l i n k the app l icati o n . This technique wi ll fai l Digital T,·,hni,al Journal Vol . 1 0 No. l 1 998 23 Figure 1 shows an example o r' a template fu nction, template_func, that contains a locally defi ned static variable. As shown in the figure, the object fi les of both A and B contai n local copies o f template_func i nstanti ated with i nt. E a c h i nstance o f templ ate_func defines i ts own version of static variable x. I n this case, directing the compiler to i nstantiate used templates locally yields a d i fferent resul t than i nstantiating a l l or used templates globally. I f we give the static data mem bers global scope a n d ensure t h a t they are properly defi ned a n d initi al i zed by executable code rather than by static i n i tial ization, we can solve the static d ata mem bers prob l e m . The app l i cation , however, remains unnecessaril y large, because m u l tiple copies of the i nstantiated templates can be present i n the exec u table. A u tomatic template instantiation rel ieves the user of the burden of determining which templates must be i nstanti ated a nd where in the application those i nstanti ations should take place. Automatic template i nstantia tion can be d ivided into two categories: compi l e - time i nstanti ation , whereby the decision about what shou l d b e i nstantiated is made at compile t i m e , and l i n k- time i nstantiation, whereby decisions about template instan tiation are made when the user's application is li n ked . I n both cases, specific ]ink-time support is needed to select the required i nstantiations for the execu table . Each i nstantiation is placed i n the commu nal data sec tion ( COM DAT) of the current compi l ation 's object fi l e . Each object fi l e contains a copy of every template instantiation needed by that compilation u n i t . COMDATs are sections t h a t have an attri bute t h a t tells the l i n ker to accept, without issuing a warni ng, m u l ti ple definitions o f a symbol d e fi ned i n the section . ' I f more t h a n o n e object file defines that symbol , o n l y the section from one object fi le is l i nked i n to the i mage and the rest are d iscarded, along with a l l symbols i n the symbol table d efi ned in t h e d iscarded section con tributi o n . At link ti me, the l i n ker resolves a n i nstantia tion reference by choosi ng one of the i nstantiations defined i n an i ndividual obj ect fi le's COMDAT. The resu l ti ng user's appl ication executable has a single copy of each requested i nsta n tiatio n . vVhen s u c h l i n ker support is n o t avail a ble, another mechanism must be used to control compi l e - time i nstantiation . O ne such approach is to use a repository to contai n the generated i nstantiations. The compiler creates the i nstantiations i n the repository i nstead of the c urrent compi lation 's object fi l e . A t l i n k time, the l i n ker incl udes any req uested i nstantiations from the repository. As a performance i m provement, the com piler can also decide whether an i nstantiation needs to be generated from the state of the reposi tory. I f the requested i nstantiation is in the repository and can be determ i ned to be up to date, the compiler does not need to regenerate the i nstanti a tion. Two major tec hn iq ues can be used to perform a utomatic tem plate instantia tion at compile ti me. The choi ce between the two depends upon the fac i l i ties available i n the l i n ker. M icrosoft Visual C++ i nstantiates templates at compile time using a strategy similar to the i nstantiate used templ ates com mand-line option described previously. ' The decision to instantiate can be left u n ti l l i n k time. The linker can find the instantia tions that are needed and direct the compiler to generate those i nstantiations. McCluskey describes one li n k-ti me instantiation scheme.'.r' The compiler logs every class, union, struct, or cnum in a name-mapping file in a repos itory. Every declared template is also logged in the name- Automatic Instantiation Com pile-time I nstantiation Link-time I nstantiation e . h :' " I I templ l i n c l u e c i o s t ream . h t emp l a te c l ass T vo i d te�plate_func { s ta ic 'J' :< � 0 ; cou t < < x .,. p ; IT p ) X+ + ; I / A . c :< :{ # i n c l ude • emp l a t e . h x x " e x t e n vo i c� b_func { ) ; int ma i n { ) ( templ a te_func ( l O I ; b_func ( ) ; re urn 0; Figure 1 Template Fu nction Containing a Local ly Ddi necl Static Variable 24 Digital Technical Journal Vol . 1 0 No. I 1 998 / / B . c::< x " i nc lude " L empl te . hx x " vo i d b_ func ( vo i d ) { II . . . temp l � e_ f un c ( 2 0 ) ; II. . . mapping file. At link time, a prelinker determines which template instantiations are required. The prelinker builds temporary instantiation source files in the repository to I* per f o r�so e_ f unct i on ( C& } # .i nc l ude " empl a te . hx x " � i nc l ude " t emp l a t e . cx x " U i nc lude satisfY the referenced instantiations, compiles them, and */ · c_class . h " adds the resulting object files to the linker input. Consider the example in Figure 2. D u ri n g the c o m pi lation o f m a i n . cxx, a n a m e Figur e 3 Example of an Instantiation Sou rce File mapping fi l e is b u i l t in the repository a n d t h e location of the user-defined class C and tJ1e flmction template, perform_some_function, are recorded. From tJ1e infor sponds to the parti c u l a r source file that can success mation stored i n the name- mapping file, an i nstan ful ly instantiate the user's request. Compiling and pre tiation source file is men created i n me repository. l in king the program used in Figure 2 generates an Figure 3 s hows the contents of tJ1e instantiation source i ns tantiation assignment file for main.cxx. This tile contains i n formation concerning the command-line file created to satisfY perform_some_fu nction . The prelinker tJ1en compi les me instantiation source options specified, me user's current worki ng directory, file by i nvoking the compiler in a special directed mode, and a l ist of instantiations m at should be i nstantiated. which directs the com piler to generate code only for Main .cxx now owns the responsibi l i ty of i nstantiating speci fi c template i nstantiations that are l isted on the perform_some_flmction . The prelinker recompiles command l i ne . The compiler then generates the defin tJ1e source fi les, such as main .cxx, tJ1at have changes i n ition of perfonn_some_flmcti o n < C > in the resu lti n g their template i nstantiation assign m ents. The process object fi l e . The resu lting object now satisfies the is repeated until there are no changes made to the instantiation request and is included as part of the i nstantiation assignments. Then the final link can be application's final .l i n k . To build the i nstantiation completed. source fi les easily, the i mplementation of this scheme This approach has the advantage of requiring no generally requires mat template decl arations, template special file structure to support automatic template definitions, and any argu ment types used to instantiate instantiation. It is generally faster and simpler than a class or function template must appear i n separate, McCluskey's approach, because fewer files are com related header files. piled in the generation of the needed i nstantiations The Edison Design Group has developed anomer and the i nstantiations are generated in the context of approach to li nk-time i nstantiation . 7 In this approach, the use r's source cod e . I n addition, the assignment of tJ1e compiler records where template i nstantiations are i nstantiations to sou rce files can be preserved between used and ·where they can be i nstantiated . At l i n k time, recompilations of the source code, so that u n less the a pre l i n ker assigns template i nstantiations by record i ng strucmre ofthe application changes, the needed instanti the assignments in a specially gene rated file that corre- ations \viU be available wimout additional recompilation. I I C_c l ass . h xx: c l ss C { publ i c : II . . . } ; 1 / t empl a t e . hxx templ a c e < C ] a s s T void er form_ s ome_ f nc i on ( T &par m ) ; 1 / t empl a t e . c xx temp l a e vo i d per f o rm_s ome_ func i o n ( T & param l ( } l lma in . c x x h nc l e " C_c l as s . hxx " " emp l a e . h x x " h ncl · de i n t ma i n ( ) { C C; perfo m_some_ unct i on ( re rn 0 ; ) ; Figur e 2 Exam p le of a Li nk -time I nstantiation Sc heme (McCluskey) Digital Technical Journal Vol. 1 0 No I 1998 25 Comparison of Manual and Automatic Instantiation Techniques The manual i nstantiation techni q u es require planning on the part of the user to ensure that needed instantia tions are present, that no extraneous i nstantiations are generated, and that each needed instantiation appears exactly once within the application . Witl1 manual i nstantiation , the user has the advantage of gai ning explicit control over aU template i nstantiations. Almough the strategy of instantiati ng used templates locally requires l ess planning, it does so at the cost of object file size and tl1e restricted use of templates when static data mem bers are present or when static data is defined locally within a function template instantiation. Automatic template i nstantiation provides template instantiation wim no explicit action on the part of the user. Compi le-time i nstantiation requires either spe cific l i n ker support to select a single template instanti ation from potentially many candidates, or support by the compiler to generate i nstantiations i n separate object files while compiling the user's source cod e . Relying on linker support allows t h e compiler t o effi ciently generate i nstantiations at the cost of larger object files; however, tl1e user loses control over which i nstantiation is used in the executable fil e . Although the use of separate instantiation object files usually takes more time at compilation than tl1e linker-support memod, it results in more compact object files and can provide the user wim more control over which instan tiation is used in the executable file. Link-time instantiation provides template instan tiation that is tai lored to the needs of the executable file. The primary cost is l i n k-time performance, since generation of instantiations occurs at link time. Another disadvantage oflink-time instanti ation can be observed when building object-code libraries. Either the library must contain all the i nstantiations that it requires, or the user who wants to link with the u brary must have access to all the machinery to create i nstan tiations. Creating a library's i nstantiations involves extra steps during library construction . All the object files to be incl uded in the li brary m ust be pre l i nked, so tlut the needed i nstantiations are generated. If i nstantiations are i ncl uded i n the i ndividual object files in the library, as in the Edison Design Group approach , unintended modules may be linked from the li brary to provide the needed instantiations. Consider the following scenario, i n whic h object fi l es A and B are i ncluded in tl1e library. Both files require tl1e instantiation of perform_some_function . V/hen these fi l es are preli n ked, the i nstanti a tion of perform_some_fu nction < i nt > is assigned to one of the files, say A . If an application that is being linked against the l ibrary requires that the object file B be linked into tl1e executable, men the object file A is also linked . Here tl1e instantiation needed by B was i nstan- 26 Digiral Tech nical Journal Vol . 1 0 N o . 1 1 99 8 tiated i n A even though the executable never refer enced anything explicitly defined i n file A. This can yield an unnecessarily large executable. In the next section, we review the template i nstan tiation support i n earlier versions of D I GI TAL C++ and then discuss the rationale and design of the auto matic template i nstan tiation facility i n version 6 . 0 of DIGITAL C++. DIGITAL C++ Tem plate I nstantiation Experience As the use of C++ templates has grown, DIGITAL C++ has been enhanced to s upport the need for improved i nstantiation techniques . The i n i tial release of DIGITAL C++ occurred before the C++ standard i zation process had matured, so that the language sup ported was based on The A nnotated C+ + Reference Manual, referred to as the AR.t\1 .8 The ARM defined template fimctionality, but it d id not provide guidance for either manual or automatic template i nstantiation. Thus it was necessary to provide a D I GITAL C++ specific mechanism for template instantiation. DIGITAL C++ Manual Template Instantiation The #pragma define_template directive and the instan tiate all command - l i ne option, -defi ne_tem p l a tes, have been supported since the initial release of DIGITAL C++. In Figure 4, tl1e define_template pragma directs the compi ler to instantiate class template , C, with type i nt. When the compiler detects the use of the pragma, it creates an i n ternal C type node and traverses the list of static data members a nd member fu nctions defined within tl1e class. If the defin itions of these members are present at tl1C point me pragma is speci fied, the compiler material izes each with type int. As the C++ language developed and template usage increased, users found manual template i nstantiation to be very labor i ntensive and req uested an automated method. DIGITAL C++ Version 5.3 Automatic Template Instantiation Automatic template i nstantiation capability became a serious issue d uri ng the planning stages of DIGITAL C++ version 5 . 3 . The use of templates was i ncreasing rapidly, and many new thi rd-parry libraries, such as Rogue Wave Software's Tools.h++, contained a signif icant use of templates. Due to this growing need, the requirements were straigh tforward. The support had to be easy to use, have a short design phase, be quickly implemenrable on both the DIG ITAL UNIX and the OpenVMS platforms, and provide reasonable perfor mance. Because McCluskey's approach had been used in several implementations, it presented i tself as our best option. emp l a t e p < class lass T c { bl i c : •nc 2 { T p ) ; v o i d mem_f n c 1 { T p ) ; vo i d mem_f ); t mp l e t empl a te lrprag a cl a s s T > vo i d C : : mem_ E unc l ( T p ) < c l a s s T > vo i d C : : mem_f unc 2 ( T p ) II II . . .l ...l de f " ne_ e mp l a t e C < i n t > Figure 4 The define_template Pragma DIGITAL made two major changes to McCluskey's approach to take advantage of the D I G ITAL C++ compiler design . First, we al lowed i nstantiation source files to be created at compile time instead of l i n k ti me. This eliminated the need for McCluskey's name- mapping fi le and simplified the prelinkin g process considerably. Since t h e needed source files existed i n the repository, there was no need to decon struct the required template insta n tiations to deter mine their arguments and types. The second change addressed the transitive closure problem . Figure 5 shows an example of the class tem plate B uffer being instantiated with the user-defined type C. After compilation of app.cxx with the McCluskey B { . I I B_c l a s s . h xx class II . . approach, the name-mapping file contained definition locations of class B and class C. However, it did not con tain any indication that class C had a data member that relied on the definition of class B . From the information in the name-mapping fil e, the pre linker then created an instantiation source file that included only C_class.hxx, Buffer.hxx, and Buffer.cxx. When this instantiation source file was compiled, an error resulted complaining that B is an undefined type whose size is unknown . We solved this problem in D IGITAL C++ version 5 . 3 by i ncluding all the top-level header files incl uded by the current compilation unit in any i nstantiation source files created. This ensured that B_class. h xx wou ld be included in the generated i nstantiation file. class C { J; I I C_c l a s s . hxx bl i c : B da t a_mem ; p ); I I B u f f er . h xx emp l a t e < c l a s s T> c lass i n t num_o f_i t ems ; T * { l i B f fer . c xx templ a t e < C l a s s T> v o i d Bu f fer T> : : a dd_i t e m ( T * p ) { ) u f f er ; p bl i c : vo " d add_i t em ( T ) ; Bu f f e r II . . . *) ; II . . . l l app . cxx # i n c l ude " B_c l ss . hxx" " Bu f f er . h xx " � · nc lude " C_c l a s s . hxx " N i n c lude { vo i d C f ( vo i d ) c; B f fer< C > c_bu f f er ; c_ bu f fe r . a dd_ i t em ( & c ) ; Figure 5 I nstantiation of the Class Template B u ffer Digital Technical Jouriul Vol. 1 0 No. l 1 99 8 27 Despite the fact that this type of automatic link time instantiation scheme was bei ng widely used in the i nd ustry, the results of using a modified McCl uskey approach were m i xed . S troustrup has described the general problems with McCl uskey's approach.9 We found that our implementation suf fered particularly from poor l i n k - time performance and so did not satisfy our users' needs. DIGITAL C++ Version 6.0 Automatic Template Instantiation DIGITAL C++ version 6 . 0 is a complete reimpJeme n tation o f DIGITAL C + + , with emphasis o n ANSI C++ conformance. It is implemented using a completely new code base, which includes the i ndustry -standard C++ tl·ont end from the Edison Design Group and a standard class library from Rogue Wave. From our experience with templ ate i nstan tiation in DIGITAL C++ versions 5 . 3 through 5 . 6 , we con cluded that the most i mportant issue that should be add ressed in the design and implementation of the a u tomatic temp late instantiation facility was the compile- and link-time per formance. The primary goal w:ts to have the performance of automatic tem plate i nstantiation su bstantially exceed the perfor mance of version 5 . 6 . Another important goa l was to remove the restri ction of template declaration and defin ition placement i n header files. In :�ddition, the automatic template instantiation facility in version 6 . 0 h a d ro b e culturally compati ble with the previous i mplementation . The user had to be able to move sources and objects to different di rectories, easi ly build archived and shared libraries, share instantia tions between various applications, and have error diagnostics reported at the earliest possible moment in the i nstantiation process. Design and I mplem entation We decided to use a compile-time instantiation model as the basis for our implementation . Since we were using the Edison Design Group's front end, we seriously considered using their link-time mod e l . However, the compi le time model seemed advantageous tor several reasons. First, there are significant complications ( as described in the section Comparison of Manual and Automatic I nstantiation Techniques) when trying to build l ibraries with a compiler that uses the Edison Design Group link-time m odel. In addition, the link-ti me model requires recompilations that limit performance in many typical cases of template use. We recognized that the link-time model could provide better pertor mance in some cases, but these would be i n the minor ity. Finally, the implementation of the link-time model would req uire su bstantially more implementation eftort on the Open VMS platform . The version of the Edison Design Group front end being used to build DIGITAL C++ version 6 . 0 required tools to scan a 28 Digir�l Tec hnical Journal Vol . 1 0 No. l 1998 user's object fi les tor i n formation concerning which mod ules could instantiate requested templates. Similar functionality would need to be implemented for the OpenVMS platform . We preserved the concept of the te mpl ate reposi tory as a d irectory that contai ns the i ndivid ual tem plate i nstan tiation ob;ect files. The repository stores one object fi le tor each templ ate fu nction , mem ber fu nction , static data member, and virtual table that is generated by a u tomatic template instantiation . The file name of the instantiation object file is derived from the name of the instanti ation 's external n ame. At com pile time, the front end generates i n termed iate code for aJI templates that are needed in the compilation unit and can be instantiate d . A tree walk is pedorrned over the i n termediate code to find all entities that are needed by each generated template instantiation . The code generator is cal led to generate cod e for the user speci fied object ti le and is then called repeatedly for each template i ns tantiation to generate t he insta n tia tion object fi les in the repository. The compiler generally considers an instantiation to be needed whe n it is referenced from a context that is itself needed, such as in a function with global visibility or by the initialization of a vatiable d1at is needed . Virtual member fi.mctions are needed when a constructor for the class is needed . Thus, ail virtual .fi.mction definitions should be visible in a compilation unit that requ ires a constructor for d1e class. Each instantiation d1at is gener ated ''�th autom:.1tic instantiation is marked as potentially being in its own object file i n the repository. The i n termediate representation of each generated instantiation is walked to determine what other entities it references. At t his point, the i nstantiation is a candi date to be generated in its own object fil e , but it can sometimes be generated as part of the user-specified object file. If the i nstantiation references an entity that is local to the compi lation unit, such as a static fu nc tion, a n d that local en tity is nonconstant and statically initial ized , the instantiation is merged into the user specified object fi le rather than generated in its own object file. As an :�lternative, we could have chosen to change the loc:tl enti ty i nto a global enti ty with :-� u nique name and generate the instantiation in its own object file. We chose not to do this in order to make i t easier t o share a repository between applications. With this alternative, the instantiation in the repository requires the object file contai ning d1e local entity's def inition, which may be i n another application . Note that any application that contains more d1an one definition of the same instantiation that references a nonconstant local enti ty is a nonstandard -conform ing application. This is a violation ofd1 e one definition rule w Consider the followin g code fragment: s t a t ic int j ; templa e vo i d f u n c ( T a r ) { s a ic i n t coun: = 0 ; pt- i n _co n t { " co n ::. " , count + + ) ; The fi.mction, print_count, is defined i n the sou rce file :m d generated as a defined function in the user specified object file. The template function, fu nc, refer ences the function, print_count. When the code for fi.mc is generated i n its own object file, the rderence to print_count m ust be changed from a rderence to a defined h.mction to a reference to an external function. By default, each needed instantiation is generated by every compilation that requires the instantiation . This is the safe default because it ensures that instantiations in the repository are up to date. However, there will prob:�bly be some compilation overhead fi-om regener ating instantiations that may already be up to date . We believed that the overhead of regeneratin(T b instantia. nons would typically be relatively smaJ I . For applications with a high overhead of i nstantiation , such as a large number of source files using the same large n u m ber o f template i nstantiations, w e provided a compila tion option to control the generation of template i nstantiations to improve compile-time performance. The generation of i nstantiation object files only when they are actually required is a difficult problem . Fine-grain dependency information would have to be kept for each i nstantiation object file. Such depen dency information would need to rdlect those fiJes that are required to successfully generate the instantiation and record which command- line options the user speci fied to the compiler. vVe suspected that the overhead involved with gathering and checkjng the information might be an appreciable percentage of the time it wouJd take to do the i nstantiation , and thus it would not give us the performance improvement that we wanted. Instead, we decided to provide an option that allows the user to decide when i nstantiations are generated . We rder to this as the template time-stamp option, -m mestamp. When using the time-stamp option, the compi ler looks 111 the repository for a file named TIMESTAl\1 P . If the fi le is not found, it is created. The modification time of this ftle is referred to as the time stamp. When generating an instantiation, the compi ler looks i.n the repository to see if the instantiation object file exists. If it does not exist, it is generated . If the file already exists, its modi fication ti me is compared to the time stamp. If the modi fication time is later than the time stamp, the i nstantiation is assumed to be up to date and is not regenerated . Otherwise, the i nstantia tion is generated. The user can control the generation of instantiation object tiles by changing the modifica tion time of the TIM ESTAMP fil e . The ti me-stamp option wou ld typical ly be used in a makefile or a shell script that compiles and builds an entire application. Before i nvoking make or the shell script, the user would make certain that no TIMESTAMP file resided in the repository. This would ensure that each needed i nstantiation would be generated exactly once duri ng all the compilations done by the build procedure. Much of the C++ linker support in version 5.6 was reused with only minor mod ifications for version �.0. The compiler is presented with a single repository mto whtch the instantiation object fi les are written . Multiple repositories can b e specified at link time, and each can be searched for i nstantiations that are needed by the executable tile. The linker is used in a tria l link mode to generate a l ist of a l l the unresolved external r �ferences. This list is then used to search the reposito nes to find the needed i nstantiation fiks, and tl1e process is repeated u n til no more instantiations are needed or can be satisfied from the repository. The lmk then proceeds as any normal li nk, adding the l ist of tnstantiation object files to the l ist of object tiles and libraries as specified by the user. If a vendor is cre:�ting a l ibrary rather tl1an an exe cutable file, the i nstantiations needed by the modules in the _li brary can be provided in either of two ways: ( 1 ) The hbrary vendor can put the needed i nstantiations in the libra:y by adding tJ1e files in the repository to the hbrary hle. ( 2 ) The li brary vendor can provide the repository with the l i brary and require that l i brary users lmk WJth the repository as wel l . Note that instan tiations pl aced in the library :u·e fixed when the l i brary IS created . Smce the library is included in the trial link of an application, any instantiation i n the library takes precedence over the same named instantiati a"n i n a repository. In a number of tests, DIGITAL C++ version performance over version 5 .6 . We tested a variety o f user code samples that use tem plates to varying degrees and found that build times tor version 6.0 decreased substantially compared to tl1e version 5 . 6 compi ler. Examples of two typical C++ applications used in our tests are the pu blicly avail able EON ray-tracing benchmark and a subset of tests from our Standard Template Library (STL) test suite. For Resu lts 6.0 showed improved D i gital Technical Journal Vol . 10 N o . I 1 998 29 the EON benchnurk, the b u i ld ti me for version ture of the ti l es used to generate the i nstJn tiati o n . For reduced to example, if the user speci fied Jn i nc l u de d i rectory 6.0 was 28 percent of the build time tor version 5 .6 . For the STL tests, t h e b u i l d ti me tor version 6 . 0 was reduced to 1 9 percent of the b u ild time fo r version 5 . 6 . of old_i n c l u d e on the i nitial compibtion and later specified J.n i ncl ude d i rectory of new_i n c l ude, this The n u m ber o f fi les i n the repository also d ecreased approach wo u l d not recognize that d i ffere n t fi les were signiti cm tly because version being i n c l u d e d . 6.0 generates only i nstan tiation object fi les i nstead o f the i nstan tiation source, Another approach to i m prov i n g application b u i l d com m a n d , dependency, and object files of\-crsion 5 .6 . performance i s t o sup port a b u i l d fa cil i ty t h a t can For EON, the version files make use of te mplate i n f(m11 J tion in determining com pared to 6 . 0 repository contained 8 8 260 fi l es i n version 5 . 6. d ependency. C u r rently, each user-spec i fied object fil e U s i n g the ti me-sta m p option, b u i ld ti me tor the i s dependent o n :� I I the i nc l uded fi les nece ssary to EON bench mark was red u ced by on l y 5 percent co m create i nstantiation object fi les f( >r te m p l ate req uests. pared to t h e dcfJ u l t i nstanti a tion strJtegy. The real When a change is made to a te mpbte d e fi n ition, all the benefit of the ti me -stamp option comes w i th appl i c a sources that reference the te mpl ate need to be reco m tions t h a t u s c t h e same te mplate i nstantiations i n many p i l e d . A b u i ld fac i l ity designed to be sensitive to te m comp i l ation u n i ts . For example, in one user's test case, plate i nstJntiati o n cou l d de tect t h a t a cha nge i n the build times d ropped from roughly 1 8 hours with the template d e fi n ition was l i m i ted to the i nstantiation d e fa u l t i n stantiation to object file. It could t h e n i nstruct the compi ler to sup 3 h o urs w h e n using the time stamp option. press the regeneration of o bject fi les tor sou rce fi les I n tl1e next secti o n , we conclude our paper with a dis that are only b e i ng recompi led due to the ci1Jnge in cu ssion of fu rtl1er work that can i m prove the perfor the te m plate i n stanti ation . S u ch a f.1 ci l i ty could also mance and usability of a u tomatic template instantiation. suppress the reco m p i i J tion o f any source fi l e thJt Future Research that were already regenerated . We conti n u e to i n vestigate approaches a nd tech niq ues can pertonn better i n some cases than the compile-time to i m prove tl1e usJb i l i ty and performance of the a u to approac h , we Jre i nvestigating the l i n k - time i n st:�ntia matic template i nstantiJtion facility. Optimal usJbility tion mod e l as a user option. wou l d only reproduce the changes to i nstantiations Because we recognize that l i nk-time i nsta n ti:�tion and performance would seem to require a development environment completely i n tq!;rJted for C++. This envi Finally, we conti n u e to look a t ways to red u c e the cost of generati ng each i nsta ntiation . For example, by ronment wo u ld keep trac k of all entity definitions Jnd default the compi l e r compresses the generated object usage rm with semantic i n formation embedded • For C++, expanding te mpl ate classes and fu nctions into their individ uaJ insta nces • SimplifYing h igh-level l an guage constructs i n to a form acceptable to the opti mi zation p hases • Converting the abstract represen tation to a differ ent a bstract form acceptable to an opti mizer, usu ally called an i ntermed iate language ( I L) • Expand ing some low- level functions inline i nto the contex t of their callers • Performing mu ltiple optim ization passes involving an notation and transformation of the I L • Converti ng the I L to a form symbolically represent ing the target machine language , usually called code generation • Performing sched uling and other opti mi zations on the symbolic machine l anguage • Converting the symbolic machine language to actual object code and writing it onto disk In modern C and C++ compi lers, these various i nter mediate f(xms are kept e n tirely in dynamic memory. Although some of these operations can be performed on a fu nction-by-fu nction basis with in a modu le, it is sometimes necessary for at least one intermed iate form of the module to reside in dynamic memory in its entirety. I n some instances, it is necessary to keep mul ti ple tonns of the whole mod ule simultaneously. This presents a diffic ult design chaJ le nge : how do we compile large programs using an acceptable amount of virtuaJ and physical memory? Trade-offs c hange con stantly as memory prices dec l ine and pagi ng a lgorithms of operating systems change. Some optimizations even have the potential to e xpand one of the intermediate representations into a form that grows faster than the size of the program ( 0( n x log( n ) ) , or even 0( n 1 ) ) . I n these cases, optimization designers often limit the scope of the transformation to a su bset of an i ndividual function (e.g., a loop nest) or use some other means to artificial ly l i mi t the dynamic memory and computation requirements. To allow additional headroom, upstream compiler ph ases are designed to eliminate un necessary portions of the module as early as possi ble. In ad d ition, the memory ma nagement systems are designed to allow i n ternal me mory reuse as e ffi ciently a s possib l e . For this reaso n , compi ler design ers at Compaq have genera l l y preferred a zone-based memory management approach rather than e ither a mal l oc- based or a garbage-col lection approach. A zoned memory approach ty pical ly allows a l location of varying amou n ts of memory i nto one of a set of identified zones, fo l l owed by deallocation of the e n ti re zone when all the i ndivi dual al locations are no longer n eeded . Since the source program is repre sen ted by a su ccession of i n ternal represen tations in an opti mizing compi ler, a zoned - b ased memory manage ment system is very approp riate . The main goals of the design are to keep the peak memory use below any artificial limits on the virtual memory avai lable for all the actual source mod ules that users care about, and to avoid algorithms that access memory i n a way that causes excessive cache misses or page taul ts. Templates are a major new teature of the C++ la nguage and are heavily used i n the new Standard Li brary. I nstantiation of templates can domin ate the compile time of the mod u les that use them . For this reason, template instantia tion is undergoing active study and i mprovement, both when compi ling a mod ule for the first time and when recom piling in response to a source change. An i mproved technique, now widely adopted , retains pre compiled i nstantiations in a l i brary to be used across compil ations of multiple mod u les. Te mplate i nstantiation may be done a t either com pile ti me or during l i n k ti me, or some com bination . ' D I G I TAL C++ h a s recently changed from a link- time to a com pi le-ti me model for improved i nstantiation performance . The i nstanti ation time i s generally pro porti onal to the nu mber of tem plates i nstanti ated , which is based on a command-line swi tch speci fication and the ti me req u i red to instantiate a typical te mplate. Te m p late Instantiation Time for C++ Digital Tcchniol Journal Vo l . 1 0 No. 1 1 998 35 Run- Time Performance Metrics We use automated SC!ipts to measure run-time perfor mance tor generated code, the debug image size, the pro duction image size, and specific optimizations triggered . R u n Time for Generated Code The run ti me for gen erated code is measured as the sum of user and system time on UNIX required to r u n an executable image. This is the pri mary metric for the qual ity of generated cod e . Code correctness is also valid ated . Comparing run times tor s lightly differing versions of synthetic benchm arks al lows us to test su pport for specitic opti mi zations. Performance regression testing on both synthetic bench marks and user applications, h owever, is the most cost-effective method of preventing per formance degradations. Tracing a pe rrormance regres sion to a specific compiler change is often d i fficu lt, but the earlier a regressio n is detected, the easier and c heaper it is to correct. Debug I m age Size The size of an i mage compiled with the debug option selected during compilation is mcJ.sured in bytes. It is a consta nt struggl e to avoid bloat caused by unnecessary or red u ndant i n formation req u i red for sym bolic debuggi ng su pport. The size of a prod uction ( optimized , with no debug i n tonmtion ) Jppl ication i m age is measured in bytes. The use of optimi zation techniq ues has historical ly made this size smal ler, but modern RISC processors such as the Alpha micro processor require optim i zations that can in crease code size su bstantial ly and can lead to excessive i mage si zes i f the techniq ues are used indiscri mi nately. Heuristics used in the optimi zation algorithms l i m i t this size impact; however, su btle changes in one part of the opti mizer can trigger unexpected size increases that aHect I -cache performance. Production Image Size In J m u l tiphase opti mizing compi ler, a specific opti mization usua l ly req ui res preparatory contributions from several upstream phases and cleanup from several down stre;�m phases, i n addition to tbe ;�ctual transforma tion . In this environment, an unre l a ted cha nge in one of the upstream or downstream phases may i n terfere with a data structure or violate an assumption exploi ted by a downstream ph ase and thus generate bad code or su ppress the optimizations. The genera tion of bad code can be detected qu ickly with auto mated testing, but opti m i zation regressions are much harder to fi n d . For s o m e opti mizations, however, it is possible to write test programs that are clearly represe n tative ;� nd can show, either by some kind of d um p i n g or by compar;�tive performance tests, when an i m p le mented opti m i zation fai ls to work as expected . One Specific Opti m i zations Triggered 36 Digit:ll T�chnicJI Journal Vo l 10 No. 1 1 99 8 commercially avaiL1ble test suite is called N U L LSTONE ,'' and custom-wri tten tests are used as wel l . In a collection of such tests, the total n umber of opti mizations implemented as a percentage of the total tests can provide a usefu l metric. This metric can indi cate if su ccessive com p i l er versions have improved and can h e l p in comparing opti mizations imple mented in compilers from difterent vendors. The opti mizations that are indicated as not im plemen ted provide useful data for guiding fu ture development effort. The app lication developer m ust always consider the compile-time versus run-time trade-off. I n a wel l designed opti m i zi ng co m p i l er, longer compile times are exchanged f(Jr shorter run times. This relationship, however, is fa r from l i near and depends on the i m por tance of pertorma nce to the application and the phase of deve lopment. During the initial code-deve l opment stage, a shorter compi le time is usefu l because the code is compiled often . During the production stage, a shorter run ti me is more im port r fi-eq u e n t T h e tools we use for comp i l e -speed a n d r u n - ti m e DCPl o u t analysis are considerably more soph isticated tha n t h e cache m i sses c a n b e i d e n ti fied fi·o m t h e measure ment tools. They are ge n e ra l l y p rovid e d b y put, whereas they may not a l ways be obvious from the CPU design or i n g a s wel l a s com p i l e r i m provements. VVe h ave used the fol lowi ng compile -speed analysis tool s : • The compi l er's i n ternal -show s tat i s t i cs feature gives a crude meas u re of the time req u i red to r each compi l e r phase. D i giral Tec h n ical J o u nul cas u a l l y observing the m a c h i n e cod e . operati n g system tools develop ment groups and are wi d e ly used for a pplication tu n 40 We analyze the the detaile d log fi l e . This l og identi fies the pro b l e m compi l e - t i m e measure m e n t , the d e fault, debug, and ously d iscussed . ! PROB E tool as d escri bed a bove , to the r u n - ti m e b e havior o f the test program rather than of t i m e s , writes t h e res u l ti n g ti m i ngs t o a fi l e . Post ( average ti mes, deviati ons, a n d fi l e s i zes) a n d compare \Ve apply h i prof a n d gprof i n combinati o n , and the a n d , after com p i l i n g t h e sou rce t h e speci fi e d n u m be r processi n g scripts eva l u ate the usabi l i ty of the resu l ts ! P RO B E too l , w h i c h can provide i nstruction- by - i nstruction details about the e x e c u mation about the execution can be captured to system t i m i n g packages. For com pile-time measure VVh e n t h e pro b l e m needs t o b e p i npoi nted more accurately than is poss i b l e with these profi l i ng cutable i mage i n some man ner, so that e no u gh i n f(x ment tools u s i n g scripts l ayered over standard operating a specific area of compi ler source . Once this i n f(xma Vol . l O N o . 1 1 998 • Final ly, we use t h e estimated schedule d u m p and statistical data optionally generated by the G EM back e n d . 1 This d u mp te l l s us how i n structions are sched u led and issued based on the processor arc h i tecture selecte d . I t may also provi d e i n formation a bo u t ways to i mprove the sched u l e . In the rest of this section, we discuss three examples of applying analysis tools to problems identified by the performance measurement scripts. called by esc . Since these components are included in dle G EM back end, the problem was fixed there. Run-Time Test Cases Compile-Time Test Case Compile-time regression occurred after a new opti mization called base components was added to tbe GEM back end to i mprove the run-time performance of structure reterences. Table l gives compile-time test results that compare the ratios of compile times using the new opti mized back e nd to those obtained with the older back end . The resu I ts for the iostream test indicate a significant degradation of 2 5 percent in the compile speed for optimize mode, whereas the perfor mance in the other two modes is unchanged . To analyze this proble m , we built hi prof versions of the two compilers and compiled the iostream bench mark to obtain its compilation profile. Figures l a and l b show the top contributions in the flat hi prof pro fi les from the two compilers. These profiles i ndicate that the nu mber of calls made to esc and gem_il_peep in the new version is greater than that of the old one and that these cal ls are responsible for performance degradation . Figures 2a and 2b show d1e cal l graph profiJes tor esc for the two compilers and show me calls made by esc and the contri butions of each component For the run-time analysis, we used two d i fferen t test e nviron ments, the Haney kernels benchmark and the NULLSTONE test nm against gee . Haney Kernels The Haney kernels benchmark i s a synthetic test written to examine the performance of specific C++ language features. In this run-ti me test case, an older C++ compiler (version 5 . 5 ) was com pared with a new compiler u nder development (version 6 . 0 ) . The Haney kernels results showed that the ver sion 6.0 development compiler experienced an overall performance regression of 40 percent. We isolated the problem to the real matrix multiplication fu nction. Figure 3 shows the execution profile for this fu nction. We then used the DCPI tool to analyze perfor mance of the inner loop instructions exercised on ver sion 6 . 0 and version 5 . 5 of the C++ compi ler. The resulting coun ts in Figures 4a and 4b show that dle version 6.0 development compi ler su ffered a code scheduling regression. The leftmost column shows the average cycle cou nts for each i nstruction executed. The reason for th is regression proved to be that a test Ta b l e 1 Ratios of CPU (User a nd System) Com p i l e Ti mes (Seconds) of the New Com piler to Those of the Old Com p i l e r F i l e Name Debug Mode Options Default Mode Optimize Mode - 04 - gO - 00 - g a 1 a mch2 0.970 0.970 0.930 col l evol 0.9 1 0 0.780 0.740 d_i n h 0.970 0.960 0.960 e_rvi rt_yes 0.970 0.980 0.960 i nterfaceparticle 0.880 0.790 0.730 iostream 0.990 0.980 1 .250 pistream 0.890 0.760 0.790 t202 0.970 0.970 1 . 1 30 t300 0.980 0.960 1 .040 t601 1 .0 1 0 1 020 1 .0 1 0 t606 1 .000 1 . 020 1 .020 t643 1 .020 1 .0 1 0 1 .000 test_complex_excepti 0.960 0.890 0.830 test_compl ex_math 0.970 0.950 0.950 . test_ demo 0.950 0.830 0.780 test_generic 1 .000 1 .020 1 . 1 00 test_task_q ueue6 0.970 0.920 0.960 test_task_rand 1 0.950 0.890 0.890 test_vector 0.970 0.920 1 . 1 20 vectorf 0.890 0.790 0.850 Averages 0.961 0.920 0.952 Digital Technical Journal Vol. 10 No. I 1998 41 g ranu l ar i ty : % t ime c yc l e 2 .8 cumu l a t i ve seconds 1 . 37 2.6 2 . 66 2.4 2.3 3 . 93 2.6 6 . 23 5 . 09 uni t s : seconds ; to tal : 4 8 . 9 6 seconds sel f seconds se l f tota l cal l s m s / ca l l 1 . 29 1 . 27 ms / ca 1 l 1 . 37 10195 0 . 13 0 . 13 515566 0 . 01 0 . 00 gem_f i _ud_acces s_resource 1 . 17 481891 0 . 00 0 . 00 gem_vm_get_nz 713176 0 . 00 0 . 00 _OtsZero 2 1 9 607 1 . 14 0 . 01 0 . 00 name cse gem_j l_oeep ( 3 1 ] [12 ] [37] [75) [67] (a) HiprofProfile Showing Instructions Executed with the New Compiler granu l a ri ty : cyc l es ; un i ts : c� ime c umu l a t i ve 3 .0 0 . 83 seconds ?. . 7 1 . 58 2 . 71 3 . 14 1.7 1.6 2.5 2 . 26 seconds ; t ocal : 2 7 . 4 9 seconds sel f total ca l l s ms / c a l l ms / ca l l 0 . OJ 614 3 5 0 0 . 01 0 . 00 se l f 0 . 83 0.75 143483 seconds 0 . 00 0 . 68 0 . 45 4 65634 8664 0 . 08 0 . 08 0 . 00 0 . 43 423144 0 . 00 0 . 00 0 . 00 gem_i I _peep name O t s z er o cse [ 1 6 ] _ [40] [64 J [36) gem_ f i _ud_access_resource g em_vm_g e c _n z [86] (b) Hiprof Proftle Showing Instructions Executed with the Old Compiler Figure 1 H i prof Profiles of Compilers for pointer disamb iguation outside the loop code was not performed properly in the version 6.0 compiler. The test would have ensured that the pointers a and t were not overlapping. We traced the origin of this regression back to the intermediate code generated by the two compil ers. Here we found that the version 6.0 compiler used a more modern form of array address computation i n the i n termediate language for which the scheduler had not yet been tuned properly. The problem was fixed i n the scheduler, and the regression was eliminated. [ 12 ] 14 . 1 1 . 37 2 . 63 0 . 63 0 . 59 5 . 55 0 . 32 0 . 34 I n itial N U LLSTONE Test Run agai nst gee We measured the performance of the DEC C compiJ er in compi ling the NULLSTONE tests and repeated the performance measurement of t he gee 2 . 7 . 2 compiler and libraries on the same tests. Figures Sa and Sb show the results of our tests. This comparison is of interest because gee is in the public domain and is widely used , bei ng the primary compiler available on the public-domain Linux operating system . Figure Sa shows the tests i n which the DEC C compi ler performs a t least 10 per cent better than gee. Figure Sb ind icates the optirniza- 134485 / 1 3 4 4 8 5 cse 1 3 4 4 8 5 / 1 3 r. 4 8 5 update_operands 1 2 1 2 4 3 / 1 2 12 4 3 Les L_ for_i n d u c t i on 10 195+9 95 102760 / 1 02 7 6 0 1 2 1 2 7 / 12 1 2 7 t e s t_for_c se [12] [ 42 ] [ 13 pu s h_ e f e c t gem_df_mo ve [ 92] 6J [ 97 ] [ 149 1 (a) Hierarchical Profile for cse with the New Compiler [ 16 ) 10. 5 0 . 68 8 6 64 + 7 5 9 3 2 . 19 1 . 04 96554 / 96554 t e s t- for_c s e 0 . 30 66850 / 66 8 50 t es t_for_ i nduc t ion 0 . 29 9 6 5 54 / 9 6 5 5 4 upd a t e_operands 0 . 12 87 1 7 6 / 8 7 1 7 6 move 0 . 09 7863 !7863 cse [16] [ 215 ] pop_e f fec t ( b ) Hierarchical Profile for cse with the Old Compiler Figure 2 H ierarchical Call Graph Proti les for esc 42 Digital Tech n ical journal Vol . 1 0 N o . l 1 998 [ 56 ] [267] 1 1 04 ] [ 1 06 ] void �1 lHC ( Real • t , Rea l * a , Real * b . i n M , con s t in nna cons const cons References and Notes N, con s t i n � K l int i , j , k ; Rea l emp ; memse l l t , 0, H • N * s i z o f ( Rea l l l ; for- ( j � 1 ; j < = N; j • I { for l k - l ; k c� K ; k + + ) ( tern = b [ k - 1 ,. K * I j 1) ] ; i f ( temp ! = 0 . 0 ) { E r l i - l ; i <= M ; i H I t [ i - 1 • �l • ( j - 1 1 l + ernp * a { · - 1 - H • I k • 1. D. B lickstein et a l . , "The G EM Opti mizing Compi ler System," D(t5ital Tc'ch n ica ! Jou rnal, vol. 4, no. 4 ( Special issue, 1 99 2 ) : 1 2 1- 1 36. 2. B . Ctlder, D. Gru nw�ld, ;md B . Zorn, "QuanritYing Beh�vioral Difkrences Berween C and C++ Programs," journal 3. ll J ; Haney Loop r()r Real Matrix Nlu l tiplication 6. N U L LSTONE Optimization Categories, U RL : h ttp:/ /w>vw. n u l l srone . com/h t m l s/category. h t m , Nullsrone Corporation, 1 990- 1 99 8 . 7. ]. the machine code generated fo r those test cases. In this Orost, "The Bench++ Benc hmark Su ite," December A drati: paper is available at http:/jwww . research .a tt .com/-orost/bench_pl us_plus/paper. h tml. 1 2 , 1 99 5 . 8. regressions were caused by the use of a n ou tmoded ( -st 0 ) for DIG ITAL U N I X environ ment. After we retested with the - ansi_al ia s option , these regres sions disappeared. i n vestigated and fi xe d regressions 10. A . Eustace a n d A. Srivast:w a, " ATOM : A Flexi ble I nterface for Bui lding H i gh Performance Program Analysis Tools," Western Research Lab Technical Note TN- 44 , Digital Equipment Corporation, July 1 99 4 . 11. A. Eustace, "Using Atom i n Computer Architecture Teaching and Research," Co mpu ter A rchitect/Ire Technical Corn m ittee Neil 'sletter I EE E Computer Society, Spring 1995 : 28- 3 5 . 12. J . Anderson e t a l . , "Continuous Profiling: Where Have All the Cycles Gone?" SRC Technical Note 1 99 7 - 0 1 6 , Digit::t l Equipment Corporation, July 1997; ;t lso in A CM Tra nsac ti o n s 0 1 1 Computer Svstems. vol . 1 5 , no. regressions, which were too d i fti c u l t to fi x w i t h i n the to the issues list with appropriate priorities. Conclusions The measurement and analysis of compi ler performance has become an i m portant and demanding fi e ld . The CPU architectures and the 4 ( 1 99 7 ) : 3 5 7-39 0. addition of new features to languages require the devel opment and i mplementation of new strategies fo r test Mass . : Digital Equip ment Corporation, 1 99 5 ) . in existing sche d u l e for the current release, were added C++ Benchnurks, Compari ng Compi ler Performance, U RL: h ttp:/jwww.bi .com/index.html, Kuck <111d Associates, Inc. ( KAI ), 1 9 9 8 . 9 . A TOJiif.o User Mwtllai ( Maynard, i nstruction com b i n i n g a n d i f optimizations. O ther i ncreasing complexity of D . Detlefs, A. Dosser, and B . Zorn, " M emory AJ ioo tion Costs in Large C and C++ Programs," Sojitl'are Practice a n d l:..,perience, vol. 24, no. 6 ( 1 9 94 ) : A . Itzkowitz and L . Folt:111, "Au tomatic Te mplate I nstanri:J tion in D I G ITAL C++," Digital Techn ical Journal. vol. 1 0 , no. I ( this issue, 1 9 9 8 ) : 22-3 1 case, the alias optimization portion showed that the a l so ( 19 94 ): 5. We i nvestigated the i n divid u a l regressions by look· We 2 P. Wu :md F. Wang, "On the Efficiencv and Optimiza tion of C++ Programs," Sojiu>are Practice and Experi ence. vol . 2 6 , no 4 ( 1 9 9 6 ) : 4 5 3-4 6 5 . ing at the detai led log of the r u n and then ex a m i n i n g DEC C i n the Lcmi� uages, 4. DEC C compiler shows 10 per cent or more regression compared to gee. standard " as the d efa u l t l a n guage d i alect Pru,f5 rct nuning 527-542 . Figure 3 tion tests i n which the of 3 1 3-35 1 these challenges. Our systematic ti·a mework tor com J. Dean, ) . H icks, C . W::tldspurger, W. Wei h l , and G. Chrysos, "Proti leMe: Hardware Support for Instruction Level Profiling on Out-ofOrder Processors," 30th Sym posium on Microarchitecrure ( Mi cro- 3 0 ) , Raleigh, N.C., December 1 997. piler performance measurement, analysis, �md prioriti 1 4 . G'u ide to /PRO/Jt·. lustct!linp, and r :, ing ( M aynard, 1 3. ing the perf(xma nce of C and C++ compilers. By employi ng en hanced measurement and analysis tech niq ues, tools, and benchmarks, we were able to address zation of improve ment opportunities should serve as an excell e nt st:u-ting poi n t for the practitioner i n :� situation in which simil:�r req u i rements :u-c im posed . Mass . : Digital Equipment Corporation, 1 9 94 ) . 15. B. Ke rn inghon �nd D . Richie, The C Progra mm iug Lang u age ( Englewood Cliffs, N . J . : Prentice- H a l l , 1 978 ) . Digital T.:chnictl Journal Vol . 1 0 N o . l 1 998 43 tm t.l lHC_X P t PC t PC ' i i : 181 xl 2 0 0 � 4 8 9 4 70 Ox l 2 0 0 1 4 8 9 8 62 4 33 l ds 1 '1 $£1, zero , 1 0 ( t5 1 0 ( L6) 8( 61 O xl 2 0 0 4 8 9c 6 O x l 00 4 8 a 0 0 : 8 9 4 6 0 00 0 lds SflO , 0 : 580 1 10 4 1 mul s Ox12001 48a4 0 : 4 7e 6 0 4 1 2 bi s $ f0 , $ t l , S t l a2 z e ro , L 0 :< 1 0 : 40 1 0 0 0 1 4 8a8 3 0 5 8 0 :< 1 2 0 0 1 � 8 a c 15 O x 1 2 0 0 1 4 8 b0 0 7265 0 a dl 0 : 2 0 c6 0 0 1 0 t4 , lda <:5 , cmp l e t4 , L7 , lda L6 , :6 ( t6 ) 0 :.; 1 0 : 5 9 4 1 1 0 0 ... add s $ fl O , ts fl, 0 0 2.� 8b8 0 : 9826 f f f 0 : 8 967£ f4 'J x 1 - D 0 1 4 8c 4 :<1 0 0 1 4 8 c 8 0 :< 1 2 0 0 4 8 cc 1 L 8 8 O x1 2 0 01 4 8 d0 3 2 5 O x l 2 o : 4 8 d4 Oxl 2001 48 8 Ox 1 2 0 0 1 4 8d c 1 286 2 Oxl2 L Ox1 2 0 0 1 4 8 e4 87 : 8986f f f4 : 58 b Ox1 2 0 1 4 8e8 0 O x1 2 0 0 1 4 8ec 4b 0: 9 7f 0 : 89c Sfll , -12 ( L6 ) $tl2 , - 12 ( t5 ) f8 f f f8 c 9e7 f f f c £15 , f O :< l 2 0 0 1 4 8 f 0 0 : 5 8 0 f104 f m ls 1 2 7 0 5 Ox l 2 0 0 1 4 8 f 4 0 : 5 a 0 f 1 00 f adds 12748 0 : 99 £ 20 00 O x1 2 0 01 4 8 f 8 (a) DCPI Profile fo r This -8 ( t 5 ) .3. $£1 3 $fl4 , $fl . $fl) s 1 3 , -8 t 5 ) .10 d 0 : 8a 0 6 14 , $ f0 . $ 0 : 9 9a6 f f f 0: $ f" , $ fl l , Sfll S f12 . Sf l l , Stl" $ f l l. - 12 ( t 5 ) $ 13 , -8 ( t ) $ 0 : 5 80 d .._ 04 d :5 fl . $fl -16 (t5) _ds muls 6 6 f f f4 :8 a4 lds 0 : 5 98bl 00b 0:. 4 8 e0 3134 6357 < 51 0 : 2 0 e7 0 0 1 0 4 6 3 88 4 O x4 , 0 : 4 0 a 8 0 b4 13054 0 09005 O x.l 2 0 0 1 4 8b4 12784 Oxl 0 1 4 8b 3207 xl 2 0 0 1 4 8c0 6 0 : 8 82 7 0 0 0 0 O : a 3 e7 0 0 8 0 t -4 ( t 6 ) $ f l6 , -4 ( t 5 ) $ f 0 , $ f 1 5 , !' 1 5 $ .. 1 6 , $ f l 5 , $ [ 1 5 2 ( 21 $ 15, Execution with Version 6.0 l·ma t Mu l HC_X P f PC f PC fC i C i C i : 3 5 1 O x l 2 0 0 1 94 d0 0 3 :i 3 ' 0 Oxl2 0 0 1 94 Oxl2 4 1 9 4 d8 Ox 1 2 0 0 1 9 4 c 0 : 88 2 7 0 0 0 0 0 : 4 0a09005 0 : 4 0a 8 0db4 0( 4, O x4 , lds SflO, t1 0 (L5) 4 0 O xl 2 0 0 1 9 4 eB 0 : 2 0c6001 0 .._ 2 8 7 0 Ox l 2 0 0 1 9 4 ec 0 : 5 9 -1 1 1 0 0 1 ad ·s SE10 , 'L , $ tl 127 O x l 2 0 0 1 9 4 f0 17968 x12 019 e4 0: Oe 7 0 0 1 0 0 : 5801 1 8 0 d 1 0 4 d 0: 8 6fff 0 : 5 9c b 1 0 0 b : 5 9 ec l 0 0c 0 : 5a0 l O Od Ox 1 2 0 0 1 9 5 2 tl : 9966fff4 3 1 3 tl xl2 19 28 3 2 0 0 Ox ' 2 0 0 1 952c 3 1 6 8 Ox' 0019530 0 : 9986 f t f8 6 58 (b) 0 : 99 6ffE O : f 6 9 f ffe7 lcs ... s a 'd adds a ds sts s s s s bne Vol . 1 0 No. l 1 998 -1 ( t5 ) 6) S f0 , $ f l l . $ f l 1 S f l t; , Sf - 1 7. ( 5l . $ f l 2 , $f12 $fl5 , -8 ( L5 ) 0 , $ 1 3 .. , f l 3 $f16 , $ f 14 , -4 ( t5 ) f11 , SEl l $ f 1 5 , $ 1 2 , $ £ 1 :! Sfl6, Sfl3 , $fU $ f : l , - 12 ( t 5 ) $ E l 2 , - 8 ( t5 ) $ f 1 3 , - 4 I tS ) a4 , O x l 2 0 0 1 4 d 0 DCPI Profile with Counts with Version 5 . 5 F i g u re 4 DCPI Profi les of rhe Inner Loop Digiral 'l'c chnical Journal $fl. cmp le . aa mu:..s 1 ·a 3 2 ' 5 O x l 2 u0 1 94 e0 44 0 0 : 8% 6 1 s a 'd l ULL TONE SU:• \ARY PERE'OR!,�C E Hl PROVE�1Et\'T l l s one E PORT e l ea s e 3 . 9 b2 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + � - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 1 , l'h:reshol d : u l l s t one Ra t i o Increase by a leas e 10% - - - - - - - - - - - - - - � - - - - -- - - - - - ---- - - - + - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - �-- + Comp a r i o . Com_ � · er Ba s e l i ne Compi - e r · ------------------�---- --------- - --------------- �-----------------------------+ C omp i l e r DEC Alpha C GCC '2 . 7 . 2 Architec �:e .3 0 0 0 / 3 0 0 + ------- - - - - --- - -- i -- - - 0 t ' m ' �a --- - ------------- i on + + --- -- - ---- - - -- - - - - - - - --- - - - ------ - - - - ---- - - - -- - - + i m i z a t i on Al i a s Op . i m ' za i on Al i a s Op �miza i on Bi f i el d Op im i za ( y t yp ) ize I Fol Con Propa a t i o t !: S 2510 Dead Code E l i m i n a t i on D i vide Op t i m i za 56 tests 2600 ests 306 15 ion I f 09 i m i z - don S impl i f i c a L i o n 1/. 0 s 13 es t s l 3 Cl:OSS J im i z a i on 99 26 ti n 15 tes s 92 s Op i m i z a i on I n tege:r Mul t ip l y Opt i m i z a t ion P o i n l e r Op t imi z P � i n t f Op i T< i l Re c ur ion R e i s te r A l l oc a t i o n N t'l'OWing S PEC Co formance S La t i c ec l a ra t ions S t ri ng Op 4 + -- - - - - - - - - - - - --- - - � - - - - - Tot 1 Pe�: formanc I mp rovemen t s • - - - - -- - - - - - - - - - - - - - - T - - - - - -- -- 6499 es >� 10% tes t s ests tes t s e s ts Les s 1 1 teSLS 3 es ts 2 ests 1 2 26 1 3 tes s tests tes t s 18 20 90 t es t s i m i z a t ion Vo l a t i l e Conformanc 3 1 39 es t s Le s 3 tes s 30 t ts 4 t SLS 4 tests 3 es c s 2 tes t s l tes t Op l i mi z Lion 15 es t s 3 i z a ion !'on1ard S tore Va l ue Range tests L es s In t eget. l·iodu l Address Op s t.:es s Lesu; 2 t sts 2 res s l es t s 4 tes s mp i ng t e s ts L e s ts 16 B l oc k t·le rg ing 8 92 Lo p Un�:o l l ing UnsH i tchi � tests 181 Loop Co L a9s i n Loop Fus i on 15 es s 38 b l e E l imi na t i on 2026 tests 5 6 t es t s cests Ho i s t i ng Va r i I n . i n ' ng s 278 S reng t h Reduc t: i on i on Indu c L ' on tes ces s es t 3 9 tests 4 t S I.: S 2 es·s Func I 1 9 Lests 3 t es t s 2353 69 + ents tes ts I 0 3 in Impr ve - - - - · -- ------------+ 15 C S E E� imi�a i o n 1 Sa ple 52 ion Co s Lan Expres s i on 3000/ 30 0 11 u a l i f i ed ) I n s t ru c t � o� Comb i n i ng I n l e" e r bl 3 6 -- ----- - - ---- - - - - - - - --------- 1 02 ( y a dr e s s ) ( con Branch E l :. i n a - i o n rant 123 DEC Al pha DEC A l pha , odel Al i a s 0 5.7 n o r esu ic · tests tes s res s t e s ts res _ 9 tests 3 te s t s tes 0 res s 2 tes t s 1 t es t s 3 tes t s 0 tes t s 0 tes t s 0 tes t s 1 tests 4 tests -- - - - - - -- - - - - - - - - . 5065 + tests Figure Sa N U LLSTO N E Results Com paring gee w i t h D E C C Compiler, Showing All I mprovem.cnts of Magnitude 1 0 % or More Digiral Technical Jourrul Vol . 10 No. l l 99S 45 NULLSTONE SU}WARY PERFORMfu�CE REGRE . Sl O, � l l s to � Re " ea se 3 . 9b 2 + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - I Threshol +---+ : -- - - - - - - - - - - - - - - - - -- - - - - - - - - -- - - - - - - + - R c i o Decreas d by a c l ea t 1 0 % N lls on B · e l i r. e Comp � l e r - - - - - - -- - - - - -- --- - - ---- - - - - - - �----------- +- I G CC 2 . 7 . 2 Comp i l e r I �1odel no 300 t30C 300 0 / 3 0 Op t im i z a t ion Al i a · Op L i i : a t � on - - - - - - - - - - - - - - - - - - - - - -- - - - - ( by t yp - - � - - - - - - - - - -- - - - - + I Regr· ---- ----- --- + - - - - Sampl e S i ,:e 10 ) ests s s i ons 64 es s l con s t - aua ! : . l ea ) 11 es s ( by n dd;e s s ) 57 a t s ests 7 tests Ins truc t i n Cornn ' n ing Cons tan Propaga i on 2 510 tests 204 CSE El i m i na t i o n 2 6 0 0 tests 9 2 tes s 1 8 1 te s t s Alia� Op t l rni z a t · on Al i a s Op irni ::: a · on I n eger D i v i de l j Les s t im i z t on Expre s i an S i mp l i f i ca t i on p t i mi zat i o n t 69 s Op L irn i z a t i on I n t ege r Mu . L iply Opt imi z a t i on P o i n t e t Op t irn i z Ta i l Recur i on esLs t ion 92 ests 9 15 tests tes ts 3 es s 6499 ests tes t s 32 :-e s t s 32 34 s � F i g u re 5b N U LLSTO N E Res u l ts Comp�ring gee 10% > - with DEC C Compi l e r, Sh owi n g Al l I I t.: e s t: s L sts 95 es 1 tests /. €' !; l: S 2 t:es ., - - - - - - - - - - - - - - - - - - t - - - - - - - - - - - - - - - - - - - - - - - - - - - - - � - - - - - - - - - - - - - - - - - - - - - - - - - - - i T o a l Performance Regre s s i ons s 1 40 1 tests "'l tO\� i ng es 5 14 4 J 8 tes t s 2 tests Ho i s L i n Unswi � h i n g I n e g e r Modu + res t.r i c t DEC Alpna � - - - - - - - - -- - - - - - - - - · - - - - - - - - - - - - - - - - - - - - - - - - - - - - - · - - - - - - - 1 - - - - - - - +- omp i l e t - - - - - - - - - - - - - - - - - - - - - - -- - - - - DEC Alpha · - - - - - - - - - - - - -- - - - - - - - - - - -- - - - - - Compa r i son DEC A lp ha . 5 . 7 - 1 2 3 bl 3 6 I Arch · tec'.ute R EPORT 1 542 tes s i - � s 1 Regressi o n s of 1 0% or vVors..: Biographies Kevin W. Harris Kevi n Harris is a consulting sofrw:�t-c engi neer at Compaq, cu rren tJy wo rk i n g in the DEC C and C ++ D ev e l opm e nt Hemant Gro u p . He has 2 1 y e:-t rs of e x pe ri e n c e worki ng on h i g h G. Rotithor Hcmant Rotithor received B . S . , td . S . , :md P h . D . d egre es in e lec tric :� I e n gi n ..:..: r i n g in 1 9 79, 1 98 1 , and 1 989, respn: · t i v e l y. He worked on C � n d C++ c o m p i l e r per �o rnu n ce i:;sues in tht: Core Technology Gr ou p ;�t Digit;�J Equipment Co rpor� t ion �or t h ree years. Prior to that, he w;�s Jn clssis tant �)rofessor at vVorcester P ol ytech n i c I nsti tute and :1 d e vel o pm e n t ..:ngi n c e r ctt P h i l i ps . H e m a n t i s a m e m b er of the p r ogr a m com minee ofThe l Oth l nrnn :-ttio n a l ContC:rt:ncc on ParJ l l e l and Distributed Com p uti ng a n d Syste ms ( PDCS '98 ) . He and ,1 is a sc·nim m e m be r of rhe II:J-: E member of Eta Kapp:� N u , T:1u Beta Pi , and Sigma Xi. His interests i nclude c o m p u te r :� rchirccmre, p e r fo r ma nc e a n :� lvs i s , digita l design, and ne tworki ng. Hennnt is currentlv e m p l ov..:d a t I nt e l Corpor:�rio n . 46 Dig:iral Tec h n ical )ournJI Vol . 1 0 No. 1 1 998 pe rfo rm a n c e compi l ers , optimization, a n d p:�ra l k l pro cessi n g . Kevin grad uated Phi BeLl Kappa in m:�rhem:�tics �i·o m the U n iversiry of 1\lb ryland cmd J O i n ed Digita l Equi pment Co rpora ti o n c1ti:cr earning a n M.S. i n com purer scien ce ti:om the Pen n sy lva n i a State U n i versity. He has m a d e maj o r contri bu tions to t he D I G ITAL F or tr:� n , C, an d C++ p ro d u c t fa m i l ies. He ho l ds p.t L c n ts f(J r tech n iqu e s tor exploiti n g performance of shared memory m u l t iproces sors and register allocatio n . H ..: is c ur re n tl y responsible tor pe r form an c e issues in the DEC C and D 1 G !Tt\L C + + product fa m i l i e s . He i s interested i n CPU a rchi tecture , c o mpiler design, large · and snul l -scalc p.tra l l e l i s m :� n d irs exploitation, and oti:ware q u a l irv issues. Mark W. Davis Mark Davis is a senior consulting engineer in the Core Technology Group at Compaq. He is a member of Compaq's GEM Compiler Back End team, toc using on performance issues. He also chairs the D IGITAL Unix Calling Standard Committee. H e joined D i gital Equipment Corporation i n 199 1 after worki ng as Director of Compilers at Stardent Computer Corporation. Mark graduated Phi Be ta Kappa i n mathemat.ics from Amherst College and earned a Ph. D . in computer science !Tom Harvard University. H e is co-inventor on a pending patent concerning 64-bit software on OpenVMS. Digital Technical Journal Vol . 10 No. l 1998 47 I August G. Reinig Alias Analysis in the DEC C and DIGITAl C++ Com pilers During alias analysis, the DEC C and DIGITAl C++ compilers use source-level type information to improve the quality of code generated. Without the use of type information, the compilers would have to assume that any assignment through a pointer expression could modify any pointer-aliased object. In contrast, through the use of type information, the compilers can assume that such an assignment can modify only those objects whose type matches that referenced by the pointer. 48 Digital Tec h nical Jou rnal Vol . lO N o . 1 1 998 vVh e n two or more address expressions reference the same memory location, these add ress ex pressions are aliases for each other. A compiler performs alias anJJy sis to detect which add ress exp ressions do not refer ence the same memory locJ.tions. Good alias an alysis is essential to the generation of efficient code. Code motion out of loops, common su bexpression elimina tion, allocation of variables to registers, and detection of u n i n i tialized variables a l l depend upon the compiler knowi ng which objects a load or a store operation could reference. Address expressions may be symbol expressions or pointer expressions. I n the C and C++ languages, a compiler always knows wh at obj ect a symbol expres sion references. The same is not true with pointer expressions. Determining which objects a pointer expression may reference is a n ongoing topic of research . Most o f the rese arch i n this area focuses o n the use of techniq ues that track which object a poin ter expres sion m ight point to. u When these techniques cannot make this determination, they assume that the pointer expression poi nts to any object whose add ress has been taken . These tech niq ues generally ignore the type information avai l able to the source program . The best tech niques perform interprocedural analysis to i mprove their accu rJcy. Although effective, the cost of analyzing a complete program can make this analysis i mpractical . I n contrast, the DEC C and DIGITAL C++ compi l ers use h igh-level type information as they perform alias analysis on a routine -by-routine basis. Limiting alias analysis to withi n a routine reduces its cost, albeit at the cost of red ucing its effectiveness . The use of this type i n formation results in s l ight i mprovements in the performance of some standard con forming C a n d C++ programs. These improve ments come at l i ttle expense in terms of compi l a tion ti me. There is, however, a risk that the use of this rype information on nonsra nd:�rd-conforming C or C++ programs may result in the compi ler producing code that exhibits u nexpected behavior. The Side-effects Package The C and C++ Type Systems Research availab.le on the use of type intormation du r The DEC C and D I G ITAL C++ compilers are GEM ing alias analysis involves languages other than C and compil ers -" The GEM compiler system incl udes a C++ . ' Trad itional ly, C is a weakJy typed l a nguage . A highly opti mizing back end. This back end uses the poi nter that references one type may actually point to GEM data access model to determine which objects a an object of a different type . For this reason, most load or a store may access. GEM compiler front ends alias-analysis techniques ignore type information when augment the GEM data access model with a side analyzing programs written in C . effects package, i . e . , an a l ias-analysis package . The The ISO Standard for C detlnes a much stronger side-effects package provides the GEM opti m i zer typing system .' In ISO Stand ard C, a poi nter expres additional i n formation about loads and stores using sion can access an object only i f the type referenced by l anguage-spec ific i n formation otl1erwise unavailable the pointer meets the following criteria: to the GEM optimizer. • It is compatible with the type of the object, ignor ing type q uali fiers and signedness. • I t is compatible with the type of a member of an aggregate or union or su bmembers thereof, ignor The DEC C and D I G ITAL C++ compilers share side-effects package • It is the char type . Thus, in Figure 1 , the pointer p can poi nt to A, B, C, o r S ( through S .s u b . m ) b u t not to T or F. The poi nter q, bei ng a pointer to char, can refer to any of A, B , C, S, T, or F. The proposed ISO Standard for C++ d e fines a simi lar typing system for C + + . ' The strength of the Standard C and C++ type systems a llows the DEC C and DIG ITAL C++ comp i lers to use type i n formation d u ri ng al ias analysis. Many existi ng C appl ications do not conform to the Standard C typing rules. They use cast ex pressions to circu mvent the Standard C type syste m . To support these applications, the DEC C compiler has a mode whereby i t ignores type information during alias analy sis. The D I G I TAL C++ compiler also has such a mod e . This mode exists t o support those C++ programmers who circumvent the C++ type system. Determines which symbols, types, and parts thereof a routine references ing type q u ali fiers and signedness. • a com mon side-eftects package . The D E C C and C++ • Determines the possi ble side efkcts of these reterences • Answers q ueries fi.-om tl1e GEM optimizer regardi ng tl1e effects and dependencies of memory accesses Preserving Memory Reference Information The D E C C and D I G ITAL C++ front ends perform lexical analysis and parsing of the source program, generating a G EM i ntermediate language (GEM I L ) graph representation of the source program 6 A tuple i s a node i n the G E M I L and represents an operation in the source program. As the D E C C and D I GITAL C++ tfont ends gener ate GEM I L , they an notate each fetch (read ) and store (write) tuple with intormation describing tl1e object being read or writte n . The front ends annotate fetches and stores of symbols with i n tormation about tl1e sym bol. They annotate fetches and stores tlu-ough poi nters with information about tl1e type tl1e pointer references. The an notation i n tornution includes information describing exactly which bytes of the symbol or type tl1e tuple accesses. This al lows the side-effects package int i gned i� � c n s t B ; un s i gned i n t vol t i l e C ; s Lruct: { s t ru int m; ) s b; $; sLr c t { horL z ; ) T; flo t F ; i 'p; c ha r * q ; to d i fferentiate between access to t\vo different mem bers of a structure. Arrays Neitl1er the DEC C nor the DIGITAL C++ tfont end ditferentiates bet\veen accesses to different elements of an array. Both assume that aU array accesses are to the first element: of the array. The GEM optimizer does extensive analysis of array references.7 Being flow insensitive, the DEC C and C++ side-effects package can, at best, differentiate between two array references tl1at both use constam indices. The GEM optimizer can do much more. V/hat the GEM optimizer cannot do, however, is Figure 1 Code Fragmenr Associated with rhe E.xpbnation ofthe Standard C Aliasing Ru les determine that an assignment through a pointer to an int: does not change any value in an array of doubles. This is the purpose oftl1e DEC C and C++ side-eftects package. Mapping a l l array accesses to access the first Digital Technical Journal Vol . 1 0 No. I 1 998 49 element of a n array does not hinder this purpose and simplifies al ias analysis of arrays. an object. To m i n i mize the n u m ber of effec ts cl asses u nd e r considera tion, the side-effects package creates effects classes for only those object regions referenced For the program fi·agme nt Tuple Annotation Example i n Figure 2, the DEC C and DIGITAL C++ ti·ont ends generate the annotated tuples displ X 3; Store p->x none struct S 0 3 3 Store v 1 .y v1 struct S 4 7 v1 .y v2 = d[i] = = v1 = d [O] Fetch v 1 v1 struct S 0 7 Store v2 v2 struct S 0 7 Fetch d [O] d double 0 7 i nt 0 3 double 0 7 Fetch i d Store d ( i ] 50 Vol . 10 To. 1 1 99 8 if two members occupy exactly the same memory loca tions, a single effects cl ass represents both mem bers . For the program fragme nt in Figure 3 , the side effects pac kage creates the effects cl asses displayed in Table 2 . There i s only o n e effects class for * uip and *ip since uip and ip may point to the same object. There are no effects c lasses for bytes 0 through 3 ofs a nd struct S as there arc no references to s . x or sp->x. By al locating effects classes for only those object regions referenced within the rou tine, the side-effects package greatly red uces both the n u m ber of effects classes and the time requi red to perform alias analysis. In the traditional C type system , a poi nter expres sion may point to anything, regardless of type. To rep rcst:nt this, the side-effects package creates exactly one eftects class to represent allocated objects. It ignores the type and the start- and end -offset information . S { inl x ; s r c T int y ; flo t t; S t:. rUCL z; s; s tl.·uc t s * p ; s i gned i n t • ip ; u si gned i nt • u i p ; l oa * fp ; * u ip : * ip ; * fp = 2 ; sp - > t = s . e ; sp.y = 2; s - • sp ; Fig u re 3 Code Fragment Associated with Allocating Efkcts Classes Using tl1e traditional C type system, for the program fragment shown in Figure 3, the side-effects package creates the effects classes displayed in Table 3 . Here, effects class 7 replaces effects classes 7 through 1 1 in Table 2. All the differentiation by types djsappears. Effects-class Sig natures Having created the effects classes, the side-effects pac kage associates a signature with each effects class. In addi tion, it associates an effects-class signature with each tuple within the rou tine and each symbol referenced within the rou tine . An effects-class signature records the possible side effects of referencing an effects class. A reference to one effects class may reference another effects class. The effects class for a load through a pointer to an int i n dicates that the load references an al located int object. The poi nter to an int may actually reference a pointer-aliased int symbol or an int mem ber of a struc ture or union. An effects-class signature is a su bset of al l the effects classes that might be referenced by a tuple. There is only one requirement for an effects-class signature : I f two tuples may refer to the same part o f memory, the intersection of their respective effects-cl ass signatures must be non-null . If two tuples cannot refer to the same part of memory, it is desirable that tl1e intersec tion of their e ffects-class signatures is null. An em pty i ntersection l eads to more optimization opportu nities. The most obvious rule for building an effects-class signature is to include in it a l l the effects classes that might be to uched by a reference to tl1e effects class. This leads to subopti mal code in cases such as that shown in Figure 4. There are three effects classes for this code , s<0,3>, S<4,7> , and S<0,7>, generated by references to s.x, s .y, and s, respectively. If the effects-class signature for S<0,3> in cludes both s<0,3> and s<0,7> and the effects-class signature for s<4,7> i ncludes both s<4 ,7> and s<0,7> , then the intersection of these 1:\vo effects- Ta ble 2 Effects Classes Using the Sta ndard C Type R u l es Effects Class Type or Symbol Start Offset End Offset Sou rce Generating Effects Class 1 0 11 2 4 11 s.t 3 sp 0 7 sp 4 fp 0 7 fp 5 ip 0 7 ip 6 uip 0 7 uip 7 struct 5 0 11 *sp 8 struct 5 4 11 sp->t 9 struct 5 4 7 10 fl oat 0 3 *fp 11 i nt 0 3 * u i p and * i p Digital Tech nical journal sp->t.y Vol . t O N o I 1 9 98 51 Ta ble 3 Effects Cl asses U s i n g the Traditional C Type Rules Effects Class Ty pe or Symbol 1 2 Start Offset End Offset Source Generating Effects Class 0 11 5 4 11 s.t 3 5p 0 7 sp 4 fp 0 7 fp 5 ip 0 7 ip 6 uip 0 7 uip 7 char 0 *sp, sp->t, *u ip, sp->t.y, *fp, * i p c lass sign atu res is no n - n u l l . This talsely i n di cates that s.x and s.y may refer to the same memory l ocation. This forces GEM to generate code that stores s.y after stor ing to s.x. The DEC C and C++ side-ef'tects package uses more effective rules for bui ldi ng effects-class signatures. These rules offer more optimization oppornmities while pre serving necessary dependency i n tormation. I f an effects class represents a region A of a symbol, its signature incl udes itself Its signature also incl udes all efrecrs cl asses repre senti ng regions of the symbol wholly conta i ned with i n A. Final ly, i t i nclu des a n y eftects class representing a region of the symbol that partially overlaps A. I t does not i nclude e ffects c l asses representing regions of the symbol that do not overlap A or th::�t wholly contain A. Ta ble 4 gives the symbol effects-class signatures for the three effects cl:lsscs u nder discussion . The i nc lusion o f su bregions i n an effects-cl ass signa ture means that references to symbols i n terfere with references to mem bers therein and vice versa. Excluding su per-regions in an effects- class signature means that references to two separate members of :1 symbol do not interfere with each other. In Table 4, the eftects class signatures for S<0,3> a nd s<4,7> do not in tcrkrc with each other. Both signatu res interfere with the effects-class signature tor s<0,7>. The incl usion of effects classes representing parti::�lly overlapping regions of a symbol a l l ows tor the correct representation of the side effects of referencing sub members of complex unions. Effects-class Signatures for Symbols s t ru c t i n t: int s; S x; { - • s . y - . . . ; • • s; Symbol Effects-c lass Signatu res Effects-class Signature S<0,3> 5<0,3> S<4, 7> 5<4, 7> S<0, 7> <0,3>, 5<4, 7>, 5<0, 7> Dip.ital Tc chniol Jou rnal • Any region of a poi nter-aliased sym bol whose type is compati b l e to T, ignori ng type qu ::�li fiers ::�nd signed n ess • A region of a poi n ter-aliascd aggregate or union symbol that contains a member or submember whose type is compatible to T, ignoring type q u a l i fiers a n d signed ness • A Vol . ! 0 No. l region of an aggregate or unio n type that con tains a member or submember whose type i s com patible to T, ignoring type qualifiers and signed ness Table 5 gives the signatures for the efkcts classes in Ta ble 2 , assuming that the sym bol s is poi nter aliased . I ncluding the effects classes of symbols in the effects c lass signatures of types records the interference of references through poi nters with references to poi nter a liased sym bols. I n Figure 3, the pointer uip points to an u nsigned int. The member s . t.y hJs type int. Thus, uip may point to s. t.y. The mem ber s.r contains s.t.y. Thus, the signature for the effects-class int<0,3> co n - Ta ble 4 52 Those regions ofT that overl ap the region ofT the effects class represents, using the same ovnlap ru les JS for symbols ; Figure 4 Example o f Problem atic Code fo r the NaYvc Ru le for B u i l d i n g E tlccrs-class Signatu res E ffects Class • y; S. X re t u tn If J n efkcts class represents a region of a type, the contents of its signa ture depends upon the type. I f tbe type is the char type, the e ffects-class signature contains a l l the eftects classes representi ng regions of other types or poi nter-aliased symbols. This reflects the C and C++ type rules , which state that a pointer to a char can point to :mything. If the type is some type T other than char, the effects class signature contains dlects classes represen ting: Effects-class Sig natures for Types 1 99 8 D u ring opti miza Ta ble 5 Res ponding to O ptim izer Q ueries Type Effects-class Sig natures tion, ilie optimizer m a kes two types of q ueries to the Effects Class Effects-class Signature 1 S<0, 1 1 > 1, 2 2 S<4, 1 1 > 2 N u m ber side-effects analysis routines: domi nator-based queries and nondominator-based q ueries . When doing nondominator- based optimizations, tJ1e optimizer uses a bit vector to represent iliose objects a 3 sp<0,7> 3 write may ch ange ( its effects ) . A similar bit vector repre 4 fp<0,7> 4 sents those objects whose val ue a read may fetch ( i ts 5 i p<0,7> 5 dependencies ) . Each bit in tJ1e bit vector represents an 6 u i p<0.7> 6 effects class. If a tuple's effects-class signan1re contains 7 struct 5<0, 1 1 > 1 , 2, 7, 8, 9 8 struct 5<4, 1 1 > 1 , 2, 8, 9 9 struct 5<4, 7> 1, 2, 9 10 float<0,3> 1 , 2, 7, 8, 1 0 11 i nt<0,3> 1 , 2, 7, 8, 9, 1 1 an effects class, iliat effects class's bit is set in ilie tuple's bit vector. The optimizer uses ilie u nion of ilie bit vec tors associated witJ1 a set ofn1ples to represent the com bined effects or dependencies of those mples. Domi nator-based queries involve fi nding the near est dominating tuple that might write to the same memory location as the tuple in q uestio n . Tuple A domi nates tuple rains the e tiects-class s<4, l l > . This means that the load of s . t depends upon the store through u i p . Including t h e effects classes of types i n t h e signa tures of the effects classes of other types records the i nterference of references through a pointer with ref erences through pointers to other types. I n F igure 3 , the pointer fp points to a float object. T h e m e m ber sp - >t . z has type float. Thus, fp m ay point to sp-> t . z . The member sp- > t contains sp- > t . z . T h u s , the signa ture for tJ1e effects-cl ass float<0,3> contains ilie effects class struct 5<4, 1 1 > . This reflects the fac t that the tore to sp->t.y depends upon the store tJ1 rough � I . e . , It m ust occ ur after ilie store ilirough fp. fp, Even though the signature for the e ffects-class float< 0,3> contains the effects-class struct 5 <4 l l > � ( s p - > t ) , it does not conta i n the e ffects-class s ruct 5<4,7> ( s p - > t . y ) . There i s no float member of struct 5 whose position within struct 5 overlaps bytes 4 through 7 ofstruct 5. There is a float member of struct 5 , namely z, whose position within struct S overlaps bytes 4 through 1 1 of struct S . The signature for the effects-class float<0,3> wou ld not contai n the effects class s<0,3> if i t existed. There is no float member of s whose position overlaps bytes 0 ilirough 3 of s . Additional Effects-class Signat u res The side-effects package creates a special effects-class signature repre senting the side effects of a cal l . A cal led procedure may reference the following: • • B if every path from the start of the B goes through A . 8 I f both tuples A and C dominate B , tuple A is the nearer domi nator i f C dom rou tine to inates A. When doing dominator- based opti mizations, the side-effects package represents the tuples i n the cur rent dominator chain as a stack, adding and removing tuples from the stack as G EM moves from one path in the routine's domi nator tree to another. Searching a single stack for the nearest dominating tuple that might write the same memory as the tuple in question references could lead to O(N9 performance, where N is the n u mber of tup les i n the domi nator chain . This worst-case behavior occurs when none ofilie tuples in a dominator chain affects any su bsequent tuple i n the chai n . Each time the side-effects package searches the stack, it exami nes all the tuples in the stack. To avoid iliis, ilie DEC C and C++ side-effects pack age creates a stack for each e ffects class. When pushing a tuple, the side-effects package pushes the tuple on each stack associated with an e fTects class in the tuple's effects-class signature. When the G E M optimizer tells th e side -effects package to find the nearest domina ti na write for a tuple, the side-effects package need onl � choose the nearest of those tu ples that are on the top of the stacks associated with ilie tuple's effects-class signature . It need only look at the top of each stack, because a tuple wou l d not be in tJ1e stack u n less it mi ght affect objects i n the e ffects class associated with tJ1e stack. The m ultistack worst-case behavior is O(NC). There Any pointer-aliased symbol ( by means of a refer are C separate stacks, one for each effects class. The ence through a pointer) effects-class signature for each effects class may con Any allocated object ( by means of a reference tam all the other effects classes. This would mean that ilirough a pointer) each of the N tuples in the domin ator chain would • Any nonlocal symbol ( by means of direct access) • Any local static symbol ( by means of recursion) The effects signature for a call i ncl udes all the effects classes representing these objects . appear in each of ilie stacks. Although the worst-case behavior for the m u l tistack case is no better than the single-stack case ( C may be e uaJ to N ), in practice there are often more tL;ples � Withm a routine than e ffects classes. Furthermore ) Digital Technical Journal Vol . 10 No. 1 1 998 53 effects-class signatures often contai n a smal l n u m ber Effectiveness of effects classes. A smal l num ber of e ffects cl asses in an effects-class signature means that there are a small The benchmark programs from the SPECint95 suite numb er of stacks to consider. Choosing the nearest offer some convenient test cases for measm ing the dominator from among the top tuples on these stacks e ffectiveness of type- based alias analysis. The sources are requir es exa mining only a small n u m be r of tuples. readily available and portable. The programs conform Cost of Using Type Information Standards Institute (ANSI) and are compute intensive. When compiling all of the SPECint95 test suite9 using culations. This reduces the number of different types high optimi zation, alias analysis accounts for approxi used in the programs. Type -based alias analysis works mately 5 percent of the comp i l a tion ti me. The use of best when there are many di fferent types in use . to aLias rules established by the American National U n fortunately, they do not contain floating-point cal Standard C type rules during alias analysis i ncreases Tlu·ee of the SPECint95 programs show no improve compi l a tion time by less than 0 . 2 percent ( ti m e mea ment when compiled using the Standard C typing rules sured i n number of cycles consumed by the compiler as opposed to using the traditional C typing ru les. as reported by Digital Continuous Profiling I n fra These programs, namely compress, go, a n d li, do not structure [ D CPI] '"). The i ncrease in compilation time use many different types and pointers to the m . \Vh en varies from program to program but never exceeds all the pointers i n a program are pointers to ints ( go ) , 0 . 5 percent. Hand l i n g the extra effects classes gener there is only o n e e ffects class for a l l pointer accesses. ated by using Standard C type al iasing i n formation Because the compiler has no way to d i fferentiate accounted for most o f the i ncrease . among the objects touched by a dereference of a Potentially, the cost of including type-aliasing infor pointer expression, it generates identical code for these mation could be huge . Calculating which effects classes programs, regardless of the type r u les use d . The ge n a reference through a char * pointer could touch is erated c o d e for l i differs o n l y sl ightly a n d only for straightforward as shown by the al gorithm in Figure 5 . i n frequently executed routines. A much more complicated process i s required to Changes i n generated code for the remai ning five calcu late which e ffects classes could be touched by a benchm arks are more prevalent. Two benchmarks, reference through a poi nter to a type other than char. ijpeg and perl, show a smal l reduction i n the nu m ber The algorithm in Fi gure of loads executed but no meani ngful reduction i n the 6 performs this process. Fortu nately, the innermost section of this loop is total numbe r of instructions executed . The other rare ly executed . The i n n e rmost section executes onJy three SPECi nt9 5 benchmarks show varying degrees if a routine references a structure e i ther through a of red uction i n both the n u m ber of loads executed pointer or a pointer-al iased sym bol, that structure (see Ta ble contains a substructure, and the routine references the executed (see Ta ble 7 ) . 6) and the total nu m be r of in structions su bstructure through a poi n ter. f or ea ch p o i n ter al i a s ed s ymbol foreach e f f e c t s c l a s s represen i ng a region o f the symbol l a ss to the e f fec t s c l as s s i gna ure for add c h a t e f fec s c ar Figure 5 Calculation of the Effects-class Signature of the Type char * foreach p oi n t er a l i a s ed s ymbo l or cype referenc ed t hrough here i n f o r e a c h member i f t he member ' s type i s referenced through poin er foreach e f f e c c s c l a s s repre s en t i ng a re i o n o f foreach e f fec t s c l a s s re p resen t i g a region re ferenced hrough a po i n t e r a o i n te r the member ' s f Lype che s ymbo l or type i f the tHO e f fec t s c l a s s re i on s ove rl ap add the symbol ' s or po i n t er ' s e f f e c t s c l a s s to the e f f e c t s c l a s s s i g natu r a s s o c i ated with t h e e f fec t c l s s represen t i n the member ' s Lype F i g u re 6 Calculation of the Effects-class Signature for Types Other Than char 54 DigiraJ Technical JournaJ Vol . 1 0 No. 1 1 998 6 N u mber of Loads Executed by the Sel ect SPECint95 Benchmarks Ta b l e SPEC Benchmark M i l l ions of Loads Using Type I nformation M i l l ions of Loads without Type I nformation Percent Reduction gee 1 0, 268 1 0,365 0.9 0.2 ij peg 1 6,853 1 6,888 m88ksim 1 3,889 1 4, 1 57 1 .9 peri 1 1 , 260 1 1 , 296 0.3 vortex 1 8, 994 1 9, 207 1.1 Ta b l e 7 N u m be r of I n structions Executed by the Sel ect S P E C i nt95 Bench m a rks M i l lions of I nstructions SPEC Benchmark M i l l ions of Instructions Using Type I nformation without Type Information Percent Reduction gee 42,830 42, 9 3 5 0.2 ij peg 82, 844 82,834 0.0 m88ks i m 72,490 73, 1 5 5 0.9 peri 45,2 1 9 45,252 0. 1 vortex 80,093 80, 607 0.6 The load and instruction cou nts are those reported by using Ato m 's pixie tool on the SPECint9 5 bi naries to generate pi xstat data. 1 1 • 1 1 The compiler used was a deve lopment C compiler. A l l comp i l ations used the fol lowing swi tches: - fas t , -04 , - a rch ev 56 , a nd - i nl i ne peed . The compil ations using the Standard C type system used the -ansi_a l i a switc h . The compilations using the trad itional C type system used the - noans i_a l i s switch . T h e bench mark binaries were run using the reference data set. DCPI'" measurements of the reduction in the n u m ber of cycles consu med b y these SPECint9 5 bench marks showed no consistent reductions. Ru n-to-run variabi lity in the data col lected swamped any cycle time reductions that might have occu rred. S i m i larly, measu rements of gains in SP ECint95'' resu lts due to the use of type information during alias analysis showed no significant changes . Changes in Generated Code The code-generation changes one sees in the SPECint95 benchmarks arc exactly what one would expect. The usc of type information during alias analysis reduces the number of redundant loads. An example of this occurs in ijpeg, which contains the code sequence: main->r v?group_c t r· = fJDH1EN · ron) ; ( JDI M� S lO I ) ( c in fo- >min_OCT_s ca l ed_s i ze • main- �rowgrou s_ava i l ( c i n fo->mi n_DCT_scal d_si ze + ll ; 2) ; in process_data_context. Using the tradi tional C type syste m , the compiler m ust assume that mai n - > row group_ctr is an alias tor cinfo->mi n_DCT_scaled_size. Thus, it must generate code that loads cinfo - > m i n_ DCT_scaled_size twice . The Standard C type system allows the compiler to generate only one load of cinfo->mi n_DCT_scaled_si ze. Several of the bench marks contain code similar to the fol lowing from conversion_rccipe i n gee: c rr . ne . . t - ur l i s - >opcode ; - 1 ; curr . ex t > l i s t - c o s t - 0 ; cu rr . exc l i s - >prev - 0 ; . ne x >l i s -> o - from ; Using traditional C type rules, the compiler must gen erate four loads of curr. next->l ist. The compiler must assume that the poi nter curr.next-> list may point to itself, making curr. next- > l ist- >member an al ias tor curr.next- > l ist. The Standard C type r u les allow the compiler to assume that curr.next->l ist does not point to itsel f. This allows the compiler to generate code that reuses the result of the fi rst load of curr.next->l ist, e liminati ng three redundant loads. In a nother example in gee, the use of Standard C type rul es allows the compiler to move a load outside a loop. The fol lowi ng loop occurs i n fi xup_gotos: f or ( ; if 1 i s Ls ; l ists - TREE_ - - thi s b loc k - > ! T REE_CHAI ( l is TREE_ADDRES ABLE s) ( li H.l\I ( l i s ts ) ) . b l oc k . ou Ls ) • 1 er_c l.e nup ·) Standard C type rules tel l the compiler that the store generated by TR.EE_ADD RESSABLE ( l ists ) I can not modi�' thisblock- >data . block.outer_clcanups. This a l l ows the compiler to generate code that retches thisblock->data. block.outer_cleanups once betore entering the loop. Usi ng traditional C type rules, the compiler m ust generate code that fetches = Digital Tec hn i d journal Vol . 1 0 N o . 1 998 55 thisblock->d ata . bl o c k . o u ter_c l ea n u ps each tim e it traverses t h e loop. Not only can type i n formation reduce the n u m ber p rograms in this suite a rc s u p posed to conf(xm to the of redu nda nt loads, i t em reduce the nu m ber of red u n Standard d an t stores. I n m88ksi m , there a r e many routines s i m i ment to the GEM optim i zer, this benchmark started lar t o the fol lowi ng: i r: t ffirst < s trLct p:r->gen . opcl ptr-·.gen . r.:cs t p•t -"9'� . OJX/ p�r >gen . n;/. retuLr. ( 0 l ; = • • - C type-al iasing r u l es . B ecll!se of an improve to give u n e x pected res u l ts . In rrx_a l loc, gee c l ears a n.S ':.. J.1..:t:L iu;. � c;rri , t:nior. opcode ?·pt:r) structure by treating it as an a rray of i nts, assigni n g zero t o e a c h e lement of the array. S u bseq uent t o zero 0. 3c: ; operar.d,; . v-lue [ O ] ; am-�op,c . nT ; operan .,; . alue [ 1 I ; i ng this structu re, gee assigns a val u e to one of the fi e l d s i n the structure. Through a series of va l id opti m izations ( given the i ncorrect type i n formation ) , the resu l ting code did not c l e a r a l l the fi e l d s i n the struc where ope 1 , d est, opc2, and src2 Jrc b i t fields s haring the same 32 bits ( long-vord ) . Using traditional C typ ture . This l e ft u n i n i ti a l i zed d a ta i n the structure res u l ting i n gee behavi n g i n an u n e xpected m a n ne To avoid potential problems, the D EC / C compi l e r, C type r u l es each other. Thus to i mplement the above routine, the ing r u l es, ptr- >gcn and cmd- >opc may be al iases for by d ef:1 u l t, d oes not use the Standard when performing alias analysis. The user of the com compi l e r must generate code that performs the fol p i l e r has to expl icitly assert that the program does fol lowing actions: l o w t h e Standard Load ptr->gen • U pdate b i t fi e l d s ptr- >gen .opc l a n d ptr- > gen .dest • Store ptr->gcn C++ C++ • Load cmd->opc. rrr c a n use • U p d a te bit fi e lds ptr- >ge n .opc2 and ptr- > gen.src2 • Store ptr->gen Usi ng Standard C typing ru l es, ments to ptr- >gcn .opc l C type r u l es t h ro u gh the u s e of a com mand -l ine switc h . • the compiler does not have to generate the first store ofptr- >gen . The assign and ptr- > ge n . d est cannot change cmd - >opc. rrr. I n this case, a l ias a n a l ysis t h a t is not type based wou l d have a d i fficu l t time detecting that p tr- > gen and cmd - >opc d o not a l i as each other. M 8 8 ksim never calls Hi rst d i rectly. It cal ls it by means T h e DIG ITA L C + + compiler docs ass u m e that the program i t is comp i l i n g ad h eres to the Standard type r ules . A user of the D I G ITAL a C++ com p i l e r com mand - li ne switch t o i n kmn the c o m p i l e r that i t s h o u l d u s e traditional C type r u l es w h e n per forming alias a nalysis. Summary Using Standard C type i n f(xmation d u ring al ias analysis C and C++ does improve the generated code f()r some programs. The compi lation cost o f· using type i n forma tion is sma l l . Except for rare cases, performance gains res u l ting from these code i mprove m e nts are small . Any of an array-indexed fu nction pointer. programs compiled using type i n formation duri ng alias A Note of Caution aliasing rules. I f not, the optimizer may generate code Many C analysis must strictly ad h e re to the Standard programs do not ad here to the Standard C i m p l i c i t casting, they access objects of one type by means of pointers to other types. More aggressive optimization by GEM combi ned with more detailed alias-analysis i n formation fi·om the D EC C and C++ side-effects package i ncreasi ngly resu l ts in these programs e x hi bit ing u n e x pected behavior when the compiler uses Standard C aliasing ru les. expects poi nter to another type works as expecte d , Passing a p o i n te r to o n e type to a routine that a u ntil t h e GEM opti m izer i n li nes the cal led procedur e . If the procedure is n o t i n l i n ed , t h e D EC C and C++ sid e-effects package m ust ass u m e that the call conflicts with aJ I pointer accesses before a n d after the cal l . Once G EM i n l i nes the routine, the side-effects package is free to ass u m e t hat references using the i n l i ned pointer do not conflict with references using the poi nter at the call site. The two pointers point to t:\vo d i fferent types. Digital Tcchnic;JI J o u n d C and C++ that produces unexpected resu l ts. aliasing rules. Through d1e usc of expucit casting and 56 A recent example of this pro b l e m occu rred in the gee program in the S P ECint9 5 benchmark s u i te . All Vol . 1 0 No. l 1 998 Acknowledgments The a u thor wou ld l i ke to than k Dave B l ickste i n , Mark D avis, N e i l Fai m a n , Steve Hobbs, and B i l l Noyce of the G E M team for their advice and reviews of this work. Dave B lickstein and N e i l Faiman a lso d i d work in the G EM opti m i zer to ensure that the D E C C++ C and s i d e - e ffects package h a d a l l the i n f(mnation i t needed t o do alias analysis correctly a n d to ensur e that the GEJ\rl o p ti mi zer effectively used the i n fcm1ution the side-effects package provid e d . Thanks a l so to J o h n Henning of the C S D Performance Gro u p a n d J eannie Lieb o f t h e GEM team fiJ r t h e i r h e l p usi ng the S PECint95 benchmark suite. A f-i nal word of t h a n ks goes to B o b M o rgan f(x suggesti ng that I write this paper and to my m anage ment f()r s upporting my doi ng so. Biography References and Notes 1 . R. Wi lson a nd M . Lam, " Enicient Comext-Sensitive Poi mcr Analysis for C Programs," Proceedini�S of the A C/\1 S!C;PLA 1\ '95 Conference on Progra m m ing La n guaw' IJesip,n a n d Implementation. L a J o l l a , C a l i f. ( J u ne 1 99 5 ) : 1-1 2 . 2 . D . Coutant, " Rctargetable High-Level Alias Analysis," Proceedings ofthe 13th A nnual �)mposium ciples oj' Program ming Languages, on Pl7n St. Petersburg Beach, Fla . ( Ja n uary 1 98 6 ) : 1 1 0-1 1 8 . 3 . A . Diwan e t al . , "Type-Based Alias Analysis," Procecd iu,�s o/ the 1 998 A CM SICPLA N Co uference o11 Pro f:ira m m ing La nguage Desig n a n d Implementation. Montreal , Canada ( J u n e 1 99 8 ): 1 06-1 1 7 . 4 . J o i n t Tech n ical Com mittee ISO/ I E C JTC 1 , "The C Programming Language," International Sta n da rd !SO/JJ;'C 9899 1990, section 6 . 3 Expressions. 5. "Worki ng Paper for Draft Proposed I n ternational Standard for I nforma tion Systems-Progra mming Language C++," WG2 1 /N 1 146, November 1 997, section 3 . 1 0 . 6. D . Blickstein e t a l . , "The G E M Optimizing Compiler System," /Ji[;ilal Tech n ical.fournal, vol. 4 , no. 4 ( Spe cial Issue, 1 99 2 ) : 1 2 1 - 1 3 6 . 7. R. Crowell ct a l . , " T h e GEM Loop Transformer," ni,� ital Tech n icaiJou rnal, vol. 1 0 , no. 2, accepted for publ ication. August G . Reinig August Rei n ig is a principal somvarc engineer, currently working on debugger support i n the D I G I TAL C++ compiler. In addition to his work on the DEC C and C++ side-effects package, August implemented a Java- based distributed test system for t.he D EC C and D IGITAL C++ · compilers and a para l le l build system for the DEC C and D I GITAL C++ compilers. The d istri buted test system simultaneously runs multiple tests on d ifferent machines and is fault tolerant. Betore joining the DEC C and C++ team, he conu·ibuted to a n advanced development incre mental compiler project, which led to two patents, "Method and Apparatus fc>r Somvare Testing Using a Testi ng Technique to Test Compilers" and "Method and Apparatus tor Testing Somvare. " He earned a B .S. in mathematics ( m agna cum laude) !Tom Dartmouth Col lege in 1 980 and an M .S . in computer science fi·om H arvard University in 1 997. He is a member of Ph.i Beta Kappa. 8. A . AJ1o, R . Sethi , and ] . U l l m a n , Compilers Princ iples. Techn iljlles. a n d Tools ( Reading, l'vbss: Addison vVesley, 1 98 6 ): 104. 9 . I n formation about the SPEC benchma rks is available from the Standard Pertorm<\nce Evaluation Corpora tion at http ://www. specbench.org/. 1 0 . J. Anderson ct a l . , "Conti nuous Profiling: vVhcre H ave All the Cycles Gone> " Proceedings of the Sixteenth A O\If .S)'mposiu m on Operatln/:5 Systl!rn Principles, Sait M::tlo, France ( October 1 99 7 ) : 1 5-26. I I . A. SrivJstava and A. Eustace, "ATOM : A System for Building Customized Program Analysis Tools," Pro ceedings of tbe .-10\lf S!CPL- !:V 9 ·1 Conference on Pro wwn m ing Language Desig n U l l d !mplemenlalion. Orlando, F l a . ( J u ne 1 994 ) : 1 96-2 0 5 . 1 2 . l/i\1/IPS- V Rejere11ce Manual (pixie a nd pixstats) (Sun nyva le, Ca l i f. : M IPS Computer Systems, 1 99 0 ) . Digital Technical Joumal Vol . 1 0 No. 1 1 998 57 I Philip H. Sweany Steven M. Carr Brett L. Huber Compiler Optimization for Superscalar Systems: Global Instruction Scheduling without Copies The performance of instruction-level parallel Many of today's computer appl ications req u i re com p u systems can be improved by com piler prog rams t u res that provide l i ttle or no para l l e l i s m . A pro m ising that order mach i n e operations to i ncrease system paral lelism and reduce execution time. The opti mization, cal led i nstruction sched u l ing, tation power n o t easily achieved b y computer architec alternative is the parallel architecture, more specifical ly, the instruction-level para l l e l ( I LP ) arc h i tecture, which i ncreases computation d u ri n g each machine cycle. I LP i s typica lly classified as local schedu l i n g if only computers a llow para l l e l computation of the lowest basic-block context is considered, or as g lobal level mac h i n e operations with i n a sin gle i nstruction cycle, i n c l u d i n g such operations as m e mory l oads and sched u l i n g if a larger context is used. G lobal sched u l i n g is generally thought to g ive better results. One g lobal method, domi nator-path stores, i n teger additions, and floating-point m u ltiplic:� tions. I LP architectures, l ike conventional architectures, conta i n m u l tiple fu nctional u n its and p i p c l i ned fi.m c sched u l ing, sched ules paths in a function's tional u nits; b u t, they have a singJ c progr:�m cou nter domi nator tree. U n l i ke many other g l obal and operate on a single instruction stre a m . Compaq sched u l i n g methods, dominator-path sched u l Computer Corporation's AlphaServer syste m , based on the Alpha i n g does n o t req u i re copy i n g of operations 2 1 1 64 microprocessor, is :�n example of an ILP machine. to preserve program semantics, making this method attractive for supersca lar a rch itectures that provide a l i m ited amount of i nstruction To effectively usc parallel h a rdware and obtain performance ad van tagcs, compi ler programs must i d c n tif)r the appropriate level o f para l lelism . For I LP l evel para l l e l i sm. In a sma l l test su ite for the arc h i tectu res, the comp i l e r must order the s i n g l e Alpha 2 1 1 64 supersca lar arch itecture, dominator i nstruction stream such t h a t m u l ti pl e , low-level opera path sched u l i n g produced sched u les req u i ring 7.3 percent less execution time than those pro duced by local sched u l i n g a lone. tions execute s i m u l taneously whe never possi b l e . This orderi ng by the compiler of machine operations to e ffectively use an I LP arc h i tecture's increased para l l e l i s m i s called instruction schedulin,r, . I t i s an opti m i zation n o t us u a l l v ro u nd in compi lers for non- I LP arch i tcctu res . Instruction sched u l i ng is c lassified as local if i t considers code o n ly within a basic b l o c k a n d ,r,loha! i f i t schedu l es code across m u l tiple bJsic b l ocks. A dis advantage to local instruction sched u l i n g is its i n a b i l i ty to consider context from s u rrou n d i n g b l ocks. \Vh i l e local sche d u li n g c a n fi n d parall elism with i n a basi c block, it can do noth i n g to exploit para l l el i s m bel:\veen basic blocks. Generally, global sched u l i n g is preferred because i t can take advantage of added program parJ l lelism avai lable when t h e compiler i s :� !lowed t o move code across basic block bmmdJries. Tj aden and F l y n n , ' tor example , fo u n d paralle l ism w i t h i n a basic block q u i te l im ited . Using a test s u i te o f scienti fi c programs, t h ey m e as u re d an average para l lelism o f 1 . 8 within basic blocks. In s i m i l a r experi ments o n scientifi c pro- 58 Digital Tcch tlical journal Vol . 10 N o . I 1 99 8 grams in which the compi l er moved code across basic later than Y. These D O D edges are b;�sed on the formal block bound aries, Nicolau and Fisher ' rou n d paral ism of data dependence analysis. There are tl1ree basic l e l ism that ranged from 4 to a virtually u n l i m ited num ber, with a n average of90 for the entire test suite. Trace scheduling '' is a global schedu ling technique types of data dependence, as described by Pad u a et al .'' • dent on a progra m , possibly at t11e expense o f less freq uently within sequential code by allowing massive migration of lvL writes to some m e mory location read by M , . • Antidependence, a l so cal led fa lse dependence. A DDD node M2 is a n tidependent on operations across basic block bounda.ties during schedul D D D node M , i f M , executes before M z and M 2 writes t o a m e m ing. By addressing this l arger scheduling context ( m any ory locati o n read by M , , thereby destroyi n g the basic blocks), trace scheduling can produce better sched val u e needed by M , . u l es tlun teclmiques that address the smaller context of a single block. To ensure the program sema.t1tics are not D D D node M , is flow depen D D D node M , i f M , execu tes before M, and data dependence. A that attempts to optimize fi:equently executed paths of executed pat11s . Trace schedu ling exploits paral lclis� Flow dependence, also cal led b·ue dependence or • A D D D node M , i s output ODD node M , i f M , executes before O u tp u t dependence. changed by i n terblock motion , trace sched u li n g inserts dependent on copies of operations that move across block bou ndaties. M2 and M1 and M, both write to the same locati o n . Such copies, necessary to ensure program semantics, are called wmpm1sation copies. The research described here is driven by a desire to develop a global i nstruction sched u l i n g tech n i q u e t h a t , l i ke trace schedu l i ng, a l l ows operations t o cross block bou n daries to fi n d good schedules a n d that, u n l i ke trace sched u l ing, does not req u i re i nsertion o f compensation copies . L i k e trace sched u l i ng, D PS first defi nes a m u l ti block context for sched u l i ng and then uses a local i nstru ction scheduler to treat the l arger context l i ke a si ngle basic bloc k . S u ch a tec h n iq u e pro vides effective sched u l es and avoids the performance cost of execu ting compensation copies. The g lobal sched u l i ng tech nique described here is based on the dominator relation * among the basic blocks of a fu nc tion and i s calle d domi nator-path sched u l i n g ( D PS ) . Local Instruction Sched u l i ng ; begin with a brief d iscussion of the local schedt ling problem. As the n a me i m plies, local instruction sched uling attempts to maxi mize para l lelism within eac h basic block of a fu nction's con trol rl ow graph. I n gen this opti m i za tion problem is N P-complete . ' H owever, i n practice, heuristics ach ieve good results. ( L..1.ndskov e t al.'' give a good su rvey of early instruction sc h ed u l i n g algorithms. Al lan e t aF describe how one might b u i l d a retarge table local i nstruction sched u ler. ) L1st schedulinp, " i s a general method often used tor local i nstructi on sched u l i ng . Briefly, l ist sched u l i n g _ typtc:: d ly req u t res two p h ases. The fi rst phase bui lds a d i rected acyclic graph ( DAG), c<�lled the d:J.tJ. depen dence DAG ( D D D ) , tor each basic block i n the fu nctio n . D D D nodes represent operations to be sched u led . The DDD's d i rected edges i n d icate that a node X D D D node, a set of a l l memory locations used ( read ) defined ( writte n ) by that particular D D D nod e . and all memory locations Once the D D D i s constructed, t h e second phase begi n s when list sche d u l i n g orders the graph 's nodes i nto the shortest sequence of insb·uctions, s u bject to (1) the constraints in the gra p h , and ( 2 ) the resource limitations i n the machine ( i . e . , a mac h i n e is typical ly u m i ted to holding o n l y a single value a t any ti m e ) . I � genera! l i s t sched u l i n g , :.1. n ordered J i s t o f tasks, called a pnoriz)l list, is constru cted . The priority l ist takes i ts name from the tact that tasks are r:mked such that those with the highest priority are chosen first. In the context of local i nstruction scheduling, the priority list contains DDD nodes, all of whose predecessors have a l ready been i nc l uded in the sched u l e being constru cted . Si nce D PS relies on a local i nstruction sched uler we e ra l , To facil itate deter m i nation <1 11d manipul ation o f data dependence, the c o m p i l e r maintains, for each preceding a node Y constrains X to occ u r n o * A basic bl_oc k , D , domin ates another block, B , i f cl'<.:n p a t h from the root ot the control-How graph (or a function ro B must pass throug;h D Expressions, Statements, and Operations Within the context o f this paper, we d iscuss a l gorithms for code motion . Before going fu rther, we need to ensure common u nderstanding among our readers tor o u r use of terms such <�S expressions. statements. and operations. To start, we consider a com p u ter program to be a ltst of operations, each of which ( possi b l y ) computes a righ t-hand side ( rh s ) v;�l u e and assigns the rhs val u e to a memory location represented by a left hand side ( l h s ) vari a b l e . This can be expressed as A�E where A represents a single m e mory l ocation and E represents an ex pression with one or more operators and a n appropri:ue n u m be r of oper;�nds . D u ring d i f fere n t phases of a compi ler, operations might be repre sented <�s • • Source cod e , a high - level langu<�ge such as C I n termediate statements, a l i n ear form of three address code such as q u ads or n-tuples'" Digital Technical Journal Vol . L O N o . 1 998 59 • DDD nodes, nodes in a DDD, ready to be sched u l ed by the instruction scheduler Important to note about operations, whether repre sented as mtermediate statements, source code, or DDD nodes, is that operations include both a set of definitions and a set of uses. Expressions, in contrast, represent the rhs of an operation and, as such, include uses but not defini tions. Throughout this paper, we use the terms state ment. intermediate statement, operation, and DDD . node Interchangeably, because they all represent an operation, with both uses and definitions, albeit gen erally at different stages of the compilation process . When we use the term expression, however, we mean an rhs with uses only and no definition. Dominator Analysis Used in Code Motion I n order to determine which operations can move across basic block boundaries, we need to analyze the source program . Although there are some choices as to tl�e exact analysis to perform , dominator-patl1 scheduhng IS ased upon a formalism first described by Retf and Taq a n . " We summarize Rei f and Tarjan's work here and then discuss the enhancements needed to allow interblock movement of operations. I n their 1 9 8 1 paper, Reif and Tarjan provide a fast algorithm for determining the approrimate hirthpoints of expressions in a program's flow graph . An expres sion's birthpoint is the first block in the control flow graph at which the expression can be computed, and the value computed is guaranteed to be the same as in the original program. Their technique is based upon fast computation of the idef set for each basic block of the control flow graph . The idef set for a block B is that set of variables defined on a path between B's i mmediate dominator and B. G iven that the domina tor relation for the basic blocks of a function can be represented as a dominator tree, the immediate domi nator, IDOM, of a basic block B is B's parent in the dominator tree . Expression birth points are not sufficient t o allow u s t o safely move entire operations from a block t o one of its dominators because birthpoints address only the movement of expressions, not definitions. Operations in general include not only a computation of some expression but the assignment of the val ue computed to a program variabl e . Ensuring a "safe" motion tor an expression requires only that no expression operand move above any possible definition of that operand, thus changing the program semantics. A similar requirement is necessary, but not sufficient, for the variable to which the value is being assigned. In add i tion to not moving A above any previous defirution o f A, A cannot move above any possible use of A . Otherwise, w e r u n the risk of changjng A's val ue for ? 60 Digital Technical JournaJ Vol . 1 0 No. l 1998 mat previous use. Thus, dominator analysis compu tes me zuse set for each basic block and tor me idef set. The iuse set for a block, B, is that set of variables used on some path between B 's immediate dominator and B. Using the idefand iuse sets, dominator analysis com putes an approxinute birtl1point for each operation. In this paper, we use the term dominator analysis to mean the analysis necessary to allow code motion of opera ons while disallowing compensation copies. Additionally, we use the term dominator motion for the �eneral optimization of code motion based upon dommator analysis. � Enhancing the Reif and Tarjan Algorithm By enhancing Rei f and Tarjan 's algorithm to compute hi11hpoints of operations i nstead of expressions, we make several issues i mportant that previously had no effect upon Reif and Tarjan's algorith m . This section motivates and describes the information needed to allow dominator motion, including the use, def iuse, and ide{ sets for each basic block. An algorithmic description of this dominator analysis information is included in the section Overview of Dominator-Path Scheduling and the Algorimm tor I n tet·block Motion . \V:hen we aLlow code motion to move intermediate statements ( or j ust expressions) from a block to one of its dominators, we run the tisk that the statement (expression) will be executed a different number of times in the dominator block than it would have been in its original location. vVhen we move only expres sions, the risk is acceptable ( although it may not be efficient to move a statement i n to a loop ) since the value needed at the original point of computation is preserved. Relative to program semantics, the number of times the same value is computed has no effect as long as the correct value is computed the last time. This accuracy is guaranteed by expression birthpoints. Consider also the consequences of moving an expres sion Jiom a block that is never executed for some partic ular input data. Again, i t may not be efficient to compute a value never used, but the computation does not alter progran1 semantics. \Vhen dominator motion moves entire statements, however, the issue becomes more complex. I f the statement moved assigns a new value to an induction vatiable, as in me following exatnple, n= n+ 1 dominator motion would change n's fin al value if it moved the statement to a block where the execution freq uency differed from that of its original block. We cou ld al leviate this pro blem by prohibiting motion of any statement for which the use and de{ sets are not disjoint, but the possibi lity remains that a statement may ddine a variable based indirectly upon that vari able's previous value. To remedy the more general p roblem, we disallow motion of any statement S ) ) whose def set intersects with those variables that are used-before-defined in the basic block in whi ch S resides. Suppose the optimizer moves an i ntermediate state ment that defines a global variable from a block that may never be executed for some set of inpu t data i nto a dominator block that is executed at least once for the same i nput data. Then the optimized version has defined a variable that the u noptimized function did not, possibly changing program semantics. We can be sure that such motion does not change the semanti cs of that function being compiled; b u t there is no mech anism, short of compiling the e n tire program as a sin gle unit, to ensure that defining a global variable in this function will not change the val ue used in another fu nction. Thus, to be conservative and ensure that it does not change program semantics, dominator motion prohibits interblock movement of any state ment that detines a global variable. At first gl ance, it may seem that this prohibition cripples dominator motion's ability to move any i ntermediate statements at all; but we shall see that such is not the case . One fi n al addition to Reif and Tarj a n information is required to take care of a subtle problem. As discussed above, dominator analysis uses the idef and iuse sets to prevent i l legal code motion . The use of these sets was assumed to be sufficient to ensure the legality of code motion i nto a domi n ator block; u n fortunately, this is not the case . The problem is that a definition might pass through the i mmediate dominator o f B to reach a use i n a sibling of B i n the dominator tree. I f there were a detlnition of this variable in B, but the variable was not defined on any path from the immediate dom i nator, there would be nothing in dominator analysis to prevent the definition from being moved into the domi nator. But that would change tl1e program's semantics. Figure 1 shows tl1e control-flow graph for a function called fi ndmax ( ) , with only the statements referring to register r7. Register r7 is defined in blocks B3 and B7, and referenced in B 9 . This means mat r7 is live-out of B 5 and live-in to B 8 , but not live-in to B7; there is a definition of r7 i n B 3 that reaches B 8 . Because there i s no definition o r use between B 7 and its i mmediate dominator B 5 , the idef and iuse sets of B 7 are empty; thus, dominator analysis, as described above, would allow the assignment of r7 to move upward to block B 5 . This motion is i ll egal ; it changes the definition in B 3 . Moving me operation from B7 to B5 changes the conditional assignment of r7 to an unconditional one. To prevent this from happening, we can i nsert the variable into the iuse set of the block B, in which we wish the statement to remain. We do not, however, want to add to the iuse set unnecessarily. The solution is to add each variable, V, that is live-in to any of B 's siblings i n tl1e domi nator tree, but not i nto B, or to B's t I B4 I B5 I t qJ 7 gJ - B8 G Figure 1 Control Flow Graph for the Function tindmax( ) iuse set. This will prevent any definition of V that might exist in B from moving up. If there is a defini tion o f V i n B, but V is live-in to B , there must be some use of V in B before the definition, so it could not move upward in any case . Measurement of Dominator Motion To measure the motion possible i n C programs, Sweany1' defined dominator motion as the movement of each i n termediate statement to its birthpoint as defined by dominator analysis and by the n u mber of domi nator blocks each statement j u mps during such movement. Sweany's choice of i ntermediate state ments (as contrasted with source code, assembly lan guage, or DDD nodes) is attributed to the lack of machine resource constraints at that level of program abstraction . He envisioned dominator motion as an upper bound on the motion avai lable in C programs when compensation copies are i ncluded . In the test suite of 12 C programs compiled, more than 25 per cent of all i ntermediate statements moved at least one dominator block upwards toward the root of the dom i nator tree . One function allowed more than 50 per cent of the statements to be hoisted an average of nearly eight domi nator blocks. The considerable amount of motio n (without copies ) avai lable at the i n termediate statement level of program abstraction Digital Technical Journal Vol . 1 0 No. 1 1 998 61 provided us with the motivation to use similar analysis techniques to facilitate global instruction schedu l i ng. Overview of Dom inator-path Sched u l ing and the Algorithm for lnterblock Motion Since experi ments show that dominator analysis al lows considerable code motion without copies, we chose to use dominator analysis as the basis tor the instruction scheduling algorithm described here, namely dominator path scheduling. As noted above, D PS is a global i nstruction scheduling method that does not require copies of operations that move ti.-om one basic block to another. DPS performs global instruction scheduling by treating a group of basic blocks found on a dominator tree path as a si ngle block, scheduling the group as a whole . In this regard, it resembles trace scheduling, \vhich schedules adjacent basic blocks as a single block. DPS's fou ndation is scheduling instructions while mov ing operations among blocks according to both the opportunities provided by and the restrictions imposed by dominator analysis. The question arises as to how to exploit dominator analysis information to permit code motion at the instruction level during scheduling. DPS is based on the observation that we can use ide( and iuse sets to al low operations to move from a block to one of its dominators during instruction scheduling. I nstruction scheduling can then choose the most advantageous position tor an operation that is placed in any one of several blocks. Because machine operations are incor porated in nodes of the DDD used in schedu ling and , l i ke intermediate statements, DDD nodes are repre sented by dejand use sets, the same analysis performed on intermediate statements can also be applied to a basic block's DDD nodes. The same motivation that drives trace scheduling namely that scheduling one large block allows better use of machine resources than scheduling the same code as several smaller blocks-also applies to D PS . I n contrast to trace scheduli ng, DPS does not allow motion of DDD nodes when a copy of a node is required and does not incur the code explosion due to copying that trace scheduling can potentially produce. For architectu res with moderate instruction- level paralle lism, D PS may produce better results than trace sche uling, because the more l imited motion may be suttictent to make good use of machine resources, and unlike trace sched ul ing, no machine resources are devoted to execunng semantic-preserving operation copies. Much l i ke traces,* the dominator path's blocks can be chosen by any of several methods. One method is a heuristic choice of a path based on length , nesting depth , or some other program characteristic. Another is programmer specification of the most important � •groups of blocks ro be scheduled rogerhcr in rrace sched u l i n g 62 Digital Technical Journal Yol . l O N o . 1 1 998 paths. A third is actual profiling of the running pro gram . We visit this issue again in the section Choosing Dominator Paths. First, however, we need to discuss the algorithmic details of D PS . O n c e D PS selects a dominator p a t h to schedule, it requ i res a method to combine the blocks' DDDs into a single DDD for the entire dominator path . I n our compiler, this task is performed by a DDD coupler,�.' which is designed for the p urpose. Given the DDD coupler, DPS proceeds by repeatedly • Choosing a dominator path to schedule • Using the DDD coupler to combine each block's DDD on the chosen dominator path • Scheduling tl1e combined DDD as a single block The dominator-path schedu ling algorithm, detailed i n this section, is summarized in Figures 2 and 3 . A significant aspect o f the D PS process i s to ensure "appropriate" interblock motion of D D D nodes and to prohibit "il legal" motion. As noted earl ier, the combined DDD for a domi nator path includes control flow. Therefore, when D PS schedules a group of blocks represented by a single DDD, it needs a mecha nism to map correctly the scheduled instructions to the basic blocks. The mechanism is easi ly accom pl ished by tl1e addition of two special nodes to each block's n"D D . Called B lockStart and B lockEnd, these special nodes represent the basic block boundaries. Since dominator-path scheduling does not allow branches to move across block bou ndaries, each B lockStart and B lockEnd node is initially "tied" (witl1 DDD arcs) to the branch statement of the block, .if any. Because B lockStart and B lockEnd are nodes i n the eventually combined DDD, they arc sched uled like all other nodes of the combined D D D . After scheduling, all i nstructions between the instruction containing the B lockStart node for a block and the i nstruction con taining the B lockEnd node for that block are consid ered i nstructions for that block. Next, DPS must ensure that the B lockStart and BlockEnd DDD nodes remain ordered ( i n the scheduled instructions) relative to one another and ro the B lockStart and BlockEnd nodes tor any other block. To do so, DPS adds use and dej i n formation to the nodes to represent a pseudore source, B lockBoundary. Because each BlockStart node defines B lockBoundary and each B lockEnd node uses BlockBoundary, no BlockEnd node can be scheduled ahead of its associated BlockStart node ( because of flow dependence . ) Also, a BlockStart node cannot be scheduled before i ts dominator block's BlockEnd node ( because of antidependence). By establishing these imaginary dependencies, DPS ensures that the DDD coupler adds arcs between all BlockS tart a nd B lockEnd nodes . Algorithm Domi nator- Path S c heduling I np u t : Function Control Flow Graph Domi nator Tree Post- Domi nator Tree Outp u t : Sched u led i nstructions for the fun ction Algori th m : Whi le a t least one Basic B lock i s unsched u l ed Heuristically choose a path B , , B 1 , . . . , B, in the Dominator Tree that inc ludes only u nschedu led Basic B locks . Pe rform dominator analysis to co m p u te l De f a nd I U s e sets / * B uild one D D D tor the entire domi nator path *I Combined D D D = B , For i = 2 to T n = I ni tiali zeTransitionDDD ( B , ., , B , ) Com bined D D D = Cou p l e ( Combined D D D ,T) Combined D D D = Cou ple ( C ombi ned D D D , B , ) Perform list sched uling on Combined D O D Mark each block o f DP sch eduled Copy schedu led in structions to the B l ocks of the path ( i nstructions between the BlockStart and B l ockEnd nodes for a B lock are "written " to that B lock) End vVhi le Figure 2 Dominaror-pJth Scheduling Algorithm Looking back to domi nator analysis, we see that operations tl1at i nstruction sched u l i n g allows. In dom i n terblock motion i s prohibited if the operation bei ng i n ator motion , i ntermediate stJtements move in only moved one d i rection , i . e . , toward the top of the ti.mction's • D e fi nes someth i n g that is i n c l uded in e i ther the ide/or iusc set • Uses some thing included i n the idef set for the bl ock in which the operation cu rrently resides control How graph , not from a domi nator block to a domi nJted one. This one-directional motion is rea sonable when attempting to move i n termediate stJte ments because one state ment's movement wil l l i kely open possibil i ties tor more motion i n the same d i rec To obtain the same p rohibitions i n the combined tion by other state ments. When statements move i n D O D , we add the ide("set tor a basic block, B, to the d i fferent directions, o n e stJte ment's motion m ight defset B 's BlockStart node. S i m i l a rl y, we add the iuse set tor B to the use set of B's B lockStart n ode. Thus we i n h i bi t another's movement in tl1e opposite d i re c tion . The goal of dominator motion is to move statements as cntorcc the same restriction on movement that domi t�u· as possible i n tl1e control flow graph. In contrast, tl1e nator analysis i m posed upon i n termediate statements goal of DPS is not to maxi.rn.ize code motion, but rather 0, to fi n d , for each operation, will yield me shortest sched u l e . Thus our goal has restrictions on movement of operations that define changed fi: om that of dominator motio n . To gain the 3 fu ll benefit from DPS, we wish to allow operJtions to gives an a lgori thmic descrip tion of the process of move past block boundaries in either direction . To per either global vJriJbles or i n d uction variables. Figure that location for 0 and ensure that any i n tcrb.lock motion preserves pro gram scmJntics. In J similar manner, DPS i ncludes the that "doping" the B lockS ta r t and BlockEnd nodes to pre mit bidirectional motion, we use the post-dominJtor ven t d isal l owed code motion. rel a tion , which says that D PS i s complicated by factors not relevant tor dom inator motion of i n termediate statements. Foremost is the complexity im posed by the bidirectional motion of a basic block, P D , is a post domi nator of a basic block B if al l paths from B to the function's exit must pass ilirough P D . Using thi s strat egy, we s i m i l arly define post-idefand Digital TcdHlic.ll Journal post-i use s ets. Vol . 1 0 No. l In 1 99 8 63 Algorithm I n i ti al i zeTransition D D D ( B , , B 1 ) Input: A Transition D D D templates, with a D u mmy DD DNode for B , 's block end and one for B, 's block start Two basic blocks, B, and B , that we wish to couple Domi nator Tree Post- Domi nator Tree The fo l l owing dataflow information Def, Use, I Def, and I Use sets for B , and B, Used - B e fore-Defined set for B, Post- I Def, a n d Post-I Use sets for B , and B, B,'s "sibling" set, defined to i n clude any variable live-in to a dominator-tree si bling ofB,, but not live - i n to B, A basic bloc k D D D for each of B, and B, Output: An i n i tialized Transition D D D , T Algorith m : T = Tra nsiti o n D D D / * "Fix" s e t for global and induction variables. * / Add set of global variables to B/s ! U se Add B/s Use d - Before - Defined to B/s IUse Add B/s si b l i n g set to B/s I Use If B, does not post-dominate B , Add B, 's Use set to Ts Block End Def set Add B , 's Defset to T's BlockEnd Use set Else Add B, 's Post- I Def set to T's BlockEnd Def set Add B , 's Post-lUse set to T's B l ockEnd Use set Add B/s I Def set to T's B lockS tart Def set Add B� 's I Use set to T's BlockS tart Use set Return T Figure 3 I nitial ize Transition O D D Algorithm fact, it is not d i ffi c u l t to comp u te Jll these q u a n ti ties sor, S, in the forward domi nator p:tth does not post for a fu n c tion . The s i mp l est w:�y is to l ogica l l y reverse d o m i nate B , DPS adds B 's de(set to the the direction of all the control flow gr:�ph arcs and per B l ockEnd node associated with use set of the B . In similar t-:1sh i o n , for m domi nator an alysis on the resu l t i n g gra p h . w e a d d B' s Hav i n g co m p u ted t h e post-domi n ator tree, DPS This technique prevents any D D D node origi n a l l y in ch ooses dominator paths such that the domina ted B from moving downward i n the domi nator path . use s e t t o B ' s B lockEnd node's de( set. node is a post-domi nator of its i m m ediate predecessor in a d o m i n ator p a t h . This c h oice a l l ows operations to Choosing Dom inator Paths move " free ly" in both d i rections. Of course, this may be too l imiting on the choice of domi n a tor paths. To DPS allows code movement a l o n g any domin ator allow for the possibility that nodes i n a domi nator path path, b u t there are many ways to wi l l not form a post-domin ator relati on, D PS needs a investigation of the effects of domi nator-path choice mec hanism when on the efficiency of generated schedu les te lls us that needed . Again, we rely o n the tech nique of a d d i ng the choice of p a th is too i mporta n t to be left to arbi to limit bidirection a l motion dependencies to the combi ned D D D . In this case ( assu m i n g that DPS is sched u l i ng paths in the forward domi nator tree), for any basic block, B, whose succes- 64 Digiral T�c hnical JournJI Vol . 10 No. I 1 9 98 select these paths. An trary selectio n ; twice the average percent speed u p * for several functions can often be ach ieved with a simple , *( unopti m i zed_speed - oprirnized_spccd )/u noptirnizcd_spccd well-chosen heuristic. Some functions have a potential tion o f D PS and the n u mber of cli stinct dominator tree percent speed up almost fou r times the average. Thus, partitionings. The original i m plementation of DPS it is important to find a good, generally app l icable incl uded a single, simple heuristic to choose domina heuristic to select tl1e domi nator paths. tor patl1s. More specifically, to choose dominator pams Unfortunately, it is not practical to schedule all of witl1 in a group, G, of contiguous blocks at me same the possible partitionings for large functions. If we nesting level, me compiler continues to choose a allow a basic block to be included in only one domina block, B, to "expand . " Expansion ofB initializes a new tor path, the formula for the numbe r of distinct parti dominator path to include B and adds B's dominators tionings of the dominator tree is until no more can be added. The algorimm then starts anomer domi nator path by expanding another ( as yet IT [ outdeg( n) + 1 ] u nexpanded) block of G. The first block of G chosen II € .\' to expand is me tail block, T, in an atte mpt to obtain as where N is the set of nodes of the dominator tree . " long a dominator pam as possible . Although the n u m ber of possible paths i s not prohi bi Unformnately, not all functions are small enough to tive for small dominator trees, larger trees have a pro be tested by performing DPS for each possible parti hibitively large n u m ber. For example, whetstone's tioning of the dominator tree. Therefore, we defined main( ), with 49 basic blocks, has a lmost two tri ll ion 37 different heuristic memods of choosing dominator trees, based upon groupings of SL"X key heuristic factors. distinct partitionings. To evaluate differences i n dominator-path choices, The maxim u m patl1 lengms of tl1e basic guidelines we scheduled a group of small fu nctions with DPS were adjusted to produce actual heuristics. We used using every possible choice of dominator p at h . The the heuristic factors from which the individual heuris target architecture for this study was a hypotheticaJ tics were constr ucted ; each seemed likely e i ther to 6-wide long-instruction-word ( LIW) machine, which m i m i c the observed characteristics of the best path was simu lated and i n which it was assumed that all selection or to allow more freedom of code motion cache accesses were hits. and, therefore, more fl exibility i n filling "gaps. " The results of exhaustive dominator-path testing show, as expected , that varying the choice of domina tor paths significantly affects the performance of scheduling. For all functions of at least two basic blocks, DPS showed i mprovement over local schedul ing for at least one of tl1e possible choices of domina tor paths. Table 1 shows the best, average, and worst percent speedup over local scheduling found for a l l fu nctions that h a d a "best" speedup of over 2 percent; it also shows the speed u p of tl1e origi nal implementa- • One nesting level-Group blocks from the same nesti ng level of a loop. Each block is in the same strongly connected component, so the blocks tend to have similar restrictions to code motion . For a group of blocks to be a strongly connected compo n ent, there must be some path in the control tlow graph fro m each node in the component to all the otl1er nodes in the component. Si nce the function will probably repeat the loop, it seems l i kely that the scheduler will be able to overlap blocks in it. Ta ble 1 Percent of Function Speed u p I m p roveme nt Using D PS Path C h oices over Local Sche d u l i ng Percent Speed up Function Name Best Average Worst Original No. Dominator Tree Partitions bu bble 39.2 1 0 .6 - 0. 1 1 1 .7 72 readm 32.5 9.3 - 0. 2 32.5 48 solve 27.8 9.9 - 0. 2 27.8 96 qu eens 25.4 8.3 - 0. 4 - 0.4 96 swa prow 23.1 5 .8 - 3 .7 1 9 .5 24 print ( g) 22.0 9.1 - 0. 2 22.0 8 find max 2 1 .3 6.2 - 0. 3 8. 7 18 copy col 1 8. 5 5.6 - 5.0 1 9 .9 8 elim 1 4.3 2.3 - 3.8 1 0. 2 576 mult 1 3 .7 2.1 - 3.8 1 0. 3 96 su bst 1 2 .9 2.4 - 4. 9 4.9 96 pri nt(8) 1 2.5 6.2 0.0 1 2.5 8 Digiral Technical Journal Vo l . 10 No. l 1 99 8 65 • • Longest pa th-Sched u l e the longest ava i l a ble path . Conse q u e n tly, path lengths c: 1n be l i m i ted without This h e u ristic c l ass Jl lows the maxi m u m d istance lowering the efficie ncy of generated cod e, and l o n ger tor code motion . paths, whi c h i ncrease sched u l i n g time, c: m be avoided. Since n o one heuristic performed we l l for all fu n c Postdomin ator-Follow the postdominator relation in the dominator tree. When J dominator block, P, is tions, w e advise u s i n g a com bi nation of heu ristics, i . e . , succeeded by a non-postd ominator block, S, our schedule by using each of th ree heuristics Jnd taking compiler adds P's del set to the use set of P's the best sched u l e . The "com bi n ed " heuristic i n c l u des B l oc kl-: nd node and the the following: use set to the def set to prevent any code motion from P to S. I f P is i n stead succeeded by its postdomi mtor block, no such mod i n cation is nece ss::try, and code would be allowed to move in both directions. Intu i tively, the postd omi na • Instruc tion density, limit to five blocks • O n e nesting level on path, l i mit to fi ve bl ocks • Non -postdomi nator, u n l i m i te d l e ngth tor relation is the cx::tct inverse of the dominator reb ti on, so code can move down, i n to a postdomi nator, as it moves up i nto a domi nator. Fur ther, the simple act of adding n odes to the D D D will complicate list sched u l ing, malvith sparse blocks and putting sp arse blocks together. blocks with h igher nesti n g levels are more costly than those added to bl ocks with lower n esting levels. Even within a loop, there exists the potenti�1 l tor consider able variation i n the executi o n fTcq uencie s o f d i tkrent b l ocks i n the meta- block due to control tlow. o r· course variable execution freq uency is not :111 issue i n trad i ti on a l local sched u l i n g bec:1use, with i n the con text of a s i ngle basic block, each D D D nodl: is exe c u ted the same n u m b er o f times, n:� m e l y, once each time executi o n enters the block. To address the issue of d i ffe ri ng execution frequen cies within meta- blocks schedu led as :1 single block by D PS, we i.nvestigated fl·equency- based l ist sc hedu l i ng (FBLS ) , ' ; an an extension of Jist sched u l i n g th::Jt provides answer to this d i ffi c u l ty by considering that execu The h e u ristic factors were used to make i nd i vidual tion fi-equencies d i ffer with i n sections of the meta h e u ristics by ch:1nging the limit on the poss i b l e nu m blocks. FBLS uses a greedy method to p l :1cc D D D nodes ber of b locks i n a p::lth . I t was reasonable t o set l i m i ts in the lowest-cost instruction possi b l e . f B LS ame nds fo r fo ur tactors : postdominator, non- postd o m i n a tor, tl1e basic list- sched u l i n g a lgorithm by revising only the ide/ size, a n d density. We tried p:�th length l i m i ts in 3, 4, 5 , :1 n d u n l i mited , making a total o f D D D node placement policy in an atte mpt to red u ce blocks of 2 , the r u n-time cycles required to execute ;1 meta-bl ock. U n fortunate ly, although F B LS makes intuitive sense, five heu ristics fi·om each h e uristic factor. 66 t(x the d i fferi n g path beh avior. idef size, the more interference t h ere is to code • In theory, to best s c h e d u l e any meta- block, an class . T h e previous cbss was s u ggested b y intuition Ru n n i n g DPS using cJch of the he uristic me thods we fou nd that D PS pro d u c ed worse schedu les with a nd comparing the efti ci e n cy of the res u l ti n g code F B LS than i t produ ced with a na ive local sc h e d u l i n g l eads to several con cl usions about effective heu ristics algorithm t h a t ignored frequency d i ffe rences with i n fo r choosi ng DPS's d o m i n ator paths. for some he u ris D PS's meta- blocks. Therefore, t h e c u r rent imple tics, we can achieve the best schedules for DPS by mentation of D PS ignores the execution tt·cq uency using paths that r:1 rely exceed th ree blocks. For :1 ny d i ffe rences be t\.veen basic blocks, both i n ch oosing particular class of heuristics, we can Jchievc the best dominator paths to sche d u l e and in sched u l i n g those schedule with paths l i m i ted to rive b locks or fe wer. d o m i n ator-path m e ta - blocks. Digital Te chn ical journal Vol . 10 No. 1 1 998 Evaluation of Dominator-path Scheduling measurements were made on an Alpha 2 1 1 64 server r u n n i n g at 2 5 0 megahertz with data cache sizes of 8 To measure the potential of DPS to generate more kilobytes, 96 kilobytes, and 4 megabytes. efficient sched u l es than local schedu l i n g for commer Looking at Table 2, we see that, in genera l , DPS cial superscalar architectures, we ran a small test suite i m proved the i n teger programs less than it i m proved of C programs on an Alpha 2 1 1 64 server. The Al pha the floati ng-poi nt programs. The range of improve server is a superscalar architecture capable of issuing ments for i nteger programs was from 0.7 percent for two i n teger and tvm floati ng-point i nstr u ctions each Dhrystone to 7 . 3 percen t each for 8- Queens and for cyc l e . Our compiler esti mates the effectiveness of a Sym bo!Table. S u m m i ng a l l the improve ments and sched u le by modeling the 2 1 1 64 as an LIW architec d ividing by eight (t he n u mber of integer programs) ture with all operation latencies known at comp i l e gives an "average" of 4.7 percent i m provement for the time. Of course th is mode l was used o n l y w i t h i n the i n teger programs. DPS improved some of the floating O u r resu l ts measured changes i n point programs even more significantly than the i n te compiler itself. 2 1 1 64 execution ti me ( m e asured w i t h the U N I X ger programs. The range of i mprovements for the six "time" command) req u i red for each progra m . floating-poi nt programs was from 3 . 7 percent for Dice Our test suite o f 1 4 C programs i ncludes 8 programs (a simu lation of rolli ng a pair of dice 10,000,000 times that use i nteger computation only and 6 programs that using a u n i form random n u m ber generator) to 1 7 .6 i nclude tloati ng-poi nt computation. We separated percent i mprovement fo r the finite difference pro those groups because we see dramati c differences i n gram. The average for the six floating-point programs DPS's pertormance when viewing i nteger and floating was 10.8 percent. This suggests, not surprisingly, that point programs. To choose dominator paths, we used the Alpha 2 1 1 64 provides more opportu n i ties for the combined he uristic recommended by Huber. '' Table 2 sum marizes the res u l ts of tests we con global schedu l i n g i mprovement when floati ng-point programs are being compiled. ducted to compare the execu tion times of programs Even with i n the six floati ng-point programs, how using DPS scheduling with those using local sched u l ever, we see a distinct bi- modal behavior in terms of ing only. T h e table l ists the programs used i n t h e test execution-ti me improvement. Three of the programs suite and the percent im provement in execution times range from 1 2 . 3 percent to 1 7 . 6 percent improve for DPS-sched u led p rograms. The executi o n time ment, whereas three are below lO percent ( and two of those sign ificantly below lO percen t ) . A reason for this wide range is the use of global variables. Remember Table 2 Percent D PS Sched ul ing I m p rovements over Local Sched u l i n g of Programs Program Percent Execution Time I m p rovement tl1at DPS forbids the motion of global variable defi n i tions across block bo undaries. This is necessary to ensure correct program semantics. I t is hardly a coinci dence that both Dice and Whetstone i ncl ude on ly global floati ng-point variables, whereas Livermore's 8- Queens 7.3 floating-point variables are mixed about hal f local SymboiTa b l e 7.3 a nd h a l f global, and the three better performers use Bubb leSort 5.0 a lmost no global variables. Thus we conclude that, for Nsieve 6. 1 floating-point programs with few global variables, we Hea psort 6.0 K i l lcache 2.6 TSP 2.4 D h rystone 0.7 C integer average 4.7 D ice 3.7 Whetstone 5.4 can expect i mprovements of roughly 1 2 to 1 5 percent i n execution time. Inclusion of global variables and exclusi o n of fl oati ng-point values wi l l , however, decrease DPS's abi lity to improve execu tion time tor the Alpha 2 1 1 64. Related Work As we have discussed , local instruction sched u l i n g can Matrix M u ltiply 1 6. 2 find paral lelism wi th i n a basic block but cannot exploit Gauss 1 2. 3 parallelism between basic blocks. Several global sched F i n ite Difference 1 7. 6 u l i n g techniques are availabl e , however, that extract Livermore C floati ng-point average Overall average 9.3 1 0.8 7.3 paral lelism from a program by moving operations across block bou ndaries and subsequently inserting compensation copies to maintai n program semantics. Trace schedu l ing1 was the first of these techniq ues to be defined. As previously mentioned, trace sched u l i n g Digital Tech n i cal Journal Vol. 10 No. I 1998 67 req u i res compensation copies. Other "early" global Conclusions sched u l ing algorithms that req u i re compenstation copies include Nicolau's percolation scheduling 1 "· 1 7 It is commonly accepted t h at to exploit the perfor and Gupta's region scheduling 1 8 A recent and qu ite mance benefits of iLP, global i nstruction schedul i n g is popular extension of trace scheduling is Hwu's SuperBlock scheduling. 19 2 0 In add ition to these more requi re d . Several varieties of global instruction sched u l i n g exist , most req uiring compensation copies to general, global schedu l i n g methods, signi ficant resu lts ensure proper program semantics when operations have been obtained by software pipel i n i ng, which is a cross block boundaries during i nstruction scheduling. tech n i q u e that overlaps i terations of loops to exploit Although such global scheduling with compensation avai lable ILP. Al lan ct a l .2 1 provide a good s u mmary, copies may be an effective strategy for archi tectures and Rau22 provides an excellent tutorial on how modulo with large degrees of ILP, another approach seems scheduling, a pop u l ar software pipeli n i n g tec h n i q u e , reasonable for more limited architectures, such as c ur should b e i mplemented. Promising recent tech niques rently available su perscalar computers. have focused on defining a meta-environment, which This paper outli nes DPS, a global instruction sched i ncludes both global scheduling and software pipelin uling tec h n i q ue that docs not req uire compensation i n g . Moon and Ebcioglu23 present an aggressive tec h copies. Based on the fact that more than n i q u e that combines software pipdining and global i ntermediate statements can be moved upward at l east 25 percent of code motion (with copies) i nto a si ngle fra mework. one domi nator block in the control flow graph with Novak and Nicolau2' describe a sophisticated schedul out changing program semantics, DPS schedules paths i ng framework in which to place software pipe l i n i ng, in a function's domi nator tree as meta- blocks, making including alternatives to modulo scheduling. While use of an extended local instruction scheduler to provi d i ng a significant n u m ber of excel l e n t global schedu le dominator paths. scheduling altern atives, none of these tec h n i q ues pro Experimental evidence shows that D PS does i ndeed vides global sc heduling without the possibi l ity of code produce more efficient schedules than local schedu l expansion ( copy code ) as D PS does. ing for Com paq's Alpha To address the issue of producing schedules without operation copies, Bernstein2; -27 defined a techniqu e he larly tor floati ng-point programs that avoid the use of calls global instruction scheduling ( G PS) that aJ.lows siderable fl exibility in p l acement of code is possible movement of instructions beyond block bou ndaries even when com pensation copies are not a l l owed . based upon the program dependence graph ( PDG) .28 In Al though more research i s req u i red t o look i n to a test suite of four programs run on I BM's RS/6000, 2 1 1 64 server syste m , particu global variables. This work has demonstrated that con possible uses for this flexibility, the global i nstruction Bernstein's method showed improvement of rough ly schedu l i ng method 7 percent over local scheduling for two ofthe programs, promise for lLP architectures. described here ( D PS ) shows with no significant clifference for the others. Comparing DPS to Bernste i n ' s method, we see that Acknowledgments both allow for i n terb lock motion without copies. Bernstein also al lows for interblock movement req uir ing dupl icates that D PS does not. Interestingly, This research was supported i n part by an Exte rn a l Research Program grant from Digi ta l Equ ipment Bernstein's later work27 does not make use of th is abi l Corporation and by the National Scie nce Fou ndation i ty to al low motion that req u i res duplication of opera under grant CCR-9308348. tions, suggesting that, to date , he has not found such motion advisable for the RS/6000 architecture to References which his techniq ues have been applied . Bernstei n a l l ows operation movement in only one clirection, l. tions nator block to a postdominator. This added flexibility is Computers, C- 1 9 ( 1 0 ) ( O ctober 1 97 0 ) : 2. A . Nicolau a n d J . Fisher, "Measu ring t h e Parallelism tions. Bernstein uses a separate set of heuristics to move Available tor Very Long I nstruction Word Architec operations i n the PDG and then uses a subsequent local t u res," scheduling pass to order operations v.rithin each block. (November 1 9 8 4 ) : 968-976. Fisheil argues that incorporati ng movement of opera tions with the scheduling p h ase itself provides better schedu l i n g than divicling the i nterblock motion and schedul i n g phases. Based on that criterion alone, DPS has some advantages over Bernestein's method. 68 on 8 8 9-8 9 5 . an advantage to DPS. Of possibly greater significance, DPS uses the local i nstruction scheduler to place opera G. Tjaden and M . Flynn , " D etection of Parallel Exe cut ion of I ndependent I nstructions," IEEE Tra nsac whereas DPS a llows operations to move from a domi Digital Technical Journal Vol . 1 0 No. 1 1 998 IEEE Transactions on Co mputers, 33( l l ) 3 . J . Fisher, "Trace Sche d u l i n g : A Tec hn i q u e tor Global Microcode Compaction," IEEE Transactions on Com puters, C-30( 7 ) ( J u l y 1 9 8 1 ) 478-490 4. 5. 6. J. El lis, Bulldog A Comp iler for VJJW A rchitectures 1 8 . R. Gupta and Approach fo r Detecting and Yale U niversity ( 1 9 84 ). lelism," IEEE Transactions on Software Eng ineering, Production of Optimal Horizontal Microcode , " P h . D . 1 9 . S . Mahlke, W. Chen, W. - M . H wu , B . Rao, a n d M . thesis, U n i versity of M i c higan, A n n Arbor, M i c h . Schlansker, "Sentinel Scheduling for VL IW and Super ( 1 97 6 ) . scalar Processors," Proceedings of the 5th Interna tional D . Lands kov, S . Davidson, B . S h river, a n d P. Mallett, Computing Surveys, 1 2( 3) (September 20. 9. B locks," Proceedings of the 29th International Sym France ( December 1 996 ) : 58-67 . 2 1 . V. Al lan, R. Jones, R. Lee , and S . Al lan, "Software D . Padua, D . Kuck, and D. Lawri e, " H igh- Speed Mul Pipel i n ing," A CJ\11 Computing Su rveys, 2 7 ( 3 ) (Septem tiprocessors and Compilation Techniques," IEEE Trans ber 1 995 ). 763-776. 22. and Tools ( Reading, MA: ( M I CR0·27), San Jose, Calif ( December 1994 ) : 63-74. Addison 2 3 . S . - M . Moon and K. Ebcioglu , " Parallelizi n g N o n n u H . Rei f and R . Tarjan, "Symbolic Program Analysis i n merical Code with Selective Sched u l i n g and Software Almost- Linear Time," Journal of Compuling, 1 1 ( 1 ) Pipel i n i ng," ( February 1 9 8 1 ) : 8 1-9 3 . Languages and s:ystems, 1 8 ( 6 ) ( N ovember 1 99 7 ) : P h . D . thesis, Computer Scie n c e Department, Col on Eng ineering· Special Issue on Microprogram m ing, 1 4( 5 ) ( May 1 998 ) : 5 7 5-5 8 3 . B . Huber, "Path-Selection Heu tistics tor Dominator Technological University ( 1 995 ) . Sched uling to Consider Execution F requency," Pro ceedings of the 28th Ha waii International Conference on System Sciences (J anuary 1 996 ) . 1 6. A . Nicol a u , " Percolation Sche du ling: A Parallel Com pilation Tech n i q u e , " Te ch nical Report TR8 5 - 678 , Department of Computer Science, C01·neU U niversity ( May 1 9 8 5 ) . icol a u , " A Deve lopment Envi ron· ment for Horizonta l M icrocode," !Etc Transactions 5 84-594 . 25. D. Bernstein and M. Rode h , "Global Instruction the ACM 51GPLAN 1 991 Conference on Programming Language Desig n and Implementat ion, Toronto, Canada ( J u ne 1 99 1 ) : 2 4 1-2 5 5 . Dupl ication: An Assist lor Global Instruction Sched ul ing," Proceedings of the 24th International Symposium on Microarchitecture ( M I C R0 - 2 4 ) , Albuquerq u e , N . Mex. ( Nove mber 1 99 1 ) : 1 03-1 1 3 . 2 7 . D . Bernste i n , D . Cohen, Y. Lavon, and V. Rai nish, " Performance Evaluation of In struction Sched u l i n g o n the I B M RS/6 000," Proceedings of the 25tb Inter national Symposium on Microarchitecture ( M I CR0- 25 ), Portland, Oreg. ( December 1 992 ) : 226-2 3 5 . 1 7 . A . Ai ken a n d A . Software Techniques 2 6 . D . Bernstein , D . Cohen, and H . Krawczyk, "Code M . Bourke, P . Sweany, and S . Beaty, " E xtending List on Compiler Sched u l i n g tor Su perscalar Machi nes," Proceedings of Path Sched u l ing," Master's thesis, Department of Com 15. Parallel Architectures and ( PACT 96), Boston, Mass. (October 1996) 87-96. Microarchitectures," IEEE Tra nsaclions on Software Michigan Progra mm ing Directed Approach to Exploiting Insu·uction-Level Paral R . M uel ler, M . D u d a , P. Sweany, and J . Walicki, Science, on lelism," Proceedings qftbe 1996 International Conference " Horizon: A Retargetable Compiler lor Horizontal puter Transactions 2 4 . S . Novak and A . Nicolau, " A n Efficient Global Resource orado State U niversity ( 1 992 ) . 1 4. A CM 8 5 3-89 8 . 1 2 . P. Sweany, " lnterblock Code Motion without Copies," 13. Proceedings of tbe 2 7tb International Symposium on Microarchitecture A. Aho, R. Sethi, and } . Ullman, Compilers. Principles, Techniques, B. Rau, "I terative Modulo Scheduling: An Algorithm for Software Pipelining Loops," Wesley, 1 9 8 6 ) . 11. C . Chekuri, R. Johnson, R. Motwani, B. N a tarajan, B. posium on Microarchitect ure ( M ICR0 - 2 9 ) , Paris, Compuler and job-Shop Scheduling actions on Computers, C-29( 9 ) (September 1 98 0 ) : 10. Support .for Level - Pa ra l l e l Sched u ling with Application to Super Practice & Experience, 2 8 ( 3 ) ( March 1 99 8 ) : 249-2 84. 77JeOiy ( New York : John Wiley & Sons, 1 9 76). Arcb itectu.rat Rau, and M. Schlansker, " Profile- Driven I nstruction Retargetable Local Instruction Schedu ler," So.ftware Colfmao, on Boston, Mass. ( October 1 9 9 2 ) : 2 3 8-247. V. Al l a n , S . Beaty, B. Su, and P. Sweany, " B u i l d i n g a E. Conference Programm ing Languages and Opera ting Systems, 1 9 80): 26 1-294. 8. Redistributing Paral 1 6 ( 4 ) (April l990 ) : 42 1-43 1 . D. DeWitt, "A Machine- Independent Approach to the " Local Microcode Compaction Tec hniques," A CM 7. M. Solh, " Region Scheduling: An ( Cambridge, MA: M I T Press, 1 9 8 5 ) , Ph D. thesis, Engineering, 1 4( 5 ) ( May 1 988): 2 8 . J . Ferrante, K . Ottenste i n , and J . Warren, "The Pro gram Dependence Graph and Irs Use in Optimiza tion," A CM Transactions on Programming Languages and Systems, 9 ( 3 ) ( J u ly 1 98 7 ) : 3 1 9-349. D i g i tal Technical Journal Vol . 10 No. l 1 998 69 Biographies Brett L. Huber Philip H. Sweany Associate Professor Phil Sweanv has been a member of Michigan Technological Unive'rsity's Computer Science faculty since 1 99 1 . He has been investigati ng compiler techniques for instruction-level parallel ( I LP) architectures, co-authoring several papers on instruction schedul i ng, reg ister assignment, and the i nteraction between these two optimizations. Phil has been the primary designer and implementer of Rocket, a highly optimizing compiler that is easily retargeta ble for a wide range ofiLP architectures. His research has been significantly assisted by grants from Digital Equi pment Corporation and the National Science Foundation. Phil received a B .S . in computer science in 1 9 83 from Washington State University, and M . S . and Ph . D . degrees i n computer science from Colorado State University in 1 986 and 1 99 2 , respectively. Steven M. Carr Steve Carr is an assistant professor in the Department of Com puter Science at J\tli chigan Technological University. The focus of his research at the un iversity is memory hierarchy management and optimization of instruction level parallel archi tectures. Steve's research has been sup ported by both the National Science Foundation and DigitaJ Equipment Corporation. He received a B . S . i n computer science trom Nliduga.n Technological Uruversity in 1 9 8 7 and M.S. and P h . D . degrees fi·om Rice University in 1 9 90 and 1 993, respectively. Steve is a member o.fACM and an I EEE Computer Society Affi liate. 70 Digiral Technical Journal Vo L 10 No. I 1 998 Raised in Hope, lv1ichigan, Brett earned B . S . and M.S. degrees in computer science at M ichigan Technological University i n Mich igan's h istoric Keweenaw Peninsu la. He is an engineer in the Software Development Systems group at Texas I nsrruments, I n c . , and is currently developing an optimizing com piler for the TMS320C6x fa milv ofVLIVV digitaJ signal processors. Brett is a member oftl� e ACM and an IEEE Computer Society Affiliate . I Mary W. Hall Jetmifer M. Anderson Maximizing Multiprocessor Performance with Sarnart P. Amarasinghe Briart R. Murphy Shih-Wei Liao Edouard Bugnion Monica S . Lam the S U IF Compiler Parallel izing compi lers for m u ltiprocessors face The affordability of shared memory mu lti processors many h u rdles. However, S U I F's robust ana lysis offers the potential of supercomputer-class performance and memory optimization tech n i q ues enabled speed u ps on three fourths of the NAS and SPECfp95 benchmark programs. to the general public. Typical ly used in a m u l tiprogram ming mode, these machi n es increase throu ghput by r u n n i n g several independent applications in paral l e l . B u t m u l tiple processors can also work together to speed up single applications. This req uires that ordinary sequential programs be rewritten to take advantage of the extra processors. ' 4 Automatic paral l e lization with a comp i l er otfers a way to do this. Parall e lizing com pilers face more difficult challenges from m u l tiprocessors than from vector machines, which were their initial target. Using a vector architecwre eftec· tively i nvolves paral le li zi ng repeated a.tithmetic opera tions on large data su-eams-for exam p l e the i nnermost , loops in array-oriented programs. On a mul tiprocessor, however, this approach typ i cally does not provide suffi cient granu l arity of paral lelism: Not enough work i s performed i n parallel t o overcome processor synch ronization and communication overhead . To use a multiprocessor effectively, the compiler must exploit coarse-gra i n paral lelism, locating large computations that can execute independently in parallel . Locating para l l e l ism i s j ust the fi rst step i n prod uc· i n g efficient m u l ti processor cod e . Achievi ng h igh per formance also req u i res e ffective use of the memory hierarchy, and multjprocessor systems have more com plex memory hierarch ies than typical vector mac h i nes: They conta i n not only shared memory but also multi ple levels of cache memory. These added challenges often limited tl1e effectiveness of early paralJel izing compilers for m u l tiprocessors, so programmers developed their appl i cations fi·om scratch, without assistance from tools. But explicitly managing an application's paral lelism and memory use requires a great deal of programming knowledge, and tl1e work is tedious and error-prone. Moreover, the resulting programs are © 1 996 IEEE. Re p r i nt ed , with permission, ti·o m CiJJIIjm/eJ; December 1 996, pages 8 4 - 8 9 . This p3pa has been m od i tied for publication h e re with the addition of the section The Status :md Fu tu re of S l " l F optimized for only a specific machine. Thus, the effort required to develop efficient parallel programs restricts the user base for m u l tiprocessors. This article describes automatic parall e l i zation tech n iques in the SU I F (Stanford U niversity I n termed iate Digital Tc·chnical Journ;ll Vol. 10 No. I 1 99 8 71 Form a t ) compiler that res u l t i n good m u l tiprocessor Moreover, it recognizes c o m m u tative operations on pertormance fo r array - based n u m erical progra m s . vVe sections o f a n array and tra ns forms th em i n to parallel provide SUIF performance measurements for the com red u ctions. The red u c tion a n a l ysis is powe r fu l enough pl ete NAS and SPECfP95 benchmark suites. Overall , the to recogn i ze co m m u tative u p d ates of even i n d i rectly results tor these scientific programs are promising. The accessed array l ocations, a l lowing para l le lization of compiler yields speedups on three fo u rths of the pro sparse computations. grams and has obtained the highest ever pcrronnancc on All these analyses are for m u lated i n terms of i nteger the S PECfP95 bench m ark, indicating that the com piler progra m m i n g p ro b l e m s o n systems of l i near i n eq u a l i can also achieve e fficient abso l u te performance. ties that represent t h e data accesse d . These i neq ualities are derived from loop bounds and array access fu nc Finding Coarse-grain Parallelism tions. I m pl e m e n t i n g opti m i z ations to speed u p com mon cases reduc<::s the compilation ti me. Mu ltiprocessors work best when the in dividu,l l proces sors have large u n i ts of in dependent co m pu tation , b u t lnterprocedural Analysis Framework it is n o t easy t o find such coarse-grain para l lel ism . First All the ana lyses arc i m p l emented using a u n i form the compiler mu st find avai lable paralleli sm across pro i n terprocedu ral a n a lysis fra mework, which helps ma n ced ure bou ndaries. F u rthermore, the original compu a g e the software engin eering complexity. T h e fra m e tations may not be paral l e l i zable as given and may first work uses i n terprocedural d a ta fl ow a n alysis,• which i s require some transtonn ations. For example, experience m o r e efficient tlun the m o r e common tec h n i q ue o f i n para l l e l i z i n g by h a n d su ggests that we must often i n l i ne s u bstitutio n . ' I n l i n e substitu tion repl aces each replace global arrays with private versions on d i ffe rent proce d u re cal l with J copy o f the cal led proced ure, processors. In other cases, the com p u tation may the n a n alyzes the expanded code in t h e usual i ntrapro need to be restructured-for e x a m p l e , we may have to cedura l m a n ner. I n l i n e subs ti tu t ion is not practical for re place a sequen tial accumu lation wi th J p:tral lel reduc large progra ms, because it can m a ke the program too tion operati o n . large to ana lyze . I t takes a l arge suite of robust a nalysis tec h n i q ues to O u r tec h n i q u e 1 : 11alyzes only a s i n gle copy of each successfu l l y locate coarse -gra i n p::tra l l e l i sm . Gen eral procedure, captu ri ng i rs side efrects in and u n i r(xm fra meworks he lped us ma nage the com fu nction i s then a p p l ied a t each cal l site to produce plexity i nvolved i n b u i l d i n g such a system i n to S U I F . precise results. When d i fferent cal l i n g contexts make it a fu n ction . This We auto m ated t h e a n a l ysis to privatize arrays a n d to necessary, the algorithm sel ective ly cl ones a procedure recognize red u c tions to both sca lar and array vari a b l es . so that code can be analy zed and poss i b l y paral le l i zed O u r com pile r's analysis tec h n i q u es a l l operate se a m u nder d i ffe rent c a l l i n g contexts ( as when d i ffe re n t less l y K : ross procedure bound aries. consta n t values J r c passed to the s a m e fo rmal para m e ter ) . I n this w a y the fu l l advantages of i n l i n i n g a r e achieved without e x p a n d i n g the c o d e i n d isc ri mina te ly. Scalar Analyses An ini tial phase analyzes scalar variables i n the programs. I t uses tec h n iq ues such as data dependence analysis, In Fi g u re 1 the boxes represe n t procedure bodies, and t h e l i nes connecting them represent proc e d u re scalar privatization analysis, and reduction recognition calls. The m::tin com putation is a series o f tour loops to to detect paral lel ism among operations with scal ar· vari com p u te three - d i mensional fast Fourier transr(mns. ables. It also derives symbolic information on these scalar Using i nterproced ural scalar and ar ray ana lyses, tile variables that is useful in the array analysis phase. S u c h S U [ f compiler d etermines that these l oops are para l i n formation i n cludes constant propagation, induc tion lel i z a b l e . Each loop contai ns m o r e than 5 0 0 li nes of vari a b l e recognition and elimi nation, recognition of code sp a n n i n g up to n i ne proc edures with up to loop-i nvariant computations, procedure calls. If this program had been fu l l y i n l i ned , and sym bolic relation 42 the loops pres<::n t ed to the compiler for a n a lysis would propagation .'"' h ave each conta i ned more than 86 ,000 l i nes of cod e . Array Analyses An :trray a n a l ysis p h ase uses a u n i fied mathe matical Memory Optimization tl-amework based on linear algebra a n d i nteger l i near 72 program m i n g . ' The a n a l ysis appl ies the basic data Numerical appl ications on high-performance micro dependence test to d etermine i f accesses to an array p rocessors can rerer to the same locati on. To supp ort array priva more levels of cache to bridge the gap between proces tization, it a l so finds array data � ow i n formation that sor and memory speeds, a processor may still waste h a l f are often memory bou n d . Even with one or determ i nes whether array elements used i n a n i teration its t i m e stalled on memory accesses because it ITequently rd cr to the val ues produced i n a p revious i tera tion . references an item not i n the cache (a cache miss ) . This Digira1 Technical Journal Vol . 1 0 No. l 1 998 P1ifi Figure 1 The compiler discovers parallelism through intcrprocedural array analysis. Each of the four parallelized loops at left consists of more than 500 lines of code spanning up to nine procedures ( boxes) with up to 42 procedure calls ( l i nes ) . memory bottleneck i s fi.1 rther exacerbated on multi processors by tl1eir greater need for memory traffic, resulting in more contention on tl1e memory bus. An effective compiler m ust address four issues that affect cache behavior: • • Commu nication : Processors in a multiprocessor system com mu nicate through accesses to the same memory location . Coherent caches typically keep tl1e data consistent by causing accesses to data writ ten by another processor to miss in the cache. Such misses are cal led true sharing misses. Limited capacity: Nu meric applications tend to have large working sets, which typically exceed cache capacity. These applications often stream through large amounts of data before reusing any of it, resulting in poor temporal locality and numerous capacity misses. • Limited associativity: Caches typically have a small set associativity; that is, each memory location can map to only one or just a few locations i n the cache. Conflict misses-when an item is discarded and later retrieved--can occur even when the applica tion 's working set is smaller than the cache, i f the data are mapped to the same cache locations. • Large line size : Data in a cache are transferred i n fixed-size units called cache lines. Applications that do not use all the data in a cache line i ncur more misses and are said to have poor spatial locality. On a m u ltiprocessor, large cache J ines can also lead to cache misses when different processors use differ- ent parts of the same cache line. Such m isses are called false sharing misses. The compiler tries to eliminate as many cache misses as possible, ilien minimize tl1e i mpact of any iliat remain by • ensuring that processors reuse the same data as many times as possi ble and • making the data accessed by each processor con tiguous in tl1e shared address space . Teclmiques for addressing each of t11ese subproblems are discussed below. Final ly, to tolerate tl1e latency of remaining cache misses, the compiler uses compiler insettedprefetching to move data into the cache before it is needed. Improving Processor Data Reuse The compiler reorgani zes tl1e computation so mat each processor reuses data to the greatest possible extent -'-� This reduces tl1e worki ng set o n each processor, thereby minimizing capacity misses. It also reduces i nterprocessor communication and thus minimizes true sharing misses. To achieve optimal reuse, the com piler uses affine pm1itioning. This technique analyzes reference patterns in the program to derive an aftine mapping (linear transformation plus an offset) of the computation of the data to tl1e processors. The affi ne mappings are chosen to maximize a processor's reuse of data wh.ile maintaining sufficient parallelism to keep all processors busy. The compi ler also uses loop block i ng to reorder tl1e computation executed on a single processor so that data is reused in the cache. Digital Technical Journal Vol . 10 No. 1 1 99 8 73 Making Processor Data Contiguous The its kn o wl e d ge of the access patterns to d i rect the oper a l l ocation policy to m a ke e a c h processor's d a ta contiguous i n the physical add ress space. The operating sys te m uses th ese h i n ts to deter mine the virtua l - to-p hysi cal p:�ge m:�pping at p:�gc a l location t i me compiler tries to arrange the data to m a ke a processor's accesses contiguous in the share d space . This i m proves spatial loc a l i ty ating system's page address red u c i n g while co n A ict m isses a n d fa lse shari n g . S U I F can ma nage d ata p lacement within a singl e array and ac ross m u l ti p l e arrays. . The d a ta - to- processor mappi ngs compute d b y the affi n e p a r titio ni ng ana lysis are used t o d e ter Experi mental Results mine the d a ta bei ng accessed by each proce ssor. Figu re 2 shows how the co m p i l e r s usc of data per a series of performance ev al uations to o f· S U I �'s ana lyses and opti m izations. We obtained measu reme nts on a D ig i t a l Alph a Serve r 8400 with eight 2 1 1 64 processors, each w i th two levels of on-chip cache and a 4 - M byte exter vVe ' mutation and data stri p - m i n i ng'" can make contiguous condu cted demonstrate the i m pact the data w i t h i n a single arra)' that is accessed by one processor. Data permuta tion i n terchanges the d i m e n sions of the arra y -fix e x a m p l e , transposi n g a !'NO d i m e nsional array. Data stri p - m i n i n g c h anges an array's d i mension a l i ty so that all data accessed by the same processor are in t he same plane of the array. To m a ke data K : ross m u l t i p le arrays accessed by the same processor contiguous, we use a tec hnique c a l led compiler-directed page colorinp,. ' ' T h e co mpiler uses nal cach e . B ecause speed ups are harder to obtain on machi nes with fJst processors, our usc of a state-of the-:lrt m a c h i n e makes the re su lts more m e a n i ngfu l and ap pl icable to fi.1ture systems. \Ve used two comp l e te standard bench mark suites to evaluate our compil er. W<:. present resu l ts for the y y X X y y y X X X P E R M UTATION STR I P-MINING Figure 2 Data transformations cJn make the dar,1 accessed by each processor contiguous i n the shared add ress space. I n the two examples above, the origi nal arrays arc two-dimensional; the axes are identified to show that elements along the ti rst nis arc contiguous. First the a Hine partitioning a nalysis d etermi n es which data elements arc accessed by the same processor (the shaded ele ments are accessed by the first processor. ) Second, data strip-mining turns the 2 0 Jrray i nto a 3D array, with the s haded eleme nts i n the same plane. Final ly, applying data permutation rotates the array, mJking data accessed by each processor contiguous. 74 Digital Technical journal Vo l . 10 No. l 1 9 98 10 programs in the SPECtp95 benchmark suite, which is techniq ues as well as techniq ues for locating coarse commonly used for benchmarking u niprocessors. We grain parallel loops-for example, array privatization also used the eight official benchmark programs fro m and reduction transformations, and ful l interproce the NAS paral lel-system benchmark suite, except for d u ral analysis of both scalar and array variables. embar; here we used a slightly modified version from Memory includes the coarse-grain techniq u es as wel l Applied Para l l e l Research. as the m u ltiprocessor memory optim izations we Figure 3 shows the S PECtp95 and NAS speedups, measured on up to eight processors on a 300-MHz AJphaServer. We calculated the speedups over the best described earlier. Figure 3 shows tl1at of tl1e 1 8 programs, 1 3 show good parallel speedup and can tlms take advantage of adclitionaJ sequential execution time from either officially reported processors. SUIF's coarse-grain techniques and memory resul ts or our own measurements. Note that mgrid and optimizations significantly affect tl1e performance of half applu appear in both benchmark suites (the program the programs. The swim and tomcat\' programs show sou rce and data set sizes differ slightly). superlinear speedups because the compiler eliminates To measure the effects of the different compiler techniq ues, we broke down the performance obtained almost al l cache misses and their 1 4 Mbyte working sets fit into the multiprocessor's aggregate cache. on eight processors into three components. In Figure For most of the programs that did not speed up, the 4 , baseline shows the speedu p obtained with paral compiler fou n d much of their computation to be par lelization using only intraprocedural data dependence al lelizable, but tl1e granularity is too fi ne to yield good analysis, scalar privatization, and scalar reduction m u l tiprocessor performance on machines with fast transtormations. Coarse grain includes the baseline processors. Only two applications, tpppp and buk, have 16 swim 15 14 13 12 11 tomcatv 10 a.. ::> Cl w w ll. (f) 9 8 / 8 7 / / / / / / 7 mgrid applu turb3d hydro2d 6 5 su2cor 4 ' em bar 6 appbt /'>...._____. a.. 5 ::> Cl w w 4 a.. (f) applu cgm appsp 3 3 2 2 / 0 , , , / / / 2 3 4 5 6 7 8 / / �--�----�---� 0 2 PROCESSORS ( a ) SPECfp95 3 4 5 6 7 buk fftpde 8 PROCESSORS ( b ) NAS Parallel B e nchmarks Figure 3 S U I F compi ler speedups over the best sequential time achieved on the ( a ) SPECfp95 and ( b ) NAS parallel benchmarks. Digital Tcdmical Journal Vol . 1 0 No. 1 1 998 75 14 12 ::::J Q._ 10 8 w Q._ [f) 6 [il 4 2 0 .?. "' u E 2 E -� (f) 0 u N ::0 (f) N e "0 "0 >;;: "0 E ·� Ci. :0 a. "' C') -e .=l "0 a. a. .e- a. a. ·u; "' Q) l[) > "' i: :0 a. a. "' ::0 Ci. a. "' (f) a. a. a. "' ::0 .D -"' E Ol u ro .D E Q) Q) "0 a. E ·� E "0 KEY: D D • MEMORY OPTIM IZATION COARSE-GRAIN PARALLELISM BASELINE Figu re 4 The speedup achieved on eight processors is broken down into three components to show how S U IF's memory opt.i mization and discovery of coarse-grain paralle l ism affected perform:mce . require that the software be general ly available. The ratios we obtained are nevertheless valid in assessing our compiler's performance . ) The geometric mean of the SPEC ratios improves over the u niprocessor execu tion by a factor of 3 with four processors and by a fac tor of 4 . 3 with eight processors. Our eight- processor ratio of 6 3 .9 represents a 50 percent improvement over the highest number reported to date . ' 2 no statically analyzable loop-level parallelism, so they are not amena ble to our techniques. Table 1 shows the times and SPEC ratios obtained on an eight-processor, 440-MHz Digital AlphaServer 8400, testifYing to our compiler's high absolute per formance. The SPEC ratios compare machine perfor mance with that of a reference machin e . ( These are not official SPEC ratings, which among other things Ta b l e 1 Abso l ute Performa nce for t h e S P E Cfp95 Bench m a rks Measured o n a 440-M H z D i g ital AlphaServer U s i n g One Processor, Four Processors, and Eight Processors Execution Time 1P 4P 8P 1P 4P 8P tomcatv 2 1 9. 1 30.3 1 8. 5 1 6. 9 1 22 . 1 200.0 swim 297.9 33.5 1 7.2 28.9 256.7 500.0 su2cor 1 55 . 0 44.9 3 1 .0 9.0 3 1 .2 45.2 hyd ro2d 249 .4 61.1 40.7 9.6 39.3 59.0 27.0 1 3. 5 59.5 92.6 Benchmark mgrid 1 85 . 3 42 . 0 applu 296 . 1 85.5 39.5 7.4 25.7 55.7 t u rb3d 267.7 7 3 .6 43.5 1 5. 3 55.7 94.3 a psi 1 37 . 5 1 4 1 .2 1 43.2 1 5. 3 1 4. 9 1 4.7 29.0 29.0 29.0 1 9.8 21 .1 20.4 1 5 .0 44.4 63.9 fpppp 33 1 . 6 331 .6 331 .6 waveS 1 5 1 .8 1 4 1 .9 1 47 .4 Geometric Mean 76 SPEC Ratio (sees) Digital Technical J ournal Vol. lO N o . 1 1 998 1 2 . K. Kennedy and U . Kremer, "Automatic Data Layout Acknowledgments This research was s upported in part by the Air Force Materiel Command and ARPA contracts F 3 0602-9 5 C-0098, DABT63 -9 5 -C-O l l 8, a n d DABT6 3 -94-C0054; a D igital Equipment Corporation grant; an NSF Young I nvestjgator Award ; an NSF CISE post doctoral fel lowship; and fe llowships from AT&T Bell Laboratories, DEC Western Research Laboratory, I ntel Corp. , and the National Science Foundation. Editors ' Note.· With the following section, the authors provide an update on the status of the SU!F compiler since the publication of their paper in Computer in December 1996. Addendum: The Status and Futu re of SUIF References l . J . M . Anderson, S . P. Amarasinghe, and M .S . Lam , " Data and Comp u tation Transformations for M u l ti processors," Proc. Fifth A CM S!GPlan Symp. Princi ples and Practice of Parallel Programming, ACM Press, New York, 1 9 9 5 , pp . 1 66-1 78. 2 . ]. M . Anderson and M .S . Lam , "Global Optimizations for Para l l el ism and Localiry on Scalable Paralle l Machines," Proc. SIGPia n '93 Conf Programming La nguage Desig n and Implementation, ACM Press, New York, 1 993, pp. 1 1 2- 1 2 5 . 3 . P. Banerjee e t a l . , "The Paradigm Compiler for Distributed -Memory JVI.ul ticompnters," Computer, Oct. 1 99 5 , pp. 37-47. 4 . W. B l u me e t a l . , "Effective Automatic Para l l e l ization with Polaris," Int i I Parallel Progra mming, May 1 99 5 . 5 . E . B ugnion e t a l . , "Compiler-Directed Page Coloring for Multiprocessors," Proc. Seventh In! ' I C011f A rchi tectural Support for Program m ing Languages and Operating Systems, tor High Performance Fortran," Proc. Supercomput ing '95. I E E E CS Press, Los Alamitos, Calif. , 1995 ( CD - ROM onl y ) . ACM Press, New York, 1 996, p p . 244-2 57. 6 . K. Cooper et a l . , "The ParaScope Parall e l Program m ing Environ ment," Proc. IEEE, Feb. 1 99 3 , p p . 244-26 3 . 7 . Standard Performance Eval uation Corp . , " D i gital Equipment Corporation AlphaServer 8400 5/440 SPEC CFP95 Results," SPEC Newsletter; Oct. 1 996. 8 . M . Haghighat and C . Polychronopolous, "Sym bolic Analysis for Parallelizing Compilers," A Cl\1 Trans. Pro gramming Languages and Systems, July 1 996, p p . 477-5 1 8 . 9 . .M.W. Hall et al . , " D etecting Coarse-Grain Parallelism Using a n lnterproce d u ral Paral l e lizing Compi ler," Proc. Supercomputing '95, I E E E CS Press, Los Alam i tos, Calif. , 1 99 5 ( CD - RO M onl y ) . 10. P. Havlak, lnterprocedural !Symbolic A nalysis, P h D thesis, Dept. of Computer Science, Rice U niv. , May 1 994. 1 1 . F. l rigoi n , .P. Jouvelot, and R . Triolet, "Semantical Interprocedura l Parallelization: An Overview of the P I PS Project," Proc. 1991 A C!J!! lnt'l Conf Supercom puting, ACM Press, New York, 1 99 1 , pp. 244-2 5 1 . Public Availability of SUIF-parallelized Benchmarks The SUIF-parallelized versions of the SPECfp9 5 benchmarks used for the experiments described in this paper have been released to the SPEC committee and are avail able to any license holders of SPEC ( see http:/jwww. specbench.org/osg/cpu95/par-research). This benchmark distribution contains the SUIF out put (C and FORTRAI'\1 code ) , along with the source code for the accompanying run-time libraries. We expect these benchmarks wil l be usefu l for two purposes: ( l ) for technology transfer, providjng insight i nto how the compiler transforms the applications to yield the reported results; and ( 2) for further experimentation , such as in architecture-simulation studies. The SUIF compiler system i tself is available from the SUIF web site at http ://www-su ifstanford .edu. This system includes only the standard parallelization analy ses that were used to obtain our basel in e results. New Parallelization Analyses in SUIF Overall, the results of automatic paraUelization reported in this paper are impressive; however, a few applica tions either do not speed up at all or achieve limited speedup at best. The question arises as to whether SUIF is exploiting al l the available parallel ism in these applications. Recently, an experiment to answer this question was performed i n which loops left unparal lelized by SUIF were instru mented witl1 ru n-time tests to determine whether opportunities for increasing the effectiveness of automatic parallelization remained in these programs . ' Run - time testing determined that eight of the programs from the NAS and SPEC95fp benchmarks had additional parallel loops, for a total of 69 additional parallelizable loops, which is less than 5% of the total number of loops in these programs. Of these 69 loops, the remaining parallelism had a signifi cant effect on coverage ( the percentage of the pro gram that is parallelizable) or granularity ( the size of the parallel regions) in only four of the programs: a psi, su2cor, waveS , and fftpde. We found that al most all the significant loops in these four programs could potentially be parallelized using a new approach that associates predicates with array data-flow values.2 Instead of producing conserv- Digital Technical Journal Vol . l O No. 1 1 99 8 77 ative results that hold tor all control-How paths and all possi ble program i nputs, predi cated array d a ta-flow analysis can derive optimistic results guarded by predi cates . Pred icated array data - fl ow anal ysis can lead to more dkctive automatic parallelization i n three ways: ( l ) It i mprove s compile-time ana.J ysis by r u l i n g out i n feasi ble con trol -flow paths. ( 2 ) It provid es a frame work for the compi ler to i n t ro du c e pred icates that, i f proven true, wou l d guar:mtee safety tor desirable data flow vaJ u es. ( 3) It enJ bles the compiler to derive low-cost run-time para l l e l i zation tests based on the predicates associated with desirJ ble data-flow values. SUIF and Compaq's GEM Compiler The G E M compiler system is the te chn ology Compaq has been using to build compiler products for a variety of languages and hardware/software platform s . ·1 Wi th i n C o m pa q , work bas been done to con nec t S U I F with t h e G EM c om pi l er. S U IF's i n termed iate re pre sentation was converted i n to GEM's i ntermediate rep rese n ta tion , s o that S U I F code can b e passed directly to G E M ' s optim i zi n g back e n d . T h is e l i m i na tes the Joss of i n fo r matio n su ftCred when S U I F code is trans l ated to C/FORTRAN source bdore i t is passed to GEM. I t also enables us to generate more efficient code for Alpha-microprocessor systems . SUlF componen t of the NCI project is the re s u l t of tl1e col l a boration among researchers in five universities ( Harvard University, Massachusetts I nstitute of Technology, Rice U niversity, Stanford U n i ve rsity, University of C a l i forn i a at San ta Barbara) and one i n dustrial partner, Portland Group I nc . Co m paq is a corporate sponsor of the p roj ect and is providing the FORTRAN fron t end . A revised version of the S U I F i n frastructure ( S U I F 2 .0 ) is being released a s part o f t h e S U I r: N C I project ( a prel i minary version of s u r r: 2 .0 is a va i l a ble at the S UIF we b site ) . The compl eted system wi ll be enhanced to support p a ra ll el i z ::� t i o n , in tc rp roc e d u ra l analysis, m e mory hierarchy o p t imiz a tio n s , obj ected oriented programming, sca lar optimizations, and m a chine-dependent opti mi z:nions. An overview of the S U I F NCI system is shown i n Figure A l . Sec vvww-suif.stanford .cd u/s u i f/NCI/su i f. h t m l for more i n formation about S U I F and the NCI project, includ i n g a complete list of opti m i zations ;md a sched u le . References 1 . B. So, S. Moo n , and M . Hal l , "Measuring rhc Eftecrivc ness of Automatic Parallcl ization in S U I !:'," of the 98, J u l y 1998. 2 . S . M o o n , J\11 . Ha l l , and B . M u r p h1·, " Predicated Arr:1y SUIF and the National Compiler Infrastructure The SUIF compiler system was recently chosen to be Data-Flow Amlysis for Ru n-Time Para l le l izati o n , " Pro part of the National Compiler I n frastrucnrre ( NCI) project fu n ded by t h e D e fense Advanced Research Projects Agency ( DARPA) and the National Science Fou ndation ( NS F ) . The goal of the project is to develop a com mon com pil e r p latform for researchers and to fac i l i tate tec h nolo gy transfer to i n dustry. The puting 98, July ceedings of the fntt!rncltiiJIIUI Umfim'IICe 1 99 2 ) : 1 2 1-1 36 . I N T E R P ROCEDURAL ANALY S I S PARALLELIZATION LOCALITY OPTIM IZATIONS OBJECT-OR I E NT E D OPTIM IZATIONS SCALAR OPTIMIZATIONS ._ AL P HA _ _ SCHEDULING R E G I STER ALLOCATION _.I L..l _ __-_ Figure A1 The S U ! F Comp i l e r l n trJstr u c w re Digital Tcc' h n icll journal Vul . 10 No. l 1998 Sl lj)('rcom Syste m , " Dig ilcd h·cb n ical follmal. \'O I . 4, n o . 4 (Speci:l l Issue, CIC++ (I BM) TARGET LANGUAGES 011 1998. 3 . D . B l ic kste i n c t a l . , "The G F M O p t i m i z i n g C o m p i l c t FRONT ENDS 78 Proceedin;:;s Jnterr/{./ffonal Confaence on SupercomjJIItinp, x_ s6 ---.l.� I _ _ _ _ _ C/FORTRAN Biog raph ies Mary W. Hall Mary Hall is jointly a research assistant professor and project leader at the University of Southern California, Department of Computer Science and at USC's Information Sciences Institute, where she has been since 1996. Her research interests focus on compiler support for high-performance computing, particularly interprocedural analysis and auto matic parallelization. She graduated magna cum laude with a B .A . in computer science and mathematical sciences in 1985 and received an M.S. and a Ph .D . in computer science in 1 989 and 1 99 1 , respectively, all from Rice University. Prior w joining USC/lSI, she was a visiting assistant pro fessor and senior research tdlow in the Department of Com p uter Science at Caltech . In earlier positions, she was a research scientist at Stanford University, working with the S U I F Compiler group, and i n the Center for Research on Parallel Computation at Rice University. Brian R. Murphy A doctoral canclidate in computer science at Stanford Uni versity, Brian Murphy is currently working on advanced pro gram analysis under SUIF as part of the National Compi le r Infrastructure Project. He received a B .S. in computer sci ence and engineering and an M .S. in electrical engineering and computer science from the Massachusetts Institute of Tec hno logy. His master's thesis work on program analysis was carried out with the Functional Languages group at the IBM Almaden Research Center. Brian was elected to me Tau Beta Pi and Eta Kappa Nu honor societies. Shih-Wei Liao Shih-Wei Liao is a doctoral candidate at the Stanford U niversity Computer Systems Laboratory. His research interests i nclude compiler algorithms and design, pro gramming environments, and computer architectures. He received a B .S . in computer science from National Taiwan University. in 1 9 9 1 and an M . S . in electrical engineering from Stanford U niversity in 1 994. Jennifer M. Anderson Jennite r Anderson is a research staff member at Compaq's Western Research Laboratory where she has worked on the Digital Continuous Profiling Infrastructure ( DCPI ) proj ect. Her research interests include compiler algorithms, programming l an gua ges and environments, profiling sys tems, and para l l e l and distributed systems software. She earned a B .S . i n intormation and computer science from the U niversity of California at I rvine and received M .S . a n d P h . D . degrees i n computer science from Stanford University. Edouard Bugnion Ed Bugnion holds a Diplom in engineering from the Swiss Federal Institute ofTechnology ( ETH ), Zurich ( 1 994) and an M .S . from Stanford University ( 1 996 ), where h e i s a doctoral candidate i n computer science. His research inter ests include operating systems, computer architecture, and machine simulati o n . From 1 996 to 1 997, Ed was also a research consultant to Compaq's Western Research Laboratory. He is the recipient of a National Science Foundation Graduate Research Fellowship. Saman P. Ama rasinghe Sa man Amarasinghe is an assistant professor of computer science and engineering at the Massachusetts I nstitute of Technology and a member oftl1e La boraw r y for Computer Science. His research interests include compilers and com puter architecture. H e received a B .S. in electrical engineer ing and computer science from Cornell University and M .S. and P h . D . degrees i n electrical engineering from Stanford University. Digital Technical Journal Vol . 10 No. I 1 99 8 79 Monica S. Lam Monica Lam is an associate professor i n the Computer Science Department at Stanford University. She leads the SUIF project, which i s aimed at developing a common infrastructure to support research i n compilers for advanced languages and architectures. Her research i n ter ests are compi lers and computer architecture. Monica earned a B . S . from the U niversity of British Colu mbia in 1 980 and a Ph . D . in compu ter science fi·om Carnegie Mellon University in 1987. She received the National Science Foundation Young I nvestigator award i n 80 Digital Technical Journal 1992. Vol. 1 0 No. 1 1 99 8 I Debugging Optimized Ronald F. Brender Jeffrey E. Nelson Mark E. Arsenault Code: Concepts and Implementation on DIGITAL Alpha Systems Effective user d ebugging of optimized code has been a topic of theoretical and practical inte rest in the software development comm u n ity for al most two decades, yet today the state of the a rt is sti l l highly u neven. We present a brief s u r Introduction In software development, it is common practice to debug a program that has been compi l ed with little or no optimization applie d . The generated code closely corresponds to the source and is readily described by a vey of the l iteratu re and cu rrent practice that simple and straightforward debugging symbol table . leads to the identification of three aspects of debugger can interpret and control execu tion of the debugging optimized code that seem to be code in a fashion close to the user's source - l evel view critical as well as tractable without extraordi nary efforts. These aspects are (1) split l i feti me support for variables whose allocation varies within a program com bi ned with defin ition A of the progra m . Sometimes, however, developers find it necessary or desirable to debug an optimized version of the pro gram . For instance, a bug-whether a compiler bug or incorrect source code-may only reveaJ itself when point reporting for cu rrency determi nation, optimization is appLied . I n other cases, the resource (2) stepping and setting breakpoints based on constraints may not aLlow the unoptimized form to be a semantic event characterization of program behavior, and (3) treatment of i n l i ned routine calls in a manner that ma kes i n l i n i n g largely transparent. We describe the real i zation of used because the code is too big and/or too slow. Or, the deve l oper may need to start anaJysis using the remains, such as a core file, of the failed program, whether or not this code has been optimized . Whatever the reason , debugging optimized code is harder than these capa b i l ities as part of Compaq's GEM debugging unoptimized code-much harder-because back-end com piler tech nology and the debug opti mization can greatly compLicate the relationship g i ng component of the Open VMS Alpha oper between the source program and the generated code. ati ng system. Zellweger1 introduced the terms expected behavior truthful behavior when referring to deb u gging optimized cod e . A debugger provides e xpected behav and ior if it provides the behavior a user wo uld experience when debugging a n u nopti m i zed version of a pro gra m . Since achieving that behavior is often not possi ble, a secondary goal is to provide at least truthful behavior, that is, to never lie to or mislead a user. In our experience, even tr uthfuL behavior can be chal lenging to achieve, but i t can be closely approached . This paper describes three i mprovements made to Compaq 's OpenVMS GEM back-end compiler system and to DEBUG , the debugging component of the OpenVMS Alpha operating syste m . These improve ments address 1 . Split life ti m e variables and cu rrency determination 2. Semantic events 3. Inlining Digital Technical Journal Vol . 1 0 No. 1 1998 81 Before presenting the details of this work, we dis generally must include all call sites and may corre cuss the alternative approaches to debugging optimized spond to most statement boundaries. His experience code that we considered, the state of the art, and the suggests, however, that even l i m i ting inspection points operating strategies we adopted. to statement boundaries severely limits almost all kinds of optimization . Alternative Approaches Holzle et al .8 describe techniq ues to dynamically Various approaches have been expl ored to i m p rove deoptimize part of a program ( replace optimized code the ability to d e b u g optimized cod e . They i n c l ude with i ts unoptimized equivalent) during debugging to the following: enable a debugger to perform requested actions. They make the tec h niq ue more tractable, in part by delaying • Enhance debugger analysis • Li mit optimization • Lim i t debugging to preplan ned locations m i zation between i n terruption points is unrestricted . • Dynamically deoptimize as needed However, even this choice of interruption points • Exploit an associated program database asynchronous events to well-defined interruption points, generally backward branches and cal l s . O pti severely l i mits most code motion and many other global optimizations. We touch on these approaches i n turn. Pollock and others9 10 use a different kind of deopti In probably the oldest theoretical analysis that m i zation , which might be called preplanned, incre supports debugging opti m i zed code, H ennessyl stu d mental deopti m i zation . D uring a de bugging session, i e s whether the value displayed for a variable is current, any debugging req u ests that cannot be honored that is, the expected value for that variable at a given point in the program. The value displayed might not because of o p ti m i zation effects are remem bered so that a subsequent compilation can create an exe be c u rrent because, for example, assignment of a later cuta ble that can honor these requests . This scheme is val u e has been moved forward or the relevant assign supported by an incremental opti mizer that uses a pro ment has been delayed or omitte d . Hennessy postu gram database to provide rapid and smooth forward lates that a flow graph description of a program is i nformation flmv to s u bseq uent debugging sessions. comm unicated to the debugger, which then solves Feiler' ' uses a program database to achieve the bene certain flow analysis equations in response to debug fits of interactive debugging while applying as m uch comm ands needed . static compilation technology as possible. He describes Copperman' takes a similar though m uch more gen to determine currency as techniques for m aintaining consistency between the eral approach . Conversely, commercial i m p lementa primary tree-based representation and a derivative tions have favored more complete preprocessing of compiled form of the program in the face of both information in the compiler to enable simpler debug debugging actions and program modifications on-the ger mechanisms.H fly. While he appears to demonstrate that more is possi If optimization is the "problem," then one approach ble than might be expected, su bstantial limitations still to solving the problem is to limit opti m i zation to only exist on debugging capability, optimization, or both. those kinds that are actually supported i n an avail a ble debugger. Zurawski7 develops the notion of a recovery A comprehensive introduction and overview to these and other approaches can be found in Copperman3 and function that matches each kind of opti m ization . As an Adl-Tabatabi . " opti m i zation is applied during compilation, the com graphy on Debugging Optimized Code" is available In addition, "An Annotated B iblio pensating recovery function is also created and made separately on the avai lable for later use by a debugger. I f such a recovery http://wvvw. digital.com/info/DTJ. This bibliography Dl:(!,ital Tecl:mical.fourna! web site at function cannot be created, then the optimization is cites and summarizes tbe entire literature on debugging omitted. Unfortunately, code-motion-related opti mi optimized code as best we know it. zations generally lack recovery functions and so must be foregone . Taking this approach to the extrem e State of the Art converges with traditional practice, which i s simply to When we began our work in early disable all opti m i zation and debug a completely unop the level of support for debugging optimi zed code we assessed that was available with competitive compilers. Because timized program . l f fu l l debugger functionality need only b e provided we have not updated this assessment, it is not appro at some locations, then some debugger capabilities can priate for us to report the results here i n detail . We do be provided more easily. Zurawski7 also employed this however s u m marize the methodology used and the idea to make it easier to construct appropriate recov mai n results, which we believe remain generally valid . ery fu nctions. This approach b uilds on a language dependent 82 1 994, concept Digital Technical Journal of inspection points, which Vol . 1 0 No. 1 1 998 VIe created a series of example programs that pro vide opportunities for optimization of a particular kind or of related ki nds, and which could lead a traditional debugger to deviate from expected behavior. We com p iled and executed these programs under the control of each system's debugger and recorded how the sys tem hand led the various kinds of opti m i zation. The range of observed behaviors was diverse . At one extreme were compilers that automati cally disable all opti mization i f a debugging symbol table is requested (or, equivalently for our purposes, give an error i f both optimization and a debugging symbol table are requeste d ) . For these compilers, the whole exercise becomes moot; that is, atte mpting to debug opti mized code i s not allowed . Some compiler/debugger combinations appeared to usefully support some of our test cases, although none handled all of them correctly. In particular, none seemed able to show a traceback of su brouti ne cal ls that compensated for in lining of routine calls and all seemed to produce a Jot of jitter when stepping by l ine on systems where code is highly scheduled . The worst example that we fou nd al lowed comp i la tion using optimization but produced a debugging symbol table that did not reflect the results of that opti mization . For example, local variables were described as allocated on the stack even though the generated code clearly used registers for these variables and never accessed any stack locations. At debug ti me, a request to exami ne such a variable resulted in the ctisplay of the irrelevant and never-accessed stack locations. The bottom line fi·om this analysis was very clear: the state of the art for support of debugging opti mized code was general ly q uite poor. D I G ITAL's debuggers, including OpenVMS DEBUG, were not unusual in this regard . The analysis did indicate some good examples, though. B oth the CONVEX CXdb4·" and the HP 9000 DOC6 systems provide many val u able capabil ities. Biases and Goals Early i n our work, we adopted the fol lowing strategies: • Do not limit or compromise optimization in any way. • Stay within the t!·amework of the traditional edit compile-link-debug cycle . • Keep the burden of analysis within the compiler. The prime directive for Compaq 's GEM- based compilers is to achieve the h ighest possible perfor mance from the Alpha architecture and chi p tech nol ogy. Any improvements i n de bugging such optimized code shou ld be usefu l in the £1ce of the best that a compiler has to offer. Conversely, i f a programmer has the luxury of preparing a less optimi zed version for debuggin g purposes, there is little or no reason for that version to be anything other than completely unop timized. There seems to be no particular benefit to creating a special i ntermed iate level of combined debugger/optimization support. Pragmatical ly, we did not have the ti me or staffi ng to develop a new opti m i zation framework, for exam ple, based on some kind of program database. Nor were we interested i n i ntruding into those parts of the GEM compiler that performed optimization to create more complicated options and variations, which might be needed for dynamic deoptim ization or recovery fu nction creation . Finally, i t seemed sensible to perform most analysis activities within the compiler, where the most complete information about the program is already available. It i s conceivable that passing additional information from tl1e compiler to the debugger using the object file debugging symbol table might eventually tip the bal ance toward performing more analysis in the debugger proper. The available size data ( p resented later in this paper in Table 3) do not incticate thi s . We identified three areas i n which w e fe lt e n hanced capabil i ties would significantly i mprove support for debugging optimized code. These areas are l. The handling of split lifetime variables and currency determination 2. The process of stepping though the program 3 . The handling of procedure inlining I n the fol l owing sections we present the capabil ities we developed i n each of these areas together with i nsight i nto the implementation techniques employed. First, we review the GEM a nd OpenVMS DEBUG framework i n which we worked. The next three sec tions address the new capabilities in turn. The last major section explores the resource costs (compile time size and performance, and object and image sizes) needed to realize these capabil i ties. Starting Framework Compaq's GEM compiler system and the OpenVMS DEBUG component of the OpenVMS operati ng system provide the framework for our work. A brief description of each follows. GEM The GEM compiler system 1 3 is tl1e technology Compaq is using to build state-of- the-art compiler products for a variety of languages and hardware and software platforms. The GEM system supports a range of languages (C, C++, FORTRAN including HPF, Pascal, Ada, COBOL, B LISS, and others) and has been successfu l ly retargeted and rehosted for the Alpha, MIPS, and Intel IA- 3 2 architectures and tor the Digital Technical Journal Vol . 10 No. l 1998 83 OpenVJV!S , D I G I TAL U N I X , Win dows NT, a n d • Windows 9 5 operati ng systems. The major components of a GEM compi ler are the fron t end, the o p ti m i zer, the code ge nerator, the fi n al code stream opti m i zer, a n d the compi l e r she l l . • i n termedi ate l a n guage ( I L ) graphs and sym bol tables. Front ends for all source langu ages translate to the same common representati o n . • The opti m i zer transforms t h e I L generated by the front end i nto a s e m a n tically eq ui val e n t fo rm that wi l l exec ute faster on the target m a c h i n e . A sign i fi cant te c h n i ca l achievement i s that a si ngle opti m i zer is used ror all la nguages and target p l atforms. • The code generator translates the IL i n to a l i st of code cel l s , each of which represents one machin e in struction for th e target h ardware . Virtual l y al l the target m achine instruction-specific code is e ncapsu l ated i n the code ge nerator. • The fi n a l phase pertorms patte rn - based peephole optimi zations fol l o wed by i nstru ction sc hed u l i n g . • T h e shel l i s a porta ble i n terface to t h e external envi ron ment i n which the compi l e r i s used. It provides common compiler fu nctions such as l isti ng ge nera tors, object fi l e emitters, and command line proces sors i n a fo rm t h a t a l l ows the other components to remain independent of the operating syste m . The bu l k of the GEM im pleme ntation work described i n this paper occ u rs at the bound ary betwe e n the fi n a l p h ase and t h e object fi l e output portion of the shel l . A new debugging opti m i zed code a n alysis phase exa m i nes t h e generated c o d e stream representation of the progra m , together wit h the com piler sy m bol table, to e x tract the i n formation necessary to pass on to a debugger t h rough the de b uggi ng sy mbol ta b l e . Most of the i m plementation is readily adapted to d i fferent target a rc h i tectures by means o f the same i n str uction property tables that arc used i n the code gene rator and final optim izer. Exa m i n e user variables and hardware recristers 0 • Display a stack trace back showi n g the cu rrent cal l stack • Set watch points • Perform many other fu nctions1' Split Lifetime Variables and Cu rrency Determination Displayi ng (printing) the va l u e of a program vatiable is one of th e most basic services that a debugger can pro v i d e . For u n opti m i ze d code and tra d i tional d e b ug gers, the mechan isms fo r doing this are ge nera l l y based on several ass u m p tions. l . A vari able h as a single al location that remains f-i xed throughout its lifetime. For a local or a stack-allocated variable that means throu ghout the l i fetime of the scope in which the varia b l e i s declared . 2. Definitions a n d uses of the va l u es of user variables occur in the same order in the ge nera ted cod e as they do i n the origi n a l program sourc e . 3 . The s e t of in str u c tions t h a t b e l o n g t o a given scope (which may b e a ro u ti n e bod y ) can be described by a single contiguous range of addresses. The first and second assumptions arc of interest in this discussion because many GFM optim izations mal(e the m inappropriate. Split lifeti me optimization (d is cussed later in this section ) leads to violation o f the fi rst assumption. Code motion optimization leads to vio l a tion of the second assu m ption and thereby creates rl1e so-called c u rrency problem. 'I'Ve treat both �frl1ese prob .>plit lems together, and we refer to them collectively as lifetime suppo11. Statement and in struction sched u l ing optimization leads to violation of the rl1i rd assumption. This topi c is addressed l ater, in the section I n li ning. Split Lifetime Variable Definition A variable is said to have spl it l i fetimes i f the set o f Open VMS DEBUG The OpenVMS Alpha d e bugger, original ly d eveloped for the OpenVMS VAX syste m , 1 '' i s a fu l l - fu nction , source- leve l , symbo l ic debugger. I t supports sym bolic d e bugging o f programs written i n B LISS , MACR0 - 3 2 , MACR0-64, FORTRAN, A d a , C , C++, Pascal , P L/ 1 , BASIC, and COBOL. The d e b ugger al lows the user to control the execution and to exa m i n e the state o f a progra m . Users can 84 character- based user i n te rrace • The front end performs lexical ana lysis a n d pars i n g o f th e sou rce program . T h e prim ary outputs are D isplay t h e source-level v i e w of t h e program's exe c u ti o n usi ng either a gra p h i c al user i n te rface or a fetches and stores of the variable c a n be partitioned such that none of the values stored i n o n e su bset are ever fe tched in another s u bset. When such a partition exists, the vari able can be "split" i n to several i n depen dent " c h i l d " variabl es, each corresp o nd i n g to a parti tion . As i n d ependent vari a bles, the c h i l d varia b l es can be a l located i n depe n d e n t ! }'· The eftect is that the original vari able can be thought to reside in d i ftcrent locations at d i fferent p o i nts i n ti me-so metim es in a • Set breakpoints to stop at certain poi nts i n the program register, • Step through the ex ecution of the program a l i ne at nowhere at a l l . I ndeed , it is even possible ror the dirfer a time ent child variables to be active s i m u l ta n eously. Digital Te chnical journal Vol to No. I 1 99 8 sometimes in memory, and someti mes A simple e xa m p l e of a split Split Lifetime Example X_Fioati n g ) variables as we l l as variables of any of the l i tctime vari able can be seen i n the fo ll ow i n g strai ght complex types ( see Sites ' 6 ) . These l atter variables are line code fragment: referred to as two-part variables because each req u ires A = B A = c = = ; A. ; ; A ! Define ( a s s i g n va l ue t o ) ! 1 Use def i i t ion De ine A aga i n 1 Use l a t er ( v lue o f ) efini A A i on rent with respect to a given position i n the source pro A I n this e xample, the first value assigned to vari able A is never used agai n . A new value is assigned to A and C. Without c h an gi n g the m eaning of this fragment, we can rewrite the code as Al - . . . , B = 2 - c = ! . . . .'1. 1 . . . , . . . , I . . . .�� . . . ' Use .:0.1 De f i ne . ·. c Jl ! . . A2 I .' This scenario in l i ne l is c o m p u t i n g and as s i g n i ng the wro n g This example i l lustrates that s p l i t l i fetime opti m i Moreover, other optim i zations can create opportu n i ties for s p l i t l i fetime opti m i zation t h a t may n o t b e apparent from casu a l examination o f the original source. In particular, loop un rol l i n g ( i n which the body of a loop is replicated several ti mes i n a row ) can create loop bod ies for which split l i fetime opti m i zation is fe:�si b le and desirable . Our impleme n tation deals only Al p h a 's extended precision tl oati ng- point ( 1 2 8 - b i t U noptimized A . . .A . . . ' B c A . . .A. . . ; D v a l u e . The p ro b l e m occurs because t h e com p i l e r h a s moved the second assi gnment so that i t is e a r l y rel a tive t o l i n e 3. Another cu rrency example can be seen i n the frag ment ( taken from Copperman·') that appears in Figure 2. I n thi s case, the optimizing com pi ler has chosen to omit the second assignment to variable A and to assign that val u e d i rectly i n to the actual parameter location used for the call of routi n e FOO. Su ppose that the debugger is stopped at the call of routine FOO. Given a request to d isplay A , a trad itional debugger is l i kely to display the resu lt of the fi rst assignment to A . Agai n , this val u e i s an actual val u e o f A , b u t i t i s not the with scalar variables and parameters. This i nc l u des 5 3. fr u i t l ess atte m p t to d e te r m i n e how the assi g n m e n t zation i s possible even in s i m p l e straight-line code. 3 h e re , i n t h e opti m i zed cod e , h a p p e n s to b e t h e res u l t might easi l y m is l e a d a user into a fr u s trat i n g and single vari able have overlapping l i fetimes. 4 of the of A i s a correct v a l u e , b u t i t is n o t the expected Here, we see that the value of A2 is assigned whi l e the 2 3 h a p p e n s t o b e c o n t a i n e d i n the l ocation of A , w h i c h value that should be disp layed at l i n e val ue of A l is sti l l al ive . That is, the spl it chiJdren of a 1 Now su ppose that execu of t h e second assi g n m e n t t o A . This d is p l ayed v a l u e Use A2 Line 3. G iven a req uest to d i s p l ay ( p ri n t ) the value of A, De E i ne l\?. U s e .'U Variables of Interest As shown i n Figure l , the opti m i zing compiler has ch osen to change the order of operations so that l i n e 4 a tradi tional d e b u gger w i l l d i spla y w h a tever v a l u e ! De f i ne Al ! vari ables. Consider the c u r rency example i n Figu re l . able C. 2 Use A2 • • • I Several kinds of opti m i zation c a n lead t o noncu rrent unoptimized code, the line that assigns a value to vari is also a n equ ivalent fragment: A2 B expected in a n u noptimi zed version of the progra m . tion has stopped at the i nstruction in l i ne Because A l and A2 are independent, the fol lowi n g ., gram if the vari able h olds the value that wou ld be is executed prior to l i n e D e f i ne A l where vari ables A l and A2 are split child variables of A. Al Currency Definition The value of a variable i n an opti mized program is cur used later i n the ass i gn ment to variable B and then used in the assi gnment to vari able two registers to hold i ts val ue. ! ! expected value . Alte rnatively, it is possi ble that prior to reac hing the cal l , the opti m i zi n g compi ler has decided to reuse the De E i e A Use 7'. c es no depend on De f i ne 7'. U e s econa ae f i n i t i o A o[ A Optimized A . . . ' B = . . . A. . . ; A c D = . . .A. . . ; F i g u re 1 Currency Example 1 Digital Technical journal Vol . 1 0 No. 1 1 99 8 85 Line l 2 Unoptimized Optimized express i o n l ; A B = 3 A = 4 FOO ( . . .A . . . ; e x p res s i on . ; ) ; I Use 1 s t def . o f A I Use 2 nd deE . 0 A A B ex = re . . .A. s ionl ; ; . . FOO ( e xpre s s i on . ) ; F i g u re 2 Cu rrency Example 2 location that originally held the first va l u e of A fo r locations h o l d val u es of user variables at any given another p u rpose. I n this case, no val ue of A is avai la ble poi n t i n the program and combine this with the set of definition locations that provide those values. B ecause to display a t the call of ro u tine F O O . Final ly, consider t h e example shown i n Figure 3, t here may be more than one source locati o n , the user which i l l ustrates that t h e cu rrency o f a vari able i s not a is given the basi c i n formation to determine where i n property that is i nvariant over ti m e . S u p pose that exe the source t h e value of a vari a b l e may have originated . c u ti on is stopped a t line 5, i nside the Joop. I n this case, Conse q u en tly, the user can determine whether the A is not c u rrent d u ring the first time through the loop val u e d isplayed is appropriate fo r his or her p urpose . body because the act ual value comes from l i ne 3 ( m oved from inside the loop ); i t shou l d come ti·om Compiler Processing l i ne 1 . O n subseq u e n t t i m es through the loop, the A compiler performs most spl i t l i fetime a nalysis on a value from l i ne 3 is the expected valu e , and the val u e of rou tine-by-routine basis. A p re l i m inary w a l k over the A is cu rrent. e n ti re sy mbol table identi fi es the varia b l e sym bols that As d iscussed earlier, most approaches to cu rrency determi nation involve making certain ki nds of A ow graph a n d compiler opti m i zation i n forma tion avai l tine, t h e compiler performs t h e fol l owing steps: • Code cell prepass • Flow gra p h construction avoid adding major new kinds of ana lysis capabi l i ty to • Basic bloc k processing D I G ITAL's debuggers. • Parameter processi ng • Backward propagation • Forward propagation • Information promotion and cleanup a b l e to the debugger so that it can report when a d is p layed va l u e is not curre n t . However, we wan ted to More fundamentally, as the degree of opti m ization i ncreases, the notion of currentposition i n the program itself becomes increasingJy ambiguous. Even when the partic u l ar i nstruction at which execution is pending can be c learly and unequivocally related to a particu lar source location, this location is not automatically the best one to use tor currency determination. Nevertheless, the source location (or set of locations) where a displayed val ue was assigned can be reliably reported without needing to establish the current position . Accordi ngly, we use an approac h d i ffe ren t than those considered in the l i terature. We use a stra ight forward flow analysis form u l ation to de termine what Line l 2 3 4 5 6 7 Currency Example clean u p tasks . The fo l lowi ng contains a brief d iscus sion of these s teps. In this summary, we highlight only the main charac tetistics of general interest. In particular, we assume that each location, such as a register, is independent of all other locati ons. This assumption is not ap propriate to locations on the stack because variables of different sizes Optimized ; ; A . . . ; wh i l e ( . . . ) A ; . . .A. . . ; = . �;h i l e A } - . . (. . . . , = . . . . . . A. . . . . ) I I A is 1 o p i n va ri an t 3 Digital Tec h n ical Journal After t h e com p i l e r comp l e tes t h i s process i n g fo r all ro u ti n es, a symbol table posrwal k performs fi n a l On op timized A Fig u re 3 86 are of i nterest fo r further analysis. Then , for each ro u Vol . 10 No I 1 99 8 may overlay each other. The complexity of dealing with overlapping allocations is beyond the scope of this paper. Of special i mportance in this processing is the fact that each operand of every instruction includes a base symbol field that refers to the compi ler's symbol table entry for the entity that is i nvolved. The symbol table prewalk identifies the variables of interest for analysis. As dis cussed, we are i nterested i n scalars corresponding to user variables ( not compiler- created temporaries) , i ncluding Alpha's extended precision floating-point ( 1 28-bit X_Fioating) and complex val ues. DIG ITAL's FORTRAN implementations pass para me ters using a by-reterence mechanism with bind (rather than copy-i n/copy-out) semantics. GEM treats the hidden reference value as a variable that is su bject to split lifeti me optimization. Si nce the reference vari able must be available to effect operations on the logi cal parameter variable, it fol lows that both the abstract parameter and its reference val ue must be treated as interesti ng variables. Symbol Table Prewa l k The code cel l prepass performs a single walk over all code cells to determ ine Code Cel l Prepass • The maximum and minimum offsets i n the stack frame that hold any interesting variables • The highest numbered register that is actu ally refer enced by the code • Whether the stack frame uses a frame pointer that is separate from the stack pointer The compiler uses these ch aracteristics to preallocate various working storage areas. Flow Graph Construction A flow graph is built, i n which each basic block i s a node o f the graph. Basic block processing per Basic Block Processing forms a kind of symbolic execution of the i nstructions of each block, keeping track of the effect on machine state as execution progresses. When an instruction operand writes to a location with a base symbol that i ndicates an interesting vari able, the compiler updates the location description to i ndicate tbat the variable is now known to reside in that location-this begins a lifetime segment. The i nstruction that assigned the val ue is also recorded with the lifetime segment. If there was previously a known variable in that loca tion, that lifeti me segm ent is ended ( even if it was for the same variable ) . The beginning and ending i nstruc tions for that segment are then recorded with the vari able in the symbol table. When an instruction reads an operand with a base symbol that indicates an interesting variable, some more unusual processi ng applies. If the variable being read is already known to occupy that location, then no further processing is required. This is the most common case . I f the location already contains some other known variable, then the variable being read is added to the set of variables for that location. This situation can arise when there is an assignment of one variable to another and the register al locator arranges to a llocate them both to the same location. As a result, the assign ment happens i mpl icitly. If the location does not contain a known variable but there is a write operation to that location earlier in the same block (a fact that is available from the loca tion description) , the prior write is retroactively u·eated as though it did write that variable at the earlier i nstruction. This situation can atise when the resu lt of a function call is assigned to a variable and the register allocator arranges to al locate that variable in tl1e regis ter where the call returns its value. The code cell repre sentation for the ca l l contains nothing that indicates a write to the variable; all that is known is that me return value location is written as a resu lt of the cal l . Only when a later code cell i ndicates that i t is using the val ue of a known variable from that location can we infer more ofwhat actually happened. I f the location does not contai n a known variable and there is no write to that same location earlier in this same basic block, then the defining i nstruction cannot be i m mediately determined. A location description is created for the beginning of tl1e basic block i ndicati ng that the give n variable or set of variables must have been defined in some predecessor block. Ofcourse, the contents known as a result of the read operation can also propagate forward toward the end of the block, just as for any other read or write operation. Special care is needed to deal ��th a two-part variable . Such a variable does not become defined u n til both instructions that assign tl1e value have been encoun tered. Similarly, any reuse of eitl1er of the two locations ends tl1e ufetime segment of me variable as a whole. At the end of basic block processing, location descriptions specif)' what is known about the contents of each location as a resul t of read and write operations that occurred in the block. This description indicates the set of variables that occupy the location, or that the location was last written by some value that is not the value of a user variable, or that the location does not change during execution of the block. Parameter Processing The compiler models parame ters as locations that are defined with the contents of a known variable at the entry point of a routine. Digiral Technical Journal Vol . 1 0 No. l 1998 87 Backward Propagation Backward propagation i ter ates over the flow graph and uses the locations with known contents at the begi n n i ng of a block to work backward to predecessor blocks looking tor i nstruc tions that write to that location . For each variable in each input location, any such prior write i nstruction is retroactively made to look like a definition of the vari able . Note that this propagation is not a fl ow algo ri thm because no convergence criteria is i nvolved; it is simply a kind of spanning wal k. Forward Propagation Forward propagation iterates over the flow graph and uses the locations with known contents at the end of each block to work forward to successor blocks to provide known contents at the beginning of other blocks. This is a classic "reaching definitions" flow algorithm, in which the input state of a location for a block is the i n tersection of the known contents from the predecessors. In our case, the compiler also propagates definition points, which are the addresses of the i nstructions that begin the lifetime segments. For those variables that are known to occupy a l ocation, the set of definitions is tl1e union of all the definitions that flow into that location . The final step of compiler processing is to combine information for adja cent blocks where possible . This action saves space in me debugging symbol table but does not affect me accuracy of the desctiption . Descriptions for by-reference bind parameters are next merged witl1 me descriptions for the associated reference variables. Finally, lifetime segment information not already associated wim symbol table entries is copied back. Information Promotion and Cleanup Debugger Processing Name resol u tion, that is, binding a textual name to the appropriate entry in the debug symbol table, is in no way affected by whether or not a variable has split lite ti me segments. After the symbol table entry is found, any seq uence of l i fetime segments is searched for one that includes the current point of execution i ndicated by the program coun ter ( PC). If found, the location of the val ue is taken from that segment. Otherwise, the value of the variable is not available. Usage Example To illustrate how a user sees tl1e results of this processing, consider me smaJJ C program in Figure 4. Note mat me numbers in m e left colunm are listing line numbers. \Vhen DOCT8 is compiled , linked, and executed under debugger control, me dialogue shown in Figure 5 appears. The figure also incl udes interpretive comments. Known Limitations The fol lowing limitations apply to the e xisting split lifetime support. M u ltiple Active Split Children While the compil e r analysis correctly determines multiple active spl i t child variables and me debug symbol tab.le corrccdy describes them, OpenVMS DEBUG docs not currendy support mu ltiple active c hild variables. When searching a sym bol's lifetime segments for one mat includes the current PC, me first match is taken as the ortly match . Support for two-part variables ( those occupying two registers) assumes that a com plete defi nition wi ll occur within a single basic bloc k . Two-part Va ri a b l es Object File Representation The object file debugging sym bol table representation tor split lifetime variables is actually q uite simple. Instead of a single address for a variable, there is a sequence of l i fetime segment descriptions. Each life time segment consists of • The range of addresses over which the child loca tion applies • The location (in a register, at a certain offSet in the curren t stack frame, indirect through a register or stack location, etc . ) • The set of addresses that provide defini tions for this lifetime segment By convention, the last segment in the sequence can have the address range 0 to FFFFFFFF ( hex ) . This address range is used for a static variable, for example in a FORTRAN COMMON block, that has a default al lo cation that applies whenever no active children exist. 88 Digital Technical Journal VoL 10 No. 1 1 99 8 385 38 6 387 oct8 388 389 390 391 392 39 3 () 398 399 400 j, int i , i j k if k; 1; 3; 2 ; j 17 ; ( f oo ( i ) ) 394 39 5 396 397 { e l se ) k ) = ( 18 ; prin t f ( " %d , %d , %d \ n " , i, j, k) ; 401 402 Figure 4 C Example Routine DOCT8 (Source with Listi ng Line N u m bers) $ T7 . 2 - 0 0 1 t un doc t B OpenVMS A l pha Debug64 Ve rs ion s ep / i n to is langucge %I , DBG> C, mod l e set to DOCT 8 o DOC T B \ doc 8 \ %LINE 3 9 1 s teppe . i, k = 3 91 : DBG > e xam ine %�1 . en t i ty · i' %W , en ' j ' %W , en t i t y j, 3; v1a s k no l a l l oca t ed in memory ' k ' does n o t h i ly does no va l ue a t h ve v ( wa the cu rrent a va l ue a PC PC p t lm i z ed a\v.:ly ) the curren t Note the difference in the message for variable i compared to the messages for variables j and k. We see that variable i was not allocated i n memory ( registers or otherwise ) , so there is no point in ever trying to examine its value again . Variables j an d k, however, do not have a vaJ ue "at the current PC." Somewhere later in the program they will have a value, but not here. The dialogue conti nues as follows : DOCT 8 \ doc t 8 \ %LINE 391 s t epped t o DOCT 8 \ doc t 8 \ %LINE 393 DBG> s tep 6 s to 391 : e pp e d OBG k = 3; s tep 393 : DBG> exam i ne %1-'1 , en t i t y ' j, k if ( f oo ( i ) ) j do s ' DOC T8 \ do c t8 k : ( not have a value at the current PC 3 va lue d e f ined a t DOCT8 \ oc t 8 \ % L INE 391 Here we see that j i s still u ndefined b u t k now has a value, namely 3, which was assigned a t line 39 1 . The source indicates thatj was assigned a value at line 390, before the assignment to k, butj's assign ment has yet to occur. Skipping ahead in the dialogue to the print statement at line 400, we see the foJ iowing: DBG> s e t b r a k % l ine 400 DBG> g o 400 : pr in t f ( " %d , DBG> examine j OCTB \ %d , DOCT 8 \ d oc t 8 \ % L IN E break a t oc t 8 \ j : 2 de f i ned va l at 400 %d \ n " , i, j , DOCT 8 \ d oc t 8 \ % L I NE k) ; 390 value de f i ned a t DOCT 8 \ d oc t 8 \ %LI NE 3 9 4 DBG> ex m i ne k DOCTS \ oc t 8 \ k : 18 va l u e de f i ned at va l e DOC T8 \ do c t 8 \ % L I E 3 97 + 4 e i ned a t DOCT 8 \ oc t 8 \ %LI E 3 9 1 This portion o f the message shows that more than one definition location is given for both j and k. Which of each pair applies depends on which path was taken i n the i f statement. If a variable has an apparently i nappropriate val ue, this mechanism provides a means to take a closer look at those places, and only those places, from vvhich that value might have come. Figure 5 Dialogue Resulting from Running DOCT8 That is, at the end of a basic block, if the second part of a definition is missing then the i nitial part is discarded and forgotten . Consider the following FORTRAN fragment: COHPLEX X, X = Y = X + Y [1.0, 0.0) Suppose that the last use of variable X occurs i n the assignment to variable Y so that X and Y can be and are allocated i n the same location, in particular, the same register pair. In this case, the definition of Y req uires only one instruction, which adds 1 .0 to the real part of the location shared by X and Y. Because there is no sec ond instruction to i ndicate completion of the defini tion, the definition will be lost by our implementation. Digital Tech nical Journal Vol . 1 0 No. 1 1 998 89 Semantic Stepping Not all such instructions are appropriate, h owever. We start with an i n i tial set of cand id ate i nstructions A major problem with step p i n g by l in e though opti a n d refine it. The fol lowing sections describe the m i zed code is that the apparent sou rce program loca h e u ristics that are currently i n use. tion " bo u n ces" back and forth , with the same l i ne often appearing again a nd aga i n . In l arge part this Assig nment bouncing is due to a compiler opti m ization called are the instructions that assign a val u e to a variable ( o r code scheduling, i n w h i c h i nstructions that arise from t o one o f i ts s p l i t c h i l d ren ) . The second instruction i n The candid ates for assignment evcms the same source l i n e are sched u l e d , that is, reordered a n assignment t o a two-part vari a b l e is e x c l u d e d . and i n termixed with other instructions, for better exe Stopp i n g between the two assignmcms is i nadvisable c u tion performance. OpenVMS because at that p o i nt the variable no longer h as the DEBUG, l i ke most de buggers, interprets the STEP / LINE (step by l i n e ) command to m ean that compl ete old state and docs not yet have the complete new state. the program should execute u nt i l t h e l i ne nu mber c h anges. Line nu mbers c h a n ge more ti·cqu cntly i n Branches sched u led code than i n u nopti m i zed code. tional a n d cond i ti on a l . An u ncond i tional branch may For example, in sample programs ti·om the SPEC95 There are two kinds o f branch : u ncond i have a known desti nation or a n u n known destination . B enchmark S u ite, the average n u m ber o f instructions U nconditional in sequence that share the same l i ne n u m ber is typ i most often arise as part o f some larger semantic con c a l l y between 2 and 3-and typica l ly 5 0 t o 70 percen t of those sequences consist o f j u s t 1 i nstructi o n ! I n branches with known d estinations struct such as an i f- t h e n - e lse or a l oop . For e x a m p l e , c o d e for an i f- th e n-else construct genera l l y has an contrast, i f o n l y i nstruction- level sched u l ing i s d is i mp l i c i t j o i n t h a t occurs at the end o f the state m e n t . abled , then t h e average n u m ber of i nstructions i s The join takes the form of a j u m p fi· om the end of o n e between 4 2 0 t o 3 0 percent consistin g o f al ternative t o the location j ust past the l a s t i nstruction one i nstr u ction . I n a c o m p i lation w i t h no optimiza and of the other ( w h i c h has no e x pl i c i t j u m p and fal l s tion, there are rou g h l y 5 6, with 8 to 12 i nstructions i n a seq uence, with percem consisting o f a single instructi o n . A second pro b l e m w i t h stepping b y l i ne through an through i n to the n e x t statement) . Th is j u m p turns the i n h erently symmetric join at the sou rce leve l i nt o an asymmetric construction at the code stream level . opti mized program is that, because of the behavior of U n cond itional j u m ps a lmost never define i n terest revisiting the same l i n e again and aga i n , the user is ing semanti c events-some re l ated instruction u s u a l l y never q u i te sure when the l i ne has fi n ished e xe c u ting. provides a more useful even t point, s u c h as the ter m i It is u n c lear when an assignment actua l l y occurs or a nation test i n t he case o f a loop . O n e exception is a control flow decision is abom to be made. s i m p l e goto statement, b u t these arc very often opti In u n opti mi zcd cod e, when a user requests a break m i zed away i n any case . Conseq u e n t l y, u nconditional point on a certain l i n e , the user e x pects e xecution to branches with known destinations arc not treated as stop j ust before that l i n e , hence before the l i ne is car semantic events. ried out. I n opti mized code, however, there is no wel l U nconditional branches with u nknown destinJ defined l ocation that i s " before t h e l i ne i s carried o u t," tions are rea l l y cond i tional b ranches: they arise from because the code for that l i n e is typica l l y scattered constructs s u ch as a C swi tch statement i m pl e m e n ted abo u t , i ntermixed, and even com bined wi th the code as a table dispatch or a FORTRAN assigned GO TO state for various other l i nes. It is u s u a l ly possi b l e , h owever, ment. These bra nc h es defin itely arc i n teresti ng points cfkct of the l i n e . d i rection is take n . Thus, the com piler retains u ncon to identifY !be i nstruction that actually carries o u t the at vvh i c h to a l low user i n teraction before the new d i tional branches as semantic events. Similarly, in genera l , cond i tional branches to known Semantic Event Concept We i n trod uce a new kind of stepping mode ca l le d 90 destinations are i mportant semanti c event points. Often semantic stepping t o address these problems. Semantic more t11JJ1 one branch instruction is generated rcJ r a sin stepping al lows the program to execute u p to, but not gle high- level source construct, rex example, a decision i nclu d i ng, an i nstruction that causes a semantic eftect. tree of tests and branches used to i mplement J small I nstructions that cause semantic eftixts are instructions C switch statement. I n this case, on ly the first i n the that execution sequence is used as the semantic event poi nt. • Assign a value to a user variable • Make a control flow decision semanti cally i nteresting even ts . • Make a routine call some r u n - time l i brary rou ti nes arc u s u a l l y n ot i nterest- Digiral Tech n ical ) o u nul Ca lls Vol . 10 1'-Jo . 1 1 998 Most calls are visible to a user and constitute However, calls to ing because these calls are perceived to be merely soft already marks branches with the semantic eve nt ware i mplementations of primitive operations, such as attrib �1te, if appropriate. Also unlike the u·aditional i nteger division i n the case of the Alpha architecture. stepping- by-line algorithm, the new algorithm does GEM internally marks calls to all its own run-time sup not consider d1e source line number. port routines �s not semantically interesting. CompiJer front ends accomplish this where appropriate for their Visible Effect own set of r u n - ti m e support routines by setti ng a flag With semantic steppi ng, a user's perception of forward on the associated entry symbol node. progress through the code is no longer dominated by Compiler Processing every few insm.Ktions regardless of what is happening. I n most cases, the compiler can identify semantic event Rather, this perception is m uch more closely related to locations by simple predicates on each instr u ction . the actual semantic behavior, that is, stopping every the side e ffects of code sched uling, that is, stopping statement or so, i ndependent of how many instruc The exceptions are • The second of the tvvo i nstructions that assign val ues to a two - part variable is iden tified during split lifetime analysis. • Conditional branches that are part of a larger con s u· u ct are identified during a simple pass over the How graph. tions from disparate statements may have executed. Note that j u mping forward and backward in the sou rce may still occur, for example, when code motions have changed the order in which semantic actions are performed. Nothing about semantic event handling attempts to hide such reordering. lnlining Object Module Representation The object module debugging semantic event repre sentation contains a sequence of address and event kind pairs, in ascend i ng address order. is i n lined into rou tine CALLER and the current point of execution is within INN ER, should the debugger Debugger Processing Semantic stepping i n the debugger i nvolves a new algorithm for determining the range of instructions to execute. This algorithm is built on a debugger pri m i tive mechanism that supports full-speed execution of user instructions within a given range of addresses but u·aps any transter out of that range, whether by reach i ng the end or b y executing any kind of branch o r call instruction. Semantic stepping works as follows. Starti ng with the current program cou nter address, Open VMS DEBUG rinds the next higher address that is a seman tic event point; this i s the OpenVMS DEBUG Procedure call inlining can be confusing when using a traditional debugger. For example, if routine INNER target event point. executes i nstructions in the add ress range that starts at the address of the c urrent i n s tructi o n and ends at the i nstru ction that precedes the target event point. The range execution terminates in the following two cases: l . If the next instruction to execute is the target event point, then execution reached t he end of target range and the step operation is complete. 2. If the next insu·uction to execute is not the target event point, then the next address becomes the cur rent address and the process repeats ( silently). report the current source location as at a location in the caller routine or in the called routine? Neither is completely satisfactory by itself. I f the current line is reported as at the location within INNER, then that information will appear to conflict with i n formation from a call stack traceback, which would not show routi ne INN ER. If the current l i ne is reported as though in CALLER, then relevant location informa tion from the callee will be obscured or suppressed . Worse yet, i n the case of nested inlining, potentially crucial i n formation about tl1e i n termediate call path may not be available in any torm . The problem of dealing with inlining was solved long ago by Zellweger'-at least the topic has not been treated again since. Zellweger's approach adds additional information to an otherwise traditional table that maps fro m instruction addresses to the corre sponding source line numbers. Our approach is d i ffer ent: it i ncludes additional i n formation in the scope description of the debugging symbol table. A key u nderpinning for inline s u pport i s the ability to accurately describe scopes that consist of m ultiple discontiguous ranges of instr u ction addresses, rather than the tradi tional single range. This capability is q u ite independent of inlining as such. However, Note that, u n l i ke the algorithm that determines the because code from an inli ned rou tine is freely sch e d range for stepping by line, the new algoritl1m does not u led with other code from t h e cal ling context, dealing requ i re an explicit test for the kind of instruction, in accurately with the resul ting disjoint scopes 1s an particular, to test if it is a kind of branch. The compiler essential buiJd ing block for effective support. Digi tal Technical Journal Vol . 1 0 No. 1 1 998 91 Goals for Debugger Support • Our overa l l goal is to support debuggin g of i n l i ned code with expected behavior, that i s , as though the i n l i n ing has not occurred . More specifically, we seek to provid e the a b i l i ty to • rogate parameter variables. • scope, which is a copy of the body of the in l i n e d • e nces to d e c larations or parameters of the rou ti ne are replaced with references to their correspo nding Show a traceback that incl udes cal l frames corre copied declarations. I n a dd i tion, returns ti·om the spo nding to i n li ned routines • • routine are replaced with j u m ps back to rhe tuple Set a breakpoi nt at a given rou ti n e e n try following the origi n a l c al l . Set a breJkpo i n t at a given line n u m be r ( from wi th i n a n i n li ned routine) • • S im i lar " bou n d a ry adj ustments" Jre made to deal with fu nction results, output parameters, c h oice of Cal l a n i n l i n ed ro u ti n e e ntry point (wh e n there i s more than one, as m i ght occur for FORTRAN a l ternate entry state ments ) , W e have achieved these goals to a s u bstantial exte n t . e t c . ( Th e boo kkee p i n g i s a bit i ntricate, b u t it is concep t u a l l y straightforw:� rd . ) GEM Locators Bdore descr i b i n g the mechanisms to r i n l i n i ng, we i ntrod u ce the G EM notion of a loca tor. A locator describes a place in the source text. The sim plest kinds of locator describe a poi nt i n the source, i nclud ing the name of the file, the l i ne within that file, and the col u m n with i n t h a t l in e ; they eve n describe the point a t The cal l i n g rou t i n e , which now i n corporates a copy of the i n lined routine, is then fu rther processed as a normal ( th o u g h larger) routi n e . lnlining Annotations for Debugging which th at fi l e was i n c luded by another fi l e ( as fo r a C as follows. or C++ #include d i rective ) , i f applicable. • - A pointer to the routine declaration being i n l i ned. or pointer. ( How this is achieved is beyond the scope The locator fi·om the call that i s replaced . In of this paper. ) I n particu lar, locators are smal l e n ough that every tuple node in the i n termediate la nguage completed; dtis locator captures t h e original call location for poss i ble later use, for example, as J meticulous a bo u t mai ntai n i ng and propagating h i g h supplement to d1e i n formation thJt maps instruc q u a l i ty locator i n formation throughout i ts opti miza tion addresses to source line n u m bers. tion and code generation . locator e n codes a p a i r that consists o f a locator ( w h i c h m a y als o b e a n i n l i ne locator) and the add ress of an associated scope node i n the G E M symbol ta b l e . As the code l i st o f the original i n l i ned routine is copied, each locator from the origi n a l is replaced by a new inline locator that records - The origi nal locator. bei ng copied . De buggi ng optimi zed code su pport tor i n l i n i n g ge n era l ly b u i ld s o n a n d i s a mi nor en hancement of t h e G E M i n l in i n g mechanis m . l n l i n i ng occurs d u ring a n early part of the G E M opti m i zer phase. Within the scope that contains the cal l site, an inline scope block is i ntrod uced . This scope represents the result of the in l i n i n g operation . It i s populated with local varia ble declarations that correspond one-to one with d1e tormal parameters ofd1e i.nlined routine. Vol . 10 N o . I As a result of these steps, every i n l i ne d i nstruc tion Gin be related back to the scope i nto which i t was i n l i n e d and h e n c e t o the routine ri·om which it was i n l i n e d , regardless o f h o w i t m a y be m o d i tied o r moved a s a resu lt of subseq uent optimization . I n l i n i ng is i mp l e me nted in G E M as fol l ows: Digiral Technid Jou rn:1l • - The newly created i n l i n e scope i n to which it is Compiler Processing • sim left i n the I L from t h e origin a l call after i n l i n i ng i s contains one. M oreover, GEM as a whole is q u i te inline a ple call wid1 n o argu ments, there may be noth i n g ( I L) and every code cell in the ge nera ted code stream An add i tional ki nd of locator was i n trod uced fo r The newly created i n l i ne scope block is a n n otated with additional i n formation , namely, of a u n i form fixed size that is no l a rger than an i n teger i n l i n ing support. This The main changes introduced for debuggi ng opti mized code support are A crucial characteristic o f locators is that they are aU 92 The origi nal call is replaced with a jump to a copy of the IL for the body of the routine, i n wh i c h refer Display para m e ters and l ocal vari a bles of an . i n lined routine • The i nl i n e scope is also made to conta i n a bod)' routi n e , includi n g a copy o f its local variab les. Report the source locJtion corresp o n d i n g to the c u rrent position i n the code • The actual argu m e n ts of the call are transformed i n to assignments that i n i tiJiize the val u es o f the sur 1 99 8 Note dut these additional steps arc an exception to the general assertion th at debugging opti m i zed code su pport occurs after code ge n e ration and j u st prior to object code e m ission. T hese steps i n no vvay i n t1 u e nce the generated code-only th e d e buggi ng symbol table that is output. The prologue of a rou through the tlow graph looki n g for the last i n struction tine genera l l y consists of those i nstructions at the ( that is, the i nstr u ction closest to the routine e x i t ) of begi n n i n g of the rou tine that establish the routine an i n l i ne i nstance that can reach Jn exit. Prolog u e a n d E p ilogue Sets stack frame ( for example, a l locate stack and save the Note that prologue and epi logue sets are not strictly return address and other preserved registers) and that symmetric: prologue sets consist of only instructions tl1at must be execu ted before a debugger can usefu Uy i nter are also semantic events, whereas epilogue sets i nc l ude pret the state of the rou ti ne. For this reaso n , setti ng a instructions tlut may or may not be semantic events. breakp o i n t at the begi n n i n g of a routi ne is usua l ly ( tra nsparently) i m plemen ted by setting a breakpoint Object Module Representation after the prologue of that routine i s co mp leted. To describe any i n l i n i n g that may have occurred dur Conversely, the epilogue o f a rou t i n e consists o f those i n structions at the end of a routin e t h a t tear i n g compilation, we i nclude three new ki nds of i n for mation i n the debu gging symbol table. down the stack fra me, reestablish the caller's conte xt, I f tl1e instructions contained in J scope do not fo rm a and make the retu rn value, i f any, avai lable to the single contiguous range, then the description of the caller. For this reason , stopping at the e n d of a routine scope is a u gmented vvjth a discontiguous range descrip i s usu ally ( transpare n t ly) i m plemente d by setti n g a tio n . This description consists of J sequence of ranges. breakpo i n t before the epilogue of that routine begi ns. (The scope itself indicates tl1e traditional approximate One benefi t of i n l i n i n g i s that most prologue and range description to provide bac kward compati bility epi logue code is avoided; however, there may still be with older versions of OpenVMS D E B U C ) . This aug some scope man agement associated with scope en try mented desc1i ption applies to aU scopes, whether or not a nd exit. Also, some program m i n g l a nguage-related they are tl1e result ofinlining. environment ma nageme n t associated with the scope For a scope that results from i n l i ni n g a cal l , the may exist and should be treated i n a m a nner analogous descri ption o f the scope is augmented with a record to traditional prologue and epilogue cod e . The prob that refers to the rou tine that was i n l ined as we l l as the lem i s hovv to i d e ntif)' it, because most o f the tradi l i n e n u m ber o f the c al l . Each scope also contains two tional comp i l e r code generation hooks do not app ly. e n tries that consist of the sequence of prologue J nd The model we chose takes adva ntage of the seman tic event i n formation that we describe in the section epilogue addresses, respectively. Backward compatibility is fu l l y mai n tained . An older Semantic Steppi n g . In parti c u l a r, we define the first version of Open VMS DEBUC that does not recognize semantic event that can be executed within the in l i ned the new kinds of i n fonnation wi l l simply ignore it. routine to be the end of the prologue. For reasons d is cussed l ater, we define the last i nstruction ( not the l ast Debugger Processing semantic eve n t ) of the i nJ i ned code as the begin n i n g of As the debugger reads the debuggi n g symbol table of the epi logu e . As a res u l t of u n rel ated opti m i zation a modu l e , i t constructs a l ist o f the i nl i ned i nstances for effects, each of these may turn out to be a set of each routi n e . This process makes it possible to tl nd a l l i nstructions. Determ i nation o f i n l i n e prologue and instances o f a given routine. Note, however, that if every epi logue sets occurs after split l i fe ti me and semantic call of the routine is expanded i n l i n e and the routine event determ i n ation is completed so that the results of cannot otherwise be called fi-om outside that m odule, those analyses can be use d . tl1en CEM does not create a noninlined (closed - form ) To determine the set o f prologue instructions, for each version of tl1e routine. i nline instance, CEM starts vvjtl1 every possible entry block and scans torward through the flow graph looking Report Source Location for tl1C first semantic event instruction that can be reached the source location tl1at correspo nds to tl1e current code It is ;.1 simple process to report from that en try. The set of such i nstructions constitu tes address. When stopped inside the code resu l ting from the prologue set for tl1at instance of the inJined routine. an inlined routine, the program cou nter maps directly This is a spa n n i n g walk forward from the ro utine to a source l i n e ,vjthin the inlined routine. e n try ( o r e ntries) that stops ei ther when a b l ock is fou nd to conta i n a n i nstruction from the given i n l i n e Display Parameters and Local Variables i nstance or when the block has alreJdy been e n cou n for tered ( each block i s considered a t most once ) . Note i nlined routine contains copies of the parame ters and a As i s tl1e case noni n l i n ed rou tine , tl1e scope description tor an that there nuy be execution paths that include one or the local variables. No special processi n g is req uired to more i nstructions from an i n l i n i n g, none o f w h i c h is a perform name binding for such entities. semantic event i nstructi o n . The set of epilogue i nstru ctions i s determined usi n g Include ln lined Ca l l s in Traceback The debu gger pre a n i nverse o f the prol ogue algorith m . The process sents i n l i ned ro u ti nes as if they are real routi ne calls. A starts with eJch possible e x i t block and scans bac kward stack frame whose cu rre n t code address corresponds Digital Technical )ourml Vol . 1 0 No. I 1 998 93 to an i nl ined rou ti n e i nstance is described with two or more virtual stack frames: one or more for the i n li ned instance(s) and one for the u l ti mate cal ler. ( An exam ple is shown later in Figure 7 . ) Set Breakpoi nts at l n l ined Routine I nstances The strategy for setting breakpoints at i n l i ned routines is based on a generalization of processing that previously existed for C++ member fu nctions. Com pilation of C++ modules can resul t i n code for a given member fu nction being compiled every time the class or tem p b te definition that contains the me mber fu nction is compiled. vVe refer to all these com pilations as clones. ( I t is not n ecessary to disti n guish which of them is the Line.: 6 7 tine: i n all the curren tly active mod u l es . �·la i n .t.ou t : ne 1 ·-EGER A , c TYPE * .� ( 3 ' c ( 0 ) 1 END F'UNCT ! ON A 12 13 14 15 ; 8(5, \ .' ( I . L) B I) 2 "" L + RETURN END c 16 F'UNCTIO: B ( J , INTEGER B , c B - C(9) + J K) + K END +T+++ c File DOCFJ- I NLI � E - 2 A . FOR fUNCT ION INTEGER C c � 2' 1 C(J) R ETUR 6 Set Breakpoints at I n lined Line N u mber I nsta nces DOC F'J - J NLI E - 2 . FOR II TEGER 9 10 11 2 3 4 5 F' i l e c 8 originaL ) I n our general i zation, an inl ined routine call routi n e , the debugger sets breakpoints at all the end c c 2 3 4 5 i nstance is treated l ike a clone . To set a breakpoint at a of�prologue addresses of every clone of the given rou +•+++ END The strategy for setting breakpoints on line n u m bers shares some teatures of setting breakpoints on routines, with Figure 6 Program to J l l ustr:�rc In l i n i ng Support additional complications. Compiler- reported l i ne num bers on OpenVMS systems are unique across a l l the files i ncluded i n a compilati o n. It follows that the same file i ncluded in more than one compilation may h ave d i ffere nt associated line n u m bers. To set a breakpoint at a particu l :�r l i n e n u m ber, that l i ne nu mber needs to be fi rst nonn:1 l i zed rel ative to the cont a i n i n g file. This norm a lized l i n e n u m be r v :t l u e i s then compared to n orm a l i zed l i ne n u m bers fo r that same fi l e that are included in other com p i l a tions. ( I f d i fferent versions of the same named fi l e occu r i n d i fferent compil ations, t h e versions are treated as u nr e lated . ) The origin a l l i ne n u m ber is converted i n to the set of add ress ranges t h a t corre spond to it in a l l modules, taking i n to account i n l i n i n g and c l o n i n g . C a l l a Routine That Is l n l i ned I f the compiler creates a can call that rou tine independent of whether there nuv also be i n l i ned i nstances of tl1e routine . If no such � ver ion ofthe routine exists, then the d e b u gger cannot call the routin e . If w e com pile , l ink, a n d r u n this program using the OpenVMS DEBUG optio n , we can step to a place in routine B that is just bdore the call to rou t i n e C and tl1en request a traceb:1ck of the call stack. This dialogue is shown in Figure 7 . Figure 7 shows tlut pseudo stack fi-ames are reported for routines A and B, even tl1ough the call of routine B has been in lined i nto routine A and the call of rou tine A has been i n l i ned i nto the main program. The main d i f ference from a real stack rrame is t h e ext.rJ line that Limitations In a real stack ri·ame, i t is possi b l e to examine ( a nd even deposit i nto) the real machine registers, rather than exami ne the variJ bles that happen to be a l located in machine registers. In an i n l i ned stack frame, this operation i s not well ddlncd and consequ e n tly not supported . In a non i n l i ned stack ti:a me, these opera Usage Example I n l i n ing s u pport has m a ny aspects, but we will i l l u s trate only one-a c a l l traceback t h a t i n c ludes i n l i n e d ca lls. Consider t h e sample program shown i n hg u re 6 . This program has four routi nes: tl1t-cc JIT combi ned i n : 1 single fi l e (enabling t h e GEM FORT RA.N com p i l e r to perform in l i ne opti m izations), and the l a s t i s i n a separate fi l e . To help correlate the l ines of code in DigitJI Tcch nicJI JournJI n u mbers to the left of the cod e . Note that these n u m bers are not p:�rt o f the progra m . reports tl1at tl1e "above routine is i n lined ." closed-form version of a routi n e , then the debugger 94 these two riles with those in Figure 7, we added l i ne Vo l . 10 No. I 1 998 tions are sti l l allowed . An attractive feature that wou l d rou nd out the expected beluvior o f i n l i ned rou tine calls wou ld be to s u pport steppi ng i n to or over the inlined call i n the same way that is possible tc)r noni nlined caJls. T h is rea ture is not curre n tly su pported-execution alwJys steps into the ull. GEMEVN$ FJ - I NL I 1E - 2 1S A l p h a Debu r n D Open %! , Lang D G> s ge : to DOCFJ - I LI B C(9) 15 : DBG > show = ca l l s e n e * DOCFJ - I . LI ine name rou above IN tO A L I N E - 2 $.1AIN E rel 15�8 abs P PC OOOOOOOOOO OOOO l C 0 0 0 0 0 0 0 0 0 2 0 0 6C 0 i s inl ined 9 ine is 00000000000000 04 0 000000002 054 i . l in e ' r , E - 2 $ MAI DO FJ - It\L I N E - 2 $:�. I . 4 DEBUG B \ % LI l i ne 15 abo ve rou t i ne ' DOC FJ- I• Figure 7 OpenVMS DOCFJ -Ir E - 2 $ MA I 'l * DOCFJ- I LINE 2 $ ---- - ver s i o� T7 . 2 - 0 0 1 E - 2 $ MA I 1 � J + K B ----- 6 Modu l e : ep / seman t i c s t epped mod FORTRAN , 0 0 0 00 0 0 0 0 0 00 0 0 3 8 0 000 0 0 0 0 0 0 02 00 3 8 0000000000000000 FFFFFFFF 8 5 9 0 7 1 6 C Dialogue to I l lustrate I nlining Suppon 5 1 : no opti m i zation (noop t ) , no d e buggi n g i n for Performance and Resource Usage mation ( nodebug, nodbgopt) We gathered a n u m ber of statistics to determine t:ypi 52: no o p t i m i zation ( n oopt ), normal debugging cal resource req u i re m ents tOr using the enhanced information ( debug, nodbgopt) debugging optimized code capability com pared to the 54: full ( d e fa u l t ) o p ti m i zation (opt), no debugging trad itional practice of debugging u nopti m i zed cod e . A information ( nod ebug, nodbgopt) short sum mary of the findi ngs fol l ows. • 55: All metrics tend to show wide variance fi·om pro gram to program , especially smal l ones. • 58: fu ll optimi zation (opt), en hanced debugging Generating traditional debugging symbol information i n formation (debug, d bgopt) increases the size of object modules typically by 50 to 100 percent on the OpenVMS syste m . Executable image sizes show similar but smaller size increases. • Generating en hanced symbol table i n formation adds about 2 to 5 percent to the typical compi lation time, although higher percen tages have been seen tor u n usually large progra ms. • Generating enhanced symbol table i n formation uses significant me mory d u ri n g compilation bm does not affect the peak memory req u i re ment of a compi lation . • Note that the option combination n u m bering sys tem is historical; we retained the system to help keep data logs consistent over ti m e . Compile-time Speed The incremental compile-time cost of creating enhanced symbol table information is presented i.n Ta ble data in this table can be summaJized as follows: • • dling split l i fetime variables ( column • reduces the re sulting ima ge size . Total net i mage s i ze i ncreases typ i cal ly by 50 to 80 percent. A more d e tai led presentation of fi n d ings fol l ows. Ta bles 1 through 3 present data collected u s i ng pro d u ction OpenVMS Al pha native co m p i lers built in December 1 9 9 6 . I n developing these results, we used five combi nations of compi lation options as fo ll ows: 3 perce n t , is attributed t o t h e flow analysis i nvolved i n han 200 perce nt to the deb ugging symbol table of opti m i zation Enhanced deb ugging (col u m n 2) increases the component of that ti me, approxi mately object modules and perhaps 50 to 1 00 percen t fo r Compiling with ful l l percent. compil ation time by about 4 percent. The largest mation compared to that for an u nopti m i zed com· executable i m ages. Traditional debugging ( column 1 ) i n creases the total compi lation time by about pilation . On the OpenVMS syste m , this adds 100 to • l for a sampling of BLISS , C, and FORTRAN modules. The Generating enh anced symbol table i n formation further i ncreases the s i ze of the sym bol table i n tor full opti mization ( o p t ) , normal d e bugging information only (debug, nodbgopt) 3). Debuggi ng tends to i n crease a s a percen tage of ti m e in larger modules, whic h suggests that pro cessing time is slightly nonli near in program size; however, thi s i ncrease does not seem to be excessive even in very large modules. Compile-time Space The compile-time memory usage during the creation of e nhanced symbol information i s presented in Table 2 . Digital Technical )ourn;ll Vol. 10 N o . I 1 998 95 Ta ble 1 Percent of Com p i lation Ti me Used to Create/Output Debugging I nformation Module S2 (noopt, debug, nodbg opt) G E M_AN 0.3% 1.1% G E M_DB 0.9 1 .8 1 .3 G EM_D F 0.8 4.4 2.7 1 3 .9 S8 (opt, debug, d bg opt) (Split lifetime Ana lysis On ly) BLISS CO D E 0.7% G E M_FB 0.7 5.2 3.5 G E M_I l_PEEP 0.6 1 4.4 (_METR I C 1.5 5.2 4.1 G RAM 0.5 2.9 2.2 I NTERP 1 .2 4.5 3.2 MATRIX300X nm nm nm NAG l 1 .4 1 3 .0 1 1 .9 SPICE_V07 3.0 6.4 4.7 WAVEX 2.5 6.3 4.8 C CO D E FORTRAN COD E Average Typical range 1 .2 % 4.3% 3.2% (0. 5 % -1 . 5 % ) (3.0%-7 . 0 % ) (2.0%-5.0%) Note: " n m " represents "not m e a n i ngfu l , " that i s , too s m a l l to b e accurately measured. Ta b l e 2 Key Dy n a m i c Mem ory Zone S izes d u r i n g B LISS G E M Com p i l at i o n s File Peak Tota l SYMBOL ZO N E Ell ZONE G E M_AN G E M_DF G E M_FB G E M_ I l_pEEP 2, 507 1 1 , 305 4,694 40,4 1 9 1 30 836 316 1 ,606 85 1 , 672 522 1 7, 666 7,381 3,03 1 3,563 1' 1 1 5 82 3 54 494 81 5 308 934 6,267 6,234 1 2, 8 1 2 1 43 1 ,520 1 ,0 5 1 4,676 227 1 ,791 3,256 3, 1 1 9 CODE ZONE OM ZONE % Peak % Larg % Ell 15 1 ' 1 80 304 1 4, 1 43 6% 10 6 34 8% 57 58 80 1 8% 71 58 80 B L I S S CODE 1 84 2,056 457 4.4 1 1 C CO D E C_M ETRIC G RAM I NTERP 2,563 21 1 688 1 67 267 131 2 9 4 6 33 20 34 33 43 58 68 459 68 6 11 7 5 26 38 14 14 26 38 14 22 32% 40% FORTRAN CODE MATRIX300X NAG l SPICE_V07 WAVEX 1 01 1 .742 885 3.482 Average 9% Note: A l l numbers t o t h e left of t h e vertical bar are thousa n d s o f bytes. not multiples o f 96 1 ,024. Column Key: Column Description Peak Tota l SY M B O L Z O N E E l l ZONE CO D E ZON E OM Z O N E % Peak % larg %Ell The peak dyn a m i c memory a l located i n a l l zones d u ring the co m p i l a t i o n The z o n e t h a t h o l d s the G E M sy m bol ta b l e The zone that holds the l argest Ell Z O N E (used for the expanded i ntermediate representation) The zone that holds the G E M gen erated code l ist T h e z o n e that holds s p l it l ifet i m e a n d other work i n g d ata The OM ZONE size as a percentage of the Peak Total size The O M ZONE s i ze as a percentage of the l a rgest s i n g l e zone i n the co m p i lation The O M Z O N E size a s a percentage o f t h e E l l ZONE size Digir:U Tcchn ica.l Journal Vol . 10 No. 1 1 998 image text, etc.) due to the inclusion ofenhanced infor mation compared to the traditional symbol table size. The followi ng is a summary of the data, where OM ZON E refers to the temporary working virtual mem ory zone used tor split l i feti me analysis: • The OM ZO N E size averages about 10 percent of the peak compil ation size. • The OM ZON E size is one-quarter to one- half of the EIL ZONE size. (The latter is well known for typi cal ly being the largest zone in a GEM compilation . ) • Since the O M ZONE i s created and destroyed after aU ElL ZONEs are destroyed, the OM ZONE does not conuibutc to establishing the peak total size. SS/52 : This ratio shows the object or i mage size with enhanced debugging i n formation with opti mization compared to the traditional debugging size without optimization. The last ratio, SS/52, is especially interesting because it combines two effects: ( l ) the reduction in size as a result of compiler optimization, and (2) the i ncrease i n size because the larger debugging symbol table needed to describe the resu lt of the optimi zation . The resu lt ing net i ncrease is reasonably modest. Object Module Size Summary and Concl usions The increased size of enhanced symbol table informa tion tor both object files and executable i mage files is shown in Table 3. In Table 3, the application or group of modu les is iden tified in the first column. The columns labeled 5 1 , 52, etc. give the resulting size for the combination of compilation options described earlier. Object module and executable image data is presented in successive rows. Three ratios of particu lar i nterest arc computed . There exists a small but significant literature regarding the debugging of optimized code, yet very kw de bug gers take advantage of what is known. In this paper we describe the new capabi l i ties tor debuggin g optimized code that are now su pported in the G EM compiler sys tem and the Open VMS D E B U G component of the OpenVMS Alpha operating system . These capabilities deal with split lifeti m e variables and currency determi nation , semantic stepping, and procedure inlining. For each case, we describe the p roblem add ressed a nd then presen t an overview of G EM com piler and OpenVMS D E B U G processi ng and the object mod u l e represen tation that med iates between them. All but the inlin i n g support are i ncluded i n OpenVMS DEBUG V7 .0 and i n G E M- based compi lers for Alpha systems that have been shipping since 1 996. The inl ining support is 52/5 1 : This ratio shows the object or i mage size with traditional debugging information compared to a base compilation without any debuggi ng infor mation . This ratio i ndicates the additional cost, in terms of increased object and image file size, associ ated with doing traditional sym bolic debugging. (S8-S5 )/(S2-S 1 ) : This ratio shows the increase in debugging symbol table size (exclusive of base object, Ta b l e 3 Object/Executa b l e (.OBJ/. EXE) F i l e S izes (in N u m ber of B l ocks) for Va rious Open VMS Components 51 52 54 55 58 no opt debug nodbgopt 52/51 opt node bug nodbdopt opt debug nodbgopt opt debug dbgopt (58-55)/ (52-5 1 ) File noopt nodebug nodbgopt 58/52 Ratio Ratio G E M_* . OBJ G E M_* .EXE 3 1 ,477 1 2, 1 60 5 1 , 069 29,543 1 .62 2 .43 BLISS CODE 27,483 1 0, 3 7 3 47, 0 3 1 27, 7 5 5 68,728 32,288 1.11 0.26 1 .35 1 . 09 436 250 1 02 60 1 40 80 653 348 1 20 70 207 113 1 . 50 1 .39 1.19 1.17 1 .48 1 .4 1 C CODE 478 250 1 00 58 1 34 75 733 385 1 17 69 205 1 13 1 , 680 581 224 91 450 1 67 4.36 2 .00 5 .94 2.20 3 . 66 1 .64 2 . 57 1 .67 1 .87 1 .30 2.17 1 .47 20 19 42 289 1 , 652 1 ,03 1 555 634 34 29 63 388 3, 1 1 7 1 , 660 1 , 639 1 , 1 90 1 .70 1 . 53 1 .51 1 . 34 1 .89 1 .61 2.95 1 .88 FORTRAN CODE 16 15 288 1 87 1 ,073 549 393 490 29 25 509 333 2,571 1 ,3 1 8 1 , 556 1 , 1 67 71 34 1 ' 1 78 469 4, 9 1 6 1 , 803 2, 949 1 ,437 3 .00 0.90 3.1 1 1 .3 7 1 . 60 0.77 1 .29 0.49 2 . 08 1.17 1 .84 1 .2 1 1 . 58 1 .09 1 .80 1 .2 1 C_M ETRIC.OBJ C_M ETR I C . E X E GRAM. OBJ GRAM.EXE INTER P.OBJ I NTERP. EX E I I MATRIX300X.OBJ MATRI X300X. E X E NAGL.OBJ NAG L . E XE SPICE. OBJ SPICE.EXE WAVEX.OBJ WAVEX.EXE Ratio Digital Tec h n ical journal Vol . l O N o . ! 998 97 currently i n tleJd test. Work is under way to provide 12. similar capa bilities in the lade bug debugger"·" compo nent of th e DIG ITAL UN I X operating syste m . There are and will always b e more opportunities and new challenges to im prove the ability to debug opti 1 3 . D . B li c kste i n see how the capabilities described in this paper provide et cial lssue I'Ol. 4, n o . 4 1 99 2 ) : 1 2 1-1 3 6 . 1 4 . B . B e and er, " V&V.. D E B UG : An I nteractive, Symbolic, Multili ngual D e b ugge r, " ACM 5!CSOFT/5!G'PI..A N Soft major benefits. We find it much harder to see what capa war(' Eugineering S)lmpusium on I-Jigb-Lel'el Deb ug bility cou ld provide the next major increment in debug ging ACM 5/CPLAN Notices. ging effectiveness when working wi tl1 optimized code. 1 9 8 3 ) : 1 7 3-1 79 . 15. References C omp il er ( Spe al . , " Th e G E M O p t im i z ing Sy s tem , " / .! igi lal Tech nicaljoumal, mized code . Perhaps tl1e biggest problem of all is to fig ure out where best to focus future anention. l t is easy to A . Adl -Tabatabi, "Source- Level Debugging of Glob a l l y O ptimi z e d C o de ," P h . D . Di sse r t a tio n , C a rm:g i t: Me l lo n U n i ve r si ty, CM U-CS-9 6- 1 3 3 (June 1 996 ) . vo l . 18, Ope n \I,VfS Dehugp,er /vfan uol, Or d e r no. No. T E ( M a yn a rd , M ass . : Digital E q u ip m e nt (August 8 i\A- QSBJ B C orpo ra t i o n , November 1 9 9 6 ) . l . P. Ze l l wt: g t: r, " Interactive Source- Level D e b u gging o f Opti m i zed Programs," P h . D . D i ss e r ta t io n, U n i vers i ty of Cal ifornia, Xerox PARC CSL-84-5 ( M ay 1 9 84 ) . 2. i J . H e nness y, " S ymbolic Deb u ggn g o f Optimized C od e , " ACM ]i·(n lsactions on Programm ing Languoges (/1/d 5)s/ems. 3. vol 4, no. 3 (July 1982 ): 3 2 3-344. M. C opp e rm an , " D e b u ggi ng O p t i m i zed Code With our B e ing M isled," P h . D . Dissertation, U n iversity o f Cali forn i a a t Santa Cruz, U CSC Technical Rep or t (J u n e U CS C - C RL- 9 3 - 2 1 4. I I , 1 99 3 ) . 16. R. 3d Si te s , e d . , Alpha Architecture e d . ( VVob u rn, Mass. Reference Jllh! l l ! la/. Press, 1 9 98 ) . Husson, ''Experie nces Developing and Us i n g an Object-Oriented Li brar y tor Pro g ra m Manipu lat1011," OOP5LA c'rJI I(i:rence Pro ceedings. A CM 5/CPLAN Notices. vo l. 1 2 , no. l O ( O c tober 1 9 9 3 ) : 8 3-89. 1 7 . T. B i ng:hJm, N . Hobbs, and D . 1 8 . D(v,ital UNIX Ladehug [)ehu,�er Ma n ual, Order N o . AA-PZ7EE - Tl T E ( M avn a.rd , Mass . : Digi tal Eq u ipm e nt Corporation, March G . Hansen, and S . Si mmons , " A New Ap p ro Jc h to D ebu g gi n g O p ti m i ze d C o d e ," ACW SIC D igi tal 1996 ) . G . B rook s , PLAN '92 Co nference on Progr am mi ng Language Biographies Des ig n and !mplementalion, SJG'PLAN Notices, vol 2 7, no. 7 ( J u ly 1 9 9 2 ) : 1-l l 5 . Convex Co m p u te r C o rpora t io n , CONVJ;X CXdh Con cepts ( Richardson, Tex . : Convex Press, Order No. DSW-47 1 , M a y 1 9 9 1 ) . 6. D . Coutant, S. M e loy, and M . Ruscetta, "D O C: A PrJc to Source-Level De bu gg ing of G l o bally C od e ," Proceedings oft he 5/CPLAN 8H Con tical Approach Optimized ferenw on Pro,({ rctmmincf� La n,r.;uage Design and Imple m entat io n . Atl anta , 7. 8. Ga. (June 2 2-24, 1 9 8 8 ) : 1 2 5-1 34. L. Zurawski, "Sou rce - Leve l Debugging of GlobaLly Opti m i zed Code witl1 Expected Behavior," Ph . D . Disserta tion, University of I ll i no i s at Urbana-Champaign ( 1989 ) . U . H ol z le , C . Cham bers, �111d D . U n gar, "Deb u gging Code with Dynamic Deop ti m i zati on , " A CM SIG'PLA N '92 Couference on Programm ing Lcl l l gua.lw Desigu a n d lmplemenlalion, San Fr a n c i s co , Calif ( J u n e 1 7- 1 9 , 1 9 92 ) a n d SI GPLAi'-l Notices, vol. 27, no. 7 ( July 1 99 2 ) : 3 2-43. Op t imi z e d 9. L. Pollock an d M . S offa, H i g h - leve l D eb uggi n g with of' an I ncremenral O pti m ize r, " Proceedings o( " the Aid the .l lsi J-Jmuaii /nlernalioua/ Conference on 5rstem Scieuces (January 1 9 8 8 ) : 5 24-5 3 2 . 10. and M . S offa , " D e b u ggi ng Tai lori ng, " fulernational 5ympu L . Po l loc k , LV! . B i 1 ·e n s , Op ti mi z e d Code via silun on Software Testing and A na/)JSis (August 1994 ) . 11. 98 P. Feiler, "A Lang uage - Oriented I nteractive Prog rcl m m i ng Environment Based on C omp i l a tion Tech nol ogy," P h . D . D i sse rta ti on , Car ne gie - Me llo n University, CJ\-W - CS- 8 2 - 1 1 7 (May 1 9 8 2 ) . Digit:�! Tcchniccll Journ:�l Vol . 10 No. l ! 998 Ronald F. Brender Ro nald F. B rt: n d e r is a senior consultant software engineer in Compaq' s Core Technology Group, w he re he is working on botl1 the GEM co mpi l er and the l • 1 I X l ad e b u g pro j e c ts . D u r i n g his oreer, Ron bas worked in advanced deve lopm e nt and pr od u c t d e ve l op men t roles for BLISS, DI GITAL's DECsy�t<:m- 1 0 , PDP- 1 1 , VAX , and Alpha computn systems. He served as a representative on the ANSI and ISO standards commi tt<:e s tor fO Rl'RAN 77 an d later for Ada 83, al so sel·v ing as a U.S. D e par tme nt ofDcti::nse invited Di sti ng u i s hed Reviewer J nd a mem be r of the Ada Board and the Ada La ngua ge Maintenance Com m i t tee tor more than eight years. Ron j oi ne d Di gital Equipment COlvorat.ion in 1 9 70, after earning the d eg re es of B . .S. t:. ( <: n gi ne eri n g scienc<:s ) , M.S. ( a ppl ied mat h e m a t i cs ) , a nd P h . D . (computer .111d commu nication s c i en ces) in 1 9 6 5 , 1 9 68, and 1 969, respec tively, aJ l ti·om me Unive rsity of M ic h ig � << val , am s i zeof ( in t ) * 8 ) ; i n t arnt ) < { amt ; For example, the generated text a << b is replaced upon regeneration by the text i t_sh l _i n t_ i n (a , bl H l 1 6 - ( u i 1 7 + + + u i 2 0 * ( s l 2 1 & ( a rgc < � argc: < = • + s 1 2 2 : - - ( ( * & * sl 4 1 ) 01 600303 7 < •• ( 5u7 ) . s i t 5m6 & 1 7 3 1 0 4 4 3 8 u * + + ui 5 * ( n s igned int I + + ( ld2 6 ) ) & ( ( ( 0 7 6 1 ) * 2 1 3 7 1 6 7 7 2 1 L * sl27 ? u l2 8 & d 1 2 * + + d9 * DBL_EPSI LON * 7 e + 4 * + + 11 ., , d l O * d1 2 * ( " ld3 J * . � L * 9 . 1 - l d 3 2 * ++ f 3 3 - - . 7 3 9 2 E - 6 L * " l d 3 4 + ?. ? . 8 2 L + 1 . 9 1 * - - l d 3 5 >= H l d 3 7 ) =. F + ( + + f 3 8 ) + + + [ 3 9 * [4 0 > ( floa t ) + + f 4 1 * 1: 4 2 >= c l 4 + + : s c 4 3 & s s 4 4 1 ' I I C 1 3 & . 9 3 0 9 L ( u i 1 8 * 0 0 7 1 1 U * u i l 9 , sc 4 6 - - ? - - ld4 7 + l d4 8 : • • L d 4 9 - ld4 8 * + + ld50 : • + ld5l I >2 3 9 . 6 1 1 ) • - + + ar c ( i n t s ig ned ) argc + + ui 5 4 ) - + + · 1 7 > = • • u l 5 8 * argc - 9ul * + & ul59 * + + u l 6 0 ; + * < • • •+ Figure 2 Generated C Expression Digital Technical Journal Vol . 10 No. 1 1 998 1 03 It� on being rer u n , the regenerated test case asserts a with generated com piler d i rective flags revea l ed a bug stand ards violation ( tor examp l e , a shift of more than i n a compiler under test-it could not even comp i l e its the word length ) , the test is discarded and testing con own header files. ti n u es with the next case. Two problems "'�th the generator remain: ( l ) obtain i n g enough output fi-om t h e generated programs so that d i fferences are visible and ( 2 ) ensuri n g that the i n g the testi ng. O n l y those results that are exhi bited by generated programs resemble real-world programs so very short text are shown . Some of the res u l ts d erive that the developers are i n terested in the test res u l ts . from h a n d genera l i zation of a p roblem that origi n a l ly Solving these two problems brin gs the q u al i ty of test surfaced through random testi ng. i n put to level 7. The t1ick here is to begin generating the There was a reason for each res u l t . For example, the program not fi-om the C grammar nonterminal symbol server crash occurred when the tested compi ler got a translation -u nit but rather !Tom a model program stack overflow on a heavily loaded machi n e with a very described by a more e l a borate string in which some of l a rge memory. The operati n g system attempted to the program is already fully generated . As a simple cl u m p a gigabyte of com piler stack, which caused a l l example, suppose you want to generate a n u m ber of t h e o t h e r active users t o thrash , and many of t h e m a lso print statements at the end of the test progra m . The d u m ped tor l ack of memory. The many d isk d rives on starting string of the generating grammar might be n de f i ne P ( v ) p r i n l f ( � v • - % x \ \ n " , in t he server began a d ance of the lights that sopped up vi the remai n i n g free resources, causing the operators to boot the server to recover. Excel lent testi n g can m ;� kc ai n ( ) you unpopular with almost everyone. d e c l ara t i on - l i s e s t <> tement 1ist Test Distribution ex i t ( 0 ) ; ri t-list where the gram matical d e fi n i tion of pri nt- l i s t Each tested or comparison program m ust be executed IS given by pri n t l i s t P ( j den t i f ier ) ; pr i n t - l i pr i n t - l i s t P ( i denl i f i er ) There are n umerous ways t o u t i l ize a n e twork to distri b u te tests and the n gather the resu lts. One par ; mi nals for the t h ree l ists i nstead of j ust one for the standard C start symbol tra nslation- un i t . Programs generated tl-om this starting stri n g wi l l cause output j ust betore e x i t . Because d i fferences caused by rou nd i n g error were u n i nteresti n g t o u s , w e mod i fied t h is pri n t macro tor types f loa t a n d double to pri n t only a tew significant d igits. With a l ittle more effort, the expa nsion of pri n t - l i s t can be forced to pri n t each variable exactly once. A l ternatively, suppose a test designer receives a bug report fl·om the fi e l d , analyzes the report, and fixes the bug. I nstead of simply p utting the bug-causing case in the regression s u i te, the test designer c a n genera l i ze it in the m a n n er j ust presented so that many simi lar test cJses can be used to expl ore for other nearby bugs. The effect of l evel 7 is to augment the probabil i ties in the stochastic grammar with more precise and direct means of control . ticul arly simple way is to use continuously r u n n i n g watcher programs. E a c h watcher program periodically exami nes a common til e system for t h e existence of some particu lar fi les upon w h i c h the program can act. If no fi l es exist, the watcher program sleeps for a whi l e a n d tries agai n . O n most operating systems, watcher programs can be i m p l e m e n ted as command scripts. There is a test master and a n u m ber of test beds . The test master generates the test cases, assigns them to the test beds, a n d l a ter analyzes the resu l ts . Each test bed runs its assigned tests. The test master and test beds share a fi l e space, perhaps via a network. For each test bed there is a test i n p u t directory and a test output di rectory. A watcher program c a l l e d the test d river waits u n til a l l the ( poss i b l y remote ) test i nput d i rectories are empty. The test d river then writes its l atest generated test case i nto each of tlhe test i n p u t d i rectories a n d returns t o i t s watch -sleep cycle. F o r e a c h test b e d there is a test watcher program that waits u n t i l there is a fi le i n its test i np u t d i rectory. \Vhen a test watcher fi n d s a Forgotten Inputs The e l a borate com mand - l i ne fl ags, config fi les, and environ ment variables that condition the behavior of progr:�ms arc also i n p ut. Such input can also be gener ated using the same toolset that is used to generate the test programs. The very first test on the very first run Dig;i t
Source Exif Data:File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.6 Linearized : Yes Has XFA : No XMP Toolkit : Adobe XMP Core 5.2-c001 63.139439, 2010/09/27-13:37:26 Create Date : 2006:04:13 13:14:21+01:00 Creator Tool : Adobe Acrobat 7.05 Modify Date : 2013:01:11 09:17:25Z Metadata Date : 2013:01:11 09:17:25Z Producer : Adobe Acrobat 10.1.4 Paper Capture Plug-in with ClearScan Format : application/pdf Title : Digital Technical Journal, Volume 10, Number 1: Programming Languages & Tools Creator : Document ID : uuid:5d02c306-16df-4680-b15a-c80d2f01bca6 Instance ID : uuid:fc734410-8429-48c8-86e4-6efc5eb0db4b Page Layout : SinglePage Page Mode : UseOutlines Page Count : 111EXIF Metadata provided by EXIF.tools