Digital Technical Journal, Volume 10, Number 1 Dtj_v10 01_1998 Dtj V10 01 1998

dtj_v10-01_1998 dtj_v10-01_1998

User Manual: dtj_v10-01_1998
Open the PDF directly: View PDF .
Page Count: 111
Download
Open PDF In Browser	View PDF
I

PROGRAMMING LANGUAGES & TOOLS

Volume 10 Number 1
1998

Editorial
jane C. Blake, Managing Editor
Kathleen M. Stetson, Editor
Hden L. Patterson, Editor

The Digital Technicaljoumalis a refereed

AlphaServer, Compaq, tl1e Compaq logo,

journal published quarterly by Compaq

DEC, DIGITAL, tl1e DIGITAL logo,

MA 01460-1289.

Computer Corporation, 550 King Street,
LKGI-2jW7, Littleton,

Circulation

sending a check in U.S. funds (made payable

Kristine M. Lowe, Administrator

to Compaq Computer Corporation) to the

Production

rates arc $40.00 (non-U.S. $60) for four issues

Hard-copy subscriptions can be ordered by

Christa W. Jessica, Production Editor
Elizabeth McGrail, Typographer
Peter R. Woodbury, Illustrator

Advisory Board
Thomas F. Gannon, Chairman (Acting)
Z. Harbert

Scott E. Cutler
Donald

pub lis hed- by address. General subscription

and $75.00 (non-U.S. $115) for eight issues.
University and college professors and Ph.D.
students in the elecu·icaJ engineering and com
puter science fields receive complimentary sub

notification when a new issue is available
on the Internet.
Single copies and back issues can be ordered
by sending tl1e requested issue's volume and
number and a check for $16.00 (non-U.S.

$18) each to tl1e published-by address. Recent
issues arc also available on me Internet at

mentary subscription orders can be sent

Corporation. Copying wimout fee is per
mitted provided that such copies are made
f·or usc in educational institutions by faculty
members and are not distributed for com
mercial advantage. Absu·acting with credit
of Compaq Computer Corporation's author
ship is permitted.
The information in tl1e jo u rnal is subject
to change without notice and should not
be construed as a commitment by Compaq
Computer Corporation or by the compa
nies herein represented. Compaq Computer
Corporation assumes no responsibility for
any errors that may appear in t11e./OII/'I/Cii.
ISSN 0898-90IX
Documentation Number EC-P9706-I8
Book production was done by Quantic
Communications, Inc.

The cover was designed by Lucinda O'Neill
of the Compaq Industria! and Graphic

Design Group.

Corporation.

sively through X/Open Company Ltd.

tl1e published-by or electronic mail address.

formance possible for software applications.

SPEC and SPECint are registered trademarks
of Standard Performance Evaluation

can also be made by calling U1e.fournal

Copyright© 1998 Compaq Computer

transforms code to extract the highest per

International, Inc.

UNIX is a registered trademark in the United

and may be sent to tl1e managing editor at

forms common elements into precious gold

International Business Machines C01voration.
Solaris is a registered trademark of Sun

States and in other countries, licensed exclu

requests to contact autl1ors are welcomed

to represent the compiler developer who

marks of Roque Wave Software, Inc.

RS/6000 is a registered trademark of

mail address, ctj@compaq.com. Inquiries

Comments on the content of any paper and

wc have chosen the alchemist who trans

Roque Wave and .h++ are registered trade

published-by address or tl1e electronic

office at 978-506-6858.

& Tools, specifi 

Corporation.

Microsysrems, Inc.

to the Dlj!,ital Technica/Joumal at tl1e

Programming Languages

NULLSTONE is a trademark ofNullstonc

SPARC is a registered trademark of SPARC

Inquiries, address changes, and compli

cally on compiler software. For the cover,

MIPS is a registered trademark of MIPS
Technologies, Inc.

Compaq employees may order subscrip

http://web rc.das.dec.com.

Cover Design

Microsoft, Visual C++, Windows, and

http://www.digital.com/ dtj.
tions through Readers Choice at URL

This special issue of the jounw/ focuses on

IlUX is a registered trademark of Silicon
Graphics, Inc.

of Microsoft Corporation.

This service will send an electronic mail

Robcrt M. Supnik

Intel and Pentium are registered u·ademarks
of Intel Corporation.

aged to contact tl1eir sales representatives.

http:jjwww.digital.com/subscription.

Richard F. Lary

Corporation.

Windows NT are registered trademarks

no charge by accessing URL

Alan G. Nemeth

DIGITAL UNIX, FX132, and OpenVMS
arc trademarks of Compaq Computer

may qualify tor gift subscriptions and arc encour

scriptions upon request. Compaq customers

Electronic subscriptions are available at

William A. Laing

ULTIUX, VAX, and VMS are registered

in the U.S. Patent and Trademark Office.

Other product and company names mentioned
herein may be trademarks and/or registered
trademarks of their respective owners.

December 1998
A letter to readers of the Dip,ital Technicaljournal
This issue is the last Digital Technicaljournal to be published. Since 1985, the
Journal has been privileged to publish intormation about significant engineeting
accomplishments for DIGITAL, including standards-setting network and storage

teclmologies, industry-leading VAX. systems, record-breaking Alpha microproces
sors and semiconductor technologies, and advanced application software and
performance tools. The Journal has been rewarded by continual growth in
rhe number of readers and by rheir expressions of appreciation for the quality
of content a.nd presentation.
The editors dunk rhe engineers who somehow made d1e time to write, the engi
neering managers who supported rhem, rhe consulting engineers and professors
who reviewed manuscripts and made rhe process a learning experience for all of
us, and, of course, the readers who are the reason the Journal came into existence

13 years ago.
With kind regards,

Jane Blake
Managing Editor

Kathleen Stetson
Editor

Helen Patterson
Editor

Digital Technical Journal
Volume 10 Number 1
Contents
Introd uction

C. Rober t Morgan, Guest Editor

2

Foreword

William C. Blake

4

Tracing and Characterization of W i ndows NT-based

Jason P. Casmira, David P. Hunter,

6

System Workloads

and David R. Kael i

Automatic Template Instantiation i n DIGITAL C++

Avru m E . I tzkowitz and Lois D . Foltan

22

Hemant G. Rotithor, Kevin W. Harris,

32

Measurement and Analysis of C and C++ Performance

and Mark W. Davis
August G. Reinig

48

Compiler Optimization for Superscalar Systems:

P hilip

58

Global I nstruction Scheduling without Copies

and Brett L. H uber

Maximizing M ulti processor Performance

Mary W. Hall, Jennifer M . Anderson,

with the S U I F Compiler

Saman P. Amarasinghe, Brian R. Murp hy,

Alias Analysis in the DEC C and DIGITAL C++ Compilers

H. Sweany, Steven M. Carr,

71

Shih-Wei Liao, Eduoard Bugnion, :md Monica S. Lam
Debugging Optimized Code: Concepts and

Ronald F. Brender, Jeffrey E. Nelson,

Implementation on DIGITAL Alpha Systems

and Mark E. Arsenault

D i fferentia l Testing for Software

William M. McKeeman

81
100

Introduction

The complexity of high-performance

Profiling describes the point in the

systems and d1e need tor ever-increased

program that is most frequently

performance to be gained from those

executed. Tracing describes the

systems creates a challenge for engi

commonly executed sequence of

neers, one d1at requires bod1 experience

instructions. In addition to helping

and innovation in the development

developers build more efficient

of software tools. The papers in this

applications, this information assists

issue of tJ1 e ]ournal are a few selected

exa mp le s of the work performed

C. Robert Morgan

within Compaq and by researchers

Every compi l er consists of two

worldwide to advance me state of me

components: the front end, which

Technical Program Manage1;

art. In fact, Compaq supports rele

analyzes the specific language, and

Core Technology Croup

vant research in programming lan

the back end, which generates opti

guages and tools.

mized instructions for the target

Senior Consulting Engineer and

Compaq has been developing

balance of both components. As lan

than thirty years, starting with the

guages such as C++ evolve, the com

Fortran compiler for the DIGITAL

piler front end must also evolve to

PDP-10, introduced in 1967. Later

keep pace. C++ has now been stan

compilers and tools for VAX com

dardized, so evolutionary changes

puter systems, introduced in 1977,

made the VA.'< system one of me most

will Jessen. However, compiler devel

usable in history. The compilers and

front-end techniques for implement

opers must continue to improve
ing the language to ensure ever better

plary. With the introduction of the

application performance. An impor

VfuY.. successor in 1992, the 64-bit

tant feature of C++ compiler develop

RISC Alpha systems, Compaq has

ment is C++ templates. Temp lates

continued me tradition of developing

may be implemented in multiple

advanced tools that accelerate appli

ways, with varying effects on appli

cation performance and usability for

cation programs. The paper by

system users. The papers, however,

Itzkowitz and Foltan describes

represent not only the work of

Compaq's efficient implementation

Compaq engineers but aJso that of

of templates. On a related subject,

researchers and academics who are

working on problems and advanced

techniques of interest to Compaq.
The paper on cbaractetization of

Rotid1or, Hanis, and Davis describe
a systematic approach Compaq has
developed for monitoring and
improving C++ compiler perfor

system workloads by Casmira, Hw1ter,

mance to minimize cost and maxi

and Kaeli addresses the capture of

mize function and reliability.

basic data needed for me development

Digital Technical Journal

machin e. An efficient compiler is a

high-performance tools for more

debugger f or VAXjVMS are exem

2

designers and implementers of future
Windows NT systems.

Improved optimization techniques

of tools and high-performance appli

for compiler back ends are presented

cations. The authors' work focuses

in three papers In the first of d1e se

on generating accurate profile and

trace data on machines ru n ning the

Reinig addresses the requirement in
an optimizing compiler for an accu

Windows NT operating system.

rate description of the variables and

Vol. 10 N o . 1

1 99 8

.

,

fields that may be changed by an

by Brender, Nelson, and Arsenault

assignment operation, and describes

reports an advanced developmt:nt

an efficient technique used in the

project at Compaq to provide tech

C/C++ compilers for gathering this

niques for the debugger to discover

information. Sweany, Carr, and Huber

a more accurate image of the state of

describe techniques for increasing

the program. These techniques are

execution speed in processors like

currently being added to Compaq

the Alpha that issue multiple instruc

debuggers.

tions simultaneously. The technique

One of the problems that tool

reorders the instructions in the pro

developers face is increasing tool reli

gram to increase the number of

ability. Tool developers, therefore,

instructions that are simultaneously

test the code. However, developers

issued. Maximizing the performance

are often biased; they know how their

of multiprocessor systems is the sub

programs operate, and they test cer

ject of the paper by Hall et al., which

tain aspects of the code but not oth

was previously published in IEEE

ers. The paper by McKeeman describes

Computer and

a technique called differential testing

updated with an

addendum for this issue. The authors

that generates correct random tests of

describe the SUIF compiler, which

tools such as compilers. The random

represents some of the best research

nature of the tests removes the devel

in this area and has become the basis

opers' bias. The tool can be used for

of one part of the ARPA compiler

two purposes: to improve existing

infrastructure project. Compaq

tools and to compare the reliability

assisted researchers by providing the

of competitive tools.

DIGITAL Fortran compiler fi-ont end
and an AJphaServer 8400 system.
As compilers become more effec

The High Performance Technical
Computing Group and the Core
Technology Group within Compaq

tive in increasing application program

are pleased to help develop this issue

performance, the ability to debug

of the]ou rn al. Studying the work

the programs becomes more difficult.

performed within Compaq and by

The difficulty arises because the

other researchers worldwide is one

compiler gains efficiency by reorder

way tlut we remain at the cutting

ing and eliminating instructions.

edge of technology of programming

Consequently, the instructions for

language, compiler, and program

an

application program are not easiJy

ming tool research.

identifiable as part of any particular
statement. The debugger cannot
always report to the application pro
gram where variables are stored or
what statement is currently being
executed. Application programmers
have two choices: Debug an unopti
mized version of the program or find
some other technique for determn
i ing
the state of the program. The paper

Digital Technical Journal

Vol. 10 No. I

1998

3

Foreword

You might think that the cover of this
issue of the Digital

William C. Blake
Director, High Performance
Technical Computing and
Core Technologv Gruups

Tecbnicaljournal

piled into those instructions. This

the relevance of those ancient alchemists

semantic gap between programming

in the drawing to the computer-age

languages and machine instructions is

topic of programming languages and

central to the evolution of compilers

tools? Certainly, both alchemists and

and to microprocessor architectures

programmers work busily on new

as well. The compiler developer's role

tools. An even more interesting

is to help close tbe gap by preserving

metaphorical connection is the

the correctness of the compilation

alchemist and the compiler software

and at the same time resolving the

developer as creators of tools that

trade-offs between the optimizations

transform (transmute, in the strict

needed tor improvements "close to

sense of alchemy) tbe base into the

the programmer" and those needed

precious. The metaphor does, how

"close to the machine."
To put the work described in tl1is

and folklore of alchemy, the science

journal

and technology of compiler software

think about the changes in compiler

development is a real and important

requirements over tl1e past 15 years.

into context, it is helptl.IJ to

part of processing a new solution or

It was in the early 1980s that the direc

algorithm into the correct and high

tion of future computer architectures

est performance set of actual machine

changed rrom increasingly complex

instructions. This issue of tl1ejournal

instruction sets, CISC, that supported

addresses current, state-of-the-art

high-level languages to computer

work at Compaq Computer Corp

architectures with much simpler,

oration on programming languages

reduced instruction sets, RJSC. Three

and tools.

key research efforts led the way: the

Gone are the days when program
mers plied their craft "close to the

Berkeley RJSC processor, the IBM
801 RISC processor, and the Stanford

machine," tlut is, working in detailed

MIPS processor. Nl three approaches

machine instructions. Today, system

dramatically reduced the instruction

designers and application developers,

set and increased the clock rate. The

driven by the pressures of time to

RISC approach promised improve

market and technical complexity,

ments up to a factor of five compared

must express their solutions in terms

witl1 CISC machines using the same

"close to the programmer" because

manufacturing technology. Compaq's

people think best in ways that are

transition rrom the VAX to the Npha

abstract, language dependent, and

64-bit RISC architecture was a direct

machine independent. Enhancing

result of the new architectural trend.

the characteristics of an abstract
high-level language, however, con

Digital Technical Journal

programmer must be correctly com

is a bit odd. After all, what could be

ever, break down. Unlike the mytl1

4

the high-level programs close to the

As a consequence of these major
architectural changes, compilers and

flicts with the need tor lower level

their associated tools became signifi

optimizations tl1at make tl1e code

cantly more important. New, much

run f:1stest. Computers still require

more complex compilers for RISC

detailed machine instructions, and

machines eliminated the need tor the

Vol. 1 0 No . I

1998

large, microcoded

CISC machines.

speedup enhancements. In the next

The complexities of high-level lan

1 5 years, Moore's Law may be stopped

guage processing moved from the

by the physical reali6es of scaling lim

petritied software of

CISC micro

processors to a whole new generation
of optimizing compilers. This move

caused some to claim that ruse really

its. But Amdahl's Law will be broken
as well, as improvements in parallel
language, tool development, and new
methods of achieving parallelism wiU

stands for "Relegate Important Stuff

posi6vely affect the future of compil

to Compilers."

ers and hence application performance.

The introduction of the third-gen
eration Alpha microprocessor, the
21264, demonstrates that the shift to

As you will see in papers in this issue,
there is a new emphasis on increasing
execution speed by exploiting the

ruse and AJpha system implementa

multiple instruction issue capability of

tions and compilers served Compaq

AJpha microprocessors. Improvements

customers well by producing reliable,

in execu6on speed will accelerate dra

accurate, and high-performance com

matically as future compilers exploit

puters. In fact, AJpha systems, which

performance improvement techniques

have the ability to process over a bil

using new capabilities evolved in AJpha.

lion 64-bit floating-point numbers

Compilers will deliver new ways of

per second, pertorm at levels formerly

hiding instruc6on latency (reducing

attained only by specialized super

the pertormance gap bel:\veen vector

IUSC superscalar

computers. It is not surprising that

processors and

the AJpha microprocessor is the most

machines), improved unrolling and

frequendy used microprocessor in the

optimization of loops, instruction

top 500 largest supercomputing sites

reordering and scheduling, and ways

in the world.
After reading through the papers

of dealing with parallel decomposi6on and data layout in nonuniform

in this issue, you may wonder what is

memory architectures. The challenges

next for compilers and tools. As phys

to compiler and tool developers will

ical limits curtail the shrinking of sili

undoubtedly increase over 6me.

con feature sizes, there is not likely to
be a repeat of the performance gains

By not relying on hardware
improvements to deliver all the

at the microprocessor level, so atten

increases in performance, compiler

tion will turn to compiler technology

wizards are making their own contri

and computer architecture to deliver

butions- always watchful of correct

the next thousandfold increase in sus

ness first, d1en run-time performance,

tained application pertormance. The

and, finally, speed and efficiency of the

two principal laws that atfect drama6c

software development process itself

application pertormance improve
ments are Moore's Law and Amdahl's
Law. Moore's Law states d1at perfor

mance will double each 1 8 months
due to semiconductor process scaling;
and Amdahl's Law expresses the
diminishing returns of various system
Digital Technical Journal

Vol . 1 0 No. 1

1 998

5

I
Jason P. Cas mira
David P. Htmter
David R. Kaeli

Tracing and
Characterization of
Windows NT-based
System Workloads

To optimize the design of pipelines, branch pre
dictors, and cache memories, computer archi
tects study the characteristics of benchmark
programs by examining traces, i.e., samples of
program execution. Since commercial desktop
applications are increasingly dependent on ser
vices and application programming interfaces
provided by the host operating system, the
authors argue that traces from benchmark exe
cution must capture operating system execution
in addition to native application execution.
Common benchmark-based workloads, how
ever, lack operating system execution. This
paper discusses the ongoing joint efforts of the
Northeaster n University Computer Architecture
Research Laboratory and Compaq Computer
Corporation's Advanced and Emerging Tech
nologies Advanced Development Group to cap
ture operating system-rich traces on Alpha
based machines running the Windows NT oper
ating system. The authors describe the latest
PatchWrx software toolset and demonstrate its
trace-generating capabilities by characterizing
numerous applications. Included is a discussion
of the fundamental differences between using
traces captured from common benchmark pro
grams and using those captured on commercial
desktop applications. The data presented
demonstrates that operating system execution
can dominate the overall execution time of
desktop applications such as Microsoft Word,
Microsoft Visual C/C++, and Microsoft Internet
Explorer and that the characteristics of the
operating system instruction stream can be
quite different from those typically found in
benchmarking workloads.

6

Digital Technic� I journal

Vol. 10 No. l

1 99 8

The computer architecture research co mmun iry com
monly uses trace-driven sim ulation in pursuing
answers to a variety of design issues. Archi tects spend a
significant amou n t of ti me studying the characteristics
of benchmark programs by examining traces, i .e., sam
ples taken from program execu tion . Popu lar bench
mark programs include the SPEC' and the BYTEmark2
benchmark test s u i tes. Si nce the underlyi ng assump
tion is that these programs generate workloads that
represent user applications, today's computer designs
have been optimized based on the cl1aracteristics of
these benchmark programs.
Although the authors of popu l ar benchmarks arc
wel l i n tentioned, the resulti ng workloads lack operat 
ing system execution and consequently do n o t repre
sent some of the most prevalen t desktop applications,
e.g., Microsoft Word , Mi crosoft Visual C/C++, and
Microsoft Internet Explorer. Such applications make
heavy use of app lication programming inted:1ces
(APis ) , which in turn exec ute many instructions in the
operating system. As a resu lt, the overal l performance
of many desktop applications depends on efficien t
operating system interaction . C learly operating system
overhead can greatly reduce the benefits of a new
compu ter design feature. Past archi tectural studies ,
however, have generally ignored operating system
interaction because few tools can generate operating
system-ric h traces.
This paper d iscusses the ongoing joi n t ef forts of
Northeastern U niversi ty and Compaq Computer
Corporation to capture operating system-rich traces on
DIGITAL Alpha-based machi nes running the M icrosoft
Windo>vs NT operating system . We argue th:tt tor u·aces
of today's workloads to be accurate, they must capture
the operating system execution as well as the native appli
cation execution . This need to capture complete pro
gram u·ace i n formation has been a dtiving fen-ce behind
the development and use of software tools such as the
PatchWrx dynamic execution-tracing too lset, which we
desctibe i n this paper.
The PatchvVrx toolset was origi nally developed by
Sites and Perl at Digi tal Equ ipment Corporation's
Systems Research Center. They described P:ttchWrx, as
developed for vVindows NT version 3.5, in "Studies of

Windows NT Performance Using Dynamic Execution

far from complete, t h i s list provides a sample of the

Traces."> The Northeastern University Computer

tools that have been used to generate input to a variety

Architecture Research Laboratory and Compaq's

of trace-driven sim ulation studies. 'vVe have character

Advanced

ized each tool in terms of the three issues (criteria) pre

Development Group continue to develop t h e toolset.

viously mentioned. Table llists the target plattorm(s)

We have updated the fra mework to operate under

for each tracing tool.

Advanced

and

Emerging

Technologies

Wi ndows NT version 4.0, added the ability to trace

Note that many of these tools cannot capture oper

programs that have code sections larger than 4 mega

ating system activity. For those that can, their associ

bytes (MB), added multiple trace buffer sizes, and

ated slowdown can significantly affect the accuracy of

developed additional postprocessing tools.

the captured trace . Of the tools that provide this capa

After briefly discussi ng related tracing tools, we

bility, PatchWrx introduces the least amount of slow

describe the PatchWrx toolset and specify the new

down yet mai ntains the integrity of the address space.

features we have ad ded. We then analyze PatchWrx

The next section discusses the Patch Wrx toolset.

traces captured on W i ndows NT version 4.0, demon
strating the capabilities of the tool while illustrati ng

PatchWrx

the i m portance of capturing operating system-rich
traces. In the final section, we su m m arize the paper,

Patch Wrx

discuss the current limitations of the toolset, and sug

developed for use on the Alpha- based Microsoft

is a dynamic execution-traci ng toolset

gest new directions for development and study.

W indows NT operating system. The toolset utilizes
the Privileged Architecture Library (PAL) facility, also

Trace Generation Tools

referred to as PALcode, of the Alpha microprocessor

Trace-driven simulation has been the method of

can instrument, i . e . ,

choice for evaluating the merits of various architec

tion and system binary i mages, including the kernel,

tural trade-offs.'5 Traces captured from the system

operating system services, drivers, and shared libraries.

to perform tracing with minimal overhead .2' PatchWrx

patch,

all W indows NT applica

under test are recorded and replayed through a model

The PAL faci l ity i s a set of architected fu nctions and

of the proposed

architecture

instructions that provides a consistent interface to a set

researchers have proposed methodologies that capture

of complex system functions. These routines provide

both application and operati ng system references.
These tools include hardware- based"- 10 and software

pri mitives for memory management, context switch

design.

Computer

ing, interrupts, and e xceptions.

based' Hs methods . Some of the issues involved in cap
turing operating system-rich traces are

Patch Wrx and the Alpha PAL Routines

The PatchWrx software tool is made possible through

l. Tracing overhead (system slowdown )

the PAL used by DIGITAL Alpha microprocessors.

2. Accuracy (perntrbation of the memory address space)

PAL routines have access to physical memory and

3. Completeness ( capturing all desired i n formation,

i nternal hardware registers and operate with interrupts
disabled . PALcode is loaded from disk at system boot

e .g . , the operating system reference stream)
Table 1 contains a list of 10 tracing tools that have
been developed over the past 10 to 15 years. Although

tim e . We modified and extended the shrink-wrapped
Alpha PALcode on a DIGITAL Alpha 21064-based
system to support the PatchWrx operations. The mod-

Tab le 1

S a m p l e of Trac i n g Too l s
Average
Slowdown

Addr ess
Pertur bation

Operating
System Activity

Platfor m

ATOM'3

lOX to lOOX

No

Yes

DIGITAL Alpha UNIX

ATUM'6

20X

No

Yes

DIGITAL VA X Ope nVMS

EEL"

lOX to lOOX

Yes

No

SPARC Solaris

Etch'"

35X

Yes

No

Intel x86 Microsoft Wind ows NT V4.0

NT-Atom"

lOX to lOOX

No

No

DIGITAL Alpha M i crosoft Windows N T V4.0

PatchWrx3

4X

No

Yes

DIGITAL Alpha M i crosoft Windows NT V4.0

Pixie'-0

lOX to lOOX

Yes

No

DIGITAL MIPS ULTRIX

Q P T 12

lOX to lOOX

Yes

No

SPARC Sola ris, DIGITAL ULTRIX

Shade2'

6X

No

No

S PARC Solaris

SimOS14

1 OX to 50,000X

No

Yes

DIGITAL Alpha UNIX, SGIIRIX, SPARC Solar is

Name

Digital TechnicJ!

Journal

Vol. 10 No. I

1998

7

i fied PatchWrx PAL rou t i nes serve two major pur

( l ) to reserve the trace bufkr at system boot
time and ( 2 ) to l og trace e n tries at trace ti m e .

poses:

We d e fi n e a patched i nstr uction as a n i nstruction
wit h i n a n i m age's code section that is overwri tten wi th
a n u ncon d i tional branch ( B R) to a patc h . The target of

O n e w a y that PatchWrx mai ntains a l o w operati n g

the B R contains the parch sec/io n . The patch sec tion

overhead i s t o store t h e captu red trace i n a p h ysical

i ncludes the trap ( CA L L_PA L ) to the appropriate PA L

memory bu fter, w hi c h is reserved at boot time. The

routine t h a t l ogs a trace e ntry correspond i n g to the

s i ze of the bu ffer can be varied depen d i n g on t h e

type of i nstruction p<1tched and t h e return branch to

a m o u n t of physical m e mory i nstal led on the system .

the origi n a l target.

S i n ce we use PAL rou ti nes to reserve this mem ory, the

PatchWrx docs not m od i fy the origi n a l b i na r y

operati n g system i s not aware that the m e m o ry e x ists

i mages; i nstead , i t generates n e w i m ages t h a t conta i n

because the PALcode performs all low - l evel system i n i 

patches. This operation preserves t h e origi nal i mages

ti a l i zation before t h e operating system is started.

on the system in case they need to be restored .

PatchWrx logs all trace e n tries in this b u ffer. Wri ti ng
trace e ntries d i rectly to p h ysical memory h as several

I nstru mentation

i nvolves

replaci n g

all

bra n c h i ng

i nstr uctions of type u n conditional bran c h , con d i tional

advantages. F i rst, writing to memory is m u c h faster

bra n c h ( e . g . , b ra n c h i f e q u a l to zero [ B EQ] ) , branch

th;m wri t i n g to d i sk or to tape. Seco n d , u s i n g p h ysical

to subroutine ( BS R ) , fu n c tion retu rn

memory a l l ows tracing of the lowest levels of the oper

( J M P ) , and j u m p to su bro u ti n e ( J S R) w i t h i n an

( RET ) , j u m p

ating system ( i . e . , the p a ge fau l t h:md ler) without gen 

i m age 's code section with u n co n d i tional bra nc h es to

era ti ng page fa u l ts . T h i r d , usi n g p h ysical memor}'

a patch secti o n . If loads a n d stores are a lso trace d ,

a l l ows tracing across m u l ti pl e t h reads r u n n i n g in m u l 

PatcbWr x rep l aces t h ese i n structions ( e . g . , l oad sign 

tiple add ress spaces regardJess of which ad d ress sp a ce i s

e x te nd e d

currently r u n n i ng .

branches t o tl1e patch secti on, where t h e o riginal load

To enabl e PatchWrx t o operate u nder Wi ndows NT

l o n gword

[ LD L ] )

with

u n co n d i ti o n a l

o r store i nstruction i s copi ed . A return branch is also

4 . 0 , we started with the PA L rou

needed to return control flow to the i nstruction s u bse

tines mod i fi ed by S ites a n d Perf and made additional

q u e n t to the original load . Wnen PatchWrx enco u n 

versions 3 . 5 1 and

mod i fications as req u i red by the operating system ver

ters this patc h , t h e tool records t h e register value of the

sions . These m od i fi cations were concentrated in the

original load or store i nstruction i n the trace log. The

process d ata structu res . The PatchWr x -specitlc PAL

p atch section con ta i n s all the patches for the i m age

2 . The fi rst t h ree routines

a n d is added to t h e rewritten i mage. Figure l s h ows

are used for read i n g the trace e n tries tl·om t h e bu ffer

e x a m p les of patched i n structi o n s . Patch \rVrx rep laces

and for t u rn i ng tracing on and off. The rema i n i n g five

o n l y branch i n structions within a n i mage to red uce the

ro u ti n es are listed i n Ta ble

rou ti n es are used to log trace e n tries based on the type

type and n u m be r of e n tries logged in the trace bu ffer.

of i nstruction i nstru mented .

Usi n g these traced bra n c h es, the tool can later recon 
struct the basic blocks they represent.

Patch Wrx Image Instrumentation

Next we describe how we use PatchWrx to i nstru ment

As s h own in Figure 1 , PatchWrx repl aces B R a n d

M icrosoft Wind ows NT i m ages. Patc h i ng the o perat

J M P i nstructions w i t h B R i nstructions t h a t transfer

i ng system i n volves the i nstru m e n tation of ::d l the

instruction i s re peated i n t h e patch section for the p u r

binary i m ages, i n c l ud i n g app l i cations, operating sys

pose of record i n g the va l u e o r· the target register ( i f

control to the patch secti o n . The ori g i n a l BR or J M P

tem cxecutables, l ibraries, and kern e l . O n ce patc h i n g

necessary) i n to the trace bu ffe r w h e n the patched

is complete , trace e ntries a r e logged by means o r' PA L

i mage is exec u ted . T h i s register val u e is necessary

ro u tines as i mages execute.

reconstru cti ng the traced i nstruction stream . Patch\Vrx

Table 2

PatchWrx-specific PAL Routines

8

PAL Routines

Function

PWR D E NT

Read a trace entry from trace memo ry

PWP E E K

Read an arbitrary l ocation (for debug)

PWCTRL

I n iti a l i ze, turn tracing on/off

PWB S R

Record a branch to s u b routi ne

PWJSR

Record a j u m p/call/return

PWLDST

Record a load/store base reg ister va l ue

PWBRT

Record a co nditional branch taken bit

PWB RF

Record a condit i o n a l branch fal l -thro u g h bit

DigiL11 Tec h nical Journ,l l

Vul . ! 0 No. l

1 99 8

tor

PATCH E D CODE

ORIGINAL CODE
EXAMPLE

1

MP

ZERO , ( R1 9 )

Jl!P

Z'i8RO, (Rl9)

PATCH . O O l :

EXAMPLE 2

J S R R2 6 , ( R1 9 )

���

P.'\TCH . 0 0 2 :

EXAMPLE 3

BEQ R 3 , TARGET . 0 0 3

BR

CALL_PAL PltJJSR
J�lP ZERO , ( R 1 9 )

BSR R2 6 , PAT CH . 0 0 2

CALL_PAL PWJSR
JMP ZERO , ( R l 9 )

BEQ R3 . �RSE� . 002

BACK . 0 0 3

PATCH . 0 0 3 :

l?l>.TCH . 0 0 1

BR

PAT . H . 0 0 3

BEQ R 2 , PATCH . 0 0 3 T
Cli.LL_P AL PWBRF
BR BACK . 0 0 3

PATCH . 0 0 3 T :

EXAMPLE

Figure 1
Instruction

Patch

4

LDL R 2 0 , 4 ( R 1 6 )

LDL R20 , 41Rl6 )
Bli.CK .

04

1?/I.TCH .

00 4 :

CAL _PAL P BRT
B R TARG ET . 0 0 3

BR P TCH . 0 0 4

CALL_PAL PWLDST
LDL R2 0 , 4 ( Rl 6 )
BR 8 ACK . 0 0 4

Examples

repl aces JSR and BSR i nstructions with BSR patches.
This replacement preserves the return address ( RA)
register fi e l d value, which contains the return address
for the subroutine. Again, the original i nstruction is
repeated in the patch section for register val ue record
ing during traci ng to help facilitate reconstruction.
Cond itional branches have a larger and more com
plex patch than the other branch types because the
original condition is d u plicated and resolved within
the patch . The taken or fall-through path generates a
bit val ue when logged within the taken or fall-th rough
trace entry. The return branch i n the patch section is a
rep l i ca of the original cond i tional branc h .
As explained earuer, tor all patches, PatchWrx replaces
the original branch with a patch unconclitional branch .
Since Alph a i nstructions are equal i n size, this replace
ment process allows patching without increasi ng the
code size with i n the i mage . Although the code size
remains u nchanged, the image size will increase in
proportion to the number of patches added. This

i mage size change becomes an issue for dynamical ly
linked l ibrary ( DLL) i mages.
Patching Dynamic Link Libraries

The Microsoft Wi ndows NT operating system pro
vides a memory management system that allows shar
i n g between p rocesses.n For example, two processes
that edit text files can share the text editor application
image that has been mapped into memory. When the
first process i nvokes the ed itor, the operating system
loads the application into memory and maps the
process's virtual address space to it. When the second
process i nvokes the editor, rather than l oad another
editor image, the operati ng system maps the second
process's virtual address space to the physical pages
that contain the editor. Of course, both processes con
tain local storage for private data .
DLLs are loaded i nto memory and shared in this
manner. When patches are added to a DLL, the size of
the i mage i ncreases. When this i mage is mapped to
Digital Technical Journal

Vol.. l O No. l

1998

9

p hysical memory ( as per its preferred base load
address ) , the larger i m age may overlap with another
image having J bJse add ress wi thin the new range.
This i mage overlap can p revent the operati ng system
from booting properly: some environment DLLs wi ll
conflict i n memory because they perform calls d i rectly
i n to other D LLs at fi xed offsets . To resolve this issue,
we rebase 24 the preferred base load add resses of the
patched DLLs, which modi fies the base l oad add resses
of each patched D L L to elimi nate con fl icts . Rebasing
affects the address accuracy of the patched S}'Ste m ,
though w e are a b l e to readjust t h e addresses d u ri ng
reconstruction . An increase i n the pagi ng activit\' m ay
also be observed si nce the additional code may cross
page boundaries.
The original version of the PatchWrx toolset was
developed on Microsoft Windows NT version 3 . 5 .
When versions 3 . 5 1 and 4 . 0 were released, several mod
ifications were made to the i mage format. In complet
ing the 3 . 5 1 - and 4.0-eompatible versions of PatchWrx,
we bad to add ress this issue. One change that affected
how we patch was the placement of the I mport Address
T1ble ( IAT) into the front of the i nitial code section of
executable binary images. This table is used to look up
the add resses of DLL proced ures used ( i . e . , i m ported )
by the executable binary. In developing the current gen
eration of Patch\Vrx, we had to make modi fications to
usc image header fields that had previously remained
un used or reserved, indicating the executa b le code sec
tions that contained data areas.
Another issue t hat we add ressed in the recent modi 
fications to PatchvVrx was long branches. The original
version of PatchWrx repl aces a l l branch, j u mp , ca l l ,
and return i nstructions with either B R o r B S R i nstruc
tions to the patch section. Si nce the PatchWrx tool has
no information about machine state d u ri ng the patch
ing phase, i t is impossi ble to uti l ize other branching
instructions ( e . g . , J MP or JSR instructions ) to provide
this branc h - to-patch tra nsition. Register and register
indirect branching instructions wou ld requ i re per
tu rbing the machine state . Therefore , the devel opers
could use only program counter ( PC)-based offset
branching i nstructions.
As discu ssed previously, i n replacing a control How
instruction with a patch branch, PatchvVrx uses a B R
o r B S R i nstruction i n which the off-Set field i s set to
branch to the correspond ing patch wi thin the i m age's
patch section . The A l pha architecture branching
i nstructions use the format s hown i n hgurc 2 .

I

=

=

=

25-BIT DISPLACEMENT

LBR I NSTRUCTION FORMAT

20-BIT DISPLACEMENT
OPCODE

31

26 2 5

REG

2 1 -BIT DISPLACEMENT
0

2 1 20

Oi[!:iLal Technical Journal

LBSR I NSTRUCTION FOR MAT

Fig u re 3
PALcode Lon g B ranch Instr uction l-'ormars

Figure 2
Al p ha Branch Instruction Format

10

The branch target virtual add ress comp u tation t-cJr
this format is newPC
( ol d PC + 4) + (4 * sign
cxtcndcd ( 2 l -bit branch d isp lacement) ) . The register
field holds the return address for BSRs. With this
branch format and target virtual add ress computation,
the Alpha architectu re provides a branc h target range
of 4 MB from an i nstruction's current PC.
Several appl ications that run today on Microsoft
Windows NT version 4 . 0 are sufficiently large that the
displacement between a control rlow i nstruction to be
patched and the patch location within the patch section
exceeds this 4-MB l imit. ( Recal l that since we want to
avoid moving code or data sections, the patch section is
placed at the end of the image . ) To address this problem,
we developed two new branch i nstructions for usc with
PatchWrx. These new branches were n ot implemented
in the i nstruction set architecture of the Al pha architec
ture. I nstead, we used PALcodc to implement d1cm . The
two new branches arc designated long branch ( LB R) and
long branch subroutine ( LB S R) . F i gu re 3 i l l ustrates the
format of these two i nstructions.
The computation of the target virtual add ress is
newPC
( oldPC + 4 ) + (4 * sign-ex te nd ed ( 2 5 -bit
branch d isplacement)) tor LB R branches and ncwPC
(oldPC + 4 ) + ( 32 * zcro-cxte nded ( 2 0 - bi t br::mch dis
placemen t) ) for LBSR branches. PatchWrx uses LB Rs
when p a tching any control fl ow instruction that has
a d ispl acement greater than 4 LV! B . PatchWrx uses
LBSRs similarly for control H ow i nstructions that must
p reserve the register field val u e .
\Vhcn a n LB R or L B S R i nstruction i s cxecu ted
within the i mage code section, a trap to PALcodc
occurs . Normal ly, CALL_PAL i nstructions have one of
several defined fu nction fields that cause a correspond 
i n g PAL routine to b e executed . The two l o n g branch
i nstructions have fu nction fields that do not belong to
any of the defi ned CALL_PAL instructions a nd there
fore force a n i l legal i nstructio n exception within the
PALcod e . This PALcodc flow has been mod i fied to
detect i f a long branch has been encou ntered .

Vol . 10 No. l

1 99 8

AB shown in Figure 3, both long branch types have
the same PALcode operation code (opcode) value of
000000. To distinguish between the r-.vo types, the least
significant bit in the instruction word is set to 0 for LBRs
and to 1 for LBSRs. This bit is not included as a usable
bit for the displacement fields of either branch type.
Consequently, each LBR has a 2 5 - bit displacement field
and each LBSR has a 20-bit field. With a 2 5 - bit usable
displacement field, the PALcode performs the LB R tar
get address computation, allowing a ± 64-MB range .
Since each LBSR instruction has a 2 0-bit d isplace
ment field, whereas the original Alpha architecture
branch displacement field is 2 1 bits, the target instruc
tion address computation for LBSR instructions is per
formed differently than tOr standard branches within
the PALcode. As shown in the address computation
equation, the 2 0 - bi t displacement is mu ltiplied by 3 2
rather than by 4 ( as for the L B R branch ) . Notice that
the 2 0 - bi t d isplacement is always zero extended . The
computation provides the LBSR instruction with a dis
placement of + 3 2 M B .
This computation procedure has two implications.
First, LBSR instructions can only be used to branc h
from an image code section to an image's patch sec
tion . Second , branches into the patch section are
either BR or BSR instructions (or their long displace 
ment counterparts ) . PatchWrx uses only BR or LBR
instructions to return from the patch section to the
original branch target within a code section; BSR and
LBSR i nstructions are never used . Therefore, restrict
ing LBSR i nstructions to use positive displacements
does not present a problem.
The LBSR displacement m u l tiplier value of 32 does
present some restrictions, however. The m ultiplier
value of 4 used in the original Alpha i nstruction set
architecture represents the instruction word length
of 4 bytes. Thus, normal branch instruction target
addresses must be aligned on a 4- byte boundary. By
using the multiplier val u e of 32 for LBSR instructions,
LBSR target addresses are restricted to align on a 32byte (i.e., eight-instruction) boundary. Since all LBSR
targets reside within the patch section, this restriction
does not pose a problem . If an LBSR is to be inserted
into the image code section and the next available
patch target address is not aligned properly, PatchWrx
can insert no operation ( NOP) instruction words and
advance the next avai la ble patch target address unti l
the necessary alignment is achieved. PatchWrx never
executes the NOPs; they are i nserted for alignment
purposes only. Although inserting these NOP instruc 
tions increases t h e image size, w e have implemented
several optimizations into the instrumentation algo
rithm to minimize this increase. For example, a queue
is used to hold LBSRs that do not align . As LBR
patches are committed , PatchWrx probes the queue to
determine if any LBSRs align fi·om their origin to the
newly available patch target offset.

Trace Capture

The PatchWrx toolset allows the user to turn tracing on
and off and thus capture any portion of workload execu
tion. The tracing tool is also responsible for copying trace
entJies fi-om the physical memory buffer to disk. Copying
the trace buffer to disk is performed after u·acing has
stopped so that the time required to perform the copy
does not introduce any overhead during u·ace capture .
PatchWrx logs a trace e nu·y for each patch encoun
tered during program execution. AB .it executes instruc
tions witllin the code section, PatchWrx encounters an
unconditional PatchWrx branch. Instead of branclling to
the otiginal target, the patched branch transfers control
to tl1e image's patch section . Witl1in the patch section, a
PatcbWrx PALcall u·aps to the PAL routine correspond
i.ng to tl1e patch type and logs a trace entry to tl1e trace
buffer. The PAL routine then returns to the instruction
following the CALL_PAL insu·uction. PatchWrx uses an
unconditional branch to transfer control fi-om tl1e patch
section back to the original target within an image code
section. During the execution of the PatchWrx PAL rou
tine, necessary machine state information is recorded
and logged in the trace buffer. This allows for the capture
of register contents, process I D information, etc . , which
are used later during u·ace reconsu·uction.
The trace capture £1cility captures tl1e dynamic execu
tion of a workload running on the system . To recon
struct tl1e trace after it has been captured, the tracing
tool must also capture a snapshot of tlK base load
addresses of all active images on tl1e system. This snap
shot serves as the virtual address map used in recon
structing the trace. Each active process and its associated
libra.Ji es is loaded into a separate address space, which
may be different tha.Jl me preferred load address as spec
ified statically in tl1e image header. If each image was
loaded into memory at its preferred base address, tl1e
virtual address map would not be necessary to perform
reconstruction. Instead, PatchWrx could map target
addresses from the trace buffer using the base address
values contained in tl1e static image headers.
The type of trace record that PatchWrx logs into the
trace buffer depends on the type of branch or low-level
PAL function being traced. Figure 4 shows the trace
record formats. The first three trace entry formats
consist of an 8 - bit opcode and a 24-bit time stamp.
The time stamp is the low-order 24 bits of the CPU
cycle counter. The 32 -bit field of these three formats
depends on the type of trace entry logged . The .first
format is used for target virtual addresses for all
u nconditional direct and i ndirect branches, j umps,
calls, returns, interrupts, and returns from interrupts.
The 32- bit field of the second format is used to record
the base register val ue tor traced load and store
instructions and stack pointer val ues that are flushed
into the trace buffer during system caJis and returns.
The 32 -bit field of the third format is used for logging
the current active process ID at a context swap.
Digita} Technical Journal

VoJ . 1 0 No. 1

1 99 8

11

OPCODE

T I M E STA M P

TARGET P C

8

24

32

OPCODE

TIME STAMP

BASE REGISTER VAL U E

8

24

OPCODE

T I M E STAMP

8

32

NEW PROCESS

24

10

32

r-- OPCODE
\ START B I T

J

3

I

VECTOR OF 60 TA KE N/FALL-TH ROUGH TWO-WAY BRANCH BITS

1

60

Figur e 4
Trace Entry Formats

The fo urth trace entry type is used for tracing con

Using the first target virtual address and process ID

ditional branches. I t uses a 3-bit opcode a n d up to 60

pair from the captured trace, trace reconsu·uction con

taken/fa l l -through bits. A start bit i s used to deter

su l ts the virtual address map to determine in which

mine how many b i ts are active. The start bit i s set to

i m age the instruction fa lls ( b ased on its dynamic base

l i f a conditional branch is taken and to 0 i f the branch

load address) and where that image is physically

is not taken . This recording scheme allows a compact

located o n the syste m . The tool consults the patched

encoding of conditional branch trace entries. Duri n g

image to determine the actual i nstruction at the target

trace reconstruction, Patc hWrx uses conditional branch

address, records this instruction , a n d then reads the

trace e n tries to reconstruct the correct instru ction

next insu·uction from the patched image . This process

flow when condi tional branches are e n countered and

continues until reconstruction encounters either a

to provide concise information about when to d eliver

conditional branch or an u n conditional branch. A

i n terrupts in loops.

conditional branch causes the tool to check the first

Trace Reconstruction

determine su bsequent control flow; the process then

The reconstruction phase is the final step in generating

conti n ues at that address. I f a n un conditional branch is

a full instruction stream of traced system activity. As

encou ntered , reconstruction records the e n try and

active bit of the current taken/fall- through e n try to

shown i n Figure 5 , trace reconstruction requires sev

ch ecks it against the next captured trace en try. I f the

eral resources i n order to generate a n accurate instruc

tvvo entries match , the tool outp u ts the recorded

tion stream of all traced system activity.

instructions to an instruction stream file, consults the

Trace reconstruction reads and i ni tializes the head

12

captured trace entry for the next target i nstruction vir

i n g of the cap tured trace, which i ncludes a time sta m p ,

tual address, and repeats the procedure u n til the entire

t h e n a m e of the u s e r w h o captured t h e trace, a n d any

captured trace has been processe d .

important system configuration information, e . g . , the

Since PatchWrx cap tures i nterrupts a n d other low

operating system version n u m ber. Next, reconstruc

level system activities ( e .g., page fa ults) i n the trace,

tion reads the first fou r raw trace records, which are

these activi ties must also be reconstructed . When

automatically entered w h enever traci ng i s turned o n .

PatchWrx logs an interrupt in to the trace bu ffer, the

These records contain t h e first target virtual address,

corresponding target virtual address in the captured

the active process ID , the value of the stack pointer,

record represents the address of the rl rst instruction

and the first talcen/fall- through record to be used

not executed when the i nterrupt was take n .

PatchvVrx

(such records always precede the branches they repre

flushes the currently active taken/fa l l -through entry

sen t ) . PatchvVrx uses this i nformation to i n i tialize the

to the memory buffer and i n itializes a nevv taken/fall

necessary data su·uctures of the reconstruction process.

through enu·y. This new e n try will be responsible fo r

Digital Technic:�] Journal

Vo l . 1 0 No. I

1 998

PATCHED
IMAGE

PATCHED
IMAGE

PATCH E D
IMAGE

I

-

RECONSTRUCTED
I N STRUCTION
STREAM
CAPTUR E D
RAW
TRACE

RECON STRUCTION
TOOL

V I RTUAL
ADDRESS
MAP

Figur e 5
Instruction Stream Reconstruction Resources

the conditional branches e n countered begi n n i n g with
the i n terrupt service routi n e . The add ress of the first
in struction wi t h i n the i n terrupt service ro utine is then
logged i n the trace .
D u ring reconstruction, the reconstruction tool looks
fo r the i n terrupt's first u n e xecmed i n struction address

2 . DLL

domain-Wi n 3 2

user

(e.g.,

kernel 3 2 d l l ,

user3 2 . dl l , a n d ntd ll . dl l )
3 . Operati ng system domain-W i n 3 2 kerne l , ke rn e l ,
system

processes,

system

idle

process

(e.g.,

Wi n 3 2 K.sys, ntoskr n l . e xe, drivers, and t h e spooler)

to know which instructi o n to stop at when recon 

Exa m i n i n g the e ti mes provi des i nsight into a work-

structi n g the i nstruction strea m . The tool then begi ns

load 's use of each dom ai n . We also examine DLL and

reco nstructi ng the i n struction stream, i n c l u d i n g the

system service usJge on a n i m age basis for each work

interrupt h a n d l e r stream. I f the u nexecuted i n struc

load . Tlus breakdown helps us more clearly identi f)r the

tion is w i t h i n a loop, trace reconstruction uti l i zes the

dependence between the workloJd and the system ser

taken/fa l l - through entry convention . On ta king the

vices provided by the Windows NT operati ng system.

i nterrupt, the active take n/fall-through record i s flushed

We also present the i n struction m i x of each workload

and another record is starte d . This process al lows the

with and without the incl usion of the operating system

tool to conti n u e to reconstruct i terations o f the l oop

execution . U n dersta nding the djfferences in instru c

u n ti l a l l the taken/fal l-through bits are exhau sted .

tion com position i n the presence of system activity fur
ther highli ghts the behavior lacking in application-only

Operating System-Rich Workload
Characterization

traces, such as i n creases in branch and memory instruc
tions, when compared to application-only workloads.
We present the average basic block l e ngths fo r each

As prese nted i n the study by Lee et al . ; ' desktop appli

domain of execution ( Jpplication-only, DLL, operating

cations and benchmarks s h a re some workload charac

syste m ) separately a nd the n i n co m bi na ti o n . This met

teristics, but app l ic ations alone d o not represent fu l l

ric reveals which workload domai n dominates the

system behavior. To i n vestigate and address system

branc hing beh avior. Casm ira's work provides a more

design issues, com p u ter arch i tects should use operat

complete description of these d i fferences across a wider

ing system-r i c h traces.

set o f workload c h aracteristics.2;

To i l l u s trate this point, we present a sample of the
vJrious workload characteristics tbat exist in a set of

Workload Descriptions

bench mark a n d desktop appl ications spec i a l ly sele cted

We pertonn ed a l l the e xperiments reported on in this

to study the d i ffe rences in the use o f the operati ng sys

paper o n a DI GITAL Alpha p l attorm r u n n i n g the

tem and related services. The first c h a racteristic we dis

.Microsoft Windows NT version 4.0 operat i n g syste m .

cuss is the amount oftime each ben chmark or desktop

We captured the traces o n a 1 5 0- megahertz Npha

application spends with i n three domains:

2 1 064 processor. The system configuration incl uded

l . Appl ication-on ly domai n ( e . g . , winword .exe and
excel . e x e )

8 0 MB of physical memory. TJ ble 3 l ists the workloads
we examined .

Digital Tech n ical Journal

Vol . 10 No. 1

1 998

13

Ta ble 3

Workload Description
Workload

Description

fou r i e r

B YTEmark benchma rk; a n umerica l ana lysis routi ne for calculat i n g series approximations o f waveforms

neural

BYTEmark bench mark; a s m a l l , funct i o n a l back- propagation netwo rk s i m u lator

go

SPEC95 Go! game bench m a rk

li

SPEC95 Lisp i nterpreter bench mark

cdplay

Microsoft CD Pl ayer playing a m u sic CD

fx ! 3 2

D I G ITAL FX 1 3 2 V 1 . 1 i nterpretin g/translating incl uded Ope n G L s a m p l e x86 a p p l ication

ie

M icrosoft I nternet Explore r V2.0 fo llowing a series of web page l i nks

vc50

M icrosoft Visual C/C++ V S . O com p i l i n g a 3, 000- l i ne C program

word

Microsoft Wo rd97 V7.0, spell-check i n g a 1 5-page docu me nt

The fourier and n e u ra l workloads are from the

To provide a clear and represent::�tivc comparison

BYTEmark benchm ark test s u i te : the n e u ral workload

ohvorkload behavior, we captured several traces. For

is a small array- based floating-point test; the fou rier

all scenarios, fu l l traces of each workload captu red

workload i s designed to measure transce n d e n ta l a n d

approximately 5 to lO seconds of executi o n , f-i l l i ng the
4 5 - M B trace b u ffer. To c haracterize worldoad behav

trigonometric fl oating-p oi nt unit perfor mance.
The go and li workloads a.rc !Tom the SPEC9 5 integer

ior, each experim e n t w:1s run with the benchmark or

bench mark suite: the go workload is a simulation of the

application as the only activity o n the syste m . E a c h

game

workload w a s r u n in t h e !-(>regro u n d .

Co1, witl1 ilie computer playing against itselr; ilie li

workload is a Lisp in terpreter. All the workloads use ilie

To e n s u re t h a t t h e traces captured were represe nta

stand ard i n puts provided vvit h tl1c bench marks and are

tive o f the overall worldo:�d behavior, we captured

com piled

mul tiple traces. We chose d i ffe re nt poi n ts d u ri n g exe

with the default optimiz:.tion level using the

native Alpha version of Mi crosoft C/C++ version 5 . 0 .

cution fo r tracing to allow comparison between d i fter

The cdplay workload i s t h e Mi crosoft C D Player

en t portions of the selected scen:�rios. To i n v estigate

application i n c l uded i n M i c rosoft Wi ndows NT ver

the variabi l i ty present in selected workloads, we tr:�ced

sion 4 . 0 . The device w:.s traced while playing a music

additiona l scenarios . A second M i crosoft Word trace

CD using d e fa u l t p l ayi n g options ( e . g . , playing a l l the

was captured with the appli cation perfor m i n g an auto

songs i n order).

for m a t operation of the same docu m e n t used in the

The 6.: ' 32 workload is the

DIGITAL FX' 32 version 1 . 1

first trace of the spell- check operation , and we cap

emulator/translator provided by Compaq's DIGITAL

tured a second M i crosoft I n ternet Explorer tr;K e ,

Alpha Migration Tools G ro u p .1" We ran the robot arm

repeating the S o n y l i nks but with t h e l i n ks cac hed . We

Ope n G L sample I n te l - based appli cation in the for e 

captu red a second trace o f

ground d u ring trace captu re.
The ie workl oad i s the st:.ndard M i c rosoft I n ternet

FX ' 32 using the i n c l u ded

boggl e sample ga m e ( to r comparison agai nst using the
OpenGL application i np u t ) . Add i tional ly, the

FX 1 3 2

Exp lorer version 2 . 0 workload i n clude d i n lvl icrosoft

translator was traced while i t optim ized a n:�tive I n tel

Win dows NT version 4 . 0 . The ie workload was traced

x86 application's pro fi l e . To conde nse the n u m be r of

w h i l e traversing fo u r l i n ks through the Sony home

memory pages occupied

web page, arriving fi na l ly at the Sony PlayStation Store

designed the new l i n ker to a l low d a ta to resid e wi r- h i n

by a n i m :� gc, Microsoft

we b page . The trace was captured on M ay 4 , 1 9 9 8 ;

t h e code regions. Hookway a n d Herdeg"' provide :1 n

pages m ay have c h anged s i n ce this d ate. The history

expl anation of the D I G ITAL

cache and the web link cache were both e mpty w h e n

t:ranslationjoptimization procedures . Casmira discusses

t h e trace was captu red .

iliese scenarios a n d others .' ·

FX1 32 e m u lation and

The vcSO workload i s tl1c M i crosoft C/C++ version
5 . 0 compiler compiling a 3 , 000-l ine C source code tile.

Domain Mix

We used the command l i ne i nterrace, and we used the

To i l l ustrate the i n herent d i ffe rences between bench

default optimization levels and oilier parameters, which

m a r k and d esktop application behavior, we break

best represented ilie common usage of tl1e compiler.
The word workload is M i c rosofi: \Nord from the
M i crosofi: Offi ce97 desktop app l i c::�tion su ite tor the

( 2 ) D LL, and ( 3 ) operati ng syste m . The application

Alpha processor used to capture :1 m a n u a l spell c h e c k

domai n represen ts the set of-' executed instructions that

o f a 1 5 -page Mi crosoft Wo rd docu m e n t . T h e standard

are within the traced appl ication ' s execut a ble i m ::� ge.

Microsoft Wo rd d i c tiorury was employe d .

14

down the captured trace i n terms of three m m u a l ly
exclusive domai n s . These domains arc ( l ) application,

Digital Tcdmi'
a:
1U)
�

10

l

�

n

5

:

I

0

FOURIER

N E U RAL

GO

Ll

C O P LAY

FX'32

IE

VC 5 0

WORD

WORKLOAD

Figure 9
Average Basic Block Length

within their executable ima ges . Therefore, i n c l u d i n g

The vc50 workload spends a s i g n i fica nt amount of

any operati n g system activity i n to a basic block le ngth

ti me wi t h i n i ts own executable image , which leads to

average has a m i n i m a l effect.

a n overa l l average basic block l e n gth si m i l ar to the

However, consideri n g the la rge amount of operat

applic ation - o n l y va l u e . T h e word workload is s i m i l a r,

ing system execution present in the cdplay trace, the

b u t the D LL behavior domi nates. The cd p l ay and ie

overa l l basic block lengt h is s i gn i fi cantly Jess than the

workloads experience a 50 percent decrease i n average

appl ication-only l e ngth . The overa l l and operating

basic b l o c k length . This decrease c a n be attri buted to

system length val ues are al most the same. Not o n l y

a n i ncrease i n the n u mber of branc hes in the presence

does i nc l u ding t h e system activity i n t h e trace i n tl u 

of operating system activity. With this i n crease in con 

e n ce t h e overa l l basic block length b u t the amount

trol fl ow i nstructi ons, we ex pect increased pressure to

of system activity determines to what degree the length

be placed u po n the branch prediction hardware.

i s affected .
In a si milar fashion, the overall basic block length of

As observed in othe r c h aracteristic categories, the
fou r bench marks do not e x h i bit noticeable deviations

the fx ! 3 2 trace tracks that of its D LLs. The length is

fro m appli cation - o n ly be havior when the operati n g

directly proportional to the amount of ti me the work

system activity is introd uced. Aga i n this explains why

load spends in its DLL domai n . The execution of the ie

s i m u l ation results using ben c h mark traces usually track

workload is more evenly distri buted among the t hree

the actual performance when the bench marks are r u n

domains, which affects tl1e overall basic block length,

on the real syste m . I n contrast, fo ur o f the five desktop

produ cing a more evenly weighted average of all its

applications exhib i t significantly d i ffe rent behav i or i n

domain basic block lengths ( n o one domain dominates ) .

the presence o f the operating system.

Digiral l.edmical Journal

Vol . 1 0 No. 1

1998

19

Summary

6.

rhe Eleue n rh Sympos ium o n Co mputer A rchitecture

I n this paper we described the PatchWrx toolset. We

(June

compared it to e x isti ng tools and d emonstrated the
need for operati n g system-ri ch traces by showing the

1 9 94 ) : 1 2 6-1 3 5 .

the D L Ls. I n a d d i t i o n , w e showed t h a t n i s t i n g d e s k 

K . Flanagan , J. Arch ibald, B . Nelson, and K . Grim
srud , " B AC H : BYU Address Collection Hardware;
The Collection of Complete Traces," Proceedings of

t o p bench marks do not exercise the kernel and t h e

the Sixtb International Olllfereucc un Jfodeling Tech

7.

amount of the total execution spe n t i n the ke r n e l a n d

n iques and Tools /or Co mp u t er Fmlwttio l l ( 1 992 ) :

D LL sufficiently to provide m e a n i ngfu l i n d i cators of

5 1 -6 5 .

desktop pertonnance.
These resu lts have rei n torced our argu ment that
and operating system i n forma tion , especia l ly a s n e w

D. Kaeli, 0 . La Maire, 'vV. White, P . Henner, a n d W.
Starke, " Real-Time Trace Ge neration," !nt em u t iu ua l
Journal 0 1 1 Computer Sim ula t i o n . vol. 6, no. 1 ( 1 9 96 ) :

appl ications s p e n d m o r e time executing within the

53-68 .

8.

researchers need to use traces with both applicati o n

operating syste m . The goal i s for computer arch itects
to usc operati ng system-ri c h traces of appli cations that

9.

D . Kaeli, L. Fong, D . Ren frew, K. I m m i ng, and
R. Booth , "Performance Analysis on J CC-NUM.A
Prototype," IBM .foumal ol les(t; i l a 1 1 d !mplementatio l l .
Orloudo.

References and Notes
l.

SPEC Neu'Sietter( Septembe r 1 9 95 ) .

2.

I n formation about t h e BYTEmark benchmark suite is
available fi·om B YTE Magazine ar http :/jwww. byte.
com/bmark/bmark .hrm.

3.

15.

M . Rosenblum, E. Bu gnion, S. Devine, and S. Herrod,
"Using the SimOS Machine Simu lator to Study Com
plex Computer Systems," A CJ1 Transactio/Is IJ/1 .\llod
eling and Sim ula t ion , vol . 7, no. I ( January 1 9 9 7 ) :

4.

16.

D. Kael i , " Issues i n Trace- Driven Simu lation," Lectu re
f\iu.
and

Verlag,
5.

729, Per/ormance

Notes in Computer Science,

Eua lu at io u
Svstems.

78-1 0 3 .

1996) :

1 69-1 8 3 .

of

Co mp uter

Com m u n ication

L. Donatiello and R. Nelson, eds. (Springer

A. Agarwal,

A nazvsis o/ Ca che Perjorma n e ej(;r Oper

ating Systems a11d Multipru,q ra rnming

m ic Publisher,

( Kluwer Acade 

1989).

1 7. ].

Larus and E. Schnarr, " EEL: Rewriting Executable
Files to Measure Program Behavior," Pmc('edi ngs of
the A C!VI SIG'PLA N"95 Co nference 011 Pn��ran1111i11g

1993 ) : 2 2 4-244 .

R. Uhlig and T. Mudge, "Trace- D riven Memory Sim
u lation : A Su rvey," A C/11 Comfllltillg Surn·: Fs, vol . 2 9 ,
no. 2 (J unc 1 99 7 ) : 1 2 8-1 7 0 .

Digital T.:d1nic.1l journal

1 994 ) : 1 96-2 0 5 .

M . Rosenblum, S. Herrod , E. Wirchcl, a n d A . Gu pta,
"Complete Computer System Simulation: The SimOS
Approach," JEF:F..fo u m al of Pa ra llel a u d Distrlhu ted
Tech nology, 1 99 8 , forthco ming.

ings o/ the Secoud fSIW!X .�vrnposium on Operating

( October

Fla. ( Ju ne

14.

S . Perl and R . Sites, "Studies of Windows NT Perfor
mance Using Dynamic Execution Traces," Proceed
S),stem f)esig n and lmplcmentmiun

20

J. Emer and D. Clark, "A Characte rization of Proces
sor Performance in the VA,\. l l - 7 8 0 , " Proceedillf.;s u/

Vol . 10 No. l

1 99 8

La ng uage Desi_q 1 1 and Implementation.

( Ju n e

1 99 5 ) : 2 9 1-300.

La jolla, Calif

1 8 . D . Lee, P. Crowley, ] . - L. B:ter, T. Anderson, and
B. B ershad , " Execution CharJCteristics of Desktop
Appl ications on Windows NT," Proceedings of the

Twen ty�jifih

International -�ymposiu m on Computer

A rchitecture. Barcelona, Spain ( J u n e 1 99 8 ).

19. E . Bem, D . H u nter, and S . Smith , " Moving ATOM to
Windows NT for Al ph a , Dtj; ital Tech n ical journal.
vol . 10, no. 2 , accepted for p u b l ication .
"

20. M . Smith, "Tracing with Pixie," Technica l Report,
CSL- TR-9 1 -497, Stanford Univ e rsi ty, November
1 99 1 .
2 1 . R . Cmelik a nd D . Keppel, "Shade: A Fast I nstruction
Set Simu lator for Execution Profi l ing," Proceedings of
A CM S(qmetrics ( May 1 99 4 ) : 1 28- 1 37 .
2 2 . Alphu AXP A rchitec tu re Ha ndhnok. Order No. EC
Q D 2 KA-TE ( Maynard , Mass . : Digital Equipment
Corporation, O ct ober 1 994 ) .
2 3 . H . Custer, Inside Wi n dows
Microsoft Press, 1 993 ) .

NT

( Red mond , Wash . :

2 4 . Microsoft Sothvare Developer's Toolkit. This toolkit is
ava i l able :tt http://msd n.microsoft.com/developer/
sdk/plattorm.htm.

25. J. Casmira, " Op e rating System Rich Workload Char
acterization," Master's thesis, ECE-CEG-98-0 1 8 ,
Northeastern University, May 1 99 8 .

David P. Hunter
David H u nter is the engineeri ng manager of Compaq

Computer Corporation's Advanced and Emerging
Technologies Group. Prior to that he was the manager
of D I GITAL's Software Parmer Engineering Advanced
Development Group, where he was involved in perfornnnce
i nvestigations of databases and their i nteractions with the
U N I X and Windows NT operating syste ms. H e has held
positions i n the Alpha Migration Organization, the I SV

Porting Group, and the Govern ment Group's Tech nical
Program Management Oftice. David joined D I GITAL's
La boratory Data Prod ucts Group in 1 98 3 , where he devel
oped the VA..,'\lab User Management System. H e was the
project leader of the advanced development project, ITS, an
executive information system, tor which he designed hard

ware a.nd so thvare components. David has two p a te n t appl i
cations pending in the area of sothvare engineering. He
holds a degree in electrical and computer engineering ti·om
Northeastern U niversity in Boston, Mas achusetts, and a
diploma in National Security and Strategic Smdies fTom t h e
United States Naval War Col lege in Newport, Rhode Island .

26. R . Hookw<1Y a n d M . Herdeg, " D I GITAL F X ! 3 2 :
Combining Emu lation a n d Binary Translation,"
Digital Tecbnicaljournal. vol. 9, no. 1 ( 1 997): 3- 1 2 .

Biographies

Jason P. Casmira

Jason Casmira received B . S . and M . S . degrees in electrical
engi n eering ri·om Northeastern University i n 1 996 and
1998, respectively, and is pursuing a Ph . D . degree in com
pute r science at the University of Colorado, Boulder. For
the past two yc<1rs, ] ason was a member of the Northeastern
U niversity Computer Architectu re Research Laboratory
( N UCAR), where he focused on developing the cu rrent
version of the P:nchWrx tracing toolset. H e also investi

gated issues related to swdying operating system -ri c h
traces. While at N U CAR, Jason was supported by a grant
ri·om the Nation;� I Sci e n c e Foundation. H e has p u blished
seven papers and is a member of the I E EE and the Eta

Kappa Nu honor society.

David R. Kael i
Da,�d .Kadi received P h . D . ( 1 992) and B . S . ( 1 98 1 ) degrees in
e lectrical engineering trom Ru tge rs U niversity and an M .S .
degree in computer engineering trom Syracuse U niversity
in 1 985. He joined the electrical and computer engineering
facu l ty at Northeastern U niversity in 1 99 3 after spending
12 years at I BM , the last 7 of which were at the I B M T. j .
vVatson Research Center i n Yorktown Heights, New York.
David is the d i rector of the Northeastern U niversity
Computer Architecrure Research Laboratorv ( NCCAR ) ,
where he investigates the performance and design o f high
p e r form a nc e computer systems and sothvare. H is current
research topics i nclude 1/0 worklo:�d characterization,
branch prediction snrdies, memory hierarchy design, object
oriented code execution pertonnance, 3-D microelectronics,
and back-end compiler design. He frequently gives tutorials
on the subject of trace-driven char c l a s s S t ack (
T * top_o f_s t a c k ;
p bl i c :
voi d push ( : aL
l ;
vo i
po ( T · ar
) ;
} ;

The act of ap p l yi ng

the arguments to the tem plate
to as t e m pla te insta n tiation . An i n sta n ti a 
tion of a te mpl ate creates a new type or fun ct i on that
is defined for the speci f-i ed typ es. Stack< i n t> creates
a c l ass that provides a stack of the type int.
Stack creates a c lass that provides a stack
of u ser_ cl as s . The types i n t and user_class are the argu
ments for the tempiJte Stack.
is r efer red

22

Digir:li Technicol Journ:�l

Vol . 10 No. I

1 99 8

tem p l ate needs to be i n stantiated when

de fined d i rectives or pragm as. S i nce i nstanti ations are

it is referen ced . When a c l ass template is i nstantiate d ,

give n global external l i nkage, the u ser must ensure

only those membe r fun ctions a n d static data members

that the specified te mplate i nstant i ations appear o n l y

that are referenced are also i n stanti:Hed . In the Stack

o n ce throughout all the modu les t h a t com pose t h e

e xa m p l e , the m e m be r fu nction Push of the c l ass

progra m . When o n l y t h i s m o d e o f i n st a n ti a tion i s

In genera l ,

:1

Sta c k < i n t> needs to be i nstantiated only if it is used .

u s e d , the u s e r also must e n s u r e that a l l req u i re d tem

Tem p l ate fu nctions a n d static d a ta mem bers have

plate i n stanti ations are spe c i fi ed to avoid u n resolved

global scope; there fore, o n ly one i nstantiation of each

sym bols at l i n k time.

should be i n a user's appl i cati o n . Since source fi les are
Command- l i n e options

compiled separately a n d combined later at l i n k time to

Co m m and-line Instantiation

prod uce an exec utable, the compi l e r alone i s not able

can be used to speci f)' template in stantiation . They are

to ensure that one a n d only one i nstance of a specific

similar in operation to the explicit i nstantiation req uests,

templ ate is e fficiently generated for any given exe

except they i n dicate groups of templates that shou ld be

c u ta b l e . That is, the compiler by itself is not a b le to

instantiated, rather than naming specific templates to be

know whether the function or variable defi nition for a

i nstantiated . The com mand - l i ne options include

speci tlc te mpl ate is satisfied by code ge nerated in
another object mod u l e .
The

C++ Standard

•

provides fac i l i ties for the user to

e ntities whose definitions are known d u ri n g compi

specif)' where a tem p late en tity s hou ld be i n stantiated . '

lation and whose argu m e n t types are specified . This

When the user explicitly spe cities te m p late i n stantia

has the advantage of spec i fYing many te mpl ate

tio n , the user then becomes responsible for ensuring

i n stantiations at o n c e . The user must st i l l e nsure

that there is only one i nstantiation of the te mplate

that no te m p late i n stantiation happens more than

fu nction or static data m e m be r per appl ic ation . This

once in the program a n d that all req u i red i n stantia

responsibil ity can necessitate a conside ra b l e amount of

tions are satisfi ed . Due to these require m e nts, the

work. However, the com p i l e r and l i n ker worki ng

user can not usually specif)' this option on more than

together can provide e ffective templ ate i n stantiation

one source- fi l e com p i l a tion in the progra m . This

without specific user d i rectio n .

option can also cause the i ns ta n ti a ti o n of templates

I n the foll owi ng section, we presen t t h e various
approaches that can be used for template i n sta ntiati o n .

I n stantiate Al l Te mpl ates. A com m a n d - l i n e option
can d i rect the compiler to i nstantiate all tem p l ate

that are not used by the prog ram .
•

Instantiate Used Te mplates. A command-line option

Te mplate Instantiation Techniques

can be used to d i rect the compiler to i nstantiate

Te mpl ate i nstantiation te c h n i q ues can be broad ly cat

source code and whose defi n i tions arc known at

only those templ ate enti ties that are used by the
com pil ation . As in the previous tec h n i q u e , the user

egorized as either m a n u a l or automati c . vVith m a n u a l

m ust ensure that no template i nstantiation happens

i nstantiation, t h e com pi l a tion system responds t o user

more than once i n the program and that all req uired

d i rectives to i nstantiate te mplate e n tities. These d irec

i nstantiations arc satis fied . Due to these req u i re 

tives can be in the source progra m , or they may be

ments , t h e u s e r can not u s u a l l y spec i fY this option

co m m a n d - l i n e options. With autom atic i n stantiatio n ,

o n more than one sou rce - fi l e comp i lation in the

the compilation syste m , i n c l u d i n g t h e l i n ker, decides

progra m .

which instantiations are req u i red a n d attempts to pro
vide them t(Jr the user's appl icati o n .

•

Instantiate Used Te mplates Locally. Thi s command
line option works l i ke the i nstantiate used te mpl ates

Manual Instantiation

opti o n , except that it d e fi nes each te m p l a te i n st a n 

Manual te m p l ate i n sta ntiation is the act of manua l ly

tiation locally in the current compilation . This option

specifYing that a template should be i nstantiated in the

has the adva n tage of provid i n g com plete te mpbte

ti le that i s being compi led . This instantiat ion i s given

i n sta n tiation coverage for the progra m , as long as

global external l in kage, so that references to the

the definitjons of the used tem plates are avai lable in

i nstantiation that are made i n other til es resolve to this

each mod u l e . Since all templ a te i nstantiations are

te mpl ate i n stantiati o n . M a n u a l te mplate i nstantiation

given local scope, there is n o pote n tial problem

i nc l udes explicit i n stantiation requ ests and pragmas as

with

wel l as com mand - l i n e options.

program is l i n ke d . The major p ro b l e m with this

m u l t iply d e fi ned

i nstantiations

when

the

tech n i q u e is that the user's appl ication can be
The

u n n ecessari ly large, si nce the same te mplate i n stan

compi lation system i nstantiates those te mpl ate e n ti ties

tiations could appear withi n m u l tiple object fi les

that the user specifies tor i nstantiation . The specification

used

can be made using the C++ expl icit te mpl ate i n stantia

if the i nstan tiations m u st have global scope such as

tion syntax or may be made using i m p lementation-

a c l ass's static d ata m e m bers.

Explicit I nsta ntiation Requests and Pragma s

to

l i n k the app l icati o n . This technique wi ll fai l

Digital T,·,hni,al Journal

Vol . 1 0 No. l

1 998

23

Figure 1 shows an example o r' a template fu nction,
template_func, that contains a locally defi ned static
variable. As shown in the figure, the object fi les of both
A and B contai n local copies o f template_func i nstanti 
ated with i nt. E a c h i nstance o f templ ate_func
defines i ts own version of static variable x. I n this case,
directing the compiler to i nstantiate used templates
locally yields a d i fferent resul t than i nstantiating a l l or
used templates globally.
I f we give the static data mem bers global scope a n d
ensure t h a t they are properly defi ned a n d initi al i zed by
executable code rather than by static i n i tial ization, we
can solve the static d ata mem bers prob l e m . The app l i 
cation , however, remains unnecessaril y large, because
m u l tiple copies of the i nstantiated templates can be
present i n the exec u table.

A u tomatic template instantiation rel ieves the user of
the burden of determining which templates must be
i nstanti ated a nd where in the application those i nstanti
ations should take place. Automatic template i nstantia
tion can be d ivided into two categories: compi l e - time
i nstanti ation , whereby the decision about what shou l d
b e i nstantiated is made at compile t i m e , and l i n k- time
i nstantiation, whereby decisions about template instan 
tiation are made when the user's application is li n ked .
I n both cases, specific ]ink-time support is needed to
select the required i nstantiations for the execu table .

Each i nstantiation is placed i n the commu nal data sec
tion ( COM DAT) of the current compi l ation 's object
fi l e . Each object fi l e contains a copy of every template
instantiation needed by that compilation u n i t .
COMDATs are sections t h a t have an attri bute t h a t tells
the l i n ker to accept, without issuing a warni ng, m u l ti 
ple definitions o f a symbol d e fi ned i n the section . ' I f
more t h a n o n e object file defines that symbol , o n l y the
section from one object fi le is l i nked i n to the i mage
and the rest are d iscarded, along with a l l symbols i n
the symbol table d efi ned in t h e d iscarded section con
tributi o n . At link ti me, the l i n ker resolves a n i nstantia
tion reference by choosi ng one of the i nstantiations
defined i n an i ndividual obj ect fi le's COMDAT. The
resu l ti ng user's appl ication executable has a single
copy of each requested i nsta n tiatio n .
vVhen s u c h l i n ker support is n o t avail a ble, another
mechanism must be used to control compi l e - time
i nstantiation . O ne such approach is to use a repository
to contai n the generated i nstantiations. The compiler
creates the i nstantiations i n the repository i nstead of
the c urrent compi lation 's object fi l e . A t l i n k time, the
l i n ker incl udes any req uested i nstantiations from the
repository. As a performance i m provement, the com 
piler can also decide whether an i nstantiation needs to
be generated from the state of the reposi tory. I f the
requested i nstantiation is in the repository and can be
determ i ned to be up to date, the compiler does not
need to regenerate the i nstanti a tion.

Two major tec hn iq ues
can be used to perform a utomatic tem plate instantia
tion at compile ti me. The choi ce between the two
depends upon the fac i l i ties available i n the l i n ker.
M icrosoft Visual C++ i nstantiates templates at compile
time using a strategy similar to the i nstantiate used
templ ates com mand-line option described previously. '

The decision to instantiate can
be left u n ti l l i n k time. The linker can find the instantia
tions that are needed and direct the compiler to generate
those i nstantiations. McCluskey describes one li n k-ti me
instantiation scheme.'.r' The compiler logs every class,
union, struct, or cnum in a name-mapping file in a repos
itory. Every declared template is also logged in the name-

Automatic Instantiation

Com pile-time I nstantiation

Link-time I nstantiation

e . h :' "
I I templ
l i n c l u e c i o s t ream . h
t emp l a te c l ass T vo i d te�plate_func
{
s ta ic 'J' :< � 0 ;
cou t < < x .,. p ;

IT p )

X+ + ;

I / A . c :< :{
# i n c l ude • emp l a t e . h x x "
e x t e n vo i c� b_func { ) ;
int ma i n { )
(

templ a te_func ( l O I ;
b_func ( ) ;
re

urn

0;

Figure 1

Template Fu nction Containing a Local ly Ddi necl Static Variable

24

Digital Technical Journal

Vol . 1 0 No. I

1 998

/ / B . c::< x
" i nc lude " L empl te . hx x "
vo i d b_ func ( vo i d )
{
II . . .
temp l � e_ f un c ( 2 0 ) ;
II. . .

mapping file. At link time, a prelinker determines which
template instantiations are required. The prelinker builds
temporary instantiation source files in the repository to

I* per f o r�so e_ f unct i on ( C& }
# .i nc l ude " empl a te . hx x "
� i nc l ude " t emp l a t e . cx x "
U i nc lude

satisfY the referenced instantiations, compiles them, and

*/

· c_class . h "

adds the resulting object files to the linker input.
Consider the example in Figure 2.
D u ri n g the c o m pi lation o f m a i n . cxx, a n a m e 

Figur e 3
Example of an Instantiation Sou rce File

mapping fi l e is b u i l t in the repository a n d t h e location
of the user-defined class C and tJ1e flmction template,
perform_some_function, are recorded. From tJ1e infor

sponds to the parti c u l a r source file that can success

mation stored i n the name- mapping file, an i nstan

ful ly instantiate the user's request. Compiling and pre

tiation source file is men created i n me repository.

l in king the program used in Figure 2 generates an

Figure 3 s hows the contents of tJ1e instantiation source

i ns tantiation assignment file for main.cxx. This tile
contains i n formation concerning the command-line

file created to satisfY perform_some_fu nction.
The prelinker tJ1en compi les me instantiation source

options specified, me user's current worki ng directory,

file by i nvoking the compiler in a special directed mode,

and a l ist of instantiations m at should be i nstantiated.

which directs the com piler to generate code only for

Main .cxx now owns the responsibi l i ty of i nstantiating

speci fi c template i nstantiations that are l isted on the

perform_some_flmction. The prelinker recompiles

command l i ne . The compiler then generates the defin

tJ1e source fi les, such as main .cxx, tJ1at have changes i n

ition of perfonn_some_flmcti o n < C > in the resu lti n g

their template i nstantiation assign m ents. The process

object fi l e . The resu lting object now satisfies the

is repeated until there are no changes made to the

instantiation request and is included as part of the

i nstantiation assignments. Then the final link can be

application's final .l i n k . To build the i nstantiation

completed.

source fi les easily, the i mplementation of this scheme

This approach has the advantage of requiring no

generally requires mat template decl arations, template

special file structure to support automatic template

definitions, and any argu ment types used to instantiate

instantiation. It is generally faster and simpler than

a class or function template must appear i n separate,

McCluskey's approach, because fewer files are com

related header files.

piled in the generation of the needed i nstantiations

The Edison Design Group has developed anomer

and the i nstantiations are generated in the context of

approach to li nk-time i nstantiation . 7 In this approach,

the use r's source cod e . I n addition, the assignment of

tJ1e compiler records where template i nstantiations are

i nstantiations to sou rce files can be preserved between

used and ·where they can be i nstantiated . At l i n k time,

recompilations of the source code, so that u n less the

a pre l i n ker assigns template i nstantiations by record i ng

strucmre ofthe application changes, the needed instanti

the assignments in a specially gene rated file that corre-

ations \viU be available wimout additional recompilation.

I I C_c l ass . h xx:
c l ss C {
publ i c :
II . . .
} ;
1 / t empl a t e . hxx
templ a c e < C ] a s s T

void

er form_ s ome_ f nc i on ( T &par m ) ;

1 / t empl a t e . c xx
temp l a e  vo i d per f o rm_s ome_ func i o n ( T & param l

(

}

l lma in . c x x
h nc l

e " C_c l as s . hxx "
"
emp l a e . h x x "

h ncl · de

i n t ma i n ( )

{

C C;
perfo m_some_ unct i on (
re
rn 0 ;

) ;

Figur e 2
Exam p le of a Li nk -time I nstantiation Sc heme (McCluskey)

Digital Technical Journal

Vol. 1 0 No I

1998

25

Comparison of Manual and Automatic Instantiation
Techniques

The manual i nstantiation techni q u es require planning
on the part of the user to ensure that needed instantia
tions are present, that no extraneous i nstantiations are
generated, and that each needed instantiation appears
exactly once within the application . Witl1 manual
i nstantiation , the user has the advantage of gai ning
explicit control over aU template i nstantiations.
Almough the strategy of instantiati ng used templates
locally requires l ess planning, it does so at the cost of
object file size and tl1e restricted use of templates when
static data mem bers are present or when static data is
defined locally within a function template instantiation.
Automatic template i nstantiation provides template
instantiation wim no explicit action on the part of the
user. Compi le-time i nstantiation requires either spe 
cific l i n ker support to select a single template instanti
ation from potentially many candidates, or support by
the compiler to generate i nstantiations i n separate
object files while compiling the user's source cod e .
Relying on linker support allows t h e compiler t o effi
ciently generate i nstantiations at the cost of larger
object files; however, tl1e user loses control over which
i nstantiation is used in the executable fil e . Although
the use of separate instantiation object files usually
takes more time at compilation than tl1e linker-support
memod, it results in more compact object files and can
provide the user wim more control over which instan
tiation is used in the executable file.
Link-time instantiation provides template instan 
tiation that is tai lored to the needs of the executable
file. The primary cost is l i n k-time performance, since
generation of instantiations occurs at link time.
Another disadvantage oflink-time instanti ation can be
observed when building object-code libraries. Either
the library must contain all the i nstantiations that it
requires, or the user who wants to link with the u brary
must have access to all the machinery to create i nstan
tiations. Creating a library's i nstantiations involves
extra steps during library construction . All the object
files to be incl uded in the li brary m ust be pre l i nked,
so tlut the needed i nstantiations are generated. If
i nstantiations are i ncl uded i n the i ndividual object
files in the library, as in the Edison Design Group
approach , unintended modules may be linked from
the li brary to provide the needed instantiations.
Consider the following scenario, i n whic h object
fi l es A and B are i ncluded in tl1e library. Both files
require tl1e instantiation of perform_some_function.
V/hen these fi l es are preli n ked, the i nstanti a tion of
perform_some_fu nction < i nt > is assigned to one of
the files, say A . If an application that is being linked
against the l ibrary requires that the object file B be
linked into tl1e executable, men the object file A is also
linked . Here tl1e instantiation needed by B was i nstan-

26

Digiral Tech nical Journal

Vol . 1 0 N o . 1

1 99 8

tiated i n A even though the executable never refer
enced anything explicitly defined i n file A. This can
yield an unnecessarily large executable.
In the next section, we review the template i nstan 
tiation support i n earlier versions of D I GI TAL C++
and then discuss the rationale and design of the auto
matic template i nstan tiation facility i n version 6 . 0 of
DIGITAL C++.

DIGITAL C++ Tem plate I nstantiation Experience

As the use of C++ templates has grown, DIGITAL
C++ has been enhanced to s upport the need for
improved i nstantiation techniques . The i n i tial release
of DIGITAL C++ occurred before the C++ standard
i zation process had matured, so that the language sup
ported was based on The A nnotated C+ + Reference
Manual, referred to as the AR.t\1 .8 The ARM defined
template fimctionality, but it d id not provide guidance
for either manual or automatic template i nstantiation.
Thus it was necessary to provide a D I GITAL C++
specific mechanism for template instantiation.
DIGITAL C++ Manual Template Instantiation

The #pragma define_template directive and the instan
tiate all command - l i ne option, -defi ne_tem p l a tes, have
been supported since the initial release of DIGITAL
C++.
In Figure 4, tl1e define_template pragma directs the
compi ler to instantiate class template , C, with type i nt.
When the compiler detects the use of the pragma, it
creates an i n ternal C type node and traverses the
list of static data members a nd member fu nctions
defined within tl1e class. If the defin itions of these
members are present at tl1C point me pragma is speci
fied, the compiler material izes each with type int.
As the C++ language developed and template usage
increased, users found manual template i nstantiation
to be very labor i ntensive and req uested an automated
method.
DIGITAL C++ Version 5.3 Automatic Template
Instantiation

Automatic template i nstantiation capability became a
serious issue d uri ng the planning stages of DIGITAL
C++ version 5 . 3 . The use of templates was i ncreasing
rapidly, and many new thi rd-parry libraries, such as
Rogue Wave Software's Tools.h++, contained a signif
icant use of templates. Due to this growing need, the
requirements were straigh tforward. The support had
to be easy to use, have a short design phase, be quickly
implemenrable on both the DIG ITAL UNIX and the
OpenVMS platforms, and provide reasonable perfor
mance. Because McCluskey's approach had been used
in several implementations, it presented i tself as our
best option.

emp l a t e
p

<

class

lass T

c {

bl i c :

•nc 2 { T p ) ;

v o i d mem_f n c 1 { T p ) ;
vo i d mem_f

);
t mp l

e

t empl a te

lrprag

a

cl a s s T > vo i d C : : mem_ E unc l ( T p )
< c l a s s T > vo i d C  : : mem_f unc 2 ( T p )

II
II

. . .l
...l

de f " ne_ e mp l a t e C < i n t >

Figure 4
The define_template Pragma

DIGITAL made two major changes to McCluskey's
approach to take advantage of the D I G ITAL C++
compiler design . First, we al lowed i nstantiation
source files to be created at compile time instead of
l i n k ti me. This eliminated the need for McCluskey's
name- mapping fi le and simplified the prelinkin g
process considerably. Since t h e needed source files
existed i n the repository, there was no need to decon
struct the required template insta n tiations to deter
mine their arguments and types.
The second change addressed the transitive closure
problem . Figure 5 shows an example of the class tem
plate B uffer being instantiated with the user-defined type
C. After compilation of app.cxx with the McCluskey

B {

.

I I B_c l a s s . h xx
class

II

.

.

approach, the name-mapping file contained definition
locations of class B and class C. However, it did not con
tain any indication that class C had a data member that
relied on the definition of class B . From the information
in the name-mapping fil e, the pre linker then created an
instantiation source file that included only C_class.hxx,
Buffer.hxx, and Buffer.cxx. When this instantiation
source file was compiled, an error resulted complaining
that B is an undefined type whose size is unknown .
We solved this problem in D IGITAL C++ version
5 . 3 by i ncluding all the top-level header files incl uded
by the current compilation unit in any i nstantiation
source files created. This ensured that B_class. h xx
wou ld be included in the generated i nstantiation file.

class C {

J;

I I C_c l a s s .

hxx

bl i c :

B da t a_mem ;

p

);

I I B u f f er . h xx
emp l a t e < c l a s s T>

c lass

i n t num_o f_i t ems ;

T

*

{

l i B f fer . c xx
templ a t e < C l a s s T>

v o i d Bu f fer T> : : a dd_i t e m ( T * p )

{

)

u f f er ;

p bl i c :
vo " d add_i t em ( T

) ;

Bu f f e r

II . . .

*) ;

II . . .

l l app . cxx

# i n c l ude

" B_c l

ss . hxx"

" Bu f f er . h xx "

� · nc lude " C_c l a s s . hxx "
N i n c lude

{

vo i d

C

f ( vo i d )
c;

B f fer< C > c_bu f f er ;
c_ bu f fe r . a dd_ i t em ( & c ) ;

Figure 5
I nstantiation of the Class Template B u ffer

Digital Technical Jouriul

Vol. 1 0 No. l

1 99 8

27

Despite the fact that this type of automatic link
time instantiation scheme was bei ng widely used
in the i nd ustry, the results of using a modified
McCl uskey approach were m i xed . S troustrup has
described the general problems with McCl uskey's
approach.9 We found that our implementation suf
fered particularly from poor l i n k - time performance
and so did not satisfy our users' needs.
DIGITAL C++ Version 6.0 Automatic Template
Instantiation

DIGITAL C++ version 6 . 0 is a complete reimpJeme n 
tation o f DIGITAL C + + , with emphasis o n ANSI C++
conformance. It is implemented using a completely
new code base, which includes the i ndustry -standard
C++ tl·ont end from the Edison Design Group and a
standard class library from Rogue Wave.
From our experience with templ ate i nstan tiation
in DIGITAL C++ versions 5 . 3 through 5 . 6 , we con 
cluded that the most i mportant issue that should
be add ressed in the design and implementation of
the a u tomatic temp late instantiation facility was the
compile- and link-time per formance. The primary
goal w:ts to have the performance of automatic tem
plate i nstantiation su bstantially exceed the perfor
mance of version 5 . 6 . Another important goa l was
to remove the restri ction of template declaration and
defin ition placement i n header files. In :�ddition, the
automatic template instantiation facility in version 6 . 0
h a d ro b e culturally compati ble with the previous
i mplementation . The user had to be able to move
sources and objects to different di rectories, easi ly
build archived and shared libraries, share instantia
tions between various applications, and have error
diagnostics reported at the earliest possible moment in
the i nstantiation process.
Design and I mplem entation We decided to use a
compile-time instantiation model as the basis for our
implementation . Since we were using the Edison
Design Group's front end, we seriously considered
using their link-time mod e l . However, the compi le
time model seemed advantageous tor several reasons.
First, there are significant complications ( as described
in the section Comparison of Manual and Automatic
I nstantiation Techniques) when trying to build
l ibraries with a compiler that uses the Edison Design
Group link-time m odel. In addition, the link-ti me
model requires recompilations that limit performance
in many typical cases of template use. We recognized
that the link-time model could provide better pertor
mance in some cases, but these would be i n the minor
ity. Finally, the implementation of the link-time model
would req uire su bstantially more implementation
eftort on the Open VMS platform . The version of the
Edison Design Group front end being used to build
DIGITAL C++ version 6 . 0 required tools to scan a

28

Digir�l Tec hnical Journal

Vol . 1 0 No. l

1998

user's object fi les tor i n formation concerning which
mod ules could instantiate requested templates. Similar
functionality would need to be implemented for the
OpenVMS platform .
We preserved the concept of the te mpl ate reposi 
tory as a d irectory that contai ns the i ndivid ual tem
plate i nstan tiation ob;ect files. The repository stores
one object fi le tor each templ ate fu nction , mem ber
fu nction , static data member, and virtual table that is
generated by a u tomatic template instantiation . The
file name of the instantiation object file is derived from
the name of the instanti ation 's external n ame. At com
pile time, the front end generates i n termed iate code
for aJI templates that are needed in the compilation
unit and can be instantiate d . A tree walk is pedorrned
over the i n termediate code to find all entities that are
needed by each generated template instantiation . The
code generator is cal led to generate cod e for the user
speci fied object ti le and is then called repeatedly for
each template i ns tantiation to generate t he insta n tia
tion object fi les in the repository.
The compiler generally considers an instantiation to
be needed whe n it is referenced from a context that is
itself needed, such as in a function with global visibility or
by the initialization of a vatiable d1at is needed . Virtual
member fi.mctions are needed when a constructor for
the class is needed . Thus, ail virtual .fi.mction definitions
should be visible in a compilation unit that requ ires a
constructor for d1e class. Each instantiation d1at is gener
ated ''�th autom:.1tic instantiation is marked as potentially
being in its own object file i n the repository.
The i n termediate representation of each generated
instantiation is walked to determine what other entities
it references. At t his point, the i nstantiation is a candi
date to be generated in its own object fil e , but it can
sometimes be generated as part of the user-specified
object file. If the i nstantiation references an entity that
is local to the compi lation unit, such as a static fu nc 
tion, a n d that local en tity is nonconstant and statically
initial ized , the instantiation is merged into the user
specified object fi le rather than generated in its own
object file. As an :�lternative, we could have chosen to
change the loc:tl enti ty i nto a global enti ty with :-�
u nique name and generate the instantiation in its own
object file. We chose not to do this in order to make i t
easier t o share a repository between applications. With
this alternative, the instantiation in the repository
requires the object file contai ning d1e local entity's def
inition, which may be i n another application . Note that
any application that contains more d1an one definition
of the same instantiation that references a nonconstant
local enti ty is a nonstandard -conform ing application.
This is a violation ofd1 e one definition rule w Consider
the followin g code fragment:
s t a t ic int j ;
templa e  vo i d f u n c ( T a r )
{
s a ic i n t coun: = 0 ;
pt- i n _co n t { " co n ::. " , count + + ) ;

The fi.mction, print_count, is defined i n the sou rce
file :m d generated as a defined function in the user
specified object file. The template function, fu nc, refer
ences the function, print_count. When the code for
fi.mc is generated i n its own object file, the rderence to
print_count m ust be changed from a rderence to a
defined h.mction to a reference to an external function.
By default, each needed instantiation is generated by
every compilation that requires the instantiation . This
is the safe default because it ensures that instantiations
in the repository are up to date. However, there will
prob:�bly be some compilation overhead fi-om regener
ating instantiations that may already be up to date . We
believed that the overhead of regeneratin(T
b instantia.
nons would typically be relatively smaJ I . For applications with a high overhead of i nstantiation , such as a
large number of source files using the same large n u m 
ber o f template i nstantiations, w e provided a compila
tion option to control the generation of template
i nstantiations to improve compile-time performance.
The generation of i nstantiation object files only
when they are actually required is a difficult problem .
Fine-grain dependency information would have to be
kept for each i nstantiation object file. Such depen
dency information would need to rdlect those fiJes that
are required to successfully generate the instantiation
and record which command- line options the user speci
fied to the compiler. vVe suspected that the overhead
involved with gathering and checkjng the information
might be an appreciable percentage of the time it wouJd
take to do the i nstantiation , and thus it would not give
us the performance improvement that we wanted.
Instead, we decided to provide an option that allows
the user to decide when i nstantiations are generated .
We rder to this as the template time-stamp option,
-m mestamp. When using the time-stamp option, the
compi ler looks 111 the repository for a file named
TIMESTAl\1 P . If the fi le is not found, it is created. The
modification time of this ftle is referred to as the time

stamp. When generating an instantiation, the compi ler
looks i.n the repository to see if the instantiation object
file exists. If it does not exist, it is generated . If the file
already exists, its modi fication ti me is compared to the
time stamp. If the modi fication time is later than the
time stamp, the i nstantiation is assumed to be up to
date and is not regenerated . Otherwise, the i nstantia
tion is generated. The user can control the generation
of instantiation object tiles by changing the modifica
tion time of the TIM ESTAMP fil e .
The ti me-stamp option wou ld typical ly be used in
a makefile or a shell script that compiles and builds
an entire application. Before i nvoking make or the
shell script, the user would make certain that no
TIMESTAMP file resided in the repository. This
would ensure that each needed i nstantiation would be
generated exactly once duri ng all the compilations
done by the build procedure.
Much of the C++ linker support in version 5.6 was
reused with only minor mod ifications for version
�.0. The compiler is presented with a single repository
mto whtch the instantiation object fi les are written .
Multiple repositories can b e specified at link time, and
each can be searched for i nstantiations that are needed
by the executable tile. The linker is used in a tria l link
mode to generate a l ist of a l l the unresolved external
r �ferences. This list is then used to search the reposito
nes to find the needed i nstantiation fiks, and tl1e
process is repeated u n til no more instantiations are
needed or can be satisfied from the repository. The
lmk then proceeds as any normal li nk, adding the l ist
of tnstantiation object files to the l ist of object tiles
and libraries as specified by the user.
If a vendor is cre:�ting a l ibrary rather tl1an an exe
cutable file, the i nstantiations needed by the modules
in the _li brary can be provided in either of two ways: ( 1 )
The hbrary vendor can put the needed i nstantiations
in the libra:y by adding tJ1e files in the repository to
the hbrary hle. ( 2 ) The li brary vendor can provide the
repository with the l i brary and require that l i brary
users lmk WJth the repository as wel l . Note that instan
tiations pl aced in the library :u·e fixed when the l i brary
IS created . Smce the library is included in the trial link
of an application, any instantiation i n the library takes
precedence over the same named instantiati a"n i n a
repository.
In a number of tests, DIGITAL C++ version
performance over version 5 .6 .
We tested a variety o f user code samples that use tem
plates to varying degrees and found that build times tor
version 6.0 decreased substantially compared to tl1e
version 5 . 6 compi ler. Examples of two typical C++
applications used in our tests are the pu blicly avail able
EON ray-tracing benchmark and a subset of tests from
our Standard Template Library (STL) test suite. For
Resu lts

6.0 showed improved

D i gital Technical Journal

Vol . 10 N o . I

1 998

29

the EON benchnurk, the b u i ld ti me for version

ture of the ti l es used to generate the i nstJn tiati o n . For

reduced to

example, if the user speci fied Jn i nc l u de d i rectory

6.0 was
28 percent of the build time tor version 5 .6 .
For the STL tests, t h e b u i l d ti me tor version 6 . 0 was
reduced to 1 9 percent of the b u ild time fo r version 5 . 6 .

of old_i n c l u d e on the i nitial compibtion and later
specified J.n i ncl ude d i rectory of new_i n c l ude, this

The n u m ber o f fi les i n the repository also d ecreased

approach wo u l d not recognize that d i ffere n t fi les were

signiti cm tly because version

being i n c l u d e d .

6.0

generates only i nstan 

tiation object fi les i nstead o f the i nstan tiation source,

Another approach to i m prov i n g application b u i l d

com m a n d , dependency, and object files of\-crsion

5 .6 .

performance i s t o sup port a b u i l d fa cil i ty t h a t can

For EON, the version

files

make use of te mplate i n f(m11 J tion in determining

com pared to

6 . 0 repository contained 8 8
260 fi l es i n version 5 . 6.

d ependency. C u r rently, each user-spec i fied object fil e

U s i n g the ti me-sta m p option, b u i ld ti me tor the

i s dependent o n :� I I the i nc l uded fi les nece ssary to

EON bench mark was red u ced by on l y 5 percent co m 

create i nstantiation object fi les f( >r te m p l ate req uests.

pared to t h e dcfJ u l t i nstanti a tion strJtegy. The real

When a change is made to a te mpbte d e fi n ition, all the

benefit of the ti me -stamp option comes w i th appl i c a 

sources that reference the te mpl ate need to be reco m

tions t h a t u s c t h e same te mplate i nstantiations i n many

p i l e d . A b u i ld fac i l ity designed to be sensitive to te m 

comp i l ation u n i ts . For example, in one user's test case,

plate i nstJntiati o n cou l d de tect t h a t a cha nge i n the

build times d ropped from roughly 1 8 hours with the

template d e fi n ition was l i m i ted to the i nstantiation

d e fa u l t i n stantiation to

object file. It could t h e n i nstruct the compi ler to sup

3

h o urs w h e n using the time

stamp option.

press the regeneration of o bject fi les tor sou rce fi les

I n tl1e next secti o n , we conclude our paper with a dis

that are only b e i ng recompi led due to the ci1Jnge in

cu ssion of fu rtl1er work that can i m prove the perfor

the te m plate i n stanti ation . S u ch a f.1 ci l i ty could also

mance and usability of a u tomatic template instantiation.

suppress the reco m p i i J tion o f any source fi l e thJt

Future Research

that were already regenerated .

We conti n u e to i n vestigate approaches a nd tech niq ues

can pertonn better i n some cases than the compile-time

to i m prove tl1e usJb i l i ty and performance of the a u to 

approac h , we Jre i nvestigating the l i n k - time i n st:�ntia

matic template i nstantiJtion facility. Optimal usJbility

tion mod e l as a user option.

wou l d only reproduce the changes to i nstantiations
Because we recognize that l i nk-time i nsta n ti:�tion

and performance would seem to require a development
environment completely i n tq!;rJted for

C++.

This envi 

Finally, we conti n u e to look a t ways to red u c e the
cost of generati ng each i nsta ntiation . For example, by

ronment wo u ld keep trac k of all entity definitions Jnd

default the compi l e r compresses the generated object

usage rm with semantic i n formation embedded

•

For C++, expanding te mpl ate classes and fu nctions
into their individ uaJ insta nces

•

SimplifYing h igh-level l an guage constructs i n to a
form acceptable to the opti mi zation p hases

•

Converting the abstract represen tation to a differ
ent a bstract form acceptable to an opti mizer, usu 
ally called an i ntermed iate language ( I L)

•

Expand ing some low- level functions inline i nto the
contex t of their callers

•

Performing mu ltiple optim ization passes involving
an notation and transformation of the I L

•

Converti ng the I L to a form symbolically represent
ing the target machine language , usually called code
generation

•

Performing sched uling and other opti mi zations on
the symbolic machine l anguage

•

Converting the symbolic machine language to actual
object code and writing it onto disk

In modern C and C++ compi lers, these various i nter
mediate f(xms are kept e n tirely in dynamic memory.
Although some of these operations can be performed
on a fu nction-by-fu nction basis with in a modu le, it is
sometimes necessary for at least one intermed iate form
of the module to reside in dynamic memory in its
entirety. I n some instances, it is necessary to keep mul
ti ple tonns of the whole mod ule simultaneously.

This presents a diffic ult design chaJ le nge : how do we
compile large programs using an acceptable amount of
virtuaJ and physical memory? Trade-offs c hange con
stantly as memory prices dec l ine and pagi ng a lgorithms
of operating systems change. Some optimizations even
have the potential to e xpand one of the intermediate
representations into a form that grows faster than the
size of the program ( 0( n x log( n ) ) , or even 0( n 1 ) ) . I n
these cases, optimization designers often limit the
scope of the transformation to a su bset of an i ndividual
function (e.g., a loop nest) or use some other means to
artificial ly l i mi t the dynamic memory and computation
requirements. To allow additional headroom, upstream
compiler ph ases are designed to eliminate un necessary
portions of the module as early as possi ble.
In ad d ition, the memory ma nagement systems are
designed to allow i n ternal me mory reuse as e ffi 
ciently a s possib l e . For this reaso n , compi ler design
ers at Compaq have genera l l y preferred a zone-based
memory management approach rather than e ither a
mal l oc- based or a garbage-col lection approach. A
zoned memory approach ty pical ly allows a l location
of varying amou n ts of memory i nto one of a set of
identified zones, fo l l owed by deallocation of the
e n ti re zone when all the i ndivi dual al locations are no
longer n eeded . Since the source program is repre
sen ted by a su ccession of i n ternal represen tations
in an opti mizing compi ler, a zoned - b ased memory
manage ment system is very approp riate .
The main goals of the design are to keep the peak
memory use below any artificial limits on the virtual
memory avai lable for all the actual source mod ules
that users care about, and to avoid algorithms that
access memory i n a way that causes excessive cache
misses or page taul ts.
Templates are a
major new teature of the C++ la nguage and are heavily
used i n the new Standard Li brary. I nstantiation of
templates can domin ate the compile time of the mod 
u les that use them . For this reason, template instantia
tion is undergoing active study and i mprovement,
both when compi ling a mod ule for the first time and
when recom piling in response to a source change. An
i mproved technique, now widely adopted , retains pre
compiled i nstantiations in a l i brary to be used across
compil ations of multiple mod u les.
Te mplate i nstantiation may be done a t either com
pile ti me or during l i n k ti me, or some com bination . '
D I G I TAL C++ h a s recently changed from a link- time
to a com pi le-ti me model for improved i nstantiation
performance . The i nstanti ation time i s generally pro
porti onal to the nu mber of tem plates i nstanti ated ,
which is based on a command-line swi tch speci fication
and the ti me req u i red to instantiate a typical te mplate.

Te m p late Instantiation Time for C++

Digital Tcchniol Journal

Vo l .

1 0 No. 1

1 998

35

Run- Time Performance Metrics

We use automated SC!ipts to measure run-time perfor
mance tor generated code, the debug image size, the pro
duction image size, and specific optimizations triggered .
R u n Time for Generated Code
The run ti me for gen
erated code is measured as the sum of user and system
time on UNIX required to r u n an executable image.
This is the pri mary metric for the qual ity of generated
cod e . Code correctness is also valid ated . Comparing
run times tor s lightly differing versions of synthetic
benchm arks al lows us to test su pport for specitic opti
mi zations. Performance regression testing on both
synthetic bench marks and user applications, h owever,
is the most cost-effective method of preventing per
formance degradations. Tracing a pe rrormance regres
sion to a specific compiler change is often d i fficu lt, but
the earlier a regressio n is detected, the easier and
c heaper it is to correct.
Debug I m age Size The size of an i mage compiled
with the debug option selected during compilation is
mcJ.sured in bytes. It is a consta nt struggl e to avoid
bloat caused by unnecessary or red u ndant i n formation
req u i red for sym bolic debuggi ng su pport.

The size of a prod uction
( optimized , with no debug i n tonmtion ) Jppl ication
i m age is measured in bytes. The use of optimi zation
techniq ues has historical ly made this size smal ler, but
modern RISC processors such as the Alpha micro
processor require optim i zations that can in crease code
size su bstantial ly and can lead to excessive i mage si zes
i f the techniq ues are used indiscri mi nately. Heuristics
used in the optimi zation algorithms l i m i t this size
impact; however, su btle changes in one part of the
opti mizer can trigger unexpected size increases that
aHect I -cache performance.
Production Image Size

In J m u l tiphase
opti mizing compi ler, a specific opti mization usua l ly
req ui res preparatory contributions from several
upstream phases and cleanup from several down 
stre;�m phases, i n addition to tbe ;�ctual transforma
tion . In this environment, an unre l a ted cha nge in one
of the upstream or downstream phases may i n terfere
with a data structure or violate an assumption
exploi ted by a downstream ph ase and thus generate
bad code or su ppress the optimizations. The genera
tion of bad code can be detected qu ickly with auto
mated testing, but opti m i zation regressions are much
harder to fi n d .
For s o m e opti mizations, however, it is possible to
write test programs that are clearly represe n tative
;� nd can show, either by some kind of d um p i n g or
by compar;�tive performance tests, when an i m p le
mented opti m i zation fai ls to work as expected . One
Specific Opti m i zations Triggered

36

Digit:ll T�chnicJI Journal

Vo l 10 No. 1

1 99 8

commercially avaiL1ble test suite is called N U L LSTONE ,''
and custom-wri tten tests are used as wel l .
In a collection of such tests, the total n umber of opti 
mizations implemented as a percentage of the total
tests can provide a usefu l metric. This metric can indi
cate if su ccessive com p i l er versions have improved and
can h e l p in comparing opti mizations imple mented in
compilers from difterent vendors. The opti mizations
that are indicated as not im plemen ted provide useful
data for guiding fu ture development effort.
The app lication developer m ust always consider the
compile-time versus run-time trade-off. I n a wel l 
designed opti m i zi ng co m p i l er, longer compile times
are exchanged f(Jr shorter run times. This relationship,
however, is fa r from l i near and depends on the i m por
tance of pertorma nce to the application and the phase
of deve lopment.
During the initial code-deve l opment stage, a shorter
compi le time is usefu l because the code is compiled
often . During the production stage, a shorter run ti me
is more im portr fi-eq u e n t

T h e tools we use for comp i l e -speed a n d r u n - ti m e

DCPl o u t

analysis are considerably more soph isticated tha n t h e

cache m i sses c a n b e i d e n ti fied fi·o m t h e

measure ment tools. They are ge n e ra l l y p rovid e d b y

put, whereas they may not a l ways be obvious from

the

CPU design

or

i n g a s wel l a s com p i l e r i m provements. VVe h ave used
the fol lowi ng compile -speed analysis tool s :
•

The compi l er's i n ternal -show s tat i s t i cs feature
gives a crude meas u re of the time req u i red to r each
compi l e r phase.

D i giral Tec h n ical J o u nul

cas u a l l y observing the m a c h i n e cod e .

operati n g system tools develop

ment groups and are wi d e ly used for a pplication tu n 

40

We analyze the

the detaile d log fi l e . This l og identi fies the pro b l e m

compi l e - t i m e measure m e n t , the d e fault, debug, and
ously d iscussed .

! PROB E tool as d escri bed a bove , to the

r u n - ti m e b e havior o f the test program rather than

of t i m e s , writes t h e res u l ti n g ti m i ngs t o a fi l e . Post
( average ti mes, deviati ons, a n d fi l e s i zes) a n d compare

\Ve apply h i prof a n d gprof i n combinati o n , and
the

a n d , after com p i l i n g t h e sou rce t h e speci fi e d n u m be r
processi n g scripts eva l u ate the usabi l i ty of the resu l ts

! P RO B E too l , w h i c h can provide

i nstruction- by - i nstruction details about the e x e c u 

mation about the execution can be captured to

system t i m i n g packages. For com pile-time measure

VVh e n t h e pro b l e m needs t o b e p i npoi nted more
accurately than is poss i b l e with these profi l i ng

cutable i mage i n some man ner, so that e no u gh i n f(x

ment tools u s i n g scripts l ayered over standard operating

a

specific area of compi ler source . Once this i n f(xma

Vol . l O N o . 1

1 998

•

Final ly, we use t h e estimated schedule d u m p and
statistical data optionally generated by the

G EM

back e n d . 1 This d u mp te l l s us how i n structions are
sched u led and issued based on the processor arc h i 
tecture selecte d . I t may also provi d e i n formation

a bo u t ways to i mprove the sched u l e .

In the rest of this section, we discuss three examples
of applying analysis tools to problems identified by the
performance measurement scripts.

called by esc . Since these components are included in
dle G EM back end, the problem was fixed there.
Run-Time Test Cases

Compile-Time Test Case

Compile-time regression occurred after a new opti 
mization called base components was added to tbe
GEM back end to i mprove the run-time performance
of structure reterences. Table l gives compile-time test
results that compare the ratios of compile times using
the new opti mized back e nd to those obtained with
the older back end . The resu I ts for the iostream test
indicate a significant degradation of 2 5 percent in the
compile speed for optimize mode, whereas the perfor
mance in the other two modes is unchanged .
To analyze this proble m , we built hi prof versions of
the two compilers and compiled the iostream bench
mark to obtain its compilation profile. Figures l a and
l b show the top contributions in the flat hi prof pro
fi les from the two compilers. These profiles i ndicate
that the nu mber of calls made to esc and gem_il_peep
in the new version is greater than that of the old one
and that these cal ls are responsible for performance
degradation . Figures 2a and 2b show d1e cal l graph
profiJes tor esc for the two compilers and show me calls
made by esc and the contri butions of each component

For the run-time analysis, we used two d i fferen t test
e nviron ments, the Haney kernels benchmark and the
NULLSTONE test nm against gee .
Haney Kernels The Haney kernels benchmark i s a
synthetic test written to examine the performance of
specific C++ language features. In this run-ti me test
case, an older C++ compiler (version 5 . 5 ) was com 
pared with a new compiler u nder development (version
6 . 0 ) . The Haney kernels results showed that the ver
sion 6.0 development compiler experienced an overall
performance regression of 40 percent. We isolated the
problem to the real matrix multiplication fu nction.
Figure 3 shows the execution profile for this fu nction.
We then used the DCPI tool to analyze perfor
mance of the inner loop instructions exercised on ver
sion 6 . 0 and version 5 . 5 of the C++ compi ler. The
resulting coun ts in Figures 4a and 4b show that dle
version 6.0 development compi ler su ffered a code
scheduling regression. The leftmost column shows the
average cycle cou nts for each i nstruction executed.
The reason for th is regression proved to be that a test

Ta b l e 1

Ratios of CPU (User a nd System) Com p i l e Ti mes (Seconds) of the New Com piler to Those of the Old Com p i l e r
F i l e Name

Debug Mode
Options

Default Mode

Optimize Mode

- 04 - gO

- 00 - g

a 1 a mch2

0.970

0.970

0.930

col l evol

0.9 1 0

0.780

0.740

d_i n h

0.970

0.960

0.960

e_rvi rt_yes

0.970

0.980

0.960

i nterfaceparticle

0.880

0.790

0.730

iostream

0.990

0.980

1 .250

pistream

0.890

0.760

0.790

t202

0.970

0.970

1 . 1 30

t300

0.980

0.960

1 .040

t601

1 .0 1 0

1 020

1 .0 1 0

t606

1 .000

1 . 020

1 .020

t643

1 .020

1 .0 1 0

1 .000

test_complex_excepti

0.960

0.890

0.830

test_compl ex_math

0.970

0.950

0.950

.

test_ demo

0.950

0.830

0.780

test_generic

1 .000

1 .020

1 . 1 00

test_task_q ueue6

0.970

0.920

0.960

test_task_rand 1

0.950

0.890

0.890

test_vector

0.970

0.920

1 . 1 20

vectorf

0.890

0.790

0.850

Averages

0.961

0.920

0.952

Digital Technical Journal

Vol. 10 No. I

1998

41

g ranu l ar i ty :

%
t ime

c yc l e

2 .8

cumu l a t i ve
seconds
1 . 37

2.6

2 . 66

2.4
2.3

3 . 93

2.6

6 . 23

5 . 09

uni t s :

seconds ;

to tal :

4 8 . 9 6 seconds

sel f
seconds

se l f

tota l

cal l s

m s / ca l l

1 . 29
1 . 27

ms / ca 1 l

1 . 37

10195

0 . 13

0 . 13

515566

0 . 01

0 . 00

gem_f i _ud_acces s_resource

1 . 17

481891

0 . 00

0 . 00

gem_vm_get_nz

713176

0 . 00

0 . 00

_OtsZero

2 1 9 607

1 . 14

0 . 01

0 . 00

name
cse

gem_j l_oeep ( 3 1 ]
[12 ]

[37]

[75)

[67]

(a) HiprofProfile Showing Instructions Executed with the New Compiler
granu l a ri ty : cyc l es ; un i ts :
c�
ime

c umu l a t i ve

3 .0

0 . 83

seconds

?. . 7

1 . 58

2 . 71
3 . 14

1.7
1.6

2.5

2 . 26

seconds ;

t ocal : 2 7 . 4 9 seconds
sel f

total

ca l l s

ms / c a l l

ms / ca l l

0 . OJ

614 3 5 0

0 . 01
0 . 00

se l f

0 . 83
0.75

143483

seconds

0 . 00

0 . 68
0 . 45

4 65634
8664

0 . 08

0 . 08

0 . 00

0 . 43

423144

0 . 00
0 . 00

0 . 00

gem_i I _peep

name

O t s z er o
cse [ 1 6 ]

_

[40]

[64 J

[36)

gem_ f i _ud_access_resource
g em_vm_g e c _n z

[86]

(b) Hiprof Proftle Showing Instructions Executed with the Old Compiler
Figure 1

H i prof Profiles of Compilers

for pointer disamb iguation outside the loop code was
not performed properly in the version 6.0 compiler.
The test would have ensured that the pointers a and t
were not overlapping.
We traced the origin of this regression back to the
intermediate code generated by the two compil ers.
Here we found that the version 6.0 compiler used a
more modern form of array address computation i n
the i n termediate language for which the scheduler had
not yet been tuned properly. The problem was fixed i n
the scheduler, and the regression was eliminated.

[ 12 ]

14 . 1

1 . 37

2 . 63
0 . 63
0 . 59
5 . 55

0 . 32

0 . 34

I n itial N U LLSTONE Test Run agai nst gee We measured
the performance of the DEC C compiJ er in compi ling
the NULLSTONE tests and repeated the performance
measurement of t he gee 2 . 7 . 2 compiler and libraries
on the same tests. Figures Sa and Sb show the results
of our tests. This comparison is of interest because gee
is in the public domain and is widely used , bei ng the
primary compiler available on the public-domain
Linux operating system . Figure Sa shows the tests i n
which the DEC C compi ler performs a t least 10 per
cent better than gee. Figure Sb ind icates the optirniza-

134485 / 1 3 4 4 8 5

cse

1 3 4 4 8 5 / 1 3 r. 4 8 5

update_operands

1 2 1 2 4 3 / 1 2 12 4 3

Les L_ for_i n d u c t i on

10 195+9

95

102760 / 1 02 7 6 0
1 2 1 2 7 / 12 1 2 7

t e s t_for_c se
[12]

[ 42 ]
[ 13

pu s h_ e f e c t
gem_df_mo ve

[ 92]

6J

[ 97 ]

[ 149 1

(a) Hierarchical Profile for cse with the New Compiler
[ 16 )

10. 5

0 . 68

8 6 64 + 7 5 9 3

2 . 19
1 . 04

96554 / 96554

t e s t- for_c s e

0 . 30

66850 / 66 8 50

t es t_for_ i nduc t ion

0 . 29

9 6 5 54 / 9 6 5 5 4

upd a t e_operands

0 . 12

87 1 7 6 / 8 7 1 7 6

move

0 . 09

7863 !7863

cse

[16]

[ 215 ]

pop_e f fec t

( b ) Hierarchical Profile for cse with the Old Compiler
Figure 2

H ierarchical Call Graph Proti les for esc
42

Digital Tech n ical journal

Vol . 1 0 N o . l

1 998

[ 56 ]

[267]

1 1 04 ]

[ 1 06 ]

void

�1 lHC ( Real • t ,
Rea l * a ,
Real * b .
i n M , con s t in

nna

cons
const

cons

References and Notes

N,

con s t i n � K l

int i , j , k ;
Rea l emp ;
memse l l t ,

0, H

•

N

* s i z o f ( Rea l l l ;

for- ( j � 1 ; j < = N; j • I
{
for l k - l ; k c� K ; k + + )
(
tern = b [ k - 1 ,. K * I j
1) ] ;
i f ( temp ! = 0 . 0 )
{
E r l i - l ; i <= M ; i H I
t [ i - 1 • �l • ( j - 1 1 l + ernp * a { · - 1 - H • I k •

1.

D. B lickstein et a l . , "The G EM Opti mizing Compi ler
System," D(t5ital Tc'ch n ica ! Jou rnal, vol. 4, no. 4
( Special issue, 1 99 2 ) : 1 2 1- 1 36.

2.

B . Ctlder, D. Gru nw�ld, ;md B . Zorn, "QuanritYing
Beh�vioral Difkrences Berween C and C++ Programs,"
journal

3.

ll

J ;

Haney Loop r()r Real Matrix Nlu l tiplication

6.

N U L LSTONE Optimization Categories, U RL :
h ttp:/ /w>vw. n u l l srone . com/h t m l s/category. h t m ,
Nullsrone Corporation, 1 990- 1 99 8 .

7. ].

the machine code generated fo r those test cases. In this

Orost, "The Bench++ Benc hmark Su ite," December
A drati: paper is available at http:/jwww
. research .a tt .com/-orost/bench_pl us_plus/paper. h tml.
1 2 , 1 99 5 .

8.

regressions were caused by the use of a n ou tmoded

( -st 0 ) for

DIG ITAL U N I X environ ment. After we

retested with the - ansi_al ia s option , these regres
sions disappeared.
i n vestigated

and

fi xe d

regressions

10.

A . Eustace a n d A. Srivast:w a, " ATOM : A Flexi ble
I nterface for Bui lding H i gh Performance Program
Analysis Tools," Western Research Lab Technical Note
TN- 44 , Digital Equipment Corporation, July 1 99 4 .

11.

A. Eustace, "Using Atom i n Computer Architecture
Teaching and Research," Co mpu ter A rchitect/Ire
Technical Corn m ittee Neil 'sletter I EE E Computer
Society, Spring 1995 : 28- 3 5 .

12.

J . Anderson e t a l . , "Continuous Profiling: Where Have
All the Cycles Gone?" SRC Technical Note 1 99 7 - 0 1 6 ,
Digit::t l Equipment Corporation, July 1997; ;t lso in
A CM Tra nsac ti o n s 0 1 1 Computer Svstems. vol . 1 5 , no.

regressions, which were too d i fti c u l t to fi x w i t h i n the
to the issues list with appropriate priorities.
Conclusions

The measurement and analysis of compi ler performance
has become an i m portant and demanding fi e ld . The

CPU architectures and the

4 ( 1 99 7 ) : 3 5 7-39 0.

addition of new features to languages require the devel
opment and i mplementation of new strategies fo r test

Mass . : Digital Equip

ment Corporation, 1 99 5 ) .

in

existing sche d u l e for the current release, were added

C++ Benchnurks, Compari ng Compi ler Performance,
U RL: h ttp:/jwww.bi .com/index.html, Kuck <111d
Associates, Inc. ( KAI ), 1 9 9 8 .

9 . A TOJiif.o User Mwtllai ( Maynard,

i nstruction com b i n i n g a n d i f optimizations. O ther

i ncreasing complexity of

D . Detlefs, A. Dosser, and B . Zorn, " M emory AJ ioo
tion Costs in Large C and C++ Programs," Sojitl'are
Practice a n d l:..,perience, vol. 24, no. 6 ( 1 9 94 ) :

A . Itzkowitz and L . Folt:111, "Au tomatic Te mplate
I nstanri:J tion in D I G ITAL C++," Digital Techn ical
Journal. vol. 1 0 , no. I ( this issue, 1 9 9 8 ) : 22-3 1

case, the alias optimization portion showed that the

a l so

( 19 94 ):

5.

We i nvestigated the i n divid u a l regressions by look·

We

2

P. Wu :md F. Wang, "On the Efficiencv and Optimiza
tion of C++ Programs," Sojiu>are Practice and Experi
ence. vol . 2 6 , no 4 ( 1 9 9 6 ) : 4 5 3-4 6 5 .

ing at the detai led log of the r u n and then ex a m i n i n g

DEC C i n the

Lcmi� uages,

4.

DEC C compiler shows 10 per

cent or more regression compared to gee.

standard " as the d efa u l t l a n guage d i alect

Pru,f5 rct nuning

527-542 .

Figure 3

tion tests i n which the

of

3 1 3-35 1

these challenges. Our systematic ti·a mework tor com

J. Dean, ) . H icks, C . W::tldspurger, W. Wei h l , and G.
Chrysos, "Proti leMe: Hardware Support for Instruction
Level Profiling on Out-ofOrder Processors," 30th Sym
posium on Microarchitecrure ( Mi cro- 3 0 ) , Raleigh, N.C.,
December 1 997.

piler performance measurement, analysis, �md prioriti

1 4 . G'u ide to /PRO/Jt·. lustct!linp, and r :, ing ( M aynard,

1 3.

ing the perf(xma nce of C and C++ compilers. By
employi ng en hanced measurement and analysis tech
niq ues, tools, and benchmarks, we were able to address

zation of improve ment opportunities should serve as an
excell e nt st:u-ting poi n t for the practitioner i n :� situation
in which simil:�r req u i rements :u-c im posed .

Mass . : Digital Equipment Corporation, 1 9 94 ) .
15.

B. Ke rn inghon �nd D . Richie, The C Progra mm iug
Lang u age ( Englewood Cliffs, N . J . : Prentice- H a l l ,
1 978 ) .

Digital T.:chnictl Journal

Vol . 1 0 N o . l

1 998

43

tm t.l lHC_X P t PC t PC ' i i :
181

xl 2 0 0 � 4 8 9 4

70

Ox l 2 0 0 1 4 8 9 8

62

4

33

l ds
1 '1

$£1,

zero ,

1

0 ( t5 1

0 ( L6)
8(

61

O xl 2 0 0 4 8 9c
6 O x l 00 4 8 a 0

0 : 8 9 4 6 0 00 0

lds

SflO ,

0 : 580 1 10 4 1

mul s

Ox12001 48a4

0 : 4 7e 6 0 4 1 2

bi s

$ f0 , $ t l , S t l
a2
z e ro , L

0 :< 1

0 : 40

1
0

0 0 1 4 8a8

3 0 5 8 0 :< 1 2 0 0 1 � 8 a c
15
O x 1 2 0 0 1 4 8 b0
0
7265

0

a dl

0 : 2 0 c6 0 0 1 0

t4 ,

lda

<:5 ,

cmp l e

t4 ,

L7 ,

lda

L6 ,

:6 ( t6 )

0 :.; 1

0 : 5 9 4 1 1 0 0 ...

add s

$ fl O ,

ts

fl,

0 0 2.� 8b8

0 : 9826 f f f

0 : 8 967£ f4

'J x 1 - D 0 1 4 8c 4

:<1 0 0 1 4 8 c 8
0 :< 1 2 0 0 4 8 cc
1 L 8 8 O x1 2 0 01 4 8 d0
3 2 5 O x l 2 o : 4 8 d4

Oxl 2001 48

8

Ox 1 2 0 0 1 4 8d c

1 286 2

Oxl2

L

Ox1 2 0 0 1 4 8 e4

87

: 8986f f f4
: 58 b

Ox1 2 0 1 4 8e8
0 O x1 2 0 0 1 4 8ec

4b

0: 9

7f

0 : 89c

Sfll ,

-12 ( L6 )

$tl2 ,

- 12 ( t5 )

f8

f f f8

c

9e7 f f f c

£15 ,

f

O :< l 2 0 0 1 4 8 f 0

0 : 5 8 0 f104 f

m ls

1 2 7 0 5 Ox l 2 0 0 1 4 8 f 4

0 : 5 a 0 f 1 00 f

adds

12748

0 : 99 £ 20 00

O x1 2 0 01 4 8 f 8

(a) DCPI Profile fo r This

-8 ( t 5 )

.3. $£1 3
$fl4 , $fl . $fl)
s 1 3 , -8 t 5 )

.10 d

0 : 8a 0 6

14 ,

$ f0 . $

0 : 9 9a6 f f f

0:

$ f" , $ fl l , Sfll
S f12 . Sf l l , Stl"
$ f l l.
- 12 ( t 5 )
$ 13 , -8 ( t )
$

0 : 5 80 d .._ 04 d

:5

fl . $fl
-16 (t5)

_ds
muls

6 6 f f f4

:8

a4

lds

0 : 5 98bl 00b

0:. 4 8 e0

3134

6357

< 51

0 : 2 0 e7 0 0 1 0

4

6 3 88

4

O x4 ,

0 : 4 0 a 8 0 b4

13054

0

09005

O x.l 2 0 0 1 4 8b4

12784 Oxl
0 1 4 8b
3207
xl 2 0 0 1 4 8c0
6

0 : 8 82 7 0 0 0 0
O : a 3 e7 0 0 8 0

t

-4 ( t 6 )

$ f l6 , -4 ( t 5 )
$ f 0 , $ f 1 5 , !' 1 5
$ .. 1 6 , $ f l 5 , $ [ 1 5
2 ( 21
$ 15,

Execution with Version 6.0

l·ma t Mu l HC_X P f PC f PC fC i C i C i :
3 5 1 O x l 2 0 0 1 94 d0
0

3 :i 3 '
0

Oxl2 0 0 1 94

Oxl2

4

1 9 4 d8

Ox 1 2 0 0 1 9 4

c

0 : 88 2 7 0 0 0 0
0 : 4 0a09005
0 : 4 0a 8 0db4

0(

4,

O x4 ,

lds

SflO,

t1

0 (L5)
4

0

O xl 2 0 0 1 9 4 eB

0 : 2 0c6001 0

.._ 2 8 7 0

Ox l 2 0 0 1 9 4 ec

0 : 5 9 -1 1 1 0 0 1

ad ·s

SE10 , 'L , $ tl

127

O x l 2 0 0 1 9 4 f0

17968

x12

019

e4

0:

Oe 7 0 0 1 0

0 : 5801 1

 8 0 d 1 0 4 d
0: 8

6fff

0 : 5 9c b 1 0 0 b
: 5 9 ec l 0 0c
0 : 5a0

l O Od

Ox 1 2 0 0 1 9 5 2 tl

: 9966fff4

3 1 3 tl
xl2
19 28
3 2 0 0 Ox ' 2 0 0 1 952c
3 1 6 8 Ox' 0019530

0 : 9986 f t f8

6

58

(b)

0 : 99 6ffE
O : f 6 9 f ffe7

lcs

... s

a 'd
adds

a ds
sts
s s
s s
bne

Vol . 1 0 No. l

1 998

-1

( t5 )
6)

S f0 , $ f l l . $ f l 1
S f l t; ,
Sf

- 1 7. (

5l

. $ f l 2 , $f12

$fl5 ,

-8 ( L5 )

0 , $ 1 3 .. , f l 3
$f16 ,
$ f 14 ,

-4 ( t5 )

f11 , SEl l
$ f 1 5 , $ 1 2 , $ £ 1 :!
Sfl6, Sfl3 , $fU
$ f : l , - 12 ( t 5 )
$ E l 2 , - 8 ( t5 )
$ f 1 3 , - 4 I tS )
a4 , O x l 2 0 0 1 4 d 0

DCPI Profile with Counts with Version 5 . 5

F i g u re 4
DCPI Profi les of rhe Inner Loop

Digiral 'l'c chnical Journal

$fl.

cmp le
. aa
mu:..s
1 ·a

3 2 ' 5 O x l 2 u0 1 94 e0

44

0

0 : 8% 6

1 s
a 'd l

ULL

TONE SU:• \ARY

PERE'OR!,�C E Hl PROVE�1Et\'T

l l s one

E PORT

e l ea s e 3 . 9 b2

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
� - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

1

,

l'h:reshol d :

u l l s t one Ra t i o Increase

by

a

leas e

10%

- - - - - - - - - - - - - - � - - - - -- - - - - - ---- - - - + - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - -

-

- - - �-- +

Comp a r i o . Com_ � · er

Ba s e l i ne Compi - e r

· ------------------�---- --------- - --------------- �-----------------------------+

C omp i l e r

DEC Alpha C

GCC '2 . 7 . 2

Architec

�:e

.3 0 0 0 / 3 0 0

+ ------- - - - - --- - --

i -- -

-

0 t ' m ' �a

--- - -------------

i on

+

+ --- -- - ---- - - -- - - - - - - - --- - - - ------ - - - - ---- - - - -- - - +

i m i z a t i on

Al i a s Op . i m ' za

i on

Al i a s Op �miza

i on

Bi

f i el d Op

im i za

( y t yp )

ize

I

Fol

Con

Propa a t i o

t

!: S

2510

Dead Code E l i m i n a t i on
D i vide Op t i m i za

56

tests

2600

ests

306

15

ion

I f 09 i m i z - don

S impl i f i c a L i o n

1/. 0

s

13

es t s

l

3

Cl:OSS J

im i z a i on

99
26

ti n

15

tes s

92

s Op i m i z a i on

I n tege:r Mul t ip l y Opt i m i z a t ion
P o i n l e r Op t imi z
P � i n t f Op

i

T< i l

Re c ur

ion

R e i s te r A l l oc a t i o n

N t'l'OWing
S PEC Co

formance

S La t i c

ec l a ra t ions

S t ri ng Op

4

+ -- - - - - - - - - - - - --- - - � - - - - -

Tot

1 Pe�: formanc

I mp rovemen t s
•

- - - - -- - - - - - - - - - - - - - - T - - - - - --

--

6499

es

>�

10%

tes t s

ests

tes t s

e s ts
Les s
1 1 teSLS
3
es ts
2
ests
1

2
26
1

3

tes s

tests

tes t s

18

20

90 t es t s

i m i z a t ion

Vo l a t i l e Conformanc

3
1

39

es t s

Le
s
3 tes s
30 t
ts
4 t SLS
4 tests
3
es c s
2 tes t s
l tes t

Op l i mi z Lion

15

es t s

3

i z a ion

!'on1ard S tore

Va l ue Range

tests

L es s

In t eget. l·iodu l
Address Op

s

t.:es s
Lesu;
2 t sts
2 res s
l
es t s
4 tes s

mp i ng

t e s ts

L e s ts

16

B l oc k t·le rg ing

8

92

Lo p Un�:o l l ing

UnsH i tchi �

tests

181

Loop Co L a9s i n
Loop Fus i on

15

es s

38

b l e E l imi na t i on

2026 tests
5 6 t es t s

cests

Ho i s t i ng

Va r i

I n . i n ' ng

s

278

S reng t h Reduc t: i on

i on

Indu c L ' on

tes

ces s

es t

3 9 tests
4 t S I.: S
2
es·s

Func

I

1 9 Lests
3 t es t s

2353

69

+

ents

tes ts I

0

3

in

Impr ve

- - - - · -- ------------+

15

C S E E� imi�a i o n

1

Sa ple

52

ion

Co s Lan

Expres s i on

3000/ 30 0

11

u a l i f i ed )

I n s t ru c t � o� Comb i n i ng

I n l e" e r

bl 3 6

-- ----- - - ---- - - - - - - - ---------

1 02

( y a dr e s s )

( con

Branch E l :. i n a - i o n

rant

123

DEC Al pha

DEC A l pha

, odel

Al i a s 0

5.7

n o r esu ic ·

tests
tes s
res

s

t e s ts
res _

9 tests
3

te s t s
tes

0 res s
2 tes t s
1 t es t s
3

tes t s

0

tes t s

0

tes t s

0

tes t s

1 tests
4 tests

-- - - - - - -- - - - - - - - -

.

5065

+

tests

Figure Sa
N U LLSTO N E Results Com paring gee w i t h D E C C Compiler, Showing All I mprovem.cnts of Magnitude 1 0 % or More

Digiral

Technical Jourrul

Vol . 10 No. l

l 99S

45

NULLSTONE SU}WARY PERFORMfu�CE REGRE . Sl O,
� l l s to � Re " ea se 3 . 9b 2

+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - -

I Threshol

+---+

:

-- - - - - - - - - - - - - - - - - -- - - - -

- - - - -- - - - - - -

+

-

R c i o Decreas d by a c l ea t 1 0 %

N lls on

B

· e l i r. e Comp � l e r

- - - - - - -- - - - - -- --- - - ---- - - - - - -

�-----------

+-

I G CC 2 . 7 . 2

Comp i l e r

I

�1odel

no

300 t30C

300 0 / 3 0

Op t im i z a t ion

Al i a

·

Op L i

i : a t � on

- - - - - - - - - - - - - - - - - - - - - -- - - - -

( by t yp

-

- �

-

- - - - - - - - -- - - - - +

I Regr·
---- ----- --- + - - - -

Sampl e S i ,:e
10

)

ests

s s i ons
64

es s

l con s t - aua ! : . l ea )

11

es s

( by n dd;e s s )

57

a t s

ests

7

tests

Ins truc t i n Cornn ' n ing
Cons tan
Propaga i on

2 510

tests

204

CSE El i m i na t i o n

2 6 0 0 tests
9 2 tes s
1 8 1 te s t s

Alia� Op t l rni z a t · on
Al i a s Op irni ::: a · on

I n eger D i v i de

l j Les s

t im i z t on

Expre s i an S i mp l i f i ca t i on
p t i mi zat i o n
t

69

s

Op L irn i z a t i on

I n t ege r Mu . L iply Opt imi z a t i on
P o i n t e t Op t irn i z
Ta i l Recur i on

esLs

t ion

92

ests

9
15

tests
tes ts

3

es s

6499

ests

tes t s

32

:-e s t s

32
34

s

�

F i g u re 5b
N U LLSTO N E

Res u l ts Comp�ring gee

10%

> -

with DEC C Compi l e r,

Sh owi n g Al l

I
I

t.: e s t: s
L

sts

95 es
1 tests
/. €' !; l: S
2 t:es

., - - - - - - - - - - - - - - - - - - t - - - - - - - - - - - - - - - - - - - - - - - - - - - - - � - - - - - - - - - - - - - - - - - - - - - - - - - - -

i T o a l Performance Regre s s i ons

s

1

40

1 tests

"'l tO\� i ng

es

5

14
4

J 8 tes t s
2 tests

Ho i s L i n
Unswi � h i n g
I n e g e r Modu

+

res t.r i c t

DEC Alpna

� - - - - - - - - -- - - - - - - - - · - - - - - - - - - - - - - - - - - - - - - - - - - - - - - · - - - - - - -

1

- - - - - - - +-

omp i l e t

- - - - - - - - - - - - - - - - - - - - - - -- - - - -

DEC Alpha

· - - - - - - - - - - - - -- - - - - - - - - -

- -- - - - - -

Compa r i son

DEC A lp ha . 5 . 7 - 1 2 3 bl 3 6

I

Arch · tec'.ute

R EPORT

1

542

tes

s

i
- �

s

1

Regressi o n s of 1 0% or vVors..:

Biographies

Kevin W.

Harris

Kevi n Harris is a consulting sofrw:�t-c engi neer at Compaq,
cu rren tJy wo rk i n g in the DEC C and C ++ D ev e l opm e nt

Hemant

Gro u p . He has 2 1 y e:-t rs of e x pe ri e n c e worki ng on h i g h

G. Rotithor

Hcmant Rotithor received

B . S . , td . S . , :md

P h . D . d egre es

in e lec tric :� I e n gi n ..:..: r i n g in 1 9 79, 1 98 1 , and 1 989, respn: ·

t i v e l y. He worked on C � n d C++ c o m p i l e r per �o rnu n ce

i:;sues in tht: Core Technology Gr ou p ;�t Digit;�J Equipment

Co rpor� t ion �or t h ree years. Prior to that, he w;�s Jn clssis
tant �)rofessor at vVorcester

P ol ytech n i c I nsti tute

and

:1

d e vel o pm e n t ..:ngi n c e r ctt P h i l i ps . H e m a n t i s a m e m b er
of the p r ogr a m com minee ofThe l Oth l nrnn :-ttio n a l

ContC:rt:ncc on ParJ l l e l and Distributed Com p uti ng a n d

Syste ms ( PDCS '98 ) . He
and

,1

is a sc·nim m e m be r of rhe II:J-: E

member of Eta Kapp:� N u , T:1u Beta Pi , and Sigma

Xi. His interests i nclude c o m p u te r :� rchirccmre, p e r fo r 
ma nc e a n :� lvs i s , digita l design, and ne tworki ng. Hennnt
is currentlv e m p l ov..:d a t I nt e l Corpor:�rio n .

46

Dig:iral Tec h n ical )ournJI

Vol . 1 0 No. 1

1 998

pe rfo rm a n c e compi l ers , optimization, a n d p:�ra l k l pro
cessi n g . Kevin grad uated Phi BeLl Kappa in m:�rhem:�tics
�i·o m the U n iversiry of 1\lb ryland cmd J O i n ed Digita l
Equi pment Co rpora ti o n c1ti:cr earning a n

M.S.

i n com purer

scien ce ti:om the Pen n sy lva n i a State U n i versity. He has
m a d e maj o r contri bu tions to t he D I G ITAL F or tr:� n , C,
an d C++ p ro d u c t fa m i l ies. He ho l ds p.t L c n ts f(J r tech n iqu e s

tor exploiti n g performance of shared memory m u l t iproces
sors and register allocatio n . H ..: is c ur re n tl y responsible tor
pe r form an c e issues in the DEC C and D 1 G !Tt\L C + +
product fa m i l i e s . He i s interested i n CPU a rchi tecture ,
c o mpiler design, large · and snul l -scalc p.tra l l e l i s m :� n d irs
exploitation, and oti:ware q u a l irv issues.

Mark W. Davis
Mark Davis is a senior consulting engineer in the Core
Technology Group at Compaq. He is a member of Compaq's
GEM Compiler Back End team, toc using on performance
issues. He also chairs the D IGITAL Unix Calling Standard
Committee. H e joined D i gital Equipment Corporation i n

199 1 after worki ng as Director of Compilers at Stardent
Computer Corporation. Mark graduated Phi Be ta Kappa i n
mathemat.ics from Amherst College and earned a Ph. D . in
computer science !Tom Harvard University. H e is co-inventor
on a pending patent concerning 64-bit software on
OpenVMS.

Digital Technical Journal

Vol . 10 No. l

1998

47

I
August

G.

Reinig

Alias Analysis in the
DEC C and DIGITAl C++
Com pilers

During alias analysis, the DEC C and DIGITAl C++
compilers use source-level type information to
improve the quality of code generated. Without
the use of type information, the compilers
would have to assume that any assignment
through a pointer expression could modify any
pointer-aliased object. In contrast, through the
use of type information, the compilers can
assume that such an assignment can modify
only those objects whose type matches that
referenced by the pointer.

48

Digital Tec h nical Jou rnal

Vol . lO N o . 1

1 998

vVh e n two or more address expressions reference the
same memory location, these add ress ex pressions are
aliases for each other. A compiler performs alias anJJy
sis to detect which add ress exp ressions do not refer
ence the same memory locJ.tions. Good alias an alysis is
essential to the generation of efficient code. Code
motion out of loops, common su bexpression elimina
tion, allocation of variables to registers, and detection
of u n i n i tialized variables a l l depend upon the compiler
knowi ng which objects a load or a store operation
could reference.
Address expressions may be symbol expressions
or pointer expressions. I n the C and C++ languages,
a compiler always knows wh at obj ect a symbol expres
sion references. The same is not true with pointer
expressions. Determining which objects a pointer
expression may reference is a n ongoing topic of
research .
Most o f the rese arch i n this area focuses o n the use
of techniq ues that track which object a poin ter expres
sion m ight point to. u When these techniques cannot
make this determination, they assume that the pointer
expression poi nts to any object whose add ress has
been taken . These tech niq ues generally ignore the
type information avai l able to the source program . The
best tech niques perform interprocedural analysis to
i mprove their accu rJcy. Although effective, the cost of
analyzing a complete program can make this analysis
i mpractical .
I n contrast, the DEC C and DIGITAL C++ compi l
ers use h igh-level type information as they perform
alias analysis on a routine -by-routine basis. Limiting alias
analysis to withi n a routine reduces its cost, albeit at
the cost of red ucing its effectiveness .
The use of this type i n formation results in s l ight
i mprovements in the performance of some standard 
con forming C a n d C++ programs. These improve
ments come at l i ttle expense in terms of compi l a tion
ti me. There is, however, a risk that the use of this rype
information on nonsra nd:�rd-conforming C or C++
programs may result in the compi ler producing code
that exhibits u nexpected behavior.

The Side-effects Package

The C and C++ Type Systems
Research availab.le on the use of type intormation du r

The DEC C and D I G ITAL C++ compilers are GEM

ing alias analysis involves languages other than C and

compil ers -" The GEM compiler system incl udes a

C++ . ' Trad itional ly, C is a weakJy typed l a nguage . A

highly opti mizing back end. This back end uses the

poi nter that references one type may actually point to

GEM data access model to determine which objects a

an object of a different type . For this reason, most

load or a store may access. GEM compiler front ends

alias-analysis techniques ignore type information when

augment the GEM data access model with a side

analyzing programs written in C .

effects package, i . e . , an a l ias-analysis package . The

The ISO Standard for C detlnes a much stronger

side-effects package provides the GEM opti m i zer

typing system .' In ISO Stand ard C, a poi nter expres

additional i n formation about loads and stores using

sion can access an object only i f the type referenced by

l anguage-spec ific i n formation otl1erwise unavailable

the pointer meets the following criteria:

to the GEM optimizer.

•

It is compatible with the type of the object, ignor
ing type q uali fiers and signedness.

•

I t is compatible with the type of a member of an
aggregate or union or su bmembers thereof, ignor

The DEC C and D I G ITAL C++ compilers share
side-effects package
•

It is the char type .
Thus, in Figure

1 , the pointer p can poi nt to A,

B,

C, o r S ( through S .s u b . m ) b u t not to T or F. The
poi nter q, bei ng a pointer to char, can refer to any of
A, B ,

C, S, T, or F.

The proposed ISO Standard for C++ d e fines a simi
lar typing system for C + + . ' The strength of the
Standard C and C++ type systems a llows the DEC C
and DIG ITAL C++ comp i lers to use type i n formation
d u ri ng al ias analysis.
Many existi ng C appl ications do not conform to the
Standard C typing rules. They use cast ex pressions to
circu mvent the Standard C type syste m . To support
these applications, the DEC C compiler has a mode
whereby i t ignores type information during alias analy
sis. The D I G I TAL C++ compiler also has such a mod e .
This mode exists t o support those C++ programmers
who circumvent the C++ type system.

Determines which symbols, types, and parts thereof
a routine references

ing type q u ali fiers and signedness.
•

a

com mon side-eftects package . The D E C C and C++

•

Determines the possi ble side efkcts of these reterences

•

Answers q ueries fi.-om tl1e GEM optimizer regardi ng
tl1e effects and dependencies of memory accesses

Preserving Memory Reference Information

The D E C C and D I G ITAL C++ front ends perform
lexical analysis and parsing of the source program,
generating a G EM i ntermediate language (GEM I L )
graph representation of the source program 6 A tuple
i s a node i n the G E M I L and represents an operation in
the source program.

As the D E C C and D I GITAL C++ tfont ends gener
ate GEM I L , they an notate each fetch (read ) and store
(write) tuple with intormation describing tl1e object
being read or writte n . The front ends annotate fetches
and stores of symbols with i n tormation about tl1e sym
bol. They annotate fetches and stores tlu-ough poi nters
with information about tl1e type tl1e pointer references.
The an notation

i n tornution

includes information

describing exactly which bytes of the symbol or type
tl1e tuple accesses. This al lows the side-effects package
int
i gned i� � c n s t B ;
un s i gned i n t vol t i l e C ;
s Lruct: {
s t ru
int m;
) s b;
$;
sLr c t {
horL z ;
) T;
flo t F ;
i
'p;
c ha r * q ;

to d i fferentiate between access to t\vo different mem
bers of a structure.
Arrays

Neitl1er the DEC C nor the DIGITAL C++

tfont end ditferentiates bet\veen accesses to different
elements of an array. Both assume that aU array accesses
are to the first element: of the array. The GEM optimizer
does extensive analysis of array references.7 Being flow
insensitive, the DEC C and C++ side-effects package
can, at best, differentiate between two array references
tl1at both use constam indices. The GEM optimizer can
do much more.
V/hat the GEM optimizer cannot do, however, is

Figure 1
Code Fragmenr Associated with rhe E.xpbnation ofthe
Standard C Aliasing Ru les

determine that an assignment through a pointer to an
int: does not change any value in an array of doubles.
This is the purpose oftl1e DEC C and C++ side-eftects
package. Mapping a l l array accesses to access the first

Digital Technical Journal

Vol . 1 0 No. I

1 998

49

element of a n array does not hinder this purpose and
simplifies al ias analysis of arrays.

an object. To m i n i mize the n u m ber of effec ts cl asses
u nd e r considera tion, the side-effects package creates
effects classes for only those object regions referenced

For the program fi·agme nt

Tuple Annotation Example

i n Figure 2, the DEC C and DIGITAL C++ ti·ont ends
generate the annotated tuples displX

3;

Store p->x

none

struct S

0

3

3

Store v 1 .y

v1

struct S

4

7

v1 .y
v2

=

d[i]

=

=

v1

=

d [O]

Fetch v 1

v1

struct S

0

7

Store v2

v2

struct S

0

7

Fetch d [O]

d

double

0

7

i nt

0

3

double

0

7

Fetch i
d

Store d ( i ]

50

Vol . 10

To. 1

1 99 8

if two members occupy exactly the same memory loca
tions, a single effects cl ass represents both mem bers .
For the program fragme nt in Figure 3 , the side
effects pac kage creates the effects cl asses displayed in
Table 2 .
There i s only o n e effects class for * uip and *ip since
uip and ip may point to the same object. There are no
effects c lasses for bytes 0 through 3 ofs a nd struct S as
there arc no references to s . x or sp->x. By al locating
effects classes for only those object regions referenced
within the rou tine, the side-effects package greatly
red uces both the n u m ber of effects classes and the
time requi red to perform alias analysis.
In the traditional C type system , a poi nter expres
sion may point to anything, regardless of type. To rep
rcst:nt this, the side-effects package creates exactly one
eftects class to represent allocated objects. It ignores
the type and the start- and end -offset information .

S {
inl x ;
s r c T
int y ;
flo t
t;

S t:. rUCL

z;

s;

s tl.·uc t s * p ;
s i gned i n t • ip ;
u si gned i nt • u i p ;
l oa
* fp ;
* u ip : * ip ;
* fp = 2 ;
sp - > t = s . e ;
sp.y = 2;
s - • sp ;

Fig u re 3

Code Fragment Associated with Allocating Efkcts Classes

Using tl1e traditional C type system, for the program
fragment shown in Figure 3, the side-effects package
creates the effects classes displayed in Table 3 . Here,
effects class 7 replaces effects classes 7 through 1 1 in
Table 2. All the differentiation by types djsappears.
Effects-class Sig natures
Having created the effects
classes, the side-effects pac kage associates a signature
with each effects class. In addi tion, it associates an
effects-class signature with each tuple within the rou
tine and each symbol referenced within the rou tine .
An effects-class signature records the possible side
effects of referencing an effects class. A reference to
one effects class may reference another effects class.
The effects class for a load through a pointer to an int
i n dicates that the load references an al located int
object. The poi nter to an int may actually reference a
pointer-aliased int symbol or an int mem ber of a struc
ture or union.
An effects-class signature is a su bset of al l the effects
classes that might be referenced by a tuple. There is
only one requirement for an effects-class signature : I f
two tuples may refer to the same part o f memory, the
intersection of their respective effects-cl ass signatures
must be non-null . If two tuples cannot refer to the
same part of memory, it is desirable that tl1e intersec
tion of their e ffects-class signatures is null. An em pty
i ntersection l eads to more optimization opportu nities.
The most obvious rule for building an effects-class
signature is to include in it a l l the effects classes that
might be to uched by a reference to tl1e effects class.
This leads to subopti mal code in cases such as that
shown in Figure 4.
There are three effects classes for this code , s<0,3>,
S<4,7> , and S<0,7>, generated by references to s.x, s .y,
and s, respectively. If the effects-class signature for
S<0,3> in cludes both s<0,3> and s<0,7> and the
effects-class signature for s<4,7> i ncludes both s<4 ,7>
and s<0,7> , then the intersection of these 1:\vo effects-

Ta ble 2

Effects Classes Using the Sta ndard C Type R u l es
Effects Class

Type or
Symbol

Start Offset

End Offset

Sou rce Generating
Effects Class

1

0

11

2

4

11

s.t

3

sp

0

7

sp

4

fp

0

7

fp

5

ip

0

7

ip

6

uip

0

7

uip

7

struct 5

0

11

*sp

8

struct 5

4

11

sp->t

9

struct 5

4

7

10

fl oat

0

3

*fp

11

i nt

0

3

* u i p and * i p

Digital Tech nical journal

sp->t.y

Vol . t O N o I

1 9 98

51

Ta ble 3

Effects Cl asses U s i n g the Traditional C Type Rules
Effects Class

Ty pe or Symbol

1
2

Start Offset

End Offset

Source Generating Effects Class

0

11

5

4

11

s.t

3

5p

0

7

sp

4

fp

0

7

fp

5

ip

0

7

ip

6

uip

0

7

uip

7

char

0

*sp, sp->t, *u ip, sp->t.y, *fp, * i p

c lass sign atu res is no n - n u l l . This talsely i n di cates that
s.x and s.y may refer to the same memory l ocation. This
forces GEM to generate code that stores s.y after stor
ing to s.x.
The DEC C and C++ side-ef'tects package uses more
effective rules for bui ldi ng effects-class signatures. These
rules offer more optimization oppornmities while pre
serving necessary dependency i n tormation.
I f an effects class
represents a region A of a symbol, its signature incl udes
itself Its signature also incl udes all efrecrs cl asses repre
senti ng regions of the symbol wholly conta i ned with i n
A. Final ly, i t i nclu des a n y eftects class representing a
region of the symbol that partially overlaps A. I t does
not i nclude e ffects c l asses representing regions of the
symbol that do not overlap A or th::�t wholly contain A.
Ta ble 4 gives the symbol effects-class signatures for
the three effects cl:lsscs u nder discussion .
The i nc lusion o f su bregions i n an effects-cl ass signa
ture means that references to symbols i n terfere with
references to mem bers therein and vice versa. Excluding
su per-regions in an effects- class signature means that

references to two separate members of :1 symbol do
not interfere with each other. In Table 4, the eftects
class signatures for S<0,3> a nd s<4,7> do not in tcrkrc
with each other. Both signatu res interfere with the
effects-class signature tor s<0,7>.
The incl usion of effects classes representing parti::�lly
overlapping regions of a symbol a l l ows tor the correct
representation of the side effects of referencing sub
members of complex unions.

Effects-class Signatures for Symbols

s t ru c t
i n t:
int
s;

S
x;

{

-

•

s . y

-

. . . ;

•

•

s;

Symbol Effects-c lass Signatu res
Effects-class Signature

S<0,3>

5<0,3>

S<4, 7>

5<4, 7>

S<0, 7>

<0,3>, 5<4, 7>, 5<0, 7>

Dip.ital Tc chniol Jou rnal

•

Any region of a poi nter-aliased sym bol whose type
is compati b l e to T, ignori ng type qu ::�li fiers ::�nd
signed n ess

•

A region of a poi n ter-aliascd aggregate or union
symbol that contains a member or submember
whose type is compatible to T, ignoring type q u a l i 
fiers a n d signed ness

•

A

Vol . ! 0 No. l

region of an aggregate or unio n type that con 
tains a member or submember whose type i s com
patible to T, ignoring type qualifiers and signed ness

Table 5 gives the signatures for the efkcts classes in
Ta ble 2 , assuming that the sym bol s is poi nter aliased .
I ncluding the effects classes of symbols in the effects
c lass signatures of types records the interference of
references through poi nters with references to poi nter
a liased sym bols. I n Figure 3, the pointer uip points to
an u nsigned int. The member s . t.y hJs type int. Thus,
uip may point to s. t.y. The mem ber s.r contains s.t.y.
Thus, the signature for the effects-class int<0,3> co n -

Ta ble 4

52

Those regions ofT that overl ap the region ofT the
effects class represents, using the same ovnlap ru les
JS for symbols

;

Figure 4
Example o f Problem atic Code fo r the NaYvc Ru le for
B u i l d i n g E tlccrs-class Signatu res

E ffects Class

•

y;

S. X

re t u tn

If J n efkcts class
represents a region of a type, the contents of its signa
ture depends upon the type. I f tbe type is the char type,
the e ffects-class signature contains a l l the eftects classes
representi ng regions of other types or poi nter-aliased
symbols. This reflects the C and C++ type rules , which
state that a pointer to a char can point to :mything.
If the type is some type T other than char, the effects
class signature contains dlects classes represen ting:
Effects-class Sig natures for Types

1 99 8

D u ring opti miza

Ta ble 5

Res ponding to O ptim izer Q ueries

Type Effects-class Sig natures

tion, ilie optimizer m a kes two types of q ueries to the

Effects Class

Effects-class Signature

1

S<0, 1 1 >

1, 2

2

S<4, 1 1 >

2

N u m ber

side-effects analysis routines: domi nator-based queries
and nondominator-based q ueries .
When doing nondominator- based optimizations, tJ1e
optimizer uses a bit vector to represent iliose objects a

3

sp<0,7>

3

write may ch ange ( its effects ) . A similar bit vector repre

4

fp<0,7>

4

sents those objects whose val ue a read may fetch ( i ts

5

i p<0,7>

5

dependencies ) . Each bit in tJ1e bit vector represents an

6

u i p<0.7>

6

effects class. If a tuple's effects-class signan1re contains

7

struct 5<0, 1 1 >

1 , 2, 7, 8, 9

8

struct 5<4, 1 1 >

1 , 2, 8, 9

9

struct 5<4, 7>

1, 2, 9

10

float<0,3>

1 , 2, 7, 8, 1 0

11

i nt<0,3>

1 , 2, 7, 8, 9, 1 1

an effects class, iliat effects class's bit is set in ilie tuple's
bit vector. The optimizer uses ilie u nion of ilie bit vec
tors associated witJ1 a set ofn1ples to represent the com
bined effects or dependencies of those mples.
Domi nator-based queries involve fi nding the near
est dominating tuple that might write to the same
memory location as the tuple in q uestio n . Tuple A
domi nates tuple

rains the e tiects-class s<4, l l > . This means that the
load of s . t depends upon the store through u i p .
Including t h e effects classes of types i n t h e signa
tures of the effects classes of other types records the
i nterference of references through a pointer with ref
erences through pointers to other types. I n F igure 3 ,
the pointer

fp

points to a float object. T h e m e m ber

sp - >t . z has type float. Thus,

fp m ay

point to sp-> t . z .

The member sp- > t contains sp- > t . z . T h u s , the signa
ture for tJ1e effects-cl ass float<0,3> contains ilie effects
class struct 5<4, 1 1 > . This reflects the fac t that the
tore to sp->t.y depends upon the store tJ1 rough
�
I . e . , It m ust occ ur after ilie store ilirough fp.

fp,

Even though the signature for the e ffects-class
float< 0,3> contains the effects-class struct 5 <4 l l >

�

( s p - > t ) , it does not conta i n the e ffects-class s ruct

5<4,7> ( s p - > t . y ) . There i s no float member of struct
5 whose position within struct 5 overlaps bytes 4
through

7 ofstruct 5. There is a float member of struct

5 , namely z, whose position within struct S overlaps
bytes 4 through 1 1 of struct S . The signature for the
effects-class float<0,3> wou ld not contai n the effects
class s<0,3> if i t existed. There is no float member of s
whose position overlaps bytes 0 ilirough 3 of s .
Additional Effects-class Signat u res

The side-effects

package creates a special effects-class signature repre
senting the side effects of a cal l . A cal led procedure
may reference the following:
•

•

B if every path from the start of the
B goes through A . 8 I f both tuples A and C
dominate B , tuple A is the nearer domi nator i f C dom
rou tine to
inates A.
When doing dominator- based opti mizations, the
side-effects package represents the tuples i n the cur
rent dominator chain as a stack, adding and removing
tuples from the stack as G EM moves from one path
in the routine's domi nator tree to another. Searching
a single stack for the nearest dominating tuple that
might write the same memory as the tuple in question
references could lead to O(N9 performance, where N
is the n u mber of tup les i n the domi nator chain . This
worst-case behavior occurs when none ofilie tuples in
a dominator chain affects any su bsequent tuple i n the
chai n . Each time the side-effects package searches the
stack, it exami nes all the tuples in the stack.
To avoid iliis, ilie DEC C and C++ side-effects pack
age creates a stack for each e ffects class. When pushing
a tuple, the side-effects package pushes the tuple on
each stack associated with an e fTects class in the tuple's
effects-class signature. When the G E M optimizer tells
th e side -effects package to find the nearest domina ti na
write for a tuple, the side-effects package need onl

�

choose the nearest of those tu ples that are on the top

of the stacks associated with ilie tuple's effects-class
signature . It need only look at the top of each stack,
because a tuple wou l d not be in tJ1e stack u n less it
mi ght affect objects i n the e ffects class associated with
tJ1e stack.
The m ultistack worst-case behavior is O(NC). There

Any pointer-aliased symbol ( by means of a refer

are C separate stacks, one for each effects class. The

ence through a pointer)

effects-class signature for each effects class may con

Any allocated object ( by means of a reference

tam all the other effects classes. This would mean that

ilirough a pointer)

each of the N tuples in the domin ator chain would

•

Any nonlocal symbol ( by means of direct access)

•

Any local static symbol ( by means of recursion)

The effects signature for a call i ncl udes all the effects
classes representing these objects .

appear in each of ilie stacks.
Although the worst-case behavior for the m u l tistack
case is no better than the single-stack case ( C may be

e uaJ to N ), in practice there are often more tL;ples
�
Withm a routine than e ffects classes. Furthermore )
Digital Technical Journal

Vol . 10 No. 1

1 998

53

effects-class signatures often contai n a smal l n u m ber

Effectiveness

of effects classes. A smal l num ber of e ffects cl asses in
an effects-class signature means that there are a small

The benchmark programs from the SPECint95 suite

numb er of stacks to consider. Choosing the nearest

offer some convenient test cases for measm ing the

dominator from among the top tuples on these stacks

e ffectiveness of type- based alias analysis. The sources are

requir es exa mining only a small n u m be r of tuples.

readily available and portable. The programs conform

Cost of Using Type Information

Standards Institute (ANSI) and are compute intensive.

When compiling all of the SPECint95 test suite9 using

culations. This reduces the number of different types

high optimi zation, alias analysis accounts for approxi

used in the programs. Type -based alias analysis works

mately 5 percent of the comp i l a tion ti me. The use of

best when there are many di fferent types in use .

to aLias rules established by the American National
U n fortunately, they do not contain floating-point cal

Standard C type rules during alias analysis i ncreases

Tlu·ee of the SPECint95 programs show no improve

compi l a tion time by less than 0 . 2 percent ( ti m e mea

ment when compiled using the Standard C typing rules

sured i n number of cycles consumed by the compiler

as opposed to using the traditional C typing ru les.

as reported by Digital Continuous Profiling I n fra

These programs, namely compress, go, a n d li, do not

structure [ D CPI] '"). The i ncrease in compilation time

use many different types and pointers to the m . \Vh en

varies from program to program but never exceeds

all the pointers i n a program are pointers to ints ( go ) ,

0 . 5 percent. Hand l i n g the extra effects classes gener

there is only o n e e ffects class for a l l pointer accesses.

ated by using Standard C type al iasing i n formation

Because the compiler has no way to d i fferentiate

accounted for most o f the i ncrease .

among the objects touched by a dereference of a

Potentially, the cost of including type-aliasing infor

pointer expression, it generates identical code for these

mation could be huge . Calculating which effects classes

programs, regardless of the type r u les use d . The ge n 

a reference through a char * pointer could touch is

erated c o d e for l i differs o n l y sl ightly a n d only for

straightforward as shown by the al gorithm in Figure 5 .

i n frequently executed routines.

A much more complicated process i s required to

Changes i n generated code for the remai ning five

calcu late which e ffects classes could be touched by a

benchm arks are more prevalent. Two benchmarks,

reference through a poi nter to a type other than char.

ijpeg and perl, show a smal l reduction i n the nu m ber

The algorithm in Fi gure

of loads executed but no meani ngful reduction i n the

6 performs this process.

Fortu nately, the innermost section of this loop is

total numbe r of instructions executed . The other

rare ly executed . The i n n e rmost section executes onJy

three SPECi nt9 5 benchmarks show varying degrees

if a routine references a structure e i ther through a

of red uction i n both the n u m ber of loads executed

pointer or a pointer-al iased sym bol, that structure

(see Ta ble

contains a substructure, and the routine references the

executed (see Ta ble 7 ) .

6)

and the total nu m be r of in structions

su bstructure through a poi n ter.

f or ea ch p o i n ter al i a s ed s ymbol
foreach e f f e c t s c l a s s represen i ng a region o f the symbol
l a ss to the e f fec t s c l as s s i gna ure for
add c h a t e f fec s

c

ar

Figure 5
Calculation of the Effects-class Signature of the Type char *

foreach p oi n t er a l i a s ed s ymbo l or cype referenc ed t hrough
here i n
f o r e a c h member
i f t he member ' s type i s referenced through poin er
foreach e f f e c c s c l a s s repre s en t i ng a re i o n o f
foreach e f fec t s c l a s s re p resen t i g a region
re ferenced
hrough a po i n t e r

a

o i n te r

the member ' s

f

Lype

che s ymbo l or type

i f the tHO e f fec t s c l a s s re i on s ove rl ap
add the symbol ' s or po i n t er ' s e f f e c t s c l a s s to the e f f e c t s
c l a s s s i g natu r
a s s o c i ated with t h e e f fec t c l s s
represen t i n

the member ' s Lype

F i g u re 6
Calculation of the Effects-class Signature for Types Other Than char

54

DigiraJ Technical JournaJ

Vol . 1 0 No. 1

1 998

6
N u mber of Loads Executed by the Sel ect SPECint95 Benchmarks

Ta b l e

SPEC Benchmark

M i l l ions of Loads
Using Type I nformation

M i l l ions of Loads
without Type I nformation

Percent Reduction

gee

1 0, 268

1 0,365

0.9
0.2

ij peg

1 6,853

1 6,888

m88ksim

1 3,889

1 4, 1 57

1 .9

peri

1 1 , 260

1 1 , 296

0.3

vortex

1 8, 994

1 9, 207

1.1

Ta b l e 7

N u m be r of I n structions Executed by the Sel ect S P E C i nt95 Bench m a rks
M i l lions of I nstructions

SPEC Benchmark

M i l l ions of Instructions
Using Type I nformation

without Type Information

Percent Reduction

gee

42,830

42, 9 3 5

0.2

ij peg

82, 844

82,834

0.0

m88ks i m

72,490

73, 1 5 5

0.9

peri

45,2 1 9

45,252

0. 1

vortex

80,093

80, 607

0.6

The load and instruction cou nts are those reported
by using Ato m 's pixie tool on the SPECint9 5 bi naries
to generate pi xstat data. 1 1 • 1 1 The compiler used was a
deve lopment C compiler. A l l comp i l ations used the
fol lowing swi tches: - fas t , -04 , - a rch ev 56 , a nd
- i nl i ne
peed .
The compil ations using the
Standard C type system used the -ansi_a l i a
switc h . The compilations using the trad itional C type
system used the - noans i_a l i s switch . T h e bench
mark binaries were run using the reference data set.
DCPI'" measurements of the reduction in the n u m 
ber of cycles consu med b y these SPECint9 5 bench
marks showed no consistent reductions. Ru n-to-run
variabi lity in the data col lected swamped any cycle
time reductions that might have occu rred. S i m i larly,
measu rements of gains in SP ECint95'' resu lts due to
the use of type information during alias analysis showed
no significant changes .
Changes in Generated Code

The code-generation changes one sees in the SPECint95
benchmarks arc exactly what one would expect.
The usc of type information during alias analysis
reduces the number of redundant loads. An example
of this occurs in ijpeg, which contains the code sequence:
main->r v?group_c t r·
=

fJDH1EN · ron)

; ( JDI M� S lO I )

( c in fo- >min_OCT_s ca l ed_s i ze

•

main- �rowgrou s_ava i l
( c i n fo->mi n_DCT_scal

d_si ze

+

ll ;

2) ;

in process_data_context. Using the tradi tional C type
syste m , the compiler m ust assume that mai n - > row
group_ctr is an alias tor cinfo->mi n_DCT_scaled_size.

Thus, it must generate code that loads cinfo - > m i n_
DCT_scaled_size twice . The Standard C type system
allows the compiler to generate only one load of
cinfo->mi n_DCT_scaled_si ze.
Several of the bench marks contain code similar to
the fol lowing from conversion_rccipe i n gee:
c

rr . ne . . t -

ur

l i s - >opcode ; - 1 ;

curr . ex t > l i s t - c o s t - 0 ;
cu rr . exc l i s - >prev - 0 ;
. ne x

>l i s

-> o

-

from ;

Using traditional C type rules, the compiler must gen
erate four loads of curr. next->l ist. The compiler must
assume that the poi nter curr.next-> list may point to
itself, making curr. next- > l ist- >member an al ias tor
curr.next- > l ist. The Standard C type r u les allow the
compiler to assume that curr.next->l ist does not point
to itsel f. This allows the compiler to generate code that
reuses the result of the fi rst load of curr.next->l ist,
e liminati ng three redundant loads.
In a nother example in gee, the use of Standard C
type rul es allows the compiler to move a load outside a
loop. The fol lowi ng loop occurs i n fi xup_gotos:
f or ( ;

if

1 i s Ls ;

l ists

- TREE_

- - thi s b loc k - >

! T REE_CHAI

( l is

TREE_ADDRES ABLE

s)

( li

H.l\I

( l i s ts ) )

. b l oc k . ou
Ls )

•

1

er_c l.e nup

·)

Standard C type rules tel l the compiler that the store
generated by TR.EE_ADD RESSABLE ( l ists )
I
can not modi�' thisblock- >data . block.outer_clcanups.
This a l l ows the compiler to generate code that retches
thisblock->data. block.outer_cleanups once betore
entering the loop. Usi ng traditional C type rules,
the compiler m ust generate code that fetches
=

Digital Tec hn i d journal

Vol . 1 0 N o .

1 998

55

thisblock->d ata . bl o c k . o u ter_c l ea n u ps

each

tim e

it

traverses t h e loop.
Not only can type i n formation reduce the n u m ber

p rograms in this suite a rc s u p posed to conf(xm to the

of redu nda nt loads, i t em reduce the nu m ber of red u n 

Standard

d an t stores. I n m88ksi m , there a r e many routines s i m i 

ment to the GEM optim i zer, this benchmark started

lar t o the fol lowi ng:
i r: t ffirst < s trLct
p:r->gen . opcl

ptr-·.gen . r.:cs t

p•t -"9'� . OJX/
p�r >gen . n;/.
retuLr. ( 0 l ;

=
•
•

-

C type-al iasing r u l es .

B ecll!se of an improve

to give u n e x pected res u l ts . In rrx_a l loc, gee c l ears a

n.S ':.. J.1..:t:L iu;. � c;rri , t:nior. opcode ?·pt:r)

structure by treating it as an a rray of i nts, assigni n g
zero t o e a c h e lement of the array. S u bseq uent t o zero

0. 3c: ;
operar.d,; . v-lue [ O ] ;
am-�op,c . nT ;
operan .,; . alue [ 1 I ;

i ng this structu re, gee assigns a val u e to one of the
fi e l d s i n the structure. Through a series of va l id opti 
m izations ( given the i ncorrect type i n formation ) , the
resu l ting code did not c l e a r a l l the fi e l d s i n the struc

where ope 1 , d est, opc2, and src2 Jrc b i t fields s haring
the same 32 bits ( long-vord ) . Using traditional

C

typ

ture . This l e ft u n i n i ti a l i zed d a ta i n the structure
res u l ting i n gee behavi n g i n an u n e xpected m a n ne
To avoid potential problems, the D EC

/

C compi l e r,
C type r u l es

each other. Thus to i mplement the above routine, the

ing r u l es, ptr- >gcn and cmd- >opc may be al iases for

by d ef:1 u l t, d oes not use the Standard

when performing alias analysis. The user of the com 

compi l e r must generate code that performs the fol 

p i l e r has to expl icitly assert that the program does fol 

lowing actions:

l o w t h e Standard

Load ptr->gen

•

U pdate b i t fi e l d s ptr- >gen .opc l a n d ptr- > gen .dest

•

Store ptr->gcn

C++
C++

•

Load cmd->opc. rrr

c a n use

•

U p d a te bit fi e lds ptr- >ge n .opc2 and ptr- > gen.src2

•

Store ptr->gen

Usi ng Standard

C typing ru l es,

ments to ptr- >gcn .opc l

C

type r u l es t h ro u gh the u s e of a

com mand -l ine switc h .

•

the compiler does not

have to generate the first store ofptr- >gen . The assign
and ptr- > ge n . d est cannot

change cmd - >opc. rrr. I n this case, a l ias a n a l ysis t h a t is
not type based wou l d have a d i fficu l t time detecting
that p tr- > gen and cmd - >opc d o not a l i as each other.
M 8 8 ksim never calls Hi rst d i rectly. It cal ls it by means

T h e DIG ITA L C + + compiler docs ass u m e that the
program i t is comp i l i n g ad h eres to the Standard
type r ules . A user of the D I G ITAL
a

C++

com p i l e r

com mand - li ne switch t o i n kmn the c o m p i l e r

that i t s h o u l d u s e traditional

C

type r u l es w h e n per

forming alias a nalysis.

Summary
Using Standard

C type i n f(xmation d u ring al ias analysis
C and C++

does improve the generated code f()r some

programs. The compi lation cost o f· using type i n forma
tion is sma l l . Except for rare cases, performance gains
res u l ting from these code i mprove m e nts are small . Any

of an array-indexed fu nction pointer.

programs compiled using type i n formation duri ng alias

A Note of Caution

aliasing rules. I f not, the optimizer may generate code

Many

C

analysis must strictly ad h e re to the Standard

programs do not ad here to the Standard

C

i m p l i c i t casting, they access objects of one type by means
of pointers to other types. More aggressive optimization

by GEM combi ned with more detailed alias-analysis

i n formation fi·om the D EC

C

and

C++

side-effects

package i ncreasi ngly resu l ts in these programs e x hi bit
ing u n e x pected behavior when the compiler uses
Standard

C aliasing ru les.

expects

poi nter to another type works as expecte d ,

Passing a p o i n te r to o n e type to a routine that
a

u ntil t h e GEM opti m izer i n li nes the cal led procedur e .
If the procedure is n o t i n l i n ed , t h e D EC

C

and

C++

sid e-effects package m ust ass u m e that the call conflicts
with aJ I pointer accesses before a n d after the cal l . Once

G EM i n l i nes the routine, the side-effects package is

free to ass u m e t hat references using the i n l i ned pointer
do not conflict with references using the poi nter at the
call site. The two pointers point to t:\vo d i fferent types.
Digital

Tcchnic;JI J o u n d

C and C++

that produces unexpected resu l ts.

aliasing rules. Through d1e usc of expucit casting and

56

A recent example of this pro b l e m occu rred in the
gee program in the S P ECint9 5 benchmark s u i te . All

Vol . 1 0 No. l

1 998

Acknowledgments
The a u thor wou ld l i ke to than k Dave B l ickste i n , Mark
D avis, N e i l Fai m a n , Steve Hobbs, and B i l l Noyce of
the G E M team for their advice and reviews of this
work. Dave B lickstein and N e i l Faiman a lso d i d work
in the G EM opti m i zer to ensure that the D E C

C++

C

and

s i d e - e ffects package h a d a l l the i n f(mnation i t

needed t o do alias analysis correctly a n d to ensur e that
the GEJ\rl o p ti mi zer effectively used the i n fcm1ution
the side-effects package provid e d . Thanks a l so to J o h n

Henning of the C S D Performance Gro u p a n d J eannie
Lieb o f t h e GEM team fiJ r t h e i r h e l p usi ng the
S PECint95 benchmark suite. A f-i nal word of t h a n ks
goes to B o b M o rgan f(x suggesti ng that I write this
paper and to my m anage ment f()r s upporting my
doi ng so.

Biography

References and Notes
1 . R. Wi lson a nd M . Lam, " Enicient Comext-Sensitive
Poi mcr Analysis for C Programs," Proceedini�S of the
A C/\1 S!C;PLA 1\ '95 Conference on Progra m m ing La n 

guaw'

IJesip,n

a n d Implementation.

L a J o l l a , C a l i f.

( J u ne 1 99 5 ) : 1-1 2 .
2 . D . Coutant, " Rctargetable High-Level Alias Analysis,"
Proceedings

ofthe 13th A nnual �)mposium

ciples oj' Program ming Languages,

on Pl7n

St. Petersburg

Beach, Fla . ( Ja n uary 1 98 6 ) : 1 1 0-1 1 8 .
3 . A . Diwan e t al . , "Type-Based Alias Analysis," Procecd

iu,�s o/ the 1 998 A CM SICPLA N Co uference o11 Pro

f:ira m m ing La nguage Desig n a n d Implementation.

Montreal , Canada ( J u n e 1 99 8 ): 1 06-1 1 7 .
4 . J o i n t Tech n ical Com mittee ISO/ I E C JTC 1 , "The C
Programming Language," International Sta n da rd
!SO/JJ;'C 9899 1990, section 6 . 3 Expressions.
5. "Worki ng Paper for Draft Proposed I n ternational
Standard for I nforma tion Systems-Progra mming
Language C++," WG2 1 /N 1 146, November 1 997,
section 3 . 1 0 .
6. D . Blickstein e t a l . , "The G E M Optimizing Compiler
System," /Ji[;ilal Tech n ical.fournal, vol. 4 , no. 4 ( Spe
cial Issue, 1 99 2 ) : 1 2 1 - 1 3 6 .
7. R. Crowell ct a l . , " T h e GEM Loop Transformer,"
ni,� ital Tech n icaiJou rnal, vol. 1 0 , no. 2, accepted for
publ ication.

August G . Reinig
August Rei n ig is a principal somvarc engineer, currently
working on debugger support i n the D I G I TAL C++
compiler. In addition to his work on the DEC C and C++
side-effects package, August implemented a Java- based
distributed test system for t.he D EC C and D IGITAL C++
·
compilers and a para l le l build system for the DEC C and
D I GITAL C++ compilers. The d istri buted test system
simultaneously runs multiple tests on d ifferent machines
and is fault tolerant. Betore joining the DEC C and C++
team, he conu·ibuted to a n advanced development incre
mental compiler project, which led to two patents,
"Method and Apparatus fc>r Somvare Testing Using a
Testi ng Technique to Test Compilers" and "Method
and Apparatus tor Testing Somvare. " He earned a B .S. in
mathematics ( m agna cum laude) !Tom Dartmouth Col lege
in 1 980 and an M .S . in computer science fi·om H arvard
University in 1 997. He is a member of Ph.i Beta Kappa.

8. A . AJ1o, R . Sethi , and ] . U l l m a n , Compilers Princ iples.
Techn iljlles. a n d Tools ( Reading, l'vbss: Addison
vVesley, 1 98 6 ): 104.
9 . I n formation about the SPEC benchma rks is available
from the Standard Pertorm<\nce Evaluation Corpora 
tion at http ://www. specbench.org/.
1 0 . J. Anderson ct a l . , "Conti nuous Profiling: vVhcre H ave
All the Cycles Gone> " Proceedings of the Sixteenth
A O\If .S)'mposiu m on Operatln/:5 Systl!rn Principles, Sait
M::tlo, France ( October 1 99 7 ) : 1 5-26.

I I . A. SrivJstava and A. Eustace, "ATOM : A System for
Building Customized Program Analysis Tools," Pro

ceedings of tbe .-10\lf S!CPL- !:V 9 ·1 Conference on Pro
wwn m ing Language Desig n U l l d !mplemenlalion.

Orlando, F l a . ( J u ne 1 994 ) : 1 96-2 0 5 .

1 2 . l/i\1/IPS- V Rejere11ce Manual (pixie a nd pixstats)
(Sun nyva le, Ca l i f. : M IPS Computer Systems, 1 99 0 ) .

Digital Technical Joumal

Vol . 1 0 No. 1

1 998

57

I
Philip H. Sweany
Steven

M. Carr

Brett L. Huber

Compiler Optimization
for Superscalar Systems:
Global Instruction
Scheduling without
Copies
The performance of instruction-level parallel

Many of today's computer appl ications req u i re com p u 

systems can be improved by com piler prog rams

t u res that provide l i ttle or no para l l e l i s m . A pro m ising

that order mach i n e operations to i ncrease
system paral lelism and reduce execution time.
The opti mization, cal led i nstruction sched u l ing,

tation power n o t easily achieved b y computer architec

alternative is the parallel architecture, more specifical ly,
the instruction-level para l l e l ( I LP ) arc h i tecture, which
i ncreases computation d u ri n g each machine cycle. I LP

i s typica lly classified as local schedu l i n g if only

computers a llow para l l e l computation of the lowest

basic-block context is considered, or as g lobal

level mac h i n e operations with i n a sin gle i nstruction
cycle, i n c l u d i n g such operations as m e mory l oads and

sched u l i n g if a larger context is used. G lobal
sched u l i n g is generally thought to g ive better
results. One g lobal method, domi nator-path

stores, i n teger additions, and floating-point m u ltiplic:�
tions. I LP architectures, l ike conventional architectures,
conta i n m u l tiple fu nctional u n its and p i p c l i ned fi.m c 

sched u l ing, sched ules paths in a function's

tional u nits; b u t, they have a singJ c progr:�m cou nter

domi nator tree. U n l i ke many other g l obal

and operate on a single instruction stre a m . Compaq

sched u l i n g methods, dominator-path sched u l 

Computer Corporation's AlphaServer syste m , based on
the Alpha

i n g does n o t req u i re copy i n g of operations

2 1 1 64

microprocessor, is :�n example of an

ILP machine.

to preserve program semantics, making this
method attractive for supersca lar a rch itectures
that provide a l i m ited amount of i nstruction

To effectively usc parallel h a rdware and obtain
performance ad van tagcs, compi ler programs must
i d c n tif)r the appropriate level o f para l lelism . For I LP

l evel para l l e l i sm. In a sma l l test su ite for the

arc h i tectu res, the comp i l e r must order the s i n g l e

Alpha 2 1 1 64 supersca lar arch itecture, dominator

i nstruction stream such t h a t m u l ti pl e , low-level opera

path sched u l i n g produced sched u les req u i ring

7.3 percent less execution time than those pro
duced by local sched u l i n g a lone.

tions execute s i m u l taneously whe never possi b l e . This
orderi ng by the compiler of machine operations to
e ffectively use an I LP arc h i tecture's increased para l 
l e l i s m i s called instruction schedulin,r, . I t i s an opti
m i zation n o t us u a l l v ro u nd in compi lers for non- I LP
arch i tcctu res .
Instruction sched u l i ng is c lassified as local if i t
considers code o n ly within a basic b l o c k a n d ,r,loha! i f

i t schedu l es code across m u l tiple bJsic b l ocks. A dis

advantage to local instruction sched u l i n g is its i n a b i l i ty
to consider context from s u rrou n d i n g b l ocks. \Vh i l e
local sche d u li n g c a n fi n d parall elism with i n a basi c
block, it can do noth i n g to exploit para l l el i s m bel:\veen
basic blocks. Generally, global sched u l i n g is preferred
because i t can take advantage of added program parJ l 
lelism avai lable when t h e compiler i s :� !lowed t o move
code across basic block bmmdJries. Tj aden and F l y n n , '
tor example , fo u n d paralle l ism w i t h i n a basic block
q u i te l im ited . Using a test s u i te o f scienti fi c programs,
t h ey m e as u re d an average para l lelism o f 1 . 8 within
basic blocks. In s i m i l a r experi ments o n scientifi c pro-

58

Digital Tcch tlical journal

Vol . 10 N o . I

1 99 8

grams in which the compi l er moved code across basic

later than Y. These D O D edges are b;�sed on the formal

block bound aries, Nicolau and Fisher ' rou n d paral

ism of data dependence analysis. There are tl1ree basic

l e l ism that ranged from

4 to a virtually u n l i m ited

num

ber, with a n average of90 for the entire test suite.

Trace scheduling ''

is a global schedu ling technique

types of data dependence, as described by Pad u a et al .''
•

dent on

a progra m , possibly at t11e expense o f less freq uently
within sequential code by allowing massive migration of

lvL writes to some m e mory location read by M , .
•

Antidependence, a l so cal led fa lse dependence. A
DDD node M2 is a n tidependent on

operations across basic block bounda.ties during schedul

D D D node M ,

i f M , executes before M z and M 2 writes t o a m e m

ing. By addressing this l arger scheduling context ( m any

ory locati o n read by M , , thereby destroyi n g the

basic blocks), trace scheduling can produce better sched

val u e needed by M , .

u l es tlun teclmiques that address the smaller context of a
single block. To ensure the program sema.t1tics are not

D D D node M , is flow depen
D D D node M , i f M , execu tes before M, and

data dependence. A

that attempts to optimize fi:equently executed paths of
executed pat11s . Trace schedu ling exploits paral lclis�

Flow dependence, also cal led b·ue dependence or

•

A D D D node M , i s output
ODD node M , i f M , executes before

O u tp u t dependence.

changed by i n terblock motion , trace sched u li n g inserts

dependent on

copies of operations that move across block bou ndaties.

M2 and M1 and M, both write to the same locati o n .

Such copies, necessary to ensure program semantics, are
called

wmpm1sation copies.

The research described here is driven by a desire to
develop a global i nstruction sched u l i n g tech n i q u e
t h a t , l i ke trace schedu l i ng, a l l ows operations t o cross
block bou n daries to fi n d good schedules a n d that,
u n l i ke trace sched u l ing, does not req u i re i nsertion o f
compensation copies . L i k e trace sched u l i ng, D PS first
defi nes a m u l ti block context for sched u l i ng and then
uses a local i nstru ction scheduler to treat the l arger
context l i ke a si ngle basic bloc k . S u ch a tec h n iq u e pro
vides effective sched u l es and avoids the performance
cost of execu ting compensation copies. The g lobal
sched u l i ng tech nique described here is based on the
dominator relation * among the basic blocks of a fu nc
tion and i s calle d domi nator-path sched u l i n g ( D PS ) .

Local Instruction Sched u l i ng

;

begin with a brief d iscussion of the local schedt ling
problem. As the n a me i m plies, local instruction sched
uling attempts to maxi mize para l lelism within eac h
basic block of a fu nction's con trol rl ow graph. I n gen
this

opti m i za tion

problem

is

N P-complete . '

H owever, i n practice, heuristics ach ieve good results.
( L..1.ndskov e t al.'' give a good su rvey of early instruction
sc h ed u l i n g algorithms. Al lan e t aF describe how one
might b u i l d a retarge table local i nstruction sched u ler. )

L1st schedulinp, " i s a

general method often used tor

local i nstructi on sched u l i ng . Briefly, l ist sched u l i n g
_
typtc:: d ly req u t res two p h ases. The fi rst phase bui lds
a

d i rected acyclic graph ( DAG), c<�lled the d:J.tJ. depen

dence DAG ( D D D ) , tor each basic block i n the
fu nctio n . D D D nodes represent operations to be
sched u led . The DDD's d i rected edges i n d icate that a
node

X

D D D node, a set of a l l memory locations used ( read )
defined ( writte n ) by that
particular D D D nod e .
and all memory locations

Once the D D D i s constructed, t h e second phase
begi n s when list sche d u l i n g orders the graph 's nodes
i nto the shortest sequence of insb·uctions, s u bject to

(1)

the constraints in the gra p h , and

( 2 ) the resource

limitations i n the machine ( i . e . , a mac h i n e is typical ly
u m i ted to holding o n l y a single value a t any ti m e ) . I �
genera! l i s t sched u l i n g , :.1. n ordered J i s t o f tasks, called a

pnoriz)l list,

is constru cted . The priority l ist takes i ts

name from the tact that tasks are r:mked such that those
with the highest priority are chosen first. In the context

of local i nstruction scheduling, the priority list contains
DDD nodes, all of whose predecessors have a l ready
been i nc l uded in the sched u l e being constru cted .

Si nce D PS relies on a local i nstruction sched uler we

e ra l ,

To facil itate deter m i nation <1 11d manipul ation o f
data dependence, the c o m p i l e r maintains, for each

preceding a node Y constrains

X

to occ u r n o

* A basic bl_oc k , D , domin ates another block, B , i f cl'<.:n p a t h from
the root ot the control-How graph (or a function ro B must pass
throug;h D

Expressions, Statements, and Operations
Within the context o f this paper, we d iscuss a l gorithms

for code motion . Before going fu rther, we need to
ensure common u nderstanding among our readers tor
o u r use of terms such

<�S expressions. statements. and
operations. To start, we consider a com p u ter program

to be a ltst of operations, each of which ( possi b l y )
computes a righ t-hand side ( rh s ) v;�l u e and assigns the
rhs val u e to a memory location represented by a left
hand side ( l h s ) vari a b l e . This can be expressed as
A�E
where

A

represents a single m e mory l ocation and E

represents an ex pression with one or more operators
and a n appropri:ue n u m be r of oper;�nds . D u ring d i f
fere n t phases of a compi ler, operations might be repre
sented <�s
•
•

Source cod e , a high - level langu<�ge such as

C

I n termediate statements, a l i n ear form of three
address code such as q u ads or n-tuples'"

Digital

Technical Journal

Vol . L O N o .

1 998

59

•

DDD nodes, nodes in a DDD, ready to be sched
u l ed by the instruction scheduler

Important to note about operations, whether repre
sented as mtermediate statements, source code, or
DDD nodes, is that operations include both a set of
definitions and a set of uses.
Expressions, in contrast, represent the rhs of an
operation and, as such, include uses but not defini 
tions. Throughout this paper, we use the terms state
ment. intermediate statement, operation, and DDD
.
node Interchangeably, because they all represent an
operation, with both uses and definitions, albeit gen
erally at different stages of the compilation process .
When we use the term expression, however, we mean
an rhs with uses only and no definition.
Dominator Analysis Used in Code Motion

I n order to determine which operations can move
across basic block boundaries, we need to analyze the
source program . Although there are some choices
as to tl�e exact analysis to perform , dominator-patl1
scheduhng IS ased upon a formalism first described by
Retf and Taq a n . " We summarize Rei f and Tarjan's
work here and then discuss the enhancements needed
to allow interblock movement of operations.
I n their 1 9 8 1 paper, Reif and Tarjan provide a fast
algorithm for determining the approrimate hirthpoints
of expressions in a program's flow graph . An expres
sion's birthpoint is the first block in the control flow
graph at which the expression can be computed, and
the value computed is guaranteed to be the same as in
the original program. Their technique is based upon
fast computation of the idef set for each basic block of
the control flow graph . The idef set for a block B is
that set of variables defined on a path between B's
i mmediate dominator and B. G iven that the domina
tor relation for the basic blocks of a function can be
represented as a dominator tree, the immediate domi
nator, IDOM, of a basic block B is B's parent in the
dominator tree .
Expression birth points are not sufficient t o allow u s
t o safely move entire operations from a block t o one of
its dominators because birthpoints address only the
movement of expressions, not definitions. Operations
in general include not only a computation of some
expression but the assignment of the val ue computed
to a program variabl e . Ensuring a "safe" motion tor an
expression requires only that no expression operand
move above any possible definition of that operand,
thus changing the program semantics. A similar
requirement is necessary, but not sufficient, for the
variable to which the value is being assigned. In add i 
tion to not moving A above any previous defirution o f
A, A cannot move above any possible use of A .
Otherwise, w e r u n the risk of changjng A's val ue for

?

60

Digital Technical JournaJ

Vol . 1 0 No. l

1998

mat previous use. Thus, dominator analysis compu tes
me zuse set for each basic block and tor me idef set.
The iuse set for a block, B, is that set of variables used
on some path between B 's immediate dominator and
B. Using the idefand iuse sets, dominator analysis com
putes an approxinute birtl1point for each operation.
In this paper, we use the term dominator analysis
to mean the analysis necessary to allow code motion of
opera ons while disallowing compensation copies.
Additionally, we use the term dominator motion for
the �eneral optimization of code motion based upon
dommator analysis.

�

Enhancing the Reif and Tarjan Algorithm

By enhancing Rei f and Tarjan 's algorithm to compute
hi11hpoints of operations i nstead of expressions, we
make several issues i mportant that previously had no
effect upon Reif and Tarjan's algorith m . This section
motivates and describes the information needed to
allow dominator motion, including the use, def iuse,
and ide{ sets for each basic block. An algorithmic
description of this dominator analysis information is
included in the section Overview of Dominator-Path
Scheduling and the Algorimm tor I n tet·block Motion .
\V:hen we aLlow code motion to move intermediate
statements ( or j ust expressions) from a block to one of
its dominators, we run the tisk that the statement
(expression) will be executed a different number of
times in the dominator block than it would have been
in its original location. vVhen we move only expres
sions, the risk is acceptable ( although it may not be
efficient to move a statement i n to a loop ) since the
value needed at the original point of computation is
preserved. Relative to program semantics, the number
of times the same value is computed has no effect as
long as the correct value is computed the last time.
This accuracy is guaranteed by expression birthpoints.
Consider also the consequences of moving an expres
sion Jiom a block that is never executed for some partic
ular input data. Again, i t may not be efficient to compute
a value never used, but the computation does not alter
progran1 semantics. \Vhen dominator motion moves
entire statements, however, the issue becomes more
complex. I f the statement moved assigns a new value to
an induction vatiable, as in me following exatnple,

n= n+ 1
dominator motion would change n's fin al value if it
moved the statement to a block where the execution
freq uency differed from that of its original block. We
cou ld al leviate this pro blem by prohibiting motion of
any statement for which the use and de{ sets are not
disjoint, but the possibi lity remains that a statement
may ddine a variable based indirectly upon that vari
able's previous value. To remedy the more general
p roblem, we disallow motion of any statement S
)

)

whose def set intersects with those variables that are
used-before-defined in the basic block in whi ch S resides.
Suppose the optimizer moves an i ntermediate state
ment that defines a global variable from a block that
may never be executed for some set of inpu t data i nto
a dominator block that is executed at least once for
the same i nput data. Then the optimized version has
defined a variable that the u noptimized function did
not, possibly changing program semantics. We can be
sure that such motion does not change the semanti cs
of that function being compiled; b u t there is no mech
anism, short of compiling the e n tire program as a sin
gle unit, to ensure that defining a global variable in this
function will not change the val ue used in another
fu nction. Thus, to be conservative and ensure that
it does not change program semantics, dominator
motion prohibits interblock movement of any state
ment that detines a global variable. At first gl ance, it
may seem that this prohibition cripples dominator
motion's ability to move any i ntermediate statements
at all; but we shall see that such is not the case .
One fi n al addition to Reif and Tarj a n information is
required to take care of a subtle problem. As discussed
above, dominator analysis uses the idef and iuse sets to
prevent i l legal code motion . The use of these sets was
assumed to be sufficient to ensure the legality of code
motion i nto a domi n ator block; u n fortunately, this is
not the case . The problem is that a definition might
pass through the i mmediate dominator o f B to reach
a use i n a sibling of B i n the dominator tree. I f there
were a detlnition of this variable in B, but the variable
was not defined on any path from the immediate dom
i nator, there would be nothing in dominator analysis
to prevent the definition from being moved into the
domi nator. But that would change tl1e program's
semantics. Figure 1 shows tl1e control-flow graph for a
function called fi ndmax ( ) , with only the statements
referring to register r7. Register r7 is defined in blocks
B3 and B7, and referenced in B 9 . This means mat r7
is live-out of B 5 and live-in to B 8 , but not live-in to
B7; there is a definition of r7 i n B 3 that reaches B 8 .
Because there i s no definition o r use between B 7 and
its i mmediate dominator B 5 , the idef and iuse sets of
B 7 are empty; thus, dominator analysis, as described
above, would allow the assignment of r7 to move
upward to block B 5 . This motion is i ll egal ; it changes
the definition in B 3 . Moving me operation from B7 to
B5 changes the conditional assignment of r7 to an
unconditional one.
To prevent this from happening, we can i nsert the
variable into the iuse set of the block B, in which we
wish the statement to remain. We do not, however,
want to add to the iuse set unnecessarily. The solution
is to add each variable, V, that is live-in to any of B 's
siblings i n tl1e domi nator tree, but not i nto B, or to B's

t

I

B4

I

B5

I

t

qJ
7

gJ
-

B8

G

Figure 1

Control Flow Graph for the Function tindmax( )

iuse set. This will prevent any definition of V that
might exist in B from moving up. If there is a defini
tion o f V i n B, but V is live-in to B , there must be some
use of V in B before the definition, so it could not move
upward in any case .
Measurement of Dominator Motion

To measure the motion possible i n C programs,
Sweany1' defined dominator motion as the movement
of each i n termediate statement to its birthpoint as
defined by dominator analysis and by the n u mber of
domi nator blocks each statement j u mps during such
movement. Sweany's choice of i ntermediate state
ments (as contrasted with source code, assembly lan
guage, or DDD nodes) is attributed to the lack of
machine resource constraints at that level of program
abstraction . He envisioned dominator motion as an
upper bound on the motion avai lable in C programs
when compensation copies are i ncluded . In the test
suite of 12 C programs compiled, more than 25 per
cent of all i ntermediate statements moved at least one
dominator block upwards toward the root of the dom
i nator tree . One function allowed more than 50 per
cent of the statements to be hoisted an average of
nearly eight domi nator blocks. The considerable
amount of motio n (without copies ) avai lable at the
i n termediate statement level of program abstraction

Digital Technical Journal

Vol . 1 0 No. 1

1 998

61

provided us with the motivation to use similar analysis
techniques to facilitate global instruction schedu l i ng.
Overview of Dom inator-path Sched u l ing and the
Algorithm for lnterblock Motion

Since experi ments show that dominator analysis al lows
considerable code motion without copies, we chose to
use dominator analysis as the basis tor the instruction
scheduling algorithm described here, namely dominator
path scheduling. As noted above, D PS is a global
i nstruction scheduling method that does not require
copies of operations that move ti.-om one basic block to
another. DPS performs global instruction scheduling by
treating a group of basic blocks found on a dominator
tree path as a si ngle block, scheduling the group as a
whole . In this regard, it resembles trace scheduling,
\vhich schedules adjacent basic blocks as a single block.
DPS's fou ndation is scheduling instructions while mov
ing operations among blocks according to both the
opportunities provided by and the restrictions imposed
by dominator analysis.
The question arises as to how to exploit dominator
analysis information to permit code motion at the
instruction level during scheduling. DPS is based on
the observation that we can use ide( and iuse sets to
al low operations to move from a block to one of its
dominators during instruction scheduling. I nstruction
scheduling can then choose the most advantageous
position tor an operation that is placed in any one of
several blocks. Because machine operations are incor
porated in nodes of the DDD used in schedu ling and ,
l i ke intermediate statements, DDD nodes are repre
sented by dejand use sets, the same analysis performed
on intermediate statements can also be applied to a
basic block's DDD nodes.
The same motivation that drives trace scheduling
namely that scheduling one large block allows better use
of machine resources than scheduling the same code as
several smaller blocks-also applies to D PS . I n contrast
to trace scheduli ng, DPS does not allow motion of
DDD nodes when a copy of a node is required and does
not incur the code explosion due to copying that trace
scheduling can potentially produce. For architectu res
with moderate instruction- level paralle lism, D PS may
produce better results than trace sche uling, because
the more l imited motion may be suttictent to make
good use of machine resources, and unlike trace sched
ul ing, no machine resources are devoted to execunng
semantic-preserving operation copies.
Much l i ke traces,* the dominator path's blocks can
be chosen by any of several methods. One method is a
heuristic choice of a path based on length , nesting
depth , or some other program characteristic. Another
is programmer specification of the most important

�

•groups of blocks ro be scheduled rogerhcr in rrace sched u l i n g

62

Digital Technical Journal

Yol . l O N o . 1

1 998

paths. A third is actual profiling of the running pro
gram . We visit this issue again in the section Choosing
Dominator Paths. First, however, we need to discuss
the algorithmic details of D PS .
O n c e D PS selects a dominator p a t h to schedule, it
requ i res a method to combine the blocks' DDDs into
a single DDD for the entire dominator path . I n our
compiler, this task is performed by a DDD coupler,�.'
which is designed for the p urpose. Given the DDD
coupler, DPS proceeds by repeatedly
•

Choosing a dominator path to schedule

•

Using the DDD coupler to combine each block's
DDD on the chosen dominator path

•

Scheduling tl1e combined DDD as a single block

The dominator-path schedu ling algorithm, detailed
i n this section, is summarized in Figures 2 and 3 .
A significant aspect o f the D PS process i s to ensure
"appropriate" interblock motion of D D D nodes and
to prohibit "il legal" motion. As noted earl ier, the
combined DDD for a domi nator path includes control
flow. Therefore, when D PS schedules a group of
blocks represented by a single DDD, it needs a mecha
nism to map correctly the scheduled instructions to
the basic blocks. The mechanism is easi ly accom 
pl ished by tl1e addition of two special nodes to each
block's n"D D . Called B lockStart and B lockEnd, these
special nodes represent the basic block boundaries.
Since dominator-path scheduling does not allow
branches to move across block bou ndaries, each
B lockStart and B lockEnd node is initially "tied" (witl1
DDD arcs) to the branch statement of the block, .if any.
Because B lockStart and B lockEnd are nodes i n the
eventually combined DDD, they arc sched uled like all
other nodes of the combined D D D . After scheduling,
all i nstructions between the instruction containing the
B lockStart node for a block and the i nstruction con
taining the B lockEnd node for that block are consid
ered i nstructions for that block. Next, DPS must
ensure that the B lockStart and BlockEnd DDD nodes
remain ordered ( i n the scheduled instructions) relative
to one another and ro the B lockStart and BlockEnd
nodes tor any other block. To do so, DPS adds use and
dej i n formation to the nodes to represent a pseudore
source, B lockBoundary. Because each BlockStart
node defines B lockBoundary and each B lockEnd
node uses BlockBoundary, no BlockEnd node can be
scheduled ahead of its associated BlockStart node
( because of flow dependence . ) Also, a BlockStart node
cannot be scheduled before i ts dominator block's
BlockEnd node ( because of antidependence). By
establishing these imaginary dependencies, DPS
ensures that the DDD coupler adds arcs between all
BlockS tart a nd B lockEnd nodes .

Algorithm Domi nator- Path S c heduling
I np u t :
Function Control Flow Graph
Domi nator Tree
Post- Domi nator Tree
Outp u t :
Sched u led i nstructions for the fun ction
Algori th m :
Whi le a t least one Basic B lock i s unsched u l ed
Heuristically choose a path B , , B 1 , . . . , B, in the Dominator Tree that inc ludes
only u nschedu led Basic B locks .
Pe rform dominator analysis to co m p u te l De f a nd I U s e sets
/ * B uild one D D D tor the entire domi nator path *I
Combined D D D = B ,
For i =

2 to
T

n

=

I ni tiali zeTransitionDDD ( B , ., , B , )

Com bined D D D = Cou p l e ( Combined D D D ,T)
Combined D D D = Cou ple ( C ombi ned D D D , B , )
Perform list sched uling on Combined D O D
Mark each block o f DP sch eduled
Copy schedu led in structions to the B l ocks of the path ( i nstructions between the
BlockStart and B l ockEnd nodes for a B lock are "written " to that B lock)
End vVhi le

Figure 2
Dominaror-pJth Scheduling Algorithm

Looking back to domi nator analysis, we see that

operations tl1at i nstruction sched u l i n g allows. In dom

i n terblock motion i s prohibited if the operation bei ng

i n ator motion , i ntermediate stJtements move in only

moved

one d i rection , i . e . , toward the top of the ti.mction's

•

D e fi nes someth i n g that is i n c l uded in e i ther the

ide/or iusc set

•

Uses some thing included i n the

idef set

for the

bl ock in which the operation cu rrently resides

control How graph , not from a domi nator block to a
domi nJted one. This one-directional motion is rea
sonable when attempting to move i n termediate stJte
ments because one state ment's movement wil l l i kely
open possibil i ties tor more motion i n the same d i rec

To obtain the same p rohibitions i n the combined

tion by other state ments. When statements move i n

D O D , we add the ide("set tor a basic block, B, to the

d i fferent directions, o n e stJte ment's motion m ight

defset B 's BlockStart node. S i m i l a rl y, we add the iuse
set tor B to the use set of B's B lockStart n ode. Thus we

i n h i bi t another's movement in tl1e opposite d i re c tion .
The goal of dominator motion is to move statements as

cntorcc the same restriction on movement that domi

t�u· as possible i n tl1e control flow graph. In contrast, tl1e

nator analysis i m posed upon i n termediate statements

goal of DPS is not to maxi.rn.ize code motion, but rather

0,

to fi n d , for each operation,

will yield me shortest sched u l e . Thus our goal has

restrictions on movement of operations that define

changed fi: om that of dominator motio n . To gain the

3

fu ll benefit from DPS, we wish to allow operJtions to

gives an a lgori thmic descrip tion of the process of

move past block boundaries in either direction . To per

either global vJriJbles or i n d uction variables. Figure

that location for

0

and ensure that any i n tcrb.lock motion preserves pro
gram scmJntics. In J similar manner, DPS i ncludes the

that

"doping" the B lockS ta r t and BlockEnd nodes to pre

mit bidirectional motion, we use the post-dominJtor

ven t d isal l owed code motion.

rel a tion , which says that

D PS i s complicated by factors not relevant tor dom
inator motion of i n termediate statements. Foremost is
the complexity im posed by the bidirectional motion of

a

basic block, P D , is a post

domi nator of a basic block B if al l paths from B to the
function's exit must pass ilirough P D . Using thi s strat

egy, we s i m i l arly define post-idefand

Digital TcdHlic.ll Journal

post-i use s ets.

Vol . 1 0 No. l

In

1 99 8

63

Algorithm I n i ti al i zeTransition D D D ( B , , B 1 )
Input:
A Transition D D D templates, with a D u mmy DD DNode
for B , 's block end and one for B, 's block start
Two basic blocks, B, and B , that we wish to couple
Domi nator Tree
Post- Domi nator Tree
The fo l l owing dataflow information
Def, Use, I Def, and I Use sets for B , and B,
Used - B e fore-Defined set for B,
Post- I Def, a n d Post-I Use sets for B , and B,
B,'s "sibling" set, defined to i n clude any variable
live-in to a dominator-tree si bling ofB,, but not
live - i n to B,
A basic bloc k D D D for each of B, and B,
Output:

An i n i tialized Transition D D D , T
Algorith m :
T

=

Tra nsiti o n D D D

/ * "Fix" s e t for global and induction variables. * /
Add set of global variables to B/s ! U se
Add B/s Use d - Before - Defined to B/s IUse
Add B/s si b l i n g set to B/s I Use
If B, does not post-dominate B ,
Add B, 's Use set to Ts Block End Def set
Add B , 's Defset to T's BlockEnd Use set
Else
Add B, 's Post- I Def set to T's BlockEnd Def set
Add B , 's Post-lUse set to T's B l ockEnd Use set
Add B/s I Def set to T's B lockS tart Def set
Add B� 's I Use set to T's BlockS tart Use set
Return T

Figure 3
I nitial ize Transition O D D Algorithm

fact, it is not d i ffi c u l t to comp u te Jll these q u a n ti ties

sor, S, in the forward domi nator p:tth does not post

for a fu n c tion . The s i mp l est w:�y is to l ogica l l y reverse

d o m i nate B , DPS adds B 's de(set to the

the direction of all the control flow gr:�ph arcs and per

B l ockEnd node associated with

use

set of the

B . In similar t-:1sh i o n ,

for m domi nator an alysis on the resu l t i n g gra p h .

w e a d d B' s

Hav i n g co m p u ted t h e post-domi n ator tree, DPS

This technique prevents any D D D node origi n a l l y in

ch ooses dominator paths such that the domina ted

B from moving downward i n the domi nator path .

use

s e t t o B ' s B lockEnd node's de( set.

node is a post-domi nator of its i m m ediate predecessor
in a d o m i n ator p a t h . This c h oice a l l ows operations to

Choosing Dom inator Paths

move " free ly" in both d i rections. Of course, this may
be too l imiting on the choice of domi n a tor paths. To

DPS allows code movement a l o n g any domin ator

allow for the possibility that nodes i n a domi nator path

path, b u t there are many ways to

wi l l not form a post-domin ator relati on, D PS needs a

investigation of the effects of domi nator-path choice

mec hanism

when

on the efficiency of generated schedu les te lls us that

needed . Again, we rely o n the tech nique of a d d i ng

the choice of p a th is too i mporta n t to be left to arbi 

to

limit

bidirection a l

motion

dependencies to the combi ned D D D . In this case
( assu m i n g that DPS is sched u l i ng paths in

the forward

domi nator tree), for any basic block, B, whose succes-

64

Digiral T�c hnical JournJI

Vol . 10

No.

I

1 9 98

select these paths. An

trary selectio n ; twice the average percent speed u p * for
several functions can often be ach ieved with

a

simple ,

*( unopti m i zed_speed - oprirnized_spccd )/u noptirnizcd_spccd

well-chosen heuristic. Some functions have a potential

tion o f D PS and the n u mber of cli stinct dominator tree

percent speed up almost fou r times the average. Thus,

partitionings. The original i m plementation of DPS

it is important to find a good, generally app l icable

incl uded a single, simple heuristic to choose domina

heuristic to select tl1e domi nator paths.

tor patl1s. More specifically, to choose dominator pams

Unfortunately, it is not practical to schedule all of

witl1 in a group, G, of contiguous blocks at me same

the possible partitionings for large functions. If we

nesting level, me compiler continues to choose a

allow a basic block to be included in only one domina

block, B, to "expand . " Expansion ofB initializes a new

tor path, the formula for the numbe r of distinct parti 

dominator path to include B and adds B's dominators

tionings of the dominator tree is

until no more can be added. The algorimm then starts
anomer domi nator path by expanding another ( as yet

IT [ outdeg( n) + 1 ]

u nexpanded) block of G. The first block of G chosen

II € .\'

to expand is me tail block, T, in an atte mpt to obtain as

where N is the set of nodes of the dominator tree . "

long a dominator pam as possible .

Although the n u m ber of possible paths i s not prohi bi

Unformnately, not all functions are small enough to

tive for small dominator trees, larger trees have a pro

be tested by performing DPS for each possible parti

hibitively large n u m ber. For example, whetstone's

tioning of the dominator tree. Therefore, we defined

main( ), with 49 basic blocks, has a lmost two tri ll ion

37 different heuristic memods of choosing dominator
trees, based upon groupings of SL"X key heuristic factors.

distinct partitionings.
To evaluate differences i n dominator-path choices,

The maxim u m patl1 lengms of tl1e basic guidelines

we scheduled a group of small fu nctions with DPS

were adjusted to produce actual heuristics. We used

using every possible choice of dominator p at h . The

the heuristic factors from which the individual heuris

target architecture for this study was a hypotheticaJ

tics were constr ucted ; each seemed likely e i ther to

6-wide long-instruction-word

( LIW)

machine, which

m i m i c the observed characteristics of the best path

was simu lated and i n which it was assumed that all

selection or to allow more freedom of code motion

cache accesses were hits.

and, therefore, more fl exibility i n filling "gaps. "

The results of exhaustive dominator-path testing
show, as expected , that varying the choice of domina
tor paths significantly affects the performance of
scheduling. For all functions of at least two basic
blocks, DPS showed i mprovement over local schedul 
ing for at least one of tl1e possible choices of domina
tor paths. Table

1 shows the best, average, and worst

percent speedup over local scheduling found for a l l
fu nctions that h a d a "best" speedup of over 2 percent;
it also shows the speed u p of tl1e origi nal implementa-

•

One nesting level-Group blocks from the same
nesti ng level of a loop. Each block is in the same
strongly connected component, so the blocks tend
to have similar restrictions to code motion . For

a

group of blocks to be a strongly connected compo
n ent, there must be some path in the control tlow
graph fro m each node in the component to all the
otl1er nodes in the component. Si nce the function
will probably repeat the loop, it seems l i kely that
the scheduler will be able to overlap blocks in it.

Ta ble 1

Percent of Function Speed u p I m p roveme nt Using D PS Path C h oices over Local Sche d u l i ng
Percent Speed up

Function Name

Best

Average

Worst

Original

No. Dominator
Tree Partitions

bu bble

39.2

1 0 .6

- 0. 1

1 1 .7

72

readm

32.5

9.3

- 0. 2

32.5

48

solve

27.8

9.9

- 0. 2

27.8

96

qu eens

25.4

8.3

- 0. 4

- 0.4

96

swa prow

23.1

5 .8

- 3 .7

1 9 .5

24

print ( g)

22.0

9.1

- 0. 2

22.0

8

find max

2 1 .3

6.2

- 0. 3

8. 7

18

copy col

1 8. 5

5.6

- 5.0

1 9 .9

8

elim

1 4.3

2.3

- 3.8

1 0. 2

576

mult

1 3 .7

2.1

- 3.8

1 0. 3

96

su bst

1 2 .9

2.4

- 4. 9

4.9

96

pri nt(8)

1 2.5

6.2

0.0

1 2.5

8

Digiral Technical Journal

Vo l . 10 No. l

1 99 8

65

•

•

Longest pa th-Sched u l e the longest ava i l a ble path .

Conse q u e n tly, path lengths c: 1n be l i m i ted without

This h e u ristic c l ass Jl lows the maxi m u m d istance

lowering the efficie ncy of generated cod e, and l o n ger

tor code motion .

paths, whi c h i ncrease sched u l i n g time,

c:
m

be avoided.

Since n o one heuristic performed we l l for all fu n c 

Postdomin ator-Follow the postdominator relation
in the dominator tree. When J dominator block, P, is

tions, w e advise u s i n g a com bi nation of heu ristics, i . e . ,

succeeded by a non-postd ominator block, S, our

schedule by using each of th ree heuristics Jnd taking

compiler adds P's del set to the use set of P's

the best sched u l e . The "com bi n ed " heuristic i n c l u des

B l oc kl-: nd node and the

the following:

use

set to the

def set

to

prevent any code motion from P to S. I f P is i n stead
succeeded by its postdomi mtor block, no such mod
i n cation is nece ss::try, and code would be allowed to
move in both directions. Intu i tively, the postd omi na

•

Instruc tion density, limit to five blocks

•

O n e nesting level on path, l i mit to fi ve bl ocks

•

Non -postdomi nator, u n l i m i te d l e ngth

tor relation is the cx::tct inverse of the dominator reb
ti on, so code can move down, i n to a postdomi nator,
as it moves up i nto a domi nator. Fur ther, the simple
act of adding n odes to the D D D will complicate list
sched u l ing, malvith
sparse blocks and putting sp arse blocks together.

blocks with h igher nesti n g levels are more costly than
those added to bl ocks with lower n esting levels. Even
within a loop, there exists the potenti�1 l tor consider
able variation i n the executi o n fTcq uencie s o f d i tkrent
b l ocks i n the meta- block due to control tlow.

o r·

course variable execution freq uency is not :111 issue i n
trad i ti on a l local sched u l i n g bec:1use, with i n the con 
text of a s i ngle basic block, each D D D nodl: is exe
c u ted the same n u m b er o f times, n:� m e l y, once each
time executi o n enters the block.
To address the issue of d i ffe ri ng execution frequen 
cies within meta- blocks schedu led as

:1

single block by

D PS, we i.nvestigated fl·equency- based l ist sc hedu l i ng
(FBLS ) , ' ;
an

an

extension of Jist sched u l i n g th::Jt provides

answer to this d i ffi c u l ty by considering that execu

The h e u ristic factors were used to make i nd i vidual

tion fi-equencies d i ffer with i n sections of the meta

h e u ristics by ch:1nging the limit on the poss i b l e nu m 

blocks. FBLS uses a greedy method to p l :1cc D D D nodes

ber of b locks i n a p::lth . I t was reasonable t o set l i m i ts

in the lowest-cost instruction possi b l e . f B LS ame nds

fo r fo ur tactors : postdominator, non- postd o m i n a tor,

tl1e basic list- sched u l i n g a lgorithm by revising only the

ide/ size, a n d density. We tried p:�th length l i m i ts in
3, 4, 5 , :1 n d u n l i mited , making a total o f

D D D node placement policy in an atte mpt to red u ce

blocks of 2 ,

the r u n-time cycles required to execute

;1

meta-bl ock.

U n fortunate ly, although F B LS makes intuitive sense,

five heu ristics fi·om each h e uristic factor.

66

t(x the d i fferi n g

path beh avior.
idef size, the more interference t h ere is to code

•

In theory, to best s c h e d u l e any meta- block, an

class . T h e previous cbss was s u ggested b y intuition

Ru n n i n g DPS using cJch of the he uristic me thods

we fou nd that D PS pro d u c ed worse schedu les with

a nd comparing the efti ci e n cy of the res u l ti n g code

F B LS than i t produ ced with a na ive local sc h e d u l i n g

l eads to several con cl usions about effective heu ristics

algorithm t h a t ignored frequency d i ffe rences with i n

fo r choosi ng DPS's d o m i n ator paths. for some he u ris

D PS's meta- blocks. Therefore, t h e c u r rent imple

tics, we can achieve the best schedules for DPS by

mentation of D PS ignores the execution tt·cq uency

using paths that r:1 rely exceed th ree blocks. For :1 ny

d i ffe rences be t\.veen basic blocks, both i n ch oosing

particular class of heuristics, we can Jchievc the best

dominator paths to sche d u l e and in sched u l i n g those

schedule with paths l i m i ted to rive b locks or fe wer.

d o m i n ator-path m e ta - blocks.

Digital Te chn ical journal

Vol . 10 No. 1

1 998

Evaluation of Dominator-path Scheduling

measurements were made on an Alpha 2 1 1 64 server
r u n n i n g at 2 5 0 megahertz with data cache sizes of 8

To measure the potential of DPS to generate more

kilobytes, 96 kilobytes, and

4 megabytes.

efficient sched u l es than local schedu l i n g for commer

Looking at Table 2, we see that, in genera l , DPS

cial superscalar architectures, we ran a small test suite

i m proved the i n teger programs less than it i m proved

of C programs on an Alpha 2 1 1 64 server. The Al pha

the floati ng-poi nt programs. The range of improve

server is a superscalar architecture capable of issuing

ments for i nteger programs was from 0.7 percent for

two i n teger and tvm floati ng-point i nstr u ctions each

Dhrystone to 7 . 3 percen t each for 8- Queens and for

cyc l e . Our compiler esti mates the effectiveness of a

Sym bo!Table. S u m m i ng a l l the improve ments and

sched u le by modeling the 2 1 1 64 as an LIW architec

d ividing by eight (t he n u mber of integer programs)

ture with all operation latencies known at comp i l e

gives an "average" of 4.7 percent i m provement for the

time. Of course th is mode l was used o n l y w i t h i n the

i n teger programs. DPS improved some of the floating

O u r resu l ts measured changes i n

point programs even more significantly than the i n te 

compiler itself.

2 1 1 64 execution ti me ( m e asured w i t h the U N I X

ger programs. The range of i mprovements for the six

"time" command) req u i red for each progra m .

floating-poi nt programs was from 3 . 7 percent for Dice

Our test suite o f 1 4 C programs i ncludes 8 programs

(a simu lation of rolli ng a pair of dice 10,000,000 times

that use i nteger computation only and 6 programs that

using a u n i form random n u m ber generator) to 1 7 .6

i nclude tloati ng-poi nt computation. We separated

percent i mprovement fo r the finite difference pro

those groups because we see dramati c differences i n

gram. The average for the six floating-point programs

DPS's pertormance when viewing i nteger and floating

was 10.8 percent. This suggests, not surprisingly, that

point programs. To choose dominator paths, we used

the Alpha 2 1 1 64 provides more opportu n i ties for

the combined he uristic recommended by Huber. ''
Table 2 sum marizes the res u l ts of tests we con

global schedu l i n g i mprovement when floati ng-point
programs are being compiled.

ducted to compare the execu tion times of programs

Even with i n the six floati ng-point programs, how

using DPS scheduling with those using local sched u l 

ever, we see a distinct bi- modal behavior in terms of

ing only. T h e table l ists the programs used i n t h e test

execution-ti me improvement. Three of the programs

suite and the percent im provement in execution times

range from 1 2 . 3 percent to 1 7 . 6 percent improve

for DPS-sched u led p rograms. The executi o n time

ment, whereas three are below lO percent ( and two of
those sign ificantly below lO percen t ) . A reason for this
wide range is the use of global variables. Remember

Table 2

Percent D PS Sched ul ing I m p rovements over Local
Sched u l i n g of Programs
Program

Percent Execution
Time I m p rovement

tl1at DPS forbids the motion of global variable defi n i 
tions across block bo undaries. This is necessary to
ensure correct program semantics. I t is hardly a coinci
dence that both Dice and Whetstone i ncl ude on ly
global floati ng-point variables, whereas Livermore's

8- Queens

7.3

floating-point variables are mixed about hal f local

SymboiTa b l e

7.3

a nd h a l f global, and the three better performers use

Bubb leSort

5.0

a lmost no global variables. Thus we conclude that, for

Nsieve

6. 1

floating-point programs with few global variables, we

Hea psort

6.0

K i l lcache

2.6

TSP

2.4

D h rystone

0.7

C integer average

4.7

D ice

3.7

Whetstone

5.4

can expect i mprovements of roughly 1 2 to 1 5 percent
i n execution time. Inclusion of global variables and
exclusi o n

of fl oati ng-point values

wi l l ,

however,

decrease DPS's abi lity to improve execu tion time tor
the Alpha 2 1 1 64.

Related Work
As we have discussed , local instruction sched u l i n g can

Matrix M u ltiply

1 6. 2

find paral lelism wi th i n a basic block but cannot exploit

Gauss

1 2. 3

parallelism between basic blocks. Several global sched

F i n ite Difference

1 7. 6

u l i n g techniques are availabl e , however, that extract

Livermore
C floati ng-point average
Overall average

9.3
1 0.8
7.3

paral lelism from a program by moving operations
across block bou ndaries and subsequently inserting
compensation copies to maintai n program semantics.
Trace schedu l ing1 was the first of these techniq ues to
be defined. As previously mentioned, trace sched u l i n g

Digital Tech n i cal Journal

Vol. 10 No. I

1998

67

req u i res compensation copies. Other "early" global

Conclusions

sched u l ing algorithms that req u i re compenstation
copies include Nicolau's percolation scheduling 1 "· 1 7

It is commonly accepted t h at to exploit the perfor

and Gupta's region scheduling 1 8 A recent and qu ite

mance benefits of iLP, global i nstruction schedul i n g is

popular extension

of trace scheduling is

Hwu's

SuperBlock scheduling. 19 2 0 In add ition to these more

requi re d . Several varieties of global instruction sched 
u l i n g exist , most req uiring compensation copies to

general, global schedu l i n g methods, signi ficant resu lts

ensure proper program semantics when operations

have been obtained by software pipel i n i ng, which is a

cross block boundaries during i nstruction scheduling.

tech n i q u e that overlaps i terations of loops to exploit

Although such global scheduling with compensation

avai lable ILP. Al lan ct a l .2 1 provide a good s u mmary,

copies may be an effective strategy for archi tectures

and Rau22 provides an excellent tutorial on how modulo

with large degrees of ILP, another approach seems

scheduling, a pop u l ar software pipeli n i n g tec h n i q u e ,

reasonable for more limited architectures, such as c ur

should b e i mplemented. Promising recent tech niques

rently available su perscalar computers.

have focused on defining a meta-environment, which

This paper outli nes DPS, a global instruction sched

i ncludes both global scheduling and software pipelin

uling tec h n i q ue that docs not req uire compensation

i n g . Moon and Ebcioglu23 present an aggressive tec h

copies. Based on the fact that more than

n i q u e that combines software pipdining and global

i ntermediate statements can be moved upward at l east

25 percent of

code motion (with copies) i nto a si ngle fra mework.

one domi nator block in the control flow graph with 

Novak and Nicolau2' describe a sophisticated schedul

out changing program semantics, DPS schedules paths

i ng framework in which to place software pipe l i n i ng,

in a function's domi nator tree as meta- blocks, making

including alternatives to modulo scheduling. While

use of an extended local instruction scheduler to

provi d i ng a significant n u m ber of excel l e n t global

schedu le dominator paths.

scheduling altern atives, none of these tec h n i q ues pro

Experimental evidence shows that D PS does i ndeed

vides global sc heduling without the possibi l ity of code

produce more efficient schedules than local schedu l 

expansion ( copy code ) as D PS does.

ing for Com paq's Alpha

To address the issue of producing schedules without
operation copies, Bernstein2; -27 defined a techniqu e he

larly tor floati ng-point programs that avoid the use of

calls global instruction scheduling ( G PS) that aJ.lows

siderable fl exibility in p l acement of code is possible

movement of instructions beyond block bou ndaries

even when com pensation copies are not a l l owed .

based upon the program dependence graph ( PDG) .28 In

Al though more research i s req u i red t o look i n to

a test suite of four programs run on I BM's

RS/6000,

2 1 1 64 server syste m , particu

global variables. This work has demonstrated that con

possible uses for this flexibility, the global i nstruction

Bernstein's method showed improvement of rough ly

schedu l i ng method

7 percent over local scheduling for two ofthe programs,

promise for lLP architectures.

described

here

( D PS )

shows

with no significant clifference for the others.
Comparing DPS to Bernste i n ' s method, we see that

Acknowledgments

both allow for i n terb lock motion without copies.
Bernstein also al lows for interblock movement req uir
ing dupl icates

that

D PS

does

not.

Interestingly,

This research was supported i n part by an Exte rn a l
Research Program grant from Digi ta l Equ ipment

Bernstein's later work27 does not make use of th is abi l

Corporation and by the National Scie nce Fou ndation

i ty to al low motion that req u i res duplication of opera

under grant CCR-9308348.

tions, suggesting that, to date , he has not found such
motion advisable for the

RS/6000 architecture to

References

which his techniq ues have been applied . Bernstei n
a l l ows operation movement in only one clirection,

l.

tions

nator block to a postdominator. This added flexibility is

Computers,

C- 1 9 ( 1 0 )

( O ctober

1 97 0 ) :

2.

A . Nicolau a n d J . Fisher, "Measu ring t h e Parallelism

tions. Bernstein uses a separate set of heuristics to move

Available tor Very Long I nstruction Word Architec

operations i n the PDG and then uses a subsequent local

t u res,"

scheduling pass to order operations v.rithin each block.

(November 1 9 8 4 ) : 968-976.

Fisheil argues that incorporati ng movement of opera
tions with the scheduling p h ase itself provides better
schedu l i n g than divicling the i nterblock motion and
schedul i n g phases. Based on that criterion alone, DPS
has some advantages over Bernestein's method.

68

on

8 8 9-8 9 5 .

an advantage to DPS. Of possibly greater significance,
DPS uses the local i nstruction scheduler to place opera

G. Tjaden and M . Flynn , " D etection of Parallel Exe
cut ion of I ndependent I nstructions," IEEE Tra nsac

whereas DPS a llows operations to move from a domi 

Digital Technical Journal

Vol . 1 0 No. 1

1 998

IEEE Transactions on Co mputers,

33( l l )

3 . J . Fisher, "Trace Sche d u l i n g : A Tec hn i q u e tor Global
Microcode Compaction," IEEE Transactions on Com
puters, C-30( 7 ) ( J u l y 1 9 8 1 ) 478-490

4.

5.

6.

J.

El lis, Bulldog A Comp iler for VJJW A rchitectures

1 8 . R. Gupta and

Approach fo r Detecting and

Yale U niversity ( 1 9 84 ).

lelism," IEEE Transactions on Software Eng ineering,

Production of Optimal Horizontal Microcode , " P h . D .

1 9 . S . Mahlke, W. Chen, W. - M . H wu , B . Rao, a n d M .

thesis, U n i versity of M i c higan, A n n Arbor, M i c h .

Schlansker, "Sentinel Scheduling for VL IW and Super

( 1 97 6 ) .

scalar Processors," Proceedings of the 5th Interna
tional

D . Lands kov, S . Davidson, B . S h river, a n d P. Mallett,
Computing

Surveys,

1 2( 3)

(September

20.

9.

B locks," Proceedings of the 29th International Sym
France ( December 1 996 ) : 58-67 .
2 1 . V. Al lan, R. Jones, R. Lee , and S . Al lan, "Software

D . Padua, D . Kuck, and D. Lawri e, " H igh- Speed Mul

Pipel i n ing," A CJ\11 Computing Su rveys, 2 7 ( 3 ) (Septem

tiprocessors and Compilation Techniques," IEEE Trans

ber 1 995 ).

763-776.

22.

and

Tools

( Reading,

MA:

( M I CR0·27), San Jose, Calif ( December 1994 ) : 63-74.

Addison

2 3 . S . - M . Moon and K. Ebcioglu , " Parallelizi n g N o n n u 

H . Rei f and R . Tarjan, "Symbolic Program Analysis i n

merical Code with Selective Sched u l i n g and Software

Almost- Linear Time," Journal of Compuling, 1 1 ( 1 )

Pipel i n i ng,"

( February 1 9 8 1 ) : 8 1-9 3 .

Languages and s:ystems, 1 8 ( 6 ) ( N ovember 1 99 7 ) :

P h . D . thesis, Computer Scie n c e Department, Col 

on

Eng ineering· Special Issue on Microprogram m ing,

1 4( 5 ) ( May 1 998 ) : 5 7 5-5 8 3 .
B . Huber, "Path-Selection Heu tistics tor Dominator
Technological

University

( 1 995 ) .

Sched uling to Consider Execution F requency," Pro
ceedings of the 28th Ha waii International Conference
on System Sciences (J anuary 1 996 ) .

1 6. A . Nicol a u , " Percolation Sche du ling: A Parallel Com
pilation Tech n i q u e , " Te ch nical

Report TR8 5 - 678 ,

Department of Computer Science, C01·neU U niversity
( May 1 9 8 5 ) .
icol a u , " A Deve lopment Envi ron·

ment for Horizonta l M icrocode," !Etc Transactions
5 84-594 .

25.

D.

Bernstein and

M.

Rode h , "Global

Instruction

the ACM 51GPLAN 1 991 Conference on Programming
Language Desig n

and Implementat ion,

Toronto,

Canada ( J u ne 1 99 1 ) : 2 4 1-2 5 5 .

Dupl ication: An Assist lor Global Instruction Sched ul
ing," Proceedings of the 24th International Symposium
on Microarchitecture ( M I C R0 - 2 4 ) , Albuquerq u e ,

N . Mex. ( Nove mber 1 99 1 ) : 1 03-1 1 3 .
2 7 . D . Bernste i n , D . Cohen, Y. Lavon, and V. Rai nish,
" Performance Evaluation of In struction Sched u l i n g
o n the I B M RS/6 000," Proceedings of the 25tb Inter
national Symposium on Microarchitecture ( M I CR0-

25 ), Portland, Oreg. ( December 1 992 ) : 226-2 3 5 .

1 7 . A . Ai ken a n d A .
Software

Techniques

2 6 . D . Bernstein , D . Cohen, and H . Krawczyk, "Code

M . Bourke, P . Sweany, and S . Beaty, " E xtending List

on

Compiler

Sched u l i n g tor Su perscalar Machi nes," Proceedings of

Path Sched u l ing," Master's thesis, Department of Com

15.

Parallel Architectures and

( PACT 96), Boston, Mass. (October 1996) 87-96.

Microarchitectures," IEEE Tra nsaclions on Software

Michigan

Progra mm ing

Directed Approach to Exploiting Insu·uction-Level Paral

R . M uel ler, M . D u d a , P. Sweany, and J . Walicki,

Science,

on

lelism," Proceedings qftbe 1996 International Conference

" Horizon: A Retargetable Compiler lor Horizontal

puter

Transactions

2 4 . S . Novak and A . Nicolau, " A n Efficient Global Resource

orado State U niversity ( 1 992 ) .

1 4.

A CM

8 5 3-89 8 .

1 2 . P. Sweany, " lnterblock Code Motion without Copies,"

13.

Proceedings of tbe

2 7tb International Symposium on Microarchitecture

A. Aho, R. Sethi, and } . Ullman, Compilers. Principles,
Techniques,

B. Rau, "I terative Modulo Scheduling: An Algorithm
for Software Pipelining Loops,"

Wesley, 1 9 8 6 ) .
11.

C . Chekuri, R. Johnson, R. Motwani, B. N a tarajan, B.

posium on Microarchitect ure ( M ICR0 - 2 9 ) , Paris,

Compuler and job-Shop Scheduling

actions on Computers, C-29( 9 ) (September 1 98 0 ) :

10.

Support .for

Level - Pa ra l l e l Sched u ling with Application to Super

Practice & Experience, 2 8 ( 3 ) ( March 1 99 8 ) : 249-2 84.

77JeOiy ( New York : John Wiley & Sons, 1 9 76).

Arcb itectu.rat

Rau, and M. Schlansker, " Profile- Driven I nstruction

Retargetable Local Instruction Schedu ler," So.ftware

Colfmao,

on

Boston, Mass. ( October 1 9 9 2 ) : 2 3 8-247.

V. Al l a n , S . Beaty, B. Su, and P. Sweany, " B u i l d i n g a

E.

Conference

Programm ing Languages and Opera ting Systems,

1 9 80):

26 1-294.

8.

Redistributing Paral

1 6 ( 4 ) (April l990 ) : 42 1-43 1 .

D. DeWitt, "A Machine- Independent Approach to the

" Local Microcode Compaction Tec hniques," A CM

7.

M. Solh, " Region Scheduling: An

( Cambridge, MA: M I T Press, 1 9 8 5 ) , Ph D. thesis,

Engineering,

1 4( 5 )

( May

1 988):

2 8 . J . Ferrante, K . Ottenste i n , and J . Warren, "The Pro
gram Dependence Graph and Irs Use in Optimiza
tion," A CM Transactions on Programming Languages
and Systems, 9 ( 3 ) ( J u ly 1 98 7 ) : 3 1 9-349.

D i g i tal Technical Journal

Vol . 10 No. l

1 998

69

Biographies

Brett L. Huber

Philip H. Sweany

Associate Professor Phil Sweanv has been a member of
Michigan Technological Unive'rsity's Computer Science
faculty since 1 99 1 . He has been investigati ng compiler
techniques for instruction-level parallel ( I LP) architectures,
co-authoring several papers on instruction schedul i ng, reg
ister assignment, and the i nteraction between these two
optimizations. Phil has been the primary designer and
implementer of Rocket, a highly optimizing compiler that
is easily retargeta ble for a wide range ofiLP architectures.
His research has been significantly assisted by grants from
Digital Equi pment Corporation and the National Science
Foundation. Phil received a B .S . in computer science in
1 9 83 from Washington State University, and M . S . and
Ph . D . degrees i n computer science from Colorado State
University in 1 986 and 1 99 2 , respectively.

Steven M. Carr

Steve Carr is an assistant professor in the Department of
Com puter Science at J\tli chigan Technological University.
The focus of his research at the un iversity is memory
hierarchy management and optimization of instruction
level parallel archi tectures. Steve's research has been sup
ported by both the National Science Foundation and
DigitaJ Equipment Corporation. He received a B . S . i n
computer science trom Nliduga.n Technological Uruversity
in 1 9 8 7 and M.S. and P h . D . degrees fi·om Rice University
in 1 9 90 and 1 993, respectively. Steve is a member o.fACM
and an I EEE Computer Society Affi liate.

70

Digiral Technical Journal

Vo L 10 No. I

1 998

Raised in Hope, lv1ichigan, Brett earned B . S . and M.S.
degrees in computer science at M ichigan Technological
University i n Mich igan's h istoric Keweenaw Peninsu la. He
is an engineer in the Software Development Systems group
at Texas I nsrruments, I n c . , and is currently developing an
optimizing com piler for the TMS320C6x fa milv ofVLIVV

digitaJ signal processors. Brett is a member oftl� e ACM
and an IEEE Computer Society Affiliate .

I
Mary W. Hall
Jetmifer M. Anderson

Maximizing
Multiprocessor
Performance with

Sarnart P. Amarasinghe
Briart R. Murphy
Shih-Wei Liao
Edouard Bugnion
Monica S . Lam

the S U IF Compiler

Parallel izing compi lers for m u ltiprocessors face

The affordability of shared memory mu lti processors

many h u rdles. However, S U I F's robust ana lysis

offers the potential of supercomputer-class performance

and memory optimization tech n i q ues enabled
speed u ps on three fourths of the NAS and
SPECfp95 benchmark programs.

to the general public. Typical ly used in a m u l tiprogram
ming mode, these machi n es increase throu ghput by
r u n n i n g several independent applications in paral l e l .
B u t m u l tiple processors can also work together to
speed up single applications. This req uires that ordinary
sequential programs be rewritten to take advantage of
the extra processors. '

4

Automatic paral l e lization with a

comp i l er otfers a way to do this.
Parall e lizing com pilers face more difficult challenges
from m u l tiprocessors than from vector machines, which
were their initial target. Using a vector architecwre eftec·
tively i nvolves paral le li zi ng repeated a.tithmetic opera
tions on large data su-eams-for exam p l e the i nnermost
,

loops in array-oriented programs. On a mul tiprocessor,
however, this approach typ i cally does not provide suffi 

cient granu l arity of paral lelism: Not enough work i s
performed i n parallel t o overcome processor synch

ronization and communication overhead . To use a
multiprocessor effectively, the compiler must exploit
coarse-gra i n paral lelism, locating large computations
that can execute independently in parallel .
Locating para l l e l ism i s j ust the fi rst step i n prod uc·
i n g efficient m u l ti processor cod e . Achievi ng h igh per
formance also req u i res e ffective use of the memory
hierarchy, and multjprocessor systems have more com
plex memory hierarch ies than typical vector mac h i nes:
They conta i n not only shared memory

but also multi

ple levels of cache memory.
These added challenges often limited tl1e effectiveness
of early paralJel izing compilers for m u l tiprocessors, so
programmers developed their appl i cations fi·om scratch,
without assistance from tools. But explicitly managing an
application's paral lelism and memory use requires a great
deal of programming knowledge, and tl1e work is tedious
and error-prone. Moreover, the resulting programs are
© 1 996 IEEE. Re p r i nt ed , with permission, ti·o m CiJJIIjm/eJ;
December 1 996, pages 8 4 - 8 9 . This p3pa has been m od i tied for
publication h e re with the addition of the section The Status :md
Fu tu re of S l " l F

optimized for only a specific machine. Thus, the effort
required to develop efficient parallel programs restricts
the user base for m u l tiprocessors.
This article describes automatic parall e l i zation tech 
n iques in the

SU I F (Stanford U niversity I n termed iate

Digital Tc·chnical Journ;ll

Vol. 10 No. I

1 99 8

71

Form a t ) compiler that res u l t i n good m u l tiprocessor

Moreover, it recognizes c o m m u tative operations on

pertormance fo r array - based n u m erical progra m s . vVe

sections o f a n array and tra ns forms th em i n to parallel

provide SUIF performance measurements for the com

red u ctions. The red u c tion a n a l ysis is powe r fu l enough

pl ete NAS and SPECfP95 benchmark suites. Overall , the

to recogn i ze co m m u tative u p d ates of even i n d i rectly

results tor these scientific programs are promising. The

accessed array l ocations, a l lowing para l le lization of

compiler yields speedups on three fo u rths of the pro

sparse computations.

grams and has obtained the highest ever pcrronnancc on

All these analyses are for m u lated i n terms of i nteger

the S PECfP95 bench m ark, indicating that the com piler

progra m m i n g p ro b l e m s o n systems of l i near i n eq u a l i 

can also achieve e fficient abso l u te performance.

ties that represent t h e data accesse d . These i neq ualities
are derived from loop bounds and array access fu nc

Finding Coarse-grain Parallelism

tions. I m pl e m e n t i n g opti m i z ations to speed u p com 
mon cases reduc<::s the compilation ti me.

Mu ltiprocessors work best when the in dividu,l l proces
sors have large u n i ts of in dependent co m pu tation , b u t

lnterprocedural Analysis Framework

it is n o t easy t o find such coarse-grain para l lel ism . First

All the ana lyses arc i m p l emented using a u n i form

the compiler mu st find avai lable paralleli sm across pro

i n terprocedu ral a n a lysis fra mework, which helps ma n 

ced ure bou ndaries. F u rthermore, the original compu 

a g e the software engin eering complexity. T h e fra m e 

tations may not be paral l e l i zable as given and may first

work uses i n terprocedural d a ta fl ow a n alysis,• which i s

require some transtonn ations. For example, experience

m o r e efficient tlun the m o r e common tec h n i q ue o f

i n para l l e l i z i n g by h a n d su ggests that we must often

i n l i ne s u bstitutio n . ' I n l i n e substitu tion repl aces each

replace global arrays with private versions on d i ffe rent

proce d u re cal l with J copy o f the cal led proced ure,

processors. In other cases, the com p u tation may

the n a n alyzes the expanded code in t h e usual i ntrapro

need to be restructured-for e x a m p l e , we may have to

cedura l m a n ner. I n l i n e subs ti tu t ion is not practical for

re place a sequen tial accumu lation wi th J p:tral lel reduc

large progra ms, because it can m a ke the program too

tion operati o n .

large to ana lyze .

I t takes a l arge suite of robust a nalysis tec h n i q ues to

O u r tec h n i q u e 1
: 11alyzes only

a

s i n gle copy of each

successfu l l y locate coarse -gra i n p::tra l l e l i sm . Gen eral

procedure, captu ri ng i rs side efrects in

and u n i r(xm fra meworks he lped us ma nage the com 

fu nction i s then a p p l ied a t each cal l site to produce

plexity i nvolved i n b u i l d i n g such a system i n to S U I F .

precise results. When d i fferent cal l i n g contexts make it

a

fu n ction . This

We auto m ated t h e a n a l ysis to privatize arrays a n d to

necessary, the algorithm sel ective ly cl ones a procedure

recognize red u c tions to both sca lar and array vari a b l es .

so that code can be analy zed and poss i b l y paral le l i zed

O u r com pile r's analysis tec h n i q u es a l l operate se a m 

u nder d i ffe rent c a l l i n g contexts ( as when d i ffe re n t

less l y K
: ross procedure bound aries.

consta n t values J r c passed to the s a m e fo rmal para m e 
ter ) . I n this w a y the fu l l advantages of i n l i n i n g a r e
achieved without e x p a n d i n g the c o d e i n d isc ri mina te ly.

Scalar Analyses

An ini tial phase analyzes scalar variables i n the programs.
I t uses tec h n iq ues such as data dependence analysis,

In Fi g u re

1

the boxes represe n t procedure bodies,

and t h e l i nes connecting them represent proc e d u re

scalar privatization analysis, and reduction recognition

calls. The m::tin com putation is a series o f tour loops to

to detect paral lel ism among operations with scal ar· vari

com p u te three - d i mensional fast Fourier transr(mns.

ables. It also derives symbolic information on these scalar

Using i nterproced ural scalar and ar ray ana lyses, tile

variables that is useful in the array analysis phase. S u c h

S U [ f compiler d etermines that these l oops are para l 

i n formation i n cludes constant propagation, induc tion

lel i z a b l e . Each loop contai ns m o r e than 5 0 0 li nes of

vari a b l e recognition and elimi nation, recognition of

code sp a n n i n g up to n i ne proc edures with up to

loop-i nvariant computations,

procedure calls. If this program had been fu l l y i n l i ned ,

and sym bolic

relation

42

the loops pres<::n t ed to the compiler for a n a lysis would

propagation .'"'

h ave each conta i ned more than 86 ,000 l i nes of cod e .
Array Analyses

An :trray a n a l ysis p h ase uses a u n i fied mathe matical

Memory Optimization

tl-amework based on linear algebra a n d i nteger l i near

72

program m i n g . ' The a n a l ysis appl ies the basic data

Numerical appl ications on high-performance micro

dependence test to d etermine i f accesses to an array

p rocessors

can rerer to the same locati on. To supp ort array priva

more levels of cache to bridge the gap between proces

tization, it a l so finds array data � ow i n formation that

sor and memory speeds, a processor may still waste h a l f

are

often memory bou n d . Even with one or

determ i nes whether array elements used i n a n i teration

its t i m e stalled on memory accesses because it ITequently

rd cr to the val ues produced i n a p revious i tera tion .

references an item not i n the cache (a cache miss ) . This

Digira1 Technical Journal

Vol . 1 0 No. l

1 998

P1ifi

Figure 1

The compiler discovers parallelism through intcrprocedural array analysis. Each of the four parallelized loops at left consists of
more than 500 lines of code spanning up to nine procedures ( boxes) with up to 42 procedure calls ( l i nes ) .

memory bottleneck i s fi.1 rther exacerbated on multi
processors by tl1eir greater need for memory traffic,
resulting in more contention on tl1e memory bus.
An effective compiler m ust address four issues that
affect cache behavior:
•

•

Commu nication : Processors in a multiprocessor
system com mu nicate through accesses to the same
memory location . Coherent caches typically keep
tl1e data consistent by causing accesses to data writ
ten by another processor to miss in the cache. Such
misses are cal led true sharing misses.
Limited capacity: Nu meric applications tend to have
large working sets, which typically exceed cache
capacity. These applications often stream through
large amounts of data before reusing any of it,
resulting in poor temporal locality and numerous
capacity misses.

•

Limited associativity: Caches typically have a small
set associativity; that is, each memory location can
map to only one or just a few locations i n the cache.
Conflict misses-when an item is discarded and
later retrieved--can occur even when the applica
tion 's working set is smaller than the cache, i f the
data are mapped to the same cache locations.

•

Large line size : Data in a cache are transferred i n
fixed-size units called cache lines. Applications that
do not use all the data in a cache line i ncur more
misses and are said to have poor spatial locality. On
a m u ltiprocessor, large cache J ines can also lead to
cache misses when different processors use differ-

ent parts of the same cache line. Such m isses are
called false sharing misses.
The compiler tries to eliminate as many cache misses as
possible, ilien minimize tl1e i mpact of any iliat remain by
•

ensuring that processors reuse the same data as
many times as possi ble and

•

making the data accessed by each processor con
tiguous in tl1e shared address space .

Teclmiques for addressing each of t11ese subproblems
are discussed below. Final ly, to tolerate tl1e latency of
remaining cache misses, the compiler uses compiler
insettedprefetching to move data into the cache before
it is needed.
Improving Processor Data Reuse

The compiler reorgani zes tl1e computation so mat each
processor reuses data to the greatest possible extent -'-�
This reduces tl1e worki ng set o n each processor,
thereby minimizing capacity misses. It also reduces
i nterprocessor communication and thus minimizes
true sharing misses. To achieve optimal reuse, the com
piler uses affine pm1itioning. This technique analyzes
reference patterns in the program to derive an aftine
mapping (linear transformation plus an offset) of the
computation of the data to tl1e processors. The affi ne
mappings are chosen to maximize a processor's reuse
of data wh.ile maintaining sufficient parallelism to keep
all processors busy. The compi ler also uses loop block
i ng to reorder tl1e computation executed on a single
processor so that data is reused in the cache.

Digital Technical Journal

Vol . 10 No. 1

1 99 8

73

Making Processor Data Contiguous

The

its kn o wl e d ge

of the access patterns to d i rect the oper
a l l ocation policy to m a ke e a c h
processor's d a ta contiguous i n the physical add ress
space. The operating sys te m uses th ese h i n ts to deter
mine the virtua l - to-p hysi cal p:�ge m:�pping at p:�gc
a l location t i me

compiler tries to arrange the data to m a ke a

processor's

accesses

contiguous in the share d

space . This i m proves spatial loc a l i ty

ating system's page

address

red u c i n g

while

co n A ict m isses a n d fa lse shari n g . S U I F can ma nage
d ata p lacement within a singl e array and ac ross m u l ti 

p l e arrays.

.

The d a ta - to- processor mappi ngs compute d

b y the affi n e p a r titio ni ng ana lysis are used t o d e ter

Experi mental Results

mine the d a ta

bei ng accessed by each proce ssor.
Figu re 2 shows how the co m p i l e r s usc of data per

a series of performance ev al uations to
o f· S U I �'s ana lyses and opti
m izations. We obtained measu reme nts on a D ig i t a l
Alph a Serve r 8400 with eight 2 1 1 64 processors, each
w i th two levels of on-chip cache and a 4 - M byte exter

vVe

'

mutation and data stri p - m i n i ng'" can make contiguous

condu cted

demonstrate the i m pact

the data w i t h i n a single arra)' that is accessed by one
processor. Data permuta tion i n terchanges the d i m e n 
sions of the arra y -fix e x a m p l e , transposi n g a !'NO
d i m e nsional array. Data stri p - m i n i n g c h anges an
array's d i mension a l i ty so that all data accessed by the
same processor are in t he same plane of the array.
To m a ke data K
: ross m u l t i p le arrays accessed by the
same processor contiguous, we use a tec hnique c a l led
compiler-directed page colorinp,. ' ' T h e co mpiler uses

nal cach e . B ecause speed ups are harder to obtain on

machi nes

with fJst processors, our usc

of a state-of

the-:lrt m a c h i n e makes the re su lts more m e a n i ngfu l
and ap pl icable

to fi.1ture

systems.

\Ve used two comp l e te standard bench mark suites

to evaluate our compil er. W<:. present

resu l ts for the

y

y

X

X

y

y

y

X

X

X

P E R M UTATION

STR I P-MINING

Figure 2

Data transformations cJn make the dar,1 accessed by each processor contiguous i n the shared add ress space. I n the two
examples above, the origi nal arrays arc two-dimensional; the axes are identified to show that elements along the ti rst nis
arc contiguous. First the a Hine partitioning a nalysis d etermi n es which data elements arc accessed by the same processor
(the shaded ele ments are accessed by the first processor. ) Second, data strip-mining turns the 2 0 Jrray i nto a 3D array,
with the s haded eleme nts i n the same plane. Final ly, applying data permutation rotates the array, mJking data accessed
by each processor contiguous.

74

Digital Technical

journal

Vo l . 10 No. l

1 9 98

10

programs in the SPECtp95 benchmark suite, which is

techniq ues as well as techniq ues for locating coarse

commonly used for benchmarking u niprocessors. We

grain parallel loops-for example, array privatization

also used the eight official benchmark programs fro m

and reduction transformations, and ful l interproce

the NAS paral lel-system benchmark suite, except for

d u ral analysis of both scalar and array variables.

embar; here we used a slightly modified version from

Memory includes the coarse-grain techniq u es as wel l

Applied Para l l e l Research.

as the m u ltiprocessor memory optim izations we

Figure 3 shows the S PECtp95 and NAS speedups,
measured on up to eight processors on a 300-MHz
AJphaServer. We calculated the speedups over the best

described earlier.
Figure 3 shows tl1at of tl1e

1 8 programs,

1 3 show good

parallel speedup and can tlms take advantage of adclitionaJ

sequential execution time from either officially reported

processors. SUIF's coarse-grain techniques and memory

resul ts or our own measurements. Note that mgrid and

optimizations significantly affect tl1e performance of half

applu appear in both benchmark suites (the program

the programs. The swim and tomcat\' programs show

sou rce and data set sizes differ slightly).

superlinear speedups because the compiler eliminates

To measure the effects of the different compiler
techniq ues, we broke down the performance obtained

almost al l cache misses and their 1 4 Mbyte working sets

fit into the multiprocessor's aggregate cache.

on eight processors into three components. In Figure

For most of the programs that did not speed up, the

4 , baseline shows the speedu p obtained with paral

compiler fou n d much of their computation to be par

lelization using only intraprocedural data dependence

al lelizable, but tl1e granularity is too fi ne to yield good

analysis, scalar privatization, and scalar reduction

m u l tiprocessor performance on machines with fast

transtormations. Coarse grain includes the baseline

processors. Only two applications, tpppp and buk, have

16
swim

15
14
13
12
11

tomcatv

10
a..

::>
Cl
w
w
ll.
(f)

9

8
/

8
7

/

/

/

/

/

/

7
mgrid
applu
turb3d
hydro2d

6
5

su2cor

4

'

em bar

6
appbt

/'>...._____.

a.. 5
::>
Cl

w
w 4
a..
(f)

applu
cgm
appsp

3

3

2

2

/

0

,

,
,

/

/
/

2

3

4

5

6

7

8

/

/

�--�----�---�

0

2

PROCESSORS

( a ) SPECfp95

3

4

5

6

7

buk

fftpde

8

PROCESSORS

( b ) NAS Parallel B e nchmarks

Figure 3

S U I F compi ler speedups over the best sequential time achieved on the ( a ) SPECfp95 and ( b ) NAS parallel benchmarks.

Digital Tcdmical

Journal

Vol . 1 0 No. 1

1 998

75

14
12
::::J
Q._

10

8
w
Q._
[f) 6

[il

4

2

0

.?.
"'
u
E
2

E
-�
(f)

0
u
N

::0
(f)

N
e

"0
"0
>;;:

"0

E

·�

Ci.
:0

a.
"'

C')
-e
.=l

"0

a.
a.
.e-

a.

a.

·u;
"'

Q)

l[)
>
"'
i:

:0

a.
a.
"'

::0
Ci.
a.
"'

(f)
a.
a.

a.

"'

::0
.D
-"'

E
Ol
u

ro
.D
E
Q)

Q)
"0
a.

E

·�
E

"0

KEY:

D
D
•

MEMORY OPTIM IZATION
COARSE-GRAIN PARALLELISM
BASELINE

Figu re 4

The speedup achieved on eight processors is broken down into three components to show how S U IF's memory opt.i mization
and discovery of coarse-grain paralle l ism affected perform:mce .

require that the software be general ly available. The
ratios we obtained are nevertheless valid in assessing
our compiler's performance . ) The geometric mean of
the SPEC ratios improves over the u niprocessor execu
tion by a factor of 3 with four processors and by a fac
tor of 4 . 3 with eight processors. Our eight- processor
ratio of 6 3 .9 represents a 50 percent improvement
over the highest number reported to date . ' 2

no statically analyzable loop-level parallelism, so they
are not amena ble to our techniques.
Table 1 shows the times and SPEC ratios obtained
on an eight-processor, 440-MHz Digital AlphaServer
8400, testifYing to our compiler's high absolute per
formance. The SPEC ratios compare machine perfor
mance with that of a reference machin e . ( These are
not official SPEC ratings, which among other things

Ta b l e 1

Abso l ute Performa nce for t h e S P E Cfp95 Bench m a rks Measured o n a 440-M H z D i g ital AlphaServer U s i n g One
Processor, Four Processors, and Eight Processors
Execution Time
1P

4P

8P

1P

4P

8P

tomcatv

2 1 9. 1

30.3

1 8. 5

1 6. 9

1 22 . 1

200.0

swim

297.9

33.5

1 7.2

28.9

256.7

500.0

su2cor

1 55 . 0

44.9

3 1 .0

9.0

3 1 .2

45.2

hyd ro2d

249 .4

61.1

40.7

9.6

39.3

59.0

27.0

1 3. 5

59.5

92.6

Benchmark

mgrid

1 85 . 3

42 . 0

applu

296 . 1

85.5

39.5

7.4

25.7

55.7

t u rb3d

267.7

7 3 .6

43.5

1 5. 3

55.7

94.3

a psi

1 37 . 5

1 4 1 .2

1 43.2

1 5. 3

1 4. 9

1 4.7

29.0

29.0

29.0

1 9.8

21 .1

20.4

1 5 .0

44.4

63.9

fpppp

33 1 . 6

331 .6

331 .6

waveS

1 5 1 .8

1 4 1 .9

1 47 .4

Geometric Mean

76

SPEC Ratio

(sees)

Digital Technical J ournal

Vol. lO N o . 1

1 998

1 2 . K. Kennedy and U . Kremer, "Automatic Data Layout

Acknowledgments

This research was s upported in part by the Air Force
Materiel Command and ARPA contracts F 3 0602-9 5 C-0098, DABT63 -9 5 -C-O l l 8, a n d DABT6 3 -94-C0054; a D igital Equipment Corporation grant; an
NSF Young I nvestjgator Award ; an NSF CISE post
doctoral fel lowship; and fe llowships from AT&T Bell
Laboratories, DEC Western Research Laboratory,
I ntel Corp. , and the National Science Foundation.

Editors ' Note.· With the following section, the authors
provide an update on the status of the SU!F compiler
since the publication of their paper in Computer in
December 1996.
Addendum: The Status and Futu re of SUIF

References
l . J . M . Anderson, S . P. Amarasinghe, and M .S . Lam ,
" Data and Comp u tation Transformations for M u l ti 
processors," Proc. Fifth A CM S!GPlan Symp. Princi
ples and Practice of Parallel Programming, ACM
Press, New York, 1 9 9 5 , pp . 1 66-1 78.
2 . ]. M . Anderson and M .S . Lam , "Global Optimizations
for Para l l el ism and Localiry on Scalable Paralle l
Machines," Proc. SIGPia n '93 Conf Programming
La nguage Desig n and Implementation, ACM Press,
New York, 1 993, pp. 1 1 2- 1 2 5 .
3 . P. Banerjee e t a l . , "The Paradigm Compiler for
Distributed -Memory JVI.ul ticompnters," Computer,
Oct. 1 99 5 , pp. 37-47.
4 . W. B l u me e t a l . , "Effective Automatic Para l l e l ization
with Polaris," Int i I Parallel Progra mming, May
1 99 5 .
5 . E . B ugnion e t a l . , "Compiler-Directed Page Coloring
for Multiprocessors," Proc. Seventh In! ' I C011f A rchi
tectural Support for Program m ing Languages and
Operating Systems,

tor High Performance Fortran," Proc. Supercomput
ing '95. I E E E CS Press, Los Alamitos, Calif. , 1995
( CD - ROM onl y ) .

ACM Press, New York, 1 996, p p .

244-2 57.
6 . K. Cooper et a l . , "The ParaScope Parall e l Program
m ing Environ ment," Proc. IEEE, Feb. 1 99 3 , p p .
244-26 3 .
7 . Standard Performance Eval uation Corp . , " D i gital
Equipment Corporation AlphaServer 8400 5/440
SPEC CFP95 Results," SPEC Newsletter; Oct. 1 996.
8 . M . Haghighat and C . Polychronopolous, "Sym bolic
Analysis for Parallelizing Compilers," A Cl\1 Trans. Pro
gramming Languages and Systems, July 1 996, p p .
477-5 1 8 .
9 . .M.W. Hall et al . , " D etecting Coarse-Grain Parallelism
Using a n lnterproce d u ral Paral l e lizing Compi ler,"
Proc. Supercomputing '95, I E E E CS Press, Los Alam i 
tos, Calif. , 1 99 5 ( CD - RO M onl y ) .
10. P. Havlak, lnterprocedural !Symbolic A nalysis, P h D
thesis, Dept. of Computer Science, Rice U niv. , May
1 994.
1 1 . F. l rigoi n , .P. Jouvelot, and R . Triolet, "Semantical
Interprocedura l Parallelization: An Overview of the
P I PS Project," Proc. 1991 A C!J!! lnt'l Conf Supercom
puting, ACM Press, New York, 1 99 1 , pp. 244-2 5 1 .

Public Availability of SUIF-parallelized Benchmarks

The SUIF-parallelized versions of the SPECfp9 5
benchmarks used for the experiments described in this
paper have been released to the SPEC committee and
are avail able to any license holders of SPEC ( see
http:/jwww. specbench.org/osg/cpu95/par-research).
This benchmark distribution contains the SUIF out
put (C and FORTRAI'\1 code ) , along with the source
code for the accompanying run-time libraries. We expect
these benchmarks wil l be usefu l for two purposes:
( l ) for technology transfer, providjng insight i nto how
the compiler transforms the applications to yield the
reported results; and ( 2) for further experimentation ,
such as in architecture-simulation studies.
The SUIF compiler system i tself is available from the
SUIF web site at http ://www-su ifstanford .edu. This
system includes only the standard parallelization analy
ses that were used to obtain our basel in e results.
New Parallelization Analyses in SUIF

Overall, the results of automatic paraUelization reported
in this paper are impressive; however, a few applica
tions either do not speed up at all or achieve limited
speedup at best. The question arises as to whether
SUIF is exploiting al l the available parallel ism in these
applications. Recently, an experiment to answer this
question was performed i n which loops left unparal
lelized by SUIF were instru mented witl1 ru n-time tests
to determine whether opportunities for increasing the
effectiveness of automatic parallelization remained in
these programs . ' Run - time testing determined that
eight of the programs from the NAS and SPEC95fp
benchmarks had additional parallel loops, for a total of
69 additional parallelizable loops, which is less than 5%
of the total number of loops in these programs. Of
these 69 loops, the remaining parallelism had a signifi 
cant effect on coverage ( the percentage of the pro
gram that is parallelizable) or granularity ( the size of
the parallel regions) in only four of the programs: a psi,
su2cor, waveS , and fftpde.
We found that al most all the significant loops in
these four programs could potentially be parallelized
using a new approach that associates predicates with
array data-flow values.2 Instead of producing conserv-

Digital Technical Journal

Vol . l O No. 1

1 99 8

77

ative results that hold tor all control-How paths and all
possi ble program i nputs, predi cated array d a ta-flow
analysis can derive optimistic results guarded by predi
cates . Pred icated array data - fl ow anal ysis can lead to
more dkctive automatic parallelization i n three ways:
( l ) It i mprove s compile-time ana.J ysis by r u l i n g out
i n feasi ble con trol -flow paths. ( 2 ) It provid es a frame
work for the compi ler to i n t ro du c e pred icates that, i f
proven true, wou l d guar:mtee safety tor desirable data
flow vaJ u es. ( 3) It enJ bles the compiler to derive low-cost
run-time para l l e l i zation tests based on the predicates
associated with desirJ ble data-flow values.

SUIF and Compaq's GEM Compiler

The G E M compiler system is the te chn ology Compaq
has been using to build compiler products for a variety
of languages and hardware/software platform s . ·1
Wi th i n C o m pa q , work bas been done to con nec t S U I F
with t h e G EM c om pi l er. S U IF's i n termed iate re pre
sentation was converted i n to GEM's i ntermediate rep 
rese n ta tion , s o that S U I F code can b e passed directly
to G E M ' s optim i zi n g back e n d . T h is e l i m i na tes the
Joss of i n fo r matio n su ftCred when S U I F code is trans
l ated to C/FORTRAN source bdore i t is passed to
GEM. I t also enables us to generate more efficient
code for Alpha-microprocessor systems .

SUlF componen t of the NCI project is the re s u l t of tl1e
col l a boration among researchers in five universities
( Harvard University, Massachusetts I nstitute of
Technology, Rice U niversity, Stanford U n i ve rsity,
University of C a l i forn i a at San ta Barbara) and one
i n dustrial partner, Portland Group I nc . Co m paq is a
corporate sponsor of the p roj ect and is providing the
FORTRAN fron t end .
A revised version of the S U I F i n frastructure ( S U I F
2 .0 ) is being released a s part o f t h e S U I r: N C I project
( a prel i minary version of s u r r: 2 .0 is a va i l a ble at the
S UIF we b site ) . The compl eted system wi ll be
enhanced to support p a ra ll el i z ::� t i o n , in tc rp roc e d u ra l
analysis, m e mory hierarchy o p t imiz a tio n s , obj ected
oriented programming, sca lar optimizations, and
m a chine-dependent opti mi z:nions. An overview of
the S U I F NCI system is shown i n Figure A l . Sec
vvww-suif.stanford .cd u/s u i f/NCI/su i f. h t m l for more
i n formation about S U I F and the NCI project, includ
i n g a complete list of opti m i zations ;md a sched u le .
References
1 . B. So, S. Moo n , and M . Hal l , "Measuring rhc Eftecrivc
ness of Automatic Parallcl ization in S U I !:',"

of the

98, J u l y

1998.

2 . S . M o o n , J\11 . Ha l l , and B . M u r p h1·, " Predicated Arr:1y

SUIF and the National Compiler Infrastructure
The SUIF compiler system was recently chosen to be

Data-Flow Amlysis for Ru n-Time Para l le l izati o n , " Pro

part of the National Compiler I n frastrucnrre ( NCI)
project fu n ded by t h e D e fense Advanced Research
Projects Agency ( DARPA) and the National Science
Fou ndation ( NS F ) . The goal of the project is to
develop a com mon com pil e r p latform for researchers
and to fac i l i tate tec h nolo gy transfer to i n dustry. The

puting 98, July

ceedings of the fntt!rncltiiJIIUI Umfim'IICe

1 99 2 ) : 1 2 1-1 36 .

I N T E R P ROCEDURAL ANALY S I S
PARALLELIZATION
LOCALITY OPTIM IZATIONS
OBJECT-OR I E NT E D OPTIM IZATIONS
SCALAR OPTIMIZATIONS

._

AL P HA

_
_

SCHEDULING
R E G I STER ALLOCATION

_.I L..l

_
__-_

Figure A1
The S U ! F Comp i l e r l n trJstr u c w re

Digital Tcc' h n icll journal

Vul .

10 No. l

1998

Sl lj)('rcom

Syste m , " Dig ilcd h·cb n ical follmal. \'O I . 4, n o . 4 (Speci:l l

Issue,

CIC++ (I BM)

TARGET
LANGUAGES

011

1998.

3 . D . B l ic kste i n c t a l . , "The G F M O p t i m i z i n g C o m p i l c t

FRONT
ENDS

78

Proceedin;:;s

Jnterr/{./ffonal Confaence on SupercomjJIItinp,

x_
s6

---.l.� I

_
_
_
_ _

C/FORTRAN

Biog raph ies
Mary W. Hall
Mary Hall is jointly a research assistant professor and project
leader at the University of Southern California, Department
of Computer Science and at USC's Information Sciences
Institute, where she has been since 1996. Her research
interests focus on compiler support for high-performance
computing, particularly interprocedural analysis and auto
matic parallelization. She graduated magna cum laude with
a B .A . in computer science and mathematical sciences in
1985 and received an M.S. and a Ph .D . in computer science
in 1 989 and 1 99 1 , respectively, all from Rice University.
Prior w joining USC/lSI, she was a visiting assistant pro
fessor and senior research tdlow in the Department of
Com p uter Science at Caltech . In earlier positions, she was
a research scientist at Stanford University, working with
the S U I F Compiler group, and i n the Center for Research
on Parallel Computation at Rice University.

Brian R. Murphy
A doctoral canclidate in computer science at Stanford Uni
versity, Brian Murphy is currently working on advanced pro
gram analysis under SUIF as part of the National Compi le r
Infrastructure Project. He received a B .S. in computer sci
ence and engineering and an M .S. in electrical engineering

and computer science from the Massachusetts Institute of
Tec hno logy. His master's thesis work on program analysis
was carried out with the Functional Languages group at
the IBM Almaden Research Center. Brian was elected to
me Tau Beta Pi and Eta Kappa Nu honor societies.

Shih-Wei Liao
Shih-Wei Liao is a doctoral candidate at the Stanford
U niversity Computer Systems Laboratory. His research
interests i nclude compiler algorithms and design, pro
gramming environments, and computer architectures.
He received a B .S . in computer science from National
Taiwan University. in 1 9 9 1 and an M . S . in electrical
engineering from Stanford U niversity in 1 994.
Jennifer M. Anderson
Jennite r Anderson is a research staff member at Compaq's
Western Research Laboratory where she has worked on the
Digital Continuous Profiling Infrastructure ( DCPI ) proj
ect. Her research interests include compiler algorithms,
programming l an gua ges and environments, profiling sys
tems, and para l l e l and distributed systems software. She
earned a B .S . i n intormation and computer science from
the U niversity of California at I rvine and received M .S .
a n d P h . D . degrees i n computer science from Stanford
University.

Edouard Bugnion
Ed Bugnion holds a Diplom in engineering from the Swiss
Federal Institute ofTechnology ( ETH ), Zurich ( 1 994)
and an M .S . from Stanford University ( 1 996 ), where h e i s a
doctoral candidate i n computer science. His research inter
ests include operating systems, computer architecture, and
machine simulati o n . From 1 996 to 1 997, Ed was also a
research consultant to Compaq's Western Research
Laboratory. He is the recipient of a National Science
Foundation Graduate Research Fellowship.
Saman P. Ama rasinghe
Sa man Amarasinghe is an assistant professor of computer
science and engineering at the Massachusetts I nstitute of
Technology and a member oftl1e La boraw r y for Computer
Science. His research interests include compilers and com
puter architecture. H e received a B .S. in electrical engineer
ing and computer science from Cornell University and M .S.
and P h . D . degrees i n electrical engineering from Stanford
University.

Digital Technical Journal

Vol . 10 No. I

1 99 8

79

Monica S. Lam
Monica Lam is an associate professor i n the Computer
Science Department at Stanford University. She leads the
SUIF project, which i s aimed at developing a common
infrastructure to support research i n compilers for
advanced languages and architectures. Her research i n ter
ests are compi lers and computer architecture. Monica
earned a B . S . from the U niversity of British Colu mbia in
1 980 and a Ph . D . in compu ter science fi·om Carnegie
Mellon University in 1987. She received the National
Science Foundation Young I nvestigator award i n

80

Digital Technical Journal

1992.

Vol. 1 0 No. 1

1 99 8

I

Debugging Optimized

Ronald F. Brender
Jeffrey E. Nelson
Mark E. Arsenault

Code: Concepts and
Implementation on
DIGITAL Alpha Systems

Effective user d ebugging of optimized code has
been a topic of theoretical and practical inte rest
in the software development comm u n ity for
al most two decades, yet today the state of the
a rt is sti l l highly u neven. We present a brief s u r

Introduction
In software development, it is common practice to
debug a program that has been compi l ed with little or
no optimization applie d . The generated code closely
corresponds to the source and is readily described by a

vey of the l iteratu re and cu rrent practice that

simple and straightforward debugging symbol table .

leads to the identification of three aspects of

debugger can interpret and control execu tion of the

debugging optimized code that seem to be

code in a fashion close to the user's source - l evel view

critical as well as tractable without extraordi
nary efforts. These aspects are (1) split l i feti me
support for variables whose allocation varies
within a program com bi ned with defin ition

A

of the progra m .
Sometimes, however, developers find it necessary or
desirable to debug an optimized version of the pro
gram . For instance, a bug-whether a compiler bug or
incorrect source code-may only reveaJ itself when

point reporting for cu rrency determi nation,

optimization is appLied . I n other cases, the resource

(2) stepping and setting breakpoints based on

constraints may not aLlow the unoptimized form to be

a semantic event characterization of program
behavior, and (3) treatment of i n l i ned routine
calls in a manner that ma kes i n l i n i n g largely
transparent. We describe the real i zation of

used because the code is too big and/or too slow. Or,
the deve l oper may need to start anaJysis using the
remains, such as a core file, of the failed program,
whether or not this

code has been optimized . Whatever

the reason , debugging optimized code is harder than

these capa b i l ities as part of Compaq's GEM

debugging unoptimized code-much harder-because

back-end com piler tech nology and the debug

opti mization can greatly compLicate the relationship

g i ng component of the Open VMS Alpha oper

between the source program and the generated code.

ati ng system.

Zellweger1 introduced the terms

expected behavior
truthful behavior when referring to deb u gging
optimized cod e . A debugger provides e xpected behav
and

ior if it provides the behavior a user wo uld experience
when debugging a n u nopti m i zed version of a pro
gra m . Since achieving that behavior is often not possi
ble, a secondary goal is to provide at least truthful
behavior, that is, to never lie to or mislead a user. In
our experience, even tr uthfuL behavior can be chal
lenging to achieve, but i t can be closely approached .
This paper describes three i mprovements made to
Compaq 's
OpenVMS

GEM back-end compiler system and to
DEBUG , the debugging component of the

OpenVMS Alpha operating syste m . These improve
ments address

1 . Split life ti m e variables and cu rrency determination

2.

Semantic events

3.

Inlining

Digital Technical Journal

Vol . 1 0 No. 1

1998

81

Before presenting the details of this work, we dis

generally must include all call sites and may corre

cuss the alternative approaches to debugging optimized

spond to most statement boundaries. His experience

code that we considered, the state of the art, and the

suggests, however, that even l i m i ting inspection points

operating strategies we adopted.

to statement boundaries severely limits almost all kinds
of optimization .

Alternative Approaches

Holzle et al .8 describe techniq ues to dynamically

Various approaches have been expl ored to i m p rove

deoptimize part of a program ( replace optimized code

the ability to d e b u g optimized cod e . They i n c l ude

with i ts unoptimized equivalent) during debugging to

the following:

enable a debugger to perform requested actions. They
make the tec h niq ue more tractable, in part by delaying

•

Enhance debugger analysis

•

Li mit optimization

•

Lim i t debugging to preplan ned locations

m i zation between i n terruption points is unrestricted .

•

Dynamically deoptimize as needed

However, even this choice of interruption points

•

Exploit an associated program database

asynchronous events to well-defined interruption
points, generally backward branches and cal l s . O pti

severely l i mits most code motion and many other
global optimizations.

We touch on these approaches i n turn.

Pollock and others9 10 use a different kind of deopti

In probably the oldest theoretical analysis that

m i zation , which might be called preplanned, incre

supports debugging opti m i zed code, H ennessyl stu d 

mental deopti m i zation . D uring a de bugging session,

i e s whether the value displayed for a variable is current,

any debugging req u ests that cannot be honored

that is, the expected value for that variable at a given

point in the program. The value displayed might not

because of o p ti m i zation effects are remem bered so
that a subsequent compilation can create an exe

be c u rrent because, for example, assignment of a later

cuta ble that can honor these requests . This scheme is

val u e has been moved forward or the relevant assign

supported by an incremental opti mizer that uses a pro

ment has been delayed or omitte d . Hennessy postu

gram database to provide rapid and smooth forward

lates that a flow graph description of a program is

i nformation flmv to s u bseq uent debugging sessions.

comm unicated to the debugger, which then solves

Feiler' ' uses a program database to achieve the bene

certain flow analysis equations in response to debug

fits of interactive debugging while applying as m uch

comm ands

needed .

static compilation technology as possible. He describes

Copperman' takes a similar though m uch more gen 

to

determine

currency

as

techniques for m aintaining consistency between the

eral approach . Conversely, commercial i m p lementa

primary tree-based representation and a derivative

tions have favored more complete preprocessing of

compiled form of the program in the face of both

information in the compiler to enable simpler debug

debugging actions and program modifications on-the

ger mechanisms.H

fly. While he appears to demonstrate that more is possi

If optimization is the "problem," then one approach

ble than might be expected, su bstantial limitations still

to solving the problem is to limit opti m i zation to only

exist on debugging capability, optimization, or both.

those kinds that are actually supported i n an avail a ble
debugger. Zurawski7 develops the notion of a

recovery

A comprehensive introduction and overview to these
and other approaches can be found in Copperman3 and

function that matches each kind of opti m ization . As an

Adl-Tabatabi . "

opti m i zation is applied during compilation, the com

graphy on Debugging Optimized Code" is available

In addition, "An Annotated B iblio

pensating recovery function is also created and made

separately on the

avai lable for later use by a debugger. I f such a recovery

http://wvvw. digital.com/info/DTJ. This bibliography

Dl:(!,ital Tecl:mical.fourna! web site at

function cannot be created, then the optimization is

cites and summarizes tbe entire literature on debugging

omitted. Unfortunately, code-motion-related opti mi

optimized code as best we know it.

zations generally lack recovery functions and so must
be foregone . Taking this approach to the extrem e

State of the Art

converges with traditional practice, which i s simply to

When we began our work in early

disable all opti m i zation and debug a completely unop

the level of support for debugging optimi zed code

we assessed

that was available with competitive compilers. Because

timized program .
l f fu l l debugger functionality need only b e provided

we have not updated this assessment, it is not appro

at some locations, then some debugger capabilities can

priate for us to report the results here i n detail . We do

be provided more easily. Zurawski7 also employed this

however s u m marize the methodology used and the

idea to make it easier to construct appropriate recov

mai n results, which we believe remain generally valid .

ery fu nctions. This approach b uilds on a language
dependent

82

1 994,

concept

Digital Technical Journal

of

inspection points, which

Vol . 1 0 No. 1

1 998

VIe created a series of example programs that pro

vide opportunities for optimization of a particular kind

or of related ki nds, and which could lead a traditional
debugger to deviate from expected behavior. We com
p iled and executed these programs under the control
of each system's debugger and recorded how the sys
tem hand led the various kinds of opti m i zation. The
range of observed behaviors was diverse .
At one extreme were compilers that automati cally
disable all opti mization i f a debugging symbol table is
requested (or, equivalently for our purposes, give an
error i f both optimization and a debugging symbol
table are requeste d ) . For these compilers, the whole
exercise becomes moot; that is, atte mpting to debug
opti mized code i s not allowed .
Some compiler/debugger combinations appeared
to usefully support some of our test cases, although
none handled all of them correctly. In particular, none
seemed able to show a traceback of su brouti ne cal ls
that compensated for in lining of routine calls and all
seemed to produce a Jot of jitter when stepping by l ine
on systems where code is highly scheduled .
The worst example that we fou nd al lowed comp i la
tion using optimization but produced a debugging
symbol table that did not reflect the results of that opti
mization . For example, local variables were described
as allocated on the stack even though the generated
code clearly used registers for these variables and never
accessed any stack locations. At debug ti me, a request
to exami ne such a variable resulted in the ctisplay of the
irrelevant and never-accessed stack locations.
The bottom line fi·om this analysis was very clear:
the state of the art for support of debugging opti 
mized code was general ly q uite poor. D I G ITAL's
debuggers, including OpenVMS DEBUG, were not
unusual in this regard . The analysis did indicate some
good examples, though. B oth the CONVEX CXdb4·"
and the HP 9000 DOC6 systems provide many val u 
able capabil ities.
Biases and Goals

Early i n our work, we adopted the fol lowing strategies:
•

Do not limit or compromise optimization in any way.

•

Stay within the t!·amework of the traditional edit
compile-link-debug cycle .

•

Keep the burden of analysis within the compiler.

The prime directive for Compaq 's GEM- based
compilers is to achieve the h ighest possible perfor
mance from the Alpha architecture and chi p tech nol
ogy. Any improvements i n de bugging such optimized
code shou ld be usefu l in the £1ce of the best that a
compiler has to offer. Conversely, i f a programmer has
the luxury of preparing a less optimi zed version for
debuggin g purposes, there is little or no reason for
that version to be anything other than completely

unop timized. There seems to be no particular benefit
to creating a special i ntermed iate level of combined
debugger/optimization support.
Pragmatical ly, we did not have the ti me or staffi ng
to develop a new opti m i zation framework, for exam 
ple, based on some kind of program database. Nor
were we interested i n i ntruding into those parts of the
GEM compiler that performed optimization to create
more complicated options and variations, which might
be needed for dynamic deoptim ization or recovery
fu nction creation .
Finally, i t seemed sensible to perform most analysis
activities within the compiler, where the most complete
information about the program is already available. It i s
conceivable that passing additional information from
tl1e compiler to the debugger using the object file
debugging symbol table might eventually tip the bal
ance toward performing more analysis in the debugger
proper. The available size data ( p resented later in this
paper in Table 3) do not incticate thi s .
We identified three areas i n which w e fe lt e n hanced
capabil i ties would significantly i mprove support for
debugging optimized code. These areas are
l.

The handling of split lifetime variables and currency
determination

2. The process of stepping though the program
3 . The handling of procedure inlining
I n the fol l owing sections we present the capabil ities we
developed i n each of these areas together with i nsight
i nto the implementation techniques employed.
First, we review the GEM a nd OpenVMS DEBUG
framework i n which we worked. The next three sec
tions address the new capabilities in turn. The last
major section explores the resource costs (compile
time size and performance, and object and image
sizes) needed to realize these capabil i ties.
Starting Framework

Compaq's GEM compiler system and the OpenVMS
DEBUG component of the OpenVMS operati ng
system provide the framework for our work. A brief
description of each follows.
GEM

The GEM compiler system 1 3 is tl1e technology
Compaq is using to build state-of- the-art compiler
products for a variety of languages and hardware and
software platforms. The GEM system supports a range
of languages (C, C++, FORTRAN including HPF,
Pascal, Ada, COBOL, B LISS, and others) and has been
successfu l ly retargeted and rehosted for the Alpha,
MIPS, and Intel IA- 3 2 architectures and tor the

Digital Technical Journal

Vol . 10 No. l

1998

83

OpenVJV!S ,

D I G I TAL

U N I X , Win dows

NT, a n d

•

Windows 9 5 operati ng systems.
The major components of a GEM compi ler are the
fron t end, the o p ti m i zer, the code ge nerator, the fi n al
code stream opti m i zer, a n d the compi l e r she l l .
•

i n termedi ate l a n guage ( I L ) graphs and sym bol
tables. Front ends for all source langu ages translate
to the same common representati o n .
•

The opti m i zer transforms t h e I L generated by the
front end i nto a s e m a n tically eq ui val e n t fo rm that
wi l l exec ute faster on the target m a c h i n e . A sign i fi 
cant te c h n i ca l achievement i s that a si ngle opti m i zer
is used ror all la nguages and target p l atforms.

•

The code generator translates the IL i n to a l i st of
code cel l s , each of which represents one machin e
in struction for th e target h ardware . Virtual l y al l the
target m achine instruction-specific code is e ncapsu
l ated i n the code ge nerator.

•

The fi n a l phase pertorms patte rn - based peephole
optimi zations fol l o wed by i nstru ction sc hed u l i n g .

•

T h e shel l i s a porta ble i n terface to t h e external envi
ron ment i n which the compi l e r i s used. It provides
common compiler fu nctions such as l isti ng ge nera
tors, object fi l e emitters, and command line proces
sors i n a fo rm t h a t a l l ows the other components to
remain independent of the operating syste m .
The bu l k of the GEM im pleme ntation work described

i n this paper occ u rs at the bound ary betwe e n the fi n a l
p h ase and t h e object fi l e output portion of the shel l . A
new debugging opti m i zed code a n alysis phase exa m 
i nes t h e generated c o d e stream representation of the
progra m , together wit h the com piler sy m bol table, to
e x tract the i n formation necessary to pass on to a
debugger t h rough the de b uggi ng sy mbol ta b l e . Most
of the i m plementation is readily adapted to d i fferent
target a rc h i tectures by means o f the same i n str uction
property tables that arc used i n the code gene rator and
final optim izer.

Exa m i n e user variables and hardware recristers
0

•

Display a stack trace back showi n g the cu rrent cal l
stack

•

Set watch points

•

Perform many other fu nctions1'

Split Lifetime Variables and Cu rrency
Determination
Displayi ng (printing) the va l u e of a program vatiable is
one of th e most basic services that a debugger can pro
v i d e . For u n opti m i ze d code and tra d i tional d e b ug
gers, the mechan isms fo r doing this are ge nera l l y
based on several ass u m p tions.
l . A vari able h as a single al location that remains f-i xed
throughout its lifetime. For a local or a stack-allocated
variable that means throu ghout the l i fetime of the
scope in which the varia b l e i s declared .
2. Definitions a n d uses of the va l u es of user variables
occur in the same order in the ge nera ted cod e as
they do i n the origi n a l program sourc e .
3 . The s e t of in str u c tions t h a t b e l o n g t o a given scope
(which may b e a ro u ti n e bod y ) can be described by
a single contiguous range of addresses.
The first and second assumptions arc of interest in this
discussion because many GFM optim izations mal(e
the m inappropriate. Split lifeti me optimization (d is
cussed later in this section ) leads to violation o f the fi rst
assumption. Code motion optimization leads to vio l a
tion of the second assu m ption and thereby creates rl1e
so-called c u rrency problem. 'I'Ve treat both

�frl1ese prob
.>plit

lems together, and we refer to them collectively as

lifetime suppo11.

Statement and in struction sched u l ing

optimization leads to violation of the rl1i rd assumption.
This topi c is addressed l ater, in the section I n li ning.
Split Lifetime Variable Definition

A variable is said to have spl it l i fetimes i f the set o f

Open VMS DEBUG

The OpenVMS Alpha d e bugger, original ly d eveloped
for the OpenVMS VAX syste m , 1 '' i s a fu l l - fu nction ,
source- leve l , symbo l ic debugger. I t supports sym bolic
d e bugging o f programs written i n B LISS , MACR0 - 3 2 ,
MACR0-64, FORTRAN, A d a , C , C++, Pascal , P L/ 1 ,
BASIC, and COBOL. The d e b ugger al lows the user to
control the execution and to exa m i n e the state o f a
progra m . Users can

84

character- based user i n te rrace
•

The front end performs lexical ana lysis a n d pars i n g
o f th e sou rce program . T h e prim ary outputs are

D isplay t h e source-level v i e w of t h e program's exe
c u ti o n usi ng either a gra p h i c al user i n te rface or a

fetches and stores of the variable c a n be partitioned
such that none of the values stored i n o n e su bset are
ever fe tched in another s u bset. When such a partition
exists, the vari able can be "split" i n to several i n depen
dent " c h i l d " variabl es, each corresp o nd i n g to a parti 
tion . As i n d ependent vari a bles, the c h i l d varia b l es can
be a l located i n depe n d e n t ! }'· The eftect is that the
original vari able can be thought to reside in d i ftcrent
locations at d i fferent p o i nts i n ti me-so metim es in a

•

Set breakpoints to stop at certain poi nts i n the program

register,

•

Step through the ex ecution of the program a l i ne at

nowhere at a l l . I ndeed , it is even possible ror the dirfer

a time

ent child variables to be active s i m u l ta n eously.

Digital

Te chnical journal

Vol

to No. I

1 99 8

sometimes

in

memory,

and

someti mes

A simple e xa m p l e of a split

Split Lifetime Example

X_Fioati n g ) variables as we l l as variables of any of the

l i tctime vari able can be seen i n the fo ll ow i n g strai ght

complex types ( see Sites ' 6 ) . These l atter variables are

line code fragment:

referred to as two-part variables because each req u ires

A

=

B
A

=

c

=

=

;
A.

;

;

A

!

Define

( a s s i g n va l ue t o )

!

1

Use def i i t ion
De ine A aga i n

1

Use l a t er

( v lue o f )

efini

A

A

i on

rent with respect to a given position i n the source pro

A

I n this e xample, the first value assigned to vari able A is
never used agai n . A new value is assigned to A and

C.

Without c h an gi n g the m eaning of this fragment, we
can rewrite the code as
Al - . . . ,
B

=

2 -

c

=

!

. . . .'1. 1 . . . ,
. . . ,

I

. . . .�� . . . '

Use .:0.1
De f i ne

. ·.

c

Jl

!

. . A2

I

.'

This scenario

in l i ne l is c o m p u t i n g and as s i g n i ng the wro n g

This example i l lustrates that s p l i t l i fetime opti m i 
Moreover, other optim i zations can create opportu n i 
ties for s p l i t l i fetime opti m i zation t h a t may n o t b e
apparent from casu a l examination o f the original
source. In particular, loop un rol l i n g ( i n which the
body of a loop is replicated several ti mes i n a row )
can create loop bod ies for which split l i fetime opti
m i zation is fe:�si b le and desirable .
Our impleme n tation deals only

Al p h a 's extended precision tl oati ng- point ( 1 2 8 - b i t

U noptimized
A
. . .A . . . '
B
c
A
. . .A. . . ;
D

v a l u e . The p ro b l e m occurs because t h e com p i l e r h a s
moved the second assi gnment so that i t is e a r l y rel a 
tive t o l i n e

3.

Another cu rrency example can be seen i n the frag
ment ( taken from Copperman·') that appears in Figure
2. I n thi s case, the optimizing com pi ler has chosen to
omit the second assignment to variable A and to assign
that val u e d i rectly i n to the actual parameter location
used for the call of routi n e FOO. Su ppose that the
debugger is stopped at the call of routine FOO. Given
a request to d isplay A , a trad itional debugger is l i kely
to display the resu lt of the fi rst assignment to A . Agai n ,
this val u e i s an actual val u e o f A , b u t i t i s not the

with scalar variables and parameters. This i nc l u des

5

3.

fr u i t l ess atte m p t to d e te r m i n e how the assi g n m e n t

zation i s possible even in s i m p l e straight-line code.

3

h e re , i n t h e opti m i zed cod e , h a p p e n s to b e t h e res u l t

might easi l y m is l e a d a user into a fr u s trat i n g and

single vari able have overlapping l i fetimes.

4

of the

of A i s a correct v a l u e , b u t i t is n o t the expected

Here, we see that the value of A2 is assigned whi l e the

2

3

h a p p e n s t o b e c o n t a i n e d i n the l ocation of A , w h i c h

value that should be disp layed at l i n e

val ue of A l is sti l l al ive . That is, the spl it chiJdren of a

1

Now su ppose that execu

of t h e second assi g n m e n t t o A . This d is p l ayed v a l u e

Use A2

Line

3.

G iven a req uest to d i s p l ay ( p ri n t ) the value of A,

De E i ne l\?.
U s e .'U

Variables of Interest

As shown i n Figure l , the opti m i zing compiler has
ch osen to change the order of operations so that l i n e 4

a tradi tional d e b u gger w i l l d i spla y w h a tever v a l u e

! De f i ne Al
!

vari ables. Consider the c u r rency example i n Figu re l .

able C.

2

Use A2

• • • I

Several kinds of opti m i zation c a n lead t o noncu rrent

unoptimized code, the line that assigns a value to vari

is also a n equ ivalent fragment:

A2
B

expected in a n u noptimi zed version of the progra m .

tion has stopped at the i nstruction in l i ne

Because A l and A2 are independent, the fol lowi n g

.,

gram if the vari able h olds the value that wou ld be

is executed prior to l i n e

D e f i ne A l

where vari ables A l and A2 are split child variables of A.

Al

Currency Definition

The value of a variable i n an opti mized program is cur

used later i n the ass i gn ment to variable B and then
used in the assi gnment to vari able

two registers to hold i ts val ue.

!

!

expected value .
Alte rnatively, it is possi ble that prior to reac hing the
cal l , the opti m i zi n g compi ler has decided to reuse the

De E i e A
Use 7'.
c
es no
depend on
De f i ne 7'.
U e s econa ae f i n i t i o

A

o[ A

Optimized
A
. . . '
B =
. . . A. . . ;
A
c
D = . . .A. . . ;

F i g u re 1
Currency Example 1

Digital Technical journal

Vol . 1 0 No. 1

1 99 8

85

Line
l

2

Unoptimized

Optimized

express i o n l ;

A

B

=

3

A

=

4

FOO (

. . .A . . . ;

e x p res s i on . ;
) ;

I

Use 1 s t def . o f A

I

Use 2 nd deE . 0

A

A
B

ex
=

re

. . .A.

s ionl ;
;

. .

FOO ( e xpre s s i on . ) ;

F i g u re 2
Cu rrency Example 2

location that originally held the first va l u e of A fo r

locations h o l d val u es of user variables at any given

another p u rpose. I n this case, no val ue of A is avai la ble

poi n t i n the program and combine this with the set of
definition locations that provide those values. B ecause

to display a t the call of ro u tine F O O .
Final ly, consider t h e example shown i n Figure

3,

t here may be more than one source locati o n , the user

which i l l ustrates that t h e cu rrency o f a vari able i s not a

is given the basi c i n formation to determine where i n

property that is i nvariant over ti m e . S u p pose that exe

the source t h e value of a vari a b l e may have originated .

c u ti on is stopped a t line

5, i nside the Joop. I n this case,

Conse q u en tly, the user can determine whether the

A is not c u rrent d u ring the first time through the loop

val u e d isplayed is appropriate fo r his or her p urpose .

body because the act ual value comes from l i ne 3
( m oved from inside the loop ); i t shou l d come ti·om

Compiler Processing

l i ne 1 . O n subseq u e n t t i m es through the loop, the

A compiler performs most spl i t l i fetime a nalysis on a

value from l i ne 3 is the expected valu e , and the val u e of

rou tine-by-routine basis. A p re l i m inary w a l k over the

A is cu rrent.

e n ti re sy mbol table identi fi es the varia b l e sym bols that

As d iscussed earlier, most approaches to cu rrency
determi nation involve making certain ki nds of A ow
graph a n d compiler opti m i zation i n forma tion avai l 

tine, t h e compiler performs t h e fol l owing steps:
•

Code cell prepass

•

Flow gra p h construction

avoid adding major new kinds of ana lysis capabi l i ty to

•

Basic bloc k processing

D I G ITAL's debuggers.

•

Parameter processi ng

•

Backward propagation

•

Forward propagation

•

Information promotion and cleanup

a b l e to the debugger so that it can report when a d is
p layed va l u e is not curre n t . However, we wan ted to

More fundamentally, as the degree of opti m ization
i ncreases, the notion of currentposition i n the program
itself becomes increasingJy ambiguous. Even when the
partic u l ar i nstruction at which execution is pending can
be c learly and unequivocally related to a particu lar source
location, this location is not automatically the best one to
use tor currency determination. Nevertheless, the source
location (or set of locations) where a displayed val ue was
assigned can be reliably reported without needing to
establish the current position .
Accordi ngly, we use an approac h d i ffe ren t than
those considered in the l i terature. We use a stra ight
forward flow analysis form u l ation to de termine what

Line
l
2
3
4
5
6
7

Currency Example

clean u p tasks . The fo l lowi ng contains a brief d iscus
sion of these s teps.
In this summary, we highlight only the main charac
tetistics of general interest. In particular, we assume that
each location, such as a register, is independent of all
other locati ons. This assumption is not ap propriate to
locations on the stack because variables of different sizes

Optimized

;
;
A . . . ;
wh i l e ( . . . )
A

;
. . .A. . . ;
=

.

�;h i l e
A

}

-

.

.

(.

. . . ,

=

.

.

.

. . . A. . .

.

. )
I I A is 1 o p i n va ri an t

3

Digital Tec h n ical Journal

After t h e com p i l e r comp l e tes t h i s process i n g fo r
all ro u ti n es, a symbol table posrwal k performs fi n a l

On op timized
A

Fig u re 3

86

are of i nterest fo r further analysis. Then , for each ro u 

Vol . 10 No I

1 99 8

may overlay each other. The complexity of dealing with
overlapping allocations is beyond the scope of this paper.
Of special i mportance in this processing is the fact
that each operand of every instruction includes a base
symbol field that refers to the compi ler's symbol table
entry for the entity that is i nvolved.
The symbol table prewalk
identifies the variables of interest for analysis. As dis
cussed, we are i nterested i n scalars corresponding to
user variables ( not compiler- created temporaries) ,
i ncluding Alpha's extended precision floating-point
( 1 28-bit X_Fioating) and complex val ues.
DIG ITAL's FORTRAN implementations pass para
me ters using a by-reterence mechanism with bind
(rather than copy-i n/copy-out) semantics. GEM treats
the hidden reference value as a variable that is su bject
to split lifeti me optimization. Si nce the reference vari
able must be available to effect operations on the logi
cal parameter variable, it fol lows that both the abstract
parameter and its reference val ue must be treated as
interesti ng variables.

Symbol Table Prewa l k

The code cel l prepass performs a
single walk over all code cells to determ ine

Code Cel l Prepass

•

The maximum and minimum offsets i n the stack
frame that hold any interesting variables

•

The highest numbered register that is actu ally refer
enced by the code

•

Whether the stack frame uses a frame pointer that is
separate from the stack pointer

The compiler uses these ch aracteristics to preallocate
various working storage areas.
Flow Graph Construction A flow graph is built, i n
which each basic block i s a node o f the graph.

Basic block processing per
Basic Block Processing
forms a kind of symbolic execution of the i nstructions
of each block, keeping track of the effect on machine
state as execution progresses.
When an instruction operand writes to a location
with a base symbol that i ndicates an interesting vari
able, the compiler updates the location description to
i ndicate tbat the variable is now known to reside in
that location-this begins a lifetime segment. The
i nstruction that assigned the val ue is also recorded
with the lifetime segment.
If there was previously a known variable in that loca
tion, that lifeti me segm ent is ended ( even if it was for
the same variable ) . The beginning and ending i nstruc
tions for that segment are then recorded with the vari
able in the symbol table.

When an instruction reads an operand with a base
symbol that indicates an interesting variable, some
more unusual processi ng applies.
If the variable being read is already known to
occupy that location, then no further processing is
required. This is the most common case .
I f the location already contains some other known
variable, then the variable being read is added to the
set of variables for that location. This situation can
arise when there is an assignment of one variable to
another and the register al locator arranges to a llocate
them both to the same location. As a result, the assign
ment happens i mpl icitly.
If the location does not contain a known variable
but there is a write operation to that location earlier in
the same block (a fact that is available from the loca
tion description) , the prior write is retroactively
u·eated as though it did write that variable at the earlier
i nstruction. This situation can atise when the resu lt of
a function call is assigned to a variable and the register
allocator arranges to al locate that variable in tl1e regis
ter where the call returns its value. The code cell repre
sentation for the ca l l contains nothing that indicates a
write to the variable; all that is known is that me return
value location is written as a resu lt of the cal l . Only
when a later code cell i ndicates that i t is using the val ue
of a known variable from that location can we infer
more ofwhat actually happened.
I f the location does not contai n a known variable and
there is no write to that same location earlier in this
same basic block, then the defining i nstruction cannot
be i m mediately determined. A location description is
created for the beginning of tl1e basic block i ndicati ng
that the give n variable or set of variables must have
been defined in some predecessor block. Ofcourse, the
contents known as a result of the read operation can
also propagate forward toward the end of the block,
just as for any other read or write operation.
Special care is needed to deal ��th a two-part variable .
Such a variable does not become defined u n til both
instructions that assign tl1e value have been encoun
tered. Similarly, any reuse of eitl1er of the two locations
ends tl1e ufetime segment of me variable as a whole.
At the end of basic block processing, location
descriptions specif)' what is known about the contents
of each location as a resul t of read and write operations
that occurred in the block. This description indicates
the set of variables that occupy the location, or that the
location was last written by some value that is not the
value of a user variable, or that the location does not
change during execution of the block.
Parameter Processing The compiler models parame
ters as locations that are defined with the contents of a
known variable at the entry point of a routine.

Digiral

Technical Journal

Vol . 1 0 No. l

1998

87

Backward Propagation
Backward propagation i ter
ates over the flow graph and uses the locations with
known contents at the begi n n i ng of a block to work
backward to predecessor blocks looking tor i nstruc
tions that write to that location . For each variable in
each input location, any such prior write i nstruction is
retroactively made to look like a definition of the vari 
able . Note that this propagation is not a fl ow algo
ri thm because no convergence criteria is i nvolved; it is
simply a kind of spanning wal k.
Forward Propagation
Forward propagation iterates
over the flow graph and uses the locations with known
contents at the end of each block to work forward to
successor blocks to provide known contents at the
beginning of other blocks. This is a classic "reaching
definitions" flow algorithm, in which the input state of
a location for a block is the i n tersection of the known
contents from the predecessors.
In our case, the compiler also propagates definition
points, which are the addresses of the i nstructions that
begin the lifetime segments. For those variables that are
known to occupy a l ocation, the set of definitions is tl1e
union of all the definitions that flow into that location .

The final step of
compiler processing is to combine information for adja
cent blocks where possible . This action saves space in me
debugging symbol table but does not affect me accuracy
of the desctiption . Descriptions for by-reference bind
parameters are next merged witl1 me descriptions for the
associated reference variables. Finally, lifetime segment
information not already associated wim symbol table
entries is copied back.
Information Promotion and Cleanup

Debugger Processing

Name resol u tion, that is, binding a textual name to the
appropriate entry in the debug symbol table, is in no
way affected by whether or not a variable has split lite
ti me segments. After the symbol table entry is found,
any seq uence of l i fetime segments is searched for one
that includes the current point of execution i ndicated
by the program coun ter ( PC). If found, the location of
the val ue is taken from that segment. Otherwise, the
value of the variable is not available.
Usage Example

To illustrate how a user sees tl1e results of this processing,
consider me smaJJ C program in Figure 4. Note mat me
numbers in m e left colunm are listing line numbers.
\Vhen DOCT8 is compiled , linked, and executed
under debugger control, me dialogue shown in Figure 5
appears. The figure also incl udes interpretive comments.
Known Limitations

The fol lowing limitations apply to the e xisting split
lifetime support.
M u ltiple Active Split Children
While the compil e r
analysis correctly determines multiple active spl i t child
variables and me debug symbol tab.le corrccdy describes
them, OpenVMS DEBUG docs not currendy support
mu ltiple active c hild variables. When searching a sym
bol's lifetime segments for one mat includes the current
PC, me first match is taken as the ortly match .

Support for two-part variables
( those occupying two registers) assumes that a com
plete defi nition wi ll occur within a single basic bloc k .

Two-part Va ri a b l es

Object File Representation

The object file debugging sym bol table representation
tor split lifetime variables is actually q uite simple.
Instead of a single address for a variable, there is a
sequence of l i fetime segment descriptions. Each life 
time segment consists of
•

The range of addresses over which the child loca
tion applies

•

The location (in a register, at a certain offSet in the
curren t stack frame, indirect through a register or
stack location, etc . )

•

The set of addresses that provide defini tions for this
lifetime segment

By convention, the last segment in the sequence can
have the address range 0 to FFFFFFFF ( hex ) . This
address range is used for a static variable, for example
in a FORTRAN COMMON block, that has a default al lo
cation that applies whenever no active children exist.

88

Digital Technical Journal

VoL 10 No. 1

1 99 8

385
38 6
387

oct8

388
389
390
391
392
39 3

()

398
399
400

j,

int i ,

i
j
k
if

k;

1;

3;

2 ;

j

17 ;

( f oo ( i ) )

394
39 5
396
397

{

e l se

)

k
)

=

(

18 ;

prin t f ( " %d ,

%d ,

%d \ n " ,

i,

j,

k) ;

401
402

Figure 4

C Example Routine DOCT8 (Source with Listi ng Line
N u m bers)

$

T7 . 2 - 0 0 1

t un doc t B

OpenVMS A l pha Debug64 Ve rs ion

s ep / i n to

is

langucge

%I ,

DBG>

C,

mod l e set

to DOCT 8

o DOC T B \ doc 8 \ %LINE 3 9 1

s teppe .

i,

k =

3 91 :
DBG > e xam ine
%�1 .

en t i ty

·

i'

%W ,

en

' j '

%W ,

en t i t y

j,

3;
v1a s

k

no l a l l oca t ed in memory

' k ' does n o t h

i ly

does no

va l ue a t

h ve
v

( wa

the cu rrent

a va l ue a

PC
PC

p t lm i z ed a\v.:ly )

the curren t

Note the difference in the message for variable i compared to the messages for variables j and k. We
see that variable i was not allocated i n memory ( registers or otherwise ) , so there is no point in ever
trying to examine its value again . Variables j an d k, however, do not have a vaJ ue "at the current PC."
Somewhere later in the program they will have a value, but not here.
The dialogue conti nues as follows :
DOCT 8 \ doc t 8 \ %LINE

391

s t epped t o DOCT 8 \ doc t 8 \ %LINE

393

DBG> s tep 6
s

to

391 :

e pp e d

OBG

k = 3;

s tep
393 :

DBG> exam i ne
%1-'1 ,

en t i t y

'

j,

k

if

( f oo ( i ) )

j

do s

'

DOC T8 \ do c t8 k :

(

not have a value at the current PC

3

va lue d e f ined a t DOCT8 \ oc t 8 \ % L INE

391

Here we see that j i s still u ndefined b u t k now has a value, namely 3, which was assigned a t line 39 1 .
The source indicates thatj was assigned a value at line 390, before the assignment to k, butj's assign
ment has yet to occur.
Skipping ahead in the dialogue to the print statement at line 400, we see the foJ iowing:
DBG>

s e t b r a k % l ine

400

DBG> g o

400 :

pr in t f ( " %d ,

DBG> examine j
OCTB \

%d ,

DOCT 8 \ d oc t 8 \ % L IN E

break a t

oc t 8 \ j :

2

de f i ned

va l

at

400
%d \ n " ,

i,

j ,

DOCT 8 \ d oc t 8 \ % L I NE

k) ;

390

value de f i ned a t DOCT 8 \ d oc t 8 \ %LI NE 3 9 4
DBG> ex m i ne k
DOCTS \

oc t 8 \ k :

18

va l u e de f i ned at
va l e

DOC T8 \ do c t 8 \ % L I

E 3 97 + 4

e i ned a t DOCT 8 \ oc t 8 \ %LI E 3 9 1

This portion o f the message shows that more than one definition location is given for both j and k.
Which of each pair applies depends on which path was taken i n the i f statement. If a variable has an
apparently i nappropriate val ue, this mechanism provides a means to take a closer look at those places,
and only those places, from vvhich that value might have come.
Figure 5

Dialogue Resulting from Running DOCT8

That is, at the end of a basic block, if the second part of
a definition is missing then the i nitial part is discarded
and forgotten .
Consider the following FORTRAN fragment:
COHPLEX X,
X =
Y = X +

Y

[1.0,

0.0)

Suppose that the last use of variable X occurs i n the
assignment to variable Y so that X and Y can be and are
allocated i n the same location, in particular, the same
register pair. In this case, the definition of Y req uires
only one instruction, which adds 1 .0 to the real part of
the location shared by X and Y. Because there is no sec
ond instruction to i ndicate completion of the defini
tion, the definition will be lost by our implementation.

Digital Tech nical Journal

Vol . 1 0 No. 1

1 998

89

Semantic Stepping

Not all such instructions are appropriate, h owever.
We start with an i n i tial set of cand id ate i nstructions

A major problem with step p i n g by l in e though opti 

a n d refine it. The fol lowing sections describe the

m i zed code is that the apparent sou rce program loca

h e u ristics that are currently i n use.

tion " bo u n ces" back and forth , with the same l i ne
often appearing again a nd aga i n . In l arge part this

Assig nment

bouncing is due to a compiler opti m ization called

are the instructions that assign a val u e to a variable ( o r

code scheduling, i n w h i c h i nstructions that arise from

t o one o f i ts s p l i t c h i l d ren ) . The second instruction i n

The candid ates for assignment evcms

the same source l i n e are sched u l e d , that is, reordered

a n assignment t o a two-part vari a b l e is e x c l u d e d .

and i n termixed with other instructions, for better exe

Stopp i n g between the two assignmcms is i nadvisable

c u tion performance.
OpenVMS

because at that p o i nt the variable no longer h as the

DEBUG, l i ke most de buggers, interprets

the STEP / LINE (step by l i n e ) command to m ean that

compl ete old state and docs not yet have the complete
new state.

the program should execute u nt i l t h e l i ne nu mber
c h anges. Line nu mbers c h a n ge more ti·cqu cntly i n

Branches

sched u led code than i n u nopti m i zed code.

tional a n d cond i ti on a l . An u ncond i tional branch may

For example, in sample programs ti·om the

SPEC95

There are two kinds o f branch : u ncond i 

have a known desti nation or a n u n known destination .

B enchmark S u ite, the average n u m ber o f instructions

U nconditional

in sequence that share the same l i ne n u m ber is typ i 

most often arise as part o f some larger semantic con

c a l l y between 2 and 3-and typica l ly 5 0 t o 70 percen t
of those sequences consist o f j u s t

1

i nstructi o n ! I n

branches with

known

d estinations

struct such as an i f- t h e n - e lse or a l oop . For e x a m p l e ,
c o d e for an i f- th e n-else construct genera l l y has an

contrast, i f o n l y i nstruction- level sched u l ing i s d is 

i mp l i c i t j o i n t h a t occurs at the end o f the state m e n t .

abled , then t h e average n u m ber of i nstructions i s

The join takes the form of a j u m p fi· om the end of o n e

between

4

2 0 t o 3 0 percent consistin g o f

al ternative t o the location j ust past the l a s t i nstruction

one i nstr u ction . I n a c o m p i lation w i t h no optimiza

and

of the other ( w h i c h has no e x pl i c i t j u m p and fal l s

tion, there are
rou g h l y

5

6, with

8

to 12 i nstructions i n a seq uence, with

percem consisting o f a single instructi o n .

A second pro b l e m w i t h stepping b y l i ne through an

through i n to the n e x t statement) . Th is j u m p turns the
i n h erently symmetric join at the sou rce leve l i nt o an
asymmetric construction at the code stream level .

opti mized program is that, because of the behavior of

U n cond itional j u m ps a lmost never define i n terest

revisiting the same l i n e again and aga i n , the user is

ing semanti c events-some re l ated instruction u s u a l l y

never q u i te sure when the l i ne has fi n ished e xe c u ting.

provides a more useful even t point, s u c h as the ter m i 

It is u n c lear when an assignment actua l l y occurs or a

nation test i n t he case o f a loop . O n e exception is a

control flow decision is abom to be made.

s i m p l e goto statement, b u t these arc very often opti

In u n opti mi zcd cod e, when a user requests a break

m i zed away i n any case . Conseq u e n t l y, u nconditional

point on a certain l i n e , the user e x pects e xecution to

branches with known destinations arc not treated as

stop j ust before that l i n e , hence before the l i ne is car

semantic events.

ried out. I n opti mized code, however, there is no wel l 

U nconditional branches with u nknown destinJ

defined l ocation that i s " before t h e l i ne i s carried o u t,"

tions are rea l l y cond i tional b ranches: they arise from

because the code for that l i n e is typica l l y scattered

constructs s u ch as a C swi tch statement i m pl e m e n ted

abo u t , i ntermixed, and even com bined wi th the code

as a table dispatch or a FORTRAN assigned GO TO state

for various other l i nes. It is u s u a l ly possi b l e , h owever,

ment. These bra nc h es defin itely arc i n teresti ng points

cfkct of the l i n e .

d i rection is take n . Thus, the com piler retains u ncon

to identifY !be i nstruction that actually carries o u t the

at vvh i c h to a l low user i n teraction before the new
d i tional branches as semantic events.
Similarly, in genera l , cond i tional branches to known

Semantic Event Concept
We i n trod uce a new kind of stepping mode ca l le d

90

destinations are i mportant semanti c event points. Often

semantic stepping t o address these problems. Semantic

more t11JJ1 one branch instruction is generated rcJ r a sin

stepping al lows the program to execute u p to, but not

gle high- level source construct, rex example, a decision

i nclu d i ng, an i nstruction that causes a semantic eftect.

tree of tests and branches used to i mplement J small

I nstructions that cause semantic eftixts are instructions

C switch statement. I n this case, on ly the first i n the

that

execution sequence is used as the semantic event poi nt.

•

Assign a value to a user variable

•

Make a control flow decision

semanti cally i nteresting even ts .

•

Make a routine call

some r u n - time l i brary rou ti nes arc u s u a l l y n ot i nterest-

Digiral Tech n ical ) o u nul

Ca lls

Vol . 10 1'-Jo . 1

1 998

Most calls are visible to a user and constitute
However, calls to

ing because these calls are perceived to be merely soft

already marks

branches with the semantic eve nt

ware i mplementations of primitive operations, such as

attrib �1te, if appropriate. Also unlike the u·aditional

i nteger division i n the case of the Alpha architecture.

stepping- by-line algorithm, the new algorithm does

GEM internally marks calls to all its own run-time sup

not consider d1e source line number.

port routines �s not semantically interesting. CompiJer
front ends accomplish this where appropriate for their

Visible Effect

own set of r u n - ti m e support routines by setti ng a flag

With semantic steppi ng, a user's perception of forward

on the associated entry symbol node.

progress through the code is no longer dominated by

Compiler Processing

every few insm.Ktions regardless of what is happening.

I n most cases, the compiler can identify semantic event

Rather, this perception is m uch more closely related to

locations by simple predicates on each instr u ction .

the actual semantic behavior, that is, stopping every

the side e ffects of code sched uling, that is, stopping

statement or so, i ndependent of how many instruc

The exceptions are
•

The second of the tvvo i nstructions that assign val
ues to a two - part variable is iden tified during split
lifetime analysis.

•

Conditional branches that are part of a larger con
s u· u ct are identified during a simple pass over the
How graph.

tions from disparate statements may have executed.
Note that j u mping forward and backward in the
sou rce may still occur, for example, when code motions
have changed the order in which semantic actions are
performed. Nothing about semantic event handling
attempts to hide such reordering.

lnlining
Object Module Representation

The object module debugging semantic event repre
sentation contains a sequence of address and event
kind pairs, in ascend i ng address order.

is i n lined into rou tine CALLER and the current point
of execution is within INN ER, should the debugger

Debugger Processing

Semantic stepping i n the debugger i nvolves a new
algorithm for determining the range of instructions to
execute. This algorithm is built on a debugger pri m i 
tive mechanism that supports full-speed execution of
user instructions within a given range of addresses but
u·aps any transter out of that range, whether by reach 
i ng the end or b y executing any kind of branch o r call
instruction.
Semantic stepping works as follows. Starti ng with
the current program cou nter address, Open VMS
DEBUG rinds the next higher address that is a seman
tic event point; this i s the
OpenVMS

DEBUG

Procedure call inlining can be confusing when using a
traditional debugger. For example, if routine INNER

target event point.

executes i nstructions

in

the

add ress range that starts at the address of the c urrent
i n s tructi o n and ends at the i nstru ction that precedes
the target event point. The range execution terminates
in the following two cases:
l . If the next instruction to execute is the target event

point, then execution reached t he end of target
range and the step operation is complete.

2. If the next insu·uction to execute is not the target
event point, then the next address becomes the cur
rent address and the process repeats ( silently).

report the current source location as at a location in
the caller routine or in the called routine? Neither is
completely satisfactory by itself. I f the current line is
reported as at the location within INNER, then that
information will appear to conflict with i n formation
from a call stack traceback, which would not show
routi ne INN ER. If the current l i ne is reported as
though in CALLER, then relevant location informa
tion from the callee will be obscured or suppressed .
Worse yet, i n the case of nested inlining, potentially
crucial i n formation about tl1e i n termediate call path
may not be available in any torm .
The problem of dealing with inlining was solved
long ago by Zellweger'-at least the topic has not
been treated again since. Zellweger's approach adds
additional information to

an

otherwise traditional table

that maps fro m instruction addresses to the corre
sponding source line numbers. Our approach is d i ffer
ent: it i ncludes additional i n formation in the scope
description of the debugging symbol table.
A key u nderpinning for inline s u pport i s the ability
to accurately describe scopes that consist of m ultiple
discontiguous ranges of instr u ction addresses, rather
than the tradi tional single range. This capability is
q u ite independent of inlining as such. However,

Note that, u n l i ke the algorithm that determines the

because code from an inli ned rou tine is freely sch e d 

range for stepping by line, the new algoritl1m does not

u led with other code from t h e cal ling context, dealing

requ i re an explicit test for the kind of instruction, in

accurately with the resul ting disjoint scopes 1s an

particular, to test if it is a kind of branch. The compiler

essential buiJd ing block for effective support.

Digi tal Technical Journal

Vol . 1 0 No. 1

1 998

91

Goals for Debugger Support

•

Our overa l l goal is to support debuggin g of i n l i ned
code with expected behavior, that i s , as though the
i n l i n ing has not occurred . More specifically, we seek to
provid e the a b i l i ty to
•

rogate parameter variables.
•

scope, which is a copy of the body of the in l i n e d
•

e nces to d e c larations or parameters of the rou ti ne
are replaced with references to their correspo nding

Show a traceback that incl udes cal l frames corre

copied declarations. I n a dd i tion, returns ti·om the

spo nding to i n li ned routines
•
•

routine are replaced with j u m ps back to rhe tuple

Set a breakpoi nt at a given rou ti n e e n try

following the origi n a l c al l .

Set a breJkpo i n t at a given line n u m be r ( from
wi th i n a n i n li ned routine)

•

•

S im i lar " bou n d a ry adj ustments" Jre made to deal
with fu nction results, output parameters, c h oice of

Cal l a n i n l i n ed ro u ti n e

e ntry point (wh e n there i s more than one, as m i ght
occur for FORTRAN a l ternate entry state ments ) ,

W e have achieved these goals to a s u bstantial exte n t .

e t c . ( Th e boo kkee p i n g i s a bit i ntricate, b u t it is
concep t u a l l y straightforw:� rd . )

GEM Locators

Bdore descr i b i n g the mechanisms to r i n l i n i ng, we
i ntrod u ce the G EM notion of a loca tor. A locator
describes a place in the source text. The sim plest kinds
of locator describe a poi nt i n the source, i nclud ing the
name of the file, the l i ne within that file, and the col 
u m n with i n t h a t l in e ; they eve n describe the point a t

The cal l i n g rou t i n e , which now i n corporates a copy
of the i n lined routine, is then fu rther processed as a
normal ( th o u g h larger) routi n e .
lnlining Annotations for Debugging

which th at fi l e was i n c luded by another fi l e ( as fo r a C

as follows.

or C++ #include d i rective ) , i f applicable.

•

- A pointer to the routine declaration being i n l i ned.

or pointer. ( How this is achieved is beyond the scope

The locator fi·om the call that i s replaced . In

of this paper. ) I n particu lar, locators are smal l e n ough
that every tuple node in the i n termediate la nguage

completed; dtis locator captures t h e original call
location for poss i ble later use, for example, as J

meticulous a bo u t mai ntai n i ng and propagating h i g h 

supplement to d1e i n formation thJt maps instruc

q u a l i ty locator i n formation throughout i ts opti miza

tion addresses to source line n u m bers.

tion and code generation .
locator e n codes a p a i r

that consists o f a locator ( w h i c h m a y als o b e a n i n l i ne
locator) and the add ress of an associated scope node i n
the G E M symbol ta b l e .

As the code l i st o f the original i n l i ned routine is
copied, each locator from the origi n a l is replaced by
a new inline locator that records
- The origi nal locator.
bei ng copied .

De buggi ng optimi zed code su pport tor i n l i n i n g ge n 
era l ly b u i ld s o n a n d i s a mi nor en hancement of t h e
G E M i n l in i n g mechanis m . l n l i n i ng occurs d u ring a n

early part of the G E M opti m i zer phase.

Within the scope that contains the cal l site, an

inline

scope block is i ntrod uced . This scope represents the

result of the in l i n i n g operation . It i s populated with
local varia ble declarations that correspond one-to
one with d1e tormal parameters ofd1e i.nlined routine.

Vol . 10 N o . I

As a result of these steps, every i n l i ne d i nstruc tion

Gin

be related back to the scope i nto which i t was i n l i n e d
and h e n c e t o the routine ri·om which it was i n l i n e d ,
regardless o f h o w i t m a y be m o d i tied o r moved a s a
resu lt of subseq uent optimization .

I n l i n i ng is i mp l e me nted in G E M as fol l ows:

Digiral Technid Jou rn:1l

•

- The newly created i n l i n e scope i n to which it is

Compiler Processing

•

sim

left i n the I L from t h e origin a l call after i n l i n i ng i s

contains one. M oreover, GEM as a whole is q u i te

inline

a

ple call wid1 n o argu ments, there may be noth i n g

( I L) and every code cell in the ge nera ted code stream

An add i tional ki nd of locator was i n trod uced fo r

The newly created i n l i ne scope block is a n n otated
with additional i n formation , namely,

of a u n i form fixed size that is no l a rger than an i n teger

i n l i n ing support. This

The main changes

introduced for debuggi ng opti mized code support are

A crucial characteristic o f locators is that they are aU

92

The origi nal call is replaced with a jump to a copy of
the IL for the body of the routine, i n wh i c h refer

Display para m e ters and l ocal vari a bles of an .
i n lined
routine

•

The i nl i n e scope is also made to conta i n a bod)'
routi n e , includi n g a copy o f its local variab les.

Report the source locJtion corresp o n d i n g to the
c u rrent position i n the code

•

The actual argu m e n ts of the call are transformed
i n to assignments that i n i tiJiize the val u es o f the sur

1 99 8

Note dut these additional steps arc an exception to
the general assertion th at debugging opti m i zed code
su pport occurs after code ge n e ration and j u st prior to
object code e m ission. T hese steps i n no vvay i n t1 u e nce
the generated code-only th e d e buggi ng symbol table
that is output.

The prologue of a rou 

through the tlow graph looki n g for the last i n struction

tine genera l l y consists of those i nstructions at the

( that is, the i nstr u ction closest to the routine e x i t ) of

begi n n i n g of the rou tine that establish the routine

an i n l i ne i nstance that can reach Jn exit.

Prolog u e a n d E p ilogue Sets

stack frame ( for example, a l locate stack and save the

Note that prologue and epi logue sets are not strictly

return address and other preserved registers) and that

symmetric: prologue sets consist of only instructions tl1at

must be execu ted before a debugger can usefu Uy i nter

are also semantic events, whereas epilogue sets i nc l ude

pret the state of the rou ti ne. For this reaso n , setti ng a

instructions tlut may or may not be semantic events.

breakp o i n t at the begi n n i n g of a routi ne is usua l ly
( tra nsparently) i m plemen ted by setting a breakpoint

Object Module Representation

after the prologue of that routine i s co mp leted.

To describe any i n l i n i n g that may have occurred dur

Conversely, the epilogue o f a rou t i n e consists o f
those i n structions at the end of a routin e t h a t tear

i n g compilation, we i nclude three new ki nds of i n for
mation i n the debu gging symbol table.

down the stack fra me, reestablish the caller's conte xt,

I f tl1e instructions contained in J scope do not fo rm a

and make the retu rn value, i f any, avai lable to the

single contiguous range, then the description of the

caller. For this reason , stopping at the e n d of a routine

scope is a u gmented vvjth a discontiguous range descrip

i s usu ally ( transpare n t ly) i m plemente d by setti n g a

tio n . This description consists of J sequence of ranges.

breakpo i n t before the epilogue of that routine begi ns.

(The scope itself indicates tl1e traditional approximate

One benefi t of i n l i n i n g i s that most prologue and

range description to provide bac kward compati bility

epi logue code is avoided; however, there may still be

with older versions of OpenVMS D E B U C ) . This aug

some scope man agement associated with scope en try

mented desc1i ption applies to aU scopes, whether or not

a nd exit. Also, some program m i n g l a nguage-related

they are tl1e result ofinlining.

environment ma nageme n t associated with the scope

For a scope that results from i n l i ni n g a cal l , the

may exist and should be treated i n a m a nner analogous

descri ption o f the scope is augmented with a record

to traditional prologue and epilogue cod e . The prob

that refers to the rou tine that was i n l ined as we l l as the

lem i s hovv to i d e ntif)' it, because most o f the tradi 

l i n e n u m ber o f the c al l . Each scope also contains two

tional comp i l e r code generation hooks do not app ly.

e n tries that consist of the sequence of prologue J nd

The model we chose takes adva ntage of the seman
tic event i n formation that we describe in the section

epilogue addresses, respectively.
Backward compatibility is fu l l y mai n tained .

An older

Semantic Steppi n g . In parti c u l a r, we define the first

version of Open VMS DEBUC that does not recognize

semantic event that can be executed within the in l i ned

the new kinds of i n fonnation wi l l simply ignore it.

routine to be the end of the prologue. For reasons d is
cussed l ater, we define the last i nstruction ( not the l ast

Debugger Processing

semantic eve n t ) of the i nJ i ned code as the begin n i n g of

As the debugger reads the debuggi n g symbol table of

the epi logu e . As a res u l t of u n rel ated opti m i zation

a modu l e , i t constructs a l ist o f the i nl i ned i nstances for

effects, each of these may turn out to be a set of

each routi n e . This process makes it possible to tl nd a l l

i nstructions. Determ i nation o f i n l i n e prologue and

instances o f a given routine. Note, however, that if every

epi logue sets occurs after split l i fe ti me and semantic

call of the routine is expanded i n l i n e and the routine

event determ i n ation is completed so that the results of

cannot otherwise be called fi-om outside that m odule,

those analyses can be use d .

tl1en CEM does not create a noninlined (closed - form )

To determine the set o f prologue instructions, for each

version of tl1e routine.

i nline instance, CEM starts vvjtl1 every possible entry
block and scans torward through the flow graph looking

Report Source Location

for tl1C first semantic event instruction that can be reached

the source location tl1at correspo nds to tl1e current code

It is

;.1

simple process to report

from that en try. The set of such i nstructions constitu tes

address. When stopped inside the code resu l ting from

the prologue set for tl1at instance of the inJined routine.

an inlined routine, the program cou nter maps directly

This is a spa n n i n g walk forward from the ro utine

to a source l i n e ,vjthin the inlined routine.

e n try ( o r e ntries) that stops ei ther when a b l ock is
fou nd to conta i n a n i nstruction from the given i n l i n e

Display Parameters and Local Variables

i nstance or when the block has alreJdy been e n cou n 

for

tered ( each block i s considered a t most once ) . Note

i nlined routine contains copies of the parame ters and

a

As i s tl1e case

noni n l i n ed rou tine , tl1e scope description tor an

that there nuy be execution paths that include one or

the local variables. No special processi n g is req uired to

more i nstructions from an i n l i n i n g, none o f w h i c h is a

perform name binding for such entities.

semantic event i nstructi o n .
The set of epilogue i nstru ctions i s determined usi n g

Include ln lined Ca l l s in Traceback

The debu gger pre

a n i nverse o f the prol ogue algorith m . The process

sents i n l i ned ro u ti nes as if they are real routi ne calls. A

starts with eJch possible e x i t block and scans bac kward

stack frame whose cu rre n t code address corresponds

Digital Technical )ourml

Vol . 1 0 No. I

1 998

93

to an i nl ined rou ti n e i nstance is described with two or
more virtual stack frames: one or more for the i n li ned
instance(s) and one for the u l ti mate cal ler. ( An exam
ple is shown later in Figure 7 . )
Set Breakpoi nts at l n l ined Routine I nstances

The

strategy for setting breakpoints at i n l i ned routines is
based on a generalization of processing that previously
existed for C++ member fu nctions. Com pilation of
C++ modules can resul t i n code for a given member
fu nction being compiled every time the class or tem
p b te definition that contains the me mber fu nction is
compiled. vVe refer to all these com pilations as clones.
( I t is not n ecessary to disti n guish which of them is the

Line.:

6
7

tine: i n all the curren tly active mod u l es .

�·la i n .t.ou t : ne
1 ·-EGER A , c
TYPE *
.� ( 3 ' c ( 0 ) 1
END
F'UNCT ! ON
A

12

13
14
15

;

8(5,

\
.'

( I . L)
B

I)

2 "" L

+

RETURN
END

c

16

F'UNCTIO: B ( J ,
INTEGER B , c
B - C(9) + J

K)
+

K

END
+T+++

c

File

DOCFJ- I NLI � E - 2 A . FOR

fUNCT ION

INTEGER C
c � 2' 1

C(J)

R ETUR

6

Set Breakpoints at I n lined Line N u mber I nsta nces

DOC F'J - J NLI E - 2 . FOR

II TEGER

9
10
11

2
3
4
5

F' i l e

c

8

originaL ) I n our general i zation, an inl ined routine call
routi n e , the debugger sets breakpoints at all the end

c
c

2
3
4
5

i nstance is treated l ike a clone . To set a breakpoint at a
of�prologue addresses of every clone of the given rou

+•+++

END

The

strategy for setting breakpoints on line n u m bers shares
some teatures of setting breakpoints on routines, with

Figure 6
Program to J l l ustr:�rc In l i n i ng Support

additional complications. Compiler- reported l i ne num
bers on OpenVMS systems are unique across a l l the
files i ncluded i n a compilati o n. It follows that the same
file i ncluded in more than one compilation may h ave
d i ffere nt associated line n u m bers.
To set a breakpoint at a particu l :�r l i n e n u m ber,
that l i ne nu mber needs to be fi rst nonn:1 l i zed rel ative
to the cont a i n i n g file. This norm a lized l i n e n u m be r
v :t l u e i s then compared to n orm a l i zed l i ne n u m bers
fo r that same fi l e that are included in other com p i l a 
tions. ( I f d i fferent versions of the same named fi l e
occu r i n d i fferent compil ations, t h e versions are
treated as u nr e lated . ) The origin a l l i ne n u m ber is
converted i n to the set of add ress ranges t h a t corre
spond to it in a l l modules, taking i n to account i n l i n 
i n g and c l o n i n g .
C a l l a Routine That Is l n l i ned

I f the compiler creates a

can call that rou tine independent of whether there
nuv also be i n l i ned i nstances of tl1e routine . If no such

�

ver ion ofthe routine exists, then the d e b u gger cannot
call the routin e .

If w e com pile , l ink, a n d r u n this program using the
OpenVMS DEBUG optio n , we can step to a place in
routine B that is just bdore the call to rou t i n e C and
tl1en request a traceb:1ck of the call stack. This dialogue
is shown in Figure 7 .
Figure 7 shows tlut pseudo stack fi-ames are reported
for routines A and B, even tl1ough the call of routine B
has been in lined i nto routine A and the call of rou tine A
has been i n l i ned i nto the main program. The main d i f
ference from a real stack rrame is t h e ext.rJ line that

Limitations

In a real stack ri·ame, i t is possi b l e to examine ( a nd
even deposit i nto) the real machine registers, rather
than exami ne the variJ bles that happen to be a l located
in machine registers. In an i n l i ned stack frame, this
operation i s not well ddlncd and consequ e n tly not
supported . In a non i n l i ned stack ti:a me, these opera

Usage Example

I n l i n ing s u pport has m a ny aspects, but we will i l l u s 
trate only one-a c a l l traceback t h a t i n c ludes i n l i n e d
ca lls. Consider t h e sample program shown i n hg u re 6 .
This program has four routi nes: tl1t-cc JIT combi ned i n
: 1 single fi l e (enabling t h e GEM FORT RA.N com p i l e r
to perform in l i ne opti m izations), and the l a s t i s i n a

separate fi l e . To help correlate the l ines of code in

DigitJI Tcch nicJI JournJI

n u mbers to the left of the cod e . Note that these n u m 
bers are not p:�rt o f the progra m .

reports tl1at tl1e "above routine is i n lined ."

closed-form version of a routi n e , then the debugger

94

these two riles with those in Figure 7, we added l i ne

Vo l . 10 No. I

1 998

tions are sti l l allowed .
An attractive feature that wou l d rou nd out the

expected beluvior o f i n l i ned rou tine calls wou ld be to
s u pport steppi ng i n to or over the inlined call i n the
same way that is possible tc)r noni nlined caJls. T h is rea
ture is not curre n tly su pported-execution alwJys
steps into the ull.

GEMEVN$

FJ - I NL I 1E - 2
1S A l p h a Debu

r n D
Open

%! ,

Lang

D G> s

ge :

to

DOCFJ - I

LI

B

C(9)

15 :
DBG > show

=

ca l l s

e n

e

* DOCFJ - I . LI

ine name

rou

above

IN

tO

A

L I N E - 2 $.1AIN

E

rel

15�8

abs

P

PC

OOOOOOOOOO OOOO l C

0 0 0 0 0 0 0 0 0 2 0 0 6C

0

i s inl ined
9

ine

is

00000000000000 04

0

000000002

054

i . l in e '

r , E - 2 $ MAI

DO

FJ - It\L I N E - 2 $:�. I .
4

DEBUG

B \ % LI

l i ne
15

abo ve rou t i ne

' DOC FJ- I•

Figure 7
OpenVMS

DOCFJ -Ir

E - 2 $ MA I 'l

* DOCFJ- I LINE 2 $
---- -

ver s i o� T7 . 2 - 0 0 1

E - 2 $ MA I 1
�
J + K

B
-----

6

Modu l e :

ep / seman t i c

s t epped

mod

FORTRAN ,

0 0 0 00 0 0 0 0 0 00 0 0 3 8

0 000 0 0 0 0 0 0 02 00 3 8

0000000000000000

FFFFFFFF 8 5 9 0 7 1 6 C

Dialogue to I l lustrate I nlining Suppon

5 1 : no opti m i zation (noop t ) , no d e buggi n g i n for

Performance and Resource Usage

mation ( nodebug, nodbgopt)
We gathered a n u m ber of statistics to determine t:ypi 

52: no o p t i m i zation ( n oopt ), normal debugging

cal resource req u i re m ents tOr using the enhanced

information ( debug, nodbgopt)

debugging optimized code capability com pared to the

54: full ( d e fa u l t ) o p ti m i zation (opt), no debugging

trad itional practice of debugging u nopti m i zed cod e . A

information ( nod ebug, nodbgopt)

short sum mary of the findi ngs fol l ows.
•

55:

All metrics tend to show wide variance fi·om pro
gram to program , especially smal l ones.

•

58: fu ll optimi zation (opt), en hanced debugging

Generating traditional debugging symbol information

i n formation (debug, d bgopt)

increases the size of object modules typically by 50 to
100 percent on the OpenVMS syste m . Executable
image sizes show similar but smaller size increases.
•

Generating en hanced symbol table i n formation
adds about 2 to 5 percent to the typical compi lation
time, although higher percen tages have been seen
tor u n usually large progra ms.

•

Generating enhanced symbol table i n formation
uses significant me mory d u ri n g compilation bm
does not affect the peak memory req u i re ment of a
compi lation .

•

Note that the option combination n u m bering sys
tem is historical; we retained the system to help keep
data logs consistent over ti m e .
Compile-time Speed

The incremental compile-time cost of creating enhanced
symbol table information is presented i.n Ta ble
data in this table can be summaJized as follows:
•

•

dling split l i fetime variables ( column
•

reduces

the

re sulting ima ge size . Total net i mage s i ze i ncreases
typ i cal ly by 50 to 80 percent.
A more d e tai led presentation of fi n d ings fol l ows.
Ta bles

1

through

3

present data collected u s i ng pro

d u ction OpenVMS Al pha native co m p i lers built in
December 1 9 9 6 . I n developing these results, we used
five combi nations of compi lation options as fo ll ows:

3

perce n t ,

is attributed t o t h e flow analysis i nvolved i n han

200 perce nt to the deb ugging symbol table of

opti m i zation

Enhanced deb ugging (col u m n 2) increases the
component of that ti me, approxi mately

object modules and perhaps 50 to 1 00 percen t fo r
Compiling with ful l

l percent.

compil ation time by about 4 percent. The largest

mation compared to that for an u nopti m i zed com·

executable i m ages.

Traditional debugging ( column 1 ) i n creases the
total compi lation time by about

pilation . On the OpenVMS syste m , this adds 100 to

•

l for a

sampling of BLISS , C, and FORTRAN modules. The

Generating enh anced symbol table i n formation
further i ncreases the s i ze of the sym bol table i n tor

full opti mization ( o p t ) , normal d e bugging

information only (debug, nodbgopt)

3).

Debuggi ng tends to i n crease a s a percen tage of
ti m e in larger modules, whic h suggests that pro
cessing time is slightly nonli near in program size;
however, thi s i ncrease does not seem to be excessive
even in very large modules.

Compile-time Space

The compile-time memory usage during the creation of
e nhanced symbol information i s presented in Table 2 .

Digital Technical )ourn;ll

Vol.

10 N o . I

1 998

95

Ta ble 1

Percent of Com p i lation Ti me Used to Create/Output Debugging I nformation
Module

S2 (noopt, debug,
nodbg opt)

G E M_AN

0.3%

1.1%

G E M_DB

0.9

1 .8

1 .3

G EM_D F

0.8

4.4
2.7
1 3 .9

S8 (opt, debug,
d bg opt)

(Split lifetime
Ana lysis On ly)

BLISS CO D E

0.7%

G E M_FB

0.7

5.2
3.5

G E M_I l_PEEP

0.6

1 4.4

(_METR I C

1.5

5.2

4.1

G RAM

0.5

2.9

2.2

I NTERP

1 .2

4.5

3.2

MATRIX300X

nm

nm

nm

NAG l

1 .4

1 3 .0

1 1 .9

SPICE_V07

3.0

6.4

4.7

WAVEX

2.5

6.3

4.8

C CO D E

FORTRAN COD E

Average
Typical range

1 .2 %

4.3%

3.2%

(0. 5 % -1 . 5 % )

(3.0%-7 . 0 % )

(2.0%-5.0%)

Note: " n m " represents "not m e a n i ngfu l , " that i s , too s m a l l to b e accurately measured.

Ta b l e 2

Key Dy n a m i c Mem ory Zone S izes d u r i n g B LISS G E M Com p i l at i o n s
File

Peak
Tota l

SYMBOL
ZO N E

Ell
ZONE

G E M_AN
G E M_DF
G E M_FB
G E M_ I l_pEEP

2, 507
1 1 , 305
4,694
40,4 1 9

1 30
836
316
1 ,606

85
1 , 672
522
1 7, 666

7,381
3,03 1
3,563

1' 1 1 5

82
3 54

494
81 5
308

934
6,267
6,234
1 2, 8 1 2

1 43
1 ,520
1 ,0 5 1
4,676

227
1 ,791
3,256
3, 1 1 9

CODE
ZONE

OM
ZONE

%
Peak

%
Larg

%
Ell

15
1 ' 1 80
304
1 4, 1 43

6%
10
6
34

8%
57
58
80

1 8%
71
58
80

B L I S S CODE

1 84
2,056
457
4.4 1 1
C CO D E

C_M ETRIC
G RAM
I NTERP

2,563
21 1
688

1 67
267
131

2
9
4

6
33
20

34
33
43

58
68
459
68

6
11
7
5

26
38
14
14

26
38
14
22

32%

40%

FORTRAN CODE

MATRIX300X
NAG l
SPICE_V07
WAVEX

1 01
1 .742
885
3.482

Average

9%

Note: A l l numbers t o t h e left of t h e vertical bar are thousa n d s o f bytes. not multiples o f

96

1 ,024.

Column Key:
Column

Description

Peak Tota l
SY M B O L Z O N E
E l l ZONE
CO D E ZON E
OM Z O N E
% Peak
% larg
%Ell

The peak dyn a m i c memory a l located i n a l l zones d u ring the co m p i l a t i o n
The z o n e t h a t h o l d s the G E M sy m bol ta b l e
The zone that holds the l argest Ell Z O N E (used for the expanded i ntermediate representation)
The zone that holds the G E M gen erated code l ist
T h e z o n e that holds s p l it l ifet i m e a n d other work i n g d ata
The OM ZONE size as a percentage of the Peak Total size
The O M ZONE s i ze as a percentage of the l a rgest s i n g l e zone i n the co m p i lation
The O M Z O N E size a s a percentage o f t h e E l l ZONE size

Digir:U Tcchn ica.l Journal

Vol . 10 No. 1

1 998

image text, etc.) due to the inclusion ofenhanced infor
mation compared to the traditional symbol table size.

The followi ng is a summary of the data, where OM
ZON E refers to the temporary working virtual mem
ory zone used tor split l i feti me analysis:
•

The OM ZO N E size averages about 10 percent of
the peak compil ation size.

•

The OM ZON E size is one-quarter to one- half of the
EIL ZONE size. (The latter is well known for typi 
cal ly being the largest zone in a GEM compilation . )

•

Since the O M ZONE i s created and destroyed after aU
ElL ZONEs are destroyed, the OM ZONE does not
conuibutc to establishing the peak total size.

SS/52 : This ratio shows the object or i mage size
with enhanced debugging i n formation with opti
mization compared to the traditional debugging
size without optimization.
The last ratio, SS/52, is especially interesting because
it combines two effects: ( l ) the reduction in size as a
result of compiler optimization, and (2) the i ncrease i n
size because the larger debugging symbol table needed
to describe the resu lt of the optimi zation . The resu lt
ing net i ncrease is reasonably modest.

Object Module Size

Summary and Concl usions

The increased size of enhanced symbol table informa
tion tor both object files and executable i mage files is
shown in Table 3.
In Table 3, the application or group of modu les is iden
tified in the first column. The columns labeled 5 1 , 52, etc.
give the resulting size for the combination of compilation
options described earlier. Object module and executable
image data is presented in successive rows.
Three ratios of particu lar i nterest arc computed .

There exists a small but significant literature regarding
the debugging of optimized code, yet very kw de bug
gers take advantage of what is known. In this paper we
describe the new capabi l i ties tor debuggin g optimized
code that are now su pported in the G EM compiler sys
tem and the Open VMS D E B U G component of the
OpenVMS Alpha operating system . These capabilities
deal with split lifeti m e variables and currency determi
nation , semantic stepping, and procedure inlining. For
each case, we describe the p roblem add ressed a nd then
presen t an overview of G EM com piler and OpenVMS
D E B U G processi ng and the object mod u l e represen
tation that med iates between them. All but the inlin
i n g support are i ncluded i n OpenVMS DEBUG V7 .0
and i n G E M- based compi lers for Alpha systems that
have been shipping since 1 996. The inl ining support is

52/5 1 : This ratio shows the object or i mage size
with traditional debugging information compared
to a base compilation without any debuggi ng infor
mation . This ratio i ndicates the additional cost, in
terms of increased object and image file size, associ 
ated with doing traditional sym bolic debugging.
(S8-S5 )/(S2-S 1 ) : This ratio shows the increase in
debugging symbol table size (exclusive of base object,

Ta b l e 3

Object/Executa b l e (.OBJ/. EXE) F i l e S izes (in N u m ber of B l ocks) for Va rious Open VMS Components
51

52

54

55

58

no opt
debug
nodbgopt

52/51

opt
node bug
nodbdopt

opt
debug
nodbgopt

opt
debug
dbgopt

(58-55)/
(52-5 1 )

File

noopt
nodebug
nodbgopt

58/52

Ratio

Ratio

G E M_* . OBJ
G E M_* .EXE

3 1 ,477
1 2, 1 60

5 1 , 069
29,543

1 .62
2 .43

BLISS CODE
27,483
1 0, 3 7 3

47, 0 3 1
27, 7 5 5

68,728
32,288

1.11
0.26

1 .35
1 . 09

436
250
1 02
60
1 40
80

653
348
1 20
70
207
113

1 . 50
1 .39
1.19
1.17
1 .48
1 .4 1

C CODE
478
250
1 00
58
1 34
75

733
385
1 17
69
205
1 13

1 , 680
581
224
91
450
1 67

4.36
2 .00
5 .94
2.20
3 . 66
1 .64

2 . 57
1 .67
1 .87
1 .30
2.17
1 .47

20
19
42
289
1 , 652
1 ,03 1
555
634

34
29
63
388
3, 1 1 7
1 , 660
1 , 639
1 , 1 90

1 .70
1 . 53
1 .51
1 . 34
1 .89
1 .61
2.95
1 .88

FORTRAN CODE
16
15
288
1 87
1 ,073
549
393
490

29
25
509
333
2,571
1 ,3 1 8
1 , 556
1 , 1 67

71
34
1 ' 1 78
469
4, 9 1 6
1 , 803
2, 949
1 ,437

3 .00
0.90
3.1 1
1 .3 7
1 . 60
0.77
1 .29
0.49

2 . 08
1.17
1 .84
1 .2 1
1 . 58
1 .09
1 .80
1 .2 1

C_M ETRIC.OBJ
C_M ETR I C . E X E
GRAM. OBJ
GRAM.EXE
INTER P.OBJ
I NTERP. EX E

I

I

MATRIX300X.OBJ
MATRI X300X. E X E
NAGL.OBJ
NAG L . E XE
SPICE. OBJ
SPICE.EXE
WAVEX.OBJ
WAVEX.EXE

Ratio

Digital Tec h n ical journal

Vol . l O N o .

! 998

97

currently i n tleJd test. Work is under way to provide

12.

similar capa bilities in the lade bug debugger"·" compo
nent of th e DIG ITAL UN I X operating syste m .
There are and will always b e more opportunities and
new challenges to im prove the ability to debug opti

1 3 . D . B li c kste i n

see how the capabilities described in this paper provide

et

cial lssue

I'Ol.

4, n o . 4

1 99 2 ) : 1 2 1-1 3 6 .

1 4 . B . B e and er,

"

V&V..

D E B UG :

An

I nteractive,

Symbolic,

Multili ngual D e b ugge r, " ACM 5!CSOFT/5!G'PI..A N Soft

major benefits. We find it much harder to see what capa

war('

Eugineering

S)lmpusium on I-Jigb-Lel'el Deb ug 

bility cou ld provide the next major increment in debug

ging ACM 5/CPLAN Notices.

ging effectiveness when working wi tl1 optimized code.

1 9 8 3 ) : 1 7 3-1 79 .
15.

References

C omp il er
( Spe

al . , " Th e G E M O p t im i z ing

Sy s tem , " / .! igi lal Tech nicaljoumal,

mized code . Perhaps tl1e biggest problem of all is to fig 
ure out where best to focus future anention. l t is easy to

A . Adl -Tabatabi, "Source- Level Debugging of Glob
a l l y O ptimi z e d C o de ," P h . D . Di sse r t a tio n , C a rm:g i t:
Me l lo n U n i ve r si ty, CM U-CS-9 6- 1 3 3 (June 1 996 ) .

vo l .

18,

Ope n \I,VfS Dehugp,er /vfan uol, Or d e r

no.

No.

T E ( M a yn a rd , M ass . : Digital E q u ip m e nt

(August

8

i\A- QSBJ B 

C orpo ra t i o n ,

November 1 9 9 6 ) .

l . P. Ze l l wt: g t: r,

" Interactive Source- Level D e b u gging o f
Opti m i zed Programs," P h . D . D i ss e r ta t io n, U n i vers i ty
of Cal ifornia, Xerox PARC CSL-84-5 ( M ay 1 9 84 ) .

2.

i

J . H e nness y, " S ymbolic Deb u ggn g o f Optimized C od e , "
ACM ]i·(n lsactions on Programm ing Languoges (/1/d
5)s/ems.

3.

vol 4, no.

3

(July

1982 ): 3 2 3-344.

M. C opp e rm an , " D e b u ggi ng O p t i m i zed Code With

our

B e ing

M isled," P h . D . Dissertation, U n iversity o f

Cali forn i a a t Santa Cruz, U CSC Technical Rep or t

(J u n e

U CS C - C RL- 9 3 - 2 1
4.

I I , 1 99 3 ) .

16.

R.
3d

Si te s , e d . , Alpha Architecture
e d . ( VVob u rn, Mass.

Reference

Jllh! l l ! la/.

Press, 1 9 98 ) .

Husson, ''Experie nces
Developing and Us i n g an Object-Oriented Li brar y tor
Pro g ra m Manipu lat1011," OOP5LA c'rJI I(i:rence Pro
ceedings. A CM 5/CPLAN Notices. vo l. 1 2 , no. l O
( O c tober 1 9 9 3 ) : 8 3-89.

1 7 . T. B i ng:hJm, N . Hobbs, and D .

1 8 . D(v,ital UNIX Ladehug [)ehu,�er Ma n ual, Order N o .
AA-PZ7EE - Tl T E ( M avn a.rd , Mass . : Digi tal Eq u ipm e nt

Corporation, March

G . Hansen, and S . Si mmons , " A New
Ap p ro Jc h to D ebu g gi n g O p ti m i ze d C o d e ," ACW SIC

D igi tal

1996 ) .

G . B rook s ,

PLAN

'92 Co nference on Progr am mi ng Language

Biographies

Des ig n and !mplementalion, SJG'PLAN Notices, vol 2 7,

no.

7 ( J u ly 1 9 9 2 ) : 1-l l

5 . Convex Co m p u te r C o rpora t io n , CONVJ;X CXdh Con 
cepts ( Richardson, Tex . :

Convex Press,

Order

No.

DSW-47 1 , M a y 1 9 9 1 ) .
6.

D . Coutant, S. M e loy, and

M . Ruscetta, "D O C: A PrJc
to Source-Level De bu gg ing of G l o bally
C od e ," Proceedings oft he 5/CPLAN 8H Con

tical Approach

Optimized

ferenw on Pro,({ rctmmincf� La n,r.;uage Design and Imple
m entat io n . Atl anta ,

7.

8.

Ga. (June 2 2-24,

1 9 8 8 ) : 1 2 5-1 34.

L. Zurawski, "Sou rce - Leve l Debugging of GlobaLly Opti
m i zed Code witl1 Expected Behavior," Ph . D . Disserta
tion, University of I ll i no i s at Urbana-Champaign ( 1989 ) .
U . H ol z le , C . Cham bers, �111d D . U n gar, "Deb u gging
Code with Dynamic Deop ti m i zati on , "
A CM SIG'PLA N '92 Couference on Programm ing Lcl l l 
gua.lw Desigu a n d lmplemenlalion, San Fr a n c i s co ,
Calif ( J u n e 1 7- 1 9 , 1 9 92 ) a n d SI GPLAi'-l Notices, vol.
27, no. 7 ( July 1 99 2 ) : 3 2-43.
Op t imi z e d

9.

L.

Pollock an d M . S offa, H i g h - leve l D eb uggi n g with
of' an I ncremenral O pti m ize r, " Proceedings o(
"

the Aid

the .l lsi J-Jmuaii /nlernalioua/ Conference on 5rstem
Scieuces (January 1 9 8 8 ) : 5 24-5 3 2 .

10.

and M . S offa , " D e b u ggi ng
Tai lori ng, " fulernational 5ympu

L . Po l loc k , LV! . B i 1 ·e n s ,
Op ti mi z e d

Code

via

silun on Software Testing and A na/)JSis (August 1994 ) .

11.

98

P. Feiler, "A Lang uage - Oriented I nteractive Prog rcl m 
m i ng Environment Based on C omp i l a tion Tech nol
ogy," P h . D . D i sse rta ti on , Car ne gie - Me llo n University,
CJ\-W - CS- 8 2 - 1 1 7 (May 1 9 8 2 ) .

Digit:�! Tcchniccll Journ:�l

Vol . 10 No. l

! 998

Ronald F. Brender

Ro nald F. B rt: n d e r is a senior consultant software engineer
in Compaq' s Core Technology Group, w he re he is working
on botl1 the GEM co mpi l er and the l • 1 I X l ad e b u g pro
j e c ts . D u r i n g his

oreer, Ron bas

worked in advanced

deve lopm e nt and pr od u c t d e ve l op men t roles

for BLISS,
DI GITAL's
DECsy�t<:m- 1 0 , PDP- 1 1 , VAX , and Alpha computn systems.
He served as a representative on the ANSI and ISO standards
commi tt<:e s tor fO Rl'RAN 77 an d later for Ada 83, al so sel·v
ing as a U.S. D e par tme nt ofDcti::nse invited Di sti ng u i s hed
Reviewer J nd a mem be r of the Ada Board and the Ada
La ngua ge Maintenance Com m i t tee tor more than eight
years. Ron j oi ne d Di gital Equipment COlvorat.ion in 1 9 70,
after earning the d eg re es of B . .S. t:. ( <: n gi ne eri n g scienc<:s ) ,
M.S. ( a ppl ied mat h e m a t i cs ) , a nd P h . D . (computer .111d
commu nication s c i en ces) in 1 9 6 5 , 1 9 68, and 1 969, respec
tively, aJ l ti·om me Unive rsity of M ic h ig�

<<

val ,
am

s i zeof ( in t ) * 8 ) ;

i n t arnt )
<

{

amt ;

For example, the generated text

a << b

is replaced upon regeneration by the text
i

t_sh l _i n t_ i n

(a ,

bl

H l 1 6 - ( u i 1 7 + + + u i 2 0 * ( s l 2 1 & ( a rgc < �
argc: < = • + s 1 2 2 : - - ( ( * & *
sl 4 1 )
01 600303 7
< •• (
5u7 ) . s i t 5m6 & 1 7 3 1 0 4 4 3 8 u * + + ui 5 * (
n s igned int I + + ( ld2 6 ) ) & ( ( ( 0 7 6 1 ) * 2 1 3 7 1 6 7 7 2 1 L * sl27 ?
u l2 8 & d 1 2 * + + d9 * DBL_EPSI LON * 7 e + 4 * + +
11
., , d l O * d1 2 *
(
" ld3 J * . � L * 9 . 1 - l d 3 2 * ++ f 3 3 - - . 7 3 9 2 E - 6 L * " l d 3 4 + ?. ? . 8 2 L
+ 1 . 9 1 * - - l d 3 5 >= H
l d 3 7 ) =. F + ( + + f 3 8 ) + + + [ 3 9 * [4 0 > (
floa t ) + + f 4 1 * 1: 4 2 >= c l 4 + + : s c 4 3 & s s 4 4 1 ' I I C 1 3 & . 9 3 0 9 L
(
u i 1 8 * 0 0 7 1 1 U * u i l 9 , sc 4 6 - - ? - - ld4 7 + l d4 8 : • • L d 4 9 - ld4 8 *
+ + ld50 :
• + ld5l I
>2 3 9 . 6 1 1 ) • - + + ar c
( i n t s ig ned ) argc + + ui 5 4 ) - + + · 1 7 > = • • u l 5 8 * argc - 9ul * + & ul59 * + + u l 6 0 ;
+

*

<

•

•

•+

Figure 2
Generated C Expression

Digital Technical Journal

Vol . 10 No. 1

1 998

1 03

It�

on being rer u n , the regenerated test case asserts a

with generated com piler d i rective flags revea l ed a bug

stand ards violation ( tor examp l e , a shift of more than

i n a compiler under test-it could not even comp i l e its

the word length ) , the test is discarded and testing con

own header files.

ti n u es with the next case.
Two problems "'�th the generator remain: ( l ) obtain 
i n g enough output fi-om t h e generated programs so
that d i fferences are visible and ( 2 ) ensuri n g that the

i n g the testi ng. O n l y those results that are exhi bited by

generated programs resemble real-world programs so

very short text are shown . Some of the res u l ts d erive

that the developers are i n terested in the test res u l ts .

from h a n d genera l i zation of a p roblem that origi n a l ly

Solving these two problems brin gs the q u al i ty of test

surfaced through random testi ng.

i n put to level

7. The t1ick here is to begin generating the

There was a reason for each res u l t . For example, the

program not fi-om the C grammar nonterminal symbol

server crash occurred when the tested compi ler got a

translation -u nit but rather !Tom a model program

stack overflow on a heavily loaded machi n e with a very

described by a more e l a borate string in which some of

l a rge memory. The operati n g system attempted to

the program is already fully generated . As a simple

cl u m p a gigabyte of com piler stack, which caused a l l

example, suppose you want to generate a n u m ber of

t h e o t h e r active users t o thrash , and many of t h e m a lso

print statements at the end of the test progra m . The

d u m ped tor l ack of memory. The many d isk d rives on

starting string of the generating grammar might be
n de f i ne P ( v ) p r i n l f ( � v • - % x \ \ n " ,

in

t he server began a d ance of the lights that sopped up

vi

the remai n i n g free resources, causing the operators to

boot the server to recover. Excel lent testi n g can m ;� kc

ai n ( )

you unpopular with almost everyone.

d e c l ara t i on - l i s e
s t <> tement

1ist

Test Distribution

ex i t ( 0 ) ;
ri

t-list

where the gram matical d e fi n i tion of pri nt- l i s t

Each tested or comparison program m ust be executed
IS

given by

pri n t l i s t P ( j den t i f ier ) ;
pr i n t - l i
pr i n t - l i s t P ( i denl i f i er )

There are n umerous ways t o u t i l ize a n e twork
to distri b u te tests and the n gather the resu lts. One par

;

mi nals for the t h ree l ists i nstead of j ust one for the
standard C start symbol tra nslation- un i t . Programs
generated tl-om this starting stri n g wi l l cause output
j ust betore e x i t . Because d i fferences caused by rou nd 
i n g error were u n i nteresti n g t o u s , w e mod i fied t h is

pri n t macro tor types f loa t a n d double to pri n t only
a tew significant d igits. With a l ittle more effort, the

expa nsion of pri n t - l i s t can be forced to pri n t each
variable exactly once.

A l ternatively, suppose a test designer receives a bug
report fl·om the fi e l d , analyzes the report, and fixes the
bug. I nstead of simply p utting the bug-causing case in
the regression s u i te, the test designer c a n genera l i ze it
in the m a n n er j ust presented so that many simi lar test
cJses can be used to expl ore for other nearby bugs.
The effect of l evel

7 is to augment the probabil i ties

in the stochastic grammar with more precise and direct
means of control .

ticul arly simple way is to use continuously r u n n i n g
watcher programs. E a c h watcher program periodically
exami nes a common til e system for t h e existence of
some particu lar fi les upon w h i c h the program can act.
If no fi l es exist, the watcher program sleeps for a whi l e
a n d tries agai n . O n most operating systems, watcher
programs can be i m p l e m e n ted as command scripts.
There is a test master and a n u m ber of test beds .
The test master generates the test cases, assigns them
to the test beds, a n d l a ter analyzes the resu l ts . Each
test bed runs its assigned tests. The test master and test
beds share a fi l e space, perhaps via a network. For each
test bed there is a test i n p u t directory and a test output
di rectory.

A watcher

program c a l l e d the test d river waits u n til

a l l the ( poss i b l y remote ) test i nput d i rectories are
empty. The test d river then writes its l atest generated
test case i nto each of tlhe test i n p u t d i rectories a n d
returns t o i t s watch -sleep cycle. F o r e a c h test b e d there
is a test watcher program that waits u n t i l there is a fi le
i n its test i np u t d i rectory. \Vhen a test watcher fi n d s a

Forgotten Inputs
The e l a borate com mand - l i ne fl ags, config fi les, and
environ ment variables that condition the behavior of
progr:�ms arc also i n p ut. Such input can also be gener
ated using the same toolset that is used to generate the
test programs. The very first test on the very first run

Dig;i t
Source Exif Data: 
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.6
Linearized                      : Yes
Has XFA                         : No
XMP Toolkit                     : Adobe XMP Core 5.2-c001 63.139439, 2010/09/27-13:37:26
Create Date                     : 2006:04:13 13:14:21+01:00
Creator Tool                    : Adobe Acrobat 7.05
Modify Date                     : 2013:01:11 09:17:25Z
Metadata Date                   : 2013:01:11 09:17:25Z
Producer                        : Adobe Acrobat 10.1.4 Paper Capture Plug-in with ClearScan
Format                          : application/pdf
Title                           : Digital Technical Journal, Volume 10, Number 1: Programming Languages & Tools
Creator                         : 
Document ID                     : uuid:5d02c306-16df-4680-b15a-c80d2f01bca6
Instance ID                     : uuid:fc734410-8429-48c8-86e4-6efc5eb0db4b
Page Layout                     : SinglePage
Page Mode                       : UseOutlines
Page Count                      : 111

EXIF Metadata provided by EXIF.tools
Digital Technical Journal, Volume 10, Number 1 Dtj_v10 01_1998 Dtj V10 01 1998

dtj_v10-01_1998 dtj_v10-01_1998

Navigation menu

Versions of this User Manual:

Views

Navigation