Digital Technical Journal, Volume 2, Number 4, 1990 Dtj_v02 04_1990 Dtj V02 04

dtj_v02-04_1990 dtj_v02-04_1990

User Manual: dtj_v02-04_1990

Open the PDF directly: View PDF .
Page Count: 147

Download
Open PDF In Browser	View PDF

VAX 9000 Series

Digital Technical Journal
Digital Equipment Corporation

Volume 2 Number 4
Fall 1990

Editorial
jane C. Blake, Editor
Barbara Lindmark, Associate EditOr

Circulation
Catherine M. Phillips, AdministratOr
Suzanne J. Babineau, Secretary

Production
Helen L. Patterson, Production Editor
Nancy jones, Typographer
Peter Woodbury, IllustratOr and Designer

Advisory Board
Samuel H. Fuller, Chairman
Richard W. Beane
Robert M. Glorioso

john W. McCredie

Mahendra R. Patel

Grant Saviers

Robert K. Spitz
Victor A. Vyssotsky
The Digital Technicaljoumal is published quarterly by Digital
Equipment Corporation, 146 Main Street MLO I-31B68, Maynard,
Massachusetts 01754-2571. Subscriptions tO the journal are S40.00

for four issues and must be prepaid in u.s. funds. University and

college professors and Ph. D. students in the electrical engineering
and computer science fields receive complimentary subscriptions
upon request. Orders, inquiries, and address changes should be

sent 10 The Digital Tecbnicaljournal at the published-by address.
Inquiries can also be sent electronically 10 D'I:J@CRL.DEC.COM

Single copies and back issues are available for $16.00 each from
Digital Press of Digital Equipment Corporation, 12 Crosby Drive,
Bedford, MA 01730-1493.
Digital employees may send subscription orders on the ENET to
RDVAX::JOURNALor by interoffice mail to mailstop MLO I -3/B68.
Orders should include badge number, cost center, site location
code and address. U.S. engineers in Engineering and Manufacturing
receive complimentary subscriptions; engineers in these organiza

tions in countries outside the u.s. should contact the journal office
to receive their complimentary subscriptions. All employees must
advise of changes of address.
Comments on the content of any paper are welcomed and may
be sent to the editOr at the published-by or network address.
Copyright ll:J 1990 Digital Equipment Corporation. Copying
without fee is permitted provided that such copies are made for
use in educational institutions by faculty members and are not
distributed for commercial advantage. Abstracting with credit
of Digital Equipment Corporation ·s authorship is permitted.
AU rights reserved.
The information in this Journal is subject 10 change without
notice and should not be construed as a commitment by Digital
Equipment Corporation. Digital Equipment Corporation
assumes no responsibility for any errors that may appear in
this journal.
ISSN 0898-901 X

Cover Design
Digital s VAX 9000 mainframe system is the theme of this issue.
Our cover depicts several simple instructions flowing through
the VAX 9000 instruction execution pipeline. High performance
was achieved by breaking the VAX instructions into small simple
tasks that could be pipelined efficiently. Concurrent operation
on up to six instructions simultaneously resulted in a execution
rate of one simple VAX instntction per clock period.
Gloria Monroy of the High Performance Systems Group designed

Documentation Number EY-E762 E-DP
The following are trademarks of Digital Equipment Corporation:
Cl, DECsystem-10, DECSYSTEM-20, Digital, the Digital logo, HDSC,
MC!J, Micro VAX, Nl, PDP-I, Ul;fRIX, VAX, VAX-11/780, VAX 6000,
VAX 8000, VAX 8600, VAX 8650, VAX 9000, VAXBI, VMS, XMI.
IBM is a registered trademark of International Business Machines
Corporation.
Kapton is a trademark of E.

duPont de Nemours & Company.

MOSAIC 111 is a trademark of Motorola Corporation.
Micromaster Plus is a registered trademark of t.:rx Company.

the cover graphic, which was implemented in cooperation

Book production was done by Digital's Educational Services

with David Comberg of the Corporate Design Group.

Media Communications Group in Bedford, MA.

I Contents
11

Foreword
Carl S. Gibson

VAX 9000 Series

Design Strategy for the VAX 9000 System
David B. Fite Jr. , Tryggve Fossum, and Dwight Manley

VAX Instructions That Illustrate the Architectural Features
of the VAX 9000 CPU
John E. Murray, R icky C. Hether ington, and Ronald M. Salett

Semiconductor Technology in a High-peiformance VAX System
Matthew J Adiletta, Richard L. Doucette, John H. Hackenberg,
Dale H. Leuthold, and Dennis M. Litwinetz

Vector Processing on the VAX 9000 System
Richard A. Brunner, Oileep P. B handarkar, Francis X. McKeen,
Bimal Patel , W illiam). Rogers Jr., and Gregory

L. Yoder

HDSC and Multichip Unit Design and Manufacture
Peter B. Dunbeck, Richard). Dischler, James B. McElroy,
and Frank J. Sw iatowiec

The VAX 9000 Service Processor Unit
Matthew S. Goldman, Paul H. Dormitzer, and Paul A. Leveille

102

The Unique Features of the VAX 9000 Power System Design
Derrick). Chin, Barry G. Brow n , Charles F. Butala, Luke L. Chang,
Steven). Chenetz, Gerald E. Cotter, BrianT. Lynch, Thiagarajan Natarajan,
and Leonard J. Salafia

118

Synthesis in the CAD System Used to Design the VAX 9000 System
Donald F. Hooper and John C. Eck

130

Hierarchical Fault Detection and Isolation Strategy
for the VAX 9000 System
Karen E. Barnard and Robert P. Harokopus

I Editor's Introduction

implement the

77 different gate array chips, the five

custom chips, and the self-timed RAM architecture.
An additional performance improvement for
numeric computations is the VAX vector architec
ture and is treated in the paper by Rich Brunner,
Dileep Bhandarkar, Frank McKeen, Bimal Patel, Rill
Rogers, and Greg Yoder. They discuss the architec
tural model and particulars of the VAX 9000 imple
mentation,

which affords numerically intensive

applications performance four to five times greater
than can be achieved by the scalar processor.
To ensure that the system performance gains
at the semiconductor level were not diminished

jane C. Blake

but were instead enhanced by packaging and inter

Editor

connects, engineers developed several technologies

The VAX 9000, Digital's first mainframe computer,

unique in the industry. The technology behind the

is the topic of papers in this issue of the

high-density signal carrier and the multichip unit

Technical journal.

D(f.{ital

As engineers writing for this

issue relate, the primary goal of the project from the

are explained in the paper by Pete Dunbeck, Rich
Dischler, Jim i'vlcEiroy, and Frank Swiatowiec.

initial product strategy through manufacture was to

Equally important to performance in the new

design and build a very high-performance, highly

9000 is system reliability as e\'idenced by the intro

reliable VAX system.
Design engineers applied both crsc and

R!SC

duction of the service processor unit. In their paper
about the service processor, Matt Goldman, Paul

techniques to achieve high levels of performance

Dormitzer,

for this rightly coupled multiprocessor system.

MicroVAX-based system embedded within the 9000

and

Paul

Leveille relate

how

the

In the opening paper, Dave Fire, Tryggve Fossum,

detects, isolates, and corrects problems without

and Dwight Manley explain the strategy behind the

interrupting the system .

design. They begin with an overview of the system,

High system availability \Vas also one impetus in

the technology, and CAD tools, and then describe

the design of the power system . Some of the unique

the redesign of VAX instructions into small tasks

features of the power system, such as redundant

which can be efficiently pipe lined. The authors

regulators, improved load sharing and simula

also touch upon three additional aspects of the

tion, are discussed by Derrick Chin, Barry Brown,

VAX 9000 system: the integration of vector process

Charles Butala, Luke Chang, Steve Chenetz, Jerry

ing into the VAX architecture, new error handling

Cotter, Brian Lynch, Raj Natarajan, and Len Salafia.

techniques, and performance modeling.
One measure of performance is the number of

The two papers that close this issue address the
topics of CAD methodology and system diagnosis.

instructions processed per cycle. The average num

Don Hooper and John Eck describe a CA D method

ber of cycles per instruction is less than five, which

ology that combines advanced rule-based A! tech

is nearly half the instruction execution rate of pre

niques with an object-oriented database. The new

vious VAX systems. To illustrate the architectural

methodology saves logic designers significant time

features that enable this level of performance, John

and reduces errors. A complex system such as the

Murray, Rick Hetherington, and Ron Salett have

VAX 9000 requires improved system diagnosis capa

selected a small sample of VAX instructions. They

bilities to achieve the desired high system availabil

describe the instruction flow through the pipeline,

ity. Karen Barnard and Rob Harokopus demonstrate

how instruction features combine to work on a sin

how a new scan system, in combination with scan

gle macro, and how stages of the pipeline interact.

pattern testing, and symptom-directed diagnosis

ln addition to the architectural improvements,

achieve this necessary diagnosis capability.

machine performance is enhanced at the semi

The editors thank Rick Hetherington of the High

conductor level by a new generation of semicustom

Performance Systems Group for not only writing a

and custom integrated circuits that support a low

paper but for his help in coordinating this issue.

c ycle time. Matt Acliletta, Dick Doucette, John
Hackenberg, Dale Leuthold, and Dennis Litwinetz
give an overview of the bipolar technology used in
the system. They then describe the methods used to

Biographies

Matthew J. Adiletta

Matthew Adile tta is currently contributing to the

implementation of a new processor architecture and performing a technology
evaluation to determine the technology for the implementation. He joined
Digital in

1985 to work on a high-performance RISC architecture. Matt was not

VAX 9000 system, but he also implemented the integer
and floating point multiply and divide units and developed an ECL custom chip

only the architect for the

process. He holds one patent and has several patents pending. Man received a

B . S . E . E. (honors, 1985) from the University of Connecticut.

Karen E. Barnard

A senior soft ware engineer with the High Power Business

Unit

CPU Development Group, Karen Barnard wrote the read-only memory
based diagnostic for the VAX 9000 service processor unit's scan control module
and developed the scan pattern diagnostic for the VAX 9000 CPU and SCU. Karen
also worked on the debugging structural test process for the VAX 9000 kernel
environment. Prior to joining Digital in 1986, Karen was with Data General
Corporation. She received a B . S . ( 1983) in computer science from the Worcester
Poly technicallnstitute.

Dileep P. Bhandarkar

As technical director for RlSC systems,

Dileep

13handarkar is responsible for leading the architectural direction of RlSC prod

1978 and was responsible for managing the evolution of
VAX architecture. Dileep was the chief architect for VAX vector processing

ucts. He joined Digital in
the

and coarchitect of Digital's RISC archi tecture. He holds one patent for his work at
Digital and has several patents pending. His degrees in electrical engineering
include a 13achelor of Technology from the Indian Institute of Technology and an

M.S . and a Ph. D. from Carnegie-Mellon University.

Barry G. Brown

The concept of designing DC-to-DC converters as system

elements rather than individual "power supplies" was introduced into the high
power systems products by Barry Brown. He created and developed a highly
tlexible, high-reliability DC-to-DC conversion system for the

VAX 9000 series.

Barry designed, implemented, and verified the power system for the
Model

VAX 9000

200 systems. He was a principal engineer for the Codex Corporation

before coming to Digital in 1984. Barry is a graduate of Woolwich Polytechnic
and Harlow Technical College.

Biographies

Richard A. Brunner

As 3 principal engineer, R ichard Brunner is the architect

c u rrently responsible for the engineering refinement and control of both
the VAX and VAX vector architectures. He is the editor of the VAX Architecture
Reference Manual and coauthor of the VAX Vector Handbook and several papers
on the VAX vector 3rchitecture. He received a B.S. (high honors, 19R4) in elec
t rical e ngineering from Case Western Reserve U n i versity and an M . S . (1987) i n
computer engineering from Rensselaer Polytechnic Institute. H e i s a member of
JEEF. and Tau Beta Pi .

Charles F. Butala

Presently responsible for the power system design and

arch itecrure of rhe VAX 9000 Model 4 00 systems, Charles Butala is a consulting
engineer in the Information Systems Business Unit Power Systems Group. Since
he joined Digital in 1976, he has been responsible for several power system design
projects, including the VAX H600 system. He is a member of I EEE and Tau Beta P i ,
and holds honorary society membership i n Eta Kappa N u . Charles received
a R.S.E.E. (1968) from I l l inois Institute of Tec hnology and an M. S . E .E. from
Norrhe3stern University.

Luke L. Chang

A fter receiving his M.S. in electrical engineering from Virginia

Polytechnic lnstirute and Stare U n iversity in 1988, Luke Chang joined the Power
Sysrems Technology and Regulations Group. He is currently a hardware engineer
and is responsible for developing simulation tools to perform h igh-quality

software design veri fication tests for the next generation DC-to-DC power con

verters. Lu ke's previous responsibilities include transient analysis :md testing of
the VAX 9000 memory power distribution sysrem, 3nd power system cost reduc
tion studies.

Steven ). Chenetz

As a principal engineer in the Information Systems Busi

ness Unit Power Systems Group, Steven Chenetz is currently working on the
H7390 for a high-power VAX system. He previously was a member of the design
and development te3ms for the H7380 of the VAX 9000 system, the H71HH envi

ronmental monitoring module for the VAX 8600 power system, the VAX 8600
clock distribution system, and signal integrity for the VAX 8 600 system. Steve
joined D igital upon gr3cluation from Rensselaer Polytechnic Institute i n 19RI.
He has 3n M.S . E. E. from Nort heastern University (19H7).

Derrick ). Chin

Derrick Chin is the engineering manager for sever3l Infor

mation Systems B usiness Unit power groups and is design e ng ineer of the

VAX 9000 processor's DC power d istribution system. His 3ssociation with D igital
began in 1961, and he has participated in many projecrs, from the POP-I ami the
DECsystem-10 to the VAX HMO systems. His responsibi l ities have ranged from
development of precision displays, circuit design, and core and semiconductor
memories to env ironmental monitoring modules and power systems. He holds a
B.S. E. E. (1959) from MIT.

I
Principal engineer Gerald Corter is a member of the Infor
mation Systems Business Unit Power Systems Group. He was the project engineer
and coarchitect of the VAX 9000 power control system (PCS). Jerry was the PCS
interface to Customer Service and Support Engineering, Manufacturing, and
Service Processor Unit Groups. He participated in development of the PCS and
power system test strategies and the initial design of the T01060 power and envi
ronmental monitor module. His previous work includes the VAX 8600 system's
power and control subsystem.
Gerald E. Cotter

In his position of systems engineer for the High Perfor
nunce Systems Group, Richard Dischler worked on the VAX 9000 signal integrity
project. He also was a member of the project team for the electrical design of
HDSC and micropackaging for multichip units, planar boards, and connectors for
the VA X 9000 system. Rich held similar responsibil ities in the development of the
VAX 8600 system. He joined Digital in 1982, and his previous experience was at
Applied Research Laboratories. He holds a B . S .E.E . (1982) from Pennsylvania
State University.

Richardj. Dischler

A s an undergraduate at Harvard University, Pau l
Dormitzer gained experience with the U N I X operating system b y working as a
programmer and operator. Upon receiving his B . A . in computer science in 1987,
he joined D igital's H igh Performance Systems Group. He is currently an engineer
in the High Performance Business Unit CPU Engineering Group. Paul's primary
responsibilities are in the development of error recovery processes for high
power systems, such as the VAX 9000 system.
Paul H. Dormitzer

Since joining Digital in 1979, Richard Doucette has been
a member of severa l high-performance systems project teams. As a senior engi·
neer on the VAX 8600 team, he helped introduce the Motorola Macrocell Array I
(MCA I ) technology into D igital and was responsible for its design analysis and
characterization in the system. As engineering manager on the VAX 9000 team,
he was responsible for the incorporation of MCA 3 technology, custom chips, and
self-timed RAM components in the system. He holds a B . S .E . E . (1973) from the
University of Maine.
Richard L. Doucette

Peter B. Dunbeck

Peter D unbeck is an engineering manager in the H igh
Performance Business Unit Technology Research and Engineering Group. He
held various positions on the VAX 9000 program between 1985 and 1990, includ
ing technology program manager and design engineering manager for the multi
chip unit. Before joining Digital in 1984 as a manufacturing engineer, Peter
developed energy conservation programs for Thermo Electron. He holds a B . S .
(1977) i n mechanical engineering from Virginia Tech and a n s. M . (1979) i n aero
nautics and astronautics from MIT.

Biographies

John C. Eck

The dcvdopment of rhe majority of the physical design CAD tools

used in rhe VAX 9000 system was managed by John Eck. He is a software engi
neer manager in the High Performance Systems CAD and D iagnostics Group.
John was employed as the manager of the Automated Design Department of
Badger Company before coming ro Digital in 1984. He holds a BS (1964) in
physics and an JYI.S. ( 1966) in aeronau ti cs and astronau t ics from MIT, and an
M . B. A . (h ighest honors, 198--i) from Babson CoJiege.

David B. Fite Jr.

Consul tant engineer David Fire was a member of rhe initial

architecture team for the VAX 9000 system. He developed the architecture for the
branch prediction, instruction fetch, and instruction decode for the VAX 9000.
H is previous work includes responsibility for prototype debugging on the VAX

8600 system . D:IVe joined Digital in 1982. He has one patent and several patent
applications pending. He is a graduate of Worcester Polytechnic Institute with a
B . S. (honors) in electrical engineering.

Tryggve Fossum

Tryggve Fossum is rhe system architect of rhe VAX 9000 sys

tem . He received a B.S. ( 1968) from the University of Oslo and earned his P h . D.

( 1972) from the University of I l linois. Tryggve joined D igital in 1973 and worked
on the design of high-end computers, notably the VAX -11/780 system. As a pro
ject leader on the VAX 8600 team, he guided the design of the t1oating point accel
erator. He has also worked on several research projects, including an early raster
scan graphics workstation, and a workstation w ith an integrated disk system.

Matthew S. Goldman

As a senior engineer on the VAX 9000 project team,

Matthew Goldman designed the scan control chip, which contains the control
logic for the VAX 9000 scan system. He was also the responsible engineer for
all VAX 9000 service processor h:trdware. Prior to joining Digital's H igh Perfor
mance Systems CPU Group in 1986 , Matt was a design engineer for Rayt heon
Company. He is a member of Tau Beta P i and Eta Kappa Nu. M:ut holds a
B.S. (highest honors, 1983) and an M.S. ( 1988) in e lectrical engineering from
Worcester Polytechnic Institute.

John H. Hackenberg

I n 1968, John H ackenberg came to D igital as a tech

nician on the Kl- 10 project, leaving after two years to serve in the armed forces.
He returned to Digita l in 1971 and worked on the designs for various h igh-end
systems, including the KL- 10. As a consult ing engineer on the VAX 8600 project,
he worked in the area of signal integrity. John was the project leader for the MCA 3
gate array used in the VA X 9000 system and is currently developing a bipolar gate
array. He holds a B.S.E.T. {1979) from the University of Lowell .

I
Robert P. H arokopus A cum laude graduate of the University of Michigan,
Robert Harokopus received a B .S. (1986) in computer engineering and is now
studying for an M . S . in computer engineering from Boston University. Bob is a
senior software engineer and joined Digital in 1986. He developed the symptom
di rected diagnosis software used in the VA X 9000 service processor unit. Bob
also developed software for the HIDE CAD tool and SCEPTER automatic test
pattern generator, both of which were used in t he VA X 9000 design project. He is
a member of Tau Beta Pi and Eta Kappa Nu.

rucky C. Hetherington As a principal engineer with the H igh Performance
Systems Group, Ricky Hetherington is currently the project leader of the transla
tion buffer and cache design of the VAX 9000 system. He holds one patent and has
several patents pending on the various design featu res of the VA X 9000 M-box .
Rick joined Digital i n 1982 as a senior engineer i n Digital's Large Computer
G roup. He has a B.S. from Pennsylvania State University.

Don Hooper is a consulting engineer in both logic design
and CAD disciplines. He initi:ued and led the development of the Synthesis of
Integral Design program, Digita l's first synthesis tool. Before coming to Digital
in 1979, he was architect for the I tel 7031 mainframe and cache designer for the
!tel Advanced System 4. He is a graduate of Don Bosco Technical Institute. Don
holds patents in speech recognition circuits, the tag and queuing system for
Digital's first pipelined C P U , and the control storage pipe for the VAX 8600
system. In addition, he has several patents pending in logic synthesis.
Donald F. Hooper

A member of the technical staff of the Integral Circuit
Design G roup, Dale Leuthold led the design team for the VAX 9000 vector regis
ter chip. He is currently working on random-access memory development for
h igh-speed mainframes. Dale was responsible for b ipolar integrated circuit
design at Signetics Corporation and Trilogy Systems Corporation before coming
to Digital in l9H6. He holds one patent and has one patent pending. Dale received
a B . S . from Oregon State University.
Dale H. Leuthold

Paul A. Leveille

In his nearly ten-ye:.Jr relationship with Digital, Paul Leveille
has specialized in the development of high-power systems, particularly the
VA X 8600 and VAX 9000 systems. As a principal engineer in the High Perfor
mance Business Unit, he helped define the VA X 9000 service processor sub
system and was responsible for developing the scan control fi rmware and
portions of the service processor application software. Pau l's previous responsi
bilities include console diagnostics, firmware. and ::�pplication software.

Bio�raphies

Derutis M. Litwinetz The projecr leader for the design of four standard cell
and custom chips for the VAX 9000, Dennis Lirwinerz is a consuhing engineer

in the High Performance Business Unir. He has prev iously participated in the
design of rwo standard eel.! chip designs for the VA X 8600 system. He joined
D igital in 1967 as a technician for the DECsysrem- 10 Engineering (;rou p. Denni:-;
has a patent pending for the VAX 9000 self-rimed register file design. He received
a R.S.E.E.T.

from Lowe ll Technological Institute and an ,'VI.S.C.E. from the

University of Lowell.

Brian T. Lynch

Brian Lynch is a principal hardware engineer in the Informa

tion Systems Business Unit Power Systems Group. In this position. he designed
and developed the H7382 bias power supply used in rhe VAX 9000 system. He is
presently working on power solutions for future high-performance systems.
Prior ro joining D igital in 1972 , Brian was responsible for power convener and
analog modu le design ar lntronics. He has a B.S. E.E. (1978) from Worcester
Polytechnic lnst irure.

Dwight Manley

As a principal engineer on the VAX 9000 project, Dw ight

Manley was responsible for all of the perform:mce modeling of the VAX 9000
CPU design. His present responsibi lities inc lude w riting code for a Digital
Extended i'vla r h Library product. Dwight joined Digital in 1979 as a member of
the Systems Performance Ana lysis Group. Prior to that time, he worked as a
systems programmer for the Bel l Telephone System. Dwight has a H.S. ( 1971 ) in
mathematics from the University of M assachuseus and an M.S. ( 1976) from
Northeastern University.

James B. McElroy

Jim McElroy is the multichip unit operations manager. H is

work on the VAX 9000 system began with interconnect and packaging, fol lowed
by the management of the physical technology efforts. He then became the
manufacturing systems program manager for the introduction of the VAX 9000
system into manubcturing. Before joining Digital in 1976, Jim worked at RCA on
packaging and interconnect design for mil itary computer systems. He received a
B. S.M.E. and an M .S.M.E. from Northeastern University.

Francis X. McKeen

The project leader for the V-box unit of till' VAX 9000

system was Francis McKeen. Prior to working on the VAX 9000 system , he wrote
microcode for the VAX 8600 and VAX 8650 systems. Frank is a principal engineer
and has been with Digital for seven years. He holds one patent and has several
rarenr applications pending. Frank received a B. S. E.E. from Northeastern
University and is a member of IEEE.

I
john E. Murray

T he coauthor of

Microarchitecture of the

VAX <)000, john

Murray is a consulting engineer in the High Performance Business Unit. He
served as project leader of the design team for the 1-box unit of the VAX

9000. He
1982. John's previous employer was ICL in the United Kingdom,
where he was a design engineer. He received a B. Sc. ( 1969) from Warwick

joined Digital in

University. He holds one patent and has several patents pending.

Thiagarajan N atarajan

T hiagarajan Natarajan is manager of a DC-to-DC

converter group in the Information Systems Business Unit. His group develops
a

high-density and highly reliable DC-to-DC converter, associated hybrids, semi

conductor components, and the distribution system for the next generation,
high-performance VAX systems. Raj's prior experience includes positions at
General Electric, Bell Laboratories, and Perkin Elmer Corporation. He has a
Ph.D. in dectrical engineering, has been awarded one patent, and has authored
approximately seventeen technical papers.

Bi mal Patel

Principal engineer Bimal Patel joined Digital in

1986 as a senior

engineer. His primary responsibility since that time was the design of the V-box
unit of the VAX

9000 system. Bimal was previously employed as a senior engineer

in the CPU Design Group of Prime Computer, Inc. He has an M. S. in computer
engineering from Boston University.

William J. Rogers Jr.

William Rogers is an engineer in the VAX

9000

CPU

Group, where he developed the design of the control logic of the V-box unit for
the VAX

9000. Prior to working on this high-performance system, Bill was a
1986 and is
a member of IEEE and Tau Beta Pi. He received a B. S. ( 1986) in electrical engineer
member of the SASE Support Engineering Group. He joined Digital in

ing from Michigan Technological University.

Leonard j. Salafia

The development of the AC front end for the VAX

9000

system was the responsibility of Leonard Salafia. who is the manager of the

AC Power Interface Developmem Group. His previous work at Digital includes
supervising the development of storage system power products for the Central
Power Supply Engineering Group and for the Storage Systems Power Group. Len
worked for General Electric prior to coming to Digital in

1980. He holds

B.S.E. E. (magna c u m laude, 1969) from the University of Hartford and an
M. S.E. E. ( 1976) from Renssel::ler Polytechnic Institute.

Biographies

Ronald M. Salett
As a consulting engineer in the High Performance Systems
Group, Ron Saletr is currently leading the development of a new high-perfor
mance C P U . As a project leader for the VAX 9000 system, he was responsible
for the architecture, design, and m icrocode of the execution unit. Since joining
Digital in 1977, Ron has also worked as an architect and project leader on
low-end integrated PDP- 1 1 systems. He holds two patents. Ron holds a B . S . E . E .
(1975) from Carnegie-Mellon University and a n M . S . E . E . ( 1979) from Worcester
Polytechnic Institute.

In 1988, Frank Swiatowiec became H DSC operations
manager, with the primary responsibility to transition Digital's new H DSC tech
nology to volume production. He was one of the engineering managers responsi
ble for the definition and development of the HDSC . Frank had over 15 years of
experience in the semiconductor industry when he joined Digital in 1986. While
with Motorola Corporation, he was awarded four patents on ECL circuit designs.
F rank holds a B . S . E . E . from the University of Il linois and an M . S . E. E . from
Arizona State University.
Frank J. Swiatowiec

Gregory Yoder is a senior hardware engineer with the H igh
Performance Systems CPU Engineering Group. His primary responsibilities on
the VAX 9000 system included the design and testing of the V-box unit, and pro
toty pe system debug, for which he received an excellence award . He also
assisted Manufacturing in producing and installing external field test VAX 9000
machines. G reg joined Digital in 1988, after participating in a one-year co-op
session at IBM . He holds a B.S. E. E. from Pennsylvania State University.
Gregory L. Yoder

Foreword

Carl S. Gibson
VAX 9000 Program Manager

This issue of the Digital Technical journal is a
collection of papers describing the technologies,
designs, and design methods employed in Digital's
VAX 9000 mainframe/supercomputer, which was
introduced in the fal l of 1989.
The VAX 9000 system embodies hundreds of
innovations in most areas of design, manufacture,
and service. In selecting papers for this journal, we
have attempted to reflect the immense scope and
variety of this program, which ranks among the
larges t and most complex in the history of our
industry.
In the summer of 1983, a small group of us set
about to determine what it would take for Digital to
develop a true mainframe. We felt that a mainframe
VAX would be a p owerful addition t o Digital's
product family. The products that we have created
took form, changed, and evolved over the months
and years as technical chal lenges yielded to inno
vations, rigor, and d iscipline. An u ndertakjng o n
this scale necessarily undergoes numerous transi·
tions as new data emerges, assumptions are tested,
and alternatives are eliminated . Technical break
t hroughs built upon one another incrementally
as we pressed the design closer to our goals. The
primary objectives of very high system-level perfor
mance and world-class reliability drove the design
process and the changes that emerged.
The planar logic packaging is illustrative of how
changes and improvements built upon one another.
The reliability benefits of m inimal connections
precipitated a .logic packaging design change from
stacked modules in dual backplanes to the planar
array. This change - an optimization for reliabil
ity - in the end actually helped performance and
maintainability. Utimately, though not envisioned
at the time, the adoption of the planar array had

a significant impact in that this structure enabled
impingement air cooling a nd elimination of t h e
bu lky liquid system t h a t was p a r t of t he initial
design. The final design of the VAX 9000 system
reflects, in myriad forms, this continual process of
successive refinement toward shared goals.
Design changes notwithstanding, our primary
strategy remained constant. The reader will note
that, while we innovated aggressively in CPU struc
ture, implementation technologies, and design
methodologies, we preserved ful l compatibility
with the VAX, Digital s torage, and Digital network
ing and cluster architectures. We wanted D igital
and our customers to be able to enjoy very high per
formance levels in a product that was compatible
with prior investments. Therefore, we d rew as
much as possible from existing products and
designs from many Digital development groups.
As a result, the VAX 9000 system incorporates
Digital's standard XMI bus and popular B l , C l , and
Nl system-level interconnects. The system runs VMS
and ULTRIX operating systems, VAX layered prod
ucts, and all of our customers' and independent
software vendors' tools and applications. This
capability proved especially rewarding when in the
final months of the project, our own VAX 9000
prototypes, running our unmodified CAD tools,
accelerated the processing of the inevitable last
m inute changes.
High-performance computation fundamentally
requires two key ingredients: short machine cycle
times and maximum computational work per
formed in each cycle. The semiconductor and
multichip unit papers describe how we m inimized
the VAX 9000 cycle time by use of fast circuits, high
density packaging, and high-speed interconnects.
These papers are complemented by architecture
descriptions through which the authors present the
innovative features that minimize the number of
cycles required to execute the VAX instruction set.
These papers present the sophisticated p ipelining
techniques and vector processing capabilities incor
porated in the VAX 9000 system.
Equal in importance to the computational capa
bilities of the product are the service and control
fea tures of the system. Papers covering the
VAX 9000 service processor and the system 's fault
management capabilities provide the reader with
insights into these important aspects of the
product.
The development strategy for the VAX 9000
system was explicitly formulated to deal with enor
mous technical and project complexity. Complex-

I
i ty itself was the single most formidable challenge
facing the team. Apparent from the outset, was the
fact that such an ambitious product required the
i n tegration of a very large number of d iscrete
design objects; each had to be conceived, created,
documented, tested, and ultimately integrated and
verified as part of the whole. The reader will see
the diversity of these efforts and recognize t he
challenge of unifying a design from this breadth of
technical advancement.
Centra l t o our strategy was the creation of a
unified design tool suite operating in a seamless,
homogeneous VMS computing environment. The
first few years of the project were devoted to con
struction of this environment in parallel with top
level design formulation. The recognition that
rigorous design methods were crucial to our success
was possibly one of the team's most powerful fun
damental notions. Papers included in this journal
illustrate some of the legacy of powerful CAD tools
and structured design approaches created by the
VAX 9000 team.
As we have seen for the product, the methodol
ogies were not immune to change as the project
progressed. Working with rapidly evolving
technologies, design p rocess experts continual ly

adapted to evolving user needs. Concurrent design
permeated every aspect of the project and domi
nated the way people worked together, with many
aspects of t he technology and p roduct design
converging and adapting as we learned from our
own processes. When the manufacturing process
needed some help, designs could be reprocessed
with the new rules and rereleased to keep things
moving ahead.
A nd, move ahead they did' Today, the VAX 9000
system is installed at many customer sites where the
systems are exceeding our original goals in both
performance and dependability. I t has been
accepted by experienced, high-end computer users
as a bona fide mainframe - a mainframe with the
unique advantage of ful l integration with D igital's
rich distributed processing architecture.
The VAX 9000 system was created by engineers
working i n many disciplines and collaborating
worldwide to invent hardware, software, and pro
cesses that have significantly advanced the state
of the art of computer design, m a n u facture, and
service. The papers in this journal describe but a
few representative examples of the creativity and
determination of this large and dedicated team of
professionals.

David B. Fite]r.
Tryggve Fossum
Dwight Manley

Design Strategyfor the
VAX 9000 System
The VAX 9000 system is Digital 's newest high-end processor in the VAX fami�y. This
paper describes the design strategy used to achieve high performance and shows how

RISC concepts were applied to a CISC architecture. Neu.• opportunitiesforparallelism
in VAX program execution were found by breaking the VAX instructions into simple
tasks which could be pipelined efficiently. By using independent, dedicated pipeline
stages, execution rates approach one instruction per cycle.

T he task confronting the VAX 9000 design team
was to develop a VAX system that outperformed
any previous VAX system and that was competi
t i ve w i t h s i m i larly sized processors from other
vendors. Although the VAX system is based on one
of the world's most popular computer architec
tures, the VAX architecture's i nstruction complexi
ties preclude efficient macroinstruction pipel ining,
such as that found in reduced instruction set com
puters (RISC). RISC processors can be bui l t with low
gate counts to handle simple, fi..xed-Jength instruc
tions sets, load/store architectures, and delayed
branching.
To compete with machines based on such archi
tectures and still remain compatible w ith the VAX
architecture, the design team chose to implement
the VA X architecture on the VA X 9000 system by
applying techniques that were similar to those used
in R ISC processors. We redesigned the VAX instruc
tions i nto small , simple tasks, and designed dedi
cated hardware that was optim ized for each task .
The result is a network of specialized processors,
each of w hich has i ts own data paths and state
machines, that operate in para l lel and execute
VAX instructions quickly. The most common, sim
ple instructions are executed at the rate of one
per cycle.

System Overview
The VAX 9000 system is a tightly coupled multipro
cessor, wh ich runs the symmetric multiprocessing
(SMP) version of the VMS operating system and can
have up to four processors sharing a central main
memory. Figure l shows a simp l ified block diagram
of the system. The major system components
include four CPUs, two memory controllers, two
I/o controllers, and a service processor, which is

Digital Technicaljournal

Vol. .! No. 4

Fall /<)')()

connected th rough the system control unit (SCU).
Through a cross-bar switch, the SCU provides high
speed, simultaneous transfers among the central
processors, I /O devices, and memory banks. System
cache consistency is maintained with duplicate tag
directories located in the SCU. As references are
made to memory, the addresses are checked against
the tag directories. If a cache hit occurs, the cache in
question is requested to invalidate or write back to
main memory. The scu supplies a bandwidth that
al lows near linear performance improvement as
new processors are added to the system. The mem
ory is interleaved on cache block boundaries to
provide bandwidth for multiple CPUs and vector
processors.
Four XMI backplane buses provide high band
width paths to I/O devices. Although the XMI is used
as the system bus in VAX 6000 systems, the X M I is
used exclusively for I/O i n the VAX 9000 system .
Several new adapters were designed to increase
throughput and reduce latency for I /0 transactions.
These adapters include connections to the C I , the
N I , the BI, and local disk comrollers. Although high
performance IIO features, such as disk striping,
solid-state d isk, and load balancing have been added
to all VAX systems, the VAX 9000 system benefits the
most from these features because it has the I/O back
plane bandwidth ro rake advantage of them. A block
d iagram of a single VA X 9000 CPU connected to the
SCU and the major data paths between the two units
is shown in Figure 2 . 1

Technology Contributions to
Improved Performance
The central processor cycle r ime has been reduced
to 16 nanoseconds (ns) mainly by the use of fast
emitter-coupled logic ( ECL) semiconductors and

VAX 9000 Series

XMI

DODD
DODD
DODD
DODD
VAX 9000 C P U N ECTOR

XMI

DO
DO
DO

DOD
DODD
DOD
DOD

256 MB

�m� mm
Figure I

VAX 9000 System

fast self-timed random-access memories (RAMs) for
registers and caches, and by decreasing the inter
connect wire length between components.
Motorola 's Macrocell Array I I I (MCA)) technology
provided both macrocell array and standard cell
capabilities. The emire system is composed of 77
unique MCA 3 options and 5 custom chip types. A
single MCA 3 contains 838 cells (4 14 major, 224
input, and 200 output), which yield 10,000 equiva
lent gates, and 256 I/O pins. Maximum power
dissip:nion is 30.0 watts, with un loaded gate prop
agation delays of 120 picoseconds (ps). Perfor
mance-critical operations, such as mu ltiplication.
division, integer and vector register accesses, and
system cloc king, were h!rther aided by employing
custom chips 2
Caches for instruction stream and memory
data, scratch pad registers, ami control stores all
require high-speed local storage. Two versions of
a proprietary self-timed RAM were designed for
these specific applications. A 4 kilobit (Kb) self
timed RAM , at 5. 5 ns, and a l6Kb self-timed R A M ,
a t I I . 5 ns, provide i nternal input and output
latches and write pulse generation circuitry. Multi
ple access modes allow highly pipelined operations
to take advantage of shorter access times.
Each new semiconductor generation reduces
cycle time. which increases the re!Jtive importance
of interconnect delay. High density s ignal carriers
14

scu

VAX

9000

CPU

Diagram

(H DSC), tape a u tomated bonding, and a single
planar module all reduce the interconnect delay
between active components in the VA X 9000
system. Strict impedance control is mai ntained
throughout the system. Clock skew is minimized by
employing fi xed-length, differential transmission
and dedicated routing layers.

CAD Contributions to Improved
Performance
Hundreds of computer-aided design (CA D ) tools
were used during the design and construction of
the VAX 9000 system. However, none of these tools
was more important in improving performance
than the physical layout and timing analysis tools.
Once the design team had placed large functional
sections, placement tools refined individual macro
cell selection and pin placements. Over 33,000 pins
were selected to minimize overall wire length and
maximize critical interconnections.
Routing presented several challenges. All levels of
interconnect included critical signals, differential
pairs, and fixed-length requirements. The H DSC
contains large cutouts that enable die attachment
and allow cooling through the back panel. These
large routing restrictions and special routing
characteristics could not be handled by existing
CAD tools. Therefore. we developed Chameleon,
Vol. .2 No. -i

Faii i'J ')(I

Digital Technica/journal

Design Strategyfor the VAX 9000 System

a general-purpose router. With Chameleon, cross
tal k is minimized, and crossing counts are main
tained and used to increase signal integrity, which
improves performance.
To model the timing relationships within the
system, we used sophisticated CAD tools to gener
ate an accurate representation of the VAX 9000
system. Detailed timing models of each macrocell
device were created using the SPICE simulator
program 5 Chameleon and signal integrity rools
provided delay values for each signal within the
MCA3, H DSC , and planar modules. CPLJDLY , using
the AUTODLY timing tool, tied the various pieces
together and gave the design engineers a powerful
view of the timing domain.

Instruction Processing
VAX systems exist in a variety of environments and

run thousands of applications. With any new, high
performance VAX system, it is important to increase
the speed of all applications and to continue to
provide general-purpose computer power. Given
the size of the installed VA X base and the nature
of the applications, performance gains should not
require code modi fications. Digital has gathered
substantial information on how VAX processors are

.: INST RUCT ION , INST RUCTION
�
I:
(BKB VIC) I • BUFFER
� CACHE
•

· · · · · - - - - - - - - - - - - - - --- - - - - - - - - - -

r-"

I/O AND
MEMORY
INTERFACE
DATA
SWITCH

----

E-BOX· ·

used. This data formed the basis for design deci
sions and trade-offs we made i n the development
of the VAX 9000 system.
Simple Instructions
In many VAX programs, only a few opcodes are
responsible for a large percentage of the i nstruc
tions issued. Most of these opcodes are simple and
limited tO a single arithmetic or logical operation.
Often, one of the operands is in memory. A typical
example is
ADDL3 < R O ) , R 1 , R 2

Because of the high frequency of these instructions,
speeding up these instructions is a top priority.
Most of the high performance achieved on RISC pro
cessors is derived because these instructions are
pipelined. I n a complex instruction set computer
(CISC), such as a VA X system, pipelining macro
instructions is more complex . Therefore, previous
VAX implementations have pipelined operations at
the microinstruction leveL '
Processing simple instructions in a VAX system
i nvolves obtaining and decoding the instruction,
fetching source operands, performing an opera
tion, and storing the result. The most important

- - - - - - - - - - - - - - - - - -INTEGER
--------------

UNIT

: .----'-'11 ----,

INSTRUCTION H
. INSTRUCTION � FLOATING
BRANCH
PREDICTION V<-- DECODE
1 • POINT UNIT
I:""Y ISSUE
( 1 K ENTRY) jv- (XBAR)

: I-BOX

VECTOR
ADD UNIT

OPERAND . REGISTER
PROCESSING� FILE
(OPU/SUFPL) h (SLIST/GPRs)

� VEC TOR
: VEC TOR
MUL TIPLy
REG I STE RS '¢=--Y UNIT

_._._._._ 1·.-.-.- --

....___,
.---'-'-____, : . _._·.-.-.- - ..---,.
.,.--J · . . . . . . · · · · · · · · .. ·
..

� MULTIPLY
1

•

RETIRE
UNIT

UNIT

:
'�'==::::::=�
: �:: : :�;I�����):NJ :::�::::::
;
:..!�
�;N:;:AT
I;::=IO=N=I: :=.£
DIVIDE
�UNIT
UNIT
1 K TB)
.. . .
. ..Jj
-- ��������. -�
- ----�
- ----�
l ;:::·�
r ====���
lc
::::::;:

scu

V-BOX

- ----- ---------- - - -- ----- ------ ---- -

WRITE
.--->
'l
�
.c.._---,
_
. . . .. . . . . . . .. . . . . . . . . . . . . . .
QUEUE
(WRTQ)

Figure 2

Digital Tecbnicaljournal

Vol. 2 No. 4

M-BOX

VAX 9000 CPUNector Block Diagram

Fa/1 /1)')0

VAX 9000 Series

difference between the way a VA X processor and a
!USC processor process simple instructions is how
the variable length instructions and memory speci
fiers are handled . VAX operands may reside in
general-purpose registers (similar to RISC
operands), in memory, or may be embedded in the
instruction stream. The VAX architecture provides
a rich selection of memory operand specifiers,
which often require computations to create the
address. In a R ISC processor, only load and store
instructions access main memory.
The instruction preprocessing stage (1-box)
decodes instructions and fetches operands in the
VA X 9000 system. I n the execution stage (E-box),
simple VAX instructions n.:s<:mble RISC instructions.
A simple opcode describes the operation, a single
register file provides source operands, and a desti
nation queue supplies a result descriptOr. The !-box
operates in parallel as with the E-box, which func
tions as a RISC processor by executing one instruc
tion each cycle. Execution occurs without the need
to identify the operand's source or addr<:ssing com
plexity. Figure 3 i l lustrates how simple instructions
t1ow through the VA X 9000 pipeline. Although all
VAX implementations perform these tasks, the VA X
9000 implementation uses separate, independent
hardware units to overlap the work because con
current operation is a prerequisite for single-cycle
instruction execution.
Instruction Cache
We used an instruction cache in the 1-box to
decrease instruction stream fetch latency and
reduce the bandwidth requirements on the main
cache. Choosing a virtually addressed cache further
reduced latency and simplified the design by
removing the need for duplicate translation buffers.
The virtual instruction cache is an 8 kilobyte (KB)
cache with a quadword line size, 32-byre blocks,
and a single-cycle access time. Line valid bits are
maintained to allow variable size fills from the main
data cache. Because the average VAX code block size
is 16 to 20 bytes, the block size of the virtual instruc
tion cache provides a good balance between the
instruction decode stage and the main cache.
Table 1

ADDL3

R3,R5,R7
SII #48,R4,@(R2)

AOBLEQ

S II # 63 , R 1 0 , 1 0$

Instruction Decode
Because the majority of instructions executed
require only a single cycle to execute, the instruc
tion decode's task of keeping ahead of the E-box is
not simple. Most instructions must be decoded in a
single cycle to keep the VAX 9000 system's ticks
per-instruction (tpi) low.
For example, VAX instructions may contain up to
si..,x operand specifiers. With 59 different specifier
addressing modes, instruction lengths can vary
from a single byte to more than 50 byres. However,
the overall average VAX instruction length is 3.8
bytes, and 98 percent of instructions require only
8 or less bytes.'i Furthermore, 96 percent of VA X
instructions executed use only 3 or less specifiers.
In each machine cycle, a 9-byte instruction buffer
is p resented to the decode stage ( X BA R). The
instruction buffer contains instruction stream data
prefetched from the virtual instruction cache.
Instruction decoding consists of generating an ini
tial m icroadd ress, determining the number of
specifiers for the instruction, including each speci
fier access mode and data type, and forwarding the
appropriate specifier data to the operand process
ing stages. The X BA R can handle up to three specifi
ers. Instructions that contain more than three
specifiers require additional decode cycles. Since
general-purpose register specifiers occur approxi
mately 41 percent of the time, three register specifi
ers can be processed concurrently.1' Short literals
comprise nearly 16 percent of the specifiers. How
ever, the X BAR can only decode a single short literal
per cycle. The remaining specifiers must all be
processed by the operand processing unit , which

Decode Cycles Req u i red
VA X- 1 1 /780

I nstruction

M U LF3

Context switches, translation bu ffer changes, and
instruction stream modifications all require that the
virtual instruction cache be invalidated. Two com
plete sets of block valid bits reduce cache sweeps to
a single cycle, if consecutive sweeps do nor occur
within 256 cycles of each other. Block size and fre
quent sweeping reduce the virtual instruction
cache's hit rate to approximately 96 percent, but by
filling through the main cache, the miss penalty is
minimized.

[ R3)

VAX 8650

Vol. 2 No. 4

Fall 19')0

VAX 9000

Digital Tecbnicaljournal

Design Strategyfor the VAX 9000 System

CYCLE

OPERATION
2

DJ
0

PC
GENERATION

VIC ACCESS

INSTRUCTION
DECODE

�
;::::�:�::::� ::� I I I
Ii
i i

SPECIFIER
PROCESSING
TRANSLATE
AOORESS

DATA CACHE
ACCESS

1. 1::': : ::1

•

'
'

MULTIPLY UNIT
EXECUTION
FLOATING UNIT
EXECUTION

LOOP:

INTEGER UNIT
EXECUTION

RETIRE

. •·=·:·:·:·:·:·:.
�
.
.
•

MULF3

(R0),#2.5,R1

MULF3

4(RO),II3.5,R2

MULF3

8(R0),#4.5,R3

[])

ADDF3

R1 , R2 , R4

AOOL2

#11xC,RO

ADDF3

R3,R4,(R5)+

IIIII

SOBGEO

R6,LOOP

•

Figure 3

.
•

.
'

� �·:::::;::::!!
I �
•

I 1:::::::,1
04::: :�,�: :1
.
'

'
'

•

m
'

D
DJ

The VAX 9000 Instruction Pipeline

decodes a single complex specifier per cycle. Unlike

Load/Store A rchitecture

preced ing processors, the X BAR hand les multiple

Load/store architectures separate memory accesses

specifiers in any order. Table I shows the number of

from computation. Loads can be scheduled to place

decode cycles required for several VAX processors.

arriving memory data at a functional unit just

operation begins. To achieve t h is effect with VAX

Operand Prefetching

instructions, memory specifiers are treated as load/

Because most simple instrucrions are decoded and

store instructions. VAX memory specifiers describe

executed in a single cycle by v::trious pipeline stages,

the effective addresses of memory operands. VAX

instruction operands a ls o m u s t be handled i n a

memory specifiers do not contain the source and

si ngle cycle. Multiple, specialized operand units

destination registers that are specified in R ISC load/

increase operand processing throughput. From one

store instructions. Rather, t h e VAX 9000 system

to three register operands may be forwarded to rhe

assigns temporary register file locations to buffer

dedicated

memory data. By processing specifiers early i n the

short literal unit expands all VAX data formats. The

pipel ine, data can be scheduled to arrive at the

operand processing unit performs complex address

appropriate time.

E-box by one register u n it per cycle.

calculations and requests memory operand data

Memory specifiers act as independent instruc

from the cache unit (M-box). Both the operand pro

tions executed i n the operand processing unit. This

cessing and short literal units can perform multiple

unit creates the operand's effective address and for

cycle operations.

wards it to the M-box. For loads, the actual memory

Digital Technicaljournal

Vol. 2

No. 4

Fall IYYO

VAX 9000 Series

data is returned to the E-box register file. The trans
lated physical address is saved in a queue of write
addresses for store/destination specifiers. W hen
execution results arrive from the E-box, the previ
ously saved address is used to write t he data into
the cache.
Conflict Detection and Resolution
Macropipelining in the VA X 9000 system relies on
autonomous units operating in parallel . Each inde
pendem unit is optimized for an individual task.
However, macropipelining does require that mech
anisms be added to resolve data dependencies
among instruction processing units. Data cont1icts
occur when an instruction's results are required by
an earlier pipeline stage. An addressing data conilict
appears in the following example:

ber or a flag which indicates that the resu lt should
be written to memory.
The instruction issue unit removes source
pointers from the source queue. These pointers are
used to address either the general-purpose registers
or source list for the actual source data. Destination
pointers from t he destination queue determ ine
where resulls should be wrirren. Register conflicts
can be detected by comparing the source pointers
needed to issue an instruction with all issued desti
nation pointers in the destination queue. For exam
ple, in Figure 4, the M U L L 3 's RO source queue entry
would match the A DDL3 's RO destination queue
entry. A write to the general-purpose registers by
the E-box removes the destination queue entry, and
the instruction issue can resume.

SRCQ
Rl
R2

MOVL R O , R 1
MOVB T A B L E ( R 1 ) , R 2

Any dedicated aclclress calculating hardware must
wait for the MOVL instruction results before per
forming the l'viOVB instruction's effective address
computation. A memory conflict is another form
of data dependency.
In the following example,

Register Conflicts
The simplest hardware mecha
nism employed in the VAX 9000 system is the use of
pointers to reference data. The operand processing
unit oversees a 16-entry source queue, an H-entry
destination queue, and a 16-entry source list. A sin
gle pointer is inserted into the source queue for
each source specifier. The pointer represents either
a register number, in the case of general-purpose
register operands, or a tag that indicates an entry in
the source list where the operand data is located . A
pointer is added to the destination queue for e:.�ch
destination. This pointer represents a register num-

I-

ADDL3 R 1 ,R2,RO
MULL3 (R3),RO,(R4)

, (R1 )
( R2 ) , R3

a prefetch unit could read the second instruction's
source operand while the E-box writes the first
instruction's results, if the values of registers R I and
R2 are different . However, when the registers con
tain identical values, the read must be delayed until
the write occurs. The VA X 9000 system uses several
differem mechanisms w detect and resolve data
dependencies. Passing pointers, scoreboard masks
within the 1-box, the write queue in the M-box, and
architectural restrictions are all used to handle vari
ous conflicts.

DSTQ
RO
MEM

MOVB R 0
MOVB

__.

SLIST
DATA

Figure 4

To resolve addressing data
contlicts, the I-box maintains a read/write register
scoreboard . Two register masks a re c reated for
each instruction decoded . The first register mask
denotes the general-purpose registers that t he E-box
will read for the instruction, and the second register
mask specifies the general-purpose register writes.
Each bit in these register masks refers to a single
VA X general-purpose register. Specifiers that are
being processed in the operand processing unit are
checked against up to six previous instruction
masks. From t he first example above, the specifier
[TABLE(R I )] requires that the operand processing
unit read R 1. If the R l bit is asserted in any preced
ing instruction's scoreboard write masks, this effec
tive address calculation must be deferred .
The VAX architecture presents a unique address
ing conflict p roblem because some speci fiers,
such as -(Rn) and (Rn)+, modify general-purpose
registers.
In the following example,
Addressing Conflicts

SUBL2 R O , R 1
ADDL2

C RO ) . , R2

Vol .! No. -1

ht/1 1'.)!)0

Digital Technicaljournal

Design Strategyfor the VAX 9000 System

the (RO)+ specifier modifies the contents of RO.
Therefore, the operand p rocessing u n i t cannot
update the general-purpose register without affect
ing the prior instruction. The read masks are used
ro detect this type of confl ict. A l l specifiers that
mod ify general-purpose registers must check the
scoreboard read masks before proceeding with
the instruction. Thus, when a confl ict occurs, the
general-pu rpose register modification stalls.
W hen an instruction completes execu tion, the
instruction's read/write mask is removed from the
scoreboard . In all addressing confl icts, specifier
processing continues once the blocking mask is
removed.
Memory Conflicts
The write queue is used to
resolve memory conflicts. Physical addresses,
received from the translation buffer, are inserted
into an eight-entry FIFO . These addresses are later
paired with the proper write data from the E-box
and written into the M-box. To avoid prefetching
stale dat:J., :�. I I memory addresses for source memory
oper:�.nds are translated and compared with the
addresses in the write queue. When no address con
fl ict occurs, the data from memory is forwarded
to the source .l ist. Operand requests that conflict
with a pending write address are stalled until the
contlict is resolved . The conflict is resolved when
the appropriate write data is received. The conflict
ing address is then removed from the write queue.
Miscellaneous Conflicts
The VAX architecture
includes instructions with operands that either are
not known w hen the instruction is decoded (e.g. ,
INSQlJE, MTPR), or mod i fy large portions of mem
ory (e.g . , MOVC 5). To avoid conflicts from t hese
instructions, the 1-box suspends processing mem
ory specifiers until the instruction execution is
completed. Self-modifying code presents another
form of conflict, which is solved by an REI instruc
tion that not ifies the hardware of this condition.

Unlike its predecessors, the VAX 9000 system com
m i ts all its resources to a single branch path. The
prediction hardware selects the path of execution
to resolve memory conflicts for those branch
instructions that are decoded before results are
available. This path selection is based on prior his
tory, if the branch hits i n the branch cache. I f the
branch does not hit in the branch cache, the path
is predicted staticly, based on the instruction's
opcode. When t he branch executes, the prediction
is compared to the actual results. The pipeline is
flushed back to the correct code path if the branch
prediction was incorrect.
The entries in the branch cache store the branch
results of the previous execution of t he branch and
the target address, if the branch was taken. Because
the branch cache is a one-way associative cache t hat
can store only 1024 entries, the results h ave an aver
age hit rate of approximately 80 percent . However,
correct predictions occur 85 percent of the time
from the cache, as opposed to an average h it rate of
56 percent, when the predictions are based solely
on opcode. Loop branches are always predicted
as taken, which increases the overall correct pre
diction rate to close to 89 percent . By caching
branch targets, the calculation may be avoided and
a latency factor of one-cycle branch taken i s
achieved. The branch cache can store a sufficient
amount of branch context to eliminate the need
to sweep the cache.
The 1-box can process instructions with up to
two conditional branches outstanding. Uncondi
tional branches (e.g., BSBW , BRB) are processed as
ordinary instructions by simply changing the
instruction flow To reduce the penalty for a bad
prediction, which results in a four-cycle penalty,
operand specifiers that mod ify general-purpose
registers are not processed under a branch predic
t ion and cause the operand processing unit to stal L
Also, branch instruction execution i s overlapped
with the previous instruction to provide the actual
branch results earlier.

Branch Instructions
Branch instructions have a substantial influence on
the overa l l perform ance of a VAX processor. O n
average, a VAX processor executes 3.9 instructions,
including the branch. before a branch starts a new
instruction sequence. Instructions t hat modi fy the
program counter represent nearly 40 percent of t he
total instructions execmed. The VAX 9000 system
uses a 1024-entry branch cache and a two-tiered
prediction schemc to increase t he average code
block size and reduce t hc branch-takcn Latcncy.

Dif!.ilai Tecbnicaljournal

Vol. .! Nu .j

hill I'J'JI!

Compute-intensive Instructions
Compute-intensive instructions requ i re multiple
execution stage cycles. Common examples of these
instruct ions are multiplication, division , and float
ing point operations. All VAX implementations
employ dedicated logic for compute- i n tensive
instructions that occur frequently. Less frequently
used instructions depend on microcode-controlled
arithmetic and logical data paths. The VAX 9000
system contains four independent execution pro-

VAX 9000 Series

cessors. The integer, floating poin t . multipl y, and

implementations. Because memory bandw idth is

divide units cxecute the VAX instruction set. The

critical, the VA X 9000 system prov ides features ro

1 -box p reprocesses i nstructions, w h ich al lows

benefit thcsc instructions.

instruction execution to overlap i n thcst: u n i ts. I n

For example, the virtual instruction cache ser

each cycle, a n e w i nstructi o n c a n b e i n i t i ated i n

vices most instruction stream references, which

t h e appropriate unit prior t o the completion of

frees the main cache to service prefetched operand

previous instructions. The t1oating poin t and multi

rcf<:rences. Both the virtual i nstruction cache and

p l y u n i ts are pipelined and can accept one instruc

the main cache have 64 -bir data paths, important

tion each cycle. The in teger unit is pipd i ned for

for c h a ractcr s t r i n g operations and ex tende d pre

s i m ple instructions. However, complex instructions

cision arithmeti c . The caches are ful l y pipeli ned

must use microcode control to perform multicycle

and al low one read per cycle. The main cache block

operations.

size is 64 bytes. exploiting spatial locality. When

Pipelined instructions are issued in order and

cache references do miss. data is wrapped and the

proceed t h rough the d ata p a t h w it h ou t further

most critical data is rewrned first. A write back,

microcode control . upon completion , instruction

write al location algorit hm further reduces main

res u l ts are retired i n thc same instruction order. The

memory and cache bandw idth req u i rements and

instructions must be p roccsscd in order because the

reduces latency.

resu l t of one operation is often needed in a sub

The VAX system is a virtual memory architecture.

sequent operation. T herefore, the pipelines must be

Virtual add resses need to be translated to physical

short and contain data bypasscs to make results

addresses through page tables in memory. A trans

available quickly. The mu ltiply, float, and d ivide

lation buffer caches the most recen t l y used page

un its' internal data paths are 64 -bits w ide. To under

tables entries. VA X systems, such as the VA X- 1 1 /780

stand how the pipelined and overlapped operations

system, process trans lat ion buffer misses in micro

app l y to the fol lowing opcration.

code, wh ich can be r ime-consum i ng . However, the

y (i )

y (i ) + C ( i )

consider the program:
LOOP .

VAX 9000 system uses a memory management pro
cessor to process translation buffer misses as part
of instruction preprocessing. This operation is per

M U LG3

R6 , < RO ) . , R�

MU LG3

R6 , ( R Q ) . , R 2

)•

ADDG2

R� , ( R 1

ADDG2

R2 , < R 1 > .

The two MULG 3/ADDG 2 instruction pairs prevent
a pipeline stall that could occur because of data
dependencies. The instructions further reduce the
loop overhea d , w h i c h is a l read y fai r l y s m a l l
because the loop control instruction was predicted
correctly. I nstructions and source operands are
prefetched . The mul tiply and add units accept the
i nstructions as they become available. The memory

formed early in the p i p e l i n e and is faster t h a n
microcode.
The CALL and RETl 'RN instructions push and pop
registers on the stac k , and these i nstructions can
be memory-bound. The VAX 9000 system contains
both the conrrol logic and the bandwidth to process
these registers at a rate of one per cycle.

Unconventional Instntctions
Spec i a l , dedicated h ardware was added to the

VAX 9000 system to process those VAX i nst ructions

that did not fit into the categories listed above. The

references are made as the operand processing unit

additional hardware operates w i t h i n the pipeline

processes memory specifiers. The majority of speci

architecture and cycle time, and the cost of add i ng

fier processing is performed independently of the

the hardware was minima l .

instruction execution.

Memory-intensive Instructions

I n the following example,
MOVL R O , - < S P >

< - - - - - - - - - - > PU S H L R O

Some VAX instruction classes are primarily memory

the MOVL and PUSHL instructions perform identical

operations that require only minor computation .

operations, but the P l iS H L i nstruction does not

Typical e x a m p les of t hese i nstructions an.: c h ar

explicitly specify a destination address. O n pn:

acter string, decimal, and privi leged operating sys

v ious VAX system s , t h e i ns t ru c t i o n p refetching

tem. Pipel ined execution offers link advan tage to

would stall until t he current instruction execution

memory-incensive instructions because the number

was comp leted . However, t he VAX 9000 modi

of memory references is not reduced as the number

fies such instructions during the decode stage by

of cycles required for execution i s reduced by new

add i ng the implied specifiers. The benefits of t h is

Vol.

2 No. 4

Fall

I'J')O

Digital Tecbnicaljournal

Design Strategyfor the VAX 9000 System

enhancement are more evident in the fol lowing
instructions.
BSBW 1 0 $ < - - - - - - - - - - > MOVAL R e l u r n _ P C , - < S P l
< - - - - - - - - - - > JMP @ ( S P l +

RSB

Sim i larly, instructions such as LOCC and CMPC3
impl ici t l y reference t h e general -purpose registers.
The instruction decode s tage creates a read/w rite
mask with these references, which a l lows instruc
tion prefetching to cont inue.
To aid handling i nstructions l i ke PUSH R a n d
C A L L , the in reger execution u n i t conrains special
bit m a s k m a n i p u lation h :trd ware, w hi c h opti
m i zes general-purpose register saves and restores.
The VAX instruction set contains variable-length,
bit-field instructions that handle non-byte data.
These instructions can reference memory within a

'512 megabyte (MB) range. The field referenced is
within the first 8 hytes of the base add ress more
than

9'5

percent of the time. Therefore, to a llow

instruction prefctching to con tinue, the operand
processing unit assumes that the fiel d is within the
initial quadword and requests that data. I f, during

Logical Integration
The VAX 9000 vector processor

connects to the

scalar CPU as an additional fu nctional execution
unit.

Vector

i n s t ructions

are

processed ,

and

operands are stored, in queues, the same as are
scalar instructions. As i nstructions are issued , a con
trol word is sent with instruction operands to the
vector processor. The processor contains vector
registers and arithmetic units. Add resses for load ,
store, gather, and scatter operations are also gener
ated by the vector processor. Vector data is stored in
the main cache, and both the scalar and vector pro
cessors have fast, shared access to that dat::t.

Physical Integration
The VAX 9000 scalar and

vector processors reside

on a single planar board. Three mu l tichip unit slots
are reserved for the optional vector processor,
w h ich is fie ld- instal l able. The integration of t he vec
tor processor d i rectly with the scalar processor
keeps critical i nt erconnects short and reduces vec
tor instruction overhead .

execu tion, the field destination act ua l ly resides out

Error Handling

side the prefetched quadword, the correct data is

Rel iabi lity, ava i l a bility, and integrity are critical fac

fetched and the pipeline is flushed to avoid poten
tial memory con tlicrs.

Integrating Vector Processing
The VAX

9000

tors in a high-performance computer system . These
factors are affected by the quality of t he physical

project team was instrumental in

design (i .e. , worst-case design), effective coo l i ng,
redundant power supplies, and quality controls
during manufacture. S t i l l , fai l u res are possible, and

in regrating vector operations and data types inro

the VAX

the VA X architecture. For many scientific applica

errors.

tions, the use of vectors im proves performance in
three ways:
•

9000

design had to dea l effectively with

Error handl ing in the VA X

9000

system has two

main goal s :

Vector i nstructions specify many operations in

•

a single opcode, which e l i m i nates instruction
stream decode as a processing hottleneck.
•

Vecwr registers increase available local storage.

•

Vector registers support h ig h pea k perfor
mance through h igh bandwidth and short access
l atency.

Minim ize system service disruption from ind i
vidual failures

•

Maximize the fai l u re information col lected for
use in preventive and corrective maintenance
A l arge percentage of hardware fa i l u res are inter

m ittent , and many solid hardware fai l u res start as
intermittent. The VAX

The VA X vector archi tecture implements a load/
store architecture, which permits the hardware to
deal w i t h l arge p ieces of m e mory in a u n i fo r m
manner and increases t h e use o f para l lelis m .

9000 system

was designed to

recover from these fa i l u res and to use the fai lure
data to predict (and prevent) future problems.
To gather information effectively, VA X

9000 stor

age elements ( i . e . , latches, tli p tlops. and RAM cells)

We added the vector instructions and data types

are v isible to the service rrocessor unit through a

to the VA X architecture in an i n tegrated fash ion .

serial d i agnostic bus. Most state i n formation t h a t

Scalar and vector instructions are mixed throughout

is relevant to isolate t h e fai l ing component i s avail

the pipdi nes. Systems that do not incl ude vector

able for error analysis programs that can be run at

processors emulate vector instructions with soft

a convenient time. The result of t h is processing is

ware. a tec h n iq u e espec i a l l y usefu l for p rogram
development . . ><

t he n used to isolate the fai l i ng compone n ts for

Di�ital Tecbnicaljournal

llul. .! Nu. .j

Fa/1 /'J'.IO

quick repair.

VAX 9000 Series

To access the storage elements through the visi
bility chain, the system clocks must be disabled,
which disrupts the system operation for a period
of time. The error may also have affected the exe
cution of the instructions in the pipeline. Error
handling minimizes these disruptions by making
them invisible ro the users almost a l l the time.
The macroinstruction is the unit of execution in
a program that is v isible to the user. Between
instructions, the program state is clearly defined
in terms of memory contents and register values.
I nterrupts and exceptions are handled between
instructions to save this state in an orderly fashion.
It is important to handle errors the same way.
Two problems arose i n trying to provide the
same method of error handling. First, instructions
go th rough many stages in a pipelined computer,
and several instructions will be in progress at the
same time. It is d i ffic u l t to identify a begin n i ng
and end for each inMruction. Second, even when
boundaries are established, errors can occur at any
time and the errors do nor automatically l ine up
with instruction boundaries.
To solve this, we made the E-box the point of syn
chronization between error handling and instruc
tion execution. In the instruction execution model,
the E-box accepts operands, then computes and
delivers res u l ts for storage. If an error occurs that
d i rectly affects one of these steps, the error is
synchronous to the execution of that instruction.
Asynchronous errors do not directly affect any of
these steps and are treated as interrupts, i .e. , pro
cessed after the E-box completes an instruction but
before it starts another instruction .
A synch ronous error causes a trap to occur i n
the E-box w h e n t h e E-box requests d a t a from t he
subsystem with the error. Since such data can he
unavai lable as a result of virtual access problems,
the E-box is ready to deal w i t h exceptions a t
that time, and errors can use the same pipelined
mechanism.
We do not d i fferentiate between those syn
chronous errors that affect computation in the
E-box and those that do not . Instead , if the program
visible state of the machine has not been modi
fied, the instruction is backed up to the beginning
and restarted . Performing this task is not a prob
lem, since the state is normally not changed until
the result is stored at the end of the instruction.
Errors occurring in early p ipeline stages are easily
recoverab.le. I n a few cases, memory and registers
could have been modified early and, as a result ,
be affected by the error. Status flags indicate if this
has happened.

By getting to an instruction boundary, the clocks
can be stopped in an orderly fashion, and the state
can be read out , includ ing temporary data to be
used for failure analysis. The machine can be reset
to start processing at the instruction boundary once
the clocks are started again.
While the clock is stopped , the CPU cannot inter
act with other subsystems or I/0 processors. To
keep these functions from being blocked and possi
bly timing out , we only stop the clock to the CPU in
error, not all the clocks in the system. We also
sweep the cache of written data before the clock is
stopped , and IIO interrupts are directed to other
CPUs in a symmetric multi processing system .

Performance Modeling
When multiple features are added to a CPU design
to individual l y enhance performance, some of
those features can interact negatively with each
other to decrease performance. Therefore, we
designed a performance model to help us evaluate
the performance of the design and make trade-offs
where necessary. A lthough instructions were not
executed on the model , it is an accurate cycle-by
cycle model of the system for most instruction oper
ations. Equally important, the model was written at
a high level, which made it easy to modify and use
to experiment with different feawres before they
were added to the design.
Cycle Time
A perennial CPU design issue is the trade-off
between cycle time and cycles per instructions. I n
a VAX system , the cycle time is often limited b y the
R A M speed in the control store and cache. We mod
eled a machine at 8 ns and one at 16 ns for the VAX
9000 system. At 8 ns, the pipelines became longer.
Although the peak t h roughp u t a l most doubled ,
the model showed that the net performance g:1in
did not offset the risks associated with the shorter
cycle time.
/-stream Synchronization
The VAX architecture requires that changes to the
instruction stream be synchronized with an R EI
instruction . This synchronization makes it easier to
implement an instruction cache that is separate
from the main cache. To synchronize, either all
memory writes can be watched or the J -cache can
he cleared on every REI. The first alternative entails
high hardware costs, and the second c:1n affect
performance. However, the model showed us that
the performance impact would be minimal if the

Vol. J N o.

Fuii i'J'JO

Digital Tecbnicaljournal

Design Strategyfor the VAX 9000 5ystem

!-cache was refi l led from the main cache rather than
from main memory because the critical parameters
were the main cache bandwidth and the !-cache
invalidation time, rather than the refill latency.
Branch Prediction
The b ranch p rediction scheme used i n the
VAX 9000 system was analyzed in great detail.
We investigated the use of multiple history bits to
improve the effectiveness of branch prediction.
In a l l cases, the use of extra bits p rovided less than
a I percent improvement in system performance.
Furt hermore, no multiple bit scheme could be
implemented without increasing cycle time
because m u l tiple history bit branch p rediction
schemes update status each time a branch is
encountered . Therefore, we chose to use a single
bit technique in the VAX 9000 design. Unlike multi
ple bit schemes that read and write history bits
for each branch instruction encountered , the single
bit technique updates the history bit only when the
prediction is wrong. The single-bit scheme is both
faster and simpler.

We also used the performance model as a verifi
cation tool . The model provided us with early
warnings when a feature d id not function in the
model, or when the cycle count differed from the
count in the gate-level simulation . For example,
from the model, we became aware of problems in
the design of how conflicts between instructions
in specifier processing were handled . Periodically,
we compared the performance model to the logical
model . Both models were subjected to the same
instruction sequences. Deviations of more than
± 5 .0 percent were investigated. Some design bugs
were found that did not affect the results of the pro
gram but which did keep performance features
from working properly. The average deviation was
on the order of ± 1 .0 percent.
Performance tests are among the first programs
run on a functional prototype. The VAX 9000 sys
tem performed almost as expected. Table 2 com
pares the actual performance of a VAX 9000 system
to its predicted performance for a small sample of
modeled programs. The accuracy of the predictions
h ighlights the increasing importance of models in
the modern engineering process.

Cache Parameters
The main data cache was accurately modeled. The
VAX 9000 system uses a first-in first-out (FIFO) block

replacement scheme. The performance model pre
dicted that a true least recently used replacement
policy would provide an insignificant improvement
in performance over the FIFO method. Also, a true
least recently used policy requires that status be
read and written for each cache access. In con
trast, the F I FO replacement pol icy updates status
only when a cache miss has occurred . Further, the
update can be done in parallel with the writing of
data into the cache block. Although the 128-byte
cache block provided a better cache hit, we chose
the 64 -byte block because it produced better system
level performance.
We chose two-set associativity because the model
clearly ind icated that performance would degrade
with a d i rect-mapped scheme. The model also pre
dicted that a four-way set associative cache would
not improve performance enough to justify the
extra hardware, design complex ity, and cycle time
penalty.
The data bypass mechanism, the write queue,
and the parallel translation buffer fix-up mecha
nisms were implemented after the performance
model indicated significant performance gains
would he achieved from these features.

DiJ:itaf Tecbnicaljournaf

Vol. .! No. . ,

htff f')')IJ

Table 2

Performance M easurements
of a VAX 9000 System

Program Name

Predicted
(VUPs * )

Measu red
(VUPs* )

HANOI

28. 54

25.53

FFT45

36.87

37.85

GAUSS

32.72

32.57

W H ETS

27.78

27. 1 7

WH ETD

34.48

34.89

•

Performance measured i n VAX u n i t s of performance (VUP). where
the performance of the VAX · 1 1 /780 system = 1 .0 VUP.

Vector Performance
Vector processing was modeled using graphical
descriptions of the pipeline. The graphical descrip
tions were essentially critical path method schedul
ing charts. This approach is reasonable because
vector processing makes regular demands on sys
tem resources. In fact, the regularity of resource
demand patterns was a major reason that vector
processing techniques were developed . By using
the pipeline schedules, we realized that data should
he prefetched to ensure good vector performance.

VAX 9000 Series

Performance Measurement
Table

5 compares the VAX 9000 scalar and vector

Acknowledgments
Many people contributed to reaching t he

VAX 9000

processors performance to ot her members of the

p<.:rformance goals. T he authors would especially

VAX family of processors.

like to t hank David Orbits, whose advanced devel
opment work on high-performance

Ta ble 3

Performance of the VAX 9000
Scalar and Vector Processors

Program
Name

VAX 8550
System
(VU Ps * )

VAX 9000
Scalar
Processor
(VUPs * )

VAX 9000
Vector
Processor
(VUPs*)

A3D

6 . 55

65.54

77.45

DY FESM

5. 1 2

3 1 .88

40.49

E M IT

5 . 86

4 1 .65

79 . 86

C F FT2D

5.52

2 5 . 76

64. 1 8

B M K8A1

5.45

30.65

83.84

MXM

5 . 93

40 . 8 1

269 . 32

•

VAX designs

became t he basis for t he performance model; and
Bill Grundmann , R ick Hetherington, John Murray,
Bill Smi t h , and David Webb, w ho comprised,
with the au thors, the origi nal

VAX 9000 architec

ture team.

References
I . ]. Murray et al., " VAX Instructions T hat Illustrate
the Architectur a l Feat ures of the VAX 9000 C: Pt!,"
Digital Technical journal, vol . 2, no. 4 (Fall
1990, t h is issue): 25-42.
2. M . Adiletta et a l. , " Semicoml uctor Technology
in a High-performance

VAX System," Digital
Technical journal, vol. 2, no. 4 (Fall 1990, t h i s

Performance measured in VAX units of performance (VUP), where
the performance of the VAX- 1 1/780 system = 1 .0 VUP.

issue): 43-60.

3. SPICE i s a general-purpose circuit s i m ulator
program developed by
T he vanattons in these performance n umbers

rake advantage of machine rcsourccs. T he numbers

of California, B erkeley.

4 . D. Clark , " Pipelining and Performance in t h e

VAX 8800 Processor," A rchitectural Support
for Programming Languages and Operating
S:vstems (AC M , October 1987).

also highlight opport u nities. By modifying appli
cations ro capitalize on machine features, large per
formance gains may he realized. Performance gains
of 100 to 200 percent are often realized and may

Nagel and

Engineering and Comp uter Sciences, Universi t y

indicate t h a t significant performance improve
ments can be ach ieved by using applica tions that

Lawrence

Ellis Cohen of the Departm e n t of Elec trical

5. C . Wiecek , "A Case Study of VAX - I I I nstruction

much parallel content. T h is category is represented

Set Usage for Compiler Execution , " Proceedings
of the Symposium on A rchitectural Support
for Programming Languages and Operating
Systems (ACM , March 1982) : 177- 1 84 .

by A 5 D and DYFESM in Table :� . Vectorizing such

6 . .J. Emer and D. Clark , " A Characteriza tion of

programs i mproves performance by a modest

Processor Performance in the VAX - I l / 780 ,
Proceedings of the 11th Annual Symposium on
Computer A rchitecture (A nn Arbor: June 1984 ):

substantially extend the lives of older programs.
Vector applications tend to fall i nto three cate
gories. T he first category generally does not contain

0 to 50 percent . Programs E.\IIT and CFFT 2 D in
Table 5 represent the second category, which are

301 -310.

applications of moderate parallel content. Applica
tions in this category realize a 50 ro 150 percent

VA X

Vector Processing Handbook (Maynard:

performance gain when vectorized . A pplications

Digital Equipment Corporation,

in the t hird category,

EC- H 04 19-46, 1989).

highest parallel conten t ,

demonstrate performance improvements o f more
t ha n

150 percent w h en vectorized. Programs

B M K8AI and

MXM in Table 3 arc examples of t h is

class of application.

Order No.

8. R . Bru nner and D. B handarkar, " Vector Exten
sions to the VAX Architecture,"

Proceedings
ofCO!'vtPCON 'YO (San Francisco: Spring 1990).

Often , modest code changes can realize dramatic
performance improvements. By simply redefining
array dimensions or loop specifications, an applica
tion can move from t he first category to the t h ird
category.

v'lil. .2

No. .f

Fall 11)')0

Digital Technicaljournal

john E. Murray

Ricky C. Hetherington
Ronald M. S alett

VAX Instructions That
Illustrate the Architectural
Features of the VAX 9000 CPU
The VAX 9000 system is Digital's largest and most powerful VAX system. As such,
it offers many unique features that required the use of advanced technology and
innovative architecture in the design of the system. Overall, the VAX 9000 micro
architecture produces a high level of system performance and the lou'est cycle time
of any VAX processor, i.e., less than five cycles per instruction. Three sections of the
l'ltX 9000 CPU - the instruction fetch and decode unit (!-box), the execution unit
(£-box), and the data cache and main memory inte1jace unit (M-box) - are
illustrated in this paper through descriptions of a small sample of VAX instructions.
These instructions are discussed in relation to theirflow through the pipeline, how
their architecturalfeatures combine to work on a single macro instruction, and how
various stages ofthe pipeline interact.

I n October

1989. Digital i nrroduced its VAX 9000

preferch, hardware translation buffer

fix-up u nit,

family of h igh-performance scalar, vector, and par

write address buffer and conflict checker, multi

VAX 9000 system is designed

ported write-back cache, independent arithmetic

ro be expandable from one ro four processors, with

u nits, and separate issue and retire queues. T hese

an optional i ntegrated vector facility available on

features are pipelined and do not i nteract i n a

:tlld processors. T he

each processor. T he desi g n team obtai ne d high

straightforward way. Many stages are not directly

levels of performance w it h advanced tech nology
u
and in novative architectural fearures.
T he tech

linked to the subsequent stage bur feed a queue
or first-in first-out

nology provided a platform that has the shortest

stage works on the output of the

cycle rime for any

pipeline is not a fixed-length and many operations

VAX processor. Most VAX proces

sors average ten or more cycles per instruction ,
w hereas the architectural features of the

VAX 9000

system reduce that average below five.
T he

(FIFO) buffer. T he subsequent
F I FO buffer. The

are done in parallel.
T he architectural features do not function totally
i ndependent of one a no t her. I n fact , the h i g hest

VAX architecture is a complex instruction set
VAX i n structions vary in l e ngth and

level of performance is achieved when all the units

arc h itecture.

function in harmo n y. T his paper h ig hlights the

number of operand specifiers. T he opcode may be

implementation of the macropipeline found i n the

o n e or two b y res lon g . T he n umber of spe c i fiers

three major subsystems of the VAX

is implied by the opcode. Each specifier 's length is

subsystems are the instruction fetch and decode

determined by the specifier type, and the length can
1
vary by up to 17 bytes. Although the VAX 9000

u n it (1-box), the execution unit (E-box), and the d :tt:1

implements a large n u mber of i n structions i n a

9000. T hese

cache and mai n memory i nrerface (M-box).
T he design team for the

VA X 9000 system's

single c ycle, some instructions need to be imple

!-box evolved a cost-effective subsystem that our

menred in tens of cycles. In these cases, microcode

performs all previous

J.Ssiswnce is required. To increase performance,

Figure 1, the !-box processes the majority of instruc

VAX systems. As shown in

VAX 9000

tions in just one cycle. lt combines a si ngle cycle

system that have not been implemented i n prev i

access virtual instruction cache with a 25-b y re

many features were i ncluded
ous

in the

VAX systems. The system contains a virtual

i nstruction buffer and an i nstruction clecocle cross

instr u c t i o n cache. a bra n c h pn:diction cache,

bar that can decode three specifiers per cycle. To

mult iple specifier evaluation units. deep instruction

minimize cycle-wasting stalls. a branch prediction

DiRilal 1'ecbnicaljournal

H>l. .! .Yo. ·I

Faii i'J')Ii

VAX 9000 Series

unit handles transitions from one code block to
another. In addition , the operand processing unit
receives and processes specifiers from the decode
unit. The specifiers are passed either to the E-box as
pointers, literal data or addresses, or to the M-box
as virtual addresses.
Figure 2 i l lustrates how the front end of the
M-box translates addresses by using either a trans
lation buffer or an autonomous virtual -to-physical
address tra nslation u n i t . P h ysical addresses for
reads are used to access a two-way associative
write-hack cache and to fetch data from memory
through the system control unit (SCU), if the data
is missing from the cache. Read data is returned to
the E-box . Write addresses from the operand pro
cessing unit are translated and queued by the M-hox
until the E-box provides the data for the write.
The E-box of the VAX 9000 CPU per forms aU
scalar operations. As shown in Figure ), the E-box
is a pipelioed design that incorporates a micro
sequencer to control fu nctional u n i t operation.
Other dedicated control logic directs the flow
through the pipe stages.
A m u l t iported register file provides general
purpose registers and temporarily holds memory
data. The data is processed by one of the four
arithmetic functional units. Results pass through a
retirement multiplexer to the register file or the
M-box data cache, as shown in Figure 4. Mul tiple
VA X instructions arc executed concurrently in the
E-box pipeline. The primary goal of the E-box is
to produce a 32-bit result each cycle, which al lows
the majority of the simple, but most frequent, VAX
i nstructions ro be executed in one cycle. This goal
is achieved when four requirements are met . First,
the !-box must have conunands available for the
£-box . Second , operand data, often from the M-box
data cache, must be available. Third , pipelined or
single-cycle latency functional units are required
for single-cycle throughput. Finally, results must
be transferred from t he functional u n i ts. E-box
features, such as queues, data bypass paths, and
powerful arithmetic units, help the system attain
a h igh-performance level. Stalls arc avoided and
each instruction is executed in a minimal amount
of time.
The M-box of the VAX 9000 CPU is the primary
source of memory data. Therefore, it contains the
virtual add ress translation buffer and the data
cache. The M-box is multiported ami pipelincd with
two autonomous pipeline segments. Each segment
occupies one machine cycle, and the cache access
latency is, therefore, two cycles long. During the

first cycle, the M -box receives and priori tizes v i r
tually (or phys ically) addressed memory requests.
The M-box then indexes the translation buffer to
produce a 33-bit physical address and to perform
protection and va lidity checks. The second pipe
l ined cycle i nvolves data cache access, data align
ment, if requ ired , and port response. T here are
numerous architectural features within both seg
ments that are targeted at high bandwid th for
prefetching and storing scalar and vector operands.
To i l lustrate the various features of the VAX 9000
m i c roarc h i tecture, we h ave selected the code
sequence shown in Figure 5 . i In the fol lowing sec
tions, we discuss each instruction as it progresses
through the pipel ine as if it were the onl y instruc
tion in the pipeline. We then sununarize by consid
ering the same instructions as a block of code.

VAX Instruction ADDL2
The A DDL2 instruction uses general-purpose regis
ter R8 as a n add ress ro memory. The contents of
that location are added to general-purpose register
R7, and the result is written back to the same loca
tion in memory. The instruction is encoded in three
bytes: opcode, register, and base register.
Cycles One through Three
I f we assume that the ADDL2 instruction is the first
instruction either in an interrupt routine or follow
ing a context switch, the program counter is gener
ated by the E-box and passed to the I-box on a 32-bit
bus. The program counter is latched and used to
access the virtual instruction cache during cycle
one. The virtual instruction cache contains up to
8 kilobytes (KB) in 32-byte blocks and 8-byte lines
of instruction stream data.
Bits < 1 2 : 3 > of the program counter's prefetch
buffer are used to access an 8-byte l ine from the
virtual instruction cache. Bits < 1 2 : 5 > are used to
access a tag, a valid block. and four quad word valid
bits. The tag is compared with bits < 31 : 13> of the
program counter's prefetch buffer. If the tag and the
bits match, the block and the quadword within the
block are valid, and the instruction is in the virtual
instruction cache (i .e. , a hit). B i ts < 2 :0> of the pre
fetch buffer are used to rotate the quadword for the
opcode byte to he loaded into byte 0 of the !-buffer
at the encl of cycle one. Similar to the VAX 8650
system , the first hyte of the !-bu ffer is the operation
code (opcode) of the instruction."
The A D D L 2 is t h ree bytes long and norma l l y
fits i n one l ine of the virtual instruction cache. I f
·
t he ADDL2 instruction c rosses a l ine boundary, a

Vol. .!

No. .q

Fa/1 1')')0

Digital Technicaljournal

E-BOX
RESULT

I - BOX DATA

M-BOX I B DATA

S2 POINTER

DEST POINTER

�------ �------�--�
DECODE STAGE

FETCH STAGE

K EY
VIR - VIRTUAL I NSTRUCTION CACHE
Sl - SOURCE 1
S2 - SOU RCE 2
DEST - DESTINATION
I B - I-BUFFER
P PC - PREFETCH PROGRAM COUNTER

U PC - UNWIND PROGRAM COUNTER
D PC - DECODE PROGRAM COUNTER
S PC - SPECIFIER PROGRAM COUNTER
BP - BRANCH PREDICTION
PC - PROGRAM COUNTER
OPU - OPERAND PROCESSING U N IT

Figure 1

SPECIFIER STAGE

SL - SHORT LITERAL
GPR - GENERAL PURPOSE REGISTER
GPRS - GEN ERAL PURPOSE REGISTERS
XGPR - X GENERAL PURPOSE REGISTER
YGPR - Y GENERAL PURPOSE REGISTER
OP D - OP DECODE

Block Diagram ofthe VAX 9000 System /-box

SL D - SHORT LITERAL DECODE
R 1 - REGISTER 1
R2 - REGISTER 2
R3 - REGISTER 3
DISP - DISPENSER

VAX 9000 Series

CONTROL
LOGIC

M ICRO
SEQUENCER
I-BOX

QUEUES
V-BOX

M-BOX

REG ISTER
FILE

Figure 2

::>

9
:::

I - BUFFER

Front End ofthe VAX 9000 System M-box

�
�

OPU

E-BOX

r:: SEQUENCER �

f
Figure 3

�

MISS

FIX-UP

�

r=::>v

��

TRANSLATION
BUFFER

TRANSLATION

� BFIUXF-UF PE R

f.-

I--

�v

Block Diagram ofthe VAX 9000 .�ystem E-box

Vol.

l No. 4

Fall

/')')IJ

Digital Tecbnicaljournal

VAX Instructions That Illustrate the A rchitectural Features ofthe VAX 9000 CPU

E-BOX

OPERAND
PROCESSING
UNIT

I -BUFF E R

CACHE

OPERAND
PROCESSING
UN I T

MAIN MEMORY -f: 64

shou ld be routed w the register/pointer unit and
that the memory specifier should be routed to the
operand processing unit.
I n parallel with the XBAR decode process dur
ing cycle two, the program coumer is passed to the
E-box from the 1-box. The opcode is used to address
the fork random-access memories (RAMs) in the
E-box that provide a fork address to the microse
quencer. At the end of cycle two, the decoded bytes
are shifted out of the !-buffer, and the subsequent
instruction is presented to the XBA R in cycle three.
The fork address from the 1-box is then used to
address a fork RAM in the E-box. For each opcode,
the fork RAM provides an entry address into the
control store, i nd icates w h ic h functional u n i t
should begin the execution , and specifies how
many source operands are needed i n the first cycle.
The fork address is modified when an instruction

co 0080

ADDL2

R7,

4 1 0083

SUBF3

#0,5,

535940C2 8F

45FD 0088

M U LG3

#2345.5,

E4 0095

BBSC

#13.

68 57

WRITE BACK

FILL BUFFER

subsequent cycle is required to access the second
l ine. The average VAX i nstruction is 3 . 8 bytes long.
Therefore, a virtual instruction cache hit delivers
about two instructions to the l-buffer.6
Other VAX processors general l y require a cycle
to decode the opcode and one or more cycles to
decode each subsequent specifier.7.H However, the
VAX 9000 CPU's instruction decode cross bar can
decode the vast majority of common instructions in
a single cycle.
If the three bytes of the ADDL2 instruction were
loaded into the !-buffer at the end of cycle one, the
bytes would be decoded during cycle two. The
decode unit (XBAR) passes data from the !-buffer to
a short l iteral unit, a register/pointer unit or an
operand processing unit. As the opcode and speci
fier bytes are decoded in paral lel, the X BAR deter
mines in less than a cycle that both specifier bytes

Figure

Cache Unit ofthe VAX 9000 System M-box

Figure 4

M-BOX
E-BOX
WRITE B U F F E R

WRITE
QUEUE

6044

59 85 9999A999

E-BOX
WRITE BUFFER

E - BOX

I-BUFFER

000001 2 1 '

EF OD

1 $:

(R8)
(RO)[R4].

(R5)+.

BDATA.

VAX Instructions That Illustrate the Major Features ofthe VAX 9000 System

Digital Tecbn icaljournal

Vol.

2 No. 4

Fall f index the 1024-entry translation
buffer. The translation buffer is a d irect-mapped ,
associative memory that contains the results of
the most recent 1024 translations. Bits < 30: 18>
are compared, validated, and protection-checked
against the tag field . The physical frame n umber is
a 24-bit field that is appended to the virtual address
bits < 9:0> to create the 33-bit physical address. The
self-t imed RAM used for the translation buffer is a
1024 by 4 sel f-timed RAM with a 4 . 5 nanosecond
(ns) access time.
Protection checking occurs during the latter por
tion of cycle four. The example we are discussing is
a request for a read and write check. Therefore,
both read and write access are checked. Fault indi
cation is forwarded with the request to the data
cache and subsequently, with the data, ro the E-box.
If the request has a valid entry in the translation
buffer and no protection violations exist (i.e. , trans
lation buffer hit), a data cache access is required in
cycle five.
The two source pointers and the destination
pointer from the 1-box are latched i n the source and
destination queues, respectively, at the start of cycle
four. The source queue holds 16 entries and can
receive 2 entries per cycle. The dest ination queue
holds eight entries. Both queues are circu lar FIFO
queues that can be flushed w ith the fork queue. The
two source pointers are also latched in the source
operand logic at the start of cycle four. The source
operand logic determi nes w hich two source
pointers to use each cycle. The pointers can come
from the source queue, the 1-box, the microword ,
the register log, and several special functions. In this
example, the two pointers are selected directly
from the latched I-box pointers because using the
source queue would have required an extra cycle.
The selected pointers address the register file
and are passed to the issue logic early in the fourth
cycle. The register file contains t he 15 general
purpose registers, R O through R l4 . These registers
can be written by either the £-box or the !-box for
autoincremem or a utodecremem speci fiers. The
first pointer accesses general-purpose register R7.
The contents of general-purpose register R7 are

Vol. 2 No. 4

Fall

/ ) fol lowing the ADDL2 instruction
(i.e., bytes < 2 :0 > ) . In the latter case, the SUBF3
instruction would be shifted into the lower bytes
as the A DDL2 instruction is shifted out.

Cycles Two through Eight
In cycle two, the SUHF3 instruction is completely
decoded and shifted out of the ! -buffer. As a result,
the following actions occur:
•

The fork address is passed to the E-box .

•

The short l i teral is passed to the short literal
expansion unit.

•

The base and index registers arc passed to t he
operand processing unit.

•

The destination general-pu rpose register R3
and the t\vo sources are passed to the register/
pointer unit .

During cycle three, the register/pointer unit allo
cates the next available entry i n the source list ro the
short literal and the subsequent entry i n the indexed
memory reference. The E-hox is informed of these
a l locations as pointers to t h e relevant entries are
passed to the poi nter queues in the source one and
source two pointers. The register/pointer unit also
passes t he destination register to the destina t i on
queue in the E-hox.

Digital Tecbuicaljom·nal

Vt>l. 1 No. 4

h71/ /'J'}IJ

The ope rand p rocessing u n i t passes t he tag,
with the address for the indexed memory specifier
request, from the register/pointer u n i t to t h e
M-box. The address is generated b y the adder i n
t he operand processing unit. In parallel w i t h the
operand processing unit and register/pointer unit,
the short literal expansion unit takes the 6-bit field
and expands it to a 32-bit F _floating number.
Duri ng cycle fou r, the s hort l iteral is wri tten
through the 1-box data bus to the relevant entry
in source list. Issue control can issue with bypass
because only the memory data for operand two is
missing.
The E-box stalls until t he memory data arrives.
Because the 1-box and the M-box generally are func
tioning ahead of t he E-box, memory stalls are short
or nonex istent. In this example, the memory data
arrives at the end of cycle five, as was the case with
the ADDL2 instruction.
In cycle four, the M-box operates for the SUBF3
i nstruction in a s i m i l ar manner t o i ts cycle four
activity for the ADDL2 instruction . At the start of
the cycle, a command, address, context, and tag
field are sent from the operand processing unit to
the M-box. The command is a simple operand read.
Arbitration occurs early in the cycle. The trans
lation buffer is then accessed , and the physical
address is sent to the cache.
Cycle five begins when the data cache receives
the p hysical address for the operand processing
unit to read . The tag store lookup and address
matching are performed simultaneously with the
data read , and the data is available to the E-box at
the end of the cycle. If the operand read results in a
cache miss, the M-box must assemble a command
and an address, which are sent to the SCU to enable
the SCU to access a 64-byte block of memory data.
In addition, the data cache tells the scu which set
the cache will replace with t he new cache block. J f
the current cache block contains valid and written
data, the block must be written back to main mem
ory before the new cache block arrives.
The scu sends a command and an add ress back
to the M-box when the memory data is ready. The
send takes approximately 26 cycles and is fo llowed ,
within a short period of time, by eight cycles of data
transfer. Each cycle is 8 bytes long. The requested
quadword is returned first to respond to the
requesting port during the first cycle of the cache
refi l l . On the eighth cycle of cache refill, the tag
s tore is updated.
The floating point fu nctional unit is started in
cycle six, as speci fied by the fork RAM data. Both
source operands are delivered , and the microword

VAX 9000 Series

ind icates a SUBF operation. The floating point unit

in bytes < 8 : 5 > . The four remaining bytes of the

requires two cycles to perform the SUBF operation .

immediate specifier could be valid in the I -bex and

Unpacking and a lignment occur in the first cycle.

the rest of the instruction could be contained in the

The floating point unit signals the issue control that

I-bex 2 . At the end of cycle one, the first fou r bytes

the result wiJJ be available at the end of the follow

are shifted to the low four byres of the 1-buffer. The

ing cycle. The issue control enters the general

next four bytes are merged from the I-bex to the

purpose register R3 destination b u t must wait

high four bytes of the !-buffer. The I-bex is now

another cycle before beginning reti remen t. If the

empty, and the bytes in the I-bex2 can be loaded

next instruction requires that the floating point unit

into the I-be x .

and the operands be available, the instruction

Because t h e MULG 3 instruction has a 2-byte-long

would be issued in t h is cycle because the floating

opcode, the only decoding necessary in cycle two is

point u ni t is ful l y pipelined.

to note the 2-byte length and shift our the ftrst byte

The second exec u tion cyc le occurs in cycle

so as tO align the specifiers to be the same as a single

seven. The floating point unit adds, normalizes,

byte opcode instruction. The specifiers are then in

rounds, and packs. The result is latched in the float
ing point unit at the end of the cycle, and the issue
control discards the top entry from the result queue
to retire the data.
In cycle eight, the retire multiplexer selects the
floating point unit result data and sends that data to
the d a ta distribu tion logic. The d a ta d istribution
logic holds the result, which w ill be written into
general-purpose register R3 in the register file dur
ing the next cycle. The write is purposely delayed
to permi t it to be aborted if an arit hmetic fau l t
occurs. B y holding t he result i n the data distribution
logic, res u l t bypassing into the data path can act as
a source operand. The result is written into the reg
ister file at the beginning of cycle nine.

VAX Instruction MULG3
The MULG3 instruction takes t he G_format floating
number, addresscd by general-purpose register R5,
from the instruction stream, multiplies it by the
immediate constant 2 3 4 5 .675, w h i c h is also a
G_format number, and puts the result in general
purpose registers R9 and R 10. General-purpose
register R5 also is incremented by eight as a side
effec t of the specifier evaluation. The opcode is
2 bytes long, the constant is a nine-byte immediate
specifier, and rhe autoincrement and register speci
fiers are each a single byte. Thus, the instruction is
encoded in 13 bytes.

Cycles One through Five

bytes < I :8> of the !-buffer. As the first opcode byte
(in this case, # FD) is shifted out , the next valid byte
in the I-bex is merged into byte 9 of the 1-buffer,
which leaves seven valid bytes in the I-bex.
Decoding really begins in cycle th ree. The fork's
address is sent to the E-box, and bit < 8 > is set to
indicate a 2-byte-long opcode. The ftrst five bytes
of t he immediate spec i fier are passed to t h e
operand processing u n i t . T h e first byte a l s o i s
passed t o the register/pointer unit for source list
allocation. The five bytes sh ifted out of the !-buffer
are replenished from the I-bex, w h ich leaves two
valid byres in the I-bex .
In cycle four, the register/pointer unit allocates
the two entries in the source list for the immediate
G_floating number by passing a source one pointer
to rhe E-box and the tag to the operand processing
unit. The operand processing unit passes the first
longword of the immediate G _ floating number to
the unit's output bu ffer.
The next four bytes of the immediate are passed
from the !-buffer to the operand processing unir.
The remaining two valid bytes from t he I-bex are
merged into the !-buffer. The I-bex is then loaded
with eight bytes from the virtual instruction cache.
In cycle five, the autoincrement and register
speci fiers are decoded and the remaining bytes of
the instruction are shifted ou t . Five bytes from the
I-bex are merged with the four valid byres in the
1-buffer. The autoincrement general-purpose regis

As in cycle one of the SUBF3 instruction, the M U LG 3

ter R5 is passed to the operand processing unit

instruction can either be a v irtual instruction cache

and the register/pointer unit, which also receives

access cycle or part of the instruction already can be
in the !-buffer and shifted to the least significant

general-pu rpose register R9. The first longword of

byte as the previous instruction is shifted out . For

processing unit output bu ffer, through the 1-box, to

the immediate specifier is passed from the operand

example, i f the previous instruction is the SUBF3

the source l ist entry al located by the register/

#0. 5 (RO) [ R4] R 3 in bytes < 4 :0> o f the !-bu ffer, the

pointer unit . The second longword is passed to the

first four bytes of the M U LG 3 instruction could be

operand processing unit output bu ffer.

Vol.

2 No. 4

Fall /')')()

Digital Tecbn icaljounzal

VAX Instructions That Illustrate the Architectural Features ofthe VAX 9000

The first microword is accessed and distributed
t h roughout the E-box . The m icrosequencer uses
the fast fields of the microword to generate the final
control store address for this i nstructio n . The
microinstruction is not issued because it requires
two source operands and the second source pointer
is not yet avail able.
Cycle Six

In cycle six, the register/pointer unit allocates two
source list entries for the autoincrement specifier,
passes t his information to the E-box in the source
one pointer, and passes a tag to the operand pro
cessing u n i t . T he general-purpose register R9 is
passed to the E-box as the destination pointer.
The operand processing uni t accesses general
purpose register R5 and passes it, with a tag and a
quadword read request, as an address to the M-box.
In parallel, the operand processing u n i t writes
general-purpose register R5, incremented by 8-byte
lengths in the unit's output buffer. The second long
word of the immediate specifier is written to the
source list at the relevant entry.
The operand processing unit sends the M-box a
read request quadword for the double-precision
floating point operand . If the address is on a quad
word boundary, the front end of the M-box will not
produce any additional virtual addresses because
the operand w i l l not cross a page boundary or a
cache line boundary. If there is a miss in the trans
lation buffer for this reference, all other arbitration
stops and control are given to the state machine of
the translation buffer fL"X-up unit.
Bits < 31 :09> of the request are captured by the
translation buffer's fix-up unit in parallel with the
translation buffer RAM's access to achieve an early
start on m iss processing. The fork to the state
machine is sensitive to bits < 31 :30> of the virtual
address. Therefore, when a translation buffer miss
occurs, a constrained control word flow begins
based on the values of bits < 31 :30>. Because this is
a user mode, the value is zero. Therefore, on the
first cycle following the translation buffer m iss, the
virtual page number is compared against the PO
length register, POLR. On the next machine cycle,
the POBR (i .e. , base register) is added to the virtual
page number ro create the system virtual address of
the process page table entry. The fix-up unit acts the
same as any other port into the translation buffer,
and makes a virtual read request with an aligned
longword context. The state machine is control led
by a microword that branches to itself until one of
three events occurs: a miss in the translation buffer

Digital Tecbnlcal]ournal

Vol. 2 No. 4

Fall 1990

CPU

(the fix-up unit processes double m isses), a memory
management fault, or a cache response. The cache
response, which is the event most likely to occur,
signals the state machine to return to idle and pre
pare for the next miss. Hardware control external
to the ftx-up u n i t w ri tes the entry into the trans
lation buffer, and the original request is retried .
This time there is a translation buffer hit, and the
physical address is sent to the cache. Single misses
in the translation buffer require seven cycles to pro
cess. A double m iss requires 13 cycles, assum ing
data cache h i ts occur.
The issue control asserts the microword hold
signal to force the microword latches to hold the
first microword until it can be executed. The micro
sequencer regenerates the control store address of
the second microword each cycle until the execu
tion stall ends.
Cycles Seven through Thirteen

Cycle seven is the data cache read cycle for the
quadword operand processing unit request that
was translated in the previous cycle. The VAX 9000
system has a 128KB data cache, with a block size of
64 bytes and access width of 8 bytes. The 64 -bit
access width matches the 64 -bit data path to the
E-box, which was construc ted to p rovide high
bandwidth for double-precision operand transfers.
When a cache hit results for the read of an aligned
quadword, both the normal response line and the
quadword response signal are asserted to alert the
E-box that the M-box is sending a quadword of data .
In cycle seven, general-purpose register R5 of
both the E-box and !-box is written with t he incre
mented value. In addition , both source pointers
and the first source operand are available to the
issue control. Because only the second operand is
missing, the microinstruction can be issued with
bypass awaiting memory data.
The quadword operand is available to the M-box
at the end of cycle eight . The low longword is
latched in the data distribution logic of the E-box,
and the high longword is held in the M-box.
In cycle nine, the quadword operand is written
into the register file at the two source list locations
allocated by the operand processing unit. However,
the low longword is available as a source immedi
ately. The low longword of the short l i teral operand
and the low longword of the memory operand are
passed to the multiply functional unit at the start of
cycle nine. The multiply unit performs the first
cycle of execution, which includes· unpacking and
multiplying the most significant bits of the two

VAX 9000 Series

operands. Issue comrol drops the microword hold
signal to allow the second microword to be latched .
An entry, which specifies general-purpose n:gister
R9 as the destination for the low longword of the
result, is made to the result queue. The second
microword is issued because the multiplier requires
the next half of each source operand and both are
available from the register file.
The microsequencer then attempts to generate a
new control store address from the next entry in
the fork queue. If no new forks are available, the
microsequencer remains idle.
In the tenth cycle. the multiply unit receives the
high longword of both source operands. The sec
ond execution cycle is performed, which includes
unpacking and three simu ltaneous multiplications
of the appropriate combinations of the most and
least significant bits of the two operands. The multi
plier signals t he issue control that the result will be
available in the following cycle. The issue control
makes an entry, which specifies general-purpose
register R 10 as the destination for the high long
word of the res u l t , in the result queue. The multiply
functional unit is fully pipeli ned and could be issued
in this cycle to start subsequent operations.
Cycle eleven is the third and final execution cycle.
The multiplier accumulates the four products it
produced in the two previous cycles, rounds, and
packs the final double-precision result. The issue
control discards the top entry from t he resuh queue
to retire the low longword of the resu lt.
In cycle twelve, the retire multiplexer selects the
multiply unit result data and sends it to the data dis
tribution logic. The issue control discards another
entry from the result queue to retire the h igh long
word of the result. The low longword of the result is
written into the register file's general-purpose regis
ter R9 in cycle th irteen . The h igh longword of the
result is written into general-purpose register R 10 in
the next cycle as the instruction is completed .

VAX Instruction BBSC
The BBSC instruction tests a bit in memory,
branches if the bit is set , and clears the bit. The
BOATA is the base add ress in memory with the
number 13 position-bit offset. The majority of VAX
field instructions have a position offset of less than
64 bits. Therefore, the VAX 9000 system's J-box
prefetches t he quadword addressed by the base.
As with all conditional branches, the result of the
test is predicted and the VAX 9000 system's J-box
continues to fetch instructions along the rredicted
pat h . The BBSC is encoded in eight bytes: one

opcode, one short li teral position, five for the base
address (a 4-byte displacement off the program
counter), and one displacement.

Cycles One and Two
Cycle one for the BBSC can be fetching the instruc
tion stream from the virtual i nstruction cache, as
described for cycle one of t he ADDL2 instruction, or
it a l ready can be in the 1-buffer (e. g . , bytes < 8 : 3 > )
and the I-bex ( i . e . , b y tes <7 6 > ) fol lowing t he
M U LG 3 (i .e. , bytes < 2 : 0 > ). In the latter case, the
BBSC i nstruction is shifted into the lower bytes as
the M ULG 3 instruction is shifted out .
The decode o f the B BSC begins with passing the
short li tera l , number 13, to the short literal expan
sion u n i t and the program counter/re l a t i ve base
address to the operand processing unit. Informa
tion on both specifiers is passed to the register/
pointer unit. In this cycle, the fork add ress is also
passed to the E-box . The fork address is mod ified
for field instructions if t he base is a register. There
fore, passing the fork address is delayed until the
base specifier is decoded . In this example, the base
is decoded in the cycle after the opcode is received.
If the base is a register, the field instruction takes a
di fferent microcode flow.
During cycle two, the decoder passes t he pro
gram counter decoder for the p rogram cou n t of
the instruction to be decoded to the operand pro
cessing unit. The program counter is passed to the
operand processing unit and the E-box in the first
decode cycle. Whenever a specifier is passed to the
operand processing unit, the X I3AR also sends a
specifier offset delta . When the delta is added to the
program counter's decoder, the add ress of the last
byte of the specifier plus one is produced .
As the short l iteral and program counter/relative
specifiers are decoded , they are d iscarded from the
!-buffer. The BBSC displacement is shi fted to t he
first byte of the !-buffer. The data arri ving from the
cache is merged into bytes < 8 : 2 >, and the other
byte is placed in the I-bex.
The branch pred iction u n i t begins operating
during the first decode cycle. A pred iction for the
branch must accompany the fork address sent to
the E-box. The prediction is made by using the
program cou n ter to access a branch prediction
cache and determine how the branch behaved the
last time it was decoded (i.e. , one h istory bit). If
the branch is in the cache, the p rediction is that
the branch will behave the same as the last time. If
the branch is not i n the cache, a prediction is made
based on the normal behavior of this cond itional

Vol. .2 No. 4

1-Ctff 1')')0

Digital Technicafjournaf

VAX Instructions Tbat Illustrate the Architectural Features ofthe VAX 9000 CPU

branch. For example, a BEQL (58 percent) and a
BBSC (73 percent) normally do not branch , whereas
a B N EQ (62 percent) normally branches. If the BBSC
instruction is in the cache and branched last time,
this information is indicated to the E-box, with the
I-box prediction given as true.
Cycle Three

In this cycle, the register/pointer unit allocates one
entry in the source list for the position specifier and
three entries for the base specifier. The unit then
passes the source one, source two, and destination
pointers to the E-box.
In the operand processing unit, the address of the
last byte of the specitler plus one is ftrst calculated
using the program counter of the instruction and
the delta provided by the X BA R . The displacement
from the instruction is then added to this calcula
tion. The result is latched in the operand processing
unit's outpur bu ffer and passed to the M-box. The
operand processing unit also passes a quadword,
field modify function, and the source list tag.
The short l iteral expansion u nit extends the size
of the position specitler to a longword and latches it
in the unit's output buffer. In this example, the
extension is done with zeros. The X BA R passes the
branch displacement byte and an updated value of
the program counter's delta to the operand process
ing uni t . The delta of the program counter and the
branch d isplacement are also sent to the branch
prediction unit as instruction lengths. The BBSC
instruction is completely decoded, and the opcode
and displacement are discarded from the !-buffer.
The branch prediction unit does most of its work
during the last decode cycle of a branc h . For the
majority of conditional branches, the last decode
cycle is also the first.
The branch p rediction cache contains 102 4
entries. Each entry has a history bit, a 32-bit target
program counter, a 6-bit instruction length, and a
1 6-bit branch displacement and its tag . The entries
are addressed by bits 9 through 0 of the program
counter's decoder. If the tag matches bits < 31 : 10>
of the program counter's decoder, the entry is
assumed to be the entry, or a hit, for this branch .
If a hit occurs and the history bit shows that the
branch was not taken last time, the branch predic
tion unit latches this state information and allows
the subsequent instruction stream to be decoded .
The operand processing unit produces the target
address as soon as it is not busy. The target address
must be stored in the program counter's unwind
buffer in case the prediction is incorrect. The E-box

Digital Tecbt�icaljounUII

VtJ/. .! Nu. 4

Full 1'}'}0

indicates the correctness of the prediction as soon
as possible. For simple branches, the E-box could
indicate that the prediction is incorrect before the
branch is fully decoded .
If a hit occurs but the history bit shows that the
branch was taken last time, the branch prediction
unit latches this state information and stops the
decoding of the subsequent instruction stream by
clearing the !-buffer and the I-bex. The program
counter of the subsequent instruction is stored in
the program counter's unwind buffer. The program
counter's target address, which is received from the
branch prediction unit cache, is passed to the pro
gram counter's prefetch buffer. The target address
that is later provided by the operand processing
unit may be discarded . The branch displacement
and instruction length from the branch prediction
cache are latched. For the fol lowing discussion on
the remaining cycles in the BBSC instruction, we
have assumed that the BBSC instruction is a branch
prediction hit and that the branch was taken the last
time decoding occurred.
Cycle Four

In cycle four, both the operand processing and
short l i teral expansion units contain d a ta to be
passed to the source list. The operand processing
unit normally has the higher priority of the two.
Therefore, the short literal expansion unit will stall.
The operand processing unit passes the base
address to the source list through the 1-box. In the
operand processing unit, the new delta of the pro
gram counter is added to the program counter, the
sign of the branch's displacement is extended from
a byte to 32 bits, and the two are added to produce
the new target address. The result is latched in the
operand processing unit output buffer.
The virtual instruction cache is accessed for the
target instruction. If the instruction is in the vir
tual instruction cacbe, it is passed to the !-buffer.
However, there is a gap in the pipeline because no
instruction can be decoded this cycle.
The displacement and instruction length from
the branch cache are compared with the actual dis
placement and instruction length. Normally, these
lengths match . However, if they are different, the
target address from the branch prediction unit
cache is p robably incorrect. The fetching and
decoding of instructions must wait until the
operand processing u n i t provides the correct
address.
At the start of cycle four, the M-box receives
a request from the operand processing unit. This

VAX 9000 Series

previously

tion or for subsequenr branches to be decoded . The

described in that it contains a command that gets

req uest

d iffers

from

all

requests

unit predicts a maximum of t h ree branches before it

special t reatment i n the M-box . T he command i s

stalls decoding to resolve the first branch.

an " opu read with write check n o bloc k . "

As the address xxxx..x xx5 is accessing the trans

T h e command is used because t h e VAX 9000 CPU

lation buffer, the final address is produced by

contains a n optimization that enhances the perfor

adding 4, which makes a translation buffer request

mance of bit field instructions. With this command,

(i.e. , addr

the op<:rand processing unit prefetches a quadword

in cycle six . The three translation buffer accesses

of data, starting from the address pointed to by the

are contiguous and interruptible. Data alignment is

xxxxxxx 9) through the sequencer port

base, without looking at the value of the position

performed by the M -box, but the alignment is con

operand . Hope fu l l y, the majority of bit fields are

strained to longwords. When an unal igned quad

within 64 bits of the base. The special command

word is detected, the front end of the M -box alters

tells the M-box that if a fault should occur, i t should

the context field that it passes to the data cache

pass the fau l t , with an operand, to the E-box and

unit. The quadword request is effectively broken

not close down the operand processing unit port or

i n to two unaligned longwords, which are properly

put a lock on the fault parameters. The command is

rotated into the low longword of the quadword

an unaligned quadword operand and, as suc h ,

interface and sent to t he E-box independently.

requires t h a t t h e M-box produce additional virtual

Cycle five is the data cache read cycle for the first

addresses to correctly access the cache. A quad

unal igned longword . Because the starting address is

word is unaligned when bits < 2 :0> are nonzero.

x:xxxx
x:x l , the entire longword is contained in the

For this example, we have assumed that the starting

cache line. Therefore, one additional rotation cycle

add ress is x:x"L
xxxx l .

is all that is required before the data is sent to the

Special ized hardware in the front end of the

E-box. The M-box pipe is effectively lengthened by

if the starting address requires

a cycle when i t is performing unaligned operations.

sequencing (i.e. , the addition of a constant of 4 to

Because cycle five is a data cache read cycle, no

M-box detects

the current address) and how many sequenced

response is issued to the E-box. In addition to the

addresses are necessary. In this case, three addresses

data cache read, the physical address is placed in

are required. The first is the starting address (i .e. ,

the write queue. A memory write is required after

from the

the bit is tested . A status bit for a new quadword is

operand processing unit. As the starting address is

set in the write queue. The new quadword indicates

addr

xxxxxxx l ), which is received

accessing the translation buffer, a constant of 4 is

that this is the starting address of an operand and

added and the sequence port requests a virtual

writes should not rake place until a n entry appears

address (i.e. , addr

in the write queue with a last bit assertion.

xxxxx x x5) from the translation

buffer at the start of cycle five.
The issue control uses the fork RAM data to deter

Because the first operand is written into the
source J ist, t he operand is available ro the integer

mine that the integer unit and two source operands

unit at the start of cycle six . The microword hold

are required . Because only the first operand is miss

signal is asserted to hold the first microword during

ing from the source list, the instruction is issued

the stall. The microsequencer regenerates the con

with bypass. The microsequencer generates the sec

trol store address of the second m icroword.

ond control store address based on the fast access
fields of the first m icroword .

Cycles Six through Nine
I n cycle six , the d ata cache is read again w i t h

Cycle Five

address

Decoding the target instruction stream begins in

read in cycle five. However, because the context is

cycle five. The operand processing unit sends the

a longword, one additional byte of data must be

xxxxxx:x 5,

which is t h e same cache line

target address to the branch prediction unit through

read from the cache to satisfy the reques t . Also, in

the program counter's target address. However, as

cycle six, rotation of the data read in cycle five is

noted earlier, the target address sent is discarded.

completed, and the M -box responds to the E-box.

Because t he operand processing unit does not use

Finally, address xxxxxxx 5 is placed in the write

the 1-box data register, the short l itera l expansion

queue.

unit can pass the short literal to the source Jis t .

By using source pointers from the source queue,

T h e branch prediction u n i t now waits either for

the position and base address operands are selected

the E-box to indicate the correctness of the predic-

by the fork RAM and passed to the i nteger u n i t . If

Vol. 2 No. -4

Fall /'J of interest from the cache read

next cycle is issued norma lly.

in cycle seven to the correct pos i t ion. No response
is issued to the E-box because this unaligned refer

Cycles Ten through Fifteen

ence requi res two data cache reads to ful fi l l . The

I n cycle ten, the E-box initiates a byte write to the

add ress xxxxxxx9 and the last bit are inserted i nto

M-box. Data is passed to the M-box , and the appro

the write queue. The M-box delivers the required

priate byte is shifted to the low byte loca tion. The

longword, and execution begins immed iately. The

sixth and final m icroinstruction is issued normal l y.

second execution cycle calcu lates the target byte

I n cycle eleven, the M-box receives an explicit

address. The position, div ided by eight, is added to

E-box write request to retire t he BBSC instruction

the base address. The m icrosequencer generates

with a memory write. Explicit writes differ from

the fourth control store address by using the next

writes i n itiated by the 1-box in that the E-box sup

address field of the microword. No operands are

plies a v i rtual address with the data, whereas the

selected for the next cycle, and the next instruction

I -box provides a virtual address and t he E-box sub

is issued norma l l y.

sequent ly provides the clara for 1-box v.·rites. How

Cycle eight is a rotation-only cycle. The one byte

ever, three entries exist in the write queue for the

<8> of i nterest, read from the cache in the previous

prefetched quad word . These entries were placed in

cycle, is rotated i nto the correct position (i .e. , byte

the queue for memory conflict-checking p urposes

<0:3> ) , and the M-box sends the data to the E-box

and cannot be used for writing pu rposes because

by issuing a response.

only a byte of clara is being written and not a quad

The third execution cycle uses the bit position to

word. The write field command from the E-box

set up the special encoder in the integer unit and

forces the write queue control to d iscard the three

clear the appropriate bit. The source two register

entries. The front end of the E-box accesses the

file pointer is incremented again to select the high

translation bu ffer and checks for write success

longword from the source l is t . This microword

during this cycle. I f the write is successfu l, the p h ys

branches on th ree comlitions determi ned by hard

ical address and the context of the byte are sent to

ware functions. The first cond i t ion indicates if the

the data cache.

low longword of the prefetched field has a page

The fi n a l execution c ycle determ ines if t h e

faul t . If a fau l t does exist, the m i croword flow

branch prediction w a s correct. T h e bit specified

checks w hether the longword is needed or not. As

by the correct position is shifted to the least signi

noted earl ier, the longword was p refetched i n

ficant position in the s h i fter, where i t can be used

the hope that the b i t pos ition was within the first

for a macrobranch comparison. The macrobranch

64 bits of the base. If the bit is not within the first

result is compared to the I-hox branch p rediction

longword , the page fau l t can be d isregarded . The

in cycle twelve. The microword also ind.icates that

second branch c hecks w hether the position is

the microsequenc<.:r shoul d start forking for new

gr<.:ater t han (l_) hits. I f it is greater, the microcode

Digital Tecbnica/jourual

Vol. 1 t\iJ. ·I

P(/1/ /'J')IJ

macroinstructions.

.19

VAX 9000 Series

Cycle twelve is the data cache lookup cycle for

E-box. This process c:vens the tlow t h rough the

the byte-write operation. The data size is less than a

pipel ine and keeps the E-box busy. Figure 6 il lus

longword . T herefore, the byte that is to be written

tratc:s the code block as it moves down the pipe.

must be merged with t he seven unaffected bytes of
the cache line.

The first stage is the virtual instruction cache
:tccess, or fetch. stage as the instruction is read from

Two signals are sent to inform the 1-box of the

the virtual instruction c:.tche. Some instructions

branch prediction status. The branch valid signal

do not need an actual virtual instru ction cache

ind icates that a branch prediction validation has

access but are in the !-buffer from

occurred, and the branch signal indicates i f the va l i

instruction c:.tche fetch. The instruct ion decode

dation was correc t .
T h e branch prediction logic receives t h e branch
valid signal. If the prediction was correct, the pro

:.t

previous v i rtu:.tl

takes p lace in the decode, or X BA R , stage . T h e
!-buffer i s shifted and t h e fork R AI'[ COM PCON '90 (San
Francisco: Spring 1990): 4 4 -53.

8. S. Mishra, "The VAX 8800 Microarchitecture,"
Digital Tecbnicaljoumal, vol . I, no. 4 (February
1987): 20-33.

3. T. Leonard, VA X A rchitecture Reference Man ual

7. T.

Vol 2 No. 4

Fall I'J')O

Digital Tecbnicaljoun.al

Matthew]. Adiletta
Richard L. Doucette
john H. Hackenberg
Dale H. Leuthold
Dennis M. Litwinetz

Semiconductor Technology
in a High-performance
VAX System
The VAX 9000 system is the newest member of Digital's VAX family of computer
systems. The 9000 is a high-performance ECL processor, with a very fast, 1 6-nano
second cycle time. To achieve this high level ofperformance, a new generation of
semicustom and custom integrated circuits was requiredfor the scalar CPU and the
vector processing option. Goals for circuit density, performance, and skew mainte
nance werefulfilled with the development ofa high-speed gate array, special custom
chips used in key applications, and a high-speed RAM employing a new architecture.

The semiconductor requirements for the VAX 9000
system posed a number of challenges for Digital's
Integrated Circuits Development Group. Those
requ i rements included a tremendous number of
equivalent logic gates ( 1 ,037,4 00 gates) and a large
amount of RAM in the processor (3,280,000 bits).
Moreover, the project 's performance goal of over
30 VAX- 1 1 /780 units of performance (VUPs)
required the development of state-of-the-art semi
conductors and the use of innovative techniques to
design them .
G iven the project's goals, the IC technologists
evaluated several competing semiconductor tech
nologies and decided to i mp lement most of the
logic within the 9000 system in a h igh-speed, high
density, 10,000-gate array. The gate array provides
a broad range of speed and power-dissipation
options. Working with Motorola, the IC Group first
engineered the base 10,000-gate macrocell array
(MCA), which is implemented in Motorola's MOSA IC
III process. Logic engineers then designed the 77
d i fferen t gate array chips (options) on the base
array, using a rich library of logic functions and a set
of automated place and route tools. Additiona lly,
they designed five custom chips, invented a fast
cycle t i me, self-timed random access memory
(STRAM) architecture, and designed a multichip unit
to imerconnect all these high-performance !Cs. '
Four different design methods were used to
implement the chips. The MCA x chips employ a gate
array design technique. The cnxx, the V RG x , and
the Sl"RAM chips required a full custom approach .

Digital Technicaljournal

Vol. .! No. .:j

Fall /')90

The STGx chip was implemented using a silicon
compiler technique. T he M ULx and DJVx chips
mwere implemented using a standard cell design
approach. Statistics on 9000 system chip design are
given in Table 1 .
This paper describes the VAX 9000 M CA I l l gate
array, the development of each of the five custom
chips, and the STRAM architecture. Before our dis
cussion of the gate array, we present a brief
overview of the semiconductor technology used
to fabricate the array and the custom chips.

Semiconductor Technology
In 1985, the VAX 8800 series was D igital's largest
and most powerful system, offering single-CPU per
formance of eight VU Ps. The 8800 CPU logic was
Motorola's Macrocell A rray I ( M CA I ) gate array,
which was fabricated in MOSAIC I bipolar technol
ogy. In comparison, the VAX 9000 goal of 30 VlJPs
was aggressive, and the IC Group realized a new
semiconductor technology was required .
At the start of the project, the technologists evalu
ated semiconductor vendors to determine what
was the "best" technology available to implement
the new system. CMOS , Bi C MOS bipolar, and GaAs
IC technologies were evaluated. Among the factors
considered were logic density, gate delays, on- and
off-chip interconnect delays. mam.1facturing risks,
and prod uct delivery.
Although very high gate densities were available
with CMOS technology, the logic gate delays proved
,

VAX 9000 Series

Table 1

VA X 9000 C h i p Statistics

Chip

Description

Die Size
( M i l l i meters)

Signal
Pins

Transistor
Count

RAM
Bits

Power
(Watts)

MCAx

MCA I l l gate array chip

9.8

256

40. 1 K

CDxx

Clock distribution chip

6.2

1 70

7.2K

STGx

Self-ti m ed reg ister file chip

9.8

1 52

29.3K

1 7.8

M U Lx

M u ltiplication chip

9.8

1 82

48.4K

30.9

D IVx

Division chip

9.8

1 12

29 .2K

23.9

VRGx

Vector register file chip

9.8

1 98

76.0K

92 1 6

24.9

1 KS R

4 self-ti med RAM

4.9

3.6

28.0K

4096

2.4

4KSR

4 self-ti med RAM

6.4

4.2

1 03 .0K

1 6384

2.4

t o b e t o o slow r o meet t h e cycle time requirement.
Also, the CMOS output circuits could not drive sig
nals off-chip i nto a 50-oh m transmission l i ne as
quickl y as a bipolar transistor, which l im i ted the
speed of signal between IC:s.
B iCi\·JOS offers the advantage of h ig h l y dense
CMOS coupled with bipolar drive capabi lity. How
ever, the technologies available at the time were
optimized for the best CMOS transistors with a com
promised bipolar device. This approach l im ited the
overall performance of the circu it to a level roug h l y
equiva lent t o t h a t o f previous generation bipolar
devices, which would not be aggressive enough ro
meet the CPU performance needs.
Galliu m arsenide (GaAs) ICs offer a theoretical
performance advantage of between two and three
to one over s)licon i m p l ementations. T he group
found IC densities were lower than those of bipolar
devices, however; and the on-chip speed advantage
was countered by the need for more off-chip sig
nals in t he critical paths of the C P U . A lso, because
the manufacturing technology of GaAs ICs was
immature, very few companies had attempted to
sell GaAs into the commercial marketplace. So
while this technology was considered for a rime in
some applications where alternatives also existed ,
GaAs were eventually dropped from consideration
because of the u ncenainty of availability.
The IC Group also studied Motorola 's third
generation of their oxide-isolated self-al igned
impl anted circu i ts (MOSAIC I l l) bipolar technology.2
Ir offered a factor of six in speed advantage over
the prev iously used MOSA IC I tech nology and h a d
the potential of prov iding eight to ten times the
logic density. A l t hough not as dense as CMOS or
BiCMOS, MOSAIC I ll was much faster than either of
those tec hnologies and much denser than any avai l
able GaAs technology I n addition, although many

30
1 3.9

of t h e manufacturing steps were new, most o f them
were based on prev iousl y proven tec hn iques. The
group therefore concluded that MOSA IC 1 1 1 was
best suited tO meet the chal lenges of the VAX 9000
system.
The MOSAI C I l l process is an advanced sil icon
bipolar process which yields a transistor structure
with a polysilicon base. emitter and collector elec
t�·odes, pol ysi licon resistors, and three l ayers of
meta l ization. Compared to the MOSAIC l device
used in the 8800, the critica l col lector-base j unction
of this transistor structure takes up approximately
50 percent less area, as shown in Figure I. Com
bined with shal lower ju nctions and reduced base
resistance, the intrinsic device performance was
improved by a factor of three. Further, the poly
silicon resistor produced with this process has far
lower parasitic capacitance than the MOSA IC l
monosilicon resistor. Some key performance mod
eling parameters and density metrics are provided
with the figure.
The VA X 9000 packaging imposed other require
ments on the semiconductor technology. Power
dissipation increased from 5 watts for the MCA I to
�0 watts for the MCA I ll because of the increase in
gate density from 1 , 200 to 10,000 gates. Therefore it
was determined that all ch ips shoul d be mounted
directl y to the multichip unit cold pl ate for opti
mum cooling. For manu facturing economy, it was
desirable to bond the mul tiple leads of the chip
directly to the pads on the h igh-density signal car
rier ( H DSC). Consequently, all CPU chips must be
provided to the mu l tichip unit assembly site in a
tape automated bond (TA B) package. As shown in
Figure 2, ch ips are mounted i n a plastic carrier suit
able for automated hand l ing, and the surface of the
die is protected from mechanical damage with an
epoxy encapsu lent .

Vn/. .2 filii. ..;

Fall

1')')11

Digital Technicaljournal

Semiconductor Technology in a High-performance VAX System

MCA JOK Gate Array

number of logic cells for a given signal pin count are
available for the logic designers. Technologists eval
uated several key factors to determine the gate array
physical layout and to ensure its success:

A high-performance emitter coup led logic (ECL )

gate array with 10,000 equivalent gates and 256
i nputs/outputs has been developed for the VAX
9000 system. The gate array design approach used
in the VAX 9000 system ensures the shortest possi
ble turnaround time from option ma-;k to hardware,
thereby reducing the system design time. In this
approach, cell boundaries are defined with all tran
sistors and resistors fu,ed within the cells. When a
cell function is selected from a predefined cell
l ibrary, the cell customization occurs at the metal
between the transistors and resistors. Then, to
define the function of the gate array option, the
metalization between cells is customized. This
approach al lows the semiconductor foundry to
build many wafers up ro the customizarion level;
when a gate array is to be built, only the custom
metal is req uired . As noted above, 77 different lOK
ECL gate array options are used in the VA X 9000 sys
tem. This gate array has a rich selection of logic cells
with di fferent power settings for the logicians to
use to meet performance and power requirements.
Using Rent's Rule, technologists maintained a bal
ance between the number of gates and the package
J /0 count. This balance ensures that a maximum

MOSAIC I l l

P+ P O LY S I L I CO N

•

Area of the silicon chip versus yield

•

110 pad pitch

•

Maximum power dissipation

•

Speed of the gates

•

Maximum number of logic cells

Successful trial layouts of the IOK ECL gate array
floor plan were completed before any VAX 9000
options were started .
The gate array floor plan, shown in Figure 3,
comprises a central core area of 4 14 major (M) cells,
divisible imo quarter cell functions, arranged in an
array of 20 rows and 2 1 columns, less 6 sires for the
master bias generators and special clock generator
circuits. The number of transistors used in a quarter
cell is based on the logic cel l most frequemly used
in the lOK EC L gate array, the scan larch. A ring of
200 output (0) cells is interspersed with 224 inter
face (I) cells. The ring surrounds the imernal cells
and imerfaces the pad drivers with the internal

N + P O LYSILICON

�����
�
�-----�
_..)

POLY S I LICON R E S ISTOR

_..... ,...

NPN TRANSISTOR 1
I

_.... ...-

_.....

MOSA� I

/
/

C-B J U NCTION AREA

�---��--�)·::::�
MONOSILICON RESI STOR

N P N T R A N S I STOR
MOSAIC I

MOSAIC I l l

N PN Fr: 5 G H z
R 0 : 1 475 ohms

1 6 GHz
400 ohms
20 ff
24 ff
54 If
DRAWN EMITTER SIZE: 31'm X 41'm 1 .751'm x 4!Jm
M ETAL 1 PITCH: Bl'm
4.5!Jm
METAL 2 PITCH: 1 51'm
71'm
METAL 3 PITCH: 1 21'm

CJc: 50 II
CJE : 45 II
CJS: 1 85 ff

Comparison ofMOSAIC Ill and MOSAIC I Deuices

Figure I

Di�ilal Tecbnicaljournal

Vol. .! No.

Fall /')'JIJ

VAX 9000 Series

cells. The 2 56 t /0 pad ce l ls a long w i t h t he J04
power pads are located around the perimeter of the
IOK gate array. The mctal ization system uses three
interconnect layers. The customized routing chan
nels reside on the first and second meta l layers with
i nterconnecting v ias between the two layers of
meta l . The top metal layer and parts of metal I and 2
provide power and ground distribution.
The lOK ECL gate array used in the VAX 9000 is
approximately ten times more dense than the ECL
gate array used in the VAX 8800 system . The gate
delays in the 9000 are improved six ti mes over gate
delays in the VAX 8800. Table 2 compares the IOK
Ec.L gate array used in the \ A X 9000 to the ECL gate
array used in the VA X 8800.
Previous gate array designs. i n genera l , have
provided only two le,·els of series gating, thereby
limiting the complexity of functions that can be
designed with one current switch. Within this gate
array, three levels of series gating Jt borh internal
and output macrocel ls provide addition:�! " A N D "
(product) gate functions at very high sreed with
one switch delay and at a lower power level . Fig
ure 4 compares three-level series gating and two
level series gating for a " 2-3-4 -4 A N D/OR " logic
function (internal gate). Table 3 lists the differences
in typical gate performance for a low power gate.
The table also compares low power gate and high
power gate. Notice the power difference between
the two-level and three-level high power gate.

C o m parison of N u m be r of Cells
and Delay s i n the VAX 8800 and
VAX 9000 Gate Arrays

Ta ble 2

I nternal major

VAX 8800
Gate Array

VAX 9000
Gate Array

414

cells
Output cells

200

I n put cells

224

Input cells
gate d e l ay

1 . 05 nanoseco nds

Metal de lay
(fall delays)

2.6 picoseco nds
per m i l

1 75 picoseconds
(high power)
1 . 3 picoseco nds
per mil

A l l current switches w i t h i n t h e array are pow
ered from the main supply voltage V E E I. Three
level-series gated functions are implemented in the
VA X 9000 gate array option, which requires V E E I
to be set to - 5 . 2 V. Input cells are powered from a
second, lower supply voltage VEE2 ( 3.4 V) to save
power. The output emitter followers of M, I , and
0 cel ls as well as series-terminated ECL (STECL)
output followers employ constant current source
pu l ldowns to VEE2 to save power. The constant cur
rent source pulldowns minimize the sensitivity of
AC performance to variations in power supply. This
same termination scheme was used in VA X 9000
custom chips.
One of the technologists' main goals was ro mini
mize power consumption of each macrocell while
obtaining the highest possible performance from
the IOK ECL gate array. The overa ll ! O K ECL Gate
Array power is limited to 30 watts because of the
cool ing requirements, the internal power distribu
tion, and the current density l im its on power pins.
A unique feature incl uded in the !OK ECL gate
array that rrevious gate arrays do not have is series
terminated ECL (STECL) omputs. STECL outputs
-

Table 3

C o m parison of Two-level and
T h ree-level Series Gating

Gate delay from
i n put pin A
to output pin YA

Two Levels
of Gating

Three Levels
of Gating

300 picoseconds

250 picoseconds

(low power)

Figure 2

Chip in TAB Package Mounted on
Plastic Carrier and Encapsulated

Low power gate
H i g h power gate

Vol 2 No. 4

9 . 88 m i l l iwatts

8 . 84 mill iwatts

1 8 .20 m i l l iwatts

1 3 . 00 m i l l iwatts

Fall /'J'Jfl

Digital Tecbn icaljom-nal

Semiconductor Technology in a High-performance VAX System

Figure 3

Photomicrograph ofthe Gate Array

include a constant current source p u lldown and a

reference clocks. The chip also supplies clocks to

series terminating resistor. This feature allows the

a l l STR A M s on the u n i t . Each of t he STR A M 's four

elimination of off-chip termination resistors used

groups of SL'< clocks can be programmed to one of

in conventional 50-ohm EC L outputs. STECL out

eight possible clock phases. This flex ibility in pro

puts a llow shorter in terconnections between chips

gramming al lows the system designer to select the

on the m u l tichip unit because the c h i ps can be

a p p ropria t e clocks for STR A M s in order to meet

placed closer to each other, t hm improving perfor

system timing requirements.

mance. Another advantage of using STECL outputs

In addition to prov iding the functions above,

over 50-ohm outputs is that less than half of the

the design goals for the C D x x project i nc l uded the

simul taneous s w i tching output noise is coupled to

fo l lowing:

unswitched outputs. A l l custom chips used in the

•

VA X 9000 employ STF.Cl. termination .

mu l tic h ip unit

Clock Distribution Chip - CDxx
The major fun c t i o n of t h e clo c k d is t r i b u t i o n c h i p
(CDxx), shown i n Figure 5 , is to distribute master
and reference clocks to each MCA on

m u l t ichip

unit. There are eight pairs of d i fferential master and

Di�ital Tecbuicaljournal

11Jl .! No.

M i nimize the space occupied by the chip on the

Fa/1 1990

•

Provide scan control and scan distribution

•

Include a wideb:md amplifier

•

Ensure low clock skew

•

Provide a temperature-detecting circuit

VAX 9000 Series

,------ �

vee

'-----��Y A �

VBB1

------+- vs@

VBB 1

ONE LEVEL OF GATING
.----- vee

vee

VBB3

VEE1 �------�---'
THREE LEVELS OF GATING

Figure 4

Two-leuel Functions uersus Three-leuel Functions

Vol.

.2 No. 4

Fall

I')'JI!

Digital Tecbnicaljournal

Semiconductor Technology in a High-pe�tormance VAX Syste�n

HOT C I R C U I T

Figure 5

Photomicrograph of CDx.-.: Chip

M i n im izing the real estate occupied by the chip

Each coxx receives i ts scan control signals from the

was comp licated by addi tional functions located on

previous CDxx in the chain or from the service pro

the CDxx, such as scan and the temperature detect

cessor. A s shown in Figure

ing circuits. The minimization was accomplished

rings located on the C D x x . Ring 1 2 is a 16-bit r i ng

there are t h ree scan

by employing a custom chip design approach in

reserved for the CD)C'< STRAM clock generation con

which each element (cell) is optimized and then

trol ring. This ring controls the STRAl'•l clock phase

manual ly placed and routed to ach ieve a compact

selection and enable for each of the four STRAM

des ign. As it turned out, the size of the chip was not

pins required to communicate to the rest of t he

clock groups. Ring 1 3 is a 14-bit ring reserved for the
CD)C'< scan control. Data is shifted i n to this ring and
then loaded i nto CDxx control registers. R i ng 14 is a
47-b i t r ing reserved for the CDxx i n formation scan

multichip u n i t .

ring. Data is loaded into t h i s ring from CDxx data

determined by the amount of real estate needed to
implement the circuits, but rather by the number of

Since a CDxx i s mounted o n every multi chip u n i t

registers and shi fted out ro the service processor.

i n t h e CPU, the scan d istribution and control logic

The design of the w idebaml a m p l i fi e r was

are located on this chip. The CDxx ch ips i n the sys

prompted by the need for the clock distribution

tem are chained together on the system scan bus.

chip to receive two d i fferent ial sinusoidal master

Digital Tecbnicaljournal

Vol. .! No. · I

Fall I'J'JO

VAX 9000 Series

and rcfc.:n:nce c lock signJis as inpurs. These.: signals
arc.: transformer coupled from the clock source.
The master clock runs at one L"ighrh the systL"m
cycle rimL". and the reference clock runs at the sys
tem ncle rime. The wideband amplifier receives
d i ffe rent ial s inusoidal signalls of relative l y small
ampli tude - less than 125 m i l l i\·olts peak to peak
and transforms t hem ro lOOK ECL levels on output .
Th<.: design of the input circuits meets these crite
ria and rypic::� l l y fu nctions w i t h i nputs less rhan
65 mi llivolts.
All rhe clocks are distributed by the COxx as pairs
of diffcrc:ntial signals. The d istribution of these
clocks is, of course, ro be done with minimal clock
s kew. Clock skew is the di fference in del::�y t ime
berw<.:c:n di fferent clock outputs measured from a
com mon point. The common point in this case i s
t h e numbc:r of master dock inputs to the chip. To
maintain low c lock skew, technologists designed
fast gates and minimized the nu mber of cascaded
gates in the clock path. A lso, all the metal that inter
connects the cel ls in the c lock path is control led for
equal delay. As a resu lt, the measured clock skew
is less than 100 picoseconds on a chip for master,
reference, and STRAM clocks. The delay of master
clock input ro output is less than I nanosecond (ns).
The: temperature-detecting circu i t on the CDxx
warns rhe system when a device j u nction tempera
t ure approaches rhe maximum al lowed tempera
t u n: on a m u lt i c h i p u n i r . As i m p lemented, t he
circuit is cont rolled from t he system console. The
console loads rhe CDxx with a number that repre
sents rhe temperature rhe circuit musr use as a point
of comparison . If rhe j unction temperature of rhe
Cl)xx is higher than the programmed value, the cir
cuit trips and notifies the console of a temperature
problem. T he console rhen rakes corrc.:crive acrion .

Self-timed Register File Chip - STGx
The self-rimed register file chip (sn; x ) is employed
in t h e VAX 9000 to provide fou r register banks
accessible through muhi rle read and write pons.
·rhe four banks incluJe a m icrocode scratch-pad
register hank, rhe VA X generJl-purpose register
set, a memory Jara register storage bank , and an
instruction d a t a register b an k . The performan ce
req u i rements for rhe STC x were quite rigid and
guided several key design tkcisions, including den
sity and layout. The read access time was ro be less
than ':i ns. The write access time was to be less than
6 ns. Ln orher words. rhe chip must read or write
any one of irs 6.:j locations in ':i or 6 ns. respectively.
Borh goals ha\'e been met . In fac t . rhe read access

t ime is typical l y less rhan 4 ns, and rhe write t ime
is typically less rhan ':i ns. Figure () is a photom icro
graph of the STG x c h ip.
The STGx is a 64 -word by 1 8-bit LCL register file
contain ing three wrire ports and rwo read ports.
The 64 words are separated into fou r 16-word by
18-bit storage array sect ions. Each of the four stor
age banks has dual read capabi lity. S torage bank one
has dual write capab i l i t y ; storage ba nks rwo and
three have triple w rite capability; and storage bank
four hJs single w rite capabil ity. Simultaneous write
access to the array i s possible t h rough a l l pons wirh
correct results occurring; the only except ion is in
t he case of writes to the same location from multi
ple pons, which is an undefined operation. A write
followed by a read access to the array - even to rhe
same address - is possible w irh correct results
occurring. The chip has two clock inputs for con
troll ing reads and writes.
One requirement for rhe design was to include a
self-rimed write capabil ity so that the system need
nor provide properly timed write pulses ro rhe chip.
In rhe system, rhe chip is clocked w i th STRAM
clocks for read ing and w r i t i n g . The design uses
these clocks to latch read address i n formation, to
latch write add ress information, and to latch input
data. I n addition, the design rakes the leading edge
of the write clock ro generate a delayed w rite pu lse.
The delayed write pu lse is used to write the appro
priate word in the 64-word by 1 8-bir array, raking
in to account rhe rime needed ro decode the wri re
add ress.
The design sryle used to i mp lement r he self-rimc.:d
register file chip is s im i iJr ro a sil icon compiler tech
nique. The c h i p's storage area i s made up of four
arrays. The input add ress register for borh read and
wrire ports, the inpur dara larches. and rhe da t::l out
pur drivers are arrangements of c<:l ls in stri ps. The
p lacement and rout i ng of t hese arrays and strips was
proced urally performed using custom layom tools.
Once rhe blocks were: assembled and p laced , in ter
connecrions among b locks, strips, and pins were
then routed manual l y.

Multiplication Chip -MULx
The architecture of the scalar processor defined an
integrated floating point p rocessor. U n l i ke most
RISC processors, which off-load all floating poinr
operations ro a separate tloating poin t processor,
rhe VAX 9000 sysrem handles floating point opera
tions within the E-box . 1 The multiplication unit
therefore supports horh i nr<:ger and tl oaring point
formats. To ach ieve t h is support, a custom chip was

l'n/. .!

. \ "o.

Fall I')<)O

Digital Tecbnicaljournal

Semiconductor Technology in

Figure G

\ i,f. .! .\iJ. 1

High-JH!r(ornwnce VAX -�),stem

Photom icrograph ofSTGx Chip

requ i red that provided superior performance. spe
cial logic gates. and improved density. Custom chip
tech nology provided enough dcnsity to accommo
date a .12-bit by :)2-bit . cight-logic-l<:vcl multiplica
t ion array in a singlc chip ( M l l l . x). To mini mize the
cost and time of custom design . designers employed
standard cell design techniq ues in which the cell
height was fixcd anu thc width cou ld vary to take
advan tage of packing dcnsit y. By constraining
the design i n t h i s fashion. the H ig h Performance
Systems Group's < .A D suitc cou ld be employed to
p l ace and rou te the c h i p . Spec i a l logic gates
eliminated t hrcc logic lcvds. and h igh-powered fast
gates provided t he pnfmmancc to perm i t a .12-bit
by :)2-bit multiph· opcra t ion in less t han 9 ns. Fig
un: I shows a photomicrograph of t ile \l l l. x chip.

Digital Tecbnicaljournal

hiii i'J'Jii

Three �l l ' L x chips werc r<:qu i red in the scalar
processm to achieve doubk-prcc ision r<:rformancc
in which every 64 ns a ')6-bit mul tipl ication could
complete. Each M l ' l. x chip has two .12-bit i n put data
buses. The Ml ! L x chip is also employed to perform
all i nteger multiply operations in a s ingle 16-ns
cycle.
The scal::ir processor, which has .12 -bit-wide data
paths, delivers double-precision input data in two
cycles. In the first cycle, each M l lLx consumes the
most sign i ficant h igh bits of c:K h operand . A II t h ree
MULx chips latch this <.bta while also u n pack i ng
it, multiply ing i t , and then latching the product.
One of the M l ' L x chips' results is then s:1ved . In the
second cycle. the n.:maining dou hk:-prccision dat:I,
t he least sign i ficant low bits. is consumed , and each

') [

VAX 9000 Series

�-

;--..,.--,..
.:. .���.;.,...._:;...,

I M U LT I P L I E R ARRAY

""""""' ,:....,M,

.-��

. .......,...

Figure 7

Photomicrograph ofMUL:x Chip

M U L x chip unpacks the data and performs a u n ique

are delivered, each MLJ L x has an additional person

multiply: operand A high bits and operand B low

ality bit for indicating whether t he M U L x is in the

bits; operand A low bits and operand B h igh bits;

V-box or E-box.

and operand A low bits and operand B low bits.

The MULx chip, as used in both the scalar and

A n 1\KA I I I gate array acc u m u l ates a l l these

vector processors, is a 32-bit by 32-bit ECL parallel

res u l ts, and another rounds and packs the bits into a

multi plier w h ich is fully pipelined for a 16-ns cycle

VAX floating point product. Since each ivl U L x needs

time. It performs both two's complement and sign/

ro know which partial product it must comp ute in

magnitude multiplication. I n a single cycle, the chip

the second cycle, two personality bits are included

unpacks VA X float ing point formats F, D, and G, or

that are loaded by means of the system scan chain .
M U Lx chips are also used in the vector processor.
The vector processor (V-box ) has 64 -bit-wide data
paths. Four MULx chips are emp loyed ro complete a
double-precision m u l t i p l y every 16 ns. S i nce the

i nteger formats long, word, and b y t e ; performs
exponent calculations and sign handling; and com
pletes up to a 32-bit by 32-bit m u l t ip lication .
I f the operation is double precision, the 64 -bit
result is a partial result. It must be accumu lated with

operand unpacking di ffers between the scalar and

three other part ial results to form t he double-preci

vector processors as a result of how fast operands

sion, correc t l y rounded, and normalized produ c t .

Vol. 2 No 4

Fall 1')')0

Digital Technicaljounwl

Semiconductor Technology in a High-performance VAX System

If the operation is an integer type, then the 64 -bit
two's complement result is the VAX integer product.
A long with producing this integer product, MULx
also produces the correct condition codes. Integer
operations require one machine cycle to complete.
Operands are not latched at input . Instead they are
immediately unpacked and sent to the multiplica
tion array. This multipurpose array then produces a
set of sum and carry product vectors. These vectors
are then added in a ful l carry lookahead adder
(CLA). This adder comprises a 31 -bit adder and a
32-bit adder, cascaded . The produced sum is the
64 -bit product, which is then latched. The output
of the latch is used to compute i nt eger-type con
dition codes.
The integer instructions supported include VAX
MULB , M U LW , and MULL. EMUL is also directly sup
ported, along with the Z and N bit condition codes.
Finally, to assist in H format-type multiplications,
a true 32-bit by 32-bit magnitude mu ltiplication is
also supported, called EXTMU L (extended multiply).
There is a 64 -bit data path back into the E-box for
EMUL- and EXTMUL-type operations.
Six features of the M U Lx design that improve per
formance and minimize logic should be noted .
First, unlike traditional designs, the MULx design
does not include Booth recoding of the multiplier
operand . Booth recoding offers no logic savings
either in timing or real estate when the multiplica
tion array reduction scheme is optimal. Second, a
Baugh-Wooley two's complement algorithm was
used to implement integer multiplication .' Third,
engineers designed special full adder logic gates to
integrate multiplication summand generation into
the full adder cel l and to eliminate the need for an
additional logic level. Fourth, a unique multipli
cation reduction algorithm was developed which
provides the initial routing advantages of a Wallace
tree, with the minimal logic of a Dadda tree."·6 Fifth,
a ripple is formed in the reduction array. The ripple
facilitates the start of the least significant 31 -bit
CLA addition at least one logic level sooner than
the most significant 32 bits and does not require a
carry-in input to the upper 32-bit adder. Finally, by
developing a very fast 4 -3-2 - 1 A N D/OR gate, engi
neers were able to remove two additional logic
levels in both CLA adder networks.
To avoid bugs in the array design, since bugs in an
array consisting of 1000 full adders could have sig
nificantly affected the product shipment schedule,
engineers developed a FORTRAN program to logi
cally interconnect and physically place the array.
Any bugs would be algorithmic and not random,
and algorithmic bugs should be obvious. In addi-

Digital Tecbnicafjournuf

Vol. 2 No. 4

Fall 19')1!

tion, by algorithmically placing the array, signi
ficant density improvements were realized . This
program provides a Wal lace-Dadda implementa
tion that logically reduces 32 rows in 8 logic levels,
and consumes as many initial summand bits. It
also uses the least number of full adders as theoreti
cal ly possible, while delivering the least significant
32 bits of sum and carries at least one full logic level
sooner than the most significant bits.

Division Chip - D/Vx
The iterative divide function performed by the divi
sion chip , DIVx, requ i res a signi ficant amount of
hardware, the density of which a standard cell chip
affords. Two gate arrays would be required to per
form the same function, in which case a timing
critical path crossing would occur between the two
chips. Therefore, the IC designers implemented the
DIVx chip as a standard cell design by building
on the techniques developed for the MULx chip
described above. Also, like the MULx design, the
goals for the D!Vx design project were to optimize
performance and minimize real estate use by fitting
t he iterative divide function in a single chip.
The IC designers employed a standard cell tech
nique in which four horizontal sections are defined ,
each section having a different number of columns.
Reference cells are located in the center row of each
section and provide ECL reference voltages to the
cells above and below i n that section 's columns.
Placement was driven for performance, with quo
tient selection logic being distributed to where i t
was required. This method made for a n irregular
structure, as can been seen in Figure 8.
The VAX 9000 system optimizes both mu ltiplica
tion and division by providing separate functional
units. Each functional unit performs both integer
and floating point operations. This approach differs
from the one taken by most processor architects,
who conceptually link multiplication and division .
Usually, algorithms are chosen that can share hard
ware at the expense of the performance of either
operation. The separate division unit in the 9000
provides superior performance for both i nteger and
floating point operations. The DIVx chip is also
used by the V-box to perform very fast vector divi
sion operations, as shown in Table 4 .
Division is an iterative process. Unlike the case of
multiplication, one cannot predict the summands
and then reduce the summand matrix. The two
approaches to division most commonly used are
the Taylor Series convergence algorithm and a sub
�
tract and shift algorithm. The algorithm employed
in the 9000 is a variation on the subtract and shift

VAX 9000 Series

Table 4

Division Performance

Data Type

Integer:

Floating
point:

byte
word
long
F-format
D-format
G-format

Cycles

Time
(Nanoseconds)

3-4
3-5
3-8

48-64
48-80
48- 1 28

1 12
208
1 92

13
12

method, which al lows for savings in hardware as
wel l as increased performance.
Jn this method, an imprecise quotient is selected
based on a truncated estimated partial remainder

Figure 8

and a truncated version of the exact divisor. This
imprecise quotient digit is corrected when the next
guess quoticnt digit is selected . The selected digits
may be positive or ncgative. The positive digits are
accumulated in a positive-value shift register. The
negative digits are accumulated in a negative-value
shift rcgistcr. The final corrected binary quotient is
then formed by subtracting the negat ive register
from the positive register.
The algorithm is based on a signed d igit notat ion
scheme. To determine two quotient bits, the bits
may be chosen from a d igit set that i nc ludes
{ -2, - I , -0, + 0, + 1, + 2 }. The digit set is simply an
expanded form of the common nonrestoring digit
set that typ ically uses { - 1 , 0, + 1 } . In nonrestoring
algorithms, the quorient is normally corrected as

Photom icrograph of D!Vx Chip

Vol. 2 No. . J

Fall /')')0

Digital Technicaljournal

Semiconductor Technology in a High-performance VAX System

needed; whereas here, it is not corrected u ntil the
entire iterative process is completed . The next sig
nificant difference between this division technique
and the nonrestoring method is that the quotient
bits selected are based on an estimate of the partial
remainder and divisor rather than the exact values.
The first advantage of this method is that an esti
mate can be obtained faster than the exact value.
Second, a truncated estimate is acceptable, rather
than a fu ll-width estimate. Consequently, this
method saves a significant amount of hardware and
increases the speed of the operation . If one were to
complete each partial remainder, up to three addi
tional chips would be required and the delay would
more than double.
The trick to the method lies in the quotient selec
tion . The selection is based on partial remainder
range transformations which guarantee that a
quotient digit selected in one iteration may be cor
rected to the exact quotient digit on the next
iteration. Therefore, although six quotient digits
are determined per major iteration, an additional
minor iteration is required to guarantee the least
significant digit of the major iteration. The major
and minor iteration terms refer to the architecture
of the divide iterative hardware. The OIVx produces
six quotient bits per machine cycle. This is a radix
64 division technique. However, the high radix
division is accomplished by overlapping lesser
radLx divisions. In particular, there are three sets of
radix 4 division groups. The first two sets are over
lapped, so that the critical path t hrough the radix
64 division is actually the critical path through two
radix 4 divisions. A m inor iteration is the path
through one radix 4 division group. A major itera
tion is the path through the overlapped set of two
radix 4 division groups, followed by the final radix
4 group. It is important to note that extra iterations
do not adversely affect the corrected quotient.
Final ly, to produce the corrected quotient, the set
of negative quotient digits is subtracted from the
set of positive quotient d igits, where each digit is
properly radix 2 weighted, based on the order of
selection. (That is, the first quotient digit selected is
the most significant bit of the correct quotient.)

Vector Register File Chip - VRGx
The VAX 9000 architecture adds vector instructions
to the standard VAX environment, thus a vector
register file was required. There were two primary
design requ i rements for the vector register file.
First, the register file and associated cross-bar logic
had to fit in a single multichip unit; and second, the

Digital Techn icaljournal

Vol. 2 No. 4

Fall f'J'JO

register file had to perform read and write at dif
ferent addresses within a single 1 6-ns clock cycle.
These requirements could not be met with available
memory and logic chips, thus necessitating the
development of a fully custom vector register chip.
The vector register file is 64 bits wide and con
sists of 1 6 vector registers with 64 elements each.
The vector register chip, VRGx, was developed as an
8-bit slice of the 64 -bit vector register file. The chip
contains 9216 bits of RAM for data storage and the
cross-bar logic (6000 equivalent gates) that allows
access from the five read ports and three write
ports. Integrating the register memory and the
cross-bar logic on the same chip allowed timing to
be optimized so that the system timing require
ments were met .
VRGx Chip Physical Features and
Organization

The VRGx chip is fabricated using the MOSAIC III ECL
process, w hich was not designed as a memory pro
cess. Coordination with the vendor resulted in the
addition of an implant step for the memory-cel l
bit line emitters. Key features of the process are
three metal interconnect layers, oxide isolation,
and polysilicon emitters with a drawn width of
1 .75 microns.
Figure 9 shows the locations of the major circuit
blocks in the VRGx chip. The major blocks of the
VRGx chip are five read ports, three write ports,
and 1 6 vector registers in the RAM bank array. The
block diagra m , Figure 10, shows the main data
paths. The 1 6 vector registers are implemented as
64 -word by 9-bit single port RAMs. Eight bits are a
slice of the 64 -bit vector register ftle and the ninth
bit is for byte parity.
Timing

A register RAM can be read from one address and
written from a different address in one 1 6-ns clock
cycle. This dual operation is made possible by a 2
to 1 m u ltiplexer on the RAM address inputs. The
read address is appl ied during the first portion of
the cycle, and the write address is applied during
the second portion of the cycle. Spl itting the clock
cycle i nto read and write portions eliminates
conflict between read and write ports in the event
that a single register RAJVl is selected for both read
and write. Read data is held in a latch during the sec
ond portion of the cycle and is unaffected by the
write operation .
A single clock cycle consists of nonoverlapping
clock phases A ami B. Latches on the read and write

VAX 9000 Series

Figure 9

Photomicrograph of VRGx Chip

pon inputs are clocked by phase A, and read port
output latches are cloc ked by p hase B. For a read
operation initiated on phase A, the output read data
becomes valid during phase B.

Cross-bar Logic
Cross-bar logic in the R A M bank array makes each of
the 16 vector register RAMs independently accessi
ble from the read and write ports. Enable inputs on
the ports prevent invalid addresses from contl icring
with i ntended addresses. Read and write ports may
point to the same register R A M , bur di fferent write
pons may nor point to the same R A M . Also, differ
ent read ports may on ly point to the same RMvl if the
vector element address is the same. All conflicts
must be resolved external to the chip.

A read port consists of an enable, a 4-bir register
select, a o-bit vector element address, and a 9-bit
ou tpu t . An enabled read port appl ies a register
select code that points to a particular RA M bank . At
that R A M bank, a ') to I multiplexer selects the vec
tor element address from the active read port and
applies it ro the read add ress of the R AM . Then t he
R A M output passes t h rough a 16 to l m u l ti p lexer
controlled by the register select code, so that the
selected R A M output reaches the output of the active
read port.
A write port consists of an enable, a 4 -bit register
select, a 6-bir vector element address, and a 9-bir
write data input. An enabled write port applies a
register select code that points to a particular RA.M
bank . At that R A M bank, a 3 to I multiplexer selects

Vol .! 1\'o. 4

Fall / -

r
S E L <3 : 0 > -

PORT
ADDR

3x
ADDR

A D D R <5 0> SEL<3:0> -

WRITE
PORT

ENABLE -

- - - - - - -

SEL
-

I
I

SEL

6
/

3:1
MUX

I
I
DATA
I

---,

5:1
MUX

D I N <8:0> -

-;-

R EAD

E N A B LE -

RAM
64 X 9
3:1
MUX

I
I

f-AI
I
I

1 6: 1
MUX

R E AD
PO RT
ou T

_______..

D0 - 8 0 ·

I
I

L - - - - - - - - - - - __j
RAM BA N K

RAM BAN K A R R AY. 1 6x

Figure 10

VRGx Chip Block Diagram

the vector element address from the active write
port and applies it to the write address of the RAM .
A lso, a 3 to I m u ltiplexer selects t he write d ata
from the active write port and applies i t to the RAM
data input .
RAM Technology
The normal transistors in an ECL process are of the
NPN type, where the collector is a buried N-doped
region . For memory cel ls, a lateral PNP transistor is
placed in the same collector region , and the com
bined structure has the latching characteristics of a
silicon controlled rectifier (SCR). The memory cell
array in the 64 by 9 register RAMs is implemented
with ECL SCR memory cells.
The SCR memory cel l shown in Figure I I consists
of two cross-coupled SCR structures. Extra NPN
emitters connect to the bit lines and provide a
means of writing and sensing the celL The "on" side
of the cell saturates, allowing the bit line emitter to
conduct in the inverse mode. Inverse gain of the bit
line emitters must be limited to avoid excessive
leakage into the unselected cells. An added process
step applies a special base implant to the bit line
emitters only to control their inverse gain.
Advantages of the SCR cell include good density,
low standby power, large sense voltage d i fferen-

Digital Tecbnicaljournal

Vol 2 No.

Fall 1990

tial, and low sensitivity to alpha-particle-induced
soft errors. The cell has one limitation: excess
charge storage due to write current can delay sub
sequent writing to the opposite state. This problem
is el iminated with a special bit line current steering
circuit that makes write current state dependent
(Figure 1 1 ).
The SCR memory cel l in Figure 1 1 is written by
applying a high current (four t imes read current) to
the "off' bit line emitter. The current steering tran
sistors prevent this current from reaching a bit line
emitter that is already " on . " Thus, attempting to
write a cell that is a lready in the desired state does
not result i n any additional cell current beyond the
normal read current, and no additional charge stor
age occurs.
Other Chip Features

Other noteworthy chip features include scan logic,
parity error detect logic , and a data pipeline for
write port 0 data. Scan operation gives access to the
register RAMs. In a single scan-in and scan-out oper
ation, it is possible to read five registers and to write
three registers.
Parity checking logic is used to detect input
errors and set error flags. There is a parity check on
the 9-bit write port data inputs. Another parity

VAX 9000 Series

1. 51 �

� 0.51

.---..----.

� 0.51

VA
KEY:
WC
UWL
BL
BR
LWL
VA

Figure 11

WRITE CONTROL
UPPER WORD L I N E
B I T L l N E (LEFT)
BIT LINE (RIGHT)
LOWER WORD L I N E
VOLTAGE R EF E R E N C E

SCR Memory Cell with Bit Line
Current Steering Circuit

checker is applied to address and control inputs.
These are assigned to three parity groups, with a
parity bit input for each group.
The write port 0 data pipeline allows a delay of
one. two, and three clock cycles to be selected ,
delaying the write port data as necessary to resolve
register access conflicts.

Self-timed RAM
In the VA X 9000 system - as in any high-perfor
mance CPU - fast memory is used for cache and
control store applications. Engi neers traditionall y
use very fast static RAMs within the CPU for mem
ory. Logic designers, however, have long recognized
that CPU performance is often l imited as a result of
the time needed to access data in these RAMs. This
l imitation is not only the result of the access time
and write cycle performance of the devices them
selves, but also of t he off-chip circuitry and inter
connect used for w ri te p u lse generation and
distribution . The logic designers and technologists

for the VAX 9000 knew that unless some architec
tural improvements were made to the traditional
static RAM , much of the RAM performance improve
ments would be lost in the w iring interconnect.
They also realized that Digita l 's memory suppliers
would have to be convi nced that a new RAM archi
tecture would be marketable to their other cus
tomers. After several design iterations, the tech
nologists submitted a set of specifications for a
synchronous, self-timed RAM (STRAM ) to several
suppliers for their revi ew. After extensive market
surveys, our memory suppliers agreed that this new
architecture could eventually become a new stan
dard for high-speed static RAMs.
The VAX 9000 system requires two configura
tions of the basic STRAM dev ice : I K words by 4 bits,
and 4K words by 4 bits. A block diagram of the
STRAM is shown in Figure 12. The STRAM is similar
to the traditional RAM in that it has chip select, input
address and data, and output data . However, the
STRAM also has several nontraditional inputs such

Vol 2 No. 4

Fall /'J'JO

Digilal Technicaljournal

Semiconductor Technology in a High-performance VAX System

as write, a differential clock, and a reference voltage
(Vbb). Latches added to all inputs and ourputs
provide pipelined timing. An internal write pulse
generator controls write operations and eliminates
the need to generate and distribute the write pulse
signal externally on the modu le. Also two optional
output configurations are provided : a 50-ohm drive
open emitter for standard parallel termination on
the module, and a resistor and pulldown current
source which is w ired extern a l l y to implement
STECL or on-chip source termination.
The clock buffer design al lows inputs to be
driven differentially from off-chip to m inimize
clock skew. The clock buffer is also designed to
accommodate customers who are not greatly con
cerned about skew or who may be more concerned
about conserving routing area. One input of the
clock buffer may be tied to the output pin of the
reference generator which provides the standard
ECL threshold vol tage (Vbb), al lowing the other
input of the clock buffer to be driven in a single
ended mode.

D I N <3:0>H

Input and output latches are clocked on opposite
edges of the internal differential clock buffer. Tim
ing diagrams are shown in Figure 13. On a falling
edge of CLK H , data and address i nputs flow into the
RAM array.
I f w rite is asserted d u ring the next rising edge
of CLK H , then a write cycle is initiated, and the
input data is stored in the memory at the add ress
presented at the ADR inputs. At the same time, the
data is passed through the mu ltip lexer and the out
put latch.
If write is deasserted on the rising edge of CLK H,
then the STRAM is in a read cycle and input data is
ignored _ The data stored in the RAM at the address
presented at the A DR inputs flows out to the multi
plexer and output latch.
If chip select (CS) is deasserted prior to the rising
edge of CLK H , then write and read operations are
disabled and the output latches are reset low.
For p roper operation of the STRAM , certain
timing requirements must be fulfilled . The write
operation is terminated by either the falling edge of

RAM ARRAY
2M X 4

..-------1

DOUT RAM
<3:0>H

D I N DOUT
<3:0><3:0>
ADDR W R EN

ADDR H
DO<3:0>H

WRITE
PULSE
GENERATOR

WRITE L

CLOCK H

��-------� ENABLE H
CS L

DLY
CLK
H

CLOCK H

CLK H

0 CLK L
Figure 12

Digital Tecbnicaljournal

Vol. 2 No. 4

Fa/1 1990

STRAM Block Diagram

VAX 9000 Series

NOTE: CLOCK HIGH STATE M U ST LAST LONG ENOUGH
TO COMPLETE A WRITE CYCLE

I'"

"'I

CLK

WRITE
ADDR, D I N , CS

DATA OUT

2 RD

Wffo;l

3 RD

KEY:
0 RD - READ OPERATION CYCLE 0
1 WR - WRITE OPERATION CYCLE

Figure 13

STRAM Timing Diagrams

CLK H or by the internal write pu lse generator,
whichever occurs first . Therefore CLK H must be
asserted long enough to ensure that data is properly
written into the memory array. The internal write
pulse generator provides an output having the
proper duration as determined by a string of gates.
Also, the assertion of the internal write pulse sig
nal must be delayed by an amount equal to the inter
nal access time of the RAM . In this way. the correct
data is stored , and not the data previously stored i n
the input registers. The delay i s accomplished by
the row delay circuit, which is also simply a string
of gates. These featu res give the STRAM i ts "self
tm
i ed " nature.

Acknowledgments
The authors would l ike to acknowledge the follow
ing individuals who participated in and contrib
uted to the success of the VAX 9000 project: Jerry
Weisbach, Andy Moroney, Bob H a l ler, Marc
Lamere, Mark Hamel, Tom Senna, Dave McCall,
Patty Kroesen, R i c k Jones, jim jensen , Terry
Skrypek , Eugene Marteney, Paul Guglielmi, Ela ine
Fire, Larry Herman, Bill G rundman n , Mark
Pascarelli, Fran Richard , Linda G reska, Jack Mason,
Chris Caiazzi, Roger Dame, Mike Normand Steve
Sullivan, Rob Rcinschmidt, Bob Bechdolt, Mike
Warder, M i ke Hickman , Brian Sadler, Wayne
Nunn, Rita Wespi, Gene Yee, Bruce Smith, Alisyn
Emerson, J im Glanville.

References
1 . D . Marshall and ]. McElroy, " VAX 9000
Packaging, The Multi-Chip Unit," Pmceedings of
COMPC ON '90 (Spring 1990).
2 . P. Zdebel et al . , " MOSAIC l l l - A H igh Perfor
mance Bipolar Technology with Self-Aligned
Devices," Proceedings of IEEE 1987 Bipolar
Circuits and Technology Meeting
3. D. Fire and T. Fossum, " Designing a VAX for High
Performance," Proceedings of COMPCON '90
(Spring 1990).
4. C. Baugh and B. Wooley, "A Two's Complement
Parallel Array Multiplication Algorithm , " Sh011
Note a t COMPCON 73, 7th A n n ual IEEE
Computer Society International Conference
(February 1973).
5. C . Wallace, "A Suggestion for a Fast Mu ltipl ier,"
1 EEE Transactions on Electronic Computers,
Vol . EC- 13 (February 1964 ): 14- 17.
6. L . Dadda, "Some Schemes for Parallel
Multipl iers," Colloque sur l 'A lgebre de Boote
Oanuary 1965).
7. K . Hwang, Computer A rithmetic Principles,
Architecture, and Design (New York: john Wiley
and Sons, 1979): 213-283.

Vol.

2 No. 4

Fall 19')0

Digital Tecbn icaljounwl

Richard A. Brunner

Dileep P. Bhandarkar

Francis X. McKeen
Bimal Patel

William]. Rogersjr.
Gregory L. Yoder

Vector Processing on the
VAX 9000 System
The VAX 9000 system provides thefirst emitter-coupled logic (ECL) implementation of
the VAX vector architecture. The optional vector processor on the VAX 9000 system
addresses the computing needs of numerically intensive applications with a peak
performance of 125 MFLOPS for double-precision calculations. The innovative
design ofthe vector registerfile allows the vectorprocessor to overlap the execution of
up to three vector instructions. Supported by both the VMS and ULTRIX operating
systems, the vector processor on the VAX 9000 system provides four to five times
performance improvementfor vectorizable applications over its scalarprocessor.

For a long time, vector processing was the domain
of large, expensive supercomputers such as the
CRAY - 1 . 1 However, with the availability of low cost,
pipelined floating point arithmetic chips, and the
maturation of vectorizing compilers, vector p ro
cessing has become a mainstream technology for
scientific applications.2 Applications that can bene
fit from vector processing include finite element
analysis, signal processing, and computational fluid
dynamics. The recent addition of integrated vector
processing to the VAX architecture and its imple
mentation on the VAX 9000 system provides these
applications with an improvement in execution
time of four to five times over that of a VAX 9000 sys
tem without vector processing. Vector processing
extends the performance range of VAX systems.
The vector processor on the VAX 9000 system ,
referred to as the V-box , is the first emitter-coupled
logic (ECL) implementation of the VAX vector archi
tecture. The definition of the architecture and the
development of the V-box started in 1986 , two years
after the design of the rest of the VAX 9000 CPU .
Thus, the design of the V-box was synergistic with
the definition of the VAX vector architecture. The
major goal of the V-box design was to provide
adequate vector performance (four to five times
speed-up over scalar) without impacting the design
of the remainder of the VAX 9000 CPU and the
memory subsystem, which were too far along in
development to change. With vector performance
comparable to a CRAY -1 and a peak performance of
125 M FLOPS for double-precision calculations, the
V-box fulfills this goal .

Digital TeL·hnicaljournal

V!JI. 2 No. 4

Fall 1990

This paper describes the VAX vector architecture
and its implementation by the VAX 9000 V-box. The
first part of the paper discusses the architectural
model that all VAX vector processors must follow.
The second part shows the actual realization of this
architecture in the VAX 9000 V-box and explains the
innovative techniques the V-box uses to achieve
good performance. The paper concludes w i th
preliminary vector performance numbers for the
VAX 9000 system on some standard vector bench
marks and a number of vector code examples.

VAX Vector Architecture
The VAX vector architecture defines the instruction
set , registers, and behavior that all VAX vector
implementations, such as the VAX 9000 V-box, must
follow.' The vector architecture effort started in
December 1985. At that time several CPU develop
ment projects were well underway, including the
VAX 9000 system. With the expectation of provid
ing four to five times performance improvement
for vectorizable applications, Digital decided to add
vector p rocessi ng to the VAX 9000 system, even
though the system was in an advanced stage of
development. A decision also was made to provide
a complementary metal oxide semiconductor
(CMOS) implementation of the architecture on the
VAX 6000 Model 4 00 system."
Because both systems could not tolerate major
changes without a major slip in schedule, the archi
tecture requ i red an approach that made few
changes to the scalar processor - that part of a VA,'\
61

VAX 9000 Series

processor that executes the regular VAX instruction
set. Furthermore, because not all applications and
markets can benefit from vector processing, Digital
decided not to require vector processing on every
new VAX processor. Therefore, vector processing is
offered as an optional capability. The scalar proces
sor decodes vector i nstructions and passed them
to its associated vector processor. All processing
of vector instructions is handled by the vector pro
cessor. Mechanisms are provided for vector-scalar
synchronization and handling of vector exceptions
by the scalar processor.
Although the architecture had to account for the
implementation constraints of both ongoing CMOS
and ECL projects, it had to be general and flexible
enough to allow future, more i ntegrated implemen
tations at higher performance. The architecture
also had to m inimize its impact on the existing VMS
a nd ULTRIX operating systems because major
changes could significantly delay software support
for vector processing.
Basic A rchitecture

The VAX vector architecture uses a vector-register
based design first pioneered by Seymour C ray. 1
There are 1 6 vector registers, each of which holds
64 elements; an element is 64 -bits. Instructions
which operate on longword integers or F _floating
point data, only manipu late the low-order 32 bits
of each element - sometimes referred to as long
word elements.
A n umber of vector control registers control
which elements of a vector register are processed
by an instmction. The vector length register (VLR)
limits the highest-numbered vector register ele
ment that is processed by a vector instruction. The
vector mask register (VMR) consists of a 64 -bit mask,
in which each mask bit corresponds to one of the
possible element positions in a vector register.
When instructions are executed under control of
the vector mask register, only those elements for
which the corresponding mask bit is true are pro
cessed by the instruction. Vector compare instruc
tions set the value of the vector mask register.
The vector coun t register (VCR) receives t he
number of elements generated by the compressed
IOTA instruction, which is similar to COMPRESSED
IOTA on the CRAY-2.1 All VAX vector instructions use
two-byte extended opcodes. Any necessary scalar
operands (e. g. , base address and stride for vector
memory instructions) are specified by standard VAX
scalar operand specifiers. The instruction formats
allow all VAX vector instructions to be encoded in

seven classes. The seven basic instruction groups
and their opcodes are shown in Table l .
Within each class, all instructions have the same
number and types of operands, which allows the
scalar processor to use block-decoding techniques.
The differences in operation between the individ
ual instructions within a class are irrelevant to the
scalar processor and need only be known by the
vector processor. I mportant features of the instruc
tion set are
•

Support for random-strided vector memory data
through gather (VGATH) and scatter (VSCAT)
instructions

•

Generation of compressed IOTA vectors (through
the IOTA instruction) to be used as offsets to the
gather and scatrer instructions

•

Merging vector registers through the VMERGE
instruction

•

The ability for any vector instruction to operate
under control of the vector mask register

Additional control information for a vector
instruction is provided in the vector control word
(shown as cntrl in Table 1 ), which is a scalar
operand to most vector instructions. The control
word operand can be specified using any VAX
addressing mode. However, VAX compilers gener
ally use immediate mode addressing (that is, place
the control word within the instruction stream).
The format of the vector control word is shown in
Figure 1 .
The Va , Yb , and Vc fields indicate the source and
destination vector registers to be used by the
instruction. These fields also indicate the specific
operation to be performed by a vector compare or
convert instruction. The MOE bit indicates whether
the particular instruction operates under control of
the vector mask register. The MTF bit determines
what bit value corresponds to " true" for vector
mask register bits. It allows a compiler to vectorize
if-then-else constructs. The EXC bit is used in vector
arithmetic instructions to enable integer overflow
and floating underflow exception reporting. The
Ml bit is used in vector memory load instructions to
indicate modify-intent. Figure 2 shows the encod
ing for some typical VAX vector instructions.
Vector Execution Model

With the addition of vector processing, a typical
VAX processor consists of a scalar processor and an
associated vector processor; the two are referred to
as a scalar/vector pair. A VAX multiprocessor system

Vol. 2 No. 4

Fall 1990

Digital Tecbnicaljournal

Vector Processing on the VAX 9000 System

Table 1

VAX Vector I n struction Classes

Vector Memory, Constant-stride

Vector-sca lar Double-precision Arithmetic

opcode cntrl , base, stride

opcode cntrl , scalar

VLDL

Load lo ngword vector data

VSADDD

O_floating add

VLDQ

Load q u adword vector data

VSADDG

G_float i n g add

VSTL

Store longword vector data

VSCMPD

O_floating com pare

VSTQ

Store q u adword vector data

VSCMPG

G_float i n g com pare

Vector Memory, Random-stride
opcode cntrl, base

VSDIVD

O_float i n g divide

VSDIVG

G_float i n g d ivide

VSM U L D

O_floating m u ltiply

VS M U LG

G_float i n g m u ltiply

Gather longword vector data

VSSUBD

O_float i n g subtract

VGATHQ

Gather q u adword vector data

VSS U BG

G _floating subtract

VSCATL

Scatter lo ngword vector data

VSMERGE

M e rg e

VSCATQ

Scatter q u adword vector data

VGATHL

Vector-vector Arithmetic
Vecto r-Scalar Sing le-precision Arithmetic

opcode cntrl or reg num

opcode cntrl, scalar
VSADDL

I nteger l o n gword add

VVADDL

I nteger longword add

VVADDF

F _float i n g add

VSADDF

F _float i n g add

VVADDD

O_float i n g add

VSBICL

Bit clear l o n gword

VVADDG

G_floating add

VSBISL

Bit set longword

VVBICL

Bit clear l o n gword

VSCMPL

I nteger lo ngword compare

VVBISL

Bit set longword

VSCMPF

F _float i n g com pare

VVCM PL

I nteger longword com pare

VSDIVF

F_float i n g d ivide

VVCMPF

F_floating com pare

VSM U L L

I nteger lo ngword m u ltiply

VVCMPD

O_float i n g com pare

VSM U L F

F _floating m u lt i p l y

VVCMPG

G_float i n g com pare

VSSLLL

S h ift left logical l o n gword

VVCVT

Convert

VSS RLL

Sh ift right logical lo ngword

VVDIVF

F _floating d ivide

VSSUBL

I ntege r longword subtract

VVDIVD

D_floating divide

VSS U B F

F _floating s u bt ract

VVDIVG

G_float i n g d ivide

VSXORL

Exclusive-or longword

VVMERGE

M e rge

I OTA

G e nerate comp ressed I OTA

VVM U L L

I nteger l o n gword m u ltiply

vector

VVMULF

F _float i n g m u ltiply

VVM U L D

O_floating m u ltiply

Vector Control Register Read

VVM U LG

G_float i n g m u ltiply

opcode reg n u m , destination

VVSLLL

S h i ft left logical longword

VVSRLL

S h i ft right log ical lo ngword

VVS U B L

I nteger l o n gword su btract

VVSUBF

F _float i n g s u btract

Vector Control Register Write

VVS U B D

O_floating su btract

opcode reg n u m , scalar

VVSUBG

G_floating su btract

VVXORL

Exclusive-or l o n gword

M FVP

MTVP

Move from vector processor

Move to vector processor

VSYNC

Synchron ize vector m e m o ry
access

Digital Techllicaljournal

Vol. 2 No. 4

Fall /990

VAX 9000 Series

MOE

MTF

EXC
Ml

8
VNCONVERT FCN

Figure 1

0
VC/COMPARE FCN

Vector Control Word

comprises a number of tht:st: scalar/vector pairs.

ever, the asynchronous execution does cause the

Asymmetric configurations can exist when only

reporting of vector exceptions to be imprecise.

some of t he VA X processors in a multiprocessor

Special instructions, which are described in the
Synchronization section, are provided to ensure

system contain a vector processor.

synchronous operation when necessary.

For good performance, the scalar processor oper
a tes asynchronou s l y from i ts vector processor

Both scalar and vector instructions are initially

whenever possible. Asynchronous operation a llows

fetched from memory and decoded by the scalar

the execution of scalar i nstructions to be over

processor. If the opcode indicates a vector instruc

lapped w ith the execution of vector instructions.

tion, the opcode and necessary scalar operands are

Furthermore, the servicing of interrupts and scalar

issued to the vector processor a n d p l aced i n its

exceptions by the sca lar processor does not disturb

instruction queue. The vector processor accesses

the execution of the vector processor, which is

memory directly for any vector data that it must

freed from the compk:xity of resuming the execu

read or write. For most vector instructions, once the

tion of vector instructions after such events. How-

scalar processor s uccessfu l l y issues the vector

ASSEMBLER FORMAT:
VVEOLF V6,V7
VVADDF/1 V1 ,V2,V3
VSMULF/U R4,V4,V5

;IF V6[i] V7[i] THEN VMR[i] 1 , ELSE VMR[i] = 0
; (VVEOLF IS A VVCMPF PSEUDO·OPCODE)
; V3 V1 V2. DO ADDITION UNDER CONTROL OF VMR
: WITH MATCH 1
; V5 = R4'V4 WITH UNDERFLOW EXCEPTION CHECKING ENABLED
=

INSTRUCTION FORMAT:
VVCMPF cntrl.rw
VVADDF cntrl . rw
VSMULF cntrl.rw, src.rl
ENCODING IN MEMORY:
BYTE
,-FD -, 0 ::>
C4 :1
8F :2
:3
:4
5
:6 ...J

; INSTRUCTION CONSISTS OF OPCODE AND CONTROL WORD
; INSTRUCTION CONSISTS OF OPCODE AND CONTROL WORD
; INSTRUCTION CONSISTS OF OPCODE, CONTROL WORD, AND SCALAR SOURCE

•-

:8
:9

:: J

:C
:D
:E
:F

_,_

TWO-BYTE OPCODE FOR VVCMPF
OPERAND SPECIFIER FOR IMMEDIATE MODE (FOR CONTROL WORD)
CONTROL WORD <7:0>: COMPARE FCN IS EOL AND V7 IS A SOURCE
CONTROL WORD <1 5:8>: V6 1S A SOURCE
TWO-BYTE OPCODE FOR VVADDF
OPERAND SPECIFIER FOR IMMEDIATE MODE (FOR CONTROL WORD)
CONTROL WORD <7:0>: V3 IS DESTINATION AND V2 IS A SOURCE
CONTROL WORD <15:8>: V1 IS A SOURCE, MASKED OPERATIONS ARE ENABLED, AND MATCH =
TWO-BYTE OPCODE FOR VSMULF
OPERAND SPECIFIER FOR IMMEDIATE MODE (FOR CONTROL WORD)
CONTROL WORD <7 0>: V5 IS DESTINATION AND V4 IS A SOURCE
CONTROL WORD <1 5:8>: VA IS IGNORED. UNDERFLOW EXCEPTION CHECKING IS ENABLED
OPERAND SPECIFIER FOR REGISTER MODE WITH SCALAR DATA IN R4
Figure 2

Vector Instruction Encoding

Vol. 2 No. 4

Fal/ /<)<)0

Dtgilal Techn icaljournal

Vector Processing on the VAX 9000 System

instruction, it proceeds to process other instruc
tions and does not wait for the vector instruction to
complete. An execution model is shown in Figure 3 .
When the scalar processor attempts t o issue a
vector instruction, it checks to see if the vector pro
cessor is disabled - that is, whether it will accept
further vector instructions. If the vector processor
is disabled, then the scalar processor takes a "vec
tor processor disabled" fault. An operating system
handler is then invoked on the scalar processor to
examine the various error-reporting registers on the
vector processor to determine the disabling con
dition. The vector processor disables itself to report
the occurrence of vector arithmetic exceptions or
hardware errors. The operating system disables the
vector processor, usually to indicate the unavaila
bility of the vector processor, by writing to a privi
leged vector register. If the disabling condition can
be corrected, the handler enables the vector proces
sor and directs the scalar processor to reissue the
faulted vector instruction.
Within the constraint of maintaining the proper
ordering among the operations of data-dependent
instructions, the architecture explicitly allows the
vector processor to execute any number of the
instructions in its queue concurrently and retire
them out of order. Thus, a VAX vector implementa
tion can chain and overlap instru ctions to the
extent best suited for its technology and cost
performance. In addition, by making this feature an
explicit part of the architecture, software is pro-

vided with a prograrruning model that ensures
correct results regardless of the extent a particular
implementation chains or overlaps. This approach
differs with respect to some other existing vector
architectures, such as the IBM S/370 vector archi
tecture, which give the appearance of sequential
instruction execution.6
A VAX vector implementation may have its own
memory management hardware, translation buffer,
and cache; or it may share those of the scalar pro
cessor. In high-end vector implementations, such as
the VAX 9000 system, the vector and scalar proces
sors are tightly coupled. The problems of limited
chip area and translation buffer and cache coher
ency can be lessened by allowing high-speed mem
ory management hardware and cache to be shared
by both vector and scalar processors. For other
implementations, such as the VAX 6000 Model 4 00
system, the vector and scalar processors are not so
tightly coupled, and there is a performance advan
tage in allowing separate memory management
hardware and cache. 1 Little additional effort is nec
essary by an operating system to support separate
vector memory management hardware and cache.
A vector processor can treat vector memory
management exceptions (MME) in a synchronous
m a nner, as the VAX 9000 V-box does. Once the
scalar processor issues a vector memory instruc
tion, it pauses until the vector processor deter
mines whether an MME w i ll be encountered by the
instruction. If an MME will occur, then a precise

PHYSICAL
MEMORY
1 6 GB
I N STRUCTION
STREAM

OPCODE, CONTROL WORD

INSTRUCTIONS
VAX
SCALAR
CPU

DATA

DISABLE/STATUS

DATA
STREAM
VECTOR DATA

Figure 3

Digital Tecbnical]ournal

Vol.

No.

Fall 1990

Vector Execution Unit
65

VAX

9000 Series

exception is taken on the scalar processor and the

Vector arithmetic exceptions are reported in an

appropriate operating system handler is invoked.

imprecise manner by vector processor disabled

If no MME will occur, the scalar processor proceeds

faults. When an exception occurs in the processing

to process other instructions and the vector proces

of a vector element, the vector processor records

sor completes the memory instruction. In the case

the exception in both a privileged exception regis

of referencing a unity-strided vector, which occurs

ter (the vector arithmetic exception register,

most frequently, the MME checking takes only

and i n the corresponding element of the destination

a short time at the beginning because the vector

vector register specified by the instruction. The vec

is contained in two or less pages. (MME checking is

tor processor then disables itself from receiving

done at the page level .)

further vector instructions. However, the vector
processor continues to execute the instruction that

Context Switching
Because of the asynchronous operation of the vec
tor and scalar processors, the vector context state of
a process is separate from its scalar comext state.
Thus, it is possible for an operating system to swap
in a new process to the scalar processor while
allowing the vector context of the previous process
to remain on the vector processor. When the previ
ous process is swapped out, the vector processor is
disabled by the operating system to prevent other
processes from accessing this vector context.
If the subsequent processes do not use the vec
tor processor, then the operating system avoids
the overhead of saving and subsequently restoring
8 kilobytes (KB) of vector context state for the orig

inal process. If another process does use the vector
processor, the operating system must reenable the
vector processor, save the vector state of the origi
nal process, load the vector context of the new
process, and, finally, make the vector processor
available. This fu ll context switch can take up to

100 microseconds on the VAX 9000 system.
Assuming that only a few processes require the
vector processor, it is l ikely that when the original
process is rescheduled to the same scalar/vector
pair, the process will find its vector context state
residing on the vector processor. By using this tech
nique, which is referred to as "cheap vector context
switching," both the VMS and

VA ER)

ULTRlX operating sys

tems reduce the time required to swap in a process

encountered the exception to completion by pro
cessing the remaining vector register elements.
As stated earlier, memory management excep
tions can be reported precisely b y a

VAX vector
VAX 9000

processor to its scalar processor, as the

V-box does, and the scalar processor takes a normal

VAX

memory management fa ult. Exception infor

mation is placed on the stack in the same format as
for scalar memory management exceptions. The
use of the same format minimizes the effort needed
by an operating system to support these exceptions.
Memory management exceptions were extended
for vectors to include two new exception para
meter bits: vector I/O space reference and vector
aligrunent fault. A vector I/O space reference occurs
whenever an attempt is made to load or store vector
data to I/O space. Because of the performance
degrada tion of unaligned memory data, a vector
alignment fault occurs w henever an element being
accessed by a vector memory instmction does not
begin at an address that is an integer multiple of the
length of the element in bytes. For example, a long
word (4-byte) element in memory should begin at
an address which is an integer multiple of 4 bytes.

Synchronization
In most cases, it is desirable for the vector processor
to operate asynchronously with the scalar proces
sor to achieve good performance. However, there

that uses the vector processor.

are cases in which the operation of the vector and

Exceptions

correct results. Rather than forcing the vector pro

scalar processors must be synchronized to ensure
vector

cessor to detect and automaticall y provide synchro

instructions are identical to those that occur for

nization in these cases, the architecture provides

VAX

special instructions, which software can use, t o

Most of the exceptions encountered by

VAX

scalar instructions. The arithmetic exceptions

are exactly the same. The memory m a nagement

accomplish the synchronization.

exceptions have been extended to include two new

instructions are discussed below. Software must

Some of these

vector exceptions: vector IIO space reference and

determine when to use these synchronization

VAX scalar architec

instructions to ensure correct results or establish

ture, the reporting of floating underflow and integer

exception checkpoints. Given the necessary sophis

vector alignment fault. As in the

overflow exceptions can be disabled by setting the

tication of vectorizing compilers, this requirement

EXC bit

is not onerous.

in the vector control word .

Vol 2 No. 4

Fall 1990

Digital Tecbnicaljournal

Vector Processing on the VAX 9000 System

Vector and scalar memory references may be
issued simultaneously. Therefore, these references
must be synchro n ized to prevent a conflict from
occurring when accessing shared memory loca
tions. This synchronization is p rovided by the
MSYNC function of the M FVP instruction. Once the
MSYNC function is invoked , the scalar processor
does not issue further instructions u ntil all p re
vious vector and scalar memory references have
completed.
Because the vector and scalar processors execute
asynchronously, software cannot determine when a
vector exception will be reported. However, soft
ware requires that exceptions be reported at certain
checkpoints. For example, exceptions incurred in a
procedure must be reported within the context of
that procedure before another procedure is calJed.
This exception reporting synchronization is pro
vided by the SYNC function of the M FV P instruction.
Once SYNC is invoked, the scalar processor does not
issue further instructions until the exceptions of
previous vector instructions, if any, are reported .
VAX 9000

Y-box Overview

The VAX 9000 V-box is one of four tightly coupled,
parallel function units that compose the VAX 9000
CPU . As such, it shares, with the rest of the CPU,
both the large 128KB data cache and the very fast
address translation hardware. As a result, the V-box
has very fast access to memory data. The V-box is
connected to the CPU through the scalar execution
unit as shown in Figure 4 . This connection consists

1--lloi

VECTOR
CONTROL
U N IT

Figure 4

Digital Tecbnicaljourna/

Vol. ,! No. 4

1-----l�

of a 64 -bit data path, which brings instructions and
data to the vector unit, and a 32-bit path, which
sends data to the scalar unit. AU vector memory
instructions send data through this data path.
As Figure 4 also shows, the V-box is composed of
the folJowing subunits: vector register uni t , vector
add unit, vector multiply unit, vector mask unit,
vector address unit, and vector control unit. Each of
these s ub units can function i n paralle l , which
allows up tO two vector arithmetic instructions
and one vector memory instruction to be executed
simultaneously. C rucial to this instruction over
lapping ability is the vector register unit, which
supports up to eight s imultaneous accesses from
the other subunits.
Physically, the V-box resides on the same planar
board as the remainder of the VAX 9000 C P U . Three
multichip units (MCUs) are reserved for the V-box,
which is a field-installable option. The V-box com
prises 25 ECL Motorola Macrocell Array Ills (MCA3) 7
(For brevity, a macrocell array is referred to as a
" chip" i n this paper.) The operation of these sub
units and the techniques used to enhance their per
formance are described in the following sections.
Vector Control Unit

The vector control u n i t receives and coordinates
the execution of vector instructions within the
V-box . The VAX 9000 scalar exec u tion engine
(E-box) transfers both an encoded version of the
vector instruction and the necessary scalar data to
the unit, which loads the instruction and data into a

VECTOR
REGISTER
U N IT

MASK!
ADDRESS

V-box Organization (with VAX 9000 CPU)

Fall /l)'JO

VAX 9000 Series

circular queue as shown in Figure 5. The queue can
buffer a few pending instructions while the remain
ing Y-box subunits are executing others. Without
the queue, the V-box could not accept pending
instructions when all of its subunits are busy, thus,
propagating a stall condition to the scalar execution
unit and resulting in poor performance.
The scalar data that is required by a vector
instruction is placed in the queue one location
behind the instruction quadword . Whenever the
queue contains two entries, the vector control unit
returns a signal to the scalar execution u nit and
requests that subsequent instruction issue be
delayed u ntil the number of entries in the queue
has diminished to one or less. The queue is cir
cular in nature and wraps around to the beginning
automatically.
When an instruction is loaded into the queue, a
pointer directs the instruction to the decode logic
shown in Figure 5. If there is enough instruction
data available in the queue and the necessary sub
unit is not busy, then the vector control unit sends
the instruction data from the queue to the register
conflict logic. The register conflict logic determines
if the vector registers required by the instruction are
already in use by the other subunits, a condition
called register conflict. The determination is made
b y comparing the vector register addresses that

E-BOX
VECTOR
DATA

are ro be used by already executing vector instruc
tions in the next cycle against the vector register
addresses required by the new instruction. If none
of the addresses overlap then the instruction is free
to issue. If an overlap does exist, the instruction is
held until the next cycle, when it can then be issued
to the appropriate subunit. (The Jack of significant
cycle delay in this case is due to the optimal design
of the vector register unit.) If there are no register
conflicts, the instruction is issued immediately to
the appropriate subunit.
As the vector control unit issues the instruction to
the subunit, it also sends scalar source operands,
if any, and the addresses of the vector registers
required by the instruction to the vector register
unit. The vector register unit latches the scalar data
for the duration of that instruction . For each cycle
of the instruction's execution, the register unit then
sends the necessary scalar and register data to the
appropriate subunit. The vector control u n i t also
contains the vector length register and sends a copy
of it with every instruction that is issued to a sub
unit. By suppl ying each subunit with a copy of
the vector length register, writes to the register by
MTVP instructions do not affect instructions cur
rently executing under the register's previous value.
Without this mechanism, wri tes to the vector
length register would be delayed until previously

BUFF ER
SCALAR DATA TO VECTOR
REGISTER FILE

SOURCE/DESTINATION VECTOR
REGISTER ADDRESSES
ADD

VECTOR
INSTRUCTION

MUL
GEN
NO
CONFLICT

ISSUE NEW
INSTRUCTION
BUFFER VALID BUFFER
COUNTER

INSTRUCTION
ISSUE
DECISION
LOGIC

Figure 5

ISSUE
NEW
INSTRUCTION

VECTOR NO
REGISTER CONFLICT
CONFLICT 1---
'-----1 CHECK
LOGIC

Vector Control Unit

Vol. 2 No. 4

Fall /'-)')0

Digital Tecbnicafjounwl

Vector Processing on the VAX 9000 System

executing instructions had finished, which would
result in poor performance.
Upon reaching the subunit, most vector instruc
tions execute at one cycle per element, after the
initial pipeline latency. However, the vector divide
instructions (VSDIV and V V OJV) execute at a varying
number of cycles, depending on the floating point
format (F, D, or G). (To simplify the vector control
logic, no other vector instructions are issued once
a vector divide s tarts.) Resu lts are returned to the
vector register unit or vector mask unit as they are
generated, depending on the instruction.
As described earlier, m icrocode in the scalar exe
cution engine encodes vector instructions into an
i nstruction quadword before passing them to the
V-box . Table 2 shows the high-order 32 bits of the
format used for every instruction sent to the V-box.
This quadword contains fields that indicate the
instruction, appropriate V-box subunit to execute
the instruction, and format of the vector control
word . The low-order 32 bits of the instruction quad
word contain the vector control word for the vector
instruction. The instruction quadwords present the
V-box with a fixed format instruction that smoothly
fits into a fiXed-length instruction queue, requires
little subsequent decoding, and has fields that can
be directly gated to selection logic. As a result, the
time needed by the V-box to decode vector instruc
tions is reduced and performance is increased .
Vector Register Unit

The vector register unit or file, as its name implies,
contains the logic and fas t memory that imple
ment the 1 6 VAX vector registers on the V-box . The
block diagram of the vector register file is shown in
Figure 6 . The vector register file has three write
ports and five read ports. By using the innovative
technique described below, these ports provide the
multiple accesses needed to feed two operands per
cycle to the vector add and multiply units, and one
operand to the vector address-mask unit. This unit
is the single largest contributor to the excellent vec
tor performance of the VAX 9000 system .
The file consists of 1 6 vector registers. Each
register contains 64 elements, and each element is
72-bits wide (64 data , S parity). The vector register
file is implemented as a byte-sliced custom chip,
which has a single parity bit per data port. Three
writes and five reads to the file can occur simulta
neously in any cycle. All w rites must be to different
register banks. However, multiple reads can occur
to the same bank if the same element is required by
each read access. Internally to the vector register

Digital Technicaljournal

Vol. 2 No. 4

Falf /')')()

unit, reads occur during the first half of the cycle,
and writes occur during the last half. A write and
read enabling signal is generated for each register
bank every cycle. Each cycle, data is selected from
one of the three write ports to be written into any
enabled register banks. Write port 0 has a four-stage
pipe to buffer data coming from the E-box, through
the control logic, which cannot be written due to a
register bank conflict. The vectOr register file also
has three scalar registers (one each for the vector
address-mask unit, vector add uni t , and vector mul
tiply unit) to hold scalar source operands for vector
scalar instructions. Write port 0 is used to write
these registers. Each enabled read port selects an
element from one of the 1 6 register banks or scalar
registers (for vector-scalar instructions) and trans
fers it to one of the other subunits.
The vector register file uses a technique referred
to as "barber poling" to improve the use of chaining
and overlapped instruction execution . As Figure 7
shows, barber poling spreads each architecturally
defined vector register across all vector register
banks. E lements are laid out such that the first
vector element of each vector register is in location
0 of the same physical register bank and element b
of vector register n is in location b of vector register
bank ({n +b] modulo 1 6) .
B y using this technique, a vector register conflict
causes the vector control unit to delay the issuing
of a new vector instruction for no more than three
cycles. If the more standard technique of placing all
elements of one vectOr register in the same bank
were used , a vector register conflict could cause
the execution of a new instruction to be delayed by
64 cycles. The 64 -cycle delay would have frustrated
attempts at overlapping and severely degraded the
vector performance of the VAX 9000 system .
Vector Add Unit

The vector add unit executes most vector instruc
tions, including both floating point and i nteger
addition, subtraction, comparison; vector convert ;
vector shift logical; vector logical operations; and
vector merges. For brevity, these instructions are
referred to as add-class instructions. One of the
challenges in designing the vector add unit was the
need to perform both integer and floating point
arithmetic.
The organization of the vector add unit is shown
in Figure 8. It is a pipelined structure that comprises
two identical chips for u npacking and aligning
operands (VI:'SA and V I'SB); one chip for performing
arithmetic and logical operations (VFAD); and a

VAX 9000 Series

Table 2

Encoded I n struction Q u adword (bits < 63 : 32 > )

Vector
I nstruction
VVS U B F!VS S U B F
VVSU BG!VSSU BG
VVS U B D!VSSU B D
VVS U B UVSS U B L
VVC M P L!VSC M P L
VVS LL!VSS L L
VVSR L!VS S R L
VVB I S UVS B I S L
VV B I C L!VS B I C L
VVXOR L!VSXO R L
VVM E R G E!VS M E R G E
VVADDD!VSA DDD
VVA D D F!VSAD D F
VVADDG!VSADDG
VVA D D L!VSA DDL
VVC M P D!VS C M P D
VVC M P F!VS C M P F
VVC M PG!VS C M P G
VVC M P D!VSC M P D
VVCVTDF
VVCVTDL
VVCVTFD
VVCVTFG
VVCVTF L
VVCVTG F
VVCVTG L
VVCVT LD
VVCVT LF
VVCVTLG
VVCVTDL
VVCVTFL
VVCVTG L
VV M U L L/VS M U L L
VVM U LF!V S M U L F
VVM U L D!VS M LI L D
VVM U LG!V S M LI LG
VV DIVF!VS D I V F
VVDIVD!VSD IVD
VVDIVG!VSDIVG
VLDL
VLDQ
Block load
VSTL
VSTQ
VGAT H L
VGATHQ
VSCATL
VSCATQ
I OTA
Load VLR
Load low V M R
Load h i g h V M R
Store l o w V M R
Store h i g h V M R
Store u n alig ned address
Load VPSR
Load VAE R
Store VAE R
R E S ET

OPCODE
< 39 : 3 2 >

Control Word Type
< 42 : 4 0 >

Dispatch Type
< 46 : 43>

OF9
ODB
OD2
OF6
OF5
034
026
086
08E
088
OAE
092
089
098
086
OD5
OFD
ODD
OD5
01 1
01 6
03A
038
03E
019
01 E
032
031
033
01 7
03F
01 F
003
004
005
006
ooc
OOD
OOE
001
002
ooc
003
004
005
006
01 0
01 1
012
007
009
OOA
OOD
OOE
01 3
014
01 5
008
OOF

2/6
2/6
2/6
2/6
3/7
2/6
2/6
2/6
2/6
2/6
5/1
2/6
2/6
2/6
2/6
3/7
3/7
3/7
3/7
4
4
4
4
4
4
4
4
4
4
4
4
4
2/6
2/6
2/6
2/6
2/6
2/6
2/6
0
0
0
0
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
2

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
4
4
4
2
2
2
2
2
2
2
2
2
2
2
2
2
3
3
3
2
3
3
3

Bits < 63: 47> are reserved.

Vol. 2 No.

Fall 1990

Digital Technicaljournal

Vector Processing on the VAX 9000 System

VML
RESULT
WPORT 2

SREG4 LD
SREG2 LD
SREGO LD

I SCALAR 0I S2

WRPORTO CNF SEL

\
I WRITE

WPORTS 0 - 2 -

SCALO SE L
SCAL2 SE L
SCAL4 SE L

REG BANK 1 5 WT EN

WRITE
ADDRESS
LOGIC

�

REG BANK 1 5 WT ADA

REG BANKO AD EN

READ
ENABLE
LOGIC

REG BAN K 1 5 RD EN

MEMORY
AR RAY

REG BANKO WT ADA

REG BANKO RD ADA

REG BAN K 1 5 RD ADR

READ
ADDRESS
LOGIC

�

A PORTS 0-4

RPORTS 0-4

\ SELECT DATA FOR EACH READ PORT FROMS 4 REG BANKS I
so --=:J
S2 "-1 I
I
I
---r
-\
.\
I
I
I
-\
I

e e e
I

RPORT2
RPORT1
RPORTO
TO MASK LOGIC
TO VML LOGIC

Figure 6

Vol. 2 No. 4

Fall

I'J'JO

e
I

�
I

RPORT3
RPORT4
TO ADDER LOGIC

Vector Register Unit

remaining chip for norm a l izing, rou nding, and
packing the result (YFPK). The data paths between
t he chips are a l 1 64 -bits w ide.
The pipeline latency through this unit for both
single-precision (integer and F _floating) and dou
ble-precision (G_floating and D _ floating) formats is
only three cycles. Thus, the vector/scalar cross-over
number for add-class instruct ions is quite small
(that is, the minimum number of vector elements
needed for the V-box to surpass the performance
of the remainder of the VA X 9000 CPU for this class
of instructions.) As a result, the V-box achieves good
performance for add-class instructions with small
sized vectors and large-sized vectors (large-sized
vectors being naturally favored by t he technique of
pipelining).
When the vector add unit begins to execute an
instruction, it receives two source elements from
the vector register unit each cycle. The elements are
latched i nto the unpacking logic, one clement for

Digital Tecbnical}ournal

SELECT WRITE DATA FOR EACH REG BANK FROM WRITE PORTS

REG BANKO WT EN

ENABLE
LOGIC

VAD RESULT
WPORT 1

LATCH

�

SCALAR 2 S4

I
so

WPORTS O -2

VCT WT DAT
WPORT O

�

LATCH

I
I SCALAR 4 1

FROM VAD

FROM CONTROL

FROM VML

BANK 0

BANK 1

BANK 2

Figure 7

BANK 1 3 BANK 14 BANK 1 5

Barber Poling

VAX 9000 Series

each of the two chips. During the next cycle, each
unpacking chip concurrently unpacks and aligns
its source element, if necessary, and forwards the
result to the addition or logical-operation logic,
depending on t he i nstruction . W ithin the same
cycle, the addition chip uses the two sources from
the unpacking logic to generate a result, which is
then latched.
D uring the final cycle, the result is sent to the
packing chip, which normalizes, rounds, and packs,
if necessary, the result and sends it to the vector
register unit to be written . Exception checking and
reporting are also done in the last cycle by the pack
ing chip, which maintains the vector add unit's
copy of the vector arithmetic exception register
(YAER). When the instruction completes, the vector
add unit sends its VAER copy to the vector mask unit
to be merged with the VAER copy from the vector
multiply unit.
The vector add unit does not d i fferentiate
between masked and unmasked vector instructions.

VFSA

The complexity of skipping over masked-out ele
ments wou ld have added extra cycles of pipeline
latency and resulted in less performance for small
sized vectors. For masked as well as unmasked
instructions, the vector add unit operates from the
first up to the last element (as indicated by the
vecror length register) of both source registers. The
actual masking of results is hand led by the vector
control unit, which blocks the vector register unit
from receiving masked-out resu lts as they are
being sent by the vector add unit. However, the
packing chip does use vector mask register bits to
suppress exception generation for results that are
masked out.
Floating Point Operation When executing vector
floating point instructions, the u npacking logic
takes the various fields of a floating point element
and expands and rearranges it into a more conve
nient format for the addition logic, i.e. , the elemem
is "unpacked . " As a result of this process, the addi-

SOURC E A

---

VFSB

����--------�
__L �-------L���

VFAD

EXPONENT

VFPK
VAER TO VMKB

MASK B I T I

I
I

ADDER LOGIC

Figure 8

Vector Add Unit

Vol. 2 No. 4

Fa/1 1990

Digital Tecbnicaljournal

Vector Processing on the VAX 9000 System

tion logic is simplified because all VAX floating point
formats (F, D, and G) are unpacked into an identical
format. The unpacking involves decoding the sign,
inserting the hidden bit, and rearranging the frac
tion bits. For all VAX floating point formats, the
fractional part is expanded to 56 bits. (F _floating
and G _ floating are expanded with zeros on the
right.) The fractional part is then surrounded on the
right with two guard bits and a rou nding bit to
form a 59-bit fraction. The overflow and guard bits
ensure the accuracy of rounded results.
After the elements are unpacked, the unpacking
chips align the elements by taking the fractional
part of the smaller magnitude number and shifting
it to the right until its exponent is equal to that of
the larger magnitude number. Each unpacking chip
also receives the exponent bits of the other chip's
element. Therefore, the alignment process can be
done in parallel before the elements are sent to the
addition logic that requires the alignment. If during
the alignment of an element for a vector floating
point subtract instruction, a one is shifted out of the
59-bit fraction field, then a "sticky bit" is generated.
This sticky bit is used by the addition logic in the
next cycle as a carry into the subtraction .
The unpacked, aligned elements are then sent to
the add chip, which produces a result and then par
tially normalizes the result before sending it to the
packing chip. Again, if the shifting during normal
ization shifts a one out of the fraction field, a sticky
bit is generated. Finally the partially normalized
result and the second sticky bit are sent to the pack
ing chip which completes the normalization and
rounding and adjusts the exponent field accord
ingly. To save an extra cycle, the packing chip com
putes two exponents values, one for each value of
the carry-over in the rounding process. Final selec
tion of the exponent and its exception is done using
the actual carry-over of the rounding logic. The
proper exponent and the normalized fraction are
then rearranged into the appropriate floating point
format, and the assembled element is sent to the
vector register unit .
For vector inte
Integer and Logical lnstntctions
ger and logical instructions, the elements bypass the
alignment logic and are sent to the add chip (VFAD)
for all but the logical shift right instruction (VVSLRL
and VSSLRL). For logical shift right instructions, the
alignment logic does the shifting because the shift
ing circuitry is already needed for the alignment of
fractions in floating point elements. The exponent
unpacking logic is used to pass on the logical shift

Digita1 1ecbnicaljournal

Vol. 2 No. 4

hill I')'JO

right count to the aligmnen t logic, which then
sends the shifted result to the add chip. The add
chip operates on the low-order 32 bits of these
elements and passes through the high-order 32 bits
u nchanged to the packing chip. For logical shift
left instructions (VVSLLL and VSSLLL), the low-order
32 bits also pass through the add chip unchanged .
On the packing chip, the floating point normalize
logic performs to do logical shift-left operations.
The shift count is passed to the normalize logic
from the unpacking logic during the first cycle . For
all other integer and logical instructions, the nor
malize count is forced to zero to pass the add chip
result through. Finally, just before sending the result
to the vector register u nit, the packing chip checks
for integer overflow exceptions.
Merge lnstrnctions
For vector merge instructions
(VVMERGE and VSMERGE), the unpacking chip with
the masked-out element, based on the appropriate
vector m as k register bit, zeros that element out
before sending it to the addition logic. The addition
logic adds the zero to the other element , which has
the effect of passing the value of the other element
on to the packing chip.
Vector Memory Operation

Because vector applications tend to issue m a n y
vector memory instructions, the execution time of
these instructions is a critical factor in the perfor
mance of a vector processor. Therefore, the V-box
was designed to m inimize the execu tion time by
taking advantage of the VAX 9000 C P U 's large 128KB
d a ta cache, by prefe tching vector data, a nd by
fetching it in blocks instead of element by element .
Memory requests b y the V-box are sent through
the VAX 9000 CPU to the cache and address trans
lation hardware (M-box) of the VAX 9000 CPU . The
M-box translates the 32-bit virtual addresses for vec
tor data into physical addresses and accesses the
proper locations in the data cache. The vector
address-mask unit generates the virtual addresses
for the vector elements. For vector load and gather
i nstructions, the vector data is returned to the
V-box through the E-box, and written to the proper
vector registers. The M-box returns 64 bits of data
each cycle. For vector store and scatter instructions,
the vector elements are sent through the E-box to
the M-box. Although the vector register unit is
capable of sending 64 bits at a time, the E-box need
only forward 32 bits per cycle to the M-box. The
M-box requires two cycles to write the cache and
does not actually write the 64 -bit data u n til the

VAX 9000 Series

second cycle. (The first cycle performs the cache tag
lookup.) Because the V-box implements synchro
nous memory management exception reporting,
once a vector memory i nstruction begins execu
tion, no other vector instruction may be issued until
the memory instruction completes.
The VAX 9000 CPU prefetches vector data. This
mechanism is used to move data from the main
memory to cache in a manner which optimizes
memory bandwidth. By using this method, a 25
percent improvement in the performance of vector
load instructions is achieved. The preferching starts
when the scalar microcode on the VAX 9000 CPU
checks the stride of a VLDQ instruction . If this stride
is 8 bytes long (quadwords are contiguous in mem
ory), the microcode converts the instruction into a
block load instruction and sends it to the V-box .
The block load instruction directs the V-hox to issue
a series of block load requests for vector data. A
block load request moves an entire cache block
from the memory into the vector registers. These
blocks are loaded into both the cache and the vector
registers when they come from main memory.
(Bypassing the cached to load the vector registers
directly reduces the effect of a cache miss for vector
data.) Otherwise, the memory requests are done for
one register element at a rime.
In addition to converting the VLDQ to a block
load instruction, the scalar microcode also issues
preferch requests ro the M-box. The M-box deter
mines if the data is valid in the cache. If so, no fm
rher action is taken on the request. If not, the data
is requested from main memory. In this manner
several prefetch requests are started in successive
cycles. This method results in multiple memory
banks being used in parallel. Vector data comes
back to the cache at a rate of 500 megab ytes
{MB) per second . The microcode stops issuing
prefetch requests when all the vector data has been
requested . This ensures that the requests from the
V-box do nor encounter many cache m isses.
Vector Address-Mask Unit
The vector address-mask unit performs the address
generation and memory requests needed to exe
cute the vector memory instructions VLD, VST,
VSCAT, and VGATH . I t also contains the vector mask
register and support logic for masked instructions.
Further, it contains the complete vector arithmetic
exception register {VAER), which it updates based
on the status sent by the vector add and vector mul
tiply units.

For vector memory i nstructions, t he vector
address-mask unit receives the base (starting mem
ory add ress of the vector) and stride (d istance
between vector elements in memory) of the instruc
tion from the vector control u n i t in an indirect
manner through the vector register unir. Both the
base and stride are 32 bits long.
For most vector load and store instructions, the
memory addresses for the vector data are generated
in an iterative fashion . During the first cycle of exe
cution, the base address bypasses the address adder
and is immed iately sent to the M-box to request the
first element. Concurrently, the base and stride are
added together by the add ress adder and latched to
provide the address of the next elemenr. In the next
cycle, the latched address is sent to the M-box and
to the address adder, where it is added to the stride
to generate the next address. The process repeats
until all element addresses have been issued . I n
tandem with the address generation, the vector
control unit directs the vector register unit to send
or receive the appropriate vector register element.
For vector gather and scatter instructions, the
memory addresses for t he vector data a re also
issued in an iterative fashion. During the first cycle
of execution, the base address is sent to the vector
address unit. In the second cycle, the vector control
unit directs the vector register unit to send the first
element of the offset vector to the vector address
unit, which adds it to the base and latches the result.
In the third and subsequent cycles, the resulting
address is sent to the M-box while the base and next
offset are added together. The process repeats until
all element addresses have been issued. I n tandem
with the address generation, the vector control unit
directs the vector register uni t to send or receive the
appropriate vector register element .
For masked vector load and gather instructions,
addresses for all elements, masked and unmasked,
are sent to the M-box. However, for masked-our
elements, the request is modified from read to
read no-op (i.e., do not actually perform the read).
This process prevents the M-box from raking cache
m isses and address translation exceptions on
masked-out elements. For masked-our elements,
the M-box returns a dummy value to the V-box,
which blocks the value from being written to the
vector register unit. The vector address unit directs
the control unit to block writes, based on the value
of the appropriate vector mask register bit.
For masked vector store and scatter instructions,
although both m asked and unmasked elements

Vol.

2 No. 4

Fall 1990

Digital Tecbnicaljounral

Vector Processing on the VAX 9000 System

are read from the vector register u nit, masked-out
elements are stopped from reaching the M-box. The
vector address unit, based on the vector mask regis
ter, causes the E-box to discard the masked-out
element instead of forwarding it to the M-box.
As described earlier, a VLDQ instmction with a
stride of 8 bytes (unity stride) is converted by the
VAX 9000 scalar processor into a block load instruc
tion when sent to the V-box. The vector address
unit, in turn, issues a number of block toad requests,
each of which is for 64 bytes of data, to the M-box
with the appropriate address and selection bits.
There are eight selection bits, one for each quad
word in the block, which tell the M-box whether to
return the corresponding quadword to the V-box
for that block load request. Generation of these
selection bits by the vector address unit is com
plicated because the starting add ress of a vector in
memory is not aligned on a block boundary (i.e. ,
starts within the middle of a block). The bits also
depend on the vector mask register (for masked
block loads).
To handle unaligned, masked block loads, the
vector address unit must generate selection bits that
deselect those quadwords which are not part of the
vector but lie within the same blocks as the first
and last elements of the vector. In addition, it must
deselect those quadwords within the vector that
are masked out by the vector mask register. Both of
the above requirements are handled by using an
extended version of the vector mask register to
generate the selection bits. This process involves
conceptually extending the vector mask register on
both ends with enough selection bits so that each
quadword has a corresponding selection bit. For
example, a vector starting at the last quadword of
one block requires that seven selection bits be
added at the beginning of the vector mask register
and one bit be added after the end .
Vector Multiply Unit
The vector multiply unit performs all of the vector
multiply and vector divide operations defined by
the VAX vector a rchi tecture: VVMU L , VSM U L ,
VVDI V , and VSDIV . The unit can perform either one
multiply instruction or one divide instruction at a
time, but cannot perform both types of instruc
tions simultaneously. In addition, the unit performs
exception checking and reporting, as required,
including floating overflow, floating underflow, and
d ivide by zero exceptions. The unit consists of
four custom multipliers: a custom divider, a divide
unpack chip, and two packing chips. Physically,

Digital Technicaljournal

Vol. 2 No. 4

ftttl 1990

these chips reside on the V M L multichip uni t of the
VAX 9000 CPU. The custom multipliers and divider
are identical to those used in the scalar execution
engine (E-box).H
Multiplication By using four parallel multipli
ers, the pipeline latency through the multiplica
tion logic for both single p recision (integer and
F_floating) and double precision (G_floating and
D_floating) is only three cycles. Thus, the vector/
scalar cross-over number for multiplication is quite
smal l . As a result, the V-box achieves good perfor
mance for vector multiply instructions with small
sized vectors as well as large. As a double-precision
vector multiply instruction executes, two 64 -bit
elements are received from the vector register unit
each cycle and are latched in the four custom
multipliers, each of which does a 32-bit by 32-bit
multiplication.
As shown in Figure 9, the element bits are dis
tributed in such a way that one multiplier operates
on the h igh-order bits of both elements; one multi
plier operates on the low-order bits of operand one
and the high-order bits of operand two; one multi
plier operates on the high-order bits of operand one
and the low-order bits of operand two; and one
multiplier operates on the low-order bits of both
elements.
During the next clock cycle, each of the four mul
tipliers unpacks its inputs and sends them through
a large multiplication array, which produces one
64-bit partial product and latches the product.
During the third cycle, the pack chips (VMLA and
VMLB) add the four 64 -bit partial products together
to produce one result and prepare the result to be
written back to the vector register unit. In this
cycle, the four partial products are shifted accord
ing to their weight. Weight is determined in relation
to w h ich bits the multiplier usee! to produce a
result. For example, the multiplier that operated on
the high-order 32 bits (most significant bits) of both
elemems produces the most significant partial
p roduct bits, and the multiplie r that operated on
the low-order 32 bits (least significant bits) of both
elements produces the least significant partial
product bits. The partial products must be aligned
or shifted properly before they are added together.
Once the partial products have been added, the
final product is then rounded, normalized , and
packed into the appropriate VAX integer or floating
poim format before being written into the vector
register unit in the next cycle.
The process and pipeline stages for single-preci
sion multiplication (VYMU LF and VSMULF) are

VAX 9000 Series

VREG_SOURCE1 [31 OJ
VREG SOURCE1 [63:32J VREG SOURCE1 [310J
VREG_SOURCE2 [31 :OJ

VREG_SOURCE1 [63 32J
VREG SOURCE2 [63 32J

CUSTOM
MULTIPLIERS

PARTIAL_PRODUCT1 [47:0J PARTIAL_PRODUCT1 [63:0J PARTIAL_PRODUCT1 [63 OJ PARTIAL_PRODUCT1 [63:32J
VMLAIVMLS
1 47
RESULTS FROM
DIVISION
ACCUMULATION

COMMON
BETWEEN
MULTIPLIERS
AND
DIVIDERS

(FROM
DIVU)

���

j63
FINAL PRODUCT

3�1
(TO
DIVU)

EXCEPTION DATA AND
FINAL EXPONENT FROM
EXPONENT LOGIC

VML_RESULT [63:0J
TO VREG
Figure 9

Vector Multiply Unit

similar to the process used for double-precision
multiplication. However, in single-precision multi
plication, only one multiplier chip is needed ro pro
duce the result and the pack chips do not need to
sum the partial product. Integer multipli ca tion is
slightly different from floating point multiplication
because it does not need to be accumulated or
rounded. Thus, the correct product is produced
by one multiplier. The result bypasses the accumu
lation and rounding logic and proceeds directly
into the packing logic to be sent to the vector regis
ter unjt.
The exponent handling for both multiplication
and division is performed by the same logic on the
packing chips. Depending on the instruction being
executed, the exponent is either added (multipli
cation) or subtracted (division). The result of this
operation is then piped to the next stage and the
position of the h idden bit is determined. If the frac
tional portion of the data must be shifted to ensure
the hidden bit is in the correct position, the expo
nent is then incremented or decremented accord76

ingly. The normalize count (i.e. , shift count) is used
to select the correct final exponent. Overflow and
underflow exception checking can only be detected
and reported after the final exponent is selected. If
an exception is detected, then a reserved operand is
written to the appropriate vector register element.
The first stage of the exponent logic also checks for
divide by zero and reserved operand exceptions.
Vector division is a variable-cycle func
tion. The number of cycles depends on the format
of the operands. The custom divider is capable of
producing six quotient bits per cycle. Therefore,
F_floating point division is performed in 7 cycles,
G_floating point in 1 2 cycles, and D_floating
point in 13 cycles. Because of the variable number
of cycles in a divide instruction, no other instruc
tion can execute in the V-box while a divide is in
process. Also, because of the iterative nature of divi
sion (i.e. , one division must be completed before
another can be started), the instruction cannot be
pipelined.
Division

Vol. 2 No. 4

Fa/1 /'J'J{)

Digital Tecbnicaljounwl

Vector Processing on the VAX 9000 System

As a vector div ide instruction executes, two
64-bit elements are received from the vector regis
ter unit each cycle and are latched i n the di vide
unpack chip. The elements are unpacked, and the
fractional portion of the elements is sent to the etJS
tom divider in 32-bit slices. The exponent portion
is sent to the shared exponent logic on the packing
chips, as described in the Multiplication sect ion.
During this cycle, time-critical values, such as com
plemented element values and first-cycle quotient
bits, are calculated and forwarded to t he custom
divider.
W hen t he divider receives the data, it uses a n
iterative algorithm t o produce six quotient bits per
cycle. The quotient bits produced are then sent to
the packing chips, which may have to increment
the quotient, depending on the value of subsequent
quotient bits. The div ider instructs the quotient
accumulation logic whether or not incrementing is
necessary. The partial quotient, once decided, is
held in a bank of l atches until a l l the quotient bits
are received . When the entire quotient is available,
the result is rounded, normal ized , and packed by
using the same logic path as multiplication. A mul
tiplexer switches this packing logic between the
multiplication and division logic.

Performance Characteristics
As of this writing, testing of the vccror performance
of the VAX 9000 system has only just begun. How
ever, some preliminmy resu lts are p resented in
Table 3. We expect that these results will improve
as testing continues and more code i s optimized
to take advantage of the chaining and overlapping
provided by the V-box.

class instruction , vector multiply instruction, and
vector memory instruction. Unlike the VAX 6000
Model 400 system, vector register conflicts between
these instructions have little effect on overlapping. ;
With the VAX 9000 system, a conflict only delays
t he execution of the subsequent vector instruction
by one or two cycles at most.
However, the overlapping behavior of the V-box
is sensitive to the issue order of vector instructions.
If two vector instructions executed by the same
V-box unit are issued one after the other, the second
instruction is delayed until the V-box unit has fin
ished executing the first. In addition, vector i nstruc
tions issued after a vector memory instruction or
divide instruction, do not begin execution unti.l the
previous instruction completes. A general ru le in
scheduling code for the VAX 9000 V-box, is to gen
erate, whenever possible, instruction triples, where
the first two instructions are a vector add-class and
vector multiply instruction and the last instruction
is a vectOr memory or vector divide instruction .
Failing that, at least one vector add-class or vector
multiply instruction should be issued before a vec
tor memory or vector divide instruction.
The following code examples demonstrate the
usage of the VAX vector instruction set and the over
lapping behavior of the VA X 9000 V-box. (Note: It
should be assumed in the examples that all arrays
are 8-byte double precision .)
In the following DAXPY inner loop example, the
first two VLDQ instructions do nor overlap. How
ever, the VSM ULD, VVA DDD , and VSTQ instructions
do overlap.
Do i

DY ( i )

•

DA x

enddo

Chaining and Overlapping
Because of the design of the vector register u n i t ,
the V-box can concurrently execute a vector addTable 3

VAX 9000 Model 2 1 0 P rel i m i n a ry
Performance Double -precision
M F LOPS , U n iprocessor
Size

Vector

Peak rate

1 25

LFK (Geometric mean)
LFK (Arit hmetic average)
L I N PAC K

44 1
44 1

1 3. 2
20.6

1 0002

FFT

4096

Convolution

1 50 X 1 500
64 2

Matrix multiply

Digital Tecbllicaljournal

Vol. 1 No. -1

vecrorizes as:

Fall 1990

K8 , v o
K 8 , V2

VLDQ

ox ,

VLDQ/M

DY ,

VSMULD

DA , V O , V 1

;V1

VVADDD

V 1 , V2 , V3

; V3 . V 1 . D Y

VSTQ

V3 , D Y '

; Lo a d ve c t o r

; L o a d ve c t o r D Y
; w i t h mod i f y i n t e n t
=

DA*DX

; S t o r e vec t o r DY

The first two V LDQ instructions do not overlap in
the following MERGE example,
Do

1 , 64

a( i )

b( l ) - c( i )
. g t . 0 ) t hen

if (a( i )
b( i)

a( i )

d i )

e l se

99. 1 5
1 1 1 . 36

OX ( i )

b( i )
end i f
enddo

VAX 9000 Series

vectorizes as:

vecwrizes as:

VLDQ

b , #8 , v o

; L o a d ve c t o r b

VLDQ

c , #8 , V 1

; L o a d ve c t o r c

VSEQLD

VVSUBD

VO , V 1 , V2

; b-e

; s e t ma s k. C VS C M P p s e u d o

VSTQ

V2 , a , # 8
1\
# X O , V2

; S t o r e vec t o r a

; op d o i ng Equal t es t >

VSLSSD

; T e s t a ( • ) a n d s e t rna s k
; i n VMR .

VVMERGE

V 1 , V 2 , VO

I O TA

a , #8 , VO
J\
# X O , VO

; L oad ve c t o r a

; Te s t a ( • ) f o r z e ro and

#8 , V 1

; Ma k e c o mp r e s s e d

< VS C M P

; ve c t o r o f o f f s e t s

; pseudo - op do i ng Less

, w r i t e s i z e o f ve c t o r

; T han S i g ned t e s t )

; t o VCR

; Me r g e a and c i n t o b

MFVCR

; Move V C R i n t o R O

; u s i n g m a s k i n VMR
VO , b , #8

VSTQ

; S t o r e ve c t o r b

; C MF V P p s e u d o - o p )
MTVLR

; L o a d n ew V L R v a I u e

VGATHQ

c ' V1 , V2

; Ga t h e r v e c t o r c

VGATHQ

d , V 1 , V3

; Ga t h e r v e c t o r d

VVD I VD

V2 , V3 , V4

; D i v i de c by d

VSCATQ

V4 , b , V 1

; Sca t t e r vec t o r b u s i ng

; C MT V P p s e u d o - o p )

However, the VVSUBD instruction does overlap
with the VSTQ instruction. Both the VSLSSD
(VSCMP) and VVMERGE instructions are executed by
the vector add unit. Therefore, these two instruc
tions do not overlap. However, the VVMERGE
instruction does overlap with the VSTQ instruction.
In an I F-THEN- ELSE example, such as the
following,
Do

1 , 64

i f (a( i )

. g t . 0 ) t h en
c( i )

b( i )
e l se

c( i ) I a( i )

b( i )
end i f
enddo

vecrorizes as:
VLDQ

a ,

VSLSSD

# 8 , VO
XO , VO

; Load vee t o r a

; T e s t a ( • ) a n d s e t mas k
; i n VMR .

< VS C M P

; p seudo - op d o i n g Less
; T h a n S i gn e d t e s t )
VLDQ

c , #8 , V 1

VVD I VD / 0

V 1 , VO , V2

; Load vee t o r c
; Ma s k e d d i v i d e o f c by a

( j

; f o r VMR i
VST Q / 1

V 1 , b , #8

VSTQ / 0

V 2 , b , #8

; S t o r e " t h en " p a r t of b ( • )

; S t o r e " e l s e " p a r t of b( * )

Nothing overlaps the first V LDQ instruction, but
the VSLSSD instruction does overlap the second
VLDQ instruction. Nothing can overlap with the
VVDIVD instruction. Thus, the VSTQ instructio n
does not begin execution until the VVOIVD instruc
tion completes. The remaining VSTQ instruction
waits for the first VSTQ instruction to complete.
In the following scatter-gather example, none of
the instructions is overlapped.
Do

1 , 64

i f (a( i )
b(i)
endi f

enddo

. e q . 0 ) t he n
=

c( i ) /d( i )

; u s i ng o f f s e t s i n V 1
; us i ng o f f s e t s i n V 1

; offsets in V1

I t should b e noted i n this e x ample that the
VSEQLD and the IOTA instructions do not overlap.
This lack of overlap occurs because the IOTA
instruction is actually done with microcode on the
E-box, and the IOTA instruction cannot begin exe
cution until the VSEQLD instruction has computed
all the new vector mask register bits. The vector
register access instructions (MFVCR and MTVLR)
take only a few cycles and do not significantly affect
the overlapping of other vector instructions.

Summary
By taking advantage of key features of the VAX
vector architecture, such as instruction overlap
ping, imprecise exceptions, and asynchronous
interaction with the scalar processor, the vector
processor of the VAX 9000 system provides super
computing performance for computationally inten
sive applications. Through the use of barber poling,
the vector processor can overlap two vector arith
metic instructions with one memory instruction
to deliver a peak double-precision performance of
125 M F LOPS.

Acknowledgments
The authors wish to acknowledge the technical
contributions of the following individuals to the
VAX vector architecture and the VAX 9000 V-box
design : Wayne Cardoza , Dave C utler, Tryggve
Fossum, Rich Grove, Kevin Harris, Steve Hobbs,
Brian Koblenz, D w ight Manley, Dave O rbits,
Bob Supnik, Mike Tehranian, Cheryl Wiecek, and
Rich Witek.

Vol. 2 No. 4

Fal/ 1990

Digital Tecbntcaljournal

Vector Processing on the VAX 9000 System

References
1.

Russell, "The

5. CRA Y-2 Compute-r System Functional Descrip

Computer System ,"
ACM Proceedings, vol . 21, no. 1 (January 1978):
CRAY - 1

tion (Cray Research, Inc , 1985 ).

6. W.

Buchholz, "The IBM System/370 Vector Archi
tecture, " IBM Syste-ms journal, vol. 25, no. 1

63-72.

2. VA X Vector Processing Handbook

(Maynard :
D igital Equipment Corporation, Order No.

(1986): 51 -62 .
7.

EC-H04 19-46/89, 1989).
3.

R. Brunner, VA X Architecture Reference Manual
(Bedford: Digital Press, Order No. EY - F 576 E- DP,

D. Fenwick et a l . , "A VlSI Implementation of
the VAX Vector Architecture," Proceedings of
COMPCON '90 (IEEE, Spring 1990).

1990).

Digital Tecbntcaljournal

Vol.

2 No. 4

Fall 1990

D. Marshall and ]. McElroy, " VAX 9000 Pack
aging- The Multichip Unit," Proceedings of
COMPCON '90 (!E E E , Spring 1990).
M. Adiletta et al . , "Semiconductor Technology
in a High-performance VAX System ," Digital
Technical journal, vol . 2 , no. 4 (Fall 1990, this
issue): 43-60.

Peter B. Dunbeck
Richard]. Dischler
james B. McElroy
Frank ]. Swiatowiec

HDSC and Multichip Unit
Design and Manufacture
The VAX 9000 system effectively integrates state-ofthe-art packaging and inter
connects with advanced integrated circuits to achieve a short machine cycle time
(16 nanoseconds) and a high rate of instruction execution. To meet highjrequency
electrical signal and pin count requirements for the system, engineers chose tape
automated bonding technology and consequently conceived and developed the high
density signal carrier (HDSC). Tbe HDSC offers densities three to five times greater
than conventional printed circuit boards. This unique technology is manufactured
using semiconductor and advanced printed circuit board tecbniques. The HDSC is
at tbe heart of the multichzp unit, a bigh-performance logic module, with wbicb tbe
VAX 9000 CPUs and system control unit are constructed.

Over the past decade, advances in the performance
of integrated circuits (ICs) have outpaced advances
in packaging and interconnect technologies. Thus a
high-performance mainframe with conventionally
packaged bipolar integrated circuits would experi
ence interconnect delays that accoun t for more
than 50 percent of the system cycle time. Key to
optimizing high-end mainframe performance, then ,
is the effective integration of state-of-the-art pack
aging and interconnects with advanced integrated
circuits. The high-density signal carrier (HDSC) and
the multichip unit (MCU) are proprietary tech
nologies that shrink interconnect paths and thus
reduce the distance and electrical loading of signals
between chips. These technologies use conven
tional semiconductor and p ri nted circuit board
(PCB) equipment in many areas of manufacturing
to improve reliability at a competitive cost. The
result is shorter machine cycle time and higher
instruction execution rate. The VAX 9000 CPUs and
system control unit (SCU ) are constructed entirely
of multichip units on large planar modules. The SCU
is composed of arrays of 6 multichip u nits, and the
CPUs are composed of arrays of 1 6 .

Multicbip Unit Design Goals
Beginning at the concept level and throughout the
development and test phase, signal integrity con
siderations guided t he development of the HDSC
and the multichip unit. Designers had to ensure
that the fas t signals woul d not be d isturbed by
noise. The cycle time goal for the VAX 9000 system ,

1 6 nanoseconds (ns), allows the system to operate at
30 VAX units of performance (VUPs).
To transm i t electrical signals quickly between
chips, wiring paths must have controlled ratios of
wire size to distance from voltage planes. These
impedance-controlled paths allow radio-frequency
computer signals to propagate with minimal dis
tortion . Prevention of noise on the signals is
paramount and many details of the physical imple
mentation, including spacings between wires, are
critical to ensuring signal integrity.
To meet the cycle time goal, high-frequency elec
trical signal concerns needed to be considered in
the design, concerns that would have been negligi
ble for slower speed signals. Due to the physics of
electrical fields, as electrical signals switch at high
frequencies, they succeed in holding their shape
(data) only if they are fed power extremely quickly,
and if they are given short paths of uniform proper
ties on which to travel . Due to the amount of power
and the short amoum of t ime a signal is given tO
arrive on chip, conventional chip carrier packages
were disallowed for the VAX 9000 system . The sig
nal paths had to be very short to be virtually noise
less. To achieve this objective, engineers decided to
enhance tape automated bonding (TAB) technology
with a ground plane for electrical control of the
wire impedances (paths). This reduction in chip
package size also allowed all of the chips for the sys
tem to be packaged into a tight area. Consequently,
to fit wires between chips, extremely dense HDSC
technology was conceived and developed .

Vol. 2

No. 4

Fall 1990

Digital Tecbn icaljou,-nal

HDSC and Multichip Unit Design and Manufacture

The multichip unit also required careful thermal
design attention because each chip consumes up to
30 watts. Moreover, most multichip units contain
four to eight of these chips plus self-timed RAMs
(STRA Ms). The key to success for the VAX 9000
program was balancing the trade-offs between per
formance require me n ts and technology develop
ment risks.
To meet the electrical and density requirements
for the machine, engineers specified the fol lowing
for the multichip unit:

1. Series-term inated output drivers were required
on chip. Therefore, external resistOrs are not
needed on the mul tichip units or programmed
into the design elsewhere. These external resis
tors take up space and lower re liability.
2 . TA B was specified for manufacturing reasons.
Short TAB tape was required to reduce switching
noise on chips. Noise would have been generated
if the TAB w ires were longer. In the case of the
noisi est chips, a ground plane was added to the
tape to reduce noise.
3 HDSC etch had to be two routing layers of IS
micron by 9-micron w ires on 75-micron centers
to meet t he density, resistivity, crosstal k, imped
ance and other goals.

4 . Four power planes, each one powered from two

All i ntegrated circuits i n the multichip unit are
attached to the H DSC by a tape automated bonding
(TA B ) process. The VAX 9000 system uses four types
of ch ips, all of which have emitter coupled logic
(ECL): gate arrays, custom chips, and two types of
STRAMs. At each chip site, a cutout in the H DSC
a llows the chip to directly attach to the baseplate.
The signals on and off the multichip unit are carried
by four signal flex connectors which attach to the
perimeter of the H OSC . The signal flex connector
provides a separable interface to the planar board
and extends the controlled-impedance electrical
envi ronment of the H OSC . Power is brough t
through two power connectors attached to oppo
site sides of the HDSC . The signal flexes, the power
connectors, and the baseplate are attached to the
multichip unit housing. The housing provides the
structure for the multichip unit and holds the com
ponents needed to position and w ipe the signal
flex. The chips and H DSC surface are covered by a
plastic lid.
The high-powered ch ips are efficiently cooled by
a short conductive path through the back of each
chip. The thermal power is conducted from the
chip to the baseplate and into a pin fin heat sink
over which air is impinged to remove the heat.
The follow ing sections describe the implementa
tion of the tec hnology.

sides, were requi red to distribute three voltage
rai ls with acceptably high conductivi ty.

1be HDSC Design and
Manufacturing Process

Thin d ielectric separates the power planes and
produces h igh capacitance which filters noise
and improves performance. This capacitance
el im inates the need for d iscrete parts which con
sume valuable space and lower rel iability.

The goal for the HOSC project was to produce a
h igh-densi ty, h ig h-performance, manufacturable
printed circuit board . This goal was achieved. The
density of the H DSC is th ree to five times greater
than that of conventional printed circuit boards.
Even at this density, the HDSC maintains the signal
integrity of bipolar i ntegrated circuits with edge
speeds of 200 picoseconds. This section describes
how the man ufacture of the H OSC pushes the limits
of printed circuit board and semiconductor equip
ment into new types of applications. We a lso
address the integration of computer-aided (CAD)
tools, process controls, and test feedback, which
helped us to achieve the results we sought.

6. I mpedance control of the connectors on the

multichip unit was needed to prevent signal dis
turbance. Ru les were generated for the number
of ground pins.
The heart of the multichip unit is the H DSC. The
H DSC is an imerconnect technology consisting of
nine metal layers separated by polyimide dielectric
and mounted on a copper baseplate. The top metal
layer is a pad layer used to solder-attach all of the
i ntegrated circuits and connectors. The four metal
layers below make up the signal core. The signal
core is a controlled-impedance, dual buried strip
line i nterconnect system used to wire all integrated
circuits to each other and to the connectors. The
power is brought from the perimeter of the H OSC to
the integrated circuits through the bottom four
metal layers.

Digital 1ecbn icaljournal

Vol. 2

1\i;

Fall /'J'JI)

HDSC Technology
As noted earlier, the H OSC has nine copper layers
for power and signal d istribution . The insulating
materi a l , polyimide, has a low dielectric constant
of 3 . 5 as compared with oxide or nitrides used in
integrated circuits or as compared w ith ceramic,
which is used for hybrid circuits. The interconnect
is laminated to a copper baseplate to provide

VAX 9000 Series

mechanical structure as well as attachment of the
multichip unit heat sink.
The conducting layers consist of the following:
•

Two layers for signal distribution

•

Two layers that serve as signal reference planes

•

Four layers for power distribution

•

One layer with bonding pads to attach the TA B
and connectors

The signal distribution is a single x-y pair that
uses the reference planes to create a dual strip
l ine interconnect. This interconnect provides a
controlled-impedance signal path with minimal
crosstal k. Table 1 l ists the electrical and physical
design parameters of the HDSC .
Process Overview
The H DSC is manufactured by two types of pro
cesses: core processing and assembly processing.
Figure 1 is a diagram of the HDSC process flow.
The core process, described funher below, uses
semiconductor manufacturing equipment and is
s imilar to the manufacturing process for the back
end of an i ntegrated circuit. Two cores are manu
factured: a signal core for strip-line signal inter
connect, and a power core for the four planes
(or layers) that d istribute power throughouc the
finished HDSC .
The second process, assembly, uses advanced
printed circuit board techniques to laminate and
interconnect the signal core and power core. The
completed H DSC has solder pads to accept the outer
lead bond of TA B integrated circuits, signal flex, and
power t1ex. The H DSC is tested with a custom flying
probe tester. Tests are made to ensure the HDSC is
functional and meets electrical parameters.
Table 1

HD SC P hysical and Electrical
Design Parameters

Line pitch

75 m icrons

Line width

1 8 m icrons

Line thickness

1 0 m ic rons

Dielectric thick ness

25 m icrons

Dielectric constant

3.5

Line i m pedance

60 ohms

Line resistance

1 /0 oh m/centimeter

C rossover capacitance

3 . 6 femtofarads

C rosstalk

5 . 1 percent max im u m

P ropagation delay

6 6 picoseconds/
centimeter

CORE PROCESS FLOW

r - - - - - - - - - - - - - - - - - - -

SIGNAL CORE

:
I

•

•
•

�

METAL LAYERS
POL YIMIDE LAYERS
COPPER LINES ETCHED
VIAS
4
5

POWER CORE

*
TEST
_

I
I

1
l

•
•
•

�

METAL LAYERS
POL YIMIDE LAYERS
WHOLE PLANES
4
5

*
TEST
_

I
I

•
•
•
•

•

---�--J
TO MCU

Figure 1

HDSC Core and Assembly Process Flow

Core Processing The process for the manufacture
of the signal and power cores, or the core process,
consists of alternating between copper deposition
and polyimide coating until the completed inter
connect layers are built on the metal wafer. The pro
cess is performed on a metal substrate shaped like a
6-inch semiconductor wafer. Copper layers are
deposited by a combination of sputtering and plat
ing techniques. Patterns in the copper that become
signal traces are generated by a semiconductor
phorolithographic technique. First, a photoresist is
applied to the metal wafer. The resist is then
exposed to the pattern in the mask that is held by
the semiconductor wafer aligner. This pattern is
then developed in the resist and etched into the
copper. The remaining copper thickness is then
added by plating. Another resist pattern is devel
oped over the plated signal traces to define where
a copper connection between interconnect layers
will occur. This connection is cal led a via post, and
it is also formed by a plating process.
Polyimide is spun on to the wafers by integrated
circuit photoresist spin tracks. The relatively thick
polyimide (25 microns at signal layers) helps to
planarize the surface of the wafers and also to cover

Vol. 2 No. 4

Fall I. 4

Fall 1<)<)0

INTEGRATED
CIRCUIT

200 MICRON
PITCH

INNER LEAD
BOND
TAB

HIGH DENSITY
SIGNAL CARRI ER

ENCAPSULATION

OUTER LEAD
BOND

INTEG RATED
CIRCUIT

Figure 3

Isometric ofa Gate A n-ay
Showing Features oftbe TAB

\J\X 9000 Series

no plating is required for epoxy die attach. The
epoxy die auach is filled with m icroscopic particles
to enhance the thermal conductivity while main
taining electrical isolation bet ween chips.
Signal Flex Connector
The signal flex connector is a high-density, con
trolled-impedance connector used to transmit sig
nals between the H OSes :md the planar module.
Each multichip unit has four flex connectors with
a combined signal I/O of 800 in an area less than
4 0 square centimetcrs. Figure 4 shows a cross sec
tion of one signal flex connector. The body of the
connector is a two-metal-layer flex print with 50and 60-ohm signal lines. The ground plane in the
flex circuit is used as an AC return path . No power is
carried through the signal flex. The signal plane
contains 200 etch lines with a raised gold bump on
each at the planar module interface. The connec
tion to the H DSC is a solder bond similar to the sol
der bonds for the TAB devin:. A window is opened
through the polyimidc to al low the formation of
cantilevered, exposed, solder-plated leads.
The raised bump on the flex circuit concentrates
the contact force into a small area. The bump is
sol id copper that is plated over with nickel and hard
gold. The force on the bump is generated by com
pressing a molded silicone rubber elastomer. The
compression of the connector causes the tkx
frame to engage a cam on the housing and wipe the.:
contacts across the planar module pads. The con
nector is compressed, nominally, 1 .27 mm and
wipes 0.46 mm . The bottom of the elastomer mates
with a tray which has a contoured surface to vary
the compression along the length of the elastomer.
This contoured surface improves the uniformity of
the force that the humps exert on their pads. The
connector has been designed to gem:rate 100 grams
minimum load on all bumps. The wipe action and
the bump force of the connector minimize the
effect of dust and environmental fi lms on the.: mat
ing surfaces.
Power Connector
T he power consumed by the multichip unit IS
brought in through two power connectors mounted
on opposite sides of the I !DSC . The connector is
composed of a flex circuit, a connector, and decou
pling capacitors. The flex circuit is solder honded to
large pads on the I I DSC surface. The flex has three
copper conductive planes separated by polyimide
dielectric. The connector has st::�mpcd metal con
tacts soldered into the llcx circuit and assembled

into a plastic housing. The connector plugs into flat
blades on the bus bar of the p lanar module assem
bly. The decoupling capacitors on the power flex
circuit filter the medium-frequency switching noise
on the MCU and the MCU power bus.
Thermal Design
The multichip unit was designed from conception
to provide an efficient cooling path for the inte
grated circuits. Figure 5 shows a cross section of the
PLANAR MODULE

SIGNAL FLEX
CIRCUIT

E LASTOMER

FLEX CIRCUIT
BUMP

ELASTOMER
ELASTOMER TRAY

Figure ,j

Vol. 2 No. ·1

Signal Flex Connector with
Detail of Bump

Fall

/'J')O

Digital Tecbn icaljountal

HDSC and MultichtP Unit Design and Manufacture

multichip unit. The heat dissipated by the chips is
conducted through the silicon and the die attach
into the baseplate. As mentioned above, the die
attach is an epoxy heavily fil led with microscopic
diamond particles to increase thermal conductivity.
The heat spreads out in the copper alloy baseplate
and is conducted across a dry interface to an al u
minum base of the pin fi n heat sink . The heat sink
has 600 aluminum pins, each 0.20 centimeters in
diameter, pressed into the base. Air plenums in the
cabinets direct at least 14 . 6 liters per second of air
into each multichip unit heat sink. The thermal
resistance for a 30-warr gate array is less than 2 .0
watts per degree Celsius which gives a junction
temperature of 85 degrees Celsius with room air at
25 degrees Celsius. This low junction temperature
is a critical part of the h igh reliab i l ity of the mul ti
chip unit.

r--'---- - -:''" - · ·

Figure 5

Multicbip Unit Manufacturing
Figure 6 shows the m a n u facturing process flow,
which has three major work centers:
•

54 -class assembly and inspection

•

P lOOO

•

assembly and inspection

Test and diagnose

I n the 54-class process, TAB semicond u ctor
devices arc assembled to the H DSC substrate, result
ing in the subassembly known interm . l l y as a 54class module. In the P 1000 process, connector and
housing components are assembled . At the last
major center, the test process, final units are tested
and, if necessary, diagnosed. A shop floor control
system tracks the units through the l i ne and pro
vides critical component and process trace infor
mation. In addition, this control system is used to
monitor process parameters to ensure control of
the l ine and consistent product quality.
The fol lowing section provides i nsight i nto
several of the process technologies we used to meet
the m a n u facturing goals of the VAX 9000 system.

Digital Teclmicaljournal

Vu/. 2

No. 4

Fall /'J'JO

---'-,

Clock Distribution
The system clock on the VAX 9000 system is
distributed to each of the multichip unit clock
distribution chips (CDxx). The CDxx generates 4 0
di fferential outputs which are routed through
equal-length etch to the other chips. The CDxx also
distributes and controls the scan lines that test the
unit both in manufacturing and in the field . The
scan l i nes also allow the unit serial number and revi
sion status to be read by the system console.

BASE
PLATE

� - -:

PIN FIN
HEAT SINK

Thermal Path

TAB and Flex Circuit Bonding
The i nsertion and soldering of leads is the most
critical step in the multichip unit manufacturing
process. Single-lead and multiple-lead gang bonding
approaches were both considered . Gang retlow sol
dering is an effective way to achieve repeatable, reli
able connections for both the TAB semiconductors
and the signal tlex circuits. Early development work
on manual machi nes required operator action for
lead forming, lead alignment, and gang bonding.
Today, critical process parameters - time, pressure,
temperature - are computer controlled to speci
fied values, and the process uses tools to assist the
operator in material movement and vision systems
to improve alignment of leads. Before bonding, the
leads are covered with a low activation flux which
is removed later in the process.
Die Attach
Another critical manufacturing step is the die attach
process. The excellent thermal performance of the
multichip unit is achieved by fol lowing these steps:
•

Careful control of the die attach materials with
feedback to our suppliers.

•

Surface cleanl iness specified and also managed
with our suppliers.

•

D ispensing of epoxy. The fil led epoxy is d is
pensed by an x-y table that is computer con
trolled to supply the correct pattern for the
particular mu ltichip unit type.

VAX 9000 Series

END OF 54 CLASS ASSEMBLY

START OF P 1 000 ASSEMBLY

ALIGN HDSC
TO HOUSING
SHIP

Figure 6
•

Manufacturing Process Flow

Establishment of bond line thick ness and epoxy

short removal or single-point bonding. Over time,

c u re. Bond l i n e t h i ckness i s accomplished b y

we bel ieve that our materials and processes can be

mechan ical l y applying pressure while curing i n

control led r o the p o i n t at w h i c h i nspec t i o n and

a purged belt furnace.

rep a i r can be dramatically reduced.

Inspection

Final Test

To ensure t h a t a l l soldered leads a re reliably

The goal of o u r tes t rrocess was to ensure t h a t

bonded, leads must be inspected for shorts, mis

m u l t i c h ip u n its wou ld operate successfu l l y i n a

a l ignments, opens, and weak joints. Shorts and mis

system env i ronme n t . Si nce no test equ i pment

al ignments are d iscovered by an automated v ision

m:m ufacturer offered a system that met our needs,

system that ca l l s marginal points to the operator's

we developed ou r own by working w i t h several

attention . The operator can then dete r m ine i f

Digital groups as we l l as outside suppl iers. The

repair action is warranted. Inspection for opens and

system contains th ree major s t a t i ons. The fi rst

weak joints is done by striking the leads with a pu lse

provides al ignment information and can ;�lso read

of laser energy and then measuring the thermal

visual serial and part nu m bers. In the second sta

decay profile. Repa ir is typical ly made by localized

tion, low voltage shorts are determi ned between

Vol. 2 No. 4

Fall

1')')0

Digital Tecbnicaljournal

HDSC and Multichip Unit Design and Manufacture

nearest neighbor leads. This step supplements our
inspection for shorts described above. In the final
station , we test for connectOr opens, thermal mea
su rement (die attach integrity), scan chain integrity,
and scan pattern data. The scan pattern testing is
done in several bursts of the clock at system speed .
In addition, diagnose capability is provided by fly
ing probes, voltage and clock margining, and a ther
mal chuck to vary temperature.

Conclusion

cess that begins with advanced development and
continues th rough volume manufacture. The H OSC
and multichip unit technologies have successfully
achieved the volume manufacwring phase. Using
the prod ucts and technologies described
in this paper, we have played a key role in the intro
duction of the VA X 9000 system to the marketplace.
Extensions of this m an u factu ring p rocess w i l l
ensure that this technology base can be applied
across a wide spectrum of products of both higher
and lower performance.

Successful use of advanced interconnect teclmolo
gies requires a seamless phased development pro-

Digilal 1i.•cbnicaljournal

Vol. .2 No. 4

Full 1'.)')11

Matthew S. Goldman
Paul H. Dormitzer
Paul A. Leveille

The VAX 9000 Service
Processor Unit
The VAX 9000 serviceprocessor unitprovides thefront-end seruices needed to support
a highly available and reliable mainframe system. The unit is close�y linked to the
VAX 9000 system to provide realtime detection and recovery of system failures.
However, the unit is independent enough to be isolated for maintenance without
affecting normal system processor operation. This combination is a first for VAX
systems. The service processor also provides various debugging features that were
essential for development and ear�)' manufacture of the VAX 9000 system. These
features utilize a system-wide scan architecture to achieve direct access to machine
state, which provides extensive visibility and control of system logic functions. The
inclusion and use ofsuch a scan architecture is a newfeaturefor a Digitalprocessor.
The VAX 9000 service p rocessor u n i t ( SPU ) is
designed w provide a dedicated subsystem for ser
vice and maintenance support for the VAX 9000
fami ly. The SPU serves two distinct roles. It func
tions as the familiar operator i nterface (i .e. , VA X
console) and as a maintenance vehicle used lO diag
nose and isolate system processor hardware faults.
The SPU performs the fol lowing major front-end
services :
•
•

System initi:ll ization
Power system control and monitoring

•

Environmental monitoring

•

Clock control and monitoring

•

VAX 9000 operating system access to SPU mass
storage devices (disk and tape)

•

Remote diagnosis port support

•

System error detection, recovery, and reponing

The SPU also provides or assists in the following
system diagnosis functions:
•

S P U mod u le self-tests

•

Scan system diagnostics

•

Clock system diagnostics

•
•

Scan pattern structural diagnostics
Structure cell (e.g. , self-rimed random-access
memory [ R AM]) d iagnostics

•

X MI-ro-system control unit adapter interface test

•

Symptom-directed diagnosis support

In addition to its use as the front-end processor
for the VAX 9000 system, the SPU wJs embedded
in several manufacturing and e ngi neering rest
vehicles. In the Debugging Features section of this
pJper, we describe how the SPU was used as a
debugging tool d u ring VAX 9000 product devel
opment and the various debugging features we
p rovide to help locate design and fabrication
problems.
A mJjor goal of the SPU WJS to perform system
wide error detection and recovery functions for the
VAX 9000 processor. I n the Error Handling section
of this paper, we detai l the types of errors that the
SPU handles arid how error detection , reporting,
and recovery occurs.
A nother of o u r design goals was to be able to
service the SPU without adversely :�ffecting the
operation of the system processor. This feature was
needed to support t he h igh avai lab i l i ty requ i re
ments of a mainframe system. To meet this goal , we
designed mechJnisms to enable the VAX 9000 oper
ating system to determine that the SPU is not func
tionJl (whereupon the operati ng system takes the
appropriate action to secure its own operation),
as well as recognize and reintegrate with the SPU
when the SPll is functional again .
If the VAX 9000 operating system Jttempts to
access one of the SI'U -based processor registers and
the SI'U does not respond, the fai lure is detected by

Vol. J No. -i

Fall

/')')0

Digital Technicaljournal

The VAX 9000 Service Processor Unit

tests are performed . The SPU 's operating system
then boots automatically and signals its availability
to the VAX 9000 operating system.
The SPU is designed to continue operation even
i f the SPU primary storage device, a n R D 5 4
Winchester disk drive, fails, which further increases
the availability of the SPU. For customers who
req u i re data security and high availability, we
designed a system configuration option that does
not use a disk drive. I n this case, the SPU boots from
TK50 cartridge tape. The SPU functions that require
a disk drive for data storage (e.g. , SPU-generated
error logs) are disabled in this configuration .

using the usual register time-out mechanism. How
ever, because the SPU is responsible for system error
handling, SPU failures must be detected quickly to
enable the SPU to respond to a system error should
one occur. Conseq uently, we developed a keep
alive protocol with which the VAX 9000 operating
system can determine SPU failures without relying
on operating system accesses to SPU-based pro
cessor registers. The keep-alive mechanism is
described in more detail under the Error Handling
section of this paper. Both the time-out and keep
alive mechanisms work regardless of whether the
SPU has an unexpected failure or undergoes a sched
uled power-down.
S hould the SPU req u i re service, field u pgrades
may be performed easily and qu ickly because of the
modularity of the hardware, which is primarily
VAXBI bus interface-based adapters. The VAXBI
backplane minimizes downtime because modules
can be removed or inserted without requiring reca
bl ing. When power to the SPU is restored, SPU self-

SPU Architecture
A block diagram of the SPU architecture is shown in
Figure 1. The service processor module, scan con
trol module, and power and environmental monitor
were designed uniquely for the VAX 9000 system.
The disk controller, tape controller, as well as the
memory daughter board were available from other

DISK
CONTROLLER
(1 1 03 1 KFBTA)

TAPE/NETWORK"
14-------'
CONTROLLER
*
Nl
(11 034 DEBNK)

VAX

TO/FROM
REST OF PCS

SERVICE
PROCESSOR
MODULE
(12051 S P M )

POWER AND
ENVIRONMENTAL
MONITOR
(11 060 PEM)

SPU M E MORY
16 MBYTES
ECC

S P U OS

F I RMWA R E

SCAN
CONTROL
MODULE
(12050 SCM)

F I RMWARE

POWER CONTROL SYSTEM
SJI

PlY

!-----'

" N I CONNECTION U S E D DURING
DEVELOPMENT ONLY

SYSTEM PROCESSOR

Figure 1

DiRilttl Tecbnicaljournal

VtJ/. 2

VAX 9000 SPU Block Diagram and interconnects

Fa/1 /'J'J()

VAX 9000 Series

Digital products. Every S P l J VA X B I adapter provides
1

i ts own bu i l t-in self-test diagnostics.

S P U hardware is based on ei t her i ndustry-proven
(e.g. , 74 00-series

interface, the system processor may also interru p t
t h e S P U w h e n the processor needs service. T h is
type of interrupt request is known as an attention.

TTL components, complementary

The SPU is i ntegrated i n to the system cabinet to

metal oxide semiconductor [CMOS] gate arrays)

better meet the performance req u i rements neces

or Digital-proven tech nology (e.g. , VAX B I , Digital

sary for system error recovery and VAX 9000 oper

custOm CMOS devices) to ensure that the unit is a n

ating system boo t . Cabinet i n tegration substantially

e ffective debugging platform for a system processor

decreases i nterconnect distances to processor logic

based on leading edge tech nology. As a resu l t , the

and ensu res that all cables are kept i nternal to the

i n herent risk and learning cu rve associated with

cabinet. Another reason for choosing the VA X B I

n e w tech nology were avoided and t h e SPU was

backplane card cage i s t h a t i ts form factor is sma l l ,

ready and available during the VA X 9000 system

w h ic h reduces the cabinet area needed (cabinet area

protOtype debugging p rocess.

is a lways in high demand), yet the user-definable

The S I' U also was made available to manufactur
ing process and tester groups (e.g. , multichip u n i t
tester) for use w i t h their designs. T h e advantages to

zones provide the high pin density req u i red for
i nterconnects ( i . e. , 1 80 110 pins per VAX B I s lot).

this approach were t hat tec hn icians became fam i l

Communication Path

i a r w i t h t h e same subsystem t h a t wou ld b e used i n

The SPU commu nicates w i t h the system processor

t h e VAX 9000 fam i ly, a n d t h e test programs could

using the SJI . This in terface is used to load the pri

be transferred for use in other test envi ronments

mary bootstrap into the VAX 9000 main memory,

that also used the SPU , including the VA X 9000 sys

t ransfe r error and m ac h i ne-check i n fo r m a t i o n to

tem itself.

the VA X 9000 opera t i ng system , provide file trans

The service processor mod u l e is the primary

fer access between the VAX 9000 opera t i ng system

processi ng element of the S P U and is the VAX B I host

and the SPU 's R D 5 4 disk drive, access system main

adapter. Based on the M i c roVAX 78032 chip and

memory, and access system i /O registers.

several custom-designed applicat ion-specific i n te

The VA X 9000 operating system accesses the SPU

grated c i r c u i t s (e. g . , S P ll -to-system cont rol u n i t

as if i t were a standard J /0 device. T h e SPU is a n

adapter, S P U memory control ler) , t h e module con

i ndependent subsystem and does not rel y o n the

t a i n s a l l the h ar d w a re necessary to store and

execution u n i t of the system processor to be a con

execute the S P U operating system . The on-board

sole processing engine, as was done i n previous

firmware contains a VA X standard console i nterface

VA X systems. T h e re a re several b enefits to t h i s

to load the SPU operating system during i n i ti a l iza

design approac h . Each C P U has equal access t o the

tion and to assist in subsystem debugging. The S P U

S P U and may i nterrupt the SPU to request serv ice.

to-system control unit interface (SJI) connects t h e

I n addition, the SPU may i n terrupt any of the CPUs

service processor mod ule to the system control unit

to request an operating system serv ice. The S P U

and is the primary communication path between

m a y b e used a s a debugging tool d ur i ng system pro
cessor debugging because it does not req u ire that

the SPU and the VAX 9000 opera ting system.
The scan control mod u le is the control i n te rface

a n y portion of the system processor be operational.

to the VAX 9000 scan system , w h ich is the visibility

The fact that the SPU could be used as a debugging

and mai ntenance path to the system p rocessor. Like

tool was an extremely important benefit for the

the service processor module, the scan control

VA X 9000 system debugging effort. The debugger

module is based on the MicroVAX 780)2 chip ami

d i d not h ave easy a c cess to the l o g i c element s

s<:veral custom-designed applicat ion-specific inte

because o f the advanced packaging a n d c i rcu i t i n te

grated c i rc u i ts (e . g . ,

gration of the VAX 9000 system . Therefore, S P U ser

distribut ion c h ip).

scan c o n trol c h i p,

scan

On-board firmware provides

v ices were u t i l ized in l ieu of logic probes. Further,

high-level fu nctions that a l low the service p rocessor

because the SPU no longer uses t he CPU for system

module to continue processing while scan-related

access, console support microcode ( i . e . , the collec

ope rations, i n c l u d i ng logi c a l - to-p h ysical s i g n a l

tion of microcode procedures t radit ional l y used for

trans lations, a r e performed concurrent l y by the

access to the system processor, memory, and J/0

scan control mod u le. The scan i nterconnect (SCI)

registers) is not requi red . The benefit of this p rocess

connects the scan control module to the system

is that valuable VAX 9000 control store space could

processor (i.e. , one to fou r C : P U s and the system con

be used for system m i c rocode or to reduce the con

trol u n i t ) and t he master clock mod u le. Using this

trol store size. For example, in the VAX 8650 system ,

Vol. J No. 4

Fall /<)<)()

Digital Tecbnicaljournal

The VAX <)000 Senlice Processor Unit

console support m icrocode occupies approxi
mately 180 microword locations.
VAX 9000 operating system access to the SPU is
through the VAX console register set. We extended
the VAX console register set to provide access to the
enhanced capabilities of the S P U . Additional regis
ters include transmit function request and param
eter and receive function request and parameter
( i .e . , TXFCT , T X PRM , !L.'<.FCT , R X P R M ). Table l l ists
the functions provided by these registers.
SJ I commu n i cations a re in the form of 14 -byte
packers that contain the command (i .e. , function),
address, and data. Packets are sent and received
over two 8-bit data paths that provide fu ll duplex
operation. Data transfers peak at 3. 5 megabytes
( M B) per second for quadword transfers.
W hen the VAX 9000 operating system executes a
Move_ro/from_ Processor _ Register instruction that
specifics an SPU register, the system control unit
sends an I /O command p::tcket, through the SJ I , to
the SPl! to initiate the system request. Then the SPU
typica l ly uses an interrupt command packet, which
generates an i nterrupt to the specified C P U . The
two other packet types are direct memory access
and error correction code.

R X FCT/R X P R M and T X FCT/T X P R M
F u n ctions

RX FCT/RX P R M Functions
(SPU to System Processor)
Remove processor
Add processor
M ark memory page bad
Request pages of m emory
Send error log entry
Send OPCOM message
Get datagram buffer
Send datagram
Return datagram status
Set keep-alive state
Abort datal i n k
E rror i n terrupt

TXFCT/TXPRM Functions
(System Processor to SPU)
Get hardware context ( o f a halted C P U )
Virtual block f i l e operation
(access to SPU disk and tape)
Keep-alive
Send datagram
Return datagram status

Visibility Path

Switch prim ary

I n the development and manufacture of a com
plex computer system, extensive test i ng methods
must be available to ensure functional operation
and product quality. Design engineering no longer
can use manu::tl probing tech niq ues in prototype
debugging. Space l i m i ta tions have resu l ted from
advanced packaging and the c lose pitc h of i n te
grated circuit ! IO pins, which is due to high i ntegra
tion lewis. Failur e isolation must be performed in
the manufacturing process, often without an exten
sive knowledge of the machine design.
A separate visibi lity and control path in the sys
tem processor of the VAX 9000 system provides
nearly 100 percent visibility to the machine-state.
The visibility path e l i m i na tes t he need to select a
subset of v isibility points to meet a l l test needs, as
was done with previous VA X systems. In addition,
the pat h al lows designers to d irectly alter t he entire
machine-state, which is a major advantage for
design and process debugging. A VAX 9000 u n i
processor ( i . e . , one C P U and system control unit)
contains over 26,000 access points.
The path is called the VAX 9000 scan system and
is controlled by the service control mod u l e. The
scan system is the fou ndation for d i rect access
by prototype debuggers, system error recovery

Digital Tecbnicaljournal

Table 1

Vol. 1 No.

-i

Full

/'J'J()

Reboot system request
C l ear warm start flag
Clear cold start flag
Boot secondary processor
H alt C P U and remove fro m avai lable set
H a l t C P U and keep in available set
Console q u iet
Set i n terrupt mode
Abort datal i n k
Reset 1/0 system
Disable vector u n it
Set keep-alive state
Start processor
M argin power
Margin clock
Fault sig nal
Start error wi ndow
End error w i ndow
Report error in w i ndow
Get error log e n t ry
Get u n m arked error log entry and mark
E n able halt restart
Get 1/0 physical address memory map configu ration
Get physical add ress m emory m ap configu ration

9:)

VAX 9000 Series

soft ware, and diagnostics to observe and alter the
VAX 9000 machine-state. Some functions provided
by the scan control module and supporting SPU
software are
•

Load and save processor state

•

Scan pattern execution

•

Continuity testing of the processor's scan
hardware

•

M u l tichip u n it t ype and revision i n formation
extraction

•

Processor attention notification

A block d iagram of the VAX 9000 scan system
is shown in Figure 2. The scan control module
connects to the system p lanar module over the SCI .
Scan and clock distribution logic, contained i n a
macrocel l array on the pl:mar module, distributes
data and control signals over the scan bus to each of
the multichip units. A clock distribution chip at the
hub of each multichip unit further distributes the
scan bus signals to the macrocell arrays, w hich are
integrated circuits that contain system logic.
As shown in Figure 3, the state devices within a
macrocell array are scan latches. The latches are
connected serially to form a ring or chain by con
necting the Scan_Data_Out line of each latch to the
Scan_Data_ln line of the next latch. The end links
are connected to the clock distribution chip. When
the system clocks are running, data is loaded into
the latch from the system data input. During scan
operation, system clocks are not active. Generated
by the scan control module, the scan clocks load the
latch with data from the scan data input . Conse
quently, the scan control module reads system state

by issuing scan clocks, w hich serially shift system
data to the scan control module. System state is
changed w hen the scan control module drives new
data to the system latches while issuing scan clocks.
An architectural feature permits each mu ltichip
u n i t to generate an attention i nterrupt d irectly to
the scan control module over the scan data return
l i ne. A ttentions notify the SPU of system events,
such as processor errors, memory self-test comple
tion, CPU halts, and keep-al i ve responses.
System diagnostics can diagnose the SCI by using
the same control signals as used for scan system
operation. Dedicated logic and special routing of
the scan l ines p rovide fai lu re isolation . Stuck-at
faults and disconnect conditions can be isolated to
the multichip unit.

Debugging Features
I n addition to its use as the VAX 9000 front-end
processor, the SPU provides a variety of features
for debugging and troubleshooting multichip unit
logic configurations. These features were required
because all mu ltichip unit logic visibility and con
trol is handled through the SC I , which connects
directly to the SPU . The use of scan larches to access
internal logic states is a first for VAX systems and
chal lenged the designers to define and deliver the
necessary tools and features to assist the multichip
unit debugging effort. Furthermore, the features
provided by the SPU had to apply tO various tester
environments, ranging from single mul tichip units
mounted in probe stations to ful l system con fig
u rations. A d d i t ional requ irements to support the
clock and power system test stations made it clear
that the SPU would have to be adaptable to a variety
of environments.
PLANAR
MODULE

SERVICE
PROCESSOR
SCI

SCAN
CONTROL
MODULE

scD·

SCAN
DATA
RETURN

MCUO

MCUn

S C A N DATA I N
AND CONTROL

'SCD - SCAN AND CLOCK DISTRIBUTION LOGIC

Figure 2

VAX 9000 Scan System

Vol l No. 4

Fall f ) .

The translation from a logical signal to its associated scan latch uses clara structures supplied in a
configuration database file, which is loaded into
SPU memory during SPU initialization . All CPUs
w i t h identical mu ltichip unit configurations (i .e. ,
same CPU revision) share the same configuration
database memory image. The system control unit
a lways req uires its own database. Only two CPU
revisions can be supported at one time because of
SPlJ memory constraints for storing the separate
configuration databases. However, by prov iding for
two C P U revisions, the needs of single and dual CPU
configurations were completely satisfied . Further,
it was possible to upgrade homogeneous triple and
quadruple configurations in a stepwise manner.

Macrocode Execution
Initial system- le\'el multicbip unit configurations
consisted only of a sca lar CPU . The system control
unit was not yet available as a result of the extended
simulation of the design . Fortunately, we had antici
pated the possibility of running partial configu
rations and could provide modes within the SPU
software to red i rect commands that normally
access main memory (e.g. , EXAMIN E , LOAD) to
access the CPU's 1 2H kilobyte (KB) system cache
or S K B virtual instruction cache instead . The first
VA X macro-instructions were loaded and executed
on the VA X 9000 system using this technique. An
additional feature, wh ich i nvolved m inor hooks in
the system microcode. provided a means for the
VA X instruction set diagnostic, EVKA A , to commu
nicate with the console terminal through scan
attentions rather than by using the system control
unit. Thus, the diagnostic could run to completion.

Advanced Debugging Features
Although not obvious aids to VA X 9000 debug, the
following features were ind ispensable or, at the
least, reduced debugging time and effort:
•

A character-cell w i ndowing capabi lity that
al lows system microcode sources ro be automat
ically located , disp layed, and updated on t h e
screen as the system is single-stepped. We mod
eled this feature after the VAX debugger's win
dowing capabi l ity because m os t VAX engi neers

VAX 9000 Series

are fam i l i ar with t h is capabili ty. W i ndow i ng
eliminated the need for hard-copy microcode
listings and the logistical problems associated
with their use.
•

•

By connecting the SPU to the engineering net
work duri ng developme n t , timely updates of
SPU software were made possible. This kept the
VA X 9000 debugging effort , which was occur
ring simultaneousl y on several systems, up to
d ate w i th the latest SPU software fixes and
enh ancements. Together w i t h the multisessi o n
capability of the SPU operating system, the use
of the network made remote debugging a reality
th roughout the VAX 9000 debug phase.
13ecause the SPU had to initial ize the VA X 9000
system thousands of times during system debug
ging, the unit was designed to perform system
initial ization as efficiently as possible. For exam
ple, the load ing of structures (e.g . , control stores
or cache tags) was optimized by overlapping the
operation of three M icroVAX-based processors :
the service processor module, the scan control
module, and the d is k controller.

The debugging features located early design and
fabrication problems in the clock, power, scan, and
processor logic areas. Ultimately, the features were
used to initialize and run the first VA X 9000 system .

Error Handling
To support high system availabi lity, accurate and
t i m e l y error detection a nd loggi ng is required .
Error data collection cannot depend upon host sys
tem availabi lity, and the data must be available when
the system is not functional . Therefore, an indepen
dent service subsystem that can collect data from all
system components, render i t into a useful format ,
and store and display the information i s needed .
The service subsystem must also be organized in
such a way that if it fails, it does not directly cause
system processor failures. Repair, reboot, and sys
tem reintegration must occur wit hout interfering
with system processor operation . The SPU meets
these requirements; it is a fully independent com
puter that runs its own operating system with dedi
catec.J peripherals. The SPU performs system-wide
error detection and reporting fu nctions and pro
v ides advanced error recovery fea t u res for the
system processor.

Error Detection
The S P U reports errors in its own VAX BI adapters,
the service p rocessor module, the scan control

module, the power and environmental monitor,
the disk controller, and the tape controller. It also
reports errors in various pa rts of the VAX 9000
system, such as the system control unit, the CPI ·s,
the memory system , the master clock module, and
the power and environmental systems. Because fa il
ures in any of these subsystems can incapacitate the
VAX 9000 system, none of them reports its errors
directly to the VAX 9000 operating system .
SPU

Errors The disk controller, tape controller,
and scan control module use the VAX B I VA X port
protocol to report errors. The power and environ
mental monitor passes error information to the ser
vice processor module through its private bus, the
SPU-to-power control system interface.
Environmental Exceptions The power and envi
ronmental monitor monitors the regulator intelli
gence cards, airflow sensors, and tempera t u re
sensors throughout the system. When it detects any
problems in operating voltages, currents, tempera
tures, or airflow, it notifies the service processor
operating system , wh ich logs the error cond ition.
Clock Exceptions When the master clock modu le
detects an error in either the clock phase or the
clock frequency lock, it generates an attention to
the scan control module, which interrupts the ser
vice processor mod u le. The SPU operating system
logs the error condition.
Memory Error Correction Code Events The main
memory of the VA X 9000 system contains error
correcting logic to correct single-bit errors and
detect double-bit errors. When a memory location
with a single-bit error is read, the system control
unit corrects the error and passes the corrected data
to the requesting device. It also writes an SPU regis
ter with the error type and the failing memory
address. The SPU operating system writes this infor
mation to the error log. I f the system control unit
detects a double-bit error or reads a marked-bad
location , it passes the bad data, marked as bad, to
the requesting device and notifies the service pro
cessor operating system , which logs the error. The
bad dat::1 is handled loca l l y by the requesting device,
usually by generating an error of its own .
CPU and System Control Unit Errors
When a CPU
detects an error in a parity checker, it attempts to
come to an instruction boundary and halt . Once
it has halted, the CPU sweeps i ts cache. When the
cache sweep is completed, the C PU asserts an

Vol. 2 No. 4

Fall

I'J'JO

Digital Tecbnicaljournal

The VAX 9000 Service Processor Unit

attention to the scan control module to inform the
SPU that recovery is required . When the system
control u n i t detects a n error, it first asserts a fatal
error signal to each of the CPUs, and then asserts an
attention. When the CPUs receive the fatal error sig
nal, they attempt to come to an i nstruction
boundary and halt. Once halted, the crus assert
attention lines to the scan control module. The
caches are not swept since their path to memory,
the system control unit, is not working.
Keep-alive, Timeout To ensure that a CPU is not
hung by an undetected error, the SPU periodically
sends a keep-alive interrupt to each CPU . CPU
m icrocode services the interrupt at the next macro
instruction boundary by asserting an attention to
the scan control module. If the CPU should be hung
by an undetected error, the SPU times out while it
waits for the keep-alive repl y attention and , thus,
determines that there has been an error. Similarly,
the primary CPU monitors the SPU by sending it a
keep-alive request through the TXFCT register. If the
SPU does not respond to this request within a time
out period, the VAX 9000 operating system assumes
that the SPU is hung and reboots i t using a VAXBI
reset. When the SPU reboots, it reintegrates itself
with the rest of the VAX 9000 system without i nter
fering with system operation .
Error Reporting

When errors are reported to the SPU operating sys
tem , the error formatting facility logs the error
information local l y and reliably transmits it to all
intended receivers. The error formatter maintains
the error log fi le ERRLOG . SYS on the SPU RD5 4
drive, passes error log entries to the VAX 9000 oper
ating system to be logged in the system error log,
and also passes the entries to any SPU software that
requests them . The error formatter writes the error
log file using the SPU operating system disk I /O func
tions, passes the error log entries to the VAX 9000
operating system using an RXFCT function. and
passes the error log entries to other SPU processes
using the SPU port protocol. If the RD54 drive is not
available, which prevents access to the SPU error
log, the error formatter continues to send error log
entries to the VAX 9000 operating system and to
other sru processes.
The SPU error log contains a l l the error log entries
collected by the SPU (but not those collected by the
VAX 9000 operating system) and time stamps,
which are logged every ten minutes. Should an SPU
operat ing system crash occur, the time stamps may

Digital Tecbnicaljournal

Vol.

No.

Ful/ 1')90

be used to determine the approximate time of the
crash . Errors are logged regard less of the state of the
system processor. As a result, information is avail
able for analysis even in the event of a total proces
sor failure. The error log file may also be transferred
to TK50 tape for off-site analysis.
The error formatter passes error information to
the VAX 9000 operating system by copying the error
log entry to system memory and then invoking the
RXFCT function to notify the VA,'{ 9000 operating
system that the entry is available. Should the operat
ing system not respond to t h is notification , t he
error formatter assumes that the operating system
has crashed and writes the error log entry to a tem
porary data ft.le. When the VAX 9000 operating sys
tem reboots, it notifies the SPU by using a TXFCT
function. The error formatter then reads any saved
error log entries from the data file and transmits
them to the VAX 9000 operating system . This proto
col ensures that all collected error data is eventually
reported in the system error log.
The error formatter also maintains a SPU port to
which any process running on the SPU may con
nect. Connected processes receive copies of all
error log entries as the entries are logged . This port
is used by EWKCA , the symptom-directed diagnosis
tool, which analyzes errors as they occur and
determines which system components might have
caused the failure. The port is also used for system
debugging by the error insertion program to verify
that errors are being logged and analyzed correctly.
Snapshots I n addition to its error logging facili
ties, the SPU operating system provides the ability to
take "snapshots" of the system processor state. The
snapshot fi le provides a detai led record of system
context, which allows engineers to take a snapshot
of a hung system and reboot it, and then analyze the
snapshot file while the system proceeds to perform
other useful work. The snapshot display utility is
used to examine the data in a snapshot file. In addi
tion to formatting the data in the snapshot file, the
snapshot display utility can be used to examine any
scan latch in the file, by name, in the same fashion as
the console EXAM I N E command is used on the
actual hardware. The data availab le in a snapshot
file is summarized in Table 2 .
Error Recovery

The h igh level of visibility achieved b y the scan
system allows the SPU to provide extensive error
recovery facilities for the VAX 9000 processor.
SPU -based recovery offers several advantages over

VAX 9000 Series

Table 2

S napshot File Contents

Revision Section
All m u ltichip u n it revisions

All

S P U adapter revisions

M i c rocode revisions
A l l X M I adapter revisions
A l l VAX B I adapter revisions

Power Section
All power control system registers
" Se n se power" results

Clock Section
All master clock mod ule registers

SPU Section
All S P U -to-system control u n i t adapter registers

1/0 Section
X M I device error registers
VAX B I device error registers
X M I-to-system control u n it error registers

System Control U n it Section
All scan latches
Last 50 entries from system control u n it m i c ro
program counter h istory buffer
All cache tags
All other logical structures ( e . g . , control stores)
Config u ration database version

1/0 physical address memory map
M e mory physical address m e mory map
N o n existent physical address memory map

CPU Section (Repeated Once for Each CPU)
A l l scan latches
Last 50 entries from program counter h istory buffer
All cache tags
All general-pu rpose registers
All i nternal processor registers
All other logical structures ( e . g . , control stores)
Top 50 longwords of cu rrent mode stack
Top 50 l o n gwords of i nterrupt stack
32 bytes of i n struction stream aro u n d each
program counter in h i story buffer
Configu ration database version
50 m i c ro program cou nters, collected by stepp i n g
the clocks

100

traditional microcode-based error handling. The
CPU hardware resources that might otherwise be
used for error handling were available for the logic
designers to improve the system performance.
Because the error data is processed external to the
failing component, the recovery process i tself is
not suspect. Finally, because the system clocks are
stopped while recovery takes place, erroneous data
does not propagate throughour the system.
Tradi tionally, m a ny microwords in the CPU
control store (approximately 500 in the VAX 8600
system) are used for error recovery microcode.
However, because the SPU is responsible for
VAX 9000 error recovery, additional control store
space is available for instruction m icrocode. If this
had not been the case, we m ight have had to make a
space trade-off between instruction and recovery
microcode, which cou l.d h ave res u l ted in more
emulated instructions and a performance penalty
for VAX instruction execution speed .
Because the scan system allows the SPU to deter
mine the state of every scan latch in the CPUs and
system control unit, logic designers were able to
place error detectors anywhere in the design
without organizing the detectors into microcode
readable error registers. As a result, significantly
more error detectors were used for precise error
analysis than woul d have been possible if the scan
system were not available. Each VA.,'\ 9000 CPU con
tains over 450 error detector latches.
Severa l advantages are derived from performing
error recovery independently from a failed compo
nent. The most obvious advantage is that hardware,
which m ay be failing, is not used to control t he
recovery. Once the system processor state has been
scanned out into SPU memory, analysis is a function
of software running on a known good processor.
The SPU analyzes the data and then scans a cor
rect state into the system processor. T he entire
process is performed while the system clocks have
been stopped . Therefore, processor errors cannot
cause "error loops; " that is, the error recovery
process itself gets errors from a corrupt processor
state. SPU-based error recovery can completely
reset a corrupt system , regardless of the degree of
corruption.
The VA.,'\ 9000 error-handl ing fac i l i ty takes
advantage of many advanced software features that
are avai lable i n the SPU operating system . It uses
configuration database information to access sys
tem processor signals by name rather than by scan
ring locations. Thus, one version of the error han
d l ing code can handle several different physical
processor variations. The error handler also uses the

Vol.

2 No. 4

Fall 1<)<)0

Digital Tecbnicaljounwl

The VAX 9000 Service Processor Unit

SPU operating system structure access routines to
read and write the processor structures, again, by
burying the physical implementation in the config
u ration database. As a res u l t , the error handler
can look at the architectural features of the VAX pro
cessor rather than at the gate-level design of the
VAX 9000 system when performing error analysis.
The benefit of this approach is that recovery proce
dures are based on the system architecture, rather
than on the machine implementation .
One of our design goals for the VAX 9000 error
handling system was to recover from most errors
in under 500 mil liseconds. Longer delays increase
the probability that I/0 devices will time out while
waiting for the operating system to respond to
requests and cause the operating system to crash,
even if the error-hand ling system s uccessfu l ly
recovers from the error. The error handler meets
this goal by taking maximum advantage of t he
multi processing capabilities of the tightly coupled
hardware design of the service processor module
and scan control module. Error recovery is split into
a mu ltistep process that keeps both SPU processors
working on the problem simultaneously.
The error handler recovers a failed system in five
phases: data collection, data analysis, error recov
ery, macrostep, and cleanup. In the data collection
phase, the scan control module scans out all scan
rings of the failed CPU or system control unit. In the
analysis phase, the scanned data is used to deter
m i ne which architectural feat ures of the system
have been corrupted (e.g. , caches, general-purpose
registers, internal processor registers, microcode
stores, and the translation buffer).
In the recovery phase, the error handler attempts
to restore the system to a state in wh ich no soft
ware-visible data is corrupt. Therefore, the soft
ware running on the VAX 9000 system, including
the operating system, is unaware that an error has
occurred. The error handler determines whether
the system state can be restored successfu l ly or if
a machi ne check must be generated to a llow the
VAX 9000 operating system to attempt to handle the
error on a higher level. It then restores the CPU to a
known good operating state, by using latch data
from the configuration database, and corrects any
corrupted software-visible data.
In the macrostep phase, the error handler turns
on the system clocks to allow the fai led C P LI to
attempt to m acrostep one instruction. I f the
macrostep completes successfu l l y, the recovery is
considered s uccessful and system operation is
allowed to continue. In the clean-up phase, the SPU

Digital Technicaljournal

V(J/.

2 No. ·4

Fall

/1.)1.)0

processes the data from the data collection phase
into an error log entry, posts the entry, and cleans
up the data structures that will be used to recover
from the next error.
Errors that are too severe for the error handler to
h andle are signaled to the SPU command i n ter
preter, which can run command scripts to com
pletely reinitialize the machine and reboot the VAX
9000 operating system . Examples of such severe
errors are bard errors that prevent VAX 9000 oper
ating system machine check code from running and
errors that cause a CPU to fail its macrostep.

Summary
The SPU is a dedicated subsystem for service and
maintenance support for the VAX 9000 fami ly. It is
closely linked to the VAX 9000 processor to provide
system error recovery. It also presents a high-level
interface with which debuggers may observe and
control system processor activity. Through the use
of a system-wide scan architecture, the SPU pro
vides access to nearly roo percent of p rocessor
machine-state. Finally, the use of the SPU in various
tester environments greatly assisted the multichip
unit debugging effort and provided advanced train
i ng for VAX 9000 system debuggers.

Acknowledgments
The authors w ish to thank Michael Evans, the SPU
project leader, whose drive and ambition provided
the force behind the project's success. We also wish
to acknowledge the other members of the SPU
design tea m : Karen Barnard , Stephen Conway,
David D 'Antonio, Susan DesMarais, and Brian Rost .

Reference
1 . D. Chin et al . , "The Unique Features of the VAX
9000 Power System Design, " Digital Technical
journal, vol. 2 , no. 4 (Fall 1990, this issue):
102 - 1 1 7.

101

Derrick]. Chin
Barry G. Brown

Charles F. Butala
Luke L. Chang

Steven]. Chenetz
Gerald E. Cotter
Brian T. Lynch

Thiagarajan Natarajan

The Unique Features
ofthe VAX9000
Power System Design

Leonard]. Salafia

The VAX 9000 series represents Digital'sfirst implementation of a mainframe com
puter system. To be competitive in this market, the power system for the VAX 9000
series had to provide high system availability To meet this goal, the system includes
features neither considered norfound in previous large Digital computer systems.
Some of these features are the use of redundancy in parts of the design and the
addition of more power system diagnosis capabili�yfor quickerfault isolation and
faulty unit replacement. Otberfeatures provide competitive advantages in specific
marketplaces, such as meeting low harmonic distortion for A C input current, which
is an emerging European A C power qualiry standard. Simulation tools, wbich are
used more prevalent()' in digital logic, were used to improl!e the power design.

The two key requiremems of the VAX 9000 power
system a re h ig h availability and the incl usion of
competitive features. High availability for rhe power
system means we had to achieve the highest unit
regulator reliability possible by using the appropri
ate technology avai lable. Further, we had to deliver
both more power system and cabinet envi ronmen
tal monitoring and diagnostic capability that could
reduce the time spent in isolating and replacing a
m a l fu nctioning u nit. Competitive features mean
designing into the system features that would be
either better than expected or advantageous to the
VAX 9000 system in certain markets.
A ful l discussion of all the methods used to meet
these requirements is too long for this paper. There
fore, the discussion in this paper focuses on some of
the unique applications of the power technology
and tools used in the design of the VAX 9000 system :
•

Power system architecture

•

I mproved load sharing

•

Simulation

•

Increased control and monitoring

•

Low harmonic distortion

One of the issues we had to decide in designing
the power system architecture was how many regu-

102

lators shoul d be used . A large number of regulators
in a power system can cause the mean time between
failures (MTBF) to be lower than desired. Therefore,
we chose to use redundant regulators in the power
system architecture for improved availabil ity.
A nother means of i nc reasing the MTBF was
achieved by improving the load sharing among the
parallel regulators that power a low-voltage current
load . W i th this feature, no one regulator operates
at a percentage of maximum rating much higher
than its parallel regulators, which eliminates the
higher operating temperatures that can occur and,
as a result, lowers the MTBF.
High regulator reliability results from good cir
cuit design. Three examples of the unique simula
tion features that were used as checks on circuit
designs are discussed in the Simulation section of
this paper. In one case, simulation pointed the way
to a circuit problem that was not initially apparent.
In another case, simulation was used to verify on
paper that the n umber of regulators chosen to
power a specific load was sufficient .
High availability can be achieved by reducing the
time to isolate a system p roblem a nd replace the
malfunctioning unit. A power and cabinet moni
toring modu le, EMM , fu l fil led this p urpose in t he
VAX 8000 systems. The power control subsystem,
PCS , used for this purpose in the VAX 9000 systems,

Vol. 2 No. 4

Fall /'J'JO

Digital Technicaljournal

The Unique Features ofthe VAX 9000 Power System Design

expands on the diagnostic and monitoring features
of the EMM .
Meeting emerging European AC power quality
standards was viewed by the E uropean sales
force as a distinct competitive advantage for the
VAX 9000 system. A proposed standard we wanted
to meet was to achieve low harmoruc distortion of
the input AC current wave form, which was met
in the u t i l i ty power conditioner (U PC) front-end
design of the power system. High availability was
designed into the UPC th rough such features as
redundancy and increased immunity to power line
disturbances from a common ly accepted industry
practice of one AC cycle to teo AC cycles.
VAX 9000 Power System Architecture
The discussion of the power system architecture
w i l l focus on some of the a rchitecture's major
features: power zoning, N + 1 redu ndancy, and
decoupling.
•

•

Power zoning enables parts of the system to be
powered off for maintenance w h i le the rest of
the system remains operational .
N + 1 red u nd ancy provides higher perceived
system availability to counteract the impact of
low system mean time between failures, which is
a result of the large number of regulators.
Decoup ling major sections of the power system
a llows future upgrades to be made w i thout
requiring significant changes to the rest of the
system.

The basic power system architect u re for the
V�'< 9000 Model 200 and Model 400 series is shown
in Figures 1 and 2, respectively. Power processing in
each model occurs in two distinct stages. First, an
AC front end processes and converts AC utility input
power to h igh-voltage DC , which is then bused
about the power system. Second, DC-to-DC switch
ing regulators convert the h igh-voltage DC to low
voltage outputs, which are then distributed through
high-current-carrying busbars to the various logic
loads. An intell igent power control subsystem (PCS)
provi des control, sequencing, monitoring, and
diagnostic capabi lities. Dedicated bias regulators,
whic h are powered from the h igh-voltage DC ,
provide housekeeping control (i.e. , low power) and
start-up power to each bank of output regulators.
The high-voltage DC bus permits low-voltage out
put regulators to be added or removed for different
system configurations. The high-voltage DC bus also
can be backed up with a battery unit that produces
high-voltage DC from 48-volt batteries through a
step-up switching regulator. This approach allows
any specific low-voltage output to be produced , as
needed, during the battery back'l.lp period without
using specific battery-to-logic voltage output DC-to
DC regulators. The battery required to backup the
entire computer system wou ld be larger than the
computer itself. Therefore, diodes are inserted into
the h igh-voltage DC distribution to partition the
high-voltage DC bus, and only sections, such as the
memory refresh operation and PCS control , are
backed up.

PCS
(POWER CONTROL S U BSYSTEM)

E N V I R O N M ENTAL
MONITORS

UTILITY
POWER

1 20/208 VAC
3 PHASE

�
�

Figure 1

Digital Tecbnicaljournal

Vol.

No.

VAX 9000 Model 200 Series Power System

Fall 1990

103

VAX 9000 Series

PCS
(POWER CONTROL SUBSYSTEM)

Figure 2

VAX 9000 Mode/ 400 Series Power System

Power Zoning
The power-zoning feature meers rhe maintain
abi l i ty a nd high avai labi lity goals in the VA X 9000
Model 400 series of triple and quadruple proces
sors. In the power system's configuration, a pair of
d u a l processors can be powered off for m a i n te
nance, while the remai ning powered-on processors
maintain system operation.
A quadruple processor configuration is not com
posed of two identical dual processors. Some func
tions of a quadruple processor are not replicated.
The system control unit, the memory, the service
processor unit, and the PCS are common ro both
d u a l processors. Therefore, these functions are
powered up by either front end . The h igh-voltage
DC power bus is diode OR 'd from either AC power
source, through the dual d iode, CR 1 , and then fed to
the ourput stages that power the common elements
listed above.
The diode-OR process i n the VA X 9000 system
does not provide for active loads haring. Active
loadsharing between each AC from end increases
the overall actual power system reliability because
it ensures that each AC front end supplies half the
load. Othenvise, one AC front end could take most
of the load (and be stressed h igher), w h i c h wou ld
leave the other unit roo lightly loaded . However,
acrive load sharing is complicated by the physical
distances between the AC front ends and the com
plex hand l ing of faults and parcial fau lts in each
AC front end . The load of the common elements in
the VAX 9000 system is only 20 percent of the total

104

system. Therefore, the worst load imbalance does
nor justify the added complexi ty.
The diode does nor have a signi ficant impact on
overall power load re liabiliry because conservarive
deraring of rhe diode results in a lower diode oper
aring temperature and hence higher rel iabili ry.
We were concerned that power zoning cou ld
have an impact on rhe resr of rhe system as a result
of powering down part of the system. However,
analysis of the results showed rhar such a concern
was unfou nded. The h igh-voltage DC bus has rela
tively long time cons tams (i.e. , slow to react to
changes). Therefore, turn-on and turn-off transients
on the bus are smooth and gradu a l and do not
generate quick-changing electromagnetic fields that
coul d affect the operation of t he sections of the
system that are still functioning.

N + 1 Redundancy
Each processor in the VAX 9000 power system uses
approximately 400 amperes from each of the two
supply voltages. T he rati ngs of the power semi
conductors used in the outputs of the OC-ro-DC
reg u lators del i ver an optimal regulato r rating of
approximately 240 amperes. Based on these rat
ings, powering a CPU i n the VAX 9000 system would
require two regulators for each voltage. However,
in a large system, such as the VAX 9000 system, the
number of regularors can quickly add up, w hich
would result i n an equally q u i ck d rop in overal l
system reli ability. Powering two CPUs from the
same voltage bus reduces the number of regulators.

Vol. 2 No. 4

Fa/1 19')0

Digital Tecbnicaljournal

The Un ique Features ofthe VAX 9000 Power System Design

Redundancy is then used to minimize the impact

unit. This reliance has a significant impact on the

of t he large n u m be r of regu lators in

design of the regu lator, the regulator response time,

the b u s.

By using redundancy, a d d i t io n a l regu l a tors on a

and how the regulator hand les the fa u l ts that can

voltage bus increase the perceived time between

cause a fai l ure. Fast regu lator response (the time it

com rlere fa il ures.
For example, consider a voltage bus that requires

t wo regulators to supply t he load cur rent.

fai l

u r e in either regulato r causes a complete fa il ure.
I f another parallel regulator is added to supply
the load c u r re n t , the probabi l i t y o f a c o m plete
failure significa ntly decreases. I n t h is case, if one
regu lator fa ils , the other two could supply the loa d .
The s t a t i s t i c a l proba b i l i t y t h a t another fai l u re
would occur before the fa i led regu lator is replaced

takes to respond to a cha nge in input or output) is
needed to ensure that the output volt age does not
dip roo much when each regu lator picks up its
share of the load from the f:J.iled regulator. How
ever, the fas ter response time makes it more diffi
cult to keep the control functions of the unit stable.
M oreover, t he reg u l a t o r i n p u t vol tage range is
designed to be relatively wide to tolerate w ide
swings in the high-voltage

input.

When one regu lator in a bank of regulators oper

is very sm all .

ated in paralle l fa i ls , t h e o u t p u t bus voltage d i ps

N regu lators at an individual fai l u re
(i\) would have a system fai l u re rate
of N rimes i\, or an MTBF of 1 d i v ided by N t imes i\ . 1

The magn itude of the dip depends on the time the

system of

rate of lambda

The actual calculati ons are

unt i l the other regulators, w h ich are connected in
para l lel , can react and pick up the load currents.
i n p u t fuses i n each r eg u l a tor t a ke t o open and o n
the values o f the input capacitors and the d istribu

i\ (total) = N X i\

tion impedances.
Fast-opening fuses a l low smal ler voltage dips but

or
MTBF =

l li\ (total)

are more p rone to fa lse n u isance openi ngs. S low

1 /(N X i\)

opening fuses do nor open for normal or nuisance

The fa il ure rate calcu lation for a system that con
tains one regulator more than req u i red

fuses quickly, but the voltage recharging of the

i\ (total observed) = (N + I ) X N X i\ X i\ I
I { (N + l ) x i\ } + (N x i\) + u )
MTT3F (observed)

(

su rges, but allow a greater vol tage d i p . La rge values
of input capacitance provide the energy to open the

(N + l ) is

capacitors is longer.

high distribution i m pedance

decoup les the fa ults from other units but has a high
power loss.

( (N + 1 ) X i\] +

Simu l a t i o n and resting showed t h a t the w i d e

(N X i\) + u ) I I (N + I ) X N X i\ X i\)

inpm range design o f t h e regu l a tors i s su fficient to

It shoukl be noted for the above equation, that u

tolerate the h igh-voltage input dips caused by other

e q u a l s I d i v i ded hy t h e t i m e between fau l t a n d

fa ul ts. The regu lator control and

re pair (service i n terval).

keep the low-voltage

Using this calculation, if a bus requ ired

MTBF

T"he obse rved

MTBF

regu

response rime

outputs within speci fica

tion when the input vol tage is within its range.

of 400,000

Other faults w i t h i n the regulator can cause it to

would be 100,000 hours.

fa i l , but the load i s picked up by the other regula

lators and each regulator had an
hours, the observed

MTBF

w i t h five regu lators ( i . e . ,

wou ld be 23,9H9,000 hours, w h i c h is

239

N+

tors, operating in paral l e l , on the bus. Clearly, fau l ts

t i mes

such as a permanent short on the output bus, cannot

longer than the four regulator case. The maximum

be s ur v i ved . Because the low-vol tage output regula

time between the fault occu rrence and repair would

tors operate in parallel and in an

weeks,

336

hours. T he observed

MTBf

N+

I redundancy

mode, the output voltage is not affected by most

so large, compa red to other elements in the system,

common single-fault cond itions in the power sys

the redundant regu la tors have an extremely small

tem hardware.

effect on the overall reliab i l i ty.
vol tage bus is l i m i ted to one in the VAX 9000 power

Decoupling
A key feat u re of the

system for sp:.te e, weigh t, and cost reasons. N is the

that each major subsystem is relatively decoupled

The number of red u ndant regulators per output

power system 's architec t u re is

number of regulators req u i red to supply the maxi

from the other su bsys tems. Decou p l i ng perm its

mum current of a bus, and the addition of one more

e:1ch subsystem to be designed for its own req u i re

regulator is cal led
N

N+ I

redundancy.

ments and t o b e c h anged or upgraded as t h e

I redundancy relies on the good regu lators

on the output bus to pick up the load from the fa i led

Digital Tecbniculjournal

v,,r 2 No. ·I

Fa/1 1')')0

req u i rements change (e. g . ,

more cost

effective,

im proved tech nology, or different output vol tage).

10)

VAX 9000 Series

provided the interface and critical fu nction remain
the same. For exam ple, two significant l y differ
em cost and performance options, H7392 or H7390,
for the AC front end can be used in different config
urations, and the rest of the power system does not
need to be changed . Thus, power p latforms can be
flexibly tailored to meet the needs of different com
puter systems.

Achieving Low Harmonic Distortion
The AC front end of the VAX 9000 power system
processes and converts public utility AC power to
high-vol tage DC. Our goal was to design the AC
front end to be highly reliable, have a high avai labil
i ty, and meet the emergi ng European AC power
quality standards. One of those standards is to have
low harmonic distOrtion of the input AC current
waveform . These featu res were essent ial to support
the VAX 9000 system 's entry into the mainframe
computer marker . We also decided tO meet the low
harmonic distOrtion standard of the AC front end
because the Eu ropean marketing and sales force
viewed compliance with this standard as a distinct
competitive advantage.

Design Factors
The dominating design factor for the AC front
end was the size of the input power level, which
was approximately 20,000 watts. This size signifi
cantly exceeded the power levels of previous AC
circuit designs for a s i ngle u n i t . The high power
consumption was a result of the use of 250,000
emi tter-coupled logic (ECL) gates in the CPU and
5 1 2 megabytes (MB) of memory.
High Reliability and A vailability To ach ieve high
reliabil ity, we used conservative power derating lev
els and good thermal management for key devices.
Typ ically, the device voltage ratings used are 80
percent of rating. The main switches and rectifiers
used in the power stages used 40 percent of rating.
Current derating is also conservatively placed at 40
percent. Stress is lessened because of lower device
fu nction temperatures, wh ich results in a longer
opera tional life, which equates to h igher reliabi lity.
We designed t wo approaches to attain high
availability. First, redundant circuitry was used for
the AC-to-DC circu i t function. Second, we inc reased
immunity-to-line outage from the standard practice
of one cycle of outage protection to ten cycles. The
increase from one c ycle to ten cycles of ou tage
immunity provides the VA X 9000 system with a
300 percent improvement in mean rime between

106

observed system power outages over standard
Digital systems This feature improves system
availability to the customer.
Harmonic Distortion The power system's design
had to meet the increasing restrictions on the inrn
face with the pub lic power u t i l i t y and be able t o
withstand the occasional avai labil ity o f only poor
power. Uti l i ty power is generated as a relati vely
pure (i .e. , low harmonic d i stortion) s i ne wave.
AC front ends and power suppl ies must convert this
sine wave of voltage ro a ripple-free DC voltage for
ultimate consumption by the logic chips within the
computer system . Standard methods used for this
conversion create a nonlinear load on the sine wave
of voltage. This nonlinear load distorts the utility's
sine wave of voltage for other users, because of the
distribution system impedance, and usually appears
as i nterference for other users. In Eu rope, the
occu rrence of this type of interference is planned
to be limited by restricting how much nonlinear
load current an AC front end can have. Therefore.
we had to design a unique circuitry that could
convert AC power to DC power at 20,000 watts
without high levels of current distortion to meet
this European requirement .
A design based on commercially available conrrol
technology could not meet the stringenr technical
requirements of high overal l conversion efficiency
and stabi l i ty of operation because conventional
AC-to-DC circui try produces up to 30 percent dis
tortion. Our goal was to comply with emerging
European requirements of harmonic current distor
tion levels in the 5 percent range. However, at the
time we were designing the system, no circui try at
this power level existed in the power conversion
industry. T herefore, we h a d to develop a unique
pulse-width modulator (PWM) circuit and control
equations for the input power conversion stage,
which is shown in Figure 3 .
The pulse-width modulator combines the advan
tages of low switching frequency, which reduces
switching losses in the converter, with exception
ally short response time to all i nput l ine voltage
d isturbances and to rapid changes i n the required
compu ter power. The fin a l design produces
less than 5 percent total harmonic distortion of
the input l ine current w hen the UPC is operated
at 20,000 watts load. The uniqueness of the PWM
increased the immunity-to-line voltage outages
from one cycle of outage protection to ten cycles.
F u rthermore, the increase was achieved w i th
o u t a corresponding tenfold increase i n storage
capacitors.

Vol. 2 No. 4

Fall /')')0

Digital Technicaljournal

The Unique Features ofthe VAX 9000 Power System Design

OUTPUT
SWITCH
AC
F I LTER

AC
INPUT

�
o-----

RECTIFIER

FAST
DI SCHARGE

AUX AC POWER AND
POWER LINE MON ITOR

TO UPC
CIRC UITS

DIGITAL POWER BUS
AND TOTAL CFF BUS

RIC
INTERFACE

Figure 3

UPC Block Diagram

Flexible L ine Cora
The high power level and the requirements for a
flexible line cord and plug required that the U nder
writers Laboratory (UL) and Canadian Standards
Association (CSA) agencies expand the regulations
that governed the size of power cordage allowed in
a computer room . A flexible l ine cord connected to
the AC service is a requirement by D igital for all i ts
products. This feature is deemed valuable because it
is used both to facil itate the initial installation of the
compmer and possible relocation at the cuswmer\
site. Although delays can occur while waiting for a
national agency to amend one of its national regula
tory codes, the approvals were received in time w
maintain the project's schedule.

Improving Load Sharing
Detailed stress analyses show that when regulators
are operated in parallel, maximum reliability is
achieved when the load current is shared equally
among them .
Traditional Approach
A traditional approach to running regulators in par
a l lel may be seen in VAX 8000 series machines.
In these processors, regulators that are designed for
standalone operation are placed in a parallel con
figuration. Current sharing is forced by mod ifying
each supply's individual reference voltage through
external monitori ng and control . In the case of
VAX 8000 machines, a maximum of four units
may be coupled in this way. Figure 4 shows that

Digital 'fecbnicaljournal

Vol. 2 No. 4

Fa/1 /'J'JO

this method essentially uses equipment that
was designed to function as standalone regulated
voltage sources. By adding external control loops,
the equipment is forced to provide identical out
put voltages, as measured at some defined point
in the system . If precise voltage matching is not
achieved, whichever supply had the higher voltage
consumes the load, up to i ts overcurrent sense
point. Thus, equal load sharing cannot happen.
Individua l external controllers are requ ired for
each converter, which m a kes the system more
complex. The VAX 9000 system requires up to five
converters per bus, and we could not achieve better
than 20 percent power sharing between modules
by using this method. No traditional methods could
support the number of converters in the VAX 9000
system. Also, most methods had a master-slave rela
tionshi p that precluded maximizing a regularor's
reliability potential.
New Approach
As a result of the limitations of the traditional meth
ods, we developed a new, less complex approach
to current sharing between p a rallel converters.
A lthough developed specifically for the VAX 9000
program, the features and utility of this approach
have universal application . The essential techno
logical shift from prior practice is that in this system
the regulators are current sources rather than
voltage sources.
We designed the current sources to have a com
pliance range that covers a band of voltages thar are

107

VAX 9000 Series

CONVERTER

INTELLIG ENT
CONTROL UNIT
(ONE PER
MODULE)

CONVERTER

INTELLI G E NT
CONTROL UNIT
(ONE PER
MODULE)

�

INTERNAL
REFERENCE
AND ERROR
AMP

·�

c u RRENT
S E NSE -

I NTERNAL
REFERENCE
AND ERROR
AMP

INTELLIGENT
CONTROL UNIT
(ONE PER
MODULE)

I N TERNAL
RE FERENCE
AND ERROR
AMP

�

POWER
CONTROL
SYSTEM

VOLTAGE CONTROL

LOAD

Figure 4

Load Sharing by Voltage Control of Voltage Sources

norm:t l l y fou nd in logic c i rcui ts. By m a k i ng the

regulator acts as a cu rrent source, the system acts as

VA X 9000

a control led and regulated voltage source. Because

reg u l a to r o u t p u ts fu l l y fl o a t i n g , the

system requ irements for + ') -vol t , - ).4 -vol t , and

the volt age control loop only contains one pole, the

buses are met with only one regulator

bandwidth of the control loop can be i ncreased by

- 5. 2-volt

design, rather than a separate design for each
vol tage. The

VAX 9000

design is s i mpler and has a

u p to a factor of at least

15.

As a res u l t , the substan

tially h igh current cha nge req u i rements i mposed

l ower manufacturing cos t . The regu l a tor is vol tage

by high-speed memories, such as those used in the

and polarity " b l i n d " over i ts compliance range, and

VAX 9000

system, can be accommodated.

any nu mber of regu lators may operate in para l lel
to provide a n y amou n t of power req u i red at any

Principle of Operation

vol tage w i t h i n t h e compl i a nc e range. A lso, this

A two-transisto r forward r eg u l a t o r i s show n i n

method a u tomatical l y compensates for the effects

Figure

I n this regu lator, S I and S2 are switched

of stray resistances and d i fferent path lengths from
ind i v i d u a l regulators on

bus.

The basic fea t u res of t h i s new a p p ro a c h are

shown i n F igure '). I n d i v idual regulators behave as
extc rn a l l y programmed current sou rces controlled
by a common control sign a l , such that each regu
l a to r d c l i vers the same c urrent .

the ou t p u ts are

load is the s u m of the individual regu lator ou tput

CONVERTER

()

\.)

l2
tL-1,----- --------+-----------�t

connected to a common load , the c u rrent in that
curn:nts. The resulting voltage that appears across

CONVERTER

( l1

l3 )

Z LOAD

the load is the product of t h a t current and the eq uiv

�
i'

a l e n t resi s ta nce of the load . F u rthermore , i f that
vol tage is compared w i t h a reference voltage in a

LOAD

conven rional error amplificr and thc res u l t i ng error

� CUORE'T
�CO,TROC

signal is used to derive the regulators· external pro
gramming source, then a volrage control loop exists
a round the regulator system . Thus, al though each

IOH

Figure 5

Load Sharing by Current Control
of Current Sources

Vol. .! No. ·4

Fa// 1')')1!

Digital Technicaljournal

The Unique Features ofthe VAX 9000 Power System Design

into conduction simultaneously, which causes the
current to flow in the primary winding of trans
former Tl at a level that is directly proportional to
the output currenr lout plus the slope of the current
due to Lour. This current also flows in the primary
w i nd i ng of c u rrent sense transformer T2 . The
resulting current that flows in T2 secondary wind
ing develops a voltage across the load resistor, RL,
which is amplified in A l and applied to the input of
comparator C I . Therefore, at this point, a voltage
pulse appears, the amplitude and shape of which
a re directly p roportional to the c u rrent flowing
i n the output choke Lout during the S l -to-S2 con
duction period .
A conventional reference source/error amplifier
combination is p l aced across the output of the sup
p l y. The res ulting error signal, called Vcontrol, is
applied to the other input of comparator C l as a DC
leve l . The comparator is followed by gating a nd
drive circuits to the power switches.
Switching is initiated by a pulse within the gating
circuit that drives the power switches on . The cur
rent flows in the output choke, Lou t , and a propor
tional vol tage appears at the output of the amplifier
A I . As this voltage ramps, it crosses the threshold
set by Vcontrol at the Cl input. The comparator
output then changes state and causes the drive pulse
to the switches ro cease.
If Vcontrol were a fL-xed value, the system would
be a constant current source. Therefore, the voltage
that would appear at its output would be the result

lOUT

T O C2
THROUGH N

Figure 6

Two-transistor Forward Regulator

Digital Technicaljournal

Vol. 2 No. 4

Fall /'J')Ii

of that constant c u rren t , and w hatever load is
placed across those terminals (i.e. , Your) would be
determined by the load value. By using an error
amplifier and reference, Vcontrol can be made a
variable quantity. Therefore, rhe regulator transfer
function can control its output current to any level
necessary to produce the desired voltage. In such a
system, a control vol tage, which is derived from a
single error amplifier and reference, can be used as
the control input for severa l regulators that are
running in parallel. Thus, the current from multiple
regulators that feed a common bus can be shared.

Increased Control and Monitoring
I n the VAX 8000 series, power and environmental
monitoring and control is provided by the H7188
environmental monitoring module (EMM). In the
VAX 9000 system, these functions are provided by
the power control system (PCS).
Basic Design ofEMM and PCS

The EMM monitors the DC-to-DC regulator contro l ,
a i r flow sensor, and cabinet temperature. I t i s also
the interface between the system console and the
power system. Conceptually, the EMM functions as a
peripheral device to the console similar to the way
an intelligent disk conrroller is a peripheral ro a
CPU . The EMM is a single module that plugs into a
power back panel .
T he res is a d istributed data acquisition a nd
control system. I t also i nterfaces between the
power and environmental systems and other parts
of the computer system. The PCS takes commands
from, a nd reports status changes ro, the service
processor unit.
However, in the PCS, the conceptual model of
the EMM is extended to provide additional support
in hardware and firmware to off-load the service
processor unit and to simplify the software inter
face to the PCS . The PCS includes many features that
enhance testability, fault coverage, fault isolation,
and system availabil ity. The relationship of the res
modules to one another and to other system com
ponents is ill ustrated in Fig u re 7. T here are five
PCS modules:
•

Power and environmental monitor ( PEM)

•

CPU regulator intelligence card (crURIC )

•

l/0

•

Signal interface panel (SIP)

•

Operator control panel (ocr)

regulator intelligence card (JOR IC)

109

VAX 9000 Series

�

TO OTH E R POWER BACKPLA ES

�

POWER BACKPLANE .

POWER BACKPLANE

(f)
a:

(.)
a: a D O D O <( Cll

CX) CX) CX) CX) (f) (f)
:::J � "'
"' "' "'
(l_ ('- 1'- 1'- 1'- 1'- <( <(

U I I I I I

ai m

0
(l_
>

(f) (f)
z a:
w
(f) 0
13: (f)

0 0 :.E
N --'
CXl
"'
1'-

a:
w

LL
a:

I ;;:{

BULKHEAD

ll!

0
Ol
"' 11'- w
I z
a: [ij

(.)
(l_
:::J

f.tt

(f)
a:

dJ
(f)

�
(lJ
0 <( �
(.)
X

(.)

� Cll

!lllll
(f)
a:

'
.[

'
Ci

(jj
s

T1
.

TO RECT I F I E R
A N D FILTER

0
>
0
�
(f)

0
>
OV

D1
MU R460
T I M E (200 NANOSECON DS/DIVIDE)

Figure 10
Figure 8

H7380 Output Switching Stage

Figure 1 1 shows a more accurate model of the

The i nitial model of the H7380 inverter stage used
simple component models and did not consider any
printed c i rc u i t board i n ductances or transistor
capacitances because they seemed negligible com
pared to other elements. We noted a discrepancy in
the voltage across the transistor Q 1 (Vds) during the
tu rn-off process between the simulated waveform ,
shown in Figure 9 , and the measured waveform ,
shown i n Figure 10.
Figure 9 shows that the voltage is initially zero
while the transistor is conducting but rises to 200
volts when the transistor is t urned off. Figure 10
shows that ringing occurs as the voltage approaches
200 volts, w i t h an overshoot to 2 4 0 volts. The
ringing and overshoot, not shown in Figure 9,
are caused by the circuit board inductance, trans
former leakage inductance, and the capacitance of
the transistor.

output stage because the L 1 through L4 etch induc
tances and C 1 and C 2 transistor capacitances are
i ncluded. The c u rren t source, !PULS E , a n d t h e
resistor, RT , approximate t h e transformer. Figure 1 2
shows t h e resu l t o f the simulation model that
includes the L and C values shown in Figure 10.
When the simulation and the measured data are
correlated, the advantage of accurate simulation
becomes apparent . By using worst-case values for
the circuit parameters, the simulation can deter
mine the maximum peak voltage. The model
depicted in Figure 12 shows that a device capable
of withstanding the expected 240 volts is needed.
Rel iance on a less accurate model w i thout para
sitics could lead to the selection of a device capable
of withstanding only 200 volts. Thus, accurate
simulation allows the correct components and
component ratings to be chosen and ensures a
robust design.

Transient Analysis

PLOT 1 TIME V(40,3)

A memory system that i nc ludes dynamic random

2.50

access memory (RAM) chips presents a difficult

2.00

Vds (Ql) Measured Turnoff

transient load problem to its power supply. The
problem arises from a combination of very high

1 .50

changes in dynamic RAM supply current and cur

(f)
�

1 . 00

�

0.50

a thousand t imes faster than the reaction t ime of a

o.oo L..._
_
..
...._
., ...�.____.____.__
_
_.
.. _...._

power system . The result is a temporary change in

rent change rise times that are typically more than

SECONDS X 1 0'

8
7

the load supply voltage. To handle these fast current
edges, high-frequency capacirors are mounted on
memory boards near the dynamic RAMs. Also, low

Figure 9

1 14

Vds (Ql) Simulated Turnoff
without Parasitics

frequency, electrol ytic capacitors, which provide a
source of local charge storage, are mounted on the

Vol. 2 No. 4

Fal/ 1')')0

Digital Tecbnicaljoumal

The Unique Features ofthe VAX 9000 Power System Design

PLOT 1 TIME V(40.3)
2.50
2.00

L1
1 5NH

0
X

1 .00

0
>

0.50

�

D2
MUR460

1 .50

0.00
+

Figure
D1
MUR460

L4
1 5NH

SIMULATION MODEL OF OUTPUT
CIRCUIT WITH PARASITICS

Final Model ofH7380 Output
Switching Stage

memory boards to handle the magnitude of the
change. The capacitors help keep the supply voltage
within its operating range until the power supply
can react and sufficiently change the current it sup
pl ies to the memory to stabil ize the supply voltage.
An adequate supply design with specified capaci
tors can keep the supply vol tage within its operat
ing tolerance. Simulation is used to determine the
correct mi..'< of high and low frequency capacitors
and the number of regulators required to support
this high transient load .
Another power supply problem arises from the
use of N + l redundancy for paral lel regulators.
W hen one of the regulators in a paral lel regulator
configu ration fails, the remaining regulators must
be able to rake on the load from the fa iled regulator
and keep the supply voltage within operating toler
ance. Because the remain ing regulators cannot
react instantaneously, the load voltage drops until a
sufficient increase in current can be provided by the
remaining regulators.
For the VAX 9000 series memory system, a pro
posed dynamic R A M power supply design consisted
of three H7380 DC-ro-DC regu lators, which would
operate in parallel (including N + I redundancy)
and be connected to the memory through power
dist ribution busbars. The numbers of high- and low-

Digitlll Technicaljournal

Vol. J No. 4

Fa/1 11)')0

SECONDS X 10

25 N H

L2
25NH

Figure 1 I

Vds (Q I) Simulated Turnoff
with Parasitics

frequency capacitors were also proposed. The
power supply was expected to be ready for load
testing before the memory or the busbars would
be available. Therefore, we had to verify that this
design coul d keep the memory supply voltage
within operating tolerance. We verified the design
by simulating the performance of the power system
and measuring the performance of the actual power
supply with a simulated load .
Power Supply Operating Voltage Tolerance
The
memory designers specified the operating tolerance
of the dynamic RAM suppl y as + 5 volts, ± 10 per
cent . Using 10 percent as the supply tolerance
budget, the supp l y designer made the allocations
shown in Table 2 to all the factors that would cause
the load voltage to deviate from its nominal value of
+ 5 volts. As can be seen from this table, the sum of
x and y must be less than 350 milli volts or 7 percent
of + 5 volts.
Memory Load
The dynamic R A M supply current
was calculated ro be a steady-state pulsed current
of 2 56 amperes t hat would last for 92 nano
seconds (ns) and with rise and fal l times of 20 ns,
as shown in Figure 13. The initial p ulse magnitude
was 1024 amperes.
Table 2

Supply Tolerance Budget Al location

Causes of
Voltage Deviation
Regulator tolerance

M i l livolts
1 00

Back panel d istribution

Tra nsient load with two

Percentage
of +5 Volts
2

reg u l ators
Failure of one reg u l ator
To tal deviati o n budget

y
500

liS

VAX 9000 Series

modeled as a current source, Gout, controlled hy
the regulator feedback voltage, Yf Cout and Rout
represent the regulators combined output capaci
tors and resistors. Most of the other elements in the
model are determined from component specifica
tions. The relationship between Gout and Vf was
determined by laboratory measurements on a regu
lator and resulted in the following equations. For
two regularors,

I-288 N S�
f-- 1 2.96 MICROSECONDS �
KEY:
A - AMPERES
NS - NANOSECONDS

Gout = 339 X VJ = 339 X ( V8 - 2 . 5 )
Figure I3

VAX 9000 Model 400 Series Memory
Power System Dynamic RAM Load

In the SPICE
model of the supply, busbar, load and capacitors
that is shown in Figure 14, the three regulators are

Memory Power System SPICE Model

For three regu larors,
Gout = 678

Vf = 678 X ( V 8 - 2 . 5)

The load is represented as two current sources, lA
and I R , the characteristics of which were obtained
from the loads shown in Figure 13.
21

-.l,

ROUT

-j.,

RESR

GO U T

=
DC

KEY:
R1
C1
R2
R3
C2
R4
VR
R5
C3
VG
R7
R6
C4
GOUT
D1
ROUT

1
2
2
3
3
4
6
5
7
8
9
9
10
0
20
21

2
0
3
4
4
5
0
7
8
9
0
10
0
20
21
0

10K
0.6N IC=2.5
1 0K
20K
1 8 P IC=5.0
1K
DC 5
2K
68N IC=3.0
DC 2.5
1 0MEG
10K
0.757N
POLY(1) 1 0 0 0 678
DIODE
1 7K

Figure I4

1 16

GOUT
RESR
LESL
RBB
LBB
CHF
RHF
LHF
CLF
RLF
LLF
\A
20 NS
IR
20 NS

21
22
23
21
24
1
26
27
1
25
28
1

20
1
20

22 1 2300U IC=5.0
23 1 M
2.4N
0
24 300U
1
1 50N
26 1 .3M
27 2 1 U
1 .4 P
0
25 1 08 . 8 M
28 400U
0.3N
0
P U LS E 0 5 1 2 A
0
NS 92 NS 288 NS
PULSE 0 5 1 2 A
0
NS 92 NS 1 2.961' S

O NS
0 NS

SPICE Model of VAX 9000 MemOtJI Power System

Vol. 2 No. 4

Fa/1 1')')0

Digital Tecbnlcaljournal

The Unique Features ofthe VAX 9000 Power System Design

When one of the three
regulators fai l s , t he other two regulators cannot
meet the increased load instantaneously. As a result,
the load voltage drops until the two regulators can
increase their output current sufficiently to reverse
the d irection of the drop. The SPICE model for t h is
condition was run and the load voltage of the drop
was predicted . Laboratory measurements were
then taken with the simulated load and one regu
lator was turned off. Both the predicted and mea
sured waveforms had the same shapes, peak
magnit udes ( 100 mill ivolts), and times of occur
rence of the peak (200 m i c roseconds) after the
regulator was turned off. Therefore, we concluded
that the proposed design cou ld meet the load
requirements.

Simulation and Lahoratmy Measurements

Failure of One Regulator

For laboratory measurements,
the actual dynamic RM•I load, as shown in Figure 1 3 ,
i s difficult to design and build i n a reasonable time
because of the magnitude and rise t ime combina
tion. However, a load with a much slower rise time
could be easily built. Such a load , (I in Figure 14) is
expected through the busbar as the capacitors and
busbar slowed down the fast edges of the dynamic
RAI'vl loacl . This s i m u lated load w as bui l t and con
nected to two regulators. The predicted waveform
and the measured waveform showed that the initial
shapes of the peak c hange, the peak magnitudes
(80 m i l l ivolts), and the ti mes of occu rrence of the
peak ( 300 microseconds) were all simi lar. However,
we could not measure the overshoot and ringing
after the peak because the busbar was not available.

References

The
two previously stated cond itions of interest result
ing in large load voltage changes are the transient
load w i th two regu lators and the fa i l u re of one
regulator.
For transient loads, a larger voltage cha nge
occurs with two regulators rather than w i th three
because two regu l ators take longer than three to
adjust the supply current to the new load value.
Simulated Load

Digital Tecbnicaljounzal

Vol. 2 No. 4

Fall 1')')0

I. P. O'Connor, Practical Reliabili�J' Engineering
2d ed . (New York: Joh n Wi ley and Sons, 1985).
2. SPICE is a general-pu rpose circuit s i mu lator
program developed b y Lawrence Nagel and
Ellis Cohen of the Department of Electrical Engi
neering and Computer Sciences, University of
California, Berkeley.

1 17

Donald F. Hooper
John C. Eck

Synthesis in the CAD
System Used to Design
the VAX 9000 System
VAX 9000 system represents a sixfold inc rease in complexity over the
860018650 system. This increased complexi�y posed a significant challenge
because ofthe concurrent need to shorten the duration ofthe project design cycle and
convert all high-performance systems computer-aided design (CAD) software from
the DECSYSTEM-20 system to the VAX system. As part of the task of meeting these
challenges, the CAD Group proposed the implementation of a design methodology
that used logic �ynthesisfor thefirst time in the development ofa major productfor
Digital. Theprimary objectives ofthis methodology were to increase theproductivi�J'
of the logic designers and to reduce the number of errors introduced during
conversion ofhigh-level designs into gate-lel!e/ structural designs.

The design ofthe
VAX

Methodologies

transformations of Boolean logic to reduce gate
counrs and improve critical timing paths.1 How

Previous Methodology
I n the prev ious development methodology, as
shown in Figure I , logic designers speci fied h igh
level designs o n paper, and simulation engineers
transferred this rendition i nro a behavioral model .
Tech nology engineers developed the gate-level
cells. After the cells were defined and characterized
for fu nction and timing, the logic designers gener
ated schematic drawi ngs by using graphical bodies
that represented the cells.
As changes were made to the schematics, the sim
u lation engineers attempted to reflect these i n the
behavioral model . Finally,

gate-level simulation

model was assembled from the completed schemat
ics to verify that the design represented a valid VAX
syste m . T h is process was extremely laborious,
error-prone, and ri me-consuming. Therefore, we
concluded it could nor be used to develop the VAX
9000 system , which is a 700,000 gate design and for
which the technology cel ls would not be defined
and characterized until late in the design stage.

Logic Synthesis

ever, this program has had only limited success and
is not really usable as a released computer-aided
design (CAD) product. For example, t he program
does not deal w i th selections of cells for com
binational logic nor does it consider the myriad
problems i nvolved in asse m b li ng a database for a
buildable gate array chip.
During 1984 and 1985, new artificial intell igence
(AI) and synthesis ideas were being developed. Uni
versities and technical communi ties were exploring
the potential of object-oriented databases, rule
based AI, data flow design entry, and algorithmic
minimizations. We began the prototype develop
ment of our system for in tegral design (SI D ) at
approx imately the same time as the ideas for the

VAX 9000 hardware architecture were beginning to
be developed. In 1985, the SID program became an
internal CA D product for use in the development
of the VAX 9000 system. By combining the most
ad vanced rule-based AI techniques with an object
oriented database, the core SID was designed to be
a repository of logic design know ledge. We hoped
that, over the years, SID wou ld mature to perform

O u r early research i nt o logic synthesis began in

many highly repe t i t i ve logic design tasks a t a n

1982 . Over the next two years, we explored new

expert level .

syn thesis ideas a n d constructed p rototypes to

From 1985 to 1988, the capabilities of the SID sys

determine the feasibility of those ideas. For exam

tem gradually improved u ntil it was producing gate

ple, one of our early logic minimization efforts was

array chips that met the VAX 9000 machine cycle

a program that emulated Brown's Laws of Form for

time, power, and electrical rules requirements.

1 18

Vol. .2 No. 4

Fall 1990

Digital Tecbnicaljournal

Synthesis in the CAD System Used to Design the VAX 9000 System

TECHNOLOGY
CELL DEFINITION

TECH NOLOGY
CHARACTERIZATION

BEHAVIOR MODEL
TEXT EDIT

GATE-LEVEL
SCHEMATIC
ENTRY

BUG
REPORT

PLACE ROUTE

BUG
REPORT
GEN ERATED

Figure 1

Previous Design Methodology

New Methodology

technology engineers are defining the technology

The VAX 9000 development methodology, shown

cel ls. In parallel w i t h these activities, s ynthesis

in Figure 2, circumvents the need to wait for the
technology cells to be completely specified before
begi n n i ng logic design . This methodology uses
schematic entry and simulates the technology
independent, register transfer level (RTL) bodies.
The RTL l ibrary for this type of entry includes
MUXes, latches, adders, comparators, incrementers,
decoders, and simple Boolean gates. The entry is

knowledge engineers are writing rules to transform
the RTL design into technology cells. These t hree
activities should be completed at the same time,
at which point, synthesis produ ces each of the
VAX 9000 system's 77 gate array chips. The goals

for the synthesis program were to
•

matic complexity by a factor of 4

extracted to a common database format, cal led
CADEX , from which a simulation model is built. A

•

cal boundaries. Thus, si m u lation models can be

•

While logic designers are creating the RTL design,

Digital Teclm.icaljournal

Vol. 2 No. 4

Fa/1 /'J()O

Reduce the n umber of simulation errors i ntro
duced in the design

built that consist of a hierarchy of m ixed behavior
and RTL models.

Generate 90 percent of the VAX 9000 system's
logic through synthesis

behavior modd still exists, hut its h ierarc h y
matches the RTL schematic h ierarchy at key physi

Simplify design entry and thereby reduce sche

Reduce the number of electrical ri1les violations
in the design

1 19

VAX 9000 Series

To generate a database for a buildable gate array
chip, the synthesis tool is required to
Read tec h nology-i ndepende n t input standard
net list format, which can be in OECSIJVI behav
ioral notation or CADEX common database
format

•

Minimize Boolean gates through state-of-the-art
minimization techniques

•

Improve timing-critical paths through Boolean
transformations, cell/pin selections, power set
tings, and net load a llocations

•

Choose the best avai lable technology cel ls based
on timing, size (area), and power estimates

•

Insert the clock system for the gate array chip

•

Insert testability access logic for the service pro
cessor unit

•

Obey all electrical design rules for the gate array
chip

TECHNOLOGY
CELL DEFINITION

TECHNOLOGY
CHARACTERIZATION

•

Make it easy to detect whether the tool has per
formed well

•

Simplify the improvement of the tool

SID Database
The design of the SID database is fundamental to the
robustness of the CAD system. Previous CAD data
bases have all assumed that the data is stable at the
time that the CAO tools are working with it. Simu
lation, t i m i ng veri fica tion , design ru le checkers
(ORC s), and many other CAD tools assume that net
lists and components are fixed and unchanging.
In synthesis, although the data is maintai ned in
a form that makes i t easy to u pdate its parameter
values, the basic structure of gates, pins, and nets
remains the same. However, throughout most of the
synthesis process, the basic structures are in a state
of change. In fact, it is a characteristic of synthesis
that logic functions are removed and replaced with
new, fu nctionally equivalent logic. Because of this
d i fference, we designed basic data structures and

BEHAVIOR MODEL
TEXT EDIT

SYNTHESIS RULES
TEXT EDIT

SYNTHESIZE
PLACE
ROUTE
SET POWER

RTL
SCHE MATIC
ENTRY

BUG REPORT

(LOOP BACK)

BUG REPORT GENE RATED

Figure 2

120

VAX 9000 Deuelopment Methodolof!J'

Vol. 2 No. 4

Fall f,
IS_BOOLEAN , !S_ A _N U M BER ; adjectives are words
such as A N Y , ALL, NO . Dbobjects are d a tabase
objects or the parameters of these objects.
The command forms used for right-side actions
a re corrunaml dbobject and command dbobject
preposition dbobjecr. Commands are words such as
I NSERT, REMOV E , REPLACE, MODI FY ; prepositions
are words such as W I T H , TO , FROM . The dbobject
can be a n y of the p rima ry database objects, sec
ondary objects, or their parameters.
= ,

122

For more complex operations, we a lso allowed
LISP functions to be cal led by prefixing them with
the keyword LISP , or by insertion of a LISP expres
sion. Thus, if the r u le language cannot implement
a required function, a LISP a lgorithm i c rout i ne is
cal led. We used algorithmic transforms in the gener
ation of adder carry-lookahead.
Ruleform Database Access
Because the d atabase cou l d be traversed i n any
direction for any arbitrary distance through the
multidirectional pointer system, rules had to have
the same traversal capab i l it y. Therefore, t he
dbobject of the Ru leform language is a shorthand
notation of the " database wal k . " Dbobject can be
used i n a sentence to compare two database objects
by wal king to both of them and using a predicate
for the comparison.
Had the database access been implemented in
p u re LISP programmi ng notation, the sentence
form would be lost in the many levels of expres
sions enclosed in parentheses. One test wou ld
occupy many l i nes of code and would read more
like a software program than an Engl ish sentence.
In this case, the chain of thought of the rule w riter,
the purpose of which is to capture the step-by-step
thoughts of a logic designer in words, woul d proba
bly be broken.

Vol. 2 No. 4

Fall 1990

Digital Tecbnicaljournal

Synthesis in the CAD System Used to Design the VAX 9000 System

To improve the comprehension of the notation
used for identifying the database object , we devel
oped an

Digital Technical Journal, Volume 2, Number 4, 1990 Dtj_v02 04_1990 Dtj V02 04

dtj_v02-04_1990 dtj_v02-04_1990

Navigation menu

Versions of this User Manual:

Views

Navigation