Digital Technical Journal, Volume 2, Number 4, 1990 Dtj_v02 04_1990 Dtj V02 04
dtj_v02-04_1990 dtj_v02-04_1990
User Manual: dtj_v02-04_1990
Open the PDF directly: View PDF
.
Page Count: 147
| Download | |
| Open PDF In Browser | View PDF |
VAX 9000 Series
Digital Technical Journal
Digital Equipment Corporation
Volume 2 Number 4
Fall 1990
Editorial
jane C. Blake, Editor
Barbara Lindmark, Associate EditOr
Circulation
Catherine M. Phillips, AdministratOr
Suzanne J. Babineau, Secretary
Production
Helen L. Patterson, Production Editor
Nancy jones, Typographer
Peter Woodbury, IllustratOr and Designer
Advisory Board
Samuel H. Fuller, Chairman
Richard W. Beane
Robert M. Glorioso
john W. McCredie
Mahendra R. Patel
F.
Grant Saviers
Robert K. Spitz
Victor A. Vyssotsky
The Digital Technicaljoumal is published quarterly by Digital
Equipment Corporation, 146 Main Street MLO I-31B68, Maynard,
Massachusetts 01754-2571. Subscriptions tO the journal are S40.00
for four issues and must be prepaid in u.s. funds. University and
college professors and Ph. D. students in the electrical engineering
and computer science fields receive complimentary subscriptions
upon request. Orders, inquiries, and address changes should be
sent 10 The Digital Tecbnicaljournal at the published-by address.
Inquiries can also be sent electronically 10 D'I:J@CRL.DEC.COM
Single copies and back issues are available for $16.00 each from
Digital Press of Digital Equipment Corporation, 12 Crosby Drive,
Bedford, MA 01730-1493.
Digital employees may send subscription orders on the ENET to
RDVAX::JOURNALor by interoffice mail to mailstop MLO I -3/B68.
Orders should include badge number, cost center, site location
code and address. U.S. engineers in Engineering and Manufacturing
receive complimentary subscriptions; engineers in these organiza
tions in countries outside the u.s. should contact the journal office
to receive their complimentary subscriptions. All employees must
advise of changes of address.
Comments on the content of any paper are welcomed and may
be sent to the editOr at the published-by or network address.
Copyright ll:J 1990 Digital Equipment Corporation. Copying
without fee is permitted provided that such copies are made for
use in educational institutions by faculty members and are not
distributed for commercial advantage. Abstracting with credit
of Digital Equipment Corporation ·s authorship is permitted.
AU rights reserved.
The information in this Journal is subject 10 change without
notice and should not be construed as a commitment by Digital
Equipment Corporation. Digital Equipment Corporation
assumes no responsibility for any errors that may appear in
this journal.
ISSN 0898-901 X
Cover Design
Digital s VAX 9000 mainframe system is the theme of this issue.
Our cover depicts several simple instructions flowing through
the VAX 9000 instruction execution pipeline. High performance
was achieved by breaking the VAX instructions into small simple
tasks that could be pipelined efficiently. Concurrent operation
on up to six instructions simultaneously resulted in a execution
rate of one simple VAX instntction per clock period.
Gloria Monroy of the High Performance Systems Group designed
Documentation Number EY-E762 E-DP
The following are trademarks of Digital Equipment Corporation:
Cl, DECsystem-10, DECSYSTEM-20, Digital, the Digital logo, HDSC,
MC!J, Micro VAX, Nl, PDP-I, Ul;fRIX, VAX, VAX-11/780, VAX 6000,
VAX 8000, VAX 8600, VAX 8650, VAX 9000, VAXBI, VMS, XMI.
IBM is a registered trademark of International Business Machines
Corporation.
Kapton is a trademark of E.
I.
duPont de Nemours & Company.
MOSAIC 111 is a trademark of Motorola Corporation.
Micromaster Plus is a registered trademark of t.:rx Company.
the cover graphic, which was implemented in cooperation
Book production was done by Digital's Educational Services
with David Comberg of the Corporate Design Group.
Media Communications Group in Bedford, MA.
I Contents
11
Foreword
Carl S. Gibson
VAX 9000 Series
13
Design Strategy for the VAX 9000 System
David B. Fite Jr. , Tryggve Fossum, and Dwight Manley
25
VAX Instructions That Illustrate the Architectural Features
of the VAX 9000 CPU
John E. Murray, R icky C. Hether ington, and Ronald M. Salett
43
Semiconductor Technology in a High-peiformance VAX System
Matthew J Adiletta, Richard L. Doucette, John H. Hackenberg,
Dale H. Leuthold, and Dennis M. Litwinetz
61
Vector Processing on the VAX 9000 System
Richard A. Brunner, Oileep P. B handarkar, Francis X. McKeen,
Bimal Patel , W illiam). Rogers Jr., and Gregory
80
L. Yoder
HDSC and Multichip Unit Design and Manufacture
Peter B. Dunbeck, Richard). Dischler, James B. McElroy,
and Frank J. Sw iatowiec
90
The VAX 9000 Service Processor Unit
Matthew S. Goldman, Paul H. Dormitzer, and Paul A. Leveille
102
The Unique Features of the VAX 9000 Power System Design
Derrick). Chin, Barry G. Brow n , Charles F. Butala, Luke L. Chang,
Steven). Chenetz, Gerald E. Cotter, BrianT. Lynch, Thiagarajan Natarajan,
and Leonard J. Salafia
118
Synthesis in the CAD System Used to Design the VAX 9000 System
Donald F. Hooper and John C. Eck
130
Hierarchical Fault Detection and Isolation Strategy
for the VAX 9000 System
Karen E. Barnard and Robert P. Harokopus
I Editor's Introduction
implement the
77 different gate array chips, the five
custom chips, and the self-timed RAM architecture.
An additional performance improvement for
numeric computations is the VAX vector architec
ture and is treated in the paper by Rich Brunner,
Dileep Bhandarkar, Frank McKeen, Bimal Patel, Rill
Rogers, and Greg Yoder. They discuss the architec
tural model and particulars of the VAX 9000 imple
mentation,
which affords numerically intensive
applications performance four to five times greater
than can be achieved by the scalar processor.
To ensure that the system performance gains
at the semiconductor level were not diminished
jane C. Blake
but were instead enhanced by packaging and inter
Editor
connects, engineers developed several technologies
The VAX 9000, Digital's first mainframe computer,
unique in the industry. The technology behind the
is the topic of papers in this issue of the
high-density signal carrier and the multichip unit
Technical journal.
D(f.{ital
As engineers writing for this
issue relate, the primary goal of the project from the
are explained in the paper by Pete Dunbeck, Rich
Dischler, Jim i'vlcEiroy, and Frank Swiatowiec.
initial product strategy through manufacture was to
Equally important to performance in the new
design and build a very high-performance, highly
9000 is system reliability as e\'idenced by the intro
reliable VAX system.
Design engineers applied both crsc and
R!SC
duction of the service processor unit. In their paper
about the service processor, Matt Goldman, Paul
techniques to achieve high levels of performance
Dormitzer,
for this rightly coupled multiprocessor system.
MicroVAX-based system embedded within the 9000
and
Paul
Leveille relate
how
the
In the opening paper, Dave Fire, Tryggve Fossum,
detects, isolates, and corrects problems without
and Dwight Manley explain the strategy behind the
interrupting the system .
design. They begin with an overview of the system,
High system availability \Vas also one impetus in
the technology, and CAD tools, and then describe
the design of the power system . Some of the unique
the redesign of VAX instructions into small tasks
features of the power system, such as redundant
which can be efficiently pipe lined. The authors
regulators, improved load sharing and simula
also touch upon three additional aspects of the
tion, are discussed by Derrick Chin, Barry Brown,
VAX 9000 system: the integration of vector process
Charles Butala, Luke Chang, Steve Chenetz, Jerry
ing into the VAX architecture, new error handling
Cotter, Brian Lynch, Raj Natarajan, and Len Salafia.
techniques, and performance modeling.
One measure of performance is the number of
The two papers that close this issue address the
topics of CAD methodology and system diagnosis.
instructions processed per cycle. The average num
Don Hooper and John Eck describe a CA D method
ber of cycles per instruction is less than five, which
ology that combines advanced rule-based A! tech
is nearly half the instruction execution rate of pre
niques with an object-oriented database. The new
vious VAX systems. To illustrate the architectural
methodology saves logic designers significant time
features that enable this level of performance, John
and reduces errors. A complex system such as the
Murray, Rick Hetherington, and Ron Salett have
VAX 9000 requires improved system diagnosis capa
selected a small sample of VAX instructions. They
bilities to achieve the desired high system availabil
describe the instruction flow through the pipeline,
ity. Karen Barnard and Rob Harokopus demonstrate
how instruction features combine to work on a sin
how a new scan system, in combination with scan
gle macro, and how stages of the pipeline interact.
pattern testing, and symptom-directed diagnosis
ln addition to the architectural improvements,
achieve this necessary diagnosis capability.
machine performance is enhanced at the semi
The editors thank Rick Hetherington of the High
conductor level by a new generation of semicustom
Performance Systems Group for not only writing a
and custom integrated circuits that support a low
paper but for his help in coordinating this issue.
c ycle time. Matt Acliletta, Dick Doucette, John
Hackenberg, Dale Leuthold, and Dennis Litwinetz
give an overview of the bipolar technology used in
the system. They then describe the methods used to
2
I
Biographies
Matthew J. Adiletta
Matthew Adile tta is currently contributing to the
implementation of a new processor architecture and performing a technology
evaluation to determine the technology for the implementation. He joined
Digital in
1985 to work on a high-performance RISC architecture. Matt was not
VAX 9000 system, but he also implemented the integer
and floating point multiply and divide units and developed an ECL custom chip
only the architect for the
process. He holds one patent and has several patents pending. Man received a
B . S . E . E. (honors, 1985) from the University of Connecticut.
Karen E. Barnard
A senior soft ware engineer with the High Power Business
Unit
CPU Development Group, Karen Barnard wrote the read-only memory
based diagnostic for the VAX 9000 service processor unit's scan control module
and developed the scan pattern diagnostic for the VAX 9000 CPU and SCU. Karen
also worked on the debugging structural test process for the VAX 9000 kernel
environment. Prior to joining Digital in 1986, Karen was with Data General
Corporation. She received a B . S . ( 1983) in computer science from the Worcester
Poly technicallnstitute.
Dileep P. Bhandarkar
As technical director for RlSC systems,
Dileep
13handarkar is responsible for leading the architectural direction of RlSC prod
1978 and was responsible for managing the evolution of
VAX architecture. Dileep was the chief architect for VAX vector processing
ucts. He joined Digital in
the
and coarchitect of Digital's RISC archi tecture. He holds one patent for his work at
Digital and has several patents pending. His degrees in electrical engineering
include a 13achelor of Technology from the Indian Institute of Technology and an
M.S . and a Ph. D. from Carnegie-Mellon University.
Barry G. Brown
The concept of designing DC-to-DC converters as system
elements rather than individual "power supplies" was introduced into the high
power systems products by Barry Brown. He created and developed a highly
tlexible, high-reliability DC-to-DC conversion system for the
VAX 9000 series.
Barry designed, implemented, and verified the power system for the
Model
VAX 9000
200 systems. He was a principal engineer for the Codex Corporation
before coming to Digital in 1984. Barry is a graduate of Woolwich Polytechnic
and Harlow Technical College.
3
Biographies
Richard A. Brunner
As 3 principal engineer, R ichard Brunner is the architect
c u rrently responsible for the engineering refinement and control of both
the VAX and VAX vector architectures. He is the editor of the VAX Architecture
Reference Manual and coauthor of the VAX Vector Handbook and several papers
on the VAX vector 3rchitecture. He received a B.S. (high honors, 19R4) in elec
t rical e ngineering from Case Western Reserve U n i versity and an M . S . (1987) i n
computer engineering from Rensselaer Polytechnic Institute. H e i s a member of
JEEF. and Tau Beta Pi .
Charles F. Butala
Presently responsible for the power system design and
arch itecrure of rhe VAX 9000 Model 4 00 systems, Charles Butala is a consulting
engineer in the Information Systems Business Unit Power Systems Group. Since
he joined Digital in 1976, he has been responsible for several power system design
projects, including the VAX H600 system. He is a member of I EEE and Tau Beta P i ,
and holds honorary society membership i n Eta Kappa N u . Charles received
a R.S.E.E. (1968) from I l l inois Institute of Tec hnology and an M. S . E .E. from
Norrhe3stern University.
Luke L. Chang
A fter receiving his M.S. in electrical engineering from Virginia
Polytechnic lnstirute and Stare U n iversity in 1988, Luke Chang joined the Power
Sysrems Technology and Regulations Group. He is currently a hardware engineer
and is responsible for developing simulation tools to perform h igh-quality
software design veri fication tests for the next generation DC-to-DC power con
verters. Lu ke's previous responsibilities include transient analysis :md testing of
the VAX 9000 memory power distribution sysrem, 3nd power system cost reduc
tion studies.
Steven ). Chenetz
As a principal engineer in the Information Systems Busi
ness Unit Power Systems Group, Steven Chenetz is currently working on the
H7390 for a high-power VAX system. He previously was a member of the design
and development te3ms for the H7380 of the VAX 9000 system, the H71HH envi
ronmental monitoring module for the VAX 8600 power system, the VAX 8600
clock distribution system, and signal integrity for the VAX 8 600 system. Steve
joined D igital upon gr3cluation from Rensselaer Polytechnic Institute i n 19RI.
He has 3n M.S . E. E. from Nort heastern University (19H7).
Derrick ). Chin
Derrick Chin is the engineering manager for sever3l Infor
mation Systems B usiness Unit power groups and is design e ng ineer of the
VAX 9000 processor's DC power d istribution system. His 3ssociation with D igital
began in 1961, and he has participated in many projecrs, from the POP-I ami the
DECsystem-10 to the VAX HMO systems. His responsibi l ities have ranged from
development of precision displays, circuit design, and core and semiconductor
memories to env ironmental monitoring modules and power systems. He holds a
B.S. E. E. (1959) from MIT.
4
I
Principal engineer Gerald Corter is a member of the Infor
mation Systems Business Unit Power Systems Group. He was the project engineer
and coarchitect of the VAX 9000 power control system (PCS). Jerry was the PCS
interface to Customer Service and Support Engineering, Manufacturing, and
Service Processor Unit Groups. He participated in development of the PCS and
power system test strategies and the initial design of the T01060 power and envi
ronmental monitor module. His previous work includes the VAX 8600 system's
power and control subsystem.
Gerald E. Cotter
In his position of systems engineer for the High Perfor
nunce Systems Group, Richard Dischler worked on the VAX 9000 signal integrity
project. He also was a member of the project team for the electrical design of
HDSC and micropackaging for multichip units, planar boards, and connectors for
the VA X 9000 system. Rich held similar responsibil ities in the development of the
VAX 8600 system. He joined Digital in 1982, and his previous experience was at
Applied Research Laboratories. He holds a B . S .E.E . (1982) from Pennsylvania
State University.
Richardj. Dischler
A s an undergraduate at Harvard University, Pau l
Dormitzer gained experience with the U N I X operating system b y working as a
programmer and operator. Upon receiving his B . A . in computer science in 1987,
he joined D igital's H igh Performance Systems Group. He is currently an engineer
in the High Performance Business Unit CPU Engineering Group. Paul's primary
responsibilities are in the development of error recovery processes for high
power systems, such as the VAX 9000 system.
Paul H. Dormitzer
Since joining Digital in 1979, Richard Doucette has been
a member of severa l high-performance systems project teams. As a senior engi·
neer on the VAX 8600 team, he helped introduce the Motorola Macrocell Array I
(MCA I ) technology into D igital and was responsible for its design analysis and
characterization in the system. As engineering manager on the VAX 9000 team,
he was responsible for the incorporation of MCA 3 technology, custom chips, and
self-timed RAM components in the system. He holds a B . S .E . E . (1973) from the
University of Maine.
Richard L. Doucette
Peter B. Dunbeck
Peter D unbeck is an engineering manager in the H igh
Performance Business Unit Technology Research and Engineering Group. He
held various positions on the VAX 9000 program between 1985 and 1990, includ
ing technology program manager and design engineering manager for the multi
chip unit. Before joining Digital in 1984 as a manufacturing engineer, Peter
developed energy conservation programs for Thermo Electron. He holds a B . S .
(1977) i n mechanical engineering from Virginia Tech and a n s. M . (1979) i n aero
nautics and astronautics from MIT.
5
Biographies
John C. Eck
The dcvdopment of rhe majority of the physical design CAD tools
used in rhe VAX 9000 system was managed by John Eck. He is a software engi
neer manager in the High Performance Systems CAD and D iagnostics Group.
John was employed as the manager of the Automated Design Department of
Badger Company before coming ro Digital in 1984. He holds a BS (1964) in
physics and an JYI.S. ( 1966) in aeronau ti cs and astronau t ics from MIT, and an
M . B. A . (h ighest honors, 198--i) from Babson CoJiege.
David B. Fite Jr.
Consul tant engineer David Fire was a member of rhe initial
architecture team for the VAX 9000 system. He developed the architecture for the
branch prediction, instruction fetch, and instruction decode for the VAX 9000.
H is previous work includes responsibility for prototype debugging on the VAX
8600 system . D:IVe joined Digital in 1982. He has one patent and several patent
applications pending. He is a graduate of Worcester Polytechnic Institute with a
B . S. (honors) in electrical engineering.
Tryggve Fossum
Tryggve Fossum is rhe system architect of rhe VAX 9000 sys
tem . He received a B.S. ( 1968) from the University of Oslo and earned his P h . D.
( 1972) from the University of I l linois. Tryggve joined D igital in 1973 and worked
on the design of high-end computers, notably the VAX -11/780 system. As a pro
ject leader on the VAX 8600 team, he guided the design of the t1oating point accel
erator. He has also worked on several research projects, including an early raster
scan graphics workstation, and a workstation w ith an integrated disk system.
Matthew S. Goldman
As a senior engineer on the VAX 9000 project team,
Matthew Goldman designed the scan control chip, which contains the control
logic for the VAX 9000 scan system. He was also the responsible engineer for
all VAX 9000 service processor h:trdware. Prior to joining Digital's H igh Perfor
mance Systems CPU Group in 1986 , Matt was a design engineer for Rayt heon
Company. He is a member of Tau Beta P i and Eta Kappa Nu. M:ut holds a
B.S. (highest honors, 1983) and an M.S. ( 1988) in e lectrical engineering from
Worcester Polytechnic Institute.
John H. Hackenberg
I n 1968, John H ackenberg came to D igital as a tech
nician on the Kl- 10 project, leaving after two years to serve in the armed forces.
He returned to Digita l in 1971 and worked on the designs for various h igh-end
systems, including the KL- 10. As a consult ing engineer on the VAX 8600 project,
he worked in the area of signal integrity. John was the project leader for the MCA 3
gate array used in the VA X 9000 system and is currently developing a bipolar gate
array. He holds a B.S.E.T. {1979) from the University of Lowell .
I
Robert P. H arokopus A cum laude graduate of the University of Michigan,
Robert Harokopus received a B .S. (1986) in computer engineering and is now
studying for an M . S . in computer engineering from Boston University. Bob is a
senior software engineer and joined Digital in 1986. He developed the symptom
di rected diagnosis software used in the VA X 9000 service processor unit. Bob
also developed software for the HIDE CAD tool and SCEPTER automatic test
pattern generator, both of which were used in t he VA X 9000 design project. He is
a member of Tau Beta Pi and Eta Kappa Nu.
rucky C. Hetherington As a principal engineer with the H igh Performance
Systems Group, Ricky Hetherington is currently the project leader of the transla
tion buffer and cache design of the VAX 9000 system. He holds one patent and has
several patents pending on the various design featu res of the VA X 9000 M-box .
Rick joined Digital i n 1982 as a senior engineer i n Digital's Large Computer
G roup. He has a B.S. from Pennsylvania State University.
Don Hooper is a consulting engineer in both logic design
and CAD disciplines. He initi:ued and led the development of the Synthesis of
Integral Design program, Digita l's first synthesis tool. Before coming to Digital
in 1979, he was architect for the I tel 7031 mainframe and cache designer for the
!tel Advanced System 4. He is a graduate of Don Bosco Technical Institute. Don
holds patents in speech recognition circuits, the tag and queuing system for
Digital's first pipelined C P U , and the control storage pipe for the VAX 8600
system. In addition, he has several patents pending in logic synthesis.
Donald F. Hooper
A member of the technical staff of the Integral Circuit
Design G roup, Dale Leuthold led the design team for the VAX 9000 vector regis
ter chip. He is currently working on random-access memory development for
h igh-speed mainframes. Dale was responsible for b ipolar integrated circuit
design at Signetics Corporation and Trilogy Systems Corporation before coming
to Digital in l9H6. He holds one patent and has one patent pending. Dale received
a B . S . from Oregon State University.
Dale H. Leuthold
Paul A. Leveille
In his nearly ten-ye:.Jr relationship with Digital, Paul Leveille
has specialized in the development of high-power systems, particularly the
VA X 8600 and VAX 9000 systems. As a principal engineer in the High Perfor
mance Business Unit, he helped define the VA X 9000 service processor sub
system and was responsible for developing the scan control fi rmware and
portions of the service processor application software. Pau l's previous responsi
bilities include console diagnostics, firmware. and ::�pplication software.
7
Bio�raphies
Derutis M. Litwinetz The projecr leader for the design of four standard cell
and custom chips for the VAX 9000, Dennis Lirwinerz is a consuhing engineer
in the High Performance Business Unir. He has prev iously participated in the
design of rwo standard eel.! chip designs for the VA X 8600 system. He joined
D igital in 1967 as a technician for the DECsysrem- 10 Engineering (;rou p. Denni:-;
has a patent pending for the VAX 9000 self-rimed register file design. He received
a R.S.E.E.T.
from Lowe ll Technological Institute and an ,'VI.S.C.E. from the
University of Lowell.
Brian T. Lynch
Brian Lynch is a principal hardware engineer in the Informa
tion Systems Business Unit Power Systems Group. In this position. he designed
and developed the H7382 bias power supply used in rhe VAX 9000 system. He is
presently working on power solutions for future high-performance systems.
Prior ro joining D igital in 1972 , Brian was responsible for power convener and
analog modu le design ar lntronics. He has a B.S. E.E. (1978) from Worcester
Polytechnic lnst irure.
Dwight Manley
As a principal engineer on the VAX 9000 project, Dw ight
Manley was responsible for all of the perform:mce modeling of the VAX 9000
CPU design. His present responsibi lities inc lude w riting code for a Digital
Extended i'vla r h Library product. Dwight joined Digital in 1979 as a member of
the Systems Performance Ana lysis Group. Prior to that time, he worked as a
systems programmer for the Bel l Telephone System. Dwight has a H.S. ( 1971 ) in
mathematics from the University of M assachuseus and an M.S. ( 1976) from
Northeastern University.
James B. McElroy
Jim McElroy is the multichip unit operations manager. H is
work on the VAX 9000 system began with interconnect and packaging, fol lowed
by the management of the physical technology efforts. He then became the
manufacturing systems program manager for the introduction of the VAX 9000
system into manubcturing. Before joining Digital in 1976, Jim worked at RCA on
packaging and interconnect design for mil itary computer systems. He received a
B. S.M.E. and an M .S.M.E. from Northeastern University.
Francis X. McKeen
The project leader for the V-box unit of till' VAX 9000
system was Francis McKeen. Prior to working on the VAX 9000 system , he wrote
microcode for the VAX 8600 and VAX 8650 systems. Frank is a principal engineer
and has been with Digital for seven years. He holds one patent and has several
rarenr applications pending. Frank received a B. S. E.E. from Northeastern
University and is a member of IEEE.
I
john E. Murray
T he coauthor of
Microarchitecture of the
VAX <)000, john
Murray is a consulting engineer in the High Performance Business Unit. He
served as project leader of the design team for the 1-box unit of the VAX
9000. He
1982. John's previous employer was ICL in the United Kingdom,
where he was a design engineer. He received a B. Sc. ( 1969) from Warwick
joined Digital in
University. He holds one patent and has several patents pending.
Thiagarajan N atarajan
T hiagarajan Natarajan is manager of a DC-to-DC
converter group in the Information Systems Business Unit. His group develops
a
high-density and highly reliable DC-to-DC converter, associated hybrids, semi
conductor components, and the distribution system for the next generation,
high-performance VAX systems. Raj's prior experience includes positions at
General Electric, Bell Laboratories, and Perkin Elmer Corporation. He has a
Ph.D. in dectrical engineering, has been awarded one patent, and has authored
approximately seventeen technical papers.
Bi mal Patel
Principal engineer Bimal Patel joined Digital in
1986 as a senior
engineer. His primary responsibility since that time was the design of the V-box
unit of the VAX
9000 system. Bimal was previously employed as a senior engineer
in the CPU Design Group of Prime Computer, Inc. He has an M. S. in computer
engineering from Boston University.
William J. Rogers Jr.
William Rogers is an engineer in the VAX
9000
CPU
Group, where he developed the design of the control logic of the V-box unit for
the VAX
9000. Prior to working on this high-performance system, Bill was a
1986 and is
a member of IEEE and Tau Beta Pi. He received a B. S. ( 1986) in electrical engineer
member of the SASE Support Engineering Group. He joined Digital in
ing from Michigan Technological University.
Leonard j. Salafia
The development of the AC front end for the VAX
9000
system was the responsibility of Leonard Salafia. who is the manager of the
AC Power Interface Developmem Group. His previous work at Digital includes
supervising the development of storage system power products for the Central
Power Supply Engineering Group and for the Storage Systems Power Group. Len
worked for General Electric prior to coming to Digital in
1980. He holds
a
B.S.E. E. (magna c u m laude, 1969) from the University of Hartford and an
M. S.E. E. ( 1976) from Renssel::ler Polytechnic Institute.
9
Biographies
Ronald M. Salett
As a consulting engineer in the High Performance Systems
Group, Ron Saletr is currently leading the development of a new high-perfor
mance C P U . As a project leader for the VAX 9000 system, he was responsible
for the architecture, design, and m icrocode of the execution unit. Since joining
Digital in 1977, Ron has also worked as an architect and project leader on
low-end integrated PDP- 1 1 systems. He holds two patents. Ron holds a B . S . E . E .
(1975) from Carnegie-Mellon University and a n M . S . E . E . ( 1979) from Worcester
Polytechnic Institute.
In 1988, Frank Swiatowiec became H DSC operations
manager, with the primary responsibility to transition Digital's new H DSC tech
nology to volume production. He was one of the engineering managers responsi
ble for the definition and development of the HDSC . Frank had over 15 years of
experience in the semiconductor industry when he joined Digital in 1986. While
with Motorola Corporation, he was awarded four patents on ECL circuit designs.
F rank holds a B . S . E . E . from the University of Il linois and an M . S . E. E . from
Arizona State University.
Frank J. Swiatowiec
Gregory Yoder is a senior hardware engineer with the H igh
Performance Systems CPU Engineering Group. His primary responsibilities on
the VAX 9000 system included the design and testing of the V-box unit, and pro
toty pe system debug, for which he received an excellence award . He also
assisted Manufacturing in producing and installing external field test VAX 9000
machines. G reg joined Digital in 1988, after participating in a one-year co-op
session at IBM . He holds a B.S. E. E. from Pennsylvania State University.
Gregory L. Yoder
10
I
Foreword
Carl S. Gibson
VAX 9000 Program Manager
This issue of the Digital Technical journal is a
collection of papers describing the technologies,
designs, and design methods employed in Digital's
VAX 9000 mainframe/supercomputer, which was
introduced in the fal l of 1989.
The VAX 9000 system embodies hundreds of
innovations in most areas of design, manufacture,
and service. In selecting papers for this journal, we
have attempted to reflect the immense scope and
variety of this program, which ranks among the
larges t and most complex in the history of our
industry.
In the summer of 1983, a small group of us set
about to determine what it would take for Digital to
develop a true mainframe. We felt that a mainframe
VAX would be a p owerful addition t o Digital's
product family. The products that we have created
took form, changed, and evolved over the months
and years as technical chal lenges yielded to inno
vations, rigor, and d iscipline. An u ndertakjng o n
this scale necessarily undergoes numerous transi·
tions as new data emerges, assumptions are tested,
and alternatives are eliminated . Technical break
t hroughs built upon one another incrementally
as we pressed the design closer to our goals. The
primary objectives of very high system-level perfor
mance and world-class reliability drove the design
process and the changes that emerged.
The planar logic packaging is illustrative of how
changes and improvements built upon one another.
The reliability benefits of m inimal connections
precipitated a .logic packaging design change from
stacked modules in dual backplanes to the planar
array. This change - an optimization for reliabil
ity - in the end actually helped performance and
maintainability. Utimately, though not envisioned
at the time, the adoption of the planar array had
a significant impact in that this structure enabled
impingement air cooling a nd elimination of t h e
bu lky liquid system t h a t was p a r t of t he initial
design. The final design of the VAX 9000 system
reflects, in myriad forms, this continual process of
successive refinement toward shared goals.
Design changes notwithstanding, our primary
strategy remained constant. The reader will note
that, while we innovated aggressively in CPU struc
ture, implementation technologies, and design
methodologies, we preserved ful l compatibility
with the VAX, Digital s torage, and Digital network
ing and cluster architectures. We wanted D igital
and our customers to be able to enjoy very high per
formance levels in a product that was compatible
with prior investments. Therefore, we d rew as
much as possible from existing products and
designs from many Digital development groups.
As a result, the VAX 9000 system incorporates
Digital's standard XMI bus and popular B l , C l , and
Nl system-level interconnects. The system runs VMS
and ULTRIX operating systems, VAX layered prod
ucts, and all of our customers' and independent
software vendors' tools and applications. This
capability proved especially rewarding when in the
final months of the project, our own VAX 9000
prototypes, running our unmodified CAD tools,
accelerated the processing of the inevitable last
m inute changes.
High-performance computation fundamentally
requires two key ingredients: short machine cycle
times and maximum computational work per
formed in each cycle. The semiconductor and
multichip unit papers describe how we m inimized
the VAX 9000 cycle time by use of fast circuits, high
density packaging, and high-speed interconnects.
These papers are complemented by architecture
descriptions through which the authors present the
innovative features that minimize the number of
cycles required to execute the VAX instruction set.
These papers present the sophisticated p ipelining
techniques and vector processing capabilities incor
porated in the VAX 9000 system.
Equal in importance to the computational capa
bilities of the product are the service and control
fea tures of the system. Papers covering the
VAX 9000 service processor and the system 's fault
management capabilities provide the reader with
insights into these important aspects of the
product.
The development strategy for the VAX 9000
system was explicitly formulated to deal with enor
mous technical and project complexity. Complex-
II
I
i ty itself was the single most formidable challenge
facing the team. Apparent from the outset, was the
fact that such an ambitious product required the
i n tegration of a very large number of d iscrete
design objects; each had to be conceived, created,
documented, tested, and ultimately integrated and
verified as part of the whole. The reader will see
the diversity of these efforts and recognize t he
challenge of unifying a design from this breadth of
technical advancement.
Centra l t o our strategy was the creation of a
unified design tool suite operating in a seamless,
homogeneous VMS computing environment. The
first few years of the project were devoted to con
struction of this environment in parallel with top
level design formulation. The recognition that
rigorous design methods were crucial to our success
was possibly one of the team's most powerful fun
damental notions. Papers included in this journal
illustrate some of the legacy of powerful CAD tools
and structured design approaches created by the
VAX 9000 team.
As we have seen for the product, the methodol
ogies were not immune to change as the project
progressed. Working with rapidly evolving
technologies, design p rocess experts continual ly
12
adapted to evolving user needs. Concurrent design
permeated every aspect of the project and domi
nated the way people worked together, with many
aspects of t he technology and p roduct design
converging and adapting as we learned from our
own processes. When the manufacturing process
needed some help, designs could be reprocessed
with the new rules and rereleased to keep things
moving ahead.
A nd, move ahead they did' Today, the VAX 9000
system is installed at many customer sites where the
systems are exceeding our original goals in both
performance and dependability. I t has been
accepted by experienced, high-end computer users
as a bona fide mainframe - a mainframe with the
unique advantage of ful l integration with D igital's
rich distributed processing architecture.
The VAX 9000 system was created by engineers
working i n many disciplines and collaborating
worldwide to invent hardware, software, and pro
cesses that have significantly advanced the state
of the art of computer design, m a n u facture, and
service. The papers in this journal describe but a
few representative examples of the creativity and
determination of this large and dedicated team of
professionals.
David B. Fite]r.
Tryggve Fossum
Dwight Manley
Design Strategyfor the
VAX 9000 System
The VAX 9000 system is Digital 's newest high-end processor in the VAX fami�y. This
paper describes the design strategy used to achieve high performance and shows how
RISC concepts were applied to a CISC architecture. Neu.• opportunitiesforparallelism
in VAX program execution were found by breaking the VAX instructions into simple
tasks which could be pipelined efficiently. By using independent, dedicated pipeline
stages, execution rates approach one instruction per cycle.
T he task confronting the VAX 9000 design team
was to develop a VAX system that outperformed
any previous VAX system and that was competi
t i ve w i t h s i m i larly sized processors from other
vendors. Although the VAX system is based on one
of the world's most popular computer architec
tures, the VAX architecture's i nstruction complexi
ties preclude efficient macroinstruction pipel ining,
such as that found in reduced instruction set com
puters (RISC). RISC processors can be bui l t with low
gate counts to handle simple, fi..xed-Jength instruc
tions sets, load/store architectures, and delayed
branching.
To compete with machines based on such archi
tectures and still remain compatible w ith the VAX
architecture, the design team chose to implement
the VA X architecture on the VA X 9000 system by
applying techniques that were similar to those used
in R ISC processors. We redesigned the VAX instruc
tions i nto small , simple tasks, and designed dedi
cated hardware that was optim ized for each task .
The result is a network of specialized processors,
each of w hich has i ts own data paths and state
machines, that operate in para l lel and execute
VAX instructions quickly. The most common, sim
ple instructions are executed at the rate of one
per cycle.
System Overview
The VAX 9000 system is a tightly coupled multipro
cessor, wh ich runs the symmetric multiprocessing
(SMP) version of the VMS operating system and can
have up to four processors sharing a central main
memory. Figure l shows a simp l ified block diagram
of the system. The major system components
include four CPUs, two memory controllers, two
I/o controllers, and a service processor, which is
Digital Technicaljournal
Vol. .! No. 4
Fall /<)')()
connected th rough the system control unit (SCU).
Through a cross-bar switch, the SCU provides high
speed, simultaneous transfers among the central
processors, I /O devices, and memory banks. System
cache consistency is maintained with duplicate tag
directories located in the SCU. As references are
made to memory, the addresses are checked against
the tag directories. If a cache hit occurs, the cache in
question is requested to invalidate or write back to
main memory. The scu supplies a bandwidth that
al lows near linear performance improvement as
new processors are added to the system. The mem
ory is interleaved on cache block boundaries to
provide bandwidth for multiple CPUs and vector
processors.
Four XMI backplane buses provide high band
width paths to I/O devices. Although the XMI is used
as the system bus in VAX 6000 systems, the X M I is
used exclusively for I/O i n the VAX 9000 system .
Several new adapters were designed to increase
throughput and reduce latency for I /0 transactions.
These adapters include connections to the C I , the
N I , the BI, and local disk comrollers. Although high
performance IIO features, such as disk striping,
solid-state d isk, and load balancing have been added
to all VAX systems, the VAX 9000 system benefits the
most from these features because it has the I/O back
plane bandwidth ro rake advantage of them. A block
d iagram of a single VA X 9000 CPU connected to the
SCU and the major data paths between the two units
is shown in Figure 2 . 1
Technology Contributions to
Improved Performance
The central processor cycle r ime has been reduced
to 16 nanoseconds (ns) mainly by the use of fast
emitter-coupled logic ( ECL) semiconductors and
13
VAX 9000 Series
XMI
DODD
DODD
DODD
DODD
VAX 9000 C P U N ECTOR
XMI
DO
DO
DO
DOD
DODD
DOD
DOD
256 MB
�m� mm
Figure I
VAX 9000 System
fast self-timed random-access memories (RAMs) for
registers and caches, and by decreasing the inter
connect wire length between components.
Motorola 's Macrocell Array I I I (MCA)) technology
provided both macrocell array and standard cell
capabilities. The emire system is composed of 77
unique MCA 3 options and 5 custom chip types. A
single MCA 3 contains 838 cells (4 14 major, 224
input, and 200 output), which yield 10,000 equiva
lent gates, and 256 I/O pins. Maximum power
dissip:nion is 30.0 watts, with un loaded gate prop
agation delays of 120 picoseconds (ps). Perfor
mance-critical operations, such as mu ltiplication.
division, integer and vector register accesses, and
system cloc king, were h!rther aided by employing
custom chips 2
Caches for instruction stream and memory
data, scratch pad registers, ami control stores all
require high-speed local storage. Two versions of
a proprietary self-timed RAM were designed for
these specific applications. A 4 kilobit (Kb) self
timed RAM , at 5. 5 ns, and a l6Kb self-timed R A M ,
a t I I . 5 ns, provide i nternal input and output
latches and write pulse generation circuitry. Multi
ple access modes allow highly pipelined operations
to take advantage of shorter access times.
Each new semiconductor generation reduces
cycle time. which increases the re!Jtive importance
of interconnect delay. High density s ignal carriers
14
scu
VAX
9000
CPU
Diagram
(H DSC), tape a u tomated bonding, and a single
planar module all reduce the interconnect delay
between active components in the VA X 9000
system. Strict impedance control is mai ntained
throughout the system. Clock skew is minimized by
employing fi xed-length, differential transmission
and dedicated routing layers.
CAD Contributions to Improved
Performance
Hundreds of computer-aided design (CA D ) tools
were used during the design and construction of
the VAX 9000 system. However, none of these tools
was more important in improving performance
than the physical layout and timing analysis tools.
Once the design team had placed large functional
sections, placement tools refined individual macro
cell selection and pin placements. Over 33,000 pins
were selected to minimize overall wire length and
maximize critical interconnections.
Routing presented several challenges. All levels of
interconnect included critical signals, differential
pairs, and fixed-length requirements. The H DSC
contains large cutouts that enable die attachment
and allow cooling through the back panel. These
large routing restrictions and special routing
characteristics could not be handled by existing
CAD tools. Therefore. we developed Chameleon,
Vol. .2 No. -i
Faii i'J ')(I
Digital Technica/journal
Design Strategyfor the VAX 9000 System
a general-purpose router. With Chameleon, cross
tal k is minimized, and crossing counts are main
tained and used to increase signal integrity, which
improves performance.
To model the timing relationships within the
system, we used sophisticated CAD tools to gener
ate an accurate representation of the VAX 9000
system. Detailed timing models of each macrocell
device were created using the SPICE simulator
program 5 Chameleon and signal integrity rools
provided delay values for each signal within the
MCA3, H DSC , and planar modules. CPLJDLY , using
the AUTODLY timing tool, tied the various pieces
together and gave the design engineers a powerful
view of the timing domain.
Instruction Processing
VAX systems exist in a variety of environments and
run thousands of applications. With any new, high
performance VAX system, it is important to increase
the speed of all applications and to continue to
provide general-purpose computer power. Given
the size of the installed VA X base and the nature
of the applications, performance gains should not
require code modi fications. Digital has gathered
substantial information on how VAX processors are
.: INST RUCT ION , INST RUCTION
�
I:
(BKB VIC) I • BUFFER
� CACHE
•
· · · · · - - - - - - - - - - - - - - --- - - - - - - - - - -
r-"
I/O AND
MEMORY
INTERFACE
DATA
SWITCH
----
E-BOX· ·
used. This data formed the basis for design deci
sions and trade-offs we made i n the development
of the VAX 9000 system.
Simple Instructions
In many VAX programs, only a few opcodes are
responsible for a large percentage of the i nstruc
tions issued. Most of these opcodes are simple and
limited tO a single arithmetic or logical operation.
Often, one of the operands is in memory. A typical
example is
ADDL3 < R O ) , R 1 , R 2
Because of the high frequency of these instructions,
speeding up these instructions is a top priority.
Most of the high performance achieved on RISC pro
cessors is derived because these instructions are
pipelined. I n a complex instruction set computer
(CISC), such as a VA X system, pipelining macro
instructions is more complex . Therefore, previous
VAX implementations have pipelined operations at
the microinstruction leveL '
Processing simple instructions in a VAX system
i nvolves obtaining and decoding the instruction,
fetching source operands, performing an opera
tion, and storing the result. The most important
- - - - - - - - - - - - - - - - - -INTEGER
--------------
UNIT
: .----'-'11 ----,
INSTRUCTION H
. INSTRUCTION � FLOATING
BRANCH
PREDICTION V<-- DECODE
1 • POINT UNIT
I:""Y ISSUE
( 1 K ENTRY) jv- (XBAR)
: I-BOX
..
VECTOR
ADD UNIT
OPERAND . REGISTER
PROCESSING� FILE
(OPU/SUFPL) h (SLIST/GPRs)
n
� VEC TOR
: VEC TOR
MUL TIPLy
REG I STE RS '¢=--Y UNIT
_._._._._ 1·.-.-.- --
....___,
.---'-'-____, : . _._·.-.-.- - ..---,.
.,.--J · . . . . . . · · · · · · · · .. ·
..
� MULTIPLY
1
•
RETIRE
UNIT
UNIT
:
'�'==::::::=�
: �:: : :�;I�����):NJ :::�::::::
;
:..!�
�;N:;:AT
I;::=IO=N=I: :=.£
DIVIDE
�UNIT
UNIT
1 K TB)
.. . .
. ..Jj
-- ��������. -�
- ----�
- ----�
l ;:::·�
r ====���
lc
::::::;:
]:
scu
V-BOX
.
- ----- ---------- - - -- ----- ------ ---- -
WRITE
.--->
'l
�
.c.._---,
_
. . . .. . . . . . . .. . . . . . . . . . . . . . .
QUEUE
(WRTQ)
Figure 2
Digital Tecbnicaljournal
Vol. 2 No. 4
M-BOX
VAX 9000 CPUNector Block Diagram
Fa/1 /1)')0
15
VAX 9000 Series
difference between the way a VA X processor and a
!USC processor process simple instructions is how
the variable length instructions and memory speci
fiers are handled . VAX operands may reside in
general-purpose registers (similar to RISC
operands), in memory, or may be embedded in the
instruction stream. The VAX architecture provides
a rich selection of memory operand specifiers,
which often require computations to create the
address. In a R ISC processor, only load and store
instructions access main memory.
The instruction preprocessing stage (1-box)
decodes instructions and fetches operands in the
VA X 9000 system. I n the execution stage (E-box),
simple VAX instructions n.:s<:mble RISC instructions.
A simple opcode describes the operation, a single
register file provides source operands, and a desti
nation queue supplies a result descriptOr. The !-box
operates in parallel as with the E-box, which func
tions as a RISC processor by executing one instruc
tion each cycle. Execution occurs without the need
to identify the operand's source or addr<:ssing com
plexity. Figure 3 i l lustrates how simple instructions
t1ow through the VA X 9000 pipeline. Although all
VAX implementations perform these tasks, the VA X
9000 implementation uses separate, independent
hardware units to overlap the work because con
current operation is a prerequisite for single-cycle
instruction execution.
Instruction Cache
We used an instruction cache in the 1-box to
decrease instruction stream fetch latency and
reduce the bandwidth requirements on the main
cache. Choosing a virtually addressed cache further
reduced latency and simplified the design by
removing the need for duplicate translation buffers.
The virtual instruction cache is an 8 kilobyte (KB)
cache with a quadword line size, 32-byre blocks,
and a single-cycle access time. Line valid bits are
maintained to allow variable size fills from the main
data cache. Because the average VAX code block size
is 16 to 20 bytes, the block size of the virtual instruc
tion cache provides a good balance between the
instruction decode stage and the main cache.
Table 1
ADDL3
R3,R5,R7
SII #48,R4,@(R2)
AOBLEQ
S II # 63 , R 1 0 , 1 0$
16
Instruction Decode
Because the majority of instructions executed
require only a single cycle to execute, the instruc
tion decode's task of keeping ahead of the E-box is
not simple. Most instructions must be decoded in a
single cycle to keep the VAX 9000 system's ticks
per-instruction (tpi) low.
For example, VAX instructions may contain up to
si..,x operand specifiers. With 59 different specifier
addressing modes, instruction lengths can vary
from a single byte to more than 50 byres. However,
the overall average VAX instruction length is 3.8
bytes, and 98 percent of instructions require only
8 or less bytes.'i Furthermore, 96 percent of VA X
instructions executed use only 3 or less specifiers.
In each machine cycle, a 9-byte instruction buffer
is p resented to the decode stage ( X BA R). The
instruction buffer contains instruction stream data
prefetched from the virtual instruction cache.
Instruction decoding consists of generating an ini
tial m icroadd ress, determining the number of
specifiers for the instruction, including each speci
fier access mode and data type, and forwarding the
appropriate specifier data to the operand process
ing stages. The X BA R can handle up to three specifi
ers. Instructions that contain more than three
specifiers require additional decode cycles. Since
general-purpose register specifiers occur approxi
mately 41 percent of the time, three register specifi
ers can be processed concurrently.1' Short literals
comprise nearly 16 percent of the specifiers. How
ever, the X BAR can only decode a single short literal
per cycle. The remaining specifiers must all be
processed by the operand processing unit , which
Decode Cycles Req u i red
VA X- 1 1 /780
I nstruction
M U LF3
Context switches, translation bu ffer changes, and
instruction stream modifications all require that the
virtual instruction cache be invalidated. Two com
plete sets of block valid bits reduce cache sweeps to
a single cycle, if consecutive sweeps do nor occur
within 256 cycles of each other. Block size and fre
quent sweeping reduce the virtual instruction
cache's hit rate to approximately 96 percent, but by
filling through the main cache, the miss penalty is
minimized.
+
[ R3)
VAX 8650
3
2
5
4
3
3
Vol. 2 No. 4
Fall 19')0
VAX 9000
Digital Tecbnicaljournal
Design Strategyfor the VAX 9000 System
CYCLE
OPERATION
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
18
DJ
0
PC
GENERATION
VIC ACCESS
INSTRUCTION
DECODE
�
;::::�:�::::� ::� I I I
Ii
i i
SPECIFIER
PROCESSING
TRANSLATE
AOORESS
.
DATA CACHE
ACCESS
1. 1::': : ::1
:
.
•
'
'
i
MULTIPLY UNIT
EXECUTION
FLOATING UNIT
EXECUTION
17
LOOP:
INTEGER UNIT
EXECUTION
RETIRE
REGISTER
WRITE
DATA CACHE
ACCESS
. •·=·:·:·:·:·:·:.
�
.
.
•
MULF3
(R0),#2.5,R1
D
MULF3
4(RO),II3.5,R2
D
MULF3
8(R0),#4.5,R3
[])
ADDF3
R1 , R2 , R4
BJ
AOOL2
#11xC,RO
II
ADDF3
R3,R4,(R5)+
IIIII
SOBGEO
R6,LOOP
•
Figure 3
:
:
.
•
.
'
� �·:::::;::::!!
I �
•
0
I 1:::::::,1
04::: :�,�: :1
.
'
'
'
'
'
.
'
:
.
•
m
'
'
'
D
DJ
.
The VAX 9000 Instruction Pipeline
decodes a single complex specifier per cycle. Unlike
Load/Store A rchitecture
preced ing processors, the X BAR hand les multiple
Load/store architectures separate memory accesses
specifiers in any order. Table I shows the number of
from computation. Loads can be scheduled to place
decode cycles required for several VAX processors.
arriving memory data at a functional unit just
as
an
operation begins. To achieve t h is effect with VAX
Operand Prefetching
instructions, memory specifiers are treated as load/
Because most simple instrucrions are decoded and
store instructions. VAX memory specifiers describe
executed in a single cycle by v::trious pipeline stages,
the effective addresses of memory operands. VAX
instruction operands a ls o m u s t be handled i n a
memory specifiers do not contain the source and
si ngle cycle. Multiple, specialized operand units
destination registers that are specified in R ISC load/
increase operand processing throughput. From one
store instructions. Rather, t h e VAX 9000 system
to three register operands may be forwarded to rhe
assigns temporary register file locations to buffer
A
dedicated
memory data. By processing specifiers early i n the
short literal unit expands all VAX data formats. The
pipel ine, data can be scheduled to arrive at the
operand processing unit performs complex address
appropriate time.
E-box by one register u n it per cycle.
calculations and requests memory operand data
Memory specifiers act as independent instruc
from the cache unit (M-box). Both the operand pro
tions executed i n the operand processing unit. This
cessing and short literal units can perform multiple
unit creates the operand's effective address and for
cycle operations.
wards it to the M-box. For loads, the actual memory
Digital Technicaljournal
Vol. 2
No. 4
Fall IYYO
17
VAX 9000 Series
data is returned to the E-box register file. The trans
lated physical address is saved in a queue of write
addresses for store/destination specifiers. W hen
execution results arrive from the E-box, the previ
ously saved address is used to write t he data into
the cache.
Conflict Detection and Resolution
Macropipelining in the VA X 9000 system relies on
autonomous units operating in parallel . Each inde
pendem unit is optimized for an individual task.
However, macropipelining does require that mech
anisms be added to resolve data dependencies
among instruction processing units. Data cont1icts
occur when an instruction's results are required by
an earlier pipeline stage. An addressing data conilict
appears in the following example:
ber or a flag which indicates that the resu lt should
be written to memory.
The instruction issue unit removes source
pointers from the source queue. These pointers are
used to address either the general-purpose registers
or source list for the actual source data. Destination
pointers from t he destination queue determ ine
where resulls should be wrirren. Register conflicts
can be detected by comparing the source pointers
needed to issue an instruction with all issued desti
nation pointers in the destination queue. For exam
ple, in Figure 4, the M U L L 3 's RO source queue entry
would match the A DDL3 's RO destination queue
entry. A write to the general-purpose registers by
the E-box removes the destination queue entry, and
the instruction issue can resume.
SRCQ
Rl
R2
MOVL R O , R 1
MOVB T A B L E ( R 1 ) , R 2
Any dedicated aclclress calculating hardware must
wait for the MOVL instruction results before per
forming the l'viOVB instruction's effective address
computation. A memory conflict is another form
of data dependency.
In the following example,
#0
Register Conflicts
The simplest hardware mecha
nism employed in the VAX 9000 system is the use of
pointers to reference data. The operand processing
unit oversees a 16-entry source queue, an H-entry
destination queue, and a 16-entry source list. A sin
gle pointer is inserted into the source queue for
each source specifier. The pointer represents either
a register number, in the case of general-purpose
register operands, or a tag that indicates an entry in
the source list where the operand data is located . A
pointer is added to the destination queue for e:.�ch
destination. This pointer represents a register num-
18
I-
ADDL3 R 1 ,R2,RO
MULL3 (R3),RO,(R4)
, (R1 )
( R2 ) , R3
a prefetch unit could read the second instruction's
source operand while the E-box writes the first
instruction's results, if the values of registers R I and
R2 are different . However, when the registers con
tain identical values, the read must be delayed until
the write occurs. The VA X 9000 system uses several
differem mechanisms w detect and resolve data
dependencies. Passing pointers, scoreboard masks
within the 1-box, the write queue in the M-box, and
architectural restrictions are all used to handle vari
ous conflicts.
DSTQ
RO
MEM
RO
MOVB R 0
MOVB
__.
SLIST
DATA
Figure 4
Register Conflict Detection
To resolve addressing data
contlicts, the I-box maintains a read/write register
scoreboard . Two register masks a re c reated for
each instruction decoded . The first register mask
denotes the general-purpose registers that t he E-box
will read for the instruction, and the second register
mask specifies the general-purpose register writes.
Each bit in these register masks refers to a single
VA X general-purpose register. Specifiers that are
being processed in the operand processing unit are
checked against up to six previous instruction
masks. From t he first example above, the specifier
[TABLE(R I )] requires that the operand processing
unit read R 1. If the R l bit is asserted in any preced
ing instruction's scoreboard write masks, this effec
tive address calculation must be deferred .
The VAX architecture presents a unique address
ing conflict p roblem because some speci fiers,
such as -(Rn) and (Rn)+, modify general-purpose
registers.
In the following example,
Addressing Conflicts
SUBL2 R O , R 1
ADDL2
C RO ) . , R2
Vol .! No. -1
ht/1 1'.)!)0
Digital Technicaljournal
Design Strategyfor the VAX 9000 System
the (RO)+ specifier modifies the contents of RO.
Therefore, the operand p rocessing u n i t cannot
update the general-purpose register without affect
ing the prior instruction. The read masks are used
ro detect this type of confl ict. A l l specifiers that
mod ify general-purpose registers must check the
scoreboard read masks before proceeding with
the instruction. Thus, when a confl ict occurs, the
general-pu rpose register modification stalls.
W hen an instruction completes execu tion, the
instruction's read/write mask is removed from the
scoreboard . In all addressing confl icts, specifier
processing continues once the blocking mask is
removed.
Memory Conflicts
The write queue is used to
resolve memory conflicts. Physical addresses,
received from the translation buffer, are inserted
into an eight-entry FIFO . These addresses are later
paired with the proper write data from the E-box
and written into the M-box. To avoid prefetching
stale dat:J., :�. I I memory addresses for source memory
oper:�.nds are translated and compared with the
addresses in the write queue. When no address con
fl ict occurs, the data from memory is forwarded
to the source .l ist. Operand requests that conflict
with a pending write address are stalled until the
contlict is resolved . The conflict is resolved when
the appropriate write data is received. The conflict
ing address is then removed from the write queue.
Miscellaneous Conflicts
The VAX architecture
includes instructions with operands that either are
not known w hen the instruction is decoded (e.g. ,
INSQlJE, MTPR), or mod i fy large portions of mem
ory (e.g . , MOVC 5). To avoid conflicts from t hese
instructions, the 1-box suspends processing mem
ory specifiers until the instruction execution is
completed. Self-modifying code presents another
form of conflict, which is solved by an REI instruc
tion that not ifies the hardware of this condition.
Unlike its predecessors, the VAX 9000 system com
m i ts all its resources to a single branch path. The
prediction hardware selects the path of execution
to resolve memory conflicts for those branch
instructions that are decoded before results are
available. This path selection is based on prior his
tory, if the branch hits i n the branch cache. I f the
branch does not hit in the branch cache, the path
is predicted staticly, based on the instruction's
opcode. When t he branch executes, the prediction
is compared to the actual results. The pipeline is
flushed back to the correct code path if the branch
prediction was incorrect.
The entries in the branch cache store the branch
results of the previous execution of t he branch and
the target address, if the branch was taken. Because
the branch cache is a one-way associative cache t hat
can store only 1024 entries, the results h ave an aver
age hit rate of approximately 80 percent . However,
correct predictions occur 85 percent of the time
from the cache, as opposed to an average h it rate of
56 percent, when the predictions are based solely
on opcode. Loop branches are always predicted
as taken, which increases the overall correct pre
diction rate to close to 89 percent . By caching
branch targets, the calculation may be avoided and
a latency factor of one-cycle branch taken i s
achieved. The branch cache can store a sufficient
amount of branch context to eliminate the need
to sweep the cache.
The 1-box can process instructions with up to
two conditional branches outstanding. Uncondi
tional branches (e.g., BSBW , BRB) are processed as
ordinary instructions by simply changing the
instruction flow To reduce the penalty for a bad
prediction, which results in a four-cycle penalty,
operand specifiers that mod ify general-purpose
registers are not processed under a branch predic
t ion and cause the operand processing unit to stal L
Also, branch instruction execution i s overlapped
with the previous instruction to provide the actual
branch results earlier.
Branch Instructions
Branch instructions have a substantial influence on
the overa l l perform ance of a VAX processor. O n
average, a VAX processor executes 3.9 instructions,
including the branch. before a branch starts a new
instruction sequence. Instructions t hat modi fy the
program counter represent nearly 40 percent of t he
total instructions execmed. The VAX 9000 system
uses a 1024-entry branch cache and a two-tiered
prediction schemc to increase t he average code
block size and reduce t hc branch-takcn Latcncy.
Dif!.ilai Tecbnicaljournal
Vol. .! Nu .j
hill I'J'JI!
Compute-intensive Instructions
Compute-intensive instructions requ i re multiple
execution stage cycles. Common examples of these
instruct ions are multiplication, division , and float
ing point operations. All VAX implementations
employ dedicated logic for compute- i n tensive
instructions that occur frequently. Less frequently
used instructions depend on microcode-controlled
arithmetic and logical data paths. The VAX 9000
system contains four independent execution pro-
19
VAX 9000 Series
cessors. The integer, floating poin t . multipl y, and
implementations. Because memory bandw idth is
divide units cxecute the VAX instruction set. The
critical, the VA X 9000 system prov ides features ro
1 -box p reprocesses i nstructions, w h ich al lows
benefit thcsc instructions.
instruction execution to overlap i n thcst: u n i ts. I n
For example, the virtual instruction cache ser
each cycle, a n e w i nstructi o n c a n b e i n i t i ated i n
vices most instruction stream references, which
t h e appropriate unit prior t o the completion of
frees the main cache to service prefetched operand
previous instructions. The t1oating poin t and multi
rcf<:rences. Both the virtual i nstruction cache and
p l y u n i ts are pipelined and can accept one instruc
the main cache have 64 -bir data paths, important
tion each cycle. The in teger unit is pipd i ned for
for c h a ractcr s t r i n g operations and ex tende d pre
s i m ple instructions. However, complex instructions
cision arithmeti c . The caches are ful l y pipeli ned
must use microcode control to perform multicycle
and al low one read per cycle. The main cache block
operations.
size is 64 bytes. exploiting spatial locality. When
Pipelined instructions are issued in order and
cache references do miss. data is wrapped and the
proceed t h rough the d ata p a t h w it h ou t further
most critical data is rewrned first. A write back,
microcode control . upon completion , instruction
write al location algorit hm further reduces main
res u l ts are retired i n thc same instruction order. The
memory and cache bandw idth req u i rements and
instructions must be p roccsscd in order because the
reduces latency.
resu l t of one operation is often needed in a sub
The VAX system is a virtual memory architecture.
sequent operation. T herefore, the pipelines must be
Virtual add resses need to be translated to physical
short and contain data bypasscs to make results
addresses through page tables in memory. A trans
available quickly. The mu ltiply, float, and d ivide
lation buffer caches the most recen t l y used page
un its' internal data paths are 64 -bits w ide. To under
tables entries. VA X systems, such as the VA X- 1 1 /780
stand how the pipelined and overlapped operations
system, process trans lat ion buffer misses in micro
app l y to the fol lowing opcration.
code, wh ich can be r ime-consum i ng . However, the
y (i )
=
y (i ) + C ( i )
consider the program:
LOOP .
VAX 9000 system uses a memory management pro
cessor to process translation buffer misses as part
of instruction preprocessing. This operation is per
M U LG3
R6 , < RO ) . , R�
MU LG3
R6 , ( R Q ) . , R 2
)•
ADDG2
R� , ( R 1
ADDG2
R2 , < R 1 > .
The two MULG 3/ADDG 2 instruction pairs prevent
a pipeline stall that could occur because of data
dependencies. The instructions further reduce the
loop overhea d , w h i c h is a l read y fai r l y s m a l l
because the loop control instruction was predicted
correctly. I nstructions and source operands are
prefetched . The mul tiply and add units accept the
i nstructions as they become available. The memory
formed early in the p i p e l i n e and is faster t h a n
microcode.
The CALL and RETl 'RN instructions push and pop
registers on the stac k , and these i nstructions can
be memory-bound. The VAX 9000 system contains
both the conrrol logic and the bandwidth to process
these registers at a rate of one per cycle.
Unconventional Instntctions
Spec i a l , dedicated h ardware was added to the
VAX 9000 system to process those VAX i nst ructions
that did not fit into the categories listed above. The
references are made as the operand processing unit
additional hardware operates w i t h i n the pipeline
processes memory specifiers. The majority of speci
architecture and cycle time, and the cost of add i ng
fier processing is performed independently of the
the hardware was minima l .
instruction execution.
Memory-intensive Instructions
I n the following example,
MOVL R O , - < S P >
< - - - - - - - - - - > PU S H L R O
Some VAX instruction classes are primarily memory
the MOVL and PUSHL instructions perform identical
operations that require only minor computation .
operations, but the P l iS H L i nstruction does not
Typical e x a m p les of t hese i nstructions an.: c h ar
explicitly specify a destination address. O n pn:
acter string, decimal, and privi leged operating sys
v ious VAX system s , t h e i ns t ru c t i o n p refetching
tem. Pipel ined execution offers link advan tage to
would stall until t he current instruction execution
memory-incensive instructions because the number
was comp leted . However, t he VAX 9000 modi
of memory references is not reduced as the number
fies such instructions during the decode stage by
of cycles required for execution i s reduced by new
add i ng the implied specifiers. The benefits of t h is
20
Vol.
2 No. 4
Fall
I'J')O
Digital Tecbnicaljournal
Design Strategyfor the VAX 9000 System
enhancement are more evident in the fol lowing
instructions.
BSBW 1 0 $ < - - - - - - - - - - > MOVAL R e l u r n _ P C , - < S P l
< - - - - - - - - - - > JMP @ ( S P l +
RSB
Sim i larly, instructions such as LOCC and CMPC3
impl ici t l y reference t h e general -purpose registers.
The instruction decode s tage creates a read/w rite
mask with these references, which a l lows instruc
tion prefetching to cont inue.
To aid handling i nstructions l i ke PUSH R a n d
C A L L , the in reger execution u n i t conrains special
bit m a s k m a n i p u lation h :trd ware, w hi c h opti
m i zes general-purpose register saves and restores.
The VAX instruction set contains variable-length,
bit-field instructions that handle non-byte data.
These instructions can reference memory within a
'512 megabyte (MB) range. The field referenced is
within the first 8 hytes of the base add ress more
than
9'5
percent of the time. Therefore, to a llow
instruction prefctching to con tinue, the operand
processing unit assumes that the fiel d is within the
initial quadword and requests that data. I f, during
Logical Integration
The VAX 9000 vector processor
connects to the
scalar CPU as an additional fu nctional execution
unit.
Vector
i n s t ructions
are
processed ,
and
operands are stored, in queues, the same as are
scalar instructions. As i nstructions are issued , a con
trol word is sent with instruction operands to the
vector processor. The processor contains vector
registers and arithmetic units. Add resses for load ,
store, gather, and scatter operations are also gener
ated by the vector processor. Vector data is stored in
the main cache, and both the scalar and vector pro
cessors have fast, shared access to that dat::t.
Physical Integration
The VAX 9000 scalar and
vector processors reside
on a single planar board. Three mu l tichip unit slots
are reserved for the optional vector processor,
w h ich is fie ld- instal l able. The integration of t he vec
tor processor d i rectly with the scalar processor
keeps critical i nt erconnects short and reduces vec
tor instruction overhead .
execu tion, the field destination act ua l ly resides out
Error Handling
side the prefetched quadword, the correct data is
Rel iabi lity, ava i l a bility, and integrity are critical fac
fetched and the pipeline is flushed to avoid poten
tial memory con tlicrs.
Integrating Vector Processing
The VAX
9000
tors in a high-performance computer system . These
factors are affected by the quality of t he physical
project team was instrumental in
design (i .e. , worst-case design), effective coo l i ng,
redundant power supplies, and quality controls
during manufacture. S t i l l , fai l u res are possible, and
in regrating vector operations and data types inro
the VAX
the VA X architecture. For many scientific applica
errors.
tions, the use of vectors im proves performance in
three ways:
•
9000
design had to dea l effectively with
Error handl ing in the VA X
9000
system has two
main goal s :
Vector i nstructions specify many operations in
•
a single opcode, which e l i m i nates instruction
stream decode as a processing hottleneck.
•
Vecwr registers increase available local storage.
•
Vector registers support h ig h pea k perfor
mance through h igh bandwidth and short access
l atency.
Minim ize system service disruption from ind i
vidual failures
•
Maximize the fai l u re information col lected for
use in preventive and corrective maintenance
A l arge percentage of hardware fa i l u res are inter
m ittent , and many solid hardware fai l u res start as
intermittent. The VAX
The VA X vector archi tecture implements a load/
store architecture, which permits the hardware to
deal w i t h l arge p ieces of m e mory in a u n i fo r m
manner and increases t h e use o f para l lelis m .
9000 system
was designed to
recover from these fa i l u res and to use the fai lure
data to predict (and prevent) future problems.
To gather information effectively, VA X
9000 stor
age elements ( i . e . , latches, tli p tlops. and RAM cells)
We added the vector instructions and data types
are v isible to the service rrocessor unit through a
to the VA X architecture in an i n tegrated fash ion .
serial d i agnostic bus. Most state i n formation t h a t
Scalar and vector instructions are mixed throughout
is relevant to isolate t h e fai l ing component i s avail
the pipdi nes. Systems that do not incl ude vector
able for error analysis programs that can be run at
processors emulate vector instructions with soft
a convenient time. The result of t h is processing is
ware. a tec h n iq u e espec i a l l y usefu l for p rogram
development . . ><
t he n used to isolate the fai l i ng compone n ts for
Di�ital Tecbnicaljournal
llul. .! Nu. .j
Fa/1 /'J'.IO
quick repair.
21
VAX 9000 Series
To access the storage elements through the visi
bility chain, the system clocks must be disabled,
which disrupts the system operation for a period
of time. The error may also have affected the exe
cution of the instructions in the pipeline. Error
handling minimizes these disruptions by making
them invisible ro the users almost a l l the time.
The macroinstruction is the unit of execution in
a program that is v isible to the user. Between
instructions, the program state is clearly defined
in terms of memory contents and register values.
I nterrupts and exceptions are handled between
instructions to save this state in an orderly fashion.
It is important to handle errors the same way.
Two problems arose i n trying to provide the
same method of error handling. First, instructions
go th rough many stages in a pipelined computer,
and several instructions will be in progress at the
same time. It is d i ffic u l t to identify a begin n i ng
and end for each inMruction. Second, even when
boundaries are established, errors can occur at any
time and the errors do nor automatically l ine up
with instruction boundaries.
To solve this, we made the E-box the point of syn
chronization between error handling and instruc
tion execution. In the instruction execution model,
the E-box accepts operands, then computes and
delivers res u l ts for storage. If an error occurs that
d i rectly affects one of these steps, the error is
synchronous to the execution of that instruction.
Asynchronous errors do not directly affect any of
these steps and are treated as interrupts, i .e. , pro
cessed after the E-box completes an instruction but
before it starts another instruction .
A synch ronous error causes a trap to occur i n
the E-box w h e n t h e E-box requests d a t a from t he
subsystem with the error. Since such data can he
unavai lable as a result of virtual access problems,
the E-box is ready to deal w i t h exceptions a t
that time, and errors can use the same pipelined
mechanism.
We do not d i fferentiate between those syn
chronous errors that affect computation in the
E-box and those that do not . Instead , if the program
visible state of the machine has not been modi
fied, the instruction is backed up to the beginning
and restarted . Performing this task is not a prob
lem, since the state is normally not changed until
the result is stored at the end of the instruction.
Errors occurring in early p ipeline stages are easily
recoverab.le. I n a few cases, memory and registers
could have been modified early and, as a result ,
be affected by the error. Status flags indicate if this
has happened.
22
By getting to an instruction boundary, the clocks
can be stopped in an orderly fashion, and the state
can be read out , includ ing temporary data to be
used for failure analysis. The machine can be reset
to start processing at the instruction boundary once
the clocks are started again.
While the clock is stopped , the CPU cannot inter
act with other subsystems or I/0 processors. To
keep these functions from being blocked and possi
bly timing out , we only stop the clock to the CPU in
error, not all the clocks in the system. We also
sweep the cache of written data before the clock is
stopped , and IIO interrupts are directed to other
CPUs in a symmetric multi processing system .
Performance Modeling
When multiple features are added to a CPU design
to individual l y enhance performance, some of
those features can interact negatively with each
other to decrease performance. Therefore, we
designed a performance model to help us evaluate
the performance of the design and make trade-offs
where necessary. A lthough instructions were not
executed on the model , it is an accurate cycle-by
cycle model of the system for most instruction oper
ations. Equally important, the model was written at
a high level, which made it easy to modify and use
to experiment with different feawres before they
were added to the design.
Cycle Time
A perennial CPU design issue is the trade-off
between cycle time and cycles per instructions. I n
a VAX system , the cycle time is often limited b y the
R A M speed in the control store and cache. We mod
eled a machine at 8 ns and one at 16 ns for the VAX
9000 system. At 8 ns, the pipelines became longer.
Although the peak t h roughp u t a l most doubled ,
the model showed that the net performance g:1in
did not offset the risks associated with the shorter
cycle time.
/-stream Synchronization
The VAX architecture requires that changes to the
instruction stream be synchronized with an R EI
instruction . This synchronization makes it easier to
implement an instruction cache that is separate
from the main cache. To synchronize, either all
memory writes can be watched or the J -cache can
he cleared on every REI. The first alternative entails
high hardware costs, and the second c:1n affect
performance. However, the model showed us that
the performance impact would be minimal if the
Vol. J N o.
.;
Fuii i'J'JO
Digital Tecbnicaljournal
Design Strategyfor the VAX 9000 5ystem
!-cache was refi l led from the main cache rather than
from main memory because the critical parameters
were the main cache bandwidth and the !-cache
invalidation time, rather than the refill latency.
Branch Prediction
The b ranch p rediction scheme used i n the
VAX 9000 system was analyzed in great detail.
We investigated the use of multiple history bits to
improve the effectiveness of branch prediction.
In a l l cases, the use of extra bits p rovided less than
a I percent improvement in system performance.
Furt hermore, no multiple bit scheme could be
implemented without increasing cycle time
because m u l tiple history bit branch p rediction
schemes update status each time a branch is
encountered . Therefore, we chose to use a single
bit technique in the VAX 9000 design. Unlike multi
ple bit schemes that read and write history bits
for each branch instruction encountered , the single
bit technique updates the history bit only when the
prediction is wrong. The single-bit scheme is both
faster and simpler.
We also used the performance model as a verifi
cation tool . The model provided us with early
warnings when a feature d id not function in the
model, or when the cycle count differed from the
count in the gate-level simulation . For example,
from the model, we became aware of problems in
the design of how conflicts between instructions
in specifier processing were handled . Periodically,
we compared the performance model to the logical
model . Both models were subjected to the same
instruction sequences. Deviations of more than
± 5 .0 percent were investigated. Some design bugs
were found that did not affect the results of the pro
gram but which did keep performance features
from working properly. The average deviation was
on the order of ± 1 .0 percent.
Performance tests are among the first programs
run on a functional prototype. The VAX 9000 sys
tem performed almost as expected. Table 2 com
pares the actual performance of a VAX 9000 system
to its predicted performance for a small sample of
modeled programs. The accuracy of the predictions
h ighlights the increasing importance of models in
the modern engineering process.
Cache Parameters
The main data cache was accurately modeled. The
VAX 9000 system uses a first-in first-out (FIFO) block
replacement scheme. The performance model pre
dicted that a true least recently used replacement
policy would provide an insignificant improvement
in performance over the FIFO method. Also, a true
least recently used policy requires that status be
read and written for each cache access. In con
trast, the F I FO replacement pol icy updates status
only when a cache miss has occurred . Further, the
update can be done in parallel with the writing of
data into the cache block. Although the 128-byte
cache block provided a better cache hit, we chose
the 64 -byte block because it produced better system
level performance.
We chose two-set associativity because the model
clearly ind icated that performance would degrade
with a d i rect-mapped scheme. The model also pre
dicted that a four-way set associative cache would
not improve performance enough to justify the
extra hardware, design complex ity, and cycle time
penalty.
The data bypass mechanism, the write queue,
and the parallel translation buffer fix-up mecha
nisms were implemented after the performance
model indicated significant performance gains
would he achieved from these features.
DiJ:itaf Tecbnicaljournaf
Vol. .! No. . ,
htff f')')IJ
Table 2
Performance M easurements
of a VAX 9000 System
Program Name
Predicted
(VUPs * )
Measu red
(VUPs* )
HANOI
28. 54
25.53
FFT45
36.87
37.85
GAUSS
32.72
32.57
W H ETS
27.78
27. 1 7
WH ETD
34.48
34.89
•
Performance measured i n VAX u n i t s of performance (VUP). where
the performance of the VAX · 1 1 /780 system = 1 .0 VUP.
Vector Performance
Vector processing was modeled using graphical
descriptions of the pipeline. The graphical descrip
tions were essentially critical path method schedul
ing charts. This approach is reasonable because
vector processing makes regular demands on sys
tem resources. In fact, the regularity of resource
demand patterns was a major reason that vector
processing techniques were developed . By using
the pipeline schedules, we realized that data should
he prefetched to ensure good vector performance.
23
VAX 9000 Series
Performance Measurement
Table
5 compares the VAX 9000 scalar and vector
Acknowledgments
Many people contributed to reaching t he
VAX 9000
processors performance to ot her members of the
p<.:rformance goals. T he authors would especially
VAX family of processors.
like to t hank David Orbits, whose advanced devel
opment work on high-performance
Ta ble 3
Performance of the VAX 9000
Scalar and Vector Processors
Program
Name
VAX 8550
System
(VU Ps * )
VAX 9000
Scalar
Processor
(VUPs * )
VAX 9000
Vector
Processor
(VUPs*)
A3D
6 . 55
65.54
77.45
DY FESM
5. 1 2
3 1 .88
40.49
E M IT
5 . 86
4 1 .65
79 . 86
C F FT2D
5.52
2 5 . 76
64. 1 8
B M K8A1
5.45
30.65
83.84
MXM
5 . 93
40 . 8 1
269 . 32
•
VAX designs
became t he basis for t he performance model; and
Bill Grundmann , R ick Hetherington, John Murray,
Bill Smi t h , and David Webb, w ho comprised,
with the au thors, the origi nal
VAX 9000 architec
ture team.
References
I . ]. Murray et al., " VAX Instructions T hat Illustrate
the Architectur a l Feat ures of the VAX 9000 C: Pt!,"
Digital Technical journal, vol . 2, no. 4 (Fall
1990, t h is issue): 25-42.
2. M . Adiletta et a l. , " Semicoml uctor Technology
in a High-performance
VAX System," Digital
Technical journal, vol. 2, no. 4 (Fall 1990, t h i s
Performance measured in VAX units of performance (VUP), where
the performance of the VAX- 1 1/780 system = 1 .0 VUP.
issue): 43-60.
3. SPICE i s a general-purpose circuit s i m ulator
program developed by
T he vanattons in these performance n umbers
rake advantage of machine rcsourccs. T he numbers
of California, B erkeley.
4 . D. Clark , " Pipelining and Performance in t h e
VAX 8800 Processor," A rchitectural Support
for Programming Languages and Operating
S:vstems (AC M , October 1987).
also highlight opport u nities. By modifying appli
cations ro capitalize on machine features, large per
formance gains may he realized. Performance gains
of 100 to 200 percent are often realized and may
Nagel and
Engineering and Comp uter Sciences, Universi t y
indicate t h a t significant performance improve
ments can be ach ieved by using applica tions that
Lawrence
Ellis Cohen of the Departm e n t of Elec trical
5. C . Wiecek , "A Case Study of VAX - I I I nstruction
much parallel content. T h is category is represented
Set Usage for Compiler Execution , " Proceedings
of the Symposium on A rchitectural Support
for Programming Languages and Operating
Systems (ACM , March 1982) : 177- 1 84 .
by A 5 D and DYFESM in Table :� . Vectorizing such
6 . .J. Emer and D. Clark , " A Characteriza tion of
programs i mproves performance by a modest
Processor Performance in the VAX - I l / 780 ,
Proceedings of the 11th Annual Symposium on
Computer A rchitecture (A nn Arbor: June 1984 ):
substantially extend the lives of older programs.
Vector applications tend to fall i nto three cate
gories. T he first category generally does not contain
"
0 to 50 percent . Programs E.\IIT and CFFT 2 D in
Table 5 represent the second category, which are
301 -310.
applications of moderate parallel content. Applica
tions in this category realize a 50 ro 150 percent
7.
VA X
Vector Processing Handbook (Maynard:
performance gain when vectorized . A pplications
Digital Equipment Corporation,
in the t hird category,
EC- H 04 19-46, 1989).
highest parallel conten t ,
demonstrate performance improvements o f more
t ha n
150 percent w h en vectorized. Programs
B M K8AI and
MXM in Table 3 arc examples of t h is
class of application.
Order No.
8. R . Bru nner and D. B handarkar, " Vector Exten
sions to the VAX Architecture,"
Proceedings
ofCO!'vtPCON 'YO (San Francisco: Spring 1990).
Often , modest code changes can realize dramatic
performance improvements. By simply redefining
array dimensions or loop specifications, an applica
tion can move from t he first category to the t h ird
category.
24
v'lil. .2
No. .f
Fall 11)')0
Digital Technicaljournal
john E. Murray
Ricky C. Hetherington
Ronald M. S alett
VAX Instructions That
Illustrate the Architectural
Features of the VAX 9000 CPU
The VAX 9000 system is Digital's largest and most powerful VAX system. As such,
it offers many unique features that required the use of advanced technology and
innovative architecture in the design of the system. Overall, the VAX 9000 micro
architecture produces a high level of system performance and the lou'est cycle time
of any VAX processor, i.e., less than five cycles per instruction. Three sections of the
l'ltX 9000 CPU - the instruction fetch and decode unit (!-box), the execution unit
(£-box), and the data cache and main memory inte1jace unit (M-box) - are
illustrated in this paper through descriptions of a small sample of VAX instructions.
These instructions are discussed in relation to theirflow through the pipeline, how
their architecturalfeatures combine to work on a single macro instruction, and how
various stages ofthe pipeline interact.
I n October
1989. Digital i nrroduced its VAX 9000
preferch, hardware translation buffer
fix-up u nit,
family of h igh-performance scalar, vector, and par
write address buffer and conflict checker, multi
VAX 9000 system is designed
ported write-back cache, independent arithmetic
ro be expandable from one ro four processors, with
u nits, and separate issue and retire queues. T hese
an optional i ntegrated vector facility available on
features are pipelined and do not i nteract i n a
:tlld processors. T he
each processor. T he desi g n team obtai ne d high
straightforward way. Many stages are not directly
levels of performance w it h advanced tech nology
u
and in novative architectural fearures.
T he tech
linked to the subsequent stage bur feed a queue
or first-in first-out
nology provided a platform that has the shortest
stage works on the output of the
cycle rime for any
pipeline is not a fixed-length and many operations
VAX processor. Most VAX proces
sors average ten or more cycles per instruction ,
w hereas the architectural features of the
VAX 9000
system reduce that average below five.
T he
(FIFO) buffer. T he subsequent
F I FO buffer. The
are done in parallel.
T he architectural features do not function totally
i ndependent of one a no t her. I n fact , the h i g hest
VAX architecture is a complex instruction set
VAX i n structions vary in l e ngth and
level of performance is achieved when all the units
arc h itecture.
function in harmo n y. T his paper h ig hlights the
number of operand specifiers. T he opcode may be
implementation of the macropipeline found i n the
o n e or two b y res lon g . T he n umber of spe c i fiers
three major subsystems of the VAX
is implied by the opcode. Each specifier 's length is
subsystems are the instruction fetch and decode
determined by the specifier type, and the length can
1
vary by up to 17 bytes. Although the VAX 9000
u n it (1-box), the execution unit (E-box), and the d :tt:1
implements a large n u mber of i n structions i n a
9000. T hese
cache and mai n memory i nrerface (M-box).
T he design team for the
VA X 9000 system's
single c ycle, some instructions need to be imple
!-box evolved a cost-effective subsystem that our
menred in tens of cycles. In these cases, microcode
performs all previous
J.Ssiswnce is required. To increase performance,
Figure 1, the !-box processes the majority of instruc
VAX systems. As shown in
VAX 9000
tions in just one cycle. lt combines a si ngle cycle
system that have not been implemented i n prev i
access virtual instruction cache with a 25-b y re
many features were i ncluded
ous
in the
VAX systems. The system contains a virtual
i nstruction buffer and an i nstruction clecocle cross
instr u c t i o n cache. a bra n c h pn:diction cache,
bar that can decode three specifiers per cycle. To
mult iple specifier evaluation units. deep instruction
minimize cycle-wasting stalls. a branch prediction
DiRilal 1'ecbnicaljournal
H>l. .! .Yo. ·I
Faii i'J')Ii
25
VAX 9000 Series
unit handles transitions from one code block to
another. In addition , the operand processing unit
receives and processes specifiers from the decode
unit. The specifiers are passed either to the E-box as
pointers, literal data or addresses, or to the M-box
as virtual addresses.
Figure 2 i l lustrates how the front end of the
M-box translates addresses by using either a trans
lation buffer or an autonomous virtual -to-physical
address tra nslation u n i t . P h ysical addresses for
reads are used to access a two-way associative
write-hack cache and to fetch data from memory
through the system control unit (SCU), if the data
is missing from the cache. Read data is returned to
the E-box . Write addresses from the operand pro
cessing unit are translated and queued by the M-hox
until the E-box provides the data for the write.
The E-box of the VAX 9000 CPU per forms aU
scalar operations. As shown in Figure ), the E-box
is a pipelioed design that incorporates a micro
sequencer to control fu nctional u n i t operation.
Other dedicated control logic directs the flow
through the pipe stages.
A m u l t iported register file provides general
purpose registers and temporarily holds memory
data. The data is processed by one of the four
arithmetic functional units. Results pass through a
retirement multiplexer to the register file or the
M-box data cache, as shown in Figure 4. Mul tiple
VA X instructions arc executed concurrently in the
E-box pipeline. The primary goal of the E-box is
to produce a 32-bit result each cycle, which al lows
the majority of the simple, but most frequent, VAX
i nstructions ro be executed in one cycle. This goal
is achieved when four requirements are met . First,
the !-box must have conunands available for the
£-box . Second , operand data, often from the M-box
data cache, must be available. Third , pipelined or
single-cycle latency functional units are required
for single-cycle throughput. Finally, results must
be transferred from t he functional u n i ts. E-box
features, such as queues, data bypass paths, and
powerful arithmetic units, help the system attain
a h igh-performance level. Stalls arc avoided and
each instruction is executed in a minimal amount
of time.
The M-box of the VAX 9000 CPU is the primary
source of memory data. Therefore, it contains the
virtual add ress translation buffer and the data
cache. The M-box is multiported ami pipelincd with
two autonomous pipeline segments. Each segment
occupies one machine cycle, and the cache access
latency is, therefore, two cycles long. During the
26
first cycle, the M -box receives and priori tizes v i r
tually (or phys ically) addressed memory requests.
The M-box then indexes the translation buffer to
produce a 33-bit physical address and to perform
protection and va lidity checks. The second pipe
l ined cycle i nvolves data cache access, data align
ment, if requ ired , and port response. T here are
numerous architectural features within both seg
ments that are targeted at high bandwid th for
prefetching and storing scalar and vector operands.
To i l lustrate the various features of the VAX 9000
m i c roarc h i tecture, we h ave selected the code
sequence shown in Figure 5 . i In the fol lowing sec
tions, we discuss each instruction as it progresses
through the pipel ine as if it were the onl y instruc
tion in the pipeline. We then sununarize by consid
ering the same instructions as a block of code.
VAX Instruction ADDL2
The A DDL2 instruction uses general-purpose regis
ter R8 as a n add ress ro memory. The contents of
that location are added to general-purpose register
R7, and the result is written back to the same loca
tion in memory. The instruction is encoded in three
bytes: opcode, register, and base register.
Cycles One through Three
I f we assume that the ADDL2 instruction is the first
instruction either in an interrupt routine or follow
ing a context switch, the program counter is gener
ated by the E-box and passed to the I-box on a 32-bit
bus. The program counter is latched and used to
access the virtual instruction cache during cycle
one. The virtual instruction cache contains up to
8 kilobytes (KB) in 32-byte blocks and 8-byte lines
of instruction stream data.
Bits < 1 2 : 3 > of the program counter's prefetch
buffer are used to access an 8-byte l ine from the
virtual instruction cache. Bits < 1 2 : 5 > are used to
access a tag, a valid block. and four quad word valid
bits. The tag is compared with bits < 31 : 13> of the
program counter's prefetch buffer. If the tag and the
bits match, the block and the quadword within the
block are valid, and the instruction is in the virtual
instruction cache (i .e. , a hit). B i ts < 2 :0> of the pre
fetch buffer are used to rotate the quadword for the
opcode byte to he loaded into byte 0 of the !-buffer
at the encl of cycle one. Similar to the VAX 8650
system , the first hyte of the !-bu ffer is the operation
code (opcode) of the instruction."
The A D D L 2 is t h ree bytes long and norma l l y
fits i n one l ine of the virtual instruction cache. I f
·
t he ADDL2 instruction c rosses a l ine boundary, a
Vol. .!
No. .q
Fa/1 1')')0
Digital Technicaljournal
E-BOX
RESULT
I - BOX DATA
M-BOX I B DATA
S2 POINTER
DEST POINTER
�------ �------�--�
DECODE STAGE
FETCH STAGE
K EY
VIR - VIRTUAL I NSTRUCTION CACHE
Sl - SOURCE 1
S2 - SOU RCE 2
DEST - DESTINATION
I B - I-BUFFER
P PC - PREFETCH PROGRAM COUNTER
U PC - UNWIND PROGRAM COUNTER
D PC - DECODE PROGRAM COUNTER
S PC - SPECIFIER PROGRAM COUNTER
BP - BRANCH PREDICTION
PC - PROGRAM COUNTER
OPU - OPERAND PROCESSING U N IT
Figure 1
SPECIFIER STAGE
SL - SHORT LITERAL
GPR - GENERAL PURPOSE REGISTER
GPRS - GEN ERAL PURPOSE REGISTERS
XGPR - X GENERAL PURPOSE REGISTER
YGPR - Y GENERAL PURPOSE REGISTER
OP D - OP DECODE
Block Diagram ofthe VAX 9000 System /-box
SL D - SHORT LITERAL DECODE
R 1 - REGISTER 1
R2 - REGISTER 2
R3 - REGISTER 3
DISP - DISPENSER
VAX 9000 Series
CONTROL
LOGIC
M ICRO
SEQUENCER
I-BOX
QUEUES
V-BOX
M-BOX
REG ISTER
FILE
Figure 2
::>
9
:::
I - BUFFER
m
Front End ofthe VAX 9000 System M-box
�
�
>
OPU
II
E-BOX
r:: SEQUENCER �
f
Figure 3
28
�
MISS
FIX-UP
�
r=::>v
��
TRANSLATION
BUFFER
TRANSLATION
� BFIUXF-UF PE R
b
f.-
I--
�v
Block Diagram ofthe VAX 9000 .�ystem E-box
Vol.
l No. 4
Fall
/')')IJ
Digital Tecbnicaljournal
VAX Instructions That Illustrate the A rchitectural Features ofthe VAX 9000 CPU
E-BOX
64
OPERAND
PROCESSING
UNIT
32
I -BUFF E R
CACHE
OPERAND
PROCESSING
UN I T
MAIN MEMORY -f: 64
shou ld be routed w the register/pointer unit and
that the memory specifier should be routed to the
operand processing unit.
I n parallel with the XBAR decode process dur
ing cycle two, the program coumer is passed to the
E-box from the 1-box. The opcode is used to address
the fork random-access memories (RAMs) in the
E-box that provide a fork address to the microse
quencer. At the end of cycle two, the decoded bytes
are shifted out of the !-buffer, and the subsequent
instruction is presented to the XBA R in cycle three.
The fork address from the 1-box is then used to
address a fork RAM in the E-box. For each opcode,
the fork RAM provides an entry address into the
control store, i nd icates w h ic h functional u n i t
should begin the execution , and specifies how
many source operands are needed i n the first cycle.
The fork address is modified when an instruction
co 0080
22
ADDL2
R7,
00
4 1 0083
23
SUBF3
#0,5,
535940C2 8F
45FD 0088
24
M U LG3
#2345.5,
E4 0095
25
BBSC
#13.
68 57
5
WRITE BACK
FILL BUFFER
subsequent cycle is required to access the second
l ine. The average VAX i nstruction is 3 . 8 bytes long.
Therefore, a virtual instruction cache hit delivers
about two instructions to the l-buffer.6
Other VAX processors general l y require a cycle
to decode the opcode and one or more cycles to
decode each subsequent specifier.7.H However, the
VAX 9000 CPU's instruction decode cross bar can
decode the vast majority of common instructions in
a single cycle.
If the three bytes of the ADDL2 instruction were
loaded into the !-buffer at the end of cycle one, the
bytes would be decoded during cycle two. The
decode unit (XBAR) passes data from the !-buffer to
a short l iteral unit, a register/pointer unit or an
operand processing unit. As the opcode and speci
fier bytes are decoded in paral lel, the X BAR deter
mines in less than a cycle that both specifier bytes
Figure
32
Cache Unit ofthe VAX 9000 System M-box
Figure 4
E3
M-BOX
E-BOX
WRITE B U F F E R
WRITE
QUEUE
6044
59 85 9999A999
64
E-BOX
WRITE BUFFER
E - BOX
53
I-BUFFER
000001 2 1 '
EF OD
1 $:
(R8)
(RO)[R4].
R3
(R5)+.
R9
BDATA.
1$
VAX Instructions That Illustrate the Major Features ofthe VAX 9000 System
Digital Tecbn icaljournal
Vol.
2 No. 4
Fall f index the 1024-entry translation
buffer. The translation buffer is a d irect-mapped ,
associative memory that contains the results of
the most recent 1024 translations. Bits < 30: 18>
are compared, validated, and protection-checked
against the tag field . The physical frame n umber is
a 24-bit field that is appended to the virtual address
bits < 9:0> to create the 33-bit physical address. The
self-t imed RAM used for the translation buffer is a
1024 by 4 sel f-timed RAM with a 4 . 5 nanosecond
(ns) access time.
Protection checking occurs during the latter por
tion of cycle four. The example we are discussing is
a request for a read and write check. Therefore,
both read and write access are checked. Fault indi
cation is forwarded with the request to the data
cache and subsequently, with the data, ro the E-box.
If the request has a valid entry in the translation
buffer and no protection violations exist (i.e. , trans
lation buffer hit), a data cache access is required in
cycle five.
The two source pointers and the destination
pointer from the 1-box are latched i n the source and
destination queues, respectively, at the start of cycle
four. The source queue holds 16 entries and can
receive 2 entries per cycle. The dest ination queue
holds eight entries. Both queues are circu lar FIFO
queues that can be flushed w ith the fork queue. The
two source pointers are also latched in the source
operand logic at the start of cycle four. The source
operand logic determi nes w hich two source
pointers to use each cycle. The pointers can come
from the source queue, the 1-box, the microword ,
the register log, and several special functions. In this
example, the two pointers are selected directly
from the latched I-box pointers because using the
source queue would have required an extra cycle.
The selected pointers address the register file
and are passed to the issue logic early in the fourth
cycle. The register file contains t he 15 general
purpose registers, R O through R l4 . These registers
can be written by either the £-box or the !-box for
autoincremem or a utodecremem speci fiers. The
first pointer accesses general-purpose register R7.
The contents of general-purpose register R7 are
Vol. 2 No. 4
Fall
/ ) fol lowing the ADDL2 instruction
(i.e., bytes < 2 :0 > ) . In the latter case, the SUBF3
instruction would be shifted into the lower bytes
as the A DDL2 instruction is shifted out.
Cycles Two through Eight
In cycle two, the SUHF3 instruction is completely
decoded and shifted out of the ! -buffer. As a result,
the following actions occur:
•
The fork address is passed to the E-box .
•
The short l i teral is passed to the short literal
expansion unit.
•
The base and index registers arc passed to t he
operand processing unit.
•
The destination general-pu rpose register R3
and the t\vo sources are passed to the register/
pointer unit .
During cycle three, the register/pointer unit allo
cates the next available entry i n the source list ro the
short literal and the subsequent entry i n the indexed
memory reference. The E-hox is informed of these
a l locations as pointers to t h e relevant entries are
passed to the poi nter queues in the source one and
source two pointers. The register/pointer unit also
passes t he destination register to the destina t i on
queue in the E-hox.
Digital Tecbuicaljom·nal
Vt>l. 1 No. 4
h71/ /'J'}IJ
The ope rand p rocessing u n i t passes t he tag,
with the address for the indexed memory specifier
request, from the register/pointer u n i t to t h e
M-box. The address is generated b y the adder i n
t he operand processing unit. In parallel w i t h the
operand processing unit and register/pointer unit,
the short literal expansion unit takes the 6-bit field
and expands it to a 32-bit F _floating number.
Duri ng cycle fou r, the s hort l iteral is wri tten
through the 1-box data bus to the relevant entry
in source list. Issue control can issue with bypass
because only the memory data for operand two is
missing.
The E-box stalls until t he memory data arrives.
Because the 1-box and the M-box generally are func
tioning ahead of t he E-box, memory stalls are short
or nonex istent. In this example, the memory data
arrives at the end of cycle five, as was the case with
the ADDL2 instruction.
In cycle four, the M-box operates for the SUBF3
i nstruction in a s i m i l ar manner t o i ts cycle four
activity for the ADDL2 instruction . At the start of
the cycle, a command, address, context, and tag
field are sent from the operand processing unit to
the M-box. The command is a simple operand read.
Arbitration occurs early in the cycle. The trans
lation buffer is then accessed , and the physical
address is sent to the cache.
Cycle five begins when the data cache receives
the p hysical address for the operand processing
unit to read . The tag store lookup and address
matching are performed simultaneously with the
data read , and the data is available to the E-box at
the end of the cycle. If the operand read results in a
cache miss, the M-box must assemble a command
and an address, which are sent to the SCU to enable
the SCU to access a 64-byte block of memory data.
In addition, the data cache tells the scu which set
the cache will replace with t he new cache block. J f
the current cache block contains valid and written
data, the block must be written back to main mem
ory before the new cache block arrives.
The scu sends a command and an add ress back
to the M-box when the memory data is ready. The
send takes approximately 26 cycles and is fo llowed ,
within a short period of time, by eight cycles of data
transfer. Each cycle is 8 bytes long. The requested
quadword is returned first to respond to the
requesting port during the first cycle of the cache
refi l l . On the eighth cycle of cache refill, the tag
s tore is updated.
The floating point fu nctional unit is started in
cycle six, as speci fied by the fork RAM data. Both
source operands are delivered , and the microword
VAX 9000 Series
ind icates a SUBF operation. The floating point unit
in bytes < 8 : 5 > . The four remaining bytes of the
requires two cycles to perform the SUBF operation .
immediate specifier could be valid in the I -bex and
Unpacking and a lignment occur in the first cycle.
the rest of the instruction could be contained in the
The floating point unit signals the issue control that
I-bex 2 . At the end of cycle one, the first fou r bytes
the result wiJJ be available at the end of the follow
are shifted to the low four byres of the 1-buffer. The
ing cycle. The issue control enters the general
next four bytes are merged from the I-bex to the
purpose register R3 destination b u t must wait
high four bytes of the !-buffer. The I-bex is now
another cycle before beginning reti remen t. If the
empty, and the bytes in the I-bex2 can be loaded
next instruction requires that the floating point unit
into the I-be x .
and the operands be available, the instruction
Because t h e MULG 3 instruction has a 2-byte-long
would be issued in t h is cycle because the floating
opcode, the only decoding necessary in cycle two is
point u ni t is ful l y pipelined.
to note the 2-byte length and shift our the ftrst byte
The second exec u tion cyc le occurs in cycle
so as tO align the specifiers to be the same as a single
seven. The floating point unit adds, normalizes,
byte opcode instruction. The specifiers are then in
rounds, and packs. The result is latched in the float
ing point unit at the end of the cycle, and the issue
control discards the top entry from the result queue
to retire the data.
In cycle eight, the retire multiplexer selects the
floating point unit result data and sends that data to
the d a ta distribu tion logic. The d a ta d istribution
logic holds the result, which w ill be written into
general-purpose register R3 in the register file dur
ing the next cycle. The write is purposely delayed
to permi t it to be aborted if an arit hmetic fau l t
occurs. B y holding t he result i n the data distribution
logic, res u l t bypassing into the data path can act as
a source operand. The result is written into the reg
ister file at the beginning of cycle nine.
VAX Instruction MULG3
The MULG3 instruction takes t he G_format floating
number, addresscd by general-purpose register R5,
from the instruction stream, multiplies it by the
immediate constant 2 3 4 5 .675, w h i c h is also a
G_format number, and puts the result in general
purpose registers R9 and R 10. General-purpose
register R5 also is incremented by eight as a side
effec t of the specifier evaluation. The opcode is
2 bytes long, the constant is a nine-byte immediate
specifier, and rhe autoincrement and register speci
fiers are each a single byte. Thus, the instruction is
encoded in 13 bytes.
Cycles One through Five
bytes < I :8> of the !-buffer. As the first opcode byte
(in this case, # FD) is shifted out , the next valid byte
in the I-bex is merged into byte 9 of the 1-buffer,
which leaves seven valid bytes in the I-bex.
Decoding really begins in cycle th ree. The fork's
address is sent to the E-box, and bit < 8 > is set to
indicate a 2-byte-long opcode. The ftrst five bytes
of t he immediate spec i fier are passed to t h e
operand processing u n i t . T h e first byte a l s o i s
passed t o the register/pointer unit for source list
allocation. The five bytes sh ifted out of the !-buffer
are replenished from the I-bex, w h ich leaves two
valid byres in the I-bex .
In cycle four, the register/pointer unit allocates
the two entries in the source list for the immediate
G_floating number by passing a source one pointer
to rhe E-box and the tag to the operand processing
unit. The operand processing unit passes the first
longword of the immediate G _ floating number to
the unit's output bu ffer.
The next four bytes of the immediate are passed
from the !-buffer to the operand processing unir.
The remaining two valid bytes from t he I-bex are
merged into the !-buffer. The I-bex is then loaded
with eight bytes from the virtual instruction cache.
In cycle five, the autoincrement and register
speci fiers are decoded and the remaining bytes of
the instruction are shifted ou t . Five bytes from the
I-bex are merged with the four valid byres in the
1-buffer. The autoincrement general-purpose regis
As in cycle one of the SUBF3 instruction, the M U LG 3
ter R5 is passed to the operand processing unit
instruction can either be a v irtual instruction cache
and the register/pointer unit, which also receives
access cycle or part of the instruction already can be
in the !-buffer and shifted to the least significant
general-pu rpose register R9. The first longword of
byte as the previous instruction is shifted out . For
processing unit output bu ffer, through the 1-box, to
the immediate specifier is passed from the operand
example, i f the previous instruction is the SUBF3
the source l ist entry al located by the register/
#0. 5 (RO) [ R4] R 3 in bytes < 4 :0> o f the !-bu ffer, the
pointer unit . The second longword is passed to the
first four bytes of the M U LG 3 instruction could be
operand processing unit output bu ffer.
34
Vol.
2 No. 4
Fall /')')()
Digital Tecbn icaljounzal
VAX Instructions That Illustrate the Architectural Features ofthe VAX 9000
The first microword is accessed and distributed
t h roughout the E-box . The m icrosequencer uses
the fast fields of the microword to generate the final
control store address for this i nstructio n . The
microinstruction is not issued because it requires
two source operands and the second source pointer
is not yet avail able.
Cycle Six
In cycle six, the register/pointer unit allocates two
source list entries for the autoincrement specifier,
passes t his information to the E-box in the source
one pointer, and passes a tag to the operand pro
cessing u n i t . T he general-purpose register R9 is
passed to the E-box as the destination pointer.
The operand processing uni t accesses general
purpose register R5 and passes it, with a tag and a
quadword read request, as an address to the M-box.
In parallel, the operand processing u n i t writes
general-purpose register R5, incremented by 8-byte
lengths in the unit's output buffer. The second long
word of the immediate specifier is written to the
source list at the relevant entry.
The operand processing unit sends the M-box a
read request quadword for the double-precision
floating point operand . If the address is on a quad
word boundary, the front end of the M-box will not
produce any additional virtual addresses because
the operand w i l l not cross a page boundary or a
cache line boundary. If there is a miss in the trans
lation buffer for this reference, all other arbitration
stops and control are given to the state machine of
the translation buffer fL"X-up unit.
Bits < 31 :09> of the request are captured by the
translation buffer's fix-up unit in parallel with the
translation buffer RAM's access to achieve an early
start on m iss processing. The fork to the state
machine is sensitive to bits < 31 :30> of the virtual
address. Therefore, when a translation buffer miss
occurs, a constrained control word flow begins
based on the values of bits < 31 :30>. Because this is
a user mode, the value is zero. Therefore, on the
first cycle following the translation buffer m iss, the
virtual page number is compared against the PO
length register, POLR. On the next machine cycle,
the POBR (i .e. , base register) is added to the virtual
page number ro create the system virtual address of
the process page table entry. The fix-up unit acts the
same as any other port into the translation buffer,
and makes a virtual read request with an aligned
longword context. The state machine is control led
by a microword that branches to itself until one of
three events occurs: a miss in the translation buffer
Digital Tecbnlcal]ournal
Vol. 2 No. 4
Fall 1990
CPU
(the fix-up unit processes double m isses), a memory
management fault, or a cache response. The cache
response, which is the event most likely to occur,
signals the state machine to return to idle and pre
pare for the next miss. Hardware control external
to the ftx-up u n i t w ri tes the entry into the trans
lation buffer, and the original request is retried .
This time there is a translation buffer hit, and the
physical address is sent to the cache. Single misses
in the translation buffer require seven cycles to pro
cess. A double m iss requires 13 cycles, assum ing
data cache h i ts occur.
The issue control asserts the microword hold
signal to force the microword latches to hold the
first microword until it can be executed. The micro
sequencer regenerates the control store address of
the second microword each cycle until the execu
tion stall ends.
Cycles Seven through Thirteen
Cycle seven is the data cache read cycle for the
quadword operand processing unit request that
was translated in the previous cycle. The VAX 9000
system has a 128KB data cache, with a block size of
64 bytes and access width of 8 bytes. The 64 -bit
access width matches the 64 -bit data path to the
E-box, which was construc ted to p rovide high
bandwidth for double-precision operand transfers.
When a cache hit results for the read of an aligned
quadword, both the normal response line and the
quadword response signal are asserted to alert the
E-box that the M-box is sending a quadword of data .
In cycle seven, general-purpose register R5 of
both the E-box and !-box is written with t he incre
mented value. In addition , both source pointers
and the first source operand are available to the
issue control. Because only the second operand is
missing, the microinstruction can be issued with
bypass awaiting memory data.
The quadword operand is available to the M-box
at the end of cycle eight . The low longword is
latched in the data distribution logic of the E-box,
and the high longword is held in the M-box.
In cycle nine, the quadword operand is written
into the register file at the two source list locations
allocated by the operand processing unit. However,
the low longword is available as a source immedi
ately. The low longword of the short l i teral operand
and the low longword of the memory operand are
passed to the multiply functional unit at the start of
cycle nine. The multiply unit performs the first
cycle of execution, which includes· unpacking and
multiplying the most significant bits of the two
35
VAX 9000 Series
operands. Issue comrol drops the microword hold
signal to allow the second microword to be latched .
An entry, which specifies general-purpose n:gister
R9 as the destination for the low longword of the
result, is made to the result queue. The second
microword is issued because the multiplier requires
the next half of each source operand and both are
available from the register file.
The microsequencer then attempts to generate a
new control store address from the next entry in
the fork queue. If no new forks are available, the
microsequencer remains idle.
In the tenth cycle. the multiply unit receives the
high longword of both source operands. The sec
ond execution cycle is performed, which includes
unpacking and three simu ltaneous multiplications
of the appropriate combinations of the most and
least significant bits of the two operands. The multi
plier signals t he issue control that the result will be
available in the following cycle. The issue control
makes an entry, which specifies general-purpose
register R 10 as the destination for the high long
word of the res u l t , in the result queue. The multiply
functional unit is fully pipeli ned and could be issued
in this cycle to start subsequent operations.
Cycle eleven is the third and final execution cycle.
The multiplier accumulates the four products it
produced in the two previous cycles, rounds, and
packs the final double-precision result. The issue
control discards the top entry from t he resuh queue
to retire the low longword of the resu lt.
In cycle twelve, the retire multiplexer selects the
multiply unit result data and sends it to the data dis
tribution logic. The issue control discards another
entry from the result queue to retire the h igh long
word of the result. The low longword of the result is
written into the register file's general-purpose regis
ter R9 in cycle th irteen . The h igh longword of the
result is written into general-purpose register R 10 in
the next cycle as the instruction is completed .
VAX Instruction BBSC
The BBSC instruction tests a bit in memory,
branches if the bit is set , and clears the bit. The
BOATA is the base add ress in memory with the
number 13 position-bit offset. The majority of VAX
field instructions have a position offset of less than
64 bits. Therefore, the VAX 9000 system's J-box
prefetches t he quadword addressed by the base.
As with all conditional branches, the result of the
test is predicted and the VAX 9000 system's J-box
continues to fetch instructions along the rredicted
pat h . The BBSC is encoded in eight bytes: one
36
opcode, one short li teral position, five for the base
address (a 4-byte displacement off the program
counter), and one displacement.
Cycles One and Two
Cycle one for the BBSC can be fetching the instruc
tion stream from the virtual i nstruction cache, as
described for cycle one of t he ADDL2 instruction, or
it a l ready can be in the 1-buffer (e. g . , bytes < 8 : 3 > )
and the I-bex ( i . e . , b y tes <7 6 > ) fol lowing t he
M U LG 3 (i .e. , bytes < 2 : 0 > ). In the latter case, the
BBSC i nstruction is shifted into the lower bytes as
the M ULG 3 instruction is shifted out .
The decode o f the B BSC begins with passing the
short li tera l , number 13, to the short literal expan
sion u n i t and the program counter/re l a t i ve base
address to the operand processing unit. Informa
tion on both specifiers is passed to the register/
pointer unit. In this cycle, the fork add ress is also
passed to the E-box . The fork address is mod ified
for field instructions if t he base is a register. There
fore, passing the fork address is delayed until the
base specifier is decoded . In this example, the base
is decoded in the cycle after the opcode is received.
If the base is a register, the field instruction takes a
di fferent microcode flow.
During cycle two, the decoder passes t he pro
gram counter decoder for the p rogram cou n t of
the instruction to be decoded to the operand pro
cessing unit. The program counter is passed to the
operand processing unit and the E-box in the first
decode cycle. Whenever a specifier is passed to the
operand processing unit, the X I3AR also sends a
specifier offset delta . When the delta is added to the
program counter's decoder, the add ress of the last
byte of the specifier plus one is produced .
As the short l iteral and program counter/relative
specifiers are decoded , they are d iscarded from the
!-buffer. The BBSC displacement is shi fted to t he
first byte of the !-buffer. The data arri ving from the
cache is merged into bytes < 8 : 2 >, and the other
byte is placed in the I-bex.
The branch pred iction u n i t begins operating
during the first decode cycle. A pred iction for the
branch must accompany the fork address sent to
the E-box. The prediction is made by using the
program cou n ter to access a branch prediction
cache and determine how the branch behaved the
last time it was decoded (i.e. , one h istory bit). If
the branch is in the cache, the p rediction is that
the branch will behave the same as the last time. If
the branch is not i n the cache, a prediction is made
based on the normal behavior of this cond itional
Vol. .2 No. 4
1-Ctff 1')')0
Digital Technicafjournaf
VAX Instructions Tbat Illustrate the Architectural Features ofthe VAX 9000 CPU
branch. For example, a BEQL (58 percent) and a
BBSC (73 percent) normally do not branch , whereas
a B N EQ (62 percent) normally branches. If the BBSC
instruction is in the cache and branched last time,
this information is indicated to the E-box, with the
I-box prediction given as true.
Cycle Three
In this cycle, the register/pointer unit allocates one
entry in the source list for the position specifier and
three entries for the base specifier. The unit then
passes the source one, source two, and destination
pointers to the E-box.
In the operand processing unit, the address of the
last byte of the specitler plus one is ftrst calculated
using the program counter of the instruction and
the delta provided by the X BA R . The displacement
from the instruction is then added to this calcula
tion. The result is latched in the operand processing
unit's outpur bu ffer and passed to the M-box. The
operand processing unit also passes a quadword,
field modify function, and the source list tag.
The short l iteral expansion u nit extends the size
of the position specitler to a longword and latches it
in the unit's output buffer. In this example, the
extension is done with zeros. The X BA R passes the
branch displacement byte and an updated value of
the program counter's delta to the operand process
ing uni t . The delta of the program counter and the
branch d isplacement are also sent to the branch
prediction unit as instruction lengths. The BBSC
instruction is completely decoded, and the opcode
and displacement are discarded from the !-buffer.
The branch prediction unit does most of its work
during the last decode cycle of a branc h . For the
majority of conditional branches, the last decode
cycle is also the first.
The branch p rediction cache contains 102 4
entries. Each entry has a history bit, a 32-bit target
program counter, a 6-bit instruction length, and a
1 6-bit branch displacement and its tag . The entries
are addressed by bits 9 through 0 of the program
counter's decoder. If the tag matches bits < 31 : 10>
of the program counter's decoder, the entry is
assumed to be the entry, or a hit, for this branch .
If a hit occurs and the history bit shows that the
branch was not taken last time, the branch predic
tion unit latches this state information and allows
the subsequent instruction stream to be decoded .
The operand processing unit produces the target
address as soon as it is not busy. The target address
must be stored in the program counter's unwind
buffer in case the prediction is incorrect. The E-box
Digital Tecbt�icaljounUII
VtJ/. .! Nu. 4
Full 1'}'}0
indicates the correctness of the prediction as soon
as possible. For simple branches, the E-box could
indicate that the prediction is incorrect before the
branch is fully decoded .
If a hit occurs but the history bit shows that the
branch was taken last time, the branch prediction
unit latches this state information and stops the
decoding of the subsequent instruction stream by
clearing the !-buffer and the I-bex. The program
counter of the subsequent instruction is stored in
the program counter's unwind buffer. The program
counter's target address, which is received from the
branch prediction unit cache, is passed to the pro
gram counter's prefetch buffer. The target address
that is later provided by the operand processing
unit may be discarded . The branch displacement
and instruction length from the branch prediction
cache are latched. For the fol lowing discussion on
the remaining cycles in the BBSC instruction, we
have assumed that the BBSC instruction is a branch
prediction hit and that the branch was taken the last
time decoding occurred.
Cycle Four
In cycle four, both the operand processing and
short l i teral expansion units contain d a ta to be
passed to the source list. The operand processing
unit normally has the higher priority of the two.
Therefore, the short literal expansion unit will stall.
The operand processing unit passes the base
address to the source list through the 1-box. In the
operand processing unit, the new delta of the pro
gram counter is added to the program counter, the
sign of the branch's displacement is extended from
a byte to 32 bits, and the two are added to produce
the new target address. The result is latched in the
operand processing unit output buffer.
The virtual instruction cache is accessed for the
target instruction. If the instruction is in the vir
tual instruction cacbe, it is passed to the !-buffer.
However, there is a gap in the pipeline because no
instruction can be decoded this cycle.
The displacement and instruction length from
the branch cache are compared with the actual dis
placement and instruction length. Normally, these
lengths match . However, if they are different, the
target address from the branch prediction unit
cache is p robably incorrect. The fetching and
decoding of instructions must wait until the
operand processing u n i t provides the correct
address.
At the start of cycle four, the M-box receives
a request from the operand processing unit. This
37
VAX 9000 Series
previously
tion or for subsequenr branches to be decoded . The
described in that it contains a command that gets
req uest
d iffers
from
all
requests
unit predicts a maximum of t h ree branches before it
special t reatment i n the M-box . T he command i s
stalls decoding to resolve the first branch.
an " opu read with write check n o bloc k . "
As the address xxxx..x xx5 is accessing the trans
T h e command is used because t h e VAX 9000 CPU
lation buffer, the final address is produced by
contains a n optimization that enhances the perfor
adding 4, which makes a translation buffer request
mance of bit field instructions. With this command,
(i.e. , addr
the op<:rand processing unit prefetches a quadword
in cycle six . The three translation buffer accesses
of data, starting from the address pointed to by the
are contiguous and interruptible. Data alignment is
=
xxxxxxx 9) through the sequencer port
base, without looking at the value of the position
performed by the M -box, but the alignment is con
operand . Hope fu l l y, the majority of bit fields are
strained to longwords. When an unal igned quad
within 64 bits of the base. The special command
word is detected, the front end of the M -box alters
tells the M-box that if a fault should occur, i t should
the context field that it passes to the data cache
pass the fau l t , with an operand, to the E-box and
unit. The quadword request is effectively broken
not close down the operand processing unit port or
i n to two unaligned longwords, which are properly
put a lock on the fault parameters. The command is
rotated into the low longword of the quadword
an unaligned quadword operand and, as suc h ,
interface and sent to t he E-box independently.
requires t h a t t h e M-box produce additional virtual
Cycle five is the data cache read cycle for the first
addresses to correctly access the cache. A quad
unal igned longword . Because the starting address is
word is unaligned when bits < 2 :0> are nonzero.
x:xxxx
x:x l , the entire longword is contained in the
For this example, we have assumed that the starting
cache line. Therefore, one additional rotation cycle
add ress is x:x"L
xxxx l .
is all that is required before the data is sent to the
..
Special ized hardware in the front end of the
E-box. The M-box pipe is effectively lengthened by
if the starting address requires
a cycle when i t is performing unaligned operations.
sequencing (i.e. , the addition of a constant of 4 to
Because cycle five is a data cache read cycle, no
M-box detects
the current address) and how many sequenced
response is issued to the E-box. In addition to the
addresses are necessary. In this case, three addresses
data cache read, the physical address is placed in
are required. The first is the starting address (i .e. ,
the write queue. A memory write is required after
from the
the bit is tested . A status bit for a new quadword is
operand processing unit. As the starting address is
set in the write queue. The new quadword indicates
addr
=
xxxxxxx l ), which is received
accessing the translation buffer, a constant of 4 is
that this is the starting address of an operand and
added and the sequence port requests a virtual
writes should not rake place until a n entry appears
address (i.e. , addr
in the write queue with a last bit assertion.
=
xxxxx x x5) from the translation
buffer at the start of cycle five.
The issue control uses the fork RAM data to deter
Because the first operand is written into the
source J ist, t he operand is available ro the integer
mine that the integer unit and two source operands
unit at the start of cycle six . The microword hold
are required . Because only the first operand is miss
signal is asserted to hold the first microword during
ing from the source list, the instruction is issued
the stall. The microsequencer regenerates the con
with bypass. The microsequencer generates the sec
trol store address of the second m icroword.
ond control store address based on the fast access
fields of the first m icroword .
Cycles Six through Nine
I n cycle six , the d ata cache is read again w i t h
Cycle Five
address
Decoding the target instruction stream begins in
read in cycle five. However, because the context is
cycle five. The operand processing unit sends the
a longword, one additional byte of data must be
xxxxxx:x 5,
which is t h e same cache line
target address to the branch prediction unit through
read from the cache to satisfy the reques t . Also, in
the program counter's target address. However, as
cycle six, rotation of the data read in cycle five is
noted earlier, the target address sent is discarded.
completed, and the M -box responds to the E-box.
Because t he operand processing unit does not use
Finally, address xxxxxxx 5 is placed in the write
the 1-box data register, the short l itera l expansion
queue.
unit can pass the short literal to the source Jis t .
By using source pointers from the source queue,
T h e branch prediction u n i t now waits either for
the position and base address operands are selected
the E-box to indicate the correctness of the predic-
by the fork RAM and passed to the i nteger u n i t . If
38
Vol. 2 No. -4
Fall /'J of interest from the cache read
next cycle is issued norma lly.
in cycle seven to the correct pos i t ion. No response
is issued to the E-box because this unaligned refer
Cycles Ten through Fifteen
ence requi res two data cache reads to ful fi l l . The
I n cycle ten, the E-box initiates a byte write to the
add ress xxxxxxx9 and the last bit are inserted i nto
M-box. Data is passed to the M-box , and the appro
the write queue. The M-box delivers the required
priate byte is shifted to the low byte loca tion. The
longword, and execution begins immed iately. The
sixth and final m icroinstruction is issued normal l y.
second execution cycle calcu lates the target byte
I n cycle eleven, the M-box receives an explicit
address. The position, div ided by eight, is added to
E-box write request to retire t he BBSC instruction
the base address. The m icrosequencer generates
with a memory write. Explicit writes differ from
the fourth control store address by using the next
writes i n itiated by the 1-box in that the E-box sup
address field of the microword. No operands are
plies a v i rtual address with the data, whereas the
selected for the next cycle, and the next instruction
I -box provides a virtual address and t he E-box sub
is issued norma l l y.
sequent ly provides the clara for 1-box v.·rites. How
Cycle eight is a rotation-only cycle. The one byte
ever, three entries exist in the write queue for the
<8> of i nterest, read from the cache in the previous
prefetched quad word . These entries were placed in
cycle, is rotated i nto the correct position (i .e. , byte
the queue for memory conflict-checking p urposes
<0:3> ) , and the M-box sends the data to the E-box
and cannot be used for writing pu rposes because
by issuing a response.
only a byte of clara is being written and not a quad
The third execution cycle uses the bit position to
word. The write field command from the E-box
set up the special encoder in the integer unit and
forces the write queue control to d iscard the three
clear the appropriate bit. The source two register
entries. The front end of the E-box accesses the
file pointer is incremented again to select the high
translation bu ffer and checks for write success
longword from the source l is t . This microword
during this cycle. I f the write is successfu l, the p h ys
branches on th ree comlitions determi ned by hard
ical address and the context of the byte are sent to
ware functions. The first cond i t ion indicates if the
the data cache.
low longword of the prefetched field has a page
The fi n a l execution c ycle determ ines if t h e
faul t . If a fau l t does exist, the m i croword flow
branch prediction w a s correct. T h e bit specified
checks w hether the longword is needed or not. As
by the correct position is shifted to the least signi
noted earl ier, the longword was p refetched i n
ficant position in the s h i fter, where i t can be used
the hope that the b i t pos ition was within the first
for a macrobranch comparison. The macrobranch
64 bits of the base. If the bit is not within the first
result is compared to the I-hox branch p rediction
longword , the page fau l t can be d isregarded . The
in cycle twelve. The microword also ind.icates that
second branch c hecks w hether the position is
the microsequenc<.:r shoul d start forking for new
gr<.:ater t han (l_) hits. I f it is greater, the microcode
Digital Tecbnica/jourual
Vol. 1 t\iJ. ·I
P(/1/ /'J')IJ
macroinstructions.
.19
VAX 9000 Series
Cycle twelve is the data cache lookup cycle for
E-box. This process c:vens the tlow t h rough the
the byte-write operation. The data size is less than a
pipel ine and keeps the E-box busy. Figure 6 il lus
longword . T herefore, the byte that is to be written
tratc:s the code block as it moves down the pipe.
must be merged with t he seven unaffected bytes of
the cache line.
The first stage is the virtual instruction cache
:tccess, or fetch. stage as the instruction is read from
Two signals are sent to inform the 1-box of the
the virtual instruction c:.tche. Some instructions
branch prediction status. The branch valid signal
do not need an actual virtual instru ction cache
ind icates that a branch prediction validation has
access but are in the !-buffer from
occurred, and the branch signal indicates i f the va l i
instruction c:.tche fetch. The instruct ion decode
dation was correc t .
T h e branch prediction logic receives t h e branch
valid signal. If the prediction was correct, the pro
:.t
previous v i rtu:.tl
takes p lace in the decode, or X BA R , stage . T h e
!-buffer i s shifted and t h e fork R AI'[ COM PCON '90 (San
Francisco: Spring 1990): 4 4 -53.
8. S. Mishra, "The VAX 8800 Microarchitecture,"
Digital Tecbnicaljoumal, vol . I, no. 4 (February
1987): 20-33.
3. T. Leonard, VA X A rchitecture Reference Man ual
42
7. T.
Vol 2 No. 4
Fall I'J')O
Digital Tecbnicaljoun.al
Matthew]. Adiletta
Richard L. Doucette
john H. Hackenberg
Dale H. Leuthold
Dennis M. Litwinetz
Semiconductor Technology
in a High-performance
VAX System
The VAX 9000 system is the newest member of Digital's VAX family of computer
systems. The 9000 is a high-performance ECL processor, with a very fast, 1 6-nano
second cycle time. To achieve this high level ofperformance, a new generation of
semicustom and custom integrated circuits was requiredfor the scalar CPU and the
vector processing option. Goals for circuit density, performance, and skew mainte
nance werefulfilled with the development ofa high-speed gate array, special custom
chips used in key applications, and a high-speed RAM employing a new architecture.
The semiconductor requirements for the VAX 9000
system posed a number of challenges for Digital's
Integrated Circuits Development Group. Those
requ i rements included a tremendous number of
equivalent logic gates ( 1 ,037,4 00 gates) and a large
amount of RAM in the processor (3,280,000 bits).
Moreover, the project 's performance goal of over
30 VAX- 1 1 /780 units of performance (VUPs)
required the development of state-of-the-art semi
conductors and the use of innovative techniques to
design them .
G iven the project's goals, the IC technologists
evaluated several competing semiconductor tech
nologies and decided to i mp lement most of the
logic within the 9000 system in a h igh-speed, high
density, 10,000-gate array. The gate array provides
a broad range of speed and power-dissipation
options. Working with Motorola, the IC Group first
engineered the base 10,000-gate macrocell array
(MCA), which is implemented in Motorola's MOSA IC
III process. Logic engineers then designed the 77
d i fferen t gate array chips (options) on the base
array, using a rich library of logic functions and a set
of automated place and route tools. Additiona lly,
they designed five custom chips, invented a fast
cycle t i me, self-timed random access memory
(STRAM) architecture, and designed a multichip unit
to imerconnect all these high-performance !Cs. '
Four different design methods were used to
implement the chips. The MCA x chips employ a gate
array design technique. The cnxx, the V RG x , and
the Sl"RAM chips required a full custom approach .
Digital Technicaljournal
Vol. .! No. .:j
Fall /')90
The STGx chip was implemented using a silicon
compiler technique. T he M ULx and DJVx chips
mwere implemented using a standard cell design
approach. Statistics on 9000 system chip design are
given in Table 1 .
This paper describes the VAX 9000 M CA I l l gate
array, the development of each of the five custom
chips, and the STRAM architecture. Before our dis
cussion of the gate array, we present a brief
overview of the semiconductor technology used
to fabricate the array and the custom chips.
Semiconductor Technology
In 1985, the VAX 8800 series was D igital's largest
and most powerful system, offering single-CPU per
formance of eight VU Ps. The 8800 CPU logic was
Motorola's Macrocell A rray I ( M CA I ) gate array,
which was fabricated in MOSAIC I bipolar technol
ogy. In comparison, the VAX 9000 goal of 30 VlJPs
was aggressive, and the IC Group realized a new
semiconductor technology was required .
At the start of the project, the technologists evalu
ated semiconductor vendors to determine what
was the "best" technology available to implement
the new system. CMOS , Bi C MOS bipolar, and GaAs
IC technologies were evaluated. Among the factors
considered were logic density, gate delays, on- and
off-chip interconnect delays. mam.1facturing risks,
and prod uct delivery.
Although very high gate densities were available
with CMOS technology, the logic gate delays proved
,
43
VAX 9000 Series
Table 1
VA X 9000 C h i p Statistics
Chip
Description
Die Size
( M i l l i meters)
Signal
Pins
Transistor
Count
RAM
Bits
Power
(Watts)
MCAx
MCA I l l gate array chip
9.8
X
9.8
256
40. 1 K
CDxx
Clock distribution chip
6.2
X
6.2
1 70
7.2K
STGx
Self-ti m ed reg ister file chip
9.8
X
9.8
1 52
29.3K
1 7.8
M U Lx
M u ltiplication chip
9.8
X
9.8
1 82
48.4K
30.9
D IVx
Division chip
9.8
X
9.8
1 12
29 .2K
23.9
VRGx
Vector register file chip
9.8
X
9.8
1 98
76.0K
92 1 6
24.9
1 KS R
1K
x
4 self-ti med RAM
4.9
X
3.6
33
28.0K
4096
2.4
4KSR
4K
x
4 self-ti med RAM
6.4
X
4.2
35
1 03 .0K
1 6384
2.4
t o b e t o o slow r o meet t h e cycle time requirement.
Also, the CMOS output circuits could not drive sig
nals off-chip i nto a 50-oh m transmission l i ne as
quickl y as a bipolar transistor, which l im i ted the
speed of signal between IC:s.
B iCi\·JOS offers the advantage of h ig h l y dense
CMOS coupled with bipolar drive capabi lity. How
ever, the technologies available at the time were
optimized for the best CMOS transistors with a com
promised bipolar device. This approach l im ited the
overall performance of the circu it to a level roug h l y
equiva lent t o t h a t o f previous generation bipolar
devices, which would not be aggressive enough ro
meet the CPU performance needs.
Galliu m arsenide (GaAs) ICs offer a theoretical
performance advantage of between two and three
to one over s)licon i m p l ementations. T he group
found IC densities were lower than those of bipolar
devices, however; and the on-chip speed advantage
was countered by the need for more off-chip sig
nals in t he critical paths of the C P U . A lso, because
the manufacturing technology of GaAs ICs was
immature, very few companies had attempted to
sell GaAs into the commercial marketplace. So
while this technology was considered for a rime in
some applications where alternatives also existed ,
GaAs were eventually dropped from consideration
because of the u ncenainty of availability.
The IC Group also studied Motorola 's third
generation of their oxide-isolated self-al igned
impl anted circu i ts (MOSAIC I l l) bipolar technology.2
Ir offered a factor of six in speed advantage over
the prev iously used MOSA IC I tech nology and h a d
the potential of prov iding eight to ten times the
logic density. A l t hough not as dense as CMOS or
BiCMOS, MOSAIC I ll was much faster than either of
those tec hnologies and much denser than any avai l
able GaAs technology I n addition, although many
44
30
1 3.9
of t h e manufacturing steps were new, most o f them
were based on prev iousl y proven tec hn iques. The
group therefore concluded that MOSA IC 1 1 1 was
best suited tO meet the chal lenges of the VAX 9000
system.
The MOSAI C I l l process is an advanced sil icon
bipolar process which yields a transistor structure
with a polysilicon base. emitter and collector elec
t�·odes, pol ysi licon resistors, and three l ayers of
meta l ization. Compared to the MOSAIC l device
used in the 8800, the critica l col lector-base j unction
of this transistor structure takes up approximately
50 percent less area, as shown in Figure I. Com
bined with shal lower ju nctions and reduced base
resistance, the intrinsic device performance was
improved by a factor of three. Further, the poly
silicon resistor produced with this process has far
lower parasitic capacitance than the MOSA IC l
monosilicon resistor. Some key performance mod
eling parameters and density metrics are provided
with the figure.
The VA X 9000 packaging imposed other require
ments on the semiconductor technology. Power
dissipation increased from 5 watts for the MCA I to
�0 watts for the MCA I ll because of the increase in
gate density from 1 , 200 to 10,000 gates. Therefore it
was determined that all ch ips shoul d be mounted
directl y to the multichip unit cold pl ate for opti
mum cooling. For manu facturing economy, it was
desirable to bond the mul tiple leads of the chip
directly to the pads on the h igh-density signal car
rier ( H DSC). Consequently, all CPU chips must be
provided to the mu l tichip unit assembly site in a
tape automated bond (TA B) package. As shown in
Figure 2, ch ips are mounted i n a plastic carrier suit
able for automated hand l ing, and the surface of the
die is protected from mechanical damage with an
epoxy encapsu lent .
Vn/. .2 filii. ..;
Fall
1')')11
Digital Technicaljournal
Semiconductor Technology in a High-performance VAX System
MCA JOK Gate Array
number of logic cells for a given signal pin count are
available for the logic designers. Technologists eval
uated several key factors to determine the gate array
physical layout and to ensure its success:
A high-performance emitter coup led logic (ECL )
gate array with 10,000 equivalent gates and 256
i nputs/outputs has been developed for the VAX
9000 system. The gate array design approach used
in the VAX 9000 system ensures the shortest possi
ble turnaround time from option ma-;k to hardware,
thereby reducing the system design time. In this
approach, cell boundaries are defined with all tran
sistors and resistors fu,ed within the cells. When a
cell function is selected from a predefined cell
l ibrary, the cell customization occurs at the metal
between the transistors and resistors. Then, to
define the function of the gate array option, the
metalization between cells is customized. This
approach al lows the semiconductor foundry to
build many wafers up ro the customizarion level;
when a gate array is to be built, only the custom
metal is req uired . As noted above, 77 different lOK
ECL gate array options are used in the VA X 9000 sys
tem. This gate array has a rich selection of logic cells
with di fferent power settings for the logicians to
use to meet performance and power requirements.
Using Rent's Rule, technologists maintained a bal
ance between the number of gates and the package
J /0 count. This balance ensures that a maximum
MOSAIC I l l
P+ P O LY S I L I CO N
•
Area of the silicon chip versus yield
•
110 pad pitch
•
Maximum power dissipation
•
Speed of the gates
•
Maximum number of logic cells
Successful trial layouts of the IOK ECL gate array
floor plan were completed before any VAX 9000
options were started .
The gate array floor plan, shown in Figure 3,
comprises a central core area of 4 14 major (M) cells,
divisible imo quarter cell functions, arranged in an
array of 20 rows and 2 1 columns, less 6 sires for the
master bias generators and special clock generator
circuits. The number of transistors used in a quarter
cell is based on the logic cel l most frequemly used
in the lOK EC L gate array, the scan larch. A ring of
200 output (0) cells is interspersed with 224 inter
face (I) cells. The ring surrounds the imernal cells
and imerfaces the pad drivers with the internal
N + P O LYSILICON
�����
�
�-----�
_..)
POLY S I LICON R E S ISTOR
_..... ,...
NPN TRANSISTOR 1
I
/
I
I
_.... ...-
_.....
MOSA� I
/
/
C-B J U NCTION AREA
�---��--�)·::::�
MONOSILICON RESI STOR
N P N T R A N S I STOR
MOSAIC I
MOSAIC I l l
N PN Fr: 5 G H z
R 0 : 1 475 ohms
1 6 GHz
400 ohms
20 ff
24 ff
54 If
DRAWN EMITTER SIZE: 31'm X 41'm 1 .751'm x 4!Jm
M ETAL 1 PITCH: Bl'm
4.5!Jm
METAL 2 PITCH: 1 51'm
71'm
METAL 3 PITCH: 1 21'm
CJc: 50 II
CJE : 45 II
CJS: 1 85 ff
Comparison ofMOSAIC Ill and MOSAIC I Deuices
Figure I
Di�ilal Tecbnicaljournal
Vol. .! No.
4
Fall /')'JIJ
45
VAX 9000 Series
cells. The 2 56 t /0 pad ce l ls a long w i t h t he J04
power pads are located around the perimeter of the
IOK gate array. The mctal ization system uses three
interconnect layers. The customized routing chan
nels reside on the first and second meta l layers with
i nterconnecting v ias between the two layers of
meta l . The top metal layer and parts of metal I and 2
provide power and ground distribution.
The lOK ECL gate array used in the VAX 9000 is
approximately ten times more dense than the ECL
gate array used in the VAX 8800 system . The gate
delays in the 9000 are improved six ti mes over gate
delays in the VAX 8800. Table 2 compares the IOK
Ec.L gate array used in the \ A X 9000 to the ECL gate
array used in the VA X 8800.
Previous gate array designs. i n genera l , have
provided only two le,·els of series gating, thereby
limiting the complexity of functions that can be
designed with one current switch. Within this gate
array, three levels of series gating Jt borh internal
and output macrocel ls provide addition:�! " A N D "
(product) gate functions at very high sreed with
one switch delay and at a lower power level . Fig
ure 4 compares three-level series gating and two
level series gating for a " 2-3-4 -4 A N D/OR " logic
function (internal gate). Table 3 lists the differences
in typical gate performance for a low power gate.
The table also compares low power gate and high
power gate. Notice the power difference between
the two-level and three-level high power gate.
C o m parison of N u m be r of Cells
and Delay s i n the VAX 8800 and
VAX 9000 Gate Arrays
Ta ble 2
I nternal major
VAX 8800
Gate Array
VAX 9000
Gate Array
48
414
cells
Output cells
26
200
I n put cells
25
224
Input cells
gate d e l ay
1 . 05 nanoseco nds
Metal de lay
(fall delays)
2.6 picoseco nds
per m i l
1 75 picoseconds
(high power)
1 . 3 picoseco nds
per mil
A l l current switches w i t h i n t h e array are pow
ered from the main supply voltage V E E I. Three
level-series gated functions are implemented in the
VA X 9000 gate array option, which requires V E E I
to be set to - 5 . 2 V. Input cells are powered from a
second, lower supply voltage VEE2 ( 3.4 V) to save
power. The output emitter followers of M, I , and
0 cel ls as well as series-terminated ECL (STECL)
output followers employ constant current source
pu l ldowns to VEE2 to save power. The constant cur
rent source pulldowns minimize the sensitivity of
AC performance to variations in power supply. This
same termination scheme was used in VA X 9000
custom chips.
One of the technologists' main goals was ro mini
mize power consumption of each macrocell while
obtaining the highest possible performance from
the IOK ECL gate array. The overa ll ! O K ECL Gate
Array power is limited to 30 watts because of the
cool ing requirements, the internal power distribu
tion, and the current density l im its on power pins.
A unique feature incl uded in the !OK ECL gate
array that rrevious gate arrays do not have is series
terminated ECL (STECL) omputs. STECL outputs
-
Table 3
C o m parison of Two-level and
T h ree-level Series Gating
Gate delay from
i n put pin A
to output pin YA
Two Levels
of Gating
Three Levels
of Gating
300 picoseconds
250 picoseconds
(low power)
Figure 2
46
Chip in TAB Package Mounted on
Plastic Carrier and Encapsulated
Low power gate
H i g h power gate
Vol 2 No. 4
9 . 88 m i l l iwatts
8 . 84 mill iwatts
1 8 .20 m i l l iwatts
1 3 . 00 m i l l iwatts
Fall /'J'Jfl
Digital Tecbn icaljom-nal
Semiconductor Technology in a High-performance VAX System
Figure 3
Photomicrograph ofthe Gate Array
include a constant current source p u lldown and a
reference clocks. The chip also supplies clocks to
series terminating resistor. This feature allows the
a l l STR A M s on the u n i t . Each of t he STR A M 's four
elimination of off-chip termination resistors used
groups of SL'< clocks can be programmed to one of
in conventional 50-ohm EC L outputs. STECL out
eight possible clock phases. This flex ibility in pro
puts a llow shorter in terconnections between chips
gramming al lows the system designer to select the
on the m u l tichip unit because the c h i ps can be
a p p ropria t e clocks for STR A M s in order to meet
placed closer to each other, t hm improving perfor
system timing requirements.
mance. Another advantage of using STECL outputs
In addition to prov iding the functions above,
over 50-ohm outputs is that less than half of the
the design goals for the C D x x project i nc l uded the
simul taneous s w i tching output noise is coupled to
fo l lowing:
unswitched outputs. A l l custom chips used in the
•
VA X 9000 employ STF.Cl. termination .
mu l tic h ip unit
Clock Distribution Chip - CDxx
The major fun c t i o n of t h e clo c k d is t r i b u t i o n c h i p
(CDxx), shown i n Figure 5 , is to distribute master
and reference clocks to each MCA on
a
m u l t ichip
unit. There are eight pairs of d i fferential master and
Di�ital Tecbuicaljournal
11Jl .! No.
q
M i nimize the space occupied by the chip on the
Fa/1 1990
•
Provide scan control and scan distribution
•
Include a wideb:md amplifier
•
Ensure low clock skew
•
Provide a temperature-detecting circuit
47
VAX 9000 Series
,------ �
vee
'-----��Y A �
VBB1
------+- vs@
VBB 1
ONE LEVEL OF GATING
.----- vee
vee
VBB3
VEE1 �------�---'
THREE LEVELS OF GATING
Figure 4
48
Two-leuel Functions uersus Three-leuel Functions
Vol.
.2 No. 4
Fall
I')'JI!
Digital Tecbnicaljournal
Semiconductor Technology in a High-pe�tormance VAX Syste�n
HOT C I R C U I T
Figure 5
Photomicrograph of CDx.-.: Chip
M i n im izing the real estate occupied by the chip
Each coxx receives i ts scan control signals from the
was comp licated by addi tional functions located on
previous CDxx in the chain or from the service pro
the CDxx, such as scan and the temperature detect
cessor. A s shown in Figure
ing circuits. The minimization was accomplished
rings located on the C D x x . Ring 1 2 is a 16-bit r i ng
5,
there are t h ree scan
by employing a custom chip design approach in
reserved for the CD)C'< STRAM clock generation con
which each element (cell) is optimized and then
trol ring. This ring controls the STRAl'•l clock phase
manual ly placed and routed to ach ieve a compact
selection and enable for each of the four STRAM
des ign. As it turned out, the size of the chip was not
pins required to communicate to the rest of t he
clock groups. Ring 1 3 is a 14-bit ring reserved for the
CD)C'< scan control. Data is shifted i n to this ring and
then loaded i nto CDxx control registers. R i ng 14 is a
47-b i t r ing reserved for the CDxx i n formation scan
multichip u n i t .
ring. Data is loaded into t h i s ring from CDxx data
determined by the amount of real estate needed to
implement the circuits, but rather by the number of
Since a CDxx i s mounted o n every multi chip u n i t
registers and shi fted out ro the service processor.
i n t h e CPU, the scan d istribution and control logic
The design of the w idebaml a m p l i fi e r was
are located on this chip. The CDxx ch ips i n the sys
prompted by the need for the clock distribution
tem are chained together on the system scan bus.
chip to receive two d i fferent ial sinusoidal master
Digital Tecbnicaljournal
Vol. .! No. · I
Fall I'J'JO
49
VAX 9000 Series
and rcfc.:n:nce c lock signJis as inpurs. These.: signals
arc.: transformer coupled from the clock source.
The master clock runs at one L"ighrh the systL"m
cycle rimL". and the reference clock runs at the sys
tem ncle rime. The wideband amplifier receives
d i ffe rent ial s inusoidal signalls of relative l y small
ampli tude - less than 125 m i l l i\·olts peak to peak
and transforms t hem ro lOOK ECL levels on output .
Th<.: design of the input circuits meets these crite
ria and rypic::� l l y fu nctions w i t h i nputs less rhan
65 mi llivolts.
All rhe clocks are distributed by the COxx as pairs
of diffcrc:ntial signals. The d istribution of these
clocks is, of course, ro be done with minimal clock
s kew. Clock skew is the di fference in del::�y t ime
berw<.:c:n di fferent clock outputs measured from a
com mon point. The common point in this case i s
t h e numbc:r of master dock inputs to the chip. To
maintain low c lock skew, technologists designed
fast gates and minimized the nu mber of cascaded
gates in the clock path. A lso, all the metal that inter
connects the cel ls in the c lock path is control led for
equal delay. As a resu lt, the measured clock skew
is less than 100 picoseconds on a chip for master,
reference, and STRAM clocks. The delay of master
clock input ro output is less than I nanosecond (ns).
The: temperature-detecting circu i t on the CDxx
warns rhe system when a device j u nction tempera
t ure approaches rhe maximum al lowed tempera
t u n: on a m u lt i c h i p u n i r . As i m p lemented, t he
circuit is cont rolled from t he system console. The
console loads rhe CDxx with a number that repre
sents rhe temperature rhe circuit musr use as a point
of comparison . If rhe j unction temperature of rhe
Cl)xx is higher than the programmed value, the cir
cuit trips and notifies the console of a temperature
problem. T he console rhen rakes corrc.:crive acrion .
Self-timed Register File Chip - STGx
The self-rimed register file chip (sn; x ) is employed
in t h e VAX 9000 to provide fou r register banks
accessible through muhi rle read and write pons.
·rhe four banks incluJe a m icrocode scratch-pad
register hank, rhe VA X generJl-purpose register
set, a memory Jara register storage bank , and an
instruction d a t a register b an k . The performan ce
req u i rements for rhe STC x were quite rigid and
guided several key design tkcisions, including den
sity and layout. The read access time was ro be less
than ':i ns. The write access time was to be less than
6 ns. Ln orher words. rhe chip must read or write
any one of irs 6.:j locations in ':i or 6 ns. respectively.
Borh goals ha\'e been met . In fac t . rhe read access
t ime is typical l y less rhan 4 ns, and rhe write t ime
is typically less rhan ':i ns. Figure () is a photom icro
graph of the STG x c h ip.
The STGx is a 64 -word by 1 8-bit LCL register file
contain ing three wrire ports and rwo read ports.
The 64 words are separated into fou r 16-word by
18-bit storage array sect ions. Each of the four stor
age banks has dual read capabi lity. S torage bank one
has dual write capab i l i t y ; storage ba nks rwo and
three have triple w rite capability; and storage bank
four hJs single w rite capabil ity. Simultaneous write
access to the array i s possible t h rough a l l pons wirh
correct results occurring; the only except ion is in
t he case of writes to the same location from multi
ple pons, which is an undefined operation. A write
followed by a read access to the array - even to rhe
same address - is possible w irh correct results
occurring. The chip has two clock inputs for con
troll ing reads and writes.
One requirement for rhe design was to include a
self-rimed write capabil ity so that the system need
nor provide properly timed write pulses ro rhe chip.
In rhe system, rhe chip is clocked w i th STRAM
clocks for read ing and w r i t i n g . The design uses
these clocks to latch read address i n formation, to
latch write add ress information, and to latch input
data. I n addition, the design rakes the leading edge
of the write clock ro generate a delayed w rite pu lse.
The delayed write pu lse is used to write the appro
priate word in the 64-word by 1 8-bir array, raking
in to account rhe rime needed ro decode the wri re
add ress.
The design sryle used to i mp lement r he self-rimc.:d
register file chip is s im i iJr ro a sil icon compiler tech
nique. The c h i p's storage area i s made up of four
arrays. The input add ress register for borh read and
wrire ports, the inpur dara larches. and rhe da t::l out
pur drivers are arrangements of c<:l ls in stri ps. The
p lacement and rout i ng of t hese arrays and strips was
proced urally performed using custom layom tools.
Once rhe blocks were: assembled and p laced , in ter
connecrions among b locks, strips, and pins were
then routed manual l y.
Multiplication Chip -MULx
The architecture of the scalar processor defined an
integrated floating point p rocessor. U n l i ke most
RISC processors, which off-load all floating poinr
operations ro a separate tloating poin t processor,
rhe VAX 9000 sysrem handles floating point opera
tions within the E-box . 1 The multiplication unit
therefore supports horh i nr<:ger and tl oaring point
formats. To ach ieve t h is support, a custom chip was
l'n/. .!
. \ "o.
.;
Fall I')<)O
Digital Tecbnicaljournal
Semiconductor Technology in
Figure G
\ i,f. .! .\iJ. 1
High-JH!r(ornwnce VAX -�),stem
Photom icrograph ofSTGx Chip
requ i red that provided superior performance. spe
cial logic gates. and improved density. Custom chip
tech nology provided enough dcnsity to accommo
date a .12-bit by :)2-bit . cight-logic-l<:vcl multiplica
t ion array in a singlc chip ( M l l l . x). To mini mize the
cost and time of custom design . designers employed
standard cell design techniq ues in which the cell
height was fixcd anu thc width cou ld vary to take
advan tage of packing dcnsit y. By constraining
the design i n t h i s fashion. the H ig h Performance
Systems Group's < .A D suitc cou ld be employed to
p l ace and rou te the c h i p . Spec i a l logic gates
eliminated t hrcc logic lcvds. and h igh-powered fast
gates provided t he pnfmmancc to perm i t a .12-bit
by :)2-bit multiph· opcra t ion in less t han 9 ns. Fig
un: I shows a photomicrograph of t ile \l l l. x chip.
Digital Tecbnicaljournal
a
hiii i'J'Jii
Three �l l ' L x chips werc r<:qu i red in the scalar
processm to achieve doubk-prcc ision r<:rformancc
in which every 64 ns a ')6-bit mul tipl ication could
complete. Each M l ' l. x chip has two .12-bit i n put data
buses. The Ml ! L x chip is also employed to perform
all i nteger multiply operations in a s ingle 16-ns
cycle.
The scal::ir processor, which has .12 -bit-wide data
paths, delivers double-precision input data in two
cycles. In the first cycle, each M l lLx consumes the
most sign i ficant h igh bits of c:K h operand . A II t h ree
MULx chips latch this <.bta while also u n pack i ng
it, multiply ing i t , and then latching the product.
One of the M l ' L x chips' results is then s:1ved . In the
second cycle. the n.:maining dou hk:-prccision dat:I,
t he least sign i ficant low bits. is consumed , and each
') [
VAX 9000 Series
--
-.
-
�-
;--..,.--,..
.:. .���.;.,...._:;...,
I M U LT I P L I E R ARRAY
""""""' ,:....,M,
.-��
. .......,...
Figure 7
Photomicrograph ofMUL:x Chip
M U L x chip unpacks the data and performs a u n ique
are delivered, each MLJ L x has an additional person
multiply: operand A high bits and operand B low
ality bit for indicating whether t he M U L x is in the
bits; operand A low bits and operand B h igh bits;
V-box or E-box.
and operand A low bits and operand B low bits.
The MULx chip, as used in both the scalar and
A n 1\KA I I I gate array acc u m u l ates a l l these
vector processors, is a 32-bit by 32-bit ECL parallel
res u l ts, and another rounds and packs the bits into a
multi plier w h ich is fully pipelined for a 16-ns cycle
VAX floating point product. Since each ivl U L x needs
time. It performs both two's complement and sign/
ro know which partial product it must comp ute in
magnitude multiplication. I n a single cycle, the chip
the second cycle, two personality bits are included
unpacks VA X float ing point formats F, D, and G, or
that are loaded by means of the system scan chain .
M U Lx chips are also used in the vector processor.
The vector processor (V-box ) has 64 -bit-wide data
paths. Four MULx chips are emp loyed ro complete a
double-precision m u l t i p l y every 16 ns. S i nce the
i nteger formats long, word, and b y t e ; performs
exponent calculations and sign handling; and com
pletes up to a 32-bit by 32-bit m u l t ip lication .
I f the operation is double precision, the 64 -bit
result is a partial result. It must be accumu lated with
operand unpacking di ffers between the scalar and
three other part ial results to form t he double-preci
vector processors as a result of how fast operands
sion, correc t l y rounded, and normalized produ c t .
52
Vol. 2 No 4
Fall 1')')0
Digital Technicaljounwl
Semiconductor Technology in a High-performance VAX System
If the operation is an integer type, then the 64 -bit
two's complement result is the VAX integer product.
A long with producing this integer product, MULx
also produces the correct condition codes. Integer
operations require one machine cycle to complete.
Operands are not latched at input . Instead they are
immediately unpacked and sent to the multiplica
tion array. This multipurpose array then produces a
set of sum and carry product vectors. These vectors
are then added in a ful l carry lookahead adder
(CLA). This adder comprises a 31 -bit adder and a
32-bit adder, cascaded . The produced sum is the
64 -bit product, which is then latched. The output
of the latch is used to compute i nt eger-type con
dition codes.
The integer instructions supported include VAX
MULB , M U LW , and MULL. EMUL is also directly sup
ported, along with the Z and N bit condition codes.
Finally, to assist in H format-type multiplications,
a true 32-bit by 32-bit magnitude mu ltiplication is
also supported, called EXTMU L (extended multiply).
There is a 64 -bit data path back into the E-box for
EMUL- and EXTMUL-type operations.
Six features of the M U Lx design that improve per
formance and minimize logic should be noted .
First, unlike traditional designs, the MULx design
does not include Booth recoding of the multiplier
operand . Booth recoding offers no logic savings
either in timing or real estate when the multiplica
tion array reduction scheme is optimal. Second, a
Baugh-Wooley two's complement algorithm was
used to implement integer multiplication .' Third,
engineers designed special full adder logic gates to
integrate multiplication summand generation into
the full adder cel l and to eliminate the need for an
additional logic level. Fourth, a unique multipli
cation reduction algorithm was developed which
provides the initial routing advantages of a Wallace
tree, with the minimal logic of a Dadda tree."·6 Fifth,
a ripple is formed in the reduction array. The ripple
facilitates the start of the least significant 31 -bit
CLA addition at least one logic level sooner than
the most significant 32 bits and does not require a
carry-in input to the upper 32-bit adder. Finally, by
developing a very fast 4 -3-2 - 1 A N D/OR gate, engi
neers were able to remove two additional logic
levels in both CLA adder networks.
To avoid bugs in the array design, since bugs in an
array consisting of 1000 full adders could have sig
nificantly affected the product shipment schedule,
engineers developed a FORTRAN program to logi
cally interconnect and physically place the array.
Any bugs would be algorithmic and not random,
and algorithmic bugs should be obvious. In addi-
Digital Tecbnicafjournuf
Vol. 2 No. 4
Fall 19')1!
tion, by algorithmically placing the array, signi
ficant density improvements were realized . This
program provides a Wal lace-Dadda implementa
tion that logically reduces 32 rows in 8 logic levels,
and consumes as many initial summand bits. It
also uses the least number of full adders as theoreti
cal ly possible, while delivering the least significant
32 bits of sum and carries at least one full logic level
sooner than the most significant bits.
Division Chip - D/Vx
The iterative divide function performed by the divi
sion chip , DIVx, requ i res a signi ficant amount of
hardware, the density of which a standard cell chip
affords. Two gate arrays would be required to per
form the same function, in which case a timing
critical path crossing would occur between the two
chips. Therefore, the IC designers implemented the
DIVx chip as a standard cell design by building
on the techniques developed for the MULx chip
described above. Also, like the MULx design, the
goals for the D!Vx design project were to optimize
performance and minimize real estate use by fitting
t he iterative divide function in a single chip.
The IC designers employed a standard cell tech
nique in which four horizontal sections are defined ,
each section having a different number of columns.
Reference cells are located in the center row of each
section and provide ECL reference voltages to the
cells above and below i n that section 's columns.
Placement was driven for performance, with quo
tient selection logic being distributed to where i t
was required. This method made for a n irregular
structure, as can been seen in Figure 8.
The VAX 9000 system optimizes both mu ltiplica
tion and division by providing separate functional
units. Each functional unit performs both integer
and floating point operations. This approach differs
from the one taken by most processor architects,
who conceptually link multiplication and division .
Usually, algorithms are chosen that can share hard
ware at the expense of the performance of either
operation. The separate division unit in the 9000
provides superior performance for both i nteger and
floating point operations. The DIVx chip is also
used by the V-box to perform very fast vector divi
sion operations, as shown in Table 4 .
Division is an iterative process. Unlike the case of
multiplication, one cannot predict the summands
and then reduce the summand matrix. The two
approaches to division most commonly used are
the Taylor Series convergence algorithm and a sub
�
tract and shift algorithm. The algorithm employed
in the 9000 is a variation on the subtract and shift
53
VAX 9000 Series
Table 4
Division Performance
Data Type
Integer:
Floating
point:
byte
word
long
F-format
D-format
G-format
Cycles
Time
(Nanoseconds)
3-4
3-5
3-8
48-64
48-80
48- 1 28
7
1 12
208
1 92
13
12
method, which al lows for savings in hardware as
wel l as increased performance.
Jn this method, an imprecise quotient is selected
based on a truncated estimated partial remainder
Figure 8
54
and a truncated version of the exact divisor. This
imprecise quotient digit is corrected when the next
guess quoticnt digit is selected . The selected digits
may be positive or ncgative. The positive digits are
accumulated in a positive-value shift register. The
negative digits are accumulated in a negative-value
shift rcgistcr. The final corrected binary quotient is
then formed by subtracting the negat ive register
from the positive register.
The algorithm is based on a signed d igit notat ion
scheme. To determine two quotient bits, the bits
may be chosen from a d igit set that i nc ludes
{ -2, - I , -0, + 0, + 1, + 2 }. The digit set is simply an
expanded form of the common nonrestoring digit
set that typ ically uses { - 1 , 0, + 1 } . In nonrestoring
algorithms, the quorient is normally corrected as
Photom icrograph of D!Vx Chip
Vol. 2 No. . J
Fall /')')0
Digital Technicaljournal
Semiconductor Technology in a High-performance VAX System
needed; whereas here, it is not corrected u ntil the
entire iterative process is completed . The next sig
nificant difference between this division technique
and the nonrestoring method is that the quotient
bits selected are based on an estimate of the partial
remainder and divisor rather than the exact values.
The first advantage of this method is that an esti
mate can be obtained faster than the exact value.
Second, a truncated estimate is acceptable, rather
than a fu ll-width estimate. Consequently, this
method saves a significant amount of hardware and
increases the speed of the operation . If one were to
complete each partial remainder, up to three addi
tional chips would be required and the delay would
more than double.
The trick to the method lies in the quotient selec
tion . The selection is based on partial remainder
range transformations which guarantee that a
quotient digit selected in one iteration may be cor
rected to the exact quotient digit on the next
iteration. Therefore, although six quotient digits
are determined per major iteration, an additional
minor iteration is required to guarantee the least
significant digit of the major iteration. The major
and minor iteration terms refer to the architecture
of the divide iterative hardware. The OIVx produces
six quotient bits per machine cycle. This is a radix
64 division technique. However, the high radix
division is accomplished by overlapping lesser
radLx divisions. In particular, there are three sets of
radix 4 division groups. The first two sets are over
lapped, so that the critical path t hrough the radix
64 division is actually the critical path through two
radix 4 divisions. A m inor iteration is the path
through one radix 4 division group. A major itera
tion is the path through the overlapped set of two
radix 4 division groups, followed by the final radix
4 group. It is important to note that extra iterations
do not adversely affect the corrected quotient.
Final ly, to produce the corrected quotient, the set
of negative quotient digits is subtracted from the
set of positive quotient d igits, where each digit is
properly radix 2 weighted, based on the order of
selection. (That is, the first quotient digit selected is
the most significant bit of the correct quotient.)
Vector Register File Chip - VRGx
The VAX 9000 architecture adds vector instructions
to the standard VAX environment, thus a vector
register file was required. There were two primary
design requ i rements for the vector register file.
First, the register file and associated cross-bar logic
had to fit in a single multichip unit; and second, the
Digital Techn icaljournal
Vol. 2 No. 4
Fall f'J'JO
register file had to perform read and write at dif
ferent addresses within a single 1 6-ns clock cycle.
These requirements could not be met with available
memory and logic chips, thus necessitating the
development of a fully custom vector register chip.
The vector register file is 64 bits wide and con
sists of 1 6 vector registers with 64 elements each.
The vector register chip, VRGx, was developed as an
8-bit slice of the 64 -bit vector register file. The chip
contains 9216 bits of RAM for data storage and the
cross-bar logic (6000 equivalent gates) that allows
access from the five read ports and three write
ports. Integrating the register memory and the
cross-bar logic on the same chip allowed timing to
be optimized so that the system timing require
ments were met .
VRGx Chip Physical Features and
Organization
The VRGx chip is fabricated using the MOSAIC III ECL
process, w hich was not designed as a memory pro
cess. Coordination with the vendor resulted in the
addition of an implant step for the memory-cel l
bit line emitters. Key features of the process are
three metal interconnect layers, oxide isolation,
and polysilicon emitters with a drawn width of
1 .75 microns.
Figure 9 shows the locations of the major circuit
blocks in the VRGx chip. The major blocks of the
VRGx chip are five read ports, three write ports,
and 1 6 vector registers in the RAM bank array. The
block diagra m , Figure 10, shows the main data
paths. The 1 6 vector registers are implemented as
64 -word by 9-bit single port RAMs. Eight bits are a
slice of the 64 -bit vector register ftle and the ninth
bit is for byte parity.
Timing
A register RAM can be read from one address and
written from a different address in one 1 6-ns clock
cycle. This dual operation is made possible by a 2
to 1 m u ltiplexer on the RAM address inputs. The
read address is appl ied during the first portion of
the cycle, and the write address is applied during
the second portion of the cycle. Spl itting the clock
cycle i nto read and write portions eliminates
conflict between read and write ports in the event
that a single register RAJVl is selected for both read
and write. Read data is held in a latch during the sec
ond portion of the cycle and is unaffected by the
write operation .
A single clock cycle consists of nonoverlapping
clock phases A ami B. Latches on the read and write
55
VAX 9000 Series
Figure 9
Photomicrograph of VRGx Chip
pon inputs are clocked by phase A, and read port
output latches are cloc ked by p hase B. For a read
operation initiated on phase A, the output read data
becomes valid during phase B.
Cross-bar Logic
Cross-bar logic in the R A M bank array makes each of
the 16 vector register RAMs independently accessi
ble from the read and write ports. Enable inputs on
the ports prevent invalid addresses from contl icring
with i ntended addresses. Read and write ports may
point to the same register R A M , bur di fferent write
pons may nor point to the same R A M . Also, differ
ent read ports may on ly point to the same RMvl if the
vector element address is the same. All conflicts
must be resolved external to the chip.
56
A read port consists of an enable, a 4-bir register
select, a o-bit vector element address, and a 9-bit
ou tpu t . An enabled read port appl ies a register
select code that points to a particular RA M bank . At
that R A M bank, a ') to I multiplexer selects the vec
tor element address from the active read port and
applies it ro the read add ress of the R AM . Then t he
R A M output passes t h rough a 16 to l m u l ti p lexer
controlled by the register select code, so that the
selected R A M output reaches the output of the active
read port.
A write port consists of an enable, a 4 -bit register
select, a 6-bir vector element address, and a 9-bir
write data input. An enabled write port applies a
register select code that points to a particular RA.M
bank . At that R A M bank, a 3 to I multiplexer selects
Vol .! 1\'o. 4
Fall / -
r
S E L <3 : 0 > -
PORT
ADDR
3x
ADDR
A D D R <5 0> SEL<3:0> -
I
WRITE
PORT
ENABLE -
-
- - - - - - -
SEL
-
I
I
I
SEL
91
6
I
6
/
3:1
MUX
I
I
DATA
I
---,
I
5:1
MUX
I
D I N <8:0> -
-;-
I
R EAD
E N A B LE -
-
I
DO
AR
AW
RAM
64 X 9
3:1
MUX
I
I
I
-+
Dl
t
f-AI
I
I
1 6: 1
MUX
R E AD
PO RT
ou T
_______..
D0 - 8 0 ·
I
I
I
I
I
I
L - - - - - - - - - - - __j
RAM BA N K
RAM BAN K A R R AY. 1 6x
Figure 10
VRGx Chip Block Diagram
the vector element address from the active write
port and applies it to the write address of the RAM .
A lso, a 3 to I m u ltiplexer selects t he write d ata
from the active write port and applies i t to the RAM
data input .
RAM Technology
The normal transistors in an ECL process are of the
NPN type, where the collector is a buried N-doped
region . For memory cel ls, a lateral PNP transistor is
placed in the same collector region , and the com
bined structure has the latching characteristics of a
silicon controlled rectifier (SCR). The memory cell
array in the 64 by 9 register RAMs is implemented
with ECL SCR memory cells.
The SCR memory cel l shown in Figure I I consists
of two cross-coupled SCR structures. Extra NPN
emitters connect to the bit lines and provide a
means of writing and sensing the celL The "on" side
of the cell saturates, allowing the bit line emitter to
conduct in the inverse mode. Inverse gain of the bit
line emitters must be limited to avoid excessive
leakage into the unselected cells. An added process
step applies a special base implant to the bit line
emitters only to control their inverse gain.
Advantages of the SCR cell include good density,
low standby power, large sense voltage d i fferen-
Digital Tecbnicaljournal
Vol 2 No.
4
Fall 1990
tial, and low sensitivity to alpha-particle-induced
soft errors. The cell has one limitation: excess
charge storage due to write current can delay sub
sequent writing to the opposite state. This problem
is el iminated with a special bit line current steering
circuit that makes write current state dependent
(Figure 1 1 ).
The SCR memory cel l in Figure 1 1 is written by
applying a high current (four t imes read current) to
the "off' bit line emitter. The current steering tran
sistors prevent this current from reaching a bit line
emitter that is already " on . " Thus, attempting to
write a cell that is a lready in the desired state does
not result i n any additional cell current beyond the
normal read current, and no additional charge stor
age occurs.
Other Chip Features
Other noteworthy chip features include scan logic,
parity error detect logic , and a data pipeline for
write port 0 data. Scan operation gives access to the
register RAMs. In a single scan-in and scan-out oper
ation, it is possible to read five registers and to write
three registers.
Parity checking logic is used to detect input
errors and set error flags. There is a parity check on
the 9-bit write port data inputs. Another parity
57
VAX 9000 Series
1. 51 �
� 0.51
.---..----.
VA
� 0.51
VA
KEY:
WC
UWL
BL
BR
LWL
VA
Figure 11
WRITE CONTROL
UPPER WORD L I N E
B I T L l N E (LEFT)
BIT LINE (RIGHT)
LOWER WORD L I N E
VOLTAGE R EF E R E N C E
SCR Memory Cell with Bit Line
Current Steering Circuit
checker is applied to address and control inputs.
These are assigned to three parity groups, with a
parity bit input for each group.
The write port 0 data pipeline allows a delay of
one. two, and three clock cycles to be selected ,
delaying the write port data as necessary to resolve
register access conflicts.
Self-timed RAM
In the VA X 9000 system - as in any high-perfor
mance CPU - fast memory is used for cache and
control store applications. Engi neers traditionall y
use very fast static RAMs within the CPU for mem
ory. Logic designers, however, have long recognized
that CPU performance is often l imited as a result of
the time needed to access data in these RAMs. This
l imitation is not only the result of the access time
and write cycle performance of the devices them
selves, but also of t he off-chip circuitry and inter
connect used for w ri te p u lse generation and
distribution . The logic designers and technologists
58
-
for the VAX 9000 knew that unless some architec
tural improvements were made to the traditional
static RAM , much of the RAM performance improve
ments would be lost in the w iring interconnect.
They also realized that Digita l 's memory suppliers
would have to be convi nced that a new RAM archi
tecture would be marketable to their other cus
tomers. After several design iterations, the tech
nologists submitted a set of specifications for a
synchronous, self-timed RAM (STRAM ) to several
suppliers for their revi ew. After extensive market
surveys, our memory suppliers agreed that this new
architecture could eventually become a new stan
dard for high-speed static RAMs.
The VAX 9000 system requires two configura
tions of the basic STRAM dev ice : I K words by 4 bits,
and 4K words by 4 bits. A block diagram of the
STRAM is shown in Figure 12. The STRAM is similar
to the traditional RAM in that it has chip select, input
address and data, and output data . However, the
STRAM also has several nontraditional inputs such
Vol 2 No. 4
Fall /'J'JO
Digilal Technicaljournal
Semiconductor Technology in a High-performance VAX System
as write, a differential clock, and a reference voltage
(Vbb). Latches added to all inputs and ourputs
provide pipelined timing. An internal write pulse
generator controls write operations and eliminates
the need to generate and distribute the write pulse
signal externally on the modu le. Also two optional
output configurations are provided : a 50-ohm drive
open emitter for standard parallel termination on
the module, and a resistor and pulldown current
source which is w ired extern a l l y to implement
STECL or on-chip source termination.
The clock buffer design al lows inputs to be
driven differentially from off-chip to m inimize
clock skew. The clock buffer is also designed to
accommodate customers who are not greatly con
cerned about skew or who may be more concerned
about conserving routing area. One input of the
clock buffer may be tied to the output pin of the
reference generator which provides the standard
ECL threshold vol tage (Vbb), al lowing the other
input of the clock buffer to be driven in a single
ended mode.
D I N <3:0>H
Input and output latches are clocked on opposite
edges of the internal differential clock buffer. Tim
ing diagrams are shown in Figure 13. On a falling
edge of CLK H , data and address i nputs flow into the
RAM array.
I f w rite is asserted d u ring the next rising edge
of CLK H , then a write cycle is initiated, and the
input data is stored in the memory at the add ress
presented at the ADR inputs. At the same time, the
data is passed through the mu ltip lexer and the out
put latch.
If write is deasserted on the rising edge of CLK H,
then the STRAM is in a read cycle and input data is
ignored _ The data stored in the RAM at the address
presented at the A DR inputs flows out to the multi
plexer and output latch.
If chip select (CS) is deasserted prior to the rising
edge of CLK H , then write and read operations are
disabled and the output latches are reset low.
For p roper operation of the STRAM , certain
timing requirements must be fulfilled . The write
operation is terminated by either the falling edge of
RAM ARRAY
2M X 4
..-------1
DOUT RAM
<3:0>H
D I N DOUT
<3:0><3:0>
ADDR W R EN
ADDR H
DO<3:0>H
WRITE
PULSE
GENERATOR
WRITE L
CLOCK H
��-------� ENABLE H
CS L
DLY
CLK
H
CLOCK H
D
CLK H
0 CLK L
Figure 12
Digital Tecbnicaljournal
Vol. 2 No. 4
Fa/1 1990
STRAM Block Diagram
59
VAX 9000 Series
NOTE: CLOCK HIGH STATE M U ST LAST LONG ENOUGH
TO COMPLETE A WRITE CYCLE
I'"
"'I
CLK
WRITE
ADDR, D I N , CS
1
DATA OUT
WR
W&
2 RD
Wffo;l
3 RD
I
KEY:
0 RD - READ OPERATION CYCLE 0
1 WR - WRITE OPERATION CYCLE
Figure 13
STRAM Timing Diagrams
CLK H or by the internal write pu lse generator,
whichever occurs first . Therefore CLK H must be
asserted long enough to ensure that data is properly
written into the memory array. The internal write
pulse generator provides an output having the
proper duration as determined by a string of gates.
Also, the assertion of the internal write pulse sig
nal must be delayed by an amount equal to the inter
nal access time of the RAM . In this way. the correct
data is stored , and not the data previously stored i n
the input registers. The delay i s accomplished by
the row delay circuit, which is also simply a string
of gates. These featu res give the STRAM i ts "self
tm
i ed " nature.
Acknowledgments
The authors would l ike to acknowledge the follow
ing individuals who participated in and contrib
uted to the success of the VAX 9000 project: Jerry
Weisbach, Andy Moroney, Bob H a l ler, Marc
Lamere, Mark Hamel, Tom Senna, Dave McCall,
Patty Kroesen, R i c k Jones, jim jensen , Terry
Skrypek , Eugene Marteney, Paul Guglielmi, Ela ine
Fire, Larry Herman, Bill G rundman n , Mark
Pascarelli, Fran Richard , Linda G reska, Jack Mason,
Chris Caiazzi, Roger Dame, Mike Normand Steve
Sullivan, Rob Rcinschmidt, Bob Bechdolt, Mike
Warder, M i ke Hickman , Brian Sadler, Wayne
Nunn, Rita Wespi, Gene Yee, Bruce Smith, Alisyn
Emerson, J im Glanville.
60
1
References
1 . D . Marshall and ]. McElroy, " VAX 9000
Packaging, The Multi-Chip Unit," Pmceedings of
COMPC ON '90 (Spring 1990).
2 . P. Zdebel et al . , " MOSAIC l l l - A H igh Perfor
mance Bipolar Technology with Self-Aligned
Devices," Proceedings of IEEE 1987 Bipolar
Circuits and Technology Meeting
3. D. Fire and T. Fossum, " Designing a VAX for High
Performance," Proceedings of COMPCON '90
(Spring 1990).
4. C. Baugh and B. Wooley, "A Two's Complement
Parallel Array Multiplication Algorithm , " Sh011
Note a t COMPCON 73, 7th A n n ual IEEE
Computer Society International Conference
(February 1973).
5. C . Wallace, "A Suggestion for a Fast Mu ltipl ier,"
1 EEE Transactions on Electronic Computers,
Vol . EC- 13 (February 1964 ): 14- 17.
6. L . Dadda, "Some Schemes for Parallel
Multipl iers," Colloque sur l 'A lgebre de Boote
Oanuary 1965).
7. K . Hwang, Computer A rithmetic Principles,
Architecture, and Design (New York: john Wiley
and Sons, 1979): 213-283.
Vol.
2 No. 4
Fall 19')0
Digital Tecbn icaljounwl
Richard A. Brunner
Dileep P. Bhandarkar
Francis X. McKeen
Bimal Patel
William]. Rogersjr.
Gregory L. Yoder
Vector Processing on the
VAX 9000 System
The VAX 9000 system provides thefirst emitter-coupled logic (ECL) implementation of
the VAX vector architecture. The optional vector processor on the VAX 9000 system
addresses the computing needs of numerically intensive applications with a peak
performance of 125 MFLOPS for double-precision calculations. The innovative
design ofthe vector registerfile allows the vectorprocessor to overlap the execution of
up to three vector instructions. Supported by both the VMS and ULTRIX operating
systems, the vector processor on the VAX 9000 system provides four to five times
performance improvementfor vectorizable applications over its scalarprocessor.
For a long time, vector processing was the domain
of large, expensive supercomputers such as the
CRAY - 1 . 1 However, with the availability of low cost,
pipelined floating point arithmetic chips, and the
maturation of vectorizing compilers, vector p ro
cessing has become a mainstream technology for
scientific applications.2 Applications that can bene
fit from vector processing include finite element
analysis, signal processing, and computational fluid
dynamics. The recent addition of integrated vector
processing to the VAX architecture and its imple
mentation on the VAX 9000 system provides these
applications with an improvement in execution
time of four to five times over that of a VAX 9000 sys
tem without vector processing. Vector processing
extends the performance range of VAX systems.
The vector processor on the VAX 9000 system ,
referred to as the V-box , is the first emitter-coupled
logic (ECL) implementation of the VAX vector archi
tecture. The definition of the architecture and the
development of the V-box started in 1986 , two years
after the design of the rest of the VAX 9000 CPU .
Thus, the design of the V-box was synergistic with
the definition of the VAX vector architecture. The
major goal of the V-box design was to provide
adequate vector performance (four to five times
speed-up over scalar) without impacting the design
of the remainder of the VAX 9000 CPU and the
memory subsystem, which were too far along in
development to change. With vector performance
comparable to a CRAY -1 and a peak performance of
125 M FLOPS for double-precision calculations, the
V-box fulfills this goal .
Digital TeL·hnicaljournal
V!JI. 2 No. 4
Fall 1990
This paper describes the VAX vector architecture
and its implementation by the VAX 9000 V-box. The
first part of the paper discusses the architectural
model that all VAX vector processors must follow.
The second part shows the actual realization of this
architecture in the VAX 9000 V-box and explains the
innovative techniques the V-box uses to achieve
good performance. The paper concludes w i th
preliminary vector performance numbers for the
VAX 9000 system on some standard vector bench
marks and a number of vector code examples.
VAX Vector Architecture
The VAX vector architecture defines the instruction
set , registers, and behavior that all VAX vector
implementations, such as the VAX 9000 V-box, must
follow.' The vector architecture effort started in
December 1985. At that time several CPU develop
ment projects were well underway, including the
VAX 9000 system. With the expectation of provid
ing four to five times performance improvement
for vectorizable applications, Digital decided to add
vector p rocessi ng to the VAX 9000 system, even
though the system was in an advanced stage of
development. A decision also was made to provide
a complementary metal oxide semiconductor
(CMOS) implementation of the architecture on the
VAX 6000 Model 4 00 system."
Because both systems could not tolerate major
changes without a major slip in schedule, the archi
tecture requ i red an approach that made few
changes to the scalar processor - that part of a VA,'\
61
VAX 9000 Series
processor that executes the regular VAX instruction
set. Furthermore, because not all applications and
markets can benefit from vector processing, Digital
decided not to require vector processing on every
new VAX processor. Therefore, vector processing is
offered as an optional capability. The scalar proces
sor decodes vector i nstructions and passed them
to its associated vector processor. All processing
of vector instructions is handled by the vector pro
cessor. Mechanisms are provided for vector-scalar
synchronization and handling of vector exceptions
by the scalar processor.
Although the architecture had to account for the
implementation constraints of both ongoing CMOS
and ECL projects, it had to be general and flexible
enough to allow future, more i ntegrated implemen
tations at higher performance. The architecture
also had to m inimize its impact on the existing VMS
a nd ULTRIX operating systems because major
changes could significantly delay software support
for vector processing.
Basic A rchitecture
The VAX vector architecture uses a vector-register
based design first pioneered by Seymour C ray. 1
There are 1 6 vector registers, each of which holds
64 elements; an element is 64 -bits. Instructions
which operate on longword integers or F _floating
point data, only manipu late the low-order 32 bits
of each element - sometimes referred to as long
word elements.
A n umber of vector control registers control
which elements of a vector register are processed
by an instmction. The vector length register (VLR)
limits the highest-numbered vector register ele
ment that is processed by a vector instruction. The
vector mask register (VMR) consists of a 64 -bit mask,
in which each mask bit corresponds to one of the
possible element positions in a vector register.
When instructions are executed under control of
the vector mask register, only those elements for
which the corresponding mask bit is true are pro
cessed by the instruction. Vector compare instruc
tions set the value of the vector mask register.
The vector coun t register (VCR) receives t he
number of elements generated by the compressed
IOTA instruction, which is similar to COMPRESSED
IOTA on the CRAY-2.1 All VAX vector instructions use
two-byte extended opcodes. Any necessary scalar
operands (e. g. , base address and stride for vector
memory instructions) are specified by standard VAX
scalar operand specifiers. The instruction formats
allow all VAX vector instructions to be encoded in
62
seven classes. The seven basic instruction groups
and their opcodes are shown in Table l .
Within each class, all instructions have the same
number and types of operands, which allows the
scalar processor to use block-decoding techniques.
The differences in operation between the individ
ual instructions within a class are irrelevant to the
scalar processor and need only be known by the
vector processor. I mportant features of the instruc
tion set are
•
Support for random-strided vector memory data
through gather (VGATH) and scatter (VSCAT)
instructions
•
Generation of compressed IOTA vectors (through
the IOTA instruction) to be used as offsets to the
gather and scatrer instructions
•
Merging vector registers through the VMERGE
instruction
•
The ability for any vector instruction to operate
under control of the vector mask register
Additional control information for a vector
instruction is provided in the vector control word
(shown as cntrl in Table 1 ), which is a scalar
operand to most vector instructions. The control
word operand can be specified using any VAX
addressing mode. However, VAX compilers gener
ally use immediate mode addressing (that is, place
the control word within the instruction stream).
The format of the vector control word is shown in
Figure 1 .
The Va , Yb , and Vc fields indicate the source and
destination vector registers to be used by the
instruction. These fields also indicate the specific
operation to be performed by a vector compare or
convert instruction. The MOE bit indicates whether
the particular instruction operates under control of
the vector mask register. The MTF bit determines
what bit value corresponds to " true" for vector
mask register bits. It allows a compiler to vectorize
if-then-else constructs. The EXC bit is used in vector
arithmetic instructions to enable integer overflow
and floating underflow exception reporting. The
Ml bit is used in vector memory load instructions to
indicate modify-intent. Figure 2 shows the encod
ing for some typical VAX vector instructions.
Vector Execution Model
With the addition of vector processing, a typical
VAX processor consists of a scalar processor and an
associated vector processor; the two are referred to
as a scalar/vector pair. A VAX multiprocessor system
Vol. 2 No. 4
Fall 1990
Digital Tecbnicaljournal
Vector Processing on the VAX 9000 System
Table 1
VAX Vector I n struction Classes
Vector Memory, Constant-stride
Vector-sca lar Double-precision Arithmetic
opcode cntrl , base, stride
opcode cntrl , scalar
VLDL
Load lo ngword vector data
VSADDD
O_floating add
VLDQ
Load q u adword vector data
VSADDG
G_float i n g add
VSTL
Store longword vector data
VSCMPD
O_floating com pare
VSTQ
Store q u adword vector data
VSCMPG
G_float i n g com pare
Vector Memory, Random-stride
opcode cntrl, base
VSDIVD
O_float i n g divide
VSDIVG
G_float i n g d ivide
VSM U L D
O_floating m u ltiply
VS M U LG
G_float i n g m u ltiply
Gather longword vector data
VSSUBD
O_float i n g subtract
VGATHQ
Gather q u adword vector data
VSS U BG
G _floating subtract
VSCATL
Scatter lo ngword vector data
VSMERGE
M e rg e
VSCATQ
Scatter q u adword vector data
VGATHL
Vector-vector Arithmetic
Vecto r-Scalar Sing le-precision Arithmetic
opcode cntrl or reg num
opcode cntrl, scalar
VSADDL
I nteger l o n gword add
VVADDL
I nteger longword add
VVADDF
F _float i n g add
VSADDF
F _float i n g add
VVADDD
O_float i n g add
VSBICL
Bit clear l o n gword
VVADDG
G_floating add
VSBISL
Bit set longword
VVBICL
Bit clear l o n gword
VSCMPL
I nteger lo ngword compare
VVBISL
Bit set longword
VSCMPF
F _float i n g com pare
VVCM PL
I nteger longword com pare
VSDIVF
F_float i n g d ivide
VVCMPF
F_floating com pare
VSM U L L
I nteger lo ngword m u ltiply
VVCMPD
O_float i n g com pare
VSM U L F
F _floating m u lt i p l y
VVCMPG
G_float i n g com pare
VSSLLL
S h ift left logical l o n gword
VVCVT
Convert
VSS RLL
Sh ift right logical lo ngword
VVDIVF
F _floating d ivide
VSSUBL
I ntege r longword subtract
VVDIVD
D_floating divide
VSS U B F
F _floating s u bt ract
VVDIVG
G_float i n g d ivide
VSXORL
Exclusive-or longword
VVMERGE
M e rge
I OTA
G e nerate comp ressed I OTA
VVM U L L
I nteger l o n gword m u ltiply
vector
VVMULF
F _float i n g m u ltiply
VVM U L D
O_floating m u ltiply
Vector Control Register Read
VVM U LG
G_float i n g m u ltiply
opcode reg n u m , destination
VVSLLL
S h i ft left logical longword
VVSRLL
S h i ft right log ical lo ngword
VVS U B L
I nteger l o n gword su btract
VVSUBF
F _float i n g s u btract
Vector Control Register Write
VVS U B D
O_floating su btract
opcode reg n u m , scalar
VVSUBG
G_floating su btract
VVXORL
Exclusive-or l o n gword
M FVP
MTVP
Move from vector processor
Move to vector processor
VSYNC
Synchron ize vector m e m o ry
access
Digital Techllicaljournal
Vol. 2 No. 4
Fall /990
63
VAX 9000 Series
15
14
13
12
MOE
MTF
EXC
Ml
0
11
8
VNCONVERT FCN
Figure 1
4
7
3
VB
0
VC/COMPARE FCN
Vector Control Word
comprises a number of tht:st: scalar/vector pairs.
ever, the asynchronous execution does cause the
Asymmetric configurations can exist when only
reporting of vector exceptions to be imprecise.
some of t he VA X processors in a multiprocessor
Special instructions, which are described in the
Synchronization section, are provided to ensure
system contain a vector processor.
synchronous operation when necessary.
For good performance, the scalar processor oper
a tes asynchronou s l y from i ts vector processor
Both scalar and vector instructions are initially
whenever possible. Asynchronous operation a llows
fetched from memory and decoded by the scalar
the execution of scalar i nstructions to be over
processor. If the opcode indicates a vector instruc
lapped w ith the execution of vector instructions.
tion, the opcode and necessary scalar operands are
Furthermore, the servicing of interrupts and scalar
issued to the vector processor a n d p l aced i n its
exceptions by the sca lar processor does not disturb
instruction queue. The vector processor accesses
the execution of the vector processor, which is
memory directly for any vector data that it must
freed from the compk:xity of resuming the execu
read or write. For most vector instructions, once the
tion of vector instructions after such events. How-
scalar processor s uccessfu l l y issues the vector
ASSEMBLER FORMAT:
VVEOLF V6,V7
VVADDF/1 V1 ,V2,V3
VSMULF/U R4,V4,V5
;IF V6[i] V7[i] THEN VMR[i] 1 , ELSE VMR[i] = 0
; (VVEOLF IS A VVCMPF PSEUDO·OPCODE)
; V3 V1 V2. DO ADDITION UNDER CONTROL OF VMR
: WITH MATCH 1
; V5 = R4'V4 WITH UNDERFLOW EXCEPTION CHECKING ENABLED
=
=
=
+
=
INSTRUCTION FORMAT:
VVCMPF cntrl.rw
VVADDF cntrl . rw
VSMULF cntrl.rw, src.rl
ENCODING IN MEMORY:
BYTE
,-FD -, 0 ::>
C4 :1
8F :2
:3
:4
5
:6 ...J
; INSTRUCTION CONSISTS OF OPCODE AND CONTROL WORD
; INSTRUCTION CONSISTS OF OPCODE AND CONTROL WORD
; INSTRUCTION CONSISTS OF OPCODE, CONTROL WORD, AND SCALAR SOURCE
-
:7
•-
:8
:9
:: J
:C
:D
:E
:F
_,_
TWO-BYTE OPCODE FOR VVCMPF
OPERAND SPECIFIER FOR IMMEDIATE MODE (FOR CONTROL WORD)
CONTROL WORD <7:0>: COMPARE FCN IS EOL AND V7 IS A SOURCE
CONTROL WORD <1 5:8>: V6 1S A SOURCE
TWO-BYTE OPCODE FOR VVADDF
OPERAND SPECIFIER FOR IMMEDIATE MODE (FOR CONTROL WORD)
CONTROL WORD <7:0>: V3 IS DESTINATION AND V2 IS A SOURCE
CONTROL WORD <15:8>: V1 IS A SOURCE, MASKED OPERATIONS ARE ENABLED, AND MATCH =
TWO-BYTE OPCODE FOR VSMULF
OPERAND SPECIFIER FOR IMMEDIATE MODE (FOR CONTROL WORD)
CONTROL WORD <7 0>: V5 IS DESTINATION AND V4 IS A SOURCE
CONTROL WORD <1 5:8>: VA IS IGNORED. UNDERFLOW EXCEPTION CHECKING IS ENABLED
OPERAND SPECIFIER FOR REGISTER MODE WITH SCALAR DATA IN R4
Figure 2
64
1
Vector Instruction Encoding
Vol. 2 No. 4
Fal/ /<)<)0
Dtgilal Techn icaljournal
Vector Processing on the VAX 9000 System
instruction, it proceeds to process other instruc
tions and does not wait for the vector instruction to
complete. An execution model is shown in Figure 3 .
When the scalar processor attempts t o issue a
vector instruction, it checks to see if the vector pro
cessor is disabled - that is, whether it will accept
further vector instructions. If the vector processor
is disabled, then the scalar processor takes a "vec
tor processor disabled" fault. An operating system
handler is then invoked on the scalar processor to
examine the various error-reporting registers on the
vector processor to determine the disabling con
dition. The vector processor disables itself to report
the occurrence of vector arithmetic exceptions or
hardware errors. The operating system disables the
vector processor, usually to indicate the unavaila
bility of the vector processor, by writing to a privi
leged vector register. If the disabling condition can
be corrected, the handler enables the vector proces
sor and directs the scalar processor to reissue the
faulted vector instruction.
Within the constraint of maintaining the proper
ordering among the operations of data-dependent
instructions, the architecture explicitly allows the
vector processor to execute any number of the
instructions in its queue concurrently and retire
them out of order. Thus, a VAX vector implementa
tion can chain and overlap instru ctions to the
extent best suited for its technology and cost
performance. In addition, by making this feature an
explicit part of the architecture, software is pro-
vided with a prograrruning model that ensures
correct results regardless of the extent a particular
implementation chains or overlaps. This approach
differs with respect to some other existing vector
architectures, such as the IBM S/370 vector archi
tecture, which give the appearance of sequential
instruction execution.6
A VAX vector implementation may have its own
memory management hardware, translation buffer,
and cache; or it may share those of the scalar pro
cessor. In high-end vector implementations, such as
the VAX 9000 system, the vector and scalar proces
sors are tightly coupled. The problems of limited
chip area and translation buffer and cache coher
ency can be lessened by allowing high-speed mem
ory management hardware and cache to be shared
by both vector and scalar processors. For other
implementations, such as the VAX 6000 Model 4 00
system, the vector and scalar processors are not so
tightly coupled, and there is a performance advan
tage in allowing separate memory management
hardware and cache. 1 Little additional effort is nec
essary by an operating system to support separate
vector memory management hardware and cache.
A vector processor can treat vector memory
management exceptions (MME) in a synchronous
m a nner, as the VAX 9000 V-box does. Once the
scalar processor issues a vector memory instruc
tion, it pauses until the vector processor deter
mines whether an MME w i ll be encountered by the
instruction. If an MME will occur, then a precise
PHYSICAL
MEMORY
1 6 GB
I N STRUCTION
STREAM
OPCODE, CONTROL WORD
INSTRUCTIONS
VAX
SCALAR
CPU
DATA
DISABLE/STATUS
DATA
STREAM
VECTOR DATA
Figure 3
Digital Tecbnical]ournal
Vol.
2
No.
4
Fall 1990
Vector Execution Unit
65
VAX
9000 Series
exception is taken on the scalar processor and the
Vector arithmetic exceptions are reported in an
appropriate operating system handler is invoked.
imprecise manner by vector processor disabled
If no MME will occur, the scalar processor proceeds
faults. When an exception occurs in the processing
to process other instructions and the vector proces
of a vector element, the vector processor records
sor completes the memory instruction. In the case
the exception in both a privileged exception regis
of referencing a unity-strided vector, which occurs
ter (the vector arithmetic exception register,
most frequently, the MME checking takes only
and i n the corresponding element of the destination
a short time at the beginning because the vector
vector register specified by the instruction. The vec
is contained in two or less pages. (MME checking is
tor processor then disables itself from receiving
done at the page level .)
further vector instructions. However, the vector
processor continues to execute the instruction that
Context Switching
Because of the asynchronous operation of the vec
tor and scalar processors, the vector context state of
a process is separate from its scalar comext state.
Thus, it is possible for an operating system to swap
in a new process to the scalar processor while
allowing the vector context of the previous process
to remain on the vector processor. When the previ
ous process is swapped out, the vector processor is
disabled by the operating system to prevent other
processes from accessing this vector context.
If the subsequent processes do not use the vec
tor processor, then the operating system avoids
the overhead of saving and subsequently restoring
8 kilobytes (KB) of vector context state for the orig
inal process. If another process does use the vector
processor, the operating system must reenable the
vector processor, save the vector state of the origi
nal process, load the vector context of the new
process, and, finally, make the vector processor
available. This fu ll context switch can take up to
100 microseconds on the VAX 9000 system.
Assuming that only a few processes require the
vector processor, it is l ikely that when the original
process is rescheduled to the same scalar/vector
pair, the process will find its vector context state
residing on the vector processor. By using this tech
nique, which is referred to as "cheap vector context
switching," both the VMS and
VA ER)
ULTRlX operating sys
tems reduce the time required to swap in a process
encountered the exception to completion by pro
cessing the remaining vector register elements.
As stated earlier, memory management excep
tions can be reported precisely b y a
VAX vector
VAX 9000
processor to its scalar processor, as the
V-box does, and the scalar processor takes a normal
VAX
memory management fa ult. Exception infor
mation is placed on the stack in the same format as
for scalar memory management exceptions. The
use of the same format minimizes the effort needed
by an operating system to support these exceptions.
Memory management exceptions were extended
for vectors to include two new exception para
meter bits: vector I/O space reference and vector
aligrunent fault. A vector I/O space reference occurs
whenever an attempt is made to load or store vector
data to I/O space. Because of the performance
degrada tion of unaligned memory data, a vector
alignment fault occurs w henever an element being
accessed by a vector memory instmction does not
begin at an address that is an integer multiple of the
length of the element in bytes. For example, a long
word (4-byte) element in memory should begin at
an address which is an integer multiple of 4 bytes.
Synchronization
In most cases, it is desirable for the vector processor
to operate asynchronously with the scalar proces
sor to achieve good performance. However, there
that uses the vector processor.
are cases in which the operation of the vector and
Exceptions
correct results. Rather than forcing the vector pro
scalar processors must be synchronized to ensure
vector
cessor to detect and automaticall y provide synchro
instructions are identical to those that occur for
nization in these cases, the architecture provides
VAX
special instructions, which software can use, t o
Most of the exceptions encountered by
VAX
scalar instructions. The arithmetic exceptions
are exactly the same. The memory m a nagement
accomplish the synchronization.
exceptions have been extended to include two new
instructions are discussed below. Software must
Some of these
vector exceptions: vector IIO space reference and
determine when to use these synchronization
VAX scalar architec
instructions to ensure correct results or establish
ture, the reporting of floating underflow and integer
exception checkpoints. Given the necessary sophis
vector alignment fault. As in the
overflow exceptions can be disabled by setting the
tication of vectorizing compilers, this requirement
EXC bit
is not onerous.
66
in the vector control word .
Vol 2 No. 4
Fall 1990
Digital Tecbnicaljournal
Vector Processing on the VAX 9000 System
Vector and scalar memory references may be
issued simultaneously. Therefore, these references
must be synchro n ized to prevent a conflict from
occurring when accessing shared memory loca
tions. This synchronization is p rovided by the
MSYNC function of the M FVP instruction. Once the
MSYNC function is invoked , the scalar processor
does not issue further instructions u ntil all p re
vious vector and scalar memory references have
completed.
Because the vector and scalar processors execute
asynchronously, software cannot determine when a
vector exception will be reported. However, soft
ware requires that exceptions be reported at certain
checkpoints. For example, exceptions incurred in a
procedure must be reported within the context of
that procedure before another procedure is calJed.
This exception reporting synchronization is pro
vided by the SYNC function of the M FV P instruction.
Once SYNC is invoked, the scalar processor does not
issue further instructions until the exceptions of
previous vector instructions, if any, are reported .
VAX 9000
Y-box Overview
The VAX 9000 V-box is one of four tightly coupled,
parallel function units that compose the VAX 9000
CPU . As such, it shares, with the rest of the CPU,
both the large 128KB data cache and the very fast
address translation hardware. As a result, the V-box
has very fast access to memory data. The V-box is
connected to the CPU through the scalar execution
unit as shown in Figure 4 . This connection consists
1--lloi
VECTOR
CONTROL
U N IT
Figure 4
Digital Tecbnicaljourna/
Vol. ,! No. 4
1-----l�
of a 64 -bit data path, which brings instructions and
data to the vector unit, and a 32-bit path, which
sends data to the scalar unit. AU vector memory
instructions send data through this data path.
As Figure 4 also shows, the V-box is composed of
the folJowing subunits: vector register uni t , vector
add unit, vector multiply unit, vector mask unit,
vector address unit, and vector control unit. Each of
these s ub units can function i n paralle l , which
allows up tO two vector arithmetic instructions
and one vector memory instruction to be executed
simultaneously. C rucial to this instruction over
lapping ability is the vector register unit, which
supports up to eight s imultaneous accesses from
the other subunits.
Physically, the V-box resides on the same planar
board as the remainder of the VAX 9000 C P U . Three
multichip units (MCUs) are reserved for the V-box,
which is a field-installable option. The V-box com
prises 25 ECL Motorola Macrocell Array Ills (MCA3) 7
(For brevity, a macrocell array is referred to as a
" chip" i n this paper.) The operation of these sub
units and the techniques used to enhance their per
formance are described in the following sections.
Vector Control Unit
The vector control u n i t receives and coordinates
the execution of vector instructions within the
V-box . The VAX 9000 scalar exec u tion engine
(E-box) transfers both an encoded version of the
vector instruction and the necessary scalar data to
the unit, which loads the instruction and data into a
VECTOR
REGISTER
U N IT
MASK!
ADDRESS
V-box Organization (with VAX 9000 CPU)
Fall /l)'JO
67
VAX 9000 Series
circular queue as shown in Figure 5. The queue can
buffer a few pending instructions while the remain
ing Y-box subunits are executing others. Without
the queue, the V-box could not accept pending
instructions when all of its subunits are busy, thus,
propagating a stall condition to the scalar execution
unit and resulting in poor performance.
The scalar data that is required by a vector
instruction is placed in the queue one location
behind the instruction quadword . Whenever the
queue contains two entries, the vector control unit
returns a signal to the scalar execution u nit and
requests that subsequent instruction issue be
delayed u ntil the number of entries in the queue
has diminished to one or less. The queue is cir
cular in nature and wraps around to the beginning
automatically.
When an instruction is loaded into the queue, a
pointer directs the instruction to the decode logic
shown in Figure 5. If there is enough instruction
data available in the queue and the necessary sub
unit is not busy, then the vector control unit sends
the instruction data from the queue to the register
conflict logic. The register conflict logic determines
if the vector registers required by the instruction are
already in use by the other subunits, a condition
called register conflict. The determination is made
b y comparing the vector register addresses that
E-BOX
VECTOR
DATA
are ro be used by already executing vector instruc
tions in the next cycle against the vector register
addresses required by the new instruction. If none
of the addresses overlap then the instruction is free
to issue. If an overlap does exist, the instruction is
held until the next cycle, when it can then be issued
to the appropriate subunit. (The Jack of significant
cycle delay in this case is due to the optimal design
of the vector register unit.) If there are no register
conflicts, the instruction is issued immediately to
the appropriate subunit.
As the vector control unit issues the instruction to
the subunit, it also sends scalar source operands,
if any, and the addresses of the vector registers
required by the instruction to the vector register
unit. The vector register unit latches the scalar data
for the duration of that instruction . For each cycle
of the instruction's execution, the register unit then
sends the necessary scalar and register data to the
appropriate subunit. The vector control u n i t also
contains the vector length register and sends a copy
of it with every instruction that is issued to a sub
unit. By suppl ying each subunit with a copy of
the vector length register, writes to the register by
MTVP instructions do not affect instructions cur
rently executing under the register's previous value.
Without this mechanism, wri tes to the vector
length register would be delayed until previously
BUFF ER
SCALAR DATA TO VECTOR
REGISTER FILE
SOURCE/DESTINATION VECTOR
REGISTER ADDRESSES
ADD
VECTOR
INSTRUCTION
MUL
GEN
NO
CONFLICT
ISSUE NEW
INSTRUCTION
BUFFER VALID BUFFER
COUNTER
INSTRUCTION
ISSUE
DECISION
LOGIC
Figure 5
68
ISSUE
NEW
INSTRUCTION
VECTOR NO
REGISTER CONFLICT
CONFLICT 1---
'-----1 CHECK
LOGIC
Vector Control Unit
Vol. 2 No. 4
Fall /'-)')0
Digital Tecbnicafjounwl
Vector Processing on the VAX 9000 System
executing instructions had finished, which would
result in poor performance.
Upon reaching the subunit, most vector instruc
tions execute at one cycle per element, after the
initial pipeline latency. However, the vector divide
instructions (VSDIV and V V OJV) execute at a varying
number of cycles, depending on the floating point
format (F, D, or G). (To simplify the vector control
logic, no other vector instructions are issued once
a vector divide s tarts.) Resu lts are returned to the
vector register unit or vector mask unit as they are
generated, depending on the instruction.
As described earlier, m icrocode in the scalar exe
cution engine encodes vector instructions into an
i nstruction quadword before passing them to the
V-box . Table 2 shows the high-order 32 bits of the
format used for every instruction sent to the V-box.
This quadword contains fields that indicate the
instruction, appropriate V-box subunit to execute
the instruction, and format of the vector control
word . The low-order 32 bits of the instruction quad
word contain the vector control word for the vector
instruction. The instruction quadwords present the
V-box with a fixed format instruction that smoothly
fits into a fiXed-length instruction queue, requires
little subsequent decoding, and has fields that can
be directly gated to selection logic. As a result, the
time needed by the V-box to decode vector instruc
tions is reduced and performance is increased .
Vector Register Unit
The vector register unit or file, as its name implies,
contains the logic and fas t memory that imple
ment the 1 6 VAX vector registers on the V-box . The
block diagram of the vector register file is shown in
Figure 6 . The vector register file has three write
ports and five read ports. By using the innovative
technique described below, these ports provide the
multiple accesses needed to feed two operands per
cycle to the vector add and multiply units, and one
operand to the vector address-mask unit. This unit
is the single largest contributor to the excellent vec
tor performance of the VAX 9000 system .
The file consists of 1 6 vector registers. Each
register contains 64 elements, and each element is
72-bits wide (64 data , S parity). The vector register
file is implemented as a byte-sliced custom chip,
which has a single parity bit per data port. Three
writes and five reads to the file can occur simulta
neously in any cycle. All w rites must be to different
register banks. However, multiple reads can occur
to the same bank if the same element is required by
each read access. Internally to the vector register
Digital Technicaljournal
Vol. 2 No. 4
Falf /')')()
unit, reads occur during the first half of the cycle,
and writes occur during the last half. A write and
read enabling signal is generated for each register
bank every cycle. Each cycle, data is selected from
one of the three write ports to be written into any
enabled register banks. Write port 0 has a four-stage
pipe to buffer data coming from the E-box, through
the control logic, which cannot be written due to a
register bank conflict. The vectOr register file also
has three scalar registers (one each for the vector
address-mask unit, vector add uni t , and vector mul
tiply unit) to hold scalar source operands for vector
scalar instructions. Write port 0 is used to write
these registers. Each enabled read port selects an
element from one of the 1 6 register banks or scalar
registers (for vector-scalar instructions) and trans
fers it to one of the other subunits.
The vector register file uses a technique referred
to as "barber poling" to improve the use of chaining
and overlapped instruction execution . As Figure 7
shows, barber poling spreads each architecturally
defined vector register across all vector register
banks. E lements are laid out such that the first
vector element of each vector register is in location
0 of the same physical register bank and element b
of vector register n is in location b of vector register
bank ({n +b] modulo 1 6) .
B y using this technique, a vector register conflict
causes the vector control unit to delay the issuing
of a new vector instruction for no more than three
cycles. If the more standard technique of placing all
elements of one vectOr register in the same bank
were used , a vector register conflict could cause
the execution of a new instruction to be delayed by
64 cycles. The 64 -cycle delay would have frustrated
attempts at overlapping and severely degraded the
vector performance of the VAX 9000 system .
Vector Add Unit
The vector add unit executes most vector instruc
tions, including both floating point and i nteger
addition, subtraction, comparison; vector convert ;
vector shift logical; vector logical operations; and
vector merges. For brevity, these instructions are
referred to as add-class instructions. One of the
challenges in designing the vector add unit was the
need to perform both integer and floating point
arithmetic.
The organization of the vector add unit is shown
in Figure 8. It is a pipelined structure that comprises
two identical chips for u npacking and aligning
operands (VI:'SA and V I'SB); one chip for performing
arithmetic and logical operations (VFAD); and a
69
VAX 9000 Series
Table 2
Encoded I n struction Q u adword (bits < 63 : 32 > )
Vector
I nstruction
VVS U B F!VS S U B F
VVSU BG!VSSU BG
VVS U B D!VSSU B D
VVS U B UVSS U B L
VVC M P L!VSC M P L
VVS LL!VSS L L
VVSR L!VS S R L
VVB I S UVS B I S L
VV B I C L!VS B I C L
VVXOR L!VSXO R L
VVM E R G E!VS M E R G E
VVADDD!VSA DDD
VVA D D F!VSAD D F
VVADDG!VSADDG
VVA D D L!VSA DDL
VVC M P D!VS C M P D
VVC M P F!VS C M P F
VVC M PG!VS C M P G
VVC M P D!VSC M P D
VVCVTDF
VVCVTDL
VVCVTFD
VVCVTFG
VVCVTF L
VVCVTG F
VVCVTG L
VVCVT LD
VVCVT LF
VVCVTLG
VVCVTDL
VVCVTFL
VVCVTG L
VV M U L L/VS M U L L
VVM U LF!V S M U L F
VVM U L D!VS M LI L D
VVM U LG!V S M LI LG
VV DIVF!VS D I V F
VVDIVD!VSD IVD
VVDIVG!VSDIVG
VLDL
VLDQ
Block load
VSTL
VSTQ
VGAT H L
VGATHQ
VSCATL
VSCATQ
I OTA
Load VLR
Load low V M R
Load h i g h V M R
Store l o w V M R
Store h i g h V M R
Store u n alig ned address
Load VPSR
Load VAE R
Store VAE R
R E S ET
OPCODE
< 39 : 3 2 >
Control Word Type
< 42 : 4 0 >
Dispatch Type
< 46 : 43>
OF9
ODB
OD2
OF6
OF5
034
026
086
08E
088
OAE
092
089
098
086
OD5
OFD
ODD
OD5
01 1
01 6
03A
038
03E
019
01 E
032
031
033
01 7
03F
01 F
003
004
005
006
ooc
OOD
OOE
001
002
ooc
003
004
005
006
01 0
01 1
012
007
009
OOA
OOD
OOE
01 3
014
01 5
008
OOF
2/6
2/6
2/6
2/6
3/7
2/6
2/6
2/6
2/6
2/6
5/1
2/6
2/6
2/6
2/6
3/7
3/7
3/7
3/7
4
4
4
4
4
4
4
4
4
4
4
4
4
2/6
2/6
2/6
2/6
2/6
2/6
2/6
0
0
0
0
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
4
4
4
2
2
2
2
2
2
2
2
2
2
2
2
2
3
3
3
2
3
3
3
Bits < 63: 47> are reserved.
70
Vol. 2 No.
4
Fall 1990
Digital Technicaljournal
Vector Processing on the VAX 9000 System
VML
RESULT
WPORT 2
I
SREG4 LD
SREG2 LD
SREGO LD
I
I
1
'
I SCALAR 0I S2
H
-\
WRPORTO CNF SEL
\
I WRITE
I
l
WPORTS 0 - 2 -
SCALO SE L
SCAL2 SE L
SCAL4 SE L
j
REG BANK 1 5 WT EN
WRITE
ADDRESS
LOGIC
�
I
REG BANK 1 5 WT ADA
l
REG BANKO AD EN
READ
ENABLE
LOGIC
REG BAN K 1 5 RD EN
MEMORY
AR RAY
REG BANKO WT ADA
REG BANKO RD ADA
!
REG BAN K 1 5 RD ADR
READ
ADDRESS
LOGIC
�
A PORTS 0-4
RPORTS 0-4
\ SELECT DATA FOR EACH READ PORT FROMS 4 REG BANKS I
so --=:J
S2 "-1 I
I
I
---r
-\
.\
I
I
I
-\
I
e e e
I
I
I
RPORT2
RPORT1
RPORTO
TO MASK LOGIC
TO VML LOGIC
Figure 6
Vol. 2 No. 4
Fall
I'J'JO
e
I
�
I
RPORT3
RPORT4
TO ADDER LOGIC
Vector Register Unit
remaining chip for norm a l izing, rou nding, and
packing the result (YFPK). The data paths between
t he chips are a l 1 64 -bits w ide.
The pipeline latency through this unit for both
single-precision (integer and F _floating) and dou
ble-precision (G_floating and D _ floating) formats is
only three cycles. Thus, the vector/scalar cross-over
number for add-class instruct ions is quite small
(that is, the minimum number of vector elements
needed for the V-box to surpass the performance
of the remainder of the VA X 9000 CPU for this class
of instructions.) As a result, the V-box achieves good
performance for add-class instructions with small
sized vectors and large-sized vectors (large-sized
vectors being naturally favored by t he technique of
pipelining).
When the vector add unit begins to execute an
instruction, it receives two source elements from
the vector register unit each cycle. The elements are
latched i nto the unpacking logic, one clement for
Digital Tecbnical}ournal
/
I
SELECT WRITE DATA FOR EACH REG BANK FROM WRITE PORTS
REG BANKO WT EN
ENABLE
LOGIC
VAD RESULT
WPORT 1
LATCH
�
SCALAR 2 S4
I
so
WPORTS O -2
VCT WT DAT
WPORT O
�
LATCH
I
I SCALAR 4 1
I
FROM VAD
FROM CONTROL
FROM VML
BANK 0
BANK 1
BANK 2
Figure 7
BANK 1 3 BANK 14 BANK 1 5
Barber Poling
71
VAX 9000 Series
each of the two chips. During the next cycle, each
unpacking chip concurrently unpacks and aligns
its source element, if necessary, and forwards the
result to the addition or logical-operation logic,
depending on t he i nstruction . W ithin the same
cycle, the addition chip uses the two sources from
the unpacking logic to generate a result, which is
then latched.
D uring the final cycle, the result is sent to the
packing chip, which normalizes, rounds, and packs,
if necessary, the result and sends it to the vector
register unit to be written . Exception checking and
reporting are also done in the last cycle by the pack
ing chip, which maintains the vector add unit's
copy of the vector arithmetic exception register
(YAER). When the instruction completes, the vector
add unit sends its VAER copy to the vector mask unit
to be merged with the VAER copy from the vector
multiply unit.
The vector add unit does not d i fferentiate
between masked and unmasked vector instructions.
64
I
--
VFSA
1
1
The complexity of skipping over masked-out ele
ments wou ld have added extra cycles of pipeline
latency and resulted in less performance for small
sized vectors. For masked as well as unmasked
instructions, the vector add unit operates from the
first up to the last element (as indicated by the
vecror length register) of both source registers. The
actual masking of results is hand led by the vector
control unit, which blocks the vector register unit
from receiving masked-out resu lts as they are
being sent by the vector add unit. However, the
packing chip does use vector mask register bits to
suppress exception generation for results that are
masked out.
Floating Point Operation When executing vector
floating point instructions, the u npacking logic
takes the various fields of a floating point element
and expands and rearranges it into a more conve
nient format for the addition logic, i.e. , the elemem
is "unpacked . " As a result of this process, the addi-
SOURC E A
---
-
VFSB
I
����--------�
__L �-------L���
VFAD
EXPONENT
VFPK
VAER TO VMKB
MASK B I T I
I
I
I
I
I
ADDER LOGIC
Figure 8
72
Vector Add Unit
Vol. 2 No. 4
Fa/1 1990
Digital Tecbnicaljournal
Vector Processing on the VAX 9000 System
tion logic is simplified because all VAX floating point
formats (F, D, and G) are unpacked into an identical
format. The unpacking involves decoding the sign,
inserting the hidden bit, and rearranging the frac
tion bits. For all VAX floating point formats, the
fractional part is expanded to 56 bits. (F _floating
and G _ floating are expanded with zeros on the
right.) The fractional part is then surrounded on the
right with two guard bits and a rou nding bit to
form a 59-bit fraction. The overflow and guard bits
ensure the accuracy of rounded results.
After the elements are unpacked, the unpacking
chips align the elements by taking the fractional
part of the smaller magnitude number and shifting
it to the right until its exponent is equal to that of
the larger magnitude number. Each unpacking chip
also receives the exponent bits of the other chip's
element. Therefore, the alignment process can be
done in parallel before the elements are sent to the
addition logic that requires the alignment. If during
the alignment of an element for a vector floating
point subtract instruction, a one is shifted out of the
59-bit fraction field, then a "sticky bit" is generated.
This sticky bit is used by the addition logic in the
next cycle as a carry into the subtraction .
The unpacked, aligned elements are then sent to
the add chip, which produces a result and then par
tially normalizes the result before sending it to the
packing chip. Again, if the shifting during normal
ization shifts a one out of the fraction field, a sticky
bit is generated. Finally the partially normalized
result and the second sticky bit are sent to the pack
ing chip which completes the normalization and
rounding and adjusts the exponent field accord
ingly. To save an extra cycle, the packing chip com
putes two exponents values, one for each value of
the carry-over in the rounding process. Final selec
tion of the exponent and its exception is done using
the actual carry-over of the rounding logic. The
proper exponent and the normalized fraction are
then rearranged into the appropriate floating point
format, and the assembled element is sent to the
vector register unit .
For vector inte
Integer and Logical lnstntctions
ger and logical instructions, the elements bypass the
alignment logic and are sent to the add chip (VFAD)
for all but the logical shift right instruction (VVSLRL
and VSSLRL). For logical shift right instructions, the
alignment logic does the shifting because the shift
ing circuitry is already needed for the alignment of
fractions in floating point elements. The exponent
unpacking logic is used to pass on the logical shift
Digita1 1ecbnicaljournal
Vol. 2 No. 4
hill I')'JO
right count to the aligmnen t logic, which then
sends the shifted result to the add chip. The add
chip operates on the low-order 32 bits of these
elements and passes through the high-order 32 bits
u nchanged to the packing chip. For logical shift
left instructions (VVSLLL and VSSLLL), the low-order
32 bits also pass through the add chip unchanged .
On the packing chip, the floating point normalize
logic performs to do logical shift-left operations.
The shift count is passed to the normalize logic
from the unpacking logic during the first cycle . For
all other integer and logical instructions, the nor
malize count is forced to zero to pass the add chip
result through. Finally, just before sending the result
to the vector register u nit, the packing chip checks
for integer overflow exceptions.
Merge lnstrnctions
For vector merge instructions
(VVMERGE and VSMERGE), the unpacking chip with
the masked-out element, based on the appropriate
vector m as k register bit, zeros that element out
before sending it to the addition logic. The addition
logic adds the zero to the other element , which has
the effect of passing the value of the other element
on to the packing chip.
Vector Memory Operation
Because vector applications tend to issue m a n y
vector memory instructions, the execution time of
these instructions is a critical factor in the perfor
mance of a vector processor. Therefore, the V-box
was designed to m inimize the execu tion time by
taking advantage of the VAX 9000 C P U 's large 128KB
d a ta cache, by prefe tching vector data, a nd by
fetching it in blocks instead of element by element .
Memory requests b y the V-box are sent through
the VAX 9000 CPU to the cache and address trans
lation hardware (M-box) of the VAX 9000 CPU . The
M-box translates the 32-bit virtual addresses for vec
tor data into physical addresses and accesses the
proper locations in the data cache. The vector
address-mask unit generates the virtual addresses
for the vector elements. For vector load and gather
i nstructions, the vector data is returned to the
V-box through the E-box, and written to the proper
vector registers. The M-box returns 64 bits of data
each cycle. For vector store and scatter instructions,
the vector elements are sent through the E-box to
the M-box. Although the vector register unit is
capable of sending 64 bits at a time, the E-box need
only forward 32 bits per cycle to the M-box. The
M-box requires two cycles to write the cache and
does not actually write the 64 -bit data u n til the
73
VAX 9000 Series
second cycle. (The first cycle performs the cache tag
lookup.) Because the V-box implements synchro
nous memory management exception reporting,
once a vector memory i nstruction begins execu
tion, no other vector instruction may be issued until
the memory instruction completes.
The VAX 9000 CPU prefetches vector data. This
mechanism is used to move data from the main
memory to cache in a manner which optimizes
memory bandwidth. By using this method, a 25
percent improvement in the performance of vector
load instructions is achieved. The preferching starts
when the scalar microcode on the VAX 9000 CPU
checks the stride of a VLDQ instruction . If this stride
is 8 bytes long (quadwords are contiguous in mem
ory), the microcode converts the instruction into a
block load instruction and sends it to the V-box .
The block load instruction directs the V-hox to issue
a series of block load requests for vector data. A
block load request moves an entire cache block
from the memory into the vector registers. These
blocks are loaded into both the cache and the vector
registers when they come from main memory.
(Bypassing the cached to load the vector registers
directly reduces the effect of a cache miss for vector
data.) Otherwise, the memory requests are done for
one register element at a rime.
In addition to converting the VLDQ to a block
load instruction, the scalar microcode also issues
preferch requests ro the M-box. The M-box deter
mines if the data is valid in the cache. If so, no fm
rher action is taken on the request. If not, the data
is requested from main memory. In this manner
several prefetch requests are started in successive
cycles. This method results in multiple memory
banks being used in parallel. Vector data comes
back to the cache at a rate of 500 megab ytes
{MB) per second . The microcode stops issuing
prefetch requests when all the vector data has been
requested . This ensures that the requests from the
V-box do nor encounter many cache m isses.
Vector Address-Mask Unit
The vector address-mask unit performs the address
generation and memory requests needed to exe
cute the vector memory instructions VLD, VST,
VSCAT, and VGATH . I t also contains the vector mask
register and support logic for masked instructions.
Further, it contains the complete vector arithmetic
exception register {VAER), which it updates based
on the status sent by the vector add and vector mul
tiply units.
74
For vector memory i nstructions, t he vector
address-mask unit receives the base (starting mem
ory add ress of the vector) and stride (d istance
between vector elements in memory) of the instruc
tion from the vector control u n i t in an indirect
manner through the vector register unir. Both the
base and stride are 32 bits long.
For most vector load and store instructions, the
memory addresses for the vector data are generated
in an iterative fashion . During the first cycle of exe
cution, the base address bypasses the address adder
and is immed iately sent to the M-box to request the
first element. Concurrently, the base and stride are
added together by the add ress adder and latched to
provide the address of the next elemenr. In the next
cycle, the latched address is sent to the M-box and
to the address adder, where it is added to the stride
to generate the next address. The process repeats
until all element addresses have been issued . I n
tandem with the address generation, the vector
control unit directs the vector register unit to send
or receive the appropriate vector register element.
For vector gather and scatter instructions, the
memory addresses for t he vector data a re also
issued in an iterative fashion. During the first cycle
of execution, the base address is sent to the vector
address unit. In the second cycle, the vector control
unit directs the vector register unit to send the first
element of the offset vector to the vector address
unit, which adds it to the base and latches the result.
In the third and subsequent cycles, the resulting
address is sent to the M-box while the base and next
offset are added together. The process repeats until
all element addresses have been issued. I n tandem
with the address generation, the vector control unit
directs the vector register uni t to send or receive the
appropriate vector register element .
For masked vector load and gather instructions,
addresses for all elements, masked and unmasked,
are sent to the M-box. However, for masked-our
elements, the request is modified from read to
read no-op (i.e., do not actually perform the read).
This process prevents the M-box from raking cache
m isses and address translation exceptions on
masked-out elements. For masked-our elements,
the M-box returns a dummy value to the V-box,
which blocks the value from being written to the
vector register unit. The vector address unit directs
the control unit to block writes, based on the value
of the appropriate vector mask register bit.
For masked vector store and scatter instructions,
although both m asked and unmasked elements
Vol.
2 No. 4
Fall 1990
Digital Tecbnicaljounral
Vector Processing on the VAX 9000 System
are read from the vector register u nit, masked-out
elements are stopped from reaching the M-box. The
vector address unit, based on the vector mask regis
ter, causes the E-box to discard the masked-out
element instead of forwarding it to the M-box.
As described earlier, a VLDQ instmction with a
stride of 8 bytes (unity stride) is converted by the
VAX 9000 scalar processor into a block load instruc
tion when sent to the V-box. The vector address
unit, in turn, issues a number of block toad requests,
each of which is for 64 bytes of data, to the M-box
with the appropriate address and selection bits.
There are eight selection bits, one for each quad
word in the block, which tell the M-box whether to
return the corresponding quadword to the V-box
for that block load request. Generation of these
selection bits by the vector address unit is com
plicated because the starting add ress of a vector in
memory is not aligned on a block boundary (i.e. ,
starts within the middle of a block). The bits also
depend on the vector mask register (for masked
block loads).
To handle unaligned, masked block loads, the
vector address unit must generate selection bits that
deselect those quadwords which are not part of the
vector but lie within the same blocks as the first
and last elements of the vector. In addition, it must
deselect those quadwords within the vector that
are masked out by the vector mask register. Both of
the above requirements are handled by using an
extended version of the vector mask register to
generate the selection bits. This process involves
conceptually extending the vector mask register on
both ends with enough selection bits so that each
quadword has a corresponding selection bit. For
example, a vector starting at the last quadword of
one block requires that seven selection bits be
added at the beginning of the vector mask register
and one bit be added after the end .
Vector Multiply Unit
The vector multiply unit performs all of the vector
multiply and vector divide operations defined by
the VAX vector a rchi tecture: VVMU L , VSM U L ,
VVDI V , and VSDIV . The unit can perform either one
multiply instruction or one divide instruction at a
time, but cannot perform both types of instruc
tions simultaneously. In addition, the unit performs
exception checking and reporting, as required,
including floating overflow, floating underflow, and
d ivide by zero exceptions. The unit consists of
four custom multipliers: a custom divider, a divide
unpack chip, and two packing chips. Physically,
Digital Technicaljournal
Vol. 2 No. 4
ftttl 1990
these chips reside on the V M L multichip uni t of the
VAX 9000 CPU. The custom multipliers and divider
are identical to those used in the scalar execution
engine (E-box).H
Multiplication By using four parallel multipli
ers, the pipeline latency through the multiplica
tion logic for both single p recision (integer and
F_floating) and double precision (G_floating and
D_floating) is only three cycles. Thus, the vector/
scalar cross-over number for multiplication is quite
smal l . As a result, the V-box achieves good perfor
mance for vector multiply instructions with small
sized vectors as well as large. As a double-precision
vector multiply instruction executes, two 64 -bit
elements are received from the vector register unit
each cycle and are latched in the four custom
multipliers, each of which does a 32-bit by 32-bit
multiplication.
As shown in Figure 9, the element bits are dis
tributed in such a way that one multiplier operates
on the h igh-order bits of both elements; one multi
plier operates on the low-order bits of operand one
and the high-order bits of operand two; one multi
plier operates on the high-order bits of operand one
and the low-order bits of operand two; and one
multiplier operates on the low-order bits of both
elements.
During the next clock cycle, each of the four mul
tipliers unpacks its inputs and sends them through
a large multiplication array, which produces one
64-bit partial product and latches the product.
During the third cycle, the pack chips (VMLA and
VMLB) add the four 64 -bit partial products together
to produce one result and prepare the result to be
written back to the vector register unit. In this
cycle, the four partial products are shifted accord
ing to their weight. Weight is determined in relation
to w h ich bits the multiplier usee! to produce a
result. For example, the multiplier that operated on
the high-order 32 bits (most significant bits) of both
elemems produces the most significant partial
p roduct bits, and the multiplie r that operated on
the low-order 32 bits (least significant bits) of both
elements produces the least significant partial
product bits. The partial products must be aligned
or shifted properly before they are added together.
Once the partial products have been added, the
final product is then rounded, normalized , and
packed into the appropriate VAX integer or floating
poim format before being written into the vector
register unit in the next cycle.
The process and pipeline stages for single-preci
sion multiplication (VYMU LF and VSMULF) are
75
VAX 9000 Series
VREG_SOURCE1 [31 OJ
VREG SOURCE1 [63:32J VREG SOURCE1 [310J
VREG_SOURCE2 [31 :OJ
VREG_SOURCE1 [63 32J
VREG SOURCE2 [63 32J
CUSTOM
MULTIPLIERS
PARTIAL_PRODUCT1 [47:0J PARTIAL_PRODUCT1 [63:0J PARTIAL_PRODUCT1 [63 OJ PARTIAL_PRODUCT1 [63:32J
VMLAIVMLS
1 47
RESULTS FROM
DIVISION
ACCUMULATION
COMMON
BETWEEN
MULTIPLIERS
AND
DIVIDERS
(FROM
DIVU)
+
���
ol
j63
FINAL PRODUCT
3�1
(TO
DIVU)
EXCEPTION DATA AND
FINAL EXPONENT FROM
EXPONENT LOGIC
VML_RESULT [63:0J
TO VREG
Figure 9
Vector Multiply Unit
similar to the process used for double-precision
multiplication. However, in single-precision multi
plication, only one multiplier chip is needed ro pro
duce the result and the pack chips do not need to
sum the partial product. Integer multipli ca tion is
slightly different from floating point multiplication
because it does not need to be accumulated or
rounded. Thus, the correct product is produced
by one multiplier. The result bypasses the accumu
lation and rounding logic and proceeds directly
into the packing logic to be sent to the vector regis
ter unjt.
The exponent handling for both multiplication
and division is performed by the same logic on the
packing chips. Depending on the instruction being
executed, the exponent is either added (multipli
cation) or subtracted (division). The result of this
operation is then piped to the next stage and the
position of the h idden bit is determined. If the frac
tional portion of the data must be shifted to ensure
the hidden bit is in the correct position, the expo
nent is then incremented or decremented accord76
ingly. The normalize count (i.e. , shift count) is used
to select the correct final exponent. Overflow and
underflow exception checking can only be detected
and reported after the final exponent is selected. If
an exception is detected, then a reserved operand is
written to the appropriate vector register element.
The first stage of the exponent logic also checks for
divide by zero and reserved operand exceptions.
Vector division is a variable-cycle func
tion. The number of cycles depends on the format
of the operands. The custom divider is capable of
producing six quotient bits per cycle. Therefore,
F_floating point division is performed in 7 cycles,
G_floating point in 1 2 cycles, and D_floating
point in 13 cycles. Because of the variable number
of cycles in a divide instruction, no other instruc
tion can execute in the V-box while a divide is in
process. Also, because of the iterative nature of divi
sion (i.e. , one division must be completed before
another can be started), the instruction cannot be
pipelined.
Division
Vol. 2 No. 4
Fa/1 /'J'J{)
Digital Tecbnicaljounwl
Vector Processing on the VAX 9000 System
As a vector div ide instruction executes, two
64-bit elements are received from the vector regis
ter unit each cycle and are latched i n the di vide
unpack chip. The elements are unpacked, and the
fractional portion of the elements is sent to the etJS
tom divider in 32-bit slices. The exponent portion
is sent to the shared exponent logic on the packing
chips, as described in the Multiplication sect ion.
During this cycle, time-critical values, such as com
plemented element values and first-cycle quotient
bits, are calculated and forwarded to t he custom
divider.
W hen t he divider receives the data, it uses a n
iterative algorithm t o produce six quotient bits per
cycle. The quotient bits produced are then sent to
the packing chips, which may have to increment
the quotient, depending on the value of subsequent
quotient bits. The div ider instructs the quotient
accumulation logic whether or not incrementing is
necessary. The partial quotient, once decided, is
held in a bank of l atches until a l l the quotient bits
are received . When the entire quotient is available,
the result is rounded, normal ized , and packed by
using the same logic path as multiplication. A mul
tiplexer switches this packing logic between the
multiplication and division logic.
Performance Characteristics
As of this writing, testing of the vccror performance
of the VAX 9000 system has only just begun. How
ever, some preliminmy resu lts are p resented in
Table 3. We expect that these results will improve
as testing continues and more code i s optimized
to take advantage of the chaining and overlapping
provided by the V-box.
class instruction , vector multiply instruction, and
vector memory instruction. Unlike the VAX 6000
Model 400 system, vector register conflicts between
these instructions have little effect on overlapping. ;
With the VAX 9000 system, a conflict only delays
t he execution of the subsequent vector instruction
by one or two cycles at most.
However, the overlapping behavior of the V-box
is sensitive to the issue order of vector instructions.
If two vector instructions executed by the same
V-box unit are issued one after the other, the second
instruction is delayed until the V-box unit has fin
ished executing the first. In addition, vector i nstruc
tions issued after a vector memory instruction or
divide instruction, do not begin execution unti.l the
previous instruction completes. A general ru le in
scheduling code for the VAX 9000 V-box, is to gen
erate, whenever possible, instruction triples, where
the first two instructions are a vector add-class and
vector multiply instruction and the last instruction
is a vectOr memory or vector divide instruction .
Failing that, at least one vector add-class or vector
multiply instruction should be issued before a vec
tor memory or vector divide instruction.
The following code examples demonstrate the
usage of the VAX vector instruction set and the over
lapping behavior of the VA X 9000 V-box. (Note: It
should be assumed in the examples that all arrays
are 8-byte double precision .)
In the following DAXPY inner loop example, the
first two VLDQ instructions do nor overlap. How
ever, the VSM ULD, VVA DDD , and VSTQ instructions
do overlap.
Do i
=
1
,
64
DY ( i )
DY ( i )
=
•
DA x
enddo
Chaining and Overlapping
Because of the design of the vector register u n i t ,
the V-box can concurrently execute a vector addTable 3
VAX 9000 Model 2 1 0 P rel i m i n a ry
Performance Double -precision
M F LOPS , U n iprocessor
Size
Vector
Peak rate
NA
1 25
LFK (Geometric mean)
LFK (Arit hmetic average)
L I N PAC K
44 1
44 1
1 3. 2
20.6
1 0002
80
FFT
4096
Convolution
1 50 X 1 500
64 2
Matrix multiply
Digital Tecbllicaljournal
Vol. 1 No. -1
26
vecrorizes as:
Fall 1990
K8 , v o
K 8 , V2
VLDQ
ox ,
VLDQ/M
DY ,
VSMULD
DA , V O , V 1
;V1
VVADDD
V 1 , V2 , V3
; V3 . V 1 . D Y
VSTQ
V3 , D Y '
K8
; Lo a d ve c t o r
OX
; L o a d ve c t o r D Y
; w i t h mod i f y i n t e n t
=
DA*DX
; S t o r e vec t o r DY
The first two V LDQ instructions do not overlap in
the following MERGE example,
Do
i
=
1 , 64
a( i )
=
b( l ) - c( i )
. g t . 0 ) t hen
if (a( i )
b( i)
=
a( i )
=
d i )
e l se
99. 1 5
1 1 1 . 36
OX ( i )
b( i )
end i f
enddo
77
VAX 9000 Series
vectorizes as:
vecwrizes as:
VLDQ
b , #8 , v o
; L o a d ve c t o r b
VLDQ
VLDQ
c , #8 , V 1
; L o a d ve c t o r c
VSEQLD
VVSUBD
VO , V 1 , V2
; b-e
; s e t ma s k. C VS C M P p s e u d o
VSTQ
V2 , a , # 8
1\
# X O , V2
; S t o r e vec t o r a
; op d o i ng Equal t es t >
VSLSSD
; T e s t a ( • ) a n d s e t rna s k
; i n VMR .
VVMERGE
V 1 , V 2 , VO
I O TA
a , #8 , VO
J\
# X O , VO
; L oad ve c t o r a
; Te s t a ( • ) f o r z e ro and
#8 , V 1
; Ma k e c o mp r e s s e d
< VS C M P
; ve c t o r o f o f f s e t s
; pseudo - op do i ng Less
, w r i t e s i z e o f ve c t o r
; T han S i g ned t e s t )
; t o VCR
; Me r g e a and c i n t o b
MFVCR
RO
; Move V C R i n t o R O
; u s i n g m a s k i n VMR
VO , b , #8
VSTQ
; S t o r e ve c t o r b
; C MF V P p s e u d o - o p )
MTVLR
RO
; L o a d n ew V L R v a I u e
VGATHQ
c ' V1 , V2
; Ga t h e r v e c t o r c
VGATHQ
d , V 1 , V3
; Ga t h e r v e c t o r d
VVD I VD
V2 , V3 , V4
; D i v i de c by d
VSCATQ
V4 , b , V 1
; Sca t t e r vec t o r b u s i ng
; C MT V P p s e u d o - o p )
However, the VVSUBD instruction does overlap
with the VSTQ instruction. Both the VSLSSD
(VSCMP) and VVMERGE instructions are executed by
the vector add unit. Therefore, these two instruc
tions do not overlap. However, the VVMERGE
instruction does overlap with the VSTQ instruction.
In an I F-THEN- ELSE example, such as the
following,
Do
i
=
1 , 64
i f (a( i )
. g t . 0 ) t h en
c( i )
b( i )
e l se
c( i ) I a( i )
b( i )
end i f
enddo
vecrorizes as:
VLDQ
a ,
VSLSSD
#
1\
# 8 , VO
XO , VO
; Load vee t o r a
; T e s t a ( • ) a n d s e t mas k
; i n VMR .
< VS C M P
; p seudo - op d o i n g Less
; T h a n S i gn e d t e s t )
VLDQ
c , #8 , V 1
VVD I VD / 0
V 1 , VO , V2
; Load vee t o r c
; Ma s k e d d i v i d e o f c by a
( j
; f o r VMR i
VST Q / 1
V 1 , b , #8
VSTQ / 0
V 2 , b , #8
=
0
; S t o r e " t h en " p a r t of b ( • )
; S t o r e " e l s e " p a r t of b( * )
Nothing overlaps the first V LDQ instruction, but
the VSLSSD instruction does overlap the second
VLDQ instruction. Nothing can overlap with the
VVDIVD instruction. Thus, the VSTQ instructio n
does not begin execution until the VVOIVD instruc
tion completes. The remaining VSTQ instruction
waits for the first VSTQ instruction to complete.
In the following scatter-gather example, none of
the instructions is overlapped.
Do
i
=
1 , 64
i f (a( i )
b(i)
endi f
enddo
78
. e q . 0 ) t he n
=
c( i ) /d( i )
; u s i ng o f f s e t s i n V 1
; us i ng o f f s e t s i n V 1
; offsets in V1
I t should b e noted i n this e x ample that the
VSEQLD and the IOTA instructions do not overlap.
This lack of overlap occurs because the IOTA
instruction is actually done with microcode on the
E-box, and the IOTA instruction cannot begin exe
cution until the VSEQLD instruction has computed
all the new vector mask register bits. The vector
register access instructions (MFVCR and MTVLR)
take only a few cycles and do not significantly affect
the overlapping of other vector instructions.
Summary
By taking advantage of key features of the VAX
vector architecture, such as instruction overlap
ping, imprecise exceptions, and asynchronous
interaction with the scalar processor, the vector
processor of the VAX 9000 system provides super
computing performance for computationally inten
sive applications. Through the use of barber poling,
the vector processor can overlap two vector arith
metic instructions with one memory instruction
to deliver a peak double-precision performance of
125 M F LOPS.
Acknowledgments
The authors wish to acknowledge the technical
contributions of the following individuals to the
VAX vector architecture and the VAX 9000 V-box
design : Wayne Cardoza , Dave C utler, Tryggve
Fossum, Rich Grove, Kevin Harris, Steve Hobbs,
Brian Koblenz, D w ight Manley, Dave O rbits,
Bob Supnik, Mike Tehranian, Cheryl Wiecek, and
Rich Witek.
Vol. 2 No. 4
Fal/ 1990
Digital Tecbntcaljournal
Vector Processing on the VAX 9000 System
References
1.
Russell, "The
5. CRA Y-2 Compute-r System Functional Descrip
Computer System ,"
ACM Proceedings, vol . 21, no. 1 (January 1978):
CRAY - 1
tion (Cray Research, Inc , 1985 ).
6. W.
Buchholz, "The IBM System/370 Vector Archi
tecture, " IBM Syste-ms journal, vol. 25, no. 1
63-72.
2. VA X Vector Processing Handbook
(Maynard :
D igital Equipment Corporation, Order No.
(1986): 51 -62 .
7.
EC-H04 19-46/89, 1989).
3.
R. Brunner, VA X Architecture Reference Manual
(Bedford: Digital Press, Order No. EY - F 576 E- DP,
4.
D. Fenwick et a l . , "A VlSI Implementation of
the VAX Vector Architecture," Proceedings of
COMPCON '90 (IEEE, Spring 1990).
1990).
Digital Tecbntcaljournal
Vol.
2 No. 4
Fall 1990
8.
D. Marshall and ]. McElroy, " VAX 9000 Pack
aging- The Multichip Unit," Proceedings of
COMPCON '90 (!E E E , Spring 1990).
M. Adiletta et al . , "Semiconductor Technology
in a High-performance VAX System ," Digital
Technical journal, vol . 2 , no. 4 (Fall 1990, this
issue): 43-60.
79
Peter B. Dunbeck
Richard]. Dischler
james B. McElroy
Frank ]. Swiatowiec
HDSC and Multichip Unit
Design and Manufacture
The VAX 9000 system effectively integrates state-ofthe-art packaging and inter
connects with advanced integrated circuits to achieve a short machine cycle time
(16 nanoseconds) and a high rate of instruction execution. To meet highjrequency
electrical signal and pin count requirements for the system, engineers chose tape
automated bonding technology and consequently conceived and developed the high
density signal carrier (HDSC). Tbe HDSC offers densities three to five times greater
than conventional printed circuit boards. This unique technology is manufactured
using semiconductor and advanced printed circuit board tecbniques. The HDSC is
at tbe heart of the multichzp unit, a bigh-performance logic module, with wbicb tbe
VAX 9000 CPUs and system control unit are constructed.
Over the past decade, advances in the performance
of integrated circuits (ICs) have outpaced advances
in packaging and interconnect technologies. Thus a
high-performance mainframe with conventionally
packaged bipolar integrated circuits would experi
ence interconnect delays that accoun t for more
than 50 percent of the system cycle time. Key to
optimizing high-end mainframe performance, then ,
is the effective integration of state-of-the-art pack
aging and interconnects with advanced integrated
circuits. The high-density signal carrier (HDSC) and
the multichip unit (MCU) are proprietary tech
nologies that shrink interconnect paths and thus
reduce the distance and electrical loading of signals
between chips. These technologies use conven
tional semiconductor and p ri nted circuit board
(PCB) equipment in many areas of manufacturing
to improve reliability at a competitive cost. The
result is shorter machine cycle time and higher
instruction execution rate. The VAX 9000 CPUs and
system control unit (SCU ) are constructed entirely
of multichip units on large planar modules. The SCU
is composed of arrays of 6 multichip u nits, and the
CPUs are composed of arrays of 1 6 .
Multicbip Unit Design Goals
Beginning at the concept level and throughout the
development and test phase, signal integrity con
siderations guided t he development of the HDSC
and the multichip unit. Designers had to ensure
that the fas t signals woul d not be d isturbed by
noise. The cycle time goal for the VAX 9000 system ,
80
1 6 nanoseconds (ns), allows the system to operate at
30 VAX units of performance (VUPs).
To transm i t electrical signals quickly between
chips, wiring paths must have controlled ratios of
wire size to distance from voltage planes. These
impedance-controlled paths allow radio-frequency
computer signals to propagate with minimal dis
tortion . Prevention of noise on the signals is
paramount and many details of the physical imple
mentation, including spacings between wires, are
critical to ensuring signal integrity.
To meet the cycle time goal, high-frequency elec
trical signal concerns needed to be considered in
the design, concerns that would have been negligi
ble for slower speed signals. Due to the physics of
electrical fields, as electrical signals switch at high
frequencies, they succeed in holding their shape
(data) only if they are fed power extremely quickly,
and if they are given short paths of uniform proper
ties on which to travel . Due to the amount of power
and the short amoum of t ime a signal is given tO
arrive on chip, conventional chip carrier packages
were disallowed for the VAX 9000 system . The sig
nal paths had to be very short to be virtually noise
less. To achieve this objective, engineers decided to
enhance tape automated bonding (TAB) technology
with a ground plane for electrical control of the
wire impedances (paths). This reduction in chip
package size also allowed all of the chips for the sys
tem to be packaged into a tight area. Consequently,
to fit wires between chips, extremely dense HDSC
technology was conceived and developed .
Vol. 2
No. 4
Fall 1990
Digital Tecbn icaljou,-nal
HDSC and Multichip Unit Design and Manufacture
The multichip unit also required careful thermal
design attention because each chip consumes up to
30 watts. Moreover, most multichip units contain
four to eight of these chips plus self-timed RAMs
(STRA Ms). The key to success for the VAX 9000
program was balancing the trade-offs between per
formance require me n ts and technology develop
ment risks.
To meet the electrical and density requirements
for the machine, engineers specified the fol lowing
for the multichip unit:
1. Series-term inated output drivers were required
on chip. Therefore, external resistOrs are not
needed on the mul tichip units or programmed
into the design elsewhere. These external resis
tors take up space and lower re liability.
2 . TA B was specified for manufacturing reasons.
Short TAB tape was required to reduce switching
noise on chips. Noise would have been generated
if the TAB w ires were longer. In the case of the
noisi est chips, a ground plane was added to the
tape to reduce noise.
3 HDSC etch had to be two routing layers of IS
micron by 9-micron w ires on 75-micron centers
to meet t he density, resistivity, crosstal k, imped
ance and other goals.
4 . Four power planes, each one powered from two
S.
All i ntegrated circuits i n the multichip unit are
attached to the H DSC by a tape automated bonding
(TA B ) process. The VAX 9000 system uses four types
of ch ips, all of which have emitter coupled logic
(ECL): gate arrays, custom chips, and two types of
STRAMs. At each chip site, a cutout in the H DSC
a llows the chip to directly attach to the baseplate.
The signals on and off the multichip unit are carried
by four signal flex connectors which attach to the
perimeter of the H OSC . The signal flex connector
provides a separable interface to the planar board
and extends the controlled-impedance electrical
envi ronment of the H OSC . Power is brough t
through two power connectors attached to oppo
site sides of the HDSC . The signal flexes, the power
connectors, and the baseplate are attached to the
multichip unit housing. The housing provides the
structure for the multichip unit and holds the com
ponents needed to position and w ipe the signal
flex. The chips and H DSC surface are covered by a
plastic lid.
The high-powered ch ips are efficiently cooled by
a short conductive path through the back of each
chip. The thermal power is conducted from the
chip to the baseplate and into a pin fin heat sink
over which air is impinged to remove the heat.
The follow ing sections describe the implementa
tion of the tec hnology.
sides, were requi red to distribute three voltage
rai ls with acceptably high conductivi ty.
1be HDSC Design and
Manufacturing Process
Thin d ielectric separates the power planes and
produces h igh capacitance which filters noise
and improves performance. This capacitance
el im inates the need for d iscrete parts which con
sume valuable space and lower rel iability.
The goal for the HOSC project was to produce a
h igh-densi ty, h ig h-performance, manufacturable
printed circuit board . This goal was achieved. The
density of the H DSC is th ree to five times greater
than that of conventional printed circuit boards.
Even at this density, the HDSC maintains the signal
integrity of bipolar i ntegrated circuits with edge
speeds of 200 picoseconds. This section describes
how the man ufacture of the H OSC pushes the limits
of printed circuit board and semiconductor equip
ment into new types of applications. We a lso
address the integration of computer-aided (CAD)
tools, process controls, and test feedback, which
helped us to achieve the results we sought.
6. I mpedance control of the connectors on the
multichip unit was needed to prevent signal dis
turbance. Ru les were generated for the number
of ground pins.
The heart of the multichip unit is the H DSC. The
H DSC is an imerconnect technology consisting of
nine metal layers separated by polyimide dielectric
and mounted on a copper baseplate. The top metal
layer is a pad layer used to solder-attach all of the
i ntegrated circuits and connectors. The four metal
layers below make up the signal core. The signal
core is a controlled-impedance, dual buried strip
line i nterconnect system used to wire all integrated
circuits to each other and to the connectors. The
power is brought from the perimeter of the H OSC to
the integrated circuits through the bottom four
metal layers.
Digital 1ecbn icaljournal
Vol. 2
1\i;
4
Fall /'J'JI)
HDSC Technology
As noted earlier, the H OSC has nine copper layers
for power and signal d istribution . The insulating
materi a l , polyimide, has a low dielectric constant
of 3 . 5 as compared with oxide or nitrides used in
integrated circuits or as compared w ith ceramic,
which is used for hybrid circuits. The interconnect
is laminated to a copper baseplate to provide
81
VAX 9000 Series
mechanical structure as well as attachment of the
multichip unit heat sink.
The conducting layers consist of the following:
•
Two layers for signal distribution
•
Two layers that serve as signal reference planes
•
Four layers for power distribution
•
One layer with bonding pads to attach the TA B
and connectors
The signal distribution is a single x-y pair that
uses the reference planes to create a dual strip
l ine interconnect. This interconnect provides a
controlled-impedance signal path with minimal
crosstal k. Table 1 l ists the electrical and physical
design parameters of the HDSC .
Process Overview
The H DSC is manufactured by two types of pro
cesses: core processing and assembly processing.
Figure 1 is a diagram of the HDSC process flow.
The core process, described funher below, uses
semiconductor manufacturing equipment and is
s imilar to the manufacturing process for the back
end of an i ntegrated circuit. Two cores are manu
factured: a signal core for strip-line signal inter
connect, and a power core for the four planes
(or layers) that d istribute power throughouc the
finished HDSC .
The second process, assembly, uses advanced
printed circuit board techniques to laminate and
interconnect the signal core and power core. The
completed H DSC has solder pads to accept the outer
lead bond of TA B integrated circuits, signal flex, and
power t1ex. The H DSC is tested with a custom flying
probe tester. Tests are made to ensure the HDSC is
functional and meets electrical parameters.
Table 1
HD SC P hysical and Electrical
Design Parameters
Line pitch
75 m icrons
Line width
1 8 m icrons
Line thickness
1 0 m ic rons
Dielectric thick ness
25 m icrons
Dielectric constant
3.5
Line i m pedance
60 ohms
Line resistance
1 /0 oh m/centimeter
C rossover capacitance
3 . 6 femtofarads
C rosstalk
5 . 1 percent max im u m
P ropagation delay
6 6 picoseconds/
centimeter
82
CORE PROCESS FLOW
r - - - - - - - - - - - - - - - - - - -
SIGNAL CORE
1
:
I
•
I
•
I
•
•
I
:
_
�
METAL LAYERS
POL YIMIDE LAYERS
COPPER LINES ETCHED
VIAS
4
5
I
_
POWER CORE
_
I
_
*
TEST
_
_
I
I
I
I
I
I
_
_
_
1
l
•
•
•
_
_
:
�
I
METAL LAYERS
POL YIMIDE LAYERS
WHOLE PLANES
4
5
_
I
_
*
TEST
_
_
I
I
I
I
_
I
_
_
-
-
-,
•
•
•
•
•
•
---�--J
TO MCU
Figure 1
HDSC Core and Assembly Process Flow
Core Processing The process for the manufacture
of the signal and power cores, or the core process,
consists of alternating between copper deposition
and polyimide coating until the completed inter
connect layers are built on the metal wafer. The pro
cess is performed on a metal substrate shaped like a
6-inch semiconductor wafer. Copper layers are
deposited by a combination of sputtering and plat
ing techniques. Patterns in the copper that become
signal traces are generated by a semiconductor
phorolithographic technique. First, a photoresist is
applied to the metal wafer. The resist is then
exposed to the pattern in the mask that is held by
the semiconductor wafer aligner. This pattern is
then developed in the resist and etched into the
copper. The remaining copper thickness is then
added by plating. Another resist pattern is devel
oped over the plated signal traces to define where
a copper connection between interconnect layers
will occur. This connection is cal led a via post, and
it is also formed by a plating process.
Polyimide is spun on to the wafers by integrated
circuit photoresist spin tracks. The relatively thick
polyimide (25 microns at signal layers) helps to
planarize the surface of the wafers and also to cover
Vol. 2 No. 4
Fall I. 4
Fall 1<)<)0
INTEGRATED
CIRCUIT
200 MICRON
PITCH
INNER LEAD
BOND
TAB
HIGH DENSITY
SIGNAL CARRI ER
ENCAPSULATION
OUTER LEAD
BOND
INTEG RATED
CIRCUIT
Figure 3
Isometric ofa Gate A n-ay
Showing Features oftbe TAB
85
\J\X 9000 Series
no plating is required for epoxy die attach. The
epoxy die auach is filled with m icroscopic particles
to enhance the thermal conductivity while main
taining electrical isolation bet ween chips.
Signal Flex Connector
The signal flex connector is a high-density, con
trolled-impedance connector used to transmit sig
nals between the H OSes :md the planar module.
Each multichip unit has four flex connectors with
a combined signal I/O of 800 in an area less than
4 0 square centimetcrs. Figure 4 shows a cross sec
tion of one signal flex connector. The body of the
connector is a two-metal-layer flex print with 50and 60-ohm signal lines. The ground plane in the
flex circuit is used as an AC return path . No power is
carried through the signal flex. The signal plane
contains 200 etch lines with a raised gold bump on
each at the planar module interface. The connec
tion to the H DSC is a solder bond similar to the sol
der bonds for the TAB devin:. A window is opened
through the polyimidc to al low the formation of
cantilevered, exposed, solder-plated leads.
The raised bump on the flex circuit concentrates
the contact force into a small area. The bump is
sol id copper that is plated over with nickel and hard
gold. The force on the bump is generated by com
pressing a molded silicone rubber elastomer. The
compression of the connector causes the tkx
frame to engage a cam on the housing and wipe the.:
contacts across the planar module pads. The con
nector is compressed, nominally, 1 .27 mm and
wipes 0.46 mm . The bottom of the elastomer mates
with a tray which has a contoured surface to vary
the compression along the length of the elastomer.
This contoured surface improves the uniformity of
the force that the humps exert on their pads. The
connector has been designed to gem:rate 100 grams
minimum load on all bumps. The wipe action and
the bump force of the connector minimize the
effect of dust and environmental fi lms on the.: mat
ing surfaces.
Power Connector
T he power consumed by the multichip unit IS
brought in through two power connectors mounted
on opposite sides of the I !DSC . The connector is
composed of a flex circuit, a connector, and decou
pling capacitors. The flex circuit is solder honded to
large pads on the I I DSC surface. The flex has three
copper conductive planes separated by polyimide
dielectric. The connector has st::�mpcd metal con
tacts soldered into the llcx circuit and assembled
into a plastic housing. The connector plugs into flat
blades on the bus bar of the p lanar module assem
bly. The decoupling capacitors on the power flex
circuit filter the medium-frequency switching noise
on the MCU and the MCU power bus.
Thermal Design
The multichip unit was designed from conception
to provide an efficient cooling path for the inte
grated circuits. Figure 5 shows a cross section of the
PLANAR MODULE
SIGNAL FLEX
CIRCUIT
E LASTOMER
FLEX CIRCUIT
BUMP
ELASTOMER
ELASTOMER TRAY
Figure ,j
Vol. 2 No. ·1
Signal Flex Connector with
Detail of Bump
Fall
/'J')O
Digital Tecbn icaljountal
HDSC and MultichtP Unit Design and Manufacture
multichip unit. The heat dissipated by the chips is
conducted through the silicon and the die attach
into the baseplate. As mentioned above, the die
attach is an epoxy heavily fil led with microscopic
diamond particles to increase thermal conductivity.
The heat spreads out in the copper alloy baseplate
and is conducted across a dry interface to an al u
minum base of the pin fi n heat sink . The heat sink
has 600 aluminum pins, each 0.20 centimeters in
diameter, pressed into the base. Air plenums in the
cabinets direct at least 14 . 6 liters per second of air
into each multichip unit heat sink. The thermal
resistance for a 30-warr gate array is less than 2 .0
watts per degree Celsius which gives a junction
temperature of 85 degrees Celsius with room air at
25 degrees Celsius. This low junction temperature
is a critical part of the h igh reliab i l ity of the mul ti
chip unit.
r--'---- - -:''" - · ·
Figure 5
Multicbip Unit Manufacturing
Figure 6 shows the m a n u facturing process flow,
which has three major work centers:
•
54 -class assembly and inspection
•
P lOOO
•
assembly and inspection
Test and diagnose
I n the 54-class process, TAB semicond u ctor
devices arc assembled to the H DSC substrate, result
ing in the subassembly known interm . l l y as a 54class module. In the P 1000 process, connector and
housing components are assembled . At the last
major center, the test process, final units are tested
and, if necessary, diagnosed. A shop floor control
system tracks the units through the l i ne and pro
vides critical component and process trace infor
mation. In addition, this control system is used to
monitor process parameters to ensure control of
the l ine and consistent product quality.
The fol lowing section provides i nsight i nto
several of the process technologies we used to meet
the m a n u facturing goals of the VAX 9000 system.
Digital Teclmicaljournal
Vu/. 2
No. 4
Fall /'J'JO
---'-,
-
Clock Distribution
The system clock on the VAX 9000 system is
distributed to each of the multichip unit clock
distribution chips (CDxx). The CDxx generates 4 0
di fferential outputs which are routed through
equal-length etch to the other chips. The CDxx also
distributes and controls the scan lines that test the
unit both in manufacturing and in the field . The
scan l i nes also allow the unit serial number and revi
sion status to be read by the system console.
BASE
PLATE
� - -:
PIN FIN
HEAT SINK
Thermal Path
TAB and Flex Circuit Bonding
The i nsertion and soldering of leads is the most
critical step in the multichip unit manufacturing
process. Single-lead and multiple-lead gang bonding
approaches were both considered . Gang retlow sol
dering is an effective way to achieve repeatable, reli
able connections for both the TAB semiconductors
and the signal tlex circuits. Early development work
on manual machi nes required operator action for
lead forming, lead alignment, and gang bonding.
Today, critical process parameters - time, pressure,
temperature - are computer controlled to speci
fied values, and the process uses tools to assist the
operator in material movement and vision systems
to improve alignment of leads. Before bonding, the
leads are covered with a low activation flux which
is removed later in the process.
Die Attach
Another critical manufacturing step is the die attach
process. The excellent thermal performance of the
multichip unit is achieved by fol lowing these steps:
•
Careful control of the die attach materials with
feedback to our suppliers.
•
Surface cleanl iness specified and also managed
with our suppliers.
•
D ispensing of epoxy. The fil led epoxy is d is
pensed by an x-y table that is computer con
trolled to supply the correct pattern for the
particular mu ltichip unit type.
87
VAX 9000 Series
END OF 54 CLASS ASSEMBLY
START OF P 1 000 ASSEMBLY
ALIGN HDSC
TO HOUSING
SHIP
Figure 6
•
Manufacturing Process Flow
Establishment of bond line thick ness and epoxy
short removal or single-point bonding. Over time,
c u re. Bond l i n e t h i ckness i s accomplished b y
we bel ieve that our materials and processes can be
mechan ical l y applying pressure while curing i n
control led r o the p o i n t at w h i c h i nspec t i o n and
a purged belt furnace.
rep a i r can be dramatically reduced.
Inspection
Final Test
To ensure t h a t a l l soldered leads a re reliably
The goal of o u r tes t rrocess was to ensure t h a t
bonded, leads must be inspected for shorts, mis
m u l t i c h ip u n its wou ld operate successfu l l y i n a
a l ignments, opens, and weak joints. Shorts and mis
system env i ronme n t . Si nce no test equ i pment
al ignments are d iscovered by an automated v ision
m:m ufacturer offered a system that met our needs,
system that ca l l s marginal points to the operator's
we developed ou r own by working w i t h several
attention . The operator can then dete r m ine i f
Digital groups as we l l as outside suppl iers. The
repair action is warranted. Inspection for opens and
system contains th ree major s t a t i ons. The fi rst
weak joints is done by striking the leads with a pu lse
provides al ignment information and can ;�lso read
of laser energy and then measuring the thermal
visual serial and part nu m bers. In the second sta
decay profile. Repa ir is typical ly made by localized
tion, low voltage shorts are determi ned between
H8
Vol. 2 No. 4
Fall
1')')0
Digital Tecbnicaljournal
HDSC and Multichip Unit Design and Manufacture
nearest neighbor leads. This step supplements our
inspection for shorts described above. In the final
station , we test for connectOr opens, thermal mea
su rement (die attach integrity), scan chain integrity,
and scan pattern data. The scan pattern testing is
done in several bursts of the clock at system speed .
In addition, diagnose capability is provided by fly
ing probes, voltage and clock margining, and a ther
mal chuck to vary temperature.
Conclusion
cess that begins with advanced development and
continues th rough volume manufacture. The H OSC
and multichip unit technologies have successfully
achieved the volume manufacwring phase. Using
the prod ucts and technologies described
in this paper, we have played a key role in the intro
duction of the VA X 9000 system to the marketplace.
Extensions of this m an u factu ring p rocess w i l l
ensure that this technology base can be applied
across a wide spectrum of products of both higher
and lower performance.
Successful use of advanced interconnect teclmolo
gies requires a seamless phased development pro-
Digilal 1i.•cbnicaljournal
Vol. .2 No. 4
Full 1'.)')11
H9
Matthew S. Goldman
Paul H. Dormitzer
Paul A. Leveille
The VAX 9000 Service
Processor Unit
The VAX 9000 serviceprocessor unitprovides thefront-end seruices needed to support
a highly available and reliable mainframe system. The unit is close�y linked to the
VAX 9000 system to provide realtime detection and recovery of system failures.
However, the unit is independent enough to be isolated for maintenance without
affecting normal system processor operation. This combination is a first for VAX
systems. The service processor also provides various debugging features that were
essential for development and ear�)' manufacture of the VAX 9000 system. These
features utilize a system-wide scan architecture to achieve direct access to machine
state, which provides extensive visibility and control of system logic functions. The
inclusion and use ofsuch a scan architecture is a newfeaturefor a Digitalprocessor.
The VAX 9000 service p rocessor u n i t ( SPU ) is
designed w provide a dedicated subsystem for ser
vice and maintenance support for the VAX 9000
fami ly. The SPU serves two distinct roles. It func
tions as the familiar operator i nterface (i .e. , VA X
console) and as a maintenance vehicle used lO diag
nose and isolate system processor hardware faults.
The SPU performs the fol lowing major front-end
services :
•
•
System initi:ll ization
Power system control and monitoring
•
Environmental monitoring
•
Clock control and monitoring
•
VAX 9000 operating system access to SPU mass
storage devices (disk and tape)
•
Remote diagnosis port support
•
System error detection, recovery, and reponing
The SPU also provides or assists in the following
system diagnosis functions:
•
S P U mod u le self-tests
•
Scan system diagnostics
•
Clock system diagnostics
•
•
90
Scan pattern structural diagnostics
Structure cell (e.g. , self-rimed random-access
memory [ R AM]) d iagnostics
•
X MI-ro-system control unit adapter interface test
•
Symptom-directed diagnosis support
In addition to its use as the front-end processor
for the VAX 9000 system, the SPU wJs embedded
in several manufacturing and e ngi neering rest
vehicles. In the Debugging Features section of this
pJper, we describe how the SPU was used as a
debugging tool d u ring VAX 9000 product devel
opment and the various debugging features we
p rovide to help locate design and fabrication
problems.
A mJjor goal of the SPU WJS to perform system
wide error detection and recovery functions for the
VAX 9000 processor. I n the Error Handling section
of this paper, we detai l the types of errors that the
SPU handles arid how error detection , reporting,
and recovery occurs.
A nother of o u r design goals was to be able to
service the SPU without adversely :�ffecting the
operation of the system processor. This feature was
needed to support t he h igh avai lab i l i ty requ i re
ments of a mainframe system. To meet this goal , we
designed mechJnisms to enable the VAX 9000 oper
ating system to determine that the SPU is not func
tionJl (whereupon the operati ng system takes the
appropriate action to secure its own operation),
as well as recognize and reintegrate with the SPU
when the SPll is functional again .
If the VAX 9000 operating system Jttempts to
access one of the SI'U -based processor registers and
the SI'U does not respond, the fai lure is detected by
Vol. J No. -i
Fall
/')')0
Digital Technicaljournal
The VAX 9000 Service Processor Unit
tests are performed . The SPU 's operating system
then boots automatically and signals its availability
to the VAX 9000 operating system.
The SPU is designed to continue operation even
i f the SPU primary storage device, a n R D 5 4
Winchester disk drive, fails, which further increases
the availability of the SPU. For customers who
req u i re data security and high availability, we
designed a system configuration option that does
not use a disk drive. I n this case, the SPU boots from
TK50 cartridge tape. The SPU functions that require
a disk drive for data storage (e.g. , SPU-generated
error logs) are disabled in this configuration .
using the usual register time-out mechanism. How
ever, because the SPU is responsible for system error
handling, SPU failures must be detected quickly to
enable the SPU to respond to a system error should
one occur. Conseq uently, we developed a keep
alive protocol with which the VAX 9000 operating
system can determine SPU failures without relying
on operating system accesses to SPU-based pro
cessor registers. The keep-alive mechanism is
described in more detail under the Error Handling
section of this paper. Both the time-out and keep
alive mechanisms work regardless of whether the
SPU has an unexpected failure or undergoes a sched
uled power-down.
S hould the SPU req u i re service, field u pgrades
may be performed easily and qu ickly because of the
modularity of the hardware, which is primarily
VAXBI bus interface-based adapters. The VAXBI
backplane minimizes downtime because modules
can be removed or inserted without requiring reca
bl ing. When power to the SPU is restored, SPU self-
SPU Architecture
A block diagram of the SPU architecture is shown in
Figure 1. The service processor module, scan con
trol module, and power and environmental monitor
were designed uniquely for the VAX 9000 system.
The disk controller, tape controller, as well as the
memory daughter board were available from other
DISK
CONTROLLER
(1 1 03 1 KFBTA)
TAPE/NETWORK"
14-------'
CONTROLLER
*
Nl
(11 034 DEBNK)
VAX
TO/FROM
REST OF PCS
81
SERVICE
PROCESSOR
MODULE
(12051 S P M )
POWER AND
ENVIRONMENTAL
MONITOR
(11 060 PEM)
SPU M E MORY
16 MBYTES
ECC
S P U OS
F I RMWA R E
SCAN
CONTROL
MODULE
(12050 SCM)
I
F I RMWARE
POWER CONTROL SYSTEM
SJI
PlY
!-----'
" N I CONNECTION U S E D DURING
DEVELOPMENT ONLY
SYSTEM PROCESSOR
Figure 1
DiRilttl Tecbnicaljournal
VtJ/. 2
No
VAX 9000 SPU Block Diagram and interconnects
4
Fa/1 /'J'J()
91
VAX 9000 Series
Digital products. Every S P l J VA X B I adapter provides
1
i ts own bu i l t-in self-test diagnostics.
S P U hardware is based on ei t her i ndustry-proven
(e.g. , 74 00-series
interface, the system processor may also interru p t
t h e S P U w h e n the processor needs service. T h is
type of interrupt request is known as an attention.
TTL components, complementary
The SPU is i ntegrated i n to the system cabinet to
metal oxide semiconductor [CMOS] gate arrays)
better meet the performance req u i rements neces
or Digital-proven tech nology (e.g. , VAX B I , Digital
sary for system error recovery and VAX 9000 oper
custOm CMOS devices) to ensure that the unit is a n
ating system boo t . Cabinet i n tegration substantially
e ffective debugging platform for a system processor
decreases i nterconnect distances to processor logic
based on leading edge tech nology. As a resu l t , the
and ensu res that all cables are kept i nternal to the
i n herent risk and learning cu rve associated with
cabinet. Another reason for choosing the VA X B I
n e w tech nology were avoided and t h e SPU was
backplane card cage i s t h a t i ts form factor is sma l l ,
ready and available during the VA X 9000 system
w h ic h reduces the cabinet area needed (cabinet area
protOtype debugging p rocess.
is a lways in high demand), yet the user-definable
The S I' U also was made available to manufactur
ing process and tester groups (e.g. , multichip u n i t
tester) for use w i t h their designs. T h e advantages to
zones provide the high pin density req u i red for
i nterconnects ( i . e. , 1 80 110 pins per VAX B I s lot).
this approach were t hat tec hn icians became fam i l
Communication Path
i a r w i t h t h e same subsystem t h a t wou ld b e used i n
The SPU commu nicates w i t h the system processor
t h e VAX 9000 fam i ly, a n d t h e test programs could
using the SJI . This in terface is used to load the pri
be transferred for use in other test envi ronments
mary bootstrap into the VAX 9000 main memory,
that also used the SPU , including the VA X 9000 sys
t ransfe r error and m ac h i ne-check i n fo r m a t i o n to
tem itself.
the VA X 9000 opera t i ng system , provide file trans
The service processor mod u l e is the primary
fer access between the VAX 9000 opera t i ng system
processi ng element of the S P U and is the VAX B I host
and the SPU 's R D 5 4 disk drive, access system main
adapter. Based on the M i c roVAX 78032 chip and
memory, and access system i /O registers.
several custom-designed applicat ion-specific i n te
The VA X 9000 operating system accesses the SPU
grated c i r c u i t s (e. g . , S P ll -to-system cont rol u n i t
as if i t were a standard J /0 device. T h e SPU is a n
adapter, S P U memory control ler) , t h e module con
i ndependent subsystem and does not rel y o n the
t a i n s a l l the h ar d w a re necessary to store and
execution u n i t of the system processor to be a con
execute the S P U operating system . The on-board
sole processing engine, as was done i n previous
firmware contains a VA X standard console i nterface
VA X systems. T h e re a re several b enefits to t h i s
to load the SPU operating system during i n i ti a l iza
design approac h . Each C P U has equal access t o the
tion and to assist in subsystem debugging. The S P U
S P U and may i nterrupt the SPU to request serv ice.
to-system control unit interface (SJI) connects t h e
I n addition, the SPU may i n terrupt any of the CPUs
service processor mod ule to the system control unit
to request an operating system serv ice. The S P U
and is the primary communication path between
m a y b e used a s a debugging tool d ur i ng system pro
cessor debugging because it does not req u ire that
the SPU and the VAX 9000 opera ting system.
The scan control mod u le is the control i n te rface
a n y portion of the system processor be operational.
to the VAX 9000 scan system , w h ich is the visibility
The fact that the SPU could be used as a debugging
and mai ntenance path to the system p rocessor. Like
tool was an extremely important benefit for the
the service processor module, the scan control
VA X 9000 system debugging effort. The debugger
module is based on the MicroVAX 780)2 chip ami
d i d not h ave easy a c cess to the l o g i c element s
s<:veral custom-designed applicat ion-specific inte
because o f the advanced packaging a n d c i rcu i t i n te
grated c i rc u i ts (e . g . ,
gration of the VAX 9000 system . Therefore, S P U ser
distribut ion c h ip).
scan c o n trol c h i p,
scan
On-board firmware provides
v ices were u t i l ized in l ieu of logic probes. Further,
high-level fu nctions that a l low the service p rocessor
because the SPU no longer uses t he CPU for system
module to continue processing while scan-related
access, console support microcode ( i . e . , the collec
ope rations, i n c l u d i ng logi c a l - to-p h ysical s i g n a l
tion of microcode procedures t radit ional l y used for
trans lations, a r e performed concurrent l y by the
access to the system processor, memory, and J/0
scan control mod u le. The scan i nterconnect (SCI)
registers) is not requi red . The benefit of this p rocess
connects the scan control module to the system
is that valuable VAX 9000 control store space could
processor (i.e. , one to fou r C : P U s and the system con
be used for system m i c rocode or to reduce the con
trol u n i t ) and t he master clock mod u le. Using this
trol store size. For example, in the VAX 8650 system ,
92
Vol. J No. 4
Fall /<)<)()
Digital Tecbnicaljournal
The VAX <)000 Senlice Processor Unit
console support m icrocode occupies approxi
mately 180 microword locations.
VAX 9000 operating system access to the SPU is
through the VAX console register set. We extended
the VAX console register set to provide access to the
enhanced capabilities of the S P U . Additional regis
ters include transmit function request and param
eter and receive function request and parameter
( i .e . , TXFCT , T X PRM , !L.'<.FCT , R X P R M ). Table l l ists
the functions provided by these registers.
SJ I commu n i cations a re in the form of 14 -byte
packers that contain the command (i .e. , function),
address, and data. Packets are sent and received
over two 8-bit data paths that provide fu ll duplex
operation. Data transfers peak at 3. 5 megabytes
( M B) per second for quadword transfers.
W hen the VAX 9000 operating system executes a
Move_ro/from_ Processor _ Register instruction that
specifics an SPU register, the system control unit
sends an I /O command p::tcket, through the SJ I , to
the SPl! to initiate the system request. Then the SPU
typica l ly uses an interrupt command packet, which
generates an i nterrupt to the specified C P U . The
two other packet types are direct memory access
and error correction code.
R X FCT/R X P R M and T X FCT/T X P R M
F u n ctions
RX FCT/RX P R M Functions
(SPU to System Processor)
Remove processor
Add processor
M ark memory page bad
Request pages of m emory
Send error log entry
Send OPCOM message
Get datagram buffer
Send datagram
Return datagram status
Set keep-alive state
Abort datal i n k
E rror i n terrupt
TXFCT/TXPRM Functions
(System Processor to SPU)
Get hardware context ( o f a halted C P U )
Virtual block f i l e operation
(access to SPU disk and tape)
Keep-alive
Send datagram
Return datagram status
Visibility Path
Switch prim ary
I n the development and manufacture of a com
plex computer system, extensive test i ng methods
must be available to ensure functional operation
and product quality. Design engineering no longer
can use manu::tl probing tech niq ues in prototype
debugging. Space l i m i ta tions have resu l ted from
advanced packaging and the c lose pitc h of i n te
grated circuit ! IO pins, which is due to high i ntegra
tion lewis. Failur e isolation must be performed in
the manufacturing process, often without an exten
sive knowledge of the machine design.
A separate visibi lity and control path in the sys
tem processor of the VAX 9000 system provides
nearly 100 percent visibility to the machine-state.
The visibility path e l i m i na tes t he need to select a
subset of v isibility points to meet a l l test needs, as
was done with previous VA X systems. In addition,
the pat h al lows designers to d irectly alter t he entire
machine-state, which is a major advantage for
design and process debugging. A VAX 9000 u n i
processor ( i . e . , one C P U and system control unit)
contains over 26,000 access points.
The path is called the VAX 9000 scan system and
is controlled by the service control mod u l e. The
scan system is the fou ndation for d i rect access
by prototype debuggers, system error recovery
Digital Tecbnicaljournal
Table 1
Vol. 1 No.
-i
Full
/'J'J()
Reboot system request
C l ear warm start flag
Clear cold start flag
Boot secondary processor
H alt C P U and remove fro m avai lable set
H a l t C P U and keep in available set
Console q u iet
Set i n terrupt mode
Abort datal i n k
Reset 1/0 system
Disable vector u n it
Set keep-alive state
Start processor
M argin power
Margin clock
Fault sig nal
Start error wi ndow
End error w i ndow
Report error in w i ndow
Get error log e n t ry
Get u n m arked error log entry and mark
E n able halt restart
Get 1/0 physical address memory map configu ration
Get physical add ress m emory m ap configu ration
9:)
VAX 9000 Series
soft ware, and diagnostics to observe and alter the
VAX 9000 machine-state. Some functions provided
by the scan control module and supporting SPU
software are
•
Load and save processor state
•
Scan pattern execution
•
Continuity testing of the processor's scan
hardware
•
M u l tichip u n it t ype and revision i n formation
extraction
•
Processor attention notification
A block d iagram of the VAX 9000 scan system
is shown in Figure 2. The scan control module
connects to the system p lanar module over the SCI .
Scan and clock distribution logic, contained i n a
macrocel l array on the pl:mar module, distributes
data and control signals over the scan bus to each of
the multichip units. A clock distribution chip at the
hub of each multichip unit further distributes the
scan bus signals to the macrocell arrays, w hich are
integrated circuits that contain system logic.
As shown in Figure 3, the state devices within a
macrocell array are scan latches. The latches are
connected serially to form a ring or chain by con
necting the Scan_Data_Out line of each latch to the
Scan_Data_ln line of the next latch. The end links
are connected to the clock distribution chip. When
the system clocks are running, data is loaded into
the latch from the system data input. During scan
operation, system clocks are not active. Generated
by the scan control module, the scan clocks load the
latch with data from the scan data input . Conse
quently, the scan control module reads system state
by issuing scan clocks, w hich serially shift system
data to the scan control module. System state is
changed w hen the scan control module drives new
data to the system latches while issuing scan clocks.
An architectural feature permits each mu ltichip
u n i t to generate an attention i nterrupt d irectly to
the scan control module over the scan data return
l i ne. A ttentions notify the SPU of system events,
such as processor errors, memory self-test comple
tion, CPU halts, and keep-al i ve responses.
System diagnostics can diagnose the SCI by using
the same control signals as used for scan system
operation. Dedicated logic and special routing of
the scan l ines p rovide fai lu re isolation . Stuck-at
faults and disconnect conditions can be isolated to
the multichip unit.
Debugging Features
I n addition to its use as the VAX 9000 front-end
processor, the SPU provides a variety of features
for debugging and troubleshooting multichip unit
logic configurations. These features were required
because all mu ltichip unit logic visibility and con
trol is handled through the SC I , which connects
directly to the SPU . The use of scan larches to access
internal logic states is a first for VAX systems and
chal lenged the designers to define and deliver the
necessary tools and features to assist the multichip
unit debugging effort. Furthermore, the features
provided by the SPU had to apply tO various tester
environments, ranging from single mul tichip units
mounted in probe stations to ful l system con fig
u rations. A d d i t ional requ irements to support the
clock and power system test stations made it clear
that the SPU would have to be adaptable to a variety
of environments.
PLANAR
MODULE
SERVICE
PROCESSOR
SCI
SCAN
CONTROL
MODULE
scD·
SCAN
DATA
RETURN
MCUO
MCUn
S C A N DATA I N
AND CONTROL
'SCD - SCAN AND CLOCK DISTRIBUTION LOGIC
Figure 2
94
VAX 9000 Scan System
Vol l No. 4
Fall f ) .
The translation from a logical signal to its associated scan latch uses clara structures supplied in a
configuration database file, which is loaded into
SPU memory during SPU initialization . All CPUs
w i t h identical mu ltichip unit configurations (i .e. ,
same CPU revision) share the same configuration
database memory image. The system control unit
a lways req uires its own database. Only two CPU
revisions can be supported at one time because of
SPlJ memory constraints for storing the separate
configuration databases. However, by prov iding for
two C P U revisions, the needs of single and dual CPU
configurations were completely satisfied . Further,
it was possible to upgrade homogeneous triple and
quadruple configurations in a stepwise manner.
Macrocode Execution
Initial system- le\'el multicbip unit configurations
consisted only of a sca lar CPU . The system control
unit was not yet available as a result of the extended
simulation of the design . Fortunately, we had antici
pated the possibility of running partial configu
rations and could provide modes within the SPU
software to red i rect commands that normally
access main memory (e.g. , EXAMIN E , LOAD) to
access the CPU's 1 2H kilobyte (KB) system cache
or S K B virtual instruction cache instead . The first
VA X macro-instructions were loaded and executed
on the VA X 9000 system using this technique. An
additional feature, wh ich i nvolved m inor hooks in
the system microcode. provided a means for the
VA X instruction set diagnostic, EVKA A , to commu
nicate with the console terminal through scan
attentions rather than by using the system control
unit. Thus, the diagnostic could run to completion.
Advanced Debugging Features
Although not obvious aids to VA X 9000 debug, the
following features were ind ispensable or, at the
least, reduced debugging time and effort:
•
A character-cell w i ndowing capabi lity that
al lows system microcode sources ro be automat
ically located , disp layed, and updated on t h e
screen as the system is single-stepped. We mod
eled this feature after the VAX debugger's win
dowing capabi l ity because m os t VAX engi neers
97
VAX 9000 Series
are fam i l i ar with t h is capabili ty. W i ndow i ng
eliminated the need for hard-copy microcode
listings and the logistical problems associated
with their use.
•
•
By connecting the SPU to the engineering net
work duri ng developme n t , timely updates of
SPU software were made possible. This kept the
VA X 9000 debugging effort , which was occur
ring simultaneousl y on several systems, up to
d ate w i th the latest SPU software fixes and
enh ancements. Together w i t h the multisessi o n
capability of the SPU operating system, the use
of the network made remote debugging a reality
th roughout the VAX 9000 debug phase.
13ecause the SPU had to initial ize the VA X 9000
system thousands of times during system debug
ging, the unit was designed to perform system
initial ization as efficiently as possible. For exam
ple, the load ing of structures (e.g . , control stores
or cache tags) was optimized by overlapping the
operation of three M icroVAX-based processors :
the service processor module, the scan control
module, and the d is k controller.
The debugging features located early design and
fabrication problems in the clock, power, scan, and
processor logic areas. Ultimately, the features were
used to initialize and run the first VA X 9000 system .
Error Handling
To support high system availabi lity, accurate and
t i m e l y error detection a nd loggi ng is required .
Error data collection cannot depend upon host sys
tem availabi lity, and the data must be available when
the system is not functional . Therefore, an indepen
dent service subsystem that can collect data from all
system components, render i t into a useful format ,
and store and display the information i s needed .
The service subsystem must also be organized in
such a way that if it fails, it does not directly cause
system processor failures. Repair, reboot, and sys
tem reintegration must occur wit hout interfering
with system processor operation . The SPU meets
these requirements; it is a fully independent com
puter that runs its own operating system with dedi
catec.J peripherals. The SPU performs system-wide
error detection and reporting fu nctions and pro
v ides advanced error recovery fea t u res for the
system processor.
Error Detection
The S P U reports errors in its own VAX BI adapters,
the service p rocessor module, the scan control
98
module, the power and environmental monitor,
the disk controller, and the tape controller. It also
reports errors in various pa rts of the VAX 9000
system, such as the system control unit, the CPI ·s,
the memory system , the master clock module, and
the power and environmental systems. Because fa il
ures in any of these subsystems can incapacitate the
VAX 9000 system, none of them reports its errors
directly to the VAX 9000 operating system .
SPU
Errors The disk controller, tape controller,
and scan control module use the VAX B I VA X port
protocol to report errors. The power and environ
mental monitor passes error information to the ser
vice processor module through its private bus, the
SPU-to-power control system interface.
Environmental Exceptions The power and envi
ronmental monitor monitors the regulator intelli
gence cards, airflow sensors, and tempera t u re
sensors throughout the system. When it detects any
problems in operating voltages, currents, tempera
tures, or airflow, it notifies the service processor
operating system , wh ich logs the error cond ition.
Clock Exceptions When the master clock modu le
detects an error in either the clock phase or the
clock frequency lock, it generates an attention to
the scan control module, which interrupts the ser
vice processor mod u le. The SPU operating system
logs the error condition.
Memory Error Correction Code Events The main
memory of the VA X 9000 system contains error
correcting logic to correct single-bit errors and
detect double-bit errors. When a memory location
with a single-bit error is read, the system control
unit corrects the error and passes the corrected data
to the requesting device. It also writes an SPU regis
ter with the error type and the failing memory
address. The SPU operating system writes this infor
mation to the error log. I f the system control unit
detects a double-bit error or reads a marked-bad
location , it passes the bad data, marked as bad, to
the requesting device and notifies the service pro
cessor operating system , which logs the error. The
bad dat::1 is handled loca l l y by the requesting device,
usually by generating an error of its own .
CPU and System Control Unit Errors
When a CPU
detects an error in a parity checker, it attempts to
come to an instruction boundary and halt . Once
it has halted, the CPU sweeps i ts cache. When the
cache sweep is completed, the C PU asserts an
Vol. 2 No. 4
Fall
I'J'JO
Digital Tecbnicaljournal
The VAX 9000 Service Processor Unit
attention to the scan control module to inform the
SPU that recovery is required . When the system
control u n i t detects a n error, it first asserts a fatal
error signal to each of the CPUs, and then asserts an
attention. When the CPUs receive the fatal error sig
nal, they attempt to come to an i nstruction
boundary and halt. Once halted, the crus assert
attention lines to the scan control module. The
caches are not swept since their path to memory,
the system control unit, is not working.
Keep-alive, Timeout To ensure that a CPU is not
hung by an undetected error, the SPU periodically
sends a keep-alive interrupt to each CPU . CPU
m icrocode services the interrupt at the next macro
instruction boundary by asserting an attention to
the scan control module. If the CPU should be hung
by an undetected error, the SPU times out while it
waits for the keep-alive repl y attention and , thus,
determines that there has been an error. Similarly,
the primary CPU monitors the SPU by sending it a
keep-alive request through the TXFCT register. If the
SPU does not respond to this request within a time
out period, the VAX 9000 operating system assumes
that the SPU is hung and reboots i t using a VAXBI
reset. When the SPU reboots, it reintegrates itself
with the rest of the VAX 9000 system without i nter
fering with system operation .
Error Reporting
When errors are reported to the SPU operating sys
tem , the error formatting facility logs the error
information local l y and reliably transmits it to all
intended receivers. The error formatter maintains
the error log fi le ERRLOG . SYS on the SPU RD5 4
drive, passes error log entries to the VAX 9000 oper
ating system to be logged in the system error log,
and also passes the entries to any SPU software that
requests them . The error formatter writes the error
log file using the SPU operating system disk I /O func
tions, passes the error log entries to the VAX 9000
operating system using an RXFCT function. and
passes the error log entries to other SPU processes
using the SPU port protocol. If the RD54 drive is not
available, which prevents access to the SPU error
log, the error formatter continues to send error log
entries to the VAX 9000 operating system and to
other sru processes.
The SPU error log contains a l l the error log entries
collected by the SPU (but not those collected by the
VAX 9000 operating system) and time stamps,
which are logged every ten minutes. Should an SPU
operat ing system crash occur, the time stamps may
Digital Tecbnicaljournal
Vol.
2
No.
4
Ful/ 1')90
be used to determine the approximate time of the
crash . Errors are logged regard less of the state of the
system processor. As a result, information is avail
able for analysis even in the event of a total proces
sor failure. The error log file may also be transferred
to TK50 tape for off-site analysis.
The error formatter passes error information to
the VAX 9000 operating system by copying the error
log entry to system memory and then invoking the
RXFCT function to notify the VA,'{ 9000 operating
system that the entry is available. Should the operat
ing system not respond to t h is notification , t he
error formatter assumes that the operating system
has crashed and writes the error log entry to a tem
porary data ft.le. When the VAX 9000 operating sys
tem reboots, it notifies the SPU by using a TXFCT
function. The error formatter then reads any saved
error log entries from the data file and transmits
them to the VAX 9000 operating system . This proto
col ensures that all collected error data is eventually
reported in the system error log.
The error formatter also maintains a SPU port to
which any process running on the SPU may con
nect. Connected processes receive copies of all
error log entries as the entries are logged . This port
is used by EWKCA , the symptom-directed diagnosis
tool, which analyzes errors as they occur and
determines which system components might have
caused the failure. The port is also used for system
debugging by the error insertion program to verify
that errors are being logged and analyzed correctly.
Snapshots I n addition to its error logging facili
ties, the SPU operating system provides the ability to
take "snapshots" of the system processor state. The
snapshot fi le provides a detai led record of system
context, which allows engineers to take a snapshot
of a hung system and reboot it, and then analyze the
snapshot file while the system proceeds to perform
other useful work. The snapshot display utility is
used to examine the data in a snapshot file. In addi
tion to formatting the data in the snapshot file, the
snapshot display utility can be used to examine any
scan latch in the file, by name, in the same fashion as
the console EXAM I N E command is used on the
actual hardware. The data availab le in a snapshot
file is summarized in Table 2 .
Error Recovery
The h igh level of visibility achieved b y the scan
system allows the SPU to provide extensive error
recovery facilities for the VAX 9000 processor.
SPU -based recovery offers several advantages over
99
VAX 9000 Series
Table 2
S napshot File Contents
Revision Section
All m u ltichip u n it revisions
All
S P U adapter revisions
M i c rocode revisions
A l l X M I adapter revisions
A l l VAX B I adapter revisions
Power Section
All power control system registers
" Se n se power" results
Clock Section
All master clock mod ule registers
SPU Section
All S P U -to-system control u n i t adapter registers
1/0 Section
X M I device error registers
VAX B I device error registers
X M I-to-system control u n it error registers
System Control U n it Section
All scan latches
Last 50 entries from system control u n it m i c ro
program counter h istory buffer
All cache tags
All other logical structures ( e . g . , control stores)
Config u ration database version
1/0 physical address memory map
M e mory physical address m e mory map
N o n existent physical address memory map
CPU Section (Repeated Once for Each CPU)
A l l scan latches
Last 50 entries from program counter h istory buffer
All cache tags
All general-pu rpose registers
All i nternal processor registers
All other logical structures ( e . g . , control stores)
Top 50 longwords of cu rrent mode stack
Top 50 l o n gwords of i nterrupt stack
32 bytes of i n struction stream aro u n d each
program counter in h i story buffer
Configu ration database version
50 m i c ro program cou nters, collected by stepp i n g
the clocks
100
traditional microcode-based error handling. The
CPU hardware resources that might otherwise be
used for error handling were available for the logic
designers to improve the system performance.
Because the error data is processed external to the
failing component, the recovery process i tself is
not suspect. Finally, because the system clocks are
stopped while recovery takes place, erroneous data
does not propagate throughour the system.
Tradi tionally, m a ny microwords in the CPU
control store (approximately 500 in the VAX 8600
system) are used for error recovery microcode.
However, because the SPU is responsible for
VAX 9000 error recovery, additional control store
space is available for instruction m icrocode. If this
had not been the case, we m ight have had to make a
space trade-off between instruction and recovery
microcode, which cou l.d h ave res u l ted in more
emulated instructions and a performance penalty
for VAX instruction execution speed .
Because the scan system allows the SPU to deter
mine the state of every scan latch in the CPUs and
system control unit, logic designers were able to
place error detectors anywhere in the design
without organizing the detectors into microcode
readable error registers. As a result, significantly
more error detectors were used for precise error
analysis than woul d have been possible if the scan
system were not available. Each VA.,'\ 9000 CPU con
tains over 450 error detector latches.
Severa l advantages are derived from performing
error recovery independently from a failed compo
nent. The most obvious advantage is that hardware,
which m ay be failing, is not used to control t he
recovery. Once the system processor state has been
scanned out into SPU memory, analysis is a function
of software running on a known good processor.
The SPU analyzes the data and then scans a cor
rect state into the system processor. T he entire
process is performed while the system clocks have
been stopped . Therefore, processor errors cannot
cause "error loops; " that is, the error recovery
process itself gets errors from a corrupt processor
state. SPU-based error recovery can completely
reset a corrupt system , regardless of the degree of
corruption.
The VA.,'\ 9000 error-handl ing fac i l i ty takes
advantage of many advanced software features that
are avai lable i n the SPU operating system . It uses
configuration database information to access sys
tem processor signals by name rather than by scan
ring locations. Thus, one version of the error han
d l ing code can handle several different physical
processor variations. The error handler also uses the
Vol.
2 No. 4
Fall 1<)<)0
Digital Tecbnicaljounwl
The VAX 9000 Service Processor Unit
SPU operating system structure access routines to
read and write the processor structures, again, by
burying the physical implementation in the config
u ration database. As a res u l t , the error handler
can look at the architectural features of the VAX pro
cessor rather than at the gate-level design of the
VAX 9000 system when performing error analysis.
The benefit of this approach is that recovery proce
dures are based on the system architecture, rather
than on the machine implementation .
One of our design goals for the VAX 9000 error
handling system was to recover from most errors
in under 500 mil liseconds. Longer delays increase
the probability that I/0 devices will time out while
waiting for the operating system to respond to
requests and cause the operating system to crash,
even if the error-hand ling system s uccessfu l ly
recovers from the error. The error handler meets
this goal by taking maximum advantage of t he
multi processing capabilities of the tightly coupled
hardware design of the service processor module
and scan control module. Error recovery is split into
a mu ltistep process that keeps both SPU processors
working on the problem simultaneously.
The error handler recovers a failed system in five
phases: data collection, data analysis, error recov
ery, macrostep, and cleanup. In the data collection
phase, the scan control module scans out all scan
rings of the failed CPU or system control unit. In the
analysis phase, the scanned data is used to deter
m i ne which architectural feat ures of the system
have been corrupted (e.g. , caches, general-purpose
registers, internal processor registers, microcode
stores, and the translation buffer).
In the recovery phase, the error handler attempts
to restore the system to a state in wh ich no soft
ware-visible data is corrupt. Therefore, the soft
ware running on the VAX 9000 system, including
the operating system, is unaware that an error has
occurred. The error handler determines whether
the system state can be restored successfu l ly or if
a machi ne check must be generated to a llow the
VAX 9000 operating system to attempt to handle the
error on a higher level. It then restores the CPU to a
known good operating state, by using latch data
from the configuration database, and corrects any
corrupted software-visible data.
In the macrostep phase, the error handler turns
on the system clocks to allow the fai led C P LI to
attempt to m acrostep one instruction. I f the
macrostep completes successfu l l y, the recovery is
considered s uccessful and system operation is
allowed to continue. In the clean-up phase, the SPU
Digital Technicaljournal
V(J/.
2 No. ·4
Fall
/1.)1.)0
processes the data from the data collection phase
into an error log entry, posts the entry, and cleans
up the data structures that will be used to recover
from the next error.
Errors that are too severe for the error handler to
h andle are signaled to the SPU command i n ter
preter, which can run command scripts to com
pletely reinitialize the machine and reboot the VAX
9000 operating system . Examples of such severe
errors are bard errors that prevent VAX 9000 oper
ating system machine check code from running and
errors that cause a CPU to fail its macrostep.
Summary
The SPU is a dedicated subsystem for service and
maintenance support for the VAX 9000 fami ly. It is
closely linked to the VAX 9000 processor to provide
system error recovery. It also presents a high-level
interface with which debuggers may observe and
control system processor activity. Through the use
of a system-wide scan architecture, the SPU pro
vides access to nearly roo percent of p rocessor
machine-state. Finally, the use of the SPU in various
tester environments greatly assisted the multichip
unit debugging effort and provided advanced train
i ng for VAX 9000 system debuggers.
Acknowledgments
The authors w ish to thank Michael Evans, the SPU
project leader, whose drive and ambition provided
the force behind the project's success. We also wish
to acknowledge the other members of the SPU
design tea m : Karen Barnard , Stephen Conway,
David D 'Antonio, Susan DesMarais, and Brian Rost .
Reference
1 . D. Chin et al . , "The Unique Features of the VAX
9000 Power System Design, " Digital Technical
journal, vol. 2 , no. 4 (Fall 1990, this issue):
102 - 1 1 7.
101
Derrick]. Chin
Barry G. Brown
Charles F. Butala
Luke L. Chang
Steven]. Chenetz
Gerald E. Cotter
Brian T. Lynch
Thiagarajan Natarajan
The Unique Features
ofthe VAX9000
Power System Design
Leonard]. Salafia
The VAX 9000 series represents Digital'sfirst implementation of a mainframe com
puter system. To be competitive in this market, the power system for the VAX 9000
series had to provide high system availability To meet this goal, the system includes
features neither considered norfound in previous large Digital computer systems.
Some of these features are the use of redundancy in parts of the design and the
addition of more power system diagnosis capabili�yfor quickerfault isolation and
faulty unit replacement. Otberfeatures provide competitive advantages in specific
marketplaces, such as meeting low harmonic distortion for A C input current, which
is an emerging European A C power qualiry standard. Simulation tools, wbich are
used more prevalent()' in digital logic, were used to improl!e the power design.
The two key requiremems of the VAX 9000 power
system a re h ig h availability and the incl usion of
competitive features. High availability for rhe power
system means we had to achieve the highest unit
regulator reliability possible by using the appropri
ate technology avai lable. Further, we had to deliver
both more power system and cabinet envi ronmen
tal monitoring and diagnostic capability that could
reduce the time spent in isolating and replacing a
m a l fu nctioning u nit. Competitive features mean
designing into the system features that would be
either better than expected or advantageous to the
VAX 9000 system in certain markets.
A ful l discussion of all the methods used to meet
these requirements is too long for this paper. There
fore, the discussion in this paper focuses on some of
the unique applications of the power technology
and tools used in the design of the VAX 9000 system :
•
Power system architecture
•
I mproved load sharing
•
Simulation
•
Increased control and monitoring
•
Low harmonic distortion
One of the issues we had to decide in designing
the power system architecture was how many regu-
102
lators shoul d be used . A large number of regulators
in a power system can cause the mean time between
failures (MTBF) to be lower than desired. Therefore,
we chose to use redundant regulators in the power
system architecture for improved availabil ity.
A nother means of i nc reasing the MTBF was
achieved by improving the load sharing among the
parallel regulators that power a low-voltage current
load . W i th this feature, no one regulator operates
at a percentage of maximum rating much higher
than its parallel regulators, which eliminates the
higher operating temperatures that can occur and,
as a result, lowers the MTBF.
High regulator reliability results from good cir
cuit design. Three examples of the unique simula
tion features that were used as checks on circuit
designs are discussed in the Simulation section of
this paper. In one case, simulation pointed the way
to a circuit problem that was not initially apparent.
In another case, simulation was used to verify on
paper that the n umber of regulators chosen to
power a specific load was sufficient .
High availability can be achieved by reducing the
time to isolate a system p roblem a nd replace the
malfunctioning unit. A power and cabinet moni
toring modu le, EMM , fu l fil led this p urpose in t he
VAX 8000 systems. The power control subsystem,
PCS , used for this purpose in the VAX 9000 systems,
Vol. 2 No. 4
Fall /'J'JO
Digital Technicaljournal
The Unique Features ofthe VAX 9000 Power System Design
expands on the diagnostic and monitoring features
of the EMM .
Meeting emerging European AC power quality
standards was viewed by the E uropean sales
force as a distinct competitive advantage for the
VAX 9000 system. A proposed standard we wanted
to meet was to achieve low harmoruc distortion of
the input AC current wave form, which was met
in the u t i l i ty power conditioner (U PC) front-end
design of the power system. High availability was
designed into the UPC th rough such features as
redundancy and increased immunity to power line
disturbances from a common ly accepted industry
practice of one AC cycle to teo AC cycles.
VAX 9000 Power System Architecture
The discussion of the power system architecture
w i l l focus on some of the a rchitecture's major
features: power zoning, N + 1 redu ndancy, and
decoupling.
•
•
•
Power zoning enables parts of the system to be
powered off for maintenance w h i le the rest of
the system remains operational .
N + 1 red u nd ancy provides higher perceived
system availability to counteract the impact of
low system mean time between failures, which is
a result of the large number of regulators.
Decoup ling major sections of the power system
a llows future upgrades to be made w i thout
requiring significant changes to the rest of the
system.
The basic power system architect u re for the
V�'< 9000 Model 200 and Model 400 series is shown
in Figures 1 and 2, respectively. Power processing in
each model occurs in two distinct stages. First, an
AC front end processes and converts AC utility input
power to h igh-voltage DC , which is then bused
about the power system. Second, DC-to-DC switch
ing regulators convert the h igh-voltage DC to low
voltage outputs, which are then distributed through
high-current-carrying busbars to the various logic
loads. An intell igent power control subsystem (PCS)
provi des control, sequencing, monitoring, and
diagnostic capabi lities. Dedicated bias regulators,
whic h are powered from the h igh-voltage DC ,
provide housekeeping control (i.e. , low power) and
start-up power to each bank of output regulators.
The high-voltage DC bus permits low-voltage out
put regulators to be added or removed for different
system configurations. The high-voltage DC bus also
can be backed up with a battery unit that produces
high-voltage DC from 48-volt batteries through a
step-up switching regulator. This approach allows
any specific low-voltage output to be produced , as
needed, during the battery back'l.lp period without
using specific battery-to-logic voltage output DC-to
DC regulators. The battery required to backup the
entire computer system wou ld be larger than the
computer itself. Therefore, diodes are inserted into
the h igh-voltage DC distribution to partition the
high-voltage DC bus, and only sections, such as the
memory refresh operation and PCS control , are
backed up.
PCS
(POWER CONTROL S U BSYSTEM)
/
/
/
/
E N V I R O N M ENTAL
MONITORS
UTILITY
POWER
1 20/208 VAC
3 PHASE
�
�
Figure 1
Digital Tecbnicaljournal
Vol.
2
No.
4
VAX 9000 Model 200 Series Power System
Fall 1990
103
VAX 9000 Series
PCS
(POWER CONTROL SUBSYSTEM)
Figure 2
VAX 9000 Mode/ 400 Series Power System
Power Zoning
The power-zoning feature meers rhe maintain
abi l i ty a nd high avai labi lity goals in the VA X 9000
Model 400 series of triple and quadruple proces
sors. In the power system's configuration, a pair of
d u a l processors can be powered off for m a i n te
nance, while the remai ning powered-on processors
maintain system operation.
A quadruple processor configuration is not com
posed of two identical dual processors. Some func
tions of a quadruple processor are not replicated.
The system control unit, the memory, the service
processor unit, and the PCS are common ro both
d u a l processors. Therefore, these functions are
powered up by either front end . The h igh-voltage
DC power bus is diode OR 'd from either AC power
source, through the dual d iode, CR 1 , and then fed to
the ourput stages that power the common elements
listed above.
The diode-OR process i n the VA X 9000 system
does not provide for active loads haring. Active
loadsharing between each AC from end increases
the overall actual power system reliability because
it ensures that each AC front end supplies half the
load. Othenvise, one AC front end could take most
of the load (and be stressed h igher), w h i c h wou ld
leave the other unit roo lightly loaded . However,
acrive load sharing is complicated by the physical
distances between the AC front ends and the com
plex hand l ing of faults and parcial fau lts in each
AC front end . The load of the common elements in
the VAX 9000 system is only 20 percent of the total
104
system. Therefore, the worst load imbalance does
nor justify the added complexi ty.
The diode does nor have a signi ficant impact on
overall power load re liabiliry because conservarive
deraring of rhe diode results in a lower diode oper
aring temperature and hence higher rel iabili ry.
We were concerned that power zoning cou ld
have an impact on rhe resr of rhe system as a result
of powering down part of the system. However,
analysis of the results showed rhar such a concern
was unfou nded. The h igh-voltage DC bus has rela
tively long time cons tams (i.e. , slow to react to
changes). Therefore, turn-on and turn-off transients
on the bus are smooth and gradu a l and do not
generate quick-changing electromagnetic fields that
coul d affect the operation of t he sections of the
system that are still functioning.
N + 1 Redundancy
Each processor in the VAX 9000 power system uses
approximately 400 amperes from each of the two
supply voltages. T he rati ngs of the power semi
conductors used in the outputs of the OC-ro-DC
reg u lators del i ver an optimal regulato r rating of
approximately 240 amperes. Based on these rat
ings, powering a CPU i n the VAX 9000 system would
require two regulators for each voltage. However,
in a large system, such as the VAX 9000 system, the
number of regularors can quickly add up, w hich
would result i n an equally q u i ck d rop in overal l
system reli ability. Powering two CPUs from the
same voltage bus reduces the number of regulators.
Vol. 2 No. 4
Fa/1 19')0
Digital Tecbnicaljournal
The Un ique Features ofthe VAX 9000 Power System Design
Redundancy is then used to minimize the impact
unit. This reliance has a significant impact on the
of t he large n u m be r of regu lators in
design of the regu lator, the regulator response time,
the b u s.
By using redundancy, a d d i t io n a l regu l a tors on a
and how the regulator hand les the fa u l ts that can
voltage bus increase the perceived time between
cause a fai l ure. Fast regu lator response (the time it
com rlere fa il ures.
For example, consider a voltage bus that requires
A
t wo regulators to supply t he load cur rent.
fai l
u r e in either regulato r causes a complete fa il ure.
I f another parallel regulator is added to supply
the load c u r re n t , the probabi l i t y o f a c o m plete
failure significa ntly decreases. I n t h is case, if one
regu lator fa ils , the other two could supply the loa d .
The s t a t i s t i c a l proba b i l i t y t h a t another fai l u re
would occur before the fa i led regu lator is replaced
takes to respond to a cha nge in input or output) is
needed to ensure that the output volt age does not
dip roo much when each regu lator picks up its
share of the load from the f:J.iled regulator. How
ever, the fas ter response time makes it more diffi
cult to keep the control functions of the unit stable.
M oreover, t he reg u l a t o r i n p u t vol tage range is
designed to be relatively wide to tolerate w ide
swings in the high-voltage
DC
input.
When one regu lator in a bank of regulators oper
is very sm all .
ated in paralle l fa i ls , t h e o u t p u t bus voltage d i ps
N regu lators at an individual fai l u re
(i\) would have a system fai l u re rate
of N rimes i\, or an MTBF of 1 d i v ided by N t imes i\ . 1
The magn itude of the dip depends on the time the
A
system of
rate of lambda
The actual calculati ons are
unt i l the other regulators, w h ich are connected in
para l lel , can react and pick up the load currents.
i n p u t fuses i n each r eg u l a tor t a ke t o open and o n
the values o f the input capacitors and the d istribu
i\ (total) = N X i\
tion impedances.
Fast-opening fuses a l low smal ler voltage dips but
or
MTBF =
l li\ (total)
=
are more p rone to fa lse n u isance openi ngs. S low
1 /(N X i\)
opening fuses do nor open for normal or nuisance
The fa il ure rate calcu lation for a system that con
tains one regulator more than req u i red
fuses quickly, but the voltage recharging of the
i\ (total observed) = (N + I ) X N X i\ X i\ I
I { (N + l ) x i\ } + (N x i\) + u )
MTT3F (observed)
=
(
su rges, but allow a greater vol tage d i p . La rge values
of input capacitance provide the energy to open the
(N + l ) is
capacitors is longer.
A
high distribution i m pedance
decoup les the fa ults from other units but has a high
power loss.
( (N + 1 ) X i\] +
Simu l a t i o n and resting showed t h a t the w i d e
(N X i\) + u ) I I (N + I ) X N X i\ X i\)
inpm range design o f t h e regu l a tors i s su fficient to
It shoukl be noted for the above equation, that u
tolerate the h igh-voltage input dips caused by other
e q u a l s I d i v i ded hy t h e t i m e between fau l t a n d
fa ul ts. The regu lator control and
re pair (service i n terval).
keep the low-voltage
Using this calculation, if a bus requ ired
MTBF
T"he obse rved
MTBF
4
regu
response rime
outputs within speci fica
tion when the input vol tage is within its range.
of 400,000
Other faults w i t h i n the regulator can cause it to
would be 100,000 hours.
fa i l , but the load i s picked up by the other regula
lators and each regulator had an
hours, the observed
MTBF
OC
w i t h five regu lators ( i . e . ,
wou ld be 23,9H9,000 hours, w h i c h is
239
N+
I)
tors, operating in paral l e l , on the bus. Clearly, fau l ts
t i mes
such as a permanent short on the output bus, cannot
longer than the four regulator case. The maximum
be s ur v i ved . Because the low-vol tage output regula
time between the fault occu rrence and repair would
tors operate in parallel and in an
be
2
weeks,
or
336
hours. T he observed
MTBf
is
N+
I redundancy
mode, the output voltage is not affected by most
so large, compa red to other elements in the system,
common single-fault cond itions in the power sys
the redundant regu la tors have an extremely small
tem hardware.
effect on the overall reliab i l i ty.
vol tage bus is l i m i ted to one in the VAX 9000 power
Decoupling
A key feat u re of the
system for sp:.te e, weigh t, and cost reasons. N is the
that each major subsystem is relatively decoupled
The number of red u ndant regulators per output
power system 's architec t u re is
number of regulators req u i red to supply the maxi
from the other su bsys tems. Decou p l i ng perm its
mum current of a bus, and the addition of one more
e:1ch subsystem to be designed for its own req u i re
regulator is cal led
N
+
N+ I
redundancy.
ments and t o b e c h anged or upgraded as t h e
I redundancy relies on the good regu lators
on the output bus to pick up the load from the fa i led
Digital Tecbniculjournal
v,,r 2 No. ·I
Fa/1 1')')0
req u i rements change (e. g . ,
more cost
effective,
im proved tech nology, or different output vol tage).
10)
VAX 9000 Series
provided the interface and critical fu nction remain
the same. For exam ple, two significant l y differ
em cost and performance options, H7392 or H7390,
for the AC front end can be used in different config
urations, and the rest of the power system does not
need to be changed . Thus, power p latforms can be
flexibly tailored to meet the needs of different com
puter systems.
Achieving Low Harmonic Distortion
The AC front end of the VAX 9000 power system
processes and converts public utility AC power to
high-vol tage DC. Our goal was to design the AC
front end to be highly reliable, have a high avai labil
i ty, and meet the emergi ng European AC power
quality standards. One of those standards is to have
low harmonic distOrtion of the input AC current
waveform . These featu res were essent ial to support
the VAX 9000 system 's entry into the mainframe
computer marker . We also decided tO meet the low
harmonic distOrtion standard of the AC front end
because the Eu ropean marketing and sales force
viewed compliance with this standard as a distinct
competitive advantage.
Design Factors
The dominating design factor for the AC front
end was the size of the input power level, which
was approximately 20,000 watts. This size signifi
cantly exceeded the power levels of previous AC
circuit designs for a s i ngle u n i t . The high power
consumption was a result of the use of 250,000
emi tter-coupled logic (ECL) gates in the CPU and
5 1 2 megabytes (MB) of memory.
High Reliability and A vailability To ach ieve high
reliabil ity, we used conservative power derating lev
els and good thermal management for key devices.
Typ ically, the device voltage ratings used are 80
percent of rating. The main switches and rectifiers
used in the power stages used 40 percent of rating.
Current derating is also conservatively placed at 40
percent. Stress is lessened because of lower device
fu nction temperatures, wh ich results in a longer
opera tional life, which equates to h igher reliabi lity.
We designed t wo approaches to attain high
availability. First, redundant circuitry was used for
the AC-to-DC circu i t function. Second, we inc reased
immunity-to-line outage from the standard practice
of one cycle of outage protection to ten cycles. The
increase from one c ycle to ten cycles of ou tage
immunity provides the VA X 9000 system with a
300 percent improvement in mean rime between
106
observed system power outages over standard
Digital systems This feature improves system
availability to the customer.
Harmonic Distortion The power system's design
had to meet the increasing restrictions on the inrn
face with the pub lic power u t i l i t y and be able t o
withstand the occasional avai labil ity o f only poor
power. Uti l i ty power is generated as a relati vely
pure (i .e. , low harmonic d i stortion) s i ne wave.
AC front ends and power suppl ies must convert this
sine wave of voltage ro a ripple-free DC voltage for
ultimate consumption by the logic chips within the
computer system . Standard methods used for this
conversion create a nonlinear load on the sine wave
of voltage. This nonlinear load distorts the utility's
sine wave of voltage for other users, because of the
distribution system impedance, and usually appears
as i nterference for other users. In Eu rope, the
occu rrence of this type of interference is planned
to be limited by restricting how much nonlinear
load current an AC front end can have. Therefore.
we had to design a unique circuitry that could
convert AC power to DC power at 20,000 watts
without high levels of current distortion to meet
this European requirement .
A design based on commercially available conrrol
technology could not meet the stringenr technical
requirements of high overal l conversion efficiency
and stabi l i ty of operation because conventional
AC-to-DC circui try produces up to 30 percent dis
tortion. Our goal was to comply with emerging
European requirements of harmonic current distor
tion levels in the 5 percent range. However, at the
time we were designing the system, no circui try at
this power level existed in the power conversion
industry. T herefore, we h a d to develop a unique
pulse-width modulator (PWM) circuit and control
equations for the input power conversion stage,
which is shown in Figure 3 .
The pulse-width modulator combines the advan
tages of low switching frequency, which reduces
switching losses in the converter, with exception
ally short response time to all i nput l ine voltage
d isturbances and to rapid changes i n the required
compu ter power. The fin a l design produces
less than 5 percent total harmonic distortion of
the input l ine current w hen the UPC is operated
at 20,000 watts load. The uniqueness of the PWM
increased the immunity-to-line voltage outages
from one cycle of outage protection to ten cycles.
F u rthermore, the increase was achieved w i th
o u t a corresponding tenfold increase i n storage
capacitors.
Vol. 2 No. 4
Fall /')')0
Digital Technicaljournal
The Unique Features ofthe VAX 9000 Power System Design
OUTPUT
SWITCH
AC
F I LTER
AC
INPUT
�
o-----
RECTIFIER
FAST
DI SCHARGE
AUX AC POWER AND
POWER LINE MON ITOR
TO UPC
CIRC UITS
DIGITAL POWER BUS
AND TOTAL CFF BUS
RIC
INTERFACE
Figure 3
UPC Block Diagram
Flexible L ine Cora
The high power level and the requirements for a
flexible line cord and plug required that the U nder
writers Laboratory (UL) and Canadian Standards
Association (CSA) agencies expand the regulations
that governed the size of power cordage allowed in
a computer room . A flexible l ine cord connected to
the AC service is a requirement by D igital for all i ts
products. This feature is deemed valuable because it
is used both to facil itate the initial installation of the
compmer and possible relocation at the cuswmer\
site. Although delays can occur while waiting for a
national agency to amend one of its national regula
tory codes, the approvals were received in time w
maintain the project's schedule.
Improving Load Sharing
Detailed stress analyses show that when regulators
are operated in parallel, maximum reliability is
achieved when the load current is shared equally
among them .
Traditional Approach
A traditional approach to running regulators in par
a l lel may be seen in VAX 8000 series machines.
In these processors, regulators that are designed for
standalone operation are placed in a parallel con
figuration. Current sharing is forced by mod ifying
each supply's individual reference voltage through
external monitori ng and control . In the case of
VAX 8000 machines, a maximum of four units
may be coupled in this way. Figure 4 shows that
Digital 'fecbnicaljournal
Vol. 2 No. 4
Fa/1 /'J'JO
this method essentially uses equipment that
was designed to function as standalone regulated
voltage sources. By adding external control loops,
the equipment is forced to provide identical out
put voltages, as measured at some defined point
in the system . If precise voltage matching is not
achieved, whichever supply had the higher voltage
consumes the load, up to i ts overcurrent sense
point. Thus, equal load sharing cannot happen.
Individua l external controllers are requ ired for
each converter, which m a kes the system more
complex. The VAX 9000 system requires up to five
converters per bus, and we could not achieve better
than 20 percent power sharing between modules
by using this method. No traditional methods could
support the number of converters in the VAX 9000
system. Also, most methods had a master-slave rela
tionshi p that precluded maximizing a regularor's
reliability potential.
New Approach
As a result of the limitations of the traditional meth
ods, we developed a new, less complex approach
to current sharing between p a rallel converters.
A lthough developed specifically for the VAX 9000
program, the features and utility of this approach
have universal application . The essential techno
logical shift from prior practice is that in this system
the regulators are current sources rather than
voltage sources.
We designed the current sources to have a com
pliance range that covers a band of voltages thar are
107
VAX 9000 Series
I
CONVERTER
?
I
CONVERTER
INTELLIG ENT
CONTROL UNIT
(ONE PER
MODULE)
CONVERTER
INTELLI G E NT
CONTROL UNIT
(ONE PER
MODULE)
�
INTERNAL
REFERENCE
AND ERROR
AMP
·�
c u RRENT
S E NSE -
?
I
I NTERNAL
REFERENCE
AND ERROR
AMP
INTELLIGENT
CONTROL UNIT
(ONE PER
MODULE)
I N TERNAL
RE FERENCE
AND ERROR
AMP
�
<
POWER
CONTROL
SYSTEM
VOLTAGE CONTROL
LOAD
Figure 4
Load Sharing by Voltage Control of Voltage Sources
norm:t l l y fou nd in logic c i rcui ts. By m a k i ng the
regulator acts as a cu rrent source, the system acts as
VA X 9000
a control led and regulated voltage source. Because
reg u l a to r o u t p u ts fu l l y fl o a t i n g , the
system requ irements for + ') -vol t , - ).4 -vol t , and
the volt age control loop only contains one pole, the
buses are met with only one regulator
bandwidth of the control loop can be i ncreased by
- 5. 2-volt
design, rather than a separate design for each
vol tage. The
VAX 9000
design is s i mpler and has a
u p to a factor of at least
15.
As a res u l t , the substan
tially h igh current cha nge req u i rements i mposed
l ower manufacturing cos t . The regu l a tor is vol tage
by high-speed memories, such as those used in the
and polarity " b l i n d " over i ts compliance range, and
VAX 9000
system, can be accommodated.
any nu mber of regu lators may operate in para l lel
to provide a n y amou n t of power req u i red at any
Principle of Operation
vol tage w i t h i n t h e compl i a nc e range. A lso, this
A two-transisto r forward r eg u l a t o r i s show n i n
method a u tomatical l y compensates for the effects
Figure
6.
I n this regu lator, S I and S2 are switched
of stray resistances and d i fferent path lengths from
ind i v i d u a l regulators on
a
bus.
The basic fea t u res of t h i s new a p p ro a c h are
CD
shown i n F igure '). I n d i v idual regulators behave as
extc rn a l l y programmed current sou rces controlled
by a common control sign a l , such that each regu
l a to r d c l i vers the same c urrent .
If
the ou t p u ts are
load is the s u m of the individual regu lator ou tput
CONVERTER
()
CD
\.)
l2
tL-1,----- --------+-----------�t
connected to a common load , the c u rrent in that
curn:nts. The resulting voltage that appears across
CONVERTER
CONVERTER
V
=
( l1
+
l2
+
l3 )
x
Z LOAD
the load is the product of t h a t current and the eq uiv
�
i'
a l e n t resi s ta nce of the load . F u rthermore , i f that
vol tage is compared w i t h a reference voltage in a
LOAD
conven rional error amplificr and thc res u l t i ng error
l3
� CUORE'T
�CO,TROC
signal is used to derive the regulators· external pro
gramming source, then a volrage control loop exists
a round the regulator system . Thus, al though each
IOH
Figure 5
Load Sharing by Current Control
of Current Sources
Vol. .! No. ·4
Fa// 1')')1!
Digital Technicaljournal
The Unique Features ofthe VAX 9000 Power System Design
into conduction simultaneously, which causes the
current to flow in the primary winding of trans
former Tl at a level that is directly proportional to
the output currenr lout plus the slope of the current
due to Lour. This current also flows in the primary
w i nd i ng of c u rrent sense transformer T2 . The
resulting current that flows in T2 secondary wind
ing develops a voltage across the load resistor, RL,
which is amplified in A l and applied to the input of
comparator C I . Therefore, at this point, a voltage
pulse appears, the amplitude and shape of which
a re directly p roportional to the c u rrent flowing
i n the output choke Lout during the S l -to-S2 con
duction period .
A conventional reference source/error amplifier
combination is p l aced across the output of the sup
p l y. The res ulting error signal, called Vcontrol, is
applied to the other input of comparator C l as a DC
leve l . The comparator is followed by gating a nd
drive circuits to the power switches.
Switching is initiated by a pulse within the gating
circuit that drives the power switches on . The cur
rent flows in the output choke, Lou t , and a propor
tional vol tage appears at the output of the amplifier
A I . As this voltage ramps, it crosses the threshold
set by Vcontrol at the Cl input. The comparator
output then changes state and causes the drive pulse
to the switches ro cease.
If Vcontrol were a fL-xed value, the system would
be a constant current source. Therefore, the voltage
that would appear at its output would be the result
lOUT
T O C2
THROUGH N
Figure 6
Two-transistor Forward Regulator
Digital Technicaljournal
Vol. 2 No. 4
Fall /'J')Ii
of that constant c u rren t , and w hatever load is
placed across those terminals (i.e. , Your) would be
determined by the load value. By using an error
amplifier and reference, Vcontrol can be made a
variable quantity. Therefore, rhe regulator transfer
function can control its output current to any level
necessary to produce the desired voltage. In such a
system, a control vol tage, which is derived from a
single error amplifier and reference, can be used as
the control input for severa l regulators that are
running in parallel. Thus, the current from multiple
regulators that feed a common bus can be shared.
Increased Control and Monitoring
I n the VAX 8000 series, power and environmental
monitoring and control is provided by the H7188
environmental monitoring module (EMM). In the
VAX 9000 system, these functions are provided by
the power control system (PCS).
Basic Design ofEMM and PCS
The EMM monitors the DC-to-DC regulator contro l ,
a i r flow sensor, and cabinet temperature. I t i s also
the interface between the system console and the
power system. Conceptually, the EMM functions as a
peripheral device to the console similar to the way
an intelligent disk conrroller is a peripheral ro a
CPU . The EMM is a single module that plugs into a
power back panel .
T he res is a d istributed data acquisition a nd
control system. I t also i nterfaces between the
power and environmental systems and other parts
of the computer system. The PCS takes commands
from, a nd reports status changes ro, the service
processor unit.
However, in the PCS, the conceptual model of
the EMM is extended to provide additional support
in hardware and firmware to off-load the service
processor unit and to simplify the software inter
face to the PCS . The PCS includes many features that
enhance testability, fault coverage, fault isolation,
and system availabil ity. The relationship of the res
modules to one another and to other system com
ponents is ill ustrated in Fig u re 7. T here are five
PCS modules:
•
Power and environmental monitor ( PEM)
•
CPU regulator intelligence card (crURIC )
•
l/0
•
Signal interface panel (SIP)
•
Operator control panel (ocr)
regulator intelligence card (JOR IC)
109
VAX 9000 Series
�
�
TO OTH E R POWER BACKPLA ES
�
POWER BACKPLANE .
POWER BACKPLANE
(f)
a:
(.)
a: a D O D O <( Cll
CX) CX) CX) CX) (f) (f)
:::J � "'
"' "' "'
(l_ ('- 1'- 1'- 1'- 1'- <( <(
U I I I I I
ai m
0
(l_
>
(f) (f)
z a:
w
(f) 0
13: (f)
0 0 :.E
N --'
CXl
"'
1'-
a:
w
LL
a:
I
1-
I ;;:{
I
BULKHEAD
ll!
0
Ol
"' 11'- w
I z
a: [ij
1w
(.)
(l_
:::J
X
f.tt
(f)
a:
5l
dJ
(f)
�
(lJ
0 <( �
(.)
X
(.)
3:
� Cll
!lllll
(f)
a:
'
.[
'
Ci
(jj
s
T1
.
.
II
TO RECT I F I E R
A N D FILTER
0
>
0
�
(f)
s
0
>
OV
D1
MU R460
T I M E (200 NANOSECON DS/DIVIDE)
Figure 10
Figure 8
H7380 Output Switching Stage
Figure 1 1 shows a more accurate model of the
The i nitial model of the H7380 inverter stage used
simple component models and did not consider any
printed c i rc u i t board i n ductances or transistor
capacitances because they seemed negligible com
pared to other elements. We noted a discrepancy in
the voltage across the transistor Q 1 (Vds) during the
tu rn-off process between the simulated waveform ,
shown in Figure 9 , and the measured waveform ,
shown i n Figure 10.
Figure 9 shows that the voltage is initially zero
while the transistor is conducting but rises to 200
volts when the transistor is t urned off. Figure 10
shows that ringing occurs as the voltage approaches
200 volts, w i t h an overshoot to 2 4 0 volts. The
ringing and overshoot, not shown in Figure 9,
are caused by the circuit board inductance, trans
former leakage inductance, and the capacitance of
the transistor.
output stage because the L 1 through L4 etch induc
tances and C 1 and C 2 transistor capacitances are
i ncluded. The c u rren t source, !PULS E , a n d t h e
resistor, RT , approximate t h e transformer. Figure 1 2
shows t h e resu l t o f the simulation model that
includes the L and C values shown in Figure 10.
When the simulation and the measured data are
correlated, the advantage of accurate simulation
becomes apparent . By using worst-case values for
the circuit parameters, the simulation can deter
mine the maximum peak voltage. The model
depicted in Figure 12 shows that a device capable
of withstanding the expected 240 volts is needed.
Rel iance on a less accurate model w i thout para
sitics could lead to the selection of a device capable
of withstanding only 200 volts. Thus, accurate
simulation allows the correct components and
component ratings to be chosen and ensures a
robust design.
Transient Analysis
PLOT 1 TIME V(40,3)
A memory system that i nc ludes dynamic random
2.50
access memory (RAM) chips presents a difficult
2.00
X
Vds (Ql) Measured Turnoff
transient load problem to its power supply. The
problem arises from a combination of very high
1 .50
changes in dynamic RAM supply current and cur
(f)
�
1 . 00
�
0.50
a thousand t imes faster than the reaction t ime of a
o.oo L..._
_
..
...._
., ...�.____.____.__
_
_.
.. _...._
power system . The result is a temporary change in
rent change rise times that are typically more than
0
2
4
6
SECONDS X 1 0'
8
7
10
the load supply voltage. To handle these fast current
edges, high-frequency capacirors are mounted on
memory boards near the dynamic RAMs. Also, low
Figure 9
1 14
Vds (Ql) Simulated Turnoff
without Parasitics
frequency, electrol ytic capacitors, which provide a
source of local charge storage, are mounted on the
Vol. 2 No. 4
Fal/ 1')')0
Digital Tecbnicaljoumal
The Unique Features ofthe VAX 9000 Power System Design
PLOT 1 TIME V(40.3)
2.50
2.00
L1
1 5NH
0
X
1 .00
0
>
0.50
�
D2
MUR460
1 .50
Ul
0.00
+
L3
Figure
D1
MUR460
L4
1 5NH
SIMULATION MODEL OF OUTPUT
CIRCUIT WITH PARASITICS
Final Model ofH7380 Output
Switching Stage
memory boards to handle the magnitude of the
change. The capacitors help keep the supply voltage
within its operating range until the power supply
can react and sufficiently change the current it sup
pl ies to the memory to stabil ize the supply voltage.
An adequate supply design with specified capaci
tors can keep the supply vol tage within its operat
ing tolerance. Simulation is used to determine the
correct mi..'< of high and low frequency capacitors
and the number of regulators required to support
this high transient load .
Another power supply problem arises from the
use of N + l redundancy for paral lel regulators.
W hen one of the regulators in a paral lel regulator
configu ration fails, the remaining regulators must
be able to rake on the load from the fa iled regulator
and keep the supply voltage within operating toler
ance. Because the remain ing regulators cannot
react instantaneously, the load voltage drops until a
sufficient increase in current can be provided by the
remaining regulators.
For the VAX 9000 series memory system, a pro
posed dynamic R A M power supply design consisted
of three H7380 DC-ro-DC regu lators, which would
operate in parallel (including N + I redundancy)
and be connected to the memory through power
dist ribution busbars. The numbers of high- and low-
Digitlll Technicaljournal
Vol. J No. 4
Fa/1 11)')0
12
10
8
6
SECONDS X 10
25 N H
L2
25NH
Figure 1 I
4
2
0
7
Vds (Q I) Simulated Turnoff
with Parasitics
frequency capacitors were also proposed. The
power supply was expected to be ready for load
testing before the memory or the busbars would
be available. Therefore, we had to verify that this
design coul d keep the memory supply voltage
within operating tolerance. We verified the design
by simulating the performance of the power system
and measuring the performance of the actual power
supply with a simulated load .
Power Supply Operating Voltage Tolerance
The
memory designers specified the operating tolerance
of the dynamic RAM suppl y as + 5 volts, ± 10 per
cent . Using 10 percent as the supply tolerance
budget, the supp l y designer made the allocations
shown in Table 2 to all the factors that would cause
the load voltage to deviate from its nominal value of
+ 5 volts. As can be seen from this table, the sum of
x and y must be less than 350 milli volts or 7 percent
of + 5 volts.
Memory Load
The dynamic R A M supply current
was calculated ro be a steady-state pulsed current
of 2 56 amperes t hat would last for 92 nano
seconds (ns) and with rise and fal l times of 20 ns,
as shown in Figure 13. The initial p ulse magnitude
was 1024 amperes.
Table 2
Supply Tolerance Budget Al location
Causes of
Voltage Deviation
Regulator tolerance
M i l livolts
1 00
Back panel d istribution
50
Tra nsient load with two
X
Percentage
of +5 Volts
2
reg u l ators
Failure of one reg u l ator
To tal deviati o n budget
y
500
10
liS
VAX 9000 Series
modeled as a current source, Gout, controlled hy
the regulator feedback voltage, Yf Cout and Rout
represent the regulators combined output capaci
tors and resistors. Most of the other elements in the
model are determined from component specifica
tions. The relationship between Gout and Vf was
determined by laboratory measurements on a regu
lator and resulted in the following equations. For
two regularors,
OA
I-288 N S�
f-- 1 2.96 MICROSECONDS �
KEY:
A - AMPERES
NS - NANOSECONDS
Gout = 339 X VJ = 339 X ( V8 - 2 . 5 )
Figure I3
VAX 9000 Model 400 Series Memory
Power System Dynamic RAM Load
In the SPICE
model of the supply, busbar, load and capacitors
that is shown in Figure 14, the three regulators are
Memory Power System SPICE Model
For three regu larors,
Gout = 678
X
Vf = 678 X ( V 8 - 2 . 5)
The load is represented as two current sources, lA
and I R , the characteristics of which were obtained
from the loads shown in Figure 13.
21
-.l,
ROUT
lA
-j.,
RESR
IR
GO U T
C2
VR
=
DC
-
KEY:
R1
C1
R2
R3
C2
R4
VR
R5
C3
VG
R7
R6
C4
GOUT
D1
ROUT
1
2
2
3
3
4
6
5
7
8
9
9
10
0
20
21
2
0
3
4
4
5
0
7
8
9
0
10
0
20
21
0
10K
0.6N IC=2.5
1 0K
20K
1 8 P IC=5.0
1K
DC 5
2K
68N IC=3.0
DC 2.5
1 0MEG
10K
0.757N
POLY(1) 1 0 0 0 678
DIODE
1 7K
Figure I4
1 16
GOUT
RESR
LESL
RBB
LBB
CHF
RHF
LHF
CLF
RLF
LLF
\A
20 NS
IR
20 NS
21
22
23
21
24
1
26
27
1
25
28
1
20
1
20
22 1 2300U IC=5.0
23 1 M
2.4N
0
24 300U
1
1 50N
26 1 .3M
27 2 1 U
1 .4 P
0
25 1 08 . 8 M
28 400U
0.3N
0
P U LS E 0 5 1 2 A
0
NS 92 NS 288 NS
PULSE 0 5 1 2 A
0
NS 92 NS 1 2.961' S
O NS
0 NS
SPICE Model of VAX 9000 MemOtJI Power System
Vol. 2 No. 4
Fa/1 1')')0
Digital Tecbnlcaljournal
The Unique Features ofthe VAX 9000 Power System Design
When one of the three
regulators fai l s , t he other two regulators cannot
meet the increased load instantaneously. As a result,
the load voltage drops until the two regulators can
increase their output current sufficiently to reverse
the d irection of the drop. The SPICE model for t h is
condition was run and the load voltage of the drop
was predicted . Laboratory measurements were
then taken with the simulated load and one regu
lator was turned off. Both the predicted and mea
sured waveforms had the same shapes, peak
magnit udes ( 100 mill ivolts), and times of occur
rence of the peak (200 m i c roseconds) after the
regulator was turned off. Therefore, we concluded
that the proposed design cou ld meet the load
requirements.
Simulation and Lahoratmy Measurements
Failure of One Regulator
For laboratory measurements,
the actual dynamic RM•I load, as shown in Figure 1 3 ,
i s difficult to design and build i n a reasonable time
because of the magnitude and rise t ime combina
tion. However, a load with a much slower rise time
could be easily built. Such a load , (I in Figure 14) is
expected through the busbar as the capacitors and
busbar slowed down the fast edges of the dynamic
RAI'vl loacl . This s i m u lated load w as bui l t and con
nected to two regulators. The predicted waveform
and the measured waveform showed that the initial
shapes of the peak c hange, the peak magnitudes
(80 m i l l ivolts), and the ti mes of occu rrence of the
peak ( 300 microseconds) were all simi lar. However,
we could not measure the overshoot and ringing
after the peak because the busbar was not available.
References
The
two previously stated cond itions of interest result
ing in large load voltage changes are the transient
load w i th two regu lators and the fa i l u re of one
regulator.
For transient loads, a larger voltage cha nge
occurs with two regulators rather than w i th three
because two regu l ators take longer than three to
adjust the supply current to the new load value.
Simulated Load
Digital Tecbnicaljounzal
Vol. 2 No. 4
Fall 1')')0
I. P. O'Connor, Practical Reliabili�J' Engineering
2d ed . (New York: Joh n Wi ley and Sons, 1985).
2. SPICE is a general-pu rpose circuit s i mu lator
program developed b y Lawrence Nagel and
Ellis Cohen of the Department of Electrical Engi
neering and Computer Sciences, University of
California, Berkeley.
1 17
Donald F. Hooper
John C. Eck
Synthesis in the CAD
System Used to Design
the VAX 9000 System
VAX 9000 system represents a sixfold inc rease in complexity over the
860018650 system. This increased complexi�y posed a significant challenge
because ofthe concurrent need to shorten the duration ofthe project design cycle and
convert all high-performance systems computer-aided design (CAD) software from
the DECSYSTEM-20 system to the VAX system. As part of the task of meeting these
challenges, the CAD Group proposed the implementation of a design methodology
that used logic �ynthesisfor thefirst time in the development ofa major productfor
Digital. Theprimary objectives ofthis methodology were to increase theproductivi�J'
of the logic designers and to reduce the number of errors introduced during
conversion ofhigh-level designs into gate-lel!e/ structural designs.
The design ofthe
VAX
Methodologies
transformations of Boolean logic to reduce gate
counrs and improve critical timing paths.1 How
Previous Methodology
I n the prev ious development methodology, as
shown in Figure I , logic designers speci fied h igh
level designs o n paper, and simulation engineers
transferred this rendition i nro a behavioral model .
Tech nology engineers developed the gate-level
cells. After the cells were defined and characterized
for fu nction and timing, the logic designers gener
ated schematic drawi ngs by using graphical bodies
that represented the cells.
As changes were made to the schematics, the sim
u lation engineers attempted to reflect these i n the
behavioral model . Finally,
a
gate-level simulation
model was assembled from the completed schemat
ics to verify that the design represented a valid VAX
syste m . T h is process was extremely laborious,
error-prone, and ri me-consuming. Therefore, we
concluded it could nor be used to develop the VAX
9000 system , which is a 700,000 gate design and for
which the technology cel ls would not be defined
and characterized until late in the design stage.
Logic Synthesis
ever, this program has had only limited success and
is not really usable as a released computer-aided
design (CAD) product. For example, t he program
does not deal w i th selections of cells for com
binational logic nor does it consider the myriad
problems i nvolved in asse m b li ng a database for a
buildable gate array chip.
During 1984 and 1985, new artificial intell igence
(AI) and synthesis ideas were being developed. Uni
versities and technical communi ties were exploring
the potential of object-oriented databases, rule
based AI, data flow design entry, and algorithmic
minimizations. We began the prototype develop
ment of our system for in tegral design (SI D ) at
approx imately the same time as the ideas for the
VAX 9000 hardware architecture were beginning to
be developed. In 1985, the SID program became an
internal CA D product for use in the development
of the VAX 9000 system. By combining the most
ad vanced rule-based AI techniques with an object
oriented database, the core SID was designed to be
a repository of logic design know ledge. We hoped
that, over the years, SID wou ld mature to perform
O u r early research i nt o logic synthesis began in
many highly repe t i t i ve logic design tasks a t a n
1982 . Over the next two years, we explored new
expert level .
syn thesis ideas a n d constructed p rototypes to
From 1985 to 1988, the capabilities of the SID sys
determine the feasibility of those ideas. For exam
tem gradually improved u ntil it was producing gate
ple, one of our early logic minimization efforts was
array chips that met the VAX 9000 machine cycle
a program that emulated Brown's Laws of Form for
time, power, and electrical rules requirements.
1 18
Vol. .2 No. 4
Fall 1990
Digital Tecbnicaljournal
j
Synthesis in the CAD System Used to Design the VAX 9000 System
TECHNOLOGY
CELL DEFINITION
TECH NOLOGY
CHARACTERIZATION
BEHAVIOR MODEL
TEXT EDIT
GATE-LEVEL
SCHEMATIC
ENTRY
BUG
REPORT
BUG
REPORT
PLACE ROUTE
BUG
REPORT
GEN ERATED
Figure 1
Previous Design Methodology
New Methodology
technology engineers are defining the technology
The VAX 9000 development methodology, shown
cel ls. In parallel w i t h these activities, s ynthesis
in Figure 2, circumvents the need to wait for the
technology cells to be completely specified before
begi n n i ng logic design . This methodology uses
schematic entry and simulates the technology
independent, register transfer level (RTL) bodies.
The RTL l ibrary for this type of entry includes
MUXes, latches, adders, comparators, incrementers,
decoders, and simple Boolean gates. The entry is
knowledge engineers are writing rules to transform
the RTL design into technology cells. These t hree
activities should be completed at the same time,
at which point, synthesis produ ces each of the
VAX 9000 system's 77 gate array chips. The goals
for the synthesis program were to
•
matic complexity by a factor of 4
extracted to a common database format, cal led
CADEX , from which a simulation model is built. A
•
cal boundaries. Thus, si m u lation models can be
•
•
While logic designers are creating the RTL design,
Digital Teclm.icaljournal
Vol. 2 No. 4
Fa/1 /'J()O
Reduce the n umber of simulation errors i ntro
duced in the design
built that consist of a hierarchy of m ixed behavior
and RTL models.
Generate 90 percent of the VAX 9000 system's
logic through synthesis
behavior modd still exists, hut its h ierarc h y
matches the RTL schematic h ierarchy at key physi
Simplify design entry and thereby reduce sche
Reduce the number of electrical ri1les violations
in the design
1 19
VAX 9000 Series
To generate a database for a buildable gate array
chip, the synthesis tool is required to
Read tec h nology-i ndepende n t input standard
net list format, which can be in OECSIJVI behav
ioral notation or CADEX common database
format
•
•
Minimize Boolean gates through state-of-the-art
minimization techniques
•
Improve timing-critical paths through Boolean
transformations, cell/pin selections, power set
tings, and net load a llocations
•
Choose the best avai lable technology cel ls based
on timing, size (area), and power estimates
•
Insert the clock system for the gate array chip
•
Insert testability access logic for the service pro
cessor unit
•
Obey all electrical design rules for the gate array
chip
TECHNOLOGY
CELL DEFINITION
TECHNOLOGY
CHARACTERIZATION
•
Make it easy to detect whether the tool has per
formed well
•
Simplify the improvement of the tool
SID Database
The design of the SID database is fundamental to the
robustness of the CAD system. Previous CAD data
bases have all assumed that the data is stable at the
time that the CAO tools are working with it. Simu
lation, t i m i ng veri fica tion , design ru le checkers
(ORC s), and many other CAD tools assume that net
lists and components are fixed and unchanging.
In synthesis, although the data is maintai ned in
a form that makes i t easy to u pdate its parameter
values, the basic structure of gates, pins, and nets
remains the same. However, throughout most of the
synthesis process, the basic structures are in a state
of change. In fact, it is a characteristic of synthesis
that logic functions are removed and replaced with
new, fu nctionally equivalent logic. Because of this
d i fference, we designed basic data structures and
BEHAVIOR MODEL
TEXT EDIT
SYNTHESIS RULES
TEXT EDIT
SYNTHESIZE
PLACE
ROUTE
SET POWER
RTL
SCHE MATIC
ENTRY
BUG REPORT
BUG REPORT
(LOOP BACK)
BUG REPORT GENE RATED
Figure 2
120
VAX 9000 Deuelopment Methodolof!J'
Vol. 2 No. 4
Fall f,
IS_BOOLEAN , !S_ A _N U M BER ; adjectives are words
such as A N Y , ALL, NO . Dbobjects are d a tabase
objects or the parameters of these objects.
The command forms used for right-side actions
a re corrunaml dbobject and command dbobject
preposition dbobjecr. Commands are words such as
I NSERT, REMOV E , REPLACE, MODI FY ; prepositions
are words such as W I T H , TO , FROM . The dbobject
can be a n y of the p rima ry database objects, sec
ondary objects, or their parameters.
= ,
122
For more complex operations, we a lso allowed
LISP functions to be cal led by prefixing them with
the keyword LISP , or by insertion of a LISP expres
sion. Thus, if the r u le language cannot implement
a required function, a LISP a lgorithm i c rout i ne is
cal led. We used algorithmic transforms in the gener
ation of adder carry-lookahead.
Ruleform Database Access
Because the d atabase cou l d be traversed i n any
direction for any arbitrary distance through the
multidirectional pointer system, rules had to have
the same traversal capab i l it y. Therefore, t he
dbobject of the Ru leform language is a shorthand
notation of the " database wal k . " Dbobject can be
used i n a sentence to compare two database objects
by wal king to both of them and using a predicate
for the comparison.
Had the database access been implemented in
p u re LISP programmi ng notation, the sentence
form would be lost in the many levels of expres
sions enclosed in parentheses. One test wou ld
occupy many l i nes of code and would read more
like a software program than an Engl ish sentence.
In this case, the chain of thought of the rule w riter,
the purpose of which is to capture the step-by-step
thoughts of a logic designer in words, woul d proba
bly be broken.
Vol. 2 No. 4
Fall 1990
Digital Tecbnicaljournal
Synthesis in the CAD System Used to Design the VAX 9000 System
To improve the comprehension of the notation
used for identifying the database object , we devel
oped an Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.6 Linearized : Yes XMP Toolkit : Adobe XMP Core 5.2-c001 63.139439, 2010/09/27-13:37:26 Create Date : 2006:04:12 12:05:07+01:00 Creator Tool : Adobe Acrobat 7.05 Modify Date : 2013:01:10 12:53:01Z Metadata Date : 2013:01:10 12:53:01Z Producer : Adobe Acrobat 10.1.4 Paper Capture Plug-in with ClearScan Format : application/pdf Title : Digital Technical Journal, Volume 2, Number 4, 1990: VAX 9000 seies Creator : Document ID : uuid:b44f3da5-5180-49ce-b030-87e240265be0 Instance ID : uuid:ff3ef3b8-dd97-42da-ad0f-7885af4666bf Page Layout : SinglePage Page Mode : UseOutlines Page Count : 147EXIF Metadata provided by EXIF.tools