Digital Technical Journal, Number 4, Febrary 1987 Dtj_v01 04_feb1987 Dtj V01 04 Feb1987
dtj_v01-04_feb1987 dtj_v01-04_feb1987
User Manual: dtj_v01-04_feb1987
Open the PDF directly: View PDF
.
Page Count: 144
| Download | |
| Open PDF In Browser | View PDF |
Digital TechnicalJournal
515
Number 4
February 1987
Editorial Staff
Editor- Richard W 13eane
Production Staff
Production Editor- jane C. 13lakc
Designer- Charlotte 13eJJ
Interactive Page Makeup- Leslie K. Schoemaker
Advisory Board
Samuel H. Fuller. Chairman
Robert M. Glorioso
john W. McCredie
Mahendra R. Patel
F. Grant Savicrs
William D. Strecker
The Digital Technical journal is published by
Digital Equipment Corporation. 77 Reed Road,
Hudson. Massachusetts 01749.
Changes of address should be sent to Digital
Equipment Corporation. attention: Media Response
Manager, 200 13aker Ave .. CFOl-l/M94. Concord,
i'>lA 01742
Comments on the content of any paper arc wel
comed
Write to the editor at Mail Stop HL02-.3/K ll
at the published-by addrcss. Comments can also be
sent on the ENET to RDVAX::I3EANE or on the
ARPANET to llEANE'!;,RDVAX DEC@DECWRL
Copyright © 1987 Digital Equipment Corporation
Copying without fee is permitted provided that such
copies are made for use in educational institutions
by faculty members and arc not distributed for com
mercial advantage.
Abstracting with credit of Digital
Equipment Corporation's authorship is permitted.
Requests for other copies for a fee may be made to
the Digital Press of Digital Equipment Corporation.
All rights reserved.
The information in this journal is subject to change
without notice and should not be construed as a
commitment by Digital Equipment Corporation. Digi
tal Equipment Corporation assumes no responsibility
for any errors that may appcar in this document.
!SUN l-55558-001-7
Documentation Numbcr EY-671 I E-DP
The following are trademarks of Digital Equipment
Corporation
DEC, DECnet. the Digital logo. LNO.)
Plus. MicroVAX I. MicroVAX IJ. NMI, PDP-J I .
PDP-I lj2�t. PDI'-l lj-44, RSX, RSX-IlM,
RSX-1
I M-PUIS.
Sill. UNJilllS, VAX. VAX-I l/750.
VA)>-LI/780. VAX-11/782. VAX 8200. VAX 8.)00,
VAX 8500
VAX 8550. VAX 8600, VAX 8650.
VAX 8700. VAX 8800. VAXBI, VAXIII 787.32.
VAXclustcr, VAX.station. VAXstation Jl, VMS
ADA is a registered trademark of the U.S. Government
Data General is a registered trademark of Data
General Corporation
Harris is a trademark of Harris Corporation
Cover Design
IBM is a registered trademark of I ntcrnational
Business Machines Corporation
This issue .features the VAX 8800 .family. Our couer depicts
l.ightspecd is a trademark of Lightspeed Computers,
grotl'tb of the VAX famiiJ'. As those chambers spiral from
Motorola is a registered tradcmark of Motorola. Inc
the growth of a chambered nautilus as a metapbor .for the
Inc.
the center, so the power of the VAX family grows .from the
SCAI.OSystcm and ValidGED ar� trademarks of Valid
to the neuJ VAX 8800 multiprocessor. The image was cre
TK1Solver is a trademark of Software Arts. Inc
Micro VAX systems, through the VAX 8200 and 8.300 CPl!s,
ated using the Lightspeed system.
The co11er was designed by Deborah Falck, Eddie L ee and
Tsuneo Taniuchi of the Graphic Des ign Department.
Logic. Inc
CNIX is a trademark of American Telephone &
Telegraph Company llell LaboratOries
Book production was clone by Educational Services
Media Communications Group in 13cdford, MA.
Contents
8
10
20
Foreword
Donald]. Mcinnis
An Overview of the Four Systems in the VAX 8800 Family
New Products
Robert M. Burley
The VAX 8800 Microarchitecture
Sudhindra N. Mishra
34
41
52
62
72
The CPU Clock System in the VAX 8800 Family
William
A.
Samaras
Aspects of the VAX 8800 C Box Design
john Fu, james 13. Keller, and Kenneth j. Haduch
The Memory System in the VAX 8800 Family
Paul]. Natusch, David C. Senerchia, and Eugene L.. Yu
Floating Point in the VAX 8800 Family
john H.P. Zurawski, Kathleen L. Pratt, and Tracey L. jones
The VAX 8800 lnputjOutput System
james P. _lanetos
81
88
100
1 1 1
12 0
The VAXBI Bus -A Randomly Configurable Design
Paul C. Wade
A Logical Grounding Scheme for the VAX 8800 Processor
Michael W. Kement and Gerald]. Brand
The Simulation of Processor Performance for the VAX 8800 Family
Cheryl
A.
Wiecek
VMS Multiprocessing on the VAX 8800 System
Stuart]. Farnham, Michael S. Harvey, and Kathleen D. Morse
A Parallel Implementation of the Circuit Simulator SPICE on
the VAX 8800 System
Gabriel P. Bischoff and Steven S. Greenberg
129
136
The Impact of VAX 8800 Design Methodology on CAD Development
Dennis T. Bak
On-line Manufacturing Data Access on the VAX 8800 Project
Andrew J. Matthews
Editor's Introduction
cache affected t h e i r design, a n d why they used
TTL i n the m emory c o n tro l l e r .
The V �"X 8 8 0 0 fa mily does n o t have a separate
f l o a t i n g po i n t a c c e l er a t o r . As jo h n Zuraws k i ,
Kat h y Pratt, a n d Tracey jo nes po i nt o u t , how
ever, a custom ECL u n i t a c h i eves h i gh perfor·
mance through the norma l datapaths . Thus l ess
hardware is needed, a n d opera nds are fetched
fast e r .
1 / 0 d e v i c es a r e l i n k e d t o t h e CPU by t h e
VAXBI bus. I n h i s paper, Ji m jan etos d iscusses
the NBI adapter, w h i c h conta ins l ogic t o handle
Richard W. Beane
Editor
CPU references and DMA re q u es t s . Then P a u l
Wade descri bes h o w t h e V AXBI design t e a m had
T h i s issue features papers a bo u t t h e d e s i gn o f
t h e V AX 8 8 0 0 fa mily of CPUs, written b y mem
bers of the d es i gn team. The tech nology used i n
Digi ta l 's la test h i gh -end mac h i ne , t h e VAX 8 8 0 0
m u I t i pro cessor, a ! s o for m s t h e ba s i s f o r t h e
ot her t h ree fa m i l y m e m b ers: the 870 0 , 8 5 5 0 .
and 8 5 0 0 CPUs.
Bob Burl ey's overv i ew re l a tes t h e processes
used in the 8 8 0 0 design and the fu ncti ons of the
m e m o ry i n t e r c o n n e c t ( N M I ) , t h e VAX B I I / 0
bus, and t he four l og i c boxes formi n g t he fi ve
stage p i pel i ne . The e a r l y d iscovery of d e s i g n
flaws a n d t h e u s e of automa ted too ls hel ped to
a c h i eve an aggressi v e com p l et i on sched u l e .
The m i crom ach i n e i m p l e m e nts t h e m i c roar
c h itecmre and contains four of the five p i pe l i n e
stages . S u d h i n M i s hra desc r i bes how m i cr o i n ·
stru c t i o ns are h a n d l ed, e m p ha s i zing the use o f
m i c r o b ra n c h e s a n d m i c r o t r a p s t o e n s u r e
co heren cy .
d tim·
The VAX 8 8 0 0 clock syste m , d iscussed bv B i ll
Samaras. was designed using an automate
i ng verifier. H e describes the trade-off between
using the ver i fi er and m a x im i z i n g the accuracv
'
of t i mi n g s ignals by m i n i m i z i ng t h e i r s kew.
The C Box and t h e M Rox are two parts of the
p i pe l i ne . joh n Fu , Jim Ke l ler, and Ken Had u c h
desc r i be t h e C Box's no-wri te a l l ocate cache and
the d e l ayed-wri te a l gor i t h m that ensures correct
w r i te-t h ro u gh . T h e C Box m u s t a l s o h a n d l e
p i p e l i n e s t a l l c o n d i t i o n s a n d m a i n ta i n d a t a
co heren cy between processors . The M Box h a n
dles r e a d a n d w r i te req u e s ts for t h e m c m orv
arrays . Pau l Natusch, Dave Senerc h i a , and Gen
�
Yu exp l a i n how the Clle signs of the N MI and the
2
to a b a n d o n t h e tra d i t i o nal approac h a n d use a
vari e ty of tec h n i q u es to spe c i fy t h e bus. So me
chip pro b l ems were resolved o n l y after a t h or
ough ana lysis of the p hysical confi gurat ion .
jerry Bra n d a n d M i ke K e m e n t d i s c u ss t h e
i m portance of u s i ng gro u nd correctly a s a s i gna l
c o n d uc t o r to a c h i eve h i gh performance. They
describe the sources of gro u n d-related noise i n
the CPU, a n d w h a t they d i d to isol ate a n d con
trol t hose sources.
Many VMS features s u p p ort m u l t i process i n g.
Stu Fa rnham, M i ke Harvey, and Kathy Morse first
describe t h e hardware that s u p ports m u l t i pro
ces s i n g , t h e n t h e i n t e r l o c k e d i n s t ru c t i on s ,
exce p t i o n h a n d lers, a n d traps t h a t i mp l e m e n r
VMS m u l ti process i n g . T o s h o w h o w m u l ti pro
c e s s i n g d e c re a s e s e x e c u t i o n t i m e , G a b r i e l
B i s c h o ff a n d S t e v e G reen berg c o n v e r t e d t h e
SPI C E circ u i t s i m u l ator into CAYENNE, a para l
l e l progra m . They created master a n d slave pro
cesses t h a t ra n CAYENNE 1.7 t i mes faster t h a n
SP I C E .
T h e fi n a l two papers re late s o m e of the autO·
mated too l s a n d te c h n i q u es used on t h e 8 8 0 0
project . Denn is B a k first descri bes bu i ld i n g t h e
CAD s u i te from exist i n g tools, n ewly deve l o ped
ones, a n d mod i fi c a t i on s . The met h o d o l ogy was
tru l y i n nova t iv e , s e rv i n g as a fra m ew o r k for
fu ture projects . Then Andy Matthews d i scusses
the o n - l i n e sys t e m that tra nsformed CAD d a ta
i n tO spec i fi c a t i o n s used by Man u facturin g . This
system m i n i m i zed t h e prod u c t stan-up ti me by
eli m i nati ng pape!Work .
Biographies
Denn is T. Bak
D e n n i s B a k is a p r i n c i p a l software e n g i n e e r i n t h e
Advanced VAX Development G ro u p . As a project leader, h e is currently
d eve loping new CAD too ls w im prove designer prod uctivi ty on fu ture
design projects . In other posit ions, Dennis performed configu ration testing
for PDP- 1 1 and VAX syste ms. Prior to join ing Digital in 19 80, he worked as
a research engineer at Ford Motor Company, doing advanced deve lopment
on electronic engine-contro l syste ms. Dennis earned a B.S. degree in elec
trical engineering from the University of M i c h igan i n 1 9 7 4 .
Gabriel P. Bischoff
I n 198 5 , Gabriel Bischoff joi ned Digi ta l after receiv
ing a D i p loma of Engineer and a D iploma of Advanced Studies in device
physics from the Ecole Centra le de Lyon (1980) and a Ph . D . degree in E.E.
from Corne l l University (198 5) . As a senior software engi neer i n t he Se m i
conductOr Engi neering Group, h e is i nvestigating t he appl ication o f paral
lel co mputing archi tectures for VLSI CAD cools, part icularly c i rc u i t s i m ula
tors . Gabriel developed a parallel version of t he circuit simu lator SPICE for
s h a re d - m e m o ry m u l t iprocessors . A m em b e r of I E E E , he has p u b l i s h e d
papers o n device mode ling a n d circ u i t simu lation .
jerry Brand is a principal engineer cu rrently deve loping
high-density, h igh-ava i l a b i l i ty power syst e m s . Prior to work i ng on t h e
power and packaging team for the VAX 8800 fam i ly, h e designed two MPS
power modu les that are widely used in Digita l 's products. Before joining
D igital i n 1 98 0 , Jerry worked for over 1 4 years i n d iscipl i nes rangi ng from
oceanography to gas- turbine i nstru mentation . He holds a B . S . E . E . degree
from the U n i versity of I l l inois and participated i n the M . S . E . E . program at
the Un iversity of New Hampshire. Jerry teaches circuit analysis and elec
tron ics in the con tinuing education program at the Un iversity of Lowe l l .
Gerald J. Brand
Robert M. Burley As a senior product management manage r, Bob Burley
was the engineering product manager for the fou r systems i n the VAX 8800
fa m i l y. As a program manager i n the LSI Acqui s i t ion and Test Group, he was
responsible for re lations with externa l vendors and acqui r i ng technologies
for the advanced gate arrays used in new CPU d esigns. Prior ro j o i n i ng D ig
ital in 1 9 80, Bob was a product and business deve lopment ma nager at Colt
Ind ustries, I nc . , and a prod uct and manufacturing manager at Scott Paper
Company. He earned h is B . S . degree in mathematics and econom ics from
Hobart Col lege in 1 9 6 5 .
3
Biographies
Stuart J. Farnham
As a principal software engi n eer in the Vi\JIS Develop·
m e n t Group, Stu Farnham is curre n t l y working on future directions in m u l ·
tiprocessing. Earl i e r, h e provided VMS su pport a t the corporate l e v e l for
Software Services . Stu was a deve l oper and instructor for the VAXjVMS Sys
tems Seminar. He joined Digital i n 19 8 2 after working as a software engi
neer at Pitney Bowes , I n c .
John Fu
C u rre n t l y earn i n g his M.S. d egree in compu ter scien c e at the
University of I l li n o i s , John Fu was a prin c i pal engineer on the VAX H800
proje c t . H e worked o n the design of the C Box a n d configurations for the
VAX 8800 fam i ly . Formerly, he worked on large-systems designs at I nterna
t i o na l C o m p u ters Li m it e d a n d on m i c ro p ro c e ss o r c o n t r o l sys t e m s for
Siemens Li mited . John was also a project manager at Systems and Software ,
I n c . He received a 1 3.Sc . ( Hons) in compu ter scie nce ( 1 9 77) from t h e Uni
versity of Manc hester in England . John is a m e mber of the British Compu ter
Society and t he lEE in England.
Steven S. Greenberg
As a team leade r i n the CAD Depar t m e nt o f the
Semicon d u c to r En gi n eeri n g Gr o u p , Steve Gre e n berg c o d eve l oped the
CAYEN N E program . An early provider of circuit and process s i m u l ators at
Dig i ta l , h e did research in timing veri fi cation and c i rcuit simu latOrs. A <; a
Digital i nd ustrial fel low at the Un iversity of Cal ifornia at Berke ley, Steve
performed research on iterated timing analys i s . Before j o i n i n g Digital in
1 976 , he was a member of the technical staff at RCA a n d a CAD engineer at
Texas I nstru ments . Steve received a B.S . E . E. degree ( 1 9 6 6 ) from M.l . T. and
an M . S . E. E. degree (1979) from Northeastern U n i versi ty. He is a m e m ber of
IEEE and Tau Beta Pi.
In 1 9 74. Ken Had u c h joi n e d D i gital after earn i ng
Kenneth J. Haduch
h i s Associate in Electronic a n d Computer Te c h n o l ogy degree from the Elec·
tronic I nsti tu tes, Pittsburgh . H e worked
as
a technician in Manufacturing
on t h e P DP- 1 1 /70 a n d VAX- 1 lj 7 8 0 CPUs a n d in E n g i n e e r i n g on t h e
D R7 '5 0 and FP7 '5 0 design s . Ken helped to develop t h e C B o x as a hardware
designer on the VAX 8800 project. He is curre n t ly a hardware engineer i n
the Advanced VA,'( Deve l opment Gro u p , working o n the hardware design
for a new VAX processor. Ken is also pursuing a B . S . degree from Northeast
ern University.
MichaelS. Harvey
Mike Han'ey joined Di gital in 1 978 after receiving
h i s B . S . d e gree i n compu ter scie nce from the University of Ver m on t . He
worked on developi ng the RSX- 1 1 M and RSX- 1 l M-PLUS operat i n g systems
and then led the team that deve loped the VAX-I 1 RSX layered prod uct for
rhe VMS system . S i nce joining the VMS Deve l opmen t Group, M i k e has par
ticipated in new processor support for the VAX 8 3 0 0 and 8 8 0 0 systems ,
spe c i a l iz i ng i n m u l t i processin g. As a principal software engi n eer, he is c ur
ren t l y working o n future directions for VMS m u l ti processi n g a n d su pport
for high-end VA,'( CPUs .
4
James P. Janetos
Jim Janetos is curren t l y studying computer architec
ture as a graduate student at Purdue Univers i ty. He joined D igital in 1 9 80
after rece iving his B . S . E . E . degree (Su mma C u m Laude) from the University
of Michiga n , where he was elected to Ta u Beta P i . As a design engineer, Jim
worked o n memory upgra des for the PDP- 1 1 / 2 4 and 1 1 /44 systems, on
memory system designs, and on dynam i c RAM eval uations . On the VAX
8800 project, he i ni t i a l ly worked on the d iagnost i c software for the 1/0
adapter, t he NBJ . Later, he designed the NBIB module, one of the two mod
u les in the N B I .
Tracey L . Jones
Earning her B . S. degree in computer engi neering from
Boston University, Tracey Jones joined D igital a fter grad uation in 1 98 2 . As
a firmware engineer i n the Advanced VAX Engineering G,roup , she wrote a
major portion of the m icrocode that performs floating po int operations i n
the VAX 8800 fa mily of processors . After pro motion to senior engi neer,
Tracey enro lled in Digital's G raduate Engi neering Education Program a n d i s
now pursu i ng a n M .S. degree i n electrica l engineeri ng at Brown U niversity .
J i m Keller i s the project leader for the instruction-fetc h
and execution u n i ts, the I and E Boxes, a n d the console for a new VAX pro
cessor. On the VAX 8800 project, he worked on the design of the C Box .
Prior to joi ning D igital in 1 98 2 , J i m worked on fiber optics and the designs
of several microprocessor boards at Harris Corporation . He earned a B . S .
degree in electrical engi neering in 1 9 80 fro m Pennsylva nia State Univer
s i ty , where he was elected to Eta Kappa N u . Jim has appl ied for t hree
patents on the techno logy in the VAX 8800 design .
James B. Keller
Michael W. Kement M i ke Kement is a sen ior design engi neer i n the
Power System Technol ogy Group, cu rrently working on EMf and EMC. He
was t he design engi neer for the power system on the VAX 8800 project.
M i ke has worked on the power systems of many products si nce joi n i ng Dig
ital i n 1 9 7 4 , i n cl u d i ng the LA3 6 and LA 1 8 0 term i n a l s , the PDP- 1 1 / 4 4 ,
VAX- 1 1 /780 and 1 1 / 7 5 0 systems , and the VAX 8600 CPU .
Andrew J. Matthews As a senior software manager i n the Advanced VAX
Systems CAD Gro u p , Andy Matthews i s curren tly automating the CAD to
CAM transi tion . He has ma naged the development of surface-mou nt CAD
processes and a pi lot program of advanced CAD to CAM data met hods. Andy
designed the prototype a n d first release of VLS, the VAX layout software
Digital uses for m odule design . He worked for Adage, I nc . , as the manager
of appl ications program m i ng before coming to Digi ta l i n 1 9 7 7 . Andy holds
a B . S . degree i n C . S . and M . E . ( 1968) from Boston Universi ty . He has pre
sented two papers at the Design Au tomation Conference.
5
Biographies
Sudhindra N. Mishra Sud h i n M ishra is a project leader i n the Advanced
VAX Development Grou p , currently deve lop ing a design verification CAD
too l . As a pri n c i pa l engi neer on the VAX 8800 project, he desi gned and
i mplemented most of the I Box and originated t he system-leve l s i m u lation
of the CPU . Before joining Digital i n 1 98 2 , he was a senior research engi·
neer at Prime Computers, Inc. Su d h i n has worked on projects ranging from
radar a nd heat-seeking m issi les to computers. He earned a B .Sc . degree i n
engi neeri ng from Ranchi Un iversity and an S . M . i n E . E . and C.S. from M . I . T.
Sud h i n has appl ied for a patent o n the technol ogy in the VAX 8800 design .
Kathleen D. Morse As a consu l t i ng software engi neer, Kathy Morse is
responsible for a l l low-end CPUs and peri pherals. She is also one of t he
desi gners for fu ture d i rections i n VMS m u l t i p rocess i n g . Kat hy provided
VMS support for the VAX- 1 1 /7 8 2 and M icroVAX I and II systems, and the
MA7 8 0 m e mory. She j o i n e d D i g i t a l a fter rece ivi n g her B . S . C . S . degree
( 1 9 7 6 ) from Worcester Polytechni c I nstitute, where she a lso earned her
M . S . C . S . degree ( 1 9 8 5 ) . Kathy is a member of I E E E , the Professional Cou n
cil , ACM , Tau Beta Pi , a nd Upsilon Phi Epsilon . She has published i n the
Compu ter Measurement Grou p ' s Conference Proce e d i n gs, the Digital
Techn ical journal, and DA TA MATION.
As a principa l hardware engineer, Pa u l Natusch is cur
rently managing the hardware deve lopment for a new VAX processor in the
Advanced VAX Deve lopment Grou p . On the VAX 8800 project, he was a
member of the me mory system team and later rook over as i ts leader. Ear
l i er, he worked o n a n upgrade to t he VAX- 1 1 / 7 5 0 m e m ory con trol ler,
which expanded it from 2 MB to 8MB . Pa u l joined D i g i ta l in 1 9 8 0 from
Storage Technology Corporation , where he was a d i agnostic engi neer. He
received his B.S . E . E . degree from Corne l l Un iversity in 1 9 7 9 and an M . B .A.
degree from Northeastern Univers i ty i n 1 9 8 5 .
Paul J. Natusch
Kathleen L. Pratt Educated at Rensse laer Polytech n i c I ns t i tu te , Kathy
Pratt came to Digita l after rece iving her B . S . degree in computer and sys
tems engineering in 1 9 8 0 . She worked on hardware designs for networks i n
t he Local Area Networks Group, then o n the design o f the floating po i n t
hardware for t h e VAX 8 8 0 0 centra l processor i n t h e Advanced VAX Devel
opment Group. Kathy is currently a senior engi neer working on the fl oat
i ng poi n t design for a new VAX processor.
William A. Samaras
B i l l Samaras is a pri n c i p a l enginee r wo rking to
design a new VAX processor. He joi ned Digital in 1 9 8 2 to design the clock
system on t he VAX 8800 project . Formerly, at Accutest Corporation , B i l l
designed VLSI testers and t i m i ng syste ms . H e holds a n Associa tes degree
( 1 9 7 3 ) from Northern Essex Commun i ty Col lege , and B .S. degrees i n engi
neering technology ( 1 9 7 5 ) and electrical engineering (I 976) , both from
Southeastern Massac husetts Univers i ty . Bil 1 teaches d i gital electron i cs for
continuing education at the U n i versity of Lowe l l . He has applied jointly for
a patent on the technology i n the 8800 clock system .
6
Dave Scnerchia is cu rrently a sen i or e ngineer i n the
Electronic Srorage Deve lopment Group. H e is a member of t he design team
worki ng on rhe m a i n memory for a new m i d-range VAX system . On the VAX
8800 rea m , Dave designed the i n i tial array mod u l e for main memory and
part i c i pated in the archi tectu re and design of the memory syste m , t h e
M Box . H e joined Digi ta l i n 1 9 8 2 after earn i ng a B . S . degree i n e lectrical
engineering from Wash i ngton Un iversity.
David C. Senerchia
As a principal engi neer, Pau l Wad e i s working on advanced
development for future VAX C PUs. He was responsible for the e l ectri ca l
design , verification , and resting for t h e VAXBI bus . Pau l a lso designed pans
of the VAX 8 2 0 0 system . Before j o i n i n g Digital in 1 98 0 , he worked as a
project e ngi neer ar M icrowave Semi conductor Corporation, RCA, and Lock
heed Electron ics . Paul earned a B . S . E . E . degree ( 1 9 7 3 ) from Newark Col
l ege of Engineering. He holds a patent on ga l l i u m arsen ide technology and
has written nine papers on t hat rop i c . One paper won the Beatrice Winner
Award a t the 1980 ISSCC .
Paul C . Wade
Cheryl A. Wiecek
Che ryl Wiecek is the engineering manager of the Sys
tems Arch i tecture Group and is responsible for the VAX architecture and a
number of Digital's i n terconnect archi tectures. She worked on VAX i nstruc
t i o n -set c haracteri zation and performance s i m u l.at i on for the VAX 8800
CPU. Cheryl a lso worked on PDP- 1 1 performance si mulation after com i ng
to D i gital i n 1 9 7 8 . She was a programmerjanalyst at the Connecti c u t Edu
cation Association and taught mathematics i n Connecticut . Cheryl holds a
B.A. degree i n mathematics ( 1 9 7 4 ) and a n M . S . degree i n computer science
( 1 9 79) from the U n i versity of Connecticut. She has publ ished five papers
on computer performance i n ACM and IEEE journals.
Eugene L . Yu
Gene Yu i s a sen ior design engi neer i n the Worksta tion
Engineeri ng Group ar Palo Alto . On the VAX 8800 project, he des i gned the
memory system i n terface to the memory i n terconnect, the N M I . Before
jo i n i ng D i gi ta l in 1 9 8 2 , Gene worked at Prime Computer as a ha rdware
designer on the i r 4 00 and 9900 systems , and at Data General Corporation
on Nova prod ucts . H e earned a B.S. degree in e lectrical engi neering from
rhe University of Massachusetts. Gene has applied for a patent as coi nventor
of the N M I and memory design for the VAX 8800 CPU.
John H.P. Zurawski John Zurawski is a consu lting engineer working as
the project leader for compute r arithmetic in the Advanced VAX Develop
ment Group. H e led the team that designed the floating point strategy and
hardware for t he VAX 8800 fa m ily. Joh n joined Digital in 1 98 2 from the
Univers i ty of Manchester, where he was a post-doctoral research associate .
He holds a B . Sc . degree i n physics ( 1 9 7 6 ) , and M . Sc . ( 1 9 7 7 ) and P h . D .
( 1 9 8 0 ) d e g r e e s i n c o m p u t e r s c i e n c e , a l l fr o m t h e U n i ve r s i t y o f
Manc hester. A member of I E E E , John has publ ished four papers o n com
puter techno logy .
7
Foreword
Donald
J.
Mcinnis
Group Manap,er,
Aduanced VA.X Enginel'rin[!.
Since the a n nouncement of the VAX-I t j 7H O sys·
rem i n Nove mber 1 9 77. Digita l Equipment Cor
poration has steadi ly expa nded the VAX fa m i l y
with n e w VAX products : t h e VAX-I l/7'50 . VA,'(.
llj7:)0, M i c roVAX I , VAX·llj72'5, VAX-II/
7H'5, VAX 8 6 0 0 , M i croVAX ll, VAX H 6'50. VAX
8 2 0 0. and VAX 8300 systems The marker accep·
ranee of the VAX fam i ly has been excel l ent across
a l most a l l computing applications. This remark
a b l e and steady i ncrease i n the usc of VAX sys·
tcms creates a continuous demand by the VAX
customer base for enha nced prod ucts across a ! I
segments of the computing i ndustry. I n the fa l l
o f 198 2 . t h e deve lopm e nt tea m for t h e H 8 0 0
project ( known i n tern a l ly a s " Na u t i l us") was
assigned the responsib ility of design i ng nL'\v sys
tems to enhance the mid-to- high end of the VA.-'(
fam i ly.
This issue of the Digital Technical journol
re prese nts a sampling of the types of design engi·
nccri ng rhar went i nto t he VAX HHOO fa m i l y. It
takes a n a m a z i ng l y l a rge n u m be r o f d i ffere nt
engi necring d isci pi i nes to design and manufac
ture a prod uct of this complexi ty. A-; time moves
on , each successive development project seems
to require a bigger investment i n a larger n u m ber
of discipli nes to produce a prod uct attractive to
the marketpl ace . It is u n fortu nate that neither
time nor space rerm its US tO give proper visibil
ity to all the d esign. m a n u facruri n g. a n cl cus
tomer-service engi neering efforts that Icc! to rhe
s hip m e nt of the VAX 8800 fa m i l y .
.
The VAX 8SOO fam i ly consi sts o f four new pro·
cessors: the VAX 8 8 0 0 , VAX 8 7 0 0 , VAX 8'5 5 0 ,
and VAX 8'500 CPUs . The VAX 8800 family and
the VAX 8 2 00 system i ntrod uced a major new
IjO bus. the VAX!3I. We also i ntroduced a com
pi ete ly new set of ljO adapters for the VAX B I
bus. which wil l b e t h e new fou ndation IjO chan
nel for many fut ur e mid· to h igh-end VAX sys·
terns . The VAXI31 bus wi l. l rep lace the UNIBUS on
this class of system . The VAXlll offers a six-fold
i ncrease i n performance and substantia l ly better
rel iabi l ity and mainta i nabi l i ty features in com
parison to the L Nll3US.
The 8800 represents a s ign i ficant advance into
new areas of h i g h -performance com p u t i n g for
the VAX fam i ly. A customer can replace a VA,'(.
ll/780 CPU with a VAX 8800 CPU i n the same
foo t p r i n t a n d e ffect an o r d e r of m a gni t u d e
i ncrease i n t h e a mount of work don e . The VA.-'(
8 5 0 0 CPU is rea l ly a rep l acement product for the
VAX - 1 1/78'5 CPU kerne l . However, the 8500 has
the same price. twice the performance , and one
t hird the foot pri nt.
To produce a product that has a good price;
perform a nce ra tio in the m arketplace , you have
to push hard on some di mensions of technology.
A n u m ber of n e w p i eces of technology were
i n troduced on the VA. -'( 8800 project, such as the
2 2 - layer bac kp lane and a 4 80-pi n , zero i nsertion
force connecto r. In the VLSI techno logy area,
one 8800 i nc l udes a total of 1 8 6 e m i tter-cou·
pl ecl logic ( ECL) gate arrays and a tota l of 28 cus·
rom-designed LCL parts.
The cycle time of a VAX CPU is a l arge determi·
nant in its performa nce . The chall e nge of meet·
ing a 4 ';-na nosecond cycle r i m e (versus 200
na noseconds for the 1 1 /7 8 0 ) requ i red s i gn i fi
cant advancements i n technology i m p l ementa·
tion and i n CAD tools for ana lysis.
Enhancements were made to the base operat·
ing system software for the VAX 8800 processor.
These softwa re e nhancements represent a basic
techno logical cha nge that is avail able to our CliS·
romers . The VMS operating system was improved
significan t l y to provide much better throughpu t
for cusromers using the VAX 8800 dual proces·
sor as a genera l -purpose system. The ULTRIX-32
o p e r a t i n g sys t e m was e n h a n c ed t O s u p port
t i g h t l y cou p l e d m ult i p rocessin g . Soft w a re
library structures were also developed for cus
of people to have a broad engineering focus
tomers who might want to improve the through
proved to be invaluable, especially in the simu
put of a single job by decomposing it to run in
lation and prototyping phases. The core manage
parallel on the tightly coupled dual processors
ment ream started with very experienced peo
of an 8800.
p l e, m o s t o f w h o m h a d V AX-llj78 0 or
To meet the performance goals, the overall
VAX-11/750 development experience: Sas Dur·
design of the VAX 8800 system is necessarily
vasula, VAX 8500 project manager; John Hittell,
quite complex and was potentially difficult to
manufacturing manager; Steve Jenkins, engineer·
implement quickly and correctly. We under
ing manager; Nancy Kronenberg, VMS engineer·
stood this from the beginning of the project,
ing; Bob Kusik, CAD manager; Steve Omand,
based on our understanding of the experiences
customer service engineering; and Bob Stewart,
of previous projects (e.g., the VAX-11/750, VAX
chief architect. Many contributors at the next
8600, and Jl1 VLSf CPU chip projects). To
level also had similar backgrounds, and all
manage that complexity in a timely manner, we
remained in place for the duration of the pro
selected some key strategies and stuck
with
ject. This continuity was a major factor in com
them through the completion of the project.
pleting a very successful project and a very suc
They proved to be very successful since the
cessful family of products.
hardware prototypes were relacively error free,
and the manufacturing start-up was very smooth
and rapid. Some of these strategies are as fol
lows:
•
The project followed a structured design
methodology that ensured the completion of
comprehensive specifications before any
detailed design was done.
•
We made a large investment in our CAD team
and in CAD tools to automate the design pro
cess.
•
The basic design was managed by a chief
architect.
•
The system was simulated extensively before
we built any hardware. (We finished the pro
ject with 14 VAX-11/780 and 11/785 sys
tems in our. cluster. During our peak simula
tion effort, however, over 30 dedicated VA,'(
systems were used for a period of several
months.)
•
Since many different engineering and manu
facturing locations were involved, we made
extensive use of Digital's worldwide network
for electronic mail and data exchange.
A more important factor than any of the above
ex a m p l es, h o wev e r, w a s t h e p e o p l e w h o
worked on the project. We attempted ro build
an excellent team that worked well together.
The attribute of teamwork and the willingness
9
Robert M. Burley
An Overview of the Four Systems
in the VAX 8800 Family
The VAX 8800 multiprocessor and the VAX 8700, 8550, and 8500 systems
all derive from the same fundamental design. Their sustained appli
cations throughput ranges from 3.0 to 12 times that of the VAX-1 1/780
system. In the design process, automated tools helped to correct design
bugs early. ECL technology and a two-phase clock system achieve a
45-nanosecond cycle time. Micro instructions are processed simulta
neously through Jour logic boxes that implemen t a five-stage pipeline. A
high-speed memory interconnect, the NMI bus, links CPUs to memory and
the ljO subsystem, which connects to VAXBI buses. Many reliability fea
tures, including extensive diagnostics, are implemented.
Design work on the VA,'\ 8800 system began i n
September 1 9 8 2 and concentrated o n develop
ing a balanced, high-performance system based
upon the use of ECL components and m u l ti pro
cessing. Although performance was the primary
product goal , many technology, packagi ng, and
implementation decisi ons reflected the equally
pressing busi ness req u i rements for reliabi l i ty
and ease of manufacwring.
The flexi b i l i ty of the design u l t im ately
spawned fou r CPU syste ms: the VAX 880 0 . VAX
8700, VAX 8 5 5 0, and VAX 8 5 0 0 models. These
systems share many common fu nction a l and
design attri butes yet maintain noticeable i m ple
mentation d i fferences i n the areas of perfor
ma nce, m u l ti process ing, expansion capabi lity
(memory and l jO). and packaging. As a result of
these i m p l ementation vari ations . the sustai ned
appl icati ons throughput (SAT) rates for these
systems range from approx i mately 3 . 0 to 1 2
times the rate for a VAX - 1 1 /780 system . Sus
tained applications throughput is more i nd i ca
tive of usable performance for a given system
than the more frequently reported peak n u m
bers that can b e derived from ideal or biased
cond i tions . Ta ble I compares the physical and
performance anributes of these fou r VAX pro
cessor syste ms.
Design Environment
Trad i t i o n a l design environ m e n ts have p l aced
the greatest emp hasis on d iscovering and e l i m i -
10
nating design errors i n the physical hardware.
The complexi ty of the VAX 8 8 0 0 design cou
p l ed with the new technologies i nvolved wou l d
have crea ted cos t ly delays i n t h e development
sc hed ule had traditional approaches been used .
Early i n the project. goa ls were defi ned to iden
t i fy l ogic design problems and to solve all t i m
i n g p ro b l e m s t h r o u g h t h e u s e o f ext e n s i v e
design verification tools.
A hierarchi cal design and s i m u lation environ
m e n t a l l ow e d t h e e n g i n e e rs to m ove fre e l y
throughout t h e design a t a n y l evel from gates ,
l ayou ts, a n d behavioral models through com
plete system s i mu lation and t i m i ng verification .
ConsiderJble comput i ng resources were req u i red
to allow that freedom . Th is envi ron ment, with
i t s carefu l ly managed l i bra ries and databases ,
al lowed this work to be done before a ny hard
ware was actu ally assembled .1 A.; a resu l t , the
design matured within our VAXcl uster systems,
evo lving ro hardware p rotOtypes o n l y a fter i t
was essentially com plete and stable . I n addition
to the expected savin gs i n prototype costs and a
red uction in overal l devel opment rime, t he per
vasive use of software tools sign i ficantly shifted
the traditional deb ug effort to an earlier poi n t i n
t h e d es i gn p rocess . C u m u la t i ve bug-detection
p lots were used extensively to provide i ns i ght
in to the srabi I ity of the design .
The effect of this shift was ro provide stable,
early prorotypes for extensive system characteri
z a t i o n and resti n g , l e a d i n g to e a r l i e r d e s i g n
Digital Technical journal
No. 4
February 1987
I
New Products
Table 1
CPU and Memory Attri butes of the VAX 8800 Fa mily
VAX 8500
VAX 8550
VAX 8700
VAX 8800
SAT (com pared
to VAX - 1 1 /780)
3.5
6.0
6.0
1 0 . 0 to 1 2.0
Cycle T i me
45 n s
45 n s
45 n s
45 n s
CPU Attributes
2
Number o f
Proces sors
U pgrade
Potential
To 8550
None
To 8800
None
Writable Control
Store (Words)
1 5K
1 5K
1 5K
1 5K i n each C P U
U ser Control
Store (Word s)
1K
1K
1K
1 K i n each CPU
Microword Size
1 43 Bits
1 43 Bits
1 43 Bits
1 43 Bits
CACHE Size
64KB
64KB
64KB
64KB ( i n each C P U )
I nternal Datapath
32 Bits
32 Bits
32 Bits
32 Bits
Instruction B u ffer
Type
Look
1 6 Byte
Ahead
Look
1 6 Byte
Ahead
Look Ahead
1 6 Byte
1 6 Byte Look Ahead
in each CPU
Ma ximum Total
1/0 Data Rate
1 6 M B/s
1 6M B/s
Over 3 0 M B/s
Over 30M B/s
M a x i mum 1/0
Channels
2
2
4
4
80MB
80MB
1 28 M B
1 28 M B
Hexword Read
(256 bits)
495 n s m i n .
1 260 n s m a x .
495 ns m i n .
1 260 n s m a x .
495 ns min.
1 2 60 n s m a x .
495 ns min.
1 260 n s m a x .
Octaword Write
( 1 28 bits)
270 ns min .
540 ns m a x .
270 ns min.
540 n s max.
270 n s min.
540 ns m a x.
270 n s min.
540 ns m a x .
Longword Write
(32 bits)
1 35 n s m i n.
495 ns max.
1 35 ns m i n .
495 ns max.
1 35 n s m i n .
495 ns m a x .
1 35 n s m i n .
495 n s m a x .
Memory Attributes
M a x i m u m Physical
Memory Size
Cycle Times:
acceptance . This strictly controll ed design envi
ron ment al lowed us to compl ete physical debug
along with the req u i red system eva luation and
testing i n only eight months.
I n a software- i ntensive design environment ,
the production of actual hardware is deferred
somewhat in favor of design stabi l i ty , resu lting
i n a s lightly longer soft-design period . The delay
in ha rdware avai lab i l ity, however, is more than
bal anced by the sta b i l i ty of the hardware proto
types, which can then be acce lerated th rough
the eva luation and q u a l i ficat ion-tes t i ng phases .
Digital Technical Journal
No. 4
Februmy 1987
The design schedule recovers during these later
phases , and substantial cost savings are rea l i zed
beca use fewer e n g i n e e r i n g changes are made
and stable manufactu ring can beg i n quickly.
CPU Design Overview
The VAX 8800 fa m i ly of designs were structured
arou nd the fu nctional ele ments, or " boxes , " of
t h e syste m . The CPU , m e m o ry, ljO, a n d bus
subsystems were all matched to provide the nec
essary system ba lance . One s i m p l e model is to
treat performance as a fu nction of two va riables:
11
A n Overview of the Fou r Systems in the VAX 8800 Fam i�y
the i nstruction execution rate , and the amou n t
of " work" e a c h i nstru c t i o n c a n perform . The
design of the VAX 8800 fa mily focused on what
we call the "short tick" approach to achieve the
necessary, sustai ned performance .
I n t h i s a pp roa c h , t h e i n s t r u c t i o n a n d data
s t r e a m s a r e kept s i m p l e and a r e e x e c u t e d
q u ickly. Any design trade-offs were resolved i n
favor o f s peed a n d s i m p l i c i ty, t h u s red u c i n g
design complexity. The use of h i gh-speed cus
tom and s e m i c us t o m VLSI components c o m
bined w i t h severa l n e w i n ternal b u s a rc h i tec
tures resu l ted in a fam i ly of processors with a
4 5 - n a n os e c o n d ( ns ) cyc l e t i m e . A l l m od e l s
e m p l oy a five-stage i ns t r u c t i o n e x e c u t i o n
pipel ine, integral floating poi nt acce leration (F,
D, G, H formats) , and the VAXB I bus as the pri
m a ry I / 0 s u bsyste m . T h e e x te n s i v e u s e o f
m i c rocode c o n t r o l s w i th m i n i m a l h a rd w a r e
a s s i s t a u gm e n ts c u r r e n t p e r fo r m a n c e w h i l e
prov i d i n g flex i bi l ity for fu r u re e n ha ncements.
The b lock d i agram in Figure 1 (using the VAX
8700 and VAX 8800 systems) i l l ustrates t he key
fu nctional elements common to the VAX 8800
fa mily design .
Technology
The raw speed , off-chip drive capab i l ities, and
ava i l a bil i ty o f b i p o l a r e m i t te r-cou p le d l o g i c
( EC L) l o g i c c o m po n e n ts provi d e d t h e m os t
straightforward means of ach i eving t h e desi red
performance of the VAX 8800 fami ly . Most logic
i s implem e nt e d in 1 2 0 0 -gate ECL a rrays . Cus
tom l ogic c hips designed by Digital provide fur
ther performance ga i ns for floating point opera
tions and genera l -purpose registers . The cache is
i m p l e m e n t ed in 1 0 - ns a n d 1 5 - n s E C L RAMs .
N i n e - l ayer, contro l l e d - i m p e d a n c e C P U l og i c
modu l es a n d a 2 2 - layer, cont ro l led-impedance
CPU backpl a ne were deve loped to meet the sig
n a l - i ntegri ty a n d s i g n a l - propaga t i o n re q u i re
m e n ts cruc i a l t o an E C L desi gn . O t h e r m u l t i
layer backp la nes were designed for the private
memory array bus and 1/0 su bsystems .
ECC
M E MORY
CONSOLE
1-v;;--I PROCESSOR- �I
-i (U PGRADE I
I VAX 8800) l
VAX
PROCESSOR
(STA N DARD
VAX 8700)
L. - - - --r - - - .J
I
I
H I G H SPEED M E MOR Y I NTERCON N ECT B U S ( N M I)
II
r - - - ..1 - - - -·
I
I
I B U S I NT E R FACE I
r--1 (OPTIONAL) IL - ,I
I II
I
III
II
- -,
B U S I NTERFACE
I
II
II
1
L - - - - - - - -'
r - - - __1 _ _ _ ..,
VAX B I
1/0 B US
STD 8700/8800
I
I
I
I
I
VAXBI
1/0 B U S
STD 8800
I
I
I
I
I
L - - - "7...- - -- J
'
/
",
I�
I 2 I
�
'-7
'
Figure 1
12
v
/
VAX
I
r -- - -1 --- -,
I
II
1
VAXBI
1/0 BUS
(OPTIONAL
8700/8800)
I
I
I
1
L----;o::- - - - ...J
/
.,
I
�
'
3
v
'
_____
I
I
I
�
I
L _ _ _,
VAX B I
1/0 B U S
(OPT I O N A L
8700/8800)
I
I
I
�
'---- 7------ ....J
r
1
�
/
/
.,
I
�'
'
�
4I
v
7
/
8 700/8800 Rlock Diagram
Digital Technical journal
No. 4
FeiJrumy I 98 7
New Products
An in novative scheme of bus bars a n d ri bbon
straps routes the appropriate power tO each of
the backplanes, m i n i m i z i n g cable management
problems for system power. The eight CPU logic
mod u l e s , a l l memory ar rays , a n d a l l IjO con
trollers attach to the i r respective bac kplanes by
means of zero insert i on force (ZIF) connectors .
which i m prove our abi l i ty to manufacture and
service the syste m . Figure 2 shows the two d i f
ferent modu l e types (CPU and VAXB I ) usL"cl i n
the VAX 8 8 0 0 fam i l y .
fo rmed w i t h i n each processor. There a re four
logica.l boxes: the i nstruction u n i t (I Box) , the
cache (C Box) , the execution unit (E Box) , and
the me mory su bsystem ( M Box) . Each processor
contains these fu nctional u n i ts and their rela ted
buses. Five buses are i m plemented w i t h i n each
CPU : the cachejALU bypass bus, the cache data
bus. the i nstruction- buffer data bus, the vi rtua l
address bus, and the write data bus . F igure 3 is a
bl ock d i agram of the processor configuration .
CONSOLE
S U BSYSTEM
INTERFACE
VISIBI LITY BUS
I
BOX
IBD B U S
E
BOX
c
BOX
CACHE DATA B U S
HIGH SPEED M EMORY INTERCONN ECT BUS ( N M I)
NBIA
ADAPTER
Figure
2
Typical CPU and f/0 Modules
TO NBIB ADAPTERS
An L"XtensivL" L"nvironmental mon itoring sub
system , ca l l ed the EMM, has been i m pl cmL"ntL"d
t h ro u g h o u t t h e syste m . The E M M c o n s ta n t l y
m o n i tors cur re n t fl u c tu a t i o n s , a i r fl ows , and
te mperature va ri a t i o n s , prov i d i ng warn i ngs at
the system conso l e . ThL" EMM can automatica l l y
power down the system i n thL" L"ve nr that safe
operating l i mits a rc violated .
CPU Subsystems
The des igns of the CPUs i n the VAX 8 8 0 0 fa m i l y
are part itionL"d along the logica l fu nctions pn-
Di�ital Technical jourmtf
No. -1
l'
�
M E MORY
I N T E R C O N N ECT
I N TE R FACE
NMI
t
F R O M EXECUTION BOX
t FROM INSTRUCTION BOX
The E Box receives data from the I Box and the
C Box. processes that data , and returns it ro rhe
C Box . The E Box performs five pri mary fu nc
tions req u i red by the processor.
•
Hand les a l l arith meti c , logica l and bi t-shift
operations
•
Mai ntains the program counter and general
registers
•
Mai n ra i ns the processor registers
•
Con trol s data tra nsfers between the C Box ,
the I Box , and the c lock-module registers
•
C Box Block Diagram
Dip,ital Technical jounwl
No. 1 Ft!hruary I 'J87
store to be free to
requests unti I t he
memory .
C Box is s hown i n
The E Box
•
Figure 6
TO
C BOX
Box Block Diagram
buffer and the cache data
process other processor
requested data arrives from
A block d i agram o f the
Figu re 6 .
PHYSICAL ADDRESS
I
I
I N T E R R U PT P E N D I N G
T R A NSLATION B U FFER
�
t
WRITABLE
CONTROL
STORE
M I CROSEQUENCER
•
I TAG
I STORE
M I CROWORD
t
Figure
VIRTUAL ADDR ESS
DECODER CONTROL
MUX
BRANCH
I NT E R R U PT
LOGIC
FILE
A D D R ESS
INSTRUCTION
DECODER
_
CONTROL
Prov ides condi tion-code i nformation to the
Box m i crosequencer
l
15
A n Oven,iew of the Fo ur s:vsterns in the
Vt1X
8800 Fa mil)•
T O C BOX
t
WRITE DATA BUS
FROM I BOX
�
;
t
FROM c BO
+
r-
CAC H E DATA B U S
I
r-
I
ST R UCTION BUFFER DATA BUS
<
LATCH
SLOW
DATA
FILE
r---
v
FROM C BOX
+
VIRTUAL ADDRESS BUS
I
t
REG I STER
FILE
PROGRAM
COUNTER
t
•
1--
A R I T H M ETIC AND LOG IC U N I T
I-PARITY
CHECK
t
<
S H I FTER
FLOATING
POINT
t
t
CACHE(ALU BYPASS B U S
Figure
7
/;' No.1.·
The major dements of rhc E Dox , located p hys·
ica l ly on r h e d a ta-sl ice mod u l es and rhe sh i fter
m od u 1<: . consist of a register fi l e , a data fi k , t h e
progra m - c o u n t er l o g i c , t h e m a i n A L U , a n d a
sh ifter. The logic of the E Box i nclu des in tegra l
float i ng point operations that are op timi zed and
a 6 4 - b i r m u l t i p l i e r ( i m p l e m e n t ed i n c u s to m
designed VLSI chi ps) r h a r a ugments t h e speed o f
borh i nreger a n d floa t i n g p o i n t m u l t i pl i ca t i o n .
Figure 7 is a block d i agra m of the E Box .
16
I
l
t
M U LTI PLIER
FROM
C BOX
r-..
v
Block Diagram
The M Box
The M Box . the memory subsyste m , consists o f
m e m o r y con trol l og i c , m e mory a rr ay s , a n d a
d e d i c a t e d m e mory a r ray b u s r h a t p rov i d e s a
usable data rare of over '5 0 M B per second to rhe
me mory subsyste m . The contro l logic opt i m i zes
m u l t i p l e m e m ory read a n d w r i t e opera t i o n s ,
i m p l e m e nt s three-way i nt er leav i ng, a n d buffers
memory transa c t i ons for opt i m u m dara move
ment . The dedi cated me mory array bus, coupled
Digital Technical journal
No.
4
Februmy
I ')87
New Prod ucts
wi t h the memory con t rol logic , effect i vely off
loads t h e N M l b u s , p rovi d i n g b a l a nced bus
access a n d l oads . The i nt e rleaving a l gori t h ms
are based u pon a rray bo u n da r i es . m a k i ng t h e
memory control logic technology i ndependent .
The resu lt is that as i ncreasi ngly dense me mory
a rrays become ava i l a b l e , few if any cont ro ll e r
mod i fications will be req u i red .
The error checking and contro l ( ECC) is bu i l t
a rou n d 7 c h e c k b i ts for every 3 2 b i ts of d a ta .
This protocol provides automatic si ngle-bit cor
rection a nd doubk-bit detect ion .
I n the VAX 8800 multiprocessor, a l l memory is
ful ly sharable. Current systems in the VAX 8800
fam i ly a re offered w i t h 1 6 MB per memory array ,
g i v i n g t h e VA..'{ 8 7 0 0 a nd VA..'{ 8 8 0 0 systems a
max i m u m memory capa c i ty of 1 2 8 M B , and t he
VAX 8 5 0 0 and VAX 8 5 5 0 systems a max i m u m o f
80MB. Figure 8 is a block d iagram of t h e M Box.
INSTRUCTION
BOX
H I G H SPEED M EMORY I N TERCONNECT B U S (NMI)
POWER SUBSYSTEM
- - - - - - ...,
r
I
M EMORY CONTR O L
I
I
I
I
I
I
L
I
I
I
I
I
I
I
I
I
_ _
Figure 8
_
M Box Block Diagram
Di[!Jtal Technical journal
February J 'J8 7
No. 4
_ _
..J
The Clock Subsystem
The c l o c k s u bsystem generates , contro ls , a n d
di stributes t i m ing signals to a l l the components
of t h e p rocessor system . The clock su bsystem
conta i ns the consol e i nt e rfa ce , a n osc i ll a tor , a
p hase generaror, clock-con trol logic c i rcuits, and
t he l ogic c i rcuits for clock signa l d istri but ion.
The VAX 8 8 0 0 fa m i l y i m p l e m e n ts a two
p hase. nonoverlapped c lock su bsystem operating
at a cycle time of 4 5 ns . A stable, high-frequency
osc i l lator ( 1 2 0 MHz nominal with variable out
put ) . coupled with a phase ge nerator, provides
the signa l . The impl ementation of a two-p hase
design wi t h m atched signa l- length d istribu t i on
t h ro u g h o u t the CPU is most e ffi c i e n t for t h e
p i pe l i ned, latch-based design o f t h e VAX 8800
fa m i ly . This design avoids the i n e ffi c i e n c i e s
associated w i t h t h e com pressed signal -assertion
t i m es resu l t i n g fro m a p proac hes t h a t spec i fy
m i n i m um delays for given logic c k ments.
A-clock and B-clock signals arc cl istri butcd to
alternate latches i n a given logic stream . Al l data
transfers occur between latches cloc ked by d i f
fe rent p hases ro assure a race - free design . The
essence of fast-processor design is managi ng and
contro l l i ng skew. In this regard , signal propaga
tion and d istribut ion presented sign i fi cant chal
l e nges i n the a reas o f con t ro l l ed etch lengths.
control led i m pedance , rou t i ng, and p l acement.
To ass u re a sta b l e , re l i a b l e des i g n . a l l design
a c t i v i ty was pred i cated o n worst- case d es i gn
m lcs rather than using the typical -case l i m its.
The NMI Bus
I n tegral to the design o f this fa m i ly of proces
sors was the development of a h i gh-speed mem
ory i n te rconnect bus called t he N M I bus . T h i s
b u s , a n a l ogous t o t he syn c h ronous back p l a n e
interconnect ( S B I bus) i n t h e VAX - 1 1 /780 CPU .
l i n k s t h e s u bsyste m s for C PU l og i c , c e n t ra l
memory , and 1/0. The N M I bus i s a 3 2 -bit syn
chronous bus, p hys ically i mplemented w i t h i n
t h e 2 2 - layer backp lane. This b u s prov i des t h e
control a n d datapath fu ncti ons as we l l as t h e
d istri but i on o f clock signals for the VA.,'( 8800
fam i l y.
O n e fu n d a m e n t a l p rob lem i n the d e s i g n of
high-performance systems revolves a round ba l
a n c i n g t h e bus a c c e s s n e e d e d at any g i v e n
i n sta n t w i t h t h e raw bandw i d t h ava i l a b l e . To
provi d e the correct balance, t h e N Ml bus was
i m p lemented as a pendecl (vs. in terlocked ) bus ,
resu lting i n very h igh bus-access ava i labi l i ty .
17
A n Overview of the Four Systems in the VAX 8800 Fami�J!
Since memory is the critical resource i n sus
ta ined operations, the NMI bus uses a modi fied
round -robi n arbitration that gives the memory a
hig her priori ty when there is con tention for the
b u s . T h i s a rb i tra t io n p r i o r i ty e l i m i n a tes a n y
lock-step conditions a n d a lso provi des for recov
ery of states a n d data i n the eve n t of p ree m p
t i o n . This h igh bus-access capab i l i ty, cou p l ed
w i t h usable data rates of u p to 6 0 M B per sec
ond, provides the necessary bala nce to su pport
CPU. memory, and l/0 transactions. The inclu
sion of write buffers within each CPU, coupled
w i t h t he l a rge cache s i z e , effectively redu ces
the nu mber of transactions presented to the bus.
M e a s u re m e n ts on a VAX 8 8 0 0 sys t e m in ou r
Engi neering VAXcl uster e nviro nment have i n d i
cated that t he N.MI b u s i s rarely busy m ore than
50 percent of the t i m e ; the CPUs usc approx i
mately 2 5 percent of t h e ava i lable access t i m e
and bandwi d t h . Other appl ications may see
somewhat d i fferent ratios.
( D i gi tal Storage A rc h i tectu re) devi ces are a l l
ported d i rectly to t h i s h i g h - performance I / 0
subsystem .
Reliability
Re l iabi l i ty was one of the pri mary goa l s of the
VAX 8 8 0 0 d es i g n . N u m e ro u s fe a t u re s were
i m ple mented that more than doubled the basi c
com p m i n g kerne l ava i labi lity compared t o the
VAX - 1 1 /7 80 system . Some of the key functions
inc lude
•
E n v i r o n m e n t a l a n d p o w e r m o n i tors t h a t
qu ery t h e sys tem a n d m a i n t a i n safe system
operating levels
•
Automatic verification of hardware , fi rmware ,
and software revision compatib i l i ty
•
Electrical ly keyed modu les and module slots
that prevent i mproper i nsta l l a t i o n and dam
age to the modu les or the system
•
Automatic el ectrostatic d i scharge (ESD) pro
tection of modu l es d u r i n g i nsta l l a t i o n a n d
removal
•
ECC on main memory
•
Parity checki ng on i n ternal RAL\1.s
•
Bus protocol checking for the memory i n terconnect
•
Timing and voltage margi n i ng
•
Remote d iagnostics capabi l i ty
•
D u a l - t o - s i n g l e p rocessor r e c o n fi g u ra t i o n
(VAX 8800 system only)
VAXBI Bus
The VAX 8 8 0 0 fa m i ly u ses the VAX bus i n ter
connect , cal led the VAXB I bus, for the 1/0 sub
system i n order to provide adequate balance for
the CPU performance. The VA.,'{ J3I bus, a 3 2 -bit
clocked bus with distribu ted arbitration, is capa
ble of usable data rates i n the VAX 8800 fa m i l y
up to 8 M B per second , depen d i ng upon word
s i z e a n d a p p l i c a ti o n . C u s t o m l o g i c on e a c h
interface module provides a l l b u s protoco l s , as
weJI as i ntegral data-i n tegrity features, includ i ng
master transmit and command acknowledge .
The VAX 8800 and VAX 8 7 0 0 systems can be
confi g u red w i t h u p to fou r VAX B I c h a n n c l s .
whereas t h e VAX 8 5 5 0 and VAX 8 5 0 0 systems
accept up to two . Therefore , fu l l y configured
VAX 8800 and VAX 8 7 0 0 systems can su pport
aggregate IjO bandwidths u p to 3 0 MB per sec
ond . Si m i la rly , fu lly configu red VAX 8 5 5 0 and
VAX 8500 systems can support aggregate band
widths up to 1 6 MB per second . Each VAXBl bus
c a n s u p p o r t u p to 1 6 n o d e s , o r l o g i c a l
acldrcsscs, which connect to any combi nation of
n e t wo r k s , i n t e l l i g e n t a n d n o n i n t e l l i g e n t
devices, DMA devices, and VAXcluster systems.
as well as provi d i ng for connection to exist i ng
UNIBUS-based devices .
Al l of D i g i ta l ' s n e twork p rotocols i n te rface
d i rectly to the VAXBI on the VA,'{ 8800 fa m i ly.
Thu s , VAXcl uster. E therne t , D E C n e t a n d DSA
18
Diagnostic Development
S i m i l a r to t h e h a r d w a r e d e ve l o p m e n t , t h e
d e s i g n m e t h o d o l o g y fo r t h e d i a g n o s t i c s
depended very heavi l y on s i m ulation . Almost a l l
the d i agnos t i c tests were debugged on behav
ioral and stru ctural models of the design before
the i n i t i a l prototype was powered u p . There
were three major benefi ts of this methodology .
1.
M i c rod i a gn o s t i c a n d m a c r o d i a g n os t i c
tests were usefu l for design verification
testing.
2.
Test vectors for automatic test equ ipment
( m o d u l e test) were extracted fro m the
simul ation data base .
3.
A comprehensive diagnostic package was
ava i l a b l e short l y after t he prototype was
powered u p .
Digital Technical journal
No. 4 February 1987
New Products
Summary
The diagnostic for the VAX 8800 fa m i ly con
s i sts o f tests s p e c i fi c t o t h i s processor a n d
generic to the VAX archi tecture. The processor
is tested pri mari ly with microd iagnostics. These
rests execute from the processor's wri table con
trol store and a re governed by the console.
VAX generic d i agnostics a re incl uded to test
the UNIBUS and VAXBI adapters and options . Al l
t h e d i a g n o s t i c c o d e fi t s o n t h e c o n s o l e ' s
Winc hester d i s k . When the system i s powered
u p . a su bset of the m i c ro d i a gn os t i c tests a r e
execu ted .
The VAX 8 8 0 0 fa m i ly of p rod ucts merges fast
i nstruct i on -execution rates, large physi cal mem
ories, large high-speed data caches, VAXBI 1/0
channels, pipel i n i ng, and bala nced i nternal-bus
architectures to prov i d e h i g h syste m - a p p l i ca
t i o n s t h ro u g h p u t . S p a n n i n g a n a p p l i c a t i o n s
throughput range that is from 3 t o 1 2 ri mes that
of t he VAX- 1 1 /780 system , the VAX 8 5 0 0 , VAX
8 5 5 0 , VAX 8 7 0 0 , a n d VAX 8 8 0 0 systems are
matc hed ro the network and appl ications strate
gies offered by Digital Equ i pment Corporation .
Balanced Systems
References
The VAX 8800 design effort del ivered fou r dif
ferent systems, the 8 8 0 0 , the 8 7 0 0 , t he 8 5 5 0 ,
and the 8 5 0 0 , a l l reflecting t he overri d ing con
cept of balanced system design . Wh ile the CPUs
t hemselves demonstrate excel lent i n ternal bal
ance between their logical and fu nctional sub
systems, they a re also balanced members of the
e x t e n d e d system t h a t can s p a n m u c h l a rg e r
physical distances. Monolithic o r isolated com
p u t i n g r e s o u r c e s a r e no l o n g e r c a p a b l e o f
access i n g , m a n i p u l a t i n g , a n d d i stri b u t i ng t he
volu mes of i nformation needed for comp lex or
extended sol u ti ons . I n this l ight, the VAX 8800
fa m i ly shou ld be viewed in the context of a bal
anced network. T h e move ment of d a t a is gov
erned by speed a nd d i sta n c e . An i nverse re l a
t i o n s h i p ex ists as s hown i n F i gu re 9 . T h e VAX
8800 fa m i ly fits on the rop bou nd of the band
width range throughout the distance fu nction .
w
;;}_
�
1 00
8
10
<'
0
:::!.
0
z
LJ.J
1.
D . Bak , "The I mpact of VAX 8800 Design
M e t h o d o l ogy on CAD Deve l o p m e n t , "
Digital Tec h n ical jo u r n a l ( F e b r u a ry
1 98 7 , this issue) : 1 2 9- 1 3 5
2.
VA X
Hardwa re Ha n d b o o k ( Maynard :
D i g i t a l E q u i p m e n t Corpora t i o n , Order
No. EB- 2 1 7 1 0 - 2 0 , 1 9 82) .
T E C H N O LOG Y
COM P L E X ----'-=-=-'-''-'"--"-"-"-____ S I M PL E
(!!_
CIJ
�
I
I
t
o
�z
<{
CIJ
10
1 00
1 000
DISTANCE - METERS (LOG SCA L E)
Fz�� ure
9
Bandwidth versus Distance
Digital Technical journal
No. 4 Februar)' I 'J87
19
Sudhindra N. Mishra
The VAX 8800 Microarchitecture
The VAX 8800 processor has a simple but efficient microarchitecture. Its
pipelined micromachine has a one-cycle next-address loop andfour-cycle
latencies for both microbranches and microtraps. Instruction prefetch
and decode are done in parallel with microcode execution. The instruc
tion buffer is a bit-sliced, four-longword circular queue. The decoder is
primarily a RAM-based table. For special events, hardwired logic is used
for decoding. A bit-sliced microsequencer provides up to 32-way condi
tional microbranching, using a collection of about 80 branch conditions.
A hardware microstack provides up to 15 levels of nested subroutine calls
and returns. Microtrap conditions are prioritized over 1 6 levels, and
microtraps are chained, not nested.
The term " m icroarchi tecture" means the spec i
fi cation or descri ption of t h e in terre lationships
between the pans of t h e m i c r o m a c h i n e t h a t
i m p l e m e nts t h e i n s t ru c t i o n s e t processo r . I n
terms o f this defi n i t i o n , the microarchitecturc of
the VAX 8800 processor w i l l be described b y
e l ucidating the organ ization o f its m icromachine
and the in teraction between its compon enrs .
F i g u re I shows a s i m p l e t h r e e-stage state
mach i n e m o d e l o f an abstract m i c roma c h i n e
appropriate for implemen t i ng t h e control u n i t
o f a typ ical von N e u m a n n processor . Figun.: 2
shows a block d iagram depicting the essen tial
el eme nts of such a m i c ro m ac h i n e . This stare
machine is capable of executing m i crocode rou
tines to i m p lement a n instruction set processor.
I n s u c h a sys te m , every m a c ro i n s t r u c t i o n i s
decoded b y the ha rdware to produce the starr
ing addresses of a sma l l set of m i croprograms ,
w h i c h e x e c u t e seq u e n t i a l l y t o p r o d u c e t h e
d e s i r e d e ffe c t . B a r r i n g s o m e e x c e p t i o n s . a
m i croprogram or m i crocode rou t i n e can exe
cute ra ther i ndependently in the sense that eac h
mi croi nstruct i on p ro d u ces the add ress of the
next m i croi nstru ction . The last microinstruction
causes the se lection of a n external address . such
as one p ro d u c e d by the de cod e r , ro starr the
execution of another rou tine .
In Digita l ' s vernacular, the I Box is the logical
part i tion cont a i n i ng the i nstru c tion-processing
hardware . Figure 3 shows a b lock d iagram of the
VAX 8800 I Box with the basic eleme nts of its
micromachine.
20
FETCH
M I CROI NSTRUCTION
I N T E R PRET
M I C ROINSTRUCTION
Figure
1
State- machine Model of an
A bstract Micro machine
From the early LBM and CDC compute rs to the
modern C RAY m ac h i n es , computer designers
have used a tec h n i q u e cal l ed " p i p e l i n i ng" to
obta i n h i gher performa nce . P i pel i n i ng overlaps
the execution of i nstructi ons i n r i m e ; t h a t is,
severa l i ns t r u c t i o n s can b e execu t i n g at the
same r i m e . T h i s tec h n i q u e pro v i d e s a h i gh e r
throughput when the p i pe l i ne is fu l ly l oaded ,
but tlw re i s a cost in v o l ved . I f the p i pe l i n e is
broke n , extra process i ng is req u i red to refi l l it.
Moreover, if any active i n structions h ave par
tial l y execu ted . i n fo rmation about t h e i r stares
may have to be saved to co n t i n u e process i n g
after a n abrupt i n te rru ption .
T h e de gree of p i p e l i n i n g v a r i e s from o n e
mach i ne to another depen d i ng upon the design
c hoices and trade-offs made by the system a rchi
t e c ts . A metaphor o ft e n used to i n d i c a t e the
degree of pipe l i n i ng is the length of the pipe l i n e
I
.&
(
7
Digital Technical journal
• 1
I
New Products
� MI CRO-
ADDRESS
GENE RATION
EXTERNAL
ADDRESSES - LOGIC
AND CONTRO LS
I--
MIC ROA D D R ESS
LATCH
OR
REGISTER
Figure 2
r---
CONTROL
STORE
�
I N STRUCTION
BUFFER
I B DATA
OPCODE,
SPEC I F I E R ,
SPEC I FI E R
M BER
,..---L----'--N_,U
CONTROL
MICRO
SEQUENCER
DECODER
PC
INCREM ENT
TO E BOX
DECODER CONTROL
CONTROL
STORE
MIC ROWORD
f---
r--
MICRODATA
- CONTROL
INTERPRE- SIGNALS
TATION
LOGIC
-
structions. A higher degree of pipe li n i ng makes
short cyc l e t i m e s poss i b l e , t h u s lead i n g to a
h i g h e r t h rou g h p u t w h e n t h e p i p e l i n e is fu l ly
l o a d ed . But l onger p i p e l in es e n ta i l i n creased
overhead in terms of their a b i l i ty tO resu me oper
ations after a break in the pipeline caused by any
abnormal even t. Therefore , an a rc h i tect's goal is
to design the system so t hat the pipe l i ne re mai ns
loaded most of the t i me and recovery from a bro
ken pipe l i ne is not roo inefficient. The VAX 8800
CPU i s a prime example o f a processor with a
pipe l i ned microarch itecture.
System Considerations
CONTROL TO E BOX
CONTROL TO C BOX
M I CROSEOUENCER CONTROL
Figure 3
r-
Block Diagram of an A bstract Micromachine
CAC HE
BRANCH CONDITIONS,
TRAPS, INTERRUPTS
-
MICRODATA
LATCH
OR
R EGISTER
VAX 8800 I Box
stated as the n u m be r of stages, for exa m p l e , a
t h ree-stage p i pe l i ne or a fou r-stage p i p e l i n e .
The number of stages conveys the extent of t i m e
overlap for ty p i c a l opera t i o ns i n a compu t e r .
I n a machi n e w i t h a p i p e l i n ed m i croar c b i tec
tu re, these operations are executions of micro i n -
The design philosophy of the VAX 8800 proces
sor was to o p t i m i z e t h e h a rdware so t h a t i t
wou l d e x e c u t e t h e m i c rocode effi c i e n t l y . A
large control store ( 1 4 4 b i ts by 1 6,000 en tries)
holds the entire m icrocode. Using fa i rly general
i z e d d a ta p a t hs , t h e m i croco d e e x e c u t e s t he
logic of the i nstructions . However, special hard
ware is used to speed up performance i n cri t ica l
areas . The processor logic is primar i l y designed
with l atches, which are clocked with a globa l ly
d i st r i bu ted , two-phase , nonoverlapping c l ock
i ng scheme. The two clock phases are cal led the
A- clock and the B-clock. A typ i ca l exa m p l e of
logic design, based o n the above a pproach , i s
shown i n Figure 4 .
OUTPUT
CL - COMBINATORIAL LOGIC
Figure
Digital Technical journal
No. 4
Febrttai:Y 1 987
4
A Typical Section of the VA X 8800
21
The VAX 8800 Microarchitecture
It is apparent from Figu re 4 that the data flow
in such a logic system occurs through rhe per
petual data transfers between the l a tc hes con
nected to the A-clock a n d those con nected to
the B-cloc k . Each data transfer may be cons i d
ered atom ic i n t h e sense o f hardware operation .
A m icrooperation may be e nvisioned as a logical
operation that i s atomic in terms of the execu
tion of a m i c ro i nstru ction . s u c h as a register
read , a register write or an AIU fu ncti o n . H ence
a m i crooperation constitutes one or more data
transfers . and the m i croi nstru c t i o n execution
s i m p l y cons t i t u tes a time seq uence of m i cro
operations. as shown i n Figure '5.
CLOCK
A
B
I
A
I
READ REGISTERS
B
I
ALU FUNCTION
ADD
I
STORE R E S U LT
IN REG ISTER
TIME
Figure
5
Example of a klicroinstruction
In high- performance machi nes, l i ke those i n
the VAX fa m i l y , t h e r e i s u s u a l l y a m i s m a t c h
between C P U cycle t i mes a n d mem o ry - access
t i mes. For e xa m p l e , cons i d e r an ADD i nstru c
tion . I f t h e operands are i n regi sters, t h e ADD
can be done rat he r q u i c k l y . But if one of t h e
operands h a s t o b e read our of me mory, t h e ADD
c a n n o t be p e rfo r m e d u n t i l t h e d es i red <..l a ta
arrives from memory. Most VAX processors have
a fast cache m emory, tightly bound to the pro
cessor's arithmetic un its, w al leviate the mem
ory- latency problem . I n the case of a cache mi ss
on a req u i red datu m . however. the only al terna
tive for a von Neu mann processor is tO wa i t A
processor i n such a state is sa i d to be · ' stalled . "
Under such con d i t ions, the state o f the proces
sor must be " frozen" unti l the cause of the sta l l
no longer persists and the sta l l is bro ken . The
two-phase clocking scheme provides a conve
nient way to i mplement sta l ls, i n which one of
the clock p hases ( the A-clock in t he 8800) may
be blocke d . Stal l s a re contro l l ed by rhe cac h e
through a spec i a l hardware signal d i st r i b u ted
globally to block the A-cloc k . Thus, the proces
sor logic con ta i ns two flavors of A- latches :
•
22
Sta l l ed A- latches, which are affected by a staJJ
•
Unsta l l ed A-latches, which are not affected by
a stall
The m icromachine is i mplemented o n ly with
sta l l ed A- latches. Hence the effect of s ta l ls o n
the exec u t i o n of the m i c ro m a c h i n e i s l argely
transparent.
A mecha n i s m i s a l so re q u i red to d e a l w i t h
h ardware exce p t i ons w h e n t h e res u l ts o f the
e x e c u t i o n o f a m i c r o i n s t r u c t i o n h a ve to b e
u n don e . I n a p i pe l i ned m i croarch i tecture , sev
era l m i c ro i nstructions m ay h ave part i a l ly exe
cuted when a n exception condi tion i s detected .
In that case i t is necessary to undo the effects of
a l l those m i c roi nstructions. The most common
techn i q u e used to deal w i t h such si tuations is
c a l l e d a m i c rotra p . S i n c e m i c r o t r a p s re l a te
closely to the m i cro m a c h i n e exec u tion , every
p rocessor h a s i ts own s c h e m e ro i m p l e m e n t
them. I n every case . howeve r, m i crotraps m ust
p e r m i t the " ro l l b a c k " o f s o m e n u m b e r of
m i croi nstruct i o n s because the d e tect i o n of a
trap con d i t i o n usua l l y occ u rs q u i t e late w i t h
respect tO mi croi nstruction execu tion .
I n the VAX 8 8 0 0 p rocesso r , m i c rotraps a re
i m p l e m e n t e d so t h a t t h e o ffe n d i n g m i c r o
i nstruction is a l lowed to complete, but subse
q u e n t m i c ro i nstru c t i o n s i n t h e p i p e l i n e a re
blocke d . Si nce the offending m i cro i nstruction
may have ca used some undesirable resu lts, the
trap-hand ler m i crocode must fix the problem .
Depe n d i ng on t h e parti c u l a r s i tu a t i o n , e i ther
the m i croinstruction execution flow i s res u m
ed fro m t h e b l o c ked s t a t e o r a n e w f l ow i s
origi nated .
System Buses and Datapath
Figure 6 i s a bl ock d i agram of t h e VAX 8 8 0 0
CPU datapa th, show i ng a l l t h e major buses. The
h a rdware orga n i za t i o n o f the CPU provides a
two-cycle operation between the cache and the
AIU , as shown . The processor has several func
tional u n i ts in addition to the main AIU. These
add i t i on a l u n i ts pe rform h i g h -speed m u l t i ply
a n d d i v i d e , s h i ft i n g , a n d floa t i ng-po i n t arith
metic operations .
There are seve ra l poss i b i l i t i es for s e l e c t i n g
i nputs ro these fu nctional u n i ts . For operations
i nvo l v i n g two i n p u t s , both can b e presented
s i m u l ta neously onto the two l egs of the m a i n
AIU a s we l l a s most other functional u n i ts . The
resu l ts from t hese fu nctional u n i ts a re sent on
the W bus for wri t i n g to e i ther the m u l t i part
Digital Technical journal
No. 4 February I 98 7
New Products
VIRTUAL ADDRESS BUS
BYPASS BUS
BACKUP PC
�"
"�
:t lrll il
r-----..
MULTIPLIER
&
�
DIVIDER
I
Uf�
r--
I
J
SHIFTER
ALU
�
l1
B
\
B
C B US
R
:Jf
em
'--
2._
�
A-PORT
MUX
'>
I
TB
l
r I L__
:=
A
PC I N C
A
B
"-..,
I•< \
0,
�
, porCACHE
_
_
B-PORT
MUX
••
CACHE
w
IB
1\
B
In
PC
MUX
A
A
A
\
7
A
..
I
DATA
DATA
BYPASS BUS
�
f---
�
<�
,-X
.-----.---
t-r---
E
G
�
R
��
'----
V'-EXPONEN T ['<--B
��
I
f-
� fV<--
SHIFT
COUNT
BUS
L{
r
MICRODATA
9J
•
1.-
1 :0�
r;:=::
�
IF
WRITE BUS
"!"
I
•
B
DATA
�
A, B - A AND B PHASES OF TWO-PHASE CLOCK
Figure 6
Digital Technical ]om-nat
No. 4 February 1987
A
IB
B
CACHE
...--.___
SLOW
DATA
FILE
MPR
L_
VA
PC
INCREMENT
/\
_\
fJ
1\
�
�·I: ""'"I�
1
IM
A
DELAY
WRITE
BUFFER
-----
�l=B
M D BUS
v�
VA X 8800 Datapath
23
The VAX 8800 Microarchitecture
registn fi le ( MPR) or the cache . However, since
rhe write actua l ly occurs in the fol lowing cycle.
the bypass bus provides a shortcut (sa v i n g a
cycle ) i n case t he wri te d a r u m is read hy r h c
very next microi nstructi on .
The v i rt u a l a d d ress bus carries t h e vi rtua l
add ress of a n y cl a r a - s t re a m ( cl - s t rea m ) refer
ences. whereas the p rogram-counter bus has the
current program counter ( PC ) The i nstruction
bu ffe r data bus provides th<.: instru crion -strcun
(i -strca m ) data . The i nstructi ons and data fro m
the cache are returnee! on the cache data bus .
H owever, a cache data bypass bus p rovi des a
d i rect path to the fu nctional un i ts for the data
rem rncd by the cache, in case the processor i s
o r wil l b e sta l led for that data .
The top part of Fi gure 7 shows the execution of
m icroinstructi ons as a fu nction of time i n a non
pipeli necl m i croarchi tectun: ; the bottom depicts
that i n a pipc l i ncd m i croarchitectu re.
The basic data flow i n a processor occurs in
the fol lowing sequence :
Read t he register operands i n to a fu nc
tional uni t , such as the ALU .
2.
Perform some
ALU
funct ion .
------
CLOCK -
A
8
A
8
A
W r i t e t h e resu l ts i n to t h e dest i n a t i o n
regi ster.
4.
I f there is a cache , start a cache operation
at a pproxi mately the same time as a regis
ter write s i nce m e m ory refere nces a rc
bu ffe red th rou gh spec i a l - purpose m e m
ory d a t a registers ( M DRs or MDs) i n most
high-performance processors .
F i g u r e '5 s h ows t h a t t h e s e q u e n c e a b o v e
occurs i n a n a tu ra l o r d e r i n t i m e as a conse
quence of the m icroi nstruction execution. With
p i pc l i n ecl m i croarchi tccrures , a time reference
is needed to correl a te the m icrooperations per
fo r m e d by v a r i o u s m i c r o i n s t r u c t i o n s w i t h
respect to each other. The notion of canon ical
ri mes is veil' conven ient for this purpose . The
clock ti cks of t h e reference m i c ro i nstru c t i o n
may b e labeled w i t h a monotOnically i ncreasing
set o f T n u m b e rs s t a rt i n g at T0 as s h own i n
Figure H . These T n u m bers are ca l l ed the canon
ical ti mes of a particu l a r mi croinstruction . The
m i croopera tion labe l ed T0 marks the start of a
m i c r o i n s t r u c t i o n e x e c u t i o n c yc l e . F i g u re H
shows the basic microopc rations of a VA.-'{ 8800
m icroinstruction with their canon ical ti mes .
\Ve sha l l use the si mple model of a m i croma
chine in Figure 1 to describe the VAX 8800 m i cro-
Microinstruction Pipeline
1.
3.
CYCLES -------.
8
A
8
8
A
A
B
A
M I CROI NSTRUCTION 1
M IC R O I N STRUCT I O N 2
MI CR OI NSTRUCTION 3
M I CROI NSTRUCTION EXECUTION I N
A N O N P I PELI N E D M ICROMACH I N E
M I C R O I N STRUCTION
1
MI CROI N ST R U CTION 2
M ICR OI NSTRUCTION 3
M I CROI NSTRUCTION EXECUTION I N
A P I P E L I N E D M ICROMACH I N E
Figure 7
24
M I CROI NSTRUCTION 4
Microinstruction Execution
Digital Technical journal
No. 4 Februarp I 'J8 7
N e w Products
CYCLE -
To
To
CLOCK -
A
B
A
r - -- - - - -
1
I
I
I
I
I
DECODER
OPERATION
L.. - - - - - - - -
B
T,
Ts
A
B
Q o o.. 0 -- �
a: w =?
>z
o
U
a:
o
rUl
a:
>z
o
O
o.:
o
o
-'
w
a:
o
rUl
.
'-'
o
o
-'
A
-' N a_
0
::::>
a: w .
CLOCK -
A
B
A
B
CYCLE -
0
1
2
3
I
I
A
I
,-------- :
Too
A
B
A
REGISTER
WRITE
ALU
OPERATION
>- a: '-'
zoo
O r- O
O Ul -'
DECODER
L------
B
B
A
B
A
B
A
B
A
B
5
6
7
8
9
10
11
12
13
14
15
I
I
I
I
I
I I I
[��������]
I
[��������'-� I
[���
LU K
xos
RD
LUK
ALU
xo s
I
Figure 9
I
I
I
I
I
I
WR.CACH
RD
WR,CACH
ALU
RD
x os
Lu K
A Lu
w R . c Ac H
_
_
....�....
___._
___._
.......J
_
_
_
_ _
_ _
_
_
_
_
_
_
__
_ _J_
� �E
�
CONTROL STORE LOOK-UP (CONTROL STORE 0 SEGMENT)
BOARD CROSSING SEGMENT (OVERLAPS CONTROL STORE 1 LOOK-UP)
REGISTER READ (OVERLAPS CONTROL STORE 2 SEGMENT LOOK-UP)
ALU F U N CTION
REGISTER WRITE
CACHE OPERATION
Digital Technicaljournal
Febmarv 1 98 7
B
CACHE MISS
ACTION
CACHE
OPERATION
E:
DECODER - DECODER OPERATION
No. 4
A
A
D
-
B
B
c
LUK
XOS
RD
ALU
WR
CACH
Tn
The next stage i n t he m icroinstruction execu
tion sequence is the fetch of the m i croinstruc
t i o n , p e rfo r m e d by a l o o k - u p i n the c o n t r o l
srore . I n t h e VAX 8800 system , the m icroadd ress
is pipelined, not the m i crodata . Consequent ly,
t he m i crodata from a segmented control store
a ppears ar the appropriate t i m e for t h e t h ree
basic operat ions ro occur in the i nd icated order.
The m i crodata l ooked up ca uses a sequence
i n w h i c h the register read occurs between the
ti mes T5 a n d T6 , the ALU function betwee n T6
and T1 b and the register wri te between T8 and
T 1 0 . The cache operations a lso occur between
the t i mes TH and T 1 0 . The secti on beyond T 1 0
denotes cache activity with respect to the mem
ory i f t here i s a cache miss. (The cachejmemory
i n terface is controlled by a n i ndependent m icro
mach ine . ) During every cyc l e , a m i croi nstruc
t i on produces the address of the next m icro i n
s t ru c t i o n , w h i c h i s t h e n execu t e d . F i g u re 9
depicts the generic m icroi nstruction p i pe l i ne of
the VAX 8800 processor.
i nstruct ion format as a sequence of basic m icro
operations I i ke t hose in Figure 8. The first stage
in the m i c ro i nstruction execu tion cycl e is t h e
m icroadd ress fetch . The m i croinstruction execu
tion cycle begi ns with a decoder operation . The
decoder prod uces the starting microaddress for
every new m icro i nstruction seq uence and pre
sents it to t h e m i c rose q u e n c e r . The d e c o d e r
determi nes that address on t h e basis o f t h e con
tents and curren t state of t he i nstruction buffer
( l B ) . E a c h m i c ro i ns t r u c t i o n s p e c i fi e s to t h e
m i croseque ncer w h e t h e r or n o t t o accept t he
decoder's m icroaddress. I f not, the m icro i nstruc
t ion must ei ther speci fy the add ress of the next
m i c r o i n s t r u c t i o n d i r e c t l y , as a p a r t of t h e
m i croword , or i nd i cate a n a l ternate sou rce for
the address within the microseq uencer. Since the
d e c o d e r ' s o p e ra t i o n is c o n c u r re n t w i t h t h e
m i crosequencer's, the decoder a lways has a start
i ng m i croadd ress for the m i crosequencer. It i s
conve n i e n t t o t h i n k of t h i s I B -decoder concur
rency as a " h idden decoder cycle . "
MICROINSTRUCTION A:
Tg
Can o n ical Times of a VAX 8800 Microinstruction
Figure 8
I
B
Ts
LUK
I I
xos
RD
ALU
WR,
�
�
-L
�
�
---
[�����_:�
I I
L K
RD
o
A Lu
--'
'_._x s
_._ u
_
_J_
_
_
_
_
_
_
Microinstruction Pipeline of the VAX 8800 CPU
25
The
VA X
8800 Microarchitecture
Micro bra nch Latency
cond i tions from t he earl ier execution are essen
tial to reproduce the same seq uence .
To si m p l i fy t h e h a rdware d es i g n , aU early
t r a p s a re d e l ayed to a fi x e d c a n o n i c a l t i m e
(T t 0) . Some trap cond itions, however, deve lop
l a ter than t he can o n i c a l t i m e w i t h the conse
q u e n c e t h a t t h ose traps c a n n o t be r e t u r n e d
fro m . I n such c a s e s t he m i c rocode must ro l l
back the state to the beginn i ng, which causes a
reexecution of the entire macroi nstruct ion .
F i g u r e 1 1 s h ow s a s e q u e n c e i n w h i c h a
m icroi nstruction at add ress T provokes a m icro
tra p . At t he earliest, the trap- handl i n g rou t i ne
can beg i n a t m i cr o i nstru c t i on X . M e a n w h i l e ,
m i cro i nstructions U , V , and W fol low T , q u i te
unaware of the i mpendi ng trap . I n fact , t hey are
in part ial execut i on w hen the trap condition i s
detected. These m icroinstructions are sai d t o be
i n the trap shadow, and they must be bloc ked
from writing any registers , thus making i t appear
as if t hey had never executed . When control is
returned from the trap- han d l i ng rou t i n e , these
trap shadow m i cro i nstru ctions a re reexecuted ,
con t i nu i ng the sequence that would have arisen
had t he trap not occurred.
One consequence of p i pe li n i ng is that any i nter
ve n i n g m i c r o i n s t r u c t i o n s m u s t be s p a c e d
between t he i nstruction t hat produces a branch
condition and the i nstruction that can bran ch o n
i t d u e tO l a te n cy i n t h e deve l o p m e n t o f t h e
branch con d i t i o n . Obviously, t h e execu tion o f
t h e i nterve n i ng m icro instructions m u s t b e i nde
pendent of the branc h . Usually, m i c rocoders are
able to code some usefu l operations during the
i n e v i t a b l e wa i t . O t he r w i s e , t h e i n te rve n i n g
i n s t ru c t i o n s m u s t b e N O Ps ( n o o p e ra t i o n ) .
Figure 1 0 s hows the m i crobranch l atency i n the
VAX 8800 CPU.
Microtrap Latency
A hardware exception causes a m icrotrap. How
ever, the trap cond i t i ons, l i ke the branch con d i
tions, m a y develop after some execut i on cycles
have been completed . Once again there m ust be
some i ntervening m i croi nstru ctions between the
trap-caus i ng m icroinstru ction and the trap-han
d l ing routine. Moreover, the state of the m icro
mach i ne must be saved so that the current exe
cution can be resu med i n such a way that t he
i n t e rve n i n g e x e c u t i o n o f t h e t r a p r o u t i n e
appears to be transparent. This state consists pri
mari l y of m i crobra n c h cond i t ions t h a t res u l t
from the execution o f m i croi nstruct ions i n the
p i pe l i n e s i n c e t h ose c o u l d i n fl u e n ce s u bs e
quent m i croaddresses and hence t h e execut io n
sequence . Therefore , on i n terru ption of t h e cur
rent sequence by the trap rou t i n e , the bra n c h
CLOCK -
A
CYCLE -
0
I
MICROINSTRUCTION C:
8
I
T h e I B bu ffe rs t h e prefe t c h e d VAX i - s t r e a m
delivered b y t h e cache a n d i n turn delivers the
opcode and spec i fier to the decoder. The IB a lso
delivers the i -stream data to the execution u n i t ,
the E Box . The decoder expects t o receive t he
current opcode a nd the current specifier byte .
A
8
A
8
A
8
A
8
A
8
A
8
A
8
2
3
4
5
6
7
8
9
10
11
12
13
14
15
I
I
I
I DECODER
L
I
I
I
I
D:
LUK
I
xos
I
[��������]
POTENTIAL
NOP
E:
BRANCH
MI CROI NSTRUCTION
I
I
ALU
RD
LUK
I
xos
1
WR,CACH
RD
LU K
1 -----
,
xos
[������ �]
I
RD
LU K
XOS
G:
Figure 1 0
I
I DECODER
I----
I
I
GE N E RATES
BRANCH CONDITION
ALU
I
I
I
WR,CACH
ALU
[�������] 1 I
F:
I
I
1-
POTENTIAL N O P
WR.CACH
RD
r------
TARGET O F
CONDITIONAL
M ICROBRANCH
26
Instruction Buffer and Decoder
LUK
WR.
ALU
I I
XOS
RD
ALU
Microbranch Latency
Digital Technical journal
No. 4
February 1987
New Products
CLOCK -
A
CYCLE -
0
I
B
I
A
B
A
B
A
B
A
B
A
B
A
B
A
B
2
3
4
5
6
7
8
9
10
11
12
13
14
15
I
I
I
I
I
---- -------
M ICROINSTRUCTION T :
I
I
I
I
I
I
I
I
I
I
1-
WR,CACH
ALU
RD
DECODER
LUK
xos
L----------- �--�--�--�------�------�
,- -------- -
U:
TRAP
SHADOW
I
I DECODER
L--------v:
I
I
I
I
I
CAUSES A MICROTRAP
I xos 1
D ...__A
_ _L_u___._w_R_,c_A_c_H__,
K I_x_o_s__,___R_
[�������� I.__L_u__.
LUK
RD
ALU
WR,CACH
s I_R_D__,___ ___._w_R_.
w [�����= ..__L_u_K-'--I_x_o__.
R
Figure 1 1
Microtrap Latency
Hence the l B saves the opcode for the duration
o f t h e i n s t r u c t i o n e x e c u t i o n a n d s h i fts t h e
buffered i -stream a long t o send each specifier i n
turn to the decoder. The goal of the VAX 8800
decoder is to p roduce a start i n g m i croaddress
correspond i ng to the opcode and the specifiers.
The seq uence of m i crocode execution caused
by the decoder is first to process a l l the specifi
ers , mak i ng all the operands ava i lable, and then
to e x e c u t e t h e o p e r a t i o n s p e c i fi e d by t h e
opcode. I f a n i nstruction has n o specifi ers , the
execution m icrocode is i n itiated d i rectly. I n any
case the d e c o d e r a l ways has a m i croaddress
a he a d o f t i m e fo r the m i c ros e q u e n c e r . T h i s
m icroaddress is the starting address o f e ither a
s p e c i f i e r ro u t i n e o r t h e e x e c u t i o n r o u t i n e ,
based o n the contents and the state o f the I B .
If a t a n y t i me the I B does not conta i n enough
i - s t r e a m d a t a for a s u c c e s s f u l d e c o d e , t h e
decoder w i l l prod uce a spe c i a l m i croadd ress .
The m i croinstruction at that address is s imply a
N O P that a ga i n req u ests the s e l e c t i o n of t h e
decoder's address . The m icromachi ne thus wai ts
i n a loop for sufficient i -stream data tO arrive i n
the I B so that the decoder can a ga i n d ispatch a
useful microaddress . This wai t-loop state of the
m icromachine is commonly referred to as the IB
sta l l , which i s d i fferent from the stal l described
earlier. Note that clocks tO sta lled A- latc hes are
not blocked for an IB sta l l . On the contrary, the
micromachine runs normally as does the rest of
the processor h a rdware . I B sta l l s m a y o c c u r
w h e n t h e i nstruction prefetch pi peli ne i s bro-
Digital Technical journal
Febmm:y J 987
No. 4
ALu
_
_
ken due to macroinstruction branches. This con
d i tion requ i res the cu rrent contents of the I B to
b e d i s c a r d e d a n d n e w i - s t re a m d a t a to be
prefetched i n to the l B .
The VAX 8 8 0 0 IB is a fou r-longword c i rcular
queue, which is usual ly long enough tO hold an
entire i nstru c tion . The data is consumed out of
the I B from the position pointed tO by the read
po i n te r . Howeve r , new data cou l d be written
c o n c u r r e n t l y b y the c a c h e at the p o s i t i o n
pointed to by the write poi nter. Whenever i t has
room , the IB is loaded by the cache if the cache
has no other h igher priority job to do. Occasion
a l l y , the IB beco m e s fu l l (the w r i t e p o i n te r
catches u p w i t h t h e read poin ter) , a n d then i t
does not accept the datu m from the cache . I f a
d a t u m i s n o t a c c e p t e d by t h e I B , t h e c a c h e
keeps repeating t h e transfer u n t i l t h e d a t u m i s
accepted . Occasionally, t h e I B becomes e mpty
if t he cache is busy doing other t h i ngs and the
decoder has consu med a l l the data from the IB
( the read poi nter and the wri te pointer poi n t tO
the same location) .
The I B i n the VAX 8800 fam i ly is i mp lemented
w i t h four i dentica l gate a rrays w i t h 8-bit s l i ces
desi gned to use a ra ther c lever b i t-scatteri ng/
gathering scheme. The IB a lso contains logic to
extract and format i -stream data , m a k i ng i t ava i l
a b l e to t h e E B o x . A c o m m o n s i l o h o l d s t h e
o p c o d e h i story for t h e d u r a t i o n o f a m a c r o
i nstruction 's execution, as we l l as for recov
ery from m icrotraps. The VAX 8800 decoder is
a R A M - b a s e d l o o k - u p t a b l e for g e n e ra t i n g
27
The VA X 8800 Microarchitecture
NOP --------�
THINGS THAT M A K E
S P E C I A L A D D R ESSES
�------�L-
S PEC I A L
A D D R ESS
ENCODER
-------,..j
SPECIAL
M I C.
R OA DDRESS
_
_
__
__
14
MICROADDRES S
-----,
ENABLE
OPCODE
)----,i'-----._ M I C R OADDRESS
OPCODE
A D D RESS
--------•1
10
SPECI F I E R BITS
AND STATE
SPECI F I E R
ADDRESS
DECODER
RAM
USE
OPCODE
A D D R ESS
18
STATE
CONTROL
1------
L_-----.
SPECIFIER
STATE FLAGS
Figure
12
S P E C I F I E R R E LATED
ASSISTS
18
DATA
FORMAT
CONTROL
VAX 8800 JJecoder
m i croaddresses . In the case of special ev<:nts,
however, hardware logic is provided for gener
ating spec i a l m i c roaddresses, as s hown in Fig
ure 1 2 , thus bypass i n g the RAM J ook- u p . The
decoder a lso provides cont rols for the I B state
machine as well as some other hardware assists .
Microsequencer
Th<:
st a t e - m achi n e respons ible for ge nera ring rhe
ncxr m i c ro a d d ress for a m i c ro i n st r u c t i o n se
qu t:ncc is commonly caUed the m i croscquencer.
As s h o w n in F i g u r e 1 3 , t h i s stare - m a c h i n e is
realized collectively by rhe control store. rhc ncxr
NEXT M I C ROADDRESS G E N E R A T I O N LO G I C
r-------------------- ----------------
EXTERNAL _
CONTROLS
MI CR OT RA P
C O N D I TI ONS
EXTERNAL
A D D R ESSES
1
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
�
____I___)
I
MI CROBRAN CH
I
I
CONDITIONS
I
TRAP
MICROTRAP
LOGIC
TRAP
AD D R ESS
��
v/
,.--..
M I C ROBRANCHING
AND
A D D R ESS
SELECTION
MICROADDRESS
LATCH
OR
R EG I S T ER
f--
CONTROL
STORE
f---.
M I C RODATA
LATCH
OR
REGISTER
f..--
I--
I
I
I
I
LOGIC
I
I
I
I
I
I
I
I
I
L---------------- ----------------- --1
NEXT A D D R ESS. A D D R ESS S EL EC T I O N CONTROLS
Figure 1 3
28
A n A fJstroct Microsequencer
Digital Technical journal
No. 4 FeiJruar)' I 987
New Products
m i croadd ress generation logic, and the m icroad
drcss and microdata latches (or registers) _
The goal of the VAX 8800 m i crosequencer is
to produce the address of the next m icroinstruc
tion dur i ng every cycle. Fi gure 1 4 depicts how
the mi crosequencer achieves this goa l .
Each m i croinstru ction may mod i fy i ts next
microaddress field through a m i crobranch com
m a n d to p r o d u c e t h e ad d ress of t h e t a rge t
m i croi nstruct i o n . M i c robra n c h con d i t i ons are
del ivered by other sections of the m ac h i ne , such
as t h e A L U . T h es e c o n d i t i o n s a re g r o u p e d
tOgether i n ways conven ient for m icroprogram
ming so that m u l t iway branches can be take n .
M i crosu bro u t i nes can b e ca l l e d a n d returned
from by m eans of a hardware mi croPC stack.
Sta l ls cause the m i c rosequencer state to be
frozen on a cycle bou ndary (i . e . , the clocks on
m i c road dress and m icrod ata latches are effec
t ive ly blocked) . M icrotraps a l low the m icrocode
to deal with u n usual even ts that wou ld he too
slow or in conve n i e n t to check norm a l ly wi t h
microbranc hes , s u c h a s T l3 m isses a n d address
mi sal ignments. The VAX 8800 processor does
not pe rmit traps to be nested . Instead , traps are
"chained , " mean ing that trap rou t i nes and hard
ware t ra p prio r i t i es are ca refu l l y a rranged so
that a second trap is taken only when the first
trap routine fi nishes . ( Mac hine check traps can
not be control led i n th is way . )
So urces of Microaddresses
There are five sources for mi croaddresses :
•
The decoder
•
The next-add ress field i n the mi croword
•
•
•
The m i crosta ck upon retu rn i n g from a sub
routine
The m icroPC silo for a saved m i crotrap
The m icromatch register for an address from
the conso l e
A n a d d ress from the conso l e i s sel ected i n
response t o a n ex p l i c i t conso l e r e q u e s t a n d
t a k e s p r e c e d e n c e o v e r e v e ry t h i n g e l s e .
A d d r e s s e s fro m t h e s i l o a r e r e q u e u e d i n
response to a trap-return com mand . Addresses
from the m i crostack are se lected in response to
a subroutine-return command . A decoder-gener
ated add ress is sel ected whenever the curren t
sequence ends and a new specifier or execution
DiRilal Technica/Journal
No. 4
Febnwr)' I 'J87
ro u t i ne shoul d begi n . Normal ly, this sel ection is
ca used by t he asserti o n of a m i c roword b i t i n
t h e very last m icroi n s t r u c t i o n of t h e c u rre n t
seq ue nce . The next-add ress field is sel ected as
the defa u l t for normal sequenci ng. This field is
also used to provide a n offset in case of su brou
t i ne retu rns.
Micro bra n ching
In normal cases, part of the se lected m i croa d
dress can be modified accord i ng to the branch
condi tions, t hat is, whenever t he next-address
f i e l d i s selected . A c o m b i n at i on of two
m icroword fields. branch type and branch mask,
se l ects the bra n c h condi tions, w h i c h are then
O Red i nto part of t he target m i croadd ress . In
the VAX 8800 system, the m i crobranch logic is
i m p l e m e n t ed with fi ve i d e n t i c a l gate arrays ,
e a c h of w h i c h gen erates a 3 - b i t s l i c e of t h e
m icroa ddress . O n e m icroadd ress b i t is branch
sensitive i n each s l i c e . This orga ni zation permi ts
up to 3 2 -way branc h i ng. Branchi ngs of 2 , 4 , 8 ,
a n d 1 6 ways a r e a l s o m a d e possible b y a sepa
rate mask b i t , cal led the branch mask, to eve ry
s l i ce. T h is bit i s used to turn off the sensi tivity
to branch cond i t i ons in a particu lar sl ice.
There are 1 6 bas i c reci pes for cond i ti o n a l
bra n c h i n g i n e a c h s l i c e . T h is arrange m e n t o f
s l i c i ng, masking, a n d branch-condition selection
in every s l ice requi res that a l l the m icrobranch
c o n d i t i o n s b e o r ga n i z e d i n t o 5 g r o u p s of
1 6 co ndi tions each . The bra n c h cond i t ions are
classi fied as e i t her static or dynam i c . Stati c con
d itions, once captured , are avai lable for branch
i ng in any later cycl e as long as those cond i t ions
re m a i n u n c h a n ged . Dyn a m i c cond i t i o n s are
asserted for just one cyc le and must be branched
o n i n that cyc l e .
Some speci a l t rap-rel ated branch cond i ti o ns
are saved at the time of the trap so that the trap
routi n e may use t h e m . For speed reaso n s , the
basic hardware m echanism for m u l tiway branch
i ng is that the sel ected condi tion is ORed rather
than added to the branch-sensitive m icroaddress
b i t . The OR i m p l i es that the branch-sens i t ive
bits of a microadd ress must be "zeros" by con
ve n t i o n . I f bra n c h i n g i s masked i n any s l i c e ,
however , o n l y u n masked bra n c h-sensitive bits
n ee d to be z e ro s . Th u s t h e bra n c h - m a s k i ng
sc h e m e l ea d s to a substa nt i a l i ncrease in the
nu mber of conditional branch-target addresses ,
c o n s t r a i n e d by t h e r e q u i r e m e n t fo r z e ros .
29
The
VAX 8800
Microarchitecture
MICROWORD N E X T A D DRESS
TOP-OF-MICROSTACK
M I CROBRANCH
SILO A D D R E S S E S
CONDITIONS
.!u
...
I
I
CONSOLE A D D R ESS
I
A
MICRO
MATCH
PUS H
R E G I STER
I
BRANCH
.....,..
�
.....
_
_
_
I
CONDITION
LOGIC
-;
A
15
A
M
M I CRO·
13
-A
I
c
ADDRESS
SOURCE
R
p El
0 c r-:-
SELECTION
LOGIC
A
5 r-;
8
I
t t t
L
0
A
'----8
I
I
T R A P VECTOR
\/
l
f
�--'-?
M I C ROSTACK
MICROSTACK POI NTER
POINTER
TRAe
AND
M I C ROTRAP
LOGIC
�------� M I C R O STACK
..
tt. t
1-) 5
/
M I CROTRAP
CONDITION
r-r--T"'""
DECODER'S
M I CROA D D R E S S
B
,..-- 1 4
�
/
\
DECODER -
I
I
v-14
/
SELECT
115
A B
B
j
I
r-
)4
I
CONTROL
STORE
0
A
f.--
CONTROL STORE 0
M I CRODATA
'--
A
r-
/4
I
CONTROL
STORE 1
B
f---
CONTROL STORE
M I C RODATA
1
......
B
r-
114
I
CONTROL
STORE
2
A
f-----
CONTROL STORE
2
M I CRODATA
'--
Figure 1 4
30
VA X 8800 1Hicrosequencer
Digital Technical journal
No. /f
Februarv I 987
New Products
Table 1
Slice
N u mber
1
2
3
4
5
Microbranch Conditions
Microbranch Conditions
State flags
W B U S low-order bits
W B U S hig h-order bits
S A L U condition codes
PSL condition codes
6
XALU condition codes
8
A L U condition codes
7
9
10
11
12
13
14
15
16
17
18
19
20
Priority encoder condition codes
TB-status
Cache command
M D n u m ber
AC low
Digit valid
NMI ID
I nterrupt pending
I nterval timer carry
Halt pending
Console mode
I nterrupt I D
Non_R etry flag
Ta b l e 1 s hows an e xa m p l e of severa l m i c ro
branch conditions.
Microsubrou tine Call and Return
As
in the normal case just discussed, the defau l t
mi croaddress, the next-address fie l d , i s selected
as t h e start i n g add ress of a m i crosu b ro u t i n e .
However, a subrouti ne-ca l l ing m icroi nstruction
pushes i ts own add ress onto the m i crostac k .
During the subroutine return, the m i crostack i s
se lected a s the sou rce and then popped . Thus
the address of the cal l i ng i nstruction i s used as a
base for the retu r n . T he ret u r n i n g i nstruc t i o n
may OR an offset from t h e next-address field to
t h a t bas e , t h u s y i e l d i n g t h e target return
address . The fact that bits are ORed rat her than
added constra i ns the ca l l i n g addresses to have
zeros in the l ow-order bit positions.
The write path ro the m ic rostack (PUSH) is
pi pel i ned by a cycl e for t i m i ng reasons. How
ever, a bypass path saves what wou l d be the top
entry of the mi crostack in the read latch ( POP)
so that PUSHs and POPs occu r in a fai rly u n re
s t r i c t e d m a n n e r . T h e re are , h ow e ve r , s o m e
minor cod ing restrictions w i t h respect t o traps
and decoder-made addresses.
Digital Technical ]om-nat
No. 4
February 1 98 7
Subrou tine calls and returns are u naffected by
sta l l s . I n the VAX 8800 CPU, t he m ic rostack is
1 6 entries deep and i s used exclusively for sub
routine cal ls and returns (i .e . , m icrotraps do not
use the stack) . Subroutine calls may be nested up
to 1 5 entries deep, beyond which the m icrostack
w r a p s a r o u n d a n d o v e r w r i t e s p r e v i ou s c a l l
addresses . S ince the next-address fie ld is condi
tiona l ly O Red i n to the ca l l i ng address to make
the return address, a cond itional m u l tiway return
becomes feasible.
Microtrap and Return
A m icrotrap i s caused w h e n t h e hardware
detects a con d i t i o n t h a t wou l d n o t a l low t h e
current microinstruction t o complete i ts execu
tion successfu l ly. The hardware forces t he next
m icroadd ress to a fixed location that depends
on the particular condition, thus overrid i ng the
address that wou l d otherwise be selected . This
spec i a l l ocation i s the starting address o f the
trap-hand l i ng m icrocode routine specific to that
trap condition. M icrotraps are used extensively
by the memory management syste m tO i m p le
m e n t t h e v i rtua l memory arch itect u re . M i c ro
traps a re a l so caused b y s e r i o u s syst e m fau l ts
( i . e . , machine checks) , such as control -store or
b u s parity e rrors. Tab l e 2 l i sts t h e m i c rotra p
cond itions and their priorities . The priorities are
arra n ged so t h a t i f m o re t h a n o n e m i crotrap
occurs during a cyc l e , the one with the h ighest
priority w i l l be serviced and the others ignored .
Table
2
Microtrap Conditions and Priorities
Microtrap Condition
Priority
M i crobreak
H i g hest
M achine check
VA parity error
TB tag parity error
Reserved for ECO
Reserved float operand
Add rounding
M ultiply rounding
Integer overflow
T B miss
Access violation
Modify bit
Page cross
U na l i gned page cross
U n a l i gned trap
Conditional VAX branch
Lowest
31
The VAX 8800 Microarchitecture
Figu re 1 1 shows the m i crotrap latency and i ts
consequences o n p i p e l i n i ng . As described ear
lier, a trap-causi n g m i croinstruction, even i f it
wri tes the wrong resu l ts , is a l l owed to complete
because i t is too l a te to block i t a nyway. (The
ca nonical t i m e of register wri te is T 9 , whereas
the m i crotrap signal occ u rs at canonical t i m e
T , o ) - The on ly recou rse i s t o let t h e trap-han
d l i n g m i crocode correct any probl ems caused
by the trapping m icroinstruction . The mi crotrap
s ignal occurs in time to block a l l three m i croi n
stru ctions i n the trap shadow. Therefore , t h e
m i crotrap logi c generates two global signals, the
gl obal mi crotrap (one-cyc l e l ong) and the block
writes (three-cycles l ong) , at time T, 0 . The pur
pose of the global-m icrotrap sign a l is to trigger
any necessary trap-contingent actions in va rious
p a rts o f t h e p ro c e s s o r . T h e p u rpose o f r h e
block-wri tes signal is ro block register writes a t
canonica l t imes T 1 1 , T 1 3 , and T 1 s , thus renderin g
i neffectua l microi nstructions U , V, and W i n Fig
ure 1 1 . I n other words the blocki ng of wri tes by
ha rdware i s i n effe c t u n t i l t h e t ra p - h a n d l i n g
m icrocode ta kes control of the micromac hine.
A silo is genera l ly used to save the stare of the
mach i ne across a m icrotra p . I n most cases the
l e n g t h o f t h e s i l o is e q u a l t o t h e d e p t h of
pipe l i n i ng . Si nce there a re m a ny more branch
condi t ion b i ts than m icroaddress bits, i t is more
econ omica l to save m icroa d d resses in the trap
s i l o than to save the conditions causing those
addresses. M icroadd resses U, V , and W must be
saved i n t h e s i l o s i n c e t h e y m a y be b ra n c h
targets o f some previous m icroinstru ctions . For
the same reason , however, the address X (over
ridden by X', the start i ng add ress of the trap rou
tine) must be saved as wel l . During the execu
t i o n of t h e t r a p r o u t i n e , t h e t r a p s i l o s a r c
" frozen " (bl ocked from loading) , thus saving
t he state o f t h e micromac h i ne a t the t i m e of
trap .
After the trap routine has completed , two con
d i tions are possible:
1.
32
The recovery from the trap is i m possible,
and hence the m icroinstruction sequence
c a n n o t be c o n t i n u e d . T h e n t h e o n l y
recourse i s to roll back and reexecute the
macroi nstruction . That is, the macroPC is
backed up from its silo, the IB is fl ushed ,
and if necessary, any register changes are
u n d o n e . I n t h i s c a s e t h e l a s t m i c ro -
i nstruction o f the tra p rou tine performs a
trap release , wh ich u nblocks the silos so
they can resu me load ing the new states .
2.
M i crocode can remedy rhe cause o f the
t r a p s o t h a t t h e m i c ro i n s t r u c t i o n
seq uence can be con ti n ued. I n this case
t he l ast microinstruction of the trap rou
t i n e perfo rms a trap retu r n , caus i n g the
hardware to recycle m i c roadd resses U , V ,
W , ancl X t hrough t h e m icroaddress p i p e .
T h i s action results i n the reexecution of
aborted m ic ro i nstructions from the trap
shadow.
I n t h e case o f a tra p r e t u r n , t h e hardware
sel ects the m i c roPC silo as the microadclress for
the n e x t fo u r cyc l e s . As s h own i n F i g u re 1 4 ,
however, the mi croPC silo does not conta i n the
microatldrcsses m ade by the decoder. Therefore,
it is necessary tO resy n c h ron i z e t h e m ic ro i n
struction execution sequence with the decoder,
wh i l e req ucu i n g t h e t rapped m i c roadd resses
from the silo. This is made possible by keeping
a tag bit i n the s i l o to identify the posi tions of
the m icroadd resses made by the decoder i n the
seq u e n c e . If a m i c ro a d d ress from t h e s i l o i s
foun d to be tagged, t he requeu i ng is termi nated
i m mediately and the m i croaddress generated by
the decoder is selec ted . A comp lete recovery
t hus occurs since the state of the IB has by this
t i m e b e e n b a c k e d u p , a n d t h e r e fo r e t h e
decoder-generated m i c roadd ress can be used for
the con t i nuati o n .
Chain ing of Microtraps
By convent i o n , m i crotraps a re n o t a l lowed to
nest ; instead , they a re chained . I n other words
the trap-handl i n g m i crocode m ust ensure that it
w i I I not cause any m i crotraps i tself. The sole
exception i s i ts last m i c ro i nstruc t i o n , w hi c h
may cause a secon d m i crotrap t o fol low i mme
d i ately, even as the saved m i c roaddresses from
the silo are be i ng requeued to resum e the origi
nal flow . Note that this second m i c rotrap does
not take effect u n t i l four cycles later, whereas
i nterve n i n g m i c ro i nstru c t i o n s a re bl ocked by
the ha rdware as a resu l t of t h i s secon d m icro·
tra p . Conseq u e n t l y , the sa m e m i c roadd resses
end up i n the m i croPC si l o once a ga i n during
the execu tion of the second trap rou t i n e . The
original sequence may fin a l l y resume a fter the
l ast of such chained traps has been serviced .
Digital Technical journal
No. 4 Februar)' J 9 8 7
New Products
Acknowledgments
The specification and design of the VAX 8800
1 Box was a team effort . Dave Laurdlo con
tributed to the lB desi gn , the i-srream data for
matter, and the i nterrupt logic. Bei Pong Wa ng
was responsible for the decoder, the PC i ncre
ment logic , and the 1 8-state manager. Jack Ward
looked after the physical constru ction of the
sequencer and the contro l store . The entire
deve lopment was carried out under the exce l
l e n t leadership of Doug Clark . Many thanks a lso
go to both Doug C lark and Bob Stewart for their
suggestions and gui dance during the cou rse of
this development.
Digital Technical journal
No. 4
Febmarv I ')8 7
33
William A. Samaras
The CPU Cl o ck System in the
VAX 8800 Family
The clock system in the VAX 8800 CPU sends timing signals to every state
device every 45 nanoseconds. The lack of accuracy of these timing signals
is called skew, which must be minimized. Two skews exist: global, between
modules; and local, within a module (the lower of the two). The design
complexity of the overall system dictated the use of an automated timing
verifier. Although advantages accrue from designing for local skew, the
verifier could not segregate between skew types. To gain the benefit of the
verifier, a unique hardware trade-off was made to minimize total skew:
local was made equal to global. The result was that 83 percent of the cycle
time is used productively.
Al l sync hronous compu ters must provide some
means of generat i n g and d i stri buting accurate
t i m ing signals. The goa l of the timing sysrem in
t h e VAX 8 8 0 0 fa m i l y is to provi d e l ow-skew
(therefore , accu rate) t i m i ng signa ls to a l. l pans
of t h e processor wi t h o u t a n y m a n u factu r i n g
a d j u s t m e n t s . F u r t h e r m o re , t h e d e s i g n t e a m
wanted to automate the verification o f the r i m
ing during the design p hase . Therefore , design
trade-offs in the clocking system were necessa ry
ro accompl ish that auromar.ion . Thi s paper d i s
cusses how the hardware designs of the clocking
system were i n fluenced to provide a good envi
ron ment for r.he au tomatic tim i ng verification .
Clocking System Requirements
The design of the clocking system requi red u s to
address many i n terrelated prob lems that had w
cu l m i nate i n a common so l u t io n . T h i s design
depended on certa i n fundamental specificat ions
that were estab l is hed for t he VAX 8800 CPU by
the system a rc h i tects . The two pri mary req u i re
ments a re descri bed be low .
Cycle Time
The cycle time of the VAX 8800 fa m ily of pro
cessors i s 4 5 nanoseconds ( ns ) , w h i c h means
t h a t a CPU c a n a c co m p l i s h some a m o u n t of
work d u ri ng that pe riod . Looking at i t. another
way, t h e se p ro c e s s o rs can d o 2 2 . 5 m i l l i o n
actions every secon d . Usua l ly, a n u m ber o f these
4 5 - ns cyc l es are req u ired by a processor to pro-
34
duce just one VAX i nstruction . The c locking sys
tem m ust keep the thousands of circu i ts in the
p rocesso r " t i c k i ng " in pe rfect step toget h e r
every 4 5 ns.
The 8800 was desi gned ro conta i n two com
p l e t e C P Us in t h e s a m e c a b i n e t . S i n c e b o t h
CPUs share a common memory, i t is beneficial
to make the m emory system and both CPUs syn
c h ronous wi t h each o c h e r . T h e c l o c k syste m
must keep a l l three items ru n n i ng together, pre
cisely locked i n t i m e .
Modules
A l l t h e c i rc u i t ry for both p rocessors a n d the
m e mory control l er is contained on 20 1 6- i n c h
b y 1 2 - i nch modules, o r printed c i rc u i t boards.
These mod u l es occupy slots i n a 2 1 - i nc h-wi de
backplane . Each m od u l e conta i ns u p w 2 0 ECL
gate arrays and m isce l la neous ECL l og i c . The
state devices , c a l led l atches, reside both i n the
gate a rrays and the m iscel laneous l ogic of each
modu l e .
The Clocking Problem
The basic d i ffi cu l ty for t h is (and a ny) clocking
system is to get the t i m i n g signals ro every scare
device i n t h e mac h i n e at p re c i s e l y t h e s a m e
t i m e . Every s y n c h r o n o u s m a c h i n e fa ces t h i s
probl e m . However, i n faster comput ers, l i ke the
VA.-'{ 8800 system , the to lerances placed on the
t i m i n g s i g n a l s are m o re seve r e . In a physical
sense , i r is s i mp ly not possible to send a I I the
Digital Technical journal
No. 4 Februar)• I ')87
I
New Products
timing signa ls to every part of each module at the
same i nstant . There is some precision, however,
that shou ld and can be achieved . We now discuss
how important this tolerance is tO the VAX 8800
systems, and what we did to mini mize it.
T h e t o l e ra n c e , o r t i m e d i ffe re n c e , t h a t we
encounter i n attempting to provide t i m in g signals
to every state device at the same time is cal led the
clock skew. Clock skew is the u ncertai n ty i n the
t i me of a particu lar event. As an analogy, consider
an airl i ne fl ight that is schedu led to arrive at a n
airport a t precisely 5 : 0 2 P . M . Now, w e know this
fl ight wi l l not arrive at 5 : 0 2 P.M. o n the dot; it
w i l l probably arrive w i t h i n a m i nute or two of
that pub lished arrival t i m e . This uncertainty i n
the time o f arrival i s the skew o f that time. I f the
u ncertai n ty of a rr i va l is 30 seconds, t h i s s kew
wou ld probably be a very acceptable value and
we wou l d say the f l i g h t i s r i g h t o n t i m e : i t
arrived with low skew.
On the other hand , if the u ncerta inty of arrival
is large, say 3 0 m inutes, we wou l d probably try
another airline. Why? Not simply because we are
i m pa t i e n t but for a more fu n d a m e n t a l reaso n .
When the uncerta i nty is large, we have less time
to do other things that are valuable to us. Usually,
we are comm itted to the entire t i me of the u ncer
tainty. Put another way, this u ncertai nty, or skew,
is wasted t i m e . Enough of t h i s a n a l ogy - h ow
does t h i s s kew a ffec t the opera t i o n of a d ig i t a l
computer?
As mentioned earlier, si nce the cycle t i me of
each CPU is 4 5 ns, all state devices are "sched
uled " to c lock at the start of that period . Any
u n c e r ta i n t y i n t h i s t i m e fro m o n e l a t c h t o
another i s cal led clock s kew. As i n o u r a i rl i ne
example, c lock skew is wasted time. There are
many factors that i ncrease the clock skew; let us
consider one of the most i mportant ones.
Since the backplane width is 2 I i nches, aJI the
CPU hardware modules are separated by no more
than that distance . Since a l l the wiri ng in the sys
tem is composed of controlled-i mpedance trans
mission l ines, the logic signals can travel at c lose
to the speed of light. At that speed a logic signal
cou ld circle the earth about 4 . 5 t imes in 1 sec
ond , or i t takes about 4 nanoseconds tO travel the
2 I i nches across the processor backplane. Now
we can begi n tO understand the skew proble m .
The m i n imum uncertainty of a ny signa l travel ing
through the entire processor woul d be at .least
4 ns, which is a l most 1 0 percent of the 4 5-ns
cycle. And that is only one source of skew.
Digital Technical journal
No. 4 February 1 98 7
Since skew c a n b e wasted t i m e , o u r goal was tO
make it as small as possible. In the 8800 system ,
there are three major contributors t o c lock skew:
var i a t i o n s i n t h e sem i co n d u ctor components,
variations in the wiring lengths (descri bed above) ,
a n d d i fferent m a n u factu ri n g tol erances of t h e
modules. O n e common way t o remove skew from
a system is to make some type of adjustment dur
i n g the assembly of the hardware. Theoretically,
at least, all the skew could be removed through
this method of adjustmen t . To keep the cost of
manufacturi ng low, however, another of our goals
was to requi re no adjustments of any k i nd . That
goal p laced an extra burden on the clock system
to d e l i ver accura t e s i g n a l s wi t ho u t e xcess i ve
skew. By carefu l ly design i n g the c i rcu i ts of the
c locking system and controll ing the skew sources
mentioned above, we held the overal l c lock skew
in the VAX 8800 fam i l y to 7 . 5 ns . Thus, on aver
age , 83 percent of our 4 5-ns cycle is uti l i zed. The
remainder of the paper explai ns some of the trade
offs we made to achieve this figure .
Clock Hardware Overview
Figure 1 depicts the hardware i n the clock sys
tem of the VAX 8800 fam i ly.
The osc i l l ator section is the t i me base of the
whole machine. The implementation is a custOm
phase-locked- l oop design that a l lows the clock
period to be varied for test purposes during the
m a n u fa c t u r i n g p rocess . U s i n g a p hase - l ocked
loop makes it possible tO have a very accurate
ti m i ng source at many specific clock periods .
The output of the oscil l ator secti o n connects
to a p hase generator t h a t prov i des two c l oc k
p h ases w i t h t h e p r o p e r t i m i n g re l a t i o n s h i p
between them. The outputs (cal l ed the A-Clock
and the B-Clock) of the phase generator are the
a c t u a l c l o c k s i g n a l s d i s t r i b u t e d to a l l s t a te
devices i n the machi n e . The phase generator is
implemented digitally by high-speed , 1 OOK ECL
shift registers. This technology creates very accu
rate t i mi ng without requiring any manufacturing
adjustments.
Since there is only one p hase generator and
thousands of state devices req u i ri n g the clocks,
or timing signals, a method is needed to get the
o u t p ut o f t he p hase generator tO every state
device wi thout add i ng very much skew. That is
the pu rpose of the d istribution stage of the clock
system . The actual circu i try used for the distribu
tion consists of I O O K ECL d i fferential devices
and 1 O K H ECL devi ces . The d i stri b u t i o n was
35
The CPU Clock System in the VAX 8800 Fam ily
CLOCK MODULE
CPU
II
PROGRAMMABLE
CLOCK
OSCI LLATOR
T
1 33.5 M H z
CONTROL
LOGIC
DIGI TAL
CLOCK
PHASE
GEN ERATOR
A
A
PHASE
ML
22.25
NOMINAL
B
B
'
A
B
/
---,
20 A,B CLOCK
PAIRS, ONE TO
EACH CPU
MODULE, ONE
TO THE MEMORY
CONTROLLER,
AND ONE
TO EACH
1/0 CONTROLLER
A
I
r-t;r
B
A
A
f--
A
A
A
A
---,
'-----,---
B
"-----
B
B
A
B
B
PHASE
B
'-----
B
,A
,B
(8 MODULES)
l
TYPICAL MODULE
CPU
BACKPLANE
INTERCONNECT
A CLOCK
DISTR I BUTION
1
A
r
A
B
l
CLOCK
DISTRIBUTION
+
I
B
CPU 2 (8 MODULES)
A
B
A
--,
I]
GATE
ARRAYS
B
--,
TYPICAL MODULE
I
t
I
I
Jl
GATE
ARRAYS
A
B
C L OCK
DISTR I B UTION
B CLOCK
DISTRIBUTION
A
B
�
l
MEMORY
MEMORY CONTROLLER
MODULE
�T
�
l
T
l
GATE
AR RAYS
A
B
CLOCK
DIST R I BUTION
A
8
+
I
1
I
1/0 CONTROLLER (UP TO 2)
Il
It
GATE
AR RAYS
A
l
I
8
CLOCK
D I ST R I BUTION
A
8
Figure 1
36
I
Clock System in VAX 8800 Fam ily
Digital Technical journal
No. 4 February I 987
New Products
heavily influenced by our desire to use an auto
mati c t i m ing verifier. The fol lowing d iscussion
of the t i m i ng veri fication environment g i ves a
clearer view of the reasoning be hind the c lock
d istri bution scheme .
Clock System and the Timing
Verification Environment
Trad i t i ona l ly , t i m i n g veri fica t i o n was accom
pl i shed by hand calcu lations usi ng com ponen t
specifi cations. A designer wou ld si mply add a l l
t h e component propagation delays i n a particu
lar path and determ i ne if all t i m ing criteria were
met. In the past, this method worked fairly wel l
for several reasons. F i rst , the desi gner us u a l ly
knew which paths in a circ u i t were cri tica l and
cou l d g i ve spec i a l atten t i on to t h e m . Seco n d ,
components genera lly behaved better than their
worst-case vendor specificati ons .
Marginal t i m ing problems, or ones that were
simply overlooked , wo uld often be less serious
t h a n t h e d i ffe rence between t h e wors t - c ase
specifi cat ions and how the components actually
worked . Finally, t i m i ng errors were expected to
ap pear d ur i n g the hardware debug phase of a
project . Therefore , t i m ing errors that were bla
tantly m i ssed d u ring the design could be cor
rected (w i t h a l o t of hard work) d u r i n g t h a t
ph ase . That was possi b l e because t h e overa l l
c o m p l e x i ty of t h e d e s i g n c o u l d be c o m p re
hended by the desi gners .
From the beginning of the VAX 8800 desi gn
effort , we knew t h a t t h e t i m i ng of the des i gn
wou ld be d i ffi c u l t to ana lyze m a n u a l l y . F irst,
t h e sheer complexi ty of t h e m a c h i n e created
over fou r m i lli on diffe rent t i m i ng paths. It was
impossible to analyze every path manually or to
discover every "crit i ca l " one w i t h e i ther man
ual or i ntuitive analysis methods.
Se cond , hardware c i rc u i t loops a re w i d e l y
used i n t h e design ; these are circu i ts that feed
s i g n a l s b a c k to t h e m s e l v e s d u r i n g a l a t e r
machine cyc l e . These circ u i ts are very d i ffi c u l t
to analyze, espec ially when loops cross physica l
boundaries or are nested with i n other loops . just
t h i n k i n g a b o u t t h e t i m i n g ra m i fi ca t i o n s o f
nested loops taxes the m i n d . Man u a l ly analyz i ng
thousands of these cases would be impossible.
Final ly, the hardware design made heavy use
of gate arrays, which conta in most of the logi c .
O u r ambi tious deve lopment schedu le a n d t h e
l a rge nu mber o f gate array designs simply could
Digital Technicaljom-rtal
No. 4
Febn.tctrJ'
1 987
not tolerate unantic ipated t i m ing errors. A t i m
i ng error in a gate array m e a n t that a n e w gate
array must be prod uced to fix the problem. The
fabrication overhead for another se m i conductor
devi ce, usua l ly taking months, was not consis
tent with our deve lopment schedule. Moreover,
while that new gate array was b e i ng fabricated,
the debugg i ng of t h e e n t i re system c o u l d be
je opard i z ed s i nce i t was just n o t poss i b l e tO
"fix" an LSI chip.
Therefore , the hardware design group wanted
to design the processor with the a i d of an auto
matic CAD too l for t i m i ng verification . Such an
automatic method for verifying the t i m i n g was
essential to the su ccess of the proj ect. Si nce the
entire des i gn was to be "soft" (the schematics
were c o n t a i n e d i n co m p u t e r d a t a b a s e s ) , i t
seemed logical that some type o f software tool
fo r a u t o m a t i c t i m i n g v e r i fi c a t i o n c o u l d be
applied .
We decided that the most appropriate t i m i ng
ve rifier for this project was prod uced by Val i d
Log ic, I n c . Although t h is automatic too l solved
the problems caused by manual t i m i n g verifica
t i o n , it a l so c reated s o m e v e ry spec i a l n e w
restrictions.
I t was a p pare n t fro m the be g i n n i n g of t h e
design effort t h a t some restr i c t ions had r o be
pl aced on the design styles of i ndiv i d u a l engi
neers to reduce the t i m i ng-ana lysis problem to a
manageable leve l . CPU ha rdware designers , l ike
any other creative persons, often assume l a rge
degrees of freedom i n t h e i r work . Usu a l ly, no
two designers will arrive at the same sol ution tO
a pro b l e m , a l t ho u g h a l l s o l u t i o n s m a y be
acceptable. W h e n t e n or more designers work
i ndependently , as happened on t h is project, it is
l i kely that ten u nique design styles wil l emerge .
Therefore, we pl aced restrictions on the t i m
i ng envi ronment for t h e following two reasons:
•
•
Some standard ization of t i m i ng had to take
place for e l e ctrical s i gna ls to com m u n i ca te
properly between desi gns generated by d i f
feren t people.
S i nce the automatic t i m i ng verification soft
ware was n e w , seve ra l i m portant fe a t u res
were lacking.
The usefu l ness of an au tomatic t i mi ng verifier
depends largely on how wel l t i m i ng-ru le v iola
tions are reported . Knowing that a design con
tain s t i m i ng errors i s usefu l only i f it i s easy to
37
The CPU Clock System in the VA X 8800 Fam ily
fi nd th em. One way to a i d the reporti ng of ti min g
errors is to create an environ ment that clocks a l l
state devices i n the p rocessor the same way . This
means that a l l logic des igns in the processor must
follow consistent and strict rules for the clocking
of state devices . That was the method we decided
to pursue in this design project .
The Timing Environment
The cl ock system needed strict constra i n ts on i ts
ci rcu i t design and physical layo ut ro guarantee
accu racy. Therefore , the generat ion and use of
c lock ing signals were tightly control led to m i n i ·
m i ze the d i ffe rent ways i n w h i c h t h e c i rc u i ts
cou ld com mun icate . The tim ing control of state
device s had to be c o n s i s t e n t t h ro u g h o u t t h e
design . Moreover, a ny arb itrary t i m i ng con tro l
of the state devices wou l d have been an i mpossi
ble task for the tim ing verification softwa re .
The t i m i ng signals i n the VAX 8800 processor
were carefu l ly di stri buted to every state devi ce.
This d istribution was accompl ished by carefu l ly
LATCHES
CLOCK
SOURCE
LATCHES
J
I
I
I
I
I
I
I
I
•____ j
I
LEVEL 1
___
I
I
I
I
I
I
I
I
I
I
_.I
LEVEL 2
Figure 2
38
L - - - ....J
LEVEL 3
LEVEL
4
L - - --J
LEVEL 5
Clock Expansion Groups
Digital Technical ]om...,•al
No. 4
Februar)' 1 98 7
New Products
CLOCK MODULE
- - - - - - - - - - - - - - - - --
BACKPLANE
1
r - - - -- -
I
1
TYPICAL CPU MODULE
r - -- - - - - - - - - - - - - - - - - - - - - - - - - - -
GATE
R RAY
�- - - -A- - -,
I
1
I
I
I
I
I
I
I
1
I
LJ------1--I-t_J-I
I
I
A
A
A
B
I
B
I
I
I
I
I
I
L _ _ _ _ _ _ _ _ _l
B
I
I
I
I
I
I
I
I
I
l _ _ _________ _ _ _____l
'-.r------)
FANOUT
LEVEL 1
I
L_ _ _ _ _ _
Digital Technical journal
February 1987
I
I
I
I
I
I
I
I
I
I
I
L - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - �
FANOUT
LEVEL 4
FANOUT
LEVEL 5
Minimized Global Skew Distribution
expand i ng the c lock signals at strategi c p hysical
pos i t ions in the processor. A simple example of
this expansion , or fan-out , is s hown in Figure 2 .
Each time the clock signals are expanded ,
more t i m i n g u ncertai n ty is i n troduced into t he
resu l t i ng signals. The 8800 design requi red up
to five levels of expansion to produ ce enough
clock signals for every state device. A<> shown i n
Figure 2 , some signals are i n common d istribu
tion grou ps. Signals existing in the same group
will have l ow tim i ng u ncerta i nty between t hem ,
a characteristic called skew correlatio n . The
t i m ing uncerta i n ty between signals in d i fferen t
d istribution groups h a s no correlation; there
fore , these signals have the h ighest skew. Signa ls
from the same grou p have a skew, ca l led local
skew, lower than the overa l l group-to-group
skew, cal led globa l skew.
It is very tempting for designers to take advan
tage of the lower local skew, which is often only
half that of the global skew. Each clock d istribu
tion group is usually conta i ned entirely on one
logic modul e due to the natural physical parti
tion i ng of the hardware . Therefore, com mu ni ca
tion between circuits o n any particu lar module
can take advantage of the lower l ocal skew. If a l l
signal comm u nication occurs w i t h i n t h e loca l -
No. 4
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
TO GATE
ARRAYS
FANOUT
LEVEL 3
FANOUT
LEVEL 2
Figure 3
I
I
I
I
l
skew environment, the ti m in g ana lysis can be
consistent and eas i ly m anaged. However, com
plications arise when tryi ng to ana lyze signals
that cross from the local-skew environment to
the global-skew e nv ironment. Signal comm u n i
cation between logic modules wil l have t o pay
the pen a l ty of using the hi gher global skew
because the t i m i n g signals at each end of the
com m unication are derived from d i fferent d is
tribution groups. Managing t he t i m ing i nterface
across this partition between loca l and global
skews was beyond the capa b i l i ties of the t i m i n g
verification software.
As d iscussed earl ier, a t i m i ng analysis of the
entire processor was beyond human capacity;
therefore, i t had to be performed with timing
verification software. The t i m i ng verification
tool chosen for the 8800 development had no
faci l i ty for d istingu ishing between local and
global skews. Moreover, we wanted to use the
t i mi ng verifier to a na lyze the t i m ing of t he entire
CPU as one entity. This decision forced us to d is
a l low the use of any local-skew compu tations i n
our t i m i ng analysis. Now, from a design point of
view this decision made the environment very
easy to work w i t h . A l l t i m i ng transactions any
where in the C PU coul d be ana lyzed the same
39
The CPU Clock S)'Siern in the VA X 8800 Fam i�V
TYPICAL CPU MODULE
r - - - - - - - - - - - - - - -- - - - - - - - - -- - - - - - - - �
I
I
I
GATE
R RAY
r-- - -A- - -,
I
I
I
L_ _ _ _ _ _ _ _ _ _
_ __ _ _ _
I
I
J
L_ _ _ _ _ _
Fig u re 4
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
_ __ _ _ _ _ _ _
I
I
j
TO GATE
A R RAYS
I
L -- - - - - - - - -- - - - - - - - - - - - - - - - - - --- �
Minim ized Local Skew Distribu tio n
way w i t h t h e same set of spec i fi c a t i o n s . Every
t h i n g comes at a p rice , howe ver . and the obvious
A l t h o u g h u s i n g t h e l ow e r l o c a l s k ew w o u l d
have been va l ua b l e . i t was sac r i fi ced by ma k i ng i t
n egat i ve s i d e o f t h is decision was the l oss of the
e q u a l ro t h e g l obal s kew.
p o i n t . s o m e p e r fo r m a n c e o f t h e p r o c e s s o r
desi gn ee! to a l l o w the ma x i m u m exp l o i ta t i o n of
t i m i n g a n a l ys i s . T h e fo l l o w i n g d i s c u s s i o n
w a r e a n d s o f t w a r e r r a d e - o ffs a r e a c o m m o n
exp lains how t h i s p roblem was solved .
o c c u rrence i n a n y d e s i gn p ro j e c t . I n t h i s case .
The Clock Distribution Solution
w i th opera t i n g t h e m ac h i n e w a s ba l a nced aga i nst
a b i l i ty ro a p p l y r h c l ow e r loca l s k e w . At r h a r
seemed ro b e c o m p ro m ised j u st r o s i m p l i fy t h e
S i nce we wanred to r i m e the
CPU
as one enti ty.
In s h o r t . t h e h a rdware of t h e clock sysrem was
the r i m i n g verifi c a t i o n software . O f c o u rs e . h a rd
h o we v e r . t h e v a l u e o f t h e h a rd w a r e i n v o l v e d
t h e softwa re a n a l ys i s needed d u r i n g the d e s i g n
we had ro m a ke the global skew as s m a l l as poss i
p h ase of r h e mach i n e .
implementa t i o n . the g l obal skew was lowered by
Summary
b l e ro m a x i m i ze
CPU
performa nce . I n the acru a l
remov i ng one ga t i n g leve l from the c l oc k d i stri
P ro d u c ing the c l oc k i ng sys t e m fo r a h i gh-speed
b u t io n . The ga t i ng level removed was necessary
computer is best descri bed as a n exercise in m i n
for prod u c i ng low local skew. Figure
3
i l l ustrates
the five leve l s of fa n - o u r t h a t were re q u ired ro
prod u ce e n o u g h s i gn a l s w h e n t h e g l o b a l -skew
d istri b u t i o n was m i n i m i zed . F i gu re
4
i m i z i n g a nd m a n a g i n g s k e w . I n t h e
VAX 8 R O O
p ro j e c t . w e avoided exot i c hardware tec h n i q u es
so t h a t we c o u l d ga i n t h e be n e fi t o f u s i n g a n
s hows the
a utomatic t i m i n g ve r i fi e r . T he resu l t i n g skew o f
case i n w h i c h the loca l -skew d i stribu t i o n wou l d
cou l d be t o l e rat e d . T h i s balance was a fa i r trade
sa me fa n - o u r ro prod uce enough s i gn a l s i n t h e
be m i n i m i zed . Ta ble I i l l u strates the i m pact o f
this opti m i zation for gl oba l skew.
1 7 p e r c e n r o f t h e cyc l e r i m e was a f i g u re t h a t
off s i n c e the s i m p l i c i t y o f the r i m i n g e n v i ro n
m e n t a l l owed u s t o d e crease the r i m e ro design
and b u i l d the
Table 1
fa m i ly of syst e m s .
Distribution Changes
Global S k ew
Optimized Local Skew
Optim ized G l obal S kew
40
VAX 8800
9 ns
7.5 ns
Local Skew
2 ns
7.5 ns
Digital Technical journal
No. ·1 Februmy I ')8 7
john Fu
james B. Keller
Kenneth j. Haduch
Aspects of the VAX 8800
C Box Design
In each processor in the VAX 8800 family, instructions and data are sup
plied to the execution units by the C Box. Employing a simple structure
with a translation buffer, cache, and address and data buffers, this logic
unit is an integral part of the processor's five-stage pipeline. The no
write allocate cache uses a write-through scheme featuring a unique
delayed-write algorithm. The C Box bas control logic to accommodate
pipeline stall conditions caused by memory accesses. The C Box also
maintains data coherency within a processor and between processors. A
dynamic priority-arbitration scheme solves the lock-out problem between
IjO and processor requests.
The p e r fo r m a n c e of a h i g h - s p e e d co m p u t e r
depends to a large extent o n how fast data can be
passed from its me mory to i ts execu tion un i ts . If
the compu ter is pipe l ined, the u n i t responsible
fo r m e m o r y a c c e s s e s m a y h a v e to h a n d l e
pipel ine sta l l cond itions. And i f the com puter i s
a multi processor, that u n i t i n each processor may
also have to handle data coherency problems. I n
p r ocessors w i t h t h e VAX a r c h i t e c t u r e , d a t a
accesses are fu rther complicated by t h e fact that
virtu a l add resses are norma l l y speci fied . These
a d d r e s s e s r e q u i r e t r a n s l a t i o n to p h y s i c a l
a d d r e s s e s b e fo re a d a t a a c c e s s c a n eve n b e
attempted .
In the VAX 8800 syste m , which is a m u l t i pro
cessor with p i pe l ined CPUs , the u n i t that per
forms add ress translations and data acc esses i s
the C Box .
to avoid that is to stare the resul t of this address
ca l c u l a t i o n i n a s m a l l , fas t m e m ory c a l l e d a
tra n s l a t i o n b u ffe r . S i nce e a c h tran s l a t i on can
acc ess a page of data ( 5 1 2 bytes in the VAX
archi tecture) , it is likely that the translat ion wi l l
b e used aga i n i n t h e program being executed .
Rather than reca l c u l a t i n g the p hys ical address
( PA) on t h ose subse q u e n t accesses, it can be
retrieved from the TB.
The t ranslation buffer in the VAX 8800 pro
c e s s o r h o l d s 5 1 2 s ys t e m a n d 5 1 2 p r o c e s s
ad dress translations. The fo l lowin g sum marizes
the characteristics of the TB .
Characteristics of the Translation Buffer
•
•
C Box Description
The C Box consi sts of three subu n i ts: the transla
tion buffer (TB) , the cache, a nd the NMI i n ter
face . Figure 1 is a schematic d iagram of this u n i t .
The transl atio n o f a VAX virtual add ress t o a
p h ys i c a l a d d r ess i s a com p l i c a t e d p rocess . 1
Accesses to system and process page tab l es are
requ ired , and shi fting and adding must be done
to obta i n the fi nal physical address . Perform i ng
t h i s add ress translation process for every data
reference signifi cantly increases the data access
t i me and red uces the read bandwidt h . One way
Digital Technical journal
No. 4
Febn�ary 1987
•
Direct M a pped
1 024 Lines
- 5 1 2 System Li nes
- 5 1 2 Process Li nes
A l location on Translation B uffer M i ss
A common approac h to the problem of data
access l atency for h ig h -speed process ors , and
the one used i n the VAX 8800 CPU, i s tO use a
cache 2 A cache is a sma l l , fast memory located
between the processor and the m a i n m e m ory
syste m . If the data requested by the CPU is not
contained i n the cache , t h a t data is accessed
from m a i n memory and loaded i nto the cache.
41
Aspects of the VA X 8800 C Box Design
.....-
'-- -
�
A r---
-
B
r-
A
'--
'--
r-
r--
TB
DATA
r-- A
v-/
DATA
B
'VA
...,
CACHE
DATA
.--- ADDRESS
f-CACHE
CACHE
TAG
ADDRESS
r--
� CACHE
HIT
Pii
-
TB
TAG
TB
� HIT
TRANSLATION BUFFER
READ STREAM
- ADDRESS
r-B U FF E R I N G
W R I T E STREAM
'- OAT A
BUFFERING
TB - TRANSLATION BUFFER
VA - VIRTUAL ADDRESS
PA - PHYSICAL ADDRESS
A , B - A AND B PHASES OF TWO PHASE CLOCK
-
NMI
¢:::)
1-------l
WRITE
BUFFER
N M I I N T E R FACE
Figure 1
Block Diagram of C Box
Thu s . i n the m ajori ty of cases , the cache w i l l
c o n ta i n rece n t l y r e fe r e n c e d d a t a i te m s , a n d
fu t u re referen ces t o those data i tems w i l l be
fetched from the cache. The i n tent is to m i n i
mize the n u m ber o f longer l a tency accesses to
the main m em ory su bsystem . The success of a
cache me mory re l i es on the l oca l i ty of refer·
enccs in both t i me and space .
The data cache i n each VAX 8800 CPU holds
64 k i lobytes (KB) of both data and instructions .
The l i s t on the right summarizes the characteris
tics of the cache .
The TB and the cache are very s i m i lar i n con
cept and stru cture , except that the TB is used to
accelerate address t ra ns l a tions and the cache tO
accel erate data accesses. Eac h consists of a tag
section and a data section . The tag section holds
the unique i d e ntifi e r , or tag , for the data item
he ld i n the corresponding data section . The TB
and the cache are d irect mapped , meaning that
42
Characteristics of the Cache
•
D i rect M apped with Physical Address
•
R ead Al locate Only
•
Delayed-Write Cache U pdate
•
Write-t hrough Memory U pdate with Write Buffe ring
•
1 024 Blocks
•
64-byte Block S i z e
•
•
4-byte ( o n e l o ngword) Line Size
32-byte (one hexword) Cache R e f i l l S i z e
each a d d ress can po i n t to o n l y o n e loca tion ;
however, each location can pote n t ia l ly be a l l o
cated t o o n e of m a n y add resses. A t a g perm its
the identification of a data item i n either the TB
or a cache locat ion . The tag in the VAX 8800
processor is a n unmodified selection of bits
Digital Technical]ournal
No. 4
February 1 98 7
New Products
PA(29,0)
VA(31 -0)
PA(28-16)
TB
TAG
TB
DATA
CACHE
1---�
PA(1 5-6) DATA
s u b s e q u e n t use . ( I f t h e a d d ress s u p p l i e d is
already a PA, then the TB is not used . )
O n l y phys i c a l a ddresses access t h e cache . I f
the data referenced i s conta i n e d i n the cache,
cal led a cache hit, then the data can be accessed
from t h e re . If the cache does n o t contain the
data, cal led a cache m iss, t he n the data must be
accessed from memory.
Read Operations
VA(30-1 8)
PA(29-0)
CACHE H I T
TB HIT
VA - VIRTUAL ADDRESS
PA - PHYSICAL ADDRESS
TB - TRANSLATION B U FFER
Figure 2
Tra nslation Buffer and Cache
A ddress Mapping
f r o m t h e a d d r e s s of t h e d a t a i t e m b e i n g
accessed . This concept is depicted in F igure 2 .
As m e n t i o n e d e a r l i e r , a m e m o ry a c c e ss i s
req u i red if the c a c h e does n o t c o n t a i n a
requested data item. In the 8800, both proces
sors are conn ected to the memory and the 1/0
subsystems t hrou g h the NMI bus. Al l read and
write references that go to these subsystems are
processed by the N M I i nterface. This i nterface
mainta ins a set of buffers for both read and wri te
reference strea ms. For the read stream there are
actually two sets of address buffers: one for data
reads , the other for i nstruction reads.
C Box Operations
A C Box reference consists of a fu nction code,
a n address, and i n the case of writes, 32 b i ts of
data . I n genera l , that address i s a 3 2 -bit virtual
add ress (VA) . The VA tra ns lation process begins
with a check to see i f the PA is ava i lable in the
TB If the PA i s ava i la b l e , called a TB h i t , the
data is read out and concatenated with the lower
n i ne bits of the VA to form the PA. As part of the
t ranslation p rocess , t h e TB also performs page
access c hecking. I f the PA that perta i ns to the VA
i s not i n the TB , c a l l e d a TB m i ss , t h e n
m i crocode must pe rform t h e transl a t i o n . The
mi crocode then writes the data i n to the TB for
Digital Technical journal
No. 4
February 1 9 8 7
Cache- miss addresses for reads are passed to the
N M I in terface , where they are held in the read
a d d r e s s b u ffe r s . A h e x w o r d r e a d r e q u e s t
( 3 2 bytes) , with the address of the missed loca
tion, is then made to me mory. The memory data
is passed to the requesting unit, and the address
held in t he read address buffer is used to u pdate
t h e m i ssed cache locatio n . A read miss is the
only occasion u po n which a cache location is
allocated .
There arc two read streams in the C Box for
requests to me mory: t he data strea m , ca l led the
d-stream, and the i nstruction strea m , ca lled the
i-stre a m . The i-stream requests the memory to
s e n d d a t a d e s t i n ed for t h e i n s t r u c t i o n u n i t
( I Box) , which interp re ts that data as macroin
s t ru c t i o n s . I - s t r e a m fe t c h e s are i n i t i a ted by
m i crocode , which l oads a C Box register ca l led
the phys ical i nstruction buffer add ress ( P I BA) .
The P I BA h o l d s the a d d ress of the next l o ng
word of the i-stream tO be fe tched . If the execu
t i o n of m a c r o i n s t r u c t i o n s is seq u e n t i a l ( i . e . ,
there are no branches, page crosses, etc . ) , the
C Box can i ncrement the PIBA contents automat
ical ly after each fetch . However, shou ld the pro
gram branch or a page cross occu r, microcode
m u s t be u s e d to r e l o a d t h e P I BA . 0 - s t r e a m
fetches are made o n l y b y t h e microcode , which
must specify one of e i g h t m e mory data ( M D )
regi s t e rs as i ts d e s t i n a t i o n . 0 - s t r e a m d a t a i s
always returned t o t h e execu tion unit.
Write Operatio ns
In genera l , the performance of a cache is mea
s u re d by i ts h i t rate when read i n g d a t a . The
selection of the u pdate m e chan isms for both
cache and me mory, however, can have a major
i nfl uence on the design of the cache . There are
two we l l known strategi es for u pdating a cache:
write al locate , and no-write a l l o ca t e . A wri te
a l l oc a te s c h e m e u p d a tes a c a c h e loca t i o n
whether o r not the write i s a hit o r a miss. This
scheme is general ly i m p le mented with a write-
43
Aspects of the VAX 8800 C Box Design
back memory arrangement (d iscussed later) . I n
a no-write a l locate scheme, the cache i s updated
only if the wri te was a h i t . The VAX 8800 pro
cessor uses a no-wri te al locate sche m e .
The no-write a l l ocate scheme does , howeve r ,
presen t a prob l e m . Si nce o n l y writes t h a t h i t
wi l l update the cach e , cache upd ates take two
p i p e l i n e cyc l es i n t h e C B o x - t h e fi rst t o
check for h i t or m iss, the second t o update the
c a c h e fo r a h i t . The C Box was d e s i g n e d to
enable one read reference to complete in each
cyc l e . I f two consecutive cycl es are needed to
update the cac he, the second cyc le coul d block
a read reference, thus causing a p i pe l i ne sta l l .
To solve this probl e m , the C Box imple ments
a d e l a y e d - wr i te a lg o r i t h m . T h i s m e c h a n i s m
delays writes that must update t h e cache from
doing so u n t i l the first cycl e of the next write
r e fe r e n c e . The s e c o n d cyc l e o f the de l ayed
write does not need to be the next consecu tive
cycl e .
T h e d e l ayed-wr i te a l go r i t h m i n the C B o x
takes advantage o f t h e fact t h a t t h e first cycl e o f
a write u t i l i zes only t h e tag section o f t h e cache
t o d e t e r m i n e w h e t h e r a h i t or a m i s s h a s
occurred . The second cyc l e uses o n ly the data
secti on. A write that must update the cache has
i ts add ress and data p laced i nt o the d e l ayed
wri te address and data buffers respectively. O n
t h e n e x t write access , d ur i ng the cache-tag look
up cyc l e , the data section of t he cache wi l l be
updated from the address a n d data contai ned i n
t h ose b u ffer s , b u t o n l y i f t h e p re v ious wri t e
access was a h i t . Si nce reading a data item after
one has been wri tten is common, this design sig
n i fi cantly reduces the potential for sta l l s .
Write Buffer
A l l write references, whether or not they hit i n
the cache, must eventua l ly go t o memory. There
are two genera l strategies i n cache design with
respect to memory updating: wri te-through , and
wr i te - ba c k . In t h e wr i t e - t h ro u g h a p p r o a c h ,
write references are sent tO the memory system
i m me d i a t e l y . C o n ve rs e l y , i n t h e w r i t e - b a c k
approach , writes are h e l d until t h e cache b lock
i s deal located (made ready to rece ive d i fferent
data) .
T h e r e are s e v e r a l m a j o r p r o b l e m s w i t h a
w r i t e - b a c k s t r a t e g y . F i rs t , i t req u i res e i t he r
m i crocode o r hard wa re to acco m p l i s h a l l t h e
44
write-back fu nctions. Add ing that cod e o r hard
ware to t h e C Box wou l d have c o n s i d e ra b l y
increased i ts co mplexity.
Seco n d , if t h ere is a w r i t e m i s s w i t h t h i s
s c h e m e . a c a c h e b l o c k t h a t m i g h t be fu l l of
val i d data cou ld be displaced by a block whose
o n l y va l i d d a t a was t ha t j u s t w r i t t e n to t h e
cache . For a cache having a large bloc k size, l i ke
the 8800 has, t h i s action is un desirable. More
over, in most cases m icrocode reads data before
it is wri tten ; therefore , wri tes wi l l genera lly h i t
i n t h e cache .
F i n a l l y , t h e wr i t e - ba c k strate gy re q u i res a
c o m p l e x a l go r i t h m t o m a i n t a i n c o h e r e n cy
between caches within a m u l tiprocessor syste m .
Therefore , for a l l those reasons, w e chose t o use
the write- t h rough approach in the cache .
One d i sadvan tage of write-through i s that i t
tends to generate a J o t o f write traffi c t o the
me mory. I n a s hared- bus system l i ke the 8800 ,
t h i s traffi c can l i m i t perform a nce . To red uce
memory-wri te traffi c , wri tes in the VAX 8800
processor a re b u ffered i n a w r i t e b u ffer con
tained i n the NMI i n terface. This write buffer is
rea l l y a o n e - l i n e , o c t a word , w r i te - a l l o c a te
cache . A write going out tO the N M I bus is held
in the wri te buffe r . Subse q u e n t writes to the
same octaword update only the write buffer so
that n o mem ory req uests are sent on t h e N M I
bus. A write that i s outside the oc taword cur
rently in the write buffer dea l locates it; that is,
the contents of the write buffer are sent to mem
ory, and the next wri te rep laces those co ntents
i n the buffer.
Like the cac he , the success of the write buffer
i n red ucing bus traffi c re l ies on the l o ca l i ty of
p ro g r a m s i n s p a c e a n d t i m e . F o r e x a m p l e ,
seq u e n t i a l wri tes , such as pushes t o the stack,
will get co llected i n the write buffer even if the
wri tes occurred i n diffe rent macroi nstructions.
This col lected "package " of writes can then be
sent to the me mory more e ffi c i e n t l y t h a n can
i nd ividual wri tes.
An other advantage of the write bu ffer is that it
decou ples the processor from mem ory activity.
When the memory is busy process i ng transac
tions from the other processor or from the IjO
su bsyste m , a processor w i I I n o t sta II d u e to
writes. The write bu ffer is actu a l ly i mp l e mented
as a two-deep buffer, which fu rther reduces t he
poten tial for s ta l l s .
Digital Technicaljournal
No. 4
februarv I Y87
New Products
Pipeline Stalls
MD Stalls
J n a p i pe l i ned i m p l e mentation , how we l l r he
p i pe l i ne performs is determi ned both by h ow
oft e n i t i s f l u s h ed c l e a r a n d how o ft e n i t i s
sta l led . Sta l l con d itions are general ly related to
rhe lack of some p hysical resource or data .
I n some i mp l e mentations, some p i pe l i ne
stages can take m ore cycles to com plete than
others for certa i n fu nctions . I f a shorter s tage
precedes a lo nger one , the l on ger one w i l l be
unable either to accept fresh data or to pass i ts
resu l t ro the next stage u nt i l fi n i shed wi t h i ts
cycle . I n turn , other port i ons of rhe p i pe l i ne
can not proceed with their operations ; therefore,
the pipeline w i l l stal l . I n this sta l led condition,
all stages preceding the "bottl eneck" m a i n ta i n
the i r i nput a n d output conditions u n t i l the stage
responsible for the sta l l compl etes i ts function.
Some i mplementa tions have a combinati on o f
stages that may e x h i b i t t hese c h a ra c te r i s t i cs ,
l ead ing t o complex pipeline stall cond i tions.
I n the VAX 8800 CPU, the design s i m p l i c i ty
of t h e p i p e ! i n c e n s u re s t h a t e a c h p i p e l i n e
stage - except the C Box - always comp letes
i rs function in one cyc Je ..l S i nce the C Box a lso
control s data accesses, a l l sta l ls in t he 8800 are
r e l a t e d to t h e o p e r a t i o n of t h i s u n i t . T h e
p i pe l i ne wi l l experience two types o f sta ll s : the
MD stal l , and the VA sta ll .
When maki ng a read reference , a microi nstruc
tion m ust specify one of eight MD registers to be
used as i ts desti nation . When data is made ava i l
a b l e , ei ther from t h e cache or from memory, i t
is written i n to t h e specified MD register. Subse
quent m icro instructions t hen use the data from
t h i s register. If a m i cro i nstruction a ttempts to
use an MD register that is nor " va l i d " ( i . e . , the
data has not yet been fetched by the C Box) , the
p i pel i ne wi l l experience an M D sta l l .
The M D sta l l con d i t ion is a data-dependency
type of sta l l that is genera l ly seen i n pipel i ned
mac h ines . On the VAX 8800 processor, certa i n
steps a r e t a k e n t o e i t he r a v o i d s u c h sta l l s o r
red uce their effects. For example, consider two
consecutive m icro i nstructions, R and S, as i l lus
trated in Fi gure 3. R is a m i cro i nstruction that
performs a read and puts data i nto an MD regis
ter . S then accesses and uses the data fetched by
R . I f R and S a re adjacent , the p i peline w i l l sta l l
i n t h e 880 0 . The reason for the sta l l is that the
p i pel i n e stage access i n g the MD data and the
stage fetc h i n g that data ( t h e C Box) a re sepa
rated by one o t h e r stage, rhe a r i t h m e t i c a n d
logi c u n i t (ALU ) . When S tries t o u s e t h e M D
data , R i s just start i ng t o make the read reference
in the C Box. S must t herefore stal l the p i pel i ne,
wai ting for data to be supplied by R.
CYCLES
INSTRUCTION R
(
MD
ACCESS
FOR
DATA
I N STRUCTION S
ALU
TB
CACHE
'
(
MD
ACCESS
FOR
DATA
R STARTS READ R E FE R ENCE
ALU
TB
CACHE
�
S REQUIRES DATA READ BY R .
M U ST STALL AT LEAST O N E
C Y C L E FOR T H E DATA.
MD - M EMORY DATA REG ISTER
TB - TRANSLATION BUFFER
Figure 3
Digital Technical journal
No. 4
Februarv J 987
Instructions R and S A re A djacent
45
A spects of the VAX 8800 C Box Desig n
CYCLES
INSTRUCTION R
�
MD
ACCESS
FOR
DATA
ALU
�
MD
ACCESS
FOR
DATA
I NSTR UCTION S
4
ALU
(
R HAS COMP LETED R E A D
R E F E R E N C E , DATA J U ST
AVAILABLE
TB
MD
ACCESS
FOR
DATA
CAC H E
ALU
TB
"'
�
CACHE
S R EQ U I R E S DATA.
DATA SENT D I R ECTLY I NTO
A LU , BYPASSED M D
U P DATE. NO STALL.
Instructions R and S Separated hy A n o ther Instruction
On the other hand . if R a nd S are separated by
one other i nstruct ion , then when S a ttempts to
use the data read by R , t h a t data is just b e i n g
m a d e a va i l a b l e by t h e C B o x ( a s s u m i n g . o f
course , a read h i t i n t h e cache) . I f S were t o wa i t
for t h e M D registers to b e u pdated before using
the data , the p i pe l i ne wou l d sta l l . To e l i m i nate
that type of stal l , a path has been designed from
t h e C Box d i rectly i n to t h e i n p ut of t h e AUJ .
bypassi ng the M D registers . 'T'herdore , the data
coming from the cache is sent both to the MD
registers for u pdat i ng and d irectly to the A U J ,
where S c a n u s e the data . 'T'he n e t effect i s that
this bypass path removes the one-cycle la tency
that S wou l d have experienced had it waited for
the data to come out of the MD registers . Figure 4
i l l ustrates t hese concepts .
Had R caused a read miss, S woul d sti l l cause
an MD sta l l si nce the C Box must make a memory
fetch for the data . Notice that an M D sta l l hap
pens only when S a ttempts to use an M D register.
Therefore, a general rule for making m i crocode
accesses to the C Box is to m ake read references
early and to usc the MD registers late. Should the
read reference m iss, some part of the memory
fetch latency will be h i dden by the m i croi nstruc
t i o n s b e t w e e n t h e r e a d a n d t h e MD r e g i s t e r
46
CAC H E
I
I N TERVENING
INSTRUCTION
Figure
TB
access . When data returns from a read m iss and
the p i pe l i n e i s e i t h e r u n d e rgo i n g o r a b o u t to
u ndergo an MD sta l l , the bypass pat h can be used
to reduce the effects of the sell I or even prevent i r .
VA Stalls
VA sta l l condition occurs when t he C Box can
not process a requ ested refe rence . This can be
clue to e i ther an i nva l i dation cyc le in t he C Box
(discussed in the fi nal section of this paper) or
the capabi l i ties of the address and data buffers
i n the N M I i n terface being exceeded .
A� mentioned earlier, for reads t here is a set of
buffers for d-strcam and i -stream references. The
d-strea m buffering is one dee p , mean ing there
can only be one read m i ss outst a n d i ng i n t h e
C Box . However, t h e i m p l ementation wi l l not
a l low t h e p i pe l i ne to stall s h o u l d s u bseq u e n t
reads b i t i n the cach e . !-stream reads never sta l l
t h e pipeline a s d o VA and M D sta l ls , w h i c h stop
the cloc k . The i nstruction buffer can "sta l l " if i t
does not h ave e n o u g h d a t a for the decoder t o
complete t h e decode o f the current V AX i nstruc
tion operand . This condition causes the CPU to
pe rform a no-opera t i o n m i croword . That docs
not stop the clock, howeve r , and thus is not a
pipeline sta l l .
A
Digital Technical journal
No. 4 Febmarv I 'J87
New Products
The C Box can s t i l l receive com mands even if
it contains one read m iss. Of course, there i s the
potential that the command bei ng received wi l l
m i ss i n t h e cache . T h a t w i l l requ i re t h e N M I
interface to request t he data from memory, thus
resu l t ing in a VA stal l . That sta l l l asts from the
t i me the command i s received until the time the
previous read-miss data returns from memory. If
the second com mand i s a read that h i ts in the
cache, a VA stall w i l l be generated for t he one
cycl e t hat i t takes to determ i ne whether or not
there i s a cache h i t . The read data w i l l then be
taken from the cache a n d retu rned to the M D ,
after which the sta l l w i l l b e re leased .
S ince wri tes go to memory more than reads ,
t h e buffering for wri tes is more extensive . The
de l ay-wri t e bu ffer and t he double b u ffering i n
the write buffer a re used t o reduce the possibil i ty
of write sta l ls . These buffers enable the C Box to
h o l d a m ax i m u m of n i n e l o n gwords o f d a ta
before the p i pe l i n e w i l l experience a VA sta l l on
a wri te.
Stalled and Unstalled Logic in
the C Box
If an i nstruction is sta l led, the C Box has e ither
not returned the data or cannot take another ref
erence. Therefore , a l l stages prior to the C Box
(the I Box and the E Box) must be stalled. The
TB is part of t he last stage of the pipe line; there
fore, it m ust be capable of be i ng stalled. When
the p i pe l i ne stalls, t he TB holds the address of
the stalled refe re n c e . O n ly the N M I i nterface
can resolve a sta l l , e ither by supplyi ng the read
miss data or by freeing u p i ts buffers . Thus this
i n t e rface can never be s t a l l ed . H owev e r , the
c a c h e , b e i n g p a r t o f t h e l a s t s t a g e of t h e
p i pel i ne, i s a lso the path for supplying data to
�
I
BOX
�
E BOX
DATA
T R A N SLAT I ON
BUFFER
STALLED
Coherency Problems in the C Box
J n genera l , data cohere n cy m e a n s t h a t a read
should a l ways get correctly mod i fi ed data when
a s e r i es of r e a d s a n d w r i t e s is m a d e i n a n y
seq uence . One way tO m a i n ta i n coherency i s to
perform a l l reads and writes to completion in a
p u rely seq u e n t i a l m a n n e r , t h u s s t r i c t l y m a i n
tai n i ng their sequence of reference . However, i n
a p i pe l i ned mach i n e , not only can t here b e sev
e ra l sources of read a n d write references, but
t here can a lso be more than one copy of t he data
item . This duplication often leads to very com
plex solutions to ach ieve coherency.
T h i s co m p l e x i ty has been s i m p l i fi e d some
w h a t in t he VAX 8 8 0 0 p i pe l i n e by havi n g the
C B o x b o t h c o n t ro l a n d s e q u e n c e a l l d a t a
accesses. The C Box i tse l f, however, i s p ipe l i ned,
having a d-stream and an i -stream for reads , and a
stre a m for w r i tes. T h i s fact a l so presen ts so me
cohe re n cy prob l ems . Coherency for t he C Box
means that two condit ions must be met.
1.
After a sequence of reads and wri tes has
completed , a ny va lid bloc ks i n the cache
must match the data i n t he memory.
2.
Whenever the processor wri tes to a loca
tion in memory a nd then reads t ha t loca
tion , the data has tO be what was written .
I
PHYSICAL
ADDRESS
CACHE
DATA
STALLED/
UNSTALLED
Figure 5
Digital Technical jom-nal
No. 4 February 1 98 7
PHYSICAL
ADDRESS
the stal led i nstruction . This situation leads to an
i nteresting control characterist i c of the C Box .
O n e of i ts s e c t i o n s , t h e T B , c a n be s t a l l e d ;
another. the N M I i nterface , m ust never stal l ; and
t h e t h i rd s e c t i o n , t h e c a c h e , m u s t r e m a i n
u nstal l ed but mainta i n stal led i n pu t and output
c o n d i t i o n s i n i t s l og i c . F i g u re 5 d e p i cts t h e
logic for sta l led and u nstal led cond i t i ons i n the
C Box.
NMI
I N T ERFACE
�
NMI
UNSTALLED
Stalled and Unstalled Logic in C Box
47
Aspects of the VAX 8800
C
Box Design
Two types of coherency problems exist in the
VAX 8800 syste m : coherency wi t h i n a proces
sor, and coherency between processors.
The first type of prob lem in the C Box arises
fro m t h e i m p l e m e n t a t i o n of t h e d e l ay- w r i te
algori thm d iscussed earl i e r . A prob l e m occu rs
when a read i s attempted to the cache location
wai t i ng to be u pdated by the wri te held in the
delay-wri te buffers . The read w i l l h i t , but the
cache data w i l l be sta l e . One solution to t h i s
prob lem i s t o stall the p i pe l i ne w h i l e t h e cache
is u pdated , perform ing the read for the correct
data. The trou ble here is that the sequence of
writing to and reading from the sa me location is
a common occu rrence . Thus to sta l l wou ld sig
n i ficantly reduce the read bandwidth .
The C Box solves this problem by compa ring
selected b i ts of the read and wri te addresses i n
the dela y-write buffe r. I f t h e bits m atch , then
the data con tent of that bu ffer is used as the read
data . This sol u tion works because, to the read .
the delay-write buffer ap pears tO be an exten
s i o n of t he c a c h e . S i nce t h e read a d d ress
matched t h e address i n t h i s buffer, t h e data can
be t a k en d i rec t .l y from i t . C o h e re ncy is r h u s
assured , a n d n o sta l l penal ty is i ncurred .
The second type of coherency problem occurs
when the read is a m iss and thus goes to the N M I
interface . To assure h i gh performance, the N M I
i nterface m a i n ta i ns two streams o f data requests ,
the read a n d write strea ms . The buffe r i n g and
the con trol of these two strea ms operate i nde
pendently. If made to d i fferent data items, read
and write requests can be processed to me mory
as q u i ck l y as poss i b l e , even o u t of seq u ence .
The coherency p ro b l e m i s to m a k e sure t ha t
subsequent reads a n d wri tes t o t h e s a m e data
i tem resu l t i n i ts correct state.
I f a read requ est occurs that was a m iss, the
cache will send i t to the NMI i n terface upon dis
covering that fact. Once in the N M I i n terface ,
the read address i s compared to the add ress of
t h e o c t a w ord i n t h e w r i t e b u ffe r . I f t h ose
addresses are d i fferent, the cache wi l l send the
read d i rectly to m e m ory . Thus the data in the
write buffer wi l l be u naffected . I f the add resses
matc h, however, the write data wi l l be sent tO
memory, fol lowed by the read request. Si nce rhe
m e mory s u bsystem p roce sses references i n a
sequential manner, the read w i l l a lways access
the correct data . (Of course, this case is fa i rl y
simple . A more co m p l i cated one is that i n which
48
a read is sent to memory, and t h e processor per
forms a write w h i l e wa iting for that rea d . )
I f t h e addresses o f t h e read a n d write match ,
the cache can give the processor t he requested
data but cannot mark the returned data val id i n
t h e cac he . T h i s s i t u a t i o n occurs becau se t h e
read - m iss data being fe tched from memory has
been made stale for subsequent reads .
T h e m i crocode i s d e s i g n e d so t h a t i t w i l l
n ever read a data item and then wri te to i t with
our first accessi n g the MD registers . However, a
cache block is 64 bytes long. The m i c rocode
cou l d write to any other data i tem i n the b lock
before com i n g to the m issed data ite m . There
can be as many as three wri tes and two reads
( o n e e a c h for the d- a n d i - streams) b u ffered
simu ltaneously in the C Box, all referenci ng the
same cache block. Even worse, the C Box can
send an arbitrary nu mber of writes to memory
while wa iting for the data returned by the read
to me mory. To m a i n tai n coherency. the C Box
performs a set of address m atches between the
read and wri te stre a ms . T h e n i t " re m em bers"
whether or not any wri te addresses matched the
out st a n d i n g reads a n d m a rks t h e m i nv a l i d as
appropriate .
C Box Design for a
Multiprocessor System
The VAX 8800 system consists of two identical
VAX 8800 processors o n the NMI bus con nected
to t h e m e mory a n d I jO su bsystems W i t h i n a
processor, on ly the design of the C Box bas been
affected by the req u i rements of a m u l t iproces
sor ar range m e n t . That is because t h e C box is
the CPU's i n terface to the N M I bus and contai ns
the centra l arbitration logi c for that bus.
T h e r e a r e t h ree key i ss u e s i n d e si g n i n g a
mem ory i nterconnect for a m u l t i p rocessor sys
tem : bus arbitra t i o n , bus ba nd w i d t h , a n d data
coherency between processors.
Bus Arbitration on the NMI Bus
Two major problems were encou n tered in the
design of an arb i tration scheme for the NMI bus.
The first was the fact that between the CPUs and
the 1/0 su bsystems, called the NBfs, there was a
possibi l ity t hat a high-priority device cou ld lock
ou r a low-priority device from the bus. T h is is
certa i n ly poss i b l e with a fixed priority-arbitra
tion sche m e . To add ress this problem, the C Box
i m p l e m e n ts a dyna m i c prior i ty- a l loca t i o n
Digital Technicaljournal
No. 4
Februar)• 1 987
New Products
s c h e m e t h a t c a u s e s p r i o r i ty to be a s s i g n e d
between two groups: t h e 1 /0 devices , a n d rhc
CPUs . Wi t h i n t h ese grou ps, t he priority s h i fts
between rhe rwo CPUs and the two 1/0 devi ces .
For exa m p l e . i f a l l four devi ces wan ted to usc
the bus a l l the t i m e , the order i n which the bus
wou ld be granted to the devi ces wou ld be
first CPU , first l/0 , second CPU, second 1/0 .
first CPU. first ljO, second CPU, second ljO ,
etc.
This scheme guarantees that all devices on the
bus wi I I have n e a r l y eq ua I access to r h e bus ,
rh us so l ving rhe lock-our proble m .
T h e second p rob lem i nvo lves t h e " m e m ory
busy" situation . Whenever rhe memory subsys
tem c a n n o t process m ore requests, it sends a
" m e m o ry busy" s i gn a l . I t cou l d h a p pe n , for
i n stan ce . r ha r a CPU accesses t he bus a n d
attem pts ro wri te ro memory . Upon receiv i ng a
m e m ory-b usy s i gna l , t h e C PU w i l l abort t h e
wri te . W h e n m emory i s rel eased , some o t h e r
device w i l l access t h e bus a n d perform a write.
rhus fi l l i n g the write queue i n memory . Once
aga i n , the fi rst C P U re -arb i t rares, accesses the
bus , and tries to w r i t e . Once aga i n , that CPU
n:cc ives a memory busy signa l . And so on .
The NMI arbi tration scheme mentioned above
so lves t h is problem in which a device might get
l oc k e d - o u r of me m ory . As i m p l e m e nt e d , t h e
arbi tration scheme saves r h e priori ty state at the
r i m e b e fo r e t h e m e m o r y - b u s y s i g n a l w a s
asserred. The arbitration logic t hen restores that
stare so t hat rhe device that received the signa l
wi l l get the bus when the memory-busy signa I is
deasscrted .
Bus Bandwidth
For r h e p rocessors on t h e i n terc o n n e c t , bus
bandwidth i nvolves two components: read band
wid t h . and w r i te bandw i d t h . The prob lem of
inadequate read bandwidth is addressed by hav
i ng a high h i t- ra te cache . The h i gher t he hit rate ,
the fewer the requests tO memory. The problem
of i nadequate write bandwidth can be treated i n
rwo ways . T h e first way i s t o have a wri te-back
cache l i ke rhc one on the VAX 8 6 5 0 processor. '
Such a cache wri tes a block ro m e m ory on l y
when r h e cache b lock is dea l located. T h is tec h
n ique can significantly reduce the write band
width requirements.
Di!!,ilal Tecbnical journal
No. ·1 Februcny 1 1)87
I n m u l t i processor sys t e m s l i ke t h e 8 8 0 0 ,
however, i n which each processor has a n i nter
n a l cache . this technique becomes complica ted .
In t hese syste ms, a data i tem can exist not o n ly
in memory bur also i n a l l rhe caches. To main
rain coherency. each write-back cache wou l d
h a v e t o n o t i fy r h e other c a c h e w h e n t h e first
cache writes. This tec hnique usu a l ly l eads ro a
complex protocol and design i m plementation.
Another approach in a m u l t i processor system,
rhe o n e u s e d in the 8 8 0 0 , i s r o i m p l e m e n t
write - through cac hes . I n such a n approach, a l l
write references go d i rectly t o memory s o that
each cache on rhe bus can "sec" all write activ
ity. The caches can then be inva l idated . Such an
a p p r o a c h grea t l y s i m p l i fies the prorocol for
cache coherency but, as d iscussed earl ier, gen
erates a high degree of write traffi c . The uni que
design of rhe write bu ffer helps ro reduce t h is
traffi c , a l t hough n o t as m u c h as a w r i t e - ba c k
cache wou ld . I n t h e 8 8 0 0 processor, however,
rhe wri te buffer redu ces traffic enough so rhar
the rwo VAX 8800 processors can write a t their
max i m u m banclwicl rhs on rhe NMI bus.
Coherency in a Multiprocessor System
A m u l t iprocessor syste m , with i n terna l caches,
p re s e n t s a n u m b e r o f i n teres t i n g c o h e re n cy
issues when sharing data. Ideal ly, i f one proces
sor wri tes ro a location and rhe other processor
reads rhar location, t he read w i l l always get the
data rhar was written . In practice, achieving this
con d i t i on is d i fficu l t . Severa l major questions
arise : Did the read happen before the write or
afrer ir' What happens if both p rocessors write
ro the same location at rhe same r i m e ' Un l ess
controlled , t hese si ruat ions can prod uce unpre
di ctabl e resu l ts .
I f progra ms on t h e p rocessors wan t t o s ha re
clara . they must usc rhe i n terlock instructions i n
the V AX archi tecture . " O n ly after a n interl ock
i nstruction is processed wi l l the memory loca
t ion be guaran teed ro have the correct clara . The
general meth od is as fo l lows . Processes must
decide to share a block of memory. One mem
ory location is cal led the software lock, and only
one process ar a rime is a l l owed ro write to (or
l o c k ) t h a t l o ca t i o n . T h i s is accessed w i t h an
i n te r l ock i nstru c t i o n , for exa m p l e , t he bra nch
on b i t ser and set i n terlocked ( BBSSI ) or the add
al igned word i nterl ocked (ADAWI) instructions.
49
A spects of the VAX
8800
C Bo.x Design
Upon ga i n ing the software lock. a given process
can proceed to write any location in the shared
b l oc k . Read·wr i te coherency wi l l be assu red
o n l y if t h e o t h e r processes s h a r i n g t h a t d a t a
observe t h e protocol of obta i n i ng t h e software
lock before mod i fying the data structure .
The VAX i n t e r l o c k i n stru c t i o ns a rc i m p l e ·
m e n t ed u s i n g i n t e r l o c k m i c ro i n s t r u c t i o n s .
These enable a processor to lock and u n lock the
me mory su bsyste m . Once locked . this s u bsys·
tem excludes fu rther attempts to lock it u n t i l an
u n lock has occurred . Thus only one processor
or 1/0 system can lock the memory su bsystem at
any one time.
When each processor has an i n tern a I cache.
there is one more mechanism that keeps the two
processors coheren t . Wh i l e o n e processor i s
perform i ng a w r i t e to me mory a n d w h i l e t h e
wri te c o m m a n d i s on the N M I b u s , the other
processor w i l l exa m ine i ts cache store to see i f
i t conta i n s a copy o f t h a t d a ta . I f t h e data is
there, i t is marked i nva l i d . The next req uest for
this data '"''i I I then resu lt in a cache m iss and a
s u b s e q u e n t fe t c h t o m e m o r y . T h i s s i m p l e
a p proa c h i s poss i b l e because t he VAX 8 8 0 0
cac hes a re write-thro u g h . Alt hough a l l wri tes
arc s e e n on t h e b u s , the w r i t e b u ffe r p a c k s
together consecutive wri tes w i t h i n a n octaword .
Therefore , t h e n u m be r of i nv a l i d a t i o n cyc l e s
p e rfo r m e d by a p ro c essor w i l l be red u c e d .
When a n i nterlock write is performed , the con
tents of the wri te bu ffe r are sent to memory .
Thus the in terlock mechanism ensures that data
coherency w i l l work under a l l con d i ti ons . Fig
u re 6 i l l ustrates t h e e v e n ts t h a t a c h i eve
coherency in the 8800 .
Summary
The genera l concepts used in the design of the
C Box arc we l l known to compu ter designers .
Our goa l was to achi eve a simple yet high-per
fo r m a n ce d e s i gn t h a t a v o i d e d u n n e cessa r i l y
complex solutions that d i d not g ive comparable
i ncreases in performance . The choi ces made
LEFT
PROCESSOR
RIGHT
PROCESSOR
�
I
I
CAC H E
I
WRITE
BUFFER
WRITE
BUFFER
I
I
OTH E R PROCESSOR
SEES WRITE ON
NMI AND LOOKS
I N CACHE FOR
I NVALIDATION
WRITE I NTERLOCK
FORCES WRITE B U FFE R
CONTENTS TO M E MORY
NMI
I
SOFTWARE
LOCK
J
M E M OR Y
Figure
50
G
Multiprocessor Coherencv
Digital Technical journal
No. 4 Februar)' J ')87
New Products
have y i e l d ed a des ign t h a t fu l l y s u p p orts t h t:
m u ltiproct:ssor concept. The VAX 8800 syste m
c a n translate a d d resses a n d a ccess data fas t e r
t h a n a n y previous VAX processor.
Acknowledgments
Al l t hose who worked on the VAX 8800 system
cont ri b u ted to the t h i n k ing that went i n to the
C Box design . Spe c i a l thanks go t o Dave Sager
for keep i ng t h i ngs goi n g .
References
l.
VAX A rchitecture Handbook , ( Maynard :
D ig i t a l E q u i p m e n t Corpora t i o n , O r d e r
N o . EB- 2 6 1 1 5 -4 6 , 1 9 86 ) : 7- 1 1 t o 7- 1 9 .
2.
A . S m i t h , "Cache Memories, " Computing
S u r v eys , vo l . 1 4 , n o . 3 , ( S e p t e m b e r
1 982) : 473-530.
3.
S. M ishra , · ' The VAX 8800 M icroarchitec
ture . " Digital Techn ical jo u rnal (Febru
ary 1 9 8 7 , t h is issue) : 2 0- 3 3 .
4.
T . Foss u m , J . M c E l roy, and W . E ng l i s h ,
"An Overview of the VAX 8600 System , "
D ig it a l Te c h n i c a l jo u r n a l ( A u g u s t
J 985): 8-23
5.
S . F a r n h a m , M . H a r ve y , a n d K . M o rse .
"VMS M u l t i processi ng on the VAX 8800
S y s t e m , ' ' Dig i tal Te c h n ical jo u r n a l
( Ft:bruary 1 98 7 , t h is i ssue ) : 1 1 1 - 1 1 9
Digital Technical journal
No. 4
Fe/Jmmy 1 98 7
51
Paul]. Natusch
David C. Senerchia
Eugene L. Yu
The Memo ry System in the
VAX 8800 Family
The memory system in the VAX 8800family can send data at 71MB per sec
ond and receive it at 59MB per second. The 8800 and 8700 CPUs can con
tain up to 128MB of memory, the 8550 and 8500 up to BOMB. Commands,
addresses, and data flow between the memory interconnect (NMI bus)
and the memory controller, array bus, and array modules. Read, write,
and masked-write commands are executed. The designs of the NMI bus
and write-through cache affected the memory system design. Although
ECL is used in the controller, TTL is used in the array bus. The array
modules of 4MB and 1 6MB contain 256K MOS dynamic RAM chips.
Al i
members of the VAX 8 8 0 0 fam i ly of proces
sors (the 8800, 8 7 0 0 , 8 5 5 0 , and 8 5 00) usc the
s a m e t y p e o f m e m o r y sys te m . S i n c e t h e
VAX 8800 system is a m u l t i processor, that mem
ory system must connect co both CPUs and both
I/0 adapters , cal led the N B lAs. The bus connect
i ng these devices is called the NMI bus, and each
connec t i o n o n t h e N M I bus is c a l led a n e xu s .
These con necti ons a re i l l ustrated i n F i g u re 1 ,
which shows five nexuses : one for each CPU, one
for each N B LA, and one for the memory system .
The memory system can del iver 7 1 megabytes
(MB) per second of read bandwi dth and 5 9 M B
per second o f write bandw i d t h .
S i nce the VAX a rc h i tecture h a s a 3 2 - b i t for
m a r , a l l datapa t hs i n the m e mory system must
a lso handle 32 b i ts . These d a tapatbs are com
b i ned by p i pe l i ned a n d para l le l opera t i ons to
p rod u c e t he read a n d w r i te ba ndwi d t h s . The
most sign i ficant occurrence of parallel operations
is two- d i mensiona l i nterleaving. The first d i men
sion i n te rl eaves between longwords ( 3 2 b i ts) of
data on a s i ngle array module; t he secon d i n ter
l eaves between octawords ( 4 longwords) on d if
fe re n t a rray mo d u l e s . As m a n y as t h ree a rray
m o d u l es c a n be a c t i ve s i m u l ta n e o u s l y w i t h
ei ther a read o r a write . There are t hree cases:
•
Eac h modu le can do one rea d .
•
O n e m o d u l e c a n d o a read w h i l e t he other
two can do as m any as four writes.
•
Figure 1
Memory Interco n nect Structure
The m emory sys t e m i ts e l f consists of t h ree
major parts, as depi cted in F i gu re 2 :
•
A memory controll e r based o n ECL technology
•
A h i gh-speed TTL bus connecti n g that mem
ory contro l l e r co a m ax i m u m of eight a rray
mod u l es
•
The a rray modu les themselves
52
Two m od u l es can each do a read w h i l e the
th i rd can d o as many as fou r wri tes .
The s e l e c t i o n of t h e a rray m od u l es can be
progra m med from the consol e when the system
is powered u p . Thus t h e m e m o ry system can
s u p p o r t a v a r i e ty o f a rray m o d u l e s i z es a n d
speeds without t he need t o mod i fy the hardware
in the memory controller. M oreover, the mem
ory con t ro l l e r can add ress 5 1 2 M B of phys i c a l
memory , the l i m i t of the VAX architectu re . The
8 8 0 0 i s t h e fi rs t VAX s ys t e m to be a b l e t o
address t h i s m u c h p hysical memory .
Digital Technical journal
No 4 Febmmy 1 <)8 7
New Products
COMMAND BUS-IN PUT COMMAND AND CLOCK
NMI
M E MORY
CONTROLLER
ARRAY
MODULE
8
Figure
2
Plan of Mem OI:J! System
Owing to the l i m i ts of the <:xist i n g tec h n o l
ogy, howeve r , t h e i n i ti a l m a c h i n e w a s i n tro
duced w i t h 3 2 M B for the 8 8 0 0 and 8 7 0 0 sys
tems, and 2 0 MB for the 8 5 0 0 and 8 5 5 0 systems.
The 3 2 M B c o n f i g u r a t i o n c o n s i s t s of e i g h t
4 M B modu les w i t h 2 5 6 K MOS dyn a m i c RAMs
pac kaged in D I Ps . To increase the dens i ty of the
machi ne without using a d i fferent semiconduc
tor t e c h n o l ogy , a 2 MB d a u g h te r m o d u l e was
developed a fter the i n i t i a l announcement. This
module uses double-sided su rface-mount tech
nology and p l astic leadless c h i p carriers. Eight
of t hese dau ghter modu l es are mou n ted o n a
mother module to produce a 1 6 M B array mod
u l e . T h i s n e w m o d u l e h a s i n c re a s e d t h e
machine's memory to 1 2 8MB for the 8800 and
8 7 0 0 systems, a n d to 8 0 M B for the 8 5 5 0 a n d
8 5 0 0 systems.
Memory System Architecture
As
shown i n Figures 1 and 2 , the m emory con
trol l e r c o m m u n i ca tes w i t h the C PUs and the
N B IAs over the memory i n terconnect , called the
N M I b u s . C o m m a n d s , a d d resses, a n d d a ta
requests are a l l first received by the N M I i nter
face and t h e n passed to other sections of t h <:
m e m ory c o n t ro l l e r . Add resses a n d d a ta a rc
srored i n custom m u l ti part RAMs, where eight
locations arc reserved for addresses and e ight for
d a t a . T h e N M I i n t e rfa c e e n c o d e s c o m m a n d
informati o n , passing i t t o the command-control
portion of the memory control ler.
Si nce the m e mory contro l l e r c o m m u n i cates
w i t h the N M I bus and the a rray bus, the N M I
Digital Technical journal
Febntary 1 ')8 7
No. 4
protOcol has to be changed to that of the array
bus. Reads and wri tes of data fi elds with various
si zes are recei ved by the N M I i n terface . The N M I
b u s su pports a very robust s e t of c o m m a n d s .
Reads and i n terloc ked reads are su pported for
longwords ( 4 bytes ) , octawords (4 longwords) ,
and hexworcls ( 2 octawords) . Masked wri tes and
masked-write u n l ocks are supported for long
word s , q u a dwords ( 8 bytes) , and octawords.
Wri tes a re supported for longwords and acta
words.
The r e a d - i n t e r l o c k e d a n d m a s k e d - w r i t e
u n lock commands are used r o i mplement VAX
i nstru c t i o n s i n w h i c h m u t u a l e x c l u s i o n i s
requ i red . For exa m p l e , t h e VAX i n stru c t i ons
A D AW J , B B C C I , B B S S J , I N S Q H I , I N S Q T I ,
I NSQU E , REMQH I , a n d REMQTI a l l need these
c o m m a n d s . S i n c e a n i n terlo cked i n st r u c t i o n
locks t h e entire m e mory syste m , t h e i nterlock
bit must reside i n t he m emory controller. This
bit restricts the execu tion of subsequent i n ter
lock commands unti l the lock has been released
by a masked-write u n lock i nstruction.
Aft e r re c e i v i n g a m e m o r y r e q u est fr o m a
nexus, the memory controller must transfer that
req ues t to t he a ppropriate array modu l e . This
transfe r i s a c c o m p l ished using t h e a rray bus .
This bus consists of
•
A unid irectiona l set of command and address
l i nes from the memory control ler ro the array
mod u l es
•
Another u n i d irectional set of data l i nes from
the memory control ler to the array modules
53
The Memory System in the VAX
8800
Fam ily
•
A set of data l ines (capable of assum i ng three
states) that can be driven by a ny one of the
array modu les a n d recei ved by the memory
control ler
•
Various status and control l i nes that commu
n icate i n both d irections
The a rray b u s h a s a m i n i m a l reperto i re of
commands, consisting of longword reads , acta
word reads , and longword writes, but not hex
word reads. S i nce the N M I su pports h exword
reads, the memory controller must convert t hem
i nto two octaword reads and then send them to
the array m odu les. Thus the two octawords of a
hexword read can reside on d ifferent array mod
u les. That fact i ncreases the memory bandwi dth
because para llel accesses can be executed . The
array bus supports only longword writes ; t here
fore, octaword writes m ust a lso be converted . As
mentioned earl ier, the array bus has one l ine for
commands and addresses and a nother for data .
Therefore, an octaword write , which takes five
cycles to transfer on t he N M I (one for the com
mand , four for the data) , can be tra nsm i tted i n
five cycles o n the array bus to a n array modu le.
Figure 3 shows the corresponding actions dur
ing each cycle o n the N M I and o n the array bus.
In addition to commands, the memory system
must a lso execute mai ntenance tasks, i ncluding
m e m o ry refre s h , error report i n g , a nd battery
backu p .
S ince physical memory is i m p lemented w i th
MOS dyna m ic RAMs , every array row m us t be
refreshed once every 4 m i l liseconds . This func
t i o n can be done by refreshi ng one row every
1 4 m icroseconds . To faci l i tate this activity, the
memory control ler sends signals to each a rray
module from a 1 4 -m icrosecond osc i l lator . Upon
receiving a refresh signa l , a n array module w i l l
h a n d l e t h e refresh arbitration a n d execute the
operation .
Occasionally, a b i t w i l l be l ost due to e ither
alpha particles or a device fai lu re. In that case
the me mory control ler m ust handle those errors
a n d other types i n a gracefu l m a n n e r . To do
that, the m e mory system uses a 7 -bit m o d i fied
h a m m i n g code to g e n e r a t e the E C C , w h i c h
a l lows a l l si ngle-bit errors to be corrected and
a l l dou bl e - b i t errors to be detected . After cor
recting each error the memory system logs the
error's p hysi ca l page add ress and the b i t . The
memory system then i n terrupts the CPU to cal l
a n error serv i ce rou t i n e , w h i c h l ogs i n a VMS
fi le the necessary information to i solate the fai l
ure . The memory system can a lso i nterrupt the
CPU to handle i nternal parity e rrors and i n ter
locked t i me-outs. An i nterlocked ti me-out hap
pens when a nexus executes a read i nterlock but
never issues a masked-wri te u n lock. The system
software can enable or d isable these i nterrupts.
Battery backu p , standard equipment on both
t h e 8 8 0 0 a n d 8 7 0 0 syste m s , c a n power t h e
refresh operation w h e n t h e system is down . That
power a l lows the memory system to continue to
refresh the RAMs so that data w i l l not be l ost .
Note that the entire system is not backed up;
CYCLE
COMMAND
OR
ADDRESS
NMI
6
2
3
4
5
DATA
DATA
DATA
DATA
COM M A N D
OR
ADDRESS
COM M A N D
OR
ADDR ESS
COM M A N D
OR
ADDRESS
COM M A N D
OR
ADDR ESS
DATA
DATA
7
ARRAY B U S
COMMA N D/
ADDR ESS
LINE
DATA
LINE
DATA
Figure 3
54
DATA
Cycles on NM! Bus and A rray Bus
Digital TechnicalJournal
No. 4 February 198 7
New Products
BUS ENABLE
ERROR
C O R R E CTION
LOGIC
1-T---1
T
A
A
M U LTIPORT
RAM
ECC
G E N ERATION
LOGIC
N
M
I
ARRAY MODULE
MEMORY CONTROLLER
Figure
4
Datapaths in Memory Co ntroller and A rray Modules
therefore, a l l components must be in qu iescent
states before the memory system e nters battery
mode. U pon sensing t hat power is erodi ng, the
8800 wi I I write a l l i ts data to the me mory sys·
tern . The memory control ler wi l l then compl ete
a l l commands and send signals w t he array mod·
u les i n form i n g them to enter battery mode. I n
t h i s mode o n ly five MSI c h i ps o n the mem ory
control ler a n d approx i mately h a l f t h e control
logic on the array module will be active .
Com mand Execu tion
The execu tion of a ny command received by the
mem ory system i s a j o i n t effor t between t h e
memory controller and t h e a rray modules. Fig·
ure 4 depicts the datapath i n each memory com
ponent. After a nexus places a command on the
NMJ bus, the i nterface in the me mory contro l ler
ascertains i f the command is a va l id memory ref·
erence and, i f so, decodes i t . The i n terface then
pl aces the command i n a q u eue of commands
wai t ing to be executed .
Si nce one array modu le can execute m u l t i ple
write commands s i m u l taneously, and si nce m u l
t i p l e array modu les c a n a lso execute commands,
the memory control ler must m a i nta i n the status
of the array modu les . The status control l ogic to
Digital Technical journal
No. 4 February 1 98 7
moni tor actiV I ty m u s t " remember" which par·
tions of w h i c h a rrays a re " bu sy . " T h i s statll s
control logic can best b e descri bed b y showi ng
how t he t h ree basic operations, writes, reads,
and masked writes, are executed .
Write Com mands
For a write command , the contro l port ion of t he
memory controller performs only three actions:
i t determ i nes the capabi l ity of the array module
to accept the command, it sends the command ,
and it wa i ts for the array mod u l e to signal i ts
readiness to receive a nother comman d .
The write datapath is that portion o f t h e l ogic
responsible for the flow of data from the NMI bus
tO the a rray mod ules. This path com prises both
e lectrical interconnects (buses and cables) and a
considerable amount of logi c . The major storage
element for t he data path is a 9-bit by 3 2 -location
custom m u l ti part RAM ( M PR) with two ports for
reads and two for writes. Data received from the
NMI bus is p laced i n the next avai lable location
of the M P R . Upon determ i n i ng that the requ ired
array mod u l e is ava i lable, the control logic sends
the data from the M P R to that array module over
t he array bus. Each array mod u l e ho lds the data
u n t i l i t is s t r o b e d i n t o t h e d y n a m i c R A M s
55
The Memory Svstem in the VAX 8800 Fam ily
( D RAMs) . The array module can load fou r long
words of data with their associated ECC bits on
four consecutive cyc les.
Some wri tes are cal led masked because there
is a 4 -bi t byte mask associated with each data
word . The byte mask i n forms the memory sys
tem as to w h i c h bytes arc to be w r i t ten . The
memory system executes this command by first
doing a read and correcting a ny s ingle-bi t errors
that may exist . It then merges the memory data
with the data received from the N M I bus, and
fi n a l l y does a wri te command . This seq uence
easi l y a l l ows t he i mp l ementation of longword
and octaword masked writes. Masked writes for
quadwords (8 bytes) are executed by perform
i ng an octaword masked wri te i n whic h the data
of two of t he longwords rem a i ns u ncha nged .
Read Commands
For read com mands , the memory con troJler per
forms fou r actions: it determi nes i f the sel ected
array m od u l e is ready to a c c e p t t h e re a d , i t
sends the com m a nd , i t wa i ts for a d a t a - ready
response, and i t transfers the data from the array
module. I mbedded in the com mand field of the
read are address b i ts that select the longword of
the octaword that is req u i red first . This action
a l l ows w r a p p e d r e a d s t o b e i m p l e m e n t e d .
(Wrapped reads are described later i n the sec
tion " Impact of the Cache . " )
The react cl a ta p a t h o r i g i n a tes a t t h e D RA M ,
wh ich sends the requested data . As i n the case of
wri te commands, each array m od u l e stores an
octaworcl of read data. Once the data has been
loaded i n to the l atches, the array module signals
to the memory contro l ler t hat t he data is ready.
As menti oned earl ier, the read datapath between
the array module and the memory controller is
tristata bl e . Th erefore , the mem ory contro l ler
must ensu re t h a t o n ly one array modu l e a t a
t i m e d r i ves t h i s d a t a pa t h . Once t h e d a ta h a s
been requested b y t he memory contro l ler, t h e
array module m ust send t h e longwords seq uen
tially, beg i n ni n g with the starti ng aclclress t hat
was sent with the com mand. This action a l l ows
the memory controller to request any one of the
fou r longwords as the first to be rea d . The array
modul e portion of the read data path can transfer
one longword of data during every cycle.
The error-correction logic i n the memory con
troller receives each longworcl of data plus the
seven ECC b i ts . This logic detects s i ngle- a n d
double-bit errors, but o n l y single-bit errors can
56
be correcte d . A sign i ficant feature o f this pro
cess is that error detection and correction is per
formed as the read data is p i peli ned through the
mem ory control J er . Thus n o a cl cl i t i on a l cycles
are needed to correct read data .
Masked-write Com mands
The execution of a masked wri te i nvolves both a
react and a write seq u e n c e . The m e mory con
tro l ler executes a masked-wr i te com ma n d by
first iss u i ng a react to the selected array modu l e .
Assuming that t here were n o memory errors, the
data r e t u r n e e! is s e n t to the M P R , w h e r e t h e
bytes arc merged w i t h those sent t o t he me mory
contro ller over the N M I bus . The memory con
tro l l e r must e n s u re t h a t no commands to the
same array come be tween t h e read a n d write
portions of a m as ked wri te . After all the bytes
have b e e n m e rged i n to t h e d a t a b u ffe r , t h e
m e m o ry contro l l er w i l l wri te the d a ta t o t h e
array modu le. The array module then generates
n e w ECC d a ta , adds i t to t h e other d a t a , a n d
strobes t h e composite data i nto t h e D RAMs .
If a si ngle-bit error is detected , the process is
qu ite s i m i lar to the one with no errors, except
that the data must be corrected . Since corrected
d a ta and N M I traffi c both s hare the same data
path on the memory control ler, the N M I i n ter
face must be free to correct errors found during
m a s k e d w r i t e s . T h i s free d o m i s e n s u red by
asserti n g a signal that stops a l l activity o n the
N M I bu s . O nce a c t i vi ty has stopped , t he data
can be routed t hrough the N M I i n terface, cor
rected , a nd then m erged w i th the N MI data i n
the data buffer. The process then continues a s i t
would have i f there were n o errors.
If a double-bit error is detected, the process is
s i m i lar to the case in which no error occurred,
except that the wri te is prevented from happen
i ng . When the array location is read the second
t i m e , the double-bit error w i l l sti l l be present,
thus al ert ing the system that the data i s u n usable .
Memory A ddress Path
The memory contro l ler conti nuously l atc hes a l l
addresses from the N M I bus . Once a n aclclress i s
latched , t h e memory control ler m ust verify i t as
a va l i d m e m ory a d d ress . T h a t v e r i fi ca t i o n i s
d o n e by c o m p a r i n g t h e a d d ress to va l i d
aclclresses of both t h e con trol s tatus reg i sters
(CSRs) and physi cal memory .
The CSR addresses are hardwired i nto the N M I
i n terface logic ; therefore, o n l y a s i m p l e compare
Digital Technical journal
· No. 4 February 1 9 8 7
New Products
of the add n:sscs is req u i red. The compare for a
va l i d mem ory add ress requ ires a reference to a
"d ecod e " RAM . This RAM is loaded by console
software when the system i s powered u p and i s
used to c o n fi g u re m e m o ry . Load i n g t h e RAM
from software al lows the memory con tro l ler to
support several d i ffe rent si zes of array modu les
wi thou t m od i fying any hardwa re .
Once the add ress has been veri fied as be i n g
va l i d . i t i s p l aced i n one of eight storage loca
tions a llocated to address b u ffering in the M P R .
The address rema i ns i n that bu ffer u ntil i ts com
mand i s sent to an array mod u l e .
Even t h o u g h e i g h t locations a re a l l ocated t o
address buffering, only seven o f t h e m c a n b e used
for rem porary storage . One location is reserved
for the erro r ' s page address , a poi nter to a phys i
cal page of memory c o n ta i n i n g a n erro r . Since
the locatio n of the e rr o r page-add ress b uffer is
not fi xed , the control l og i c for the add ress-buffer
c o n t ro l m u s t l o o k a h e ad a n d n o t a l l ow a n e w
address ro overwrite that error page address .
The c o n t ro l of t h e a d d ress b u ffe r i s fu r t h e r
compli cated by masked wri tes and error l oggi ng _
S i n c e a masked write i s i m p l e m ented a s a read
fo l l o wed by a w r i t e , the a d d ress in the bu ffer
cannot be overwri tten u n t i l the write has com
p leted . A s i m i lar si tuation ex ists for error logg i n g
o n r e a d t r a n s a c t i o n s . S i n c e a n e rr o r i s n o t
detected u n t i l t h e read h a s completed , the
address cannot be overwri tten u n t i l the data has
been c hecked .
Design Requirements of the
VAX 8800 System
Impact of the NMI Bus
As
stated earl ier, the VAX 8800 memory system
i n t e r fa c es w i t h t h e C P U s a n d I / 0 s y s t e m s
through a sync hronous bus c a l l ed t h e N M l bus .
T h i s bus i s h i g h l y effi c i e n t a n d o p e rates i n a
pcnded fas hion s i m i l ar to the synchronous back
plane i n tercon nect (SBl bus) in the VA.X- 1 1 /780
processor. The NMI bus a l lows several transfers
to be i n progress s i mu ltaneously.
There a rc fo u r n e x u ses in the 8 8 0 0 system
that can req u i re mem o ry : the two CPUs, and the
two NBIA<> . Each nexus i s al lowed to have rwo
co mma nds o u rsta n d i ng at any t i m e . The proto
col s u pports this arra n gement by a l locati n g two
codes i n a 4 - bit 10 fie l d ro each nexus.
The CPUs use one of their references for pro
gram data , c a l led the d -stre a m , and the other for
Digital Technical journal
No. 4
Fe/Jruar)' I 'J87
i n s t r u c t i o n s , ca l l e d t h e i - s t rc a m . T h e C PU s
always req uest a hexword of d a t a ; t h e N B IAs may
req uest e i t h e r longwords or ocr a w o rd s . Thus
t h e r e can b e as m a n y as e i g h t s i m u l t a n e o u s
requesters of memory data. These s i m u l taneous
events req u i re that t h e m e m ory syste m b u ffer
several commands w h i l e exec u t i n g . I n the 8800
i mp l e mentat i o n , the memory syste m can access
t h ree a r ray m o d u l e s i n p a ra l l e l a n d store rwo
com mands.
M o r e o v e r , s i n c e t h e m e m o ry s ys t e m c a n
accept m u l t i p l e read commands, i t m ust store
t h e i d e n t i fi c a t i o n o f t h e r e q u e s t e r a n d t h e
l e n g t h of t h e t r a n s ac t i o n . T h e N M I i n te rface
does the actual srori ng and returns the identifi
cation with t h e correct data . T h i s action i s poss i
b l e b e c a u s e a l l co m m a n d s a re p r o c e s s e d i n
seq u e n c e ; t h e re fo re , t h e read retu r n ed f i rst is
the one stored the l o n gest. Howeve r, hexword
reads are returned to the N M I i nterface as two
separate octaword reads; t h e re fore , t h a t i n ter
face mu st ensure that borh ocrawords have been
returned before d iscard i ng the i de n t i fication.
To preven t a deadlock cond i t i o n , the memory
system is give n the h i ghes t priority d u r i ng arb i
tration . T h i s priority guarantees that t h e memory
system wi l l be able to return data to a req u ester.
When fu l l , the memory system not i fies any poten
t i a l req u esters that i t c a n n o t process any more
commands and to try aga i n later, thus p reventing
the memory system from overfi l l i n g .
Impact of the Cache
The design of the cache affected the design of
the memory syste m . The wri te-through des ign of
the cache guarantees there wi II be a large num
ber of longword writes d i rected a t memory. 1 A
write b u ffer was i nsta l led to b u n d l e a series of
longword wri tes i n to octaword writes; however,
t h e w r i te bu ffe r i s o n l y effe c t i ve if m u l t i p l e
lo ngwords a rc written i n t h e same ocraword .
Extra logic is always r eq u i red to i n c rease per
forma n c e . The extra write ba n d w i d t h for t h i s
memory syste m , however , re q u i red m o re logic
than w hat wo u l d have been req u i red to i m ple
ment extra read b a n d w i d t h . The added co m
plex i ty was needed r o fac i l i tate i n terleaving o n
longword boundaries for write operations.
When the 8800 p roject was first i n i t i ated , the
go a l o f the m e m o ry sys t e m was to m ax i m i z e
read bandwidth, thus prod ucing a re latively sim
p l e a rray- mod u l e d e s i g n . I n that d e s i g n , a n y
operation , regard l ess of i ts s i z e , k e p t an e n t i re
57
The Jl1emor)'
.
�ystem in
the VA X
8800 Fa mi(J'
a rray mo d u l e b u s y u n t i l t h e o p e ra t i o n co m
p l erecl . The control logic o n the a rray mo d u k
was si mple a n d req u i red a reasonable amount of
board space a n d powe r . W h e n t h e design
cha nged to the wri te-through concept, however.
h i g he r w r i t e bandw i d t h was req u i red . Therc
fore , the control logic in each array module had
to be rep l i cated for each ba n k ( l o n gworcl) of
mcmory to al low i ndependent write operations.
This re p l i ca t i on perm i tted fou r longwords to be
wri tten on fo ur consecutive cycles to the same
array mod u l e .
T h i s i ncrease i n desi gn com p l e x i ty was nor
l i m i t ed t o t h e a r r a y m o d u l e . l n the i n i t i a l
des i g n , when m a x i m u m read b a n dw i d t h was
critica l , the me mory control logic was n: l a t i vcly
s i m p l e . It had only to track the state of an a rray
mod u l e as bei n g busy or not. However, w i t h the
i n t e r l e a v i n g c a p a b i l i t y r e q u i r e d fo r t h e
i n creased wri te bandwi d t h , the me mory control
logic now has to track s i m u l taneously the status
of as many as eight write operatio ns in progress
on two array mod u l es .
A l t h o u g h ma x i m i z i n g t h e l o n gwo rcl w r i t e
bandwidth was i m portant, m i n i m i z ing t h e read
latency to the fi rst longword req u i red was criti
ca l . W r a p p c d r e a d s w e r e i m p l e m e n t ed to
red u c e this l a tency. A wrapped read is a hex
wor d or o c taword c o m m a n d t h a t r e q u e sts a
spec i fi c l o n gword tO be re t u r n e d fi rs t , w i t h
o t h e r l o n g w o rd s i n t h a t b l o c k to fo l l ow i n
"wrapped " fas h i o n .
Other Design Trade-offs and Options
As
i n a l l design processes, we consi d e red many
trade-offs and options before com m i tt i ng to a
part i c u l a r des i g n a rc h i tect u re . One area w i t h
s e v e r a l a l t e r n a t i v es w a s t h e i n t e r c o n n e c t
between t h e m e m ory contro l l e r a n d the array
modu les . The array modu l es and the controller
reside in p h ys i c a l ly separate back p l a n es i n ter
conn ected by a cab l e . We had to deci de whether
tO make this i n terconnect with ECL or 1TL.
The o v e ra l l p ro j e c t go a l was to m a k e t h e
8 8 0 0 a n a l l - E CL m a ch i n e . Th erefore , o u r first
c h o i c e for t h i s i n te rco n n ect was ECL, w h i c h
prov i d e s e n h a nced s i g n a l i n t egr i ty , re d u c e d
skews, a n d overa l l speed advan tages over TTL
As rhe system and me mory des i gn progressed ,
however, some real problems arose thar al tered
our opi n i o n . The fi rst problem became appare n t
a s the array- mod u l e design coal esced enough to
58
a l low s o m e a c c u ra t e p o w e r est i m a t e s ro be
made . We fo und that. with an ECL bus, the array
mod u l e wou l d requ i re - 5 . 2 V i n excess of its
a l l o c a t i o n . T h e n e x t p r o b l e m s u r fa c e d i n
response tO an a rc h i te c t u ra l re q u i re m ent t h a t
t h e memory system fu nction w i t h l ess t han e i ght
a r ray mo d u l t:s a n d , p r e fe ra b l y . w i t h o u t load
c a r d s . T h i s req u i re m e n t made ir d i ffi c u l t to
i m p l e m e n t a t e r m i n a t i o n s c h e m e for a n E C L
i n terconnect.
W i r h these problems in m i nd , we i nvestiga ted
a TTL i n terconnect , which clearly offered so me
d e s i gn c h a l l e n g es . the l e ast o f w h i c h w e r e
spccd a n d skew. Us ing the SPICE s i m u lator, we:
const ructed an acc u rate mod e l to verify t h a t a
TTL e l ec t r i c a l i nt e rcon nect could i n deed meet
our s i g n a l i n tegr i t y , speed, and skew re q u i re
2
m e n t s . W h i l e t h e s i m u l a t i on res u l ts s howed
that a TTL i n tercon nect could wor k , the associ
ated skews certa i nly i ncreased rhe complexi ty of
the me mory desi gn . W h i l e al levi ati ng rhe prob
lems of l i m ited - 5 . 2 V power on the array mod
u le and the term i n a t i o n of var i ed load i ng, t h i s
TTL s c h e m e req u i red ECL- ro-TTL trans l a tors i n
the m e m o ry c o n t r o l l e r ro d r ive the array b u s .
We: fi n a l l y d e c i ded ro accept t h e a d d e d c o m
p l e x i ty and u s e TTL for the i n te rcon nect . The
sole except ion was the clocks, w h i c h were d i f
fe re n t i a l E C L , re c e i ved a n d t r a n s l ated on t h e
array mod u l e .
There were logical rrade-offs as we l l a s elec
t r i cal o n es . The or i g i n a l spec i fi c a t i o n for the
N M I clicl nor su pport q uadword masked writes.
They were added after the i m p l e m e n t a t i o n of
the m e m ory sys t e m had progressed cons i d e r
a b l y . Si nce r h e array b u s su pported o n l y long
word a n d oc rawo rd r e a d s . t h ere were t h r e e
opti ons t o support r h i s change :
•
•
•
The first was tO ch ange the a rray bus proto
c o l . rhe command generatOr o n rhe memory
contro l l e r , and rhe array mod u le.
The second was 1 0 execute rhe command by
p e r fo r m i n g two l o n g w o rd m a s k e d w r i t e s .
This option wou ld take a l most twice a s long
as a q u adword masked write if i m p lemented
l i kc the firsr opt i o n , yet sti l l re q u i re changes
ro t h e c o m m a n d g e n e r a r o r i n t h e me mory
control ler.
The t h i rd was to execute an octaword masked
wri te i n w h i c h the d a ta of two of the long
words re m a ins unchanged.
Digital Technical]o11r11al
No. 4
February 1 987
N e w Products
Since the design was wel l adva nced, we chose
the last method tO ease the prob l e ms of imple
m e n t a t i o n ; t h i s d e c i s i o n a c t u a l l y has l i t t l e
i m pa c t o n sys t e m p e r fo r m a nce . T h e l o g i c to
accomplish this addition a l ready ex isted on the
array mod u l e . Only small cha nges were requ i red
to the co m mand ge nerator of the memory con
tro l ler and the datapath control . In practice , the
fr e q u e n c y o f q u a d w o r d m a s k e d w r i t e s i s
extre mely low si nce they are executed only by
the NBlAs.
Technology Description
A nu mber of d i fferent module and component
technologies were u sed for the memory con
trol ler, backplane, and two array modules.
Memory Controller
The me mory control ler is a 9 - layer, cont rol led
i m pedance , extended hex mod u l e ( 1 5 i nches by
1 1 inches) . The lay-up consists of 6 rou ting layers,
2 power layers (- 5 . 2 V and - 2 Y), and a ground
p l a n e . Si nce there is a m i nimal a m o u n t of TTL ,
both the + 5 V power and the + 5 V battery are run
on the su rface with 5 0 - m i l etch . With the m ixed
techno logy on the modu le, we took special care
tO keep the TTL signals properly spaced from the
ECL signals tO avoid signal i ntegrity prob lems.
The l o g i c o n t h i s mod u l e i s i m p l e me n ted
using nine uni que macrocel l - a rray des igns from
Motoro la, I nc . . and one custom ECL mu lti ported
RA M . T h e re a r e 1 6 c u s t o m a n d s e m i c u s r o m
devices o n t h e mod u l e . I t also conta ins some
I O K H MSI logic, some ECL-ro -TTL converters,
and som e CMOS logic used for opera t i ng with
battery back up.
A rray Module Backplane
T h e array mod u l e backplane i n t h e VA.'( 8800
and 8700 CPUs is a 1 2 -layer , 8-slot pressed -pin
backplane. The one in the VAX 8 5 5 0 and 8500
CPUs is a 5 -s lot bac kplane. S ince a TTL bus was
ch osen to com m u n i cate between the mem ory
controller and the array modu les, a good term i
nation strategy had t O b e deve lo ped . Us ing the
S P I C E s i m u l a tor, we evo l ved the term i n a t i o n
strategies shown i n Figure 5 .
r---- - - - - - - - - - -- - - �
I
1
I
MEMORY
CONTROLLER
8480
ECL TO TIL
DO
- Dl
c cs
I
I
I
�
'--'
tr'
I
I
F374
'-- Dl
-
-
I
I
I
I
I
I
I
c
I
I
I
I
I
I
I
I
'-- D l
-
CLK
c
EN
F374
8481
TIL TO ECL
( DO
Dl
-
cs ff-
>4700
(>
;.
EN
i
HLD
Digital Technical journal
February 1 98 7
F374
Dl
CLK
F374
Dl
r-
CLK
0
EN
Dl
r-
0
CLK
EN
NAB READ DATA BUS
DO- DATA O UT
CS - C H I P SELECT
(TO
� DO
DO
F374
..
DO
c EN
EN
Dl - DATA I N
- _ _ _ _ _ _ _ _ _ _ _j
Figure 5
CLK
F374
�
CLK
OHMS
.._
-
DO
Dl
+5 VOLT
L------
No. 4
F374
DO
� DO
I
I
I
I
NAB COMM A N D/ADDRESS-WRITE DATA B U S
OHMS
- HLD
I
I
ARRAY MODULES
f-
p
8
DO
MODULES)
..
F374
DO
Dl
CLK
EN
(
HLD - HOLD (CLOCK)
EN - ENABLE
CLK - CLOCK
Termination Strategies in Memory Controller and A rray Modules
59
The Memory System in the
VAX 8800
Figure
6
Fam ily
Sixteen Megabyte A rray Module
Four Megabyte A rray Module
Summary
The 4 MB array module was des i gned u s i n g a n
8-layer, control led-i mpedance, p r i n ted c ircu i t
board . The l ay- u p cons i sts of 4 rou t i ng l ayers ,
2 power l ayers, and 2 ground layers . To su pport
battery backup, the m od u le has separate power
planes for + 5 V power and the + 5 V battery .
S i n c e o n l y a l i m i ted a m o u n t o f - 5 . 2 V a n d
- 2 V power i s needed , t hese v o l rages s h a re
space on the other power planes. To el i m i nate
d i sc o n t i n u i t i es t h a t c o u l d c a u s e u nw a n te d
refl ections, we ensu red that signals d i d not cross
t h e p o w e r - p l a n e s p l i ts by s u rro u n d i n g t h e
power planes with sol i d ground planes .
Approxi mately half of the logic techno logy on
the array mod u l e consists MOS dyn a m i c RAMS;
the other ha l f is FAST MSI logic. The clock system
is i m plemented in ECL to m i n i mi ze the skew.
The VAX 8800 m e mory system was designed to
provide 7 1 MB per second of read bandwidth
and 5 9 MB per second of write bandwidth to the
m u l t ip rocessor system . The system archi tecture,
processor perfo r m a n c e nee d s , a n d h i g h I / 0
activity com b ined to m a k e a high-performance
me mory a req u i rement.
S ince the 8800 conta i ns ECL components, the
memory system has to provide a high-speed path
between the ECL logi c i n the CPUs and the high
d e ns i t y dyn a m i c RAMs u sed for m a i n s torage .
A l t h o u g h t h e m e mo ry system does n o t play a
d i rect rol e i n the execu t i on of a VAX i nstruc
tion, i ts performance has ro match closely that
of the m u l ti processor system . I f the memory sys
tem were u nder designed , the processors would
sta l l frequently, thus reducing their usable per
fo r m a n c e . I f t h e m e m o r y s y s t e m were over
designed , i t wou l d conta i n extra co m p l e x i ty ,
w i t h t h e attendant extra cost, that could n o t be
used by the system . Thus the m em ory strategy
played an i m porta n t role in the pri ce/pe rfor
mance trade-offs that had to be made .
Sixteen Megabyte A rray Module
A 1 6MB array module was developed tO i n crease
the ava i l a b l e me m ory to 1 2 8 M B for the 8 8 0 0
and 8700 systems and 8 0 M B for the 8 5 5 0 and
8 5 00 systems. This array m od u le consists of a n
8-layer mother board (si m i la r t o t h e 4 MB mod
ule) and ei ght 2 MB su rface-mounted daug hter
boa rds . The 1 6MB array modu le is pictured i n
Figure 6 .
60
Acknowledgments
Al though done by a sma l l group of engi neers,
the design of the m e m o ry system was greatly
Digital Technical Journal
No. 4 February 1 98 7
N e w Products
i n f l uen ced hy the e fforts of many peo p l e fro m
t h e E l ectron i c Storage Deve l op m e nt G ro u p a n d
t h e A d v a n c e d VAX E n g i n e e r i n g G r o u p . We
wou l d especia l l y l i ke ro a c k n o w l edge the c r e
a t i v i ty, leaders h i p , and e n e rgy l e v e l o f t h e l a t e
.John He n ry . J r .
References
I .
].
Fu.
J . Ke l l e r , a n d
VAX 8 8 0 0 C
Techn ical jo urnal
of the
K . Had u c h , "Ao;; pects
J3ox Des ign , "
Digital
(Febru a ry 1 9 8 7 , t h i s
i ss u e : 4 1 - 5 1 .
2.
S P I C E ·was devel oped by Lawrence Nagel
a n d E l l i s C o h e n of t h e D e p a r t m e n t o f
El ectrical Eng i neeri ng and Co m p u t e r Sc i
e n c e , U n i ve rsity o f Ca l i fo r n i a , Berke l e y .
Digital Technical journal
No. 4
Febmmy I 'J8 7
6t
John H.P. Zurawski
Kathleen L. Pratt
Tracey L. Jones
Fl o ating Po int in the
VAX 8800 Family
The processors in the VAX 8800 family were designed with particular
emphasis on cost-effectiveness. These CPUs do not contain separatefloat
ing point accelerators. Their performance is not compromised, however,
especially for the double-precision instructions. High performance is
achieved, in part, by a custom ECL multiplier and divider unit and by
specific hardware for exponent manipulation and normalization. The
main advantages of this integrated approach are less hardware to repli
cate and a tightly coupled interface to each CPU, thus less time is wasted
fetching the operands. Microcode branch problems are minimized by
using a prediction strategy and extensive hardware assistance.
U n l i ke other VAX fam i l i es, the processors i n the
VAX 8800 fam i ly do not conta i n separate float
ing point acce lerators ( FPAs ) . I nstead , their FPA
is i n tegrated i nto each processor' s m a i n data
path . Therefore, n o disti nction is made between
instructions t h a t a re execu ted i n t he FPA and
those that are not : the hardware is avai I able to
be used for a l l fu n c t i o n s . For e xa m p l e , t h e
extended arithmetic l ogic u n i t (XALU) i s also
used as a counter for t he move character i nstruc
tion (MOVC) . This usage d i ffers from that i n the
VAX 8600 and VAX- 1 1 /780 systems, where the
XALU i s used o n l y for floa t i n g poi n t i nstruc
t i o n s . F u r t h e rmore , a l l t h e floa t i n g p o i n t
instruct ions, from the most comp l i cated (POLY
a n d E M O D ) to t h e s i m p l est ( M OV F ) , have
access t o t he FPA hardware .
There a re a n u m be r of a d v a n tages to t h i s
a pproac h . F i rs t , logic i s n o t d u p l i ca ted ; o n ly
one arithmetic logic u n i t (ALU) and one shifter
u n i t is shared between the float i ng poi nt and the
normal arithmetic. Second , the design is tightly
i ntegrated with the rest of the compu ter; t here
is no overhead involved in starting the floating
point computation .
Clearly, since a l l other VAX fam i l i es use FPA-; ,
there arc a lso d isadvantages w i t h o u r approach.
Shared logic is more complex than specia l i zed
logi c . Perform ance m a y also su ffe r s i nce t h e
design cannot b e opt i m i zed toward one class of
problem . Those disadvantages can be overcome ,
however, as we sha l l relate i n t h i s paper. The
62
problem of o p t i m iz a t i o n was a m e l i orated by
provi d i ng d e d i ca t e d h ar d wa re for t h e m a i n
operations of m u l t i p l i ca t ion and add i t i o n . A cus
tom m u l t ip l i e r a n d d iv i d e r c h i p is provi d e d
together w i t h exponent manipulation l ogic a n d
a s h i fter u n i t optimized for floating poi n t . These
logic elements handle those float i ng point oper
ations that take the longest ti mes to execu te .
The floating point logic resi des i n the execu
tion unit, the E Box, of the V�'C 8800 CPU. That
logic is controlled by m i crocode in the i nstruc
tion unit, the I Box. 1
VAX Formats and Instructions
T h e VAX a rc h i te c t u re s u p ports fou r fl o a t i n g
poi n t formats: F , D , G , a n d H . These formats are
d iscussed at lengt h i n references 2 and 3 . The
F format is 32 bits wide, the D and G formats are
both 6 4 b i ts wide, and the H format is 1 2 8 b i ts
w i d e . A l t hough t h e D and G formats have the
same width, the exponent field is larger in the
G format, and i ts fractional fi eld is com mensu
rately smaller. This form a t a l l ows a larger range
but with s li ghtly lower prec ision. The fracti ons
are always norma l i zed and the leadi n g b i t - the
h i dden b i t - is not stored .
E Box Operation
Phys i c a l l y , floa t i ng p o i n t opera t i o ns a re per
formed o n three mod u l e s : two s l i ce mod u l es
and a shifter modul e . The sl i ce modules contai n
the cache, the main ALU, and a register fi l e . The
Digital Technicaljournal
No. 4 February 1 ')8 7
New Products
shifter module coma i ns the custom mu l t i p l i er.
t h e s h i fter u n i t . t h e exponent m a n i pu l a t i o n
logic (the two AlUs) , and the priori ty encoder.
Fi gure 1 s h ows t h i s p a rt i t i o n i n g . To a l a rge
extent, the shifter mod u l e strongly rese mbles an
FPA bm wi thout the AlU and register fil e .
The source operands are fetched from e i ther
the 64 ki lobyte (KB) cache or a genera l-purpose
regi ster (G P R ) . The operands are s e n t on the
A and B ports to the AlU on t he sl ice modu les
and to the shifter modu l e . Al l the components
on the shifter modu l e are driven i n para l l e l by
the A and B ports .
From Figure I i t i s clear that the datapath is
highly para l l e l ; the s h i fter, XALU . m u l t i p l i e r ,
a n d ALU c a n a l l operate s i m u l t aneously. T h i s
para l l e l i sm is u s e d extensively t o gai n pe rfor
mance and to save cost . For exa mpl e , in m u lt i
plication operations, t h e XALU dete r m i n es the
exp onent of t h e res u l t , the m u l t i p l i e r mu l t i
pl ies. a n d the s h i fter absorbs the low-order bytes
BYPASS BUS<3 1 :0>
SHIFT COUNT BUS
·
of the product that are di scarded each cycle by
the mu ltiplier.
The m a i n prob l e m with d es i g n i ng a n i n t e
grated FPA i s that t h e VAX fo rmats for i n teger
and floa t i ng poim numbers must a l l be handled
by the same shared u n i ts . figu re 2 shows the dif
fere n t b i t ord e r i n gs for two VAX formats, the
F floating po i m and the i nteger. I n the i nteger
fo rmat, the b i t ordering is from right to left . In
the F format, the mant issa begins at bit 16 and in
creases in signi fi cance to bit 3 1 , then cont i n ues
from bits 0 through 6. The re m a i n ing bit positions
are used to hold t he exponent and the s ign .
This req uirement for shared hand l i ng compli
cates the carry path of the AlU . The carries om
of t h e ! 6 - b i t w o r d b o u n d a r i e s h a v e to b e
swi tched in to the appropriate places, a s shown
in F igure 3 . The problem with shifting is s i m i lar
to t he carry problem, except that now t he carry
p a t h of F i g u re 3 r e p r e s e n t s t h e fl ow of t h e
shifted bits.
SHIFTER MODULE
SLICE MODULES
5:0>
A PORT
CACHE DATA
8 PORT
REGISTER FILE
Figure
Digital Tecb nical journal
No. 4
February I 'J87
I
Block Diagram of the E Box
65
Floating Point in the VAX
8800
Fam ily
F FORMAT:
BIT POSITION
31
16 1 5
EXPONENT
LEAST S I G N I F I CANT BIT
_j
0
7 6
MANTISSA
(LEAST S I G N I FICANT PART)
MANTISSA
_j
MOST S I G N I FICANT BIT
INTEGER FORMAT:
0
31
L
LEAST S I GN IFICANT BIT
MOST S I G N I FICANT BIT
_j
S - SIGN BIT
Figure
2
Two VAX Formats
T h e A L U a n d t h e s h i ft e r u n i t a r e b o t h
desi gned to hand le a l l integer and floating poi n t
for m a t s . The m u l t i p l i e r expects opera nds t o
come o n l y i n a fl oati ng p o i n t format . Therefore,
for i nteger m u l t i p l i cations, the data must fi rst
be converted i nto a pseudo-floating point format
by swappi ng the places of 1 6 - b i t words w i t h i n
t h e i nteger format. T h i s operation i s performed
by the shifter u n i t .
Table 1 gives t h e execution times for t h e most
common floating poi n t i nstructions. These ti mes
include the overhead for fetching the operands.
0 FORMAT:
(MOST S I G N I FICANT PART)
T h e VAX 8 8 0 0 processor i s d e s i g n e d so t h a t
there is l i ttle, i f any, d i fference i n performance
between reg ister a n d m e m ory opera n d s . The
execu tion ti mes vary from 2 . 2 5 to over 5 ti mes
the performance of the VAX - 1 1 /780 CPU with
an FPA for the F and D formats . For m u lt ip l ies,
one 8800 CPU i s 2 . 5 t i mes faster in F format
a n d 4 . 8 t i mes fas ter in D fo r m a t ; d i vides are
3.0 ti mes faster. The ga i n is even more substan
tial for the G and H formats s i n ce they a r e n o t
accelerated o n the 1 1 j780 .
BIT POSITION
31
16
�.__
____.r- I I
M A N T Is s A
_
_
_
_
_
_______
_
_
_
_
_
_
_
s
0
7 6
15
EXPON ENT
MOST S I G N I FICANT BIT
I
MANTISSA
__j
r-
0 FORMAT:
(LEAST S I G N I FICANT PART)
__j
I
IL.....--
MANTISSA
MANTISSA
----.-'
-
LEAST S I G N I FICANT BIT
_j
CARRY I N
S - SIG N BIT
Figure 3
64
Floating Point Carry for D Format
Digital Technical journal
No. 4 Februarv 1 98 7
New
Table 1
Execution Times
I n struction
Register to
Register
Execution Time (Na noseconds)
F
D
G
H
ADD
31 5
495
540
33 1 4
MUL
450
675
842
6306
1 607
3 1 97
3 1 07
2 1 649
DIV
In the 8800 the D format is sl ightly faster than
the G fo r m a t w i t h i ts l o n g e r o p c o d e , w h i c h
req u i res an extra cycle i n the decoder. The si ngle
precision F fo rmat executes the fastest , and t he
larger 1 2 8 - b i t H fo r m a t e x e c u tes t h e s l owes t .
However, the H format i s i n tended a s a bac k u p
fo r i n t e r m e d i a t e c a l c u l a t i o n s i n t h e D a n d
G formats. Used thus, the H format ensures that
the fi n a l calculation res u l t has sufficient preci
s i on a n d avo i d s overfl ow or u n d e rflow prob
lems. Little hardware assistance is provided for
t he H format; it is driven mostly by m icrocod e .
Technology
Component tec h n o logy used i n t h e VAX 8 8 0 0
processor i s an e n hanced version of the macro
cel l a rray ( M CA ) used in t he VAX 8600 CPU . 2
T h i s tec h n o l o gy p ro v i d e s a b o u t 1 , 2 0 0 g a t e
e q u i va l e n t s w i t h a t y p i c a l g a t e s p e e d o f
1 na noseco n d (ns) . MCAs u t i l i z e e m i tter-cou
pled l o g i c (ECL) i n a 7 2 - p i n pac kage that is
1 square i n c h w i t h a max i m u m power d issipa
t i o n of 5 . '5 watts . The G PR and the m u l t i p l i e r
a re made with custom technol ogy, w h i ch uses
the s a m e p a c kage as t h e MCA b u t c o n t a i n s a
m o r e a d v a n c e d p r ocess . A r o u n d 1 , 8 0 0 g a t e
equ iva l en ts are provided , and t h e gate speed is
50 perce n t fas t e r than t h e MCA. T h i s h i g h e r
performance is achieved by u s i n g t h e fol lowing
features:
•
•
•
Smal l e r trans i s tors and met a l -o x i d e - wa l l ed
resistOrs
Cu rrent mode l ogi c i nstead of the slower ECL
Four-level logic i nstead of the two- l evel l ogic
of the MCA
At 3 0 0 by 2 6 0 m i l s , t h e s i z e of t h e custom
c h i p is l a rger than t h e d i m e n s i o ns of 2 2 1 by
2 '5 2 m i ls for the MCA.
Digital Technical journal
No. 4
Februarv
I Y8 7
T h e s h i ft e r m o d u l e con ta i n s J 2 MC As a n d
8 custOm m u l t i p l i e r parts . So me l O KH parts arc
used for c lock d ist r i b u t i on a n d fo r dr iving the
bid i rect i o n a l bypass bus .
Arithmetic Algorithm Processing
Addition and Subtractio n
For an addition operation , �he 3 2 -bit words con
ta i n i ng the exponents are sent to the m a i n ALU .
T h e r e t he y a rc p a s s e d to t h e A a n d B p o rt s ,
w h i c h fee d t h e s h i ft e r m o d u l e . T h e s e p o r t s
drive a l l the gate arrays i n para l l e l .
The exponents a re then loaded i nto the XALU
a n d th e sh i ft-a mou n t ALU (SALU) , which com
p u te s t h e a l ign m e n t s h i ft a m o u n t s e n t to t h e
shifter. T h e SALU a lso generates some 2 0 branch
cond i t ions for the m icrocod e . These con d i t ions
i n d i ca t e t h e s i z e o f t h e a l i g n m e n t s h i ft a n d
w h e t h e r a n y sou rce o p e r a n d i s zero o r a
rese rved opera n d . They a lso he l p to op t i m ize
the m i c rocode tl ow.
The XAllJ , which selects the larger exponent
a n d saves i t for later use , has a 1 2 -b i t datapath
and a register to ho l d the exponent. The size of
this datapath is sufficient for the F, D , and G for
mats plus a guard bit for overtlow or u n d e rtlow
detection . An ALl! is provided to perform arith
metic opera t i o n s o n the exponen t . The SAUl ,
with a n l l -b i t datapath, su btracts the exponents
to determ ine the a l ignment shift a m o u n t , which
is always pos i tive . The s ign man i pu lation logic
also resi des in the SALU.
Next, the fract i ona l part of the smaller operand
is a li gned hy the shifter. This operati o n i nvolves
e i t her one C PU cyc l e for F for m a t o perands or
two CPU cyc l e s for the D a n d G fo r m a ts . The
shifter unit s h i fts i n the tloat i n g p o i n t format and
c a n do a fu l l 6 4 - b i t s h i ft . The l og i c t h a t deter
m i n es the rou nd bits i s related t o the a l i gn ment
s h i ft opera t i o n but i s phys i ca l ly l ocated in the
priority encoder gate array . This gate array a lso
conta i ns some of the shifter fu nction a l i ty .
N i ne gate arrays a re used for the shifter u n i t .
Of those , eight m a k e u p t h e datapath, t h e n i nt h
is t h e c o n t ro l d ev i c e . The s h i fter c a n a c c e p t
ei ther a 64 - b i t operand o n t h e A and B ports o r a
3 2 -b i r operand on ei ther port . The s h i fter gener
ates a 3 2 -b i t resu l t t hat can be ei ther the h igh
order or the low-order part of the answer. The
65
Products
Floating Point in the VAX 8800
Fa m ilv
s h i ft e r d a t a p a t h g a te a rrays a rc i d e n t i c a l : e a c h
e ffect i v e l y c onst i tu te s a b y t e s l i ce of t h e d es i gn
an d p e rfor m s a b i t s h i ft of u p to seven p l a c es
By te
s h ifti n g is t h e n p e r for m e d by send i n g t h e
co r r ec t s h i fter o m p u t to t h e co r r e c t byte pos i
t i o n . T h i s o p e ra t i on i s fac i l i ta ted by h av i n g a l l
the o u tp ut s w i red t o t h e
OR
g a t es a t a l l poss i b l e
b yt e pos i t i on s an d by e n a b l i n g t h e con·ecr o u t p u t .
The s h i ft e r p e r fo r m s f l oa t i n g p o i n t . i n t e ge r .
a n d l og i c a l s h i fts , as w e l l as a n u mber of m i sce l
l a n eous fu n c t i ons . Th e se i n c l u d e conve rts from
deci m a l - format data i n to i n t eger fo rmat and \ ' i c c
v e rsa . T h e m a s k i n g of t h e expo n e n t f i e l d a n d
t h e i n sert i o n o f t h e h i dden b i t are a l s o done by
Second . it was not
p oss i b l e
to s u cc u m b tO t lw
te m pta t i o n of u s i n g t h e m a i n AUJ to p rov i d e t he
d i v i s i on o pe ra t i o n . This desi re was n a t u ra l s i n ce
cl i ,· i s i o n is an i n fre q u e n t opera t i o n . a n d t he usc
of an AU J i n a repeated su btract a n d s h i ft mode
was a p p e a l i ng . For exa m p l e .
the VAX
8 6 0 0 uses
the ALU for j u s t t h a t p u rpose . In t he 8 8 0 0 t h e
main
AUJ
Si n ce t h i s
a l so c o m p u t e s
cl a t a p a t h
t he
v i r t u a l a d d re s s .
is very t i m e - c r i t i c a l ( i n t h e
8 8 0 0 as we l l as i n most ot her co m pu ter
design s ) . i t can n o t be a l l owed to go a n y s l ower.
I n c l u d i n g a n e x t ra path to a c c o m m o d a t e d i v i
s i o n wou l d have s l owed down t h i s c r i t i ca l p a t h
by around '5 n s , r e s u l t i n g i n a 1 0 perce n t p e r for
m an c e d e gr ada t i o n for a l l op e r a t i o n s .
t h e sh i ft e r .
After t h e a l i g n m e n t s h i ft . the o u t p u t of t h e
s h i fter is d i rected to t h e m a i n ALU on t h e lwpass
bu s . There. the o u t p u t i s add e d to or su btracted
from t h e fra c t i o n of t h e l a rger o pe ra n d . The out
p u t of t h e ALU operation is now r e ad y to be n or
m a l i zed i n t h e s h i ft e r . I n most cases a sm a l l nor
J\tl o rcovc r , t h e ava i I a b l e s pa ce fo r t he m u l t i
p l i c r a n d d i v i d e r u n i t was l i m i te d s i nce fl oati n g
poi n t opera t i ons a rc i n t egrated w i t h t he r e s t o f
t h e mach i n e . Approx i m a t e l y o n e - t h i rd of a m o d
ule ( 1 2
i n c h es by 1 6 i n c h e s ) was ava i l a b l e . I n
contrast, t h e VA X
8 6 '5 0
CPU
d e d i c a t e s a fu l l
m a l i ze s h i ft o f a t most one b i t p os i t i o n l eft or
m o d u l e to m u l t i p l i ca t i o n .
w a re i n t h e s h i ft e r h a n d l es t h i s c a s e a n d t h e n
d i v i d e r u n i t i s bas i ca l l y a byte s l i c e o f a l a rge
r o u n d s t h e r e s u l t . S h o u l d a l a r g e r s h i ft b e
word - s i z e d m u l t i p l i e r a n d d i v i de r u n i t . T h e
r i g h t w i l l be s u ffic i e n t . The s p e c i a l i z e d h a r d
req u i red , t he n m i c rocode w i l l fi rst
ALU
d i rect t h e
r e s u l t to t h e p r i o r i ty e n c o d e r g a t e a r ray .
There , t h e p os i t i o n of the l e ad i n g l is fo u n d a n d
used t o determ i ne t h e norma l i ze a m o u n t for t he
s u bseq u e n t cyc l e .
The ro u n d i ng o pe ra t i o n i n t h e
V�'(
8800
CPU
T h e c u s t o m d e s i gn o f t h e m u l t i p l i e r a n d
m u lt i p l i e r h a n d l es 8 b i ts p e r cyc l e , t h e d i v i d e r
h a n d l e s I h i t . F i g u re
5 6 - b i t by
H-bit
4
s l i ce custom c h i ps . E i g h t c h i ps a rc used tO form
the re q u i r e d word s i z e of 64 b i ts ( 5 6 data b i ts
p l u s 8 g u a rd b i ts ) . T h i s a r r a n ge m e n t is s u ffi
i s u n u s u a l i n t h a t i t is l i m i ted to t h e low- order
c i e n t to h a n d l e F. 0 , a n d
e i gh t b i ts . Therefore . a small 8-bit adder c a n be
H
used for this op e ra t i o n . This ad d er is both faster
i n g t h e p ro b l e m i n t o
a n d c h e a p e r t h a n the u s u a l m e t hod of u s i n g
a
fu l l 64 - b i t a d d e r . The 8 - b i t a d d e r is a I so s u ffi
s h ow s t h e c o m p l e t e
m u l t i p l i e r w i t h i ts e i g h t by t e
G
fo r m a t ope ra t i on s .
format opera t i o n s arc perfo r m ed by
m a ny
pa rt i t io n
s m a l l e r '5 6 - b i t m u l t i
pl i cat ions u nder m i crocode c o n t ro l .
Th e m u l t i p l i c a n d i s loaded i n to t h e
MD
l a tc h
c i e n t to c a l c u l a t e t h e c o r r e c t a n sw e r i n o v e r
a ft e r p a ss i n g t h ro u g h r h c m a s k l o g i c . w h i c h
a carry-out b e generated b y t h i s 8 - b i t rou n d i n g
i n s e rt s t h e h i d d e n b i t . T h e
l n t h a t case t h e c o m p u t e r i s t ra p p e d a n d
T h e P R G B c o n t a i n s t h e g u a r d b i ts for t h e P R
\) \) . 5 perce n t of t h e add i t i o n o p e ra t ion s . Shou l d
add , t h e n c l e a r l y t h e resu l t created i s i ncorrect .
m i crocode i n voked to correct t h e resu l t .
Multiplication
A-; m e n t i on e d earl i e r , t h e 8 B O O c on ta i n s a h i gh
pe rfor m a n c e . c u s to m - d e s i g n e d m u l t i p l i e r a n d
d i v i de r u n i t .
A
n u mber o f fa ctors i m p e l l e d u s t o
u s c s u c h a u n i t . F i rs t . m u l t i p l i ca t i o n i s a v e ry
c l ears t he s i gn a n d t he expo n e n t fie l d a n d
PR
latch a n d t he
P R G B a rc c l e a r e d a t t h e s t a r t o f t h e m u l ti p l y .
l a tc h . A t t h e e n d o f a m u l t i p l y . t h i s l a t c h w i l l
h o l d t h e b i ts r e qu i red for a p o ss i b l e norma l i z a
t i o n s h i ft a nd a lso for a r o u n d i n g o p e ra t io n . The
l east s i gn i f i c a n t e i g ht b i ts o f t h e
mu l t i pl i er
arc
l oaded i n to t he m u l t i pl i e r l a r c h . The fi rst m u l t i
p l y cyc l e i s now re a d y to be pe r fo r m e d .
A
'5 6 - b i t by 8 - b i t mu l ti p l i c a t i o n is pe rfo r m ed
freq u e n t o p e ra t i o n t h a t i s u s e d ex t e n s i v e l y i n
between the c on t e n t s of the MD a nd m u l t i p l i e r
b e n c h ma r k , t h e t i m e - c r i t i c a l rou t i n e con
of t h e PR l a t c h ( w h i c h i s i n i t ia l. l y z e ro ) a n d t h e n
m a t r i x m a n i p u l a t i on . For e x a m p l e , i n t h e LI N
PACK
ta i ns an even mix of a d d i t i o n and m u l t i p l i c a t i o n
opera t i o ns . '
66
latches. The r e su l t is t h e n added to t h e c o n t e n ts
w r i t t e n b a c k i n to i t w i t h a r i g h t s h i ft o f 8 b i ts .
The P R l a t c h i s t h u s a n ac c u m u l a t i n g l a tch a n d
Digital Technical journal
1Vo. 4
FehruaJ:J' l lJ8 7
New Products
MULTIPLICAND IN PUT
MULTIPLIER I N PUT
S·BIT SHIFT
PRGB
64
BITS
BOOTH RECODE
MULTIPLIER OUTPUT
Figure
4
Multiplier and Divider Unit
conta ins the 6 4 · b i t partial product of each m u l ·
t i p l i c a r i on o p e ra t i o n . T h e n e x t 8 b i ts o f t h e
m u l t i p l ier are loaded i nto the m u l tiplier larc h ,
ready for the next cyc l e . This cyc l i n g cont inues
u n t i l the m u l t i p l i cand has been m u l t i p l i ed by
a l l the m u l t i p l i er byres. This algorithm is si m i lar
t o the one u s e d in t h e VAX R 6 5 0 s c h e m e ,
except t hat that processor has a narrower data·
path of 32 bits.
Notice that the l e as t s i g n i fi ca n t byte of t h e
partial product is discarded after each cyc l e and
absorbed by t h e s h i ft e r u n i t . These bytes are
req u i red only for the H format m u l t i ply.
O n c e c o m p l e t e d , t h e res u l t i s s e n t o u r
t h rough the resu l t latc h , t hen n o r m a l i z ed a n d
ro u nded . The rou n d i ng carry i s on ly propagated
i nto the least s i g n i ficant byte of the resu l t . This
proced u re u ses less l og ic s i n ce only an 8 · b i t
i nstead o f a 64 ·bit incrementer i s req u i red . The
8 · b i t i n c r e m e n t e r wi l l be s u ffi c i e n t fo r most
Digital Technical journal
o.
4
Februarl'
I <)87
m u l t i p l i e s . S h o u l d a g r e a t e r i nc r e m e n t be
req u i red, then the m u l t i p l i e r wi l l trap the rest
of the mac h i n e , and t he correct ion w i l l be per·
fo rmed by the m a i n ALU . This scheme is s i m i l ar
ro the one used for add i t i o n .
The prov i s i o n of a 6 4 · b i t a d d e r i ns i d e t h e
m a i n m u l t i p l y path i s u nusual i n a h igh·perfor·
nunce machi n e . H i gh ·speed m u l t i p l i e r designs
typ i c a l l y use ca rry·save a d d e rs , w h i c h do nor
propagate the carry signal bur save them so t hey
can be absorbed by the subseq u ent cyc l e . This
form of adder is indeed used i n the c u sro m m u l ·
r i p l i e r r o perform t h e 5 6 · b i t by 8·bit m u l t i p ly
fu nction i l l ustrated i n F i gu re 4 . Howeve r, the
8800 a l so uses a fu l l 6 4 ·bi t adder for the fo l l ow·
i ng reasons:
•
A
64 · b i t adder has ro be provided somewhere
to propagate the carries from rhe carry·save
adders.
67
Floating Point in the VAX
•
•
8800
Family
With the 4 5 -ns cycle t i m e , the 6 4 -b i t adder
fi ts i n the main datapath . A faster c l ock for
the m u l t i p l i e r wou ld have co m p l i cated the
clock d istriburion and heen d i ffi c u l t to gener
ate with low skew.
A lternative Designs for the Multiplier
An MCA design was certa i n l y possible and cou l d
have been m a d e ro fi r i n r h e specified space .
The p e rfo r m a n ce of s u c h a d es i gn , howev e r ,
wou ld nor b e as good as the custom design for
m u l t i p l ication but compara b l e for d ivision . An
MCA design wou ld be I . 7 ri mes better than an
l l j780 with an FPA for a mul ti ply i n F fo r m a t ,
whereas the custom logic chosen i s 2 . ') ti m es
bette r . The performa nce wou l d be 2 ') t i mes
better fo r t h e D for m a t , w h ereas t he custo m
design is 4 . 8 ti mes better .
Another alternative was t o use a commercially
ava i lable m u lt i p l ier. That was tempting because
such a prod uct has the advan tage of being read
i ly ava ilable and tested . Using it wou l d have c i r
cu mvented t h e h i gh risk of a custom d e s i g n .
However, there are a number of d i sadvan tages to
using genera l-purpose m u l t i p l i ers :
•
•
68
•
A fu l l adder in the darap;ah a l l ows the usc of
a simple nonresroring division a lgori t h m .
The m u l t i p l i e r a n d d i v i d e r c h i p conta ins a
1 2 - b i t by 8 - b i t m u l t i pl i e r , two 8 - b i t a d d e rs ,
six latches with a rota! size of 7 2 bits, as we l l as
the rounding , norma l izing, a nd control l ogic . A
comparable MCA design wou l d req u i re between
three and four of these elements.
•
and ro u n d i n g of resu l ts enta i ls e i t he r extra
logi c or addi tional cyc les i f the floating poi n t
hardware i n t h e E Box is used .
Extra logic i s req u i red ro m ask out the s ign
and exponent o f t he d a ta a n d to i ns e rt t h e
h i d d e n b i t . The output of the m u l t i p l i e r
wou l d have to be masked.
Most avai lable produ cts cannot handle d iv i
s i o n . T h u s a s e p a r a t e d i v i d e r wo u l d have
been req u i red , w h i c h was expensive . Even
d i v i s i o n a l go r i t h m s u s i n g m u l t i p l i c a t i o n
req u i re a large amount of ROM r o conta i n rhe
approx i mation constants .
Many o f the ava i l a b l e designs a rc i nt ended for
i n teger applications, such as HI butterfl ies
a n d d i g i t a l s i g n a l proc essors . H e n c e , t h e
designs a re opt i m i zed for those appl icati ons .
Exte nd i ng these 8- or 1 6-bit m u l t i p l i e rs ro a
larger word l engt h , as req u i red for the Vfu'{
arch i tecmre , was neither straightforward nor
cost effective . M oreove r , t he norm a l i za t i on
•
Most designs have a c lock system not consis
tent w i t h the rest of the machi n e . This fact
i n t r o d u c e s t h e co m p l i ca t i o n of a s p e c i a l
c l ock d istribution and d i ffi culties in veri fying
r h c design .
Very few designs a rc based on ECL tec hnol
ogy . Other techn ologies . such as TT L , wou l d
req u i re a d i ffe re n t power ra i l a n d thus a n
extra power su pply.
The c losest ava i lable m u l t i p l ier to rhe 8800
req u i rements is the I 090 I made by Motoro l a ,
I n c . This MCA imple mentation conta ins an 8-bir
by H-bit m u l t i p l ier together with a 1 6-bit adder.
Howeve r . n o latc hes a re i n c l u d e d ; they m us t
the refore b e provi ded externa l l y , t h u s i ncreas
i n g rhc cost su bstan t i a l l y . On the other hand ,
d i vision cou ld be provided by repeatedly using
the 1 (J-bir adder of t h e I 090 I .
Division
The multiplier performs a nonresroring d ivision
a l go r i t h m , 1 b i t per c y c l e . fo r the F, D. a n d
G fo rmats . The d i v i d e r c a n accept a n e w d i v i
d e n d b i t d u r i n g every cyc l e . t h u s permi tt i ng a
1 2 8-bit by ') (J-bir d i vide. A d ivide of this size is
used i n the H format algorithm to form the start
ing approx i mation .
The booth recodc of the m u l t i p l i e r i s mod i
fi e d s l i g h t l y r o a c c o m m o d a t e t h e d i v i s i o n
deeode z l n the case o f m u l t i p l i cation , the mul
t i p l i e r recod e sel ects the correct m u l t i p les of
the m u l t i p l i ca n d to a d d to the part i a l prod uct
d u ri n g each m u l t i p l i ca t i o n opera t i on . l n t h e
case of d i visi o n , rhe d iv isor i s l o a d e d i nto t h e
M D latc h , and the boot h recode s e lects e i t h e r
+ 1 or - l t i mes the d i v isor for each d iv i s ion
step.
In the n o n re s to r i ng d i v i s i o n a l go r i t h m , t h e
sign b i t o f t h e previous resu It selects t h e correct
d ivisor m u l t i p l e for the next cyc l e . This selec
tion is faci I i ta ted by feeding the sign signal i n to
the mod i fi e d booth recod c so t h a t i t w i l l se
lect the m u l t iples of e i t h er + I or - 1 t i mes the
d ivisor.
The quotient bit generated every cycle is sent
to the shifter u n i t to be absorbed . The first q uo
tient bit generated corresponds to the most sig
n i fi cant b i t of the answer . That bit is then nor
m a l i zed and rounded by the shifter.
Digital Technical journal
No. 4 Februan• J 'J8 7
New Products
•
Microcode Design
Be i n g i n t egra ted i n t o t h e l og i c i n t h e m a i n
mach i n e , t h e fl oa t i n g p o i n t l og i c i s a l s o con
trolled by the m a i n m icrocode . The VAX 8800
C P U i s an e x t e n s i v e l y p i p e l i n e d d e s i g n . s
Al though p i pe l i n i ng is a wel l known techn ique
for i m provi ng perform ance (for exa m p l e , t h e
VAX 8 6 0 0 CPU) , i t comes at a price : t h e m icro
code bra n c h l a te n cy w i l l i ncrease . By t h a t we
mean that t he m i crocode c a n n o t bran c h on a
con d i ti o n or flag i n t h e very next i nstruction ;
i nstead , i t m us t wa i t a n u m ber of cyc les. T h i s
delay is a consequence o f the overlapping of the
m i c ro i n s t r u c t i o n s ; e a c h s u c c e s s i v e m i c r o
i ns t ru c t i o n starts before i ts p re d ecessor h a s
completed .
Figu re 5 shows a typi ca l p ipel i n e s i m i l a r to
tha t used i n the VAX 8800 syst e m . The m icroin
struction is subdivided i n to five components:
•
•
•
•
In WRITE , the resu l t of the ALU operation is
wri tten back to the register fi le.
Thus when the next-address cyc l e has com
pleted for t he first m icroinstruction, A, t he next
address cycl e for t he m i croi nstruct ion , B, in the
su bseq u e n t cyc l e is s t a rted . T h i s cycle now
overlaps with the look-up cycle for A. As many
as five operations can proceed s i m u l taneously i n
t h i s manner.
The branch l a tency of t h i s p i pe l i ne i s gov
erned by t h e first m i c r o i n s t ru c t i o n t h a t c a n
"see" a branch con d i tion set i n an earl ier cycl e .
For exa m p l e , i f t h e ALU cycle of A sets a carry
con d i t i o n , t h e n t he fi r s t i n s t r u c t i o n t ha t can
possibly use t h is s ignal in i ts next-address cyc le
is E. Thus t he branch l atency is three m icro i n
structions, a s shown i n Figure 5 .
Natura l ly, this branch latency i n fl uenced the
way i n which we designed the logic to perform
floa t i n g po i n t opera t i o n s . C l ea r l y , we had to
a v o i d b ra n c h i n g w h e n e v e r p o ss i b l e as t h i s
wou l d resu l t i n a n excessive ly s l ow a l gor i t h m .
I nstea d , we had to adopt a strat egy based o n
p r e d i c t i o n a n d p rov i d e e x t e n s i ve h a rdware
assistance .
Pred iction is based on the fact that the speed
of algori thms for floati n g point adds are usu a l ly
d a ta depend e n t . For exa m p l e , for cert a i n data
va l u e s , the resu l t o f a flo a t i ng p o i n t add wi l l
r e q u i re c o n s i d e ra b l e n o r m a l i z a t i o n . T h a t
requirement i s a l ways present when two val ues
I n N EXT ADDRESS, the address for the next
m i c ro i n s t r u c t i o n i s c o m p u t e d , as w e l l as
those for a ny se l ected branch condit ions.
In LOOK-UP , the m icrocode RAM is accessed
to fetch the m icro i nstruction speci fied by the
cu rrent N EXT ADD RESS .
In READ, the register fi l e is read to fetch the
speci fied operands ( e . g . , fetch RO and R l ) .
l n ALU, the operation i n t he arithmetic logic
unit is performed ( e . g . , RO + R l ) .
rl �
CONDIT ION CODE SET (E.G . . CARRY OUT)
I N STRUCTION A :
-�- ---r
-__-r______r-___--r----�
NA
ALu
R EA D
LU
w R IT E
B:
NA
c:
READ
ALU
NA
LU
READ
ALU
NA
LU
READ
ALU
NA
LU
READ
D:
I
E:
N A - NEXT ADDRESS
LU - M I CROCODE INSTRUCTION LOOKUP
Figure 5
Digital Technical journal
February 1 !)87
No. 4
I
LU
L
WRITE
BRANCH
LATENCY
WRITE
WRITE
ALU
WRITE
I
EARLIEST I N STRUCTION THAT CAN BRANCH
ON CONDITION CODE OF I N STRUCTION A.
Five-stage Pipeline
.
69
floating Point in the VAX 8800 Family
of s i m i lar magnitude and large cancel lation are
su btracted . In other cases l i ttle or n o norm a l iza
tion is requ i red . It is c l early preferable not to
pay the penalty of unnecessary normal i zations.
The approach we took in the 8800 i s to pro
ceed down the most l i kely path, assu m i ng that a
sma l l norm a l ization wi l l be requ ired while wait
ing for the result of the branch signals. The add
and subtract a lgori thms i n particular are struc
tured that way. The SALU exa m i nes the expo
nents of the operands and other signals; then it
sets approx i mately 2 0 branch con d i ti ons i n the
first two cycles of the add/subtract datapath .
I n certa i n s ituations a l l paths may be equa l ly
probable. I n these cases the m icrocode enables
hardware signals to contro l the datapath . A good
e xa m p l e of t h i s processing is the selection of
operands . For a floating point add, i t is natural
to t h i n k in terms of the larger and the smal ler
opera nds . For exa m p l e , the smaller operand is
the o ne t h a t is a l ways a l i g n e d . H owever, t h e
m i crocode does n o t k now w h i c h regi ster loca
t i o n h o l ds t h e s m a l l e r va l u e , and it does not
wa n t to w a i t fo r t h e w h o l e b ra n c h - l a te n cy
period to find out.
Therefore, the m icrocode wi l l assume that the
larger operand is in a particular register. Shou ld
this assumption be i ncorrect, then the SALU wi l l
swap the register fi l e read add resses ( thus sort
ing the operands) . Not a l l locat ions have their
add resses m od i fi e d i n t h i s m a n n e r s i n c e t h e
m i c rocode s t i l l needs tO be a b l e to read a n d
write t o specific locations.
S i m i l a r l y , the SALU d e t e r m i nes if the m a i n
ALU i s t o d o an add o r su btract operation . At this
po i n t in the c o m p u ta t i o n the m i c r o c o d e is
u naware of which operation wi l l be requ ired .
The p i p e l i n e i s st i l l w i t h i n t h e l o n g bra n c h
latency o f the 8800 and cannot branch u n t i l this
latency delay has elapsed . Note that one of the
most frequently performed i nstructions i s ADDF.
That i nstruction will have just completed by the
time the m i c rocode can fi n a l ly branc h . There
fore , the ADDF cannot execute any faster si nce it
is l i m i ted by the bra nc h - l a tency delay. Conse
q u e n t ly , those i nstructions t ha t are the most
probable cases are completely hardware drive n .
To a l low fast paths i n t h e add algori thms, i t i s
necessary t o know t h a t t h e result cannot poss i
b l y overflow s i n ce overflowed resu l ts m u s t
never be writ te n . To prevent overflow the SALU
exam i nes the exponents of the operands . I t then
70
determ i nes i f the exponent of the result cou l d
poss i b l y overfl o w or u n d erfl o w , t a k i n g i n tO
account a ny possible normal i zation shift . There
is al so the added complexity of a rou nding oper
a t i on p rovok i ng an extra n o rm a l i za t i o n step .
That wou l d happen when t h e rou n d i ng i ncre
m e n t caused a ca rry to p ropagate t h roughout
the whole fraction .
Conseq uently, the use of a small 8-bit i ncre
menter for the round operation is possible only
i f it i s k nown that a n overflow cannot happen .
The reason for t h is i s that halting (trapping) the
machine is not instantaneous ( for the same rea
son that bra n c h late ncy exists) ; t herefore , the
result w i l l al ways be writte n . Thus, although the
mi crocode can eventually correct the resu l t , it
cannot prevent that resu lt from wri ting.
Performance Issues
W h e n a p ro g r a m w i t h m a n y f l o a t i n g p o i n t
i nstructions - such a s U N PAC K - i s r u n , i ts
performance is not tota lly d i c ta ted by the raw
floating point speed of the CPU . Having a more
profound effect are other factors, such as
•
•
The size and orga n i zation of the cache - This
factor is part i c u l arly i m portant for programs
w i t h l a r g e a m o u n t s of d a t a b e c a u s e t h e
o p e r a n d s w i l l res i d e i n m e m o ry . H a v i n g
superior register-to-register performance w i l l
not help i n this type of progra m . Clearly, the
larger the cache, the greater the cha nce that
the req u i red data wi l l be q u i ck ly ava i lable,
t h u s avoi d i ng a l e ngthy transac t i o n w i t h
memory.
The performance of the i n teger and con trol
i n struc t i o ns - Even progra m s perfo r m i n g
extensive floa t i ng p o i n t operations sti l l have
s i gn i fi c a n t a m ou n ts of i n teger a n d control
i n stru c t i o n s . D o i n g t h ese q u i c k l y can con
tribute substan t i a l ly ro the program ' s perfor
mance .
To i l lustrate the effect of t hese factors, com
pare the performance of the VAX 8800 system
w i t h t h a t of t h e VAX 8 6 5 0 w h e n b o t h r u n
UNPACK, a s shown i n Table 2 . � The 8 6 5 0 has
faster raw fl oat i ng po i n t s peed , espe c i a l l y for
the F for m a t (over twice as fast ) . Yet the two
systems r u n t h i s be n c h m a r k w i t h a l most t h e
s a m e performance . C l early, i n progra m s w i t h
t h ese c h a ra c t e r i s t i c s , fa crors o t h e r t h a n raw
Digital TechnicalJournal
No. 4 February 1987
New Products
speed w i l l have a greater i n fl u ence on pe rfor
mance . Of course. in app l i cations without the m .
the raw s p e e d advantage of the 8 6 5 0 w i l l b e
more pronounced .
Table
2
.'1 .
4 . j . Donga rra , " Pe rfo r m a n c e o f Va r i o u s
Compu ters U s i n g Standard Li near Equa
tions Software in a F O RTRAN E n v i ro n
ment . " Argonne National Laboratory (May
l 9H6 ) .
U N PACK Performance
Performance (M FLOPS)
Computer
F Format
0 Format
VAX 8800
1 .35
0.99
VAX 8650
1 .30
0.70
VA X A rchite c t u re Ma n u a l ( M a y n a rd :
D i g i t a l E q u i p m e n t Corpora t i on . Order
No. EB- 1 9 5 8 0 , 1 9 8 1 ) .
5.
S . M ishra, " The VfuV. 8HOO ,\f icroarchitec
tUIT , " Digital Technical jou rna! (Febru
a ry 1 98 7 , this issue) : 2 0 - 3 3 .
Summary
The a r c h i t ec t u re of a p rocessor l i k e t h e VAX
8800 CPU is a l l a matter of trade-offs . Where
does the performance make a d i ffe rence 1 For
exa m p l e , we cou l d have s u p pl i ed t h e 8 8 0 0
w i t h a separate floa t i n g po i n t u n i t t O a c h i eve
faster performa nce . Doing that, however, wou ld
have req u i red a t l east one e x t ra mo d u l e . To
keep the cost of the system constant. this extra
mod u l e wou l d have enta i led re moving a module
of logic from some other part of the comput e r .
P e r h a ps r e m o v i n g t h a t m o d u l e wo u l d have
resu l ted in a sma l ler cache or a si mpler decoder
with no opti m i zations for the frequent i nstruc
t i o n s . In any c a se the net e ffec t wou l d have
been w sacrifice the performance of the com
puter in some other area . All thi ngs considered .
we feel that the design is well balanced for the
multitude of d i fferent computing tasks t hat CLts
tomers w i l l perform with the VAX 8800 syste m .
Acknowledgments
The authors wou l d l i ke to thank Ron Me lanson
and his tea m for the c i rcu i t design of the custom
m u l t i p l ier. In add ition, we wou ld l i ke w thank
Dave Sager for his help and gu idance.
References
1.
R . 13urley, " An Overview of the Four Sys
tems i n the VAX 8 8 0 0 Fa m i l y , " D igital
Technical jo u rnal ( February 1 98 7 , this
issue) : 1 0- 1 9 .
2 . T . Fossu m , W . Grun dmann, and V . Blaha.
"The F Box, F l o a t i n g Po i n t i n the VAX
8 6 0 0 System , " Digital Techn ical jo u r
nul (August 1 9 8 5 ) : 4 3 - 5 3 .
Digital Technical journal
No. 4 Fe/Jruarp I <)87
71
james P. janetos
The VAX 8800 Input/Output System
The VAXBI bus links the processors in the VAX 8800family to ljO devices,
including clusters and networks. The VAX 8800 multiprocessor can sup
portfour of these 32-bit synchronous buses, each of which connects up to
16 /jO devices. Each VAXBI bus connects to the memory interconnect, the
NMI bus, by an l\'Bl adapter, which contains an interface chip to imple
ment the VAXBI protocol. The NB/ adapter logic handles CPU references
and direct memory accesses to andfrom the ljO devices. The adapter has
its own 200-nanosecond clock, which is completely asynchronous with
the 45-ns CPU clock.
T h t:
VAX
8 8 0 0 fa m i l y o f s y s t t: m s i s a n o t h e r
m a j o r s t t: p for D i gi ta I E q u i p m e n r Corpora t i o n
i n t o t h t: rt:a l m of h i gh -perform ance c o m p u ti n g .
b u s . T h i s b u s i s a l i m i te d - l e ng t h , h i g h - s p e e d
sync h ronous com m u n i ca t i o n s path t h a t provi des
t h e data l i n k between t h ese fo u r d e v i c e s . The
\Vh i l c: i n creas i n g t h e c o m p u t i n g c a pa b i l i ty o f
N M I bus is c o m p l e t e l y c o n t a i n ed i n t h e m a i n
tilL
system c a b i n e t ; i ts cyc l e t i m e i s
VAX
l i ne for s c i e n t i f i c a n d tec h n i cal app l i
cat ions. t h ese systems w i l l u n do u btedly p l ay a n
i m p o r ta nt ro l e i n c o m m e rc i a l a n d offi u: m a r
n anoseconds
45
( n s ) , t h e sa m e as the C P U ' s . The b u s protoco l
h a n d les seve ra l o u tsta n d i n g transac t i ons a t one
k e t s . I n thest: markets , t h e abi l i ty ro c o n n t: c r ro a
t i m e . t h us e ffect i ve l y i n creas i ng the b u s ' s u t i
c o m p u t i n g c l u s t e r . s e rv i c e m a n y u s e r s . a n d
l i za t i o n . T h a t i s , o n c e a d e v i c e has i s s u e d a
fu n c t i on i n a n e twork arc a s i m portant a s a fast
t r a n s a c t i o n ( e . g . , a r ea d ) , t h a t d e v i c e r e l i n
C P U . I n dt:ed , i n a m u l ti user. m u l t i progra m m i ng
q u i shes t h e usc of t h e bus u n t i l t h e respond i ng
system , t h e effi c i ency of " housekeep i ng " opera
d evice is ready w i t h t h e d a t a . O t h e r devi ces arc
t i ons a ffects t h e perceived system perform a n c e
as m u c h a s r a w p ro c e s s o r c o m p u t i n g s p e e d .
T h e s e o p e ra t i o n s i n c l u d t: s h a r i n g m e m o ry
t h e n free to start o t h e r transa c t i o n s .
I n t h i s fas h i o n , t h e b u s u s a g e i s g r e a t l y
i n creased . The two C PUs comm u n i ca te d i rectly
between m a n y progra m s , swapp i n g processes
with memory over t he
t h e 1/0 devi ces
i n to and out of m e m ory. ragin g , a n d respon d i n g
c o n n e c te d ro t h e
access m e mo ry
to i nteractive user req u ests .
v i a t he
NIH
N M I bus;
VAX B I b uses
ada pters. A d e v i ce on t h e
NMI
bus is
8 8 0 0 fa m i ly usc D i g i
c a l l ed a ' ' n e x us . ' ' Arb i tra t i o n among n e x u ses
b u s as t h e i r c o m m u n i c a t i o n
occu rs i n para l le l w i t h data transfers and is h a n
l i n k t o c l usters. n e tworks , a n d i n teract i ve users .
d l e d by one C P U i n a n e a r l y rou nd - robi n fas h
W i t h i rs a b i l i ty t o c o n n e c t t o fo u r s t: p a r a t c
i o n . T h i s g u a ra n tees t h a t e a c h n e x u s ga i n s i rs
VA,'CB! c h a n n e l s , t h e VAX
fa i r share of t h e bus resou rce . Data transfers on
Al l me mbers of t h e
ta l ' s n e w
VAX B I
VAX
8 8 0 0 system i n rarr i c
u l a r o ffe rs g r e a t f l e x i b i l i t y i n c o n fi g u r i n g
p e r i p h e ra l d e v i ces a n d i n terfaces . T h i s p a p e r
the
NM I
bus occur i n J on gword , octaword , a n d
hexaword l engths ( 4 ,
16,
and
32
bytes respec
first d i sc usses t h e c haracte r i s t i cs of t h e system
t i v e l y) . Fo u r l eve l s o f d ev i c e i n t e r r u p t s a r e
com m u n i ca t i o n buses i n t h e
s u pportcd .
VAX
8 8 0 0 system .
F o l l o w i n g t h a t i s a d i scuss i o n o f t h e i n te rface ,
cal led the
NBJ
tem bus to the
adapter, l i n ki ng the pri mary sys
VAXlll
i n p u tjou t p u t ( 1 /0) b u s .
Fi gure I i I l u stra r e s t h e various c o m p o n e n t s of a
VAX
f\ 8 0 0 syste m .
The VAXBI Backplane Interconnect
The
VAX B I
b u s i s u s e d a s t h e IjO b u s for t h e
VAX 8 8 0 0 syste m . As s ho w n i n F i gu r e I , fro m
o n e t o fou r
NMI
VAXI31
buses can b e i nterfaced t o t h e
b u s . d e pe n d i ng on a c u s t o m e r ' s needs a n d
The Processor-to-Memory Bus
h is d e s i red m i x of p e r i p h e r a l d ev i c e s . E a c h
The two C PUs. the IjO s u bsystem . a n d mem ory
VAX131
a l l share the pri mary system b u s , ca l l ed the N 1\
d i sc ret i z e d i f fer en t i a
1 eg ua t i o n s
m a inta i ne d be low a spe c i fi e d t h resho l d . T h i s
error i s cal led t h e l o c a l tru ncation error . T h e
r e s u l t i n g sy s t e m o f n o n l i n e a r e q u a t i o n s i s
reduced to a system of l inear equations by per
for m i n g a fi rst-order Ta ylor expansion of t h e
nonUnear e lements of t h e c i rcu i t . This l i neariza
t i o n i n trodu ces a n o t h e r e rror ca l l e d the l i n
eari zation error. The resu l ting system of l i n ear
e q u a t i ons i s then solved exactly, using a n LU
factorization of the system matri x .
A ft e r t h e so l u t i on o f t h e s ys t e m has b e e n
o b ta i n ed , t h e l i n e a r i z a t i o n e rror can b e esti
mate d . I f t h i s error is too big, a new l i neariza
t i o n is performed around the previously co m
p u t e d so l ut i o n , a n d t h e n e w l i n ea r system i s
solved aga i n . Successi ve l i neariza t ions a re per
formed u n t i l convergence is obta i n e d , that i s ,
u n t i l t h e li nearization error is be low a specified
t h resho l d . W h e n converg e n ce i s reached the
so lution of the non l i near system i s obta ined , and
t h e local t r u n c a t i o n error is t h e n checked . I f
t h i s error i s too big, the sol u tion a t time poin t ti
is rejected and the system of d i ffere n t ia l equa
t i o n s is s o l ve d at a new t ime p o i n t f; so t h a t
ti - 1 < t1 < ti . If t h e error is be low a specified
thres hold , the so lution i s accepted , and the sys
tem is solved at a new t i m e po i n t ti + 1 so t ha t
t i < ti + 1 . This procedure i s repeated unti l the
entire transient analysis i s computed . During a
t ransient simu lation the circui t simu lator SPICE
spends up ro 90 percen t of i ts CPU time in three
phases of the previ ous algor i t hm . These phases
arc as fol lows :
•
D O WH I L E ( no t c o nverged }
l i n ear i z e a l g eb r a i c egu a t i o n s
s o l ve l i ne a r egua t i o n s
c he c k c o n v e r g e n c e
E H DDO
•
I F ( l o c a l t r u n ca t i on e r r o r t o o b i g ) T H E H
r e d u c e t i me
ELSE
save r e s u l t s a t t h i s t i me
a d va n c e t i m e
EHD I F
•
E H D DO
Figure 3
Tra nsient A nalysis Algorithm jar
SPICE
Digital Technical journal
No. 1
February
I I n our envi
ron m ent synchro n i za ti on is done t hrough soft
ware and the fi ne-gra i n para l l el ism used for vec
torization may not be effi c i e n t . Based on t hese
cons idera t i o n s . we have proposed a n d i m p l e
mented a n a l gor i t h m i n w h i c h pa rticular care
has been taken to m i n i m i z e the overhead
inc urred with para l l e l processing. The deta i ls of
our algori thm can be fo und in reference 1 0 .
Local Tru n ca tion Error Phase
The para l le l co mputation of the time step does
not present major diffi c u l ties si nce the compu
t a t i o n of t h e l o c a l t r u n c a t i o n er ror for e a c h
e n e rgy s to rage e l e m e nt is i nd e p e n d e n t . E a c h
slave process is ass igned a s e t o f ene rgy storage
clements and com putes the t i m e step req u i red by
this sc.-t . The master process then computes the
mini mum time step among the time steps re tu rned
by the sl ave processe s . The e n ergy storage c l e
ments are stat i ca l l y assigned among s l ave pro
cesses so that t he work among them is balanced .
Results
The para l l e l algorithms descri bed i n this paper
have been i m plemented to produce the program
CAY E N N E . We now prese nt two e x a m p l es to
compan.: the t i m i ng performa nces of SPICE and
CAYE N N E .
The first exa m p le is t h e s i m u lation o f a MOS
arith metic logic unit (AlU) on a VAX R H O O sys
te m . The c i rc u i t h as 2 0 0 nodes and 1 3 50 e l c -
Digital Technical journal
No.
1
Februmy
1 987
m enrs . Twelve hund red Ne""rton Raphson i tera
tions are req u i red for the transi ent si mulation .
The effic iency of our para l l e l i m p l ementation is
measu red in this exa m p l e . If a m u l t i p le-stream
phase runs seq uen t i a l ly i n an e la psed t i me Ts
a n d i n para l l e l w i t h N s l ave p rocesses i n a n
elapsed time T, , we defi n e t h e efficiency, E , of
the para l le l execu tion hy
E
=
( T, - T, ) / ( T_, - Ts /N )
E
represents t he rat i o of t h e actual savings i n
el apsed time t o the pote n t i a l savings i n elapsed
t i m e . Ta ble I gives timi ngs and effi c i encies for
the AlU exa mple . As a comparison , SPICE simu
lates the same circ u i t i n an e lapsed t i me of 8 3 4
seconds.
Table 1
Phase
Load
LU
LTE
Total
S i m u l ation
Timing Performances and Efficiencies
CAYE NNE
0 Slaves
(Seconds)
CAYENNE
2 Slaves
(Seconds)
Efficiency
(Percent)
694
97
86
22
14
70
67
35
96
867
529
The second e x a m p l e is t h e s i m u la t i o n of a
MOS contro l store . The c i rcu i t has 1 6 0 nodes
and 5 30 clem ents , and the transient s i m u lation
req u i res 1 4 0 4 N e wto n Ra p h so n i t e ra t i o n s .
SPICE spends 9 1 percent of the s i m u l ation t i m e
i n t h e th ree phases w e mod ified for para l l e l pro
cessi n g . CAYEN N E e x e c u t i n g w i t h two s l ave
processes a c h i eves 9 0 - p e rc e n t e ffi c i e n c y i n
these phases and s i m u lates the c i rcu i t 1 . 7 ti mes
faster than SPICE. For t h is s i m u l ation, CAYE NNE
on a VAX 8800 runs 9 ti mes faster than SPICE on
a VAX - I 1 /780 CPU. Tab l e 2 shows th ese com
parisons.
The e ffi c i e n c i e s of a p a ra l l e l execu t i o n of
CAY E N N E d e p e n d on t h e s i z e of t h e c i rc u i t .
I n deed , there i s a fixed overhead i n cu rred by
Table 2
Comparison of SPICE and
CA VENNE Elapsed Run Times
Case
E l a psed
Seconds
S P I C E on VA X - 1 1 /780
3990
SPICE on VAX 8800
CAY E N N E on VAX 8800
Ratio
9.1
750
1 .7
440
1 .0
1 27
A Parallel Implemen tatio n of the Circuit Sim ulator SPICE on the VAX
ca l l i n g t h e s y n c h r o n i za t i o n r o u t i n e s J O I N ,
4.
8800
System
S. Fa r n h a m , M . H a rve y . a n d K . M o rse ,
FORK or J O I N_FORK . The b i gger the task per
"VMS M u l t i p rocess i n g on the VAX 8 8 0 0
formed by the sl ave processes before a ca l l to a
Sys t e m . " D ig i t a l Te ch n i cal jo u rn a l
syn c h ro n i za t i on rou t i n e , the s m a l l e r the relative
( Fe b ruary 1 9 H 7 . t his issu e ) : 1 1 1 - 1 1 9 .
cost of syn c h ron izat i o n . The s i m u l a t i ons of ou r
exam ples were a lso r u n on a l ightly l oaded sys
'5 .
te m . Loss of e ffi c i e n cy occurs when processors
C o r p o ra t i o n , O r d e r N o . AA- Z '5 0 I B - T E ,
have to be s h a r e d w i t h n o n rc l a te d p rocesses .
1 9 86) .
and busy-wa i t syn c h ro n i z a t i ons may waste s i g
n i ficant reso urces. A work load consist i n g of sev
era I i n d c p e n d e n t s i m u l a t i o ns o f e q ua I im por
VA XjVMS Sy s t e m Services Refe r e n c e
Manual ( M ay n a rd : D i g i t a l E q u i p m e n t
6.
G . J a c o b . A. N e wt o n , a n d D . P e d e rson ,
" D i rect Method C i rcu i t S i m u l a t i on Using
t a n c e i s a l re a d y d e c o mpos e d . and CAYE N N E
J\'l u l t i p r oc e s s o rs , " Proceedings of t h e
I f the
Internatio nal Sy mposiu m o n Circu its
and Systems ( May 1 9 8 6 ) : 1 7 0 - 1 7 3
s h o u l d b e r u n i n s i ng l e - p rocess m o d e
turnaro u n d of a s i ng l e , large s i m u l a t i on n eeds ro
be m i n i m i z e d , howeve r, CAYE N N E s h o u l d be
run with two s lave processes on a ded icated o r
7.
l i g h t l y l oaded 8800 .
tions on Circuits and Systems, vol . CAS26 (September 1 9 7 9 )
Summary
We have descri bed a ge nera l m e t h o d o l ogy for
A . N e w to n , ' ' T h e S i m u l a t i o n o f La r ge
Sca le I ntegrated C i rc u i ts . " IEEE Transac
8.
74 1 - 7 4 9
R . Thomas, " Us i ng t h e B u t terfly to So l ve
S i m u l ta n e o u s Linear E q u a t i o ns , " La bora
para l l e l process i ng on the VAX 8800 system and
a user-fr i e n d l y s e t o f rou t i n e s t h a t e m b e d o u r
tory M e m o r a n d u m , Bo l t , Bera n e k , a n d
method o l ogy . \Ve have a l so presented t h e s u c
Newman , l n c . ( Ma rc h 1 9 8 '5 ) .
c e s s f u l c o n v e rs i o n o f t h e c i rc u i t s i m u l a to r
SPICE i nt o the para l l e l program CAYE N N E . New
9
schemes to m i n i m i ze the o ve r iH.:ad of p a ra l l e l
L a r g e Sca l e C i r c u i t S i m u l a t i o n . " IEEE
process i ng a n d t o balance the l o a d among pro
Tr a n s a c t i o n s o n C o mp u t e r A i ded
Oesig n , vo l . CAD- 4 , n o . 3 (Ju ly 1 9 8 5 ) :
cesses con t r i b u te to the overa l l effic i en cy of o u r
i m p l e m e n tatio n .
Acknowledgments
We wou ld l i k e to a c k n ow l e d ge B o b Ku s i k for
F . Yama moro a n d S. Takahas h i , "Vccror
i z cd LU Dec o m p os i t i o n A l g o r i t h m s fo r
2 .1 2 - 2 3 9 .
10.
G . B i s c h o ff a n d S . G ree n b e r g , " C AY
E N N E : A Para l le l I m p l e m e n t a t i o n of the
i n i t i at i ng t h i s pro j e c t , Cra i g Y a n k e s for i nt ro
Ci rcu i t S i m u l ator SPICE . " Proceedings of
VA.,'( jYMS system a n d for pro v i d i n g us w i t h a n
the IEEE Interna tio nal Conference o n
C o mp u ter A i ded D e s ig n ( N o v e m b e r
i n i t i a l l i b r a ry of ro u t i n e s from w h i c h o u r
1 9 86) : l H 2- 1 8 '5 .
d u c i n g u s ro p a r a l l e l p r o c e s s i n g w i t h i n t h e
methodology e vo l ved , and John farice J J i , N a d i m
Kha l i l , Karem Saka l l a h , a n d john Sopka for many
fru i t fu l d i scuss i o n s .
References
1.
R . H a c kn e y a nd C . Jess h o p e . " Pa ra l l e l
Com putt:rs , " (Bristo l : Adam H i lger. Ltd . .
1 9H 1 ) .
2.
L. Nage l , " S P I CE 2 . A Computer Program
to S i m u l a t e Se m i c o n d u c to r C i r c u i t s . "
Memo n o . E R L- M 5 2 0 . U n i vers i ty of Ca l i
fo rn i a . ne rkcl ey ( May 1 9 7 '5 ) .
3.
Guide t o Jlll ultipro cessing o n VAXjVMS
( M a y n a rd : D i g i ta l E q u i p m e n t C o r p o ra
t i o n , Order N o . AA- H P 6 9A-TE, 1 9 H6) .
1 28
Digital Tecbnit.: ul journal
No.
4
TI!!Jruar)'
I ')8 7
Dennis T.
Bak
The Impact of VAX 8800 Design
Methodology on CAD Development
Contributing to the success of the VAX 8800 project was a n integrated
CAD environment supporting the hardware design effort. A CAD group
dedicated to this single project was chartered to supply a smoothly oper
ating CAD process from initial design conception to final production.
The CAD environment evolved through a blending of existing tools avail
able in Digital with new tools developed outside the company. Gaps in the
environment were filled through extensive modification of existing tools
and new development efforts. The driving force behind the CAD process
was a design methodology, radical for its time but second nature now.
Past CAD Development Efforts
P r i o r to t h e m i d - 1 9 7 0 s , l o g i c deve l o p m e n t
efforts w i t h i n D i g i ta l E q u i p m e n t Corporation
were largely done without the extensive use of
CAD tao l s . H a n d - d rawn s c h e m a t i c d i a gra m s
were t h e pri m ary m e a n s o f e x p re ss i n g l o g i c
designs .
A major advance i n design a u tomation took
p lace in the mid- 1 9 7 0s when the Stanford Uni
ve rs i t y Des i g n Syst e m , o r SUDS, began to be
used within D ig i t a l . SUDS a llowed the entry of
sch ematics i n to and the extraction of net lists
from a gra p h ics d a t a b as e . A l t ho u g h i t was a
major step forward i n the automation of design
processes, SUDS req u i red significant user train
ing and experience to become an effective too l .
Bu i l d i n g a SUDS d a tabase cap a b l e of b e i n g
used by a computer opened a new avenue for
the evo l v i n g CAD groups to a u t o m a t e t he i r
design processes. These groups soon deve loped
a l a rge body of programs to s u pport n e t - l i s t
extract i o n , design analysis, placement and rout
i n g , and eve n t u a l l y m a n u factu r i n g parts - l i s ts
generation. S i m u lation too ls were deve loped to
he l p verify the operat ions of a desi gn before any
actu a l h a rdware was ava i l a b l e . The i n creased
complexity of design drove CAD developers tO
provide more powerfu l CAD too ls. I n turn , logic
designers soon grew i n creas in gly dependent on
CAD tools as their capab i l i ties i ncreased .
The design methodologies and the CAD tool
s u i t e e vo lved to s u ppor t l a rge-CPU desi g n s ,
Digital Technical journal
No. 4
Februarv
1 98 7
s u c h as the VAX 8600 fa m i l y . SUDS eased the
b u rd e n of e n t e r i n g and co p i n g w i t h d e s i gn
c ha n ge s ; h o weve r , t h e a c t u a l contents of i t s
schemati cs d i ffered l ittle from those o f t h e ear
L ier hand-drawn ones. I n large pa n the schemat
ics e n tered by desi gners i nto SU DS corre l a ted
d i rectly with t h e p hysi c a l e n t i ty being b u i l t ,
showing a l l components and their p i ns.
At the i nception of the VAX 8800 project in
the early 1 9 80s, a vast col lection of CAD tools,
written by many in terna l groups, had spru ng u p .
Most of t h ese roo l s req u i red l a rge ASC I I d a t a
fi les a n d sign ificant m a n u a l i n tervention b y CAD
experts. Alt hough many a i ds were provided to
develop design processes, they lacked the cohe
siveness and simpl icity needed to put a process
d i rectly i nto the hands of the designers .
At a b o u t t h i s t i me , a n u mb er of s i g n i fi c a n t
advances were made i n CAD techno logy . Engi
neering workstations were annou nced at prices
that made it practical to put them d i rectly i nto
the hands of designers . Moreover, new design
met hodologies, such as structured com puter
a ided logic design , or SCAL D, were a l so deve l
oped . 1
T h ese m e t h o d o l o g i es c o u l d s i g n i fi c a n t l y
i m prove the qua l i ty of design while decreasi ng
t h e t i m e to deve l op c o m p l ex systems . There
fore , D i g i t a l made a commi tment t O u s e t hose
methodologies on the VAX 8800 project to pro
duce not o n l y the product b u t a more produc
tive way of developing i t .
1 29
I
Th e
Impact
of VAX
8800
Design Methodo logy on
CA D
Design Methodology
Del 'elopmen t
works t a t i o n . were processed i n ro a l o g i c a l n e r
l i st t h a t was used by r h e si mulat ion a n d veri fica
T h e d e v e l o p m e n t o f C A D t o o l s fo r t h e
t i on too l s . Once a l og i c a l design reached a cer
VA,'( R R O O project was a cons i d e ra b l e c h a l l e nge
t a i n I nT I o f m a ru r i r y . i t was m a p p e d i n to a
ro rhe CA D des i g n ns . 'f he c o m p l e x i t y of t h e
VA.'( 8 8 0 0 desi gn . w i t h i rs part i c u i J r ga t e ar ra �·
p h �· s i c a l des i g n . At t h a t p o i n t a p h ysica l a n a l ysis.
i m p l e m e n ta t i o n , d e m a n d e d r h a r r h c des i g n
r o d e t e r m i n e d e l a y s and s i g n a l i n t e g r i t y , was
q u a l i ty be h i gh before a n y t h i n g was co m m it t e d
p e r fo r m e d . P l a ce m cnr and rou t i n g too ls were
then run to fu rther refi n e t he design . The part of
ro hardware . In fac t . t h e project man agers made
the p lw s i c a l d e s i g n d a t a base t h a t r e p resen ted
a rad i c a l (for i rs r i m e ) c o m m i t m e nt ro sim ul a t e
the l og i ca l ropo l ogy was then passed back to the
t h e e n t i re des i g n a n d v e r i fy i ts r i m i n g befo r e
h a r d w a r e was bu i I r . The rcfo r e . r h e C A D
p a r i s o n was m a d e tO e n s u re t h a t t h e p h ys i c a l
goal bur also ro fa c i l i ta t e t h e ra p i d prod u c t i o n
a n d logica l designs were congru e n t . Thc resu l ts
any
logi c a l side o f the d esi gn process Tbere , a com
p ro c e ss had ro b e designed ro m e e t nor o n l y t h a t
of h a rdware once t he design h a d p rove n accept
of s i m u l a t i o n s b a s e d o n t h e p h ys i c a l d e s i g n
a b l e . T h i s s e c t i o n of t h e p a p e r d es c r i b e s t h e
were a l so passed r o th e l og i c a l process for com
me th o d o l ogy w e fo l l owed t o make the best use
p a r i s o n wi t h t h e s i m u l a t i o n s based on r h e l o g i
c a l des ign . These mechan isms p ro v i cl e cl the p r i
o f our CAD too l s . T h e n e xt s e c t i on d es c r i bes
m a ry c h e c k s ro e n s u re r h a t t h e l o g i c a l d e s i g n
t h ose rools and how t h ey were u sed .
The rool s u i te that evolved , p i c mred i n Fi gu r e 1 ,
marc hed t h e p hysi c a l o n e .
su pported both l o g i c a l a n d p hy s i c a l d e s i g n pro
W e d e c i d e d t h a t t h e best w a y t o ass u re s u c
cesses w i t h c hecks and bala n c es ro e n s u re t h a t
ce ss was r o develo p a com p l e te paper speci fica
t h e design topo l og i es re m a i ned t h e s a m e . Sche
tion of the m a c h i n e to be b u i l t . Once the ovcr
m a t i c d i a g r a m s . ca p t u r ed at an e n g i n e e r i n g
a l l goals for the m a c h i n e had been esra b l i s hecl .
DESIGNER
MANUFACTURING
NTERACTIVE CLEANUP
-- IMANUFACTURI
-- PLACEMENT
LOGICALNGTO PHYSICAL -- REPORTS
- MAPPI
NG RULES CHECK
G
N
ROUTI
DELAYS
Y
T
TEGRI
N
I
NAL
G
SI
RE RULE CHECK
-- IWINTERFACE
FI LES
UNIX
VAX/VMS
Fig u re 1
1 :1 0
CAD
Too l Su ite
Di,C!, ital Technical journal
,Yo.
i
Fe!Jnuny
I
')8'
New Products
the designers developed the spec i ficati ons for
each major logic section . This h i gh - level logical
d e s i g n was t h e n p a r t i t i o n e d i n to fu n c t i o n s
req u i red within modu les a nd gate arrays. These
p r i mary i n te rfaces were spec i fi e d before a ny
deta i l ed logic was developed . As i t tu rned o u t ,
t h a t p a r t i t i o n i n g re m a i n e d r e l a t i v e l y i n t a c t
t hrougho u t the project.
The n ex t step was to deve lop probe designs
and abstract models for the most complex parts
of t h e m a c h i ne . T h e s e d e s i gns a n d m o d e l s
tested whether o r not particular logic fu ncti ons
cou l d be developed and timing constra i n ts met .
I n so me cases the probe design s were carried
through to the actual fabrications of gate arrays
or mod u l es . This conti nu i ty a l l owed us to test
the l i m i tati ons of the selected ECL technology as
we l l as the logic design .
The probe d e s i gns p roved u s efu l i n m a n y
ways to both t h e designers and the CAD deve l
opers . The des igners were able to veri fy t h a t
t h e i r l o g i c i m plementations wou ld wor k . The
CAD developers were able to use the designs as
test cases tO de ve lop a n d d e b u g processes .
These test cases proved to be critical tO the pro
j e c t ' s su ccess, espec i a l ly w h e n t h e f i n i s h e d
design was given t o the man ufactu ring organ iza
t i on . The process was so smoot h , in fact , that
designs flowed through it with few problems.
The Influence of SCALD
At the onset of the VAX 8 8 0 0 project, we i nves
t i gated t h e too l s ava i l a b l e w i t h i n D ig i ta l for
b u i l d i n g a process to s u p po r t t h e evo l v i n g
design method o l ogy . This study l ea d t h e CAD
team to explore several systems b e i n g devel
oped by other compa n i e s . One system b e i ng
deve loped by Va l i d Log i c , I n c . , the SCALDSys
tem CAD syste m , was procured by D i g i tal . This
system put the power of dedicated engi neering
works t a t i o n s d i re c t l y i nto the hands of l o g i c
designers. Of eq u a l i m portance was the fac t that
the SCALDSystem CAD too ls were being deve l
oped b y t h e s a m e people w h o conceived t h e
SCALD approach t O hardware des ign .
Logical schematics, req u iring almost no i n for
mation about the physical design , were en tered
i n ro the SCALDSystem database . These schemat
i c s w e r e e n t e red i n a h i e ra r c h i c a l m a n n e r
through a n easy- ro-l earn graphical syste m . Such
a n a rrange m e n t enco u raged the d e s i g ne rs to
Digital Technical journal
No. 1
February I Y87
avoid the creation of paper schematics by trans
fe rring t h e i r concepts d i rectly to the wo rksta
tion screens.
The decomposi tion of the design was from t he
top down, but the a c t u a l en try of design data
o c c u r r e d s i m u l t a n e o u s l y at m a n y l e v e l s .
A " design tree" evol ved i n w h i c h cel ls fo rm
i n g gate a r rays w e r e m e r ged o n to m o d u l e s
that p l u gged into t h e backplane to form a sys
tem . The l o g i c a l d e s i g n was entered v i a t h e
SCALDSystem tool s onto schematics. The physi
ca l i m plementation of that logical design was
left to the physica l design tOols.
Simulation a nd Tim ing Verificatio n
S i m u l a t i o n o n t h e VAX 8 8 0 0 p r o j e c t w a s
approached from two differen t viewpoi nts. The
first a imed tO determi ne whether or not the per
formance goal s of the proposed m i croarc h i tec
ture were wit hin the necessary range , as speci
fi ed by the proj e c t ' s needs. 2 This s i m u l a t i o n
started early i n t h e project before a n y deta i led
logic design had been comple ted . Once t hose
performance goal s had been verified , the second
level of simu lation focused on the logic design
as i t evolved .
The designers cou l d verify that each pi ece of
the design fu nctioned as spec i fied w h i l e that
piece was bei ng deve loped . As the design tree
evol ved , the n u m be r of logic leve ls give n to the
simul ation tools i ncreased u n t i l the entire logic
d e s i g n had b e e n e n t e red . At t h i s p o i n t t h e
designers actually h a d t h e e q u ivalent of a soft
ware bread board of the entire VAX 8800 proces
sor. M icrocoded i nstructions were " ru n n i ng" on
this software bread board long before any hard
ware was ava i lable.
The abi l i ty to run i nstruction strea ms on the
breadboard gave the project several advantages.
Logic designers cou ld debug their l ogic concur
re n t w i th the m i crocode deve lopers ver i fying
t h e i r m i croc ode . M oreover , the d i a g n os t i cs
engineers cou ld wri te as wel l as debug signifi
can t numbers of m icrod iagnostics much earlier
t h a n was usual i n a des i gn proj e c t . The early
c o m p l e t i o n of t hose d iagnos t i cs a l l owed t he
fi rst ava i l a b l e ha rdware t o be c h e c k e d t h or
oughly.
Making the des ign logica l ly correct through
s i m u l a t i o n d i d not e n s u re t h a t the m a c h i n e
wou l d work a t t h e desired cyc l e t i me . I n the
131
The Impact of VAX 8800 Design Methodology on CAD Developmen t
ECL tech nology used in rhe VAX 8 8 0 0 , signal r i m
i ng was cri t i ca l . 'T'hercfore , a t i m i n g veri fier, parr
T h i s c l u s t e r c o n s i s t e d of 1 4 VAX- 1 1 / 7 8 0 a n d
VAX:- 1 1 / 7 8 ') systems w i t h over 2 0 gigabytes of
of the SCALDSys tem roo l s . was used to asce rta i n
mass storage . Even t h i s l a rge a m o u n t of storage
whether or not the t i m i ng goa l s were bei ng m e t .
was i nadequate ar r i mes to s u p port the d e m a nds
I t was w i t h i n the t i m i ng ve r i fier that the i n tl u
of t h e databases. Forecasting rhc: com pmati ona l
ence of the phys i cal i m p le m e n ta t i on on the l og
requ i re m e nt s of t h i s p ro j e c t p roved d i ffi c u l t .
i c a l des i gn was first fe l t . The logic designns had
The VAXcl uster sysrcm prov i d ed t h e c o m p u ta
to c:nsure rhar the p l acement of gates and ro u t
t i o n a l p o w e r and f l e x i b i l i t y r o gr ow as t h e
i n g of s i g n a l s w a s o pt i m a l for al l c r i t i c a l c l e
demands i n creased .
m e n ts . D e l a y i n fo r ma t i o n was t h e n e x t ra c t e d
fro m t h e p h ys i c a l d e s i g n a n d fed b a c k to t h e
The a va i l a b i l i ty of s u ffi c i e n t co m p u ta t i ona l
resources was c r i t i cal to r h e suc cess of our pro
t i m i n g ver i fi e r .
jeer . The design method o l ogy of ext e n s i ve s i m u
Physical Design
gra m run r i m e s . O n ce r h e d e s i g n w a s veri fied .
l a t i o n w a s e ffe c t i v e o n l y w i t h reason a b l e pro
As t h e l og i c a l design e v o l ved , we deve l o ped a
l arge numbers of p hys i c a l designs were rc: lcased
CAD process to convert i r rapi d l y i nto a p h ys i c a l
for fa brica t i o n w i t h i n a s hort pc:riocl , w h i c h con
A s e t of autom a t i c p l ac e m e n t a n d rou t
su m e d si g n i fi c a n t c o m p u ta t i o n a l a n d s to rage
design.
i ng t o o l s , rog e t h e r w i t h d e l a y-esti m a t i o n a n d
resources .
s igna l - i n tegrity tools, was used r o give feedback
to t h e d es i gn e r s . T h e i m porta n t q u e s t i o n here
was w h e t h e r o r nor t h ey cou l d b u i l d p h y s i c a l
The Tool Suite
r e p res e n t a t i o n s of t h e i r l o g i c d e s i g n s . T h e s e
Design Data Managem ent
t o o l s a l so p a s s e d d a t a t o t h e t i m i n g v e r i f i e r ,
A design d a t a m a n age m e n t ( D D M ) sys t e m was
w h i c h ana lyzed t h e effect o f r h e p hys ical design
deve l opeu to orga n i z e t h e m a n y fi l es t h a t con
on c i rcu i t t i m i ngs.
tained the actual design data . At r h e heart of that
S i n c e a l l the l o g i c had to be veri fied bdore
system was t h e concept of a " d es i gn object . "
any h a rdware was fa bri cated , a l l processes h a d
T h i s o b j e c t was s o m e fu n c t i o n a l p i ece: of t h e
designs i n para l l e l . The re l ev a n t D i g i t a l m a n u
dc:sign . u s u a l ly conform i n g ro rhe phys i c a l part i
fac t u r i n g fa c i l i t i c: s a n d o u t s i d e v e n d ors were
u l e i n r h e system was d c fi nc:d as a uesign obj ect.
t o be d e s i g n e d to h a n d le a l a rge n u m b e r o f
t i on i ng. For exa m p l e . each gare array a n d mod
acq u a i n ted with the phys i c a l design through t h e
For each object we d e v e l o p e d a h i e ra r c h y of
test cases ra r b er t h a n t h rough a n actua.l protO
subd i rectories w i t h i n the VMS fi l e syste m . T h i s
type . Thus the fac i l i t i es and vend ors co u l d con
s e p a r a t i o n o f d a t a f i l e s i n t o s u b d i rc: c ro r i e s
figure and debug t h e i r own man u factu r i n g pro
a ll owed vari ous roots w i t h i n t h e CAD process ro
cesses before a n y c o m p le t e d p h ys i c a l d e s i g n s
know where ro f i n d i n p u t filLs a n d ro write our
were s e n t ro t h e m .
pur fi l e s .
To ensure a smooth t ransi t i on i nto r h e fabrica
T h e d e s i g n da tabase was con t i n u a l ly c h u rn i ng
t i o n p h a s e , m a n u fa c t u r i n g e n g i n e e r s w e r e
w i t h new i n forma t i o n . To g ive a stable p i c t u re
ass i g n e d ro w o r k d i re c t l y w i t h r h e cks i g n e r s
e a r l y i n t h e dc:s i g n p ro c ess . Th us t h e s e: e n g i
as
rhe overa l l design e vo l ve d , a " s n a pshot" of a
design object cou l d be take n at any r i m e , r h u s
neers became: fa m i l ia r w i t h t h e VAX 8 8 0 0 tech
gen erati ng a rev i s i o n of the design objec t . New
n o logy and t h e machine as it evolved . T h i s was
s u bd i re ct o ry fi l e trees were: t h e n cr e a t e d fo r
an i m porta n t s t e p because o u r m a n u fa c tu r i n g
e a c h rev i s i o n . U s i n g r h i s sc hc: m c a d e s i g n e r
o r ga n i za t i o n w a s to b u i l d a l l r h e h a rd wa re ,
cou l d create a " froze n " rev i s i o n o f a d es i gn . H e
i nc l u d i n g t h e pro to types . T h i s ear l y a c q u a i n
cou l d t h e n usc t h a t revisi on for s i m u lat i o ns or
tance w i t h t h e: design a l l owed t h e m ro deve l o p
other a c t i vi t i es wh i l e chan ges were being made
m a n u fa c tu r i ng p rocesses ro s u p po r t r h c r a p i d
ro another rc:v ision of rlw desi g n .
c ha n ge to fu l l vo lume s h i p ments soon a fter r h e
VAX 8 8 0 0 system w a s a n nou nced
1
Computa tio nal Reso urces
One of the l a rgc:sr VAX c lusrer systems ever b u i l t
was assembled w sup rorr r h c VAX 8 8 0 0 projec t .
1 32
T h e re l a t i o n s h i p s b e nv e c: n d e s i g n o b j e c t s
were defined w i t h i n a rev i s i o n - ma t r i x fi l e kept
w i t h each fi l e tree . This fi le dcfi n c:d the system
l ev e l h i erarc h y of t h e ma c h i ne : w h i c h cks i g n
o b j e c t s w e r e s u b o r d i n a t e ro a g i ve n o b j e c t
Us i ng t h i s fi le a dcsignc:r wor k i n g o n a mod u l e
Digital TeciJniutl journal
. No.
-I 1-'l'hrllliiT I 'J87
New Products
design cou ld select frozen revisions of the gate
array designs on that m od u l e and be assu red of
not having them changed as he worked on i t .
Another fac i lity provided by the D D M system
was a user i n terface to the design env i ronment.
This i nt erface consisted of a s i m p l e com m a nd
language for transvers i ng the design trees and
fo r r u n n i ng spec i fi c too l s . S i nce t hese too l s
requ ired a large number o f i n p u t variables, we
estab l ished a system of defa u l t parameters to
m i n i m i ze user i n put . For cases i n which t hose
defa u l ts proved i nadequate, users or CAD devel
opers co u l d c h a n g e p a ra m e ters to m e e t t he
design's needs.
Schematic Capture
Using t h e Va l i d G E D e d i to r , logic sche ma t i cs
were entered d i rectly i nto the workstations by
t he designers . The extracted w i re l ists were then
transferred from the SCAJ.DSystem UNI X-based
workstation through a com m u n i cations port to
the VAXcluster syst e m . The workstations were
a l so i n terconnected in a netwo r k i n g envi ron
ment , thus provid i ng com m u n i cation between
them. To ease the burden on designers to learn
mu ltiple operating systems, only graphica l data
entry was permi tted on the workstations. All t he
other CAD too ls were r u n i n the more n a t i ve
VA.i\cl uster environment.
S i nce the m a j o r i ty o f a d e s i g n e r ' s t i m e was
s p e n t i n t e r a c t i n g w i t h C A D t o o l s on t h e
VAXcl uster system , t here was no need for each
designer to have a ded i cated worksta t i o n fo r
sche m a t i c c a p t u re . T h e r a t i o of d e s i g n e rs to
worksta t i ons of a bo u t two to one p roved ade
quate . The eas i ly lea rned GED editor su pported a
rapid i ncrease in the n u m be r of nondesigners
managers , secretaries, and documentat ion writ
ers - i n t he user com m u n i ty . A l l were drawn to
the system by the ease of graphical data creation .
E v e n t u a l l y , t h i s d o c u m e n t a t i o n a c t i v i ty
accounted for the m ajority of workstation usage .
Sim ulatio n and Tim ing Verification
Another propri etary too l , ca l led the DECS I M sys
te m , was the primary s i m u lator used on the pro
ject. This system supported m ixed-level simula
t ions, both structural and behaviora l . The logica l
design was transferred h ierarch i ca l l y to the DEC
SIM system . This system allowed the designers to
deal with complex designs by viewing the simu
lation i n the same h i erarchica l form as the sche
matics . For complex devi ces , such as m u ltiplier
Digital Technical journal
No. 4 February 1 98 7
c hips and R.AJ.\1 devices, behavioral models were
d e v e l o p e d . T h e s e m o r e e ff i c i e n t m o d e l s
increased the overa l l performance of the s i m u la
tions . In the case of RAM devices, abstracting to a
behavioral model also a l lowed the m i c rocoded
i nstructions to be loaded efficiently.
C o m p l e m e n t i n g the fu n c t i o n a l s i m u la t i o n
faci l i t i es o f DECSI M system was t h e t i mi ng veri
fi er (TV) i n the SCALDSystem tools. TV ana lyzed
circu i t t i m i ngs to ensure that the design wou ld
work u nder worst-case cond i tions at t he desired
clock rate.
Wire delays are a major factor to be taken i n to
account by t i m i n g veri ficat i on . The placement
of the p hysi c a l gates was c r i t i cal tO m i n i m i ze
the w i re lengths and hence the delays . S ince the
placement was not avai lable in t he initial design
phases, statistical delays based on l oading were
used . As place ment information became plenti
fu l , the l atest refined del ays were sent to the
t i m i ng verifi er. When the phys i ca l design had
been compl eted , delays based on routed lengths
were used . I f the req u i red t i m ing was not met at
any point in the process, the offend i ng circu i ts
were redesigned or the l ayou t was changed to
correct the problem .
Wirelisting and State Main tenance
The logic gates e n te red on schemat ics by t h e
designers were , i n genera l , assigned ro p hysical
components by the CAD process. Th is mappi ng
occurred i n i t i a l l y w i t h i n the SCALDSystem post
processor software using a random gate-to-com
ponent assignment. This random packagi ng was
t h e n fed i n to a sys t e m c a l l ed YAWL ( for Yet
Another WireLister) . YAWL acted as a genera l
p u rpose w i re l i s t e r , g e n e ra t i n g i n t e rfaces t o
m a n y too l s a n d a c ce p t i ng feedback fro m t h e
physical design tools.
As the physical design process refi ned the gate
as s i g n m e n t , YAWL e ns u red t ha t t h e l og i c a l
design topol ogy d i d not change . B y accepting
feedback data from t he p lacement and routing
to o l s and the p hys i c a l design sys t e m , YAW L
c a u g h t a n y i llega l c h a n ges t h a t wo u l d have
altered the logic functions.
Eventually, t he com p lexity of maintaining t he
state became so large that YAWL a l one cou ld not
cope with it. Therefore, severa l other programs
were placed in the feedback loop from the phys
ical design tools to detect c hanges made in the
p rocess of m a n ua l l y c l ea n i n g up the p hysi cal
d e s i g n . These p rograms w e re n eeded s i n c e ,
1 33
The Impact of VAX 8800 Design Methodology on CAD Development
even a t that late stage , a designer coul d still add
logic to the design . The CAD process therefore
h a d to h a n d l e t hese a d d i t i o n s as we l l as t o
detect i l legal transformations r o the logic . The
r e s o l u t i o n of t h e s e c h a n g e s t o o k a l o t o f
resources, both i n terms of time and computer
power.
I n a d d i t i on to be i n g t he s t a t e m a i n t a i n e r ,
YAWL acted a s a p r i mary sou rce of t h e design
data needed for the remainder of the CAD pro
cess . YAWL c r e a t e d m a n y r e p o r t s to i n fo r m
designers o f problems between t h e i r logica l and
p hysical designs. Most of the i n terface fi les i n
t h e CAD p rocess were either read , wri tten , or
both, from YAWL , which p l ayed a key role i n
the overa l l process.
Placement and Rou ting
Two processes were deve loped for the place
m e n t and rou t i n g o f g a t e - a rray and m o d u l e
designs . The gate array process was h ighly auto
mated , requ iring a m i n i m u m of i nteraction by
the d e s i gn e rs . The p rocess was o rga n i zed to
make severa l runs from which a designer could
s e l e c t t h e one t h a t best o p t i m i ze d h i s l o g i c
design .
The bounded problem of placemenr and rout
ing within a gate array was easy to solve in com
parison to the m o d u l e designs. H ere the con
stra ints p l aced by designers, the l i mi tations of
tools, and the complexit ies of design req u i red
extensive human i n tervention .
Ana lysis tools were used extensively tO assist
in determi n i ng the qual i ty of design a t the two
design leve l s : gate a rrays and m od u les . These
tools analyzed such factors as thermal d issi pa
t i o n , s i g n a l i n tegri ty, and cross ta l k . The con
strai n ts defined i n these tools and i n t he exten
s i v e d e s i g n - r u l e c h e c k e r s w e re m e t , t h u s
ensuri ng a h i gh-qual i ty design .
Most of the tools used for the physica l design
were deve l oped w i t h i n D i g i ta l . Those deve l
oped outside t h e VAX 8 8 0 0 CAD group were
mod i fi e d , so metimes extensively, to meet the
needs of the project.
Physical Design and
Man ufacturing Interface
A proprietary p hysical design system , cal led the
VAX layout system (VLS) , was used for the fin a l
p hys i c a l d e s i g n tasks . V L S rook t h e phys i c a l
design , a s given b y t h e p l acement a n d rou t i n g
1 34
tools. and added the data req u ired to manufac
ture the design. A l ayou t designer, through the
VLS i nteractive graphics syste m , cou l d manua l ly
comp lete the rou t ing that could not be hand.l ed
by the a u tOma t i c roo l s . Some add i ti o n a l parts
that were necessary for fabrication , such as han
d les for modules , were also added a t this t i m e .
The n e t resu l t was a complete design , specified
so t ha t it cou l d be used to m a n u fa c tu re t h e
product.
The design data was then col lected ro form a
rel ease package. To keep track of the fo rmal
release of design data. a system cal led POST was
deve l oped by the CAD group. POST provided an
on-l ine database , which any member of the pro
ject team cou ld query ro determ ine the release
status of a design.
Problems Imposed by the
Design Methodology
U p to this point, we have described the basics of
t h e design m etho d o l ogy used to deve l o p t h e
VAX 8 8 0 0 syst e m a n d some h i g h l i gh ts of the
CAD t o o l s su p p o rt i n g t h a t m e t h o d o l o gy. As
mentioned earlier, t he CAD process was p laced
d i rectly i n to the hands of the designers . Thus a
tight coupl i ng was establ ished between the pro
cess of clesign and the design process. This cou
p l i n g posed several major probl e m s , as now
descri bed , for the CAD grou p .
Train ing
W i t h direct control of a process or tool given to
t he desi gners, t h ey a l l now needed extensive
t ra i n i n g . O n p re v i o u s p r o j e c t s , o n e h i g h l y
k n o w l edge a b l e i n d i v i d u a l c o u l d r u n a roo l ;
now, there were 3 0 or so novice users a l l learn
ing to use that same too l . Extensive support for
those users , i n terms of both trainers and docu
men tation , had to be provi ded .
I n most cases t h e desi gners q u i c k l y learned
how to u t i l i z e the tools . In a few cases - the
placement of modules in particular - p l acement
experts were needed owing tO the spec i a l i ze d
naru re of t h e task. I n su m mary , t h e extent of the
su p p o r t r e q u i red by u s e rs w a s g r e a t e r t h a n
a n ti c ipated .
State Maintenance
The t a s k o f s t a te m a i n te n a n c e p roved to be
extremely complex owing to the freedom given
to designers to make c hanges a t almost any poi n t
Digital Technical journal
No. 4 Februarl' 1 98 7
New Products
i n the design process . To ensure that the logical
and physica l designs matched , it was necessary
to do a com plete isomorphic comparison of the
physical topology aga i nst the logi cal topol ogy of
the design.
Logical Prints
The sche m a t i cs genera ted by t h e design ers a t
t h e i r w o r k s t a t i o n s r e p re s e n t e d t h e l o g i c a l
des i g n , not the physical one . Certa i n features
avai lable i n the SCAlDSystem tools, such as vcc
torized signals and gates , aiJowed it to prod uce
a concise representation of the logic. This ca me,
however, at the expense of not putting physical
data back onto the print set . For reasons of state
m a i n tenance. we were also u na ble to restruc
t u re a p r i n t set o n c e m a p p e d t o a p h ys i c a l
i mplementation . Both these factors contri buted
to a print set that appeared q u i te d i fferent from
those generated by previous projects .
Log ical print sets, w h i l e i n i t i a l l y envisioned
as being benefi c i a l , later c a used problems i n
docu menting the design s . Thi s was particula rly
true for module - l evel desi gns for which tra i n i n g
was needed s o that groups o u tside the project
team cou ld i n terpret the new symbology.
Cross References
U s i n g l og i c a l p r i n t sets a l o n e , a t e c h n i c i a n
cou l d not probe a p i n o f t he p hysi ca l board s .
Si nce a n abstract mapp ing took place i n t he CAD
process. i t was necessary to develop an exten
sive set of cross references s howing t he m a p
p i ng of t h e logical t o t h e physical design . These
cross references proved to be cumbersome and ,
when printed , consumed vast a mounts of paper.
Libraries
CAD tools ru n on l i braries, and each major tool
h a s i r s o w n fo r m a t f o r l i b r a ry d a t a . T h e s e
l i braries m u s t b e consi stent a cross t h e e n t i re
process. Despite a l l the safeguards bu i l t i n to the
p rocess , we fo u n d t h a t i n c o n s i s t e n c i e s s r i J J
crept back i n to t h e data base . D iscovering and
e l i m i n a t i n g t h o s e i n c o n s i s te n c i e s , m a n y o f
w h i c h were fou nd late i n t h e project, consumed
a lor of time.
Summmy
Both the desi gn met hodology and the CAD pro
cess s u pport i ng t h e VAX 8 8 0 0 project w e re
q u i te successfu l . The fi rst protOtype hardware
Digital Technical jo u rnal
No.
1
Februar)' 1 <)8 7
delivered r o u s worked a s expected. We fou n d
only a sma l l number o f h ardware problems dur
i n g the prototype debug phase of the projec t .
Most of those problem s were i n areas that h a d
not h a d extensive simu lation or t i m i ng verifica
tion.
Some genera l conclusi ons reached from the
VAX 8800 project can help fu ture CAD design
ers to i mprove their tools.
•
A
close cou p l i ng from the start, bot h phys i
c a l l y a n d o r g a n i z a t i o n a l l y , b e t we e n a l l
groups associ ated w i t h the p roject leads to
the deve lopment of a smooth process flow.
•
The design methodology has a d i rect and far
rea c h i n g i m pact on t h e CAD p rocess . T h e
capabi l i t i es o f CAD tools d i rectly a ffect t h e
design methodology .
•
Extensive s i m ulation and t i m i ng veri fication
before fa brication can help to achieve a high
q u a l i ty prod uct.
•
The i m pact of rad i c a l changes ( e . g . , in the
data content of schemati cs) must be appreci
ated and then taken i n to account b y a l l p ro
ject members .
I n future projects w e w i l l focus on reducing
the process- loop ti mes and e n hancing the capa
b i l ities of the s i m u lation and t i m i n g verification
too l s . I t w i l l be e a s i e r to fu n c t i o n in fu t u re
design e n v i ro n m e nts, a n d m o re tools w i l l be
p laced d i rectly i n to the hands of the designers .
The design methodo l ogy w i l l be mod i fi e d to
make the reso lut ion of t h e des i gn state easier
and therefore faster.
References
1.
Structured Computer Aided Logi c Design
was developed at Lawrence Liverm ore
Labora t o r i e s a n d a p p l i e d t h e re to t h e
design o f t h e S l computer.
2.
C . Wiecek, "The Simu lation o f Processor
Performance for the VAX 8800 Fa m i ly , "
D igital Tec h n ical jo u r n a l ( Fe b r u a ry
1 98 7 , this issu e ) : 1 00 - 1 1 0 .
3.
A . M a t t hews . " O n - l i n e M a n u fa c t u r i n g
Data Access o n t h e VAX 8 8 0 0 Project , "
Digital Te c h n ical jo u rn a l ( Fe b r u a ry
1 9 8 7 , t h is issue) : 1 3 6- 1 4 1 .
1 35
Andrew]. Matthews
On-l ine Manufacturing Data
Access on the VAX 8800 Project
Previously, the transition from design to manufacture involved transfer
ring significant amounts of data on paper. To minimize product start-up
time, the VAX 8800 project used an on-line system that eliminated much of
the paper. The key task was transforming the data from existing CAD
tools with different formats into manufacturing data. Two generic types
of VMS.files, DA TA and DRA WING, contained data for each Part Number
and Revision Number. VMS's subdirectory and access-control capabilities
provided total revision control. Manufacturing engineers pulled files at
will using DA TA .files to drive their processes and viewing DRA WING .files
from VAXstation II workstations.
A key obje c t i ve for t h e VAX 8 8 0 0 project was ro
rat h e r t h
Source Exif Data:
File Type : PDF
File Type Extension : pdf
MIME Type : application/pdf
PDF Version : 1.6
Linearized : Yes
Has XFA : No
XMP Toolkit : Adobe XMP Core 5.2-c001 63.139439, 2010/09/27-13:37:26
Modify Date : 2013:01:10 06:31:26Z
Create Date : 2006:04:09 18:48:08+01:00
Metadata Date : 2013:01:10 06:31:26Z
Creator Tool : Adobe Acrobat 7.05
Format : application/pdf
Title : Digital Technical Journal, Number 4, Febrary 1987: VAX 8800 Family
Creator :
Document ID : uuid:e12fc662-2e12-432d-a44e-847bb06edf24
Instance ID : uuid:bc5cfea8-9f03-497e-93e9-47f678f0ba09
Producer : Adobe Acrobat 10.1.4 Paper Capture Plug-in with ClearScan
Page Layout : SinglePage
Page Mode : UseOutlines
Page Count : 144
EXIF Metadata provided by EXIF.tools