Digital Technical Journal, Number 4, Febrary 1987 Dtj_v01 04_feb1987 Dtj V01 04 Feb1987
dtj_v01-04_feb1987 dtj_v01-04_feb1987
User Manual: dtj_v01-04_feb1987
Open the PDF directly: View PDF .
Page Count: 144
Download | |
Open PDF In Browser | View PDF |
Digital TechnicalJournal 515 Number 4 February 1987 Editorial Staff Editor- Richard W 13eane Production Staff Production Editor- jane C. 13lakc Designer- Charlotte 13eJJ Interactive Page Makeup- Leslie K. Schoemaker Advisory Board Samuel H. Fuller. Chairman Robert M. Glorioso john W. McCredie Mahendra R. Patel F. Grant Savicrs William D. Strecker The Digital Technical journal is published by Digital Equipment Corporation. 77 Reed Road, Hudson. Massachusetts 01749. Changes of address should be sent to Digital Equipment Corporation. attention: Media Response Manager, 200 13aker Ave .. CFOl-l/M94. Concord, i'>lA 01742 Comments on the content of any paper arc wel comed Write to the editor at Mail Stop HL02-.3/K ll at the published-by addrcss. Comments can also be sent on the ENET to RDVAX::I3EANE or on the ARPANET to llEANE'!;,RDVAX DEC@DECWRL Copyright © 1987 Digital Equipment Corporation Copying without fee is permitted provided that such copies are made for use in educational institutions by faculty members and arc not distributed for com mercial advantage. Abstracting with credit of Digital Equipment Corporation's authorship is permitted. Requests for other copies for a fee may be made to the Digital Press of Digital Equipment Corporation. All rights reserved. The information in this journal is subject to change without notice and should not be construed as a commitment by Digital Equipment Corporation. Digi tal Equipment Corporation assumes no responsibility for any errors that may appcar in this document. !SUN l-55558-001-7 Documentation Numbcr EY-671 I E-DP The following are trademarks of Digital Equipment Corporation DEC, DECnet. the Digital logo. LNO.) Plus. MicroVAX I. MicroVAX IJ. NMI, PDP-J I . PDP-I lj2�t. PDI'-l lj-44, RSX, RSX-IlM, RSX-1 I M-PUIS. Sill. UNJilllS, VAX. VAX-I l/750. VA)>-LI/780. VAX-11/782. VAX 8200. VAX 8.)00, VAX 8500 VAX 8550. VAX 8600, VAX 8650. VAX 8700. VAX 8800. VAXBI, VAXIII 787.32. VAXclustcr, VAX.station. VAXstation Jl, VMS ADA is a registered trademark of the U.S. Government Data General is a registered trademark of Data General Corporation Harris is a trademark of Harris Corporation Cover Design IBM is a registered trademark of I ntcrnational Business Machines Corporation This issue .features the VAX 8800 .family. Our couer depicts l.ightspecd is a trademark of Lightspeed Computers, grotl'tb of the VAX famiiJ'. As those chambers spiral from Motorola is a registered tradcmark of Motorola. Inc the growth of a chambered nautilus as a metapbor .for the Inc. the center, so the power of the VAX family grows .from the SCAI.OSystcm and ValidGED ar� trademarks of Valid to the neuJ VAX 8800 multiprocessor. The image was cre TK1Solver is a trademark of Software Arts. Inc Micro VAX systems, through the VAX 8200 and 8.300 CPl!s, ated using the Lightspeed system. The co11er was designed by Deborah Falck, Eddie L ee and Tsuneo Taniuchi of the Graphic Des ign Department. Logic. Inc CNIX is a trademark of American Telephone & Telegraph Company llell LaboratOries Book production was clone by Educational Services Media Communications Group in 13cdford, MA. Contents 8 10 20 Foreword Donald]. Mcinnis An Overview of the Four Systems in the VAX 8800 Family New Products Robert M. Burley The VAX 8800 Microarchitecture Sudhindra N. Mishra 34 41 52 62 72 The CPU Clock System in the VAX 8800 Family William A. Samaras Aspects of the VAX 8800 C Box Design john Fu, james 13. Keller, and Kenneth j. Haduch The Memory System in the VAX 8800 Family Paul]. Natusch, David C. Senerchia, and Eugene L.. Yu Floating Point in the VAX 8800 Family john H.P. Zurawski, Kathleen L. Pratt, and Tracey L. jones The VAX 8800 lnputjOutput System james P. _lanetos 81 88 100 1 1 1 12 0 The VAXBI Bus -A Randomly Configurable Design Paul C. Wade A Logical Grounding Scheme for the VAX 8800 Processor Michael W. Kement and Gerald]. Brand The Simulation of Processor Performance for the VAX 8800 Family Cheryl A. Wiecek VMS Multiprocessing on the VAX 8800 System Stuart]. Farnham, Michael S. Harvey, and Kathleen D. Morse A Parallel Implementation of the Circuit Simulator SPICE on the VAX 8800 System Gabriel P. Bischoff and Steven S. Greenberg 129 136 The Impact of VAX 8800 Design Methodology on CAD Development Dennis T. Bak On-line Manufacturing Data Access on the VAX 8800 Project Andrew J. Matthews Editor's Introduction cache affected t h e i r design, a n d why they used TTL i n the m emory c o n tro l l e r . The V �"X 8 8 0 0 fa mily does n o t have a separate f l o a t i n g po i n t a c c e l er a t o r . As jo h n Zuraws k i , Kat h y Pratt, a n d Tracey jo nes po i nt o u t , how ever, a custom ECL u n i t a c h i eves h i gh perfor· mance through the norma l datapaths . Thus l ess hardware is needed, a n d opera nds are fetched fast e r . 1 / 0 d e v i c es a r e l i n k e d t o t h e CPU by t h e VAXBI bus. I n h i s paper, Ji m jan etos d iscusses the NBI adapter, w h i c h conta ins l ogic t o handle Richard W. Beane Editor CPU references and DMA re q u es t s . Then P a u l Wade descri bes h o w t h e V AXBI design t e a m had T h i s issue features papers a bo u t t h e d e s i gn o f t h e V AX 8 8 0 0 fa mily of CPUs, written b y mem bers of the d es i gn team. The tech nology used i n Digi ta l 's la test h i gh -end mac h i ne , t h e VAX 8 8 0 0 m u I t i pro cessor, a ! s o for m s t h e ba s i s f o r t h e ot her t h ree fa m i l y m e m b ers: the 870 0 , 8 5 5 0 . and 8 5 0 0 CPUs. Bob Burl ey's overv i ew re l a tes t h e processes used in the 8 8 0 0 design and the fu ncti ons of the m e m o ry i n t e r c o n n e c t ( N M I ) , t h e VAX B I I / 0 bus, and t he four l og i c boxes formi n g t he fi ve stage p i pel i ne . The e a r l y d iscovery of d e s i g n flaws a n d t h e u s e of automa ted too ls hel ped to a c h i eve an aggressi v e com p l et i on sched u l e . The m i crom ach i n e i m p l e m e nts t h e m i c roar c h itecmre and contains four of the five p i pe l i n e stages . S u d h i n M i s hra desc r i bes how m i cr o i n · stru c t i o ns are h a n d l ed, e m p ha s i zing the use o f m i c r o b ra n c h e s a n d m i c r o t r a p s t o e n s u r e co heren cy . d tim· The VAX 8 8 0 0 clock syste m , d iscussed bv B i ll Samaras. was designed using an automate i ng verifier. H e describes the trade-off between using the ver i fi er and m a x im i z i n g the accuracv ' of t i mi n g s ignals by m i n i m i z i ng t h e i r s kew. The C Box and t h e M Rox are two parts of the p i pe l i ne . joh n Fu , Jim Ke l ler, and Ken Had u c h desc r i be t h e C Box's no-wri te a l l ocate cache and the d e l ayed-wri te a l gor i t h m that ensures correct w r i te-t h ro u gh . T h e C Box m u s t a l s o h a n d l e p i p e l i n e s t a l l c o n d i t i o n s a n d m a i n ta i n d a t a co heren cy between processors . The M Box h a n dles r e a d a n d w r i te req u e s ts for t h e m c m orv arrays . Pau l Natusch, Dave Senerc h i a , and Gen � Yu exp l a i n how the Clle signs of the N MI and the 2 to a b a n d o n t h e tra d i t i o nal approac h a n d use a vari e ty of tec h n i q u es to spe c i fy t h e bus. So me chip pro b l ems were resolved o n l y after a t h or ough ana lysis of the p hysical confi gurat ion . jerry Bra n d a n d M i ke K e m e n t d i s c u ss t h e i m portance of u s i ng gro u nd correctly a s a s i gna l c o n d uc t o r to a c h i eve h i gh performance. They describe the sources of gro u n d-related noise i n the CPU, a n d w h a t they d i d to isol ate a n d con trol t hose sources. Many VMS features s u p p ort m u l t i process i n g. Stu Fa rnham, M i ke Harvey, and Kathy Morse first describe t h e hardware that s u p ports m u l t i pro ces s i n g , t h e n t h e i n t e r l o c k e d i n s t ru c t i on s , exce p t i o n h a n d lers, a n d traps t h a t i mp l e m e n r VMS m u l ti process i n g . T o s h o w h o w m u l ti pro c e s s i n g d e c re a s e s e x e c u t i o n t i m e , G a b r i e l B i s c h o ff a n d S t e v e G reen berg c o n v e r t e d t h e SPI C E circ u i t s i m u l ator into CAYENNE, a para l l e l progra m . They created master a n d slave pro cesses t h a t ra n CAYENNE 1.7 t i mes faster t h a n SP I C E . T h e fi n a l two papers re late s o m e of the autO· mated too l s a n d te c h n i q u es used on t h e 8 8 0 0 project . Denn is B a k first descri bes bu i ld i n g t h e CAD s u i te from exist i n g tools, n ewly deve l o ped ones, a n d mod i fi c a t i on s . The met h o d o l ogy was tru l y i n nova t iv e , s e rv i n g as a fra m ew o r k for fu ture projects . Then Andy Matthews d i scusses the o n - l i n e sys t e m that tra nsformed CAD d a ta i n tO spec i fi c a t i o n s used by Man u facturin g . This system m i n i m i zed t h e prod u c t stan-up ti me by eli m i nati ng pape!Work . Biographies Denn is T. Bak D e n n i s B a k is a p r i n c i p a l software e n g i n e e r i n t h e Advanced VAX Development G ro u p . As a project leader, h e is currently d eve loping new CAD too ls w im prove designer prod uctivi ty on fu ture design projects . In other posit ions, Dennis performed configu ration testing for PDP- 1 1 and VAX syste ms. Prior to join ing Digital in 19 80, he worked as a research engineer at Ford Motor Company, doing advanced deve lopment on electronic engine-contro l syste ms. Dennis earned a B.S. degree in elec trical engineering from the University of M i c h igan i n 1 9 7 4 . Gabriel P. Bischoff I n 198 5 , Gabriel Bischoff joi ned Digi ta l after receiv ing a D i p loma of Engineer and a D iploma of Advanced Studies in device physics from the Ecole Centra le de Lyon (1980) and a Ph . D . degree in E.E. from Corne l l University (198 5) . As a senior software engi neer i n t he Se m i conductOr Engi neering Group, h e is i nvestigating t he appl ication o f paral lel co mputing archi tectures for VLSI CAD cools, part icularly c i rc u i t s i m ula tors . Gabriel developed a parallel version of t he circuit simu lator SPICE for s h a re d - m e m o ry m u l t iprocessors . A m em b e r of I E E E , he has p u b l i s h e d papers o n device mode ling a n d circ u i t simu lation . jerry Brand is a principal engineer cu rrently deve loping high-density, h igh-ava i l a b i l i ty power syst e m s . Prior to work i ng on t h e power and packaging team for the VAX 8800 fam i ly, h e designed two MPS power modu les that are widely used in Digita l 's products. Before joining D igital i n 1 98 0 , Jerry worked for over 1 4 years i n d iscipl i nes rangi ng from oceanography to gas- turbine i nstru mentation . He holds a B . S . E . E . degree from the U n i versity of I l l inois and participated i n the M . S . E . E . program at the Un iversity of New Hampshire. Jerry teaches circuit analysis and elec tron ics in the con tinuing education program at the Un iversity of Lowe l l . Gerald J. Brand Robert M. Burley As a senior product management manage r, Bob Burley was the engineering product manager for the fou r systems i n the VAX 8800 fa m i l y. As a program manager i n the LSI Acqui s i t ion and Test Group, he was responsible for re lations with externa l vendors and acqui r i ng technologies for the advanced gate arrays used in new CPU d esigns. Prior ro j o i n i ng D ig ital in 1 9 80, Bob was a product and business deve lopment ma nager at Colt Ind ustries, I nc . , and a prod uct and manufacturing manager at Scott Paper Company. He earned h is B . S . degree in mathematics and econom ics from Hobart Col lege in 1 9 6 5 . 3 Biographies Stuart J. Farnham As a principal software engi n eer in the Vi\JIS Develop· m e n t Group, Stu Farnham is curre n t l y working on future directions in m u l · tiprocessing. Earl i e r, h e provided VMS su pport a t the corporate l e v e l for Software Services . Stu was a deve l oper and instructor for the VAXjVMS Sys tems Seminar. He joined Digital i n 19 8 2 after working as a software engi neer at Pitney Bowes , I n c . John Fu C u rre n t l y earn i n g his M.S. d egree in compu ter scien c e at the University of I l li n o i s , John Fu was a prin c i pal engineer on the VAX H800 proje c t . H e worked o n the design of the C Box a n d configurations for the VAX 8800 fam i ly . Formerly, he worked on large-systems designs at I nterna t i o na l C o m p u ters Li m it e d a n d on m i c ro p ro c e ss o r c o n t r o l sys t e m s for Siemens Li mited . John was also a project manager at Systems and Software , I n c . He received a 1 3.Sc . ( Hons) in compu ter scie nce ( 1 9 77) from t h e Uni versity of Manc hester in England . John is a m e mber of the British Compu ter Society and t he lEE in England. Steven S. Greenberg As a team leade r i n the CAD Depar t m e nt o f the Semicon d u c to r En gi n eeri n g Gr o u p , Steve Gre e n berg c o d eve l oped the CAYEN N E program . An early provider of circuit and process s i m u l ators at Dig i ta l , h e did research in timing veri fi cation and c i rcuit simu latOrs. A <; a Digital i nd ustrial fel low at the Un iversity of Cal ifornia at Berke ley, Steve performed research on iterated timing analys i s . Before j o i n i n g Digital in 1 976 , he was a member of the technical staff at RCA a n d a CAD engineer at Texas I nstru ments . Steve received a B.S . E . E. degree ( 1 9 6 6 ) from M.l . T. and an M . S . E. E. degree (1979) from Northeastern U n i versi ty. He is a m e m ber of IEEE and Tau Beta Pi. In 1 9 74. Ken Had u c h joi n e d D i gital after earn i ng Kenneth J. Haduch h i s Associate in Electronic a n d Computer Te c h n o l ogy degree from the Elec· tronic I nsti tu tes, Pittsburgh . H e worked as a technician in Manufacturing on t h e P DP- 1 1 /70 a n d VAX- 1 lj 7 8 0 CPUs a n d in E n g i n e e r i n g on t h e D R7 '5 0 and FP7 '5 0 design s . Ken helped to develop t h e C B o x as a hardware designer on the VAX 8800 project. He is curre n t ly a hardware engineer i n the Advanced VA,'( Deve l opment Gro u p , working o n the hardware design for a new VAX processor. Ken is also pursuing a B . S . degree from Northeast ern University. MichaelS. Harvey Mike Han'ey joined Di gital in 1 978 after receiving h i s B . S . d e gree i n compu ter scie nce from the University of Ver m on t . He worked on developi ng the RSX- 1 1 M and RSX- 1 l M-PLUS operat i n g systems and then led the team that deve loped the VAX-I 1 RSX layered prod uct for rhe VMS system . S i nce joining the VMS Deve l opmen t Group, M i k e has par ticipated in new processor support for the VAX 8 3 0 0 and 8 8 0 0 systems , spe c i a l iz i ng i n m u l t i processin g. As a principal software engi n eer, he is c ur ren t l y working o n future directions for VMS m u l ti processi n g a n d su pport for high-end VA,'( CPUs . 4 James P. Janetos Jim Janetos is curren t l y studying computer architec ture as a graduate student at Purdue Univers i ty. He joined D igital in 1 9 80 after rece iving his B . S . E . E . degree (Su mma C u m Laude) from the University of Michiga n , where he was elected to Ta u Beta P i . As a design engineer, Jim worked o n memory upgra des for the PDP- 1 1 / 2 4 and 1 1 /44 systems, on memory system designs, and on dynam i c RAM eval uations . On the VAX 8800 project, he i ni t i a l ly worked on the d iagnost i c software for the 1/0 adapter, t he NBJ . Later, he designed the NBIB module, one of the two mod u les in the N B I . Tracey L . Jones Earning her B . S. degree in computer engi neering from Boston University, Tracey Jones joined D igital a fter grad uation in 1 98 2 . As a firmware engineer i n the Advanced VAX Engineering G,roup , she wrote a major portion of the m icrocode that performs floating po int operations i n the VAX 8800 fa mily of processors . After pro motion to senior engi neer, Tracey enro lled in Digital's G raduate Engi neering Education Program a n d i s now pursu i ng a n M .S. degree i n electrica l engineeri ng at Brown U niversity . J i m Keller i s the project leader for the instruction-fetc h and execution u n i ts, the I and E Boxes, a n d the console for a new VAX pro cessor. On the VAX 8800 project, he worked on the design of the C Box . Prior to joi ning D igital in 1 98 2 , J i m worked on fiber optics and the designs of several microprocessor boards at Harris Corporation . He earned a B . S . degree in electrical engi neering in 1 9 80 fro m Pennsylva nia State Univer s i ty , where he was elected to Eta Kappa N u . Jim has appl ied for t hree patents on the techno logy in the VAX 8800 design . James B. Keller Michael W. Kement M i ke Kement is a sen ior design engi neer i n the Power System Technol ogy Group, cu rrently working on EMf and EMC. He was t he design engi neer for the power system on the VAX 8800 project. M i ke has worked on the power systems of many products si nce joi n i ng Dig ital i n 1 9 7 4 , i n cl u d i ng the LA3 6 and LA 1 8 0 term i n a l s , the PDP- 1 1 / 4 4 , VAX- 1 1 /780 and 1 1 / 7 5 0 systems , and the VAX 8600 CPU . Andrew J. Matthews As a senior software manager i n the Advanced VAX Systems CAD Gro u p , Andy Matthews i s curren tly automating the CAD to CAM transi tion . He has ma naged the development of surface-mou nt CAD processes and a pi lot program of advanced CAD to CAM data met hods. Andy designed the prototype a n d first release of VLS, the VAX layout software Digital uses for m odule design . He worked for Adage, I nc . , as the manager of appl ications program m i ng before coming to Digi ta l i n 1 9 7 7 . Andy holds a B . S . degree i n C . S . and M . E . ( 1968) from Boston Universi ty . He has pre sented two papers at the Design Au tomation Conference. 5 Biographies Sudhindra N. Mishra Sud h i n M ishra is a project leader i n the Advanced VAX Development Grou p , currently deve lop ing a design verification CAD too l . As a pri n c i pa l engi neer on the VAX 8800 project, he desi gned and i mplemented most of the I Box and originated t he system-leve l s i m u lation of the CPU . Before joining Digital i n 1 98 2 , he was a senior research engi· neer at Prime Computers, Inc. Su d h i n has worked on projects ranging from radar a nd heat-seeking m issi les to computers. He earned a B .Sc . degree i n engi neeri ng from Ranchi Un iversity and an S . M . i n E . E . and C.S. from M . I . T. Sud h i n has appl ied for a patent o n the technol ogy in the VAX 8800 design . Kathleen D. Morse As a consu l t i ng software engi neer, Kathy Morse is responsible for a l l low-end CPUs and peri pherals. She is also one of t he desi gners for fu ture d i rections i n VMS m u l t i p rocess i n g . Kat hy provided VMS support for the VAX- 1 1 /7 8 2 and M icroVAX I and II systems, and the MA7 8 0 m e mory. She j o i n e d D i g i t a l a fter rece ivi n g her B . S . C . S . degree ( 1 9 7 6 ) from Worcester Polytechni c I nstitute, where she a lso earned her M . S . C . S . degree ( 1 9 8 5 ) . Kathy is a member of I E E E , the Professional Cou n cil , ACM , Tau Beta Pi , a nd Upsilon Phi Epsilon . She has published i n the Compu ter Measurement Grou p ' s Conference Proce e d i n gs, the Digital Techn ical journal, and DA TA MATION. As a principa l hardware engineer, Pa u l Natusch is cur rently managing the hardware deve lopment for a new VAX processor in the Advanced VAX Deve lopment Grou p . On the VAX 8800 project, he was a member of the me mory system team and later rook over as i ts leader. Ear l i er, he worked o n a n upgrade to t he VAX- 1 1 / 7 5 0 m e m ory con trol ler, which expanded it from 2 MB to 8MB . Pa u l joined D i g i ta l in 1 9 8 0 from Storage Technology Corporation , where he was a d i agnostic engi neer. He received his B.S . E . E . degree from Corne l l Un iversity in 1 9 7 9 and an M . B .A. degree from Northeastern Univers i ty i n 1 9 8 5 . Paul J. Natusch Kathleen L. Pratt Educated at Rensse laer Polytech n i c I ns t i tu te , Kathy Pratt came to Digita l after rece iving her B . S . degree in computer and sys tems engineering in 1 9 8 0 . She worked on hardware designs for networks i n t he Local Area Networks Group, then o n the design o f the floating po i n t hardware for t h e VAX 8 8 0 0 centra l processor i n t h e Advanced VAX Devel opment Group. Kathy is currently a senior engi neer working on the fl oat i ng poi n t design for a new VAX processor. William A. Samaras B i l l Samaras is a pri n c i p a l enginee r wo rking to design a new VAX processor. He joi ned Digital in 1 9 8 2 to design the clock system on t he VAX 8800 project . Formerly, at Accutest Corporation , B i l l designed VLSI testers and t i m i ng syste ms . H e holds a n Associa tes degree ( 1 9 7 3 ) from Northern Essex Commun i ty Col lege , and B .S. degrees i n engi neering technology ( 1 9 7 5 ) and electrical engineering (I 976) , both from Southeastern Massac husetts Univers i ty . Bil 1 teaches d i gital electron i cs for continuing education at the U n i versity of Lowe l l . He has applied jointly for a patent on the technology i n the 8800 clock system . 6 Dave Scnerchia is cu rrently a sen i or e ngineer i n the Electronic Srorage Deve lopment Group. H e is a member of t he design team worki ng on rhe m a i n memory for a new m i d-range VAX system . On the VAX 8800 rea m , Dave designed the i n i tial array mod u l e for main memory and part i c i pated in the archi tectu re and design of the memory syste m , t h e M Box . H e joined Digi ta l i n 1 9 8 2 after earn i ng a B . S . degree i n e lectrical engineering from Wash i ngton Un iversity. David C. Senerchia As a principal engi neer, Pau l Wad e i s working on advanced development for future VAX C PUs. He was responsible for the e l ectri ca l design , verification , and resting for t h e VAXBI bus . Pau l a lso designed pans of the VAX 8 2 0 0 system . Before j o i n i n g Digital in 1 98 0 , he worked as a project e ngi neer ar M icrowave Semi conductor Corporation, RCA, and Lock heed Electron ics . Paul earned a B . S . E . E . degree ( 1 9 7 3 ) from Newark Col l ege of Engineering. He holds a patent on ga l l i u m arsen ide technology and has written nine papers on t hat rop i c . One paper won the Beatrice Winner Award a t the 1980 ISSCC . Paul C . Wade Cheryl A. Wiecek Che ryl Wiecek is the engineering manager of the Sys tems Arch i tecture Group and is responsible for the VAX architecture and a number of Digital's i n terconnect archi tectures. She worked on VAX i nstruc t i o n -set c haracteri zation and performance s i m u l.at i on for the VAX 8800 CPU. Cheryl a lso worked on PDP- 1 1 performance si mulation after com i ng to D i gital i n 1 9 7 8 . She was a programmerjanalyst at the Connecti c u t Edu cation Association and taught mathematics i n Connecticut . Cheryl holds a B.A. degree i n mathematics ( 1 9 7 4 ) and a n M . S . degree i n computer science ( 1 9 79) from the U n i versity of Connecticut. She has publ ished five papers on computer performance i n ACM and IEEE journals. Eugene L . Yu Gene Yu i s a sen ior design engi neer i n the Worksta tion Engineeri ng Group ar Palo Alto . On the VAX 8800 project, he des i gned the memory system i n terface to the memory i n terconnect, the N M I . Before jo i n i ng D i gi ta l in 1 9 8 2 , Gene worked at Prime Computer as a ha rdware designer on the i r 4 00 and 9900 systems , and at Data General Corporation on Nova prod ucts . H e earned a B.S. degree in e lectrical engi neering from rhe University of Massachusetts. Gene has applied for a patent as coi nventor of the N M I and memory design for the VAX 8800 CPU. John H.P. Zurawski John Zurawski is a consu lting engineer working as the project leader for compute r arithmetic in the Advanced VAX Develop ment Group. H e led the team that designed the floating point strategy and hardware for t he VAX 8800 fa m ily. Joh n joined Digital in 1 98 2 from the Univers i ty of Manchester, where he was a post-doctoral research associate . He holds a B . Sc . degree i n physics ( 1 9 7 6 ) , and M . Sc . ( 1 9 7 7 ) and P h . D . ( 1 9 8 0 ) d e g r e e s i n c o m p u t e r s c i e n c e , a l l fr o m t h e U n i ve r s i t y o f Manc hester. A member of I E E E , John has publ ished four papers o n com puter techno logy . 7 Foreword Donald J. Mcinnis Group Manap,er, Aduanced VA.X Enginel'rin[!. Since the a n nouncement of the VAX-I t j 7H O sys· rem i n Nove mber 1 9 77. Digita l Equipment Cor poration has steadi ly expa nded the VAX fa m i l y with n e w VAX products : t h e VAX-I l/7'50 . VA,'(. llj7:)0, M i c roVAX I , VAX·llj72'5, VAX-II/ 7H'5, VAX 8 6 0 0 , M i croVAX ll, VAX H 6'50. VAX 8 2 0 0. and VAX 8300 systems The marker accep· ranee of the VAX fam i ly has been excel l ent across a l most a l l computing applications. This remark a b l e and steady i ncrease i n the usc of VAX sys· tcms creates a continuous demand by the VAX customer base for enha nced prod ucts across a ! I segments of the computing i ndustry. I n the fa l l o f 198 2 . t h e deve lopm e nt tea m for t h e H 8 0 0 project ( known i n tern a l ly a s " Na u t i l us") was assigned the responsib ility of design i ng nL'\v sys tems to enhance the mid-to- high end of the VA.-'( fam i ly. This issue of the Digital Technical journol re prese nts a sampling of the types of design engi· nccri ng rhar went i nto t he VAX HHOO fa m i l y. It takes a n a m a z i ng l y l a rge n u m be r o f d i ffere nt engi necring d isci pi i nes to design and manufac ture a prod uct of this complexi ty. A-; time moves on , each successive development project seems to require a bigger investment i n a larger n u m ber of discipli nes to produce a prod uct attractive to the marketpl ace . It is u n fortu nate that neither time nor space rerm its US tO give proper visibil ity to all the d esign. m a n u facruri n g. a n cl cus tomer-service engi neering efforts that Icc! to rhe s hip m e nt of the VAX 8800 fa m i l y . . The VAX 8SOO fam i ly consi sts o f four new pro· cessors: the VAX 8 8 0 0 , VAX 8 7 0 0 , VAX 8'5 5 0 , and VAX 8'500 CPUs . The VAX 8800 family and the VAX 8 2 00 system i ntrod uced a major new IjO bus. the VAX!3I. We also i ntroduced a com pi ete ly new set of ljO adapters for the VAX B I bus. which wil l b e t h e new fou ndation IjO chan nel for many fut ur e mid· to h igh-end VAX sys· terns . The VAXI31 bus wi l. l rep lace the UNIBUS on this class of system . The VAXlll offers a six-fold i ncrease i n performance and substantia l ly better rel iabi l ity and mainta i nabi l i ty features in com parison to the L Nll3US. The 8800 represents a s ign i ficant advance into new areas of h i g h -performance com p u t i n g for the VAX fam i ly. A customer can replace a VA,'(. ll/780 CPU with a VAX 8800 CPU i n the same foo t p r i n t a n d e ffect an o r d e r of m a gni t u d e i ncrease i n t h e a mount of work don e . The VA.-'( 8 5 0 0 CPU is rea l ly a rep l acement product for the VAX - 1 1/78'5 CPU kerne l . However, the 8500 has the same price. twice the performance , and one t hird the foot pri nt. To produce a product that has a good price; perform a nce ra tio in the m arketplace , you have to push hard on some di mensions of technology. A n u m ber of n e w p i eces of technology were i n troduced on the VA. -'( 8800 project, such as the 2 2 - layer bac kp lane and a 4 80-pi n , zero i nsertion force connecto r. In the VLSI techno logy area, one 8800 i nc l udes a total of 1 8 6 e m i tter-cou· pl ecl logic ( ECL) gate arrays and a tota l of 28 cus· rom-designed LCL parts. The cycle time of a VAX CPU is a l arge determi· nant in its performa nce . The chall e nge of meet· ing a 4 ';-na nosecond cycle r i m e (versus 200 na noseconds for the 1 1 /7 8 0 ) requ i red s i gn i fi cant advancements i n technology i m p l ementa· tion and i n CAD tools for ana lysis. Enhancements were made to the base operat· ing system software for the VAX 8800 processor. These softwa re e nhancements represent a basic techno logical cha nge that is avail able to our CliS· romers . The VMS operating system was improved significan t l y to provide much better throughpu t for cusromers using the VAX 8800 dual proces· sor as a genera l -purpose system. The ULTRIX-32 o p e r a t i n g sys t e m was e n h a n c ed t O s u p port t i g h t l y cou p l e d m ult i p rocessin g . Soft w a re library structures were also developed for cus of people to have a broad engineering focus tomers who might want to improve the through proved to be invaluable, especially in the simu put of a single job by decomposing it to run in lation and prototyping phases. The core manage parallel on the tightly coupled dual processors ment ream started with very experienced peo of an 8800. p l e, m o s t o f w h o m h a d V AX-llj78 0 or To meet the performance goals, the overall VAX-11/750 development experience: Sas Dur· design of the VAX 8800 system is necessarily vasula, VAX 8500 project manager; John Hittell, quite complex and was potentially difficult to manufacturing manager; Steve Jenkins, engineer· implement quickly and correctly. We under ing manager; Nancy Kronenberg, VMS engineer· stood this from the beginning of the project, ing; Bob Kusik, CAD manager; Steve Omand, based on our understanding of the experiences customer service engineering; and Bob Stewart, of previous projects (e.g., the VAX-11/750, VAX chief architect. Many contributors at the next 8600, and Jl1 VLSf CPU chip projects). To level also had similar backgrounds, and all manage that complexity in a timely manner, we remained in place for the duration of the pro selected some key strategies and stuck with ject. This continuity was a major factor in com them through the completion of the project. pleting a very successful project and a very suc They proved to be very successful since the cessful family of products. hardware prototypes were relacively error free, and the manufacturing start-up was very smooth and rapid. Some of these strategies are as fol lows: • The project followed a structured design methodology that ensured the completion of comprehensive specifications before any detailed design was done. • We made a large investment in our CAD team and in CAD tools to automate the design pro cess. • The basic design was managed by a chief architect. • The system was simulated extensively before we built any hardware. (We finished the pro ject with 14 VAX-11/780 and 11/785 sys tems in our. cluster. During our peak simula tion effort, however, over 30 dedicated VA,'( systems were used for a period of several months.) • Since many different engineering and manu facturing locations were involved, we made extensive use of Digital's worldwide network for electronic mail and data exchange. A more important factor than any of the above ex a m p l es, h o wev e r, w a s t h e p e o p l e w h o worked on the project. We attempted ro build an excellent team that worked well together. The attribute of teamwork and the willingness 9 Robert M. Burley An Overview of the Four Systems in the VAX 8800 Family The VAX 8800 multiprocessor and the VAX 8700, 8550, and 8500 systems all derive from the same fundamental design. Their sustained appli cations throughput ranges from 3.0 to 12 times that of the VAX-1 1/780 system. In the design process, automated tools helped to correct design bugs early. ECL technology and a two-phase clock system achieve a 45-nanosecond cycle time. Micro instructions are processed simulta neously through Jour logic boxes that implemen t a five-stage pipeline. A high-speed memory interconnect, the NMI bus, links CPUs to memory and the ljO subsystem, which connects to VAXBI buses. Many reliability fea tures, including extensive diagnostics, are implemented. Design work on the VA,'\ 8800 system began i n September 1 9 8 2 and concentrated o n develop ing a balanced, high-performance system based upon the use of ECL components and m u l ti pro cessing. Although performance was the primary product goal , many technology, packagi ng, and implementation decisi ons reflected the equally pressing busi ness req u i rements for reliabi l i ty and ease of manufacwring. The flexi b i l i ty of the design u l t im ately spawned fou r CPU syste ms: the VAX 880 0 . VAX 8700, VAX 8 5 5 0, and VAX 8 5 0 0 models. These systems share many common fu nction a l and design attri butes yet maintain noticeable i m ple mentation d i fferences i n the areas of perfor ma nce, m u l ti process ing, expansion capabi lity (memory and l jO). and packaging. As a result of these i m p l ementation vari ations . the sustai ned appl icati ons throughput (SAT) rates for these systems range from approx i mately 3 . 0 to 1 2 times the rate for a VAX - 1 1 /780 system . Sus tained applications throughput is more i nd i ca tive of usable performance for a given system than the more frequently reported peak n u m bers that can b e derived from ideal or biased cond i tions . Ta ble I compares the physical and performance anributes of these fou r VAX pro cessor syste ms. Design Environment Trad i t i o n a l design environ m e n ts have p l aced the greatest emp hasis on d iscovering and e l i m i - 10 nating design errors i n the physical hardware. The complexi ty of the VAX 8 8 0 0 design cou p l ed with the new technologies i nvolved wou l d have crea ted cos t ly delays i n t h e development sc hed ule had traditional approaches been used . Early i n the project. goa ls were defi ned to iden t i fy l ogic design problems and to solve all t i m i n g p ro b l e m s t h r o u g h t h e u s e o f ext e n s i v e design verification tools. A hierarchi cal design and s i m u lation environ m e n t a l l ow e d t h e e n g i n e e rs to m ove fre e l y throughout t h e design a t a n y l evel from gates , l ayou ts, a n d behavioral models through com plete system s i mu lation and t i m i ng verification . ConsiderJble comput i ng resources were req u i red to allow that freedom . Th is envi ron ment, with i t s carefu l ly managed l i bra ries and databases , al lowed this work to be done before a ny hard ware was actu ally assembled .1 A.; a resu l t , the design matured within our VAXcl uster systems, evo lving ro hardware p rotOtypes o n l y a fter i t was essentially com plete and stable . I n addition to the expected savin gs i n prototype costs and a red uction in overal l devel opment rime, t he per vasive use of software tools sign i ficantly shifted the traditional deb ug effort to an earlier poi n t i n t h e d es i gn p rocess . C u m u la t i ve bug-detection p lots were used extensively to provide i ns i ght in to the srabi I ity of the design . The effect of this shift was ro provide stable, early prorotypes for extensive system characteri z a t i o n and resti n g , l e a d i n g to e a r l i e r d e s i g n Digital Technical journal No. 4 February 1987 I New Products Table 1 CPU and Memory Attri butes of the VAX 8800 Fa mily VAX 8500 VAX 8550 VAX 8700 VAX 8800 SAT (com pared to VAX - 1 1 /780) 3.5 6.0 6.0 1 0 . 0 to 1 2.0 Cycle T i me 45 n s 45 n s 45 n s 45 n s CPU Attributes 2 Number o f Proces sors U pgrade Potential To 8550 None To 8800 None Writable Control Store (Words) 1 5K 1 5K 1 5K 1 5K i n each C P U U ser Control Store (Word s) 1K 1K 1K 1 K i n each CPU Microword Size 1 43 Bits 1 43 Bits 1 43 Bits 1 43 Bits CACHE Size 64KB 64KB 64KB 64KB ( i n each C P U ) I nternal Datapath 32 Bits 32 Bits 32 Bits 32 Bits Instruction B u ffer Type Look 1 6 Byte Ahead Look 1 6 Byte Ahead Look Ahead 1 6 Byte 1 6 Byte Look Ahead in each CPU Ma ximum Total 1/0 Data Rate 1 6 M B/s 1 6M B/s Over 3 0 M B/s Over 30M B/s M a x i mum 1/0 Channels 2 2 4 4 80MB 80MB 1 28 M B 1 28 M B Hexword Read (256 bits) 495 n s m i n . 1 260 n s m a x . 495 ns m i n . 1 260 n s m a x . 495 ns min. 1 2 60 n s m a x . 495 ns min. 1 260 n s m a x . Octaword Write ( 1 28 bits) 270 ns min . 540 ns m a x . 270 ns min. 540 n s max. 270 n s min. 540 ns m a x. 270 n s min. 540 ns m a x . Longword Write (32 bits) 1 35 n s m i n. 495 ns max. 1 35 ns m i n . 495 ns max. 1 35 n s m i n . 495 ns m a x . 1 35 n s m i n . 495 n s m a x . Memory Attributes M a x i m u m Physical Memory Size Cycle Times: acceptance . This strictly controll ed design envi ron ment al lowed us to compl ete physical debug along with the req u i red system eva luation and testing i n only eight months. I n a software- i ntensive design environment , the production of actual hardware is deferred somewhat in favor of design stabi l i ty , resu lting i n a s lightly longer soft-design period . The delay in ha rdware avai lab i l ity, however, is more than bal anced by the sta b i l i ty of the hardware proto types, which can then be acce lerated th rough the eva luation and q u a l i ficat ion-tes t i ng phases . Digital Technical Journal No. 4 Februmy 1987 The design schedule recovers during these later phases , and substantial cost savings are rea l i zed beca use fewer e n g i n e e r i n g changes are made and stable manufactu ring can beg i n quickly. CPU Design Overview The VAX 8800 fa m i ly of designs were structured arou nd the fu nctional ele ments, or " boxes , " of t h e syste m . The CPU , m e m o ry, ljO, a n d bus subsystems were all matched to provide the nec essary system ba lance . One s i m p l e model is to treat performance as a fu nction of two va riables: 11 A n Overview of the Fou r Systems in the VAX 8800 Fam i�y the i nstruction execution rate , and the amou n t of " work" e a c h i nstru c t i o n c a n perform . The design of the VAX 8800 fa mily focused on what we call the "short tick" approach to achieve the necessary, sustai ned performance . I n t h i s a pp roa c h , t h e i n s t r u c t i o n a n d data s t r e a m s a r e kept s i m p l e and a r e e x e c u t e d q u ickly. Any design trade-offs were resolved i n favor o f s peed a n d s i m p l i c i ty, t h u s red u c i n g design complexity. The use of h i gh-speed cus tom and s e m i c us t o m VLSI components c o m bined w i t h severa l n e w i n ternal b u s a rc h i tec tures resu l ted in a fam i ly of processors with a 4 5 - n a n os e c o n d ( ns ) cyc l e t i m e . A l l m od e l s e m p l oy a five-stage i ns t r u c t i o n e x e c u t i o n pipel ine, integral floating poi nt acce leration (F, D, G, H formats) , and the VAXB I bus as the pri m a ry I / 0 s u bsyste m . T h e e x te n s i v e u s e o f m i c rocode c o n t r o l s w i th m i n i m a l h a rd w a r e a s s i s t a u gm e n ts c u r r e n t p e r fo r m a n c e w h i l e prov i d i n g flex i bi l ity for fu r u re e n ha ncements. The b lock d i agram in Figure 1 (using the VAX 8700 and VAX 8800 systems) i l l ustrates t he key fu nctional elements common to the VAX 8800 fa mily design . Technology The raw speed , off-chip drive capab i l ities, and ava i l a bil i ty o f b i p o l a r e m i t te r-cou p le d l o g i c ( EC L) l o g i c c o m po n e n ts provi d e d t h e m os t straightforward means of ach i eving t h e desi red performance of the VAX 8800 fami ly . Most logic i s implem e nt e d in 1 2 0 0 -gate ECL a rrays . Cus tom l ogic c hips designed by Digital provide fur ther performance ga i ns for floating point opera tions and genera l -purpose registers . The cache is i m p l e m e n t ed in 1 0 - ns a n d 1 5 - n s E C L RAMs . N i n e - l ayer, contro l l e d - i m p e d a n c e C P U l og i c modu l es a n d a 2 2 - layer, cont ro l led-impedance CPU backpl a ne were deve loped to meet the sig n a l - i ntegri ty a n d s i g n a l - propaga t i o n re q u i re m e n ts cruc i a l t o an E C L desi gn . O t h e r m u l t i layer backp la nes were designed for the private memory array bus and 1/0 su bsystems . ECC M E MORY CONSOLE 1-v;;--I PROCESSOR- �I -i (U PGRADE I I VAX 8800) l VAX PROCESSOR (STA N DARD VAX 8700) L. - - - --r - - - .J I I H I G H SPEED M E MOR Y I NTERCON N ECT B U S ( N M I) II r - - - ..1 - - - -· I I I B U S I NT E R FACE I r--1 (OPTIONAL) IL - ,I I II I III II - -, B U S I NTERFACE I II II 1 L - - - - - - - -' r - - - __1 _ _ _ .., VAX B I 1/0 B US STD 8700/8800 I I I I I VAXBI 1/0 B U S STD 8800 I I I I I L - - - "7...- - -- J ' / ", I� I 2 I � '-7 ' Figure 1 12 v / VAX I r -- - -1 --- -, I II 1 VAXBI 1/0 BUS (OPTIONAL 8700/8800) I I I 1 L----;o::- - - - ...J / ., I � ' 3 v ' _____ I I I � I L _ _ _, VAX B I 1/0 B U S (OPT I O N A L 8700/8800) I I I � '---- 7------ ....J r 1 � / / ., I �' ' � 4I v 7 / 8 700/8800 Rlock Diagram Digital Technical journal No. 4 FeiJrumy I 98 7 New Products An in novative scheme of bus bars a n d ri bbon straps routes the appropriate power tO each of the backplanes, m i n i m i z i n g cable management problems for system power. The eight CPU logic mod u l e s , a l l memory ar rays , a n d a l l IjO con trollers attach to the i r respective bac kplanes by means of zero insert i on force (ZIF) connectors . which i m prove our abi l i ty to manufacture and service the syste m . Figure 2 shows the two d i f ferent modu l e types (CPU and VAXB I ) usL"cl i n the VAX 8 8 0 0 fam i l y . fo rmed w i t h i n each processor. There a re four logica.l boxes: the i nstruction u n i t (I Box) , the cache (C Box) , the execution unit (E Box) , and the me mory su bsystem ( M Box) . Each processor contains these fu nctional u n i ts and their rela ted buses. Five buses are i m plemented w i t h i n each CPU : the cachejALU bypass bus, the cache data bus. the i nstruction- buffer data bus, the vi rtua l address bus, and the write data bus . F igure 3 is a bl ock d i agram of the processor configuration . CONSOLE S U BSYSTEM INTERFACE VISIBI LITY BUS I BOX IBD B U S E BOX c BOX CACHE DATA B U S HIGH SPEED M EMORY INTERCONN ECT BUS ( N M I) NBIA ADAPTER Figure 2 Typical CPU and f/0 Modules TO NBIB ADAPTERS An L"XtensivL" L"nvironmental mon itoring sub system , ca l l ed the EMM, has been i m pl cmL"ntL"d t h ro u g h o u t t h e syste m . The E M M c o n s ta n t l y m o n i tors cur re n t fl u c tu a t i o n s , a i r fl ows , and te mperature va ri a t i o n s , prov i d i ng warn i ngs at the system conso l e . ThL" EMM can automatica l l y power down the system i n thL" L"ve nr that safe operating l i mits a rc violated . CPU Subsystems The des igns of the CPUs i n the VAX 8 8 0 0 fa m i l y are part itionL"d along the logica l fu nctions pn- Di�ital Technical jourmtf No. -1 l' � M E MORY I N T E R C O N N ECT I N TE R FACE NMI t F R O M EXECUTION BOX t FROM INSTRUCTION BOX The E Box receives data from the I Box and the C Box. processes that data , and returns it ro rhe C Box . The E Box performs five pri mary fu nc tions req u i red by the processor. • Hand les a l l arith meti c , logica l and bi t-shift operations • Mai ntains the program counter and general registers • Mai n ra i ns the processor registers • Con trol s data tra nsfers between the C Box , the I Box , and the c lock-module registers • C Box Block Diagram Dip,ital Technical jounwl No. 1 Ft!hruary I 'J87 store to be free to requests unti I t he memory . C Box is s hown i n The E Box • Figure 6 TO C BOX Box Block Diagram buffer and the cache data process other processor requested data arrives from A block d i agram o f the Figu re 6 . PHYSICAL ADDRESS I I I N T E R R U PT P E N D I N G T R A NSLATION B U FFER � t WRITABLE CONTROL STORE M I CROSEQUENCER • I TAG I STORE M I CROWORD t Figure VIRTUAL ADDR ESS DECODER CONTROL MUX BRANCH I NT E R R U PT LOGIC FILE A D D R ESS INSTRUCTION DECODER _ CONTROL Prov ides condi tion-code i nformation to the Box m i crosequencer l 15 A n Oven,iew of the Fo ur s:vsterns in the Vt1X 8800 Fa mil)• T O C BOX t WRITE DATA BUS FROM I BOX � ; t FROM c BO + r- CAC H E DATA B U S I r- I ST R UCTION BUFFER DATA BUS < LATCH SLOW DATA FILE r--- v FROM C BOX + VIRTUAL ADDRESS BUS I t REG I STER FILE PROGRAM COUNTER t • 1-- A R I T H M ETIC AND LOG IC U N I T I-PARITY CHECK t < S H I FTER FLOATING POINT t t CACHE(ALU BYPASS B U S Figure 7 /;' No.1.· The major dements of rhc E Dox , located p hys· ica l ly on r h e d a ta-sl ice mod u l es and rhe sh i fter m od u 1<: . consist of a register fi l e , a data fi k , t h e progra m - c o u n t er l o g i c , t h e m a i n A L U , a n d a sh ifter. The logic of the E Box i nclu des in tegra l float i ng point operations that are op timi zed and a 6 4 - b i r m u l t i p l i e r ( i m p l e m e n t ed i n c u s to m designed VLSI chi ps) r h a r a ugments t h e speed o f borh i nreger a n d floa t i n g p o i n t m u l t i pl i ca t i o n . Figure 7 is a block d i agra m of the E Box . 16 I l t M U LTI PLIER FROM C BOX r-.. v Block Diagram The M Box The M Box . the memory subsyste m , consists o f m e m o r y con trol l og i c , m e mory a rr ay s , a n d a d e d i c a t e d m e mory a r ray b u s r h a t p rov i d e s a usable data rare of over '5 0 M B per second to rhe me mory subsyste m . The contro l logic opt i m i zes m u l t i p l e m e m ory read a n d w r i t e opera t i o n s , i m p l e m e nt s three-way i nt er leav i ng, a n d buffers memory transa c t i ons for opt i m u m dara move ment . The dedi cated me mory array bus, coupled Digital Technical journal No. 4 Februmy I ')87 New Prod ucts wi t h the memory con t rol logic , effect i vely off loads t h e N M l b u s , p rovi d i n g b a l a nced bus access a n d l oads . The i nt e rleaving a l gori t h ms are based u pon a rray bo u n da r i es . m a k i ng t h e memory control logic technology i ndependent . The resu lt is that as i ncreasi ngly dense me mory a rrays become ava i l a b l e , few if any cont ro ll e r mod i fications will be req u i red . The error checking and contro l ( ECC) is bu i l t a rou n d 7 c h e c k b i ts for every 3 2 b i ts of d a ta . This protocol provides automatic si ngle-bit cor rection a nd doubk-bit detect ion . I n the VAX 8800 multiprocessor, a l l memory is ful ly sharable. Current systems in the VAX 8800 fam i ly a re offered w i t h 1 6 MB per memory array , g i v i n g t h e VA..'{ 8 7 0 0 a nd VA..'{ 8 8 0 0 systems a max i m u m memory capa c i ty of 1 2 8 M B , and t he VAX 8 5 0 0 and VAX 8 5 5 0 systems a max i m u m o f 80MB. Figure 8 is a block d iagram of t h e M Box. INSTRUCTION BOX H I G H SPEED M EMORY I N TERCONNECT B U S (NMI) POWER SUBSYSTEM - - - - - - ..., r I M EMORY CONTR O L I I I I I I L I I I I I I I I I _ _ Figure 8 _ M Box Block Diagram Di[!Jtal Technical journal February J 'J8 7 No. 4 _ _ ..J The Clock Subsystem The c l o c k s u bsystem generates , contro ls , a n d di stributes t i m ing signals to a l l the components of t h e p rocessor system . The clock su bsystem conta i ns the consol e i nt e rfa ce , a n osc i ll a tor , a p hase generaror, clock-con trol logic c i rcuits, and t he l ogic c i rcuits for clock signa l d istri but ion. The VAX 8 8 0 0 fa m i l y i m p l e m e n ts a two p hase. nonoverlapped c lock su bsystem operating at a cycle time of 4 5 ns . A stable, high-frequency osc i l lator ( 1 2 0 MHz nominal with variable out put ) . coupled with a phase ge nerator, provides the signa l . The impl ementation of a two-p hase design wi t h m atched signa l- length d istribu t i on t h ro u g h o u t the CPU is most e ffi c i e n t for t h e p i pe l i ned, latch-based design o f t h e VAX 8800 fa m i ly . This design avoids the i n e ffi c i e n c i e s associated w i t h t h e com pressed signal -assertion t i m es resu l t i n g fro m a p proac hes t h a t spec i fy m i n i m um delays for given logic c k ments. A-clock and B-clock signals arc cl istri butcd to alternate latches i n a given logic stream . Al l data transfers occur between latches cloc ked by d i f fe rent p hases ro assure a race - free design . The essence of fast-processor design is managi ng and contro l l i ng skew. In this regard , signal propaga tion and d istribut ion presented sign i fi cant chal l e nges i n the a reas o f con t ro l l ed etch lengths. control led i m pedance , rou t i ng, and p l acement. To ass u re a sta b l e , re l i a b l e des i g n . a l l design a c t i v i ty was pred i cated o n worst- case d es i gn m lcs rather than using the typical -case l i m its. The NMI Bus I n tegral to the design o f this fa m i ly of proces sors was the development of a h i gh-speed mem ory i n te rconnect bus called t he N M I bus . T h i s b u s , a n a l ogous t o t he syn c h ronous back p l a n e interconnect ( S B I bus) i n t h e VAX - 1 1 /780 CPU . l i n k s t h e s u bsyste m s for C PU l og i c , c e n t ra l memory , and 1/0. The N M I bus i s a 3 2 -bit syn chronous bus, p hys ically i mplemented w i t h i n t h e 2 2 - layer backp lane. This b u s prov i des t h e control a n d datapath fu ncti ons as we l l as t h e d istri but i on o f clock signals for the VA.,'( 8800 fam i l y. O n e fu n d a m e n t a l p rob lem i n the d e s i g n of high-performance systems revolves a round ba l a n c i n g t h e bus a c c e s s n e e d e d at any g i v e n i n sta n t w i t h t h e raw bandw i d t h ava i l a b l e . To provi d e the correct balance, t h e N Ml bus was i m p lemented as a pendecl (vs. in terlocked ) bus , resu lting i n very h igh bus-access ava i labi l i ty . 17 A n Overview of the Four Systems in the VAX 8800 Fami�J! Since memory is the critical resource i n sus ta ined operations, the NMI bus uses a modi fied round -robi n arbitration that gives the memory a hig her priori ty when there is con tention for the b u s . T h i s a rb i tra t io n p r i o r i ty e l i m i n a tes a n y lock-step conditions a n d a lso provi des for recov ery of states a n d data i n the eve n t of p ree m p t i o n . This h igh bus-access capab i l i ty, cou p l ed w i t h usable data rates of u p to 6 0 M B per sec ond, provides the necessary bala nce to su pport CPU. memory, and l/0 transactions. The inclu sion of write buffers within each CPU, coupled w i t h t he l a rge cache s i z e , effectively redu ces the nu mber of transactions presented to the bus. M e a s u re m e n ts on a VAX 8 8 0 0 sys t e m in ou r Engi neering VAXcl uster e nviro nment have i n d i cated that t he N.MI b u s i s rarely busy m ore than 50 percent of the t i m e ; the CPUs usc approx i mately 2 5 percent of t h e ava i lable access t i m e and bandwi d t h . Other appl ications may see somewhat d i fferent ratios. ( D i gi tal Storage A rc h i tectu re) devi ces are a l l ported d i rectly to t h i s h i g h - performance I / 0 subsystem . Reliability Re l iabi l i ty was one of the pri mary goa l s of the VAX 8 8 0 0 d es i g n . N u m e ro u s fe a t u re s were i m ple mented that more than doubled the basi c com p m i n g kerne l ava i labi lity compared t o the VAX - 1 1 /7 80 system . Some of the key functions inc lude • E n v i r o n m e n t a l a n d p o w e r m o n i tors t h a t qu ery t h e sys tem a n d m a i n t a i n safe system operating levels • Automatic verification of hardware , fi rmware , and software revision compatib i l i ty • Electrical ly keyed modu les and module slots that prevent i mproper i nsta l l a t i o n and dam age to the modu les or the system • Automatic el ectrostatic d i scharge (ESD) pro tection of modu l es d u r i n g i nsta l l a t i o n a n d removal • ECC on main memory • Parity checki ng on i n ternal RAL\1.s • Bus protocol checking for the memory i n terconnect • Timing and voltage margi n i ng • Remote d iagnostics capabi l i ty • D u a l - t o - s i n g l e p rocessor r e c o n fi g u ra t i o n (VAX 8800 system only) VAXBI Bus The VAX 8 8 0 0 fa m i ly u ses the VAX bus i n ter connect , cal led the VAXB I bus, for the 1/0 sub system i n order to provide adequate balance for the CPU performance. The VA.,'{ J3I bus, a 3 2 -bit clocked bus with distribu ted arbitration, is capa ble of usable data rates i n the VAX 8800 fa m i l y up to 8 M B per second , depen d i ng upon word s i z e a n d a p p l i c a ti o n . C u s t o m l o g i c on e a c h interface module provides a l l b u s protoco l s , as weJI as i ntegral data-i n tegrity features, includ i ng master transmit and command acknowledge . The VAX 8800 and VAX 8 7 0 0 systems can be confi g u red w i t h u p to fou r VAX B I c h a n n c l s . whereas t h e VAX 8 5 5 0 and VAX 8 5 0 0 systems accept up to two . Therefore , fu l l y configured VAX 8800 and VAX 8 7 0 0 systems can su pport aggregate IjO bandwidths u p to 3 0 MB per sec ond . Si m i la rly , fu lly configu red VAX 8 5 5 0 and VAX 8500 systems can support aggregate band widths up to 1 6 MB per second . Each VAXBl bus c a n s u p p o r t u p to 1 6 n o d e s , o r l o g i c a l acldrcsscs, which connect to any combi nation of n e t wo r k s , i n t e l l i g e n t a n d n o n i n t e l l i g e n t devices, DMA devices, and VAXcluster systems. as well as provi d i ng for connection to exist i ng UNIBUS-based devices . Al l of D i g i ta l ' s n e twork p rotocols i n te rface d i rectly to the VAXBI on the VA,'{ 8800 fa m i ly. Thu s , VAXcl uster. E therne t , D E C n e t a n d DSA 18 Diagnostic Development S i m i l a r to t h e h a r d w a r e d e ve l o p m e n t , t h e d e s i g n m e t h o d o l o g y fo r t h e d i a g n o s t i c s depended very heavi l y on s i m ulation . Almost a l l the d i agnos t i c tests were debugged on behav ioral and stru ctural models of the design before the i n i t i a l prototype was powered u p . There were three major benefi ts of this methodology . 1. M i c rod i a gn o s t i c a n d m a c r o d i a g n os t i c tests were usefu l for design verification testing. 2. Test vectors for automatic test equ ipment ( m o d u l e test) were extracted fro m the simul ation data base . 3. A comprehensive diagnostic package was ava i l a b l e short l y after t he prototype was powered u p . Digital Technical journal No. 4 February 1987 New Products Summary The diagnostic for the VAX 8800 fa m i ly con s i sts o f tests s p e c i fi c t o t h i s processor a n d generic to the VAX archi tecture. The processor is tested pri mari ly with microd iagnostics. These rests execute from the processor's wri table con trol store and a re governed by the console. VAX generic d i agnostics a re incl uded to test the UNIBUS and VAXBI adapters and options . Al l t h e d i a g n o s t i c c o d e fi t s o n t h e c o n s o l e ' s Winc hester d i s k . When the system i s powered u p . a su bset of the m i c ro d i a gn os t i c tests a r e execu ted . The VAX 8 8 0 0 fa m i ly of p rod ucts merges fast i nstruct i on -execution rates, large physi cal mem ories, large high-speed data caches, VAXBI 1/0 channels, pipel i n i ng, and bala nced i nternal-bus architectures to prov i d e h i g h syste m - a p p l i ca t i o n s t h ro u g h p u t . S p a n n i n g a n a p p l i c a t i o n s throughput range that is from 3 t o 1 2 ri mes that of t he VAX- 1 1 /780 system , the VAX 8 5 0 0 , VAX 8 5 5 0 , VAX 8 7 0 0 , a n d VAX 8 8 0 0 systems are matc hed ro the network and appl ications strate gies offered by Digital Equ i pment Corporation . Balanced Systems References The VAX 8800 design effort del ivered fou r dif ferent systems, the 8 8 0 0 , the 8 7 0 0 , t he 8 5 5 0 , and the 8 5 0 0 , a l l reflecting t he overri d ing con cept of balanced system design . Wh ile the CPUs t hemselves demonstrate excel lent i n ternal bal ance between their logical and fu nctional sub systems, they a re also balanced members of the e x t e n d e d system t h a t can s p a n m u c h l a rg e r physical distances. Monolithic o r isolated com p u t i n g r e s o u r c e s a r e no l o n g e r c a p a b l e o f access i n g , m a n i p u l a t i n g , a n d d i stri b u t i ng t he volu mes of i nformation needed for comp lex or extended sol u ti ons . I n this l ight, the VAX 8800 fa m i ly shou ld be viewed in the context of a bal anced network. T h e move ment of d a t a is gov erned by speed a nd d i sta n c e . An i nverse re l a t i o n s h i p ex ists as s hown i n F i gu re 9 . T h e VAX 8800 fa m i ly fits on the rop bou nd of the band width range throughout the distance fu nction . w ;;}_ � 1 00 8 10 <' 0 :::!. 0 z LJ.J 1. D . Bak , "The I mpact of VAX 8800 Design M e t h o d o l ogy on CAD Deve l o p m e n t , " Digital Tec h n ical jo u r n a l ( F e b r u a ry 1 98 7 , this issue) : 1 2 9- 1 3 5 2. VA X Hardwa re Ha n d b o o k ( Maynard : D i g i t a l E q u i p m e n t Corpora t i o n , Order No. EB- 2 1 7 1 0 - 2 0 , 1 9 82) . T E C H N O LOG Y COM P L E X ----'-=-=-'-''-'"--"-"-"-____ S I M PL E (!!_ CIJ � I I t o �z <{ CIJ 10 1 00 1 000 DISTANCE - METERS (LOG SCA L E) Fz�� ure 9 Bandwidth versus Distance Digital Technical journal No. 4 Februar)' I 'J87 19 Sudhindra N. Mishra The VAX 8800 Microarchitecture The VAX 8800 processor has a simple but efficient microarchitecture. Its pipelined micromachine has a one-cycle next-address loop andfour-cycle latencies for both microbranches and microtraps. Instruction prefetch and decode are done in parallel with microcode execution. The instruc tion buffer is a bit-sliced, four-longword circular queue. The decoder is primarily a RAM-based table. For special events, hardwired logic is used for decoding. A bit-sliced microsequencer provides up to 32-way condi tional microbranching, using a collection of about 80 branch conditions. A hardware microstack provides up to 15 levels of nested subroutine calls and returns. Microtrap conditions are prioritized over 1 6 levels, and microtraps are chained, not nested. The term " m icroarchi tecture" means the spec i fi cation or descri ption of t h e in terre lationships between the pans of t h e m i c r o m a c h i n e t h a t i m p l e m e nts t h e i n s t ru c t i o n s e t processo r . I n terms o f this defi n i t i o n , the microarchitecturc of the VAX 8800 processor w i l l be described b y e l ucidating the organ ization o f its m icromachine and the in teraction between its compon enrs . F i g u re I shows a s i m p l e t h r e e-stage state mach i n e m o d e l o f an abstract m i c roma c h i n e appropriate for implemen t i ng t h e control u n i t o f a typ ical von N e u m a n n processor . Figun.: 2 shows a block d iagram depicting the essen tial el eme nts of such a m i c ro m ac h i n e . This stare machine is capable of executing m i crocode rou tines to i m p lement a n instruction set processor. I n s u c h a sys te m , every m a c ro i n s t r u c t i o n i s decoded b y the ha rdware to produce the starr ing addresses of a sma l l set of m i croprograms , w h i c h e x e c u t e seq u e n t i a l l y t o p r o d u c e t h e d e s i r e d e ffe c t . B a r r i n g s o m e e x c e p t i o n s . a m i croprogram or m i crocode rou t i n e can exe cute ra ther i ndependently in the sense that eac h mi croi nstruct i on p ro d u ces the add ress of the next m i croi nstru ction . The last microinstruction causes the se lection of a n external address . such as one p ro d u c e d by the de cod e r , ro starr the execution of another rou tine . In Digita l ' s vernacular, the I Box is the logical part i tion cont a i n i ng the i nstru c tion-processing hardware . Figure 3 shows a b lock d iagram of the VAX 8800 I Box with the basic eleme nts of its micromachine. 20 FETCH M I CROI NSTRUCTION I N T E R PRET M I C ROINSTRUCTION Figure 1 State- machine Model of an A bstract Micro machine From the early LBM and CDC compute rs to the modern C RAY m ac h i n es , computer designers have used a tec h n i q u e cal l ed " p i p e l i n i ng" to obta i n h i gher performa nce . P i pel i n i ng overlaps the execution of i nstructi ons i n r i m e ; t h a t is, severa l i ns t r u c t i o n s can b e execu t i n g at the same r i m e . T h i s tec h n i q u e pro v i d e s a h i gh e r throughput when the p i pe l i ne is fu l ly l oaded , but tlw re i s a cost in v o l ved . I f the p i pe l i n e is broke n , extra process i ng is req u i red to refi l l it. Moreover, if any active i n structions h ave par tial l y execu ted . i n fo rmation about t h e i r stares may have to be saved to co n t i n u e process i n g after a n abrupt i n te rru ption . T h e de gree of p i p e l i n i n g v a r i e s from o n e mach i ne to another depen d i ng upon the design c hoices and trade-offs made by the system a rchi t e c ts . A metaphor o ft e n used to i n d i c a t e the degree of pipe l i n i ng is the length of the pipe l i n e I .& ( 7 Digital Technical journal • 1 I New Products � MI CRO- ADDRESS GENE RATION EXTERNAL ADDRESSES - LOGIC AND CONTRO LS I-- MIC ROA D D R ESS LATCH OR REGISTER Figure 2 r--- CONTROL STORE � I N STRUCTION BUFFER I B DATA OPCODE, SPEC I F I E R , SPEC I FI E R M BER ,..---L----'--N_,U CONTROL MICRO SEQUENCER DECODER PC INCREM ENT TO E BOX DECODER CONTROL CONTROL STORE MIC ROWORD f--- r-- MICRODATA - CONTROL INTERPRE- SIGNALS TATION LOGIC - structions. A higher degree of pipe li n i ng makes short cyc l e t i m e s poss i b l e , t h u s lead i n g to a h i g h e r t h rou g h p u t w h e n t h e p i p e l i n e is fu l ly l o a d ed . But l onger p i p e l in es e n ta i l i n creased overhead in terms of their a b i l i ty tO resu me oper ations after a break in the pipeline caused by any abnormal even t. Therefore , an a rc h i tect's goal is to design the system so t hat the pipe l i ne re mai ns loaded most of the t i me and recovery from a bro ken pipe l i ne is not roo inefficient. The VAX 8800 CPU i s a prime example o f a processor with a pipe l i ned microarch itecture. System Considerations CONTROL TO E BOX CONTROL TO C BOX M I CROSEOUENCER CONTROL Figure 3 r- Block Diagram of an A bstract Micromachine CAC HE BRANCH CONDITIONS, TRAPS, INTERRUPTS - MICRODATA LATCH OR R EGISTER VAX 8800 I Box stated as the n u m be r of stages, for exa m p l e , a t h ree-stage p i pe l i ne or a fou r-stage p i p e l i n e . The number of stages conveys the extent of t i m e overlap for ty p i c a l opera t i o ns i n a compu t e r . I n a machi n e w i t h a p i p e l i n ed m i croar c b i tec tu re, these operations are executions of micro i n - The design philosophy of the VAX 8800 proces sor was to o p t i m i z e t h e h a rdware so t h a t i t wou l d e x e c u t e t h e m i c rocode effi c i e n t l y . A large control store ( 1 4 4 b i ts by 1 6,000 en tries) holds the entire m icrocode. Using fa i rly general i z e d d a ta p a t hs , t h e m i croco d e e x e c u t e s t he logic of the i nstructions . However, special hard ware is used to speed up performance i n cri t ica l areas . The processor logic is primar i l y designed with l atches, which are clocked with a globa l ly d i st r i bu ted , two-phase , nonoverlapping c l ock i ng scheme. The two clock phases are cal led the A- clock and the B-clock. A typ i ca l exa m p l e of logic design, based o n the above a pproach , i s shown i n Figure 4 . OUTPUT CL - COMBINATORIAL LOGIC Figure Digital Technical journal No. 4 Febrttai:Y 1 987 4 A Typical Section of the VA X 8800 21 The VAX 8800 Microarchitecture It is apparent from Figu re 4 that the data flow in such a logic system occurs through rhe per petual data transfers between the l a tc hes con nected to the A-clock a n d those con nected to the B-cloc k . Each data transfer may be cons i d ered atom ic i n t h e sense o f hardware operation . A m icrooperation may be e nvisioned as a logical operation that i s atomic in terms of the execu tion of a m i c ro i nstru ction . s u c h as a register read , a register write or an AIU fu ncti o n . H ence a m i crooperation constitutes one or more data transfers . and the m i croi nstru c t i o n execution s i m p l y cons t i t u tes a time seq uence of m i cro operations. as shown i n Figure '5. CLOCK A B I A I READ REGISTERS B I ALU FUNCTION ADD I STORE R E S U LT IN REG ISTER TIME Figure 5 Example of a klicroinstruction In high- performance machi nes, l i ke those i n the VAX fa m i l y , t h e r e i s u s u a l l y a m i s m a t c h between C P U cycle t i mes a n d mem o ry - access t i mes. For e xa m p l e , cons i d e r an ADD i nstru c tion . I f t h e operands are i n regi sters, t h e ADD can be done rat he r q u i c k l y . But if one of t h e operands h a s t o b e read our of me mory, t h e ADD c a n n o t be p e rfo r m e d u n t i l t h e d es i red <..l a ta arrives from memory. Most VAX processors have a fast cache m emory, tightly bound to the pro cessor's arithmetic un its, w al leviate the mem ory- latency problem . I n the case of a cache mi ss on a req u i red datu m . however. the only al terna tive for a von Neu mann processor is tO wa i t A processor i n such a state is sa i d to be · ' stalled . " Under such con d i t ions, the state o f the proces sor must be " frozen" unti l the cause of the sta l l no longer persists and the sta l l is bro ken . The two-phase clocking scheme provides a conve nient way to i mplement sta l ls, i n which one of the clock p hases ( the A-clock in t he 8800) may be blocke d . Stal l s a re contro l l ed by rhe cac h e through a spec i a l hardware signal d i st r i b u ted globally to block the A-cloc k . Thus, the proces sor logic con ta i ns two flavors of A- latches : • 22 Sta l l ed A- latches, which are affected by a staJJ • Unsta l l ed A-latches, which are not affected by a stall The m icromachine is i mplemented o n ly with sta l l ed A- latches. Hence the effect of s ta l ls o n the exec u t i o n of the m i c ro m a c h i n e i s l argely transparent. A mecha n i s m i s a l so re q u i red to d e a l w i t h h ardware exce p t i ons w h e n t h e res u l ts o f the e x e c u t i o n o f a m i c r o i n s t r u c t i o n h a ve to b e u n don e . I n a p i pe l i ned m i croarch i tecture , sev era l m i c ro i nstructions m ay h ave part i a l ly exe cuted when a n exception condi tion i s detected . In that case i t is necessary to undo the effects of a l l those m i c roi nstructions. The most common techn i q u e used to deal w i t h such si tuations is c a l l e d a m i c rotra p . S i n c e m i c r o t r a p s re l a te closely to the m i cro m a c h i n e exec u tion , every p rocessor h a s i ts own s c h e m e ro i m p l e m e n t them. I n every case . howeve r, m i crotraps m ust p e r m i t the " ro l l b a c k " o f s o m e n u m b e r of m i croi nstruct i o n s because the d e tect i o n of a trap con d i t i o n usua l l y occ u rs q u i t e late w i t h respect tO mi croi nstruction execu tion . I n the VAX 8 8 0 0 p rocesso r , m i c rotraps a re i m p l e m e n t e d so t h a t t h e o ffe n d i n g m i c r o i nstruction is a l lowed to complete, but subse q u e n t m i c ro i nstru c t i o n s i n t h e p i p e l i n e a re blocke d . Si nce the offending m i cro i nstruction may have ca used some undesirable resu lts, the trap-hand ler m i crocode must fix the problem . Depe n d i ng on t h e parti c u l a r s i tu a t i o n , e i ther the m i croinstruction execution flow i s res u m ed fro m t h e b l o c ked s t a t e o r a n e w f l ow i s origi nated . System Buses and Datapath Figure 6 i s a bl ock d i agram of t h e VAX 8 8 0 0 CPU datapa th, show i ng a l l t h e major buses. The h a rdware orga n i za t i o n o f the CPU provides a two-cycle operation between the cache and the AIU , as shown . The processor has several func tional u n i ts in addition to the main AIU. These add i t i on a l u n i ts pe rform h i g h -speed m u l t i ply a n d d i v i d e , s h i ft i n g , a n d floa t i ng-po i n t arith metic operations . There are seve ra l poss i b i l i t i es for s e l e c t i n g i nputs ro these fu nctional u n i ts . For operations i nvo l v i n g two i n p u t s , both can b e presented s i m u l ta neously onto the two l egs of the m a i n AIU a s we l l a s most other functional u n i ts . The resu l ts from t hese fu nctional u n i ts a re sent on the W bus for wri t i n g to e i ther the m u l t i part Digital Technical journal No. 4 February I 98 7 New Products VIRTUAL ADDRESS BUS BYPASS BUS BACKUP PC �" "� :t lrll il r-----.. MULTIPLIER & � DIVIDER I Uf� r-- I J SHIFTER ALU � l1 B \ B C B US R :Jf em '-- 2._ � A-PORT MUX '> I TB l r I L__ := A PC I N C A B "-.., I•< \ 0, � , porCACHE _ _ B-PORT MUX •• CACHE w IB 1\ B In PC MUX A A A \ 7 A .. I DATA DATA BYPASS BUS � f--- � <� ,-X .-----.--- t-r--- E G � R �� '---- V'-EXPONEN T ['<--B �� I f- � fV<-- SHIFT COUNT BUS L{ r MICRODATA 9J • 1.- 1 :0� r;:=:: � IF WRITE BUS "!" I • B DATA � A, B - A AND B PHASES OF TWO-PHASE CLOCK Figure 6 Digital Technical ]om-nat No. 4 February 1987 A IB B CACHE ...--.___ SLOW DATA FILE MPR L_ VA PC INCREMENT /\ _\ fJ 1\ � �·I: ""'"I� 1 IM A DELAY WRITE BUFFER ----- �l=B M D BUS v� VA X 8800 Datapath 23 The VAX 8800 Microarchitecture registn fi le ( MPR) or the cache . However, since rhe write actua l ly occurs in the fol lowing cycle. the bypass bus provides a shortcut (sa v i n g a cycle ) i n case t he wri te d a r u m is read hy r h c very next microi nstructi on . The v i rt u a l a d d ress bus carries t h e vi rtua l add ress of a n y cl a r a - s t re a m ( cl - s t rea m ) refer ences. whereas the p rogram-counter bus has the current program counter ( PC ) The i nstruction bu ffe r data bus provides th<.: instru crion -strcun (i -strca m ) data . The i nstructi ons and data fro m the cache are returnee! on the cache data bus . H owever, a cache data bypass bus p rovi des a d i rect path to the fu nctional un i ts for the data rem rncd by the cache, in case the processor i s o r wil l b e sta l led for that data . The top part of Fi gure 7 shows the execution of m icroinstructi ons as a fu nction of time i n a non pipeli necl m i croarchi tectun: ; the bottom depicts that i n a pipc l i ncd m i croarchitectu re. The basic data flow i n a processor occurs in the fol lowing sequence : Read t he register operands i n to a fu nc tional uni t , such as the ALU . 2. Perform some ALU funct ion . ------ CLOCK - A 8 A 8 A W r i t e t h e resu l ts i n to t h e dest i n a t i o n regi ster. 4. I f there is a cache , start a cache operation at a pproxi mately the same time as a regis ter write s i nce m e m ory refere nces a rc bu ffe red th rou gh spec i a l - purpose m e m ory d a t a registers ( M DRs or MDs) i n most high-performance processors . F i g u r e '5 s h ows t h a t t h e s e q u e n c e a b o v e occurs i n a n a tu ra l o r d e r i n t i m e as a conse quence of the m icroi nstruction execution. With p i pc l i n ecl m i croarchi tccrures , a time reference is needed to correl a te the m icrooperations per fo r m e d by v a r i o u s m i c r o i n s t r u c t i o n s w i t h respect to each other. The notion of canon ical ri mes is veil' conven ient for this purpose . The clock ti cks of t h e reference m i c ro i nstru c t i o n may b e labeled w i t h a monotOnically i ncreasing set o f T n u m b e rs s t a rt i n g at T0 as s h own i n Figure H . These T n u m bers are ca l l ed the canon ical ti mes of a particu l a r mi croinstruction . The m i croopera tion labe l ed T0 marks the start of a m i c r o i n s t r u c t i o n e x e c u t i o n c yc l e . F i g u re H shows the basic microopc rations of a VA.-'{ 8800 m icroinstruction with their canon ical ti mes . \Ve sha l l use the si mple model of a m i croma chine in Figure 1 to describe the VAX 8800 m i cro- Microinstruction Pipeline 1. 3. CYCLES -------. 8 A 8 8 A A B A M I CROI NSTRUCTION 1 M IC R O I N STRUCT I O N 2 MI CR OI NSTRUCTION 3 M I CROI NSTRUCTION EXECUTION I N A N O N P I PELI N E D M ICROMACH I N E M I C R O I N STRUCTION 1 MI CROI N ST R U CTION 2 M ICR OI NSTRUCTION 3 M I CROI NSTRUCTION EXECUTION I N A P I P E L I N E D M ICROMACH I N E Figure 7 24 M I CROI NSTRUCTION 4 Microinstruction Execution Digital Technical journal No. 4 Februarp I 'J8 7 N e w Products CYCLE - To To CLOCK - A B A r - -- - - - - 1 I I I I I DECODER OPERATION L.. - - - - - - - - B T, Ts A B Q o o.. 0 -- � a: w =? >z o U a: o rUl a: >z o O o.: o o -' w a: o rUl . '-' o o -' A -' N a_ 0 ::::> a: w . CLOCK - A B A B CYCLE - 0 1 2 3 I I A I ,-------- : Too A B A REGISTER WRITE ALU OPERATION >- a: '-' zoo O r- O O Ul -' DECODER L------ B B A B A B A B A B 5 6 7 8 9 10 11 12 13 14 15 I I I I I I I I [��������] I [��������'-� I [��� LU K xos RD LUK ALU xo s I Figure 9 I I I I I I WR.CACH RD WR,CACH ALU RD x os Lu K A Lu w R . c Ac H _ _ ....�.... ___._ ___._ .......J _ _ _ _ _ _ _ _ _ _ _ _ _ __ _ _J_ � �E � CONTROL STORE LOOK-UP (CONTROL STORE 0 SEGMENT) BOARD CROSSING SEGMENT (OVERLAPS CONTROL STORE 1 LOOK-UP) REGISTER READ (OVERLAPS CONTROL STORE 2 SEGMENT LOOK-UP) ALU F U N CTION REGISTER WRITE CACHE OPERATION Digital Technicaljournal Febmarv 1 98 7 B CACHE MISS ACTION CACHE OPERATION E: DECODER - DECODER OPERATION No. 4 A A D - B B c LUK XOS RD ALU WR CACH Tn The next stage i n t he m icroinstruction execu tion sequence is the fetch of the m i croinstruc t i o n , p e rfo r m e d by a l o o k - u p i n the c o n t r o l srore . I n t h e VAX 8800 system , the m icroadd ress is pipelined, not the m i crodata . Consequent ly, t he m i crodata from a segmented control store a ppears ar the appropriate t i m e for t h e t h ree basic operat ions ro occur in the i nd icated order. The m i crodata l ooked up ca uses a sequence i n w h i c h the register read occurs between the ti mes T5 a n d T6 , the ALU function betwee n T6 and T1 b and the register wri te between T8 and T 1 0 . The cache operations a lso occur between the t i mes TH and T 1 0 . The secti on beyond T 1 0 denotes cache activity with respect to the mem ory i f t here i s a cache miss. (The cachejmemory i n terface is controlled by a n i ndependent m icro mach ine . ) During every cyc l e , a m i croi nstruc t i on produces the address of the next m icro i n s t ru c t i o n , w h i c h i s t h e n execu t e d . F i g u re 9 depicts the generic m icroi nstruction p i pe l i ne of the VAX 8800 processor. i nstruct ion format as a sequence of basic m icro operations I i ke t hose in Figure 8. The first stage in the m i c ro i nstruction execu tion cycl e is t h e m icroadd ress fetch . The m i croinstruction execu tion cycle begi ns with a decoder operation . The decoder prod uces the starting microaddress for every new m icro i nstruction seq uence and pre sents it to t h e m i c rose q u e n c e r . The d e c o d e r determi nes that address on t h e basis o f t h e con tents and curren t state of t he i nstruction buffer ( l B ) . E a c h m i c ro i ns t r u c t i o n s p e c i fi e s to t h e m i croseque ncer w h e t h e r or n o t t o accept t he decoder's m icroaddress. I f not, the m icro i nstruc t ion must ei ther speci fy the add ress of the next m i c r o i n s t r u c t i o n d i r e c t l y , as a p a r t of t h e m i croword , or i nd i cate a n a l ternate sou rce for the address within the microseq uencer. Since the d e c o d e r ' s o p e ra t i o n is c o n c u r re n t w i t h t h e m i crosequencer's, the decoder a lways has a start i ng m i croadd ress for the m i crosequencer. It i s conve n i e n t t o t h i n k of t h i s I B -decoder concur rency as a " h idden decoder cycle . " MICROINSTRUCTION A: Tg Can o n ical Times of a VAX 8800 Microinstruction Figure 8 I B Ts LUK I I xos RD ALU WR, � � -L � � --- [�����_:� I I L K RD o A Lu --' '_._x s _._ u _ _J_ _ _ _ _ _ _ Microinstruction Pipeline of the VAX 8800 CPU 25 The VA X 8800 Microarchitecture Micro bra nch Latency cond i tions from t he earl ier execution are essen tial to reproduce the same seq uence . To si m p l i fy t h e h a rdware d es i g n , aU early t r a p s a re d e l ayed to a fi x e d c a n o n i c a l t i m e (T t 0) . Some trap cond itions, however, deve lop l a ter than t he can o n i c a l t i m e w i t h the conse q u e n c e t h a t t h ose traps c a n n o t be r e t u r n e d fro m . I n such c a s e s t he m i c rocode must ro l l back the state to the beginn i ng, which causes a reexecution of the entire macroi nstruct ion . F i g u r e 1 1 s h ow s a s e q u e n c e i n w h i c h a m icroi nstruction at add ress T provokes a m icro tra p . At t he earliest, the trap- handl i n g rou t i ne can beg i n a t m i cr o i nstru c t i on X . M e a n w h i l e , m i cro i nstructions U , V , and W fol low T , q u i te unaware of the i mpendi ng trap . I n fact , t hey are in part ial execut i on w hen the trap condition i s detected. These m icroinstructions are sai d t o be i n the trap shadow, and they must be bloc ked from writing any registers , thus making i t appear as if t hey had never executed . When control is returned from the trap- han d l i ng rou t i n e , these trap shadow m i cro i nstru ctions a re reexecuted , con t i nu i ng the sequence that would have arisen had t he trap not occurred. One consequence of p i pe li n i ng is that any i nter ve n i n g m i c r o i n s t r u c t i o n s m u s t be s p a c e d between t he i nstruction t hat produces a branch condition and the i nstruction that can bran ch o n i t d u e tO l a te n cy i n t h e deve l o p m e n t o f t h e branch con d i t i o n . Obviously, t h e execu tion o f t h e i nterve n i ng m icro instructions m u s t b e i nde pendent of the branc h . Usually, m i c rocoders are able to code some usefu l operations during the i n e v i t a b l e wa i t . O t he r w i s e , t h e i n te rve n i n g i n s t ru c t i o n s m u s t b e N O Ps ( n o o p e ra t i o n ) . Figure 1 0 s hows the m i crobranch l atency i n the VAX 8800 CPU. Microtrap Latency A hardware exception causes a m icrotrap. How ever, the trap cond i t i ons, l i ke the branch con d i tions, m a y develop after some execut i on cycles have been completed . Once again there m ust be some i ntervening m i croi nstru ctions between the trap-caus i ng m icroinstru ction and the trap-han d l ing routine. Moreover, the state of the m icro mach i ne must be saved so that the current exe cution can be resu med i n such a way that t he i n t e rve n i n g e x e c u t i o n o f t h e t r a p r o u t i n e appears to be transparent. This state consists pri mari l y of m i crobra n c h cond i t ions t h a t res u l t from the execution o f m i croi nstruct ions i n the p i pe l i n e s i n c e t h ose c o u l d i n fl u e n ce s u bs e quent m i croaddresses and hence t h e execut io n sequence . Therefore , on i n terru ption of t h e cur rent sequence by the trap rou t i n e , the bra n c h CLOCK - A CYCLE - 0 I MICROINSTRUCTION C: 8 I T h e I B bu ffe rs t h e prefe t c h e d VAX i - s t r e a m delivered b y t h e cache a n d i n turn delivers the opcode and spec i fier to the decoder. The IB a lso delivers the i -stream data to the execution u n i t , the E Box . The decoder expects t o receive t he current opcode a nd the current specifier byte . A 8 A 8 A 8 A 8 A 8 A 8 A 8 2 3 4 5 6 7 8 9 10 11 12 13 14 15 I I I I DECODER L I I I I D: LUK I xos I [��������] POTENTIAL NOP E: BRANCH MI CROI NSTRUCTION I I ALU RD LUK I xos 1 WR,CACH RD LU K 1 ----- , xos [������ �] I RD LU K XOS G: Figure 1 0 I I DECODER I---- I I GE N E RATES BRANCH CONDITION ALU I I I WR,CACH ALU [�������] 1 I F: I I 1- POTENTIAL N O P WR.CACH RD r------ TARGET O F CONDITIONAL M ICROBRANCH 26 Instruction Buffer and Decoder LUK WR. ALU I I XOS RD ALU Microbranch Latency Digital Technical journal No. 4 February 1987 New Products CLOCK - A CYCLE - 0 I B I A B A B A B A B A B A B A B 2 3 4 5 6 7 8 9 10 11 12 13 14 15 I I I I I ---- ------- M ICROINSTRUCTION T : I I I I I I I I I I 1- WR,CACH ALU RD DECODER LUK xos L----------- �--�--�--�------�------� ,- -------- - U: TRAP SHADOW I I DECODER L--------v: I I I I I CAUSES A MICROTRAP I xos 1 D ...__A _ _L_u___._w_R_,c_A_c_H__, K I_x_o_s__,___R_ [�������� I.__L_u__. LUK RD ALU WR,CACH s I_R_D__,___ ___._w_R_. w [�����= ..__L_u_K-'--I_x_o__. R Figure 1 1 Microtrap Latency Hence the l B saves the opcode for the duration o f t h e i n s t r u c t i o n e x e c u t i o n a n d s h i fts t h e buffered i -stream a long t o send each specifier i n turn to the decoder. The goal of the VAX 8800 decoder is to p roduce a start i n g m i croaddress correspond i ng to the opcode and the specifiers. The seq uence of m i crocode execution caused by the decoder is first to process a l l the specifi ers , mak i ng all the operands ava i lable, and then to e x e c u t e t h e o p e r a t i o n s p e c i fi e d by t h e opcode. I f a n i nstruction has n o specifi ers , the execution m icrocode is i n itiated d i rectly. I n any case the d e c o d e r a l ways has a m i croaddress a he a d o f t i m e fo r the m i c ros e q u e n c e r . T h i s m icroaddress is the starting address o f e ither a s p e c i f i e r ro u t i n e o r t h e e x e c u t i o n r o u t i n e , based o n the contents and the state o f the I B . If a t a n y t i me the I B does not conta i n enough i - s t r e a m d a t a for a s u c c e s s f u l d e c o d e , t h e decoder w i l l prod uce a spe c i a l m i croadd ress . The m i croinstruction at that address is s imply a N O P that a ga i n req u ests the s e l e c t i o n of t h e decoder's address . The m icromachi ne thus wai ts i n a loop for sufficient i -stream data tO arrive i n the I B so that the decoder can a ga i n d ispatch a useful microaddress . This wai t-loop state of the m icromachine is commonly referred to as the IB sta l l , which i s d i fferent from the stal l described earlier. Note that clocks tO sta lled A- latc hes are not blocked for an IB sta l l . On the contrary, the micromachine runs normally as does the rest of the processor h a rdware . I B sta l l s m a y o c c u r w h e n t h e i nstruction prefetch pi peli ne i s bro- Digital Technical journal Febmm:y J 987 No. 4 ALu _ _ ken due to macroinstruction branches. This con d i tion requ i res the cu rrent contents of the I B to b e d i s c a r d e d a n d n e w i - s t re a m d a t a to be prefetched i n to the l B . The VAX 8 8 0 0 IB is a fou r-longword c i rcular queue, which is usual ly long enough tO hold an entire i nstru c tion . The data is consumed out of the I B from the position pointed tO by the read po i n te r . Howeve r , new data cou l d be written c o n c u r r e n t l y b y the c a c h e at the p o s i t i o n pointed to by the write poi nter. Whenever i t has room , the IB is loaded by the cache if the cache has no other h igher priority job to do. Occasion a l l y , the IB beco m e s fu l l (the w r i t e p o i n te r catches u p w i t h t h e read poin ter) , a n d then i t does not accept the datu m from the cache . I f a d a t u m i s n o t a c c e p t e d by t h e I B , t h e c a c h e keeps repeating t h e transfer u n t i l t h e d a t u m i s accepted . Occasionally, t h e I B becomes e mpty if t he cache is busy doing other t h i ngs and the decoder has consu med a l l the data from the IB ( the read poi nter and the wri te pointer poi n t tO the same location) . The I B i n the VAX 8800 fam i ly is i mp lemented w i t h four i dentica l gate a rrays w i t h 8-bit s l i ces desi gned to use a ra ther c lever b i t-scatteri ng/ gathering scheme. The IB a lso contains logic to extract and format i -stream data , m a k i ng i t ava i l a b l e to t h e E B o x . A c o m m o n s i l o h o l d s t h e o p c o d e h i story for t h e d u r a t i o n o f a m a c r o i nstruction 's execution, as we l l as for recov ery from m icrotraps. The VAX 8800 decoder is a R A M - b a s e d l o o k - u p t a b l e for g e n e ra t i n g 27 The VA X 8800 Microarchitecture NOP --------� THINGS THAT M A K E S P E C I A L A D D R ESSES �------�L- S PEC I A L A D D R ESS ENCODER -------,..j SPECIAL M I C. R OA DDRESS _ _ __ __ 14 MICROADDRES S -----, ENABLE OPCODE )----,i'-----._ M I C R OADDRESS OPCODE A D D RESS --------•1 10 SPECI F I E R BITS AND STATE SPECI F I E R ADDRESS DECODER RAM USE OPCODE A D D R ESS 18 STATE CONTROL 1------ L_-----. SPECIFIER STATE FLAGS Figure 12 S P E C I F I E R R E LATED ASSISTS 18 DATA FORMAT CONTROL VAX 8800 JJecoder m i croaddresses . In the case of special ev<:nts, however, hardware logic is provided for gener ating spec i a l m i c roaddresses, as s hown in Fig ure 1 2 , thus bypass i n g the RAM J ook- u p . The decoder a lso provides cont rols for the I B state machine as well as some other hardware assists . Microsequencer Th<: st a t e - m achi n e respons ible for ge nera ring rhe ncxr m i c ro a d d ress for a m i c ro i n st r u c t i o n se qu t:ncc is commonly caUed the m i croscquencer. As s h o w n in F i g u r e 1 3 , t h i s stare - m a c h i n e is realized collectively by rhe control store. rhc ncxr NEXT M I C ROADDRESS G E N E R A T I O N LO G I C r-------------------- ---------------- EXTERNAL _ CONTROLS MI CR OT RA P C O N D I TI ONS EXTERNAL A D D R ESSES 1 I I I I I I I I I I I I I I I I I I I I I I � ____I___) I MI CROBRAN CH I I CONDITIONS I TRAP MICROTRAP LOGIC TRAP AD D R ESS �� v/ ,.--.. M I C ROBRANCHING AND A D D R ESS SELECTION MICROADDRESS LATCH OR R EG I S T ER f-- CONTROL STORE f---. M I C RODATA LATCH OR REGISTER f..-- I-- I I I I LOGIC I I I I I I I I I L---------------- ----------------- --1 NEXT A D D R ESS. A D D R ESS S EL EC T I O N CONTROLS Figure 1 3 28 A n A fJstroct Microsequencer Digital Technical journal No. 4 FeiJruar)' I 987 New Products m i croadd ress generation logic, and the m icroad drcss and microdata latches (or registers) _ The goal of the VAX 8800 m i crosequencer is to produce the address of the next m icroinstruc tion dur i ng every cycle. Fi gure 1 4 depicts how the mi crosequencer achieves this goa l . Each m i croinstru ction may mod i fy i ts next microaddress field through a m i crobranch com m a n d to p r o d u c e t h e ad d ress of t h e t a rge t m i croi nstruct i o n . M i c robra n c h con d i t i ons are del ivered by other sections of the m ac h i ne , such as t h e A L U . T h es e c o n d i t i o n s a re g r o u p e d tOgether i n ways conven ient for m icroprogram ming so that m u l t iway branches can be take n . M i crosu bro u t i nes can b e ca l l e d a n d returned from by m eans of a hardware mi croPC stack. Sta l ls cause the m i c rosequencer state to be frozen on a cycle bou ndary (i . e . , the clocks on m i c road dress and m icrod ata latches are effec t ive ly blocked) . M icrotraps a l low the m icrocode to deal with u n usual even ts that wou ld he too slow or in conve n i e n t to check norm a l ly wi t h microbranc hes , s u c h a s T l3 m isses a n d address mi sal ignments. The VAX 8800 processor does not pe rmit traps to be nested . Instead , traps are "chained , " mean ing that trap rou t i nes and hard ware t ra p prio r i t i es are ca refu l l y a rranged so that a second trap is taken only when the first trap routine fi nishes . ( Mac hine check traps can not be control led i n th is way . ) So urces of Microaddresses There are five sources for mi croaddresses : • The decoder • The next-add ress field i n the mi croword • • • The m i crosta ck upon retu rn i n g from a sub routine The m icroPC silo for a saved m i crotrap The m icromatch register for an address from the conso l e A n a d d ress from the conso l e i s sel ected i n response t o a n ex p l i c i t conso l e r e q u e s t a n d t a k e s p r e c e d e n c e o v e r e v e ry t h i n g e l s e . A d d r e s s e s fro m t h e s i l o a r e r e q u e u e d i n response to a trap-return com mand . Addresses from the m i crostack are se lected in response to a subroutine-return command . A decoder-gener ated add ress is sel ected whenever the curren t sequence ends and a new specifier or execution DiRilal Technica/Journal No. 4 Febnwr)' I 'J87 ro u t i ne shoul d begi n . Normal ly, this sel ection is ca used by t he asserti o n of a m i c roword b i t i n t h e very last m icroi n s t r u c t i o n of t h e c u rre n t seq ue nce . The next-add ress field is sel ected as the defa u l t for normal sequenci ng. This field is also used to provide a n offset in case of su brou t i ne retu rns. Micro bra n ching In normal cases, part of the se lected m i croa d dress can be modified accord i ng to the branch condi tions, t hat is, whenever t he next-address f i e l d i s selected . A c o m b i n at i on of two m icroword fields. branch type and branch mask, se l ects the bra n c h condi tions, w h i c h are then O Red i nto part of t he target m i croadd ress . In the VAX 8800 system, the m i crobranch logic is i m p l e m e n t ed with fi ve i d e n t i c a l gate arrays , e a c h of w h i c h gen erates a 3 - b i t s l i c e of t h e m icroa ddress . O n e m icroadd ress b i t is branch sensitive i n each s l i c e . This orga ni zation permi ts up to 3 2 -way branc h i ng. Branchi ngs of 2 , 4 , 8 , a n d 1 6 ways a r e a l s o m a d e possible b y a sepa rate mask b i t , cal led the branch mask, to eve ry s l i ce. T h is bit i s used to turn off the sensi tivity to branch cond i t i ons in a particu lar sl ice. There are 1 6 bas i c reci pes for cond i ti o n a l bra n c h i n g i n e a c h s l i c e . T h is arrange m e n t o f s l i c i ng, masking, a n d branch-condition selection in every s l ice requi res that a l l the m icrobranch c o n d i t i o n s b e o r ga n i z e d i n t o 5 g r o u p s of 1 6 co ndi tions each . The bra n c h cond i t ions are classi fied as e i t her static or dynam i c . Stati c con d itions, once captured , are avai lable for branch i ng in any later cycl e as long as those cond i t ions re m a i n u n c h a n ged . Dyn a m i c cond i t i o n s are asserted for just one cyc le and must be branched o n i n that cyc l e . Some speci a l t rap-rel ated branch cond i ti o ns are saved at the time of the trap so that the trap routi n e may use t h e m . For speed reaso n s , the basic hardware m echanism for m u l tiway branch i ng is that the sel ected condi tion is ORed rather than added to the branch-sensitive m icroaddress b i t . The OR i m p l i es that the branch-sens i t ive bits of a microadd ress must be "zeros" by con ve n t i o n . I f bra n c h i n g i s masked i n any s l i c e , however , o n l y u n masked bra n c h-sensitive bits n ee d to be z e ro s . Th u s t h e bra n c h - m a s k i ng sc h e m e l ea d s to a substa nt i a l i ncrease in the nu mber of conditional branch-target addresses , c o n s t r a i n e d by t h e r e q u i r e m e n t fo r z e ros . 29 The VAX 8800 Microarchitecture MICROWORD N E X T A D DRESS TOP-OF-MICROSTACK M I CROBRANCH SILO A D D R E S S E S CONDITIONS .!u ... I I CONSOLE A D D R ESS I A MICRO MATCH PUS H R E G I STER I BRANCH .....,.. � ..... _ _ _ I CONDITION LOGIC -; A 15 A M M I CRO· 13 -A I c ADDRESS SOURCE R p El 0 c r-:- SELECTION LOGIC A 5 r-; 8 I t t t L 0 A '----8 I I T R A P VECTOR \/ l f �--'-? M I C ROSTACK MICROSTACK POI NTER POINTER TRAe AND M I C ROTRAP LOGIC �------� M I C R O STACK .. tt. t 1-) 5 / M I CROTRAP CONDITION r-r--T"'"" DECODER'S M I CROA D D R E S S B ,..-- 1 4 � / \ DECODER - I I v-14 / SELECT 115 A B B j I r- )4 I CONTROL STORE 0 A f.-- CONTROL STORE 0 M I CRODATA '-- A r- /4 I CONTROL STORE 1 B f--- CONTROL STORE M I C RODATA 1 ...... B r- 114 I CONTROL STORE 2 A f----- CONTROL STORE 2 M I CRODATA '-- Figure 1 4 30 VA X 8800 1Hicrosequencer Digital Technical journal No. /f Februarv I 987 New Products Table 1 Slice N u mber 1 2 3 4 5 Microbranch Conditions Microbranch Conditions State flags W B U S low-order bits W B U S hig h-order bits S A L U condition codes PSL condition codes 6 XALU condition codes 8 A L U condition codes 7 9 10 11 12 13 14 15 16 17 18 19 20 Priority encoder condition codes TB-status Cache command M D n u m ber AC low Digit valid NMI ID I nterrupt pending I nterval timer carry Halt pending Console mode I nterrupt I D Non_R etry flag Ta b l e 1 s hows an e xa m p l e of severa l m i c ro branch conditions. Microsubrou tine Call and Return As in the normal case just discussed, the defau l t mi croaddress, the next-address fie l d , i s selected as t h e start i n g add ress of a m i crosu b ro u t i n e . However, a subrouti ne-ca l l ing m icroi nstruction pushes i ts own add ress onto the m i crostac k . During the subroutine return, the m i crostack i s se lected a s the sou rce and then popped . Thus the address of the cal l i ng i nstruction i s used as a base for the retu r n . T he ret u r n i n g i nstruc t i o n may OR an offset from t h e next-address field to t h a t bas e , t h u s y i e l d i n g t h e target return address . The fact that bits are ORed rat her than added constra i ns the ca l l i n g addresses to have zeros in the l ow-order bit positions. The write path ro the m ic rostack (PUSH) is pi pel i ned by a cycl e for t i m i ng reasons. How ever, a bypass path saves what wou l d be the top entry of the mi crostack in the read latch ( POP) so that PUSHs and POPs occu r in a fai rly u n re s t r i c t e d m a n n e r . T h e re are , h ow e ve r , s o m e minor cod ing restrictions w i t h respect t o traps and decoder-made addresses. Digital Technical ]om-nat No. 4 February 1 98 7 Subrou tine calls and returns are u naffected by sta l l s . I n the VAX 8800 CPU, t he m ic rostack is 1 6 entries deep and i s used exclusively for sub routine cal ls and returns (i .e . , m icrotraps do not use the stack) . Subroutine calls may be nested up to 1 5 entries deep, beyond which the m icrostack w r a p s a r o u n d a n d o v e r w r i t e s p r e v i ou s c a l l addresses . S ince the next-address fie ld is condi tiona l ly O Red i n to the ca l l i ng address to make the return address, a cond itional m u l tiway return becomes feasible. Microtrap and Return A m icrotrap i s caused w h e n t h e hardware detects a con d i t i o n t h a t wou l d n o t a l low t h e current microinstruction t o complete i ts execu tion successfu l ly. The hardware forces t he next m icroadd ress to a fixed location that depends on the particular condition, thus overrid i ng the address that wou l d otherwise be selected . This spec i a l l ocation i s the starting address o f the trap-hand l i ng m icrocode routine specific to that trap condition. M icrotraps are used extensively by the memory management syste m tO i m p le m e n t t h e v i rtua l memory arch itect u re . M i c ro traps a re a l so caused b y s e r i o u s syst e m fau l ts ( i . e . , machine checks) , such as control -store or b u s parity e rrors. Tab l e 2 l i sts t h e m i c rotra p cond itions and their priorities . The priorities are arra n ged so t h a t i f m o re t h a n o n e m i crotrap occurs during a cyc l e , the one with the h ighest priority w i l l be serviced and the others ignored . Table 2 Microtrap Conditions and Priorities Microtrap Condition Priority M i crobreak H i g hest M achine check VA parity error TB tag parity error Reserved for ECO Reserved float operand Add rounding M ultiply rounding Integer overflow T B miss Access violation Modify bit Page cross U na l i gned page cross U n a l i gned trap Conditional VAX branch Lowest 31 The VAX 8800 Microarchitecture Figu re 1 1 shows the m i crotrap latency and i ts consequences o n p i p e l i n i ng . As described ear lier, a trap-causi n g m i croinstruction, even i f it wri tes the wrong resu l ts , is a l l owed to complete because i t is too l a te to block i t a nyway. (The ca nonical t i m e of register wri te is T 9 , whereas the m i crotrap signal occ u rs at canonical t i m e T , o ) - The on ly recou rse i s t o let t h e trap-han d l i n g m i crocode correct any probl ems caused by the trapping m icroinstruction . The mi crotrap s ignal occurs in time to block a l l three m i croi n stru ctions i n the trap shadow. Therefore , t h e m i crotrap logi c generates two global signals, the gl obal mi crotrap (one-cyc l e l ong) and the block writes (three-cycles l ong) , at time T, 0 . The pur pose of the global-m icrotrap sign a l is to trigger any necessary trap-contingent actions in va rious p a rts o f t h e p ro c e s s o r . T h e p u rpose o f r h e block-wri tes signal is ro block register writes a t canonica l t imes T 1 1 , T 1 3 , and T 1 s , thus renderin g i neffectua l microi nstructions U , V, and W i n Fig ure 1 1 . I n other words the blocki ng of wri tes by ha rdware i s i n effe c t u n t i l t h e t ra p - h a n d l i n g m icrocode ta kes control of the micromac hine. A silo is genera l ly used to save the stare of the mach i ne across a m icrotra p . I n most cases the l e n g t h o f t h e s i l o is e q u a l t o t h e d e p t h of pipe l i n i ng . Si nce there a re m a ny more branch condi t ion b i ts than m icroaddress bits, i t is more econ omica l to save m icroa d d resses in the trap s i l o than to save the conditions causing those addresses. M icroadd resses U, V , and W must be saved i n t h e s i l o s i n c e t h e y m a y be b ra n c h targets o f some previous m icroinstru ctions . For the same reason , however, the address X (over ridden by X', the start i ng add ress of the trap rou tine) must be saved as wel l . During the execu t i o n of t h e t r a p r o u t i n e , t h e t r a p s i l o s a r c " frozen " (bl ocked from loading) , thus saving t he state o f t h e micromac h i ne a t the t i m e of trap . After the trap routine has completed , two con d i tions are possible: 1. 32 The recovery from the trap is i m possible, and hence the m icroinstruction sequence c a n n o t be c o n t i n u e d . T h e n t h e o n l y recourse i s to roll back and reexecute the macroi nstruction . That is, the macroPC is backed up from its silo, the IB is fl ushed , and if necessary, any register changes are u n d o n e . I n t h i s c a s e t h e l a s t m i c ro - i nstruction o f the tra p rou tine performs a trap release , wh ich u nblocks the silos so they can resu me load ing the new states . 2. M i crocode can remedy rhe cause o f the t r a p s o t h a t t h e m i c ro i n s t r u c t i o n seq uence can be con ti n ued. I n this case t he l ast microinstruction of the trap rou t i n e perfo rms a trap retu r n , caus i n g the hardware to recycle m i c roadd resses U , V , W , ancl X t hrough t h e m icroaddress p i p e . T h i s action results i n the reexecution of aborted m ic ro i nstructions from the trap shadow. I n t h e case o f a tra p r e t u r n , t h e hardware sel ects the m i c roPC silo as the microadclress for the n e x t fo u r cyc l e s . As s h own i n F i g u re 1 4 , however, the mi croPC silo does not conta i n the microatldrcsses m ade by the decoder. Therefore, it is necessary tO resy n c h ron i z e t h e m ic ro i n struction execution sequence with the decoder, wh i l e req ucu i n g t h e t rapped m i c roadd resses from the silo. This is made possible by keeping a tag bit i n the s i l o to identify the posi tions of the m icroadd resses made by the decoder i n the seq u e n c e . If a m i c ro a d d ress from t h e s i l o i s foun d to be tagged, t he requeu i ng is termi nated i m mediately and the m i croaddress generated by the decoder is selec ted . A comp lete recovery t hus occurs since the state of the IB has by this t i m e b e e n b a c k e d u p , a n d t h e r e fo r e t h e decoder-generated m i c roadd ress can be used for the con t i nuati o n . Chain ing of Microtraps By convent i o n , m i crotraps a re n o t a l lowed to nest ; instead , they a re chained . I n other words the trap-handl i n g m i crocode m ust ensure that it w i I I not cause any m i crotraps i tself. The sole exception i s i ts last m i c ro i nstruc t i o n , w hi c h may cause a secon d m i crotrap t o fol low i mme d i ately, even as the saved m i c roaddresses from the silo are be i ng requeued to resum e the origi nal flow . Note that this second m i c rotrap does not take effect u n t i l four cycles later, whereas i nterve n i n g m i c ro i nstru c t i o n s a re bl ocked by the ha rdware as a resu l t of t h i s secon d m icro· tra p . Conseq u e n t l y , the sa m e m i c roadd resses end up i n the m i croPC si l o once a ga i n during the execu tion of the second trap rou t i n e . The original sequence may fin a l l y resume a fter the l ast of such chained traps has been serviced . Digital Technical journal No. 4 Februar)' J 9 8 7 New Products Acknowledgments The specification and design of the VAX 8800 1 Box was a team effort . Dave Laurdlo con tributed to the lB desi gn , the i-srream data for matter, and the i nterrupt logic. Bei Pong Wa ng was responsible for the decoder, the PC i ncre ment logic , and the 1 8-state manager. Jack Ward looked after the physical constru ction of the sequencer and the contro l store . The entire deve lopment was carried out under the exce l l e n t leadership of Doug Clark . Many thanks a lso go to both Doug C lark and Bob Stewart for their suggestions and gui dance during the cou rse of this development. Digital Technical journal No. 4 Febmarv I ')8 7 33 William A. Samaras The CPU Cl o ck System in the VAX 8800 Family The clock system in the VAX 8800 CPU sends timing signals to every state device every 45 nanoseconds. The lack of accuracy of these timing signals is called skew, which must be minimized. Two skews exist: global, between modules; and local, within a module (the lower of the two). The design complexity of the overall system dictated the use of an automated timing verifier. Although advantages accrue from designing for local skew, the verifier could not segregate between skew types. To gain the benefit of the verifier, a unique hardware trade-off was made to minimize total skew: local was made equal to global. The result was that 83 percent of the cycle time is used productively. Al l sync hronous compu ters must provide some means of generat i n g and d i stri buting accurate t i m ing signals. The goa l of the timing sysrem in t h e VAX 8 8 0 0 fa m i l y is to provi d e l ow-skew (therefore , accu rate) t i m i ng signa ls to a l. l pans of t h e processor wi t h o u t a n y m a n u factu r i n g a d j u s t m e n t s . F u r t h e r m o re , t h e d e s i g n t e a m wanted to automate the verification o f the r i m ing during the design p hase . Therefore , design trade-offs in the clocking system were necessa ry ro accompl ish that auromar.ion . Thi s paper d i s cusses how the hardware designs of the clocking system were i n fluenced to provide a good envi ron ment for r.he au tomatic tim i ng verification . Clocking System Requirements The design of the clocking system requi red u s to address many i n terrelated prob lems that had w cu l m i nate i n a common so l u t io n . T h i s design depended on certa i n fundamental specificat ions that were estab l is hed for t he VAX 8800 CPU by the system a rc h i tects . The two pri mary req u i re ments a re descri bed be low . Cycle Time The cycle time of the VAX 8800 fa m ily of pro cessors i s 4 5 nanoseconds ( ns ) , w h i c h means t h a t a CPU c a n a c co m p l i s h some a m o u n t of work d u ri ng that pe riod . Looking at i t. another way, t h e se p ro c e s s o rs can d o 2 2 . 5 m i l l i o n actions every secon d . Usua l ly, a n u m ber o f these 4 5 - ns cyc l es are req u ired by a processor to pro- 34 duce just one VAX i nstruction . The c locking sys tem m ust keep the thousands of circu i ts in the p rocesso r " t i c k i ng " in pe rfect step toget h e r every 4 5 ns. The 8800 was desi gned ro conta i n two com p l e t e C P Us in t h e s a m e c a b i n e t . S i n c e b o t h CPUs share a common memory, i t is beneficial to make the m emory system and both CPUs syn c h ronous wi t h each o c h e r . T h e c l o c k syste m must keep a l l three items ru n n i ng together, pre cisely locked i n t i m e . Modules A l l t h e c i rc u i t ry for both p rocessors a n d the m e mory control l er is contained on 20 1 6- i n c h b y 1 2 - i nch modules, o r printed c i rc u i t boards. These mod u l es occupy slots i n a 2 1 - i nc h-wi de backplane . Each m od u l e conta i ns u p w 2 0 ECL gate arrays and m isce l la neous ECL l og i c . The state devices , c a l led l atches, reside both i n the gate a rrays and the m iscel laneous l ogic of each modu l e . The Clocking Problem The basic d i ffi cu l ty for t h is (and a ny) clocking system is to get the t i m i n g signals ro every scare device i n t h e mac h i n e at p re c i s e l y t h e s a m e t i m e . Every s y n c h r o n o u s m a c h i n e fa ces t h i s probl e m . However, i n faster comput ers, l i ke the VA.-'{ 8800 system , the to lerances placed on the t i m i n g s i g n a l s are m o re seve r e . In a physical sense , i r is s i mp ly not possible to send a I I the Digital Technical journal No. 4 Februar)• I ')87 I New Products timing signa ls to every part of each module at the same i nstant . There is some precision, however, that shou ld and can be achieved . We now discuss how important this tolerance is tO the VAX 8800 systems, and what we did to mini mize it. T h e t o l e ra n c e , o r t i m e d i ffe re n c e , t h a t we encounter i n attempting to provide t i m in g signals to every state device at the same time is cal led the clock skew. Clock skew is the u ncertai n ty i n the t i me of a particu lar event. As an analogy, consider an airl i ne fl ight that is schedu led to arrive at a n airport a t precisely 5 : 0 2 P . M . Now, w e know this fl ight wi l l not arrive at 5 : 0 2 P.M. o n the dot; it w i l l probably arrive w i t h i n a m i nute or two of that pub lished arrival t i m e . This uncertainty i n the time o f arrival i s the skew o f that time. I f the u ncertai n ty of a rr i va l is 30 seconds, t h i s s kew wou ld probably be a very acceptable value and we wou l d say the f l i g h t i s r i g h t o n t i m e : i t arrived with low skew. On the other hand , if the u ncerta inty of arrival is large, say 3 0 m inutes, we wou l d probably try another airline. Why? Not simply because we are i m pa t i e n t but for a more fu n d a m e n t a l reaso n . When the uncerta i nty is large, we have less time to do other things that are valuable to us. Usually, we are comm itted to the entire t i me of the u ncer tainty. Put another way, this u ncertai nty, or skew, is wasted t i m e . Enough of t h i s a n a l ogy - h ow does t h i s s kew a ffec t the opera t i o n of a d ig i t a l computer? As mentioned earlier, si nce the cycle t i me of each CPU is 4 5 ns, all state devices are "sched uled " to c lock at the start of that period . Any u n c e r ta i n t y i n t h i s t i m e fro m o n e l a t c h t o another i s cal led clock s kew. As i n o u r a i rl i ne example, c lock skew is wasted time. There are many factors that i ncrease the clock skew; let us consider one of the most i mportant ones. Since the backplane width is 2 I i nches, aJI the CPU hardware modules are separated by no more than that distance . Since a l l the wiri ng in the sys tem is composed of controlled-i mpedance trans mission l ines, the logic signals can travel at c lose to the speed of light. At that speed a logic signal cou ld circle the earth about 4 . 5 t imes in 1 sec ond , or i t takes about 4 nanoseconds tO travel the 2 I i nches across the processor backplane. Now we can begi n tO understand the skew proble m . The m i n imum uncertainty of a ny signa l travel ing through the entire processor woul d be at .least 4 ns, which is a l most 1 0 percent of the 4 5-ns cycle. And that is only one source of skew. Digital Technical journal No. 4 February 1 98 7 Since skew c a n b e wasted t i m e , o u r goal was tO make it as small as possible. In the 8800 system , there are three major contributors t o c lock skew: var i a t i o n s i n t h e sem i co n d u ctor components, variations in the wiring lengths (descri bed above) , a n d d i fferent m a n u factu ri n g tol erances of t h e modules. O n e common way t o remove skew from a system is to make some type of adjustment dur i n g the assembly of the hardware. Theoretically, at least, all the skew could be removed through this method of adjustmen t . To keep the cost of manufacturi ng low, however, another of our goals was to requi re no adjustments of any k i nd . That goal p laced an extra burden on the clock system to d e l i ver accura t e s i g n a l s wi t ho u t e xcess i ve skew. By carefu l ly design i n g the c i rcu i ts of the c locking system and controll ing the skew sources mentioned above, we held the overal l c lock skew in the VAX 8800 fam i l y to 7 . 5 ns . Thus, on aver age , 83 percent of our 4 5-ns cycle is uti l i zed. The remainder of the paper explai ns some of the trade offs we made to achieve this figure . Clock Hardware Overview Figure 1 depicts the hardware i n the clock sys tem of the VAX 8800 fam i ly. The osc i l l ator section is the t i me base of the whole machine. The implementation is a custOm phase-locked- l oop design that a l lows the clock period to be varied for test purposes during the m a n u fa c t u r i n g p rocess . U s i n g a p hase - l ocked loop makes it possible tO have a very accurate ti m i ng source at many specific clock periods . The output of the oscil l ator secti o n connects to a p hase generator t h a t prov i des two c l oc k p h ases w i t h t h e p r o p e r t i m i n g re l a t i o n s h i p between them. The outputs (cal l ed the A-Clock and the B-Clock) of the phase generator are the a c t u a l c l o c k s i g n a l s d i s t r i b u t e d to a l l s t a te devices i n the machi n e . The phase generator is implemented digitally by high-speed , 1 OOK ECL shift registers. This technology creates very accu rate t i mi ng without requiring any manufacturing adjustments. Since there is only one p hase generator and thousands of state devices req u i ri n g the clocks, or timing signals, a method is needed to get the o u t p ut o f t he p hase generator tO every state device wi thout add i ng very much skew. That is the pu rpose of the d istribution stage of the clock system . The actual circu i try used for the distribu tion consists of I O O K ECL d i fferential devices and 1 O K H ECL devi ces . The d i stri b u t i o n was 35 The CPU Clock System in the VAX 8800 Fam ily CLOCK MODULE CPU II PROGRAMMABLE CLOCK OSCI LLATOR T 1 33.5 M H z CONTROL LOGIC DIGI TAL CLOCK PHASE GEN ERATOR A A PHASE ML 22.25 NOMINAL B B ' A B / ---, 20 A,B CLOCK PAIRS, ONE TO EACH CPU MODULE, ONE TO THE MEMORY CONTROLLER, AND ONE TO EACH 1/0 CONTROLLER A I r-t;r B A A f-- A A A A ---, '-----,--- B "----- B B A B B PHASE B '----- B ,A ,B (8 MODULES) l TYPICAL MODULE CPU BACKPLANE INTERCONNECT A CLOCK DISTR I BUTION 1 A r A B l CLOCK DISTRIBUTION + I B CPU 2 (8 MODULES) A B A --, I] GATE ARRAYS B --, TYPICAL MODULE I t I I Jl GATE ARRAYS A B C L OCK DISTR I B UTION B CLOCK DISTRIBUTION A B � l MEMORY MEMORY CONTROLLER MODULE �T � l T l GATE AR RAYS A B CLOCK DIST R I BUTION A 8 + I 1 I 1/0 CONTROLLER (UP TO 2) Il It GATE AR RAYS A l I 8 CLOCK D I ST R I BUTION A 8 Figure 1 36 I Clock System in VAX 8800 Fam ily Digital Technical journal No. 4 February I 987 New Products heavily influenced by our desire to use an auto mati c t i m ing verifier. The fol lowing d iscussion of the t i m i ng veri fication environment g i ves a clearer view of the reasoning be hind the c lock d istri bution scheme . Clock System and the Timing Verification Environment Trad i t i ona l ly , t i m i n g veri fica t i o n was accom pl i shed by hand calcu lations usi ng com ponen t specifi cations. A designer wou ld si mply add a l l t h e component propagation delays i n a particu lar path and determ i ne if all t i m ing criteria were met. In the past, this method worked fairly wel l for several reasons. F i rst , the desi gner us u a l ly knew which paths in a circ u i t were cri tica l and cou l d g i ve spec i a l atten t i on to t h e m . Seco n d , components genera lly behaved better than their worst-case vendor specificati ons . Marginal t i m ing problems, or ones that were simply overlooked , wo uld often be less serious t h a n t h e d i ffe rence between t h e wors t - c ase specifi cat ions and how the components actually worked . Finally, t i m i ng errors were expected to ap pear d ur i n g the hardware debug phase of a project . Therefore , t i m ing errors that were bla tantly m i ssed d u ring the design could be cor rected (w i t h a l o t of hard work) d u r i n g t h a t ph ase . That was possi b l e because t h e overa l l c o m p l e x i ty of t h e d e s i g n c o u l d be c o m p re hended by the desi gners . From the beginning of the VAX 8800 desi gn effort , we knew t h a t t h e t i m i ng of the des i gn wou ld be d i ffi c u l t to ana lyze m a n u a l l y . F irst, t h e sheer complexi ty of t h e m a c h i n e created over fou r m i lli on diffe rent t i m i ng paths. It was impossible to analyze every path manually or to discover every "crit i ca l " one w i t h e i ther man ual or i ntuitive analysis methods. Se cond , hardware c i rc u i t loops a re w i d e l y used i n t h e design ; these are circu i ts that feed s i g n a l s b a c k to t h e m s e l v e s d u r i n g a l a t e r machine cyc l e . These circ u i ts are very d i ffi c u l t to analyze, espec ially when loops cross physica l boundaries or are nested with i n other loops . just t h i n k i n g a b o u t t h e t i m i n g ra m i fi ca t i o n s o f nested loops taxes the m i n d . Man u a l ly analyz i ng thousands of these cases would be impossible. Final ly, the hardware design made heavy use of gate arrays, which conta in most of the logi c . O u r ambi tious deve lopment schedu le a n d t h e l a rge nu mber o f gate array designs simply could Digital Technicaljom-rtal No. 4 Febn.tctrJ' 1 987 not tolerate unantic ipated t i m ing errors. A t i m i ng error in a gate array m e a n t that a n e w gate array must be prod uced to fix the problem. The fabrication overhead for another se m i conductor devi ce, usua l ly taking months, was not consis tent with our deve lopment schedule. Moreover, while that new gate array was b e i ng fabricated, the debugg i ng of t h e e n t i re system c o u l d be je opard i z ed s i nce i t was just n o t poss i b l e tO "fix" an LSI chip. Therefore , the hardware design group wanted to design the processor with the a i d of an auto matic CAD too l for t i m i ng verification . Such an automatic method for verifying the t i m i n g was essential to the su ccess of the proj ect. Si nce the entire des i gn was to be "soft" (the schematics were c o n t a i n e d i n co m p u t e r d a t a b a s e s ) , i t seemed logical that some type o f software tool fo r a u t o m a t i c t i m i n g v e r i fi c a t i o n c o u l d be applied . We decided that the most appropriate t i m i ng ve rifier for this project was prod uced by Val i d Log ic, I n c . Although t h is automatic too l solved the problems caused by manual t i m i n g verifica t i o n , it a l so c reated s o m e v e ry spec i a l n e w restrictions. I t was a p pare n t fro m the be g i n n i n g of t h e design effort t h a t some restr i c t ions had r o be pl aced on the design styles of i ndiv i d u a l engi neers to reduce the t i m i ng-ana lysis problem to a manageable leve l . CPU ha rdware designers , l ike any other creative persons, often assume l a rge degrees of freedom i n t h e i r work . Usu a l ly, no two designers will arrive at the same sol ution tO a pro b l e m , a l t ho u g h a l l s o l u t i o n s m a y be acceptable. W h e n t e n or more designers work i ndependently , as happened on t h is project, it is l i kely that ten u nique design styles wil l emerge . Therefore, we pl aced restrictions on the t i m i ng envi ronment for t h e following two reasons: • • Some standard ization of t i m i ng had to take place for e l e ctrical s i gna ls to com m u n i ca te properly between desi gns generated by d i f feren t people. S i nce the automatic t i m i ng verification soft ware was n e w , seve ra l i m portant fe a t u res were lacking. The usefu l ness of an au tomatic t i mi ng verifier depends largely on how wel l t i m i ng-ru le v iola tions are reported . Knowing that a design con tain s t i m i ng errors i s usefu l only i f it i s easy to 37 The CPU Clock System in the VA X 8800 Fam ily fi nd th em. One way to a i d the reporti ng of ti min g errors is to create an environ ment that clocks a l l state devices i n the p rocessor the same way . This means that a l l logic des igns in the processor must follow consistent and strict rules for the clocking of state devices . That was the method we decided to pursue in this design project . The Timing Environment The cl ock system needed strict constra i n ts on i ts ci rcu i t design and physical layo ut ro guarantee accu racy. Therefore , the generat ion and use of c lock ing signals were tightly control led to m i n i · m i ze the d i ffe rent ways i n w h i c h t h e c i rc u i ts cou ld com mun icate . The tim ing control of state device s had to be c o n s i s t e n t t h ro u g h o u t t h e design . Moreover, a ny arb itrary t i m i ng con tro l of the state devices wou l d have been an i mpossi ble task for the tim ing verification softwa re . The t i m i ng signals i n the VAX 8800 processor were carefu l ly di stri buted to every state devi ce. This d istribution was accompl ished by carefu l ly LATCHES CLOCK SOURCE LATCHES J I I I I I I I I •____ j I LEVEL 1 ___ I I I I I I I I I I _.I LEVEL 2 Figure 2 38 L - - - ....J LEVEL 3 LEVEL 4 L - - --J LEVEL 5 Clock Expansion Groups Digital Technical ]om...,•al No. 4 Februar)' 1 98 7 New Products CLOCK MODULE - - - - - - - - - - - - - - - - -- BACKPLANE 1 r - - - -- - I 1 TYPICAL CPU MODULE r - -- - - - - - - - - - - - - - - - - - - - - - - - - - - GATE R RAY �- - - -A- - -, I 1 I I I I I I I 1 I LJ------1--I-t_J-I I I A A A B I B I I I I I I L _ _ _ _ _ _ _ _ _l B I I I I I I I I I l _ _ _________ _ _ _____l '-.r------) FANOUT LEVEL 1 I L_ _ _ _ _ _ Digital Technical journal February 1987 I I I I I I I I I I I L - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - � FANOUT LEVEL 4 FANOUT LEVEL 5 Minimized Global Skew Distribution expand i ng the c lock signals at strategi c p hysical pos i t ions in the processor. A simple example of this expansion , or fan-out , is s hown in Figure 2 . Each time the clock signals are expanded , more t i m i n g u ncertai n ty is i n troduced into t he resu l t i ng signals. The 8800 design requi red up to five levels of expansion to produ ce enough clock signals for every state device. A<> shown i n Figure 2 , some signals are i n common d istribu tion grou ps. Signals existing in the same group will have l ow tim i ng u ncerta i nty between t hem , a characteristic called skew correlatio n . The t i m ing uncerta i n ty between signals in d i fferen t d istribution groups h a s no correlation; there fore , these signals have the h ighest skew. Signa ls from the same grou p have a skew, ca l led local skew, lower than the overa l l group-to-group skew, cal led globa l skew. It is very tempting for designers to take advan tage of the lower local skew, which is often only half that of the global skew. Each clock d istribu tion group is usually conta i ned entirely on one logic modul e due to the natural physical parti tion i ng of the hardware . Therefore, com mu ni ca tion between circuits o n any particu lar module can take advantage of the lower l ocal skew. If a l l signal comm u nication occurs w i t h i n t h e loca l - No. 4 I I I I I I I I I I I I I I I I I I TO GATE ARRAYS FANOUT LEVEL 3 FANOUT LEVEL 2 Figure 3 I I I I l skew environment, the ti m in g ana lysis can be consistent and eas i ly m anaged. However, com plications arise when tryi ng to ana lyze signals that cross from the local-skew environment to the global-skew e nv ironment. Signal comm u n i cation between logic modules wil l have t o pay the pen a l ty of using the hi gher global skew because the t i m i n g signals at each end of the com m unication are derived from d i fferent d is tribution groups. Managing t he t i m ing i nterface across this partition between loca l and global skews was beyond the capa b i l i ties of the t i m i n g verification software. As d iscussed earl ier, a t i m i ng analysis of the entire processor was beyond human capacity; therefore, i t had to be performed with timing verification software. The t i m i ng verification tool chosen for the 8800 development had no faci l i ty for d istingu ishing between local and global skews. Moreover, we wanted to use the t i mi ng verifier to a na lyze the t i m ing of t he entire CPU as one entity. This decision forced us to d is a l low the use of any local-skew compu tations i n our t i m i ng analysis. Now, from a design point of view this decision made the environment very easy to work w i t h . A l l t i m i ng transactions any where in the C PU coul d be ana lyzed the same 39 The CPU Clock S)'Siern in the VA X 8800 Fam i�V TYPICAL CPU MODULE r - - - - - - - - - - - - - - -- - - - - - - - - -- - - - - - - - � I I I GATE R RAY r-- - -A- - -, I I I L_ _ _ _ _ _ _ _ _ _ _ __ _ _ _ I I J L_ _ _ _ _ _ Fig u re 4 I I I I I I I I I I I I I I I I I _ __ _ _ _ _ _ _ I I j TO GATE A R RAYS I L -- - - - - - - - -- - - - - - - - - - - - - - - - - - --- � Minim ized Local Skew Distribu tio n way w i t h t h e same set of spec i fi c a t i o n s . Every t h i n g comes at a p rice , howe ver . and the obvious A l t h o u g h u s i n g t h e l ow e r l o c a l s k ew w o u l d have been va l ua b l e . i t was sac r i fi ced by ma k i ng i t n egat i ve s i d e o f t h is decision was the l oss of the e q u a l ro t h e g l obal s kew. p o i n t . s o m e p e r fo r m a n c e o f t h e p r o c e s s o r desi gn ee! to a l l o w the ma x i m u m exp l o i ta t i o n of t i m i n g a n a l ys i s . T h e fo l l o w i n g d i s c u s s i o n w a r e a n d s o f t w a r e r r a d e - o ffs a r e a c o m m o n exp lains how t h i s p roblem was solved . o c c u rrence i n a n y d e s i gn p ro j e c t . I n t h i s case . The Clock Distribution Solution w i th opera t i n g t h e m ac h i n e w a s ba l a nced aga i nst a b i l i ty ro a p p l y r h c l ow e r loca l s k e w . At r h a r seemed ro b e c o m p ro m ised j u st r o s i m p l i fy t h e S i nce we wanred to r i m e the CPU as one enti ty. In s h o r t . t h e h a rdware of t h e clock sysrem was the r i m i n g verifi c a t i o n software . O f c o u rs e . h a rd h o we v e r . t h e v a l u e o f t h e h a rd w a r e i n v o l v e d t h e softwa re a n a l ys i s needed d u r i n g the d e s i g n we had ro m a ke the global skew as s m a l l as poss i p h ase of r h e mach i n e . implementa t i o n . the g l obal skew was lowered by Summary b l e ro m a x i m i ze CPU performa nce . I n the acru a l remov i ng one ga t i n g leve l from the c l oc k d i stri P ro d u c ing the c l oc k i ng sys t e m fo r a h i gh-speed b u t io n . The ga t i ng level removed was necessary computer is best descri bed as a n exercise in m i n for prod u c i ng low local skew. Figure 3 i l l ustrates the five leve l s of fa n - o u r t h a t were re q u ired ro prod u ce e n o u g h s i gn a l s w h e n t h e g l o b a l -skew d istri b u t i o n was m i n i m i zed . F i gu re 4 i m i z i n g a nd m a n a g i n g s k e w . I n t h e VAX 8 R O O p ro j e c t . w e avoided exot i c hardware tec h n i q u es so t h a t we c o u l d ga i n t h e be n e fi t o f u s i n g a n s hows the a utomatic t i m i n g ve r i fi e r . T he resu l t i n g skew o f case i n w h i c h the loca l -skew d i stribu t i o n wou l d cou l d be t o l e rat e d . T h i s balance was a fa i r trade sa me fa n - o u r ro prod uce enough s i gn a l s i n t h e be m i n i m i zed . Ta ble I i l l u strates the i m pact o f this opti m i zation for gl oba l skew. 1 7 p e r c e n r o f t h e cyc l e r i m e was a f i g u re t h a t off s i n c e the s i m p l i c i t y o f the r i m i n g e n v i ro n m e n t a l l owed u s t o d e crease the r i m e ro design and b u i l d the Table 1 fa m i ly of syst e m s . Distribution Changes Global S k ew Optimized Local Skew Optim ized G l obal S kew 40 VAX 8800 9 ns 7.5 ns Local Skew 2 ns 7.5 ns Digital Technical journal No. ·1 Februmy I ')8 7 john Fu james B. Keller Kenneth j. Haduch Aspects of the VAX 8800 C Box Design In each processor in the VAX 8800 family, instructions and data are sup plied to the execution units by the C Box. Employing a simple structure with a translation buffer, cache, and address and data buffers, this logic unit is an integral part of the processor's five-stage pipeline. The no write allocate cache uses a write-through scheme featuring a unique delayed-write algorithm. The C Box bas control logic to accommodate pipeline stall conditions caused by memory accesses. The C Box also maintains data coherency within a processor and between processors. A dynamic priority-arbitration scheme solves the lock-out problem between IjO and processor requests. The p e r fo r m a n c e of a h i g h - s p e e d co m p u t e r depends to a large extent o n how fast data can be passed from its me mory to i ts execu tion un i ts . If the compu ter is pipe l ined, the u n i t responsible fo r m e m o r y a c c e s s e s m a y h a v e to h a n d l e pipel ine sta l l cond itions. And i f the com puter i s a multi processor, that u n i t i n each processor may also have to handle data coherency problems. I n p r ocessors w i t h t h e VAX a r c h i t e c t u r e , d a t a accesses are fu rther complicated by t h e fact that virtu a l add resses are norma l l y speci fied . These a d d r e s s e s r e q u i r e t r a n s l a t i o n to p h y s i c a l a d d r e s s e s b e fo re a d a t a a c c e s s c a n eve n b e attempted . In the VAX 8800 syste m , which is a m u l t i pro cessor with p i pe l ined CPUs , the u n i t that per forms add ress translations and data acc esses i s the C Box . to avoid that is to stare the resul t of this address ca l c u l a t i o n i n a s m a l l , fas t m e m ory c a l l e d a tra n s l a t i o n b u ffe r . S i nce e a c h tran s l a t i on can acc ess a page of data ( 5 1 2 bytes in the VAX archi tecture) , it is likely that the translat ion wi l l b e used aga i n i n t h e program being executed . Rather than reca l c u l a t i n g the p hys ical address ( PA) on t h ose subse q u e n t accesses, it can be retrieved from the TB. The t ranslation buffer in the VAX 8800 pro c e s s o r h o l d s 5 1 2 s ys t e m a n d 5 1 2 p r o c e s s ad dress translations. The fo l lowin g sum marizes the characteristics of the TB . Characteristics of the Translation Buffer • • C Box Description The C Box consi sts of three subu n i ts: the transla tion buffer (TB) , the cache, a nd the NMI i n ter face . Figure 1 is a schematic d iagram of this u n i t . The transl atio n o f a VAX virtual add ress t o a p h ys i c a l a d d r ess i s a com p l i c a t e d p rocess . 1 Accesses to system and process page tab l es are requ ired , and shi fting and adding must be done to obta i n the fi nal physical address . Perform i ng t h i s add ress translation process for every data reference signifi cantly increases the data access t i me and red uces the read bandwidt h . One way Digital Technical journal No. 4 Febn�ary 1987 • Direct M a pped 1 024 Lines - 5 1 2 System Li nes - 5 1 2 Process Li nes A l location on Translation B uffer M i ss A common approac h to the problem of data access l atency for h ig h -speed process ors , and the one used i n the VAX 8800 CPU, i s tO use a cache 2 A cache is a sma l l , fast memory located between the processor and the m a i n m e m ory syste m . If the data requested by the CPU is not contained i n the cache , t h a t data is accessed from m a i n memory and loaded i nto the cache. 41 Aspects of the VA X 8800 C Box Design .....- '-- - � A r--- - B r- A '-- '-- r- r-- TB DATA r-- A v-/ DATA B 'VA ..., CACHE DATA .--- ADDRESS f-CACHE CACHE TAG ADDRESS r-- � CACHE HIT Pii - TB TAG TB � HIT TRANSLATION BUFFER READ STREAM - ADDRESS r-B U FF E R I N G W R I T E STREAM '- OAT A BUFFERING TB - TRANSLATION BUFFER VA - VIRTUAL ADDRESS PA - PHYSICAL ADDRESS A , B - A AND B PHASES OF TWO PHASE CLOCK - NMI ¢:::) 1-------l WRITE BUFFER N M I I N T E R FACE Figure 1 Block Diagram of C Box Thu s . i n the m ajori ty of cases , the cache w i l l c o n ta i n rece n t l y r e fe r e n c e d d a t a i te m s , a n d fu t u re referen ces t o those data i tems w i l l be fetched from the cache. The i n tent is to m i n i mize the n u m ber o f longer l a tency accesses to the main m em ory su bsystem . The success of a cache me mory re l i es on the l oca l i ty of refer· enccs in both t i me and space . The data cache i n each VAX 8800 CPU holds 64 k i lobytes (KB) of both data and instructions . The l i s t on the right summarizes the characteris tics of the cache . The TB and the cache are very s i m i lar i n con cept and stru cture , except that the TB is used to accelerate address t ra ns l a tions and the cache tO accel erate data accesses. Eac h consists of a tag section and a data section . The tag section holds the unique i d e ntifi e r , or tag , for the data item he ld i n the corresponding data section . The TB and the cache are d irect mapped , meaning that 42 Characteristics of the Cache • D i rect M apped with Physical Address • R ead Al locate Only • Delayed-Write Cache U pdate • Write-t hrough Memory U pdate with Write Buffe ring • 1 024 Blocks • 64-byte Block S i z e • • 4-byte ( o n e l o ngword) Line Size 32-byte (one hexword) Cache R e f i l l S i z e each a d d ress can po i n t to o n l y o n e loca tion ; however, each location can pote n t ia l ly be a l l o cated t o o n e of m a n y add resses. A t a g perm its the identification of a data item i n either the TB or a cache locat ion . The tag in the VAX 8800 processor is a n unmodified selection of bits Digital Technical]ournal No. 4 February 1 98 7 New Products PA(29,0) VA(31 -0) PA(28-16) TB TAG TB DATA CACHE 1---� PA(1 5-6) DATA s u b s e q u e n t use . ( I f t h e a d d ress s u p p l i e d is already a PA, then the TB is not used . ) O n l y phys i c a l a ddresses access t h e cache . I f the data referenced i s conta i n e d i n the cache, cal led a cache hit, then the data can be accessed from t h e re . If the cache does n o t contain the data, cal led a cache m iss, t he n the data must be accessed from memory. Read Operations VA(30-1 8) PA(29-0) CACHE H I T TB HIT VA - VIRTUAL ADDRESS PA - PHYSICAL ADDRESS TB - TRANSLATION B U FFER Figure 2 Tra nslation Buffer and Cache A ddress Mapping f r o m t h e a d d r e s s of t h e d a t a i t e m b e i n g accessed . This concept is depicted in F igure 2 . As m e n t i o n e d e a r l i e r , a m e m o ry a c c e ss i s req u i red if the c a c h e does n o t c o n t a i n a requested data item. In the 8800, both proces sors are conn ected to the memory and the 1/0 subsystems t hrou g h the NMI bus. Al l read and write references that go to these subsystems are processed by the N M I i nterface. This i nterface mainta ins a set of buffers for both read and wri te reference strea ms. For the read stream there are actually two sets of address buffers: one for data reads , the other for i nstruction reads. C Box Operations A C Box reference consists of a fu nction code, a n address, and i n the case of writes, 32 b i ts of data . I n genera l , that address i s a 3 2 -bit virtual add ress (VA) . The VA tra ns lation process begins with a check to see i f the PA is ava i lable in the TB If the PA i s ava i la b l e , called a TB h i t , the data is read out and concatenated with the lower n i ne bits of the VA to form the PA. As part of the t ranslation p rocess , t h e TB also performs page access c hecking. I f the PA that perta i ns to the VA i s not i n the TB , c a l l e d a TB m i ss , t h e n m i crocode must pe rform t h e transl a t i o n . The mi crocode then writes the data i n to the TB for Digital Technical journal No. 4 February 1 9 8 7 Cache- miss addresses for reads are passed to the N M I in terface , where they are held in the read a d d r e s s b u ffe r s . A h e x w o r d r e a d r e q u e s t ( 3 2 bytes) , with the address of the missed loca tion, is then made to me mory. The memory data is passed to the requesting unit, and the address held in t he read address buffer is used to u pdate t h e m i ssed cache locatio n . A read miss is the only occasion u po n which a cache location is allocated . There arc two read streams in the C Box for requests to me mory: t he data strea m , ca l led the d-stream, and the i nstruction strea m , ca lled the i-stre a m . The i-stream requests the memory to s e n d d a t a d e s t i n ed for t h e i n s t r u c t i o n u n i t ( I Box) , which interp re ts that data as macroin s t ru c t i o n s . I - s t r e a m fe t c h e s are i n i t i a ted by m i crocode , which l oads a C Box register ca l led the phys ical i nstruction buffer add ress ( P I BA) . The P I BA h o l d s the a d d ress of the next l o ng word of the i-stream tO be fe tched . If the execu t i o n of m a c r o i n s t r u c t i o n s is seq u e n t i a l ( i . e . , there are no branches, page crosses, etc . ) , the C Box can i ncrement the PIBA contents automat ical ly after each fetch . However, shou ld the pro gram branch or a page cross occu r, microcode m u s t be u s e d to r e l o a d t h e P I BA . 0 - s t r e a m fetches are made o n l y b y t h e microcode , which must specify one of e i g h t m e mory data ( M D ) regi s t e rs as i ts d e s t i n a t i o n . 0 - s t r e a m d a t a i s always returned t o t h e execu tion unit. Write Operatio ns In genera l , the performance of a cache is mea s u re d by i ts h i t rate when read i n g d a t a . The selection of the u pdate m e chan isms for both cache and me mory, however, can have a major i nfl uence on the design of the cache . There are two we l l known strategi es for u pdating a cache: write al locate , and no-write a l l o ca t e . A wri te a l l oc a te s c h e m e u p d a tes a c a c h e loca t i o n whether o r not the write i s a hit o r a miss. This scheme is general ly i m p le mented with a write- 43 Aspects of the VAX 8800 C Box Design back memory arrangement (d iscussed later) . I n a no-write a l locate scheme, the cache i s updated only if the wri te was a h i t . The VAX 8800 pro cessor uses a no-wri te al locate sche m e . The no-write a l l ocate scheme does , howeve r , presen t a prob l e m . Si nce o n l y writes t h a t h i t wi l l update the cach e , cache upd ates take two p i p e l i n e cyc l es i n t h e C B o x - t h e fi rst t o check for h i t or m iss, the second t o update the c a c h e fo r a h i t . The C Box was d e s i g n e d to enable one read reference to complete in each cyc l e . I f two consecutive cycl es are needed to update the cac he, the second cyc le coul d block a read reference, thus causing a p i pe l i ne sta l l . To solve this probl e m , the C Box imple ments a d e l a y e d - wr i te a lg o r i t h m . T h i s m e c h a n i s m delays writes that must update t h e cache from doing so u n t i l the first cycl e of the next write r e fe r e n c e . The s e c o n d cyc l e o f the de l ayed write does not need to be the next consecu tive cycl e . T h e d e l ayed-wr i te a l go r i t h m i n the C B o x takes advantage o f t h e fact t h a t t h e first cycl e o f a write u t i l i zes only t h e tag section o f t h e cache t o d e t e r m i n e w h e t h e r a h i t or a m i s s h a s occurred . The second cyc l e uses o n ly the data secti on. A write that must update the cache has i ts add ress and data p laced i nt o the d e l ayed wri te address and data buffers respectively. O n t h e n e x t write access , d ur i ng the cache-tag look up cyc l e , the data section of t he cache wi l l be updated from the address a n d data contai ned i n t h ose b u ffer s , b u t o n l y i f t h e p re v ious wri t e access was a h i t . Si nce reading a data item after one has been wri tten is common, this design sig n i fi cantly reduces the potential for sta l l s . Write Buffer A l l write references, whether or not they hit i n the cache, must eventua l ly go t o memory. There are two genera l strategies i n cache design with respect to memory updating: wri te-through , and wr i te - ba c k . In t h e wr i t e - t h ro u g h a p p r o a c h , write references are sent tO the memory system i m me d i a t e l y . C o n ve rs e l y , i n t h e w r i t e - b a c k approach , writes are h e l d until t h e cache b lock i s deal located (made ready to rece ive d i fferent data) . T h e r e are s e v e r a l m a j o r p r o b l e m s w i t h a w r i t e - b a c k s t r a t e g y . F i rs t , i t req u i res e i t he r m i crocode o r hard wa re to acco m p l i s h a l l t h e 44 write-back fu nctions. Add ing that cod e o r hard ware to t h e C Box wou l d have c o n s i d e ra b l y increased i ts co mplexity. Seco n d , if t h ere is a w r i t e m i s s w i t h t h i s s c h e m e . a c a c h e b l o c k t h a t m i g h t be fu l l of val i d data cou ld be displaced by a block whose o n l y va l i d d a t a was t ha t j u s t w r i t t e n to t h e cache . For a cache having a large bloc k size, l i ke the 8800 has, t h i s action is un desirable. More over, in most cases m icrocode reads data before it is wri tten ; therefore , wri tes wi l l genera lly h i t i n t h e cache . F i n a l l y , t h e wr i t e - ba c k strate gy re q u i res a c o m p l e x a l go r i t h m t o m a i n t a i n c o h e r e n cy between caches within a m u l tiprocessor syste m . Therefore , for a l l those reasons, w e chose t o use the write- t h rough approach in the cache . One d i sadvan tage of write-through i s that i t tends to generate a J o t o f write traffi c t o the me mory. I n a s hared- bus system l i ke the 8800 , t h i s traffi c can l i m i t perform a nce . To red uce memory-wri te traffi c , wri tes in the VAX 8800 processor a re b u ffered i n a w r i t e b u ffer con tained i n the NMI i n terface. This write buffer is rea l l y a o n e - l i n e , o c t a word , w r i te - a l l o c a te cache . A write going out tO the N M I bus is held in the wri te buffe r . Subse q u e n t writes to the same octaword update only the write buffer so that n o mem ory req uests are sent on t h e N M I bus. A write that i s outside the oc taword cur rently in the write buffer dea l locates it; that is, the contents of the write buffer are sent to mem ory, and the next wri te rep laces those co ntents i n the buffer. Like the cac he , the success of the write buffer i n red ucing bus traffi c re l ies on the l o ca l i ty of p ro g r a m s i n s p a c e a n d t i m e . F o r e x a m p l e , seq u e n t i a l wri tes , such as pushes t o the stack, will get co llected i n the write buffer even if the wri tes occurred i n diffe rent macroi nstructions. This col lected "package " of writes can then be sent to the me mory more e ffi c i e n t l y t h a n can i nd ividual wri tes. An other advantage of the write bu ffer is that it decou ples the processor from mem ory activity. When the memory is busy process i ng transac tions from the other processor or from the IjO su bsyste m , a processor w i I I n o t sta II d u e to writes. The write bu ffer is actu a l ly i mp l e mented as a two-deep buffer, which fu rther reduces t he poten tial for s ta l l s . Digital Technicaljournal No. 4 februarv I Y87 New Products Pipeline Stalls MD Stalls J n a p i pe l i ned i m p l e mentation , how we l l r he p i pe l i ne performs is determi ned both by h ow oft e n i t i s f l u s h ed c l e a r a n d how o ft e n i t i s sta l led . Sta l l con d itions are general ly related to rhe lack of some p hysical resource or data . I n some i mp l e mentations, some p i pe l i ne stages can take m ore cycles to com plete than others for certa i n fu nctions . I f a shorter s tage precedes a lo nger one , the l on ger one w i l l be unable either to accept fresh data or to pass i ts resu l t ro the next stage u nt i l fi n i shed wi t h i ts cycle . I n turn , other port i ons of rhe p i pe l i ne can not proceed with their operations ; therefore, the pipeline w i l l stal l . I n this sta l led condition, all stages preceding the "bottl eneck" m a i n ta i n the i r i nput a n d output conditions u n t i l the stage responsible for the sta l l compl etes i ts function. Some i mplementa tions have a combinati on o f stages that may e x h i b i t t hese c h a ra c te r i s t i cs , l ead ing t o complex pipeline stall cond i tions. I n the VAX 8800 CPU, the design s i m p l i c i ty of t h e p i p e ! i n c e n s u re s t h a t e a c h p i p e l i n e stage - except the C Box - always comp letes i rs function in one cyc Je ..l S i nce the C Box a lso control s data accesses, a l l sta l ls in t he 8800 are r e l a t e d to t h e o p e r a t i o n of t h i s u n i t . T h e p i pe l i ne wi l l experience two types o f sta ll s : the MD stal l , and the VA sta ll . When maki ng a read reference , a microi nstruc tion m ust specify one of eight MD registers to be used as i ts desti nation . When data is made ava i l a b l e , ei ther from t h e cache or from memory, i t is written i n to t h e specified MD register. Subse quent m icro instructions t hen use the data from t h i s register. If a m i cro i nstruction a ttempts to use an MD register that is nor " va l i d " ( i . e . , the data has not yet been fetched by the C Box) , the p i pel i ne wi l l experience an M D sta l l . The M D sta l l con d i t ion is a data-dependency type of sta l l that is genera l ly seen i n pipel i ned mac h ines . On the VAX 8800 processor, certa i n steps a r e t a k e n t o e i t he r a v o i d s u c h sta l l s o r red uce their effects. For example, consider two consecutive m icro i nstructions, R and S, as i l lus trated in Fi gure 3. R is a m i cro i nstruction that performs a read and puts data i nto an MD regis ter . S then accesses and uses the data fetched by R . I f R and S a re adjacent , the p i peline w i l l sta l l i n t h e 880 0 . The reason for the sta l l is that the p i pel i n e stage access i n g the MD data and the stage fetc h i n g that data ( t h e C Box) a re sepa rated by one o t h e r stage, rhe a r i t h m e t i c a n d logi c u n i t (ALU ) . When S tries t o u s e t h e M D data , R i s just start i ng t o make the read reference in the C Box. S must t herefore stal l the p i pel i ne, wai ting for data to be supplied by R. CYCLES INSTRUCTION R ( MD ACCESS FOR DATA I N STRUCTION S ALU TB CACHE ' ( MD ACCESS FOR DATA R STARTS READ R E FE R ENCE ALU TB CACHE � S REQUIRES DATA READ BY R . M U ST STALL AT LEAST O N E C Y C L E FOR T H E DATA. MD - M EMORY DATA REG ISTER TB - TRANSLATION BUFFER Figure 3 Digital Technical journal No. 4 Februarv J 987 Instructions R and S A re A djacent 45 A spects of the VAX 8800 C Box Desig n CYCLES INSTRUCTION R � MD ACCESS FOR DATA ALU � MD ACCESS FOR DATA I NSTR UCTION S 4 ALU ( R HAS COMP LETED R E A D R E F E R E N C E , DATA J U ST AVAILABLE TB MD ACCESS FOR DATA CAC H E ALU TB "' � CACHE S R EQ U I R E S DATA. DATA SENT D I R ECTLY I NTO A LU , BYPASSED M D U P DATE. NO STALL. Instructions R and S Separated hy A n o ther Instruction On the other hand . if R a nd S are separated by one other i nstruct ion , then when S a ttempts to use the data read by R , t h a t data is just b e i n g m a d e a va i l a b l e by t h e C B o x ( a s s u m i n g . o f course , a read h i t i n t h e cache) . I f S were t o wa i t for t h e M D registers to b e u pdated before using the data , the p i pe l i ne wou l d sta l l . To e l i m i nate that type of stal l , a path has been designed from t h e C Box d i rectly i n to t h e i n p ut of t h e AUJ . bypassi ng the M D registers . 'T'herdore , the data coming from the cache is sent both to the MD registers for u pdat i ng and d irectly to the A U J , where S c a n u s e the data . 'T'he n e t effect i s that this bypass path removes the one-cycle la tency that S wou l d have experienced had it waited for the data to come out of the MD registers . Figure 4 i l l ustrates t hese concepts . Had R caused a read miss, S woul d sti l l cause an MD sta l l si nce the C Box must make a memory fetch for the data . Notice that an M D sta l l hap pens only when S a ttempts to use an M D register. Therefore, a general rule for making m i crocode accesses to the C Box is to m ake read references early and to usc the MD registers late. Should the read reference m iss, some part of the memory fetch latency will be h i dden by the m i croi nstruc t i o n s b e t w e e n t h e r e a d a n d t h e MD r e g i s t e r 46 CAC H E I I N TERVENING INSTRUCTION Figure TB access . When data returns from a read m iss and the p i pe l i n e i s e i t h e r u n d e rgo i n g o r a b o u t to u ndergo an MD sta l l , the bypass pat h can be used to reduce the effects of the sell I or even prevent i r . VA Stalls VA sta l l condition occurs when t he C Box can not process a requ ested refe rence . This can be clue to e i ther an i nva l i dation cyc le in t he C Box (discussed in the fi nal section of this paper) or the capabi l i ties of the address and data buffers i n the N M I i n terface being exceeded . A� mentioned earlier, for reads t here is a set of buffers for d-strcam and i -stream references. The d-strea m buffering is one dee p , mean ing there can only be one read m i ss outst a n d i ng i n t h e C Box . However, t h e i m p l ementation wi l l not a l low t h e p i pe l i ne to stall s h o u l d s u bseq u e n t reads b i t i n the cach e . !-stream reads never sta l l t h e pipeline a s d o VA and M D sta l ls , w h i c h stop the cloc k . The i nstruction buffer can "sta l l " if i t does not h ave e n o u g h d a t a for the decoder t o complete t h e decode o f the current V AX i nstruc tion operand . This condition causes the CPU to pe rform a no-opera t i o n m i croword . That docs not stop the clock, howeve r , and thus is not a pipeline sta l l . A Digital Technical journal No. 4 Febmarv I 'J87 New Products The C Box can s t i l l receive com mands even if it contains one read m iss. Of course, there i s the potential that the command bei ng received wi l l m i ss i n t h e cache . T h a t w i l l requ i re t h e N M I interface to request t he data from memory, thus resu l t ing in a VA stal l . That sta l l l asts from the t i me the command i s received until the time the previous read-miss data returns from memory. If the second com mand i s a read that h i ts in the cache, a VA stall w i l l be generated for t he one cycl e t hat i t takes to determ i ne whether or not there i s a cache h i t . The read data w i l l then be taken from the cache a n d retu rned to the M D , after which the sta l l w i l l b e re leased . S ince wri tes go to memory more than reads , t h e buffering for wri tes is more extensive . The de l ay-wri t e bu ffer and t he double b u ffering i n the write buffer a re used t o reduce the possibil i ty of write sta l ls . These buffers enable the C Box to h o l d a m ax i m u m of n i n e l o n gwords o f d a ta before the p i pe l i n e w i l l experience a VA sta l l on a wri te. Stalled and Unstalled Logic in the C Box If an i nstruction is sta l led, the C Box has e ither not returned the data or cannot take another ref erence. Therefore , a l l stages prior to the C Box (the I Box and the E Box) must be stalled. The TB is part of t he last stage of the pipe line; there fore, it m ust be capable of be i ng stalled. When the p i pe l i ne stalls, t he TB holds the address of the stalled refe re n c e . O n ly the N M I i nterface can resolve a sta l l , e ither by supplyi ng the read miss data or by freeing u p i ts buffers . Thus this i n t e rface can never be s t a l l ed . H owev e r , the c a c h e , b e i n g p a r t o f t h e l a s t s t a g e of t h e p i pel i ne, i s a lso the path for supplying data to � I BOX � E BOX DATA T R A N SLAT I ON BUFFER STALLED Coherency Problems in the C Box J n genera l , data cohere n cy m e a n s t h a t a read should a l ways get correctly mod i fi ed data when a s e r i es of r e a d s a n d w r i t e s is m a d e i n a n y seq uence . One way tO m a i n ta i n coherency i s to perform a l l reads and writes to completion in a p u rely seq u e n t i a l m a n n e r , t h u s s t r i c t l y m a i n tai n i ng their sequence of reference . However, i n a p i pe l i ned mach i n e , not only can t here b e sev e ra l sources of read a n d write references, but t here can a lso be more than one copy of t he data item . This duplication often leads to very com plex solutions to ach ieve coherency. T h i s co m p l e x i ty has been s i m p l i fi e d some w h a t in t he VAX 8 8 0 0 p i pe l i n e by havi n g the C B o x b o t h c o n t ro l a n d s e q u e n c e a l l d a t a accesses. The C Box i tse l f, however, i s p ipe l i ned, having a d-stream and an i -stream for reads , and a stre a m for w r i tes. T h i s fact a l so presen ts so me cohe re n cy prob l ems . Coherency for t he C Box means that two condit ions must be met. 1. After a sequence of reads and wri tes has completed , a ny va lid bloc ks i n the cache must match the data i n t he memory. 2. Whenever the processor wri tes to a loca tion in memory a nd then reads t ha t loca tion , the data has tO be what was written . I PHYSICAL ADDRESS CACHE DATA STALLED/ UNSTALLED Figure 5 Digital Technical jom-nal No. 4 February 1 98 7 PHYSICAL ADDRESS the stal led i nstruction . This situation leads to an i nteresting control characterist i c of the C Box . O n e of i ts s e c t i o n s , t h e T B , c a n be s t a l l e d ; another. the N M I i nterface , m ust never stal l ; and t h e t h i rd s e c t i o n , t h e c a c h e , m u s t r e m a i n u nstal l ed but mainta i n stal led i n pu t and output c o n d i t i o n s i n i t s l og i c . F i g u re 5 d e p i cts t h e logic for sta l led and u nstal led cond i t i ons i n the C Box. NMI I N T ERFACE � NMI UNSTALLED Stalled and Unstalled Logic in C Box 47 Aspects of the VAX 8800 C Box Design Two types of coherency problems exist in the VAX 8800 syste m : coherency wi t h i n a proces sor, and coherency between processors. The first type of prob lem in the C Box arises fro m t h e i m p l e m e n t a t i o n of t h e d e l ay- w r i te algori thm d iscussed earl i e r . A prob l e m occu rs when a read i s attempted to the cache location wai t i ng to be u pdated by the wri te held in the delay-wri te buffers . The read w i l l h i t , but the cache data w i l l be sta l e . One solution to t h i s prob lem i s t o stall the p i pe l i ne w h i l e t h e cache is u pdated , perform ing the read for the correct data. The trou ble here is that the sequence of writing to and reading from the sa me location is a common occu rrence . Thus to sta l l wou ld sig n i ficantly reduce the read bandwidth . The C Box solves this problem by compa ring selected b i ts of the read and wri te addresses i n the dela y-write buffe r. I f t h e bits m atch , then the data con tent of that bu ffer is used as the read data . This sol u tion works because, to the read . the delay-write buffer ap pears tO be an exten s i o n of t he c a c h e . S i nce t h e read a d d ress matched t h e address i n t h i s buffer, t h e data can be t a k en d i rec t .l y from i t . C o h e re ncy is r h u s assured , a n d n o sta l l penal ty is i ncurred . The second type of coherency problem occurs when the read is a m iss and thus goes to the N M I interface . To assure h i gh performance, the N M I i nterface m a i n ta i ns two streams o f data requests , the read a n d write strea ms . The buffe r i n g and the con trol of these two strea ms operate i nde pendently. If made to d i fferent data items, read and write requests can be processed to me mory as q u i ck l y as poss i b l e , even o u t of seq u ence . The coherency p ro b l e m i s to m a k e sure t ha t subsequent reads a n d wri tes t o t h e s a m e data i tem resu l t i n i ts correct state. I f a read requ est occurs that was a m iss, the cache will send i t to the NMI i n terface upon dis covering that fact. Once in the N M I i n terface , the read address i s compared to the add ress of t h e o c t a w ord i n t h e w r i t e b u ffe r . I f t h ose addresses are d i fferent, the cache wi l l send the read d i rectly to m e m ory . Thus the data in the write buffer wi l l be u naffected . I f the add resses matc h, however, the write data wi l l be sent tO memory, fol lowed by the read request. Si nce rhe m e mory s u bsystem p roce sses references i n a sequential manner, the read w i l l a lways access the correct data . (Of course, this case is fa i rl y simple . A more co m p l i cated one is that i n which 48 a read is sent to memory, and t h e processor per forms a write w h i l e wa iting for that rea d . ) I f t h e addresses o f t h e read a n d write match , the cache can give the processor t he requested data but cannot mark the returned data val id i n t h e cac he . T h i s s i t u a t i o n occurs becau se t h e read - m iss data being fe tched from memory has been made stale for subsequent reads . T h e m i crocode i s d e s i g n e d so t h a t i t w i l l n ever read a data item and then wri te to i t with our first accessi n g the MD registers . However, a cache block is 64 bytes long. The m i c rocode cou l d write to any other data i tem i n the b lock before com i n g to the m issed data ite m . There can be as many as three wri tes and two reads ( o n e e a c h for the d- a n d i - streams) b u ffered simu ltaneously in the C Box, all referenci ng the same cache block. Even worse, the C Box can send an arbitrary nu mber of writes to memory while wa iting for the data returned by the read to me mory. To m a i n tai n coherency. the C Box performs a set of address m atches between the read and wri te stre a ms . T h e n i t " re m em bers" whether or not any wri te addresses matched the out st a n d i n g reads a n d m a rks t h e m i nv a l i d as appropriate . C Box Design for a Multiprocessor System The VAX 8800 system consists of two identical VAX 8800 processors o n the NMI bus con nected to t h e m e mory a n d I jO su bsystems W i t h i n a processor, on ly the design of the C Box bas been affected by the req u i rements of a m u l t iproces sor ar range m e n t . That is because t h e C box is the CPU's i n terface to the N M I bus and contai ns the centra l arbitration logi c for that bus. T h e r e a r e t h ree key i ss u e s i n d e si g n i n g a mem ory i nterconnect for a m u l t i p rocessor sys tem : bus arbitra t i o n , bus ba nd w i d t h , a n d data coherency between processors. Bus Arbitration on the NMI Bus Two major problems were encou n tered in the design of an arb i tration scheme for the NMI bus. The first was the fact that between the CPUs and the 1/0 su bsystems, called the NBfs, there was a possibi l ity t hat a high-priority device cou ld lock ou r a low-priority device from the bus. T h is is certa i n ly poss i b l e with a fixed priority-arbitra tion sche m e . To add ress this problem, the C Box i m p l e m e n ts a dyna m i c prior i ty- a l loca t i o n Digital Technicaljournal No. 4 Februar)• 1 987 New Products s c h e m e t h a t c a u s e s p r i o r i ty to be a s s i g n e d between two groups: t h e 1 /0 devices , a n d rhc CPUs . Wi t h i n t h ese grou ps, t he priority s h i fts between rhe rwo CPUs and the two 1/0 devi ces . For exa m p l e . i f a l l four devi ces wan ted to usc the bus a l l the t i m e , the order i n which the bus wou ld be granted to the devi ces wou ld be first CPU , first l/0 , second CPU, second 1/0 . first CPU. first ljO, second CPU, second ljO , etc. This scheme guarantees that all devices on the bus wi I I have n e a r l y eq ua I access to r h e bus , rh us so l ving rhe lock-our proble m . T h e second p rob lem i nvo lves t h e " m e m ory busy" situation . Whenever rhe memory subsys tem c a n n o t process m ore requests, it sends a " m e m o ry busy" s i gn a l . I t cou l d h a p pe n , for i n stan ce . r ha r a CPU accesses t he bus a n d attem pts ro wri te ro memory . Upon receiv i ng a m e m ory-b usy s i gna l , t h e C PU w i l l abort t h e wri te . W h e n m emory i s rel eased , some o t h e r device w i l l access t h e bus a n d perform a write. rhus fi l l i n g the write queue i n memory . Once aga i n , the fi rst C P U re -arb i t rares, accesses the bus , and tries to w r i t e . Once aga i n , that CPU n:cc ives a memory busy signa l . And so on . The NMI arbi tration scheme mentioned above so lves t h is problem in which a device might get l oc k e d - o u r of me m ory . As i m p l e m e nt e d , t h e arbi tration scheme saves r h e priori ty state at the r i m e b e fo r e t h e m e m o r y - b u s y s i g n a l w a s asserred. The arbitration logic t hen restores that stare so t hat rhe device that received the signa l wi l l get the bus when the memory-busy signa I is deasscrted . Bus Bandwidth For r h e p rocessors on t h e i n terc o n n e c t , bus bandwidth i nvolves two components: read band wid t h . and w r i te bandw i d t h . The prob lem of inadequate read bandwidth is addressed by hav i ng a high h i t- ra te cache . The h i gher t he hit rate , the fewer the requests tO memory. The problem of i nadequate write bandwidth can be treated i n rwo ways . T h e first way i s t o have a wri te-back cache l i ke rhc one on the VAX 8 6 5 0 processor. ' Such a cache wri tes a block ro m e m ory on l y when r h e cache b lock is dea l located. T h is tec h n ique can significantly reduce the write band width requirements. Di!!,ilal Tecbnical journal No. ·1 Februcny 1 1)87 I n m u l t i processor sys t e m s l i ke t h e 8 8 0 0 , however, i n which each processor has a n i nter n a l cache . this technique becomes complica ted . In t hese syste ms, a data i tem can exist not o n ly in memory bur also i n a l l rhe caches. To main rain coherency. each write-back cache wou l d h a v e t o n o t i fy r h e other c a c h e w h e n t h e first cache writes. This tec hnique usu a l ly l eads ro a complex protocol and design i m plementation. Another approach in a m u l t i processor system, rhe o n e u s e d in the 8 8 0 0 , i s r o i m p l e m e n t write - through cac hes . I n such a n approach, a l l write references go d i rectly t o memory s o that each cache on rhe bus can "sec" all write activ ity. The caches can then be inva l idated . Such an a p p r o a c h grea t l y s i m p l i fies the prorocol for cache coherency but, as d iscussed earl ier, gen erates a high degree of write traffi c . The uni que design of rhe write bu ffer helps ro reduce t h is traffi c , a l t hough n o t as m u c h as a w r i t e - ba c k cache wou ld . I n t h e 8 8 0 0 processor, however, rhe wri te buffer redu ces traffic enough so rhar the rwo VAX 8800 processors can write a t their max i m u m banclwicl rhs on rhe NMI bus. Coherency in a Multiprocessor System A m u l t iprocessor syste m , with i n terna l caches, p re s e n t s a n u m b e r o f i n teres t i n g c o h e re n cy issues when sharing data. Ideal ly, i f one proces sor wri tes ro a location and rhe other processor reads rhar location, t he read w i l l always get the data rhar was written . In practice, achieving this con d i t i on is d i fficu l t . Severa l major questions arise : Did the read happen before the write or afrer ir' What happens if both p rocessors write ro the same location at rhe same r i m e ' Un l ess controlled , t hese si ruat ions can prod uce unpre di ctabl e resu l ts . I f progra ms on t h e p rocessors wan t t o s ha re clara . they must usc rhe i n terlock instructions i n the V AX archi tecture . " O n ly after a n interl ock i nstruction is processed wi l l the memory loca t ion be guaran teed ro have the correct clara . The general meth od is as fo l lows . Processes must decide to share a block of memory. One mem ory location is cal led the software lock, and only one process ar a rime is a l l owed ro write to (or l o c k ) t h a t l o ca t i o n . T h i s is accessed w i t h an i n te r l ock i nstru c t i o n , for exa m p l e , t he bra nch on b i t ser and set i n terlocked ( BBSSI ) or the add al igned word i nterl ocked (ADAWI) instructions. 49 A spects of the VAX 8800 C Bo.x Design Upon ga i n ing the software lock. a given process can proceed to write any location in the shared b l oc k . Read·wr i te coherency wi l l be assu red o n l y if t h e o t h e r processes s h a r i n g t h a t d a t a observe t h e protocol of obta i n i ng t h e software lock before mod i fying the data structure . The VAX i n t e r l o c k i n stru c t i o ns a rc i m p l e · m e n t ed u s i n g i n t e r l o c k m i c ro i n s t r u c t i o n s . These enable a processor to lock and u n lock the me mory su bsyste m . Once locked . this s u bsys· tem excludes fu rther attempts to lock it u n t i l an u n lock has occurred . Thus only one processor or 1/0 system can lock the memory su bsystem at any one time. When each processor has an i n tern a I cache. there is one more mechanism that keeps the two processors coheren t . Wh i l e o n e processor i s perform i ng a w r i t e to me mory a n d w h i l e t h e wri te c o m m a n d i s on the N M I b u s , the other processor w i l l exa m ine i ts cache store to see i f i t conta i n s a copy o f t h a t d a ta . I f t h e data is there, i t is marked i nva l i d . The next req uest for this data '"''i I I then resu lt in a cache m iss and a s u b s e q u e n t fe t c h t o m e m o r y . T h i s s i m p l e a p proa c h i s poss i b l e because t he VAX 8 8 0 0 cac hes a re write-thro u g h . Alt hough a l l wri tes arc s e e n on t h e b u s , the w r i t e b u ffe r p a c k s together consecutive wri tes w i t h i n a n octaword . Therefore , t h e n u m be r of i nv a l i d a t i o n cyc l e s p e rfo r m e d by a p ro c essor w i l l be red u c e d . When a n i nterlock write is performed , the con tents of the wri te bu ffe r are sent to memory . Thus the in terlock mechanism ensures that data coherency w i l l work under a l l con d i ti ons . Fig u re 6 i l l ustrates t h e e v e n ts t h a t a c h i eve coherency in the 8800 . Summary The genera l concepts used in the design of the C Box arc we l l known to compu ter designers . Our goa l was to achi eve a simple yet high-per fo r m a n ce d e s i gn t h a t a v o i d e d u n n e cessa r i l y complex solutions that d i d not g ive comparable i ncreases in performance . The choi ces made LEFT PROCESSOR RIGHT PROCESSOR � I I CAC H E I WRITE BUFFER WRITE BUFFER I I OTH E R PROCESSOR SEES WRITE ON NMI AND LOOKS I N CACHE FOR I NVALIDATION WRITE I NTERLOCK FORCES WRITE B U FFE R CONTENTS TO M E MORY NMI I SOFTWARE LOCK J M E M OR Y Figure 50 G Multiprocessor Coherencv Digital Technical journal No. 4 Februar)' J ')87 New Products have y i e l d ed a des ign t h a t fu l l y s u p p orts t h t: m u ltiproct:ssor concept. The VAX 8800 syste m c a n translate a d d resses a n d a ccess data fas t e r t h a n a n y previous VAX processor. Acknowledgments Al l t hose who worked on the VAX 8800 system cont ri b u ted to the t h i n k ing that went i n to the C Box design . Spe c i a l thanks go t o Dave Sager for keep i ng t h i ngs goi n g . References l. VAX A rchitecture Handbook , ( Maynard : D ig i t a l E q u i p m e n t Corpora t i o n , O r d e r N o . EB- 2 6 1 1 5 -4 6 , 1 9 86 ) : 7- 1 1 t o 7- 1 9 . 2. A . S m i t h , "Cache Memories, " Computing S u r v eys , vo l . 1 4 , n o . 3 , ( S e p t e m b e r 1 982) : 473-530. 3. S. M ishra , · ' The VAX 8800 M icroarchitec ture . " Digital Techn ical jo u rnal (Febru ary 1 9 8 7 , t h is issue) : 2 0- 3 3 . 4. T . Foss u m , J . M c E l roy, and W . E ng l i s h , "An Overview of the VAX 8600 System , " D ig it a l Te c h n i c a l jo u r n a l ( A u g u s t J 985): 8-23 5. S . F a r n h a m , M . H a r ve y , a n d K . M o rse . "VMS M u l t i processi ng on the VAX 8800 S y s t e m , ' ' Dig i tal Te c h n ical jo u r n a l ( Ft:bruary 1 98 7 , t h is i ssue ) : 1 1 1 - 1 1 9 Digital Technical journal No. 4 Fe/Jmmy 1 98 7 51 Paul]. Natusch David C. Senerchia Eugene L. Yu The Memo ry System in the VAX 8800 Family The memory system in the VAX 8800family can send data at 71MB per sec ond and receive it at 59MB per second. The 8800 and 8700 CPUs can con tain up to 128MB of memory, the 8550 and 8500 up to BOMB. Commands, addresses, and data flow between the memory interconnect (NMI bus) and the memory controller, array bus, and array modules. Read, write, and masked-write commands are executed. The designs of the NMI bus and write-through cache affected the memory system design. Although ECL is used in the controller, TTL is used in the array bus. The array modules of 4MB and 1 6MB contain 256K MOS dynamic RAM chips. Al i members of the VAX 8 8 0 0 fam i ly of proces sors (the 8800, 8 7 0 0 , 8 5 5 0 , and 8 5 00) usc the s a m e t y p e o f m e m o r y sys te m . S i n c e t h e VAX 8800 system is a m u l t i processor, that mem ory system must connect co both CPUs and both I/0 adapters , cal led the N B lAs. The bus connect i ng these devices is called the NMI bus, and each connec t i o n o n t h e N M I bus is c a l led a n e xu s . These con necti ons a re i l l ustrated i n F i g u re 1 , which shows five nexuses : one for each CPU, one for each N B LA, and one for the memory system . The memory system can del iver 7 1 megabytes (MB) per second of read bandwi dth and 5 9 M B per second o f write bandw i d t h . S i nce the VAX a rc h i tecture h a s a 3 2 - b i t for m a r , a l l datapa t hs i n the m e mory system must a lso handle 32 b i ts . These d a tapatbs are com b i ned by p i pe l i ned a n d para l le l opera t i ons to p rod u c e t he read a n d w r i te ba ndwi d t h s . The most sign i ficant occurrence of parallel operations is two- d i mensiona l i nterleaving. The first d i men sion i n te rl eaves between longwords ( 3 2 b i ts) of data on a s i ngle array module; t he secon d i n ter l eaves between octawords ( 4 longwords) on d if fe re n t a rray mo d u l e s . As m a n y as t h ree a rray m o d u l es c a n be a c t i ve s i m u l ta n e o u s l y w i t h ei ther a read o r a write . There are t hree cases: • Eac h modu le can do one rea d . • O n e m o d u l e c a n d o a read w h i l e t he other two can do as m any as four writes. • Figure 1 Memory Interco n nect Structure The m emory sys t e m i ts e l f consists of t h ree major parts, as depi cted in F i gu re 2 : • A memory controll e r based o n ECL technology • A h i gh-speed TTL bus connecti n g that mem ory contro l l e r co a m ax i m u m of eight a rray mod u l es • The a rray modu les themselves 52 Two m od u l es can each do a read w h i l e the th i rd can d o as many as fou r wri tes . The s e l e c t i o n of t h e a rray m od u l es can be progra m med from the consol e when the system is powered u p . Thus t h e m e m o ry system can s u p p o r t a v a r i e ty o f a rray m o d u l e s i z es a n d speeds without t he need t o mod i fy the hardware in the memory controller. M oreover, the mem ory con t ro l l e r can add ress 5 1 2 M B of phys i c a l memory , the l i m i t of the VAX architectu re . The 8 8 0 0 i s t h e fi rs t VAX s ys t e m to be a b l e t o address t h i s m u c h p hysical memory . Digital Technical journal No 4 Febmmy 1 <)8 7 New Products COMMAND BUS-IN PUT COMMAND AND CLOCK NMI M E MORY CONTROLLER ARRAY MODULE 8 Figure 2 Plan of Mem OI:J! System Owing to the l i m i ts of the <:xist i n g tec h n o l ogy, howeve r , t h e i n i ti a l m a c h i n e w a s i n tro duced w i t h 3 2 M B for the 8 8 0 0 and 8 7 0 0 sys tems, and 2 0 MB for the 8 5 0 0 and 8 5 5 0 systems. The 3 2 M B c o n f i g u r a t i o n c o n s i s t s of e i g h t 4 M B modu les w i t h 2 5 6 K MOS dyn a m i c RAMs pac kaged in D I Ps . To increase the dens i ty of the machi ne without using a d i fferent semiconduc tor t e c h n o l ogy , a 2 MB d a u g h te r m o d u l e was developed a fter the i n i t i a l announcement. This module uses double-sided su rface-mount tech nology and p l astic leadless c h i p carriers. Eight of t hese dau ghter modu l es are mou n ted o n a mother module to produce a 1 6 M B array mod u l e . T h i s n e w m o d u l e h a s i n c re a s e d t h e machine's memory to 1 2 8MB for the 8800 and 8 7 0 0 systems, a n d to 8 0 M B for the 8 5 5 0 a n d 8 5 0 0 systems. Memory System Architecture As shown i n Figures 1 and 2 , the m emory con trol l e r c o m m u n i ca tes w i t h the C PUs and the N B IAs over the memory i n terconnect , called the N M I b u s . C o m m a n d s , a d d resses, a n d d a ta requests are a l l first received by the N M I i nter face and t h e n passed to other sections of t h <: m e m ory c o n t ro l l e r . Add resses a n d d a ta a rc srored i n custom m u l ti part RAMs, where eight locations arc reserved for addresses and e ight for d a t a . T h e N M I i n t e rfa c e e n c o d e s c o m m a n d informati o n , passing i t t o the command-control portion of the memory control ler. Si nce the m e mory contro l l e r c o m m u n i cates w i t h the N M I bus and the a rray bus, the N M I Digital Technical journal Febntary 1 ')8 7 No. 4 protOcol has to be changed to that of the array bus. Reads and wri tes of data fi elds with various si zes are recei ved by the N M I i n terface . The N M I b u s su pports a very robust s e t of c o m m a n d s . Reads and i n terloc ked reads are su pported for longwords ( 4 bytes ) , octawords (4 longwords) , and hexworcls ( 2 octawords) . Masked wri tes and masked-write u n l ocks are supported for long word s , q u a dwords ( 8 bytes) , and octawords. Wri tes a re supported for longwords and acta words. The r e a d - i n t e r l o c k e d a n d m a s k e d - w r i t e u n lock commands are used r o i mplement VAX i nstru c t i o n s i n w h i c h m u t u a l e x c l u s i o n i s requ i red . For exa m p l e , t h e VAX i n stru c t i ons A D AW J , B B C C I , B B S S J , I N S Q H I , I N S Q T I , I NSQU E , REMQH I , a n d REMQTI a l l need these c o m m a n d s . S i n c e a n i n terlo cked i n st r u c t i o n locks t h e entire m e mory syste m , t h e i nterlock bit must reside i n t he m emory controller. This bit restricts the execu tion of subsequent i n ter lock commands unti l the lock has been released by a masked-write u n lock i nstruction. Aft e r re c e i v i n g a m e m o r y r e q u est fr o m a nexus, the memory controller must transfer that req ues t to t he a ppropriate array modu l e . This transfe r i s a c c o m p l ished using t h e a rray bus . This bus consists of • A unid irectiona l set of command and address l i nes from the memory control ler ro the array mod u l es • Another u n i d irectional set of data l i nes from the memory control ler to the array modules 53 The Memory System in the VAX 8800 Fam ily • A set of data l ines (capable of assum i ng three states) that can be driven by a ny one of the array modu les a n d recei ved by the memory control ler • Various status and control l i nes that commu n icate i n both d irections The a rray b u s h a s a m i n i m a l reperto i re of commands, consisting of longword reads , acta word reads , and longword writes, but not hex word reads. S i nce the N M I su pports h exword reads, the memory controller must convert t hem i nto two octaword reads and then send them to the array m odu les. Thus the two octawords of a hexword read can reside on d ifferent array mod u les. That fact i ncreases the memory bandwi dth because para llel accesses can be executed . The array bus supports only longword writes ; t here fore, octaword writes m ust a lso be converted . As mentioned earl ier, the array bus has one l ine for commands and addresses and a nother for data . Therefore, an octaword write , which takes five cycles to transfer on t he N M I (one for the com mand , four for the data) , can be tra nsm i tted i n five cycles o n the array bus to a n array modu le. Figure 3 shows the corresponding actions dur ing each cycle o n the N M I and o n the array bus. In addition to commands, the memory system must a lso execute mai ntenance tasks, i ncluding m e m o ry refre s h , error report i n g , a nd battery backu p . S ince physical memory is i m p lemented w i th MOS dyna m ic RAMs , every array row m us t be refreshed once every 4 m i l liseconds . This func t i o n can be done by refreshi ng one row every 1 4 m icroseconds . To faci l i tate this activity, the memory control ler sends signals to each a rray module from a 1 4 -m icrosecond osc i l lator . Upon receiving a refresh signa l , a n array module w i l l h a n d l e t h e refresh arbitration a n d execute the operation . Occasionally, a b i t w i l l be l ost due to e ither alpha particles or a device fai lu re. In that case the me mory control ler m ust handle those errors a n d other types i n a gracefu l m a n n e r . To do that, the m e mory system uses a 7 -bit m o d i fied h a m m i n g code to g e n e r a t e the E C C , w h i c h a l lows a l l si ngle-bit errors to be corrected and a l l dou bl e - b i t errors to be detected . After cor recting each error the memory system logs the error's p hysi ca l page add ress and the b i t . The memory system then i n terrupts the CPU to cal l a n error serv i ce rou t i n e , w h i c h l ogs i n a VMS fi le the necessary information to i solate the fai l ure . The memory system can a lso i nterrupt the CPU to handle i nternal parity e rrors and i n ter locked t i me-outs. An i nterlocked ti me-out hap pens when a nexus executes a read i nterlock but never issues a masked-wri te u n lock. The system software can enable or d isable these i nterrupts. Battery backu p , standard equipment on both t h e 8 8 0 0 a n d 8 7 0 0 syste m s , c a n power t h e refresh operation w h e n t h e system is down . That power a l lows the memory system to continue to refresh the RAMs so that data w i l l not be l ost . Note that the entire system is not backed up; CYCLE COMMAND OR ADDRESS NMI 6 2 3 4 5 DATA DATA DATA DATA COM M A N D OR ADDRESS COM M A N D OR ADDR ESS COM M A N D OR ADDRESS COM M A N D OR ADDR ESS DATA DATA 7 ARRAY B U S COMMA N D/ ADDR ESS LINE DATA LINE DATA Figure 3 54 DATA Cycles on NM! Bus and A rray Bus Digital TechnicalJournal No. 4 February 198 7 New Products BUS ENABLE ERROR C O R R E CTION LOGIC 1-T---1 T A A M U LTIPORT RAM ECC G E N ERATION LOGIC N M I ARRAY MODULE MEMORY CONTROLLER Figure 4 Datapaths in Memory Co ntroller and A rray Modules therefore, a l l components must be in qu iescent states before the memory system e nters battery mode. U pon sensing t hat power is erodi ng, the 8800 wi I I write a l l i ts data to the me mory sys· tern . The memory control ler wi l l then compl ete a l l commands and send signals w t he array mod· u les i n form i n g them to enter battery mode. I n t h i s mode o n ly five MSI c h i ps o n the mem ory control ler a n d approx i mately h a l f t h e control logic on the array module will be active . Com mand Execu tion The execu tion of a ny command received by the mem ory system i s a j o i n t effor t between t h e memory controller and t h e a rray modules. Fig· ure 4 depicts the datapath i n each memory com ponent. After a nexus places a command on the NMJ bus, the i nterface in the me mory contro l ler ascertains i f the command is a va l id memory ref· erence and, i f so, decodes i t . The i n terface then pl aces the command i n a q u eue of commands wai t ing to be executed . Si nce one array modu le can execute m u l t i ple write commands s i m u l taneously, and si nce m u l t i p l e array modu les c a n a lso execute commands, the memory control ler must m a i nta i n the status of the array modu les . The status control l ogic to Digital Technical journal No. 4 February 1 98 7 moni tor actiV I ty m u s t " remember" which par· tions of w h i c h a rrays a re " bu sy . " T h i s statll s control logic can best b e descri bed b y showi ng how t he t h ree basic operations, writes, reads, and masked writes, are executed . Write Com mands For a write command , the contro l port ion of t he memory controller performs only three actions: i t determ i nes the capabi l ity of the array module to accept the command, it sends the command , and it wa i ts for the array mod u l e to signal i ts readiness to receive a nother comman d . The write datapath is that portion o f t h e l ogic responsible for the flow of data from the NMI bus tO the a rray mod ules. This path com prises both e lectrical interconnects (buses and cables) and a considerable amount of logi c . The major storage element for t he data path is a 9-bit by 3 2 -location custom m u l ti part RAM ( M PR) with two ports for reads and two for writes. Data received from the NMI bus is p laced i n the next avai lable location of the M P R . Upon determ i n i ng that the requ ired array mod u l e is ava i lable, the control logic sends the data from the M P R to that array module over t he array bus. Each array mod u l e ho lds the data u n t i l i t is s t r o b e d i n t o t h e d y n a m i c R A M s 55 The Memory Svstem in the VAX 8800 Fam ily ( D RAMs) . The array module can load fou r long words of data with their associated ECC bits on four consecutive cyc les. Some wri tes are cal led masked because there is a 4 -bi t byte mask associated with each data word . The byte mask i n forms the memory sys tem as to w h i c h bytes arc to be w r i t ten . The memory system executes this command by first doing a read and correcting a ny s ingle-bi t errors that may exist . It then merges the memory data with the data received from the N M I bus, and fi n a l l y does a wri te command . This seq uence easi l y a l l ows t he i mp l ementation of longword and octaword masked writes. Masked writes for quadwords (8 bytes) are executed by perform i ng an octaword masked wri te i n whic h the data of two of t he longwords rem a i ns u ncha nged . Read Commands For read com mands , the memory con troJler per forms fou r actions: it determi nes i f the sel ected array m od u l e is ready to a c c e p t t h e re a d , i t sends the com m a nd , i t wa i ts for a d a t a - ready response, and i t transfers the data from the array module. I mbedded in the com mand field of the read are address b i ts that select the longword of the octaword that is req u i red first . This action a l l ows w r a p p e d r e a d s t o b e i m p l e m e n t e d . (Wrapped reads are described later i n the sec tion " Impact of the Cache . " ) The react cl a ta p a t h o r i g i n a tes a t t h e D RA M , wh ich sends the requested data . As i n the case of wri te commands, each array m od u l e stores an octaworcl of read data. Once the data has been loaded i n to the l atches, the array module signals to the memory contro l ler t hat t he data is ready. As menti oned earl ier, the read datapath between the array module and the memory controller is tristata bl e . Th erefore , the mem ory contro l ler must ensu re t h a t o n ly one array modu l e a t a t i m e d r i ves t h i s d a t a pa t h . Once t h e d a ta h a s been requested b y t he memory contro l ler, t h e array module m ust send t h e longwords seq uen tially, beg i n ni n g with the starti ng aclclress t hat was sent with the com mand. This action a l l ows the memory controller to request any one of the fou r longwords as the first to be rea d . The array modul e portion of the read data path can transfer one longword of data during every cycle. The error-correction logic i n the memory con troller receives each longworcl of data plus the seven ECC b i ts . This logic detects s i ngle- a n d double-bit errors, but o n l y single-bit errors can 56 be correcte d . A sign i ficant feature o f this pro cess is that error detection and correction is per formed as the read data is p i peli ned through the mem ory control J er . Thus n o a cl cl i t i on a l cycles are needed to correct read data . Masked-write Com mands The execution of a masked wri te i nvolves both a react and a write seq u e n c e . The m e mory con tro l ler executes a masked-wr i te com ma n d by first iss u i ng a react to the selected array modu l e . Assuming that t here were n o memory errors, the data r e t u r n e e! is s e n t to the M P R , w h e r e t h e bytes arc merged w i t h those sent t o t he me mory contro ller over the N M I bus . The memory con tro l l e r must e n s u re t h a t no commands to the same array come be tween t h e read a n d write portions of a m as ked wri te . After all the bytes have b e e n m e rged i n to t h e d a t a b u ffe r , t h e m e m o ry contro l l er w i l l wri te the d a ta t o t h e array modu le. The array module then generates n e w ECC d a ta , adds i t to t h e other d a t a , a n d strobes t h e composite data i nto t h e D RAMs . If a si ngle-bit error is detected , the process is qu ite s i m i lar to the one with no errors, except that the data must be corrected . Since corrected d a ta and N M I traffi c both s hare the same data path on the memory control ler, the N M I i n ter face must be free to correct errors found during m a s k e d w r i t e s . T h i s free d o m i s e n s u red by asserti n g a signal that stops a l l activity o n the N M I bu s . O nce a c t i vi ty has stopped , t he data can be routed t hrough the N M I i n terface, cor rected , a nd then m erged w i th the N MI data i n the data buffer. The process then continues a s i t would have i f there were n o errors. If a double-bit error is detected, the process is s i m i lar to the case in which no error occurred, except that the wri te is prevented from happen i ng . When the array location is read the second t i m e , the double-bit error w i l l sti l l be present, thus al ert ing the system that the data i s u n usable . Memory A ddress Path The memory contro l ler conti nuously l atc hes a l l addresses from the N M I bus . Once a n aclclress i s latched , t h e memory control ler m ust verify i t as a va l i d m e m ory a d d ress . T h a t v e r i fi ca t i o n i s d o n e by c o m p a r i n g t h e a d d ress to va l i d aclclresses of both t h e con trol s tatus reg i sters (CSRs) and physi cal memory . The CSR addresses are hardwired i nto the N M I i n terface logic ; therefore, o n l y a s i m p l e compare Digital Technical journal · No. 4 February 1 9 8 7 New Products of the add n:sscs is req u i red. The compare for a va l i d mem ory add ress requ ires a reference to a "d ecod e " RAM . This RAM is loaded by console software when the system i s powered u p and i s used to c o n fi g u re m e m o ry . Load i n g t h e RAM from software al lows the memory con tro l ler to support several d i ffe rent si zes of array modu les wi thou t m od i fying any hardwa re . Once the add ress has been veri fied as be i n g va l i d . i t i s p l aced i n one of eight storage loca tions a llocated to address b u ffering in the M P R . The address rema i ns i n that bu ffer u ntil i ts com mand i s sent to an array mod u l e . Even t h o u g h e i g h t locations a re a l l ocated t o address buffering, only seven o f t h e m c a n b e used for rem porary storage . One location is reserved for the erro r ' s page address , a poi nter to a phys i cal page of memory c o n ta i n i n g a n erro r . Since the locatio n of the e rr o r page-add ress b uffer is not fi xed , the control l og i c for the add ress-buffer c o n t ro l m u s t l o o k a h e ad a n d n o t a l l ow a n e w address ro overwrite that error page address . The c o n t ro l of t h e a d d ress b u ffe r i s fu r t h e r compli cated by masked wri tes and error l oggi ng _ S i n c e a masked write i s i m p l e m ented a s a read fo l l o wed by a w r i t e , the a d d ress in the bu ffer cannot be overwri tten u n t i l the write has com p leted . A s i m i lar si tuation ex ists for error logg i n g o n r e a d t r a n s a c t i o n s . S i n c e a n e rr o r i s n o t detected u n t i l t h e read h a s completed , the address cannot be overwri tten u n t i l the data has been c hecked . Design Requirements of the VAX 8800 System Impact of the NMI Bus As stated earl ier, the VAX 8800 memory system i n t e r fa c es w i t h t h e C P U s a n d I / 0 s y s t e m s through a sync hronous bus c a l l ed t h e N M l bus . T h i s bus i s h i g h l y effi c i e n t a n d o p e rates i n a pcnded fas hion s i m i l ar to the synchronous back plane i n tercon nect (SBl bus) in the VA.X- 1 1 /780 processor. The NMI bus a l lows several transfers to be i n progress s i mu ltaneously. There a rc fo u r n e x u ses in the 8 8 0 0 system that can req u i re mem o ry : the two CPUs, and the two NBIA<> . Each nexus i s al lowed to have rwo co mma nds o u rsta n d i ng at any t i m e . The proto col s u pports this arra n gement by a l locati n g two codes i n a 4 - bit 10 fie l d ro each nexus. The CPUs use one of their references for pro gram data , c a l led the d -stre a m , and the other for Digital Technical journal No. 4 Fe/Jruar)' I 'J87 i n s t r u c t i o n s , ca l l e d t h e i - s t rc a m . T h e C PU s always req uest a hexword of d a t a ; t h e N B IAs may req uest e i t h e r longwords or ocr a w o rd s . Thus t h e r e can b e as m a n y as e i g h t s i m u l t a n e o u s requesters of memory data. These s i m u l taneous events req u i re that t h e m e m ory syste m b u ffer several commands w h i l e exec u t i n g . I n the 8800 i mp l e mentat i o n , the memory syste m can access t h ree a r ray m o d u l e s i n p a ra l l e l a n d store rwo com mands. M o r e o v e r , s i n c e t h e m e m o ry s ys t e m c a n accept m u l t i p l e read commands, i t m ust store t h e i d e n t i fi c a t i o n o f t h e r e q u e s t e r a n d t h e l e n g t h of t h e t r a n s ac t i o n . T h e N M I i n te rface does the actual srori ng and returns the identifi cation with t h e correct data . T h i s action i s poss i b l e b e c a u s e a l l co m m a n d s a re p r o c e s s e d i n seq u e n c e ; t h e re fo re , t h e read retu r n ed f i rst is the one stored the l o n gest. Howeve r, hexword reads are returned to the N M I i nterface as two separate octaword reads; t h e re fore , t h a t i n ter face mu st ensure that borh ocrawords have been returned before d iscard i ng the i de n t i fication. To preven t a deadlock cond i t i o n , the memory system is give n the h i ghes t priority d u r i ng arb i tration . T h i s priority guarantees that t h e memory system wi l l be able to return data to a req u ester. When fu l l , the memory system not i fies any poten t i a l req u esters that i t c a n n o t process any more commands and to try aga i n later, thus p reventing the memory system from overfi l l i n g . Impact of the Cache The design of the cache affected the design of the memory syste m . The wri te-through des ign of the cache guarantees there wi II be a large num ber of longword writes d i rected a t memory. 1 A write b u ffer was i nsta l led to b u n d l e a series of longword wri tes i n to octaword writes; however, t h e w r i te bu ffe r i s o n l y effe c t i ve if m u l t i p l e lo ngwords a rc written i n t h e same ocraword . Extra logic is always r eq u i red to i n c rease per forma n c e . The extra write ba n d w i d t h for t h i s memory syste m , however , re q u i red m o re logic than w hat wo u l d have been req u i red to i m ple ment extra read b a n d w i d t h . The added co m plex i ty was needed r o fac i l i tate i n terleaving o n longword boundaries for write operations. When the 8800 p roject was first i n i t i ated , the go a l o f the m e m o ry sys t e m was to m ax i m i z e read bandwidth, thus prod ucing a re latively sim p l e a rray- mod u l e d e s i g n . I n that d e s i g n , a n y operation , regard l ess of i ts s i z e , k e p t an e n t i re 57 The Jl1emor)' . �ystem in the VA X 8800 Fa mi(J' a rray mo d u l e b u s y u n t i l t h e o p e ra t i o n co m p l erecl . The control logic o n the a rray mo d u k was si mple a n d req u i red a reasonable amount of board space a n d powe r . W h e n t h e design cha nged to the wri te-through concept, however. h i g he r w r i t e bandw i d t h was req u i red . Therc fore , the control logic in each array module had to be rep l i cated for each ba n k ( l o n gworcl) of mcmory to al low i ndependent write operations. This re p l i ca t i on perm i tted fou r longwords to be wri tten on fo ur consecutive cycles to the same array mod u l e . T h i s i ncrease i n desi gn com p l e x i ty was nor l i m i t ed t o t h e a r r a y m o d u l e . l n the i n i t i a l des i g n , when m a x i m u m read b a n dw i d t h was critica l , the me mory control logic was n: l a t i vcly s i m p l e . It had only to track the state of an a rray mod u l e as bei n g busy or not. However, w i t h the i n t e r l e a v i n g c a p a b i l i t y r e q u i r e d fo r t h e i n creased wri te bandwi d t h , the me mory control logic now has to track s i m u l taneously the status of as many as eight write operatio ns in progress on two array mod u l es . A l t h o u g h ma x i m i z i n g t h e l o n gwo rcl w r i t e bandwidth was i m portant, m i n i m i z ing t h e read latency to the fi rst longword req u i red was criti ca l . W r a p p c d r e a d s w e r e i m p l e m e n t ed to red u c e this l a tency. A wrapped read is a hex wor d or o c taword c o m m a n d t h a t r e q u e sts a spec i fi c l o n gword tO be re t u r n e d fi rs t , w i t h o t h e r l o n g w o rd s i n t h a t b l o c k to fo l l ow i n "wrapped " fas h i o n . Other Design Trade-offs and Options As i n a l l design processes, we consi d e red many trade-offs and options before com m i tt i ng to a part i c u l a r des i g n a rc h i tect u re . One area w i t h s e v e r a l a l t e r n a t i v es w a s t h e i n t e r c o n n e c t between t h e m e m ory contro l l e r a n d the array modu les . The array modu l es and the controller reside in p h ys i c a l ly separate back p l a n es i n ter conn ected by a cab l e . We had to deci de whether tO make this i n terconnect with ECL or 1TL. The o v e ra l l p ro j e c t go a l was to m a k e t h e 8 8 0 0 a n a l l - E CL m a ch i n e . Th erefore , o u r first c h o i c e for t h i s i n te rco n n ect was ECL, w h i c h prov i d e s e n h a nced s i g n a l i n t egr i ty , re d u c e d skews, a n d overa l l speed advan tages over TTL As rhe system and me mory des i gn progressed , however, some real problems arose thar al tered our opi n i o n . The fi rst problem became appare n t a s the array- mod u l e design coal esced enough to 58 a l low s o m e a c c u ra t e p o w e r est i m a t e s ro be made . We fo und that. with an ECL bus, the array mod u l e wou l d requ i re - 5 . 2 V i n excess of its a l l o c a t i o n . T h e n e x t p r o b l e m s u r fa c e d i n response tO an a rc h i te c t u ra l re q u i re m ent t h a t t h e memory system fu nction w i t h l ess t han e i ght a r ray mo d u l t:s a n d , p r e fe ra b l y . w i t h o u t load c a r d s . T h i s req u i re m e n t made ir d i ffi c u l t to i m p l e m e n t a t e r m i n a t i o n s c h e m e for a n E C L i n terconnect. W i r h these problems in m i nd , we i nvestiga ted a TTL i n terconnect , which clearly offered so me d e s i gn c h a l l e n g es . the l e ast o f w h i c h w e r e spccd a n d skew. Us ing the SPICE s i m u lator, we: const ructed an acc u rate mod e l to verify t h a t a TTL e l ec t r i c a l i nt e rcon nect could i n deed meet our s i g n a l i n tegr i t y , speed, and skew re q u i re 2 m e n t s . W h i l e t h e s i m u l a t i on res u l ts s howed that a TTL i n tercon nect could wor k , the associ ated skews certa i nly i ncreased rhe complexi ty of the me mory desi gn . W h i l e al levi ati ng rhe prob lems of l i m ited - 5 . 2 V power on the array mod u le and the term i n a t i o n of var i ed load i ng, t h i s TTL s c h e m e req u i red ECL- ro-TTL trans l a tors i n the m e m o ry c o n t r o l l e r ro d r ive the array b u s . We: fi n a l l y d e c i ded ro accept t h e a d d e d c o m p l e x i ty and u s e TTL for the i n te rcon nect . The sole except ion was the clocks, w h i c h were d i f fe re n t i a l E C L , re c e i ved a n d t r a n s l ated on t h e array mod u l e . There were logical rrade-offs as we l l a s elec t r i cal o n es . The or i g i n a l spec i fi c a t i o n for the N M I clicl nor su pport q uadword masked writes. They were added after the i m p l e m e n t a t i o n of the m e m ory sys t e m had progressed cons i d e r a b l y . Si nce r h e array b u s su pported o n l y long word a n d oc rawo rd r e a d s . t h ere were t h r e e opti ons t o support r h i s change : • • • The first was tO ch ange the a rray bus proto c o l . rhe command generatOr o n rhe memory contro l l e r , and rhe array mod u le. The second was 1 0 execute rhe command by p e r fo r m i n g two l o n g w o rd m a s k e d w r i t e s . This option wou ld take a l most twice a s long as a q u adword masked write if i m p lemented l i kc the firsr opt i o n , yet sti l l re q u i re changes ro t h e c o m m a n d g e n e r a r o r i n t h e me mory control ler. The t h i rd was to execute an octaword masked wri te i n w h i c h the d a ta of two of the long words re m a ins unchanged. Digital Technical]o11r11al No. 4 February 1 987 N e w Products Since the design was wel l adva nced, we chose the last method tO ease the prob l e ms of imple m e n t a t i o n ; t h i s d e c i s i o n a c t u a l l y has l i t t l e i m pa c t o n sys t e m p e r fo r m a nce . T h e l o g i c to accomplish this addition a l ready ex isted on the array mod u l e . Only small cha nges were requ i red to the co m mand ge nerator of the memory con tro l ler and the datapath control . In practice , the fr e q u e n c y o f q u a d w o r d m a s k e d w r i t e s i s extre mely low si nce they are executed only by the NBlAs. Technology Description A nu mber of d i fferent module and component technologies were u sed for the memory con trol ler, backplane, and two array modules. Memory Controller The me mory control ler is a 9 - layer, cont rol led i m pedance , extended hex mod u l e ( 1 5 i nches by 1 1 inches) . The lay-up consists of 6 rou ting layers, 2 power layers (- 5 . 2 V and - 2 Y), and a ground p l a n e . Si nce there is a m i nimal a m o u n t of TTL , both the + 5 V power and the + 5 V battery are run on the su rface with 5 0 - m i l etch . With the m ixed techno logy on the modu le, we took special care tO keep the TTL signals properly spaced from the ECL signals tO avoid signal i ntegrity prob lems. The l o g i c o n t h i s mod u l e i s i m p l e me n ted using nine uni que macrocel l - a rray des igns from Motoro la, I nc . . and one custom ECL mu lti ported RA M . T h e re a r e 1 6 c u s t o m a n d s e m i c u s r o m devices o n t h e mod u l e . I t also conta ins some I O K H MSI logic, some ECL-ro -TTL converters, and som e CMOS logic used for opera t i ng with battery back up. A rray Module Backplane T h e array mod u l e backplane i n t h e VA.'( 8800 and 8700 CPUs is a 1 2 -layer , 8-slot pressed -pin backplane. The one in the VAX 8 5 5 0 and 8500 CPUs is a 5 -s lot bac kplane. S ince a TTL bus was ch osen to com m u n i cate between the mem ory controller and the array modu les, a good term i nation strategy had t O b e deve lo ped . Us ing the S P I C E s i m u l a tor, we evo l ved the term i n a t i o n strategies shown i n Figure 5 . r---- - - - - - - - - - -- - - � I 1 I MEMORY CONTROLLER 8480 ECL TO TIL DO - Dl c cs I I I � '--' tr' I I F374 '-- Dl - - I I I I I I I c I I I I I I I I '-- D l - CLK c EN F374 8481 TIL TO ECL ( DO Dl - cs ff- >4700 (> ;. EN i HLD Digital Technical journal February 1 98 7 F374 Dl CLK F374 Dl r- CLK 0 EN Dl r- 0 CLK EN NAB READ DATA BUS DO- DATA O UT CS - C H I P SELECT (TO � DO DO F374 .. DO c EN EN Dl - DATA I N - _ _ _ _ _ _ _ _ _ _ _j Figure 5 CLK F374 � CLK OHMS .._ - DO Dl +5 VOLT L------ No. 4 F374 DO � DO I I I I NAB COMM A N D/ADDRESS-WRITE DATA B U S OHMS - HLD I I ARRAY MODULES f- p 8 DO MODULES) .. F374 DO Dl CLK EN ( HLD - HOLD (CLOCK) EN - ENABLE CLK - CLOCK Termination Strategies in Memory Controller and A rray Modules 59 The Memory System in the VAX 8800 Figure 6 Fam ily Sixteen Megabyte A rray Module Four Megabyte A rray Module Summary The 4 MB array module was des i gned u s i n g a n 8-layer, control led-i mpedance, p r i n ted c ircu i t board . The l ay- u p cons i sts of 4 rou t i ng l ayers , 2 power l ayers, and 2 ground layers . To su pport battery backup, the m od u le has separate power planes for + 5 V power and the + 5 V battery . S i n c e o n l y a l i m i ted a m o u n t o f - 5 . 2 V a n d - 2 V power i s needed , t hese v o l rages s h a re space on the other power planes. To el i m i nate d i sc o n t i n u i t i es t h a t c o u l d c a u s e u nw a n te d refl ections, we ensu red that signals d i d not cross t h e p o w e r - p l a n e s p l i ts by s u rro u n d i n g t h e power planes with sol i d ground planes . Approxi mately half of the logic techno logy on the array mod u l e consists MOS dyn a m i c RAMS; the other ha l f is FAST MSI logic. The clock system is i m plemented in ECL to m i n i mi ze the skew. The VAX 8800 m e mory system was designed to provide 7 1 MB per second of read bandwidth and 5 9 MB per second of write bandwidth to the m u l t ip rocessor system . The system archi tecture, processor perfo r m a n c e nee d s , a n d h i g h I / 0 activity com b ined to m a k e a high-performance me mory a req u i rement. S ince the 8800 conta i ns ECL components, the memory system has to provide a high-speed path between the ECL logi c i n the CPUs and the high d e ns i t y dyn a m i c RAMs u sed for m a i n s torage . A l t h o u g h t h e m e mo ry system does n o t play a d i rect rol e i n the execu t i on of a VAX i nstruc tion, i ts performance has ro match closely that of the m u l ti processor system . I f the memory sys tem were u nder designed , the processors would sta l l frequently, thus reducing their usable per fo r m a n c e . I f t h e m e m o r y s y s t e m were over designed , i t wou l d conta i n extra co m p l e x i ty , w i t h t h e attendant extra cost, that could n o t be used by the system . Thus the m em ory strategy played an i m porta n t role in the pri ce/pe rfor mance trade-offs that had to be made . Sixteen Megabyte A rray Module A 1 6MB array module was developed tO i n crease the ava i l a b l e me m ory to 1 2 8 M B for the 8 8 0 0 and 8700 systems and 8 0 M B for the 8 5 5 0 and 8 5 00 systems. This array m od u le consists of a n 8-layer mother board (si m i la r t o t h e 4 MB mod ule) and ei ght 2 MB su rface-mounted daug hter boa rds . The 1 6MB array modu le is pictured i n Figure 6 . 60 Acknowledgments Al though done by a sma l l group of engi neers, the design of the m e m o ry system was greatly Digital Technical Journal No. 4 February 1 98 7 N e w Products i n f l uen ced hy the e fforts of many peo p l e fro m t h e E l ectron i c Storage Deve l op m e nt G ro u p a n d t h e A d v a n c e d VAX E n g i n e e r i n g G r o u p . We wou l d especia l l y l i ke ro a c k n o w l edge the c r e a t i v i ty, leaders h i p , and e n e rgy l e v e l o f t h e l a t e .John He n ry . J r . References I . ]. Fu. J . Ke l l e r , a n d VAX 8 8 0 0 C Techn ical jo urnal of the K . Had u c h , "Ao;; pects J3ox Des ign , " Digital (Febru a ry 1 9 8 7 , t h i s i ss u e : 4 1 - 5 1 . 2. S P I C E ·was devel oped by Lawrence Nagel a n d E l l i s C o h e n of t h e D e p a r t m e n t o f El ectrical Eng i neeri ng and Co m p u t e r Sc i e n c e , U n i ve rsity o f Ca l i fo r n i a , Berke l e y . Digital Technical journal No. 4 Febmmy I 'J8 7 6t John H.P. Zurawski Kathleen L. Pratt Tracey L. Jones Fl o ating Po int in the VAX 8800 Family The processors in the VAX 8800 family were designed with particular emphasis on cost-effectiveness. These CPUs do not contain separatefloat ing point accelerators. Their performance is not compromised, however, especially for the double-precision instructions. High performance is achieved, in part, by a custom ECL multiplier and divider unit and by specific hardware for exponent manipulation and normalization. The main advantages of this integrated approach are less hardware to repli cate and a tightly coupled interface to each CPU, thus less time is wasted fetching the operands. Microcode branch problems are minimized by using a prediction strategy and extensive hardware assistance. U n l i ke other VAX fam i l i es, the processors i n the VAX 8800 fam i ly do not conta i n separate float ing point acce lerators ( FPAs ) . I nstead , their FPA is i n tegrated i nto each processor' s m a i n data path . Therefore, n o disti nction is made between instructions t h a t a re execu ted i n t he FPA and those that are not : the hardware is avai I able to be used for a l l fu n c t i o n s . For e xa m p l e , t h e extended arithmetic l ogic u n i t (XALU) i s also used as a counter for t he move character i nstruc tion (MOVC) . This usage d i ffers from that i n the VAX 8600 and VAX- 1 1 /780 systems, where the XALU i s used o n l y for floa t i n g poi n t i nstruc t i o n s . F u r t h e rmore , a l l t h e floa t i n g p o i n t instruct ions, from the most comp l i cated (POLY a n d E M O D ) to t h e s i m p l est ( M OV F ) , have access t o t he FPA hardware . There a re a n u m be r of a d v a n tages to t h i s a pproac h . F i rs t , logic i s n o t d u p l i ca ted ; o n ly one arithmetic logic u n i t (ALU) and one shifter u n i t is shared between the float i ng poi nt and the normal arithmetic. Second , the design is tightly i ntegrated with the rest of the compu ter; t here is no overhead involved in starting the floating point computation . Clearly, since a l l other VAX fam i l i es use FPA-; , there arc a lso d isadvantages w i t h o u r approach. Shared logic is more complex than specia l i zed logi c . Perform ance m a y also su ffe r s i nce t h e design cannot b e opt i m i zed toward one class of problem . Those disadvantages can be overcome , however, as we sha l l relate i n t h i s paper. The 62 problem of o p t i m iz a t i o n was a m e l i orated by provi d i ng d e d i ca t e d h ar d wa re for t h e m a i n operations of m u l t i p l i ca t ion and add i t i o n . A cus tom m u l t ip l i e r a n d d iv i d e r c h i p is provi d e d together w i t h exponent manipulation l ogic a n d a s h i fter u n i t optimized for floating poi n t . These logic elements handle those float i ng point oper ations that take the longest ti mes to execu te . The floating point logic resi des i n the execu tion unit, the E Box, of the V�'C 8800 CPU. That logic is controlled by m i crocode in the i nstruc tion unit, the I Box. 1 VAX Formats and Instructions T h e VAX a rc h i te c t u re s u p ports fou r fl o a t i n g poi n t formats: F , D , G , a n d H . These formats are d iscussed at lengt h i n references 2 and 3 . The F format is 32 bits wide, the D and G formats are both 6 4 b i ts wide, and the H format is 1 2 8 b i ts w i d e . A l t hough t h e D and G formats have the same width, the exponent field is larger in the G format, and i ts fractional fi eld is com mensu rately smaller. This form a t a l l ows a larger range but with s li ghtly lower prec ision. The fracti ons are always norma l i zed and the leadi n g b i t - the h i dden b i t - is not stored . E Box Operation Phys i c a l l y , floa t i ng p o i n t opera t i o ns a re per formed o n three mod u l e s : two s l i ce mod u l es and a shifter modul e . The sl i ce modules contai n the cache, the main ALU, and a register fi l e . The Digital Technicaljournal No. 4 February 1 ')8 7 New Products shifter module coma i ns the custom mu l t i p l i er. t h e s h i fter u n i t . t h e exponent m a n i pu l a t i o n logic (the two AlUs) , and the priori ty encoder. Fi gure 1 s h ows t h i s p a rt i t i o n i n g . To a l a rge extent, the shifter mod u l e strongly rese mbles an FPA bm wi thout the AlU and register fil e . The source operands are fetched from e i ther the 64 ki lobyte (KB) cache or a genera l-purpose regi ster (G P R ) . The operands are s e n t on the A and B ports to the AlU on t he sl ice modu les and to the shifter modu l e . Al l the components on the shifter modu l e are driven i n para l l e l by the A and B ports . From Figure I i t i s clear that the datapath is highly para l l e l ; the s h i fter, XALU . m u l t i p l i e r , a n d ALU c a n a l l operate s i m u l t aneously. T h i s para l l e l i sm is u s e d extensively t o gai n pe rfor mance and to save cost . For exa mpl e , in m u lt i plication operations, t h e XALU dete r m i n es the exp onent of t h e res u l t , the m u l t i p l i e r mu l t i pl ies. a n d the s h i fter absorbs the low-order bytes BYPASS BUS<3 1 :0> SHIFT COUNT BUS · of the product that are di scarded each cycle by the mu ltiplier. The m a i n prob l e m with d es i g n i ng a n i n t e grated FPA i s that t h e VAX fo rmats for i n teger and floa t i ng poim numbers must a l l be handled by the same shared u n i ts . figu re 2 shows the dif fere n t b i t ord e r i n gs for two VAX formats, the F floating po i m and the i nteger. I n the i nteger fo rmat, the b i t ordering is from right to left . In the F format, the mant issa begins at bit 16 and in creases in signi fi cance to bit 3 1 , then cont i n ues from bits 0 through 6. The re m a i n ing bit positions are used to hold t he exponent and the s ign . This req uirement for shared hand l i ng compli cates the carry path of the AlU . The carries om of t h e ! 6 - b i t w o r d b o u n d a r i e s h a v e to b e swi tched in to the appropriate places, a s shown in F igure 3 . The problem with shifting is s i m i lar to t he carry problem, except that now t he carry p a t h of F i g u re 3 r e p r e s e n t s t h e fl ow of t h e shifted bits. SHIFTER MODULE SLICE MODULES 5:0> A PORT CACHE DATA 8 PORT REGISTER FILE Figure Digital Tecb nical journal No. 4 February I 'J87 I Block Diagram of the E Box 65 Floating Point in the VAX 8800 Fam ily F FORMAT: BIT POSITION 31 16 1 5 EXPONENT LEAST S I G N I F I CANT BIT _j 0 7 6 MANTISSA (LEAST S I G N I FICANT PART) MANTISSA _j MOST S I G N I FICANT BIT INTEGER FORMAT: 0 31 L LEAST S I GN IFICANT BIT MOST S I G N I FICANT BIT _j S - SIGN BIT Figure 2 Two VAX Formats T h e A L U a n d t h e s h i ft e r u n i t a r e b o t h desi gned to hand le a l l integer and floating poi n t for m a t s . The m u l t i p l i e r expects opera nds t o come o n l y i n a fl oati ng p o i n t format . Therefore, for i nteger m u l t i p l i cations, the data must fi rst be converted i nto a pseudo-floating point format by swappi ng the places of 1 6 - b i t words w i t h i n t h e i nteger format. T h i s operation i s performed by the shifter u n i t . Table 1 gives t h e execution times for t h e most common floating poi n t i nstructions. These ti mes include the overhead for fetching the operands. 0 FORMAT: (MOST S I G N I FICANT PART) T h e VAX 8 8 0 0 processor i s d e s i g n e d so t h a t there is l i ttle, i f any, d i fference i n performance between reg ister a n d m e m ory opera n d s . The execu tion ti mes vary from 2 . 2 5 to over 5 ti mes the performance of the VAX - 1 1 /780 CPU with an FPA for the F and D formats . For m u lt ip l ies, one 8800 CPU i s 2 . 5 t i mes faster in F format a n d 4 . 8 t i mes fas ter in D fo r m a t ; d i vides are 3.0 ti mes faster. The ga i n is even more substan tial for the G and H formats s i n ce they a r e n o t accelerated o n the 1 1 j780 . BIT POSITION 31 16 �.__ ____.r- I I M A N T Is s A _ _ _ _ _ _______ _ _ _ _ _ _ _ s 0 7 6 15 EXPON ENT MOST S I G N I FICANT BIT I MANTISSA __j r- 0 FORMAT: (LEAST S I G N I FICANT PART) __j I IL.....-- MANTISSA MANTISSA ----.-' - LEAST S I G N I FICANT BIT _j CARRY I N S - SIG N BIT Figure 3 64 Floating Point Carry for D Format Digital Technical journal No. 4 Februarv 1 98 7 New Table 1 Execution Times I n struction Register to Register Execution Time (Na noseconds) F D G H ADD 31 5 495 540 33 1 4 MUL 450 675 842 6306 1 607 3 1 97 3 1 07 2 1 649 DIV In the 8800 the D format is sl ightly faster than the G fo r m a t w i t h i ts l o n g e r o p c o d e , w h i c h req u i res an extra cycle i n the decoder. The si ngle precision F fo rmat executes the fastest , and t he larger 1 2 8 - b i t H fo r m a t e x e c u tes t h e s l owes t . However, the H format i s i n tended a s a bac k u p fo r i n t e r m e d i a t e c a l c u l a t i o n s i n t h e D a n d G formats. Used thus, the H format ensures that the fi n a l calculation res u l t has sufficient preci s i on a n d avo i d s overfl ow or u n d e rflow prob lems. Little hardware assistance is provided for t he H format; it is driven mostly by m icrocod e . Technology Component tec h n o logy used i n t h e VAX 8 8 0 0 processor i s an e n hanced version of the macro cel l a rray ( M CA ) used in t he VAX 8600 CPU . 2 T h i s tec h n o l o gy p ro v i d e s a b o u t 1 , 2 0 0 g a t e e q u i va l e n t s w i t h a t y p i c a l g a t e s p e e d o f 1 na noseco n d (ns) . MCAs u t i l i z e e m i tter-cou pled l o g i c (ECL) i n a 7 2 - p i n pac kage that is 1 square i n c h w i t h a max i m u m power d issipa t i o n of 5 . '5 watts . The G PR and the m u l t i p l i e r a re made with custom technol ogy, w h i ch uses the s a m e p a c kage as t h e MCA b u t c o n t a i n s a m o r e a d v a n c e d p r ocess . A r o u n d 1 , 8 0 0 g a t e equ iva l en ts are provided , and t h e gate speed is 50 perce n t fas t e r than t h e MCA. T h i s h i g h e r performance is achieved by u s i n g t h e fol lowing features: • • • Smal l e r trans i s tors and met a l -o x i d e - wa l l ed resistOrs Cu rrent mode l ogi c i nstead of the slower ECL Four-level logic i nstead of the two- l evel l ogic of the MCA At 3 0 0 by 2 6 0 m i l s , t h e s i z e of t h e custom c h i p is l a rger than t h e d i m e n s i o ns of 2 2 1 by 2 '5 2 m i ls for the MCA. Digital Technical journal No. 4 Februarv I Y8 7 T h e s h i ft e r m o d u l e con ta i n s J 2 MC As a n d 8 custOm m u l t i p l i e r parts . So me l O KH parts arc used for c lock d ist r i b u t i on a n d fo r dr iving the bid i rect i o n a l bypass bus . Arithmetic Algorithm Processing Addition and Subtractio n For an addition operation , �he 3 2 -bit words con ta i n i ng the exponents are sent to the m a i n ALU . T h e r e t he y a rc p a s s e d to t h e A a n d B p o rt s , w h i c h fee d t h e s h i ft e r m o d u l e . T h e s e p o r t s drive a l l the gate arrays i n para l l e l . The exponents a re then loaded i nto the XALU a n d th e sh i ft-a mou n t ALU (SALU) , which com p u te s t h e a l ign m e n t s h i ft a m o u n t s e n t to t h e shifter. T h e SALU a lso generates some 2 0 branch cond i t ions for the m icrocod e . These con d i t ions i n d i ca t e t h e s i z e o f t h e a l i g n m e n t s h i ft a n d w h e t h e r a n y sou rce o p e r a n d i s zero o r a rese rved opera n d . They a lso he l p to op t i m ize the m i c rocode tl ow. The XAllJ , which selects the larger exponent a n d saves i t for later use , has a 1 2 -b i t datapath and a register to ho l d the exponent. The size of this datapath is sufficient for the F, D , and G for mats plus a guard bit for overtlow or u n d e rtlow detection . An ALl! is provided to perform arith metic opera t i o n s o n the exponen t . The SAUl , with a n l l -b i t datapath, su btracts the exponents to determ ine the a l ignment shift a m o u n t , which is always pos i tive . The s ign man i pu lation logic also resi des in the SALU. Next, the fract i ona l part of the smaller operand is a li gned hy the shifter. This operati o n i nvolves e i t her one C PU cyc l e for F for m a t o perands or two CPU cyc l e s for the D a n d G fo r m a ts . The shifter unit s h i fts i n the tloat i n g p o i n t format and c a n do a fu l l 6 4 - b i t s h i ft . The l og i c t h a t deter m i n es the rou nd bits i s related t o the a l i gn ment s h i ft opera t i o n but i s phys i ca l ly l ocated in the priority encoder gate array . This gate array a lso conta i ns some of the shifter fu nction a l i ty . N i ne gate arrays a re used for the shifter u n i t . Of those , eight m a k e u p t h e datapath, t h e n i nt h is t h e c o n t ro l d ev i c e . The s h i fter c a n a c c e p t ei ther a 64 - b i t operand o n t h e A and B ports o r a 3 2 -b i r operand on ei ther port . The s h i fter gener ates a 3 2 -b i t resu l t t hat can be ei ther the h igh order or the low-order part of the answer. The 65 Products Floating Point in the VAX 8800 Fa m ilv s h i ft e r d a t a p a t h g a te a rrays a rc i d e n t i c a l : e a c h e ffect i v e l y c onst i tu te s a b y t e s l i ce of t h e d es i gn an d p e rfor m s a b i t s h i ft of u p to seven p l a c es By te s h ifti n g is t h e n p e r for m e d by send i n g t h e co r r ec t s h i fter o m p u t to t h e co r r e c t byte pos i t i o n . T h i s o p e ra t i on i s fac i l i ta ted by h av i n g a l l the o u tp ut s w i red t o t h e OR g a t es a t a l l poss i b l e b yt e pos i t i on s an d by e n a b l i n g t h e con·ecr o u t p u t . The s h i ft e r p e r fo r m s f l oa t i n g p o i n t . i n t e ge r . a n d l og i c a l s h i fts , as w e l l as a n u mber of m i sce l l a n eous fu n c t i ons . Th e se i n c l u d e conve rts from deci m a l - format data i n to i n t eger fo rmat and \ ' i c c v e rsa . T h e m a s k i n g of t h e expo n e n t f i e l d a n d t h e i n sert i o n o f t h e h i dden b i t are a l s o done by Second . it was not p oss i b l e to s u cc u m b tO t lw te m pta t i o n of u s i n g t h e m a i n AUJ to p rov i d e t he d i v i s i on o pe ra t i o n . This desi re was n a t u ra l s i n ce cl i ,· i s i o n is an i n fre q u e n t opera t i o n . a n d t he usc of an AU J i n a repeated su btract a n d s h i ft mode was a p p e a l i ng . For exa m p l e . the VAX 8 6 0 0 uses the ALU for j u s t t h a t p u rpose . In t he 8 8 0 0 t h e main AUJ Si n ce t h i s a l so c o m p u t e s cl a t a p a t h t he v i r t u a l a d d re s s . is very t i m e - c r i t i c a l ( i n t h e 8 8 0 0 as we l l as i n most ot her co m pu ter design s ) . i t can n o t be a l l owed to go a n y s l ower. I n c l u d i n g a n e x t ra path to a c c o m m o d a t e d i v i s i o n wou l d have s l owed down t h i s c r i t i ca l p a t h by around '5 n s , r e s u l t i n g i n a 1 0 perce n t p e r for m an c e d e gr ada t i o n for a l l op e r a t i o n s . t h e sh i ft e r . After t h e a l i g n m e n t s h i ft . the o u t p u t of t h e s h i fter is d i rected to t h e m a i n ALU on t h e lwpass bu s . There. the o u t p u t i s add e d to or su btracted from t h e fra c t i o n of t h e l a rger o pe ra n d . The out p u t of t h e ALU operation is now r e ad y to be n or m a l i zed i n t h e s h i ft e r . I n most cases a sm a l l nor J\tl o rcovc r , t h e ava i I a b l e s pa ce fo r t he m u l t i p l i c r a n d d i v i d e r u n i t was l i m i te d s i nce fl oati n g poi n t opera t i ons a rc i n t egrated w i t h t he r e s t o f t h e mach i n e . Approx i m a t e l y o n e - t h i rd of a m o d ule ( 1 2 i n c h es by 1 6 i n c h e s ) was ava i l a b l e . I n contrast, t h e VA X 8 6 '5 0 CPU d e d i c a t e s a fu l l m a l i ze s h i ft o f a t most one b i t p os i t i o n l eft or m o d u l e to m u l t i p l i ca t i o n . w a re i n t h e s h i ft e r h a n d l es t h i s c a s e a n d t h e n d i v i d e r u n i t i s bas i ca l l y a byte s l i c e o f a l a rge r o u n d s t h e r e s u l t . S h o u l d a l a r g e r s h i ft b e word - s i z e d m u l t i p l i e r a n d d i v i de r u n i t . T h e r i g h t w i l l be s u ffic i e n t . The s p e c i a l i z e d h a r d req u i red , t he n m i c rocode w i l l fi rst ALU d i rect t h e r e s u l t to t h e p r i o r i ty e n c o d e r g a t e a r ray . There , t h e p os i t i o n of the l e ad i n g l is fo u n d a n d used t o determ i ne t h e norma l i ze a m o u n t for t he s u bseq u e n t cyc l e . The ro u n d i ng o pe ra t i o n i n t h e V�'( 8800 CPU T h e c u s t o m d e s i gn o f t h e m u l t i p l i e r a n d m u lt i p l i e r h a n d l es 8 b i ts p e r cyc l e , t h e d i v i d e r h a n d l e s I h i t . F i g u re 5 6 - b i t by H-bit 4 s l i ce custom c h i ps . E i g h t c h i ps a rc used tO form the re q u i r e d word s i z e of 64 b i ts ( 5 6 data b i ts p l u s 8 g u a rd b i ts ) . T h i s a r r a n ge m e n t is s u ffi i s u n u s u a l i n t h a t i t is l i m i ted to t h e low- order c i e n t to h a n d l e F. 0 , a n d e i gh t b i ts . Therefore . a small 8-bit adder c a n be H used for this op e ra t i o n . This ad d er is both faster i n g t h e p ro b l e m i n t o a n d c h e a p e r t h a n the u s u a l m e t hod of u s i n g a fu l l 64 - b i t a d d e r . The 8 - b i t a d d e r is a I so s u ffi s h ow s t h e c o m p l e t e m u l t i p l i e r w i t h i ts e i g h t by t e G fo r m a t ope ra t i on s . format opera t i o n s arc perfo r m ed by m a ny pa rt i t io n s m a l l e r '5 6 - b i t m u l t i pl i cat ions u nder m i crocode c o n t ro l . Th e m u l t i p l i c a n d i s loaded i n to t h e MD l a tc h c i e n t to c a l c u l a t e t h e c o r r e c t a n sw e r i n o v e r a ft e r p a ss i n g t h ro u g h r h c m a s k l o g i c . w h i c h a carry-out b e generated b y t h i s 8 - b i t rou n d i n g i n s e rt s t h e h i d d e n b i t . T h e l n t h a t case t h e c o m p u t e r i s t ra p p e d a n d T h e P R G B c o n t a i n s t h e g u a r d b i ts for t h e P R \) \) . 5 perce n t of t h e add i t i o n o p e ra t ion s . Shou l d add , t h e n c l e a r l y t h e resu l t created i s i ncorrect . m i crocode i n voked to correct t h e resu l t . Multiplication A-; m e n t i on e d earl i e r , t h e 8 B O O c on ta i n s a h i gh pe rfor m a n c e . c u s to m - d e s i g n e d m u l t i p l i e r a n d d i v i de r u n i t . A n u mber o f fa ctors i m p e l l e d u s t o u s c s u c h a u n i t . F i rs t . m u l t i p l i ca t i o n i s a v e ry c l ears t he s i gn a n d t he expo n e n t fie l d a n d PR latch a n d t he P R G B a rc c l e a r e d a t t h e s t a r t o f t h e m u l ti p l y . l a tc h . A t t h e e n d o f a m u l t i p l y . t h i s l a t c h w i l l h o l d t h e b i ts r e qu i red for a p o ss i b l e norma l i z a t i o n s h i ft a nd a lso for a r o u n d i n g o p e ra t io n . The l east s i gn i f i c a n t e i g ht b i ts o f t h e mu l t i pl i er arc l oaded i n to t he m u l t i pl i e r l a r c h . The fi rst m u l t i p l y cyc l e i s now re a d y to be pe r fo r m e d . A '5 6 - b i t by 8 - b i t mu l ti p l i c a t i o n is pe rfo r m ed freq u e n t o p e ra t i o n t h a t i s u s e d ex t e n s i v e l y i n between the c on t e n t s of the MD a nd m u l t i p l i e r b e n c h ma r k , t h e t i m e - c r i t i c a l rou t i n e con of t h e PR l a t c h ( w h i c h i s i n i t ia l. l y z e ro ) a n d t h e n m a t r i x m a n i p u l a t i on . For e x a m p l e , i n t h e LI N PACK ta i ns an even mix of a d d i t i o n and m u l t i p l i c a t i o n opera t i o ns . ' 66 latches. The r e su l t is t h e n added to t h e c o n t e n ts w r i t t e n b a c k i n to i t w i t h a r i g h t s h i ft o f 8 b i ts . The P R l a t c h i s t h u s a n ac c u m u l a t i n g l a tch a n d Digital Technical journal 1Vo. 4 FehruaJ:J' l lJ8 7 New Products MULTIPLICAND IN PUT MULTIPLIER I N PUT S·BIT SHIFT PRGB 64 BITS BOOTH RECODE MULTIPLIER OUTPUT Figure 4 Multiplier and Divider Unit conta ins the 6 4 · b i t partial product of each m u l · t i p l i c a r i on o p e ra t i o n . T h e n e x t 8 b i ts o f t h e m u l t i p l ier are loaded i nto the m u l tiplier larc h , ready for the next cyc l e . This cyc l i n g cont inues u n t i l the m u l t i p l i cand has been m u l t i p l i ed by a l l the m u l t i p l i er byres. This algorithm is si m i lar t o the one u s e d in t h e VAX R 6 5 0 s c h e m e , except t hat that processor has a narrower data· path of 32 bits. Notice that the l e as t s i g n i fi ca n t byte of t h e partial product is discarded after each cyc l e and absorbed by t h e s h i ft e r u n i t . These bytes are req u i red only for the H format m u l t i ply. O n c e c o m p l e t e d , t h e res u l t i s s e n t o u r t h rough the resu l t latc h , t hen n o r m a l i z ed a n d ro u nded . The rou n d i ng carry i s on ly propagated i nto the least s i g n i ficant byte of the resu l t . This proced u re u ses less l og ic s i n ce only an 8 · b i t i nstead o f a 64 ·bit incrementer i s req u i red . The 8 · b i t i n c r e m e n t e r wi l l be s u ffi c i e n t fo r most Digital Technical journal o. 4 Februarl' I <)87 m u l t i p l i e s . S h o u l d a g r e a t e r i nc r e m e n t be req u i red, then the m u l t i p l i e r wi l l trap the rest of the mac h i n e , and t he correct ion w i l l be per· fo rmed by the m a i n ALU . This scheme is s i m i l ar ro the one used for add i t i o n . The prov i s i o n of a 6 4 · b i t a d d e r i ns i d e t h e m a i n m u l t i p l y path i s u nusual i n a h igh·perfor· nunce machi n e . H i gh ·speed m u l t i p l i e r designs typ i c a l l y use ca rry·save a d d e rs , w h i c h do nor propagate the carry signal bur save them so t hey can be absorbed by the subseq u ent cyc l e . This form of adder is indeed used i n the c u sro m m u l · r i p l i e r r o perform t h e 5 6 · b i t by 8·bit m u l t i p ly fu nction i l l ustrated i n F i gu re 4 . Howeve r, the 8800 a l so uses a fu l l 6 4 ·bi t adder for the fo l l ow· i ng reasons: • A 64 · b i t adder has ro be provided somewhere to propagate the carries from rhe carry·save adders. 67 Floating Point in the VAX • • 8800 Family With the 4 5 -ns cycle t i m e , the 6 4 -b i t adder fi ts i n the main datapath . A faster c l ock for the m u l t i p l i e r wou ld have co m p l i cated the clock d istriburion and heen d i ffi c u l t to gener ate with low skew. A lternative Designs for the Multiplier An MCA design was certa i n l y possible and cou l d have been m a d e ro fi r i n r h e specified space . The p e rfo r m a n ce of s u c h a d es i gn , howev e r , wou ld nor b e as good as the custom design for m u l t i p l ication but compara b l e for d ivision . An MCA design wou ld be I . 7 ri mes better than an l l j780 with an FPA for a mul ti ply i n F fo r m a t , whereas the custom logic chosen i s 2 . ') ti m es bette r . The performa nce wou l d be 2 ') t i mes better fo r t h e D for m a t , w h ereas t he custo m design is 4 . 8 ti mes better . Another alternative was t o use a commercially ava i lable m u lt i p l ier. That was tempting because such a prod uct has the advan tage of being read i ly ava ilable and tested . Using it wou l d have c i r cu mvented t h e h i gh risk of a custom d e s i g n . However, there are a number of d i sadvan tages to using genera l-purpose m u l t i p l i ers : • • 68 • A fu l l adder in the darap;ah a l l ows the usc of a simple nonresroring division a lgori t h m . The m u l t i p l i e r a n d d i v i d e r c h i p conta ins a 1 2 - b i t by 8 - b i t m u l t i pl i e r , two 8 - b i t a d d e rs , six latches with a rota! size of 7 2 bits, as we l l as the rounding , norma l izing, a nd control l ogic . A comparable MCA design wou l d req u i re between three and four of these elements. • and ro u n d i n g of resu l ts enta i ls e i t he r extra logi c or addi tional cyc les i f the floating poi n t hardware i n t h e E Box is used . Extra logic i s req u i red ro m ask out the s ign and exponent o f t he d a ta a n d to i ns e rt t h e h i d d e n b i t . The output of the m u l t i p l i e r wou l d have to be masked. Most avai lable produ cts cannot handle d iv i s i o n . T h u s a s e p a r a t e d i v i d e r wo u l d have been req u i red , w h i c h was expensive . Even d i v i s i o n a l go r i t h m s u s i n g m u l t i p l i c a t i o n req u i re a large amount of ROM r o conta i n rhe approx i mation constants . Many o f the ava i l a b l e designs a rc i nt ended for i n teger applications, such as HI butterfl ies a n d d i g i t a l s i g n a l proc essors . H e n c e , t h e designs a re opt i m i zed for those appl icati ons . Exte nd i ng these 8- or 1 6-bit m u l t i p l i e rs ro a larger word l engt h , as req u i red for the Vfu'{ arch i tecmre , was neither straightforward nor cost effective . M oreove r , t he norm a l i za t i on • Most designs have a c lock system not consis tent w i t h the rest of the machi n e . This fact i n t r o d u c e s t h e co m p l i ca t i o n of a s p e c i a l c l ock d istribution and d i ffi culties in veri fying r h c design . Very few designs a rc based on ECL tec hnol ogy . Other techn ologies . such as TT L , wou l d req u i re a d i ffe re n t power ra i l a n d thus a n extra power su pply. The c losest ava i lable m u l t i p l ier to rhe 8800 req u i rements is the I 090 I made by Motoro l a , I n c . This MCA imple mentation conta ins an 8-bir by H-bit m u l t i p l ier together with a 1 6-bit adder. Howeve r . n o latc hes a re i n c l u d e d ; they m us t the refore b e provi ded externa l l y , t h u s i ncreas i n g rhc cost su bstan t i a l l y . On the other hand , d i vision cou ld be provided by repeatedly using the 1 (J-bir adder of t h e I 090 I . Division The multiplier performs a nonresroring d ivision a l go r i t h m , 1 b i t per c y c l e . fo r the F, D. a n d G fo rmats . The d i v i d e r c a n accept a n e w d i v i d e n d b i t d u r i n g every cyc l e . t h u s permi tt i ng a 1 2 8-bit by ') (J-bir d i vide. A d ivide of this size is used i n the H format algorithm to form the start ing approx i mation . The booth recodc of the m u l t i p l i e r i s mod i fi e d s l i g h t l y r o a c c o m m o d a t e t h e d i v i s i o n deeode z l n the case o f m u l t i p l i cation , the mul t i p l i e r recod e sel ects the correct m u l t i p les of the m u l t i p l i ca n d to a d d to the part i a l prod uct d u ri n g each m u l t i p l i ca t i o n opera t i on . l n t h e case of d i visi o n , rhe d iv isor i s l o a d e d i nto t h e M D latc h , and the boot h recode s e lects e i t h e r + 1 or - l t i mes the d i v isor for each d iv i s ion step. In the n o n re s to r i ng d i v i s i o n a l go r i t h m , t h e sign b i t o f t h e previous resu It selects t h e correct d ivisor m u l t i p l e for the next cyc l e . This selec tion is faci I i ta ted by feeding the sign signal i n to the mod i fi e d booth recod c so t h a t i t w i l l se lect the m u l t iples of e i t h er + I or - 1 t i mes the d ivisor. The quotient bit generated every cycle is sent to the shifter u n i t to be absorbed . The first q uo tient bit generated corresponds to the most sig n i fi cant b i t of the answer . That bit is then nor m a l i zed and rounded by the shifter. Digital Technical journal No. 4 Februan• J 'J8 7 New Products • Microcode Design Be i n g i n t egra ted i n t o t h e l og i c i n t h e m a i n mach i n e , t h e fl oa t i n g p o i n t l og i c i s a l s o con trolled by the m a i n m icrocode . The VAX 8800 C P U i s an e x t e n s i v e l y p i p e l i n e d d e s i g n . s Al though p i pe l i n i ng is a wel l known techn ique for i m provi ng perform ance (for exa m p l e , t h e VAX 8 6 0 0 CPU) , i t comes at a price : t h e m icro code bra n c h l a te n cy w i l l i ncrease . By t h a t we mean that t he m i crocode c a n n o t bran c h on a con d i ti o n or flag i n t h e very next i nstruction ; i nstead , i t m us t wa i t a n u m ber of cyc les. T h i s delay is a consequence o f the overlapping of the m i c ro i n s t r u c t i o n s ; e a c h s u c c e s s i v e m i c r o i ns t ru c t i o n starts before i ts p re d ecessor h a s completed . Figu re 5 shows a typi ca l p ipel i n e s i m i l a r to tha t used i n the VAX 8800 syst e m . The m icroin struction is subdivided i n to five components: • • • • In WRITE , the resu l t of the ALU operation is wri tten back to the register fi le. Thus when the next-address cyc l e has com pleted for t he first m icroinstruction, A, t he next address cycl e for t he m i croi nstruct ion , B, in the su bseq u e n t cyc l e is s t a rted . T h i s cycle now overlaps with the look-up cycle for A. As many as five operations can proceed s i m u l taneously i n t h i s manner. The branch l a tency of t h i s p i pe l i ne i s gov erned by t h e first m i c r o i n s t ru c t i o n t h a t c a n "see" a branch con d i tion set i n an earl ier cycl e . For exa m p l e , i f t h e ALU cycle of A sets a carry con d i t i o n , t h e n t he fi r s t i n s t r u c t i o n t ha t can possibly use t h is s ignal in i ts next-address cyc le is E. Thus t he branch l atency is three m icro i n structions, a s shown i n Figure 5 . Natura l ly, this branch latency i n fl uenced the way i n which we designed the logic to perform floa t i n g po i n t opera t i o n s . C l ea r l y , we had to a v o i d b ra n c h i n g w h e n e v e r p o ss i b l e as t h i s wou l d resu l t i n a n excessive ly s l ow a l gor i t h m . I nstea d , we had to adopt a strat egy based o n p r e d i c t i o n a n d p rov i d e e x t e n s i ve h a rdware assistance . Pred iction is based on the fact that the speed of algori thms for floati n g point adds are usu a l ly d a ta depend e n t . For exa m p l e , for cert a i n data va l u e s , the resu l t o f a flo a t i ng p o i n t add wi l l r e q u i re c o n s i d e ra b l e n o r m a l i z a t i o n . T h a t requirement i s a l ways present when two val ues I n N EXT ADDRESS, the address for the next m i c ro i n s t r u c t i o n i s c o m p u t e d , as w e l l as those for a ny se l ected branch condit ions. In LOOK-UP , the m icrocode RAM is accessed to fetch the m icro i nstruction speci fied by the cu rrent N EXT ADD RESS . In READ, the register fi l e is read to fetch the speci fied operands ( e . g . , fetch RO and R l ) . l n ALU, the operation i n t he arithmetic logic unit is performed ( e . g . , RO + R l ) . rl � CONDIT ION CODE SET (E.G . . CARRY OUT) I N STRUCTION A : -�- ---r -__-r______r-___--r----� NA ALu R EA D LU w R IT E B: NA c: READ ALU NA LU READ ALU NA LU READ ALU NA LU READ D: I E: N A - NEXT ADDRESS LU - M I CROCODE INSTRUCTION LOOKUP Figure 5 Digital Technical journal February 1 !)87 No. 4 I LU L WRITE BRANCH LATENCY WRITE WRITE ALU WRITE I EARLIEST I N STRUCTION THAT CAN BRANCH ON CONDITION CODE OF I N STRUCTION A. Five-stage Pipeline . 69 floating Point in the VAX 8800 Family of s i m i lar magnitude and large cancel lation are su btracted . In other cases l i ttle or n o norm a l iza tion is requ i red . It is c l early preferable not to pay the penalty of unnecessary normal i zations. The approach we took in the 8800 i s to pro ceed down the most l i kely path, assu m i ng that a sma l l norm a l ization wi l l be requ ired while wait ing for the result of the branch signals. The add and subtract a lgori thms i n particular are struc tured that way. The SALU exa m i nes the expo nents of the operands and other signals; then it sets approx i mately 2 0 branch con d i ti ons i n the first two cycles of the add/subtract datapath . I n certa i n s ituations a l l paths may be equa l ly probable. I n these cases the m icrocode enables hardware signals to contro l the datapath . A good e xa m p l e of t h i s processing is the selection of operands . For a floating point add, i t is natural to t h i n k in terms of the larger and the smal ler opera nds . For exa m p l e , the smaller operand is the o ne t h a t is a l ways a l i g n e d . H owever, t h e m i crocode does n o t k now w h i c h regi ster loca t i o n h o l ds t h e s m a l l e r va l u e , and it does not wa n t to w a i t fo r t h e w h o l e b ra n c h - l a te n cy period to find out. Therefore, the m icrocode wi l l assume that the larger operand is in a particular register. Shou ld this assumption be i ncorrect, then the SALU wi l l swap the register fi l e read add resses ( thus sort ing the operands) . Not a l l locat ions have their add resses m od i fi e d i n t h i s m a n n e r s i n c e t h e m i c rocode s t i l l needs tO be a b l e to read a n d write t o specific locations. S i m i l a r l y , the SALU d e t e r m i nes if the m a i n ALU i s t o d o an add o r su btract operation . At this po i n t in the c o m p u ta t i o n the m i c r o c o d e is u naware of which operation wi l l be requ ired . The p i p e l i n e i s st i l l w i t h i n t h e l o n g bra n c h latency o f the 8800 and cannot branch u n t i l this latency delay has elapsed . Note that one of the most frequently performed i nstructions i s ADDF. That i nstruction will have just completed by the time the m i c rocode can fi n a l ly branc h . There fore , the ADDF cannot execute any faster si nce it is l i m i ted by the bra nc h - l a tency delay. Conse q u e n t ly , those i nstructions t ha t are the most probable cases are completely hardware drive n . To a l low fast paths i n t h e add algori thms, i t i s necessary t o know t h a t t h e result cannot poss i b l y overflow s i n ce overflowed resu l ts m u s t never be writ te n . To prevent overflow the SALU exam i nes the exponents of the operands . I t then 70 determ i nes i f the exponent of the result cou l d poss i b l y overfl o w or u n d erfl o w , t a k i n g i n tO account a ny possible normal i zation shift . There is al so the added complexity of a rou nding oper a t i on p rovok i ng an extra n o rm a l i za t i o n step . That wou l d happen when t h e rou n d i ng i ncre m e n t caused a ca rry to p ropagate t h roughout the whole fraction . Conseq uently, the use of a small 8-bit i ncre menter for the round operation is possible only i f it i s k nown that a n overflow cannot happen . The reason for t h is i s that halting (trapping) the machine is not instantaneous ( for the same rea son that bra n c h late ncy exists) ; t herefore , the result w i l l al ways be writte n . Thus, although the mi crocode can eventually correct the resu l t , it cannot prevent that resu lt from wri ting. Performance Issues W h e n a p ro g r a m w i t h m a n y f l o a t i n g p o i n t i nstructions - such a s U N PAC K - i s r u n , i ts performance is not tota lly d i c ta ted by the raw floating point speed of the CPU . Having a more profound effect are other factors, such as • • The size and orga n i zation of the cache - This factor is part i c u l arly i m portant for programs w i t h l a r g e a m o u n t s of d a t a b e c a u s e t h e o p e r a n d s w i l l res i d e i n m e m o ry . H a v i n g superior register-to-register performance w i l l not help i n this type of progra m . Clearly, the larger the cache, the greater the cha nce that the req u i red data wi l l be q u i ck ly ava i lable, t h u s avoi d i ng a l e ngthy transac t i o n w i t h memory. The performance of the i n teger and con trol i n struc t i o ns - Even progra m s perfo r m i n g extensive floa t i ng p o i n t operations sti l l have s i gn i fi c a n t a m ou n ts of i n teger a n d control i n stru c t i o n s . D o i n g t h ese q u i c k l y can con tribute substan t i a l ly ro the program ' s perfor mance . To i l lustrate the effect of t hese factors, com pare the performance of the VAX 8800 system w i t h t h a t of t h e VAX 8 6 5 0 w h e n b o t h r u n UNPACK, a s shown i n Table 2 . � The 8 6 5 0 has faster raw fl oat i ng po i n t s peed , espe c i a l l y for the F for m a t (over twice as fast ) . Yet the two systems r u n t h i s be n c h m a r k w i t h a l most t h e s a m e performance . C l early, i n progra m s w i t h t h ese c h a ra c t e r i s t i c s , fa crors o t h e r t h a n raw Digital TechnicalJournal No. 4 February 1987 New Products speed w i l l have a greater i n fl u ence on pe rfor mance . Of course. in app l i cations without the m . the raw s p e e d advantage of the 8 6 5 0 w i l l b e more pronounced . Table 2 .'1 . 4 . j . Donga rra , " Pe rfo r m a n c e o f Va r i o u s Compu ters U s i n g Standard Li near Equa tions Software in a F O RTRAN E n v i ro n ment . " Argonne National Laboratory (May l 9H6 ) . U N PACK Performance Performance (M FLOPS) Computer F Format 0 Format VAX 8800 1 .35 0.99 VAX 8650 1 .30 0.70 VA X A rchite c t u re Ma n u a l ( M a y n a rd : D i g i t a l E q u i p m e n t Corpora t i on . Order No. EB- 1 9 5 8 0 , 1 9 8 1 ) . 5. S . M ishra, " The VfuV. 8HOO ,\f icroarchitec tUIT , " Digital Technical jou rna! (Febru a ry 1 98 7 , this issue) : 2 0 - 3 3 . Summary The a r c h i t ec t u re of a p rocessor l i k e t h e VAX 8800 CPU is a l l a matter of trade-offs . Where does the performance make a d i ffe rence 1 For exa m p l e , we cou l d have s u p pl i ed t h e 8 8 0 0 w i t h a separate floa t i n g po i n t u n i t t O a c h i eve faster performa nce . Doing that, however, wou ld have req u i red a t l east one e x t ra mo d u l e . To keep the cost of the system constant. this extra mod u l e wou l d have enta i led re moving a module of logic from some other part of the comput e r . P e r h a ps r e m o v i n g t h a t m o d u l e wo u l d have resu l ted in a sma l ler cache or a si mpler decoder with no opti m i zations for the frequent i nstruc t i o n s . In any c a se the net e ffec t wou l d have been w sacrifice the performance of the com puter in some other area . All thi ngs considered . we feel that the design is well balanced for the multitude of d i fferent computing tasks t hat CLts tomers w i l l perform with the VAX 8800 syste m . Acknowledgments The authors wou l d l i ke to thank Ron Me lanson and his tea m for the c i rcu i t design of the custom m u l t i p l ier. In add ition, we wou ld l i ke w thank Dave Sager for his help and gu idance. References 1. R . 13urley, " An Overview of the Four Sys tems i n the VAX 8 8 0 0 Fa m i l y , " D igital Technical jo u rnal ( February 1 98 7 , this issue) : 1 0- 1 9 . 2 . T . Fossu m , W . Grun dmann, and V . Blaha. "The F Box, F l o a t i n g Po i n t i n the VAX 8 6 0 0 System , " Digital Techn ical jo u r nul (August 1 9 8 5 ) : 4 3 - 5 3 . Digital Technical journal No. 4 Fe/Jruarp I <)87 71 james P. janetos The VAX 8800 Input/Output System The VAXBI bus links the processors in the VAX 8800family to ljO devices, including clusters and networks. The VAX 8800 multiprocessor can sup portfour of these 32-bit synchronous buses, each of which connects up to 16 /jO devices. Each VAXBI bus connects to the memory interconnect, the NMI bus, by an l\'Bl adapter, which contains an interface chip to imple ment the VAXBI protocol. The NB/ adapter logic handles CPU references and direct memory accesses to andfrom the ljO devices. The adapter has its own 200-nanosecond clock, which is completely asynchronous with the 45-ns CPU clock. T h t: VAX 8 8 0 0 fa m i l y o f s y s t t: m s i s a n o t h e r m a j o r s t t: p for D i gi ta I E q u i p m e n r Corpora t i o n i n t o t h t: rt:a l m of h i gh -perform ance c o m p u ti n g . b u s . T h i s b u s i s a l i m i te d - l e ng t h , h i g h - s p e e d sync h ronous com m u n i ca t i o n s path t h a t provi des t h e data l i n k between t h ese fo u r d e v i c e s . The \Vh i l c: i n creas i n g t h e c o m p u t i n g c a pa b i l i ty o f N M I bus is c o m p l e t e l y c o n t a i n ed i n t h e m a i n tilL system c a b i n e t ; i ts cyc l e t i m e i s VAX l i ne for s c i e n t i f i c a n d tec h n i cal app l i cat ions. t h ese systems w i l l u n do u btedly p l ay a n i m p o r ta nt ro l e i n c o m m e rc i a l a n d offi u: m a r n anoseconds 45 ( n s ) , t h e sa m e as the C P U ' s . The b u s protoco l h a n d les seve ra l o u tsta n d i n g transac t i ons a t one k e t s . I n thest: markets , t h e abi l i ty ro c o n n t: c r ro a t i m e . t h us e ffect i ve l y i n creas i ng the b u s ' s u t i c o m p u t i n g c l u s t e r . s e rv i c e m a n y u s e r s . a n d l i za t i o n . T h a t i s , o n c e a d e v i c e has i s s u e d a fu n c t i on i n a n e twork arc a s i m portant a s a fast t r a n s a c t i o n ( e . g . , a r ea d ) , t h a t d e v i c e r e l i n C P U . I n dt:ed , i n a m u l ti user. m u l t i progra m m i ng q u i shes t h e usc of t h e bus u n t i l t h e respond i ng system , t h e effi c i ency of " housekeep i ng " opera d evice is ready w i t h t h e d a t a . O t h e r devi ces arc t i ons a ffects t h e perceived system perform a n c e as m u c h a s r a w p ro c e s s o r c o m p u t i n g s p e e d . T h e s e o p e ra t i o n s i n c l u d t: s h a r i n g m e m o ry t h e n free to start o t h e r transa c t i o n s . I n t h i s fas h i o n , t h e b u s u s a g e i s g r e a t l y i n creased . The two C PUs comm u n i ca te d i rectly between m a n y progra m s , swapp i n g processes with memory over t he t h e 1/0 devi ces i n to and out of m e m ory. ragin g , a n d respon d i n g c o n n e c te d ro t h e access m e mo ry to i nteractive user req u ests . v i a t he NIH N M I bus; VAX B I b uses ada pters. A d e v i ce on t h e NMI bus is 8 8 0 0 fa m i ly usc D i g i c a l l ed a ' ' n e x us . ' ' Arb i tra t i o n among n e x u ses b u s as t h e i r c o m m u n i c a t i o n occu rs i n para l le l w i t h data transfers and is h a n l i n k t o c l usters. n e tworks , a n d i n teract i ve users . d l e d by one C P U i n a n e a r l y rou nd - robi n fas h W i t h i rs a b i l i ty t o c o n n e c t t o fo u r s t: p a r a t c i o n . T h i s g u a ra n tees t h a t e a c h n e x u s ga i n s i rs VA,'CB! c h a n n e l s , t h e VAX fa i r share of t h e bus resou rce . Data transfers on Al l me mbers of t h e ta l ' s n e w VAX B I VAX 8 8 0 0 system i n rarr i c u l a r o ffe rs g r e a t f l e x i b i l i t y i n c o n fi g u r i n g p e r i p h e ra l d e v i ces a n d i n terfaces . T h i s p a p e r the NM I bus occur i n J on gword , octaword , a n d hexaword l engths ( 4 , 16, and 32 bytes respec first d i sc usses t h e c haracte r i s t i cs of t h e system t i v e l y) . Fo u r l eve l s o f d ev i c e i n t e r r u p t s a r e com m u n i ca t i o n buses i n t h e s u pportcd . VAX 8 8 0 0 system . F o l l o w i n g t h a t i s a d i scuss i o n o f t h e i n te rface , cal led the NBJ tem bus to the adapter, l i n ki ng the pri mary sys VAXlll i n p u tjou t p u t ( 1 /0) b u s . Fi gure I i I l u stra r e s t h e various c o m p o n e n t s of a VAX f\ 8 0 0 syste m . The VAXBI Backplane Interconnect The VAX B I b u s i s u s e d a s t h e IjO b u s for t h e VAX 8 8 0 0 syste m . As s ho w n i n F i gu r e I , fro m o n e t o fou r NMI VAXI31 buses can b e i nterfaced t o t h e b u s . d e pe n d i ng on a c u s t o m e r ' s needs a n d The Processor-to-Memory Bus h is d e s i red m i x of p e r i p h e r a l d ev i c e s . E a c h The two C PUs. the IjO s u bsystem . a n d mem ory VAX131 a l l share the pri mary system b u s , ca l l ed the N 1\ d i sc ret i z e d i f fer en t i a 1 eg ua t i o n s m a inta i ne d be low a spe c i fi e d t h resho l d . T h i s error i s cal led t h e l o c a l tru ncation error . T h e r e s u l t i n g sy s t e m o f n o n l i n e a r e q u a t i o n s i s reduced to a system of l inear equations by per for m i n g a fi rst-order Ta ylor expansion of t h e nonUnear e lements of t h e c i rcu i t . This l i neariza t i o n i n trodu ces a n o t h e r e rror ca l l e d the l i n eari zation error. The resu l ting system of l i n ear e q u a t i ons i s then solved exactly, using a n LU factorization of the system matri x . A ft e r t h e so l u t i on o f t h e s ys t e m has b e e n o b ta i n ed , t h e l i n e a r i z a t i o n e rror can b e esti mate d . I f t h i s error is too big, a new l i neariza t i o n is performed around the previously co m p u t e d so l ut i o n , a n d t h e n e w l i n ea r system i s solved aga i n . Successi ve l i neariza t ions a re per formed u n t i l convergence is obta i n e d , that i s , u n t i l t h e li nearization error is be low a specified t h resho l d . W h e n converg e n ce i s reached the so lution of the non l i near system i s obta ined , and t h e local t r u n c a t i o n error is t h e n checked . I f t h i s error i s too big, the sol u tion a t time poin t ti is rejected and the system of d i ffere n t ia l equa t i o n s is s o l ve d at a new t ime p o i n t f; so t h a t ti - 1 < t1 < ti . If t h e error is be low a specified thres hold , the so lution i s accepted , and the sys tem is solved at a new t i m e po i n t ti + 1 so t ha t t i < ti + 1 . This procedure i s repeated unti l the entire transient analysis i s computed . During a t ransient simu lation the circui t simu lator SPICE spends up ro 90 percen t of i ts CPU time in three phases of the previ ous algor i t hm . These phases arc as fol lows : • D O WH I L E ( no t c o nverged } l i n ear i z e a l g eb r a i c egu a t i o n s s o l ve l i ne a r egua t i o n s c he c k c o n v e r g e n c e E H DDO • I F ( l o c a l t r u n ca t i on e r r o r t o o b i g ) T H E H r e d u c e t i me ELSE save r e s u l t s a t t h i s t i me a d va n c e t i m e EHD I F • E H D DO Figure 3 Tra nsient A nalysis Algorithm jar SPICE Digital Technical journal No. 1 February II n our envi ron m ent synchro n i za ti on is done t hrough soft ware and the fi ne-gra i n para l l el ism used for vec torization may not be effi c i e n t . Based on t hese cons idera t i o n s . we have proposed a n d i m p l e mented a n a l gor i t h m i n w h i c h pa rticular care has been taken to m i n i m i z e the overhead inc urred with para l l e l processing. The deta i ls of our algori thm can be fo und in reference 1 0 . Local Tru n ca tion Error Phase The para l le l co mputation of the time step does not present major diffi c u l ties si nce the compu t a t i o n of t h e l o c a l t r u n c a t i o n er ror for e a c h e n e rgy s to rage e l e m e nt is i nd e p e n d e n t . E a c h slave process is ass igned a s e t o f ene rgy storage clements and com putes the t i m e step req u i red by this sc.-t . The master process then computes the mini mum time step among the time steps re tu rned by the sl ave processe s . The e n ergy storage c l e ments are stat i ca l l y assigned among s l ave pro cesses so that t he work among them is balanced . Results The para l l e l algorithms descri bed i n this paper have been i m plemented to produce the program CAY E N N E . We now prese nt two e x a m p l es to compan.: the t i m i ng performa nces of SPICE and CAYE N N E . The first exa m p le is t h e s i m u lation o f a MOS arith metic logic unit (AlU) on a VAX R H O O sys te m . The c i rc u i t h as 2 0 0 nodes and 1 3 50 e l c - Digital Technical journal No. 1 Februmy 1 987 m enrs . Twelve hund red Ne""rton Raphson i tera tions are req u i red for the transi ent si mulation . The effic iency of our para l l e l i m p l ementation is measu red in this exa m p l e . If a m u l t i p le-stream phase runs seq uen t i a l ly i n an e la psed t i me Ts a n d i n para l l e l w i t h N s l ave p rocesses i n a n elapsed time T, , we defi n e t h e efficiency, E , of the para l le l execu tion hy E = ( T, - T, ) / ( T_, - Ts /N ) E represents t he rat i o of t h e actual savings i n el apsed time t o the pote n t i a l savings i n elapsed t i m e . Ta ble I gives timi ngs and effi c i encies for the AlU exa mple . As a comparison , SPICE simu lates the same circ u i t i n an e lapsed t i me of 8 3 4 seconds. Table 1 Phase Load LU LTE Total S i m u l ation Timing Performances and Efficiencies CAYE NNE 0 Slaves (Seconds) CAYENNE 2 Slaves (Seconds) Efficiency (Percent) 694 97 86 22 14 70 67 35 96 867 529 The second e x a m p l e is t h e s i m u la t i o n of a MOS contro l store . The c i rcu i t has 1 6 0 nodes and 5 30 clem ents , and the transient s i m u lation req u i res 1 4 0 4 N e wto n Ra p h so n i t e ra t i o n s . SPICE spends 9 1 percent of the s i m u l ation t i m e i n t h e th ree phases w e mod ified for para l l e l pro cessi n g . CAYEN N E e x e c u t i n g w i t h two s l ave processes a c h i eves 9 0 - p e rc e n t e ffi c i e n c y i n these phases and s i m u lates the c i rcu i t 1 . 7 ti mes faster than SPICE. For t h is s i m u l ation, CAYE NNE on a VAX 8800 runs 9 ti mes faster than SPICE on a VAX - I 1 /780 CPU. Tab l e 2 shows th ese com parisons. The e ffi c i e n c i e s of a p a ra l l e l execu t i o n of CAY E N N E d e p e n d on t h e s i z e of t h e c i rc u i t . I n deed , there i s a fixed overhead i n cu rred by Table 2 Comparison of SPICE and CA VENNE Elapsed Run Times Case E l a psed Seconds S P I C E on VA X - 1 1 /780 3990 SPICE on VAX 8800 CAY E N N E on VAX 8800 Ratio 9.1 750 1 .7 440 1 .0 1 27 A Parallel Implemen tatio n of the Circuit Sim ulator SPICE on the VAX ca l l i n g t h e s y n c h r o n i za t i o n r o u t i n e s J O I N , 4. 8800 System S. Fa r n h a m , M . H a rve y . a n d K . M o rse , FORK or J O I N_FORK . The b i gger the task per "VMS M u l t i p rocess i n g on the VAX 8 8 0 0 formed by the sl ave processes before a ca l l to a Sys t e m . " D ig i t a l Te ch n i cal jo u rn a l syn c h ro n i za t i on rou t i n e , the s m a l l e r the relative ( Fe b ruary 1 9 H 7 . t his issu e ) : 1 1 1 - 1 1 9 . cost of syn c h ron izat i o n . The s i m u l a t i ons of ou r exam ples were a lso r u n on a l ightly l oaded sys '5 . te m . Loss of e ffi c i e n cy occurs when processors C o r p o ra t i o n , O r d e r N o . AA- Z '5 0 I B - T E , have to be s h a r e d w i t h n o n rc l a te d p rocesses . 1 9 86) . and busy-wa i t syn c h ro n i z a t i ons may waste s i g n i ficant reso urces. A work load consist i n g of sev era I i n d c p e n d e n t s i m u l a t i o ns o f e q ua I im por VA XjVMS Sy s t e m Services Refe r e n c e Manual ( M ay n a rd : D i g i t a l E q u i p m e n t 6. G . J a c o b . A. N e wt o n , a n d D . P e d e rson , " D i rect Method C i rcu i t S i m u l a t i on Using t a n c e i s a l re a d y d e c o mpos e d . and CAYE N N E J\'l u l t i p r oc e s s o rs , " Proceedings of t h e I f the Internatio nal Sy mposiu m o n Circu its and Systems ( May 1 9 8 6 ) : 1 7 0 - 1 7 3 s h o u l d b e r u n i n s i ng l e - p rocess m o d e turnaro u n d of a s i ng l e , large s i m u l a t i on n eeds ro be m i n i m i z e d , howeve r, CAYE N N E s h o u l d be run with two s lave processes on a ded icated o r 7. l i g h t l y l oaded 8800 . tions on Circuits and Systems, vol . CAS26 (September 1 9 7 9 ) Summary We have descri bed a ge nera l m e t h o d o l ogy for A . N e w to n , ' ' T h e S i m u l a t i o n o f La r ge Sca le I ntegrated C i rc u i ts . " IEEE Transac 8. 74 1 - 7 4 9 R . Thomas, " Us i ng t h e B u t terfly to So l ve S i m u l ta n e o u s Linear E q u a t i o ns , " La bora para l l e l process i ng on the VAX 8800 system and a user-fr i e n d l y s e t o f rou t i n e s t h a t e m b e d o u r tory M e m o r a n d u m , Bo l t , Bera n e k , a n d method o l ogy . \Ve have a l so presented t h e s u c Newman , l n c . ( Ma rc h 1 9 8 '5 ) . c e s s f u l c o n v e rs i o n o f t h e c i rc u i t s i m u l a to r SPICE i nt o the para l l e l program CAYE N N E . New 9 schemes to m i n i m i ze the o ve r iH.:ad of p a ra l l e l L a r g e Sca l e C i r c u i t S i m u l a t i o n . " IEEE process i ng a n d t o balance the l o a d among pro Tr a n s a c t i o n s o n C o mp u t e r A i ded Oesig n , vo l . CAD- 4 , n o . 3 (Ju ly 1 9 8 5 ) : cesses con t r i b u te to the overa l l effic i en cy of o u r i m p l e m e n tatio n . Acknowledgments We wou ld l i k e to a c k n ow l e d ge B o b Ku s i k for F . Yama moro a n d S. Takahas h i , "Vccror i z cd LU Dec o m p os i t i o n A l g o r i t h m s fo r 2 .1 2 - 2 3 9 . 10. G . B i s c h o ff a n d S . G ree n b e r g , " C AY E N N E : A Para l le l I m p l e m e n t a t i o n of the i n i t i at i ng t h i s pro j e c t , Cra i g Y a n k e s for i nt ro Ci rcu i t S i m u l ator SPICE . " Proceedings of VA.,'( jYMS system a n d for pro v i d i n g us w i t h a n the IEEE Interna tio nal Conference o n C o mp u ter A i ded D e s ig n ( N o v e m b e r i n i t i a l l i b r a ry of ro u t i n e s from w h i c h o u r 1 9 86) : l H 2- 1 8 '5 . d u c i n g u s ro p a r a l l e l p r o c e s s i n g w i t h i n t h e methodology e vo l ved , and John farice J J i , N a d i m Kha l i l , Karem Saka l l a h , a n d john Sopka for many fru i t fu l d i scuss i o n s . References 1. R . H a c kn e y a nd C . Jess h o p e . " Pa ra l l e l Com putt:rs , " (Bristo l : Adam H i lger. Ltd . . 1 9H 1 ) . 2. L. Nage l , " S P I CE 2 . A Computer Program to S i m u l a t e Se m i c o n d u c to r C i r c u i t s . " Memo n o . E R L- M 5 2 0 . U n i vers i ty of Ca l i fo rn i a . ne rkcl ey ( May 1 9 7 '5 ) . 3. Guide t o Jlll ultipro cessing o n VAXjVMS ( M a y n a rd : D i g i ta l E q u i p m e n t C o r p o ra t i o n , Order N o . AA- H P 6 9A-TE, 1 9 H6) . 1 28 Digital Tecbnit.: ul journal No. 4 TI!!Jruar)' I ')8 7 Dennis T. Bak The Impact of VAX 8800 Design Methodology on CAD Development Contributing to the success of the VAX 8800 project was a n integrated CAD environment supporting the hardware design effort. A CAD group dedicated to this single project was chartered to supply a smoothly oper ating CAD process from initial design conception to final production. The CAD environment evolved through a blending of existing tools avail able in Digital with new tools developed outside the company. Gaps in the environment were filled through extensive modification of existing tools and new development efforts. The driving force behind the CAD process was a design methodology, radical for its time but second nature now. Past CAD Development Efforts P r i o r to t h e m i d - 1 9 7 0 s , l o g i c deve l o p m e n t efforts w i t h i n D i g i ta l E q u i p m e n t Corporation were largely done without the extensive use of CAD tao l s . H a n d - d rawn s c h e m a t i c d i a gra m s were t h e pri m ary m e a n s o f e x p re ss i n g l o g i c designs . A major advance i n design a u tomation took p lace in the mid- 1 9 7 0s when the Stanford Uni ve rs i t y Des i g n Syst e m , o r SUDS, began to be used within D ig i t a l . SUDS a llowed the entry of sch ematics i n to and the extraction of net lists from a gra p h ics d a t a b as e . A l t ho u g h i t was a major step forward i n the automation of design processes, SUDS req u i red significant user train ing and experience to become an effective too l . Bu i l d i n g a SUDS d a tabase cap a b l e of b e i n g used by a computer opened a new avenue for the evo l v i n g CAD groups to a u t o m a t e t he i r design processes. These groups soon deve loped a l a rge body of programs to s u pport n e t - l i s t extract i o n , design analysis, placement and rout i n g , and eve n t u a l l y m a n u factu r i n g parts - l i s ts generation. S i m u lation too ls were deve loped to he l p verify the operat ions of a desi gn before any actu a l h a rdware was ava i l a b l e . The i n creased complexity of design drove CAD developers tO provide more powerfu l CAD too ls. I n turn , logic designers soon grew i n creas in gly dependent on CAD tools as their capab i l i ties i ncreased . The design methodologies and the CAD tool s u i t e e vo lved to s u ppor t l a rge-CPU desi g n s , Digital Technical journal No. 4 Februarv 1 98 7 s u c h as the VAX 8600 fa m i l y . SUDS eased the b u rd e n of e n t e r i n g and co p i n g w i t h d e s i gn c ha n ge s ; h o weve r , t h e a c t u a l contents of i t s schemati cs d i ffered l ittle from those o f t h e ear L ier hand-drawn ones. I n large pa n the schemat ics e n tered by desi gners i nto SU DS corre l a ted d i rectly with t h e p hysi c a l e n t i ty being b u i l t , showing a l l components and their p i ns. At the i nception of the VAX 8800 project in the early 1 9 80s, a vast col lection of CAD tools, written by many in terna l groups, had spru ng u p . Most of t h ese roo l s req u i red l a rge ASC I I d a t a fi les a n d sign ificant m a n u a l i n tervention b y CAD experts. Alt hough many a i ds were provided to develop design processes, they lacked the cohe siveness and simpl icity needed to put a process d i rectly i nto the hands of the designers . At a b o u t t h i s t i me , a n u mb er of s i g n i fi c a n t advances were made i n CAD techno logy . Engi neering workstations were annou nced at prices that made it practical to put them d i rectly i nto the hands of designers . Moreover, new design met hodologies, such as structured com puter a ided logic design , or SCAL D, were a l so deve l oped . 1 T h ese m e t h o d o l o g i es c o u l d s i g n i fi c a n t l y i m prove the qua l i ty of design while decreasi ng t h e t i m e to deve l op c o m p l ex systems . There fore , D i g i t a l made a commi tment t O u s e t hose methodologies on the VAX 8800 project to pro duce not o n l y the product b u t a more produc tive way of developing i t . 1 29 I Th e Impact of VAX 8800 Design Methodo logy on CA D Design Methodology Del 'elopmen t works t a t i o n . were processed i n ro a l o g i c a l n e r l i st t h a t was used by r h e si mulat ion a n d veri fica T h e d e v e l o p m e n t o f C A D t o o l s fo r t h e t i on too l s . Once a l og i c a l design reached a cer VA,'( R R O O project was a cons i d e ra b l e c h a l l e nge t a i n I nT I o f m a ru r i r y . i t was m a p p e d i n to a ro rhe CA D des i g n ns . 'f he c o m p l e x i t y of t h e VA.'( 8 8 0 0 desi gn . w i t h i rs part i c u i J r ga t e ar ra �· p h �· s i c a l des i g n . At t h a t p o i n t a p h ysica l a n a l ysis. i m p l e m e n ta t i o n , d e m a n d e d r h a r r h c des i g n r o d e t e r m i n e d e l a y s and s i g n a l i n t e g r i t y , was q u a l i ty be h i gh before a n y t h i n g was co m m it t e d p e r fo r m e d . P l a ce m cnr and rou t i n g too ls were then run to fu rther refi n e t he design . The part of ro hardware . In fac t . t h e project man agers made the p lw s i c a l d e s i g n d a t a base t h a t r e p resen ted a rad i c a l (for i rs r i m e ) c o m m i t m e nt ro sim ul a t e the l og i ca l ropo l ogy was then passed back to the t h e e n t i re des i g n a n d v e r i fy i ts r i m i n g befo r e h a r d w a r e was bu i I r . The rcfo r e . r h e C A D p a r i s o n was m a d e tO e n s u re t h a t t h e p h ys i c a l goal bur also ro fa c i l i ta t e t h e ra p i d prod u c t i o n a n d logica l designs were congru e n t . Thc resu l ts any logi c a l side o f the d esi gn process Tbere , a com p ro c e ss had ro b e designed ro m e e t nor o n l y t h a t of h a rdware once t he design h a d p rove n accept of s i m u l a t i o n s b a s e d o n t h e p h ys i c a l d e s i g n a b l e . T h i s s e c t i o n of t h e p a p e r d es c r i b e s t h e were a l so passed r o th e l og i c a l process for com me th o d o l ogy w e fo l l owed t o make the best use p a r i s o n wi t h t h e s i m u l a t i o n s based on r h e l o g i c a l des ign . These mechan isms p ro v i cl e cl the p r i o f our CAD too l s . T h e n e xt s e c t i on d es c r i bes m a ry c h e c k s ro e n s u re r h a t t h e l o g i c a l d e s i g n t h ose rools and how t h ey were u sed . The rool s u i te that evolved , p i c mred i n Fi gu r e 1 , marc hed t h e p hysi c a l o n e . su pported both l o g i c a l a n d p hy s i c a l d e s i g n pro W e d e c i d e d t h a t t h e best w a y t o ass u re s u c cesses w i t h c hecks and bala n c es ro e n s u re t h a t ce ss was r o develo p a com p l e te paper speci fica t h e design topo l og i es re m a i ned t h e s a m e . Sche tion of the m a c h i n e to be b u i l t . Once the ovcr m a t i c d i a g r a m s . ca p t u r ed at an e n g i n e e r i n g a l l goals for the m a c h i n e had been esra b l i s hecl . DESIGNER MANUFACTURING NTERACTIVE CLEANUP -- IMANUFACTURI -- PLACEMENT LOGICALNGTO PHYSICAL -- REPORTS - MAPPI NG RULES CHECK G N ROUTI DELAYS Y T TEGRI N I NAL G SI RE RULE CHECK -- IWINTERFACE FI LES UNIX VAX/VMS Fig u re 1 1 :1 0 CAD Too l Su ite Di,C!, ital Technical journal ,Yo. i Fe!Jnuny I ')8' New Products the designers developed the spec i ficati ons for each major logic section . This h i gh - level logical d e s i g n was t h e n p a r t i t i o n e d i n to fu n c t i o n s req u i red within modu les a nd gate arrays. These p r i mary i n te rfaces were spec i fi e d before a ny deta i l ed logic was developed . As i t tu rned o u t , t h a t p a r t i t i o n i n g re m a i n e d r e l a t i v e l y i n t a c t t hrougho u t the project. The n ex t step was to deve lop probe designs and abstract models for the most complex parts of t h e m a c h i ne . T h e s e d e s i gns a n d m o d e l s tested whether o r not particular logic fu ncti ons cou l d be developed and timing constra i n ts met . I n so me cases the probe design s were carried through to the actual fabrications of gate arrays or mod u l es . This conti nu i ty a l l owed us to test the l i m i tati ons of the selected ECL technology as we l l as the logic design . The probe d e s i gns p roved u s efu l i n m a n y ways to both t h e designers and the CAD deve l opers . The des igners were able to veri fy t h a t t h e i r l o g i c i m plementations wou ld wor k . The CAD developers were able to use the designs as test cases tO de ve lop a n d d e b u g processes . These test cases proved to be critical tO the pro j e c t ' s su ccess, espec i a l ly w h e n t h e f i n i s h e d design was given t o the man ufactu ring organ iza t i on . The process was so smoot h , in fact , that designs flowed through it with few problems. The Influence of SCALD At the onset of the VAX 8 8 0 0 project, we i nves t i gated t h e too l s ava i l a b l e w i t h i n D ig i ta l for b u i l d i n g a process to s u p po r t t h e evo l v i n g design method o l ogy . This study l ea d t h e CAD team to explore several systems b e i n g devel oped by other compa n i e s . One system b e i ng deve loped by Va l i d Log i c , I n c . , the SCALDSys tem CAD syste m , was procured by D i g i tal . This system put the power of dedicated engi neering works t a t i o n s d i re c t l y i nto the hands of l o g i c designers. Of eq u a l i m portance was the fac t that the SCALDSystem CAD too ls were being deve l oped b y t h e s a m e people w h o conceived t h e SCALD approach t O hardware des ign . Logical schematics, req u iring almost no i n for mation about the physical design , were en tered i n ro the SCALDSystem database . These schemat i c s w e r e e n t e red i n a h i e ra r c h i c a l m a n n e r through a n easy- ro-l earn graphical syste m . Such a n a rrange m e n t enco u raged the d e s i g ne rs to Digital Technical journal No. 1 February I Y87 avoid the creation of paper schematics by trans fe rring t h e i r concepts d i rectly to the wo rksta tion screens. The decomposi tion of the design was from t he top down, but the a c t u a l en try of design data o c c u r r e d s i m u l t a n e o u s l y at m a n y l e v e l s . A " design tree" evol ved i n w h i c h cel ls fo rm i n g gate a r rays w e r e m e r ged o n to m o d u l e s that p l u gged into t h e backplane to form a sys tem . The l o g i c a l d e s i g n was entered v i a t h e SCALDSystem tool s onto schematics. The physi ca l i m plementation of that logical design was left to the physica l design tOols. Simulation a nd Tim ing Verificatio n S i m u l a t i o n o n t h e VAX 8 8 0 0 p r o j e c t w a s approached from two differen t viewpoi nts. The first a imed tO determi ne whether or not the per formance goal s of the proposed m i croarc h i tec ture were wit hin the necessary range , as speci fi ed by the proj e c t ' s needs. 2 This s i m u l a t i o n started early i n t h e project before a n y deta i led logic design had been comple ted . Once t hose performance goal s had been verified , the second level of simu lation focused on the logic design as i t evolved . The designers cou l d verify that each pi ece of the design fu nctioned as spec i fied w h i l e that piece was bei ng deve loped . As the design tree evol ved , the n u m be r of logic leve ls give n to the simul ation tools i ncreased u n t i l the entire logic d e s i g n had b e e n e n t e red . At t h i s p o i n t t h e designers actually h a d t h e e q u ivalent of a soft ware bread board of the entire VAX 8800 proces sor. M icrocoded i nstructions were " ru n n i ng" on this software bread board long before any hard ware was ava i lable. The abi l i ty to run i nstruction strea ms on the breadboard gave the project several advantages. Logic designers cou ld debug their l ogic concur re n t w i th the m i crocode deve lopers ver i fying t h e i r m i croc ode . M oreover , the d i a g n os t i cs engineers cou ld wri te as wel l as debug signifi can t numbers of m icrod iagnostics much earlier t h a n was usual i n a des i gn proj e c t . The early c o m p l e t i o n of t hose d iagnos t i cs a l l owed t he fi rst ava i l a b l e ha rdware t o be c h e c k e d t h or oughly. Making the des ign logica l ly correct through s i m u l a t i o n d i d not e n s u re t h a t the m a c h i n e wou l d work a t t h e desired cyc l e t i me . I n the 131 The Impact of VAX 8800 Design Methodology on CAD Developmen t ECL tech nology used in rhe VAX 8 8 0 0 , signal r i m i ng was cri t i ca l . 'T'hercfore , a t i m i n g veri fier, parr T h i s c l u s t e r c o n s i s t e d of 1 4 VAX- 1 1 / 7 8 0 a n d VAX:- 1 1 / 7 8 ') systems w i t h over 2 0 gigabytes of of the SCALDSys tem roo l s . was used to asce rta i n mass storage . Even t h i s l a rge a m o u n t of storage whether or not the t i m i ng goa l s were bei ng m e t . was i nadequate ar r i mes to s u p port the d e m a nds I t was w i t h i n the t i m i ng ve r i fier that the i n tl u of t h e databases. Forecasting rhc: com pmati ona l ence of the phys i cal i m p le m e n ta t i on on the l og requ i re m e nt s of t h i s p ro j e c t p roved d i ffi c u l t . i c a l des i gn was first fe l t . The logic designns had The VAXcl uster sysrcm prov i d ed t h e c o m p u ta to c:nsure rhar the p l acement of gates and ro u t t i o n a l p o w e r and f l e x i b i l i t y r o gr ow as t h e i n g of s i g n a l s w a s o pt i m a l for al l c r i t i c a l c l e demands i n creased . m e n ts . D e l a y i n fo r ma t i o n was t h e n e x t ra c t e d fro m t h e p h ys i c a l d e s i g n a n d fed b a c k to t h e The a va i l a b i l i ty of s u ffi c i e n t co m p u ta t i ona l resources was c r i t i cal to r h e suc cess of our pro t i m i n g ver i fi e r . jeer . The design method o l ogy of ext e n s i ve s i m u Physical Design gra m run r i m e s . O n ce r h e d e s i g n w a s veri fied . l a t i o n w a s e ffe c t i v e o n l y w i t h reason a b l e pro As t h e l og i c a l design e v o l ved , we deve l o ped a l arge numbers of p hys i c a l designs were rc: lcased CAD process to convert i r rapi d l y i nto a p h ys i c a l for fa brica t i o n w i t h i n a s hort pc:riocl , w h i c h con A s e t of autom a t i c p l ac e m e n t a n d rou t su m e d si g n i fi c a n t c o m p u ta t i o n a l a n d s to rage design. i ng t o o l s , rog e t h e r w i t h d e l a y-esti m a t i o n a n d resources . s igna l - i n tegrity tools, was used r o give feedback to t h e d es i gn e r s . T h e i m porta n t q u e s t i o n here was w h e t h e r o r nor t h ey cou l d b u i l d p h y s i c a l The Tool Suite r e p res e n t a t i o n s of t h e i r l o g i c d e s i g n s . T h e s e Design Data Managem ent t o o l s a l so p a s s e d d a t a t o t h e t i m i n g v e r i f i e r , A design d a t a m a n age m e n t ( D D M ) sys t e m was w h i c h ana lyzed t h e effect o f r h e p hys ical design deve l opeu to orga n i z e t h e m a n y fi l es t h a t con on c i rcu i t t i m i ngs. tained the actual design data . At r h e heart of that S i n c e a l l the l o g i c had to be veri fied bdore system was t h e concept of a " d es i gn object . " any h a rdware was fa bri cated , a l l processes h a d T h i s o b j e c t was s o m e fu n c t i o n a l p i ece: of t h e designs i n para l l e l . The re l ev a n t D i g i t a l m a n u dc:sign . u s u a l ly conform i n g ro rhe phys i c a l part i fac t u r i n g fa c i l i t i c: s a n d o u t s i d e v e n d ors were u l e i n r h e system was d c fi nc:d as a uesign obj ect. t o be d e s i g n e d to h a n d le a l a rge n u m b e r o f t i on i ng. For exa m p l e . each gare array a n d mod acq u a i n ted with the phys i c a l design through t h e For each object we d e v e l o p e d a h i e ra r c h y of test cases ra r b er t h a n t h rough a n actua.l protO subd i rectories w i t h i n the VMS fi l e syste m . T h i s type . Thus the fac i l i t i es and vend ors co u l d con s e p a r a t i o n o f d a t a f i l e s i n t o s u b d i rc: c ro r i e s figure and debug t h e i r own man u factu r i n g pro a ll owed vari ous roots w i t h i n t h e CAD process ro cesses before a n y c o m p le t e d p h ys i c a l d e s i g n s know where ro f i n d i n p u t filLs a n d ro write our were s e n t ro t h e m . pur fi l e s . To ensure a smooth t ransi t i on i nto r h e fabrica T h e d e s i g n da tabase was con t i n u a l ly c h u rn i ng t i o n p h a s e , m a n u fa c t u r i n g e n g i n e e r s w e r e w i t h new i n forma t i o n . To g ive a stable p i c t u re ass i g n e d ro w o r k d i re c t l y w i t h r h e cks i g n e r s e a r l y i n t h e dc:s i g n p ro c ess . Th us t h e s e: e n g i as rhe overa l l design e vo l ve d , a " s n a pshot" of a design object cou l d be take n at any r i m e , r h u s neers became: fa m i l ia r w i t h t h e VAX 8 8 0 0 tech gen erati ng a rev i s i o n of the design objec t . New n o logy and t h e machine as it evolved . T h i s was s u bd i re ct o ry fi l e trees were: t h e n cr e a t e d fo r an i m porta n t s t e p because o u r m a n u fa c tu r i n g e a c h rev i s i o n . U s i n g r h i s sc hc: m c a d e s i g n e r o r ga n i za t i o n w a s to b u i l d a l l r h e h a rd wa re , cou l d create a " froze n " rev i s i o n o f a d es i gn . H e i nc l u d i n g t h e pro to types . T h i s ear l y a c q u a i n cou l d t h e n usc t h a t revisi on for s i m u lat i o ns or tance w i t h t h e: design a l l owed t h e m ro deve l o p other a c t i vi t i es wh i l e chan ges were being made m a n u fa c tu r i ng p rocesses ro s u p po r t r h c r a p i d ro another rc:v ision of rlw desi g n . c ha n ge to fu l l vo lume s h i p ments soon a fter r h e VAX 8 8 0 0 system w a s a n nou nced 1 Computa tio nal Reso urces One of the l a rgc:sr VAX c lusrer systems ever b u i l t was assembled w sup rorr r h c VAX 8 8 0 0 projec t . 1 32 T h e re l a t i o n s h i p s b e nv e c: n d e s i g n o b j e c t s were defined w i t h i n a rev i s i o n - ma t r i x fi l e kept w i t h each fi l e tree . This fi le dcfi n c:d the system l ev e l h i erarc h y of t h e ma c h i ne : w h i c h cks i g n o b j e c t s w e r e s u b o r d i n a t e ro a g i ve n o b j e c t Us i ng t h i s fi le a dcsignc:r wor k i n g o n a mod u l e Digital TeciJniutl journal . No. -I 1-'l'hrllliiT I 'J87 New Products design cou ld select frozen revisions of the gate array designs on that m od u l e and be assu red of not having them changed as he worked on i t . Another fac i lity provided by the D D M system was a user i n terface to the design env i ronment. This i nt erface consisted of a s i m p l e com m a nd language for transvers i ng the design trees and fo r r u n n i ng spec i fi c too l s . S i nce t hese too l s requ ired a large number o f i n p u t variables, we estab l ished a system of defa u l t parameters to m i n i m i ze user i n put . For cases i n which t hose defa u l ts proved i nadequate, users or CAD devel opers co u l d c h a n g e p a ra m e ters to m e e t t he design's needs. Schematic Capture Using t h e Va l i d G E D e d i to r , logic sche ma t i cs were entered d i rectly i nto the workstations by t he designers . The extracted w i re l ists were then transferred from the SCAJ.DSystem UNI X-based workstation through a com m u n i cations port to the VAXcluster syst e m . The workstations were a l so i n terconnected in a netwo r k i n g envi ron ment , thus provid i ng com m u n i cation between them. To ease the burden on designers to learn mu ltiple operating systems, only graphica l data entry was permi tted on the workstations. All t he other CAD too ls were r u n i n the more n a t i ve VA.i\cl uster environment. S i nce the m a j o r i ty o f a d e s i g n e r ' s t i m e was s p e n t i n t e r a c t i n g w i t h C A D t o o l s on t h e VAXcl uster system , t here was no need for each designer to have a ded i cated worksta t i o n fo r sche m a t i c c a p t u re . T h e r a t i o of d e s i g n e rs to worksta t i ons of a bo u t two to one p roved ade quate . The eas i ly lea rned GED editor su pported a rapid i ncrease in the n u m be r of nondesigners managers , secretaries, and documentat ion writ ers - i n t he user com m u n i ty . A l l were drawn to the system by the ease of graphical data creation . E v e n t u a l l y , t h i s d o c u m e n t a t i o n a c t i v i ty accounted for the m ajority of workstation usage . Sim ulatio n and Tim ing Verification Another propri etary too l , ca l led the DECS I M sys te m , was the primary s i m u lator used on the pro ject. This system supported m ixed-level simula t ions, both structural and behaviora l . The logica l design was transferred h ierarch i ca l l y to the DEC SIM system . This system allowed the designers to deal with complex designs by viewing the simu lation i n the same h i erarchica l form as the sche matics . For complex devi ces , such as m u ltiplier Digital Technical journal No. 4 February 1 98 7 c hips and R.AJ.\1 devices, behavioral models were d e v e l o p e d . T h e s e m o r e e ff i c i e n t m o d e l s increased the overa l l performance of the s i m u la tions . In the case of RAM devices, abstracting to a behavioral model also a l lowed the m i c rocoded i nstructions to be loaded efficiently. C o m p l e m e n t i n g the fu n c t i o n a l s i m u la t i o n faci l i t i es o f DECSI M system was t h e t i mi ng veri fi er (TV) i n the SCALDSystem tools. TV ana lyzed circu i t t i m i ngs to ensure that the design wou ld work u nder worst-case cond i tions at t he desired clock rate. Wire delays are a major factor to be taken i n to account by t i m i n g veri ficat i on . The placement of the p hysi c a l gates was c r i t i cal tO m i n i m i ze the w i re lengths and hence the delays . S ince the placement was not avai lable in t he initial design phases, statistical delays based on l oading were used . As place ment information became plenti fu l , the l atest refined del ays were sent to the t i m i ng verifi er. When the phys i ca l design had been compl eted , delays based on routed lengths were used . I f the req u i red t i m ing was not met at any point in the process, the offend i ng circu i ts were redesigned or the l ayou t was changed to correct the problem . Wirelisting and State Main tenance The logic gates e n te red on schemat ics by t h e designers were , i n genera l , assigned ro p hysical components by the CAD process. Th is mappi ng occurred i n i t i a l l y w i t h i n the SCALDSystem post processor software using a random gate-to-com ponent assignment. This random packagi ng was t h e n fed i n to a sys t e m c a l l ed YAWL ( for Yet Another WireLister) . YAWL acted as a genera l p u rpose w i re l i s t e r , g e n e ra t i n g i n t e rfaces t o m a n y too l s a n d a c ce p t i ng feedback fro m t h e physical design tools. As the physical design process refi ned the gate as s i g n m e n t , YAWL e ns u red t ha t t h e l og i c a l design topol ogy d i d not change . B y accepting feedback data from t he p lacement and routing to o l s and the p hys i c a l design sys t e m , YAW L c a u g h t a n y i llega l c h a n ges t h a t wo u l d have altered the logic functions. Eventually, t he com p lexity of maintaining t he state became so large that YAWL a l one cou ld not cope with it. Therefore, severa l other programs were placed in the feedback loop from the phys ical design tools to detect c hanges made in the p rocess of m a n ua l l y c l ea n i n g up the p hysi cal d e s i g n . These p rograms w e re n eeded s i n c e , 1 33 The Impact of VAX 8800 Design Methodology on CAD Development even a t that late stage , a designer coul d still add logic to the design . The CAD process therefore h a d to h a n d l e t hese a d d i t i o n s as we l l as t o detect i l legal transformations r o the logic . The r e s o l u t i o n of t h e s e c h a n g e s t o o k a l o t o f resources, both i n terms of time and computer power. I n a d d i t i on to be i n g t he s t a t e m a i n t a i n e r , YAWL acted a s a p r i mary sou rce of t h e design data needed for the remainder of the CAD pro cess . YAWL c r e a t e d m a n y r e p o r t s to i n fo r m designers o f problems between t h e i r logica l and p hysical designs. Most of the i n terface fi les i n t h e CAD p rocess were either read , wri tten , or both, from YAWL , which p l ayed a key role i n the overa l l process. Placement and Rou ting Two processes were deve loped for the place m e n t and rou t i n g o f g a t e - a rray and m o d u l e designs . The gate array process was h ighly auto mated , requ iring a m i n i m u m of i nteraction by the d e s i gn e rs . The p rocess was o rga n i zed to make severa l runs from which a designer could s e l e c t t h e one t h a t best o p t i m i ze d h i s l o g i c design . The bounded problem of placemenr and rout ing within a gate array was easy to solve in com parison to the m o d u l e designs. H ere the con stra ints p l aced by designers, the l i mi tations of tools, and the complexit ies of design req u i red extensive human i n tervention . Ana lysis tools were used extensively tO assist in determi n i ng the qual i ty of design a t the two design leve l s : gate a rrays and m od u les . These tools analyzed such factors as thermal d issi pa t i o n , s i g n a l i n tegri ty, and cross ta l k . The con strai n ts defined i n these tools and i n t he exten s i v e d e s i g n - r u l e c h e c k e r s w e re m e t , t h u s ensuri ng a h i gh-qual i ty design . Most of the tools used for the physica l design were deve l oped w i t h i n D i g i ta l . Those deve l oped outside t h e VAX 8 8 0 0 CAD group were mod i fi e d , so metimes extensively, to meet the needs of the project. Physical Design and Man ufacturing Interface A proprietary p hysical design system , cal led the VAX layout system (VLS) , was used for the fin a l p hys i c a l d e s i g n tasks . V L S rook t h e phys i c a l design , a s given b y t h e p l acement a n d rou t i n g 1 34 tools. and added the data req u ired to manufac ture the design. A l ayou t designer, through the VLS i nteractive graphics syste m , cou l d manua l ly comp lete the rou t ing that could not be hand.l ed by the a u tOma t i c roo l s . Some add i ti o n a l parts that were necessary for fabrication , such as han d les for modules , were also added a t this t i m e . The n e t resu l t was a complete design , specified so t ha t it cou l d be used to m a n u fa c tu re t h e product. The design data was then col lected ro form a rel ease package. To keep track of the fo rmal release of design data. a system cal led POST was deve l oped by the CAD group. POST provided an on-l ine database , which any member of the pro ject team cou ld query ro determ ine the release status of a design. Problems Imposed by the Design Methodology U p to this point, we have described the basics of t h e design m etho d o l ogy used to deve l o p t h e VAX 8 8 0 0 syst e m a n d some h i g h l i gh ts of the CAD t o o l s su p p o rt i n g t h a t m e t h o d o l o gy. As mentioned earlier, t he CAD process was p laced d i rectly i n to the hands of the designers . Thus a tight coupl i ng was establ ished between the pro cess of clesign and the design process. This cou p l i n g posed several major probl e m s , as now descri bed , for the CAD grou p . Train ing W i t h direct control of a process or tool given to t he desi gners, t h ey a l l now needed extensive t ra i n i n g . O n p re v i o u s p r o j e c t s , o n e h i g h l y k n o w l edge a b l e i n d i v i d u a l c o u l d r u n a roo l ; now, there were 3 0 or so novice users a l l learn ing to use that same too l . Extensive support for those users , i n terms of both trainers and docu men tation , had to be provi ded . I n most cases t h e desi gners q u i c k l y learned how to u t i l i z e the tools . In a few cases - the placement of modules in particular - p l acement experts were needed owing tO the spec i a l i ze d naru re of t h e task. I n su m mary , t h e extent of the su p p o r t r e q u i red by u s e rs w a s g r e a t e r t h a n a n ti c ipated . State Maintenance The t a s k o f s t a te m a i n te n a n c e p roved to be extremely complex owing to the freedom given to designers to make c hanges a t almost any poi n t Digital Technical journal No. 4 Februarl' 1 98 7 New Products i n the design process . To ensure that the logical and physica l designs matched , it was necessary to do a com plete isomorphic comparison of the physical topology aga i nst the logi cal topol ogy of the design. Logical Prints The sche m a t i cs genera ted by t h e design ers a t t h e i r w o r k s t a t i o n s r e p re s e n t e d t h e l o g i c a l des i g n , not the physical one . Certa i n features avai lable i n the SCAlDSystem tools, such as vcc torized signals and gates , aiJowed it to prod uce a concise representation of the logic. This ca me, however, at the expense of not putting physical data back onto the print set . For reasons of state m a i n tenance. we were also u na ble to restruc t u re a p r i n t set o n c e m a p p e d t o a p h ys i c a l i mplementation . Both these factors contri buted to a print set that appeared q u i te d i fferent from those generated by previous projects . Log ical print sets, w h i l e i n i t i a l l y envisioned as being benefi c i a l , later c a used problems i n docu menting the design s . Thi s was particula rly true for module - l evel desi gns for which tra i n i n g was needed s o that groups o u tside the project team cou ld i n terpret the new symbology. Cross References U s i n g l og i c a l p r i n t sets a l o n e , a t e c h n i c i a n cou l d not probe a p i n o f t he p hysi ca l board s . Si nce a n abstract mapp ing took place i n t he CAD process. i t was necessary to develop an exten sive set of cross references s howing t he m a p p i ng of t h e logical t o t h e physical design . These cross references proved to be cumbersome and , when printed , consumed vast a mounts of paper. Libraries CAD tools ru n on l i braries, and each major tool h a s i r s o w n fo r m a t f o r l i b r a ry d a t a . T h e s e l i braries m u s t b e consi stent a cross t h e e n t i re process. Despite a l l the safeguards bu i l t i n to the p rocess , we fo u n d t h a t i n c o n s i s t e n c i e s s r i J J crept back i n to t h e data base . D iscovering and e l i m i n a t i n g t h o s e i n c o n s i s te n c i e s , m a n y o f w h i c h were fou nd late i n t h e project, consumed a lor of time. Summmy Both the desi gn met hodology and the CAD pro cess s u pport i ng t h e VAX 8 8 0 0 project w e re q u i te successfu l . The fi rst protOtype hardware Digital Technical jo u rnal No. 1 Februar)' 1 <)8 7 delivered r o u s worked a s expected. We fou n d only a sma l l number o f h ardware problems dur i n g the prototype debug phase of the projec t . Most of those problem s were i n areas that h a d not h a d extensive simu lation or t i m i ng verifica tion. Some genera l conclusi ons reached from the VAX 8800 project can help fu ture CAD design ers to i mprove their tools. • A close cou p l i ng from the start, bot h phys i c a l l y a n d o r g a n i z a t i o n a l l y , b e t we e n a l l groups associ ated w i t h the p roject leads to the deve lopment of a smooth process flow. • The design methodology has a d i rect and far rea c h i n g i m pact on t h e CAD p rocess . T h e capabi l i t i es o f CAD tools d i rectly a ffect t h e design methodology . • Extensive s i m ulation and t i m i ng veri fication before fa brication can help to achieve a high q u a l i ty prod uct. • The i m pact of rad i c a l changes ( e . g . , in the data content of schemati cs) must be appreci ated and then taken i n to account b y a l l p ro ject members . I n future projects w e w i l l focus on reducing the process- loop ti mes and e n hancing the capa b i l ities of the s i m u lation and t i m i n g verification too l s . I t w i l l be e a s i e r to fu n c t i o n in fu t u re design e n v i ro n m e nts, a n d m o re tools w i l l be p laced d i rectly i n to the hands of the designers . The design methodo l ogy w i l l be mod i fi e d to make the reso lut ion of t h e des i gn state easier and therefore faster. References 1. Structured Computer Aided Logi c Design was developed at Lawrence Liverm ore Labora t o r i e s a n d a p p l i e d t h e re to t h e design o f t h e S l computer. 2. C . Wiecek, "The Simu lation o f Processor Performance for the VAX 8800 Fa m i ly , " D igital Tec h n ical jo u r n a l ( Fe b r u a ry 1 98 7 , this issu e ) : 1 00 - 1 1 0 . 3. A . M a t t hews . " O n - l i n e M a n u fa c t u r i n g Data Access o n t h e VAX 8 8 0 0 Project , " Digital Te c h n ical jo u rn a l ( Fe b r u a ry 1 9 8 7 , t h is issue) : 1 3 6- 1 4 1 . 1 35 Andrew]. Matthews On-l ine Manufacturing Data Access on the VAX 8800 Project Previously, the transition from design to manufacture involved transfer ring significant amounts of data on paper. To minimize product start-up time, the VAX 8800 project used an on-line system that eliminated much of the paper. The key task was transforming the data from existing CAD tools with different formats into manufacturing data. Two generic types of VMS.files, DA TA and DRA WING, contained data for each Part Number and Revision Number. VMS's subdirectory and access-control capabilities provided total revision control. Manufacturing engineers pulled files at will using DA TA .files to drive their processes and viewing DRA WING .files from VAXstation II workstations. A key obje c t i ve for t h e VAX 8 8 0 0 project was ro rat h e r t h
Source Exif Data:File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.6 Linearized : Yes Has XFA : No XMP Toolkit : Adobe XMP Core 5.2-c001 63.139439, 2010/09/27-13:37:26 Modify Date : 2013:01:10 06:31:26Z Create Date : 2006:04:09 18:48:08+01:00 Metadata Date : 2013:01:10 06:31:26Z Creator Tool : Adobe Acrobat 7.05 Format : application/pdf Title : Digital Technical Journal, Number 4, Febrary 1987: VAX 8800 Family Creator : Document ID : uuid:e12fc662-2e12-432d-a44e-847bb06edf24 Instance ID : uuid:bc5cfea8-9f03-497e-93e9-47f678f0ba09 Producer : Adobe Acrobat 10.1.4 Paper Capture Plug-in with ClearScan Page Layout : SinglePage Page Mode : UseOutlines Page Count : 144EXIF Metadata provided by EXIF.tools