Digital Technical Journal, Volume 2, Number 4, 1990 Dtj_v02 04_1990 Dtj V02 04
dtj_v02-04_1990 dtj_v02-04_1990
User Manual: dtj_v02-04_1990
Open the PDF directly: View PDF .
Page Count: 147
Download | |
Open PDF In Browser | View PDF |
VAX 9000 Series Digital Technical Journal Digital Equipment Corporation Volume 2 Number 4 Fall 1990 Editorial jane C. Blake, Editor Barbara Lindmark, Associate EditOr Circulation Catherine M. Phillips, AdministratOr Suzanne J. Babineau, Secretary Production Helen L. Patterson, Production Editor Nancy jones, Typographer Peter Woodbury, IllustratOr and Designer Advisory Board Samuel H. Fuller, Chairman Richard W. Beane Robert M. Glorioso john W. McCredie Mahendra R. Patel F. Grant Saviers Robert K. Spitz Victor A. Vyssotsky The Digital Technicaljoumal is published quarterly by Digital Equipment Corporation, 146 Main Street MLO I-31B68, Maynard, Massachusetts 01754-2571. Subscriptions tO the journal are S40.00 for four issues and must be prepaid in u.s. funds. University and college professors and Ph. D. students in the electrical engineering and computer science fields receive complimentary subscriptions upon request. Orders, inquiries, and address changes should be sent 10 The Digital Tecbnicaljournal at the published-by address. Inquiries can also be sent electronically 10 D'I:J@CRL.DEC.COM Single copies and back issues are available for $16.00 each from Digital Press of Digital Equipment Corporation, 12 Crosby Drive, Bedford, MA 01730-1493. Digital employees may send subscription orders on the ENET to RDVAX::JOURNALor by interoffice mail to mailstop MLO I -3/B68. Orders should include badge number, cost center, site location code and address. U.S. engineers in Engineering and Manufacturing receive complimentary subscriptions; engineers in these organiza tions in countries outside the u.s. should contact the journal office to receive their complimentary subscriptions. All employees must advise of changes of address. Comments on the content of any paper are welcomed and may be sent to the editOr at the published-by or network address. Copyright ll:J 1990 Digital Equipment Corporation. Copying without fee is permitted provided that such copies are made for use in educational institutions by faculty members and are not distributed for commercial advantage. Abstracting with credit of Digital Equipment Corporation ·s authorship is permitted. AU rights reserved. The information in this Journal is subject 10 change without notice and should not be construed as a commitment by Digital Equipment Corporation. Digital Equipment Corporation assumes no responsibility for any errors that may appear in this journal. ISSN 0898-901 X Cover Design Digital s VAX 9000 mainframe system is the theme of this issue. Our cover depicts several simple instructions flowing through the VAX 9000 instruction execution pipeline. High performance was achieved by breaking the VAX instructions into small simple tasks that could be pipelined efficiently. Concurrent operation on up to six instructions simultaneously resulted in a execution rate of one simple VAX instntction per clock period. Gloria Monroy of the High Performance Systems Group designed Documentation Number EY-E762 E-DP The following are trademarks of Digital Equipment Corporation: Cl, DECsystem-10, DECSYSTEM-20, Digital, the Digital logo, HDSC, MC!J, Micro VAX, Nl, PDP-I, Ul;fRIX, VAX, VAX-11/780, VAX 6000, VAX 8000, VAX 8600, VAX 8650, VAX 9000, VAXBI, VMS, XMI. IBM is a registered trademark of International Business Machines Corporation. Kapton is a trademark of E. I. duPont de Nemours & Company. MOSAIC 111 is a trademark of Motorola Corporation. Micromaster Plus is a registered trademark of t.:rx Company. the cover graphic, which was implemented in cooperation Book production was done by Digital's Educational Services with David Comberg of the Corporate Design Group. Media Communications Group in Bedford, MA. I Contents 11 Foreword Carl S. Gibson VAX 9000 Series 13 Design Strategy for the VAX 9000 System David B. Fite Jr. , Tryggve Fossum, and Dwight Manley 25 VAX Instructions That Illustrate the Architectural Features of the VAX 9000 CPU John E. Murray, R icky C. Hether ington, and Ronald M. Salett 43 Semiconductor Technology in a High-peiformance VAX System Matthew J Adiletta, Richard L. Doucette, John H. Hackenberg, Dale H. Leuthold, and Dennis M. Litwinetz 61 Vector Processing on the VAX 9000 System Richard A. Brunner, Oileep P. B handarkar, Francis X. McKeen, Bimal Patel , W illiam). Rogers Jr., and Gregory 80 L. Yoder HDSC and Multichip Unit Design and Manufacture Peter B. Dunbeck, Richard). Dischler, James B. McElroy, and Frank J. Sw iatowiec 90 The VAX 9000 Service Processor Unit Matthew S. Goldman, Paul H. Dormitzer, and Paul A. Leveille 102 The Unique Features of the VAX 9000 Power System Design Derrick). Chin, Barry G. Brow n , Charles F. Butala, Luke L. Chang, Steven). Chenetz, Gerald E. Cotter, BrianT. Lynch, Thiagarajan Natarajan, and Leonard J. Salafia 118 Synthesis in the CAD System Used to Design the VAX 9000 System Donald F. Hooper and John C. Eck 130 Hierarchical Fault Detection and Isolation Strategy for the VAX 9000 System Karen E. Barnard and Robert P. Harokopus I Editor's Introduction implement the 77 different gate array chips, the five custom chips, and the self-timed RAM architecture. An additional performance improvement for numeric computations is the VAX vector architec ture and is treated in the paper by Rich Brunner, Dileep Bhandarkar, Frank McKeen, Bimal Patel, Rill Rogers, and Greg Yoder. They discuss the architec tural model and particulars of the VAX 9000 imple mentation, which affords numerically intensive applications performance four to five times greater than can be achieved by the scalar processor. To ensure that the system performance gains at the semiconductor level were not diminished jane C. Blake but were instead enhanced by packaging and inter Editor connects, engineers developed several technologies The VAX 9000, Digital's first mainframe computer, unique in the industry. The technology behind the is the topic of papers in this issue of the high-density signal carrier and the multichip unit Technical journal. D(f.{ital As engineers writing for this issue relate, the primary goal of the project from the are explained in the paper by Pete Dunbeck, Rich Dischler, Jim i'vlcEiroy, and Frank Swiatowiec. initial product strategy through manufacture was to Equally important to performance in the new design and build a very high-performance, highly 9000 is system reliability as e\'idenced by the intro reliable VAX system. Design engineers applied both crsc and R!SC duction of the service processor unit. In their paper about the service processor, Matt Goldman, Paul techniques to achieve high levels of performance Dormitzer, for this rightly coupled multiprocessor system. MicroVAX-based system embedded within the 9000 and Paul Leveille relate how the In the opening paper, Dave Fire, Tryggve Fossum, detects, isolates, and corrects problems without and Dwight Manley explain the strategy behind the interrupting the system . design. They begin with an overview of the system, High system availability \Vas also one impetus in the technology, and CAD tools, and then describe the design of the power system . Some of the unique the redesign of VAX instructions into small tasks features of the power system, such as redundant which can be efficiently pipe lined. The authors regulators, improved load sharing and simula also touch upon three additional aspects of the tion, are discussed by Derrick Chin, Barry Brown, VAX 9000 system: the integration of vector process Charles Butala, Luke Chang, Steve Chenetz, Jerry ing into the VAX architecture, new error handling Cotter, Brian Lynch, Raj Natarajan, and Len Salafia. techniques, and performance modeling. One measure of performance is the number of The two papers that close this issue address the topics of CAD methodology and system diagnosis. instructions processed per cycle. The average num Don Hooper and John Eck describe a CA D method ber of cycles per instruction is less than five, which ology that combines advanced rule-based A! tech is nearly half the instruction execution rate of pre niques with an object-oriented database. The new vious VAX systems. To illustrate the architectural methodology saves logic designers significant time features that enable this level of performance, John and reduces errors. A complex system such as the Murray, Rick Hetherington, and Ron Salett have VAX 9000 requires improved system diagnosis capa selected a small sample of VAX instructions. They bilities to achieve the desired high system availabil describe the instruction flow through the pipeline, ity. Karen Barnard and Rob Harokopus demonstrate how instruction features combine to work on a sin how a new scan system, in combination with scan gle macro, and how stages of the pipeline interact. pattern testing, and symptom-directed diagnosis ln addition to the architectural improvements, achieve this necessary diagnosis capability. machine performance is enhanced at the semi The editors thank Rick Hetherington of the High conductor level by a new generation of semicustom Performance Systems Group for not only writing a and custom integrated circuits that support a low paper but for his help in coordinating this issue. c ycle time. Matt Acliletta, Dick Doucette, John Hackenberg, Dale Leuthold, and Dennis Litwinetz give an overview of the bipolar technology used in the system. They then describe the methods used to 2 I Biographies Matthew J. Adiletta Matthew Adile tta is currently contributing to the implementation of a new processor architecture and performing a technology evaluation to determine the technology for the implementation. He joined Digital in 1985 to work on a high-performance RISC architecture. Matt was not VAX 9000 system, but he also implemented the integer and floating point multiply and divide units and developed an ECL custom chip only the architect for the process. He holds one patent and has several patents pending. Man received a B . S . E . E. (honors, 1985) from the University of Connecticut. Karen E. Barnard A senior soft ware engineer with the High Power Business Unit CPU Development Group, Karen Barnard wrote the read-only memory based diagnostic for the VAX 9000 service processor unit's scan control module and developed the scan pattern diagnostic for the VAX 9000 CPU and SCU. Karen also worked on the debugging structural test process for the VAX 9000 kernel environment. Prior to joining Digital in 1986, Karen was with Data General Corporation. She received a B . S . ( 1983) in computer science from the Worcester Poly technicallnstitute. Dileep P. Bhandarkar As technical director for RlSC systems, Dileep 13handarkar is responsible for leading the architectural direction of RlSC prod 1978 and was responsible for managing the evolution of VAX architecture. Dileep was the chief architect for VAX vector processing ucts. He joined Digital in the and coarchitect of Digital's RISC archi tecture. He holds one patent for his work at Digital and has several patents pending. His degrees in electrical engineering include a 13achelor of Technology from the Indian Institute of Technology and an M.S . and a Ph. D. from Carnegie-Mellon University. Barry G. Brown The concept of designing DC-to-DC converters as system elements rather than individual "power supplies" was introduced into the high power systems products by Barry Brown. He created and developed a highly tlexible, high-reliability DC-to-DC conversion system for the VAX 9000 series. Barry designed, implemented, and verified the power system for the Model VAX 9000 200 systems. He was a principal engineer for the Codex Corporation before coming to Digital in 1984. Barry is a graduate of Woolwich Polytechnic and Harlow Technical College. 3 Biographies Richard A. Brunner As 3 principal engineer, R ichard Brunner is the architect c u rrently responsible for the engineering refinement and control of both the VAX and VAX vector architectures. He is the editor of the VAX Architecture Reference Manual and coauthor of the VAX Vector Handbook and several papers on the VAX vector 3rchitecture. He received a B.S. (high honors, 19R4) in elec t rical e ngineering from Case Western Reserve U n i versity and an M . S . (1987) i n computer engineering from Rensselaer Polytechnic Institute. H e i s a member of JEEF. and Tau Beta Pi . Charles F. Butala Presently responsible for the power system design and arch itecrure of rhe VAX 9000 Model 4 00 systems, Charles Butala is a consulting engineer in the Information Systems Business Unit Power Systems Group. Since he joined Digital in 1976, he has been responsible for several power system design projects, including the VAX H600 system. He is a member of I EEE and Tau Beta P i , and holds honorary society membership i n Eta Kappa N u . Charles received a R.S.E.E. (1968) from I l l inois Institute of Tec hnology and an M. S . E .E. from Norrhe3stern University. Luke L. Chang A fter receiving his M.S. in electrical engineering from Virginia Polytechnic lnstirute and Stare U n iversity in 1988, Luke Chang joined the Power Sysrems Technology and Regulations Group. He is currently a hardware engineer and is responsible for developing simulation tools to perform h igh-quality software design veri fication tests for the next generation DC-to-DC power con verters. Lu ke's previous responsibilities include transient analysis :md testing of the VAX 9000 memory power distribution sysrem, 3nd power system cost reduc tion studies. Steven ). Chenetz As a principal engineer in the Information Systems Busi ness Unit Power Systems Group, Steven Chenetz is currently working on the H7390 for a high-power VAX system. He previously was a member of the design and development te3ms for the H7380 of the VAX 9000 system, the H71HH envi ronmental monitoring module for the VAX 8600 power system, the VAX 8600 clock distribution system, and signal integrity for the VAX 8 600 system. Steve joined D igital upon gr3cluation from Rensselaer Polytechnic Institute i n 19RI. He has 3n M.S . E. E. from Nort heastern University (19H7). Derrick ). Chin Derrick Chin is the engineering manager for sever3l Infor mation Systems B usiness Unit power groups and is design e ng ineer of the VAX 9000 processor's DC power d istribution system. His 3ssociation with D igital began in 1961, and he has participated in many projecrs, from the POP-I ami the DECsystem-10 to the VAX HMO systems. His responsibi l ities have ranged from development of precision displays, circuit design, and core and semiconductor memories to env ironmental monitoring modules and power systems. He holds a B.S. E. E. (1959) from MIT. 4 I Principal engineer Gerald Corter is a member of the Infor mation Systems Business Unit Power Systems Group. He was the project engineer and coarchitect of the VAX 9000 power control system (PCS). Jerry was the PCS interface to Customer Service and Support Engineering, Manufacturing, and Service Processor Unit Groups. He participated in development of the PCS and power system test strategies and the initial design of the T01060 power and envi ronmental monitor module. His previous work includes the VAX 8600 system's power and control subsystem. Gerald E. Cotter In his position of systems engineer for the High Perfor nunce Systems Group, Richard Dischler worked on the VAX 9000 signal integrity project. He also was a member of the project team for the electrical design of HDSC and micropackaging for multichip units, planar boards, and connectors for the VA X 9000 system. Rich held similar responsibil ities in the development of the VAX 8600 system. He joined Digital in 1982, and his previous experience was at Applied Research Laboratories. He holds a B . S .E.E . (1982) from Pennsylvania State University. Richardj. Dischler A s an undergraduate at Harvard University, Pau l Dormitzer gained experience with the U N I X operating system b y working as a programmer and operator. Upon receiving his B . A . in computer science in 1987, he joined D igital's H igh Performance Systems Group. He is currently an engineer in the High Performance Business Unit CPU Engineering Group. Paul's primary responsibilities are in the development of error recovery processes for high power systems, such as the VAX 9000 system. Paul H. Dormitzer Since joining Digital in 1979, Richard Doucette has been a member of severa l high-performance systems project teams. As a senior engi· neer on the VAX 8600 team, he helped introduce the Motorola Macrocell Array I (MCA I ) technology into D igital and was responsible for its design analysis and characterization in the system. As engineering manager on the VAX 9000 team, he was responsible for the incorporation of MCA 3 technology, custom chips, and self-timed RAM components in the system. He holds a B . S .E . E . (1973) from the University of Maine. Richard L. Doucette Peter B. Dunbeck Peter D unbeck is an engineering manager in the H igh Performance Business Unit Technology Research and Engineering Group. He held various positions on the VAX 9000 program between 1985 and 1990, includ ing technology program manager and design engineering manager for the multi chip unit. Before joining Digital in 1984 as a manufacturing engineer, Peter developed energy conservation programs for Thermo Electron. He holds a B . S . (1977) i n mechanical engineering from Virginia Tech and a n s. M . (1979) i n aero nautics and astronautics from MIT. 5 Biographies John C. Eck The dcvdopment of rhe majority of the physical design CAD tools used in rhe VAX 9000 system was managed by John Eck. He is a software engi neer manager in the High Performance Systems CAD and D iagnostics Group. John was employed as the manager of the Automated Design Department of Badger Company before coming ro Digital in 1984. He holds a BS (1964) in physics and an JYI.S. ( 1966) in aeronau ti cs and astronau t ics from MIT, and an M . B. A . (h ighest honors, 198--i) from Babson CoJiege. David B. Fite Jr. Consul tant engineer David Fire was a member of rhe initial architecture team for the VAX 9000 system. He developed the architecture for the branch prediction, instruction fetch, and instruction decode for the VAX 9000. H is previous work includes responsibility for prototype debugging on the VAX 8600 system . D:IVe joined Digital in 1982. He has one patent and several patent applications pending. He is a graduate of Worcester Polytechnic Institute with a B . S. (honors) in electrical engineering. Tryggve Fossum Tryggve Fossum is rhe system architect of rhe VAX 9000 sys tem . He received a B.S. ( 1968) from the University of Oslo and earned his P h . D. ( 1972) from the University of I l linois. Tryggve joined D igital in 1973 and worked on the design of high-end computers, notably the VAX -11/780 system. As a pro ject leader on the VAX 8600 team, he guided the design of the t1oating point accel erator. He has also worked on several research projects, including an early raster scan graphics workstation, and a workstation w ith an integrated disk system. Matthew S. Goldman As a senior engineer on the VAX 9000 project team, Matthew Goldman designed the scan control chip, which contains the control logic for the VAX 9000 scan system. He was also the responsible engineer for all VAX 9000 service processor h:trdware. Prior to joining Digital's H igh Perfor mance Systems CPU Group in 1986 , Matt was a design engineer for Rayt heon Company. He is a member of Tau Beta P i and Eta Kappa Nu. M:ut holds a B.S. (highest honors, 1983) and an M.S. ( 1988) in e lectrical engineering from Worcester Polytechnic Institute. John H. Hackenberg I n 1968, John H ackenberg came to D igital as a tech nician on the Kl- 10 project, leaving after two years to serve in the armed forces. He returned to Digita l in 1971 and worked on the designs for various h igh-end systems, including the KL- 10. As a consult ing engineer on the VAX 8600 project, he worked in the area of signal integrity. John was the project leader for the MCA 3 gate array used in the VA X 9000 system and is currently developing a bipolar gate array. He holds a B.S.E.T. {1979) from the University of Lowell . I Robert P. H arokopus A cum laude graduate of the University of Michigan, Robert Harokopus received a B .S. (1986) in computer engineering and is now studying for an M . S . in computer engineering from Boston University. Bob is a senior software engineer and joined Digital in 1986. He developed the symptom di rected diagnosis software used in the VA X 9000 service processor unit. Bob also developed software for the HIDE CAD tool and SCEPTER automatic test pattern generator, both of which were used in t he VA X 9000 design project. He is a member of Tau Beta Pi and Eta Kappa Nu. rucky C. Hetherington As a principal engineer with the H igh Performance Systems Group, Ricky Hetherington is currently the project leader of the transla tion buffer and cache design of the VAX 9000 system. He holds one patent and has several patents pending on the various design featu res of the VA X 9000 M-box . Rick joined Digital i n 1982 as a senior engineer i n Digital's Large Computer G roup. He has a B.S. from Pennsylvania State University. Don Hooper is a consulting engineer in both logic design and CAD disciplines. He initi:ued and led the development of the Synthesis of Integral Design program, Digita l's first synthesis tool. Before coming to Digital in 1979, he was architect for the I tel 7031 mainframe and cache designer for the !tel Advanced System 4. He is a graduate of Don Bosco Technical Institute. Don holds patents in speech recognition circuits, the tag and queuing system for Digital's first pipelined C P U , and the control storage pipe for the VAX 8600 system. In addition, he has several patents pending in logic synthesis. Donald F. Hooper A member of the technical staff of the Integral Circuit Design G roup, Dale Leuthold led the design team for the VAX 9000 vector regis ter chip. He is currently working on random-access memory development for h igh-speed mainframes. Dale was responsible for b ipolar integrated circuit design at Signetics Corporation and Trilogy Systems Corporation before coming to Digital in l9H6. He holds one patent and has one patent pending. Dale received a B . S . from Oregon State University. Dale H. Leuthold Paul A. Leveille In his nearly ten-ye:.Jr relationship with Digital, Paul Leveille has specialized in the development of high-power systems, particularly the VA X 8600 and VAX 9000 systems. As a principal engineer in the High Perfor mance Business Unit, he helped define the VA X 9000 service processor sub system and was responsible for developing the scan control fi rmware and portions of the service processor application software. Pau l's previous responsi bilities include console diagnostics, firmware. and ::�pplication software. 7 Bio�raphies Derutis M. Litwinetz The projecr leader for the design of four standard cell and custom chips for the VAX 9000, Dennis Lirwinerz is a consuhing engineer in the High Performance Business Unir. He has prev iously participated in the design of rwo standard eel.! chip designs for the VA X 8600 system. He joined D igital in 1967 as a technician for the DECsysrem- 10 Engineering (;rou p. Denni:-; has a patent pending for the VAX 9000 self-rimed register file design. He received a R.S.E.E.T. from Lowe ll Technological Institute and an ,'VI.S.C.E. from the University of Lowell. Brian T. Lynch Brian Lynch is a principal hardware engineer in the Informa tion Systems Business Unit Power Systems Group. In this position. he designed and developed the H7382 bias power supply used in rhe VAX 9000 system. He is presently working on power solutions for future high-performance systems. Prior ro joining D igital in 1972 , Brian was responsible for power convener and analog modu le design ar lntronics. He has a B.S. E.E. (1978) from Worcester Polytechnic lnst irure. Dwight Manley As a principal engineer on the VAX 9000 project, Dw ight Manley was responsible for all of the perform:mce modeling of the VAX 9000 CPU design. His present responsibi lities inc lude w riting code for a Digital Extended i'vla r h Library product. Dwight joined Digital in 1979 as a member of the Systems Performance Ana lysis Group. Prior to that time, he worked as a systems programmer for the Bel l Telephone System. Dwight has a H.S. ( 1971 ) in mathematics from the University of M assachuseus and an M.S. ( 1976) from Northeastern University. James B. McElroy Jim McElroy is the multichip unit operations manager. H is work on the VAX 9000 system began with interconnect and packaging, fol lowed by the management of the physical technology efforts. He then became the manufacturing systems program manager for the introduction of the VAX 9000 system into manubcturing. Before joining Digital in 1976, Jim worked at RCA on packaging and interconnect design for mil itary computer systems. He received a B. S.M.E. and an M .S.M.E. from Northeastern University. Francis X. McKeen The project leader for the V-box unit of till' VAX 9000 system was Francis McKeen. Prior to working on the VAX 9000 system , he wrote microcode for the VAX 8600 and VAX 8650 systems. Frank is a principal engineer and has been with Digital for seven years. He holds one patent and has several rarenr applications pending. Frank received a B. S. E.E. from Northeastern University and is a member of IEEE. I john E. Murray T he coauthor of Microarchitecture of the VAX <)000, john Murray is a consulting engineer in the High Performance Business Unit. He served as project leader of the design team for the 1-box unit of the VAX 9000. He 1982. John's previous employer was ICL in the United Kingdom, where he was a design engineer. He received a B. Sc. ( 1969) from Warwick joined Digital in University. He holds one patent and has several patents pending. Thiagarajan N atarajan T hiagarajan Natarajan is manager of a DC-to-DC converter group in the Information Systems Business Unit. His group develops a high-density and highly reliable DC-to-DC converter, associated hybrids, semi conductor components, and the distribution system for the next generation, high-performance VAX systems. Raj's prior experience includes positions at General Electric, Bell Laboratories, and Perkin Elmer Corporation. He has a Ph.D. in dectrical engineering, has been awarded one patent, and has authored approximately seventeen technical papers. Bi mal Patel Principal engineer Bimal Patel joined Digital in 1986 as a senior engineer. His primary responsibility since that time was the design of the V-box unit of the VAX 9000 system. Bimal was previously employed as a senior engineer in the CPU Design Group of Prime Computer, Inc. He has an M. S. in computer engineering from Boston University. William J. Rogers Jr. William Rogers is an engineer in the VAX 9000 CPU Group, where he developed the design of the control logic of the V-box unit for the VAX 9000. Prior to working on this high-performance system, Bill was a 1986 and is a member of IEEE and Tau Beta Pi. He received a B. S. ( 1986) in electrical engineer member of the SASE Support Engineering Group. He joined Digital in ing from Michigan Technological University. Leonard j. Salafia The development of the AC front end for the VAX 9000 system was the responsibility of Leonard Salafia. who is the manager of the AC Power Interface Developmem Group. His previous work at Digital includes supervising the development of storage system power products for the Central Power Supply Engineering Group and for the Storage Systems Power Group. Len worked for General Electric prior to coming to Digital in 1980. He holds a B.S.E. E. (magna c u m laude, 1969) from the University of Hartford and an M. S.E. E. ( 1976) from Renssel::ler Polytechnic Institute. 9 Biographies Ronald M. Salett As a consulting engineer in the High Performance Systems Group, Ron Saletr is currently leading the development of a new high-perfor mance C P U . As a project leader for the VAX 9000 system, he was responsible for the architecture, design, and m icrocode of the execution unit. Since joining Digital in 1977, Ron has also worked as an architect and project leader on low-end integrated PDP- 1 1 systems. He holds two patents. Ron holds a B . S . E . E . (1975) from Carnegie-Mellon University and a n M . S . E . E . ( 1979) from Worcester Polytechnic Institute. In 1988, Frank Swiatowiec became H DSC operations manager, with the primary responsibility to transition Digital's new H DSC tech nology to volume production. He was one of the engineering managers responsi ble for the definition and development of the HDSC . Frank had over 15 years of experience in the semiconductor industry when he joined Digital in 1986. While with Motorola Corporation, he was awarded four patents on ECL circuit designs. F rank holds a B . S . E . E . from the University of Il linois and an M . S . E. E . from Arizona State University. Frank J. Swiatowiec Gregory Yoder is a senior hardware engineer with the H igh Performance Systems CPU Engineering Group. His primary responsibilities on the VAX 9000 system included the design and testing of the V-box unit, and pro toty pe system debug, for which he received an excellence award . He also assisted Manufacturing in producing and installing external field test VAX 9000 machines. G reg joined Digital in 1988, after participating in a one-year co-op session at IBM . He holds a B.S. E. E. from Pennsylvania State University. Gregory L. Yoder 10 I Foreword Carl S. Gibson VAX 9000 Program Manager This issue of the Digital Technical journal is a collection of papers describing the technologies, designs, and design methods employed in Digital's VAX 9000 mainframe/supercomputer, which was introduced in the fal l of 1989. The VAX 9000 system embodies hundreds of innovations in most areas of design, manufacture, and service. In selecting papers for this journal, we have attempted to reflect the immense scope and variety of this program, which ranks among the larges t and most complex in the history of our industry. In the summer of 1983, a small group of us set about to determine what it would take for Digital to develop a true mainframe. We felt that a mainframe VAX would be a p owerful addition t o Digital's product family. The products that we have created took form, changed, and evolved over the months and years as technical chal lenges yielded to inno vations, rigor, and d iscipline. An u ndertakjng o n this scale necessarily undergoes numerous transi· tions as new data emerges, assumptions are tested, and alternatives are eliminated . Technical break t hroughs built upon one another incrementally as we pressed the design closer to our goals. The primary objectives of very high system-level perfor mance and world-class reliability drove the design process and the changes that emerged. The planar logic packaging is illustrative of how changes and improvements built upon one another. The reliability benefits of m inimal connections precipitated a .logic packaging design change from stacked modules in dual backplanes to the planar array. This change - an optimization for reliabil ity - in the end actually helped performance and maintainability. Utimately, though not envisioned at the time, the adoption of the planar array had a significant impact in that this structure enabled impingement air cooling a nd elimination of t h e bu lky liquid system t h a t was p a r t of t he initial design. The final design of the VAX 9000 system reflects, in myriad forms, this continual process of successive refinement toward shared goals. Design changes notwithstanding, our primary strategy remained constant. The reader will note that, while we innovated aggressively in CPU struc ture, implementation technologies, and design methodologies, we preserved ful l compatibility with the VAX, Digital s torage, and Digital network ing and cluster architectures. We wanted D igital and our customers to be able to enjoy very high per formance levels in a product that was compatible with prior investments. Therefore, we d rew as much as possible from existing products and designs from many Digital development groups. As a result, the VAX 9000 system incorporates Digital's standard XMI bus and popular B l , C l , and Nl system-level interconnects. The system runs VMS and ULTRIX operating systems, VAX layered prod ucts, and all of our customers' and independent software vendors' tools and applications. This capability proved especially rewarding when in the final months of the project, our own VAX 9000 prototypes, running our unmodified CAD tools, accelerated the processing of the inevitable last m inute changes. High-performance computation fundamentally requires two key ingredients: short machine cycle times and maximum computational work per formed in each cycle. The semiconductor and multichip unit papers describe how we m inimized the VAX 9000 cycle time by use of fast circuits, high density packaging, and high-speed interconnects. These papers are complemented by architecture descriptions through which the authors present the innovative features that minimize the number of cycles required to execute the VAX instruction set. These papers present the sophisticated p ipelining techniques and vector processing capabilities incor porated in the VAX 9000 system. Equal in importance to the computational capa bilities of the product are the service and control fea tures of the system. Papers covering the VAX 9000 service processor and the system 's fault management capabilities provide the reader with insights into these important aspects of the product. The development strategy for the VAX 9000 system was explicitly formulated to deal with enor mous technical and project complexity. Complex- II I i ty itself was the single most formidable challenge facing the team. Apparent from the outset, was the fact that such an ambitious product required the i n tegration of a very large number of d iscrete design objects; each had to be conceived, created, documented, tested, and ultimately integrated and verified as part of the whole. The reader will see the diversity of these efforts and recognize t he challenge of unifying a design from this breadth of technical advancement. Centra l t o our strategy was the creation of a unified design tool suite operating in a seamless, homogeneous VMS computing environment. The first few years of the project were devoted to con struction of this environment in parallel with top level design formulation. The recognition that rigorous design methods were crucial to our success was possibly one of the team's most powerful fun damental notions. Papers included in this journal illustrate some of the legacy of powerful CAD tools and structured design approaches created by the VAX 9000 team. As we have seen for the product, the methodol ogies were not immune to change as the project progressed. Working with rapidly evolving technologies, design p rocess experts continual ly 12 adapted to evolving user needs. Concurrent design permeated every aspect of the project and domi nated the way people worked together, with many aspects of t he technology and p roduct design converging and adapting as we learned from our own processes. When the manufacturing process needed some help, designs could be reprocessed with the new rules and rereleased to keep things moving ahead. A nd, move ahead they did' Today, the VAX 9000 system is installed at many customer sites where the systems are exceeding our original goals in both performance and dependability. I t has been accepted by experienced, high-end computer users as a bona fide mainframe - a mainframe with the unique advantage of ful l integration with D igital's rich distributed processing architecture. The VAX 9000 system was created by engineers working i n many disciplines and collaborating worldwide to invent hardware, software, and pro cesses that have significantly advanced the state of the art of computer design, m a n u facture, and service. The papers in this journal describe but a few representative examples of the creativity and determination of this large and dedicated team of professionals. David B. Fite]r. Tryggve Fossum Dwight Manley Design Strategyfor the VAX 9000 System The VAX 9000 system is Digital 's newest high-end processor in the VAX fami�y. This paper describes the design strategy used to achieve high performance and shows how RISC concepts were applied to a CISC architecture. Neu.• opportunitiesforparallelism in VAX program execution were found by breaking the VAX instructions into simple tasks which could be pipelined efficiently. By using independent, dedicated pipeline stages, execution rates approach one instruction per cycle. T he task confronting the VAX 9000 design team was to develop a VAX system that outperformed any previous VAX system and that was competi t i ve w i t h s i m i larly sized processors from other vendors. Although the VAX system is based on one of the world's most popular computer architec tures, the VAX architecture's i nstruction complexi ties preclude efficient macroinstruction pipel ining, such as that found in reduced instruction set com puters (RISC). RISC processors can be bui l t with low gate counts to handle simple, fi..xed-Jength instruc tions sets, load/store architectures, and delayed branching. To compete with machines based on such archi tectures and still remain compatible w ith the VAX architecture, the design team chose to implement the VA X architecture on the VA X 9000 system by applying techniques that were similar to those used in R ISC processors. We redesigned the VAX instruc tions i nto small , simple tasks, and designed dedi cated hardware that was optim ized for each task . The result is a network of specialized processors, each of w hich has i ts own data paths and state machines, that operate in para l lel and execute VAX instructions quickly. The most common, sim ple instructions are executed at the rate of one per cycle. System Overview The VAX 9000 system is a tightly coupled multipro cessor, wh ich runs the symmetric multiprocessing (SMP) version of the VMS operating system and can have up to four processors sharing a central main memory. Figure l shows a simp l ified block diagram of the system. The major system components include four CPUs, two memory controllers, two I/o controllers, and a service processor, which is Digital Technicaljournal Vol. .! No. 4 Fall /<)')() connected th rough the system control unit (SCU). Through a cross-bar switch, the SCU provides high speed, simultaneous transfers among the central processors, I /O devices, and memory banks. System cache consistency is maintained with duplicate tag directories located in the SCU. As references are made to memory, the addresses are checked against the tag directories. If a cache hit occurs, the cache in question is requested to invalidate or write back to main memory. The scu supplies a bandwidth that al lows near linear performance improvement as new processors are added to the system. The mem ory is interleaved on cache block boundaries to provide bandwidth for multiple CPUs and vector processors. Four XMI backplane buses provide high band width paths to I/O devices. Although the XMI is used as the system bus in VAX 6000 systems, the X M I is used exclusively for I/O i n the VAX 9000 system . Several new adapters were designed to increase throughput and reduce latency for I /0 transactions. These adapters include connections to the C I , the N I , the BI, and local disk comrollers. Although high performance IIO features, such as disk striping, solid-state d isk, and load balancing have been added to all VAX systems, the VAX 9000 system benefits the most from these features because it has the I/O back plane bandwidth ro rake advantage of them. A block d iagram of a single VA X 9000 CPU connected to the SCU and the major data paths between the two units is shown in Figure 2 . 1 Technology Contributions to Improved Performance The central processor cycle r ime has been reduced to 16 nanoseconds (ns) mainly by the use of fast emitter-coupled logic ( ECL) semiconductors and 13 VAX 9000 Series XMI DODD DODD DODD DODD VAX 9000 C P U N ECTOR XMI DO DO DO DOD DODD DOD DOD 256 MB �m� mm Figure I VAX 9000 System fast self-timed random-access memories (RAMs) for registers and caches, and by decreasing the inter connect wire length between components. Motorola 's Macrocell Array I I I (MCA)) technology provided both macrocell array and standard cell capabilities. The emire system is composed of 77 unique MCA 3 options and 5 custom chip types. A single MCA 3 contains 838 cells (4 14 major, 224 input, and 200 output), which yield 10,000 equiva lent gates, and 256 I/O pins. Maximum power dissip:nion is 30.0 watts, with un loaded gate prop agation delays of 120 picoseconds (ps). Perfor mance-critical operations, such as mu ltiplication. division, integer and vector register accesses, and system cloc king, were h!rther aided by employing custom chips 2 Caches for instruction stream and memory data, scratch pad registers, ami control stores all require high-speed local storage. Two versions of a proprietary self-timed RAM were designed for these specific applications. A 4 kilobit (Kb) self timed RAM , at 5. 5 ns, and a l6Kb self-timed R A M , a t I I . 5 ns, provide i nternal input and output latches and write pulse generation circuitry. Multi ple access modes allow highly pipelined operations to take advantage of shorter access times. Each new semiconductor generation reduces cycle time. which increases the re!Jtive importance of interconnect delay. High density s ignal carriers 14 scu VAX 9000 CPU Diagram (H DSC), tape a u tomated bonding, and a single planar module all reduce the interconnect delay between active components in the VA X 9000 system. Strict impedance control is mai ntained throughout the system. Clock skew is minimized by employing fi xed-length, differential transmission and dedicated routing layers. CAD Contributions to Improved Performance Hundreds of computer-aided design (CA D ) tools were used during the design and construction of the VAX 9000 system. However, none of these tools was more important in improving performance than the physical layout and timing analysis tools. Once the design team had placed large functional sections, placement tools refined individual macro cell selection and pin placements. Over 33,000 pins were selected to minimize overall wire length and maximize critical interconnections. Routing presented several challenges. All levels of interconnect included critical signals, differential pairs, and fixed-length requirements. The H DSC contains large cutouts that enable die attachment and allow cooling through the back panel. These large routing restrictions and special routing characteristics could not be handled by existing CAD tools. Therefore. we developed Chameleon, Vol. .2 No. -i Faii i'J ')(I Digital Technica/journal Design Strategyfor the VAX 9000 System a general-purpose router. With Chameleon, cross tal k is minimized, and crossing counts are main tained and used to increase signal integrity, which improves performance. To model the timing relationships within the system, we used sophisticated CAD tools to gener ate an accurate representation of the VAX 9000 system. Detailed timing models of each macrocell device were created using the SPICE simulator program 5 Chameleon and signal integrity rools provided delay values for each signal within the MCA3, H DSC , and planar modules. CPLJDLY , using the AUTODLY timing tool, tied the various pieces together and gave the design engineers a powerful view of the timing domain. Instruction Processing VAX systems exist in a variety of environments and run thousands of applications. With any new, high performance VAX system, it is important to increase the speed of all applications and to continue to provide general-purpose computer power. Given the size of the installed VA X base and the nature of the applications, performance gains should not require code modi fications. Digital has gathered substantial information on how VAX processors are .: INST RUCT ION , INST RUCTION � I: (BKB VIC) I • BUFFER � CACHE • · · · · · - - - - - - - - - - - - - - --- - - - - - - - - - - r-" I/O AND MEMORY INTERFACE DATA SWITCH ---- E-BOX· · used. This data formed the basis for design deci sions and trade-offs we made i n the development of the VAX 9000 system. Simple Instructions In many VAX programs, only a few opcodes are responsible for a large percentage of the i nstruc tions issued. Most of these opcodes are simple and limited tO a single arithmetic or logical operation. Often, one of the operands is in memory. A typical example is ADDL3 < R O ) , R 1 , R 2 Because of the high frequency of these instructions, speeding up these instructions is a top priority. Most of the high performance achieved on RISC pro cessors is derived because these instructions are pipelined. I n a complex instruction set computer (CISC), such as a VA X system, pipelining macro instructions is more complex . Therefore, previous VAX implementations have pipelined operations at the microinstruction leveL ' Processing simple instructions in a VAX system i nvolves obtaining and decoding the instruction, fetching source operands, performing an opera tion, and storing the result. The most important - - - - - - - - - - - - - - - - - -INTEGER -------------- UNIT : .----'-'11 ----, INSTRUCTION H . INSTRUCTION � FLOATING BRANCH PREDICTION V<-- DECODE 1 • POINT UNIT I:""Y ISSUE ( 1 K ENTRY) jv- (XBAR) : I-BOX .. VECTOR ADD UNIT OPERAND . REGISTER PROCESSING� FILE (OPU/SUFPL) h (SLIST/GPRs) n � VEC TOR : VEC TOR MUL TIPLy REG I STE RS '¢=--Y UNIT _._._._._ 1·.-.-.- -- ....___, .---'-'-____, : . _._·.-.-.- - ..---,. .,.--J · . . . . . . · · · · · · · · .. · .. � MULTIPLY 1 • RETIRE UNIT UNIT : '�'==::::::=� : �:: : :�;I�����):NJ :::�:::::: ; :..!� �;N:;:AT I;::=IO=N=I: :=.£ DIVIDE �UNIT UNIT 1 K TB) .. . . . ..Jj -- ��������. -� - ----� - ----� l ;:::·� r ====��� lc ::::::;: ]: scu V-BOX . - ----- ---------- - - -- ----- ------ ---- - WRITE .---> 'l � .c.._---, _ . . . .. . . . . . . .. . . . . . . . . . . . . . . QUEUE (WRTQ) Figure 2 Digital Tecbnicaljournal Vol. 2 No. 4 M-BOX VAX 9000 CPUNector Block Diagram Fa/1 /1)')0 15 VAX 9000 Series difference between the way a VA X processor and a !USC processor process simple instructions is how the variable length instructions and memory speci fiers are handled . VAX operands may reside in general-purpose registers (similar to RISC operands), in memory, or may be embedded in the instruction stream. The VAX architecture provides a rich selection of memory operand specifiers, which often require computations to create the address. In a R ISC processor, only load and store instructions access main memory. The instruction preprocessing stage (1-box) decodes instructions and fetches operands in the VA X 9000 system. I n the execution stage (E-box), simple VAX instructions n.:s<:mble RISC instructions. A simple opcode describes the operation, a single register file provides source operands, and a desti nation queue supplies a result descriptOr. The !-box operates in parallel as with the E-box, which func tions as a RISC processor by executing one instruc tion each cycle. Execution occurs without the need to identify the operand's source or addr<:ssing com plexity. Figure 3 i l lustrates how simple instructions t1ow through the VA X 9000 pipeline. Although all VAX implementations perform these tasks, the VA X 9000 implementation uses separate, independent hardware units to overlap the work because con current operation is a prerequisite for single-cycle instruction execution. Instruction Cache We used an instruction cache in the 1-box to decrease instruction stream fetch latency and reduce the bandwidth requirements on the main cache. Choosing a virtually addressed cache further reduced latency and simplified the design by removing the need for duplicate translation buffers. The virtual instruction cache is an 8 kilobyte (KB) cache with a quadword line size, 32-byre blocks, and a single-cycle access time. Line valid bits are maintained to allow variable size fills from the main data cache. Because the average VAX code block size is 16 to 20 bytes, the block size of the virtual instruc tion cache provides a good balance between the instruction decode stage and the main cache. Table 1 ADDL3 R3,R5,R7 SII #48,R4,@(R2) AOBLEQ S II # 63 , R 1 0 , 1 0$ 16 Instruction Decode Because the majority of instructions executed require only a single cycle to execute, the instruc tion decode's task of keeping ahead of the E-box is not simple. Most instructions must be decoded in a single cycle to keep the VAX 9000 system's ticks per-instruction (tpi) low. For example, VAX instructions may contain up to si..,x operand specifiers. With 59 different specifier addressing modes, instruction lengths can vary from a single byte to more than 50 byres. However, the overall average VAX instruction length is 3.8 bytes, and 98 percent of instructions require only 8 or less bytes.'i Furthermore, 96 percent of VA X instructions executed use only 3 or less specifiers. In each machine cycle, a 9-byte instruction buffer is p resented to the decode stage ( X BA R). The instruction buffer contains instruction stream data prefetched from the virtual instruction cache. Instruction decoding consists of generating an ini tial m icroadd ress, determining the number of specifiers for the instruction, including each speci fier access mode and data type, and forwarding the appropriate specifier data to the operand process ing stages. The X BA R can handle up to three specifi ers. Instructions that contain more than three specifiers require additional decode cycles. Since general-purpose register specifiers occur approxi mately 41 percent of the time, three register specifi ers can be processed concurrently.1' Short literals comprise nearly 16 percent of the specifiers. How ever, the X BAR can only decode a single short literal per cycle. The remaining specifiers must all be processed by the operand processing unit , which Decode Cycles Req u i red VA X- 1 1 /780 I nstruction M U LF3 Context switches, translation bu ffer changes, and instruction stream modifications all require that the virtual instruction cache be invalidated. Two com plete sets of block valid bits reduce cache sweeps to a single cycle, if consecutive sweeps do nor occur within 256 cycles of each other. Block size and fre quent sweeping reduce the virtual instruction cache's hit rate to approximately 96 percent, but by filling through the main cache, the miss penalty is minimized. + [ R3) VAX 8650 3 2 5 4 3 3 Vol. 2 No. 4 Fall 19')0 VAX 9000 Digital Tecbnicaljournal Design Strategyfor the VAX 9000 System CYCLE OPERATION 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 DJ 0 PC GENERATION VIC ACCESS INSTRUCTION DECODE � ;::::�:�::::� ::� I I I Ii i i SPECIFIER PROCESSING TRANSLATE AOORESS . DATA CACHE ACCESS 1. 1::': : ::1 : . • ' ' i MULTIPLY UNIT EXECUTION FLOATING UNIT EXECUTION 17 LOOP: INTEGER UNIT EXECUTION RETIRE REGISTER WRITE DATA CACHE ACCESS . •·=·:·:·:·:·:·:. � . . • MULF3 (R0),#2.5,R1 D MULF3 4(RO),II3.5,R2 D MULF3 8(R0),#4.5,R3 []) ADDF3 R1 , R2 , R4 BJ AOOL2 #11xC,RO II ADDF3 R3,R4,(R5)+ IIIII SOBGEO R6,LOOP • Figure 3 : : . • . ' � �·:::::;::::!! I � • 0 I 1:::::::,1 04::: :�,�: :1 . ' ' ' ' ' . ' : . • m ' ' ' D DJ . The VAX 9000 Instruction Pipeline decodes a single complex specifier per cycle. Unlike Load/Store A rchitecture preced ing processors, the X BAR hand les multiple Load/store architectures separate memory accesses specifiers in any order. Table I shows the number of from computation. Loads can be scheduled to place decode cycles required for several VAX processors. arriving memory data at a functional unit just as an operation begins. To achieve t h is effect with VAX Operand Prefetching instructions, memory specifiers are treated as load/ Because most simple instrucrions are decoded and store instructions. VAX memory specifiers describe executed in a single cycle by v::trious pipeline stages, the effective addresses of memory operands. VAX instruction operands a ls o m u s t be handled i n a memory specifiers do not contain the source and si ngle cycle. Multiple, specialized operand units destination registers that are specified in R ISC load/ increase operand processing throughput. From one store instructions. Rather, t h e VAX 9000 system to three register operands may be forwarded to rhe assigns temporary register file locations to buffer A dedicated memory data. By processing specifiers early i n the short literal unit expands all VAX data formats. The pipel ine, data can be scheduled to arrive at the operand processing unit performs complex address appropriate time. E-box by one register u n it per cycle. calculations and requests memory operand data Memory specifiers act as independent instruc from the cache unit (M-box). Both the operand pro tions executed i n the operand processing unit. This cessing and short literal units can perform multiple unit creates the operand's effective address and for cycle operations. wards it to the M-box. For loads, the actual memory Digital Technicaljournal Vol. 2 No. 4 Fall IYYO 17 VAX 9000 Series data is returned to the E-box register file. The trans lated physical address is saved in a queue of write addresses for store/destination specifiers. W hen execution results arrive from the E-box, the previ ously saved address is used to write t he data into the cache. Conflict Detection and Resolution Macropipelining in the VA X 9000 system relies on autonomous units operating in parallel . Each inde pendem unit is optimized for an individual task. However, macropipelining does require that mech anisms be added to resolve data dependencies among instruction processing units. Data cont1icts occur when an instruction's results are required by an earlier pipeline stage. An addressing data conilict appears in the following example: ber or a flag which indicates that the resu lt should be written to memory. The instruction issue unit removes source pointers from the source queue. These pointers are used to address either the general-purpose registers or source list for the actual source data. Destination pointers from t he destination queue determ ine where resulls should be wrirren. Register conflicts can be detected by comparing the source pointers needed to issue an instruction with all issued desti nation pointers in the destination queue. For exam ple, in Figure 4, the M U L L 3 's RO source queue entry would match the A DDL3 's RO destination queue entry. A write to the general-purpose registers by the E-box removes the destination queue entry, and the instruction issue can resume. SRCQ Rl R2 MOVL R O , R 1 MOVB T A B L E ( R 1 ) , R 2 Any dedicated aclclress calculating hardware must wait for the MOVL instruction results before per forming the l'viOVB instruction's effective address computation. A memory conflict is another form of data dependency. In the following example, #0 Register Conflicts The simplest hardware mecha nism employed in the VAX 9000 system is the use of pointers to reference data. The operand processing unit oversees a 16-entry source queue, an H-entry destination queue, and a 16-entry source list. A sin gle pointer is inserted into the source queue for each source specifier. The pointer represents either a register number, in the case of general-purpose register operands, or a tag that indicates an entry in the source list where the operand data is located . A pointer is added to the destination queue for e:.�ch destination. This pointer represents a register num- 18 I- ADDL3 R 1 ,R2,RO MULL3 (R3),RO,(R4) , (R1 ) ( R2 ) , R3 a prefetch unit could read the second instruction's source operand while the E-box writes the first instruction's results, if the values of registers R I and R2 are different . However, when the registers con tain identical values, the read must be delayed until the write occurs. The VA X 9000 system uses several differem mechanisms w detect and resolve data dependencies. Passing pointers, scoreboard masks within the 1-box, the write queue in the M-box, and architectural restrictions are all used to handle vari ous conflicts. DSTQ RO MEM RO MOVB R 0 MOVB __. SLIST DATA Figure 4 Register Conflict Detection To resolve addressing data contlicts, the I-box maintains a read/write register scoreboard . Two register masks a re c reated for each instruction decoded . The first register mask denotes the general-purpose registers that t he E-box will read for the instruction, and the second register mask specifies the general-purpose register writes. Each bit in these register masks refers to a single VA X general-purpose register. Specifiers that are being processed in the operand processing unit are checked against up to six previous instruction masks. From t he first example above, the specifier [TABLE(R I )] requires that the operand processing unit read R 1. If the R l bit is asserted in any preced ing instruction's scoreboard write masks, this effec tive address calculation must be deferred . The VAX architecture presents a unique address ing conflict p roblem because some speci fiers, such as -(Rn) and (Rn)+, modify general-purpose registers. In the following example, Addressing Conflicts SUBL2 R O , R 1 ADDL2 C RO ) . , R2 Vol .! No. -1 ht/1 1'.)!)0 Digital Technicaljournal Design Strategyfor the VAX 9000 System the (RO)+ specifier modifies the contents of RO. Therefore, the operand p rocessing u n i t cannot update the general-purpose register without affect ing the prior instruction. The read masks are used ro detect this type of confl ict. A l l specifiers that mod ify general-purpose registers must check the scoreboard read masks before proceeding with the instruction. Thus, when a confl ict occurs, the general-pu rpose register modification stalls. W hen an instruction completes execu tion, the instruction's read/write mask is removed from the scoreboard . In all addressing confl icts, specifier processing continues once the blocking mask is removed. Memory Conflicts The write queue is used to resolve memory conflicts. Physical addresses, received from the translation buffer, are inserted into an eight-entry FIFO . These addresses are later paired with the proper write data from the E-box and written into the M-box. To avoid prefetching stale dat:J., :�. I I memory addresses for source memory oper:�.nds are translated and compared with the addresses in the write queue. When no address con fl ict occurs, the data from memory is forwarded to the source .l ist. Operand requests that conflict with a pending write address are stalled until the contlict is resolved . The conflict is resolved when the appropriate write data is received. The conflict ing address is then removed from the write queue. Miscellaneous Conflicts The VAX architecture includes instructions with operands that either are not known w hen the instruction is decoded (e.g. , INSQlJE, MTPR), or mod i fy large portions of mem ory (e.g . , MOVC 5). To avoid conflicts from t hese instructions, the 1-box suspends processing mem ory specifiers until the instruction execution is completed. Self-modifying code presents another form of conflict, which is solved by an REI instruc tion that not ifies the hardware of this condition. Unlike its predecessors, the VAX 9000 system com m i ts all its resources to a single branch path. The prediction hardware selects the path of execution to resolve memory conflicts for those branch instructions that are decoded before results are available. This path selection is based on prior his tory, if the branch hits i n the branch cache. I f the branch does not hit in the branch cache, the path is predicted staticly, based on the instruction's opcode. When t he branch executes, the prediction is compared to the actual results. The pipeline is flushed back to the correct code path if the branch prediction was incorrect. The entries in the branch cache store the branch results of the previous execution of t he branch and the target address, if the branch was taken. Because the branch cache is a one-way associative cache t hat can store only 1024 entries, the results h ave an aver age hit rate of approximately 80 percent . However, correct predictions occur 85 percent of the time from the cache, as opposed to an average h it rate of 56 percent, when the predictions are based solely on opcode. Loop branches are always predicted as taken, which increases the overall correct pre diction rate to close to 89 percent . By caching branch targets, the calculation may be avoided and a latency factor of one-cycle branch taken i s achieved. The branch cache can store a sufficient amount of branch context to eliminate the need to sweep the cache. The 1-box can process instructions with up to two conditional branches outstanding. Uncondi tional branches (e.g., BSBW , BRB) are processed as ordinary instructions by simply changing the instruction flow To reduce the penalty for a bad prediction, which results in a four-cycle penalty, operand specifiers that mod ify general-purpose registers are not processed under a branch predic t ion and cause the operand processing unit to stal L Also, branch instruction execution i s overlapped with the previous instruction to provide the actual branch results earlier. Branch Instructions Branch instructions have a substantial influence on the overa l l perform ance of a VAX processor. O n average, a VAX processor executes 3.9 instructions, including the branch. before a branch starts a new instruction sequence. Instructions t hat modi fy the program counter represent nearly 40 percent of t he total instructions execmed. The VAX 9000 system uses a 1024-entry branch cache and a two-tiered prediction schemc to increase t he average code block size and reduce t hc branch-takcn Latcncy. Dif!.ilai Tecbnicaljournal Vol. .! Nu .j hill I'J'JI! Compute-intensive Instructions Compute-intensive instructions requ i re multiple execution stage cycles. Common examples of these instruct ions are multiplication, division , and float ing point operations. All VAX implementations employ dedicated logic for compute- i n tensive instructions that occur frequently. Less frequently used instructions depend on microcode-controlled arithmetic and logical data paths. The VAX 9000 system contains four independent execution pro- 19 VAX 9000 Series cessors. The integer, floating poin t . multipl y, and implementations. Because memory bandw idth is divide units cxecute the VAX instruction set. The critical, the VA X 9000 system prov ides features ro 1 -box p reprocesses i nstructions, w h ich al lows benefit thcsc instructions. instruction execution to overlap i n thcst: u n i ts. I n For example, the virtual instruction cache ser each cycle, a n e w i nstructi o n c a n b e i n i t i ated i n vices most instruction stream references, which t h e appropriate unit prior t o the completion of frees the main cache to service prefetched operand previous instructions. The t1oating poin t and multi rcf<:rences. Both the virtual i nstruction cache and p l y u n i ts are pipelined and can accept one instruc the main cache have 64 -bir data paths, important tion each cycle. The in teger unit is pipd i ned for for c h a ractcr s t r i n g operations and ex tende d pre s i m ple instructions. However, complex instructions cision arithmeti c . The caches are ful l y pipeli ned must use microcode control to perform multicycle and al low one read per cycle. The main cache block operations. size is 64 bytes. exploiting spatial locality. When Pipelined instructions are issued in order and cache references do miss. data is wrapped and the proceed t h rough the d ata p a t h w it h ou t further most critical data is rewrned first. A write back, microcode control . upon completion , instruction write al location algorit hm further reduces main res u l ts are retired i n thc same instruction order. The memory and cache bandw idth req u i rements and instructions must be p roccsscd in order because the reduces latency. resu l t of one operation is often needed in a sub The VAX system is a virtual memory architecture. sequent operation. T herefore, the pipelines must be Virtual add resses need to be translated to physical short and contain data bypasscs to make results addresses through page tables in memory. A trans available quickly. The mu ltiply, float, and d ivide lation buffer caches the most recen t l y used page un its' internal data paths are 64 -bits w ide. To under tables entries. VA X systems, such as the VA X- 1 1 /780 stand how the pipelined and overlapped operations system, process trans lat ion buffer misses in micro app l y to the fol lowing opcration. code, wh ich can be r ime-consum i ng . However, the y (i ) = y (i ) + C ( i ) consider the program: LOOP . VAX 9000 system uses a memory management pro cessor to process translation buffer misses as part of instruction preprocessing. This operation is per M U LG3 R6 , < RO ) . , R� MU LG3 R6 , ( R Q ) . , R 2 )• ADDG2 R� , ( R 1 ADDG2 R2 , < R 1 > . The two MULG 3/ADDG 2 instruction pairs prevent a pipeline stall that could occur because of data dependencies. The instructions further reduce the loop overhea d , w h i c h is a l read y fai r l y s m a l l because the loop control instruction was predicted correctly. I nstructions and source operands are prefetched . The mul tiply and add units accept the i nstructions as they become available. The memory formed early in the p i p e l i n e and is faster t h a n microcode. The CALL and RETl 'RN instructions push and pop registers on the stac k , and these i nstructions can be memory-bound. The VAX 9000 system contains both the conrrol logic and the bandwidth to process these registers at a rate of one per cycle. Unconventional Instntctions Spec i a l , dedicated h ardware was added to the VAX 9000 system to process those VAX i nst ructions that did not fit into the categories listed above. The references are made as the operand processing unit additional hardware operates w i t h i n the pipeline processes memory specifiers. The majority of speci architecture and cycle time, and the cost of add i ng fier processing is performed independently of the the hardware was minima l . instruction execution. Memory-intensive Instructions I n the following example, MOVL R O , - < S P > < - - - - - - - - - - > PU S H L R O Some VAX instruction classes are primarily memory the MOVL and PUSHL instructions perform identical operations that require only minor computation . operations, but the P l iS H L i nstruction does not Typical e x a m p les of t hese i nstructions an.: c h ar explicitly specify a destination address. O n pn: acter string, decimal, and privi leged operating sys v ious VAX system s , t h e i ns t ru c t i o n p refetching tem. Pipel ined execution offers link advan tage to would stall until t he current instruction execution memory-incensive instructions because the number was comp leted . However, t he VAX 9000 modi of memory references is not reduced as the number fies such instructions during the decode stage by of cycles required for execution i s reduced by new add i ng the implied specifiers. The benefits of t h is 20 Vol. 2 No. 4 Fall I'J')O Digital Tecbnicaljournal Design Strategyfor the VAX 9000 System enhancement are more evident in the fol lowing instructions. BSBW 1 0 $ < - - - - - - - - - - > MOVAL R e l u r n _ P C , - < S P l < - - - - - - - - - - > JMP @ ( S P l + RSB Sim i larly, instructions such as LOCC and CMPC3 impl ici t l y reference t h e general -purpose registers. The instruction decode s tage creates a read/w rite mask with these references, which a l lows instruc tion prefetching to cont inue. To aid handling i nstructions l i ke PUSH R a n d C A L L , the in reger execution u n i t conrains special bit m a s k m a n i p u lation h :trd ware, w hi c h opti m i zes general-purpose register saves and restores. The VAX instruction set contains variable-length, bit-field instructions that handle non-byte data. These instructions can reference memory within a '512 megabyte (MB) range. The field referenced is within the first 8 hytes of the base add ress more than 9'5 percent of the time. Therefore, to a llow instruction prefctching to con tinue, the operand processing unit assumes that the fiel d is within the initial quadword and requests that data. I f, during Logical Integration The VAX 9000 vector processor connects to the scalar CPU as an additional fu nctional execution unit. Vector i n s t ructions are processed , and operands are stored, in queues, the same as are scalar instructions. As i nstructions are issued , a con trol word is sent with instruction operands to the vector processor. The processor contains vector registers and arithmetic units. Add resses for load , store, gather, and scatter operations are also gener ated by the vector processor. Vector data is stored in the main cache, and both the scalar and vector pro cessors have fast, shared access to that dat::t. Physical Integration The VAX 9000 scalar and vector processors reside on a single planar board. Three mu l tichip unit slots are reserved for the optional vector processor, w h ich is fie ld- instal l able. The integration of t he vec tor processor d i rectly with the scalar processor keeps critical i nt erconnects short and reduces vec tor instruction overhead . execu tion, the field destination act ua l ly resides out Error Handling side the prefetched quadword, the correct data is Rel iabi lity, ava i l a bility, and integrity are critical fac fetched and the pipeline is flushed to avoid poten tial memory con tlicrs. Integrating Vector Processing The VAX 9000 tors in a high-performance computer system . These factors are affected by the quality of t he physical project team was instrumental in design (i .e. , worst-case design), effective coo l i ng, redundant power supplies, and quality controls during manufacture. S t i l l , fai l u res are possible, and in regrating vector operations and data types inro the VAX the VA X architecture. For many scientific applica errors. tions, the use of vectors im proves performance in three ways: • 9000 design had to dea l effectively with Error handl ing in the VA X 9000 system has two main goal s : Vector i nstructions specify many operations in • a single opcode, which e l i m i nates instruction stream decode as a processing hottleneck. • Vecwr registers increase available local storage. • Vector registers support h ig h pea k perfor mance through h igh bandwidth and short access l atency. Minim ize system service disruption from ind i vidual failures • Maximize the fai l u re information col lected for use in preventive and corrective maintenance A l arge percentage of hardware fa i l u res are inter m ittent , and many solid hardware fai l u res start as intermittent. The VAX The VA X vector archi tecture implements a load/ store architecture, which permits the hardware to deal w i t h l arge p ieces of m e mory in a u n i fo r m manner and increases t h e use o f para l lelis m . 9000 system was designed to recover from these fa i l u res and to use the fai lure data to predict (and prevent) future problems. To gather information effectively, VA X 9000 stor age elements ( i . e . , latches, tli p tlops. and RAM cells) We added the vector instructions and data types are v isible to the service rrocessor unit through a to the VA X architecture in an i n tegrated fash ion . serial d i agnostic bus. Most state i n formation t h a t Scalar and vector instructions are mixed throughout is relevant to isolate t h e fai l ing component i s avail the pipdi nes. Systems that do not incl ude vector able for error analysis programs that can be run at processors emulate vector instructions with soft a convenient time. The result of t h is processing is ware. a tec h n iq u e espec i a l l y usefu l for p rogram development . . >< t he n used to isolate the fai l i ng compone n ts for Di�ital Tecbnicaljournal llul. .! Nu. .j Fa/1 /'J'.IO quick repair. 21 VAX 9000 Series To access the storage elements through the visi bility chain, the system clocks must be disabled, which disrupts the system operation for a period of time. The error may also have affected the exe cution of the instructions in the pipeline. Error handling minimizes these disruptions by making them invisible ro the users almost a l l the time. The macroinstruction is the unit of execution in a program that is v isible to the user. Between instructions, the program state is clearly defined in terms of memory contents and register values. I nterrupts and exceptions are handled between instructions to save this state in an orderly fashion. It is important to handle errors the same way. Two problems arose i n trying to provide the same method of error handling. First, instructions go th rough many stages in a pipelined computer, and several instructions will be in progress at the same time. It is d i ffic u l t to identify a begin n i ng and end for each inMruction. Second, even when boundaries are established, errors can occur at any time and the errors do nor automatically l ine up with instruction boundaries. To solve this, we made the E-box the point of syn chronization between error handling and instruc tion execution. In the instruction execution model, the E-box accepts operands, then computes and delivers res u l ts for storage. If an error occurs that d i rectly affects one of these steps, the error is synchronous to the execution of that instruction. Asynchronous errors do not directly affect any of these steps and are treated as interrupts, i .e. , pro cessed after the E-box completes an instruction but before it starts another instruction . A synch ronous error causes a trap to occur i n the E-box w h e n t h e E-box requests d a t a from t he subsystem with the error. Since such data can he unavai lable as a result of virtual access problems, the E-box is ready to deal w i t h exceptions a t that time, and errors can use the same pipelined mechanism. We do not d i fferentiate between those syn chronous errors that affect computation in the E-box and those that do not . Instead , if the program visible state of the machine has not been modi fied, the instruction is backed up to the beginning and restarted . Performing this task is not a prob lem, since the state is normally not changed until the result is stored at the end of the instruction. Errors occurring in early p ipeline stages are easily recoverab.le. I n a few cases, memory and registers could have been modified early and, as a result , be affected by the error. Status flags indicate if this has happened. 22 By getting to an instruction boundary, the clocks can be stopped in an orderly fashion, and the state can be read out , includ ing temporary data to be used for failure analysis. The machine can be reset to start processing at the instruction boundary once the clocks are started again. While the clock is stopped , the CPU cannot inter act with other subsystems or I/0 processors. To keep these functions from being blocked and possi bly timing out , we only stop the clock to the CPU in error, not all the clocks in the system. We also sweep the cache of written data before the clock is stopped , and IIO interrupts are directed to other CPUs in a symmetric multi processing system . Performance Modeling When multiple features are added to a CPU design to individual l y enhance performance, some of those features can interact negatively with each other to decrease performance. Therefore, we designed a performance model to help us evaluate the performance of the design and make trade-offs where necessary. A lthough instructions were not executed on the model , it is an accurate cycle-by cycle model of the system for most instruction oper ations. Equally important, the model was written at a high level, which made it easy to modify and use to experiment with different feawres before they were added to the design. Cycle Time A perennial CPU design issue is the trade-off between cycle time and cycles per instructions. I n a VAX system , the cycle time is often limited b y the R A M speed in the control store and cache. We mod eled a machine at 8 ns and one at 16 ns for the VAX 9000 system. At 8 ns, the pipelines became longer. Although the peak t h roughp u t a l most doubled , the model showed that the net performance g:1in did not offset the risks associated with the shorter cycle time. /-stream Synchronization The VAX architecture requires that changes to the instruction stream be synchronized with an R EI instruction . This synchronization makes it easier to implement an instruction cache that is separate from the main cache. To synchronize, either all memory writes can be watched or the J -cache can he cleared on every REI. The first alternative entails high hardware costs, and the second c:1n affect performance. However, the model showed us that the performance impact would be minimal if the Vol. J N o. .; Fuii i'J'JO Digital Tecbnicaljournal Design Strategyfor the VAX 9000 5ystem !-cache was refi l led from the main cache rather than from main memory because the critical parameters were the main cache bandwidth and the !-cache invalidation time, rather than the refill latency. Branch Prediction The b ranch p rediction scheme used i n the VAX 9000 system was analyzed in great detail. We investigated the use of multiple history bits to improve the effectiveness of branch prediction. In a l l cases, the use of extra bits p rovided less than a I percent improvement in system performance. Furt hermore, no multiple bit scheme could be implemented without increasing cycle time because m u l tiple history bit branch p rediction schemes update status each time a branch is encountered . Therefore, we chose to use a single bit technique in the VAX 9000 design. Unlike multi ple bit schemes that read and write history bits for each branch instruction encountered , the single bit technique updates the history bit only when the prediction is wrong. The single-bit scheme is both faster and simpler. We also used the performance model as a verifi cation tool . The model provided us with early warnings when a feature d id not function in the model, or when the cycle count differed from the count in the gate-level simulation . For example, from the model, we became aware of problems in the design of how conflicts between instructions in specifier processing were handled . Periodically, we compared the performance model to the logical model . Both models were subjected to the same instruction sequences. Deviations of more than ± 5 .0 percent were investigated. Some design bugs were found that did not affect the results of the pro gram but which did keep performance features from working properly. The average deviation was on the order of ± 1 .0 percent. Performance tests are among the first programs run on a functional prototype. The VAX 9000 sys tem performed almost as expected. Table 2 com pares the actual performance of a VAX 9000 system to its predicted performance for a small sample of modeled programs. The accuracy of the predictions h ighlights the increasing importance of models in the modern engineering process. Cache Parameters The main data cache was accurately modeled. The VAX 9000 system uses a first-in first-out (FIFO) block replacement scheme. The performance model pre dicted that a true least recently used replacement policy would provide an insignificant improvement in performance over the FIFO method. Also, a true least recently used policy requires that status be read and written for each cache access. In con trast, the F I FO replacement pol icy updates status only when a cache miss has occurred . Further, the update can be done in parallel with the writing of data into the cache block. Although the 128-byte cache block provided a better cache hit, we chose the 64 -byte block because it produced better system level performance. We chose two-set associativity because the model clearly ind icated that performance would degrade with a d i rect-mapped scheme. The model also pre dicted that a four-way set associative cache would not improve performance enough to justify the extra hardware, design complex ity, and cycle time penalty. The data bypass mechanism, the write queue, and the parallel translation buffer fix-up mecha nisms were implemented after the performance model indicated significant performance gains would he achieved from these features. DiJ:itaf Tecbnicaljournaf Vol. .! No. . , htff f')')IJ Table 2 Performance M easurements of a VAX 9000 System Program Name Predicted (VUPs * ) Measu red (VUPs* ) HANOI 28. 54 25.53 FFT45 36.87 37.85 GAUSS 32.72 32.57 W H ETS 27.78 27. 1 7 WH ETD 34.48 34.89 • Performance measured i n VAX u n i t s of performance (VUP). where the performance of the VAX · 1 1 /780 system = 1 .0 VUP. Vector Performance Vector processing was modeled using graphical descriptions of the pipeline. The graphical descrip tions were essentially critical path method schedul ing charts. This approach is reasonable because vector processing makes regular demands on sys tem resources. In fact, the regularity of resource demand patterns was a major reason that vector processing techniques were developed . By using the pipeline schedules, we realized that data should he prefetched to ensure good vector performance. 23 VAX 9000 Series Performance Measurement Table 5 compares the VAX 9000 scalar and vector Acknowledgments Many people contributed to reaching t he VAX 9000 processors performance to ot her members of the p<.:rformance goals. T he authors would especially VAX family of processors. like to t hank David Orbits, whose advanced devel opment work on high-performance Ta ble 3 Performance of the VAX 9000 Scalar and Vector Processors Program Name VAX 8550 System (VU Ps * ) VAX 9000 Scalar Processor (VUPs * ) VAX 9000 Vector Processor (VUPs*) A3D 6 . 55 65.54 77.45 DY FESM 5. 1 2 3 1 .88 40.49 E M IT 5 . 86 4 1 .65 79 . 86 C F FT2D 5.52 2 5 . 76 64. 1 8 B M K8A1 5.45 30.65 83.84 MXM 5 . 93 40 . 8 1 269 . 32 • VAX designs became t he basis for t he performance model; and Bill Grundmann , R ick Hetherington, John Murray, Bill Smi t h , and David Webb, w ho comprised, with the au thors, the origi nal VAX 9000 architec ture team. References I . ]. Murray et al., " VAX Instructions T hat Illustrate the Architectur a l Feat ures of the VAX 9000 C: Pt!," Digital Technical journal, vol . 2, no. 4 (Fall 1990, t h is issue): 25-42. 2. M . Adiletta et a l. , " Semicoml uctor Technology in a High-performance VAX System," Digital Technical journal, vol. 2, no. 4 (Fall 1990, t h i s Performance measured in VAX units of performance (VUP), where the performance of the VAX- 1 1/780 system = 1 .0 VUP. issue): 43-60. 3. SPICE i s a general-purpose circuit s i m ulator program developed by T he vanattons in these performance n umbers rake advantage of machine rcsourccs. T he numbers of California, B erkeley. 4 . D. Clark , " Pipelining and Performance in t h e VAX 8800 Processor," A rchitectural Support for Programming Languages and Operating S:vstems (AC M , October 1987). also highlight opport u nities. By modifying appli cations ro capitalize on machine features, large per formance gains may he realized. Performance gains of 100 to 200 percent are often realized and may Nagel and Engineering and Comp uter Sciences, Universi t y indicate t h a t significant performance improve ments can be ach ieved by using applica tions that Lawrence Ellis Cohen of the Departm e n t of Elec trical 5. C . Wiecek , "A Case Study of VAX - I I I nstruction much parallel content. T h is category is represented Set Usage for Compiler Execution , " Proceedings of the Symposium on A rchitectural Support for Programming Languages and Operating Systems (ACM , March 1982) : 177- 1 84 . by A 5 D and DYFESM in Table :� . Vectorizing such 6 . .J. Emer and D. Clark , " A Characteriza tion of programs i mproves performance by a modest Processor Performance in the VAX - I l / 780 , Proceedings of the 11th Annual Symposium on Computer A rchitecture (A nn Arbor: June 1984 ): substantially extend the lives of older programs. Vector applications tend to fall i nto three cate gories. T he first category generally does not contain " 0 to 50 percent . Programs E.\IIT and CFFT 2 D in Table 5 represent the second category, which are 301 -310. applications of moderate parallel content. Applica tions in this category realize a 50 ro 150 percent 7. VA X Vector Processing Handbook (Maynard: performance gain when vectorized . A pplications Digital Equipment Corporation, in the t hird category, EC- H 04 19-46, 1989). highest parallel conten t , demonstrate performance improvements o f more t ha n 150 percent w h en vectorized. Programs B M K8AI and MXM in Table 3 arc examples of t h is class of application. Order No. 8. R . Bru nner and D. B handarkar, " Vector Exten sions to the VAX Architecture," Proceedings ofCO!'vtPCON 'YO (San Francisco: Spring 1990). Often , modest code changes can realize dramatic performance improvements. By simply redefining array dimensions or loop specifications, an applica tion can move from t he first category to the t h ird category. 24 v'lil. .2 No. .f Fall 11)')0 Digital Technicaljournal john E. Murray Ricky C. Hetherington Ronald M. S alett VAX Instructions That Illustrate the Architectural Features of the VAX 9000 CPU The VAX 9000 system is Digital's largest and most powerful VAX system. As such, it offers many unique features that required the use of advanced technology and innovative architecture in the design of the system. Overall, the VAX 9000 micro architecture produces a high level of system performance and the lou'est cycle time of any VAX processor, i.e., less than five cycles per instruction. Three sections of the l'ltX 9000 CPU - the instruction fetch and decode unit (!-box), the execution unit (£-box), and the data cache and main memory inte1jace unit (M-box) - are illustrated in this paper through descriptions of a small sample of VAX instructions. These instructions are discussed in relation to theirflow through the pipeline, how their architecturalfeatures combine to work on a single macro instruction, and how various stages ofthe pipeline interact. I n October 1989. Digital i nrroduced its VAX 9000 preferch, hardware translation buffer fix-up u nit, family of h igh-performance scalar, vector, and par write address buffer and conflict checker, multi VAX 9000 system is designed ported write-back cache, independent arithmetic ro be expandable from one ro four processors, with u nits, and separate issue and retire queues. T hese an optional i ntegrated vector facility available on features are pipelined and do not i nteract i n a :tlld processors. T he each processor. T he desi g n team obtai ne d high straightforward way. Many stages are not directly levels of performance w it h advanced tech nology u and in novative architectural fearures. T he tech linked to the subsequent stage bur feed a queue or first-in first-out nology provided a platform that has the shortest stage works on the output of the cycle rime for any pipeline is not a fixed-length and many operations VAX processor. Most VAX proces sors average ten or more cycles per instruction , w hereas the architectural features of the VAX 9000 system reduce that average below five. T he (FIFO) buffer. T he subsequent F I FO buffer. The are done in parallel. T he architectural features do not function totally i ndependent of one a no t her. I n fact , the h i g hest VAX architecture is a complex instruction set VAX i n structions vary in l e ngth and level of performance is achieved when all the units arc h itecture. function in harmo n y. T his paper h ig hlights the number of operand specifiers. T he opcode may be implementation of the macropipeline found i n the o n e or two b y res lon g . T he n umber of spe c i fiers three major subsystems of the VAX is implied by the opcode. Each specifier 's length is subsystems are the instruction fetch and decode determined by the specifier type, and the length can 1 vary by up to 17 bytes. Although the VAX 9000 u n it (1-box), the execution unit (E-box), and the d :tt:1 implements a large n u mber of i n structions i n a 9000. T hese cache and mai n memory i nrerface (M-box). T he design team for the VA X 9000 system's single c ycle, some instructions need to be imple !-box evolved a cost-effective subsystem that our menred in tens of cycles. In these cases, microcode performs all previous J.Ssiswnce is required. To increase performance, Figure 1, the !-box processes the majority of instruc VAX systems. As shown in VAX 9000 tions in just one cycle. lt combines a si ngle cycle system that have not been implemented i n prev i access virtual instruction cache with a 25-b y re many features were i ncluded ous in the VAX systems. The system contains a virtual i nstruction buffer and an i nstruction clecocle cross instr u c t i o n cache. a bra n c h pn:diction cache, bar that can decode three specifiers per cycle. To mult iple specifier evaluation units. deep instruction minimize cycle-wasting stalls. a branch prediction DiRilal 1'ecbnicaljournal H>l. .! .Yo. ·I Faii i'J')Ii 25 VAX 9000 Series unit handles transitions from one code block to another. In addition , the operand processing unit receives and processes specifiers from the decode unit. The specifiers are passed either to the E-box as pointers, literal data or addresses, or to the M-box as virtual addresses. Figure 2 i l lustrates how the front end of the M-box translates addresses by using either a trans lation buffer or an autonomous virtual -to-physical address tra nslation u n i t . P h ysical addresses for reads are used to access a two-way associative write-hack cache and to fetch data from memory through the system control unit (SCU), if the data is missing from the cache. Read data is returned to the E-box . Write addresses from the operand pro cessing unit are translated and queued by the M-hox until the E-box provides the data for the write. The E-box of the VAX 9000 CPU per forms aU scalar operations. As shown in Figure ), the E-box is a pipelioed design that incorporates a micro sequencer to control fu nctional u n i t operation. Other dedicated control logic directs the flow through the pipe stages. A m u l t iported register file provides general purpose registers and temporarily holds memory data. The data is processed by one of the four arithmetic functional units. Results pass through a retirement multiplexer to the register file or the M-box data cache, as shown in Figure 4. Mul tiple VA X instructions arc executed concurrently in the E-box pipeline. The primary goal of the E-box is to produce a 32-bit result each cycle, which al lows the majority of the simple, but most frequent, VAX i nstructions ro be executed in one cycle. This goal is achieved when four requirements are met . First, the !-box must have conunands available for the £-box . Second , operand data, often from the M-box data cache, must be available. Third , pipelined or single-cycle latency functional units are required for single-cycle throughput. Finally, results must be transferred from t he functional u n i ts. E-box features, such as queues, data bypass paths, and powerful arithmetic units, help the system attain a h igh-performance level. Stalls arc avoided and each instruction is executed in a minimal amount of time. The M-box of the VAX 9000 CPU is the primary source of memory data. Therefore, it contains the virtual add ress translation buffer and the data cache. The M-box is multiported ami pipelincd with two autonomous pipeline segments. Each segment occupies one machine cycle, and the cache access latency is, therefore, two cycles long. During the 26 first cycle, the M -box receives and priori tizes v i r tually (or phys ically) addressed memory requests. The M-box then indexes the translation buffer to produce a 33-bit physical address and to perform protection and va lidity checks. The second pipe l ined cycle i nvolves data cache access, data align ment, if requ ired , and port response. T here are numerous architectural features within both seg ments that are targeted at high bandwid th for prefetching and storing scalar and vector operands. To i l lustrate the various features of the VAX 9000 m i c roarc h i tecture, we h ave selected the code sequence shown in Figure 5 . i In the fol lowing sec tions, we discuss each instruction as it progresses through the pipel ine as if it were the onl y instruc tion in the pipeline. We then sununarize by consid ering the same instructions as a block of code. VAX Instruction ADDL2 The A DDL2 instruction uses general-purpose regis ter R8 as a n add ress ro memory. The contents of that location are added to general-purpose register R7, and the result is written back to the same loca tion in memory. The instruction is encoded in three bytes: opcode, register, and base register. Cycles One through Three I f we assume that the ADDL2 instruction is the first instruction either in an interrupt routine or follow ing a context switch, the program counter is gener ated by the E-box and passed to the I-box on a 32-bit bus. The program counter is latched and used to access the virtual instruction cache during cycle one. The virtual instruction cache contains up to 8 kilobytes (KB) in 32-byte blocks and 8-byte lines of instruction stream data. Bits < 1 2 : 3 > of the program counter's prefetch buffer are used to access an 8-byte l ine from the virtual instruction cache. Bits < 1 2 : 5 > are used to access a tag, a valid block. and four quad word valid bits. The tag is compared with bits < 31 : 13> of the program counter's prefetch buffer. If the tag and the bits match, the block and the quadword within the block are valid, and the instruction is in the virtual instruction cache (i .e. , a hit). B i ts < 2 :0> of the pre fetch buffer are used to rotate the quadword for the opcode byte to he loaded into byte 0 of the !-buffer at the encl of cycle one. Similar to the VAX 8650 system , the first hyte of the !-bu ffer is the operation code (opcode) of the instruction." The A D D L 2 is t h ree bytes long and norma l l y fits i n one l ine of the virtual instruction cache. I f · t he ADDL2 instruction c rosses a l ine boundary, a Vol. .! No. .q Fa/1 1')')0 Digital Technicaljournal E-BOX RESULT I - BOX DATA M-BOX I B DATA S2 POINTER DEST POINTER �------ �------�--� DECODE STAGE FETCH STAGE K EY VIR - VIRTUAL I NSTRUCTION CACHE Sl - SOURCE 1 S2 - SOU RCE 2 DEST - DESTINATION I B - I-BUFFER P PC - PREFETCH PROGRAM COUNTER U PC - UNWIND PROGRAM COUNTER D PC - DECODE PROGRAM COUNTER S PC - SPECIFIER PROGRAM COUNTER BP - BRANCH PREDICTION PC - PROGRAM COUNTER OPU - OPERAND PROCESSING U N IT Figure 1 SPECIFIER STAGE SL - SHORT LITERAL GPR - GENERAL PURPOSE REGISTER GPRS - GEN ERAL PURPOSE REGISTERS XGPR - X GENERAL PURPOSE REGISTER YGPR - Y GENERAL PURPOSE REGISTER OP D - OP DECODE Block Diagram ofthe VAX 9000 System /-box SL D - SHORT LITERAL DECODE R 1 - REGISTER 1 R2 - REGISTER 2 R3 - REGISTER 3 DISP - DISPENSER VAX 9000 Series CONTROL LOGIC M ICRO SEQUENCER I-BOX QUEUES V-BOX M-BOX REG ISTER FILE Figure 2 ::> 9 ::: I - BUFFER m Front End ofthe VAX 9000 System M-box � � > OPU II E-BOX r:: SEQUENCER � f Figure 3 28 � MISS FIX-UP � r=::>v �� TRANSLATION BUFFER TRANSLATION � BFIUXF-UF PE R b f.- I-- �v Block Diagram ofthe VAX 9000 .�ystem E-box Vol. l No. 4 Fall /')')IJ Digital Tecbnicaljournal VAX Instructions That Illustrate the A rchitectural Features ofthe VAX 9000 CPU E-BOX 64 OPERAND PROCESSING UNIT 32 I -BUFF E R CACHE OPERAND PROCESSING UN I T MAIN MEMORY -f: 64 shou ld be routed w the register/pointer unit and that the memory specifier should be routed to the operand processing unit. I n parallel with the XBAR decode process dur ing cycle two, the program coumer is passed to the E-box from the 1-box. The opcode is used to address the fork random-access memories (RAMs) in the E-box that provide a fork address to the microse quencer. At the end of cycle two, the decoded bytes are shifted out of the !-buffer, and the subsequent instruction is presented to the XBA R in cycle three. The fork address from the 1-box is then used to address a fork RAM in the E-box. For each opcode, the fork RAM provides an entry address into the control store, i nd icates w h ic h functional u n i t should begin the execution , and specifies how many source operands are needed i n the first cycle. The fork address is modified when an instruction co 0080 22 ADDL2 R7, 00 4 1 0083 23 SUBF3 #0,5, 535940C2 8F 45FD 0088 24 M U LG3 #2345.5, E4 0095 25 BBSC #13. 68 57 5 WRITE BACK FILL BUFFER subsequent cycle is required to access the second l ine. The average VAX i nstruction is 3 . 8 bytes long. Therefore, a virtual instruction cache hit delivers about two instructions to the l-buffer.6 Other VAX processors general l y require a cycle to decode the opcode and one or more cycles to decode each subsequent specifier.7.H However, the VAX 9000 CPU's instruction decode cross bar can decode the vast majority of common instructions in a single cycle. If the three bytes of the ADDL2 instruction were loaded into the !-buffer at the end of cycle one, the bytes would be decoded during cycle two. The decode unit (XBAR) passes data from the !-buffer to a short l iteral unit, a register/pointer unit or an operand processing unit. As the opcode and speci fier bytes are decoded in paral lel, the X BAR deter mines in less than a cycle that both specifier bytes Figure 32 Cache Unit ofthe VAX 9000 System M-box Figure 4 E3 M-BOX E-BOX WRITE B U F F E R WRITE QUEUE 6044 59 85 9999A999 64 E-BOX WRITE BUFFER E - BOX 53 I-BUFFER 000001 2 1 ' EF OD 1 $: (R8) (RO)[R4]. R3 (R5)+. R9 BDATA. 1$ VAX Instructions That Illustrate the Major Features ofthe VAX 9000 System Digital Tecbn icaljournal Vol. 2 No. 4 Fall findex the 1024-entry translation buffer. The translation buffer is a d irect-mapped , associative memory that contains the results of the most recent 1024 translations. Bits < 30: 18> are compared, validated, and protection-checked against the tag field . The physical frame n umber is a 24-bit field that is appended to the virtual address bits < 9:0> to create the 33-bit physical address. The self-t imed RAM used for the translation buffer is a 1024 by 4 sel f-timed RAM with a 4 . 5 nanosecond (ns) access time. Protection checking occurs during the latter por tion of cycle four. The example we are discussing is a request for a read and write check. Therefore, both read and write access are checked. Fault indi cation is forwarded with the request to the data cache and subsequently, with the data, ro the E-box. If the request has a valid entry in the translation buffer and no protection violations exist (i.e. , trans lation buffer hit), a data cache access is required in cycle five. The two source pointers and the destination pointer from the 1-box are latched i n the source and destination queues, respectively, at the start of cycle four. The source queue holds 16 entries and can receive 2 entries per cycle. The dest ination queue holds eight entries. Both queues are circu lar FIFO queues that can be flushed w ith the fork queue. The two source pointers are also latched in the source operand logic at the start of cycle four. The source operand logic determi nes w hich two source pointers to use each cycle. The pointers can come from the source queue, the 1-box, the microword , the register log, and several special functions. In this example, the two pointers are selected directly from the latched I-box pointers because using the source queue would have required an extra cycle. The selected pointers address the register file and are passed to the issue logic early in the fourth cycle. The register file contains t he 15 general purpose registers, R O through R l4 . These registers can be written by either the £-box or the !-box for autoincremem or a utodecremem speci fiers. The first pointer accesses general-purpose register R7. The contents of general-purpose register R7 are Vol. 2 No. 4 Fall / ) fol lowing the ADDL2 instruction (i.e., bytes < 2 :0 > ) . In the latter case, the SUBF3 instruction would be shifted into the lower bytes as the A DDL2 instruction is shifted out. Cycles Two through Eight In cycle two, the SUHF3 instruction is completely decoded and shifted out of the ! -buffer. As a result, the following actions occur: • The fork address is passed to the E-box . • The short l i teral is passed to the short literal expansion unit. • The base and index registers arc passed to t he operand processing unit. • The destination general-pu rpose register R3 and the t\vo sources are passed to the register/ pointer unit . During cycle three, the register/pointer unit allo cates the next available entry i n the source list ro the short literal and the subsequent entry i n the indexed memory reference. The E-hox is informed of these a l locations as pointers to t h e relevant entries are passed to the poi nter queues in the source one and source two pointers. The register/pointer unit also passes t he destination register to the destina t i on queue in the E-hox. Digital Tecbuicaljom·nal Vt>l. 1 No. 4 h71/ /'J'}IJ The ope rand p rocessing u n i t passes t he tag, with the address for the indexed memory specifier request, from the register/pointer u n i t to t h e M-box. The address is generated b y the adder i n t he operand processing unit. In parallel w i t h the operand processing unit and register/pointer unit, the short literal expansion unit takes the 6-bit field and expands it to a 32-bit F _floating number. Duri ng cycle fou r, the s hort l iteral is wri tten through the 1-box data bus to the relevant entry in source list. Issue control can issue with bypass because only the memory data for operand two is missing. The E-box stalls until t he memory data arrives. Because the 1-box and the M-box generally are func tioning ahead of t he E-box, memory stalls are short or nonex istent. In this example, the memory data arrives at the end of cycle five, as was the case with the ADDL2 instruction. In cycle four, the M-box operates for the SUBF3 i nstruction in a s i m i l ar manner t o i ts cycle four activity for the ADDL2 instruction . At the start of the cycle, a command, address, context, and tag field are sent from the operand processing unit to the M-box. The command is a simple operand read. Arbitration occurs early in the cycle. The trans lation buffer is then accessed , and the physical address is sent to the cache. Cycle five begins when the data cache receives the p hysical address for the operand processing unit to read . The tag store lookup and address matching are performed simultaneously with the data read , and the data is available to the E-box at the end of the cycle. If the operand read results in a cache miss, the M-box must assemble a command and an address, which are sent to the SCU to enable the SCU to access a 64-byte block of memory data. In addition, the data cache tells the scu which set the cache will replace with t he new cache block. J f the current cache block contains valid and written data, the block must be written back to main mem ory before the new cache block arrives. The scu sends a command and an add ress back to the M-box when the memory data is ready. The send takes approximately 26 cycles and is fo llowed , within a short period of time, by eight cycles of data transfer. Each cycle is 8 bytes long. The requested quadword is returned first to respond to the requesting port during the first cycle of the cache refi l l . On the eighth cycle of cache refill, the tag s tore is updated. The floating point fu nctional unit is started in cycle six, as speci fied by the fork RAM data. Both source operands are delivered , and the microword VAX 9000 Series ind icates a SUBF operation. The floating point unit in bytes < 8 : 5 > . The four remaining bytes of the requires two cycles to perform the SUBF operation . immediate specifier could be valid in the I -bex and Unpacking and a lignment occur in the first cycle. the rest of the instruction could be contained in the The floating point unit signals the issue control that I-bex 2 . At the end of cycle one, the first fou r bytes the result wiJJ be available at the end of the follow are shifted to the low four byres of the 1-buffer. The ing cycle. The issue control enters the general next four bytes are merged from the I-bex to the purpose register R3 destination b u t must wait high four bytes of the !-buffer. The I-bex is now another cycle before beginning reti remen t. If the empty, and the bytes in the I-bex2 can be loaded next instruction requires that the floating point unit into the I-be x . and the operands be available, the instruction Because t h e MULG 3 instruction has a 2-byte-long would be issued in t h is cycle because the floating opcode, the only decoding necessary in cycle two is point u ni t is ful l y pipelined. to note the 2-byte length and shift our the ftrst byte The second exec u tion cyc le occurs in cycle so as tO align the specifiers to be the same as a single seven. The floating point unit adds, normalizes, byte opcode instruction. The specifiers are then in rounds, and packs. The result is latched in the float ing point unit at the end of the cycle, and the issue control discards the top entry from the result queue to retire the data. In cycle eight, the retire multiplexer selects the floating point unit result data and sends that data to the d a ta distribu tion logic. The d a ta d istribution logic holds the result, which w ill be written into general-purpose register R3 in the register file dur ing the next cycle. The write is purposely delayed to permi t it to be aborted if an arit hmetic fau l t occurs. B y holding t he result i n the data distribution logic, res u l t bypassing into the data path can act as a source operand. The result is written into the reg ister file at the beginning of cycle nine. VAX Instruction MULG3 The MULG3 instruction takes t he G_format floating number, addresscd by general-purpose register R5, from the instruction stream, multiplies it by the immediate constant 2 3 4 5 .675, w h i c h is also a G_format number, and puts the result in general purpose registers R9 and R 10. General-purpose register R5 also is incremented by eight as a side effec t of the specifier evaluation. The opcode is 2 bytes long, the constant is a nine-byte immediate specifier, and rhe autoincrement and register speci fiers are each a single byte. Thus, the instruction is encoded in 13 bytes. Cycles One through Five bytes < I :8> of the !-buffer. As the first opcode byte (in this case, # FD) is shifted out , the next valid byte in the I-bex is merged into byte 9 of the 1-buffer, which leaves seven valid bytes in the I-bex. Decoding really begins in cycle th ree. The fork's address is sent to the E-box, and bit < 8 > is set to indicate a 2-byte-long opcode. The ftrst five bytes of t he immediate spec i fier are passed to t h e operand processing u n i t . T h e first byte a l s o i s passed t o the register/pointer unit for source list allocation. The five bytes sh ifted out of the !-buffer are replenished from the I-bex, w h ich leaves two valid byres in the I-bex . In cycle four, the register/pointer unit allocates the two entries in the source list for the immediate G_floating number by passing a source one pointer to rhe E-box and the tag to the operand processing unit. The operand processing unit passes the first longword of the immediate G _ floating number to the unit's output bu ffer. The next four bytes of the immediate are passed from the !-buffer to the operand processing unir. The remaining two valid bytes from t he I-bex are merged into the !-buffer. The I-bex is then loaded with eight bytes from the virtual instruction cache. In cycle five, the autoincrement and register speci fiers are decoded and the remaining bytes of the instruction are shifted ou t . Five bytes from the I-bex are merged with the four valid byres in the 1-buffer. The autoincrement general-purpose regis As in cycle one of the SUBF3 instruction, the M U LG 3 ter R5 is passed to the operand processing unit instruction can either be a v irtual instruction cache and the register/pointer unit, which also receives access cycle or part of the instruction already can be in the !-buffer and shifted to the least significant general-pu rpose register R9. The first longword of byte as the previous instruction is shifted out . For processing unit output bu ffer, through the 1-box, to the immediate specifier is passed from the operand example, i f the previous instruction is the SUBF3 the source l ist entry al located by the register/ #0. 5 (RO) [ R4] R 3 in bytes < 4 :0> o f the !-bu ffer, the pointer unit . The second longword is passed to the first four bytes of the M U LG 3 instruction could be operand processing unit output bu ffer. 34 Vol. 2 No. 4 Fall /')')() Digital Tecbn icaljounzal VAX Instructions That Illustrate the Architectural Features ofthe VAX 9000 The first microword is accessed and distributed t h roughout the E-box . The m icrosequencer uses the fast fields of the microword to generate the final control store address for this i nstructio n . The microinstruction is not issued because it requires two source operands and the second source pointer is not yet avail able. Cycle Six In cycle six, the register/pointer unit allocates two source list entries for the autoincrement specifier, passes t his information to the E-box in the source one pointer, and passes a tag to the operand pro cessing u n i t . T he general-purpose register R9 is passed to the E-box as the destination pointer. The operand processing uni t accesses general purpose register R5 and passes it, with a tag and a quadword read request, as an address to the M-box. In parallel, the operand processing u n i t writes general-purpose register R5, incremented by 8-byte lengths in the unit's output buffer. The second long word of the immediate specifier is written to the source list at the relevant entry. The operand processing unit sends the M-box a read request quadword for the double-precision floating point operand . If the address is on a quad word boundary, the front end of the M-box will not produce any additional virtual addresses because the operand w i l l not cross a page boundary or a cache line boundary. If there is a miss in the trans lation buffer for this reference, all other arbitration stops and control are given to the state machine of the translation buffer fL"X-up unit. Bits < 31 :09> of the request are captured by the translation buffer's fix-up unit in parallel with the translation buffer RAM's access to achieve an early start on m iss processing. The fork to the state machine is sensitive to bits < 31 :30> of the virtual address. Therefore, when a translation buffer miss occurs, a constrained control word flow begins based on the values of bits < 31 :30>. Because this is a user mode, the value is zero. Therefore, on the first cycle following the translation buffer m iss, the virtual page number is compared against the PO length register, POLR. On the next machine cycle, the POBR (i .e. , base register) is added to the virtual page number ro create the system virtual address of the process page table entry. The fix-up unit acts the same as any other port into the translation buffer, and makes a virtual read request with an aligned longword context. The state machine is control led by a microword that branches to itself until one of three events occurs: a miss in the translation buffer Digital Tecbnlcal]ournal Vol. 2 No. 4 Fall 1990 CPU (the fix-up unit processes double m isses), a memory management fault, or a cache response. The cache response, which is the event most likely to occur, signals the state machine to return to idle and pre pare for the next miss. Hardware control external to the ftx-up u n i t w ri tes the entry into the trans lation buffer, and the original request is retried . This time there is a translation buffer hit, and the physical address is sent to the cache. Single misses in the translation buffer require seven cycles to pro cess. A double m iss requires 13 cycles, assum ing data cache h i ts occur. The issue control asserts the microword hold signal to force the microword latches to hold the first microword until it can be executed. The micro sequencer regenerates the control store address of the second microword each cycle until the execu tion stall ends. Cycles Seven through Thirteen Cycle seven is the data cache read cycle for the quadword operand processing unit request that was translated in the previous cycle. The VAX 9000 system has a 128KB data cache, with a block size of 64 bytes and access width of 8 bytes. The 64 -bit access width matches the 64 -bit data path to the E-box, which was construc ted to p rovide high bandwidth for double-precision operand transfers. When a cache hit results for the read of an aligned quadword, both the normal response line and the quadword response signal are asserted to alert the E-box that the M-box is sending a quadword of data . In cycle seven, general-purpose register R5 of both the E-box and !-box is written with t he incre mented value. In addition , both source pointers and the first source operand are available to the issue control. Because only the second operand is missing, the microinstruction can be issued with bypass awaiting memory data. The quadword operand is available to the M-box at the end of cycle eight . The low longword is latched in the data distribution logic of the E-box, and the high longword is held in the M-box. In cycle nine, the quadword operand is written into the register file at the two source list locations allocated by the operand processing unit. However, the low longword is available as a source immedi ately. The low longword of the short l i teral operand and the low longword of the memory operand are passed to the multiply functional unit at the start of cycle nine. The multiply unit performs the first cycle of execution, which includes· unpacking and multiplying the most significant bits of the two 35 VAX 9000 Series operands. Issue comrol drops the microword hold signal to allow the second microword to be latched . An entry, which specifies general-purpose n:gister R9 as the destination for the low longword of the result, is made to the result queue. The second microword is issued because the multiplier requires the next half of each source operand and both are available from the register file. The microsequencer then attempts to generate a new control store address from the next entry in the fork queue. If no new forks are available, the microsequencer remains idle. In the tenth cycle. the multiply unit receives the high longword of both source operands. The sec ond execution cycle is performed, which includes unpacking and three simu ltaneous multiplications of the appropriate combinations of the most and least significant bits of the two operands. The multi plier signals t he issue control that the result will be available in the following cycle. The issue control makes an entry, which specifies general-purpose register R 10 as the destination for the high long word of the res u l t , in the result queue. The multiply functional unit is fully pipeli ned and could be issued in this cycle to start subsequent operations. Cycle eleven is the third and final execution cycle. The multiplier accumulates the four products it produced in the two previous cycles, rounds, and packs the final double-precision result. The issue control discards the top entry from t he resuh queue to retire the low longword of the resu lt. In cycle twelve, the retire multiplexer selects the multiply unit result data and sends it to the data dis tribution logic. The issue control discards another entry from the result queue to retire the h igh long word of the result. The low longword of the result is written into the register file's general-purpose regis ter R9 in cycle th irteen . The h igh longword of the result is written into general-purpose register R 10 in the next cycle as the instruction is completed . VAX Instruction BBSC The BBSC instruction tests a bit in memory, branches if the bit is set , and clears the bit. The BOATA is the base add ress in memory with the number 13 position-bit offset. The majority of VAX field instructions have a position offset of less than 64 bits. Therefore, the VAX 9000 system's J-box prefetches t he quadword addressed by the base. As with all conditional branches, the result of the test is predicted and the VAX 9000 system's J-box continues to fetch instructions along the rredicted pat h . The BBSC is encoded in eight bytes: one 36 opcode, one short li teral position, five for the base address (a 4-byte displacement off the program counter), and one displacement. Cycles One and Two Cycle one for the BBSC can be fetching the instruc tion stream from the virtual i nstruction cache, as described for cycle one of t he ADDL2 instruction, or it a l ready can be in the 1-buffer (e. g . , bytes < 8 : 3 > ) and the I-bex ( i . e . , b y tes <7 6 > ) fol lowing t he M U LG 3 (i .e. , bytes < 2 : 0 > ). In the latter case, the BBSC i nstruction is shifted into the lower bytes as the M ULG 3 instruction is shifted out . The decode o f the B BSC begins with passing the short li tera l , number 13, to the short literal expan sion u n i t and the program counter/re l a t i ve base address to the operand processing unit. Informa tion on both specifiers is passed to the register/ pointer unit. In this cycle, the fork add ress is also passed to the E-box . The fork address is mod ified for field instructions if t he base is a register. There fore, passing the fork address is delayed until the base specifier is decoded . In this example, the base is decoded in the cycle after the opcode is received. If the base is a register, the field instruction takes a di fferent microcode flow. During cycle two, the decoder passes t he pro gram counter decoder for the p rogram cou n t of the instruction to be decoded to the operand pro cessing unit. The program counter is passed to the operand processing unit and the E-box in the first decode cycle. Whenever a specifier is passed to the operand processing unit, the X I3AR also sends a specifier offset delta . When the delta is added to the program counter's decoder, the add ress of the last byte of the specifier plus one is produced . As the short l iteral and program counter/relative specifiers are decoded , they are d iscarded from the !-buffer. The BBSC displacement is shi fted to t he first byte of the !-buffer. The data arri ving from the cache is merged into bytes < 8 : 2 >, and the other byte is placed in the I-bex. The branch pred iction u n i t begins operating during the first decode cycle. A pred iction for the branch must accompany the fork address sent to the E-box. The prediction is made by using the program cou n ter to access a branch prediction cache and determine how the branch behaved the last time it was decoded (i.e. , one h istory bit). If the branch is in the cache, the p rediction is that the branch will behave the same as the last time. If the branch is not i n the cache, a prediction is made based on the normal behavior of this cond itional Vol. .2 No. 4 1-Ctff 1')')0 Digital Technicafjournaf VAX Instructions Tbat Illustrate the Architectural Features ofthe VAX 9000 CPU branch. For example, a BEQL (58 percent) and a BBSC (73 percent) normally do not branch , whereas a B N EQ (62 percent) normally branches. If the BBSC instruction is in the cache and branched last time, this information is indicated to the E-box, with the I-box prediction given as true. Cycle Three In this cycle, the register/pointer unit allocates one entry in the source list for the position specifier and three entries for the base specifier. The unit then passes the source one, source two, and destination pointers to the E-box. In the operand processing unit, the address of the last byte of the specitler plus one is ftrst calculated using the program counter of the instruction and the delta provided by the X BA R . The displacement from the instruction is then added to this calcula tion. The result is latched in the operand processing unit's outpur bu ffer and passed to the M-box. The operand processing unit also passes a quadword, field modify function, and the source list tag. The short l iteral expansion u nit extends the size of the position specitler to a longword and latches it in the unit's output buffer. In this example, the extension is done with zeros. The X BA R passes the branch displacement byte and an updated value of the program counter's delta to the operand process ing uni t . The delta of the program counter and the branch d isplacement are also sent to the branch prediction unit as instruction lengths. The BBSC instruction is completely decoded, and the opcode and displacement are discarded from the !-buffer. The branch prediction unit does most of its work during the last decode cycle of a branc h . For the majority of conditional branches, the last decode cycle is also the first. The branch p rediction cache contains 102 4 entries. Each entry has a history bit, a 32-bit target program counter, a 6-bit instruction length, and a 1 6-bit branch displacement and its tag . The entries are addressed by bits 9 through 0 of the program counter's decoder. If the tag matches bits < 31 : 10> of the program counter's decoder, the entry is assumed to be the entry, or a hit, for this branch . If a hit occurs and the history bit shows that the branch was not taken last time, the branch predic tion unit latches this state information and allows the subsequent instruction stream to be decoded . The operand processing unit produces the target address as soon as it is not busy. The target address must be stored in the program counter's unwind buffer in case the prediction is incorrect. The E-box Digital Tecbt�icaljounUII VtJ/. .! Nu. 4 Full 1'}'}0 indicates the correctness of the prediction as soon as possible. For simple branches, the E-box could indicate that the prediction is incorrect before the branch is fully decoded . If a hit occurs but the history bit shows that the branch was taken last time, the branch prediction unit latches this state information and stops the decoding of the subsequent instruction stream by clearing the !-buffer and the I-bex. The program counter of the subsequent instruction is stored in the program counter's unwind buffer. The program counter's target address, which is received from the branch prediction unit cache, is passed to the pro gram counter's prefetch buffer. The target address that is later provided by the operand processing unit may be discarded . The branch displacement and instruction length from the branch prediction cache are latched. For the fol lowing discussion on the remaining cycles in the BBSC instruction, we have assumed that the BBSC instruction is a branch prediction hit and that the branch was taken the last time decoding occurred. Cycle Four In cycle four, both the operand processing and short l i teral expansion units contain d a ta to be passed to the source list. The operand processing unit normally has the higher priority of the two. Therefore, the short literal expansion unit will stall. The operand processing unit passes the base address to the source list through the 1-box. In the operand processing unit, the new delta of the pro gram counter is added to the program counter, the sign of the branch's displacement is extended from a byte to 32 bits, and the two are added to produce the new target address. The result is latched in the operand processing unit output buffer. The virtual instruction cache is accessed for the target instruction. If the instruction is in the vir tual instruction cacbe, it is passed to the !-buffer. However, there is a gap in the pipeline because no instruction can be decoded this cycle. The displacement and instruction length from the branch cache are compared with the actual dis placement and instruction length. Normally, these lengths match . However, if they are different, the target address from the branch prediction unit cache is p robably incorrect. The fetching and decoding of instructions must wait until the operand processing u n i t provides the correct address. At the start of cycle four, the M-box receives a request from the operand processing unit. This 37 VAX 9000 Series previously tion or for subsequenr branches to be decoded . The described in that it contains a command that gets req uest d iffers from all requests unit predicts a maximum of t h ree branches before it special t reatment i n the M-box . T he command i s stalls decoding to resolve the first branch. an " opu read with write check n o bloc k . " As the address xxxx..x xx5 is accessing the trans T h e command is used because t h e VAX 9000 CPU lation buffer, the final address is produced by contains a n optimization that enhances the perfor adding 4, which makes a translation buffer request mance of bit field instructions. With this command, (i.e. , addr the op<:rand processing unit prefetches a quadword in cycle six . The three translation buffer accesses of data, starting from the address pointed to by the are contiguous and interruptible. Data alignment is = xxxxxxx 9) through the sequencer port base, without looking at the value of the position performed by the M -box, but the alignment is con operand . Hope fu l l y, the majority of bit fields are strained to longwords. When an unal igned quad within 64 bits of the base. The special command word is detected, the front end of the M -box alters tells the M-box that if a fault should occur, i t should the context field that it passes to the data cache pass the fau l t , with an operand, to the E-box and unit. The quadword request is effectively broken not close down the operand processing unit port or i n to two unaligned longwords, which are properly put a lock on the fault parameters. The command is rotated into the low longword of the quadword an unaligned quadword operand and, as suc h , interface and sent to t he E-box independently. requires t h a t t h e M-box produce additional virtual Cycle five is the data cache read cycle for the first addresses to correctly access the cache. A quad unal igned longword . Because the starting address is word is unaligned when bits < 2 :0> are nonzero. x:xxxx x:x l , the entire longword is contained in the For this example, we have assumed that the starting cache line. Therefore, one additional rotation cycle add ress is x:x"L xxxx l . is all that is required before the data is sent to the .. Special ized hardware in the front end of the E-box. The M-box pipe is effectively lengthened by if the starting address requires a cycle when i t is performing unaligned operations. sequencing (i.e. , the addition of a constant of 4 to Because cycle five is a data cache read cycle, no M-box detects the current address) and how many sequenced response is issued to the E-box. In addition to the addresses are necessary. In this case, three addresses data cache read, the physical address is placed in are required. The first is the starting address (i .e. , the write queue. A memory write is required after from the the bit is tested . A status bit for a new quadword is operand processing unit. As the starting address is set in the write queue. The new quadword indicates addr = xxxxxxx l ), which is received accessing the translation buffer, a constant of 4 is that this is the starting address of an operand and added and the sequence port requests a virtual writes should not rake place until a n entry appears address (i.e. , addr in the write queue with a last bit assertion. = xxxxx x x5) from the translation buffer at the start of cycle five. The issue control uses the fork RAM data to deter Because the first operand is written into the source J ist, t he operand is available ro the integer mine that the integer unit and two source operands unit at the start of cycle six . The microword hold are required . Because only the first operand is miss signal is asserted to hold the first microword during ing from the source list, the instruction is issued the stall. The microsequencer regenerates the con with bypass. The microsequencer generates the sec trol store address of the second m icroword. ond control store address based on the fast access fields of the first m icroword . Cycles Six through Nine I n cycle six , the d ata cache is read again w i t h Cycle Five address Decoding the target instruction stream begins in read in cycle five. However, because the context is cycle five. The operand processing unit sends the a longword, one additional byte of data must be xxxxxx:x 5, which is t h e same cache line target address to the branch prediction unit through read from the cache to satisfy the reques t . Also, in the program counter's target address. However, as cycle six, rotation of the data read in cycle five is noted earlier, the target address sent is discarded. completed, and the M -box responds to the E-box. Because t he operand processing unit does not use Finally, address xxxxxxx 5 is placed in the write the 1-box data register, the short l itera l expansion queue. unit can pass the short literal to the source Jis t . By using source pointers from the source queue, T h e branch prediction u n i t now waits either for the position and base address operands are selected the E-box to indicate the correctness of the predic- by the fork RAM and passed to the i nteger u n i t . If 38 Vol. 2 No. -4 Fall /'J of interest from the cache read next cycle is issued norma lly. in cycle seven to the correct pos i t ion. No response is issued to the E-box because this unaligned refer Cycles Ten through Fifteen ence requi res two data cache reads to ful fi l l . The I n cycle ten, the E-box initiates a byte write to the add ress xxxxxxx9 and the last bit are inserted i nto M-box. Data is passed to the M-box , and the appro the write queue. The M-box delivers the required priate byte is shifted to the low byte loca tion. The longword, and execution begins immed iately. The sixth and final m icroinstruction is issued normal l y. second execution cycle calcu lates the target byte I n cycle eleven, the M-box receives an explicit address. The position, div ided by eight, is added to E-box write request to retire t he BBSC instruction the base address. The m icrosequencer generates with a memory write. Explicit writes differ from the fourth control store address by using the next writes i n itiated by the 1-box in that the E-box sup address field of the microword. No operands are plies a v i rtual address with the data, whereas the selected for the next cycle, and the next instruction I -box provides a virtual address and t he E-box sub is issued norma l l y. sequent ly provides the clara for 1-box v.·rites. How Cycle eight is a rotation-only cycle. The one byte ever, three entries exist in the write queue for the <8> of i nterest, read from the cache in the previous prefetched quad word . These entries were placed in cycle, is rotated i nto the correct position (i .e. , byte the queue for memory conflict-checking p urposes <0:3> ) , and the M-box sends the data to the E-box and cannot be used for writing pu rposes because by issuing a response. only a byte of clara is being written and not a quad The third execution cycle uses the bit position to word. The write field command from the E-box set up the special encoder in the integer unit and forces the write queue control to d iscard the three clear the appropriate bit. The source two register entries. The front end of the E-box accesses the file pointer is incremented again to select the high translation bu ffer and checks for write success longword from the source l is t . This microword during this cycle. I f the write is successfu l, the p h ys branches on th ree comlitions determi ned by hard ical address and the context of the byte are sent to ware functions. The first cond i t ion indicates if the the data cache. low longword of the prefetched field has a page The fi n a l execution c ycle determ ines if t h e faul t . If a fau l t does exist, the m i croword flow branch prediction w a s correct. T h e bit specified checks w hether the longword is needed or not. As by the correct position is shifted to the least signi noted earl ier, the longword was p refetched i n ficant position in the s h i fter, where i t can be used the hope that the b i t pos ition was within the first for a macrobranch comparison. The macrobranch 64 bits of the base. If the bit is not within the first result is compared to the I-hox branch p rediction longword , the page fau l t can be d isregarded . The in cycle twelve. The microword also ind.icates that second branch c hecks w hether the position is the microsequenc<.:r shoul d start forking for new gr<.:ater t han (l_) hits. I f it is greater, the microcode Digital Tecbnica/jourual Vol. 1 t\iJ. ·I P(/1/ /'J')IJ macroinstructions. .19 VAX 9000 Series Cycle twelve is the data cache lookup cycle for E-box. This process c:vens the tlow t h rough the the byte-write operation. The data size is less than a pipel ine and keeps the E-box busy. Figure 6 il lus longword . T herefore, the byte that is to be written tratc:s the code block as it moves down the pipe. must be merged with t he seven unaffected bytes of the cache line. The first stage is the virtual instruction cache :tccess, or fetch. stage as the instruction is read from Two signals are sent to inform the 1-box of the the virtual instruction c:.tche. Some instructions branch prediction status. The branch valid signal do not need an actual virtual instru ction cache ind icates that a branch prediction validation has access but are in the !-buffer from occurred, and the branch signal indicates i f the va l i instruction c:.tche fetch. The instruct ion decode dation was correc t . T h e branch prediction logic receives t h e branch valid signal. If the prediction was correct, the pro :.t previous v i rtu:.tl takes p lace in the decode, or X BA R , stage . T h e !-buffer i s shifted and t h e fork R AI' [ COM PCON '90 (San Francisco: Spring 1990): 4 4 -53. 8. S. Mishra, "The VAX 8800 Microarchitecture," Digital Tecbnicaljoumal, vol . I, no. 4 (February 1987): 20-33. 3. T. Leonard, VA X A rchitecture Reference Man ual 42 7. T. Vol 2 No. 4 Fall I'J')O Digital Tecbnicaljoun.al Matthew]. Adiletta Richard L. Doucette john H. Hackenberg Dale H. Leuthold Dennis M. Litwinetz Semiconductor Technology in a High-performance VAX System The VAX 9000 system is the newest member of Digital's VAX family of computer systems. The 9000 is a high-performance ECL processor, with a very fast, 1 6-nano second cycle time. To achieve this high level ofperformance, a new generation of semicustom and custom integrated circuits was requiredfor the scalar CPU and the vector processing option. Goals for circuit density, performance, and skew mainte nance werefulfilled with the development ofa high-speed gate array, special custom chips used in key applications, and a high-speed RAM employing a new architecture. The semiconductor requirements for the VAX 9000 system posed a number of challenges for Digital's Integrated Circuits Development Group. Those requ i rements included a tremendous number of equivalent logic gates ( 1 ,037,4 00 gates) and a large amount of RAM in the processor (3,280,000 bits). Moreover, the project 's performance goal of over 30 VAX- 1 1 /780 units of performance (VUPs) required the development of state-of-the-art semi conductors and the use of innovative techniques to design them . G iven the project's goals, the IC technologists evaluated several competing semiconductor tech nologies and decided to i mp lement most of the logic within the 9000 system in a h igh-speed, high density, 10,000-gate array. The gate array provides a broad range of speed and power-dissipation options. Working with Motorola, the IC Group first engineered the base 10,000-gate macrocell array (MCA), which is implemented in Motorola's MOSA IC III process. Logic engineers then designed the 77 d i fferen t gate array chips (options) on the base array, using a rich library of logic functions and a set of automated place and route tools. Additiona lly, they designed five custom chips, invented a fast cycle t i me, self-timed random access memory (STRAM) architecture, and designed a multichip unit to imerconnect all these high-performance !Cs. ' Four different design methods were used to implement the chips. The MCA x chips employ a gate array design technique. The cnxx, the V RG x , and the Sl"RAM chips required a full custom approach . Digital Technicaljournal Vol. .! No. .:j Fall /')90 The STGx chip was implemented using a silicon compiler technique. T he M ULx and DJVx chips mwere implemented using a standard cell design approach. Statistics on 9000 system chip design are given in Table 1 . This paper describes the VAX 9000 M CA I l l gate array, the development of each of the five custom chips, and the STRAM architecture. Before our dis cussion of the gate array, we present a brief overview of the semiconductor technology used to fabricate the array and the custom chips. Semiconductor Technology In 1985, the VAX 8800 series was D igital's largest and most powerful system, offering single-CPU per formance of eight VU Ps. The 8800 CPU logic was Motorola's Macrocell A rray I ( M CA I ) gate array, which was fabricated in MOSAIC I bipolar technol ogy. In comparison, the VAX 9000 goal of 30 VlJPs was aggressive, and the IC Group realized a new semiconductor technology was required . At the start of the project, the technologists evalu ated semiconductor vendors to determine what was the "best" technology available to implement the new system. CMOS , Bi C MOS bipolar, and GaAs IC technologies were evaluated. Among the factors considered were logic density, gate delays, on- and off-chip interconnect delays. mam.1facturing risks, and prod uct delivery. Although very high gate densities were available with CMOS technology, the logic gate delays proved , 43 VAX 9000 Series Table 1 VA X 9000 C h i p Statistics Chip Description Die Size ( M i l l i meters) Signal Pins Transistor Count RAM Bits Power (Watts) MCAx MCA I l l gate array chip 9.8 X 9.8 256 40. 1 K CDxx Clock distribution chip 6.2 X 6.2 1 70 7.2K STGx Self-ti m ed reg ister file chip 9.8 X 9.8 1 52 29.3K 1 7.8 M U Lx M u ltiplication chip 9.8 X 9.8 1 82 48.4K 30.9 D IVx Division chip 9.8 X 9.8 1 12 29 .2K 23.9 VRGx Vector register file chip 9.8 X 9.8 1 98 76.0K 92 1 6 24.9 1 KS R 1K x 4 self-ti med RAM 4.9 X 3.6 33 28.0K 4096 2.4 4KSR 4K x 4 self-ti med RAM 6.4 X 4.2 35 1 03 .0K 1 6384 2.4 t o b e t o o slow r o meet t h e cycle time requirement. Also, the CMOS output circuits could not drive sig nals off-chip i nto a 50-oh m transmission l i ne as quickl y as a bipolar transistor, which l im i ted the speed of signal between IC:s. B iCi\·JOS offers the advantage of h ig h l y dense CMOS coupled with bipolar drive capabi lity. How ever, the technologies available at the time were optimized for the best CMOS transistors with a com promised bipolar device. This approach l im ited the overall performance of the circu it to a level roug h l y equiva lent t o t h a t o f previous generation bipolar devices, which would not be aggressive enough ro meet the CPU performance needs. Galliu m arsenide (GaAs) ICs offer a theoretical performance advantage of between two and three to one over s)licon i m p l ementations. T he group found IC densities were lower than those of bipolar devices, however; and the on-chip speed advantage was countered by the need for more off-chip sig nals in t he critical paths of the C P U . A lso, because the manufacturing technology of GaAs ICs was immature, very few companies had attempted to sell GaAs into the commercial marketplace. So while this technology was considered for a rime in some applications where alternatives also existed , GaAs were eventually dropped from consideration because of the u ncenainty of availability. The IC Group also studied Motorola 's third generation of their oxide-isolated self-al igned impl anted circu i ts (MOSAIC I l l) bipolar technology.2 Ir offered a factor of six in speed advantage over the prev iously used MOSA IC I tech nology and h a d the potential of prov iding eight to ten times the logic density. A l t hough not as dense as CMOS or BiCMOS, MOSAIC I ll was much faster than either of those tec hnologies and much denser than any avai l able GaAs technology I n addition, although many 44 30 1 3.9 of t h e manufacturing steps were new, most o f them were based on prev iousl y proven tec hn iques. The group therefore concluded that MOSA IC 1 1 1 was best suited tO meet the chal lenges of the VAX 9000 system. The MOSAI C I l l process is an advanced sil icon bipolar process which yields a transistor structure with a polysilicon base. emitter and collector elec t�·odes, pol ysi licon resistors, and three l ayers of meta l ization. Compared to the MOSAIC l device used in the 8800, the critica l col lector-base j unction of this transistor structure takes up approximately 50 percent less area, as shown in Figure I. Com bined with shal lower ju nctions and reduced base resistance, the intrinsic device performance was improved by a factor of three. Further, the poly silicon resistor produced with this process has far lower parasitic capacitance than the MOSA IC l monosilicon resistor. Some key performance mod eling parameters and density metrics are provided with the figure. The VA X 9000 packaging imposed other require ments on the semiconductor technology. Power dissipation increased from 5 watts for the MCA I to �0 watts for the MCA I ll because of the increase in gate density from 1 , 200 to 10,000 gates. Therefore it was determined that all ch ips shoul d be mounted directl y to the multichip unit cold pl ate for opti mum cooling. For manu facturing economy, it was desirable to bond the mul tiple leads of the chip directly to the pads on the h igh-density signal car rier ( H DSC). Consequently, all CPU chips must be provided to the mu l tichip unit assembly site in a tape automated bond (TA B) package. As shown in Figure 2, ch ips are mounted i n a plastic carrier suit able for automated hand l ing, and the surface of the die is protected from mechanical damage with an epoxy encapsu lent . Vn/. .2 filii. ..; Fall 1')')11 Digital Technicaljournal Semiconductor Technology in a High-performance VAX System MCA JOK Gate Array number of logic cells for a given signal pin count are available for the logic designers. Technologists eval uated several key factors to determine the gate array physical layout and to ensure its success: A high-performance emitter coup led logic (ECL ) gate array with 10,000 equivalent gates and 256 i nputs/outputs has been developed for the VAX 9000 system. The gate array design approach used in the VAX 9000 system ensures the shortest possi ble turnaround time from option ma-;k to hardware, thereby reducing the system design time. In this approach, cell boundaries are defined with all tran sistors and resistors fu,ed within the cells. When a cell function is selected from a predefined cell l ibrary, the cell customization occurs at the metal between the transistors and resistors. Then, to define the function of the gate array option, the metalization between cells is customized. This approach al lows the semiconductor foundry to build many wafers up ro the customizarion level; when a gate array is to be built, only the custom metal is req uired . As noted above, 77 different lOK ECL gate array options are used in the VA X 9000 sys tem. This gate array has a rich selection of logic cells with di fferent power settings for the logicians to use to meet performance and power requirements. Using Rent's Rule, technologists maintained a bal ance between the number of gates and the package J /0 count. This balance ensures that a maximum MOSAIC I l l P+ P O LY S I L I CO N • Area of the silicon chip versus yield • 110 pad pitch • Maximum power dissipation • Speed of the gates • Maximum number of logic cells Successful trial layouts of the IOK ECL gate array floor plan were completed before any VAX 9000 options were started . The gate array floor plan, shown in Figure 3, comprises a central core area of 4 14 major (M) cells, divisible imo quarter cell functions, arranged in an array of 20 rows and 2 1 columns, less 6 sires for the master bias generators and special clock generator circuits. The number of transistors used in a quarter cell is based on the logic cel l most frequemly used in the lOK EC L gate array, the scan larch. A ring of 200 output (0) cells is interspersed with 224 inter face (I) cells. The ring surrounds the imernal cells and imerfaces the pad drivers with the internal N + P O LYSILICON ����� � �-----� _..) POLY S I LICON R E S ISTOR _..... ,... NPN TRANSISTOR 1 I / I I _.... ...- _..... MOSA� I / / C-B J U NCTION AREA �---��--�)·::::� MONOSILICON RESI STOR N P N T R A N S I STOR MOSAIC I MOSAIC I l l N PN Fr: 5 G H z R 0 : 1 475 ohms 1 6 GHz 400 ohms 20 ff 24 ff 54 If DRAWN EMITTER SIZE: 31'm X 41'm 1 .751'm x 4!Jm M ETAL 1 PITCH: Bl'm 4.5!Jm METAL 2 PITCH: 1 51'm 71'm METAL 3 PITCH: 1 21'm CJc: 50 II CJE : 45 II CJS: 1 85 ff Comparison ofMOSAIC Ill and MOSAIC I Deuices Figure I Di�ilal Tecbnicaljournal Vol. .! No. 4 Fall /')'JIJ 45 VAX 9000 Series cells. The 2 56 t /0 pad ce l ls a long w i t h t he J04 power pads are located around the perimeter of the IOK gate array. The mctal ization system uses three interconnect layers. The customized routing chan nels reside on the first and second meta l layers with i nterconnecting v ias between the two layers of meta l . The top metal layer and parts of metal I and 2 provide power and ground distribution. The lOK ECL gate array used in the VAX 9000 is approximately ten times more dense than the ECL gate array used in the VAX 8800 system . The gate delays in the 9000 are improved six ti mes over gate delays in the VAX 8800. Table 2 compares the IOK Ec.L gate array used in the \ A X 9000 to the ECL gate array used in the VA X 8800. Previous gate array designs. i n genera l , have provided only two le,·els of series gating, thereby limiting the complexity of functions that can be designed with one current switch. Within this gate array, three levels of series gating Jt borh internal and output macrocel ls provide addition:�! " A N D " (product) gate functions at very high sreed with one switch delay and at a lower power level . Fig ure 4 compares three-level series gating and two level series gating for a " 2-3-4 -4 A N D/OR " logic function (internal gate). Table 3 lists the differences in typical gate performance for a low power gate. The table also compares low power gate and high power gate. Notice the power difference between the two-level and three-level high power gate. C o m parison of N u m be r of Cells and Delay s i n the VAX 8800 and VAX 9000 Gate Arrays Ta ble 2 I nternal major VAX 8800 Gate Array VAX 9000 Gate Array 48 414 cells Output cells 26 200 I n put cells 25 224 Input cells gate d e l ay 1 . 05 nanoseco nds Metal de lay (fall delays) 2.6 picoseco nds per m i l 1 75 picoseconds (high power) 1 . 3 picoseco nds per mil A l l current switches w i t h i n t h e array are pow ered from the main supply voltage V E E I. Three level-series gated functions are implemented in the VA X 9000 gate array option, which requires V E E I to be set to - 5 . 2 V. Input cells are powered from a second, lower supply voltage VEE2 ( 3.4 V) to save power. The output emitter followers of M, I , and 0 cel ls as well as series-terminated ECL (STECL) output followers employ constant current source pu l ldowns to VEE2 to save power. The constant cur rent source pulldowns minimize the sensitivity of AC performance to variations in power supply. This same termination scheme was used in VA X 9000 custom chips. One of the technologists' main goals was ro mini mize power consumption of each macrocell while obtaining the highest possible performance from the IOK ECL gate array. The overa ll ! O K ECL Gate Array power is limited to 30 watts because of the cool ing requirements, the internal power distribu tion, and the current density l im its on power pins. A unique feature incl uded in the !OK ECL gate array that rrevious gate arrays do not have is series terminated ECL (STECL) omputs. STECL outputs - Table 3 C o m parison of Two-level and T h ree-level Series Gating Gate delay from i n put pin A to output pin YA Two Levels of Gating Three Levels of Gating 300 picoseconds 250 picoseconds (low power) Figure 2 46 Chip in TAB Package Mounted on Plastic Carrier and Encapsulated Low power gate H i g h power gate Vol 2 No. 4 9 . 88 m i l l iwatts 8 . 84 mill iwatts 1 8 .20 m i l l iwatts 1 3 . 00 m i l l iwatts Fall /'J'Jfl Digital Tecbn icaljom-nal Semiconductor Technology in a High-performance VAX System Figure 3 Photomicrograph ofthe Gate Array include a constant current source p u lldown and a reference clocks. The chip also supplies clocks to series terminating resistor. This feature allows the a l l STR A M s on the u n i t . Each of t he STR A M 's four elimination of off-chip termination resistors used groups of SL'< clocks can be programmed to one of in conventional 50-ohm EC L outputs. STECL out eight possible clock phases. This flex ibility in pro puts a llow shorter in terconnections between chips gramming al lows the system designer to select the on the m u l tichip unit because the c h i ps can be a p p ropria t e clocks for STR A M s in order to meet placed closer to each other, t hm improving perfor system timing requirements. mance. Another advantage of using STECL outputs In addition to prov iding the functions above, over 50-ohm outputs is that less than half of the the design goals for the C D x x project i nc l uded the simul taneous s w i tching output noise is coupled to fo l lowing: unswitched outputs. A l l custom chips used in the • VA X 9000 employ STF.Cl. termination . mu l tic h ip unit Clock Distribution Chip - CDxx The major fun c t i o n of t h e clo c k d is t r i b u t i o n c h i p (CDxx), shown i n Figure 5 , is to distribute master and reference clocks to each MCA on a m u l t ichip unit. There are eight pairs of d i fferential master and Di�ital Tecbuicaljournal 11Jl .! No. q M i nimize the space occupied by the chip on the Fa/1 1990 • Provide scan control and scan distribution • Include a wideb:md amplifier • Ensure low clock skew • Provide a temperature-detecting circuit 47 VAX 9000 Series ,------ � vee '-----��Y A � VBB1 ------+- vs@ VBB 1 ONE LEVEL OF GATING .----- vee vee VBB3 VEE1 �------�---' THREE LEVELS OF GATING Figure 4 48 Two-leuel Functions uersus Three-leuel Functions Vol. .2 No. 4 Fall I')'JI! Digital Tecbnicaljournal Semiconductor Technology in a High-pe�tormance VAX Syste�n HOT C I R C U I T Figure 5 Photomicrograph of CDx.-.: Chip M i n im izing the real estate occupied by the chip Each coxx receives i ts scan control signals from the was comp licated by addi tional functions located on previous CDxx in the chain or from the service pro the CDxx, such as scan and the temperature detect cessor. A s shown in Figure ing circuits. The minimization was accomplished rings located on the C D x x . Ring 1 2 is a 16-bit r i ng 5, there are t h ree scan by employing a custom chip design approach in reserved for the CD)C'< STRAM clock generation con which each element (cell) is optimized and then trol ring. This ring controls the STRAl'•l clock phase manual ly placed and routed to ach ieve a compact selection and enable for each of the four STRAM des ign. As it turned out, the size of the chip was not pins required to communicate to the rest of t he clock groups. Ring 1 3 is a 14-bit ring reserved for the CD)C'< scan control. Data is shifted i n to this ring and then loaded i nto CDxx control registers. R i ng 14 is a 47-b i t r ing reserved for the CDxx i n formation scan multichip u n i t . ring. Data is loaded into t h i s ring from CDxx data determined by the amount of real estate needed to implement the circuits, but rather by the number of Since a CDxx i s mounted o n every multi chip u n i t registers and shi fted out ro the service processor. i n t h e CPU, the scan d istribution and control logic The design of the w idebaml a m p l i fi e r was are located on this chip. The CDxx ch ips i n the sys prompted by the need for the clock distribution tem are chained together on the system scan bus. chip to receive two d i fferent ial sinusoidal master Digital Tecbnicaljournal Vol. .! No. · I Fall I'J'JO 49 VAX 9000 Series and rcfc.:n:nce c lock signJis as inpurs. These.: signals arc.: transformer coupled from the clock source. The master clock runs at one L"ighrh the systL"m cycle rimL". and the reference clock runs at the sys tem ncle rime. The wideband amplifier receives d i ffe rent ial s inusoidal signalls of relative l y small ampli tude - less than 125 m i l l i\·olts peak to peak and transforms t hem ro lOOK ECL levels on output . Th<.: design of the input circuits meets these crite ria and rypic::� l l y fu nctions w i t h i nputs less rhan 65 mi llivolts. All rhe clocks are distributed by the COxx as pairs of diffcrc:ntial signals. The d istribution of these clocks is, of course, ro be done with minimal clock s kew. Clock skew is the di fference in del::�y t ime berw<.:c:n di fferent clock outputs measured from a com mon point. The common point in this case i s t h e numbc:r of master dock inputs to the chip. To maintain low c lock skew, technologists designed fast gates and minimized the nu mber of cascaded gates in the clock path. A lso, all the metal that inter connects the cel ls in the c lock path is control led for equal delay. As a resu lt, the measured clock skew is less than 100 picoseconds on a chip for master, reference, and STRAM clocks. The delay of master clock input ro output is less than I nanosecond (ns). The: temperature-detecting circu i t on the CDxx warns rhe system when a device j u nction tempera t ure approaches rhe maximum al lowed tempera t u n: on a m u lt i c h i p u n i r . As i m p lemented, t he circuit is cont rolled from t he system console. The console loads rhe CDxx with a number that repre sents rhe temperature rhe circuit musr use as a point of comparison . If rhe j unction temperature of rhe Cl)xx is higher than the programmed value, the cir cuit trips and notifies the console of a temperature problem. T he console rhen rakes corrc.:crive acrion . Self-timed Register File Chip - STGx The self-rimed register file chip (sn; x ) is employed in t h e VAX 9000 to provide fou r register banks accessible through muhi rle read and write pons. ·rhe four banks incluJe a m icrocode scratch-pad register hank, rhe VA X generJl-purpose register set, a memory Jara register storage bank , and an instruction d a t a register b an k . The performan ce req u i rements for rhe STC x were quite rigid and guided several key design tkcisions, including den sity and layout. The read access time was ro be less than ':i ns. The write access time was to be less than 6 ns. Ln orher words. rhe chip must read or write any one of irs 6.:j locations in ':i or 6 ns. respectively. Borh goals ha\'e been met . In fac t . rhe read access t ime is typical l y less rhan 4 ns, and rhe write t ime is typically less rhan ':i ns. Figure () is a photom icro graph of the STG x c h ip. The STGx is a 64 -word by 1 8-bit LCL register file contain ing three wrire ports and rwo read ports. The 64 words are separated into fou r 16-word by 18-bit storage array sect ions. Each of the four stor age banks has dual read capabi lity. S torage bank one has dual write capab i l i t y ; storage ba nks rwo and three have triple w rite capability; and storage bank four hJs single w rite capabil ity. Simultaneous write access to the array i s possible t h rough a l l pons wirh correct results occurring; the only except ion is in t he case of writes to the same location from multi ple pons, which is an undefined operation. A write followed by a read access to the array - even to rhe same address - is possible w irh correct results occurring. The chip has two clock inputs for con troll ing reads and writes. One requirement for rhe design was to include a self-rimed write capabil ity so that the system need nor provide properly timed write pulses ro rhe chip. In rhe system, rhe chip is clocked w i th STRAM clocks for read ing and w r i t i n g . The design uses these clocks to latch read address i n formation, to latch write add ress information, and to latch input data. I n addition, the design rakes the leading edge of the write clock ro generate a delayed w rite pu lse. The delayed write pu lse is used to write the appro priate word in the 64-word by 1 8-bir array, raking in to account rhe rime needed ro decode the wri re add ress. The design sryle used to i mp lement r he self-rimc.:d register file chip is s im i iJr ro a sil icon compiler tech nique. The c h i p's storage area i s made up of four arrays. The input add ress register for borh read and wrire ports, the inpur dara larches. and rhe da t::l out pur drivers are arrangements of c<:l ls in stri ps. The p lacement and rout i ng of t hese arrays and strips was proced urally performed using custom layom tools. Once rhe blocks were: assembled and p laced , in ter connecrions among b locks, strips, and pins were then routed manual l y. Multiplication Chip -MULx The architecture of the scalar processor defined an integrated floating point p rocessor. U n l i ke most RISC processors, which off-load all floating poinr operations ro a separate tloating poin t processor, rhe VAX 9000 sysrem handles floating point opera tions within the E-box . 1 The multiplication unit therefore supports horh i nr<:ger and tl oaring point formats. To ach ieve t h is support, a custom chip was l'n/. .! . \ "o. .; Fall I')<)O Digital Tecbnicaljournal Semiconductor Technology in Figure G \ i,f. .! .\iJ. 1 High-JH!r(ornwnce VAX -�),stem Photom icrograph ofSTGx Chip requ i red that provided superior performance. spe cial logic gates. and improved density. Custom chip tech nology provided enough dcnsity to accommo date a .12-bit by :)2-bit . cight-logic-l<:vcl multiplica t ion array in a singlc chip ( M l l l . x). To mini mize the cost and time of custom design . designers employed standard cell design techniq ues in which the cell height was fixcd anu thc width cou ld vary to take advan tage of packing dcnsit y. By constraining the design i n t h i s fashion. the H ig h Performance Systems Group's < .A D suitc cou ld be employed to p l ace and rou te the c h i p . Spec i a l logic gates eliminated t hrcc logic lcvds. and h igh-powered fast gates provided t he pnfmmancc to perm i t a .12-bit by :)2-bit multiph· opcra t ion in less t han 9 ns. Fig un: I shows a photomicrograph of t ile \l l l. x chip. Digital Tecbnicaljournal a hiii i'J'Jii Three �l l ' L x chips werc r<:qu i red in the scalar processm to achieve doubk-prcc ision r<:rformancc in which every 64 ns a ')6-bit mul tipl ication could complete. Each M l ' l. x chip has two .12-bit i n put data buses. The Ml ! L x chip is also employed to perform all i nteger multiply operations in a s ingle 16-ns cycle. The scal::ir processor, which has .12 -bit-wide data paths, delivers double-precision input data in two cycles. In the first cycle, each M l lLx consumes the most sign i ficant h igh bits of c:K h operand . A II t h ree MULx chips latch this <.bta while also u n pack i ng it, multiply ing i t , and then latching the product. One of the M l ' L x chips' results is then s:1ved . In the second cycle. the n.:maining dou hk:-prccision dat:I, t he least sign i ficant low bits. is consumed , and each ') [ VAX 9000 Series -- -. - �- ;--..,.--,.. .:. .���.;.,...._:;..., I M U LT I P L I E R ARRAY """"""' ,:....,M, .-�� . .......,... Figure 7 Photomicrograph ofMUL:x Chip M U L x chip unpacks the data and performs a u n ique are delivered, each MLJ L x has an additional person multiply: operand A high bits and operand B low ality bit for indicating whether t he M U L x is in the bits; operand A low bits and operand B h igh bits; V-box or E-box. and operand A low bits and operand B low bits. The MULx chip, as used in both the scalar and A n 1\KA I I I gate array acc u m u l ates a l l these vector processors, is a 32-bit by 32-bit ECL parallel res u l ts, and another rounds and packs the bits into a multi plier w h ich is fully pipelined for a 16-ns cycle VAX floating point product. Since each ivl U L x needs time. It performs both two's complement and sign/ ro know which partial product it must comp ute in magnitude multiplication. I n a single cycle, the chip the second cycle, two personality bits are included unpacks VA X float ing point formats F, D, and G, or that are loaded by means of the system scan chain . M U Lx chips are also used in the vector processor. The vector processor (V-box ) has 64 -bit-wide data paths. Four MULx chips are emp loyed ro complete a double-precision m u l t i p l y every 16 ns. S i nce the i nteger formats long, word, and b y t e ; performs exponent calculations and sign handling; and com pletes up to a 32-bit by 32-bit m u l t ip lication . I f the operation is double precision, the 64 -bit result is a partial result. It must be accumu lated with operand unpacking di ffers between the scalar and three other part ial results to form t he double-preci vector processors as a result of how fast operands sion, correc t l y rounded, and normalized produ c t . 52 Vol. 2 No 4 Fall 1')')0 Digital Technicaljounwl Semiconductor Technology in a High-performance VAX System If the operation is an integer type, then the 64 -bit two's complement result is the VAX integer product. A long with producing this integer product, MULx also produces the correct condition codes. Integer operations require one machine cycle to complete. Operands are not latched at input . Instead they are immediately unpacked and sent to the multiplica tion array. This multipurpose array then produces a set of sum and carry product vectors. These vectors are then added in a ful l carry lookahead adder (CLA). This adder comprises a 31 -bit adder and a 32-bit adder, cascaded . The produced sum is the 64 -bit product, which is then latched. The output of the latch is used to compute i nt eger-type con dition codes. The integer instructions supported include VAX MULB , M U LW , and MULL. EMUL is also directly sup ported, along with the Z and N bit condition codes. Finally, to assist in H format-type multiplications, a true 32-bit by 32-bit magnitude mu ltiplication is also supported, called EXTMU L (extended multiply). There is a 64 -bit data path back into the E-box for EMUL- and EXTMUL-type operations. Six features of the M U Lx design that improve per formance and minimize logic should be noted . First, unlike traditional designs, the MULx design does not include Booth recoding of the multiplier operand . Booth recoding offers no logic savings either in timing or real estate when the multiplica tion array reduction scheme is optimal. Second, a Baugh-Wooley two's complement algorithm was used to implement integer multiplication .' Third, engineers designed special full adder logic gates to integrate multiplication summand generation into the full adder cel l and to eliminate the need for an additional logic level. Fourth, a unique multipli cation reduction algorithm was developed which provides the initial routing advantages of a Wallace tree, with the minimal logic of a Dadda tree."·6 Fifth, a ripple is formed in the reduction array. The ripple facilitates the start of the least significant 31 -bit CLA addition at least one logic level sooner than the most significant 32 bits and does not require a carry-in input to the upper 32-bit adder. Finally, by developing a very fast 4 -3-2 - 1 A N D/OR gate, engi neers were able to remove two additional logic levels in both CLA adder networks. To avoid bugs in the array design, since bugs in an array consisting of 1000 full adders could have sig nificantly affected the product shipment schedule, engineers developed a FORTRAN program to logi cally interconnect and physically place the array. Any bugs would be algorithmic and not random, and algorithmic bugs should be obvious. In addi- Digital Tecbnicafjournuf Vol. 2 No. 4 Fall 19')1! tion, by algorithmically placing the array, signi ficant density improvements were realized . This program provides a Wal lace-Dadda implementa tion that logically reduces 32 rows in 8 logic levels, and consumes as many initial summand bits. It also uses the least number of full adders as theoreti cal ly possible, while delivering the least significant 32 bits of sum and carries at least one full logic level sooner than the most significant bits. Division Chip - D/Vx The iterative divide function performed by the divi sion chip , DIVx, requ i res a signi ficant amount of hardware, the density of which a standard cell chip affords. Two gate arrays would be required to per form the same function, in which case a timing critical path crossing would occur between the two chips. Therefore, the IC designers implemented the DIVx chip as a standard cell design by building on the techniques developed for the MULx chip described above. Also, like the MULx design, the goals for the D!Vx design project were to optimize performance and minimize real estate use by fitting t he iterative divide function in a single chip. The IC designers employed a standard cell tech nique in which four horizontal sections are defined , each section having a different number of columns. Reference cells are located in the center row of each section and provide ECL reference voltages to the cells above and below i n that section 's columns. Placement was driven for performance, with quo tient selection logic being distributed to where i t was required. This method made for a n irregular structure, as can been seen in Figure 8. The VAX 9000 system optimizes both mu ltiplica tion and division by providing separate functional units. Each functional unit performs both integer and floating point operations. This approach differs from the one taken by most processor architects, who conceptually link multiplication and division . Usually, algorithms are chosen that can share hard ware at the expense of the performance of either operation. The separate division unit in the 9000 provides superior performance for both i nteger and floating point operations. The DIVx chip is also used by the V-box to perform very fast vector divi sion operations, as shown in Table 4 . Division is an iterative process. Unlike the case of multiplication, one cannot predict the summands and then reduce the summand matrix. The two approaches to division most commonly used are the Taylor Series convergence algorithm and a sub � tract and shift algorithm. The algorithm employed in the 9000 is a variation on the subtract and shift 53 VAX 9000 Series Table 4 Division Performance Data Type Integer: Floating point: byte word long F-format D-format G-format Cycles Time (Nanoseconds) 3-4 3-5 3-8 48-64 48-80 48- 1 28 7 1 12 208 1 92 13 12 method, which al lows for savings in hardware as wel l as increased performance. Jn this method, an imprecise quotient is selected based on a truncated estimated partial remainder Figure 8 54 and a truncated version of the exact divisor. This imprecise quotient digit is corrected when the next guess quoticnt digit is selected . The selected digits may be positive or ncgative. The positive digits are accumulated in a positive-value shift register. The negative digits are accumulated in a negative-value shift rcgistcr. The final corrected binary quotient is then formed by subtracting the negat ive register from the positive register. The algorithm is based on a signed d igit notat ion scheme. To determine two quotient bits, the bits may be chosen from a d igit set that i nc ludes { -2, - I , -0, + 0, + 1, + 2 }. The digit set is simply an expanded form of the common nonrestoring digit set that typ ically uses { - 1 , 0, + 1 } . In nonrestoring algorithms, the quorient is normally corrected as Photom icrograph of D!Vx Chip Vol. 2 No. . J Fall /')')0 Digital Technicaljournal Semiconductor Technology in a High-performance VAX System needed; whereas here, it is not corrected u ntil the entire iterative process is completed . The next sig nificant difference between this division technique and the nonrestoring method is that the quotient bits selected are based on an estimate of the partial remainder and divisor rather than the exact values. The first advantage of this method is that an esti mate can be obtained faster than the exact value. Second, a truncated estimate is acceptable, rather than a fu ll-width estimate. Consequently, this method saves a significant amount of hardware and increases the speed of the operation . If one were to complete each partial remainder, up to three addi tional chips would be required and the delay would more than double. The trick to the method lies in the quotient selec tion . The selection is based on partial remainder range transformations which guarantee that a quotient digit selected in one iteration may be cor rected to the exact quotient digit on the next iteration. Therefore, although six quotient digits are determined per major iteration, an additional minor iteration is required to guarantee the least significant digit of the major iteration. The major and minor iteration terms refer to the architecture of the divide iterative hardware. The OIVx produces six quotient bits per machine cycle. This is a radix 64 division technique. However, the high radix division is accomplished by overlapping lesser radLx divisions. In particular, there are three sets of radix 4 division groups. The first two sets are over lapped, so that the critical path t hrough the radix 64 division is actually the critical path through two radix 4 divisions. A m inor iteration is the path through one radix 4 division group. A major itera tion is the path through the overlapped set of two radix 4 division groups, followed by the final radix 4 group. It is important to note that extra iterations do not adversely affect the corrected quotient. Final ly, to produce the corrected quotient, the set of negative quotient digits is subtracted from the set of positive quotient d igits, where each digit is properly radix 2 weighted, based on the order of selection. (That is, the first quotient digit selected is the most significant bit of the correct quotient.) Vector Register File Chip - VRGx The VAX 9000 architecture adds vector instructions to the standard VAX environment, thus a vector register file was required. There were two primary design requ i rements for the vector register file. First, the register file and associated cross-bar logic had to fit in a single multichip unit; and second, the Digital Techn icaljournal Vol. 2 No. 4 Fall f'J'JO register file had to perform read and write at dif ferent addresses within a single 1 6-ns clock cycle. These requirements could not be met with available memory and logic chips, thus necessitating the development of a fully custom vector register chip. The vector register file is 64 bits wide and con sists of 1 6 vector registers with 64 elements each. The vector register chip, VRGx, was developed as an 8-bit slice of the 64 -bit vector register file. The chip contains 9216 bits of RAM for data storage and the cross-bar logic (6000 equivalent gates) that allows access from the five read ports and three write ports. Integrating the register memory and the cross-bar logic on the same chip allowed timing to be optimized so that the system timing require ments were met . VRGx Chip Physical Features and Organization The VRGx chip is fabricated using the MOSAIC III ECL process, w hich was not designed as a memory pro cess. Coordination with the vendor resulted in the addition of an implant step for the memory-cel l bit line emitters. Key features of the process are three metal interconnect layers, oxide isolation, and polysilicon emitters with a drawn width of 1 .75 microns. Figure 9 shows the locations of the major circuit blocks in the VRGx chip. The major blocks of the VRGx chip are five read ports, three write ports, and 1 6 vector registers in the RAM bank array. The block diagra m , Figure 10, shows the main data paths. The 1 6 vector registers are implemented as 64 -word by 9-bit single port RAMs. Eight bits are a slice of the 64 -bit vector register ftle and the ninth bit is for byte parity. Timing A register RAM can be read from one address and written from a different address in one 1 6-ns clock cycle. This dual operation is made possible by a 2 to 1 m u ltiplexer on the RAM address inputs. The read address is appl ied during the first portion of the cycle, and the write address is applied during the second portion of the cycle. Spl itting the clock cycle i nto read and write portions eliminates conflict between read and write ports in the event that a single register RAJVl is selected for both read and write. Read data is held in a latch during the sec ond portion of the cycle and is unaffected by the write operation . A single clock cycle consists of nonoverlapping clock phases A ami B. Latches on the read and write 55 VAX 9000 Series Figure 9 Photomicrograph of VRGx Chip pon inputs are clocked by phase A, and read port output latches are cloc ked by p hase B. For a read operation initiated on phase A, the output read data becomes valid during phase B. Cross-bar Logic Cross-bar logic in the R A M bank array makes each of the 16 vector register RAMs independently accessi ble from the read and write ports. Enable inputs on the ports prevent invalid addresses from contl icring with i ntended addresses. Read and write ports may point to the same register R A M , bur di fferent write pons may nor point to the same R A M . Also, differ ent read ports may on ly point to the same RMvl if the vector element address is the same. All conflicts must be resolved external to the chip. 56 A read port consists of an enable, a 4-bir register select, a o-bit vector element address, and a 9-bit ou tpu t . An enabled read port appl ies a register select code that points to a particular RA M bank . At that R A M bank, a ') to I multiplexer selects the vec tor element address from the active read port and applies it ro the read add ress of the R AM . Then t he R A M output passes t h rough a 16 to l m u l ti p lexer controlled by the register select code, so that the selected R A M output reaches the output of the active read port. A write port consists of an enable, a 4 -bit register select, a 6-bir vector element address, and a 9-bir write data input. An enabled write port applies a register select code that points to a particular RA.M bank . At that R A M bank, a 3 to I multiplexer selects Vol .! 1\'o. 4 Fall / - r S E L <3 : 0 > - PORT ADDR 3x ADDR A D D R <5 0> SEL<3:0> - I WRITE PORT ENABLE - - - - - - - - - SEL - I I I SEL 91 6 I 6 / 3:1 MUX I I DATA I ---, I 5:1 MUX I D I N <8:0> - -;- I R EAD E N A B LE - - I DO AR AW RAM 64 X 9 3:1 MUX I I I -+ Dl t f-AI I I 1 6: 1 MUX R E AD PO RT ou T _______.. D0 - 8 0 · I I I I I I L - - - - - - - - - - - __j RAM BA N K RAM BAN K A R R AY. 1 6x Figure 10 VRGx Chip Block Diagram the vector element address from the active write port and applies it to the write address of the RAM . A lso, a 3 to I m u ltiplexer selects t he write d ata from the active write port and applies i t to the RAM data input . RAM Technology The normal transistors in an ECL process are of the NPN type, where the collector is a buried N-doped region . For memory cel ls, a lateral PNP transistor is placed in the same collector region , and the com bined structure has the latching characteristics of a silicon controlled rectifier (SCR). The memory cell array in the 64 by 9 register RAMs is implemented with ECL SCR memory cells. The SCR memory cel l shown in Figure I I consists of two cross-coupled SCR structures. Extra NPN emitters connect to the bit lines and provide a means of writing and sensing the celL The "on" side of the cell saturates, allowing the bit line emitter to conduct in the inverse mode. Inverse gain of the bit line emitters must be limited to avoid excessive leakage into the unselected cells. An added process step applies a special base implant to the bit line emitters only to control their inverse gain. Advantages of the SCR cell include good density, low standby power, large sense voltage d i fferen- Digital Tecbnicaljournal Vol 2 No. 4 Fall 1990 tial, and low sensitivity to alpha-particle-induced soft errors. The cell has one limitation: excess charge storage due to write current can delay sub sequent writing to the opposite state. This problem is el iminated with a special bit line current steering circuit that makes write current state dependent (Figure 1 1 ). The SCR memory cel l in Figure 1 1 is written by applying a high current (four t imes read current) to the "off' bit line emitter. The current steering tran sistors prevent this current from reaching a bit line emitter that is already " on . " Thus, attempting to write a cell that is a lready in the desired state does not result i n any additional cell current beyond the normal read current, and no additional charge stor age occurs. Other Chip Features Other noteworthy chip features include scan logic, parity error detect logic , and a data pipeline for write port 0 data. Scan operation gives access to the register RAMs. In a single scan-in and scan-out oper ation, it is possible to read five registers and to write three registers. Parity checking logic is used to detect input errors and set error flags. There is a parity check on the 9-bit write port data inputs. Another parity 57 VAX 9000 Series 1. 51 � � 0.51 .---..----. VA � 0.51 VA KEY: WC UWL BL BR LWL VA Figure 11 WRITE CONTROL UPPER WORD L I N E B I T L l N E (LEFT) BIT LINE (RIGHT) LOWER WORD L I N E VOLTAGE R EF E R E N C E SCR Memory Cell with Bit Line Current Steering Circuit checker is applied to address and control inputs. These are assigned to three parity groups, with a parity bit input for each group. The write port 0 data pipeline allows a delay of one. two, and three clock cycles to be selected , delaying the write port data as necessary to resolve register access conflicts. Self-timed RAM In the VA X 9000 system - as in any high-perfor mance CPU - fast memory is used for cache and control store applications. Engi neers traditionall y use very fast static RAMs within the CPU for mem ory. Logic designers, however, have long recognized that CPU performance is often l imited as a result of the time needed to access data in these RAMs. This l imitation is not only the result of the access time and write cycle performance of the devices them selves, but also of t he off-chip circuitry and inter connect used for w ri te p u lse generation and distribution . The logic designers and technologists 58 - for the VAX 9000 knew that unless some architec tural improvements were made to the traditional static RAM , much of the RAM performance improve ments would be lost in the w iring interconnect. They also realized that Digita l 's memory suppliers would have to be convi nced that a new RAM archi tecture would be marketable to their other cus tomers. After several design iterations, the tech nologists submitted a set of specifications for a synchronous, self-timed RAM (STRAM ) to several suppliers for their revi ew. After extensive market surveys, our memory suppliers agreed that this new architecture could eventually become a new stan dard for high-speed static RAMs. The VAX 9000 system requires two configura tions of the basic STRAM dev ice : I K words by 4 bits, and 4K words by 4 bits. A block diagram of the STRAM is shown in Figure 12. The STRAM is similar to the traditional RAM in that it has chip select, input address and data, and output data . However, the STRAM also has several nontraditional inputs such Vol 2 No. 4 Fall /'J'JO Digilal Technicaljournal Semiconductor Technology in a High-performance VAX System as write, a differential clock, and a reference voltage (Vbb). Latches added to all inputs and ourputs provide pipelined timing. An internal write pulse generator controls write operations and eliminates the need to generate and distribute the write pulse signal externally on the modu le. Also two optional output configurations are provided : a 50-ohm drive open emitter for standard parallel termination on the module, and a resistor and pulldown current source which is w ired extern a l l y to implement STECL or on-chip source termination. The clock buffer design al lows inputs to be driven differentially from off-chip to m inimize clock skew. The clock buffer is also designed to accommodate customers who are not greatly con cerned about skew or who may be more concerned about conserving routing area. One input of the clock buffer may be tied to the output pin of the reference generator which provides the standard ECL threshold vol tage (Vbb), al lowing the other input of the clock buffer to be driven in a single ended mode. D I N <3:0>H Input and output latches are clocked on opposite edges of the internal differential clock buffer. Tim ing diagrams are shown in Figure 13. On a falling edge of CLK H , data and address i nputs flow into the RAM array. I f w rite is asserted d u ring the next rising edge of CLK H , then a write cycle is initiated, and the input data is stored in the memory at the add ress presented at the ADR inputs. At the same time, the data is passed through the mu ltip lexer and the out put latch. If write is deasserted on the rising edge of CLK H, then the STRAM is in a read cycle and input data is ignored _ The data stored in the RAM at the address presented at the A DR inputs flows out to the multi plexer and output latch. If chip select (CS) is deasserted prior to the rising edge of CLK H , then write and read operations are disabled and the output latches are reset low. For p roper operation of the STRAM , certain timing requirements must be fulfilled . The write operation is terminated by either the falling edge of RAM ARRAY 2M X 4 ..-------1 DOUT RAM <3:0>H D I N DOUT <3:0><3:0> ADDR W R EN ADDR H DO <3:0>H WRITE PULSE GENERATOR WRITE L CLOCK H ��-------� ENABLE H CS L DLY CLK H CLOCK H D CLK H 0 CLK L Figure 12 Digital Tecbnicaljournal Vol. 2 No. 4 Fa/1 1990 STRAM Block Diagram 59 VAX 9000 Series NOTE: CLOCK HIGH STATE M U ST LAST LONG ENOUGH TO COMPLETE A WRITE CYCLE I'" "'I CLK WRITE ADDR, D I N , CS 1 DATA OUT WR W& 2 RD Wffo;l 3 RD I KEY: 0 RD - READ OPERATION CYCLE 0 1 WR - WRITE OPERATION CYCLE Figure 13 STRAM Timing Diagrams CLK H or by the internal write pu lse generator, whichever occurs first . Therefore CLK H must be asserted long enough to ensure that data is properly written into the memory array. The internal write pulse generator provides an output having the proper duration as determined by a string of gates. Also, the assertion of the internal write pulse sig nal must be delayed by an amount equal to the inter nal access time of the RAM . In this way. the correct data is stored , and not the data previously stored i n the input registers. The delay i s accomplished by the row delay circuit, which is also simply a string of gates. These featu res give the STRAM i ts "self tm i ed " nature. Acknowledgments The authors would l ike to acknowledge the follow ing individuals who participated in and contrib uted to the success of the VAX 9000 project: Jerry Weisbach, Andy Moroney, Bob H a l ler, Marc Lamere, Mark Hamel, Tom Senna, Dave McCall, Patty Kroesen, R i c k Jones, jim jensen , Terry Skrypek , Eugene Marteney, Paul Guglielmi, Ela ine Fire, Larry Herman, Bill G rundman n , Mark Pascarelli, Fran Richard , Linda G reska, Jack Mason, Chris Caiazzi, Roger Dame, Mike Normand Steve Sullivan, Rob Rcinschmidt, Bob Bechdolt, Mike Warder, M i ke Hickman , Brian Sadler, Wayne Nunn, Rita Wespi, Gene Yee, Bruce Smith, Alisyn Emerson, J im Glanville. 60 1 References 1 . D . Marshall and ]. McElroy, " VAX 9000 Packaging, The Multi-Chip Unit," Pmceedings of COMPC ON '90 (Spring 1990). 2 . P. Zdebel et al . , " MOSAIC l l l - A H igh Perfor mance Bipolar Technology with Self-Aligned Devices," Proceedings of IEEE 1987 Bipolar Circuits and Technology Meeting 3. D. Fire and T. Fossum, " Designing a VAX for High Performance," Proceedings of COMPCON '90 (Spring 1990). 4. C. Baugh and B. Wooley, "A Two's Complement Parallel Array Multiplication Algorithm , " Sh011 Note a t COMPCON 73, 7th A n n ual IEEE Computer Society International Conference (February 1973). 5. C . Wallace, "A Suggestion for a Fast Mu ltipl ier," 1 EEE Transactions on Electronic Computers, Vol . EC- 13 (February 1964 ): 14- 17. 6. L . Dadda, "Some Schemes for Parallel Multipl iers," Colloque sur l 'A lgebre de Boote Oanuary 1965). 7. K . Hwang, Computer A rithmetic Principles, Architecture, and Design (New York: john Wiley and Sons, 1979): 213-283. Vol. 2 No. 4 Fall 19')0 Digital Tecbn icaljounwl Richard A. Brunner Dileep P. Bhandarkar Francis X. McKeen Bimal Patel William]. Rogersjr. Gregory L. Yoder Vector Processing on the VAX 9000 System The VAX 9000 system provides thefirst emitter-coupled logic (ECL) implementation of the VAX vector architecture. The optional vector processor on the VAX 9000 system addresses the computing needs of numerically intensive applications with a peak performance of 125 MFLOPS for double-precision calculations. The innovative design ofthe vector registerfile allows the vectorprocessor to overlap the execution of up to three vector instructions. Supported by both the VMS and ULTRIX operating systems, the vector processor on the VAX 9000 system provides four to five times performance improvementfor vectorizable applications over its scalarprocessor. For a long time, vector processing was the domain of large, expensive supercomputers such as the CRAY - 1 . 1 However, with the availability of low cost, pipelined floating point arithmetic chips, and the maturation of vectorizing compilers, vector p ro cessing has become a mainstream technology for scientific applications.2 Applications that can bene fit from vector processing include finite element analysis, signal processing, and computational fluid dynamics. The recent addition of integrated vector processing to the VAX architecture and its imple mentation on the VAX 9000 system provides these applications with an improvement in execution time of four to five times over that of a VAX 9000 sys tem without vector processing. Vector processing extends the performance range of VAX systems. The vector processor on the VAX 9000 system , referred to as the V-box , is the first emitter-coupled logic (ECL) implementation of the VAX vector archi tecture. The definition of the architecture and the development of the V-box started in 1986 , two years after the design of the rest of the VAX 9000 CPU . Thus, the design of the V-box was synergistic with the definition of the VAX vector architecture. The major goal of the V-box design was to provide adequate vector performance (four to five times speed-up over scalar) without impacting the design of the remainder of the VAX 9000 CPU and the memory subsystem, which were too far along in development to change. With vector performance comparable to a CRAY -1 and a peak performance of 125 M FLOPS for double-precision calculations, the V-box fulfills this goal . Digital TeL·hnicaljournal V!JI. 2 No. 4 Fall 1990 This paper describes the VAX vector architecture and its implementation by the VAX 9000 V-box. The first part of the paper discusses the architectural model that all VAX vector processors must follow. The second part shows the actual realization of this architecture in the VAX 9000 V-box and explains the innovative techniques the V-box uses to achieve good performance. The paper concludes w i th preliminary vector performance numbers for the VAX 9000 system on some standard vector bench marks and a number of vector code examples. VAX Vector Architecture The VAX vector architecture defines the instruction set , registers, and behavior that all VAX vector implementations, such as the VAX 9000 V-box, must follow.' The vector architecture effort started in December 1985. At that time several CPU develop ment projects were well underway, including the VAX 9000 system. With the expectation of provid ing four to five times performance improvement for vectorizable applications, Digital decided to add vector p rocessi ng to the VAX 9000 system, even though the system was in an advanced stage of development. A decision also was made to provide a complementary metal oxide semiconductor (CMOS) implementation of the architecture on the VAX 6000 Model 4 00 system." Because both systems could not tolerate major changes without a major slip in schedule, the archi tecture requ i red an approach that made few changes to the scalar processor - that part of a VA,'\ 61 VAX 9000 Series processor that executes the regular VAX instruction set. Furthermore, because not all applications and markets can benefit from vector processing, Digital decided not to require vector processing on every new VAX processor. Therefore, vector processing is offered as an optional capability. The scalar proces sor decodes vector i nstructions and passed them to its associated vector processor. All processing of vector instructions is handled by the vector pro cessor. Mechanisms are provided for vector-scalar synchronization and handling of vector exceptions by the scalar processor. Although the architecture had to account for the implementation constraints of both ongoing CMOS and ECL projects, it had to be general and flexible enough to allow future, more i ntegrated implemen tations at higher performance. The architecture also had to m inimize its impact on the existing VMS a nd ULTRIX operating systems because major changes could significantly delay software support for vector processing. Basic A rchitecture The VAX vector architecture uses a vector-register based design first pioneered by Seymour C ray. 1 There are 1 6 vector registers, each of which holds 64 elements; an element is 64 -bits. Instructions which operate on longword integers or F _floating point data, only manipu late the low-order 32 bits of each element - sometimes referred to as long word elements. A n umber of vector control registers control which elements of a vector register are processed by an instmction. The vector length register (VLR) limits the highest-numbered vector register ele ment that is processed by a vector instruction. The vector mask register (VMR) consists of a 64 -bit mask, in which each mask bit corresponds to one of the possible element positions in a vector register. When instructions are executed under control of the vector mask register, only those elements for which the corresponding mask bit is true are pro cessed by the instruction. Vector compare instruc tions set the value of the vector mask register. The vector coun t register (VCR) receives t he number of elements generated by the compressed IOTA instruction, which is similar to COMPRESSED IOTA on the CRAY-2.1 All VAX vector instructions use two-byte extended opcodes. Any necessary scalar operands (e. g. , base address and stride for vector memory instructions) are specified by standard VAX scalar operand specifiers. The instruction formats allow all VAX vector instructions to be encoded in 62 seven classes. The seven basic instruction groups and their opcodes are shown in Table l . Within each class, all instructions have the same number and types of operands, which allows the scalar processor to use block-decoding techniques. The differences in operation between the individ ual instructions within a class are irrelevant to the scalar processor and need only be known by the vector processor. I mportant features of the instruc tion set are • Support for random-strided vector memory data through gather (VGATH) and scatter (VSCAT) instructions • Generation of compressed IOTA vectors (through the IOTA instruction) to be used as offsets to the gather and scatrer instructions • Merging vector registers through the VMERGE instruction • The ability for any vector instruction to operate under control of the vector mask register Additional control information for a vector instruction is provided in the vector control word (shown as cntrl in Table 1 ), which is a scalar operand to most vector instructions. The control word operand can be specified using any VAX addressing mode. However, VAX compilers gener ally use immediate mode addressing (that is, place the control word within the instruction stream). The format of the vector control word is shown in Figure 1 . The Va , Yb , and Vc fields indicate the source and destination vector registers to be used by the instruction. These fields also indicate the specific operation to be performed by a vector compare or convert instruction. The MOE bit indicates whether the particular instruction operates under control of the vector mask register. The MTF bit determines what bit value corresponds to " true" for vector mask register bits. It allows a compiler to vectorize if-then-else constructs. The EXC bit is used in vector arithmetic instructions to enable integer overflow and floating underflow exception reporting. The Ml bit is used in vector memory load instructions to indicate modify-intent. Figure 2 shows the encod ing for some typical VAX vector instructions. Vector Execution Model With the addition of vector processing, a typical VAX processor consists of a scalar processor and an associated vector processor; the two are referred to as a scalar/vector pair. A VAX multiprocessor system Vol. 2 No. 4 Fall 1990 Digital Tecbnicaljournal Vector Processing on the VAX 9000 System Table 1 VAX Vector I n struction Classes Vector Memory, Constant-stride Vector-sca lar Double-precision Arithmetic opcode cntrl , base, stride opcode cntrl , scalar VLDL Load lo ngword vector data VSADDD O_floating add VLDQ Load q u adword vector data VSADDG G_float i n g add VSTL Store longword vector data VSCMPD O_floating com pare VSTQ Store q u adword vector data VSCMPG G_float i n g com pare Vector Memory, Random-stride opcode cntrl, base VSDIVD O_float i n g divide VSDIVG G_float i n g d ivide VSM U L D O_floating m u ltiply VS M U LG G_float i n g m u ltiply Gather longword vector data VSSUBD O_float i n g subtract VGATHQ Gather q u adword vector data VSS U BG G _floating subtract VSCATL Scatter lo ngword vector data VSMERGE M e rg e VSCATQ Scatter q u adword vector data VGATHL Vector-vector Arithmetic Vecto r-Scalar Sing le-precision Arithmetic opcode cntrl or reg num opcode cntrl, scalar VSADDL I nteger l o n gword add VVADDL I nteger longword add VVADDF F _float i n g add VSADDF F _float i n g add VVADDD O_float i n g add VSBICL Bit clear l o n gword VVADDG G_floating add VSBISL Bit set longword VVBICL Bit clear l o n gword VSCMPL I nteger lo ngword compare VVBISL Bit set longword VSCMPF F _float i n g com pare VVCM PL I nteger longword com pare VSDIVF F_float i n g d ivide VVCMPF F_floating com pare VSM U L L I nteger lo ngword m u ltiply VVCMPD O_float i n g com pare VSM U L F F _floating m u lt i p l y VVCMPG G_float i n g com pare VSSLLL S h ift left logical l o n gword VVCVT Convert VSS RLL Sh ift right logical lo ngword VVDIVF F _floating d ivide VSSUBL I ntege r longword subtract VVDIVD D_floating divide VSS U B F F _floating s u bt ract VVDIVG G_float i n g d ivide VSXORL Exclusive-or longword VVMERGE M e rge I OTA G e nerate comp ressed I OTA VVM U L L I nteger l o n gword m u ltiply vector VVMULF F _float i n g m u ltiply VVM U L D O_floating m u ltiply Vector Control Register Read VVM U LG G_float i n g m u ltiply opcode reg n u m , destination VVSLLL S h i ft left logical longword VVSRLL S h i ft right log ical lo ngword VVS U B L I nteger l o n gword su btract VVSUBF F _float i n g s u btract Vector Control Register Write VVS U B D O_floating su btract opcode reg n u m , scalar VVSUBG G_floating su btract VVXORL Exclusive-or l o n gword M FVP MTVP Move from vector processor Move to vector processor VSYNC Synchron ize vector m e m o ry access Digital Techllicaljournal Vol. 2 No. 4 Fall /990 63 VAX 9000 Series 15 14 13 12 MOE MTF EXC Ml 0 11 8 VNCONVERT FCN Figure 1 4 7 3 VB 0 VC/COMPARE FCN Vector Control Word comprises a number of tht:st: scalar/vector pairs. ever, the asynchronous execution does cause the Asymmetric configurations can exist when only reporting of vector exceptions to be imprecise. some of t he VA X processors in a multiprocessor Special instructions, which are described in the Synchronization section, are provided to ensure system contain a vector processor. synchronous operation when necessary. For good performance, the scalar processor oper a tes asynchronou s l y from i ts vector processor Both scalar and vector instructions are initially whenever possible. Asynchronous operation a llows fetched from memory and decoded by the scalar the execution of scalar i nstructions to be over processor. If the opcode indicates a vector instruc lapped w ith the execution of vector instructions. tion, the opcode and necessary scalar operands are Furthermore, the servicing of interrupts and scalar issued to the vector processor a n d p l aced i n its exceptions by the sca lar processor does not disturb instruction queue. The vector processor accesses the execution of the vector processor, which is memory directly for any vector data that it must freed from the compk:xity of resuming the execu read or write. For most vector instructions, once the tion of vector instructions after such events. How- scalar processor s uccessfu l l y issues the vector ASSEMBLER FORMAT: VVEOLF V6,V7 VVADDF/1 V1 ,V2,V3 VSMULF/U R4,V4,V5 ;IF V6[i] V7[i] THEN VMR[i] 1 , ELSE VMR[i] = 0 ; (VVEOLF IS A VVCMPF PSEUDO·OPCODE) ; V3 V1 V2. DO ADDITION UNDER CONTROL OF VMR : WITH MATCH 1 ; V5 = R4'V4 WITH UNDERFLOW EXCEPTION CHECKING ENABLED = = = + = INSTRUCTION FORMAT: VVCMPF cntrl.rw VVADDF cntrl . rw VSMULF cntrl.rw, src.rl ENCODING IN MEMORY: BYTE ,-FD -, 0 ::> C4 :1 8F :2 :3 :4 5 :6 ...J ; INSTRUCTION CONSISTS OF OPCODE AND CONTROL WORD ; INSTRUCTION CONSISTS OF OPCODE AND CONTROL WORD ; INSTRUCTION CONSISTS OF OPCODE, CONTROL WORD, AND SCALAR SOURCE - :7 •- :8 :9 :: J :C :D :E :F _,_ TWO-BYTE OPCODE FOR VVCMPF OPERAND SPECIFIER FOR IMMEDIATE MODE (FOR CONTROL WORD) CONTROL WORD <7:0>: COMPARE FCN IS EOL AND V7 IS A SOURCE CONTROL WORD <1 5:8>: V6 1S A SOURCE TWO-BYTE OPCODE FOR VVADDF OPERAND SPECIFIER FOR IMMEDIATE MODE (FOR CONTROL WORD) CONTROL WORD <7:0>: V3 IS DESTINATION AND V2 IS A SOURCE CONTROL WORD <15:8>: V1 IS A SOURCE, MASKED OPERATIONS ARE ENABLED, AND MATCH = TWO-BYTE OPCODE FOR VSMULF OPERAND SPECIFIER FOR IMMEDIATE MODE (FOR CONTROL WORD) CONTROL WORD <7 0>: V5 IS DESTINATION AND V4 IS A SOURCE CONTROL WORD <1 5:8>: VA IS IGNORED. UNDERFLOW EXCEPTION CHECKING IS ENABLED OPERAND SPECIFIER FOR REGISTER MODE WITH SCALAR DATA IN R4 Figure 2 64 1 Vector Instruction Encoding Vol. 2 No. 4 Fal/ /<)<)0 Dtgilal Techn icaljournal Vector Processing on the VAX 9000 System instruction, it proceeds to process other instruc tions and does not wait for the vector instruction to complete. An execution model is shown in Figure 3 . When the scalar processor attempts t o issue a vector instruction, it checks to see if the vector pro cessor is disabled - that is, whether it will accept further vector instructions. If the vector processor is disabled, then the scalar processor takes a "vec tor processor disabled" fault. An operating system handler is then invoked on the scalar processor to examine the various error-reporting registers on the vector processor to determine the disabling con dition. The vector processor disables itself to report the occurrence of vector arithmetic exceptions or hardware errors. The operating system disables the vector processor, usually to indicate the unavaila bility of the vector processor, by writing to a privi leged vector register. If the disabling condition can be corrected, the handler enables the vector proces sor and directs the scalar processor to reissue the faulted vector instruction. Within the constraint of maintaining the proper ordering among the operations of data-dependent instructions, the architecture explicitly allows the vector processor to execute any number of the instructions in its queue concurrently and retire them out of order. Thus, a VAX vector implementa tion can chain and overlap instru ctions to the extent best suited for its technology and cost performance. In addition, by making this feature an explicit part of the architecture, software is pro- vided with a prograrruning model that ensures correct results regardless of the extent a particular implementation chains or overlaps. This approach differs with respect to some other existing vector architectures, such as the IBM S/370 vector archi tecture, which give the appearance of sequential instruction execution.6 A VAX vector implementation may have its own memory management hardware, translation buffer, and cache; or it may share those of the scalar pro cessor. In high-end vector implementations, such as the VAX 9000 system, the vector and scalar proces sors are tightly coupled. The problems of limited chip area and translation buffer and cache coher ency can be lessened by allowing high-speed mem ory management hardware and cache to be shared by both vector and scalar processors. For other implementations, such as the VAX 6000 Model 4 00 system, the vector and scalar processors are not so tightly coupled, and there is a performance advan tage in allowing separate memory management hardware and cache. 1 Little additional effort is nec essary by an operating system to support separate vector memory management hardware and cache. A vector processor can treat vector memory management exceptions (MME) in a synchronous m a nner, as the VAX 9000 V-box does. Once the scalar processor issues a vector memory instruc tion, it pauses until the vector processor deter mines whether an MME w i ll be encountered by the instruction. If an MME will occur, then a precise PHYSICAL MEMORY 1 6 GB I N STRUCTION STREAM OPCODE, CONTROL WORD INSTRUCTIONS VAX SCALAR CPU DATA DISABLE/STATUS DATA STREAM VECTOR DATA Figure 3 Digital Tecbnical]ournal Vol. 2 No. 4 Fall 1990 Vector Execution Unit 65 VAX 9000 Series exception is taken on the scalar processor and the Vector arithmetic exceptions are reported in an appropriate operating system handler is invoked. imprecise manner by vector processor disabled If no MME will occur, the scalar processor proceeds faults. When an exception occurs in the processing to process other instructions and the vector proces of a vector element, the vector processor records sor completes the memory instruction. In the case the exception in both a privileged exception regis of referencing a unity-strided vector, which occurs ter (the vector arithmetic exception register, most frequently, the MME checking takes only and i n the corresponding element of the destination a short time at the beginning because the vector vector register specified by the instruction. The vec is contained in two or less pages. (MME checking is tor processor then disables itself from receiving done at the page level .) further vector instructions. However, the vector processor continues to execute the instruction that Context Switching Because of the asynchronous operation of the vec tor and scalar processors, the vector context state of a process is separate from its scalar comext state. Thus, it is possible for an operating system to swap in a new process to the scalar processor while allowing the vector context of the previous process to remain on the vector processor. When the previ ous process is swapped out, the vector processor is disabled by the operating system to prevent other processes from accessing this vector context. If the subsequent processes do not use the vec tor processor, then the operating system avoids the overhead of saving and subsequently restoring 8 kilobytes (KB) of vector context state for the orig inal process. If another process does use the vector processor, the operating system must reenable the vector processor, save the vector state of the origi nal process, load the vector context of the new process, and, finally, make the vector processor available. This fu ll context switch can take up to 100 microseconds on the VAX 9000 system. Assuming that only a few processes require the vector processor, it is l ikely that when the original process is rescheduled to the same scalar/vector pair, the process will find its vector context state residing on the vector processor. By using this tech nique, which is referred to as "cheap vector context switching," both the VMS and VA ER) ULTRlX operating sys tems reduce the time required to swap in a process encountered the exception to completion by pro cessing the remaining vector register elements. As stated earlier, memory management excep tions can be reported precisely b y a VAX vector VAX 9000 processor to its scalar processor, as the V-box does, and the scalar processor takes a normal VAX memory management fa ult. Exception infor mation is placed on the stack in the same format as for scalar memory management exceptions. The use of the same format minimizes the effort needed by an operating system to support these exceptions. Memory management exceptions were extended for vectors to include two new exception para meter bits: vector I/O space reference and vector aligrunent fault. A vector I/O space reference occurs whenever an attempt is made to load or store vector data to I/O space. Because of the performance degrada tion of unaligned memory data, a vector alignment fault occurs w henever an element being accessed by a vector memory instmction does not begin at an address that is an integer multiple of the length of the element in bytes. For example, a long word (4-byte) element in memory should begin at an address which is an integer multiple of 4 bytes. Synchronization In most cases, it is desirable for the vector processor to operate asynchronously with the scalar proces sor to achieve good performance. However, there that uses the vector processor. are cases in which the operation of the vector and Exceptions correct results. Rather than forcing the vector pro scalar processors must be synchronized to ensure vector cessor to detect and automaticall y provide synchro instructions are identical to those that occur for nization in these cases, the architecture provides VAX special instructions, which software can use, t o Most of the exceptions encountered by VAX scalar instructions. The arithmetic exceptions are exactly the same. The memory m a nagement accomplish the synchronization. exceptions have been extended to include two new instructions are discussed below. Software must Some of these vector exceptions: vector IIO space reference and determine when to use these synchronization VAX scalar architec instructions to ensure correct results or establish ture, the reporting of floating underflow and integer exception checkpoints. Given the necessary sophis vector alignment fault. As in the overflow exceptions can be disabled by setting the tication of vectorizing compilers, this requirement EXC bit is not onerous. 66 in the vector control word . Vol 2 No. 4 Fall 1990 Digital Tecbnicaljournal Vector Processing on the VAX 9000 System Vector and scalar memory references may be issued simultaneously. Therefore, these references must be synchro n ized to prevent a conflict from occurring when accessing shared memory loca tions. This synchronization is p rovided by the MSYNC function of the M FVP instruction. Once the MSYNC function is invoked , the scalar processor does not issue further instructions u ntil all p re vious vector and scalar memory references have completed. Because the vector and scalar processors execute asynchronously, software cannot determine when a vector exception will be reported. However, soft ware requires that exceptions be reported at certain checkpoints. For example, exceptions incurred in a procedure must be reported within the context of that procedure before another procedure is calJed. This exception reporting synchronization is pro vided by the SYNC function of the M FV P instruction. Once SYNC is invoked, the scalar processor does not issue further instructions until the exceptions of previous vector instructions, if any, are reported . VAX 9000 Y-box Overview The VAX 9000 V-box is one of four tightly coupled, parallel function units that compose the VAX 9000 CPU . As such, it shares, with the rest of the CPU, both the large 128KB data cache and the very fast address translation hardware. As a result, the V-box has very fast access to memory data. The V-box is connected to the CPU through the scalar execution unit as shown in Figure 4 . This connection consists 1--lloi VECTOR CONTROL U N IT Figure 4 Digital Tecbnicaljourna/ Vol. ,! No. 4 1-----l� of a 64 -bit data path, which brings instructions and data to the vector unit, and a 32-bit path, which sends data to the scalar unit. AU vector memory instructions send data through this data path. As Figure 4 also shows, the V-box is composed of the folJowing subunits: vector register uni t , vector add unit, vector multiply unit, vector mask unit, vector address unit, and vector control unit. Each of these s ub units can function i n paralle l , which allows up tO two vector arithmetic instructions and one vector memory instruction to be executed simultaneously. C rucial to this instruction over lapping ability is the vector register unit, which supports up to eight s imultaneous accesses from the other subunits. Physically, the V-box resides on the same planar board as the remainder of the VAX 9000 C P U . Three multichip units (MCUs) are reserved for the V-box, which is a field-installable option. The V-box com prises 25 ECL Motorola Macrocell Array Ills (MCA3) 7 (For brevity, a macrocell array is referred to as a " chip" i n this paper.) The operation of these sub units and the techniques used to enhance their per formance are described in the following sections. Vector Control Unit The vector control u n i t receives and coordinates the execution of vector instructions within the V-box . The VAX 9000 scalar exec u tion engine (E-box) transfers both an encoded version of the vector instruction and the necessary scalar data to the unit, which loads the instruction and data into a VECTOR REGISTER U N IT MASK! ADDRESS V-box Organization (with VAX 9000 CPU) Fall /l)'JO 67 VAX 9000 Series circular queue as shown in Figure 5. The queue can buffer a few pending instructions while the remain ing Y-box subunits are executing others. Without the queue, the V-box could not accept pending instructions when all of its subunits are busy, thus, propagating a stall condition to the scalar execution unit and resulting in poor performance. The scalar data that is required by a vector instruction is placed in the queue one location behind the instruction quadword . Whenever the queue contains two entries, the vector control unit returns a signal to the scalar execution u nit and requests that subsequent instruction issue be delayed u ntil the number of entries in the queue has diminished to one or less. The queue is cir cular in nature and wraps around to the beginning automatically. When an instruction is loaded into the queue, a pointer directs the instruction to the decode logic shown in Figure 5. If there is enough instruction data available in the queue and the necessary sub unit is not busy, then the vector control unit sends the instruction data from the queue to the register conflict logic. The register conflict logic determines if the vector registers required by the instruction are already in use by the other subunits, a condition called register conflict. The determination is made b y comparing the vector register addresses that E-BOX VECTOR DATA are ro be used by already executing vector instruc tions in the next cycle against the vector register addresses required by the new instruction. If none of the addresses overlap then the instruction is free to issue. If an overlap does exist, the instruction is held until the next cycle, when it can then be issued to the appropriate subunit. (The Jack of significant cycle delay in this case is due to the optimal design of the vector register unit.) If there are no register conflicts, the instruction is issued immediately to the appropriate subunit. As the vector control unit issues the instruction to the subunit, it also sends scalar source operands, if any, and the addresses of the vector registers required by the instruction to the vector register unit. The vector register unit latches the scalar data for the duration of that instruction . For each cycle of the instruction's execution, the register unit then sends the necessary scalar and register data to the appropriate subunit. The vector control u n i t also contains the vector length register and sends a copy of it with every instruction that is issued to a sub unit. By suppl ying each subunit with a copy of the vector length register, writes to the register by MTVP instructions do not affect instructions cur rently executing under the register's previous value. Without this mechanism, wri tes to the vector length register would be delayed until previously BUFF ER SCALAR DATA TO VECTOR REGISTER FILE SOURCE/DESTINATION VECTOR REGISTER ADDRESSES ADD VECTOR INSTRUCTION MUL GEN NO CONFLICT ISSUE NEW INSTRUCTION BUFFER VALID BUFFER COUNTER INSTRUCTION ISSUE DECISION LOGIC Figure 5 68 ISSUE NEW INSTRUCTION VECTOR NO REGISTER CONFLICT CONFLICT 1--- '-----1 CHECK LOGIC Vector Control Unit Vol. 2 No. 4 Fall /'-)')0 Digital Tecbnicafjounwl Vector Processing on the VAX 9000 System executing instructions had finished, which would result in poor performance. Upon reaching the subunit, most vector instruc tions execute at one cycle per element, after the initial pipeline latency. However, the vector divide instructions (VSDIV and V V OJV) execute at a varying number of cycles, depending on the floating point format (F, D, or G). (To simplify the vector control logic, no other vector instructions are issued once a vector divide s tarts.) Resu lts are returned to the vector register unit or vector mask unit as they are generated, depending on the instruction. As described earlier, m icrocode in the scalar exe cution engine encodes vector instructions into an i nstruction quadword before passing them to the V-box . Table 2 shows the high-order 32 bits of the format used for every instruction sent to the V-box. This quadword contains fields that indicate the instruction, appropriate V-box subunit to execute the instruction, and format of the vector control word . The low-order 32 bits of the instruction quad word contain the vector control word for the vector instruction. The instruction quadwords present the V-box with a fixed format instruction that smoothly fits into a fiXed-length instruction queue, requires little subsequent decoding, and has fields that can be directly gated to selection logic. As a result, the time needed by the V-box to decode vector instruc tions is reduced and performance is increased . Vector Register Unit The vector register unit or file, as its name implies, contains the logic and fas t memory that imple ment the 1 6 VAX vector registers on the V-box . The block diagram of the vector register file is shown in Figure 6 . The vector register file has three write ports and five read ports. By using the innovative technique described below, these ports provide the multiple accesses needed to feed two operands per cycle to the vector add and multiply units, and one operand to the vector address-mask unit. This unit is the single largest contributor to the excellent vec tor performance of the VAX 9000 system . The file consists of 1 6 vector registers. Each register contains 64 elements, and each element is 72-bits wide (64 data , S parity). The vector register file is implemented as a byte-sliced custom chip, which has a single parity bit per data port. Three writes and five reads to the file can occur simulta neously in any cycle. All w rites must be to different register banks. However, multiple reads can occur to the same bank if the same element is required by each read access. Internally to the vector register Digital Technicaljournal Vol. 2 No. 4 Falf /')')() unit, reads occur during the first half of the cycle, and writes occur during the last half. A write and read enabling signal is generated for each register bank every cycle. Each cycle, data is selected from one of the three write ports to be written into any enabled register banks. Write port 0 has a four-stage pipe to buffer data coming from the E-box, through the control logic, which cannot be written due to a register bank conflict. The vectOr register file also has three scalar registers (one each for the vector address-mask unit, vector add uni t , and vector mul tiply unit) to hold scalar source operands for vector scalar instructions. Write port 0 is used to write these registers. Each enabled read port selects an element from one of the 1 6 register banks or scalar registers (for vector-scalar instructions) and trans fers it to one of the other subunits. The vector register file uses a technique referred to as "barber poling" to improve the use of chaining and overlapped instruction execution . As Figure 7 shows, barber poling spreads each architecturally defined vector register across all vector register banks. E lements are laid out such that the first vector element of each vector register is in location 0 of the same physical register bank and element b of vector register n is in location b of vector register bank ({n +b] modulo 1 6) . B y using this technique, a vector register conflict causes the vector control unit to delay the issuing of a new vector instruction for no more than three cycles. If the more standard technique of placing all elements of one vectOr register in the same bank were used , a vector register conflict could cause the execution of a new instruction to be delayed by 64 cycles. The 64 -cycle delay would have frustrated attempts at overlapping and severely degraded the vector performance of the VAX 9000 system . Vector Add Unit The vector add unit executes most vector instruc tions, including both floating point and i nteger addition, subtraction, comparison; vector convert ; vector shift logical; vector logical operations; and vector merges. For brevity, these instructions are referred to as add-class instructions. One of the challenges in designing the vector add unit was the need to perform both integer and floating point arithmetic. The organization of the vector add unit is shown in Figure 8. It is a pipelined structure that comprises two identical chips for u npacking and aligning operands (VI:'SA and V I'SB); one chip for performing arithmetic and logical operations (VFAD); and a 69 VAX 9000 Series Table 2 Encoded I n struction Q u adword (bits < 63 : 32 > ) Vector I nstruction VVS U B F!VS S U B F VVSU BG!VSSU BG VVS U B D!VSSU B D VVS U B UVSS U B L VVC M P L!VSC M P L VVS LL!VSS L L VVSR L!VS S R L VVB I S UVS B I S L VV B I C L!VS B I C L VVXOR L!VSXO R L VVM E R G E!VS M E R G E VVADDD!VSA DDD VVA D D F!VSAD D F VVADDG!VSADDG VVA D D L!VSA DDL VVC M P D!VS C M P D VVC M P F!VS C M P F VVC M PG!VS C M P G VVC M P D!VSC M P D VVCVTDF VVCVTDL VVCVTFD VVCVTFG VVCVTF L VVCVTG F VVCVTG L VVCVT LD VVCVT LF VVCVTLG VVCVTDL VVCVTFL VVCVTG L VV M U L L/VS M U L L VVM U LF!V S M U L F VVM U L D!VS M LI L D VVM U LG!V S M LI LG VV DIVF!VS D I V F VVDIVD!VSD IVD VVDIVG!VSDIVG VLDL VLDQ Block load VSTL VSTQ VGAT H L VGATHQ VSCATL VSCATQ I OTA Load VLR Load low V M R Load h i g h V M R Store l o w V M R Store h i g h V M R Store u n alig ned address Load VPSR Load VAE R Store VAE R R E S ET OPCODE < 39 : 3 2 > Control Word Type < 42 : 4 0 > Dispatch Type < 46 : 43> OF9 ODB OD2 OF6 OF5 034 026 086 08E 088 OAE 092 089 098 086 OD5 OFD ODD OD5 01 1 01 6 03A 038 03E 019 01 E 032 031 033 01 7 03F 01 F 003 004 005 006 ooc OOD OOE 001 002 ooc 003 004 005 006 01 0 01 1 012 007 009 OOA OOD OOE 01 3 014 01 5 008 OOF 2/6 2/6 2/6 2/6 3/7 2/6 2/6 2/6 2/6 2/6 5/1 2/6 2/6 2/6 2/6 3/7 3/7 3/7 3/7 4 4 4 4 4 4 4 4 4 4 4 4 4 2/6 2/6 2/6 2/6 2/6 2/6 2/6 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 4 4 4 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 2 3 3 3 Bits < 63: 47> are reserved. 70 Vol. 2 No. 4 Fall 1990 Digital Technicaljournal Vector Processing on the VAX 9000 System VML RESULT WPORT 2 I SREG4 LD SREG2 LD SREGO LD I I 1 ' I SCALAR 0I S2 H -\ WRPORTO CNF SEL \ I WRITE I l WPORTS 0 - 2 - SCALO SE L SCAL2 SE L SCAL4 SE L j REG BANK 1 5 WT EN WRITE ADDRESS LOGIC � I REG BANK 1 5 WT ADA l REG BANKO AD EN READ ENABLE LOGIC REG BAN K 1 5 RD EN MEMORY AR RAY REG BANKO WT ADA REG BANKO RD ADA ! REG BAN K 1 5 RD ADR READ ADDRESS LOGIC � A PORTS 0-4 RPORTS 0-4 \ SELECT DATA FOR EACH READ PORT FROMS 4 REG BANKS I so --=:J S2 "-1 I I I ---r -\ .\ I I I -\ I e e e I I I RPORT2 RPORT1 RPORTO TO MASK LOGIC TO VML LOGIC Figure 6 Vol. 2 No. 4 Fall I'J'JO e I � I RPORT3 RPORT4 TO ADDER LOGIC Vector Register Unit remaining chip for norm a l izing, rou nding, and packing the result (YFPK). The data paths between t he chips are a l 1 64 -bits w ide. The pipeline latency through this unit for both single-precision (integer and F _floating) and dou ble-precision (G_floating and D _ floating) formats is only three cycles. Thus, the vector/scalar cross-over number for add-class instruct ions is quite small (that is, the minimum number of vector elements needed for the V-box to surpass the performance of the remainder of the VA X 9000 CPU for this class of instructions.) As a result, the V-box achieves good performance for add-class instructions with small sized vectors and large-sized vectors (large-sized vectors being naturally favored by t he technique of pipelining). When the vector add unit begins to execute an instruction, it receives two source elements from the vector register unit each cycle. The elements are latched i nto the unpacking logic, one clement for Digital Tecbnical}ournal / I SELECT WRITE DATA FOR EACH REG BANK FROM WRITE PORTS REG BANKO WT EN ENABLE LOGIC VAD RESULT WPORT 1 LATCH � SCALAR 2 S4 I so WPORTS O -2 VCT WT DAT WPORT O � LATCH I I SCALAR 4 1 I FROM VAD FROM CONTROL FROM VML BANK 0 BANK 1 BANK 2 Figure 7 BANK 1 3 BANK 14 BANK 1 5 Barber Poling 71 VAX 9000 Series each of the two chips. During the next cycle, each unpacking chip concurrently unpacks and aligns its source element, if necessary, and forwards the result to the addition or logical-operation logic, depending on t he i nstruction . W ithin the same cycle, the addition chip uses the two sources from the unpacking logic to generate a result, which is then latched. D uring the final cycle, the result is sent to the packing chip, which normalizes, rounds, and packs, if necessary, the result and sends it to the vector register unit to be written . Exception checking and reporting are also done in the last cycle by the pack ing chip, which maintains the vector add unit's copy of the vector arithmetic exception register (YAER). When the instruction completes, the vector add unit sends its VAER copy to the vector mask unit to be merged with the VAER copy from the vector multiply unit. The vector add unit does not d i fferentiate between masked and unmasked vector instructions. 64 I -- VFSA 1 1 The complexity of skipping over masked-out ele ments wou ld have added extra cycles of pipeline latency and resulted in less performance for small sized vectors. For masked as well as unmasked instructions, the vector add unit operates from the first up to the last element (as indicated by the vecror length register) of both source registers. The actual masking of results is hand led by the vector control unit, which blocks the vector register unit from receiving masked-out resu lts as they are being sent by the vector add unit. However, the packing chip does use vector mask register bits to suppress exception generation for results that are masked out. Floating Point Operation When executing vector floating point instructions, the u npacking logic takes the various fields of a floating point element and expands and rearranges it into a more conve nient format for the addition logic, i.e. , the elemem is "unpacked . " As a result of this process, the addi- SOURC E A --- - VFSB I ����--------� __L �-------L��� VFAD EXPONENT VFPK VAER TO VMKB MASK B I T I I I I I I ADDER LOGIC Figure 8 72 Vector Add Unit Vol. 2 No. 4 Fa/1 1990 Digital Tecbnicaljournal Vector Processing on the VAX 9000 System tion logic is simplified because all VAX floating point formats (F, D, and G) are unpacked into an identical format. The unpacking involves decoding the sign, inserting the hidden bit, and rearranging the frac tion bits. For all VAX floating point formats, the fractional part is expanded to 56 bits. (F _floating and G _ floating are expanded with zeros on the right.) The fractional part is then surrounded on the right with two guard bits and a rou nding bit to form a 59-bit fraction. The overflow and guard bits ensure the accuracy of rounded results. After the elements are unpacked, the unpacking chips align the elements by taking the fractional part of the smaller magnitude number and shifting it to the right until its exponent is equal to that of the larger magnitude number. Each unpacking chip also receives the exponent bits of the other chip's element. Therefore, the alignment process can be done in parallel before the elements are sent to the addition logic that requires the alignment. If during the alignment of an element for a vector floating point subtract instruction, a one is shifted out of the 59-bit fraction field, then a "sticky bit" is generated. This sticky bit is used by the addition logic in the next cycle as a carry into the subtraction . The unpacked, aligned elements are then sent to the add chip, which produces a result and then par tially normalizes the result before sending it to the packing chip. Again, if the shifting during normal ization shifts a one out of the fraction field, a sticky bit is generated. Finally the partially normalized result and the second sticky bit are sent to the pack ing chip which completes the normalization and rounding and adjusts the exponent field accord ingly. To save an extra cycle, the packing chip com putes two exponents values, one for each value of the carry-over in the rounding process. Final selec tion of the exponent and its exception is done using the actual carry-over of the rounding logic. The proper exponent and the normalized fraction are then rearranged into the appropriate floating point format, and the assembled element is sent to the vector register unit . For vector inte Integer and Logical lnstntctions ger and logical instructions, the elements bypass the alignment logic and are sent to the add chip (VFAD) for all but the logical shift right instruction (VVSLRL and VSSLRL). For logical shift right instructions, the alignment logic does the shifting because the shift ing circuitry is already needed for the alignment of fractions in floating point elements. The exponent unpacking logic is used to pass on the logical shift Digita1 1ecbnicaljournal Vol. 2 No. 4 hill I')'JO right count to the aligmnen t logic, which then sends the shifted result to the add chip. The add chip operates on the low-order 32 bits of these elements and passes through the high-order 32 bits u nchanged to the packing chip. For logical shift left instructions (VVSLLL and VSSLLL), the low-order 32 bits also pass through the add chip unchanged . On the packing chip, the floating point normalize logic performs to do logical shift-left operations. The shift count is passed to the normalize logic from the unpacking logic during the first cycle . For all other integer and logical instructions, the nor malize count is forced to zero to pass the add chip result through. Finally, just before sending the result to the vector register u nit, the packing chip checks for integer overflow exceptions. Merge lnstrnctions For vector merge instructions (VVMERGE and VSMERGE), the unpacking chip with the masked-out element, based on the appropriate vector m as k register bit, zeros that element out before sending it to the addition logic. The addition logic adds the zero to the other element , which has the effect of passing the value of the other element on to the packing chip. Vector Memory Operation Because vector applications tend to issue m a n y vector memory instructions, the execution time of these instructions is a critical factor in the perfor mance of a vector processor. Therefore, the V-box was designed to m inimize the execu tion time by taking advantage of the VAX 9000 C P U 's large 128KB d a ta cache, by prefe tching vector data, a nd by fetching it in blocks instead of element by element . Memory requests b y the V-box are sent through the VAX 9000 CPU to the cache and address trans lation hardware (M-box) of the VAX 9000 CPU . The M-box translates the 32-bit virtual addresses for vec tor data into physical addresses and accesses the proper locations in the data cache. The vector address-mask unit generates the virtual addresses for the vector elements. For vector load and gather i nstructions, the vector data is returned to the V-box through the E-box, and written to the proper vector registers. The M-box returns 64 bits of data each cycle. For vector store and scatter instructions, the vector elements are sent through the E-box to the M-box. Although the vector register unit is capable of sending 64 bits at a time, the E-box need only forward 32 bits per cycle to the M-box. The M-box requires two cycles to write the cache and does not actually write the 64 -bit data u n til the 73 VAX 9000 Series second cycle. (The first cycle performs the cache tag lookup.) Because the V-box implements synchro nous memory management exception reporting, once a vector memory i nstruction begins execu tion, no other vector instruction may be issued until the memory instruction completes. The VAX 9000 CPU prefetches vector data. This mechanism is used to move data from the main memory to cache in a manner which optimizes memory bandwidth. By using this method, a 25 percent improvement in the performance of vector load instructions is achieved. The preferching starts when the scalar microcode on the VAX 9000 CPU checks the stride of a VLDQ instruction . If this stride is 8 bytes long (quadwords are contiguous in mem ory), the microcode converts the instruction into a block load instruction and sends it to the V-box . The block load instruction directs the V-hox to issue a series of block load requests for vector data. A block load request moves an entire cache block from the memory into the vector registers. These blocks are loaded into both the cache and the vector registers when they come from main memory. (Bypassing the cached to load the vector registers directly reduces the effect of a cache miss for vector data.) Otherwise, the memory requests are done for one register element at a rime. In addition to converting the VLDQ to a block load instruction, the scalar microcode also issues preferch requests ro the M-box. The M-box deter mines if the data is valid in the cache. If so, no fm rher action is taken on the request. If not, the data is requested from main memory. In this manner several prefetch requests are started in successive cycles. This method results in multiple memory banks being used in parallel. Vector data comes back to the cache at a rate of 500 megab ytes {MB) per second . The microcode stops issuing prefetch requests when all the vector data has been requested . This ensures that the requests from the V-box do nor encounter many cache m isses. Vector Address-Mask Unit The vector address-mask unit performs the address generation and memory requests needed to exe cute the vector memory instructions VLD, VST, VSCAT, and VGATH . I t also contains the vector mask register and support logic for masked instructions. Further, it contains the complete vector arithmetic exception register {VAER), which it updates based on the status sent by the vector add and vector mul tiply units. 74 For vector memory i nstructions, t he vector address-mask unit receives the base (starting mem ory add ress of the vector) and stride (d istance between vector elements in memory) of the instruc tion from the vector control u n i t in an indirect manner through the vector register unir. Both the base and stride are 32 bits long. For most vector load and store instructions, the memory addresses for the vector data are generated in an iterative fashion . During the first cycle of exe cution, the base address bypasses the address adder and is immed iately sent to the M-box to request the first element. Concurrently, the base and stride are added together by the add ress adder and latched to provide the address of the next elemenr. In the next cycle, the latched address is sent to the M-box and to the address adder, where it is added to the stride to generate the next address. The process repeats until all element addresses have been issued . I n tandem with the address generation, the vector control unit directs the vector register unit to send or receive the appropriate vector register element. For vector gather and scatter instructions, the memory addresses for t he vector data a re also issued in an iterative fashion. During the first cycle of execution, the base address is sent to the vector address unit. In the second cycle, the vector control unit directs the vector register unit to send the first element of the offset vector to the vector address unit, which adds it to the base and latches the result. In the third and subsequent cycles, the resulting address is sent to the M-box while the base and next offset are added together. The process repeats until all element addresses have been issued. I n tandem with the address generation, the vector control unit directs the vector register uni t to send or receive the appropriate vector register element . For masked vector load and gather instructions, addresses for all elements, masked and unmasked, are sent to the M-box. However, for masked-our elements, the request is modified from read to read no-op (i.e., do not actually perform the read). This process prevents the M-box from raking cache m isses and address translation exceptions on masked-out elements. For masked-our elements, the M-box returns a dummy value to the V-box, which blocks the value from being written to the vector register unit. The vector address unit directs the control unit to block writes, based on the value of the appropriate vector mask register bit. For masked vector store and scatter instructions, although both m asked and unmasked elements Vol. 2 No. 4 Fall 1990 Digital Tecbnicaljounral Vector Processing on the VAX 9000 System are read from the vector register u nit, masked-out elements are stopped from reaching the M-box. The vector address unit, based on the vector mask regis ter, causes the E-box to discard the masked-out element instead of forwarding it to the M-box. As described earlier, a VLDQ instmction with a stride of 8 bytes (unity stride) is converted by the VAX 9000 scalar processor into a block load instruc tion when sent to the V-box. The vector address unit, in turn, issues a number of block toad requests, each of which is for 64 bytes of data, to the M-box with the appropriate address and selection bits. There are eight selection bits, one for each quad word in the block, which tell the M-box whether to return the corresponding quadword to the V-box for that block load request. Generation of these selection bits by the vector address unit is com plicated because the starting add ress of a vector in memory is not aligned on a block boundary (i.e. , starts within the middle of a block). The bits also depend on the vector mask register (for masked block loads). To handle unaligned, masked block loads, the vector address unit must generate selection bits that deselect those quadwords which are not part of the vector but lie within the same blocks as the first and last elements of the vector. In addition, it must deselect those quadwords within the vector that are masked out by the vector mask register. Both of the above requirements are handled by using an extended version of the vector mask register to generate the selection bits. This process involves conceptually extending the vector mask register on both ends with enough selection bits so that each quadword has a corresponding selection bit. For example, a vector starting at the last quadword of one block requires that seven selection bits be added at the beginning of the vector mask register and one bit be added after the end . Vector Multiply Unit The vector multiply unit performs all of the vector multiply and vector divide operations defined by the VAX vector a rchi tecture: VVMU L , VSM U L , VVDI V , and VSDIV . The unit can perform either one multiply instruction or one divide instruction at a time, but cannot perform both types of instruc tions simultaneously. In addition, the unit performs exception checking and reporting, as required, including floating overflow, floating underflow, and d ivide by zero exceptions. The unit consists of four custom multipliers: a custom divider, a divide unpack chip, and two packing chips. Physically, Digital Technicaljournal Vol. 2 No. 4 ftttl 1990 these chips reside on the V M L multichip uni t of the VAX 9000 CPU. The custom multipliers and divider are identical to those used in the scalar execution engine (E-box).H Multiplication By using four parallel multipli ers, the pipeline latency through the multiplica tion logic for both single p recision (integer and F_floating) and double precision (G_floating and D_floating) is only three cycles. Thus, the vector/ scalar cross-over number for multiplication is quite smal l . As a result, the V-box achieves good perfor mance for vector multiply instructions with small sized vectors as well as large. As a double-precision vector multiply instruction executes, two 64 -bit elements are received from the vector register unit each cycle and are latched in the four custom multipliers, each of which does a 32-bit by 32-bit multiplication. As shown in Figure 9, the element bits are dis tributed in such a way that one multiplier operates on the h igh-order bits of both elements; one multi plier operates on the low-order bits of operand one and the high-order bits of operand two; one multi plier operates on the high-order bits of operand one and the low-order bits of operand two; and one multiplier operates on the low-order bits of both elements. During the next clock cycle, each of the four mul tipliers unpacks its inputs and sends them through a large multiplication array, which produces one 64-bit partial product and latches the product. During the third cycle, the pack chips (VMLA and VMLB) add the four 64 -bit partial products together to produce one result and prepare the result to be written back to the vector register unit. In this cycle, the four partial products are shifted accord ing to their weight. Weight is determined in relation to w h ich bits the multiplier usee! to produce a result. For example, the multiplier that operated on the high-order 32 bits (most significant bits) of both elemems produces the most significant partial p roduct bits, and the multiplie r that operated on the low-order 32 bits (least significant bits) of both elements produces the least significant partial product bits. The partial products must be aligned or shifted properly before they are added together. Once the partial products have been added, the final product is then rounded, normalized , and packed into the appropriate VAX integer or floating poim format before being written into the vector register unit in the next cycle. The process and pipeline stages for single-preci sion multiplication (VYMU LF and VSMULF) are 75 VAX 9000 Series VREG_SOURCE1 [31 OJ VREG SOURCE1 [63:32J VREG SOURCE1 [310J VREG_SOURCE2 [31 :OJ VREG_SOURCE1 [63 32J VREG SOURCE2 [63 32J CUSTOM MULTIPLIERS PARTIAL_PRODUCT1 [47:0J PARTIAL_PRODUCT1 [63:0J PARTIAL_PRODUCT1 [63 OJ PARTIAL_PRODUCT1 [63:32J VMLAIVMLS 1 47 RESULTS FROM DIVISION ACCUMULATION COMMON BETWEEN MULTIPLIERS AND DIVIDERS (FROM DIVU) + ��� ol j63 FINAL PRODUCT 3�1 (TO DIVU) EXCEPTION DATA AND FINAL EXPONENT FROM EXPONENT LOGIC VML_RESULT [63:0J TO VREG Figure 9 Vector Multiply Unit similar to the process used for double-precision multiplication. However, in single-precision multi plication, only one multiplier chip is needed ro pro duce the result and the pack chips do not need to sum the partial product. Integer multipli ca tion is slightly different from floating point multiplication because it does not need to be accumulated or rounded. Thus, the correct product is produced by one multiplier. The result bypasses the accumu lation and rounding logic and proceeds directly into the packing logic to be sent to the vector regis ter unjt. The exponent handling for both multiplication and division is performed by the same logic on the packing chips. Depending on the instruction being executed, the exponent is either added (multipli cation) or subtracted (division). The result of this operation is then piped to the next stage and the position of the h idden bit is determined. If the frac tional portion of the data must be shifted to ensure the hidden bit is in the correct position, the expo nent is then incremented or decremented accord76 ingly. The normalize count (i.e. , shift count) is used to select the correct final exponent. Overflow and underflow exception checking can only be detected and reported after the final exponent is selected. If an exception is detected, then a reserved operand is written to the appropriate vector register element. The first stage of the exponent logic also checks for divide by zero and reserved operand exceptions. Vector division is a variable-cycle func tion. The number of cycles depends on the format of the operands. The custom divider is capable of producing six quotient bits per cycle. Therefore, F_floating point division is performed in 7 cycles, G_floating point in 1 2 cycles, and D_floating point in 13 cycles. Because of the variable number of cycles in a divide instruction, no other instruc tion can execute in the V-box while a divide is in process. Also, because of the iterative nature of divi sion (i.e. , one division must be completed before another can be started), the instruction cannot be pipelined. Division Vol. 2 No. 4 Fa/1 /'J'J{) Digital Tecbnicaljounwl Vector Processing on the VAX 9000 System As a vector div ide instruction executes, two 64-bit elements are received from the vector regis ter unit each cycle and are latched i n the di vide unpack chip. The elements are unpacked, and the fractional portion of the elements is sent to the etJS tom divider in 32-bit slices. The exponent portion is sent to the shared exponent logic on the packing chips, as described in the Multiplication sect ion. During this cycle, time-critical values, such as com plemented element values and first-cycle quotient bits, are calculated and forwarded to t he custom divider. W hen t he divider receives the data, it uses a n iterative algorithm t o produce six quotient bits per cycle. The quotient bits produced are then sent to the packing chips, which may have to increment the quotient, depending on the value of subsequent quotient bits. The div ider instructs the quotient accumulation logic whether or not incrementing is necessary. The partial quotient, once decided, is held in a bank of l atches until a l l the quotient bits are received . When the entire quotient is available, the result is rounded, normal ized , and packed by using the same logic path as multiplication. A mul tiplexer switches this packing logic between the multiplication and division logic. Performance Characteristics As of this writing, testing of the vccror performance of the VAX 9000 system has only just begun. How ever, some preliminmy resu lts are p resented in Table 3. We expect that these results will improve as testing continues and more code i s optimized to take advantage of the chaining and overlapping provided by the V-box. class instruction , vector multiply instruction, and vector memory instruction. Unlike the VAX 6000 Model 400 system, vector register conflicts between these instructions have little effect on overlapping. ; With the VAX 9000 system, a conflict only delays t he execution of the subsequent vector instruction by one or two cycles at most. However, the overlapping behavior of the V-box is sensitive to the issue order of vector instructions. If two vector instructions executed by the same V-box unit are issued one after the other, the second instruction is delayed until the V-box unit has fin ished executing the first. In addition, vector i nstruc tions issued after a vector memory instruction or divide instruction, do not begin execution unti.l the previous instruction completes. A general ru le in scheduling code for the VAX 9000 V-box, is to gen erate, whenever possible, instruction triples, where the first two instructions are a vector add-class and vector multiply instruction and the last instruction is a vectOr memory or vector divide instruction . Failing that, at least one vector add-class or vector multiply instruction should be issued before a vec tor memory or vector divide instruction. The following code examples demonstrate the usage of the VAX vector instruction set and the over lapping behavior of the VA X 9000 V-box. (Note: It should be assumed in the examples that all arrays are 8-byte double precision .) In the following DAXPY inner loop example, the first two VLDQ instructions do nor overlap. How ever, the VSM ULD, VVA DDD , and VSTQ instructions do overlap. Do i = 1 , 64 DY ( i ) DY ( i ) = • DA x enddo Chaining and Overlapping Because of the design of the vector register u n i t , the V-box can concurrently execute a vector addTable 3 VAX 9000 Model 2 1 0 P rel i m i n a ry Performance Double -precision M F LOPS , U n iprocessor Size Vector Peak rate NA 1 25 LFK (Geometric mean) LFK (Arit hmetic average) L I N PAC K 44 1 44 1 1 3. 2 20.6 1 0002 80 FFT 4096 Convolution 1 50 X 1 500 64 2 Matrix multiply Digital Tecbllicaljournal Vol. 1 No. -1 26 vecrorizes as: Fall 1990 K8 , v o K 8 , V2 VLDQ ox , VLDQ/M DY , VSMULD DA , V O , V 1 ;V1 VVADDD V 1 , V2 , V3 ; V3 . V 1 . D Y VSTQ V3 , D Y ' K8 ; Lo a d ve c t o r OX ; L o a d ve c t o r D Y ; w i t h mod i f y i n t e n t = DA*DX ; S t o r e vec t o r DY The first two V LDQ instructions do not overlap in the following MERGE example, Do i = 1 , 64 a( i ) = b( l ) - c( i ) . g t . 0 ) t hen if (a( i ) b( i) = a( i ) = d i ) e l se 99. 1 5 1 1 1 . 36 OX ( i ) b( i ) end i f enddo 77 VAX 9000 Series vectorizes as: vecwrizes as: VLDQ b , #8 , v o ; L o a d ve c t o r b VLDQ VLDQ c , #8 , V 1 ; L o a d ve c t o r c VSEQLD VVSUBD VO , V 1 , V2 ; b-e ; s e t ma s k. C VS C M P p s e u d o VSTQ V2 , a , # 8 1\ # X O , V2 ; S t o r e vec t o r a ; op d o i ng Equal t es t > VSLSSD ; T e s t a ( • ) a n d s e t rna s k ; i n VMR . VVMERGE V 1 , V 2 , VO I O TA a , #8 , VO J\ # X O , VO ; L oad ve c t o r a ; Te s t a ( • ) f o r z e ro and #8 , V 1 ; Ma k e c o mp r e s s e d < VS C M P ; ve c t o r o f o f f s e t s ; pseudo - op do i ng Less , w r i t e s i z e o f ve c t o r ; T han S i g ned t e s t ) ; t o VCR ; Me r g e a and c i n t o b MFVCR RO ; Move V C R i n t o R O ; u s i n g m a s k i n VMR VO , b , #8 VSTQ ; S t o r e ve c t o r b ; C MF V P p s e u d o - o p ) MTVLR RO ; L o a d n ew V L R v a I u e VGATHQ c ' V1 , V2 ; Ga t h e r v e c t o r c VGATHQ d , V 1 , V3 ; Ga t h e r v e c t o r d VVD I VD V2 , V3 , V4 ; D i v i de c by d VSCATQ V4 , b , V 1 ; Sca t t e r vec t o r b u s i ng ; C MT V P p s e u d o - o p ) However, the VVSUBD instruction does overlap with the VSTQ instruction. Both the VSLSSD (VSCMP) and VVMERGE instructions are executed by the vector add unit. Therefore, these two instruc tions do not overlap. However, the VVMERGE instruction does overlap with the VSTQ instruction. In an I F-THEN- ELSE example, such as the following, Do i = 1 , 64 i f (a( i ) . g t . 0 ) t h en c( i ) b( i ) e l se c( i ) I a( i ) b( i ) end i f enddo vecrorizes as: VLDQ a , VSLSSD # 1\ # 8 , VO XO , VO ; Load vee t o r a ; T e s t a ( • ) a n d s e t mas k ; i n VMR . < VS C M P ; p seudo - op d o i n g Less ; T h a n S i gn e d t e s t ) VLDQ c , #8 , V 1 VVD I VD / 0 V 1 , VO , V2 ; Load vee t o r c ; Ma s k e d d i v i d e o f c by a ( j ; f o r VMR i VST Q / 1 V 1 , b , #8 VSTQ / 0 V 2 , b , #8 = 0 ; S t o r e " t h en " p a r t of b ( • ) ; S t o r e " e l s e " p a r t of b( * ) Nothing overlaps the first V LDQ instruction, but the VSLSSD instruction does overlap the second VLDQ instruction. Nothing can overlap with the VVDIVD instruction. Thus, the VSTQ instructio n does not begin execution until the VVOIVD instruc tion completes. The remaining VSTQ instruction waits for the first VSTQ instruction to complete. In the following scatter-gather example, none of the instructions is overlapped. Do i = 1 , 64 i f (a( i ) b(i) endi f enddo 78 . e q . 0 ) t he n = c( i ) /d( i ) ; u s i ng o f f s e t s i n V 1 ; us i ng o f f s e t s i n V 1 ; offsets in V1 I t should b e noted i n this e x ample that the VSEQLD and the IOTA instructions do not overlap. This lack of overlap occurs because the IOTA instruction is actually done with microcode on the E-box, and the IOTA instruction cannot begin exe cution until the VSEQLD instruction has computed all the new vector mask register bits. The vector register access instructions (MFVCR and MTVLR) take only a few cycles and do not significantly affect the overlapping of other vector instructions. Summary By taking advantage of key features of the VAX vector architecture, such as instruction overlap ping, imprecise exceptions, and asynchronous interaction with the scalar processor, the vector processor of the VAX 9000 system provides super computing performance for computationally inten sive applications. Through the use of barber poling, the vector processor can overlap two vector arith metic instructions with one memory instruction to deliver a peak double-precision performance of 125 M F LOPS. Acknowledgments The authors wish to acknowledge the technical contributions of the following individuals to the VAX vector architecture and the VAX 9000 V-box design : Wayne Cardoza , Dave C utler, Tryggve Fossum, Rich Grove, Kevin Harris, Steve Hobbs, Brian Koblenz, D w ight Manley, Dave O rbits, Bob Supnik, Mike Tehranian, Cheryl Wiecek, and Rich Witek. Vol. 2 No. 4 Fal/ 1990 Digital Tecbntcaljournal Vector Processing on the VAX 9000 System References 1. Russell, "The 5. CRA Y-2 Compute-r System Functional Descrip Computer System ," ACM Proceedings, vol . 21, no. 1 (January 1978): CRAY - 1 tion (Cray Research, Inc , 1985 ). 6. W. Buchholz, "The IBM System/370 Vector Archi tecture, " IBM Syste-ms journal, vol. 25, no. 1 63-72. 2. VA X Vector Processing Handbook (Maynard : D igital Equipment Corporation, Order No. (1986): 51 -62 . 7. EC-H04 19-46/89, 1989). 3. R. Brunner, VA X Architecture Reference Manual (Bedford: Digital Press, Order No. EY - F 576 E- DP, 4. D. Fenwick et a l . , "A VlSI Implementation of the VAX Vector Architecture," Proceedings of COMPCON '90 (IEEE, Spring 1990). 1990). Digital Tecbntcaljournal Vol. 2 No. 4 Fall 1990 8. D. Marshall and ]. McElroy, " VAX 9000 Pack aging- The Multichip Unit," Proceedings of COMPCON '90 (!E E E , Spring 1990). M. Adiletta et al . , "Semiconductor Technology in a High-performance VAX System ," Digital Technical journal, vol . 2 , no. 4 (Fall 1990, this issue): 43-60. 79 Peter B. Dunbeck Richard]. Dischler james B. McElroy Frank ]. Swiatowiec HDSC and Multichip Unit Design and Manufacture The VAX 9000 system effectively integrates state-ofthe-art packaging and inter connects with advanced integrated circuits to achieve a short machine cycle time (16 nanoseconds) and a high rate of instruction execution. To meet highjrequency electrical signal and pin count requirements for the system, engineers chose tape automated bonding technology and consequently conceived and developed the high density signal carrier (HDSC). Tbe HDSC offers densities three to five times greater than conventional printed circuit boards. This unique technology is manufactured using semiconductor and advanced printed circuit board tecbniques. The HDSC is at tbe heart of the multichzp unit, a bigh-performance logic module, with wbicb tbe VAX 9000 CPUs and system control unit are constructed. Over the past decade, advances in the performance of integrated circuits (ICs) have outpaced advances in packaging and interconnect technologies. Thus a high-performance mainframe with conventionally packaged bipolar integrated circuits would experi ence interconnect delays that accoun t for more than 50 percent of the system cycle time. Key to optimizing high-end mainframe performance, then , is the effective integration of state-of-the-art pack aging and interconnects with advanced integrated circuits. The high-density signal carrier (HDSC) and the multichip unit (MCU) are proprietary tech nologies that shrink interconnect paths and thus reduce the distance and electrical loading of signals between chips. These technologies use conven tional semiconductor and p ri nted circuit board (PCB) equipment in many areas of manufacturing to improve reliability at a competitive cost. The result is shorter machine cycle time and higher instruction execution rate. The VAX 9000 CPUs and system control unit (SCU ) are constructed entirely of multichip units on large planar modules. The SCU is composed of arrays of 6 multichip u nits, and the CPUs are composed of arrays of 1 6 . Multicbip Unit Design Goals Beginning at the concept level and throughout the development and test phase, signal integrity con siderations guided t he development of the HDSC and the multichip unit. Designers had to ensure that the fas t signals woul d not be d isturbed by noise. The cycle time goal for the VAX 9000 system , 80 1 6 nanoseconds (ns), allows the system to operate at 30 VAX units of performance (VUPs). To transm i t electrical signals quickly between chips, wiring paths must have controlled ratios of wire size to distance from voltage planes. These impedance-controlled paths allow radio-frequency computer signals to propagate with minimal dis tortion . Prevention of noise on the signals is paramount and many details of the physical imple mentation, including spacings between wires, are critical to ensuring signal integrity. To meet the cycle time goal, high-frequency elec trical signal concerns needed to be considered in the design, concerns that would have been negligi ble for slower speed signals. Due to the physics of electrical fields, as electrical signals switch at high frequencies, they succeed in holding their shape (data) only if they are fed power extremely quickly, and if they are given short paths of uniform proper ties on which to travel . Due to the amount of power and the short amoum of t ime a signal is given tO arrive on chip, conventional chip carrier packages were disallowed for the VAX 9000 system . The sig nal paths had to be very short to be virtually noise less. To achieve this objective, engineers decided to enhance tape automated bonding (TAB) technology with a ground plane for electrical control of the wire impedances (paths). This reduction in chip package size also allowed all of the chips for the sys tem to be packaged into a tight area. Consequently, to fit wires between chips, extremely dense HDSC technology was conceived and developed . Vol. 2 No. 4 Fall 1990 Digital Tecbn icaljou,-nal HDSC and Multichip Unit Design and Manufacture The multichip unit also required careful thermal design attention because each chip consumes up to 30 watts. Moreover, most multichip units contain four to eight of these chips plus self-timed RAMs (STRA Ms). The key to success for the VAX 9000 program was balancing the trade-offs between per formance require me n ts and technology develop ment risks. To meet the electrical and density requirements for the machine, engineers specified the fol lowing for the multichip unit: 1. Series-term inated output drivers were required on chip. Therefore, external resistOrs are not needed on the mul tichip units or programmed into the design elsewhere. These external resis tors take up space and lower re liability. 2 . TA B was specified for manufacturing reasons. Short TAB tape was required to reduce switching noise on chips. Noise would have been generated if the TAB w ires were longer. In the case of the noisi est chips, a ground plane was added to the tape to reduce noise. 3 HDSC etch had to be two routing layers of IS micron by 9-micron w ires on 75-micron centers to meet t he density, resistivity, crosstal k, imped ance and other goals. 4 . Four power planes, each one powered from two S. All i ntegrated circuits i n the multichip unit are attached to the H DSC by a tape automated bonding (TA B ) process. The VAX 9000 system uses four types of ch ips, all of which have emitter coupled logic (ECL): gate arrays, custom chips, and two types of STRAMs. At each chip site, a cutout in the H DSC a llows the chip to directly attach to the baseplate. The signals on and off the multichip unit are carried by four signal flex connectors which attach to the perimeter of the H OSC . The signal flex connector provides a separable interface to the planar board and extends the controlled-impedance electrical envi ronment of the H OSC . Power is brough t through two power connectors attached to oppo site sides of the HDSC . The signal flexes, the power connectors, and the baseplate are attached to the multichip unit housing. The housing provides the structure for the multichip unit and holds the com ponents needed to position and w ipe the signal flex. The chips and H DSC surface are covered by a plastic lid. The high-powered ch ips are efficiently cooled by a short conductive path through the back of each chip. The thermal power is conducted from the chip to the baseplate and into a pin fin heat sink over which air is impinged to remove the heat. The follow ing sections describe the implementa tion of the tec hnology. sides, were requi red to distribute three voltage rai ls with acceptably high conductivi ty. 1be HDSC Design and Manufacturing Process Thin d ielectric separates the power planes and produces h igh capacitance which filters noise and improves performance. This capacitance el im inates the need for d iscrete parts which con sume valuable space and lower rel iability. The goal for the HOSC project was to produce a h igh-densi ty, h ig h-performance, manufacturable printed circuit board . This goal was achieved. The density of the H DSC is th ree to five times greater than that of conventional printed circuit boards. Even at this density, the HDSC maintains the signal integrity of bipolar i ntegrated circuits with edge speeds of 200 picoseconds. This section describes how the man ufacture of the H OSC pushes the limits of printed circuit board and semiconductor equip ment into new types of applications. We a lso address the integration of computer-aided (CAD) tools, process controls, and test feedback, which helped us to achieve the results we sought. 6. I mpedance control of the connectors on the multichip unit was needed to prevent signal dis turbance. Ru les were generated for the number of ground pins. The heart of the multichip unit is the H DSC. The H DSC is an imerconnect technology consisting of nine metal layers separated by polyimide dielectric and mounted on a copper baseplate. The top metal layer is a pad layer used to solder-attach all of the i ntegrated circuits and connectors. The four metal layers below make up the signal core. The signal core is a controlled-impedance, dual buried strip line i nterconnect system used to wire all integrated circuits to each other and to the connectors. The power is brought from the perimeter of the H OSC to the integrated circuits through the bottom four metal layers. Digital 1ecbn icaljournal Vol. 2 1\i; 4 Fall /'J'JI) HDSC Technology As noted earlier, the H OSC has nine copper layers for power and signal d istribution . The insulating materi a l , polyimide, has a low dielectric constant of 3 . 5 as compared with oxide or nitrides used in integrated circuits or as compared w ith ceramic, which is used for hybrid circuits. The interconnect is laminated to a copper baseplate to provide 81 VAX 9000 Series mechanical structure as well as attachment of the multichip unit heat sink. The conducting layers consist of the following: • Two layers for signal distribution • Two layers that serve as signal reference planes • Four layers for power distribution • One layer with bonding pads to attach the TA B and connectors The signal distribution is a single x-y pair that uses the reference planes to create a dual strip l ine interconnect. This interconnect provides a controlled-impedance signal path with minimal crosstal k. Table 1 l ists the electrical and physical design parameters of the HDSC . Process Overview The H DSC is manufactured by two types of pro cesses: core processing and assembly processing. Figure 1 is a diagram of the HDSC process flow. The core process, described funher below, uses semiconductor manufacturing equipment and is s imilar to the manufacturing process for the back end of an i ntegrated circuit. Two cores are manu factured: a signal core for strip-line signal inter connect, and a power core for the four planes (or layers) that d istribute power throughouc the finished HDSC . The second process, assembly, uses advanced printed circuit board techniques to laminate and interconnect the signal core and power core. The completed H DSC has solder pads to accept the outer lead bond of TA B integrated circuits, signal flex, and power t1ex. The H DSC is tested with a custom flying probe tester. Tests are made to ensure the HDSC is functional and meets electrical parameters. Table 1 HD SC P hysical and Electrical Design Parameters Line pitch 75 m icrons Line width 1 8 m icrons Line thickness 1 0 m ic rons Dielectric thick ness 25 m icrons Dielectric constant 3.5 Line i m pedance 60 ohms Line resistance 1 /0 oh m/centimeter C rossover capacitance 3 . 6 femtofarads C rosstalk 5 . 1 percent max im u m P ropagation delay 6 6 picoseconds/ centimeter 82 CORE PROCESS FLOW r - - - - - - - - - - - - - - - - - - - SIGNAL CORE 1 : I • I • I • • I : _ � METAL LAYERS POL YIMIDE LAYERS COPPER LINES ETCHED VIAS 4 5 I _ POWER CORE _ I _ * TEST _ _ I I I I I I _ _ _ 1 l • • • _ _ : � I METAL LAYERS POL YIMIDE LAYERS WHOLE PLANES 4 5 _ I _ * TEST _ _ I I I I _ I _ _ - - -, • • • • • • ---�--J TO MCU Figure 1 HDSC Core and Assembly Process Flow Core Processing The process for the manufacture of the signal and power cores, or the core process, consists of alternating between copper deposition and polyimide coating until the completed inter connect layers are built on the metal wafer. The pro cess is performed on a metal substrate shaped like a 6-inch semiconductor wafer. Copper layers are deposited by a combination of sputtering and plat ing techniques. Patterns in the copper that become signal traces are generated by a semiconductor phorolithographic technique. First, a photoresist is applied to the metal wafer. The resist is then exposed to the pattern in the mask that is held by the semiconductor wafer aligner. This pattern is then developed in the resist and etched into the copper. The remaining copper thickness is then added by plating. Another resist pattern is devel oped over the plated signal traces to define where a copper connection between interconnect layers will occur. This connection is cal led a via post, and it is also formed by a plating process. Polyimide is spun on to the wafers by integrated circuit photoresist spin tracks. The relatively thick polyimide (25 microns at signal layers) helps to planarize the surface of the wafers and also to cover Vol. 2 No. 4 Fall I . 4 Fall 1<)<)0 INTEGRATED CIRCUIT 200 MICRON PITCH INNER LEAD BOND TAB HIGH DENSITY SIGNAL CARRI ER ENCAPSULATION OUTER LEAD BOND INTEG RATED CIRCUIT Figure 3 Isometric ofa Gate A n-ay Showing Features oftbe TAB 85 \J\X 9000 Series no plating is required for epoxy die attach. The epoxy die auach is filled with m icroscopic particles to enhance the thermal conductivity while main taining electrical isolation bet ween chips. Signal Flex Connector The signal flex connector is a high-density, con trolled-impedance connector used to transmit sig nals between the H OSes :md the planar module. Each multichip unit has four flex connectors with a combined signal I/O of 800 in an area less than 4 0 square centimetcrs. Figure 4 shows a cross sec tion of one signal flex connector. The body of the connector is a two-metal-layer flex print with 50and 60-ohm signal lines. The ground plane in the flex circuit is used as an AC return path . No power is carried through the signal flex. The signal plane contains 200 etch lines with a raised gold bump on each at the planar module interface. The connec tion to the H DSC is a solder bond similar to the sol der bonds for the TAB devin:. A window is opened through the polyimidc to al low the formation of cantilevered, exposed, solder-plated leads. The raised bump on the flex circuit concentrates the contact force into a small area. The bump is sol id copper that is plated over with nickel and hard gold. The force on the bump is generated by com pressing a molded silicone rubber elastomer. The compression of the connector causes the tkx frame to engage a cam on the housing and wipe the.: contacts across the planar module pads. The con nector is compressed, nominally, 1 .27 mm and wipes 0.46 mm . The bottom of the elastomer mates with a tray which has a contoured surface to vary the compression along the length of the elastomer. This contoured surface improves the uniformity of the force that the humps exert on their pads. The connector has been designed to gem:rate 100 grams minimum load on all bumps. The wipe action and the bump force of the connector minimize the effect of dust and environmental fi lms on the.: mat ing surfaces. Power Connector T he power consumed by the multichip unit IS brought in through two power connectors mounted on opposite sides of the I !DSC . The connector is composed of a flex circuit, a connector, and decou pling capacitors. The flex circuit is solder honded to large pads on the I I DSC surface. The flex has three copper conductive planes separated by polyimide dielectric. The connector has st::�mpcd metal con tacts soldered into the llcx circuit and assembled into a plastic housing. The connector plugs into flat blades on the bus bar of the p lanar module assem bly. The decoupling capacitors on the power flex circuit filter the medium-frequency switching noise on the MCU and the MCU power bus. Thermal Design The multichip unit was designed from conception to provide an efficient cooling path for the inte grated circuits. Figure 5 shows a cross section of the PLANAR MODULE SIGNAL FLEX CIRCUIT E LASTOMER FLEX CIRCUIT BUMP ELASTOMER ELASTOMER TRAY Figure ,j Vol. 2 No. ·1 Signal Flex Connector with Detail of Bump Fall /'J')O Digital Tecbn icaljountal HDSC and MultichtP Unit Design and Manufacture multichip unit. The heat dissipated by the chips is conducted through the silicon and the die attach into the baseplate. As mentioned above, the die attach is an epoxy heavily fil led with microscopic diamond particles to increase thermal conductivity. The heat spreads out in the copper alloy baseplate and is conducted across a dry interface to an al u minum base of the pin fi n heat sink . The heat sink has 600 aluminum pins, each 0.20 centimeters in diameter, pressed into the base. Air plenums in the cabinets direct at least 14 . 6 liters per second of air into each multichip unit heat sink. The thermal resistance for a 30-warr gate array is less than 2 .0 watts per degree Celsius which gives a junction temperature of 85 degrees Celsius with room air at 25 degrees Celsius. This low junction temperature is a critical part of the h igh reliab i l ity of the mul ti chip unit. r--'---- - -:''" - · · Figure 5 Multicbip Unit Manufacturing Figure 6 shows the m a n u facturing process flow, which has three major work centers: • 54 -class assembly and inspection • P lOOO • assembly and inspection Test and diagnose I n the 54-class process, TAB semicond u ctor devices arc assembled to the H DSC substrate, result ing in the subassembly known interm . l l y as a 54class module. In the P 1000 process, connector and housing components are assembled . At the last major center, the test process, final units are tested and, if necessary, diagnosed. A shop floor control system tracks the units through the l i ne and pro vides critical component and process trace infor mation. In addition, this control system is used to monitor process parameters to ensure control of the l ine and consistent product quality. The fol lowing section provides i nsight i nto several of the process technologies we used to meet the m a n u facturing goals of the VAX 9000 system. Digital Teclmicaljournal Vu/. 2 No. 4 Fall /'J'JO ---'-, - Clock Distribution The system clock on the VAX 9000 system is distributed to each of the multichip unit clock distribution chips (CDxx). The CDxx generates 4 0 di fferential outputs which are routed through equal-length etch to the other chips. The CDxx also distributes and controls the scan lines that test the unit both in manufacturing and in the field . The scan l i nes also allow the unit serial number and revi sion status to be read by the system console. BASE PLATE � - -: PIN FIN HEAT SINK Thermal Path TAB and Flex Circuit Bonding The i nsertion and soldering of leads is the most critical step in the multichip unit manufacturing process. Single-lead and multiple-lead gang bonding approaches were both considered . Gang retlow sol dering is an effective way to achieve repeatable, reli able connections for both the TAB semiconductors and the signal tlex circuits. Early development work on manual machi nes required operator action for lead forming, lead alignment, and gang bonding. Today, critical process parameters - time, pressure, temperature - are computer controlled to speci fied values, and the process uses tools to assist the operator in material movement and vision systems to improve alignment of leads. Before bonding, the leads are covered with a low activation flux which is removed later in the process. Die Attach Another critical manufacturing step is the die attach process. The excellent thermal performance of the multichip unit is achieved by fol lowing these steps: • Careful control of the die attach materials with feedback to our suppliers. • Surface cleanl iness specified and also managed with our suppliers. • D ispensing of epoxy. The fil led epoxy is d is pensed by an x-y table that is computer con trolled to supply the correct pattern for the particular mu ltichip unit type. 87 VAX 9000 Series END OF 54 CLASS ASSEMBLY START OF P 1 000 ASSEMBLY ALIGN HDSC TO HOUSING SHIP Figure 6 • Manufacturing Process Flow Establishment of bond line thick ness and epoxy short removal or single-point bonding. Over time, c u re. Bond l i n e t h i ckness i s accomplished b y we bel ieve that our materials and processes can be mechan ical l y applying pressure while curing i n control led r o the p o i n t at w h i c h i nspec t i o n and a purged belt furnace. rep a i r can be dramatically reduced. Inspection Final Test To ensure t h a t a l l soldered leads a re reliably The goal of o u r tes t rrocess was to ensure t h a t bonded, leads must be inspected for shorts, mis m u l t i c h ip u n its wou ld operate successfu l l y i n a a l ignments, opens, and weak joints. Shorts and mis system env i ronme n t . Si nce no test equ i pment al ignments are d iscovered by an automated v ision m:m ufacturer offered a system that met our needs, system that ca l l s marginal points to the operator's we developed ou r own by working w i t h several attention . The operator can then dete r m ine i f Digital groups as we l l as outside suppl iers. The repair action is warranted. Inspection for opens and system contains th ree major s t a t i ons. The fi rst weak joints is done by striking the leads with a pu lse provides al ignment information and can ;�lso read of laser energy and then measuring the thermal visual serial and part nu m bers. In the second sta decay profile. Repa ir is typical ly made by localized tion, low voltage shorts are determi ned between H8 Vol. 2 No. 4 Fall 1')')0 Digital Tecbnicaljournal HDSC and Multichip Unit Design and Manufacture nearest neighbor leads. This step supplements our inspection for shorts described above. In the final station , we test for connectOr opens, thermal mea su rement (die attach integrity), scan chain integrity, and scan pattern data. The scan pattern testing is done in several bursts of the clock at system speed . In addition, diagnose capability is provided by fly ing probes, voltage and clock margining, and a ther mal chuck to vary temperature. Conclusion cess that begins with advanced development and continues th rough volume manufacture. The H OSC and multichip unit technologies have successfully achieved the volume manufacwring phase. Using the prod ucts and technologies described in this paper, we have played a key role in the intro duction of the VA X 9000 system to the marketplace. Extensions of this m an u factu ring p rocess w i l l ensure that this technology base can be applied across a wide spectrum of products of both higher and lower performance. Successful use of advanced interconnect teclmolo gies requires a seamless phased development pro- Digilal 1i.•cbnicaljournal Vol. .2 No. 4 Full 1'.)')11 H9 Matthew S. Goldman Paul H. Dormitzer Paul A. Leveille The VAX 9000 Service Processor Unit The VAX 9000 serviceprocessor unitprovides thefront-end seruices needed to support a highly available and reliable mainframe system. The unit is close�y linked to the VAX 9000 system to provide realtime detection and recovery of system failures. However, the unit is independent enough to be isolated for maintenance without affecting normal system processor operation. This combination is a first for VAX systems. The service processor also provides various debugging features that were essential for development and ear�)' manufacture of the VAX 9000 system. These features utilize a system-wide scan architecture to achieve direct access to machine state, which provides extensive visibility and control of system logic functions. The inclusion and use ofsuch a scan architecture is a newfeaturefor a Digitalprocessor. The VAX 9000 service p rocessor u n i t ( SPU ) is designed w provide a dedicated subsystem for ser vice and maintenance support for the VAX 9000 fami ly. The SPU serves two distinct roles. It func tions as the familiar operator i nterface (i .e. , VA X console) and as a maintenance vehicle used lO diag nose and isolate system processor hardware faults. The SPU performs the fol lowing major front-end services : • • System initi:ll ization Power system control and monitoring • Environmental monitoring • Clock control and monitoring • VAX 9000 operating system access to SPU mass storage devices (disk and tape) • Remote diagnosis port support • System error detection, recovery, and reponing The SPU also provides or assists in the following system diagnosis functions: • S P U mod u le self-tests • Scan system diagnostics • Clock system diagnostics • • 90 Scan pattern structural diagnostics Structure cell (e.g. , self-rimed random-access memory [ R AM]) d iagnostics • X MI-ro-system control unit adapter interface test • Symptom-directed diagnosis support In addition to its use as the front-end processor for the VAX 9000 system, the SPU wJs embedded in several manufacturing and e ngi neering rest vehicles. In the Debugging Features section of this pJper, we describe how the SPU was used as a debugging tool d u ring VAX 9000 product devel opment and the various debugging features we p rovide to help locate design and fabrication problems. A mJjor goal of the SPU WJS to perform system wide error detection and recovery functions for the VAX 9000 processor. I n the Error Handling section of this paper, we detai l the types of errors that the SPU handles arid how error detection , reporting, and recovery occurs. A nother of o u r design goals was to be able to service the SPU without adversely :�ffecting the operation of the system processor. This feature was needed to support t he h igh avai lab i l i ty requ i re ments of a mainframe system. To meet this goal , we designed mechJnisms to enable the VAX 9000 oper ating system to determine that the SPU is not func tionJl (whereupon the operati ng system takes the appropriate action to secure its own operation), as well as recognize and reintegrate with the SPU when the SPll is functional again . If the VAX 9000 operating system Jttempts to access one of the SI'U -based processor registers and the SI'U does not respond, the fai lure is detected by Vol. J No. -i Fall /')')0 Digital Technicaljournal The VAX 9000 Service Processor Unit tests are performed . The SPU 's operating system then boots automatically and signals its availability to the VAX 9000 operating system. The SPU is designed to continue operation even i f the SPU primary storage device, a n R D 5 4 Winchester disk drive, fails, which further increases the availability of the SPU. For customers who req u i re data security and high availability, we designed a system configuration option that does not use a disk drive. I n this case, the SPU boots from TK50 cartridge tape. The SPU functions that require a disk drive for data storage (e.g. , SPU-generated error logs) are disabled in this configuration . using the usual register time-out mechanism. How ever, because the SPU is responsible for system error handling, SPU failures must be detected quickly to enable the SPU to respond to a system error should one occur. Conseq uently, we developed a keep alive protocol with which the VAX 9000 operating system can determine SPU failures without relying on operating system accesses to SPU-based pro cessor registers. The keep-alive mechanism is described in more detail under the Error Handling section of this paper. Both the time-out and keep alive mechanisms work regardless of whether the SPU has an unexpected failure or undergoes a sched uled power-down. S hould the SPU req u i re service, field u pgrades may be performed easily and qu ickly because of the modularity of the hardware, which is primarily VAXBI bus interface-based adapters. The VAXBI backplane minimizes downtime because modules can be removed or inserted without requiring reca bl ing. When power to the SPU is restored, SPU self- SPU Architecture A block diagram of the SPU architecture is shown in Figure 1. The service processor module, scan con trol module, and power and environmental monitor were designed uniquely for the VAX 9000 system. The disk controller, tape controller, as well as the memory daughter board were available from other DISK CONTROLLER (1 1 03 1 KFBTA) TAPE/NETWORK" 14-------' CONTROLLER * Nl (11 034 DEBNK) VAX TO/FROM REST OF PCS 81 SERVICE PROCESSOR MODULE (12051 S P M ) POWER AND ENVIRONMENTAL MONITOR (11 060 PEM) SPU M E MORY 16 MBYTES ECC S P U OS F I RMWA R E SCAN CONTROL MODULE (12050 SCM) I F I RMWARE POWER CONTROL SYSTEM SJI PlY !-----' " N I CONNECTION U S E D DURING DEVELOPMENT ONLY SYSTEM PROCESSOR Figure 1 DiRilttl Tecbnicaljournal VtJ/. 2 No VAX 9000 SPU Block Diagram and interconnects 4 Fa/1 /'J'J() 91 VAX 9000 Series Digital products. Every S P l J VA X B I adapter provides 1 i ts own bu i l t-in self-test diagnostics. S P U hardware is based on ei t her i ndustry-proven (e.g. , 74 00-series interface, the system processor may also interru p t t h e S P U w h e n the processor needs service. T h is type of interrupt request is known as an attention. TTL components, complementary The SPU is i ntegrated i n to the system cabinet to metal oxide semiconductor [CMOS] gate arrays) better meet the performance req u i rements neces or Digital-proven tech nology (e.g. , VAX B I , Digital sary for system error recovery and VAX 9000 oper custOm CMOS devices) to ensure that the unit is a n ating system boo t . Cabinet i n tegration substantially e ffective debugging platform for a system processor decreases i nterconnect distances to processor logic based on leading edge tech nology. As a resu l t , the and ensu res that all cables are kept i nternal to the i n herent risk and learning cu rve associated with cabinet. Another reason for choosing the VA X B I n e w tech nology were avoided and t h e SPU was backplane card cage i s t h a t i ts form factor is sma l l , ready and available during the VA X 9000 system w h ic h reduces the cabinet area needed (cabinet area protOtype debugging p rocess. is a lways in high demand), yet the user-definable The S I' U also was made available to manufactur ing process and tester groups (e.g. , multichip u n i t tester) for use w i t h their designs. T h e advantages to zones provide the high pin density req u i red for i nterconnects ( i . e. , 1 80 110 pins per VAX B I s lot). this approach were t hat tec hn icians became fam i l Communication Path i a r w i t h t h e same subsystem t h a t wou ld b e used i n The SPU commu nicates w i t h the system processor t h e VAX 9000 fam i ly, a n d t h e test programs could using the SJI . This in terface is used to load the pri be transferred for use in other test envi ronments mary bootstrap into the VAX 9000 main memory, that also used the SPU , including the VA X 9000 sys t ransfe r error and m ac h i ne-check i n fo r m a t i o n to tem itself. the VA X 9000 opera t i ng system , provide file trans The service processor mod u l e is the primary fer access between the VAX 9000 opera t i ng system processi ng element of the S P U and is the VAX B I host and the SPU 's R D 5 4 disk drive, access system main adapter. Based on the M i c roVAX 78032 chip and memory, and access system i /O registers. several custom-designed applicat ion-specific i n te The VA X 9000 operating system accesses the SPU grated c i r c u i t s (e. g . , S P ll -to-system cont rol u n i t as if i t were a standard J /0 device. T h e SPU is a n adapter, S P U memory control ler) , t h e module con i ndependent subsystem and does not rel y o n the t a i n s a l l the h ar d w a re necessary to store and execution u n i t of the system processor to be a con execute the S P U operating system . The on-board sole processing engine, as was done i n previous firmware contains a VA X standard console i nterface VA X systems. T h e re a re several b enefits to t h i s to load the SPU operating system during i n i ti a l iza design approac h . Each C P U has equal access t o the tion and to assist in subsystem debugging. The S P U S P U and may i nterrupt the SPU to request serv ice. to-system control unit interface (SJI) connects t h e I n addition, the SPU may i n terrupt any of the CPUs service processor mod ule to the system control unit to request an operating system serv ice. The S P U and is the primary communication path between m a y b e used a s a debugging tool d ur i ng system pro cessor debugging because it does not req u ire that the SPU and the VAX 9000 opera ting system. The scan control mod u le is the control i n te rface a n y portion of the system processor be operational. to the VAX 9000 scan system , w h ich is the visibility The fact that the SPU could be used as a debugging and mai ntenance path to the system p rocessor. Like tool was an extremely important benefit for the the service processor module, the scan control VA X 9000 system debugging effort. The debugger module is based on the MicroVAX 780)2 chip ami d i d not h ave easy a c cess to the l o g i c element s s<:veral custom-designed applicat ion-specific inte because o f the advanced packaging a n d c i rcu i t i n te grated c i rc u i ts (e . g . , gration of the VAX 9000 system . Therefore, S P U ser distribut ion c h ip). scan c o n trol c h i p, scan On-board firmware provides v ices were u t i l ized in l ieu of logic probes. Further, high-level fu nctions that a l low the service p rocessor because the SPU no longer uses t he CPU for system module to continue processing while scan-related access, console support microcode ( i . e . , the collec ope rations, i n c l u d i ng logi c a l - to-p h ysical s i g n a l tion of microcode procedures t radit ional l y used for trans lations, a r e performed concurrent l y by the access to the system processor, memory, and J/0 scan control mod u le. The scan i nterconnect (SCI) registers) is not requi red . The benefit of this p rocess connects the scan control module to the system is that valuable VAX 9000 control store space could processor (i.e. , one to fou r C : P U s and the system con be used for system m i c rocode or to reduce the con trol u n i t ) and t he master clock mod u le. Using this trol store size. For example, in the VAX 8650 system , 92 Vol. J No. 4 Fall /<)<)() Digital Tecbnicaljournal The VAX <)000 Senlice Processor Unit console support m icrocode occupies approxi mately 180 microword locations. VAX 9000 operating system access to the SPU is through the VAX console register set. We extended the VAX console register set to provide access to the enhanced capabilities of the S P U . Additional regis ters include transmit function request and param eter and receive function request and parameter ( i .e . , TXFCT , T X PRM , !L.'<.FCT , R X P R M ). Table l l ists the functions provided by these registers. SJ I commu n i cations a re in the form of 14 -byte packers that contain the command (i .e. , function), address, and data. Packets are sent and received over two 8-bit data paths that provide fu ll duplex operation. Data transfers peak at 3. 5 megabytes ( M B) per second for quadword transfers. W hen the VAX 9000 operating system executes a Move_ro/from_ Processor _ Register instruction that specifics an SPU register, the system control unit sends an I /O command p::tcket, through the SJ I , to the SPl! to initiate the system request. Then the SPU typica l ly uses an interrupt command packet, which generates an i nterrupt to the specified C P U . The two other packet types are direct memory access and error correction code. R X FCT/R X P R M and T X FCT/T X P R M F u n ctions RX FCT/RX P R M Functions (SPU to System Processor) Remove processor Add processor M ark memory page bad Request pages of m emory Send error log entry Send OPCOM message Get datagram buffer Send datagram Return datagram status Set keep-alive state Abort datal i n k E rror i n terrupt TXFCT/TXPRM Functions (System Processor to SPU) Get hardware context ( o f a halted C P U ) Virtual block f i l e operation (access to SPU disk and tape) Keep-alive Send datagram Return datagram status Visibility Path Switch prim ary I n the development and manufacture of a com plex computer system, extensive test i ng methods must be available to ensure functional operation and product quality. Design engineering no longer can use manu::tl probing tech niq ues in prototype debugging. Space l i m i ta tions have resu l ted from advanced packaging and the c lose pitc h of i n te grated circuit ! IO pins, which is due to high i ntegra tion lewis. Failur e isolation must be performed in the manufacturing process, often without an exten sive knowledge of the machine design. A separate visibi lity and control path in the sys tem processor of the VAX 9000 system provides nearly 100 percent visibility to the machine-state. The visibility path e l i m i na tes t he need to select a subset of v isibility points to meet a l l test needs, as was done with previous VA X systems. In addition, the pat h al lows designers to d irectly alter t he entire machine-state, which is a major advantage for design and process debugging. A VAX 9000 u n i processor ( i . e . , one C P U and system control unit) contains over 26,000 access points. The path is called the VAX 9000 scan system and is controlled by the service control mod u l e. The scan system is the fou ndation for d i rect access by prototype debuggers, system error recovery Digital Tecbnicaljournal Table 1 Vol. 1 No. -i Full /'J'J() Reboot system request C l ear warm start flag Clear cold start flag Boot secondary processor H alt C P U and remove fro m avai lable set H a l t C P U and keep in available set Console q u iet Set i n terrupt mode Abort datal i n k Reset 1/0 system Disable vector u n it Set keep-alive state Start processor M argin power Margin clock Fault sig nal Start error wi ndow End error w i ndow Report error in w i ndow Get error log e n t ry Get u n m arked error log entry and mark E n able halt restart Get 1/0 physical address memory map configu ration Get physical add ress m emory m ap configu ration 9:) VAX 9000 Series soft ware, and diagnostics to observe and alter the VAX 9000 machine-state. Some functions provided by the scan control module and supporting SPU software are • Load and save processor state • Scan pattern execution • Continuity testing of the processor's scan hardware • M u l tichip u n it t ype and revision i n formation extraction • Processor attention notification A block d iagram of the VAX 9000 scan system is shown in Figure 2. The scan control module connects to the system p lanar module over the SCI . Scan and clock distribution logic, contained i n a macrocel l array on the pl:mar module, distributes data and control signals over the scan bus to each of the multichip units. A clock distribution chip at the hub of each multichip unit further distributes the scan bus signals to the macrocell arrays, w hich are integrated circuits that contain system logic. As shown in Figure 3, the state devices within a macrocell array are scan latches. The latches are connected serially to form a ring or chain by con necting the Scan_Data_Out line of each latch to the Scan_Data_ln line of the next latch. The end links are connected to the clock distribution chip. When the system clocks are running, data is loaded into the latch from the system data input. During scan operation, system clocks are not active. Generated by the scan control module, the scan clocks load the latch with data from the scan data input . Conse quently, the scan control module reads system state by issuing scan clocks, w hich serially shift system data to the scan control module. System state is changed w hen the scan control module drives new data to the system latches while issuing scan clocks. An architectural feature permits each mu ltichip u n i t to generate an attention i nterrupt d irectly to the scan control module over the scan data return l i ne. A ttentions notify the SPU of system events, such as processor errors, memory self-test comple tion, CPU halts, and keep-al i ve responses. System diagnostics can diagnose the SCI by using the same control signals as used for scan system operation. Dedicated logic and special routing of the scan l ines p rovide fai lu re isolation . Stuck-at faults and disconnect conditions can be isolated to the multichip unit. Debugging Features I n addition to its use as the VAX 9000 front-end processor, the SPU provides a variety of features for debugging and troubleshooting multichip unit logic configurations. These features were required because all mu ltichip unit logic visibility and con trol is handled through the SC I , which connects directly to the SPU . The use of scan larches to access internal logic states is a first for VAX systems and chal lenged the designers to define and deliver the necessary tools and features to assist the multichip unit debugging effort. Furthermore, the features provided by the SPU had to apply tO various tester environments, ranging from single mul tichip units mounted in probe stations to ful l system con fig u rations. A d d i t ional requ irements to support the clock and power system test stations made it clear that the SPU would have to be adaptable to a variety of environments. PLANAR MODULE SERVICE PROCESSOR SCI SCAN CONTROL MODULE scD· SCAN DATA RETURN MCUO MCUn S C A N DATA I N AND CONTROL 'SCD - SCAN AND CLOCK DISTRIBUTION LOGIC Figure 2 94 VAX 9000 Scan System Vol l No. 4 Fall f ) . The translation from a logical signal to its associated scan latch uses clara structures supplied in a configuration database file, which is loaded into SPU memory during SPU initialization . All CPUs w i t h identical mu ltichip unit configurations (i .e. , same CPU revision) share the same configuration database memory image. The system control unit a lways req uires its own database. Only two CPU revisions can be supported at one time because of SPlJ memory constraints for storing the separate configuration databases. However, by prov iding for two C P U revisions, the needs of single and dual CPU configurations were completely satisfied . Further, it was possible to upgrade homogeneous triple and quadruple configurations in a stepwise manner. Macrocode Execution Initial system- le\'el multicbip unit configurations consisted only of a sca lar CPU . The system control unit was not yet available as a result of the extended simulation of the design . Fortunately, we had antici pated the possibility of running partial configu rations and could provide modes within the SPU software to red i rect commands that normally access main memory (e.g. , EXAMIN E , LOAD) to access the CPU's 1 2H kilobyte (KB) system cache or S K B virtual instruction cache instead . The first VA X macro-instructions were loaded and executed on the VA X 9000 system using this technique. An additional feature, wh ich i nvolved m inor hooks in the system microcode. provided a means for the VA X instruction set diagnostic, EVKA A , to commu nicate with the console terminal through scan attentions rather than by using the system control unit. Thus, the diagnostic could run to completion. Advanced Debugging Features Although not obvious aids to VA X 9000 debug, the following features were ind ispensable or, at the least, reduced debugging time and effort: • A character-cell w i ndowing capabi lity that al lows system microcode sources ro be automat ically located , disp layed, and updated on t h e screen as the system is single-stepped. We mod eled this feature after the VAX debugger's win dowing capabi l ity because m os t VAX engi neers 97 VAX 9000 Series are fam i l i ar with t h is capabili ty. W i ndow i ng eliminated the need for hard-copy microcode listings and the logistical problems associated with their use. • • By connecting the SPU to the engineering net work duri ng developme n t , timely updates of SPU software were made possible. This kept the VA X 9000 debugging effort , which was occur ring simultaneousl y on several systems, up to d ate w i th the latest SPU software fixes and enh ancements. Together w i t h the multisessi o n capability of the SPU operating system, the use of the network made remote debugging a reality th roughout the VAX 9000 debug phase. 13ecause the SPU had to initial ize the VA X 9000 system thousands of times during system debug ging, the unit was designed to perform system initial ization as efficiently as possible. For exam ple, the load ing of structures (e.g . , control stores or cache tags) was optimized by overlapping the operation of three M icroVAX-based processors : the service processor module, the scan control module, and the d is k controller. The debugging features located early design and fabrication problems in the clock, power, scan, and processor logic areas. Ultimately, the features were used to initialize and run the first VA X 9000 system . Error Handling To support high system availabi lity, accurate and t i m e l y error detection a nd loggi ng is required . Error data collection cannot depend upon host sys tem availabi lity, and the data must be available when the system is not functional . Therefore, an indepen dent service subsystem that can collect data from all system components, render i t into a useful format , and store and display the information i s needed . The service subsystem must also be organized in such a way that if it fails, it does not directly cause system processor failures. Repair, reboot, and sys tem reintegration must occur wit hout interfering with system processor operation . The SPU meets these requirements; it is a fully independent com puter that runs its own operating system with dedi catec.J peripherals. The SPU performs system-wide error detection and reporting fu nctions and pro v ides advanced error recovery fea t u res for the system processor. Error Detection The S P U reports errors in its own VAX BI adapters, the service p rocessor module, the scan control 98 module, the power and environmental monitor, the disk controller, and the tape controller. It also reports errors in various pa rts of the VAX 9000 system, such as the system control unit, the CPI ·s, the memory system , the master clock module, and the power and environmental systems. Because fa il ures in any of these subsystems can incapacitate the VAX 9000 system, none of them reports its errors directly to the VAX 9000 operating system . SPU Errors The disk controller, tape controller, and scan control module use the VAX B I VA X port protocol to report errors. The power and environ mental monitor passes error information to the ser vice processor module through its private bus, the SPU-to-power control system interface. Environmental Exceptions The power and envi ronmental monitor monitors the regulator intelli gence cards, airflow sensors, and tempera t u re sensors throughout the system. When it detects any problems in operating voltages, currents, tempera tures, or airflow, it notifies the service processor operating system , wh ich logs the error cond ition. Clock Exceptions When the master clock modu le detects an error in either the clock phase or the clock frequency lock, it generates an attention to the scan control module, which interrupts the ser vice processor mod u le. The SPU operating system logs the error condition. Memory Error Correction Code Events The main memory of the VA X 9000 system contains error correcting logic to correct single-bit errors and detect double-bit errors. When a memory location with a single-bit error is read, the system control unit corrects the error and passes the corrected data to the requesting device. It also writes an SPU regis ter with the error type and the failing memory address. The SPU operating system writes this infor mation to the error log. I f the system control unit detects a double-bit error or reads a marked-bad location , it passes the bad data, marked as bad, to the requesting device and notifies the service pro cessor operating system , which logs the error. The bad dat::1 is handled loca l l y by the requesting device, usually by generating an error of its own . CPU and System Control Unit Errors When a CPU detects an error in a parity checker, it attempts to come to an instruction boundary and halt . Once it has halted, the CPU sweeps i ts cache. When the cache sweep is completed, the C PU asserts an Vol. 2 No. 4 Fall I'J'JO Digital Tecbnicaljournal The VAX 9000 Service Processor Unit attention to the scan control module to inform the SPU that recovery is required . When the system control u n i t detects a n error, it first asserts a fatal error signal to each of the CPUs, and then asserts an attention. When the CPUs receive the fatal error sig nal, they attempt to come to an i nstruction boundary and halt. Once halted, the crus assert attention lines to the scan control module. The caches are not swept since their path to memory, the system control unit, is not working. Keep-alive, Timeout To ensure that a CPU is not hung by an undetected error, the SPU periodically sends a keep-alive interrupt to each CPU . CPU m icrocode services the interrupt at the next macro instruction boundary by asserting an attention to the scan control module. If the CPU should be hung by an undetected error, the SPU times out while it waits for the keep-alive repl y attention and , thus, determines that there has been an error. Similarly, the primary CPU monitors the SPU by sending it a keep-alive request through the TXFCT register. If the SPU does not respond to this request within a time out period, the VAX 9000 operating system assumes that the SPU is hung and reboots i t using a VAXBI reset. When the SPU reboots, it reintegrates itself with the rest of the VAX 9000 system without i nter fering with system operation . Error Reporting When errors are reported to the SPU operating sys tem , the error formatting facility logs the error information local l y and reliably transmits it to all intended receivers. The error formatter maintains the error log fi le ERRLOG . SYS on the SPU RD5 4 drive, passes error log entries to the VAX 9000 oper ating system to be logged in the system error log, and also passes the entries to any SPU software that requests them . The error formatter writes the error log file using the SPU operating system disk I /O func tions, passes the error log entries to the VAX 9000 operating system using an RXFCT function. and passes the error log entries to other SPU processes using the SPU port protocol. If the RD54 drive is not available, which prevents access to the SPU error log, the error formatter continues to send error log entries to the VAX 9000 operating system and to other sru processes. The SPU error log contains a l l the error log entries collected by the SPU (but not those collected by the VAX 9000 operating system) and time stamps, which are logged every ten minutes. Should an SPU operat ing system crash occur, the time stamps may Digital Tecbnicaljournal Vol. 2 No. 4 Ful/ 1')90 be used to determine the approximate time of the crash . Errors are logged regard less of the state of the system processor. As a result, information is avail able for analysis even in the event of a total proces sor failure. The error log file may also be transferred to TK50 tape for off-site analysis. The error formatter passes error information to the VAX 9000 operating system by copying the error log entry to system memory and then invoking the RXFCT function to notify the VA,'{ 9000 operating system that the entry is available. Should the operat ing system not respond to t h is notification , t he error formatter assumes that the operating system has crashed and writes the error log entry to a tem porary data ft.le. When the VAX 9000 operating sys tem reboots, it notifies the SPU by using a TXFCT function. The error formatter then reads any saved error log entries from the data file and transmits them to the VAX 9000 operating system . This proto col ensures that all collected error data is eventually reported in the system error log. The error formatter also maintains a SPU port to which any process running on the SPU may con nect. Connected processes receive copies of all error log entries as the entries are logged . This port is used by EWKCA , the symptom-directed diagnosis tool, which analyzes errors as they occur and determines which system components might have caused the failure. The port is also used for system debugging by the error insertion program to verify that errors are being logged and analyzed correctly. Snapshots I n addition to its error logging facili ties, the SPU operating system provides the ability to take "snapshots" of the system processor state. The snapshot fi le provides a detai led record of system context, which allows engineers to take a snapshot of a hung system and reboot it, and then analyze the snapshot file while the system proceeds to perform other useful work. The snapshot display utility is used to examine the data in a snapshot file. In addi tion to formatting the data in the snapshot file, the snapshot display utility can be used to examine any scan latch in the file, by name, in the same fashion as the console EXAM I N E command is used on the actual hardware. The data availab le in a snapshot file is summarized in Table 2 . Error Recovery The h igh level of visibility achieved b y the scan system allows the SPU to provide extensive error recovery facilities for the VAX 9000 processor. SPU -based recovery offers several advantages over 99 VAX 9000 Series Table 2 S napshot File Contents Revision Section All m u ltichip u n it revisions All S P U adapter revisions M i c rocode revisions A l l X M I adapter revisions A l l VAX B I adapter revisions Power Section All power control system registers " Se n se power" results Clock Section All master clock mod ule registers SPU Section All S P U -to-system control u n i t adapter registers 1/0 Section X M I device error registers VAX B I device error registers X M I-to-system control u n it error registers System Control U n it Section All scan latches Last 50 entries from system control u n it m i c ro program counter h istory buffer All cache tags All other logical structures ( e . g . , control stores) Config u ration database version 1/0 physical address memory map M e mory physical address m e mory map N o n existent physical address memory map CPU Section (Repeated Once for Each CPU) A l l scan latches Last 50 entries from program counter h istory buffer All cache tags All general-pu rpose registers All i nternal processor registers All other logical structures ( e . g . , control stores) Top 50 longwords of cu rrent mode stack Top 50 l o n gwords of i nterrupt stack 32 bytes of i n struction stream aro u n d each program counter in h i story buffer Configu ration database version 50 m i c ro program cou nters, collected by stepp i n g the clocks 100 traditional microcode-based error handling. The CPU hardware resources that might otherwise be used for error handling were available for the logic designers to improve the system performance. Because the error data is processed external to the failing component, the recovery process i tself is not suspect. Finally, because the system clocks are stopped while recovery takes place, erroneous data does not propagate throughour the system. Tradi tionally, m a ny microwords in the CPU control store (approximately 500 in the VAX 8600 system) are used for error recovery microcode. However, because the SPU is responsible for VAX 9000 error recovery, additional control store space is available for instruction m icrocode. If this had not been the case, we m ight have had to make a space trade-off between instruction and recovery microcode, which cou l.d h ave res u l ted in more emulated instructions and a performance penalty for VAX instruction execution speed . Because the scan system allows the SPU to deter mine the state of every scan latch in the CPUs and system control unit, logic designers were able to place error detectors anywhere in the design without organizing the detectors into microcode readable error registers. As a result, significantly more error detectors were used for precise error analysis than woul d have been possible if the scan system were not available. Each VA.,'\ 9000 CPU con tains over 450 error detector latches. Severa l advantages are derived from performing error recovery independently from a failed compo nent. The most obvious advantage is that hardware, which m ay be failing, is not used to control t he recovery. Once the system processor state has been scanned out into SPU memory, analysis is a function of software running on a known good processor. The SPU analyzes the data and then scans a cor rect state into the system processor. T he entire process is performed while the system clocks have been stopped . Therefore, processor errors cannot cause "error loops; " that is, the error recovery process itself gets errors from a corrupt processor state. SPU-based error recovery can completely reset a corrupt system , regardless of the degree of corruption. The VA.,'\ 9000 error-handl ing fac i l i ty takes advantage of many advanced software features that are avai lable i n the SPU operating system . It uses configuration database information to access sys tem processor signals by name rather than by scan ring locations. Thus, one version of the error han d l ing code can handle several different physical processor variations. The error handler also uses the Vol. 2 No. 4 Fall 1<)<)0 Digital Tecbnicaljounwl The VAX 9000 Service Processor Unit SPU operating system structure access routines to read and write the processor structures, again, by burying the physical implementation in the config u ration database. As a res u l t , the error handler can look at the architectural features of the VAX pro cessor rather than at the gate-level design of the VAX 9000 system when performing error analysis. The benefit of this approach is that recovery proce dures are based on the system architecture, rather than on the machine implementation . One of our design goals for the VAX 9000 error handling system was to recover from most errors in under 500 mil liseconds. Longer delays increase the probability that I/0 devices will time out while waiting for the operating system to respond to requests and cause the operating system to crash, even if the error-hand ling system s uccessfu l ly recovers from the error. The error handler meets this goal by taking maximum advantage of t he multi processing capabilities of the tightly coupled hardware design of the service processor module and scan control module. Error recovery is split into a mu ltistep process that keeps both SPU processors working on the problem simultaneously. The error handler recovers a failed system in five phases: data collection, data analysis, error recov ery, macrostep, and cleanup. In the data collection phase, the scan control module scans out all scan rings of the failed CPU or system control unit. In the analysis phase, the scanned data is used to deter m i ne which architectural feat ures of the system have been corrupted (e.g. , caches, general-purpose registers, internal processor registers, microcode stores, and the translation buffer). In the recovery phase, the error handler attempts to restore the system to a state in wh ich no soft ware-visible data is corrupt. Therefore, the soft ware running on the VAX 9000 system, including the operating system, is unaware that an error has occurred. The error handler determines whether the system state can be restored successfu l ly or if a machi ne check must be generated to a llow the VAX 9000 operating system to attempt to handle the error on a higher level. It then restores the CPU to a known good operating state, by using latch data from the configuration database, and corrects any corrupted software-visible data. In the macrostep phase, the error handler turns on the system clocks to allow the fai led C P LI to attempt to m acrostep one instruction. I f the macrostep completes successfu l l y, the recovery is considered s uccessful and system operation is allowed to continue. In the clean-up phase, the SPU Digital Technicaljournal V(J/. 2 No. ·4 Fall /1.)1.)0 processes the data from the data collection phase into an error log entry, posts the entry, and cleans up the data structures that will be used to recover from the next error. Errors that are too severe for the error handler to h andle are signaled to the SPU command i n ter preter, which can run command scripts to com pletely reinitialize the machine and reboot the VAX 9000 operating system . Examples of such severe errors are bard errors that prevent VAX 9000 oper ating system machine check code from running and errors that cause a CPU to fail its macrostep. Summary The SPU is a dedicated subsystem for service and maintenance support for the VAX 9000 fami ly. It is closely linked to the VAX 9000 processor to provide system error recovery. It also presents a high-level interface with which debuggers may observe and control system processor activity. Through the use of a system-wide scan architecture, the SPU pro vides access to nearly roo percent of p rocessor machine-state. Finally, the use of the SPU in various tester environments greatly assisted the multichip unit debugging effort and provided advanced train i ng for VAX 9000 system debuggers. Acknowledgments The authors w ish to thank Michael Evans, the SPU project leader, whose drive and ambition provided the force behind the project's success. We also wish to acknowledge the other members of the SPU design tea m : Karen Barnard , Stephen Conway, David D 'Antonio, Susan DesMarais, and Brian Rost . Reference 1 . D. Chin et al . , "The Unique Features of the VAX 9000 Power System Design, " Digital Technical journal, vol. 2 , no. 4 (Fall 1990, this issue): 102 - 1 1 7. 101 Derrick]. Chin Barry G. Brown Charles F. Butala Luke L. Chang Steven]. Chenetz Gerald E. Cotter Brian T. Lynch Thiagarajan Natarajan The Unique Features ofthe VAX9000 Power System Design Leonard]. Salafia The VAX 9000 series represents Digital'sfirst implementation of a mainframe com puter system. To be competitive in this market, the power system for the VAX 9000 series had to provide high system availability To meet this goal, the system includes features neither considered norfound in previous large Digital computer systems. Some of these features are the use of redundancy in parts of the design and the addition of more power system diagnosis capabili�yfor quickerfault isolation and faulty unit replacement. Otberfeatures provide competitive advantages in specific marketplaces, such as meeting low harmonic distortion for A C input current, which is an emerging European A C power qualiry standard. Simulation tools, wbich are used more prevalent()' in digital logic, were used to improl!e the power design. The two key requiremems of the VAX 9000 power system a re h ig h availability and the incl usion of competitive features. High availability for rhe power system means we had to achieve the highest unit regulator reliability possible by using the appropri ate technology avai lable. Further, we had to deliver both more power system and cabinet envi ronmen tal monitoring and diagnostic capability that could reduce the time spent in isolating and replacing a m a l fu nctioning u nit. Competitive features mean designing into the system features that would be either better than expected or advantageous to the VAX 9000 system in certain markets. A ful l discussion of all the methods used to meet these requirements is too long for this paper. There fore, the discussion in this paper focuses on some of the unique applications of the power technology and tools used in the design of the VAX 9000 system : • Power system architecture • I mproved load sharing • Simulation • Increased control and monitoring • Low harmonic distortion One of the issues we had to decide in designing the power system architecture was how many regu- 102 lators shoul d be used . A large number of regulators in a power system can cause the mean time between failures (MTBF) to be lower than desired. Therefore, we chose to use redundant regulators in the power system architecture for improved availabil ity. A nother means of i nc reasing the MTBF was achieved by improving the load sharing among the parallel regulators that power a low-voltage current load . W i th this feature, no one regulator operates at a percentage of maximum rating much higher than its parallel regulators, which eliminates the higher operating temperatures that can occur and, as a result, lowers the MTBF. High regulator reliability results from good cir cuit design. Three examples of the unique simula tion features that were used as checks on circuit designs are discussed in the Simulation section of this paper. In one case, simulation pointed the way to a circuit problem that was not initially apparent. In another case, simulation was used to verify on paper that the n umber of regulators chosen to power a specific load was sufficient . High availability can be achieved by reducing the time to isolate a system p roblem a nd replace the malfunctioning unit. A power and cabinet moni toring modu le, EMM , fu l fil led this p urpose in t he VAX 8000 systems. The power control subsystem, PCS , used for this purpose in the VAX 9000 systems, Vol. 2 No. 4 Fall /'J'JO Digital Technicaljournal The Unique Features ofthe VAX 9000 Power System Design expands on the diagnostic and monitoring features of the EMM . Meeting emerging European AC power quality standards was viewed by the E uropean sales force as a distinct competitive advantage for the VAX 9000 system. A proposed standard we wanted to meet was to achieve low harmoruc distortion of the input AC current wave form, which was met in the u t i l i ty power conditioner (U PC) front-end design of the power system. High availability was designed into the UPC th rough such features as redundancy and increased immunity to power line disturbances from a common ly accepted industry practice of one AC cycle to teo AC cycles. VAX 9000 Power System Architecture The discussion of the power system architecture w i l l focus on some of the a rchitecture's major features: power zoning, N + 1 redu ndancy, and decoupling. • • • Power zoning enables parts of the system to be powered off for maintenance w h i le the rest of the system remains operational . N + 1 red u nd ancy provides higher perceived system availability to counteract the impact of low system mean time between failures, which is a result of the large number of regulators. Decoup ling major sections of the power system a llows future upgrades to be made w i thout requiring significant changes to the rest of the system. The basic power system architect u re for the V�'< 9000 Model 200 and Model 400 series is shown in Figures 1 and 2, respectively. Power processing in each model occurs in two distinct stages. First, an AC front end processes and converts AC utility input power to h igh-voltage DC , which is then bused about the power system. Second, DC-to-DC switch ing regulators convert the h igh-voltage DC to low voltage outputs, which are then distributed through high-current-carrying busbars to the various logic loads. An intell igent power control subsystem (PCS) provi des control, sequencing, monitoring, and diagnostic capabi lities. Dedicated bias regulators, whic h are powered from the h igh-voltage DC , provide housekeeping control (i.e. , low power) and start-up power to each bank of output regulators. The high-voltage DC bus permits low-voltage out put regulators to be added or removed for different system configurations. The high-voltage DC bus also can be backed up with a battery unit that produces high-voltage DC from 48-volt batteries through a step-up switching regulator. This approach allows any specific low-voltage output to be produced , as needed, during the battery back'l.lp period without using specific battery-to-logic voltage output DC-to DC regulators. The battery required to backup the entire computer system wou ld be larger than the computer itself. Therefore, diodes are inserted into the h igh-voltage DC distribution to partition the high-voltage DC bus, and only sections, such as the memory refresh operation and PCS control , are backed up. PCS (POWER CONTROL S U BSYSTEM) / / / / E N V I R O N M ENTAL MONITORS UTILITY POWER 1 20/208 VAC 3 PHASE � � Figure 1 Digital Tecbnicaljournal Vol. 2 No. 4 VAX 9000 Model 200 Series Power System Fall 1990 103 VAX 9000 Series PCS (POWER CONTROL SUBSYSTEM) Figure 2 VAX 9000 Mode/ 400 Series Power System Power Zoning The power-zoning feature meers rhe maintain abi l i ty a nd high avai labi lity goals in the VA X 9000 Model 400 series of triple and quadruple proces sors. In the power system's configuration, a pair of d u a l processors can be powered off for m a i n te nance, while the remai ning powered-on processors maintain system operation. A quadruple processor configuration is not com posed of two identical dual processors. Some func tions of a quadruple processor are not replicated. The system control unit, the memory, the service processor unit, and the PCS are common ro both d u a l processors. Therefore, these functions are powered up by either front end . The h igh-voltage DC power bus is diode OR 'd from either AC power source, through the dual d iode, CR 1 , and then fed to the ourput stages that power the common elements listed above. The diode-OR process i n the VA X 9000 system does not provide for active loads haring. Active loadsharing between each AC from end increases the overall actual power system reliability because it ensures that each AC front end supplies half the load. Othenvise, one AC front end could take most of the load (and be stressed h igher), w h i c h wou ld leave the other unit roo lightly loaded . However, acrive load sharing is complicated by the physical distances between the AC front ends and the com plex hand l ing of faults and parcial fau lts in each AC front end . The load of the common elements in the VAX 9000 system is only 20 percent of the total 104 system. Therefore, the worst load imbalance does nor justify the added complexi ty. The diode does nor have a signi ficant impact on overall power load re liabiliry because conservarive deraring of rhe diode results in a lower diode oper aring temperature and hence higher rel iabili ry. We were concerned that power zoning cou ld have an impact on rhe resr of rhe system as a result of powering down part of the system. However, analysis of the results showed rhar such a concern was unfou nded. The h igh-voltage DC bus has rela tively long time cons tams (i.e. , slow to react to changes). Therefore, turn-on and turn-off transients on the bus are smooth and gradu a l and do not generate quick-changing electromagnetic fields that coul d affect the operation of t he sections of the system that are still functioning. N + 1 Redundancy Each processor in the VAX 9000 power system uses approximately 400 amperes from each of the two supply voltages. T he rati ngs of the power semi conductors used in the outputs of the OC-ro-DC reg u lators del i ver an optimal regulato r rating of approximately 240 amperes. Based on these rat ings, powering a CPU i n the VAX 9000 system would require two regulators for each voltage. However, in a large system, such as the VAX 9000 system, the number of regularors can quickly add up, w hich would result i n an equally q u i ck d rop in overal l system reli ability. Powering two CPUs from the same voltage bus reduces the number of regulators. Vol. 2 No. 4 Fa/1 19')0 Digital Tecbnicaljournal The Un ique Features ofthe VAX 9000 Power System Design Redundancy is then used to minimize the impact unit. This reliance has a significant impact on the of t he large n u m be r of regu lators in design of the regu lator, the regulator response time, the b u s. By using redundancy, a d d i t io n a l regu l a tors on a and how the regulator hand les the fa u l ts that can voltage bus increase the perceived time between cause a fai l ure. Fast regu lator response (the time it com rlere fa il ures. For example, consider a voltage bus that requires A t wo regulators to supply t he load cur rent. fai l u r e in either regulato r causes a complete fa il ure. I f another parallel regulator is added to supply the load c u r re n t , the probabi l i t y o f a c o m plete failure significa ntly decreases. I n t h is case, if one regu lator fa ils , the other two could supply the loa d . The s t a t i s t i c a l proba b i l i t y t h a t another fai l u re would occur before the fa i led regu lator is replaced takes to respond to a cha nge in input or output) is needed to ensure that the output volt age does not dip roo much when each regu lator picks up its share of the load from the f:J.iled regulator. How ever, the fas ter response time makes it more diffi cult to keep the control functions of the unit stable. M oreover, t he reg u l a t o r i n p u t vol tage range is designed to be relatively wide to tolerate w ide swings in the high-voltage DC input. When one regu lator in a bank of regulators oper is very sm all . ated in paralle l fa i ls , t h e o u t p u t bus voltage d i ps N regu lators at an individual fai l u re (i\) would have a system fai l u re rate of N rimes i\, or an MTBF of 1 d i v ided by N t imes i\ . 1 The magn itude of the dip depends on the time the A system of rate of lambda The actual calculati ons are unt i l the other regulators, w h ich are connected in para l lel , can react and pick up the load currents. i n p u t fuses i n each r eg u l a tor t a ke t o open and o n the values o f the input capacitors and the d istribu i\ (total) = N X i\ tion impedances. Fast-opening fuses a l low smal ler voltage dips but or MTBF = l li\ (total) = are more p rone to fa lse n u isance openi ngs. S low 1 /(N X i\) opening fuses do nor open for normal or nuisance The fa il ure rate calcu lation for a system that con tains one regulator more than req u i red fuses quickly, but the voltage recharging of the i\ (total observed) = (N + I ) X N X i\ X i\ I I { (N + l ) x i\ } + (N x i\) + u ) MTT3F (observed) = ( su rges, but allow a greater vol tage d i p . La rge values of input capacitance provide the energy to open the (N + l ) is capacitors is longer. A high distribution i m pedance decoup les the fa ults from other units but has a high power loss. ( (N + 1 ) X i\] + Simu l a t i o n and resting showed t h a t the w i d e (N X i\) + u ) I I (N + I ) X N X i\ X i\) inpm range design o f t h e regu l a tors i s su fficient to It shoukl be noted for the above equation, that u tolerate the h igh-voltage input dips caused by other e q u a l s I d i v i ded hy t h e t i m e between fau l t a n d fa ul ts. The regu lator control and re pair (service i n terval). keep the low-voltage Using this calculation, if a bus requ ired MTBF T"he obse rved MTBF 4 regu response rime outputs within speci fica tion when the input vol tage is within its range. of 400,000 Other faults w i t h i n the regulator can cause it to would be 100,000 hours. fa i l , but the load i s picked up by the other regula lators and each regulator had an hours, the observed MTBF OC w i t h five regu lators ( i . e . , wou ld be 23,9H9,000 hours, w h i c h is 239 N+ I) tors, operating in paral l e l , on the bus. Clearly, fau l ts t i mes such as a permanent short on the output bus, cannot longer than the four regulator case. The maximum be s ur v i ved . Because the low-vol tage output regula time between the fault occu rrence and repair would tors operate in parallel and in an be 2 weeks, or 336 hours. T he observed MTBf is N+ I redundancy mode, the output voltage is not affected by most so large, compa red to other elements in the system, common single-fault cond itions in the power sys the redundant regu la tors have an extremely small tem hardware. effect on the overall reliab i l i ty. vol tage bus is l i m i ted to one in the VAX 9000 power Decoupling A key feat u re of the system for sp:.te e, weigh t, and cost reasons. N is the that each major subsystem is relatively decoupled The number of red u ndant regulators per output power system 's architec t u re is number of regulators req u i red to supply the maxi from the other su bsys tems. Decou p l i ng perm its mum current of a bus, and the addition of one more e:1ch subsystem to be designed for its own req u i re regulator is cal led N + N+ I redundancy. ments and t o b e c h anged or upgraded as t h e I redundancy relies on the good regu lators on the output bus to pick up the load from the fa i led Digital Tecbniculjournal v,,r 2 No. ·I Fa/1 1')')0 req u i rements change (e. g . , more cost effective, im proved tech nology, or different output vol tage). 10) VAX 9000 Series provided the interface and critical fu nction remain the same. For exam ple, two significant l y differ em cost and performance options, H7392 or H7390, for the AC front end can be used in different config urations, and the rest of the power system does not need to be changed . Thus, power p latforms can be flexibly tailored to meet the needs of different com puter systems. Achieving Low Harmonic Distortion The AC front end of the VAX 9000 power system processes and converts public utility AC power to high-vol tage DC. Our goal was to design the AC front end to be highly reliable, have a high avai labil i ty, and meet the emergi ng European AC power quality standards. One of those standards is to have low harmonic distOrtion of the input AC current waveform . These featu res were essent ial to support the VAX 9000 system 's entry into the mainframe computer marker . We also decided tO meet the low harmonic distOrtion standard of the AC front end because the Eu ropean marketing and sales force viewed compliance with this standard as a distinct competitive advantage. Design Factors The dominating design factor for the AC front end was the size of the input power level, which was approximately 20,000 watts. This size signifi cantly exceeded the power levels of previous AC circuit designs for a s i ngle u n i t . The high power consumption was a result of the use of 250,000 emi tter-coupled logic (ECL) gates in the CPU and 5 1 2 megabytes (MB) of memory. High Reliability and A vailability To ach ieve high reliabil ity, we used conservative power derating lev els and good thermal management for key devices. Typ ically, the device voltage ratings used are 80 percent of rating. The main switches and rectifiers used in the power stages used 40 percent of rating. Current derating is also conservatively placed at 40 percent. Stress is lessened because of lower device fu nction temperatures, wh ich results in a longer opera tional life, which equates to h igher reliabi lity. We designed t wo approaches to attain high availability. First, redundant circuitry was used for the AC-to-DC circu i t function. Second, we inc reased immunity-to-line outage from the standard practice of one cycle of outage protection to ten cycles. The increase from one c ycle to ten cycles of ou tage immunity provides the VA X 9000 system with a 300 percent improvement in mean rime between 106 observed system power outages over standard Digital systems This feature improves system availability to the customer. Harmonic Distortion The power system's design had to meet the increasing restrictions on the inrn face with the pub lic power u t i l i t y and be able t o withstand the occasional avai labil ity o f only poor power. Uti l i ty power is generated as a relati vely pure (i .e. , low harmonic d i stortion) s i ne wave. AC front ends and power suppl ies must convert this sine wave of voltage ro a ripple-free DC voltage for ultimate consumption by the logic chips within the computer system . Standard methods used for this conversion create a nonlinear load on the sine wave of voltage. This nonlinear load distorts the utility's sine wave of voltage for other users, because of the distribution system impedance, and usually appears as i nterference for other users. In Eu rope, the occu rrence of this type of interference is planned to be limited by restricting how much nonlinear load current an AC front end can have. Therefore. we had to design a unique circuitry that could convert AC power to DC power at 20,000 watts without high levels of current distortion to meet this European requirement . A design based on commercially available conrrol technology could not meet the stringenr technical requirements of high overal l conversion efficiency and stabi l i ty of operation because conventional AC-to-DC circui try produces up to 30 percent dis tortion. Our goal was to comply with emerging European requirements of harmonic current distor tion levels in the 5 percent range. However, at the time we were designing the system, no circui try at this power level existed in the power conversion industry. T herefore, we h a d to develop a unique pulse-width modulator (PWM) circuit and control equations for the input power conversion stage, which is shown in Figure 3 . The pulse-width modulator combines the advan tages of low switching frequency, which reduces switching losses in the converter, with exception ally short response time to all i nput l ine voltage d isturbances and to rapid changes i n the required compu ter power. The fin a l design produces less than 5 percent total harmonic distortion of the input l ine current w hen the UPC is operated at 20,000 watts load. The uniqueness of the PWM increased the immunity-to-line voltage outages from one cycle of outage protection to ten cycles. F u rthermore, the increase was achieved w i th o u t a corresponding tenfold increase i n storage capacitors. Vol. 2 No. 4 Fall /')')0 Digital Technicaljournal The Unique Features ofthe VAX 9000 Power System Design OUTPUT SWITCH AC F I LTER AC INPUT � o----- RECTIFIER FAST DI SCHARGE AUX AC POWER AND POWER LINE MON ITOR TO UPC CIRC UITS DIGITAL POWER BUS AND TOTAL CFF BUS RIC INTERFACE Figure 3 UPC Block Diagram Flexible L ine Cora The high power level and the requirements for a flexible line cord and plug required that the U nder writers Laboratory (UL) and Canadian Standards Association (CSA) agencies expand the regulations that governed the size of power cordage allowed in a computer room . A flexible l ine cord connected to the AC service is a requirement by D igital for all i ts products. This feature is deemed valuable because it is used both to facil itate the initial installation of the compmer and possible relocation at the cuswmer\ site. Although delays can occur while waiting for a national agency to amend one of its national regula tory codes, the approvals were received in time w maintain the project's schedule. Improving Load Sharing Detailed stress analyses show that when regulators are operated in parallel, maximum reliability is achieved when the load current is shared equally among them . Traditional Approach A traditional approach to running regulators in par a l lel may be seen in VAX 8000 series machines. In these processors, regulators that are designed for standalone operation are placed in a parallel con figuration. Current sharing is forced by mod ifying each supply's individual reference voltage through external monitori ng and control . In the case of VAX 8000 machines, a maximum of four units may be coupled in this way. Figure 4 shows that Digital 'fecbnicaljournal Vol. 2 No. 4 Fa/1 /'J'JO this method essentially uses equipment that was designed to function as standalone regulated voltage sources. By adding external control loops, the equipment is forced to provide identical out put voltages, as measured at some defined point in the system . If precise voltage matching is not achieved, whichever supply had the higher voltage consumes the load, up to i ts overcurrent sense point. Thus, equal load sharing cannot happen. Individua l external controllers are requ ired for each converter, which m a kes the system more complex. The VAX 9000 system requires up to five converters per bus, and we could not achieve better than 20 percent power sharing between modules by using this method. No traditional methods could support the number of converters in the VAX 9000 system. Also, most methods had a master-slave rela tionshi p that precluded maximizing a regularor's reliability potential. New Approach As a result of the limitations of the traditional meth ods, we developed a new, less complex approach to current sharing between p a rallel converters. A lthough developed specifically for the VAX 9000 program, the features and utility of this approach have universal application . The essential techno logical shift from prior practice is that in this system the regulators are current sources rather than voltage sources. We designed the current sources to have a com pliance range that covers a band of voltages thar are 107 VAX 9000 Series I CONVERTER ? I CONVERTER INTELLIG ENT CONTROL UNIT (ONE PER MODULE) CONVERTER INTELLI G E NT CONTROL UNIT (ONE PER MODULE) � INTERNAL REFERENCE AND ERROR AMP ·� c u RRENT S E NSE - ? I I NTERNAL REFERENCE AND ERROR AMP INTELLIGENT CONTROL UNIT (ONE PER MODULE) I N TERNAL RE FERENCE AND ERROR AMP � < POWER CONTROL SYSTEM VOLTAGE CONTROL LOAD Figure 4 Load Sharing by Voltage Control of Voltage Sources norm:t l l y fou nd in logic c i rcui ts. By m a k i ng the regulator acts as a cu rrent source, the system acts as VA X 9000 a control led and regulated voltage source. Because reg u l a to r o u t p u ts fu l l y fl o a t i n g , the system requ irements for + ') -vol t , - ).4 -vol t , and the volt age control loop only contains one pole, the buses are met with only one regulator bandwidth of the control loop can be i ncreased by - 5. 2-volt design, rather than a separate design for each vol tage. The VAX 9000 design is s i mpler and has a u p to a factor of at least 15. As a res u l t , the substan tially h igh current cha nge req u i rements i mposed l ower manufacturing cos t . The regu l a tor is vol tage by high-speed memories, such as those used in the and polarity " b l i n d " over i ts compliance range, and VAX 9000 system, can be accommodated. any nu mber of regu lators may operate in para l lel to provide a n y amou n t of power req u i red at any Principle of Operation vol tage w i t h i n t h e compl i a nc e range. A lso, this A two-transisto r forward r eg u l a t o r i s show n i n method a u tomatical l y compensates for the effects Figure 6. I n this regu lator, S I and S2 are switched of stray resistances and d i fferent path lengths from ind i v i d u a l regulators on a bus. The basic fea t u res of t h i s new a p p ro a c h are CD shown i n F igure '). I n d i v idual regulators behave as extc rn a l l y programmed current sou rces controlled by a common control sign a l , such that each regu l a to r d c l i vers the same c urrent . If the ou t p u ts are load is the s u m of the individual regu lator ou tput CONVERTER () CD \.) l2 tL-1,----- --------+-----------�t connected to a common load , the c u rrent in that curn:nts. The resulting voltage that appears across CONVERTER CONVERTER V = ( l1 + l2 + l3 ) x Z LOAD the load is the product of t h a t current and the eq uiv � i' a l e n t resi s ta nce of the load . F u rthermore , i f that vol tage is compared w i t h a reference voltage in a LOAD conven rional error amplificr and thc res u l t i ng error l3 � CUORE'T �CO,TROC signal is used to derive the regulators· external pro gramming source, then a volrage control loop exists a round the regulator system . Thus, al though each IOH Figure 5 Load Sharing by Current Control of Current Sources Vol. .! No. ·4 Fa// 1')')1! Digital Technicaljournal The Unique Features ofthe VAX 9000 Power System Design into conduction simultaneously, which causes the current to flow in the primary winding of trans former Tl at a level that is directly proportional to the output currenr lout plus the slope of the current due to Lour. This current also flows in the primary w i nd i ng of c u rrent sense transformer T2 . The resulting current that flows in T2 secondary wind ing develops a voltage across the load resistor, RL, which is amplified in A l and applied to the input of comparator C I . Therefore, at this point, a voltage pulse appears, the amplitude and shape of which a re directly p roportional to the c u rrent flowing i n the output choke Lout during the S l -to-S2 con duction period . A conventional reference source/error amplifier combination is p l aced across the output of the sup p l y. The res ulting error signal, called Vcontrol, is applied to the other input of comparator C l as a DC leve l . The comparator is followed by gating a nd drive circuits to the power switches. Switching is initiated by a pulse within the gating circuit that drives the power switches on . The cur rent flows in the output choke, Lou t , and a propor tional vol tage appears at the output of the amplifier A I . As this voltage ramps, it crosses the threshold set by Vcontrol at the Cl input. The comparator output then changes state and causes the drive pulse to the switches ro cease. If Vcontrol were a fL-xed value, the system would be a constant current source. Therefore, the voltage that would appear at its output would be the result lOUT T O C2 THROUGH N Figure 6 Two-transistor Forward Regulator Digital Technicaljournal Vol. 2 No. 4 Fall /'J')Ii of that constant c u rren t , and w hatever load is placed across those terminals (i.e. , Your) would be determined by the load value. By using an error amplifier and reference, Vcontrol can be made a variable quantity. Therefore, rhe regulator transfer function can control its output current to any level necessary to produce the desired voltage. In such a system, a control vol tage, which is derived from a single error amplifier and reference, can be used as the control input for severa l regulators that are running in parallel. Thus, the current from multiple regulators that feed a common bus can be shared. Increased Control and Monitoring I n the VAX 8000 series, power and environmental monitoring and control is provided by the H7188 environmental monitoring module (EMM). In the VAX 9000 system, these functions are provided by the power control system (PCS). Basic Design ofEMM and PCS The EMM monitors the DC-to-DC regulator contro l , a i r flow sensor, and cabinet temperature. I t i s also the interface between the system console and the power system. Conceptually, the EMM functions as a peripheral device to the console similar to the way an intelligent disk conrroller is a peripheral ro a CPU . The EMM is a single module that plugs into a power back panel . T he res is a d istributed data acquisition a nd control system. I t also i nterfaces between the power and environmental systems and other parts of the computer system. The PCS takes commands from, a nd reports status changes ro, the service processor unit. However, in the PCS, the conceptual model of the EMM is extended to provide additional support in hardware and firmware to off-load the service processor unit and to simplify the software inter face to the PCS . The PCS includes many features that enhance testability, fault coverage, fault isolation, and system availabil ity. The relationship of the res modules to one another and to other system com ponents is ill ustrated in Fig u re 7. T here are five PCS modules: • Power and environmental monitor ( PEM) • CPU regulator intelligence card (crURIC ) • l/0 • Signal interface panel (SIP) • Operator control panel (ocr) regulator intelligence card (JOR IC) 109 VAX 9000 Series � � TO OTH E R POWER BACKPLA ES � POWER BACKPLANE . POWER BACKPLANE (f) a: (.) a: a D O D O <( Cll CX) CX) CX) CX) (f) (f) :::J � "' "' "' "' (l_ ('- 1'- 1'- 1'- 1'- <( <( U I I I I I ai m 0 (l_ > (f) (f) z a: w (f) 0 13: (f) 0 0 :.E N --' CXl "' 1'- a: w LL a: I 1- I ;;:{ I BULKHEAD ll! 0 Ol "' 11'- w I z a: [ij 1w (.) (l_ :::J X f.tt (f) a: 5l dJ (f) � (lJ 0 <( � (.) X (.) 3: � Cll !lllll (f) a: ' .[ ' Ci (jj s T1 . . II TO RECT I F I E R A N D FILTER 0 > 0 � (f) s 0 > OV D1 MU R460 T I M E (200 NANOSECON DS/DIVIDE) Figure 10 Figure 8 H7380 Output Switching Stage Figure 1 1 shows a more accurate model of the The i nitial model of the H7380 inverter stage used simple component models and did not consider any printed c i rc u i t board i n ductances or transistor capacitances because they seemed negligible com pared to other elements. We noted a discrepancy in the voltage across the transistor Q 1 (Vds) during the tu rn-off process between the simulated waveform , shown in Figure 9 , and the measured waveform , shown i n Figure 10. Figure 9 shows that the voltage is initially zero while the transistor is conducting but rises to 200 volts when the transistor is t urned off. Figure 10 shows that ringing occurs as the voltage approaches 200 volts, w i t h an overshoot to 2 4 0 volts. The ringing and overshoot, not shown in Figure 9, are caused by the circuit board inductance, trans former leakage inductance, and the capacitance of the transistor. output stage because the L 1 through L4 etch induc tances and C 1 and C 2 transistor capacitances are i ncluded. The c u rren t source, !PULS E , a n d t h e resistor, RT , approximate t h e transformer. Figure 1 2 shows t h e resu l t o f the simulation model that includes the L and C values shown in Figure 10. When the simulation and the measured data are correlated, the advantage of accurate simulation becomes apparent . By using worst-case values for the circuit parameters, the simulation can deter mine the maximum peak voltage. The model depicted in Figure 12 shows that a device capable of withstanding the expected 240 volts is needed. Rel iance on a less accurate model w i thout para sitics could lead to the selection of a device capable of withstanding only 200 volts. Thus, accurate simulation allows the correct components and component ratings to be chosen and ensures a robust design. Transient Analysis PLOT 1 TIME V(40,3) A memory system that i nc ludes dynamic random 2.50 access memory (RAM) chips presents a difficult 2.00 X Vds (Ql) Measured Turnoff transient load problem to its power supply. The problem arises from a combination of very high 1 .50 changes in dynamic RAM supply current and cur (f) � 1 . 00 � 0.50 a thousand t imes faster than the reaction t ime of a o.oo L..._ _ .. ...._ ., ...�.____.____.__ _ _. .. _...._ power system . The result is a temporary change in rent change rise times that are typically more than 0 2 4 6 SECONDS X 1 0' 8 7 10 the load supply voltage. To handle these fast current edges, high-frequency capacirors are mounted on memory boards near the dynamic RAMs. Also, low Figure 9 1 14 Vds (Ql) Simulated Turnoff without Parasitics frequency, electrol ytic capacitors, which provide a source of local charge storage, are mounted on the Vol. 2 No. 4 Fal/ 1')')0 Digital Tecbnicaljoumal The Unique Features ofthe VAX 9000 Power System Design PLOT 1 TIME V(40.3) 2.50 2.00 L1 1 5NH 0 X 1 .00 0 > 0.50 � D2 MUR460 1 .50 Ul 0.00 + L3 Figure D1 MUR460 L4 1 5NH SIMULATION MODEL OF OUTPUT CIRCUIT WITH PARASITICS Final Model ofH7380 Output Switching Stage memory boards to handle the magnitude of the change. The capacitors help keep the supply voltage within its operating range until the power supply can react and sufficiently change the current it sup pl ies to the memory to stabil ize the supply voltage. An adequate supply design with specified capaci tors can keep the supply vol tage within its operat ing tolerance. Simulation is used to determine the correct mi..'< of high and low frequency capacitors and the number of regulators required to support this high transient load . Another power supply problem arises from the use of N + l redundancy for paral lel regulators. W hen one of the regulators in a paral lel regulator configu ration fails, the remaining regulators must be able to rake on the load from the fa iled regulator and keep the supply voltage within operating toler ance. Because the remain ing regulators cannot react instantaneously, the load voltage drops until a sufficient increase in current can be provided by the remaining regulators. For the VAX 9000 series memory system, a pro posed dynamic R A M power supply design consisted of three H7380 DC-ro-DC regu lators, which would operate in parallel (including N + I redundancy) and be connected to the memory through power dist ribution busbars. The numbers of high- and low- Digitlll Technicaljournal Vol. J No. 4 Fa/1 11)')0 12 10 8 6 SECONDS X 10 25 N H L2 25NH Figure 1 I 4 2 0 7 Vds (Q I) Simulated Turnoff with Parasitics frequency capacitors were also proposed. The power supply was expected to be ready for load testing before the memory or the busbars would be available. Therefore, we had to verify that this design coul d keep the memory supply voltage within operating tolerance. We verified the design by simulating the performance of the power system and measuring the performance of the actual power supply with a simulated load . Power Supply Operating Voltage Tolerance The memory designers specified the operating tolerance of the dynamic RAM suppl y as + 5 volts, ± 10 per cent . Using 10 percent as the supply tolerance budget, the supp l y designer made the allocations shown in Table 2 to all the factors that would cause the load voltage to deviate from its nominal value of + 5 volts. As can be seen from this table, the sum of x and y must be less than 350 milli volts or 7 percent of + 5 volts. Memory Load The dynamic R A M supply current was calculated ro be a steady-state pulsed current of 2 56 amperes t hat would last for 92 nano seconds (ns) and with rise and fal l times of 20 ns, as shown in Figure 13. The initial p ulse magnitude was 1024 amperes. Table 2 Supply Tolerance Budget Al location Causes of Voltage Deviation Regulator tolerance M i l livolts 1 00 Back panel d istribution 50 Tra nsient load with two X Percentage of +5 Volts 2 reg u l ators Failure of one reg u l ator To tal deviati o n budget y 500 10 liS VAX 9000 Series modeled as a current source, Gout, controlled hy the regulator feedback voltage, Yf Cout and Rout represent the regulators combined output capaci tors and resistors. Most of the other elements in the model are determined from component specifica tions. The relationship between Gout and Vf was determined by laboratory measurements on a regu lator and resulted in the following equations. For two regularors, OA I-288 N S� f-- 1 2.96 MICROSECONDS � KEY: A - AMPERES NS - NANOSECONDS Gout = 339 X VJ = 339 X ( V8 - 2 . 5 ) Figure I3 VAX 9000 Model 400 Series Memory Power System Dynamic RAM Load In the SPICE model of the supply, busbar, load and capacitors that is shown in Figure 14, the three regulators are Memory Power System SPICE Model For three regu larors, Gout = 678 X Vf = 678 X ( V 8 - 2 . 5) The load is represented as two current sources, lA and I R , the characteristics of which were obtained from the loads shown in Figure 13. 21 -.l, ROUT lA -j., RESR IR GO U T C2 VR = DC - KEY: R1 C1 R2 R3 C2 R4 VR R5 C3 VG R7 R6 C4 GOUT D1 ROUT 1 2 2 3 3 4 6 5 7 8 9 9 10 0 20 21 2 0 3 4 4 5 0 7 8 9 0 10 0 20 21 0 10K 0.6N IC=2.5 1 0K 20K 1 8 P IC=5.0 1K DC 5 2K 68N IC=3.0 DC 2.5 1 0MEG 10K 0.757N POLY(1) 1 0 0 0 678 DIODE 1 7K Figure I4 1 16 GOUT RESR LESL RBB LBB CHF RHF LHF CLF RLF LLF \A 20 NS IR 20 NS 21 22 23 21 24 1 26 27 1 25 28 1 20 1 20 22 1 2300U IC=5.0 23 1 M 2.4N 0 24 300U 1 1 50N 26 1 .3M 27 2 1 U 1 .4 P 0 25 1 08 . 8 M 28 400U 0.3N 0 P U LS E 0 5 1 2 A 0 NS 92 NS 288 NS PULSE 0 5 1 2 A 0 NS 92 NS 1 2.961' S O NS 0 NS SPICE Model of VAX 9000 MemOtJI Power System Vol. 2 No. 4 Fa/1 1')')0 Digital Tecbnlcaljournal The Unique Features ofthe VAX 9000 Power System Design When one of the three regulators fai l s , t he other two regulators cannot meet the increased load instantaneously. As a result, the load voltage drops until the two regulators can increase their output current sufficiently to reverse the d irection of the drop. The SPICE model for t h is condition was run and the load voltage of the drop was predicted . Laboratory measurements were then taken with the simulated load and one regu lator was turned off. Both the predicted and mea sured waveforms had the same shapes, peak magnit udes ( 100 mill ivolts), and times of occur rence of the peak (200 m i c roseconds) after the regulator was turned off. Therefore, we concluded that the proposed design cou ld meet the load requirements. Simulation and Lahoratmy Measurements Failure of One Regulator For laboratory measurements, the actual dynamic RM•I load, as shown in Figure 1 3 , i s difficult to design and build i n a reasonable time because of the magnitude and rise t ime combina tion. However, a load with a much slower rise time could be easily built. Such a load , (I in Figure 14) is expected through the busbar as the capacitors and busbar slowed down the fast edges of the dynamic RAI'vl loacl . This s i m u lated load w as bui l t and con nected to two regulators. The predicted waveform and the measured waveform showed that the initial shapes of the peak c hange, the peak magnitudes (80 m i l l ivolts), and the ti mes of occu rrence of the peak ( 300 microseconds) were all simi lar. However, we could not measure the overshoot and ringing after the peak because the busbar was not available. References The two previously stated cond itions of interest result ing in large load voltage changes are the transient load w i th two regu lators and the fa i l u re of one regulator. For transient loads, a larger voltage cha nge occurs with two regulators rather than w i th three because two regu l ators take longer than three to adjust the supply current to the new load value. Simulated Load Digital Tecbnicaljounzal Vol. 2 No. 4 Fall 1')')0 I. P. O'Connor, Practical Reliabili�J' Engineering 2d ed . (New York: Joh n Wi ley and Sons, 1985). 2. SPICE is a general-pu rpose circuit s i mu lator program developed b y Lawrence Nagel and Ellis Cohen of the Department of Electrical Engi neering and Computer Sciences, University of California, Berkeley. 1 17 Donald F. Hooper John C. Eck Synthesis in the CAD System Used to Design the VAX 9000 System VAX 9000 system represents a sixfold inc rease in complexity over the 860018650 system. This increased complexi�y posed a significant challenge because ofthe concurrent need to shorten the duration ofthe project design cycle and convert all high-performance systems computer-aided design (CAD) software from the DECSYSTEM-20 system to the VAX system. As part of the task of meeting these challenges, the CAD Group proposed the implementation of a design methodology that used logic �ynthesisfor thefirst time in the development ofa major productfor Digital. Theprimary objectives ofthis methodology were to increase theproductivi�J' of the logic designers and to reduce the number of errors introduced during conversion ofhigh-level designs into gate-lel!e/ structural designs. The design ofthe VAX Methodologies transformations of Boolean logic to reduce gate counrs and improve critical timing paths.1 How Previous Methodology I n the prev ious development methodology, as shown in Figure I , logic designers speci fied h igh level designs o n paper, and simulation engineers transferred this rendition i nro a behavioral model . Tech nology engineers developed the gate-level cells. After the cells were defined and characterized for fu nction and timing, the logic designers gener ated schematic drawi ngs by using graphical bodies that represented the cells. As changes were made to the schematics, the sim u lation engineers attempted to reflect these i n the behavioral model . Finally, a gate-level simulation model was assembled from the completed schemat ics to verify that the design represented a valid VAX syste m . T h is process was extremely laborious, error-prone, and ri me-consuming. Therefore, we concluded it could nor be used to develop the VAX 9000 system , which is a 700,000 gate design and for which the technology cel ls would not be defined and characterized until late in the design stage. Logic Synthesis ever, this program has had only limited success and is not really usable as a released computer-aided design (CAD) product. For example, t he program does not deal w i th selections of cells for com binational logic nor does it consider the myriad problems i nvolved in asse m b li ng a database for a buildable gate array chip. During 1984 and 1985, new artificial intell igence (AI) and synthesis ideas were being developed. Uni versities and technical communi ties were exploring the potential of object-oriented databases, rule based AI, data flow design entry, and algorithmic minimizations. We began the prototype develop ment of our system for in tegral design (SI D ) at approx imately the same time as the ideas for the VAX 9000 hardware architecture were beginning to be developed. In 1985, the SID program became an internal CA D product for use in the development of the VAX 9000 system. By combining the most ad vanced rule-based AI techniques with an object oriented database, the core SID was designed to be a repository of logic design know ledge. We hoped that, over the years, SID wou ld mature to perform O u r early research i nt o logic synthesis began in many highly repe t i t i ve logic design tasks a t a n 1982 . Over the next two years, we explored new expert level . syn thesis ideas a n d constructed p rototypes to From 1985 to 1988, the capabilities of the SID sys determine the feasibility of those ideas. For exam tem gradually improved u ntil it was producing gate ple, one of our early logic minimization efforts was array chips that met the VAX 9000 machine cycle a program that emulated Brown's Laws of Form for time, power, and electrical rules requirements. 1 18 Vol. .2 No. 4 Fall 1990 Digital Tecbnicaljournal j Synthesis in the CAD System Used to Design the VAX 9000 System TECHNOLOGY CELL DEFINITION TECH NOLOGY CHARACTERIZATION BEHAVIOR MODEL TEXT EDIT GATE-LEVEL SCHEMATIC ENTRY BUG REPORT BUG REPORT PLACE ROUTE BUG REPORT GEN ERATED Figure 1 Previous Design Methodology New Methodology technology engineers are defining the technology The VAX 9000 development methodology, shown cel ls. In parallel w i t h these activities, s ynthesis in Figure 2, circumvents the need to wait for the technology cells to be completely specified before begi n n i ng logic design . This methodology uses schematic entry and simulates the technology independent, register transfer level (RTL) bodies. The RTL l ibrary for this type of entry includes MUXes, latches, adders, comparators, incrementers, decoders, and simple Boolean gates. The entry is knowledge engineers are writing rules to transform the RTL design into technology cells. These t hree activities should be completed at the same time, at which point, synthesis produ ces each of the VAX 9000 system's 77 gate array chips. The goals for the synthesis program were to • matic complexity by a factor of 4 extracted to a common database format, cal led CADEX , from which a simulation model is built. A • cal boundaries. Thus, si m u lation models can be • • While logic designers are creating the RTL design, Digital Teclm.icaljournal Vol. 2 No. 4 Fa/1 /'J()O Reduce the n umber of simulation errors i ntro duced in the design built that consist of a hierarchy of m ixed behavior and RTL models. Generate 90 percent of the VAX 9000 system's logic through synthesis behavior modd still exists, hut its h ierarc h y matches the RTL schematic h ierarchy at key physi Simplify design entry and thereby reduce sche Reduce the number of electrical ri1les violations in the design 1 19 VAX 9000 Series To generate a database for a buildable gate array chip, the synthesis tool is required to Read tec h nology-i ndepende n t input standard net list format, which can be in OECSIJVI behav ioral notation or CADEX common database format • • Minimize Boolean gates through state-of-the-art minimization techniques • Improve timing-critical paths through Boolean transformations, cell/pin selections, power set tings, and net load a llocations • Choose the best avai lable technology cel ls based on timing, size (area), and power estimates • Insert the clock system for the gate array chip • Insert testability access logic for the service pro cessor unit • Obey all electrical design rules for the gate array chip TECHNOLOGY CELL DEFINITION TECHNOLOGY CHARACTERIZATION • Make it easy to detect whether the tool has per formed well • Simplify the improvement of the tool SID Database The design of the SID database is fundamental to the robustness of the CAD system. Previous CAD data bases have all assumed that the data is stable at the time that the CAO tools are working with it. Simu lation, t i m i ng veri fica tion , design ru le checkers (ORC s), and many other CAD tools assume that net lists and components are fixed and unchanging. In synthesis, although the data is maintai ned in a form that makes i t easy to u pdate its parameter values, the basic structure of gates, pins, and nets remains the same. However, throughout most of the synthesis process, the basic structures are in a state of change. In fact, it is a characteristic of synthesis that logic functions are removed and replaced with new, fu nctionally equivalent logic. Because of this d i fference, we designed basic data structures and BEHAVIOR MODEL TEXT EDIT SYNTHESIS RULES TEXT EDIT SYNTHESIZE PLACE ROUTE SET POWER RTL SCHE MATIC ENTRY BUG REPORT BUG REPORT (LOOP BACK) BUG REPORT GENE RATED Figure 2 120 VAX 9000 Deuelopment Methodolof!J' Vol. 2 No. 4 Fall f , IS_BOOLEAN , !S_ A _N U M BER ; adjectives are words such as A N Y , ALL, NO . Dbobjects are d a tabase objects or the parameters of these objects. The command forms used for right-side actions a re corrunaml dbobject and command dbobject preposition dbobjecr. Commands are words such as I NSERT, REMOV E , REPLACE, MODI FY ; prepositions are words such as W I T H , TO , FROM . The dbobject can be a n y of the p rima ry database objects, sec ondary objects, or their parameters. = , 122 For more complex operations, we a lso allowed LISP functions to be cal led by prefixing them with the keyword LISP , or by insertion of a LISP expres sion. Thus, if the r u le language cannot implement a required function, a LISP a lgorithm i c rout i ne is cal led. We used algorithmic transforms in the gener ation of adder carry-lookahead. Ruleform Database Access Because the d atabase cou l d be traversed i n any direction for any arbitrary distance through the multidirectional pointer system, rules had to have the same traversal capab i l it y. Therefore, t he dbobject of the Ru leform language is a shorthand notation of the " database wal k . " Dbobject can be used i n a sentence to compare two database objects by wal king to both of them and using a predicate for the comparison. Had the database access been implemented in p u re LISP programmi ng notation, the sentence form would be lost in the many levels of expres sions enclosed in parentheses. One test wou ld occupy many l i nes of code and would read more like a software program than an Engl ish sentence. In this case, the chain of thought of the rule w riter, the purpose of which is to capture the step-by-step thoughts of a logic designer in words, woul d proba bly be broken. Vol. 2 No. 4 Fall 1990 Digital Tecbnicaljournal Synthesis in the CAD System Used to Design the VAX 9000 System To improve the comprehension of the notation used for identifying the database object , we devel oped an
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.6 Linearized : Yes XMP Toolkit : Adobe XMP Core 5.2-c001 63.139439, 2010/09/27-13:37:26 Create Date : 2006:04:12 12:05:07+01:00 Creator Tool : Adobe Acrobat 7.05 Modify Date : 2013:01:10 12:53:01Z Metadata Date : 2013:01:10 12:53:01Z Producer : Adobe Acrobat 10.1.4 Paper Capture Plug-in with ClearScan Format : application/pdf Title : Digital Technical Journal, Volume 2, Number 4, 1990: VAX 9000 seies Creator : Document ID : uuid:b44f3da5-5180-49ce-b030-87e240265be0 Instance ID : uuid:ff3ef3b8-dd97-42da-ad0f-7885af4666bf Page Layout : SinglePage Page Mode : UseOutlines Page Count : 147EXIF Metadata provided by EXIF.tools